sks-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Sks-devel] Mass drop from pool


From: Phil Pennock
Subject: Re: [Sks-devel] Mass drop from pool
Date: Tue, 19 Apr 2011 14:29:00 -0400

On 2011-04-19 at 18:42 +1000, Matthew Palmer wrote:
> I would recommend making the algorithm less tied to one particular peer
> (which, after all, should have no reason for being in any way special from
> the point of view of clients), and instead use a statistic across the pool;
> something like "any peer which is more than N standard deviations from the
> mean number of keys across all candidate servers gets dropped".  (N could be
> fractional, if appropriate).

I wrote such an algorithm for the analysis tool I run, which has an
interface that just spits out a list of IP addresses (plus a
header/footer for error detection).  I separately just build a zone out
of retrieving that, from cron.  I deliberately didn't give the zone a
friendly name, or filter to <512 octets, or try to do any geographic
load-balancing, to keep people from switching to it -- I like Kristian's
established setup.

But if you want to play, you can grab sks_peers.py (plus .asc signature
and changelog) from: http://people.spodhuis.org/phil.pennock/software/

The code needs refactoring and clean-up, grew organically and has had
surgery to replace unstable Python dependencies.  It's WSGI.

You can see it running at:
  http://sks.spodhuis.org/sks-peers
ans see the links at the bottom of http://sks.spodhuis.org/ for some of
the other end-points.  Direct list of IPs is:
  http://sks.spodhuis.org/sks-peers/ip-valid
and debug info with histograms available at:
  http://sks.spodhuis.org/sks-peers/ip-valid-stats

Shove ?json onto the end of those last two, if that's your poison.

The algorithm used has not been formally proven, it was just an ad-hoc
attempt to filter crazy stuff and then get a list of remaining keys.  It
seems to work well enough, but any decent statistician would probably
have a screaming fit.

The approach is to bucket the counts of keys (bucket size 3000), find
the largest bucket, then collect together all the servers (not just this
bucket) within 5σ of the mean of that bucket.  That gets rid of "crazy"
outliers.  Then find the standard deviation of these keys, and set the
threshold of number of acceptable keys to be the count of the second
largest, less one stddev, less a "daily jitter" constant, which deals
with acceptable minor propagation delays and should be about the number
of new keys seen in a day (I set 500).  Then strip out servers running
1.0.10.  Some minor complexity to not double-count dual-stacked servers,
etc.

(No idea why I didn't go with 2σ after the crazyness-filter)

Really I should just add a JSON interface to the main set of data which
has names, IPs and key-counts and let the stats be done by a second tool
which isn't in the main WSGI app.  Anyone could then play with making
their own stats for filtering.  Anyone interested in that?

-Phil



reply via email to

[Prev in Thread] Current Thread [Next in Thread]