pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Number of histogram bins


From: Jason H. Stover
Subject: Re: Number of histogram bins
Date: Sun, 5 Dec 2004 19:14:27 -0500
User-agent: Mutt/1.4.2.1i

I found a recent preprint online suggesting an automatic
bin width selection:

L. Birge and Y. Rozenholc. How many bins should be put in a regular histogram.
www.proba.jussieu.fr/pageperso/rozen/preprint/Histo-030629.pdf

This paper also mentions a few other methods, including three used
by GNU/R: Sturges' (number of bins approximately 1 + log_2(N)), Scott's
(mentioned by Ben below), and one by Freedman and Diaconis in:

D. Freedman P. Diaconis. On the histogram as a density estimator: L_2 
theory. Z. Wahrscheinlichkeitstheor. Verw. Geb. 1981. 57, 453-476. 

Sturges' is the default in R.

-Jason

On Sun, Dec 05, 2004 at 12:18:56PM -0800, Ben Pfaff wrote:
> John Darrington <address@hidden> writes:
> 
> > Does anyone know how spss decides the number of bins to construct a
> > histogram?  Or can anyone suggest a suitable algorithm for doing so?
> 
> PSPP 0.1.0 had vestigial support for plotting histograms.  At the
> time, if I recall correctly, I checked out in some detail how
> SPSS/PC+ chose the number of bins.  Here's the code that that
> version used to decide:
> 
>   #define MIN_HIST_BARS 3
>   #define MAX_HIST_BARS 20
> ...
>   double upper = /* maximum value in data */;
>   double lower = /* minimum value in data */;
>   if (upper - lower >= 10)
>     {
>       double l, u;
> 
>       u = round_up (upper, 5);
>       l = round_down (lower, 5);
>       nbars = (u - l) / 5;
>       if (nbars * 2 + 1 <= MAX_HIST_BARS)
>         {
>           nbars *= 2;
>           u = round_up (upper, 2.5);
>           l = round_down (lower, 2.5);
>           if (l + 1.25 <= lower && u - 1.25 >= upper)
>             nbars--, lower = l + 1.25, upper = u - 1.25;
>           else if (l + 1.25 <= lower)
>             lower = l + 1.25, upper = u + 1.25;
>           else if (u - 1.25 >= upper)
>             lower = l - 1.25, upper = u - 1.25;
>           else
>             nbars++, lower = l - 1.25, upper = u + 1.25;
>         }
>       else if (nbars < MAX_HIST_BARS)
>         {
>           if (l + 2.5 <= lower && u - 2.5 >= upper)
>             nbars--, lower = l + 2.5, upper = u - 2.5;
>           else if (l + 2.5 <= lower)
>             lower = l + 2.5, upper = u + 2.5;
>           else if (u - 2.5 >= upper)
>             lower = l - 2.5, upper = u - 2.5;
>           else
>             nbars++, lower = l - 2.5, upper = u + 2.5;
>         }
>       else
>         nbars = MAX_HIST_BARS;
>     }
>   else
>     {
>       nbars = /* number of unique values in data. */
>       if (nbars > MAX_HIST_BARS)
>         nbars = MAX_HIST_BARS;
>     }
>   if (nbars < MIN_HIST_BARS)
>     nbars = MIN_HIST_BARS;
>   interval = (upper - lower) / nbars;
> 
> It seemed to make some kind of sense at the time, but this was
> way back in 1994 or so and I didn't write as many useful comments
> then as I do now.  I think that the rationale is roughly this:
> the upper and lower values should by preference be rounded to
> "round" numbers, like multiples of 5, because it makes the graph
> easier to read and data tends to be more naturally interpretable
> that way.  Then it tries to recenter the actual range plotted
> based on the actual lower and upper values.
> 
> I'm not sure we want any part of this anymore.  The above is a
> pretty weak defense of the rationale, and I actually wrote the
> code.
> 
> A search for "histogram bin width" turned up this webpage:
>         http://www.fmrib.ox.ac.uk/analysis/techrep/tr00mj2/tr00mj2/node24.html
> which gives the formula
>         W = 3.49 * s * N^(-1/3)
> as an "optimal bin width" given s as the standard deviation from
> the mean and N as the number of samples, as well as
>         W = 2 * IQR * N^(-1/3)
> where IQR is additionally the interquartile range.  Either one of
> these would be pretty easy to implement, and the webpage claims
> the latter is more robust.
> -- 
> On Perl: "It's as if H.P. Lovecraft, returned from the dead and speaking by
> seance to Larry Wall, designed a language both elegant and terrifying for his
> Elder Things to write programs in, and forgot that the Shoggoths didn't turn
> out quite so well in the long run." --Matt Olson
> 
> 
> _______________________________________________
> pspp-dev mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/pspp-dev
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]