pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Boxplot whisker length


From: John Darrington
Subject: Re: Boxplot whisker length
Date: Sat, 3 Jan 2015 08:29:26 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Dec 31, 2014 at 10:28:10AM +0100, John Darrington wrote:
     On Tue, Dec 30, 2014 at 04:58:48PM -0600, Alan Mead wrote:
     or GNU/Linux.
     
          Regarding the actual algorithm, the boxplot I get from SPSS is 
attached
          as "boxplot2.png".  I think it's a lot more reasonable (albeit 
uglier).
          The main difference is the SPSS boxplot had short whiskers while 
PSPP's
          boxplot whiskers seems to include the entire range of the data
          (including the outlier). In the physio dataset, apparently there are
          some outliers like 30 mm for a human height.  That's the kind of thing
          that boxplots are supposed to help you find.  Maybe that's a bug in 
PSPP
          that the whisker length is just wrong?  Otherwise I think it would 
make
          more sense to limit the whiskers to some reasonable value like 1.5 
times
          the inter-quartile range (or to the highest and lowest values that are
          within 1.5 times the inter-quartile range).
     
     Here is what SPSS has to say about boxplots:
     
        The boundaries of the box are Tukey's hinges. The length of the box is 
the interquartile range
        based on Tukey's hinges. That is, IQR = Q_3 - Q_1
        Define
         STEP = 1.5 IQR
        A case is an outlier if 
        Q_3 + STEP < y < Q_3 + 2 * STEP
        or
        Q_3 - 2 * STEP < y < Q_3 - 2 * STEP
     
        A case is an extreme if
        y >= Q_3 + 2 * STEP
        or
        y <= Q_1 - 2 * STEP
     
          
     Note that it doesn't actually say where the whiskers should be.  However 
it seems that PSPP
     is placing the lower whisker at the lowest value y, of the dataset for 
which
      y < Q1 - STEP  
     and the upper whisker at the highest value y, for which
      y < Q3 + STEP
     
     I vaguely remember reading this recommendation in the literature.
     
     If someone can reference any better recommendations, when we can consider 
implementing that instead.
     

Most other implementations seem to have the whiskers extend to the most extreme 
points of the dataset, which are not themselves outliers.

So I pushed a change so that boxplots in PSPP do that too.

J'


-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://sks-keyservers.net or any PGP keyserver for public key.

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]