pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug in covariance.c/categoricals.c


From: John Darrington
Subject: Re: bug in covariance.c/categoricals.c
Date: Sun, 12 Jun 2011 20:42:23 +0000
User-agent: Mutt/1.5.18 (2008-05-17)

You're right.  It's a problem of semantics.  The current implementation
encodes each categorical variables into N columns, where N is the number
of distinct values.  For the covariance matrix we require N - 1 columns.

Can you try the attached patch.  I think it will solve your problem.
(It causes all the ONEWAY tests to fail but we can think about that later).

J'

On Sun, Jun 12, 2011 at 12:40:37PM -0400, Jason Stover wrote:
     I have found a bug in the computation of the covariance matrix when
     categorical variables are involved. It is very subtle, and wasn't easy
     to find because it's caused by an inconsistency in the way
     covariance.c and categoricals.c interpret the encoded categorical
     values, rather than a straightforward miscomputation.
     
     Here is the syntax to generate the problem:
     
     data list list / v0 v1 v2.
     begin data
     3.2 1 1
     3.1 1 1
     3.3 1 2
     3.4 1 2
     3.2 1 3
     3.3 1 3
     3.3 1 4
     3.2 1 4
     2.8 2 1
     2.9 2 1
     3.3 2 2
     3.0 2 2
     3.1 2 3
     3.2 2 3
     3.2 2 4
     3.1 2 4
     end data
     GLM v0 by v1 v2
         /INTERCEPT = include.
     
     dump_matrix in glm.c gives this as the covariance matrix:
     
     0.378 0.700 -0.700 -0.650 0.350 
     0.700 4.000 -4.000 0.000 0.000 
     -0.700 -4.000 4.000 0.000 0.000 
     -0.650 0.000 0.000 3.000 -1.000 
     0.350 0.000 0.000 -1.000 3.000 
     
     Examining how this matrix was computed showed this to be the
     encoding covariance.c used for the data:
     
     3.2 1 0 1 0
     3.1 1 0 1 0
     3.3 1 0 0 1
     3.4 1 0 0 1
     3.2 1 0 0 0
     3.3 1 0 0 0
     3.3 1 0 0 0
     3.2 1 0 0 0
     2.8 0 1 1 0
     2.9 0 1 1 0
     3.3 0 1 0 1
     3.0 0 1 0 1
     3.1 0 1 0 0
     3.2 0 1 0 0
     3.2 0 1 0 0
     3.1 0 1 0 0
     
     This is not among the possible correct encodings. An example of a
     correct encoding is the following:
     
     3.2 0 1 0 0
     3.1 0 1 0 0
     3.3 0 0 1 0
     3.4 0 0 1 0
     3.2 0 0 0 1
     3.3 0 0 0 1
     3.3 0 0 0 0
     3.2 0 0 0 0
     2.8 1 1 0 0
     2.9 1 1 0 0
     3.3 1 0 1 0
     3.0 1 0 1 0
     3.1 1 0 0 1
     3.2 1 0 0 1
     3.2 1 0 0 0
     3.1 1 0 0 0
     
     The problem happens in a call to categoricals_get_binary_by_subscript (),
     called by get_val (), called by covariance_accumulate_pass2 (). To see
     the problem, it may be easiest to consider the first case:
     
     3.2 1 1
     
     For this case, get_val is first called with i equal to 0, which is
     fine.  Then, get_val is called with i equal 1, which causes it to call
     categoricals_get_binary_by_subscript (cov->categoricals, 0, c). Inside
     categoricals_get_binary_by_subscript, var is v1 (OK), val is 1 (OK),
     which matches the value shown by categoricals_get_value_by_subscript,
     so the function returns 1 (which could be OK, depending on the
     encoding we want).
     
     The next call to categoricals_get_binary_by_subscript shows the
     problem. subscript is now 1, which causes var to be v1. val is then 2,
     which does not match the second value in the case, so the function
     returns 0. This by itself could be fine, but the variable we want on
     this second call is now v2. So you can see that sometimes, this
     function will return correctly, but sometimes it may not, and we can
     see this inconsistency later: The values of 3 and 4 for v2 both end up
     being mapped to (0 0) during computation of the covariance, because
     covariance_accumulate_pass2 stops asking for binary values at cov->dim
     - 1, which is 4. So: get_val gets *two* values for v1 (it should get
     only one), and *two* values for v2 (it should get three).
     
     Hence, for the first case, the columns for v1 and v2 are mapped to (1
     0 1 0). That would be fine, if we had a mapping as follows:
     
     variable value ------> binary encoded
     v1       1    -------> 1
     v1       2     ------> 0
     v2       1     ------> 0 1 0
     v2       2     ------> (something else)
     v2       3     ------> (something else)
     v2       4     ------> (something else)
     
     where exactly one of those "something else"'s is (0 0 0), one is (1 0
     0) and one is (0 0 1). But this can't happen, because, by the current
     system, 3 and 4 are ignored, and consequently both map to (0 0).
     
     To fix the problem, covariance.c must properly interpret the encoding
     used by categoricals.c. This seems to be mostly a problem in get_val,
     but it may be elsewhere around covariance.c.
     
     _______________________________________________
     pspp-dev mailing list
     address@hidden
     https://lists.gnu.org/mailman/listinfo/pspp-dev

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

Attachment: cat.patch
Description: Text Data

Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]