pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: GLM and interactions


From: Jason Stover
Subject: Re: GLM and interactions
Date: Fri, 8 Jul 2011 15:09:24 -0400
User-agent: Mutt/1.5.18 (2008-05-17)

On Fri, Jul 08, 2011 at 07:19:28AM +0000, John Darrington wrote:
> On Thu, Jul 07, 2011 at 04:09:28PM -0400, Jason Stover wrote:
>      
>      The binary encoding for category, drug, and the interaction term would
>      be something like this:
>      
>      category:
>       1 --> 0
>       2 --> 1
>      
>      drug:
>       1 --> 0 0
>       2 --> 1 0
>       3 --> 0 1
>      
>      interaction (category * drug):
>       1 * 1 --> 0 0
>       1 * 2 --> 0 0
>       1 * 3 --> 0 0
>       2 * 1 --> 0 0
>       2 * 2 --> 1 0
>       2 * 3 --> 0 1
>      
>      So, I have just multiplied each of the pairs. Notice most are mapped
>      to the origin. This isn't a problem, though, if we just want to test
>      for an interaction. If we take X to be our binary variable for
>      category, and Y_1, Y_2 for our binary variables for drug, we can write
>      our linear model this way:
>      
>       response = intercept + b_1 * X + b_2 * Y_1 + b_3 * Y_2 + b_4 * X * Y_1 
> + b_5 * X * Y_2 + error
>      
> 
> How would this encoding look in the more general case where "category" had 
> (say) five
> distinct values instead of only two?

category (5 categories --> 4 degrees of freedom):
        a --> 0 0 0 0
        b --> 0 0 0 1
        c --> 0 0 1 0
        d --> 0 1 0 0
        e --> 1 0 0 0

drug (3 categories --> 2 degrees of freedom):
        1 --> 0 0
        2 --> 1 0
        3 --> 0 1

drug * category ((5 - 1) * (3 - 1) = 8 degrees of freedom):
     a1 --> 0 0 0 0 0 0 0 0
     a2 --> 0 0 0 0 0 0 0 0
     a3 --> 0 0 0 0 0 0 0 0
     b1 --> 0 0 0 0 0 0 0 0
     b2 --> 0 0 0 1 0 0 0 0
     b3 --> 0 0 0 0 0 0 0 1
     c1 --> 0 0 0 0 0 0 0 0
     c2 --> 0 0 1 0 0 0 0 0
     c3 --> 0 0 0 0 0 0 1 0
     d1 --> 0 0 0 0 0 0 0 0
     d2 --> 0 1 0 0 0 0 0 0
     d3 --> 0 0 0 0 0 1 0 0
     e1 --> 0 0 0 0 0 0 0 0
     e2 --> 1 0 0 0 0 0 0 0
     e3 --> 0 0 0 0 1 0 0 0

This is not the only valid encoding, but it's the one that occurred to
me first. Many different encodings could be considered as being
correct. The only constraint is that we need to estimate the mean of
each factor/level combination by summing the coefficients available.
And we do not want any more coefficients than necessary, lest we lose
degrees of freedom for error (and hence our ability to estimate the
variability).

The reason we have 8 degrees of freedom for the interaction is that we
need to estimate means for all possible factor/level combinations,
that is, 15 means in total. One of those means we can estimate with
the intercept, so that leaves 14 coefficients to estimate (one
coefficient per mean, though a mean is estimated with some combination
of those coefficients). We estimate 4 coefficients for category, 2
coefficients for drug, leaving 14 - (4 + 2) = 8 coefficients for the
interactions. In general, following the same line of reasoning, if we
have n categories for factor 1, and k categories for factor 2, then we
need n * k - (n - 1) - (k - 1) - 1 = n * k - n - k + 1 = (n - 1) * (k
- 1) degrees of freedom for the interaction.

Another way to see this is to write the model, with all its coefficients:

output = intercept + b_1 * X_1 +...+ b_4 * X_2 + b_5 * Y_1 + b_6 * Y_2 +
       blah blah blah

....where 'blah blah blah' stands for 'some products of X's and Y's,
multiplied by their coefficients', the point being to give us an
estimate of the mean for each factor/level combination.

Because of the foregoing, I'm glad I'm not the only one thinking about
categoricals.c and how it relates to covariance.c.

-Jason



reply via email to

[Prev in Thread] Current Thread [Next in Thread]