[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Thu, 5 Jun 2008 17:05:38 -0400
On Thu, Jun 05, 2008 at 04:58:06PM +0100, Ed wrote:
> The linear regression code (math/linreg/* ) already handles multiple
> regression, i.e. it solves
> y = Xb + e
> for the vector b of regression coefficients, given y a vector of
> observations, X a matrix of independent variables and e an error term.
> In particular it handles the case where X is not of full rank, which
> is one of the distinguishing characteristics of GLM.
> The GLM problem is just YM = XB + E (all matrices), where the new term
> M is a linear transform of the independent variables. This could be
> handled by the existing linear regression code if vectors are changed
> to matrices (could select between vector and matrix where applicable
> if there's a big efficiency gain).
This is a nice way to understand the problem, but I don't think it's
the right way to code it. I would like to keep lib/linreg as a
univariate library. That way, it is useful to other procedures in an
obvious way ("lib/linreg: solve the normal equations for me now!")
Also, it's complex enough dealing with Level 2 linear algebraic
operations. Making it deal with Level 3 to solve systems of matrices
could be done more easily by writing another, higher-level library
that uses lib/linreg to do the lower-level solving of systems.
Also, most users aren't going to need the full matrix systems since
most users are just doing the usual linear regression with qualitative
variables. So unless the user specifically asks for MANOVA,
lib/linreg can already handle the job.
There is another reason not to change lib/linreg: lib/linreg currently
assumes it has been passed a matrix consisting of all the data. I
would like to change this to make PSPP better able to handle large
data sets. This will mean that lib/linreg is going to have to know
whether it has been passed all the data, or products of the design
matrix and the vector of the dependent variable. That change to
lib/linreg will make it more complex than it already is.
> The bit that's quite a lot harder than I'd realised is extracting all
> the [M]AN[C]OVA bits from this; I've never really seen this in full
> generality. The algorithm is just to take your fitted regression means
> B and compare sums of squared error from various parts of the model.
> There are at least 6 common ways of doing this (type i through vi sums
> of squares). Random factors in this method are handled by linear
> combinations of dependent variables Y in the M matrix (although I'm
> not 100% sure about this - my textbook says it must be so, but its not
> obvious to me that this yields the same random factors calculations as
> via other methods?).
We will have to look up the more abstruse sums of squares and
statistics as we need them. I don't know if all that stuff should be
in a linear regression library, or outside in the code of a
procedure. It should be outside in a procedure if it's not something
that other procedures might need; inside a library otherwise.
> Anyway my feeling is that the first step is to build design matrices
> based on a given ANOVA/model spec. The second step is to implement the
> various kinds of sums of squares. The final step is to provide the
> UNIANOVA command/other glm interfaces for users in PSPP (and
> presumably the GLM/GLM repeated measures/etc dialogs too).
There is code for making design matrices in
src/math/design-matrix.[ch]. It needs work, but it is a start. The big
problem there is the "accounting" problem of mapping values of a qualitative
variable back and forth to vectors with binary entries.
> If this seems reasonable, it would be really useful if anyone has a
> good references on (pointers will do, I have access to a decent
> * the exact method of calculation of the various kinds of sums of
> squares (especially type iv - it seems to be an unsafe method that
> enjoyed brief popularity, but is now warned about but not described
Types 1, 2 and 3 will be ok at the beginning. That will satisfy most
users without making the programmer spend too much time coding
something that won't be used often. The others we can look up in the