Re: regression lib

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regression lib

From:	John Darrington
Subject:	Re: regression lib
Date:	Mon, 2 May 2005 14:32:51 +0800
User-agent:	Mutt/1.3.28i

On Sun, May 01, 2005 at 11:43:07AM -0400, Jason H. Stover wrote:
     
     I got started on a regression lib. You can find it
     at 
     
        www.sakla.net/linreg.tar.bz2
     
     Let me know if it looks offensive. I just dropped it into lib/ and
     compiled it. It doesn't contain much yet, but I thought I should give
     people a chance to critique its design before going much further.
     
     I called it 'linreg' because 'regression' could mean 'non-linear
     regression'. I also created a struct which can contain a lot of
     relevant information about estimation for a linear model, including
     coefficients, residuals, sums of squares and whatever else becomes
     necessary later.  That information can be passed to other procedures,
     making extra data passes unnecessary for some analyses.
     
     On this topic of caching statistics: It would be nice if pspp_linreg()
     could accept as an argument the means and standard deviations of all
     model variables. That would eliminate the need for pspp_linreg() to
     pass through the data to get those values. Under this design, when
     pspp_linreg() gets a mean and/or std. dev. for a variable in the
     model, it will not compute that mean/std. dev. again. If it doesn't
     get the mean/std. dev. for a variable in the model, it will compute
     that mean/std. dev.
     
     If some PSPP procedure had already computed means/std. dev.'s by the
     time pspp_linreg() is called, can PSPP pass those values to
     pspp_linreg()? If so, where does PSPP store that information? What
     structure should I look in to figure this all out? I see the variable
     structure contains information about a variable like its label and
     number of values. Can it also contain a variable's mean and standard
     deviation?


Regression analysis is starting to get a bit outside my area of
expertise.  But it's certainly something that spss does and PSPP
should intend to do it too.

Currently there's no caching of statistics.  Each procedure
calculates them for itself, which is less than ideal because it leads
to a lot of duplication. For example group.c largely duplicates
factor_stats.c --- I think there should be some framework for caching
these values like Jason suggests.  The problem is, comming up with a
model which is flexible enough to suit our purpose and yet simple
enough to understand.  

It's not only mean and stddev.  I can foresee dozens of procedures
which need to calculate sst sse etc.   It would be good if
applications could just look these values up in a cache.  But there's
a lot of issues to consider:

* The cache would have to be invalidated every time a transformation
  is done.

* Caching would be useful not only on complete variables, but also on
  subsets of cases.  Eg. variable X, factored by variable Y.  So how
  does one define all the posibilities?

* Each statistic (eg: mean, stddev) will be different depending upon
  the specification of the procedure's /MISSING subcommand.

All these things complicate the implementation and would mean that the
potential cache space would quite large.

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

pgpUOyrz4mPtG.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: regression lib, Jason H. Stover, 2005/05/01
- Re: regression lib, John Darrington <=
  - Re: regression lib, Ben Pfaff, 2005/05/02
    - Re: regression lib, John Darrington, 2005/05/02

Prev by Date: Re: copyright assignments complete--green light for 0.4.0
Next by Date: Re: Long-name/short-name complexity
Previous by thread: Re: regression lib
Next by thread: Re: regression lib
Index(es):
- Date
- Thread