[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: regression lib
Re: regression lib
Mon, 2 May 2005 14:32:51 +0800
On Sun, May 01, 2005 at 11:43:07AM -0400, Jason H. Stover wrote:
I got started on a regression lib. You can find it
Let me know if it looks offensive. I just dropped it into lib/ and
compiled it. It doesn't contain much yet, but I thought I should give
people a chance to critique its design before going much further.
I called it 'linreg' because 'regression' could mean 'non-linear
regression'. I also created a struct which can contain a lot of
relevant information about estimation for a linear model, including
coefficients, residuals, sums of squares and whatever else becomes
necessary later. That information can be passed to other procedures,
making extra data passes unnecessary for some analyses.
On this topic of caching statistics: It would be nice if pspp_linreg()
could accept as an argument the means and standard deviations of all
model variables. That would eliminate the need for pspp_linreg() to
pass through the data to get those values. Under this design, when
pspp_linreg() gets a mean and/or std. dev. for a variable in the
model, it will not compute that mean/std. dev. again. If it doesn't
get the mean/std. dev. for a variable in the model, it will compute
that mean/std. dev.
If some PSPP procedure had already computed means/std. dev.'s by the
time pspp_linreg() is called, can PSPP pass those values to
pspp_linreg()? If so, where does PSPP store that information? What
structure should I look in to figure this all out? I see the variable
structure contains information about a variable like its label and
number of values. Can it also contain a variable's mean and standard
Regression analysis is starting to get a bit outside my area of
expertise. But it's certainly something that spss does and PSPP
should intend to do it too.
Currently there's no caching of statistics. Each procedure
calculates them for itself, which is less than ideal because it leads
to a lot of duplication. For example group.c largely duplicates
factor_stats.c --- I think there should be some framework for caching
these values like Jason suggests. The problem is, comming up with a
model which is flexible enough to suit our purpose and yet simple
enough to understand.
It's not only mean and stddev. I can foresee dozens of procedures
which need to calculate sst sse etc. It would be good if
applications could just look these values up in a cache. But there's
a lot of issues to consider:
* The cache would have to be invalidated every time a transformation
* Caching would be useful not only on complete variables, but also on
subsets of cases. Eg. variable X, factored by variable Y. So how
does one define all the posibilities?
* Each statistic (eg: mean, stddev) will be different depending upon
the specification of the procedure's /MISSING subcommand.
All these things complicate the implementation and would mean that the
potential cache space would quite large.
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
Description: PGP signature