[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: contribution and cvs
Re: contribution and cvs
Sat, 3 Sep 2005 17:30:03 +0800
This discussion is starting to get out of my depth, so if this
suggestion doesn't make sense then please ignore it.
One of the commands we haven't yet implemented is MATRIX DATA,
which to me is a slightly unintuitive name, because it inputs
both a matrix and other data.
Most (all) of the regression commands support a /MATRIX subcommand
to input/output data in this form rather than as individual values.
If we're going to add new state to the variable struct, wouldn't it
make sense to use the MATRIX DATA parameters as a model for the new
On Fri, Sep 02, 2005 at 06:18:26PM +0000, Jason Stover wrote:
> > The code I wrote before did not add anything to the struct variable,
> > but to make it work I had to create a struct
> > recoded_categorical_array. The recoded_categorical_array is cumbersome
> > and would be unnecessary if the variable values could be stored inside
> > the struct variable. So may I/we/someone add a gsl_matrix * to the
> > definition of struct variable? Doing so will make a lot of numerical
> > routines easier to write.
> I'm not against adding a member if that's the best solution, but
> I'd like to learn more. If I understand correctly, the primary
> purpose of the matrix is to identify the values that the variable
> actually takes on. Is that correct?...
> If so, then I have two concerns.
> First, how do we track changes to the data in the active file
> between procedures? If the user does something like
> COMPUTE x = x + 1.
> SELECT IF x NE 1.
> then this means that we have to invalidate the cache, but
> currently there isn't any mechanism for that. We want such a
> cache and invalidation mechanism for other reasons, so it's
> becoming increasingly clear to me that it's something to
> implement soon, but it's not there yet.
Agreed. In light of this need, I will leave the struct
recoded_categorical_array where it is for now, ugly as it is.
> Second, is this the best way to represent this data (as you say)?
> If I'm correct in that the matrix mainly identifies a variables'
> data values, then perhaps we should really be storing a frequency
> table for the variable, and transforming that into the
> appropriate gsl_matrix as needed. After all, a lot of procedures
> could find a frequency table useful (right?).
Right. I like the idea of the frequency table, but there should still be
lookup table that maps the variable values to binary vectors. The
reason is that the design matrix, and anything derived from it, may be
worth saving (especially after computations resulting from passing a
large data set). The procedures that see the design matrix, or the
Hessian, or whatever, later will need to know which vectors correspond
to which values, and vice versa. That lookup table could be stored
elsewhere, but the most appropriate place is probably in the struct
A frequency table is an example of a 'sufficient statistic',
which is a good thing to cache. For the purpose of writing software,
storing a sufficient statistic means not having to pass the data again
(unless we switch to a different statistical model). I am in favor of
caching sufficient statistics whenever possible. If creating a
general-purpose cache to hold sufficient statistics is feasible, it
might be worth creating someday. It would also take a lot of thought to
do it right, since there are so many models, each with their own
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
Description: PGP signature