[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: contribution and cvs
Re: contribution and cvs
Fri, 2 Sep 2005 18:18:26 +0000
On Thu, Sep 01, 2005 at 01:07:11PM -0700, Ben Pfaff wrote:
> Jason Stover <address@hidden> writes:
> > Recoding a categorical variable's values as a vectors with binary
> > entries is a basic necessity for most statistical procedures which
> > use categorical data. PSPP must pass the data once to recode
> > those values, so it would be nice if the struct variable held those
> > binary vectors, even after the procedure that created them exits, thereby
> > making the vectors available to the next procedure. There would be one
> > binary vector per distinct value.
> > But, by the comment above, v->aux can hold the binary vectors only until
> > someone else needs to hold other auxiliary data.
> > The code I wrote before did not add anything to the struct variable,
> > but to make it work I had to create a struct
> > recoded_categorical_array. The recoded_categorical_array is cumbersome
> > and would be unnecessary if the variable values could be stored inside
> > the struct variable. So may I/we/someone add a gsl_matrix * to the
> > definition of struct variable? Doing so will make a lot of numerical
> > routines easier to write.
> I'm not against adding a member if that's the best solution, but
> I'd like to learn more. If I understand correctly, the primary
> purpose of the matrix is to identify the values that the variable
> actually takes on. Is that correct?...
> If so, then I have two concerns.
> First, how do we track changes to the data in the active file
> between procedures? If the user does something like
> COMPUTE x = x + 1.
> SELECT IF x NE 1.
> then this means that we have to invalidate the cache, but
> currently there isn't any mechanism for that. We want such a
> cache and invalidation mechanism for other reasons, so it's
> becoming increasingly clear to me that it's something to
> implement soon, but it's not there yet.
Agreed. In light of this need, I will leave the struct
recoded_categorical_array where it is for now, ugly as it is.
> Second, is this the best way to represent this data (as you say)?
> If I'm correct in that the matrix mainly identifies a variables'
> data values, then perhaps we should really be storing a frequency
> table for the variable, and transforming that into the
> appropriate gsl_matrix as needed. After all, a lot of procedures
> could find a frequency table useful (right?).
Right. I like the idea of the frequency table, but there should still be some
lookup table that maps the variable values to binary vectors. The
reason is that the design matrix, and anything derived from it, may be
worth saving (especially after computations resulting from passing a
large data set). The procedures that see the design matrix, or the
Hessian, or whatever, later will need to know which vectors correspond
to which values, and vice versa. That lookup table could be stored
elsewhere, but the most appropriate place is probably in the struct
A frequency table is an example of a 'sufficient statistic',
which is a good thing to cache. For the purpose of writing software,
storing a sufficient statistic means not having to pass the data again
(unless we switch to a different statistical model). I am in favor of
caching sufficient statistics whenever possible. If creating a
general-purpose cache to hold sufficient statistics is feasible, it
might be worth creating someday. It would also take a lot of thought to
do it right, since there are so many models, each with their own
SDF Public Access UNIX System - http://sdf.lonestar.org