Re: data sets and caching

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: data sets and caching

From:	Ben Pfaff
Subject:	Re: data sets and caching
Date:	Mon, 31 Oct 2005 15:09:29 -0800
User-agent:	Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

Jason Stover <address@hidden> writes:

> On Mon, Oct 31, 2005 at 10:25:20AM -0800, Ben Pfaff wrote:
>> Jason Stover <address@hidden> writes:
>> 
>> > I need to be able to append residuals to the active file
>> > with a 'save' subcommand. How should I go about this?
>> 
>> Would you like to save them for a single session only,
>> or should it be possible to save them to disk and retrieve them
>> in later sessions?
>
> Good question. I had intended to save them to the working data file,
> as the SPSS SAVE subcommand does in its regression procedure.  Users
> mostly like to look at residuals and run tests on them after the model
> has been fit. But if this working data file is written to disk, the
> residuals are written with it, and can be used later. 

Ah, so the models would be included as part of the working file
dictionary?  That's a workable idea.  (SPSS already does
something like this, you say?)

Should it be possible to save them to and retrieve them from
separate files?  (Maybe the SAVE/XSAVE command could support an
option that saves models without associated data.)

>> > This syntax illustrates two design changes that would make pspp more 
>> > flexible
>> > for users.
>> >
>> > 1. The user can name the output from any procedure.  [...]
>> 
>> This looks good to me.  Do you have a good idea for syntax?  It
>> would be nice if the syntax were uniform across procedures, so
>> we'd want a keyword that wasn't already used (much) and ideally
>> one unique in its first three letters.  "name" seems a little too
>> generic for that purpose.
>
> I do not have a good idea for syntax, but will look into it. If 'name'
> is too generic, then the subcommand should indicate that we are naming
> a model, so maybe 'modname' or 'mname'? (I'll think about it more.)

Maybe TOMODEL or MOUT for the output side, or FRMODEL/MIN for
the input side?  Or maybe just MODEL if it's not necessary to
have both?  Rationale: I don't think 'name' is a very good word
to include, because it doesn't help the user to understand what
is being acted on or what is being done with it.  (When a command
takes variables it uses VARIABLES; when it takes a file it uses
FILE; 'name' isn't in either of those.)  'model' is good because
it identifies the type of object; 'to' indicates that it's a
destination.  (These are really just musings.)

> It would be a matter of supporting multiple, named active files.
>
>> I don't know whether a "name" keyword on procedures would be
>> sufficient for this purpose, because transformations that precede
>> procedure invocation need to know what active file they're
>> working out of.  That's assuming that the different active files
>> can have different dictionaries; if their dictionaries are
>> identical and they just have different data sets, then it
>> wouldn't be necessary as far as I can tell.
>
> Now I have a question: Do you mean that the 'name' keyword would be
> insufficient because just naming the cache of a procedure tells
> it nothing of the data set used to create the cache? So if a data
> set is modified, that cache may believe incorrect things about that
> data set? 

If different active files can have different dictionaries, then
transformation commands need to know which dictionary to use.
For example, if there are different variables named ID in the two
dictionaries, one of which is a string, the other of which is
numeric, then COMPUTE will need to know which is being referred
to.

If the different active files share the dictionary, but have
different data, then a reference to a variable name is
unambiguous.  We don't need to know which data set is in use
until we read the data.

> Do you think it would be beneficial to use a garbage collector
> for cache-allocation? (Like the Boehm's, which is not entirely GPL'd?)

I don't see how a language-level GC is going to help with this
problem.  Perhaps you can explain further, if you think
differently.
-- 
Ben Pfaff 
email: address@hidden
web: http://benpfaff.org

[Prev in Thread]

Current Thread

[Next in Thread]

data sets and caching, Jason Stover, 2005/10/31
- Re: data sets and caching, Ben Pfaff, 2005/10/31
  - Re: data sets and caching, Jason Stover, 2005/10/31
    - Re: data sets and caching, Ben Pfaff <=

Prev by Date: Re: categorical variables again
Next by Date: sfm-read.c
Previous by thread: Re: data sets and caching
Next by thread: categorical variables again
Index(es):
- Date
- Thread