[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: regression lib
Re: regression lib
Tue, 3 May 2005 07:31:58 +0800
On Mon, May 02, 2005 at 08:24:57AM -0700, Ben Pfaff wrote:
John Darrington <address@hidden> writes:
> Currently there's no caching of statistics. Each procedure
> calculates them for itself, which is less than ideal because it leads
> to a lot of duplication. For example group.c largely duplicates
Hmm. If so, I think that's probably orthogonal to the caching
problem. Is there some reason those files can't share some
common code to perform their common functionality?
There's no fundamental reason. It's just a question of coming up
with the right model to fit the problem (which implies that one has to
understand the problem adequately). If I'm going to spend the time
refactoring those two files, then I want to do it in such a way
that'll make implementation of other procedures easier.
> It's not only mean and stddev. I can foresee dozens of procedures
> which need to calculate sst sse etc. It would be good if
> applications could just look these values up in a cache. But there's
> a lot of issues to consider:
> * The cache would have to be invalidated every time a transformation
> is done.
This is something we'll just have to deal with. I don't think
it's too hard. We just add a `statcache_invalidate(variable)'
function and call it for the modified variables from every
transformation that modifies variables, plus a
`statcache_invalidate_all()' function that invalidates everything
for procedures that modify the entire file (e.g. MATCH FILES).
> * Caching would be useful not only on complete variables, but also on
> subsets of cases. Eg. variable X, factored by variable Y. So how
> does one define all the posibilities?
I have two ideas:
1. Ignore the problem. Only cache statistics on complete
That's the easiest way. Will it give sufficient optimisation? It'll
affect only the most trivial uses of PSPP.
2. Try to handle some special cases as special cases.
For example, if FILTER BY <VAR> is in effect, then we
could cache those values as long as FILTER BY <VAR>
remained in effect and <VAR> was unmodified.
I was thinking more about situations such as
DATA LIST LIST /A * /B * .
ONEWAY A BY B.
Here the ONEWAY procedure does all the same calculations as the T-TEST
(assuming B takes only 2 values). But all the data that T-TEST has
calculated is freed when it exits.
> * Each statistic (eg: mean, stddev) will be different depending upon
> the specification of the procedure's /MISSING subcommand.
The most common case is "itemwise" missing with user-missing
values removed. We can ignore other cases if we want to. When
you're caching, you want to save time in the most common cases.
If you can save time in other cases, too, that's great, but it's
not as valuable because they don't come up as much.
Sure it can be done. We just have to be very carefull that we don't
end up using a cached value where it's not appropriate.
> All these things complicate the implementation and would mean that the
> potential cache space would quite large.
But you don't reserve space for all of them on each variable.
You just allocate space as you need it. Furthermore, because the
cache is just an optimization, you can throw it, or part of it,
away if it gets too large.
I wasn't thinking so much about the physical memory, but rather the
way in which we would address these cached values given that there are
a lot of parameters needed in order to correctly specify the required
I think this came up before and I threw up some of these same
objections. They are problems, sure. But they are problems we
can deal with and I think we should, sometime post-0.4.0.
I'm not saying that these problems can't be overcome. But they are
problems which need to be carefully considered. And yes, if we're
going to do it, then it definately should be after 0.4.0
PGP Public key ID: 1024D/2DE827B3
fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.
Description: PGP signature