pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: pspp development


From: Ben Pfaff
Subject: Re: pspp development
Date: Sun, 07 Nov 2004 14:09:11 -0800
User-agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)

[Adding pspp-dev CC because this seems of general interest]

"Jason H. Stover" <address@hidden> writes:

>> Yes.  Also, I'm not so happy with the structure of a lot of the
>> code right now, so reactions are welcome--if you have suggestions
>> for improvement, let's talk about it.
>
> I just took a first glance in the src directory, and
> I already have a suggestion you might have thought of:
> Make more subdirectories. PSPP, if it grows to include 
> most of the procedures currently in SPSS, will have 
> hundreds of megabytes of source, so putting each routine
> in its own directory will make it more readable.

Yes, this is a good idea.  I actually meant to do this much
earlier (years ago) but Automake wasn't quite ready for it at the
time.  I think that we're getting toward time to give it another
shot.

> Also, as PSPP acquires more procedures, it might need each procedure
> to be a shared library, to keep the size of the exectuable from being
> enormous.

I'm in favor of shared libraries when they help a large program
to be logically partitioned in a sensible way, or when parts of a
program are really libraries that can be used by other software.
But I don't think that doing it just because a program is getting
big is a good reason.  Enormous executables don't necessarily
start up any more slowly or run more slowly than small ones, and
there are definitely extra issues that come up when shared
libraries are introduced.  So in summary, I'm in favor of shared
libraries for the right reasons, but I don't think these have
come up yet.

(If there's some reason to introduce "plug-ins" into a program,
then that's another reason to use shared libraries.)

> One other question I have: Is there a policy of allowing one procedure
> to eat the 'output' (or intermediate data structures) from another
> procedure? E.g., can the residuals of the regression procedure be fed
> back to another regression procedure? Or a ttest procedure? SPSS
> cannot do this kind of thing, and I don't know if you want PSPP to be
> able to do it. I don't know if it's feasible. But it is a statistical
> practice that has become quite popular in the last decade, so if PSPP
> could do such a thing without becoming a nightmare, it would be good.
> SAS, SPSS's main competitor, was designed to allow this kind of thing,
> and that has helped make SAS more popular than SPSS over the years.
> (Though SAS's ability to do this output piping is klugey.)

This is one of the things I've been thinking about lately, and
it's one of the reasons that my long-promised "new output module"
has been delayed so much.  I had been of the opinion earlier that
what we needed for output was a module that, when supplied a
description of tables, etc., wanted for output, translated that
description into the format the user wants, such as plain text or
HTML or PostScript.

But lately I've come to realize that, while this is okay for
human-readable output, it's not very nice as machine-readable
output.  By the time the output has been transformed into a
description of tables, much has been effectively lost:
machine-precision numbers are now just a few decimal places,
multidimensional arrays are flattened into 2 dimensions, there's
no semantic description of the analysis that took place, and so
on.  Some of this can be recovered by careful parsing (like
building up multidimensional arrays) but there really shouldn't
be the need for it.  It should be possible to "play nicely" with
other software (and other procedures) without the need for
parsing textual or tabular output.

So lately I've been thinking about adding another level of
indirection (the way that all computer problems can be solved :-)
That is, instead of having PSPP generate a description of what
output should look like, have it dump out the actual output in a
machine-readable format, together with information on what
analysis led to the output and the formatting requested by the
user.  Programs that want the real output can obtain it easily,
and we can still write translators that produce pretty
human-readable output in the format originally requested.  We can
do better, in fact; we can now have "output browsers" that let
the formatting be reconfigured on-the-fly, and things like pivot
tables become pretty easy.

I've been investigating what generalized formats already exist
for efficiently dealing with scientific data.  Right now I'm most
impressed with HDF5 (http://hdf.ncsa.uiuc.edu/HDF5/), which seems
to offer everything that I want.  

> I'm sorry to mention all this stuff before I've familiarized myself
> with the code. I'll tinker before blabbering about any more
> 'brilliant' ideas.

*shrug*  Seem like good ideas to me.
-- 
Ben Pfaff 
email: address@hidden
web: http://benpfaff.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]