pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RFC: output as tables


From: Ben Pfaff
Subject: Re: RFC: output as tables
Date: Tue, 03 Jun 2008 22:24:04 -0700
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)

First: I'm adopting for the message the vocabulary for "tables"
and "relations" that I described in my parallel reply to John.

Jason Stover <address@hidden> writes:

> In this case, the relationship between rows and columns is easy to see.
> Certainly there would be other times when this relationship would not
> be so easy. Can you think of any cases in which there would 
> be so many row/column relationships as to make this approach too bloated
> or complex?

I am not sure that I understand this question.  Maybe this is a
responsive answer, maybe not: I hope to use a relational model
(as in SQL) for output from statistical procedures, so that if
there is a complex relationship, then it can be realized as
multiple relations and "joins" between them.  If you are
concerned about storage space, this might not help, because
executing the "join" will materialize the whole unified relation
anyhow, but if you are concerned about code complexity, or if you
might only need a subset of the result of the join (either in
total or just one piece at a time), then it can be an
improvement.

> Also, this is the only example in which you defined the row/column
> relationship: A row corresponding to a file split, a column to a
> statistic. What other kinds of tables can you (or anyone) think of, where
> the rows denote something besides a split file?

I don't mean to imply that using a row in a relation to denote a
split file would be a primary use for rows, or even a common use
for rows.  In particular, the output for a split file is not
ordinarily expected to be a single row, and in some cases there
would be no direct relationship at all between rows and split
files.

In the CROSSTABS example, a row is a cell in a crosstabulation.
In FREQUENCIES, a row would be a value and its frequency.  In
other cases, a row might be one entry in a matrix.

> There is the obvious one in which someone uses a SAVE-like command to
> save, say, residuals, where one row corresponds to one case and a column
> to the new variable RESID.

This is doable, I think.

> And which part of the program would have to know what the rows "mean"?
> How would it know? Do we have to explicitly code rules like "rows from
> this table from this procedure means A; from that table, B; from that
> procedure the rows mean C; ..."?

We will need a way to map from relations to tables, yes.  My
tentative approach to this is that each statistical procedure
would be accompanied by a little bit of code to do this
transformation.  This code would probably not be written in C.
Its duties would be:

        1. Use relational queries to join relations as necessary,
           producing a relation that is isomorphic to what the
           presentational table will show.  (The language for
           these queries might be SQL, but that is possibly too
           heavy-weight.)

        2. Designate columns in the resulting relation to be
           represented as rows or columns or layers, etc., in the
           presentational table.  This is where the Polaris paper
           I cited earlier comes in, or where the Wilkinson work
           comes in.

           (In a GUI, the user could adjust these choices
           interactively, in the fashion of a pivot table.)

        3. Provide default styles: column and row labels, lines
           between cells, colors, and so on.  (Think of what
           cascading style sheets makes possible here.)

           (In a GUI, the user could also adjust these choices
           interactively.)

If done properly, it should be a rather small amount of code per
procedure, easier to write than the corresponding formatting code
we have now, and much more flexible.

> How will the table look if the statistic consists of more than one
> value? For example, a confidence interval can be thought of as a single
> statistic, but it is two values. Do you plan to keep the two values
> in one "column" or two? (I guess John's suggestion about a "multicolumn"
> feature would fix this.)

A confidence interval would likely be two columns in a relation,
that in a table could be mapped to two columns or to two entries
in a cell within a single row or to another desired format.

> Can multiple procedures read from/write to the same table? 

My inclination is this: when a procedure outputs a relation, it
is thereafter immutable.  A later procedure can read from the
relation, or it can output a new relation that incorporates or is
informed by all or part of the older relation.

> Let's say someone fits many models and wants to compare the
> performance of each. Could the output be made to have each procedure
> write to a table, with one row corresponding to that procedure, and
> one column to a statistic?  I think the answer to this is "yes"
> because you mentioned it being done with something similar to
> casereaders. I ask because, if this were all possible, then taking
> that table and displaying it as something human-readable would be easy
> and useful. 

Now I understand why you asked the question.

I would be inclined to do this by taking the relations that each
procedure output and joining them into a larger relation in the
way you describe, then displaying that larger relation in
human-readable form as a table.  I think that this achieves the
same effect you have in mind.

> Are these tables how procedures read each others' output?

Yes.

>> Not all output is reasonably treated as part of a table, so procedures
>> may also write out additional metadata output not associated with a
>> table.  Some kinds of metadata might be included in some kind of
>> standardized form along with every procedure (or every command, even
>> those that are not procedures), such as the following:
> ...
>
> I know that if anything seems to not be easily describable by a table,
> we can put it into metadata. But I'm afraid that if there are many
> such exceptions to the table format, we often would wind up saying "oh
> that can't be in a table, so we'll just put it in the metadata," and
> thereby make the metadata a large dumping ground without any
> coherence. Do you think this is a serious risk?

Yes, I do.  Nothing is yet concrete enough to try to pin down the
difference between data and metadata, though.

I hope that you and John have further comments.
-- 
Ben Pfaff 
http://benpfaff.org




reply via email to

[Prev in Thread] Current Thread [Next in Thread]