pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RFC: output as tables


From: Ben Pfaff
Subject: RFC: output as tables
Date: Mon, 02 Jun 2008 22:43:09 -0700
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)

I've been trying to write up a proposal for a new PSPP output
subsystem.  So far, I've only managed to finish writing up the
basic idea and a couple of brief examples, even though I have
more details in mind.  Here is what I have.

This part of the proposal doesn't get at all to the point where
we're actually showing data to the user.  It's just about how
PSPP procedures send along raw data to the next stage.

I expect there will be plenty of comments, objections, etc. at
this point.  That's fine.  Let's try to figure out whether this
is a good idea that can be made into a good design.  If it isn't,
then it's best to head it off at this point.

Basic Idea
==========

PSPP procedures will write out the bulk of their data output as cases
in tables.  In this case, "tables" means in PSPP terms casewriters and
casereaders, analogous to tables in relational databases, not tables
drawn for presentational reasons.

Tables may be accompanied by metadata that annotates rows, columns,
and other properties, analogous to the way that variables can be
annotated with variable labels and value labels.  Typically, each
column or row could be annotated with an English description of the
column's purpose, suitable for use in a presentational table.

Not all output is reasonably treated as part of a table, so procedures
may also write out additional metadata output not associated with a
table.  Some kinds of metadata might be included in some kind of
standardized form along with every procedure (or every command, even
those that are not procedures), such as the following:

        - The PSPP syntax used to invoke the procedure, as a string,
          along with file names and line numbers.

        - Any error, warning, or informational messages issued while
          parsing the syntax or executing the procedure.

        - Date and time at which the procedure was invoked.

Here are a few examples.  There are many reasonable ways to fit data
into relational tables.  The following are the ways that occur to me
first, but they may not be the best ones.

Example 1: DESCRIPTIVES
-----------------------

Consider the DESCRIPTIVES procedure.  Its primary output could be a
table with the following columns:

        - One column per SPLIT FILE variable (only if SPLIT FILE is in
          effect).  These columns would be annotated as SPLIT FILE
          variables.  (Oddly, SPLIT FILE doesn't give us a way to know
          the sort order, unless we use a technique like that at
          http://groups.google.com/group/comp.programming/msg/c7ebefe24af2f930
          to figure it out on our own.)

        - VARIABLE, the name of a variable whose descriptive
          statistics were calculated by the procedure.  This column
          would be annotated with the variable's variable label.

        - N_VALID, the number of valid observations for this variable.

        - N_MISSING, the number of missing observations for this
          variable.

        - N_MISSING_LISTWISE, the number of cases that had at least
          one missing value in any of the analyzed variables.

        - One column per descriptive statistic calculated.

Each row in the table gives the descriptive statistics for one file
split.

When Z scores are calculated, the output could include a second table
with the following columns:

        - VARIABLE, the name of a variable on which Z scores were
          calculated.

        - Z_VARIABLE, the name of the Z score variable.

The procedure output would also include some additional metadata,
probably not exposed as tables.  Here are some possibilities that come
to mind:

        - Options that influenced calculations used to build the data
          tables, e.g. whether missing values were considered on a
          per-variable or casewise basis.

        - Options that influence output formatting, e.g. the requested
          sort order (to allow later stages of output honor those
          options without having to parse PSPP syntax).

Example 2; CROSSTABS
--------------------

Consider the CROSSTABS procedure.  Its primary output would be a
collection of tables, one per requested crosstabulation combination,
each of which would have the following columns:

        - One column per SPLIT FILE variable (only if SPLIT FILE is in
          effect), as above.

        - One column per variable in the crosstabulation.

        - One column with the count.

Notably, these tables do not include all the data to be displayed in
the cells of the crosstabulations, such as row, column, and table
totals.  The output engine is expected to be able to handle simple
calculations of this type on its own.  Such totals will change anyhow
as the table is pivoted in an interactive environment.

The statistics optionally output by CROSSTABS would go, I think, into
a second table, with the following columns:

        - One column per SPLIT FILE variable (only if SPLIT FILE is in
          effect), as above.

        - One column to specify the crosstabulation combination.

        - One column per requested statistic.  Many statistics only
          apply to crosstabulation tables with certain characteristics
          (e.g. square tables, 2x2 tables, tables with only numeric
          data); crosstabulation tables that do not have these
          characteristics would have missing values for those values.

CROSSTABS has a lot of options, for calculations and formatting, that
would need to be expressed as metadata.

-- 
"Let others praise ancient times; I am glad I was born in these."
--Ovid (43 BC-18 AD)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]