Re: data access

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: data access

From:	Jason Stover
Subject:	Re: data access
Date:	Sat, 1 Jul 2006 21:30:26 -0400
User-agent:	Mutt/1.5.10i

I see your point. Large files aren't going to be stored entirely in a
matrix, and files small enough to be read entirely into matrices can
be read multiple times.

So it's okay the way it is, for most purposes.

In the future, there may be a need to read a subset of a very large
casefile into a gsl_matrix, and in that situation, reading it only
once and passing the gsl_matrix around would be beneficial. But that
hasn't happened yet, so I'll leave it for now.

-Jason

On Sun, Jul 02, 2006 at 08:23:36AM +0800, John Darrington wrote:
> I think that part of the problem is that the casefiles are designed to
> be a) Fast; and b) Potentially very large.  One of the costs of these
> design criteria is that they're not quite so flexible.
> 
> As Ben has explained to me before, accessing  a casefile in random
> order is much less effecient than doing so in sequential order.  I'm
> not sure that a 10^8 x 300 gsl-matrix would be very efficient.
> 
> I'm not a statistician, but I cannot envisage any situation where a
> matrix operation (eg pre-multiply, inverse etc) would need to be
> performed on a casefile as a whole; it wouldn't make sense in the
> general case, because of non-numeric data.
> 
> Having said that, I'm working on abstracting the interface for the
> casefiles right now.  It might be possible to devise a casefile type
> that is more convenient for math routines, but probably not one that
> would be quite as flexible as gsl_matrix.
> 
> Can you give me an example of a particular problem you've encountered,
> and I'll see if I can come up with any suggestions.
> 
> J'
> 
> 
> On Sat, Jul 01, 2006 at 02:01:45PM -0400, Jason Stover wrote:
>      I'm wrestling with reading data via casefiles again. We've all said
>      it would be nice to make reading the data easier, and Ben has
>      complained about every procedure's need to pass the entire data
>      set. 
>      
>      I thought of what might be a simple approach: Each time a procedure
>      reads the data via casefiles, it stores them in a gsl_matrix, along
>      with some other information about variable names, etc. Then the next
>      time a procedure needs the data, it uses that gsl_matrix, if it's
>      available and contains the necessary information. If not, it reads the
>      data via casefiles.
>      
>      Filling up and using a gsl_matrix is easy. I don't know how easy it
>      would be to store the meta-data the procedures would need.
>      
>      Pardon me if this is an old idea. But the difficulty of using
>      casefiles prevents other people from contributing mathematical code,
>      whereas gsl_matrices are easy to handle.
>      
> -- 
> PGP Public key ID: 1024D/2DE827B3 
> fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
> See http://pgp.mit.edu or any PGP keyserver for public key.
> 
>

[Prev in Thread]

Current Thread

[Next in Thread]

data access, Jason Stover, 2006/07/01
- Re: data access, John Darrington, 2006/07/01
  - Re: data access, Jason Stover <=
  - Re: data access, Ben Pfaff, 2006/07/03

Prev by Date: quitting pspp segfaults on OBSD
Next by Date: Re: [patch #5219] Fix for i18n closure bug on BSD
Previous by thread: Re: data access
Next by thread: Re: data access
Index(es):
- Date
- Thread