swarm-support
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Swarm-Support] HDF5 hdfview works, Swarm usage mysterious (was Re:


From: Marcus G. Daniels
Subject: Re: [Swarm-Support] HDF5 hdfview works, Swarm usage mysterious (was Re: New Code: Interpolator...
Date: Mon, 19 May 2003 17:01:12 -0600
User-agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.4b) Gecko/20030516

Paul E Johnson wrote:

in the R list Robert Gentleman has appeared with another library which also supports HDF5 data.

Cool, I didn't know about that.  Here's the reference I guess you mean:

  http://www.bioconductor.org/repository/release1.1/package/html/rhdf5.html

I took a look at the source code, and basically it is a R veneer over a subset of the HDF5 interfaces, plus some operator implementations. By implementing methods for array operators the like, it is possible to avoid pulling whole HDF5 datasets ("dataset" is a technical term in HDF5 for a typed matrix) into virtual memory. It makes R more like SAS -- useful for handling arbitrarily large datasets. One oversight in this package seems to be that they don't make use of HDF5 compound types, which are like rows in data frames in R or ivars in Swarm. HDF5 compound types make it possible to store flat objects in a very time and space efficient way.

The hdf5 package I wrote has works with most of the primitive types that R provides, and also has special support for R data frames (tables, basically). R data frames are useful in conjuction with Swarm because data frames are a high-level interface similar to class declarations in Java or Objective C. For example, if you have have an agent class with an int, double, and string and 100 instances of that agent, and you have your simulation periodically flushing those agents to a HDF5 file, it's possible to load those dumps into a list of data frames, and use them in a convenient way in R. I think the ideal kind of statistical interface for dealing with large datasets (e.g. periodic snapshots of simulations or l; loading of complex landscapes or scenarios), would be one that was not only fast and compact on disk but also used memory with care (my package pulls any requested thing fully into memory). In order to do that in R, one would have to provide a set of methods that do all of the things R does in its (S language) library code and in addition make sure that composite concepts like data frames mapped efficiently to HDF5 features. R & SPlus, being object-oriented, are designed to make that possible, but I don't think it would be trivial endeavor.

In other words, rewrite both packages into a single module. The approach of the rhdf5 package is more scalable, but I think my package is probably faster for datasets that fit in virtual memory. (Marcus, who is now doing R runs on a Opteron machine half way across the country, because he is out of both virtual and physical memory!) While doing this, it would probably be worth to reflect on whether or not the effort could be unified with the SQL code somehow..



reply via email to

[Prev in Thread] Current Thread [Next in Thread]