h5md-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [h5md-user] Another HDF5-based trajector format


From: Felix Höfling
Subject: Re: [h5md-user] Another HDF5-based trajector format
Date: Tue, 02 Sep 2014 23:46:34 +0200
User-agent: Opera Mail/12.16 (Linux)

Am 02.09.2014, 10:59 Uhr, schrieb <address@hidden>:

Hi Konrad,

On Mon, Sep 01, 2014 at 11:58:27AM +0200, Konrad Hinsen wrote:
I found this description somewhat accidentally:

  http://mdtraj.org/latest/hdf5_format.html

It looks like a complete definition of an HDF5-based trajectory
format for biomolecular systems, but it also looks like tailor-made
for the needs of a particular library.

This is very interesting indeed. From browsing the history (I just love that you can do that) [1] the HDF5 support is from a bit more than a year ago, that is later than my extensive web searches. I had found H5Part [2] but it was not
satifactory. I'll have a closer look at mdtraj.

As you mention, it is program-specific. Also, the structure seems really rigid (no
groups, for instance).

The topology [their own word, we use connectivity now I think :-) ] storage is a bit awkward, as it is a json text files embedded in a HDF5 dataset. But at least
they have connectivity information.

With respect to MOSAIC, it also seems more rigid for the same reasons as above:
one single entity in the file.

An interesting aspect is the use of compression. Did anyone try this
in H5MD?

Only a few times, without much gain (from memory, less than 10% gain with gzip).
I don't have a good automated strategy to test this.

Pierre

[1] https://github.com/rmcgibbo/mdtraj/commits/master/docs/hdf5_format.rst
[2] http://vis.lbl.gov/Research/H5Part/




The MDtraj project is indeed interesting. It misses a conversion tool to
and from H5MD :-)

Compression was one of the major criteria to choose HDF5 when Peter and I
started with MD simulations. In HALMD it is enabled by default (but the
parameters are not optimised). The underlying HDF5 concept are filters,
which require a chunked dataspace layout. The relevant filters are
"shuffle" in combination with "deflate" (GZIP) or szip. You can nicely
play around with

    h5repack -f SHUF -f GZIP=6 input.h5 output.h5

and check the result with "h5dump -Hp" or "h5ls -v". For some arbitrary
file with 205k particles (2 snapshots, float32), I get 13% compression
with GZIP alone and 39% with shuffle plus GZIP. (I can't compare with SZIP
which is missing in my h5repack build.) Using GZIP=9 improves the ratio
only marginally by 0.2%.

Dataset layout and the chunk size are relatively important. Switching from
chunks of 1xNxD to 1xNx1 by

    h5repack -l particles/A/position:CHUNK=1x204800x1 input.h5 output.h5

packs all x-coordinates separately etc. and improves the compression ratio
considerably from 39% to 59% (for my specific example).

I find these numbers really encouraging to enable the compression features
of HDF5 (and thus H5MD).

The HDF5 library also helps to get rid of the high floating-point bits
(which are almost incompressible white noise and thus irrelevant). If the
memory and file datatypes are different, the conversion should be done by
the library (but I haven't tried this myself.)

Cheers,

Felix



reply via email to

[Prev in Thread] Current Thread [Next in Thread]