|
From: | Felix Höfling |
Subject: | Re: [h5md-user] Another HDF5-based trajector format |
Date: | Tue, 02 Sep 2014 23:46:34 +0200 |
User-agent: | Opera Mail/12.16 (Linux) |
Am 02.09.2014, 10:59 Uhr, schrieb <address@hidden>:
Hi Konrad, On Mon, Sep 01, 2014 at 11:58:27AM +0200, Konrad Hinsen wrote:I found this description somewhat accidentally: http://mdtraj.org/latest/hdf5_format.html It looks like a complete definition of an HDF5-based trajectory format for biomolecular systems, but it also looks like tailor-made for the needs of a particular library.This is very interesting indeed. From browsing the history (I just love that you can do that) [1] the HDF5 support is from a bit more than a year ago, that is later than my extensive web searches. I had found H5Part [2] but it was notsatifactory. I'll have a closer look at mdtraj.As you mention, it is program-specific. Also, the structure seems really rigid (nogroups, for instance).The topology [their own word, we use connectivity now I think :-) ] storage is a bit awkward, as it is a json text files embedded in a HDF5 dataset. But at leastthey have connectivity information.With respect to MOSAIC, it also seems more rigid for the same reasons as above:one single entity in the file.An interesting aspect is the use of compression. Did anyone try this in H5MD?Only a few times, without much gain (from memory, less than 10% gain with gzip).I don't have a good automated strategy to test this. Pierre[1] https://github.com/rmcgibbo/mdtraj/commits/master/docs/hdf5_format.rst[2] http://vis.lbl.gov/Research/H5Part/
The MDtraj project is indeed interesting. It misses a conversion tool to and from H5MD :-) Compression was one of the major criteria to choose HDF5 when Peter and I started with MD simulations. In HALMD it is enabled by default (but the parameters are not optimised). The underlying HDF5 concept are filters, which require a chunked dataspace layout. The relevant filters are "shuffle" in combination with "deflate" (GZIP) or szip. You can nicely play around with h5repack -f SHUF -f GZIP=6 input.h5 output.h5 and check the result with "h5dump -Hp" or "h5ls -v". For some arbitrary file with 205k particles (2 snapshots, float32), I get 13% compression with GZIP alone and 39% with shuffle plus GZIP. (I can't compare with SZIP which is missing in my h5repack build.) Using GZIP=9 improves the ratio only marginally by 0.2%. Dataset layout and the chunk size are relatively important. Switching from chunks of 1xNxD to 1xNx1 by h5repack -l particles/A/position:CHUNK=1x204800x1 input.h5 output.h5 packs all x-coordinates separately etc. and improves the compression ratio considerably from 39% to 59% (for my specific example). I find these numbers really encouraging to enable the compression features of HDF5 (and thus H5MD). The HDF5 library also helps to get rid of the high floating-point bits (which are almost incompressible white noise and thus irrelevant). If the memory and file datatypes are different, the conversion should be done by the library (but I haven't tried this myself.) Cheers, Felix
[Prev in Thread] | Current Thread | [Next in Thread] |