|
From: | Felix Höfling |
Subject: | Re: [h5md-user] Variable-size particle groups |
Date: | Tue, 29 May 2012 10:15:08 +0200 |
User-agent: | Opera Mail/11.64 (Linux) |
Hi Peter,Am 26.05.2012, 15:55 Uhr, schrieb Peter Colberg <address@hidden>:
Dear H5MD community, Let's break the silence with a new extension for H5MD :-). While finishing the support of particle groups in HALMD, which allow selection of a subset of particles of the system for observation, I am pondering how to store variable-size trajectory data in H5MD. This would become necessary once I track, e.g., particles in the neighbourhood of a particle, while avoiding to sample an entire system of millions of solvent particles (or, at least, with a significantly lower frequency). One idea I had in mind was to use the existing trajectory dataset structure, and fill empty placeholders with some invalid value (NaN). While the storage overhead should be negligible due to compression, this has a serious disadvantage: The number of placeholders must be chosen wisely, otherwise a lengthy simulation may have to abort due to an overflow of particles. Instead, I propose a better scheme: H5MD implements an optional dataset “range” inside each trajectory subgroup, next to the other datasets groups “step” and “time”. The dataset “range” is two-dimensional, with the first dimension as the [variable] dimension (in H5MD lingo “to accumulate time steps”), and the second dimension equal to 2. The dataset stores an array of ranges [first, last), which reference the variable dimension of the datasets position/sample, velocity/sample, … The datasets position/sample, velocity/sample, … are reduced by one dimension, i.e. [variable][N][D] are reduced to [variable × N][D]. For readers, this will add an additional indirection when looking up particle data, e.g. to look up the position sample at step s, the reader first looks up the range [first, last) at step s, and then selects this range from the position/sample dataset. As an example, a lookup by range [first, last) could be implemented with ease using NumPy's array indexing, array[first:last], e.g. first, last = range[step] sample = position[first:last] Of course, with a fluctuating number of particles, one would probably also store a trajectory subgroup “tag” to identify particles, but this is a separate issue from my proposal. What do you think of this proposal? Should such an extension be optional, or mandatory? Do you see even more complex use cases which could not be handled by this scheme?
In the majority of MD simulations, the particle number is fixed; this includes even semi-grandcanonical simulations, where the particle type/species changes but the total number of particles is preserved. I'm not sure whether people actually store particle configurations in true grandcanonical Monte-Carlo simulations since mostly averages matter in the end.
The trajectory group is at the heart of the H5MD format and shall be as simple as possible. On the other hand, it shall be as flexible as possible, of course. I think the current scheme, which we all have agreed on some time ago, fulfills this aim, and I would like to stick to this direct and straightforward scheme.
Your suggestion is taylored for your specific application, a change of the H5MD structure would have an impact for _all_ users. The indirect lookup or several formats for the trajectory would make things more complicated and, I believe, will effectively discourage people from using H5MD.
A trajectory describes the time evolution of a given set of particles. Snapshots of a changing subset of particles are, strictly speaking, not a trajectory. Probably you also do not want to resume a simulation from such a partial dataset. Hence, the right place for the data you need to store is a specific H5MD subgroup (e.g., in "structure/..."?), and there, the proposed format would be perfectly fine.
I strongly favour separate subgroups for application-specific data structures, this keeps general groups like "trajectory/" clean and simple. Nevertheless, the structure of commonly used subgroups may be defined by the H5MD format (as we have done for "observables/" already).
Finally a technical point: the slicing of a large HDF5 dataset, position[first:last],may be much less efficient than using a dataset with appopriately formed dimensions and accessing a full snapshat via a single index,
position[step].This has to be checked. I expect that the performance sensitively depends on the way slicing is implemented, i.e., on the backend used for HDF5 access.
Best wishes, Felix
[Prev in Thread] | Current Thread | [Next in Thread] |