[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [h5md-user] Variable-size particle groups

From: Felix Höfling
Subject: Re: [h5md-user] Variable-size particle groups
Date: Tue, 29 May 2012 10:15:08 +0200
User-agent: Opera Mail/11.64 (Linux)

Hi Peter,

Am 26.05.2012, 15:55 Uhr, schrieb Peter Colberg <address@hidden>:

Dear H5MD community,

Let's break the silence with a new extension for H5MD :-).

While finishing the support of particle groups in HALMD, which allow
selection of a subset of particles of the system for observation,
I am pondering how to store variable-size trajectory data in H5MD.

This would become necessary once I track, e.g., particles in the
neighbourhood of a particle, while avoiding to sample an entire
system of millions of solvent particles (or, at least, with a
significantly lower frequency).

One idea I had in mind was to use the existing trajectory dataset
structure, and fill empty placeholders with some invalid value (NaN).
While the storage overhead should be negligible due to compression,
this has a serious disadvantage: The number of placeholders must be
chosen wisely, otherwise a lengthy simulation may have to abort due
to an overflow of particles.

Instead, I propose a better scheme:

H5MD implements an optional dataset “range” inside each trajectory
subgroup, next to the other datasets groups “step” and “time”.

The dataset “range” is two-dimensional, with the first dimension
as the [variable] dimension (in H5MD lingo “to accumulate time steps”),
and the second dimension equal to 2. The dataset stores an array of
ranges [first, last), which reference the variable dimension of the
datasets position/sample, velocity/sample, …

The datasets position/sample, velocity/sample, … are reduced by one
dimension, i.e. [variable][N][D] are reduced to [variable × N][D].

For readers, this will add an additional indirection when looking up
particle data, e.g. to look up the position sample at step s, the
reader first looks up the range [first, last) at step s, and then
selects this range from the position/sample dataset.

As an example, a lookup by range [first, last) could be implemented
with ease using NumPy's array indexing, array[first:last], e.g.

  first, last = range[step]
  sample = position[first:last]

Of course, with a fluctuating number of particles, one would probably
also store a trajectory subgroup “tag” to identify particles, but this
is a separate issue from my proposal.

What do you think of this proposal?

Should such an extension be optional, or mandatory? Do you see even
more complex use cases which could not be handled by this scheme?

In the majority of MD simulations, the particle number is fixed; this includes even semi-grandcanonical simulations, where the particle type/species changes but the total number of particles is preserved. I'm not sure whether people actually store particle configurations in true grandcanonical Monte-Carlo simulations since mostly averages matter in the end.

The trajectory group is at the heart of the H5MD format and shall be as simple as possible. On the other hand, it shall be as flexible as possible, of course. I think the current scheme, which we all have agreed on some time ago, fulfills this aim, and I would like to stick to this direct and straightforward scheme.

Your suggestion is taylored for your specific application, a change of the H5MD structure would have an impact for _all_ users. The indirect lookup or several formats for the trajectory would make things more complicated and, I believe, will effectively discourage people from using H5MD.

A trajectory describes the time evolution of a given set of particles. Snapshots of a changing subset of particles are, strictly speaking, not a trajectory. Probably you also do not want to resume a simulation from such a partial dataset. Hence, the right place for the data you need to store is a specific H5MD subgroup (e.g., in "structure/..."?), and there, the proposed format would be perfectly fine.

I strongly favour separate subgroups for application-specific data structures, this keeps general groups like "trajectory/" clean and simple. Nevertheless, the structure of commonly used subgroups may be defined by the H5MD format (as we have done for "observables/" already).

Finally a technical point: the slicing of a large HDF5 dataset,
may be much less efficient than using a dataset with appopriately formed dimensions and accessing a full snapshat via a single index,
This has to be checked. I expect that the performance sensitively depends on the way slicing is implemented, i.e., on the backend used for HDF5 access.

Best wishes,


reply via email to

[Prev in Thread] Current Thread [Next in Thread]