h5md-user
[Top][All Lists]

## Re: [h5md-user] The Box Story

 From: Peter Colberg Subject: Re: [h5md-user] The Box Story Date: Thu, 26 Sep 2013 10:47:08 -0400 User-agent: Mutt/1.5.21 (2010-09-15)

```On Thu, Sep 26, 2013 at 09:19:36AM +0200, Konrad Hinsen wrote:
> Proposition 1: Store a single time series with box information for the
> whole trajectory. It must cover at least those steps for which any
> position information is stored. The box information for a given step
> must be retrieved by binary search for random-access step
> retrieval. For sequential traversal of the trajectory, more efficient
> methods are available.
>
>  + Simplicity. Easy to understand, easy to check.
>
>  + Efficient storage: no duplication of box data.

Efficiency of storage can be dropped, since the overhead of duplicated
box data (of order 1 element per time step) versus regular position data
(of order N elements per time step) is negligible.

>  - Box information retrieval is less efficient.

Here we are in need of data. A benchmark that determines the time of
random access of one element of "edges/value" versus a prerequisite
N positions, species, …

The algorithm to lookup the step without reading the entire dataset is
outlined here; this applies for a step dataset with “many” elements:

http://article.gmane.org/gmane.science.simulation.h5md.user/146

If the lookup turns out to be a (even minor) bottleneck, we have to
scratch the "step" dataset for the reader side, and devise a new
scheme to read data at a given step.

>  - Parallel writing (in the sense of parallel I/O) of independent
>    position time series requires coordination between processes.

The box time series is not written in parallel; but indeed, it
requires coordination between the process writing the box, and
the processes writing the subsystems.

> Proposition 2: With every position time series, store a box time
> series at exactly the same step numbers. If multiple such box time
> series are identical, links can be used to avoid duplicating the data.

This proposition is currently implemented in the specification, and
it requires the following modification to make it useful: “The time
and step datasets of "position" and "box" within a subsystem must

>
>  - Efficient writing (without data duplication) requires some effort
>    and careful thought.

What do you mean with this point? The writing is straightforward, no?

Peter

```