[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [h5md-user] The Box Story

From: Konrad Hinsen
Subject: Re: [h5md-user] The Box Story
Date: Thu, 26 Sep 2013 17:20:16 +0200

Peter Colberg writes:

 > >  - Box information retrieval is less efficient.
 > Here we are in need of data. A benchmark that determines the time of
 > random access of one element of "edges/value" versus a prerequisite
 > binary search, and compares the overhead of the latter against reading
 > N positions, species, …

That would be nice to have, but it's a lot of work if you want to
eliminate other factors: hardware, HDF5 array layout (chunked, ...),
data access pattern, etc.

 > The algorithm to lookup the step without reading the entire dataset is
 > outlined here; this applies for a step dataset with “many” elements:
 > http://article.gmane.org/gmane.science.simulation.h5md.user/146

I'd expect a potential performance problem to come from jumping around
in the file, leading to bad cache use. But it would indeed have to be
checked by a benchmark.

BTW, there is one aspect of all this that bothers me a bit: in 90% of
my trajectories, steps are sampled on regular grid, meaning that the
index of each step can be computed trivially and at essentially no
cost.  There is no way in H5MD to optimize for such an arrangement.
Such trajectories shouldn't even have explicit storage for "step" and

 > >  - Parallel writing (in the sense of parallel I/O) of independent
 > >    position time series requires coordination between processes.
 > The box time series is not written in parallel

What I am thinking of is a parallelized simulation (MPI style) in
which different processors write different subsets of the system.

 > >  - Efficient writing (without data duplication) requires some effort
 > >    and careful thought.
 > What do you mean with this point? The writing is straightforward, no?

Let's consider an example: subsystem 1 is written at every Nth step,
subsystem 2 at every Mth step. That creates two different box time
series.  However, if the user sets M=N, there should be onlly one box
time series, linked to two places. That's what I mean by "efficient

Konrad Hinsen
Centre de Biophysique Moléculaire, CNRS Orléans
Synchrotron Soleil - Division Expériences
Saint Aubin - BP 48
91192 Gif sur Yvette Cedex, France
Tel. +33-1 69 35 97 15
E-Mail: research AT khinsen DOT fastmail DOT net
ORCID: http://orcid.org/0000-0003-0330-9428
Twitter: @khinsen

reply via email to

[Prev in Thread] Current Thread [Next in Thread]