h5md-user
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [h5md-user] Specifying the data type


From: Felix Höfling
Subject: Re: [h5md-user] Specifying the data type
Date: Thu, 29 Aug 2013 13:46:31 +0200
User-agent: Opera Mail/12.15 (Linux)

Am 29.08.2013, 12:40 Uhr, schrieb Olaf Lenz <address@hidden>:


More seriously, there is not just one integer or float type in
HDF5. For this reason, the H5MD spec just states "of integer data
 type" or "real-valued".

That is why I haven't specified it more precisely. Yes, there are
several datatypes, but the HDF5 docs[1] state: "The source and
destination may have different (but compatible) layouts, in which case
the data elements are automatically transformed during the transfer."

To me, this means that you do not have to specify the exact datatype
layout, but only the "Datatype class" as it is termed in the HDF5 docs.
Of these, only the "Atomic" datatypes need to be specified, i.e. String,
Integer and Float (in our case). However, some properties might have to
be specified.

And for the most interesting datasets, the actual data in
"value", the data type is unspecified at all as it depends on the
specific data stored.

That's what I used the <type> for, to keep it open. However, in many
cases, we have to specify a "datatype class", otherwise writing any
tools that can use h5md as interchange format are impossible.


Hi Olaf,

I overlooked these abstract datatype classes. Indeed, it might be useful to specify them in the HDF5 terminology. The datatype would be any of Integer, Float, String, Bitfield, Time, or Opaque. But in general, we want to allow also (custom) data of Composite type, namely Array, Enumeration, Variable Length, Compound. Quite a long list ...

I feel that expressing all that by the tree graphs would overload
 them, the focus should be on the tree structure. And for the
details, the user is encouraged to read the text, not just to
look at the pictures ;-)

Still it would be way easier to see what is expected. I do not think it
is bloated, and furthermore it clearly points out when we have forgotten
to specify it where it is needed.

Actually, the only datatype that is really important is the integer
 datatype for "step" and for some data such as "id". HDF5 is
otherwise flexible and I would avoid (i) clobbering the
specification and (ii) putting constraints where it is not needed.

I do not agree. If we do not specify datatypes for the datasets that
actually carry semantics, the specification is useless. How am I able to
interpret a h5md file if I do not know whether the positions are stored
as floats, integers or strings? A specification defines how to interpret
data, and to do so it often also has to put constraints.


Does it make a difference with respect to reading whether positions are stored as Float[N][D] or as Array[N], where the Array type is a D-dimensional vector?

The Array type, however, does not work for dimensions > 4 and not for tensor-valued data. Hence we may exclude it from H5MD. Implicitly, the current draft says that the Float version is to be used, see the description of "value" in
http://nongnu.org/h5md/draft.html#time-dependent-data

Even in the specification it is obvious that most of the datasets have a
well-defined datatype class.

file root
 \-- h5md
     +-- version : String[variable]
     \-- author
     |   +-- name : String[variable]
     |   +-- (email : String[variable])
     \-- creator
         +-- name : String[variable]
         +-- version : String[variable]
 \-- (particles)
     \-- <group1>
         \-- box
         \-- (position)
             \-- value : <type>[variable][N][D]
             \-- step : Integer[variable]
             \-- time : Float[variable]
         \-- (species : Integer[N])
             \--


I think adding the type class information to the notation would make it
significantly more readable and easier to understand.

Olaf


Actually, your example revealed an ambiguity: in the current notation it is not clear whether "box" is a scalar dataset or a group.

Some more remarks:

- I would not put a restriction on the type of String, whether fixed-size or variable size. The reader has to handle both cases (although this means some extra effort on the reader).

- the type of the position is actually restricted to be Atomic (see above) and to the domain of real numbers, i.e., Float or Integer.

- something similar happens with the particle species: it could be Integer or Enumeration

- typographic things to improve readibility: close the parentheses directly after the identifier (before the colon), drop the space before the colon and insert a space after the data type.

In conclusion, putting the HDF5 datatype class to the graphs may help to detect possible issues. But is should happen typographically in a modest form and allow for multiple datatype classes. I suggest to introduce our own abbreviations, which can easily be combined:

A=Atomic, C=Composite (the generic cases)
I=Integer, F=Float, S=String, B=Bitfield, T=Time, O=Opaque
A=Array, E=Enumeration, V=Variable Length, C=Compound.

Note the clash for Atomic/Array and Composite/Compound which needs to be resolved.

A general dataset would then be of type "AC", and the particle group would look like:

  <particle_group>
       \-- box
       \-- (position)
       |   \-- value: FI [variable][N][D]
       |   \-- step: I [variable]
       |   \-- time: F [variable]
       \-- (species): IE [N]


Felix



reply via email to

[Prev in Thread] Current Thread [Next in Thread]