[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [h5md-user] Specifying the data type
From: |
Felix Höfling |
Subject: |
Re: [h5md-user] Specifying the data type |
Date: |
Thu, 29 Aug 2013 13:46:31 +0200 |
User-agent: |
Opera Mail/12.15 (Linux) |
Am 29.08.2013, 12:40 Uhr, schrieb Olaf Lenz <address@hidden>:
More seriously, there is not just one integer or float type in
HDF5. For this reason, the H5MD spec just states "of integer data
type" or "real-valued".
That is why I haven't specified it more precisely. Yes, there are
several datatypes, but the HDF5 docs[1] state: "The source and
destination may have different (but compatible) layouts, in which case
the data elements are automatically transformed during the transfer."
To me, this means that you do not have to specify the exact datatype
layout, but only the "Datatype class" as it is termed in the HDF5 docs.
Of these, only the "Atomic" datatypes need to be specified, i.e. String,
Integer and Float (in our case). However, some properties might have to
be specified.
And for the most interesting datasets, the actual data in
"value", the data type is unspecified at all as it depends on the
specific data stored.
That's what I used the <type> for, to keep it open. However, in many
cases, we have to specify a "datatype class", otherwise writing any
tools that can use h5md as interchange format are impossible.
Hi Olaf,
I overlooked these abstract datatype classes. Indeed, it might be useful
to specify them in the HDF5 terminology. The datatype would be any of
Integer, Float, String, Bitfield, Time, or Opaque. But in general, we want
to allow also (custom) data of Composite type, namely Array, Enumeration,
Variable Length, Compound. Quite a long list ...
I feel that expressing all that by the tree graphs would overload
them, the focus should be on the tree structure. And for the
details, the user is encouraged to read the text, not just to
look at the pictures ;-)
Still it would be way easier to see what is expected. I do not think it
is bloated, and furthermore it clearly points out when we have forgotten
to specify it where it is needed.
Actually, the only datatype that is really important is the integer
datatype for "step" and for some data such as "id". HDF5 is
otherwise flexible and I would avoid (i) clobbering the
specification and (ii) putting constraints where it is not needed.
I do not agree. If we do not specify datatypes for the datasets that
actually carry semantics, the specification is useless. How am I able to
interpret a h5md file if I do not know whether the positions are stored
as floats, integers or strings? A specification defines how to interpret
data, and to do so it often also has to put constraints.
Does it make a difference with respect to reading whether positions are
stored as Float[N][D] or as Array[N], where the Array type is a
D-dimensional vector?
The Array type, however, does not work for dimensions > 4 and not for
tensor-valued data. Hence we may exclude it from H5MD. Implicitly, the
current draft says that the Float version is to be used, see the
description of "value" in
http://nongnu.org/h5md/draft.html#time-dependent-data
Even in the specification it is obvious that most of the datasets have a
well-defined datatype class.
file root
\-- h5md
+-- version : String[variable]
\-- author
| +-- name : String[variable]
| +-- (email : String[variable])
\-- creator
+-- name : String[variable]
+-- version : String[variable]
\-- (particles)
\-- <group1>
\-- box
\-- (position)
\-- value : <type>[variable][N][D]
\-- step : Integer[variable]
\-- time : Float[variable]
\-- (species : Integer[N])
\--
I think adding the type class information to the notation would make it
significantly more readable and easier to understand.
Olaf
Actually, your example revealed an ambiguity: in the current notation it
is not clear whether "box" is a scalar dataset or a group.
Some more remarks:
- I would not put a restriction on the type of String, whether fixed-size
or variable size. The reader has to handle both cases (although this means
some extra effort on the reader).
- the type of the position is actually restricted to be Atomic (see above)
and to the domain of real numbers, i.e., Float or Integer.
- something similar happens with the particle species: it could be Integer
or Enumeration
- typographic things to improve readibility: close the parentheses
directly after the identifier (before the colon), drop the space before
the colon and insert a space after the data type.
In conclusion, putting the HDF5 datatype class to the graphs may help to
detect possible issues. But is should happen typographically in a modest
form and allow for multiple datatype classes. I suggest to introduce our
own abbreviations, which can easily be combined:
A=Atomic, C=Composite (the generic cases)
I=Integer, F=Float, S=String, B=Bitfield, T=Time, O=Opaque
A=Array, E=Enumeration, V=Variable Length, C=Compound.
Note the clash for Atomic/Array and Composite/Compound which needs to be
resolved.
A general dataset would then be of type "AC", and the particle group would
look like:
<particle_group>
\-- box
\-- (position)
| \-- value: FI [variable][N][D]
| \-- step: I [variable]
| \-- time: F [variable]
\-- (species): IE [N]
Felix