openexr-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Openexr-devel] UTF-8


From: Florian Kainz
Subject: Re: [Openexr-devel] UTF-8
Date: Thu, 15 Nov 2012 12:21:11 -0800
User-agent: Thunderbird 2.0.0.24 (X11/20100428)

After mulling this over a bit more, I think the rule that _all_ strings
must be normalized is too restrictive.  Strings such as the value of the
"comments" attribute are only stored in the file or retrieved from the file.
The library performs no other processing, so there's no requirement for
normalization.

However, attribute names and channel names are used by the library for table
lookups.  In those cases normalization is necessary, or the lookup will not
work correctly.  Comparison with strcmp() can fail when a string has more
than one possible representation.

I don't see how the onus could be shifted to application code.  If a user
types a channel name such as ??.R (see my earlier mail), must the application
try all possible representations of this string in order to find out if the
corresponding channel exists?  If the application fails to do this, should
the library allow a channel list that contains multiple channels with the
name ??.R, the only difference being that in one case ?? is represented
as two Hangul syllable characters, in the next ? has been split into Jamo
but ? is a syllable, and so on?

If a single application generates all the attribute and channel names in
a file then we can reasonably assume that the application uses consistent
rules for encoding strings throughout its code base.  However, applications
must be able to handle channel names found in files that may have been
generated by other applications, possibly with different internal conventions
for text processing.  For example, an application that internally represents
Korean texts using syllables should be able to handle OpenEXR files that
were written by an application that uses Jamo, or any combination of Jamo
and syllables.


I propose a revised set of rules:

- All text strings are to be interpreted as Unicode, encoded as UTF-8.
  This includes attribute names and strings contained in attributes,
  for example, as channel names.

- Attribute names and channel names stored in files must be in Normalization
  Form C (NFC, canonical decomposition followed by canonical composition).

- Where attribute names or channel names need to be collated, strcmp() is
  used to compare the corresponding char sequences:  string A comes before
  (or is less than) string B if

    strcmp(A,B) == -1

  (Note: this is not ambigous; the C99 standard specifies that strcmp()
  interprets the bytes that make up a string as unsigned.)

- Attribute names and channel names passed to the IlmImf library must be
  encoded as UTF-8 and in Normalization Form C.

  (Note - this last rule could be changed to: Attribute names and channel
  names must be encoded as UTF-8.  The library converts the names to
  Normalization Form C before any further processing.)

Florian



reply via email to

[Prev in Thread] Current Thread [Next in Thread]