openexr-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Openexr-devel] UTF-8


From: David Aguilar
Subject: Re: [Openexr-devel] UTF-8
Date: Wed, 14 Nov 2012 18:52:47 -0800

On Wed, Nov 14, 2012 at 2:57 PM, Florian Kainz <address@hidden> wrote:
>
> The problem is that a channel or attribute name such as
> "grün" could be represented as the character sequence
>
>     0067 0072 0075 0803 006E
>     (g, r, u, combining diaeresis, n)
>
> or as
>
>     0067 0072 00FC 006E
>     (g, r, u with diaeresis, n).
>
> Typographically the representations look identical, but
> string comparisons would treat them as different.
> I can't imagine users being happy to be told that a file
> contains, for example, a "grün" channel of type HALF, and
> a "grün" channel of type FLOAT, where the only difference
> between the names is how they are represented as Unicode.
>
> As far as I can tell, either string comparison needs to
> perform some normalization on the fly, or the strings that
> are compared must already be normalized.
>
> Yes, normalization is a headache, but with Unicode there is
> not a one-to-one correspondence between the character sequence
> stored in a string and the typographical representation of
> that string.

I understand the point of normalization,
but I do not think it is the responsibility of the library.

>From the POV of an application -- if they are handing one unicode
representation to the library, and then ask the library for what
it has it will then give a different answer.

That would be a hard bug to track down.

Similarly, someone who stores filenames in headers and expects to get
back byte-for-byte identical strings will run into problems when they
find that the filenames do not exist (because they use a different
form).

Is it not easier to treat the data like raw bytes and not care?

I'm in favor of UTF-8 as a recommendation.
I'm on the fence about enforcing it in the library (it couldn't hurt).
I am not overly excited about pushing normalization issues into the library.

What's the driving benefit of forcing a particular normalization?

The user used a particular form.  Why not use it as-is?
Presumably the rest of their app uses it too, so leaving data as-is
lets them make the call.



> Florian
>
>
>
> David Aguilar wrote:
>>
>> On Wed, Nov 14, 2012 at 11:47 AM, Florian Kainz <address@hidden> wrote:
>>>
>>> The ACES image container specification, meant to be compatible OpenEXR,
>>> prescribes UTF-8 for the representation of strings.  Therefore I suggest
>>> that OpenEXR adopt the following rules:
>>>
>>> - All text strings are to be interpreted as Unicode, encoded as UTF-8.
>>>   This includes attribute names and strings contained in attributes,
>>>   for example, as channel names.
>>>
>>> - Text strings stored in files must be in Normalization Form C (NFC,
>>>   canonical decomposition followed by canonical composition).
>>
>>
>> I would stay far away from dealing with normalization issues.
>>
>> Poke around on OS X and its broken HFS filesystem to see why:
>>
>> http://radsoft.net/rants/20080405,00.shtml
>>
>> If the library verified utf-8 that would be enough IMO.
>>
>> Imagine some poor sucker who goes and stores unicode filenames in a
>> header.  It's not fun to have a library silently "fix" things for you.
>>
>> What's the upside of doing the normalization?  How about just leave it
>> as-is?  That way the code can stay simple.  Whatever you put in can be
>> byte-for-byte identical to what you get out.
>>
>> Other then that, UTF-8 all the way as the "recommended" encoding.
>>
>>> - Where text strings need to be collated, strcmp() is used to compare
>>>   the corresponding char sequences:  string A comes before (or is less
>>>   than) string B if
>>>
>>>     strcmp(A,B) == -1
>>>
>>>   (Note: this is not ambigous; the C99 standard specifies that strcmp()
>>>   interprets the bytes that make up a string as unsigned.)
>>>
>>> - Text strings passed to the IlmImf library must be encoded as UTF-8
>>>   and in Normalization Form C.
>>>
>>> As far as I can tell, these rules are entirely compatible with all
>>> existing versions of the IlmImf library.  Users whose writing system
>>> includes non-ASCII Unicode characters can continue to employ the
>>> existing library versions without change.
>>>
>>> Future versions of the library should verify that text strings are
>>> valid UTF-8.  In addition, the library should either verify that
>>> strings are normalized to NFC, or normalize to NFC on the fly.
>>
>>
>> If we treat them like raw bytes then we really don't care about the
>> encoding, do we?  (that's why I said, "recommended")
>>
>> It would be nice if the thing stayed agnostic.
>>
>> Is there a reason why it needs to enforce the encoding,
>> or is a strong recommendation to use UTF-8 good enough?



-- 
David



reply via email to

[Prev in Thread] Current Thread [Next in Thread]