openexr-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Openexr-devel] UTF-8


From: David Aguilar
Subject: Re: [Openexr-devel] UTF-8
Date: Wed, 14 Nov 2012 22:35:53 -0800

Thanks for the detailed explanation.

On Wed, Nov 14, 2012 at 9:11 PM, Florian Kainz <address@hidden> wrote:
> David Aguilar wrote:
>>
>> Is it not easier to treat the data like raw bytes and not care?
>>
>> I'm in favor of UTF-8 as a recommendation.
>> I'm on the fence about enforcing it in the library (it couldn't hurt).
>> I am not overly excited about pushing normalization issues into the
>> library.
>>
>> What's the driving benefit of forcing a particular normalization?
>>
>> The user used a particular form.  Why not use it as-is?
>> Presumably the rest of their app uses it too, so leaving data as-is
>> lets them make the call.
>
>
> I'm not sure that treating the data as raw bytes and not caring is a
> good idea.
>
> Suppose someone hands you an OpenEXR file, and a listing of the header
> reveals the following set of channels:
>
>     공룡.R
>     공룡.G
>     공룡.B
>     배경.R
>     배경.G
>     배경.B
>
> In your image processing application you want to extract the first layer
> from the file, so you type 공룡.  However, you don't know - and you
> shouldn't have to know - how the text is encoded in the file: Hangul Jamo,
> Hangul syllables (pre-composed Jamo) or a combination of both.  In order
> to access the correct channel, the name in the file and the name that
> was typed in must both be converted into a common, canonical encoding.
> Unicode normalization does that.

That makes sense.  This is probably the most common use case,
so I see how it helps here.  In lieu of an encoding header,
one form must be chosen, so it's best to go with one.

I just wanted to illustrate one tiny use case where not doing
auto-normalization could be helpful.

Just thinking out loud --

Auto-normalization definitely makes sense for channel
and header names. Are there any use cases for raw
const char * storage?  Header values?

> Similarly, if the file already contains a channel called 배경.R, encoded
> using Jamo, then it should not be possible to add another channel with
> the name 배경.R, but encoded as syllables.  Code might not have a problem
> distinguishing the two channel names, but people certainly would.  The
> OpenEXR library should detect an attempt to add two channels with the
> same name, and generate an appropriate error message.
>
> The fact that storing a string in a file and retrieving it may change its
> encoding should not be a big problem for application code that is aware of
> Unicode, since the application must already be able to handle alternate
> encodings of a string.
>
> Instead of normalizing strings before they are stored in files, the
> OpenEXR library could normalize strings on the fly before every string
> comparison.  That way every string would be preserved exactly.  Speed
> could be an issue, though.  String comparisons are not rare, and on-the-fly
> normalization would slow them down considerably.
-- 
David



reply via email to

[Prev in Thread] Current Thread [Next in Thread]