[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects

From: Ken Hornstein
Subject: Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects
Date: Tue, 17 Jun 2014 16:11:22 -0400

>> if not for file names?
>The Unix kernel stores filenames as a run of bytes, not including `/'
>and NUL.

That's not universally true anymore.  Some newer filesystems are mandating
that filenames are UTF-8 and enforcing normalization rules (MacOS X and
Solaris are two notable examples).  Obviously some charset conversion is
happening for non-UTF-8 locales.  I think that's inevitable, given the
issues with composed and decomposed characters.

For example, let's say you see this:

% ls
Résumé.txt      Résumé.txt

How can that be?  Well, they aren't the same sequence of bytes.  In the
first one the “é” is U+00E9.  In the second, it's U+0065 U+0301 (a regular
“e” followed by a combining accent character).  The only way of resolving
this is to use the normalization rules for Unicode and do filename
searching that way; MacOS X actually rewrites all of the filenames
using Normalization Form D (all characters in decomposed form, which
means the regular character followed by the combining accents) and I think
that sucks, but they didn't ask me.  Solaris is better; the original bytes
are preserved, but lookup is done using normalized names so you can't
have two filenames with the same characters.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]