[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects

From: Ralph Corderoy
Subject: Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects
Date: Wed, 18 Jun 2014 12:01:37 +0100

Hello Ken,

> > The Unix kernel stores filenames as a run of bytes, not including
> > `/' and NUL.
> That's not universally true anymore.  Some newer filesystems are
> mandating that filenames are UTF-8 and enforcing normalization rules
> (MacOS X and Solaris are two notable examples).

Thanks, I didn't know.  Haven't used Solaris in years, and never bought

> The only way of resolving this is to use the normalization rules for
> Unicode and do filename searching that way;


> MacOS X actually rewrites all of the filenames using Normalization
> Form D (all characters in decomposed form, which means the regular
> character followed by the combining accents) and I think that sucks,
> but they didn't ask me.

I think I agree with you.

> Solaris is better; the original bytes are preserved, but lookup is
> done using normalized names so you can't have two filenames with the
> same characters.

What about globbing, especially on Mac OS X?  Given your two examples on
Linux with bash,

    $ touch résumé résumé
    $ ls r?sum?
    $ ls r?sum? | recode ..dump
    UCS2   Mne   Description

    0072   r     latin small letter r
    00E9   e'    latin small letter e with acute
    0073   s     latin small letter s
    0075   u     latin small letter u
    006D   m     latin small letter m
    00E9   e'    latin small letter e with acute
    000A   LF    line feed (lf)
    $ ls r??sum??

Do you think NFKC would be better, so ? often matches what appears as a
single rune and fi matches ligature fi?

Cheers, Ralph.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]