[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] I like neither green eggs and ham nor MIME

From: Ralph Corderoy
Subject: Re: [Nmh-workers] I like neither green eggs and ham nor MIME
Date: Fri, 18 Jul 2014 19:39:54 +0100


> Norm wrote:
> > I am not at all secure about how the standard GNU utilities will
> > handle non-ascii characters. For example, 'wc -c', just counts
> > bytes.

Christian has pointed out -c has remained bytes, --bytes is a synonym,
because otherwise too many things would break, and that -m has been
added to handle multi-byte characters, AKA --chars.  tr(1) remains
resolutely single bytes, though the documentation talks of growing
multibyte support with a -C complement option.

    $ od -c <<<←
    0000000 342 206 220  \n
    $ tr \\220 \\221 <<<←

Things like sed and grep all work in a UTF-8 world just fine, though
often a bit more slowly, Unix having moved to it some years ago.

    $ sed 'y/\220/\221/' <<<←
    $ sed y/←/x/ <<<←

For the odd occasion when I want to remove locale specifics, I use
~/bin/C as a shorthand.

    $ cat ~/bin/C
    #! /bin/sh

    # LC_ALL has precedence over LANG according to POSIX[1], but we may as
    # well stamp out any traces by setting LANG too.
    # 1.  The Open Group Base Specifications, Ch. 8 Environment Variables.

    LC_ALL=C LANG=C exec "$@"
    $ C sed 'y/←/x/' <<<←
    sed: -e expression #1, char 8: strings for `y' command are different lengths
    $ C sed 'y/←/xyz/' <<<←

Ken wrote:
> But since UTF-8 has the excellent property that non-ASCII characters
> look like just 8-bit characters but won't ever be mistaken for ASCII
> (not a surprise, since it was designed by two of the original Unix
> geeks)

Ken Thompson and Rob Pike.  (Pike's not quite original, but nearly.)
Rob covered its creation in a diner on a napkin back in 2012.
There's a comment by me there with a Google Streetview of the diner.

> I jumped whole-hog into UTF-8 a few years ago, and I haven't regretted
> it one bit.

No regrets here.  You might find iconv(1) useful to convert existing
files from one encoding to another.

Cheers, Ralph.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]