Re: [Nmh-workers] nmh architecture discussion: format engine character s

From: Ken Hornstein
Subject: Re: [Nmh-workers] nmh architecture discussion: format engine character set
Date: Tue, 11 Aug 2015 12:28:39 -0400

>I am in no way an expert on this.  But, I won't let that stop me.

Welcome to the club!  I think we're all in the same boat in that

>It seems to me that the only solution is to use Unicode internally.
>Disgusting as it seems to those of us who are old enough to hoard
>bytes, we might want to consider using something other than UTF-8
>for the internal representation.  Using UTF-16 wouldn't be horrible
>but I recall that the Unicode folks made a botch of things so that
>one really needs 24 bits now, which really means using 32 internally.

AFAICT ... there is probably no advantage in using UTF-16 or UTF-32
versus UTF-8.

People might think that you gain something because with UTF-16 two
bytes == 1 character.  But that's only true for things in the Basic
Multilingual Plane, and people are now telling us 🖕 because they want
to send emoji in email which are NOT part of the BMP, which means we
have to start dealing with 💩 like surrogate pairs. And really, even
with just the BMP combining characters toss that idea out of the window
UTF-32 lets you say 4 bytes == 1 character ... but do we care about
'characters' or 'column positions'?

So given that, I think sticking with UTF-8 is preferrable; it has the
nice property that we can represent text as C strings and it's just
ASCII if you're living in a 7-bit world.

>On the output side, we just have to do the best we can if characters in
>the input locale can't be represented in the output locale.  This is
>independent of the internal representation.

Well, this works great if your locale is UTF-8.  But ... what happens
if your email address contains UTF-8, and your locale setting is


