[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects

From: Ken Hornstein
Subject: Re: [Nmh-workers] Non-ASCII Characters in bodies and subjects
Date: Tue, 17 Jun 2014 15:34:39 -0400

>I've had to deal with messages that have non-ASCII messages in headers,
>so they can occur in the wild, and usually occur in non-English locales,
>but can still occur in English locales where special characters (e.g.
>English pound, euro) are used.

And sadly, before 1.6 was released some of those messages might have been
sent by nmh users!

>In a program I developed that has to parse emails, I had to provide a
>configuration option that instructed the program what the default
>character encoding should be when parsing message headers because of
>this.  The MIME RFCs say US-ASCII is the default, but the real world
>indicates this is not always the case.  Not sure what nmh does when
>encountering such data.

Most of the time, we mostly ignore it.  Well, let me rephrase that.  If
an 8-bit character appears in an email address (the actual address@hidden
part), it's summarily rejected.  Otherwise it's treated as ASCII, which
generally means it's sent unmodified to the user.  We have not yet reached
consensus on what should be done in these cases; there are no wonderful
answers, and the code structure makes it a bit hard (there's not a clear
seperation between "stuff we read on disk" and "stuff we've decoded and
converted to the local character set").  I'd be open to having a setting
which specifics the default character set for unencoded 8-bit headers,
but implementing that would be some work.

>As for message storage, nothing prohibits nmh from auto-converting (aka
>normalizing) non-ASCII encoded data to UTF-8 when storing the message.
>The underlying message parsing tools of nmh should not be affected (but
>others would have to confirm this).  This would allow standard Unix
>tools, or other tools like search indexing tools, to process the files
>w/o having to do full MIME-aware parsing.  Also, it would avoid the
>on-the-fly decoding of non-ASCII headers by nmh each time it reads a
>message (for pick, show, scan, etc).

The problem is that if you forward or redistribute such a message, we
have no facility to reencode those message header; fixing that would
require some rearchitecturing.  Also, when decoding you've lost any
charset information (since in the case of RFC 2047 headers, the charset
is part of the encoding) so if you change your locale the headers will
be in the "wrong" character set.  We could normalize everything to
UTF-8 ... but that's also problematic from a technical standpoint since
character set conversion is many times imperfect and lossy (I realize
that converting TO UTF-8 is not lossy in a perfect world, but as you
note we don't live in that world).  Also, people tend to complain when
that has been suggested in the past.  I think those complainers should
suck it up and just switch to UTF-8, but not everyone agrees.

You mention the overhead of the decoding/conversion, and that might have
been a concern 20 years ago ... but today?  Computers are faster, it's
not something to be concerned about.

So, my summary is: storing messages with unencoded 8-bit headers has
a bunch of side effects that would need to be carefully thought out.
I do not believe any gains would be worth the hassle.  I would be open
to being proven wrong.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]