[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] nmh architecture discussion: format engine character s

From: Ken Hornstein
Subject: Re: [Nmh-workers] nmh architecture discussion: format engine character set
Date: Tue, 11 Aug 2015 14:07:32 -0400

>- Message should be stored in their original forms.  I.e.  The
>   character encoding transformation should only be done for
>   display/access purposes.

Completely, 100% agree here.

>- I think using a character encoding library is unavoidable.  Is iconv()
>   sufficient?.  If UTF-8 is to be used as the normalized encoding
>   format, a library is needed that can transform the various encodings
>   into it, and likely from it.  Maybe it is not as big an issue as it
>   was in the past, but not everyone was sold on Unicode.  In my
>   mail-related project, I had users that preferred they local character
>   encoding formats over anything Unicode related.

Weeeeel .... not exactly.  It's not just a transformation issue; if it
was, iconv() would be fine.

The issue in the format engine is: we need to know about things, like is
' ' a space? (the format engine does space compression) If the strings
are UTF-8, we can't use isspace() on it.  We can't even use iswspace(),
because that requires the locale to be set to an UTF-8 locale.  So we
need a library that can process UTF-8, regardless of the locale setting.

>   Character encoding choices can get quite political.
>   If a library is adopted, then users have full control of what encoding
>   they prefer.

Well, I was thinking that the locale would control the display/encoding
character set, like it does now.

>- As for parsing message headers, make it a configurable option
>   on what the default character encoding should be.  UTF-8 could be the
>   default (which is fortunately is US-ASCII compatible).
>   Real-world note: I have encountered emails that actually use a
>   non-ASCII default encoding for message header data.  Messages in
>   non-English locale.  Technically, these message are not conformant to
>   the RFCs, but such messages actually exist.  Hence, in my project, I
>   support an option that specifies what the default encoding is.

While I understand where you're coming from, back before EAI those
messages were invalid according to the RFCs.  Now the RFCs have changed
and those messages are defined as being UTF-8, full stop, no exceptions.
I understand the need to define a default character set for messages
which don't meet the RFCs, but it feels wrong to me to allow the user to
override the interpretation of a message which is now legal.  I welcome
discussion in this area.

>- I think it is perfectly reasonable to leverage the current locale
>   setting to determine defaults, but one should be able to explicit
>   override such defaults via .mh_profile and command-line options.

Well, a user can already override that by changing locale environment
variables.  To me that seems like the right mechanism; you can do that
on the command line, with shell wrappers, whatever.

>Warning message(s) should be generated when character data is lost due
>to conversion.

It's unclear to me where those messages should go, and it doesn't seem like
anyone else does that.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]