nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output


From: Ken Hornstein
Subject: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?
Date: Sun, 20 May 2012 22:06:37 -0400

Greetings all,

I've been noticing for a while that while exmh now displays people's
names with non-ASCII characters in them fine, it's been busted for
subject lines with some non-ASCII characters in them ... like this one,
at least for me (although I suspect it works fine for others).

I thought this was exmh's fault, but I decided to look at it a bit
and I realized that it's really nmh's fault.  The core problem: if
you're using a multibyte character encoding (like UTF-8) as your locale
then the function cpstripped() can mangle the UTF-8 because that
function calls isspace() and iscntrl() on the encoded bytes.  On the
Linux systems I have, that works because those functions return 0 for
everything > 127.  On MacOS X, that does NOT because it seems that the
value is interpreted as a Unicode codepoint and returns "true" for some
values > 127 if you have a UTF-8 locale.  I've thought about this a
lot and I've read the relevant standards and I don't know which is the
"right" behavior regarding the is*() functions when you get passed in
something > 127 in a UTF-8 locale.  But irregardless it seems that we're
doing it wrong and we should fix that.

This works fine when parsing the "From" header because it calls
cptrimmed() which is actually multibyte-aware.  It seems to make sense
that cpstripped() should be multibyte-aware as well.  I assume this is
non-controversial.  Also, I am planning on fixing fmt_scan() at the same
time so you can pass in an extra argument denoting the printable width
you want your buffer to be ... you might notice (if you're using a UTF-8
locale) that scan lines that contain multibyte characters are "short"
on the end because fmt_scan() is truncating based on bytes instead of
printable characters.  I assume that this is non-controversial as well.

The question I have, however, is this: should this be fixed for 1.5?  I
would argue that it should because this is (for me) a significant
charset bug.  Another part of me says this might be too much this late
in the release cycle.  Thoughts?

--Ken



reply via email to

[Prev in Thread] Current Thread [Next in Thread]