[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output
From: |
Ken Hornstein |
Subject: |
[Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5? |
Date: |
Sun, 20 May 2012 22:06:37 -0400 |
Greetings all,
I've been noticing for a while that while exmh now displays people's
names with non-ASCII characters in them fine, it's been busted for
subject lines with some non-ASCII characters in them ... like this one,
at least for me (although I suspect it works fine for others).
I thought this was exmh's fault, but I decided to look at it a bit
and I realized that it's really nmh's fault. The core problem: if
you're using a multibyte character encoding (like UTF-8) as your locale
then the function cpstripped() can mangle the UTF-8 because that
function calls isspace() and iscntrl() on the encoded bytes. On the
Linux systems I have, that works because those functions return 0 for
everything > 127. On MacOS X, that does NOT because it seems that the
value is interpreted as a Unicode codepoint and returns "true" for some
values > 127 if you have a UTF-8 locale. I've thought about this a
lot and I've read the relevant standards and I don't know which is the
"right" behavior regarding the is*() functions when you get passed in
something > 127 in a UTF-8 locale. But irregardless it seems that we're
doing it wrong and we should fix that.
This works fine when parsing the "From" header because it calls
cptrimmed() which is actually multibyte-aware. It seems to make sense
that cpstripped() should be multibyte-aware as well. I assume this is
non-controversial. Also, I am planning on fixing fmt_scan() at the same
time so you can pass in an extra argument denoting the printable width
you want your buffer to be ... you might notice (if you're using a UTF-8
locale) that scan lines that contain multibyte characters are "short"
on the end because fmt_scan() is truncating based on bytes instead of
printable characters. I assume that this is non-controversial as well.
The question I have, however, is this: should this be fixed for 1.5? I
would argue that it should because this is (for me) a significant
charset bug. Another part of me says this might be too much this late
in the release cycle. Thoughts?
--Ken
- [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?,
Ken Hornstein <=
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, Tom Lane, 2012/05/20
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, Ken Hornstein, 2012/05/20
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, paul vixie, 2012/05/21
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, Tethys, 2012/05/21
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, Ken Hornstein, 2012/05/21
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, Tom Lane, 2012/05/21
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, Ken Hornstein, 2012/05/21
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, Tom Lane, 2012/05/21
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, Ken Hornstein, 2012/05/21
- Re: [Nmh-workers] A € for your thoughts - should we fix UTF-8 subject output in scan for 1.5?, Tethys, 2012/05/23