[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pan-devel] GMime & charsets

From: Jeffrey Stedfast
Subject: [Pan-devel] GMime & charsets
Date: Wed, 28 Mar 2007 10:30:10 -0400

So as I was fixing bug #342196, I noticed that GMime's header decoders
do not presently enforce that they return valid UTF-8...

Not really a big surprise... I already knew that, but I didn't think it
was quite as bad as it actually is:

- I had thought that they at least /tried/ to convert unencoded 8bit
text into UTF-8, but they don't.

- turns out that rfc2047 encoded-word tokens which declared themselves
to have been encoded in UTF-8 weren't checked for validity (e.g. so if
they were actually random 8bit garbage after being qp/b64 decoded, they
were just returned as if they were valid UTF-8)

- when decoding non-UTF-8-declared rfc2047 encoded-word tokens, if the
conversion to UTF-8 from the declared charset failed, it only fell back
to the user's locale charset (actually, this one is a pretty reasonable
fallback behavior but am adding it here anyway for completeness)

- the decoded headers might contain some sequences of valid UTF-8 and
some not valid sequences (e.g. a few encoded-word tokens are
successfully converted to UTF-8 while others not... and lets not forget
about the possibility of some unencoded 8bit text being thrown in there
as well) and so how do you as an app developer handle that case? You
could try to do your own UTF-8 conversions on the returned decoded
string... but if the text contains some pieces of valid UTF-8, how would
you know? (well... you could, but it'd be a lot of work)

Since I added g_mime_set_user_charsets() and g_mime_user_charsets() to
allow custom overrides for charsets used in the encoding process, I
figured... why not take advantage of them and use them in the decode
process as well?

What I've done (not yet committed to svn) is to iterate over the list of
charsets provided, keeping track of the number of bytes that needed to
be dropped in order to convert the text sequence (might be a qp/b64
decoded text blob from an encoded-word token or raw 8bit text found
elsewhere in the header). If at any time we successfully convert to
UTF-8 without the need to drop any chars, we simply return the result.
If, instead, we get to the end of list of charsets and discover that
none of them are suitable for converting to UTF-8, we choose the best
(e.g. fewest dropped bytes) charset and convert using that, replacing
all dropped bytes from the input with 'x' in the output.

So that's what I'm presently doing... I think I've got
g_mime_utils_header_decode_text() fixed now (which is what
8bit_header_decode wraps, fwiw) and am going to look into fixing
g_mime_utils_header_decode_phrase() tonight if I can...)

So what does this mean? Well, hopefully it'll mean that you can rely on
the header decoders returning valid UTF-8 for display in Gtk+-2.0 apps
w/o need to validate on your own (which is likely to be annoying at best
and not very useful at worst since what are you going to do with invalid
UTF-8 at that point? especially since some sequences in the decoded
header might be UTF-8 and others not?)

Any thoughts on this? Anyone have test messages that are failing (due to
charset issues) to display properly in Pan right now?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]