bug-mailutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: subject decoding


From: Sam Roberts
Subject: Re: subject decoding
Date: Wed, 13 Nov 2002 11:31:20 -0500
User-agent: Mutt/1.5.0i

Wrote Frederic Gobry <address@hidden>, on Wed, Nov 13, 2002 at 09:09:15AM +0100:
> > This would be a good way, as long as it takes more than a word, I
> > think it needs to take an entire header field value.
> > 
> > So, you can call header_aget_field_value(), then, if you need to
> > display it, you could call:
> > 
> > rfc2047_decode_field () on the value.
> 
> Sure. The function name is misleading, but this seemed to be the
> official term in the RFC.

Look at RFC2047, section 8. See the example for Subject:? It has
TWO encoded words in it! So a structured field could look like:

Subject: =?iso-8859-1?Q?Joe?= is =?utf-8?funny=? in prose

It can have multiple encoded words, and they can be in different
character sets.

> It will decode / encode a complete string of course.

So you have to go through, find all the words, and decode them.

> > As far as I can see, it is not possible for header_aget_field_value() to
> > do it internally, because it involves converting from the MIME character
> > set that the field value is encoded in to the character set used by
> > whatever display device the application is using, and how is it supposed
> > to know that? If it was sieve, for example, it shouldn't be converted
> > to the terminal's character set, but to utf-8, because all sieve
> > field comparisons are supposed to be made against headers that have
> > been canonicalized to utf-8.
> 
> I intended only to manage the _transfer_ encoding (this reminds me that
> the function should also return the content encoding). To do it in the
> header_aget_field_value() would only require to check if the mail uses
> MIME or not (so that non-RFC2047 mails won't be corrupted).

I'm not sure what you mean by transfer and content encoding, those terms
refer to message bodies, and 2047 only refers to message headers.

encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

So if you have:

Subject: Eduardo =?iso-8859-1?Q?P=E9rez?= called it =?koi8-r?Q?=D0=D0?=

It seems to me that if you strip the encoding, you get the string
"Eduardo P\xE9rez called it \xD0\xD0" (in C string-notation), but you
can't do anything with that, some of the chars are koi-8, some are
latin-1, how can you display this? How do you tell which are which?

That's why I think you'll need a

int rfc2047_decode(const char* toset, const char* fromstr, char**
tostr);

Then you could call decode on the above example, tell it your terminal
or gui uses utf-8 (or the gui uses utf-8) and it would display the
latin-1 and koi8 perfectly. Or you could tell it you have a latin1
terminal, and the latin-1 would display, and the koi8 would have '?'
marks, or something, instead of undisplayable characters.

Even if you wrote a function that only removed the encoding, and assumed
that all strings in the header were in the same character set, then
you would have to have something like:


int rfc2047_to_binary(const char* fromstr, char** tostr, const char**
charset);


Where you would get back tostr with all the QP and B64 decoding done,
and the charset would tell you what the charset was, and then it would
be up to the user to decide what to do with strings in that charset.

> > Also, only some headers are allowed to have MIME encoded multi-lingual
> > tokens, but people can add headers and define them as being in that 
> > set. The header_() API would have to know not only the character
> > encoding you want, but also whether each specific header field is
> > allowed to be 2047 encoded.
> 
> For fields having a defined syntax (like those containing addresses),
> decoding could be performed transparently when one accesses the
> address_t structure for instance.

I don't see how this can be transparent, for the reasons above. Do you
see what I'm getting at? Am I a little clearer?

Either:

1) the address_get_name() functions just strips the Q and B encoding, in
which case it must also return the charset of the result (you can't do
anything with it unless you know whether it's utf-8, latin1, or koi8),
and even that would assume the input was in only one charset 

or

2) you tell the address_get_name() function what charset you want, and
it will strip the Q and B encoding, and then convert the character
encoding to whatever you wanted.

I suggest approach (2).

Do you think I'm missing something?

-- 
Sam Roberts <address@hidden>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]