[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Nmh-workers] RFC2047 section 5 and other MIME issues for the new sc
Re: [Nmh-workers] RFC2047 section 5 and other MIME issues for the new scan
Sun, 14 Nov 2010 17:56:10 -0600
On Sun, Nov 14, 2010 at 11:45 AM, Jon Steinhart <address@hidden> wrote:
> My preference is to say that we'll treat any =?...?= as an encoded word
> wherever it appears and that we'll decode it. It appears that the authors of
> RFC2047 expect that everything will be parsed into tokens and examined before
> looking for encoded words.
You right. RFC 822 defined the basic tokenization rules,
and MIME attempts to stay compatibile with that. I.e. You have
a system that knows how to due RFC 822 tokenization, and then
that token data can be passed to the MIME-aware layer.
Here is a relevant note from RFC 2047:
IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
by an RFC 822 parser. As a consequence, unencoded white space
characters (such as SPACE and HTAB) are FORBIDDEN within an
'encoded-word'. For example, the character sequence
=?iso-8859-1?q?this is some text?=
would be parsed as four 'atom's, rather than as a single 'atom' (by
an RFC 822 parser) or 'encoded-word' (by a parser which understands
'encoded-words'). The correct way to encode the string "this is some
text" is to encode the SPACE characters as well, e.g.
I think many mail implementations today probably do
not work that way, mainly due to ignorance of the developers.
Although not related to this topic, an example of this
ignorance is the syntax adopted in DKIM headers.
As for space between encoded word, such space should be
collapsed. I.e. Two adjacent encoded words should be
concatenated together after decoding, with no space between
Note, it is a mistake to blindly assume that all sequences
of =?...?= should be decoded, which has lead to some erroneous
uses by some software. For example, using =?...?= inside
parameter values vs using RFC 2184 (now RFC 2123).
> My current plan for the new scan code is to:
> 1. Read a header field name.
> 2. Read a header field body if the header field is used by the format,
> unfolding folded lines in the process.
> 3. Look for encoded words and decode them creating a UTF-8 version of the
> header field body.
I've never really dived into MH/nmh parsing code. Is there any
attempt to perform RFC 822 based tokenization before duing any
Decoding of encoded words should only be done in specific contexts.
Look at Section 5 of RFC 2047 the contexts that encoded words are