[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-devel] [PATCH] 8 bit characters in header

From: Christophe Lambin
Subject: Re: [Pan-devel] [PATCH] 8 bit characters in header
Date: 21 Jul 2002 16:25:32 +0200

On Sun, 2002-07-21 at 02:27, Sam Solon wrote:
> Although the proper answer is "they're wrong according to RFC 977" there
> seem to be a number of postings that use 8 bit characters in the header
> -- particularly for the subject. This seems most common in binary
> newsgroups and is probably an attempt to disguise a copyright violation.

This is (unfortunately) very common on Usenet, and isn't restricted to
just binary newsgroups: go to any Russian newsgroup (e.g. and you'll see more than 50% of the articles have 8bit
characters in the header.

(in many cases, the problem is even worse, in that the articles don't
even have a content-type header, so we have no way of knowing which
charset the headers are in)

Pan 0.12.1 handles this better than CVS HEAD, but I appear to have
missed merging these changes back to HEAD (so egg on my face, and thanks
for bringing this to my attention :-)).

The way this works for headers is:

        if the header is quote-encoded
                convert the header from quote-encoded to utf8
        if the header contains 8bit characters
                convert the header to utf8
                don't convert the header

The source charset for the 8bit character conversion is the group's
default charset. This defaults to the user's locale (so, if a
non-ISO8859-1 user has set up his locale, this is transparent), but it
configurable in the group's properties.

In 0.12.1, something similar is done for reading an article. However,
reading an article works differently in 0.12.90 / HEAD: there we get the
article from the server as one gstring and then pass it to gmime, which
parses the string and gives us back a GMimeMessage, which we then use to
populate the gui headers and the article text.

Although we could do something similar in HEAD for article reading, I
feel that the better solution would be to let GMime handle this (which
would also take care of the 'Invalid UTF-8 sequence encountered' errors
you get when reading such articles). GMime could then also look at the
content type to see the charset for the 8bit headers (if any).

Charles, fejj: thoughts ?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]