[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-gnu-libiconv] Re: MIME encodings
From: |
Bruno Haible |
Subject: |
[bug-gnu-libiconv] Re: MIME encodings |
Date: |
Sun, 29 Jun 2008 19:37:27 +0200 |
User-agent: |
KMail/1.5.4 |
Hi Tristan,
> RFC 2047 defines a thing called an encoded-word for use in email headers
> when you have non-ASCII text, for example, when you want a japanese
> subject line.
>
> In this scheme, a sequence of text is stored as something like:
>
> =?iso-2022-jp?B?34hyf3whseku3409usekijf3409u9f==?=
>
> this encoding is limited, in the best case, to 75 characters and often
> has to be less. Encoding a sequence of characters requires splitting the
> sequence up into multiple sections so that each section will be placed
> in its own encoded word which can all then be concatenated with an
> appropriate delimiter.
>
> There is a significant requirement for an encoded word:
>
> Some character sets use code-switching techniques to switch between
> "ASCII mode" and other modes. If unencoded text in an 'encoded-word'
> contains a sequence which causes the charset interpreter to switch
> out of ASCII mode, it MUST contain additional control codes such that
> ASCII mode is again selected at the end of the 'encoded-word'. (This
> rule applies separately to each 'encoded-word', including adjacent
> 'encoded-word's within a single header field.)
>
> The limit on the number of 7bit ASCII characters (after encoding) on the
> first line of an email header field is:
>
> l = 68 - "length of header name" - "length of any whitespace before the
> field body begins" - "length of character encoding name"
>
> and the limit on any other line is:
>
> l = 68 - "length of character encoding name"
>
> And the primary encoding (eg iso-2022-jp) is followed by either base64
> or a special form of quoted printable before counting the number of 7bit
> ASCII characters resulting from that encoding to check it against the
> limit. This requires either guaranteed early unshifting with some
> arbitrary shortfall, or repeated attempts at encoding.
>
> For optimal encoding, the sequence to shift back to ASCII should be
> placed as late as possible.
>
> Implementing that is pretty tough and pretty unpretty :)
>
> So, I suppose the ideal feature to be added would include:
>
> 1) (optional) the ability to select an encoding wrapped in quoted
> printable encoding (because that would make for a really nice interface
> for this) or wrapped in base64 encoding (because base64 encoding has
> special termination requirements involving padding with '=' to a 4
> character boundary that could be neatly handled with the next feature).
>
> 2) much more importantly, an api (I suppose via iconvctl) to set a
> parameter in the conversion state indicating how many output characters
> to produce before which point the sequence should be unshifted (and, if
> feature 1 above is supported, also padded). This feature should stop
> conversion and unshift at the latest possible moment before exhausting
> the number of characters requested. I think this should require setting
> after each time it is used as the state would simply count down how many
> characters are left before unshifting and that count would need to be
> reset before encoding could continue (it would be an error to attempt to
> continue). There would need to be a ctl command to determine how many
> characters remain after any conversion stops.
Thanks for explaining all this.
> I think:
>
> for feature 1) (optional) a syntax for composing character encoding
> schemes like quoted-printable onto a named "character set/encoding
> form/scheme" (as in iso-2022-jp or UTF-8), or the special
> quoted-printable variant used for MIME encoded-words, and base64. This
> might be something like "base64(iso-2022-jp)",
> "quoted-printable-encoded-word(iso-2022-jp)", and
> "quoted-printable(iso-2022-jp)".
>
> for feature 2) three additional ctl commands for iconvctl:
> SET_LIMIT_ENABLED
> SET_LIMIT_COUNT
> GET_LIMIT_COUNT.
Before going into details, two facts:
- GNU has two iconv implementations: one in glibc, for use on Linux and other
glibc systems, and libiconv, for portability to other systems. When we
add features to one of them, it should also make sense for the other one.
- iconv is a fundamental facility for all kinds of applications, and most
of them don't use MIME. You may convince the glibc maintainers to add
additional encodings to glibc (since they sit passively on the disk in
form of a .so file if not used), but it is very hard to justify new code
in libc.so if it's only for MIME.
So let's look how difficult it is to realize your two features inside and
outside the iconv implementation.
------------------------------------------------------------------------------
Feature 1), the "encoding form/scheme". This is a concept that neither glibc
nor libiconv have so far. (Only librecode has it.)
glibc's encoding syntax is "ENCODING//ERRORHANDLING" or also
"ISO10646/SURFACE/ERRORHANDLING". But the concept of the surface here is only
present in the name; in the implementation it does not correspond to a specific
type.
libiconv would have to add a lot of extra code for this, similar to its
handling of wchar_t.
Whereas outside the iconv implementation, you just call iconv() and then
your buffer transform for QP or base64.
Another argument for doing this outside the iconv implementation is that
many mailers write "iso-2022-jp" when they in fact send some string in
"iso-2022-jp-2". Or similarly with "iso-8859-1" and "windows-1252". So the
encoding specification has to be changed in an application specific way.
------------------------------------------------------------------------------
Feature 2), the "try to exhaust buffer" feature. Essentially, such a feature
exists already in iconv, through the fact that iconv() returns when the
output buffer is full (with possible E2BIG return code). Only it does not
try to "unshift" before stopping. But the unshift sequence is never very long:
encoding max unshift length
UTF-7 2
ISO-2022-JP 3
ISO-2022-JP-1 3
ISO-2022-JP-2 3
ISO-2022-JP-3 8
ISO-2022-CN 1
ISO-2022-CN-EXT 1
HZ 2
BIG5-HKSCS 2
ISO-2022-KR 1
EUC-JISX0213 2
SHIFT_JISX0213 2
Inside iconv, this feature would add a lot of complexity.
Outside iconv, simply try to use 'l' as maximum length; after iconv()
returns call it with NULL source, to produce the unshift, then you will
see if it fits. If not, decrement the maximum length by 1, and retry.
As shown above, this rarely needs to be done more than 3 times. Speed is
not an issue, since the text being converted up to that point is ca, 60
bytes in size only.
------------------------------------------------------------------------------
In summary, I think the effort to do this outside iconv is smaller than
inside iconv.
Certainly it would be good to have such code in a GNU library. GNU gnulib
(http://www.gnu.org/software/gnulib/) already has a macro for detecting
the available iconv implementation and a module for base64 conversion.
If you want, you can contribute a module for quoted-printable conversion
and another one for RFC 2047 and RFC 2822 compliant header encoding and
decoding. Or, you can create a library of your own, of course.
Regards,
Bruno