Re: [Nmh-workers] bug in decode

nmh-workers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Nmh-workers] bug in decode_rfc2047()

From:	David Levine
Subject:	Re: [Nmh-workers] bug in decode_rfc2047()
Date:	Thu, 03 Jan 2013 10:09:11 -0500

> >I've been slow to adapt to the multibyte world, so I
> >tripped over a bug in decode_rfc2047():
> >
> >  12  12/31 "Sears"            2013 New Year?s Deals! Start the yearr right
> >
> >scan really did produce "yearr", wrongly.  valgrind noticed,
> >too.
>
> I'm trying to understand the bug ... what exactly triggered it?  The
> encoding on the Subject line was bad?  I'm just trying to understand
> potential pitfalls so future code doesn't have it.

The encoding was correct.  The problem was due to improper
handling of an invalid character for the locale.

U+2019 was used for the apostrophe in "Year's".  With my
single-byte locale, iconv reported the first invalid byte.
decode_rfc2047() output the '?', moved on to the next character,
and continued conversion.

It keeps track of position in the input byte string ("start")
and the count of remaining bytes ("inbytes").  The problem was
that it initially advanced start to the next byte but didn't
decrement inbytes.  So it eventually fed iconv a byte of
garbage.  (The input was split into two strings, so that showed
up in the middle of scan's Subject.)

The fix was to decrement inbytes when (initially) advancing
start.  It already did that for non-UTF8 input.  So this took a
combination of UTF-8, a multibyte character, and a locale that
couldn't handle that character.

The root of all this is iconv's behavior that requires us to
skip past the invalid character.  Looking at it now, I wonder if
we can do better than the current special handling for UTF-8?
It's the "fromutf8" block below:

    while (inbytes) {
        if (iconv(cd, &start, &inbytes, &saveq, &savedstlen) ==
                (size_t)-1) {
            if (errno != EILSEQ) break;
            /* character couldn't be converted. we output a `?'
             * and try to carry on which won't work if              
             * either encoding was stateful */
            iconv (cd, 0, 0, &saveq, &savedstlen);
            if (!savedstlen)
                break;
            *saveq++ = '?';
            savedstlen--;
            if (!savedstlen)
                break;
            /* skip to next input character */
            if (fromutf8) {
                for (++start, --inbytes;
                     start < q  &&  (*start & 192) == 128;
                     ++start, --inbytes)
                    continue;
            } else
                start++, inbytes--;
            if (start >= q)
                break;
        }
    }

That's the only special handling of UTF-8 in decode_rfc2047().
And decode_rfc2047() is our only caller of iconv(), and it's
just in this one place.

David

[Prev in Thread]

Current Thread

[Next in Thread]

[Nmh-workers] bug in decode_rfc2047(), David Levine, 2013/01/02
- Re: [Nmh-workers] bug in decode_rfc2047(), Ken Hornstein, 2013/01/03
- Re: [Nmh-workers] bug in decode_rfc2047(), David Levine <=
  - Re: [Nmh-workers] bug in decode_rfc2047(), Ken Hornstein, 2013/01/03
    - Re: [Nmh-workers] bug in decode_rfc2047(), Valdis . Kletnieks, 2013/01/03
- Re: [Nmh-workers] bug in decode_rfc2047(), David Levine, 2013/01/03
  - Re: [Nmh-workers] bug in decode_rfc2047(), Ken Hornstein, 2013/01/04

Prev by Date: Re: [Nmh-workers] Garbage collection
Next by Date: Re: [Nmh-workers] Garbage collection
Previous by thread: Re: [Nmh-workers] bug in decode_rfc2047()
Next by thread: Re: [Nmh-workers] bug in decode_rfc2047()
Index(es):
- Date
- Thread