[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug reported regarding Unicode handling in email address

From: Ralph Corderoy
Subject: Re: Bug reported regarding Unicode handling in email address
Date: Sat, 12 Jun 2021 11:19:12 +0100

Hi Ken,

> Probably the best way to do that is using mhbuild directives.
> That is, you can today do stuff like:
> #<text/plain; charset=utf-8
> [... utf-8 text here ...]
> #<text/plain; charset=iso-8859-1
> [... iso-8859-1 text here ...]
> #<text/html; charset=utf-8
> [... HTML text here ...]

The input to mhbuild can be that, it's true, though a text editor might
only handle it in the C locale.  And then nmh treats a NUL byte as end
of string, e.g. charset=ucs-2le doesn't work.  Worse than just
truncating the UCS-2LE input, it causes corruption in earlier parts in
this experiment.

    $ cat build
    #! /bin/bash

        printf '%s\n' \
            'subject: Test.' \
            '' \
            'Disappears.' \
            '#<text/plain; charset=iso-8859-1' \
            $'Fiat: $ \xa3' \
            '#<text/plain; charset=ucs-2le'
        iconv -t ucs-2le <<<'† Footnote.'
    ) >draft
    sed -n l draft

    cp draft mimed
    mhbuild -list -realsize -headers -verbose mimed

    sed -n l mimed
    $ ./build
    subject: Test.$
    #<text/plain; charset=iso-8859-1$
    Fiat: $ \243$
    #<text/plain; charset=ucs-2le$
 ¹     \000F\000o\000o\000t\000n\000o\000t\000e\000.\000$

     msg part  type/subtype              size description
       0       multipart/mixed             99
                 boundary="----- =_aaaaaaaaaa0"
         1     text/plain                  34
         2     text/plain                   3

    subject: Test.$
    MIME-Version: 1.0$
    Content-Type: multipart/mixed; boundary="----- =_aaaaaaaaaa0"$
    Content-ID: <21398.1623492782.0@orac.inputplus.co.uk>$
    Content-Transfer-Encoding: 8bit$
    ------- =_aaaaaaaaaa0$
    Content-Type: text/plain; charset="UTF-8"$
    Content-ID: <21398.1623492782.1@orac.inputplus.co.uk>$
    Content-Transfer-Encoding: 8bit$
 ²  ain; charset=iso-8859-1$
    Fiat: $ \243$
    ------- =_aaaaaaaaaa0$
    Content-Type: text/plain; charset="ucs-2le"$
    Content-ID: <21398.1623492782.2@orac.inputplus.co.uk>$
 ³     $
    ------- =_aaaaaaaaaa0--$

1. sed happily displays the NUL bytes in the draft.

2. The ‘Disappears’ part in the draft has vanished.  The Fiat part
starts with part of the preceding directive.  Altering the length of the
UCS-2LE part changes how far back this part erroneously starts;
I suspect some pointer subtraction.

3. All that makes it into the UCS-2LE part is the three spaces which
represent the first three-quarters of the U+2020 dagger and its
following U+0020 space.

This isn't a complaint, just passing on the observation having made the

Cheers, Ralph.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]