nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mhfixmsg character set conversion


From: David Levine
Subject: Re: mhfixmsg character set conversion
Date: Fri, 4 Feb 2022 05:08:20 -0800

Steven wrote:

> I routinely use mhfixmsg to clean up incoming messages, using this command
> in a shell script invoked through procmail:
>
>    mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8 \
>             -reformat -fixcte -fixboundary -noreplacetextplain \
>             -fixtype application/octet-stream -noverbose -file - \
>             -outfile $destination < $source

> original message:
>
>    Veuillez ne pas r=E9
>
> This should decode to the following (represented in UTF-8):
>
>    Veuillez ne pas ré
>
> ...but mhfixmsg turns that into
>
>    Veuillez ne pas ré

(I truncated the examples to focus on the first errant conversion, see below.)

> My questions are then:
>
> 1) Is this a bug in mhfixmsg, or am I just using it incorrectly?
>
> 2) If the former, is there further information I can supply to help track
>    this down, or further tests I can conduct on the message in question?
>
> 3) ...or if the latter, what am I doing wrong, and what should I be doing
>    instead?

Good questions, and thank you for your detailed report.

Looking at the first 8-bit character in the excerpt, E9 in iso8859-1,
that should have been converted to C3A9 in UTF-8. iconv correctly does
that:

$ printf '\xE9' | iconv -f iso-8859-1 -t utf-8 | hexdump -C
00000000  c3 a9                                             |..|

Instead, it got converted to C383C2A9.  I'm not sure why.  I expect
that your environment is close enough to:

$ iconv --version
iconv (GNU libc) 2.34

$ locale
LANG=en_CA.utf8
LC_CTYPE="en_CA.utf8"
LC_NUMERIC="en_CA.utf8"
LC_TIME="en_CA.utf8"
LC_COLLATE="en_CA.utf8"
LC_MONETARY="en_CA.utf8"
LC_MESSAGES="en_CA.utf8"
LC_PAPER="en_CA.utf8"
LC_NAME="en_CA.utf8"
LC_ADDRESS="en_CA.utf8"
LC_TELEPHONE="en_CA.utf8"
LC_MEASUREMENT="en_CA.utf8"
LC_IDENTIFICATION="en_CA.utf8"

With this small example:

$ cat 3
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="mime-boundary"
Content-Transfer-Encoding: 8bit

--mime-boundary
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=iso-8859-1

=E9

--mime-boundary
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=iso-8859-1

&#233;

--mime-boundary--

I see correct conversion of the quoted-printable E9 to UTF-8 C3A9:

$ mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8
-reformat -fixcte -fixboundary -noreplacetextplain -fixtype
application/octet-stream -noverbose -file - -out - < 3 | hexdump -C |
egrep a9
000000c0  65 74 3d 22 55 54 46 2d  38 22 0a 0a c3 a9 0a 0a  |et="UTF-8"......|

Does adding -verbose to your mhfixmsg invocation provide any clues?
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, decode text/plain; charset=iso-8859-1
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 1, decode text/html; charset=iso-8859-1
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, convert iso-8859-1 to UTF-8

David



reply via email to

[Prev in Thread] Current Thread [Next in Thread]