[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mhfixmsg character set conversion
From: |
David Levine |
Subject: |
Re: mhfixmsg character set conversion |
Date: |
Fri, 4 Feb 2022 05:08:20 -0800 |
Steven wrote:
> I routinely use mhfixmsg to clean up incoming messages, using this command
> in a shell script invoked through procmail:
>
> mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8 \
> -reformat -fixcte -fixboundary -noreplacetextplain \
> -fixtype application/octet-stream -noverbose -file - \
> -outfile $destination < $source
> original message:
>
> Veuillez ne pas r=E9
>
> This should decode to the following (represented in UTF-8):
>
> Veuillez ne pas ré
>
> ...but mhfixmsg turns that into
>
> Veuillez ne pas ré
(I truncated the examples to focus on the first errant conversion, see below.)
> My questions are then:
>
> 1) Is this a bug in mhfixmsg, or am I just using it incorrectly?
>
> 2) If the former, is there further information I can supply to help track
> this down, or further tests I can conduct on the message in question?
>
> 3) ...or if the latter, what am I doing wrong, and what should I be doing
> instead?
Good questions, and thank you for your detailed report.
Looking at the first 8-bit character in the excerpt, E9 in iso8859-1,
that should have been converted to C3A9 in UTF-8. iconv correctly does
that:
$ printf '\xE9' | iconv -f iso-8859-1 -t utf-8 | hexdump -C
00000000 c3 a9 |..|
Instead, it got converted to C383C2A9. I'm not sure why. I expect
that your environment is close enough to:
$ iconv --version
iconv (GNU libc) 2.34
$ locale
LANG=en_CA.utf8
LC_CTYPE="en_CA.utf8"
LC_NUMERIC="en_CA.utf8"
LC_TIME="en_CA.utf8"
LC_COLLATE="en_CA.utf8"
LC_MONETARY="en_CA.utf8"
LC_MESSAGES="en_CA.utf8"
LC_PAPER="en_CA.utf8"
LC_NAME="en_CA.utf8"
LC_ADDRESS="en_CA.utf8"
LC_TELEPHONE="en_CA.utf8"
LC_MEASUREMENT="en_CA.utf8"
LC_IDENTIFICATION="en_CA.utf8"
With this small example:
$ cat 3
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary="mime-boundary"
Content-Transfer-Encoding: 8bit
--mime-boundary
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=iso-8859-1
=E9
--mime-boundary
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=iso-8859-1
é
--mime-boundary--
I see correct conversion of the quoted-printable E9 to UTF-8 C3A9:
$ mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8
-reformat -fixcte -fixboundary -noreplacetextplain -fixtype
application/octet-stream -noverbose -file - -out - < 3 | hexdump -C |
egrep a9
000000c0 65 74 3d 22 55 54 46 2d 38 22 0a 0a c3 a9 0a 0a |et="UTF-8"......|
Does adding -verbose to your mhfixmsg invocation provide any clues?
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, decode text/plain; charset=iso-8859-1
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 1, decode text/html; charset=iso-8859-1
mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, convert iso-8859-1 to UTF-8
David
- Re: In Memoriam: Norman Z. Shapiro 1932-2021, Ken Hornstein, 2022/02/01
- Re: In Memoriam: Norman Z. Shapiro 1932-2021, Jon Steinhart, 2022/02/01
- mhfixmsg character set conversion, Steven Winikoff, 2022/02/03
- Re: mhfixmsg character set conversion,
David Levine <=
- Re: mhfixmsg character set conversion, Ken Hornstein, 2022/02/04
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/04
- Re: mhfixmsg character set conversion, David Levine, 2022/02/04
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/04
- Re: mhfixmsg character set conversion, Ken Hornstein, 2022/02/04
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/04
- Re: mhfixmsg character set conversion, David Levine, 2022/02/05
- Re: mhfixmsg character set conversion, David Levine, 2022/02/06
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/06
- Re: mhfixmsg character set conversion, David Levine, 2022/02/06