Re: mhfixmsg character set conversion

nmh-workers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mhfixmsg character set conversion

From:	Ralph Corderoy
Subject:	Re: mhfixmsg character set conversion
Date:	Wed, 09 Feb 2022 11:30:43 +0000

Hi Steven,

This has been fun.  :-)

> In my case I don't have just the one sentence in a file by itself, but
> let's try grep (unpatched, and installed from grep 3.7-1 on Manjaro):
>
>    $ grep ^Veuillez good | cut -c1-68
>    Veuillez ne pas répondre au présent courriel. Il a été généré
>
>    $ grep ^Veuillez bad | cut -c1-68
>    Veuillez ne pas répondre au présent courriel. Il a été généré
>
> Really.  I'm not making this up. :-/

No, I don't think you are.  I think that line in both files is correctly
UTF-8 encoded.

> ...but if I open the incorrect output file in vim and go to line 108,
> I see this (pasted from an xterm in which vim was running):
>
>    Veuillez ne pas rÃ©pondre au prÃ©sent courriel. Il a Ã©tÃ© gÃ©nÃ©rÃ©

vim isn't the vi(1) I grew up with, and probably you too.  It has much
more smarts with which to be ‘helpful’.  Try ‘:se fileencoding?’ when
vim-ing good and again with bad.  I expect the bad file has something
earlier on which fixes vim's idea of the encoding to ISO 8859-1 and so
the later UTF-8 encoded bytes get encoded a second time for display to
your UTF-8 locale.

Emails can have bytes which encode character with a variety of
encodings.  :-)

> But wait.  It gets worse:
>
>    $ grep -n ^Veuillez good | cut -c1-68
>    108:Veuillez ne pas répondre au présent courriel. Il a été gén�
>
>    $ grep -n ^Veuillez bad | cut -c1-68
>    108:Veuillez ne pas répondre au présent courriel. Il a été gén�

The worse being it is the very same line 108 you're seeing in vim which
grep is also showing?  (The ‘�’ at the end is to be expected.)

> Is my shell somehow getting involved?
>
>    $ echo $SHELL
>    /usr/bin/tcsh

I used to use tcsh(1) before Linux, e.g. AIX, and back then it didn't
interfere with a command's stdout, so let's assume not.

> > bad is double-encoded.
> > 
> >     $ iconv -f iso-8859-1 -t utf-8 good | cmp - bad
> >     $
>
> I understand that, although I don't understand why that's happening.

Yes, I was just showing that is precisely the relationship between the
two.

> > head(1) and more(1) don't disguise that.
>
> They certainly shouldn't, but:
>
>    $ head -108 bad | tail -1 | cut -c1-68
>    Veuillez ne pas répondre au présent courriel. Il a été généré

Yep, looks fine.  Line 108 is validly UTF-8 encoded.

>    $ cp -p good good_snippet
>    $ cp -p bad bad_snippet
>    $ vi good_snippet bad_snippet
>         # delete all but the relevant part of line 108
>
>    $ LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
>
> ...but nothing appeared to happen, and I killed the command after
> waiting about a minute.  (...and yes, I tried that in a bash subshell
> because I know that syntax won't work in tcsh).

I don't understand that.  The -p sets up a loop to read a line from
good_snippet, do the substitution on it, and print the result, until
EOF.  The -l strips off the linefeed on input and puts it back on the
output.  The substitution in between changes all bytes, thanks to
LC_ALL=C, which aren't space to tilde into a ‘<42>’ string representing
their hex value.  It should complete ‘instantly’.  Sitting forever
suggests it was trying to read stdin instead of good_snippet.

> However, just to muddy the waters even further, I fell back on od:
>
>    $ od -t x1c good_snippet 
>    0000000  56  65  75  69  6c  6c  65  7a  20  6e  65  20  70  61  73  20
>              V   e   u   i   l   l   e   z       n   e       p   a   s    
>    0000020  72  c3  a9  70  6f  6e  64  72  65  20  61  75  20  70  72  c3
>              r 303 251   p   o   n   d   r   e       a   u       p   r 303
>    0000040  a9  73  65  6e  74  20  63  6f  75  72  72  69  65  6c  2e  20
>            251   s   e   n   t       c   o   u   r   r   i   e   l   .    
>    0000060  49  6c  20  61  20  c3  a9  74  c3  a9  20  67  c3  a9  6e  c3
>              I   l       a     303 251   t 303 251       g 303 251   n 303
>    0000100  a9  72  c3  a9  0a
>            251   r 303 251  \n
>    0000105

Nothing wrong with od(1).  If you have hexdump(1) installed then it with
-C gives quite nice output.

> ...and both snippets are identical!

Well, those lines were identical to start with before snipping.
You could confirm this with

    cmp <(sed -n 108p good) <(sed -n 108p bad)

> Strangely, both snippet files look fine in vim.

Because you have chopped off the non-UTF-8 which occurs earlier in bad
which fixes vim's idea of the file's encoding.

> One additional fact which must be relevant although I don't know
> enough to say exactly how is that the status bar in vim looks like
> this when the good file is newly opened:
>
>    "good" 836 lines, 50844 bytes                    1,1           Top
>
> ...but for the bad file, that becomes
>
>    "bad" [converted] 336 lines, 49471 bytes         1,1           Top

Ta-da!

> The smaller number of lines is expected (that's the effect of my
> no-longer-wanted patch to mhfixmsg), but does that also explain the
> different number of bytes?

Hard to say, I don't have enough detail.

> More importantly, vim explicitly claims that the bad file is
> "[converted]", so maybe that's the source of the double encoding?

Quite.  :-)

-- 
Cheers, Ralph.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: mhfixmsg character set conversion, (continued)

Prev by Date: subject change in reply
Next by Date: Re: mhfixmsg character set conversion
Previous by thread: Re: mhfixmsg character set conversion
Next by thread: Re: mhfixmsg character set conversion
Index(es):
- Date
- Thread