[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mhfixmsg character set conversion
From: |
Ralph Corderoy |
Subject: |
Re: mhfixmsg character set conversion |
Date: |
Wed, 09 Feb 2022 11:30:43 +0000 |
Hi Steven,
This has been fun. :-)
> In my case I don't have just the one sentence in a file by itself, but
> let's try grep (unpatched, and installed from grep 3.7-1 on Manjaro):
>
> $ grep ^Veuillez good | cut -c1-68
> Veuillez ne pas répondre au présent courriel. Il a été généré
>
> $ grep ^Veuillez bad | cut -c1-68
> Veuillez ne pas répondre au présent courriel. Il a été généré
>
> Really. I'm not making this up. :-/
No, I don't think you are. I think that line in both files is correctly
UTF-8 encoded.
> ...but if I open the incorrect output file in vim and go to line 108,
> I see this (pasted from an xterm in which vim was running):
>
> Veuillez ne pas répondre au présent courriel. Il a été généré
vim isn't the vi(1) I grew up with, and probably you too. It has much
more smarts with which to be ‘helpful’. Try ‘:se fileencoding?’ when
vim-ing good and again with bad. I expect the bad file has something
earlier on which fixes vim's idea of the encoding to ISO 8859-1 and so
the later UTF-8 encoded bytes get encoded a second time for display to
your UTF-8 locale.
Emails can have bytes which encode character with a variety of
encodings. :-)
> But wait. It gets worse:
>
> $ grep -n ^Veuillez good | cut -c1-68
> 108:Veuillez ne pas répondre au présent courriel. Il a été gén�
>
> $ grep -n ^Veuillez bad | cut -c1-68
> 108:Veuillez ne pas répondre au présent courriel. Il a été gén�
The worse being it is the very same line 108 you're seeing in vim which
grep is also showing? (The ‘�’ at the end is to be expected.)
> Is my shell somehow getting involved?
>
> $ echo $SHELL
> /usr/bin/tcsh
I used to use tcsh(1) before Linux, e.g. AIX, and back then it didn't
interfere with a command's stdout, so let's assume not.
> > bad is double-encoded.
> >
> > $ iconv -f iso-8859-1 -t utf-8 good | cmp - bad
> > $
>
> I understand that, although I don't understand why that's happening.
Yes, I was just showing that is precisely the relationship between the
two.
> > head(1) and more(1) don't disguise that.
>
> They certainly shouldn't, but:
>
> $ head -108 bad | tail -1 | cut -c1-68
> Veuillez ne pas répondre au présent courriel. Il a été généré
Yep, looks fine. Line 108 is validly UTF-8 encoded.
> $ cp -p good good_snippet
> $ cp -p bad bad_snippet
> $ vi good_snippet bad_snippet
> # delete all but the relevant part of line 108
>
> $ LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
>
> ...but nothing appeared to happen, and I killed the command after
> waiting about a minute. (...and yes, I tried that in a bash subshell
> because I know that syntax won't work in tcsh).
I don't understand that. The -p sets up a loop to read a line from
good_snippet, do the substitution on it, and print the result, until
EOF. The -l strips off the linefeed on input and puts it back on the
output. The substitution in between changes all bytes, thanks to
LC_ALL=C, which aren't space to tilde into a ‘<42>’ string representing
their hex value. It should complete ‘instantly’. Sitting forever
suggests it was trying to read stdin instead of good_snippet.
> However, just to muddy the waters even further, I fell back on od:
>
> $ od -t x1c good_snippet
> 0000000 56 65 75 69 6c 6c 65 7a 20 6e 65 20 70 61 73 20
> V e u i l l e z n e p a s
> 0000020 72 c3 a9 70 6f 6e 64 72 65 20 61 75 20 70 72 c3
> r 303 251 p o n d r e a u p r 303
> 0000040 a9 73 65 6e 74 20 63 6f 75 72 72 69 65 6c 2e 20
> 251 s e n t c o u r r i e l .
> 0000060 49 6c 20 61 20 c3 a9 74 c3 a9 20 67 c3 a9 6e c3
> I l a 303 251 t 303 251 g 303 251 n 303
> 0000100 a9 72 c3 a9 0a
> 251 r 303 251 \n
> 0000105
Nothing wrong with od(1). If you have hexdump(1) installed then it with
-C gives quite nice output.
> ...and both snippets are identical!
Well, those lines were identical to start with before snipping.
You could confirm this with
cmp <(sed -n 108p good) <(sed -n 108p bad)
> Strangely, both snippet files look fine in vim.
Because you have chopped off the non-UTF-8 which occurs earlier in bad
which fixes vim's idea of the file's encoding.
> One additional fact which must be relevant although I don't know
> enough to say exactly how is that the status bar in vim looks like
> this when the good file is newly opened:
>
> "good" 836 lines, 50844 bytes 1,1 Top
>
> ...but for the bad file, that becomes
>
> "bad" [converted] 336 lines, 49471 bytes 1,1 Top
Ta-da!
> The smaller number of lines is expected (that's the effect of my
> no-longer-wanted patch to mhfixmsg), but does that also explain the
> different number of bytes?
Hard to say, I don't have enough detail.
> More importantly, vim explicitly claims that the bad file is
> "[converted]", so maybe that's the source of the double encoding?
Quite. :-)
--
Cheers, Ralph.
- Re: mhfixmsg character set conversion, (continued)
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/04
- Re: mhfixmsg character set conversion, David Levine, 2022/02/05
- Re: mhfixmsg character set conversion, David Levine, 2022/02/06
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/06
- Re: mhfixmsg character set conversion, David Levine, 2022/02/06
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/07
- Re: mhfixmsg character set conversion, David Levine, 2022/02/07
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/08
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/08
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/09
- Re: mhfixmsg character set conversion,
Ralph Corderoy <=
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/09
- Re: mhfixmsg character set conversion, George Michaelson, 2022/02/09
- Re: mhfixmsg character set conversion, George Michaelson, 2022/02/09
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/09
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/10
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/11
- Re: mhfixmsg character set conversion, Robert Elz, 2022/02/11
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/11
- Re: mhfixmsg character set conversion, Robert Elz, 2022/02/11
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/12