[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: mhfixmsg character set conversion
From: |
Steven Winikoff |
Subject: |
Re: mhfixmsg character set conversion |
Date: |
Wed, 09 Feb 2022 00:43:44 -0500 |
>BTW, to begin a thread, please don't reply to an existing message
>on the list and change the subject
That makes sense, but (a) I wasn't trying to start a new thread, and
(b) I replied to an existing message without changing the subject.
I'll try to remember that for future reference, but I don't understand
why you mentioned it here and now.
>> ...but when I look at the files with command-line tools such as more or
>> head, *both* versions look correct.
>
>Have you patched more or head? ;-)
No, but that's a fair question. :-)
They're both unpatched, installed as part of util-linux 2.37.3-2 (for more)
and coreutils 9.0-2 (for head) on Manjaro Linux.
>Can you cut-and-paste commands and output from your terminal to show us
>the problem.
Of course.
>Otherwise we have to trust your competency, no offence intended,
None taken. It's a perfectly fair request.
>Here's my go.
>
>How I could be influencing programs.
>
> $ locale
> LANG=en_GB.utf8
> LC_CTYPE="en_GB.utf8"
> LC_NUMERIC="en_GB.utf8"
> LC_TIME="en_GB.utf8"
> LC_COLLATE="en_GB.utf8"
> LC_MONETARY="en_GB.utf8"
> LC_MESSAGES="en_GB.utf8"
> LC_PAPER="en_GB.utf8"
> LC_NAME="en_GB.utf8"
> LC_ADDRESS="en_GB.utf8"
> LC_TELEPHONE="en_GB.utf8"
> LC_MEASUREMENT="en_GB.utf8"
> LC_IDENTIFICATION="en_GB.utf8"
> LC_ALL=
> $
Mine's
$ locale
LANG=en_CA.UTF-8
LC_CTYPE="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_PAPER="en_CA.UTF-8"
LC_NAME="en_CA.UTF-8"
LC_ADDRESS="en_CA.UTF-8"
LC_TELEPHONE="en_CA.UTF-8"
LC_MEASUREMENT="en_CA.UTF-8"
LC_IDENTIFICATION="en_CA.UTF-8"
LC_ALL=
>Test inputs.
>
> $ cat good
> Veuillez ne pas répondre au présent courriel. Il a été généré
> automatiquement, nous ne pourrons pas y donner suite.
> $ cat bad
> Veuillez ne pas répondre au présent courriel. Il a été généré
> automatiquement, nous ne pourrons pas y donner suite.
> $
In my case I don't have just the one sentence in a file by itself, but
let's try grep (unpatched, and installed from grep 3.7-1 on Manjaro):
$ grep ^Veuillez good | cut -c1-68
Veuillez ne pas répondre au présent courriel. Il a été généré
$ grep ^Veuillez bad | cut -c1-68
Veuillez ne pas répondre au présent courriel. Il a été généré
Really. I'm not making this up. :-/
...but if I open the incorrect output file in vim and go to line 108,
I see this (pasted from an xterm in which vim was running):
Veuillez ne pas répondre au présent courriel. Il a été généré
But wait. It gets worse:
$ grep -n ^Veuillez good | cut -c1-68
108:Veuillez ne pas répondre au présent courriel. Il a été gén�
$ grep -n ^Veuillez bad | cut -c1-68
108:Veuillez ne pas répondre au présent courriel. Il a été gén�
Is my shell somehow getting involved?
$ echo $SHELL
/usr/bin/tcsh
That's (also unpatched :-) tcsh 6.23.02-1 from Manjaro's tcsh package.
>bad is double-encoded.
>
> $ iconv -f iso-8859-1 -t utf-8 good | cmp - bad
> $
I understand that, although I don't understand why that's happening.
>head(1) and more(1) don't disguise that.
They certainly shouldn't, but:
$ head -108 bad | tail -1 | cut -c1-68
Veuillez ne pas répondre au présent courriel. Il a été généré
If you tell me this shouldn't be happening, I'll agree 100%. But somehow
it is happening and I have no idea why.
>Show the hex values of non-ASCII bytes.
I can't do that on the whole file, so I did this:
$ cp -p good good_snippet
$ cp -p bad bad_snippet
$ vi good_snippet bad_snippet
# delete all but the relevant part of line 108
$ LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
...but nothing appeared to happen, and I killed the command after waiting
about a minute. (...and yes, I tried that in a bash subshell because I know
that syntax won't work in tcsh).
My perl is a bit rusty, so I'm not sure exactly how this command works.
However, just to muddy the waters even further, I fell back on od:
$ od -t x1c good_snippet
0000000 56 65 75 69 6c 6c 65 7a 20 6e 65 20 70 61 73 20
V e u i l l e z n e p a s
0000020 72 c3 a9 70 6f 6e 64 72 65 20 61 75 20 70 72 c3
r 303 251 p o n d r e a u p r 303
0000040 a9 73 65 6e 74 20 63 6f 75 72 72 69 65 6c 2e 20
251 s e n t c o u r r i e l .
0000060 49 6c 20 61 20 c3 a9 74 c3 a9 20 67 c3 a9 6e c3
I l a 303 251 t 303 251 g 303 251 n 303
0000100 a9 72 c3 a9 0a
251 r 303 251 \n
0000105
$ od -t x1c bad_snippet
0000000 56 65 75 69 6c 6c 65 7a 20 6e 65 20 70 61 73 20
V e u i l l e z n e p a s
0000020 72 c3 a9 70 6f 6e 64 72 65 20 61 75 20 70 72 c3
r 303 251 p o n d r e a u p r 303
0000040 a9 73 65 6e 74 20 63 6f 75 72 72 69 65 6c 2e 20
251 s e n t c o u r r i e l .
0000060 49 6c 20 61 20 c3 a9 74 c3 a9 20 67 c3 a9 6e c3
I l a 303 251 t 303 251 g 303 251 n 303
0000100 a9 72 c3 a9 0a
251 r 303 251 \n
0000105
...and both snippets are identical! Suddenly I understand even less than
I did when I started writing this reply. :-(
Strangely, both snippet files look fine in vim. But the original bad file
still looks bad in vim, and I'm at a loss for how to prove that except by
taking a screen shot, so I've done that and attached the result as a 34 Kb
PDF file.
One additional fact which must be relevant although I don't know enough
to say exactly how is that the status bar in vim looks like this when
the good file is newly opened:
"good" 836 lines, 50844 bytes 1,1 Top
...but for the bad file, that becomes
"bad" [converted] 336 lines, 49471 bytes 1,1 Top
The smaller number of lines is expected (that's the effect of my
no-longer-wanted patch to mhfixmsg), but does that also explain the
different number of bytes?
More importantly, vim explicitly claims that the bad file is "[converted]",
so maybe that's the source of the double encoding?
The more I try to think about this, the more my head hurts. :-(
- Steven
--
___________________________________________________________________________
Steven Winikoff |
Montreal, QC, Canada | "Do not meddle in the affairs of dragons,
smw@smwonline.ca | for you are crunchy and good with ketchup."
http://smwonline.ca |
bad.pdf
Description: bad.pdf
- Re: mhfixmsg character set conversion, (continued)
- Re: mhfixmsg character set conversion, David Levine, 2022/02/05
- Re: mhfixmsg character set conversion, David Levine, 2022/02/06
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/06
- Re: mhfixmsg character set conversion, David Levine, 2022/02/06
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/07
- Re: mhfixmsg character set conversion, David Levine, 2022/02/07
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/08
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/08
- Re: mhfixmsg character set conversion,
Steven Winikoff <=
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/09
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/09
- Re: mhfixmsg character set conversion, George Michaelson, 2022/02/09
- Re: mhfixmsg character set conversion, George Michaelson, 2022/02/09
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/09
- Re: mhfixmsg character set conversion, Ralph Corderoy, 2022/02/10
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/11
- Re: mhfixmsg character set conversion, Robert Elz, 2022/02/11
- Re: mhfixmsg character set conversion, Steven Winikoff, 2022/02/11
- Re: mhfixmsg character set conversion, Robert Elz, 2022/02/11