nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mhfixmsg character set conversion


From: Steven Winikoff
Subject: Re: mhfixmsg character set conversion
Date: Fri, 11 Feb 2022 20:24:11 -0500

>I assume vim(1) will read up to a certain amount until it either makes up
>its mind or assumes the default.

That makes sense.


>Try this to remove the boring ASCII bytes and see what's left.
>
>    tr -d ' -~' <bad | env LC_ALL=C grep -n .

Done.  I've attached an 11 Kb PDF file to show the results, but I can
describe them here as follows:

   - The first 39 output lines show the tab characters from the message
     headers.

   - Lines 94, 96, 98, 100, 102, 104, 108 and 110 all show accented
     characters, which appear out of context to be exactly what should
     appear in the message.  This is absolutely consistent with the file
     being properly encoded in UTF-8.

   - Lines 289, 291, 293, 295, 300, 304, 308 and 310 all show sequences
     of (nothing but) ‘�’ glyphs; in each case the number of these glyphs
     matches the number of valid characters in the lines 94-110 range.

   - For reference, lines in the original file are divided as follows:

        - Lines 1-83 are the message headers

        - Lines 85-110 are the text/plain portion, with

             Content-Transfer-Encoding: 8bit
             Content-Type: text/plain; charset="UTF-8"
             Mime-Version: 1.0

        - Lines 112-336 are the text/html portion, with

             Content-Transfer-Encoding: 8bit
             Content-Type: text/html; charset=iso-8859-1
             Mime-Version: 1.0

...so it seems that tr is reporting exactly what we'd expect to see.


>https://en.wikipedia.org/wiki/Specials_(Unicode_block)#Replacement_character
>describes ‘�’ and it's being seen above because cut(1) is cutting bytes
>and the ‘108:’ at the start of the line has shifted the 68/69 cut-off
>point to part-way through the UTF-8 for a single code point AKA rune.

For me, this falls into the category of "things that are perfectly obvious,
but only after they've been explained".  Thank you for explaining it.


>Try
>
>    sh
>    LC_ALL=C; export LC_ALL
>    locale
>    perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet

Done, and I just learned something interesting.  First, the output looks
like this:

   sh-5.1$ LC_ALL=C; export LC_ALL
   sh-5.1$ locale
   LANG=en_CA.UTF-8
   LC_CTYPE="C"
   LC_NUMERIC="C"
   LC_TIME="C"
   LC_COLLATE="C"
   LC_MONETARY="C"
   LC_MESSAGES="C"
   LC_PAPER="C"
   LC_NAME="C"
   LC_ADDRESS="C"
   LC_TELEPHONE="C"
   LC_MEASUREMENT="C"
   LC_IDENTIFICATION="C"
   LC_ALL=C
   sh-5.1$ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
   Veuillez ne pas r<c3><a9>pondre au pr<c3><a9>sent courriel. Il a 
<c3><a9>t<c3><a9> g<c3><a9>n<c3><a9>r<c3><a9>

Second, the problem with the original command appearing to hang turns out
to be an interaction between bash and xterm's pasting mechanism(!).

I'm accustomed to pasting a command line by triple-clicking to select the
whole line, then middle-clicking to paste it.  That's how xterm has worked
since I first started using it <mumble> years ago.

...and it still works exactly this way, and the line gets pasted just as I
expect, in tcsh.

...but in bash, although the line gets pasted, the newline at the end of it
somehow doesn't.  When 

   LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet

originally seemed to hang, in fact it was just waiting for me to press the
Enter key!  I still don't know why this is happening, but at least I'm
comforted by the fact that my bash binary isn't totally broken. :-/


>Beware that invoking bash(1) as ‘sh’ is not the same as running ‘bash’.

I did know that, but thank you for mentioning it just in case.


>Might not make a difference in this case, but in general it's better to
>run whichever is desired.

Right, but in this case sh was what was desired.  As I understand it,
when invoked that way bash behaves closer to a real Bourne shell than
when involved as bash.


>> I propose to forget this particular clupea harengus of the crimson
>> variety unless you find it interesting in and of itself.
>
>It is odd.  And odd might affect other things, including to do with nmh.
>:-)

Odd indeed, but apparently only when used interactively with xterm, so nmh
is unlikely to be affected.

     - Steven
-- 
___________________________________________________________________________
Steven Winikoff      |
Montreal, QC, Canada | "The reward of a thing well
smw@smwonline.ca     |  done is to have done it."
http://smwonline.ca  |
                     |                   - Emerson

Attachment: tr_output.pdf
Description: tr_output.pdf


reply via email to

[Prev in Thread] Current Thread [Next in Thread]