nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mhfixmsg character set conversion


From: Steven Winikoff
Subject: Re: mhfixmsg character set conversion
Date: Wed, 09 Feb 2022 19:48:07 -0500

>> Really.  I'm not making this up. :-/
>
>No, I don't think you are.  I think that line in both files is correctly
>UTF-8 encoded.

And now that you've explained what's going on, it's clear that you're
right.


>vim isn't the vi(1) I grew up with, and probably you too.

Definitely.  The first time I used vi was in 1984, on a 68000-based Cadmus
system.


>Try ‘:se fileencoding?’ when vim-ing good and again with bad.

Good point:

   $ vim good
   :set fileencoding
   fileencoding=utf-8

   $ vim bad
   :set fileencoding
   fileencoding=latin1


>I expect the bad file has something earlier on which fixes vim's idea of
>the encoding to ISO 8859-1

That does seem to be the case.  Do you have any idea what kind of thing
that might be?  (I know you can't diagnose a file you haven't seen, but in
general, what sorts of things should I look for?)


>> But wait.  It gets worse:
>>
>>    $ grep -n ^Veuillez good | cut -c1-68
>>    108:Veuillez ne pas répondre au présent courriel. Il a été gén�
>>
>>    $ grep -n ^Veuillez bad | cut -c1-68
>>    108:Veuillez ne pas répondre au présent courriel. Il a été gén�
>
>The worse being it is the very same line 108 you're seeing in vim which
>grep is also showing?

Exactly, because...


>(The ‘�’ at the end is to be expected.)

...this is still more evidence that you know more about character sets and
conversions than I do.  As if further evidence was needed at this point. :-/

Until now, I've only ever seen that glyph when a character doesn't exist in
the font being used -- but that can't be the case here because that same
character is shown correctly five times in the same line of output.

Why is it to be expected?


>>    $ LC_ALL=C perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
>> [...]
>
>I don't understand that.  The -p sets up a loop to read a line from
>good_snippet, do the substitution on it, and print the result, until
>EOF.  The -l strips off the linefeed on input and puts it back on the
>output.  The substitution in between changes all bytes, thanks to
>LC_ALL=C, which aren't space to tilde into a ‘<42>’ string representing
>their hex value.

Thank you for explaining that.

Just for fun, I tried the following in tcsh:

   $ setenv LC_ALL C
   $ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet
   Veuillez ne pas r<c3><a9>pondre au pr<c3><a9>sent courriel. Il a 
<c3><a9>t<c3><a9> g<c3><a9>n<c3><a9>r<c3><a9>

As expected, this returned pretty much instantly.  Then I tried this:

   $ sh
   $ LC_ALL=C
   $ echo $LC_ALL
   C
   $ perl -lpe 's/[^ -~]/sprintf "<%02x>", ord($&)/ge' good_snippet

...and that also hung.  Which in a way is good, because at least it means
bash is behaving consistently.  But also not good, because it's behaving
badly. :-/

On my system, /bin/sh is a symlink to /bin/bash, which is version 5.1.016-2
as packaged by Manjaro.

...but troubleshooting bash is far outside the scope of this discussion, so
I propose to forget this particular clupea harengus of the crimson variety
unless you find it interesting in and of itself.


>Nothing wrong with od(1).  If you have hexdump(1) installed then it with
>-C gives quite nice output.

Yes, I see (or -C? :-).  Thanks for that tip; I hadn't known that hexdump
existed.


>> ...and both snippets are identical!
>
>Well, those lines were identical to start with before snipping.
>You could confirm this with
>
>    cmp <(sed -n 108p good) <(sed -n 108p bad)

As written, this also hangs in bash (and is invalid syntax in tcsh).

But it's effectively equivalent to

   $ sed -n 108p good > good.sed
   $ sed -n 108p bad  >  bad.sed 
   $ cmp good.sed bad.sed
   $ echo $?
   0

...which behaves as expected.


>> Strangely, both snippet files look fine in vim.
>
>Because you have chopped off the non-UTF-8 which occurs earlier in bad
>which fixes vim's idea of the file's encoding.

In retrospect this should have been obvious. :-/


>> ...but for the bad file, that becomes
>>
>>    "bad" [converted] 336 lines, 49471 bytes         1,1           Top
>
>Ta-da!

Indeed. :-)

Thank you.

     - Steven
-- 
___________________________________________________________________________
Steven Winikoff      |
Montreal, QC, Canada |             Eschew obfuscation.
smw@smwonline.ca     |
http://smwonline.ca  |



reply via email to

[Prev in Thread] Current Thread [Next in Thread]