nmh-workers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mhfixmsg character set conversion


From: Steven Winikoff
Subject: Re: mhfixmsg character set conversion
Date: Fri, 04 Feb 2022 20:33:09 -0500

>As Robert and Ken pointed out, one explanation could be that the
>content is converted twice, the second time incorrectly.

I saw those replies, but I wasn't sure how to interpret them (as in, the
evidence is compelling, but I have no idea why that would be happening or
what to do about it).


>I don't see at this point how mhfixmsg could do that but this needs more
>investigation.  We can continue this way, or if you want to send me a
>sanitized excerpt of the message, I'd be glad to work with it.

I can't think of a reasonable way to sanitize it, but I'm willing to send
it to you privately.  Should I use your <levinedl@acm.org> address for this
purpose?


>> $ mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8 -reformat \
>>            -fixcte -fixboundary -noreplacetextplain \
>>            -fixtype application/octet-stream -verbose -file - \
>>            -outfile $destination < $source
>> mhfixmsg: /home/smw/Mail/mhfixmsgnss3pI part 2, decode text/plain; 
>> charset=iso-8859-1
>> mhfixmsg: /home/smw/Mail/mhfixmsgnss3pI part 1, decode text/html; 
>> charset=iso-8859-1
>> mhfixmsg: /home/smw/Mail/mhfixmsgnss3pI part 2, convert UTF-8 to UTF-8
>>
>> ...which is interesting for more than one reason, including that there's
>> apparently no conversion of iso-8859-1 to UTF-8,
>
>That's strange, unless $source had already been run through mhfixmsg.

It hadn't.  In normal use my procmail-invoked shell script does run the
message through a program I wrote myself, which decodes 2047-encoded
headers -- but that only affects the headers, and passes the body through
unmodified; the relevant excerpt for that is:

   [ loop that processes header lines elided]

   172       /**  an empty input line means the end of the message headers:  **/
   173  
   174       if (strlen(input_line) < 1) break;
   175    }
   176  
   177  
   178    /**  read and write message body:  **/
   179  
   180    while (getline(&input_line, &len, infile) >= 0)
   181    {
   182       fputs(input_line, outfile);
   183    }
   184
   185
   186    /**  ...and we're done:  **/
   187  
   188    return(0);
   189  
   190 }


The only change this produces in the problematic message is as follows:

   47,57c47,57
   < X-SG-EID:  
=?us-ascii?Q?CePduXinO1TKWf=2FmbcRcIcb5o7KEfW6Q=2FLxIZrPrRA0dtxQ5evb2UIV0M0r6v6?=
   <  =?us-ascii?Q?DfqG=2FoldGlAr6l6p1riD1OEyVdX0=2F57dKo740dz?=
   <  =?us-ascii?Q?NZIhwlTw5J3KSyIU4H7pjfyfMBv0e9LGxKHVezS?=
   <  =?us-ascii?Q?FeSLaVJyOzyyK3LeB3eGx+QysKjtjkJzuVDXsW4?=
   <  =?us-ascii?Q?ZiePczPvW34XaHeheXAl2m0RGMRgZENpvRzzX2M?=
   <  =?us-ascii?Q?G6=2FuEHfZ5+X57rF1w=3D?=
   < X-SG-ID:  
=?us-ascii?Q?N2C25iY2uzGMFz6rgvQsb8raWjw0ZPf1VmjsCkspi=2FKHgAsE=2FCUk5eZaRe5Ltr?=
   <  =?us-ascii?Q?cbw5EBe1xYnaBlEvYrWq76guWX6eVcLnBjZLZsv?=
   <  =?us-ascii?Q?fUgud7M9swcG4+O7RGb81dd6HibI6WdUCRYi2bx?=
   <  =?us-ascii?Q?T8y2GlCc1B+71TSgKjD9dEU2IqN30RZ1qRbAGlx?=
   <  =?us-ascii?Q?5EAyl462xuJc+?=
   ---
   > X-SG-EID:  CePduXinO1TKWf/mbcRcIcb5o7KEfW6Q/LxIZrPrRA0dtxQ5evb2UIV0M0r6v6
   >  DfqG/oldGlAr6l6p1riD1OEyVdX0/57dKo740dz
   >  NZIhwlTw5J3KSyIU4H7pjfyfMBv0e9LGxKHVezS
   >  FeSLaVJyOzyyK3LeB3eGx+QysKjtjkJzuVDXsW4
   >  ZiePczPvW34XaHeheXAl2m0RGMRgZENpvRzzX2M
   >  G6/uEHfZ5+X57rF1w=
   > X-SG-ID:  N2C25iY2uzGMFz6rgvQsb8raWjw0ZPf1VmjsCkspi/KHgAsE/CUk5eZaRe5Ltr
   >  cbw5EBe1xYnaBlEvYrWq76guWX6eVcLnBjZLZsv
   >  fUgud7M9swcG4+O7RGb81dd6HibI6WdUCRYi2bx
   >  T8y2GlCc1B+71TSgKjD9dEU2IqN30RZ1qRbAGlx
   >  5EAyl462xuJc+

...but in my testing last night and just now, I see the same behavior
when I run mhfixmsg directly on the unmodified original file (my script
always saves an unmodified copy when it makes changes, in case something
goes wrong).


>Conversion to the same charset is a no-op, I'll look into removing the
>verbose output in that case.

That's probably a helpful thing to do, but the question I was wondering
about wasn't why the UTF-to-UTF conversion was reported, but rather why
the iso-8859-1-to-UTF conversion wasn't reported.


>> and that in fact it's part 1 rather than part 2 that gets converted
>> improperly
>
>The part numbers are reversed because that's the order used for display.
>Part 2 is the text/plain part, that's the one that got converted.

Thank you.  That clears up part of my confusion.

     - Steven
-- 
___________________________________________________________________________
Steven Winikoff      | "The thing is, I mean, there's times when
Montreal, QC, Canada |  you look at the universe and you think,
smw@smwonline.ca     |  'What about me?' and you can just hear
http://smwonline.ca  |  the universe replying, 'Well, what about
                     |  you?'"
                     |         - Terry Pratchett (Thief of Time)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]