Re: mhfixmsg character set conversion

nmh-workers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mhfixmsg character set conversion

From:	Steven Winikoff
Subject:	Re: mhfixmsg character set conversion
Date:	Sat, 12 Feb 2022 01:48:36 -0500

>I would do this if you haven't already:
>1. download nmh HEAD, build, and install somewhere
>2. move your $(mhpath +)/mhn.defaults
>3. move your profile and create one with just a Path: entry
>4. run the "mhfixmsg -file original_copy -out -" from 1. and see if the
>   output looks good or bad

I just tried this, and a couple of other things, but only after installing
par 1.53.0 from source and using that to replace the AUR binary.  Here's
what I learned:

   1) Replacing par does indeed fix one of the three failed tests.  I can
      send you the details, but I seem to recall that you already have them
      from Valdis Klētnieks; please let me know if I should forward them
      anyway.

   2) After running make install, the newly built mhfixmsg produces correct
      output.  But so does nmh-1.7.1 mhfixmsg when compiled without my patch.

   3) Step (3) above was the key, and it turned out that I was being misled
      by this .mh_profile entry:

         mhshow-show-text/html:  html_to_text %F | cat -

      ...where html_to_text is a shell script that basically just runs this
      command:

         elinks -force-html -dump -dump-charset utf-8 ${html}

      Removing this profile entry causes the message to be displayed
      correctly -- both the original, unmodified version, and the one that
      was saved after being converted by my patched version of nmh-1.7.1
      mhfixmsg.  That's pretty conclusive evidence that I'd been looking
      in the wrong place all along. :-(

      The man page for elinks describes -dump-charset as follows:

         -dump-charset (alias for document.dump.codepage)
             Codepage used when formatting dump output.

      Interestingly, when I restored the mhshow-show-text/html .mh_profile
      entry and modified my shell script to run elinks without this option,
      I still saw the same doubly encoded output.

      So next I tried passing the character set to my script as follows:

         mhshow-show-text/html:  html_to_text %{charset} %F

      ...and changed the script to use the provided character set rather
      than forcing utf-8:

         elinks -force-html -dump -dump-charset $1 ${html}

      This failed differently.  Instead of rendering the message with '�'
      marking undisplayable characters, it used '*' instead.  Somehow, I
      don't consider that to be much of an improvement. :-/

...so clearly I need to replace elinks in my html_to_text script, and doing
that will solve the problem that prompted this discussion, leaving the
following questions:

   1) What's the best replacement for elinks?

   2) Should I replace my 1.7.1 installation by the version I just built?
      Basically I'm asking what benefits the current snapshot has over
      1.7.1, and how far away the next numbered release might be.

   3) How can I guarantee that messages will be saved with quoted-printable
      or base64 parts decoded, without patching mhfixmsg to deal with
      messages in which the decoded text would be more than 998 characters
      long?

      I used the current mhfixmsg with the test message I've been using
      throughout this discussion, with this command line:

         /tmp/nmh/root/bin/mhfixmsg \
             -decodeheaderfieldbodies utf-8 -decodetext binary \
             -decodetypes text -textcharset UTF-8 -reformat \
             -fixcte -fixboundary -noreplacetextplain  \
             -fixtype application/octet-stream \
             -verbose -file $source -outfile $destination

      ...and that resulted in these headers after decoding:

         - for the text/plain part:

              Content-Transfer-Encoding: 8bit
              Content-Type: text/plain; charset="UTF-8"

         - for the text/html part:

              Content-Transfer-Encoding: binary
              Content-Type: text/html; charset=iso-8859-1

      That raises some further questions:

         - Why wasn't the text/html part converted to utf-8?

         - Regardless of the answer to the previous question, after a
           message has been refiled (and assuming I'm not planning to
           resend it to anyone), is there a practical difference between
           binary and 8bit encoding?

         - Why are the headers of the decoded message identical to those
           of the input, despite the use of -decodeheaderfieldbodies?

           (...and yes, the unmodified version of the message does contain
            some encoded headers that my decode_headers program found and
            decoded; mhfixmsg appears not to have done so).

   Thanks,

     - Steven
-- 
___________________________________________________________________________
Steven Winikoff      | "'Somebody, SOMEBODY
Montreal, QC, Canada | Has to, you see.'
smw@smwonline.ca     | Then she picked out two Somebodies.
http://smwonline.ca  | Sally and me."
                     |                        - Dr. Seuss

[Prev in Thread]

Current Thread

[Next in Thread]

Re: mhfixmsg character set conversion, (continued)

Prev by Date: Experimental IMAP branch
Next by Date: Re: mhfixmsg character set conversion
Previous by thread: Re: mhfixmsg character set conversion
Next by thread: Re: mhfixmsg character set conversion
Index(es):
- Date
- Thread