[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [nmh-workers] nmh 1.7.1: both bcc and dcc broken for mts sendmail/pi

From: Paul Fox
Subject: Re: [nmh-workers] nmh 1.7.1: both bcc and dcc broken for mts sendmail/pipe
Date: Fri, 15 Feb 2019 09:19:50 -0500

ken wrote:
 > >The  �" � around `Blind-Carbon-Copy' should be \(lq and \(rq, or the
 > >equivalent strings for consistency with the style used at start of the
 > >paragraph.
 > So, in a mostly unrelated note ... I couldn't help noticing that Ralph
 > used guillemets ( � �) in one of his messages on this thread (way to push
 > non-US-ASCII characters, Ralph!), and after a series of replies to his note
 > things devolved into classic mojibake.  And since hopefully most everyone
 > on this thread is an nmh user, I wanted to understand why, because really
 > that shouldn't have happened.

Mea Culpa.  I haven't fully worked through the bug or the fix, but
rest assured, the problem isn't with nmh.

My replies and forwarded message drafts are constructed by a script
that predates replyfilter.  It does things like add attribution ("ken
wrote:"), my .sig, and the bulk of the body with the " > " indents.
It includes the original headers if forwarding, but not when replying, 
and also adjusts the current headers based on what folder I'm in, for
things like Reply-to: and Fcc:.

I haven't done full debugging yet, but looking quickly I see that the
body content is created by:
            mhshow -form mhl.null -type text/plain -file $original_text  |
                utf_clean |

where $original text is the path to the message being replied to.

The function remove_part_markers_and_quote() runs sed to get rid of
the "part markers" that mhshow emits:
        # delete part markers entirely if they're the whole line,
        # otherwise just remove that part of the line.
        # and because we're already running sed, add the leading ' > '
        sed -e '/address@hidden(\[ part .* \]\)@\*\]$/d' \
            -e 's/address@hidden(\[ part .* \]\)@\*\]//' \
            -e 's/^/ > /'

But utf_clean() is the culprit, I believe -- it's there to remove a
few really annoying binary characters that my fonts don't display
correctly.  But it does so with a fairly large and indiscriminate
hammer, completely ignoring the current encoding.
        #eliminate utf hard non-printing space:  <U+200B> or \u200B
        #also eliminate A0, which is non-breaking space in iso-8859
        sed -e 's/\xe2\x80\x8e/ /g' \
            -e 's/\xe2\x80\x8b//g' \
            -e 's/\xa0/ /g' \
            -e 's/\xc2/ /g'

I'll work on this, and also take a look at replyfilter to see if
I can't get it to do more of the heavy lifting.


 > I went back to the raw archives (ftp://lists.gnu.org/nmh-workers/2019-02)
 > because the mailing list software will sometimes translate stuff into
 > base64 encoding when it sees non-ASCII characters.  And, well, I hate to
 > assign blame, but I think it's a bit unavoidable ... please, don't anyone
 > take this as a personal attack, I am just trying to understand how we
 > could do better.
 > Ralph's original note containing the guillemets (Message-Id
 > <address@hidden>) was text/plain, a
 > character set of utf-8, and encoded using quoted-printable.  The
 > characters were encoded properly using quoted-printable, specifically
 > they were listed as =C2=AB and =C2=BB.
 > Valdis was the first reply to that (Message-ID
 > <address@hidden>), and HIS email was text/plain,
 > character set iso-8859-1, and encoded using quoted-printable.  He quoted
 > Ralph's message, and the guillemets were encoded as =AB and =BB.  Which seems
 > correct to me.
 > Paul Fox replied to Valdis's note (Message-Id
 > <address@hidden>), and THAT note
 > was text/plain, character set UTF-8, encoded using quoted-printable ...
 > but it seems like this was the start of where things went off the rails.
 > The original line in Valdis's email was (in raw form):
 >    > The =AB=22=BB around ...
 > But in Paul's note it ended up as (extra > added in the reply)
 >    > > The  =AB" =BB around 
 > This is NOT correct.  First, there is an extra space in front of
 > the encoded bytes.  Secondly, they're not valid UTF-8; they're the
 > ISO-8859-1 bytes.  So I am guessing whatever Paul used to quote the reply
 > didn't translate the ISO-8859-1 characters properly into UTF-8.
 > However, whatever Mark Bergman uses for email actually made an intelligent
 > decision.  When he replied to Paul's note, those invalid UTF-8 characters
 > got converted to the Unicode Replacement Character (U+FFFD), which was
 > sent out as =EF=BF=BD (utf-8, quoted-printable).
 > Further muddying the waters ... when Ralph replied to Mark's email,
 > those Unicode Replacement Characters somehow got converted back to
 > the correct guillemets (=C2=AB and =C2=BB).  Which means Ralph has
 > perhaps the most intelligent reply quoting program ever and he should
 > immediately share it as it would revolutionize AI, or he went back and
 > manually corrected it when he replied to Mark's note.  I'm 50/50 on
 > which one of those scenarios is more likely.
 > If anyone involved with this email thread wants to pipe up with some
 > more explanation on what exactly they used to compose their email
 > replies, I would love to hear it.  No judgements; I just want to know
 > how nmh could help everyone do better.  Like, do we need to include
 > better tools for composing reply messages?  Well, duh, the answer to
 > that is "yes", and I think replyfilter does ok here but obviously we
 > need to do better.  But if we're SENDING something that is not valid
 > UTF-8, should we be smarter and flag it?  People were upset when we
 > refused to send out 8-bit characters when your locale was US-ASCII (I
 > mean, REALLY?  I couldn't believe it), so I don't know what makes sense.
 > Sending out invalid UTF-8 just seems wrong to me.
 > --Ken
 > -- 
 > nmh-workers
 > https://lists.nongnu.org/mailman/listinfo/nmh-workers

paul fox, address@hidden (arlington, ma, where it's 33.6 degrees)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]