[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [nmh-workers] detecting enclosed msg as spam - unicode regex help ne

From: Valdis Klētnieks
Subject: Re: [nmh-workers] detecting enclosed msg as spam - unicode regex help needed, I think; spam unicode chars in header
Date: Sat, 04 May 2019 13:01:05 -0400

On Sat, 04 May 2019 06:16:27 -0500, address@hidden said:
> This is OT/not a nmh issue.  It does concern spam unicode chars in the mail 
> header
> though, so maybe you could direct me a bit.

> I use procmail, so I should be able to filter out msgs like the above,
> but I could use some tips on a general strategy.

Note that the presence of =?utf-8? in the headers is *not* always proof of
spam (see headers of this message), so be prepared to deal with false positives
appropriately (but see below).

Also, note that while procmail does support onboard regular expressions, they're
not a full PCRE set of expressions.  So, for instance, you can't look for utf-8 
of more than a certain length by searching for 
nor can you check for more than 10 occurrences via '(=?utf-8?.*){10}'.

You're probably better served by installing SpamAssassin and calling that from
procmail (as it will help with things other than utf-8 as well.

There's 90 presumed-spam messages in my spam folder at the moment.  Of those, 
12 have
one bodypart and specify charset=utf-8  in the rfc822 headers, while 44 specify
multipart and thus the charset=, if any, is buried in the body.  10 have raw 
in the From: line, and 17 have raw utf-8 in the Subject: line (but see below)

And something in the e-mail ecosphere is filtering and converting explicit
=?utf8? encoding in rc-822 headers.  I was going to blame mhfixmsg, but it's
happening before procmail gets hold of it.  I send mail to myself, 'send'
tosses it to Google, Google hands it back to me via fetchmail/imap thence to
sendmail and procmail, and the =?utf8 has been already decoded. I invoke
mhfixmsg as '| tee $tmpfile | mhfixmsg -noverbose -file - -outfile -', and the
version in $tmpfile is already converted.  Meanwhile, some other mail
arrives with raw chars, while some *does* arrive with =?utf-8? still intact.


Attachment: pgpkdkXkeq1Et.pgp
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]