[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pan-users] Ignoring specific threads
From: |
Duncan |
Subject: |
Re: [Pan-users] Ignoring specific threads |
Date: |
Mon, 15 Sep 2014 23:36:07 +0000 (UTC) |
User-agent: |
Pan/0.140 (Chocolate Salty Balls; GIT d447f7c /m/p/portage/src/egit-src/pan2) |
JCA posted on Mon, 15 Sep 2014 15:06:31 -0600 as excerpted:
> I was wondering if Pan can do the following:
>
> Let's assume take a user U in a given group G. U is a crank, a
> troll or something like that. I would like to tell Pan to ignore not
> only all posts from U but also all threads initiated by U. Is this
> possible with Pan?
Ignoring threads by a specific person isn't necessarily impossible, but
it's not /directly/ possible, either. You'll sort of be relying on a bit
of a side-effect of something else, and hoping that you can get a good
match without catching too many unrelated posts in the process. Tho if
it /does/ catch other posts you can potentially use score ordering or
incremental scoring to rescue them.
IOW, this will be advanced score usage that could be complicated to setup
and not necessarily worth the hassle, but in theory it can be done...
sort-of.
Here's the deal. Proper threading uses the references header. This
header contains a multi-generational list of "parent" post message-IDs.
To score on threads or subthreads you score on the appropriate message-ID
in the references header, and anything that matches will get the assigned
score.
The problem is that message-IDs (which are assigned to both email and
news messages, news message format being almost entirely the same as
email message format, with a few different news-specific headers and
generally omitting a few mail specific headers, altho both news and mail
headers can be present and normally won't conflict with each other) are
designed to uniquely ID specific messages, so a match of an entire
message-ID will match only the single (sub)thread in reply to that
specific message. To match all threads originated by a specific author,
you need to find something unique about that author's message-IDs that
you can score on, that won't catch other author's message-IDs as well.
To the extent that you can do so, you can filter threads replying to that
person. To the extent that you cannot, that the fixed part of the target
author's message-IDs also appear in the message-IDs of others, you score
their messages also.
As it happens, message-IDs are set either by the posting client, or by
the server posted to, if the posting client didn't set one. There's no
hard rules governing the algorithm used to get a globally unique ID that
is extremely unlikely to apply to a different message (message-IDs are
used to track messages, so if two different messages get the same ID,
only the first one seen by a particular server or client will normally
appear), only general rules on the characters it can contain and the
general format, which is similar to an email address, userpart @
domainpart. (I deliberately spaced it out to avoid triggering gmane's
email address obfuscation.)
If the posting client doesn't include a message-ID, then the server will
set one. Usually the domain side of these is the domain name of the news
service provider the message was posted to, say @ giganews.com, or some
such. Of course scoring on that will catch all users who post to that
NSP, with clients that don't set the message-ID themselves.
Clients that set the message-ID can use a similar pattern, pan uses the
domain name of the email address you are posting with, for instance. The
Agent (and freeagent) client at least used to use the agent domain name
instead. Of course, in most cases either one of these will result in a
domain name match that matches far more than one poster.
So the domain name side of the message-ID can be useful in narrowing
things down, but ordinarily won't be enough by itself to identify a
single poster, so you'll need to match something from the user side of
the message-ID as well.
But the user-side of the message-ID tends to be almost entirely
unstandardized, except of course there's some restriction in the
characters that can be used, and the idea is to ultimately have something
unique enough that no other message will have the same message-ID,
despite a lot of other messages from the same poster and others normally
having the same domain-side.
So what you'll want to try to do is look at the message-ID of a post from
the target author, and **TRY** to find a match that's as unique to his
posts as possible, but still dependably identifies ALL his posts.
If you're lucky, he uses a news server or client that nobody else posting
to the group in question uses, and between limiting the score to that
domain-name side of the message-ID, plus anything that's unique on the
user side, and limiting that score to a specific group, it'll "just
work". Tho of course there's always the possibility that a new poster
will appear that matches as well, that you'll miss.
But chances are pretty good you won't find a good enough match and that
other posters will match that score as well. But if it's only a few
other posters that get caught in the net, all hope is not yet lost.
Pan uses two types of scoring, absolute scoring, where a matching rule
sets that score and no further rules are processed, and incremental
scoring, where the score is simply increased or decreased by the value in
the score.
Ignore is a score of -9999 or lower. Normally, setting an ignore sets an
absolute score of -9999, but a post can also be ignored if no absolute
scores apply but the total of all incremental scores ends up being -9999
or lower. So if the net cast by your would-be references-header message-
id ignore is too wide and catching others as well, you have two possible
methods to counteract that.
If you want to use an absolute score ignore, then counteracting it is as
simple as setting another absolute score that catches the "mistakes",
that gets processed first (appears before the too wide score in the
scorefile, which you can edit for order as necessary).
The problem here is that the references header will contain message-ids
from multiple generations of parent, and the ones that contain the target
may well contain the false-positive IDs as well. So an absolute score
isn't likely to do what you need, because trying to undo it for the false-
positives will likely undo too much as well.
Which leaves incremental scoring. The idea here would be to find a mix
of scores such that in the end, all the matches for the target posts end
up at -9999 or lower, while incrementals add just enough score back to
the false-positives to rescue them from the ignore, bringing their score
up to at least -9998, if not up further, to zero or positive. That's
definitely an art unto itself; or as I said above, "advanced".
Meanwhile, something that may help: In your example you specified
threads INITIATED by U. As it happens, regular-expression matches have a
way to specify BEGINS WITH and/or ENDS WITH. If you're only worried
about matching threads where U is the original poster, the ^ character at
the beginning of the regex can be used to specify "begins with". You can
then use a wildcard that omits the ">" character used to terminate each
message-ID, thus forcing the match to only apply to the first one.
Something like this (spaces again inserted either side of the @) :
References: ^[^>]* @ sample\.com>
^ means begins-with. The [] encloses a character-set, with ^ as the
first character meaning "not". * means "any number of matches of the
previous". So what that means is:
References header, begins with, any-number-of-characters-not-including->,
@ sample.com, >.
Thus the first message-ID in the references header would have to have
sample.com as the domain name portion.
But something else to keep in mind as well: Some clients are broken and
do not include a properly populated References header in replies. These
clients will often attempt to thread by the contents of the subject
header, instead. Obviously, no references header, no match on a
references-header score. =:^(
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman