[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Pan-devel] You need to incorporate Pre-processing tool (filters) to
From: |
Duncan |
Subject: |
Re: [Pan-devel] You need to incorporate Pre-processing tool (filters) to control downloading |
Date: |
Sun, 7 Sep 2014 09:20:25 +0000 (UTC) |
User-agent: |
Pan/0.140 (Chocolate Salty Balls; GIT d447f7c /m/p/portage/src/egit-src/pan2) |
Rajib Bandopadhyay posted on Sat, 06 Sep 2014 20:50:10 +0530 as excerpted:
> I have used PAN in the past and was impressed by its ease of use.
>
> However, there is a great disadvantage for PAN - it doesn't have a
> pre-processing filter like Claws Mail.
[You originally posted to the pan-devel list. However, pan-user has more
traffic and the following subthread is likely both more discoverable by
new users and available to existing regulars if it's posted to pan-user,
so I'm posting this reply to both, with followup to pan-user (only)
requested.]
Being a user of both pan and claws-mail, I can say that pan has
functionality conceptually as powerful as that of at least claws-mail
built-in filtering functionality. However, it (1) works differently and
perhaps not as immediately intuitively, (2) does not have claws-mail's
script-your-own extensibility (tho in the higher performance multi-
threaded pan environment the only real effective way to do that would be
native code or at least JIT-compiled bytecode in any case, and of course
those with the ability to code can already do that by patching pan's own
code directly), and (3) there's potential to make it even more powerful,
but one bug and lack of a couple feature extensions currently limit it a
bit.
I am of course speaking of pan's scoring functionality combined with the
(relatively) new feature, (automated) score-based "actions". More below.
> This pre-processing filter empowers us readers to pre-filter files we
> need to download, and we can control, selectively download, or even bar
> downloads, with pre-processing filters.
>
> This pre-processing filter makes it a great tool. But it has its
> weaknesses, its downloads are single-threaded and hence, very slow.
>
> You have a scope here to improve upon your design!
As I said above, pan's comparable feature combination is scoring combined
with actions. But there's a big conceptual difference in how they're
implemented in pan, for at least two reasons:
1) The nature the news protocol and typical user task requirements.
2) The requirements of pan's high-performance multi-threaded environment.
Unlike claws-mail which is designed with single-threaded user-scripted-
extensibility as a very high priority, pan's automation focus is on
higher-volume multi-threaded binary-post downloading and attachment
saving. Certainly both clients are designed to be usable for BOTH
binaries and text. But...
A primary focus of claws-mail is extreme user-scripted extensibility in a
lower performance single-threaded environment with an assumption that
actually reading message text is the primary reason for downloading, such
that lower-performance single-threaded execution, with single-threaded
"we can stop and wait for the extension-script result before proceeding",
works just fine. Efficient download and saving of binary attachments is
certainly possible, but it's clearly lower on the priority scale than
simple reading of primarily text messages, and full user-scripted
extensibility, even at the cost of efficiency and effectively locking
processing to single-thread.
OTOH...
A primary focus of pan is on high performance multi-threaded download and
saving of binary attachments, with clear emphasis on "high performance"
and "multi-threaded", and with no assumption that the text message itself
is of any interest at all as it may in fact be ONLY the saved binary
attachments that are of interest. Clearly pan can reasonably effectively
handle reading and replying to text messages as well and that in fact is
the primary use-case of many users including myself, but pan's emphasis
on efficient multi-threaded downloading means the claws-mail "we can and
will stop everything and wait for the result of a user-script extension
before proceeding" approach is simply not possible and ENTIRELY out of
the question.
That's also the reason for pan's distinction between "caching" and
"downloading", and why pan's local message cache is relatively small (10
MB IIRC) by default.
* "Downloading" in pan terms means caching whole multi-individual-text-
post messages (where 100 or more such individual text messages are often
entirely transparently combined to allow the next step), automatically
extracting and saving off the attachments, then deleting the still
entirely unread text messages from cache in ordered to make room for the
next batch of individual text messages to be cached and their attachments
saved off. By this definition ONLY the often rather large binary
attachments matter, the text messages themselves are simply the container
it attachments ship in and can be discarded after the attached contents
are safely unpacked and stored.
* By contrast, "caching" in pan terms is what claws-mail would call
"downloading", that is, downloading the text message to cache, leaving it
marked unread, ready for the user to read and save or reply to later as
desired. This "text mode message processing", as opposed to "binary mode
attachment processing", is conceptually an /entirely/ different work-flow
where high efficiency multi-thread bulk-processing isn't as vital, but
user-scripted extensibility such as claws-mail offers for this use-case
might be. Pan's default 10 MB cache is rather small for this, and
indeed, my text-instance[1] pan is configured as unexpiring with multi-
gig cache so it doesn't delete anything, and I have the entire content of
the couple dozen or so text groups I follow (including the pan lists, as
newsgroups via gmane.org's list2news service), minus a few spam and troll
posts, going back over ten years in several cases. =:^) I even have the
contents of several of my ISP's old discussion newsgroups still cached,
even tho they killed their news server some years ago now so there's
actually no server left to connect to for new messages (tho a few of the
former ISP-private groups live on as pretty much entirely spam-filled
zombie groups on various commercial NSP services).
It's important to get this difference, because back in the day when I had
first come to pan on Linux from MSOE on MS Windows, I was originally
rather dismayed to see messages I had "downloaded" but only to cache,
believing them to be saved for reading and further processing later,
suddenly disappearing again, still unread, as I downloaded (to cache)
additional messages! The cache was still set to its default 10 MB, and I
had 10 MB of messages downloaded, so pan was simply deleting the oldest
ones from cache in ordered to make way for new ones. Once I figured out
what was happening, I was able to set a far bigger cache (tho it was
limited to I think 1 GB or some such back then, no such limits now), and
my messages quit disappearing from cache before I even had a chance to
read them!
The point being, pan ASSUMES binary mode download, attachment saving and
unread containing-text-message discard by default, and is optimized to
process that as efficiently as it can. While it can /do/ text messages
and in fact isn't actually a bad text-message news client at all, so much
so that many folks including me actually use pan primarily for such text-
mode messages and groups, that's not what it assumes or is optimized for,
just as claws-mail doesn't assume nor is it optimized for the bulk-binary-
mode attachment saving pan handles far more efficiently, even tho it can
handle it in its slower single-threaded fashion. (Tho I don't actually
know if claws-mail handles the more efficient yenc-encoded attachments
common on binary newsgroups or not, but if it doesn't do so directly,
there's certainly third-party-utilities that can do so, and claws-mail is
certainly extensible enough to add that functionality as a third-party-
script-extension if desired.)
OK, now that we've dealt with the concepts underlying the practical,
let's move on to their practical application in the context of your
post. Again, pan's implementation is scoring combined with actions.
We'll deal with them one at a time.
1) Scoring
Pan's scoring system can work in one of two modes, absolute or
incremental. In fact, pan's ignore and watch features are implemented
simply enough as the extreme ends of absolute-mode scoring, -9999 (or
lower) scores are ignored, (+)9999 (or higher) scores are interpreted as
watched, and pan's ignore and watch features simply create scoring rules
that set =-9999 and =9999 absolute scores.
But in the absence of an applicable absolute (aka forced) score, pan will
match any incremental mode scores that apply and the resulting score is
the total of all incremental scoring rules.
Further, there are several scoring "zones" in addition to the two
extremes. If you have the score column active in your headers pane/tab,
pan will color-code the scored posts by zone as configured on the colors
tab of prefs, and can be set to show or hide posts by score zone as well,
as accessed via the "match scores" section in the view, header pane
submenu. As can be seen from both places, the scoring zones are as
follows, lowest to highest:
-9999 (and lower): ignored
-9998 to -1: low (negative)
0: normal (zero/neutral, no scores apply or the
effect of multiple scores combined is as if none applied at all)
1 to 4999: medium
5000 to 9998: high
9999 (and higher): watched
These scoring zones are critical to the automated actions discussed
below, but before we get to them, let's talk a bit more about how the
scoring rules actually work.
In general, pan's scoring rules are stored in a scorefile with a format
based that of another news client, slrn, altho pan's implementation isn't
as advanced as that of slrn. FWIW, at least one other news client, xnews
(MS platform news client), uses a very similar scorefile format.
Here's the slrn scorefile.txt documentation:
http://slrn.sourceforge.net/docs/score.txt
Again, keep in mind while reading that, that pan's implementation is very
similar, but not as advanced. In particular, pan lacks support for the
include directive, as well as for nested/grouped rules. An additional
difference is that pan's processing is case insensitive.
And one additional difference, presently, pan's scoring only supports
logical OR: if ANY of the conditions match, the scoring rule is applied.
As documented, logical OR is Score:: (double colon), while Score: (single
colon /should/ be logical AND (only apply if ALL conditions match).
Presently, however, Score: seems to behave like Score::, they both are
treated as logical OR and ANY matching condition triggers the score.
I'm not sure but I /believe/ this to be a bug as I could almost swear
that logical AND (single colon) *USED* to work. But someone posted that
they couldn't get it working and I tested and sure enough, all my scores
were being applied as logical OR as well, so either it broke somewhere
along the line or I'm mis-remembering and it never worked in the first
place.
Tho it can be noted that with pan's scoring zones and appropriate use of
incremental scoring, /almost/ the same effect can be achieved by simply
adjusting the score values of the various individual elements composing
the would-be AND, such that selected posts only fall in the desired score-
zone if ALL the appropriate conditions match, otherwise they'll fall into
a different zone due to failure to match some of the conditions that
incrementally combine to put it in the target zone.
Meanwhile, it's also worthwhile to specifically point out that as pan
creates the scores, MOST of lines pan actually writes into the scorefile
are actually either blank lines or comments, due to the leading % comment
indicator. You can thus trim all the %BOS and %EOS lines, etc, without
affecting actual scoring functionality at all, as they're comments, there
simply to clarify what pan was actually doing when it wrote the score,
add date information, etc.
Also, note the overview-headers-only recommendation, which applies to pan
as well as slrn. In particular, while scoring can be done on any header,
if the header isn't in the overview file, it's likely the whole post will
have to be downloaded before the score can be applied. While for ignored
posts especially that can still be better than having to actually see and
deal with the post manually, it does mean it has to be downloaded to
cache before the score can be applied, so if at all possible, it's MUCH
better to score on overview-included headers only, thus avoiding the
download entirely.
Finally, here's a link (via gmane) to an earlier post of mine, with an
except from my own scorefile (and some additional explanation/commentary)
as an example of what a nicely organized hand edited scorefile can look
like in practice.
http://permalink.gmane.org/gmane.comp.gnome.apps.pan.user/8689
OK, that covers scoring, but other than showing or hiding posts and/or
making their scores show up in pretty colors in the header pane's score
column, of what practical USE are they? In particular, how can they be
used to trigger automated pre-download filtering and selective download
or delete-before-download? That's where actions come in! =:^)
2) Actions
Once you understand how scoring works and master the art of writing good
scoring rules, pan's still relatively new (automated) actions feature
makes putting those scoring rules to practical use actually quite simple.
=:^)
Actions are configured in pan preferences on the actions tab, and are
scoring-zone based. Depending on how much you want to rely on scoring to
determine what's automatically processed, there are several suggested
configurations possible. I'd recommend NOT setting automated delete or
even mark-read just yet, until you've watched how your scoring config is
working and are comfortable that it's working as intended and you're not
going to be missing a whole lot of posts due to accidental ignore-score
matches.
In fact, here's what I recommend:
Before setting up actions at all, do this:
1) In pan prefs, headers tab:
Ensure that you have the score column enabled, and order it so it's in
view all the time.
Back in the main window, header pane/tab, expand or shrink the score and
other columns as necessary to fit, while keeping the score column in view.
2) In the view menu, header pane submenu:
Ensure that ALL the "match scores of" options, including "low" and
"ignored", are checked.
3) Back in pan prefs, on the colors tab under header pane:
Setup your colors for each score zone so you can tell the zones apart
just by color.
In particular, make sure ignored and low/negative score zone colors stand
out, as well as watched and if you eventually intend to auto-download
them, medium and high score zones as well.
4) In pan prefs, actions tab:
Ensure that all actions are currently DISABLED, for testing mode.
Now go back to using pan, setting up scores to put posts in the desired
score zones as appropriate, and watching that the scores work as
intended. In particular, be sure the low/negative and ignored zones
aren't catching posts that you actually want to see.
5) After some time watching that posts are getting assigned to their
intended score zones, when you're comfortable that they are...
6) Back in pan prefs, on the actions tab:
Enable actions for score zones as appropriate. Here's a recommended
example:
Delete articles scoring at: -9999 or less (ignored)
This is optional. If you're conservative, you might wish to keep this
disabled instead.
Mark articles read scoring at: -9998 to -1 (low)
Alternatively, if you're not deleting ignored articles, you can set it to
simply mark-read ignored.
Assuming you have pan set to hide read posts, this will hide them, but
the headers won't actually be deleted, so you can still set show read
posts temporarily if you want to refer back to them, perhaps because
someone (not scored so low so you see the post) referred to them in a
quote and you want to read the entire post to get the context.
Note that the above "negative actions" should work at the set level AND
BELOW. So if you for instance set mark-read from low-zone, it should
mark-read ignored-zone as well.
The below "positive actions" should work the other way, at the set level
AND ABOVE. So if you set cache articles scoring medium-zone, it should
also cache those in the high and watched zones.
Depending on whether you run pan in binary download-and-save-attachments
or cache-and-process-later modes, and whether you want to auto-cache/
download medium and high scorezone posts or only watched posts, you can
set these as appropriate, but:
Read-text-mode example:
Cache articles scoring at: 5000 to 9998 (high)
Download-binary-mode example:
Download attachments of articles scoring at: 9999 or more (watched)
For the download example, after you're sure it's downloading as
appropriate, you likely want to set the mark affected articles read
option as well, since once the attachments are downloaded (and saved) you
probably don't care to see them any longer.
I believe once you have both scoring and actions setup appropriately,
you'll find it does what you need quite well. Pre-processing filters?
Why? That would only slow pan down! =:^)
Tho since you posted to the dev list there's a reasonable chance that you
can code in the C++ pan's written in, and if so, I'm sure no one would
object if you found time to figure out why pan's AND scoring doesn't work
and fix that, and if we're /really/ lucky, we might even get a patch for
the missing include and/or nested/grouped scoring condition support, as
well. Then pan's scoring support would /really/ rock! =:^)
Several devs have in the past cloned pan's git repo to their github
accounts and requested pulls from there when they have patches ready. =:^)
---
[1] My text-instance: I run several separate pan instances each with its
own config and cache, one for text, one for binaries, and one for
temporary testing. Pan reads the PAN_HOME environmental var and uses
that for its config and data if set, using the default (~/.pan2 on *ix
anyway, I don't do windows). I use that in a wrapper script to point pan
at the appropriate config depending on whether launched the bin, text or
test wrapper.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman