pan-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-devel] Scoring articles by ration of bytes/lines


From: Konrad Karl
Subject: Re: [Pan-devel] Scoring articles by ration of bytes/lines
Date: Thu, 1 Feb 2007 11:15:08 +0100
User-agent: Mutt/1.4.2.2i

Hi Charles
pls see below

On Wed, Jan 31, 2007 at 07:17:33PM -0600, Charles Kerr wrote:
> Konrad Karl wrote:
> >Hi,
> >
> >I want to be able to score/filter/delete articles where
> >the ration of article_bytes / article_lines is below a certain
> >value.
> >
> >Many sporged postings could be easily identified. With
> >the old pan 1.x I have been using a simple perl filter program
> >in oder to delete articles with a too low ratio and this simple
> >approach worked surprisingly well - the algortithm might require some
> >tweaking, e.g if number of lines < 10 then dont apply the
> >ratio rule etc. etc.
> >
> >Now I have started looking into the latest sources but I am
> >afraid it will take considerable time until I will understand
> >whats going on.
> >
> >What do you think?
> >
> >Greetings,
> >Konrad
> 
> Hi Konrad,
> 
> This can be done in 0.120 by adding a scoring rule to ignore
> all articles with a line count less than 10.
> See Article > Add a Scoring Rule

Yes, I know. But I want a more complec rule expressed in
pseudocode:

if (article_lines > some_threshold) {
   ratio = article_bytes / article_lines;
   if (ratio < ratio_threshold)
        apply_some_rule;
}

"apply_some_rule" could perhaps mean: display only these articles - then they 
could
be deleted manually.

There are sporged articles which have a linecount in the
range of several hundreds to thousands and a ridiculous low byte 
count and I had much success by selecting ratio_threshold at about
10. (every line has to have at least 10 bytes at average, else
it is very likely some sporge)

Greetings,
Konrad

> 
> cheers,
> Charles
> 
> 
> _______________________________________________
> Pan-devel mailing list
> address@hidden
> http://lists.nongnu.org/mailman/listinfo/pan-devel




reply via email to

[Prev in Thread] Current Thread [Next in Thread]