[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Pan-users] Composing regex for Pan
From: |
Paul Hudson |
Subject: |
RE: [Pan-users] Composing regex for Pan |
Date: |
Sun, 14 Mar 2004 14:01:47 -0000 |
> >
> > \b[:upper:]{2,}\b
> This dumped all replies. The regex animal book doesn't
> explain those constructs very well (nor have any of the web
> sites I've looked at).
Have a look at the link I sent - all the info's in there somewhere, I think
:)
> > http://www.pcre.org/pcre.txt).
> > (?-i)\b[A-Z]{2,}\b
> This works, sort-of, if I select NONE OF:, but things like
> "!?&" in the string break it.
(All the below untested as before)
So, I'm unclear what you want. How about keeping things with at last one
word with at least one lower case letter in the middle of it?
(?-i)\b.+[a-z].+\b
> What I've been reading says that the ? refers to "zero or more times"
> (this must be my "snake & necklace" problem again).
It's the ( followed by ? that is important here - you're correct that ? In
other contexts means zero or more
>
> I want to dump as many of the annoying spam, troll and
> AOL-keyboard posts as I can, which I think, will require
> parsing the string's individual characters, multiple times
> (maybe my approach is flawed?) Once for ALL CAPS (if true,
> dump the post, regardless of additional characters in the
> string).
So dump lines that match
(?-i)[a-z]
maybe (don't contain at least one lower case character)
>After that, it gets interesting. Now we should have
> mixed-case alpha and/or alpha-numeric (or "should" have).
So, don't do anything with these (leave them with the default score which
means they'll be shown)
> Next, filter on multiple instances (2 or more to start) of
> any non-alpha, printable characters, anywhere in the string.
Do you mean the same charact repeated? This one's interesting. I think we
can use backreferences here....
Keep lines that don't match
[:punct:]\1
> Dump the matches. Then filter those results against any other
> specific criteria until what remains are subjects that look
> "normal" as in: Just a test post | Just A Test Post | Just a
> Test Post #10 | any of the previous, prefixed by "Re:", ect.
These should be straightforward?
What are you setting the score to for each of these?
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.572 / Virus Database: 362 - Release Date: 27/01/2004
- [Pan-users] Re: Composing regex for Pan, (continued)
- [Pan-users] Re: Composing regex for Pan, Duncan, 2004/03/30
- Re: [Pan-users] Re: Composing regex for Pan, Wolf J. Flywheel, 2004/03/30
- [Pan-users] OT: Namespace collision Was: Composing regex for Pan, Duncan, 2004/03/31
- Re: [Pan-users] Re: Composing regex for Pan, John Aldrich, 2004/03/31
- Re: [Pan-users] Re: Composing regex for Pan, John Aldrich, 2004/03/31
- Re: [Pan-users] Re: Composing regex for Pan, John Aldrich, 2004/03/31
- Re: [Pan-users] Re: Composing regex for Pan, John Aldrich, 2004/03/31
- Re: [Pan-users] Re: Composing regex for Pan, Wolf J. Flywheel, 2004/03/31
RE: [Pan-users] Composing regex for Pan, Paul Hudson, 2004/03/13
Re: [Pan-users] Composing regex for Pan, John Aldrich, 2004/03/13