[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: How to feed Bayes on relay-only server?
From: |
Thomas Cameron |
Subject: |
Re: How to feed Bayes on relay-only server? |
Date: |
Mon, 14 Jun 2004 21:24:13 -0500 |
----- Original Message -----
From: "Dan Nelson" <address@hidden>
To: "Thomas Cameron" <address@hidden>
Cc: <address@hidden>
Sent: Monday, June 14, 2004 10:17 AM
Subject: Re: How to feed Bayes on relay-only server?
> It primarily depends on the end-user's setups. One relatively easy way
> would be to set up "spam@" and "notspam@" email accounts, and have
> something processing those mailboxes and training any
> attached/forwarded messages. This only works if your end-users mail
> agents can forward the entire original message (including headers) as
> an attachment or inline. If they don't, then your options are limited,
> since without the message you can't retrain.
>
> What I do here with Lotus Notes clients is save all incoming messages
> under 32k (spam is rarely bigger than that) to a MySQL database with
> another milter.
Which milter is that? Sounds intriguing...
> I then wrote an agent that grabs just the Message-ID
> out of tagged messages and submits them via xmlrpc to a daemon on the
> mailserver, which extracts the full messages from the database and runs
> sa-learn and razor-report on them. I age messages out of the database
> after a week.
But how do you tell SA which is spam and which is ham? I'm looking at my
inbox and most of my (non-spam) messages are under 32k.
> If you have no control over your end-user's clients, maybe a
> combination of both approaches would work. Save all incoming mail, and
> scour messages sent to spam/ham@ for enough information to pull the
> orignal out of the database. If there's a message-id in the forwarded
> mail, you're home free. Otherwise, filtering on subject and date
> (maybe recipient) should get you close enough.
What about auto-learning? If I understand correctly, SA auto-learns for
messages which score very high or very low. Would it make sense to change
those thresholds so that SA is more likely to auto-learn? Or am I not
understanding the auto-learn?
Thanks for all the feedback. Keep it coming!
Thomas