Spam filtering with Pyzor and SpamBayes

Posted by Steve on Mon 2 Jan 2006 at 10:42

Spam appears to be a fact of life for most of the online world at the moment. Here is how I personally handle the filtering of incoming mail, using a combination of Pyzor, SpamBayes and Procmail. These tools each integrate nicely, and work easily with my mail reader of choice: mutt.

Procmail is a mail processing tool, which allows you to filter incoming mail into different mailboxes in a variety of different manners. We've introduced procmail previously.

Using procmail several things can be achieved:

It is the latter that I use procmail for here, passing all incoming mail through two programs:

Pyzor

pyzor describes itself as a "collaborative, networked system to detect and block spam using identifying digests of messages". What this means is that each incoming message is reduced to a simple checksum, or digest, and that is queried against an online database of messages that others have reported as being spam.

Installing Pyzor is very simple:

apt-get install pyzor

Once installed you need to make sure that it has the correct server address to query against. This can be achieved by running:

steve@skx2:~/bin$ pyzor discover
downloading servers from 
http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x

Once this has worked you can test that you may communicate with the server by running "pyzor ping" as follows:

steve@skx2:~/bin$ pyzor ping
66.250.40.33:24441      (200, 'OK')

Here we see that we managed to connect to the current server 66.250.44.33 and received an "OK" response.

The next step is to filter our incoming mail through it. Inside your procmail configuration file ~/.procmailrc add this (before any normal handling, but after any setup steps you might have):

#
#  Check each message with Pyzor
#
:0 Wc
| pyzor check

#
# Add header to the mail identified as being spam.
#
:0 Waf
| formail -A 'X-Pyzor: spam'

#
#  Now filter anything that has the spam identifier header into a
# "spam" mailbox
#
:0
* ^X-Pyzor: spam*
spam

Breaking this down we have three distinct operations:

Checking the mail

This pipes the body of the message through the pyzor process with the argument check.

Anything that has already been submitted by a pyzor user will match.

Adding a Flag

If the result of the "pyzor check" was a positive hit we add the header "X-Pyzor: spam" to the message, this can later be used by our procmail recipe.

Moving Flagged Messages

Any mail with the "X-Pyzor: spam" header is moved into a mailbox called "spam". We could have done this in the previous step, but separating it out allows us to do other processing if we wish.

SpamBayes

SpamBayes is a Bayesian, or learning, mail filter which will attempt to classify your mail into two categories "spam", or "ham" (which is good mail). It will place mail it cannot classify with certainty into a third category called "unsure".

After a few days training I found it to work very nicely in handling my incoming mail.

Installation is fairly simple:

apt-get install spambayes

Once installed you must create a database which you can do by executing "sb_filter.py -n". After that you must create a simple configuration file ~/.spambayesrc pointing the system at your previously created database.

The following is a sensible configuration file:

[Storage]
persistent_use_database = True
persistent_storage_file = ~/.hammiedb

With that out of the way you may now setup your procmail installation to filter using the system. The following is sufficient:

#
# Filter incoming mail via SpamBayes
#
:0 fw:hamlock
| /usr/bin/sb_filter.py

#
# Place matching messages into the Spam folder.
#
:0:
* ^X-Spambayes-Classification: spam
spam

This filters each incoming mail through the sb_filter.py process and moves each message classified as "spam" into the "spam" folder.

In the first few days it will be worth checking the results manually, but over time it should improve considerably.

Each incoming mail message will have a header added to it called X-Spambayes-Classification with a value of either spam, ham, or unsure. You may retrain these classifications when it makes mistakes by piping either individual messages, or whole mailboxes through the sb_filter.py program.

I have my mail reader, mutt, setup so that I can press "S" to retrain the current message as "Spam", and "H" to retrain a message as "Ham". This is accomplished by the following settings in my ~/.muttrc file:


macro index S "<enter-command>unset wait_key\n<pipe-entry>sb_filter.py -s -f  >/dev/null\n<enter-command>set wait_key\n<delete-message>"
macro pager S "<enter-command>unset wait_key\n<pipe-entry>sb_filter.py -s -f  >/dev/null\n<enter-command>set wait_key\n<delete-message>"
macro index H "<enter-command>unset wait_key\n<pipe-entry>sb_filter.py -g -f  >/dev/null\n<enter-command>set wait_key\n"
macro pager H "<enter-command>unset wait_key\n<pipe-entry>sb_filter.py -g -f  >/dev/null\n>enter-command>set wait_key\n"

(Note that this deletes a message which I file as spam; not much point keeping it around).

To enable me to retrain message which SpamBayes is unsure about I also highlight messages tagged "unsure" in yellow with the following configuration option:

color index yellow default "~h '^X-Spambayes-Classification: unsure'"

Spambayes is a very nice program, and it has more features than those described here. Most importantly it can be used as both a SMTP and a POP3 proxy server - which allows you to interact with it via its own webserver.

The webserver allows you to train, or reclassify incoming messages via your browser. Check the documentation for the sb_server.py program and the Spambayes homepage for more details.

Tweaks

One thing that immediately stood out as a problem for me when I started using both Pyzor and SpamBayes was that there was no "whitelist" support.

I frequently receive email which has a high chance of matching as spam. These are usually short automatic mails from servers, etc.

To handle this I updated my .procmailrc to include some rudimentary whitelisting support. This allows me to skip the other checks for particular incoming addresses. Any address found in ~/.procmail_whitelist is assumed to be OK, and not checked with either Pyzor or SpamBayes.

Here is the updated code:

#
# ~/.procmailrc   - Spam filtering setup.
#
#
#  Now look for whitelists which get excused the spam checks.
#

#
# Remove any whitelist headers from the incoming mail - somebody
# may be trying to spoof us.
#
:0 fhw
* ^X-whitelist:
| formail -I "X-whitelist"

#
# Add the whitelist tag if the message is from a user in our whitelist file.
#
:0 fhw
* ? test -s $HOME/.procmail_whitelist
* ? formail -rxTo: | fgrep -qisf $HOME/.procmail_whitelist
| formail -A "X-whitelist: yes"


#
#  Check each message with Pyzor unless the mail is whitelisted.
#
:0 Wc
* !^X-whitelist: yes
| pyzor check

#
# Add header to bogus mail.
#
:0 Waf
| formail -A 'X-Pyzor: spam'

#
# If pyzor marked it then we can move it into the spam folder.
#
:0
* ^X-Pyzor: spam*
spam

#
# After pyzor has run let SpamBayes test the mail.  Again we don't
# do this if the mail is from a whitelisted user.
#
:0 fw:hamlock
* !^X-whitelist: yes
| /usr/bin/sb_filter.py

#
# Move things that SpamBayes identifies into the spam folder.
#
:0:
* ^X-Spambayes-Classification: spam
spam


This article can be found online at the Debian Administration website at the following bookmarkable URL:

This article is copyright 2006 Steve - please ask for permission to republish or translate.