Spam filtering with Pyzor and SpamBayes

Posted by Steve on Mon 2 Jan 2006 at 10:42

Spam appears to be a fact of life for most of the online world at the moment. Here is how I personally handle the filtering of incoming mail, using a combination of Pyzor, SpamBayes and Procmail. These tools each integrate nicely, and work easily with my mail reader of choice: mutt.

Procmail is a mail processing tool, which allows you to filter incoming mail into different mailboxes in a variety of different manners. We've introduced procmail previously.

Using procmail several things can be achieved:

  • Incoming mail may be sent to different "mailboxes".
  • Programs may be executed to filter, process, or otherwise handle mail.

It is the latter that I use procmail for here, passing all incoming mail through two programs:

  • pyzor - A distributed spam identification system.
  • spambayes - A Bayesian "learning" spam filter.
Pyzor

pyzor describes itself as a "collaborative, networked system to detect and block spam using identifying digests of messages". What this means is that each incoming message is reduced to a simple checksum, or digest, and that is queried against an online database of messages that others have reported as being spam.

Installing Pyzor is very simple:

apt-get install pyzor

Once installed you need to make sure that it has the correct server address to query against. This can be achieved by running:

steve@skx2:~/bin$ pyzor discover
downloading servers from 
http://pyzor.sourceforge.net/cgi-bin/inform-servers-0-3-x

Once this has worked you can test that you may communicate with the server by running "pyzor ping" as follows:

steve@skx2:~/bin$ pyzor ping
66.250.40.33:24441      (200, 'OK')

Here we see that we managed to connect to the current server 66.250.44.33 and received an "OK" response.

The next step is to filter our incoming mail through it. Inside your procmail configuration file ~/.procmailrc add this (before any normal handling, but after any setup steps you might have):

#
#  Check each message with Pyzor
#
:0 Wc
| pyzor check

#
# Add header to the mail identified as being spam.
#
:0 Waf
| formail -A 'X-Pyzor: spam'

#
#  Now filter anything that has the spam identifier header into a
# "spam" mailbox
#
:0
* ^X-Pyzor: spam*
spam

Breaking this down we have three distinct operations:

Checking the mail

This pipes the body of the message through the pyzor process with the argument check.

Anything that has already been submitted by a pyzor user will match.

Adding a Flag

If the result of the "pyzor check" was a positive hit we add the header "X-Pyzor: spam" to the message, this can later be used by our procmail recipe.

Moving Flagged Messages

Any mail with the "X-Pyzor: spam" header is moved into a mailbox called "spam". We could have done this in the previous step, but separating it out allows us to do other processing if we wish.

SpamBayes

SpamBayes is a Bayesian, or learning, mail filter which will attempt to classify your mail into two categories "spam", or "ham" (which is good mail). It will place mail it cannot classify with certainty into a third category called "unsure".

After a few days training I found it to work very nicely in handling my incoming mail.

Installation is fairly simple:

apt-get install spambayes

Once installed you must create a database which you can do by executing "sb_filter.py -n". After that you must create a simple configuration file ~/.spambayesrc pointing the system at your previously created database.

The following is a sensible configuration file:

[Storage]
persistent_use_database = True
persistent_storage_file = ~/.hammiedb

With that out of the way you may now setup your procmail installation to filter using the system. The following is sufficient:

#
# Filter incoming mail via SpamBayes
#
:0 fw:hamlock
| /usr/bin/sb_filter.py

#
# Place matching messages into the Spam folder.
#
:0:
* ^X-Spambayes-Classification: spam
spam

This filters each incoming mail through the sb_filter.py process and moves each message classified as "spam" into the "spam" folder.

In the first few days it will be worth checking the results manually, but over time it should improve considerably.

Each incoming mail message will have a header added to it called X-Spambayes-Classification with a value of either spam, ham, or unsure. You may retrain these classifications when it makes mistakes by piping either individual messages, or whole mailboxes through the sb_filter.py program.

I have my mail reader, mutt, setup so that I can press "S" to retrain the current message as "Spam", and "H" to retrain a message as "Ham". This is accomplished by the following settings in my ~/.muttrc file:


macro index S "<enter-command>unset wait_key\n<pipe-entry>sb_filter.py -s -f  >/dev/null\n<enter-command>set wait_key\n<delete-message>"
macro pager S "<enter-command>unset wait_key\n<pipe-entry>sb_filter.py -s -f  >/dev/null\n<enter-command>set wait_key\n<delete-message>"
macro index H "<enter-command>unset wait_key\n<pipe-entry>sb_filter.py -g -f  >/dev/null\n<enter-command>set wait_key\n"
macro pager H "<enter-command>unset wait_key\n<pipe-entry>sb_filter.py -g -f  >/dev/null\n>enter-command>set wait_key\n"

(Note that this deletes a message which I file as spam; not much point keeping it around).

To enable me to retrain message which SpamBayes is unsure about I also highlight messages tagged "unsure" in yellow with the following configuration option:

color index yellow default "~h '^X-Spambayes-Classification: unsure'"

Spambayes is a very nice program, and it has more features than those described here. Most importantly it can be used as both a SMTP and a POP3 proxy server - which allows you to interact with it via its own webserver.

The webserver allows you to train, or reclassify incoming messages via your browser. Check the documentation for the sb_server.py program and the Spambayes homepage for more details.

Tweaks

One thing that immediately stood out as a problem for me when I started using both Pyzor and SpamBayes was that there was no "whitelist" support.

I frequently receive email which has a high chance of matching as spam. These are usually short automatic mails from servers, etc.

To handle this I updated my .procmailrc to include some rudimentary whitelisting support. This allows me to skip the other checks for particular incoming addresses. Any address found in ~/.procmail_whitelist is assumed to be OK, and not checked with either Pyzor or SpamBayes.

Here is the updated code:

#
# ~/.procmailrc   - Spam filtering setup.
#
#
#  Now look for whitelists which get excused the spam checks.
#

#
# Remove any whitelist headers from the incoming mail - somebody
# may be trying to spoof us.
#
:0 fhw
* ^X-whitelist:
| formail -I "X-whitelist"

#
# Add the whitelist tag if the message is from a user in our whitelist file.
#
:0 fhw
* ? test -s $HOME/.procmail_whitelist
* ? formail -rxTo: | fgrep -qisf $HOME/.procmail_whitelist
| formail -A "X-whitelist: yes"


#
#  Check each message with Pyzor unless the mail is whitelisted.
#
:0 Wc
* !^X-whitelist: yes
| pyzor check

#
# Add header to bogus mail.
#
:0 Waf
| formail -A 'X-Pyzor: spam'

#
# If pyzor marked it then we can move it into the spam folder.
#
:0
* ^X-Pyzor: spam*
spam

#
# After pyzor has run let SpamBayes test the mail.  Again we don't
# do this if the mail is from a whitelisted user.
#
:0 fw:hamlock
* !^X-whitelist: yes
| /usr/bin/sb_filter.py

#
# Move things that SpamBayes identifies into the spam folder.
#
:0:
* ^X-Spambayes-Classification: spam
spam

 

 


Posted by bdf (213.132.xx.xx) on Mon 2 Jan 2006 at 12:39
An alternative to SpamBayes is SpamOracle (apt-get install spamoracle). It offers the same facilities as described for SpamBayes in this article (although its internal classification algorithm might be less advanced - I haven't compared). The main advantage is that SpamOracle is not implemented in a scripting language; it therefore has a very small footprint and almost no dependencies. If you need to scan large volumes of e-mail, this can make a big difference in computer load.

[ Parent | Reply to this comment ]

Posted by Steve (82.41.xx.xx) on Mon 2 Jan 2006 at 13:09
[ View Steve's Scratchpad | View Weblogs ]

Thanks for the pointer, it looks interesting although as you say I've no idea how well it works compared to SpamBayes.

You're right to point out the relatively high startup cost of running a scripted scanner. Right now that isn't a problem for me, but if it is I can always switch to using a "persistant" scanner - kinda like the spamd/spamc programs available with SpamAssassin.

Ultimately I guess I'm not going to lose too much, since my .procmailrc file does make a lot of optimizations (more so than shown here) which avoid scanning a lot of incoming mail; just that which is otherwise flagged as being potentially spammy. The heuristics are pretty simple, but I've been pleased with how well they work out.

Steve

[ Parent | Reply to this comment ]

Posted by Anonymous (80.99.xx.xx) on Mon 2 Jan 2006 at 17:46
Hi,

First of all, sorry form my poor english.
I've got a question about this article.
Our company have got an e-mail server (with postfix+mysql+amavisd-new+clamav) with a lot of virtual user. All of e-mail account is in a mysql databse. So on the server only i've got a real user. Can I setup this spam filter techique in this environment?
Thanks!

Best regards,
Laszlo Laszlo

[ Parent | Reply to this comment ]

Posted by Steve (82.41.xx.xx) on Tue 3 Jan 2006 at 08:16
[ View Steve's Scratchpad | View Weblogs ]

In general if you're able to execute scripts as mails come in (by some mechanism such as procmail) you can use it.

I think in your case you'll find it hard, since the messages will go from Postfix straight into the database?

Having said that you can use the SMTP / POP3 proxy support of SpamBayes - but the setup will be a lot different to that shown here. Investigate the documentation if you're curious about what will be involved.

Steve

[ Parent | Reply to this comment ]

Posted by Anonymous (69.76.xx.xx) on Mon 2 Jan 2006 at 18:14
Do you see any particular advantage to this setup over using SpamAssassin with Pyzor and its internal Bayesian classifier (along with all of it's many other checks, RBLs, Razor, etc.)? Not evangelizing, just asking.

[ Parent | Reply to this comment ]

Posted by Steve (82.41.xx.xx) on Tue 3 Jan 2006 at 08:21
[ View Steve's Scratchpad | View Weblogs ]

Good question!

I guess for me there are a couple of reasons to choose this approach instead of the SpamAssassin route.

Primarily my decision is based upon memories of previous SA releases which were a little bit involved to setup and maintain. I remember that I frequently had to adjust the points returned by particular tests to get SA to work nicely for my environment.

Whilst I'm happy to train a mail filter on individual messages (either way round) I'm not really wanting to fiddle within the guts of a filter to change particular tests - because 99% of the time I'm not sure if I'm making things worse or not.

Another advantage of using SpamBayes is that it has a lot of other options not explored here, such as being able to run as both a POP3 proxy server, and an SMTP proxy server - in these cases it can be controlled via a web interface!

Generally I find that SB is more lightweight, easier to tweak, and more flexible. It also doesn't suffer by trying to do everything it can, which is sometimes a problem for SA. (I find blacklists are a frequent problem, by design, and am very careful only to use very small blacklists which are locally maintained...)

That wasn't a great comparison between the two, as you can see I'm biased by previous SA usage, but a couple of points that I can think of so early in the morning!

Steve

[ Parent | Reply to this comment ]

Posted by Anonymous (85.250.xx.xx) on Sun 15 Jan 2006 at 14:59
I dont have debian, so I had to install pyzor manually (download pyzor-0.4.0.tar.bz2 from source forge and then build and install it).
discover and ping works ok, but when I try to check an email I get this dump :

# pyzor check < spam.eml
Traceback (most recent call last):
File "/usr/bin/pyzor", line 4, in ?
pyzor.client.run()
File "/usr/lib/python2.4/site-packages/pyzor/client.py", line 934, in run
ExecCall().run()
File "/usr/lib/python2.4/site-packages/pyzor/client.py", line 188, in run
if not apply(dispatch, (self, args)):
File "/usr/lib/python2.4/site-packages/pyzor/client.py", line 262, in check
for digest in FileDigester(sys.stdin, self.digest_spec):
File "/usr/lib/python2.4/site-packages/pyzor/client.py", line 615, in __init__ self.digester = iter(get_file_digester(fp, spec, mbox))
File "/usr/lib/python2.4/site-packages/pyzor/client.py", line 632, in get_file_digester
return (DataDigester(rfc822BodyCleaner(fp),
File "/usr/lib/python2.4/site-packages/pyzor/client.py", line 678, in __init__ self.multifile.next()
File "/usr/lib/python2.4/multifile.py", line 120, in next
while self.readline(): pass
File "/usr/lib/python2.4/multifile.py", line 92, in readline
if marker == self.section_divider(sep):
File "/usr/lib/python2.4/multifile.py", line 155, in section_divider
return "--" + str
TypeError: cannot concatenate 'str' and 'NoneType' objects

(same for legit messages)

what version of pyzor do you use ?
what version of python ?

any idea how can I fix this problem ?

[ Parent | Reply to this comment ]

Posted by Steve (82.41.xx.xx) on Sun 15 Jan 2006 at 15:06
[ View Steve's Scratchpad | View Weblogs ]

Looks like a bug in something. Sadly my Python skills are limited, so I'd suggest you look to see if there is a mailing list/bug reporting address on the project homepage.

For reference I'm usign Debian Sarge with Python version 2.3.5, and pyzor1:0.4.0+cvs20030201-3

Steve

[ Parent | Reply to this comment ]

Sign In

Username:

Password:

[Register|Advanced]

 

Flattr

 

Current Poll

What do you use for configuration management?








( 707 votes ~ 10 comments )