Spam Words

This service is turned off as of 26feb2005. We are evaluating another spam filter called qsf, and will probably resume publishing a spam words list using that filter in a few weeks.

Here at ACME Labs we get a lot of spam - almost a million messages per day. We use a number of different filters to deal with it, from IP blacklists to the ClamAV worm detector to procmail patterns to the bogofilter learning Bayesian filter. Because we get so much spam, our bogofilter wordlist has become extremely effective. We'd like to share that effectiveness with you.

Bogofilter has an option to use multiple wordlists at the same time. What this page suggests is that you use your own local wordlist, and use ACME's wordlist as well - but only the spam part of it. Many people have thought of adding together other people's bogofilter word lists for greater filtering effectiveness, but the problem is that the email different people gets is very different, so sharing filters doesn't really work very well. However, everyone gets pretty much the same load of spam. Sharing just the spam part of the filter should work quite well indeed.

So, how does it work? Every night around 11pm Pacific Time, ACME's mainframe extracts the spam words from our wordlist and puts them on our web server. Here's the shell script we use - it takes a few minutes to run. Then every night around midnight Pacific Time, you have a cron job on your machine fetch the wordlist and install it on your system. Here it is:

acmespamlist.db

To use it, you add two options like these to your bogofilter config file (~/.bogofilter.cf):

wordlist=r,local,wordlist.db,1
wordlist=r,acme,acmespamlist.db,2
You can also put the options on your bogofilter command line if you prefer:
--wordlist=r,local,wordlist.db,1 --wordlist=r,acme,acmespamlist.db,2
The first option tells bogofilter about your local list - with two lists, you can't just let it default. The second tells bogofilter to also use ACME's list. The filenames are relative to your bogofilter_dir setting, which defaults to '~/.bogofilter'. Any training feedback you do, with the -n/-s/-u flags, goes into the first wordlist - your local one.

The numbers on the end (,1 ,2) mean that the ACME list only gets looked at for words which don't appear at all in your local list. Your local list takes precedence. You can also use the same number on both (,1 ,1) which means bogofilter will always look words up in both lists and add the counts together.

If you use this, please let me know so we can try to figure out whether it actually works! And if you have a wordlist which you think is also particularly good, by all means copy this idea and export yours too. Let me know and I'll link to your list.
Back to ACME Labs.