StandBayeMail

What is StandBayeMail?

StandBayeMail is a Java application that classifies all the mail coming from a POP3 server in real time, telling wether a message passing at a certain moment is spam or not. In order to do this StandBayeMail uses a pop3 proxy that applies a Bayesian algorithm on every mail coming from the server, and when a message holds good enough evidence to be considered spam its Subject: is marked with the word "[SPAM]" and the header "X-StandBayeMail-Spamicity:" is added, followed by a "spamicity" value. When the incoming message is proxied back to the client it will therefore be possible, by using simple filtering rules, to automatically remove unwanted messages.

News

Version 1.0.0 is finally out!

License

StandBayeMail, a bayesian mail proxy filter.
(C) 2004 by Luca M. Viola

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

the Free Software Foundation, Inc.
59 Temple Place - Suite 330
Boston, MA 02111-1307
USA

Getting Started

In the archive you will find the file StandBayeMail.jar, an example configuration and a minimal spam words database. It is important to understand that the bayesian filter "learns" your style in classifying emails. What you consider spam could not be considered as such by somebody else. The included spam database, then, although useful for a quick start, could not be sufficient. It is supplied anyway because it is improbable that you saved enough spam to do a significant training out-of-the-box. Instead, you should have enough of your good emails. A good number to get a significant training is about 4000 emails in both domains, although you can start seeing good results with much, much less. While you receive your mail, if the filter mistakes them you can still make it learn what belongs to the right domain. This will improve the efficiency for the future classifications. Let's see now the steps needed to start.

Initial training

The first thing to do is the training. What you need here is 2 files with all of your good mail and spam mail. If you have more than one account, you might want to create a temporary folder in which you can unify of all your mail. If you have no spam you can start with the supplied "spam.sps" file, although you should not consider it as the perfect match for you, because it does not know anything of the spam that's *specific* to you. You could also wait for some time to collect enough messages in a specific "spam" folder created with you mail client only to manually collect spam messages.

WARNING: IT IS REALLY IMPORTANT THAT MESSAGES ARE CORRECTLY CLASSIFIED. IF THE FILTER FINDS GOOD EMAILS IN THE SPAM FOLDER OR VICEVERSA, IT MIGHT NOT HAVE A WAY TO CORRECTLY RECOGNIZE SIMILAR INCOMING MESSAGES. THEREFORE, BEFORE STARTING, IT IS A GOOD IDEA TO DO SOME CLEANING, MOVING EMAILS IN THE CORRECT FOLDERS.

StandBayeMail recognizes the standard unix mailbox format and outlook express' dbx format, therefore it can be used with mail clients that use such formats or that can at least export the unix mailbox format. To understand if your client supports the unix mailbox format, you can just localize your mail folder files on the filesystem and open them with a regular text editor. If every mail header starts with a line similar to


From name@address Sat May 03 16:42:15 2003

and the file itself is in plain text format you are set. In any case, before importing/learning, StandBayeMail tries to verify if the file format is correct. The folder files you just localized have to be used for training. Going practical, suppose you have got 2 folders called Inbox.mbx and spam.mbx, as used by an hypothetical mail client. Unzip in a directory the StandBayMail's archive and go on that directory using the command line prompt. Once you do it, type:


java -jar StandBayeMail.jar

and you will see the option list, that will look like this:


Usage: StandBayeMail <mailbox|outlookexpress> <spam|mail|test> <mailboxfile>

<mailbox|outlookexpress>
    : specify if the mailbox file is a regular unix mbox or
    : an outlook express dbx file.
<spam|mail|test>
    : The switch "spam" or "mail" will add the 's words
    : to either the good words' database or the spam words'.
    : The switch "test" will check the  applying
    : the bayesian filter.
<mailboxfile>
    : specify the path to the mailbox file.

Examples:
  java -jar StandBayeMail.jar mailbox mail c:\mail\in.mbx
  java -jar StandBayeMail.jar mailbox test c:\mail\new.mbx
  java -jar StandBayeMail.jar outlookexpress spam spam.dbx

All parameters are mandatory.

The three command line parameters are all mandatory. To start the learning of your good mail in mailbox format, we will write something like


java -jar StandBayeMail.jar mailbox mail c:\pathto\mailclient\Inbox.mbx

( on a unix system it might be something like:


java -jar StandBayeMail.jar mailbox mail /var/spool/mail/username

)

The first parameter ("mailbox") says that we want to import an mbox format file, the second ("mail") that we are training the filter about good mails, and the third is the path to the mailbox file. In a similar fashion, in order to do spam related training, you need the command:


java -jar StandBayeMail.jar mailbox spam c:\pathto\mailclient\spam.mbx

(please notice as the second parameter changed from "mail" to "spam"). StandBayeMail will keep you informed about the progress of the operation. Should you notice that at a certain point there is a huge slowdown or should you get java.lang.OutOfMemoryException that means that the mailbox might be too big and that the word tokenization is stressing the garbage collector and the heap memory. In such cases your options are either to increase the minimum and maximum heap memory or to split bigger mailboxes in smaller ones. This might happen more frequently when training on spam, because such mails can contain more tokens (=words) than your regular ones. The above mentioned commands might therefore become something like


java -ms128m -mx512m -jar StandBayeMail.jar mailbox spam c:\pathto\mailclient\spam.mbx

with which we ask a minimum heap size of 128 megabytes and a maximum of 512mb. If you are importing from outlook express, you have to find the path to the folder that you want to use for training, first. You can do this by right-clicking on the folder name (on the left of outlook express) and selecting "Properties". You will get a dialog from which you can "Select All" the path to your mail folder, than with Copy/Paste you can use it. For example, after "Copy"ing the file path selection, you can give the command


java -jar StandBayeMail.jar outlookexpress spam 
                                           "C:\Documents and Settings\
                                            Administrator\Local Settings\
                                            Application Data\Identities\
                                            {9D1DD68B-867C-4FFF-85FD-F1465ACFF358}\
                                            Microsoft\Outlook Express\SPAM.dbx"

(The double quotes " delimitating the path name to the file are necessary because some directory names in the path could have spaces in the middle). The above command will interpret the dbx file and will do the training out of it. If you want to check in detail how the dbx file is extracted you can use the command


java -cp StandBayeMail.jar oeimport.dbxImport <path_file_dbx> <output file>

As a result, in <output file> you'll find the dbx file image converted in unix mbox format.

Proxy configuration

On completion of the previous phase you can set the POP3 proxy. Open the file StandBayeMail.config with a text editor, it will look something like:


#################################################
#
#   user list for java proxy server
#
#################################################

loginname  = pop3.server.address.here

For example, if you have an account itsme@myprovider.com, and your username is "itsme" and the pop3 server address is "pop.myprovider.com", the above line will become


itsme = pop.myprovider.com

You can put here all of your user names and related pop3 server addressess. The StandBayeMail pop3 proxy is able to handle multiple accounts, when your client will send a request the proxy will check the requested user name and it will connect to the corresponding pop3 server.

When you are done you can save the config file.

Mail client configuration

When you are done with the previous point you have to configure your email client. For every user that you specified in the config file you must create an account in the mail client. As a pop3 server address you must specify "localhost", and as a user name the same one that you set in the configuration file. This done you have to create, for each account if you whish, a folder called "Spam", one called "FalsePositives" and one called "FalseNegatives". Finally you have to create a rule in your email client to automatically move incoming mail to the folder Spam. In order to do this, you need to know that when the pop3 proxy detects a spam message it add the string "[SPAM]" in the Subject: of the incoming message and it adds an header called "X-StandBayeMail-Spamicity: 0.9999988374837483" among the others in the message (the decimal number after the ":" varies according to the "spamicity" of the message). Therefore the filter rule in the mail client should be like:

If the Subject contains "[SPAM]" move the message to the folder "Spam"

If the header X-StandBayeMail-Spamicity exists move the message to the folder "Spam"

The ways of creating message filter rules are specific to each mail client but with this 2 possibilities you should be able to operate with the most of them.
Once you are done with this last task you are ready to go. Run the script

StandBayeMal.bat (Windows)

StandBayeMail.sh (Unix)

and ask to the client to take the incoming mail. If everything is working you will see the folder Inbox and Spam getting automatically populated.

Everyday training

Now we have just to explain the creation of the folders "FalsePositives" and "FalseNegatives". Their presence is due to the fact that, especially at the beginning with a small mail corpus, the bayesian filter could mistake the interpretation of some incoming messages. When this happens a human intervention is necessary. In the folder "FalsePositives" we will put the mail falsely judged positive to the "Is it spam?" question. That is, REGULAR messages that by mistake went to the SPAM folder. StandBayeMail tries to limit at its best this event, by judging unknown words "a bit better" than the average of statistical neutrality. As a matter of fact this is a very rare event, but if it happens the mail will have to be moved manually in the "falsepositives" folder to ease its learning by the filter. The learning of the false positives will happen exactly as described in the paragraph Initial training, but as a path we will use the one to the "falsepositives" folder. As an example, for Outlook express, that means a command like


java -jar StandBayeMail.jar outlookexpress mail "C:\..\Outlook Express\FalsePositives.dbx"

Similarly, false negatives are spam mails that pass through and go to your Inbox. Of course this event is less troubling than the previous one, and training the filter on false negatives will ease the recognition of similar messages in the future. To do it you can use the command


java -jar StandBayeMail.jar mailbox spam /home/username/mail/FalseNegatives

(which describes a typical unix environment situation). After doing the training you can remove the content of false positives and false negatives folders. Periodically you might want to teach the filter everything about new spam and new good mail, even when they were categorized correctly. My suggestion is that when you reach a good corpus of about 4000 email for each category, you continue the training only on false positives and false negatives. If the word database gets too large, in fact, the lookup in the Hashmaps could start getting slower. If anyway you want to keep training the filter on ALL spams and good mails, even the ones correctly detected, you can do it by the means explained in the Initial Training section. The only thing you should be careful about is not to keep retraining the filter on the same emails after the last learning session. That means, you train it on new spams, then delete the content of the spam folder and start collect new spams for the next training session. In similar way you should try to train the filter only on the new good mails that you received after the last learning session. That is necessary to try to preserve the statistical frequency with which the words counted in your mail corpus appeared in your mail folders.

Frequently Asked Questions

Does it really work?!

All the people using the prerelease version (including me) had a really great experience so far. So the answer would be a straight and clear yes, of course due to the fact that you are a specific user with a peculiar spam receiving profile, your mileage may vary.

What do I need to run this?

You need an O/S with an available Java Run Time. This program has been tested against Windows, Linux, Solaris and Tru64. It has been tested on a variety of JREs/JDKs, ranging from the Jeode for Ipaq, roughly equivalent to a trimmed down jre 1.1.8, to the jdk1.5 beta. You also might need at least 256 mb of memory during the training process, expecially if you deal with big mailboxes (>20mb), unless you are willing to cut them down to smaller files and do the training on the single pieces.

Why this name?

It's a pun, it comes from the words "Stand-by mail" and the word Bayesian, which refers to the peculiar algorithm used, invented in the 1700s by the reverend Thomas Bayes.

...Bayesian?!
Rev. Bayes' algorithm gives a method to calculate the combined probability that an event may occur when there is some contrasting evidence. Let's suppose, for example, to ask to 4 people if it's going to rain on the next day. Let's suppose to know also with which probabilities they answered correctly to this question in the past:
A says yes, and he has a past 40% precision
B says no, and she has a past 50% precision
B says no, and he has a past 60% precision
D says yes, and she has a past 50% precision

Bayes' algorithm is able to combine all this answers together to determine the mixed probability of all the events. Just for the very fact that this approach requires a "history" of the events, which must be defined priorly and it didn't seem to be possible to do it from a totally neutral point of view, after an initial hype at the beginning of the 20th century, the use of this method faded into oblivion starting from the '30s. That lasted for about five decades, but from the '80s this method was reintroduced and today it is an important instrument in many fields, from medicine to economy, engineering and law. With computers and the possibility to create and handle large databases the lack of neutrality in determining the previous history of certain events was minimized, if not eliminated. Its application to e-mail classification was proposed by Paul Graham in his interesting article "A plan for Spam", in august 2002. In that paper Graham suggests to create an initial database of "good" mail and "spam" ones by simply partitioning them manually. In such a database every word of the mails is counted to see how many times it occurs in the 2 domains (good and spam). Then, for every incoming message, we select a maximum of N words and we check how many times they occur in the 2 domains, obtaining N probabilities that the message examined is spam or not. Such N probabilities are passed to the Bayes algorithm, that combines them together, reaching a very high precision in classification. This is, briefly put, what StandBayeMail does, integrating also a few clues described in the following Paul Graham's article "Better Bayesian Filtering".
Will you implement [...]?
I am actually looking forward to some feedback and/or suggestions. Of course I will be more than happy to implement the most interesting features requested. And of course I will also keep contributions in due consideration.
Why do spam messages still pass through/do I see errors on startup?
If that is happening, either the training process was not significant (not enough mails), or you are just using the supplied spam.sps file, or you are running StandBayeMail.jar from a directory in which there are no *sps and/or StandBayeMail.config files. Such files in fact MUST stay in the same directory where you run StandBayeMail from. Do not rely too much on the supplied spam.sps file either, as it can't really reflect 100% your personal spam profile.
Can I compile this with gcj?
I tried with earlier versions, and it would compile fine, although giving some run-time heap problems. This final release has some parts in the source code that do not compile (due to minor API differences with the Sun JDK). So the answer is "probably yes, but with some tweaking".
What meaning do you give to the version number?
The version number is composed by three different parts:

Major.Minor.Revision (es. 1.0.0)

All the numbers are positive, increasing integers. "Revision" is incremented only when the new releases of StandBayeMail just contain bug fixes. "Minor" is increased whenever there are new features added, while "Major" will be incremented if very important changes happen. When "Minor" or "Major" are increased, what's on their right part will be resetted to zero. As a further rule, every time a new compilation is made there is a "build" number that gets automatically increased and managed through my build-release scripts. You can read the build number when StandBayeMail starts.
HELP!!!!
Here I come!
What if I find a bug?
Please let me know.
Can I exchange ideas about this with other users?
Sure thing!
Your english seems a bit odd...
If you find it odd, that's probably because I am not a native speaker. My apologies in advance if anything sounds difficult to understand.
I am a spammer and I really hate you for doing this.
Right. I hope you can get back at least 0.001% of the frustration that the whole world goes through because of you guys. With just that amount of frustration you would have to use massive amount of s0ma, cyalys and víagra to get some action. And when I just stop at analyzing your job I understand that probably what you must REALLY need is some of your own very well (and obsessively) advertised "enlargements", so that you'll get more confident in yourself. Well, you sure should know where to get all that stuff ;-) If not, I suggest that you go selling flowers, or something less damaging to others like that.
Do you accept donations?
As a matter of fact, yes. if you find this project useful, if it resolves the spam problem for you, if you just like it and feel like giving economical support, that will be appreciated. All donations will be exclusively used for the project (new hardware, documentation, connectivity) and part of them automatically transfered to the Free Software Foundation.

Download

Download StandBayeMail v1.0.0 (zip file, including binaries and source code)

References

(1) "Combining probabilities"
     http://www.mathpages.com/home/kmath267.htm

(2) "How Bayes' Theorem can change your life"
     Focus Magazine UK, n. 120 November 2002 pag. 56
     http://www.focusmag.co.uk

(3) "A plan for Spam" by Paul Graham
     http://www.paulgraham.com/spam.html

(4) "Better Bayesian Filtering" by Paul Graham,
     http://www.paulgraham.com/better.html

(5) "Two Spam Filters 10 Times As Accurate As Humans"
     http://yro.slashdot.org/yro/04/02/24/0025219.shtml

Known Issues

The memory usage while training on big spam mailboxes can be really excessive. This problem was investigated and improved a lot, but it is still far from being optimal. During the tokenization of words StandBayeMail creates lots of temporary objects that need to be kept in memory, then the tokens are stored in Hashmaps. This will be fixed. During normal operations instead the memory usage is really acceptable (around 16-20mb).
Many parameters are hard coded, they should be triggered by the command line and/or the config file. This will be rationalized with the GNU Getopt library in a more general fashion.
Some parts of the code should be optimized better and/or rewritten without a "rush mode" ;-) I am aware of this, but I just wanted to get started. There will surely be improvements on the code quality, as the application will evolve.

To do...

Better memory handling during training

Smart whitelist builder

Better command line options management using GNU Java GetOpt

Proxy for the IMAP protocol

Server Side support (for ex. with PostFix or SendMail)

Import from formats other than mailbox or dbx (Outlook MAPI?)

Evolving the filter toward Markovan or Dobly algorithms, that can be 10 times more precise than a human.