SourceForge logo

What is this?
News
License
Getting Started
FAQ
Download
References
Known Issues
TODOs
Forums
Bug Tracking
Request Features
Request Support

What is StandBayeMail?

StandBayeMail is a Java application that classifies all the mail coming from a POP3 server in real time, telling wether a message passing at a certain moment is spam or not. In order to do this StandBayeMail uses a pop3 proxy that applies a Bayesian algorithm on every mail coming from the server, and when a message holds good enough evidence to be considered spam its Subject: is marked with the word "[SPAM]" and the header "X-StandBayeMail-Spamicity:" is added, followed by a "spamicity" value. When the incoming message is proxied back to the client it will therefore be possible, by using simple filtering rules, to automatically remove unwanted messages.

News

Version 1.0.0 is finally out!

License

StandBayeMail, a bayesian mail proxy filter.
(C) 2004 by Luca M. Viola

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

the Free Software Foundation, Inc.
59 Temple Place - Suite 330
Boston, MA 02111-1307
USA

Getting Started

In the archive you will find the file StandBayeMail.jar, an example configuration and a minimal spam words database. It is important to understand that the bayesian filter "learns" your style in classifying emails. What you consider spam could not be considered as such by somebody else. The included spam database, then, although useful for a quick start, could not be sufficient. It is supplied anyway because it is improbable that you saved enough spam to do a significant training out-of-the-box. Instead, you should have enough of your good emails. A good number to get a significant training is about 4000 emails in both domains, although you can start seeing good results with much, much less. While you receive your mail, if the filter mistakes them you can still make it learn what belongs to the right domain. This will improve the efficiency for the future classifications. Let's see now the steps needed to start.

Initial training

The first thing to do is the training. What you need here is 2 files with all of your good mail and spam mail. If you have more than one account, you might want to create a temporary folder in which you can unify of all your mail. If you have no spam you can start with the supplied "spam.sps" file, although you should not consider it as the perfect match for you, because it does not know anything of the spam that's *specific* to you. You could also wait for some time to collect enough messages in a specific "spam" folder created with you mail client only to manually collect spam messages.

WARNING: IT IS REALLY IMPORTANT THAT MESSAGES ARE CORRECTLY CLASSIFIED. IF THE FILTER FINDS GOOD EMAILS IN THE SPAM FOLDER OR VICEVERSA, IT MIGHT NOT HAVE A WAY TO CORRECTLY RECOGNIZE SIMILAR INCOMING MESSAGES. THEREFORE, BEFORE STARTING, IT IS A GOOD IDEA TO DO SOME CLEANING, MOVING EMAILS IN THE CORRECT FOLDERS.

StandBayeMail recognizes the standard unix mailbox format and outlook express' dbx format, therefore it can be used with mail clients that use such formats or that can at least export the unix mailbox format. To understand if your client supports the unix mailbox format, you can just localize your mail folder files on the filesystem and open them with a regular text editor. If every mail header starts with a line similar to


From name@address Sat May 03 16:42:15 2003

and the file itself is in plain text format you are set. In any case, before importing/learning, StandBayeMail tries to verify if the file format is correct. The folder files you just localized have to be used for training. Going practical, suppose you have got 2 folders called Inbox.mbx and spam.mbx, as used by an hypothetical mail client. Unzip in a directory the StandBayMail's archive and go on that directory using the command line prompt. Once you do it, type:


java -jar StandBayeMail.jar

and you will see the option list, that will look like this:


Usage: StandBayeMail <mailbox|outlookexpress> <spam|mail|test> <mailboxfile>

<mailbox|outlookexpress>
    : specify if the mailbox file is a regular unix mbox or
    : an outlook express dbx file.
<spam|mail|test>
    : The switch "spam" or "mail" will add the 's words
    : to either the good words' database or the spam words'.
    : The switch "test" will check the  applying
    : the bayesian filter.
<mailboxfile>
    : specify the path to the mailbox file.

Examples:
  java -jar StandBayeMail.jar mailbox mail c:\mail\in.mbx
  java -jar StandBayeMail.jar mailbox test c:\mail\new.mbx
  java -jar StandBayeMail.jar outlookexpress spam spam.dbx

All parameters are mandatory.

The three command line parameters are all mandatory. To start the learning of your good mail in mailbox format, we will write something like


java -jar StandBayeMail.jar mailbox mail c:\pathto\mailclient\Inbox.mbx

( on a unix system it might be something like:


java -jar StandBayeMail.jar mailbox mail /var/spool/mail/username

)

The first parameter ("mailbox") says that we want to import an mbox format file, the second ("mail") that we are training the filter about good mails, and the third is the path to the mailbox file. In a similar fashion, in order to do spam related training, you need the command:


java -jar StandBayeMail.jar mailbox spam c:\pathto\mailclient\spam.mbx

(please notice as the second parameter changed from "mail" to "spam"). StandBayeMail will keep you informed about the progress of the operation. Should you notice that at a certain point there is a huge slowdown or should you get java.lang.OutOfMemoryException that means that the mailbox might be too big and that the word tokenization is stressing the garbage collector and the heap memory. In such cases your options are either to increase the minimum and maximum heap memory or to split bigger mailboxes in smaller ones. This might happen more frequently when training on spam, because such mails can contain more tokens (=words) than your regular ones. The above mentioned commands might therefore become something like


java -ms128m -mx512m -jar StandBayeMail.jar mailbox spam c:\pathto\mailclient\spam.mbx

with which we ask a minimum heap size of 128 megabytes and a maximum of 512mb. If you are importing from outlook express, you have to find the path to the folder that you want to use for training, first. You can do this by right-clicking on the folder name (on the left of outlook express) and selecting "Properties". You will get a dialog from which you can "Select All" the path to your mail folder, than with Copy/Paste you can use it. For example, after "Copy"ing the file path selection, you can give the command


java -jar StandBayeMail.jar outlookexpress spam 
                                           "C:\Documents and Settings\
                                            Administrator\Local Settings\
                                            Application Data\Identities\
                                            {9D1DD68B-867C-4FFF-85FD-F1465ACFF358}\
                                            Microsoft\Outlook Express\SPAM.dbx"

(The double quotes " delimitating the path name to the file are necessary because some directory names in the path could have spaces in the middle). The above command will interpret the dbx file and will do the training out of it. If you want to check in detail how the dbx file is extracted you can use the command


java -cp StandBayeMail.jar oeimport.dbxImport <path_file_dbx> <output file>

As a result, in <output file> you'll find the dbx file image converted in unix mbox format.

Proxy configuration

On completion of the previous phase you can set the POP3 proxy. Open the file StandBayeMail.config with a text editor, it will look something like:


#################################################
#
#   user list for java proxy server
#
#################################################

loginname  = pop3.server.address.here

For example, if you have an account itsme@myprovider.com, and your username is "itsme" and the pop3 server address is "pop.myprovider.com", the above line will become


itsme = pop.myprovider.com

You can put here all of your user names and related pop3 server addressess. The StandBayeMail pop3 proxy is able to handle multiple accounts, when your client will send a request the proxy will check the requested user name and it will connect to the corresponding pop3 server.

When you are done you can save the config file.

Mail client configuration

When you are done with the previous point you have to configure your email client. For every user that you specified in the config file you must create an account in the mail client. As a pop3 server address you must specify "localhost", and as a user name the same one that you set in the configuration file. This done you have to create, for each account if you whish, a folder called "Spam", one called "FalsePositives" and one called "FalseNegatives". Finally you have to create a rule in your email client to automatically move incoming mail to the folder Spam. In order to do this, you need to know that when the pop3 proxy detects a spam message it add the string "[SPAM]" in the Subject: of the incoming message and it adds an header called "X-StandBayeMail-Spamicity: 0.9999988374837483" among the others in the message (the decimal number after the ":" varies according to the "spamicity" of the message). Therefore the filter rule in the mail client should be like:

If the Subject contains "[SPAM]" move the message to the folder "Spam"

or

If the header X-StandBayeMail-Spamicity exists move the message to the folder "Spam"

The ways of creating message filter rules are specific to each mail client but with this 2 possibilities you should be able to operate with the most of them.
Once you are done with this last task you are ready to go. Run the script

StandBayeMal.bat (Windows)
or
StandBayeMail.sh (Unix)

and ask to the client to take the incoming mail. If everything is working you will see the folder Inbox and Spam getting automatically populated.

Everyday training

Now we have just to explain the creation of the folders "FalsePositives" and "FalseNegatives". Their presence is due to the fact that, especially at the beginning with a small mail corpus, the bayesian filter could mistake the interpretation of some incoming messages. When this happens a human intervention is necessary. In the folder "FalsePositives" we will put the mail falsely judged positive to the "Is it spam?" question. That is, REGULAR messages that by mistake went to the SPAM folder. StandBayeMail tries to limit at its best this event, by judging unknown words "a bit better" than the average of statistical neutrality. As a matter of fact this is a very rare event, but if it happens the mail will have to be moved manually in the "falsepositives" folder to ease its learning by the filter. The learning of the false positives will happen exactly as described in the paragraph Initial training, but as a path we will use the one to the "falsepositives" folder. As an example, for Outlook express, that means a command like


java -jar StandBayeMail.jar outlookexpress mail "C:\..\Outlook Express\FalsePositives.dbx"

Similarly, false negatives are spam mails that pass through and go to your Inbox. Of course this event is less troubling than the previous one, and training the filter on false negatives will ease the recognition of similar messages in the future. To do it you can use the command

java -jar StandBayeMail.jar mailbox spam /home/username/mail/FalseNegatives

(which describes a typical unix environment situation). After doing the training you can remove the content of false positives and false negatives folders. Periodically you might want to teach the filter everything about new spam and new good mail, even when they were categorized correctly. My suggestion is that when you reach a good corpus of about 4000 email for each category, you continue the training only on false positives and false negatives. If the word database gets too large, in fact, the lookup in the Hashmaps could start getting slower. If anyway you want to keep training the filter on ALL spams and good mails, even the ones correctly detected, you can do it by the means explained in the Initial Training section. The only thing you should be careful about is not to keep retraining the filter on the same emails after the last learning session. That means, you train it on new spams, then delete the content of the spam folder and start collect new spams for the next training session. In similar way you should try to train the filter only on the new good mails that you received after the last learning session. That is necessary to try to preserve the statistical frequency with which the words counted in your mail corpus appeared in your mail folders.

Frequently Asked Questions

Download

Download StandBayeMail v1.0.0 (zip file, including binaries and source code)

References

(1) "Combining probabilities"
     http://www.mathpages.com/home/kmath267.htm

(2) "How Bayes' Theorem can change your life"
     Focus Magazine UK, n. 120 November 2002 pag. 56
     http://www.focusmag.co.uk

(3) "A plan for Spam" by Paul Graham
     http://www.paulgraham.com/spam.html

(4) "Better Bayesian Filtering" by Paul Graham,
     http://www.paulgraham.com/better.html

(5) "Two Spam Filters 10 Times As Accurate As Humans"
     http://yro.slashdot.org/yro/04/02/24/0025219.shtml

Known Issues

  • The memory usage while training on big spam mailboxes can be really excessive. This problem was investigated and improved a lot, but it is still far from being optimal. During the tokenization of words StandBayeMail creates lots of temporary objects that need to be kept in memory, then the tokens are stored in Hashmaps. This will be fixed. During normal operations instead the memory usage is really acceptable (around 16-20mb).

  • Many parameters are hard coded, they should be triggered by the command line and/or the config file. This will be rationalized with the GNU Getopt library in a more general fashion.

  • Some parts of the code should be optimized better and/or rewritten without a "rush mode" ;-) I am aware of this, but I just wanted to get started. There will surely be improvements on the code quality, as the application will evolve.

To do...