Break the CODE!!!: Adaptive Spam Filtering algorithm

When we talk about Spam, we generally mean emails. So a spam mail is one which is sent to you as an email promotion or a bulk mail. And in most of the cases you are not interested in receiving them. So earlier days we had to go through the mail and identify if its a spam or not. A mail which is not spam (is called ham), we keep in Inbox and for the spam, we manually used to move it to a junk folder. Now that is a lot of work to do, given that these days 50-60% of mails are spam. So there are a few algorithms which came up to solve this issue. And the best of all is "Bayesian Algorithm". Its an adaptive, machine-learning algorithm. And we will discuss the details below.

Classifying an email as spam or not cannot be done at the mail server. It needs to be done at the email client. For instance lets say there are 2 users - A and B. And A works for a Bank and B works as a Pharmacist. A mail with content "Reduce your mortgage loan" is spam for B but ham for A. And a mail "Solution for baldness" is spam for A but ham for B. So when the recipient receives the email, if he received a mail and he considers it as spam, he can "Mark it as Spam". This is not a big issue. On the other hand, if he noticed a mail that was ham went into his spam folder, he can "Mark it as NOT Spam". This is an issue, as the mail might be an important one and you might miss out on it (as its not showing in your inbox). So the spam detectors should be careful not to mark a ham as spam. Also, spam can be detected based on email content, email subject, sender email, recepient emails, etc. Lets see how they work.

In the industry we have a collection of thousands of ham/spam emails which can be used to build our Spam filter application. Download these emails into your data store. Run a job on it (Map-Reduce or batch) to go through the email message and split them as multiple words. You might have to do additional tasks like removing special characters, quotes, converting to lower case, ignoring words of length less than 4, ignore common words, ignore words with only letters, etc. Now the valid words you add it into a HashMap as Key. The value for the Map is a Node. The Node class has 3 fields - spamCount, hamCount and probability. So if I am reading a word "XYZ" from spam email and it is the first time I encountered this word, then the Node class would have spamCount=1, hamCount=0. We will calculate probability after the map is constructed. Note that the same word can appear in the ham list. Every time a word is put in the map, increment a class level variable totalSpam (or totalHam) by 1. After all the emails are read and the map is constructed, iterate the map and get each key. For the key get the spamCount and hamCount. Calculate probability using -

probability = (spamCount/totalSpam)/((spamCount/totalSpam) + (hamCount/totalHam))

Do this for all the keys. The probability is a floating point value between 0.0 and 1.0.

That completes the training step. Next is the filtering step.

An email comes from a sender "X". So again, get the words (as described above) and for each word get the probability of the word in the map. If the word doesn't exist it the map, it means the spam filter is not trained for this word. So it could be a valid word, give it a value 0.5. Calculate the interest values I for each word as follows-

I = |0.5 - probability|

Once it is calculated for all the words, sort the I values in descending order (highest interest). Out of this take N values (N=15). For these I values, get the corresponding probabilities p1, p2, p3.. p15. Now calculate the total probability using the following formula

P = (p1*p2*p3..p15)/((p1*p2*p3..p15) + ((1-p1)*(1-p2)*(1-p3)....(1-p15)))

This value would be between 0.0 and 1.0. The nearer the value is to 0, the lesser the chances of it being spam. So we mark anything equal to or greater than 0.9 as spam.

Next comes machine learning. It could happen that, an email which is not marked spam needs is found to be spam. You mark it as spam. To do that, add the word back to map and calculate the probabilities again.

Implementation

I have built a basic implementation which can be trained and also do machine-learning. I created 3 files - ham.txt, spam.txt and common-words.txt. In this basic implementation I am storing text as mail content in one line of the text file. In the sample data I setup, I want to filter all jobsite, lottery, retirement emails. So the spam filter gives the following output.

'quick and easy jobsite' is spam
'will there be a production release tomorrow' is not a spam
'plan your retirement soon' is spam
'you have won a state lottery claim your money now' is spam
'register with our jobsite and get your dream job and money now' is spam
'today with our jobsite making money is not all that hard' is not a spam
'today with our jobsite making money is not all that hard' is spam
'back to office today and started off coding' is not a spam

Note that 6 was initially found to be a ham. The reason being a few words like today, money, etc are found in ham list as well. But when I mark it as spam, the next time when I received the same email at 7, it automatically traced it to be a spam.

Solution

CODE

spam.txt

ham.txt

common-words.txt

6 comments:

radhakrishna_65February 6, 2014 at 10:04 PM
It is very useful for me to learn about Spam detection techniques
Radhakrishna
UnknownNovember 10, 2014 at 3:35 PM
is hashmap an alternative or the same as saving data in database?? thanks!!
UnknownNovember 10, 2014 at 3:37 PM
raju, can you make a video tutorial of it cause i was confused of how it works though! thanks, regards in advance! hope you so. It will be a great help for me if you did so! Its great.
UnknownDecember 2, 2015 at 1:08 AM
thank you ... it's help me a lot ...

Break the CODE!!!

Monday, December 3, 2012

Adaptive Spam Filtering algorithm

6 comments:

ABOUT ME

My Posts