Determine the 10 most frequent words given a terabyte of strings.
This is one of my favorite FaceBook Interview Question. There is a terabyte size file comprising of words. Scan the file and calculate the top 10 words in the file w.r.t number of occurrences. List them in descending order of occurrence. Also, support for searching word frequency for any word in the text. Further, implement the solution such that it can be used in machine learning. So that, it can accept streams of data coming in real-time and calculate the top-10 in real-time (no batching allowed).The output of the solution that we are going to discuss is show below-
130 MB data
Loaded 14146404 words into Data Structure. Time taken = 18 sec
*************
Top 10 words are:
*************
[type] found 360234 times
[2012, main] found 337678 times
[values] found 337549 times
[debug, summary, options, auditing] found 337375 times
[return] found 336572 times
[2012] found 335377 times
[class] found 269462 times
[auto] found 123839 times
*************Word performance found 97 times
I generated this report on a 130 MB log file comprising 14 Million Words. And I applied some custom rules for frequency calculation.
Algorithm.,
algorithm
algorithm's
All the 3 should be counted as occurrence algorithm. Then of-course the most important part of loading the 130 MB data and operating on it. Woow!! all done in 18 sec. And further I was able to query the frequency of any arbitrary word. The search takes less than a millisecond.
Before discussing the solution, lets see similar problems that we see day-today. If you consider an e-commerce site, they have top-sellers. And a gaming site would display leader board. The other problem is in network analysis - you are building a solution to capture IP-Addresses of requesting machines in real-time. In real-time calculate the IP of a machine with highest hits. The traditional algorithms of using a HashMap or a Array Counter would not work here. Since you would very soon, run out of memory or end up with a slow-computation.
There is another approach to this, and it is Online Algorithm. This is a general term used for algorithms were you build a solution once, and then in real-time you can query and get response. So here you are waiting for the response of the system. Example of Online Algorithm is - Patricia Tries which I posted earlier.
So what do these do? They are very efficient in terms of computation and also in space. For instance, Patricia Trie uses only 2n-1 nodes and is a binary trie. At the same time, when it comes to processing time, it doesn't depend on the size of the data. It only depends on the input size. So can we use these Online Algorithms for real-time data analytics? May be not. However, if you want to build an application to go through a huge Novel (for instance, Complete work of William Shakespeare, Bible, etc) and print the frequencies of each word, it is good. We need to look beyond this.
Lets take the example of Novel and address it. Go through the novel line by line and fetch each word. Assume that you have 3 hash functions - h1, h2, h3. Each of them take a word as input and returns an int as return type. The hash functions should be in such a way that the no two values are same (all distinct). And also the values are between 0 to 9. Now create 3 int arrays of size 10 each, all initialized to 0. Lets say, we computed for a string "xyz" and we got the below.
h1("xyz") = 8
h2("xyz") = 6
h3("xyz") = 9
So in the 3 arrays, increase the value by one at the positions returned above. So we would have below
array1 [ 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
array2 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Now proceed to next word and do the same.
h1("abc") = 7
h2("abc") = 3
h3("abc") = 9
So in this case, h3("abc") is clashing with h3("xyz"). We will come to this in awhile. Set the table values.
array1 [ 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]
array2 [ 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]
Assume that the next word is again "xyz". So we would get the table below
array1 [ 0, 0, 0, 0, 0, 0, 0, 1, 2, 0]
array2 [ 0, 0, 0, 1, 0, 0, 2, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]
array2 [ 0, 0, 0, 1, 0, 0, 2, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]
To calculate the frequency of "xyz", again calculate h1, h2, and h3. Goto the tables and find the minimum value at the positions. So we have 2, 2 and 3. Hence the minimum 2 is the answer. The value "xyz" is found 2 times. Similarly for "abc" we get 1, 1, 3. The minimum 1 is the answer. Its been found that with a proper choice of hash functions the collision can be reduced considerably. And also by increasing the array size from 10 to a bigger number, would make the possibility of conflicts remote. Algorithms of this type are called Streaming Algorithms (or Heavy Hitters). This is called Count-Min Sketch Algorithm. They are used by Google, FB, Twitter and many more. Big Data solutions are based on this. The bloom filters used in HBase is a sub-type of this algorithm.
The beauty of this algorithm is that, you don't need to store the input and you can compute the result in one pass. So both time and space complexity is remarkably low. Even though this is a probabilistic approach, the chances of error are found to be 1 in a Million. This is still good, given the performance gain.
Now this algorithm is good, but how can we compute the Top-10 Words (we are not storing the actual words)? For this, I did a workaround. Basically, used a MaxHeap to compute the top-10. Keep around top 1000 occurrences, to support a future requirement. In the code, you might notice that I have not implemented a MaxHeap at all, however I wanted to show that MaxHeap can be built using a simple HashMap too (really simple).
Note What this algorithm cannot do? If we want to print all words and their frequencies, its not possible. Since we are not storing all the words. I have built solutions using Online Algorithms (using Patricia Trie) and Traditional Algorithms (HashMap with Linked List). For people interested in this, please email me rajur79@gmail.com
Solution