Tuesday, December 25, 2012

Count Distinct Values in huge text file

This is a common problem today. You are given a huge text file, containing tens of millions of text data. And you have to count how many distinct values exist in the file. The problem gets more complicated, if you cannot load the entire data into memory. How do you make your algorithm "Online". An "online algorithm" is one in which, the result is expected to be returned immediately - may be in one pass, and you won't have memory to hold all the data.

The usage of this algorithm is in various fields. In web analytics to return number of unique visitors last month, last week, so on. Another usage is in detecting a suspected attack. If the load on your server increases all of a sudden, and you notice a spike in the network, then the first thing to look for the possibility of an attack. In this case, the requests could be coming repeatedly from a system. So next we will discuss different implementations available. For the purpose of this article, I used Shakespeare's complete work as a huge text file. It can be downloaded from http://www.gutenberg.org/cache/epub/100/pg100.txt. The file is around 5.5 MB in size and contains 67802 distinct words.

Read the input file line by line, break it into words. Then store the word in a set. Once all the words are stored, then get the size of the set. This is the distinct count. However with this approach, we have to store the entire 5.5 MB data in the Set. Assuming that there are repeated words, and say 4 MB is the distinct words. Still we need to save 4 MB data in Set. And since a Set internally uses a dictionary, and assuming 0.7 load factor then we would need 4/0.7 = 5.7 MB. And if the file is extremely big then we cannot use this approach at all.

A small work-around for Approach 1 is here. Assume that you have a very good hashCode function and the hashCode for any 2 words are never the same. In this case, store the hashCode instead of the word. So hashCode being int and when stored in a set as Integer, it is 32 bits or 4-bytes long. So to store 67802 hashCodes we need 378 KB memory. This is a huge gain, but the problem is we should find a good hashCode function. And the size of the Set increases with the input file size.

We can use "Bloom Filters" as an alternative. It is a 2-dimensional boolean array. In this approach, we use 3 or 4 different hashCode functions - h1, h2, h3, h4. For each word, compute h1 to h4. The rows of our bloom filter are the hashCodes and columns are fixed based on the accuracy needed. One thing to note here is that, the Bloom Filter is a probabilistic algorithm. It doesn't give 100% accurate results always. There is a chance of error 1% to 4% or so. The bigger the size of the Bloom Filter array sizes, the more accurate the results are. Assume that the values for h1, h2, h3, and h4 are 22, 17, 35, and 19. If the bloom filter array is represented as a, we set a[22][0], a[17][1], a[35][2], a[19][3] to true. Using this approach, I could get a result of 67801 instead of 67802 with array 4*1M. Since boolean is a byte in Java, we would need 4*1M bytes or 3 MB memory. However with a memory of 300 KB, I got 3% error. The efficiency of these algorithms are - we don't need to store the entire data and we can get the result in one pass.


The next approach we discuss is a very efficient algorithm. And the same has been used in many applications - including Cassandra (by Facebook). It is called "HyperLogLog Counter". Its a slightly complex and mathematical algorithm. It needs very little memory and doesn't vary much on the input data size. The idea is as follows. Read each word and compute the hash of the word. Cassandra uses "MurMur Hash". In this algorithm we use something called "Stochastic Averaging". We define 'b' as a number between 4 and 16. And create an int (or byte) array of size 2^b. So if we use b=4, then we have 16 buckets. Assume that we have chosen b=15, hashCode of a word is 1379266913. We convert this hashCode to a 32-bit binary number (Integer is 32-bit). And then break it into first 15 bits and next 17 bits. The first 15-bits decide which bucket to use. The next 17-bit is used to determine the rank. The rank is nothing but, the position of the first set-bit (or position of 1), from left to right. At the bucket index, if the value is lesser than the rank, then rank is stored in the bucket. So basically at each bucket, we store the highest 1-bit position. Once all the words are read and the buckets populated, we use the harmonic mean to compute the result.
Using this approach with b-value of 15, meaning 32768 buckets ( or 32 KB) memory the result obtained was 67749 (against the actual 67802). Which means using just 32 KB memory we got a result which is 99.92% accurate!!



  1. Very good.

    I would also mention counting with hash function and bitmap.

    Also I do not think Cassandra currently uses HyperLogLog, but this is not that important.

  2. In approach 3, should it be 300kb versus 3MB?