Thursday, November 29, 2012

Real-time data analytics on Terrabytes of data

Determine the 10 most frequent words given a terabyte of strings.

This is one of my favorite FaceBook Interview Question. There is a terabyte size file comprising of words. Scan the file and calculate the top 10 words in the file w.r.t number of occurrences. List them in descending order of occurrence. Also, support for searching word frequency for any word in the text. Further, implement the solution such that it can be used in machine learning. So that, it can accept streams of data coming in real-time and calculate the top-10 in real-time (no batching allowed).

The output of the solution that we are going to discuss is show below-

130 MB data
Loaded 14146404 words into Data Structure. Time taken = 18 sec

*************
Top 10 words are:
*************
[argument] found 1011599 times
[type] found 360234 times
[2012, main] found 337678 times
[values] found 337549 times
[debug, summary, options, auditing] found 337375 times
[return] found 336572 times
[2012] found 335377 times
[class] found 269462 times
[auto] found 123839 times
*************
Word performance found 97 times


I generated this report on a 130 MB log file comprising 14 Million Words. And I applied some custom rules for frequency calculation.
Algorithm.,
algorithm
algorithm's
All the 3 should be counted as occurrence algorithm. Then of-course the most important part of loading the 130 MB data and operating on it. Woow!! all done in 18 sec. And further I was able to query the frequency of any arbitrary word. The search takes less than a millisecond.

Before discussing the solution, lets see similar problems that we see day-today. If you consider an e-commerce site, they have top-sellers. And a gaming site would display leader board. The other problem is in network analysis - you are building a solution to capture IP-Addresses of requesting machines in real-time. In real-time calculate the IP of a machine with highest hits. The traditional algorithms of using a HashMap or a Array Counter would not work here. Since you would very soon, run out of memory or end up with a slow-computation.

There is another approach to this, and it is Online Algorithm. This is a general term used for algorithms were you build a solution once, and then in real-time you can query and get response. So here you are waiting for the response of the system. Example of Online Algorithm is - Patricia Tries which I posted earlier.
So what do these do? They are very efficient in terms of computation and also in space. For instance, Patricia Trie uses only 2n-1 nodes and is a binary trie. At the same time, when it comes to processing time, it doesn't depend on the size of the data. It only depends on the input size. So can we use these Online Algorithms for real-time data analytics? May be not. However, if you want to build an application to go through a huge Novel (for instance, Complete work of William Shakespeare, Bible, etc) and print the frequencies of each word, it is good. We need to look beyond this.

Lets take the example of Novel and address it. Go through the novel line by line and fetch each word. Assume that you have 3 hash functions - h1, h2, h3. Each of them take a word as input and returns an int as return type. The hash functions should be in such a way that the no two values are same (all distinct). And also the values are between 0 to 9. Now create 3 int arrays of size 10 each, all initialized to 0. Lets say, we computed for a string "xyz" and we got the below.

h1("xyz") = 8
h2("xyz") = 6
h3("xyz") = 9

So in the 3 arrays, increase the value by one at the positions returned above. So we would have below

array1 [ 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
array2 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

Now proceed to next word and do the same.

h1("abc") = 7
h2("abc") = 3
h3("abc") = 9

So in this case, h3("abc") is clashing with h3("xyz"). We will come to this in awhile. Set the table values.


array1 [ 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]
array2 [ 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]

Assume that the next word is again "xyz". So we would get the table below

array1 [ 0, 0, 0, 0, 0, 0, 0, 1, 2, 0]
array2 [ 0, 0, 0, 1, 0, 0, 2, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]

To calculate the frequency of "xyz", again calculate h1, h2, and h3. Goto the tables and find the minimum value at the positions. So we have 2, 2 and 3. Hence the minimum 2 is the answer. The value "xyz" is found 2 times. Similarly for "abc" we get 1, 1, 3. The minimum 1 is the answer. Its been found that with a proper choice of hash functions the collision can be reduced considerably. And also by increasing the array size from 10 to a bigger number, would make the possibility of conflicts remote. Algorithms of this type are called Streaming Algorithms (or Heavy Hitters). This is called Count-Min Sketch Algorithm. They are used by Google, FB, Twitter and many more. Big Data solutions are based on this. The bloom filters used in HBase is a sub-type of this algorithm.

The beauty of this algorithm is that, you don't need to store the input and you can compute the result in one pass. So both time and space complexity is remarkably low. Even though this is a probabilistic approach, the chances of error are found to be 1 in a Million. This is still good, given the performance gain.

Now this algorithm is good, but how can we compute the Top-10 Words (we are not storing the actual words)? For this, I did a workaround. Basically, used a MaxHeap to compute the top-10. Keep around top 1000 occurrences, to support a future requirement.  In the code, you might notice that I have not implemented a MaxHeap at all, however I wanted to show that MaxHeap can be built using a simple HashMap too (really simple). 

Note What this algorithm cannot do? If we want to print all words and their frequencies, its not possible. Since we are not storing all the words. I have built solutions using Online Algorithms (using Patricia Trie) and Traditional Algorithms (HashMap with Linked List). For people interested in this, please email me rajur79@gmail.com

Solution



Tuesday, November 27, 2012

LRU Cache Implementation - O(1)

LRU Cache is a well known data structure used in many of the applications like ehCache (used by Hibernate), Web Caching (to cache most frequently used pages), Disk-Cleanup utilities etc. The main concept here is - you need to build a cache which holds key/value pairs.The cache would have a predefined size, if the no of key/value pairs exceeds the size then flush out least recently used items.

For instance if I added items a, b, c, d, e in the same order and the size limit is 5. Then if i want to add f, then  i have to flush out a (since that is the one last created). So I would have b, c, d, e, f. Now if i access c (either update or get), then that becomes the most recent item, so the order would be b, d, e, f, c. We should support remove operation as well.

Java provides LinkedHashMap class, which can be used to implement a LRU Cache. However, the time-complexity would not be O(1). As it doesn't ensure collision-free hashing and the linked list traversal is sequential. So we will redesign this Cache. I have not added Multi-Threading capability to my implementation and it is not templatized.

In my earlier post I had discussed how to build a Perfect HashMap (with no collisions). So today we will use that HashMap and we will slightly modify the structure to achieve LRU. The HashMap would store key/value pair, for now I am using key as Integer and value as String. Lets call this key/value class as "Pair". The HashMap is nothing but an array of "Pair"s. But as per our requirement whenever the key is accessed (update/get) we need to make it as the most recent. So what we need is a Linked List to store the keys and whenever the key is accessed bring it to the front. However if I have to do that, then I need to go through all the keys to find the specific key. So I modified the Pair class to below


class Pair {
Integer key;
Node val;
}


class Node {
String val;
Node left;
Node right;
}

So I can store the value in the node and the same node can be used as a linkedlist node.

So whenever a new key/value is added the value node is added to the head. And when the cache is full, the key and the value node at tail are removed. Whenever a key is accessed, the corresponding node is brought to the head.

Solution




Sunday, November 25, 2012

Multi-Player Snake And Ladder Game using Java

A bit of fun using java to build a snake and ladder game. We are given the position of the snakes and the ladder. Initialize the board by setting the 4 arrays - snakestart, snakeend, ladderstart, ladderend. Once it is set, then input the total number of players (N). Initialize a score array of integer[N]. In a round-robin fashion starting with user 1 to user N, throw a dice. This is nothing but generating a random number between 1-12. The value obtained is added to the score[i]. If the score results in a ladder or snake, get the position from ladderEnd or snakeEnd. To check if the score is a ladder or snake, do a binary search on snakeStart and ladderStart. If the total reaches 100, declare the winner :)


Solution


Saturday, November 24, 2012

Interviewstreet Challenge: Median

Problem
https://www.interviewstreet.com/challenges/dashboard/#problem/4fcf919f11817


The median of M numbers is defined as the middle number after sorting them in order, if M is odd or the average number of the middle 2 numbers (again after sorting) if M is even. You have an empty number list at first. Then you can add or remove some number from the list. For each add or remove operation, output the median of numbers in the list.
 
Example : For a set of m = 5 numbers, { 9, 2, 8, 4, 1 } the median is the third number in sorted set { 1, 2, 4, 8, 9 } which is 4. Similarly for set of m = 4, { 5, 2, 10, 4 }, the median is the average of second and the third element in the sorted set { 2, 4, 5, 10 } which is (4+5)/2 = 4.5  
 
Input:
 
The first line is an integer n indicates the number of operations. Each of the next n lines is either "a x" or "r x" which indicates the operation is add or remove.
 
Output:
 
For each operation: If the operation is add output the median after adding x in a single line. If the operation is remove and the number x is not in the list, output "Wrong!" in a single line. If the operation is remove and the number x is in the list, output the median after deleting x in a single line. (if the result is an integer DO NOT output decimal point. And if the result is a double number , DO NOT output trailing 0s.)
 
Constraints:
 
0 < n <= 100,000
 
for each "a x" or "r x" , x will fit in 32-bit integer.
 
Sample Input:
 
7
r 1
a 1
a 2
a 1
r 1
r 2
r 1
 
Sample Output:
Wrong!
1
1.5
1
1.5
1
Wrong!
 
Note: As evident from the last line of the input, if after remove operation the list becomes empty you have to print "Wrong!" ( quotes are for clarity )

Analysis
All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

Median can be calculated easily if the data is already sorted. However in this case the data is not sorted, we can add and remove data in any order. 
So we are left with 2 options - 
  1. keep the data sorted at all times
  2. use Heaps.
Approach:1
If we need to keep the data sorted at all times, we need to ensure that whenever user adds a value we search the appropriate position and add the value. Well, you might be wondering why didn't I use a TreetSet? The problem is Set doesn't allow duplicates. In my case, I want to allow duplicates. And also, we have to ensure that the adding a value doesn't take O(n) time. So we can use a variant of binary search - find the number or a number less than it. I have already discussed this approach in my earlier postings. So what I did is, I built a sorted arraylist class - where add/remove are binary search operations rather than sequential. This saved a lot of time. The worst-case time it took was 3 sec compared to the 5 sec benchmark.
Next comes the median calculation. Once we have the sorted arraylist, then the median is the middle number if the arraylist size is odd. Else we take the number at (size/2-1) and (size/2), the average of them is the median. Use BigIntegers to ensure that the result doesn't exceed INT.MAX

Approach:2
Maintain data in 2 Heaps - One MaxHeap and One MinHeap. The size of both should be same (if the total number of elements in both the Heaps together is even). Else the size would differ by 1. I have given the code for the Median  solution in my earlier postings. The main problem with this approach is - when the data is removed or added we have to maintain the size of the Heaps. For this, we might have to move some data from one Heap to the other.

Solution 



Interviewstreet Challenge: Flowers

Problem
https://www.interviewstreet.com/challenges/dashboard/#problem/4fd05444acc45


You and your K-1 friends want to buy N flowers. Flower number i has host ci. Unfortunately the seller does not like a customer to buy a lot of flowers, so he tries to change the price of flowers for customer who had bought flowers before. More precisely if a customer has already bought x flowers, he should pay (x+1)*ci dollars to buy flower number i.
You and your K-1 firends want to buy all N flowers in such a way that you spend the as few money as possible.

Input:
The first line of input contains two integers N and K.
next line contains N positive integers c1,c2,...,cN respectively.

Output:
Print the minimum amount of money you (and your friends) have to pay in order to buy all n flowers.
Sample onput :
3 3
2 5 6
Sample output :
13
Explanation :
In the example each of you and your friends should buy one flower. in this case you have to pay 13 dollars.
Constraint :
1 <= N, K  <= 100
Each ci is not more than 1000,000
Analysis
All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

The task is to find out the total cost such that it is minimum. And we can get minimum cost only if the cost (x+1)*ci is minimum. That means X should be minimum and ci should also be minimum. So this indicates that each person should buy flowers uniformly. We cannot have person A buying 3 flowers and person B buying only 1. So we should use round-robin algorithm here - each person buy one flower at a time. And to make ci minimum means, we need to buy low cost flowers later. That means start buying high cost flowers and then move to low cost ones. 
So we arrive at a solution - Sort the costs and use round-robin

For instance if the costs are- 9, 6, 8, 7. And lets say there are 3 people who want to buy. Then it means 1 person has to buy 2 flowers and the other 2 buy one each. And also the person who buys 2 flowers, should buy the flower with cost 6 (minimum cost) as the 2nd flower. So that gives us-
Person A = 9, 6
Person B = 8
Person C = 7

Solution




Thursday, November 22, 2012

Interviewstreet Challenge: Even Tree

Problem


You are given a tree (a simple connected graph with no cycles).You have to remove as many edges from the tree as possible to obtain a forest with the condition that : Each connected component of the forest contains even number of vertices
Your task is to calculate the number of removed edges in such a forest.

Input:
The first line of input contains two integers N and M. N is the number of vertices and M is the number of edges. 2 <= N <= 100.
Next M lines contains two integers ui and vi which specifies an edge of the tree. (1-based index)

Output:
Print a single integer which is the answer
Sample Input 
10 9
2 1
3 1
4 3
5 2
6 1
7 2
8 6
9 8
10 8
Sample Output :
2
Explanation : On removing the edges (1, 3) and (1, 6), we can get the desired result.
Original tree:


Decomposed tree:

Note: The tree in the input will be such that it can always be decomposed into components containing even number of nodes. 

Analysis
All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

Well this is an interesting problem. Initially it appears to be a problem of n-ary trees. But to build a tree and perform the decomposition on the trees, it is time consuming. So I approached it with a Map approach. And well all my 10 testcases took around 1 sec only.
Create a map with parent as key and list of all its children. Now once this map is created, then take the root. Get each of its children from the list. So we have 1 as the node and 2, 3, 6 as children. So take 2. Check if the total number of nodes under it are even or odd (I am using a recursive call here to get the count.) In this case we have only two nodes - 7 and 5 (even number of nodes). So we cannot remove the link 1-3. Since that leaves with a subtree with 3-nodes (2, 7 and 5) which is not even number. So do not increment the count. However remove 2 from the list of 1's children and add 7 and 5 to the list. 
Now take 6, it has 3 children (8, 9 and 10). So the link 1-6 can be removed. Increment counter, remove 6, add 8, 9 and 10 to the list.
The next entries 7, 5, and 4 have no children. Continue to 8. It has 2 children, hence cannot remove.



Now take the next element in the list - it is 3. It has only 1 child (odd number). So that means we can break 1-3 link. Increase the counter, remove 3 and add 4 to the list.



Solution


Interviewstreet Challenge: String Similarity


Problem

String Similarity (25 Points)
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
Calculate the sum of similarities of a string S with each of it's suffixes.
Input:
The first line contains the number of test cases T. Each of the next T lines contains a string each.
Output:
Output T lines containing the answer for the corresponding test case.
Constraints:
1 <= T <= 10
The length of each string is at most 100000 and contains only lower case characters.
Sample Input:
2
ababaa
aa

Analysis

All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.
So the length of the String can be upto 1,00,000 and in total we have upto 10 test cases. All these have to be run in 5 secs. So the hint here is that you cannot do any String manipulation operations. If you are using String.indexOf or String.charAt then you would incur additional time. A better option is to convert the String to character arrray and work on it.
The next hint is that instead of creating multiple substrings, create a virtual substring inline. So what it means is, If the string is "raju", I don't need to store another substring "aju". I can use 2 pointers one pointing at string "raju" position 0 and other pointing string "raju" position 1. If they match, increment the counter and move the pointers forward.
E.g.
String "ababaa". The first substring is the original string itself "ababaa" and it would match for all characters.
Next we need to compare "ababaa" with "babaa". So check index 0 and index 1. they don't match. So stop here.
Next proceed for "ababaa" with "abaa". So check index 0 and index 2 (a and a). They match. Increment counter. Next index 1 and Index 3 (b and b). Again increment counter. Next also match. After that b does not match with a. So stop here.
Continue this search.

Solution



Interviewstreet Challenge: Pairs

Problem

https://www.interviewstreet.com/challenges/dashboard/#problem/4e14b83d5fd12

Given N numbers , [N<=10^5] we need to count the total pairs of numbers that have a difference of K. [K>0 and K<1e9 span="span">

Input Format:
1st line contains N & K (integers).
2nd line contains N numbers of the set. All the N numbers are assured to be distinct.
Output Format:
One integer saying the no of pairs of numbers that have a diff K.

Sample Input #00:
5 2
1 5 3 4 2

Sample Output #00:3


Analysis
All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

The array contains 1,00,000 numbers and we need to find the pairs of numbers whose difference is K. There is a clue here that if the numbers are not sorted, we are only left with Brute-Force approach .
That means we need to check each number with every other number to see if they make a pair. This proves costly and we would run out of 5 seconds.
So one of the simplest solution is to sort the array using any inbuilt sort method. Now apply a modified version of binary search. So take each number 'x' and do a binary search for (x+k). If its found they make a pair. Now take the next number in the sorted array and so on.

Solution



Tuesday, November 20, 2012

5 Solutions for Search

Search has been a favorite topic of today and there are a lots of algorithms around it. Its all due to the rise in the data generated by systems and the increase in storage capacity. So we would need more faster and efficient algorithms to search. Today we will discuss Search in-depth, and look at multiple algorithms. I am limiting the solution to 5, but there are tons of other solutions not discussed here.

Types of Search
  1. Fixed Text Base and Variable Pattern - In this type of search, we have a more or less fixed data set on which we repeatedly search. But each time we search for a different pattern. This is more or less like a web search. Lets take Google as an example. There are huge number of documents that has to be searched. So the crawlers index the web and build a data structure. So the data set is fixed now. Our search operates on this data set.
  2. Variable Text Base and Fixed Pattern - In this type of search, we have the data set changing regularly. But we search for the same pattern every time. This is employed in censoring applications, e.g. News Feed. The feed comes in continuous fashion. On this we search if it contains any abusive words. 
  3. Variable Text Base and Variable Pattern - In this pattern, the data keeps changing and so also our queries. This is a common one. 
Solution:1 - Naive or Brute-Force approach
This is one of the simplest solutions and least effective one. In this we are given a string "S" of length "n". And we are given a pattern "P" of length "m". We need to scan through S and find where all we find the pattern P. Print all the indexes in increasing order. This is some what like saying  print all s.indexOf(p). 
e.g. S:   thisishiscar
      P:   hiscar
In this approach, we keep two variables i and j initialized to 0. If the characters at i and j are same, save the position i in another variable k. Increase i and j by till the two characters match. In our case, they match till below.
thisishiscar
hiscar
After this position it does not match. So we have to reset j to 0 and set i to (k+1). This is because we found that the position k does not match, so we have to restart from the next position. 
Definitely this approach is not suggested. Lets derive the time-complexity for this algorithm (for people who are not familiar with Big-O). Here in worst case we compare each character of pattern 'P' to each character of String 'S'. So we have m*n comparisons. Hence the time-complexity is O(m*n). So what it means is if 1 comparison takes say 1 ms, then to search for a pattern of length 5 in a string of length 100, we take 500 ms.  

CODE

Alternate Solution:- To address the problem of comparing every character of pattern with every character of text, we have KMP Algorithm. But then again the problem with this algorithm is we need to calculate the fault functions for each pattern.


Solution:2 - Java List Approach
Now lets increase the system behavior slightly. So lets say we are given a list of names where each name contains FirstName and LastName. We have to search for a name (either firstname or lastname). The program should print all names where it matches.
I have used a text file available on the web consisting of 1,51,671 names. Its available at http://stavi.sh/downloads/names.txt. Download this text file to your local system. I have stored it in "C:\\Apps" folder. We will use this file for the remaining approaches.
In this approach, store all the names in an ArrayList. When the user searches for a name "xyz" - go through the entire list. Search if the value in the list starts with "xyz " (notice the space here at end) or ends with " xyz" (notice the space here at the beginning). This is because we want to do exact search.
So what is the time complexity of this approach? Its again same as approach 1. We are checking the entire list and comparing the pattern against each entry. So if we have N words and the length of the longest word is "K". And if the pattern length is "M", then the complexity is O(N*M*K). This is because for each word in the word list we are comparing the pattern O(K*M) and we are doing it for 'N' words.

The program loaded the 151671 words in 77 ms and searched in 94 ms.

CODE


Solution:3 - Java List and Map Approach
So how can we increase the processing time of approach 2? Lets say all the names contains only the 26 English alphabets. So instead of keeping one list we will keep a list of size 26. So any first name or last name starting with 'A' will goto list entry 0. And one starting with 'B' goto entry 1. At each entry we will maintain a LinkedHashMap. This map will contain the firstname/surname and the index of the name in the text file. So if we have a name "Anand Chari" at position 4 of the text file then we will store ("Anand", 4) in list entry 0 and ("Chari", 4) in list entry 2. Once this data structure is built, to search we can get the first character of the search key and goto the appropriate list entry. We will search only in this entry and not the other 25 entries.

So we have made the search 25 times faster?? Well yes, but the map factor adds to the cost. And still we are going through all the names which have the specific first character. So the order is O(N*M*K/(d-1)) where d is the character set size. In this case its 26.


The program loaded the 151671 words in 769 ms and searched in 3 ms.


CODE


Solution:4 - TRIE
We will introduce TRIE data structure. The basic structure of the TRIE is as shown below














There are 2 ways of implementing a TRIE - as a character array or linked list. In case of character array TRIE we have each node storing a value and a marker indicating if its an end node (double-circle). Along with this it stores children. The children here are again TRIE nodes. The number of children is 'd', which is the character set size. So we could have 26 for english or 256 for unicode character set. So this wastes a lot of space.
To overcome this, we can use the Linked List TRIE. In this the children are not 'd' but only the ones that are used. So we can avoid the additional space requirements. But we have to pay for the lost indexing capability, we have to do sequential access in linked list. We can maintain the linked list sorted for faster access.
In this approach, we will store the firstname/lastname and the occurrence of the name in the file.
Search is very simple, we will have to do a depth first search. For instance to search for "acd", we goto node    "a" and from there we go down to "b". From here we goto "c". Then we go down to "d". So we only searched 4 nodes. Then what is the time-complexity of this algorithm? O(m*d). This is because in worst-case we might have to traverse d-nodes for each character of the pattern. This we do for each of the 'm' characters.


The program loaded the 151671 words in 1 sec and searched in 0 ms. As you notice, the build-time is increased but the search time is really fast.

CODE

Solution:5 - PATRICIA TRIE (RADIX TRIE)
The problem with TRIES is the space requirement. It stores characters at each node. So the number of nodes is (d*N), even with linked list approach. That means we would need 151671*26 nodes just for English alphabets. To address this we have PATRICIA (Practical Algorithm to Retrieve Information Coded in Alphanumeric) TRIE. Here if we have N keys the total nodes would be (2*N - 1) always.
Its a difficult algorithm to explain, but I will try my best to attempt. It uses the binary representation of string.
Ascii value of 'A' is 65 which is represented in binary as "0100 0001".
So if we have a String s = "AC", we can represent it as below
0100 0001 0100 0011

Lets say we want to add 3 strings "THAT", "THEM" and "THEY" to the tree.

THAT     0101 0100 0100 1000 0100 0001 0101 0100
THEM    0101 0100 0100 1000 0100 0101 0100 1101
THEY     0101 0100 0100 1000 0100 0101 0101 1001

So the words THAT and THEM differ at position 21 (left to right). Similarly, THEM and THEY differ at position 27. So we make a binary tree like this

Notice that to store 3 keys we have used 5 nodes (2*3 - 1).

In "THAT" the 21st bit is 0 that is why it is on the left. For "THEM" and "THEY" the bit is 1 and thats why they they are on right. The words "THEM" and "THEY" differ at bit 27. THEM has 0 at the position and THEY has 1.





So we build a tree like this and the search is very simple in this case. There are several other reasons for choosing this data structure. We can do lot more things like proximity search, approximate search, regex search, substring search, and so on.


The program loaded the 151671 words in 2 sec and searched in 1 ms. The increase in build time is due to the fact that we are using string manipulation functions to convert String to Binary. I haven't optimized it yet.

So what did we gain in this?? A lot less memory requirement and very fast search. 


CODE



Thursday, November 15, 2012

Solution for Probability using Tree

Probability is one area that's most often ignored by many. However there are very interesting problems on it. One such thing is tossing coins or picking up a ball, etc. Why do people ask such questions in interviews? Well, if we observe carefully tossing a coin is Head/Tail, that means binary. So the solution for these can be solved using binary algorithms. We will discuss how to achieve using a special type of Complete Binary Search Tree called "Probability Tree".
Lets say i toss a coin 'n' times and i get the following sequence - HTHTHHTTTHHHTHTH
Here H denotes Head and T denotes Tail. Now with this data, can you predict the probability for a series of 1, 2 or 3 coin tosses? Example, what is the probability that my first toss results in H? Or what is the probability of getting Head first time and Tail second time? Or what is the probability of getting HHT? All these can be solved using probability tree.
For our consideration we will take a maximum of 3-coin toss series problems. Treat H as '0' and T as '1'. Take the input and read all 3 series starting from the left. Since there are 16 tosses in total, we can make 3 series of 14 inputs. They are

HTH
THT
HTH
THH
HHT
HTT
TTT
TTH
THH
HHH
HHT
HTH
THT
HTH

What we have done is, we have read 3 at a time and removed the first. So (1,2,3), (2,3,4), (3,4,5) etc..

Now prepare a 3 level Complete Binary Search Tree and assume left nodes denote 'T' and right nodes denote 'H'. Initialize the count of all nodes to 0. Take the above 14 inputs one by one. First we start with HTH. So we go right, left, and right.  As we traverse we increase the count of the node by 1. Likewise do it for all the 14 inputs. Now we are done with counts. Now at level-1 the left node has count 6 and right node has 8. So the total for the 2 nodes is 14. The probability for left node is 6/14 and for right is 8/14. Store it and move on to calculate next levels. For each node just consider its two children and set their probabilities. Thats all we are done!!

Now to find the probability of an event lets say "H". We can directly goto right node and get its probability. So it is nearly 0.57.
Lets say we want to find the probability of "TT". Its 0.43*0.33.
Probability of "TTH" is 0.43*0.33*0.5.

Note that the total probability is the probability for all leaf nodes and it is equal to 1.

Solution:



Wednesday, November 14, 2012

Algorithm used by BZip, JPeg compression

We all use winzip, jpeg and other compression utilities. The space gained by compression is really high. We will discuss the underlying implementation behind them.

In Java char primitive type is stored as 2 bytes (or 16 bits). The ascii values however are 7 bits. So the total number of ascii characters are 128. So in ascii 'a' is 97 and 'A' is 65.
Refer to the link for the complete list

So in Java if I want to store a string "cat" it takes 2*3 = 6 bytes (or 48 bits).

Imagine you are writing an utility which loads an English novel into memory. If we need to store this huge data then we will most probably run into memory issues. So what is the alternative?
Huffman has proposed an algorithm for compression. It has support for encoding and decoding as well.
Lets say our text is "bcbbbbbbaacaabbcade". This contains 19 letters, so we need 19*16 = 304 Bits to represent this String.

Algorithm:
Consider the input string. Count how many times each character appears in the string. Store it somewhere in the ascending order of frequency. So i would get the below list-

e=1
d=1
c=3
a=5
b=9

What is the most efficient way to do this? Lets see what is the size of our character set. Its 256. So initialize an integer array of size 256. Read the input string character by character and increase the count at the index of the character. So if our integer array is arr[256], then we would have arr[97] = 5, arr[98] = 9 and so on.
So the time-complexity is O(n) where n is the length of the input string.

Now take the first 2 characters (e,1) and (d,1). We will store these 2 as child nodes of a binary-tree like structure. The parent of these 2 nodes will be a consolidated node (ed, 2). Here we added the 2 characters and also their frequencies.

           (ed, 2)
(e,1)                 (d, 1)

So the frequency for 'ed' is 2. This is the minimum currently, the next minimum is (c, 3). So our new node (edc, 5) would have left child (ed, 2) and right child (c, 3). Similarly we add 5 and 9. Finally we would get (bedca, 19). This is the final root node. The frequency of this node is 19 which is the length of the input string. The following is the structure of the tree.
Mark all left nodes 0 and right nodes 1.
Double-circled nodes are leaves.

Start from root node and do a DFS(Depth First Search) till you encounter leaf node. We get

b = 0 
a = 11
c = 101
d = 1001
e = 1000



Input string will translate to "010100000011111011111001011110011000". So the length of the string is 36. That means we only need 36 bits to store and not 304 bits. Woow that's a great saving of 88.15%.


Now how to decode any string. Lets say we want to decode "0001101011000". Read the numbers from left to right. So we get 0, read from the tree till we hit a leaf node. So we hit b for 0. So we get bbb for 000. Now we get 1, we goto (edca,10). Continue, since we still didn't encounter leaf node. Next number is 1, so we encountered (a, 5). Hence it is 'a'. So we get "bbbabce". This is the decrypted value.

In the solution, I have built the Huffman's tree in a most-efficient way. I don't use any costly recursion, I have used plain arrays. Have fun coding!!

Solution


Monday, November 12, 2012

Spell Check - Minimum Distance Calculation

Wherever there is a manual user input, there is a possibility of a typo error. As per research its been found that possibility of the first letter being typed wrongly is very remote.
Errors are classified into the following types

  1. Extra character : "inpuut" instead of "input"
  2. Missing character: "publsh" instead of "publish"
  3. Wrong character: "prebiew" instead of "preview"

We call each of the above errors as 1 distance.  Now given 2 strings, its possible to calculate the distance.
For eg. "Apple" and "Applets" distance is 2 as it is 2 missing characters.
"Apple" and "Tablet" distance is 4 as it is 3 Wrong characters (A instead of T, p instead of a, p instead of b),   and 1 missing character (t missing).

There is an algorithm to calculate this distance, found by a Russian mathematician Levenshtein. This algorithm is named according to him. So if the length of string1 is 'n' and length of string2 is 'm' then the algorithm runs in O(n*m). First we plot the string1 and string2 characters in a matrix format and then start filling up the values.

Calculation starts from (1,1) and the result is stored at the right bottom (red).

at (1,1) we have 't' and 'a'. they do not match so consider the values (0,0), (0,1) and (1,0) and take the minimum value and add 1. So the numbers are 0, 1, and 1. The minimum is 0 add 1 to it. We get 1.

at (2,1) we have 'a' and 'a'. they match so take the value at (1,0) which is the diagonally above the current element.

The minimum distance is stored at (6,5) which is 4


The same algorithm can be extended to check multiple strings. It can be applied to find all matching words in a dictionary. Some of the spell checkers that we see day-today use this to find the nearest matching word in a given dictionary. While doing so, the average distance they would check against is 2 (as per research).
But if they have to prepare this matrix for each string and compare it in O(n*m) its a time consuming thing. So there is a revised algorithm which runs in O(n log n + m log m). I am planning to put up this algorithm in the next sessions where I will be posting on the algorithm behind Google Docs.

Solution



Sunday, November 11, 2012

Perfect Hashing - Collision Free HashMap

The HashMap is a reliable Key-Value store in Java which works on the principles of Hashing. The structure of the HashMap class is as below -

Buckets are array of Linked Lists.
Initial size of Buckets is 11.

Each key/value pair is stored in the linked list
Key's hashCode is used to find the Bucket.

Due to Hash-Collision multiple Key-Value
pairs hash to same bucket.




For Instance consider the below code
map.put( new Integer(33), "ThirtyThree");
map.put( new Integer(11), "Eleven");
map.put( new Integer(25), "TwentyFive");

The Keys here are Integers and the default implementation of hashCode for Integer class is the intValue.
So for the first put for 33, the hashCode of the key is 33. Now this value is converted to Bucket Index-
index = hashCode % capacity
index = 33 % 11 = 0
So the key/value pair is inserted at bucket 0.
Next, for 11 as well the bucket index is 0. This is a hash-Collision. This increases the length of the Linked-List. More the conflicts, worse the performance would be. As the linked list has to be traversed.

How do we ensure that there are no conflicts at all??
The answer is Perfect Hashing. There are various techniques available. We will look at "Cuckoo Hashing". This is a type of hashing which ensures zero-conflicts.

















Image Source Wikipedia

In this hashing we maintain 2 set of buckets(or tables). We use 2 hashing functions h(k) and h'(k).

h(k) = key % 11 
h'(k) =(key/11) % 11

Here 11 is the capacity and its changed at run-time based on the capacity of the table.

So to populate key 20, first h(k) is calculated.
20%11 = 9

The first preference is table1. The algorithm checks if the index 9 of table 1 is free. If it is, store the key ( key-value pair infact) at the index 9. Hence you don't need to calculate h'(k).
Next key is 50. For this h(k) is 6. Again table1 index 6 is free and the key is inserted.
Next key is 53. For this h(k) is 9. But at that index we already have key 20. So calculate h'(k). It is (53/11)%11 = 4%11 = 4. Check if the index 4 of table2 is free. It is, so insert it in table2.

When inserting 67, h(k) = 1 and h'(k) = 6. And you notice that table1 index 1 is not empty and so also table2 index 6. Now apply greedy-method.Steal the index of table1 index 1 and store 67 at that position. So the old-key in that position, 100 is pushed out. Now 100 has to find a new place. Since it originated from table1, now it seeks its position in table2 using h'(k). For 100, h'(k) is 9 and table2 index 9 is free. Hence it is stored there.
Similarly, follow the path for 39. For 39, h(k) is 6 and h'(k) is 3. Both are occupied. So force out 105 from table1 index 6. For 105, h'(k) is 9. Store 105 at table2 position 9 by forcing out 100. For 100 h(k) is 1. Store it in table1 index 1 by forcing out 67. For 67, h'(k) is 6. So store it at table2 position 6 by forcing out 75. For 75, h(k) is 9 hence force out 53. and so on...

If we proceed like this, will we not end up with infinite-loop? YES.
So this algorithm works by ensuring that anytime the table's loadfactor is not exceeding 0.5. That means the number of buckets occupied should not be more than 50%. For people who are not aware of load-factor it is (size/capacity). This way we can ensure that the possibility of an infinite-loop is remote. And if that happens we can apply a different hashing.

The code below implements Cuckoo Hashing. To simulate the infinite loop, test by adding 6 (comment out the portion mentioned in the code)

Finally, why the name Cuckoo?? Its a technique used by Cuckoo bird :)

Solution




Saturday, November 10, 2012

Convert a Binary Tree to Linked List

Given a binary tree (or Binary Search Tree), convert it to a doubly-linked list. Can you do it inline?

Input:
                    60
           50             70
      40       55   65      75

Output:
      60 -> 50 -> 70 ->40 -> 55 -> 65 -> 75

Approach
The structure of a Binary Tree Node is similar to Linked List Node. Binary Tree node has left and right. A linked list has prev and next.
Do a Bread-First-Search (BFS) and adjust the pointers as the nodes are read.

Solution



Friday, November 9, 2012

Print prime numbers from 1 to 10 Million

Print all prime numbers from 1 to 10 million.

Approach
Prime number is one which is divisible by 1 and itself. So to find if a number 'X' is prime, we need to divide each and every number with 'X'. However this can be optimized. Instead of checking for all the numbers, check upto square-root of 'X'.
Further to this, Sieve has proposed an optimized solution to calculating primes.
If a number is prime then any multiple of it is not a prime

With this approach, I was able to print all prime numbers from 1 to 10 million, in 4 seconds on my laptop.

Solution


Find the next smallest number for a given number


In an unsorted array, given a number 'x',
find the next smallest number.
The number 'x' does exist in the array.
Array {62,50,91,32,60,63,17,36,55,61}
Num = 62
Result = 61

Num = 50
Result = 36

Approach:
Sorting is not a solution, since it will be O(N^2) or atleast O(N log N). 
Create a Binary Search Tree and find the number 'X'. Once found, fetch the right most node of its left subtree. Thats the required element.

Solution



Find the number or next smallest number


 In a sorted array, find if a given number 'X' is found.
 If found, print it
 If not found, print the next smallest number lesser than the given number
 Eg. {17,32,58,60,63,91}
 Num = 35
 Answer = 32

Approach
As the numbers are already sorted, we can use a modified version of binary search. The twist here is, when the middle number is less than 'X' then check if the next number is greater than 'X'.

Alternatively, what if the numbers were not sorted??
Use a min-heap and get the minimum till finding a number greater than 'X'.

Solution


Create a Binary Search Tree from Array

You are given an array representing the preorder traversal of a tree. From this array, construct the original BST.

for.eg. 
              { 60 40 30 28 35 42 70 65 62 68 72 }
 
Approach
Remember the property of a BST - parent is greater than its left and less than its right.
Now 60 is the root of the BST. Search for the possible right child. For this, scan the array from position 2 onwards till you encounter a bigger number than 60. So end up with 70. That is the right node. The left node is 40, the immediate number.
Next take 40 (the left child of root). Find its right child - the number greater than 40 and less than 70. It is 42. And the left child is 30.
Now take 72 (the right child of root). Fin its right child - the number greater than 70. We will find 72. Similarly, left child is 65.
For 30 the right child should be anywhere between numbers 28 and 35. So it is 35. Hence 28 is the left child.
For 42 there is no right child or left child as there are no numbers between 42 and 70.
And so on....

Solution


Two non-repeating numbers in an unsorted array


Design an algorithm to find two non-repeating numbers in an array where other numbers are repeated.
for eg.
input array: {2, 8, 6, 8, 3, 9, 2, 9}
result: 3,6

Approach:-
XOR all the numbers. This gives a value x. 
2^2, 8^8, and 9^9 all become 0. So the remaining is 6^3.

0110
0011
------
0101
------

So x in this case is 5. Take the last set bit, in this case is 1 (that is the unit place). So in the result, one number has 1 in position 1 and the other number has 0 in position 1. That is the reason we have 1 in the result.
Next scan the input array again, but as the numbers are inspected check if the number has 0 or 1 in the position 1. Make 2 XOR groups - one group of numbers with 1 in position 1 and the other group having 0.
The result is the 2 numbers.

Solution


First occurrence of a number in a sorted array


Given a sorted array of numbers where each number is repeated multiple times,  design an algorithm to find the first occurrence of a given number.
for e.g.
input array: {2,2,2,3,4,5,5,5,6,6,6,7,7,7,7,12,12,15,19}
number: 6
Here number 6 is found in position 8,9 and 10 (0-indexed). However, 8 is the first index.

Approach:- Using binary search locate the block of numbers containing the list of target number.
Once the list is found, apply binary search on the block again.

Solution



Single non-repeating number


Given an array of numbers where every number except one is repeated, efficiently find the number.
for e.g. {9, 1, 8, 7, 8, 1, 9}
Answer: 7

Approach
Sorting and finding each number with the next number - this is not an ideal approach as it takes O(N^2).
Instead use bit manipulation. XOR all numbers and the left out number is the result

Solution



Sum of Digits until it becomes a single digit


You are given a huge number and the task is to add the number till it becomes a single digit.
E.g.
Number = 9580658690133945975556610984511994659
 9+5+8+0+6+5+8+6+9+0+1+3+3+9+4+5+9+7+5+5+5+6+6+1+0+9+8+4+5+1+1+9+9+4+6+5+9     =195
  1+9+5 =  15
  1+5 = 6
  Result = 6
 
  Wait!! Can I do it in one pass?? Yes O(N) solution
  Add the values. If sum is divisible by 9, result is 9.
  Else result is (sum mod 9). In the above case, after the first pass the result is 195. Hence 195%9 = 6, the   desired answer.

Solution


Wednesday, November 7, 2012

Count number of 1's in a Number

Count the number of 1's in a number's representation.
E.g. Number 23 is 0001 0111
So there are 4 One's

Approach
Left Shift a number (equivalent to dividing by 2). Then if divided number is a double of original number, ignore. Else increment count (the original number is odd).
23 >> 1   gives 11
11 >> 1   gives 5
5  >> 1    gives 2
2 >> 1     gives 1

Solution



In place merge sort 2 sorted arrays

Given two sorted arrays A and B of size (m+n) and m respectively, design an algorithm to merge A and B in-place.

e.g.  A = {8, 12, 15, 20, 22, 0, 0, 0, 0}
        B = {6, 13, 18, 19}
 Result= {6, 8, 12, 13, 15, 18, 19, 20, 22}

Approach
Apply 2-way merge. Advantage of this method is - it can be extended for any number of arrays (k-way merge). Take the last elements of arrays (22 and 19), put the highest element to the end of Array A. Next take 20 and the previous smaller element (19). Compare and put 20 in the last but one position. so on..

Solution


UA-36403895-1