Tuesday, December 25, 2012

Count Distinct Values in huge text file

This is a common problem today. You are given a huge text file, containing tens of millions of text data. And you have to count how many distinct values exist in the file. The problem gets more complicated, if you cannot load the entire data into memory. How do you make your algorithm "Online". An "online algorithm" is one in which, the result is expected to be returned immediately - may be in one pass, and you won't have memory to hold all the data.

The usage of this algorithm is in various fields. In web analytics to return number of unique visitors last month, last week, so on. Another usage is in detecting a suspected attack. If the load on your server increases all of a sudden, and you notice a spike in the network, then the first thing to look for the possibility of an attack. In this case, the requests could be coming repeatedly from a system. So next we will discuss different implementations available. For the purpose of this article, I used Shakespeare's complete work as a huge text file. It can be downloaded from http://www.gutenberg.org/cache/epub/100/pg100.txt. The file is around 5.5 MB in size and contains 67802 distinct words.

APPROACH:1
Read the input file line by line, break it into words. Then store the word in a set. Once all the words are stored, then get the size of the set. This is the distinct count. However with this approach, we have to store the entire 5.5 MB data in the Set. Assuming that there are repeated words, and say 4 MB is the distinct words. Still we need to save 4 MB data in Set. And since a Set internally uses a dictionary, and assuming 0.7 load factor then we would need 4/0.7 = 5.7 MB. And if the file is extremely big then we cannot use this approach at all.

APPROACH:2
A small work-around for Approach 1 is here. Assume that you have a very good hashCode function and the hashCode for any 2 words are never the same. In this case, store the hashCode instead of the word. So hashCode being int and when stored in a set as Integer, it is 32 bits or 4-bytes long. So to store 67802 hashCodes we need 378 KB memory. This is a huge gain, but the problem is we should find a good hashCode function. And the size of the Set increases with the input file size.

APPROACH:3
We can use "Bloom Filters" as an alternative. It is a 2-dimensional boolean array. In this approach, we use 3 or 4 different hashCode functions - h1, h2, h3, h4. For each word, compute h1 to h4. The rows of our bloom filter are the hashCodes and columns are fixed based on the accuracy needed. One thing to note here is that, the Bloom Filter is a probabilistic algorithm. It doesn't give 100% accurate results always. There is a chance of error 1% to 4% or so. The bigger the size of the Bloom Filter array sizes, the more accurate the results are. Assume that the values for h1, h2, h3, and h4 are 22, 17, 35, and 19. If the bloom filter array is represented as a, we set a[22][0], a[17][1], a[35][2], a[19][3] to true. Using this approach, I could get a result of 67801 instead of 67802 with array 4*1M. Since boolean is a byte in Java, we would need 4*1M bytes or 3 MB memory. However with a memory of 300 KB, I got 3% error. The efficiency of these algorithms are - we don't need to store the entire data and we can get the result in one pass.

Solution
package com.raj.streaming;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
/**
* Shakespeare Complete works from http://www.gutenberg.org/cache/epub/100/pg100.txt
* Bloom Filter implementation
*
* @author raju rama krishna
*
*/
public class CountMinAlgo {
final int[] primes = {3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97,
101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181};
int count = 0;
boolean[][] table = new boolean[4][1000000];
/**
* @param args
*/
public static void main(String[] args) throws Exception {
CountMinAlgo algo = new CountMinAlgo();
BufferedReader br = new BufferedReader(new FileReader(new File("C:\\Apps\\Shakespeare.txt")));
String line = br.readLine();
String[] sArr = null;
while(line != null) {
sArr = line.split(" " );
for( String s: sArr ) {
algo.add(s);
}
line = br.readLine();
}
br.close();
System.out.println("Unique words = " +algo.count);
}
public int[] hash( String s ) {
int h1 = hash1(s);
int h2 = hash2(s);
int h3 = hash3(s);
int h4 = hash4(s);
int[] res = new int[]{h1, h2, h3, h4};
// System.out.println(h1+" " +h2+ " " +h3 + " " +h4);
return res;
}
public int hash1( String x ) {
// System.out.println(x);
char ch[] = x.toCharArray();
int i, sum;
for (sum=0, i=0; i < x.length(); i++)
sum += ch[i]*primes[i];
return sum % 1000000;
}
public int hash2( String s ) {
int h = 0;
for (int i = 0; i < s.length(); i++) {
h = 31*h + s.charAt(i);
}
h = h % 1000000;
if( h < 0 ) {
h = -h;
}
return h;
}
public int hash3( String s ) {
int h = 0;
for (int i = 0; i < s.length(); i++) {
h = 17*h + s.charAt(i);
}
h = h % 1000000;
if( h < 0 ) {
h = -h;
}
return h;
}
public int hash4( String s ) {
int hash = 0;
for (int i = 0, l = s.length(); i < l; i++)
{
hash += (int)s.charAt(i);
hash += hash << 10;
hash ^= hash >> 6;
}
hash += hash << 3;
hash ^= hash >> 6;
hash += hash << 16;
hash = Math.abs(hash);
return hash % 1000000;
}
public void add( String s ) {
int[] h = hash( s );
if( !(table[0][h[0]] && table[1][h[1]] && table[2][h[2]] && table[3][h[3]]) ) {
table[0][h[0]] = true;
table[1][h[1]] = true;
table[2][h[2]] = true;
table[3][h[3]] = true;
count++;
}
}
}
view raw gistfile1.java hosted with ❤ by GitHub

APPROACH:4
The next approach we discuss is a very efficient algorithm. And the same has been used in many applications - including Cassandra (by Facebook). It is called "HyperLogLog Counter". Its a slightly complex and mathematical algorithm. It needs very little memory and doesn't vary much on the input data size. The idea is as follows. Read each word and compute the hash of the word. Cassandra uses "MurMur Hash". In this algorithm we use something called "Stochastic Averaging". We define 'b' as a number between 4 and 16. And create an int (or byte) array of size 2^b. So if we use b=4, then we have 16 buckets. Assume that we have chosen b=15, hashCode of a word is 1379266913. We convert this hashCode to a 32-bit binary number (Integer is 32-bit). And then break it into first 15 bits and next 17 bits. The first 15-bits decide which bucket to use. The next 17-bit is used to determine the rank. The rank is nothing but, the position of the first set-bit (or position of 1), from left to right. At the bucket index, if the value is lesser than the rank, then rank is stored in the bucket. So basically at each bucket, we store the highest 1-bit position. Once all the words are read and the buckets populated, we use the harmonic mean to compute the result.
Using this approach with b-value of 15, meaning 32768 buckets ( or 32 KB) memory the result obtained was 67749 (against the actual 67802). Which means using just 32 KB memory we got a result which is 99.92% accurate!!

Solution
package com.raj.dvsketch;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
/**
* HyperLogLog Counter implementation for
* Please refer to the Cassandra's HyperLogLogCounter for full details
*
* @author raju rama krishna
*
*/
public class HyperLogLog {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
HyperLogLog hyperLogLog = new HyperLogLog( 15 );
BufferedReader br = new BufferedReader(new FileReader(new File("C:\\Apps\\Shakespeare.txt")));
String line = br.readLine();
String[] sArr = null;
while(line != null) {
sArr = line.split(" " );
for( String s: sArr ) {
hyperLogLog.update(s);
}
line = br.readLine();
}
br.close();
System.out.println("Unique words = " +(int)hyperLogLog.get());
}
private static final long Pow2_32 = 1L << 32;
private final int b;
// number of registers m = 2^b
private final int m;
private final double alphaM;
// a collection of m registers
private byte[] M;
/**
* Create HyperLogLogCounter with design parameter b.
* b must be in the range [4, 16] for optimization.
*
* @param b parameter
*/
public HyperLogLog(int b)
{
assert b >= 4 && b <= 16;
this.b = b;
this.m = 1 << b;
this.alphaM =
b == 4 ? 0.673 // m == 16
: b == 5 ? 0.697 // m == 32
: b == 6 ? 0.709 // m == 64
: 0.7213 / (1 + 1.079 / m);
this.M = new byte[m];
}
public void update(String value)
{
int x = hash( value );
// determine position in register using first b bits of hashed value x
int j = x >>> (Integer.SIZE - b);
M[j] = (byte)Math.max(M[j], rank((x << b) | (1 << (b - 1)) + 1));
}
public int hash( String value ) {
int hash = 0;
for (int i = 0, l = value.length(); i < l; i++)
{
hash += (int)value.charAt(i);
hash += hash << 10;
hash ^= hash >> 6;
}
hash += hash << 3;
hash ^= hash >> 6;
hash += hash << 16;
return hash;
}
public double get()
{
// compute Z with 'indicator' function
double Z = 0.0;
for (int i = 0; i < m; ++i)
Z += 1.0 / (1 << M[i]);
// "raw" HyperLogLog estimate
double E = alphaM * m * m / Z;
if (E <= (5.0 / 2.0) * m)
{
// small range correction
int V = 0;
for (int v : M)
if (v == 0) V++;
return V == 0 ? E : m * Math.log((float)m / V);
}
else if (E <= Pow2_32 / 30.0)
{
// intermediate range collection - no correction
return E;
}
else
{
// large range correction
return -1 * Pow2_32 * Math.log(1.0 - E / Pow2_32);
}
}
/**
* Determine the leftmost position of 1-bit
*
* @param w
* @return the position of the leftmost 1-bit (rank(00011010) = 4)
*/
int rank(int w)
{
return w == 0 ? 0 : 1 + Integer.numberOfLeadingZeros(w);
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Friday, December 14, 2012

Google Fresher Interview

Question:1

There is a linked list of numbers of length N. N is very large and you don’t know N. 
You have to write a function that will return k random numbers from the list. 
Numbers should be completely random.


Approach

One approach is to use random and modulo with N. But if we don't know N, it becomes interesting. The approach that we discuss is called "reservoir sampling". In this approach, we read the first K numbers and store it in the list. For our example lets say K is 10. So we will have the first 10 elements in the list. Next read the 11th element, compute a random function which returns a random number between 0 to 11. If the random number (r) is less than 10, then set the r'th number to 11th number, else continue.

Solution


package com.raj.google;
import java.util.Random;
/**
* There is a linked list of numbers of length N. N is very large and you don’t know N.
* You have to write a function that will return k random numbers from the list.
* Numbers should be completely random.
*
* @author raju rama krishna
*
*/
public class RandomSample {
/**
* @param args
*/
public static void main(String[] args) {
Random random = new Random();
LinkList list = new LinkList();
for( int i=10000; i>=1; i-- ) {
list.add(i);
}
int count = 0;
int k = 10;
Node[] reservoir = new Node[k];
Node node = list.root;
for( int i=0; i<k; i++) {
reservoir[i] = node;
node = node.next;
count++;
}
while( node != null ) {
node = node.next;
int rand = random.nextInt(count);
if( rand < k ) {
reservoir[rand] = node;
}
count++;
}
System.out.println();
}
}
class LinkList {
public Node root;
public void add( int val ) {
if( root == null ) {
root = new Node( val );
} else {
Node node = new Node(val);
node.next = root;
root = node;
}
}
}
class Node {
public int val;
public Node next;
public Node( int val ) {
this.val = val;
}
public String toString() {
return String.valueOf( val );
}
}
view raw gistfile1.java hosted with ❤ by GitHub
Question: 2


Given a Data Structure having first n integers and next n chars. A = i1 i2 i3 ... iN c1 c2 c3 ... cN. 
Write an in-place algorithm to rearrange the elements of the array ass A = i1 c1 i2 c2 ... in cn

Approach


We can use a divide-and-conquer approach. Lets say the list contains {1, 2, 3, 4, 5, 6, 7, 8, a, b, c, d, e, f, g, h}. Using binary-search kind of approach, convert it into {1, 2, 3, 4, a, b, c, d, 5, 6, 7, 8, e, f, g, h}. Break the list into 2 parts and apply the same algorithm on the 2 parts - {1, 2, 3, 4, a, b, c, d} and {5, 6, 7, 8, e, f, g, h}. 


Solution

package com.raj.google;
/**
* Given a Data Structure having first n integers and next n chars. A = i1 i2 i3 ... iN c1 c2 c3 ... cN.
* Write an in-place algorithm to rearrange the elements of the array ass A = i1 c1 i2 c2 ... in cn
*
* @author raju rama krishna
*
*/
public class RearrangeArray {
/**
* @param args
*/
public static void main(String[] args) {
Object[] a = new Object[]{1, 2, 3, 4, 5, 6, 7, 8, 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'};
int n = a.length;
rearrange(a, 0, n-1);
System.out.println();
}
public static void rearrange( Object[] a, int l, int h ) {
if ( l < h ) {
int m = (l+h)/2;
int r1 = (l+m)/2;
int i = r1+1;
int j = m+1;
while(i <= m ) {
Object t = a[i];
a[i] = a[j];
a[j] = t;
i++;
j++;
}
rearrange( a , l, m);
rearrange( a , m+1, h);
}
}
}
view raw gistfile1.java hosted with ❤ by GitHub
Question: 3

There is an array A[N] of N numbers. You have to compose an array Output[N] such that Output[i] will be equal to multiplication of all the elements of A[N] except A[i]. For example Output[0] will be multiplication of A[1] to A[N-1] and Output[1] will be multiplication of A[0] and from A[2] to A[N-1]. Solve it without division operator and in O(n).

Approach

We can do it in 2 pass, O(n). If the list is a: {4, 2, 5, 1}, we create a product list b: {0, 0, 0, 0}. Set b[0] to 1. Set pf (product forward) to 1 and pb(product backward) to 1. Start pass forward. Set pf  = pf * a[i-1]. So at i=1, pf will be 4. Set value of pf to b[i]. So the list will become {1, 4, 0, 0}. Next at i2, pf will become 4*2=8, so the list will be {1, 4, 8, 0} and next it will be {1, 4, 8, 40}.
Start pass backward from last but one item that is i=2. Set pb = pb*a[i+1]. Set value of b[i]*pb to b[i]. So pb=1. The list will be finally {10, 20, 8, 40}

Solution

package com.raj.google;
/**
* There is an array A[N] of N numbers. You have to compose an array Output[N] such that
* Output[i] will be equal to multiplication of all the elements of A[N] except A[i].
* For example Output[0] will be multiplication of A[1] to A[N-1] and
* Output[1] will be multiplication of A[0] and from A[2] to A[N-1].
*
* Solve it without division operator and in O(n).
* @author raju rama krishna
*
*/
public class ProductArray {
/**
* @param args
*/
public static void main(String[] args) {
int[] a = {4, 2, 5, 1};
int n = a.length;
int[] b = new int[n];
b[0] = 1;
int pf = 1;
int pb = 1;
for(int i=1; i<n; i++) {
pf = pf *a[i-1];
b[i] = pf;
}
for(int i=n-2; i>=0; i--) {
pb = pb * a[i+1];
b[i] = b[i] * pb;
}
System.out.println("done");
}
}
view raw gistfile1.java hosted with ❤ by GitHub
Question: 4

You are given an array with integers (both positive and negative) in any random order. 
Find the sub-array with the largest sum

Solution
package com.raj.google;
/**
* You are given an array with integers (both positive and negative) in any random order.
* Find the sub-array with the largest sum
*
* @author raju rama krishna
*
*/
public class MaxSumSubArray {
/**
* @param args
*/
public static void main(String[] args) {
int[] a = {4, 3, -11, 9, 2, 3, -18, 11, 2, -5};
maxSumSubArray(a, a.length);
}
public static void maxSumSubArray( int[] array, int len)
{
int maxSum = Integer.MIN_VALUE;
int curSum = 0;
int a=0;
int b=0;
int i=0;
int s=0;
for( i = 0; i < len; i++ ) {
curSum += array[i];
if ( curSum > maxSum ) {
maxSum = curSum;
a = s;
b = i;
}
if( curSum < 0 ) {
curSum = 0;
s = i + 1;
}
}
System.out.println("Maximum sum=" +maxSum+ " between position : " +a+ " and " +b);
}
}
view raw gistfile1.java hosted with ❤ by GitHub






Sort a very large text file on limited memory

Problem

We are given a very large text file of size say 10 GB and we need to sort the file and store it in an output location. But the memory (RAM) is limited, say 1 GB.

Approach

1. Brute-Force Approach
Read the input file line by line using BufferedReader and store each line in a TreeSet (assume there are no duplicate lines). Once the file is completely read, store the contents of the TreeSet into the output file.
For my analysis, I used a 400 MB file containing line items. I extracted this from TPC-H dataset. You can use any large file. When I ran my program with 512 MB memory (VM Argument set to -Xmx512M), I got OutOfMemory. So I set it to 1GB, it ran fine. The time taken was 7 seconds.
So as we can observe, if we need to sort 10GB then we would need around 15-20 GB RAM. The memory utilization is by the TreeSet to hold the data.

2. External Sort
In this approach, we will only read the input file line by line and store it into the TreeSet. This step is similar to above, but we would not store the entire file in the TreeSet. If the file is 10 GB and our RAM capacity is 1 GB, we will store 512MB data in the TreeSet. Once it becomes 512 MB we flush it into disk, on a temporary file (call it temp1.txt). Repeat this procedure of writing into temporary files, till you read 10 GB completely. So we will have 20 sorted temporary files. Next step is to merge these 20 files into a single sorted file and delete the temporary files. We call it a K-Way merge algorithm. Consider the example below

temp1:  { dog, fan, jug }
temp2:  { egg, kit, rat, sun }
temp3:  { ant, gun, hat}

read the first items - dog, egg, ant. The smallest of it, write to output.txt and remove the element from temp file. To the list, add the next item from the removed file. So we will remove ant from temp3 and put it to output.txt. So in the list we will have - dog, egg. Add gun to the list (since we removed ant from temp3). Now next smallest is dog add to output.txt and add fan to the list. Repeat till all the elements are added.

I first implemented this method to sort a 400 MB file on a 40 MB RAM. The time taken was 22 seconds.
This is definitely a gain in the space-complexity at the cost of increased time. Can we do it better?

Next I utilized Memory-Mapped-Files (MMAP). For this we need to consider how Java does file IO. Data in the file system (secondary storage or hard-disk) has to be brought to memory (RAM). To do this, we can either read character by character, or read a chunk of data at a time. The time taken by file IO is huge, compared to the CPU processing time. So its better to reduce the frequency of file reads, hence we use BufferedReader.
But if we use MMAP, then a file or portion of a file can be mapped directly to memory. To achieve this, it uses SWAP space (or virtual memory). Every system has this virtual memory and is different from the RAM. In Java, objects are created on the Heap or RAM or memory as we call it. But if we bring data to swap space then the data can be read directly without an operating system call (which is magnitude of times slower). You cannot Memory Map a full 10 GB file, so we can MMAP a portion of the file. The other advantage of MMAP is, multiple processes can share the same file (thread-safe). In Java, we have the classes for this in NIO package. FileChannel is the class to hold the data.

With this approach, I memory mapped the 400 MB file for 8KB size. Read this 8KB data and store it in a TreeSet. Repeat 5000 times, and store the 40 MB data in a temporary file (we will have 10 temporary sorted files). Then apply k-way merge.
Using this approach, the program ran in 17 secs ( with a 5 secs gain ). Its generally, atleast 20% faster than the BufferedReader.

Solution:1

package com.raj.filesort;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.Date;
import java.util.Set;
import java.util.TreeSet;
/**
* Sort a 400 MB file
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class FileSorter {
public static void main(String[] args) throws Exception {
long t1 = new Date().getTime();
Set<String> set = new TreeSet<String>();
BufferedReader br = new BufferedReader(new FileReader(new File("C:\\Project\\tpc-h\\dbloader\\LINEITEM.xml")));
String s = br.readLine();
while( s != null ) {
set.add(s);
s = br.readLine();
}
br.close();
FileWriter fw = new FileWriter(new File("C:\\Apps\\output.txt"));
for (String x: set) {
fw.write(x);
fw.write('\n');
}
fw.close();
long t2 = new Date().getTime();
System.out.println("Time taken = " +(t2-t1)/1000 + " sec");
}
}
view raw gistfile1.java hosted with ❤ by GitHub
Solution:2

package com.raj.filesort;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.Date;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import java.util.TreeSet;
/**
* Sort a 400 MB file on a 40MB RAM
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class SorterUtil {
public static void main(String[] args) throws Exception {
long t1 = new Date().getTime();
int i = 0;
Set<String> set = new TreeSet<String>();
BufferedReader br = new BufferedReader(new FileReader(new File("C:\\Project\\tpc-h\\dbloader\\LINEITEM.xml")));
String s = br.readLine();
while( s != null ) {
set.add(s);
if(set.size() == 60000) {
FileWriter fw = new FileWriter(new File("C:\\Apps\\tmp\\temp-" +i+".txt"));
for (String x: set) {
fw.write(x);
fw.write("\n");
}
fw.close();
i++;
set = new TreeSet<String>();
}
s = br.readLine();
}
br.close();
br = null;
FileWriter fw = new FileWriter(new File("C:\\Apps\\tmp\\temp-" +i+".txt"));
for (String x: set) {
fw.write(x);
fw.write('\n');
}
fw.close();
Map<String, Integer> map = new TreeMap<String, Integer>();
BufferedReader[] brArr = new BufferedReader[i+1];
for(int j=0; j<=i; j++) {
brArr[j] = new BufferedReader(new FileReader(new File("C:\\Apps\\tmp\\temp-" +j+".txt")));
map.put(brArr[j].readLine(), j);
}
BufferedWriter bw = new BufferedWriter(new FileWriter(new File("C:\\Apps\\output.txt")));
String endofline = "\n";
while(!map.isEmpty()) {
s = map.keySet().iterator().next();
i = map.get(s);
map.remove(s);
bw.write(s);
bw.write(endofline);
s = brArr[i].readLine();
if(s != null ) {
map.put(s, i);
}
}
bw.close();
for(int j=0; j<brArr.length; j++) {
brArr[j].close();
new File("C:\\Apps\\tmp\\temp-" +j+".txt").delete();
}
long t2 = new Date().getTime();
System.out.println("Time taken = " +(t2-t1)/1000 + " sec");
}
}
view raw gistfile1.java hosted with ❤ by GitHub
Solution:3
package com.raj.filesort;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.Date;
import java.util.Map;
import java.util.TreeMap;
/**
* Sort a 400MB file using FileChannel
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class NIOFileSort {
private static final int DEFAULT_BUFFER_SIZE = 1024 * 8;
/**
* @param args
*/
public static void main(String[] args) throws Exception {
long t1 = new Date().getTime();
int i = 0;
FileChannel source = new FileInputStream(new File("C:\\Project\\tpc-h\\dbloader\\LINEITEM.xml")).getChannel();
ByteBuffer buf = ByteBuffer.allocateDirect(DEFAULT_BUFFER_SIZE);
int j = 0;
FileChannel destination = new FileOutputStream(new File("C:\\Apps\\tmp\\temp-" +i+".txt")).getChannel();
while((source.read(buf)) != -1) {
buf.flip();
destination.write(buf);
if( j == 5000 ) {
i++;
destination.close();
destination = new FileOutputStream(new File("C:\\Apps\\tmp\\temp-" +i+".txt")).getChannel();
j=0;
} else {
j++;
}
buf.clear();
}
source.close();
Map<String, Integer> map = new TreeMap<String, Integer>();
BufferedReader[] brArr = new BufferedReader[i+1];
for(j=0; j<=i; j++) {
brArr[j] = new BufferedReader(new FileReader(new File("C:\\Apps\\tmp\\temp-" +j+".txt")));
map.put(brArr[j].readLine(), j);
}
BufferedWriter bw = new BufferedWriter(new FileWriter(new File("C:\\Apps\\output.txt")));
String s = null;
String endofline = "\n";
while(!map.isEmpty()) {
s = map.keySet().iterator().next();
i = map.get(s);
map.remove(s);
bw.write(s);
bw.write(endofline);
s = brArr[i].readLine();
if(s != null ) {
map.put(s, i);
}
}
bw.close();
for(j=0; j<brArr.length; j++) {
brArr[j].close();
new File("C:\\Apps\\tmp\\temp-" +j+".txt").delete();
}
long t2 = new Date().getTime();
System.out.println("Time taken = " +(t2-t1)/1000 + " sec");
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Monday, December 3, 2012

Adaptive Spam Filtering algorithm

When we talk about Spam, we generally mean emails. So a spam mail is one which is sent to you as an email promotion or a bulk mail. And in most of the cases you are not interested in receiving them. So earlier days we had to go through the mail and identify if its a spam or not. A mail which is not spam (is called ham), we keep in Inbox and for the spam, we manually used to move it to a junk folder. Now that is a lot of work to do, given that these days 50-60% of mails are spam. So there are a few algorithms which came up to solve this issue. And the best of all is "Bayesian Algorithm". Its an adaptive, machine-learning algorithm. And we will discuss the details below.

Classifying an email as spam or not cannot be done at the mail server. It needs to be done at the email client. For instance lets say there are 2 users - A and B. And A works for a Bank and B works as a Pharmacist. A mail with content "Reduce your mortgage loan" is spam for B but ham for A. And a mail "Solution for baldness" is spam for A but ham for B. So when the recipient receives the email, if he received a mail and he considers it as spam, he can "Mark it as Spam". This is not a big issue. On the other hand, if he noticed a mail that was  ham went  into his spam folder, he can "Mark it as NOT Spam". This is an issue, as the mail might be an important one and you might miss out on it (as its not showing in your inbox). So the spam detectors should be careful not to mark a ham as spam. Also, spam can be detected based on email content, email subject, sender email, recepient emails, etc. Lets see how they work.

In the industry we have a collection of thousands of ham/spam emails which can be used to build our Spam filter application. Download these emails into your data store. Run a job on it (Map-Reduce or batch) to go through the email message and split them as multiple words. You might have to do additional tasks like removing special characters, quotes, converting to lower case, ignoring words of length less than 4, ignore common words, ignore words with only letters, etc. Now the valid words you add it into a HashMap as Key. The value for the Map is a Node. The Node class has 3 fields - spamCount, hamCount and probability. So if I am reading a word "XYZ" from spam email and it is the first time I encountered this word, then the Node class would have spamCount=1, hamCount=0. We will calculate probability after the map is constructed. Note that the same word can appear in the ham list. Every time a word is put in the map, increment a class level variable totalSpam (or totalHam) by 1. After all the emails are read and the map is constructed, iterate the map and get each key. For the key get the spamCount and hamCount. Calculate probability using -

probability = (spamCount/totalSpam)/((spamCount/totalSpam) + (hamCount/totalHam))

Do this for all the keys. The probability is a floating point value between 0.0 and 1.0.
That completes the training step. Next is the filtering step.

An email comes from a sender "X". So again, get the words (as described above) and for each word get the probability of the word in the map. If the word doesn't exist it the map, it means the spam filter is not trained for this word. So it could be a valid word, give it a value 0.5. Calculate the interest values I for each word as follows-

I = |0.5 - probability|

Once it is calculated for all the words, sort the I values in descending order (highest interest). Out of this take N values (N=15). For these I values, get the corresponding probabilities p1, p2, p3.. p15. Now calculate the total probability using the following formula

P = (p1*p2*p3..p15)/((p1*p2*p3..p15)  + ((1-p1)*(1-p2)*(1-p3)....(1-p15)))

This value would be between 0.0 and 1.0. The nearer the value is to 0, the lesser the chances of it being spam. So we mark anything equal to or greater than 0.9 as spam. 

Next comes machine learning. It could happen that, an email which is not marked spam needs is found to be spam. You mark it as spam. To do that, add the word back to map and calculate the probabilities again. 

Implementation
I have built a basic implementation which can be trained and also do machine-learning. I created 3 files - ham.txt, spam.txt and common-words.txt. In this basic implementation I am storing text as mail content in one line of the text file. In the sample data I setup, I want to filter all jobsite, lottery, retirement emails. So the spam filter gives the following output.
  1. 'quick and easy jobsite' is spam
  2. 'will there be a production release tomorrow' is not a spam
  3. 'plan your retirement soon' is spam
  4. 'you have won a state lottery claim your money now' is spam
  5. 'register with our jobsite and get your dream job and money now' is spam
  6. 'today with our jobsite making money is not all that hard' is not a spam
  7. 'today with our jobsite making money is not all that hard' is spam
  8. 'back to office today and started off coding' is not a spam
Note that 6 was initially found to be a ham. The reason being a few words like today, money, etc are found in ham list as well. But when I mark it as spam, the next time when I received the same email at 7, it automatically traced it to be a spam.

Solution
CODE
package com.raj.streaming;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
/**
* Machine-learning spam detector using Bayesian Algorithm
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class SpamDetector {
int spamCount = 0;
int hamCount = 0;
Map<String, Node> map = new HashMap<String, Node>();
List<String> commonWords = new ArrayList<String>();
public void init() throws Exception {
BufferedReader br = new BufferedReader( new FileReader( new File("C:\\Apps\\common-words.txt")));
String line = br.readLine();
while(line != null) {
commonWords.add(line);
line = br.readLine();
}
br = new BufferedReader( new FileReader( new File("C:\\Apps\\spam.txt")));
line = br.readLine();
while(line != null) {
line = line.toLowerCase();
String[] words = line.split(" ");
for( String s: words ) {
if(s.length() > 3 && !commonWords.contains(s)) {
spamCount++;
if( map.containsKey(s)) {
map.get(s).spamCount++;
} else {
map.put(s, new Node( 1, 0));
}
}
}
line = br.readLine();
}
br = new BufferedReader( new FileReader( new File("C:\\Apps\\ham.txt")));
line = br.readLine();
while(line != null) {
line = line.toLowerCase();
String[] words = line.split(" ");
for( String s: words ) {
if(s.length() > 3 && !commonWords.contains(s)) {
hamCount++;
if( map.containsKey(s)) {
map.get(s).hamCount++;
} else {
map.put(s, new Node( 0, 1));
}
}
}
line = br.readLine();
}
Set<String> keys = map.keySet();
for( String key: keys ) {
Node node = map.get(key);
double res = ((node.spamCount)/(double)(spamCount))/(double)(((node.spamCount)/(double)(spamCount)) + (node.hamCount)/(double)(hamCount));
node.probability = res;
}
br.close();
}
public void detect( String s ) {
boolean result = false;
s = s.toLowerCase();
String[] sArr = s.split(" ");
TreeMap<Double, List<Double>> interestMap = new TreeMap<Double, List<Double>>(Collections.reverseOrder());
for( String x: sArr ) {
if(x.length()> 3 && !commonWords.contains(x)) {
double i = 0.5;
double p = 0.5;
if(map.containsKey(x)) {
p = map.get(x).probability;
}
i = Math.abs(i - p);
if( !interestMap.containsKey(i) ) {
List<Double> values = new ArrayList<Double>();
values.add(p);
interestMap.put(i, values);
} else {
interestMap.get(i).add(p);
}
}
}
List<Double> probabilities = new ArrayList<Double>();
int count = 0;
Set<Double> set = interestMap.keySet();
for( Double d: set ) {
List<Double> list = interestMap.get(d);
for(Double x: list) {
count++;
probabilities.add(x);
if(count == 15) {
break;
}
}
if(count == 15) {
break;
}
}
double res = 1;
double numerator = 1;
double denominator = 1;
for( Double d: probabilities ) {
numerator = numerator * d;
denominator = denominator * (1- d);
}
res = numerator/(double)(numerator +denominator);
if(res >= 0.9) {
result = true;
}
if( result ) {
System.out.println("'" +s+ "' is spam");
} else {
System.out.println("'" +s+ "' is not a spam");
}
}
public void moveToSpam( String s ) {
s = s.toLowerCase();
String[] sArr = s.split(" ");
for( String x: sArr ) {
if(x.length()> 3 && !commonWords.contains(x)) {
spamCount++;
if( map.containsKey(x)) {
map.get(x).spamCount++;
} else {
map.put(x, new Node( 1, 0));
}
}
}
Set<String> keys = map.keySet();
for( String key: keys ) {
Node node = map.get(key);
double res = ((node.spamCount)/(double)(spamCount))/(double)(((node.spamCount)/(double)(spamCount)) + (node.hamCount)/(double)(hamCount));
node.probability = res;
}
}
public static void main(String[] args) throws Exception {
SpamDetector spamDetector = new SpamDetector();
spamDetector.init();
spamDetector.detect( "Quick and easy jobsite" );
spamDetector.detect("Will there be a Production Release tomorrow");
spamDetector.detect( "Plan your retirement soon" );
spamDetector.detect( "You have won a state lottery claim your money now" );
spamDetector.detect( "register with our jobsite and get your dream job and money now" );
spamDetector.detect("today with our jobsite making money is not all that hard");
spamDetector.moveToSpam("today with our jobsite making money is not all that hard");
spamDetector.detect("today with our jobsite making money is not all that hard");
spamDetector.detect("back to office today and started off coding");
}
}
class Node {
int spamCount;
int hamCount;
double probability;
public Node( int spamCount, int hamCount ) {
this.spamCount = spamCount;
this.hamCount = hamCount;
}
public String toString() {
return String.valueOf(probability);
}
}
view raw gistfile1.java hosted with ❤ by GitHub
spam.txt
Quick and easy way to make money
Congratulations you have won a lottery
Claim your prize now
Thanks for applying to our jobsite
Greetings from jobsite
Solution for all your financial needs
Buy home now at less money
Dream of your home
You are minutes away from retirement
One stop solution for your financial needs
Plan your retirement days
Its all about money honey
Our jobsite found that you have not been active
You have won the US state lottery
Help me sell this piece of land
view raw gistfile1.txt hosted with ❤ by GitHub
ham.txt
Code walk-thru meeting scheduled today
I will be late to the office today
We will have a day off tomorrow after the release
Production is down today
I will be transferring the money to you today
Some of the changes you did are not right
Lets hope that the code changes go smooth
Have you finished your code changes
view raw gistfile1.txt hosted with ❤ by GitHub
common-words.txt
able
been
being
could
does
done
first
good
have
just
many
most
should
this
that
then
they
value
would
when
with
your
view raw gistfile1.txt hosted with ❤ by GitHub


Saturday, December 1, 2012

FaceBook - Write a function that can show the number of users online at any given time

You're given a period of time where users log in and log out and a set of login and log out times for those users. Facebook is asking you to create an efficient algorithm that will calculate a simple, but extremely important, metric.

Facebook currently has more than 1 Billion users and assume that each user is given a unique identifier like user_192. Hundreds of people login (and logout) to FB every second. There is one audit log maintained explicitly to store the time of login/logout and the userid. A sample log is given below-


[10/11/2012 10:25:06] Login user_4
[10/11/2012 10:28:55] Login user_6
[10/11/2012 10:29:19] Logout user_4
[10/11/2012 10:36:33] Login user_8

Using the above information, we need to come up with the most efficient algorithm (fast and low memory utilization) to calculate the users logged in at any time. For instance in the above example, if I query for 10/11/2012 10:29:00, there were 2 users logged in user_4 and user_6. Assume that there can be users who are logged into FaceBook the entire day (and never logout). Also assume that the entire 1 Billion users login every day. Most importantly, we need to consider boundary conditions as well. As in somebody logged in at 23:55:00 and logged out the next day at 01:17:12. 

Analysis
There were a few solutions available, which had time complexity O(n log n). However, I was thinking if I could do i in 1 pass, which is O(n). And also, I didn't want to use the BitSet approach as I would have to create a BitSet of size 1 Billion every day. So, I came up with this solution.


Maintain a map, where the key is the date (10/11/2012). The value is an int[] of size 86400 (total number of secs in a day). The values of the array are pre-initialized to -1. Maintain a counter initialized to 0. Now go through the log file, for each log entry convert the time portion into secs. If the activity is "Login" then increment the counter, if "Logout" then decrement the counter. Goto the array position and set the value to the value of the counter. In the figure above, the blue window represents the time user_7 is logged in. The green window represents the time user_4 is logged in. The red window represents users who are logged in fo 2 overlapping days. So what we have achieved is, in linear time we have created our data structure.

Now lets see how we can do querying. Convert the time to seconds, get the corresponding value from the int array. If the value is not -1, then the value is the result. 
If the value is -1, from the current position of the array navigate back till you encounter a value which is not -1 (case a). While doing so, you might end up at the beginning of the array and still not found any value which is not -1 (case b). In case a, the value at the position is the result. In case b, we need to go back and refer to the previous day's map. In that map's int array, start from the end of the array till you find the value not -1. 
Will see with an example. In the figure above, at time 37411 we had 1 user. At time 37506 we had 2 users. So if I query for time 37506 we can directly say 2. If we query for 37500 we have 1 user. How did we arrive at this? At 37500 the value was -1. So we navigate left and at 37411 we get  value 1. That is the result.

Note:  In a real-world scenario the user log would not be a single text file, but a distributed file. Facebook have their own logging system PTail and Puma. And the back-end uses HBase over HDFS. So the log would be broken into multiple 64 KB pieces. Map-reduce job runs on each of these pieces in parallel building the above map.

Solution

CODE
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.text.SimpleDateFormat;
import java.util.Arrays;
import java.util.Calendar;
import java.util.HashMap;
import java.util.Map;
public class LogCounter {
Map<String, int[]> dailyStats = new HashMap<String, int[]>();
/**
* @param args
*/
public static void main(String[] args) throws Exception {
LogCounter logCounter = new LogCounter();
BufferedReader br = new BufferedReader(new FileReader(new File("C:\\Apps\\users.log")));
String line = br.readLine();
Holder holder = null;
String date1 = null;
String date2 = null;
int[] counter = null;
int count = 0;
while( line != null ) {
holder = logCounter.getHolder(line);
date2 = holder.date;
int time = holder.time;
if(!date2.equals(date1)) {
counter = new int[86400];
Arrays.fill(counter, -1);
date1 = date2;
logCounter.dailyStats.put(date2, counter);
}
if(!holder.isLogout) {
count++;
} else {
count--;
}
counter[time] = count;
line = br.readLine();
}
logCounter.loggedInUsers("10/11/2012 10:20:00");
logCounter.loggedInUsers("10/11/2012 10:23:40");
logCounter.loggedInUsers("10/11/2012 10:28:54");
logCounter.loggedInUsers("10/11/2012 10:28:56");
logCounter.loggedInUsers("11/11/2012 00:04:09");
br.close();
}
public void loggedInUsers( String dateTime ) throws Exception {
String date = dateTime.split(" ")[0];
String time = dateTime.split(" ")[1];
int hour = Integer.parseInt(time.split(":")[0]);
int min = Integer.parseInt(time.split(":")[1]);
int sec = Integer.parseInt(time.split(":")[2]);
int val = hour*60*60 + min*60 + sec;
int[] counter = dailyStats.get(date);
if(counter[val] != -1) {
System.out.println(counter[val] + " users were logged in at : " +dateTime);
} else {
int val1 = val;
while(val1 >= 0 && counter[val1] == -1 ) {
val1--;
}
if( val1 == -1 ) {
String prevDate = getPreviousDay(date);
if(dailyStats.containsKey(prevDate)) {
counter = dailyStats.get(prevDate);
val = 86399;
while( counter[val] == -1 ) {
val--;
}
System.out.println(counter[val] + " users were logged in at : " +dateTime);
} else {
System.out.println("No users were logged in at : " +dateTime);
}
} else {
System.out.println(counter[val1] + " users were logged in at : " +dateTime);
}
}
}
public String getPreviousDay( String date ) throws Exception {
Calendar cal = Calendar.getInstance();
SimpleDateFormat dateFormat = new SimpleDateFormat("dd/MM/yyyy");
cal.setTime(dateFormat.parse(date));
cal.add(Calendar.DATE, -1);
String prevDate = dateFormat.format(cal.getTime());
return prevDate;
}
public Holder getHolder( String line ) {
String dateTime = line.substring(1, line.indexOf("]"));
String date = dateTime.split(" ")[0];
String time = dateTime.split(" ")[1];
int hour = Integer.parseInt(time.split(":")[0]);
int min = Integer.parseInt(time.split(":")[1]);
int sec = Integer.parseInt(time.split(":")[2]);
int val = hour*60*60 + min*60 + sec;
boolean isLogout = (line.indexOf(" Logout ") != -1)? true: false;
int userId = Integer.parseInt(line.substring(line.indexOf("_")+1));
Holder holder = new Holder(date, val, isLogout, userId);
return holder;
}
}
class Holder {
String date;
int time;
boolean isLogout;
int userId;
public Holder(String date, int time, boolean isLogout, int userId) {
super();
this.date = date;
this.time = time;
this.isLogout = isLogout;
this.userId = userId;
}
}
view raw gistfile1.java hosted with ❤ by GitHub
LOG File
[10/11/2012 10:23:31] Login user_7
[10/11/2012 10:25:06] Login user_4
[10/11/2012 10:28:55] Login user_6
[10/11/2012 10:29:19] Logout user_4
[10/11/2012 10:36:33] Login user_8
[10/11/2012 10:38:16] Logout user_7
[10/11/2012 10:42:20] Logout user_8
[10/11/2012 10:45:12] Logout user_6
[10/11/2012 10:47:15] Login user_18
[10/11/2012 10:47:37] Login user_99
[10/11/2012 10:51:18] Logout user_99
[10/11/2012 10:55:36] Login user_5
[10/11/2012 11:02:10] Login user_99
[10/11/2012 11:07:37] Logout user_5
[10/11/2012 11:21:39] Logout user_18
[10/11/2012 11:25:16] Logout user_99
[10/11/2012 23:55:10] Login user_99777
[10/11/2012 23:59:22] Login user_88772
[11/11/2012 00:05:00] Logout user_99777
[11/11/2012 00:07:02] Logout user_88772
[11/11/2012 00:17:12] Login user_271
[11/11/2012 00:25:17] Login user_45
[11/11/2012 00:28:03] Login user_60
[11/11/2012 01:00:00] Login user_43
[11/11/2012 01:12:02] Logout user_45
[11/11/2012 01:18:09] Logout user_271
[11/11/2012 01:22:16] Logout user_60
[11/11/2012 01:45:59] Logout user_43
[12/11/2012 01:47:00] Login user_18
[12/11/2012 01:52:09] Logout user_18
[12/11/2012 01:53:50] Login user_993
[12/11/2012 01:55:01] Login user_5
[12/11/2012 02:12:12] Logout user_993
[12/11/2012 02:27:19] Logout user_5
[13/11/2012 09:51:17] Login user_33
[13/11/2012 09:55:33] Login user_102890
[13/11/2012 10:03:18] Login user_790
[13/11/2012 10:05:08] Login user_4
[13/11/2012 10:08:00] Logout user_102890
[13/11/2012 10:19:00] Logout user_33
[13/11/2012 10:26:11] Login user_8
[13/11/2012 10:28:33] Logout user_790
[13/11/2012 10:32:03] Login user_8995
[13/11/2012 10:35:17] Logout user_4
[13/11/2012 10:37:19] Logout user_8995
[13/11/2012 10:37:44] Login user_9934550
[13/11/2012 10:41:42] Login user_777
[13/11/2012 10:45:29] Login user_666
[13/11/2012 10:57:08] Logout user_9934550
[13/11/2012 10:57:05] Logout user_777
[13/11/2012 11:01:00] Login user_99
[13/11/2012 11:05:11] Logout user_99
[13/11/2012 11:05:04] Logout user_666
[13/11/2012 11:19:22] Logout user_8
view raw gistfile1.txt hosted with ❤ by GitHub


Thursday, November 29, 2012

Real-time data analytics on Terrabytes of data

Determine the 10 most frequent words given a terabyte of strings.

This is one of my favorite FaceBook Interview Question. There is a terabyte size file comprising of words. Scan the file and calculate the top 10 words in the file w.r.t number of occurrences. List them in descending order of occurrence. Also, support for searching word frequency for any word in the text. Further, implement the solution such that it can be used in machine learning. So that, it can accept streams of data coming in real-time and calculate the top-10 in real-time (no batching allowed).

The output of the solution that we are going to discuss is show below-

130 MB data
Loaded 14146404 words into Data Structure. Time taken = 18 sec

*************
Top 10 words are:
*************
[argument] found 1011599 times
[type] found 360234 times
[2012, main] found 337678 times
[values] found 337549 times
[debug, summary, options, auditing] found 337375 times
[return] found 336572 times
[2012] found 335377 times
[class] found 269462 times
[auto] found 123839 times
*************
Word performance found 97 times


I generated this report on a 130 MB log file comprising 14 Million Words. And I applied some custom rules for frequency calculation.
Algorithm.,
algorithm
algorithm's
All the 3 should be counted as occurrence algorithm. Then of-course the most important part of loading the 130 MB data and operating on it. Woow!! all done in 18 sec. And further I was able to query the frequency of any arbitrary word. The search takes less than a millisecond.

Before discussing the solution, lets see similar problems that we see day-today. If you consider an e-commerce site, they have top-sellers. And a gaming site would display leader board. The other problem is in network analysis - you are building a solution to capture IP-Addresses of requesting machines in real-time. In real-time calculate the IP of a machine with highest hits. The traditional algorithms of using a HashMap or a Array Counter would not work here. Since you would very soon, run out of memory or end up with a slow-computation.

There is another approach to this, and it is Online Algorithm. This is a general term used for algorithms were you build a solution once, and then in real-time you can query and get response. So here you are waiting for the response of the system. Example of Online Algorithm is - Patricia Tries which I posted earlier.
So what do these do? They are very efficient in terms of computation and also in space. For instance, Patricia Trie uses only 2n-1 nodes and is a binary trie. At the same time, when it comes to processing time, it doesn't depend on the size of the data. It only depends on the input size. So can we use these Online Algorithms for real-time data analytics? May be not. However, if you want to build an application to go through a huge Novel (for instance, Complete work of William Shakespeare, Bible, etc) and print the frequencies of each word, it is good. We need to look beyond this.

Lets take the example of Novel and address it. Go through the novel line by line and fetch each word. Assume that you have 3 hash functions - h1, h2, h3. Each of them take a word as input and returns an int as return type. The hash functions should be in such a way that the no two values are same (all distinct). And also the values are between 0 to 9. Now create 3 int arrays of size 10 each, all initialized to 0. Lets say, we computed for a string "xyz" and we got the below.

h1("xyz") = 8
h2("xyz") = 6
h3("xyz") = 9

So in the 3 arrays, increase the value by one at the positions returned above. So we would have below

array1 [ 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
array2 [ 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

Now proceed to next word and do the same.

h1("abc") = 7
h2("abc") = 3
h3("abc") = 9

So in this case, h3("abc") is clashing with h3("xyz"). We will come to this in awhile. Set the table values.


array1 [ 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]
array2 [ 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]

Assume that the next word is again "xyz". So we would get the table below

array1 [ 0, 0, 0, 0, 0, 0, 0, 1, 2, 0]
array2 [ 0, 0, 0, 1, 0, 0, 2, 0, 0, 0]
array3 [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]

To calculate the frequency of "xyz", again calculate h1, h2, and h3. Goto the tables and find the minimum value at the positions. So we have 2, 2 and 3. Hence the minimum 2 is the answer. The value "xyz" is found 2 times. Similarly for "abc" we get 1, 1, 3. The minimum 1 is the answer. Its been found that with a proper choice of hash functions the collision can be reduced considerably. And also by increasing the array size from 10 to a bigger number, would make the possibility of conflicts remote. Algorithms of this type are called Streaming Algorithms (or Heavy Hitters). This is called Count-Min Sketch Algorithm. They are used by Google, FB, Twitter and many more. Big Data solutions are based on this. The bloom filters used in HBase is a sub-type of this algorithm.

The beauty of this algorithm is that, you don't need to store the input and you can compute the result in one pass. So both time and space complexity is remarkably low. Even though this is a probabilistic approach, the chances of error are found to be 1 in a Million. This is still good, given the performance gain.

Now this algorithm is good, but how can we compute the Top-10 Words (we are not storing the actual words)? For this, I did a workaround. Basically, used a MaxHeap to compute the top-10. Keep around top 1000 occurrences, to support a future requirement.  In the code, you might notice that I have not implemented a MaxHeap at all, however I wanted to show that MaxHeap can be built using a simple HashMap too (really simple). 

Note What this algorithm cannot do? If we want to print all words and their frequencies, its not possible. Since we are not storing all the words. I have built solutions using Online Algorithms (using Patricia Trie) and Traditional Algorithms (HashMap with Linked List). For people interested in this, please email me rajur79@gmail.com

Solution

package com.raj.streaming;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.List;
import java.util.Set;
import java.util.TreeMap;
/**
* List top 10 words and their count in the order of occurrences in a huge file of 25 MB.
* Also support, word count for any arbitary word in the document.
*
* Performance: Searched 25 MB data comprising 2.5 Million words in 4 sec
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class FrequencyCount {
int[][] table = new int[4][1000000];
TreeMap<Integer, List<String>> map = new TreeMap<Integer, List<String>>(Collections.reverseOrder());
/**
* @param args
*/
public static void main(String[] args) throws Exception {
FrequencyCount freq = new FrequencyCount();
BufferedReader br = null;
try {
long t1 = new Date().getTime();
br = new BufferedReader(new FileReader(new File("C:\\Apps\\Professional.NoSQL.Aug.2011.txt")));
String line = br.readLine();
int i=0;
while( line != null ) {
if( line.length() > 0 ) {
String[] sArr = line.split(" ");
for( String s: sArr ) {
if(s.trim().length() > 3) {
try {
freq.add(freq.trimStr(s));
}catch(ArrayIndexOutOfBoundsException e) {
//TODO: just for testing
}
}
i++;
}
}
line = br.readLine();
}
long t2 = new Date().getTime();
System.out.println("Loaded " +i+ " words into Data Structure. Time taken = " + (t2-t1) + "ms.");
System.out.println("*************");
System.out.println("Top 10 words are:");
System.out.println("*************");
Set<Integer> set = freq.map.keySet();
for(Integer x: set ) {
System.out.println(freq.map.get(x) + " found " +x+ " times");
}
System.out.println("*************");
System.out.println("Word performance found " +freq.count("performance")+ " times");
}catch(Exception e) {
e.printStackTrace();
} finally {
br.close();
}
}
public String trimStr( String s ) {
if(s.toUpperCase().equals(s.toLowerCase())) {
return s;
}
s = s.toLowerCase().trim();
if( s.endsWith("'s")) {
s = s.substring(0, s.length()-2);
}
int i=0;
int j=s.length()-1;
char[] cArr = s.toCharArray();
while(!(cArr[i] >= 65 && cArr[i] <= 90)
&& !(cArr[i] >= 97 && cArr[i] <= 122)) {
i++;
}
while(!(cArr[j] >= 65 && cArr[j] <= 90)
&& !(cArr[j] >= 97 && cArr[j] <= 122)) {
j--;
}
return s.substring(i, j+1);
}
public int[] hash( String s ) {
int h1 = hash1(s);
int h2 = hash2(s);
int h3 = hash3(s);
int h4 = hash4(s);
int[] res = new int[]{h1, h2, h3, h4};
// System.out.println(h1+" " +h2+ " " +h3 + " " +h4);
return res;
}
public int hash1( String x ) {
char ch[] = x.toCharArray();
int i, sum;
for (sum=0, i=0; i < x.length(); i++)
sum += ch[i];
return sum % 1000000;
}
public int hash2( String s ) {
int h = 0;
for (int i = 0; i < s.length(); i++) {
h = 31*h + s.charAt(i);
}
h = h % 1000000;
if( h < 0 ) {
h = -h;
}
return h;
}
public int hash3( String s ) {
int h = 0;
for (int i = 0; i < s.length(); i++) {
h = 17*h + s.charAt(i);
}
h = h % 1000000;
if( h < 0 ) {
h = -h;
}
return h;
}
public int hash4( String s ) {
int h = 0;
for (int i = 0; i < s.length(); i++) {
h = 11*h + s.charAt(i);
}
h = h % 1000000;
if( h < 0 ) {
h = -h;
}
return h;
}
public void add( String s ) {
int[] h = hash( s );
table[0][h[0]] = table[0][h[0]] + 1;
table[1][h[1]] = table[1][h[1]] + 1;
table[2][h[2]] = table[2][h[2]] + 1;
table[3][h[3]] = table[3][h[3]] + 1;
int r = Math.min(Math.min(Math.min(table[0][h[0]], table[1][h[1]]), table[2][h[2]]), table[3][h[3]]);
boolean add = true;
List<String> list = map.get(r);
if( list == null ) {
if(map.size() == 10) {
Integer lastKey = map.lastKey();
if(lastKey.intValue() > r) {
add = false;
} else {
map.remove(lastKey);
}
}
list = new ArrayList<String>();
}
if( add ) {
list.add(s);
map.put(r, list);
if( r > 1 ) {
list = map.get(r-1);
if( list != null ) {
if(list.size() == 1 ) {
map.remove(r-1);
} else {
list.remove(s);
}
}
}
}
}
public int count( String s ) {
int[] h = hash( s );
int a = table[0][h[0]];
int b = table[1][h[1]];
int c = table[2][h[2]];
int d = table[3][h[3]];
int r = Math.min(Math.min(Math.min(a, b), c), d);
return r;
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Tuesday, November 27, 2012

LRU Cache Implementation - O(1)

LRU Cache is a well known data structure used in many of the applications like ehCache (used by Hibernate), Web Caching (to cache most frequently used pages), Disk-Cleanup utilities etc. The main concept here is - you need to build a cache which holds key/value pairs.The cache would have a predefined size, if the no of key/value pairs exceeds the size then flush out least recently used items.

For instance if I added items a, b, c, d, e in the same order and the size limit is 5. Then if i want to add f, then  i have to flush out a (since that is the one last created). So I would have b, c, d, e, f. Now if i access c (either update or get), then that becomes the most recent item, so the order would be b, d, e, f, c. We should support remove operation as well.

Java provides LinkedHashMap class, which can be used to implement a LRU Cache. However, the time-complexity would not be O(1). As it doesn't ensure collision-free hashing and the linked list traversal is sequential. So we will redesign this Cache. I have not added Multi-Threading capability to my implementation and it is not templatized.

In my earlier post I had discussed how to build a Perfect HashMap (with no collisions). So today we will use that HashMap and we will slightly modify the structure to achieve LRU. The HashMap would store key/value pair, for now I am using key as Integer and value as String. Lets call this key/value class as "Pair". The HashMap is nothing but an array of "Pair"s. But as per our requirement whenever the key is accessed (update/get) we need to make it as the most recent. So what we need is a Linked List to store the keys and whenever the key is accessed bring it to the front. However if I have to do that, then I need to go through all the keys to find the specific key. So I modified the Pair class to below


class Pair {
Integer key;
Node val;
}


class Node {
String val;
Node left;
Node right;
}

So I can store the value in the node and the same node can be used as a linkedlist node.

So whenever a new key/value is added the value node is added to the head. And when the cache is full, the key and the value node at tail are removed. Whenever a key is accessed, the corresponding node is brought to the head.

Solution


package com.raj.lru;
/**
* Least Recently Used Cache (LRU Cache)
*
* Head=Zero; Tail=Ten
* 21 , Twenty One
* Head=Twenty One; Tail=Ten
* Head=Fifty Five; Tail=Twenty
* Head=Twenty One; Tail=Twenty
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class LRUCache {
public static void main(String[] args) {
PerfectHashMap map = new PerfectHashMap();
map.put(10, "Ten");
map.put(20, "Twenty");
map.put(21, "Twenty One");
map.put(30, "Thirty");
map.put(1, "One");
map.put(19, "Nineteen");
map.put(33, "Thirty Three");
map.put(23, "Twenty Three");
map.put(66, "Sixty Six");
map.put(0, "Zero");
System.out.println("Head=" + map.linkedList.head + "; Tail=" +map.linkedList.tail);
System.out.println(map.get(21));
System.out.println("Head=" + map.linkedList.head + "; Tail=" +map.linkedList.tail);
map.put( 55, "Fifty Five");
System.out.println("Head=" + map.linkedList.head + "; Tail=" +map.linkedList.tail);
map.remove(55);
System.out.println("Head=" + map.linkedList.head + "; Tail=" +map.linkedList.tail);
}
}
class PerfectHashMap {
int MAX_SIZE = 10;
float MAX_LOAD_FACTOR = 0.5f;
int capacity1 = 11; //Choose a Prime
int size1 = 0;
float loadFactor1 = 0.0f; //loadfactor = (size/capacity)
int capacity2 = 11; //Choose a Prime
int size2 = 0;
float loadFactor2 = 0.0f;
Pair[] table1 = new Pair[capacity1];
Pair[] table2 = new Pair[capacity2];
LinkedList linkedList = new LinkedList();
public void remove( Integer key ) {
int h1 = hash1( key );
int h2 = hash2( key );
Pair pair1 = table1[h1];
Pair pair2 = table2[h2];
if( pair1 != null && pair1.key.equals(key) ) {
linkedList.remove(pair1.val);
} else if( pair2 != null && pair2.key.equals(key) ) {
linkedList.remove(pair2.val);
}
}
public Pair get( Integer key ) {
int h1 = hash1( key );
int h2 = hash2( key );
Pair pair1 = table1[h1];
Pair pair2 = table2[h2];
if( pair1 != null && pair1.key.equals(key) ) {
linkedList.moveToHead(pair1.val);
return pair1;
} else if( pair2 != null && pair2.key.equals(key) ) {
linkedList.moveToHead(pair2.val);
return pair2;
}
return null;
}
public void put( Integer key, String val ) {
int h1 = hash1( key );
int h2 = hash2( key );
if( slot1Empty(h1) ) {
addTable1( key, val, h1 );
} else if( slot2Empty(h2) ) {
addTable2( key, val, h2 );
} else {
boolean pass = false;
Integer original = key;
Pair pushed = null; //Pushed out key
Pair pair = null;
while( true ) {
pushed = getTable1( h1 );
if( pushed == null ) {
break;
}
if( pass && (original.equals(key)) ) {
System.out.println("Found a Cycle and breaking. Key = " +key);
break;
}
if( key.equals(original) ) {
addTable1( key, val, h1 );
} else {
addTable1( pair, h1 );
}
h2 = hash2( pushed.key );
pair = getTable2( h2 );
addTable2( pushed, h2 );
if( pair == null ) {
break;
}
key = pair.key;
h1 = hash1( key );
pass = true;
}
}
}
public Pair getTable1( int index ) {
return table1[ index ];
}
public Pair getTable2( int index ) {
return table2[ index ];
}
/**
* Add the number to the table. It could be replace or add.
* If its add, then increment size. Once added, check the loadfactor.
* If loadfactor is greater than 0.5f then double the capacity and
* move the array contents to the new array
*
* @param key
* @param index
*/
public void addTable1( Integer key, String val, int index ) {
if( getTable1(index) == null ) {
if( size() == MAX_SIZE ) {
linkedList.removeTail();
}
size1++;
loadFactor1 = size1/(float)capacity1;
//Comment this if block: to simulate Cycle
if( loadFactor1 >= MAX_LOAD_FACTOR ) {
int newCapacity = capacity1 * 2;
capacity1 = newCapacity;
Pair[] temp = new Pair[newCapacity];
System.arraycopy(table1, 0, temp, 0, table1.length);
table1 = temp;
temp = null;
index = hash1(key);
}
}
Node node = new Node(val);
linkedList.insertHead(node );
Pair pair = new Pair( key, node );
table1[index] = pair;
}
public void addTable1( Pair pair, int index ) {
if( getTable1(index) == null ) {
if( size() == MAX_SIZE ) {
linkedList.removeTail();
}
size1++;
loadFactor1 = size1/(float)capacity1;
//Comment this if block: to simulate Cycle
if( loadFactor1 >= MAX_LOAD_FACTOR ) {
int newCapacity = capacity1 * 2;
capacity1 = newCapacity;
Pair[] temp = new Pair[newCapacity];
System.arraycopy(table1, 0, temp, 0, table1.length);
table1 = temp;
temp = null;
index = hash1(pair.key);
}
}
table1[index] = pair;
}
/**
* Add the number to the table. It could be replace or add.
* If its add, then increment size. Once added, check the loadfactor.
* If loadfactor is greater than 0.5f then double the capacity and
* move the array contents to the new array
*
* @param key
* @param index
*/
public void addTable2( Integer key, String val, int index ) {
if( getTable2(index) == null ) {
if( size() == MAX_SIZE ) {
linkedList.removeTail();
}
size2++;
loadFactor2 = size2/(float)capacity2;
//Comment this if block: to simulate Cycle
if( loadFactor2 >= MAX_LOAD_FACTOR ) {
int newCapacity = capacity2 * 2;
capacity2 = newCapacity;
Pair[] temp = new Pair[newCapacity];
System.arraycopy(table2, 0, temp, 0, table2.length);
table2 = temp;
temp = null;
index = hash2(key);
}
}
Node node = new Node(val);
linkedList.insertHead(node );
Pair pair = new Pair( key, node );
table2[index] = pair;
}
public void addTable2( Pair pair, int index ) {
if( getTable2(index) == null ) {
if( size() == MAX_SIZE ) {
linkedList.removeTail();
}
size2++;
loadFactor2 = size2/(float)capacity2;
//Comment this if block: to simulate Cycle
if( loadFactor2 >= MAX_LOAD_FACTOR ) {
int newCapacity = capacity2 * 2;
capacity2 = newCapacity;
Pair[] temp = new Pair[newCapacity];
System.arraycopy(table2, 0, temp, 0, table2.length);
table2 = temp;
temp = null;
index = hash2(pair.key);
}
}
table2[index] = pair;
}
public boolean slot1Empty( int index ) {
return (table1[index] == null);
}
public boolean slot2Empty( int index ) {
return (table2[index] == null);
}
public int hash1( int key ) {
return ( key % capacity1 );
}
public int hash2( int key ) {
return ( ( key / capacity2 ) % capacity2 );
}
public int size() {
return size1 + size2;
}
}
class Pair {
Integer key;
Node val;
public Pair( Integer key, Node val ) {
this.key = key;
this.val = val;
}
public String toString() {
return key.intValue() + " , " +val.val;
}
}
class Node {
String val;
Node left;
Node right;
public Node( String val ) {
this.val = val;
}
public String toString() {
return val;
}
}
class LinkedList {
Node head;
Node tail;
public void insertHead( Node node ) {
if( head == null ) {
head = node;
tail = node;
head.right = null;
} else {
node.right = head;
head.left = node;
head = node;
}
}
public void removeTail() {
Node node = tail.left;
node.right = null;
tail = node;
}
public void moveToHead( Node node ) {
if( node != head ) {
node.left.right = node.right;
node.right = head;
head.left = node;
head = node;
node.left = null;
}
}
public void remove( Node node ) {
if(node == head ) {
head = node.right;
node =null;
} else {
node.left.right = node.right;
node = null;
}
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Sunday, November 25, 2012

Multi-Player Snake And Ladder Game using Java

A bit of fun using java to build a snake and ladder game. We are given the position of the snakes and the ladder. Initialize the board by setting the 4 arrays - snakestart, snakeend, ladderstart, ladderend. Once it is set, then input the total number of players (N). Initialize a score array of integer[N]. In a round-robin fashion starting with user 1 to user N, throw a dice. This is nothing but generating a random number between 1-12. The value obtained is added to the score[i]. If the score results in a ladder or snake, get the position from ladderEnd or snakeEnd. To check if the score is a ladder or snake, do a binary search on snakeStart and ladderStart. If the total reaches 100, declare the winner :)


Solution
package com.raj.games;
import java.util.Arrays;
import java.util.Scanner;
/**
* Multi-player Snake and Ladder Game
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class SnakeAndLadder {
int MAX_PLAYERS = 5;
int[] snakesA = null;
int[] snakesB = null;
int[] laddersA = null;
int[] laddersB = null;
int totalPos = 100;
int[] players = null;
int[] scores = null;
int n = 0;
public void init() throws Exception {
//Position of snakes for eg. snake at 31 bites and fall to 14
String snakes = "31-14,37-17,73-53,78-39,92-35,99-7";
//Position of ladders for eg. Ladder at 5 takes to 25
String ladders = "5-25,10-29,22-41,28-55,44-95,70-89,79-81";
String[] snakesArr = snakes.split(",");
String[] laddersArr = ladders.split(",");
snakesA = new int[snakesArr.length];
snakesB = new int[snakesArr.length];
laddersA = new int[laddersArr.length];
laddersB = new int[laddersArr.length];
int i=0;
for(String s: snakesArr) {
String[] x = s.split("-");
snakesA[i] = Integer.parseInt(x[0]);
snakesB[i] = Integer.parseInt(x[1]);
i++;
}
i=0;
for(String s: laddersArr) {
String[] x = s.split("-");
laddersA[i] = Integer.parseInt(x[0]);
laddersB[i] = Integer.parseInt(x[1]);
i++;
}
//Read total players required as input
System.out.println("Enter number of players : between 2 and " +MAX_PLAYERS);
Scanner scan=new Scanner(System.in);
n=scan.nextInt();
scan.close();
if( n < 2 && n > MAX_PLAYERS ) {
System.exit(-1);
}
players = new int[ n ];
scores = new int[ n ];
}
public boolean isSnake( int pos ) {
int indx = Arrays.binarySearch(snakesA, pos);
boolean result = (indx < 0)? false: true;
return result;
}
public boolean isLadder( int pos ) {
int indx = Arrays.binarySearch(laddersA, pos);
boolean result = (indx < 0)? false: true;
return result;
}
public int getSnake( int pos ) {
int indx = Arrays.binarySearch(snakesA, pos);
return snakesB[indx];
}
public int getLadder( int pos ) {
int indx = Arrays.binarySearch(laddersA, pos);
return laddersB[indx];
}
public int throwDice() {
int i = 0;
while( i == 0) {
i = (int)(Math.random()*100);
i = i%13;
}
return i;
}
public void game() throws Exception {
init();
int i=0;
while( true ) {
int v = throwDice();
char user = (char)('A' + i);
int score = scores[ i ] + v;
if( score >= totalPos ) {
System.out.println("User : " +user+ " got " +v+ ".WINNER!!");
break;
}
if( isLadder(score) ) {
score = getLadder(score);
System.out.println("User : " +user+ " got " +v+ " climbed ladder to: " +score);
scores[i] = score;
} else if( isSnake(score)) {
score = getSnake(score);
System.out.println("User : " +user+ " got " +v+ " snake bit to: " +score);
scores[i] = score;
} else {
System.out.println("User : " +user+ " got " +v+ " and moved to: " +score);
scores[i] = score;
}
i++;
if( i == n) {
i = 0;
}
}
}
/**
* @param args
*/
public static void main(String[] args) throws Exception {
SnakeAndLadder snakeAndLadder = new SnakeAndLadder();
snakeAndLadder.game();
}
}
class Snake {
int start;
int end;
public Snake(int start, int end) {
this.start = start;
this.end = end;
}
}
class Ladder {
int start;
int end;
public Ladder(int start, int end) {
this.start = start;
this.end = end;
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Saturday, November 24, 2012

Interviewstreet Challenge: Median

Problem
https://www.interviewstreet.com/challenges/dashboard/#problem/4fcf919f11817


The median of M numbers is defined as the middle number after sorting them in order, if M is odd or the average number of the middle 2 numbers (again after sorting) if M is even. You have an empty number list at first. Then you can add or remove some number from the list. For each add or remove operation, output the median of numbers in the list.
 
Example : For a set of m = 5 numbers, { 9, 2, 8, 4, 1 } the median is the third number in sorted set { 1, 2, 4, 8, 9 } which is 4. Similarly for set of m = 4, { 5, 2, 10, 4 }, the median is the average of second and the third element in the sorted set { 2, 4, 5, 10 } which is (4+5)/2 = 4.5  
 
Input:
 
The first line is an integer n indicates the number of operations. Each of the next n lines is either "a x" or "r x" which indicates the operation is add or remove.
 
Output:
 
For each operation: If the operation is add output the median after adding x in a single line. If the operation is remove and the number x is not in the list, output "Wrong!" in a single line. If the operation is remove and the number x is in the list, output the median after deleting x in a single line. (if the result is an integer DO NOT output decimal point. And if the result is a double number , DO NOT output trailing 0s.)
 
Constraints:
 
0 < n <= 100,000
 
for each "a x" or "r x" , x will fit in 32-bit integer.
 
Sample Input:
 
7
r 1
a 1
a 2
a 1
r 1
r 2
r 1
 
Sample Output:
Wrong!
1
1.5
1
1.5
1
Wrong!
 
Note: As evident from the last line of the input, if after remove operation the list becomes empty you have to print "Wrong!" ( quotes are for clarity )

Analysis
All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

Median can be calculated easily if the data is already sorted. However in this case the data is not sorted, we can add and remove data in any order. 
So we are left with 2 options - 
  1. keep the data sorted at all times
  2. use Heaps.
Approach:1
If we need to keep the data sorted at all times, we need to ensure that whenever user adds a value we search the appropriate position and add the value. Well, you might be wondering why didn't I use a TreetSet? The problem is Set doesn't allow duplicates. In my case, I want to allow duplicates. And also, we have to ensure that the adding a value doesn't take O(n) time. So we can use a variant of binary search - find the number or a number less than it. I have already discussed this approach in my earlier postings. So what I did is, I built a sorted arraylist class - where add/remove are binary search operations rather than sequential. This saved a lot of time. The worst-case time it took was 3 sec compared to the 5 sec benchmark.
Next comes the median calculation. Once we have the sorted arraylist, then the median is the middle number if the arraylist size is odd. Else we take the number at (size/2-1) and (size/2), the average of them is the median. Use BigIntegers to ensure that the result doesn't exceed INT.MAX

Approach:2
Maintain data in 2 Heaps - One MaxHeap and One MinHeap. The size of both should be same (if the total number of elements in both the Heaps together is even). Else the size would differ by 1. I have given the code for the Median  solution in my earlier postings. The main problem with this approach is - when the data is removed or added we have to maintain the size of the Heaps. For this, we might have to move some data from one Heap to the other.

Solution 

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.math.BigDecimal;
import java.util.ArrayList;
import java.util.List;
/**
* Interview Street - Median
*
* The median of M numbers is defined as the middle number after sorting them in order, if M is odd
* or the average number of the middle 2 numbers (again after sorting) if M is even.
* You have an empty number list at first. Then you can add or remove some number from the list.
* For each add or remove operation, output the median of numbers in the list.
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class Solution {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
SortedArrayList h1 = new SortedArrayList( 50000 );
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
int N = Integer.parseInt(in.readLine());
for( int i=0; i<N; i++) {
String line = in.readLine();
if( line.startsWith("r")) {
int val = Integer.parseInt(line.split(" ")[1]);
h1.remove( val );
} else {
int val = Integer.parseInt(line.split(" ")[1]);
h1.add( val );
}
}
}
}
class SortedArrayList {
int size = 0;
List<Integer> a;
public SortedArrayList( int capacity ) {
a = new ArrayList<Integer>( capacity );
}
public void remove( int val ) {
boolean found = false;
if( size == 1 && a.get(0) == val ) {
a.remove( 0 );
size--;
} else if( size > 0 ) {
int l = 0;
int h = size-1;
int m = (l+h)/2;
while( h-l > 1 ) {
m = (l+h)/2;
if(a.get(m) == val ) {
a.remove(m);
found = true;
break;
}
if(a.get(m) > val ) {
h = m;
} else if( a.get(m) < val ) {
l = m;
}
}
if( !found && l < h ) {
if(a.get(l) == val) {
a.remove(l);
found = true;
} else if(a.get(h) == val ) {
a.remove(h);
found = true;
}
}
if(found) {
size--;
}
}
if( found )
System.out.println(median());
else
System.out.println("Wrong!");
}
public void add( int val ) {
if( a.isEmpty() || val > max() ) {
a.add(val);
size++;
} else {
int l = 0;
int h = size-1;
int m = (l+h)/2;
while( h-l > 1 ) {
m = (l+h)/2;
if(a.get(m) == val ) {
break;
}
if(a.get(m) > val ) {
h = m;
} else if( a.get(m) < val ) {
l = m;
}
}
if(a.get(m) == val ) {
a.add(m, val);
} else if(a.get(l) < val ) {
a.add(l+1, val);
} else {
a.add(l, val);
}
size++;
}
System.out.println(median());
}
public int max() {
if(a.isEmpty()) {
return Integer.MAX_VALUE;
} else {
return a.get(size-1);
}
}
public String median() {
if(size% 2 == 1) {
return new BigDecimal(a.get(size/2)).toString();
} else {
int x = a.get(size/2-1);
int y = a.get(size/2);
return new BigDecimal(x).add(new BigDecimal(y)).divide(new BigDecimal(2.0)).toString();
}
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Interviewstreet Challenge: Flowers

Problem
https://www.interviewstreet.com/challenges/dashboard/#problem/4fd05444acc45


You and your K-1 friends want to buy N flowers. Flower number i has host ci. Unfortunately the seller does not like a customer to buy a lot of flowers, so he tries to change the price of flowers for customer who had bought flowers before. More precisely if a customer has already bought x flowers, he should pay (x+1)*ci dollars to buy flower number i.
You and your K-1 firends want to buy all N flowers in such a way that you spend the as few money as possible.

Input:
The first line of input contains two integers N and K.
next line contains N positive integers c1,c2,...,cN respectively.

Output:
Print the minimum amount of money you (and your friends) have to pay in order to buy all n flowers.
Sample onput :
3 3
2 5 6
Sample output :
13
Explanation :
In the example each of you and your friends should buy one flower. in this case you have to pay 13 dollars.
Constraint :
1 <= N, K  <= 100
Each ci is not more than 1000,000
Analysis
All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

The task is to find out the total cost such that it is minimum. And we can get minimum cost only if the cost (x+1)*ci is minimum. That means X should be minimum and ci should also be minimum. So this indicates that each person should buy flowers uniformly. We cannot have person A buying 3 flowers and person B buying only 1. So we should use round-robin algorithm here - each person buy one flower at a time. And to make ci minimum means, we need to buy low cost flowers later. That means start buying high cost flowers and then move to low cost ones. 
So we arrive at a solution - Sort the costs and use round-robin

For instance if the costs are- 9, 6, 8, 7. And lets say there are 3 people who want to buy. Then it means 1 person has to buy 2 flowers and the other 2 buy one each. And also the person who buys 2 flowers, should buy the flower with cost 6 (minimum cost) as the 2nd flower. So that gives us-
Person A = 9, 6
Person B = 8
Person C = 7

Solution


package com.raj.strings;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
/**
* Interview Street - Flowers
*
* You and your K-1 friends want to buy N flowers. Flower number i has host ci. Unfortunately the seller does not like a customer to buy a lot of flowers, so he tries to change the price of flowers for customer who had bought flowers before. More precisely if a customer has already bought x flowers, he should pay (x+1)*ci dollars to buy flower number i.
* You and your K-1 firends want to buy all N flowers in such a way that you spend the as few money as possible.
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class Solution {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String line = br.readLine();
int n = Integer.parseInt(line.split(" ")[0]);
int k = Integer.parseInt(line.split(" ")[1]);
String s = br.readLine();
List<Integer> set = new ArrayList<Integer>();
String[] sArr = s.split(" ");
for( String s1: sArr ) {
set.add( Integer.parseInt(s1));
}
Collections.sort(set);
int[] counters = new int[k];
int total = 0;
int i=0;
while(!set.isEmpty()) {
int m = set.size() - 1;
int val = set.remove(m);
total = total + (1+counters[i])*val;
counters[i] = counters[i] + 1;
if( i < k-1 ) {
i++;
} else {
i=0;
}
}
System.out.println(total);
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Thursday, November 22, 2012

Interviewstreet Challenge: Even Tree

Problem


You are given a tree (a simple connected graph with no cycles).You have to remove as many edges from the tree as possible to obtain a forest with the condition that : Each connected component of the forest contains even number of vertices
Your task is to calculate the number of removed edges in such a forest.

Input:
The first line of input contains two integers N and M. N is the number of vertices and M is the number of edges. 2 <= N <= 100.
Next M lines contains two integers ui and vi which specifies an edge of the tree. (1-based index)

Output:
Print a single integer which is the answer
Sample Input 
10 9
2 1
3 1
4 3
5 2
6 1
7 2
8 6
9 8
10 8
Sample Output :
2
Explanation : On removing the edges (1, 3) and (1, 6), we can get the desired result.
Original tree:


Decomposed tree:

Note: The tree in the input will be such that it can always be decomposed into components containing even number of nodes. 

Analysis
All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

Well this is an interesting problem. Initially it appears to be a problem of n-ary trees. But to build a tree and perform the decomposition on the trees, it is time consuming. So I approached it with a Map approach. And well all my 10 testcases took around 1 sec only.
Create a map with parent as key and list of all its children. Now once this map is created, then take the root. Get each of its children from the list. So we have 1 as the node and 2, 3, 6 as children. So take 2. Check if the total number of nodes under it are even or odd (I am using a recursive call here to get the count.) In this case we have only two nodes - 7 and 5 (even number of nodes). So we cannot remove the link 1-3. Since that leaves with a subtree with 3-nodes (2, 7 and 5) which is not even number. So do not increment the count. However remove 2 from the list of 1's children and add 7 and 5 to the list. 
Now take 6, it has 3 children (8, 9 and 10). So the link 1-6 can be removed. Increment counter, remove 6, add 8, 9 and 10 to the list.
The next entries 7, 5, and 4 have no children. Continue to 8. It has 2 children, hence cannot remove.



Now take the next element in the list - it is 3. It has only 1 child (odd number). So that means we can break 1-3 link. Increase the counter, remove 3 and add 4 to the list.



Solution
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* Interview Street - Even Tree
*
* You are given a tree (a simple connected graph with no cycles).You have to remove as many edges from the tree as possible to obtain a forest with the condition that : Each connected component of the forest contains even number of vertices
* Your task is to calculate the number of removed edges in such a forest.
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class Solution {
private static Map<Integer, List<Integer>> map = new HashMap<Integer, List<Integer>>();
/**
* @param args
*/
public static void main(String[] args) throws Exception {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String line = br.readLine();
int n = Integer.parseInt(line.split(" ")[0]);
int m = Integer.parseInt(line.split(" ")[1]);
List<Integer> list = null;
for( int i=0; i<m; i++ ) {
line = br.readLine();
int b = Integer.parseInt(line.split(" ")[0]);
int a = Integer.parseInt(line.split(" ")[1]);
list = map.get(a);
if( list == null ) {
list = new ArrayList<Integer>();
}
list.add(b);
map.put(a, list);
}
int res = compute( );
System.out.println(res);
}
private static int compute() {
int total = 0;
int node = 1;
List<Integer> list = map.get(node);
List<Integer> list2 = null;
while(!list.isEmpty()) {
int child = list.remove(0);
int count = getCount( child );
if( count%2 == 1) {
total++;
}
list2 = map.get( child );
if( list2 != null ) {
list.addAll(list2);
}
}
return total;
}
private static int getCount( int node ) {
List<Integer> list = map.get(node);
if( list != null ) {
int count1 = list.size();
int count2 = 0;
for(int v: list ) {
count2 += getCount( v );
}
return count1 + count2;
}
return 0;
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Interviewstreet Challenge: String Similarity


Problem

String Similarity (25 Points)
For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
Calculate the sum of similarities of a string S with each of it's suffixes.
Input:
The first line contains the number of test cases T. Each of the next T lines contains a string each.
Output:
Output T lines containing the answer for the corresponding test case.
Constraints:
1 <= T <= 10
The length of each string is at most 100000 and contains only lower case characters.
Sample Input:
2
ababaa
aa

Analysis

All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.
So the length of the String can be upto 1,00,000 and in total we have upto 10 test cases. All these have to be run in 5 secs. So the hint here is that you cannot do any String manipulation operations. If you are using String.indexOf or String.charAt then you would incur additional time. A better option is to convert the String to character arrray and work on it.
The next hint is that instead of creating multiple substrings, create a virtual substring inline. So what it means is, If the string is "raju", I don't need to store another substring "aju". I can use 2 pointers one pointing at string "raju" position 0 and other pointing string "raju" position 1. If they match, increment the counter and move the pointers forward.
E.g.
String "ababaa". The first substring is the original string itself "ababaa" and it would match for all characters.
Next we need to compare "ababaa" with "babaa". So check index 0 and index 1. they don't match. So stop here.
Next proceed for "ababaa" with "abaa". So check index 0 and index 2 (a and a). They match. Increment counter. Next index 1 and Index 3 (b and b). Again increment counter. Next also match. After that b does not match with a. So stop here.
Continue this search.

Solution

import java.io.BufferedReader;
import java.io.InputStreamReader;
/**
* Interview Street - String Similarity
*
* For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
* Calculate the sum of similarities of a string S with each of it's suffixes.
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class Solution {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
int T = Integer.parseInt(in.readLine());
for( int i=0; i< T; i++ ) {
char[] s = in.readLine().toCharArray();
System.out.println(getcount( s ));
}
}
private static int getcount( char[] s ) {
int i = 0;
int n = s.length;
int total = 0;
while( i < n ) {
int sum = 0;
int k = 0;
while( i+k < n ) {
if(s[k] == s[i+k] ) {
sum++;
} else {
break;
}
k++;
}
total += sum;
i++;
}
return total;
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Interviewstreet Challenge: Pairs

Problem

https://www.interviewstreet.com/challenges/dashboard/#problem/4e14b83d5fd12

Given N numbers , [N<=10^5] we need to count the total pairs of numbers that have a difference of K. [K>0 and K<1e9 span="span">

Input Format:
1st line contains N & K (integers).
2nd line contains N numbers of the set. All the N numbers are assured to be distinct.
Output Format:
One integer saying the no of pairs of numbers that have a diff K.

Sample Input #00:
5 2
1 5 3 4 2

Sample Output #00:3


Analysis
All Java Solutions on the InterviewStreet needs to run in 5 seconds with 256 MB maximum RAM available. So that means our solution can never be "Brute-Force" or O(n^2). Same applies to the space complexity.

The array contains 1,00,000 numbers and we need to find the pairs of numbers whose difference is K. There is a clue here that if the numbers are not sorted, we are only left with Brute-Force approach .
That means we need to check each number with every other number to see if they make a pair. This proves costly and we would run out of 5 seconds.
So one of the simplest solution is to sort the array using any inbuilt sort method. Now apply a modified version of binary search. So take each number 'x' and do a binary search for (x+k). If its found they make a pair. Now take the next number in the sorted array and so on.

Solution

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
/**
* Interview Street - Find pairs of numbers whose difference is k
* Given N numbers , [N<=10^5] we need to count the total pairs of numbers that have a difference of K. [K>0 and K<1e9]
*
* @author raju rama krishna
* @see http://javatroops.blogspot.co.nz
*
*/
public class Solution {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
String line1 = in.readLine();
String line2 = in.readLine();
int n = Integer.parseInt(line1.split(" ")[0]);
int k = Integer.parseInt(line1.split(" ")[1]);
List<Integer> list = new ArrayList<Integer>();
String[] sArr = line2.split(" ");
for( String s : sArr ) {
list.add(Integer.parseInt(s));
}
//Sort the array
Collections.sort(list);
int num = list.get(0);
int l = 1;
int h = n-1;
int i = 1;
int count = 0;
while( i < n ) {
//For 'num' search if 'num+k' exists using binary search
while( l < h ) {
int mid = (l+h)/2;
int val = list.get(mid);
if( val-num < k ) {
break;
}
if( val == num+k) {
count++;
break;
} else if( val < num+k ) {
l = mid;
} else {
h = mid;
}
}
num = list.get(i);
i++;
l = i;
h = n-1;
}
System.out.println(count);
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Tuesday, November 20, 2012

5 Solutions for Search

Search has been a favorite topic of today and there are a lots of algorithms around it. Its all due to the rise in the data generated by systems and the increase in storage capacity. So we would need more faster and efficient algorithms to search. Today we will discuss Search in-depth, and look at multiple algorithms. I am limiting the solution to 5, but there are tons of other solutions not discussed here.

Types of Search
  1. Fixed Text Base and Variable Pattern - In this type of search, we have a more or less fixed data set on which we repeatedly search. But each time we search for a different pattern. This is more or less like a web search. Lets take Google as an example. There are huge number of documents that has to be searched. So the crawlers index the web and build a data structure. So the data set is fixed now. Our search operates on this data set.
  2. Variable Text Base and Fixed Pattern - In this type of search, we have the data set changing regularly. But we search for the same pattern every time. This is employed in censoring applications, e.g. News Feed. The feed comes in continuous fashion. On this we search if it contains any abusive words. 
  3. Variable Text Base and Variable Pattern - In this pattern, the data keeps changing and so also our queries. This is a common one. 
Solution:1 - Naive or Brute-Force approach
This is one of the simplest solutions and least effective one. In this we are given a string "S" of length "n". And we are given a pattern "P" of length "m". We need to scan through S and find where all we find the pattern P. Print all the indexes in increasing order. This is some what like saying  print all s.indexOf(p). 
e.g. S:   thisishiscar
      P:   hiscar
In this approach, we keep two variables i and j initialized to 0. If the characters at i and j are same, save the position i in another variable k. Increase i and j by till the two characters match. In our case, they match till below.
thisishiscar
hiscar
After this position it does not match. So we have to reset j to 0 and set i to (k+1). This is because we found that the position k does not match, so we have to restart from the next position. 
Definitely this approach is not suggested. Lets derive the time-complexity for this algorithm (for people who are not familiar with Big-O). Here in worst case we compare each character of pattern 'P' to each character of String 'S'. So we have m*n comparisons. Hence the time-complexity is O(m*n). So what it means is if 1 comparison takes say 1 ms, then to search for a pattern of length 5 in a string of length 100, we take 500 ms.  

CODE
package com.raj.strings;
import java.util.ArrayList;
import java.util.List;
/**
* Naive approach - Time Complexity O(n*m)
* n length of text
* m length of pattern
* match m times for each letter. do it for all 'n' letters.
*
* @author raju rama krishna
*
*/
public class NaiveSearch {
/**
* @param args
*/
public static void main(String[] args) {
String s = "i have a car which i care. i have a card and a cat";
String p = "car";
List<Integer> list = search( s, p );
System.out.println(list);
}
private static List<Integer> search(String s, String p) {
List<Integer> list = new ArrayList<Integer>();
int i=0;
int j=0;
int k=0;
int n = s.length();
int m = p.length();
while(i < n && j < m) {
k = i;
while(i < n && j < m && s.charAt(i) == p.charAt(j)) {
i++;
j++;
if(j == m) {
list.add(i-m);
}
}
i = k+1;
j = 0;
}
return list;
}
}
view raw gistfile1.java hosted with ❤ by GitHub

Alternate Solution:- To address the problem of comparing every character of pattern with every character of text, we have KMP Algorithm. But then again the problem with this algorithm is we need to calculate the fault functions for each pattern.


Solution:2 - Java List Approach
Now lets increase the system behavior slightly. So lets say we are given a list of names where each name contains FirstName and LastName. We have to search for a name (either firstname or lastname). The program should print all names where it matches.
I have used a text file available on the web consisting of 1,51,671 names. Its available at http://stavi.sh/downloads/names.txt. Download this text file to your local system. I have stored it in "C:\\Apps" folder. We will use this file for the remaining approaches.
In this approach, store all the names in an ArrayList. When the user searches for a name "xyz" - go through the entire list. Search if the value in the list starts with "xyz " (notice the space here at end) or ends with " xyz" (notice the space here at the beginning). This is because we want to do exact search.
So what is the time complexity of this approach? Its again same as approach 1. We are checking the entire list and comparing the pattern against each entry. So if we have N words and the length of the longest word is "K". And if the pattern length is "M", then the complexity is O(N*M*K). This is because for each word in the word list we are comparing the pattern O(K*M) and we are doing it for 'N' words.

The program loaded the 151671 words in 77 ms and searched in 94 ms.

CODE


package com.raj.strings;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
/**
*
* @author raju rama krishna
*
*/
public class ListNames {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
List<String> list = new ArrayList<String>();
BufferedReader br = null;
try {
long t1 = new Date().getTime();
br = new BufferedReader(new FileReader(new File("C:\\Apps\\names.txt")));
String line = br.readLine();
int i=0;
while( line != null ) {
list.add(line);
line = br.readLine();
i++;
}
long t2 = new Date().getTime();
System.out.println("Loaded " +i+ " names into Map. Time taken = " + (t2-t1) + "ms.");
String p = "JAMES";
System.out.println("Searching for " +p);
i = 0;
for( String key: list ) {
if(key.startsWith(p+" ") || key.endsWith(" " +p)) {
System.out.println(key);
i++;
}
}
t1 = new Date().getTime();
System.out.println("Found " +i+ " records in " +(t1-t2) + "ms.");
}catch(Exception e) {
e.printStackTrace();
} finally {
br.close();
}
}
}
view raw gistfile1.java hosted with ❤ by GitHub
Solution:3 - Java List and Map Approach
So how can we increase the processing time of approach 2? Lets say all the names contains only the 26 English alphabets. So instead of keeping one list we will keep a list of size 26. So any first name or last name starting with 'A' will goto list entry 0. And one starting with 'B' goto entry 1. At each entry we will maintain a LinkedHashMap. This map will contain the firstname/surname and the index of the name in the text file. So if we have a name "Anand Chari" at position 4 of the text file then we will store ("Anand", 4) in list entry 0 and ("Chari", 4) in list entry 2. Once this data structure is built, to search we can get the first character of the search key and goto the appropriate list entry. We will search only in this entry and not the other 25 entries.

So we have made the search 25 times faster?? Well yes, but the map factor adds to the cost. And still we are going through all the names which have the specific first character. So the order is O(N*M*K/(d-1)) where d is the character set size. In this case its 26.


The program loaded the 151671 words in 769 ms and searched in 3 ms.


CODE


package com.raj.strings;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Date;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Set;
/**
* List Map implementation
*
* @author raju rama krishna
*
*/
public class ListMap {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
List<LinkedHashMap<String, String>> list = new ArrayList<LinkedHashMap<String, String>>();
BufferedReader br = null;
try {
long t1 = new Date().getTime();
for( int i=0; i<26; i++ ) {
list.add(new LinkedHashMap<String, String>());
}
br = new BufferedReader(new FileReader(new File("C:\\Apps\\names.txt")));
String line = br.readLine();
int i=0;
while( line != null ) {
String[] sArr = line.split(" ");
for( String s: sArr ) {
int indx = (s.charAt(0) - 'A');
LinkedHashMap<String, String> map = list.get(indx);
String val = map.get(s);
if( val == null ) {
map.put(s, String.valueOf(i));
} else {
val = val + "," +i;
map.put(s, val);
}
}
line = br.readLine();
i++;
}
long t2 = new Date().getTime();
System.out.println("Loaded " +i+ " names into Map. Time taken = " + (t2-t1) + "ms.");
String p = "JAMES";
List<Integer> indexes = new ArrayList<Integer>();
int indx = (p.charAt(0) - 'A');
LinkedHashMap<String, String> map = list.get(indx);
Set<String> keys = map.keySet();
for( String key: keys ) {
if(key.equals(p)) {
String val = map.get(key);
String[] vArr = val.split(",");
for( i=0; i< vArr.length; i++ ) {
indexes.add(Integer.parseInt(vArr[i]));
}
break;
}
}
t1 = new Date().getTime();
System.out.println("Searched for : " +p+ " . Time taken = " +(t1-t2) + "ms.");
if(indexes != null && !indexes.isEmpty()) {
System.out.println("RESULTS: Found " +indexes.size()+ " matches");
br = new BufferedReader(new FileReader(new File("C:\\Apps\\names.txt")));
line = br.readLine();
i=0;
while( line != null ) {
if( indexes.contains(i)) {
System.out.println(line);
}
line = br.readLine();
i++;
}
} else {
System.out.println("ERROR: Not Found!!");
}
}catch(Exception e) {
e.printStackTrace();
} finally {
br.close();
}
}
}
view raw gistfile1.java hosted with ❤ by GitHub
Solution:4 - TRIE
We will introduce TRIE data structure. The basic structure of the TRIE is as shown below














There are 2 ways of implementing a TRIE - as a character array or linked list. In case of character array TRIE we have each node storing a value and a marker indicating if its an end node (double-circle). Along with this it stores children. The children here are again TRIE nodes. The number of children is 'd', which is the character set size. So we could have 26 for english or 256 for unicode character set. So this wastes a lot of space.
To overcome this, we can use the Linked List TRIE. In this the children are not 'd' but only the ones that are used. So we can avoid the additional space requirements. But we have to pay for the lost indexing capability, we have to do sequential access in linked list. We can maintain the linked list sorted for faster access.
In this approach, we will store the firstname/lastname and the occurrence of the name in the file.
Search is very simple, we will have to do a depth first search. For instance to search for "acd", we goto node    "a" and from there we go down to "b". From here we goto "c". Then we go down to "d". So we only searched 4 nodes. Then what is the time-complexity of this algorithm? O(m*d). This is because in worst-case we might have to traverse d-nodes for each character of the pattern. This we do for each of the 'm' characters.


The program loaded the 151671 words in 1 sec and searched in 0 ms. As you notice, the build-time is increased but the search time is really fast.

CODE

package com.raj.strings;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
/**
* Trie implementation:
* Names taken from: http://stavi.sh/downloads/names.txt
*
* @author raju rama krishna
*
*/
public class Trie {
LinkList rootList;
/**
* @param args
*/
public static void main(String[] args) throws Exception {
Trie trie = new Trie();
BufferedReader br = null;
try {
long t1 = new Date().getTime();
br = new BufferedReader(new FileReader(new File("C:\\Apps\\names.txt")));
String line = br.readLine();
int i=0;
while( line != null ) {
trie.createTree(line, i++);
line = br.readLine();
}
long t2 = new Date().getTime();
System.out.println("Loaded " +i+ " names into Trie. Time taken = " + (t2-t1) + "ms.");
String p = "JAMES";
List<Integer> indexes = trie.search(p);
t1 = new Date().getTime();
System.out.println("Searched for : " +p+ " . Time taken = " +(t1-t2) + "ms.");
if(indexes != null) {
System.out.println("RESULTS: Found " +indexes.size()+ " matches");
br = new BufferedReader(new FileReader(new File("C:\\Apps\\names.txt")));
line = br.readLine();
i=0;
while( line != null ) {
if( indexes.contains(i)) {
System.out.println(line);
}
line = br.readLine();
i++;
}
} else {
System.out.println("ERROR: Not Found!!");
}
}catch(Exception e) {
e.printStackTrace();
} finally {
br.close();
}
}
public void createTree( String s, int i ) {
String[] sArr = s.split(" ");
for( int j=0; j<sArr.length; j++ ) {
insert( sArr[j], i );
}
}
public void insert( String x, int index ) {
char[] cArr = x.toCharArray();
LinkList list = rootList;
for(int i=0; i<cArr.length; i++) {
if( list == null ) {
rootList = new LinkList();
LinkNode node = new LinkNode(cArr[i]);
rootList.head = node;
node.children = new LinkList();
list = node.children;
} else {
if( i == cArr.length-1 ) {
list = list.add(cArr[i], true, index);
} else {
list = list.add(cArr[i], false, index);
}
}
}
}
public List<Integer> search( String p ) {
List<Integer> indexes = null;
LinkList list = rootList;
int i = 0;
char c = p.charAt(i);
LinkNode node = list.head;
while( node != null ) {
if( c == node.c ) {
i++;
if( i == p.length()) {
if(node.isEnd) {
indexes = node.indexes;
}
break;
}
c = p.charAt(i);
list = node.children;
node = list.head;
} else {
node = node.next;
}
}
return indexes;
}
}
class LinkList {
LinkNode head;
public LinkList add( char c, boolean isEnd, int index ) {
LinkNode res = null;
if( head == null ) {
LinkNode linkNode = new LinkNode( c );
head = linkNode;
res = linkNode;
} else {
if( c < head.c ) {
LinkNode linkNode = new LinkNode( c );
linkNode.next = head;
head = linkNode;
res = linkNode;
} else {
LinkNode prev = null;
LinkNode cur = head;
while( cur != null && cur.c < c ) {
prev = cur;
cur = cur.next;
}
if( cur != null && cur.c == c ) {
res = cur;
} else {
LinkNode linkNode = new LinkNode( c );
prev.next = linkNode;
linkNode.next = cur;
res = linkNode;
}
}
}
if( !isEnd ) {
if(res.children == null ) {
res.children = new LinkList();
}
} else {
res.isEnd = true;
if(res.indexes == null) {
res.indexes = new ArrayList<Integer>();
}
res.indexes.add(index);
}
return res.children;
}
}
class LinkNode {
boolean isEnd;
char c;
LinkNode next;
LinkList children;
List<Integer> indexes;
public LinkNode( char c ) {
this.c = c;
}
public String toString() {
return String.valueOf(c);
// StringBuffer sb = new StringBuffer();
// LinkNode node = this;
// while( node != null ) {
// sb.append("->").append(node.c);
// node = node.next;
// }
// return sb.toString();
}
}
view raw gistfile1.java hosted with ❤ by GitHub
Solution:5 - PATRICIA TRIE (RADIX TRIE)
The problem with TRIES is the space requirement. It stores characters at each node. So the number of nodes is (d*N), even with linked list approach. That means we would need 151671*26 nodes just for English alphabets. To address this we have PATRICIA (Practical Algorithm to Retrieve Information Coded in Alphanumeric) TRIE. Here if we have N keys the total nodes would be (2*N - 1) always.
Its a difficult algorithm to explain, but I will try my best to attempt. It uses the binary representation of string.
Ascii value of 'A' is 65 which is represented in binary as "0100 0001".
So if we have a String s = "AC", we can represent it as below
0100 0001 0100 0011

Lets say we want to add 3 strings "THAT", "THEM" and "THEY" to the tree.

THAT     0101 0100 0100 1000 0100 0001 0101 0100
THEM    0101 0100 0100 1000 0100 0101 0100 1101
THEY     0101 0100 0100 1000 0100 0101 0101 1001

So the words THAT and THEM differ at position 21 (left to right). Similarly, THEM and THEY differ at position 27. So we make a binary tree like this

Notice that to store 3 keys we have used 5 nodes (2*3 - 1).

In "THAT" the 21st bit is 0 that is why it is on the left. For "THEM" and "THEY" the bit is 1 and thats why they they are on right. The words "THEM" and "THEY" differ at bit 27. THEM has 0 at the position and THEY has 1.





So we build a tree like this and the search is very simple in this case. There are several other reasons for choosing this data structure. We can do lot more things like proximity search, approximate search, regex search, substring search, and so on.


The program loaded the 151671 words in 2 sec and searched in 1 ms. The increase in build time is due to the fact that we are using string manipulation functions to convert String to Binary. I haven't optimized it yet.

So what did we gain in this?? A lot less memory requirement and very fast search. 


CODE

package com.raj.strings;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
/**
* Patricia Tree - Search
* Total nodes = 2*n - 1
* Leaves = n
* Internal = n-1
* @author raju rama krishna
*
*/
public class Patricia {
/**
* @param args
*/
public static void main(String[] args) throws Exception {
PTree pTree = new PTree();
BufferedReader br = null;
try {
long t1 = new Date().getTime();
br = new BufferedReader(new FileReader(new File("C:\\Apps\\names.txt")));
String line = br.readLine();
int i=0;
while( line != null ) {
String[] sArr = line.split(" ");
for( String s: sArr ) {
pTree.insert(s, i);
}
line = br.readLine();
i++;
}
long t2 = new Date().getTime();
System.out.println("Loaded " +i+ " names into Trie. Time taken = " + (t2-t1) + "ms.");
String p = "JAMES";
List<Integer> indexes = pTree.find(p);
t1 = new Date().getTime();
System.out.println("Searched for : " +p+ " . Time taken = " +(t1-t2) + "ms.");
if(indexes != null) {
System.out.println("RESULTS: Found " +indexes.size()+ " matches");
br = new BufferedReader(new FileReader(new File("C:\\Apps\\names.txt")));
line = br.readLine();
i=0;
while( line != null ) {
if( indexes.contains(i)) {
System.out.println(line);
}
line = br.readLine();
i++;
}
} else {
System.out.println("ERROR: Not Found!!");
}
}catch(Exception e) {
e.printStackTrace();
} finally {
br.close();
}
}
}
class PTree {
PNode root;
public void insert( String s, int occurence ) {
if( root == null ) {
root = new PNode( s, occurence );
root.isLeaf = true;
} else {
String binary = getBinary( s );
PNode node = root;
while( !node.isLeaf ) {
int index = node.index;
int val = 0;
if(binary.length() > index) {
val = binary.charAt(index) - '0';
}
if( val == 0 ) {
node = node.left;
} else {
node = node.right;
}
}
String data = node.data;
if( data != null ) {
if( data.equals(s) ) {
node.occurences.add(occurence);
} else {
IndexHolder indexHolder = getIndexHolder( data, s );
if( indexHolder.zero ) {
PNode left = new PNode( data, node.occurences );
left.isLeaf = true;
PNode right = new PNode( s, occurence );
right.isLeaf = true;
node.index = indexHolder.index;
node.data = null;
node.isLeaf = false;
node.left = left;
node.right = right;
node.occurences = null;
} else {
PNode left = new PNode( s, occurence );
left.isLeaf = true;
PNode right = new PNode( data, node.occurences );
right.isLeaf = true;
node.index = indexHolder.index;
node.data = null;
node.isLeaf = false;
node.left = left;
node.right = right;
node.occurences = null;
}
}
}
}
}
public List<Integer> find(String s) {
List<Integer> list = null;
String binary = getBinary(s);
PNode node = root;
while( node != null ) {
if( node.isLeaf && node.data.equals(s)) {
list = node.occurences;
break;
}
int index = node.index;
int val = 0;
if( binary.length() > index ) {
val = binary.charAt(index) - '0';
}
if( val == 0 ) {
node = node.left;
} else {
node = node.right;
}
}
return list;
}
public String getBinary( String s ) {
char[] cArr = s.toCharArray();
StringBuffer sb = new StringBuffer();
for( char c: cArr ) {
sb.append(getString((byte)c));
}
return sb.toString();
}
public IndexHolder getIndexHolder( String s1, String s2 ) {
IndexHolder indexHolder;
boolean zero = false;
int index = 0;
int i=0;
while( i < s1.length() && i < s2.length() ) {
if(s1.charAt(i) == s2.charAt(i)) {
i++;
} else {
break;
}
}
if( i == s1.length()) {
String b2 = getString((byte)s2.charAt(i));
int j = 0;
while( j < 8 ) {
if(b2.charAt(j) == '0') {
j++;
} else {
break;
}
}
index = (i*8) + j;
zero = true;
} else if( i == s2.length()) {
String b1 = getString((byte)s1.charAt(i));
int j = 0;
while( j < 8 ) {
if(b1.charAt(j) == '0') {
j++;
} else {
break;
}
}
index = (i*8) + j;
zero = false;
} else {
String b1 = getString((byte)s1.charAt(i));
String b2 = getString((byte)s2.charAt(i));
int j = 0;
while( j < 8 ) {
if( b1.charAt(j) == b2.charAt(j)) {
j++;
} else {
break;
}
}
index = (i*8) + j;
zero = (b1.charAt(j) == '0')? true: false;
}
indexHolder = new IndexHolder(index, zero);
return indexHolder;
}
public static String getString( byte b ) {
String s = Integer.toBinaryString(b);
for(int i=0; i<8-s.length(); i++) {
s = "0" + s;
}
return s;
}
}
class PNode {
int index;
String data;
boolean isLeaf;
PNode left;
PNode right;
List<Integer> occurences;
public PNode( String data, int occurence ) {
this.data = data;
if( occurences == null ) {
occurences = new ArrayList<Integer>();
}
occurences.add( occurence );
}
public PNode( String data, List<Integer> occurences ) {
this.data = data;
this.occurences = occurences;
}
public PNode( int index ) {
this.index = index;
}
public String toString() {
if(isLeaf) {
return data;
}
return String.valueOf(index);
}
}
class IndexHolder {
int index;
boolean zero;
public IndexHolder() {
}
public IndexHolder( int index, boolean zero ) {
this.index = index;
this.zero = zero;
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Thursday, November 15, 2012

Solution for Probability using Tree

Probability is one area that's most often ignored by many. However there are very interesting problems on it. One such thing is tossing coins or picking up a ball, etc. Why do people ask such questions in interviews? Well, if we observe carefully tossing a coin is Head/Tail, that means binary. So the solution for these can be solved using binary algorithms. We will discuss how to achieve using a special type of Complete Binary Search Tree called "Probability Tree".
Lets say i toss a coin 'n' times and i get the following sequence - HTHTHHTTTHHHTHTH
Here H denotes Head and T denotes Tail. Now with this data, can you predict the probability for a series of 1, 2 or 3 coin tosses? Example, what is the probability that my first toss results in H? Or what is the probability of getting Head first time and Tail second time? Or what is the probability of getting HHT? All these can be solved using probability tree.
For our consideration we will take a maximum of 3-coin toss series problems. Treat H as '0' and T as '1'. Take the input and read all 3 series starting from the left. Since there are 16 tosses in total, we can make 3 series of 14 inputs. They are

HTH
THT
HTH
THH
HHT
HTT
TTT
TTH
THH
HHH
HHT
HTH
THT
HTH

What we have done is, we have read 3 at a time and removed the first. So (1,2,3), (2,3,4), (3,4,5) etc..

Now prepare a 3 level Complete Binary Search Tree and assume left nodes denote 'T' and right nodes denote 'H'. Initialize the count of all nodes to 0. Take the above 14 inputs one by one. First we start with HTH. So we go right, left, and right.  As we traverse we increase the count of the node by 1. Likewise do it for all the 14 inputs. Now we are done with counts. Now at level-1 the left node has count 6 and right node has 8. So the total for the 2 nodes is 14. The probability for left node is 6/14 and for right is 8/14. Store it and move on to calculate next levels. For each node just consider its two children and set their probabilities. Thats all we are done!!

Now to find the probability of an event lets say "H". We can directly goto right node and get its probability. So it is nearly 0.57.
Lets say we want to find the probability of "TT". Its 0.43*0.33.
Probability of "TTH" is 0.43*0.33*0.5.

Note that the total probability is the probability for all leaf nodes and it is equal to 1.

Solution:

package com.raj.probability;
/**
* Probability Tree for coin toss
*
* @author raju rama krishna
* @email rajur79@gmail.com
* @see http://www.javatroops.blogspot.co.nz
*
*/
public class ProbabilityTree {
/**
* @param args
*/
public static void main(String[] args) {
String pattern = "HTHTHHTTTHHHTHTH";
Tree tree = new Tree();
for(int i=0; i <= pattern.length()-3; i++) {
tree.setCount(pattern.substring(i, i+3));
}
tree.setProbabilities();
//Calculate probability for an occurrence
int res = (int)tree.getProbability("T");
System.out.println("Probability of getting a Tail first time = " +res+ "%");
res = (int)tree.getProbability("HHT");
System.out.println("Probability of getting HHT = " +res+ "%");
}
}
class Tree {
Node root = new Node(-1);
public Tree() {
buildTree();
}
public void buildTree() {
add( root, 0 );
}
private void add( Node node, int level ) {
if( level != 3 ) {
node.left = new Node(0);
node.right = new Node(0);
add( node.left, level+1);
add( node.right, level+1);
}
}
public void setCount( String s ) {
Node node = root;
char[] cArr = s.toCharArray();
for(char c: cArr) {
if( c == 'T' ) {
node = node.left;
node.count++;
} else if( c == 'H' ) {
node = node.right;
node.count++;
}
}
}
public void setProbabilities() {
Node node = root;
setProbability(node);
}
private void setProbability( Node node ) {
if( node.left != null && node.right != null) {
int a = node.left.count;
int b = node.right.count;
int c = a + b;
float pa = (a/(float)c);
float pb = 1 - pa; // probability of a + probability of b = 1
node.left.prob = pa;
node.right.prob = pb;
setProbability( node.left );
setProbability( node.right );
}
}
public float getProbability(String s) {
char[] cArr = s.toCharArray();
float res = 1;
Node node = root;
for(char c: cArr) {
if(c == 'T') {
node = node.left;
} else if(c == 'H') {
node = node.right;
}
res = res * node.prob;
}
res = (int)(res * 100);
return res;
}
public float getTotalProb() {
String s = "HHH,HHT,HTH,HTT,THH,THT,TTH,TTT";
String[] sArr = s.split(",");
float res = 0;
for( String x: sArr ) {
float p = getProbability(x);
res = res + p;
}
return res;
}
}
class Node {
int count;
float prob;
Node left;
Node right;
public Node( int count ) {
this.count = count;
}
public void setProb( float prob ) {
this.prob = prob;
}
public String toString() {
return String.valueOf( count+ " : " +prob );
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Wednesday, November 14, 2012

Algorithm used by BZip, JPeg compression

We all use winzip, jpeg and other compression utilities. The space gained by compression is really high. We will discuss the underlying implementation behind them.

In Java char primitive type is stored as 2 bytes (or 16 bits). The ascii values however are 7 bits. So the total number of ascii characters are 128. So in ascii 'a' is 97 and 'A' is 65.
Refer to the link for the complete list

So in Java if I want to store a string "cat" it takes 2*3 = 6 bytes (or 48 bits).

Imagine you are writing an utility which loads an English novel into memory. If we need to store this huge data then we will most probably run into memory issues. So what is the alternative?
Huffman has proposed an algorithm for compression. It has support for encoding and decoding as well.
Lets say our text is "bcbbbbbbaacaabbcade". This contains 19 letters, so we need 19*16 = 304 Bits to represent this String.

Algorithm:
Consider the input string. Count how many times each character appears in the string. Store it somewhere in the ascending order of frequency. So i would get the below list-

e=1
d=1
c=3
a=5
b=9

What is the most efficient way to do this? Lets see what is the size of our character set. Its 256. So initialize an integer array of size 256. Read the input string character by character and increase the count at the index of the character. So if our integer array is arr[256], then we would have arr[97] = 5, arr[98] = 9 and so on.
So the time-complexity is O(n) where n is the length of the input string.

Now take the first 2 characters (e,1) and (d,1). We will store these 2 as child nodes of a binary-tree like structure. The parent of these 2 nodes will be a consolidated node (ed, 2). Here we added the 2 characters and also their frequencies.

           (ed, 2)
(e,1)                 (d, 1)

So the frequency for 'ed' is 2. This is the minimum currently, the next minimum is (c, 3). So our new node (edc, 5) would have left child (ed, 2) and right child (c, 3). Similarly we add 5 and 9. Finally we would get (bedca, 19). This is the final root node. The frequency of this node is 19 which is the length of the input string. The following is the structure of the tree.
Mark all left nodes 0 and right nodes 1.
Double-circled nodes are leaves.

Start from root node and do a DFS(Depth First Search) till you encounter leaf node. We get

b = 0 
a = 11
c = 101
d = 1001
e = 1000



Input string will translate to "010100000011111011111001011110011000". So the length of the string is 36. That means we only need 36 bits to store and not 304 bits. Woow that's a great saving of 88.15%.


Now how to decode any string. Lets say we want to decode "0001101011000". Read the numbers from left to right. So we get 0, read from the tree till we hit a leaf node. So we hit b for 0. So we get bbb for 000. Now we get 1, we goto (edca,10). Continue, since we still didn't encounter leaf node. Next number is 1, so we encountered (a, 5). Hence it is 'a'. So we get "bbbabce". This is the decrypted value.

In the solution, I have built the Huffman's tree in a most-efficient way. I don't use any costly recursion, I have used plain arrays. Have fun coding!!

Solution
package com.raj.compression;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
/**
* Compression Algorithm for Strings. This is used in JPEG and BZip
* It provides both encoding and decoding
*
* @author raju rama krishna
*
*/
public class Huffman {
/**
* @param args
*/
public static void main(String[] args) {
Huffman compressor = new Huffman();
String s = "bcbbbbbbaacaabbcade";
Set<CharacterNode> tNodes = compressor.getTNodes( s );
//Frequency table
// for( CharacterNode n: tNodes ) {
// System.out.println(n.c + "=" +n.freq);
// }
HuffmanTree tree = HuffmanTree.buildTree( tNodes );
String res = tree.encode(s);
System.out.println("Encoded to : " +res);
int reducedLength = res.length();
int actualLength = s.length() * 16; //1 unicode char = 16 bits
float compression = 100 - (float)(reducedLength/(float)(actualLength))*100;
System.out.println("Reduced from " +actualLength+ " bits to " +reducedLength+ " bits");
System.out.println("Space Gained: " +compression+ "%");
//Compressed to : 010100000011111011111001011110011000
String pattern = "0001101011000";
res = tree.decode( pattern );
System.out.println("Decoded pattern: " +res);
//Decoded to: bbbabce
}
public Set<CharacterNode> getTNodes(String s) {
Set<CharacterNode> tNodes = new TreeSet<CharacterNode>();
int[] charSet = new int[256];
char[] cArr = s.toCharArray();
for( char c: cArr ) {
charSet[c] = charSet[c]+ 1;
}
for( int i=0; i<charSet.length; i++ ) {
if( charSet[i] == 0 ) {
continue;
}
CharacterNode node = new CharacterNode( String.valueOf((char)i), charSet[i]);
tNodes.add(node);
}
return tNodes;
}
}
class CharacterNode implements Comparable {
String c;
int freq;
public CharacterNode( String c, int freq ) {
this.c = c;
this.freq = freq;
}
public int compareTo(Object o) {
if( o instanceof CharacterNode ) {
CharacterNode x = (CharacterNode)o;
if(x.freq == this.freq ) {
return -1;
} else {
return (this.freq - x.freq);
}
}
throw new ClassCastException("Invalid!!");
}
public boolean equals(Object o) {
if( o instanceof CharacterNode ) {
CharacterNode x = (CharacterNode)o;
return x.c.equals(this.c);
}
throw new ClassCastException("Invalid!!");
}
public String toString() {
return c;
}
}
class HuffmanTree {
HuffmanNode tree = new HuffmanNode( null );
Map<Character, String> encodingMap = new HashMap<Character, String>();
public static HuffmanTree buildTree( Set<CharacterNode> tNodes ) {
List<HuffmanNode> hNodes = new ArrayList<HuffmanNode>();
HuffmanTree tree = new HuffmanTree();
while( true ) {
Iterator<CharacterNode> itr = tNodes.iterator();
if(!itr.hasNext()) {
break;
}
CharacterNode a = itr.next();
itr.remove();
if(!itr.hasNext()) {
break;
}
CharacterNode b = itr.next();
itr.remove();
CharacterNode c = new CharacterNode( a.c+b.c, a.freq+b.freq );
tNodes.add(c);
HuffmanNode node = new HuffmanNode(c);
HuffmanNode left = null;
HuffmanNode right = null;
for( HuffmanNode h: hNodes ) {
if( h.c.equals(a) ) {
left = h;
} else if( h.c.equals(b)) {
right = h;
}
}
if( left != null ) {
hNodes.remove(left);
} else {
left = new HuffmanNode(a);
}
if( right != null ) {
hNodes.remove(right);
} else {
right = new HuffmanNode(b);
}
node.left = left;
node.right = right;
hNodes.add(node);
}
tree.tree = hNodes.remove(0);
return tree;
}
public String encode( String s ) {
HuffmanNode node = tree;
encode( node, "" );
StringBuffer sBuf = new StringBuffer();
char[] cArr = s.toCharArray();
for( char c: cArr ) {
sBuf.append(encodingMap.get(c));
}
return sBuf.toString();
}
private void encode( HuffmanNode node, String s ) {
if( node != null ) {
encode( node.left, s + "0");
if(node.c.c.length() == 1) {
encodingMap.put(node.c.c.charAt(0), s);
}
encode( node.right, s + "1");
}
}
public String decode( String p ) {
StringBuffer sBuf = new StringBuffer();
HuffmanNode node = tree;
char[] cArr = p.toCharArray();
int i=0;
while( i < cArr.length ) {
int val = cArr[i] - '0';
if( val == 0 ) {
node = node.left;
} else {
node = node.right;
}
if( node.left == null && node.right == null ) {
sBuf.append(node.c.c);
node = tree;
}
i++;
}
return sBuf.toString();
}
}
class HuffmanNode {
CharacterNode c;
HuffmanNode left;
HuffmanNode right;
public HuffmanNode( CharacterNode c ) {
this.c = c;
}
public String toString() {
return c.c;
}
@Override
public boolean equals(Object o) {
if(o instanceof HuffmanNode ) {
HuffmanNode n = (HuffmanNode)o;
return( n.c.c.equals(c.c));
}
throw new ClassCastException("Invalid!!");
}
}
view raw gistfile1.java hosted with ❤ by GitHub


Monday, November 12, 2012

Spell Check - Minimum Distance Calculation

Wherever there is a manual user input, there is a possibility of a typo error. As per research its been found that possibility of the first letter being typed wrongly is very remote.
Errors are classified into the following types

  1. Extra character : "inpuut" instead of "input"
  2. Missing character: "publsh" instead of "publish"
  3. Wrong character: "prebiew" instead of "preview"

We call each of the above errors as 1 distance.  Now given 2 strings, its possible to calculate the distance.
For eg. "Apple" and "Applets" distance is 2 as it is 2 missing characters.
"Apple" and "Tablet" distance is 4 as it is 3 Wrong characters (A instead of T, p instead of a, p instead of b),   and 1 missing character (t missing).

There is an algorithm to calculate this distance, found by a Russian mathematician Levenshtein. This algorithm is named according to him. So if the length of string1 is 'n' and length of string2 is 'm' then the algorithm runs in O(n*m). First we plot the string1 and string2 characters in a matrix format and then start filling up the values.

Calculation starts from (1,1) and the result is stored at the right bottom (red).

at (1,1) we have 't' and 'a'. they do not match so consider the values (0,0), (0,1) and (1,0) and take the minimum value and add 1. So the numbers are 0, 1, and 1. The minimum is 0 add 1 to it. We get 1.

at (2,1) we have 'a' and 'a'. they match so take the value at (1,0) which is the diagonally above the current element.

The minimum distance is stored at (6,5) which is 4


The same algorithm can be extended to check multiple strings. It can be applied to find all matching words in a dictionary. Some of the spell checkers that we see day-today use this to find the nearest matching word in a given dictionary. While doing so, the average distance they would check against is 2 (as per research).
But if they have to prepare this matrix for each string and compare it in O(n*m) its a time consuming thing. So there is a revised algorithm which runs in O(n log n + m log m). I am planning to put up this algorithm in the next sessions where I will be posting on the algorithm behind Google Docs.

Solution

package com.palindrome;
/**
* Leventshtein Distance Calculator
*
* @author raju rama krishna
*
*/
public class DistanceCalculator {
/**
* @param args
*/
public static void main(String[] args) {
String s1 = "apple";
String s2 = "tablet";
int d = levenshteinDistance(s1.toCharArray(), s2.toCharArray());
System.out.println(d);
}
public static int levenshteinDistance( char[] s, char[] t ) {
int m = s.length;
int n = t.length;
int[][] d = new int[m+1][n+1];
//The distance between the empty string and another
// string is equal to the length of the other string.
for( int i = 0; i<=m; i++) {
d[i][0] = i;
}
for( int j = 0; j <=n; j++) {
d[0][j] = j;
}
for(int j =1; j<=n; j++) {
for( int i = 1; i<=m; i++) {
if( s[i-1] == t[j-1]) {
d[i][j] = d[i-1][j-1]; //No edit
}
else {
d[i][j] = Math.min(d[i-1][j] + 1, //Deletion
Math.min(d[i][j-1] + 1, //Insertion
d[i-1][j-1] + 1)); //Substitution
}
}
}
return d[m][n];
}
}
view raw gistfile1.java hosted with ❤ by GitHub


UA-36403895-1