Tuesday, November 20, 2012

5 Solutions for Search

Search has been a favorite topic of today and there are a lots of algorithms around it. Its all due to the rise in the data generated by systems and the increase in storage capacity. So we would need more faster and efficient algorithms to search. Today we will discuss Search in-depth, and look at multiple algorithms. I am limiting the solution to 5, but there are tons of other solutions not discussed here.

Types of Search
  1. Fixed Text Base and Variable Pattern - In this type of search, we have a more or less fixed data set on which we repeatedly search. But each time we search for a different pattern. This is more or less like a web search. Lets take Google as an example. There are huge number of documents that has to be searched. So the crawlers index the web and build a data structure. So the data set is fixed now. Our search operates on this data set.
  2. Variable Text Base and Fixed Pattern - In this type of search, we have the data set changing regularly. But we search for the same pattern every time. This is employed in censoring applications, e.g. News Feed. The feed comes in continuous fashion. On this we search if it contains any abusive words. 
  3. Variable Text Base and Variable Pattern - In this pattern, the data keeps changing and so also our queries. This is a common one. 
Solution:1 - Naive or Brute-Force approach
This is one of the simplest solutions and least effective one. In this we are given a string "S" of length "n". And we are given a pattern "P" of length "m". We need to scan through S and find where all we find the pattern P. Print all the indexes in increasing order. This is some what like saying  print all s.indexOf(p). 
e.g. S:   thisishiscar
      P:   hiscar
In this approach, we keep two variables i and j initialized to 0. If the characters at i and j are same, save the position i in another variable k. Increase i and j by till the two characters match. In our case, they match till below.
After this position it does not match. So we have to reset j to 0 and set i to (k+1). This is because we found that the position k does not match, so we have to restart from the next position. 
Definitely this approach is not suggested. Lets derive the time-complexity for this algorithm (for people who are not familiar with Big-O). Here in worst case we compare each character of pattern 'P' to each character of String 'S'. So we have m*n comparisons. Hence the time-complexity is O(m*n). So what it means is if 1 comparison takes say 1 ms, then to search for a pattern of length 5 in a string of length 100, we take 500 ms.  


Alternate Solution:- To address the problem of comparing every character of pattern with every character of text, we have KMP Algorithm. But then again the problem with this algorithm is we need to calculate the fault functions for each pattern.

Solution:2 - Java List Approach
Now lets increase the system behavior slightly. So lets say we are given a list of names where each name contains FirstName and LastName. We have to search for a name (either firstname or lastname). The program should print all names where it matches.
I have used a text file available on the web consisting of 1,51,671 names. Its available at http://stavi.sh/downloads/names.txt. Download this text file to your local system. I have stored it in "C:\\Apps" folder. We will use this file for the remaining approaches.
In this approach, store all the names in an ArrayList. When the user searches for a name "xyz" - go through the entire list. Search if the value in the list starts with "xyz " (notice the space here at end) or ends with " xyz" (notice the space here at the beginning). This is because we want to do exact search.
So what is the time complexity of this approach? Its again same as approach 1. We are checking the entire list and comparing the pattern against each entry. So if we have N words and the length of the longest word is "K". And if the pattern length is "M", then the complexity is O(N*M*K). This is because for each word in the word list we are comparing the pattern O(K*M) and we are doing it for 'N' words.

The program loaded the 151671 words in 77 ms and searched in 94 ms.


Solution:3 - Java List and Map Approach
So how can we increase the processing time of approach 2? Lets say all the names contains only the 26 English alphabets. So instead of keeping one list we will keep a list of size 26. So any first name or last name starting with 'A' will goto list entry 0. And one starting with 'B' goto entry 1. At each entry we will maintain a LinkedHashMap. This map will contain the firstname/surname and the index of the name in the text file. So if we have a name "Anand Chari" at position 4 of the text file then we will store ("Anand", 4) in list entry 0 and ("Chari", 4) in list entry 2. Once this data structure is built, to search we can get the first character of the search key and goto the appropriate list entry. We will search only in this entry and not the other 25 entries.

So we have made the search 25 times faster?? Well yes, but the map factor adds to the cost. And still we are going through all the names which have the specific first character. So the order is O(N*M*K/(d-1)) where d is the character set size. In this case its 26.

The program loaded the 151671 words in 769 ms and searched in 3 ms.


Solution:4 - TRIE
We will introduce TRIE data structure. The basic structure of the TRIE is as shown below

There are 2 ways of implementing a TRIE - as a character array or linked list. In case of character array TRIE we have each node storing a value and a marker indicating if its an end node (double-circle). Along with this it stores children. The children here are again TRIE nodes. The number of children is 'd', which is the character set size. So we could have 26 for english or 256 for unicode character set. So this wastes a lot of space.
To overcome this, we can use the Linked List TRIE. In this the children are not 'd' but only the ones that are used. So we can avoid the additional space requirements. But we have to pay for the lost indexing capability, we have to do sequential access in linked list. We can maintain the linked list sorted for faster access.
In this approach, we will store the firstname/lastname and the occurrence of the name in the file.
Search is very simple, we will have to do a depth first search. For instance to search for "acd", we goto node    "a" and from there we go down to "b". From here we goto "c". Then we go down to "d". So we only searched 4 nodes. Then what is the time-complexity of this algorithm? O(m*d). This is because in worst-case we might have to traverse d-nodes for each character of the pattern. This we do for each of the 'm' characters.

The program loaded the 151671 words in 1 sec and searched in 0 ms. As you notice, the build-time is increased but the search time is really fast.


The problem with TRIES is the space requirement. It stores characters at each node. So the number of nodes is (d*N), even with linked list approach. That means we would need 151671*26 nodes just for English alphabets. To address this we have PATRICIA (Practical Algorithm to Retrieve Information Coded in Alphanumeric) TRIE. Here if we have N keys the total nodes would be (2*N - 1) always.
Its a difficult algorithm to explain, but I will try my best to attempt. It uses the binary representation of string.
Ascii value of 'A' is 65 which is represented in binary as "0100 0001".
So if we have a String s = "AC", we can represent it as below
0100 0001 0100 0011

Lets say we want to add 3 strings "THAT", "THEM" and "THEY" to the tree.

THAT     0101 0100 0100 1000 0100 0001 0101 0100
THEM    0101 0100 0100 1000 0100 0101 0100 1101
THEY     0101 0100 0100 1000 0100 0101 0101 1001

So the words THAT and THEM differ at position 21 (left to right). Similarly, THEM and THEY differ at position 27. So we make a binary tree like this

Notice that to store 3 keys we have used 5 nodes (2*3 - 1).

In "THAT" the 21st bit is 0 that is why it is on the left. For "THEM" and "THEY" the bit is 1 and thats why they they are on right. The words "THEM" and "THEY" differ at bit 27. THEM has 0 at the position and THEY has 1.

So we build a tree like this and the search is very simple in this case. There are several other reasons for choosing this data structure. We can do lot more things like proximity search, approximate search, regex search, substring search, and so on.

The program loaded the 151671 words in 2 sec and searched in 1 ms. The increase in build time is due to the fact that we are using string manipulation functions to convert String to Binary. I haven't optimized it yet.

So what did we gain in this?? A lot less memory requirement and very fast search. 


No comments:

Post a Comment