Implement a basic web search. It should load a document list into memory. Each document contains text data of variable length. When the user searches for a particular text, it should find all matching patterns for the complete word list (in the same sequence) he entered.
For. e.g
If my documents are -
"where is the house",
"my house is where the mountain is",
"it is my house",
"my dog is a pet",
"where is the horse"
Conclusion
As always, this is just a prototype i came up with and it can have a lot of additional features like document ranking, suggestion, completion, etc. In an actual web search engine, there would be a complex algorithm and processing involved.
For. e.g
If my documents are -
"where is the house",
"my house is where the mountain is",
"it is my house",
"my dog is a pet",
"where is the horse"
And I search for - "where is the house". The exact pattern should be matched. It should only return "where is the house". It should not return the 2nd string "my house is where the mountain is", even though it contains all the 4 words (where, is, the, house) in a jumbled fashion.
The user should be able to search for substring as well e.g. "my house". It will return "my house is where the mountain is" and also "it is my house".
Design:
If we did not have to support substring search, it could have been a scenario for storing key-value pair. 
With this requirement, we can only implement it with a Tree or a TRIE. I have choosen a Binary Search Tree since I can search for words in Log(N). 
Read documents one by one. For every document read the words. Check if the word is in the BST. If not add a node to the BST with value as the word. The index for it would be a linked list with one key-value pair (key is document-id and value is word-id in the document). If there is a node already with the word, then append the key-value index to the end of the linked-list. This way, the document1's index would be stored before the document-2's index. 
Search - "x y z".
First search for word "x". Get the indexes for the word. Lets say we got (1,4), (3,9). Then search for word "y". If it returned (1,5), (2,1), (3,11). Then we are only interested in document 1. Since neither 3 nor 2 fit in our criteria. It is because in document 1, word x was at position 4 and word y is at position 5. We continue the same with word "z". But this time our focus is only on index (1,5).
Conclusion
As always, this is just a prototype i came up with and it can have a lot of additional features like document ranking, suggestion, completion, etc. In an actual web search engine, there would be a complex algorithm and processing involved.
Solution
 
No comments:
Post a Comment