Monday, November 12, 2012

Spell Check - Minimum Distance Calculation

Wherever there is a manual user input, there is a possibility of a typo error. As per research its been found that possibility of the first letter being typed wrongly is very remote.
Errors are classified into the following types

  1. Extra character : "inpuut" instead of "input"
  2. Missing character: "publsh" instead of "publish"
  3. Wrong character: "prebiew" instead of "preview"

We call each of the above errors as 1 distance.  Now given 2 strings, its possible to calculate the distance.
For eg. "Apple" and "Applets" distance is 2 as it is 2 missing characters.
"Apple" and "Tablet" distance is 4 as it is 3 Wrong characters (A instead of T, p instead of a, p instead of b),   and 1 missing character (t missing).

There is an algorithm to calculate this distance, found by a Russian mathematician Levenshtein. This algorithm is named according to him. So if the length of string1 is 'n' and length of string2 is 'm' then the algorithm runs in O(n*m). First we plot the string1 and string2 characters in a matrix format and then start filling up the values.

Calculation starts from (1,1) and the result is stored at the right bottom (red).

at (1,1) we have 't' and 'a'. they do not match so consider the values (0,0), (0,1) and (1,0) and take the minimum value and add 1. So the numbers are 0, 1, and 1. The minimum is 0 add 1 to it. We get 1.

at (2,1) we have 'a' and 'a'. they match so take the value at (1,0) which is the diagonally above the current element.

The minimum distance is stored at (6,5) which is 4


The same algorithm can be extended to check multiple strings. It can be applied to find all matching words in a dictionary. Some of the spell checkers that we see day-today use this to find the nearest matching word in a given dictionary. While doing so, the average distance they would check against is 2 (as per research).
But if they have to prepare this matrix for each string and compare it in O(n*m) its a time consuming thing. So there is a revised algorithm which runs in O(n log n + m log m). I am planning to put up this algorithm in the next sessions where I will be posting on the algorithm behind Google Docs.

Solution

No comments:

Post a Comment

UA-36403895-1