Monday, February 15, 2010

Speedup strings matching

In the previous post, I presented a new structure of the dictionary that will improve the classification speed by 40%. In spite of this improvement, the classification will stay to slow when the dictionaries will grow up.  In this post, I will present a another solution theorically more performant.

1 - The objectives

- To find any alphanumeric term in the dictionary and to get its weigh and its class. The term can belong several different classes and have different weighs.
- The dictionary must be easily viewable and editable with a basic text editor.
- The dictionary must be robust enough to be repeared in case of data corruption.

2 - Solution: Dictionary structure

   I proposed this structure in my previous post:
label_index label_name:K/W1L1\TERM1/W2L2\TERM2..../WiLi\TERMi/EOL
where Wi is the weigh and Li the length of the TERMi.
  Li is used to compute the position of the next TERM to compare with the searched term, then the comparison is limited to TERMi and the searched term.
  The algorithm proposed sequentially gets the length Li, compare the TERMi with the searched term, compute the index of the TERMi+1, jump to the TERMi+1.

 3 - Optimization

The comparison operation and the new index computation can be treated in parallel pretty easily.

let index= position of the first slash
        let length=val(s1[index+1],s1[index+2]);

    if (integer)(s1[index+3,index+4]) equals (integer)(s2[0,1])
        wait for end of thread;
        if s1[index+3,index+3+length] equals s2
            let weight=val('0',s1[index-1]);
            return weight;
    wait for end of thread;
while s1[index] does not equal EOL;

Note: val() converts 2 chars (0...F) in integer (0...255)

In the algorithm above, the first tread computes the index of the next substring TERMi+1 to compare, while the second tread compares the current substring TERMi with the searched term in s2.
The instruction (integer)(s1[index+3,index+4]) equals (integer)(s2[0,1]) compares the first 2 bytes of the TERMi with the first 2 bytes of the string S2. It is a very fast operation between two integers; it is interesting because the probability (*) is low that the comparison returns true, then the slower strings comparison  "if s1[index+3,index+3+length] equals s2" will run rarely.

(*) Low probability that the first two letters of a word match the first two letters of another word. Explanation:
  English words have leading digrams with these frequencies:
th 3.15%  he 2.51%  an 1.72%  in 1.69%  er 1.54%  re 1.48%  es 1.45% ...
In consequence, 3.15% of the terms in the dictionary will start with "th" and there are 3.15% of chance that the searched term s2 starts with "th" too.
Hence, the probability that the leading digram of s2 matches the leading digrams of any term in the dictionary is only 0.0315 * 0.0315 ~ 0.001 or 0.1%

Finally there is a very low probability that will have to compare the other characters following the digrams. Because there are 99.9% chance that the digrams will not match and because we compare the digrams simultaneously, we can consider that we have an O(m+n) string searching algorithm. The limitation is to code each letter on one byte (plain ascii) and to exclude terms of length < 2.

4 - Sorting

Terms in each class of the dictionary have different probabilities of occuring. Intuitively we understand that if the terms with the highest probabilities are in the beginning of the string s1 and terms with the lowest probabilities are at the end of s1, and if of course, we start looking for the term from the left to the right, then we have a better chance to find the term faster.  The weight of each term represents the frequency of the term in its class. In consequence, it is just needed to sort the terms of each class based on their weights. Intuitively, we can imagine the gain cannot be important because a significant term cannot strongly belongs all the classes of the dictionary.

Other solution:

Sorting terms of dictionary based on the inverse of the frequency of their leading digrams is probably more interesting. A term in s2 with a frequent digram will be found faster than a term with a rare digram. We can significantly increase the speed if we skip all the similar leading diagram as soon as the first test shows they do not match. To do that, we just need to know the length of the substrings of s1 containing the same leading digrams.

Structure of the dico:
label_index label_name:k/S1/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/S2/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/S3/....
where S1 is the length of the substring  "/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/"
S2 is the length of the substring  "/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/"
etc for S3, S4... Sn


If we search the weight of the word "hertz", the first digrams comparison (th of theobromine with  he of hertz) will not match, then we jump directly to the next group of digram (heavy, heat, hertz). The advantage is we do only one comparison for each digram until both digram match.

Now, we have just to imagine a smart algorithm to do this job efficiently.

No comments:

Post a Comment