Friday, February 12, 2010

Dictionaries structure

  The structure of the dictionaries is very simple. Because TexLexAn is essentially experimental, the dictionary structure was designed to be easily viewable/editable with any basic text editor and to work without complication with the c function strstr(). The consequence of this choice is a poor search speed. The classification becomes particularly slow when the size of the dictionaries increase in number of classes and number of words.

  The idea is to improve the structure of the dictionary without losing the easiness to view and edit the dictionary.

The current structure of dictionaries 'keyworder.lan.dicN' is a set class labels, where each class is defined by a single line of any length:

j class_label:kj/w1\n-gram1/....../wi\n-grami/
where j is the class index, kj is the constant of the class j, wi is the weight of the n-gram i.

The function strstr(s1,s2) is used to search a term (n-gram) inside each line of the dictionary.
The term searched is delimited with one slash at the beginning and one backslash at the end: /searched n-gram\  , so the string s1 contains the line "j class_label:kj/w1\n-gram1/....../wi\n-grami/"  and the s2 contains the searched term: "\searched n-gram/"

Of course, it is a very simple solution but allows a simple and robust algorithm, and gives the possibility to search the root of words very easily. For example: The term 'power' delimited as "\power" will be found in the line of the dictionary "/w1\powerful/.....". This is a basic solution but requires only a fast stemming operation of the searched term.

The biggest inconvenient of strstr() in our case is that the backslash '\' of s2 is searched in the n-gram and weight of s1.   Example:  for the string s2 "\thermal/"  and the string s1 "/8\powerful/5\system/9\thermal/" , the backslash '\' will be searched all along powerfull/5   and system/9, that is a wast of time because we know it cannot be present in the n-grams and weights.

One solution is to store the length of each term/n-gram and to used this length to compute the position of the next term/n-gram.

Example of structure:
j class_label:kj/w1l1\n-gram1/....../wili\n-grami/   where the length li of each term is coded on 2 digits.
Our previous string s1 becomes:
/808\powerful/506\system/907\thermal/

Another example of structure:
j class_label:kj/w1\l1n-gram1/....../wi\lin-grami/   where the length li of each term is coded on 2 digits too.

Our previous string s1 becomes:

/8\08powerful/5\06system/9\07thermal/


For both examples, the term "powerful" has a length of 8 characters, so we know the next term is at the position i=i+8+5 ; (the value 5 is for the sequence /808\ of the first structure or /8\08 of the second structure).

For the second example the search algorithm (simplified) could be:

let index=0;
while s1[index] does not equal '\'
   let  index=index+1;

do
    let length=val(s1[index+1],s1[index+2]);
    if s1[index+3,index+3+length] equals s2
    {
        let weight=val(s1[index-1]);
        return weight;
     }
     else
        index=index+length+5;
}
while s1[index] does not equal EOL;

Note 1: The first loop is not required if we take care to store the length of the class label.

Note 2: If we want to keep the easy root or stem search, it is better to choose the first example as structure of the dictionary: "/808\powerful/506\system/907\thermal/" , the algorithm described above stays almost the same but we keep the possibility to search the root of the word, for example: "/power".

Search speed and Gain:

The simple search solution strstr(s1,s2) costs in the worst case: L1 + N1*L2 byte-byte comparisons, where L1 is the length of s1, N1 the number of '\' (equiv to the number of terms in s1) and L2 is the length of s2.

The algorithm given above decreases the search cost to: N1*L2 comparisons, but requires to compute an index based on the length of each term in the dictionary.
We can estimate that the computation of the index costs the equiv. of 5 comparisons, then the total cost of algorithm is 5 * N1 + N1 * L2 or N1 * (5 + L2) , and always in the worst case.

The gain for the dictionary of single words (or unigrams) is not very important, if we consider the average length of words is 8 characters and 3 characters are used to delimit the weight and the word ( .../w\term... ), then L1 = ( 8 + 3 ) * N1 ,
in consequence the gain is just ( 8 + 3 ) * N1 + N1 * L2 - ( 5 * N1 + N1 * L2 ) ,
simplified:   Gain = 6 * N1.

The gain becomes more interesting for digrams and trigrams dictionaries that texlexan used too. For instance, gain for digrams "word1-word2" dictionary can be estimated at:
( 17 + 3 ) * N1 + N1 * L2 - ( 5 * N1 + N1 * L2 )  => Gain = 15 * N1.
 Gain for trigrams "word1-word2-word3" dictionary can be estimated at:
( 26 + 3 ) * N1 + N1 * L2 - ( 5 * N1 + N1 * L2 )  => Gain = 24 * N1.

Practically,if we have a text of 1,000 filtered words and a small dictionary of 1,000,000 terms (200 classes of 5000 terms), then the maximum gains are 6,000,000,000 ; 15,000,000,000 and 24,000,000,000 comparisons for the unigrams, digrams and trigrams dictionaries.  If the comparison routine of 2 unsigned bytes takes 1ns, then the gains are 6s, 15s and 24 s.

Concerning N1 * L2 + 5 * N1,  if N1 equals 1,000,000 terms and L2 is 8, 17 and 26 chars in case of unigrams, digrams and trigrams.  Our text of 1,000 filtered words will be proceeded in:
1,000 * ( 1,000,000 *   8 * 10-9 + 1,000,000 * 5 * 10-9 ) = 13 secs
1,000 * ( 1,000,000 * 17 * 10-9 + 1,000,000 * 5 * 10-9 ) = 22 secs
1,000 * ( 1,000,000 * 26 * 10-9 + 1,000,000 * 5 * 10-9 ) = 31 secs


Of course, these results are in the worst case, where all terms of the document  are not found in the dictionaries and furthermore terms only differ on the last character. It is theoretically impossible but that give an idea of the search efficiency of the algorithm.

We can see that the search algorithm improves the search speed of about 40%.

Because it will be exceptional that the search term and the term is the dictionary will differ only on the last character, we can say only the first half of the term will be compared (*), then the search duration can be divided by 2.
So the search duration of 1,000 terms in dictionaries of 1,000,000 terms can be estimated at:
Unigrams:  6.5s
Digrams:  11s
Trigrams: 15.5s

Note: Classification algorithm of TexLexAn is based on the uni, di and trigrams search, we can estimate that the classification of a text containing 1000 words with our small dictionaries of 1,000,000 terms will take at least 33 secondes, that is still pretty long!

Conclusion:
The new structures of the dictionaries will improve the classification speed significantly, a speedup of 40% can be expected but the classification will stay too slow, particularly when the dictionaries will grow up in number of classes and terms per class. Finally a more sophisticated structure and algorithm will have to be developped.

No comments:

Post a Comment