TexLexAn: February 2010

Sunday, February 28, 2010

Dictionary structure: Switch between sorted and unsorted leading digrams

The computation of the jump value of each leading digrams has cost (my post of feb 17), so when the number of terms with the same leading digrams is too small, it will be more expensive to compute the jump to the next leading diagram than to compare each term. This conclusion brings two ideas:

- The dictionary should have two structures: the terms sorted and grouped on their leading digrams when the number of terms of same digrams is important and the terms unsorted and ungrouped when this number is small.
- A calibration routine will compute the threshold value. This value will be used to decide when to switch between the two structures. This calibration will only run the first time the program is executed.

1- New dictionary structure:

label_index label_name:k/S1/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/S2/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/S3/..../Sn/Wk+1Lk+1\TERMk/...

/0/Wl+1Ll+1\TERMl+1/Wl+2Ll+2\TERMl+2/...EOL

where S1 is the length of the substring "/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/"
S2 is the length of the substring "/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/"
etc for S3, S4... Sn
/0/ marks the end of grouped leading digrams.

The difference with the structure presented in my post of Feb 17 are the /0/ sequence of characters. The value 0 indicates the end of the group of terms, and then a routine will have to compare each terms one by one rather to compare just the leading digram of the first term of each group of terms.

2 -New routines:

To keep the program simple, the grouping of terms with the same leading digram will be done by a third program, independently of the classifier and the learner.

Simplified flowchar:

The new program named 'smartgroup' will reorganize the dictionaries. It will be trigged after several dictionaries updates. It runs independently of Texlexan and Lazylearner.

New search algorithm:
(pseudo code, Python style with goto):

GROUP:
      While not EOL do:
         get length S of the group of digram
         if length S is 0 goto BULK
         compute the index of the next group
         get the digram of the first term
         if digram of the first term is the digram searched goto TERM
      else: goto NO_FOUND
TERM:
      while not EOL do:
         get length L of the term
         compute the index of the next term
         get the term
         if term is the term searched goto FOUND
         if index of the next term is at the end of the group goto GROUP
      else: goto NO_FOUND

BULK:
      while not EOL do:
         get the length L of the term
         compute the index of the next term
         get the digram
         if digram is the digram searched:
             get the term
             if term is the term searched goto FOUND
      else: goto NO_FOUND

FOUND:
      get the weight
      return the weight ( exit of this routine )

NO_FOUND:
      return 0 ( exit of this routine )

This algorithm is divided in three parts:
- GROUP: jump from group of digram to group of digram when the first digram read does not match the searched digram.
- TERM: jump from term to term inside the group of same leading digram when the term does not match the searched term.
- BULK: jump from term to term as soon as the value of S is zero.

Speed optimization (multi-core processor):
The computation of the index of each group, the computation of the index of each term, and the comparison of digrams or terms can be done in parallel . The code should eliminate the race conditions ( in case the comparison is faster than the computation of the indexes ).

The next post will be about the robustness of this structure of data.

Wednesday, February 17, 2010

Improved dictionary structure and Digrams statistics

I presented in my previous post the idea to sort the terms based on their leading digrams and to regroup the same leading digram in a substring that we store the length:

label_index label_name:k/S1/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/S2/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/S3/..../Sn/Wk+1Lk+1\TERMk/EOL

where S1 is the length of the substring "/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/"
S2 is the length of the substring "/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/"
etc for S3, S4... Sn

This advantage of this structure is when the leading digrams of the term in the dictionary and the searched term do not match then we skip directly to the other leading digram.

Now, the question is "What is the efficiency of this solution?"

First we need to know the frequency of leading digrams, the table below give the frequencies for the English words (extracted from http://www-math.cudenver.edu/~wcherowi/courses/m5410/engstat.html ):

TABLE 1 - Order and Frequency of Leading DIGRAMS

TH  3.15%  TO  1.11%  SA  0.75%  MA  0.56%
  HE  2.51   NT  1.10   HI  0.72   TA  0.56
  AN  1.72   ED  1.07   LE  0.72   CE  0.55
  IN  1.69   IS  1.06   SO  0.71   IC  0.55
  ER  1.54   AR  1.01   AS  0.67   LL  0.55
  RE  1.48   OU  0.96   NO  0.65   NA  0.54
  ES  1.45   TE  0.94   NE  0.64   RO  0.54
  ON  1.45   OF  0.94   EC  0.64   OT  0.53
  EA  1.31   IT  0.88   IO  0.63   TT  0.53
  TI  1.28   HA  0.84   RT  0.63   VE  0.53
  AT  1.24   SE  0.84   CO  0.59   NS  0.51
  ST  1.21   ET  0.80   BE  0.58   UR  0.49
  EN  1.20   AL  0.77   DI  0.57   ME  0.48
  ND  1.18   RI  0.77   LI  0.57   WH  0.48
  OR  1.13   NG  0.75   RA  0.57   LY  0.47

The table 1 gives the 60 most frequent digrams and represent about 55% of the words.

Now,
if n is the computation cost of the comparison of 2 digrams, N is the number of terms in a class with the same leading digram, and m is the computation cost of the new index pointing on the next digram group,
then the new structure is interesting if m < N * n or m/n < N

What are the values of N, m and n?
The values of m and n depend of the processor, the programming langage and the code optimization, but we can estimate N for each digram pretty easily.
One class can have about 10,000 terms.
If we do the naïve assumption that all combinations are possible, the 26 letters give 676 digrams, the second naïve assumption is the words are evenly distributed, then the number of words for each digram is 10,000/676 =14.79
We conclude (naïvely) that the index computation must not be 14.79 slower than the digrams comparison to make this dictionary structure insteresting.
In fact, the words are not evenly distibuted, the table 1 shows that the leading digram 'TH' represents 3.15% of the English words, then this class will contain about 315 terms beginning with 'th' (this is the best case). The worst case in our table, 'LY' will represent 47 terms. The problem is our table covers only 55% of the words. Of course, it would be better to find a complete table, but we can continue a rough approximation:
If we assume there are 676 digrams possible, the table 1 gives the 60 first digrams, then the 45% of the words are distributed over the 616 digrams. The average frequency is 0.45/616 = 0.000666 (0.0666%) , so for the class of 10,000 terms , the average is N=6.7 words per digrams.
This result lets think that the new structure will efficient for the most frequent leading digrams (roughly the 60~100 first digrams) and will be penalising for the large group of rare digrams.

The new idea is to switch from the new structure (and algorithm) to the old structure (and old algorithm) passed the most frequent digrams.

Monday, February 15, 2010

Speedup strings matching

In the previous post, I presented a new structure of the dictionary that will improve the classification speed by 40%. In spite of this improvement, the classification will stay to slow when the dictionaries will grow up. In this post, I will present a another solution theorically more performant.

1 - The objectives

- To find any alphanumeric term in the dictionary and to get its weigh and its class. The term can belong several different classes and have different weighs.
- The dictionary must be easily viewable and editable with a basic text editor.
- The dictionary must be robust enough to be repeared in case of data corruption.

2 - Solution: Dictionary structure

I proposed this structure in my previous post:

label_index label_name:K/W1L1\TERM1/W2L2\TERM2..../WiLi\TERMi/EOL

where Wi is the weigh and Li the length of the TERMi.
Li is used to compute the position of the next TERM to compare with the searched term, then the comparison is limited to TERMi and the searched term.
The algorithm proposed sequentially gets the length Li, compare the TERMi with the searched term, compute the index of the TERMi+1, jump to the TERMi+1.

3 - Optimization

The comparison operation and the new index computation can be treated in parallel pretty easily.

let index= position of the first slash

do
{

thread
{

let length=val(s1[index+1],s1[index+2]);

index2=index+length+5;
}

if (integer)(s1[index+3,index+4]) equals (integer)(s2[0,1])
{
wait for end of thread;

if s1[index+3,index+3+length] equals s2
{

let weight=val('0',s1[index-1]);

        return weight;
   }
    }
wait for end of thread;

index=index2;
}

while s1[index] does not equal EOL;

Note: val() converts 2 chars (0...F) in integer (0...255)

In the algorithm above, the first tread computes the index of the next substring TERMi+1 to compare, while the second tread compares the current substring TERMi with the searched term in s2.
The instruction (integer)(s1[index+3,index+4]) equals (integer)(s2[0,1]) compares the first 2 bytes of the TERMi with the first 2 bytes of the string S2. It is a very fast operation between two integers; it is interesting because the probability (*) is low that the comparison returns true, then the slower strings comparison "if s1[index+3,index+3+length] equals s2" will run rarely.

(*) Low probability that the first two letters of a word match the first two letters of another word. Explanation:
English words have leading digrams with these frequencies:

th 3.15% he 2.51% an 1.72% in 1.69% er 1.54% re 1.48% es 1.45% ...

In consequence, 3.15% of the terms in the dictionary will start with "th" and there are 3.15% of chance that the searched term s2 starts with "th" too.
Hence, the probability that the leading digram of s2 matches the leading digrams of any term in the dictionary is only 0.0315 * 0.0315 ~ 0.001 or 0.1%

Finally there is a very low probability that will have to compare the other characters following the digrams. Because there are 99.9% chance that the digrams will not match and because we compare the digrams simultaneously, we can consider that we have an O(m+n) string searching algorithm. The limitation is to code each letter on one byte (plain ascii) and to exclude terms of length < 2.

4 - Sorting

Terms in each class of the dictionary have different probabilities of occuring. Intuitively we understand that if the terms with the highest probabilities are in the beginning of the string s1 and terms with the lowest probabilities are at the end of s1, and if of course, we start looking for the term from the left to the right, then we have a better chance to find the term faster. The weight of each term represents the frequency of the term in its class. In consequence, it is just needed to sort the terms of each class based on their weights. Intuitively, we can imagine the gain cannot be important because a significant term cannot strongly belongs all the classes of the dictionary.

Other solution:

Sorting terms of dictionary based on the inverse of the frequency of their leading digrams is probably more interesting. A term in s2 with a frequent digram will be found faster than a term with a rare digram. We can significantly increase the speed if we skip all the similar leading diagram as soon as the first test shows they do not match. To do that, we just need to know the length of the substrings of s1 containing the same leading digrams.

Structure of the dico:

label_index label_name:k/S1/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/S2/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/S3/....

where S1 is the length of the substring "/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/"
S2 is the length of the substring "/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/"
etc for S3, S4... Sn

Example:

/0045/w106\theory/w211\theobromine/w312\theophylline/0030/w405\heavy/w504\heat/w605\hertz/....

If we search the weight of the word "hertz", the first digrams comparison (th of theobromine with he of hertz) will not match, then we jump directly to the next group of digram (heavy, heat, hertz). The advantage is we do only one comparison for each digram until both digram match.

Now, we have just to imagine a smart algorithm to do this job efficiently.

Friday, February 12, 2010

Dictionaries structure

The structure of the dictionaries is very simple. Because TexLexAn is essentially experimental, the dictionary structure was designed to be easily viewable/editable with any basic text editor and to work without complication with the c function strstr(). The consequence of this choice is a poor search speed. The classification becomes particularly slow when the size of the dictionaries increase in number of classes and number of words.

The idea is to improve the structure of the dictionary without losing the easiness to view and edit the dictionary.

The current structure of dictionaries 'keyworder.lan.dicN' is a set class labels, where each class is defined by a single line of any length:

j class_label:kj/w1\n-gram1/....../wi\n-grami/
where j is the class index, kj is the constant of the class j, wi is the weight of the n-gram i.

The function strstr(s1,s2) is used to search a term (n-gram) inside each line of the dictionary.
The term searched is delimited with one slash at the beginning and one backslash at the end: /searched n-gram\ , so the string s1 contains the line "j class_label:kj/w1\n-gram1/....../wi\n-grami/" and the s2 contains the searched term: "\searched n-gram/"

Of course, it is a very simple solution but allows a simple and robust algorithm, and gives the possibility to search the root of words very easily. For example: The term 'power' delimited as "\power" will be found in the line of the dictionary "/w1\powerful/.....". This is a basic solution but requires only a fast stemming operation of the searched term.

The biggest inconvenient of strstr() in our case is that the backslash '\' of s2 is searched in the n-gram and weight of s1.   Example: for the string s2 "\thermal/" and the string s1 "/8\powerful/5\system/9\thermal/" , the backslash '\' will be searched all along powerfull/5   and system/9, that is a wast of time because we know it cannot be present in the n-grams and weights.

One solution is to store the length of each term/n-gram and to used this length to compute the position of the next term/n-gram.

Example of structure:
j class_label:kj/w1l1\n-gram1/....../wili\n-grami/   where the length li of each term is coded on 2 digits.
Our previous string s1 becomes:
/808\powerful/506\system/907\thermal/

Another example of structure:
j class_label:kj/w1\l1n-gram1/....../wi\lin-grami/   where the length li of each term is coded on 2 digits too.

Our previous string s1 becomes:

/8\08powerful/5\06system/9\07thermal/

For both examples, the term "powerful" has a length of 8 characters, so we know the next term is at the position i=i+8+5 ; (the value 5 is for the sequence /808\ of the first structure or /8\08 of the second structure).

For the second example the search algorithm (simplified) could be:

let index=0;

while s1[index] does not equal '\'

let index=index+1;

do
{

let length=val(s1[index+1],s1[index+2]);

if s1[index+3,index+3+length] equals s2
{

let weight=val(s1[index-1]);

return weight;
}

else

index=index+length+5;
}

while s1[index] does not equal EOL;

Note 1: The first loop is not required if we take care to store the length of the class label.

Note 2: If we want to keep the easy root or stem search, it is better to choose the first example as structure of the dictionary: "/808\powerful/506\system/907\thermal/" , the algorithm described above stays almost the same but we keep the possibility to search the root of the word, for example: "/power".

Search speed and Gain:

The simple search solution strstr(s1,s2) costs in the worst case: L1 + N1*L2 byte-byte comparisons, where L1 is the length of s1, N1 the number of '\' (equiv to the number of terms in s1) and L2 is the length of s2.

The algorithm given above decreases the search cost to: N1*L2 comparisons, but requires to compute an index based on the length of each term in the dictionary.
We can estimate that the computation of the index costs the equiv. of 5 comparisons, then the total cost of algorithm is 5 * N1 + N1 * L2 or N1 * (5 + L2) , and always in the worst case.

The gain for the dictionary of single words (or unigrams) is not very important, if we consider the average length of words is 8 characters and 3 characters are used to delimit the weight and the word ( .../w\term... ), then L1 = ( 8 + 3 ) * N1 ,
in consequence the gain is just ( 8 + 3 ) * N1 + N1 * L2 - ( 5 * N1 + N1 * L2 ) ,
simplified: Gain = 6 * N1.

The gain becomes more interesting for digrams and trigrams dictionaries that texlexan used too. For instance, gain for digrams "word1-word2" dictionary can be estimated at:
( 17 + 3 ) * N1 + N1 * L2 - ( 5 * N1 + N1 * L2 ) => Gain = 15 * N1.
Gain for trigrams "word1-word2-word3" dictionary can be estimated at:
( 26 + 3 ) * N1 + N1 * L2 - ( 5 * N1 + N1 * L2 ) => Gain = 24 * N1.

Practically,if we have a text of 1,000 filtered words and a small dictionary of 1,000,000 terms (200 classes of 5000 terms), then the maximum gains are 6,000,000,000 ; 15,000,000,000 and 24,000,000,000 comparisons for the unigrams, digrams and trigrams dictionaries. If the comparison routine of 2 unsigned bytes takes 1ns, then the gains are 6s, 15s and 24 s.

Concerning N1 * L2 + 5 * N1, if N1 equals 1,000,000 terms and L2 is 8, 17 and 26 chars in case of unigrams, digrams and trigrams. Our text of 1,000 filtered words will be proceeded in:
1,000 * ( 1,000,000 * 8 * 10^-9 + 1,000,000 * 5 * 10^-9 ) = 13 secs
1,000 * ( 1,000,000 * 17 * 10^-9 + 1,000,000 * 5 * 10^-9 ) = 22 secs
1,000 * ( 1,000,000 * 26 * 10^-9 + 1,000,000 * 5 * 10^-9 ) = 31 secs

Of course, these results are in the worst case, where all terms of the document are not found in the dictionaries and furthermore terms only differ on the last character. It is theoretically impossible but that give an idea of the search efficiency of the algorithm.

We can see that the search algorithm improves the search speed of about 40%.

Because it will be exceptional that the search term and the term is the dictionary will differ only on the last character, we can say only the first half of the term will be compared (*), then the search duration can be divided by 2.
So the search duration of 1,000 terms in dictionaries of 1,000,000 terms can be estimated at:
Unigrams: 6.5s
Digrams: 11s
Trigrams: 15.5s

Note: Classification algorithm of TexLexAn is based on the uni, di and trigrams search, we can estimate that the classification of a text containing 1000 words with our small dictionaries of 1,000,000 terms will take at least 33 secondes, that is still pretty long!

Conclusion:
The new structures of the dictionaries will improve the classification speed significantly, a speedup of 40% can be expected but the classification will stay too slow, particularly when the dictionaries will grow up in number of classes and terms per class. Finally a more sophisticated structure and algorithm will have to be developped.

Sunday, February 7, 2010

Estimate Bc

I explained in the last post that the constant Bc depends of the size of the training set, more precisely:
Bc = Log(P(C)/P'(C)) .

Because P(C) = Nc/Nt and P'(C) = 1-Nc/Nt .
We have Bc = Log(Nc/(Nt-Nc)) , where Nc is the number of documents of class C and Nt the total of documents all classes included.

The size of the documents are very different, so it seems more correct to use the number of words rather the number of documents, then we have this relation:

Bc = Log(Wc/(Wt-Wc)) , where Wc is the number of words in documents of class C and Wt is the number of words all classes included.

This second relation is better but is still not prefect because it doesn't take in consideration the updating frequency of each class. A better solution is the combination of the two relations above:

Bc = Log(Nc/(Nt-Nc) * Wc/(Wt-Wc) / 2)

Wc and Wt are the number of words in the documents unfiltered, but all words of the documents do not participate to the classification. Furthermore, each class of the dictionary does not contain the same number of word. Depending of the lexical richness of the class and of the training set used, the number of N-grams of the dictionary's class may vary a lot. Intuitively, we can say that the probability to classify a document in the class C increases when the number of N-grams present in the dictionary's class C is high.

Probably a better estimation of Bc should include the size of each class of the dictionary. Below, it is an average of the three frequencies:

Bc = Log(Nc/(Nt-Nc) * Wc/(Wt-Wc) * Gc/(Gt-Gc) / 3), where Gc is the number of N-gram in the class C of the dictionary and Gt is the number total of N-grams in the dictionary.

Saturday, February 6, 2010

Improving the classification model

We saw in the previous post that the Naive Bayes model is a linear classifier in the log space:
Score= ∑ Wci*Xi + Bc

The weight Wci of each term i in the class C is estimated during the training of the classifier. It is the job of the program Lazylearner

The term Bc is logarithmically proportional to the number of document (*) of class C used to train the classifier: Bc ~ Log(Nc/(Nt-Nc)) (Nc number of documents of class C, Nt total of document)

(*) documents are supposed have the same size.

Because we just look for the highest score to affect the class label to the document, we can forget Bc if we take care to train the classifier with almost the same number of documents for each class. Unfortunately, it is generally impossible to train evenly the classifier. The consequence of training the classifier with more documents for one class than for the other classes will be an emphasis of the classes with the largest number of training documents.

The new version of texlexan ( pack 1.47 ) tries to compensate the size inequalities in the training set. The model used to classify the documents includes the constant Bc, in consequence the dictionaries are modified and completed with these constants (one cst for each class). These constants are computed by Lazylearner from the size (number of words) of the documents used to train the classifier.

The number of words has been chosen rather the number of documents because the size of the documents are often very unequal.

Note: The dictionaries keyworder.'lan'.dic'N' have changed but stay compatible with the previous versions.

The new package is available here:
http://sourceforge.net/projects/texlexan/files/

TexLexAn