Wednesday, February 17, 2010

Improved dictionary structure and Digrams statistics

I presented in my previous post the idea to sort the terms based on their leading digrams and to regroup the same leading digram in a substring that we store the length:

label_index label_name:k/S1/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/S2/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/S3/..../Sn/Wk+1Lk+1\TERMk/EOL
where S1 is the length of the substring  "/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/"
S2 is the length of the substring  "/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/"
etc for S3, S4... Sn

This advantage of this structure is when the leading digrams of the term in the dictionary and the searched term do not match then we skip directly to the other leading digram.


Now, the question is "What is the efficiency of this solution?"

First we need to know the frequency of leading digrams, the table below give the frequencies for the English words  (extracted from http://www-math.cudenver.edu/~wcherowi/courses/m5410/engstat.html ):

TABLE 1 - Order and Frequency of Leading DIGRAMS
TH  3.15%  TO  1.11%  SA  0.75%  MA  0.56%
  HE  2.51   NT  1.10   HI  0.72   TA  0.56
  AN  1.72   ED  1.07   LE  0.72   CE  0.55
  IN  1.69   IS  1.06   SO  0.71   IC  0.55
  ER  1.54   AR  1.01   AS  0.67   LL  0.55
  RE  1.48   OU  0.96   NO  0.65   NA  0.54
  ES  1.45   TE  0.94   NE  0.64   RO  0.54
  ON  1.45   OF  0.94   EC  0.64   OT  0.53
  EA  1.31   IT  0.88   IO  0.63   TT  0.53
  TI  1.28   HA  0.84   RT  0.63   VE  0.53
  AT  1.24   SE  0.84   CO  0.59   NS  0.51
  ST  1.21   ET  0.80   BE  0.58   UR  0.49
  EN  1.20   AL  0.77   DI  0.57   ME  0.48
  ND  1.18   RI  0.77   LI  0.57   WH  0.48
  OR  1.13   NG  0.75   RA  0.57   LY  0.47  
The table 1 gives the 60 most frequent digrams and represent about 55% of the words.
 

Now,
if n is the computation cost of the comparison of 2 digrams, N is the number of terms in a class with the same leading digram, and m is the computation cost of the new index pointing on the next digram group,
then the new structure is interesting if m < N * n or m/n < N

What are the values of N, m and n?
The values of m and n depend of the processor, the programming langage and the code optimization, but we can estimate N for each digram pretty easily.
One class can have about 10,000 terms.
If we do the naïve assumption that all combinations are possible, the 26 letters give 676 digrams, the second naïve assumption is the words are evenly distributed, then the number of words for each digram is 10,000/676 =14.79
We conclude (naïvely) that the index computation must not be 14.79 slower than the digrams comparison to make this dictionary structure insteresting.
In fact, the words are not evenly distibuted,  the table 1 shows that the leading digram 'TH' represents 3.15% of the English words, then this class will contain about 315 terms beginning with 'th' (this is the best case). The worst case in our table, 'LY' will represent 47 terms. The problem is our table covers only 55% of the words. Of course, it would be better to find a complete table, but we can continue a rough approximation:
If we assume there are 676 digrams possible, the table 1 gives the 60 first digrams, then the 45% of the words are distributed over the 616 digrams. The average frequency is 0.45/616 = 0.000666  (0.0666%)  , so for the class of 10,000 terms , the average is N=6.7 words per digrams.
This result lets think that the new structure will efficient for the most frequent leading digrams (roughly the 60~100 first digrams) and will be penalising for the large group of rare digrams.

The new idea is to switch from the new structure (and algorithm) to the old structure (and old algorithm) passed the most frequent digrams.

No comments:

Post a Comment