Sunday, February 28, 2010

Dictionary structure: Switch between sorted and unsorted leading digrams

The computation of the jump value of each leading digrams has cost (my post of feb 17), so when the number of terms with the same leading digrams is too small, it will be more expensive to compute the jump to the next leading diagram than to compare each term.  This conclusion brings two ideas:

- The dictionary should have two structures: the terms sorted and grouped on their leading digrams when the number of terms of same digrams is important and the terms unsorted and ungrouped when this number is small.
- A calibration routine will compute the threshold value. This value will be used to decide when to switch between the two structures. This calibration will only run the first time the program is executed.

1- New dictionary structure:

label_index label_name:k/S1/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/S2/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/S3/..../Sn/Wk+1Lk+1\TERMk/...
/0/Wl+1Ll+1\TERMl+1/Wl+2Ll+2\TERMl+2/...EOL
where S1 is the length of the substring  "/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/"
S2 is the length of the substring  "/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/"
etc for S3, S4... Sn
/0/ marks the end of grouped leading digrams.

The difference with the structure presented in my post of Feb 17 are the /0/ sequence of characters. The value 0 indicates the end of the group of terms, and then a routine will have to compare each terms one by one rather to compare just the leading digram of the first term of each group of terms.

2 -New routines:

To keep the program simple, the grouping of terms with the same leading digram will be done by a third program, independently of the classifier and the learner.

Simplified flowchar:


The new program named 'smartgroup' will reorganize the dictionaries. It will be trigged after several dictionaries updates. It runs independently of Texlexan and Lazylearner.

New search algorithm:
(pseudo code, Python style with goto):

GROUP:
      While not EOL do:
         get length S of the group of digram
         if length S is 0 goto BULK
         compute the index of the next group
         get the digram of the first term
         if digram of the first term is the digram searched goto TERM
      else: goto NO_FOUND
TERM:
      while not EOL do:
         get length L of the term
         compute the index of the next term
         get the term
         if term is the term searched goto FOUND
         if index of the next term is at the end of the group goto GROUP
      else: goto NO_FOUND

BULK:
      while not EOL do:
         get the length L of the term
         compute the index of the next term
         get the digram
         if digram is the digram searched:
             get the term
             if term is the term searched goto FOUND
      else: goto NO_FOUND

FOUND:
      get the weight
      return the weight ( exit of this routine )

NO_FOUND:
      return 0 ( exit of this routine )
This algorithm is divided in three parts:
 - GROUP: jump from group of digram to group of digram when the first digram read does not match the searched digram.
- TERM: jump from term to term inside the group of same leading digram when the term does not match the searched term.
- BULK: jump from term to term as soon as the value of S is zero.


Speed optimization (multi-core processor):
The computation of the index of each group, the computation of the index of each term, and the comparison of digrams or terms can be done in parallel . The code should eliminate the race conditions ( in case the comparison is faster than the computation of the indexes ).

The next post will be about the robustness of this structure of data.

No comments:

Post a Comment