- The dictionary should have two structures: the terms sorted and grouped on their leading digrams when the number of terms of same digrams is important and the terms unsorted and ungrouped when this number is small.
- A calibration routine will compute the threshold value. This value will be used to decide when to switch between the two structures. This calibration will only run the first time the program is executed.
1- New dictionary structure:
label_index label_name:k/S1/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/S2/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/S3/..../Sn/Wk+1Lk+1\TERMk/...
/0/Wl+1Ll+1\TERMl+1/Wl+2Ll+2\TERMl+2/...EOL
where S1 is the length of the substring "/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/"S2 is the length of the substring "/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/"
etc for S3, S4... Sn
/0/ marks the end of grouped leading digrams.
The difference with the structure presented in my post of Feb 17 are the /0/ sequence of characters. The value 0 indicates the end of the group of terms, and then a routine will have to compare each terms one by one rather to compare just the leading digram of the first term of each group of terms.
2 -New routines:
To keep the program simple, the grouping of terms with the same leading digram will be done by a third program, independently of the classifier and the learner.
Simplified flowchar:
The new program named 'smartgroup' will reorganize the dictionaries. It will be trigged after several dictionaries updates. It runs independently of Texlexan and Lazylearner.
New search algorithm:
(pseudo code, Python style with goto):
GROUP:
While not EOL do:
get length S of the group of digram
if length S is 0 goto BULK
compute the index of the next group
get the digram of the first term
if digram of the first term is the digram searched goto TERM
else: goto NO_FOUND
TERM:
while not EOL do:
get length L of the term
compute the index of the next term
get the term
if term is the term searched goto FOUND
if index of the next term is at the end of the group goto GROUP
else: goto NO_FOUND
BULK:
while not EOL do:
get the length L of the term
compute the index of the next term
get the digram
if digram is the digram searched:
get the term
if term is the term searched goto FOUND
else: goto NO_FOUND
FOUND:
get the weight
return the weight ( exit of this routine )
NO_FOUND:
return 0 ( exit of this routine )
This algorithm is divided in three parts:While not EOL do:
get length S of the group of digram
if length S is 0 goto BULK
compute the index of the next group
get the digram of the first term
if digram of the first term is the digram searched goto TERM
else: goto NO_FOUND
TERM:
while not EOL do:
get length L of the term
compute the index of the next term
get the term
if term is the term searched goto FOUND
if index of the next term is at the end of the group goto GROUP
else: goto NO_FOUND
BULK:
while not EOL do:
get the length L of the term
compute the index of the next term
get the digram
if digram is the digram searched:
get the term
if term is the term searched goto FOUND
else: goto NO_FOUND
FOUND:
get the weight
return the weight ( exit of this routine )
NO_FOUND:
return 0 ( exit of this routine )
- GROUP: jump from group of digram to group of digram when the first digram read does not match the searched digram.
- TERM: jump from term to term inside the group of same leading digram when the term does not match the searched term.
- BULK: jump from term to term as soon as the value of S is zero.
Speed optimization (multi-core processor):
The computation of the index of each group, the computation of the index of each term, and the comparison of digrams or terms can be done in parallel . The code should eliminate the race conditions ( in case the comparison is faster than the computation of the indexes ).
The next post will be about the robustness of this structure of data.