Wednesday, March 3, 2010

Robustness of the new dictionary

The robustness of a structure of data is an important point when a large amount will have to be stored on disk for a long period of time. The current hard disk reliability is high ( 600,000 hrs < MTBF < 1,200,000 ), furthermore some array configurations of hard disks increase the overall reliability (raid 6 for instance). Anyway, the risk is non-null and it is interesting to verify if it will be possible to recover the data in case of hard disk failure.

Analyze:

Structure of each class in the dictionary: 

label_index label_name:k/S1/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/S2/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/S3/..../Sn/Wk+1Lk+1\TERMk/...
/0/Wl+1Ll+1\TERMl+1/Wl+2Ll+2\TERMl+2/...EOL

There are 3 dictionaries with the same structure for the word-unigrams, words-digrams and words-trigrams.

Hypothesis: one or a few consecutive bytes are corrupted.

- The sequence "label_index label_name:" is exactly repeated in each dictionary. It is easy to rebuilt this sequence from another dictionary by searching the closest match. If the probability is P to have a problem is this sequence and the probability is the same for each dictionary, then the probability is P3 to have an unrecoverable sequence "label_index label_name:".  P3 is a tiny probability and furthermore, it will always possible to correct manually the label_name.

- The value k can be easily computed again from the Number of terms in the dictionary / number of term in the class. So, the risk is not here.


Finally , the biggest risk comes from the list of terms of each class because it represent the largest part of the data.

/S1/w1L1\TERM1/w2L2\TERM2...
- The sequence /Si/ takes 6 bytes, the value Si 4 takes bytes and can be computed again by searching the next leading digram and counting the number of characters.
- The sequence /wjLj\ takes 6 bytes. wj takes 2 bytes and can be roughly estimated from the previous weight wj-1 and the next weight wj+1 if the term of the same leading digram are sorted on the weight w. Lj take 2 bytes and is the length of the TERMj. It can be computed again.
So, the sequence /Si/wjLj\ does not represent a big risk too.

- The chain \TERMj/ has a variable length given by Lj + 2. In many case, this term belongs to other classes inside the dictionary, then it will be possible to retreive it by searching its closest match (lowest edit distance) inside the dictionary.

In conclusion:
This structure seems intrinsectly robust with enough redundency to recover any errors affecting one byte or a few continuous bytes.

Note: I am starting to modify the dictionaries and the programs. They will be in the new branch of the repository /branch/new_dico_struct/...  Be patient, that will take time to modify and to valid the code.