*The robustness of a structure of data is an important point when a large amount will have to be stored on disk for a long period of time. The current hard disk reliability is high ( 600,000 hrs < MTBF < 1,200,000 ), furthermore some array configurations of hard disks increase the overall reliability (raid 6 for instance). Anyway, the risk is non-null and it is interesting to verify if it will be possible to recover the data in case of hard disk failure.*

**Analyze:**

Structure of each class in the dictionary:

label_index label_name:k/S1/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/S2/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/S3/..../Sn/Wk+1Lk+1\TERMk/...

/0/Wl+1Ll+1\TERMl+1/Wl+2Ll+2\TERMl+2/...EOL

There are 3 dictionaries with the same structure for the word-unigrams, words-digrams and words-trigrams.

Hypothesis: one or a few consecutive bytes are corrupted.

- The sequence "label_index label_name:" is exactly repeated in each dictionary. It is easy to rebuilt this sequence from another dictionary by searching the closest match. If the probability is P to have a problem is this sequence and the probability is the same for each dictionary, then the probability is P

^{3}to have an unrecoverable sequence "label_index label_name:". P

^{3}is a tiny probability and furthermore, it will always possible to correct manually the label_name.

- The value k can be easily computed again from the Number of terms in the dictionary / number of term in the class. So, the risk is not here.

Finally , the biggest risk comes from the list of terms of each class because it represent the largest part of the data.

/S1/w1L1\TERM1/w2L2\TERM2...

- The sequence /S

_{i}/ takes 6 bytes, the value S

_{i}4 takes bytes and can be computed again by searching the next leading digram and counting the number of characters.

- The sequence /w

_{j}L

_{j}\ takes 6 bytes. w

_{j}takes 2 bytes and can be roughly estimated from the previous weight w

_{j-1}and the next weight w

_{j+1}if the term of the same leading digram are sorted on the weight w. L

_{j}take 2 bytes and is the length of the TERM

_{j}. It can be computed again.

So, the sequence /Si/wjLj\ does not represent a big risk too.

- The chain \TERM

_{j}/ has a variable length given by L

_{j}+ 2. In many case, this term belongs to other classes inside the dictionary, then it will be possible to retreive it by searching its closest match (lowest edit distance) inside the dictionary.

**In conclusion:**

This structure seems intrinsectly robust with enough redundency to recover any errors affecting one byte or a few continuous bytes.

*Note: I am starting to modify the dictionaries and the programs. They will be in the new branch of the repository /branch/new_dico_struct/... Be patient, that will take time to modify and to valid the code.*

## No comments:

## Post a Comment