Sunday, February 7, 2010

Estimate Bc

 I explained in the last post that the constant Bc depends of the size of the training set, more precisely:
   Bc = Log(P(C)/P'(C)) .

Because P(C) = Nc/Nt and P'(C) = 1-Nc/Nt .
We have Bc = Log(Nc/(Nt-Nc))  , where Nc is the number of documents of class C and Nt the total of documents all classes included.

The size of the documents are very different, so it seems more correct to use the number of words rather the number of documents, then we have this relation:

Bc = Log(Wc/(Wt-Wc)) , where Wc is the number of words in documents of class C and Wt is the number of words all classes included.

This second relation is better but is still not prefect because it doesn't take in consideration the updating frequency of each class. A better solution is the combination of the two relations above:

Bc = Log(Nc/(Nt-Nc) * Wc/(Wt-Wc) / 2)

Wc and Wt are the number of words in the documents unfiltered, but all words of the documents do not participate to the classification. Furthermore, each class of the dictionary does not contain the same number of word. Depending of the lexical richness of the class and of the training set used, the number of N-grams of the dictionary's class may vary a lot. Intuitively, we can say that the probability to classify a document in the class C increases when the number of N-grams present in the dictionary's class C is high.

Probably a better estimation of Bc should include the size of each class of the dictionary.  Below, it is an average of the three frequencies:

Bc = Log(Nc/(Nt-Nc) * Wc/(Wt-Wc) * Gc/(Gt-Gc) / 3), where Gc is the number of N-gram in the class C of the dictionary and Gt is the number total of N-grams in the dictionary.

No comments:

Post a Comment