Saturday, February 6, 2010

Improving the classification model

We saw in the previous post that the Naive Bayes model is a linear classifier in the log space: 
        Score= ∑ Wci*Xi + Bc

The weight Wci of each term i in the class C is estimated during the training of the classifier. It is the job of the program Lazylearner

The term Bc is logarithmically proportional to the number of document (*) of class C used to train the classifier:   Bc ~ Log(Nc/(Nt-Nc))  (Nc number of documents of class C, Nt total of document)

(*) documents are supposed have the same size.

Because we just look for the highest score to affect the class label to the document, we can forget Bc if we take care to train the classifier with almost the same number of documents for each class. Unfortunately, it is generally impossible to train evenly the classifier. The consequence of training the classifier with more documents for one class than for the other classes will be an emphasis of the classes with the largest number of training documents.

The new version of texlexan ( pack 1.47 ) tries to compensate the size inequalities in the training set. The model used to classify the documents includes the constant Bc, in consequence the dictionaries are modified and completed with these constants (one cst for each class). These constants are computed by Lazylearner from the size (number of words) of the documents used to train the classifier.

The number of words has been chosen rather the number of documents because the size of the documents are often very unequal.

Note: The dictionaries keyworder.'lan'.dic'N' have changed but stay compatible with the previous versions.

The new package is available here:
http://sourceforge.net/projects/texlexan/files/

No comments:

Post a Comment