Sunday, January 24, 2010

Statistical approach

Introduction:

The classifier and the summarizer are based on a pure statistical approach. Concerning the classifier, the text is view as a bag of words and short sequence of words. For the summarizer, each sentences are view as a bag of words.

The order of the words do not have any importance in case of word unigram. In case of bigrams and trigrams (sequence of 2 and 3 words), the order of the words plays a role in the limited area of 2 or 3 consecutive words.

TexLexAn is configured by default to classify a text based from it unigrams, bigrams and trigrams. Texlexan can use 4-grams, 5-grams and more (just add the option -n, for instance -n6 for 1-gram to 6-gram), but the computation time and the memory required will increase dramatically.

Bigrams and trigrams are sufficient for the majority of text. Many concepts are expressed with a sequence of two or three words, for examples:

Police officer, power station, fruit juice, answering machine, computer assisted, traffic light, heart disease, mineral water, vegetable oil are bigrams.

Random access memory, local water treatment, free online dictionary, high speed internet, very high frequency are trigrams.

Single words generaly  belong many classes, but bigrams and trigrams often more specific. Using bigrams and trigrams to classify a text help to resolve many ambiguity and increase the precision of the classification.

Some statistics:

In our particular case, the Bayes' theorem will help use to estimate the probability that a text belongs a class when a set of n-grams (words, bigrams,trigrams...) is present in the text. If we apply the Bayes' theorem to our particular case of text classification, we can write:

P(C|W) = P(W|C) * P(C) / P(W)

In the expression above:
C is a topic or text classification
P(C) the probability C occurs.
W is one word (unigram) or one sequence of words (bigrams,trigrams...).
P(W) the probability W occurs.
P(W|C) is the probability to have W when C
P(C|W) is the probability to have C when W

P(C) is normaly well know because it depends of the number of classes we have.
P(W) is independant of the number classes and is constant.
P(W|C) can be estimated by computing the frequence of W when the text has the class C.  it is the training of the classifier.

We can extend the simple equation above to more than one n-gram (W). If we suppose that each n-grams are independant (we know that it's false, words in a sentence depend each others and sentences in a text depend each others too, but it's greatly simplify the equation and works pretty well.), then the resulting probability is the product of probabilities of each P(C|W).

Pc= P(C|W1)*P(C|W2)*...P(C|Wn) =  ∏ P(W|Ci) * P(C) / Z =  ∏ P(W|Ci) * Kc

We assume the probability Pc follows a multinomial distribution:
Pc= ∏ (θci)pow(xi) * Kc   where θci is the probability of the term i in C and xi the occurence of the term i in the document. Kc is dependant of the number of class C.

Now, we considere:  P(C|W) / P(C'|W) = ∏ (P(W|Ci) / P(W|C'i)) * P(C) / P(C')
where C' is the complement of C

We can easily linearize the above equation:
log(P(C|W) / P(C'|W)) = ∑ log(P(W|Ci) / P(W|C'i)) + log(P(C) / P(C')) 

Finally, we can simplify:
Score= ∑ Wci*Xi + Bc     where score is log(P(C|W) / P(C'|W)), Wci*Xi is  log(P(W|Ci) / P(W|C'i)) and Bc is log(P(C) / P(C')).

The weight Wci of each term i in the class C will be estimated during the training of the classifier. The term Bc is proportional to the logarithm of number of document of class C used to train the classifier:   Bc ~ Log(Nc/(Nt-Nc)) , where Nc is the number of documents of class C and Nt is the total of document.
Because we look for the highest score , we can forget Bc if we take care to train the classifier with almost the same number of documents for each class.

No comments:

Post a Comment