Sunday, January 24, 2010

Statistical approach

Introduction:

The classifier and the summarizer are based on a pure statistical approach. Concerning the classifier, the text is view as a bag of words and short sequence of words. For the summarizer, each sentences are view as a bag of words.

The order of the words do not have any importance in case of word unigram. In case of bigrams and trigrams (sequence of 2 and 3 words), the order of the words plays a role in the limited area of 2 or 3 consecutive words.

TexLexAn is configured by default to classify a text based from it unigrams, bigrams and trigrams. Texlexan can use 4-grams, 5-grams and more (just add the option -n, for instance -n6 for 1-gram to 6-gram), but the computation time and the memory required will increase dramatically.

Bigrams and trigrams are sufficient for the majority of text. Many concepts are expressed with a sequence of two or three words, for examples:

Police officer, power station, fruit juice, answering machine, computer assisted, traffic light, heart disease, mineral water, vegetable oil are bigrams.

Random access memory, local water treatment, free online dictionary, high speed internet, very high frequency are trigrams.

Single words generaly  belong many classes, but bigrams and trigrams often more specific. Using bigrams and trigrams to classify a text help to resolve many ambiguity and increase the precision of the classification.

Some statistics:

In our particular case, the Bayes' theorem will help use to estimate the probability that a text belongs a class when a set of n-grams (words, bigrams,trigrams...) is present in the text. If we apply the Bayes' theorem to our particular case of text classification, we can write:

P(C|W) = P(W|C) * P(C) / P(W)

In the expression above:
C is a topic or text classification
P(C) the probability C occurs.
W is one word (unigram) or one sequence of words (bigrams,trigrams...).
P(W) the probability W occurs.
P(W|C) is the probability to have W when C
P(C|W) is the probability to have C when W

P(C) is normaly well know because it depends of the number of classes we have.
P(W) is independant of the number classes and is constant.
P(W|C) can be estimated by computing the frequence of W when the text has the class C.  it is the training of the classifier.

We can extend the simple equation above to more than one n-gram (W). If we suppose that each n-grams are independant (we know that it's false, words in a sentence depend each others and sentences in a text depend each others too, but it's greatly simplify the equation and works pretty well.), then the resulting probability is the product of probabilities of each P(C|W).

Pc= P(C|W1)*P(C|W2)*...P(C|Wn) =  ∏ P(W|Ci) * P(C) / Z =  ∏ P(W|Ci) * Kc

We assume the probability Pc follows a multinomial distribution:
Pc= ∏ (θci)pow(xi) * Kc   where θci is the probability of the term i in C and xi the occurence of the term i in the document. Kc is dependant of the number of class C.

Now, we considere:  P(C|W) / P(C'|W) = ∏ (P(W|Ci) / P(W|C'i)) * P(C) / P(C')
where C' is the complement of C

We can easily linearize the above equation:
log(P(C|W) / P(C'|W)) = ∑ log(P(W|Ci) / P(W|C'i)) + log(P(C) / P(C')) 

Finally, we can simplify:
Score= ∑ Wci*Xi + Bc     where score is log(P(C|W) / P(C'|W)), Wci*Xi is  log(P(W|Ci) / P(W|C'i)) and Bc is log(P(C) / P(C')).

The weight Wci of each term i in the class C will be estimated during the training of the classifier. The term Bc is proportional to the logarithm of number of document of class C used to train the classifier:   Bc ~ Log(Nc/(Nt-Nc)) , where Nc is the number of documents of class C and Nt is the total of document.
Because we look for the highest score , we can forget Bc if we take care to train the classifier with almost the same number of documents for each class.

Saturday, January 16, 2010

Results window


I am continuing my previous post with a description of the results. The window is split in two part. The left panel is the text returned by the engine texlexan, and for the most useful information, we can find the classification results and the list of the most relevant sentences extracted from the summaries.


The right panel displays the classification results under the form of  bar graphs and allows to see quickly the most significant results.

Example:

Number of summaries extracted: 42

Grade:  23% Class: en.text-technic-computer
Grade:  25% Class: en.text-technic-computer-text_mining
Grade:  10% Class: en.text-technic-computer-processor
Grade:   7% Class: en.text-technic-computer-machine_learning
Grade:   6% Class: en.text-technic-computer-operating_system
Grade:   7% Class: en.text-law_agreement-international
Grade:   2% Class: en.text-technic-computer-memory
Grade:   4% Class: en.text-health
Grade:   3% Class: en.text-technic-computer-programming
Grade:   3% Class: en.text-technic-computer-unclassified
Grade:   3% Class: en.text-knowledge_management
Grade:   2% Class: en.text-technic-computer-artificial_intelligence
Grade:   1% Class: en.text-health-drug
Grade:   1% Class: en.text-science-chemistry-chemical
Grade:   0% Class: en.text-food-fruit

We know the result comes from 42 summaries that have been extracted and analysed. It means 42 documents were analyzed, summarized and archived oven the period considered.
The classification list shows the majority of the documents were about the computer, text mining, processor, machine learning, operating system and agreement international.
It is important to be careful when the grade (pseudo-probability) of a classification is low. There is a high probability that the classification can be  erroneous and simply due to the noise, for instance, the result "1% Class: en.text-health-drug".


The next interesting part of the results are the sentences extracted from the summaries:
Additionally, some use these terms to refer only to multi-core
microprocessors that are manufactured on the same integrated circuit
die .These people generally refer to separate microprocessor dies in
the same package by another name, such as multi-chip module .This
article uses both the terms "multi-core" and "dual-core" to
reference microelectronic CPUs manufactured on the same integrated
circuit, unless otherwise noted.

There are only multi-threaded managed runtimes means when it loads
an single threaded managed app the runtime itself creates multi
threads for its own purpose, right ? A: The multi-threading managed
runtime takes care of creating multiple threads as needed by the
application.

Others, generally seeking more compact and stable methods for
indexing highly diverse sources for which full, word-based indexes
are often unavailable, have explored higher-level indexing methods
including free-text and controlled-vocabulary metadata schemes,
semantic representations, and query-based indexing with training
sets.

Hierarchical Indexing Hierarchical indexing is a method of indexing
large documents at several levels of structure, so that a retrieval
system can pinpoint the most relevant sections within each document.

For document retrieval, hierarchical concept-based indexing and
document sectioning show promise for improving on word indexing alone.



Total of cue words found 431/20758 not found

The tag "business" indicates the cue words belong the business and only 431 cue words were found in summaries analysed.


The sentences extracted above represent "theorically" the main information expressed in the 42 documents analysed. Unfortunately because it is based on purely statistic method, sometime a few non-relevant sentences can be extracted. I will explain more in detail the reason in a next post.

Friday, January 15, 2010

Extract meaningful information

In a previous post, I discussed the idea to extract the most interesting information in the mass of electronic text circulating in the enterprise. After two weeks of work, the first step is done: TexLexAn is able to extract the most relevant sentences in a set of documents.

The main difficulty is to decide if a sentence is relevant or not. The solution chosen is to weight each sentences with the keywords extracted from the summaries, and to use a list of cue words to increase the weight. Finally, only the sentences with a weight above a threshold are kept.

A trial has been done by classifying the set of summaries and then by using the keywords belonging the classification dictionary. This solution could seem better because it increases the number of the keywords, but when the classification is wrong then the keywords are inappropriate.

The interface is very basic: There are two fields to enter the starting date and the ending date (a calendar can be called), and a large text window to enter some extra-options. The most interesting option is -v1 for verbose and -K for the keyword list.

The results are pretty long to comment and cannot fit here, they will be the object of a next post.

The package pack1.46.tar.gz is available here http://sourceforge.net/projects/texlexan/files/