Friday, January 15, 2010

Extract meaningful information

In a previous post, I discussed the idea to extract the most interesting information in the mass of electronic text circulating in the enterprise. After two weeks of work, the first step is done: TexLexAn is able to extract the most relevant sentences in a set of documents.

The main difficulty is to decide if a sentence is relevant or not. The solution chosen is to weight each sentences with the keywords extracted from the summaries, and to use a list of cue words to increase the weight. Finally, only the sentences with a weight above a threshold are kept.

A trial has been done by classifying the set of summaries and then by using the keywords belonging the classification dictionary. This solution could seem better because it increases the number of the keywords, but when the classification is wrong then the keywords are inappropriate.

The interface is very basic: There are two fields to enter the starting date and the ending date (a calendar can be called), and a large text window to enter some extra-options. The most interesting option is -v1 for verbose and -K for the keyword list.

The results are pretty long to comment and cannot fit here, they will be the object of a next post.

The package pack1.46.tar.gz is available here http://sourceforge.net/projects/texlexan/files/

No comments:

Post a Comment