In a previous post, I discussed the idea to extract the most interesting information in the mass of electronic text circulating in the enterprise. After two weeks of work, the first step is done: TexLexAn is able to extract the most relevant sentences in a set of documents.
The main difficulty is to decide if a sentence is relevant or not. The solution chosen is to weight each sentences with the keywords extracted from the summaries, and to use a list of cue words to increase the weight. Finally, only the sentences with a weight above a threshold are kept.
The interface is very basic: There are two fields to enter the starting date and the ending date (a calendar can be called), and a large text window to enter some extra-options. The most interesting option is -v1 for verbose and -K for the keyword list.
The results are pretty long to comment and cannot fit here, they will be the object of a next post.
The package pack1.46.tar.gz is available here http://sourceforge.net/projects/texlexan/files/
No comments:
Post a Comment