In a previous post, I discussed the idea to extract the most interesting information in the mass of electronic text circulating in the enterprise. After two weeks of work, the first step is done: TexLexAn is able to extract the most relevant sentences in a set of documents.
The main difficulty is to decide if a sentence is relevant or not. The solution chosen is to weight each sentences with the keywords extracted from the summaries, and to use a list of cue words to increase the weight. Finally, only the sentences with a weight above a threshold are kept.

The results are pretty long to comment and cannot fit here, they will be the object of a next post.
The package pack1.46.tar.gz is available here http://sourceforge.net/projects/texlexan/files/
No comments:
Post a Comment