Thursday, December 31, 2009

Finding meaningful information in the noise

I decided to extend the capability of TexLexan to the Business Intelligence. It seems pretencious, but I am certain that TexLexAn can do the job.


Let's think with a few questions:

- How to extract the most important information from the stream of reports, documents, mails, memos circulating in the enterprise?

- What is an important information?

- How a machine can know that one information is important and another is uninteresting?

- TexLexAn will be able to analyse this mass of texts?

- Is it able to extract the most pertinent, important and useful information from this mass of texts?

 Let's answer now:

- I will try to answer to answer to first question at the end. Let's go with the 2nd. The question is pretty naïve. The information that matters concerns the future of the entreprise, its competitivity, its safety, compliance, nuisance, growth... These informations can be categorized and their values can be sorted.

- For the 3rd question, the program can search the cue words to detect the sentence carrying some important informations, it can use a classifier to extract the relevant sentences, and it can use a list of keywords extracted from the corpus to extract the most significant sentences.

- Concerning the 4th question, TexLexAn cannot analyse the totality of the documents as a whole document over a long period of time due to the volume of texts, sentences, words to analyse. The computation time will become quickly unacceptable. But the stream of documents can be summarized one by one, the summaries over a period of time can be compiled into a single document, and finally, this document can be analysed.


- The answer to the 5th question: The current versions of TexLexAn extracts the most relevant sentences and generates a summary with them. It uses the classifier, the cue words or a list of keywords to find the relevant sentence. Because it can do the same job with a list of summaries, it will be able to extract the most pertinent and important information from a mass of texts. As I explained above, it will not work directly with the documents but with their summaries.

- Now I can answer to the first question. We can imagine that TexLexAn is installed on the servers of the entreprise, and It summarizes the stream of documents is able to recognize (text, html, msdoc, odt, pdf, ppt) circulating on the intranet. Then it produces a file containing the summaries of the text analysed and date + time of the analyse (the current version does this during the archiving operation).  Finally, the file of summaries will be analysed and summarized in the headquater. The new summary could cover a period of one day, one week, two weeks, one month for instance, and in consequence, it will report the most important information during the period considered.

TODO: The file containing the whole summaries exists in the folder texlexan_archive (classification.lst).  I have to add a new function to texlexan allowing to analyse and summarize this file between two date. This is not very difficult!

No comments:

Post a Comment