Saturday, January 16, 2010

Results window


I am continuing my previous post with a description of the results. The window is split in two part. The left panel is the text returned by the engine texlexan, and for the most useful information, we can find the classification results and the list of the most relevant sentences extracted from the summaries.


The right panel displays the classification results under the form of  bar graphs and allows to see quickly the most significant results.

Example:

Number of summaries extracted: 42

Grade:  23% Class: en.text-technic-computer
Grade:  25% Class: en.text-technic-computer-text_mining
Grade:  10% Class: en.text-technic-computer-processor
Grade:   7% Class: en.text-technic-computer-machine_learning
Grade:   6% Class: en.text-technic-computer-operating_system
Grade:   7% Class: en.text-law_agreement-international
Grade:   2% Class: en.text-technic-computer-memory
Grade:   4% Class: en.text-health
Grade:   3% Class: en.text-technic-computer-programming
Grade:   3% Class: en.text-technic-computer-unclassified
Grade:   3% Class: en.text-knowledge_management
Grade:   2% Class: en.text-technic-computer-artificial_intelligence
Grade:   1% Class: en.text-health-drug
Grade:   1% Class: en.text-science-chemistry-chemical
Grade:   0% Class: en.text-food-fruit

We know the result comes from 42 summaries that have been extracted and analysed. It means 42 documents were analyzed, summarized and archived oven the period considered.
The classification list shows the majority of the documents were about the computer, text mining, processor, machine learning, operating system and agreement international.
It is important to be careful when the grade (pseudo-probability) of a classification is low. There is a high probability that the classification can be  erroneous and simply due to the noise, for instance, the result "1% Class: en.text-health-drug".


The next interesting part of the results are the sentences extracted from the summaries:
Additionally, some use these terms to refer only to multi-core
microprocessors that are manufactured on the same integrated circuit
die .These people generally refer to separate microprocessor dies in
the same package by another name, such as multi-chip module .This
article uses both the terms "multi-core" and "dual-core" to
reference microelectronic CPUs manufactured on the same integrated
circuit, unless otherwise noted.

There are only multi-threaded managed runtimes means when it loads
an single threaded managed app the runtime itself creates multi
threads for its own purpose, right ? A: The multi-threading managed
runtime takes care of creating multiple threads as needed by the
application.

Others, generally seeking more compact and stable methods for
indexing highly diverse sources for which full, word-based indexes
are often unavailable, have explored higher-level indexing methods
including free-text and controlled-vocabulary metadata schemes,
semantic representations, and query-based indexing with training
sets.

Hierarchical Indexing Hierarchical indexing is a method of indexing
large documents at several levels of structure, so that a retrieval
system can pinpoint the most relevant sections within each document.

For document retrieval, hierarchical concept-based indexing and
document sectioning show promise for improving on word indexing alone.



Total of cue words found 431/20758 not found

The tag "business" indicates the cue words belong the business and only 431 cue words were found in summaries analysed.


The sentences extracted above represent "theorically" the main information expressed in the 42 documents analysed. Unfortunately because it is based on purely statistic method, sometime a few non-relevant sentences can be extracted. I will explain more in detail the reason in a next post.

No comments:

Post a Comment