TexLexAn: June 2010

Thursday, June 24, 2010

Formating the data

This is not the most complex task. The main goal is to remove the meaningless informations.

For instance: Articles, very common words and formating characters of a text document are removed. High frequency or low frequency sounds of low intensity masked by a middle frequency sound of an audio recording are ignored. Edges of a still image are detected and extracted. Motions in video are detected and quantified.

In case of text, we use a small dictionary containing the words to suppress and simple algorithm to suppress paragraph numbers, tabulation, indentation marks... The words can be simplified. Plurals can be converted in singular or stemmed or lemmatized...

In case of audio, we use simultaneous and temporal masking codec to suppress sounds that ears cannot discriminate.

In case of picture, we can apply the Canny edge detection. http://www.pages.drexel.edu/~weg22/can_tut.html

In case of video, we can detect motions by comparing changes between each frames. http://www.codeproject.com/KB/audio-video/Motion_Detection.aspx

A additional step in the data formating is to compute the relative values of the parameters in order to create a kind of invariant patterns. For instance, we compute the relative frequency of words (1), the relative frequency and amplitude of sounds (2), the relative size and position of shapes (3). This operation makes the informations to analyse independent of the size of the text (1), independent of the pitch and volume of the speaker (2), independent of the distance of the objects on the image (3).

Sunday, June 20, 2010

How to mimic our brain!

In my previous post, I tried to explain that a large knowledge base with the appropriate algorithms could mimic our brain reasoning. But what could be these algorithms?

In a naïve approach, the simplest form of reasoning program will have to:
1 - format the input facts.
2 - retrieve similar or related facts in the knowledge base.
3 - retrieve rules linked with the facts (retrieved) in the knowledge base.
4 - search for synonyms and repeat steps 2 and 3.
5 - check the incoherences in the facts and rules retrieved and mark them "check it!"
6 - make inferences between facts and rules
7 - recycle the conclusions in the step 1 (the conclusion becomes an input fact) until the conclusion matches our criteria.

This a very raw description of the different steps and algorithms required to mimic our brain reasoning. I will detail each steps in the future posts.

Friday, June 18, 2010

When Knowledge size matter.

We are able to estimate the risk to receive the fruit on our head because our observations accumulated since our childhood. As young kid without any knowledge about the gravity we know that any object between our fingers will fall when we will release the object. Later, we will know that fruits in the tree fall when it is windy or when the fruits are ripe. If a program has this knowledge too, logically it will be able to estimate the risk to receive the fruit. The reasoning will be a succession of inferences. For example, the database contains these facts and rules (the probability that the fact is true is indicated in percent):

1 - 100% - A pear is a fruit.
2 - 100% - A pear grows in the top of a tree.
3 - 100% - The top of a tree is above the ground surface.
4 - 100% - A thing above the ground surface falls when released.
5 - 100% - A pear is a thing.
6 - 80% - A fruit is released when is ripe.
7 - 60% - A pear is ripe on October.
8 - 100% - We are on October.

The inferences are:

(1,6 => 9) A pear is a fruit + A fruit is released when is ripe = A pear is released when is ripe.

(9,7,8 => 10) A pear is released when is ripe + A pear is ripe on October + We are on October = A pear is released (0.8 x 0.6 x 1 = 0.48).

(4,5 => 11) Thing above the ground surface fall when released + A pear is a thing =A pear above the ground surface falls when released.

(11,10 => 12) A pear above the ground surface falls when released + A pear is released = A pear above the ground surface falls (1 x 0.48 = 0.48).

(2,3 => 13) A pear is in the top of a tree + The top of a tree is above the ground surface = A pear is above the ground surface.

(12,13 => 14) A pear is above the ground surface + A pear above the ground surface falls = A pear falls (1 x 0.48 = 0.48).

So the set of facts and rules above bring to the conclusion that a pear could fall with a probability of 48%.

This sort example shows that a software is able of reasoning but that requires an extensive and precise factual and procedural knowledge base.

But what happens when the knowledge base is incomplete, a fact is wrong or incertain , or different words are used for the same thing?

For example:

a) If the fact 6 is missing "A fruit is released when is ripe" ?

b) If the fact 7 is wrong "A pear is ripe on June" ?

c) If the fact 3 and 4 use two different terms soil and ground surface for the same thing: "The top of a tree is above the soil" and "A thing above the ground surface falls when released" ?

As for human, it will depend of the quantity of knowledge stored in the database.

For the problem (a), the software can proceed by analogy, for instance the 2 facts below conduct to the hypothesis that a pear could fall like an apple :
- An apple falls from the tree when it is ripe.
- Apples and pears grow on tree.

For (b), the software can detect an error in the knowledge base. If for instance the knowledge base can contain these facts:
- Anjou, Bartlett, Bosc, Comice are pear varieties.
- Comice pears are harvested September through February.
- Bosc pears are harvested September through April.
- Barlett pears are harvested July through October.
So the fact (b) is illogical regarding the facts listed above and the program will warn there is an incoherence in its knowledge base and it will ask for checking of the fact (b).

For (c), the software can search a synonyms of soil and ground and then it will decide that both words design the same thing.

As a human, a program is able to support some incoherence in its knowledge base and ask to verify some weird facts.

But what kind of algorithms are required for this pseudo-human reasoning ?

To be continued...

Thursday, June 17, 2010

Facts and Rules

My work to improve the dictionaries of TexLexAn, in fact its knowledge base brings me to a very basic question: What is knowledge?

We can distinguish two kind of knowledge:

The procedural knowledge or the rules to do things.
The declarative knowledge or the facts.

In the computer world, the procedural knowledge is represented by the rules-based system experts such as the online diagnosis programs, and the declarative knowledge is foundation of the databases such as the phone directory for the simplest form.

Today, the both forms of knowledge are managed by two completely different kinds of programs but our brain does not work like that! It looks evident that facts and rules are intimately mixed. There is a good reason for it; we construct our rules from ours observations (the facts). These rules can very dependent of the observations but we try generalize these rules. We call it an inductive reasoning.

We can imagine this funny thing: In the middle of October while we are reading under an apple tree, we receive an apple on our head, so we decide to move under another tree, in fact a pear tree. Pears are not apples but we are cautious because we infer that if one apple fell down there is a chance that a pear can fall down too. Can a program do the same hypothesis?

to be continued...

TexLexAn