INDUCTION - LEARNING FROM EXAMPLES

   The time spent in obtaining knowledge from human experts can greatly increase the cost of developing an expert system. One approach that is often used is to learn a decision tree from a table of examples. This approach is based on Quinlan's Algorithm known as ID3.

   Ref. J.R.Quinlan, Discovering rules by induction from large collection of examples ,in D. Michie(ed.), Expert Systems in the Microelectronic Age, 168-201, Edinburgh Univ.Press,1979.

   ID3 takes a table of examples as input and produces a decision tree a table of examples. We shall describe ID3 with the aid of the following example.

Example:
   A recruitment agency which specialises in recruiting trainee programmers wants to develop a decision tree to filter out unsuitable applicants.
   The agency we three criteria to make this decision: programming ability(Prog.), communication ability*Com.) & presentation(Pres.).
   Suppose the Agency provide the examples in the following Table:
 
NO
Prog.
Com
Pres
Decision
1.
good
ok
neat
yes
2.
average
ok
fair
yes
3.
poor
liar
scruffy
no
4.
average
liar
fair
no
5.
good
boring
scruffy
yes
6.
good
ok
scruffy
yes
7.
average
boring
scruffy
no
8.
good
liar
scruffy
no
9.
poor
boring
fair
no
10.
poor
boring
scruffy
no
 
   A decision tree which is consistent with these examples can be obtained by applying the following procedure:

1- Choose an attribute as the root. Only those attributes that are not ancestors are eligible. Initially, all the attributes except the "Decision" attribute are eligible.

2- Create an arc for each possible value of the root. Label the arcs with their value.

3- Partition the table into sub-tables; one for each arc. A sub table for a particular arc must have exactly those examples of the table whose root has the value, labelling the arc.

4- Repeat the process on those sub-tables whose "Decision" attribute has different values.

Decision Tree when Prog. is chosen as root. Alternatively we can construct the following decision tree taking ("Pres.") as root.
 

   Above figures show two different decision trees that result when different attributes are chosen as the root. Notice that although both these trees are consistent with the examples, they are not equivalent. Thus, for example, a candidate who is neat but a liar and a poor programmer is accepted by the second tree. However, the candidate would rightly be rejected by the first tree.
   This problem occurs in the second tree because we choose an attribute which does not give much information about the final classification.
   So the question is "How can we choose an attribute which will result in a good tree?"
- We should choose an attribute which gives the most information.
- Information Theory provides a suitable measure.
 

    Information cannot be measured by the extent to which it clarifies the unknown. If told something known, there is no or less information.

   AV. information/message is called entropy(H).
   H is a measure of the amount of info. in a message that is based on the log of the no. of equivalent messages.

   For a discrete probability distribution(yes/no). (H is also measurement of uncertainty of var X for eg., H(X)).
   For example, how much info. would we gain by knowing whether the decision is a yes or a no if we already know how good the candidate is at programming.
   Conditional entropy or equivocation H(C/A) is a measure of reduction of uncertainty about decision C given the possible attributes(A).
 

   Where "aj" are the "n" values of the attribute A, and the "ci" are the "m" values of the decision c.

   For example, we can calculate H(Decision/Prog) by first using the letter formula to obtain H(Decision/aj) for each of programming ability.
 
 
aj
P(yes/aj)
P(no/aj)
P(aj)
H(Decision/aj)
good
3/4
1/4
4/10
0.811
ave
1/3
2/3
3/10
0.919
poor
0/3
3/3
3/10
0.000
 

and then use the former formula to obtain:

Similarly we obtain;
                              H(Decision/Comm) = 0.325
                              H(Decision/Pres) = 0.827

   Thus we know "comm. ability " will give us the least uncertainty about final decision.
   We now partition the examples into tables.

                                 TABLE-1
1.
good
neat
yes
2.
ave
fair
yes
6.
good
scruffy
yes
                     When comm. ability is OK

                                  TABLE-2
5.
good
scruffy
yes
7.
ave
scruffy
no
9.
poor
fair
no
10.
poor
scruffy
no
                       When applicant is BORING

                                   TABLE-3
3.
poor
scruffy
no
4.
ave
fair
no
8.
good
scruffy
no
                        When applicant is LIAR
 
   The decision given for an example in Table-1 is YES, while the decision for an example in Table-3 is NO.
We will repeat the process on Table-2 to find that ;
                        H(Decision/Prog) = 0
&
                        H(Decision/Pres) = 0.69
Hence we choose programming ability as the next criteria and find that no further partitioning is necessary. The resulting decision tree, which also rejects the "neat liar" is given below:



These are the notes of Dr.Mehmet Tolun and written by Mehmet Nuri Cankaya. If you have any questions about web page design send e-mail to: cankaya@rorqual.cc.metu.edu.tr.