Induction-Learning from Examples

INDUCTION - LEARNING FROM EXAMPLES

The time spent in obtaining knowledge from human experts can greatly increase the cost of developing an expert system. One approach that is often used is to learn a decision tree from a table of examples. This approach is based on Quinlan's Algorithm known as ID3.

Ref. J.R.Quinlan, Discovering rules by induction from large collection of examples ,in D. Michie(ed.), Expert Systems in the Microelectronic Age, 168-201, Edinburgh Univ.Press,1979.

ID3 takes a table of examples as input and produces a decision tree a table of examples. We shall describe ID3 with the aid of the following example.

Example:
   A recruitment agency which specialises in recruiting trainee programmers wants to develop a decision tree to filter out unsuitable applicants.
   The agency we three criteria to make this decision: programming ability(Prog.), communication ability*Com.) & presentation(Pres.).
   Suppose the Agency provide the examples in the following Table:

NO Prog. Com Pres Decision

1. good ok neat yes

2. average ok fair yes

3. poor liar scruffy no

4. average liar fair no

5. good boring scruffy yes

6. good ok scruffy yes

7. average boring scruffy no

8. good liar scruffy no

9. poor boring fair no

10. poor boring scruffy no

   A decision tree which is consistent with these examples can be obtained by applying the following procedure:

1- Choose an attribute as the root. Only those attributes that are not ancestors are eligible. Initially, all the attributes except the "Decision" attribute are eligible.

2- Create an arc for each possible value of the root. Label the arcs with their value.

3- Partition the table into sub-tables; one for each arc. A sub table for a particular arc must have exactly those examples of the table whose root has the value, labelling the arc.

4- Repeat the process on those sub-tables whose "Decision" attribute has different values.

Decision Tree when Prog. is chosen as root. Alternatively we can construct the following decision tree taking ("Pres.") as root.

   Above figures show two different decision trees that result when different attributes are chosen as the root. Notice that although both these trees are consistent with the examples, they are not equivalent. Thus, for example, a candidate who is neat but a liar and a poor programmer is accepted by the second tree. However, the candidate would rightly be rejected by the first tree.
   This problem occurs in the second tree because we choose an attribute which does not give much information about the final classification.
   So the question is "How can we choose an attribute which will result in a good tree?"
- We should choose an attribute which gives the most information.
- Information Theory provides a suitable measure.

Information cannot be measured by the extent to which it clarifies the unknown. If told something known, there is no or less information.

AV. information/message is called entropy(H).
H is a measure of the amount of info. in a message that is based on the log of the no. of equivalent messages.

   For a discrete probability distribution(yes/no). (H is also measurement of uncertainty of var X for eg., H(X)).
   For example, how much info. would we gain by knowing whether the decision is a yes or a no if we already know how good the candidate is at programming.
   Conditional entropy or equivocation H(C/A) is a measure of reduction of uncertainty about decision C given the possible attributes(A).

Where "aj" are the "n" values of the attribute A, and the "ci" are the "m" values of the decision c.

For example, we can calculate H(Decision/Prog) by first using the letter formula to obtain H(Decision/aj) for each of programming ability.

aj P(yes/aj) P(no/aj) P(aj) H(Decision/aj)

good 3/4 1/4 4/10 0.811

ave 1/3 2/3 3/10 0.919

poor 0/3 3/3 3/10 0.000

and then use the former formula to obtain:

Similarly we obtain;
H(Decision/Comm) = 0.325
H(Decision/Pres) = 0.827

Thus we know "comm. ability " will give us the least uncertainty about final decision.
We now partition the examples into tables.

TABLE-1

1. good neat yes

2. ave fair yes

6. good scruffy yes

When comm. ability is OK

TABLE-2

5. good scruffy yes

7. ave scruffy no

9. poor fair no

10. poor scruffy no

When applicant is BORING

                                   TABLE-3

3. poor scruffy no

4. ave fair no

8. good scruffy no

                        When applicant is LIAR

   The decision given for an example in Table-1 is YES, while the decision for an example in Table-3 is NO.
We will repeat the process on Table-2 to find that ;
                        H(Decision/Prog) = 0
&
                        H(Decision/Pres) = 0.69
Hence we choose programming ability as the next criteria and find that no further partitioning is necessary. The resulting decision tree, which also rejects the "neat liar" is given below:

These are the notes of Dr.Mehmet Tolun and written by Mehmet Nuri Cankaya. If you have any questions about web page design send e-mail to: cankaya@rorqual.cc.metu.edu.tr.

NO	Prog.	Com	Pres	Decision
1.	good	ok	neat	yes
2.	average	ok	fair	yes
3.	poor	liar	scruffy	no
4.	average	liar	fair	no
5.	good	boring	scruffy	yes
6.	good	ok	scruffy	yes
7.	average	boring	scruffy	no
8.	good	liar	scruffy	no
9.	poor	boring	fair	no
*10.*	poor	boring	scruffy	no

aj	P(yes/aj)	P(no/aj)	P(aj)	H(Decision/aj)
good	3/4	1/4	4/10	0.811
ave	1/3	2/3	3/10	0.919
poor	0/3	3/3	3/10	0.000