Decision Learning

[read Chapter 3]

[recommended exercises 3.1, 3.4]

 representation

 ID3 learning algorithm

 Entropy, Information gain

 Over tting

c

46 lecture slides for textb o ok , Tom M. Mitchell, McGraw Hill, 1997

Decision Tree for P l ay T ennis

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

c

47 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

A Tree to Predict C-Section Risk

Learned from medical records of 1000 women

Negative examples are C-sections

[833+,167-] .83+ .17-

Fetal_Presentation = 1: [822+,116-] .88+ .12-

| Previous_Csectio n = 0: [767+,81-] .90+ .10-

| | Primiparous = 0: [399+,13-] .97+ .03-

| | Primiparous = 1: [368+,68-] .84+ .16-

| | | Fetal_Distress = 0: [334+,47-] .88+ .12-

| | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05-

| | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22-

| | | Fetal_Distress = 1: [34+,21-] .62+ .38-

| Previous_Csectio n = 1: [55+,35-] .61+ .39-

Fetal_Presentation = 2: [3+,29-] .11+ .89-

Fetal_Presentation = 3: [8+,22-] .27+ .73-

c

48 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Decision Trees

Decision tree representation:

 Each internal no de tests an attribute

 Each branch corresp onds to attribute value

 Each leaf no de assigns a classi cati on

How would we represent:

 ^; _; XOR

 (A ^ B ) _ (C ^ :D ^ E )

 M of N

c

49 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

When to Consider Decision Trees

 Instances describable by attribute{val ue pairs

 Target function is discrete valued

 Disjunctive hyp othesis may b e required

 Possibly noisy training data

Examples:

 Equipment or medical diagnosis

 Credit risk analysis

 Mo deling calendar scheduling preferences

c

50 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Top-Down Induction of Decision Trees

Main lo op:

1. A the \b est" decision attribute for next node

2. Assign A as decision attribute for node

3. For each value of A, create new descendant of

node

4. Sort training examples to leaf no des

5. If training examples p erfectly classi ed, Then

STOP, Else iterate over new leaf no des

Which attribute is b est?

[29+,35-]A1=? [29+,35-] A2=?

t f t f

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

c

51 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Entropy

1.0

0.5 Entropy(S)

0.0 0.5 1.0 p

+

 S is a sample of training examples

 p is the prop ortion of p ositive examples in S



 p is the prop ortion of negative examples in S

 Entropy measures the impurity of S

E ntr opy (S )  p log p p log p

 

2 2

c

52 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Entropy

E ntr opy (S ) = exp ected numb er of bits needed to

enco de class ( or ) of randomly drawn

memb er of S (under the optimal, shortest-length

co de)

Why?

Information theory: optimal length co de assigns

log p bits to message having probability p.

2

So, exp ected numb er of bits to enco de  or of

random memb er of S :

p ( log p ) + p ( log p )

 

2 2

E ntr opy (S )  p log p p log p

 

2 2

c

53 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Information Gain

Gain(S; A) = exp ected reduction in entropy due to

sorting on A

jS j

X

v

E ntr opy (S ) Gain(S; A)  E ntr opy (S )

v

jS j

v 2V al ues(A)

[29+,35-]A1=? [29+,35-] A2=?

t f t f

[21+,5-] [8+,30-] [18+,33-] [11+,2-]

c

54 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Training Examples

Day Outlo ok Temp erature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Co ol Normal Weak Yes

D6 Rain Co ol Normal Strong No

D7 Overcast Co ol Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Co ol Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

c

55 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Selecting the Next Attribute

Which attribute is the best classifier?

S: [9+,5-] S: [9+,5-] E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-] E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity ) Gain (S, Wind ) = .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0

= .151 = .048

c

56 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 {D1, D2, ..., D14} [9+,5−]

Outlook

Sunny Overcast Rain

{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14} [2+,3−] [4+,0−] [3+,2−]

? Yes ?

Which attribute should be tested here?

Ssunny = {D1,D2,D8,D9,D11}

Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970

Gain (S sunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570 Gain (S sunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019

c

57 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Hyp othesis Space Search by ID3

+ – +

... A2 A1 +– + + + – + –

...

A2 A2

+ – + – + – + – A3 A4 – +

......

c

58 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Hyp othesis Space Search by ID3

 Hyp othesis space is complete!

{ Target function surely in there...

 Outputs a single hyp othesis (which one?)

{ Can't play 20 questions...

 No back tracking

{ Lo cal minima...

 Statisical l y-based search choices

{ Robust to noisy data...

 Inductive bias: approx \prefer shortest tree"

c

59 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Inductive Bias in ID3

Note H is the p ower set of instances X

!Unbiased?

Not really...

 Preference for short trees, and for those with

high information gain attributes near the ro ot

 Bias is a preference for some hyp otheses, rather

than a restriction of hyp othesis space H

 Occam's razor: prefer the shortest hyp othesis

that ts the data

c

60 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Occam's Razor

Why prefer short hyp otheses?

Argument in favor:

 Fewer short hyps. than long hyps.

! a short hyp that ts data unlikely to b e

coincidence

! a long hyp that ts data might b e coincidence

Argument opp osed:

 There are many ways to de ne small sets of hyps

 e.g., all trees with a prime numb er of no des that

use attributes b eginning with \Z"

 What's so sp ecial ab out small sets based on size

of hyp othesis??

c

61 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Over tting in Decision Trees

Consider adding noisy training example #15:

S unny ; H ot; N or mal ; S tr ong ; P l ay T ennis = N o

What e ect on earlier tree?

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

c

62 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Over tting

Consider error of hyp othesis h over

 training data: er or (h)

tr ain

 entire distribution D of data: er r or (h)

D

Hyp othesis h 2 H over ts training data if there is

0

an alternative hyp othesis h 2 H such that

0

er r or (h) < er r or (h )

tr ain tr ain

and

0

er r or (h) > er r or (h )

D D

c

63 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Over tting in Decision Tree Learning

0.9

0.85

0.8

0.75

0.7 Accuracy 0.65

0.6 On training data On test data 0.55

0.5 0 10 20 30 40 50 60 70 80 90 100

Size of tree (number of nodes)

c

64 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Avoiding Over tting

How can we avoid over tting?

 stop growing when data split not statistical l y

signi cant

 grow full tree, then p ost-prune

How to select \b est" tree:

 Measure p erformance over training data

 Measure p erformance over separate validati on

data set

 MDL: minimize

siz e(tr ee) + siz e(miscl assif ications(tr ee))

c

65 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Reduced-Error Pruning

Split data into tr aining and v al idation set

Do until further pruning is harmful:

1. Evaluate impact on v al idation set of pruning

each p ossible no de (plus those b elow it)

2. Greedily remove the one that most improves

v al idation set accuracy

 pro duces smallest version of most accurate

subtree

 What if data is limited?

c

66 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

E ect of Reduced-Error Pruning

0.9

0.85

0.8

0.75

0.7

Accuracy 0.65

0.6 On training data On test data 0.55 On test data (during pruning)

0.5 0 10 20 30 40 50 60 70 80 90 100

Size of tree (number of nodes)

c

67 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Rule Post-Pruning

1. Convert tree to equivale nt set of rules

2. Prune each rule indep endently of others

3. Sort nal rules into desired sequence for use

Perhaps most frequently used metho d (e.g., C4.5)

c

68 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Converting A Tree to Rules

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

c

69 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

IF (O utl ook = S unny ) ^ (H umidity = H ig h)

THEN P l ay T ennis = N o

IF (O utl ook = S unny ) ^ (H umidity = N or mal )

THEN P l ay T ennis = Y es

:::

c

70 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Continuous Valued Attributes

Create a discrete attribute to test continuous

 T emper atur e = 82:5

 (T emper atur e > 72:3) = t; f

Temperature: 40 48 60 72 80 90

PlayTennis: No No Yes Yes Yes No

c

71 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Attributes with Many Values

Problem:

 If attribute has many values, Gain will select it

3 1996 as attribute  Imagine using D ate = J un

One approach: use GainR atio instead

Gain(S; A)

GainR atio(S; A) 

S pl itI nf or mation(S; A)

c

jS j jS j

X

i i

log S pl itI nf or mation(S; A) 

2

i=1

jS j jS j

where S is subset of S for which A has value v

i i

c

72 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Attributes with Costs

Consider

 medical diagnosis, B l oodT est has cost $150

f r om 1f t has cost 23 sec.  rob otics, W idth

How to learn a consistent tree with low exp ected

cost?

One approach: replace gain by

 Tan and Schlimmer (1990)

2

Gain (S; A)

:

C ost(A)

 Nunez (1988)

Gain(S;A)

2 1

w

(C ost(A) + 1)

where w 2 [0; 1] determines imp ortance of cost

c

73 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997

Unknown Attribute Values

What if some examples missing values of A?

Use training example anyway, sort through tree

 If no de n tests A, assign most common value of

A among other examples sorted to no de n

 assign most common value of A among other

examples with same target value

 assign probability p to each p ossible value v of

i i

A

{ assign fraction p of example to each

i

descendant in tree

Classify new examples in same fashion

c

74 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997