Decision Tree Learning

Decision Tree Learning [read Chapter 3] [recommended exercises 3.1, 3.4] Decision tree representation ID3 learning algorithm Entropy, Information gain Over tting c 46 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Decision Tree for P l ay T ennis Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes c 47 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 A Tree to Predict C-Section Risk Learned from medical records of 1000 women Negative examples are C-sections [833+,167-] .83+ .17- Fetal_Presentation = 1: [822+,116-] .88+ .12- | Previous_Csectio n = 0: [767+,81-] .90+ .10- | | Primiparous = 0: [399+,13-] .97+ .03- | | Primiparous = 1: [368+,68-] .84+ .16- | | | Fetal_Distress = 0: [334+,47-] .88+ .12- | | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05- | | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22- | | | Fetal_Distress = 1: [34+,21-] .62+ .38- | Previous_Csectio n = 1: [55+,35-] .61+ .39- Fetal_Presentation = 2: [3+,29-] .11+ .89- Fetal_Presentation = 3: [8+,22-] .27+ .73- c 48 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Decision Trees Decision tree representation: Each internal no de tests an attribute Each branch corresp onds to attribute value Each leaf no de assigns a classi cati on How would we represent: ^; _; XOR (A ^ B ) _ (C ^ :D ^ E ) M of N c 49 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 When to Consider Decision Trees Instances describable by attribute{val ue pairs Target function is discrete valued Disjunctive hyp othesis may b e required Possibly noisy training data Examples: Equipment or medical diagnosis Credit risk analysis Mo deling calendar scheduling preferences c 50 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Top-Down Induction of Decision Trees Main lo op: 1. A the \b est" decision attribute for next node 2. Assign A as decision attribute for node 3. For each value of A, create new descendant of node 4. Sort training examples to leaf no des 5. If training examples p erfectly classi ed, Then STOP, Else iterate over new leaf no des Which attribute is b est? [29+,35-]A1=? [29+,35-] A2=? t f t f [21+,5-] [8+,30-] [18+,33-] [11+,2-] c 51 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Entropy 1.0 0.5 Entropy(S) 0.0 0.5 1.0 p + S is a sample of training examples p is the prop ortion of p ositive examples in S p is the prop ortion of negative examples in S Entropy measures the impurity of S E ntr opy (S ) p log p p log p 2 2 c 52 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Entropy E ntr opy (S ) = exp ected numb er of bits needed to enco de class ( or ) of randomly drawn memb er of S (under the optimal, shortest-length co de) Why? Information theory: optimal length co de assigns log p bits to message having probability p. 2 So, exp ected numb er of bits to enco de or of random memb er of S : p ( log p ) + p ( log p ) 2 2 E ntr opy (S ) p log p p log p 2 2 c 53 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Information Gain Gain(S; A) = exp ected reduction in entropy due to sorting on A jS j X v E ntr opy (S ) Gain(S; A) E ntr opy (S ) v jS j v 2V al ues(A) [29+,35-]A1=? [29+,35-] A2=? t f t f [21+,5-] [8+,30-] [18+,33-] [11+,2-] c 54 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Training Examples Day Outlo ok Temp erature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Co ol Normal Weak Yes D6 Rain Co ol Normal Strong No D7 Overcast Co ol Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Co ol Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No c 55 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Selecting the Next Attribute Which attribute is the best classifier? S: [9+,5-] S: [9+,5-] E =0.940 E =0.940 Humidity Wind High Normal Weak Strong [3+,4-] [6+,1-] [6+,2-] [3+,3-] E =0.985 E =0.592 E =0.811 E =1.00 Gain (S, Humidity ) Gain (S, Wind ) = .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0 = .151 = .048 c 56 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 {D1, D2, ..., D14} [9+,5−] Outlook Sunny Overcast Rain {D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14} [2+,3−] [4+,0−] [3+,2−] ? Yes ? Which attribute should be tested here? Ssunny = {D1,D2,D8,D9,D11} Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970 Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570 Gain (S sunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019 c 57 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Hyp othesis Space Search by ID3 + – + ... A2 A1 +– + + + – + – ... A2 A2 + – + – + – + – A3 A4 – + ... ... c 58 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Hyp othesis Space Search by ID3 Hyp othesis space is complete! { Target function surely in there... Outputs a single hyp othesis (which one?) { Can't play 20 questions... No back tracking { Lo cal minima... Statisical l y-based search choices { Robust to noisy data... Inductive bias: approx \prefer shortest tree" c 59 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Inductive Bias in ID3 Note H is the p ower set of instances X !Unbiased? Not really... Preference for short trees, and for those with high information gain attributes near the ro ot Bias is a preference for some hyp otheses, rather than a restriction of hyp othesis space H Occam's razor: prefer the shortest hyp othesis that ts the data c 60 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Occam's Razor Why prefer short hyp otheses? Argument in favor: Fewer short hyps. than long hyps. ! a short hyp that ts data unlikely to b e coincidence ! a long hyp that ts data might b e coincidence Argument opp osed: There are many ways to de ne small sets of hyps e.g., all trees with a prime numb er of no des that use attributes b eginning with \Z" What's so sp ecial ab out small sets based on size of hyp othesis?? c 61 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Over tting in Decision Trees Consider adding noisy training example #15: S unny ; H ot; N or mal ; S tr ong ; P l ay T ennis = N o What e ect on earlier tree? Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes c 62 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Over tting Consider error of hyp othesis h over training data: er r or (h) tr ain entire distribution D of data: er r or (h) D Hyp othesis h 2 H over ts training data if there is 0 an alternative hyp othesis h 2 H such that 0 er r or (h) < er r or (h ) tr ain tr ain and 0 er r or (h) > er r or (h ) D D c 63 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Over tting in Decision Tree Learning 0.9 0.85 0.8 0.75 0.7 Accuracy 0.65 0.6 On training data On test data 0.55 0.5 0 10 20 30 40 50 60 70 80 90 100 Size of tree (number of nodes) c 64 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Avoiding Over tting How can we avoid over tting? stop growing when data split not statistical l y signi cant grow full tree, then p ost-prune How to select \b est" tree: Measure p erformance over training data Measure p erformance over separate validati on data set MDL: minimize siz e(tr ee) + siz e(miscl assif ications(tr ee)) c 65 lecture slides for textb o ok Machine Learning, Tom M. Mitchell, McGraw Hill, 1997 Reduced-Error Pruning Split data into tr aining and v al idation set Do until further pruning is harmful: 1. Evaluate impact on v al idation set of pruning each p ossible no de (plus those b elow it) 2. Greedily remove the one that most improves v al idation set accuracy pro duces smallest version of most accurate subtree What if data is limited? c 66 lecture slides for textb o ok Machine Learning, Tom M.

Load more