Introduction to Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Machine Learning Introduction to Machine Learning Amo G. Tong 1 Lecture 3 Supervised Learning • Decision Tree • Readings: Mitchell Ch 3. • Some materials are courtesy of Vibhave Gogate and Tom Mitchell. • All pictures belong to their creators. Introduction to Machine Learning Amo G. Tong 2 Supervised Learning • Given some training examples < 푥, 푓(푥) > and an unknown function 푓. • Find a good approximation of 푓. • Step 1. Select the features of 푥 to be used. • Step 2. Select a hypothesis space 퐻: a set of candidate functions to approximate 푓. • Step 3. Select a measure to evaluate the functions in 퐻. • Step 4. Use a machine learning algorithm to find the best function in 퐻 according to your measure. • Concept Learning. • Input: values of attributes Decision Tree: learn a discrete value function. • Output: Yes or No. Introduction to Machine Learning Amo G. Tong 3 Decision Tree • Each instance is characterized by some discrete attributes. • Each attribute has several possible values. • Each instance is associated with a discrete target value. • A decision tree classifies an instance by testing the attributes sequentially. attribute1 value11 value12 value13 attribute2 value21 value22 target value 1 target value 2 Introduction to Machine Learning Amo G. Tong 4 Decision Tree • Play Tennis? Attributes Target value Introduction to Machine Learning Amo G. Tong 5 Decision Tree • Play Tennis? Attributes Target value Introduction to Machine Learning Amo G. Tong 6 Decision Tree • A decision tree. • Outlook=Rain and Wind=Weak. Play tennis? • Outlook=Sunny and Humidity=High. Play Tennis? Introduction to Machine Learning Amo G. Tong 7 Decision Tree • A decision tree. • How to build a decision tree according to the training data so that it can well classify the unobserved instance? Introduction to Machine Learning Amo G. Tong 8 Decision Tree • We will learn some algorithms to build a decision tree. • We will discuss some issues regarding decision tree. Introduction to Machine Learning Amo G. Tong 9 ID3 • To build a decision tree is to decide which attribute should be tested. • Which attribute is the best classifier? algorithm measure Introduction to Machine Learning Amo G. Tong 10 ID3 • To build a decision tree is to decide which attribute should be tested. • Which attribute is the best classifier? • The ID3 algorithm: select the attribute that can maximally reduce the impurity of the instances. Introduction to Machine Learning Amo G. Tong 11 ID3 • To build a decision tree is to decide which attribute should be tested. • Which attribute is the best classifier? • The ID3 algorithm: select the attribute that can maximally reduce the impurity of the instances. • Entropy measures the purity of a collection of instances. • For a collection S of positive and negative instances, its entropy is defined as • 퐸푛푡푟표푝푦 푆 : = −푝1 log2 푝1 − 푝0 log2 푝0, • where 푝1 is proportion of positive instances and 푝0 = 1 − 푝1. Introduction to Machine Learning Amo G. Tong 12 ID3 • 퐸푛푡푟표푝푦 푆 : = −푝1 log2 푝1 − 푝0 log2 푝0 half-half 1 Smaller 퐸푛푡푟표푝푦 푆 ⇒ Higher purity. 퐸푛푡푟표푝푦 푆 all negative all positive 0 1 푝1 Introduction to Machine Learning Amo G. Tong 13 ID3 • Generally, 퐸푛푡푟표푝푦 푆 : = −∑ 푝푖 log2 푝푖, where 푝푖 is the proportion of the instances have the -th value. Introduction to Machine Learning Amo G. Tong 14 ID3 • 퐸푛푡푟표푝푦 푆 : = −푝1 log2 푝1 − 푝0 log2 푝0. • 푝1 = 9/14, 푝0 = 5/14, 퐸푛푡푟표푝푦(푆) = 0.94 Introduction to Machine Learning Amo G. Tong 15 ID3 • To build a decision tree is to decide which attribute should be tested. • Which attribute is the best classifier? • The ID3 algorithm: select the attribute that can maximally reduce the impurity of the instances. • Entropy measures the purity of a collection of instances. • What will happen if an attribute 퐴 is selected? • If 퐴 has 푘 values, the instances 푆 will be divided into subsets: 푆1, … , 푆푘. • The expected entropy is 푆 ∑ 푖 퐸푛푡푟표푝푦(푆 ) 푖 푆 푖 Introduction to Machine Learning Amo G. Tong 16 ID3 • Consider Outlook. 5 4 5 • 퐸푛푡푟표푝푦 푆 + 퐸푛푡푟표푝푦 푆 + 퐸푛푡푟표푝푦(푆 ) 14 푆푢푛푛푦 14 푂푣푒푟푐푎푠푡 14 푅푎푖푛 Introduction to Machine Learning Amo G. Tong 17 ID3 • What will happen if an attribute 퐴 is selected? • If 퐴 has 푘 values, the instances 푆 will be divided into subsets: 푆1, … , 푆푘. 푆 • After 퐴 is tested: the expected entropy is ∑ 푖 퐸푛푡푟표푝푦(푆 ) 푖 푆 푖 • Before 퐴 is tested: 퐸푛푡푟표푝푦(푆) • Information Gain: 푆푖 • 퐺푎푛(푆, 퐴) = 퐸푛푡푟표푝푦(푆) − ∑푖 퐸푛푡푟표푝푦(푆푖) 푆 Difference • ID3: select the attribute that can 푚푎푥푚푧푒 Gain(푆, 퐴), until every leaf is pure. Introduction to Machine Learning Amo G. Tong 18 ID3 Step 1. Outlook has the highest gain. ? ? Introduction to Machine Learning Amo G. Tong 19 ID3 Overcast is pure. Step 1. Outlook has the highest gain. ? ? Introduction to Machine Learning Amo G. Tong 20 ID3 Gain(S,A) S: instances belonging to the current branch. Step 1. Outlook has the highest gain. ? ? Introduction to Machine Learning Amo G. Tong 21 ID3 Information Gain: 푆 퐺푎푛(푆, 퐴) = 퐸푛푡푟표푝푦(푆) − 푖 퐸푛푡푟표푝푦(푆 ) 푆 푖 푖 Introduction to Machine Learning Amo G. Tong 22 Issues • Issues. • Hypothesis Space. • Inductive Bias. • Overfitting. • Real-valued features. • Missing values. Introduction to Machine Learning Amo G. Tong 23 Hypothesis Space • Hypothesis Space. • Fact: every finite discrete function can be represented by a decision tree. • ID3 searches a complete space. • Note. Each 푛-feature Boolean function can be represented by a binary decision tree with 푛 depth. • Just test the features one by one. Introduction to Machine Learning Amo G. Tong 24 Inductive bias • Inductive bias • ID3 outputs only one decision. There can be many decision trees consistent with the training data. • What is the inductive bias? • ID3 prefers shorter trees. • Purity => less need of classification => shorter trees. Introduction to Machine Learning Amo G. Tong 25 Inductive bias • Concept learning algorithm. • Restriction Bias: assume the true target is a conjunction of constrains. • Preference Bias: None (or consistency). We output the whole version space. • If no restriction bias, • (a) we need every possible instance to present to narrow the space into exact one hypothesis. • (b) predictions is not better than random guesses. Introduction to Machine Learning Amo G. Tong 26 Inductive bias • Decision tree learning + ID3 algorithm. • Restriction Bias: None. The space is complete. • Preference Bias: shorter trees are preferred. • If no preference bias: • ?? Introduction to Machine Learning Amo G. Tong 27 Inductive bias • Fact: every finite discrete function can be represented by a decision tree. • ID3 searches a complete space. • ID3 prefers shorter trees. • Purity => less need of classification => shorter trees. • Why not longer trees? • Occam’s Razor: Simplest model that explains the data should be preferred. Introduction to Machine Learning Amo G. Tong 28 Overfitting • Overfitting. • Consider the error of a hypothesis ℎ over • (a) the training data, 푒푟푟표푟푡푟푎푖푛(ℎ) • (b) the entire distribution 퐷 of the data, 푒푟푟표푟퐷(ℎ) • Overfitting happens if there is another hypothesis ℎ’ such that ’ • 푒푟푟표푟푡푟푎푖푛(ℎ) < 푒푟푟표푟푡푟푎푖푛(ℎ ) and ’ • 푒푟푟표푟퐷(ℎ) > 푒푟푟표푟퐷(ℎ ) • Generalization/prediction is important. Introduction to Machine Learning Amo G. Tong 29 Overfitting • Overfitting. • Source of overfitting: training data cannot reflect the entire distribution. • (a) errors • Fitting data does not help. • (b) imbalance distribution • If a leaf has only one training instance, is it reliable? Introduction to Machine Learning Amo G. Tong 30 Overfitting 푦 푦 Truth Good data 푥 푥 푦 Too much noise 푦 Ill-distributed 푥 푥 Introduction to Machine Learning Amo G. Tong 31 Overfitting • Overfitting. • Source of overfitting: training data cannot reflect the entire distribution. • (a) errors • Fitting data does not help. • (b) imbalanced distribution • If a leaf has only one training instance, is it reliable? • It happens when the tree is too large [the rules are too specific]. Introduction to Machine Learning Amo G. Tong 32 Overfitting • Overfitting. • It happens when the tree is too large [the rules are too specific]. Accuracy on training data Accuracy on testing data Size of the tree Introduction to Machine Learning Amo G. Tong 33 Overfitting • Source of overfitting • (a) noise • If some training data has errors, they cannot reflect the entire distribution • (b) imbalanced distribution • If a leaf has only one training instance, is it reliable? • It happens when the tree is too large [the rules are too specific]. • How to avoid overfitting? Control the size of the tree. Introduction to Machine Learning Amo G. Tong 34 Overfitting • Overfitting. • How to avoid overfitting? Control the size of the tree. • Method 1. Stop growing the tree if the further classification is not statistically significant. • Method 2. Grow the full tree and then prune the tree. • Use training data to grow the tree. • Use validation data to evaluate the utility of post-pruning. • Idea: the same noise is not likely to appear in both datasets. • Two approaches: reduce-error pruning and rule post-pruning. Introduction to Machine Learning Amo G. Tong 35 Overfitting • Reduce-error pruning. • Pruning a node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training instances affiliated with that node. Introduction to Machine Learning Amo G. Tong 36 Overfitting • Reduce-error pruning. • Pruning a node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training instances affiliated with that node. • Pruning effect: the change of the accuracy on the validation set after pruning. • Algorithm: • Pruning the node that has the best pruning effect. • Until no pruning is helpful. [cannot increase the accuracy on validation set.] Introduction to Machine Learning Amo G. Tong 37 Decision Tree • Overfitting. • Reduce-error pruning. Introduction to Machine Learning Amo G.