Machine Learning: Decision Trees
Total Page:16
File Type:pdf, Size:1020Kb
What is learning? • “Learning denotes changes in a system that ... MachineMachine Learning:Learning: enable a system to do the same task more efficiently the next time.” –Herbert Simon • “Learning is constructing or modifying DecisionDecision TreesTrees representations of what is being experienced.” Chapter 18.1-18.3 –Ryszard Michalski • “Learning is making useful changes in our minds.” –Marvin Minsky Some material adopted from notes by Chuck Dyer 1 2 Why study learning? A general model of learning agents • Understand and improve efficiency of human learning – Use to improve methods for teaching and tutoring people (e.g., better computer-aided instruction) • Discover new things or structure previously unknown – Examples: data mining, scientific discovery • Fill in skeletal or incomplete specifications about a domain – Large, complex AI systems can’t be completely derived by hand and require dynamic updating to incorporate new information. – Learning new characteristics expands the domain or expertise and lessens the “brittleness” of the system • Build agents that can adapt to users, other agents, and their environment 3 4 1 Major paradigms of machine learning The inductive learning problem • Rote learning – One-to-one mapping from inputs to stored • Extrapolate from a given set of examples to make accurate representation. “Learning by memorization.” Association-based predictions about future examples storage and retrieval. • Supervised versus unsupervised learning • Induction – Use specific examples to reach general conclusions – Learn an unknown function f(X) = Y, where X is an input example and Y is the desired output. • Clustering – Unsupervised identification of natural groups in data – Supervised learning implies we are given a training set of (X, Y) • Analogy – Determine correspondence between two different pairs by a “teacher” – Unsupervised learning means we are only given the Xs and some representations (ultimate) feedback function on our performance. • Discovery – Unsupervised, specific goal not given • Concept learning or classification • Genetic algorithms – “Evolutionary” search techniques, based on – Given a set of examples of some concept/class/category, determine an analogy to “survival of the fittest” if a given example is an instance of the concept or not – If it is an instance, we call it a positive example • Reinforcement – Feedback (positive or negative reward) given at – If it is not, it is called a negative example the end of a sequence of steps – Or we can make a probabilistic prediction (e.g., using a Bayes net) 5 6 Inductive learning framework Supervised concept learning • Raw input data from sensors are typically preprocessed to • Given a training set of positive and negative obtain a feature vector, X, that adequately describes all of the relevant features for classifying examples examples of a concept • Each x is a list of (attribute, value) pairs. For example, • Construct a description that will accurately classify X = [Person:Sue, EyeColor:Brown, Age:Young, Sex:Female] whether future examples are positive or negative • The number of attributes (a.k.a. features) is fixed (positive, • That is, learn some good estimate of function f finite) given a training set {(x1, y1), (x2, y2), ..., (xn, yn)} • Each attribute has a fixed, finite number of possible values (or could be continuous) where each yi is either + (positive) or - (negative), or a probability distribution over +/- • Each example can be interpreted as a point in an n- dimensional feature space, where n is the number of attributes 7 8 2 Inductive learning as search Model spaces • Decision trees • Instance space I defines the language for the training and – Partition the instance space into axis-parallel regions, labeled with test instances class value – Typically, but not always, each instance i∈I is a feature vector • Version spaces – Features are sometimes called attributes or variables – Search for necessary (lower-bound) and sufficient (upper-bound) –I: V1 x V2 x … x Vk, i = (v1, v2, …, vk) partial instance descriptions for an instance to be a member of the • Class variable C gives an instance’s class (to be predicted) class • Model space M defines the possible classifiers • Nearest-neighbor classifiers –M: I → C, M = {m1, … mn} (possibly infinite) – Partition the instance space into regions defined by the centroid instances (or cluster of k instances) – Model space is sometimes, but not always, defined in terms of the same features as the instance space • Associative rules (feature values → class) • Training data can be used to direct the search for a good • First-order logical rules (consistent, complete, simple) hypothesis in the model • Bayesian networks (probabilistic dependencies of class on space attributes) • Neural networks 9 10 Model spaces Inductive learning and bias I I - - - - + + + + • Suppose that we want to learn a function f(x) = y and we I are given some sample (x,y) pairs, as in figure (a) Nearest • There are several hypotheses we could make about this - Decision - neighbor function, e.g.: (b), (c) and (d) tree • A preference for one over the others reveals the bias of our + learning technique, e.g.: + – prefer piece-wise functions Version space – prefer a smooth function – prefer a simple function and treat outliers as noise 11 12 3 Preference bias: Ockham’s Razor Learning decision trees •Goal: Build a decision tree to classify examples as • A.k.a. Occam’s Razor, Law of Economy, or Law of positive or negative instances of a concept using Parsimony supervised learning from a training set • Principle stated by William of Ockham (1285-1347/49), a scholastic, that •A decision tree is a tree where Color –“non sunt multiplicanda entia praeter necessitatem” – each non-leaf node has associated greenred blue – or, entities are not to be multiplied beyond necessity with it an attribute (feature) Size + Shape • The simplest consistent explanation is the best –each leaf node has associated big small square round • Therefore, the smallest decision tree that correctly classifies with it a classification (+ or -) - + Size + all of the training examples is best. –each arc has associated with it one big small of the possible values of the attribute • Finding the provably smallest decision tree is NP-hard, so - + instead of constructing the absolute smallest tree consistent at the node from which the arc is directed with the training examples, construct one that is pretty small •Generalization: allow for >2 classes 13 –e.g., for stocks, classify into {sell, hold, buy} 14 Decision tree-induced partition – example Expressiveness • Decision trees can express any function of the input attributes. • E.g., for Boolean functions, truth table row → path to leaf: I Color greenred blue Size + Shape big small square round - + Size + big small • Trivially, there is a consistent decision tree for any training set with - + one path to leaf for each example (unless f nondeterministic in x) but it probably won't generalize to new examples • We prefer to find more compact decision trees 15 16 4 Hypothesis spaces R&N’s restaurant domain • How many distinct decision trees with n Boolean attributes? • Develop a decision tree to model the decision a – = number of Boolean functions patron makes when deciding whether or not to wait n – = number of distinct truth tables with 2n rows = 22 for a table at a restaurant – e.g., with 6 Boolean attributes, 18,446,744,073,709,551,616 trees • Two classes: wait, leave • How many conjunctive hypotheses (e.g., Hungry ∧¬Rain)? – Each attribute can be in (positive), in (negative), or out • Ten attributes: Alternative available? Bar in ⇒3n distinct conjunctive hypotheses restaurant? Is it Friday? Are we hungry? How full – e.g., with 6 Boolean attributes, 729 trees is the restaurant? How expensive? Is it raining? Do • A more expressive hypothesis space we have a reservation? What type of restaurant is – increases chance that target function can be expressed it? What’s the purported waiting time? – increases number of hypotheses consistent with training set ⇒ may get worse predictions in practice • Training set of 12 examples • ~ 7000 possible cases 17 18 A decision tree Attribute-based representations from introspection • Examples described by attribute values (Boolean, discrete, continuous) – E.g., situations where I will/won't wait for a table • Classification of examples is positive (T) or negative (F) 19 • Serves as a training set 20 5 ID3 Algorithm Choosing the best attribute • A greedy algorithm for decision tree construction • The key problem is choosing which attribute to developed by Ross Quinlan circa 1987 split a given set of examples • Top-down construction of decision tree by recursively • Some possibilities are: selecting “best attribute” to use at the current node in tree – Random: Select any attribute at random – Once attribute is selected for current node, generate – Least-Values: Choose the attribute with the smallest child nodes, one for each possible value of selected number of possible values attribute – Most-Values: Choose the attribute with the largest number of possible values – Partition examples using the possible values of this attribute, and assign these subsets of the examples to – Max-Gain: Choose the attribute that has the largest expected information gain–i.e., the attribute that will the appropriate child node result in the smallest expected size of the subtrees rooted – Repeat for each child node until all examples associated at its children with a node are either all positive or all negative • The ID3 algorithm uses the Max-Gain method of selecting the best attribute 21 22 ID3-induced Choosing an attribute decision tree • Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative“ Patrons? is a better choice than type? 23 24 6 Information theory 101 Information theory 101 • Information theory sprang almost fully formed from the • Information is measured in bits seminal work of Claude E. Shannon at Bell Labs • If there are n equally probable possible messages, then the – classic paper "A Mathematical Theory of Communication“, Bell probability p of each is 1/n System Technical Journal, 1948.