Overview Learning Inductive Learning Training and Testing

Overview Learning Adapt through interaction with the world: rote memory to Learning in general • • developing a complex strategy Inductive learning (chapter 18) Types of learning: • • Statistical learning: neural networks (chapter 20; old 19) 1. Supervised learning (dense feedback) • 2. Unsupervised learning (no feedback) 3. Reinforcement learning (sparse feedback, environment altering), etc. Advantages (two, among many): • 1. Fault tolerance 2. No need for a complete specification to begin with Becoming a central focus of AI. • 1 2 Inductive Learning Training and Testing Given example pairs (x, f(x)), return a function h that Different Types of Error • approximates the function f : Training error • – pure inductive inference, or induction. Validation error • The function h is called a hypothesis. Test error • • Issues Generalization • Bias-Variance dillema • Overfitting, underfitting • Model complexity • 3 4 Inductive Learning and Inductive Bias Decision Trees Patrons? None Some Full o o o o o o o o No Yes Hungry? o o o o No Yes o o o o o o o o Type? No (a) (b) (c) (d) French Italian Thai Burger Yes No Fri/Sat? Yes Given (a) as the training data, we can come up with several different No Yes hypotheses: (b) to (d) No Yes selection of one hypothesis over another is called a inductive • learn to approximate discrete-valued target functions. bias (don’t confuse with other things called bias). • – exact match to training data step-by-step decision making (disjunction of conjunctions) • – prefer imprecise but smooth approximation applications: medical diagnosis, assess credit risk of loan • – etc. applicants, etc. 5 6 Decision Trees: What They Represent Decision Trees: What They Represent (cont’d) Wait or not (Yes/No)? The decision tree above corresponds to: In other words, for each instance (or example), there are attributes • (P atrons = Some) (Patrons, Hungry, etc.) and each instance have a full attribute value (P atrons = F ull Hungry = No T ype = F rench) assignment. ∨ ∧ ∧ (P atrons = F ull Hungry = No T ype = T hai F ri/Sat = Y es) ∨ ∧ ∧ ∧ For a given instance, it is classified into different discrete classes by the (P atrons = F ull Hungry = No T ype = Burger) • ∨ ∧ ∧ decision tree. Decision trees represent disjunction of conjunctions. For training, many (instance, class) pairs are used. • 7 8 Constructing Decision Trees from Examples Constructing Decision Trees: Trivial Solution Attributes Goal Example Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait A trivial solution is to explicitly construct paths for each given X1 Yes No No Yes Some $$$ No Yes French 0±10 Yes • X2 Yes No No Yes Full $ No No Thai 30±60 No example. X3 No Yes No No Some $ No No Burger 0±10 Yes X4 Yes No Yes Yes Full $ No No Thai 10±30 Yes X5 Yes No Yes No Full $$$ No Yes French >60 No The problem with this approach is that it is not able to deal with X6 No Yes No Yes Some $$ Yes Yes Italian 0±10 Yes • X7 No Yes No No None $ Yes No Burger 0±10 No situations where, some attribute values are missing or new kinds X8 No No No Yes Some $$ Yes Yes Thai 0±10 Yes X9 No Yes Yes No Full $ Yes No Burger >60 No of situations arise. X10 Yes Yes Yes Yes Full $$$ No Yes Italian 10±30 No X11 No No No No None $ No No Thai 0±10 No X12 Yes Yes Yes Yes Full $ No No Burger 30±60 Yes Consider that some attributes may not count much toward the • final classification. Given a set of examples (training set), both positive and • negative, the task is to construct a decision tree that describes a concise decision path. Using the resulting decision tree, we want to classify new • instances of examples (either as yes or no). 9 10 Finding a Concise Decision Tree Finding a Concise Decision Tree (cont’d) Memorizing all cases may not be the best way. Basic idea: pick up attributes that can clearly separate positive • • and negative cases. We want to extract a decision pattern that can describe a large • number of cases in a concise way. These attributes are more important than others: the final • classification heavily depend on the value of these attributes. Such an inductive bias is called Ockham’s razor: The most likely • hypothesis is the simplest one that is consistent with all observations. In terms of a decision tree, we want to make as few tests before • reaching a decision, i.e. the depth of the tree should be shallow. 11 12 Finding a Concise Decision Tree (cont’d) Decision Tree Learning Algorithm +: X1,X3,X4,X6,X8,X12 (a) −: X2,X5,X7,X9,X10,X11 Patrons? function DECISION-TREE-LEARNING(examples, attributes, default) returns a decision tree Patrons? inputs: examples, set of Noneexamples Some Full None Some Full attributes, set of attributes No Yes Hungry? +: +: X1,X3,X6,X8 +: X4,X12 −: X7,X11 −: −: X2,X5,X9,X10 default, default value for the goal predicateNo Yes examples default +: X1,X3,X4,X6,X8,X12 if is empty then return Type? No (b) −: X2,X5,X7,X9,X10,X11 else if all examples have the same classi®cation then return the classi®cation Type? else if attributes is empty Frenchthen returnItalianMAJORITYThai-VALUE(examplesBurger) Yes No Fri/Sat? Yes French Italian Thai Burger else +: X1 +: X6 +: X4,X8 +: X3,X12 best CHOOSE-ATTRIBUTE(attributes, examples)No Yes −: X5 −: X10 −: X2,X11 −: X7,X9 tree a new decision tree with root test best No Yes value vi of best +: X1,X3,X4,X6,X8,X12 for each do −: X2,X5,X7,X9,X10,X11 ¢¡ (c) examplesi elements of examples with best = vi £ Patrons? subtree DECISION-TREE-LEARNING(examplesi, attributes ¤ best, MAJORITY-VALUE(examples)) None Some Full +: +: X1,X3,X6,X8 +: X4,X12 add a branch to tree with label vi and subtree subtree −: X7,X11 −: −: X2,X5,X9,X10 end No Yes Hungry? return tree Y N +: X4,X12 +: −: X2,X10 −: X5,X9 13 14 Resulting Decision Tree Accuracy of Decision Trees 1 0.9 0.8 0.7 0.6 % correct on test set 0.5 0.4 0 20 40 60 80 100 Training set size Some attributes are not tested at all. • Divide examples into training and test sets. Odd paths can be generated (Thai food branch). • • Train using the training set. Sometimes the tree can be incorrect for new examples • • Measure accuracy of resulting decision tree on the test set. (exceptional cases). • 15 16 Choosing the Best Attribute to Test First Entropy and Information Gain Use Shannon’s information theory to choose the attribute that give the Entropy(E) = P log (P ) maximum information gain. − i 2 i i C X∈ Pick an attribute such that the information gain (or entropy Ev • Gain(E, A) = Entropy(E) | | Entropy(Ev ) reduction) is maximized. − E v V alues(A) ∈ X | | Entropy measures the average surprisal of events. Less E: set of examples • • probable events are more surprising. A: a single attribute • E : set of examples where attribute A = v. • v S : cardinality of set S. •| | 17 18 Issues in Decision Tree Learning Key Points Noise and overfitting Decision tree learning: • Missing attribute values from examples What is the embodied principle (or bias)? • • Multi-valued attributes with large number of possible values How to choose the best attribute? Given a set of examples, • • choose the best attribute to test first. Continuous-valued attributes. • What are the issues? noise, overfitting, etc. • 19 20 Neural Networks Many Faces of Neural Networks Neural networks is one particular form of learning from data. Abstract mathematical/statistical model • simple processing elements: named units, or neurons Optimization algorithm • • connective structure and associated connection weights Pattern recognition algorithm • • learning: adaptation of connection weights Tools for understanding the function of the brain • • Neural networks mimic the human (or animal) nervous system. Robust engineering application • 21 22 The Central Nervous System Function of the Nervous System Function of the nervous system: Perception • Cognition • Motor control • Regulation of essential bodily functions • Cortex: thin outer sheet where most of the neurons are. • Sub-cortical nuclei: thalamus, hippocampus, basal ganglia, etc. • Midbrain, pons, and medulla, connects to the spinal cord. • Cerebellum (hind brain, or small brain) • 23 24 The Central Nervous System: Facts a How the Brain Differs from Computers Facts: human neocortex Densely connected. Thickness: 1.6mm • • Massively parallel. Area: 36cm 36cm (about 1.4 ft2) • • × Highly nonlinear. Neurons: 10 billion (1011) b • • Asynchronous: no central clock. Connections: 60 trillion (6 1013) to 100 trillion • • × Connections per neuron: 104 Fault tolerant. • • Energy usage per operation: 10 16J (compare to 10 6J in Highly adaptable. • − − • modern computers) Creative. a • Neural networks: a comprehensive foundation by Simon Haykin (1994), and Foundations of Vision by Brian Wandell (1995). May slightly differ from those in Russel & Norvig. No need to memorize these figures. Why are these crucial? b Note: More recent/accurate estimate is 86 billion neurons (see https://www.brainfacts.org/the-brain-facts-book) 25 26 Neurons: Basic Functional Unit of the Brain Propagation of Activation Across the Synapse AXON TERMINALS Synaptic Cleft Postsynaptic AXON TERMINALS Action Potential Potential CELL BODY CELL BODY AXON NUCLEUS AXON DENDRITE DENDRITIC ARBOR NUCLEUS Presynaptic Postsynaptic Neuron neurotransmitter Neuron DENDRITE DENDRITIC 1. Action potential reaches axon terminal. ARBOR 2. Neurotransmitters are released into synaptic cleft and bind to postsynaptic Dendrites receive input from upstream neurons. cell’s receptors. • + + Ions flow in to make the cell positively charged. 3. Binding allows ion channels to open (Na ), and Na ions flows in and • makes the postsynaptic cell depolarize. Once a firing threshold is reached, a spike is generated and transmitted • along the axon.

Load more