Classification and Prediction Overview
• Introduction • Decision Trees Data Mining Techniques: • Statistical Decision Theory • Nearest Neighbor Classification and Prediction • Bayesian Classification • Artificial Neural Networks Mirek Riedewald • Support Vector Machines (SVMs) Some slides based on presentations by • Prediction Han/Kamber/Pei, Tan/Steinbach/Kumar, and Andrew • Accuracy and Error Measures Moore • Ensemble Methods
2
Classification vs. Prediction Induction: Model Construction
• Assumption: after data preparation, we have a data set Classification where each record has attributes X ,…,X , and Y. Algorithm 1 n Training • Goal: learn a function f:(X ,…,X )Y, then use this 1 n Data function to predict y for a given input record (x1,…,xn). – Classification: Y is a discrete attribute, called the class label • Usually a categorical attribute with small domain Model NAME RANK YEARS TENURED – Prediction: Y is a continuous attribute (Function) M ike Assistant Prof 3 no • Called supervised learning, because true labels (Y- M ary Assistant Prof 7 yes values) are known for the initially provided data Bill Professor 2 yes • Typical applications: credit approval, target marketing, Jim Associate Prof 7 yes IF rank = ‘professor’ medical diagnosis, fraud detection Dave Assistant Prof 6 no OR years > 6 Anne Associate Prof 3 no THEN tenured = ‘yes’
3 4
Deduction: Using the Model Classification and Prediction Overview
• Introduction Model (Function) • Decision Trees • Statistical Decision Theory • Bayesian Classification Test • Artificial Neural Networks Unseen Data Data • Support Vector Machines (SVMs) • Nearest Neighbor (Jeff, Professor, 4) • Prediction NAME RANK YEARS TENURED Tenured? • Accuracy and Error Measures Tom Assistant Prof 2 no • Merlisa Associate Prof 7 no Ensemble Methods George Professor 5 yes Joseph Assistant Prof 7 yes 5 6
1 Example of a Decision Tree Another Example of Decision Tree
Splitting Attributes MarSt Single, Tid Refund Marital Taxable Married Divorced Status Income Cheat
1 Yes Single 125K No NO Refund 2 No Married 100K No Refund Yes No Yes No 3 No Single 70K No NO TaxInc 4 Yes Married 120K No NO MarSt < 80K > 80K 5 No Divorced 95K Yes Single, Divorced Married 6 No Married 60K No NO YES 7 Yes Divorced 220K No TaxInc NO 8 No Single 85K Yes < 80K > 80K 9 No Married 75K No NO YES There could be more than one tree that 10 No Single 90K Yes
10 fits the same data!
Training Data Model: Decision Tree
7 8
Apply Model to Test Data Apply Model to Test Data Test Data Test Data Start from the root of tree.
Refund Refund Yes No Yes No
NO MarSt NO MarSt Single, Divorced Married Single, Divorced Married
TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K
NO YES NO YES
9 10
Apply Model to Test Data Apply Model to Test Data Test Data Test Data
Refund Marital Taxable Status Income Cheat
No Married 80K ?
10 Refund Refund Yes No Yes No
NO MarSt NO MarSt Single, Divorced Married Single, Divorced Married
TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K
NO YES NO YES
11 12
2 Apply Model to Test Data Apply Model to Test Data Test Data Test Data
Refund Marital Taxable Status Income Cheat
No Married 80K ?
10 Refund Refund Yes No Yes No
NO MarSt NO MarSt Single, Divorced Married Single, Divorced Married Assign Cheat to “No”
TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K
NO YES NO YES
13 14
Decision Tree Induction Decision Boundary
1 x2 • 0.9 Basic greedy algorithm X1 – Top-down, recursive divide-and-conquer 0.8 < 0.43? – At start, all the training records are at the root 0.7 Yes No – Training records partitioned recursively based on split attributes 0.6 X X2 – Split attributes selected based on a heuristic or statistical 0.5 2 < 0.47? < 0.33? measure (e.g., information gain) 0.4 Yes No Yes No • Conditions for stopping partitioning 0.3 Refund – Pure node (all records belong Yes No 0.2 : 4 : 0 : 0 : 4 to same class) 0.1 : 0 : 4 : 3 : 0 NO MarSt – No remaining attributes for 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 further partitioning Single, Divorced Married x • Majority voting for classifying the leaf 1 TaxInc – No cases left NO Decision boundary = border between two neighboring regions of different classes. < 80K > 80K For trees that split on a single attribute at a time, the decision boundary is parallel NO YES to the axes. 15 16
Oblique Decision Trees How to Specify Split Condition?
• Depends on attribute types
x + y < 1 – Nominal – Ordinal – Numeric (continuous)
Class = + Class = • Depends on number of ways to split – 2-way split • Test condition may involve multiple attributes – Multi-way split • More expressive representation • Finding optimal test condition is computationally expensive 17 18
3 Splitting Nominal Attributes Splitting Ordinal Attributes
• Multi-way split: use as many partitions as • Multi-way split: Size distinct values. Small Large CarType Medium Family Luxury Sports • Binary split: Size Size {Small, OR {Medium, • Binary split: divides values into two subsets; Medium} {Large} Large} {Small} need to find optimal partitioning.
CarType CarType • What about this split? Size {Sports, OR {Family, {Small, Luxury} {Family} Luxury} {Sports} Large} {Medium}
19 20
Splitting Continuous Attributes Splitting Continuous Attributes
• Different options – Discretization to form an ordinal categorical Taxable Taxable Income Income? attribute > 80K?
• Static – discretize once at the beginning < 10K > 80K • Dynamic – ranges found by equal interval bucketing, Yes No
equal frequency bucketing (percentiles), or clustering. [10K,25K) [25K,50K) [50K,80K)
– Binary Decision: (A < v) or (A v) (i) Binary split (ii) Multi-way split • Consider all possible splits, choose best one
21 22
How to Determine Best Split How to Determine Best Split
Before Splitting: 10 records of class 0, • Greedy approach: 10 records of class 1 – Nodes with homogeneous class distribution are
Own Car Student preferred Car? Type? ID? • Need a measure of node impurity: Yes No Family Luxury c1 c20 c c Sports 10 11 C0: 5 C0: 9 C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0 C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1 C1: 5 C1: 1 Non-homogeneous, Homogeneous, Which test condition is the best? High degree of impurity Low degree of impurity
23 24
4 Attribute Selection Measure: Example Information Gain • Select attribute with highest information gain • Predict if somebody will buy a computer • pi = probability that an arbitrary record in D belongs to class C , i=1,…,m Age Income Student Credit_rating Buys_computer i • Given data set: 30 High No Bad No • Expected information (entropy) needed to classify a record 30 High No Good No in D: m 31…40 High No Bad Yes Info(D) pi log2 ( pi ) > 40 Medium No Bad Yes i1 > 40 Low Yes Bad Yes • Information needed after using attribute A to split D into v > 40 Low Yes Good No 31...40 Low Yes Good Yes partitions D1,…, Dv: v | Dj | 30 Medium No Bad No Info (D) Info(D ) 30 Low Yes Bad Yes A | D | j j1 > 40 Medium Yes Bad Yes 30 Medium Yes Good Yes • Information gained by splitting on attribute A: 31...40 Medium No Good Yes Gain (D) Info(D) Info (D) 31...40 High Yes Bad Yes A A > 40 Medium No Good No
25 26
Information Gain Example Gain Ratio for Attribute Selection
• Class P: buys_computer = “yes” 5 4 Infoage (D) I(2,3) I(4,0) • Information gain is biased towards attributes with a large • Class N: buys_computer = “no” 14 14 number of values 9 9 5 5 5 Info(D) I(9,5) log log 0.940 I(3,2) 0.694 14 2 14 14 2 14 14 • Use gain ratio to normalize information gain: 5 Age #yes #no I(#yes, #no) • I ( 2 , 3 ) means “age 30” has 5 out of 14 – GainRatioA(D) = GainA(D) / SplitInfoA(D) 30 2 3 0.971 14 samples, with 2 yes’es and 3 no’s. v 31…40 4 0 0 – Similar for the other terms | Dj | | Dj | >40 3 2 0.971 SplitInfoA (D) log2 Age Income Student Credit_rating Buys_computer • Hence j1 | D | | D |
30 High No Bad No Gainage(D) Info(D) Infoage(D) 0.246 30 High No Good No 4 4 6 6 4 4 31…40 High No Bad Yes • Similarly, • E.g., SplitInfoincome(D) log2 log2 log2 0.926 > 40 Medium No Bad Yes Gain income(D) 0.029 14 14 14 14 14 14 > 40 Low Yes Bad Yes > 40 Low Yes Good No Gain student(D) 0.151 31...40 Low Yes Good Yes • Gain credit_rating (D) 0.048 GainRatioincome(D) = 0.029/0.926 = 0.031 30 Medium No Bad No 30 Low Yes Bad Yes • Therefore we choose age as the splitting • Attribute with maximum gain ratio is selected as splitting > 40 Medium Yes Bad Yes attribute 30 Medium Yes Good Yes attribute 31...40 Medium No Good Yes 31...40 High Yes Bad Yes > 40 Medium No Good No 27 28
Comparing Attribute Selection Gini Index Measures m • Gini index, gini(D), is defined as 2 • No clear winner gini( D) 1 pi i1 (and there are many more)
– Information gain: • If data set D is split on A into v subsets D1,…, Dv, the gini index gini (D) is defined as • Biased towards multivalued attributes A v | D | j – Gain ratio: gini A (D) gini( Dj ) j1 | D | • Tends to prefer unbalanced splits where one partition is • Reduction in Impurity: much smaller than the others – Gini index: gini A (D) gini( D) gini A (D) • Biased towards multivalued attributes • Tends to favor tests that result in equal-sized partitions and • Attribute that provides smallest ginisplit(D) (= largest reduction in impurity) is chosen to split the node purity in both partitions
29 30
5 Practical Issues of Classification How Good is the Model?
• Underfitting and overfitting • Training set error: compare prediction of • Missing values training record with true value • Computational cost – Not a good measure for the error on unseen data. (Discussed soon.) • Expressiveness • Test set error: for records that were not used for training, compare model prediction and true value – Use holdout data from available data set
31 32
Training versus Test Set Error Test Data
• We’ll create a training dataset • Generate test data using the same method: copy of e, 25% Output y = copy of e, inverted; done independently from previous noise process Five inputs, all bits, are except a random 25% • Some y’s that were corrupted in the training set will be uncorrupted generated in all 32 possible of the records have y in the testing set. combinations set to the opposite of e • Some y’s that were uncorrupted in the training set will be corrupted in the test set.
a b c d e y (training y (test a b c d e y data) data) 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 1 1 1 1
0 0 1 0 0 1 0 0 1 0 0 1 1 32 records 32 : : : : : : : : : : : : : 1 1 1 1 1 1 1 1 1 1 1 1 1 33 34
Full Tree for The Training Data Testing The Tree with The Test Set
Root 1/4 of the tree nodes are 3/4 are fine corrupted
e=0 e=1 1/4 of the test set 1/16 of the test set will 3/16 of the test set will be records are corrupted be correctly predicted for wrongly predicted because a=0 a=1 a=0 a=1 the wrong reasons the test record is corrupted 3/4 are fine 3/16 of the test 9/16 of the test predictions predictions will be wrong will be fine because the tree node is corrupted
In total, we expect to be wrong on 3/8 of the test set predictions 25% of these leaf node labels will be corrupted
Each leaf contains exactly one record, hence no error in predicting the training data!
35 36
6 What’s This Example Shown Us? Suppose We Had Less Data
• Discrepancy between training and test set Output y = copy of e, except a error These bits are hidden random 25% of the records have y set to the opposite of e • But more importantly – …it indicates that there is something we should do a b c d e y about it if we want to predict well on future data. 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1
0 0 1 0 0 1 32 records 32 : : : : : : 1 1 1 1 1 1
37 38
Tree Learned Without Access to The Tree Learned Without Access to The Irrelevant Bits Irrelevant Bits
Root Root
e=0 e=1 e=0 e=1 These nodes will be unexpandable
In about 12 of In about 12 of the 16 records the 16 records in this node the in this node the output will be 0 output will be 1
So this will So this will almost certainly almost certainly predict 0 predict 1 39 40
Tree Learned Without Access to The Typical Observation Irrelevant Bits Overfitting Root almost certainly almost certainly all none of the tree are fine Model M overfits the nodes are training data if another e=0 e=1 corrupted model M’ exists, such 1/4 of the test n/a 1/4 of the test set that M has smaller set records are will be wrongly error than M’ over the corrupted predicted because training examples, but the test record is M’ has smaller error corrupted than M over the entire distribution of 3/4 are fine n/a 3/4 of the test instances. predictions will be fine
In total, we expect to be wrong on only 1/4 of the test set predictions Underfitting: when model is too simple, both training and test errors are large
41 42
7 Reasons for Overfitting Avoiding Overfitting
• Noise • General idea: make the tree smaller – Too closely fitting the training data means the model’s – Addresses all three reasons for overfitting predictions reflect the noise as well • Insufficient training data • Prepruning: Halt tree construction early – Do not split a node if this would result in the goodness measure – Not enough data to enable the model to generalize falling below a threshold beyond idiosyncrasies of the training records – Difficult to choose an appropriate threshold, e.g., tree for XOR • Data fragmentation (special problem for trees) – Number of instances gets smaller as you traverse • Postpruning: Remove branches from a “fully grown” tree down the tree – Use a set of data different from the training data to decide when – Number of instances at a leaf node could be too small to stop pruning • Validation data: train tree on training data, prune on validation data, to make any confident decision about class then test on test data
43 44
Minimum Description Length (MDL) MDL-Based Pruning Intuition
X y A? Yes No X y X 1 1 0 B? X1 ? Cost X2 0 B1 B2 X2 ? X3 0 C? 1 A C1 C2 B X3 ? X4 1 X ? … … 0 1 4 Cost(Model, Data) … … Xn 1 Cost(Model)=model size Xn ? Lowest total cost • Alternative to using validation data – Motivation: data mining is about finding regular patterns in data; regularity can be used to compress the data; method that achieves greatest compression found most regularity and hence is best Cost(Data|Model)=model errors • Minimize Cost(Model,Data) = Cost(Model) + Cost(Data|Model) – Cost is the number of bits needed for encoding. small large • Cost(Data|Model) encodes the misclassification errors. Tree size • Cost(Model) uses node encoding plus splitting condition encoding. Best tree size
45 46
Handling Missing Attribute Values Distribute Instances
Tid Refund Marital Taxable Status Income Class • Missing values affect decision tree Tid Refund Marital Taxable 1 Yes Single 125K No Status Income Class 2 No Married 100K No construction in three different ways: 10 ? Single 90K Yes
10 3 No Single 70K No – How impurity measures are computed 4 Yes Married 120K No Refund – How to distribute instance with missing value to 5 No Divorced 95K Yes Yes No 6 No Married 60K No Class=Yes 0 + 3/9 Class=Yes 2 + 6/9 child nodes 7 Yes Divorced 220K No Class=No 3 Class=No 4
8 No Single 85K Yes
– How a test instance with missing value is classified 9 No Married 75K No
10 Probability that Refund=Yes is 3/9 Refund Yes No Probability that Refund=No is 6/9 Assign record to the left child with Class=Yes 0 Cheat=Yes 2 weight = 3/9 and to the right child Class=No 3 Cheat=No 4 with weight = 6/9
47 48
8 Computing Impurity Measure Classify Instances
Tid Refund Marital Taxable Split on Refund: assume records with New record: Married Single Divorced Total Status Income Class missing values are distributed as Tid Refund Marital Taxable Class 1 Yes Single 125K No discussed before Status Income Class=No 3 1 0 4 2 No Married 100K No 3/9 of record 10 go to Refund=Yes 11 No ? 85K ?
10 Class=Yes 6/9 1 1 2.67 3 No Single 70K No 6/9 of record 10 go to Refund=No 4 Yes Married 120K No Total 3.67 2 1 6.67 Entropy(Refund=Yes) Refund 5 No Divorced 95K Yes Yes = -(1/3 / 10/3)log(1/3 / 10/3) No 6 No Married 60K No – (3 / 10/3)log(3 / 10/3) = 0.469 NO MarSt 7 Yes Divorced 220K No Single, Married 8 No Single 85K Yes Entropy(Refund=No) Divorced = -(8/3 / 20/3)log(8/3 / 20/3) 9 No Married 75K No TaxInc NO Probability that Marital Status 10 ? Single 90K Yes – (4 / 20/3)log(4 / 20/3) = 0.971 < 80K > 80K = Married is 3.67/6.67
10
Entropy(Children) NO YES Probability that Marital Status Before Splitting: Entropy(Parent) = 1/3*0.469 + 2/3*0.971 = 0.804 ={Single,Divorced} is 3/6.67 = -0.3 log(0.3)-(0.7)log(0.7) = 0.881 Gain = 0.881 – 0.804 = 0.077 49 50
Tree Cost Analysis Tree Expressiveness
• Finding an optimal decision tree is NP-complete • Can represent any finite discrete-valued function – Optimization goal: minimize expected number of binary tests to uniquely identify any record from a given finite set – But it might not do it very efficiently • Greedy algorithm • Example: parity function – O(#attributes * #training_instances * log(#training_instances)) – Class = 1 if there is an even number of Boolean attributes with truth value = True • At each tree depth, all instances considered • Assume tree depth is logarithmic (fairly balanced splits) – Class = 0 if there is an odd number of Boolean attributes with truth value = True • Need to test each attribute at each node • What about binary splits? – For accurate modeling, must have a complete tree – Sort data once on each attribute, use to avoid re-sorting subsets – Incrementally maintain counts for class distribution as different split points • Not expressive enough for modeling continuous are explored attributes • In practice, trees are considered to be fast both for training (when using the greedy algorithm) and making predictions – But we can still use a tree for them in practice; it just cannot accurately represent the true function
51 54
Rule Extraction from a Decision Tree Classification in Large Databases
• One rule is created for each path from the root to a leaf – Precondition: conjunction of all split predicates of nodes on path • Scalability: Classify data sets with millions of – Consequent: class prediction from leaf examples and hundreds of attributes with • Rules are mutually exclusive and exhaustive • Example: Rule extraction from buys_computer decision-tree reasonable speed – IF age = young AND student = no THEN buys_computer = no • – IF age = young AND student = yes THEN buys_computer = yes Why use decision trees for data mining? – IF age = mid-age THEN buys_computer = yes – Relatively fast learning speed – IF age = old AND credit_rating = excellent THEN buys_computer = yes – IF age = young AND credit_rating = fair THEN buys_computer = no – Can handle all attribute types age? – Convertible to intelligible classification rules
<=30 31..40 >40 – Good classification accuracy, but not as good as student? credit rating? yes newer methods (but tree ensembles are top!)
no yes excellent fair
no yes yes 55 56
9 Scalable Tree Induction Tree Conclusions
• High cost when the training data at a node does not fit in • Very popular data mining tool memory • Solution 1: special I/O-aware algorithm – Easy to understand – Keep only class list in memory, access attribute values on disk – Easy to implement – Maintain separate list for each attribute – Easy to use: little tuning, handles all attribute – Use count matrix for each attribute • Solution 2: Sampling types and missing values – Common solution: train tree on a sample that fits in memory – Computationally relatively cheap – More sophisticated versions of this idea exist, e.g., Rainforest • • Build tree on sample, but do this for many bootstrap samples Overfitting problem • Combine all into a single new tree that is guaranteed to be almost identical to the one trained from entire data set • Focused on classification, but easy to extend • Can be computed with two data scans to prediction (future lecture)
57 58
Classification and Prediction Overview Theoretical Results
• Introduction • Trees make sense intuitively, but can we get • Decision Trees some hard evidence and deeper • Statistical Decision Theory • Nearest Neighbor understanding about their properties? • Bayesian Classification • Statistical decision theory can give some • Artificial Neural Networks answers • Support Vector Machines (SVMs) • Prediction • Need some probability concepts first • Accuracy and Error Measures • Ensemble Methods
60 61
Random Variables Working with Random Variables
• Intuitive version of the definition: • E[X + Y] = E[X] + E[Y] – Can take on one of possibly many values, each with a • Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X,Y) certain probability – These probabilities define the probability distribution of • For constants a, b the random variable – E[aX + b] = a E[X] + b – E.g., let X be the outcome of a coin toss, then – Var(aX + b) = Var(aX) = a2 Var(X) Pr(X=‘heads’)=0.5 and Pr(X=‘tails’)=0.5; distribution is uniform • Iterated expectation: • Consider a discrete random variable X with numeric – E[X] = EX[ EY[Y| X] ], where EY[Y| X] = yi*Pr(Y=yi| X=x) values x ,...,x is the expectation of Y for a given value x of X, i.e., is a 1 k function of X – Expectation: E[X] = xi*Pr(X=xi) – Variance: Var(X) = E[(X – E[X])2] = E[X2] – (E[X])2 – In general for any function f(X,Y): E [f(X,Y)] = E [ E [f(X,Y)| X] ] X,Y X Y
62 63
10 What is the Optimal Model f(X)? Optimal Model f(X) (cont.)
Let X denote a real - valued random input variable and Y a real - valued random output variable
2 2 The choice of f(X) does not affect EY (Y Y ) | X , but Y f (X ) is minimized for 2 The squared error of trained model f(X) is E X,Y (Y f (X )) . f(X) Y EY [Y | X ].
Which function f(X) will minimize the squared error? 2 2 Note that E X,Y (Y f (X )) E X EY (Y f (X )) | X . Hence
Consider the error for a specific value of X and let Y E [Y | X ] : Y E (Y f (X )) 2 E E (Y Y ) 2 | X Y f (X ) 2 2 2 X,Y X Y EY (Y f (X )) | X EY (Y Y Y f (X )) | X 2 2 EY (Y Y ) | X EY (Y f (X )) | X 2EY (Y Y )(Y f (X )) | X Hence the squared error is minimzed by choosing f(X) EY [Y | X ] for every X. 2 2 EY (Y Y ) | X Y f (X ) 2(Y f (X )) EY (Y Y ) | X 2 2 EY (Y Y ) | X Y f (X ) (Notice that for minimizing absolute error E X,Y | Y f (X ) |,one can show that the best model is f(X) median( X | Y ).)
(Notice : EY (Y Y ) | X EY Y | X EY Y | X Y Y 0)
64 65
Interpreting the Result Implications for Trees
• To minimize mean squared error, the best prediction for input X=x is the mean of • Since there are not enough, or none at all, training records the Y-values of all training records (x(i),y(i)) with x(i)=x – E.g., assume there are training records (5,22), (5,24), (5,26), (5,28). The optimal prediction for with X=x, the output for input X=x has to be based on input X=5 would be estimated as (22+24+26+28)/4 = 25. records “in the neighborhood” • Problem: to reliably estimate the mean of Y for a given X=x, we need sufficiently many training records with X=x. In practice, often there is only one or no training – A tree leaf corresponds to a multi-dimensional range in the data record at all for an X=x of interest. space – If there were many such records with X=x, we would not need a model and could just return – Records in the same leaf are neighbors of each other the average Y for that X=x. • The benefit of a good data mining technique is its ability to interpolate and • Solution: estimate mean Y for input X=x from the training extrapolate from known training records to make good predictions even for X- records in the same leaf node that contains input X=x values that do not occur in the training data at all. • Classification for two classes: encode as 0 and 1, use squared error as before – Classification: leaf returns majority class or class probabilities – Then f(X) = E[Y| X=x] = 1*Pr(Y=1| X=x) + 0*Pr(Y=0| X=x) = Pr(Y=1| X=x) (estimated from fraction of training records in the leaf) • Classification for k classes: can show that for 0-1 loss (error = 0 if correct class, – Prediction: leaf returns average of Y-values or fits a local model error = 1 if wrong class predicted) the optimal choice is to return the majority class for a given input X=x – Make sure there are enough training records in the leaf to – This is called the Bayes classifier. obtain reliable estimates
66 67
Bias-Variance Tradeoff Bias-Variance Tradeoff Derivation
2 2 EX ,D,Y Y f (X ; D) EX ED EY Y f (X ; D) | X , D. Now consider the inner term : • Let’s take this one step further and see if we can 2 2 2 ED EY Y f (X ; D) | X , D ED EY Y E[Y | X ] | X , D f (X ; D) E[Y | X ] understand overfitting through statistical decision (Same derivation as before for optimal function f(X).) E Y E[Y | X ]2 | X E f (X ; D) E[Y | X ]2 theory Y D 2 2 (The first term does not depend on D, hence ED EY Y E[Y | X ] | X , D EY Y E[Y | X ] | X .) • As before, consider two random variables X and Y Consider the second term : 2 2 ED f (X ; D) E[Y | X ] ED f (X ; D) ED[ f (X ; D) ED[ f (X ; D)] E[Y | X ] • From a training set D with n records, we want to 2 2 ED f (X ; D) ED[ f (X ; D)] ED ED[ f (X ; D)] E[Y | X ] 2E f (X ; D) E [ f (X ; D) E [ f (X ; D)] E[Y | X ] construct a function f(X) that returns good D D D E f (X ; D) E [ f (X ; D)]2 E [ f (X ; D)] E[Y | X ]2 approximations of Y for future inputs X D D D 2ED f (X ; D) ED[ f (X ; D)ED[ f (X ; D)] E[Y | X ] 2 2 – Make dependence of f on D explicit by writing f(X; D) ED f (X ; D) ED[ f (X ; D)] ED[ f (X ; D)] E[Y | X ] (The third term is zero, because E f (X ; D) E [ f (X ; D) E [ f (X ; D)] E [ f (X ; D)] 0.) • Goal: minimize mean squared error over all X, Y, D D D D and D, i.e., E [ (Y - f(X; D))2 ] Overall we therefore obtain : X,D,Y 2 2 2 2 EX ,D,Y Y f (X ; D) EX ED[ f (X ; D)] E[Y | X ] ED f (X ; D) ED[ f (X ; D)] EY Y E[Y | X ] | X
68 69
11 Bias-Variance Tradeoff and Overfitting Implications for Trees
2 ED [ f (X; D)] E[Y | X ] : bias 2 ED f (X; D) ED [ f (X; D)] : variance 2 • Bias decreases as tree becomes larger EY Y E[Y | X ] | X : irreducible error (does not depend on f and is simply the variance of Y given X.) – Larger tree can fit training data better • Option 1: f(X;D) = E[Y| X,D]
– Bias: since ED[ E[Y| X,D] ] = E[Y| X], bias is zero 2 2 • Variance increases as tree becomes larger – Variance: (E[Y| X,D]-ED[E[Y| X,D]]) = (E[Y| X,D]-E[Y| X]) can be very large since E[Y| X,D] depends heavily on D – – Might overfit! Sample variance affects predictions of larger tree • Option 2: f(X;D)=X (or other function independent of D) more 2 2 – Variance: (X-ED[X]) =(X-X) =0 2 2 – Bias: (ED[X]-E[Y| X]) =(X-E[Y| X]) can be large, because E[Y| X] might be • Find right tradeoff as discussed earlier completely different from X – Might underfit! – Validation data to find best pruned tree • Find best compromise between fitting training data too closely (option 1) and completely ignoring it (option 2) – MDL principle
70 71
Classification and Prediction Overview Lazy vs. Eager Learning
• Introduction • Lazy learning: Simply stores training data (or only • Decision Trees minor processing) and waits until it is given a test • Statistical Decision Theory record • Nearest Neighbor • Eager learning: Given a training set, constructs a • Bayesian Classification classification model before receiving new (test) • Artificial Neural Networks data to classify • Support Vector Machines (SVMs) • General trend: Lazy = faster training, slower • Prediction predictions • Accuracy and Error Measures • Accuracy: not clear which one is better! • Ensemble Methods – Lazy method: typically driven by local decisions – Eager method: driven by global and local decisions
72 73
Nearest-Neighbor Nearest-Neighbor Classifiers Unknown tuple • Recall our statistical decision theory analysis: • Requires: – Set of stored records Best prediction for input X=x is the mean of – Distance metric for pairs of records the Y-values of all records (x(i),y(i)) with x(i)=x • Common choice: Euclidean (majority class for classification) 2 d(p, q) ( pi qi ) • Problem was to estimate E[Y| X=x] or majority i – Parameter k class for X=x from the training data • Number of nearest neighbors to retrieve • Solution was to approximate it • To classify a record: – Use Y-values from training records in – Find its k nearest neighbors – Determine output based on neighborhood around X=x (distance-weighted) average
of neighbors’ output
74 75
12 Definition of Nearest Neighbor 1-Nearest Neighbor
X X X Voronoi Diagram
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points that have the k smallest distance to x
76 77
Nearest Neighbor Classification Effect of Changing k
• Choosing the value of k: – k too small: sensitive to noise points – k too large: neighborhood may include points from other classes
X
Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning
78 79
Explaining the Effect of k Experiment
• Recall the bias-variance tradeoff • 50 training points (x, y) • Small k, i.e., predictions based on few – −2 ≤ 푥 ≤ 2, selected uniformly at random – 푦 = 푥2 + 휀, where 휀 is selected uniformly at random neighbors from range [-0.5, 0.5] – High variance, low bias • Test data sets: 500 points from same distribution • Large k, e.g., average over entire data set as training data, but 휀 = 0 – Low variance, but high bias • Plot 1: all (x, NN1(x)) for 5 test sets • Need to find k that achieves best tradeoff • Plot 2: all (x, AVG(NN1(x))), averaged over 200 test data set • Can do that using validation data – Same for NN20 and NN50
80 81
13 2 ED [ f (X; D)] E[Y | X ] : bias 2 ED f (X; D) ED [ f (X; D)] : variance 2 EY Y E[Y | X ] | X : irreducible error (does not depend on f and is simply the variance of Y given X.) 82 83
84 85
86 87
14 Scaling Issues Other Problems
• Attributes may have to be scaled to prevent • Problem with Euclidean measure: distance measures from being dominated by – High dimensional data: curse of dimensionality one of the attributes – Can produce counter-intuitive results • Example: 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 – Height of a person may vary from 1.5m to 1.8m vs 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 – Weight of a person may vary from 90lb to 300lb
– Income of a person may vary from $10K to $1M d = 1.4142 d = 1.4142 – Solution: Normalize the vectors to unit length – Income difference would dominate record distance • Irrelevant attributes might dominate distance – Solution: eliminate them
88 89
Computational Cost Classification and Prediction Overview
• Brute force: O(#trainingRecords) • Introduction – For each training record, compute distance to test record, • Decision Trees keep if among top-k • Statistical Decision Theory • Pre-compute Voronoi diagram (expensive), then search • spatial index of Voronoi cells: if lucky Nearest Neighbor O(log(#trainingRecords)) • Bayesian Classification • Store training records in multi-dimensional search tree, • Artificial Neural Networks e.g., R-tree: if lucky O(log(#trainingRecords)) • Support Vector Machines (SVMs) • Bulk-compute predictions for many test records using • Prediction spatial join between training and test set • Accuracy and Error Measures – Same worst-case cost as one-by-one predictions, but • Ensemble Methods usually much faster in practice
90 107
Bayesian Classification Bayesian Theorem: Basics
• Performs probabilistic prediction, i.e., predicts • X = random variable for data records (“evidence”) class membership probabilities • H = hypothesis that specific record X=x belongs to class C • Based on Bayes’ Theorem • Goal: determine P(H| X=x) – Probability that hypothesis holds given a record x • Incremental training • P(H) = prior probability – Update probabilities as new training records arrive – The initial probability of the hypothesis – Can combine prior knowledge with observed data – E.g., person x will buy computer, regardless of age, income etc. • Even when Bayesian methods are • P(X=x) = probability that data record x is observed computationally intractable, they can provide a • P(X=x| H) = probability of observing record x, given that the standard of optimal decision making against hypothesis holds – E.g., given that x will buy a computer, what is the probability which other methods can be measured that x is in age group 31...40, has medium income, etc.?
108 109
15 Bayes’ Theorem Towards Naïve Bayes Classifier
• Given data record x, the posterior probability of a hypothesis H, P(H| X=x), follows from Bayes theorem: • Suppose there are m classes C1, C2,…, Cm P(Xx|H)P(H) P(H |Xx) • Classification goal: for record x, find class Ci that P(Xx) has the maximum posterior probability P(Ci| X=x) • Informally: posterior = likelihood * prior / evidence • Bayes’ theorem: • Among all candidate hypotheses H, find the maximally probably P(X x|C )P(C ) one, called maximum a posteriori (MAP) hypothesis P(C |Xx) i i • Note: P(X=x) is the same for all hypotheses i P(Xx) • If all hypotheses are equally probable a priori, we only need to compare P(X=x| H) – Winning hypothesis is called the maximum likelihood (ML) hypothesis • Since P(X=x) is the same for all classes, only need • Practical difficulties: requires initial knowledge of many to find maximum of P(Xx|C )P(C ) probabilities and has high computational cost i i
110 111
Example: Computing P(X=x|Ci) and Computing P(X=x|Ci) and P(Ci) P(Ci) • P(buys_computer = yes) = 9/14 • Estimate P(Ci) by counting the frequency of class • P(buys_computer = no) = 5/14 Ci in the training data • P(age>40, income=low, student=no, credit_rating=bad| buys_computer=yes) = 0 ? • Can we do the same for P(X=x|C )? Age Income Student Credit_rating Buys_computer i 30 High No Bad No – Need very large set of training data 30 High No Good No 31…40 High No Bad Yes – Have |X1|*|X2|*…*|Xd|*m different combinations of > 40 Medium No Bad Yes > 40 Low Yes Bad Yes possible values for X and Ci > 40 Low Yes Good No – Need to see every instance x many times to obtain 31...40 Low Yes Good Yes 30 Medium No Bad No reliable estimates 30 Low Yes Bad Yes > 40 Medium Yes Bad Yes • Solution: decompose into lower-dimensional 30 Medium Yes Good Yes 31...40 Medium No Good Yes problems 31...40 High Yes Bad Yes > 40 Medium No Good No
112 113
Conditional Independence Derivation of Naïve Bayes Classifier
• X, Y, Z random variables • Simplifying assumption: all input attributes conditionally independent, given class • X is conditionally independent of Y, given Z, if d
P(X (x1,, xd ) | Ci ) P(X k xk | Ci ) P(X1 x1 | Ci ) P(X 2 x2 | Ci )P(X d xd | Ci ) P(X| Y,Z) = P(X| Z) k1
– Equivalent to: P(X,Y| Z) = P(X| Z) * P(Y| Z) • Each P(Xk=xk| Ci) can be estimated robustly • Example: people with longer arms read better – If Xk is categorical attribute – Confounding factor: age • P(Xk=xk| Ci) = #records in Ci that have value xk for Xk, divided by #records of class Ci in training data set • Young child has shorter arms and lacks reading skills of adult – If Xk is continuous, we could discretize it – If age is fixed, observed relationship between arm • Problem: interval selection length and reading skills disappears – Too many intervals: too few training cases per interval – Too few intervals: limited choices for decision boundary
114 115
16 Estimating P(X =x | C ) for Continuous k k i Naïve Bayes Example Attributes without Discretization
• P(Xk=xk| Ci) computed based on Gaussian • Classes: distribution with mean μ and standard deviation – C1:buys_computer = yes σ: (x)2 – C :buys_computer = no Age Income Student Credit_rating Buys_computer 1 2 2 2 g(x, , ) e 30 High No Bad No 2 30 High No Good No 31…40 High No Bad Yes as > 40 Medium No Bad Yes P(X x | C ) g(x , , ) • Data sample x > 40 Low Yes Bad Yes k k i k k,Ci k,Ci > 40 Low Yes Good No – age 30, 31...40 Low Yes Good Yes • Estimate k,Ci from sample mean of attribute Xk 30 Medium No Bad No – income = medium, 30 Low Yes Bad Yes for all training records of class Ci > 40 Medium Yes Bad Yes • – student = yes, and 30 Medium Yes Good Yes Estimate k,Ci similarly from sample 31...40 Medium No Good Yes – credit_rating = bad 31...40 High Yes Bad Yes > 40 Medium No Good No
116 117
Naïve Bayesian Computation Zero-Probability Problem
• Compute P(Ci) for each class: • Naïve Bayesian prediction requires each conditional probability to – P(buys_computer = “yes”) = 9/14 = 0.643 – P(buys_computer = “no”) = 5/14= 0.357 be non-zero (why?) • Compute P(X =x | C ) for each class k k i d – P(age = “ 30” | buys_computer = “yes”) = 2/9 = 0.222 – P(age = “ 30” | buys_computer = “no”) = 3/5 = 0.6 P(X (x1,, xd ) | Ci ) P(X k xk | Ci ) P(X1 x1 | Ci ) P(X 2 x2 | Ci )P(X d xd | Ci ) – P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 k1 – P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 – P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 • Example: 1000 records for buys_computer=yes with income=low – P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 – P(credit_rating = “bad” | buys_computer = “yes”) = 6/9 = 0.667 (0), income= medium (990), and income = high (10) – P(credit_rating = “bad” | buys_computer = “no”) = 2/5 = 0.4 – For input with income=low, conditional probability is zero • Compute P(X=x| Ci) using the Naive Bayes assumption – P(30, medium, yes, fair |buys_computer = “yes”) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044 • Use Laplacian correction (or Laplace estimator) by adding 1 dummy – P(30, medium, yes, fair | buys_computer = “no”) = 0.6 * 0.4 * 0.2 * 0.4 = 0.019 record to each income level • Compute final result P(X=x| Ci) * P(Ci) • Prob(income = low) = 1/1003 – P(X=x | buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 – P(X=x | buys_computer = “no”) * P(buys_computer = “no”) = 0.007 • Prob(income = medium) = 991/1003 • Prob(income = high) = 11/1003 • Therefore we predict buys_computer = “yes” for – “Corrected” probability estimates close to their “uncorrected” input x = (age = “30”, income = “medium”, student = “yes”, credit_rating = “bad”) counterparts, but none is zero
118 119
Naïve Bayesian Classifier: Comments Probabilities
• Easy to implement • Summary of elementary probability facts we have • Good results obtained in many cases used already and/or will need soon – Robust to isolated noise points • Let X be a random variable as usual – Handles missing values by ignoring the instance during probability estimate calculations • Let A be some predicate over its possible values – Robust to irrelevant attributes – A is true for some values of X, false for others • Disadvantages – E.g., X is outcome of throw of a die, A could be “value – Assumption: class conditional independence, is greater than 4” therefore loss of accuracy • P(A) is the fraction of possible worlds in which A – Practically, dependencies exist among variables is true • How to deal with these dependencies? – P(die value is greater than 4) = 2 / 6 = 1/3
120 121
17 Axioms Theorems from the Axioms
• 0 P(A) 1 • 0 P(A) 1, P(True) = 1, P(False) = 0 • P(True) = 1 • P(A B) = P(A) + P(B) - P(A B) • P(False) = 0 • P(A B) = P(A) + P(B) - P(A B) • From these we can prove: – P(not A) = P(~A) = 1 - P(A) – P(A) = P(A B) + P(A ~B)
122 123
Conditional Probability Definition of Conditional Probability
• P(A|B) = Fraction of worlds in which B is true P(A B) that also have A true H = “Have a headache” P(A| B) = ------F = “Coming down with Flu” P(B)
P(H) = 1/10 P(F) = 1/40 Corollary: the Chain Rule F P(H|F) = 1/2
“Headaches are rare and flu H is rarer, but if you’re coming P(A B) = P(A| B) P(B) down with flu there’s a 50- 50 chance you’ll have a headache.”
124 125
Easy Fact about Multivalued Random Multivalued Random Variables Variables • Suppose X can take on more than 2 values • Using the axioms of probability • X is a random variable with arity k if it can take – 0 P(A) 1, P(True) = 1, P(False) = 0 – P(A B) = P(A) + P(B) - P(A B) on exactly one value out of {v , v ,…, v } 1 2 k • And assuming that X obeys • Thus
P(X vi X v j ) 0 if i j • We can prove that i P(X v X v ... X v ) 1 1 2 k P(X v1 X v2 ... X vi ) P(X v j ) j1 • And therefore: k P(X v j ) 1 j1 126 127
18 Example: Boolean Useful Easy-to-Prove Facts The Joint Distribution variables A, B, C Recipe for making a joint distribution of d variables: P(A| B)P(~ A| B) 1
k P(X v j | B) 1 j1
128 129
Example: Boolean Example: Boolean The Joint Distribution variables A, B, C The Joint Distribution variables A, B, C A B C A B C Prob Recipe for making a joint distribution Recipe for making a joint distribution 0 0 0 0 0 0 0.30 of d variables: of d variables: 0 0 1 0 0 1 0.05
0 1 0 0 1 0 0.10
1. Make a truth table listing all 0 1 1 1. Make a truth table listing all 0 1 1 0.05
combinations of values of your 1 0 0 combinations of values of your 1 0 0 0.05 d d variables (has 2 rows for d 1 0 1 variables (has 2 rows for d 1 0 1 0.10
Boolean variables). 1 1 0 Boolean variables). 1 1 0 0.25
1 1 1 2. For each combination of values, 1 1 1 0.10 say how probable it is.
130 131
Using the The Joint Distribution Example: Boolean variables A, B, C Joint Dist. A B C Prob Recipe for making a joint distribution 0 0 0 0.30 of d variables: 0 0 1 0.05
0 1 0 0.10
1. Make a truth table listing all 0 1 1 0.05
combinations of values of your 1 0 0 0.05 d variables (has 2 rows for d 1 0 1 0.10
Boolean variables). 1 1 0 0.25
2. For each combination of values, 1 1 1 0.10 say how probable it is. Once you have the JD you P(E) P(row) 3. If you subscribe to the axioms of can ask for the probability of rows matching E probability, those numbers must A 0.05 0.10 0.05 any logical expression sum to 1. 0.10 involving your attribute 0.25 0.05 C
B 0.10 0.30
132 133
19 Using the Using the Joint Dist. Joint Dist.
P(Poor Male) = 0.4654 P(E) P(row) P(Poor) = 0.7604 rows matching E
134 135
Inference Inference with the with the Joint Dist. Joint Dist.
P(row) P(E1 E2 ) rows matching E1 and E2 P(E1 | E2 ) P(E2 ) P(row) rows matching E2
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
136 137
Joint Distributions What Would Help?
• Good news: Once you • Bad news: Impossible to • Full independence have a joint create joint distribution – P(gender=g hours_worked=h wealth=w) = distribution, you can for more than about ten P(gender=g) * P(hours_worked=h) * P(wealth=w) answer important attributes because – Can reconstruct full joint distribution from a few questions that involve there are so many marginals uncertainty. numbers needed when • Full conditional independence given class value you build it. – Naïve Bayes • What about something between Naïve Bayes and general joint distribution?
138 139
20 Bayesian Belief Networks Bayesian Network Properties
• Subset of the variables conditionally independent • Each variable is conditionally independent of • Graphical model of causal relationships its non-descendents in the graph, given its – Represents dependency among the variables parents – Gives a specification of joint probability distribution • Naïve Bayes as a Bayesian network:
Nodes: random variables Y Links: dependency Y X X and Y are the parents of Z, and Y is the parent of P Z Given Y, Z and P are independent P X1 X2 Xn Has no loops or cycles
140 141
General Properties Structural Property
• P(X1,X2,X3)=P(X1|X2,X3)P(X2|X3)P(X3) • Missing links simplify computation of P 푋1, 푋2, … , 푋푛 • General: 푛 P(푋 |푋 , 푋 , … , 푋 ) • 푖=1 푖 푖−1 푖−2 1 P(X1,X2,X3)=P(X3|X1,X2) P(X2|X1) P(X1) – Fully connected: link between every pair of nodes 푛 • Network does not necessarily reflect causality • Given network: 푖=1 P(푋푖|parents(푋푖)) – Some links are missing – The terms P(푋푖|parents 푋푖 ) are given as conditional probability tables (CPT) in the network X1 X2 X1 X2 • Sparse network allows better estimation of CPT’s (fewer combinations of parent values, hence more reliable to estimate from limited data) and faster X3 X3 computation
142 143
Small Example Computing P(S|J)
• Probability that a student who got a great job was doing her homework • S: Student studies a lot for 6220 • P(S | J) = P(S, J) / P(J)
• L: Student learns a lot and gets a good grade • P(S, J) = P(S, J, L) + P(S, J, ~L) • P(J) = P(J, S, L) + P(J, S, ~L) + P(J, ~S, L) + P(J, ~S, ~L) • J: Student gets a great job • P(J, L, S) = P(J | L, S) * P(L, S) = P(J | L) * P(L | S) * P(S) = 0.8*0.9*0.4 • P(J, ~L, S) = P(J | ~L, S) * P(~L, S) = P(J | ~L) * P(~L | S) * P(S) = 0.3*(1-0.9)*0.4 P(S) = 0.4 S • P(J, L, ~S) = P(J | L, ~S) * P(L, ~S) = P(J | L) * P(L | ~S) * P(~S) = 0.8*0.2*(1-0.4) • P(J, ~L, ~S) = P(J | ~L, ~S) * P(~L, ~S) = P(J | ~L) * P(~L | ~S) * P(~S) = 0.3*(1-0.2)*(1- 0.4)
P(L|S) = 0.9 L • Putting this all together, we obtain: P(L|~S) = 0.2 • P(H | J) = (0.8*0.9*0.4 + 0.3*0.1*0.4) / (0.8*0.9*0.4 + 0.3*0.1*0.4 + 0.8*0.2*0.6 + 0.3*0.8*0.6) = 0.3 / 0.54 = 0.56
P(J|L) = 0.8 J P(J|~L) = 0.3 144 145
21 More Complex Example Computing with Bayes Net
P(M)=0.6 P(S)=0.3 S M T: The lecture started on time L: The lecturer arrives late P(RM)=0.3 R: The lecture concerns data mining P(LM, S)=0.05 L P(R~M)=0.6 M M: The lecturer is Mike P(LM, ~S)=0.1 P(TL)=0.3 R S: It is snowing P(L~M, S)=0.1 P(T~L)=0.8 R P(L~M, ~S)=0.2 T T: The lecture started on time L: The lecturer arrives late R: The lecture concerns data mining M: The lecturer is Mike S ? S: It is snowing
L P(T, ~R, L, ~M, S) T = P(T | L) P(~R | ~M) P(L | ~M, S) P(~M) P(S)
146 147
Computing with Bayes Net Inference with Bayesian Networks
P(M)=0.6 P(S)=0.3 S M • Can predict the probability for any attribute, P(RM)=0.3 given any subset of the other attributes P(LM, S)=0.05 L P(R~M)=0.6 P(LM, ~S)=0.1 – P(M | L, R), P(T | S, ~M, R) and so on P(TL)=0.3 R P(L~M, S)=0.1 P(T~L)=0.8 P(L~M, ~S)=0.2 • Easy case: P(Xi | Xj1, Xj2,…, Xjk) where T T: The lecture started on time L: The lecturer arrives late parents(Xi){Xj1, Xj2,…, Xjk} R: The lecture concerns data mining M: The lecturer is Mike – Can read answer directly from Xi’s CPT P(R | T, ~S) = P(R, T, ~S) / P(T, ~S) S: It is snowing • What if values are not given for all parents of Xi? P(R, T, ~S) – Exact inference of probabilities in general for an = P(L, M, R, T, ~S) + P(~L, M, R, T, ~S) + P(L, ~M, R, T, ~S) + P(~L, ~M, R, T, ~S) arbitrary Bayesian network is NP-hard – Solutions: probabilistic inference, trade precision for Compute P(T, ~S) similarly. Problem: There are now 8 such terms to be computed. efficiency
148 149
Training Bayesian Networks Classification and Prediction Overview
• Several scenarios: • Introduction – Network structure known, all variables observable: learn • Decision Trees only the CPTs • Statistical Decision Theory – Network structure known, some hidden variables: gradient • Nearest Neighbor descent (greedy hill-climbing) method, analogous to neural • Bayesian Classification network learning • Artificial Neural Networks – Network structure unknown, all variables observable: search through the model space to reconstruct network • Support Vector Machines (SVMs) topology • Prediction – Unknown structure, all hidden variables: No good • Accuracy and Error Measures algorithms known for this purpose • Ensemble Methods • Ref.: D. Heckerman: Bayesian networks for data mining
150 152
22 Basic Building Block: Perceptron Perceptron Decision Hyperplane
Input: {(x1, x2, y), …} For Example x2 Called the bias Output: classification function f(x) d f(x) > 0: return +1 f (x) signb wi xi i1 f(x) ≤ 0: return = -1 x1 w1 b+w1x1+w2x2 = 0 x w +b 2 2 Decision hyperplane: b+w∙x = 0 f Output y x Note: b+w∙x > 0, if and only if d wd d wi xi b i1 Input Weight Weighted Activation vector vector sum function b represents a threshold for when the x w perceptron “fires”. x1 153 154
Representing Boolean Functions Perceptron Training Rule
• AND with two-input perceptron • Goal: correct +1/-1 output for each training record
– b=-0.8, w1=w2=0.5 • Start with random weights, constant (learning rate) • OR with two-input perceptron • While some training records are still incorrectly – b=-0.3, w1=w2=0.5 classified do • m-of-n function: true if at least m out of n inputs – For each training record (x, y) are true • Let fold(x) be the output of the current perceptron for x • – All input weights 0.5, threshold weight b is set Set b:= b + b, where b = ( y - fold(x) ) according to m, n • For all i, set wi := wi + wi, where wi = ( y - fold(x))xi • Can also represent NAND, NOR • Converges to correct decision boundary, if the classes • What about XOR? are linearly separable and a small enough is used
155 156
Gradient Descent Gradient Descent Rule
• If training records are not linearly separable, find best • Find weight vector that minimizes E(b,w) by altering it fit approximation in direction of steepest descent – Gradient descent to search the space of possible weight – Set (b,w) := (b,w) + (b,w), where (b,w) = - E(b,w)
vectors • -E(b,w)=[ E/b, E/w1,…, E/wn ] is the gradient, hence – Basis for Backpropagation algorithm E b : b b y u(x) • Consider un-thresholded perceptron (no sign function b (x,y)D E(w1,w2) applied), i.e., u(x) = b + w∙x E wi : wi wi y u(x)(xi ) w 100 i (x,y)D 90 • Measure training error by squared error 80 70 60 50 1 2 • Start with random weights, 40 30 E(b,w) y u(x) 20 iterate until convergence 10 2 (x,y)D 0 – Will converge to global 4 2 -4 0 – D = training data minimum if is small enough -2 w2 0 -2 2 -4 w1 4
157 158
23 Gradient Descent Summary Multilayer Feedforward Networks
• Epoch updating (batch mode) • Use another perceptron to combine – Compute gradient over entire training set output of lower layer Output layer – Changes model once per scan of entire training set – What about linear units only? • Case updating (incremental mode, stochastic gradient Can only construct linear functions! descent) – Need nonlinear component • sign function: not differentiable – Compute gradient for a single training record (gradient descent!) – Changes model after every single training record immediately • Use sigmoid: (x)=1/(1+e-x) Hidden layer 1 1/(1+exp(-x)) • Case updating can approximate epoch updating arbitrarily 0.9 close if is small enough 0.8 0.7 • What is the difference between perceptron training rule 0.6 and case updating for gradient descent? 0.5 Perceptron function: 0.4 – Error computation on thresholded vs. unthresholded function 0.3 1 y bwx Input layer 0.2 1 e 0.1 0 -4 -2 0 2 4 159 160
1-Hidden Layer ANN Example Making Predictions
• Input record fed simultaneously into the units of the N INS input layer v gb w x w 1 1 1k k • Then weighted and fed simultaneously to a hidden 11 k 1 w1 layer w21 x1 • Weighted outputs of the last hidden layer are the input to the units in the output layer, which emits the w31 N INS N w2 HID network's prediction v gb w x Out g B W v 2 2 2k k k k • The network is feed-forward w12 k 1 k 1 w22 – None of the weights cycles back to an input unit or to an w output unit of a previous layer x2 3 w N INS • Statistical point of view: neural networks perform 32 v gb w x g is usually the nonlinear regression 3 3 3k k sigmoid function k 1 161 162
Backpropagation Algorithm Overfitting
• Earlier discussion: gradient descent for a single perceptron • When do we stop updating the weights? using a simple un-thresholded function • If sigmoid (or other differentiable) function is applied to • Overfitting tends to happen in later iterations weighted sum, use complete function for gradient descent – Weights initially small random values • Multiple perceptrons: optimize over all weights of all perceptrons – Weights all similar => smooth decision surface – Problems: huge search space, local minima – Surface complexity increases as weights diverge • Backpropagation – Initialize all weights with small random values • Preventing overfitting – Iterate many times – Weight decay: decrease each weight by small factor • Compute gradient, starting at output and working back – Error of hidden unit h: how do we get the true output value? Use weighted during each iteration, or sum of errors of each unit influenced by h • Update all weights in the network – Use validation data to decide when to stop iterating
163 164
24 Neural Network Decision Boundary Backpropagation Remarks
• Computational cost – Each iteration costs O(|D|*|w|), with |D| training records and |w| weights – Number of iterations can be exponential in n, the number of inputs (in practice often tens of thousands) • Local minima can trap the gradient descent algorithm: convergence guaranteed to local minimum, not global • Backpropagation highly effective in practice – Many variants to deal with local minima issue, use of case updating Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning
165 166
Defining a Network Representational Power
1. Decide network topology • Boolean functions – #input units, #hidden layers, #units per hidden layer, #output – Each can be represented by a 2-layer network units (one output unit per class for problems with >2 classes) – Number of hidden units can grow exponentially with 2. Normalize input values for each attribute to [0.0, 1.0] number of inputs – Nominal/ordinal attributes: one input unit per domain value • Create hidden unit for each input record • For attribute grade with values A, B, C, have 3 inputs that are set to • Set its weights to activate only for that input 1,0,0 for grade A, to 0,1,0 for grade B, and 0,0,1 for C • Implement output unit as OR gate that only activates for desired • Why not map it to a single input with domain [0.0, 1.0]? output patterns 3. Choose learning rate , e.g., 1 / (#training iterations) • Continuous functions – Too small: takes too long to converge – Every bounded continuous function can be approximated – Too large: might never converge (oversteps minimum) arbitrarily close by a 2-layer network 4. Bad results on test data? Change network topology, initial • Any function can be approximated arbitrarily close by a weights, or learning rate; try again. 3-layer network
167 168
Neural Network as a Classifier Classification and Prediction Overview
• Weaknesses • Introduction – Long training time • Decision Trees – Many non-trivial parameters, e.g., network topology • Statistical Decision Theory – Poor interpretability: What is the meaning behind learned weights and hidden units? • Nearest Neighbor • Note: hidden units are alternative representation of input values, • Bayesian Classification capturing their relevant features • Strengths • Artificial Neural Networks – High tolerance to noisy data • Support Vector Machines (SVMs) – Well-suited for continuous-valued inputs and outputs • Prediction – Successful on a wide array of real-world data • Accuracy and Error Measures – Techniques exist for extraction of rules from neural networks • Ensemble Methods
169 171
25 SVM—Support Vector Machines SVM—History and Applications
• Newer and very popular classification method • Vapnik and colleagues (1992) • Uses a nonlinear mapping to transform the – Groundwork from Vapnik & Chervonenkis’ statistical original training data into a higher dimension learning theory in 1960s • Training can be slow but accuracy is high • Searches for the optimal separating – Ability to model complex nonlinear decision hyperplane (i.e., “decision boundary”) in the boundaries (margin maximization) new dimension • Used both for classification and prediction • SVM finds this hyperplane using support • Applications: handwritten digit recognition, vectors (“essential” training records) and object recognition, speaker identification, margins (defined by the support vectors) benchmarking time-series prediction tests
172 173
Linear Classifiers Linear Classifiers
f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b) denotes +1 denotes +1 denotes -1 denotes -1
How would you How would you classify this data? classify this data?
174 175
Linear Classifiers Linear Classifiers
f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b) denotes +1 denotes +1 denotes -1 denotes -1
How would you How would you classify this data? classify this data?
176 177
26 Linear Classifiers Classifier Margin
f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b) denotes +1 denotes +1 denotes -1 denotes -1 Define the margin of a linear classifier as the Any of these width that the would be fine.. boundary could be increased by ..but which is before hitting a best? data record.
178 179
Maximum Margin Maximum Margin
f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b) denotes +1 denotes +1 denotes -1 Find the maximum denotes -1 margin linear classifier.
This is the Support Vectors simplest kind of are those SVM, called linear datapoints that SVM or LSVM. the margin pushes up against
180 181
Why Maximum Margin? Specifying a Line and Margin Plus-Plane • If we made a small error in the location of the Classifier Boundary boundary, this gives us the least chance of Minus-Plane causing a misclassification. • Model is immune to removal of any non- support-vector data records. • Plus-plane = { x : wx + b = +1 } • There is some theory (using VC dimension) • Minus-plane = { x : wx + b = -1 } that is related to (but not the same as) the proposition that this is a good thing. Classify as +1 if w x + b 1 • Empirically it works very well. -1 if wx + b -1 what if -1 < wx + b < 1 ?
182 183
27 Computing Margin Width Computing Margin Width
M = Margin Width x+ M = Margin Width
x-
• Plus-plane = { x : wx + b = +1 } • Choose arbitrary point x- on minus-plane • Minus-plane = { x : wx + b = -1 } + - • Goal: compute M in terms of w and b • Let x be the point in plus-plane closest to x – Note: vector w is perpendicular to plus-plane • Since vector w is perpendicular to these planes, it • Consider two vectors u and v on plus-plane and show that w(u-v)=0 + - • Hence it is also perpendicular to the minus-plane holds that x = x + w, for some value of
184 185
Putting It All Together Finding the Maximum Margin
• We have so far: • How do we find w and b such that the margin is – wx+ + b = +1 and wx- + b = -1 maximized and all training records are in the – x+ = x- + w correct zone for their class? – |x+- x-| = M • Solution: Quadratic Programming (QP) • Derivation: • QP is a well-studied class of optimization algorithms to maximize a quadratic function of – w(x- + w) + b = +1, hence wx- + b + ww = 1 some real-valued variables subject to linear – This implies ww = 2, i.e., = 2 / ww constraints. + - 0.5 – Since M = |x - x | = |w| = |w| = (ww) – There exist algorithms for finding such constrained – We obtain M = 2 (ww)0.5/ ww = 2 / (ww)0.5 quadratic optima efficiently and reliably.
186 187
Quadratic Programming What Are the SVM Constraints? T T u Ru Find arg max c d u Quadratic criterion u 2 • Consider n training records (x(k), y(k)), Subject to a u a u ... a u b 11 1 12 2 1m m 1 where y(k) = +/- 1 a u a u ... a u b n additional linear 21 1 22 2 2m m 2 2 • How many constraints : inequality M constraints w w will we have? a u a u ... a u b n1 1 n2 2 nm m n • What is the quadratic • What should they be? And subject to optimization criterion? a(n1)1u1 a(n1)2u2 ... a(n1)mum b(n1) a u a u ... a u b e additional (n2)1 1 (n2)2 2 (n2)m m (n2) linear : equality constraints a u a u ... a u b (ne)1 1 (ne)2 2 (ne)m m (ne) 188 189
28 Problem: Classes Not Linearly What Are the SVM Constraints? Separable • Consider n training denotes +1 • Inequalities for training records (x(k), y(k)), denotes -1 records are not where y(k) = +/- 1 satisfiable by any w and 2 • M How many constraints b w w will we have? n. • What is the quadratic • What should they be? optimization criterion? For each 1 k n: – Minimize ww wx(k) + b 1, if y(k)=1 wx(k) + b -1, if y(k)=-1
190 191
Solution 1? Solution 2? denotes +1 • Find minimum ww, denotes +1 • Minimize ww + denotes -1 while also minimizing denotes -1 C(#trainSetErrors) number of training set – C is a tradeoff parameter errors • Problems: – Not a well-defined – Cannot be expressed as optimization problem QP, hence finding solution might be slow (cannot optimize two things at the same time) – Does not distinguish between disastrous errors and near misses
192 193
Solution 3 What Are the SVM Constraints? 11 denotes +1 • Minimize ww + 2 • Consider n training denotes -1 C(distance of error records (x(k), y(k)), where y(k) = +/- 1 records to their correct place) 7 • How many constraints will we have? n. • This works! • What should they be? • What is the quadratic • But still need to do For each 1 k n: optimization criterion? something about the wx(k)+b 1 - , if y(k)=1 unsatisfiable set of – Minimize k wx(k)+b -1+ , if y(k)=-1 inequalities 1 n k w w C ε 0 2 k k k1 194 195
29 Facts About the New Problem Effect of Parameter C Formulation • Original QP formulation had d+1 variables
– w1, w2,..., wd and b • New QP formulation has d+1+n variables
– w1, w2,..., wd and b – 1, 2,..., n • C is a new parameter that needs to be set for the SVM – Controls tradeoff between paying attention to margin size versus misclassifications Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning
196 197
An Equivalent QP (The “Dual”) Important Facts n 1 n n Maximize αk αk αl y(k) y(l)x(k)x(l) • Dual formulation of QP can be optimized more k1 2 k1 l1 quickly, but result is equivalent n • Data records with > 0 are the support vectors Subject to these k k :0 αk C αk y(k) 0 – Those with 0 < k < C lie on the plus- or minus-plane constraints: k1 – Those with k = C are on the wrong side of the Then define: classifier boundary (have k > 0) n • Computation for w and b only depends on those w α y(k)x(k) Then classify with: k records with k > 0, i.e., the support vectors k1 f(x,w,b) = sign(wx + b) 1 • Alternative QP has another major advantage, as b AVG x(k)w we will see now... k:0k C y(k) 198 199
Easy To Separate Easy To Separate
What would Not a big surprise SVMs do with this data?
Positive “plane” Negative “plane”
200 201
30 Harder To Separate Harder To Separate X’ (= X2) What can be done about this? Non-linear basis functions: Original data: (X, Y) Transformed: (X, X2, Y)
Think of X2 as a new X attribute, e.g., X’
202 203
Corresponding “Planes” in Original Now Separation Is Easy Again Space X’ (= X2)
Region above plus-”plane”
X Region below minus-”plane”
204 205
1 Constant Term 2x 1 Quadratic Basis 2x Common SVM Basis Functions 2 Linear Terms : Functions 2x d • 2 Polynomial of attributes X1,..., Xd of certain x1 2 x2 Pure max degree, e.g., X4 2 Quadratic Number of terms : Terms (assuming d input attributes): • Radial basis function 2 xd Φ(x) (d+2)-choose-2 2x x – Symmetric around center, i.e., 1 2 2x x = (d+2)(d+1)/2 KernelFunction(|X - c| / kernelWidth) 1 3 : d2/2 • Sigmoid function of X, e.g., hyperbolic tangent 2x1xd Quadratic 2x x Cross-Terms 2 3 Why did we choose this specific • Let (x) be the transformed input record : transformation? 2 – Previous example: ( (x) ) = (x, x ) 2x1xd : 2x x 206 d 1 d 207
31 Dual QP With Basis Functions Computation Challenge n 1 n n Maximize αk αk αl y(k) y(l)Φx(k)Φx(l) • Input vector x has d components (its d attribute k1 2 k1 l1 values) n • The transformed input vector (x) has d2/2 Subject to these k :0 αk C αk y(k) 0 components constraints: k1 • Hence computing (x(k))(x(l)) now costs order Then define: d2/2 instead of order d operations (additions, n multiplications) w α y(k)Φx(k) Then classify with: k • k1 f(x,w,b) = sign(w(x) + b) ...or is there a better way to do this? 1 – Take advantage of properties of certain b AVG Φx(k)w transformations k:0k C y(k) 208 209
1 1 1 Quadratic 2a1 2b1 + 2a 2b 2 2 d Dot Quadratic Dot Products : : 2aibi i1 2a 2b Products Now consider another function of a d d 2 2 + and b: a1 b1 2 2 d a2 b2 2 a2b2 (ab 1) : : i i i1 2 2 2 ad bd (ab) 2ab 1 Φ(a)Φ(b) Φ(a)Φ(b) 2a a 2b b d 2 d 1 2 1 2 d d d d 2 2 aibi 2 aibi 1 2a1a3 2b1b3 + 1 2 aibi ai bi 2aia jbibj i1 i1 : : i1 i1 i1 ji1 d d d 2a1ad 2b1bd a b a b 2 a b 1 d d i i j j i i 2a a 2b b i1 j1 i1 2 3 2 3 2aia jbibj i1 ji1 : : d d d d 2 (aibi ) 2 aibia jbj 2 aibi 1 2a1ad 2b1bd i1 i1 ji1 i1 : : 2a a 2b b d 1 d d 1 d 210 211
Quadratic Dot Products Any Other Computation Problems?
• The results of (a)(b) and of (ab+1)2 are identical • Computing (a)(b) costs about d2/2, while 2 computing (a b+1) costs only about d+2 operations • What about computing w? • This means that we can work in the high-dimensional 2 – Finally need f(x,w,b) = sign(w(x) + b): space (d /2 dimensions) where the training records are n
more easily separable, but pay about the same cost as w Φ(x) αk y(k)Φx(k)Φ(x) working in the original space (d dimensions) k1 • Savings are even greater when dealing with higher- – Can be computed using the same trick as before degree polynomials, i.e., degree q>2, that can be • Can apply the same trick again to b, because computed as (ab+1)q n Φx(k)w α j y( j)Φx(k)Φx( j) j1
212 213
32 SVM Kernel Functions Overfitting
• For which transformations, called kernels, • With the right kernel function, computation in high dimensional transformed space is no problem does the same trick work? • But what about overfitting? There seem to be so many • Polynomial: K(a,b)=(a b +1)q parameters... • Usually not a problem, due to maximum margin • Radial-Basis-style (RBF): q, , , and are approach (a b)2 magic parameters – Only the support vectors determine the model, hence SVM K(a,b) exp that must be chosen 2 2 complexity depends on number of support vectors, not by a model selection dimensions (still, in higher dimensions there might be method. more support vectors) – Neural-net-style sigmoidal: – Minimizing ww discourages extremely large weights, which smoothes the function (recall weight decay for K(a,b) tanh(ab ) neural networks!)
214 215
Different Kernels Multi-Class Classification
• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). • With output arity N, learn N SVM’s – SVM 1 learns “Output==1” vs “Output != 1” – SVM 2 learns “Output==2” vs “Output != 2” – : – SVM N learns “Output==N” vs “Output != N” • Predict with each SVM and find out which one puts the prediction the furthest into the positive region. Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning
216 217
Why Is SVM Effective on High SVM vs. Neural Network Dimensional Data? • Complexity of trained classifier is characterized by the • SVM • Neural Network number of support vectors, not dimensionality of the – – data Relatively new concept Relatively old – Deterministic algorithm – Nondeterministic • If all other training records are removed and training is repeated, the same separating hyperplane would be – Nice Generalization algorithm found properties – Generalizes well but • The number of support vectors can be used to – Hard to train – learned in doesn’t have strong compute an upper bound on the expected error rate of batch mode using mathematical foundation the SVM, which is independent of data dimensionality quadratic programming – Can easily be learned in • Thus, an SVM with a small number of support vectors techniques incremental fashion can have good generalization, even when the – Using kernels can learn – To learn complex dimensionality of the data is high very complex functions functions—use multilayer perceptron (not that trivial)
218 219
33 Classification and Prediction Overview What Is Prediction?
• Introduction • Essentially the same as classification, but output • Decision Trees is continuous, not discrete • Statistical Decision Theory – Construct a model, then use model to predict • Nearest Neighbor continuous output value for a given input • Bayesian Classification • Major method for prediction: regression • Artificial Neural Networks – Many variants of regression analysis in statistics literature; not covered in this class • Support Vector Machines (SVMs) • Prediction • Neural network and k-NN can do regression “out- of-the-box” • Accuracy and Error Measures • Ensemble Methods • SVMs for regression exist • What about trees?
221 222
Regression Trees and Model Trees Classification and Prediction Overview
• Regression tree: proposed in CART system • Introduction (Breiman et al. 1984) • Decision Trees – CART: Classification And Regression Trees • Statistical Decision Theory – Each leaf stores a continuous-valued prediction • Nearest Neighbor • Average output value for the training records in the leaf • Bayesian Classification • Model tree: proposed by Quinlan (1992) • Artificial Neural Networks – Each leaf holds a regression model—a multivariate • Support Vector Machines (SVMs) linear equation • Prediction • Training: like for classification trees, but uses • Accuracy and Error Measures variance instead of purity measure for selecting split predicates • Ensemble Methods
223 224
Classifier Accuracy Measures Precision and Recall
Predicted class total buy_computer = yes buy_computer = no • Precision: measure of exactness True class buy_computer = yes 6954 46 7000 – buy_computer = no 412 2588 3000 t-pos / (t-pos + f-pos) total 7366 2634 10000 • Recall: measure of completeness • Accuracy of a classifier M, acc(M): percentage of – t-pos / (t-pos + f-neg) test records that are correctly classified by M • F-measure: combination of precision and recall – Error rate (misclassification rate) of M = 1 – acc(M) – 2 * precision * recall / (precision + recall) – Given m classes, CM[i,j], an entry in a confusion matrix, indicates # of records in class i that are labeled by the classifier as class j • Note: Accuracy = (t-pos + t-neg) / (t-pos + t-neg +
C1 C2 f-pos + f-neg)
C1 True positive False negative
C2 False positive True negative 225 226
34 Limitation of Accuracy Cost-Sensitive Measures: Cost Matrix
• Consider a 2-class problem PREDICTED CLASS – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10 C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes) • If model predicts everything to be class 0, ACTUAL CLASS accuracy is 9990/10000 = 99.9 % Class=No C(Yes|No) C(No|No) – Accuracy is misleading because model does not detect any class 1 example • Always predicting the majority class defines the baseline C(i| j): Cost of misclassifying class j example as class i – A good classifier should do better than baseline
227 228
Computing Cost of Classification Prediction Error Measures
Cost PREDICTED CLASS Matrix • Continuous output: it matters how far off the prediction is from the true value C(i|j) + - • Loss function: distance between y and predicted value y’ ACTUAL + -1 100 CLASS – Absolute error: | y – y’| - 1 0 – Squared error: (y – y’)2 • Test error (generalization error): average loss over the test set
Model M1 PREDICTED CLASS Model M2 PREDICTED CLASS • Mean absolute error: Mean squared error: 1 n 1 n | y(i) y'(i) | y(i) y'(i)2 n + - + - i1 n i1 ACTUAL ACTUAL n n + 150 40 + 250 45 • Relative absolute error: | y ( i ) y ' ( i ) | Relative squared error: (y(i) y'(i))2 CLASS CLASS i1 i1 - 60 250 - 5 200 n n | y(i) y | (y(i) y)2 Accuracy = 80% Accuracy = 90% i1 i1 Cost = 3910 Cost = 4255 • Squared-error exaggerates the presence of outliers
229 230
Evaluating a Classifier or Predictor Learning Curve
• Holdout method • Accuracy versus – The given data set is randomly partitioned into two sets sample size • Effect of small • Training set (e.g., 2/3) for model construction sample size: • Test set (e.g., 1/3) for accuracy estimation – Bias in estimate – Can repeat holdout multiple times – Variance of • Accuracy = avg. of the accuracies obtained estimate • Helps determine how • Cross-validation (k-fold, where k = 10 is most popular) much training data is needed – Randomly partition data into k mutually exclusive subsets, – Still need to have each approximately equal size enough test and – In i-th iteration, use D as test set and others as training set validation data to i be representative – Leave-one-out: k folds where k = # of records of distribution • Expensive, often results in high variance of performance metric
231 232
35 ROC (Receiver Operating ROC Curve Characteristic) • Developed in 1950s for signal detection theory to • 1-dimensional data set containing 2 classes (positive and negative) analyze noisy signals – Any point located at x > t is classified as positive
– Characterizes trade-off between positive hits and false alarms • ROC curve plots T-Pos rate (y-axis) against F-Pos rate (x-axis) • Performance of each classifier is represented as a point on the ROC curve – Changing the threshold of the algorithm, sample distribution or cost matrix changes the location of the point At threshold t: TPR=0.5, FPR=0.12 233 234
ROC Curve Diagonal Line for Random Guessing (TPR, FPR): • (0,0): declare everything to • Classify a record as positive with fixed probability be negative class p, irrespective of attribute values • (1,1): declare everything to • Consider test set with a positive and b negative be positive class records • (1,0): ideal • True positives: p*a, hence true positive rate = (p*a)/a = p • Diagonal line: • False positives: p*b, hence false positive rate = – Random guessing (p*b)/b = p • For every value 0p1, we get point (p,p) on ROC curve
235 236
Using ROC for Model Comparison How to Construct an ROC curve
• Neither model • Use classifier that produces consistently record P(+|x) True Class posterior probability P(+|x) outperforms the 1 0.95 + for each test record x other 2 0.93 + – M1 better for small 3 0.87 - • Sort records according to FPR 4 0.85 - P(+|x) in decreasing order – M2 better for large FPR 5 0.85 - • Apply threshold at each 6 0.85 + unique value of P(+|x) • Area under the ROC 7 0.76 - – Count number of TP, FP, TN, FN curve 8 0.53 + at each threshold – Ideal: area = 1 9 0.43 - – TP rate, TPR = TP/(TP+FN) – Random guess: 10 0.25 + – FP rate, FPR = FP/(FP+TN) area = 0.5
237 238
36 How To Construct An ROC Curve Test of Significance
Class + - + - + - - - + + P Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 2 2 1 0 • FP 5 5 4 4 3 1 0 0 0 Given two models:
TN 0 0 1 1 2 4 5 5 5 FN 0 1 1 2 2 3 3 4 5 – Model M1: accuracy = 85%, tested on 30 instances TPR 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.2 0 0 0 – Model M2: accuracy = 75%, tested on 5000 1.0 true positive rate instances • Can we say M1 is better than M2? – How much confidence can we place on accuracy ROC Curve: of M1 and M2? 0.4 – Can the difference in accuracy be explained as a 0.2 result of random fluctuations in the test set?
false positive rate 239 240 0 0.2 0.4 1.0
Confidence Interval for Accuracy Confidence Interval for Accuracy Area = 1 - • • Binomial distribution for X=“number of Classification can be regarded as a Bernoulli trial correctly classified test records out of n” – A Bernoulli trial has 2 possible outcomes, “correct” or – E(X)=pn, Var(X)=p(1-p)n “wrong” for classification • Accuracy = X / n – Collection of Bernoulli trials has a Binomial – E(ACC) = p, Var(ACC) = p(1-p) / n • For large test sets (n>30), Binomial distribution distribution is closely approximated by • Probability of getting c correct predictions if model accuracy normal distribution with same mean is p (=probability to get a single prediction right): and variance – ACC has a normal distribution with n c nc p (1 p) mean=p, variance=p(1-p)/n Z/2 Z1- /2 c ACC p PZ Z 1 • Given c, or equivalently, ACC = c / n and n (#test / 2 1 / 2 p(1 p) / n records), can we predict p, the true accuracy of • Confidence Interval for p: 2 2 2 the model? 2nACC Z / 2 Z / 2 4nACC 4nACC p 2 2(n Z / 2 ) 241 242
Comparing Performance of Two Confidence Interval for Accuracy Models • Consider a model that produces an accuracy of • Given two models M1 and M2, which is better? 80% when evaluated on 100 test instances – M1 is tested on D1 (size=n1), found error rate = e1 – n = 100, ACC = 0.8 – M2 is tested on D2 (size=n2), found error rate = e2 – Let 1- = 0.95 (95% confidence) – Assume D and D are independent – From probability table, Z = 1.96 1- Z 1 2 /2 – If n and n are sufficiently large, then 0.99 2.58 1 2 N 50 100 500 1000 5000 err ~ N , 0.98 2.33 1 1 1 p(lower) 0.670 0.711 0.763 0.774 0.789 err2 ~ N2 , 2 p(upper) 0.888 0.866 0.833 0.824 0.811 0.95 1.96
0.90 1.65 – Estimate: 2 ei (1 ei ) ˆi ei and ˆi ni 243 244
37 Testing Significance of Accuracy An Illustrative Example Difference • Given: M1: n = 30, e = 0.15 • Consider random variable d = err1– err2 1 1 M2: n2 = 5000, e2 = 0.25 – Since err1, err2 are normally distributed, so is their • E[d] = |e – e | = 0.1 difference 1 2 • 2-sided test: dt = 0 versus dt 0 – Hence d ~ N (dt, t) where dt is the true difference 2 0.15(1 0.15) 0.25(1 0.25) ˆt 0.0043 • Estimator for dt: 30 5000
– E[d] = E[err1-err2] = E[err1] – E[err2] e1 - e2 • At 95% confidence level, Z/2 = 1.96 – Since D1 and D2 are independent, variance adds up: 2 2 2 e (1 e ) e (1 e ) dt 0.100 1.96 0.0043 0.100 0.128 ˆ ˆ ˆ 1 1 2 2 t 1 2 n n 1 2 • Interval contains zero, hence difference may not be statistically significant
– At (1-) confidence level, dt E[d] Z / 2ˆt • But: may reject null hypothesis (dt 0) at lower confidence level
245 246
Significance Test for K-Fold Cross- Classification and Prediction Overview Validation • Each learning algorithm produces k models: • Introduction – L1 produces M11 , M12, …, M1k • Decision Trees – L2 produces M21 , M22, …, M2k • Statistical Decision Theory • • Both models are tested on the same test sets D1, Nearest Neighbor D2,…, Dk • Bayesian Classification – For each test set, compute dj = e1,j – e2,j • Artificial Neural Networks – For large enough k, dj is normally distributed with • Support Vector Machines (SVMs) mean dt and variance t • k Prediction – Estimate: 2 • (d j d) t-distribution: get t coefficient Accuracy and Error Measures 2 j1 ˆ t1-,k-1 from table by looking up • Ensemble Methods t k(k 1) confidence level (1-) and degrees of freedom (k-1) dt d t1 ,k1ˆt 247 248
Ensemble Methods General Idea
Original • Construct a set of classifiers from the training D Training data data
Step 1: Create Multiple D .... D1 2 Dt-1 Dt Data Sets • Predict class label of previously unseen records by aggregating predictions made by Step 2: Build Multiple C C C C multiple classifiers Classifiers 1 2 t -1 t
Step 3: Combine C* Classifiers
249 250
38 Why Does It Work? Base Classifier vs. Ensemble Error
• Consider 2-class problem • Suppose there are 25 base classifiers – Each classifier has error rate = 0.35 – Assume the classifiers are independent • Return majority vote of the 25 classifiers – Probability that the ensemble classifier makes a
wrong prediction: 25 25 i 25i (1 ) 0.06 i13 i
251 252
Model Averaging and Bias-Variance Bagging: Bootstrap Aggregation Tradeoff • Single model: lowering bias will usually increase • Given training set with n records, sample n variance records randomly with replacement – “Smoother” model has lower variance but might not Original Data 1 2 3 4 5 6 7 8 9 10 model function well enough Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9 Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2 • Ensembles can overcome this problem Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7 1. Let models overfit • Train classifier for each bootstrap sample • Low bias, high variance 2. Take care of the variance problem by averaging • Note: each training record has probability many of these models 1 – (1 – 1/n)n of being selected at least once in • This is the basic idea behind bagging a sample of size n
253 254
Bagged Trees Typical Result
• Create k trees from training data – Bootstrap sample, grow large trees • Design goal: independent models, high variability between models • Ensemble prediction = average of individual tree predictions (or majority vote) • Works the same way for other classifiers
(1/k)· + (1/k)· +…+ (1/k)·
255 256
39 Typical Result Typical Result
257 258
Bagging Challenges Additive Grove
• Ideal case: all models independent of each other • Ensemble technique for predicting continuous output • Train on independent data samples • Instead of individual trees, train additive models – Prediction of single Grove model = sum of tree predictions – Problem: limited amount of training data • Prediction of ensemble = average of individual Grove predictions • Training set needs to be representative of data distribution • Combines large trees and additive models – Bootstrap sampling allows creation of many “almost” – Challenge: how to train the additive models without having the first independent training sets trees fit the training data too well • Diversify models, because similar sample might result • Next tree is trained on residuals of previously trained trees in same Grove model in similar tree • If previously trained trees capture training data too well, next tree is mostly – Random Forest: limit choice of split attributes to small trained on noise random subset of attributes (new selection of subset for each node) when training tree
– Use different model types in same ensemble: tree, ANN, (1/k)· +…+ + (1/k)· +…+ +…+ (1/k)· +…+ SVM, regression models
259 260
Training Groves Typical Grove Performance
10 • Root mean squared 9 error – Lower is better 8 • Horizontal axis: tree + + + + + + 7 size – Fraction of training 6 data when to stop splitting + + + 5 • Vertical axis: number
4 of trees in each
0.13 single Grove model
3 • 100 bagging iterations 2
1 0.5 0.2 0.1 0.05 0.02 0.01 0.005 0.002 0
261 262
40 Boosting Boosting
• Records that are wrongly classified will have their weights increased • Iterative procedure to • Records that are classified correctly will have adaptively change distribution their weights decreased of training data by focusing Original Data 1 2 3 4 5 6 7 8 9 10 more on previously Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3 Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2 misclassified records Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
– Initially, all n records are assigned equal weights • Assume record 4 is hard to classify – Record weights may change at • Its weight is increased, therefore it is more likely the end of each boosting round to be chosen again in subsequent rounds
263 264
Example: AdaBoost AdaBoost Details
(i) i • Weight update: (i1) wj if Ci (x j ) y j • Base classifiers: C1, C2,…, CT wj 1 Z i i 1 if C (x ) y • i j j Error rate (n training where Z is the normalization factor records, wj are weights that i sum to 1): • Weights initialized to 1/n n • Zi ensures that weights add to 1 i wj Ci (x j ) y j • If any intermediate rounds produce error rate higher j1 than 50%, the weights are reverted back to 1/n and the • Importance of a classifier: resampling procedure is repeated • Final classification: 1i T i ln C *(x) arg max i Ci (x) y i y i1
265 266
Illustrating AdaBoost Illustrating AdaBoost B1 Initial weights for each data point Data points 0.0094 0.0094 0.4623 for training Boosting Round 1 + + + ------ = 1.9459 0.1 0.1 0.1 Original B2 + + + - - - - - + + 0.3037 0.0009 0.0422 Data Boosting Round 2 ------+ + = 2.9323 New weights B1 B3 0.0094 0.0094 0.4623 0.0276 0.1819 0.0038 Boosting Boosting + + + + + + + + + + Round 1 + + + ------ = 1.9459 Round 3 = 3.8744
Overall + + + - - - - - + +
Note: The numbers appear to be wrong, but they convey the right idea… 267 Note: The numbers appear to be wrong, but they convey the right idea… 268
41 Bagging vs. Boosting Classification/Prediction Summary
• Analogy • Forms of data analysis that can be used to train models – Bagging: diagnosis based on multiple doctors’ majority vote – Boosting: weighted vote, based on doctors’ previous diagnosis accuracy from data and then make predictions for new records • Sampling procedure • Effective and scalable methods have been developed – Bagging: records have same weight; easy to train in parallel – Boosting: weights record higher if model predicts it wrong; inherently for decision tree induction, Naive Bayesian sequential process classification, Bayesian networks, rule-based classifiers, • Overfitting Backpropagation, Support Vector Machines (SVM), – Bagging robust against overfitting – Boosting susceptible to overfitting: make sure individual models do not overfit nearest neighbor classifiers, and many other • Accuracy usually significantly better than a single classifier classification methods – Best boosted model often better than best bagged model • • Additive Grove Regression models are popular for prediction. – Combines strengths of bagging and boosting (additive models) Regression trees, model trees, and ANNs are also used – Shown empirically to make better predictions on many data sets for prediction. – Training more tricky, especially when data is very noisy
269 270
Classification/Prediction Summary
• K-fold cross-validation is a popular method for accuracy estimation, but determining accuracy on large test set is equally accepted – If test sets are large enough, a significance test for finding the best model is not necessary • Area under ROC curve and many other common performance measures exist • Ensemble methods like bagging and boosting can be used to increase overall accuracy by learning and combining a series of individual models – Often state-of-the-art in prediction quality, but expensive to train, store, use • No single method is superior over all others for all data sets – Issues such as accuracy, training and prediction time, robustness, interpretability, and scalability must be considered and can involve trade-offs
271
42