Classification and Prediction Overview

• Introduction • Decision Trees Techniques: • Statistical Decision Theory • Nearest Neighbor Classification and Prediction • Bayesian Classification • Artificial Neural Networks Mirek Riedewald • Support Vector Machines (SVMs) Some slides based on presentations by • Prediction Han/Kamber/Pei, Tan/Steinbach/Kumar, and Andrew • Accuracy and Error Measures Moore • Ensemble Methods

2

Classification vs. Prediction Induction: Model Construction

• Assumption: after data preparation, we have a data set Classification where each record has attributes X ,…,X , and Y. 1 n Training • Goal: learn a function f:(X ,…,X )Y, then use this 1 n Data function to predict y for a given input record (x1,…,xn). – Classification: Y is a discrete attribute, called the class label • Usually a categorical attribute with small domain Model NAME RANK YEARS TENURED – Prediction: Y is a continuous attribute (Function) M ike Assistant Prof 3 no • Called , because true labels (Y- M ary Assistant Prof 7 yes values) are known for the initially provided data Bill Professor 2 yes • Typical applications: credit approval, target marketing, Jim Associate Prof 7 yes IF rank = ‘professor’ medical diagnosis, fraud detection Dave Assistant Prof 6 no OR years > 6 Anne Associate Prof 3 no THEN tenured = ‘yes’

3 4

Deduction: Using the Model Classification and Prediction Overview

• Introduction Model (Function) • Decision Trees • Statistical Decision Theory • Bayesian Classification Test • Artificial Neural Networks Unseen Data Data • Support Vector Machines (SVMs) • Nearest Neighbor (Jeff, Professor, 4) • Prediction NAME RANK YEARS TENURED Tenured? • Accuracy and Error Measures Tom Assistant Prof 2 no • Merlisa Associate Prof 7 no Ensemble Methods George Professor 5 yes Joseph Assistant Prof 7 yes 5 6

1 Example of a Decision Tree Another Example of Decision Tree

Splitting Attributes MarSt Single, Tid Refund Marital Taxable Married Divorced Status Income Cheat

1 Yes Single 125K No NO Refund 2 No Married 100K No Refund Yes No Yes No 3 No Single 70K No NO TaxInc 4 Yes Married 120K No NO MarSt < 80K > 80K 5 No Divorced 95K Yes Single, Divorced Married 6 No Married 60K No NO YES 7 Yes Divorced 220K No TaxInc NO 8 No Single 85K Yes < 80K > 80K 9 No Married 75K No NO YES There could be more than one tree that 10 No Single 90K Yes

10 fits the same data!

Training Data Model: Decision Tree

7 8

Apply Model to Test Data Apply Model to Test Data Test Data Test Data Start from the root of tree.

Refund Refund Yes No Yes No

NO MarSt NO MarSt Single, Divorced Married Single, Divorced Married

TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K

NO YES NO YES

9 10

Apply Model to Test Data Apply Model to Test Data Test Data Test Data

Refund Marital Taxable Status Income Cheat

No Married 80K ?

10 Refund Refund Yes No Yes No

NO MarSt NO MarSt Single, Divorced Married Single, Divorced Married

TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K

NO YES NO YES

11 12

2 Apply Model to Test Data Apply Model to Test Data Test Data Test Data

Refund Marital Taxable Status Income Cheat

No Married 80K ?

10 Refund Refund Yes No Yes No

NO MarSt NO MarSt Single, Divorced Married Single, Divorced Married Assign Cheat to “No”

TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K

NO YES NO YES

13 14

Decision Tree Induction Decision Boundary

1 x2 • 0.9 Basic greedy algorithm X1 – Top-down, recursive divide-and-conquer 0.8 < 0.43? – At start, all the training records are at the root 0.7 Yes No – Training records partitioned recursively based on split attributes 0.6 X X2 – Split attributes selected based on a heuristic or statistical 0.5 2 < 0.47? < 0.33? measure (e.g., information gain) 0.4 Yes No Yes No • Conditions for stopping partitioning 0.3 Refund – Pure node (all records belong Yes No 0.2 : 4 : 0 : 0 : 4 to same class) 0.1 : 0 : 4 : 3 : 0 NO MarSt – No remaining attributes for 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 further partitioning Single, Divorced Married x • Majority voting for classifying the leaf 1 TaxInc – No cases left NO Decision boundary = border between two neighboring regions of different classes. < 80K > 80K For trees that split on a single attribute at a time, the decision boundary is parallel NO YES to the axes. 15 16

Oblique Decision Trees How to Specify Split Condition?

• Depends on attribute types

x + y < 1 – Nominal – Ordinal – Numeric (continuous)

Class = + Class = • Depends on number of ways to split – 2-way split • Test condition may involve multiple attributes – Multi-way split • More expressive representation • Finding optimal test condition is computationally expensive 17 18

3 Splitting Nominal Attributes Splitting Ordinal Attributes

• Multi-way split: use as many partitions as • Multi-way split: Size distinct values. Small Large CarType Medium Family Luxury Sports • Binary split: Size Size {Small, OR {Medium, • Binary split: divides values into two subsets; Medium} {Large} Large} {Small} need to find optimal partitioning.

CarType CarType • What about this split? Size {Sports, OR {Family, {Small, Luxury} {Family} Luxury} {Sports} Large} {Medium}

19 20

Splitting Continuous Attributes Splitting Continuous Attributes

• Different options – Discretization to form an ordinal categorical Taxable Taxable Income Income? attribute > 80K?

• Static – discretize once at the beginning < 10K > 80K • Dynamic – ranges found by equal interval bucketing, Yes No

equal frequency bucketing (percentiles), or clustering. [10K,25K) [25K,50K) [50K,80K)

– Binary Decision: (A < v) or (A  v) (i) Binary split (ii) Multi-way split • Consider all possible splits, choose best one

21 22

How to Determine Best Split How to Determine Best Split

Before Splitting: 10 records of class 0, • Greedy approach: 10 records of class 1 – Nodes with homogeneous class distribution are

Own Car Student preferred Car? Type? ID? • Need a measure of node impurity: Yes No Family Luxury c1 c20 c c Sports 10 11 C0: 5 C0: 9 C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0 C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1 C1: 5 C1: 1 Non-homogeneous, Homogeneous, Which test condition is the best? High degree of impurity Low degree of impurity

23 24

4 Attribute Selection Measure: Example Information Gain • Select attribute with highest information gain • Predict if somebody will buy a computer • pi = probability that an arbitrary record in D belongs to class C , i=1,…,m Age Income Student Credit_rating Buys_computer i • Given data set:  30 High No Bad No • Expected information (entropy) needed to classify a record  30 High No Good No in D: m 31…40 High No Bad Yes Info(D)   pi log2 ( pi ) > 40 Medium No Bad Yes i1 > 40 Low Yes Bad Yes • Information needed after using attribute A to split D into v > 40 Low Yes Good No 31...40 Low Yes Good Yes partitions D1,…, Dv: v  | Dj | 30 Medium No Bad No Info (D)  Info(D )  30 Low Yes Bad Yes A  | D | j j1 > 40 Medium Yes Bad Yes  30 Medium Yes Good Yes • Information gained by splitting on attribute A: 31...40 Medium No Good Yes Gain (D) Info(D) Info (D) 31...40 High Yes Bad Yes A A > 40 Medium No Good No

25 26

Information Gain Example Gain Ratio for Attribute Selection

• Class P: buys_computer = “yes” 5 4 Infoage (D)  I(2,3)  I(4,0) • Information gain is biased towards attributes with a large • Class N: buys_computer = “no” 14 14 number of values 9 9 5 5 5 Info(D)  I(9,5)   log  log 0.940  I(3,2)  0.694 14 2 14 14 2 14 14 • Use gain ratio to normalize information gain: 5 Age #yes #no I(#yes, #no) • I ( 2 , 3 ) means “age  30” has 5 out of 14 – GainRatioA(D) = GainA(D) / SplitInfoA(D)  30 2 3 0.971 14 samples, with 2 yes’es and 3 no’s. v 31…40 4 0 0 – Similar for the other terms | Dj | | Dj | >40 3 2 0.971   SplitInfoA (D)   log2   Age Income Student Credit_rating Buys_computer • Hence j1 | D |  | D | 

 30 High No Bad No Gainage(D)  Info(D)  Infoage(D)  0.246  30 High No Good No 4 4 6 6 4 4 31…40 High No Bad Yes • Similarly, • E.g., SplitInfoincome(D)   log2  log2  log2  0.926 > 40 Medium No Bad Yes Gain income(D)  0.029 14 14 14 14 14 14 > 40 Low Yes Bad Yes > 40 Low Yes Good No Gain student(D)  0.151 31...40 Low Yes Good Yes • Gain credit_rating (D)  0.048 GainRatioincome(D) = 0.029/0.926 = 0.031  30 Medium No Bad No  30 Low Yes Bad Yes • Therefore we choose age as the splitting • Attribute with maximum gain ratio is selected as splitting > 40 Medium Yes Bad Yes attribute  30 Medium Yes Good Yes attribute 31...40 Medium No Good Yes 31...40 High Yes Bad Yes > 40 Medium No Good No 27 28

Comparing Attribute Selection Gini Index Measures m • Gini index, gini(D), is defined as 2 • No clear winner gini( D) 1  pi i1 (and there are many more)

– Information gain: • If data set D is split on A into v subsets D1,…, Dv, the gini index gini (D) is defined as • Biased towards multivalued attributes A v | D | j – Gain ratio: gini A (D)   gini( Dj ) j1 | D | • Tends to prefer unbalanced splits where one partition is • Reduction in Impurity: much smaller than the others – Gini index: gini A (D)  gini( D) gini A (D) • Biased towards multivalued attributes • Tends to favor tests that result in equal-sized partitions and • Attribute that provides smallest ginisplit(D) (= largest reduction in impurity) is chosen to split the node purity in both partitions

29 30

5 Practical Issues of Classification How Good is the Model?

• Underfitting and overfitting • Training set error: compare prediction of • Missing values training record with true value • Computational cost – Not a good measure for the error on unseen data. (Discussed soon.) • Expressiveness • Test set error: for records that were not used for training, compare model prediction and true value – Use holdout data from available data set

31 32

Training versus Test Set Error Test Data

• We’ll create a training dataset • Generate test data using the same method: copy of e, 25% Output y = copy of e, inverted; done independently from previous noise process Five inputs, all bits, are except a random 25% • Some y’s that were corrupted in the training set will be uncorrupted generated in all 32 possible of the records have y in the testing set. combinations set to the opposite of e • Some y’s that were uncorrupted in the training set will be corrupted in the test set.

a b c d e y (training y (test a b c d e y data) data) 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 1 1 1 1

0 0 1 0 0 1 0 0 1 0 0 1 1 32 records 32 : : : : : : : : : : : : : 1 1 1 1 1 1 1 1 1 1 1 1 1 33 34

Full Tree for The Training Data Testing The Tree with The Test Set

Root 1/4 of the tree nodes are 3/4 are fine corrupted

e=0 e=1 1/4 of the test set 1/16 of the test set will 3/16 of the test set will be records are corrupted be correctly predicted for wrongly predicted because a=0 a=1 a=0 a=1 the wrong reasons the test record is corrupted 3/4 are fine 3/16 of the test 9/16 of the test predictions predictions will be wrong will be fine because the tree node is corrupted

In total, we expect to be wrong on 3/8 of the test set predictions 25% of these leaf node labels will be corrupted

Each leaf contains exactly one record, hence no error in predicting the training data!

35 36

6 What’s This Example Shown Us? Suppose We Had Less Data

• Discrepancy between training and test set Output y = copy of e, except a error These bits are hidden random 25% of the records have y set to the opposite of e • But more importantly – …it indicates that there is something we should do a b c d e y about it if we want to predict well on future data. 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 1 1

0 0 1 0 0 1 32 records 32 : : : : : : 1 1 1 1 1 1

37 38

Tree Learned Without Access to The Tree Learned Without Access to The Irrelevant Bits Irrelevant Bits

Root Root

e=0 e=1 e=0 e=1 These nodes will be unexpandable

In about 12 of In about 12 of the 16 records the 16 records in this node the in this node the output will be 0 output will be 1

So this will So this will almost certainly almost certainly predict 0 predict 1 39 40

Tree Learned Without Access to The Typical Observation Irrelevant Bits Overfitting Root almost certainly almost certainly all none of the tree are fine Model M overfits the nodes are training data if another e=0 e=1 corrupted model M’ exists, such 1/4 of the test n/a 1/4 of the test set that M has smaller set records are will be wrongly error than M’ over the corrupted predicted because training examples, but the test record is M’ has smaller error corrupted than M over the entire distribution of 3/4 are fine n/a 3/4 of the test instances. predictions will be fine

In total, we expect to be wrong on only 1/4 of the test set predictions Underfitting: when model is too simple, both training and test errors are large

41 42

7 Reasons for Overfitting Avoiding Overfitting

• Noise • General idea: make the tree smaller – Too closely fitting the training data means the model’s – Addresses all three reasons for overfitting predictions reflect the noise as well • Insufficient training data • Prepruning: Halt tree construction early – Do not split a node if this would result in the goodness measure – Not enough data to enable the model to generalize falling below a threshold beyond idiosyncrasies of the training records – Difficult to choose an appropriate threshold, e.g., tree for XOR • Data fragmentation (special problem for trees) – Number of instances gets smaller as you traverse • Postpruning: Remove branches from a “fully grown” tree down the tree – Use a set of data different from the training data to decide when – Number of instances at a leaf node could be too small to stop pruning • Validation data: train tree on training data, prune on validation data, to make any confident decision about class then test on test data

43 44

Minimum Description Length (MDL) MDL-Based Pruning Intuition

X y A? Yes No X y X 1 1 0 B? X1 ? Cost X2 0 B1 B2 X2 ? X3 0 C? 1 A C1 C2 B X3 ? X4 1 X ? … … 0 1 4 Cost(Model, Data) … … Xn 1 Cost(Model)=model size Xn ? Lowest total cost • Alternative to using validation data – Motivation: data mining is about finding regular patterns in data; regularity can be used to compress the data; method that achieves greatest compression found most regularity and hence is best Cost(Data|Model)=model errors • Minimize Cost(Model,Data) = Cost(Model) + Cost(Data|Model) – Cost is the number of bits needed for encoding. small large • Cost(Data|Model) encodes the misclassification errors. Tree size • Cost(Model) uses node encoding plus splitting condition encoding. Best tree size

45 46

Handling Missing Attribute Values Distribute Instances

Tid Refund Marital Taxable Status Income Class • Missing values affect decision tree Tid Refund Marital Taxable 1 Yes Single 125K No Status Income Class 2 No Married 100K No construction in three different ways: 10 ? Single 90K Yes

10 3 No Single 70K No – How impurity measures are computed 4 Yes Married 120K No Refund – How to distribute instance with missing value to 5 No Divorced 95K Yes Yes No 6 No Married 60K No Class=Yes 0 + 3/9 Class=Yes 2 + 6/9 child nodes 7 Yes Divorced 220K No Class=No 3 Class=No 4

8 No Single 85K Yes

– How a test instance with missing value is classified 9 No Married 75K No

10 Probability that Refund=Yes is 3/9 Refund Yes No Probability that Refund=No is 6/9 Assign record to the left child with Class=Yes 0 Cheat=Yes 2 weight = 3/9 and to the right child Class=No 3 Cheat=No 4 with weight = 6/9

47 48

8 Computing Impurity Measure Classify Instances

Tid Refund Marital Taxable Split on Refund: assume records with New record: Married Single Divorced Total Status Income Class missing values are distributed as Tid Refund Marital Taxable Class 1 Yes Single 125K No discussed before Status Income Class=No 3 1 0 4 2 No Married 100K No 3/9 of record 10 go to Refund=Yes 11 No ? 85K ?

10 Class=Yes 6/9 1 1 2.67 3 No Single 70K No 6/9 of record 10 go to Refund=No 4 Yes Married 120K No Total 3.67 2 1 6.67 Entropy(Refund=Yes) Refund 5 No Divorced 95K Yes Yes = -(1/3 / 10/3)log(1/3 / 10/3) No 6 No Married 60K No – (3 / 10/3)log(3 / 10/3) = 0.469 NO MarSt 7 Yes Divorced 220K No Single, Married 8 No Single 85K Yes Entropy(Refund=No) Divorced = -(8/3 / 20/3)log(8/3 / 20/3) 9 No Married 75K No TaxInc NO Probability that Marital Status 10 ? Single 90K Yes – (4 / 20/3)log(4 / 20/3) = 0.971 < 80K > 80K = Married is 3.67/6.67

10

Entropy(Children) NO YES Probability that Marital Status Before Splitting: Entropy(Parent) = 1/3*0.469 + 2/3*0.971 = 0.804 ={Single,Divorced} is 3/6.67 = -0.3 log(0.3)-(0.7)log(0.7) = 0.881 Gain = 0.881 – 0.804 = 0.077 49 50

Tree Cost Analysis Tree Expressiveness

• Finding an optimal decision tree is NP-complete • Can represent any finite discrete-valued function – Optimization goal: minimize expected number of binary tests to uniquely identify any record from a given finite set – But it might not do it very efficiently • Greedy algorithm • Example: parity function – O(#attributes * #training_instances * log(#training_instances)) – Class = 1 if there is an even number of Boolean attributes with truth value = True • At each tree depth, all instances considered • Assume tree depth is logarithmic (fairly balanced splits) – Class = 0 if there is an odd number of Boolean attributes with truth value = True • Need to test each attribute at each node • What about binary splits? – For accurate modeling, must have a complete tree – Sort data once on each attribute, use to avoid re-sorting subsets – Incrementally maintain counts for class distribution as different split points • Not expressive enough for modeling continuous are explored attributes • In practice, trees are considered to be fast both for training (when using the greedy algorithm) and making predictions – But we can still use a tree for them in practice; it just cannot accurately represent the true function

51 54

Rule Extraction from a Decision Tree Classification in Large Databases

• One rule is created for each path from the root to a leaf – Precondition: conjunction of all split predicates of nodes on path • Scalability: Classify data sets with millions of – Consequent: class prediction from leaf examples and hundreds of attributes with • Rules are mutually exclusive and exhaustive • Example: Rule extraction from buys_computer decision-tree reasonable speed – IF age = young AND student = no THEN buys_computer = no • – IF age = young AND student = yes THEN buys_computer = yes Why use decision trees for data mining? – IF age = mid-age THEN buys_computer = yes – Relatively fast learning speed – IF age = old AND credit_rating = excellent THEN buys_computer = yes – IF age = young AND credit_rating = fair THEN buys_computer = no – Can handle all attribute types age? – Convertible to intelligible classification rules

<=30 31..40 >40 – Good classification accuracy, but not as good as student? credit rating? yes newer methods (but tree ensembles are top!)

no yes excellent fair

no yes yes 55 56

9 Scalable Tree Induction Tree Conclusions

• High cost when the training data at a node does not fit in • Very popular data mining tool memory • Solution 1: special I/O-aware algorithm – Easy to understand – Keep only class list in memory, access attribute values on disk – Easy to implement – Maintain separate list for each attribute – Easy to use: little tuning, handles all attribute – Use count matrix for each attribute • Solution 2: Sampling types and missing values – Common solution: train tree on a sample that fits in memory – Computationally relatively cheap – More sophisticated versions of this idea exist, e.g., Rainforest • • Build tree on sample, but do this for many bootstrap samples Overfitting problem • Combine all into a single new tree that is guaranteed to be almost identical to the one trained from entire data set • Focused on classification, but easy to extend • Can be computed with two data scans to prediction (future lecture)

57 58

Classification and Prediction Overview Theoretical Results

• Introduction • Trees make sense intuitively, but can we get • Decision Trees some hard evidence and deeper • Statistical Decision Theory • Nearest Neighbor understanding about their properties? • Bayesian Classification • Statistical decision theory can give some • Artificial Neural Networks answers • Support Vector Machines (SVMs) • Prediction • Need some probability concepts first • Accuracy and Error Measures • Ensemble Methods

60 61

Random Variables Working with Random Variables

• Intuitive version of the definition: • E[X + Y] = E[X] + E[Y] – Can take on one of possibly many values, each with a • Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X,Y) certain probability – These probabilities define the probability distribution of • For constants a, b the random variable – E[aX + b] = a E[X] + b – E.g., let X be the outcome of a coin toss, then – Var(aX + b) = Var(aX) = a2 Var(X) Pr(X=‘heads’)=0.5 and Pr(X=‘tails’)=0.5; distribution is uniform • Iterated expectation: • Consider a discrete random variable X with numeric – E[X] = EX[ EY[Y| X] ], where EY[Y| X] = yi*Pr(Y=yi| X=x) values x ,...,x is the expectation of Y for a given value x of X, i.e., is a 1 k function of X – Expectation: E[X] =  xi*Pr(X=xi) – : Var(X) = E[(X – E[X])2] = E[X2] – (E[X])2 – In general for any function f(X,Y): E [f(X,Y)] = E [ E [f(X,Y)| X] ] X,Y X Y

62 63

10 What is the Optimal Model f(X)? Optimal Model f(X) (cont.)

Let X denote a real - valued random input variable and Y a real - valued random output variable

2 2 The choice of f(X) does not affect EY (Y  Y ) | X , but Y  f (X ) is minimized for 2 The squared error of trained model f(X) is E X,Y (Y  f (X )) . f(X)  Y  EY [Y | X ].

Which function f(X) will minimize the squared error? 2 2 Note that E X,Y (Y  f (X ))  E X EY (Y  f (X )) | X . Hence

Consider the error for a specific value of X and let Y  E [Y | X ] : Y E (Y  f (X )) 2  E E (Y  Y ) 2 | X  Y  f (X ) 2 2 2 X,Y   X  Y      EY (Y  f (X )) | X  EY (Y  Y  Y  f (X )) | X  2 2  EY (Y  Y ) | X  EY (Y  f (X )) | X  2EY (Y  Y )(Y  f (X )) | X  Hence the squared error is minimzed by choosing f(X)  EY [Y | X ] for every X. 2 2  EY (Y  Y ) | X  Y  f (X )  2(Y  f (X )) EY (Y  Y ) | X  2 2  EY (Y  Y ) | X  Y  f (X ) (Notice that for minimizing absolute error E X,Y | Y  f (X ) |,one can show that the best model is f(X)  median( X | Y ).)

(Notice : EY (Y  Y ) | X  EY Y | X  EY Y | X  Y  Y  0)

64 65

Interpreting the Result Implications for Trees

• To minimize mean squared error, the best prediction for input X=x is the mean of • Since there are not enough, or none at all, training records the Y-values of all training records (x(i),y(i)) with x(i)=x – E.g., assume there are training records (5,22), (5,24), (5,26), (5,28). The optimal prediction for with X=x, the output for input X=x has to be based on input X=5 would be estimated as (22+24+26+28)/4 = 25. records “in the neighborhood” • Problem: to reliably estimate the mean of Y for a given X=x, we need sufficiently many training records with X=x. In practice, often there is only one or no training – A tree leaf corresponds to a multi-dimensional range in the data record at all for an X=x of interest. space – If there were many such records with X=x, we would not need a model and could just return – Records in the same leaf are neighbors of each other the average Y for that X=x. • The benefit of a good data mining technique is its ability to interpolate and • Solution: estimate mean Y for input X=x from the training extrapolate from known training records to make good predictions even for X- records in the same leaf node that contains input X=x values that do not occur in the training data at all. • Classification for two classes: encode as 0 and 1, use squared error as before – Classification: leaf returns majority class or class probabilities – Then f(X) = E[Y| X=x] = 1*Pr(Y=1| X=x) + 0*Pr(Y=0| X=x) = Pr(Y=1| X=x) (estimated from fraction of training records in the leaf) • Classification for k classes: can show that for 0-1 loss (error = 0 if correct class, – Prediction: leaf returns average of Y-values or fits a local model error = 1 if wrong class predicted) the optimal choice is to return the majority class for a given input X=x – Make sure there are enough training records in the leaf to – This is called the . obtain reliable estimates

66 67

Bias-Variance Tradeoff Bias-Variance Tradeoff Derivation

2 2 EX ,D,Y Y  f (X ; D)  EX ED EY Y  f (X ; D) | X , D. Now consider the inner term : • Let’s take this one step further and see if we can 2 2 2 ED EY Y  f (X ; D) | X , D ED EY Y  E[Y | X ] | X , D  f (X ; D)  E[Y | X ]  understand overfitting through statistical decision (Same derivation as before for optimal function f(X).)  E Y  E[Y | X ]2 | X  E  f (X ; D)  E[Y | X ]2 theory Y   D   2 2 (The first term does not depend on D, hence ED EY Y  E[Y | X ] | X , D EY Y  E[Y | X ] | X .) • As before, consider two random variables X and Y Consider the second term : 2 2 ED  f (X ; D)  E[Y | X ]  ED  f (X ; D)  ED[ f (X ; D) ED[ f (X ; D)] E[Y | X ]  • From a training set D with n records, we want to 2 2  ED  f (X ; D)  ED[ f (X ; D)]  ED ED[ f (X ; D)] E[Y | X ]   2E f (X ; D)  E [ f (X ; D)  E [ f (X ; D)] E[Y | X ] construct a function f(X) that returns good D  D   D   E  f (X ; D)  E [ f (X ; D)]2  E [ f (X ; D)] E[Y | X ]2 approximations of Y for future inputs X D  D  D  2ED  f (X ; D)  ED[ f (X ; D)ED[ f (X ; D)] E[Y | X ] 2 2 – Make dependence of f on D explicit by writing f(X; D)  ED  f (X ; D)  ED[ f (X ; D)]  ED[ f (X ; D)] E[Y | X ] (The third term is zero, because E  f (X ; D)  E [ f (X ; D) E [ f (X ; D)] E [ f (X ; D)]  0.) • Goal: minimize mean squared error over all X, Y, D D D D and D, i.e., E [ (Y - f(X; D))2 ] Overall we therefore obtain : X,D,Y 2 2 2 2 EX ,D,Y Y  f (X ; D)  EX ED[ f (X ; D)] E[Y | X ]  ED  f (X ; D)  ED[ f (X ; D)]  EY Y  E[Y | X ] | X 

68 69

11 Bias-Variance Tradeoff and Overfitting Implications for Trees

2 ED [ f (X; D)]  E[Y | X ] : bias 2 ED  f (X; D)  ED [ f (X; D)] : variance 2 • Bias decreases as tree becomes larger EY Y  E[Y | X ] | X : irreducible error (does not depend on f and is simply the variance of Y given X.) – Larger tree can fit training data better • Option 1: f(X;D) = E[Y| X,D]

– Bias: since ED[ E[Y| X,D] ] = E[Y| X], bias is zero 2 2 • Variance increases as tree becomes larger – Variance: (E[Y| X,D]-ED[E[Y| X,D]]) = (E[Y| X,D]-E[Y| X]) can be very large since E[Y| X,D] depends heavily on D – – Might overfit! Sample variance affects predictions of larger tree • Option 2: f(X;D)=X (or other function independent of D) more 2 2 – Variance: (X-ED[X]) =(X-X) =0 2 2 – Bias: (ED[X]-E[Y| X]) =(X-E[Y| X]) can be large, because E[Y| X] might be • Find right tradeoff as discussed earlier completely different from X – Might underfit! – Validation data to find best pruned tree • Find best compromise between fitting training data too closely (option 1) and completely ignoring it (option 2) – MDL principle

70 71

Classification and Prediction Overview Lazy vs. Eager Learning

• Introduction • Lazy learning: Simply stores training data (or only • Decision Trees minor processing) and waits until it is given a test • Statistical Decision Theory record • Nearest Neighbor • Eager learning: Given a training set, constructs a • Bayesian Classification classification model before receiving new (test) • Artificial Neural Networks data to classify • Support Vector Machines (SVMs) • General trend: Lazy = faster training, slower • Prediction predictions • Accuracy and Error Measures • Accuracy: not clear which one is better! • Ensemble Methods – Lazy method: typically driven by local decisions – Eager method: driven by global and local decisions

72 73

Nearest-Neighbor Nearest-Neighbor Classifiers Unknown tuple • Recall our statistical decision theory analysis: • Requires: – Set of stored records Best prediction for input X=x is the mean of – Distance metric for pairs of records the Y-values of all records (x(i),y(i)) with x(i)=x • Common choice: Euclidean (majority class for classification) 2 d(p, q)  ( pi  qi ) • Problem was to estimate E[Y| X=x] or majority i – Parameter k class for X=x from the training data • Number of nearest neighbors to retrieve • Solution was to approximate it • To classify a record: – Use Y-values from training records in – Find its k nearest neighbors – Determine output based on neighborhood around X=x (distance-weighted) average

of neighbors’ output

74 75

12 Definition of Nearest Neighbor 1-Nearest Neighbor

X X X Voronoi Diagram

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points that have the k smallest distance to x

76 77

Nearest Neighbor Classification Effect of Changing k

• Choosing the value of k: – k too small: sensitive to noise points – k too large: neighborhood may include points from other classes

X

Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning

78 79

Explaining the Effect of k Experiment

• Recall the bias-variance tradeoff • 50 training points (x, y) • Small k, i.e., predictions based on few – −2 ≤ 푥 ≤ 2, selected uniformly at random – 푦 = 푥2 + 휀, where 휀 is selected uniformly at random neighbors from range [-0.5, 0.5] – High variance, low bias • Test data sets: 500 points from same distribution • Large k, e.g., average over entire data set as training data, but 휀 = 0 – Low variance, but high bias • Plot 1: all (x, NN1(x)) for 5 test sets • Need to find k that achieves best tradeoff • Plot 2: all (x, AVG(NN1(x))), averaged over 200 test data set • Can do that using validation data – Same for NN20 and NN50

80 81

13 2 ED [ f (X; D)]  E[Y | X ] : bias 2 ED  f (X; D)  ED [ f (X; D)] : variance 2 EY Y  E[Y | X ] | X : irreducible error (does not depend on f and is simply the variance of Y given X.) 82 83

84 85

86 87

14 Scaling Issues Other Problems

• Attributes may have to be scaled to prevent • Problem with Euclidean measure: distance measures from being dominated by – High dimensional data: curse of dimensionality one of the attributes – Can produce counter-intuitive results • Example: 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 – Height of a person may vary from 1.5m to 1.8m vs 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 – Weight of a person may vary from 90lb to 300lb

– Income of a person may vary from $10K to $1M d = 1.4142 d = 1.4142 – Solution: Normalize the vectors to unit length – Income difference would dominate record distance • Irrelevant attributes might dominate distance – Solution: eliminate them

88 89

Computational Cost Classification and Prediction Overview

• Brute force: O(#trainingRecords) • Introduction – For each training record, compute distance to test record, • Decision Trees keep if among top-k • Statistical Decision Theory • Pre-compute Voronoi diagram (expensive), then search • spatial index of Voronoi cells: if lucky Nearest Neighbor O(log(#trainingRecords)) • Bayesian Classification • Store training records in multi-dimensional search tree, • Artificial Neural Networks e.g., R-tree: if lucky O(log(#trainingRecords)) • Support Vector Machines (SVMs) • Bulk-compute predictions for many test records using • Prediction spatial join between training and test set • Accuracy and Error Measures – Same worst-case cost as one-by-one predictions, but • Ensemble Methods usually much faster in practice

90 107

Bayesian Classification Bayesian Theorem: Basics

• Performs probabilistic prediction, i.e., predicts • X = random variable for data records (“evidence”) class membership probabilities • H = hypothesis that specific record X=x belongs to class C • Based on Bayes’ Theorem • Goal: determine P(H| X=x) – Probability that hypothesis holds given a record x • Incremental training • P(H) = – Update probabilities as new training records arrive – The initial probability of the hypothesis – Can combine prior knowledge with observed data – E.g., person x will buy computer, regardless of age, income etc. • Even when Bayesian methods are • P(X=x) = probability that data record x is observed computationally intractable, they can provide a • P(X=x| H) = probability of observing record x, given that the standard of optimal decision making against hypothesis holds – E.g., given that x will buy a computer, what is the probability which other methods can be measured that x is in age group 31...40, has medium income, etc.?

108 109

15 Bayes’ Theorem Towards Naïve Bayes Classifier

• Given data record x, the of a hypothesis H, P(H| X=x), follows from Bayes theorem: • Suppose there are m classes C1, C2,…, Cm P(Xx|H)P(H) P(H |Xx) • Classification goal: for record x, find class Ci that P(Xx) has the maximum posterior probability P(Ci| X=x) • Informally: posterior = likelihood * prior / evidence • Bayes’ theorem: • Among all candidate hypotheses H, find the maximally probably P(X x|C )P(C ) one, called maximum a posteriori (MAP) hypothesis P(C |Xx) i i • Note: P(X=x) is the same for all hypotheses i P(Xx) • If all hypotheses are equally probable a priori, we only need to compare P(X=x| H) – Winning hypothesis is called the maximum likelihood (ML) hypothesis • Since P(X=x) is the same for all classes, only need • Practical difficulties: requires initial knowledge of many to find maximum of P(Xx|C )P(C ) probabilities and has high computational cost i i

110 111

Example: Computing P(X=x|Ci) and Computing P(X=x|Ci) and P(Ci) P(Ci) • P(buys_computer = yes) = 9/14 • Estimate P(Ci) by counting the frequency of class • P(buys_computer = no) = 5/14 Ci in the training data • P(age>40, income=low, student=no, credit_rating=bad| buys_computer=yes) = 0 ? • Can we do the same for P(X=x|C )? Age Income Student Credit_rating Buys_computer i  30 High No Bad No – Need very large set of training data  30 High No Good No 31…40 High No Bad Yes – Have |X1|*|X2|*…*|Xd|*m different combinations of > 40 Medium No Bad Yes > 40 Low Yes Bad Yes possible values for X and Ci > 40 Low Yes Good No – Need to see every instance x many times to obtain 31...40 Low Yes Good Yes  30 Medium No Bad No reliable estimates  30 Low Yes Bad Yes > 40 Medium Yes Bad Yes • Solution: decompose into lower-dimensional  30 Medium Yes Good Yes 31...40 Medium No Good Yes problems 31...40 High Yes Bad Yes > 40 Medium No Good No

112 113

Conditional Independence Derivation of Naïve Bayes Classifier

• X, Y, Z random variables • Simplifying assumption: all input attributes conditionally independent, given class • X is conditionally independent of Y, given Z, if d

P(X  (x1,, xd ) | Ci )  P(X k  xk | Ci )  P(X1  x1 | Ci ) P(X 2  x2 | Ci )P(X d  xd | Ci ) P(X| Y,Z) = P(X| Z) k1

– Equivalent to: P(X,Y| Z) = P(X| Z) * P(Y| Z) • Each P(Xk=xk| Ci) can be estimated robustly • Example: people with longer arms read better – If Xk is categorical attribute – Confounding factor: age • P(Xk=xk| Ci) = #records in Ci that have value xk for Xk, divided by #records of class Ci in training data set • Young child has shorter arms and lacks reading skills of adult – If Xk is continuous, we could discretize it – If age is fixed, observed relationship between arm • Problem: interval selection length and reading skills disappears – Too many intervals: too few training cases per interval – Too few intervals: limited choices for decision boundary

114 115

16 Estimating P(X =x | C ) for Continuous k k i Naïve Bayes Example Attributes without Discretization

• P(Xk=xk| Ci) computed based on Gaussian • Classes: distribution with mean μ and standard deviation – C1:buys_computer = yes σ: (x)2  – C :buys_computer = no Age Income Student Credit_rating Buys_computer 1 2 2 2 g(x, , )  e  30 High No Bad No 2  30 High No Good No 31…40 High No Bad Yes as > 40 Medium No Bad Yes P(X  x | C )  g(x , , ) • Data sample x > 40 Low Yes Bad Yes k k i k k,Ci k,Ci > 40 Low Yes Good No – age  30, 31...40 Low Yes Good Yes • Estimate k,Ci from sample mean of attribute Xk  30 Medium No Bad No – income = medium,  30 Low Yes Bad Yes for all training records of class Ci > 40 Medium Yes Bad Yes •  – student = yes, and  30 Medium Yes Good Yes Estimate k,Ci similarly from sample 31...40 Medium No Good Yes – credit_rating = bad 31...40 High Yes Bad Yes > 40 Medium No Good No

116 117

Naïve Bayesian Computation Zero-Probability Problem

• Compute P(Ci) for each class: • Naïve Bayesian prediction requires each to – P(buys_computer = “yes”) = 9/14 = 0.643 – P(buys_computer = “no”) = 5/14= 0.357 be non-zero (why?) • Compute P(X =x | C ) for each class k k i d – P(age = “ 30” | buys_computer = “yes”) = 2/9 = 0.222 – P(age = “ 30” | buys_computer = “no”) = 3/5 = 0.6 P(X  (x1,, xd ) | Ci )  P(X k  xk | Ci )  P(X1  x1 | Ci ) P(X 2  x2 | Ci )P(X d  xd | Ci ) – P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 k1 – P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 – P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 • Example: 1000 records for buys_computer=yes with income=low – P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 – P(credit_rating = “bad” | buys_computer = “yes”) = 6/9 = 0.667 (0), income= medium (990), and income = high (10) – P(credit_rating = “bad” | buys_computer = “no”) = 2/5 = 0.4 – For input with income=low, conditional probability is zero • Compute P(X=x| Ci) using the Naive Bayes assumption – P(30, medium, yes, fair |buys_computer = “yes”) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044 • Use Laplacian correction (or Laplace estimator) by adding 1 dummy – P(30, medium, yes, fair | buys_computer = “no”) = 0.6 * 0.4 * 0.2 * 0.4 = 0.019 record to each income level • Compute final result P(X=x| Ci) * P(Ci) • Prob(income = low) = 1/1003 – P(X=x | buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 – P(X=x | buys_computer = “no”) * P(buys_computer = “no”) = 0.007 • Prob(income = medium) = 991/1003 • Prob(income = high) = 11/1003 • Therefore we predict buys_computer = “yes” for – “Corrected” probability estimates close to their “uncorrected” input x = (age = “30”, income = “medium”, student = “yes”, credit_rating = “bad”) counterparts, but none is zero

118 119

Naïve Bayesian Classifier: Comments Probabilities

• Easy to implement • Summary of elementary probability facts we have • Good results obtained in many cases used already and/or will need soon – Robust to isolated noise points • Let X be a random variable as usual – Handles missing values by ignoring the instance during probability estimate calculations • Let A be some predicate over its possible values – Robust to irrelevant attributes – A is true for some values of X, false for others • Disadvantages – E.g., X is outcome of throw of a die, A could be “value – Assumption: class , is greater than 4” therefore loss of accuracy • P(A) is the fraction of possible worlds in which A – Practically, dependencies exist among variables is true • How to deal with these dependencies? – P(die value is greater than 4) = 2 / 6 = 1/3

120 121

17 Axioms Theorems from the Axioms

• 0  P(A)  1 • 0  P(A)  1, P(True) = 1, P(False) = 0 • P(True) = 1 • P(A  B) = P(A) + P(B) - P(A  B) • P(False) = 0 • P(A  B) = P(A) + P(B) - P(A  B) • From these we can prove: – P(not A) = P(~A) = 1 - P(A) – P(A) = P(A  B) + P(A  ~B)

122 123

Conditional Probability Definition of Conditional Probability

• P(A|B) = Fraction of worlds in which B is true P(A  B) that also have A true H = “Have a headache” P(A| B) = ------F = “Coming down with Flu” P(B)

P(H) = 1/10 P(F) = 1/40 Corollary: the Chain Rule F P(H|F) = 1/2

“Headaches are rare and flu H is rarer, but if you’re coming P(A  B) = P(A| B) P(B) down with flu there’s a 50- 50 chance you’ll have a headache.”

124 125

Easy Fact about Multivalued Random Multivalued Random Variables Variables • Suppose X can take on more than 2 values • Using the axioms of probability • X is a random variable with arity k if it can take – 0  P(A)  1, P(True) = 1, P(False) = 0 – P(A  B) = P(A) + P(B) - P(A  B) on exactly one value out of {v , v ,…, v } 1 2 k • And assuming that X obeys • Thus

P(X  vi  X  v j )  0 if i  j • We can prove that i P(X  v  X  v ... X  v ) 1 1 2 k P(X  v1  X  v2 ... X  vi )   P(X  v j ) j1 • And therefore: k  P(X  v j ) 1 j1 126 127

18 Example: Boolean Useful Easy-to-Prove Facts The Joint Distribution variables A, B, C Recipe for making a joint distribution of d variables: P(A| B)P(~ A| B) 1

k  P(X  v j | B) 1 j1

128 129

Example: Boolean Example: Boolean The Joint Distribution variables A, B, C The Joint Distribution variables A, B, C A B C A B C Prob Recipe for making a joint distribution Recipe for making a joint distribution 0 0 0 0 0 0 0.30 of d variables: of d variables: 0 0 1 0 0 1 0.05

0 1 0 0 1 0 0.10

1. Make a truth table listing all 0 1 1 1. Make a truth table listing all 0 1 1 0.05

combinations of values of your 1 0 0 combinations of values of your 1 0 0 0.05 d d variables (has 2 rows for d 1 0 1 variables (has 2 rows for d 1 0 1 0.10

Boolean variables). 1 1 0 Boolean variables). 1 1 0 0.25

1 1 1 2. For each combination of values, 1 1 1 0.10 say how probable it is.

130 131

Using the The Joint Distribution Example: Boolean variables A, B, C Joint Dist. A B C Prob Recipe for making a joint distribution 0 0 0 0.30 of d variables: 0 0 1 0.05

0 1 0 0.10

1. Make a truth table listing all 0 1 1 0.05

combinations of values of your 1 0 0 0.05 d variables (has 2 rows for d 1 0 1 0.10

Boolean variables). 1 1 0 0.25

2. For each combination of values, 1 1 1 0.10 say how probable it is. Once you have the JD you P(E)   P(row) 3. If you subscribe to the axioms of can ask for the probability of rows matching E probability, those numbers must A 0.05 0.10 0.05 any logical expression sum to 1. 0.10 involving your attribute 0.25 0.05 C

B 0.10 0.30

132 133

19 Using the Using the Joint Dist. Joint Dist.

P(Poor  Male) = 0.4654 P(E)   P(row) P(Poor) = 0.7604 rows matching E

134 135

Inference Inference with the with the Joint Dist. Joint Dist.

 P(row) P(E1  E2 ) rows matching E1 and E2 P(E1 | E2 )   P(E2 )  P(row) rows matching E2

P(Male | Poor) = 0.4654 / 0.7604 = 0.612

136 137

Joint Distributions What Would Help?

• Good news: Once you • Bad news: Impossible to • Full independence have a joint create joint distribution – P(gender=g  hours_worked=h  wealth=w) = distribution, you can for more than about ten P(gender=g) * P(hours_worked=h) * P(wealth=w) answer important attributes because – Can reconstruct full joint distribution from a few questions that involve there are so many marginals uncertainty. numbers needed when • Full conditional independence given class value you build it. – Naïve Bayes • What about something between Naïve Bayes and general joint distribution?

138 139

20 Bayesian Belief Networks Properties

• Subset of the variables conditionally independent • Each variable is conditionally independent of • of causal relationships its non-descendents in the graph, given its – Represents dependency among the variables parents – Gives a specification of joint probability distribution • Naïve Bayes as a Bayesian network:

 Nodes: random variables Y  Links: dependency Y X  X and Y are the parents of Z, and Y is the parent of P Z  Given Y, Z and P are independent P X1 X2 Xn  Has no loops or cycles

140 141

General Properties Structural Property

• P(X1,X2,X3)=P(X1|X2,X3)P(X2|X3)P(X3) • Missing links simplify computation of P 푋1, 푋2, … , 푋푛 • General: 푛 P(푋 |푋 , 푋 , … , 푋 ) •   푖=1 푖 푖−1 푖−2 1 P(X1,X2,X3)=P(X3|X1,X2) P(X2|X1) P(X1) – Fully connected: link between every pair of nodes 푛 • Network does not necessarily reflect causality • Given network: 푖=1 P(푋푖|parents(푋푖)) – Some links are missing – The terms P(푋푖|parents 푋푖 ) are given as conditional probability tables (CPT) in the network X1 X2 X1 X2 • Sparse network allows better estimation of CPT’s (fewer combinations of parent values, hence more reliable to estimate from limited data) and faster X3 X3 computation

142 143

Small Example Computing P(S|J)

• Probability that a student who got a great job was doing her homework • S: Student studies a lot for 6220 • P(S | J) = P(S, J) / P(J)

• L: Student learns a lot and gets a good grade • P(S, J) = P(S, J, L) + P(S, J, ~L) • P(J) = P(J, S, L) + P(J, S, ~L) + P(J, ~S, L) + P(J, ~S, ~L) • J: Student gets a great job • P(J, L, S) = P(J | L, S) * P(L, S) = P(J | L) * P(L | S) * P(S) = 0.8*0.9*0.4 • P(J, ~L, S) = P(J | ~L, S) * P(~L, S) = P(J | ~L) * P(~L | S) * P(S) = 0.3*(1-0.9)*0.4 P(S) = 0.4 S • P(J, L, ~S) = P(J | L, ~S) * P(L, ~S) = P(J | L) * P(L | ~S) * P(~S) = 0.8*0.2*(1-0.4) • P(J, ~L, ~S) = P(J | ~L, ~S) * P(~L, ~S) = P(J | ~L) * P(~L | ~S) * P(~S) = 0.3*(1-0.2)*(1- 0.4)

P(L|S) = 0.9 L • Putting this all together, we obtain: P(L|~S) = 0.2 • P(H | J) = (0.8*0.9*0.4 + 0.3*0.1*0.4) / (0.8*0.9*0.4 + 0.3*0.1*0.4 + 0.8*0.2*0.6 + 0.3*0.8*0.6) = 0.3 / 0.54 = 0.56

P(J|L) = 0.8 J P(J|~L) = 0.3 144 145

21 More Complex Example Computing with Bayes Net

P(M)=0.6 P(S)=0.3 S M T: The lecture started on time L: The lecturer arrives late P(RM)=0.3 R: The lecture concerns data mining P(LM, S)=0.05 L P(R~M)=0.6 M M: The lecturer is Mike P(LM, ~S)=0.1 P(TL)=0.3 R S: It is snowing P(L~M, S)=0.1 P(T~L)=0.8 R P(L~M, ~S)=0.2 T T: The lecture started on time L: The lecturer arrives late R: The lecture concerns data mining M: The lecturer is Mike S ? S: It is snowing

L P(T, ~R, L, ~M, S) T = P(T | L)  P(~R | ~M)  P(L | ~M, S)  P(~M)  P(S)

146 147

Computing with Bayes Net Inference with Bayesian Networks

P(M)=0.6 P(S)=0.3 S M • Can predict the probability for any attribute, P(RM)=0.3 given any subset of the other attributes P(LM, S)=0.05 L P(R~M)=0.6 P(LM, ~S)=0.1 – P(M | L, R), P(T | S, ~M, R) and so on P(TL)=0.3 R P(L~M, S)=0.1 P(T~L)=0.8 P(L~M, ~S)=0.2 • Easy case: P(Xi | Xj1, Xj2,…, Xjk) where T T: The lecture started on time L: The lecturer arrives late parents(Xi){Xj1, Xj2,…, Xjk} R: The lecture concerns data mining M: The lecturer is Mike – Can read answer directly from Xi’s CPT P(R | T, ~S) = P(R, T, ~S) / P(T, ~S) S: It is snowing • What if values are not given for all parents of Xi? P(R, T, ~S) – Exact inference of probabilities in general for an = P(L, M, R, T, ~S) + P(~L, M, R, T, ~S) + P(L, ~M, R, T, ~S) + P(~L, ~M, R, T, ~S) arbitrary Bayesian network is NP-hard – Solutions: probabilistic inference, trade precision for Compute P(T, ~S) similarly. Problem: There are now 8 such terms to be computed. efficiency

148 149

Training Bayesian Networks Classification and Prediction Overview

• Several scenarios: • Introduction – Network structure known, all variables observable: learn • Decision Trees only the CPTs • Statistical Decision Theory – Network structure known, some hidden variables: gradient • Nearest Neighbor descent (greedy hill-climbing) method, analogous to neural • Bayesian Classification network learning • Artificial Neural Networks – Network structure unknown, all variables observable: search through the model space to reconstruct network • Support Vector Machines (SVMs) topology • Prediction – Unknown structure, all hidden variables: No good • Accuracy and Error Measures known for this purpose • Ensemble Methods • Ref.: D. Heckerman: Bayesian networks for data mining

150 152

22 Basic Building Block: Perceptron Decision Hyperplane

Input: {(x1, x2, y), …} For Example x2 Called the bias Output: classification function f(x) d   f(x) > 0: return +1 f (x)  signb   wi xi   i1  f(x) ≤ 0: return = -1 x1 w1 b+w1x1+w2x2 = 0 x w +b 2 2  Decision hyperplane: b+w∙x = 0 f Output y x Note: b+w∙x > 0, if and only if d wd d  wi xi  b i1 Input Weight Weighted Activation vector vector sum function b represents a threshold for when the x w perceptron “fires”. x1 153 154

Representing Boolean Functions Perceptron Training Rule

• AND with two-input perceptron • Goal: correct +1/-1 output for each training record

– b=-0.8, w1=w2=0.5 • Start with random weights, constant  () • OR with two-input perceptron • While some training records are still incorrectly – b=-0.3, w1=w2=0.5 classified do • m-of-n function: true if at least m out of n inputs – For each training record (x, y) are true • Let fold(x) be the output of the current perceptron for x •    – All input weights 0.5, threshold weight b is set Set b:= b + b, where b = ( y - fold(x) ) according to m, n • For all i, set wi := wi + wi, where wi = ( y - fold(x))xi • Can also represent NAND, NOR • Converges to correct decision boundary, if the classes • What about XOR? are linearly separable and a small enough  is used

155 156

Gradient Descent Gradient Descent Rule

• If training records are not linearly separable, find best • Find weight vector that minimizes E(b,w) by altering it fit approximation in direction of steepest descent – Gradient descent to search the space of possible weight – Set (b,w) := (b,w) + (b,w), where (b,w) = - E(b,w)

vectors • -E(b,w)=[ E/b, E/w1,…, E/wn ] is the gradient, hence – Basis for algorithm  E     b : b   b  y  u(x) • Consider un-thresholded perceptron (no sign function b (x,y)D E(w1,w2)   applied), i.e., u(x) = b + w∙x  E wi : wi   wi  y  u(x)(xi ) w 100 i (x,y)D 90 • Measure training error by squared error 80 70 60 50 1 2 • Start with random weights, 40 30 E(b,w)  y  u(x) 20  iterate until convergence 10 2 (x,y)D 0 – Will converge to global 4 2  -4 0 – D = training data minimum if is small enough -2 w2 0 -2 2 -4 w1 4

157 158

23 Gradient Descent Summary Multilayer Feedforward Networks

• Epoch updating (batch mode) • Use another perceptron to combine – Compute gradient over entire training set output of lower layer Output layer – Changes model once per scan of entire training set – What about linear units only? • Case updating (incremental mode, stochastic gradient Can only construct linear functions! descent) – Need nonlinear component • sign function: not differentiable – Compute gradient for a single training record (gradient descent!) – Changes model after every single training record immediately • Use sigmoid: (x)=1/(1+e-x) Hidden layer 1 1/(1+exp(-x)) • Case updating can approximate epoch updating arbitrarily 0.9 close if  is small enough 0.8 0.7 • What is the difference between perceptron training rule 0.6 and case updating for gradient descent? 0.5 Perceptron function: 0.4 – Error computation on thresholded vs. unthresholded function 0.3 1 y  bwx Input layer 0.2 1 e 0.1 0 -4 -2 0 2 4 159 160

1-Hidden Layer ANN Example Making Predictions

• Input record fed simultaneously into the units of the  N INS  input layer v  gb  w x  w 1  1  1k k  • Then weighted and fed simultaneously to a hidden 11  k 1  w1 layer w21 x1 • Weighted outputs of the last hidden layer are the input to the units in the output layer, which emits the w31 N INS   N w2  HID  network's prediction v  gb  w x  Out  g B  W v  2  2  2k k    k k  • The network is feed-forward w12  k 1   k 1  w22 – None of the weights cycles back to an input unit or to an w output unit of a previous layer x2 3 w N INS • Statistical point of view: neural networks perform 32   v  gb  w x  g is usually the nonlinear regression 3  3  3k k   k 1  161 162

Backpropagation Algorithm Overfitting

• Earlier discussion: gradient descent for a single perceptron • When do we stop updating the weights? using a simple un-thresholded function • If sigmoid (or other differentiable) function is applied to • Overfitting tends to happen in later iterations weighted sum, use complete function for gradient descent – Weights initially small random values • Multiple : optimize over all weights of all perceptrons – Weights all similar => smooth decision surface – Problems: huge search space, local minima – Surface complexity increases as weights diverge • Backpropagation – Initialize all weights with small random values • Preventing overfitting – Iterate many times – Weight decay: decrease each weight by small factor • Compute gradient, starting at output and working back – Error of hidden unit h: how do we get the true output value? Use weighted during each iteration, or sum of errors of each unit influenced by h • Update all weights in the network – Use validation data to decide when to stop iterating

163 164

24 Neural Network Decision Boundary Backpropagation Remarks

• Computational cost – Each iteration costs O(|D|*|w|), with |D| training records and |w| weights – Number of iterations can be exponential in n, the number of inputs (in practice often tens of thousands) • Local minima can trap the gradient descent algorithm: convergence guaranteed to local minimum, not global • Backpropagation highly effective in practice – Many variants to deal with local minima issue, use of case updating Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning

165 166

Defining a Network Representational Power

1. Decide network topology • Boolean functions – #input units, #hidden layers, #units per hidden layer, #output – Each can be represented by a 2-layer network units (one output unit per class for problems with >2 classes) – Number of hidden units can grow exponentially with 2. Normalize input values for each attribute to [0.0, 1.0] number of inputs – Nominal/ordinal attributes: one input unit per domain value • Create hidden unit for each input record • For attribute grade with values A, B, C, have 3 inputs that are set to • Set its weights to activate only for that input 1,0,0 for grade A, to 0,1,0 for grade B, and 0,0,1 for C • Implement output unit as OR gate that only activates for desired • Why not map it to a single input with domain [0.0, 1.0]? output patterns 3. Choose learning rate , e.g., 1 / (#training iterations) • Continuous functions – Too small: takes too long to converge – Every bounded continuous function can be approximated – Too large: might never converge (oversteps minimum) arbitrarily close by a 2-layer network 4. Bad results on test data? Change network topology, initial • Any function can be approximated arbitrarily close by a weights, or learning rate; try again. 3-layer network

167 168

Neural Network as a Classifier Classification and Prediction Overview

• Weaknesses • Introduction – Long training time • Decision Trees – Many non-trivial parameters, e.g., network topology • Statistical Decision Theory – Poor interpretability: What is the meaning behind learned weights and hidden units? • Nearest Neighbor • Note: hidden units are alternative representation of input values, • Bayesian Classification capturing their relevant features • Strengths • Artificial Neural Networks – High tolerance to noisy data • Support Vector Machines (SVMs) – Well-suited for continuous-valued inputs and outputs • Prediction – Successful on a wide array of real-world data • Accuracy and Error Measures – Techniques exist for extraction of rules from neural networks • Ensemble Methods

169 171

25 SVM—Support Vector Machines SVM—History and Applications

• Newer and very popular classification method • Vapnik and colleagues (1992) • Uses a nonlinear mapping to transform the – Groundwork from Vapnik & Chervonenkis’ statistical original training data into a higher dimension learning theory in 1960s • Training can be slow but accuracy is high • Searches for the optimal separating – Ability to model complex nonlinear decision hyperplane (i.e., “decision boundary”) in the boundaries (margin maximization) new dimension • Used both for classification and prediction • SVM finds this hyperplane using support • Applications: handwritten digit recognition, vectors (“essential” training records) and object recognition, speaker identification, margins (defined by the support vectors) benchmarking time-series prediction tests

172 173

Linear Classifiers Linear Classifiers

f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b) denotes +1 denotes +1 denotes -1 denotes -1

How would you How would you classify this data? classify this data?

174 175

Linear Classifiers Linear Classifiers

f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b) denotes +1 denotes +1 denotes -1 denotes -1

How would you How would you classify this data? classify this data?

176 177

26 Linear Classifiers Classifier Margin

f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b) denotes +1 denotes +1 denotes -1 denotes -1 Define the margin of a as the Any of these width that the would be fine.. boundary could be increased by ..but which is before hitting a best? data record.

178 179

Maximum Margin Maximum Margin

f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b) denotes +1 denotes +1 denotes -1 Find the maximum denotes -1 margin linear classifier.

This is the Support Vectors simplest kind of are those SVM, called linear datapoints that SVM or LSVM. the margin pushes up against

180 181

Why Maximum Margin? Specifying a Line and Margin Plus-Plane • If we made a small error in the location of the Classifier Boundary boundary, this gives us the least chance of Minus-Plane causing a misclassification. • Model is immune to removal of any non- support-vector data records. • Plus-plane = { x : wx + b = +1 } • There is some theory (using VC dimension) • Minus-plane = { x : wx + b = -1 } that is related to (but not the same as) the proposition that this is a good thing. Classify as +1 if w x + b  1 • Empirically it works very well. -1 if wx + b  -1 what if -1 < wx + b < 1 ?

182 183

27 Computing Margin Width Computing Margin Width

M = Margin Width x+ M = Margin Width

x-

• Plus-plane = { x : wx + b = +1 } • Choose arbitrary point x- on minus-plane • Minus-plane = { x : wx + b = -1 } + - • Goal: compute M in terms of w and b • Let x be the point in plus-plane closest to x – Note: vector w is perpendicular to plus-plane • Since vector w is perpendicular to these planes, it • Consider two vectors u and v on plus-plane and show that w(u-v)=0 + - • Hence it is also perpendicular to the minus-plane holds that x = x + w, for some value of 

184 185

Putting It All Together Finding the Maximum Margin

• We have so far: • How do we find w and b such that the margin is – wx+ + b = +1 and wx- + b = -1 maximized and all training records are in the – x+ = x- + w correct zone for their class? – |x+- x-| = M • Solution: Quadratic Programming (QP) • Derivation: • QP is a well-studied class of optimization algorithms to maximize a quadratic function of – w(x- + w) + b = +1, hence wx- + b + ww = 1 some real-valued variables subject to linear – This implies ww = 2, i.e.,  = 2 / ww constraints. + - 0.5 – Since M = |x - x | = |w| =  |w| = (ww) – There exist algorithms for finding such constrained – We obtain M = 2 (ww)0.5/ ww = 2 / (ww)0.5 quadratic optima efficiently and reliably.

186 187

Quadratic Programming What Are the SVM Constraints? T T u Ru Find arg max c  d u  Quadratic criterion u 2 • Consider n training records (x(k), y(k)), Subject to a u  a u ... a u  b 11 1 12 2 1m m 1 where y(k) = +/- 1 a u  a u ... a u  b n additional linear 21 1 22 2 2m m 2 2 • How many constraints : inequality M  constraints w w will we have? a u  a u ... a u  b n1 1 n2 2 nm m n • What is the quadratic • What should they be? And subject to optimization criterion? a(n1)1u1  a(n1)2u2 ... a(n1)mum  b(n1) a u  a u ... a u  b e additional (n2)1 1 (n2)2 2 (n2)m m (n2) linear : equality constraints a u  a u ... a u  b (ne)1 1 (ne)2 2 (ne)m m (ne) 188 189

28 Problem: Classes Not Linearly What Are the SVM Constraints? Separable • Consider n training denotes +1 • Inequalities for training records (x(k), y(k)), denotes -1 records are not where y(k) = +/- 1 satisfiable by any w and 2 • M  How many constraints b w w will we have? n. • What is the quadratic • What should they be? optimization criterion? For each 1  k  n: – Minimize ww wx(k) + b  1, if y(k)=1 wx(k) + b  -1, if y(k)=-1

190 191

Solution 1? Solution 2? denotes +1 • Find minimum ww, denotes +1 • Minimize ww + denotes -1 while also minimizing denotes -1 C(#trainSetErrors) number of training set – C is a tradeoff parameter errors • Problems: – Not a well-defined – Cannot be expressed as optimization problem QP, hence finding solution might be slow (cannot optimize two things at the same time) – Does not distinguish between disastrous errors and near misses

192 193

Solution 3 What Are the SVM Constraints? 11  denotes +1 • Minimize ww + 2 • Consider n training denotes -1 C(distance of error records (x(k), y(k)), where y(k) = +/- 1 records to their correct  place) 7 • How many constraints will we have? n. • This works! • What should they be? • What is the quadratic • But still need to do For each 1  k  n: optimization criterion? something about the wx(k)+b  1 -  , if y(k)=1 unsatisfiable set of – Minimize k wx(k)+b  -1+ , if y(k)=-1 inequalities 1 n k w w  C ε   0 2  k k k1 194 195

29 Facts About the New Problem Effect of Parameter C Formulation • Original QP formulation had d+1 variables

– w1, w2,..., wd and b • New QP formulation has d+1+n variables

– w1, w2,..., wd and b – 1, 2,..., n • C is a new parameter that needs to be set for the SVM – Controls tradeoff between paying attention to margin size versus misclassifications Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning

196 197

An Equivalent QP (The “Dual”) Important Facts n 1 n n Maximize αk  αk αl  y(k) y(l)x(k)x(l) • Dual formulation of QP can be optimized more k1 2 k1 l1 quickly, but result is equivalent n • Data records with  > 0 are the support vectors Subject to these k k :0  αk  C αk y(k)  0  – Those with 0 < k < C lie on the plus- or minus-plane constraints: k1 – Those with k = C are on the wrong side of the Then define: classifier boundary (have k > 0) n • Computation for w and b only depends on those w  α  y(k)x(k) Then classify with:  k records with k > 0, i.e., the support vectors k1 f(x,w,b) = sign(wx + b)  1  • Alternative QP has another major advantage, as b  AVG   x(k)w we will see now... k:0k C y(k)  198 199

Easy To Separate Easy To Separate

What would Not a big surprise SVMs do with this data?

Positive “plane” Negative “plane”

200 201

30 Harder To Separate Harder To Separate X’ (= X2) What can be done about this? Non-linear basis functions: Original data: (X, Y) Transformed: (X, X2, Y)

Think of X2 as a new X attribute, e.g., X’

202 203

Corresponding “Planes” in Original Now Separation Is Easy Again Space X’ (= X2)

Region above plus-”plane”

X Region below minus-”plane”

204 205

 1  Constant Term   2x  1  Quadratic Basis  2x  Common SVM Basis Functions  2  Linear Terms  :  Functions  2x   d  • 2 Polynomial of attributes X1,..., Xd of certain  x1    2 x2 Pure max degree, e.g., X4  2  Quadratic Number of terms  :    Terms (assuming d input attributes): • Radial basis function 2  xd  Φ(x)  (d+2)-choose-2  2x x  – Symmetric around center, i.e.,  1 2  2x x = (d+2)(d+1)/2 KernelFunction(|X - c| / kernelWidth)  1 3   :   d2/2   • Sigmoid function of X, e.g., hyperbolic tangent  2x1xd  Quadratic   2x x Cross-Terms  2 3  Why did we choose this specific • Let (x) be the transformed input record   : transformation? 2   – Previous example: ( (x) ) = (x, x )  2x1xd   :     2x x  206  d 1 d  207

31 Dual QP With Basis Functions Computation Challenge n 1 n n Maximize αk  αk αl  y(k) y(l)Φx(k)Φx(l) • Input vector x has d components (its d attribute k1 2 k1 l1 values) n • The transformed input vector (x) has d2/2 Subject to these k :0  αk  C αk y(k)  0 components constraints: k1 • Hence computing (x(k))(x(l)) now costs order Then define: d2/2 instead of order d operations (additions, n multiplications) w  α  y(k)Φx(k) Then classify with:  k • k1 f(x,w,b) = sign(w(x) + b) ...or is there a better way to do this?  1  – Take advantage of properties of certain b  AVG  Φx(k)w transformations k:0k C y(k)  208 209

 1   1      1 Quadratic  2a1   2b1  +  2a   2b   2   2  d Dot Quadratic Dot Products  :   :   2aibi i1  2a   2b  Products Now consider another function of a  d   d  2 2 + and b:  a1   b1   2   2  d  a2   b2  2 a2b2 (ab 1)  :   :   i i i1  2   2  2  ad   bd   (ab)  2ab 1 Φ(a)Φ(b)      Φ(a)Φ(b)  2a a 2b b d 2 d  1 2   1 2  d d d d   2 2   aibi   2 aibi 1  2a1a3   2b1b3  + 1 2 aibi  ai bi  2aia jbibj        i1  i1  :   :  i1 i1 i1 ji1     d d d  2a1ad   2b1bd   a b a b  2 a b 1     d d  i i j j  i i 2a a 2b b i1 j1 i1  2 3   2 3    2aia jbibj i1 ji1  :   :  d d d d     2  (aibi )  2 aibia jbj  2 aibi 1  2a1ad   2b1bd      i1 i1 ji1 i1  :   :       2a a   2b b   d 1 d   d 1 d  210 211

Quadratic Dot Products Any Other Computation Problems?

• The results of (a)(b) and of (ab+1)2 are identical • Computing (a)(b) costs about d2/2, while  2 computing (a b+1) costs only about d+2 operations • What about computing w? • This means that we can work in the high-dimensional 2 – Finally need f(x,w,b) = sign(w(x) + b): space (d /2 dimensions) where the training records are n

more easily separable, but pay about the same cost as w Φ(x)  αk  y(k)Φx(k)Φ(x) working in the original space (d dimensions) k1 • Savings are even greater when dealing with higher- – Can be computed using the same trick as before degree polynomials, i.e., degree q>2, that can be • Can apply the same trick again to b, because computed as (ab+1)q n Φx(k)w  α j  y( j)Φx(k)Φx( j) j1

212 213

32 SVM Kernel Functions Overfitting

• For which transformations, called kernels, • With the right kernel function, computation in high dimensional transformed space is no problem does the same trick work? • But what about overfitting? There seem to be so many • Polynomial: K(a,b)=(a  b +1)q parameters... • Usually not a problem, due to maximum margin • Radial-Basis-style (RBF): q, , , and  are approach  (a b)2  magic parameters – Only the support vectors determine the model, hence SVM K(a,b)  exp  that must be chosen  2 2  complexity depends on number of support vectors, not   by a model selection dimensions (still, in higher dimensions there might be method. more support vectors) – Neural-net-style sigmoidal: – Minimizing ww discourages extremely large weights, which smoothes the function (recall weight decay for K(a,b)  tanh(ab ) neural networks!)

214 215

Different Kernels Multi-Class Classification

• SVMs can only handle two-class outputs (i.e. a categorical output variable with arity 2). • With output arity N, learn N SVM’s – SVM 1 learns “Output==1” vs “Output != 1” – SVM 2 learns “Output==2” vs “Output != 2” – : – SVM N learns “Output==N” vs “Output != N” • Predict with each SVM and find out which one puts the prediction the furthest into the positive region. Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning

216 217

Why Is SVM Effective on High SVM vs. Neural Network Dimensional Data? • Complexity of trained classifier is characterized by the • SVM • Neural Network number of support vectors, not dimensionality of the – – data Relatively new concept Relatively old – Deterministic algorithm – Nondeterministic • If all other training records are removed and training is repeated, the same separating hyperplane would be – Nice Generalization algorithm found properties – Generalizes well but • The number of support vectors can be used to – Hard to train – learned in doesn’t have strong compute an upper bound on the expected error rate of batch mode using mathematical foundation the SVM, which is independent of data dimensionality quadratic programming – Can easily be learned in • Thus, an SVM with a small number of support vectors techniques incremental fashion can have good generalization, even when the – Using kernels can learn – To learn complex dimensionality of the data is high very complex functions functions—use (not that trivial)

218 219

33 Classification and Prediction Overview What Is Prediction?

• Introduction • Essentially the same as classification, but output • Decision Trees is continuous, not discrete • Statistical Decision Theory – Construct a model, then use model to predict • Nearest Neighbor continuous output value for a given input • Bayesian Classification • Major method for prediction: regression • Artificial Neural Networks – Many variants of in literature; not covered in this class • Support Vector Machines (SVMs) • Prediction • Neural network and k-NN can do regression “out- of-the-box” • Accuracy and Error Measures • Ensemble Methods • SVMs for regression exist • What about trees?

221 222

Regression Trees and Model Trees Classification and Prediction Overview

• Regression tree: proposed in CART system • Introduction (Breiman et al. 1984) • Decision Trees – CART: Classification And Regression Trees • Statistical Decision Theory – Each leaf stores a continuous-valued prediction • Nearest Neighbor • Average output value for the training records in the leaf • Bayesian Classification • Model tree: proposed by Quinlan (1992) • Artificial Neural Networks – Each leaf holds a regression model—a multivariate • Support Vector Machines (SVMs) linear equation • Prediction • Training: like for classification trees, but uses • Accuracy and Error Measures variance instead of purity measure for selecting split predicates • Ensemble Methods

223 224

Classifier Accuracy Measures Precision and Recall

Predicted class total buy_computer = yes buy_computer = no • Precision: measure of exactness True class buy_computer = yes 6954 46 7000 – buy_computer = no 412 2588 3000 t-pos / (t-pos + f-pos) total 7366 2634 10000 • Recall: measure of completeness • Accuracy of a classifier M, acc(M): percentage of – t-pos / (t-pos + f-neg) test records that are correctly classified by M • F-measure: combination of precision and recall – Error rate (misclassification rate) of M = 1 – acc(M) – 2 * precision * recall / (precision + recall) – Given m classes, CM[i,j], an entry in a confusion matrix, indicates # of records in class i that are labeled by the classifier as class j • Note: Accuracy = (t-pos + t-neg) / (t-pos + t-neg +

C1 C2 f-pos + f-neg)

C1 True positive False negative

C2 False positive True negative 225 226

34 Limitation of Accuracy Cost-Sensitive Measures: Cost Matrix

• Consider a 2-class problem PREDICTED CLASS – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10 C(i|j) Class=Yes Class=No

Class=Yes C(Yes|Yes) C(No|Yes) • If model predicts everything to be class 0, ACTUAL CLASS accuracy is 9990/10000 = 99.9 % Class=No C(Yes|No) C(No|No) – Accuracy is misleading because model does not detect any class 1 example • Always predicting the majority class defines the baseline C(i| j): Cost of misclassifying class j example as class i – A good classifier should do better than baseline

227 228

Computing Cost of Classification Prediction Error Measures

Cost PREDICTED CLASS Matrix • Continuous output: it matters how far off the prediction is from the true value C(i|j) + - • Loss function: distance between y and predicted value y’ ACTUAL + -1 100 CLASS – Absolute error: | y – y’| - 1 0 – Squared error: (y – y’)2 • Test error (generalization error): average loss over the test set

Model M1 PREDICTED CLASS Model M2 PREDICTED CLASS • Mean absolute error: Mean squared error: 1 n 1 n | y(i)  y'(i) | y(i)  y'(i)2 n   + - + - i1 n i1 ACTUAL ACTUAL n n + 150 40 + 250 45 • Relative absolute error: | y ( i )  y ' ( i ) | Relative squared error: (y(i)  y'(i))2 CLASS CLASS   i1 i1 - 60 250 - 5 200 n n | y(i)  y | (y(i)  y)2 Accuracy = 80% Accuracy = 90% i1 i1 Cost = 3910 Cost = 4255 • Squared-error exaggerates the presence of outliers

229 230

Evaluating a Classifier or Predictor Learning Curve

• Holdout method • Accuracy versus – The given data set is randomly partitioned into two sets sample size • Effect of small • Training set (e.g., 2/3) for model construction sample size: • Test set (e.g., 1/3) for accuracy estimation – Bias in estimate – Can repeat holdout multiple times – Variance of • Accuracy = avg. of the accuracies obtained estimate • Helps determine how • Cross-validation (k-fold, where k = 10 is most popular) much training data is needed – Randomly partition data into k mutually exclusive subsets, – Still need to have each approximately equal size enough test and – In i-th iteration, use D as test set and others as training set validation data to i be representative – Leave-one-out: k folds where k = # of records of distribution • Expensive, often results in high variance of performance metric

231 232

35 ROC (Receiver Operating ROC Curve Characteristic) • Developed in 1950s for signal detection theory to • 1-dimensional data set containing 2 classes (positive and negative) analyze noisy signals – Any point located at x > t is classified as positive

– Characterizes trade-off between positive hits and false alarms • ROC curve plots T-Pos rate (y-axis) against F-Pos rate (x-axis) • Performance of each classifier is represented as a point on the ROC curve – Changing the threshold of the algorithm, sample distribution or cost matrix changes the location of the point At threshold t: TPR=0.5, FPR=0.12 233 234

ROC Curve Diagonal Line for Random Guessing (TPR, FPR): • (0,0): declare everything to • Classify a record as positive with fixed probability be negative class p, irrespective of attribute values • (1,1): declare everything to • Consider test set with a positive and b negative be positive class records • (1,0): ideal • True positives: p*a, hence true positive rate = (p*a)/a = p • Diagonal line: • False positives: p*b, hence false positive rate = – Random guessing (p*b)/b = p • For every value 0p1, we get point (p,p) on ROC curve

235 236

Using ROC for Model Comparison How to Construct an ROC curve

• Neither model • Use classifier that produces consistently record P(+|x) True Class posterior probability P(+|x) outperforms the 1 0.95 + for each test record x other 2 0.93 + – M1 better for small 3 0.87 - • Sort records according to FPR 4 0.85 - P(+|x) in decreasing order – M2 better for large FPR 5 0.85 - • Apply threshold at each 6 0.85 + unique value of P(+|x) • Area under the ROC 7 0.76 - – Count number of TP, FP, TN, FN curve 8 0.53 + at each threshold – Ideal: area = 1 9 0.43 - – TP rate, TPR = TP/(TP+FN) – Random guess: 10 0.25 + – FP rate, FPR = FP/(FP+TN) area = 0.5

237 238

36 How To Construct An ROC Curve Test of Significance

Class + - + - + - - - + + P Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 2 2 1 0 • FP 5 5 4 4 3 1 0 0 0 Given two models:

TN 0 0 1 1 2 4 5 5 5 FN 0 1 1 2 2 3 3 4 5 – Model M1: accuracy = 85%, tested on 30 instances TPR 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.2 0 0 0 – Model M2: accuracy = 75%, tested on 5000 1.0 true positive rate instances • Can we say M1 is better than M2? – How much confidence can we place on accuracy ROC Curve: of M1 and M2? 0.4 – Can the difference in accuracy be explained as a 0.2 result of random fluctuations in the test set?

false positive rate 239 240 0 0.2 0.4 1.0

Confidence Interval for Accuracy Confidence Interval for Accuracy Area = 1 -  • • Binomial distribution for X=“number of Classification can be regarded as a Bernoulli trial correctly classified test records out of n” – A Bernoulli trial has 2 possible outcomes, “correct” or – E(X)=pn, Var(X)=p(1-p)n “wrong” for classification • Accuracy = X / n – Collection of Bernoulli trials has a Binomial – E(ACC) = p, Var(ACC) = p(1-p) / n • For large test sets (n>30), Binomial distribution distribution is closely approximated by • Probability of getting c correct predictions if model accuracy with same mean is p (=probability to get a single prediction right): and variance – ACC has a normal distribution with n c nc   p (1 p) mean=p, variance=p(1-p)/n Z/2 Z1-  /2 c    ACC p  PZ   Z  1 • Given c, or equivalently, ACC = c / n and n (#test   / 2 1 / 2   p(1 p) / n  records), can we predict p, the true accuracy of • Confidence Interval for p: 2 2 2 the model? 2nACC Z / 2  Z / 2  4nACC 4nACC p  2 2(n  Z / 2 ) 241 242

Comparing Performance of Two Confidence Interval for Accuracy Models • Consider a model that produces an accuracy of • Given two models M1 and M2, which is better? 80% when evaluated on 100 test instances – M1 is tested on D1 (size=n1), found error rate = e1 – n = 100, ACC = 0.8 – M2 is tested on D2 (size=n2), found error rate = e2 – Let 1- = 0.95 (95% confidence) – Assume D and D are independent – From probability table, Z = 1.96 1- Z 1 2 /2 – If n and n are sufficiently large, then 0.99 2.58 1 2 N 50 100 500 1000 5000 err ~ N ,  0.98 2.33 1 1 1 p(lower) 0.670 0.711 0.763 0.774 0.789 err2 ~ N2 , 2  p(upper) 0.888 0.866 0.833 0.824 0.811 0.95 1.96

0.90 1.65 – Estimate: 2 ei (1 ei ) ˆi  ei and ˆi  ni 243 244

37 Testing Significance of Accuracy An Illustrative Example Difference • Given: M1: n = 30, e = 0.15 • Consider random variable d = err1– err2 1 1 M2: n2 = 5000, e2 = 0.25 – Since err1, err2 are normally distributed, so is their • E[d] = |e – e | = 0.1 difference 1 2 • 2-sided test: dt = 0 versus dt  0 – Hence d ~ N (dt, t) where dt is the true difference 2 0.15(1 0.15) 0.25(1 0.25) ˆt    0.0043 • Estimator for dt: 30 5000

– E[d] = E[err1-err2] = E[err1] – E[err2]  e1 - e2 • At 95% confidence level, Z/2 = 1.96 – Since D1 and D2 are independent, variance adds up: 2 2 2 e (1 e ) e (1 e ) dt  0.100 1.96 0.0043  0.100  0.128 ˆ  ˆ ˆ  1 1  2 2 t 1 2 n n 1 2 • Interval contains zero, hence difference may not be statistically significant

– At (1-) confidence level, dt  E[d] Z / 2ˆt • But: may reject null hypothesis (dt  0) at lower confidence level

245 246

Significance Test for K-Fold Cross- Classification and Prediction Overview Validation • Each learning algorithm produces k models: • Introduction – L1 produces M11 , M12, …, M1k • Decision Trees – L2 produces M21 , M22, …, M2k • Statistical Decision Theory • • Both models are tested on the same test sets D1, Nearest Neighbor D2,…, Dk • Bayesian Classification – For each test set, compute dj = e1,j – e2,j • Artificial Neural Networks – For large enough k, dj is normally distributed with • Support Vector Machines (SVMs) mean dt and variance t • k Prediction – Estimate: 2 • (d j  d) t-distribution: get t coefficient Accuracy and Error Measures 2 j1 ˆ  t1-,k-1 from table by looking up • Ensemble Methods t k(k 1) confidence level (1-) and degrees of freedom (k-1) dt  d  t1 ,k1ˆt 247 248

Ensemble Methods General Idea

Original • Construct a set of classifiers from the training D Training data data

Step 1: Create Multiple D .... D1 2 Dt-1 Dt Data Sets • Predict class label of previously unseen records by aggregating predictions made by Step 2: Build Multiple C C C C multiple classifiers Classifiers 1 2 t -1 t

Step 3: Combine C* Classifiers

249 250

38 Why Does It Work? Base Classifier vs. Ensemble Error

• Consider 2-class problem • Suppose there are 25 base classifiers – Each classifier has error rate  = 0.35 – Assume the classifiers are independent • Return majority vote of the 25 classifiers – Probability that the ensemble classifier makes a

wrong prediction: 25 25 i 25i   (1 )  0.06 i13 i 

251 252

Model Averaging and Bias-Variance Bagging: Bootstrap Aggregation Tradeoff • Single model: lowering bias will usually increase • Given training set with n records, sample n variance records randomly with replacement – “Smoother” model has lower variance but might not Original Data 1 2 3 4 5 6 7 8 9 10 model function well enough Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9 Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2 • Ensembles can overcome this problem Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7 1. Let models overfit • Train classifier for each bootstrap sample • Low bias, high variance 2. Take care of the variance problem by averaging • Note: each training record has probability many of these models 1 – (1 – 1/n)n of being selected at least once in • This is the basic idea behind bagging a sample of size n

253 254

Bagged Trees Typical Result

• Create k trees from training data – Bootstrap sample, grow large trees • Design goal: independent models, high variability between models • Ensemble prediction = average of individual tree predictions (or majority vote) • Works the same way for other classifiers

(1/k)· + (1/k)· +…+ (1/k)·

255 256

39 Typical Result Typical Result

257 258

Bagging Challenges Additive Grove

• Ideal case: all models independent of each other • Ensemble technique for predicting continuous output • Train on independent data samples • Instead of individual trees, train additive models – Prediction of single Grove model = sum of tree predictions – Problem: limited amount of training data • Prediction of ensemble = average of individual Grove predictions • Training set needs to be representative of data distribution • Combines large trees and additive models – Bootstrap sampling allows creation of many “almost” – Challenge: how to train the additive models without having the first independent training sets trees fit the training data too well • Diversify models, because similar sample might result • Next tree is trained on residuals of previously trained trees in same Grove model in similar tree • If previously trained trees capture training data too well, next tree is mostly – : limit choice of split attributes to small trained on noise random subset of attributes (new selection of subset for each node) when training tree

– Use different model types in same ensemble: tree, ANN, (1/k)· +…+ + (1/k)· +…+ +…+ (1/k)· +…+ SVM, regression models

259 260

Training Groves Typical Grove Performance

10 • Root mean squared 9 error – Lower is better 8 • Horizontal axis: tree + + + + + + 7 size – Fraction of training 6 data when to stop splitting + + + 5 • Vertical axis: number

4 of trees in each

0.13 single Grove model

3 • 100 bagging iterations 2

1 0.5 0.2 0.1 0.05 0.02 0.01 0.005 0.002 0

261 262

40 Boosting Boosting

• Records that are wrongly classified will have their weights increased • Iterative procedure to • Records that are classified correctly will have adaptively change distribution their weights decreased of training data by focusing Original Data 1 2 3 4 5 6 7 8 9 10 more on previously Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3 Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2 misclassified records Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

– Initially, all n records are assigned equal weights • Assume record 4 is hard to classify – Record weights may change at • Its weight is increased, therefore it is more likely the end of each boosting round to be chosen again in subsequent rounds

263 264

Example: AdaBoost AdaBoost Details

(i)   i • Weight update: (i1) wj  if Ci (x j )  y j • Base classifiers: C1, C2,…, CT wj  1 Z i i  1 if C (x )  y •  i j j Error rate (n training where Z is the normalization factor records, wj are weights that i sum to 1): • Weights initialized to 1/n n • Zi ensures that weights add to 1 i  wj Ci (x j )  y j  • If any intermediate rounds produce error rate higher j1 than 50%, the weights are reverted back to 1/n and the • Importance of a classifier: resampling procedure is repeated • Final classification: 1i  T i  ln    C *(x)  arg max i Ci (x)  y i    y i1

265 266

Illustrating AdaBoost Illustrating AdaBoost B1 Initial weights for each data point Data points 0.0094 0.0094 0.4623 for training Boosting Round 1 + + + ------ = 1.9459 0.1 0.1 0.1 Original B2 + + + - - - - - + + 0.3037 0.0009 0.0422 Data Boosting Round 2 ------+ +  = 2.9323 New weights B1 B3 0.0094 0.0094 0.4623 0.0276 0.1819 0.0038 Boosting Boosting + + + + + + + + + + Round 1 + + + ------ = 1.9459 Round 3  = 3.8744

Overall + + + - - - - - + +

Note: The numbers appear to be wrong, but they convey the right idea… 267 Note: The numbers appear to be wrong, but they convey the right idea… 268

41 Bagging vs. Boosting Classification/Prediction Summary

• Analogy • Forms of data analysis that can be used to train models – Bagging: diagnosis based on multiple doctors’ majority vote – Boosting: weighted vote, based on doctors’ previous diagnosis accuracy from data and then make predictions for new records • Sampling procedure • Effective and scalable methods have been developed – Bagging: records have same weight; easy to train in parallel – Boosting: weights record higher if model predicts it wrong; inherently for decision tree induction, Naive Bayesian sequential process classification, Bayesian networks, rule-based classifiers, • Overfitting Backpropagation, Support Vector Machines (SVM), – Bagging robust against overfitting – Boosting susceptible to overfitting: make sure individual models do not overfit nearest neighbor classifiers, and many other • Accuracy usually significantly better than a single classifier classification methods – Best boosted model often better than best bagged model • • Additive Grove Regression models are popular for prediction. – Combines strengths of bagging and boosting (additive models) Regression trees, model trees, and ANNs are also used – Shown empirically to make better predictions on many data sets for prediction. – Training more tricky, especially when data is very noisy

269 270

Classification/Prediction Summary

• K-fold cross-validation is a popular method for accuracy estimation, but determining accuracy on large test set is equally accepted – If test sets are large enough, a significance test for finding the best model is not necessary • Area under ROC curve and many other common performance measures exist • Ensemble methods like bagging and boosting can be used to increase overall accuracy by learning and combining a series of individual models – Often state-of-the-art in prediction quality, but expensive to train, store, use • No single method is superior over all others for all data sets – Issues such as accuracy, training and prediction time, robustness, interpretability, and scalability must be considered and can involve trade-offs

271

42