1 Data Mining Techniques: Classification and Prediction

Classification and Prediction Overview • Introduction • Decision Trees Data Mining Techniques: • Statistical Decision Theory • Nearest Neighbor Classification and Prediction • Bayesian Classification • Artificial Neural Networks Mirek Riedewald • Support Vector Machines (SVMs) Some slides based on presentations by • Prediction Han/Kamber/Pei, Tan/Steinbach/Kumar, and Andrew • Accuracy and Error Measures Moore • Ensemble Methods 2 Classification vs. Prediction Induction: Model Construction • Assumption: after data preparation, we have a data set Classification where each record has attributes X ,…,X , and Y. Algorithm 1 n Training • Goal: learn a function f:(X ,…,X )Y, then use this 1 n Data function to predict y for a given input record (x1,…,xn). – Classification: Y is a discrete attribute, called the class label • Usually a categorical attribute with small domain Model NAME RANK YEARS TENURED – Prediction: Y is a continuous attribute (Function) M ike Assistant Prof 3 no • Called supervised learning, because true labels (Y- M ary Assistant Prof 7 yes values) are known for the initially provided data Bill Professor 2 yes • Typical applications: credit approval, target marketing, Jim Associate Prof 7 yes IF rank = ‘professor’ medical diagnosis, fraud detection Dave Assistant Prof 6 no OR years > 6 Anne Associate Prof 3 no THEN tenured = ‘yes’ 3 4 Deduction: Using the Model Classification and Prediction Overview • Introduction Model (Function) • Decision Trees • Statistical Decision Theory • Bayesian Classification Test • Artificial Neural Networks Unseen Data Data • Support Vector Machines (SVMs) • Nearest Neighbor (Jeff, Professor, 4) • Prediction NAME RANK YEARS TENURED Tenured? • Accuracy and Error Measures Tom Assistant Prof 2 no • Merlisa Associate Prof 7 no Ensemble Methods George Professor 5 yes Joseph Assistant Prof 7 yes 5 6 1 Example of a Decision Tree Another Example of Decision Tree Splitting Attributes MarSt Single, Tid Refund Marital Taxable Married Divorced Status Income Cheat 1 Yes Single 125K No NO Refund 2 No Married 100K No Refund Yes No Yes No 3 No Single 70K No NO TaxInc 4 Yes Married 120K No NO MarSt < 80K > 80K 5 No Divorced 95K Yes Single, Divorced Married 6 No Married 60K No NO YES 7 Yes Divorced 220K No TaxInc NO 8 No Single 85K Yes < 80K > 80K 9 No Married 75K No NO YES There could be more than one tree that 10 No Single 90K Yes 10 fits the same data! Training Data Model: Decision Tree 7 8 Apply Model to Test Data Apply Model to Test Data Test Data Test Data Start from the root of tree. Refund Refund Yes No Yes No NO MarSt NO MarSt Single, Divorced Married Single, Divorced Married TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K NO YES NO YES 9 10 Apply Model to Test Data Apply Model to Test Data Test Data Test Data Refund Marital Taxable Status Income Cheat No Married 80K ? 10 Refund Refund Yes No Yes No NO MarSt NO MarSt Single, Divorced Married Single, Divorced Married TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K NO YES NO YES 11 12 2 Apply Model to Test Data Apply Model to Test Data Test Data Test Data Refund Marital Taxable Status Income Cheat No Married 80K ? 10 Refund Refund Yes No Yes No NO MarSt NO MarSt Single, Divorced Married Single, Divorced Married Assign Cheat to “No” TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K NO YES NO YES 13 14 Decision Tree Induction Decision Boundary 1 x2 • 0.9 Basic greedy algorithm X1 – Top-down, recursive divide-and-conquer 0.8 < 0.43? – At start, all the training records are at the root 0.7 Yes No – Training records partitioned recursively based on split attributes 0.6 X X2 – Split attributes selected based on a heuristic or statistical 0.5 2 < 0.47? < 0.33? measure (e.g., information gain) 0.4 Yes No Yes No • Conditions for stopping partitioning 0.3 Refund – Pure node (all records belong 0.2 Yes No : 4 : 0 : 0 : 4 to same class) 0.1 : 0 : 4 : 3 : 0 NO MarSt – No remaining attributes for 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 further partitioning Single, Divorced Married x • Majority voting for classifying the leaf 1 TaxInc – No cases left NO Decision boundary = border between two neighboring regions of different classes. < 80K > 80K For trees that split on a single attribute at a time, the decision boundary is parallel NO YES to the axes. 15 16 Oblique Decision Trees How to Specify Split Condition? • Depends on attribute types x + y < 1 – Nominal – Ordinal – Numeric (continuous) Class = + Class = • Depends on number of ways to split – 2-way split • Test condition may involve multiple attributes – Multi-way split • More expressive representation • Finding optimal test condition is computationally expensive 17 18 3 Splitting Nominal Attributes Splitting Ordinal Attributes • Multi-way split: use as many partitions as • Multi-way split: Size distinct values. Small Large CarType Medium Family Luxury Sports • Binary split: Size Size {Small, OR {Medium, • Binary split: divides values into two subsets; Medium} {Large} Large} {Small} need to find optimal partitioning. CarType CarType • What about this split? Size {Sports, OR {Family, {Small, Luxury} {Family} Luxury} {Sports} Large} {Medium} 19 20 Splitting Continuous Attributes Splitting Continuous Attributes • Different options – Discretization to form an ordinal categorical Taxable Taxable Income Income? attribute > 80K? • Static – discretize once at the beginning < 10K > 80K • Dynamic – ranges found by equal interval bucketing, Yes No equal frequency bucketing (percentiles), or clustering. [10K,25K) [25K,50K) [50K,80K) – Binary Decision: (A < v) or (A v) (i) Binary split (ii) Multi-way split • Consider all possible splits, choose best one 21 22 How to Determine Best Split How to Determine Best Split Before Splitting: 10 records of class 0, • Greedy approach: 10 records of class 1 – Nodes with homogeneous class distribution are Own Car Student preferred Car? Type? ID? • Need a measure of node impurity: Yes No Family Luxury c1 c20 c c Sports 10 11 C0: 5 C0: 9 C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0 C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1 C1: 5 C1: 1 Non-homogeneous, Homogeneous, Which test condition is the best? High degree of impurity Low degree of impurity 23 24 4 Attribute Selection Measure: Example Information Gain • Select attribute with highest information gain • Predict if somebody will buy a computer • pi = probability that an arbitrary record in D belongs to class C , i=1,…,m Age Income Student Credit_rating Buys_computer i • Given data set: 30 High No Bad No • Expected information (entropy) needed to classify a record 30 High No Good No in D: m 31…40 High No Bad Yes Info(D) p log ( p ) i 2 i > 40 Medium No Bad Yes i1 > 40 Low Yes Bad Yes • Information needed after using attribute A to split D into v > 40 Low Yes Good No 31...40 Low Yes Good Yes partitions D1,…, Dv: v | Dj | 30 Medium No Bad No InfoA (D) Info(Dj ) 30 Low Yes Bad Yes j1 | D | > 40 Medium Yes Bad Yes 30 Medium Yes Good Yes • Information gained by splitting on attribute A: 31...40 Medium No Good Yes Gain (D) Info(D) Info (D) 31...40 High Yes Bad Yes A A > 40 Medium No Good No 25 26 Information Gain Example Gain Ratio for Attribute Selection • Class P: buys_computer = “yes” 5 4 Infoage (D) I(2,3) I(4,0) • Information gain is biased towards attributes with a large • Class N: buys_computer = “no” 14 14 number of values 9 9 5 5 5 Info(D) I(9,5) log log 0.940 I(3,2) 0.694 14 2 14 14 2 14 14 • Use gain ratio to normalize information gain: 5 Age #yes #no I(#yes, #no) • I ( 2 , 3 ) means “age 30” has 5 out of 14 – GainRatioA(D) = GainA(D) / SplitInfoA(D) 30 2 3 0.971 14 samples, with 2 yes’es and 3 no’s. v 31…40 4 0 0 – Similar for the other terms | Dj | | Dj | >40 3 2 0.971 SplitInfoA (D) log2 Age Income Student Credit_rating Buys_computer • Hence j1 | D | | D | 30 High No Bad No Gainage(D) Info(D) Infoage(D) 0.246 30 High No Good No 4 4 6 6 4 4 31…40 High No Bad Yes • Similarly, • E.g., SplitInfoincome(D) log2 log2 log2 0.926 > 40 Medium No Bad Yes Gain income(D) 0.029 14 14 14 14 14 14 > 40 Low Yes Bad Yes > 40 Low Yes Good No Gain student(D) 0.151 31...40 Low Yes Good Yes • Gain credit_rating (D) 0.048 GainRatioincome(D) = 0.029/0.926 = 0.031 30 Medium No Bad No 30 Low Yes Bad Yes • Therefore we choose age as the splitting • Attribute with maximum gain ratio is selected as splitting > 40 Medium Yes Bad Yes attribute 30 Medium Yes Good Yes attribute 31...40 Medium No Good Yes 31...40 High Yes Bad Yes > 40 Medium No Good No 27 28 Comparing Attribute Selection Gini Index Measures m • Gini index, gini(D), is defined as 2 • No clear winner gini( D) 1 pi i1 (and there are many more) – Information gain: • If data set D is split on A into v subsets D1,…, Dv, the gini index gini (D) is defined as • Biased towards multivalued attributes A v | D | j – Gain ratio: gini A (D) gini( Dj ) j1 | D | • Tends to prefer unbalanced splits where one partition is • Reduction in Impurity: much smaller than the others – Gini index: gini A (D) gini( D) gini A (D) • Biased towards multivalued attributes • Tends to favor tests that result in equal-sized partitions and • Attribute that provides smallest ginisplit(D) (= largest reduction in impurity) is chosen to split the node purity in both partitions 29 30 5 Practical Issues of Classification How Good is the Model? • Underfitting and overfitting • Training set error: compare prediction of • Missing values training record with true value • Computational cost – Not a good measure for the error on unseen data.

Load more