<<

K236: Basis of Data Analytics Lecture 7: Classification and prediction Decision tree induction

Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai Schedule of K236

1. Introduction to data science (1) データ科学入門 6/9 2. Introduction to data science (2) データ科学入門 6/13 3. Data and databases データとデータベース 6/16 4. Review of univariate statistics 単変量統計 6/20 5. Review of linear algebra 線形代数 6/23 6. Data mining software データマイニングソフトウェア 6/27 7. Data preprocessing データ前処理 6/30 8. Classification and prediction (1) 分類と予測 (1) 7/4 9. Knowledge evaluation 知識評価 7/7 10. Classification and prediction (2) 分類と予測 (2) 7/11 11. Classification and prediction (3) 分類と予測 (3) 7/14 12. Mining association rules (1) 相関ルールの解析 7/18 13. Mining association rules (2) 相関ルールの解析 7/21 14. Cluster analysis クラスター解析 7/25 15. Review and Examination レビューと試験 (the data is not fixed) 7/27

2 Data schemas vs. mining methods データ・スキーマ vs. 学習手法

Types of data Mining tasks and methods マイニングの課題と手法 § Flat data tables 表形式データ § Relational databases 関係DB § Classification/Prediction 分類/予測 § Temporal & spatial data q Decision trees 決定木 時空間データ q Bayesian classification ベイジアン分類 § Transactional databases q Neural networks 神経回路網 取引データ q Rule induction ルール帰納法 § Multimedia data q Support vector machines SVM マルチメディアデータ q Hidden Markov Model 隠れマルコフ § Genome databases ゲノムデータ q etc. § Materials science data § Description 記述 材料データ q Association analysis 相関分析 § Textual data テキストデータ q Clustering クラスタリング § Web data ウェブデータ q Summarization 要約 q etc. § etc.

3 Outline

1. Issues Regarding Classification and Prediction 2. Attribute selection in decision tree induction 3. Tree pruning and other issues

4 Classification and prediction

Given: �, � , �, � , … , (�, �) - � is description of an object, phenomenon, etc. - � (label attribute) is some property of �, if not available learning is unsupervised Find: a function � � that characterizes {�} or that � � = �

Unsupervised data Supervised data

color #nuclei #tails class color #nuclei #tails label H1 light 1 1 hea H1 light 1 1 heal H2 dark 1 1 healthyH2 dark 1 1 healthy H3 H3 light 1 2 healthyH3 light 1 2 healthy H4 light 2 1 healthyH4 light 2 1 healthy dark 1 2 cancerousC1 dark 1 2 cancerous C1 C2 dark 2 1 cancerousC2 dark 2 1 cancerous light 2 2 cancerousC3 light 2 2 cancerous C3 C4 dark 2 2 cancerousC4 dark 2 2 cancerous

The problem is usually called classification if “label” is categorical, and prediction if “label” is continuous (in this case, if the descriptive attribute is numerical the problem is regression) Classification—a two-step process

• Model construction: describing a set of predetermined classes q Each tuple/object is assumed to belong to a predefined class, as determined by the class label attribute q The set of tuples used for model construction: training set q The model is represented as classification rules, decision trees, or mathematical formulae (classifiers) • Model usage: for classifying future or unknown objects Estimate accuracy of the model: q The known label of test object is compared with the classified result from the model q Accuracy rate is the percentage of test set objects that are correctly classified by the model q Test set is independent of training set, otherwise over-fitting will occur

6 Classification—a two-step process

Model construction Model usage Classification Algorithms Unknown object H1 H2

H3 H4 Classifier (model) Cancerous? C1 C2 If color = dark training data and # tails = 2 Then cancerous cell Cancerous

7 Criteria for classification methods

• Predictive accuracy(予測精度): the ability of the classifier to correctly predict unseen data • Speed: refers to computation cost • Robustness(頑健性): the ability of the classifier to make correctly predictions given noisy data or data with missing values • Scalability(拡張性): the ability to construct the classifier efficiently given large amounts of data • Interpretability(解釈容易性): the level of understanding and insight that is provided by the classifier

8 Machine learning: View by nature of methods

Tribes Origins Master Algorithms

Symbolists Logic, philosophy Inverse deduction

Evolutionaries Evolutionary biology Genetic programming

Connectionists Neuroscience Backpropagation

Bayesians Statistics Probabilistic inference

Analogizers Psychology Kernel machines

The five tribes of machine learning, Pedro Domingos 45 Symbolists

Tom Mitchell Steve Muggleton Ross Quinlan

46 Classification with decision trees

#nuclei?

H1 H2 1 2

color? color? H3 H4 light dark light dark H #tails? #tails? C C1 C2 1 2 1 2 H C H C C3 C4

K236, L7 47 Analoziger

Peter Hart Vladimir Vapnik Douglas Hofstadter

49 Kernel methods The basic ideas

Input space X Feature space F

-1 x1 x2 inverse map f f(xn) f f(x) ... (xn-1) f(x1) … f(x2)

xn-1 xn . k(xi,xj) = f(xi) f(xj)

Kernel matrix K kernel function k: XxX à R nxn kernel-based algorithm on K (computation done on kernel matrix)

K619 50 Connectionists

Yann LeCun Geoff Hinton Yoshua Bengio

51 Classification with neural networks

H1 H2 color = dark Healthy

H3 H4 # nuclei = 1

C1 C2 # tails = 2 Cancerous

C3 C4

K236, L9 52 Deep learning

K619 53 Bayesians in machine learning

David Heckerman Judea Pearl Michael Jordan

K236, L8

54 Probabilistic graphical models Instances of graphical models

Probabilistic models Naïve Bayes Graphical models classifie r LDA Directed Undirected

Bayes nets MRFs Mixture DBNs models Conditional random Kalma fields n Hidden Markov Model (HMM) MaxEnt filter model K619 Murphy, ML for life sciences 55 Outline

1. Issues Regarding Classification and Prediction 2. Attribute selection in decision tree induction 3. Tree pruning and other issues

19 Mining with decision trees 決定木でのマイニング

A decision tree is a flow-chart- like tree structure: フローチャートのような木構造 § each internal node denotes {H1, H2, H3, H4, a test on an attribute #nuclei 属性の値を判定するのが中間 C1, C2, C3, C4} にある節 1 2

§ each branch represents an {H1, H2, H3, C1} {H4, C2, C3, C4} outcome of the test 値を判定して各枝へ分岐 color? #tails § leaf nodes represent classes light dark 1 2 or class distributions {H1, H3} {H2, C1} {H4, C2} {C3, C4} 末端(葉)はクラス/分布 H #tails color? C § The top-most node in a tree is the root node 1 2 light dark 木構造の頂点は根 {H2} {C1} {C2} H C H {H4} C 20 Decision tree induction (DTI)

§ Decision tree generation consists of two phases

q Tree construction(決定木構築) § Partition examples recursively based on selected attributes § At start, all the training objects are at the root

q Tree pruning (構築した木の枝刈) § Identify and remove branches that reflect noise or outliers § Use of decision trees: Classify unknown objects(新事例の分類) q Test the attribute values of the object against the decision tree

21 Tree construction general algorithm 木構造を構築する一般的なアルゴリズム

Two steps: recursively generate the tree (順次、 属性を選んでデータを分割)(1-4), and prune the tree (構築した木の枝刈)(5)

1. At each node, choose the “best” attribute by a given measure for {H1, H2, H3, H4, attribute selection 各節では事前に指定し #nuclei た選択基準をに対し、最良の属性を選ぶ C1, C2, C3, C4} 1 2 2. Extend tree by adding new branch for each value of the attribute {H1, H2, H3, C1} {H4, C2, C3, C4} その属性の値ごとに枝を追加して木を拡張 color? #tails 3. Sorting training examples to leaf nodes 末端に訓練データを並べ替える light dark 1 2 4. If examples in a node belong to one {H1, H3} {H2, C1} {H4, C2} {C3, C4} class Then Stop Else Repeat steps 1-4 for #tails color? leaf nodes ある節のデータが同一クラスだけな H C ら停止、混じっていれば1から繰返す 1 2 light dark 5. Prune the tree to avoid over-fitting {H2} {C1} {C2} 枝刈をして過学習を防ぐ H C H {H4} C 22 Training data for concept “play-tennis”

• A typical dataset in machine learning • 14 objects belonging to two class {Y, N} are observed on 4 properties. • Dom(Outlook) = {sunny, overcast, rain} • Dom(Temperature) = {hot, mild, cool} • Dom(humidity) = {high, normal} • Dom(Wind) = {weak, strong}

23 A decision tree for playing tennis テニスに関する決定木の一例

temperature cool hot mild {D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14} outlook wind outlook

sunny rain o’cast true false sunny o’cast rain {D9} {D5, D6} {D7} {D2} {D1, D3, D13} {D8, D11} {D12} {D4, D10,D14} yes no yes wind yes humidity wind humidity

true false high normal true false high normal {D11} {D8} {D5} {D6} {D1, D3} {D3} {D4, D14} {D10} no wind no yes outlook yes yes yes true false sunny rain o’cast {D14} {D4} {D1} {D3} yes yes no null no 24 A simple decision tree for playing tennis テニスに関する簡潔な決定木

outlook

sunny o’cast rain {D1, D2, D8 {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11}

humidity yes wind

high normal true false {D1, D2, D8} {D9, D10} {D6, D14} {D4, D5, D10} no yes no yes

This tree is much simpler as “outlook” is selected at the root. How to select good attribute to split a decision node? 最初の属性として”outlook”を選択することで決定木がかなり簡潔になる. 分割条件として適切な属性をどのように選ぶのか? 25 Which attribute is the best? 最良の属性は?

• The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-] テニスデータ(テニスする(+)9件、しない(-)5件) のクラス分布[9+, 5-] • If attributes “humidity” and “wind” split S into sub-nodes with proportions of positive and negative objects as below, which attribute is better? データを“humidity”で分割する場合と “wind”で分割する場合とでは、クラスの分布はどちらが よいか?

A1 = humidity A2 = wind [9+, 5-] [9+, 5-]

normal high weak strong

[6+, 1-] [3+, 4-] [6+, 2-] [3+, 3-] 26 Entropy エントロピー

• Entropy characterizes the impurity (purity) of an arbitrary collection of objects (データ集合の純度の指標). q S is the collection of positive and negative objects(全体) q � is the proportion of positive objects in S (該当データの比率) q � is the proportion of negative objects in S (非該当データの比率) q In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively (テニスデータでは、それぞれ14,9/14, 5/14)

• Entropy is defined as follows エントロピーの定義式

������� � = −� ���� − � ����

27 Entropy

The entropy function relative to a Boolean classification, as the proportion of � positive objects varies between 0 and 1. entropy If the collection has c distinct groups of objects then the entropy is defined by

������� � = −����� �

28 Example

From 14 examples of Play-Tennis, 9 positive and 5 negative objects (denote by [9+, 5-] ) 14件中、正例9件、負例5件なら

Entropy([9+, 5-] ) = − (9/14)log2(9/14) − (5/14)log2(5/14) = 0.940

Notice: 1. Entropy is 0 if all members of S belong to the same class(全データ が同じクラスの場合のエントロピーは0) . For example, if all members are

positive ( � = 1), then � is 0, and Entropy(S) = − 1. log2(1) − 0. log2 (0) = − 1.0 − 0 . log2 (0) = 0. 2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1. (両クラスのデータ件数が等しい場合のエン トロピーは1、等しくなければ0から1の間の値)

29 Information gain measures the expected reduction in entropy

We define a measure, called information gain (情報利得), of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the objects according to this attribute その属性によるデータ分割における不純度低減効果をはかる尺度のひとつが情報利得

� ���� �, � = ������� � − �������(� ) � ∈() where Value(A) is the set of all possible values for attribute A, and

Sv is the subset of S for which A has value v.

Value(A):属性Aの値 Sv: 全データSのうちValue(A)=vのもの

30 Information gain measures the expected reduction in entropy

Values(Wind) = {Weak, Strong}, S = [9+, 5-]

Sweak , the subnode with value “weak”, is [6+, 2-] Sstrong , the subnode with value “strong”, is [3+, 3-]

���� �, ���� = �������(�) − ∑ �������(� ) ∈{,}

8 6 = ������� � − ������� � − �������(� ) 14 14

8 6 = 0.940 − 0.811 − � 1.0 = 0.048 14 14

31 Which attribute is the best classifier?

S:[9+, 5-] S:[9+, 5-] E = 0.940 E = 0.940

Humidity Wind

High Normal Weak Strong

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E = 0.985 E = 0.592 E = 0.811 E = 1.00

Gain(S, Humidity) Gain(S, Wind) = .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.00 = .151 = .048

32 Information gain of all attributes

Gain (S, Outlook) = 0.246

Gain (S, Humidity) = 0.151

Gain (S, Wind) = 0.048

Gain (S, Temperature) = 0.029

33 Next step in growing the decision tree

{D1, D2, ..., D14} [9+, 5-] Outlook

Sunny Overcast Rain

{D1, D2, D8, D9, D11} {D3, D7, D12, D13} {D4, D5, D6, D10, D14} [2+, 3-] [4+, 0-] [3+, 2-]

? Yes ?

Which attribute should be tested here?

Ssunny = {D1, D2, D3, D9, D11} Gain(Ssunny, Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = 0.970 Gain(Ssunny, Temperature) = .970 - (2/5)0.0 - (2/5)1.0 - (1/5)0.0 = 0.570 Gain(Ssunny, Wind) = .970 - (2/5)1.0 - (3/5)0.918 = 0.019 34 Attributes with many values

• If attribute has many values (e.g., days of the month), ID3 will select it • C4.5 uses GainRatio instead

����(�, �) ��������� �, � = ���������������(�, �)

� � ���������������� �, � = − ��� � �

�ℎ��� � �� ������ �� � ���ℎ � ℎ�� ℎ�� ����� �

35 Measures for attribute selection

∑ . ∑ ∑ .. ���� ����� Quinlan, C4.5, 1993 ∑ ..

Breiman , CART, 1984 ���� ����� �. �/ − �.

� � Statistics � − � . . � , � = � �..

� ������� �.��� �/ Ho & Nguyen, 1997 Outline

1. Issues Regarding Classification and Prediction 2. Attribute selection in decision tree induction 3. Tree pruning and other issues

37 Stopping condition

1. Every attribute has already been included along this path through the tree 木構造の経路内に出現しない属性がなくなったとき 2. The training objects associated with each leaf node all have the same target attribute value (i.e., their entropy is zero 末端に該当するデータが同一クラスで構成される場合 = エ ントロピー0

Notice: Algorithm ID3 uses Information Gain and C4.5, its successor, uses Gain Ratio (a variant of Information Gain) 分割の適切さを測る尺度として、 ID3では情報利得、その後継C4.5では情報利得比を用いる

38 Generalization problem in classification

• One of the most common tasks is to fit a “model” to a set of training data, so as to be able to make reliable predictions on general untrained data. • Overfitting: A statistical model describes random error or noise instead of the underlying relationship. • Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. • A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Underfitting Good fitting Overfitting Over-fitting in decision trees

• The generated tree may overfit the training data q Too many branches, some may reflect anomalies due to noise or outliers q Result is in poor accuracy for unseen objects • Two approaches to avoid overfitting q Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold. • Difficult to choose an appropriate threshold q Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree”.

40 Converting a tree to rules

outlook

sunny o’cast rain humidity yes wind true false high normal no yes no yes

IF (Outlook = Sunny) and (Humidity = High) THEN PlayTennis = No

IF (Outlook = Sunny) and (Humidity = Normal) THEN PlayTennis = Yes

41 Visualization of decision trees

Sunday 11-12 PM

Our D2MS Fisheye view Tree map

D2MS’s .5D Hyperbolic tree Cone tree 42 Ensemble learning

Ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. q Boosting: Make examples currently misclassified more important q Bagging: Use different subsets of the training data for each model

Training Data Some unknown distribution

Data1 Data2 × × × × × × × × Data m Model 5 Model 2 Model 6

Model 3 Learner1 Learner2 × × × × × Learner m

Model 4 Model1 Model2 × × × ×× × Model m Model 1

Model Combiner Final Model

43 Random forest

• Random forests is a forest of random decision trees (ensemble) • Tree bagging: Given a training set �, � , �, � , … , �, � . q Sample with replacement � training examples à Learn a tree

q Repeat � times to learn � decision trees

q Making prediction for an unknown case by the majority vote from the results of � trees • Random forest: As tree bagging but choose a random subset of attributes to build the tree. Leo Breiman, 1928 - 2005 Issues in decision tree learning

• Attribute selection • Pruning trees • From trees to rules (high cost of pruning) • Visualization • Data access: recent development on very large training sets, fast, efficient and scalable (well-known systems: C4.5 and CART) • Random Forest • Further reading: http://www.jaist.ac.jp/~bao/DA-K236/TopTenDMAlgorithms.pdf

45 Homework Homework

A company preparares its marketing strategy and sent out some promotion to various houses and recorded 4 facts (attributes) about each house and also whether the people responded or not (outcome of promotion). The data is as in the table.

Manually build a decision tree with the method studied in this lecture.