K236: Basis of Data Analytics Lecture 7: Classification and prediction Decision tree induction
Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai Schedule of K236
1. Introduction to data science (1) データ科学入門 6/9 2. Introduction to data science (2) データ科学入門 6/13 3. Data and databases データとデータベース 6/16 4. Review of univariate statistics 単変量統計 6/20 5. Review of linear algebra 線形代数 6/23 6. Data mining software データマイニングソフトウェア 6/27 7. Data preprocessing データ前処理 6/30 8. Classification and prediction (1) 分類と予測 (1) 7/4 9. Knowledge evaluation 知識評価 7/7 10. Classification and prediction (2) 分類と予測 (2) 7/11 11. Classification and prediction (3) 分類と予測 (3) 7/14 12. Mining association rules (1) 相関ルールの解析 7/18 13. Mining association rules (2) 相関ルールの解析 7/21 14. Cluster analysis クラスター解析 7/25 15. Review and Examination レビューと試験 (the data is not fixed) 7/27
2 Data schemas vs. mining methods データ・スキーマ vs. 学習手法
Types of data Mining tasks and methods マイニングの課題と手法 § Flat data tables 表形式データ § Relational databases 関係DB § Classification/Prediction 分類/予測 § Temporal & spatial data q Decision trees 決定木 時空間データ q Bayesian classification ベイジアン分類 § Transactional databases q Neural networks 神経回路網 取引データ q Rule induction ルール帰納法 § Multimedia data q Support vector machines SVM マルチメディアデータ q Hidden Markov Model 隠れマルコフ § Genome databases ゲノムデータ q etc. § Materials science data § Description 記述 材料データ q Association analysis 相関分析 § Textual data テキストデータ q Clustering クラスタリング § Web data ウェブデータ q Summarization 要約 q etc. § etc.
3 Outline
1. Issues Regarding Classification and Prediction 2. Attribute selection in decision tree induction 3. Tree pruning and other issues
4 Classification and prediction
Given: � , � , � , � , … , (� , � ) - � is description of an object, phenomenon, etc. - � (label attribute) is some property of � , if not available learning is unsupervised Find: a function � � that characterizes {� } or that � � = �
Unsupervised data Supervised data
color #nuclei #tails class color #nuclei #tails label H1 H2 H1 light 1 1 hea H1 light 1 1 heal H2 dark 1 1 healthyH2 dark 1 1 healthy H3 H4 H3 light 1 2 healthyH3 light 1 2 healthy H4 light 2 1 healthyH4 light 2 1 healthy C1 dark 1 2 cancerousC1 dark 1 2 cancerous C1 C2 C2 dark 2 1 cancerousC2 dark 2 1 cancerous C3 light 2 2 cancerousC3 light 2 2 cancerous C3 C4 C4 dark 2 2 cancerousC4 dark 2 2 cancerous
The problem is usually called classification if “label” is categorical, and prediction if “label” is continuous (in this case, if the descriptive attribute is numerical the problem is regression) Classification—a two-step process
• Model construction: describing a set of predetermined classes q Each tuple/object is assumed to belong to a predefined class, as determined by the class label attribute q The set of tuples used for model construction: training set q The model is represented as classification rules, decision trees, or mathematical formulae (classifiers) • Model usage: for classifying future or unknown objects Estimate accuracy of the model: q The known label of test object is compared with the classified result from the model q Accuracy rate is the percentage of test set objects that are correctly classified by the model q Test set is independent of training set, otherwise over-fitting will occur
6 Classification—a two-step process
Model construction Model usage Classification Algorithms Unknown object H1 H2
H3 H4 Classifier (model) Cancerous? C1 C2 If color = dark training data and # tails = 2 Then cancerous cell Cancerous
7 Criteria for classification methods
• Predictive accuracy(予測精度): the ability of the classifier to correctly predict unseen data • Speed: refers to computation cost • Robustness(頑健性): the ability of the classifier to make correctly predictions given noisy data or data with missing values • Scalability(拡張性): the ability to construct the classifier efficiently given large amounts of data • Interpretability(解釈容易性): the level of understanding and insight that is provided by the classifier
8 Machine learning: View by nature of methods
Tribes Origins Master Algorithms
Symbolists Logic, philosophy Inverse deduction
Evolutionaries Evolutionary biology Genetic programming
Connectionists Neuroscience Backpropagation
Bayesians Statistics Probabilistic inference
Analogizers Psychology Kernel machines
The five tribes of machine learning, Pedro Domingos 45 Symbolists
Tom Mitchell Steve Muggleton Ross Quinlan
46 Classification with decision trees
#nuclei?
H1 H2 1 2
color? color? H3 H4 light dark light dark H #tails? #tails? C C1 C2 1 2 1 2 H C H C C3 C4
K236, L7 47 Analoziger
Peter Hart Vladimir Vapnik Douglas Hofstadter
49 Kernel methods The basic ideas
Input space X Feature space F
-1 x1 x2 inverse map f f(xn) f f(x) ... (xn-1) f(x1) … f(x2)
xn-1 xn . k(xi,xj) = f(xi) f(xj)
Kernel matrix K kernel function k: XxX à R nxn kernel-based algorithm on K (computation done on kernel matrix)
K619 50 Connectionists
Yann LeCun Geoff Hinton Yoshua Bengio
51 Classification with neural networks
H1 H2 color = dark Healthy
H3 H4 # nuclei = 1
C1 C2 # tails = 2 Cancerous
C3 C4
K236, L9 52 Deep learning
K619 53 Bayesians in machine learning
David Heckerman Judea Pearl Michael Jordan
K236, L8
54 Probabilistic graphical models Instances of graphical models
Probabilistic models Naïve Bayes Graphical models classifie r LDA Directed Undirected
Bayes nets MRFs Mixture DBNs models Conditional random Kalma fields n Hidden Markov Model (HMM) MaxEnt filter model K619 Murphy, ML for life sciences 55 Outline
1. Issues Regarding Classification and Prediction 2. Attribute selection in decision tree induction 3. Tree pruning and other issues
19 Mining with decision trees 決定木でのマイニング
A decision tree is a flow-chart- like tree structure: フローチャートのような木構造 § each internal node denotes {H1, H2, H3, H4, a test on an attribute #nuclei 属性の値を判定するのが中間 C1, C2, C3, C4} にある節 1 2
§ each branch represents an {H1, H2, H3, C1} {H4, C2, C3, C4} outcome of the test 値を判定して各枝へ分岐 color? #tails § leaf nodes represent classes light dark 1 2 or class distributions {H1, H3} {H2, C1} {H4, C2} {C3, C4} 末端(葉)はクラス/分布 H #tails color? C § The top-most node in a tree is the root node 1 2 light dark 木構造の頂点は根 {H2} {C1} {C2} H C H {H4} C 20 Decision tree induction (DTI)
§ Decision tree generation consists of two phases
q Tree construction(決定木構築) § Partition examples recursively based on selected attributes § At start, all the training objects are at the root
q Tree pruning (構築した木の枝刈) § Identify and remove branches that reflect noise or outliers § Use of decision trees: Classify unknown objects(新事例の分類) q Test the attribute values of the object against the decision tree
21 Tree construction general algorithm 木構造を構築する一般的なアルゴリズム
Two steps: recursively generate the tree (順次、 属性を選んでデータを分割)(1-4), and prune the tree (構築した木の枝刈)(5)
1. At each node, choose the “best” attribute by a given measure for {H1, H2, H3, H4, attribute selection 各節では事前に指定し #nuclei た選択基準をに対し、最良の属性を選ぶ C1, C2, C3, C4} 1 2 2. Extend tree by adding new branch for each value of the attribute {H1, H2, H3, C1} {H4, C2, C3, C4} その属性の値ごとに枝を追加して木を拡張 color? #tails 3. Sorting training examples to leaf nodes 末端に訓練データを並べ替える light dark 1 2 4. If examples in a node belong to one {H1, H3} {H2, C1} {H4, C2} {C3, C4} class Then Stop Else Repeat steps 1-4 for #tails color? leaf nodes ある節のデータが同一クラスだけな H C ら停止、混じっていれば1から繰返す 1 2 light dark 5. Prune the tree to avoid over-fitting {H2} {C1} {C2} 枝刈をして過学習を防ぐ H C H {H4} C 22 Training data for concept “play-tennis”
• A typical dataset in machine learning • 14 objects belonging to two class {Y, N} are observed on 4 properties. • Dom(Outlook) = {sunny, overcast, rain} • Dom(Temperature) = {hot, mild, cool} • Dom(humidity) = {high, normal} • Dom(Wind) = {weak, strong}
23 A decision tree for playing tennis テニスに関する決定木の一例
temperature cool hot mild {D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14} outlook wind outlook
sunny rain o’cast true false sunny o’cast rain {D9} {D5, D6} {D7} {D2} {D1, D3, D13} {D8, D11} {D12} {D4, D10,D14} yes no yes wind yes humidity wind humidity
true false high normal true false high normal {D11} {D8} {D5} {D6} {D1, D3} {D3} {D4, D14} {D10} no wind no yes outlook yes yes yes true false sunny rain o’cast {D14} {D4} {D1} {D3} yes yes no null no 24 A simple decision tree for playing tennis テニスに関する簡潔な決定木
outlook
sunny o’cast rain {D1, D2, D8 {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11}
humidity yes wind
high normal true false {D1, D2, D8} {D9, D10} {D6, D14} {D4, D5, D10} no yes no yes
This tree is much simpler as “outlook” is selected at the root. How to select good attribute to split a decision node? 最初の属性として”outlook”を選択することで決定木がかなり簡潔になる. 分割条件として適切な属性をどのように選ぶのか? 25 Which attribute is the best? 最良の属性は?
• The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-] テニスデータ(テニスする(+)9件、しない(-)5件) のクラス分布[9+, 5-] • If attributes “humidity” and “wind” split S into sub-nodes with proportions of positive and negative objects as below, which attribute is better? データを“humidity”で分割する場合と “wind”で分割する場合とでは、クラスの分布はどちらが よいか?
A1 = humidity A2 = wind [9+, 5-] [9+, 5-]
normal high weak strong
[6+, 1-] [3+, 4-] [6+, 2-] [3+, 3-] 26 Entropy エントロピー
• Entropy characterizes the impurity (purity) of an arbitrary collection of objects (データ集合の純度の指標). q S is the collection of positive and negative objects(全体) q � is the proportion of positive objects in S (該当データの比率) q � is the proportion of negative objects in S (非該当データの比率) q In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively (テニスデータでは、それぞれ14,9/14, 5/14)
• Entropy is defined as follows エントロピーの定義式
������� � = −� ��� � − � ��� �
27 Entropy
The entropy function relative to a Boolean classification, as the proportion of � positive objects varies between 0 and 1. entropy If the collection has c distinct groups of objects then the entropy is defined by
������� � = −� ��� � �
28 Example
From 14 examples of Play-Tennis, 9 positive and 5 negative objects (denote by [9+, 5-] ) 14件中、正例9件、負例5件なら
Entropy([9+, 5-] ) = − (9/14)log2(9/14) − (5/14)log2(5/14) = 0.940
Notice: 1. Entropy is 0 if all members of S belong to the same class(全データ が同じクラスの場合のエントロピーは0) . For example, if all members are
positive ( � = 1), then � is 0, and Entropy(S) = − 1. log2(1) − 0. log2 (0) = − 1.0 − 0 . log2 (0) = 0. 2. Entropy is 1 if the collection contains an equal number of positive and negative examples. If these numbers are unequal, the entropy is between 0 and 1. (両クラスのデータ件数が等しい場合のエン トロピーは1、等しくなければ0から1の間の値)
29 Information gain measures the expected reduction in entropy
We define a measure, called information gain (情報利得), of the effectiveness of an attribute in classifying data. It is the expected reduction in entropy caused by partitioning the objects according to this attribute その属性によるデータ分割における不純度低減効果をはかる尺度のひとつが情報利得