K236: Basis of Data Analytics Lecture 7: Classification and Prediction Decision Tree Induction
Total Page:16
File Type:pdf, Size:1020Kb
K236: Basis of Data Analytics Lecture 7: Classification and prediction Decision tree induction Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai Schedule of K236 1. Introduction to data science (1) データ科学入門 6/9 2. Introduction to data science (2) データ科学入門 6/13 3. Data and databases データとデータベース 6/16 4. Review of univariate statistics 単変量統計 6/20 5. Review of linear algebra 線形代数 6/23 6. Data mining software データマイニングソフトウェア 6/27 7. Data preprocessing データ前処理 6/30 8. Classification and prediction (1) 分類と予測 (1) 7/4 9. Knowledge evaluation 知識評価 7/7 10. Classification and prediction (2) 分類と予測 (2) 7/11 11. Classification and prediction (3) 分類と予測 (3) 7/14 12. Mining association rules (1) 相関ルールの解析 7/18 13. Mining association rules (2) 相関ルールの解析 7/21 14. Cluster analysis クラスター解析 7/25 15. Review and Examination レビューと試験 (the data is not fixed) 7/27 2 Data schemas vs. mining methods データ・スキーマ vs. 学習手法 Types of data Mining tasks and methods マイニングの課題と手法 § Flat data tables 表形式データ § Relational databases 関係DB § Classification/Prediction 分類/予測 § Temporal & spatial data q Decision trees 決定木 時空間データ q Bayesian classification ベイジアン分類 § Transactional databases q Neural networks 神経回路網 取引データ q Rule induction ルール帰納法 § Multimedia data q Support vector machines SVM マルチメディアデータ q Hidden Markov Model 隠れマルコフ § Genome databases ゲノムデータ q etc. § Materials science data § Description 記述 材料データ q Association analysis 相関分析 § Textual data テキストデータ q Clustering クラスタリング § Web data ウェブデータ q Summarization 要約 q etc. § etc. 3 Outline 1. Issues Regarding Classification and Prediction 2. Attribute selection in decision tree induction 3. Tree pruning and other issues 4 Classification and prediction Given: �", �" , �%, �% , … , (�(, �() - �+ is description of an object, phenomenon, etc. - �+ (label attribute) is some property of �+, if not available learning is unsupervised Find: a function � � that characterizes {�+} or that � �+ = �+ Unsupervised data Supervised data color #nuclei #tails class color #nuclei #tails label H1 H2 H1 light 1 1 hea H1 light 1 1 heal H2 dark 1 1 healthyH2 dark 1 1 healthy H3 H4 H3 light 1 2 healthyH3 light 1 2 healthy H4 light 2 1 healthyH4 light 2 1 healthy C1 dark 1 2 cancerousC1 dark 1 2 cancerous C1 C2 C2 dark 2 1 cancerousC2 dark 2 1 cancerous C3 light 2 2 cancerousC3 light 2 2 cancerous C3 C4 C4 dark 2 2 cancerousC4 dark 2 2 cancerous The problem is usually called classification if “label” is categorical, and prediction if “label” is continuous (in this case, if the descriptive attribute is numerical the problem is regression) Classification—a two-step process • Model construction: describing a set of predetermined classes q Each tuple/object is assumed to belong to a predefined class, as determined by the class label attribute q The set of tuples used for model construction: training set q The model is represented as classification rules, decision trees, or mathematical formulae (classifiers) • Model usage: for classifying future or unknown objects Estimate accuracy of the model: q The known label of test object is compared with the classified result from the model q Accuracy rate is the percentage of test set objects that are correctly classified by the model q Test set is independent of training set, otherwise over-fitting will occur 6 Classification—a two-step process Model construction Model usage Classification Algorithms Unknown object H1 H2 H3 H4 Classifier (model) Cancerous? C1 C2 If color = dark training data and # tails = 2 Then cancerous cell Cancerous 7 Criteria for classification methods • Predictive accuracy(予測精度): the ability of the classifier to correctly predict unseen data • Speed: refers to computation cost • Robustness(頑健性): the ability of the classifier to make correctly predictions given noisy data or data with missing values • Scalability(拡張性): the ability to construct the classifier efficiently given large amounts of data • Interpretability(解釈容易性): the level of understanding and insight that is provided by the classifier 8 Machine learning: View by nature of methods Tribes Origins Master Algorithms Symbolists Logic, philosophy Inverse deduction Evolutionaries Evolutionary biology Genetic programming Connectionists Neuroscience Backpropagation Bayesians Statistics Probabilistic inference Analogizers Psychology Kernel machines The five tribes of machine learning, Pedro Domingos 45 Symbolists Tom Mitchell Steve Muggleton Ross Quinlan 46 Classification with decision trees #nuclei? H1 H2 1 2 color? color? H3 H4 light dark light dark H #tails? #tails? C C1 C2 1 2 1 2 H C H C C3 C4 K236, L7 47 Analoziger Peter Hart Vladimir Vapnik Douglas Hofstadter 49 Kernel methods The basic ideas Input space X Feature space F -1 x1 x2 inverse map f f(xn) f f(x) ... (xn-1) f(x1) … f(x2) xn-1 xn . k(xi,xj) = f(xi) f(xj) Kernel matrix K kernel function k: XxX à R nxn kernel-based algorithm on K (computation done on kernel matrix) K619 50 Connectionists Yann LeCun Geoff Hinton Yoshua Bengio 51 Classification with neural networks H1 H2 color = dark Healthy H3 H4 # nuclei = 1 C1 C2 # tails = 2 Cancerous C3 C4 K236, L9 52 Deep learning K619 53 Bayesians in machine learning David Heckerman Judea Pearl Michael Jordan K236, L8 54 Probabilistic graphical models Instances of graphical models Probabilistic models Naïve Bayes Graphical models classifie r LDA Directed Undirected Bayes nets MRFs Mixture DBNs models Conditional random Kalma fields n Hidden Markov Model (HMM) MaxEnt filter model K619 Murphy, ML for life sciences 55 Outline 1. Issues Regarding Classification and Prediction 2. Attribute selection in decision tree induction 3. Tree pruning and other issues 19 Mining with decision trees 決定木でのマイニング A decision tree is a flow-chart- like tree structure: フローチャートのような木構造 § each internal node denotes {H1, H2, H3, H4, a test on an attribute #nuclei 属性の値を判定するのが中間 C1, C2, C3, C4} にある節 1 2 § each branch represents an {H1, H2, H3, C1} {H4, C2, C3, C4} outcome of the test 値を判定して各枝へ分岐 color? #tails § leaf nodes represent classes light dark 1 2 or class distributions {H1, H3} {H2, C1} {H4, C2} {C3, C4} 末端(葉)はクラス/分布 H #tails color? C § The top-most node in a tree is the root node 1 2 light dark 木構造の頂点は根 {H2} {C1} {C2} H C H {H4} C 20 Decision tree induction (DTI) § Decision tree generation consists of two phases q Tree construction(決定木構築) § Partition examples recursively based on selected attributes § At start, all the training objects are at the root q Tree pruning (構築した木の枝刈) § Identify and remove branches that reflect noise or outliers § Use of decision trees: Classify unknown objects(新事例の分類) q Test the attribute values of the object against the decision tree 21 Tree construction general algorithm 木構造を構築する一般的なアルゴリズム Two steps: recursively generate the tree (順次、 属性を選んでデータを分割)(1-4), and prune the tree (構築した木の枝刈)(5) 1. At each node, choose the “best” attribute by a given measure for {H1, H2, H3, H4, attribute selection 各節では事前に指定し #nuclei た選択基準をに対し、最良の属性を選ぶ C1, C2, C3, C4} 1 2 2. Extend tree by adding new branch for each value of the attribute {H1, H2, H3, C1} {H4, C2, C3, C4} その属性の値ごとに枝を追加して木を拡張 color? #tails 3. Sorting training examples to leaf nodes 末端に訓練データを並べ替える light dark 1 2 4. If examples in a node belong to one {H1, H3} {H2, C1} {H4, C2} {C3, C4} class Then Stop Else Repeat steps 1-4 for #tails color? leaf nodes ある節のデータが同一クラスだけな H C ら停止、混じっていれば1から繰返す 1 2 light dark 5. Prune the tree to avoid over-fitting {H2} {C1} {C2} 枝刈をして過学習を防ぐ H C H {H4} C 22 Training data for concept “play-tennis” • A typical dataset in machine learning • 14 objects belonging to two class {Y, N} are observed on 4 properties. • Dom(Outlook) = {sunny, overcast, rain} • Dom(Temperature) = {hot, mild, cool} • Dom(humidity) = {high, normal} • Dom(Wind) = {weak, strong} 23 A decision tree for playing tennis テニスに関する決定木の一例 temperature cool hot mild {D5, D6, D7, D9} {D1, D2, D3, D13} {D4, D8, D10, D11,D12, D14} outlook wind outlook sunny rain o’cast true false sunny o’cast rain {D9} {D5, D6} {D7} {D2} {D1, D3, D13} {D8, D11} {D12} {D4, D10,D14} yes no yes wind yes humidity wind humidity true false high normal true false high normal {D11} {D8} {D5} {D6} {D1, D3} {D3} {D4, D14} {D10} no wind no yes outlook yes yes yes true false sunny rain o’cast {D14} {D4} {D1} {D3} yes yes no null no 24 A simple decision tree for playing tennis テニスに関する簡潔な決定木 outlook sunny o’cast rain {D1, D2, D8 {D3, D7, D12, D13} {D4, D5, D6, D10, D14} D9, D11} humidity yes wind high normal true false {D1, D2, D8} {D9, D10} {D6, D14} {D4, D5, D10} no yes no yes This tree is much simpler as “outlook” is selected at the root. How to select good attribute to split a decision node? 最初の属性として”outlook”を選択することで決定木がかなり簡潔になる. 分割条件として適切な属性をどのように選ぶのか? 25 Which attribute is the best? 最良の属性は? • The “playing-tennis” set S contains 9 positive objects (+) and 5 negative objects (-), denote by [9+, 5-] テニスデータ(テニスする(+)9件、しない(-)5件) のクラス分布[9+, 5-] • If attributes “humidity” and “wind” split S into sub-nodes with proportions of positive and negative objects as below, which attribute is better? データを“humidity”で分割する場合と “wind”で分割する場合とでは、クラスの分布はどちらが よいか? A1 = humidity A2 = wind [9+, 5-] [9+, 5-] normal high weak strong [6+, 1-] [3+, 4-] [6+, 2-] [3+, 3-] 26 Entropy エントロピー • Entropy characterizes the impurity (purity) of an arbitrary collection of objects (データ集合の純度の指標). q S is the collection of positive and negative objects(全体) q � is the proportion of positive objects in S (該当データの比率) q � is the proportion of negative objects in S (非該当データの比率) q In the play-tennis example, these numbers are 14, 9/14 and 5/14, respectively (テニスデータでは、それぞれ14,9/14, 5/14) • Entropy is defined as follows エントロピーの定義式 ������� � = −� ���.� − � ���.� 27 Entropy The entropy function relative to a Boolean classification, as the proportion of � positive objects varies between 0 and 1.