Overfitting in Decision Trees, Boosting Slides

Total Page:16

File Type:pdf, Size:1020Kb

Overfitting in Decision Trees, Boosting Slides 1/29/17 Decision Trees: Overfitting CSE 446: Machine Learning Emily Fox University of Washington January 30, 2017 ©2017 Emily Fox CSE 446: Machine Learning Decision tree recap Loan status: Root poor 22 18 4 14 Safe Risky Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf Safe Term? node, set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 2 ©2017 Emily Fox CSE 446: Machine Learning 1 1/29/17 Greedy decision tree learning • Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to, classification error make predictions Stopping • Step 4: Otherwise, go to Step 2 & conditions 1 & 2 continue (recurse) on this split Recursion 3 ©2017 Emily Fox CSE 446: Machine Learning Scoring a loan application xi = (Credit = poor, Income = high, Term = 5 years) Start excellent poor Credit? fair Income? Safe Term? high Low 3 years 5 years Risky Safe Term? Risky 3 years 5 years Risky Safe ŷi = Safe 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/29/17 Decision trees vs logistic regression: Example ©2017 Emily Fox CSE 446: Machine Learning Logistic regression Weight Feature Value Learned h0(x) 1 0.22 h1(x) x[1] 1.12 h2(x) x[2] -1.07 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/29/17 Depth 1: Split on x[1] y values Root - + 18 13 x[1] x[1] < -0.07 x[1] >= -0.07 13 3 4 11 7 ©2017 Emily Fox CSE 446: Machine Learning Depth 2 y values Root - + 18 13 x[1] x[1] < -0.07 x[1] >= -0.07 13 3 4 11 x[1] x[2] x[1] < -1.66 x[1] >= -1.66 x[2] < 1.55 x[2] >= 1.55 7 0 6 3 1 11 3 0 8 ©2017 Emily Fox CSE 446: Machine Learning 4 1/29/17 Threshold split caveat y values Root - + 18 13 x[1] For threshold splits, x[1] < -0.07 x[1] >= -0.07 same feature can be 13 3 4 11 used multiple times x[1] x[2] x[1] < -1.66 x[1] >= -1.66 x[2] < 1.55 x[2] >= 1.55 7 0 6 3 1 11 3 0 9 ©2017 Emily Fox CSE 446: Machine Learning Decision boundaries Depth 1 Depth 2 Depth 10 10 ©2017 Emily Fox CSE 446: Machine Learning 5 1/29/17 Comparing decision boundaries Decision Tree Depth 1 Depth 3 Depth 10 Logistic Regression Degree 1 features Degree 2 features Degree 6 features 11 ©2017 Emily Fox CSE 446: Machine Learning Predicting probabilities with decision trees Loan status: Root Safe Risky 18 12 Credit? excellent fair poor P(y = Safe | x) 9 2 6 9 3 1 = 3 = 0.75 3 + 1 Safe Risky Safe 12 ©2017 Emily Fox CSE 446: Machine Learning 6 1/29/17 Depth 1 probabilities Y values root - + 18 13 X1 X1 < -0.07 X1 >= -0.07 13 3 4 11 13 ©2017 Emily Fox CSE 446: Machine Learning Depth 2 probabilities Y values root - + 18 13 X1 X1 < -0.07 X1 >= -0.07 13 3 4 11 X1 X2 X1 < -1.66 X1 >= -1.66 X2 < 1.55 X2 >= 1.55 7 0 6 3 1 11 3 0 14 ©2017 Emily Fox CSE 446: Machine Learning 7 1/29/17 Comparison with logistic regression Logistic Regression Decision Trees (Degree 2) (Depth 2) Class Probability 15 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in decision trees ©2017 Emily Fox CSE 446: Machine Learning 8 1/29/17 What happens when we increase depth? Training error reduces with depth Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10 Training error 0.22 0.13 0.10 0.03 0.00 Decision boundary 17 ©2017 Emily Fox CSE 446: Machine Learning Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates 18 ©2017 Emily Fox CSE 446: Machine Learning 9 1/29/17 Technique 1: Early stopping • Stopping conditions (recap): 1. All examples have the same target value 2. No more features to split on • Early stopping conditions: 1. Limit tree depth (choose max_depth using validation set) 2. Do not consider splits that do not cause a sufficient decrease in classification error 3. Do not split an intermediate node which contains too few data points 19 ©2017 Emily Fox CSE 446: Machine Learning Challenge with early stopping condition 1 Hard to know exactly when to stop Simple Complex Also, might want trees trees some branches of tree to go deeper while others remain shallow True error Classification Error Training error max_depth Tree depth 20 ©2017 Emily Fox CSE 446: Machine Learning 10 1/29/17 Early stopping condition 2: Pros and Cons • Pros: - A reasonable heuristic for early stopping to avoid useless splits • Cons: - Too short sighted: We may miss out on “good” splits may occur right after “useless” splits - Saw this with “xor” example 21 ©2017 Emily Fox CSE 446: Machine Learning Two approaches to picking simpler trees 1. Early Stopping: Stop the learning algorithm before tree becomes too complex 2. Pruning: Simplify the tree after the learning algorithm terminates Complements early stopping 22 ©2017 Emily Fox CSE 446: Machine Learning 11 1/29/17 Pruning: Intuition Train a complex tree, simplify later Complex Tree Simpler Tree Simplify 23 ©2017 Emily Fox CSE 446: Machine Learning Pruning motivation Simple Complex tree tree True Error Simplify after tree is built Don’t stop too early Classification Error Training Error Tree depth 24 ©2017 Emily Fox CSE 446: Machine Learning 12 1/29/17 Scoring trees: Desired total quality format Want to balance: i. How well tree fits data ii. Complexity of tree want to balance Total cost = measure of fit + measure of complexity (classification error) Large # = likely to overfit Large # = bad fit to training data 25 ©2017 Emily Fox CSE 446: Machine Learning Simple measure of complexity of tree L(T) = # of leaf nodes Start excellent poor Credit? good fair bad Safe Risky Safe Safe Risky 26 ©2017 Emily Fox CSE 446: Machine Learning 13 1/29/17 Balance simplicity & predictive power Too complex, risk of overfitting Start excellent poor Credit? Too simple, high fair classification error Income? Safe Term? Start high low 3 years 5 years Risky Risky Safe Term? Risky 3 years 5 years Risky Safe 27 ©2015-2016 Emily Fox & Carlos Guestrin CSE 446: Machine Learning Balancing fit and complexity Total cost C(T) = Error(T) + λ L(T) tuning parameter If λ=0: If λ=∞: If λ in between: 28 ©2017 Emily Fox CSE 446: Machine Learning 14 1/29/17 Tree pruning algorithm ©2017 Emily Fox CSE 446: Machine Learning Step 1: Consider a split Start Tree T excellent poor Credit? fair Income? Safe Term? high low 3 years 5 years Risky Safe Term? Risky 3 years 5 years Candidate for Safe Risky pruning 30 ©2017 Emily Fox CSE 446: Machine Learning 15 1/29/17 Step 2: Compute total cost C(T) of split Start Tree T excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Term? Risky 3 years 5 years Candidate for Safe Risky pruning 31 ©2017 Emily Fox CSE 446: Machine Learning Step 2: “Undo” the splits on Tsmaller Start Tree Tsmaller excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky Replace split by leaf node? 32 ©2017 Emily Fox CSE 446: Machine Learning 16 1/29/17 Prune if total cost is lower: C(Tsmaller) ≤ C(T) Worse training error but Start Tree Tsmaller lower overall cost excellent poor λ = 0.3 Credit? Tree Error #Leaves Total fair T 0.25 6 0.43 Income? Safe Term? Tsmaller 0.26 5 0.41 high low 3 years 5 years C(T) = Error(T) + λ L(T) Risky Safe Safe Risky Replace split by leaf node? YES! 33 ©2017 Emily Fox CSE 446: Machine Learning Step 5: Repeat Steps 1-4 for every split Start Decide if each split can be “pruned” excellent poor Credit? fair Income? Safe Term? high low 3 years 5 years Risky Safe Safe Risky 34 ©2017 Emily Fox CSE 446: Machine Learning 17 1/29/17 Summary of overfitting in decision trees ©2017 Emily Fox CSE 446: Machine Learning What you can do now… • Identify when overfitting in decision trees • Prevent overfitting with early stopping - Limit tree depth - Do not consider splits that do not reduce classification error - Do not split intermediate nodes with only few points • Prevent overfitting by pruning complex trees - Use a total cost formula that balances classification error and tree complexity - Use total cost to merge potentially complex trees into simpler ones 36 ©2017 Emily Fox CSE 446: Machine Learning 18 1/29/17 Boosting CSE 446: Machine Learning Emily Fox University of Washington January 30, 2017 ©2017 Emily Fox CSE 446: Machine Learning Simple (weak) classifiers are good! Income>$100K? Yes No Safe Risky Logistic regression Shallow Decision w.
Recommended publications
  • Bootstrap Aggregating and Random Forest
    Bootstrap Aggregating and Random Forest Tae-Hwy Lee, Aman Ullah and Ran Wang Abstract Bootstrap Aggregating (Bagging) is an ensemble technique for improving the robustness of forecasts. Random Forest is a successful method based on Bagging and Decision Trees. In this chapter, we explore Bagging, Random Forest, and their variants in various aspects of theory and practice. We also discuss applications based on these methods in economic forecasting and inference. 1 Introduction The last 30 years witnessed the dramatic developments and applications of Bag- ging and Random Forests. The core idea of Bagging is model averaging. Instead of choosing one estimator, Bagging considers a set of estimators trained on the boot- strap samples and then takes the average output of them, which is helpful in improv- ing the robustness of an estimator. In Random Forest, we grow a set of Decision Trees to construct a ‘forest’ to balance the accuracy and robustness for forecasting. This chapter is organized as follows. First, we introduce Bagging and some vari- ants. Second, we discuss Decision Trees in details. Then, we move to Random Forest which is one of the most attractive machine learning algorithms combining Decision Trees and Bagging. Finally, several economic applications of Bagging and Random Forest are discussed. As we mainly focus on the regression problems rather than classification problems, the response y is a real number, unless otherwise mentioned. Tae-Hwy Lee Department of Economics, University of California, Riverside, e-mail: [email protected] Aman Ullah Department of Economics, University of California, Riverside, e-mail: [email protected] Ran Wang Department of Economics, University of California, Riverside, e-mail: ran.wang@email.
    [Show full text]
  • Outline of Machine Learning
    Outline of machine learning The following outline is provided as an overview of and topical guide to machine learning: Machine learning – subfield of computer science[1] (more particularly soft computing) that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.[1] In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed".[2] Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.[3] Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions. Contents What type of thing is machine learning? Branches of machine learning Subfields of machine learning Cross-disciplinary fields involving machine learning Applications of machine learning Machine learning hardware Machine learning tools Machine learning frameworks Machine learning libraries Machine learning algorithms Machine learning methods Dimensionality reduction Ensemble learning Meta learning Reinforcement learning Supervised learning Unsupervised learning Semi-supervised learning Deep learning Other machine learning methods and problems Machine learning research History of machine learning Machine learning projects Machine learning organizations Machine learning conferences and workshops Machine learning publications
    [Show full text]
  • Adaptively Pruning Features for Boosted Decision Trees
    Adaptively Pruning Features for Boosted Decision Trees Maryam Aziz Jesse Anderton Javed Aslam Northeastern University Northeastern University Northeastern University Boston, Massachusetts, USA Boston, Massachusetts, USA Boston, Massachusetts, USA [email protected] [email protected] [email protected] Abstract Boosted decision trees enjoy popularity in a variety of applications; however, for large-scale datasets, the cost of training a decision tree in each round can be prohibitively expensive. Inspired by ideas from the multi-arm bandit literature, we develop a highly efficient algorithm for computing exact greedy-optimal decision trees, outperforming the state-of-the-art Quick Boost method. We further develop a framework for deriving lower bounds on the problem that applies to a wide family of conceivable algorithms for the task (including our algorithm and Quick Boost), and we demonstrate empirically on a wide variety of data sets that our algorithm is near-optimal within this family of algorithms. We also derive a lower bound applicable to any algorithm solving the task, and we demonstrate that our algorithm empirically achieves performance close to this best-achievable lower bound. 1 Introduction Boosting algorithms are among the most popular classification algorithms in use today, e.g. in computer vision, learning-to-rank, and text classification. Boosting, originally introduced by Schapire [1990], Freund [1995], Freund and Schapire [1996], is a family of machine learning algorithms in which an accurate classification strategy is learned by combining many “weak” hypotheses, each trained with respect to a different weighted distribution over the training data. These hypotheses are learned sequentially, and at each iteration of boosting the learner is biased towards correctly classifying the examples which were most difficult to classify by the preceding weak hypotheses.
    [Show full text]
  • Simple and Ensemble Decision Tree Classifier Based Detection of Breast Cancer Tina Elizabeth Mathew
    INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616 Simple And Ensemble Decision Tree Classifier Based Detection Of Breast Cancer Tina Elizabeth Mathew Abstract: Breast cancer being the top cancer, among all other types, in women worldwide has an increasing incidence particularly in developing countries where the majority of cases are diagnosed in late stages. Low survival is attributed to mainly late diagnosis of the disease, which is ascribed to lack of early diagnostic facilities. Many techniques are being used to aid early diagnosis. Besides medical and imaging methods, statistical and data mining techniques are being implemented for breast cancer diagnosis. Data mining techniques provide promising results in prediction and employing these methods can help the medical practitioners in quicker disease diagnosis. Numerous supervised techniques are being deployed to improve prediction and diagnosis. Decision trees are supervised learning models which provide high accuracy, stability and ease of interpretation. In this study, different Decision tree models, single and ensemble methods, are implemented and their performance is evaluated using the Wisconsin breast cancer diagnostic data. Rotation Forest classifier was seen to produce the best accuracy while using ensembles and NBTree in single models for disease diagnosis. Keyword: Decision tree(DT),Naive Bayes Tree(NB Tree), REPTree, Rotation Forest, Bagging, Boosting, Ensembles, Random Forest, Adaptive Boosting(AdaBoost) ———————————————————— 1.INTRODUCTION K- fold Cross Validation, using a Validation dataset, Decision trees are considered to be among the topmost few considering the proportion of records with error prediction methods that aid in efficient supervised classification. They are few ways to prune a tree.
    [Show full text]
  • Performance Analysis of Boosting Classifiers in Recognizing Activities of Daily Living
    International Journal of Environmental Research and Public Health Article Performance Analysis of Boosting Classifiers in Recognizing Activities of Daily Living Saifur Rahman 1,*, Muhammad Irfan 1 , Mohsin Raza 2, Khawaja Moyeezullah Ghori 3 , Shumayla Yaqoob 3 and Muhammad Awais 4,* 1 Electrical Engineering Department, College of Engineering, Najran University, Najran 61441, Saudi Arabia; [email protected] 2 Department of Computer and Information Sciences, Northumbria University, Newcastle-upon-Tyne NE1 8ST, UK; [email protected] 3 Department of Computer Science, National University of Modern Languages, Islamabad 44000, Pakistan; [email protected] (K.M.G.); [email protected] (S.Y.) 4 Faculty of Medicine and Health, School of Psychology, University of Leeds, Leeds LS2 9JT, UK * Correspondence: [email protected] (S.R.); [email protected] (M.A.) Received: 22 January 2020; Accepted: 5 February 2020; Published: 8 February 2020 Abstract: Physical activity is essential for physical and mental health, and its absence is highly associated with severe health conditions and disorders. Therefore, tracking activities of daily living can help promote quality of life. Wearable sensors in this regard can provide a reliable and economical means of tracking such activities, and such sensors are readily available in smartphones and watches. This study is the first of its kind to develop a wearable sensor-based physical activity classification system using a special class of supervised machine learning approaches called boosting algorithms. The study presents the performance analysis of several boosting algorithms (extreme gradient boosting—XGB, light gradient boosting machine—LGBM, gradient boosting—GB, cat boosting—CB and AdaBoost) in a fair and unbiased performance way using uniform dataset, feature set, feature selection method, performance metric and cross-validation techniques.
    [Show full text]
  • Direct Optimization for Classification with Boosting
    Direct Optimization for Classification with Boosting A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Shaodan Zhai M.S., Northwest University, 2009 B.S., Northwest University, 2006 2015 Wright State University Wright State University GRADUATE SCHOOL November 9, 2015 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Shaodan Zhai ENTITLED Direct Optimization for Classification with Boosting BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy. Shaojun Wang, Ph.D. Dissertation Director Michael Raymer, Ph.D. Director, Department of Computer Science and Engineering Robert E. W. Fyffe, Ph.D. Vice President for Research and Dean of the Graduate School Committee on Final Examination Shaojun Wang, Ph.D. Keke Chen, Ph.D. Krishnaprasad Thirunarayan, Ph.D. Chaocheng Huang, Ph.D. ABSTRACT Zhai, Shaodan. Ph.D., Department of Computer Science and Engineering, Wright State University, 2015. Direct Optimization for Classification with Boosting. Boosting, as one of the state-of-the-art classification approaches, is widely used in the industry for a broad range of problems. The existing boosting methods often formulate classification tasks as a convex optimization problem by using surrogates of performance measures. While the convex surrogates are computationally efficient to globally optimize, they are sensitive to outliers and inconsistent under some conditions. On the other hand, boosting’s success can be ascribed to maximizing the margins, but few boosting approaches are designed to directly maximize the margin. In this research, we design novel boosting algorithms that directly optimize non-convex performance measures, including the empir- ical classification error and margin functions, without resorting to any surrogates or ap- proximations.
    [Show full text]
  • DATA MINING with DECISION TREES Theory and Applications 2Nd Edition
    DATA MINING WITH DECISION TREES Theory and Applications 2nd Edition 9097_9789814590075_tp.indd 1 30/7/14 2:32 pm SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors: H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA) Vol. 65: Fighting Terror in Cyberspace (Eds. M. Last and A. Kandel) Vol. 66: Formal Models, Languages and Applications (Eds. K. G. Subramanian, K. Rangarajan and M. Mukund) Vol. 67: Image Pattern Recognition: Synthesis and Analysis in Biometrics (Eds. S. N. Yanushkevich, P. S. P. Wang, M. L. Gavrilova and S. N. Srihari ) Vol. 68: Bridging the Gap Between Graph Edit Distance and Kernel Machines (M. Neuhaus and H. Bunke) Vol. 69: Data Mining with Decision Trees: Theory and Applications (L. Rokach and O. Maimon) Vol. 70: Personalization Techniques and Recommender Systems (Eds. G. Uchyigit and M. Ma) Vol. 71: Recognition of Whiteboard Notes: Online, Offline and Combination (Eds. H. Bunke and M. Liwicki) Vol. 72: Kernels for Structured Data (T Gärtner) Vol. 73: Progress in Computer Vision and Image Analysis (Eds. H. Bunke, J. J. Villanueva, G. Sánchez and X. Otazu) Vol. 74: Wavelet Theory Approach to Pattern Recognition (2nd Edition) (Y. Y. Tang) Vol. 75: Pattern Classification Using Ensemble Methods (L. Rokach) Vol. 76: Automated Database Applications Testing: Specification Representation for Automated Reasoning (R. F. Mikhail, D. Berndt and A. Kandel) Vol. 77: Graph Classification and Clustering Based on Vector Space Embedding (K. Riesen and H. Bunke) Vol. 78: Integration of Swarm Intelligence and Artificial Neural Network (Eds. S. Dehuri, S. Ghosh and S.-B. Cho) Vol. 79 Document Analysis and Recognition with Wavelet and Fractal Theories (Y.
    [Show full text]
  • Infinite Ensemble Learning with Support Vector Machines
    Infinite Ensemble Learning with Support Vector Machines Thesis by Hsuan-Tien Lin In Partial Fulfillment of the Requirements for the Degree of Master of Science California Institute of Technology Pasadena, California 2005 (Submitted May 18, 2005) ii c 2005 Hsuan-Tien Lin All Rights Reserved iii Acknowledgements I am very grateful to my advisor, Professor Yaser S. Abu-Mostafa, for his help, sup- port, and guidance throughout my research and studies. It has been a privilege to work in the free and easy atmosphere that he provides for the Learning Systems Group. I have enjoyed numerous discussions with Ling Li and Amrit Pratap, my fellow members of the group. I thank them for their valuable input and feedback. I am particularly indebted to Ling, who not only collaborates with me in some parts of this work, but also gives me lots of valuable suggestions in research and in life. I also appreciate Lucinda Acosta for her help in obtaining necessary resources for research. I thank Kai-Min Chung for providing useful comments on an early publication of this work. I also want to address a special gratitude to Professor Chih-Jen Lin, who brought me into the fascinating area of Support Vector Machine five years ago. Most importantly, I thank my family and friends for their endless love and support. I am thankful to my parents, who have always encouraged me and have believed in me. In addition, I want to convey my personal thanks to Yung-Han Yang, my girlfriend, who has shared every moment with me during days of our lives in the U.S.
    [Show full text]
  • Direct Optimization for Classification with Boosting
    Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2015 Direct Optimization for Classification with Boosting Shaodan Zhai Wright State University Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all Part of the Computer Engineering Commons, and the Computer Sciences Commons Repository Citation Zhai, Shaodan, "Direct Optimization for Classification with Boosting" (2015). Browse all Theses and Dissertations. 1644. https://corescholar.libraries.wright.edu/etd_all/1644 This Dissertation is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It has been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE Scholar. For more information, please contact [email protected]. Direct Optimization for Classification with Boosting A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Shaodan Zhai M.S., Northwest University, 2009 B.S., Northwest University, 2006 2015 Wright State University Wright State University GRADUATE SCHOOL November 9, 2015 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Shaodan Zhai ENTITLED Direct Optimization for Classification with Boosting BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy. Shaojun Wang, Ph.D. Dissertation Director Michael Raymer, Ph.D. Director, Department of Computer Science and Engineering Robert E. W. Fyffe, Ph.D. Vice President for Research and Dean of the Graduate School Committee on Final Examination Shaojun Wang, Ph.D. Keke Chen, Ph.D. Krishnaprasad Thirunarayan, Ph.D. Chaocheng Huang, Ph.D. ABSTRACT Zhai, Shaodan. Ph.D., Department of Computer Science and Engineering, Wright State University, 2015.
    [Show full text]
  • D4.2. Measure
    ITEA 2 Office High Tech Campus 69-3 Tel : +31 88 003 6136 5656 AG Eindhoven Fax : +31 88 003 6130 The Netherlands Email : [email protected] Web : www.itea2.org ITEA 2 is a EUREKA strategic ICT cluster programme D4.2: Industrial Data Analysis Algorithms MEASURE Edited by Sana Mallouli and Wissam Mallouli on 01/12/2017 Edited by Sana Mallouli and Wissam Mallouli on 15/05/2018 Edited by SELAMI BAGRIYANIK on 16/05/2018 Edited by Sana Mallouli and Wissam Mallouli on 16/05/2018 Edited by Sana Mallouli on 23/05/2018 Edited by Alin Stefanescu on 25/05/2018 Reviewed by Jérôme Rocheteau on 28/05/2018 Reviewed by Stephane Maag on 01/06/2018 Edited by Sana Mallouli on 04/06/2018 Edited by Wissam Mallouli on 04/06/2018 Edited by Alin Stefanescu on 06/06/2018 Edited by Sana Mallouli and Wissam Mallouli on 06/06/2018 Edited by Alessandra Bagnato on 06/06/2018 Page 1 / 46 ITEA Office – template v9 MEASURE D4.2 - Industrial Data Analysis Algorithms ITEA 2 – 14009 Executive summary The objective of WP4 is to design and develop a platform to analyse the large amount of measurement data generated by the use cases carried out in WP5. The data analysis platform will implement analytics algorithms, to correlate the different phases of software development and perform the tracking of metrics and their value. The platform will also connect and assure interoperability among the tools, selected or developed in WP3, and will evaluate their performance and define actions for improvement and possible countermeasures.
    [Show full text]
  • Decision Trees: Overfitting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018
    4/25/18 Decision Trees: Overfitting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Decision trees recap ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 1 4/25/18 Decision tree recap Loan status: Root poor 22 18 4 14 Safe Risky Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf Safe Term? node, set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 3 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Greedy decision tree learning • Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to, classification error make predictions Stopping • Step 4: Otherwise, go to Step 2 & conditions 1 & 2 continue (recurse) on this split Recursion 4 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 2 4/25/18 Stopping condition 1: All data agrees on y All data in these nodes have same Loan status: Root poor y value è 22 18 4 14 NothingSafe Risky to do Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 5 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Stopping condition 2: Already split on all features Already split on all possible features Loan status: Root poor è 22 18 4 14 NothingSafe Risky to do Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 Safe Term? Term?
    [Show full text]
  • Stackoverflow Vs Kaggle: a Study of Developer Discussions About Data Science
    StackOverflow vs Kaggle: A Study of Developer Discussions About Data Science David Hin University of Adelaide [email protected] ABSTRACT This analysis uses Latent Dirichlet Allocation (LDA) [6] Software developers are increasingly required to understand combined with qualitative analyses to investigate DS discussion fundamental Data science (DS) concepts. Recently, the presence on both StackOverflow and Kaggle. A large-scale empirical study of machine learning (ML) and deep learning (DL) has dramatically on 197836 posts from StackOverflow and Kaggle are analysed to increased in the development of user applications, whether they better understand DS-related discussion across online websites. are leveraged through frameworks or implemented from scratch. The key contributions are as follows:: These topics attract much discussion on online platforms. 1. This is the first research paper to investigate the This paper conducts large-scale qualitative and quantitative relationship between DS discussion on a Q&A experiments to study the characteristics of 197836 posts from website like StackOverflow with competition-oriented StackOverflow and Kaggle. Latent Dirichlet Allocation topic community like Kaggle. modelling is used to extract twenty-four DS discussion topics. The 2. Twenty-four key DS topics are identified. main findings include that TensorFlow-related topics were most 3. Source code to perform the analyses in this paper prevalent in StackOverflow, while meta discussion topics were [9]. Full reproduction package (including data) also the prevalent ones on Kaggle. StackOverflow tends to include available [8]. lower-level troubleshooting, while Kaggle focuses on practicality Paper structure: Section 2 introduces the motivation behind and optimising leaderboard performance. In addition, across this work.
    [Show full text]