Direct Optimization for Classification with Boosting

Total Page:16

File Type:pdf, Size:1020Kb

Direct Optimization for Classification with Boosting Direct Optimization for Classification with Boosting A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Shaodan Zhai M.S., Northwest University, 2009 B.S., Northwest University, 2006 2015 Wright State University Wright State University GRADUATE SCHOOL November 9, 2015 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Shaodan Zhai ENTITLED Direct Optimization for Classification with Boosting BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy. Shaojun Wang, Ph.D. Dissertation Director Michael Raymer, Ph.D. Director, Department of Computer Science and Engineering Robert E. W. Fyffe, Ph.D. Vice President for Research and Dean of the Graduate School Committee on Final Examination Shaojun Wang, Ph.D. Keke Chen, Ph.D. Krishnaprasad Thirunarayan, Ph.D. Chaocheng Huang, Ph.D. ABSTRACT Zhai, Shaodan. Ph.D., Department of Computer Science and Engineering, Wright State University, 2015. Direct Optimization for Classification with Boosting. Boosting, as one of the state-of-the-art classification approaches, is widely used in the industry for a broad range of problems. The existing boosting methods often formulate classification tasks as a convex optimization problem by using surrogates of performance measures. While the convex surrogates are computationally efficient to globally optimize, they are sensitive to outliers and inconsistent under some conditions. On the other hand, boosting’s success can be ascribed to maximizing the margins, but few boosting approaches are designed to directly maximize the margin. In this research, we design novel boosting algorithms that directly optimize non-convex performance measures, including the empir- ical classification error and margin functions, without resorting to any surrogates or ap- proximations. We first applied this approach on binary classification, and then extended this idea to more complicated classification problems, including multi-class classification, semi-supervised classification, and multi-label classification. These extensions are non- trivial, where we have to mathematically re-formulate the optimization problem: defining new objectives and designing new algorithms that depend on the specific learning tasks. Moreover, we showed good theoretical properties of the optimization objectives, which ex- plains why we define these objectives and how we design algorithms to efficiently optimize them. Finally, we showed experimentally that the proposed approaches display competitive or better results than state-of-the-art convex relaxation boosting methods, and they perform especially well on noisy cases. iii Contents 1 Introduction1 1.1 Direct Loss Minimization...........................3 1.2 Direct Margin Maximization.........................4 1.3 Multi-class Boosting.............................5 1.4 Semi-supervised Boosting..........................6 1.5 Multi-label Boosting.............................8 1.6 Our works...................................9 2 DirectBoost: a boosting method for binary classification 12 2.1 Minimizing Classification Error....................... 13 2.2 Maximizing Margins............................. 19 2.3 Experimental Results............................. 34 2.3.1 Experiments on UCI data....................... 35 2.3.2 Evaluate noise robustness...................... 38 3 DMCBoost: a boosting method for multi-class classification 41 3.1 Minimizing Multi-class Classification Errors................. 42 3.2 Maximizing a Multi-class Margin Objective................. 48 3.3 Experiments.................................. 60 3.3.1 Experimental Results on UCI Datasets................ 61 3.3.2 Evaluation of Noise Robustness................... 64 4 SSDBoost: a boosting method for semi-supervised classification 66 4.1 Minimize generalized classification error................... 67 4.2 Maximize the generalized average margin.................. 70 4.3 Experiments.................................. 77 4.3.1 UCI datasets............................. 79 4.3.2 Evaluation of Noise Robustness................... 81 5 MRMM: a multi-label boosting method 84 5.1 Minimum Ranking Margin.......................... 85 5.2 The Algorithm................................ 89 iv 5.2.1 Outline of MRMM algorithm.................... 89 5.2.2 Line search algorithm for a given weak classifier.......... 90 5.2.3 Weak learning algorithm....................... 92 5.2.4 Computational Complexity...................... 93 5.3 Experiments.................................. 94 5.3.1 Settings................................ 94 5.3.2 Experimental Results......................... 97 5.3.3 Influence Of Each Parameter In MRMM............... 100 6 Conclusion and future work 103 Bibliography 105 v List of Figures 1.1 Convex surrogates of 0-1 loss.........................2 2.1 An example of computing minimum 0-1 loss of a weak learner over 4 sam- ples....................................... 16 2.2 0-1 loss curve in one dimension when 0-1 loss hasn’t reached 0....... 17 2.3 0-1 loss surface in two dimensions when 0-1 loss hasn’t reached 0...... 17 2.4 0-1 loss curve in one dimension when 0-1 loss has reached 0......... 18 2.5 0-1 loss surface in two dimensions when 0-1 loss has reached 0....... 18 2.6 Margin curves of six examples. At points q1; q2; q3 and q4, the median 0 example is changed. At points q2 and q4, the set of bottom n = 3 examples are changed................................... 22 2.7 The average margin curve over bottom 5 examples in one dimension.... 29 2.8 The average margin surface over bottom 5 examples in two dimensions... 29 2.9 Contour of the average margin surface over bottom 5 examples in two di- mensions.................................... 30 2.10 Path of the -relaxation method, adapted from (Bertsekas, 1998, 1999).... 31 2.11 The bottom k-th order margin curve in one dimension............ 33 2.12 The bottom k-th order margin surface in two dimensions........... 33 2.13 Contour of the bottom kth order margin surface in two dimensions...... 34 2.14 The value of average margins of bottom n0 samples vs. the number of itera- tions for LPBoost with column generation and DirectBoostavg on Australian dataset, left: Decision tree, right: Decision stump............... 37 3.1 Three scenarios to compute the empirical error of a weak learner ht over an example pair (xi; yi), where l denotes the incorrect label with the high- est score, and q denotes the intersection point that results in an empirical error change. The red bold line for each scenario represents the inference function of example xi and its true label yi.................. 43 3.2 Three scenarios of margin curve of a weak learner ht over an example pair (xi; yi), where l denotes the incorrect label with the highest score...... 47 vi 3.3 Left: Training errors of AdaBoost.M1, SAMME, and DMCBoost with 0- 1 loss minimization algorithm on the Car dataset. Middle: Test errors of AdaBoost.M1, SAMME, and DMCBoost with 0-1 loss minimization algo- rithm on Car dataset. Right: Training and test error of DMCBoost when it switches to the margin maximization algorithm on the Car dataset...... 62 4.1 Mean error rates (in %) of ASSEMBLE, EntropyBoost and SSDBoost with 20 labeled and increasing number of unlabeled samples on Mushroom dataset. 80 4.2 Margins distribution of ASSEMBLE, EntropyBoost and SSDBoost for la- beled and unlabeled samples on synthetic data with 20% label noise. For SSDBoost, the parameters n0 and m0 which we used to plot the figure were n m selected by the validation set, they are 2 and 2 respectively......... 82 5.1 The average 5 smallest MRMs along two coordinates in one-dimension (upper left and right), and the surface of the average 5 smallest MRMs in these two coordinates (lower)......................... 85 5.2 Left: F1 and micro-F1 vs. different values of m on the medical dataset. Right: F1 and micro-F1 vs. different values of k on the medical dataset... 101 5.3 Left: F1 and micro-F1 vs. number of ensemble rounds on the medical dataset. Right: objective value vs. number of ensemble rounds on the medical dataset................................. 101 vii List of Tables 2.1 Percent test errors of AdaBoost, LogitBoost, soft margin LPBoost with col- umn generation, BrownBoost, and three DirectBoost methods on 10 UCI datasets each with N samples and D attributes................. 35 2.2 Number of iterations and total run times (in seconds) in the training stage on Adult dataset with 10000 training samples and the depth of decision trees is 3.................................... 37 2.3 Percent test errors of AdaBoost and LogitBoost, LPBoost, BrownBoost, DirectBoostavg, and DirectBoostorder on Long and Servedio’s three points example with random noise.......................... 39 2.4 Percent test errors of AdaBoost and LogitBoost, LPBoost, BrownBoost, DirectBoostavg and DirectBoostorder on two UCI datasets with random noise...................................... 39 3.1 Description of 13 UCI datasets........................ 61 3.2 Test error (and standard deviation) of multi-class boosting methods Ad- aBoost.M1, SAMME, GD-MCBoost, and DMCBoost on 13 UCI datasets, using multi-class decision trees with a depth of 3............... 63 3.3 Test error (and standard deviation) of multi-class boosting methods Ad- aBoost.M1, AdaBoost.MH, SAMME, GD-MCBoost, and DMCBoost on 13 UCI datasets, using decision trees with a maximum depth of 12...... 64 3.4 Test error (and standard deviation) of multi-class boosting methods Ad- aBoost.M1, AdaBoost.MH, SAMME, GD-MCBoost,
Recommended publications
  • Bootstrap Aggregating and Random Forest
    Bootstrap Aggregating and Random Forest Tae-Hwy Lee, Aman Ullah and Ran Wang Abstract Bootstrap Aggregating (Bagging) is an ensemble technique for improving the robustness of forecasts. Random Forest is a successful method based on Bagging and Decision Trees. In this chapter, we explore Bagging, Random Forest, and their variants in various aspects of theory and practice. We also discuss applications based on these methods in economic forecasting and inference. 1 Introduction The last 30 years witnessed the dramatic developments and applications of Bag- ging and Random Forests. The core idea of Bagging is model averaging. Instead of choosing one estimator, Bagging considers a set of estimators trained on the boot- strap samples and then takes the average output of them, which is helpful in improv- ing the robustness of an estimator. In Random Forest, we grow a set of Decision Trees to construct a ‘forest’ to balance the accuracy and robustness for forecasting. This chapter is organized as follows. First, we introduce Bagging and some vari- ants. Second, we discuss Decision Trees in details. Then, we move to Random Forest which is one of the most attractive machine learning algorithms combining Decision Trees and Bagging. Finally, several economic applications of Bagging and Random Forest are discussed. As we mainly focus on the regression problems rather than classification problems, the response y is a real number, unless otherwise mentioned. Tae-Hwy Lee Department of Economics, University of California, Riverside, e-mail: [email protected] Aman Ullah Department of Economics, University of California, Riverside, e-mail: [email protected] Ran Wang Department of Economics, University of California, Riverside, e-mail: ran.wang@email.
    [Show full text]
  • Outline of Machine Learning
    Outline of machine learning The following outline is provided as an overview of and topical guide to machine learning: Machine learning – subfield of computer science[1] (more particularly soft computing) that evolved from the study of pattern recognition and computational learning theory in artificial intelligence.[1] In 1959, Arthur Samuel defined machine learning as a "Field of study that gives computers the ability to learn without being explicitly programmed".[2] Machine learning explores the study and construction of algorithms that can learn from and make predictions on data.[3] Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions. Contents What type of thing is machine learning? Branches of machine learning Subfields of machine learning Cross-disciplinary fields involving machine learning Applications of machine learning Machine learning hardware Machine learning tools Machine learning frameworks Machine learning libraries Machine learning algorithms Machine learning methods Dimensionality reduction Ensemble learning Meta learning Reinforcement learning Supervised learning Unsupervised learning Semi-supervised learning Deep learning Other machine learning methods and problems Machine learning research History of machine learning Machine learning projects Machine learning organizations Machine learning conferences and workshops Machine learning publications
    [Show full text]
  • Adaptively Pruning Features for Boosted Decision Trees
    Adaptively Pruning Features for Boosted Decision Trees Maryam Aziz Jesse Anderton Javed Aslam Northeastern University Northeastern University Northeastern University Boston, Massachusetts, USA Boston, Massachusetts, USA Boston, Massachusetts, USA [email protected] [email protected] [email protected] Abstract Boosted decision trees enjoy popularity in a variety of applications; however, for large-scale datasets, the cost of training a decision tree in each round can be prohibitively expensive. Inspired by ideas from the multi-arm bandit literature, we develop a highly efficient algorithm for computing exact greedy-optimal decision trees, outperforming the state-of-the-art Quick Boost method. We further develop a framework for deriving lower bounds on the problem that applies to a wide family of conceivable algorithms for the task (including our algorithm and Quick Boost), and we demonstrate empirically on a wide variety of data sets that our algorithm is near-optimal within this family of algorithms. We also derive a lower bound applicable to any algorithm solving the task, and we demonstrate that our algorithm empirically achieves performance close to this best-achievable lower bound. 1 Introduction Boosting algorithms are among the most popular classification algorithms in use today, e.g. in computer vision, learning-to-rank, and text classification. Boosting, originally introduced by Schapire [1990], Freund [1995], Freund and Schapire [1996], is a family of machine learning algorithms in which an accurate classification strategy is learned by combining many “weak” hypotheses, each trained with respect to a different weighted distribution over the training data. These hypotheses are learned sequentially, and at each iteration of boosting the learner is biased towards correctly classifying the examples which were most difficult to classify by the preceding weak hypotheses.
    [Show full text]
  • Simple and Ensemble Decision Tree Classifier Based Detection of Breast Cancer Tina Elizabeth Mathew
    INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 8, ISSUE 11, NOVEMBER 2019 ISSN 2277-8616 Simple And Ensemble Decision Tree Classifier Based Detection Of Breast Cancer Tina Elizabeth Mathew Abstract: Breast cancer being the top cancer, among all other types, in women worldwide has an increasing incidence particularly in developing countries where the majority of cases are diagnosed in late stages. Low survival is attributed to mainly late diagnosis of the disease, which is ascribed to lack of early diagnostic facilities. Many techniques are being used to aid early diagnosis. Besides medical and imaging methods, statistical and data mining techniques are being implemented for breast cancer diagnosis. Data mining techniques provide promising results in prediction and employing these methods can help the medical practitioners in quicker disease diagnosis. Numerous supervised techniques are being deployed to improve prediction and diagnosis. Decision trees are supervised learning models which provide high accuracy, stability and ease of interpretation. In this study, different Decision tree models, single and ensemble methods, are implemented and their performance is evaluated using the Wisconsin breast cancer diagnostic data. Rotation Forest classifier was seen to produce the best accuracy while using ensembles and NBTree in single models for disease diagnosis. Keyword: Decision tree(DT),Naive Bayes Tree(NB Tree), REPTree, Rotation Forest, Bagging, Boosting, Ensembles, Random Forest, Adaptive Boosting(AdaBoost) ———————————————————— 1.INTRODUCTION K- fold Cross Validation, using a Validation dataset, Decision trees are considered to be among the topmost few considering the proportion of records with error prediction methods that aid in efficient supervised classification. They are few ways to prune a tree.
    [Show full text]
  • Overfitting in Decision Trees, Boosting Slides
    1/29/17 Decision Trees: Overfitting CSE 446: Machine Learning Emily Fox University of Washington January 30, 2017 ©2017 Emily Fox CSE 446: Machine Learning Decision tree recap Loan status: Root poor 22 18 4 14 Safe Risky Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf Safe Term? node, set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 2 ©2017 Emily Fox CSE 446: Machine Learning 1 1/29/17 Greedy decision tree learning • Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to, classification error make predictions Stopping • Step 4: Otherwise, go to Step 2 & conditions 1 & 2 continue (recurse) on this split Recursion 3 ©2017 Emily Fox CSE 446: Machine Learning Scoring a loan application xi = (Credit = poor, Income = high, Term = 5 years) Start excellent poor Credit? fair Income? Safe Term? high Low 3 years 5 years Risky Safe Term? Risky 3 years 5 years Risky Safe ŷi = Safe 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/29/17 Decision trees vs logistic regression: Example ©2017 Emily Fox CSE 446: Machine Learning Logistic regression Weight Feature Value Learned h0(x) 1 0.22 h1(x) x[1] 1.12 h2(x) x[2] -1.07 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/29/17 Depth 1: Split on x[1] y values Root - + 18 13 x[1] x[1] < -0.07 x[1] >= -0.07 13 3 4 11 7 ©2017 Emily Fox CSE 446: Machine Learning Depth 2 y values Root - + 18 13 x[1] x[1] < -0.07 x[1] >= -0.07
    [Show full text]
  • Performance Analysis of Boosting Classifiers in Recognizing Activities of Daily Living
    International Journal of Environmental Research and Public Health Article Performance Analysis of Boosting Classifiers in Recognizing Activities of Daily Living Saifur Rahman 1,*, Muhammad Irfan 1 , Mohsin Raza 2, Khawaja Moyeezullah Ghori 3 , Shumayla Yaqoob 3 and Muhammad Awais 4,* 1 Electrical Engineering Department, College of Engineering, Najran University, Najran 61441, Saudi Arabia; [email protected] 2 Department of Computer and Information Sciences, Northumbria University, Newcastle-upon-Tyne NE1 8ST, UK; [email protected] 3 Department of Computer Science, National University of Modern Languages, Islamabad 44000, Pakistan; [email protected] (K.M.G.); [email protected] (S.Y.) 4 Faculty of Medicine and Health, School of Psychology, University of Leeds, Leeds LS2 9JT, UK * Correspondence: [email protected] (S.R.); [email protected] (M.A.) Received: 22 January 2020; Accepted: 5 February 2020; Published: 8 February 2020 Abstract: Physical activity is essential for physical and mental health, and its absence is highly associated with severe health conditions and disorders. Therefore, tracking activities of daily living can help promote quality of life. Wearable sensors in this regard can provide a reliable and economical means of tracking such activities, and such sensors are readily available in smartphones and watches. This study is the first of its kind to develop a wearable sensor-based physical activity classification system using a special class of supervised machine learning approaches called boosting algorithms. The study presents the performance analysis of several boosting algorithms (extreme gradient boosting—XGB, light gradient boosting machine—LGBM, gradient boosting—GB, cat boosting—CB and AdaBoost) in a fair and unbiased performance way using uniform dataset, feature set, feature selection method, performance metric and cross-validation techniques.
    [Show full text]
  • DATA MINING with DECISION TREES Theory and Applications 2Nd Edition
    DATA MINING WITH DECISION TREES Theory and Applications 2nd Edition 9097_9789814590075_tp.indd 1 30/7/14 2:32 pm SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors: H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA) Vol. 65: Fighting Terror in Cyberspace (Eds. M. Last and A. Kandel) Vol. 66: Formal Models, Languages and Applications (Eds. K. G. Subramanian, K. Rangarajan and M. Mukund) Vol. 67: Image Pattern Recognition: Synthesis and Analysis in Biometrics (Eds. S. N. Yanushkevich, P. S. P. Wang, M. L. Gavrilova and S. N. Srihari ) Vol. 68: Bridging the Gap Between Graph Edit Distance and Kernel Machines (M. Neuhaus and H. Bunke) Vol. 69: Data Mining with Decision Trees: Theory and Applications (L. Rokach and O. Maimon) Vol. 70: Personalization Techniques and Recommender Systems (Eds. G. Uchyigit and M. Ma) Vol. 71: Recognition of Whiteboard Notes: Online, Offline and Combination (Eds. H. Bunke and M. Liwicki) Vol. 72: Kernels for Structured Data (T Gärtner) Vol. 73: Progress in Computer Vision and Image Analysis (Eds. H. Bunke, J. J. Villanueva, G. Sánchez and X. Otazu) Vol. 74: Wavelet Theory Approach to Pattern Recognition (2nd Edition) (Y. Y. Tang) Vol. 75: Pattern Classification Using Ensemble Methods (L. Rokach) Vol. 76: Automated Database Applications Testing: Specification Representation for Automated Reasoning (R. F. Mikhail, D. Berndt and A. Kandel) Vol. 77: Graph Classification and Clustering Based on Vector Space Embedding (K. Riesen and H. Bunke) Vol. 78: Integration of Swarm Intelligence and Artificial Neural Network (Eds. S. Dehuri, S. Ghosh and S.-B. Cho) Vol. 79 Document Analysis and Recognition with Wavelet and Fractal Theories (Y.
    [Show full text]
  • Infinite Ensemble Learning with Support Vector Machines
    Infinite Ensemble Learning with Support Vector Machines Thesis by Hsuan-Tien Lin In Partial Fulfillment of the Requirements for the Degree of Master of Science California Institute of Technology Pasadena, California 2005 (Submitted May 18, 2005) ii c 2005 Hsuan-Tien Lin All Rights Reserved iii Acknowledgements I am very grateful to my advisor, Professor Yaser S. Abu-Mostafa, for his help, sup- port, and guidance throughout my research and studies. It has been a privilege to work in the free and easy atmosphere that he provides for the Learning Systems Group. I have enjoyed numerous discussions with Ling Li and Amrit Pratap, my fellow members of the group. I thank them for their valuable input and feedback. I am particularly indebted to Ling, who not only collaborates with me in some parts of this work, but also gives me lots of valuable suggestions in research and in life. I also appreciate Lucinda Acosta for her help in obtaining necessary resources for research. I thank Kai-Min Chung for providing useful comments on an early publication of this work. I also want to address a special gratitude to Professor Chih-Jen Lin, who brought me into the fascinating area of Support Vector Machine five years ago. Most importantly, I thank my family and friends for their endless love and support. I am thankful to my parents, who have always encouraged me and have believed in me. In addition, I want to convey my personal thanks to Yung-Han Yang, my girlfriend, who has shared every moment with me during days of our lives in the U.S.
    [Show full text]
  • Direct Optimization for Classification with Boosting
    Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2015 Direct Optimization for Classification with Boosting Shaodan Zhai Wright State University Follow this and additional works at: https://corescholar.libraries.wright.edu/etd_all Part of the Computer Engineering Commons, and the Computer Sciences Commons Repository Citation Zhai, Shaodan, "Direct Optimization for Classification with Boosting" (2015). Browse all Theses and Dissertations. 1644. https://corescholar.libraries.wright.edu/etd_all/1644 This Dissertation is brought to you for free and open access by the Theses and Dissertations at CORE Scholar. It has been accepted for inclusion in Browse all Theses and Dissertations by an authorized administrator of CORE Scholar. For more information, please contact [email protected]. Direct Optimization for Classification with Boosting A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Shaodan Zhai M.S., Northwest University, 2009 B.S., Northwest University, 2006 2015 Wright State University Wright State University GRADUATE SCHOOL November 9, 2015 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Shaodan Zhai ENTITLED Direct Optimization for Classification with Boosting BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy. Shaojun Wang, Ph.D. Dissertation Director Michael Raymer, Ph.D. Director, Department of Computer Science and Engineering Robert E. W. Fyffe, Ph.D. Vice President for Research and Dean of the Graduate School Committee on Final Examination Shaojun Wang, Ph.D. Keke Chen, Ph.D. Krishnaprasad Thirunarayan, Ph.D. Chaocheng Huang, Ph.D. ABSTRACT Zhai, Shaodan. Ph.D., Department of Computer Science and Engineering, Wright State University, 2015.
    [Show full text]
  • D4.2. Measure
    ITEA 2 Office High Tech Campus 69-3 Tel : +31 88 003 6136 5656 AG Eindhoven Fax : +31 88 003 6130 The Netherlands Email : [email protected] Web : www.itea2.org ITEA 2 is a EUREKA strategic ICT cluster programme D4.2: Industrial Data Analysis Algorithms MEASURE Edited by Sana Mallouli and Wissam Mallouli on 01/12/2017 Edited by Sana Mallouli and Wissam Mallouli on 15/05/2018 Edited by SELAMI BAGRIYANIK on 16/05/2018 Edited by Sana Mallouli and Wissam Mallouli on 16/05/2018 Edited by Sana Mallouli on 23/05/2018 Edited by Alin Stefanescu on 25/05/2018 Reviewed by Jérôme Rocheteau on 28/05/2018 Reviewed by Stephane Maag on 01/06/2018 Edited by Sana Mallouli on 04/06/2018 Edited by Wissam Mallouli on 04/06/2018 Edited by Alin Stefanescu on 06/06/2018 Edited by Sana Mallouli and Wissam Mallouli on 06/06/2018 Edited by Alessandra Bagnato on 06/06/2018 Page 1 / 46 ITEA Office – template v9 MEASURE D4.2 - Industrial Data Analysis Algorithms ITEA 2 – 14009 Executive summary The objective of WP4 is to design and develop a platform to analyse the large amount of measurement data generated by the use cases carried out in WP5. The data analysis platform will implement analytics algorithms, to correlate the different phases of software development and perform the tracking of metrics and their value. The platform will also connect and assure interoperability among the tools, selected or developed in WP3, and will evaluate their performance and define actions for improvement and possible countermeasures.
    [Show full text]
  • Decision Trees: Overfitting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018
    4/25/18 Decision Trees: Overfitting STAT/CSE 416: Machine Learning Emily Fox University of Washington April 26, 2018 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Decision trees recap ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 1 4/25/18 Decision tree recap Loan status: Root poor 22 18 4 14 Safe Risky Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 For each leaf Safe Term? node, set Term? Risky ŷ = majority value 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 3 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Greedy decision tree learning • Step 1: Start with an empty tree • Step 2: Select a feature to split data Pick feature split • For each split of the tree: leading to lowest • Step 3: If nothing more to, classification error make predictions Stopping • Step 4: Otherwise, go to Step 2 & conditions 1 & 2 continue (recurse) on this split Recursion 4 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning 2 4/25/18 Stopping condition 1: All data agrees on y All data in these nodes have same Loan status: Root poor y value è 22 18 4 14 NothingSafe Risky to do Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 Safe Term? Term? Risky 3 years 5 years 0 4 9 0 3 years 5 years 0 2 4 3 Risky Safe Risky Safe 5 ©2018 Emily Fox STAT/CSE 416: Intro to Machine Learning Stopping condition 2: Already split on all features Already split on all possible features Loan status: Root poor è 22 18 4 14 NothingSafe Risky to do Credit? Income? excellent Fair 9 0 9 4 high low 4 5 0 9 Safe Term? Term?
    [Show full text]
  • Stackoverflow Vs Kaggle: a Study of Developer Discussions About Data Science
    StackOverflow vs Kaggle: A Study of Developer Discussions About Data Science David Hin University of Adelaide [email protected] ABSTRACT This analysis uses Latent Dirichlet Allocation (LDA) [6] Software developers are increasingly required to understand combined with qualitative analyses to investigate DS discussion fundamental Data science (DS) concepts. Recently, the presence on both StackOverflow and Kaggle. A large-scale empirical study of machine learning (ML) and deep learning (DL) has dramatically on 197836 posts from StackOverflow and Kaggle are analysed to increased in the development of user applications, whether they better understand DS-related discussion across online websites. are leveraged through frameworks or implemented from scratch. The key contributions are as follows:: These topics attract much discussion on online platforms. 1. This is the first research paper to investigate the This paper conducts large-scale qualitative and quantitative relationship between DS discussion on a Q&A experiments to study the characteristics of 197836 posts from website like StackOverflow with competition-oriented StackOverflow and Kaggle. Latent Dirichlet Allocation topic community like Kaggle. modelling is used to extract twenty-four DS discussion topics. The 2. Twenty-four key DS topics are identified. main findings include that TensorFlow-related topics were most 3. Source code to perform the analyses in this paper prevalent in StackOverflow, while meta discussion topics were [9]. Full reproduction package (including data) also the prevalent ones on Kaggle. StackOverflow tends to include available [8]. lower-level troubleshooting, while Kaggle focuses on practicality Paper structure: Section 2 introduces the motivation behind and optimising leaderboard performance. In addition, across this work.
    [Show full text]