Mining Software Engineering Data for Useful Knowledge Boris Baldassari

Mining Software Engineering Data for Useful Knowledge Boris Baldassari To cite this version: Boris Baldassari. Mining Software Engineering Data for Useful Knowledge. Machine Learning [stat.ML]. Université de Lille, 2014. English. tel-01297400 HAL Id: tel-01297400 https://tel.archives-ouvertes.fr/tel-01297400 Submitted on 4 Apr 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. École doctorale Sciences Pour l’Ingénieur THÈSE présentée en vue d’obtenir le grade de Docteur, spécialité Informatique par Boris Baldassari Mining Software Engineering Data for Useful Knowledge preparée dans l’équipe-projet SequeL commune Soutenue publiquement le 1er Juillet 2014 devant le jury composé de : Philippe Preux, Professeur des universités - Université de Lille 3 - Directeur Benoit Baudry, Chargé de recherche INRIA - INRIA Rennes - Rapporteur Laurence Duchien, Professeur des universités - Université de Lille 1 - Examinateur Flavien Huynh, Ingénieur Docteur - Squoring Technologies - Examinateur Pascale Kuntz, Professeur des universités - Polytech’ Nantes - Rapporteur Martin Monperrus, Maître de conférences - Université de Lille 1 - Examinateur 2 Preface Maisqual is a recursive acronym standing for “Maisqual Automagically Improves Software QUALity”. It may sound naive or pedantic at first sight, but it clearly stated at one time the expectations of Maisqual. The primary idea for the Maisqual project was to investigate the application of data mining methods to practical concerns of software quality engineering and assessment. This includes the development of innovative techniques for the Squore product, and also the verification of some common myths and beliefs about software engineering. It also proposes new tools and methods to extract useful information from software data. The Maisqual project is deeply rooted in empirical software engineering and heavily relies on the experience gathered during years of practice and consulting with the industry. In order to apply the data mining component, we had to widen our knowledge of statistics by reading student manuels and reference books, experimenting with formulae and verifying the results. As a bridge between two distinct disciplines, we had to pay great attention to the communication of ideas and results between practitioners of both fields. Most of Maisqual’s users have a software engineering background and one of the requirements of the project was to help them to grasp the benefits of the investigated data mining tools. We intensively used analogies, pictures and common sense in order to explain concepts and to help people map ideas to notions with which they are familiar. We decided to formulate our different lines of enquiry into projects and established a structure to complete them. Requirements were refined to be more effective, more precise and more practical. From the initial intent we designed a framework and a method to address the various objectives of our search. This enabled us to setup a secure foundation for the establishment of specific tasks, which gave birth to the Squore Labs. Squore Labs are clearly-defined projects addressing major subjects with pragmatic results and implementations. This report retraces the evolution of the Maisqual project from the discovery of new statistical methods to their implementation in practical applications. We tried to make it useful, practical and not too boring. 1 Acknowledgements This project has been a huge and long investment, which would never have been possible without many people. First of all, I would like to thank Squoring Technologies for funding this long-running project, and Philippe Preux from INRIA Lille, who was a faithful director. Philippe kept me up on the rails, orienting research in safe directions and always watching the compass and range of sight. I would like to thank my dissertation committee members, Benoit Baudry, Laurence Duchien, Flavien Huynh, Pascale Kuntz, and Martin Monperrus, for taking interest in my research and accepting to examine my work. Special thanks go to Pascale Kuntz and Benoit Baudry, for their thorough reviews and insightful comments. Also, I’ll be eternally grateful to all the people that supported me, and took care of me, when I was working on Maisqual and during the many adventures that happened during these three years: family, friends, neighbours, those who left (Tanguy, Mélodie) and those who are still there. Yakacémé, the dutch barge I’m living on and working in, was a safe and dependable home and shelter. Special thanks go to Cassandra Tchen, who was a heartful coach, bringing sound advice and reviews, and helped a lot correcting and enhancing the English prose. 2 Contents Preface 1 1 Introduction 13 1.1 Context of the project............................ 13 1.1.1 Early history of Maisqual...................... 13 1.1.2 About INRIA Lille and SequeL................... 14 1.1.3 About Squoring Technologies................... 14 1.2 Project timeline................................ 15 1.3 Expected outputs............................... 16 1.3.1 Squore Labs............................. 16 1.3.2 Communication............................ 16 1.3.3 Publications.............................. 17 1.4 About this document............................. 18 1.5 Summary................................... 20 I State of the art 21 2 Software Engineering 23 2.1 The art of building software......................... 24 2.1.1 Development processes........................ 24 2.1.2 Development practices........................ 25 2.2 Software measurement............................ 27 2.2.1 The art of measurement....................... 27 2.2.2 Technical debt............................. 30 2.3 Quality in software engineering........................ 31 2.3.1 A few words about quality....................... 31 2.3.2 Garvin’s perspectives on quality.................... 31 2.3.3 Shewhart............................... 32 2.3.4 Crosby................................. 32 2.3.5 Feigenbaum.............................. 33 2.3.6 Deming................................ 33 2.3.7 Juran................................. 33 3 2.4 Quality Models in Software Engineering.................. 34 2.4.1 A few words about quality models.................. 34 2.4.2 Product-oriented models....................... 35 2.4.3 Process-oriented models....................... 40 2.4.4 FLOSS models............................ 42 2.5 Summary................................... 43 3 Data mining 45 3.1 Exploratory analysis............................. 45 3.1.1 Basic statistic tools.......................... 45 3.1.2 Scatterplots.............................. 46 3.2 Principal Component Analysis........................ 47 3.3 Clustering................................... 49 3.3.1 K-means clustering.......................... 50 3.3.2 Hierarchical clustering........................ 50 3.3.3 dbscan clustering.......................... 52 3.4 Outliers detection............................... 53 3.4.1 What is an outlier?.......................... 53 3.4.2 Boxplot................................ 54 3.4.3 Local Outlier Factor......................... 55 3.4.4 Clustering-based techniques..................... 55 3.5 Regression analysis.............................. 56 3.6 Time series.................................. 58 3.6.1 Seasonal-trend decomposition.................... 58 3.6.2 Time series modeling......................... 58 3.6.3 Time series clustering......................... 59 3.6.4 Outliers detection in time series................... 59 3.7 Distribution of measures........................... 60 3.7.1 The Pareto distribution........................ 61 3.7.2 The Weibull distribution....................... 62 3.8 Summary................................... 62 II The Maisqual project 65 4 Foundations 69 4.1 Understanding the problem......................... 69 4.1.1 Where to begin?........................... 69 4.1.2 About literate data analysis..................... 70 4.2 Version analysis................................. 71 4.2.1 Simple summary............................ 71 4.2.2 Distribution of variables....................... 72 4.2.3 Outliers detection........................... 72 4 4.2.4 Regression analysis.......................... 73 4.2.5 Principal Component Analysis.................... 74 4.2.6 Clustering............................... 75 4.2.7 Survival analysis........................... 77 4.2.8 Specific concerns........................... 78 4.3 Evolution analysis............................... 79 4.3.1 Evolution of metrics......................... 80 4.3.2 Autocorrelation............................ 80 4.3.3 Moving average & loess........................ 82 4.3.4 Time series decomposition...................... 83 4.3.5 Time series forecasting........................ 83 4.3.6 Conditional execution & factoids.................. 85 4.4 Primary lessons................................ 86 4.4.1 Data quality and mining algorithms................. 86 4.4.2 Volume of data............................ 86 4.4.3 About scientific software....................... 87 4.4.4 Check data.............................

Mining Software Engineering Data for Useful Knowledge Boris Baldassari

SEO Footprints

Java Programming Standards & Reference Guide

Command Line Interface

Squore Acceptance Provides a Fast and High Return on Investment by Efficiently

Write Your Own Rules and Enforce Them Continuously

Open Source Tools for Measuring the Internal Quality of Java Software Products

Adhering to Coding Standards and Finding Bugs in Code Automatically

Appendix a the Ten Commandments for Websites

Locating Exploits and Finding Targets

Robustbase: Basic Robust Statistics

A Practical Guide to Support Predictive Tasks in Data Science

Software Übersicht