Improving Software Maintenance Using Unsupervised Machine Learning Techniques

Università degli studi di Napoli “Federico II” Dottorato in Scienze Computazionali e Informatiche Ciclo XXV Improving Software Maintenance using Unsupervised Machine Learning techniques a dissertation presented by Valerio Maggio to The Department of Mathematics and Applications in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Software Engineering Naples, Italy March 2013 2 Thesis advisors: Sergio Di Martino and Anna Corazza Valerio Maggio Improving Software Maintenance using Unsupervised Machine Learning techniques Abstract Software maintenance is an essential step in the evolution of software systems and represents one of the most expensive, time consuming, and challenging phases of the whole development process. In particular, the cost and the effort necessary for both the maintenance and the evolution operations (e.g., corrective, adaptive, etc.) are mainly related to the effort necessary to comprehend the system and its source code. As a consequence many reverse engineering tools and solutions have been proposed to support the maintainers in their activities. An important resource for maintainers is represented by the architectural information of the system. However such information is usually not documented, or the documentation is outdated. Therefore, the existing code remains the most updated source of information to exploit in order to automatically retrieve and reconstruct the architecture of a system. Many research efforts are being devoted to support this task, in order to define solutions that are able to re-modularise a given software application. The main purpose of re-modularisation techniques is to automatically partition the system into meaningful subsystems, in order to locate and group together software components that are in some way related, e.g., they implement the same functionalities. A number of these approaches generally attempt to discover these groups (or clusters) by exploiting the lexical information provided in the source code, such as terms in comments, as well as names of identifiers (e.g., variable, methods and classes). Nevertheless, the source code lexicon has some specific peculiarities that make it con- ceptually different from a typical textual resource: identifiers are often created by con- i Thesis advisors: Sergio Di Martino and Anna Corazza Valerio Maggio catenating multiple words (e.g. getAttribute, MINHEIGHT), which may be additionally shortened (e.g., getAttr, MINHGT) to avoid long names. As a consequence, tools and techniques that analyse the source code lexicon must integrate algorithms to normalise its vocabulary. Another well known and largely investigated issue in software maintenance is clone detection: it is focused on the identification of source code duplications. Software clones might affect the reliability and the maintainability of large software systems. For example, errors affecting a fragment of code must be fixed in everyone of its possible duplications. Clones are usually not documented, and their identification is usually complicated since programmers adapt software copies by applying multiple modifications (e.g., adding new statements and renaming variables). Therefore, automatic and reliable approaches are required in order to tackle this problem. In this thesis we proposed new Machine Learning (ML) based approaches that mine the relevant information directly from the source code to cope with the three introduced issues, namely the software re-modularisation, the source code vocabulary normalisation, and the clone detection. In particular, proposed contributions leverages the benefits of ML algorithms, which have been properly tailored and customised in order to make them suitable for the considered domain. All the presented approaches have been extensively assessed with empirical evaluations conducted on large software systems, and results have been compared with other related techniques, whenever possible. Achieved results outperform the state-of-the-art solutions for all the three considered problems, thus confirming the benefits derived from the defi- nition and the application of ML algorithms to maintenance tasks. ii Contents I Introduction 1 I.a Thesis Motivations . 2 I.b Outline of the Thesis . 6 I.c Origin of the Chapters . 7 A Background and Related Work 9 1 Software Maintenance and Evolution 11 1.1 Issues in Maintaining Large Software Systems . 14 1.2 Source Code Vocabulary Normalisation . 19 1.3 Software Re-modularisation . 24 1.4 Clone Detection . 31 2 Machine Learning and Pattern Matching Techniques 45 2.1 Definitions and Notations . 48 2.2 Learning from examples . 60 2.3 Unsupervised Learning . 62 2.4 Kernel Methods . 72 B Original Contributions 85 3 Weighting Source Code Lexical Information with a Proba- bilistic Model for Software Re-modularisation 87 iii 3.1 Investigating the Use of Source Code Lexical Information . 90 3.2 Probabilistic Models for Software Re-modularisation . 96 3.3 Clustering of Software Artefacts . 100 3.4 Experimental Settings . 104 3.5 Results and Discussions . 111 4 An Efficient Approach to Split Identifiers and Expand Ab- breviations 119 4.1 The LINSEN Algorithm . 120 4.2 Experimental Settings . 127 4.3 Results and Discussions . 131 5 A Kernel Based Approach for Clone Detection 141 5.1 Tree Kernels for Clone Detection . 143 5.2 Graph Kernels for Clone Detection . 156 5.3 Towards a Supervised Kernel Learning Approach for Clone Detection163 6 Conclusions 171 6.1 Software re-modularisation . 172 6.2 Source code vocabulary normalisation . 174 6.3 Clone detection . 175 References 178 iv Listing of figures 1.1 UML Activity Diagram of the staged process model for evolution (adapted from [190]) .......................... 13 1.2 Change Analysis Activity Diagram . 13 1.3 Change Implementation Activity Diagram . 13 2.1 An overview of how the different components of a machine learning applications are organised (adapted from [68]). 48 2.1 Scalar projection. 52 2.2 Example of a labelled directed graph . 57 2.3 Example of a labelled undirected graph . 57 2.4 Example of a Tree (a), and two corresponding possible subtrees (b) and (c). In particular, (b) is a Subtree, while (c) is a Subset tree. 60 2.1 An example of k-means clustering of 2D points organised in three clusters. Cluster centroids are marked as large green rings, while elements in the different clusters are dots, triangles, and stars re- spectively. 68 2.2 Two examples of dendrograms representing the clustering results of the same set of data applying two different linkage strategy, namely the complete linkage (left) and the group average linkage (right). 70 3.1 The Overall Software Re-modularisation process. 89 3.1 Activity Diagram of the Artefact Indexing Processing Step. 90 3.2 Document representation as a set of six different zone buckets. 92 v 3.1 Authoritativeness results for RQ1 (Flat system versus the unweighted system) obtained by the application of the K-medoid clustering algorithms . 111 3.2 Authoritativeness results for RQ1 (Flat system versus the unweighted system) obtained by the application of the HAC clustering algorithms112 3.3 Authoritativeness results comparison of the K-medoid and GAAC clustering algorithm on the flat system configuration. 112 3.4 Authoritativeness results comparison of the K-medoid and GAAC clustering algorithm on the unweighted system configuration. 113 3.5 Authoritativeness results for RQ2 (Unweighted system versus the Complete system) obtained by the application of the K-medoid clustering algorithms . 114 3.6 Authoritativeness results for RQ2 (Unweighted system versus the Complete system) obtained by the application of the GAAC clustering algorithms . 114 3.7 Authoritativeness results for RQ2 (Unweighted system versus the Complete system) obtained by the application of the K-medoid clustering algorithms, considering the Gaussian and the Bernoulli models.115 4.1 Example of matching graph for the identifier getpnt ......... 124 4.1 Bar Chart of splitting results compared with Single-iteration results reported in [126] (Table 4.1)....................... 133 4.2 Bar Chart of splitting results compared with best results attained by the GenTest splitting algorithm [113, 115] (Table 4.2). 133 4.3 Splitting results (F-measure) for systems gathered from [113, 114, 126]134 4.4 Bar Chart of normalisation results (i.e., splitting of identifiers and expansion of abbreviations) compared with results reported in [126] (Table 4.3). ............................... 135 4.5 Bar Chart of normalisation results (i.e., splitting of identifiers and expansion of abbreviations) compared with results reported in [113] (Table 4.4). ............................... 136 4.6 Normalisation results (F-measure) for systems considered in [113, 126]137 4.7 Expansion results for systems considered in [85]........... 138 vi 5.1 Example of a partial AST generated from the above code example. 144 5.2 Associated features to node of the AST reported in Figure 5.1.... 147 5.3 Editing Taxonomy Scenarios (extracted from [159]) . 150 5.4 Results for the qualitative evaluation of the proposed Tree Kernel KAST clone detection technique. 153 5.5 Precision, Recall and F-Measure plot of achieved results for Type 3 clones . 155 5.6 Bar chart summarising obtained results on Type 3 clones by CloneDig- ger [32] and Tree Kernel Clone Detector. 155 5.1 Example of the PDG corresponding to the function reported in Listing 5.1. Nodes are labelled with their corresponding types, while data and control edges are depicted in solid and dashed lines . 159 5.2

Load more