UC Irvine UC Irvine Electronic Theses and Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
UC Irvine UC Irvine Electronic Theses and Dissertations Title Large-Scale Code Clone Detection Permalink https://escholarship.org/uc/item/45r2308g Author Sajnani, Hitesh Publication Date 2016 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Large-Scale Code Clone Detection DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Information and Computer Science by Hitesh Sajnani Dissertation Committee: Professor Cristina Lopes, Chair Professor Andr´evan der Hoek Professor James A. Jones 2016 Portions of Chapter 3 c 2016 IEEE Portions of Chapter 4 c 2016 IEEE Portions of Chapter 5 c 2015 Wiley & Sons, Inc. Portions of Chapter 6 c 2016 IEEE Portions of Chapter 7 c 2014 IEEE All other materials c 2016 Hitesh Sajnani DEDICATION To my parents, sisters, beloved wife, and pramukh swami maharaj. ii Contents Page LIST OF FIGURES vi LIST OF TABLES viii ACKNOWLEDGMENTS ix CURRICULUM VITAE xi ABSTRACT OF THE DISSERTATION xii 1 Introduction 1 1.1 Motivation . 2 1.2 Terminology . 4 1.2.1 Code Clone Terms . 4 1.2.2 Code Clone Types . 4 1.3 Problem Statement . 6 1.4 Research Questions . 7 1.5 Thesis . 8 1.6 Contributions . 9 2 Clone Detection: Background and Related Work 11 2.1 Why Do Code Clones Exist? . 11 2.2 Issues Due to Code Cloning . 14 2.3 Applications of Clone Detection . 14 2.4 Clone Detection Techniques and Tools . 16 2.5 Measures to Evaluate Clone Detection Techniques . 19 2.6 Impact of Code Cloning on Software Systems . 20 2.7 Chapter Summary . 22 3 SourcererCC: Accurate and Scalable Code Clone Detection 23 3.1 Problem Formulation . 24 3.2 Overview of the Approach . 28 3.3 Filtering Heuristics to Reduce Candidate Comparisons . 30 3.3.1 Sub-block Overlap Filtering . 30 3.3.2 Token Position Filtering . 35 3.4 Clone Detection Algorithm . 36 iii 3.4.1 Partial Index Creation . 37 3.4.2 Clone Detection . 39 3.4.3 Candidate Verification . 42 3.4.4 Revisiting the Research Questions . 43 3.5 Implementation . 45 3.5.1 Parser . 45 3.5.2 Indexer . 49 3.5.3 Searcher . 50 3.6 Chapter Summary . 50 4 Evaluation of SourcererCC 51 4.1 Execution Time and Scalability . 53 4.2 Experiment with Big IJaDataset . 55 4.3 Recall . 57 4.3.1 Recall Measured by the Mutation Framework . 58 4.3.2 Recall Measured by BigCloneBench . 61 4.4 Precision . 65 4.5 Summary of Recall and Precision Experiments . 67 4.6 Sensitivity Analysis of the Similarity Threshold Parameter . 68 4.7 Manual Inspection of Clones Detected by SourcererCC . 72 4.8 Threats to Validity . 78 4.9 Chapter Summary . 79 5 SourcererCC-D: Parallel and Distributed SourcererCC 80 5.1 Introduction . 80 5.2 Architecture . 82 5.3 Evaluation . 84 5.3.1 Evaluation Metrics . 84 5.3.2 Experiments to Measure the Speed-up . 86 5.3.3 Experiments to Measure the Scale-up . 88 5.3.4 Detecting Project Clones in the MUSE Repository . 89 5.4 Chapter Summary . 93 6 SourcererCC-I: Interactive SourcererCC for Developers 95 6.1 Introduction . 95 6.2 A Preliminary Survey . 97 6.3 SourcererCC-I's Architecture . 98 6.4 SourcererCC-I's Features . 101 6.5 Related Tools . 104 6.6 Tool Artifacts . 106 6.7 Chapter Summary . 107 7 Empirical Applications of SourcererCC 108 7.1 Introduction . 109 iv 7.2 Study 1. A Comparative Study of Bug Patterns in Java Cloned and Non- cloned Code . 112 7.2.1 Research Questions . 112 7.2.2 Study Design . 114 7.2.3 Study Results . 120 7.2.4 Conclusion (Study 1) . 130 7.3 Study 2. A Comparative Study of Software Quality Metrics in Java Cloned and Non-cloned Code . 131 7.3.1 Research Questions . 131 7.3.2 Dataset . 132 7.3.3 Clone Detection . 132 7.3.4 Software Quality Metrics . 135 7.3.5 Summary of the Results . 135 7.3.6 Conclusion (Study 2) . 137 7.4 Threats to Validity . 138 7.5 Reproducibility . 140 7.6 Chapter Summary . 140 8 Conclusions and Discussion 142 8.1 Dissertation Summary . 142 8.2 The Surprising Effectiveness of the Bag-of-tokens model and Overlap Similar- ity Measure in Clone Detection . 145 8.3 Lessons Learned During SourcererCC's Development . 146 8.4 Going Forward . 149 Bibliography 151 Appendices 161 A Subject Systems . 162 B Running SourcererCC-D Using Amazon Web Services (AWS) . 164 C Experience Report on Using AWS . 167 D Cost of Running the Experiments Using AWS . 169 v List of Figures Page 1.1 Type 1 Example Clone-pair . 5 1.2 Type 2 Example Clone-pair . 5 1.3 Type 3 Example Clone-pair . 6 1.4 Type 4 Example Clone-pair . 6 3.1 Code blocks represented as a set of (token, frequency) pairs . 25 3.2 Methods from Apache Cocoon Project . 26 3.3 Growth in number of candidate comparisons with the increase in the number of code blocks . 27 3.4 SourcererCC's clone detection process . 29 3.5 Sample code fragment as input to the parser . 46 3.6 Output produced by the parser . 46 3.7 Delimiters used in the output file produced by the parser . 47 3.8 (Token, Frequency) pair representation in the output format . 47 4.1 Summary of Results. F-Measure is computed using Recall (BigCloneBench) and Precision . 68 4.2 Change in number of clones reported (top-center), number of candidates com- pared (bottom-left), and number of tokens compared (bottom-right) with the change in similarity threshold. 70 4.3 Sample code clones observed in the subject systems. 1A & 1B: Cross-cutting Concerns; 2: Code Generation; 3: API/Library Protocols; 4A, 4B & 5A, 5B: Replicate and Specialize; 6A & 6B: Near-Exact Copy . 76 5.1 Shared-disk Architectural Style . 82 5.2 Shared-memory Architectural Style . 82 5.3 SourcererCC-D's Clone Detection Process . 83 5.4 Speed-up . 85 5.5 Scale-up . 86 5.6 Speed-up . 87 5.7 The number of clone-pairs detected increases exponentially.