Leveraging Social Coding Websites DISSERTATION Submitted in Partial
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA, IRVINE Beyond Similar Code: Leveraging Social Coding Websites DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Software Engineering by Di Yang Dissertation Committee: Professor Cristina V. Lopes, Chair Professor James A. Jones Professor Sam Malek 2019 c 2019 Di Yang DEDICATION To my loving husband, Zikang. ii TABLE OF CONTENTS Page LIST OF FIGURES vi LIST OF TABLES viii ACKNOWLEDGMENTS ix CURRICULUM VITAE x ABSTRACT OF THE DISSERTATION xiv 1 Introduction 1 1.1 Overview . .1 1.1.1 Background . .1 1.1.2 Research Questions . .2 1.2 Thesis Statement . .3 1.3 Methodology and Key Findings . .3 2 Usability Analysis of Stack Overflow Code Snippets 6 2.1 Introduction . .6 2.2 Related Work . .9 2.3 Usability Rates . 11 2.3.1 Snippets Processing . 12 2.3.2 Findings . 15 2.3.3 Error Messages . 18 2.3.4 Heuristic Repairs for Java and C# Snippets . 19 2.4 Qualitative Analysis . 24 2.5 Google Search Results . 27 2.6 Conclusion . 30 3 File-level Duplication Analysis of GitHub 32 3.1 Introduction . 33 3.2 Related Work . 37 3.3 Analysis Pipeline . 40 3.3.1 Tokenization . 40 3.3.2 Database . 41 iii 3.3.3 SourcererCC . 42 3.4 Corpus . 43 3.5 Quantitative Analysis . 45 3.5.1 File-Level Analysis . 45 3.5.2 File-Level Analysis Excluding Small Files . 47 3.6 Mixed Method Analysis . 48 3.6.1 Most Duplicated Non-Trivial Files . 49 3.6.2 File Duplication at Different Levels . 52 3.7 Conclusions . 58 4 Method-level Duplication Analysis between Stack Overflow and GitHub 60 4.1 Introduction . 60 4.2 Related Work . 63 4.3 Methodology . 66 4.3.1 Method Extraction . 66 4.3.2 Tokenization . 67 4.3.3 Three Levels of Similarity . 68 4.4 Dataset . 70 4.5 Quantitative Analysis . 72 4.5.1 Block-hash Similarity . 74 4.5.2 Token-hash Similarity . 75 4.5.3 SCC Similarity . 76 4.6 Qualitative Analysis . 76 4.6.1 Step 1: Duplicated Blocks . 77 4.6.2 Step 2: Large Blocks Present in GitHub and SO . 86 4.7 Conclusion . 88 5 Analysis of Adaptations from Stack Overflow to GitHub 90 5.1 Introduction . 91 5.2 Related Work . 93 5.3 Dataset . 95 5.4 Adaptation Type Analysis . 97 5.4.1 Manual Inspection . 97 5.4.2 Automated Adaptation Categorization . 101 5.4.3 Accuracy of Adaptation Categorization . 102 5.5 Empirical Study . 103 5.5.1 How many edits are potentially required to adapt a SO example? . 103 5.5.2 What are common adaptation and variation types? . 104 5.6 Conclusion . 106 6 Statement-level Recommendation for Related Code 107 6.1 Introduction . 108 6.2 The Opportunity for Aroma .......................... 113 6.3 Related Work . 116 6.4 Algorithm . 118 iv 6.4.1 Definitions . 120 6.4.2 Featurization . 124 6.4.3 Recommendation Algorithm . 127 6.5 Evaluation of Aroma's Code Recommendation Capabilities . 137 6.5.1 Datasets . 137 6.5.2 Recommendation Performance on Partial Code Snippets . 138 6.5.3 Recommendation Quality on Full Code Snippets . 139 6.5.4 Comparison with Pattern-Oriented Code Completion . 142 6.6 Evaluation of Search Recall . 142 6.6.1 Comparison with Clone Detectors and Conventional Search Techniques 144 6.7 Conclusion . 146 7 Method-level Recommendation for Related Code 148 7.1 Introduction . 148 7.2 Related Work . 152 7.3 Data Collection Approach . 153 7.3.1 Retrieve similar methods . 154 7.3.2 Identify Co-occurring Code Fragments . 156 7.3.3 Clustering and Ranking . 157 7.4 Dataset . 158 7.5 Manual Analysis and Categorization . 161 7.6 Chrome Extension for Stack Overflow . 169 7.7 Comparison with code search engines . 171 7.8 Threats to Validity . 172 7.9 Conclusion . 173 8 Conclusion 174 Bibliography 177 v LIST OF FIGURES Page 2.1 Example of Stack Overflow question and answers. The search query was \java parse words in string". .7 2.2 Sequence of operations . 13 2.3 Parsable and compilable/runnable rates histogram . 16 2.4 Examples of C# Snippets. 17 2.5 Examples of Java Snippets. 17 2.6 Examples of JavaScript Snippets. 18 2.7 Examples of Python Snippets. 18 2.8 Most common error messages for C# . 19 2.9 Most common error messages for Java . 20 2.10 Most common error messages for JavaScript . 20 2.11 Most common error messages for Python . 20 2.12 Sequence of operations while applying repairs . 23 2.13 Example of an incomplete answer in Stack Overflow . 25 2.14 Example of Google Search Result . 28 3.1 Map of code duplication. The y-axis is the number of commits per project, the x-axis is the number of files in a project. The value of each tile is the percentage of duplicated files for all projects in the tile. Darker means more clones. 34 3.2 Analysis pipeline. 40 3.3 Files per project. 44 3.4 SLOC per file. 45 3.5 File-level duplication for entire dataset and excluding small files. 47 3.6 Distribution of file-hash clones. 54 3.7 Distribution of token-hash clones. 55 3.8 ∆ of file sizes in token hash groups. 56 4.1 Pipeline for file analysis. 65 4.2 Python GitHub projects. LOC means Lines Of source Code, and is calculated after removing empty lines. ..