Code Similarity and Clone Search in Large-Scale Source Code Data
Total Page:16
File Type:pdf, Size:1020Kb
Code Similarity and Clone Search in Large-Scale Source Code Data Chaiyong Ragkhitwetsagul A dissertation submitted in fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department of Computer Science University College London September 30, 2018 Declarations I, Chaiyong Ragkhitwetsagul, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. The papers presented here are original work undertaken between September 2014 and July 2018 at University College London. They have been published or submitted for publication as listed below along with a summary of my contributions to the papers. 1. C. Ragkhitwetsagul, J. Krinke, D. Clark, Similarity of source code in the presence of pervasive modifications, in Proceedings of the 16th International Working Conference on Source Code Analysis and Manipulation (SCAM ’16), 2016. Contribution: Dr. Krinke designed and started the initial experiment which I later continued. I incorporated more tools, handled the tool executions, collected the results, analysed and discussed the results with Dr. Krinke. The three authors wrote the paper. 2. C. Ragkhitwetsagul, J. Krinke, D. Clark, A comparison of code similarity analysers, Empirical Software Engineering (EMSE), 2017. Contribution: This publication is an invited journal extension of the SCAM ’16 paper to a special issue of Empirical Software Engineering. I substantially expanded the experiment by doubling the data set size, rerunning the first and second experimental scenarios, and introducing three additional experimental scenarios under the supervision of Dr. Krinke. The paper has been written by the three authors. 3 3. C. Ragkhitwetsagul and J. Krinke, Using compilation/decompilation to en- hance clone detection, in Proceedings of the 11th International Workshop on Software Clone (IWSC ’17), 2017. Contribution: I designed and performed the experiment and analysed the results with suggestions from Dr. Krinke. The paper was written by the two authors. 4. C. Ragkhitwetsagul and J. Krinke, Awareness and experience of developers to outdated and license-violating code on Stack Overflow : An online survey, UCL Research Note, 2017. Contribution: I completed the two surveys and analysed the results under the supervision of Dr. Krinke. The research note is written by me and revised by Dr. Krinke. 5. C. Ragkhitwetsagul, M. Paixoa, J. Krinke, G. Bianco, R. Oliveto, Toxic code snippets on Stack Overflow, Submitted to IEEE Transactions on Software Engineering (TSE), 2018 (after the first round of Revise and Resubmit as New). Contribution: I lead the project and performed all the experiments and the surveys under the supervision of Dr. Krinke. G. Bianco, an exchange student from University of Molise under the supervision of Dr. Oliveto, helped on the data collection from Stack Overflow. My PhD colleague, M. Paixao, was involved in the manual clone classification. The paper was written by me and revised by Dr. Krinke and Dr. Oliveto. 6. C. Ragkhitwetsagul, J. Krinke, Siamese: Scalable and incremental code clone search via multiple code representations, Submitted to Empirical Software Engineering (EMSE), 2018. Contribution: I performed the whole study under the supervision of Dr. Krinke. The paper was written by me and revised by Dr. Krinke. 4 7. R. White, J. Krinke, E. Barr, C. Ragkhitwetsagul, F. Sarro, M. Al-arfaj, Exploiting test-to-code traceability links for reuse, Submitted to The 33rd International Conference on Automated Software Engineering (ASE ’18). Contribution: I provided and configured the Siamese clone search engine as a code similarity tool for the RELATEST test recommendation tool. I also involved in the manual validations of the recommended tests. The paper was written by R. White, Dr. Krinke, me, and Dr. Barr. In addition, the list below includes the papers that I authored or co-authored and published or submitted for peer review during the programmed of my study. They are not featured in this thesis. 1. C. Ragkhitwetsagul, M. Paixao, M. Adham, S. Busari, J. Krinke, J. H. Drake, Searching for configurations in clone evaluation – A replication study, in Proceedings of the 8th Symposium on Search-Based Software Engineering (SSBSE ’16), 2016. 2. M. Paixao, J. Krinke, D. Han, C. Ragkhitwetsagul, and M. Harman, Are developers aware of the architectural impact of their changes?, in Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE ’17), 2017. 3. M. Alonaizan, D. Han, J. Krinke, C. Ragkhitwetsagul, D. Schwartz- Narbonne, and B. Zhu, Recommending related code reviews, Submitted to IEEE Transactions on Software Engineering (TSE), 2018 (Major Revisions). 4. C. Ragkhitwetsagul, J. Krinke, B. Marnette, A picture is worth a thousand words: Code clone detection based on image similarity, in Proceedings of the 12th International Workshop on Software Clones (IWSC ’18), 2018. 5 5. J. Wilkie, Z. Al Halabi, A. Karaoglu, J. Liao, G. Ndungu, C. Ragkhitwetsagul, M. Paixao, and J. Krinke, Who’s this? Developer identification using IDE event data, in Proceedings of the 15th International Conference on Mining Software Repositories (MSR ’18), 2018. Chaiyong Ragkhitwetsagul Date Abstract Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer’s guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them “online code clones.” They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the- art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large- scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone Abstract 7 detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse. Impact Statement The knowledge and methods of software analysis presented in this thesis are beneficial to software quality improvement and could have impacts on both software engineering research and industry. The phenomenon of code duplication, i.e., code cloning, is well-known in both research and industry, although sometimes with different terms used to describe it. There are two camps of beliefs that clones lead to bug propagation and degrades software maintenance, and that clones rarely relate to bugs and do not affect software quality. However, they both agree that clones have to be made explicit. This thesis strengthens the body of knowledge in code clone research by studying a new type of code clones, a cloning activity to and from online sources such as Stack Overflow programming Q&A website. The thesis shows that the result of such cloning for which we call online code clones, have at least two issues: outdated code and software license incompatibility. Since Stack Overflow is a popular website with 7.6 million users, the issues from online code cloning can affect a large number of programmers around the world. The findings lead to an urgent need to mitigate the issues, both by research and Stack Overflow itself. On the research side, the thesis develops a scalable code clone search technique and a tool called Siamese, to provide a scalable solution to locate online code clones from large source code corpora. The thesis demonstrates that Siamese can be put to use to efficiently locate clones between Stack Overflow and a hundred open source projects. The technique can incorporate automated license analysis to detect violations of software licenses and has a potential to be transformed to a cloud Abstract 9 service that offers real-time checks for online code clones in any software projects. On the industry side, the thesis calls for action from Stack Overflow to mitigate the issues of outdated code and software license violations. The survey results in the thesis clearly show that Stack Overflow users are aware of the issues and some of them need a guideline from Stack Overflow. Moreover, the website must collect the origin of the source code examples to check for their newer version and their original software license. Lastly, the comparison of 34 code similarity analysers presented in the thesis is the largest in existence and potentially an invaluable guide for future users of similarity detection in source code. The findings can be used both in academia and outside academia on any studies or projects related to code similarity. Acknowledgements Many people have helped me during the four years of my PhD at UCL, London. I would not have made it this far without them. Thus, I would like to express my gratitude to the following people. First and foremost, I would like to thank my supervisor, Dr.