Predicting the Programming Language of Questions and Snippets of Stack Overflow Using Natural Language Processing by Kamel Alras
Total Page:16
File Type:pdf, Size:1020Kb
Predicting the Programming Language of Questions and Snippets of Stack Overflow Using Natural Language Processing by Kamel Alrashedy B.Ed., University of Hail, 2013 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE in the Department of Computer Science c Kamel Alrashedy, 2018 University of Victoria All rights reserved. This thesis may not be reproduced in whole or in part, by photocopying or other means, without the permission of the author. ii Predicting the Programming Language of Questions and Snippets of Stack Overflow Using Natural Language Processing by Kamel Alrashedy B.Ed., University of Hail, 2013 Supervisory Committee Dr. Venkatesh Srinivasan, Co-Supervisor (Department of Computer Science) Dr. T. Aaron Gulliver, Co-Supervisor (Department of Electrical and Computer Engineering) iii Supervisory Committee Dr. Venkatesh Srinivasan, Co-Supervisor (Department of Computer Science) Dr. T. Aaron Gulliver, Co-Supervisor (Department of Electrical and Computer Engineering) ABSTRACT Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Over- flow usually contain a code snippet. Stack Overflow relies on users to properly tag the programming language of a question and assumes that the programming language of the snippets inside a question is the same as the tag of the question itself. In this the- sis, a classifier is proposed to predict the programming language of questions posted in Stack Overflow using Natural Language Processing (NLP) and Machine Learning (ML). The classifier achieves an accuracy of 91.1% in predicting the 24 most popular programming languages by combining features from the title, body and code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 81.1%. Finally, we propose a classifier of code snip- pets only that achieves an accuracy of 77.7%.Thus, deploying ML techniques on the combination of text and code snippets of a question provides the best performance. These results demonstrate that it is possible to identify the programming language of a snippet of only a few lines of source code. We visualize the feature space of two programming languages Java and SQL in order to identify some properties of the iv information inside the questions corresponding to these languages. v Contents Supervisory Committee ii Abstract iii Table of Contents v List of Tables vii List of Figures ix Acknowledgements xi Dedication xii 1 Introduction 1 1.1 Research Questions . 4 1.2 Thesis Contributions . 4 1.3 Thesis Organization . 5 2 Related Work 7 2.1 Predicting Programming Languages . 7 2.2 Mining Stack Overflow . 8 3 Dataset Extraction and Processing 10 3.1 Stack Overflow Selection . 10 3.2 Extraction and Processing of Stack Overflow Questions . 11 vi 4 Methodology 15 4.1 Classifiers . 15 4.2 The Performance Metrics . 17 5 Results 19 5.1 XGBoost Classifier . 19 5.2 Random Forest Classifier . 26 6 Discussion and Threats to Validity 34 6.1 Discussion . 34 6.2 The Features . 36 6.3 Threats to Validity . 43 7 Conclusions and Future Work 44 7.1 Conclusions . 44 7.2 Future Work . 44 A Additional Information 46 Bibliography 53 vii List of Tables Table 5.1 Performance of XGBoost trained on textual information and code snippet features. 20 Table 5.2 Performance of XGBoost trained on textual information features. 22 Table 5.3 Effect of the minimum number of characters in a code snippet on the accuracy . 25 Table 5.4 Performance of XGBoost trained on code snippet features. 25 Table 5.5 Performance of RFC trained on textual information and code snippet features. 28 Table 5.6 Performance of RFC trained on textual information features. 30 Table 5.7 Performance of RFC trained on code snippet features. 31 Table 5.8 A comparison of the classifier in Baquero [4] and the proposed classifiers. 33 Table 6.1 The top 50 features for each programming language. 38 Table 6.2 The top 50 features for each programming language. 39 Table 6.3 The top 50 features for each programming language. 40 Table 6.4 The top 50 features for each programming language. 41 Table 6.5 The top 50 features for each programming language. 42 Table A.1 The top 50 textual information features for each programming language. 47 Table A.2 The top 50 textual information features for each programming language. 48 viii Table A.3 The top 50 textual information features for each programming language. 49 Table A.4 The top 50 textual information features for each programming language. 50 Table A.5 The top 50 textual information features for each programming language. 51 Table A.6 The top 50 textual information features for each programming language. 52 ix List of Figures Figure 1.1 An example of a stack overflow post. 3 Figure 3.1 An example of a stack overflow question. 11 (a) Before applying NLP techniques. 11 (b) After applying NLP techniques. 11 Figure 3.2 The dataset extraction process. 12 Figure 3.3 Box plots showing the number of lines of code in the extracted code snippets for all the languages. Note that there were at least 400 posts which had more than 200 lines of code and these were not included in this plot. 13 Figure 5.1 Confusion matrix for the XGBoost classifier trained on code snip- pet and textual information features. The diagonal represents the percentage of the programming language that was correctly predicted. 21 Figure 5.2 Confusion matrix for the XGboost classifier trained on textual information features. The diagonal represents the percentage of the programming language that was correctly predicted. 23 Figure 5.3 Confusion matrix for the XGboost classifier trained on code snip- pet features. The diagonal represents the percentage of the pro- gramming language that was correctly predicted. 24 x Figure 5.4 Confusion matrix for the Random Forest classifier trained on code snippet and textual information features. The diagonal represents the percentage of the programming language that was correctly predicted. 27 Figure 5.5 Confusion matrix for the RandomForest classifier trained on tex- tual information features. The diagonal represents the percent- age of the programming language that was correctly predicted. 29 Figure 5.6 Confusion matrix for the Random Forest classifier trained on code snippet features. The diagonal represents the percentage of the programming language that was correctly predicted. 32 Figure 6.1 Code snippet and textual information features of Java repre- sented in two dimensions after using T-SNE on a trained Word2Vec model. 36 (a) Java code snippet features. 36 (b) Java textual information features. 36 Figure 6.2 Code snippet and text information features of SQL represented in two dimensions using T-SNE on a trained Word2Vec model. 37 (a) SQL code snippet features. 37 (b) SQL textual information features. 37 xi ACKNOWLEDGEMENTS First of all, I would like to thank my co-supervisor, Venkatesh Srinivasan, for his guidance and advice throughout my research. The opportunities he gave me to work on many projects expanded my knowledge and led me to interest in Applied Machine Learning area. Frequent discussions with Venkatesh about my progress had a positive impact and motivated me to learn more. Beside research experience, I have learnt many things from him such as communication skills, time management, facing chal- lenges, planning for the future, work-life balance, learning from failure and managing failure. He was an extremely helpful supervisor. I would like to thank my co-supervisor, Aaron Gulliver, for guidance during my masters degree. His motivation and enthusiasm led me to work on the research problem in this thesis. His expectations from my work had a positive impact and made me work hard. Also, working on another project with him ended up as my first published paper. I learnt the process of doing research and publishing a scientific paper from him. I respect that he motivated and helped me to start my degree at this university. In addition, I would like to thank, Daniel German, for my favorite course Min- ing Software Repository that highly influenced my current and future research work. Collaborating with him was an extremely great opportunity. I have received many comments and suggestions from him on how to improve this work. Dhanush Dhar- maretnam, my classmate, officemate and friend, helped and supported me in working in the field of Nature Language Processing to solve the main problem considered in this thesis. I would like thank many great friends and colleagues who encouraged and moti- vated me during this time. Special thanks to Rehan Sayeed for his encouragement and enthusiasm that motivated me a lot during my masters degree. I acknowledges the financial support of the Saudi Ministry of Education through a graduate scholarship. xii DEDICATION First and foremost, I would like to thank almighty God without whose mercy this journey was near impossible to voyage. I would like to dedicate this thesis to my wonderful parents, Shima and Aali, whose constant encouragement throughout my journey has been really awe aspiring. This journey would remain incomplete without my aunt Shima whose frequent phone calls and emails were the source of mental strength needed to complete my masters. I would also like to dedicate this work to my aunt, Maha, who is the most influ- ential person throughout my education career. It was her that initially persuaded me to pursue a graduate degree. I cannot forget to thank my loving brothers Wafi and Ahmad for encouraging me to study Computer Science. Many thanks to my brother Ayyad, my first programming teacher, who taught me HTML and Microsoft FrontPage when I was in grade nine.