Automatic Bug Triaging Techniques Using Machine Learning and Stack Traces and Submitted in Partial Fulfillment of the Requirements for the Degree Of
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Bug Triaging Techniques Using Machine Learning and Stack Traces Korosh Koochekian Sabor A Thesis In The Department of Electrical and Computer Engineering Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy (Electrical and Computer Engineering) at Concordia university Montreal, Quebec, Canada July 2019 © Korosh Koochekian Sabor, 2019 CONCORDIA UNIVERSITY SCHOOL OF GRADUATE STUDIES This is to certify that the thesis prepared By: Korosh Koochekian Sabor Entitled: Automatic bug triaging techniques using machine learning and stack traces and submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical and Computer Engineering) complies with the regulations of the University and meets the accepted standards with respect to originality and quality. Signed by the final examining committee: Chair Dr. Luis Amador External Examiner Dr. Ali Ouni External to Program Dr. Nikos Tsantalis Examiner Dr. Anjali Agarwal Examiner Dr. Nawwaf Kharma Thesis Supervisor Dr. Abdelwahab Hamou-Lhadj Approved by Dr. Rastko R. Selmic, Graduate Program Director September 27, 2019 Dr. Amir Asif, Dean Gina Cody School of Engineering & Computer Science Abstract Automatic Bug Triaging Techniques Using Machine Learning and Stack Traces Korosh Koochekian Sabor, Ph.D. Concordia University, 2019 When a software system crashes, users have the option to report the crash using automated bug tracking systems. These tools capture software crash and failure data (e.g., stack traces, memory dumps, etc.) from end-users. These data are sent in the form of bug (crash) reports to the software development teams to uncover the causes of the crash and provide adequate fixes. The reports are first assessed (usually in a semi-automatic way) by a group of software analysts, known as triagers. Triagers assign priority to the bugs and redirect them to the software development teams in order to provide fixes. The triaging process, however, is usually very challenging. The problem is that many of these reports are caused by similar faults. Studies have shown that one way to improve the bug triaging process is to detect automatically duplicate (or similar) reports. This way, triagers would not need to spend time on reports caused by faults that have already been handled. Another issue is related to the prioritization of bug reports. Triagers often rely on the information provided by the customers (the report submitters) to prioritize bug reports. However, this task can be quite tedious and requires tool support. Next, triagers route the bug report to the responsible development team based on the subsystem, which caused the crash. Since having knowledge of all the subsystems of an ever-evolving industrial system is impractical, having a tool to automatically identify defective subsystems can significantly reduce the manual bug triaging effort. iii The main goal of this research is to investigate techniques and tools to help triagers process bug reports. We start by studying the effect of the presence of stack traces in analyzing bug reports. Next, we present a framework to help triagers in each step of the bug triaging process. We propose a new and scalable method to automatically detect duplicate bug reports using stack traces and bug report categorical features. We then propose a novel approach for predicting bug severity using stack traces and categorical features, and finally, we discuss a new method for predicting faulty product and component fields of bug reports. We evaluate the effectiveness of our techniques using bug reports from two large open-source systems. Our results show that stack traces and machine learning methods can be used to automate the bug triaging process, and hence increase the productivity of bug triagers, while reducing costs and efforts associated with manual triaging of bug reports. iv Acknowledgments First and foremost, I want to express my profound and sincere gratitude to my respectful supervisor, Dr. Abdelwahab Hamou-Lhadj. Thanks for your trust in me, and thank you for proactively providing me all the academic, mental, and financial support I needed throughout the program. I would like to express my heartfelt gratitude for your mentorship, which helped me grow both professionally, personally and helped me to further develop my future career path. My deep gratitude for all the support I received from my PhD committee, Dr. Nikolaos Tsantalis, Dr. Nawwaf Kharma, Dr. Ali Ouni and Dr. Anjali Agarwal. Their suggestions significantly helped me in further improving the direction of my research. I would like to extend my thanks to Alf Larsson from Ericsson, Sweden, whose feedback and comments throughout this work have been very valuable. I would also like to thank Ericsson, MITACS, NSERC, and the Gina School of Engineering and Computer Science at Concordia University for their financial support. Many thanks to all my friends at Concordia University. I was thrilled to have been surrounded by such wonderful people. Special thanks to Mohammad Reza Rejali and Amir Bahador Gahroosi for being such wonderful friends, I would be indebted for what you did for me during our collaboration and friendship in Concordia. Thanks to all my friends outside of Concordia, including Mojataba Khomami Abadi and Parham Darbandi, for all their friendship and support. Words cannot express my gratitude and gratefulness enough to my beloved family members. My deepest appreciation for my Parents Anoushe and Saeedeh, for inspiring me to start the program and giving me heartwarming all the way through my program, my beloved sister, Camelia, who always supported my decisions from thousands miles away, and very special thanks to my wife, Camelia, I am very blessed to have you in my life, thanks for your extraordinary patience, care, kindness, love and all the sacrifices you made throughout the program. I could not have accomplished this work without your unconditional support. v Table of Contents List of Figures ..................................................................................................................... ix List of Tables ....................................................................................................................... xi 1 Introduction .......................................................................................................................... 1 1.1. Terminology ............................................................................................................ 4 1.2. Research Hypothesis ............................................................................................... 4 1.3. Thesis Contributions ................................................................................................ 5 1.3.1. Chapter 2: Background and Related Work ...................................................... 6 1.3.2. Chapter 3: Data Preparation ........................................................................... 6 1.3.3. Chapter 4: An Empirical Study on the Effectiveness of Stack traces ............... 7 1.3.4. Chapter 5: Detecting Duplicate Bug Reports ................................................... 7 1.3.5. Chapter 6: Predicting Severity of Bugs ............................................................ 7 1.3.6. Chapter 7: Predicting Faulty Product and Component Fields of Bug Reports . 7 1.3.7. Chapter 8: Conclusion and future work .......................................................... 7 2 Background and Related Work .............................................................................................. 8 2.1. Background ............................................................................................................. 8 2.1.1. Bug Report Description ................................................................................... 8 2.1.2. Stack Trace ...................................................................................................... 8 2.1.3. Categorical Information .................................................................................. 9 2.2. Related Work ......................................................................................................... 10 2.2.1. Usefulness of Stack Traces ............................................................................ 10 2.2.2. Detection of Duplicate Bug Reports .............................................................. 11 2.2.3. Predicting Bug Severity ................................................................................. 18 2.2.4. Bug Report Faulty Product and Component Field Prediction ....................... 24 3 Data Preparation ................................................................................................................. 28 3.1. Eclipse Dataset ...................................................................................................... 28 3.2. Gnome Dataset ..................................................................................................... 29 4 Empirical Study on the Effectiveness of Presence of Stack Traces ...................................... 31 4.1. Dataset Setup ........................................................................................................ 31 4.2. Statistical Analysis ................................................................................................. 32 4.3. Experiments .......................................................................................................... 32 4.4. Study Result ..........................................................................................................