Phd Thesis the Open University

Open Research Online The Open University’s repository of research publications and other research outputs Improving Information Retrieval Bug Localisation Using Contextual Heuristics Thesis How to cite: Dilshener, Tezcan (2017). Improving Information Retrieval Bug Localisation Using Contextual Heuristics. PhD thesis The Open University. For guidance on citations see FAQs. c 2016 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/ Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.21954/ou.ro.0000c41c Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk Improving Information Retrieval Based Bug Localisation Using Contextual Heuristics Tezcan Dilshener M.Sc. A thesis submitted to The Open University in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computing Department of Computing and Communications Faculty of Mathematics, Computing and Technology The Open University September 2016 Copyright © 2016 Tezcan Dilshener Abstract Software developers working on unfamiliar systems are challenged to identify where and how high-level concepts are implemented in the source code prior to performing maintenance tasks. Bug localisation is a core program comprehension activity in software maintenance: given the observation of a bug, e.g. via a bug report, where is it located in the source code? Information retrieval (IR) approaches see the bug report as the query, and the source files as the documents to be retrieved, ranked by relevance. Current approaches rely on project history, in particular previously fixed bugs and versions of the source code. Existing IR techniques fall short of providing adequate solutions in finding all the source code files relevant for a bug. Without additional help, bug localisation can become a tedious, time- consuming and error-prone task. My research contributes a novel algorithm that, given a bug report and the application’s source files, uses a combination of lexical and structural information to suggest, in a ranked order, files that may have to be changed to resolve the reported bug without requiring past code and similar reports. I study eight applications for which I had access to the user guide, the source code, and some bug reports. I compare the relative importance and the occurrence of the domain concepts in the project artefacts and measure the e↵ectiveness of using only concept key words to locate files relevant for a bug compared to using all the words of a bug report. Measuring my approach against six others, using their five metrics and eight projects, I position an e↵ected file in the top-1, top-5 and top-10 ranks on average for 44%, 69% and 76% of the bug reports respectively. This is an improvement of 23%, 16% and 11% respectively over the best performing current state-of-the-art tool. Finally, I evaluate my algorithm with a range of industrial applications in user studies, and found that it is superior to simple string search, as often performed by developers. These results show the applicability of my approach to software projects without history and o↵ers a simpler light-weight solution. i ii Acknowledgements I express my thanks and sincere gratitude to my supervisors, Dr. Michel Wermelinger and Dr. Yijun Yu, for believing in my research and in my capabilities. Their dedicated sup- port, direction and the guidance made my journey through the research forest a conquerable experience. Simon Butler for his assistance in using the JIM tool and for the countless proof readings of my work. As my research buddy, he always managed to motivate and encourage me. Prof. Marian Petre for sharing her valuable experience during online seminar sessions and organising the best departmental conferences. Prof. Hongyu Zhang for kindly providing BugLocator and its datasets, Ripon Saha for the BLUiR dataset, Laura Moreno for the LOBSTER dataset and Chu-Pan Wong for giving me access to the source code of his tool BRTracer. Jana B. at our industrial partner, a global financial IT solutions provider located in southern Germany, for providing the artefacts to one of their proprietary application that I used throughout my research and their input on diverse information required. Markus S., Georgi M. and Alan E. for being the proxy at their business locations while I conducted my user study. Also to Markus S. at our industrial partner for his countless hours of discussions on patterns in software applications and source code analysis. My wife Anja for her endless patience, understanding and respect. With her love, encour- agement and dedicated coaching, I stood the emotional up and down phases of my research, thus completed my Ph.D. studies. To my two daughters, Denise and Jasmin for allowing their play time with me to be spent on my research instead. Last but not least, thank you to all those who wish to remain unnamed for contributing to my dedication. Finally, there are no intended contradictory intentions implied to any individual or estab- lishment. iii iv Contents Contents viii List of Figures x List of Tables xiii 1Introduction 1 1.1 Academic Motivation: Software Maintenance Challenges . 1 1.2 Industry Motivation: Industry Needs in Locating Bugs . 3 1.3 An Example of Information Retrieval Process . 3 1.4 Existing Work and The Limitations . 7 1.5 ResearchQuestions................................. 9 1.6 OverviewoftheChapters ............................. 11 1.7 Summary ...................................... 13 2 Landscape of Code Retrieval 15 2.1 Concept Assignment Problem . 16 2.2 Information Retrieval as Conceptual Framework . 17 2.2.1 Evaluation Metrics . 18 2.2.2 Vector Space Model and Latent Semantic Analysis . 18 2.2.3 Dynamic Analysis and Execution Scenarios . 20 2.2.4 Source Code Vocabulary . 21 2.2.5 Ontology Relations . 25 2.2.6 Call Relations . 27 2.3 Information Retrieval in Bug Localisation . 29 2.3.1 Evaluation Metrics . 31 2.3.2 Vocabulary of Bug Reports in Source Code Files . 32 v 2.3.3 Using Stack Trace and Structure . 34 2.3.4 Version History and Other Data Sources . 37 2.3.5 Combining Multiple Information Sources . 39 2.4 CurrentTools ................................... 43 2.5 UserStudies..................................... 46 2.6 Summary ...................................... 49 2.6.1 Conclusion . 52 3 Relating Domain Concepts and Artefact Vocabulary in Software 55 3.1 Introduction..................................... 56 3.2 Previous Approaches . 57 3.3 My Approach . 59 3.3.1 DataProcessing .............................. 60 3.3.1.1 Data Extraction Stage . 61 3.3.1.2 Persistence Stage . 64 3.3.1.3 Search Stage . 66 3.3.2 Ranking Files . 68 3.4 Evaluation of the results . 71 3.4.1 RQ1.1: How Does the Degree of Frequency Among the Common Con- cepts Correlate Across the Project Artefacts? . 71 3.4.1.1 Correlation of common concepts across artefacts . 75 3.4.2 RQ1.2: What is the Vocabulary Similarity Beyond the Domain Con- cepts, which may Contribute Towards Code Comprehension? . 76 3.4.3 RQ1.3: How can the Vocabulary be Leveraged When Searching for Concepts to Find the Relevant Files? . 79 3.4.3.1 Performance . 81 3.4.3.2 Contribution of VSM . 81 3.4.3.3 Searching for AspectJ and SWT in Eclipse . 83 3.5 Threatstovalidity ................................. 84 3.6 Discussion ..................................... 85 3.7 ConcludingRemarks ............................... 86 vi 4 Locating Bugs without Looking Back 89 4.1 Introduction..................................... 90 4.2 Existing Approaches . 93 4.3 Extended Approach . 95 4.3.1 RankingFilesRevisited .......................... 96 4.3.1.1 Scoring with Key Positions (KP score) . 97 4.3.1.2 Scoring with Stack Traces (ST score) . 99 4.3.1.3 Rationale behind the scoring values . 100 4.4 Evaluation of the Results . 100 4.4.1 RQ2: Scoring with File Names in Bug Reports . 101 4.4.1.1 Scoring with Words In Key Positions (KP score) . 101 4.4.1.2 Scoring with Stack Trace Information (ST score) . 103 4.4.1.3 KP and ST Score improvement when Searching with Concepts Only vs all Bug Report Vocabulary . 105 4.4.1.4 Variations of Score Values . 106 4.4.1.5 Overall Results . 108 4.4.1.6 Performance . 114 4.4.2 RQ2.1: Scoring without Similar Bugs . 116 4.4.3 RQ2.2: VSM’s Contribution . 118 4.5 Discussion ..................................... 120 4.5.1 Threats to Validity . 122 4.6 Concluding Remarks . 123 5Userstudies 125 5.1 Background . 125 5.2 StudyDesign .................................... 127 5.3 Results........................................ 129 5.3.1 Pre-session Interview Findings . 129 5.3.2 Post-session Findings . 131 5.4 Evaluation of the Results . 134 5.4.1 Threats to Validity . 136 5.5 Concluding Remarks . 137 vii 6 Conclusion and Future Directions 139 6.1 How the Research Problem is Addressed . 139 6.1.1 Relating Domain Concepts and Vocabulary . 140 6.1.2 Locating Bugs Without Looking Back . 142 6.1.3 User Studies . 145 6.1.4 Validity of My Hypothesis . 145 6.2 Contributions . 146 6.2.1 Recommendations for Practitioners . 146 6.2.2 Suggestions for Future Research . 147 6.3 A Final Rounding O↵............................... 150 Bibliography 153 A Glossary 167 BTop-5Concepts 170 C Sample Bug Report Descriptions 172 D Search Pattern for Extracting Stack Trace 173 E Domain Concepts 174 viii List of Figures 1.1 Information retrieval repository creation and search process . 4 1.2 Bug report with a very terse description in SWT project . 5 1.3 AspectJ project’s bug report description with stack trace information . 6 1.4 Ambiguous bug report description found in SWT .

Load more