Evaluating Methods for Grouping and Comparing Crash Dumps

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019 Evaluating methods for grouping and comparing crash dumps MICHEL CUPURDIJA KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Evaluating methods for grouping and comparing crash dumps MICHEL CUPURDIJA Master in Computer Science Date: January 28, 2019 Supervisors: Alexander Baltatzis, Cyrille Artho Examiner: Johan Håstad School of Electrical Engineering and Computer Science Swedish title: Utvärdering av metoder för att gruppera och jämföra krashdumpar iii Abstract Observations suggest that a high percentage of all reported software errors are reoccurrences. In certain cases even as high as 75%. This high percentage of reoccurrences means that companies are wasting hours manually re- diagnosing errors that have already been diagnosed. The goal of this thesis was to eliminate or limit cases where errors have to be re-diagnosed through the use of automated grouping of crash dumps. In this study we constructed a series of tests. We evaluate both pre-existing methods as well as our new proposed methods for comparing and matching crash dumps. A set of known errors were used as basis for measuring the matching precision and grouping ability of each method. Our results show a large variation in accuracy between methods and that generally, the more accurate a method is, the less it offers in terms of grouping ability. With an accuracy ranging from 50% to 90% and a reduction in manual diagnosis by up to 90%, we have shown that through automatic grouping of crash dumps we can accurately identify reoccurrences and reduce manual diagnosis. iv Sammanfattning Målet med denna rapport var att undersöka metoder för gruppering av krashdumpar. Rapporter inom ämnet har visat att upp till 75% av rapporterade bug- gar kan vara upprepade förekomster av samma bug. Syftet har därför varit att reducerat behovet av manuell diagnostik genom att gruppera krashdumpar med samma felkälla. I vår studie konstruerade vi tester för att objektivt kunna jämföra och utvär- dera de olika metoderna. Vi utvärderade redan existerande grupperingsmeto- der och metoder som vi föreslagit. Testerna utvärderade grupperingmetoder- nas precision samt deras grupperingsförmåga. Utvärderingen visade på stor variation i precision mellan metoderna men också en korrelation mellan grup- peringsförmåga och precision. Observationen var att metoder med stor precision har en dålig grupperingsförmåga. Våra resultat visar att det är möjligt att reducera upp till 90% av det manuella felsökningsarbetet med en precision i intervallet 50-90% beroende på metodval. Contents 1 Introduction 1 1.1 Problem statement . .1 1.2 Methodology . .2 1.3 Delimitation . .2 1.4 Contribution . .3 1.5 Ethics and sustainability . .3 1.6 Thesis outline . .3 2 Background 4 2.1 Commercial software . .4 2.2 Building reliable software . .5 2.3 Programming languages and their effect on reliability . .5 2.4 Software verification . .5 2.4.1 Formal methods . .6 2.4.2 Software testing . .6 2.4.3 Automatic crash reporting systems . .7 2.5 Crash analysis and understanding failures . .7 2.5.1 System crashes . .8 2.5.2 Process crashes . .9 2.6 Understanding crash dumps . 10 2.7 Tools for crash dump analysis & information extraction . 10 2.7.1 Locating cause of SIGSEGV using mdb in process crash 11 2.8 Machine learning . 13 2.8.1 Online and offline learning . 13 2.8.2 Classification approaches . 14 2.8.3 Measuring the performance of classifiers . 15 3 Related work 16 3.1 Automatic crash reporting systems . 16 v vi CONTENTS 3.2 Bucketing algorithms . 17 3.2.1 Analyzing call stacks . 17 3.2.2 Definition of an edit distance . 19 4 Comparing crash dumps 20 4.1 Symptoms . 20 4.2 Matching based on symptoms . 22 4.3 Comparing stack traces . 22 4.3.1 String comparison . 22 4.3.2 Recursion removal . 23 4.3.3 Edit distance . 27 4.3.4 Prefix matching . 29 4.3.5 Distance normalization . 30 4.4 Machine learning . 31 4.4.1 Nearest neighbor learning algorithm . 31 4.5 Summary . 32 4.5.1 Risk discussion . 33 5 Evaluation and results 35 5.1 Evaluation overview . 35 5.1.1 Test data . 36 5.1.2 Evaluation method for determining precision . 37 5.1.3 Evaluation method for determining grouping ability . 38 5.1.4 Evaluation method for machine learning precision . 41 5.2 Evaluation results . 41 5.2.1 Precision evaluation results . 42 5.2.2 Grouping evaluation results . 46 5.2.3 Machine learning evaluation results . 51 6 Conclusion 53 7 Future work 55 Bibliography 56 Chapter 1 Introduction Modern software is becoming larger and more complex than ever before. Source code repositories and user bases are growing larger every day. Although growth is inevitable, from a quality assurance perspective, new growth means more errors being encountered more frequently. It is important to understand that a high percentage of all encountered software errors are reoccurrences [1][2]. This high percentage is often due to the significant amount of time it takes to develop and deploy fixes. Applications can therefore remain flawed for a long time. Manual re-diagnosis of old errors that are consistently reported as new bugs is a huge waste of resources. Clearly there is an opportunity here to automate a process that could al- leviate the amount of necessary manual diagnosis significantly. If we could determine and pinpoint the exact symptoms of an error, and the symptoms of two or more errors coincide — then we could consider them equal. Consid- ering two errors equal gives us the ability to categorize errors, and if a new error is reported it could be placed in an error-group with similar symptoms that have already been diagnosed. However, uniquely categorizing errors based on the characteristics of their errors, is not trivial. Information is limited and often only presented in the form of a crash dump, which is the main focus of this thesis. 1.1 Problem statement This thesis will investigate and evaluate methods for grouping errors based on their symptoms, or indications of the cause of an error. The goal is to minimize the amount of manual diagnostics by removing the necessity of redundant re- diagnosis of already known errors. This thesis therefore aims to answer the 1 2 CHAPTER 1. INTRODUCTION following questions: Is it possible to group crashes with sufficient accuracy for an automated system to be trusted with the task of reducing the amount of necessary manual diagnosis? 1.2 Methodology The overall focus was on exploring the field in order to later apply known the- ory and evaluate a number of existing methods. The first step was to establish an understanding of the subject in general and localize relevant information in crash dumps that would allow us to identify the symptoms of an error. The second step was to conduct a study of certain existing methods for comparing symptoms, specifically within stack trace analysis, and to assess their ability to group errors. These two steps were conducted in several iterations. The outcome of the study determined that four grouping methods were to be implemented and evaluated. The final step was then to summarize the analysis of suggested methods and evaluate concept reliability (see Figure 1.1 for an illustration of the research phase). Figure 1.1: Workflow during research phase. 1.3 Delimitation This work is limited to analysis of crash dumps on Unix-like operating systems. The work is also limited to analysis of crash dumps from a set of similar CHAPTER 1. INTRODUCTION 3 applications, therefore, the results may not be generalizable beyond our data set. 1.4 Contribution This thesis primarily analyzes and compares existing methods for comparing stack traces. The results are in line with existing research for the individual comparison methods, and our evaluation gives an indication as to under which circumstances the different methods are ideal. We introduce a heuristic for removing recursion patterns in stack traces based on the removal of maximal repeats. It is showing an increase in accuracy over existing techniques for the tested data set. 1.5 Ethics and sustainability The training of machine learning algorithms can have a negative impact on sustainability in terms energy consumption, in particular for large data sets where training over large periods of time is necessary. Furthermore, the re- quired physical hardware is constructed with scarce natural resources. From an ethical perspective, part of this work attempts to improve ways of fixing errors. With a very efficient way of fixing errors once a product has already been distributed to customers, companies could spend less time assuring the quality of product prior to distribution. Companies could then be more inclined to release unfinished products, which in turn could have varying levels of impact depending on the sector in which the company operates. 1.6 Thesis outline This thesis is organized as follows: Chapter 2 introduces the reader to crash dumps, what they contain, how they are provoked and how they can be ana- lyzed; Chapter 3 introduces related work already done in the field; Chapter 4 defines how crashes can be categorized and describes methods for doing such categorizations; Chapter 5 explains the evaluation process of the different comparison methods and the results of the evaluation; Chapter 6 discusses the results from the evaluation; and lastly, Chapter 7 concludes the report. Chapter 2 Background Writing reliable software is hard, especially in today’s landscape where com- plexity of modern applications is ever increasing. History has shown that writing software without errors is close to impossible. Errors can be caused by flaws in the design, or by a correct design implemented in the wrong way.

Evaluating Methods for Grouping and Comparing Crash Dumps

Introduction to Debugging the Freebsd Kernel

Process and Memory Management Commands

Post Mortem Crash Analysis

Linux Core Dumps

The Complete Freebsd

Exniffer: Learning to Rank Crashes by Assessing the Exploitability from Memory Dump

Chapter 2: Operating-System Structures

Comparing the Robustness of POSIX Operating Systems

Chapter 2 Operating System Structures

Troubleshooting Typical Issues in Oracle Solaris 11.1

Benchmarking the Stack Trace Analysis Tool for Bluegene/L

Warrior1: a Performance Sanitizer for C++ Arxiv:2010.09583V1 [Cs.SE]