DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2019

Evaluating methods for grouping and comparing dumps

MICHEL CUPURDIJA

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Evaluating methods for grouping and comparing crash dumps

MICHEL CUPURDIJA

Master in Computer Science Date: January 28, 2019 Supervisors: Alexander Baltatzis, Cyrille Artho Examiner: Johan Håstad School of Electrical Engineering and Computer Science Swedish title: Utvärdering av metoder för att gruppera och jämföra krashdumpar

iii

Abstract

Observations suggest that a high percentage of all reported software errors are reoccurrences. In certain cases even as high as 75%. This high percent- age of reoccurrences means that companies are wasting hours manually re- diagnosing errors that have already been diagnosed. The goal of this thesis was to eliminate or limit cases where errors have to be re-diagnosed through the use of automated grouping of crash dumps. In this study we constructed a series of tests. We evaluate both pre-existing methods as well as our new proposed methods for comparing and matching crash dumps. A set of known errors were used as basis for measuring the matching precision and grouping ability of each method. Our results show a large variation in accuracy between methods and that generally, the more accurate a method is, the less it offers in terms of grouping ability. With an accuracy ranging from 50% to 90% and a reduction in manual diagnosis by up to 90%, we have shown that through automatic grouping of crash dumps we can accurately identify reoccurrences and reduce manual diagnosis. iv

Sammanfattning

Målet med denna rapport var att undersöka metoder för gruppering av krash- dumpar. Rapporter inom ämnet har visat att upp till 75% av rapporterade bug- gar kan vara upprepade förekomster av samma bug. Syftet har därför varit att reducerat behovet av manuell diagnostik genom att gruppera krashdumpar med samma felkälla. I vår studie konstruerade vi tester för att objektivt kunna jämföra och utvär- dera de olika metoderna. Vi utvärderade redan existerande grupperingsmeto- der och metoder som vi föreslagit. Testerna utvärderade grupperingmetoder- nas precision samt deras grupperingsförmåga. Utvärderingen visade på stor variation i precision mellan metoderna men också en korrelation mellan grup- peringsförmåga och precision. Observationen var att metoder med stor preci- sion har en dålig grupperingsförmåga. Våra resultat visar att det är möjligt att reducera upp till 90% av det manuella felsökningsarbetet med en precision i intervallet 50-90% beroende på metodval. Contents

1 Introduction 1 1.1 Problem statement ...... 1 1.2 Methodology ...... 2 1.3 Delimitation ...... 2 1.4 Contribution ...... 3 1.5 Ethics and sustainability ...... 3 1.6 Thesis outline ...... 3

2 Background 4 2.1 Commercial software ...... 4 2.2 Building reliable software ...... 5 2.3 Programming languages and their effect on reliability . . . . .5 2.4 Software verification ...... 5 2.4.1 Formal methods ...... 6 2.4.2 Software testing ...... 6 2.4.3 Automatic crash reporting systems ...... 7 2.5 Crash analysis and understanding failures ...... 7 2.5.1 System crashes ...... 8 2.5.2 Process crashes ...... 9 2.6 Understanding crash dumps ...... 10 2.7 Tools for crash dump analysis & information extraction . . . . 10 2.7.1 Locating cause of SIGSEGV using mdb in process crash 11 2.8 Machine learning ...... 13 2.8.1 Online and offline learning ...... 13 2.8.2 Classification approaches ...... 14 2.8.3 Measuring the performance of classifiers ...... 15

3 Related work 16 3.1 Automatic crash reporting systems ...... 16

v vi CONTENTS

3.2 Bucketing algorithms ...... 17 3.2.1 Analyzing call stacks ...... 17 3.2.2 Definition of an edit distance ...... 19

4 Comparing crash dumps 20 4.1 Symptoms ...... 20 4.2 Matching based on symptoms ...... 22 4.3 Comparing stack traces ...... 22 4.3.1 String comparison ...... 22 4.3.2 Recursion removal ...... 23 4.3.3 Edit distance ...... 27 4.3.4 Prefix matching ...... 29 4.3.5 Distance normalization ...... 30 4.4 Machine learning ...... 31 4.4.1 Nearest neighbor learning algorithm ...... 31 4.5 Summary ...... 32 4.5.1 Risk discussion ...... 33

5 Evaluation and results 35 5.1 Evaluation overview ...... 35 5.1.1 Test data ...... 36 5.1.2 Evaluation method for determining precision . . . . . 37 5.1.3 Evaluation method for determining grouping ability . . 38 5.1.4 Evaluation method for machine learning precision . . 41 5.2 Evaluation results ...... 41 5.2.1 Precision evaluation results ...... 42 5.2.2 Grouping evaluation results ...... 46 5.2.3 Machine learning evaluation results ...... 51

6 Conclusion 53

7 Future work 55

Bibliography 56 Chapter 1

Introduction

Modern software is becoming larger and more complex than ever before. Source code repositories and user bases are growing larger every day. Although growth is inevitable, from a quality assurance perspective, new growth means more errors being encountered more frequently. It is important to understand that a high percentage of all encountered soft- ware errors are reoccurrences [1][2]. This high percentage is often due to the significant amount of time it takes to develop and deploy fixes. Applications can therefore remain flawed for a long time. Manual re-diagnosis of old errors that are consistently reported as new bugs is a huge waste of resources. Clearly there is an opportunity here to automate a process that could al- leviate the amount of necessary manual diagnosis significantly. If we could determine and pinpoint the exact symptoms of an error, and the symptoms of two or more errors coincide — then we could consider them equal. Consid- ering two errors equal gives us the ability to categorize errors, and if a new error is reported it could be placed in an error-group with similar symptoms that have already been diagnosed. However, uniquely categorizing errors based on the characteristics of their errors, is not trivial. Information is limited and often only presented in the form of a crash dump, which is the main focus of this thesis.

1.1 Problem statement

This thesis will investigate and evaluate methods for grouping errors based on their symptoms, or indications of the cause of an error. The goal is to minimize the amount of manual diagnostics by removing the necessity of redundant re- diagnosis of already known errors. This thesis therefore aims to answer the

1 2 CHAPTER 1. INTRODUCTION

following questions: Is it possible to group crashes with sufficient accuracy for an automated sys- tem to be trusted with the task of reducing the amount of necessary manual diagnosis?

1.2 Methodology

The overall focus was on exploring the field in order to later apply known the- ory and evaluate a number of existing methods. The first step was to establish an understanding of the subject in general and localize relevant information in crash dumps that would allow us to identify the symptoms of an error. The second step was to conduct a study of certain existing methods for comparing symptoms, specifically within stack trace analysis, and to assess their ability to group errors. These two steps were conducted in several iterations. The outcome of the study determined that four grouping methods were to be implemented and evaluated. The final step was then to summarize the analysis of suggested methods and evaluate concept reliability (see Figure 1.1 for an illustration of the research phase).

Figure 1.1: Workflow during research phase.

1.3 Delimitation

This work is limited to analysis of crash dumps on -like operating sys- tems. The work is also limited to analysis of crash dumps from a set of similar CHAPTER 1. INTRODUCTION 3

applications, therefore, the results may not be generalizable beyond our data set.

1.4 Contribution

This thesis primarily analyzes and compares existing methods for comparing stack traces. The results are in line with existing research for the individual comparison methods, and our evaluation gives an indication as to under which circumstances the different methods are ideal. We introduce a heuristic for removing recursion patterns in stack traces based on the removal of maximal repeats. It is showing an increase in accuracy over existing techniques for the tested data set.

1.5 Ethics and sustainability

The training of machine learning algorithms can have a negative impact on sustainability in terms energy consumption, in particular for large data sets where training over large periods of time is necessary. Furthermore, the re- quired physical hardware is constructed with scarce natural resources. From an ethical perspective, part of this work attempts to improve ways of fixing errors. With a very efficient way of fixing errors once a product has already been distributed to customers, companies could spend less time assuring the quality of product prior to distribution. Companies could then be more inclined to release unfinished products, which in turn could have varying levels of impact depending on the sector in which the company operates.

1.6 Thesis outline

This thesis is organized as follows: Chapter 2 introduces the reader to crash dumps, what they contain, how they are provoked and how they can be ana- lyzed; Chapter 3 introduces related work already done in the field; Chapter 4 defines how crashes can be categorized and describes methods for doing such categorizations; Chapter 5 explains the evaluation process of the differ- ent comparison methods and the results of the evaluation; Chapter 6 discusses the results from the evaluation; and lastly, Chapter 7 concludes the report. Chapter 2

Background

Writing reliable software is hard, especially in today’s landscape where com- plexity of modern applications is ever increasing. History has shown that writ- ing software without errors is close to impossible. Errors can be caused by flaws in the design, or by a correct design implemented in the wrong way. Programs not acting as expected contain bugs. Software bugs are often the source of abnormal program terminations, normally referred to as crashes.

2.1 Commercial software

In commercial software, with paying users and customers, there is an ex- pectancy that the software will work as intended and advertised. If the ex- pectancy is not met, customers will look for competing alternatives. Software companies therefore spend significant resources to mitigate the risks of bugs in their released programs. Despite efforts, bugs are often not found until soft- ware is put into commercial use. A unique aspect about software is that it is possible to distribute fixes at relatively low cost. This allows companies to retain or regain customer repu- tation through quick and continuous support. Customers today expect their software to be regularly updated and new fea- tures added on regular basis. These ideas, in terms of reliability, are opposing forces. To achieve reliability and robustness a program needs to be iterated on several times. New features and updates add additional code, code that is of- ten re-used and re-purposed, which again needs iterating upon. A continuous development cycle like this makes is difficult to maintain quality.

4 CHAPTER 2. BACKGROUND 5

2.2 Building reliable software

Software can have a long lifecycle and go through many stages. It can go from an original design and implementation to testing and maintenance. At each stage appropriate measures can be taken to mitigate the risk for bugs. Various practices exist for what each stage should entail. Microsoft applies a methodology called SDL, Security Development Lifecycle, a “seven-stage process” consisting of: training, requirements, design, implementation, veri- fication, release and response [4]. Each stage is described in detail and acts as a standard for the development cycle throughout the company. Practices like these are not uncommon for larger companies and audit trails for the process are sometimes applied.

2.3 Programming languages and their effect on reliability

Most programming languages can be categorized as languages that either pro- duce native code, or languages that execute in a controlled environment. Lan- guages that produce native code such as or C++ often allow the program- mer direct access to the underlying architecture. In particular, management of memory is left to the programmer. Pointer arithmetic is the root cause of many classes of bugs with examples being buffer overflows, null pointer exceptions, double free and use-after-free bugs. Languages such as Java executing inside virtual environments may counter this by being purely object oriented and not allow programmers to directly interact with memory. However, languages are not significantly different from a reliability perspective as object oriented languages often work with object references — references that can be null if they are not currently referring to an object. Dereferencing such a null pointer will throw an exception which, if not handled, will terminate and crash the program. New languages are introduced and compilers are getting more advanced, but despite technical progress there is still room for programmer errors to introduce bugs [5].

2.4 Software verification

Verifying that a program will act as expected under all circumstances or most circumstances is a non-trivial task. In order to prove that it behaves as expected for all inputs formal methods can be used. A proof using formal methods is 6 CHAPTER 2. BACKGROUND

a costly, and often complex approach only applied where necessary. To show that a program acts as expected in most circumstances, software testing can be used. Software testing is a more common requirement in the industry. Neither method is entirely encapsulating and does not exclude the use of the other.

2.4.1 Formal methods Using formal methods it is possible to prove the correctness of software. Mean- ing, generating a proof that it always will behave correctly, i.e., according to specification. This is especially important for safety-critical applications in in- dustries that require it, examples being avionics or nuclear power. However, in practice, using formal methods is often limited to smaller parts of the software. Formal methods have inherent scalability issues. This is mainly a con- cern when attempting to prove the correctness of an implementation derived from a formal specification outside a lab environment. Formal specifications on their own scale well as it is a higher level abstraction and good tooling is available. When formal methods are used to verify an implementation, its scalability is limited by complexities of programming languages, operating systems and computer architectures [5]. When a program is ran outside a de- terministic lab environment it is interfacing with operating systems, user input and other potentially unknown parameters. Conducting formal analysis under these circumstances is hard and often times impossible.

2.4.2 Software testing With formal methods not always being applicable, an alternative is to put a program, or part of a program through a number of tests to verify that it be- haves correctly. Tests can test both negative and positive behavior. Positive tests show that the program behaves correctly given valid input and normal user behavior. Negative tests show that the program can gracefully handle ab- normal input or user behavior. The purpose of the tests is to verify behavior, find bugs and discover regressions when code is changed. Manually written tests can generally only cover a fraction of available in- puts. The process of automatically generating invalid inputs and feed them into the program is called Fuzzing. It has been shown to be an effective way of finding bugs. The process can be time consuming, but as it is fully auto- mated it can be ran continuously until a new version is ready for fuzzing. The method has been shown to be particularly effective at finding memory corrup- tion bugs [6]. Memory corruption can often be identified and detected through CHAPTER 2. BACKGROUND 7

program crashes. When a program abnormally terminates it is possible to save and store the contents in memory at the time of the crash. This stored data is referred to as a crash dump.

2.4.3 Automatic crash reporting systems For commercial software targeted to the masses it is unfeasible to test all per- mutations of hardware and operating systems configurations. Companies at- tempt to mitigate this issue by having their programs automatically report back information when a crash occurs. The information returned can contain infor- mation about the system configuration as well as a crash dump. This allows developers to reproduce the problem and fix the issue [7]. For larger institutions with many users the number of reported crashes can be very large. Microsoft’s error reporting system is designed to handle over a hundred million reports daily [5].

2.5 Crash analysis and understanding fail- ures

Analyzing and understanding crashes is difficult. In particular when a program is written in native code and only “machine-level” information is available. Connecting machine level information to source code is not trivial. For pro- grams written in managed code1 the task can be significantly easier as many modern managed platforms include information about where in the source code a crash originated. A crash generally results in a crash dump that contains a stack trace. The stack trace gives insight into recent function calls and control flow. Often the in itself is sufficient information for a developer to locate an issue [8]. The topic of call stack analysis is covered in more detail in subsection 3.2.1. An application typically crashes under two circumstances, if it performs an operation not allowed by the , or if an intentional assertion was placed in the code. Some scenarios under which an application can crash are: accessing memory that is not allocated to the application; executing in- valid or privileged instructions; illegal or incorrect I/O operations on hardware devices; incorrect usage of system calls.

1An encapsulating term coined by Microsoft for programming languages that execute code inside a controlled virtual environment, e.g., Java, C#, etc. 8 CHAPTER 2. BACKGROUND

In certain scenarios crashes can be intentional. Assertions placed in pro- grams can cause them to crash. Typically such assertions are put in place if the state of the program is in an irrecoverable state or in a state that is irrefutably wrong and could cause further damage to the system if not terminated imme- diately. Aside from an application crashing, the entire operating system can crash. Modern operating systems usually handle application crashes gracefully and remain unharmed. However, some irrecoverable circumstances can cause the entire system to crash. Hardware faults are a common cause, but the operating system can also crash due to internal inconsistencies. Poorly written drivers are a source for such inconsistencies. Device drivers reside in the operating systems address space and could through incorrect memory usage crash the system. Systems crashes as well as program crashes are covered by section 2.6 in more detail.

2.5.1 System crashes As previously mentioned there are many ways an operating system can crash; anything from faulty drivers to hardware errors. They are all, however, treated in a similar way and an illustration of the process can be seen in Figure 2.1. A significant difference between a system crash and a process crash, often called application crash, is that if the system crashes, a reboot is likely necessary. The specifics of how an OS handles system crashes differs, even between Unix-like systems. However, in general terms the chain of events happens as follows. If a critical error is detected, example being a fatal error in hardware, the panic() routine is called. The panic() routine proceeds by interrupt- ing and suspending all other processes to minimize damage to user data. It then generates a dump file which is saved to a temporary dump device2 and later restored, when the system has rebooted, into an appropriate folder [9]. Forcing crashes can be helpful when learning new tools or wanting to ex- plore crash dumps in general. On illumos this task is made trivial as there is a parameter, -d, that can be passed to reboot() which forces the OS to create a system crash dump before rebooting [3].

2A logical volume where the dump is stored temporarily (often the swap disk) until the system restarts and places it in a permanent volume. CHAPTER 2. BACKGROUND 9

Kernel crash requested Process violates terms (panic, user-requested)

OS suspends all OS signals process processes SIGSEGV

OS stores memory OS handler image on swap disk terminates process

If enabled, OS dumps System reboot memory to dump file

Retrieve image from swap disk then store it

Figure 2.1: On the left is the flow during a system crash, and on the right a process crash.

2.5.2 Process crashes The most common cause for process crashes is when the operating system sends a terminating signal to the process enforcing a violation of terms. Typ- ical violations that result in a process crash:

• Attempting to access memory that is not allocated for the process (seg- mentation violation).

• Attempting to execute invalid instructions or privileged instructions.

• Attempting to use a system resource for which the process has insuffi- cient privileges.

• Invoking system calls with invalid parameters. 10 CHAPTER 2. BACKGROUND

2.6 Understanding crash dumps

A crash dump, often called a core-dump or just a dump, refers to the recording of the state of a program. Generally one is created when a program terminates abnormally, i.e., crashes. The information that is recorded, or dumped3, consist of two key compo- nents, processor registers and memory information. The processor registers always contain a and the stack pointer. Also, under good cir- cumstances, the data registers can contain some crucial parameters that might indicate where the fault was located. These registers are, however, prone to change as a single processor instruction can change their contents. This volatil- ity makes them an unreliable source of information and one should not expect information there to be relevant. The memory information component contains information that is the most relevant to our case. It contains the whole process address space in use. This means that we have access to the stack trace of the process at the time of the crash and any local or shared data. Information which is essential when com- paring symptoms of crashes.

2.7 Tools for crash dump analysis & infor- mation extraction

This section covers the basics for analyzing crash dumps. The information within core-dumps is similar across all Unix-likes. Meaning, even though the information extraction methods may differ between operating systems, it is still valuable to know what information is available. Crash dumps generally contain large amounts of information, extracting all of it is not always necessary. It is often more desirable to have portions of it displayed in a human-readable way. A tool that allows us to do this extraction is called mdb4, the Modular [10]. It is a general purpose debugging tool much like the GNU debugger, gdb, which is widely used. The Modular Debugger differs from many debugging tools which only allow the developer to execute programs in traditional controlled environments where and inspection of state using source language is available. Using controlled execution is not always feasible; the mdb manual suggests four scenarios where

3Dump or dumped has become a term that implies storage of raw data. 4Note, this thesis does not concern itself with explaining mdb in full detail, but limits itself to the necessities of our problem. Please see [10] for further reading. CHAPTER 2. BACKGROUND 11

a regular debugger does not suffice: 1) Debugging an operating system, bugs might not be reproducible and program state is often massive, 2) Optimized programs with debug information stripped, 3) Debugging low-level tools, such as another debugger and 4) Situations where only post-mortem information is available, an example being programs ran at customer location [10]. The Modular Debugger and similar allow for debugging of both live and post-mortem programs. Proficiency with such tools generally requires familiarity with both the as- sembly programming language and an understanding of the relationship be- tween higher level code and assembly code. This is, as previously mentioned, because code generally gets compiled into a lower level language (e.g., assem- bly or ), which during run-time gets loaded into memory. Mean- ing, apart from function and variable names which are preserved in dumps due to the symbol table5, all that is still available after compilation is assembly. Listing 2.1 below is an example of a program that causes a SIGSEGV to occur by attempting to write to an illegal address (often referred to as a null pointer exception). This example is used to illustrate the information avail- able in a crash dump by attempting to locate the cause of the SIGSEGV using mdb. Despite its simplicity this example shows large parts of the information available that is required to start tracing error symptoms. Listing 2.1: Example program in C that causes a process crash due to segmen- tation fault and dumps core int main(){ int* _bar; _bar = 0; // Point to illegal address *_bar = 0xAAAAAAAA; // Null pointer exception return; }

2.7.1 Locating cause of SIGSEGV using mdb in pro- cess crash We assume the program in Listing 2.1 has been ran and a corresponding dump exists. We start debugging using mdb (mdb specific commands are shown in blue) and get the following:

5A data structure used by compilers to store information on names and types to aid certain processes during compilation. Debuggers often access and present this data to the user to simplify debugging. See Listing 2.3 for an illustration of this. 12 CHAPTER 2. BACKGROUND

Listing 2.2: Starting debugging of dump using mdb and command ::status root@m2 :∼# mdb core.a.out.15778.1398948161 Loading modules: [ libc.so.1 ld.so.1 ] > ::status debugging core file of a.out (32-bit) from m2 file: /root/a.out initial argv: ./a.out ARGS threading model: native threads status: process terminated by SIGSEGV (Seg. Fault), addr=0 In Listing 2.2 we have opened the dump and ran the mdb command ::status which prints a summary of the target crash dump. Information here is limited but valuable when differentiating host-machines, binaries and causes of termi- nation. In our case we see that the process terminated due to a SIGSEGV. To further investigate the cause, the command ::stack is used; it prints the stack trace at the time of the crash. The ::stack command provides the most valuable information to our case as it explains the program state at the time of the crash. We then proceed by disassembling the main() function.

Listing 2.3: Starting debugging of dump using mdb and command ::stack > ::stack main+0x10(1, 8047dbc, 8047dc4, 8050d70, 0, 0) _start+0x83(1, 8047e80, 0, 8047e88, 8047e98, 8047eac) > main::dis main: pushl %ebp ;pushFPtostack main+1: movl %esp,%ebp ;copySPtoFP main+3: subl $0x10,%esp ;make room for pointer main+6: movl $0x0,-0x4(%ebp) ;set pointer to NULL main+0xd: movl -0x4(%ebp),%eax ;copy pointer to %eax main+0x10: movl $0xaaaaaaaa,(%eax) ;copy value to %eax(NULL) main+0x1b: leave main+0x1c: ret

In Listing 2.3 the stack trace shows that main() was the last function to be called before termination. After disassembly, using ::dis, it becomes clear that there is an attempt at copying the value 0xAAAAAAAA to address 0x0(NULL), which is a reserved address and modifying it is not permitted by the OS. As previously mentioned, symbol tables allow function names to be dis- played in the stack trace. However, this only applies to programs compiled with gcc. Programs compiled with g++ use a more complicated name mangling6 system, and require additional predefined symbols in order for the stack trace to be demangled (or made human-readable). The modular debugger, mdb, provides this functionality, Listing 2.4 and Listing 2.5 illustrate this. This can be valuable when comparing stack traces from different compilers as name

6A technique used to handle the problem of resolving of unique identifiers in programs. CHAPTER 2. BACKGROUND 13

mangling is not standardized. Listing 2.4: g++ stack trace without C++ demangling (function parameters removed for visiblity) libjvm.so‘__1cCosFabort6Fb_v_+0x51() libjvm.so‘__1cLJvmtiExportTpost_vm_initialized6F_v_+0x436() libjvm.so‘__1cHThreadsJcreate_vm6FpnOJavaVMInitArgs_pb_i_+0xdc9()

Listing 2.5: g++ stack trace with C++ demangling (function parameters re- moved for visiblity) libjvm .so ‘ void os::abort+0x51() libjvm .so ‘ void JvmtiExport::post_vm_initialized+0x436() libjvm .so ‘ int Threads::create_vm+0xdc9()

2.8 Machine learning

Machine learning is used to make data driven decisions based on patterns in data. The domain spans across a wide area of fields and approaches to learn- ing. It can be anything in the range of making predictions of the future based on patterns in the past, or to simply classifying data points. The topic on ma- chine learning is wide and complex and only a small part of its domain will be relevant in this thesis.

2.8.1 Online and offline learning The term learning stems from the idea that existing data, or new data is used to improve classification through analysis of patterns. Machine learning can be categorized in many ways but a relevant categorization for this thesis is the separation of online learning and offline learning [11]. The concept of updating the model incrementally and instantly for each new training sample that arrives, is referred to as online learning. This often requires specific algorithms that are not dependent on comparing a new sample to all existing samples, because that would not scale for large data sets [12]. Offline learning, on the other hand, does not include new training samples instantly. Instead the learning is done in batch where the whole data set is used to compute a model. This can be a time consuming process. Offline models are, therefore, generally computed separately from the running classifier and the classifier is only updated once a new model has been computed [13][12]. 14 CHAPTER 2. BACKGROUND

2.8.2 Classification approaches Within the field of machine learning there are mainly two approaches: super- vised and unsupervised. The main difference between them is that for super- vised learning there exists a ground truth, some prior knowledge about the expected outcome. The goal of supervised learning then becomes to produce a classifier, which given a set of samples and the corresponding desired out- comes, is capable of classifying new samples. Unsupervised learning, on the other hand, does not have access to a ground truth but instead attempts to infer a natural structure given a set of data points [11]. When comparing crash dumps with the intent of reducing work load we want to group similar crashes. Grouping as a concept within machine learning is achieved using classification algorithms. Classification algorithms are a class of supervised machine learning algorithms where the algorithms first are fed training data, which in turn is used to learn how to classify, or group, new observations. The classification category consist mainly of four different approaches to grouping: linear classifiers, decision trees, neural networks and nearest neighbor [11]. Linear classifiers attempt to separate objects by making a linear combina- tion of some characteristic and are often illustrated using a vector or hyper- plane separating objects in an two or three dimensional space. A classifier is considered linear if the decision boundary is a linear function. Classifiers using hyperplanes are often referred to as support vector machines. Decision trees build classification models in the form a tree structure. The core concept is to use a if-then ruleset that is mutually exclusive. The tree is sequentially built for each input of training data. A variation to decisions trees are random forests, which as the name implies, consist of multiple decision tress. Each tree within the forest evaluates a certain characteristic. The result is the merged outcome of the trees [11]. Artificial neural networks, or just neural networks, attempt to mimic the decision process in an organic brain. The network consists of neurons arranged in layers. Each layer converts an input vector into some output. The neurons in each layer apply a function (often non-linear) to the input and pass it to the next set of neurons in the next layer. The output from the last layer in the network is the result of the classification. Lastly, nearest neighbor classifiers, which group objects based on a dis- tance threshold for some common characteristic, or criterion. An object to be classified is compared to all existing objects and is simply classified by where the distance to its nearest neighbors is at a minimum [14]. CHAPTER 2. BACKGROUND 15

2.8.3 Measuring the performance of classifiers Many criteria can be used to measure the performance of a supervised classi- fier. Different criteria are appropriate in different settings. Two metrics that are often evaluated are precision and recall. Both are frequently used individually as evaluation metrics, but are also incorporated in other metrics. The terms can informally be described as follows: Precision, how many of the selected samples were correct; Recall, how many of the samples that were expected to be returned were actually returned [15]. Determining the correctness of a prediction is necessary in order to evalu- ate a classifier. A standard categorization of predictions has therefore been in- troduced. It defines the four following prediction categories: (1) True positive, the classifier predicted a positive match, and the prediction is deemed correct by external judgement; (2) False positive, the classifier predicted a positive match, while it is deemed incorrect by external judgement; (3) True negative, the classifiers predicted a negative match, and the prediction is deemed correct by external judgement; (4) False negative, the classifier predicted a negative match while it is deemed incorrect by external judgement. In more general terms, positive and negative refer to the prediction of the classifier, and the boolean terms true or false refer to whether the prediction is correct or not [15]. Formally precision and recall are defined as fractions. Given the catego- rizations we can now define precision as TP , (2.1) TP + FP where TP = true positive and FP = false positive. This is the fraction of relevant samples among all the retrieved samples. Recall is defined as TP , (2.2) TP + FN where TP = true positive and FN = false negative. This is the fraction of sam- ples that were retrieved among all the samples that were expected to be re- trieved [15][11]. Chapter 3

Related work

This chapter concerns itself with detailing existing key research conducted in relevant areas to this thesis. Implemented systems built on top of the concept of comparing crash dumps are covered, as well as an overview of existing methods for grouping and comparing individual crash dumps.

3.1 Automatic crash reporting systems

As mentioned in previous sections, to collect crash information companies can use programs that report back information when a crash occurs. Crash reporting systems such as Windows Error Reporting, Apple Crash Reporter and Mozilla Crash Reporter have been deployed widely. When a crash oc- curs at a deployed site, the systems collect crash information such as product version, product name, operating systems, installed drivers, call stacks, and crash reason. The information can then be sent to the company servers if the user consents. Crash reporting systems for large institutions can accumulate large numbers of crash reports over time. Since many crash reports are likely caused by the same bug, systems can leverage existing reports and attempt to bucket duplicates or similar reports. Once a bucket reaches some threshold for number of crashes it is often handed over to the developers [16][17]. The Windows Error Reporting represents by far the largest system for re- porting crashes. The platform is not only used for fixing issues after a release, it can also be used preemptively. Given the amount of reported crashes and corresponding fixes, the platform has become a resource for empirical data on bug fixes. The information can be used to reduce the amount of reoccurring bugs by identifying problematic patterns in both code and hardware [18]. A side from allowing the company behind the automatic crash reporting

16 CHAPTER 3. RELATED WORK 17

systems to receive crash information about their products, some systems can allow third party application developers to use the service as well. Both Mi- crosoft Windows and Apple’s macOS allow third party applications to use the crash reporting system as a service. Applications can send crash information back the Apple’s or Microsoft’s servers and receive information about the sta- bility of their product [5].

3.2 Bucketing algorithms

An ideal bucketing methodology would manage to assign exactly one bug to each bucket. The idea is that once a bug has been fixed, all unused crash reports within that bucket can be discarded without further analysis. Such a methodology would also group new reports into existing or new buckets. An ideal algorithm or methodology does not exist. Instead heuristics are used in order to achieve bucketing that is useful in practice. Varying heuristics have been shown to be useful. A common denominator, however, is to analyze call stacks and crash location (module offset) to com- pare crashes. Meta information applied by the client can be used to separate operating systems, etc., but the binary comparisons between two crashes is predominantly done using the call stack [5]. Tools available to the public exist today, such as !exploitable and Crash- Wrangler that can analyze and group crashes [19]. Their primary use case, however, is determining if a crash is a potentially exploitable bug. The group- ing is done by hashing parts of the stack frames in the call stack and comparing hashes. A method that can be improved upon as suggested by Dhaliwal et al. [20]. The Windows Error Reporting and similar systems use more sophisti- cated heuristics but are not as readily accessible by the public [18].

3.2.1 Analyzing call stacks The call stack, often call stack trace is in the context of comparing crashes an important part of the crash dump. An empirical study analyzing bug reports containing stack traces conducted by Schroter et al. suggests that 60 percent of fixes involved a function in the stack trace [8]. A majority of those functions were within the top ten stack frames. Furthermore, the report suggests that multiple reports and multiple stack traces submitted for the same bug gave de- velopers more insight into the underlying problem and has potential to reduce lead times [8]. 18 CHAPTER 3. RELATED WORK

Expanding on the hashing solution implemented by !exploitable. To sepa- rate crashes from each other !exploitable hashes two sections of the call stack. One hash is referred to as the major component and the other as the minor com- ponent. Hashing and comparisons of call stacks is performed on the level of function names (as a string), including or excluding function offsets. This defi- nition of a call stack comparison will be used throughout the thesis. The major component is by default configured to hash the five topmost stack frames. The function offsets are not included in the hash. The minor component hashes all stack frames and this time including the function offsets. The major component is less strict and allows some deviation in the stack trace, while the minor component is more conservative and demands full equal- ity. Having two components with different levels of deviation allows the !ex- ploitable to catch two relations between crashes. The minor component poten- tially allows crashes occurring in the same function to be differentiated while the previous stack frames must be the same. The major component only con- cerns itself with the topmost stack frames where previous events are irrelevant. In addition to using a combination of components, as suggested above, Brodie et al. suggest using the different components separately depending on needs. Full equality, similar to the minor component, where call stacks have to be identical, including or excluding function offsets. Alternatively only using prefix matching, similar to the major component, where only the top n function names have to be identical including or excluding offsets [21]. A separate approach suggested by Dhaliwal et al. and Modani et al. is calculating the distance between stack traces, primarily using the Levenshtein distance, or edit distance, and grouping them using that metric. The edit dis- tance between two sequences is defined as the minimal number of changes needed to transform one sequence into the other [22] [20]. Some average dis- tance threshold in a group determines if a new group is to be created when a new crash is consumed. An edit distance approach has potential to mod- ify the bucketing characteristics as allowed deviation between crashes can be tuned through changes in the distance threshold. A definition for Levenshtein distance is outlined in Section 3.2.2. The study conducted by Schroter et al. argues that grouping crashes be- longing to separate bugs has a detrimental effect on the time it takes to fix a bug. A conservative grouping method that divides a single bug into multi- ple buckets is also argued to have less impact on bug fixing time. The article therefore suggests using an approach where incorrect matchings, or numbers of misdiagnosed cases is minimized. CHAPTER 3. RELATED WORK 19

3.2.2 Definition of an edit distance An edit distance is used to quantify how dissimilar two string are. It is achieved by counting the number of required operations to transform one string into the other. Transformations are applied to a chosen granularity based on needs. When comparing individual words in a natural language, a reasonable gran- ularity level is comparing letters. When comparing sentences the granularity level could be increased to comparing entire words. The most commonly used definition is the Levenshtein distance. It defines the following three operations [23]:

Insertion (+) insertion of a single symbol x, uv → uxv

Deletion (−) deletes a single symbol x, uxv → uv

Substitution (↔) substitutes x for y, uxv → uyv

In Lavenshtein’s model each operation has a cost of 1. And the process is best illustrated by a textbook example. Consider the Lavenshtein distance between ’thinker’ and ’tailor’:

− thinker → tinker

↔ tinker → tanker

↔ tanker → taiker

↔ taiker → tailer

↔ tailer → tailor

Given the operation cost of 1, the edit distance between ’thinker’ and ’tailor’ is at most five. This definition of edit distance is used throughout the thesis. Chapter 4

Comparing crash dumps

In section 2.6 the basic concepts behind a crash dump, and suggested means for extracting information from them are explained. This chapter addresses two challenging problems: 1) uniquely characterizing errors by their symptoms and 2) matching characterizations to identify common errors. Both of which are key components in our problem.

4.1 Symptoms

A symptom for an error is not a defined term. However, in this thesis, when discussing symptoms, we refer to an indication to the cause of an error. An example of this is the stack trace leading up to the SIGSEV in Listing 2.1. Here the whole stack leading up to the error can be considered a symptom for the error. In order to solve this problem of uniquely characterizing errors by their symptoms, we need to first establish what types of symptoms are to be consid- ered. As mentioned in previous sections, the amount of information contained within a crash dump is very large, it is therefore unfeasible and out of scope for this project to consider all possible symptom sources. According to Lee et al., errors that share the same problem often share two types of symptoms: data oriented symptoms and code oriented symptoms [1]. Data oriented symptoms can refer to local data or shared data such as function parameters and global variables. These are often non-trivial to obtain due to them being hard to generalize as the data can range from primitive integers to nested complicated structures. Code oriented symptoms concern information such as, status message at time of termination (see Listing 2.2) and the latest sequence of function call, namely the stack trace (see Listing 2.3).

20 CHAPTER 4. COMPARING CRASH DUMPS 21

For Listing 2.1 it was clear the symptom was inside the main function. It is, however, possible to reach a state in a program through different paths, causing ambiguity when attempting to determine the symptom for an error. Lee et al. suggest there are two scenarios of this problem, one scenario related to code oriented symptoms and the other to data oriented symptoms [1]. Figure 4.1 illustrates a problem related to code oriented symptoms where reaching an error, or a faulty part of the code is possible through two paths, namely, func_1 or func_2 invoking the fauly_function. When ana- lyzing two crash dumps of this program (a unique code path for each one), the code oriented symptoms would differ as the stack traces would not be identical.

Figure 4.1: Arriving at same error through different code paths.

Figure 4.2 illustrates the execution of a program where the functions func_1 and func_2 access shared data (illustrated with dotted lines) and later result- ing in error due to corruption in database. This is an example as to why this thesis only concerns itself with code oriented symptoms. The reasoning lies in the difficulty of connecting the symptoms to the errors, when the error can lie within the shared data, or within the code.

Figure 4.2: Error caused by faulty data leads to ambiguity when attempting to pinpoint symptoms. 22 CHAPTER 4. COMPARING CRASH DUMPS

4.2 Matching based on symptoms

Matching different symptoms can be done in three ways: complete matching, partial matching and weighted matching. In a complete matching all consid- ered symptoms have to be identical or else it is a mismatch. A partial match- ings would allow symptoms to deviate a certain distance1 from each other. A paper by Modani et al. suggests a method based on edit distance for this [22]. Weighted matching allows more deviation between error symptoms by placing value in certain areas while disregarding the value of others, i.e., a heuristic approach where previous experiences can determine such values. This is just a general idea of how matching can be done, the section below covers the dif- ferent approaches in detail while focusing on code oriented symptoms.

4.3 Comparing stack traces

By being a single piece of information containing the largest part of the pro- gram state during termination, the stack trace is ideal for crash dump compar- ison. A testament to this is that the developers of Mozilla Firefox use stack traces alone in 67 percent of their symptom grouping efforts [2] and related papers in the field suggest a 75 percent to 95 percent success rate in identify- ing common symptoms when comparing stack traces alone [1]. Naturally, this implies a need for comparison methods that can be applied to stack traces.

4.3.1 String comparison In its simplest form comparing stack traces can be reduced to a trivial string comparison problem. Many related papers in the field suggest grouping meth- ods based on stack comparisons. There are other approaches to this problem, but in the case of complete matching, we could compare n number of stack traces and compare function names line by line. If they are fully equal we consider them the same error. This is a sane assumption, based on tests done by [24], where 130 crash dumps were diagnosed (caused by 39 unique er- rors). The test managed to create 28 groups of crashes that mapped to 28 unique errors. The inability to group the other 11 was caused by minor func- tion call deviations and incomplete stack traces. Most importantly, there were zero cases of crashes, caused by different errors, being grouped together us- ing this method. Zero cases of incorrect groupings is a desirable feature in

1Quantified by how many required operations to turn one object into the other. CHAPTER 4. COMPARING CRASH DUMPS 23

an automated system. A crash is referred to as being misdiagnosed if it is grouped with crashes of a different errors. This methodology is applied in part by !exploitable where the function names in a stack trace were hashed and then compared [19]. Considering we have discarded data oriented symptoms as a source of in- formation, it counteracts the idea of only comparing code oriented symptoms if they are kept for comparisons. In the study by Inhawn Lee, a similar ideol- ogy was applied where the author decided to strip away parameters and offsets from the stack traces (see Listing 4.1 for a before and after illustration) [24]. Parameters passed into a function are rarely static, and even if disregarded, we know a set of potential errors are caused by the same chain of events. The offsets describe which specific instruction in a function was a part of the chain of events that caused the crash. More specifically, in the binary where the function starts, by adding the offset to the first instruction one ends up at the potentially erroneous function. The latter can be considered a code oriented symptom. It does, however, possess little value as the offset value is not static and can change depending on platform, compiler and other variables. There- fore, all comparisons in this thesis are considered without parameters and off- sets.

Listing 4.1: original stack trace from mdb (top), stripped stack trace (bottom) libc.so.1‘_lwp_kill+0x15(1, 6, 4c, fef55000, fef55000, 8077fe0) libc.so.1‘raise+0x2b(6, 0, 8042400, feea9802, 0, 0) libc.so.1‘abort+0x10e(8061774, 8061727, 80621a4, 0, 805c4f0, 0) usage+0x1e5(0, 80617f0, 8061b91, 8055022) zpool_do_offline+0xb2(1, 8046538, 8077780, 801, 0, 0) main+0x131(80464ec, fef5f8a8, 8046528, 8054e8b, 2, 8046534) _start+0x83(2, 804697c, 804698c, 0, 8046994, 80469ad) ------libc.so.1‘_lwp_kill libc.so.1‘raise libc.so.1‘abort usage zpool_do_offline main _start

4.3.2 Recursion removal To allow for more stack traces to be matched, we can start introducing allowed deviations between stack traces. This will increase the probability of misdi- agnosis, but if done well, can be kept to a minimum. One idea suggested by [22] is to remove reoccurring patterns, most commonly seen when using re- cursion. The argument for doing so is that it is irrelevant to consider recursion 24 CHAPTER 4. COMPARING CRASH DUMPS

depth as it varies based on function input, as long as the patterns indeed are repeating. This follows the previous reasoning for discarding data oriented symptoms where parameters are stripped (see Section 4.3.1 for details). The same logic can be applied to for loops where instead of focusing on single function calls we can point to the chain of events leading up to the error as an error cause. Preserving some information as to where the recursion started can, how- ever, be critical to accurately locating the error symptoms. Listing 4.2 displays such a repetition of patterns where the additional repetitions do not give us additional information. Three repetitions can be removed without significant information loss, as shown in Listing 4.3. Listing 4.2: Example stack trace with colorized repeating patterns. . . purge zpool_foo Listing 4.3: Ideal removal of re- zpool_bar peating patterns from stack trace. zpool_baz . zpool_foo . purge zpool_bar → zpool_foo zpool_baz zpool_bar zpool_foo zpool_baz zpool_bar main zpool_baz _start zpool_foo zpool_bar zpool_baz main _start Modani et al. suggests a greedy algorithm to the problem: consider two iterating pointers, one pointing at the top-most function iterating downwards, the other at the bottom-most function iterating upwards. If the function names pointed to by the two pointers are equal, then all function calls between the pointers are removed, including the bottom-most one. The process is repeated until only unique function names are left (see Algorithm 1 for an algorithm outline). The reason we refer to the algorithm as greedy is (even after the CHAPTER 4. COMPARING CRASH DUMPS 25

successful display in Listing 4.4): consider an arbitrary stack trace such as ABCDEAB2. The proposed algorithm in [22] would reduce the stack trace to AB, which is a significant reduction of information as the CDE functions are assumed to be irrelevant. Removal of excess information from a stack trace before comparisons could lead to misdiagnosis causing incorrect groupings to occur. The upside is that the algorithm is easy to understand and imple- ment, and has a time complexity of Ogm2 for stack length m and number of reappearing functions g. As it is greedily removes function calls it also has potential to reduce the amount of necessary computations during compar- ison. This could be significant factor if comparing large numbers of long stack traces.

Algorithm 1 Greedy recursion removal 1: procedure removeRecursion 2: st ← stack_trace 3: if number of duplicate function names = 0 then 4: return st 5: i ← 1 . top of stack 6: j ← st.length . bottom of stack 7: while i ≤ st.length do 8: while j > i + 1 do 9: if st[i] = st[j] then 10: st ← concat stack[1 : i] with stack[j+1 : st.length] 11: return removeRecursion(st) 12: j ← j − 1 13: i ← i + 1

We propose a less greedy approach involving removal of maximal repeats [25]. A maximal repeat in a sequence S is a repeated substring s1, s2, . . . , sn ⊂ S such that si and si+1 can not be extended in either direction and still agree. Example, AD is one of two maximal repeats of BADBBADCDADA, even though the first two repeats can be extended into BAD, the first and the third occurrence differ on both sides, thus a maximal repeat. The second maximal repeat is BAD ⊂ BADBBADCDADA, where neither of the occurrences can be extended and still agree. Now consider an algorithm that operates in two steps: 1) removal of sin- gle repeating function calls in sequence and 2) removal of smallest maximal

2Each letter represents a function call and left-most is top-most in the stack. 26 CHAPTER 4. COMPARING CRASH DUMPS

repeats. The process of removing single repeating function calls would sim- ply find repeating single sequences, and remove all but one occurrence of the repeated function call. The second process involving removal of the shortest maximal repeat pattern is repeated until only one occurrence remains, and it is left untouched. Both steps are sequentially repeated until no removals remain. The reasoning behind using the shortest maximal repeat over the longest lies in the fact that the use of the longest can result in a longer stack trace. The example above was chosen specifically to illustrate this point. Assume the algorithm was applied to BADBBADCDADA with the longest repeat op- tion, after step 1) BADBBADCDADA → BADBADCDADA and after step 2) BADBADCDADA → BADCDADA. Now again with the shortest repeat option, 1) BADBBADCDADA → BADBADCDADA, 2) BADBADCDADA → BADBCDA. Which is an arguably better representation, and even more so compared against the greedy algorithm above which reduced the previous ex- ample, ABCDEAB → AB where our algorithm would reduce it to ABCDEAB → ABCDE. This minimizes information loss as the chain of events before the crash is preserved. Furthermore in the latter scenario of ABCDEAB, the CDE function calls could be the chain of events leading up to the error. The trailing AB calls could have been prepended by a sequence of CDE calls as well that were lost due to stack trace corruption. Our algorithm would better minimize information loss in such a scenario and better the odds of grouping similar er- rors. An analysis of complexity for finding maximal repeats is a complicated subject and beyond the scope of this work (see [26] for details and [27] for a quasi-linear python implementation). Important detail, both processes are done bottom up, i.e., starting at the bottom of the stack, this in order to keep the non-deleted repetition as far up in the stack as possible. Function calls closer to the error hold more value as they are more telling about the state during crash. Heuristics such as this one are discussed and mentioned in more detail in section 4.3.3.

Listing 4.4: Stack trace in Listing 4.2 after application of Algorithm 1 . . purge zpool_foo zpool_bar zpool_baz main _start CHAPTER 4. COMPARING CRASH DUMPS 27

4.3.3 Edit distance Another method for allowing deviations in stack traces is to simply allow them to differ by a number of function calls. This needs to be controlled and normal- ized for predictable results. Additionally, the distance between a set of stack traces needs to be quantifiable for us to be able to determine the level of simi- larity. Edit distance was designed for this purpose, to quantify how dissimilar two are by counting the number of required operations to transform one string into the other. Considering the definition of edit distance that was outlined in Section 3.2.2 the problem now needs to be modified so that it becomes applicable to our case — comparing stack traces. This is trivial as we can adjust granularity and con- sider each row in the stack trace a symbol. Assume two stacks S and T (rows in stacks denoted by si and tj), [22] then defines the process recursively as follows:

d(S ,T ) + w(s , t ) ↔  i−1 j−1 i j d(Si,Tj) = min d(Si−1,Tj) + w(si, ∅) − (4.1)  d(Si,Tj−1) + w(∅, tj) + ( 0 if si = tj w(si, tj) = (4.2) 1 else Where d is the distance between two stack traces, ∅ illustrates the empty parameter and the w function returns the distance between two stack trace rows (by definition the only possible distance values are 1 or 0). A recursive implementation as described in Equation 4.1 is, however, un- feasible for comparing large stack traces as evaluation would take exponential time. Instead dynamic programming can be used. Assume we need to compute the edit distance, d(am, bn). A matrix Dm,n is filled so that each element Di,j where 1 ≤ i ≤ m and 1 ≤ j ≤ n holds the minimum number of operations 3 needed to transform a1,...,i into b1,...,j. Table 4.1 illustrates this process . It also becomes clear the algorithm has time complexity O|a||b|, notable is also the space complexity Omin(|a|, |b|). Reason being, when doing column-wise computations only the previous column has to be kept in memory, which can be updated and stored. The process can be done row-wise or column-wise to minimize space [23]. In Section 4.3.2 heuristics were used to tune the outcome of a comparison, specifically by keeping the top-most repeating patterns as they contain the most

3The x-axis represents a =’sleep’ and the y-axis b =’deep’. 28 CHAPTER 4. COMPARING CRASH DUMPS

SLEEP 0 1 2 3 4 5 D 1 1 2 3 4 5 E 2 2 2 2 3 4 E 3 3 3 2 2 3 P 4 4 4 3 3 2

Table 4.1: Dynamic programming algorithm for the edit distance between ’sleep’ and ’deep’. Bottom right cell holds the answer (2), the bold numbers indicate the path taken. recent information about the state during termination. The idea of adding value to function calls close to the top of the stack was proposed by [21], and [2] shows it is a more accurate heuristic than simple comparison. However, the paper also shows that for small applications the heuristic is lacking as a single function call (taken out of a small pool) can carry too much weight and cause minimal grouping, forcing us to manually inspect all crash dumps regardless of the grouping effort. In order to test this we will now introduce a heuristic system for edit distance where a deviation can carry more (or less) weight depending how far the deviation is located from the top. Operation cost will no longer be constant, and cost will be directly associated with the position of the symbol. Equation 4.3 defines such a weight function. ( 0 if si = tj w(si, i, tj, j) = (4.3) max(i, j) + 1 else In words, assume the following edit distance changes with cost indicated in 1 1 1 the transition: deletion, si → ∅; insertion, ∅ → tj and substitution, si → tj. i+1 In contrast, a weighted edit distance would look as follows: deletion, si → ∅; j+1 max(i,j)+1 insertion, ∅ → tj and substitution, si → tj. Table 4.2 shows through the use of dynamic programming the difference between a regular edit distance and a weighted edit distance. A trivial but telling example of the concept where additional deviation can be allowed when it is far from the top of the stack. Papers in the field suggest removing uninformative functions, meaning, CHAPTER 4. COMPARING CRASH DUMPS 29

functions that an expert in the field has deemed irrelevant for comparisons, e.g., error handlers that differ between software versions. This idea spans the whole subject of comparing stack traces, but is particularly relevant here. In- stead of having a linear cost increase based on where in the stack trace a func- tion is, one could use fixed weights attached to operations for certain functions. In practicality, however, this is unfeasible for most software projects. Code- bases are generally large and maintained by multiple teams, so maintaining a curated list of individual function weights would be impractical.

WILD WILD 0 4 7 9 10 0 4 7 9 10 M 4 4 7 9 10 W 4 0 3 5 6 I 7 7 4 6 7 I 7 3 0 2 3 L 9 9 6 4 5 L 9 5 2 0 1 D 10 10 7 5 4 T 10 6 3 1 1

Table 4.2: Dynamic programming algorithm for weighted edit distance illus- trated. The illustration shows the difference in cost when only one letter is changed, but at different locations. The table to the left holds a change at the top and the table to the right holds a change at the bottom.

The advantage of a weighted edit distance lies in the ability to allow more granular tuning of the allowed deviation between stack traces. With the added benefit also comes a downside: all types of allowed deviations and heuristics introduce a risk of misdiagnosis. The cost of misdiagnosis is highly relevant when choosing matching method and is covered in the summary of this chap- ter.

4.3.4 Prefix matching Another suggested method is a comparison of the longest common substring starting from the top of the stack. This enforces a strict rule of no allowed deviations close to the top the stack and was shown by [21] to be an effective method for grouping crash dumps. In order to quantify the closeness of two stacks, S and T with row index i, we measure the size of the longest shared prefix, which can be defined formally 30 CHAPTER 4. COMPARING CRASH DUMPS

as follows:

argmax f(i) := {i | i ∈ N : Si = Ti} (4.4) One advantage the prefix matching has over the edit distance is time com- plexity. The longest common prefix can be found in linear time using suffix arrays [28] and if dealing with large numbers of stack traces or where perfor- mance is critical the prefix matching method could be considered.

4.3.5 Distance normalization Edit distance and prefix matching have the common attribute that a calculated distance can never be larger than the maximum length of the longest stack trace. However, our weighted edit distance algorithm can have a distance greater than the maximum length of the longest stack trace. This is because the weight of a difference in the stack is directly correlated with its position. An exemplification of this: consider two stack traces S and T of equal length, if there is a difference in the top two function calls, the distance between them would be len(S) + (len(S) − 1), which is strictly larger than the length of S. The relevance of this lies in the common definition of distance function normalization. Distance function normalization is defined as the distance be- tween the objects over the maximal potential distance between them. For edit distance and prefix matching, scaling would by this definition look as follows:

d(S,T ) d0(S,T ) = (4.5) max(len(S), len(T ))

In order to the scale the weighted edit distance we need to define the max- imal distance that two stack traces can be from each other. As mentioned previously, the weight of an operation is directly correlated with its positions in the stack trace, so the maximum distance between two stack traces is then the sum of the indices in the longest stack trace of the two. We define it as follows:

d(S,T ) d0(S,T ) = (4.6) len(S) len(T ) max( P n, P m) n=1 m=1 CHAPTER 4. COMPARING CRASH DUMPS 31

4.4 Machine learning

Within the category of classifications algorithms there are a couple of methods particularly suited for classifying data where distance functions are employed, namely, nearest neighbor algorithms and support vector machines. Nearest neighbor algorithms are generally intuitive and easy to apply, a new observation is compared to all neighbors, or its nearest neighbors based on some criterion. The downsides to this approach is that it can be slow, par- ticularly for large data sets as it scales linearly to the number of data points [14]. Support vector machines, or SVM, are less intuitive and more complex. Instead of simply directly comparing an observation with previous ones to directly make a classification, an SVM attempts to find a vector, or hyperplane that divides the data in a suitable way — where the data point is, relative to the hyperplane, determines its classification. Given that we have established methods for comparing individual crash dumps using distance functions, it follows that either a nearest neighbor type of algorithm or a support vector machine is suitable. Incidentally, there is no established definition for when one of the algorithms is better than the other, so we opt to go for the simpler concept of nearest neighbor algorithms [21]. De- termining which choice of learning algorithm is better suited for this particular use case is not in the scope of this thesis.

4.4.1 Nearest neighbor learning algorithm When a new crash is encountered it needs to be compared with existing crashes to determine if it resembles already categorized ones. The assumptions is that if the resemblance is within some threshold the compared crashes are caused by the same bug. Comparing a new crash to every other know crash can be computationally expensive as the number of stored crashes can be large. To mitigate this prob- lem one could instead of comparing a new crash to all existing crashes, only compare it against a particular reference, or signature, crash for each bug or existing group. The new crash would then only have to be compared to all the signature crashes in order to make a classification. A learning algorithm can be used to determine which crash in a group is the optimal reference crash from a training data set with knows bugs. In our case it would be the crash in a group with either the lowest average distance to the other crashes in the group, or the crash with the on average longest common 32 CHAPTER 4. COMPARING CRASH DUMPS

prefix with every other crash in the group. The algorithm would be fed crashes in groups of known problems and pair-wise compare them then pick the one with the lowest average and promote it to a signature crash. Whenever new crashes are added to an existing group, the group is re-evaluated, either in real time (online learning) or in a scheduled batch (offline learning) and a new crash is potentially promoted [21]. Handling cases where a crash does not match a single existing group can be done in various ways. Our suggested approach is create a new group with the single crash in and automatically promote it to a signature crash. When further crashes are placed in the group it can be manually classified by a human to verify that it is indeed a new bug, or if it is a duplicate, the group can be merged with an existing group (with a known bug) and the whole set can be re-evaluated to potentially better categorize future crashes. Further details on the algorithm are covered in the evaluation section.

4.5 Summary

In the prelude to this chapter we brought up two problems that were to be ad- dressed: 1) uniquely characterizing errors by their symptoms and 2) matching characterizations to identify common errors. For the first problem we deter- mined that characterization can be done by analyzing data oriented symptoms and code oriented symptoms. Data oriented symptoms were dismissed due to complexity and ambiguity, while in the code oriented symptoms we focused on the stack trace. Same stack trace, same characterization. For the second problem we tried to identify common errors among the different unique char- acterizations by attempting to group them based on their stack trace. This was done using a number of matching methods suggested by various papers in the field and our own variations of them, specifically:

• Complete matching, a matching method where stack traces have to fully match (when offsets and parameters are stripped).

• Recursion removal, a heuristic that allows deviation in the form of different recursion depths.

• Edit distance, a heuristic that allows variation by allowing some function calls to deviate.

• Prefix matching, a matching method which compares the longest common sub- string starting form the top of the stack. CHAPTER 4. COMPARING CRASH DUMPS 33

• A machine learning approach to enhance the performance of the distance based methods.

There are advantages and disadvantages to the different methods, most of which revolve around the tradeoff between the risk and cost of misdiagnosis versus the cost of added manual inspection. The cost of misdiagnosis is subjec- tive and has to be evaluated by each individual institution. For the evaluation we propose a simple set of guidelines: 1) If the number of expected crashes is low, and the risk of unique crashes is high, the cost of misdiagnosis is high. The idea behind this is that an error could be erroneously marked as fixed and remain in the system for a long period of time. 2) If the number of expected crashes is high (usually in systems with a large user base) where errors are hit frequently, the cost of misdiagnosis is low. An example of such a system is Mozilla Firefox which receives 2.5 million crash-reports every day. If an error is misdiagnosed, it is likely to be encountered again and handled differently [29]. Considering the guidelines, an institution agreeing with guideline 1) should not use heuristic methods and minimize risk for misdiagnosis. While an in- stitution agreeing with guideline 2) could consider introducing heuristics with the benefit of reducing amount of manual diagnosis. Overall the pre-study and studies of our own suggest that the efficiency of using allowed deviations is limited to organizations where the crash fre- quency is high. This is because in real-world applications the case of un- detected misdiagnosis has potentially severe cost attached to it. An example being a critical bug shipped out to customer due to misdiagnosis where a bug was incorrectly marked as fixed. The evaluation chapter below investigates the efficiency claims further.

4.5.1 Risk discussion All methods but the complete matching introduce a confirmed amount of risk. As tested in [24], the complete matching (parameters and offsets removed) was able to group 130 crash dumps (caused by 39 unique errors) into 28 groups mapping to 28 unique errors. A possible reason for the stack trace compari- son match rate not being higher in many papers is because stack traces from different operating systems are tested and considered equal. However, it has been shown that the same error can look different depending on which oper- ating system is used, even different versions of have been shown to produce different stack traces for the exact same error [2][30]. We 34 CHAPTER 4. COMPARING CRASH DUMPS

conducted a brief study to verify this statement by creating a set simple of er- roneous programs and executing them on five of the major unix-like operating systems, FreeBSD, NetBSD, Mac OS X, and illumos (Solaris). The result of the study showed that the statement held true, the stack traces were different on all of the tested systems. Noteworthy is that for all the cases the difference lied in the error handling, meaning, the functions called after the erroneous function. This study showed that it is not advisable to group multi- platform crash dumps using stack traces without sufficient knowledge about underlying error-handling on the different platforms, which is in line with the finding made by Kompotis et al. [30] Chapter 5

Evaluation and results

This chapter is divided into two parts, evaluation method and experimental results. Its aim is to show that the amount of manual diagnosis can be reliably reduced by introducing automated grouping and comparison of crash dumps. The evaluation phase designs a framework for determining optimal heuristic thresholds, and benchmarks the methods against each other.

5.1 Evaluation overview

For the evaluation phase we tracked two key metrics, precision and grouping ability. Precision is defined as the percentage of crashes correctly predicted as belonging together, i.e., not misdiagnosed as previously defined. Grouping ability, similar to recall, is defined as how often the stack comparison methods managed to group similar crashes. Using the concept of the key metrics we strived to gain the following insights. Firstly, determining the optimal equality thresholds for the edit distance (ED), weighted edit distance (WED) and prefix matching (PM) algorithms in the context of comparing stack traces. Secondly, determining how the four methods of complete matching (CM), ED, WED and PM compare against each other. Then lastly, determining the effects of applying recursion removal strategies. When evaluating or applying distance functions across heterogenous data sets it is often required to normalize or scale the distances to a fixed range such as our choice of [0, 1]. This is because object distance can be heavily influenced by certain attributes, in our case the length of the stack trace. Only in homogenous data sets it is feasible to directly compare distances between objects. For the evaluation of ED, WED and PM we normalized the thresholds and defined them as a percentage of allowed deviation, in accordance with the

35 36 CHAPTER 5. EVALUATION AND RESULTS

previous distance normalization definition.

5.1.1 Test data For a realistic evaluation to take place, real-world stack traces were used from three proprietary server applications. All three applications are within the on- line stockbroking sector, a sector where low-latency, high-throughput and rigid service-level agreements1 are norm.

Application details The first application is a stock market price intake and is primarily concerned with high throughput data consumption. It communicates with a number of stock markets, from which it receives up to a hundred thousand messages per second. The application consists of approximately 200 000 lines of code and will be referred to as application Medium. The second application is similar to the previous one, except it consumes news data from various news sources. News data is significantly larger in size but less frequent. This application consists of approximately 50 000 lines of code and will be referred to as application Small. Lastly, the third application, is concerned with receiving customer orders and forwarding them to the stock markets. Each order needs to be validated according to business rules before being sent. It is the most complex appli- cation of the three as it contains large amounts of business logic in addition to handling tens of thousands request per second in a transactional manner, e.g., making sure customers cannot double-spend2. The codebase is also the largest consisting of approximately 500 000 lines of code. It will be referred to as application Large. The three applications share source code via libraries. In particular code for error handling, logging, data persistence and a crash reporting plugin. Up- time is ensured by having multiple instances of the applications executing in parallel with distributed workloads. If an instance crashes, the others automat- ically take on its workload until it has been restarted. Once an application has crashed the crash reporting plugin automatically creates a new issue in a bug tracking system where the crash dump is uploaded. Due to the importance of

1Service-level agreements are commitments between clients and a service provider, com- mon aspects include – availability, quality and responsibilities. 2Double-spending is a potential flaw within digital currency systems in which a single token can be spent more than once. CHAPTER 5. EVALUATION AND RESULTS 37

said applications, all crash reports are manually inspected, deduplicated and grouped in a best effort manner.

Data collection The data was collected by manually picking a random set of 100 patched bugs that had 1461 crash dumps associated with them via bug-reports in a bug track- ing system. All bugs and crash reports are related to the three applications. The criterion for selection was that the bug-reports needed to contain one or more crash dumps, and that they had been manually inspected. By knowing which crash dumps map to which bugs, we could determine how well the algorithm grouped the crashes, and if any crashes were grouped incorrectly. Out of the 1461 crash dumps 421 show signs of containing recursion. The number of crashes per application and additional details can be seen in Table 5.1.

Lines of Number of Number of Crash dumps Application code bugs crash dumps with recursion Small 50 000 25 366 107 Medium 200 000 35 514 148 Large 500 000 40 581 167 Total 750 000 100 1461 422

Table 5.1: Number of bugs, crashes and recursion presence broken down per application.

5.1.2 Evaluation method for determining precision Stack trace comparison method precision was determined through pairwise comparison of all crashes. Each binary prediction can have four outcomes: (1) True positive, both stacks are predicted equal and are truly equal; (2) False positive, both stacks are predicted equal but are not equal; (3) True negative, both stacks are predicted not equal and are truly not equal; (4) False negative, both stacks are predict not equal but are truly equal. To determine the We calculate a commonly used performance metric for classifiers, namely precision. Precision is formally defined as follows:

TP , (5.1) TP + FP 38 CHAPTER 5. EVALUATION AND RESULTS

where TP = true positive and FP = false positive [15]. The pairwise com- parison was done with and without recursion removal, and with varying thresh- olds for edit distance and prefix matching. The result of this should point us to the ideal thresholds for this data set and give us insight into the efficiency of allowing deviations.

5.1.3 Evaluation method for determining grouping abil- ity To determine the grouping ability of the different matching methods we pro- pose an evaluation method which resembles the suggested machine learning algorithm and has some similarities with the standard definition for calculating recall. For each bug a reference crash dump is randomly selected. Each unique bug is represented by a bucket, which is pre-filled with its corresponding ref- erence bug. The remaining crash dumps are attempted to be placed in each bucket. If the stack trace matches the one of the reference crash dump in the bucket, it is considered a match (this is not necessarily a correct matching due to imprecision in the matching algorithms). Four possible matching scenarios are possible: (1) Truly positive, a single match in the correct bucket; (2) Truly negative, a single match in the wrong bucket (3) Truly negative, matching multiple buckets; (4) False negative, no matches. A single match, illustrated in Figure 5.1, means a crash dump is only placed in a single bucket. Note, this does not imply a correct match as it could have been misplaced, i.e., a misdiagnosis. Multiple matches, illustrated in Figure 5.2, implies a misdiagnosis as the crash dump is matched against two or more unique bugs. And for the scenario where a crash dump is not matched against a single bucket, a new bucket is created, as illstrated in Figure 5.3. We did not count such cases as misdiagnosed as the implications of a misdiagnosis imply a greater cost than an added bucket which only increases the amount of manual diagnosis. The similarities between the standard definition of recall and our evalua- tion method is best illustrated by observing the definition for recall in Equa- tion 2.2. Our evaluation produces a number of groups while recall produces a fraction, however, there are similarities as to when the two numbers increase and decrease. Specifically, the recall value would decrease when the number of buckets would increase, and vice versa. As shown in Equation 2.2, the re- call value is primarily affected by the number of False Negatives, which in our case is matching scenario number four — a crash dump not being placed CHAPTER 5. EVALUATION AND RESULTS 39

in an existing bucket. A high number of False Negatives relative to the True Positives would yield a low recall, while a high number False Negatives in our evaluation would results in many additional buckets. If recall is described as a fraction detailining the number of relevant crash groups returned over the total number of actual crash groups, the similarities between the evaluation approaches can be observed.

Figure 5.1: Case A: Single match. If placed in the wrong bucket, misdiagnosis, else a correct match.

In practicality combining the case of a “single match” and the case of a “multiple match” could result in a crash dump being placed in two or more buckets and not be a misdiagnosis. This is if a crash dump was placed in its correct bucket and a bucket that was previously created due to a zero matching. The algorithm is then accurate, but not precise. This is a scenario which can be mitigated using said machine learning approach and re-training the model. For the actual testing we applied the evaluation method to the proposed matching algorithms by attempting to correctly place the crash dumps into their corresponding buckets. Once all the crash dumps have been placed, the number of buckets with one or more crash dumps are counted. This entire pro- cess was repeated 5000 times with randomly selected reference crash dumps in each iteration and the results averaged. The average sum of buckets is the final result. The process was done with and without recursion removal, and with varying thresholds for edit distance and prefix matching. The result of this should give an idea on how often new buckets have to be created due to grouping inability, and the consequences this might have on the 40 CHAPTER 5. EVALUATION AND RESULTS

Figure 5.2: Case B: Multiple matches. If placed in multiple buckets, misdiag- nosis. amount of necessary manual diagnosis.

Figure 5.3: Case C: Zero matches. New bucket created. Not considered a misdiagnosis but would in practice increase the amount of manual diagnosis. CHAPTER 5. EVALUATION AND RESULTS 41

5.1.4 Evaluation method for machine learning preci- sion The evaluation method and learning approach is illustrated in Figure 5.4. Firstly we needed to extract the stack traces from the crashes, strip irrelevant informa- tion and apply our variation of the recursion removal strategy. For the heuristic thresholds we selected the best performing value in terms of precision for each one of the edit distance based methods. Prefix matching and complete match- ing were omitted from this evaluation as the concept of signature crash is not applicable to them, all crashes within a group are equally comparable for a fixed threshold. The data set was randomly divided into a training set and a verification set. The algorithm was then trained on the training set which results in a new set of signatures crashes. Once the training was finished we started the testing by comparing each crash in the verification set to all the signatures crashes in order to find the best scoring match. The best scoring match is referred to as the algorithms predictions. The predictions are then compared with the known causes to determine the overall precision of the predictions, which again is defined as the percentage of crashes in the training set correctly predicted as belonging together. To better determine the learning algorithms performance the process was repeated 5000 times and the results averaged. How the data is divided between the training set and verification set can be varied. With a smaller training set the algorithm is likely going to perform worse and be too course. With a larger training set the algorithm might per- form better, but since the verification set gets smaller the outcome gets less trustworthy, and the risk for overfitting increases. Our approach was to divide the sets in various partitions ranging from 10 percent to 80 percent, i.e., for each group of known crashes, a percentage of the crashes were set away for training, and the rest for verification. The result will tell us generally how accurate a machine learning approach is for this data set, and how to ideally partition the training set and the verification set.

5.2 Evaluation results

The result of the evaluation is presented in graphs, each of the four methods are evaluated in precision and grouping ability. For the edit based methods preci- sion in combination with learning is also presented. What we hope to observe and learn is that it is possible to group crashes with sufficient precision for an automated system to be trusted enough to allow for a reduction in necessary 42 CHAPTER 5. EVALUATION AND RESULTS

Crashes known cause

Post-processing

Processed stack traces

Training Verification set set

Find best matching Determine signature signature crash crashes Signature then compare crashes prediction with key

Figure 5.4: Machine learning methodology and evaluation manual diagnosis. An evaluation of three similar applications within a single sector is not enough data to see a statistical significance between the methods and the results may, therefore, not be generalizable beyond the observations for this data set. However, the results may show a promising trend where future work could corroborate the findings.

5.2.1 Precision evaluation results The precision evaluation showed promising results, in particular for the meth- ods allowing deviation. The highest combination of distance threshold and precision was achieved through the use of PM which successfully matched up to 80 percent of the stack traces correctly with an equality threshold of 30 percent; results which are inline with other papers in the field [22]. The graphs illustrate how well precision is maintained while distance thresh- old is increased. Precision is measured for four comparison methods: edit dis- tance, ED; weighted edit distance, WED; prefix matching, PM; and complete matching, CM. Each method is also evaluated with three methods for recur- CHAPTER 5. EVALUATION AND RESULTS 43

sion removal: no recursion removal, NORR; a greedy recursion removal, RR; and our approach involving removal of maximal repeats denoted RR+.

100 100 RR RR 80 RR+ 80 RR+ NORR NORR

60 60

40 40 Precision in % 20 20

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Distance threshold Distance threshold (a) Application Small (b) Application Medium 100 RR 80 RR+ NORR

60

40 Precision in % 20

0 0 0.2 0.4 0.6 0.8 1 Distance threshold (c) Application Large

Figure 5.5: Precision of edit distance with a varying equality threshold. One graph for each application: Small, top-left; Medium, top-right; Large, bottom.

Looking at the graphs individually and in more detail we can gain further insight into the outcomes. In Figure 5.5 ED is evaluated. We can see that with a distance threshold of zero percent the precision is high which is a common pattern for all the methods. The reason for this is that with zero deviation the matching is very strict and we only identify identical stack traces as being part of the same bug. This has, however, a significant impact on grouping ability as the test data contains non-equal stack traces belonging to the same bug bucket. The effects of this will be highlighted in the next section. Once we start introducing allowed deviation we can see that the precision declines. For the edit distance the precision rapidly declines up until a relative distance 44 CHAPTER 5. EVALUATION AND RESULTS

threshold of 0.05 where the decline rate decreases. A reminder here is edit distance does not discriminate between differences in the top of the stack and the bottom of the stack. When the relative distance threshold is increased beyond 0.05 the precision is further reduced — eventually it considers all stack traces equal. Another aspect is the recursion removal, when applying it we introduce allowed deviation by allowing recursion depth to differ. We can see that, RR+, our variation of the recursion removal performs generally better than no recur- sion removal, NORR, and strictly better than the recursion removal suggested by [22], RR. It is reasonable to assume that the greedier recursion removal re- moves an excessive amount of information which in turn reduces precision for our data set. While our variation of the recursion removal introduces enough deviation without being detrimental to the overall precision. Overall the recur- sion removal had limited impact on the precision, ±5 percent. The reasoning for this could be within the test data — recursions in the crashes could have been of consistent depth and in such cases recursion removal had no benefits. However, given that recursion removal was not significantly detrimental to the precision it could still be worth evaluating it for a solution. In Figure 5.6 the precision result for WED is shown. It performs better than ED which is in line with the assumption that “guided” deviation could perform better than naive deviation. We can observe the difference by comparing where the precision decline rate decreases. For WED the rapid precision decline stops at a relative distance threshold of 0.1 while for ED it stops at 0.05. This difference is explained by WED allowing for potentially large deviations in the bottom of a stack trace, while allowing little to no deviation at the top. To justify this assumption consider comparing two stack traces of length 15, if we modify the bottom two function calls using ED it would result in a deviation of, 2/15 = 0.133, ca. 13 percent. Now consider the same scenario with WED, P15 ((15−14)+(15−13))/ n=1 n ≈ 0.025, a 2.5 percent deviation. If the same difference was among the top function calls in WED we would get up to 25 percent deviation. WED could potentially be further tuned by having a non- linear weight increase, which could be useful in scenarios where large stack traces are compared, or any other variation. Similarly to the ED, the WED drops off linearly up until a point where all stack traces are considered equal. The prefix matching outcome is shown in Figure 5.7 and is showing con- sistent performance past the equality threshold of 20 percent. It has a precision of up to 80 percent at an equality threshold of 30 percent, a high number that is in line with the evaluation of the method in [22]. The graph has a different characteristic than the edit distance based algorithms. Precision does not de- CHAPTER 5. EVALUATION AND RESULTS 45

100 100 RR RR 80 RR+ 80 RR+ NORR NORR

60 60

40 40 Precision in % 20 20

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Distance threshold Distance threshold (a) Application Small (b) Application Medium 100 RR 80 RR+ NORR

60

40 Precision in % 20

0 0 0.2 0.4 0.6 0.8 1 Distance threshold (c) Application Large

Figure 5.6: Precision of weighted edit distance with a varying equality thresh- old. One graph for each application: Small, top-left; Medium, top-right; Large, bottom. crease past a certain point which is because the prefix matching gets stricter the larger the equality threshold is. A significant upside to PM over the ED and WED is that it is not as reliant on potentially volatile thresholds for where precision decline slows down. The equality threshold can be set higher than where the peak values are to be con- servative and avoid the low percentage precision region, an option not available to ED and WED. An upside to the ED and WED algorithms is that they can be incorporated in a machine learning setup which is covered in a later section. Complete matching, as shown in Figure 5.8, is also generally covered by the other graphs by simply looking at the accuracies when thresholds are set for zero deviation. The interesting observation to be made here is that for such 46 CHAPTER 5. EVALUATION AND RESULTS

100 100 RR RR 80 RR+ 80 RR+ NORR NORR

60 60

40 40 Precision in % 20 20

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Prefix equality threshold Prefix equality threshold (a) Application Small (b) Application Medium 100 RR 80 RR+ NORR

60

40 Precision in % 20

0 0 0.2 0.4 0.6 0.8 1 Prefix equality threshold (c) Application Large

Figure 5.7: Precision of prefix matching with a varying equality threshold. One graph for each application: Small, top-left; Medium, top-right; Large, bottom. a simplistic measurement as full equality, it performs well. For an institution where a misdiagnosis is never or rarely allowed to happen, this could be an ideal choice of method.

5.2.2 Grouping evaluation results The results of the grouping evaluation show a correlation between precision and grouping ability. Our test data contains 100 bugs distributed across three applications. The ideal result is then 100 buckets in total at the point where the precision for an algorithm is as high as possible. The graphs in Figure 5.8-5.11 illustrate how many groups are created while distance threshold is CHAPTER 5. EVALUATION AND RESULTS 47

100

80

60

40 Precision in % RR 20 RR+ NORR 0 Small Medium Large

Figure 5.8: Precision of complete matching. increased. The ideal point is marked with a dotted line along the y-axis. The ideal point is where the number of buckets and the precision are at their lowest respectively highest point. The number of unique bugs for an application is denoted by a dotted line along the x-axis. Grouping ability is measured for four comparison methods: edit distance, ED; weighted edit distance, WED; prefix matching, PM; and complete matching, CM. Each method is also again evaluated with three methods for recursion removal: no recursion removal, NORR; a greedy recursion removal, RR; and our approach involving removal of maximal repeats denoted RR+. The correlation is not entirely surprising but the results do serve to show that the scenario described in section 5.1.3 where an excessive number of new buckets can be created is common when the distance threshold is set low. The peak grouping performance was achieved by the WED algorithm which pro- duced 110 buckets in total at a precision of 75 percent. Despite the correlation between precision and grouping ability it is still worthwhile going over each grouping result in more detail. For ED in Fig- ure 5.9 when deviation is set to a minimum the number of buckets is high and we really see the escalating effect of many new buckets being created, as described in section 5.1.3. Once deviation is introduced we see the number of buckets quickly declining and hover near the original amount of buckets. At the ideal point ED produced a total of 163 buckets. Once the relative dis- tance threshold goes beyond 0.6 we see that the number of buckets approaches exactly 100. Which is explained by comparisons getting too inaccurate and crashes belonging to different bugs getting bundled together. With similar characteristics as ED, WED’s results are shown in Figure 5.10 and it produced the number of buckets closest to 100 at peak precision, specifi- 48 CHAPTER 5. EVALUATION AND RESULTS

RR 250 RR RR+ RR+ 150 NORR 200 NORR Ideal point Ideal point 150 100

100 50 Number of buckets # of unique bugs 50 # of unique bugs

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Distance threshold Distance threshold (a) Application Small (b) Application Medium 300 RR RR+ NORR 200 Ideal point

100 Number of buckets # of unique bugs

0 0 0.2 0.4 0.6 0.8 1 Distance threshold (c) Application Large

Figure 5.9: Grouping ability of edit distance with a varying equality threshold. One graph for each application: Small, top-left; Medium, top-right; Large, bottom. CHAPTER 5. EVALUATION AND RESULTS 49

RR 250 RR RR+ RR+ 150 NORR 200 NORR Ideal point Ideal point 150 100

100 50 Number of buckets Number of buckets # of unique bugs 50 # of unique bugs

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Distance threshold Distance threshold (a) Application Small (b) Application Medium 300 RR RR+ NORR 200 Ideal point

100

# of unique bugs

0 0 0.2 0.4 0.6 0.8 1 Distance threshold (c) Application Large

Figure 5.10: Grouping ability of weighted edit distance with a varying equality threshold. One graph for each application: Small, top-left; Medium, top-right; Large, bottom. cally, 110 buckets. Given the small variation in number of groups at ideal point between the methods, it is hard to attribute this performance to something spe- cific in WED - other than the same arguments made for WED’s precision over ED. The PM results, again, show a different characteristic than the distance based methods. When the prefix equality threshold is close to zero the num- ber of buckets is close to 100, which is attributed to the impreciseness of the comparisons. It produces 151 buckets at its ideal point, similar to the ED re- sult. Complete matching shown in Figure 5.8 is, similarly to before, covered by the other graphs by looking at the grouping ability when thresholds are set for 50 CHAPTER 5. EVALUATION AND RESULTS

RR 250 RR RR+ RR+ 150 NORR 200 NORR Ideal point Ideal point 150 100

100 50 Number of buckets # of unique bugs 50 # of unique bugs

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Prefix equality threshold Prefix equality threshold (a) Application Small (b) Application Medium 300 RR RR+ NORR 200 Ideal point

100 Number of buckets # of unique bugs

0 0 0.2 0.4 0.6 0.8 1 Prefix equality threshold (c) Application Large

Figure 5.11: Grouping ability of prefix matching with a varying equality threshold. One graph for each application: Small, top-left; Medium, top-right; Large, bottom. zero deviation. Full equality produces 660 buckets, which is higher than the other methods but given the low likelihood of misdiagnosis, it performs well, and potentially reducing the amount of manual diagnosis by up to 60 percent. The results show that deviating from the ideal point by 10 percent can re- duce the number of groups by up to 20 percent, and potentially further decreas- ing the amount of necessary manual diagnosis. However, as this introduces a higher likelihood of misdiagnosis since precision is lower, it is a strategy fitted for scenarios where a misdiagnosed crash dump is likely to be reported again. CHAPTER 5. EVALUATION AND RESULTS 51

300

200

100

Number of buckets RR RR+ NORR 0 Small Medium Large

Figure 5.12: Grouping ability of complete matching.

5.2.3 Machine learning evaluation results The results of the machine learning evaluation are shown in Figure 5.13. As previously mentioned, only ED and WED were suited for learning. This is because determining an optimal signature crash is only applicable if a crash can be considered generally closest to all the other crashes in a bucket —- with the goal of increasing likelihood for further matches. For prefix matching and complete matching, determining closeness is irrelevant as the entire stack trace has to be equal from the top of the start to some end, changing a signature crash has no effect on likelihood of matching here. With 10 percent of the data being used for training we can see that the learned algorithm performs only equally well to an untrained ED and WED implementation. However, when the percentage is increased to 40 percent and beyond we see an improvement, with accuracies up to 90 percent, out- performing PM with 10 percent. Individual accuracies increased by up to 20 percent for both WED and ED. Past 50 percent of data being used for testing no further precision increase is achieved, instead precision starts to decrease, which could be attributed to overfitting. The precision results and increases are in line with a similar experiment done by [21], where a different matching method was used. 52 CHAPTER 5. EVALUATION AND RESULTS

100 100

80 80

60 60

40 40 Precision in % ED ED 20 20 WED WED

0 0 0 20 40 60 80 100 0 20 40 60 80 100 Training data distribution in % Training data distribution in % (a) Application Small (b) Application Medium 100

80

60

40 Precision in % ED 20 WED

0 0 20 40 60 80 100 Training data distribution in % (c) Application Large

Figure 5.13: Precision of machine learning algorithm with a varying training- to verification data ratio. One graph for each application: Small, top-left; Medium, top-right; Large, bottom. Chapter 6

Conclusion

In this thesis we evaluated methods for comparing crash dumps. We com- pared crash dumps based on their symptoms with the purpose of minimizing re-diagnosis of already known errors. We approached the problem by inves- tigating related work and extracting suggested methodology. The goal was to compare the varying methods based on certain criterion. As we progressed in our research it became clear that there is not a single best method, but that certain methods are better suited for certain circumstances. In order for us to evaluate this and categorize the methods, we consider the concept of misdi- agnosis cost, which allows us to make educated decision about when to focus on grouping ability by choosing an approach with higher likelihood for mis- diagnosis or an approach which lacks in grouping ability but has low risk for misdiagnosis. For this we introduced a set of encapsulating guidelines which help us determine when the misdiagnosis cost is high respectively low. We reiterate:

1) If the number of expected crashes is low, and the risk of unique crashes is high, the cost of misdiagnosis is high as an error could be erroneously marked as fixed and remain in the system for a long period.

2) If the number of expected crashes is high (usually in systems with large user bases) where errors are hit frequently, the cost of misdiagnosis is low.

Analyzing the results it is clear that, while being the least likely to mis- diagnose, complete matching is lacking in grouping ability. This is due to the strict nature of complete matching where stack traces have to be identical, making it ideal for cases where misdiagnosis cost is high. The other meth- ods performed better in terms of grouping ability where they could reduce

53 54 CHAPTER 6. CONCLUSION

the amount of manual diagnosis by up to 90 percent with a precision of up to 90 percent. Our solution for a weighted edit distance showed an increase in precision and grouping ability over non-weighted edit distance, which further solidifies the theory of giving function calls higher up in the stack additional value. Prefix matching was the top performer when it came to precision when not taking the machine learning enhanced edit distance and weighted edit dis- tance into consideration. With machine learning accuracies of as high as 90 percent were achieved. Such a high precision might even allow for institutions with high misdiagnosis cost to consider this approach. A limitation with the distance based approaches is that the precision is only high at explicit distance thresholds, and are, therefore, likely best suited for monolithic applications with a homogenous code base and error handling. Given this detail prefix matching stands out as being the most robust solution as it provides high precision, high grouping ability and has low complexity. In addition to comparing the matching methods we attempted to normalize the individual stack traces by removing repeating recursive calls before com- parison. We achieved this by using two removal methods, one suggested by Modani et al., and the other method was designed by us, denoted RR+ in the graphs. In general we noticed a 5 percent increase in precision and grouping ability using our variation of recursion removal while the approach suggested by Modani et al. decreased the same values by up to 5 percent [22]. However, in the experiments conducted by the Modani et al., an increase in precision was observed, a variation potentially caused by differences in test data. Over- all we suggest the limited impact of recursion removal could be attributed to the crashes having a consistent recursion depth in our data set as other papers have observed higher gains. Chapter 7

Future work

Using automated systems to reduce the amount of manual diagnosis looks promising and we have showed that one can accurately reduce the amount of required re-diagnosis by up to 90 percent. Going forward there are still several areas that require further investigation and analysis. In our work we omitted data-oriented symptoms, which could further increase accuracy and reliability of automated systems. Also, we acknowledge that our evaluation was limited to three standalone programs and before the results can be generalized, a wider range of programs need to be evaluated. It is also common for multiple pro- grams, across multiple operating systems, to interact. Determining the cause for such an error might require information retrieval from multiple sources. Closely related to this is analyzing errors caused by hardware fault and how they could be distinguished from software errors. Furthermore, our work is limited to a linear weighted edit distance algorithm, this study could be ex- tended by investigating non-linear variations. Our work provides an overview of existing methods today. Given the prob- lem of creating an accurate system for grouping crash dumps, we have shown that under the right circumstances it is possible, and in turn allowing for a reduction in necessary amounts of manual diagnosis.

55 Bibliography

[1] Lee et al. “Identifying Software Problems Using Symptoms.” In: FTCS. IEEE Computer Society, 1994, pp. 320–329. isbn: 0-8186-5520-8. url: http://dblp.uni-trier.de/db/conf/ftcs/ftcs94. html#LeeIM94. [2] Wei Le and Daniel Krutz. How to Group Crashes Effectively: Com- paring Manually and Automatically Grouped Crash Dumps. Tech. rep. Rochester Institute of Technology, 2012. [3] About illumos. [Online; accessed 26-08-2014]. url: http://wiki. illumos.org/display/illumos/About+illumos. [4] The Security Development LifeCycle. [Online; accessed 28-01-2019]. url: https://social.technet.microsoft.com/wiki/ contents/articles/7100.the-security-development- lifecycle.aspx. [5] Håkon Krohn-Hansen. Program Crash Analysis: Evaluation and Appli- cation of Current Methods. Tech. rep. University of Oslo, Department of Informatics, 2012. [6] Patrice Godefroid, Michael Y Levin, David A Molnar, et al. “Auto- mated whitebox fuzz testing.” In: NDSS. Vol. 8. 2008, pp. 151–166. [7] Scott Piper. System for automatically collecting and analyzing crash dumps. US Patent 9,378,368. June 2016. [8] Adrian Schroter et al. “Do stack traces help developers fix bugs?” In: 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010). IEEE. 2010, pp. 118–121. [9] ’panic.c’ comments in illumos repository. [Online; accessed 26-08-2014]. url: https : / / github . com / illumos / illumos - gate / blob/master/usr/src/uts/common/os/panic.c.

56 BIBLIOGRAPHY 57

[10] The Modular Debugger. [Online; accessed 26-08-2014]. url: http: //illumos.org/books/mdb/intro-1.html. [11] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006. isbn: 0387310738. [12] Avrim Blum. “On-line algorithms in machine learning”. In: Online al- gorithms. Springer, 1998, pp. 306–325. [13] Nick Littlestone. “From on-line to batch learning”. In: Proceedings of the second annual workshop on Computational learning theory. 2014, pp. 269–284. [14] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. “Distance metric learning for large margin nearest neighbor classification”. In: Advances in neural information processing systems. 2006, pp. 1473– 1480. [15] Rich Caruana and Alexandru Niculescu-Mizil. “Data mining in metric space: an empirical analysis of supervised learning performance crite- ria”. In: Proceedings of the tenth ACM SIGKDD international confer- ence on Knowledge discovery and data mining. ACM. 2004, pp. 69– 78. [16] Hyunmin Seo and Sunghun Kim. “Predicting recurring crash stacks”. In: Automated Software Engineering (ASE), 2012 Proceedings of the 27th IEEE/ACM International Conference on. IEEE. 2012, pp. 180– 189. [17] Rongxin Wu et al. “CrashLocator: locating crashing faults based on crash stacks”. In: Proceedings of the 2014 International Symposium on Software Testing and Analysis. ACM. 2014, pp. 204–214. [18] Kirk Glerum et al. “Debugging in the (very) large: ten years of imple- mentation and experience”. In: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM. 2009, pp. 103–116. [19] !exploitable Crash Analyzer - MSEC Debugger Extensions. [Online; ac- cessed 23-10-2018]. url: https://archive.codeplex.com/ ?p=msecdbg. [20] Dhaliwal et al. “Classifying field crash reports for fixing bugs: A case study of Mozilla Firefox”. In: Software Maintenance (ICSM), 2011 27th IEEE International Conference on. IEEE. 2011, pp. 333–342. 58 BIBLIOGRAPHY

[21] Mark Brodie et al. “Automated Problem Determination Using Call- Stack Matching”. English. In: Journal of Network and Systems Man- agement 13.2 (2005), pp. 219–237. issn: 1064-7570. doi: 10.1007/ s10922 - 005 - 4443 - 8. url: http : / / dx . doi . org / 10 . 1007/s10922-005-4443-8. [22] Natwar Modani et al. “Automatically Identifying Known Software Prob- lems”. In: 2014 IEEE 30th International Conference on Data Engineer- ing Workshops (2007), pp. 433–441. doi: http://doi.ieeecomputersociety. org/10.1109/ICDEW.2007.4401026. [23] Gonzalo Navarro. “A Guided Tour to Approximate String Matching”. In: ACM Comput. Surv. 33.1 (Mar. 2001), pp. 31–88. issn: 0360-0300. doi: 10 . 1145 / 375360 . 375365. url: http : / / doi . acm . org/10.1145/375360.375365. [24] Inhawn Lee. Software Dependability in the Operational Phase. Tech. rep. Department of Electrical and Computer Engineering, University of Illinois, 1995. [25] ChenNa Lian, Mihail Halachev, and Nematollaah Shiri. “Searching for Supermaximal Repeats in Large DNA Sequences”. English. In: Bioin- formatics Research and Development. Vol. 13. Communications in Com- puter and Information Science. Springer Berlin Heidelberg, 2008, pp. 87– 101. isbn: 978-3-540-70598-7. doi: 10.1007/978-3-540-70600- 7_ 7. url: http://dx.doi.org/10.1007/978- 3- 540- 70600-7_7. [26] Roman Kolpakov and Gregory Kucherov. Finding Repeats With Fixed Gap. Anglais. Rapport de recherche RR-3901. INRIA, 2000, p. 15. url: http://hal.inria.fr/inria-00072753. [27] Detection of all maximal repeats in strings, a python implementation. [Online; accessed 26-08-2014]. url: https://code.google. com/p/py-rstr-max/. [28] Toru Kasai et al. “Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications”. In: Proceedings of the 12th An- nual Symposium on Combinatorial Pattern Matching. CPM ’01. Lon- don, UK, UK: Springer-Verlag, 2001, pp. 181–192. isbn: 3-540-42271- 4. url: http://dl.acm.org/citation.cfm?id=647820. 736222. BIBLIOGRAPHY 59

[29] Socorro: Mozilla’s Crash Reporting System. [Online; accessed 26-08- 2014]. url: http://blog.mozilla.org/webdev/2010/ 05/19/socorro-mozilla-crash-reports/. [30] Kompotis et al. Identifying and grouping program run time errors. US Patent 9,009,539. Apr. 2015.

TRITA -EECS-EX-2019:22

www.kth.se