Post Mortem Crash Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Post Mortem Crash Analysis Johan Heander & Magnus Malmborn January 14, 2007 Abstract To improve the quality and reliability of embedded systems it is important to gather information about errors in units already sold and deployed. To achieve this, a system for transmitting error information from the customer back to the developers is needed, and the developers must also have a set of tools to analyze the error reports. The purpose of this master thesis was to develop a fully functioning demon- stration system for collection, transmission and interpretation of error reports from Axis network cameras using the Linux operating system. The system has been shown to handle both kernel and application errors and conducts automatic analysis of received data. It also uses standard HTTP protocol for all network transfers making it easy to use even on firewalled net- works. i ii Acknowledgement We would like to thank our LTH supervisor Jonas Skeppstedt for all he has taught us about computer science in general and operating systems and the C programming language in particular. We would also like to thank Mikael Starvik at Axis Communications for quickly providing us with all hardware and information we needed to complete this thesis, and for providing us with support during our implementation and writing. Finally we thank all the developers working at Axis Communications, many of whom have provided input and reflections on our work. iii iv Contents 1 Introduction 1 1.1 Problem description . 1 1.2 Problem analysis . 1 2 Background 3 2.1 Kernel crashes . 3 2.2 User space crashes . 5 2.2.1 Core dump generation under Linux . 5 2.3 Existingproducts........................... 6 2.3.1 netdump . 7 2.4 Other systems . 7 2.4.1 Windows XP . 7 2.4.2 MacOS X . 7 2.4.3 Solaris 10 . 8 3 System 9 3.1 Target platform . 9 3.2 Desired information . 9 3.3 User space errors . 9 3.4 Core dump compression . 10 3.4.1 Run length encoding . 11 3.4.2 LZMA . 12 3.5 The netcore file system . 12 3.6 Kernel errors . 13 3.6.1 Extending the kernel Oops . 13 3.7 Core dump analysis . 14 3.7.1 Linux core files . 14 3.7.2 Call trace and the CRIS architecture . 15 3.7.3 The coreinfo tool . 15 3.8 Fingerprinting . 16 3.8.1 Core dump fingerprinting . 16 3.8.2 Recursion errors . 18 3.8.3 Oops fingerprinting . 22 3.9 Customer server . 22 3.10 Axis server . 22 v 4 Validation 25 4.1 Validation of crash reports . 25 4.1.1 User space crashes . 25 4.1.2 Kernel crashes . 25 4.2 Compression . 26 5 Conclusion 29 5.1 Discussion . 29 5.1.1 Signal handlers . 29 5.1.2 Patents . 29 5.2 Further improvements . 30 6 Summary 31 A Dictionary 33 Bibliography 33 vi Chapter 1 Introduction To improve long term reliability of embedded systems it is important to follow up and analyze faults in deployed units. This can be difficult, since a deployed system generally is and should be inaccessible to the developers; thus it is nec- essary to allow the customer to report errors as they occur, at her discretion. Since software crashes are unfortunately fairly common, it is easy to be over- loaded with debug information unless there is a system in place to automate the analysis. The target products for this thesis are network cameras from Axis Commu- nications. These cameras are embedded devices with limited hardware resources running a standard Linux kernel and a number of applications handling video processing and user interaction. Definitions of terms used in this thesis can be found in appendix A on page 33. 1.1 Problem description This thesis project have derived: • Means to save debug information from the cameras when a kernel or ap- plication error occurs and propagating it back to Axis. • Tools for automated analysis of collected debug information. The thesis have also developed a prototype system demonstrating these steps, running on real camera hardware. 1.2 Problem analysis The problem led to the following specific sub-tasks: • Investigation of what kind of debug information the developers want from a crashed system, and if any additional useful information can be extracted at a low additional cost. • Development of a reliable fingerprinting algorithm to recognize multiple reports of the same error. 1 1.2. PROBLEM ANALYSIS CHAPTER 1. INTRODUCTION • Creation of a system for transferring debug information from the camera back to Axis. • Development of an aggregation scheme for grouping and ranking error reports. The prototype system consists of four different parts. One part running on the camera to detect crashes, extract error information and transfer the error information to the customer. Somewhere on the customer’s network there is software that receives error reports from the cameras and presents them to the user, so that she can authorize the sending of information to Axis. At Axis there is server software to receive error reports from all the customers, as well as various tools that can be used by the Axis developers to further analyze the reports. The general structure is shown in figure 1.1. Figure 1.1: System network architecture The prototype system demonstrates functional implementations of all four parts. The most work has been put into the parts running on the camera and the tools for report analysis. These parts should be reliable and ready to use in production systems. The servers at the customer and at Axis needs to be integrated in the existing software infrastructure in order to be suitable for production systems, and no attempts have been made to develop this kind of integration. The goal was only to develop a fully functional demonstration system. The software parts running on the camera should not use the processor at all under normal system operation, and incur no extra overhead on normal program execution. During crashes, processor usage should be kept as low as possible to ensure fast error handling. Memory usage should be kept very low during both normal operation and crashes to avoid disturbing the video software. On the customer server, disk space usage should be kept low but it can be assumed to have abundant processor and memory resources. The Axis crash report server is dedicated to handling crash reports and have plenty of disk space, memory and processor resources, as well as access to source code and binary debugging information. 2 Chapter 2 Background Software errors can be of many different types[12], e.g.: logic errors The program is valid, but does something other than was in- tended, e.g. printing the wrong range of pages. deadlocks The system hangs when two or more parts stop to wait for each other. The situation is akin to two people meeting in a doorway; each need to wait for the other to move out of the way, and—since neither computers nor people back up voluntarily—neither can move because the other is blocking the way. illegal memory references A program tries to access memory it is not al- lowed to, and gets terminated by the operating system. memory corruption A program accidentally writes to the wrong place in memory, destroying the previous data. Later when the previous data is needed, the newly written data is read back instead, causing unpredictable errors. Some of these errors are detected automatically, e.g. illegal memory references, since they are so obviously illegal that the computer cannot execute them, but other errors are more a question of whether the program behaves as the user expects it to or not. In these cases only the user can decide when an error has occurred. The handling of illegal software operations differs significantly if it is the operating system kernel itself that raises the error or an ordinary user space process. During user space crashes the rest of the system is usually fully func- tional and can be used to process the crash data. Kernel crashes are more complicated since no part of the system can be trusted to be in a consistent state. 2.1 Kernel crashes Under Linux, if an error is detected while running in kernel mode the kernel outputs a short plain-text report, called a kernel Oops, containing a short de- scription of the error together with a dump of register values, the top of the 3 2.1. KERNEL CRASHES CHAPTER 2. BACKGROUND stack, a short call trace and the bytes around the current value of the program counter, see example in figure 2.1. Most other operating systems behave in a similar way, but the message is called something else. Under Linux the kernel Oops is printed using the general kernel debug printing function printk(). As can be seen the format is fairly compact; the size of a kernel Oops is generally below 2 KiB1 but can, in the worst case, grow to about 20 KiB for recursion errors. Oops: 0002 Modules linked in: oopsmod2[d0074000+788] artpec_2[d0094000+26544] CPU: 0 ERP: d0074006 SRP: c001db88 CCS: 40028008 USP: 00000000 MOF: 00000140 r0: d00745f0 r1: c31dd200 r2: 400000a8 r3: c31dd208 r4: ffffe000 r5: d00745ec r6: 00000000 r7: d0074000 r8: 00000003 r9: 4000002a r10: 00000000 r11: d00745f4 r12: c31dd210 r13: c31dd208 oR10: 00000000 acr: 00000000 sp: c07ddf0c Data MMU Cause: 000002a5 Instruction MMU Cause: d00740a5 Process oopsmod2queue/0 (pid: 1281, stackpage=c04a8780) Stack from c07ddf6c: c31dd208 00000001 c31dd200 c0020ae2 c001dbee fffffffc c00075b6 00000000 00000000 c001dcfc c00209f8 c05f9f30 c000ba62 c31dd200 ffffffff ffffffff 00000001 00000000 c000b902 00010000 00000000 00000000 c04a8780 c000b902 Call Trace: [<c0020ae2>] [<c001dbee>] [<c00075b6>] [<c001dcfc>] [<c00209f8>] [<c000ba62>] [<c000b902>] [<c000b902>] Code: -- -- -- -- -- -- 7f 86 4f 9e 2a 00 (cf) 9b 7f 9d 14 06 00 00 a9 0b 20 20 Figure 2.1: Typical kernel Oops Analyzing a kernel Oops is the simplest form of kernel debugging, but some tools are available that instead of saving just the kernel Oops saves a complete dump of the system memory.