Windows XP Kernel Crash Analysis Archana Ganapathi, Viji Ganapathi, and David Patterson – University of California, Berkeley

Windows XP Kernel Crash Analysis Archana Ganapathi, Viji Ganapathi, and David Patterson – University of California, Berkeley ABSTRACT PC users have started viewing crashes as a fact of life rather than a problem. To improve operating system dependability, systems designers and programmers must analyze and understand failure data. In this paper, we analyze Windows XP kernel crash data collected from a population of volunteers who contribute to the Berkeley Open Infrastructure for Network Computing (BOINC) project. We found that OS crashes are predominantly caused by poorly-written device driver code. Users as well as product developers will benefit from understanding the crash behaviors elaborated in this paper. Introduction the dominant failure cause of popular computer systems. In particular, it identifies products that cause the Personal Computer (PC) reliability has become a most user frustration, thus facilitating our efforts to rapidly growing concern both for computer users as build stable, resilient systems. Furthermore, it enables well as product developers. Personal computers running the Microsoft Windows operating system are product evaluation and development of benchmarks often considered overly complex and difficult to man- that rank product quality. These benchmarks can influ- age. As modern operating systems serve as a conflu- ence design prototypes for reliable systems. ence of a variety of hardware and software compo- Within an organization, analyzing failure data nents, it is difficult to pinpoint unreliable components. can improve quality of service. Often, corporations Such unconstrained flexibility allows complex, collect failure data to evaluate causes of downtime. In unanticipated, and unsafe interactions that result in an addition, they perform cost-benefit analysis to unstable environment often frustrating the user. To improve service availability. Some companies extend troubleshoot recurring problems, it is beneficial to their analyses to client sites by gathering failure data data-mine, analyze and document every interaction for at deployment locations. erroneous behaviors. Such failure data provides For example, Microsoft Corporation collects insight into how computer systems behave under var- crash data for their Windows operating system as well ied hardware and software configurations. as applications used by their customers. Unfortunately, To improve dependability, systems designers and due to legal concerns, corporations such as Microsoft programmers must understand operating system failure will usually not share their data with academic data. In this paper, we analyze crash data from a small research groups. Companies do not wish to reveal number of Windows machines. We collected our data their internal vulnerabilities, nor can they share third from a population of volunteers who contribute to the party products’ potential weaknesses. In addition, Berkeley Open Infrastructure for Network Computing many companies disable the reporting feature after (BOINC) project. As our analysis is based on a small viewing proprietary data in the report. While abundant amount of data (with a self-selection bias due to the failure data is generated on a daily basis, very little is nature of BOINC), we acknowledge that our results do readily sharable with the research community. not represent the entire PC population. Nonetheless, The remainder of this paper describes our data the data reveals several useful results for PC users as collection and analysis methodology, including: related well as researchers and product developers. work in the areas of system dependability and failure Most Windows users have experienced at least data analysis, background information about Windows one ‘‘bluescreen’’ during the lifetime of their machine. crash data and the data collection process, crash data A sophisticated PC user will accept Windows crashes analysis and results, a discussion of the merits of as a fact and attempt to cope with them. However, a potential extensions to our work, and a conclusion. novice user will be terrified by the implications of a crash and will continue to be preoccupied with the Related Work thought of causing severe damage to the computer. Jim Gray’s work [Gra86, Gra90] serves a model Analyzing failure data can help users gauge the for most contemporary failure analysis work. Gray did dependability of various products and understand the not perform root cause analysis but rather Outage Cause source of their crashes. that considers the last in the fault chain. In 1989, he From a research perspective, the motivation found that the major source of outages was software, behind failure data-mining is manifold. First, it reveals contributing about 55%, far outrunning its immediate 20th Large Installation System Administration Conference (LISA ’06) 149 Wi n d o w sXP Kernel Crash Analysis Ganapathi, Ganapathi, & Patterson successor, system operations, which contributed 15%. rather than injecting artificial faults as performed by This observation led him to blame software for almost fuzz testing [FM00]. Our study of crash data differs every failure. In an earlier study [G05, GP05], we ana- from error log analysis performed by Kalakech, et al. lyzed Windows application crashes to understand causal [KK+04]; we determine the cause of crashes in addi- relationships in the user-level. Departing from Gray’s tion to time and frequency. outage cause analysis, in our study we perform root Several researchers have provided insights on cause analysis under the assumption that the first crash benchmarking and failure data analysis [BC+02, in a sequence of crashes is responsible for all subse- BS97, OB+02, WM+02]. Wilson, et al. suggest evalu- quent crashes within that event chain. ating the relationship between failures and service The past two decades have produced several stud- availability [WM+02]. Among other metrics, when ies in root-cause analysis for operating systems (OS) evaluating dependability, system stability is a key con- ranging from Guardian OS and Tandem Non-Stop UX cern. Ganapathi, et al. examine Windows XP registry OS to VAX/VMS and Windows NT [Gra90, Kal98, problems and their effect on system stability [GW+04]. LI95, SK+00, SK+02, TI92, TI+95]. In server environ- Levendel suggests using the catastrophic nature of fail- ments, Tandem computers, VAX clusters as well as ures to evaluate system stability [Lev89]. Brown, et al. several operating systems and file servers have been provide a practical perspective on system dependability examined for software defects by several researchers. by incorporating users’ experience in benchmarks Lee and Iyer focussed on software faults in the Tandem [BC+02, BS97]. In our study of crashes, we consider GUARDIAN operating system [LI95], Tang and Iyer these factors when evaluating various applications. considered two VAX clusters running the VAX/VMS operating system [TI92], and Sullivan and Chillarege Overview of Crashes and Crashdumps examined software defects in MVS, DB2, and IMS A crash is an event caused by a problem in the [SC91]. Murphy and Gent also focussed on system operating system (OS) or application (app) requiring crashes in VAX systems over an extended period, OS or app restart. App crashes occur at user level and almost a decade [MG95]. They concluded that system typically involve restarting the crashing application. management was responsible for over 50% of failures An OS crash occurs at kernel-level, and is usually with software trailing at 20% followed by hardware caused by memory corruption, bad drivers or faulty that is responsible for about 10% of failures. system-level routines. OS crashes are more frustrating While examining NFS data availability in Net- than application crashes as they require the user to kill work Appliance’s NetApp filers, Lancaster and Rowe and restart the Windows Explorer process at a mini- attributed power failures and software failures as the mum, more commonly forcing a full machine reboot. largest contributors to downtime; operator failure con- While there are a handful of crashes due to memory tributions were negligible [LR01]. Thakur and Iyer corruption and other common systems problems, a examined failures in a network of 69 SunOS worksta- majority of these OS crashes are caused by device driv- tions [TI96]. They divided problem root causes into ers. These drivers are related to various components network, non-disk and disk-related machine problems. such as display monitors, network and video cards. Kalyanakrishnam, et al. perused six months of event Upon each OS crash or bluescreen generated by logs from a LAN comprising of Windows NT work- the operating system, Windows XP collects failure stations that delivered emails [KK+99]. Using a state data as a minidump. Users have three different options machine model of detailed system failure states to for the amount of information that is collected upon a describe failure timelines on a single node, they con- crash. We use the default (and smallest) option of col- cluded that most automatic system reboot problems lecting small dumps, which are only 64K in size. are software-related; the average downtime is two These small minidumps contain a partial snapshot of hours. Similarly, Xu, et al. considered Windows NT the computer’s state at the time of crash. They include event log entries related to system reboots for a net- a list of loaded drivers, the names and timestamps of work of workstations that were used for enterprise in- binaries that were loaded

Windows XP Kernel Crash Analysis Archana Ganapathi, Viji Ganapathi, and David Patterson – University of California, Berkeley

What Is an Operating System III 2.1 Compnents II an Operating System

Measuring and Improving Memory's Resistance to Operating System

Clusterfuzz: Fuzzing at Google Scale

The Fuzzing Hype-Train: How Random Testing Triggers Thousands of Crashes

Recovering from Operating System Crashes

Protect Your Computer from Viruses, Hackers, & Spies

PC Crash 7.2 Manual

Internet Application Development

Kdump and Introduction to Vmcore Analysis How to Get Started with Inspecting Kernel Failures

COMP26120 Academic Session: 2018-19 Lab Exercise 3: Debugging

Pointers, Arrays, Memory: AKA the Cause of Those F@#)(#@*( Segfaults

Automating Problem Analysis and Triage Sasha Goldshtein @Goldshtn Production Debugging