An Empirical Analysis of Hardware Failures on a Million Consumer Pcs
Total Page:16
File Type:pdf, Size:1020Kb
Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs Edmund B. Nightingale John R. Douceur Vince Orgovan Microsoft Research Microsoft Research Microsoft Corporation [email protected] [email protected] [email protected] Abstract two orders of magnitude after a first such crash occurs, and We present the first large-scale analysis of hardware failure DRAM failures are likely to recur in the same location. rates on a million consumer PCs. We find that many failures Three conditions motivate a hardware-failure study par- are neither transient nor independent. Instead, a large portion ticularly targeted at consumer machines. First, in contrast to of hardware induced failures are recurrent: a machine that servers, consumer machines tend to lack significant error- crashes from a fault in hardware is up to two orders of mag- resiliency features, such as ECC for memory and RAID for nitude more likely to crash a second time. For example, ma- disks, which increases the likelihood that a hardware er- chines with at least 30 days of accumulated CPU time over ror will have an impact on a machine’s operation. Second, an 8 month period had a 1 in 190 chance of crashing due whereas the failure of a single server in a data center can to a CPU subsystem fault. Further, machines that crashed often be masked by the application-level logic that parti- once had a probability of 1 in 3.3 of crashing a second tions tasks among machines, the failure of a single con- time. Our study examines failures due to faults within the sumer machine directly affects the user of that machine. CPU, DRAM and disk subsystems. Our analysis spans desk- Moreover, there are good reasons to expect that server-class tops and laptops, CPU vendor, overclocking, underclocking, and consumer-class machines will exhibit different failure generic vs. brand name, and characteristics such as machine characteristics: Consumer machines are built with cheaper speed and calendar age. Among our many results, we find components, and they run in harsher environments, with that CPU fault rates are correlated with the number of cycles wider temperature variation, greater mechanical stresses, executed, underclocked machines are significantly more reli- more ambient dust, and more frequent power cycling. Third, able than machines running at their rated speed, and laptops although there now exists a significant body of work on an- are more reliable than desktops. alyzing hardware failures in server machines [Bairavasun- daram et al. 2007, 2008; Kalyanakrishnam et al. 1999; Op- Categories and Subject Descriptors D.4.5 [Operating Sys- penheimer et al. 2003; Pinheiro et al. 2007; Schroeder and tems]: Reliability Gibson 2006, 2007; Schroeder et al. 2009; Xu et al. 1999], General Terms Measurement, reliability there is a dearth of information on consumer machines, even though they constitute the vast majority of machines sold Keywords Fault tolerance, hardware faults and used each year. Studying consumer machines brings new challenges not 1. Introduction present when studying server-class machines. Consumer We present the first large-scale analysis of hardware failure machines are geographically remote, widely distributed, and rates on consumer PCs by studying failures in the CPU, independently administered, which complicates data col- DRAM, and disk subsystems. We find that many failures are lection. There are no field reports from repair technicians neither transient (one-off) nor independent (memoryless). or outage logs filed by customers, as used by other stud- Instead, a large portion of hardware-induced failures are ies [Gray 1987, 1990; Kalyanakrishnam et al. 1999; Oppen- recurrent. Hardware crashes increase in likelihood by up to heimer et al. 2003; Schroeder and Gibson 2006; Xu et al. 1999]. The general lack of ECC precludes tracking ECC er- rors as a means for determining memory failures, as used in server studies [Constantinescu 2003; Schroeder et al. 2009]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed To address these challenges, we employ data sets from for profit or commercial advantage and that copies bear this notice and the full citation the Windows Error Reporting (WER) system [Glerum et al. on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 2009], which was built to support diagnosis of software EuroSys’11, April 10–13, 2011, Salzburg, Austria. faults that occur in the field. Through post hoc analysis of re- Copyright c 2011 ACM 978-1-4503-0634-8/11/04. $10.00 ports from roughly one million machines, we are able to iso- 2. Prior work late several narrow classes of failures that are highly likely Ours is the first large-scale study of hardware failures in con- to have a root cause in hardware: machine-check exceptions sumer machines, and the first study of CPU subsystem fail- reported by the CPU, single-bit errors in the kernel region of ures on any class of machines. Furthermore, many of our DRAM, and OS-critical read failures in the disk subsystem. analyses have not been previously conducted, even for server This methodology has two unfortunate limitations. First, machines. The effect on failure rates of overclocking and un- since WER logs are generated only in response to a crash, derclocking, brand name vs. white box, and memory size our study is blind to hardware failures that do not lead to have not been previously studied. There is no prior work system crashes. In particular, the only DRAM bit errors that examining the relative frequency of intermittent vs. tran- cause system crashes are those that occur within the roughly sient hardware faults, nor in studying the spatial locality of 1.5% of memory that is occupied by kernel code pages. Er- DRAM failures. Some of our comparisons, such as desktop rors elsewhere in the memory may cause application crashes vs. laptop, are not even meaningful in server systems. or data corruption, but we cannot observe these. However, some of our findings mirror those in prior work. Second, our study sheds no light on the relative frequency Like Schroeder et al. [2009], who conducted detailed stud- of hardware-induced versus software-induced crashes. The ies of DRAM failure rates on server machines, we find that three types of failures we study constitute a minority of DRAM errors are far more likely to occur than would be ex- WER entries; however, most types of CPU, memory, and pected from studies of error rates induced by active radiation disk failures produce symptoms that are indistinguishable sources [Constantinescu 2002; Micron 1997; Ziegler et al. from kernel-software failures. On the other hand, such a re- 1998] or cosmic rays [Seifert et al. 2001; Ziegler et al. 1996]. sult would be relevant only for Windows machines, whereas Also, like Pinheiro et al. [2007] and Schroeder and Gibson absolute measures of hardware failures are relevant for any [2007], who examined disk failure rates in large data-center operating system, or at least any OS that relies on a function- installations, we find that disk MTTF times are much lower ing CPU, error-free memory, and a responsive disk. than those specified on disk data sheets. Our findings on Despite these limitations, our study has found a number conditional failure probability mirror the findings of recent of interesting results. For instance, even small degrees of studies of DRAM failures [Schroeder et al. 2009] and disk- overclocking significantly degrade machine reliability, and subsystem failures [Jiang et al. 2008] on server machines: an small degrees of underclocking improve reliability over run- increase of up to two orders of magnitude in observed failure ning at rated speed. We also find that faster CPUs tend to rates after a first failure occurs. become faulty more quickly than slower CPUs, and laptops Some of our findings contrast with those of prior studies. have lower failure rates than desktops. Beyond our results, In studying the correlation between age and DRAM failure this study serves to inform the community that hardware rates, Schroeder et al. [2009] found no infant mortality and faults on consumer machines are not rare, independent, or no increasing error rates at very old ages; by contrast, we find always transient. While prior work focused on measuring minor evidence for both of these phenomena, consistent with hardware faults that may or may not cause an OS failure, a bathtub curve. Constantinescu [2003], in a study of ECC our study is the first to measure observed OS failure rates errors on 193 servers, concluded that the risk of intermittent due to faulty hardware. faults increases as chip densities and frequencies increase; In the sequel, we analyze the probability of machine fail- by contrast, we find that memory size (which roughly corre- ure from each failure type, as well as the conditional prob- sponds to density) has only weak correlation to failure rate, ability that a failure will recur (§4). We analyze the spatial and CPU frequency has no effect on the rate of failures per locality of DRAM failures, which informs conjectures about CPU cycle for non-overclocked machines. the underlying fault. We analyze failure behavior of various Much of the prior work in this area addresses issues that machine classes, by partitioning machines into populations our study does not. Our data is unable to conclusively ad- of overclocked vs. non-overclocked, underclocked vs. rated- dress the relative frequency of hardware versus software fail- speed, white box (generic) vs. brand name, and desktops vs. ures, but the coarse-grained division between hardware, soft- laptops (§5).