Assessing Fault Sensitivity in MPI Applications ∗†

Assessing Fault Sensitivity in MPI Applications ∗† Charng-da Lu Daniel A. Reed Department of Computer Science Renaissance Computing Institute University of Illinois University of North Carolina Urbana, Illinois 61801 Chapel Hill, North Carolina 27514 Abstract that lack error correcting memory or end-to-end error detection and correction for message transport. Today, clusters built from commodity PCs dominate Hardware failures are usually classified as either high-performance computing, with systems contain- hard errors or soft (transient) errors. Hard errors ing thousands of processors now being deployed. As are permanent physical defects whose repair normally node counts for multi-teraflop systems grow to thou- requires component replacement (e.g., a power supply sands and with proposed petaflop system likely to or fan failure). Conversely, soft errors (also known as contain tens of thousands of nodes, the standard as- single-event upsets) include both transient faults in sumption that system hardware and software are fully semiconductor devices (e.g., memory or register bit reliable becomes much less credible. Concomitantly, errors) and recoverable errors in other devices (e.g., understanding application sensitivity to system fail- disk read retries). ures is critical to establishing confidence in the outputs of large-scale applications. In many cases, error detection and recovery mech- Using software fault injection, we simulated sin- anisms can mask the occurrence of transient errors. gle bit memory errors, register file upsets and MPI However, on some systems, error detection and cor- message payload corruption and measured the behav- rection support may be missing (e.g., due to price- ioral responses for a suite of MPI applications. These sensitive marketing of commodity components) or experiments showed that most applications are very disabled (e.g., for reduced latency on communication sensitive to even single errors. Perhaps most worri- channels). some, the errors were often undetected, yielding erro- Non-recoverable hardware failures are exacerbated neous output with no user indicators. Encouragingly, by programming models with limited support for even minimal internal application error checking and fault-tolerance. For scientific applications, MPI [1] is program assertions can detect some of the faults we the most popular parallel programming model. How- injected. ever, the MPI standard does not specify mechanisms or interfaces for fault-tolerance - normally, all of an MPI application’s tasks are terminated when any of 1 Introduction the underlying nodes fails or becomes inaccessible. Today, clusters built from commodity PCs dominate In this paper, we examine the impact of soft errors high-performance computing, with systems contain- on MPI applications by injecting faults into regis- ing thousands of processors now being deployed. As ters, the process address space, and MPI messages to node counts for multi-teraflop systems grow to thou- simulate single-bit-flip errors. The remainder of this sands and with proposed petaflop system likely to paper is organized as follows. In §2, we outline the contain tens of thousands of nodes, the standard as- rationale for fault assessment of commodity hardware sumption that system hardware and software are fully and consider the common failure modes in memory reliable becomes much less credible. This is especially and communication systems. This is followed by a true for systems built from very low cost components description of our fault injection methodology in §3 and our experimental environment in §4. In §5, we ∗This work was supported in part by Contract 74837-001- describe the application suite and experimental re- 0349 from the Regents of University of California (Los Alamos sults, followed by an overall assessment in §6 and §7. National Laboratory) to William Marsh Rice University and by National Science Foundation under grant ACI 02-19597. Finally, we discuss related work in §8 and conclude †0-7695-2153-3/04 $20.00 c 2004 IEEE by summarizing results and future directions in §9. 1 2 COTS Failure Modes On the other hand, shrinking geometries, lower voltages, and higher clock frequencies contribute to The price/performance advantage of commercial off- the growing occurrence of soft errors – the associated the shelf (COTS) components has led many groups to decrease in noise margins increases signal sensitivity assemble clusters containing thousands of nodes. Be- to transients. Intel reported that the soft error rate cause the COTS market is very price sensitive, there for SRAMs increased thirty fold when the process is great pressure on manufacturers to eliminate any technology shifted from 0.25 to 0.18 micron features features not necessary for the intended, commodity and the supply voltage dropped from 2 V to 1.6 V market. For example, error correcting memory is [3]. not used on many consumer PCs, nor are these sys- Soft errors can also arise due to environmental con- tems subject to the same level of quality assurance ditions. Poor power regulation and brownouts can as systems intended for mission critical commercial induce soft errors because memory cells may not re- or scientific domains. Even when the individual sys- ceive enough power to be refreshed. Cosmic rays can tems are well engineered, the multiplicative coupling also lead to single bit upsets, particularly for systems of large numbers of components can lead to low reli- located at high altitudes. IBM showed that the soft ability for the aggregate. error rate in Denver was ten times higher than that As a motivating example, consider the Los Alamos at sea level [6] ASCI Q system, with 33 TB of error correcting Given these diverse conditions, the observed soft (ECC) memory. If one assumes one error every ten error rate (SER) can differ by as much as two orders days for each 1 GB of memory and a 95 percent of magnitude, based on manufacturing process and ECC coverage rate (see §2.1), the soft error rate is environmental conditions. Actel [7] reported that the 33, 000 × 0.05 or roughly 1,650 errors every ten days. SER for every Mb of memory manufactured using a If even some of these errors corrupt an application’s 0.13 micron process technology was roughly MTBF data space or message payloads, then an application’s of 1-10 years. Tezzaron Semiconductor [8] surveyed outputs may be incorrect. recently published data on SER values and concluded The Los Alamos Q system is constructed from high that 1000 to 5000 FIT (Failure-In-Time; the number quality components, in contrast to those used to as- of failures in a billion hours) per Mb was typical for semble many low cost, laboratory clusters. Hence, modern memory devices. However, even using a con- as low-cost COTS hardware becomes the standard servative soft error rate (500 FIT/Mb), a system with building block, it is crucial that we understand the 1 GB of RAM can expect a soft error every 10 days. balance of component reliability and price relative to Historically, parity and error correction codes system reliability and application usability. In this (ECC) have been the primary protection against light, we review the possible failure modes and prob- memory soft errors. SECDEC (Single-Error- abilities for memory and message transmission, as a Correction, Double-Errors-Detection) is the standard prelude to experimental analysis. approach, with every 64 data bits protected by a set of 8 check bits. However, ECC does not eliminate all 2.1 Memory Errors soft errors. Compaq reported that roughly 10 percent of errors are not caught by the on-chip ECC [9]. In an analysis of system logs from workstation clus- A hardware-based fault injection experiment by ters, Lin and Siewiorek [2] reported that 90 percent Constantinescu [10] showed that 18 percent of er- of the crashes were due to soft memory errors. In rors are uncovered by ECC memory. Moreover, ECC practice, a single soft memory error rarely causes a memory solutions generally require 20 percent more system crash, unless it strikes a critical memory re- die area to fabricate, cost 10-25 percent more, and gion at right time. Hence, the actual frequency of reduce memory performance by 3-4 percent [8, 11]. soft errors is higher than that detected – most have In a price sensitive consumer market, these marginal no detectable effect. costs are substantial, and many vendors omit these Improved manufacturing processes and designs and features on consumer-grade products. Therefore, soft have continued to reduce the hard error rate (HER) memory errors will still be an inevitable reliability for memory modules. Recent estimates range from problem for future COTS clusters. a mean time before failure (MTBF) of 1,100 years for a 32 Mb DRAM [3] to between 159-713 years for 2.2 Communication Errors 16 and 64 Mb DRAMs [4, 5]. Overall, the HER has remained roughly constant as memory densities have On parallel systems, transient errors can also occur increased [3]. when transmitting messages. Although the MPI 1.1 2 standard [1] specifies that it is MPI implementor’s Therefore, we chose the cost-effective SWIFI to simu- responsibility to insulate the user from the unrelia- late transient errors in memory and messages during bility of underlying communication fabric, most MPI runtime. Below we describe our memory and message implementations assume the underlying communica- fault injection models. tion substrate (e.g., TCP/IP or Myrinet [12]) handles all reliability issues. 3.1 Software Environment However, library or operating system managed end-to-end communication reliability is not without Our experimental target was Intel x86 systems run- cost – communication latency increases with each ning Linux 2.4, with the MPICH library [17] as the software-mediated verification. Indeed, OS-bypass MPI communication toolkit. Software error injection mechanisms, with direct access to network inter- targeted both registers and the application’s address face cards, were introduced precisely to reduce buffer space, but not the MPI libraries.

Load more