Self-Repair of Uncore Components in Robust System-On-Chips: an Opensparc T2 Case Study

Self-Repair of Uncore Components in Robust System-on-Chips: An OpenSPARC T2 Case Study Yanjing Li1,2 Eric Cheng1 Samy Makar1 Subhasish Mitra1 1Stanford University 2Intel Corporation Stanford, CA 94305 USA Santa Clara, CA 95054 USA Abstract uncore components 1 because they can account for a significant Self-repair replaces/bypasses faulty components in a system-on- proportion of the overall area of a multi-core SoC. In this paper, we use chip (SoC) to keep the system functioning correctly even in the presence this term to refer to non-processor logic components such as various of permanent faults. Such faults may result from early-life failures, controllers (e.g., cache / DRAM / I/O controllers) and accelerators (e.g., circuit aging, and manufacturing defects and variations. Unlike on-chip network offload engines). In the industrial OpenSPARC T2 SoC that memories, processor cores, and networks-on-chip, little attention has supports 8 cores and 64 hardware threads [OpenSPARC], the logic area been paid to self-repair of uncore components (e.g., cache controllers, (excluding all SRAM modules, e.g., cache memories and queues/buffers) memory controllers, and I/O controllers) that occupy significant of uncore components is comparable to that of processor cores (Fig. 1). portions of multi-core SoCs. In this paper, we present new techniques If any uncore component fails, the entire SoC can stop functioning that utilize architectural features to achieve self-repair of uncore correctly. For example, if a fault occurs in the logic that indicates components while incurring low area, power, and performance costs. whether a DRAM request is valid in a DRAM controller, requests to that We demonstrate the effectiveness and practicality of our techniques, controller can be dropped, which can result in system hang. Hence, self- using the industrial OpenSPARC T2 SoC with 8 processor cores that repair of uncore components is essential. support 64 hardware threads. Our key results are: Logic area of 1. Our techniques enable effective self-repair of any single faulty processor cores uncore component with 7.5% post-layout chip-level area impact and 3% Logic area of 11.8% power impact. In contrast, existing redundancy techniques impose high uncore components (e.g., 16%) area costs. Our techniques do not incur any performance 12.2% impact in fault-free systems. In the presence of a single faulty uncore Memories component, there can be a 5% application performance impact. 76% 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with Figure 1. Area breakdown of OpenSPARC T2 [Li 10]. graceful degradation of application performance. Self-repair techniques that utilize spare units, i.e., sparing-based 3. Our techniques achieve high self-repair coverage of 97.5% in the techniques, can be expensive for uncore components. For example, a presence of a single fault. self-repair technique which uses one spare component for each uncore Our self-repair techniques also enable flexible tradeoffs between component “type” incurs high chip-level area cost of 16% (optimistic, self-repair coverage and area costs. For example, 75% self-repair before place-and-route) for OpenSPARC T2 (Sec. 2). We overcome this coverage can be achieved with 3.2% post-layout chip-level area impact. challenge of high self-repair costs of existing techniques, and make the 1. Introduction following contributions: Permanent faults or hard failures, such as those caused by early- 1. We present two self-repair techniques, Enhanced Resource life failures, circuit aging, and manufacturing defects and variations, Reallocation and Sharing (ERRS) and Sparing through Hierarchical pose major reliability challenges in advanced CMOS technologies Exploration (SHE), and demonstrate their effectiveness and practicality [Agostinelli 05, Borkar 05, 07, Hicks 08, Nassif 12, Van Horn 05]. To using the open-source OpenSPARC T2 SoC with 8 processor cores that enable robust systems with built-in tolerance to permanent faults, the support 64 hardware threads. following steps must work together in a holistic fashion during the 2. ERRS and SHE enable effective self-repair of any single faulty manufacturing process and in the field: uncore component with 7.5% chip-level area impact (after place-and- • Detection of permanent faults. Permanent faults can be detected route), 3% chip-level power impact, and 5% application performance either during manufacturing test (including burn-in), or in the field using impact in the presence of a single faulty component. ERRS and SHE do techniques such as concurrent error detection [Mitra 00], circuit failure not introduce any performance impact in fault-free systems. prediction [Agarwal 07, Karl 08, Kim 10], and online self-test and 3. ERRS is capable of self-repairing multiple faulty uncore diagnostics [Constantinides 07, Li 08, 10]. components without incurring additional area costs. As the number of faulty components increases, the system experiences graceful • Diagnosis to narrow down permanent fault location(s). performance degradation (i.e., it takes longer to execute target • Self-repair to replace/bypass faulty components (i.e., components applications). For example, ERRS allows up to 4 L2 cache bank with permanent faults), so that the system keeps functioning correctly controllers and 2 DRAM controllers to be simultaneously faulty, with even in the presence of permanent faults. Self-repair is distinct from self- application performance impact of 11.4%. tuning of system parameters (e.g., frequency, voltage, or body bias) to 4. We quantify the effectiveness of ERRS and SHE using a self- compensate for delay degradation due to circuit aging [Mintarno 11, repair coverage metric, which is defined as the probability that the Tiwari 08]. • uncore components function correctly for a given number of faults (Sec. If permanent faults are detected in the field, an additional 4.4). ERRS and SHE achieve 97.5% self-repair coverage in the presence recovery step may be required to correct corrupt system data and states, of a single fault, and over 86.1% self-repair coverage even in the e.g., using checkpointing and rollback [Elnozahy 02, Nakano 06]. presence of a large number of faults. In this paper, we focus on self-repair. Although our primary Our self-repair techniques enable flexible tradeoffs between area, objective is to overcome reliability challenges due to permanent faults, power, and performance costs and self-repair coverage. For example, as our techniques may also be used for yield improvement after shown in Fig. 2, we achieve 74.9% single-fault self-repair coverage manufacturing. (self-repair coverage in the presence of a single fault, details in Sec. 4.4) Previous work on self-repair mostly targets memories, processor cores, interconnection networks, and FPGAs. In contrast, we focus on 1 May also be referred to as “nest,” “outside-core,” or “northbridge” components. 1 Paper 8.1 INTERNATIONAL TEST CONFERENCE 1978-1-4799-0859-2/13/$31.00 ©2013 IEEE with only 3.2% area (post-layout), 2.8% power, and 5% performance emulation engine with those of the component periodically for short impact (in the presence of a single faulty component) using ERRS alone. intervals of time to detect permanent faults. The emulation engine may With SHE and ERRS, single-fault self-repair coverage increases to also be used to emulate the functionalities of a faulty component for self- 97.5% while post-layout area, power, and performance impact (for a repair. However, the area and performance costs for self-repairing single faulty component) is 7.5%, 3%, and 5%, respectively. arbitrary faulty components using an emulation engine may be high. 100 97.5% • ABFT (Algorithm-Based Fault Tolerance) utilizes special 74.9% application properties, e.g., matrix operations [Huang 84], to achieve ERRS+SHE low-cost fault-tolerance. However, ABFT is not generally applicable for Single-Fault (3.0% power, 5% perf. for 50 arbitrary applications. Self-Repair ERRS single faulty component) Coverage (%) (2.8% power, 5% perf. for Given their generality, we mainly consider sparing-based single faulty component) techniques in this paper. We quantify the area costs of three existing 0 3.2% 7.5% sparing-based techniques (Fig. 3) for OpenSPARC T2 [OpenSPARC]. 0.00 3.75 7.50 As discussed later in Sec. 4.1, power costs of sparing-based techniques Post-Layout Chip-Level Area Overhead (%) can be low if proper power-gating techniques are used. Figure. 2. Single-fault self-repair coverage vs. chip-level area, 1. Component-type sparing (Fig. 3a) allocates one spare unit for power, and performance costs of ERRS and SHE on each component type; i.e., a single spare is used for multiple identical OpenSPARC T2. ERRS and SHE do not introduce performance instances of the same component. The spare unit includes all logic and impact in fault-free systems. SRAM modules (e.g., queues and buffers) inside a component instance. For self-repair during system operation, permanent faults must first If multiple faulty components of the same type need to be tolerated, be detected and localized. We achieve both objectives using a low-cost more spare units are required. online self-test and diagnostics technique called CASP [Li 08, 10], 2. Logic sparing [Allsup 10, Mirza 12a] (Fig. 3b) duplicates the which introduces only 1% chip-level area and power impact for logic portion of each component (in contrast to each component type). It OpenSPARC T2 (Appendix A). For ERRS and SHE, faults only need to excludes SRAM modules for which self-repair techniques exist (e.g., be localized to hardware blocks that can be replaced/bypassed. Hence, [Aitken 04]). Multiple faulty components may be tolerated using logic we do not require highly fine-grained diagnosis techniques that localize sparing since a spare unit is not shared by multiple components.

Load more