Self-Repair of Uncore Components in Robust System-on-Chips: An OpenSPARC T2 Case Study

Yanjing Li1,2 Eric Cheng1 Samy Makar1 Subhasish Mitra1

1Stanford University 2Intel Corporation Stanford, CA 94305 USA Santa Clara, CA 95054 USA

Abstract uncore components 1 because they can account for a significant Self-repair replaces/bypasses faulty components in a system-on- proportion of the overall area of a multi-core SoC. In this paper, we use chip (SoC) to keep the system functioning correctly even in the presence this term to refer to non-processor logic components such as various of permanent faults. Such faults may result from early-life failures, controllers (e.g., cache / DRAM / I/O controllers) and accelerators (e.g., circuit aging, and manufacturing defects and variations. Unlike on-chip network offload engines). In the industrial OpenSPARC T2 SoC that memories, processor cores, and networks-on-chip, little attention has supports 8 cores and 64 hardware threads [OpenSPARC], the logic area been paid to self-repair of uncore components (e.g., cache controllers, (excluding all SRAM modules, e.g., cache memories and queues/buffers) memory controllers, and I/O controllers) that occupy significant of uncore components is comparable to that of processor cores (Fig. 1). portions of multi-core SoCs. In this paper, we present new techniques If any uncore component fails, the entire SoC can stop functioning that utilize architectural features to achieve self-repair of uncore correctly. For example, if a fault occurs in the logic that indicates components while incurring low area, power, and performance costs. whether a DRAM request is valid in a DRAM controller, requests to that We demonstrate the effectiveness and practicality of our techniques, controller can be dropped, which can result in system hang. Hence, self- using the industrial OpenSPARC T2 SoC with 8 processor cores that repair of uncore components is essential. support 64 hardware threads. Our key results are: Logic area of 1. Our techniques enable effective self-repair of any single faulty processor cores uncore component with 7.5% post-layout chip-level area impact and 3% Logic area of 11.8% power impact. In contrast, existing redundancy techniques impose high uncore components (e.g., 16%) area costs. Our techniques do not incur any performance 12.2% impact in fault-free systems. In the presence of a single faulty uncore Memories component, there can be a 5% application performance impact. 76% 2. Our techniques are capable of self-repairing multiple faulty uncore components without any additional area impact, but with Figure 1. Area breakdown of OpenSPARC T2 [Li 10]. graceful degradation of application performance. Self-repair techniques that utilize spare units, i.e., sparing-based 3. Our techniques achieve high self-repair coverage of 97.5% in the techniques, can be expensive for uncore components. For example, a presence of a single fault. self-repair technique which uses one spare component for each uncore Our self-repair techniques also enable flexible tradeoffs between component “type” incurs high chip-level area cost of 16% (optimistic, self-repair coverage and area costs. For example, 75% self-repair before place-and-route) for OpenSPARC T2 (Sec. 2). We overcome this coverage can be achieved with 3.2% post-layout chip-level area impact. challenge of high self-repair costs of existing techniques, and make the

1. Introduction following contributions: Permanent faults or hard failures, such as those caused by early- 1. We present two self-repair techniques, Enhanced Resource life failures, circuit aging, and manufacturing defects and variations, Reallocation and Sharing (ERRS) and Sparing through Hierarchical pose major reliability challenges in advanced CMOS technologies Exploration (SHE), and demonstrate their effectiveness and practicality [Agostinelli 05, Borkar 05, 07, Hicks 08, Nassif 12, Van Horn 05]. To using the open-source OpenSPARC T2 SoC with 8 processor cores that enable robust systems with built-in tolerance to permanent faults, the support 64 hardware threads. following steps must work together in a holistic fashion during the 2. ERRS and SHE enable effective self-repair of any single faulty manufacturing process and in the field: uncore component with 7.5% chip-level area impact (after place-and- • Detection of permanent faults. Permanent faults can be detected route), 3% chip-level power impact, and 5% application performance either during manufacturing test (including burn-in), or in the field using impact in the presence of a single faulty component. ERRS and SHE do techniques such as concurrent error detection [Mitra 00], circuit failure not introduce any performance impact in fault-free systems. prediction [Agarwal 07, Karl 08, Kim 10], and online self-test and 3. ERRS is capable of self-repairing multiple faulty uncore diagnostics [Constantinides 07, Li 08, 10]. components without incurring additional area costs. As the number of faulty components increases, the system experiences graceful • Diagnosis to narrow down permanent fault location(s). performance degradation (i.e., it takes longer to execute target • Self-repair to replace/bypass faulty components (i.e., components applications). For example, ERRS allows up to 4 L2 cache bank with permanent faults), so that the system keeps functioning correctly controllers and 2 DRAM controllers to be simultaneously faulty, with even in the presence of permanent faults. Self-repair is distinct from self- application performance impact of 11.4%. tuning of system parameters (e.g., frequency, voltage, or body bias) to 4. We quantify the effectiveness of ERRS and SHE using a self- compensate for delay degradation due to circuit aging [Mintarno 11, repair coverage metric, which is defined as the probability that the Tiwari 08]. • uncore components function correctly for a given number of faults (Sec. If permanent faults are detected in the field, an additional 4.4). ERRS and SHE achieve 97.5% self-repair coverage in the presence recovery step may be required to correct corrupt system data and states, of a single fault, and over 86.1% self-repair coverage even in the e.g., using checkpointing and rollback [Elnozahy 02, Nakano 06]. presence of a large number of faults. In this paper, we focus on self-repair. Although our primary Our self-repair techniques enable flexible tradeoffs between area, objective is to overcome reliability challenges due to permanent faults, power, and performance costs and self-repair coverage. For example, as our techniques may also be used for yield improvement after shown in Fig. 2, we achieve 74.9% single-fault self-repair coverage manufacturing. (self-repair coverage in the presence of a single fault, details in Sec. 4.4) Previous work on self-repair mostly targets memories, processor cores, interconnection networks, and FPGAs. In contrast, we focus on 1 May also be referred to as “nest,” “outside-core,” or “northbridge” components.

1 Paper 8.1 INTERNATIONAL TEST CONFERENCE 1978-1-4799-0859-2/13/$31.00 ©2013 IEEE with only 3.2% area (post-layout), 2.8% power, and 5% performance emulation engine with those of the component periodically for short impact (in the presence of a single faulty component) using ERRS alone. intervals of time to detect permanent faults. The emulation engine may With SHE and ERRS, single-fault self-repair coverage increases to also be used to emulate the functionalities of a faulty component for self- 97.5% while post-layout area, power, and performance impact (for a repair. However, the area and performance costs for self-repairing single faulty component) is 7.5%, 3%, and 5%, respectively. arbitrary faulty components using an emulation engine may be high. 100 97.5% • ABFT (Algorithm-Based Fault Tolerance) utilizes special 74.9% application properties, e.g., matrix operations [Huang 84], to achieve ERRS+SHE low-cost fault-tolerance. However, ABFT is not generally applicable for Single-Fault (3.0% power, 5% perf. for 50 arbitrary applications. Self-Repair ERRS single faulty component) Coverage (%) (2.8% power, 5% perf. for Given their generality, we mainly consider sparing-based single faulty component) techniques in this paper. We quantify the area costs of three existing 0 3.2% 7.5% sparing-based techniques (Fig. 3) for OpenSPARC T2 [OpenSPARC]. 0.00 3.75 7.50 As discussed later in Sec. 4.1, power costs of sparing-based techniques Post-Layout Chip-Level Area Overhead (%) can be low if proper power-gating techniques are used. Figure. 2. Single-fault self-repair coverage vs. chip-level area, 1. Component-type sparing (Fig. 3a) allocates one spare unit for power, and performance costs of ERRS and SHE on each component type; i.e., a single spare is used for multiple identical OpenSPARC T2. ERRS and SHE do not introduce performance instances of the same component. The spare unit includes all logic and impact in fault-free systems. SRAM modules (e.g., queues and buffers) inside a component instance.

For self-repair during system operation, permanent faults must first If multiple faulty components of the same type need to be tolerated, be detected and localized. We achieve both objectives using a low-cost more spare units are required. online self-test and diagnostics technique called CASP [Li 08, 10], 2. Logic sparing [Allsup 10, Mirza 12a] (Fig. 3b) duplicates the which introduces only 1% chip-level area and power impact for logic portion of each component (in contrast to each component type). It OpenSPARC T2 (Appendix A). For ERRS and SHE, faults only need to excludes SRAM modules for which self-repair techniques exist (e.g., be localized to hardware blocks that can be replaced/bypassed. Hence, [Aitken 04]). Multiple faulty components may be tolerated using logic we do not require highly fine-grained diagnosis techniques that localize sparing since a spare unit is not shared by multiple components. faults to individual gates (e.g., for yield-learning purposes [Keim 06]). 3. Shared-FF sparing [Mirza 12b] (Fig. 3c) is a special case of As discussed in [Beckler 12], such fine-grained gate-level diagnosis can logic sparing which utilizes a few spare flip-flops for self-repairing all be difficult and expensive to achieve during system operation. flip-flops, and only duplicates combinational logic gates of each In Sec. 2, we discuss the limitations of existing self-repair component. Multiple faulty components may be tolerated as long as the techniques. We present ERRS and SHE in Sec. 3. In Sec. 4, we quantify few spare flip-flops are capable of self-repairing all faulty flip-flops in a chip-level area, power, and performance costs, and self-repair coverage component. of ERRS and SHE, and present comparisons with existing techniques. Logic Logic Logic 2. Limitations of Existing Self-Repair Techniques SRAM SRAM Existing self-repair techniques mostly focus on processor cores, Component A Component A memories, networks-on-chip, and FPGAs. Spare cores are a well-known Logic idea, and are used in commercial products (e.g., Cisco Metro [Shalf 09] Logic Logic and IBM BlueGene/Q [Morgan 11]). Other self-repair techniques for SRAM SRAM cores include microarchitectural block disabling using hardware Component A Component A (b) [Shivakumar 03, Schuchman 05] or software [Scholzel 11] techniques, core cannibalization [Romanescu 08], and architectural core salvaging Logic Comb. Logic FFs [Powell 09]. These techniques suffer from high area cost (e.g., 11%), SRAM Comb. Logic performance cost (e.g., 20%), low self-repair coverage (e.g., 60%-80%) Component A FFs SRAM or limited applicability (e.g., only applicable for applications that seldom Component A (c) Logic utilize “complex” instructions such as floating-point SIMD instructions). SRAM Spare unit A considerable amount of research literature as well as commercial (a) Component A Steering logic (multiplexers and interconnects) products exist for redundancy-based built-in self-repair of memories [Benso 02, Kim 98, Schober 01, Zorian 03], cache line disabling [Chang Figure. 3. Existing sparing-based techniques. a) Component-type sparing. b) Logic sparing. c) Shared-FF sparing. 07, Sanda 08, Turgeon 91], and reconfigurable cache architectures [Shirvani 99]. A variety of techniques also exist for fault-tolerant routing Figure 4 shows the uncore components considered in our case study in interconnection networks, and, more recently, in networks-on-chip (shaded in the “Components” column), breakdown of synthesis area of [Adams 87, Gomez 04]. Commercial FPGAs, e.g., from Altera, various components (area evaluation methodology in Sec. 4.1), and the incorporate redundancy and repair for yield improvement [Altera 06]. estimated area costs required to implement the three sparing-based In contrast, very little attention has been paid to self-repair of techniques. Area cost estimates in Fig. 4 are optimistic because we do uncore components. Techniques that may be used for self-repair of such not account for the area cost of steering logic (multiplexers and components include: interconnects that select whether the original or the spare unit should be • Sparing-based techniques, which implement spare units to used). In reality, wire routing overheads of steering logic can be high. replace faulty hardware blocks. We discuss in detail three existing The high area impact of these techniques (12% - 16% even with our sparing-based techniques later in this section. optimistic estimation) motivates the need for new cost-effective uncore • Reconfigurable wrappers [Abramovici 06] for post-silicon self-repair techniques. Note that, the area costs of our self-repair validation can potentially be used to modify a faulty signal to “fix” a techniques (reported in Sec. 4.1) are not optimistic because we perform fault. However, to achieve high self-repair coverage, fine-grained place-and-route to account for routing overheads. diagnosis to localize faults to individual signals may be required. 3. New Self-Repair Techniques for Uncore Components • Roving Emulation [Breuer 86] was originally introduced in the To overcome the limitations of existing self-repair techniques, our context of fault detection. It uses an emulation engine to emulate the approach is to analyze the architecture (i.e., functional properties) of operations of a given component, and compare the outputs of the each uncore component in addition to the structural information (i.e., 2

Optimistic estimation of synthesis area overhead for % original chip uncore self-repair (cost of steering logic not included) L2M L2M Components area per instance Component-type Shared-FF core core core core (# instances) Logic sparing L2M L2M sparing sparing Processor core 4.54% (8) N/A N/A N/A MCU L2C L2C L2C L2C MCU L2 cache data/tag arrays (L2M) 3.95% (8) N/A N/A N/A

MCU NCU MCU SIU CCX Crossbar (CCX) sub-block 0.15% (17) 0.30% 2.54% 2.41% L2M L2M L2 cache bank controller (L2C) 1.87% (8) 1.87% 6.02% 5.51% L2C L2C L2C L2C DRAM controller (MCU) 0.41% (4) 0.41% 1.19% 1.03% L2M L2M Non-cacheable unit (NCU) 0.67% (1) 0.67% 0.30% 0.26% DMU core core core core System interface unit (SIU) 1.38% (1) 1.38% 0.29% 0.25% NIU PCI-express controller (DMU) 1.26% (1) 1.26% 0.70% 0.61% Network controller/accelerator (NIU) 9.65% (1) 9.65% 1.58% 1.32% Total 100% 15.53% 12.63% 11.38%

Figure 4. OpenSPARC T2 components and area breakdown. Uncore components considered in this paper are shaded in the “Components” column. Optimistic synthesis area impacts of existing sparing-based techniques are also provided. organization of hardware blocks). This enables us to take advantage of 3. Transfer “necessary states” from the faulty component to the existing architectural features to reduce area costs. In some cases, we helper. The necessary states are required for the helper to correctly can even eliminate spare units. Our new self-repair techniques include: process the requests that are reallocated from the faulty component (e.g., 1. Enhanced Resource Reallocation and Sharing (ERRS). design configuration parameters such as refresh intervals in MCUs). 2. Sparing through Hierarchical Exploration (SHE). 4. Reallocate workloads from the faulty component to the helper. We demonstrate our techniques on OpenSPARC T2 (Fig. 4) as a 5. Enable the helper to process workloads originally mapped to the case study, where ERRS is used for self-repairing CCX, L2Cs, and faulty component in addition to its own (using the same control signals MCUs, and SHE is used for NCU, SIU, DMU, and NIU. that indicate whether uncore components are faulty). 6. Disregard outputs from the faulty component. 3.1. Enhanced Resource Reallocation and Sharing (ERRS) To eliminate the need for spare units, we utilize the existence of RRS for CCX sub-bocks multiple instances of uncore components that can accomplish “similar” In OpenSPARC T2, a CCX sub-block (e.g., CCX sub-block 0 in functionalities. For example, if one of the L2 cache bank controllers Fig. 5) arbitrates and forwards requests from multiple sources (e.g., 8 (L2Cs) becomes faulty, the idea is to reallocate the workload of that processor cores in Fig. 5) to a single destination (e.g., L2C 0 in Fig. 5). (faulty) component to another already-existing and “similar” component, If a CCX sub-block becomes faulty (e.g., CCX sub-block 0 in Fig. 6), referred to as a helper. The helper then processes the workload of the the helper (e.g., CCX sub-block 1 in Fig. 6) arbitrates and forwards faulty component in addition to its own. For example, an L2C helper requests originally mapped to the faulty CCX sub-block along with its will allocate entries for and respond to requests (from processor cores) to own. Hardware modifications to support RRS for CCX sub-blocks are the data that are mapped to the faulty L2C. shown in Fig. 6. Workload reallocation is done by modifying the valid We expect a wide variety of future SoCs to benefit from ERRS. For bit logic which indicates which CCX sub-block is responsible for example, multiple cache controllers and DRAM controllers already exist routing an incoming request. We also modify the arbitration logic so in many commercial SoCs (e.g., Intel SandyBridge, IBM Power 7, that two requests targeting different destinations can be dispatched in Oracle UltraSPARC). Multiple processing units and controllers also exist the same cycle by the helper. in graphics ICs (e.g., Nvidia GT200) and network processors (e.g., C-5 core 0 core 1 … core 7 digital communication processors). Moreover, as the trends of modular design and parallel execution model continue to prevail in SoC designs, Valid bit logic it is likely that multiple “similar” non-processor controllers and CCX sub-block 0 CCX sub-block 1 accelerator engines would be prevalent in future SoCs. … The key challenge for such a sharing-based approach is that, due to Arbitration logic … Arbitration logic … resource sharing, application performance can degrade significantly L2C 0 L2C 1 after self-repair (i.e., after a fault occurs). For example, direct Figure 5. CCX sub-blocks in OpenSPARC T2. application of the RRS (Resource Reallocation and Sharing) approach [Li 10] (originally created for online self-test and diagnostics) to self- 1. Stall 4. Enable workload core 0 core 1 … core 7 reallocation repair can result in 22% system performance impact with just a single 2. Drain outstanding faulty component (details in Sec. 4.3). The idea of ERRS is to minimize requests (wait until 5. Enable Modified valid bit logic this performance impact without significant area costs. ERRS can be queue is empty) resource sharing achieved by following three steps: CCX sub-block 0 CCX sub-block 1 Step 1. Implement RRS [Li 10] for self-repair. Step 2. Identify RRS performance bottlenecks. Modified … Modified … … Step 3. Overcome RRS performance bottlenecks via critical arbitration logic arbitration logic architectural enhancements with area vs. performance tradeoffs. A key advantage of ERRS is that it is capable of self-repairing 6. Select outputs from CCX sub-block 0 mux Skip 3. transfer mux multiple faulty components (without any design changes) as long as a necessary states (not needed) component and its helper are not both faulty. Additional/modified L2C 0 L2C 1 hardware Step 1. Implement RRS for Self-Repair Figure 6. RRS for self-repairing CCX sub-blocks in OpenSPARC RRS serves as a starting point to enable resource sharing for self- T2. CCX sub-block 0 is faulty and CCX sub-block 1 is the helper. repair. The general flow to enable resource sharing after a fault occurs RRS for L2Cs (Fig. 6 shows an example, more details in a technical report [Li 13]) is: In OpenSPARC T2, the shared L2 cache is divided into 8 banks, 1. Stall the faulty component so that no new requests are accepted. each of which consists of an L2 cache bank controller (L2C) and cache 2. Drain any outstanding requests in the faulty component. 3 memory/tag arrays (L2M). Existing memory self-repair techniques can thereby creating contention in shared resources if one component be applied to L2Ms, and RRS is applied to the L2Cs. We implement two becomes faulty. In our case study, each stress workload generates RRS schemes for L2Cs, denoted by RRS-1 and RRS-2. random requests on behalf of every input source of both components in In RRS-1, hardware support is added so that an L2C helper can a RTL model for 16 consecutive cycles. The input sources for the two share its own L2M with the data originally mapped to the faulty L2C. As CCX sub-blocks and the two L2Cs are the 8 processor cores, and for the a result, L2 cache capacity is effectively reduced. For example, in Fig. two MCUs, they are 4 L2Cs. We then study the simulation traces of 7a, L2C 1 (serving as a helper) allocates entries for the data mapped to these workloads to identify RRS performance bottlenecks. L2C 0 (which is faulty) in L2M 1, in addition to its own data. We summarize in Fig. 8 cycle count overheads (the number of In RRS-2, hardware support is added so that an L2C helper can simulation cycles required to complete all requests generated by stress access two sets of L2Ms (the second set is originally accessed by the workloads normalized to the corresponding fault-free scenarios) of RRS faulty L2C). For example, in Fig. 7b, L2C 1 (the helper) accesses data in for CCX, RRS-1 and RRS-2 for L2Cs, and RRS-2 for MCUs. The key both L2M 0 and L2M 1. However, due to their physical locations, there results are discussed below. is an 8-cycle overhead (in our implementation) for the L2C helper to RRS for CCX sub-blocks: Cycle count overheads are very small access the second L2M. Implementation details of RRS-1 and RRS-2 because we modify the arbitration logic so that a CCX sub-block helper can be found in [Li 13]. is able to dispatch packets to L2Cs at the same rate as the fault-free RRS-1 and RRS-2 introduce different performance tradeoffs for the scenario (Fig. 6). average memory access time metric: cache hit time + cache miss rate × RRS-1/RRS-2 for L2Cs: Cycle count overheads for all request cache miss penalty. RRS-1 results in reduced cache capacity, which types are high (~70% on average, and up to 100%). By analyzing increases cache miss rate. RRS-2 introduces additional cache latency (8 simulation traces, we find that the reasons are: cycles in our implementation) for a helper to access the L2M originally a) For stress workloads that generate “load hit” and “store hit” associated with the faulty L2C, thereby increasing the corresponding hit requests, two such requests targeting two different L2Cs in the same time and miss penalty for the data that is mapped to the second L2M. cycle can be handled in parallel in the fault-free scenario. With RRS- We quantify these tradeoffs using simulation experiments in Sec. 4.3. 1/RRS-2, however, these requests need to be serialized, which affects L2 hit processing bandwidth (the number of hit requests that can be 8 cores, 64 threads 8 cores, 64 threads processed concurrently). CCX sub- CCX sub- CCX sub- CCX sub- b) Cycle counts for stress workloads that generate “load miss” and block 0 block 1 block 0 8-cycle block 1 “store miss” requests are determined by the maximum number of overhead outstanding misses, which is in turn determined by the number of entries L2C 0 L2C 1 L2C 0 L2C 1 in the miss fill buffer. Once the miss fill buffer is full, an additional L2M 0 L2M 1 L2M 0 L2M 1 request that misses in the L2 cache bank needs to wait until a previous MCU 0 MCU 0 miss is handled and an entry in the miss fill buffer is freed up; this can a) RRS-1: L2M 0 not used; b) RRS-2: L2C 1 controls both take a long time (e.g., 56 cycles). In the fault-free scenario, two L2Cs L2M 1 capacity shared L2M 0 and L2M 1 together allow 16 outstanding misses. With RRS-1/RRS-2 for a single Figure 7. RRS-1 vs. RRS-2 for self-repairing L2Cs in faulty L2C, only 8 outstanding misses are allowed by the helper, which OpenSPARC T2. L2C 0 is faulty and L2C 1 is the helper. reduces L2 miss processing bandwidth, i.e., the maximum number of allowed outstanding misses. RRS for MCUs The four DRAM controllers (MCUs) in OpenSPARC T2 are RRS-2 for MCUs: Cycle count overheads of both DRAM reads responsible for interacting with off-chip DRAM modules (through and writes are high with RRS-2 for a faulty MCU (up to 100%). DRAM channels) to process DRAM read/write requests from the L2Cs. Simulation traces show that each MCU in the RTL model keeps track of Each MCU can keep track of the status of multiple DRAM banks to up to 16 DRAM banks for a total of 32 DRAM banks. This is because handle multiple DRAM requests at the same time. An MCU helper there are 16 DRAM bank control FSMs per MCU. With a single faulty handles all requests originally mapped to the faulty MCU in addition to MCU, up to 16 outstanding DRAM requests can be supported by the its own by interacting with two DRAM channels, one originally helper, which effectively reduces both DRAM read and write access associated with the helper and the other originally associated with the bandwidth (i.e., the number DRAM read and write requests that can be faulty MCU (detailed implementation in [Li 13]). The principle is handled concurrently). Furthermore, the DRAM read return logic in each similar to RRS-2 for L2Cs. MCU is only able to process one DRAM read return data at a time. With In theory, the RRS-1 principle for L2Cs may also be used for RRS-2 for MCUs, DRAM data can be returned from two DRAM MCUs. However, without appropriate operating system support, such channels at the same time, but they can only be processed by the helper an approach can disrupt normal operation because active physical one after the other. This also reduces DRAM read access bandwidth memory pages may no longer be accessible. because the data can be processed in parallel by two separate MCUs in the fault-free scenario. Step 2. Identify RRS Performance Bottlenecks Average, min, and max cycle count overheads of 20 stress workloads The methodology to identify RRS performance bottlenecks RRS for CCX RRS for L2Cs RRS-2 for MCUs consists of the following steps (details in [Li 13]): 100% RRS-1 RRS-2 100% 1. For each uncore component, identify various request “types” 100% 50% targeting that uncore component (Fig. 7). 50% 50% 0% 2. Construct 4 RTL models: a) two CCX sub-blocks with RRS 0% support; b) two L2Cs with RRS-1 support; c) two L2Cs with RRS-2 0% 0% Any input1 Load Store Load Store DRAM DRAM support; d) two MCUs with RRS-2 support. For each RTL model, we packet hit hit miss miss read write perform RTL simulation for two scenarios: i) a fault-free scenario; ii) CCX request typeL2C request types MCU request types one component in the RTL model is faulty and the other serves as the Figure 8. Cycle count overheads of RRS vs. fault-free scenarios. helper. Such RTL models (instead of a full-system model) allow us to focus on the shared hardware resources while simplifying simulation Step 3. Critical Architectural Enhancements for ERRS We first incorporate architectural enhancements to overcome the details and subsequent analysis. performance bottlenecks identified in Step 2. Next, we follow the 3. For each request type in each RTL model, generate “stress simulation methodology in Step 2 together with area/power evaluation workloads” to fully utilize hardware resources of both components, 4

(details in Sec. 4.1) for tradeoff analysis of area vs. performance costs. exist [Aitken 04]. Compared to component-type sparing that provides This process can be repeated several times to obtain desirable area vs. spare units for both memory modules and logic (Sec. 2), which in this performance tradeoffs based on system requirements and constraints. case is equivalent to replicating the rdmc sub-component, SHE reduces For our case study, we provide architectural enhancements to area overhead vs. the area of the original rdmc sub-component by 80%. preserve L2 miss/hit processing bandwidths and DRAM read/write Level 2 access bandwidths, as summarized in Table 1. ERRS for L2Cs utilizes sub-components SHE for rdmc the same principle as RRS-2 for L2Cs, for which the L2 cache capacity is Level 1 pio pio_ucb sub-components preserved, but additional cache latency (8 cycles in our implementation) clk rdmc dbg chnl16 is introduced for a helper to access the L2M originally associated with (a.k.a. components) Level 3 chnl the faulty L2C. The post-layout area impact of ERRS with these L2M L2M chnl 2 core core core core sub-components chnl 1 enhancements is 3.2% (Sec. 4.1). As Fig. 9 shows, the improvements in L2M L2M MCU L2C L2C L2C L2C MCU chnl16

performance over RRS are significant: from 70% to 6% on average for a MCU NCU MCU mb4 SIU CCX chnl2 mb4 L2M L2M chnl1 single faulty L2C, and from 60% to 8% on average for a single faulty L2C L2C L2C L2C SRAM L2M L2M mb4 ctrl MCU. Further performance results using full system simulation and ctrl DMU core core core core SRAM realistic workloads show less than 5% performance impact using ERRS NIU ctrl after self-repair (Sec. 4.3). Note that, ERRS does not introduce any performance impact in fault-free scenarios. Additional hardware (spare units & steering logic) We also perform sensitivity analysis by increasing the number of Figure 10. First 3 levels of design hierarchy of NIU in miss fill buffer entries in L2Cs and the number of DRAM bank control OpenSPARC T2. SHE implementation is shown for the rdmc sub- component. FSMs in MCUs by 50%, respectively, while duplicating the modules that handle cache hits in L2Cs and the DRAM read return logic in MCUs. SHE raises an important question: how deep in the design hierarchy Performance impact labeled as ERRS_sensitivity_analysis (Fig. 9) can be should we explore? Intuitively, the smaller the sub-components (lower in as high as 60% vs. fault-free scenarios. Therefore, to minimize the hierarchy), the more likely identical structures can be found (the performance impact, we use the configurations in Table 1 for ERRS. extreme case being at the level of individual gates). However, self-repair

Table 1. ERRS critical architectural enhancements for different of smaller sub-components requires more steering logic (multiplexers uncore components in OpenSPARC T2. and interconnects), which not only imposes higher area costs, but also CCX sub-block None reduces self-repair coverage (details in Sec. 4.4). To analyze these 1. Duplicate modules that handle cache hits tradeoffs, we create 34 different configurations of SHE for NCU, SIU, L2C 2. Double the number of miss fill buffer entries DMU, and NIU: 1. A sweet-spot configuration, which is obtained by following the 1. Double the number of DRAM bank control FSMs MCU 2. Duplicate DRAM read return logic methodology in Fig. 11. 2. A coarse-grained configuration where a spare unit is provided Average, min, and max cycle count overheads of 20 stress workloads for each unique instance of a Level 2 sub-component (i.e., one spare unit 60% ERRS for L2Cs 60% ERRS for MCUs is provided for all instances of an identical sub-component and sub- ERRS ERRS_sensitivity_analysis components with only a single instance are merely duplicated). 3. A fine-grained configuration where a spare unit is provided for 30% 30% each unique instance of a Level 3 sub-component. 4. 31 other configurations obtained using a methodology similar to 0% 0% Fig. 11. We start by adding all unique Level 2 sub-components to the set Load Store Load Store DRAM DRAM SC. Instead of using the two heuristics that balance area savings and hit hit miss miss read write self-repair coverage degradation, we first check if the current sub- L2Cs request types MCU request types component is at Level 3. If so, we provide a spare unit for all identical Figure 9. Cycle count overheads of ERRS vs. fault-free scenarios. instances of the sub-component. Otherwise, we randomly choose to 3.2. Sparing through Hierarchical Exploration (SHE) continue the flow or provide a spare unit for all identical instances for ERRS is expected to be applicable for a wide variety of uncore the sub-component (which will be at Level 2). Spare units are provided components in future SoCs (Sec. 3.1). However, not all uncore at only these two levels because self-repair coverage tends to be low components can utilize ERRS. For example, the I/O controllers (NCU, beyond Level 3 (Sec. 4.4). SIU, and DMU) and network accelerator (NIU) in OpenSPARC T2 add all unique sub-components of the top-level component cannot utilize ERRS because no other existing components can current (sub-)component to set SC accomplish similar functionalities. remove next sub- flow complete The idea of SHE is to explore different levels of the design component s from set SC no yes SC empty? hierarchy of each uncore component to identify structural properties of yes sub-components (Fig. 10). Such exploration enables us to identify s contains current sub- multiple identical sub-components inside an uncore component to multiple identical sub- component s is larger reduce the costs of spare units. For example, consider the NIU components within the yes than 50,000 circuit nodes (i.e., next 2 levels of inputs and outputs of component of OpenSPARC T2 (Fig. 4). All sub-components at Level 1 hierarchy? ‡ logic gates)? ‡ and Level 2 of the design hierarchy are different (Fig. 10). However, by looking one level deeper (Level 3), we find that the rdmc sub-component no no contains 16 identical DMA channels (chnl -chnl ). We provide only one halt exploration and provide one spare unit to be shared by 1 16 all identical instances of the current sub-component spare unit for all 16 identical instances, and one spare unit per remaining ‡ sub-component of the rdmc (Fig. 10), thereby reducing area overhead vs. heuristics to balance area savings and self-repair coverage degradation Figure 11. Sweet-spot SHE configuration. the area of the original rdmc sub-component when compared to logic sparing by 58%. Note that, although ERRS may be used for identical sub- Similar to logic sparing and shared-FF sparing (Sec. 2), SHE is components identified by SHE (to eliminate the need for spare units), the only applied to the logic portion of an uncore component, because self- cost and complexity associated with architectural modifications and repair techniques for memory modules inside the uncore components 5 enhancements required for ERRS can outweigh the cost of simply however, the reduced L2 cache capacity can result in higher DRAM providing a spare unit since sub-components are relatively small. access power in a faulty system (after self-repair). To quantify such

4. Results impact of RRS-1, we use the Micron DRAM power calculator [Micron] 4.1. Chip-Level Area, Power, and Clock Frequency Impact to estimate power consumption of DDR2 DRAM chips. We derive the We use the Synopsys Design Compiler with the Synopsys EDK percentage of clocks cycles for which DRAM banks process read or 32nm library for synthesis. We also perform place-and-route (P&R) to write requests using microarchitectural simulation results, which are assess additional routing and wire overheads. Using the Synopsys IC obtained from the “PARSEC mix” workload executing on a 64-core chip compiler, we run P&R for each component. The physical design of each multiprocessor (details in Sec. 4.3). Off-chip DRAM access power component is then assembled to build the entire SoC design. impact of RRS-1 (in a faulty system after self-repair) vs. fault-free To evaluate the chip-level power impact of ERRS, we first perform operations is 11%, 14%, and 20% assuming 1, 2, 4 faulty L2Cs, RTL simulations using Synopsys VCS to obtain switching activities for respectively, which suggests that the RRS-1 may not be a desirable option for self-repair. a mix of SPEC and synthetic programs (the synthetic programs perform random read and write operations to an 8MB address space). The 4.3. Application Performance Impact (After Self-Repair) switching activities obtained from RTL simulation are then used by the If a system is fault-free, ERRS and SHE do not introduce any Synopsys power compiler to derive power numbers. For SHE, we application performance impact. In this section, we focus on evaluation implement power gating using the Synopsys IC compiler to of application performance of ERRS after self-repair is performed in a automatically insert power switch cells, which can connect or disconnect faulty system. As explained in Sec. 3.2, our current implementation of the power supply for a hardware structure via an enable signal. Since the SHE does not degrade application performance. We simulate two chip spare units in SHE are not required to operate in a fault-free system, they multiprocessors (CMPs), one with 8 cores and the other with 64 cores2. can be disconnected from the power supply. Our detailed methodology Simulated system parameters are summarized in Table 4. The uncore to obtain area, power, and clock speed results can be found in [Li 13]. configuration follows that of OpenSPARC T2 (Fig. 4).

Table 2 summarizes the results. ERRS and SHE (sweet-spot Table 4. Simulated system parameters. configuration, Sec. 3.2) together introduce 5.6% synthesis area impact at 8-core CMP 64-core CMP the chip level, which is significantly lower than existing sparing-based GEMS [Martin 05] techniques for which area impact is optimistically estimated (Sec. 2). Simulator gem5 [Binkert 11] After place-and-route, the area impact is 7.5%, which accounts for with Simics 8 single-issue 64 single-issue additional routing and interconnect overheads. In our implementation, Processors the OpenSPARC T2 design with ERRS and SHE achieve the same clock processor cores processor cores frequency (post-P&R) as the native OpenSPARC T2 design (300MHz OS OpenSolaris 10 Linux 2.6 using the Synopsys EDK 32nm library; we also confirm that both Memory Private L1 split instruction/data caches designs report critical path timing violations at 325 MHz). ERRS and hierarchy 4MByte shared L2 cache (8 banks) SHE techniques introduce 2.94% chip-level power impact. Uncore 8 L2 bank controllers, 4 DRAM controllers

Table 2. Chip-level area/power impacts of ERRS and SHE. The workloads used include: Area impact Clock 1. Individual programs from the PARSEC benchmark suite3 that Self-repair Power freq. represent CMP workloads [Bienia 08]. technique Post P&R Synthesis impact impact 2. A mix of programs in the PARSEC benchmark suite, referred to as “PARSEC mix,” where each program imposes different demands on ERRS 3.16% 2.61% 2.72% 0% the uncore components. On the 64-core CMP, we run 6 instances of each † SHE 4.32% 2.97% 0.22% 0% program (except for dedup and streamcluster for which we run 5 Overall 7.48% 5.58% 2.94% 0% instances) for a total of 64 programs (one program per core). On the 8- † Power impact is from steering logic since spare units of SHE are core simulated CMP, we run the first 8 PARSEC programs in turned off via power gating. Power switches and additional routing alphabetical order, one program per core. costs are accounted for in the area impact of SHE. 3. Synthetic applications to stress processing bandwidth demands to L2Cs or MCUs, referred to as “L2C stress apps” or “MCU stress apps,” Compared to ERRS (Table 2), the area, clock frequency, and on each core. These synthetic applications are created by tuning various power impact of RRS techniques are shown in Table 3. Although the strides (i.e., the difference in the memory addresses of two consecutive area/power impact of both RRS schemes is smaller than ERRS, RRS memory access operations) in the workload (similar to [Joshi 08]). Our techniques can incur high application performance impact (up to 22% synthetic applications mimic memory access behaviors of commercial for stress applications, as detailed in Sec. 4.3) as soon as a single uncore database and web server applications (which can impose very high component becomes faulty. Moreover, RRS-1 also introduces a large instruction miss rates in addition to data miss rates [Spracklen 05]). increase in off-chip DRAM access power after self-repair of a faulty In all our simulations, the baseline represents results obtained for system, which is not desirable (Sec. 4.2). fault-free simulations. For simulations in the presence of faults, we Table 3. Chip-level area/power impact of RRS for L2C, MCU, and arbitrarily choose a component to be faulty. For example, for the CCX sub-blocks in OpenSPARC T2. Area impact Power Clock freq. RRS* Post P&R Synthesis impact impact 2The 8-core system supports the SPARC Instruction Set Architecture (ISA) which RRS-1 1.12% 0.97% 0.81% 0% matches the ISA of OpenSPARC T2. It mimics the design of the 8-core OpenSPARC T2 when multiple hardware threads are not enabled. The 64-core simulated CMP, on RRS-2 1.41% 1.17% 1.15% 0% the other hand, only supports the x86 ISA. However, it allows us to examine the * RRS-1 and RRS-2 implementations differ only for L2Cs (Sec. 3.1). performance of uncore components with a large number of requests, corresponding to a

4.2. Off-chip DRAM Access Power Impact (After Self-Repair) scenario when all 64 hardware threads in OpenSPARC T2 execute memory-intensive applications concurrently. Although specific memory access patterns can differ for In addition to chip-level power impact, off-chip DRAM access different ISAs, the uncore components are generally subject to similar memory access power is also a major concern from a system’s perspective, which demands from the processor cores. depends on L2 cache miss rate. For ERRS and RRS-2, DRAM access 3 On the 64-core simulated CMP we are unable to run “freqmine” and “raytrace” due power in a faulty system (after self-repair) is similar to that of fault-free to known library issues [gem5] in the simulator. operation because the cache capacity is preserved [Li 13]. For RRS-1, 6

“PARSEC mix” workload, L2C 3 is faulty on the 8-core CMP and L2C 8-core simulated execution time overhead (%) of RRS/ERRS; One faulty MCU 6 is faulty on the 64-core CMP. On the 8-core simulated CMP, we run 0.3% RRS-2 ERRS all workloads to completion to obtain execution times. On the 64-core 0.2% 0.1% CMP, each simulation run continues until each program (on each core) 0.0% executes at least 3 million instructions starting from the main body of the program (after the initial setup phase). We then extract CPI (cycles per instruction) for the last 2 million instructions (the first 1 million (a) Individual PARSEC programs instructions are used for warming up the caches). This methodology 20% allows us to capture performance behaviors across various program 0.4% phases (intervals of a program that exhibit different architectural 10% 0.2% characteristics such as CPI, working sets, etc.) for reasonable simulation runtimes [Sanchez 10]. 0.0% 0% (b) PARSEC mix (c) Geometric mean of 100 MCU stress apps 4 Performance Impact: Single Faulty L2C Figure 14. Performance impact of RRS/ERRS for various As shown in Figs. 12 and 13 (for the case of a single faulty L2C), workloads on 8-core simulated CMP; one faulty MCU. the application performance overheads of RRS-1 and RRS-2 can be 64-core simulated CPI overhead (%) of RRS/ERRS; One faulty MCU large (up to 18%). ERRS significantly reduces performance impact to 1.5% less than 5% at the price of slightly larger area cost (3.2% post-P&R chip-level area impact as discussed in Sec. 4.1). 0.0% 8-core simulated execution time overhead (%) of RRS/ERRS; One faulty L2C 10% RRS-1 RRS-2 ERRS

5% (a) Individual PARSEC programs 25% 0% 0.7%

(a) Individual PARSEC programs 10% 0.0% 0% 10% (b) PARSEC mix (c) Geometric mean of 100 MCU stress apps 5% 5% Figure 15. Performance impact of RRS/ERRS for various workloads on 64-core simulated CMP; one faulty MCU assumed.

0% 0% Performance Impact: Multiple Faulty L2Cs and MCUs (b) PARSEC mix (c) Geometric mean of 100 L2 stress apps In Fig. 16, we present results on the performance impact of Figure 12. Performance impact of RRS/ERRS for various RRS/ERRS for multiple faulty components. In the worst case, 2 MCUs workloads on 8-core simulated CMP; one faulty L2C. and 4 L2Cs can fail (MCUs 0 and 2, and L2Cs 0, 2, 4, and 6 are 64-core CPI overhead (%) of RRS/ERRS; One faulty L2C 15% arbitrarily chosen to be faulty). Note that, ERRS allows all of these RRS-1 RRS-2 ERRS 10% components to be faulty because the helpers are all fault-free. As 5% expected, application performance degrades as the number of faulty 0% components increases, since more hardware resources need to be shared. ERRS benefits from the architectural enhancements (Sec. 3.1), and incurs the lowest performance overheads for most cases. The only (a) Individual PARSEC programs exception is that, on the 8-core CMP, RRS-1 outperforms ERRS for the 6% 18% cases with 4 faulty L2Cs. To understand this result, we record L2 miss rates of RRS-1, which are 2.6%, 3.9%, and 4.8% for one, two and four 3% 9% faulty L2Cs, respectively. For ERRS, as the number of faulty L2Cs increases, the percentage of L2 requests with higher cache latencies 0% 0% increases at faster rates, i.e., 11.4%, 26.8%, and 52.9% for one, two, and (b) PARSEC mix (c) Geometric mean of 100 L2 stress apps four faulty L2Cs, respectively. These trends suggest that the increased Figure 13. Performance impact of RRS/ERRS for various L2 hit time and miss penalty associated with ERRS may become more workloads on 64-core simulated CMP; one faulty L2C detrimental to application performance impact than the reduced L2 Performance Impact: Single Faulty MCU cache capacity associated with RRS-1 for a large number of faulty Figures 14 and 15 show the performance impact of RRS and ERRS L2C’s. (Note that, if ERRS is implemented following the same principle techniques for various workloads in the presence of a faulty MCU. For as RRS-1, which preserves L2 hit time and miss penalty but reduces L2 PARSEC, both RRS and ERRS (Figs. 14a and b, 15a and b) incur very cache capacity, ERRS is expected to outperform RRS-1 for these cases). small (< 1%) performance impact. However, for DRAM stress RRS-1 RRS-2 ERRS 24% applications (Figs. 14c, 15c), ERRS significantly reduces (i.e., improves) 16% PARSEC mix PARSEC mix the average performance impact: from 16% to 1% for the simulated 8- on 64-core on 8-core core CMP, and from 22% to 5% for the simulated 64-core CMP. simulated CMP simulated CMP 12% 8%

0% 0% 4 L2Cs + 4 L2Cs + 2 MCUs 2 L2Cs 4 L2Cs 2 MCUs 2 L2Cs 4 L2Cs 4 2 MCUs 2 MCUs Note that, the performance impact results in this section actually correspond to a Faulty components Faulty components scenario with two faulty components: the faulty L2C, and the corresponding CCX sub-block. If a L2C is faulty, it implies that the corresponding CCX sub-block is idle Figure. 16. Performance impact (%) of RRS / ERRS for the (when RRS-1, RRS-2 or ERRS is used). PARSEC mix workload with multiple faulty components.

7

4.4. Self-Repair Coverage We assume that all faults are independent and identically Similar to other self-repair techniques, our techniques also cannot distributed. The probability that a set of circuit nodes is fault free (Pset) is guarantee that all failures will be correctly repaired. This is due to the calculated using the Poisson model in Eq. 1, where nset is the number of existence of single points of failure, which are signals that, if failed, can nodes in the set and d is the fault density. This model is used in yield result in incorrect system operation. For both ERRS and SHE, single modeling [Cunningham 90] for random and independent defects, which points of failure include the primary inputs and primary outputs of the we adopt for random and independent faults. Uncore components uncore components or sub-components being repaired, and the steering function correctly only if all RUs function correctly (with probability logic (multiplexers and interconnects). The “response comparators” PRU) and the non-repairable set is fault-free (with probability Pnon- which indicate whether a (sub-) component needs to be repaired or not repairable-set calculated using Eq. 1). Hence, self-repair coverage can be can also behave as single points of failure (Appendix A). calculated using Eq. 2. For the self-repair techniques considered in this ERRS and SHE are capable of self-repairing a wide variety (e.g., paper, at most one (sub-) component in a RU can become faulty. For stuck-at, transition, or delay) and a large number (single or multiple) of example, for ERRS, a component and its helper cannot be faulty at the faults that may occur inside a (sub-) component being repaired. To take same time (Sec. 3.1). Derivation of PRU is presented in Appendix B. the single points of failure into account, we use the self-repair coverage Note that, PRU’s in Eq. 2 are statistically independent since any shared metric, which is defined as the probability that the uncore components in logic among multiple components is accounted for in Pnon-repairable-set. an SoC function correctly for a given fault density. In our analysis, fault Pset = (Eq. 1) density is defined as the total number of single stuck-at faults per 100 million circuit nodes (i.e., the inputs and outputs of logic gates). 100 million is of the same order of magnitude as the number of circuit nodes Self-repair coverage = (Eq. 2) in the original OpenSPARC T2 design. As a special case, we also define the single-fault self-repair coverage as the probability that the uncore Self-repair coverage results are shown in Fig. 18 for a range of fault components function correctly in the presence of a single fault (stuck-at densities comparing our techniques (ERRS+SHE) vs. component-type for our analysis, but other fault types can also be used). sparing, logic sparing, and shared-FF sparing in Sec. 2 (the existing

4.4.1. Single-Fault Self-Repair Coverage sparing-based techniques are applied for all uncore components Single-fault self-repair coverage can be calculated by dividing the considered in this paper). The “idealistic” upper-bound reference is very total number of circuit nodes that are not single points of failure by the difficult (if not impossible) to achieve; it assumes that a spare gate is total number of nodes. As shown in Fig. 2, ERRS alone achieves 74.9% provided for every gate, but the steering logic does not introduce area single-fault self-repair coverage in our case study. With both ERRS and costs or single points of failure.

SHE, single-fault self-repair coverage is 97.5% Self-Repair Coverage vs. Fault Density Chip-Level In Fig. 17, we present single-fault self-repair coverage vs. synthesis 100 Synthesis Area area impact for the 34 SHE configurations discussed in Sec. 3.2. For the Overhead 5.6% coarse-grained configuration, few identical sub-components are found; ERRS+SHE 80 7.5% (Post P&R) hence, the area cost is quite high (95% at the component-level, i.e., Shared-FF 11.3%§ normalized to the synthesis area for the logic portion of NCU, SIU, sparing Component- DMU, and NIU). For the fine-grained configuration, area impact is Idealistic 15.5%§ 60 type sparing reduced by 20% at the component-level compared to the coarse-grained; 0 Logic 12.6%§ however, it also results in a noticeable (2.5%) reduction in self-repair (%) Coverage Self-Repair 01 5 10 15 20 sparing coverage. The sweet-spot configuration provides balanced self-repair Fault Density (faults/100M nodes) § Discounts steering logic costs coverage and area impact tradeoff, because we are able to identify Figure 18. Self-repair coverage for a wide range of fault densities and area impact of different self-repair techniques. several cases of identical sub-components to simultaneously achieve 97.5% self-repair coverage with only 2.97% chip-level (77% The key result from Fig. 18 is that the combination of ERRS and component-level) area impact. SHE achieves very high self-repair coverage (86.1%-99.7%) for all fault Component-Level Synthesis Area Overhead (%) densities considered while introducing the lowest area impact. We 75.0 80.5 86.0 91.5 97.0 provide an analysis of this key result below: 100.0 1. Comparing ERRS with logic sparing, ERRS results in fewer RUs sweet-spot coarse-grained (14 vs. 30) since spare units are not needed for ERRS. Since PRU < 1, Single-Fault 97.5 Self-Repair fewer terms result in higher self-repair coverage values based on Eq. 2. Coverage (%) Comparing SHE with logic sparing (which provides spare units for 95.0 fine-grained Level 1 components), SHE introduces ~2X more single points of failure 0.0 since it requires more steering logic. Note that, even in the absence of 2.83.13.33.63.8 identical sub-components in Level 2 of the design hierarchy, logic Chip-Level Synthesis Area Overhead (%) sparing at Level 1 is distinct from SHE at Level 2. The aggregate Figure 17. Single-fault self-repair coverage vs. synthesis area number of primary inputs and outputs for Level 2 sub-components is impact for 34 SHE configurations. greater than the aggregate number of primary inputs and outputs for 4.4.2 Self-repair Coverage for Various Fault Densities Level 1 components, thus requiring additional multiplexers. However, To calculate self-repair for a given fault density (instead of a single SHE helps with self-repair coverage for high fault densities since spare fault as discussed in Sec. 4.4.1), single points of failure of all uncore units are provided at lower levels of the design hierarchy; as a result, it is components are considered as belonging to a non-repairable set. All less likely for a RU to fail (since multiple faulty units can be repaired). non-single points of failure that belong to original (sub-) component(s) A combination of these factors results in comparable self-repair and the corresponding helper or spare unit belong to a unique unit of coverage between our techniques (ERRS+SHE) and logic sparing, but repair (RU). For example, the non-single points of failure for all 16 our techniques introduce significantly less area impact. DMA channels and the single spare DMA channel (Fig. 10) belong to 2. Self-repair coverage for shared-FF sparing is relatively low since the same RU, and those for a faulty L2C along with its helper belong to it requires substantially more steering logic (i.e., ~6X more single points the same RU. of failure than ERRS+SHE) and results in low self-repair coverage for high fault densities.

8

3. For component-type sparing, allocating one spare component [Allsup 10] Allsup, C., “Is Built-in Logic Redundancy Ready for Prime Time?” shared by multiple components is detrimental to self-repair coverage for Proc. Intl. Symp. on Quality Electronic Design, pp. 299-306, 2010. high fault densities since it increases the likelihood that at least two such [Altera 06] “Altera’s Strategy for Delivering the Benefits of the 65-nm components fail. Although SHE also shares a single spare unit among Semiconductor Process,” http://www.altera.com/literature/wp/wp-01002.pdf [Beckler 12] Beckler, M., and R.D. Blanton, “On-Chip Diagnosis for Early-Life multiple identical sub-components, providing spare units at lower levels and Wear-Out Failures,” Proc. Intl. Test Conf., pp. 1-10, 2012. of the design hierarchy helps with self-repair coverage (since multiple [Benso 02] Benso, A., et al., “An On-Line BIST RAM Architecture with Self- faulty sub-components can be repaired). As a result, the worst case Repair Capabilities,” IEEE Trans. Reliability, vol. 51, no. 1, pp. 123-128, 2002. values of the PRU terms are 98% for SHE and 74% for component-type [Bienia 08] Bienia, C., et al., “The PARSEC Benchmark Suite: Characterization sparing at 20 faults per 100M circuit nodes. and Architectural Implications,” Proc. Intl. Conf. on Parallel Architectures and Compilation Techniques, pp. 72-81, 2008. 4.5. Summary and Discussion [Binkert 11] Binkert, N., et al., “The GEM5 Simulator,” SIGARCH Computer Table 5 compares our techniques to existing sparing-based Architecture News, vol. 39, no. 2, pp. 1-7, 2011. techniques for the OpenSPARC T2 case study. The combination of [Borkar 05] Borkar, S., “Designing Reliable Systems from Unreliable ERRS and SHE incurs substantially less area cost, but achieves high Components: The Challenges of Transistor Variability and Degradation,” IEEE self-repair coverage. Micro, vol. 25, no. 6, pp. 10-16, 2005. [Borkar 07] Borkar, S., “Thousand Core Chips – A Technology Perspective,” Table 5. Result comparison for the OpenSPARC T2 case study. Proc. Design Automation Conf., pp. 272-278, 2007. Component Logic Shared-FF [Breuer 86] Breuer, M., and A. Ismaeel, “Roving Emulation as a Fault Detection -type ERRS+SHE sparing sparing Mechanism,” IEEE Trans. Comp., vol. C-35, no. 11, pp. 933-939, 1986. sparing [Chang 07] Chang, J., et al., “The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series,” IEEE Journal of Solid-State Synthesis area § § § 5.6% (7.5% 15.5% 12.6% 11.3% impact post-layout) Circuits, vol. 42, no. 4, pp. 846-852, 2007. [Constantinides 07] Constantinides, K., et al., “Software-Based On-line Detection Single-fault self- 99.0% 99.1% 90.0% 97.5% of Hardware Defects: Mechanisms, Architectural Support, and Evaluation,” repair coverage Proc. Intl. Symp. on Microarchitecture, pp. 97-108, 2007. ERRS: 0% for fault-free systems, 5% for single [Cunningham 90] Cunningham, J., “The Use and Evaluation of Yield Models in Application faulty component, graceful degradation for Integrated Circuit Manufacturing,” IEEE Trans. Semiconductor Manufacturing, performance multiple faulty components. vol. 3, no. 2, pp. 60-71, 1990. impact [Elnozahy 02] Elnozahy, E.N., D.B. Johnson, and Y.M. Wang, “A Survey of Sparing-based techniques (including SHE): 0% Rollback-Recovery Protocols in Message-Passing Systems,” ACM Computing Clock speed May be introduced due to wire Surveys, vol. 34, no. 3, pp. 375-408, 2002. 0% impact routing overheads [gem5] “The gem5 Simulator System,” http://www.m5sim.org/. [Gomez 04] Gomez, M.E., et al., “An Efficient Fault-Tolerant Routing ERRS+SHE: 3% Methodology for Meshes and Tori,” IEEE Computer Architecture Letters, vol.

Power impact Sparing-based techniques: may be small if 3, no. 1, pp. 3, 2004. gating techniques are used for the spare units. [Hicks 08] Hicks, J., et. al., “45nm Transistor Reliability,” Intel Technology Journal, vol. 12, no. 2, pp. 131-144, 2008. § Optimistic since area cost for steering logic is discounted. [Huang 84] Huang, K.H., and J.A. Abraham, “Algorithm-Based Fault Tolerance 5. Conclusion for Matrix Operations,” IEEE Trans. Comp., vol. C-33, no. 6, pp. 518-528, Uncore components are prevalent in SoCs. Self-repair of uncore 1984. components is essential for ensuring overall SoC reliability in the [Joshi 08] Joshi, A., et al., “Automated Stressmark Generation,” presence of permanent faults. We present two new self-repair techniques, Proc. Intl. Symp. on High-Performance Computer Architecture, pp. 229-239, 2008. ERRS and SHE, which utilize architectural features in SoCs to enable [Karl 08] Karl, E., et al., “Compact In-Situ Sensors for Monitoring Negative- effective self-repair of uncore components at low cost. Our techniques Bias-Temperature-Instability Effect and Oxide Degradation,” Proc. Intl. Solid- can also be used for yield improvement during manufacturing. State Circuits Conf., pp. 410-623, 2008. Future research directions include: 1. Extension of our self-repair [Keim 06] Keim, M., et al., “A Rapid Yield Learning Flow Based on Production techniques for application domains with real-time constraints. 2. New Integrated Layout-Aware Diagnosis,” Proc. Intl. Test Conf., pp. 1-10, 2006. recovery techniques that can be integrated with our detection, diagnosis, [Kim 98] Kim, I., et al., “Built In Self Repair For Embedded High Density and self-repair techniques, with optimized area, power, and performance SRAM,” Proc. Intl. Test Conf., pp. 1112-1119, 1998. tradeoffs. 3. Use of emerging 3D stacking technologies to further reduce [Kim 10] Kim, Y.M., et al., “Low-Cost Gate-Oxide Early-life Failure Detection in Robust Systems,” Proc. Intl. Symp. VLSI Circuits, pp. 125-126, 2010. the costs of our self-repair techniques. [Li 08] Li, Y., S. Makar, and S. Mitra, “CASP: Concurrent Autonomous Chip Acknowledgement Self-Test using Stored Test Patterns,” Proc. Design, Automation, and Test in This work was supported in part by FCRP Gigascale Systems Europe, pp. 885-890, 2008. Research Center (GSRC), National Science Foundation (NSF), and Intel. [Li 10] Li, Y., et al., “Concurrent Autonomous Self-Test for Uncore Components We thank Dr. Farzan Fallah of Stanford and Dr. Jung Yun Choi of in System-on-Chips,” Proc. VLSI Test Symp., pp. 232-237, 2010. [Li 13] Li, Y., “Online Self-Test, Diagnostics, and Self-Repair for Robust System Samsung for insightful discussions. Design”, Doctoral Dissertation, Stanford University, 2013. References [Martin 05] Martin, M.M.K., et al., “Multifacet's General Execution-Driven [Abramovici 06] Abramovici, M., et al., “A Reconfigurable Design-for-Debug Multiprocessor Simulator (GEMS) Toolset,” SIGARCH Computer Architecture Infrastructure for SoCs”, Proc. Design Automation Conf., pp. 7-12, 2006. News, vol. 33, no. 4, pp. 92-99, 2005. [Adams 87] Adams, G.B., III, D.P. Agrawal, H.J, Siegel, “A Survey and [Micron] “Calculating Memory System Power for DDR2,” Comparison of Fault-Tolerant Multistage Interconnection Networks,” http://download.micron.com/pdf/technotes/ddr2/tn4704.pdf. Computer, vol. 20, no. 6, pp. 14-27, 1987. [Mintarno 11] Mintarno, E., et al., “Self-Tuning for Maximized Lifetime Energy- [Agarwal 07] Agarwal, M., et al., “Circuit Failure Prediction and Its Application Efficiency in the Presence of Circuit Aging,” IEEE Trans. CAD, vol. 30, no. 5, to Transistor Aging,” Proc. VLSI Test Symp., pp. 277-286, 2007. pp. 760-773, 2011. [Agostinelli 05] Agostinelli, M., et al., “Random Charge Effects for PMOS NBTI [Mirza 12a] Mirza-Aghatabar, M., et al., “Theory of Redundancy for Logic in Ultra-Small Gate Area Devices,” Proc. Intl. Reliability Physics Symp., pp. Circuits to Maximize Yield/Area,” Proc. Intl. Symp. Quality Electronic Design, 529-532, 2005. pp. 663-671, 2012. [Aitken 04] Aitken, R., “A Modular Wrapper Enabling High Speed BIST and [Mirza 12b] Mirza-Aghatabar, M., et al., “A Design Flow to Maximize Repair for Small Wide Memories,” Proc. Intl. Test Conf., pp. 997-1005, 2004. Yield/Area of Physical Devices via Redundancy,” Proc. Intl. Test Conf., pp. 1- 10, 2012. 9

[Mitra 00] Mitra, S., and E.J. McCluskey, “Which Concurrent Error Detection need to localize faults to a hardware structure that matches self-repair Schemes to Choose?” Proc. Intl. Test Conf., pp. 985-994, 2000. granularity, partitioning of sub-components is based on the specific self- th [Morgan 11] Morgan, T.P, “IBM’s BlueGene/Q Super Chip Grows 18 Core,” repair technique. For example, for ERRS which is performed at the http://insidehpc.com/2011/08/26/ibms-bluegeneq-super-chip-grows-18th-core/. component level for L2Cs, MCUs, and CCX, faults only need to be [Nakano 06] Nakano, J., et al., “ReViveI/O: Efficient Handling of IO in Highly- Available Rollback-Recovery Servers,” Proc. Intl. Symp. on High- localized at the component level. For SHE, the sub-components of NCU, Performance Computer Architecture, pp. 200-211, 2006. SIU, DMU, and NIU are formed based on the sweet-spot configuration [Nassif 12] Nassif, S.R., V.B. Kleeberger, and U. Schlichtmann, “Goldilocks as discussed in Sec. 3.2 to match self-repair granularity. Failures: Not Too Soft, Not Too Hard,” Proc. Intl. Reliability Physics Symp., Component i CASP controller pp. 2F.1.1-2F.1.5, 2012. Local test logic Test scheduling, [OpenSPARC] “OpenSPARC: World’s First Free 64-bit Microprocessor,” Pre-process, test application, post-process pattern fetch http://www.opensparc.net. [Powell 09] Powell, M.D., et al., “Architectural Core Salvaging in a Multi-Core Decompressor Test data buffer OFF-CHIP FLASH Processor for Hard-Error Tolerance,” Proc. Intl. Symp. on Computer … …… … passj/ Architecture, pp. 93-104, 2009. fail L2M L2M … j core core core core compr- [Romanescu 08] Romanescu, B.F., and D.J. Sorin, “Core Cannibalization L2M L2M essed Compactor MCU L2C L2C L2C L2C MCU

Architecture: Improving Lifetime Chip Performance for Multicore Processors MCU NCU MCU SIU CCX test data L2M L2M in the Presence of Hard Faults,” Proc. Intl. Conf. on Parallel Architectures and Response comparator L2C L2C L2C L2C Compare actual response to golden L2M L2M Compilation Techniques, pp. 43-51, 2008. DMU Sub-component j core core core core [Sanchez 10] Sanchez, D., and C. Kozyrakis, “The ZCache: Decoupling Ways NIU (matches self-repair granularity) and Associativity,” Proc. Intl. Symp. on Microarchitecture, pp. 187-198, 2010. [Sanda 08] Sanda, P.N, et al., “Fault-Tolerant Design of the IBM Power6 Figure A.1. Online self-test and diagnostics support in Microprocessor,” IEEE Micro, vol. 28, no. 2, pp. 30-38, 2008. OpenSPARC T2.

[Schober 01] Schober, V., S. Paul, and O. Picot, “Memory Built-In Self-Repair Appendix B. Additional Details on Self-Repair Coverage using redundant words, ” Proc. Intl. Test Conf., pp. 995-1001, 2001. [Scholzel 11] Scholzel, M., “Fine-Grained Software-Based Self-Repair of VLIW We derive the expressions for PRU (i.e., probability that a unit of Processors,” Proc. Intl. Symp. on Defect and Fault Tolerance in VLSI and repair (RU) functions correctly) for the self-repair techniques presented Nanotechnology Systems, pp. 41-49, 2011. in this paper. Figure B.1 depicts how RUs are formed for various self- [Schuchman 05] Schuchman, E., and T.N. Vijaykumar, “Rescue: A repair techniques. The probability that a RU functions correctly, PRU-ERRS, Microarchitecture for Testability and Defect Tolerance,” Proc. Intl. Symp. on PRU-LS, PRU-SHE, and PRU-FF, is given in Fig. B.1, where PERRS, PLS, and Computer Architecture, pp. 160-171, 2005. PSHE denote the probability that the corresponding (sub-) component is [Shalf 09] Shalf, J., et al., “The Manycore revolution: Will the HPC Community fault-free. PERRS, PLS, and PSHE are obtained using Eq. 1, and m in Eq. Lead or Follow?” SciDAC Review, pp. 40-49, 2009. B.3 is the number of original (sub-) components. Note that the [Shirvani 99] Shirvani, P.P., and E.J. McCluskey, “PADded Cache: A New Fault- Tolerance Technique for Cache Memories,” Proc. VLSI Test Symp., pp. 440- component-type sparing case can be modeled and calculated using case c 445, 1999. (SHE RU) and Eq. B.3 by simply considering each sub-component in [Shivakumar 03] Shivakumar, P., et al., “Exploiting Microarchitectural the SHE case as a component. For shared-FF sparing, we optimistically Redundancy for Defect Tolerance,” Proc. Intl. Conf. on Computer Design, pp. assume that all logic (both combinational and sequential) can be 481-488, 2003. perfectly repaired, since PRU-FF is dependent on how logic is partitioned [Spracklen 05] Spracklen, L., et al., “Effective Instruction Prefetching in Chip into blocks for which shared-FF sparing is applied. Hence, the only Multiprocessors for Modern Commercial Applications,” Proc. Intl. Symp. on single points of failure considered are for the steering logic for flip-flops High-Performance Computer Architecture, pp. 225-236, 2005. and are accounted for in the non-repairable set (Sec. 4.4.2). [Tiwari 08] Tiwari, A., and J. Torrellas, “Facelift: Hiding and slowing down aging in multicores,” Proc. Intl. Symp. on Microarchitecture, pp. 129-140, (a) ERRS RU (c) SHE RU

2008. sub-component1 (PSHE) [Turgeon 91] Turgeon, P.R., A.R. Steel, and M.R. Charlebois, “Two approaches sub-component2 (PSHE) to array fault tolerance in the IBM Enterprise System/9000 Type 9121 2 2 processor,” IBM Journal of Research and Development, vol. 35, no. 3, pp. 382- = 1− 2− − sub-componentm (PSHE) 389, 1991. =1 [Van Horn 05] Van Horn, J., “Towards Achieving Relentless Reliability Gains in 2 spare sub-component (PSHE) = 1−1− a Server Marketplace of Teraflops, Laptops, Kilowatts, & Cost, Cost, Cost,” +1 (Eq. B.1) +1 = 1− +1− Proc. Intl. Test Conf., pp. 1-8, 2005. − = [Zorian 03] Zorian, Y., and S. Shoukourian, “Embedded-Memory Test and Repair: (b) logic sparing RU Infrastructure IP for SOC Yield,” IEEE Design and Test of Computers, vol. 20, (Eq. B.3) no. 3, pp. 58-66, 2003. (d) shared-FF RU Appendix A. Online Self-Test and Diagnostics (OLST) for Detection 2 2 and Online Diagnosis of Permanent Faults = 1− 2− − The OLST approach we use is similar to CASP (Concurrent =1 2 Autonomous chip self-test using Stored test Patterns) [Li 08, 10]. CASP = 1−1− − 1 achieves high online test coverage at low cost by: 1. utilizing on-chip (Eq. B.2) test compression and off-chip storage (e.g., FLASH) to store thorough Figure B.1. Deriving the probability that a RU (unit of repair) compressed test patterns; and 2. providing system-level support to fetch functions correctly. test patterns from off-chip storage and to apply them to different components through scan chains. CASP may be applicable for manufacturing test as well, and the details can be found in [Li 13]. Figure A.1 shows the hardware support for CASP for OpenSPARC T2. Independent scan chains are formed and a response comparator is added for each sub-component. A response comparator compares actual scan test responses with the golden responses to output a pass/fail signal. The pass/fail signals provide diagnosis capability at the sub-component level, since a sub-component is considered as fault-free only if the pass/fail signal indicates a “pass” for all test patterns. Since we only 10