Boosting Performance of Chip Multiprocessors with Excess Cache

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers Dept. of Computer Science, Univ. of Pittsburgh fabraham,cho,[email protected] Abstract isters, this approach disables faulty cores and enables only functional ones. As long as there are enough sound cores, Technology advances continuously shrink on-chip devices. this technique produces many partially operating CMPs, Consequently, the number of cores in a single chip multi- which would otherwise be discarded without core disabling. processor (CMP) is expected to grow in coming years. Un- For instance, IBM's Cell processor reportedly has a yield of fortunately, with smaller device size and greater integration, only 10% to 20% with eight synergistic processor elements chip yield degrades significantly. Guaranteeing that all chip (SPEs). However, by disabling one (faulty) SPE, the yield components function correctly leads to an unrealistically jumps to nearly 40% [26]. AMD sells tri-core chips [1], low yield. Chip vendors have adopted a design strategy to which is a byproduct of a quad-core chip with a faulty core. market partially functioning processor chips to combat this The NVIDIA GeForce 8800 has three product derivatives problem. The two major components in a multicore chip with 128, 112, and 96 cores [19]. The GeForce chips with are compute cores and on-chip memory such as L2 cache. small core counts are believed to be partially disabled chips From the viewpoint of the chip yield, the compute cores have of the same 128-core design; all the designs have the same a much lower yield than the on-chip memory due to their transistor count. Lastly, the Sun UltraSPARC T1 has three logic complexity and well-established memory yield enhanc- different core counts: four, six, and eight cores [25]. ing techniques. Therefore, future CMPs are expected to have Inside a chip, logic and memory have very different yield more available on-chip memories than working cores. This characteristics. For the same physical defect size and pro- paper introduces a novel on-chip memory utilization scheme cess variation effect, memory may be more vulnerable than called StimulusCache, which decouples the L2 caches of logic due to small transistor size. However, various fault faulty compute cores and employs them to assist applications handling schemes have been successfully deployed to sig- on other working cores. Our extensive experimental evalua- nificantly improve the yield of memory, including parity, tion demonstrates that StimulusCache significantly improves ECC, and row/column redundancy [13]. Furthermore, a few the performance of both single-threaded and multithreaded cache blocks may be “disabled” or “remapped” to oppor- workloads. tunistically cover faults and improve yield without affecting chip functionality [4, 12, 15, 16]. In fact, ITRS reports that the primary issue for memory yield is to protect the support 1. Introduction logic, not the memory cells [13]. Continuous device scaling causes more frequent hard faults Traditional core disabling techniques take a core and its to occur in processor chips at manufacturing time [3, 24]. associated memory (e.g., private L2 cache) offline without Two major sources of faults are physical defects and process consideration for whether the core or its memory failed. variations. First, physical defects can cause a short or open, Thus, a failed core causes its associated memory to be un- which makes a circuit unusable. While technology advances available, although the memory may be functional. For improve the defect density of semiconductor materials and example, AMD's Phenom X3 processor disables one core the manufacturing environment, with ever smaller feature along with its 512KB L2 cache [1]. Due to the large asym- sizes, the critical defect size continues to shrink. Accord- metry in the yield of logic and memory, however, such ingly, physical defects remain a serious threat to achieving a coarse-grained disabling scheme will likely waste much profitable chip yield. Second, process variations can cause memory capacity in the future. We hypothesize that the per- mismatches in device coupling, unevenly degraded circuit formance of a CMP will be significantly improved if the speeds, and higher power consumption, increasing the prob- sound cache memories associated with faulty cores are uti- ability of circuit malfunction at nominal conditions. lized by other sound cores. To explore such a design ap- To improve chip yield, processor vendors have re- proach, this paper proposes StimulusCache, a novel archi- cently adopted “core disabling” for chip multiprocessors tecture that utilizes unemployed “excess” L2 caches. These (CMPs) [1, 19, 25, 26]. Using programmable fuses and reg- excess caches come from disabled cores where the cache is functional. This work was supported in part by NSF grants CCF-0811295, CCF- 0811352, CCF-0702236, and CCF-0952273. We answer two main technical questions for the proposed StimulusCache approach. First, what is the system require- In the above, YM is the material intrinsic yield, which we fix ment (including system software and microarchitecture) to to 1 and do not consider in this work. YS is the systematic enable StimulusCache? Second, what are the desirable ex- yield, which is generally assumed to be 90% for logic and cess cache management strategies under various workloads? 95% for memory [13]. α is a cluster parameter and assumed The ideas and results we present in this paper are also ap- to be 2 as in the ITRS report. Although technologies with plicable to chips without excess caches. For example, under smaller feature sizes are more vulnerable to defects, ITRS a low system load, advanced CMPs dynamically put some targets the same D0 for upcoming technologies when ma- cores into a deep sleep mode to save energy [11]. In such tured, due to process technology advances. a scenario, the cache capacity of the sleeping cores could To compute a realistic yield with equation (1) in the re- be borrowed by other active cores. We make the following mainder of this paper, we derive D0 from the published contributions in this paper: yield of the IBM Cell processor chip, which is 20% [26].1 For accurate calculation, we differentiate the logic portion • We A new yield model for processor components. whose geometric structure is irregular from the memory develop a “decoupled” yield model to accurately calcu- cell array that has a regular structure in each functional late the yield of various processor components having block. While the memory cell array may be more vulner- both logic and memory cell arrays. Based on compo- able to defects and process variability, it is well-protected nent yield modeling, we perform an availability study with robust fault masking techniques, such as redundancy for compute cores and low-level cache memory with and ECC [13, 16, 20]. current and future technology parameters to show that We use CACTI version 5.3 [30] to obtain the area of there will likely be more functional caches available the memory cell array in a memory-oriented function block. than cores in future CMPs (Section 2). StimulusCache From CACTI and die photo analysis, we determined that the aims to effectively utilize these excess caches. memory cell array of the PPE and the SPEs account for about • Architectural support for StimulusCache. We de- 8% and 14% of the total chip area, repectively.2 Based on velop the necessary architectural support to enable the above analysis, we determine the total memory cell array StimulusCache multicore architectures (Section 3). We area is 22% of the chip area (175mm2 in 65nm technology). find that the added datapath and control overhead to the We can derive D0 with equation (1) using the total non- 2 cache controllers is small for 8 and 32 core designs. memory chip area. We calculated D0 to be 0.0181/mm . • Strategies to utilize excess caches. We explore and study novel policies to utilize the available excess 2.2. Decoupled yield model caches (Section 4). We find that organizing excess Given multiple functional blocks in a chip and their individ- caches as a non-inclusive shared victim L3 cache is ual yields (Yblock), the chip yield can be computed as [7]: very effective. We also find it beneficial to monitor the cache usage of individual threads and limit certain N threads from using the excess caches if they cannot ef- YDie = Yblocki (2) i=1 fectively use the extra capacity. Y • An evaluation of StimulusCache. We perform a com- It is clear that the yield of a vulnerable functional block prehensive evaluation of our proposed architecture and can be a significant potential threat to the overall yield. excess cache policies to assess the benefit of Stimulus- Therefore, it becomes imperative to evaluate each functional Cache, which is compared with the latest private cache block's yield separately to prioritize and guide design tuning partitioning technique, DSR [21] (Section 5). We ex- activities, e.g., implementing isolation points and employing amine a wide range of workloads using an 8-core and a functional block salvaging techniques. To accurately evalu- 32-core CMP configurations. StimulusCache is shown ate the yield of individual functional blocks, as suggested in to consistently boost the performance of all programs the previous subsection, we propose to define their yield in (by up to 45%) with no performance penalty. terms of the logic yield and the memory cell array yield as 2. Decoupled Yield Model for Cores and Caches follows: 2.1. Baseline yield model and parameters Yblocki = Ylogici × Ymemoryi (3) Chip yield is generally dictated by defect density D0, area 1In Sperling [26] the yield for the Cell processor was vaguely given as 10%– A, and clustering factor α. We use a negative binomial yield 20%. While a lower yield makes an even stronger case for StimulusCache, model from the ITRS report [13], where the yield of the chip we conservatively use the highest yield estimate (20%).

Boosting Performance of Chip Multiprocessors with Excess Cache

Data Storage the CPU-Memory

Cache-Aware Roofline Model: Upgrading the Loft

ARM-Architecture Simulator Archulator

Multi-Tier Caching Technology™

CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY

A Characterization of Processor Performance in the VAX-1 L/780

Instruction Set Innovations for Convey's HC-1 Computer

Fundamental CUDA Optimization

Solid-State Drive Caching with Differentiated Storage Services

Cache Generally Uses This Form of Ram

Cacheout: Leaking Data on Intel Cpus Via Cache Evictions

CUDA Memory Architecture