Process Variation in Near-Threshold Wide SIMD Architectures

Process Variation in Near-Threshold Wide SIMD Architectures Sangwon Seo∗1, Ronald G. Dreslinski1, Mark Woh1, Yongjun Park1, Chaitali Charkrabari2, Scott Mahlke1, David Blaauw1, Trevor Mudge1 1University of Michigan, Ann Arbor, MI 48109 2Arizona State University, Tempe, AZ 85287 ABSTRACT exacerbates the problem, providing many challenges for process Near-threshold operation has emerged as a competitive approach engineers and circuit designers [2]. These variation-induced tim- for energy-efficient architecture design. In particular, a combina- ing errors are much more critical in wide SIMD architectures for two reasons. First, the probability that all SIMD datapaths are tion of near-threshold circuit techniques and parallel SIMD compu- error-free decreases when variations are severe, because the num- tations achieves excellent energy efficiency for easy-to-parallelize ber of critical paths are multiplied by the SIMD width. Recent work applications. However, near-threshold operations suffer from delay also shows that there is a significant performance drop in SIMD ar- variations due to increased process variability. This is exacerbated chitectures as single-stage-error probabilities increase [3]. Second, in wide SIMD architectures where the number of critical paths are commonly used error-tolerating methods such as pipeline stalling multiplied by the SIMD width. This paper provides a systematic or re-execution result in greater performance and power penalties in-depth study of delay variations in near-threshold operations and due to problems in one lane impacting all other lanes. To tolerate shows that simple techniques such as structural duplication and variation-induced timing errors in near-threshold operations, com- supply voltage/frequency margining are sufficient to mitigate the plex architectural enhancements have been considered. For exam- timing variation problems in wide SIMD architectures at the cost ple, Synctium [3] proposed decoupled parallel SIMD pipelines and of marginal area and power overhead. pipeline weaving using decoupling queues and micro-barriers. In this paper, we investigate the effect of process variations in Categories and Subject Descriptors wide SIMD architectures operating at near-threshold voltages. De- C.1.2 [Processor Architectures]; C.1.4 [Parallel Architectures]; lay variations in the near-threshold regime are first analyzed for C.4 [Performance of Systems] present and future technology nodes (90nm, 45nm, 32nm, and 22nm). Our study shows that delay variations in near-threshold operations General Terms have been over-estimated in the past. In 90nm technology, although delay variation (3σ/μ) at 0.5V in a single gate increases by ∼2.5x Design, Experimentation, Reliability compared to that at 1V, the variation decreases in a chain of gates. Keywords For instance, the variation is only ∼1.5x for a chain of 50 gates. This is an example of mean-value theorem where the uncorrelated Near-threshold Computing, Wide SIMD, Process Variation variations are averaged out over the chain. Working against this effect is the fact that the datapath is a wide SIMD machine, thus in- 1. INTRODUCTION creasing the number of these critical paths. Nevertheless, the corre- An attractive approach for energy-efficient system design is the sponding performance degradation for such wide systems in 90nm combination of near-threshold operation [1] for reduced energy con- technology is less than 5%. Therefore, simple techniques are suffi- sumption and wide SIMD (Single Instruction Multiple Data) archi- cient to tolerate and mitigate the timing variation problems. Three tectures to improve parallel performance. This approach is partic- techniques are explored in this work: 1) structural duplication to ularly suited for hand-held devices running signal processing algo- replace underperforming modules, 2) voltage margining to reduce rithms for high throughput applications. However, near-threshold both average delay and its variation, and 3) frequency margining to designs are impacted greater by process variations than traditional increase delay margins. The analysis shows a combination of these designs, because the on-current (Ion) in the near-threshold voltage simple techniques can effectively reduce variation-induced timing region is highly sensitive to variations in threshold voltage, Vth. errors in wide SIMD architectures such as Diet SODA [4] with Increased process variations in advanced technology nodes further marginal area and power overhead. The rest of the paper is organized as follows. Section 2 intro- ∗ Currently at Qualcomm Incorporated, San Diego, CA duces near-threshold operation. Section 3 discusses variation issues at circuit- and architecture-levels. Section 4 explores techniques to tolerate and mitigate the variation-induced timing errors. Section 5 Permission to make digital or hard copies of part or all of this work for discusses the related work and Section 6 concludes the paper. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components 2. NEAR-THRESHOLD OPERATION of this work owned by others than ACM must be honored. Abstracting with There are three regions of operating voltage: super-threshold, credit is permitted. To copy otherwise, to republish, to post on servers or to near-threshold and sub-threshold (See Figure 9 in Appendix A). In redistribute to lists, requires prior specific permission and/or a fee. the super-threshold region (Vdd >Vth), energy is highly sensitive DAC’12, June 03 – 07 2012, San Francisco, CA, USA V Copyright 2012 ACM 978-1-4503-1199-1/12/06 $10.00. to Vdd due to the quadratic scaling of switching energy with dd. 980 Hence, voltage scaling down to the near-threshold region (Vdd ∼ cantly increases as Vdd reduces; for example, 3σ/μ increases from Vth) yields an energy reduction on the order of 10x at the expense 15.58%@1.0V to 35.49%@0.5V. Although the delay variations in of approximately 10x performance degradation. However, the de- near-threshold voltage region cause large performance degradation pendence of energy on Vdd becomes more complex as voltage is on a single gate, the uncorrelated random within-die variations av- scaled below Vth. In the sub-threshold regime (Vdd <Vth), cir- erage out over a long chain of gates as shown in Figure 1(b). The cuit delay increases exponentially with Vdd, causing leakage en- delay variation (3σ/μ) of a chain of 50 FO4 inverters is only 9.43% ergy (the product of leakage current, Vdd, and delay) to increase in @0.5V compared to that of a single inverter (35.49%@0.5V). Thus a near-exponential fashion. This rise in leakage energy eventually the delay variation is not significant for medium to long chains and dominates any reduction in switching energy, creating an energy is expected to not be significant for datapath components. A similar minimum. observation was made in [7] which showed only 8.4%@0.5V de- Although the energy minimum is achieved in the sub-threshold lay variation for a 64-bit Kogge-Stone adder. Therefore, part of the region, the performance improves by 50∼100x when Vdd is scaled delay variation problem can be alleviated by implementing longer from the sub-threshold regime to the near-threshold regime while logic chains [5]. the energy increases by only 2x. Therefore, near-threshold opera- Although delay variations reduce as a chain length (N) increases, Δ3σ/μ tions achieve a good balance between performance and energy. The additional study shows the amount of reduction ( ΔN ) decreases near-threshold region offers an opportunity for applications that re- with N (see Figure 11 in Appendix C). Therefore, implementing quire high processing power with high energy efficiency. Further- the logic with a very long chain of gates will not solve all the timing more, data parallel architectures like SIMD can be used to com- variation problems. In addition, technology scaling exacerbates the pensate for the reduced performance when operating in the near- delay variations [2]; for example, technology scaling from 90nm to threshold regime for DLP (Data Level Parallelism)-intensive appli- 22nm increases delay variation of a chain of 50 FO4 inverters by cations. ∼2.5x when operating at 0.55V. 3. VARIATIONS IN NEAR-THRESHOLD " OPERATION As described in Section 2, near-threshold designs significantly reduce energy consumption. However, Ion is highly sensitive to ! variations in Vth, resulting in delay variations which diminish the advantage of near-threshold operations. RDFs (Random Dopant " Ion Fluctuations) are known to be the dominant factor of varia- % tions in near-threshold operation [5]. In addition, LER (Line Edge Roughness) is a significant factor for advanced technology nodes. ! " To evaluate the effect of cross chip variations in the near-threshold voltage regime, Monte Carlo simulations with Hspice are performed for 90nm/45nm commercially used GP (General Purpose) models σ/μ and 32nm/22nm PTM (Predictive Technology Model [6]) HP (High Figure 2: Delay variations (3 ) (%) of a chain of 50 FO4 V Performance) models. Two dominant variation sources, Vth and inverters vs. supply voltage ( dd) using four technology models LER, are represented by normal distributions and inserted into the (90nm GP, 45nm GP, 32nm PTM HP, and 22nm PTM HP). A 32nm/22nm PTM HP models. thousand samples for each data point are simulated. In this section, we examine how much delay

Process Variation in Near-Threshold Wide SIMD Architectures

Parallel Programming

Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations

Unit: 4 Processes and Threads in Distributed Systems

Gpu Concurrency

Hyper-Threading Technology Architecture and Microarchitecture

Scheduling Many-Task Workloads on Supercomputers: Dealing with Trailing Tasks

Intel's Hyper-Threading

COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011

IMTEC-91-58 High-Performance Computing: Industry Uses Of

Introduction to Parallel Computing

Computer-System Architecture (Cont.) Symmetrically Constructed Clusters (Cont.) Advantages: 1

Lect. 9: Multithreading