Low-Power and Error-Resilient VLSI Circuits and Systems

Total Page:16

File Type:pdf, Size:1020Kb

Low-Power and Error-Resilient VLSI Circuits and Systems Low-Power and Error-Resilient VLSI Circuits and Systems by Chia-Hsiang Chen A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) in the University of Michigan 2014 Doctoral Committee: Assistant Professor Zhengya Zhang, Chair Associate Professor Achilleas Anastasopoulos Professor David Blaauw Assistant Professor Junjie Zhu c Chia-Hsiang Chen 2014 All Rights Reserved To my family and \TEA TIME" friends for their love and support ii ACKNOWLEDGEMENTS Many people have contributed to the completion of this dissertation to whom I will always be grateful. My adivisor, Professor Zhengya Zhang, provided me the exceptional scientific resources and guidance, and allowed me to grow intellectully and to explore exciting researches over the years. I would also like to thank Professor David Blaauw, Achilleas Anastasopoulos, and Junjie Zhu, for participating in my dissertation committee and providing valuable feedback. I appreciate my research sponsors, including Intel, DARPA under cooperative agreement HR0011-13-2-0006, NASA, and the University of Michigan. I would like to thank all individuals who have inspired and mentored me through- out the doctoral training. Professor Michael Flynn provided me insightful discussions over different ideas. Dr. Keith Bowman, Dr. Farhana Sheikh, Dr. Charles Au- gustine, and Jim Tschanz from Intel Circuit Research Labs, they inspired me and broadened my knowledge. Another group of people that assisted me through this journey were those in my research group, including Jungkuk Kim, Youn Sung Park, Phil Knag, Yaoyu Tao, Siming Song, Shuanghong, Thomas Chen, and Wei Tang. They shared their expertise and effort to accelerate my understanding on several re- searching topics. In addition to those in my research group, I would also like to thank Dr. Wei-Hsiang Ma, Dr. Zhiyoong Foo, Tai-Chuan Ou, and Ye-Sheng Kuo for bring enjoyable memories as circuit designers at the University of Michigan . Finally, I want to express my deepest appreciation to my parents and family in Taiwan for their unconditional love and support. iii TABLE OF CONTENTS DEDICATION :::::::::::::::::::::::::::::::::: ii ACKNOWLEDGEMENTS :::::::::::::::::::::::::: iii LIST OF FIGURES ::::::::::::::::::::::::::::::: vii LIST OF TABLES :::::::::::::::::::::::::::::::: xiv ABSTRACT ::::::::::::::::::::::::::::::::::: xvi CHAPTER I. Introduction ..............................1 1.1 Error characterization and impact analysis . .2 1.1.1 Data-retention failures and timing faults . .2 1.1.2 Soft errors . .3 1.2 Design and evaluation methodology for error detection techniques4 1.2.1 In situ error detection circuits . .4 1.2.2 FPGA-based transient simulator . .5 1.3 Confidence-driven computing architecture . .5 1.4 Low-power and error-resilient system with algorithm and ar- chitecture co-design . .5 1.5 Dissertation outline . .6 II. Timing Faults and Data-Retention Failures ...........7 2.1 Statistical circuit simulation methodology . .7 2.2 Data-retention failure and Vmin analysis . 10 2.3 Hold-time violation and Vmin analysis . 13 2.4 Vmin of sequential logic circuits . 17 2.5 Summary . 19 III. Heavy-Ion Induced Soft Error Characterization ........ 20 3.1 Test chip designs and test setup . 21 iv 3.1.1 Test chip 1 . 21 3.1.2 Test chip 2 . 24 3.1.3 Ion beam testing . 25 3.2 Circuit design considerations . 26 3.2.1 Radiation hardening by redundancy . 27 3.2.2 Increasing critical charge by upsizing . 28 3.2.3 Depth and sizing of combinational circuits . 29 3.2.4 Data dependence . 29 3.2.5 Latchup and total ionization dose . 31 3.3 Voltage and frequency dependence . 32 3.3.1 Supply voltage scaling . 32 3.3.2 Clock frequency effects . 33 3.4 Summary . 35 IV. Error-Detection Circuits and Simulation Methodology .... 36 4.1 Transient error detection circuits . 37 4.2 Flexible cross-edge checking . 38 4.2.1 Cross-edge circuitry . 40 4.2.2 Pre-edge circuitry . 43 4.2.3 Comparisons . 43 4.3 FPGA-based transient error simulator . 44 4.4 Transient error simulator . 45 4.4.1 Delay model and error model . 46 4.4.2 Transient simulation . 48 4.5 Evaluation of error-resilient designs . 50 4.5.1 Pre-edge technique . 51 4.5.2 Post-edge technique . 52 4.6 Summary . 54 V. Confidence-Driven Architecture for Error-Resilient Computing 55 5.1 Related work . 55 5.2 Confidence-driven computing . 57 5.2.1 Lockstep synchronization . 58 5.2.2 Speculative synchronization . 59 5.2.3 Spatial redundancy . 61 5.3 Reliability and performance evaluation using sample-based em- ulation . 61 5.3.1 Reliability evaluation . 62 5.3.2 Performance evaluation . 63 5.3.3 Implementation results and case study . 67 5.4 Early checking . 68 5.4.1 Early checking . 68 5.4.2 Functional analysis and comparison . 69 v 5.5 Reliability and performance evaluation using event-based sim- ulation . 70 5.5.1 Error simulation . 71 5.5.2 Experimental evaluation . 71 5.5.3 Implementation results and case study . 75 5.6 Summary . 76 VI. Low-Power Error-Resilient Systems for Wireless Communi- cation ................................... 78 6.1 Background . 78 6.1.1 MIMO system model . 79 6.1.2 Non-iterative MIMO detection algorithms . 80 6.1.3 Iterative MIMO detection algorithms . 82 6.2 Iterative detection-decoding design overview . 83 6.3 SISO MMSE algorithm and optimization . 85 6.3.1 Reduced-complexity MMSE-PIC algorithm . 86 6.3.2 Algorithmic optimizations for efficient MMSE detec- tor implementation . 87 6.4 Architecture and circuit optimization . 90 6.5 Nonbinary interface and optimization . 97 6.5.1 Detector to decoder: LLR computation with L1-norm 97 6.5.2 Decoder to detector: Efficient soft-symbol and vari- ance computation . 100 6.6 Measurements . 104 6.7 Conclusion . 105 VII. Conclusion and Outlook ....................... 109 7.1 Conclusion . 109 7.2 Outlook . 110 BIBLIOGRAPHY :::::::::::::::::::::::::::::::: 112 vi LIST OF FIGURES Figure 2.1 Master-slave flip-flop (MSFF) schematic. .8 2.2 Probability density distributions from MC simulations with 104 sam- ples for (a) normalized hold time at Vnorm and (b) normalized data- retention Vmin with Rg. The MC (a) hold time and (b) data- retention Vmin corresponding to a 3σ WID-variation probability are compared to MPP results. .8 2.3 Data-retention analysis for the slave latch of the MSFF while holding a logic 1. The PMOS process variation (circled) and gate-dielectric soft breakdown (Rg) on node S I limit the data-retention Vmin. 10 2.4 (a) Butterfly curves for the static DC analysis of data retention. (b) Description of the retention windows for the dynamic transient anal- ysis of data retention. 12 2.5 Normalized data-retention Vmin with the individual and combined contributions of WID variation and Rg for the conventional static DC analysis and the dynamic transient analysis that captures the cycle time effect. 12 2.6 Hold time is defined as the CK-to-D delay where a 50% Vcc glitch occurs on node M O........................... 13 2.7 Dynamic transient simulation description for the worst-case data- dependent hold-time analysis. 14 2.8 Hold time versus normalized Vcc with different placement of Rg while not including WID variations. 15 2.9 Normalized hold time with and without Rg for a 5σ WID-variation target and an FO4 inverter chain delay versus the normalized Vcc. 15 vii 2.10 Hold time as a percentage of cycle time versus normalized Vcc. Hold- time Vmin is defined as the Vcc in which the hold time exceeds 10% of the cycle time. 16 2.11 Breakdown of the 5σ WID variation across all the transistors in the MSFF from the MPP simulations. 17 2.12 Vmin for data retention and hold time versus different combinations of Rg and 5σ WID variation. 18 2.13 Hold-time Vmin for the original MSFF in Fig. 2.1 and for the MSFF with a 2× larger first clock inverter. 18 3.1 Shift register chains implemented on the test chip 1. Each chain consists of 500 flip-flops. Chain 6 to 11 each has a varying number and size of inverters. 22 3.2 Three types of D flip-flops: (a) DICE flip-flop [1], (b) TMR flip- flop [2], and (c) unprotected standard D flip-flop. 23 3.3 (a) Test chip 1 layout, and (b) microphotograph. 24 3.4 Radiation test setup. 26 3.5 Automated testing of chip 2. 27 3.6 Cross section per bit with ion energy for standard, DICE and TMR flip-flops. 28 3.7 Cross section per bit after adding combinational circuits. 30 3.8 Latch design results in unequal upsets. 31 3.9 Cross section per bit of D flip-flop, DICE and TMR flip-flops at supply voltage of 1.0V and 0.7V. 32 3.10 Cross section per bit of standard and radiation-hardened correlator cores at supply voltage of 1.0V and 0.7V (clock frequency of 100MHz). 33 3.11 Cross section per bit of standard and radiation-hardened correlator cores at 100MHz and 500MHz (1.0V supply voltage). 34 4.1 (a) Timing margin allocated for static and dynamic variations, and (b) pre-edge and post-edge error checking window. 37 viii 4.2 (a) Checking window positions depending on path criticality, (b) fixed post-edge checking window causing race conditions, and (c) fixed pre- edge checking window leading to performance degradation. 39 4.3 (a) Cross-edge checking design based on a conventional transmission gate flip-flop [3], and (b) error detector design. 40 4.4 Waveforms illustrating the operations of the cross-edge circuitry. 42 4.5 (a) Pre-edge checking design, and (b) transition detector design. 42 4.6 Waveforms illustrating the operations of the pre-edge circuitry.
Recommended publications
  • Pipelining and Vector Processing
    Chapter 8 Pipelining and Vector Processing 8–1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8–2 Pipeline stalls can be caused by three types of hazards: resource, data, and control hazards. Re- source hazards result when two or more instructions in the pipeline want to use the same resource. Such resource conflicts can result in serialized execution, reducing the scope for overlapped exe- cution. Data hazards are caused by data dependencies among the instructions in the pipeline. As a simple example, suppose that the result produced by instruction I1 is needed as an input to instruction I2. We have to stall the pipeline until I1 has written the result so that I2 reads the correct input. If the pipeline is not designed properly, data hazards can produce wrong results by using incorrect operands. Therefore, we have to worry about the correctness first. Control hazards are caused by control dependencies. As an example, consider the flow control altered by a branch instruction. If the branch is not taken, we can proceed with the instructions in the pipeline. But, if the branch is taken, we have to throw away all the instructions that are in the pipeline and fill the pipeline with instructions at the branch target. 8–3 Prefetching is a technique used to handle resource conflicts. Pipelining typically uses just-in- time mechanism so that only a simple buffer is needed between the stages. We can minimize the performance impact if we relax this constraint by allowing a queue instead of a single buffer.
    [Show full text]
  • Diploma Thesis
    Faculty of Computer Science Chair for Real Time Systems Diploma Thesis Timing Analysis in Software Development Author: Martin Däumler Supervisors: Jun.-Prof. Dr.-Ing. Robert Baumgartl Dr.-Ing. Andreas Zagler Date of Submission: March 31, 2008 Martin Däumler Timing Analysis in Software Development Diploma Thesis, Chemnitz University of Technology, 2008 Abstract Rapid development processes and higher customer requirements lead to increasing inte- gration of software solutions in the automotive industry’s products. Today, several elec- tronic control units communicate by bus systems like CAN and provide computation of complex algorithms. This increasingly requires a controlled timing behavior. The following diploma thesis investigates how the timing analysis tool SymTA/S can be used in the software development process of the ZF Friedrichshafen AG. Within the scope of several scenarios, the benefits of using it, the difficulties in using it and the questions that can not be answered by the timing analysis tool are examined. Contents List of Figures iv List of Tables vi 1 Introduction 1 2 Execution Time Analysis 3 2.1 Preface . 3 2.2 Dynamic WCET Analysis . 4 2.2.1 Methods . 4 2.2.2 Problems . 4 2.3 Static WCET Analysis . 6 2.3.1 Methods . 6 2.3.2 Problems . 7 2.4 Hybrid WCET Analysis . 9 2.5 Survey of Tools: State of the Art . 9 2.5.1 aiT . 9 2.5.2 Bound-T . 11 2.5.3 Chronos . 12 2.5.4 MTime . 13 2.5.5 Tessy . 14 2.5.6 Further Tools . 15 2.6 Examination of Methods . 16 2.6.1 Software Description .
    [Show full text]
  • How Data Hazards Can Be Removed Effectively
    International Journal of Scientific & Engineering Research, Volume 7, Issue 9, September-2016 116 ISSN 2229-5518 How Data Hazards can be removed effectively Muhammad Zeeshan, Saadia Anayat, Rabia and Nabila Rehman Abstract—For fast Processing of instructions in computer architecture, the most frequently used technique is Pipelining technique, the Pipelining is consider an important implementation technique used in computer hardware for multi-processing of instructions. Although multiple instructions can be executed at the same time with the help of pipelining, but sometimes multi-processing create a critical situation that altered the normal CPU executions in expected way, sometime it may cause processing delay and produce incorrect computational results than expected. This situation is known as hazard. Pipelining processing increase the processing speed of the CPU but these Hazards that accrue due to multi-processing may sometime decrease the CPU processing. Hazards can be needed to handle properly at the beginning otherwise it causes serious damage to pipelining processing or overall performance of computation can be effected. Data hazard is one from three types of pipeline hazards. It may result in Race condition if we ignore a data hazard, so it is essential to resolve data hazards properly. In this paper, we tries to present some ideas to deal with data hazards are presented i.e. introduce idea how data hazards are harmful for processing and what is the cause of data hazards, why data hazard accord, how we remove data hazards effectively. While pipelining is very useful but there are several complications and serious issue that may occurred related to pipelining i.e.
    [Show full text]
  • Pipelining: Basic Concepts and Approaches
    International Journal of Scientific & Engineering Research, Volume 7, Issue 4, April-2016 1197 ISSN 2229-5518 Pipelining: Basic Concepts and Approaches RICHA BAIJAL1 1Student,M.Tech,Computer Science And Engineering Career Point University,Alaniya,Jhalawar Road,Kota-325003 (Rajasthan) Abstract-This paper is concerned with the pipelining principles while designing a processor.The basics of instruction pipeline are discussed and an approach to minimize a pipeline stall is explained with the help of example.The main idea is to understand the working of a pipeline in a processor.Various hazards that cause pipeline degradation are explained and solutions to minimize them are discussed. Index Terms— Data dependency, Hazards in pipeline, Instruction dependency, parallelism, Pipelining, Processor, Stall. —————————— —————————— 1 INTRODUCTION does the paint. Still,2 rooms are idle. These rooms that I want to paint constitute my hardware.The painter and his skills are the objects and the way i am using them refers to the stag- O understand what pipelining is,let us consider the as- T es.Now,it is quite possible i limit my resources,i.e. I just have sembly line manufacturing of a car.If you have ever gone to a two buckets of paint at a time;therefore,i have to wait until machine work shop ; you might have seen the different as- these two stages give me an output.Although,these are inde- pendent tasks,but what i am limiting is the resources. semblies developed for developing its chasis,adding a part to I hope having this comcept in mind,now the reader
    [Show full text]
  • Powerpc 601 RISC Microprocessor Users Manual
    MPR601UM-01 MPC601UM/AD PowerPC™ 601 RISC Microprocessor User's Manual CONTENTS Paragraph Page Title Number Number About This Book Audience .............................................................................................................. xlii Organization......................................................................................................... xlii Additional Reading ............................................................................................. xliv Conventions ........................................................................................................ xliv Acronyms and Abbreviations ............................................................................. xliv Terminology Conventions ................................................................................. xlvii Chapter 1 Overview 1.1 PowerPC 601 Microprocessor Overview............................................................. 1-1 1.1.1 601 Features..................................................................................................... 1-2 1.1.2 Block Diagram................................................................................................. 1-3 1.1.3 Instruction Unit................................................................................................ 1-5 1.1.3.1 Instruction Queue......................................................................................... 1-5 1.1.4 Independent Execution Units..........................................................................
    [Show full text]
  • UNIT 8B a Full Adder
    UNIT 8B Computer Organization: Levels of Abstraction 15110 Principles of Computing, 1 Carnegie Mellon University - CORTINA A Full Adder C ABCin Cout S in 0 0 0 A 0 0 1 0 1 0 B 0 1 1 1 0 0 1 0 1 C S out 1 1 0 1 1 1 15110 Principles of Computing, 2 Carnegie Mellon University - CORTINA 1 A Full Adder C ABCin Cout S in 0 0 0 0 0 A 0 0 1 0 1 0 1 0 0 1 B 0 1 1 1 0 1 0 0 0 1 1 0 1 1 0 C S out 1 1 0 1 0 1 1 1 1 1 ⊕ ⊕ S = A B Cin ⊕ ∧ ∨ ∧ Cout = ((A B) C) (A B) 15110 Principles of Computing, 3 Carnegie Mellon University - CORTINA Full Adder (FA) AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 4 Carnegie Mellon University - CORTINA 2 Another Full Adder (FA) http://students.cs.tamu.edu/wanglei/csce350/handout/lab6.html AB 1-bit Cout Full Cin Adder S 15110 Principles of Computing, 5 Carnegie Mellon University - CORTINA 8-bit Full Adder A7 B7 A2 B2 A1 B1 A0 B0 1-bit 1-bit 1-bit 1-bit ... Cout Full Full Full Full Cin Adder Adder Adder Adder S7 S2 S1 S0 AB 8 ⁄ ⁄ 8 C 8-bit C out FA in ⁄ 8 S 15110 Principles of Computing, 6 Carnegie Mellon University - CORTINA 3 Multiplexer (MUX) • A multiplexer chooses between a set of inputs. D1 D 2 MUX F D3 D ABF 4 0 0 D1 AB 0 1 D2 1 0 D3 1 1 D4 http://www.cise.ufl.edu/~mssz/CompOrg/CDAintro.html 15110 Principles of Computing, 7 Carnegie Mellon University - CORTINA Arithmetic Logic Unit (ALU) OP 1OP 0 Carry In & OP OP 0 OP 1 F 0 0 A ∧ B 0 1 A ∨ B 1 0 A 1 1 A + B http://cs-alb-pc3.massey.ac.nz/notes/59304/l4.html 15110 Principles of Computing, 8 Carnegie Mellon University - CORTINA 4 Flip Flop • A flip flop is a sequential circuit that is able to maintain (save) a state.
    [Show full text]
  • Improving UNIX Kernel Performance Using Profile Based Optimization
    Improving UNIX Kernel Performance using Profile Based Optimization Steven E. Speer (Hewlett-Packard) Rajiv Kumar (Hewlett-Packard) and Craig Partridge (Bolt Beranek and Newman/Stanford University) Abstract Several studies have shown that operating system performance has lagged behind improvements in applica- tion performance. In this paper we show how operating systems can be improved to make better use of RISC architectures, particularly in some of the networking code, using a compiling technique known as Profile Based Optimization (PBO). PBO uses profiles from the execution of a program to determine how to best organize the binary code to reduce the number of dynamically taken branches and reduce instruction cache misses. In the case of an operating system, PBO can use profiles produced by instrumented kernels to optimize a kernel image to reflect patterns of use on a particular system. Tests applying PBO to an HP-UX kernel running on an HP9000/720 show that certain parts of the system code (most notably the networking code) achieve substantial performance improvements of up to 35% on micro benchmarks. Overall system performance typically improves by about 5%. 1. Introduction Achieving good operating system performance on a given processor remains a challenge. Operating systems have not experienced the same improvement in transitioning from CISC to RISC processors that has been experi- enced by applications. It is becoming apparent that the differences between RISC and CISC processors have greater significance for operating systems than for applications. (This discovery should, in retrospect, probably not be very surprising. An operating system is far more closely linked to the hardware it runs on than is the average application).
    [Show full text]
  • Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: a Practical Approach∗
    IEICE TRANS. INF. & SYST., VOL.E101–D, NO.4 APRIL 2018 1116 PAPER Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: A Practical Approach∗ Carlos Cesar CORTES TORRES†a), Nonmember, Hayate OKUHARA†, Student Member, Nobuyuki YAMASAKI†, Member, and Hideharu AMANO†, Fellow SUMMARY In the past decade, real-time systems (RTSs), which must in RTSs. These techniques can improve energy efficiency; maintain time constraints to avoid catastrophic consequences, have been however, they often require a large amount of power since widely introduced into various embedded systems and Internet of Things they must control the supply voltages of the systems. (IoTs). The RTSs are required to be energy efficient as they are used in embedded devices in which battery life is important. In this study, we in- Body bias (BB) control is another solution that can im- vestigated the RTS energy efficiency by analyzing the ability of body bias prove RTS energy efficiency as it can manage the tradeoff (BB) in providing a satisfying tradeoff between performance and energy. between power leakage and performance without affecting We propose a practical and realistic model that includes the BB energy and the power supply [4], [5].Itseffect is further endorsed when timing overhead in addition to idle region analysis. This study was con- ducted using accurate parameters extracted from a real chip using silicon systems are enabled with silicon on thin box (SOTB) tech- on thin box (SOTB) technology. By using the BB control based on the nology [6], which is a novel and advanced fully depleted sili- proposed model, about 34% energy reduction was achieved.
    [Show full text]
  • With Extreme Scale Computing the Rules Have Changed
    With Extreme Scale Computing the Rules Have Changed Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 11/17/15 1 • Overview of High Performance Computing • With Extreme Computing the “rules” for computing have changed 2 3 • Create systems that can apply exaflops of computing power to exabytes of data. • Keep the United States at the forefront of HPC capabilities. • Improve HPC application developer productivity • Make HPC readily available • Establish hardware technology for future HPC systems. 4 11E+09 Eflop/s 362 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 166 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1 Eflop/s 1E+09 420 PFlop/s 100000000100 Pflop/s 10000000 10 Pflop/s 33.9 PFlop/s 1000000 1 Pflop/s SUM 100000100 Tflop/s 206 TFlop/s 1000010 Tflop /s N=1 1 Tflop1000/s 1.17 TFlop/s 100 Gflop/s100 My Laptop 70 Gflop/s N=500 10 59.7 GFlop/s 10 Gflop/s My iPhone 4 Gflop/s 1 1 Gflop/s 0.1 100 Mflop/s 400 MFlop/s 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2015 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 1E+10 1 Eflop/s 1E+09 100 Pflop/s 100000000 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 10 Tflop/s N=1 10000 1 Tflop/s 1000 100 Gflop/s N=500 100 10 Gflop/s 10 1 Gflop/s 1 100 Mflop/s 0.1 1996 2002 2020 2008 2014 • Pflops (> 1015 Flop/s) computing fully established with 81 systems.
    [Show full text]
  • Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus
    Theoretical Peak FLOPS per instruction set on modern Intel CPUs Romain Dolbeau Bull – Center for Excellence in Parallel Programming Email: [email protected] Abstract—It used to be that evaluating the theoretical and potentially multiple threads per core. Vector of peak performance of a CPU in FLOPS (floating point varying sizes. And more sophisticated instructions. operations per seconds) was merely a matter of multiplying Equation2 describes a more realistic view, that we the frequency by the number of floating-point instructions will explain in details in the rest of the paper, first per cycles. Today however, CPUs have features such as vectorization, fused multiply-add, hyper-threading or in general in sectionII and then for the specific “turbo” mode. In this paper, we look into this theoretical cases of Intel CPUs: first a simple one from the peak for recent full-featured Intel CPUs., taking into Nehalem/Westmere era in section III and then the account not only the simple absolute peak, but also the full complexity of the Haswell family in sectionIV. relevant instruction sets and encoding and the frequency A complement to this paper titled “Theoretical Peak scaling behavior of current Intel CPUs. FLOPS per instruction set on less conventional Revision 1.41, 2016/10/04 08:49:16 Index Terms—FLOPS hardware” [1] covers other computing devices. flop 9 I. INTRODUCTION > operation> High performance computing thrives on fast com- > > putations and high memory bandwidth. But before > operations => any code or even benchmark is run, the very first × micro − architecture instruction number to evaluate a system is the theoretical peak > > - how many floating-point operations the system > can theoretically execute in a given time.
    [Show full text]
  • Lecture Topics RAW Hazards Stall Penalty
    Lecture Topics ECE 486/586 • Pipelining – Hazards and Stalls • Data Hazards Computer Architecture – Forwarding • Control Hazards Lecture # 11 Reference: • Appendix C: Section C.2 Spring 2015 Portland State University RAW Hazards Stall Penalty Time Clock Cycle 1 2 3 4 5 6 7 8 • If the dependent instruction follows right after the producer Add R2, R3, #100 IF ID EX MEM WB instruction, we need to stall the pipe for 2 cycles (Stall penalty = 2) • What happens if the producer and dependent instructions are Subtract R9, R2, #30 IF IDstall stall EX MEM WB separated by an intermediate instruction? • Subtract is stalled in decode stage for two additional cycles to delay reading Add R2, R3, #100 IF ID EX MEM WB R2 until the new value of R2 has been written by Add How is this pipeline stall implemented in hardware? Or R4, R5, R6 IF ID EX MEM WB • Control circuitry recognizes the data dependency when it decodes Subtract (comparing source register ids with dest. register ids of prior instructions) Subtract R9, R2, #30 IF ID stall EX MEM WB • During cycles 3 to 5: • Add proceeds through the pipe • In this case, stall penalty = 1 • Subtract is held in the ID stage • Stall penalty depends upon the distance between the producer • In cycle 5 and dependent instruction • Add writes R2 in the first half cycle, Subtract reads R2 in the second half cycle Impact of Data Hazards Operand Forwarding - Alleviating Data Hazards • Frequent stalls caused by data hazards can impact the • Stalls due to data dependencies can be mitigated by forwarding • Consider the two instructions discussed in the previous example performance significantly.
    [Show full text]
  • Misleading Performance Reporting in the Supercomputing Field David H
    Misleading Performance Reporting in the Supercomputing Field David H. Bailey RNR Technical Report RNR-92-005 December 1, 1992 Ref: Scientific Programming, vol. 1., no. 2 (Winter 1992), pg. 141–151 Abstract In a previous humorous note, I outlined twelve ways in which performance figures for scientific supercomputers can be distorted. In this paper, the problem of potentially mis- leading performance reporting is discussed in detail. Included are some examples that have appeared in recent published scientific papers. This paper also includes some pro- posed guidelines for reporting performance, the adoption of which would raise the level of professionalism and reduce the level of confusion in the field of supercomputing. The author is with the Numerical Aerodynamic Simulation (NAS) Systems Division at NASA Ames Research Center, Moffett Field, CA 94035. 1 1. Introduction Many readers may have read my previous article “Twelve Ways to Fool the Masses When Giving Performance Reports on Parallel Computers” [5]. The attention that this article received frankly has been surprising [11]. Evidently it has struck a responsive chord among many professionals in the field who share my concerns. The following is a very brief summary of the “Twelve Ways”: 1. Quote only 32-bit performance results, not 64-bit results, and compare your 32-bit results with others’ 64-bit results. 2. Present inner kernel performance figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs, and compare your assembly-coded results with others’ Fortran or C implementations. 4. Scale up the problem size with the number of processors, but don’t clearly disclose this fact.
    [Show full text]