Whitepaper Cache Speculation Side-Channels Date: June 2020 Version 2.5
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Data Storage the CPU-Memory
Data Storage •Disks • Hard disk (HDD) • Solid state drive (SSD) •Random Access Memory • Dynamic RAM (DRAM) • Static RAM (SRAM) •Registers • %rax, %rbx, ... Sean Barker 1 The CPU-Memory Gap 100,000,000.0 10,000,000.0 Disk 1,000,000.0 100,000.0 SSD Disk seek time 10,000.0 SSD access time 1,000.0 DRAM access time Time (ns) Time 100.0 DRAM SRAM access time CPU cycle time 10.0 Effective CPU cycle time 1.0 0.1 CPU 0.0 1985 1990 1995 2000 2003 2005 2010 2015 Year Sean Barker 2 Caching Smaller, faster, more expensive Cache 8 4 9 10 14 3 memory caches a subset of the blocks Data is copied in block-sized 10 4 transfer units Larger, slower, cheaper memory Memory 0 1 2 3 viewed as par@@oned into “blocks” 4 5 6 7 8 9 10 11 12 13 14 15 Sean Barker 3 Cache Hit Request: 14 Cache 8 9 14 3 Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Sean Barker 4 Cache Miss Request: 12 Cache 8 12 9 14 3 12 Request: 12 Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Sean Barker 5 Locality ¢ Temporal locality: ¢ Spa0al locality: Sean Barker 6 Locality Example (1) sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Sean Barker 7 Locality Example (2) int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } Sean Barker 8 Locality Example (3) int sum_array_cols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } Sean Barker 9 The Memory Hierarchy The Memory Hierarchy Smaller On 1 cycle to access CPU Chip Registers Faster Storage Costlier instrs can L1, L2 per byte directly Cache(s) ~10’s of cycles to access access (SRAM) Main memory ~100 cycles to access (DRAM) Larger Slower Flash SSD / Local network ~100 M cycles to access Cheaper Local secondary storage (disk) per byte slower Remote secondary storage than local (tapes, Web servers / Internet) disk to access Sean Barker 10. -
Computer Science 246 Computer Architecture Spring 2010 Harvard University
Computer Science 246 Computer Architecture Spring 2010 Harvard University Instructor: Prof. David Brooks [email protected] Dynamic Branch Prediction, Speculation, and Multiple Issue Computer Science 246 David Brooks Lecture Outline • Tomasulo’s Algorithm Review (3.1-3.3) • Pointer-Based Renaming (MIPS R10000) • Dynamic Branch Prediction (3.4) • Other Front-end Optimizations (3.5) – Branch Target Buffers/Return Address Stack Computer Science 246 David Brooks Tomasulo Review • Reservation Stations – Distribute RAW hazard detection – Renaming eliminates WAW hazards – Buffering values in Reservation Stations removes WARs – Tag match in CDB requires many associative compares • Common Data Bus – Achilles heal of Tomasulo – Multiple writebacks (multiple CDBs) expensive • Load/Store reordering – Load address compared with store address in store buffer Computer Science 246 David Brooks Tomasulo Organization From Mem FP Op FP Registers Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Store Load6 Buffers Add1 Add2 Mult1 Add3 Mult2 Reservation To Mem Stations FP adders FP multipliers Common Data Bus (CDB) Tomasulo Review 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 LD F0, 0(R1) Iss M1 M2 M3 M4 M5 M6 M7 M8 Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2 M3 Wb SUBI R1, R1, 8 Iss Ex Wb BNEZ R1, Loop Iss Ex Wb LD F0, 0(R1) Iss Iss Iss Iss M Wb MUL F4, F0, F2 Iss Iss Iss Iss Iss Ex Ex Ex Ex Wb SD 0(R1), F0 Iss Iss Iss Iss Iss Iss Iss Iss Iss M1 M2 -
OS and Compiler Considerations in the Design of the IA-64 Architecture
OS and Compiler Considerations in the Design of the IA-64 Architecture Rumi Zahir (Intel Corporation) Dale Morris, Jonathan Ross (Hewlett-Packard Company) Drew Hess (Lucasfilm Ltd.) This is an electronic reproduction of “OS and Compiler Considerations in the Design of the IA-64 Architecture” originally published in ASPLOS-IX (the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems) held in Cambridge, MA in November 2000. Copyright © A.C.M. 2000 1-58113-317-0/00/0011...$5.00 Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permis- sion and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or [email protected]. ASPLOS 2000 Cambridge, MA Nov. 12-15 , 2000 212 OS and Compiler Considerations in the Design of the IA-64 Architecture Rumi Zahir Jonathan Ross Dale Morris Drew Hess Intel Corporation Hewlett-Packard Company Lucas Digital Ltd. 2200 Mission College Blvd. 19447 Pruneridge Ave. P.O. Box 2459 Santa Clara, CA 95054 Cupertino, CA 95014 San Rafael, CA 94912 [email protected] [email protected] [email protected] [email protected] ABSTRACT system collaborate to deliver higher-performance systems. -
Selective Eager Execution on the Polypath Architecture
Selective Eager Execution on the PolyPath Architecture Artur Klauser, Abhijit Paithankar, Dirk Grunwald [klauser,pvega,grunwald]@cs.colorado.edu University of Colorado, Department of Computer Science Boulder, CO 80309-0430 Abstract average misprediction latency is the sum of the architected latency (pipeline depth) and a variable latency, which depends on data de- Control-flow misprediction penalties are a major impediment pendencies of the branch instruction. The overall cycles lost due to high performance in wide-issue superscalar processors. In this to branch mispredictions are the product of the total number of paper we present Selective Eager Execution (SEE), an execution mispredictions incurred during program execution and the average model to overcome mis-speculation penalties by executing both misprediction latency. Thus, reduction of either component helps paths after diffident branches. We present the micro-architecture decrease performance loss due to control mis-speculation. of the PolyPath processor, which is an extension of an aggressive In this paper we propose Selective Eager Execution (SEE) and superscalar, out-of-order architecture. The PolyPath architecture the PolyPath architecture model, which help to overcome branch uses a novel instruction tagging and register renaming mechanism misprediction latency. SEE recognizes the fact that some branches to execute instructions from multiple paths simultaneously in the are predicted more accurately than others, where it draws from re- same processor pipeline, while retaining maximum resource avail- search in branch confidence estimation [4, 6]. SEE behaves like ability for single-path code sequences. a normal monopath speculative execution architecture for highly Results of our execution-driven, pipeline-level simulations predictable branches; it predicts the most likely successor path of a show that SEE can improve performance by as much as 36% for branch, and evaluates instructions only along this path. -
Igpu: Exception Support and Speculative Execution on Gpus
Appears in the 39th International Symposium on Computer Architecture, 2012 iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf, Karthikeyan Sankaralingam Department of Computer Sciences University of Wisconsin-Madison {menon, dekruijf, karu}@cs.wisc.edu Abstract the evolution of traditional CPUs to illuminate why this pro- gression appears natural and imminent. Since the introduction of fully programmable vertex Just as CPU programmers were forced to explicitly man- shader hardware, GPU computing has made tremendous age CPU memories in the days before virtual memory, for advances. Exception support and speculative execution are almost a decade, GPU programmers directly and explicitly the next steps to expand the scope and improve the usabil- managed the GPU memory hierarchy. The recent release ity of GPUs. However, traditional mechanisms to support of NVIDIA’s Fermi architecture and AMD’s Fusion archi- exceptions and speculative execution are highly intrusive to tecture, however, has brought GPUs to an inflection point: GPU hardware design. This paper builds on two related in- both architectures implement a unified address space that sights to provide a unified lightweight mechanism for sup- eliminates the need for explicit memory movement to and porting exceptions and speculation on GPUs. from GPU memory structures. Yet, without demand pag- First, we observe that GPU programs can be broken into ing, something taken for granted in the CPU space, pro- code regions that contain little or no live register state at grammers must still explicitly reason about available mem- their entry point. We then also recognize that it is simple to ory. The drawbacks of exposing physical memory size to generate these regions in such a way that they are idempo- programmers are well known. -
Cache-Aware Roofline Model: Upgrading the Loft
1 Cache-aware Roofline model: Upgrading the loft Aleksandar Ilic, Frederico Pratas, and Leonel Sousa INESC-ID/IST, Technical University of Lisbon, Portugal ilic,fcpp,las @inesc-id.pt f g Abstract—The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization. The proposed model was experimentally verified for different architectures by taking advantage of built-in hardware counters with a curve fitness above 90%. Index Terms—Multicore computer architectures, Performance modeling, Application optimization F 1 INTRODUCTION in Tab. 1. The horizontal part of the Roofline model cor- Driven by the increasing complexity of modern applica- responds to the compute bound region where the Fp can tions, microprocessors provide a huge diversity of compu- be achieved. The slanted part of the model represents the tational characteristics and capabilities. While this diversity memory bound region where the performance is limited is important to fulfill the existing computational needs, it by BD. The ridge point, where the two lines meet, marks imposes challenges to fully exploit architectures’ potential. the minimum I required to achieve Fp. In general, the A model that provides insights into the system performance original Roofline modeling concept [10] ties the Fp and the capabilities is a valuable tool to assist in the development theoretical bandwidth of a single memory level, with the I and optimization of applications and future architectures. -
ARM-Architecture Simulator Archulator
Embedded ECE Lab Arm Architecture Simulator University of Arizona ARM-Architecture Simulator Archulator Contents Purpose ............................................................................................................................................ 2 Description ...................................................................................................................................... 2 Block Diagram ............................................................................................................................ 2 Component Description .............................................................................................................. 3 Buses ..................................................................................................................................................... 3 µP .......................................................................................................................................................... 3 Caches ................................................................................................................................................... 3 Memory ................................................................................................................................................. 3 CoProcessor .......................................................................................................................................... 3 Using the Archulator ...................................................................................................................... -
Multi-Tier Caching Technology™
Multi-Tier Caching Technology™ Technology Paper Authored by: How Layering an Application’s Cache Improves Performance Modern data storage needs go far beyond just computing. From creative professional environments to desktop systems, Seagate provides solutions for almost any application that requires large volumes of storage. As a leader in NAND, hybrid, SMR and conventional magnetic recording technologies, Seagate® applies different levels of caching and media optimization to benefit performance and capacity. Multi-Tier Caching (MTC) Technology brings the highest performance and areal density to a multitude of applications. Each Seagate product is uniquely tailored to meet the performance requirements of a specific use case with the right memory, NAND, and media type and size. This paper explains how MTC Technology works to optimize hard drive performance. MTC Technology: Key Advantages Capacity requirements can vary greatly from business to business. While the fastest performance can be achieved using Dynamic Random Access Memory (DRAM) cache, the data in the DRAM is not persistent through power cycles and DRAM is very expensive compared to other media. NAND flash data survives through power cycles but it is still very expensive compared to a magnetic storage medium. Magnetic storage media cache offers good performance at a very low cost. On the downside, media cache takes away overall disk drive capacity from PMR or SMR main store media. MTC Technology solves this dilemma by using these diverse media components in combination to offer different levels of performance and capacity at varying price points. By carefully tuning firmware with appropriate cache types and sizes, the end user can experience excellent overall system performance. -
CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY
CUDA 11 AND A100 - WHAT’S NEW? Markus Hrywniak, 23rd June 2020 TOPICS FOR TODAY Ampere architecture – A100, powering DGX–A100, HGX-A100... and soon, FZ Jülich‘s JUWELS Booster New CUDA 11 Toolkit release Overview of features Talk next week: Third generation Tensor Cores GTC talks go into much more details. See references! 2 HGX-A100 4-GPU HGX-A100 8-GPU • 4 A100 with NVLINK • 8 A100 with NVSwitch 3 HIERARCHY OF SCALES Multi-System Rack Multi-GPU System Multi-SM GPU Multi-Core SM Unlimited Scale 8 GPUs 108 Multiprocessors 2048 threads 4 AMDAHL’S LAW serial section parallel section serial section Amdahl’s Law parallel section Shortest possible runtime is sum of serial section times Time saved serial section Some Parallelism Increased Parallelism Infinite Parallelism Program time = Parallel sections take less time Parallel sections take no time sum(serial times + parallel times) Serial sections take same time Serial sections take same time 5 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY serial section parallel section serial section Split up serial & parallel components parallel section serial section Some Parallelism Task Parallelism Infinite Parallelism Program time = Parallel sections overlap with serial sections Parallel sections take no time sum(serial times + parallel times) Serial sections take same time 6 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY CUDA Concurrency Mechanisms At Every Scope CUDA Kernel Threads, Warps, Blocks, Barriers Application CUDA Streams, CUDA Graphs Node Multi-Process Service, GPU-Direct System NCCL, CUDA-Aware MPI, NVSHMEM 7 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY Execution Overheads Non-productive latencies (waste) Operation Latency Network latencies Memory read/write File I/O .. -
A Survey of Published Attacks on Intel
1 A Survey of Published Attacks on Intel SGX Alexander Nilsson∗y, Pegah Nikbakht Bideh∗, Joakim Brorsson∗zx falexander.nilsson,pegah.nikbakht_bideh,[email protected] ∗Lund University, Department of Electrical and Information Technology, Sweden yAdvenica AB, Sweden zCombitech AB, Sweden xHyker Security AB, Sweden Abstract—Intel Software Guard Extensions (SGX) provides a Unfortunately, a relatively large number of flaws and attacks trusted execution environment (TEE) to run code and operate against SGX have been published by researchers over the last sensitive data. SGX provides runtime hardware protection where few years. both code and data are protected even if other code components are malicious. However, recently many attacks targeting SGX have been identified and introduced that can thwart the hardware A. Contribution defence provided by SGX. In this paper we present a survey of all attacks specifically targeting Intel SGX that are known In this paper, we present the first comprehensive review to the authors, to date. We categorized the attacks based on that includes all known attacks specific to SGX, including their implementation details into 7 different categories. We also controlled channel attacks, cache-attacks, speculative execu- look into the available defence mechanisms against identified tion attacks, branch prediction attacks, rogue data cache loads, attacks and categorize the available types of mitigations for each presented attack. microarchitectural data sampling and software-based fault in- jection attacks. For most of the presented attacks, there are countermeasures and mitigations that have been deployed as microcode patches by Intel or that can be employed by the I. INTRODUCTION application developer herself to make the attack more difficult (or impossible) to exploit. -
A Characterization of Processor Performance in the VAX-1 L/780
A Characterization of Processor Performance in the VAX-1 l/780 Joel S. Emer Douglas W. Clark Digital Equipment Corp. Digital Equipment Corp. 77 Reed Road 295 Foster Street Hudson, MA 01749 Littleton, MA 01460 ABSTRACT effect of many architectural and implementation features. This paper reports the results of a study of VAX- llR80 processor performance using a novel hardware Prior related work includes studies of opcode monitoring technique. A micro-PC histogram frequency and other features of instruction- monitor was buiit for these measurements. It kee s a processing [lo. 11,15,161; some studies report timing count of the number of microcode cycles execute z( at Information as well [l, 4,121. each microcode location. Measurement ex eriments were performed on live timesharing wor i loads as After describing our methods and workloads in well as on synthetic workloads of several types. The Section 2, we will re ort the frequencies of various histogram counts allow the calculation of the processor events in 5 ections 3 and 4. Section 5 frequency of various architectural events, such as the resents the complete, detailed timing results, and frequency of different types of opcodes and operand !!Iection 6 concludes the paper. specifiers, as well as the frequency of some im lementation-s ecific events, such as translation bu h er misses. ?phe measurement technique also yields the amount of processing time spent, in various 2. DEFINITIONS AND METHODS activities, such as ordinary microcode computation, memory management, and processor stalls of 2.1 VAX-l l/780 Structure different kinds. This paper reports in detail the amount of time the “average’ VAX instruction The llf780 processor is composed of two major spends in these activities. -
Instruction Set Innovations for Convey's HC-1 Computer
Instruction Set Innovations for Convey's HC-1 Computer THE WORLD’S FIRST HYBRID-CORE COMPUTER. Hot Chips Conference 2009 [email protected] Introduction to Convey Computer • Company Status – Second round venture based startup company – Product beta systems are at customer sites – Currently staffing at 36 people – Located in Richardson, Texas • Investors – Four Venture Capital Investors • Interwest Partners (Menlo Park) • CenterPoint Ventures (Dallas) • Rho Ventures (New York) • Braemar Energy Ventures (Boston) – Two Industry Investors • Intel Capital • Xilinx Presentation Outline • Overview of HC-1 Computer • Instruction Set Innovations • Application Examples Page 3 Hot Chips Conference 2009 What is a Hybrid-Core Computer ? A hybrid-core computer improves application performance by combining an x86 processor with hardware that implements application-specific instructions. ANSI Standard Applications C/C++/Fortran Convey Compilers x86 Coprocessor Instructions Instructions Intel® Processor Hybrid-Core Coprocessor Oil & Gas& Oil Financial Sciences Custom CAE Application-Specific Personalities Cache-coherent shared virtual memory Page 4 Hot Chips Conference 2009 What Is a Personality? • A personality is a reloadable set of instructions that augment x86 application the x86 instruction set Processor specific – Applicable to a class of applications instructions or specific to a particular code • Each personality is a set of files that includes: – The bits loaded into the Coprocessor – Information used by the Convey compiler • List of