I See Dead Μops: Leaking Secrets Via Intel/AMD Micro-Op Caches

Total Page:16

File Type:pdf, Size:1020Kb

I See Dead Μops: Leaking Secrets Via Intel/AMD Micro-Op Caches I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches Xida Ren Logan Moody Mohammadkazem Taram Matthew Jordan University of Virginia University of Virginia University of California, San Diego University of Virginia [email protected] [email protected] [email protected] [email protected] Dean M. Tullsen Ashish Venkat University of California, San Diego University of Virginia [email protected] [email protected] translation; the resulting translation penalty is a function of Abstract—Modern Intel, AMD, and ARM processors translate complex instructions into simpler internal micro-ops that are the complexity of the x86 instruction, making the micro-op then cached in a dedicated on-chip structure called the micro- cache a ripe candidate for implementing a timing channel. op cache. This work presents an in-depth characterization study However, in contrast to the rest of the on-chip caches, the of the micro-op cache, reverse-engineering many undocumented micro-op cache has a very different organization and design; features, and further describes attacks that exploit the micro- op cache as a timing channel to transmit secret information. as a result, conventional cache side-channel techniques do not In particular, this paper describes three attacks – (1) a same directly apply – we cite several examples. First, software is thread cross-domain attack that leaks secrets across the user- restricted from directly indexing into the micro-op cache, let kernel boundary, (2) a cross-SMT thread attack that transmits alone access its contents, regardless of the privilege level. secrets across two SMT threads via the micro-op cache, and Second, unlike traditional data and instruction caches, the (3) transient execution attacks that have the ability to leak an unauthorized secret accessed along a misspeculated path, micro-op cache is designed to operate as a stream buffer, even before the transient instruction is dispatched to execution, allowing the fetch engine to sequentially stream decoded breaking several existing invisible speculation and fencing-based micro-ops spanning multiple ways within the same cache solutions that mitigate Spectre. set (see Section II for more details). Third, the number of I. INTRODUCTION micro-ops within each cache line can vary drastically based on various x86 nuances such as prefixes, opcodes, size of Modern processors feature complex microarchitectural immediate values, and fusibility of adjacent micro-ops. Finally, structures that are carefully tuned to maximize performance. the proprietary nature of the micro-op ISA has resulted in However, these structures are often characterized by observ- minimal documentation of important design details of the able timing effects (e.g., a cache hit or a miss) that can micro-op cache (such as its replacement policies); as a result, be exploited to covertly transmit secret information, bypass- their security implications have been largely ignored. ing sandboxes and traditional privilege boundaries. Although numerous side-channel attacks [1]–[10] have been proposed In this paper, we present an in-depth characterization of in the literature, the recent influx of Spectre and related the micro-op cache that not only allows us to confirm our attacks [11]–[19] has further highlighted the extent of this understanding of the few features documented in Intel/AMD threat, by showing how even transiently accessed secrets can manuals, but further throws light on several undocumented be transmitted via microarchitectural channels. This work features such as its partitioning and replacement policies, in exposes a new timing channel that manifests as a result of both single-threaded and multi-threaded settings. an integral performance enhancement in modern Intel/AMD By leveraging this knowledge of the behavior of the micro- processors – the micro-op cache. op cache, we then propose a principled framework for au- The x86 ISA implements a variety of complex instructions tomatically generating high-bandwidth micro-op cache-based that are internally broken down into RISC-like micro-ops at timing channel exploits in three primary settings – (a) across the front-end of the processor pipeline to facilitate a simpler code regions within the same thread, but operating at dif- backend. These micro-ops are then cached in a small dedicated ferent privilege levels, (b) across different co-located threads on-chip buffer called the micro-op cache. This feature is running simultaneously on different SMT contexts (logical both a performance and a power optimization that allows the cores) within the same physical core, and (c) two transient instruction fetch engine to stream decoded micro-ops directly execution attack variants that leverage the micro-op cache to from the micro-op cache when a translation is available, leak secrets, bypassing several existing hardware and software- turning off the rest of the decode pipeline until a micro- based mitigations, including Intel’s recommended LFENCE. op cache miss occurs. In the event of a miss, the decode The micro-op cache as a side channel has several dangerous pipeline is re-activated to perform the internal CISC-to-RISC implications. First, it bypasses all techniques that mitigate caches as side channels. Second, these attacks are not detected BPU: predicts by any existing attack or malware profile. Third, because the next-instruction address micro-op cache sits at the front of the pipeline, well before Instruction Fetch Unit Micro-Op Cache execution, certain defenses that mitigate Spectre and other 16 Bytes / Cycle Macro-Op 8 ways Cache Hierarchy transient execution attacks by restricting speculative cache Instructions From Queue 6 Macro-Ops / Cycle updates still remain vulnerable to micro-op cache attacks. Predecoder (Length Decode) Most existing invisible speculation and fencing-based solu- 5 fused tions focus on hiding the unintended vulnerable side-effects of Decoders macro-ops / cycle 32 sets speculative execution that occur at the back end of the proces- 1 1 1 sor pipeline, rather than inhibiting the source of speculation ↓ ↓ ↓ 1 1 1 ~ 4 at the front-end. That makes them vulnerable to the attack MSROM 6 micro-ops / cycle we describe, which discloses speculatively accessed secrets 5 fused 4 micro-ops / cycle through a front-end side channel, before a transient instruction micro-ops / cycle 4 micro-ops / cycle has the opportunity to get dispatched for execution. This eludes Instruction Decode Queue (IDQ) a whole suite of existing defenses [20]–[32]. Furthermore, due Fig. 1: x86 micro-op cache and decode pipeline to the relatively small size of the micro-op cache, our attack is significantly faster than existing Spectre variants that rely on priming and probing several cache sets to transmit secret of three to six cycles when a length-changing prefix (LCP) information, and is considerably more stealthy, as it uses the is encountered, as the decoded instruction length is different micro-op cache as its sole disclosure primitive, introducing from the default length. As the macro-ops are decoded and fewer data/instruction cache accesses, let alone misses. placed in the macro-op queue, particular pairs of consecutive In summary, we make the following major contributions. macro-op entries may be fused together into a single macro-op • We present an in-depth characterization of the micro- to save decode bandwidth. op cache featured in Intel and AMD processors, reverse Multiple macro-ops are read from the macro-op queue engineering several undocumented features, including its every cycle, and distributed to the decoders which translate replacement and partitioning policies. each macro-op into internal RISC (Reduced Instruction Set • We propose mechanisms to automatically generate exploit Computing)-like micro-ops. The Skylake microarchitecture code that leak secrets by leveraging the micro-op cache features: (a) multiple 1:1 decoders that can translate simple as a timing channel. macro-ops that only decompose into one micro-op, (b) a • We describe and evaluate four attack variants that exploit 1:4 decoder that can translate complex macro-ops that can the novel micro-op cache vulnerability – (a) cross-domain decompose into anywhere between one and four micro-ops, same address-space attack, (b) cross-SMT thread attack, and (c) a microsequencing ROM (MSROM) that translates and (c) two transient execution attack variants. more complex microcoded instructions, where a single macro- • We comment on the extensibility of existing cache side- op can translate into more than four micro-ops, potentially in- channel and transient execution attack mitigations to volving multiple branches and loops, taking up several decode address the vulnerability we expose, and further suggest cycles. The decode pipeline can provide a peak bandwidth of potential attack detection and defense mechanisms. 5 micro-ops per cycle [33]. In contrast, AMD Zen features II. BACKGROUND AND RELATED WORK four 1:2 decoders and relegates to a microcode ROM when This section provides relevant background on the x86 front- it encounters a complex instruction that translates into more end and its salient components, including the micro-op cache than two micro-ops. and other front-end features, in reference to Intel Skylake and The translated micro-ops are then queued up in an Instruc- AMD Zen microarchitectures. We also briefly discuss related tion Decode Queue (IDQ) for further processing by the rest of research on side-channel and transient execution attacks. the pipeline. In a particular cycle, micro-ops can be delivered to the IDQ by the micro-op cache, or if not available, the more A. The x86 Decode Pipeline expensive full decode pipeline is turned on and micro-ops are Figure 1 shows the decoding process in an x86 processor. delivered that way. In the absence of any optimizations, the instruction fetch In Intel architectures, the macro-op queue, IDQ, and the unit reads instruction bytes corresponding to a 16-byte code micro-op cache remain partitioned across different SMT region from the L1 instruction cache into a small fetch buffer threads running on the same physical core, while AMD allows every cycle.
Recommended publications
  • Data Storage the CPU-Memory
    Data Storage •Disks • Hard disk (HDD) • Solid state drive (SSD) •Random Access Memory • Dynamic RAM (DRAM) • Static RAM (SRAM) •Registers • %rax, %rbx, ... Sean Barker 1 The CPU-Memory Gap 100,000,000.0 10,000,000.0 Disk 1,000,000.0 100,000.0 SSD Disk seek time 10,000.0 SSD access time 1,000.0 DRAM access time Time (ns) Time 100.0 DRAM SRAM access time CPU cycle time 10.0 Effective CPU cycle time 1.0 0.1 CPU 0.0 1985 1990 1995 2000 2003 2005 2010 2015 Year Sean Barker 2 Caching Smaller, faster, more expensive Cache 8 4 9 10 14 3 memory caches a subset of the blocks Data is copied in block-sized 10 4 transfer units Larger, slower, cheaper memory Memory 0 1 2 3 viewed as par@@oned into “blocks” 4 5 6 7 8 9 10 11 12 13 14 15 Sean Barker 3 Cache Hit Request: 14 Cache 8 9 14 3 Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Sean Barker 4 Cache Miss Request: 12 Cache 8 12 9 14 3 12 Request: 12 Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Sean Barker 5 Locality ¢ Temporal locality: ¢ Spa0al locality: Sean Barker 6 Locality Example (1) sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Sean Barker 7 Locality Example (2) int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } Sean Barker 8 Locality Example (3) int sum_array_cols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } Sean Barker 9 The Memory Hierarchy The Memory Hierarchy Smaller On 1 cycle to access CPU Chip Registers Faster Storage Costlier instrs can L1, L2 per byte directly Cache(s) ~10’s of cycles to access access (SRAM) Main memory ~100 cycles to access (DRAM) Larger Slower Flash SSD / Local network ~100 M cycles to access Cheaper Local secondary storage (disk) per byte slower Remote secondary storage than local (tapes, Web servers / Internet) disk to access Sean Barker 10.
    [Show full text]
  • Caching Basics
    Introduction Why memory subsystem design is important • CPU speeds increase 25%-30% per year • DRAM speeds increase 2%-11% per year Autumn 2006 CSE P548 - Memory Hierarchy 1 Memory Hierarchy Levels of memory with different sizes & speeds • close to the CPU: small, fast access • close to memory: large, slow access Memory hierarchies improve performance • caches: demand-driven storage • principal of locality of reference temporal: a referenced word will be referenced again soon spatial: words near a reference word will be referenced soon • speed/size trade-off in technology ⇒ fast access for most references First Cache: IBM 360/85 in the late ‘60s Autumn 2006 CSE P548 - Memory Hierarchy 2 1 Cache Organization Block: • # bytes associated with 1 tag • usually the # bytes transferred on a memory request Set: the blocks that can be accessed with the same index bits Associativity: the number of blocks in a set • direct mapped • set associative • fully associative Size: # bytes of data How do you calculate this? Autumn 2006 CSE P548 - Memory Hierarchy 3 Logical Diagram of a Cache Autumn 2006 CSE P548 - Memory Hierarchy 4 2 Logical Diagram of a Set-associative Cache Autumn 2006 CSE P548 - Memory Hierarchy 5 Accessing a Cache General formulas • number of index bits = log2(cache size / block size) (for a direct mapped cache) • number of index bits = log2(cache size /( block size * associativity)) (for a set-associative cache) Autumn 2006 CSE P548 - Memory Hierarchy 6 3 Design Tradeoffs Cache size the bigger the cache, + the higher the hit ratio
    [Show full text]
  • 45-Year CPU Evolution: One Law and Two Equations
    45-year CPU evolution: one law and two equations Daniel Etiemble LRI-CNRS University Paris Sud Orsay, France [email protected] Abstract— Moore’s law and two equations allow to explain the a) IC is the instruction count. main trends of CPU evolution since MOS technologies have been b) CPI is the clock cycles per instruction and IPC = 1/CPI is the used to implement microprocessors. Instruction count per clock cycle. c) Tc is the clock cycle time and F=1/Tc is the clock frequency. Keywords—Moore’s law, execution time, CM0S power dissipation. The Power dissipation of CMOS circuits is the second I. INTRODUCTION equation (2). CMOS power dissipation is decomposed into static and dynamic powers. For dynamic power, Vdd is the power A new era started when MOS technologies were used to supply, F is the clock frequency, ΣCi is the sum of gate and build microprocessors. After pMOS (Intel 4004 in 1971) and interconnection capacitances and α is the average percentage of nMOS (Intel 8080 in 1974), CMOS became quickly the leading switching capacitances: α is the activity factor of the overall technology, used by Intel since 1985 with 80386 CPU. circuit MOS technologies obey an empirical law, stated in 1965 and 2 Pd = Pdstatic + α x ΣCi x Vdd x F (2) known as Moore’s law: the number of transistors integrated on a chip doubles every N months. Fig. 1 presents the evolution for II. CONSEQUENCES OF MOORE LAW DRAM memories, processors (MPU) and three types of read- only memories [1]. The growth rate decreases with years, from A.
    [Show full text]
  • LABEL MATRIX 2021 Release Notes Version 2021.00.00
    LABEL MATRIX 2021 Release Notes Version 2021.00.00 Table of Contents System Requirements ................................................................................................................. 3 Virtual Environment .................................................................................................................... 3 LABEL MATRIX 2021 New Features, Improvements and Fixes ........................................................... 4 LABEL MATRIX 2019 New Features, Improvements and Fixes ........................................................... 5 LABEL MATRIX 2018 New Features, Improvements and Fixes ........................................................... 7 LABEL MATRIX 2016 New Features, Improvements and Fixes ......................................................... 10 LABEL MATRIX 2015 New Features, Improvements and Fixes ......................................................... 11 LABEL MATRIX 2014 New Features, Improvements and Fixes ......................................................... 12 LABEL MATRIX 2012 New Features, Improvements and Fixes ......................................................... 14 LABEL MATRIX 8 New Features, Improvements and Fixes .............................................................. 15 New Device Support.................................................................................................................. 19 Known Limitations .................................................................................................................... 34 www.teklynx.com
    [Show full text]
  • Hierarchical Roofline Analysis for Gpus: Accelerating Performance
    Hierarchical Roofline Analysis for GPUs: Accelerating Performance Optimization for the NERSC-9 Perlmutter System Charlene Yang, Thorsten Kurth Samuel Williams National Energy Research Scientific Computing Center Computational Research Division Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Berkeley, CA 94720, USA Berkeley, CA 94720, USA fcjyang, [email protected] [email protected] Abstract—The Roofline performance model provides an Performance (GFLOP/s) is bound by: intuitive and insightful approach to identifying performance bottlenecks and guiding performance optimization. In prepa- Peak GFLOP/s GFLOP/s ≤ min (1) ration for the next-generation supercomputer Perlmutter at Peak GB/s × Arithmetic Intensity NERSC, this paper presents a methodology to construct a hi- erarchical Roofline on NVIDIA GPUs and extend it to support which produces the traditional Roofline formulation when reduced precision and Tensor Cores. The hierarchical Roofline incorporates L1, L2, device memory and system memory plotted on a log-log plot. bandwidths into one single figure, and it offers more profound Previously, the Roofline model was expanded to support insights into performance analysis than the traditional DRAM- the full memory hierarchy [2], [3] by adding additional band- only Roofline. We use our Roofline methodology to analyze width “ceilings”. Similarly, additional ceilings beneath the three proxy applications: GPP from BerkeleyGW, HPGMG Roofline can be added to represent performance bottlenecks from AMReX, and conv2d from TensorFlow. In so doing, we demonstrate the ability of our methodology to readily arising from lack of vectorization or the failure to exploit understand various aspects of performance and performance fused multiply-add (FMA) instructions. bottlenecks on NVIDIA GPUs and motivate code optimizations.
    [Show full text]
  • An Algorithmic Theory of Caches by Sridhar Ramachandran
    An algorithmic theory of caches by Sridhar Ramachandran Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY. December 1999 Massachusetts Institute of Technology 1999. All rights reserved. Author Department of Electrical Engineering and Computer Science Jan 31, 1999 Certified by / -f Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by Arthur C. Smith Chairman, Departmental Committee on Graduate Students MSSACHUSVTS INSTITUT OF TECHNOLOGY MAR 0 4 2000 LIBRARIES 2 An algorithmic theory of caches by Sridhar Ramachandran Submitted to the Department of Electrical Engineeringand Computer Science on Jan 31, 1999 in partialfulfillment of the requirementsfor the degree of Master of Science. Abstract The ideal-cache model, an extension of the RAM model, evaluates the referential locality exhibited by algorithms. The ideal-cache model is characterized by two parameters-the cache size Z, and line length L. As suggested by its name, the ideal-cache model practices automatic, optimal, omniscient replacement algorithm. The performance of an algorithm on the ideal-cache model consists of two measures-the RAM running time, called work complexity, and the number of misses on the ideal cache, called cache complexity. This thesis proposes the ideal-cache model as a "bridging" model for caches in the sense proposed by Valiant [49]. A bridging model for caches serves two purposes. It can be viewed as a hardware "ideal" that influences cache design. On the other hand, it can be used as a powerful tool to design cache-efficient algorithms.
    [Show full text]
  • Generalized Methods for Application Specific Hardware Specialization
    Generalized methods for application specific hardware specialization by Snehasish Kumar M. Sc., Simon Fraser University, 2013 B. Tech., Biju Patnaik University of Technology, 2010 Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the School of Computing Science Faculty of Applied Sciences c Snehasish Kumar 2017 SIMON FRASER UNIVERSITY Spring 2017 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, education, satire, parody, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. Approval Name: Snehasish Kumar Degree: Doctor of Philosophy (Computing Science) Title: Generalized methods for application specific hardware specialization Examining Committee: Chair: Binay Bhattacharyya Professor Arrvindh Shriraman Senior Supervisor Associate Professor Simon Fraser University William Sumner Supervisor Assistant Professor Simon Fraser University Vijayalakshmi Srinivasan Supervisor Research Staff Member, IBM Research Alexandra Fedorova Supervisor Associate Professor University of British Columbia Richard Vaughan Internal Examiner Associate Professor Simon Fraser University Andreas Moshovos External Examiner Professor University of Toronto Date Defended: November 21, 2016 ii Abstract Since the invention of the microprocessor in 1971, the computational capacity of the microprocessor has scaled over 1000× with Moore and Dennard scaling. Dennard scaling ended with a rapid increase in leakage power 30 years after it was proposed. This ushered in the era of multiprocessing where additional transistors afforded by Moore’s scaling were put to use. With the scaling of computational capacity no longer guaranteed every generation, application specific hardware specialization is an attractive alternative to sustain scaling trends.
    [Show full text]
  • P310 Maintenance Manual
    P310 Maintenance Manual CPARD RINTER P RODUCTS Manual No. 980264-001 Rev. B ©2001 Zebra Technologies Corporation FOREWORD This manual contains service and repair information for P310 Card Printers manufactured by Zebra Technology Corporation, Camarillo, California. The contents include maintenance, diagnosis and repair information. TECHNICAL SUPPORT For technical support, users should first contact the distributor that originally sold the product—phone +1 (800) 344 4003 to locate the nearest Eltron Products Distributor. Eltron Products offers the following: U.S.A Europe Asia Latin America http://www.eltron.com Internet ftp://ftp.eltron.com e-mail [email protected] [email protected] [email protected] [email protected] Compu 102251,1164 Serve Phone +805 578 1800 +33 (0) 2 40 09 70 70 +65 73 33 123 +1 847 584 2714 +44 (0) 1189 895 762 FAX +1 805 579 1808 +65 73 38 206 +1 847 584 2725 +33 (0) 2 40 09 70 70 RETURN MATERIALS AUTHORIZATION Before returning any equipment to Eltron for either in- or out-of-warranty repairs, contact the Eltron Repair Administration for a Return Materials Authorization (RMA) number. Then repackage the equipment, if possible using original packing materials, and mark the RMA number clearly on the outside. Ship the equipment, freight prepaid, to one of the following addresses: For USA and Latin America: For Europe, Asia, and Pacific: Zebra Technologies Corporation Zebra Technologies Corporation Eltron Card Printer Products Eltron Card Printer Products 1001 Flynn Road Zone Industrielle Rue d’Amsterdam Camarillo, CA 93012-8706, USA 44370 Varades, France Phone: +1 (805) 579-1800 Phone: +33 (0) 2 40 09 70 70 FAX: +1 (805) 579-1808 FAX: +33 (0) 2 40 83 47 45 COPYRIGHT NOTICE This document contains information proprietary to Zebra Technology Corporation.
    [Show full text]
  • Cache-Aware Roofline Model: Upgrading the Loft
    1 Cache-aware Roofline model: Upgrading the loft Aleksandar Ilic, Frederico Pratas, and Leonel Sousa INESC-ID/IST, Technical University of Lisbon, Portugal ilic,fcpp,las @inesc-id.pt f g Abstract—The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization. The proposed model was experimentally verified for different architectures by taking advantage of built-in hardware counters with a curve fitness above 90%. Index Terms—Multicore computer architectures, Performance modeling, Application optimization F 1 INTRODUCTION in Tab. 1. The horizontal part of the Roofline model cor- Driven by the increasing complexity of modern applica- responds to the compute bound region where the Fp can tions, microprocessors provide a huge diversity of compu- be achieved. The slanted part of the model represents the tational characteristics and capabilities. While this diversity memory bound region where the performance is limited is important to fulfill the existing computational needs, it by BD. The ridge point, where the two lines meet, marks imposes challenges to fully exploit architectures’ potential. the minimum I required to achieve Fp. In general, the A model that provides insights into the system performance original Roofline modeling concept [10] ties the Fp and the capabilities is a valuable tool to assist in the development theoretical bandwidth of a single memory level, with the I and optimization of applications and future architectures.
    [Show full text]
  • ARM-Architecture Simulator Archulator
    Embedded ECE Lab Arm Architecture Simulator University of Arizona ARM-Architecture Simulator Archulator Contents Purpose ............................................................................................................................................ 2 Description ...................................................................................................................................... 2 Block Diagram ............................................................................................................................ 2 Component Description .............................................................................................................. 3 Buses ..................................................................................................................................................... 3 µP .......................................................................................................................................................... 3 Caches ................................................................................................................................................... 3 Memory ................................................................................................................................................. 3 CoProcessor .......................................................................................................................................... 3 Using the Archulator ......................................................................................................................
    [Show full text]
  • Multi-Tier Caching Technology™
    Multi-Tier Caching Technology™ Technology Paper Authored by: How Layering an Application’s Cache Improves Performance Modern data storage needs go far beyond just computing. From creative professional environments to desktop systems, Seagate provides solutions for almost any application that requires large volumes of storage. As a leader in NAND, hybrid, SMR and conventional magnetic recording technologies, Seagate® applies different levels of caching and media optimization to benefit performance and capacity. Multi-Tier Caching (MTC) Technology brings the highest performance and areal density to a multitude of applications. Each Seagate product is uniquely tailored to meet the performance requirements of a specific use case with the right memory, NAND, and media type and size. This paper explains how MTC Technology works to optimize hard drive performance. MTC Technology: Key Advantages Capacity requirements can vary greatly from business to business. While the fastest performance can be achieved using Dynamic Random Access Memory (DRAM) cache, the data in the DRAM is not persistent through power cycles and DRAM is very expensive compared to other media. NAND flash data survives through power cycles but it is still very expensive compared to a magnetic storage medium. Magnetic storage media cache offers good performance at a very low cost. On the downside, media cache takes away overall disk drive capacity from PMR or SMR main store media. MTC Technology solves this dilemma by using these diverse media components in combination to offer different levels of performance and capacity at varying price points. By carefully tuning firmware with appropriate cache types and sizes, the end user can experience excellent overall system performance.
    [Show full text]
  • CUDA 11 and A100 - WHAT’S NEW? Markus Hrywniak, 23Rd June 2020 TOPICS for TODAY
    CUDA 11 AND A100 - WHAT’S NEW? Markus Hrywniak, 23rd June 2020 TOPICS FOR TODAY Ampere architecture – A100, powering DGX–A100, HGX-A100... and soon, FZ Jülich‘s JUWELS Booster New CUDA 11 Toolkit release Overview of features Talk next week: Third generation Tensor Cores GTC talks go into much more details. See references! 2 HGX-A100 4-GPU HGX-A100 8-GPU • 4 A100 with NVLINK • 8 A100 with NVSwitch 3 HIERARCHY OF SCALES Multi-System Rack Multi-GPU System Multi-SM GPU Multi-Core SM Unlimited Scale 8 GPUs 108 Multiprocessors 2048 threads 4 AMDAHL’S LAW serial section parallel section serial section Amdahl’s Law parallel section Shortest possible runtime is sum of serial section times Time saved serial section Some Parallelism Increased Parallelism Infinite Parallelism Program time = Parallel sections take less time Parallel sections take no time sum(serial times + parallel times) Serial sections take same time Serial sections take same time 5 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY serial section parallel section serial section Split up serial & parallel components parallel section serial section Some Parallelism Task Parallelism Infinite Parallelism Program time = Parallel sections overlap with serial sections Parallel sections take no time sum(serial times + parallel times) Serial sections take same time 6 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY CUDA Concurrency Mechanisms At Every Scope CUDA Kernel Threads, Warps, Blocks, Barriers Application CUDA Streams, CUDA Graphs Node Multi-Process Service, GPU-Direct System NCCL, CUDA-Aware MPI, NVSHMEM 7 OVERCOMING AMDAHL: ASYNCHRONY & LATENCY Execution Overheads Non-productive latencies (waste) Operation Latency Network latencies Memory read/write File I/O ..
    [Show full text]