Design and Development of a Heterogeneous Hardware Search Accelerator

Total Page:16

File Type:pdf, Size:1020Kb

Design and Development of a Heterogeneous Hardware Search Accelerator Design and Development of a Heterogeneous Hardware Search Accelerator This dissertation is submitted for the degree of Doctor of Philosophy Tan, Shawn Ser Ngiap Magdalene College May 21, 2009 Abstract Search is a fundamental computing problem and is used in any number of applications that are invading our everyday lives. However, it has not received as much attention as other fundamental computing problems. Historically, there have been several attempts at designing complex machines to accelerate search applications. However, with the cost of transistors falling dramatically, it may be useful to design a novel on-chip hardware accelerator for search applications. A search application is any application that traverses a data set in order to find one or more records that meet certain fitting criteria. These applications can be broken down into several low level operations, which can be accelerated by specialised hardware units. A special search stack can be used to visualise the different levels of a search operation. Three hardware accelerator units were designed to work alongside a host processor. A significant speed-up in performance when compared against pure software solutions was observed under ideal simulation conditions An unconventional method for virtually saving and loading search data was developed within the simulation construct to reduce simulation time. This method of acceleration is not the only possible solution as search can be ac- celerated at a number of levels. However, the proposed architecture is unique in the way that the accelerator units can be combined like LEGO bricks, giving this solution flexibility and scalability. Search is memory intensive, but the performance of regular cache memory that exploit temporal and spatial locality was found wanting. A certain cache memory that exploited structural locality instead of temporal and spatial locality was also developed to improve the performance. As search is a fundamental computational operation, it is used in almost every application, not just obvious search applications. Therefore, the hardware accelerator units can be applied to almost every software application. Obvious examples include genetics and law enforcement while less obvious examples include gaming and operating system software. In fact, it would be useful to integrate accelerator units with slower microprocessors to improve general search performance. The accelerator units can be implemented using an off-the-shelf FPGA at speeds of around 200MHz or in ASIC for 333MHz (0.35µm) and 1.0GHz (0.18µm) operations. A regular FPGA is able to accelerate up to five parallel simple queries or two heterogeneous boolean queries or a combination of each when used with regular DDR2 memory. This solution is particularly low-cost for accelerating search, avoiding the need for expensive system-level solutions. Declaration I hereby declare that my thesis entitled is not substantially the same as any that I have submitted for a degree or diploma or other qualification at any other University. I further state that no part of my thesis has already been or is being concurrently submitted for any such degree, diploma or other qualification. This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed the limit of length prescribed by the Degree Committee of the Engineering Department. The length of my thesis is approximately 45,000 words with 41 figures and 25 listings. Signed, Shawn Tan i Acknowledgements I would like to take this opportunity to express my gratitude to the following people who have helped me, in one way or another, throughout the duration of my research at Cambridge and the write-up at home in Malaysia. Dr David Holburn, for being the nicest supervisor that one can hope for, without whom this work would be difficult to accomplish. I want to express my thanks for everything you’ve done for me in the past four years; welcoming me into your family, getting things done within the department and patiently reading through my thesis. All the members of the department and division, for making it a nice place and easy environment to work in. Mr Stephen Mounsey, Mr John Norcott and Miss Eleanor Blair for technical assistance in setting up the various software tools that I needed. Mr Mick Furber for all the assistance in the electrical teaching lab. Friends from college, for helping me through tough times and keeping me sane. Jack Nie for helping me print out my thesis and handling all of the administrative issues in submitting my thesis. Drs Ray Chan and Ming Yeong Lim for being my companions on my many travels. Zen Cho for being my shoulder to cry on when things were not going well. All my friends and family in Malaysia, for their belief in me and support throughout the duration of this research. I would like to thank my sister and my parents for all the patience and tolerance that they have shown me during the final stretch of this work. My niece and nephews, Jarellynn, Jareick and Jarell for lending me their bubbling energy when I needed a boost. This thesis is dedicated to them. ii Contents 1 Introduction 1 1.1 Justifying Search Acceleration . .... 1 1.2 HistoricalJustification . 3 1.3 Objectives................................... 5 2 Search Basics 7 2.1 SearchStack ................................. 7 2.2 CategorisingSearch ............................. 9 2.2.1 PrimarySearch............................ 9 2.2.2 SecondarySearch........................... 10 2.3 DataStructures&Algorithms . 11 2.3.1 DataStructures............................ 11 2.3.2 Algorithms .............................. 13 2.4 SearchProblems ............................... 14 3 Search Application 16 3.1 SearchApplication .............................. 16 3.1.1 ExampleQuery............................ 17 3.1.2 PipelineBreakdown .. .. .. .. .. .. .. 17 3.1.3 QueryIllustration .......................... 19 3.2 SearchProfile................................. 20 3.2.1 KeySearch .............................. 21 3.2.2 ListRetrieval............................. 21 3.2.3 ResultCollation ........................... 22 3.2.4 OverallProfile ............................ 22 iii 4 General Architecture 25 4.1 InitialConsiderations . 25 4.2 HardwareArchitecture. 26 4.2.1 Multi-CoreProcessing . 26 4.2.2 WordSize............................... 27 4.2.3 HostProcessor ............................ 27 4.3 SoftwareArchitecture . 27 4.3.1 SoftwareToolchain. .. .. .. .. .. .. .. 27 4.3.2 StandardLibraries . .. .. .. .. .. .. .. 28 4.3.3 CustomLibrary............................ 28 4.4 InitialArchitecture. .. .. .. .. .. .. .. .. 29 4.4.1 StackProcessors ........................... 30 5 Streamer Unit 32 5.1 Introduction.................................. 32 5.1.1 DesignConsiderations . 33 5.2 Architecture.................................. 33 5.2.1 Configuration............................. 34 5.2.2 OperatingModes........................... 35 5.2.3 StateMachine ............................ 36 5.3 StreamerSimulation ............................. 37 5.3.1 Kernel Functional Simulation . 38 5.3.2 KernelTimingSimulation . 39 5.3.3 Kernel Performance Simulation . 44 5.4 Conclusion .................................. 45 6 Sieve Unit 46 6.1 Introduction.................................. 46 6.1.1 DesignConsiderations . 47 6.2 Architecture.................................. 47 6.2.1 Configuration............................. 48 6.2.2 Modes................................. 49 6.2.3 Operation............................... 50 6.3 SimulationResults .............................. 51 6.3.1 Kernel Functional Simulation . 51 6.3.2 KernelSoftwarePumpTiming . 51 6.3.3 KernelSoftwarePumpPerformance . 56 6.3.4 KernelHardwarePipeTiming . 57 6.3.5 KernelHardwarePipePerformance . 59 iv 6.4 Conclusion .................................. 60 7 Chaser Unit 63 7.1 Introduction.................................. 63 7.1.1 DesignConsiderations . 64 7.2 ChaserArchitecture ............................. 64 7.2.1 Configuration............................. 65 7.2.2 Operation............................... 67 7.3 KernelSimulationResults . 68 7.3.1 Kernel Functional Simulation . 68 7.3.2 KernelSingleKeyTiming . 69 7.3.3 KernelSingleKeyPerformance . 72 7.3.4 KernelMultiKeyTiming . 73 7.3.5 KernelMultiKeyPerformance . 75 7.4 Conclusion .................................. 76 8 Memory Interface 79 8.1 Introduction.................................. 79 8.2 CachePrimer................................. 79 8.3 CachePrinciples ............................... 80 8.3.1 InstructionCache .......................... 82 8.3.2 DataCache.............................. 84 8.4 CacheParameters .............................. 85 8.4.1 InstructionCache .......................... 86 8.4.2 Data Cache Trends (Repeat Key) . 87 8.4.3 Data Cache Trends (Random Key) . 89 8.5 DataCachePrefetching ........................... 90 8.5.1 StaticPrefetching .......................... 90 8.5.2 DynamicPrefetching. 92 8.5.3 PrefetchedDataCache. 92 8.6 CacheIntegration............................... 94 8.6.1 CacheSizeRatio ........................... 95 8.6.2 StructuralLocality . .. .. .. .. .. .. .. 95 8.7 Conclusion .................................. 96 9 Search Pipelines 97 9.1 Pipelines.................................... 97 9.1.1 PrimarySearch............................ 97 9.1.2 SimpleQuery............................. 99 v 9.1.3 RangeQuery ............................. 100 9.1.4 BooleanQuery ...........................
Recommended publications
  • AMD Accelerated Parallel Processing Opencl Programming Guide
    AMD Accelerated Parallel Processing OpenCL Programming Guide November 2013 rev2.7 © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade- marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic- tions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or other- wise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or envi- ronmental damage may occur.
    [Show full text]
  • Download Drivers Sapphire Nitro R7 370 SAPPHIRE NITRO R7 370 4GB DRIVERS for MAC
    download drivers sapphire nitro r7 370 SAPPHIRE NITRO R7 370 4GB DRIVERS FOR MAC. This makes msi s r7 370 gaming card 10.2-inches in length. This card features exclusive asus auto-extreme technology with super alloy power ii for premium aerospace-grade quality and reliability. Gpu card reviewed both cards and 4gb, windows 7/8. Sapphire r7 370 4 gb bios warning, you are viewing an unverified bios file. Alternatively a suitable upgrade choice for the radeon r7 370 sapphire nitro 4gb edition is the rx 5000 series radeon rx 5500 4gb, which is 130% more powerful and can run 726 of the 1000. Discussion created by nefe on latest reply on by uncatt. Equipped with a modest gaming rig. Msi r7 370 4 gb bios warning, you are viewing an unverified bios file. With every new generation of purchase. This upload has not been verified by us in any way like we do for the entries listed under the 'amd', 'ati' and 'nvidia' sections . 27-05-2016 the sapphire nitro radeon r7 370 4gb gddr5 retails at around rm 750, the card performs much better than any r7 360 cards and offers much better value. It always installs drivers for r9 200, so i'm forced to install the driver for the actual gpu. Published on so i've looked all over the internet and everyone with a similar problem with a similar card just ends up rma ing. Über 400.000 Testberichte und aktuelle Tests. We delete comments that violate our policy, which we. But if i keep the driver for r9 200 that windows installed.
    [Show full text]
  • Software-Based Undervolting Faults in AMD Zen Processors Fehler in AMD Zen Prozessoren Durch Software-Basierte Unterspannung
    Software-based Undervolting Faults in AMD Zen Processors Fehler in AMD Zen Prozessoren durch Software-basierte Unterspannung Bachelorarbeit im Rahmen des Studiengangs IT-Sicherheit der Universität zu Lübeck vorgelegt von Anja Rabich ausgegeben und betreut von Prof. Dr. Thomas Eisenbarth mit Unterstützung von Luca Wilke Lübeck, den 31. August 2020 Abstract Dynamic Voltage and Frequency Scaling (DVFS) is a powerful performance enhance- ment method used by modern processors, allowing them to scale voltage or frequency as needed based on the power requirements of the CPU. This not only saves power, but also prevents processors from overheating. However, the continued integration of soft- ware interfaces giving a user direct access to this functionality has been shown to be a potential security risk, allowing a privileged adversary to indirectly tamper with sensitive computations. This thesis summarizes the results of various papers showing that using DVFS features, unsuitable voltage/frequency values can be set for the processor leading to hardware faults and calculation errors which can be used to undermine the integrity of Trusted Execution Environments (TEE). Results are partially replicated for Intel’s TEE implementation SGX, followed by extending the same methodology to AMD’s Zen Pro- cessors, on which there is currently no information. Results show that undervolting is an unlikely attack vector. iii Zusammenfassung Dynamische Spannungs- und Frequenzskalierung (engl. DVFS) ist ein in modernen Prozessoren vorhandener Leistungs- und Stromverwaltungsmechanismus, womit Span- nung und Frequenz der CPU je nach Bedarf skaliert werden können. Somit wird nicht nur Strom gespart, sondern auch zusätzlich verhindert, dass der Prozessor überhitzt. Die zunehmende Integration von Softwareschnittstellen zu diesen Mechanismen die dem Nutzer Einstellungsmöglichkeiten anbieten, haben sich zunehmend als potenzielle Sicherheitslücke erwiesen.
    [Show full text]
  • High-Performance Reconfigurable Computing
    High-Performance Reconfigurable Computing Tarek El-Ghazawi Director, Institute for Massively Parallel Applications and Computing Technology (IMPACT) Co-Director, NSF Center for High-Performance Reconfigurable Computing (CHREC) The George Washington University ICFPT07 12/11/07 1 Acknowledgements ARSC, AMI, Cray, DoD, HPTi, NASA, NSF/CHREC, SGI, SRC, Star Bridge, Xtreme Data, many others ICFPT07 12/11/07 2 1 Outline Architectures and Systems Tools and Programming Applications Performance Wrap-up ICFPT07 12/11/07 3 Reconfigurable Supercomputing (RSC) Efficient high performance computing using parallel and distributed systems of both reconfigurable hardware resources and conventional microprocessors This tutorial establishes the current status, the direction taken, and the potential for RSC ICFPT07 12/11/07 4 2 Top 500 Supercomputers Rank Site Computer Processors Year Rmax Rpeak eServer Blue DOE/NNSA/LLNL Gene Solution 1 United States 212992 2007 478200 596378 IBM Forschungszentrum Blue Gene/P 2 Juelich (FZJ) Solution 65536 2007 167300 222822 Germany IBM SGI/New Mexico SGI Altix ICE Computing Applications 8200, Xeon quad 3 Center (NMCAC) core 3.0 GHz 14336 2007 126900 172032 United States SGI Cluster Platform Computational Research 3000 BL460c, Laboratories, TATA Xeon 53xx 3GHz, 4 SONS 14240 2007 117900 170880 Infiniband India HP Cluster Platform 3000 BL460c, Government Agency Xeon 53xx 5 Sweden 2.66GHz, 13728 2007 102800 146430 Infiniband HP ICFPT07 12/11/07 5 Reconfigurable Computers The microchip that rewires itself Scientific American – June 1997 0 Computers that modify their hardware circuits as they operate are opening a new era in computer design. 0 Reconfigurable computers architecture is based on FPGAs (Field Programmable Gate Arrays) Source: [Sci97] ICFPT07 12/11/07 6 3 Execution Model for HPRCs μP •Transfer of Control •Input Data RP PC •Output Data Piplines, Systolic Arrays, SIMD, ..
    [Show full text]
  • High Performance Linpack Benchmark on AMD EPYC™ Processors
    High Performance Linpack Benchmark on AMD EPYC™ Processors This document details running the High Performance Linpack (HPL) benchmark using the AMD xhpl binary. HPL Implementation: The HPL benchmark presents an opportunity to demonstrate the optimal combination of multithreading (via the OpenMP library) and MPI for scientific and technical high-performance computing on the EPYC architecture. For MPI applications where the per-MPI-rank work can be further parallelized, each L3 cache is an MPI rank running a multi-threaded application. This approach results in fewer MPI ranks than using one rank per core, and results in a corresponding reduction in MPI overhead. The ideal balance is for the number of threads per MPI rank to be less than or equal to the number of CPUs per L3 cache. The exact maximum thread count per MPI rank depends on both the specific EPYC SKU (e.g. 32 core parts have 4 physical cores per L3, 24 core parts have 3 physical cores per L3) and whether SMT is enabled (e.g. for a 32 core part with SMT enabled there are 8 CPUs per L3). HPL performance is primarily determined by DGEMM performance, which is in turn primarily determined by SIMD throughput. The Zen microarchitecture of the EPYC processor implements one SIMD unit per physical core. Since HPL is SIMD limited, when SMT is enabled using a second HPL thread per core will not directly improve HPL performance. However, leaving SMT enabled may indirectly allow slightly higher performance (1% - 2%) since the OS can utilize the SMT siblings as needed without pre-empting the HPL threads.
    [Show full text]
  • AMD's Early Processor Lines, up to the Hammer Family (Families K8
    AMD’s early processor lines, up to the Hammer Family (Families K8 - K10.5h) Dezső Sima October 2018 (Ver. 1.1) Sima Dezső, 2018 AMD’s early processor lines, up to the Hammer Family (Families K8 - K10.5h) • 1. Introduction to AMD’s processor families • 2. AMD’s 32-bit x86 families • 3. Migration of 32-bit ISAs and microarchitectures to 64-bit • 4. Overview of AMD’s K8 – K10.5 (Hammer-based) families • 5. The K8 (Hammer) family • 6. The K10 Barcelona family • 7. The K10.5 Shanghai family • 8. The K10.5 Istambul family • 9. The K10.5-based Magny-Course/Lisbon family • 10. References 1. Introduction to AMD’s processor families 1. Introduction to AMD’s processor families (1) 1. Introduction to AMD’s processor families AMD’s early x86 processor history [1] AMD’s own processors Second sourced processors 1. Introduction to AMD’s processor families (2) Evolution of AMD’s early processors [2] 1. Introduction to AMD’s processor families (3) Historical remarks 1) Beyond x86 processors AMD also designed and marketed two embedded processor families; • the 2900 family of bipolar, 4-bit slice microprocessors (1975-?) used in a number of processors, such as particular DEC 11 family models, and • the 29000 family (29K family) of CMOS, 32-bit embedded microcontrollers (1987-95). In late 1995 AMD cancelled their 29K family development and transferred the related design team to the firm’s K5 effort, in order to focus on x86 processors [3]. 2) Initially, AMD designed the Am386/486 processors that were clones of Intel’s processors.
    [Show full text]
  • AMD Raven Ridge
    DELIVERING A NEW LEVEL OF VISUAL PERFORMANCE IN AN SOC AMD “RAVEN RIDGE” APU Dan Bouvier, Jim Gibney, Alex Branover, Sonu Arora Presented by: Dan Bouvier Corporate VP, Client Products Chief Architect AMD CONFIDENTIAL RAISING THE BAR FOR THE APU VISUAL EXPERIENCE Up to MOBILE APU GENERATIONAL 200% MORE CPU PERFORMANCE PERFORMANCE GAINS Up to 128% MORE GPU PERFORMANCE Up to 58% LESS POWER FIRST “Zen”-based APU CPU Performance GPU Performance Power HIGH-PERFORMANCE AMD Ryzen™ 7 2700U 7th Gen AMD A-Series APU On-die “Vega”-based graphics Scaled GPU Managed Improved Upgraded Increased LONG BATTERY LIFE and CPU up to power delivery memory display package Premium form factors reach target and thermal bandwidth experience performance frame rate dissipation efficiency density 2 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | * See footnotes for details. “RAVEN RIDGE” APU AMD “ZEN” x86 CPU CORES CPU 0 “ZEN” CPU CPU 1 (4 CORE | 8 THREAD) USB 3.1 NVMe PCIe FULL PCIe GPP ----------- ----------- Discrete SYSTEM 4MB USB 2.0 SATA GFX CONNECTIVITY CPU 2 CPU 3 L3 Cache X64 DDR4 HIGH BANDWIDTH SOC FABRIC System Infinity Fabric & MEMORY Management SYSTEM Unit ACCELERATED Platform Multimedia Security MULTIMEDIA Processor Engines AMD GFX+ 1MB L2 EXPERIENCE X64 DDR4 (11 COMPUTE UNITS) Cache Video Audio Sensor INTEGRATED CU CU CU CU CU CU Display Codec ACP Fusion Controller Next SENSOR Next Hub FUSION HUB CU CU CU CU CU AMD “VEGA” GPU UPGRADED DISPLAY ENGINE 3 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | SIGNIFICANT DENSITY INCREASE “Raven Ridge” die BGA Package: 25 x 35 x 1.38mm Technology: GLOBALFOUNDRIES 14nm – 11 layer metal Transistor count: 4.94B 59% 16% Die Size: 209.78mm2 more transistors smaller die than prior generation “Bristol Ridge” APU 4 | AMD Ryzen™ Processors with Radeon™ Vega Graphics - Hot Chips 30 | * See footnotes for details.
    [Show full text]
  • SMBIOS Specification
    1 2 Document Identifier: DSP0134 3 Date: 2019-10-31 4 Version: 3.4.0a 5 System Management BIOS (SMBIOS) Reference 6 Specification Information for Work-in-Progress version: IMPORTANT: This document is not a standard. It does not necessarily reflect the views of the DMTF or its members. Because this document is a Work in Progress, this document may still change, perhaps profoundly and without notice. This document is available for public review and comment until superseded. Provide any comments through the DMTF Feedback Portal: http://www.dmtf.org/standards/feedback 7 Supersedes: 3.3.0 8 Document Class: Normative 9 Document Status: Work in Progress 10 Document Language: en-US 11 System Management BIOS (SMBIOS) Reference Specification DSP0134 12 Copyright Notice 13 Copyright © 2000, 2002, 2004–2019 DMTF. All rights reserved. 14 DMTF is a not-for-profit association of industry members dedicated to promoting enterprise and systems 15 management and interoperability. Members and non-members may reproduce DMTF specifications and 16 documents, provided that correct attribution is given. As DMTF specifications may be revised from time to 17 time, the particular version and release date should always be noted. 18 Implementation of certain elements of this standard or proposed standard may be subject to third party 19 patent rights, including provisional patent rights (herein "patent rights"). DMTF makes no representations 20 to users of the standard as to the existence of such rights, and is not responsible to recognize, disclose, 21 or identify any or all such third party patent right, owners or claimants, nor for any incomplete or 22 inaccurate identification or disclosure of such rights, owners or claimants.
    [Show full text]
  • AMD Introduces World's Most Powerful 16- Core
    November 7, 2019 AMD Introduces World’s Most Powerful 16- core Consumer Desktop Processor, the AMD Ryzen™ 9 3950X – AMD Ryzen™ 9 3950X rounds out 3rd Gen Ryzen desktop processor series, arriving November 25 – – New AMD Athlon™ 3000G processor to provide everyday users with unmatched performance per dollar, coming November 19 – SANTA CLARA, Calif., Nov. 07, 2019 (GLOBE NEWSWIRE) -- Today, AMD announced the release of the highly anticipated flagship 16-core AMD Ryzen 9 3950X processor, available worldwide November 25, 2019. AMD Ryzen 9 3950X processor brings the ultimate processor for gamers with effortless 1080P gaming in select titles1 and up to 2X more energy efficient processing power compared to the competition2 as the world’s fastest 16- core consumer desktop processor3. In addition, AMD also announced a significant performance uplift4 coming for mainstream desktop users with the new AMD Athlon 3000G, arriving November 19, 2019. “We are excited to bring the AMD Ryzen™ 9 3950X to market later this month, offering enthusiasts the most powerful 16-core desktop processor ever,” said Chris Kilburn, corporate vice president and general manager, client channel, AMD. “We are focused on offering the best solutions at every level of the market, including the AMD Athlon 3000G for everyday PC users that delivers great performance at an incredible price point.” AMD Ryzen 9 3950X: Fastest 16-core Consumer Desktop Processor Offering up to 22% performance increase over previous generations5, the AMD Ryzen 9 3950X offers faster 1080p gaming in select titles1 and content creation6 than the competition. Built on the industry-leading “Zen 2” architecture, the AMD Ryzen 9 3950X also excels in power efficiency3 with a TDP7 of 105W.
    [Show full text]
  • Take a Way: Exploring the Security Implications of AMD's Cache Way
    Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors Moritz Lipp Vedad Hadžić Michael Schwarz Graz University of Technology Graz University of Technology Graz University of Technology Arthur Perais Clémentine Maurice Daniel Gruss Unaffiliated Univ Rennes, CNRS, IRISA Graz University of Technology ABSTRACT 1 INTRODUCTION To optimize the energy consumption and performance of their With caches, out-of-order execution, speculative execution, or si- CPUs, AMD introduced a way predictor for the L1-data (L1D) cache multaneous multithreading (SMT), modern processors are equipped to predict in which cache way a certain address is located. Conse- with numerous features optimizing the system’s throughput and quently, only this way is accessed, significantly reducing the power power consumption. Despite their performance benefits, these op- consumption of the processor. timizations are often not designed with a central focus on security In this paper, we are the first to exploit the cache way predic- properties. Hence, microarchitectural attacks have exploited these tor. We reverse-engineered AMD’s L1D cache way predictor in optimizations to undermine the system’s security. microarchitectures from 2011 to 2019, resulting in two new attack Cache attacks on cryptographic algorithms were the first mi- techniques. With Collide+Probe, an attacker can monitor a vic- croarchitectural attacks [12, 42, 59]. Osvik et al. [58] showed that tim’s memory accesses without knowledge of physical addresses an attacker can observe the cache state at the granularity of a cache or shared memory when time-sharing a logical core. With Load+ set using Prime+Probe. Yarom et al. [82] proposed Flush+Reload, Reload, we exploit the way predictor to obtain highly-accurate a technique that can observe victim activity at a cache-line granu- memory-access traces of victims on the same physical core.
    [Show full text]
  • EPYC: Designed for Effective Performance
    EPYC: Designed for Effective Performance By Linley Gwennap Principal Analyst June 2017 www.linleygroup.com EPYC: Designed for Effective Performance By Linley Gwennap, Principal Analyst, The Linley Group Measuring server-processor performance using clock speed (GHz) or even the traditional SPEC_int test can be misleading. AMD’s new EPYC processor is designed to deliver strong performance across a wide range of server applications, meeting the needs of modern data centers and enterprises. These design capabilities include advanced branch prediction, data prefetching, coherent interconnect, and integrated high-bandwidth DRAM and I/O interfaces. AMD sponsored the creation of this white paper, but the opinions and analysis are those of the author. Trademark names are used in an editorial fashion and are the property of their respective owners. Although many PC users can settle for “good enough” performance, data-center opera- tors are always seeking more. Web searches demand more performance as the Internet continues to expand. Newer applications such as voice recognition (for services such as Alexa and Siri) and analyzing big data also require tremendous performance. Neural networks are gaining in popularity for everything from image recognition to self-driving cars, but training these networks can tie up hundreds of servers for days at a time. Processor designers must meet these greater performance demands while staying within acceptable electrical-power ratings. Server processors are often characterized by core count and clock speed (GHz), but these characteristics provide only a rough approximation of application performance. As important as speed is, the amount of work that a processor can accomplish with each tick of the clock, a parameter known as instructions per cycle (IPC), is equally important.
    [Show full text]
  • AMD Zen Rohin, Vijay, Brandon Outline
    AMD Zen Rohin, Vijay, Brandon Outline 1. History and Overview 2. Datapath Structure 3. Memory Hierarchy 4. Zen 2 Improvements History and Overview AMD History ● IBM production too large, forced Intel to license their designs to 3rd parties ● AMD fills the gap, produces clones for 15ish years - legal battles ensued ● K5 first in-house x86 chip in 1996 ● Added more features like out of order, L2 caches, etc ● Current CPUs are Zen* tomshardware.com/picturestory/71 3-amd-cpu-history.html Zen Brand ● Performance desktop and mobile computing ○ Athlon ○ Ryzen 3, Ryzen 5, Ryzen 7, Ryzen 9 ○ Ryzen Threadripper ● Server ○ EPYC https://en.wikichip.org/wiki/amd/microarchitectures/zen Zen History ● Aimed to replace two of AMD’s older chips ○ Excavator: high performance architecture ○ Puma: low power architecture https://en.wikichip.org/wiki/amd/microarchitectures/zen#Block_Diagram Zen Architecture ● Quad-core ● Fetch 4 instructions/cycle ● Op cache 2k instructions ● 168 physical integer registers ● 72 out of order loads ● Large shared L3 cache ● 2 threads per core https://www.slideshare.net/AMD/amd-epyc-microp rocessor-architecture Datapath Structure Fetch ● Decoupled branch predictor ○ Runs ahead of fetches ○ Successful predictions help latency and memory parallelism ○ Mispredictions incur power penalty ● 3 layer TLB ○ L0: 8 entries ○ L1: 64 entries ○ L2: 512 entries https://www.anandtech.com/show/10591/amd-zen-microarchiture-p art-2-extracting-instructionlevel-parallelism/3 Branch Predictor ● Perceptron: simple neural network ● Table of perceptrons, each a vector of weights ● Branch address used to access perceptron table ● Dot product between weight vector and branch history vector Perceptron Branch Predictor ● ~10% improve prediction rates over gshare predictor - (2, 2) correlating predictor ● Can utilize longer branch histories ○ Hardware requirements scale linearly whereas they scale exponentially for other predictors D.
    [Show full text]