Hardware-Accelerated Platforms and Infrastructures for Network Functions: a Survey of Enabling Technologies and Research Studies
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Summit™ M5x Protocol Analyzer / Jammer for PCI Express® 5.0
Summit™ M5x Protocol Analyzer / Jammer for PCI Express® 5.0 The Summit M5x is Teledyne LeCroy’s highest performance PCI Express analyzer, Key Features offering both inline probing and direct tapping approaches to protocol capture - the best in overall diversity in probing methods. Find errors fast • One button error check Inline operation provides advanced features such as: real-time error injection (jammer) • Fast upload speed for PCI Express 4.0 specification; support for data rates of 2.5 GT/s, 5.0 GT/s, 8.0 • Large trace memory GT/s, and 16.0 GT/s; full data capture on bidirectional link widths of x1, x2, x4, x8 and • Powerful triggering/filtering x16 for up to 16 GT/s; and up to 128GB of trace memory. See and understand the traffic ® • Get useful information Direct tapping operation supports PCIe 5.0 protocol analysis with speeds up to • More choices of data views 32GT/s at up to x8 link widths. The Summit M5x is ideal for high-performance protocol • More ways to analyze data development of add-in boards, switches, servers and workstations, and for customers • Custom decoding and reports currently working on PCIe. Data capture Flexible Hardware Analyzer for CXL (Compute Express Link) Links. • 100% data capture at 32.0 GT/s The Summit M5x PCIe 5.0 Protocol Analyzer is a Support is provided for CXL.io, CXL.mem and high-end analyzer that offers important analysis CXL.cache with full decoding from the FLIT layer Deep memory buffer features for new application development. to the CXL Message layers. -
Hardware Realization of an FPGA Processor – Operating System Call Offload and Experiences
Downloaded from orbit.dtu.dk on: Oct 02, 2021 Hardware Realization of an FPGA Processor – Operating System Call Offload and Experiences Hindborg, Andreas Erik; Schleuniger, Pascal; Jensen, Nicklas Bo; Karlsson, Sven Published in: Proceedings of the 2014 Conference on Design and Architectures for Signal and Image Processing (DASIP) Publication date: 2014 Link back to DTU Orbit Citation (APA): Hindborg, A. E., Schleuniger, P., Jensen, N. B., & Karlsson, S. (2014). Hardware Realization of an FPGA Processor – Operating System Call Offload and Experiences. In A. Morawiec, & J. Hinderscheit (Eds.), Proceedings of the 2014 Conference on Design and Architectures for Signal and Image Processing (DASIP) IEEE. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Hardware Realization of an FPGA Processor – Operating System Call Offload and Experiences Andreas Erik Hindborg, Pascal Schleuniger Nicklas Bo Jensen, Sven Karlsson DTU Compute – Technical University of Denmark fahin,pass,nboa,[email protected] Abstract—Field-programmable gate arrays, FPGAs, are at- speedup of up to 64% over a Xilinx MicroBlaze based baseline tractive implementation platforms for low-volume signal and system. -
SC20를 통해 본 HPC 기술 동향 HPC Technology Through SC20
Electronics and Telecommunications Trends SC20를 통해 본 HPC 기술 동향 HPC Technology Through SC20 어익수 (I.S. Eo, [email protected]) 슈퍼컴퓨팅기술연구센터 책임연구원 모희숙 (H.S. Mo, [email protected]) 슈퍼컴퓨팅기술연구센터 책임연구원 박유미 (Y.M. Park, [email protected]) 슈퍼컴퓨팅기술연구센터 책임연구원/센터장 한우종 (W.J. Han, [email protected]) 인공지능연구소 책임연구원/연구위원 ABSTRACT High-performance computing (HPC) is the underpinning for many of today’s most exciting new research areas, to name a few, from big science to new ways of fighting the disease, to artificial intelligence (AI), to big data analytics, to quantum computing. This report captures the summary of a 9-day program of presentations, keynotes, and workshops at the SC20 conference, one of the most prominent events on sharing ideas and results in HPC technology R&D. Because of the exceptional situation caused by COVID-19, the conference was held entirely online from 11/9 to 11/19 2020, and interestingly caught more attention on using HPC to make a breakthrough in the area of vaccine and cure for COVID-19. The program brought together 103 papers from 21 countries, along with 163 presentations in 24 workshop sessions. The event has covered several key areas in HPC technology, including new memory hierarchy and interconnects for different accelerators, evaluation of parallel programming models, as well as simulation and modeling in traditional science applications. Notably, there was increasing interest in AI and Big Data analytics as well. With this summary of the recent HPC trend readers may find useful information to guide the R&D directions for challenging new technologies and applications in the area of HPC. -
Designware IP for Cloud Computing Socs
DesignWare IP for Cloud Computing SoCs Overview Hyperscale cloud data centers continue to evolve due to tremendous Internet traffic growth from online collaboration, smartphones and other IoT devices, video streaming, augmented and virtual reality (AR/VR) applications, and connected AI devices. This is driving the need for new architectures for compute, storage, and networking such as AI accelerators, Software Defined Networks (SDNs), communications network processors, and solid state drives (SSDs) to improve cloud data center efficiency and performance. Re-architecting the cloud data center for these latest applications is driving the next generation of semiconductor SoCs to support new high-speed protocols to optimize data processing, networking, and storage in the cloud. Designers building system-on-chips (SoCs) for cloud and high performance computing (HPC) applications need a combination of high-performance and low-latency IP solutions to help deliver total system throughput. Synopsys provides a comprehensive portfolio of high-quality, silicon-proven IP that enables designers to develop SoCs for high-end cloud computing, including AI accelerators, edge computing, visual computing, compute/application servers, networking, and storage applications. Synopsys’ DesignWare® Foundation IP, Interface IP, Security IP, and Processor IP are optimized for high performance, low latency, and low power, while supporting advanced process technologies from 16-nm to 5-nm FinFET and future process nodes. High-Performance Computing Today’s high-performance computing (HPC) solutions provide detailed insights into the world around us and improve our quality of life. HPC solutions deliver the data processing power for massive workloads required for genome sequencing, weather modeling, video rendering, engineering modeling and simulation, medical research, big data analytics, and many other applications. -
Bringing Virtualization to the X86 Architecture with the Original Vmware Workstation
12 Bringing Virtualization to the x86 Architecture with the Original VMware Workstation EDOUARD BUGNION, Stanford University SCOTT DEVINE, VMware Inc. MENDEL ROSENBLUM, Stanford University JEREMY SUGERMAN, Talaria Technologies, Inc. EDWARD Y. WANG, Cumulus Networks, Inc. This article describes the historical context, technical challenges, and main implementation techniques used by VMware Workstation to bring virtualization to the x86 architecture in 1999. Although virtual machine monitors (VMMs) had been around for decades, they were traditionally designed as part of monolithic, single-vendor architectures with explicit support for virtualization. In contrast, the x86 architecture lacked virtualization support, and the industry around it had disaggregated into an ecosystem, with different ven- dors controlling the computers, CPUs, peripherals, operating systems, and applications, none of them asking for virtualization. We chose to build our solution independently of these vendors. As a result, VMware Workstation had to deal with new challenges associated with (i) the lack of virtual- ization support in the x86 architecture, (ii) the daunting complexity of the architecture itself, (iii) the need to support a broad combination of peripherals, and (iv) the need to offer a simple user experience within existing environments. These new challenges led us to a novel combination of well-known virtualization techniques, techniques from other domains, and new techniques. VMware Workstation combined a hosted architecture with a VMM. The hosted architecture enabled a simple user experience and offered broad hardware compatibility. Rather than exposing I/O diversity to the virtual machines, VMware Workstation also relied on software emulation of I/O devices. The VMM combined a trap-and-emulate direct execution engine with a system-level dynamic binary translator to ef- ficiently virtualize the x86 architecture and support most commodity operating systems. -
TAKING OCP to the NEXT LEVEL – WORKLOAD OPTIMIZATION Krishna Paul Principal Engineer Data Center Group
TAKING OCP TO THE NEXT LEVEL – WORKLOAD OPTIMIZATION Krishna Paul Principal Engineer Data Center Group Consume. Collaborate. Collaborate. Contribute. Contribute. OPEN COMPUTE HISTORY OF COMMITMENT AND LEADERSHIP 2011 +25 +80 Founding Contributions and Products Board Member Enablements with Partners of OCP Consume. Collaborate. Collaborate. Contribute. Contribute. GETTING TO $10B IN OCP DEPLOYMENT Workload Open Management Integrated Optimized Solutions Consume. Collaborate. Collaborate. Contribute. Contribute. NEW OCP COMPUTE PLATFORMS THIS YEAR Mount Olympus Next Gen Platform Cooper Lake Processor Platforms for Cascade Lake Processor 2S 4S (2x 2S) 8S (4x 2S) Consume. Collaborate. Collaborate. Contribute. Contribute. *Other names and brands may be claimed as property of others. INTEL® HIGH-DENSITY, CLOUD-OPTIMIZED PLATFORM Cloud-Optimized Platform 2U 450mm x 780mm 4S Intel® Xeon® Scalable processors 48 DDR4 memory slots, SATA/SAS/NVMe 2.5” SSD drive bays Available in second half 2019 *Other names and brands may be claimed as property of others. Consume. Collaborate. Collaborate. Contribute. Contribute. OCP CARDS SUPPORT NEW AI ACCELERATORS INTEL® NERVANA™ NEURAL NETWORK PROCESSOR (NNP) FOR TRAINING: FOR INFERENCE: Nervana™ Dedicated deep learning training acceleration Dedicated deep learning inference acceleration Optimized memory and interconnects 10nm Intel® process node In production in 2019 In production in 2019 Intel is a proud partner of the community Consume. Collaborate. Collaborate. Contribute. Contribute. *Other names and brands may be claimed as property of others. ANNOUNCING A NEW CPU-TO-DEVICE INTERCONNECT STANDARD COMPUTE EXPRESS LINK (CXL) ▪ New CXL specification and consortium ▪ Memory coherent, high-speed interconnect ▪ Initial spec donated by Intel ▪ Intel® believes there is an opportunity for an OCP working group to define new form factors for CXL ▪ Optimized stack with x16 PCI Express Gen 5 physical and electrical connection (32 GT/s) ▪ Use cases include AI, networking, media, graphics and more ▪ Availability expected 2021 Consume. -
Intel FPGA Product Catalog Devices: 10 Nm Device Portfolio Intel Agilex FPGA and Soc Overview
• Cover TBD INTEL® FPGA PRODUCT CATALOG Version 19.3 CONTENTS Overview Acceleration Platforms and Solutions Intel® FPGA Solutions Portfolio 1 Intel FPGA Programmable Acceleration Overview 61 Devices Intel Acceleration Stack for Intel Xeon® CPU with FPGAs 62 Intel FPGA Programmable Acceleration Cards 63 10 nm Device Portfolio - Intel AgilexTM - Intel Programmable Acceleration Card with 63 FPGA and SoC Overview 2 Intel Arria 10 GX FPGA - Intel Agilex FPGA Features 4 - Intel FPGA Programmable Acceleration Card D5005 64 Generation 10 Device Portfolio - Intel FPGA Programmable Acceleration Card N3000 65 - Generation 10 FPGAs and SoCs 6 - Intel FPGA Programmable Acceleration Card 66 - Intel Stratix® 10 FPGA and SoC Overview 7 Comparison - Intel Stratix 10 FPGA Features 9 Accelerated Workload Solutions 67 - Intel Stratix 10 SoC Features 11 - Intel Stratix 10 TX Features 13 - Intel Stratix 10 MX Features 15 Design Tools, OS Support, and Processors - Intel Stratix 10 DX Features 17 Intel Quartus® Prime Software 68 - Intel Arria® 10 FPGA and SoC Overview 20 DSP Builder for Intel FPGAs 71 - Intel Arria 10 FPGA Features 21 Intel FPGA SDK for OpenCL™ 72 - Intel Arria 10 SoC Features 23 - Intel Cyclone® 10 FPGA Overview 25 Intel SoC FPGA Embedded Development Suite 73 - Intel Cyclone 10 GX FPGA Features 26 SoC Operating System Support 74 - Intel Cyclone 10 LP FPGA Features 27 Nios® II Processor 75 - Intel MAX® 10 FPGA Overview 29 - Intel MAX 10 FPGA Features 30 Nios II Processor Embedded Design Suite 76 Nios II Processor Operating System Support 28 -
Static Fault Attack on Hardware DES Registers
Static Fault Attack on Hardware DES Registers Philippe Loubet-Moundi, Francis Olivier, and David Vigilant Gemalto, Security Labs, France {philippe.loubet-moundi,david.vigilant,francis.olivier}@gemalto.com http://www.gemalto.com Abstract. In the late nineties, Eli Biham and Adi Shamir published the first paper on Differential Fault Analysis on symmetric key algorithms. More specifically they introduced a fault model where a key bit located in non-volatile memory is forced to 0=1 with a fault injection. In their scenario the fault was permanent, and could lead the attacker to full key recovery with low complexity. In this paper, another fault model is considered: forcing a key bit to 0=1 in the register of a hardware block implementing Data Encryption Stan- dard. Due to the specific location of the fault, the key modification is not permanent in the life of the embedded device, and this leads to apply a powerful safe-error like attack. This paper reports a practical validation of the fault model on two actual circuits, and discusses limitations and efficient countermeasures against this threat. Keywords: Hardware DES, fault attacks, safe-error, register attacks 1 Introduction Fault attacks against embedded systems exploit the disturbed execution of a given program to infer sensitive data, or to execute unauthorized parts of this program. Usually a distinction is made between permanent faults and transient faults.A permanent fault alters the behavior of the device from the moment it is injected, and even if the device is powered off [8]. On the other hand, a transient fault has an effect limited in time, only a few cycles of the process are affected. -
Validating Register Allocation and Spilling
Validating Register Allocation and Spilling Silvain Rideau1 and Xavier Leroy2 1 Ecole´ Normale Sup´erieure,45 rue d'Ulm, 75005 Paris, France [email protected] 2 INRIA Paris-Rocquencourt, BP 105, 78153 Le Chesnay, France [email protected] Abstract. Following the translation validation approach to high- assurance compilation, we describe a new algorithm for validating a posteriori the results of a run of register allocation. The algorithm is based on backward dataflow inference of equations between variables, registers and stack locations, and can cope with sophisticated forms of spilling and live range splitting, as well as many architectural irregular- ities such as overlapping registers. The soundness of the algorithm was mechanically proved using the Coq proof assistant. 1 Introduction To generate fast and compact machine code, it is crucial to make effective use of the limited number of registers provided by hardware architectures. Register allocation and its accompanying code transformations (spilling, reloading, coa- lescing, live range splitting, rematerialization, etc) therefore play a prominent role in optimizing compilers. As in the case of any advanced compiler pass, mistakes sometimes happen in the design or implementation of register allocators, possibly causing incorrect machine code to be generated from a correct source program. Such compiler- introduced bugs are uncommon but especially difficult to exhibit and track down. In the context of safety-critical software, they can also invalidate all the safety guarantees obtained by formal verification of the source code, which is a growing concern in the formal methods world. There exist two major approaches to rule out incorrect compilations. Com- piler verification proves, once and for all, the correctness of a compiler or compi- lation pass, preferably using mechanical assistance (proof assistants) to conduct the proof. -
V810 Familytm 32-Bit Microprocessor
V810 FAMILYTM 32-BIT MICROPROCESSOR ARCHITECTURE V805TM V810TM V820TM V821TM Document No. U10082EJ1V0UM00 (1st edition) Date Published October 1995P © 11995 Printed in Japan NOTES FOR CMOS DEVICES 1 PRECAUTION AGAINST ESD FOR SEMICONDUCTORS Note: Strong electric field, when exposed to a MOS device, can cause destruction of the gate oxide and ultimately degrade the device operation. Steps must be taken to stop generation of static electricity as much as possible, and quickly dissipate it once it has occurred. Environmental control must be adequate. When it is dry, humidifier should be used. It is recommended to avoid using insulators that easily build static electricity. Semiconductor devices must be stored and transported in an anti- static container, static shielding bag or conductive material. All test and measurement tools including work bench and floor should be grounded. The operator should be grounded using wrist strap. Semiconductor devices must not be touched with bare hands. Similar precautions need to be taken for PW boards with semiconductor devices on it. 2 HANDLING OF UNUSED INPUT PINS FOR CMOS Note: No connection for CMOS device inputs can be cause of malfunction. If no connection is provided to the input pins, it is possible that an internal input level may be generated due to noise, etc., hence causing malfunction. CMOS devices behave differently than Bipolar or NMOS devices. Input levels of CMOS devices must be fixed high or low by using a pull-up or pull-down circuitry. Each unused pin should be connected to VDD or GND with a resistor, if it is considered to have a possibility of being an output pin. -
Exploiting Cache Side Channels on CPU-FPGA Cloud Platforms
Exploiting Cache Side-Channels on CPU-FPGA Cloud Platforms Cache-basierte Seitenkanalangriffe auf CPU-FPGA Cloud Plattformen Masterarbeit im Rahmen des Studiengangs Informatik der Universität zu Lübeck vorgelegt von Thore Tiemann ausgegeben und betreut von Prof. Dr. Thomas Eisenbarth Lübeck, den 09. Januar 2020 Abstract Cloud service providers invest more and more in FPGA infrastructure to offer it to their customers (IaaS). FPGA use cases include the acceleration of artificial intelligence applica- tions. While physical attacks on FPGAs got a lot of attention by researchers just recently, nobody focused on micro-architectural attacks in the context of FPGA yet. We take the first step into this direction and investigate the timing behavior of memory reads on two Intel FPGA platforms that were designed to be used in a cloud environment: a Programmable Acceleration Card (PAC) and a Xeon CPU with an integrated FPGA that has an additional coherent local cache. By using the Open Programmable Acceleration Engine (OPAE) and the Core Cache Interface (CCI-P), our setup represents a realistic cloud scenario. We show that an Acceleration Functional Unit (AFU) on either platform can, with the help of a self-designed hardware timer, distinguish which of the different parts of the memory hierarchy serves the response for memory reads. On the integrated platform, the same is true for the CPU, as it can differentiate between answers from the FPGA cache, the CPU cache, and the main memory. Next, we analyze the effect of the caching hints offered by the CCI-P on the location of written cache lines in the memory hierarchy and show that most cache lines get cached in the LLC of the CPU. -
Register Allocation Deconstructed
Register Allocation Deconstructed David Ryan Koes Seth Copen Goldstein Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA Pittsburgh, PA [email protected] [email protected] Abstract assignment. The coalescing component of register allocation Register allocation is a fundamental part of any optimiz- attempts to eliminate existing move instructions by allocat- ing compiler. Effectively managing the limited register re- ing the operands of the move instruction to identical loca- sources of the constrained architectures commonly found in tions. The spilling component attempts to minimize the im- embedded systems is essential in order to maximize code pact of accessing variables in memory. The move insertion quality. In this paper we deconstruct the register allocation component attempts to insert move instructions, splitting the problem into distinct components: coalescing, spilling, move live ranges of variables, with the goal of achieving a net im- insertion, and assignment. Using an optimal register alloca- provement in code quality by improving the results of the tion framework, we empirically evaluate the importance of other components. The assignment component assigns those each of the components, the impact of component integra- program variables that aren’t in memory to specific hard- tion, and the effectiveness of existing heuristics. We evalu- ware registers. Existing register allocators take different ap- ate code quality both in terms of code performance and code proaches in solving these various components. For example, size and consider four distinct instruction set architectures: the central focus of a graph-coloring based register allocator ARM, Thumb, x86, and x86-64. The results of our investiga- is to solve the assignment problem, while the spilling and tion reveal general principles for register allocation design.