Benchmarking High Bandwidth Memory on Fpgas
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
ASIC Implementation of DDR SDRAM Memory Controller
© 2016 IJEDR | Volume 4, Issue 4 | ISSN: 2321-9939 ASIC Implementation of DDR SDRAM Memory Controller Ramagiri Ramya, Naganaik M.Tech Scholar, ASST.PROF, HOD of ECE BRIG-IC, Hyderabad ________________________________________________________________________________________________________ Abstract - A Dedicated Memory Controller is of prime importance in applications that do not contain microprocessors (high-end applications). The Memory Controller provides command signals for memory refresh, read and write operation and initialization of SDRAM. Our work will focus on ASIC Design methodology of Double Data Rate (DDR) SDRAM Controller that is located between the DDR SDRAM and Bus Master. The Controller simplifies the SDRAM command interface to standard system read/write interface and also optimizes the access time of read/write cycle. Keywords - DDR SDRAM Controller, Read/Write Data path, Cadence RTL Compiler. ________________________________________________________________________________________________________ I. INTRODUCTION Memory devices are almost found in all systems and nowadays high speed and high performance memories are in great demand. For better throughput and speed, the controllers are to be designed with clock frequency in the range of megahertz. As the clock speed of the controller is increasing, the design challenges are also becoming complex. Therefore the next generation memory devices require very high speed controllers like double data rate and quad data rate memory controllers. In this paper, the double data rate SDRAM Controller is implemented using ASIC methodology. Synchronous DRAM (SDRAM) is preferred in embedded system memory design because of its speed and pipelining capability. In high-end applications, like microprocessors there will be specific built in peripherals to provide the interface to the SDRAM. But for other applications, the system designer must design a specific memory controller to provide command signals for memory refresh, read and write operation and initialization of SDRAM. -
Memory Interference Characterization Between CPU Cores and Integrated Gpus in Mixed-Criticality Platforms
Memory Interference Characterization between CPU cores and integrated GPUs in Mixed-Criticality Platforms Roberto Cavicchioli, Nicola Capodieci and Marko Bertogna University of Modena And Reggio Emilia, Department of Physics, Informatics and Mathematics, Modena, Italy [email protected] Abstract—Most of today’s mixed criticality platforms feature be put on the memory traffic originated by the iGPU, as it Systems on Chip (SoC) where a multi-core CPU complex (the represents a very popular architectural paradigm for computing host) competes with an integrated Graphic Processor Unit (iGPU, massively parallel workloads at impressive performance per the device) for accessing central memory. The multi-core host Watt ratios [1]. This architectural choice (commonly referred and the iGPU share the same memory controller, which has to to as General Purpose GPU computing, GPGPU) is one of the arbitrate data access to both clients through often undisclosed reference architectures for future embedded mixed-criticality or non-priority driven mechanisms. Such aspect becomes crit- ical when the iGPU is a high performance massively parallel applications, such as autonomous driving and unmanned aerial computing complex potentially able to saturate the available vehicle control. DRAM bandwidth of the considered SoC. The contribution of this paper is to qualitatively analyze and characterize the conflicts In order to maximize the validity of the presented results, due to parallel accesses to main memory by both CPU cores we considered different platforms featuring different memory and iGPU, so to motivate the need of novel paradigms for controllers, instruction sets, data bus width, cache hierarchy memory centric scheduling mechanisms. We analyzed different configurations and programming models: well known and commercially available platforms in order to i NVIDIA Tegra K1 SoC (TK1), using CUDA 6.5 [2] for estimate variations in throughput and latencies within various memory access patterns, both at host and device side. -
Performance Impact of Memory Channels on Sparse and Irregular Algorithms
Performance Impact of Memory Channels on Sparse and Irregular Algorithms Oded Green1,2, James Fox2, Jeffrey Young2, Jun Shirako2, and David Bader3 1NVIDIA Corporation — 2Georgia Institute of Technology — 3New Jersey Institute of Technology Abstract— Graph processing is typically considered to be a Graph algorithms are typically latency-bound if there is memory-bound rather than compute-bound problem. One com- not enough parallelism to saturate the memory subsystem. mon line of thought is that more available memory bandwidth However, the growth in parallelism for shared-memory corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization systems brings into question whether graph algorithms are of the memory system for graph algorithms is not necessarily still primarily latency-bound. Getting peak or near-peak the raw bandwidth or even the latency of memory requests. bandwidth of current memory subsystems requires highly Instead, we show that performance is proportional to the parallel applications as well as good spatial and temporal number of memory channels available to handle small data memory reuse. The lack of spatial locality in sparse and transfers with limited spatial locality. Using several widely used graph frameworks, including irregular algorithms means that prefetched cache lines have Gunrock (on the GPU) and GAPBS & Ligra (for CPUs), poor data reuse and that the effective bandwidth is fairly low. we evaluate key graph analytics kernels using two unique The introduction of high-bandwidth memories like HBM and memory hierarchies, DDR-based and HBM/MCDRAM. Our HMC have not yet closed this inefficiency gap for latency- results show that the differences in the peak bandwidths sensitive accesses [19]. -
A Cache-Efficient Sorting Algorithm for Database and Data Mining
A Cache-Efficient Sorting Algorithm for Database and Data Mining Computations using Graphics Processors Naga K. Govindaraju, Nikunj Raghuvanshi, Michael Henson, David Tuft, Dinesh Manocha University of North Carolina at Chapel Hill {naga,nikunj,henson,tuft,dm}@cs.unc.edu http://gamma.cs.unc.edu/GPUSORT Abstract els [3]. These techniques have been extensively studied for CPU-based algorithms. We present a fast sorting algorithm using graphics pro- Recently, database researchers have been exploiting the cessors (GPUs) that adapts well to database and data min- computational capabilities of graphics processors to accel- ing applications. Our algorithm uses texture mapping and erate database queries and redesign the query processing en- blending functionalities of GPUs to implement an efficient gine [8, 18, 19, 39]. These include typical database opera- bitonic sorting network. We take into account the commu- tions such as conjunctive selections, aggregates and semi- nication bandwidth overhead to the video memory on the linear queries, stream-based frequency and quantile estima- GPUs and reduce the memory bandwidth requirements. We tion operations,and spatial database operations. These al- also present strategies to exploit the tile-based computa- gorithms utilize the high memory bandwidth, the inherent tional model of GPUs. Our new algorithm has a memory- parallelism and vector processing functionalities of GPUs efficient data access pattern and we describe an efficient to execute the queries efficiently. Furthermore, GPUs are instruction dispatch mechanism to improve the overall sort- becoming increasingly programmable and have also been ing performance. We have used our sorting algorithm to ac- used for sorting [19, 24, 35]. celerate join-based queries and stream mining algorithms. -
How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?
Hot topic session: How to solve the current memory access and data transfer bottlenecks: at the processor architecture or at the compiler level? Francky Catthoor, Nikil D. Dutt, IMEC , Leuven, Belgium ([email protected]) U.C.Irvine, CA ([email protected]) Christoforos E. Kozyrakis, U.C.Berkeley, CA ([email protected]) Abstract bedded memories are correctly incorporated. What is more controversial still however is how to deal with dynamic ap- Current processor architectures, both in the pro- plication behaviour, with multithreading and with highly grammable and custom case, become more and more dom- concurrent architecture platforms. inated by the data access bottlenecks in the cache, system bus and main memory subsystems. In order to provide suf- 2. Explicitly parallel architectures for mem- ficiently high data throughput in the emerging era of highly ory performance enhancement (Christo- parallel processors where many arithmetic resources can work concurrently, novel solutions for the memory access foros E. Kozyrakis) and data transfer will have to be introduced. The crucial question we want to address in this hot topic While microprocessor architectures have significantly session is where one can expect these novel solutions to evolved over the last fifteen years, the memory system rely on: will they be mainly innovative processor archi- is still a major performance bottleneck. The increasingly tecture ideas, or novel approaches in the application com- complex structures used for speculation, reordering, and piler/synthesis technology, or a mix. caching have improved performance, but are quickly run- ning out of steam. These techniques are becoming ineffec- tive in terms of resource utilization and scaling potential. -
Prefetching for Complex Memory Access Patterns
UCAM-CL-TR-923 Technical Report ISSN 1476-2986 Number 923 Computer Laboratory Prefetching for complex memory access patterns Sam Ainsworth July 2018 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2018 Sam Ainsworth This technical report is based on a dissertation submitted February 2018 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Churchill College. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract Modern-day workloads, particularly those in big data, are heavily memory-latency bound. This is because of both irregular memory accesses, which have no discernible pattern in their memory addresses, and large data sets that cannot fit in any cache. However, this need not be a barrier to high performance. With some data structure knowledge it is typically possible to bring data into the fast on-chip memory caches early, so that it is already available by the time it needs to be accessed. This thesis makes three contributions. I first contribute an automated software prefetch- ing compiler technique to insert high-performance prefetches into program code to bring data into the cache early, achieving 1.3 geometric mean speedup on the most complex × processors, and 2.7 on the simplest. I also provide an analysis of when and why this × is likely to be successful, which data structures to target, and how to schedule software prefetches well. Then I introduce a hardware solution, the configurable graph prefetcher. -
Eth-28647-02.Pdf
Research Collection Doctoral Thesis Adaptive main memory compression Author(s): Tuduce, Irina Publication Date: 2006 Permanent Link: https://doi.org/10.3929/ethz-a-005180607 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Doctoral Thesis ETH No. 16327 Adaptive Main Memory Compression A dissertation submitted to the Swiss Federal Institute of Technology Zurich (ETH ZÜRICH) for the degree of Doctor of Technical Sciences presented by Irina Tuduce Engineer, TU Cluj-Napoca born April 9, 1976 citizen of Romania accepted on the recommendation of Prof. Dr. Thomas Gross, examiner Prof. Dr. Lothar Thiele, co-examiner 2005 Seite Leer / Blank leaf Seite Leer / Blank leaf Seite Leer lank Abstract Computer pioneers correctly predicted that programmers would want unlimited amounts of fast memory. Since fast memory is expensive, an economical solution to that desire is a memory hierarchy organized into several levels. Each level has greater capacity than the preceding but it is less quickly accessible. The goal of the memory hierarchy is to provide a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level. In the last decades, the processor performance improved much faster than the performance of the memory levels. The memory hierarchy proved to be a scalable solution, i.e., the bigger the perfomance gap between processor and memory, the more levels are used in the memory hierarchy. For instance, in 1980 microprocessors were often designed without caches, while in 2005 most of them come with two levels of caches on the chip. -
Detecting False Sharing Efficiently and Effectively
W&M ScholarWorks Arts & Sciences Articles Arts and Sciences 2016 Cheetah: Detecting False Sharing Efficiently andff E ectively Tongping Liu Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA; Xu Liu Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA Follow this and additional works at: https://scholarworks.wm.edu/aspubs Recommended Citation Liu, T., & Liu, X. (2016, March). Cheetah: Detecting false sharing efficiently and effectively. In 2016 IEEE/ ACM International Symposium on Code Generation and Optimization (CGO) (pp. 1-11). IEEE. This Article is brought to you for free and open access by the Arts and Sciences at W&M ScholarWorks. It has been accepted for inclusion in Arts & Sciences Articles by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Cheetah: Detecting False Sharing Efficiently and Effectively Tongping Liu ∗ Xu Liu ∗ Department of Computer Science Department of Computer Science University of Texas at San Antonio College of William and Mary San Antonio, TX 78249 USA Williamsburg, VA 23185 USA [email protected] [email protected] Abstract 1. Introduction False sharing is a notorious performance problem that may Multicore processors are ubiquitous in the computing spec- occur in multithreaded programs when they are running on trum: from smart phones, personal desktops, to high-end ubiquitous multicore hardware. It can dramatically degrade servers. Multithreading is the de-facto programming model the performance by up to an order of magnitude, significantly to exploit the massive parallelism of modern multicore archi- hurting the scalability. Identifying false sharing in complex tectures. -
A Modern Primer on Processing in Memory
A Modern Primer on Processing in Memory Onur Mutlua,b, Saugata Ghoseb,c, Juan Gomez-Luna´ a, Rachata Ausavarungnirund SAFARI Research Group aETH Z¨urich bCarnegie Mellon University cUniversity of Illinois at Urbana-Champaign dKing Mongkut’s University of Technology North Bangkok Abstract Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend. -
Case Study on Integrated Architecture for In-Memory and In-Storage Computing
electronics Article Case Study on Integrated Architecture for In-Memory and In-Storage Computing Manho Kim 1, Sung-Ho Kim 1, Hyuk-Jae Lee 1 and Chae-Eun Rhee 2,* 1 Department of Electrical Engineering, Seoul National University, Seoul 08826, Korea; [email protected] (M.K.); [email protected] (S.-H.K.); [email protected] (H.-J.L.) 2 Department of Information and Communication Engineering, Inha University, Incheon 22212, Korea * Correspondence: [email protected]; Tel.: +82-32-860-7429 Abstract: Since the advent of computers, computing performance has been steadily increasing. Moreover, recent technologies are mostly based on massive data, and the development of artificial intelligence is accelerating it. Accordingly, various studies are being conducted to increase the performance and computing and data access, together reducing energy consumption. In-memory computing (IMC) and in-storage computing (ISC) are currently the most actively studied architectures to deal with the challenges of recent technologies. Since IMC performs operations in memory, there is a chance to overcome the memory bandwidth limit. ISC can reduce energy by using a low power processor inside storage without an expensive IO interface. To integrate the host CPU, IMC and ISC harmoniously, appropriate workload allocation that reflects the characteristics of the target application is required. In this paper, the energy and processing speed are evaluated according to the workload allocation and system conditions. The proof-of-concept prototyping system is implemented for the integrated architecture. The simulation results show that IMC improves the performance by 4.4 times and reduces total energy by 4.6 times over the baseline host CPU. -
A Hybrid Analytical DRAM Performance Model
A Hybrid Analytical DRAM Performance Model George L. Yuan Tor M. Aamodt Department of Electrical and Computer Engineering University of British Columbia, Vancouver, BC, CANADA fgyuan,[email protected] Abstract 2000 As process technology scales, the number of transistors that 1600 can ¯t in a unit area has increased exponentially. Proces- 1200 sor throughput, memory storage, and memory throughput 800 have all been increasing at an exponential pace. As such, 400 DRAM has become an ever-tightening bottleneck for appli- 0 Memory Access Latency cations with irregular memory access patterns. Computer BS FWT DG LPS LIB MMC MUM NN RAY RED SP SLA TRA WP architects in industry sometimes use ad hoc analytical mod- eling techniques in lieu of cycle-accurate performance sim- Figure 1: Average memory latency measured in perfor- ulation to identify critical design points. Moreover, ana- mance simulation lytical models can provide clear mathematical relationships for how system performance is a®ected by individual mi- croarchitectural parameters, something that may be di±cult cess behavior. In the massively multithreaded GPU Com- to obtain with a detailed performance simulator. Modern pute architecture that we simulate, threads are scheduled to DRAM controllers rely on Out-of-Order scheduling policies Single-Instruction Multiple-Data pipelines in groups of 32 to increase row access locality and decrease the overheads of called warps, each thread of which is capable of generating timing constraint delays. This paper proposes a hybrid ana- a memory request to any location in GPU main memory. lytical DRAM performance model that uses memory address (This can be done as long as the limit on the number of in- traces to predict the DRAM e±ciency of a DRAM system flight memory requests per GPU shader core has not been when using such a memory scheduling policy. -
ECE 571 – Advanced Microprocessor-Based Design Lecture 17
ECE 571 { Advanced Microprocessor-Based Design Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver [email protected] 3 April 2018 Announcements • HW8 is readings 1 More DRAM 2 ECC Memory • There's debate about how many errors can happen, anywhere from 10−10 error/bit*h (roughly one bit error per hour per gigabyte of memory) to 10−17 error/bit*h (roughly one bit error per millennium per gigabyte of memory • Google did a study and they found more toward the high end • Would you notice if you had a bit flipped? • Scrubbing { only notice a flip once you read out a value 3 Registered Memory • Registered vs Unregistered • Registered has a buffer on board. More expensive but can have more DIMMs on a channel • Registered may be slower (if it buffers for a cycle) • RDIMM/UDIMM 4 Bandwidth/Latency Issues • Truly random access? No, burst speed fast, random speed not. • Is that a problem? Mostly filling cache lines? 5 Memory Controller • Can we have full random access to memory? Why not just pass on CPU mem requests unchanged? • What might have higher priority? • Why might re-ordering the accesses help performance (back and forth between two pages) 6 Reducing Refresh • DRAM Refresh Mechanisms, Penalties, and Trade-Offs by Bhati et al. • Refresh hurts performance: ◦ Memory controller stalls access to memory being refreshed ◦ Refresh takes energy (read/write) On 32Gb device, up to 20% of energy consumption and 30% of performance 7 Async vs Sync Refresh • Traditional refresh rates ◦ Async Standard (15.6us) ◦ Async Extended (125us) ◦ SDRAM -