High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems

Total Page:16

File Type:pdf, Size:1020Kb

High-Performance and Energy-Effcient Memory Scheduler Design for Heterogeneous Systems High-Performance and Energy-Ecient Memory Scheduler Design for Heterogeneous Systems Rachata Ausavarungnirun1 Gabriel H. Loh2 Lavanya Subramanian3,1 Kevin Chang4,1 Onur Mutlu5,1 1Carnegie Mellon University 2AMD Research 3Intel Labs 4Facebook 5ETH Zürich This paper summarizes the idea of the Staged Memory Sched- 1. Introduction uler (SMS), which was published at ISCA 2012 [14], and exam- ines the work’s signicance and future potential. When multiple As the number of cores continues to increase in modern processor cores (CPUs) and a GPU integrated together on the chip multiprocessor (CMP) systems, the DRAM memory sys- same chip share the o-chip DRAM, requests from the GPU tem has become a critical shared resource [139, 145]. Mem- can heavily interfere with requests from the CPUs, leading to ory requests from multiple cores interfere with each other, low system performance and starvation of cores. Unfortunately, and this inter-application interference is a signicant im- state-of-the-art memory scheduling algorithms are ineective pediment to individual application and overall system perfor- at solving this problem due to the very large amount of GPU mance. Various works on application-aware memory schedul- memory trac, unless a very large and costly request buer ing [98,99,141,142] address the problem by making the mem- is employed to provide these algorithms with enough visibility ory controller aware of application characteristics and ap- across the global request stream. propriately prioritizing memory requests to improve system performance and fairness. Previously-proposed memory controller (MC) designs use a Recent heterogeneous CPU-GPU systems [1,27,28,37,76, single monolithic structure to perform three main tasks. First, 77, 78, 133, 152,153, 167, 209] present an additional challenge the MC attempts to schedule together requests to the same by introducing integrated graphics processing units (GPUs) DRAM row to increase row buer hit rates. Second, the MC on the same die with CPU cores. GPU applications typi- arbitrates among the requesters (CPUs and GPU) to optimize cally demand signicantly more memory bandwidth than for overall system throughput, average response time, fairness CPU applications due to the GPU’s capability of executing and quality of service. Third, the MC manages the low-level a large number of concurrent threads [1,2,3,4,13,23,27,37, DRAM command scheduling to complete requests while ensur- 38, 40, 62, 68, 77, 78, 133, 149, 150, 151, 152, 153, 154, 167, 176, 178, ing compliance with all DRAM timing and power constraints. 179, 188, 189, 199, 200, 206, 209]. GPUs use single-instruction This paper proposes a fundamentally new approach, called multiple-data (SIMD) pipelines to concurrently execute mul- the Staged Memory Scheduler (SMS), which decouples the three tiple threads [53]. In a GPU, a group of threads executing the primary MC tasks into three signicantly simpler structures same instruction is called a wavefront or warp, and threads that together improve system performance and fairness. Our in a warp are executed in lockstep. When a wavefront stalls three-stage MC rst groups requests based on row buer locality. on a memory instruction, the GPU core hides this memory This grouping allows the second stage to focus only on inter- access latency by switching to another wavefront to avoid application scheduling decisions. These two stages enforce high- stalling the pipeline. Therefore, there can be thousands of level policies regarding performance and fairness, and therefore outstanding memory requests from across all of the wave- the last stage can use simple per-bank FIFO queues (i.e., there is fronts. This is fundamentally more memory intensive than arXiv:1804.11043v1 [cs.AR] 30 Apr 2018 no need for further command reordering within each bank) and CPU memory trac, where each CPU application has a much straightforward logic that deals only with the low-level DRAM smaller number of outstanding requests due to the sequential commands and timing. execution model of CPUs. Figure 1 (a) shows the memory request rates for a represen- We evaluated the design trade-os involved and compared it tative subset of our GPU applications and the most memory- against four state-of-the-art MC designs. Our evaluation shows intensive SPEC2006 (CPU) applications, as measured by mem- that SMS provides 41.2% performance improvement and 4.8× ory requests per thousand cycles when each application runs fairness improvement compared to the best previous state-of- alone on the system. The raw bandwidth demands (i.e., mem- the-art technique, while enabling a design that is signicantly ory request rates) of the GPU applications are often multiple less complex and more power-ecient to implement. times higher than the SPEC benchmarks. Figure 1 (b) shows Our analysis and proposed scheduler have inspired signi- the row buer hit rates (also called row buer locality or cant research on (1) predictable and/or deadline-aware memory RBL [134]). The GPU applications show consistently high scheduling [91,92,194,195,197,201,202,216] and (2) memory levels of RBL, whereas the SPEC benchmarks exhibit more scheduling for heterogeneous systems [161, 201, 202, 207]. variability. The GPU programs have high levels of spatial 250 1 6 225 0.9 200 0.8 5 175 0.7 4 150 0.6 125 0.5 3 BLP RBL MPKC 100 0.4 75 0.3 2 50 0.2 1 25 0.1 0 GAME01 GAME03 GAME05 BENCH02 BENCH04 gcc h264ref astar omnetpp leslie3d mcf 0 GAME01 GAME03 GAME05 BENCH02 BENCH04 gcc h264ref astar omnetpp leslie3d mcf 0 GAME01 GAME03 GAME05 BENCH02 BENCH04 gcc h264ref astar omnetpp leslie3d mcf (a) (b) (c) Figure 1: GPU memory characteristics. (a) Memory intensity, measured by memory requests per thousand cycles, (b) row buer locality, measured by the fraction of accesses that hit in the row buer, and (c) bank-level parallelism. Reproduced from [14]. locality, often due to access patterns related to large sequen- (a) CPU Requests tial memory accesses (e.g., frame buer updates). Figure 1(c) shows the bank-level parallelism (BLP) [109, 142], which is the (b) GPU Requests average number of parallel memory requests that can be issued X X X X X X X X X X to dierent DRAM banks, for each application, with the GPU programs consistently making use of four banks at the same (c) X X X X X X X X X X X XX X X X X X X XX X X X time. Figure 2: Example of the limited visibility of the mem- In addition to the high-intensity memory trac of GPU ory controller. (a) CPU-only information, (b) Memory con- applications, there are other properties that distinguish GPU troller’s visibility, (c) Improved visibility. Adapted from [14]. applications from CPU applications. Prior work [99] observed buer, thereby limiting the memory controller’s visibility of that CPU applications with streaming access patterns typi- the CPU applications’ memory characteristics. cally exhibit high RBL but low BLP, while applications with less uniform access patterns typically have low RBL but high One approach to increasing the memory controller’s visibil- BLP. ity across a larger window of memory requests is to increase In contrast, GPU applications have both high RBL and high the size of its request buer. This allows the memory con- BLP. The combination of high memory intensity, high RBL troller to observe more requests from the CPUs to better char- and high BLP means that the GPU will cause signicant inter- acterize their memory behavior, as shown in Figure 2(c). For ference to other applications across all banks, especially when instance, with a large request buer, the memory controller using a memory scheduling algorithm that preferentially fa- can identify and service multiple requests from one CPU core vors requests that result in row buer hits (e.g., [173, 220]). to the same row such that they become row buer hits, how- Recent memory scheduling research has focused on mem- ever, with a small request buer as shown in Figure 2(b), the ory interference between applications in CPU-only scenarios. memory controller may not even see these requests at the These past proposals are built around a single centralized re- same time because the GPU’s requests have occupied the quest buer at each memory controller (MC). The scheduling majority of the entries. algorithm implemented in the memory controller analyzes Unfortunately, very large request buers impose signi- the stream of requests in the centralized request buer to cant implementation challenges, including the die area for determine application memory characteristics, decides on the larger structures and the additional circuit complexity for a priority for each core, and then enforces these priorities. analyzing so many requests, along with the logic needed for Observable memory characteristics may include the number assignment and enforcement of priorities [194,195]. There- of requests that result in row buer hits, the bank-level par- fore, while building a very large, centralized memory con- allelism of each core, memory request rates, overall fairness troller request buer could perhaps lead to reasonable mem- metrics, and other information. Figure 2(a) shows the CPU- ory scheduling decisions, the approach is unattractive due to only scenario where the request buer only holds requests the resulting area, power, timing and complexity costs. from the CPUs. In this case, the memory controller sees a In this work, we propose the Staged Memory Scheduler number of requests from the CPUs and has visibility into (SMS), a decentralized architecture for memory scheduling in their memory behavior. On the other hand, when the request the context of integrated multi-core CPU-GPU systems. The buer is shared between the CPUs and the GPU, as shown in key idea in SMS is to decouple the various functional tasks of Figure 2(b), the large volume of requests from the GPU occu- memory controllers and partition these tasks across several pies a signicant fraction of the memory controller’s request simpler hardware structures which operate in a staged fash- 2 ion.
Recommended publications
  • Animatsiooni Töövõtted Ja Terminid
    Multimeedium, 2D animatsioon Animatsiooni töövõtted ja terminid Nii 2D kui ka 3D animatsioonide juures kasutatakse mitmeid samu võtteid ja termineid. Mõned neist on järgnevas lahtiseletatud. Keyframe Keyframe ehk võtmekaader on selline kaader, millel autor määrab objekti(de) täpse asukoha, suuruse, kuju, värvuse jms. antud ajahetkel. Kaadrid kahe keyframe vahel genereeritakse tavaliselt animatsiooniprogrammi poolt Keyframe 1 Keyframe 2 keerukate arvutuste tulemusena. Traditsioonilise animeerimise korral joonistavad peakunstnikud vajalikud keyframe'id ning nende assistendid kõik ülejäänu. Tracing Tracing on meetod uue kaadri loomiseks eelmise või vahel ka järgmise kopeerimise (trace) teel. Kopeeritava kujutise näitamiseks kasutatakse mitmeid meetodeid, sealhulgas aluseks oleva kujutise näitamine teise värviga, sageli sinisega. Tweening Tweening on tehnika, mis etteantud keyframe'ide vahele genereerib kasutaja poolt etteantud arvu kaadreid, millede jooksul toimub järkjärguline üleminek esimeselt keyframe'ilt teisele. Võttes arvesse kahe keyframe'i erinevusi, kalkuleerib arvuti iga kaadri jaoks proportsionaalsed muudatused objektide asukoha, suuruse, värvuse, kuju jms jaoks. Mida rohkem kaadreid lastakse genereerida, seda väiksemad tulevad järjestikuste kaadrite vahelised erinevused ning seda sujuvam tuleb liikumine. Keyframe 1 genereeritud kaadrid Keyframe 2 Rendering Rendering on protsess, mille käigus objektide traatmudelid (wireframe vahel ka mesh) kaetakse pinnatekstuuridega, neile lisatakse valgusefektid ja värvid, et saada realistlik
    [Show full text]
  • Intel® Desktop Board DG965PZ Technical Product Specification
    Intel® Desktop Board DG965PZ Technical Product Specification August 2006 Order Number: D56009-001US The Intel® Desktop Board DG965PZ may contain design defects or errors known as errata that may cause the product to deviate from published specifications. Current characterized errata are documented in the Intel Desktop Board DG965PZ Specification Update. Revision History Revision Revision History Date -001 First release of the Intel® Desktop Board DG965PZ Technical Product August 2006 Specification. This product specification applies to only the standard Intel® Desktop Board DG965PZ with BIOS identifier MQ96510A.86A. Changes to this specification will be published in the Intel Desktop Board DG965PZ Specification Update before being incorporated into a revision of this document. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel Corporation may have patents or pending patent applications, trademarks, copyrights, or other intellectual property rights that relate to the presented subject matter. The furnishing of documents and other materials and information does not provide any license, express or implied, by estoppel or otherwise, to any such patents, trademarks, copyrights, or other intellectual property rights.
    [Show full text]
  • Benchmarking the Intel FPGA SDK for Opencl Memory Interface
    The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface Hamid Reza Zohouri*†1, Satoshi Matsuoka*‡ *Tokyo Institute of Technology, †Edgecortix Inc. Japan, ‡RIKEN Center for Computational Science (R-CCS) {zohour.h.aa@m, matsu@is}.titech.ac.jp Abstract—Supported by their high power efficiency and efficiency on Intel FPGAs with different configurations recent advancements in High Level Synthesis (HLS), FPGAs are for input/output arrays, vector size, interleaving, kernel quickly finding their way into HPC and cloud systems. Large programming model, on-chip channels, operating amounts of work have been done so far on loop and area frequency, padding, and multiple types of blocking. optimizations for different applications on FPGAs using HLS. However, a comprehensive analysis of the behavior and • We outline one performance bug in Intel’s compiler, and efficiency of the memory controller of FPGAs is missing in multiple deficiencies in the memory controller, leading literature, which becomes even more crucial when the limited to significant loss of memory performance for typical memory bandwidth of modern FPGAs compared to their GPU applications. In some of these cases, we provide work- counterparts is taken into account. In this work, we will analyze arounds to improve the memory performance. the memory interface generated by Intel FPGA SDK for OpenCL with different configurations for input/output arrays, II. METHODOLOGY vector size, interleaving, kernel programming model, on-chip channels, operating frequency, padding, and multiple types of A. Memory Benchmark Suite overlapped blocking. Our results point to multiple shortcomings For our evaluation, we develop an open-source benchmark in the memory controller of Intel FPGAs, especially with respect suite called FPGAMemBench, available at https://github.com/ to memory access alignment, that can hinder the programmer’s zohourih/FPGAMemBench.
    [Show full text]
  • A Modern Primer on Processing in Memory
    A Modern Primer on Processing in Memory Onur Mutlua,b, Saugata Ghoseb,c, Juan Gomez-Luna´ a, Rachata Ausavarungnirund SAFARI Research Group aETH Z¨urich bCarnegie Mellon University cUniversity of Illinois at Urbana-Champaign dKing Mongkut’s University of Technology North Bangkok Abstract Modern computing systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in computing that cause performance, scalability and energy bottlenecks: (1) data access is a key bottleneck as many important applications are increasingly data-intensive, and memory bandwidth and energy do not scale well, (2) energy consumption is a key limiter in almost all computing platforms, especially server and mobile systems, (3) data movement, especially off-chip to on-chip, is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially severely-felt in the data-intensive server and energy-constrained mobile systems of today. At the same time, conventional memory technology is facing many technology scaling challenges in terms of reliability, energy, and performance. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, at the expense of higher cost. The emergence of 3D-stacked memory plus logic, the adoption of error correcting codes inside the latest DRAM chips, proliferation of different main memory standards and chips, specialized for different purposes (e.g., graphics, low-power, high bandwidth, low latency), and the necessity of designing new solutions to serious reliability and security issues, such as the RowHammer phenomenon, are an evidence of this trend.
    [Show full text]
  • Advanced X86
    Advanced x86: BIOS and System Management Mode Internals Input/Output Xeno Kovah && Corey Kallenberg LegbaCore, LLC All materials are licensed under a Creative Commons “Share Alike” license. http://creativecommons.org/licenses/by-sa/3.0/ ABribuEon condiEon: You must indicate that derivave work "Is derived from John BuBerworth & Xeno Kovah’s ’Advanced Intel x86: BIOS and SMM’ class posted at hBp://opensecuritytraining.info/IntroBIOS.html” 2 Input/Output (I/O) I/O, I/O, it’s off to work we go… 2 Types of I/O 1. Memory-Mapped I/O (MMIO) 2. Port I/O (PIO) – Also called Isolated I/O or port-mapped IO (PMIO) • X86 systems employ both-types of I/O • Both methods map peripheral devices • Address space of each is accessed using instructions – typically requires Ring 0 privileges – Real-Addressing mode has no implementation of rings, so no privilege escalation needed • I/O ports can be mapped so that they appear in the I/O address space or the physical-memory address space (memory mapped I/O) or both – Example: PCI configuration space in a PCIe system – both memory-mapped and accessible via port I/O. We’ll learn about that in the next section • The I/O Controller Hub contains the registers that are located in both the I/O Address Space and the Memory-Mapped address space 4 Memory-Mapped I/O • Devices can also be mapped to the physical address space instead of (or in addition to) the I/O address space • Even though it is a hardware device on the other end of that access request, you can operate on it like it's memory: – Any of the processor’s instructions
    [Show full text]
  • Motorola Mpc107 Pci Bridge/Integrated Memory Controller
    MPC107FACT/D Fact Sheet MOTOROLA MPC107 PCI BRIDGE/INTEGRATED MEMORY CONTROLLER The MPC107 PCI Bridge/Integrated Memory Controller provides a bridge between the Peripheral Component Interconnect, (PCI) bus and PowerPC 603e™, PowerPC 740™, PowerPC 750™ or MPC7400 microprocessors. PCI support allows system designers to design systems quickly using peripherals already designed for PCI and the other standard interfaces available in the personal computer hardware environment. The MPC107 provides many of the other necessities for embedded applications including a high-performance memory controller and dual processor support, 2-channel flexible DMA controller, an interrupt controller, an I2O-ready message unit, an inter-integrated circuit controller (I2C), and low skew clock drivers. The MPC107 contains an Embedded Programmable Interrupt Controller (EPIC) featuring five hardware interrupts (IRQ’s) as well as sixteen serial interrupts along with four timers. The MPC107 uses an advanced, 2.5-volt HiP3 process technology and is fully compatible with TTL devices. Integrated Memory Controller The memory interface controls processor and PCI interactions to main memory. It supports a variety of programmable timing supporting DRAM (FPM, EDO), SDRAM, and ROM/Flash ROM configurations, up to speeds of 100 MHz. PCI Bus Support The MPC107 PCI interface is designed to connect the processor and memory buses to the PCI local bus without the need for "glue" logic at speeds up to 66 MHz. The MPC107 acts as either a master or slave device on the PCI MPC107 Block Diagram bus and contains a PCI bus arbitration unit 60x Bus which reduces the need for an equivalent Memory Data external unit thus reducing the total system Data Path ECC / Parity complexity and cost.
    [Show full text]
  • The Impulse Memory Controller
    IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 11, NOVEMBER 2001 1 The Impulse Memory Controller John B. Carter, Member, IEEE, Zhen Fang, Student Member, IEEE, Wilson C. Hsieh, Sally A. McKee, Member, IEEE, and Lixin Zhang, Student Member, IEEE AbstractÐImpulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cache and bus utilization. The Impuse design does not require any modification to processor, cache, or bus designs since all the functionality resides at the memory controller. As a result, Impulse can be adopted in conventional systems without major system changes. We describe the design of the Impulse architecture and how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications. Impulse can be used to dynamically create superpages cheaply, to dynamically recolor physical pages, to perform strided fetches, and to perform gathers and scatters through indirection vectors. Our performance results demonstrate the effectiveness of these optimizations in a variety of scenarios. Using Impulse can speed up a range of applications from 20 percent to over a factor of 5. Alternatively, Impulse can be used by the OS for dynamic superpage creation; the best policy for creating superpages using Impulse outperforms previously known superpage creation policies. Index TermsÐComputer architecture, memory systems. æ 1 INTRODUCTION INCE 1987, microprocessor performance has improved at memory. By giving applications control (mediated by the Sa rate of 55 percent per year; in contrast, DRAM latencies OS) over the use of shadow addresses, Impulse supports have improved by only 7 percent per year and DRAM application-specific optimizations that restructure data.
    [Show full text]
  • Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained Cmps
    Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained CMPs Major Bhadauria and Sally A. Mckee Computer Systems Lab Cornell University Ithaca, NY, USA [email protected], [email protected] ABSTRACT 1. INTRODUCTION Multi-core designs have become the industry imperative, Power and thermal constraints have begun to limit the replacing our reliance on increasingly complicated micro- maximum operating frequency of high performance proces- architectural designs and VLSI improvements to deliver in- sors. The cubic increase in power from increases in fre- creased performance at lower power budgets. Performance quency and higher voltages required to attain those frequen- of these multi-core chips will be limited by the DRAM mem- cies has reached a plateau. By leveraging increasing die ory system: we demonstrate this by modeling a cycle-accurate space for more processing cores (creating chip multiproces- DDR2 memory controller with SPLASH-2 workloads. Sur- sors, or CMPs) and larger caches, designers hope that multi- prisingly, benchmarks that appear to scale well with the threaded programs can exploit shrinking transistor sizes to number of processors fail to do so when memory is accurately deliver equal or higher throughput as single-threaded, single- modeled. We frequently find that the most efficient config- core predecessors. The current software paradigm is based uration is not the one with the most threads. By choosing on the assumption that multi-threaded programs with little the most efficient number of threads for each benchmark, contention for shared data scale (nearly) linearly with the average energy delay efficiency improves by a factor of 3.39, number of processors, yielding power-efficient data through- and performance improves by 19.7%, on average.
    [Show full text]
  • COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011
    COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Edgar Gabriel Spring 2011 Edgar Gabriel Moore’s Law • Long-term trend on the number of transistor per integrated circuit • Number of transistors double every ~18 month Source: http://en.wikipedia.org/wki/Images:Moores_law.svg COSC 6385 – Computer Architecture Edgar Gabriel 1 What do we do with that many transistors? • Optimizing the execution of a single instruction stream through – Pipelining • Overlap the execution of multiple instructions • Example: all RISC architectures; Intel x86 underneath the hood – Out-of-order execution: • Allow instructions to overtake each other in accordance with code dependencies (RAW, WAW, WAR) • Example: all commercial processors (Intel, AMD, IBM, SUN) – Branch prediction and speculative execution: • Reduce the number of stall cycles due to unresolved branches • Example: (nearly) all commercial processors COSC 6385 – Computer Architecture Edgar Gabriel What do we do with that many transistors? (II) – Multi-issue processors: • Allow multiple instructions to start execution per clock cycle • Superscalar (Intel x86, AMD, …) vs. VLIW architectures – VLIW/EPIC architectures: • Allow compilers to indicate independent instructions per issue packet • Example: Intel Itanium series – Vector units: • Allow for the efficient expression and execution of vector operations • Example: SSE, SSE2, SSE3, SSE4 instructions COSC 6385 – Computer Architecture Edgar Gabriel 2 Limitations of optimizing a single instruction
    [Show full text]
  • WP127: "Embedded System Design Considerations" V1.0 (03/06/2002)
    White Paper: Virtex-II Series R WP127 (v1.0) March 6, 2002 Embedded System Design Considerations By: David Naylor Embedded systems see a steadily increasing bandwidth mismatch between raw processor MIPS and surrounding components. System performance is not solely dependent upon processor capability. While a processor with a higher MIPS specification can provide incremental system performance improvement, eventually the lagging surrounding components become a system performance bottleneck. This white paper examines some of the factors contributing to this. The analysis of bus and component performance leads to the conclusion that matching of processor and component performance provides a good cost-performance trade-off. © 2002 Xilinx, Inc. All rights reserved. All Xilinx trademarks, registered trademarks, patents, and disclaimers are as listed at http://www.xilinx.com/legal.htm. All other trademarks and registered trademarks are the property of their respective owners. All specifications are subject to change without notice. WP127 (v1.0) March 6, 2002 www.xilinx.com 1 1-800-255-7778 R White Paper: Embedded System Design Considerations Introduction Today’s systems are composed of a hierarchy, or layers, of subsystems with varying access times. Figure 1 depicts a layered performance pyramid with slower system components in the wider, lower layers and faster components nearer the top. The upper six layers represent an embedded system. In descending order of speed, the layers in this pyramid are as follows: 1. CPU 2. Cache memory 3. Processor Local Bus (PLB) 4. Fast and small Static Random Access Memory (SRAM) subsystem 5. Slow and large Dynamic Random Access Memory (DRAM) subsystem 6.
    [Show full text]
  • Computer Architecture Lecture 12: Memory Interference and Quality of Service
    Computer Architecture Lecture 12: Memory Interference and Quality of Service Prof. Onur Mutlu ETH Zürich Fall 2017 1 November 2017 Summary of Last Week’s Lectures n Control Dependence Handling q Problem q Six solutions n Branch Prediction n Trace Caches n Other Methods of Control Dependence Handling q Fine-Grained Multithreading q Predicated Execution q Multi-path Execution 2 Agenda for Today n Shared vs. private resources in multi-core systems n Memory interference and the QoS problem n Memory scheduling n Other approaches to mitigate and control memory interference 3 Quick Summary Papers n "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems” n "The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost" n "Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems” n "Parallel Application Memory Scheduling” n "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning" 4 Shared Resource Design for Multi-Core Systems 5 Memory System: A Shared Resource View Storage 6 Resource Sharing Concept n Idea: Instead of dedicating a hardware resource to a hardware context, allow multiple contexts to use it q Example resources: functional units, pipeline, caches, buses, memory n Why? + Resource sharing improves utilization/efficiency à throughput q When a resource is left idle by one thread, another thread can use it; no need to replicate shared data + Reduces communication latency q For example,
    [Show full text]
  • Memory Controller SC White Paper
    DEVELOPING HIGH-SPEED MEMORY INTERFACES: THE LatticeSCM FPGA ADVANTAGE A Lattice Semiconductor White Paper February 2006 Lattice Semiconductor 5555 Northeast Moore Ct. Hillsboro, Oregon 97124 USA Telephone: (503) 268-8000 www.latticesemi.com Developing High-Speed Memory Interfaces A Lattice Semiconductor White Paper Introduction A common problem for today’s system designers is to reliably interface to their next generation high- speed memory devices. As system bandwidths continue to increase, memory technologies have been optimized for higher speeds and performance. As a result, these next generation memory interfaces are also increasingly challenging to design to. Implementing high-speed, high-efficiency memory interfaces in programmable logic devices such as FPGAs has always been a major challenge for designers. Lattice Semiconductor offers customers a high performance FPGA platform in the LatticeSC to design high-speed memory interface solutions. The LatticeSC family implements various features on- chip that facilitate designing high-speed memory controllers to interface to the next generation high- speed, high performance DDR SDRAM, QDR SRAM, and emerging RLDRAM memory devices. The PURESPEED I/O structures on the LatticeSC, combined with the clock management resources and a high-speed FPGA fabric help customers reduce design risk and time-to-market for high-speed memory based designs. Additionally, the LatticeSCM family implements full-featured embedded high-speed memory controllers on-chip to interface smoothly to the next generation high-speed, high performance DDR I/II SDRAM, QDR II SRAM, and RLDRAM I/II memory devices. The embedded memory controllers, coupled with the I/O structures and clock management resources, are fully verified controllers for customers to use as off-the-shelf solutions.
    [Show full text]