The Effect of High Bandwidth and Low Latency Memory on Computers

Total Page:16

File Type:pdf, Size:1020Kb

The Effect of High Bandwidth and Low Latency Memory on Computers 1 The Effect of High Bandwidth and Low Latency Memory on Computers Michael Kirzinger instruction cache is the same. Abstract—Adding high bandwidth and/or low latency The processors are available with either a 512KB L2 cache memory to a computer results in improved performance. This or a 1MB L2 cache. A 512 KB L2 cache was used for the performance gain is as a result of widening the bottleneck simulations. A clock speed of 2400 MHz was used. caused by the increasing gap in speeds between the processor A second cache configuration consisting of 32KB L1 and main system memory. This paper investigates how large of instruction and data caches with a 64KD L2 unified cache was a gain can be achieved by widening the bus between the CPU also used to stress the memory. This was necessary to do and memory, increasing memory speed, and reducing memory latency. Latencies of real modules are used in the simulation to since there was insufficient time to run simulations on very see what is possible by using available high performance large data sets. Using a smaller cache should have a similar memory modules. effect to using a large data set, that effect being cache misses caused by insufficient capacity in the cache to fit all the data. Index Terms—DDR, DDR2, GDDR3, high bandwidth, low The RAM parameters in SimpleScalar were based upon the latency, memory CAS and tRCD latencies of Samsung DDR[2], DDR2[3], and GDDR3[4] memory modules. The available burst length for DDR are 2, 4 and 8[2]. For DDR2, it is 4 and 8[3]. GDDR3 I.INTRODUCTION has burst lengths of 4 and 8[4]. A power of 2 line size in the NE of the main bottlenecks in computer systems today cache cache will always be used so that it is always creates a Ois the speed of the memory available to the processor. power of 2 ratio to the burst length. This bottleneck can be further divided into two categories: A To convert from CAS and tRCD latencies and the data rate bottleneck caused by memory latencies, and a bottleneck to the SimpleScalar parameters -mem:lat <first_chunk> caused by memory bandwidth. <inter_chunk>, equations (1) and (2) were used. This paper will investigate the effects of increasing the CPU_Clock×CAStRCD first_chunk= (1) memory bandwidth available to the processor. This will be Memory_Clock done by increasing the bus width between the memory and CPU_Clock inter_chunk= (2) processor and by increasing the speed at which the memory Memory_Data_Rate runs. As well, it will compare increasing the bus width to For DDR memory, the memory data rate is twice the increasing data rate for a fixed bandwidth. memory clock. All values for first_chunk and inter_chunk The effect of reduced latency will also be simulated. This have been rounded up. will look at values beyond what is currently available on the A summary of the modules used and their corresponding market. first and inter chunk values is shown in Table 1. CAS tRCD first_chunk inter_chunk II.SIMULATOR DDR 266 2 3 91 10 A slightly modified version of the SimpleScalar tool set (a DDR 333 2.5 3 80 8 unified diff against 3v0d SimpleScalar is available at DDR 400 3 3 72 6 http://www.ece.ualberta.ca/~mjk3/ece510/burst_length.diff) was used for all the simulations. One modification was made DDR2 533 4 4 72 5 to the memory system: A new parameter was added to set the DDR2 667 5 5 73 4 minimum and maximum burst length for the memory. This adds one of the differences between DDR and DDR2 (and DDR2 800 5 5 60 3 GDDR3) into the simulation. GDDR3 1000 7 8 72 3 GDDR3 1200 9 10 76 2 III.SIMULATION PARAMETERS GDDR3 1400 10 10 69 2 The processor used in the simulation is similar to a socket GDDR3 1600 11 12 69 2 754 AMD Athlon 64. This has a 64KB, 2-way set associative GDDR3 1800 11 12 62 2 L1 data cache[1]. The cache lines are 64 bytes wide and a least recently used replacement policy is used. The L1 Table 1: Latencies of various memory modules used 2 There were 5 different programs used to test the effects of The 256 bit DDR2 and GDDR3 performs poorly. This is a varying the memory parameters: result of the cache line only being twice the size of the bus 1) matrix: Multiplies two 100x100 matrices consisting of width. The memory is unable to do a burst read of 2, so it has double precision floating point numbers together. an expensive 2 reads. Once they go to 512 bits, the cache line 2) sort: Does a quick sort of 40000 double precision can be filled in 1 read. floating point numbers. In general, there is only a small improvement in going to a 3) fft: Performs a complex discrete fast fourier transform on wider bus when the wider bus does not allow the full burst a 65536 element double precision floating point number length to be used. array. Uses a general purpose FFT package[5]. Now that the memory system is being utilized more, the 4) filter: Runs an 8192 length double precision signal performance gains of a higher bandwidth system become through a 200 tap finite impulse response filter. more pronounced. Once again, the 256 bit DDR2/GDDR3 5) alphaBlend: Performs an alpha blend of two 800x600 24 suffers in performance for the same reason as before. bit bitmaps. V.SIMULATION 2: EFFECT OF CACHE LINE SIZE ON EQUAL IV.SIMULATION 1: EFFECT OF BANDWIDTH BANDWIDTH MEMORY CONFIGURATIONS The five test programs were run through the simulator for The cache line size will effect the performance of the each memory module on a 64, 128, 256, and 512 bit memory memory, as seen in the 256 bit bus case in the first simulation. bus. The memory's speed will be made better use of with larger The results, sorted by bandwidth, for the small cache is cache lines, but this can also harm performance for shown in Figure 1. Results for the large cache show the same applications which do not have data accessed in a sequential thing, just the values are closer together because of most of order. the data fitting into the cache. Performance is relative to the For this simulation, a constant bandwidth of 12.8 GB per 128 bit DDR 400. The fft program stresses the memory the second was used. The memory configurations with this most, with an 18.3% spread in performance. Increasing bandwidth are 256 bit DDR 400, 128 bit DDR2 800, and 256 bandwidth showed an increase in most cases. The small bit GDDR3 1600. These are all at or near the top of their differences in performance among two memory memory types for clock rate and overall latency. configurations with similar bandwidths are due to differences The results are all relative to the 128 bit DDR2 800 with a in the latency. 64 byte cache L2 line size. These results for the small cache processor are shown in Figure 2. The large cache simulation Performance of memory modules data shows the same results, except that the differences in 64bit DDR 266 performance are smaller. 64bit DDR 333 64bit DDR 400 The wider bus performs better than the faster memory 128bit DDR 266 64bit DDR2 533 running on a smaller bus. The slower memory ends up 128bit DDR 333 64bit DDR2 667 sending more data per burst read. The extra first chunk access 128bit DDR 400 64bit DDR2 800 times that the faster memory is encountering is causing it to 64bit GDDR3 1000 256bit DDR 266 perform worse than the slower memory on a wider bus. The 128bit DDR2 533 cache line size creates improvements in the performance until 64bit GDDR3 1200 256bit DDR 333 128bit DDR2 667 64 bit GDDR3 1400 256bit DDR 400 Effect of Cache Line Size 128bit DDR2 800 64bit GDDR3 1600 1.025 64bit GDDR3 1800 1.000 128bit GDDR3 1000 512bit DDR 266 0.975 256bit DDR2 533 0.950 128bit GDDR3 1200 e 0.925 512bit DDR 333 0.900 256bit DDR2 667 anc 0.875 128 bit GDDR3 1400 m 0.850 512bit DDR 400 or 0.825 256bit DDR2 800 f 128bit GDDR3 1600 r 0.800 DDR 128bit GDDR3 1800 e 0.775 DDR2 P 256bit GDDR3 1000 0.750 GDDR3 512bit DDR2 533 e v 0.725 i 256bit GDDR3 1200 t 0.700 a 512bit DDR2 667 l 0.675 256bit GDDR3 1400 e 512bit DDR2 800 R 0.650 256bit GDDR3 1600 0.625 256bit GDDR3 1800 0.600 512bit GDDR3 1000 0.575 512bit GDDR3 1200 512bit GDDR3 1400 0.550 512bit GDDR3 1600 10 100 1000 10000 512bit GDDR3 1800 Cache Line Size 0.90 0.92 0.94 0.96 0.98 1.00 1.02 1.04 1.06 Figure 1: Performance of various memory modules in the Figure 2: Relative performance of modules with a bandwidth small cache processor simulation of 12.8GB per second with varying cache line size 3 Effect of CAS and tRAS Latency 1.350 1.300 1.250 e c n 1.200 a m r 1.150 o matrix f r 1.100 sort e P fft 1.050 e filter v i t 1.000 alphaBlend a l e 0.950 R 0.900 0.850 0.800 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 CAS/tRAS latency (cycles) Figure 3: The effect of CAS latency using 128bit DDR2 800 memory modules.
Recommended publications
  • The Book of Overclocking—Tweak Your PC to Unleash Its Power
    The Book of Overclocking—Tweak Your PC to Unleash Its Power Scott Wainner Robert Richmond Copyright © 2003 No Starch Press, Inc. All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher. 1 2 3 4 5 6 7 8 9 10 – 06 05 04 03 Trademarked names are used throughout this book. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. Publisher: William Pollock Editorial Director: Karol Jurado Cover and Interior Design: Octopod Studios Composition: 1106 Design, LLC Developmental Editor: Heather Bennett Proofreader: Robyn Brode Indexer: Broccoli Information Management Distributed to the book trade in the United States by Publishers Group West, 1700 Fourth Street, Berkeley, CA 94710; phone: 800-788-3123; fax: 510-658-1834. Distributed to the book trade in Canada by Jacqueline Gross & Associates, Inc., One Atlantic Avenue, Suite 105, Toronto, Ontario M6K 3E7 Canada; phone: 416-531- 6737; fax 416-531-4259. For information on translations or book distributors outside the United States and Canada, please see our distributors list in the back of this book or contact No Starch Press, Inc. directly: No Starch Press, Inc. 555 De Haro Street, Suite 250, San Francisco, CA 94107 phone: 415-863-9900; fax: 415-863-9950; [email protected]; http://www.nostarch.com The information in this book is distributed on an “As Is” basis, without warranty.
    [Show full text]
  • Mainstream Computer System Components
    MainstreamMainstream ComputerComputer SystemSystem ComponentsComponents Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide CPU Core 2 GHz - 3.5 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Current DDR3 SDRAM Example: Multiple FP, integer FUs, Dynamic branch prediction … One core or multi-core (2-8) per chip PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) All Non-blocking caches SRAM 8-way interleaved (8-banks) L1 ~12.8 GBYTES/SEC (peak) L1 16-128K 2-8 way set associative (usually separate/split) (one 64bit channel) CPU L2 256K- 4M 8-16 way set associative (unified) ~25.6 GBYTES/SEC (peak) L2 L3 4-24M 16-64 way set associative (unified) (two 64bit channels – e,g AMD x4, x6) ~38.4 GBYTES/SEC (peak) L3 Examples: AMD K8: HyperTransport (three 64bit channels – e.g Intel Core i7) Caches (FSB) Alpha, AMD K7: EV6, 200-400 MHz SRAM PC2-6400 (DDR2-800) System Bus Intel PII, PIII: GTL+ 133 MHz 200 MHz (internal base chip clock) Intel P4 800 MHz 64-128 bits wide Off or On-chip 4-way interleaved (4-banks) adapters ~6.4 GBYTES/SEC (peak) Memory I/O Buses (one 64bit channel) Controller Example: PCI, 33-66MHz ~12.8 GBYTES/SEC (peak) 32-64 bits wide (two 64bit channels) 133-528 MBYTES/SEC Memory Bus Controllers NICs PCI-X 133MHz 64 bit DDR SDRAM Example: 1024 MBYTES/SEC PC3200 (DDR-400) Memory 200 MHz (base chip clock) 4-way interleaved (4-banks) Disks ~3.2 GBYTES/SEC (peak) Displays (one 64bit channel) System Memory Networks ~6.4 GBYTES/SEC Keyboards (two 64bit channels) (DRAM) Single
    [Show full text]
  • Mobile Intel 915 and 910 Express Chipset Family of Products
    R 72 Mobile Intel® 915 and 910 Express Chipset Family of Products Datasheet April 2007 Document Number: 305264-002 Introduction R INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548- 4725, or by visiting Intel’s Web Site.
    [Show full text]
  • The Hardware Guide
    The Hardware Guide Do you know from which parts your computer is made. Or want to invest in new hardware by purchasing a new one or by upgrade. Which one to upgrade to get most performance. How to spent in wisely manner. This document will help you. Bios Guides The Bus Speed Guide Chipset Guides CPU Guide The Graphics Card Guide The Hard Disk Guide Keyboard Guide The Motherboard Guide Mouse and Joystick Guide Modems and Fax Guide Monitor Guide Printer Guide RAM Guide Scanner Guide Speakers and Sound Card Guide Troubleshooting Guide Upgrading Your Hardware Guide Video Formats Guide About Me THE BIOS GUIDE BIOS settings are a frequent problem asked about in several hardware related newsgroups. Did you ever experienced a system lock up or poor performance and erratic behavior due to improper BIOS settings? Have you ever been left in the dark by a cryptic 5 pages, badly written motherboard manual? The answer is probably yes. BIOS Basic Input Output System. All computer hardware has to work with software through an interface. The BIOS gives the computer a little built-in starter kit to run the rest of softwares from floppy disks (FDD) and hard disks (HDD). The BIOS is responsible for booting the computer by providing a basic set of instructions. It performs all the tasks that need to be done at start-up time: POST (Power-On Self Test, booting an operating system from FDD or HDD). Furthermore, it provides an interface to the underlying hardware for the operating system in the form of a library of interrupt handlers.
    [Show full text]
  • Reducing Main Memory Access Latency Through SDRAM Address Mapping Techniques and Access Reordering Mechanisms
    Michigan Technological University Digital Commons @ Michigan Tech Dissertations, Master's Theses and Master's Dissertations, Master's Theses and Master's Reports - Open Reports 2006 Reducing main memory access latency through SDRAM address mapping techniques and access reordering mechanisms Jun Shao Michigan Technological University Follow this and additional works at: https://digitalcommons.mtu.edu/etds Part of the Electrical and Computer Engineering Commons Copyright 2006 Jun Shao Recommended Citation Shao, Jun, "Reducing main memory access latency through SDRAM address mapping techniques and access reordering mechanisms", Dissertation, Michigan Technological University, 2006. https://doi.org/10.37099/mtu.dc.etds/72 Follow this and additional works at: https://digitalcommons.mtu.edu/etds Part of the Electrical and Computer Engineering Commons Reducing Main Memory Access Latency through SDRAM Address Mapping Techniques and Access Reordering Mechanisms by JUN SHAO A DISSERTATION Submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Electrical Engineering) MICHIGAN TECHNOLOGICAL UNIVERSITY 2006 Copyright c Jun Shao 2006 All rights reserved This dissertation, “Reducing Main Memory Access Latency through SDRAM Address Mapping Techniques and Access Reordering Mechanisms”, is hereby approved in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY in the field of Electrical Engineering. DEPARTMENT or PROGRAM: Electrical and Computer Engineering Dissertation Advisor Dr. Brian T. Davis Committee Dr. Jindong Tan Dr. Chunxiao Chigan Dr. Soner Onder¨ Department Chair Dr. Timothy J. Schulz Date To my parents and friends Acknowledgments First and foremost, I wish to express my gratitude to my advisor Dr. Brian T. Davis for his constant support, guidance and encouragement throughout the Ph.D.
    [Show full text]