Access Order and Effective Bandwidth for Streams on a Direct Rambus Memory Sung I. Hong, Sally A. McKee†, Maximo H. Salinas, Robert H. Klenke, James H. Aylor, Wm. A. Wulf Dept. of Electrical and Computer Engineering †Dept. of Computer Science University of Virginia University of Utah Charlottesville, VA 22903 Salt Lake City, Utah 84112 Abstract current DRAM page forces a new page to be accessed. The Processor speeds are increasing rapidly, and memory speeds are overhead time required to do this makes servicing such a request not keeping up. Streaming computations (such as multi-media or significantly slower than one that hits the current page. The order of scientific applications) are among those whose performance is requests affects the performance of all such components. Access most limited by the memory bottleneck. Rambus hopes to bridge the order also affects bus utilization and how well the available processor/memory performance gap with a recently introduced parallelism can be exploited in memories with multiple banks. DRAM that can deliver up to 1.6Gbytes/sec. We analyze the These three observations — the inefficiency of traditional, performance of these interesting new memory devices on the inner dynamic caching for streaming computations; the high advertised loops of streaming computations, both for traditional memory bandwidth of Direct Rambus DRAMs; and the order-sensitive controllers that treat all DRAM transactions as random cacheline performance of modern DRAMs — motivated our investigation of accesses, and for controllers augmented with streaming hardware. a hardware streaming mechanism that dynamically reorders For our benchmarks, we find that accessing unit-stride streams in memory accesses in a Rambus-based memory system. cacheline bursts in the natural order of the computation exploits This paper explains how the details of the Direct Rambus from 44-76% of the peak bandwidth of a memory system composed interface affect sustained, streaming accesses, and presents analytic of a single Direct RDRAM device, and that accessing streams via a and simulation results for inner loops of streaming kernels streaming mechanism with a simple access ordering scheme can performed on a memory system composed of a single Direct improve performance by factors of 1.18 to 2.25. RDRAM device. We evaluate two memory interleaving schemes, deriving bounds on the percentage of available memory bandwidth 1. Introduction exploited when accessing streams via (a) cacheline accesses in the natural order of the computation, and (b) streaming hardware that As processors continue to become faster and to consume more dynamically reorders accesses. We find that the former approach bandwidth, conventional DRAM memory systems will have more generally fails to exploit much of the potential memory bandwidth. and more difficulty keeping up. The kinds of applications that are Adding hardware support for streaming improves the performance particularly affected by the growing processor-memory of these Direct RDRAM systems, allowing computations on performance gap include scientific computations, multi-media streams of a thousand or more elements to utilize nearly all of the codecs, encryption, signal processing, and text searching. Although available memory bandwidth. A system with streaming support data caches perform well for some access patterns, the vectors used outperforms a traditional Rambus system by factors of up to 2.25 in these streaming computations are normally much too large to for stride one and 2.20 for strides bigger than a cacheline. cache in their entirety, and each element is typically visited only once during lengthy portions of the computation. This lack of 2. Background temporal locality of reference makes caching less effective, and performance becomes limited by the speed of the memory system. To put the analysis and results in Section 5 and Section 6 in The new Direct Rambus DRAMs (RDRAMs) propose to bridge perspective, this section describes basic Dynamic Random Access the current performance gap with a pipelined microarchitecture that Memory (DRAM) organization and operation, compares and allows direct control of all DRAM row and column resources contrasts the timing parameters of several types of current DRAMs, concurrently with data transfer operations [21]. The RDRAM and explains how the new Direct RDRAMs work. memory architecture merits study for several reasons: its interface, 2.1 DRAM basics architecture, and timing are unique; it advertises a peak bandwidth of 1.6Gbytes/sec, a significant improvement over that of other DRAM storage cell arrays are typically rectangular, and thus a data currently available memory devices; all of the top thirteen DRAM access sequence consists of a row access (RAS, or row address suppliers are actively developing Direct RDRAMS; and Intel has strobe signal) followed by a one or more column accesses (CAS, or selected the Direct Rambus technology to become its next PC main- column address strobe signal). During RAS, the row address is memory standard [7]. presented to the DRAM. In page mode, data in the storage cells of Like nearly all modern DRAMs, Direct RDRAMs implement a the decoded row are moved into a bank of sense amplifiers (the form of page mode operation. In page mode, memory devices “sense amps” or page buffer), which serves as a row cache. During behave as if implemented with a single line, or page, of cache on CAS, the column address is decoded and the selected data is read chip. A memory access falling outside the address range of the from the sense amps. Consecutive accesses to the current row — called page hits — require only a CAS, allowing data to be accessed Sung Hong’s current address: Lockheed Martin Federal Systems, 9500 at the maximum frequency. Godwin Dr. Manassas, VA 20110, [email protected]. The key timing parameters used to analyze DRAM performance initiate a precharge operation. The smallest addressable data size is include the row-access time (tRAC ), column-access time (tCAC ), 128 bits (two 64-bit stream elements). The full memory bandwidth page-mode cycle time (tPC ), and random read/write cycle time cannot be utilized unless all words in a DATA packet are used. Note (tRC ). Typical timing parameter values for various common the distinction between the RDRAM transfer rate (800 MHz), the DRAMs are given in Figure 1 [14][21]. Extended Data Out (EDO) RDRAM interface clock rate (400 MHz), and the packet transfer DRAMs are similar to fast-page mode DRAMs, except that data rate (100 MHz). All references to cycles in the following sections buffers are added between the column decoder and the input/output are in terms of the 400 MHz interface clock. buffer. These buffers permit data transfer to extend beyond the CAS signal de-assertion, allowing a faster page-cycle time. Burst-EDO * DRAMs transfer larger blocks of data by incorporating an internal Timings for a Min -50 -800 Direct RDRAM Part counter. After the memory controller transfers the initial address, t this internal counter generates the subsequent memory addresses in CYCLE interface clock cycle time (400 MHz) 2.5ns the block. SDRAMs synchronize all inputs and outputs to a system 4 t t packet transfer time CYCLE clock, allowing an even faster page-cycle time. PACK 10ns 11 t Fast-Page Direct t min interval between ROW & COL packets CYCLE EDO Burst-EDO SDRAM RCD 27.5 ns Mode RDRAM t 50 50 52 50 50 page precharge time: min interval between RAC 10 t t ROW precharge (PRER) & activate (ACT) CYCLE t 13 13 10 9 20 RP 25 ns CAC packets nsec tRC 95 89 90 100 85 column/precharge overlap: max overlap 1 t t 30 20 15 10 10* t CYCLE PC CPOL between last COL packet & start of row PRER 2.5 ns max 33 50 66 100 400 MHz freq. page hit latency: delay between start of COL 8 t t CYCLE CAC packet & valid data 20 ns *The packet transfer time, since tPC doesn’t apply here. page miss latency: delay between start of ROW Figure 1 Typical DRAM timing parameters 20 t t ACT request & valid data CYCLE RAC 50 ns ( tRCD ++tCAC 1 ) extra cycle 2.2 Rambus DRAMs page miss cycle time: min interval between 34 t t successive ROW ACT requests (random read/ CYCLE Although the memory core — the banks and sense amps — of RC 85 ns RDRAMs is similar to that of other DRAMs, the architecture and write cycle time for single device) interface are unique. An RDRAM is actually an interleaved Row/row packet delay: min delay between con- 8 t memory system integrated onto a single memory chip. Its pipelined t secutive ROW accesses to the same RDRAM CYCLE RR 20 ns microarchitecture supports up to four outstanding requests. device Currently, all 64 Mbit RDRAMs incorporate at least eight independent banks of memory. Some RDRAM cores incorporate Roundtrip bus delay: latency between start of 16 banks in a “double bank” architecture, but two adjacent banks COL packet & valid data (added to read page- 2 tCYCLE tRDLY cannot be accessed simultaneously, making the total number of hit times, since DATA packet travels in opposite 5 ns independent banks effectively eight [20]. direction of commands; no delay for writes) First-generation Base RDRAMs use a 64-bit or 72-bit internal Read/write bus turnaround: interval between 6 tCYCLE bus and a 64-to-8 or 72-to-9 bit multiplexer to deliver bandwidth of tRW writing & reading ( tPACK + t ) 15 n 500 to 600 Mbytes/sec. Second-generation Concurrent RDRAMs RDLY deliver the same peak bandwidth, but an improved protocol allows *These are the key parameters that affect our study. For complete timing better bandwidth utilization by handling multiple concurrent parameters and most recent data sheets, see http://www.rambus.com/. transactions. Current, third-generation Direct RDRAMs double the Figure 2 RDRAM timing parameter definitions external data bus width from 8/9-bits to 16/18-bits and increase the clock frequency from 250/300 MHz to 400 MHz.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-