A Hybrid Analytical DRAM Performance Model

A Hybrid Analytical DRAM Performance Model George L. Yuan Tor M. Aamodt Department of Electrical and Computer Engineering University of British Columbia, Vancouver, BC, CANADA fgyuan,[email protected] Abstract 2000 As process technology scales, the number of transistors that 1600 can ¯t in a unit area has increased exponentially. Proces- 1200 sor throughput, memory storage, and memory throughput 800 have all been increasing at an exponential pace. As such, 400 DRAM has become an ever-tightening bottleneck for appli- 0 Memory Access Latency cations with irregular memory access patterns. Computer BS FWT DG LPS LIB MMC MUM NN RAY RED SP SLA TRA WP architects in industry sometimes use ad hoc analytical modeling techniques in lieu of cycle-accurate performance sim- Figure 1: Average memory latency measured in perfor- ulation to identify critical design points. Moreover, ana- mance simulation lytical models can provide clear mathematical relationships for how system performance is a®ected by individual mi- croarchitectural parameters, something that may be di±cult cess behavior. In the massively multithreaded GPU Com- to obtain with a detailed performance simulator. Modern pute architecture that we simulate, threads are scheduled to DRAM controllers rely on Out-of-Order scheduling policies Single-Instruction Multiple-Data pipelines in groups of 32 to increase row access locality and decrease the overheads of called warps, each thread of which is capable of generating timing constraint delays. This paper proposes a hybrid ana- a memory request to any location in GPU main memory. lytical DRAM performance model that uses memory address (This can be done as long as the limit on the number of in- traces to predict the DRAM e±ciency of a DRAM system ﬂight memory requests per GPU shader core has not been when using such a memory scheduling policy. To stress our reached. In CPUs, this is determined by the number of Miss model, we use a massively multithreaded architecture based Status Holding Registers (MSHR) [11].) In this microarchi- upon contemporary GPUs to generate our memory address tecture, a warp cannot proceed as long as any single thread trace. We test our techniques on a set of real CUDA ap- in it is still blocked by a memory access, making DRAM an plications and ¯nd our hybrid analytical model predicts the even bigger bottleneck. As such, DRAM must be properly DRAM e±ciency to within 15.2% absolute error when arith- modeled to provide useful insight for microprocessor and metically averaged across all applications. memory designers alike. Few analytical models for DRAM have been proposed be- 1. INTRODUCTION fore [2, 23], with Ahn et al.'s being the most recent [2]. Their As the gap between memory speed and microprocessor speed ¯rst-order analytical DRAM model predicts the achievable increases, e±cient dynamic random access memory (DRAM) bandwidth given a uniform memory access pattern with a system design is becoming increasingly important as it be- constant ¯xed row access locality. (Row access locality is comes an increasingly limiting bottleneck. This is especially de¯ned as the number of requests serviced in a DRAM row true in General Purpose Applications for Graphics Process before the row is closed and a new row is open. More details Units (GPGPU) where applications do not necessarily make on modern DRAM systems are given in Section 2.) While use of caches (which are both limited in size and capabilities 1 this model does not try to handle any non-regular memory relative to the processing throughput of GPUs). access behavior (behavior exhibited in our applications), it does show good accuracy for the microbenchmarks it consid- Modern memory systems can no longer be treated as hav- ers, indicating that analytical modeling of DRAM in more ing a ¯xed long latency. Figure 1 shows the measured la- demanding scenarios is promising. tency for di®erent applications when run on our performance simulator, GPGPU-Sim [3], averaged across all memory re- In this paper, we introduce a hybrid analytical DRAM per- quests. As shown, average memory latency can vary greatly formance model that predicts DRAM e±ciency, which is across di®erent applications, depending on their memory ac- closely related to DRAM utilization. DRAM utilization is 1 As of NVIDIA's CUDA Programming Framework version 2, all the percentage of time that a DRAM chip is transferring caches are read-only [17]. data across its data pins over a given period of time. DRAM e±ciency is the percentage of time that a DRAM chip is transferring data across its data pins over a period of time in which the DRAM is not idle. In other words, only time when DRAM has work to do is considered when calculating Processor Core 0 Processor Core 1 DRAM e±ciency. In essence, both these metrics give a no- tion of how much useful work is being done by DRAM. Our Interconnection Network analytical model is \hybrid" [22] in the sense that it uses ON-CHIP equations derived analytically in conjunction with a highly Memory Controller 0 Memory Controller 1 simpli¯ed simulation of DRAM request scheduling. Sense Amplifiers Sense Amplifiers d ic & Bus Drivers d ic & Bus Drivers In order to keep up with high memory demands, modern n g Sense Amplifiers n g Sense Amplifiers a o & Bus Drivers a o & Bus Drivers L ic L ic DRAM chips are banked to increase e±ciency by allowing s g s g s l o s l o e o L e o L r r l r l requests to a ready bank to be serviced while other banks t o Bank 0 r t o Bank 0 d n tr d n tr d o n d o n A o BankBank 1 0 A o BankBank 1 0 are unavailable. To use these banks more e®ectively, mod- C C C C d d n BankBank 2 1 n BankBank 2 1 ern memory controllers re-order the requests waiting in the a a OFF-CHIP s s s s re BankBank 3 2 re BankBank 3 2 memory controller queue [19]. Modeling the e®ects of the d d d d memory controller is thus crucial to accurately predicting DRAMA ChipBank 0 3 DRAMA ChipBank 0 3 DRAM e±ciency. DRAM Chip 1 DRAM Chip 1 This paper makes the following contributions: Figure 2: Memory system organization ² It presents a novel DRAM hybrid analytical model to model the overlapping of DRAM timing constraint de- in graphics processors, like DDR-SDRAM [16], by increasing lays of one bank by servicing requests to other banks, clock speeds and lowering power requirements. Like DDR- given an aggressive First-Ready First-Come First-Serve [19] SDRAM, it uses a double data rate (DDR) interface, mean- (FR-FCFS) memory scheduler that re-orders memory ing that it transfers data across its pins on both the positive requests to increase row access locality. and negative edge of the bus clock. More speci¯c to the par- ticular GDDR3 technology that we study, a 4n-prefetch ar- ² It presents a method to use this hybrid analytical model chitecture is used, meaning that a single read or write access to predict DRAM e±ciency over the runtime of an ap- consists of 4 bits transferred across each data pin over 2 cy- plication by pro¯ling a memory request address trace. cles, or 4 half-cycles. (This is more commonly known as the By doing so, it achieves an arithmetic mean of the ab- burst length, BL.) For a 32-bit wide interface, this trans- 2 solute error across all benchmarks of 15.2% . lates to a 16 byte transfer per access for a single GDDR3 chip. ² It presents data for a set of applications simulated on a massively multithreaded architecture using GPGPU- Similar to other modern DRAM technology, GDDR3-SDRAM Sim [3]. To the best of our knowledge, this is the is composed of a number of memory banks. Each bank is ¯rst work that attempts to use an analytical DRAM composed of a 2D array that is addressed with a row ad- model to predict the DRAM e±ciency for a set of real dress and a column address, both of which share the same applications. address pins to reduce the pin count. In a typical memory access, a row address is ¯rst provided to a bank using The rest of this paper is organized as follows. Section 2 re- an activate command that activates all of the memory cells views the mechanics of the GDDR3 memory [21] that we in, or \opens", the row, connecting them using long wires are modeling. Section 3 introduces our ¯rst-order hybrid to the sense ampli¯ers, which are subsequently connected to analytical DRAM model, capable of predicting DRAM e±- the data pins. A key property of the sense ampli¯ers is that, ciency for an arbitrary memory access pattern, and describes once they detect the values of the memory cells in a row, they the two di®erent heuristics of the algorithm we use in our hold on to the values so that subsequent accesses to the row results. Section 4 describes the experimental methodology do not force the sense ampli¯ers to re-read the row. This and Section 5 presents and analyzes our results. Section 6 property allows intelligent memory controllers to schedule reviews related work and Section 7 concludes the paper. requests out of order to take advantage of this \row access locality" (as de¯ned in Section 1) to improve e±ciency. The 2. BACKGROUND organization of such a memory system is shown in Figure 2.

A Hybrid Analytical DRAM Performance Model

Memory Interference Characterization Between CPU Cores and Integrated Gpus in Mixed-Criticality Platforms

A Cache-Efficient Sorting Algorithm for Database and Data Mining

How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?

Prefetching for Complex Memory Access Patterns

Eth-28647-02.Pdf

Detecting False Sharing Efficiently and Effectively

Performance Analysis of Cache Memory

Classifying Memory Access Patterns for Prefetching

STEALTHMEM: System-Level Protection Against Cache-Based Side Channel Attacks in the Cloud

Embedded Memory Architecture for Low-Power Application Processor

A Visual Performance Analysis Tool for Memory Bound GPU Kernels

Predicting Multiprocessor Memory Access Patterns with Learning Models