<<

CHAPTER 13 DRAM Memory Controller

In modern computer systems, processors and I/O workload characteristics, and different design goals, devices access data in the memory system through the design space of a DRAM memory controller for the use of one or more memory controllers. Memory a given DRAM device has nearly as much freedom in controllers manage the movement of data into and the design space as the design space of a out of DRAM devices while ensuring protocol compli- that implements a specifi c instruction-set architec- ance, accounting for DRAM-device-specifi c electrical ture. In that sense, just as an instruction-set architec- characteristics, timing characteristics, and, depend- ture defi nes the programming model of a processor, ing on the specifi c system, even error detection and a DRAM-access protocol defi nes the interface pro- correction. DRAM memory controllers are often con- tocol between a DRAM memory controller and the tained as part of the system controller, and the design system of DRAM devices. In both cases, actual per- of an optimal memory controller must consist of sys- formance characteristics depend on the specifi c tem-level considerations that ensure fairness in arbi- microarchitectural implementations rather than the tration for access between different agents that read superfi cial description of a programming model or and store data in the same memory system. interface protocol. That is, just as two processors that The design and implementation of the DRAM support the same instruction-set architecture can memory controllers determine the access latency and have dramatically different performance characteris- bandwidth effi ciency characteristics of the DRAM tics depending on the respective microarchitectural memory system. The previous chapters provide a implementations, two DRAM memory controllers bottom-up approach to the design and implemen- that support the same DRAM-access protocol can tation of a DRAM memory system. With the under- have dramatically different latency and sustainable standing of DRAM device operations and system bandwidth characteristics depending on the respec- level provided by the previous chapters, this chapter tive microarchitectural implementations. DRAM proceeds to examine DRAM controller design and memory controllers can be designed to minimize implementation considerations. die size, minimize power consumption, maximize system performance, or simply reach a reasonably optimal compromise of the confl icting design goals. Specifi cally, the Row-Buffer-Management Policy, the 13.1 DRAM Controller Architecture Address Mapping Scheme, and the Memory Transac- The function of a DRAM memory controller is tion and DRAM Command Ordering Scheme are par- to manage the fl ow of data into and out of DRAM ticularly important to the design and implementation devices connected to that DRAM controller in the of DRAM memory controllers. memory system. However, due to the complexity of Due to the increasing disparity in the operating fre- DRAM memory-access protocols, the large num- quency of modern processors and the access latency bers of timing parameters, the innumerable combi- to main memory, there is a large body of active and nations of memory system organizations, different ongoing research in the architectural community 497 498 Memory Systems: , DRAM, Disk

devoted to the performance optimization of the transaction scheduling and command scheduling DRAM memory controller. Specifi cally, the Address policies and examining them in a collective context Mapping Scheme, designed to minimize bank address rather than separate optimizations. For example, a confl icts, has been studied by Lin et al. [2001] and low-priority request from an I/O device to an already Zhang et al. [2002a]. DRAM Command and Memory open bank may be scheduled ahead of a high- priority Transaction Ordering Schemes have been studied by request from a to a different row of Briggs et al. [2002], Cuppu et al. [1999], Hur and Lin the same open bank, depending on the access history, [2004], McKee et al. [1996a], and Rixner et al. [2000]. respective priority, and state of the memory system. Due to the sheer volume of research into optimal Consequently, a discussion on transaction arbitration DRAM controller designs for different types of DRAM is included in this chapter. memory systems and workload characteristics, this Figure 13.1 also illustrates that once a transaction chapter is not intended as a comprehensive sum- wins arbitration and enters into the memory con- mary of all prior work. Rather, the text in this chapter troller, it is mapped to a location describes the basic concepts of DRAM memory con- and converted to a sequence of DRAM commands. troller design in abstraction, and relevant research on The sequence of commands is placed in queues that specifi c topics is referenced as needed. exist in the memory controller. The queues may be Figure 13.1 illustrates some basic components of arranged as a generic queue pool, where the control- an abstract DRAM memory controller. The memory ler will select from pending commands to execute, or controller accepts requests from one or more micropro- the queues may be arranged so that there is one queue cessors and one or more I/O devices and provides the per bank or per rank of memory. Then, depending on arbitration interface to determine which request agent the DRAM command scheduling policy, commands will be able to place its request into the memory control- are scheduled to the DRAM devices through the elec- ler. From a certain perspective, the request arbitration trical signaling interface. logic may be considered as part of the system control- In the following sections, the various components ler rather than the memory controller. However, as the of the memory controller illustrated in Figure 13.1 cost of memory access continues to increase relative are separately examined, with the exception of the to the cost of data computation in modern processors, electrical signaling interface. Although the electrical efforts in performance optimizations are combining signaling interface may be one of the most critical

DRAM memory controller queue bank DRAM cpu pool management Bank 0 cpu DRAM arbiter Bank 1

I/O request DRAM streams interface Bank 2 signaling DRAM DIMM

transaction address command electrical DRAM scheduling translation scheduling signalling access

FIGURE 13.1: Illustration of an abstract DRAM memory controller. Chapter 13 DRAM MEMORY CONTROLLER 499

components in modern, high data rate memory be accessed again with the minimal latency of tCAS. In systems , the challenges of signaling are examined sep- the case where another memory read access is made arately in Chapter 9. Consequently, the focus in this to the same row, that memory access can occur with chapter is limited to the digital logic components of minimal latency since the row is already active in the the DRAM memory controller. sense amplifi er and only a column access command is needed to move the data from the sense amplifi ers to the memory controller. However, in the case where the access is to a different row of the same bank, the 13.2 Row-Buffer-Management Policy memory controller must fi rst precharge the DRAM In modern DRAM devices, the arrays of sense array, engage another row activation, and then per- amplifi ers can also act as buffers that provide tempo- form the column access. rary data storage. In this chapter, policies that man- age the operation of sense amplifi ers are referred to as row-buffer-management policies. The two primary 13.2.2 Close-Page Row-Buffer-Management row-buffer-management policies are the open-page Policy policy and the close-page policy, and depending on In contrast to the open-page row-buffer-manage- the system, different row-buffer-management poli- ment policy, the close-page row-buffer-management cies can be used to optimize performance or minimize policy is designed to favor accesses to random loca- power consumption of the DRAM memory system. tions in memory and optimally supports memory request patterns with low degrees of access locality. The open-page policy and closely related variant 13.2.1 Open-Page Row-Buffer-Management policies are typically deployed in memory systems Policy designed for low processor count, general-purpose In commodity DRAM devices, data access to and computers. In contrast, the close-page policy is typi- from the DRAM storage cells is a two-step that cally deployed in memory systems designed for large requires separate row activation commands and col- processor count, multiprocessor systems or specialty umn access commands.1 In cases where the memory- embedded systems. The reason that an open-page access sequence possesses a high degree of temporal policy is typically deployed in memory systems of low and spatial locality, memory system architects and processor count platforms while a close-page policy design engineers can take advantage of the locality is typically deployed in memory systems of larger by directing temporally and spatially adjacent mem- processor count platforms is that in large systems, ory accesses to the same row of memory. The open- the intermixing of memory request sequences from page row-buffer-management policy is designed to multiple, concurrent, threaded contexts reduces the favor memory accesses to the same row of memory locality of the resulting memory-access sequence. by keeping sense amplifi ers open and holding a row Consequently, the probability of row hit decreases of data for ready access. In a DRAM controller that and the probability of bank confl ict increases in these implements the open-page policy, once a row of data systems, reaching a tipping point of sorts where a is brought to the array of sense amplifi ers in a bank of close-page policy provides better performance for DRAM cells, different columns of the same row can the computer system. However, not all large processor

1In some DRAM devices that strive to be SRAM-like, the row-activation command and the column-access command are coupled into a single read or write command. These devices do not support the open-page row-buffer-management poli- cies. Typically, these DRAM devices are designed as low-latency, random-access memory and used in speciality embedded systems. 500 Memory Systems: Cache, DRAM, Disk

count systems use close-page memory systems. For to a close-page policy for better performance. example, Alpha EV7’s Direct RDRAM memory system Similarly, if a rapid succession of memory requests to uses open-page policy to manage the sense amplifi - a given bank is made to the same row, the DRAM con- ers in the DRAM devices. The reason for this choice troller can switch to an open-page policy to improve is that a fully loaded Direct RDRAM memory system performance. One simple mechanism used in mod- has 32 ranks of DRAM devices per channel and 32 ern DRAM controllers to improve performance and split banks per rank. The large number of banks in reduce power consumption is the use of a timer to the Direct RDRAM memory system means that even control the sense amplifi ers. That is, a timer is set to in the case where a large number of concurrent pro- a predetermined value when a row is activated. The cesses are accessing memory from the same memory timer counts down with every clock tick, and when system, the probability of bank confl icts remains low. it reaches zero, a precharge command is issued to Consequently, Alpha EV7’s memory system demon- precharge the bank. In case of a row buffer hit to an strates that the optimality in the choice of row-buffer- open bank, the is reset to a higher value and management policies depends on both the type and the countdown repeats. In this manner, temporal the number of the processor, as well as the parallel- and spatial locality present in a given memory-access ism available in the memory system. sequence can be utilized without keeping rows open indefi nitely. Finally, row-buffer-management policies can be 13.2.3 Hybrid (Dynamic) Row-Buffer- controlled on a channel-by-channel basis or on a Management Policies bank-by-bank basis. However, the potential gains In modern DRAM memory controllers, the row- in performance and power savings must be traded buffer-management policy is often neither a strictly off against the increase in hardware sophistication open-page policy nor a strictly close-page policy, but and design complexity of the memory controller. In a dynamic combination of the two policies. That is, cases where high performance or minimum power the respective analyses of the performance and power consumption is not required, a basic controller can consumption impact of row-buffer-management poli- be implemented to minimize die size impact of the cies illustrate that the optimality of the row-buffer- DRAM memory controller. management policy depends on the request rate and The defi nition of the row-buffer-management access locality of the memory request sequences. To policy forms the foundation in the design of a DRAM support memory request sequences whose request rate memory controller. The choice of the row-buffer- and access locality can change dramatically depending management policy directly impacts the design of on the dynamic, run-time behavior of the workload, the address mapping scheme, the memory com- DRAM memory controllers designed for general-pur- mand reordering mechanism, and the transaction pose computing can utilize a combination of access reordering mechanism in DRAM memory control- history and timers to dynamically control the row-buf- lers. In the following sections, the address mapping fer-management policy for performance optimization scheme, the memory command reordering mecha- or power consumption minimization. nism, and the transaction reordering mechanism are Previously, the minimum ratio of memory read explored in the context of the row-buffer-management requests that must be row buffer hits for an open- policy used. page memory system to have lower read latency than a comparable close-page memory system was com- 13.2.4 Performance Impact of Row-Buffer- puted as tRP / (tRCD tRP). The minimum ratio of row buffer hits means that if a sequence of bank confl icts Management Policies occurs in rapid succession and the ratio of memory A formal analysis that compares the performance of read requests that are row buffer hits falls below a row-buffer-management policies requires an in-depth precomputed threshold, the DRAM controller can analysis of system-level queuing delays, the locality Chapter 13 DRAM MEMORY CONTROLLER 501

TABLE 13.1 Current specifi cation for 16 256-Mbit Direct RDRAM devices in 32-bit RIMM modules

Condition Specifi cation Current One RDRAM device per channel in Read, balance in NAP mode 1195 mA One RDRAM device per channel in Read, balance in standby mode 2548 mA One RDRAM device per channel in Read, balance in active mode 3206 mA

and rate of request arrival in the memory-access open-page memory system to have lower (idle system) sequences. However, a fi rst-order approximation of the memory read latency than a comparable close-page performance benefi ts and trade-offs of different poli- memory system. Alternatively, as the tRP approaches cies can be made through the analysis of memory read zero,4 an open-page system will have lower DRAM access latencies. Assuming nominally idle systems, the memory-access latency for any non-zero percentage read latency in a close-page memory system is simply of row buffer hits. Given specifi c values for tRCD and tRCD tCAS. Comparably, the read latency in an open- tRP, specifi c requirements of row buffer hit versus row page memory system is as little as tCAS or as much as buffer miss ratio can be computed, and the result- 2 tRP tRCD tCAS. In this context, tCAS is the row buffer ing ratio can be used to aid in design decisions of a hit latency, and tRP tRCD tCAS is the row buffer miss row-buffer-management policy for a DRAM memory (bank confl ict) latency. If x represents the percentage system. of memory accesses that hit in an open row buffer, and 1 – x represents the percentage of memory accesses that miss the row buffer, the average DRAM-access 13.2.5 Power Impact of Row-Buffer- Management Policies latency in an open-page memory system is x * (tCAS) (1 – x) * (tRP tRCD tCAS). Taking the formula and The previous section presents a simple mathemat- equating to the memory read latency of tRCD tCAS in ical exercise that compares idle system read latencies a close-page memory system, the minimum percent- of an abstract open-page memory system to a com- age of memory accesses that must be row buffer hits parable close-page memory system. In reality, the for an open-page memory system to have lower aver- choice of the row-buffer-management policy in the age memory-access latency than a close-page memory design of a DRAM memory controller is a complex and system can be solved for. multifaceted issue. A second factor that can infl uence Solving for x, the minimum ratio of memory read the selection of the row-buffer-management policy requests that must be row buffer hits for an open- may be the power consumption of DRAM devices.5 page memory system to have lower read latency is Table 13.1 illustrates the operating current ratings simply tRP/(tRCD tRP). That is, as tRP approaches for a Direct RDRAM memory system that contains a infi nity,3 the percentage of row buffer hits in an open- total of 16 256-Mbit Direct RDRAM devices. The act of page memory system must be nearly 100% for the keeping the DRAM banks active and the DRAM device

2 That is, assuming that the probability of having to wait for tRAS of the previous row activation is equal in open-page and close-page memory systems. If the probability of occurrence is the same, then the latency overheads are the same and can be ignored. 3 Or increase to a signifi cantly higher multiple of tRCD. 4 Or decrease to a small fraction of tRCD. 5Commodity DRAM devices such as DDR2 SDRAM devices consume approximately the same amount of power in active standby mode as it does in precharge standby mode, so power optimality of the paging policy is device-dependent. 502 Memory Systems: Cache, DRAM, Disk

in active standby mode requires a moderate amount requests can map them to different rows of different of current draw in Direct RDRAM devices. Table 13.1 banks, where accesses to different banks can occur illustrates that a lower level of power consumption in with some degree of parallelism. Fundamentally, Direct RDRAM devices can be achieved by keeping all the task of an address mapping scheme is to mini- the banks inactive and the DRAM device in a power- mize the probability of bank confl icts in temporally down NAP mode. adjacent requests and maximize the parallelism in The power consumption characteristics of different the memory system. To obtain the best performance, DRAM device operating modes dictate that in cases the choice of the address mapping scheme is often where power consumption minimization is impor- coupled to the row-buffer-management policy of tant, the optimality of the row-buffer- management the memory controller. However, unlike hybrid row- policy can also depend on the memory request rate. buffer-management policies that can be dynami- That is, the close-page row-buffer-management cally adjusted to support different types of memory policy is unambiguously better for memory request request sequences with different request rates and sequences with low access locality, but it is also better access locality, address mapping schemes cannot be for power-sensitive memory systems designed for dynamically adjusted in conventional DRAM mem- request sequences with relatively low request rates. ory controllers. In the power-sensitive memory systems, Table 13.1 Figure 13.2 illustrates and compares the conven- shows that for Direct RDRAM devices, it may be better tional system architecture against a novel system to pay the cost of the row activation and precharge architecture proposed by the Impulse memory con- current for each column access than it is to keep the troller research project. The fi gure shows that in the rows active for an indefi nite amount of time waiting conventional system architecture, the processor oper- for more column accesses to the same open row. ates in the virtual address space, and the TLB maps application addresses in the virtual address space to the physical address space without knowledge or regard to the mapping scheme in the DRAM memory 13.3 Address Mapping (Translation) address. The Impulse memory controller from the Many factors can impact the latency and sustain- University of Utah proposes a technique that allows a able bandwidth characteristics of a DRAM memory system to utilize part of the address space as shadow system. Aside from the row-buffer-management addresses. The shadow addressing scheme utilizes a policy, one factor that can directly impact DRAM novel virtual-to-physical address translation scheme memory system performance is the address mapping and, in cooperation with the intelligent memory con- scheme. In this text, the address mapping scheme is troller, eliminates bank confl icts between sequences used to denote the scheme whereby a given physical of streaming requests by dynamically remapping address is resolved into indices in a DRAM memory address locations in the address space. system in terms of channel ID, rank ID, bank ID, row Essentially, the Impulse memory controller assumes ID, and column ID. The task of address mapping is the system architecture where the memory controller also sometimes referred to as address translation. is tightly integrated with the TLB. With full under- In a case where the run-time behavior of the appli- standing of the organization of the DRAM memory cation is poorly matched with the address mapping system, the Impulse memory controller is better able scheme of the DRAM memory system, consecutive to minimize bank confl icts in mapping application memory requests in the memory request sequence request sequences from the virtual address space to may be mapped to different rows of the same bank the DRAM address space. of DRAM array, resulting in bank confl icts that However, Impulse memory controller research degrade performance. On the other hand, an address is based on a novel system architecture, and the mapping scheme that is better suited to the locality technique is not currently utilized in contemporary property of the same series of consecutive memory DRAM memory controllers within the context of Chapter 13 DRAM MEMORY CONTROLLER 503

processor Virtual Processor Address TLB TLB Virtual Impulse Address conventional memory I/O system physical controller controller address DRAM address

DRAM DRAM address DRAM DRAM chipDRAM chip DRAM chipDRAM chip DRAM chipDRAM chip chip chip

Conventional System Architecture Impulse Memory System Architecture

FIGURE 13.2: Comparison of conventional system architecture and the Impulse memory controller architecture. conventional system architectures.6 As a result, the Channel discussion in the remaining sections of this chapter Independent channels of memory possess the is focused on the examination of a DRAM control- highest degree of parallelism in the organization of ler in the context of conventional system architec- DRAM memory systems. There are no restrictions tures, and the address mapping schemes described from the perspective of the DRAM memory sys- herein are focused on the mapping from the physical tem on requests issued to different logical channels address space into DRAM memory system organiza- controlled by independent memory controllers. For tion indices. performance-optimized designs, consecutive cach- eline accesses are mapped to different channels.7 13.3.1 Available Parallelism in Memory System Organization Rank In this section, available parallelism of channels, DRAM accesses can proceed in parallel in differ- ranks, banks, rows, and columns is examined. The ent ranks of a given channel subject to the availability examination of available parallelism in DRAM mem- of the shared address, command, and data busses. ory system organization is then used as the basis of However, rank-to-rank switching penalties in high- discussion of the various address mapping schemes. frequency, globally synchronous DRAM memory

6Fully Buffered DIMM memory systems present an interesting path to an unconventional system architecture that the Impulse memory controller or an Impulse-like memory controller may be able to take advantage of. That is, with mul- tiple memory controllers controlling multiple, independent channels of Fully Buffered , some channels may be designated for general-purpose access as in a conventional system architecture, while other channels may be used as high bandwidth, application-controlled, direct-access memory systems. 7The exploration of parallelism in the memory system is an attempt to extract maximum performance. For low-power targeted systems, different criteria may be needed to optimize the address mapping scheme. 504 Memory Systems: Cache, DRAM, Disk

systems such as DDRx SDRAM memory systems in time provided that additional ESDRAM-like or limit the desirability of sending consecutive DRAM VCDRAM-like row buffers are not present in the requests to different ranks. DRAM device. The result of the forced serialization of accesses to different rows of the same bank means that row addresses are typically mapped to the high- Bank est memory address ranges to minimize the likeli- Similar to the case of consecutive memory accesses hood that spatially adjacent consecutive accesses are to multiple ranks, consecutive memory accesses can made to different rows of the same bank. proceed in parallel to different banks of a given rank subject to the availability of the shared address, com- mand, and data busses. In contemporary DRAM Column devices, scheduling consecutive DRAM read accesses In open-page memory systems, cachelines with to different banks within a given rank is, in general, sequentially consecutive addresses are optimally more effi cient than scheduling consecutive read mapped to the same row of memory to support accesses to different ranks since idle cycles are not streaming accesses. As a result, column addresses are needed to switch between different masters on typically mapped to the lower address bits of a given the data bus. However, in most DRAM devices with- physical address in open-page memory systems. In out a write buffer or separate internal for contrast, cachelines with sequentially consecutive separate read and write data fl ow, a column read addresses are optimally mapped to different rows command that follows a column-write command is and different banks of memory to support streaming more effi ciently performed to different ranks of mem- accesses in close-page memory systems. The map- ory as compared to a column-read command that ping of sequentially consecutive cachelines to dif- follows a column-write command to different banks ferent banks, different ranks, and different channels of the same rank. In modern computer systems, read scatters requests in streaming accesses to different requests tend to have higher spatial locality than write rows and favors parallelism in lieu of spatial locality. requests due to the existence of write-back caches. The result is that in close-page memory systems, the Moreover, the number of column-read commands low address bits of the column address that denote that immediately follow column-write commands the column offset within a cacheline are optimally can be minimized in advanced memory controllers by mapped to the lowest address bits of the physical deferring individual write requests and instead group address, but the remainder of the column address is schedule them as a sequence of consecutive write optimally mapped to the high address ranges compa- commands. Consequently, bank addresses are typi- rable to the row addresses. cally mapped lower than rank addresses in most con- trollers to favor the extraction of spatial locality from consecutive memory read accesses over the reduction 13.3.2 Parameter of Address Mapping Schemes of write-to-read turnaround times to open rows.8 To facilitate the examination of address mapping schemes, parametric variables are defi ned in this sec- tion to denote the organization of memory systems. Row For the sake of simplicity, a uniform memory system In conventional DRAM memory systems, only is assumed throughout this chapter. Specifi cally, the one row per bank can be active at any given instance memory system under examination is assumed to

8Overhead-free scheduling of consecutive column accesses to different banks of a given rank of DRAM devices has long been the most effi cient way to schedule memory commands. However, constraints such as burst chop in DDR3 SDRAM devices and tFAW constraints in 8-bank DDRx SDRAM devices is now shifting the overhead distribution. Consequently, rank parallelism may be more favorable than bank parallelism in future DDRx SDRAM memory systems. Chapter 13 DRAM MEMORY CONTROLLER 505

TABLE 13.2 Summary of system confi guration variables

Symbol Variable Dependence Description K Independent Number of channels in system L Independent Number of ranks per channel B Independent Number of banks per rank R Independent Number of rows per bank C Independent Number of columns per row V Independent Number of bytes per column Z Independent Number of bytes per cacheline N Dependent Number of cachelines per row have K independent channels of memory, and each 13.3.3 Baseline Address Mapping Schemes channel consists of L ranks per channel, B banks In the previous section, the available parallelism per rank, R rows per bank, C columns per row, and V of memory channels, ranks, banks, rows, and col- bytes per column. The total size of physical memory umns was examined in abstraction. In this section, in the system is simply K * L * B * R * C * V. Further- two baseline address mapping schemes are estab- more, it is assumed that each memory request moves lished. In an abstract memory system, the total size of data with the granularity of a cacheline. The length memory is simply K * L * B * R * C * V. The convention of a cacheline is defi ned as Z bytes, and the number adopted in this chapter is that the colon (:) is used to of cachelines per row is denoted as N. The number of denote separation in the address ranges. As a result, cachelines per row is a dependent variable that can be k:l:b:r:c:v not only denotes the size of the memory, computed by multiplying the number of columns per but also the order of the respective address ranges in row by the number of bytes per column and divided the address mapping scheme. Finally, for the sake of through by the number of bytes per cacheline. That simplicity, C * V is replaced with N * Z. That is, instead is, N C * V / Z. The organization variables are sum- of the number of bytes per column multiplied by the marized in Table 13.2. number of columns per row, the number of bytes per In general, the value of system confi guration param- cacheline multiplied by the number of cachelines per eters can be any positive integer. For example, a mem- row can be used equivalently. The size of the memory ory system can have six ranks of memory per channel system is thus K * L * B * R * N * Z, and an address and three channels of memory in the memory system. mapping scheme for this memory system can be However, for the sake of simplicity, system parameters denoted as k:l:b:r:n:z. defi ned in this chapter are assumed to be integer pow- ers of two, and the lowercase letters of the respective parameters are used to denote that power of two. For Open-Page Baseline Address Mapping Scheme example, there are 2b B banks in each rank, and In performance-optimized, open-page memory 2l L ranks in each channel of memory. A memory systems, adjacent cacheline addresses are striped system that has the capacity of K * L * B * R * C * V bytes across different channels so that streaming bandwidth can then be indexed with k l b r c v number can be sustained across multiple channels and then of address bits. mapped into the same row, same bank, and same rank.9

9Again, assume a uniform memory system where all channels have identical confi gurations in terms of banks, ranks, rows, and columns. 506 Memory Systems: Cache, DRAM, Disk

The baseline open-page address mapping scheme is supports asymmetrical channel confi gurations, denoted as r:l:b:n:k:z. channel indices are also mapped to the high address ranges, and parallelism presented by multiple chan- nels may not be available to individual applications. Close-Page Baseline Address Mapping Scheme As a result, some high-performance systems enforce Similar to the baseline address mapping scheme confi guration rules that dictate symmetrical channel for open-page memory systems, consecutive cach- confi gurations. eline addresses are mapped to different channels in In the respective baseline address mapping a close-page memory system. However, unlike open- schemes described previously, channel and rank page memory systems, mapping cachelines with indices are mapped to the low-order address bits. sequentially consecutive addresses to the same bank, However, in a fl exible, user-confi gurable memory same rank, and same channel of memory will result in system, the channel and rank indices are moved sequences of bank confl icts and greatly reduce avail- to the high-order address bits. The result is that able memory bandwidth. To minimize the chances of an expandable open-page memory system would bank confl ict, adjacent lines are mapped to different utilize an address mapping scheme that is compa- channels, then to different banks, and then to differ- rable to the ordering of k:l:r:b:n:z, and the k:l:r:n: ent ranks in close-page memory systems. The base- b:z address mapping scheme would be used in an line close-page address mapping scheme is denoted expandable, close-page memory system. In these as r:n:l:b:k:z. address mapping schemes geared toward memory system expendability, some degrees of channel and rank parallelism are lost to single threaded work- 13.3.4 Parallelism vs. Expansion Capability loads that use only a subset of the contiguous physi- In modern computing systems, one capability cal address space. The loss of parallelism for single that system designers often provide to end-users is threaded workloads in memory systems designed the ability to confi gure the capacity of the memory for confi guration fl exibility is less of a concern for system by adding or removing memory modules. In memory systems designed for large multi- processor the context of address mapping schemes, the memory systems. In such systems, concurrent memory expansion capability means that respective channel, accesses from different memory-access streams row, column, rank, and bank address ranges must be to different regions of the physical address space fl exibly adjustable in the address mapping scheme would make better use of the parallelism offered by depending on the confi guration of the DRAM mod- multiple channel and multiple ranks than the single ules inserted into the memory system. As an exam- threaded workload. ple, in contemporary desktop computer systems, system memory capacity can be adjusted by adding or removing memory modules with one or two ranks 13.3.5 Address Mapping in the 82955X of DRAM devices per module. In these systems, rank MCH indices are mapped to the highest address range in In this section, Intel’s 82955X Memory Control- the DRAM memory system. The result of such a map- ler Hub (MCH) is used as an example to illustrate ping scheme means that an application that utilizes address mapping schemes in a high-performance, only a subset of the memory address space would multi-channel, multi-rank memory system. The typically make use of fewer ranks of memory than 82955X MCH contains two memory controllers that are available in the system. The address mapping can independently control two channels of DDR2 scheme that allows for expansion capability thus SDRAM devices. Each channel in the 82955X MCH presents less rank parallelism to memory accesses. supports up to four ranks of DRAM devices, and Similarly, in cases where multiple channels can be Table 13.3 summarizes six possible rank confi gura- confi gured independently and the memory system tions supported by the 82955X MCH. In practical Chapter 13 DRAM MEMORY CONTROLLER 507

terms, system boards that utilize the 82955X MCH 82955X MCH is confi gured with different capacities can support one or two memory modules in each of memory modules in the two channels, the 82955X channel, and each memory module is composed MCH operates in asymmetric dual channel mode. of one or two identically confi gured ranks listed in In the asymmetric dual channel mode, the physical Table 13.3. address is mapped from 0 MB to the capacity of chan- The 82955X MCH supports address mapping nel 0 and then to the full capacity of channel 1. In this schemes that are optimized for an open-page mem- manner, requests from a streaming request sequence ory system. The interesting aspect of the 82955X MCH are mapped to one channel at a time unless the array is that it supports different mapping schemes that are address space spans both channels. respectively targeted to obtain higher performance Figure 13.3 illustrates a symmetric confi guration or confi guration fl exibility, and the 82955X MCH can and an asymmetric confi guration for the 82955X deploy different mapping schemes depending on the MCH. The fi gure shows that in the symmetric confi g- organization of the memory modules inserted into uration, both channel 0 and channel 1 are populated the system. Specifi cally, the 82955X MCH can sup- with a single-rank, 512-MB memory module that port the two channels confi gured with symmetric occupies rank 0 and a single-rank, 256-MB memory or asymmetric organizations of memory modules. module that occupies rank 1. Although the 256-MB Moreover, the 82955X MCH uses rank confi guration memory modules in rank 1 of both channels are not registers to perform address mapping on a rank- identical in organization, the fact that they are iden- by-rank basis. The topics of channel symmetry and tical in capacity is suffi cient for the 82955X MCH to per-rank address mapping in the 82955X MCH are utilize the symmetric dual channel mode. In contrast, examined in detail in the following sections. the asymmetric confi guration example shows that the two channels are populated with memory mod- ules with different capacities and different numbers Symmetric and Asymmetric Dual Channel Modes of ranks. In the asymmetric confi guration, the physi- In the case that the two channels are populated cal address space extends from 0 to 512 MB in chan- with memory modules with symmetrically matched nel 0 and then from 512 to 1536 MB in channel 1. capacities, the 82955X MCH can operate in symmetric dual channel mode. In the symmetric dual channel mode, sequentially consecutive cacheline addresses Address Mapping Confi guration Registers are mapped to alternating channels so that requests The 82955X MCH uses confi guration registers to from a streaming request sequence are mapped to support different address mapping schemes for per- both channels concurrently. In the case where the formance optimization or confi guration fl exibility.

TABLE 13.3 DDR2 SDRAM rank confi gurations

Device Confi guration: Rank Confi guration: Bank Row Column Column bank count x row Rank Composition: bank count x row Address Address Address Address Rank count x col count x col device density x count x col count x col Bits Bits Bits Offset Capacity size (col size in bytes) device count size (B x R x C x V) (b) (r) (c) (v) 128 MB 4 x 8192 x 512 x 2 256 Mbit x 4 4 x 8192 x 512 x 8 2 13 9 3 256 MB 4 x 8192 x 1024 x 2 512 Mbit x 4 4 x 8192 x 1024 x 8 2 13 10 3 256 MB 4 x 8192 x 1024 x 1 256 Mbit x 8 4 x 8192 x 1024 x 8 2 13 10 3 512 MB 8 x 8192 x 1024 x 2 1 Gbit x 4 8 x 8192 x 1024 x 8 3 13 10 3 512 MB 4 x 16384 x 1024 x 1 512 Mbit x 8 4 x 16384 x 1024 x 8 2 14 10 3 508 Memory Systems: Cache, DRAM, Disk rank 0 rank 1 rank 0 rank 1 rank 2

chan 0 chan 0

Intel Intel 256MB 82955X 512 MB 82955X MCH MCH chan 1 chan 1

Symmetric Asymmetric Configuration 256MB Configuration 512 MB

FIGURE 13.3: Symmetric and asymmetric channel confi gurations in the 82955X MCH.

Two types of confi guration registers are used in the biguously resolve a physical address location to a 82955X MCH to aid it in the task of address mapping: given rank of memory. Then, with the aid of the rank rank address boundary registers and rank architec- architectural registers, the organization of the DRAM tural registers. devices to the given rank of memory is also known, The function of the rank address boundary register and the physical address can be further separated is to defi ne the upper address boundaries for a given into row, bank, and column addresses. That is, the rank of DRAM devices. There are four rank address address mapping scheme in the 82955X MCH is gen- boundary registers per channel. In asymmetrical erated on a rank-by-rank basis. channel model, each register contains the highest Figure 13.4 illustrates the address mapping addressable location of a single channel of memory. scheme of 82955X MCH. The figure shows that in In symmetrical channel mode, cacheline addresses single channel or asymmetric dual channel mode, are interleaved between the two channels, and the the 82955X MCH maps the three least significant rank address boundary registers contain the upper bits of the physical address as the byte offset of the address boundaries for a given rank of DRAM devices 8-byte-wide memory module. Then, depending for both channels of memory. The rank architectural on the number of columns on the memory mod- registers identify the size of the row for the DRAM ule, the next 9 or 10 bits of the physical address are devices inserted into the system. In the 82955X MCH, used to denote the column address field. The bank there are four rank architectural registers per chan- address then follows the column address, and the nel, with one register per rank. most significant address bits are used for the row address. In the 82955X MCH, the channel address is Per-Rank Address Mapping Schemes mapped to different address bit fi elds, depend- The address mapping confi guration registers are ing on the operating mode of the memory control- set at system initialization time by the 82955X MCH, ler. Figure 13.4 does not show the channel address and they contain values that refl ect the capacity and fi eld for the single channel or asymmetric channel organization of the DRAM devices in the memory mode, since each channel is a contiguous block of system. With the aid of the rank address boundary memory, and the channel address is mapped to the confi guration registers, the 82955X MCH can unam- highest available physical address bit fi eld. However, Chapter 13 DRAM MEMORY CONTROLLER 509

Per channel, per-rank address mapping scheme for single/asymmetric channel mode

rank rank configuration physical address row count x bank count x capacity 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 (MB) column count x column size 128 8192 x 4 x 512 x 8 109876543210111201876543210 xxx

256 8192 x 4 x 1024 x 8 12 10 9 8 7 6 5 4 3 2 1 0 11 1 0 9 8 7 6 5 4 3 2 1 0 x x x

512 16384 x 4 x 1024 x 8 13 12 10987654321011109876543210 xxx

512 8192 x 8 x 1024 x 8 12 11 1098765432100129876543210 xxx

1024 16384 x 8 x 1024 x 8 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 9 8 7 6 5 4 3 2 1 0 x x x

single/asymmetric channel byte offset row ID bank ID col ID per DIMM

symmetric dual channel chan ID

128 8192 x 4 x 512 x 8 1098765432101112018765430 210xxx

256 8192 x 4 x 1024 x 8 12 109876543210111098765430 210xxx

512 16384 x 4 x 1024 x 8 13 12 109876543210111098765430 210xxx

512 8192 x 8 x 1024 x 8 12 11 10987654321001298765430 210xxx

1024 16384 x 8 x 1024 x 8 13 11 12 10987654321001298765430 210xxx rank rank configuration 313029282726252423222120191817161514131211109876543210 capacity row count x bank count x physical address (MB) column count x column size Per-rank address mapping scheme for dual channel symmetric mode

FIGURE 13.4: Per-rank address mapping schemes in the 82955X MCH.

in symmetric dual channel mode, the cacheline channel mode are shifted over by 1 bit position to addresses are interleaved between the two chan- the left. nels, and the channel address is mapped to the low range of the address bit fi elds. Specifi cally, Fig- Quick Summary of Address Mapping in the ure 13.4 shows that the address mapping schemes shown for the single/asymmetric channel mode and 82955X MCH the dual channel symmetric mode are essentially Figure 13.4 serves as a concrete example that illus- the same, and the only difference between the two trates some interesting aspects of address mapping sets of address mapping schemes is that the 6th bit schemes used in contemporary memory controllers. in the physical address is used to denote the chan- In particular, the address mapping schemes illus- nel address, and the respective addresses in the dual trated in Figure 13.4 for the 82955X MCH are classical 510 Memory Systems: Cache, DRAM, Disk

open-page-optimal address mapping schemes. For ranks and multiple channels in asymmetric channel example, Figure 13.4 shows that in the single/asym- mode. However, the amount of achievable parallel- metric channel mode, the address mapping scheme ism depends on the specifi c access request sequences in the 82955X MCH can be represented as k:l:r:b:n:z, and the locations of the data structures accessed by and in the symmetric dual channel mode, the address the concurrently executing process contexts. mapping scheme can be represented as l:r:b:n:k:z. In both cases, the column address fi elds are mapped to the low address ranges so that spatially adjacent 13.3.6 Bank Address Aliasing (Stride Collision) memory address locations can be directed to the same One additional issue in the consideration of an open page. Similarly, in the various address mapping address mapping scheme is the problem of bank schemes illustrated in Figure 13.4, the 82955X MCH address aliasing. The problem of bank address alias- shows that the side effect of granting the end-users ing occurs when arrays whose respective sizes are rel- the ability to confi gure the memory system with dif- atively large powers-of-two are accessed concurrently ferently organized memory modules is that rank par- with strided accesses to the same bank. Figure 13.4 allelism to spatially adjacent memory accesses is lost. shows that in a system that uses 1-GB DDR2 SDRAM Although the rank address fi eld is not explicitly illus- memory modules with the 82955X MCH in dual chan- trated in Figure 13.4, the use of the address bound- nel mode, the bank address for each access is obtained ary registers and per-rank address mapping schemes from physical address bit positions 14 through 16. means that the rank address fi eld is mapped to the That is, in this system confi guration, all contiguously high address ranges above the row address fi eld. allocated arrays that are aligned on address boundar- Figure 13.4 shows that the 82955X MCH has been ies that are integer multiples of 217 bytes from each cleverly designed so that most of the bit positions are other would have array elements that map to identi- directed to the same address fi elds regardless of the cal banks for all corresponding array elements. organization of the memory modules in the memory For example, the task of array summation, where system. For example, Figure 13.4 shows that physi- the array elements of arrays A and B are added cal address bits 16 through 26 are used to denote together and then stored into array C, requires that row addresses 0 through 10 in the single/asymmetric the corresponding elements of A, B, and C be accessed channel mode, regardless of the number and type of concurrently. In the case where arrays A, B, and C are memory modules placed in the memory system. In contiguously allocated by the system and mapped to this manner, only a few bit positions will have to be integer multiples of 128-kB address boundaries from dynamically adjusted depending on the organiza- each other, then array elements A[i], B[i], and C[i], tion of the memory system, and bit positions shown would be mapped to different rows within the same with the grey background in Figure 13.4 are always bank for all valid array indices i, resulting in multiple directed to the same address fi elds. bank confl icts for each step of the array summation Finally, the address mapping scheme in the 82955X process in the system described above. MCH means that single threaded streaming applica- In general, the bank address aliasing problem can be tions often cannot take advantage of the parallelism alleviated by several different methods. One method afforded by multiple ranks and the two channels that can alleviate the bank address aliasing problem in asymmetric channel mode. Fortunately, multi- is the conscientious application of padding or offsets processor and multi-threaded processor systems to large arrays so that bank confl icts are not generated with concurrently executing contexts can access dif- throughout concurrent array accesses to those large ferent regions of memory and may be able to take arrays.10 A second method that can alleviate the bank advantage of the parallelism afforded by the multiple address aliasing problem is the conscientious design

10A simple offset insertion increased STREAM Triad bandwidth by 25% in a test system with an Intel i875P system controller. Chapter 13 DRAM MEMORY CONTROLLER 511

of a that can purposefully for a DRAM memory system with 2b banks, there are allocate large arrays to non-contiguous pages in the only 2b possible permutations in a 1:1 mapping that physical address space. In this manner, the chance of maps a physical address to the memory address. In a bank confl ict changes from a guaranteed event that the bank address permutation scheme for the con- occurs for every single access to the array to a proba- ventional SDRAM-type memory system proposed by bilistic event that depends on the number of banks Zhang, the address aliasing problem is simply shifted and ranks in the memory system. Finally, improved to a larger granularity. That is, without the bank per- address mapping schemes have been proposed to mutation scheme illustrated in Figure 13.5, arrays alleviate the bank address aliasing problem, and they aligned on address boundaries of 2(b p) bytes would are described in the following section. suffer a bank confl ict on every pair of concurrent array accesses. The implementation of the bank permuta- tion scheme means that arrays aligned on address Hardware Solution to the Address Aliasing Problem boundaries of 2(b p) bytes no longer suffer from the The bank address aliasing problem has been inves- same address aliasing problem, but arrays that are tigated by Lin et al. [2001] and Zhang et al. [2000]. aligned on address boundaries of 2(b p b) bytes con- The schemes proposed by Lin and Zhang are simi- tinue to suffer a bank confl ict on every pair of concur- lar to schemes applied to different memory systems. rent array accesses. Essentially, there are not enough The basic idea of taking the row address and bit- banks to rotate through the entire address space in wise XOR’ed with the bank address to generate new a contemporary memory system to completely avoid bank addresses that are not aligned for concurrent the memory address aliasing problem. accesses to large arrays is common to both designs. However, the generous rank and bank parallelism in the fully confi gured Direct RDRAM memory system allowed Lin to create a 1:1 mapping that permutes 13.4 Performance Optimization the available number of banks through the entire The performance characteristic of a modern DRAM address space in the system confi guration exam- memory controller depends on implementation- ined. In contrast, Zhang illustrated a more modest specifi c DRAM command and memory transaction memory system where the page index was larger ordering policies. A DRAM controller can be designed than the bank index. The mapping scheme described to minimize complexity without regard to perfor- by Zhang is shown in Figure 13.5. Figure 13.5 shows mance or designed to extract the maximum perfor- that the problem for the scheme described by Zhang mance from the memory system by implementing is that there are relatively few banks in contemporary aggressive DRAM command and memory transaction SDRAM and DDRx SDRAM memory systems, and ordering policies. DRAM command and transaction

page index bank index page offset

b b p XOR

b

page index new bank index page offset

FIGURE 13.5: Address mapping scheme proposed by Zhang et al. [2000]. 512 Memory Systems: Cache, DRAM, Disk

ordering policies have been studied by Briggs et al. The large number of design factors that a design [2002], Cuppu et al. [1999], Hur and Lin [2004], McKee engineer must consider underlines the complexity et al. [1996], Lin et al. [2001], and Rixner et al. [2000]. In of a high-performance DRAM memory controller. studies performed by Briggs et al., Cuppu et al., McKee Fortunately, some basic strategies exist in common et al., Lin et al., and Rixner et al., various DRAM-cen- for the design of high-performance DRAM mem- tric scheduling schemes are examined. In the study ory controllers. Specifi cally, the strategies of bank- performed by Hur et al., the observation is noted that centric organization, write caching, and seniors the ideal DRAM scheduling algorithm depends not fi rst are common to many high-performance DRAM only on the optimality of scheduling to the DRAM controllers, while specifi c adaptive arbitration algo- memory system, but also on the requirement of the rithms are unique to specifi c DRAM controllers and application. In particular, the integration of DRAM systems. memory controllers with the processor core onto the same silicon die means that the processor core can interact directly with the memory controller and pro- 13.4.1 Write Caching vide direct feedback to select the optimal DRAM com- One strategy deployed in many modern DRAM mand scheduling algorithm. controllers is the strategy of write caching. The basic The design of a high-performance DRAM memory idea for write caching is that write requests are typi- controller is further complicated by the emergence cally non-critical in terms of latency, but read requests of modern, high-performance, multi-threaded pro- are typically critical in terms of latency. As a result, it cessors and multi-core processors. While the use of is typically desirable to defer write requests and allow multi-threading has been promoted as a way to hide read requests to proceed ahead of write requests, as the effects of memory-access latency in modern com- long as the memory ordering model of the system puter systems, the net effect of multi-threaded and supports this optimization and the functional cor- multi-core processors on a DRAM memory system rectness of programs is not violated. Furthermore, is that the intermixed memory request stream from DRAM devices are typically poorly designed to sup- the multiple threaded contexts to the DRAM memory port back-to-back read and write requests. In particu- system disrupts the row locality of the request pat- lar, a column read command that occurs immediately tern and increases bank confl icts [Lin et al. 2001]. As after a column write command typically incurs a large a result, an optimal DRAM controller design not only penalty in the data bus turnaround time in conven- has to account for the idiosyncrasies of specifi c DRAM tional DDRx SDRAM devices due to the fact that the memory systems, application-specifi c requirements, column read command must await the availability but also the type and number of processing elements of the internal of the DRAM device that is in the system. shared between read and write commands.

tCWD + tBurst + tWR - tCMD tCMD time cmd&addr write 0 read 1 bank “i” of rank “m” data restore bank “j” of rank “m” row x open rank “m” utilization I/O gating I/O gating data bus data burst data burst

t CWD tBurst tWTR

FIGURE 13.6: Write command following read command to open banks. Chapter 13 DRAM MEMORY CONTROLLER 513

Bank 0 col 0x60 row 0x19C rank 0 bank 5 channel 0 col 0x8E row 0x7E2 rank 0 bank 2 channel 0 Bank 1 Bank 2 CPU Req. scheduling: stream round-robin through n banks, queue-depth DRAM address mapping weight based, row:column:bank:offset Bank n -1 or resource- availability based.

queue depth

FIGURE 13.7: Per-bank organization of DRAM request queues.

Figure 13.6 repeats the illustration of a column 13.4.2 Request Queue Organizations read command that follows a write command and To control the fl ow of data between the DRAM shows that, due to the differences in the direction of memory controller and DRAM devices, memory data fl ow between read and write commands, sig- transactions are translated into sequences of DRAM nifi cant overheads exist when column read and write commands in modern DRAM memory controllers. To commands are pipelined back to back. The strategy facilitate the pipelined execution of these DRAM com- of write caching allows read requests that may be mands, the DRAM commands may be placed into a critical to application performance to proceed ahead single queue or multiple queues. With the DRAM com- of write requests, and the write caching strategy can mands organized in the request queuing structure, the also reduce read-write overheads when combined DRAM memory controller can then prioritize DRAM with a strategy to burst multiple write requests to the commands based on many different factors, includ- memory system consecutively. One memory con- ing, but not limited to, the priority of the request, the troller that utilizes the write caching strategy is Intel’s availability of resources to a given request, the bank i8870 system controller which can buffer upwards of address of the request, the age of the request, or the 8 kB of write data to prioritize read requests over write access history of the agent that made the request. requests. However, in systems that implement write One organization that can facilitate the pipelined caching, signifi cant overhead in terms of latency or execution of DRAM commands in a high- performance hardware complexity may exist due to the fact that DRAM memory controller is the per-bank queuing the address of all pending read requests must be organization.11 In the per-bank queuing structure, checked against the address of cached writes, and memory transaction requests, assumed to be of equal the memory controller must provide the consistency priority, are sorted and directed to different queues guarantee to ensure the correctness of memory- on a bank-by-bank basis. Figure 13.7 shows one access ordering. organization of a set of request queues organized on a

11The per-bank request queuing construct is an abstract construct. Memory controllers can utilize a unifi ed queue with sophisticated hardware to perform the transaction reordering and bank rotation described herein, albeit with greater diffi culty. 514 Memory Systems: Cache, DRAM, Disk

per-bank basis. In the organization illustrated in may be more optimal than a strictly round-robin Figure 13.7, memory transaction requests are trans- priority scheduling scheme.13 lated into memory addresses and directed into dif- The per-bank queue organization may be favored ferent request queues based on their respective bank in large memory systems where high request rates are addresses. In an open-page memory controller with directed to relatively few banks. In memory systems request queues organized comparably to Figure 13.7, where there are relatively lower access rates to a large multiple column commands can be issued from a number of banks in the system, dedicated per-banks given request queue to a given bank if the column queues are less effi cient in organizing requests. In access commands are directed to the same open row. these memory systems, the queue structure may In the case where a given request queue has exhausted be more optimally organized as a pool of general- all pending requests to the same open row and all purpose queue entries where each queue entry can other pending requests in the queue are addressed be directed to different banks as needed. to different rows, the request queue can then issue a precharge command and allow the next bank to issue commands into the memory system.12 13.4.3 Refresh Management Figure 13.7 shows that one way to schedule requests One issue that all modern DRAM controllers must from the per-bank queuing structure is to rotate the deal with to ensure the integrity of data stored in scheduling priority in a round-robin fashion from DRAM devices is the refresh function. In the case bank to bank. The round-robin, bank-rotation com- where a modern memory system is inactive for a mand scheduling scheme can effectively hide DRAM short period of time, all DRAM devices can make bank confl ict overhead to a given bank if there are suf- use of a DRAM device controlled self-refresh mode, fi cient numbers of pending requests to other banks where the DRAM memory controller can be tempo- that can be processed before the scheduling priority rarily powered down and placed into a sleep state rotates back to the same bank. In a close-page mem- while the DRAM device controls its own refresh func- ory controller, the round-robin bank-rotation scheme tion. However, the entry into and exit out of the self- maximizes the temporal distance between requests to refresh mode is typically performed under the explicit any given bank without sophisticated logic circuits to control of the DRAM memory controller, and the self- resolve against starvation. However, the round-robin refresh action is not engaged in during normal opera- scheme may not always produce optimal scheduling, tions in most modern DRAM devices. particularly, for open-page memory controllers. In One exception to the explicit management of the open-page memory controllers, the address mapping refresh function by the memory controller can be scheme maps spatially adjacent cachelines to open found in some pseudo-static DRAM devices such rows, and multiple requests to an open row may be as MobileRAM, where temperature-compensated pending in a given queue. As a result, a weight-based self-refresh is used as part of the normal operat- priority scheduling scheme, where the queue with ing mode to minimize refresh power consumption. the largest number of pending requests, is prioritized Moreover, the hidden self-refresh removes the com- ahead of other queues with fewer pending requests, plexity of refresh control from the memory controller

12The bank-centric organization assumes that all requests from different agents are of equal priority and access rates from different agents are comparable. In practical terms, additional safeguards must be put in place to prevent the scenario where a constant stream of requests to an open row starves other requests to different rows. In some controllers, a limit is placed on the maximum number of consecutive column commands that can be scheduled to an open row before the open row is closed to prevent starvation and to ensure some degree of fairness to requests made to different rows of the same bank. 13Weight-based schemes must also be constrained by age considerations to prevent starvation. Chapter 13 DRAM MEMORY CONTROLLER 515

read priority CPU request stream write based refresh request stream refresh scheduling (one request per 7.8 µs)

FIGURE 13.8: Priority-based scheduling for refresh requests.

and contributes to the illusion of the MobileRAM every refresh period, mathematics dictate that an DRAM device as a pseudo-static memory device. The all-banks-concurrent refresh command must be timing and interval of the DRAM refresh action in issued to the DRAM device once every 7.8 µs for the these devices are hidden from the memory control- device with a 64-ms period requirement. Fortunately, ler. Therefore, in the case where a memory request DRAM refresh commands can be deferred for short from the memory controller collides with the hidden periods of time to allow latency-critical memory read refresh action, the pseudo-static DRAM device asserts requests to proceed ahead. Consequently, the DRAM a wait signal to inform the memory controller that the controller need not adhere to a strict requirement of data return from the pseudo-static DRAM device will having to send an all-banks-concurrent refresh com- be delayed until after the self-refresh action within the mand to the DRAM device every 7.8 µs. To take advan- pseudo-static DRAM device is completed and the wait tage of the fact that refresh commands can be deferred signal is deasserted by the pseudo-static DRAM device. within a reasonable timing window, Figure 13.8 shows However, a wait signal from the pseudo-static DRAM an organization of the queuing structure where the device that can delay state transition in the memory microprocessor request stream is separated into read controller effectively introduces a slow signal path and write request queues and the request commands into the memory controller and effectively limits the are placed into the refresh queue at a constant rate of operating frequency of the memory controller. Conse- one refresh command every 7.8 µs. In the structure quently, the explicit wait state signal that enables the illustrated in Figure 13.8, each refresh request is attrib- hidden self-refresh function in normal operating mode uted with a count that denotes the number of cycles is only used in relatively low-frequency, pseudo-static that the request has been deferred. In this manner, DRAM-based memory systems designed for battery- in the case that the refresh request is below a preset operated mobile platforms, and the refresh function deferral threshold, all read and write requests will have in DRAM devices targeted for high- frequency DRAM priority over the refresh request.15 In the case where memory systems remains under the purview of the the system is idle with no other pending read or write DRAM memory controller. requests, the refresh request can then be sent to the To ensure the integrity of data stored in DRAM DRAM devices. In the case where the system is fi lled devices, each DRAM row that contains valid data with pending read and write requests but a DRAM must be refreshed at least once per refresh period, refresh request has nearly exceeded the maximum typically 32 or 64 ms in duration.14 In terms of a deferral time, that DRAM refresh request will then DRAM device that requires 8192 refresh commands receive the highest scheduling priority to ensure that

14Some DRAM devices contain additional registers to defi ne the ranges of rows that need to be refreshed. In these devices, the refresh action can be ignored for certain rows that do not contain valid data. 15The maximum refresh request deferral time is defi ned in the device data sheet for DRAM devices. In modern DRAM devices such as DDR2 SDRAM devices, the period DRAM refresh command can be deferred for as long a 97.8-µs refresh command intervals. 516 Memory Systems: Cache, DRAM, Disk

TABLE 13.4 Refresh cycle times of DDR2 SDRAM devices

Row Size t RC: Row Cycle t RFC: Refresh Cycle Density Bank Count Row Count (bits) Time (ns) Time (ns) 256 Mbit 4 8192 8192 55 75 512 Mbit 4 16384 8192 55 105 1 Gbit 8 16384 8192 54 127.5 2 Gbit 8 32768 8192 54 197.5 4 Gbit 8 65536 8192 54 327.5

the refresh request occurs within the required time and the queueing structure required may be far more period to ensure data integrity in the memory system. complex than the sample queuing structure illus- One feature under consideration that could fur- trated in Figure 13.7. Finally, the all-banks-concur- ther increase the complexity of future DRAM memory rent refresh command also serves as a time period controllers is the functionality of per-bank refresh. where the DRAM device can perform housekeeping Table 13.4 illustrates that tRFC, the refresh cycle time, duties such as signal recalibration between DRAM is increasing with each generation of higher density devices and the memory controller. Without the all- DRAM devices, and the bandwidth overhead of the banks-concurrent refresh command, DRAM devices refresh functionality grows proportionally.16 One pro- would lose the guaranteed time period where circuits posal to minimize the bandwidth impact of DRAM within the DRAM devices are active, but the interface refresh is to replace or supplement the all-banks- of the DRAM devices are idle to perform these types concurrent refresh command with separate refresh of housekeeping duties. commands that refresh one row in one bank at a time as opposed to refreshing one row in all banks within a rank of DRAM devices concurrently.17 13.4.4 Agent-Centric Request Queuing The performance benefi t of the separate per-bank Organization refresh command can be easily computed since each In previous sections, techniques to maximize per-bank refresh command need only respect the bandwidth utilization and decrease effective read tRC row cycle time constraint rather than the tRFC latency were examined. However, one issue that was refresh cycle time constraint. However, one caveat of left out of previous discussions is that regardless of the per-bank refresh command proposal is that the the performance techniques deployed, an overrid- fi ne-grained control of the refresh function on a per- ing consideration in the design of a modern DRAM bank basis means that the complexity of the DRAM memory controller is one of fairness. That is, in any memory controller must increase proportionally transaction request reordering mechanism, anti- to deal with separate refresh requests to each bank, starvation safeguards must exist to ensure that no

16There are more than 8192 rows in higher density DDR2 SDRAM devices, but the number of refresh commands per 64-ms time period remains constant. For example, the 4-Gbit DDR2 device must refresh 8 rows of data with each refresh command. 17The refresh command is a relatively energy-intensive operation. The instantaneous power draw of large, multi-rank memory systems, where all ranks are refreshed concurrently with a single refresh command, could signifi cantly increase the peak power consumption profi le of the memory system. To limit the peak power consumption profi le, some DRAM memory controllers are designed to refresh each rank of DRAM devices separately and scheduled some time apart from each other. Chapter 13 DRAM MEMORY CONTROLLER 517

request can be deferred for an indefi nite period of The confl icting requirements of fairness and time. The anti-starvation safeguards are particularly performance are important considerations that must important in the context of multiple agents that be accounted for by the DRAM memory controller’s share use of the same memory system. In particular, scheduling mechanism. Fortunately, the schedul- the issues of fairness and performance optimization ing mechanism used to deal with the DRAM device for multiple agents with drastically different request refresh requirement can be broadly extended to deal rates and address sequences must be carefully traded with a broad range of agents that require low latency off against each other. For example, a microprocessor or guaranteed bandwidth. That is, DRAM refresh com- running a typical application may require relatively mands can be considered as a sequence of requests low bandwidth, but read requests from the processor from an agent that requires some amount of guar- must be considered as latency critical. In contrast, anteed bandwidth from the DRAM memory system. a graphics processor that is connected to the same This agent, along with a number of other agents that memory system may require a large amount of guar- require differing amounts of guaranteed bandwidth, anteed bandwidth, but individual memory transac- must share the memory system with other agents that tion requests from the graphics processor may be have no fi xed requirements in terms of bandwidth, deferred in favor of requests from the microprocessor. but must have low average access latency. Finally, memory transaction requests from relatively Figure 13.9 shows an organization of the queu- low-bandwidth and low-priority I/O devices may be ing structure where the memory controller selects deferred, but these requests cannot be deferred for between requests from low-latency agents and guar- an indefi nite period of time so as to cause starvation anteed bandwidth agents. Figure 13.9 shows that for the I/O devices. That is, in the context of multiple requests from low-bandwidth and guaranteed band- agents that share a common memory system, better width agents are directed to a two-level queuing struc- performance may be measured in terms of achiev- ture, where requests are fi rst sent to a pending queue ing an equitable balance in the usage of the memory and then moved to a scheduling queue under respec- system rather than obtaining the absolute maximum tive rate-controlled conditions. The rate controls bandwidth from the DRAM devices. ensure that the non-latency critical request agents

low latency agents

guaranteed bandwidth agents (i.e. refresh)

low bandwidth, low priority agents priority based scheduling pending rate scheduling to DRAM queues control queues

FIGURE 13.9: Sample system-level arbiter design. 518 Memory Systems: Cache, DRAM, Disk

cannot saturate the memory system with requests The exploration of a history-based DRAM transac- at the expense of other agents. In the queuing struc- tion and command scheduling algorithm by Hur and ture illustrated in Figure 13.9, requests from the low- Lin is enabled by the use of a processor with an inte- latency agents are typically scheduled with the highest grated DRAM controller, the IBM POWER5. The trend priority, except when the scheduling queue for the in the integration of memory controllers and proces- guaranteed bandwidth agents are full. To ensure that sors means that the memory controllers will gain the bandwidth guarantees are met, the scheduling access to transaction scheduling information that priority must favor the guaranteed bandwidth agents they could not access as stand alone controllers. in the case where the shared scheduling queue for the As more processors are designed with integrated guaranteed bandwidth agents is full. DRAM memory controllers, these processors can In the previous section, DRAM-centric request communicate directly with the DRAM memory con- scheduling algorithms are examined in the context of trollers and schedule DRAM commands based not obtaining the highest performance from the DRAM only on the availability of resources within the DRAM memory system, given that all requests are equal in memory system, but also on the DRAM command- importance and that requests can be freely reordered access history. In particular, as multi-threaded and for better performance. However, in a system where multi-core processors are integrated with DRAM multiple agents must share the use of the memory sys- memory controllers, these DRAM memory control- tem, not all requests from different agents are equal lers not only have to be aware of the availability of in importance. As a result, to obtain better perfor- resources within the DRAM memory system, but mance in the system as a whole, both DRAM- centric they must also be aware of the state and access his- and agent-centric algorithms must be considered. tory of the respective threaded contexts on the pro- Figure 13.9 thus illustrates a two-level scheduling cessor in order to achieve the highest performance algorithm to ensure fairness and system throughput. possible.

13.4.5 Feedback-Directed Scheduling In modern computer systems, memory access is per- 13.5 Summary formed by the memory controller on behalf of proces- An analogy that may be made for the trans- sors or intelligent I/O devices. Memory-access requests action queuing mechanism of a modern, high- are typically encapsulated in the form of transaction performance DRAM memory controller is one that requests that contain the type, address, and data for compares the transaction queuing mechanism of a the request in the case of write requests. However, in high-performance DRAM controller to the instruc- the majority of systems, transaction requests typically tion Reorder Buffer (ROB) of high-performance do not contain information to allow a memory control- that dynamically convert assem- ler to prioritize the transactions based on the specifi c bly instructions into internal microoperations that requirements of the workload. Rather, memory con- the processor executes out of order. In the ROB, trollers typically rely on the type, the access history, the the microprocessor accepts as input a sequence of requesting agent, and the state of the memory system assembly instructions that it converts to microop- to prioritize and schedule the memory transactions. erations. In the transaction queue of the memory In one recent study performed by Hur and Lin [2004], controller, the transaction queue accepts read and the use of a history-based arbiter that selects among write requests that it must convert to DRAM com- different scheduling policies is examined in detail. In mands that the memory controller then attempts to this study, the memory-access request history is used execute. Similar to the microoperations in the ROB, to select from different prioritization policies dynami- DRAM commands can be scheduled subject to the cally, and speedups between 5 and 60% are observed ordering constraints of the transaction requests and on some benchmarks. availability of the resources.