[3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 16

...... CACHE HIERARCHY AND MEMORY SUBSYSTEM OF THE AMD PROCESSOR

...... THE 12-CORE AMD OPTERON PROCESSOR, CODE-NAMED ‘‘MAGNY COURS,’’ COMBINES

ADVANCES IN SILICON, PACKAGING, INTERCONNECT, CACHE COHERENCE PROTOCOL,

AND SERVER ARCHITECTURE TO INCREASE THE COMPUTE DENSITY OF HIGH-VOLUME

COMMODITY 2P/4P BLADE SERVERS WHILE OPERATING WITHIN THE SAME POWER

ENVELOPE AS EARLIER-GENERATION AMD OPTERON PROCESSORS.AKEY ENABLING

FEATURE, THE PROBE FILTER, REDUCES BOTH THE BANDWIDTH OVERHEAD OF TRADITIONAL

BROADCAST-BASED COHERENCE AND MEMORY LATENCY.

...... Recent trends point to high and and latency sensitive, which favors the use growing demand for increased compute den- of high-performance cores. In fact, a com- sity in large-scale data centers. Many popular mon use of chip multiprocessor (CMP) serv- server workloads exhibit abundant process- ers is simply running multiple independent and thread-level parallelism, so benefit di- instances of single-threaded applications in rectly from additional cores. One approach multiprogrammed mode. In addition, for Pat Conway to exploiting thread-level parallelism is to in- workloads with low parallelism, the highest tegrate multiple simple, in-order cores, each performance will likely be achieved by a Nathan Kalyanasundharam with multithreading support. Such an CMP built from more complex cores capable approach has achieved high batch through- of exploiting instruction-level parallelism Gregg Donley put on some commercial workloads, but from the small number of available software has limitations when response time (latency), threads. Similarly, many high-performance Kevin Lepak distribution of response times (quality of computing applications that operate on service, or QoS), and user experience are a large data sets in main memory and use soft- Bill Hughes concern.1,2 Examples of highly threaded ware pipelining to hide memory latency applications requiring both high throughput might run best with moderate numbers of and bounded latency include real-time trad- high-performance cores. ing, server-hosted game play, interactive sim- The AMD Opteron processor has hard- ulations, and Web search. ware support for virtualization.4 It lets multi- Despite focused research efforts to sim- ple guest operating systems run on a single plify parallel programming,3 most server system, fully protected from the effects of applications continue to be single-threaded other guest operating systems, each running ......

16 Published by the IEEE Computer Society 0272-1732/10/$26.00 c 2010 IEEE [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 17

its own set of application workloads. A com- mon usage scenario is to dedicate one core per guest operating system to provide hard- Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 ware context (thread)-based QoS. Data centers are consolidating legacy single- 512-Kbyte 512-Kbyte 512-Kbyte 512-Kbyte 512-Kbyte 512-Kbyte L2 cache L2 cache L2 cache L2 cache L2 cache L2 cache application servers onto high-core-count dual- and quad-processor (2p and 4P) blade and rack servers—the dominant seg- System request interface (SRI) ment of the server market. By doubling com- pute density, ‘‘Magny Cours’’ doubles the L3 data array number of guest operating systems that can L3 tag (6 Mbytes) be run per server in this mode. By using Directory larger servers in place of a pool of smaller storage servers (say, one 4P blade versus four 1P blades), the operating system or hypervisor Crossbar canflexiblyallocatememoryandI/Oresour- ces across applications and guest operating Memory systems as needed and on demand. Probe filter/ contoller directory Efficient power management is another MCT/DCT first-order consideration in data centers since power budget determines both the data center’s maximum scale and its oper- Four HyperTransport3 ports DRAM DRAM DRAM DRAM ating cost. Benchmarks such as SPEC- DRAM DRAM Power2008 measure power consumption and performance at different hardware uti- lizations. Such benchmarks reward designs Figure 1. ‘‘Magny Cours’’ silicon block diagram. The node integrates six that provide more performance within the -64 cores, a shared L3, two DDR3 memory channels and four HT3 same power envelope and conserve power ports. when idle. The consolidation of multiple single-application servers with low hardware utilization levels onto high-core-count blade and one for floating-point and multimedia servers using virtualization results in signifi- operations. These schedulers can dispatch cant power savings in the data center. up to nine mops to the following execution resources: Processor overview The basic building block in ‘‘Magny three integer pipelines, each containing Cours’’ is a silicon design, a node that inte- an integer-execution unit and an grates six x86-64 cores, a shared 6-Mbyte address-generation unit; and level-3 (L3) cache, four HyperTransport3 three floating-point and multimedia ports, and two double data rate 3 (DDR3) pipelines. memory channels (see Figure 1). We built the node using 45-nanometer silicon on in- The schedulers also dispatch load-and- sulator (SOI) process technology. store operations to the load/store unit, which can perform two loads or stores each ‘‘Magny Cours’’ die cycle. The processor core can reorder as Each ‘‘Magny Cours’’ processor core is an many as 72 mops. The core has separate in- aggressive out-of-order, three-way superscalar struction and data caches, each 64 Kbytes in processor. It can fetch and decode up to size and backed by a large on-chip L2 cache, three x86-64 instructions each cycle from which is 512 Kbytes. All caches throughout the instruction cache. It turns variable-length the hierarchy (including the L3 cache) have x86-64 instructions into fixed-length macro- 64-byte lines. The L1 caches are two-way as- operations (mops) and dispatches them to sociative and have a load-to-use latency of two independent schedulers—one for integer three clock cycles; the L2 cache is 16-way ......

MARCH/APRIL 2010 17 [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 18

...... HOT CHIPS

. As such, we did not want to waste die area on a dedicated on-chip Multichip module (MCM) probe filter, which is useful only in multi- package processor configurations. Our implementa- x16 cHT3 port tion allows the probe filter to be enabled in a multiprocessor configuration or disabled DDR3 channels in a uniprocessor desktop or configuration. The probe filter is also known as Hyper- x8 cHT3 port Transport Assist (HT Assist) at the platform cHT3: Coherent HT3 level. This is a reference to one benefit of the ncHT3: Noncoherent HT3 probe filter, which is to conserve system bandwidth; a second benefit is to reduce memory latency. Multichip module package x16 ncHT3 port The processor packages two dies in a tightly coupled multichip module (MCM) Figure 2. Logical view of the G34 multichip module package. The MCM configuration to yield a 12-core processor package has 12 cores, four HyperTransport ports, a 12-Mbyte L3 cache, architecture. The processor interfaces to and four memory channels. Each die has six cores, four HyperTransport four DDR3 channels and four HT3 tech- ports, a 6-Mbyte L3 cache, and two memory channels. nology ports, as Figure 2 shows. The pack- age is a 42.5 60-mm organic (LGA) and has 1,944 pins—735 more than the earlier-generation AMD associative with a best-case load-to-use la- Opteron L1SP 1,207-pin package. This tency of 12 clock cycles. As in many prior- new socket is known as G34 (Generation generation AMD processor cores, the L1 3, four memory channels). It has 1,132 sig- and L2 caches use an exclusive layout—that nalI/O,341power,and471groundpins is, the L2 cache is a victim cache for the organized as a 57 40 array of contacts L1 instruction and data caches. Fills from on a 1-mm pitch. outer cache or DRAM layers (system fills) The HT3 ports are ungangable since go directly into the appropriate L1 cache each 16 HT3 port can operate as two in- and evict any existing L1 entry, which dependent 8 HT3 ports. This allows us moves into the L2 cache. System fills typi- to build highly optimized 2P and 4P cally are not placed into the L2 cache di- blade server topologies by configuring a rectly. Similarly, most common-case CPU node as a router with four to eight ports core L2 cache hits are invalidated from the in the network. If pins had not been a con- L2 cache and placed into the requesting L1 5,6 straint, we would have brought out six cache. HT3 ports on the package—three from The shared L3 design plays two distinct each node—to allow for total platform- roles, exploiting the fact that the L3 cache level flexibility. However, pin constraints and are colocated on imposed a limit of four HT3 ports on the the same die: package, as Figure 2 shows. An additional design constraint was the requirement a traditional cache associated with the that a single high-bandwidth device, such processor; and as a GPU, have access to the full 16 non- storage for a cache directory (probe fil- coherent HT3 (ncHT3) I/O bandwidth. A ter), implemented in fast SRAM, asso- single wide ncHT3 link connects to the ciated with the memory controller. lower node. Thus, the MCM topology is asymmetric with respect to ncHT3 but Most silicon nodes will likely be packaged symmetric with respect to coherent HT3 as uniprocessors for desktop and (cHT3). The lower node has four wide ...... 18 IEEE MICRO [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 19

16 HT3 ports, which we allocate as and(P0,P1,P4,P5),interconnectedby follows: the on-package pair (16 and 8) of HT3 links. This topology has a diameter of 2 one 16 link for I/O, and an average diameter of 1.25. For a uni- one 16 plus one 8HT3linkto form traffic distribution memory access pat- connect to three other sockets in the tern such as Xfire, half of the traffic from 4P topology, and each node traverses the on-package Hyper- one 16 plus one 8 HT3 link to con- Transport link pair and 1/8 of the traffic nect to the other node in the package. routes over the 8HyperTransportlinks in each fully connected plane. We balance traffic across the pair of on- Figure 3c shows an alternative 4P topol- package HT3 links by routing all ordered ogy with two HT3 I/O channels (4P Max traffic (such as I/O direct memory access Perf ). This topology trades I/O connectivity [DMA] read and write) on the primary for a more fully connected topology, achiev- 16 HT3 link, and all unordered traffic ing a reduced average diameter of 1.19. This (such as cache probes and responses) across topology is less susceptible to hot spotting the least recently used of the link pair. and has lower average latency because of the additional links. 2P and 4P blade architectures Figure 3d shows yet another 4P topology Network diameter and Xfire (crossfire) built with two identical 2P blades. This to- bandwidth are two useful metrics for evaluat- pology provides a pay-as-you-go upgrade ing and comparing topologies. Network di- path for customers wanting to start small ameter is the maximum number of hops to (2P) and increase the number of processors, traverse between any pair of nodes in the sys- memory capacity, and memory bandwidth tem. Xfire bandwidth is the maximum by adding a second 2P blade. Table 1 coherent memory-read bandwidth in the shows the Xfire bandwidth and diameter system when each node accesses its own metrics for the four topologies. memory and that of every other node in a round-robin fashion, assuming that the Configurable L3 cache HT3 links are the only limiting factor. The 6-Mbyte L3 cache is a victim cache, The Xfire bandwidth metric is superior to installing lines that are evicted from any of the network-oriented bisection bandwidth the core L2 caches. Each of its four subcaches metric7 because it measures useful bandwidth contains tag and data macros to form a 1- or (that is, data bandwidth) and captures the 2-Mbyte 16-way associative cache. We built interaction of topology, routing tables, and the L3 cache from two 1-Mbyte subcaches protocol overhead. and two 2-Mbyte subcaches. We chose this Figure 3a illustrates the recommended 2P configuration purely for silicon area consid- blade topology. It has a diameter of 1, or erations. The architecture itself allows one, minimum, latency, a key benefit for com- two, or four subcaches, and each can be 1 mercial workloads (hence the name 2P or 2 Mbytes. This allows flexibility in system- Max Perf ). Assuming uniform distribution on-chip (SoC) layouts and lets us use differ- of traffic (Xfire), the 8 diagonal links deter- ent subcache building blocks in different mine the maximum system bandwidth. SoCs with reduced effort. For example, Using the two horizontal 16 links reduces some prior-generation processors used two the likelihood of hot spotting. We could dis- 1-Mbyte subcaches; other processors might able the on-package 8HT3linkinthis use four 2-Mbyte subcaches, with little im- configuration to conserve power without ad- pact to the overall . We versely affecting either Xfire bandwidth or can apply cache probe operations to all sub- average diameter. caches in parallel, or restrict them to a specific Figure 3b shows the recommended 4P to- subset based on address. The L3 cache allo- pology in platforms with four HT3 I/O cates lines to an available subcache, account- channels (4P Max I/O). It consists of two ing for any address-based restrictions and fully connected planes, (P2, P3, P6, P7) giving preference to any subcache containing ......

MARCH/APRIL 2010 19 [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 20

...... HOT CHIPS

x16 x8 x8 x8 P0 P2 P2 P6 P2 P6 P2 P6 x8

x16+x8 P3 P7 P3 P7 P3 P7

P1 P3 P0 P4 P0 P4 P0 P4 x16+x8 x16+x8 x16+x8 x16 I/O I/O I/O I/O I/O I/O P1 P5 P1 P5 P1 P5 x16 x16 x16 I/O I/O I/O I/O I/O I/O (a) (b) (c) (d)

Figure 3. Dual processor (2P) blade and quad processor (4P) rack topologies: 2P Max Perf (a), 4P Max I/O (b), 4P Max Perf (c), and 4P Modular 2þ2 (d).

Table 1. Xfire bandwidth and diameter metrics for the topologies in Figure 3.

Bandwidth DRAM Bandwidth Bandwidth increase bandwidth Average using using the using the using DDR Diameter diameter broadcast probe filter probe 1333 Topology (hops) (hops) (Gbytes/sec) (Gbytes/sec) filter (%) (Gbytes/sec)

2P Max Perf 1 0.75 76.5 125.9 64 85.2 4P Max Perf 2 1.19 56.7 143.4 152 170.4 Max I/O 2 1.25 56.7 143.4 152 170.4 Modular 2+2 2 1.25 38.4 71.7 86 170.4

invalid entries in the indexed set. If no in- associativity in this mode, so the benefit is valid entries are present, a weighted round- workload dependent. robin algorithm distributes allocations Architecting the L3 as a victim cache among the subcaches in proportion to their reduces overlap between the contents of the size. On a valid replacement, the L3 cache L3 and L2 caches, allowing more data to chooses the subcache to victimize first be cached. In addition, back invalidation of (using weighted round-robin), then applies the L2 caches is not required when L3 lines a pseudo least recently used (LRU) algorithm are victimized. The L3 cache can retain a within the 16-way associative subcache for line after providing a copy to a requesting final victim selection. core when true sharing of a line is detected. The L3 cache supports a directed address- The core that victimized a line is stored in ing mode in which the L3 cache can be sliced a field in the L3 tag entry for the line and based on address. In this mode, both alloca- is used to detect sharing patterns by the L3 tions and probes are directed to half of the cache controller. If the next read is from cache (a pair of subcaches) based on a hash the same core that allocated the L3 line, of the address bits. This reduces power and the data is deemed private and the L3 could reduce latency. Each tag probe only cache does not retain a copy. Instead, it ties up resources in part of the cache, letting passes responsibility for the line back to the the L3 cache controller issue more speculative core by providing the line in E or M state reads in parallel. However, some applications and invalidating the line in the L3 cache. might be sensitive to the reduced effective Otherwise, if it is consistent with the core ...... 20 IEEE MICRO [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 21

request type, the L3 cache will keep a copy in until it can be returned to the requesting anticipation of further sharing. This mecha- core. Data from any subcache can be nism lets a core evict E and M state lines returned to any core. One dedicated write to the L3 cache and later retrieve them in buffer per core holds allocation data until the same state if they are not shared. it can be written to the cache. Data from any core can store to any subcache. Each Subcache optimization. The tags in the L3 data can sustain a bandwidth of 16 subcache are physically located near the L3 bytes per clock. The required tag accesses controller to provide hit/miss results with limit the combined L3 data read and write minimum possible latency. The data array bandwidth to two 64-byte cache line within the subcache is divided into four accesses per clock. regions, each providing 128 bits of data. Both the control signals to the data regions Coherency. For requests that miss in the L3 and the read data return path are pipelined, cache, the memory controller is the ordering causing round-trip latency to each succes- point between requests from different cores. sive region to increase by one clock. As a When data movement is to or from the L3 result, the subcache can provide four con- cache, the L3 controller must ensure that secutive 128-bit data values even though none of the coherency rules are violated. the last value might be located a significant In particular, it must ensure that there are distance from the tags. This provides flexi- no races between probes to a core and hit bility in placing the data macros and facili- data being returned to the same core, or be- tates efficient floorplanning. tween probes to the L3 cache and victim data from the cores being allocated into Latency optimization. The L3 cache archi- theL3cache.Toachievethisgoal,theL3 tecture dynamically optimizes both latency controller performs collision detection of and bandwidth. When the cache is lightly probes against L3 cache data movements be- loaded, it operates in a latency-reducing fore delivering the probes to the core. This mode in which a processor-initiated tag is a low-latency operation because it only probe assumes a hit and reserves the neces- probes the L3 queues; it performs the L3 sary data buses and buffers in advance. cache probe after collision detection and This allows the read of the data macro to in parallel with the core-cache probe. If a be overlapped with thetagprobe,which conflict exists, the probe can be delayed if minimizes latency. However, when the the data movement is guaranteed to com- requestrateishighandtheL3cachedoes plete in a deadlock-free manner. Otherwise, not contain enough resources, the L3 the data movement is delayed and the probe cache controller sends the request to the is ordered ahead of the processor operation. tags as a query only, which will determine When this occurs, the L3 controller applies the cache status without initiating a data the probe-state update to the L3 tags as well transfer. If the line is not present in the as conflicting L3 allocations, which have yet cache, it forwards the request to DRAM to update the tags. Dependency tracking with minimal latency. Otherwise, it issues ensures that these conflicting operations an L3 data read to the subcache containing are completed in the correct order to main- thelineoncethenecessaryresourcesare tain coherency. available. With multiprogrammed workloads, if a program running on one core has poor Bandwidth. The L3 cache controller can caching characteristics, it could negatively issue one processor-initiated tag probe or affect the cache performance of programs tag update and one probe-initiated tag running on other cores. This situation can probe or tag update per clock. Each sub- be detected when a core exhibits a high in- cache provides access ports to the required stall rate and a low cache hit rate, which twotagsaswellastworeaddatabuses indicates that it is not benefiting much and one write data bus. A dedicated read from the L3 cache and is probably pollut- buffer holds the data from each subcache ing the cache for other cores. To address ......

MARCH/APRIL 2010 21 [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 22

...... HOT CHIPS

this situation, the L3 controller implements The memory controller and DRAM in- the block aggressive neighbors replacement terface have several notable features. algorithm. The BAN algorithm computes First, they offer memory-controller-based each core’s cache efficiency based on its al- prefetch for CPU and I/O traffic, allowing location rate and L3 cache hit rate. It limits prefetch chaining with CPU core prefetchers. the allocations of those programs or cores This includes adaptive throttling when the determined to be getting little benefit DRAM interface is heavily used. fromthecachewhilecausingsignificant In addition, they provide adaptive pre- pollution into the L3 cache. Rather than fetch of DRAM requests in parallel with promoting an allocated line to the most local L3 tag accesses to minimize latency recently used (MRU) position, the BAN al- for L3 misses to local DRAM along with gorithm sets the program lines to either the L3 hit/miss predictors. The L3 hit/miss LRU position or half-way through the LRU predictors use a combination of per-page stack (position 8). As a result, a poorly (4-Kbyte region) recent L3 access behavior behaving program cannot negatively affect and per-core local L3 hit/miss ratio to the most frequently accessed lines within guide prefetch decisions. the L3 cache. A third feature is overlapped DRAM ac- cess and directory lookup, which minimizes Memory controller and DRAM interface DRAM latency when HT Assist is enabled. ‘‘Magny Cours’’ continues the tradition A directory hit to dirty (or potentially of the AMD Opteron family with an on- dirty) data in another cache in the system die memory controller, but adds DDR3 will cancel a DRAM data response to the capability. The DRAM channels are config- requesting processor, which saves intercon- ured for unganged (independent, 72-bit nect bandwidth. The memory controller versus combined 144-bit) operation for max- issues probes as early as possible to minimize imum throughput and DRAM efficiency. indirection latency. The memory controller supports full single- The memory controller and DRAM in- error correction and double-error detection terface also offer optimized DRAM page (SECDED) and x4 Chipkill. It does not sup- management and DRAM command bus uti- port ganged (144-bit) DRAM channel oper- lization. Each memory channel has a dedi- ation because the natural DDR3 burst length cated DRAM controller that supports (8) with a ganged channel leads to 128 bytes timeout-based and predictive page closing of data return, but the AMD Opteron cache utilizing DRAM bank history. The DRAM line size is 64 bytes, leading to 64 bytes of controller allows aggressive reordering for superfluous data. It does not use DDR3 high DRAM efficiency with out-of-order burst four-chop mode to mitigate this scheduling of both command (precharge, because DRAM performance is significantly activate, and so on) phase and data (CAS) reduced in this mode. Maximum capacity phase between multiple DRAM banks configurations support up to three dual- simultaneously. rank, two quad-rank, or two dual-rank plus Finally, the memory controller and DRAM one quad-rank DDR3 device configurations interface can collect multiple DRAM write per channel. At 1.5V, a maximum operating transactions outside the DRAM schedulers frequency of RDDR3-1333 will be available, until many writes are available to be handled subject to platform design for lower capaci- in a burst (write bursting)toavoidcostly ties (up to two dual-rank registered dual DRAM read-to-write and write-to-read bus in-line memory module [RDIMM] per turnarounds. channel). Maximum capacity configurations have operating frequencies of either 800 or Cache coherence protocol 1,066 megatransfers per second, depending The cache coherence protocol is an im- on platform routing and detailed DIMM portant aspect of CMP systems. Generally, characteristics. The memory interface sup- designers favor broadcast-based protocols ports 1.35V (DDR3L) operation, leading when either overall protocol simplicity or la- to lower speeds in some configurations. tency for cache-to-cache transfers is more ...... 22 IEEE MICRO [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 23

Probe filter Probe filter Home lookup Home lookup Home node Probes node node RdBlk RdBlk DRAM RdBlk Directed probe request Response request response Request Request Request Request Owner node node node node Cache response (a) (b) (c)

Figure 4. Cache coherence protocol examples: broadcast protocol (a), HT Assist with clean data (b), and HT Assist with dirty data (c). important than interconnection and probe memory data is available. The requesting bandwidth. Directory-based protocols can processor collects all responses, including be easier to scale up to larger systems,7,8 cache data responses, and determines which but they also can be difficult to implement data should be used to satisfy the original re- and verify. In addition, in many cases, the di- quest. Once it receives all responses, the rectory is stored in DRAM at the home requesting processor delivers data to the node, which leads to relatively long indirec- CPU core and sends a transaction (not tion latencies, more complex protocols, or shown in the figure) to the home-node mem- complex directory-caching mechanisms with ory controller indicating that the cache line performance policies defining which direc- request is complete and another request for tory entries to cache. the same cache line can be activated. HT Assist is a key innovation in ‘‘Magny Cours’’ and the AMD Opteron processor HyperTransport Assist directory protocol code-named ‘‘Istanbul.’’ The HT Assist di- HT Assist adds a cache directory. Each rectory protocol gains many of the key scal- home node keeps track of which cache lines ability advantages of classic directories from its memory are cached by other process- without using DRAM-based directories (for ors in the system. The directory includes all example, by repurposing ECC bits), and cached data in the system. If a cache line is also maintains low indirection latencies. It present in any cache, there must be an does this by maintaining a cache directory, entry in the home node’s directory to indicate which fits naturally within the existing that the line is cached within the system. If a AMD Opteron processor broadcast transac- directory is full or a mapping conflict occurs, tion flow. a previous directory entry must be replaced (causing a potential writeback, plus invalida- Review of broadcast protocol tion of the previous entry’s data from all Figure 4a illustrates the broadcast coher- caches) to accommodate the new request. ence transaction flow used by earlier genera- HT Assist’s transaction flow is similar to tions of AMD Opteron processors.5 the broadcast protocol. Initial requests travel In the broadcast protocol, last-level cache to the home node, where the memory con- misses go to the home node (where DRAM troller orders and activates them. Once a re- for the requested cache line resides) and the quest is active, instead of broadcasting memory controller determines the ordering probes, it begins a DRAM access and a for each request for the same cache line. probe filter lookup in parallel to minimize Once a request becomes active, the home DRAM latency. When the directory lookup node broadcasts cache probe requests to all is complete, the memory controller generates processors, and the memory controller ini- a broadcast probe, a directed probe (a probe tiates a DRAM access. All processors send targeting a single processor), or no probe, probe responses directly to the requesting depending on directory state. The time to ac- processor, and the memory controller cess the probe filter and generate a directed (DRAM) sends a separate response when probe is the indirection latency. The protocol ......

MARCH/APRIL 2010 23 [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 24

...... HOT CHIPS

guarantees that a directed probe never existing AMD Opteron cache line sizes, requires a DRAM response, so any DRAM and continual pressure to improve error de- response is canceled. When no probe is gen- tection and tolerance surrounding DRAM erated, the memory sends a data response devices, we did not pursue an in-memory di- with an indication that this is the only re- rectory strategy. Most directory actions in- sponse to expect and that the request can volve read-modify-write accesses that are now be completed. Broadcast probes follow implemented much more efficiently, and a similar flow to the previous broadcast pro- with minimum port occupancy, in SRAM tocol, except this generation’s protocol guar- than in DRAM. antees that data will be returned from the Our design’s 64-byte cache line holds 16 owner node. Figures 4b and 4c show a sim- directory entries, with 4 bytes per entry plified version of the transaction flow for organized as a four-entry four-way set asso- cases in which no probes, and a single ciative array (see Figure 5a). The tag field directed probe, are required. tracks normalized addresses, which are For most requests, the directory state is computed by subtracting the home node’s immediately updated after the initial direc- DRAM base address (or, for node- tory lookup based on the request type and interleaved addressing, removes appropriate directory state. For some requests, such as bits of the system’s physical address) before a store to an S-state line in the requestor’s storing the resulting address. Using the nor- cache, the request is always treated as a malized address reduces the number of tag broadcast, and directory lookup and update bits required and lets us size the probe filter are postponed until after the request com- tag field based on the maximum DRAM per pletes. In all cases, only a single directory node, not the total DRAM across all nodes lookup and update is required, and these in a system. are treated as an atomic read-modify-write By default, the basic input/output system action. Designing the protocol in this fash- (BIOS) will allocate 1 Mbyte of the 6-Mbyte ion greatly simplified microarchitecture de- L3 cache to directory storage. The directory sign and reduced protocol complexity. holds 256k directory entries, which can The directory protocol must also enable cover 16 Mbytes of cache. This results in a alternate coherence behavior of the CPU- directory coverage ratio of 16 Mbytes/(0.5 side caches (L1, L2, and L3). Notably, exter- Mbytes 6coresþ 5-Mbyte L3), or 2.0, nal read probe requests must transition which says there are at least twice as many di- E-state cache lines to O-state, return data rectory entries in a system as cached lines, to the requester, and send eviction notifica- since a single directory entry can track multi- tions of E-state cache lines to the directory. ple cached copies of a shared line. Directory storage Directory states and transitions The design supports multiple directory Earlier, we showed the behavior of the di- sizes through a combination of per-way and rectory protocol using some examples. Also per-subcache mappings of L3 space assigned important are the protocol’s states and tran- to the directory. However, in its production sitions, including special consideration for form, only a subset of sizes and configura- the shared L3 cache and CMP nature of tions are available for selection. each processor node. Because the directory storage is held in the fast L3 SRAM arrays, directory indirec- Directory behavior. Figure 5b lists the direc- tion latency is low and we can achieve suffi- tory states. The directory supports the full cient bandwidth to the arrays to not limit MOESI protocol from all previous AMD per-node coherent bandwidth. We used the Opteron processors. It observes all requests, existing L3 arrays in lieu of dedicated direc- many of which lead to state updates. To tory storage to maximize processor flexibility, maintain the directory semantics discussed for the discussed latency and bandwidth earlier, and to keep the directory up to characteristics, and to avoid additional pres- date, the directory protocol informs the di- sure on the DRAM ECC coding. With rectory of any cache castouts of M, O, and ...... 24 IEEE MICRO [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 25

E state lines. Finally, each directory miss Four ways might find the directory index full of valid entries, one of which must be evicted to Entry 0 Entry 1 Entry 2 Entry 3 make space for the new entry. We refer to L3 cache this final case as needing a downgrade line (64 bytes) Four sets probe. Such a probe causes a writeback Entry 12 Entry 13 Entry 14 Entry 15 (if dirty) and invalidates all existing cached copies of the downgraded cache line. Figure 6 shows some common transaction Probe filter Tag State Owner scenarios. The transactions in the figure have entry (4 bytes) the following semantics: EM, O, S, S1, I states (a) Fetch—an instruction fetch request; in- stall in S state by default. EM A copy of the cache line is present on a single node, recorded in Load—a data read request; install in E the owner field. The state of the line in the cache is not known at state by default. the directory. It may be clean (Exclusive) or dirty (Modified or Store—a data store miss request (write- Owned with all shared copies on the owner node). allocate); install in M state. O The cache line is cached on one node, the Owner node, and possibly multiple Sharing nodes. The line is dirty and is written In each case, the ‘‘Directory hit’’ columns back to memory when cast out of the node. indicate that a request hits a pre-existing di- rectory entry to the same cache line address. S The cache line may be present on multiple nodes and all copies The ‘‘I,’’ ‘‘O,’’ ‘‘S,’’ ‘‘S1,’’ and ‘‘EM’’ col- are clean (Shared). umns indicate the entry’s directory state, S1 The cache line is clean and present on a single node, recorded and the table entries indicate the type of in the owner field. probe generated, if any. The ‘‘Directory miss’’ columns indicate scenarios in which I The line is uncached. the requested cache line has no pre-existing (b) directory entry, the line is uncached, and no probe is necessary (filtered). In the direc- Figure 5. Probe filter entry format (a) and directory states (b), which shows tory miss case, the directory states indicate how probe filter entries map into the L3 cache line. the state of the replaced line and the corre- sponding type of downgrade probe (broad- cast downgrade or directed downgrade). Although a downgrade probe does not block system activity for the demand re- questthatcausedit,minimizingdown- Directory hit Directory miss grades to achieve high probe filtering rates I O S S1 EM I O S S1 EM and processor-side cache perturbation is stillimportant.Therefore,informingthe Fetch – D––D–BBDIDI directoryonM,O,andEstatecastouts, Load – D––D–BBDIDI in combination with directory size and mapping, ensures that most directory miss Store –BBDIDIBBBDI– requests find an available invalid directory location. The protocol does not include – No probe (filtered) Effective notifications of S-state castouts. Downgrade D Directed probe probes reclaim S-state lines in the directory. DI Directed invalidate We omitted S-state evictions for several rea- B Broadcast invalidate Ineffective sons, due both to performance (a poten- tially large number of eviction messages) and various microarchitecture-specific im- Figure 6. Probe filter transaction scenarios summarizing probe actions as a plementation complexities. Performance function of memory access type and probe filter state. evaluation showed that S-state eviction ......

MARCH/APRIL 2010 25 [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 26

...... HOT CHIPS

notifications were not critical for achieving performance models. The model used in acceptable directory performance. this article has a detailed representation of Because the directory is shared with the the on-die (L3 cache, microarch- processor-side L3 cache, AMD Opteron pro- itecture, HT3 links, DRAM controllers/ cessors with HT Assist enabled must trade off devices, and so on), is designed to be cycle- directory size (minimizing downgrades) with accurate at the northbridge level, and is cor- processor-side L3 cache performance (elimi- related against the register transfer level nating last-level cache misses). We achieved (RTL) design presilicon. The model can an appropriate balance via extensive modeling use detailed, cycle-accurate, correlated-to- and hardware-based evaluations. This tradeoff RTL CPU core models (representing is particularly noteworthy because the direc- AMD’s most precise CPU-to-system model- tory’s associativity is much lower (four-way) ing environment) or abstract CPU core mod- than the net associativity of all CPU-side els when simulation efficiency is paramount. caches. As we noted earlier, a directory cover- For practical reasons, we use the abstract age ratio of at least 2.0 provides significant CPUcoremodel.Onemodelinputiscoher- leverage. Additionally, the directory index ence transaction traces, taken from existing mapping uses hashing for many common- AMD Opteron multiprocessor systems. case scenarios considering the possible x86 These traces are essentially lists of L1 cache page sizes (4 Kbyte, 2 Mbyte, and 1 Gbyte). miss records for fetches, loads, and stores In addition, interaction with operating system/ issued from the cores in the system. Each Hypervisor page-coloring algorithms helps trace record contains metadata that specifies avoid pathological directory mapping colli- interthread ordering for accesses to shared sions in multiprogrammed scenarios. Finally, data so that the simulator can enforce for directory replacement policies attempt to deterministic execution. The abstract core avoid victimizing lines that are cached in model consumes these traces when running many CPUs to reduce CPU-side cache pertur- a simulation. The abstract core model bation resulting from directory downgrades. includes a representation of core caches (L1 and L2), CPI, memory-level parallelism, CMP considerations. Our description of the and core frequency. The simulator models protocol and states has not specifically men- the memory traffic from the abstract core tioned interactions with the various levels of model cycle accurately within the northbridge private (L1 and L2) and shared (L3) caches model, using total execution time for a con- on each processor node. The HT Assist di- stant set of references from each input trace rectory treats each set of CPU cores, private on each CPU as the performance metric. caches, and shared L3 cache as a unit Thehardwareperformancemeasurements (node). All of the transaction flows and used preproduction ‘‘Magny Cours’’ hard- state transitions discussed previously are ware in AMD’s performance labs. The faithful representations of the directory pro- CPU and on-die northbridge operational fre- tocol behavior for the AMD Opteron cache quencies and other system parameters are rep- hierarchy. The directory is never involved resentative of final shipping configurations. (that is, a message is never sent to the direc- tory) in any internal cache transitions or Transaction scenario frequencies movement of a cache line between layers Figure 7 shows the probe filter transaction in the cache hierarchy within a processor scenario frequencies from early 4P hardware node (for example, movement of a cache measurements for SPECJBB2005. Most line from L3 to L1, or from L2 to L3). requests lead to no probe or to a directed probe; therefore, the HT Assist directory Performance eliminates a large amount of probe and re- We evaluated the HT Assist directory sponse traffic. both in simulation and by making measure- If we attach a weight of 0 to filtered, ments on preproduction silicon. 0.125 to directed (one probe in place of The simulation-based performance results eight for broadcast), and 1 to broadcast, the presented here use AMD-internal coherence protocol overhead with probe filter ...... 26 IEEE MICRO [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 27

is (72.2 percent 0) þ ([24.9 percent þ 1.6 percent] 0.125) þ (1.3 percent 1), Probe filter hit Probe filter miss or 4.6 percent of the broadcast coherence. Thus, the protocol is more than 95 percent I O S S1 EM I O S S1 EM effective in reducing probe traffic. Fetch 1.0 0.1 On the other hand, the directory protocol requires that clean victims be sent to the home Load 0.1 4.0 1.5 66.0 0.2 1.0 16.0 node whenever E lines age out of the last-level Store 8.0 1.2 0.1 0.8 cache (one clean victim in place of eight probes), which introduces 66 percent 0.125 (or 8.25 percent) additional traffic – No probe (filtered) 72.2% Effective overhead corresponding to table entry D Directed probe 1.6% {Load, PF Miss, I}. So, the net effectiveness— DI Directed invalidate 24.9% or reduction in traffic associated with main- B Broadcast invalidate 1.3% Ineffective taining cache coherence in this theoretical 100% design—is greater than 87 percent. This example shows that the traditional Figure 7. Probe filter transaction scenario frequencies for SPECJBB2005. cache-hit ratio is not an appropriate measure Each table entry is a percentage of all requests. of the probe filter’s effectiveness (in this exam- ple, the directory-hit ratio is only 14.6 percent). Note that downgrades occur in the back- Having the probe filter on ‘‘Magny ground and can be overlapped with new Cours’’ reduces the observed latency of memory requests. Additionally, the reduced memory accesses when probes and probe L3 victim traffic offsets the modest band- responses are the longest path. By signifi- width overhead of downgrades. cantly reducing local memory latency, the probe filter amplifies the benefit of NUMA NUMA software optimizations software optimizations. Like previous generations of AMD Opteron processors, ‘‘Magny Cours’’ is a dis- Memory latency and bandwidth tributed shared memory machine that bene- Figure 8a shows preproduction hardware fits from operating system/application tuning improvements in maximum main memory to optimize memory allocation and process bandwidth when running STREAM triad, scheduling. Modern operating systems, such and Figure 8b shows improvements in local as Windows and Linux, are aware of the un- and one-hop memory latency in 2P and 4P derlying machine node topology via the (four-node and eight-node, respectively) sys- Advanced Configuration and Power Inter- tem configurations. Latency measurements face (ACPI) static resource affinity table/ are for DRAM page hits. Each measurement system locality information table (SRAT/ is relative to HT Assist disabled for that plat- SLIT) supplied in BIOS.9 These tables asso- form (2P or 4P), and the 2P and 4P relative ciate memory with nodes and provide a ma- memory latencies cannot be compared because trix to assign a latency cost for accessing they are not normalized to the same baseline. memory for all {source, destination} node As these tests show, enabling HT Assist pairs. The Linux nonuniform memory archi- significantly increases memory bandwidth tecture (NUMA) library supports a com- and reduces memory latency. Notably, the mand line utility, numactl, which defines memory bandwidth and latency improve- the default memory allocation policy (local, ments are larger in 4P systems than in 2P interleaved, or user specified) for all threads systems, illustrating the greater benefit from spawned by a process and specifies on HT Assist in these larger systems. As might which cores they should run. The shared li- be expected, the improvement in local brary libnuma provides applications with an DRAM latency is larger than one-hop API to control the process policy for allocat- (and two-hop in 4P) latency. We therefore ing new or existing memory and scheduling expect NUMA-optimized workloads with threads on sets of nodes. high processor-memory affinity to benefit ......

MARCH/APRIL 2010 27 [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 28

...... HOT CHIPS

100 80 90 70 60 400 80 50 350 70 40 30

300 Percent 250 60 20 200 50 10 150 0 Percent

100 Percent 40 2P 4P 50 0 30 2P 4P 20 Database1 Java Business JVM1 (a) Database2 Java Business JVM2 10 Database2 Business1 0 Database3 Virtualization1 2P 4P WebServing1 Local DRAM latency One hop DRAM latency Figure 10. Simulation-based studies of (b) performance improvement with HT Assist across a diverse set of commercial Figure 8. Memory bandwidth improvement (a) and relative memory latency workloads. (b) with and without HT Assist. Latency is normalized to HT Assist disabled as 100 percent (lower is better). Figure 10 shows simulation-based studies of additional workloads with configurations similar to the hardware measurements. 150 Although real-world results might vary, 130 110 our simulations demonstrate that HT Assist 90 improves performance more dramatically in 70 4P configurations than in 2P, and that it 50 Percent 30 benefits the WebServing, Java Business, 10 and Business workloads more significantly –10 2P 4P than the Database and Virtualization work- loads. Detailed evaluation of the underlying SPEC2006INT-Rate (estimated) performance data (not shown here) indicates SPEC2006FP-Rate (estimated) that Java workloads have good processor- SPECJBB2005 (estimated) memory affinity and significant need for memory bandwidth, making them a good Figure 9. Early hardware measurements of fit for HT Assist. The WebServing and Busi- application improvement with HT Assist. ness workloads benefit primarily from the additional achieved memory and intercon- nection bandwidth, and less from the signif- more from HT Assist, and that benefit will icant improvement in local memory latency be greater in 4P systems. because they have less processor-memory affinity. The Database and Virtualization Application benchmarks workloads benefit from reduced memory la- Figure 9 shows early hardware measure- tency in 2P, but have less processor-memory ments of the benefit of HT Assist for affinity and less overall memory bandwidth application-level workloads. HT Assist bene- need; therefore, the gains are smaller com- fits 4P systems more than 2P systems because pared to other workloads. These workloads of the larger relative reduction in memory la- are more sensitive to reduced L3 capacity. tency and bandwidth. Note that perfor- In 4P, the larger relative improvement in mance scaling from 2P to 4P (not shown) average memory latency and the greater is greater than 95 percent in these cases, illus- number of CPUs contending for effective trating that HT Assist could enable superior memory bandwidth lead to larger relative system scalability. gains compared to 2P...... 28 IEEE MICRO [3B2-14] mmi2010020016.3d 30/3/010 12:7 Page 29

he ‘‘Magny Cours’’ combination of computing. Conway has a masters in T superscalar cores, high core count, and electrical engineering from University Col- virtualization support make it an appro- lege Cork, Ireland, and an MBA from priate choice for running a heterogeneous Golden Gate University. He is a member mix of workloads in the data center. The of the IEEE. G34 socket infrastructure provides suffi- cient interconnect and memory bandwidth Nathan Kalyanasundharam is a principal headroom to accept upgrades of future member of AMD’s technical staff. His generations of plug-compatible processors, technical interests include the design of the which are already in development. MICRO load/store unit, cache controller, and north- bridge. Kalyanasundharam has an MS in Acknowledgments electrical engineering from Texas A&M An exceptional team of AMD technical University. staff designed, verified, and tested the ‘‘Magny Cours’’ processor. Gregg Donley is a senior member of AMD’s technical staff. His technical interests include ...... third-level cache and system architecture. References Donley has an MS in computer engineering 1. K. Olukotun, L. Hammond, and J. Laudon, from the University of Michigan. Chip Multiprocessor Architecture: Tech- niques to Improve Throughput and Latency, Kevin Lepak is a principal member of Morgan & Claypool, 2007. AMD’s technical staff. His technical interests 2. J. Brutlag, ‘‘Speed Matters for Google Web include various aspects of next-generation Search,’’ Google, http://code.google.com/ server platforms and processor development. speed/files/delayexp.pdf. Lepak has a PhD in electrical and computer 3. K. Asanovic et al., The Landscape of Parallel engineering from the University of Wisconsin. Computing Research: A View from Berkeley, tech. report UCB/EECS-2006-183, Electrical Bill Hughes is a senior fellow at AMD, Eng. and Computer Science Dept., Univ. of where he leads the Northbridge and Hyper- California, Berkeley, 2006. Transport Microarchitecture team. He has a 4. C. Keltcher et al., ‘‘The AMD Opteron Pro- PhD electrical and electronic engineering cessor for Shared Memory Multiprocessor from Leeds University, England. Systems,’’ IEEE Micro, vol. 23, no. 2, 2003, pp. 66-76. Direct questions or comments about this 5. P. Conway and W. Hughes, ‘‘The AMD article to Pat Conway, Advanced Micro Opteron Northbridge Architecture,’’ IEEE Devices, MS 362, 1 AMD Place, Sunnyvale, Micro, vol. 27, no. 2, 2007, pp. 10-21. CA 95070; [email protected]. 6. C. Moore, and P. Conway, ‘‘General Purpose Multiprocessors,’’ Multicore Pro- cessors and Systems, S. Keckler, K. Olukotun, and P. Hofstee, eds., Springer, 2009. 7. D. Culler, J. Pal Singh, and A. Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1999. 8. J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, 4th ed., Morgan Kaufmann, 2007. 9. A. Kleen, ‘‘A NUMA API for Linux,’’ SUSE Labs white paper, Aug. 2004.

Pat Conway is a principal member of the technical staff at AMD. His technical interests include server design and accelerated ......

MARCH/APRIL 2010 29