THE AMD OPTERON PROCESSOR FOR MULTIPROCESSOR SERVERS

REPRESENTING AMD’S ENTRY INTO 64-BIT COMPUTING, OPTERON COMBINES

THE BACKWARDS COMPATIBILITY OF THE -64 ARCHITECTURE WITH A DDR

MEMORY CONTROLLER AND HYPERTRANSPORT LINKS TO DELIVER SERVER-

CLASS PERFORMANCE. THESE FEATURES ALSO MAKE OPTERON A FLEXIBLE,

MODULAR, AND EASILY CONNECTABLE COMPONENT FOR VARIOUS

MULTIPROCESSOR CONFIGURATIONS.

Advanced Micro Devices’ Opteron aided design (CAD) tools, scientific/engineer- processor is a 64-bit, x86-based microproces- ing problems, and high-performance servers— sor with an on-chip double-data-rate (DDR) that address large amounts of virtual and and three HyperTransport physical memory. For example, the Hammer links for glueless . AMD based design project, which includes Opteron, ran Opteron on the x86-64 architecture,1 AMD’s most of its design simulations on a compute backward-compatible extension of the indus- farm of 3,000 running Linux. How- try standard x86 architecture. Thus, Opteron ever, for some large problem sets, our CAD provides seamless, high-performance support tools required greater than 4 Gbytes of mem- Chetana N. Keltcher for both the existing 32-bit x86 code base and ory and had to run on far more expensive plat- new 64-bit software. Its core is an out-of- forms that supported larger addresses. The Kevin J. McGrath order, superscalar processor supported by large Hammer project could have benefited from the on-chip level-1 (L1) and level-2 (L2) caches. existence of low-cost 64-bit computing. Ardsher Ahmed Figure 1 shows a block diagram. A key feature However, there exists a huge software code is the on-chip, integrated DDR memory con- base built around the legacy 32-bit x86 instruc- Pat Conway troller, which leads to low memory latency tion set. Many x86 and server users and provides high-performance support to now face the dilemma of how to transition to integer, floating-point, and multimedia 64-bit systems and yet maintain compatibility instructions. The HyperTransport links let with their existing code base. The AMD x86- Opteron connect to neighboring Opteron 64 architecture adds 64-bit addressing and processors and other HyperTransport devices expands register resources to support higher without additional support chips.2 performance for recompiled 64-bit programs. At the same time, it also supports legacy 32-bit Why 64 bit? applications and operating systems without The need for a 64-bit architecture arises from modification or recompilation. It thus supports applications—such as databases, computer- both the vast body of existing software as well

66 Published by the IEEE Computer Society 0272-1732/03/$17.00  2003 IEEE Intel x86 63 31 157 0 Added by x86-64 RAX EAX AH AL DDR memory PC2700 127 0 5.3 Gbytes/s XMM0 EAX

Memory 64-Kbyte L1 controller instruction System Execution cache request XMM7 EDI core XMM8 R8 queue 64-Kbyte L1 Crossbar data cache SSE and 2 registers 1-Mbyte General-purpose registers XMM15 R15 HT HT HT L2 cache

79 0 6.4 6.4 Program Gbytes/s Gbytes/s counter 6.4 HT HyperTransport Gbytes/s X87 63 31 0 EIP

Figure 1. Opteron architecture.

Figure 2. Programmer’s model for the x86-64 architecture. as new 64-bit software. The AMD x86-64 architecture supports 64-bit virtual and 52-bit physical addresses, Greater than 32 although a given implementation might sup- 16 to 32 port less. The Opteron, for example, supports 8 to 15 48-bit virtual and 40-bit physical addresses. Less than 8 The x86-64 architecture extends all x86 inte- 100 ger arithmetic and logical integer instructions to 64 bits, including 64 × 64-bit multiply with 80 a 128-bit result. 60 The x86-64 architecture doubles the number of integer general-purpose registers (GPRs) and 40 streaming SIMD extension (SSE) registers 20

from 8 to 16 of each. It also extends the GPRs registers (percentage) Functions that require N from 32 to 64 bits. The 64-bit GPR extensions 0 overlay the existing 32-bit registers in the same gcc perl ijepg way as they had in the x86 architecture’s previ- vortex bytemarkm88ksim wordproc ous transition, from 16 to 32 bits. speadsheet Figure 2 depicts the programmer’s view of the x86-64 registers, showing the legacy x86 Figure 3. Register usage in typical applications. registers and the x86-64 register extensions. Programs address the new registers via a new instruction prefix byte called REX (register within the small number of legacy registers. extension). Adding eight more registers satisfied the reg- The extension to the x86 register set is ister allocation needs of more than 80 percent straightforward and much needed. Current of functions appearing in typical application x86 software is highly optimized to work code, as Figure 3 shows.

MARCH–APRIL 2003 67 AMD OPTERON

Fetch Branch prediction L1 instruction cache Scan/align L2 (64 Kbytes) cache Fastpath Microcode engine (1 Mbyte)

Micro-ops

Instruction control unit (72 entries) L1 data cache (64 Kbytes) System request Integer decode Floating-point decode queue and rename and rename

Crossbar Floating-point scheduler Res Res Res (36-entry)

Memory Load/store controller queue AGU AGU AGU FADD FMUL FMISC (44-entry) Hyper- ALU ALU ALU Transport MULT Res 8 entry reservation station

Figure 4. Processor core block diagram.

Core microarchitecture Pipeline Figure 4 is a block diagram of the Opteron Opteron’s pipeline, shown in Figure 5, is core, an aggressive, out-of-order, three-way long enough for high frequency and short superscalar processor. It reorders as many as enough for good IPC (instructions per cycle). 72 micro-ops, and fetches and decodes three This pipeline is fully integrated from instruc- x86-64 instructions each cycle from the tion fetch through DRAM access. It takes instruction cache. Like its predecessor, the seven cycles for fetch and decode; excellent AMD ,3 Opteron turns variable-length branch prediction covers this latency. The exe- x86-64 instructions into fixed-length micro- cute pipeline is typically 12 stages for integer ops and dispatches them to two independent and 17 stages for floating-point. The data schedulers: one for integer, and one for float- cache access occurs in stage 11 for loads; the ing-point and multimedia (MMX, 3Dnow, result returns to the execution units in stage SSE, and SSE 2) operations. In addition, it 12. sends load and store micro-ops to the In case of an L1 cache miss, the pipeline load/store unit. The schedulers and the accesses the L2 cache and (in parallel) a load/store unit can dispatch 11 micro-ops request goes to the system request queue. The each cycle to the following execution pipeline cancels the system request on an L2 resources: cache hit. The memory controller schedules the request to DRAM. •three integer execution units, The stages in the DRAM pipeline run at •three address generation units, the same frequency as the core. Once data is •three floating point and multimedia units, returned from DRAM, it takes several stages and to route data across the chip to the L1 cache •two loads/stores to the data cache. and to perform error code correction. The

68 IEEE MICRO Integer

123456789101112

Fetch 1 Fetch 2 Pick Decode 1 Decode 2 Pack Pack/ Dispatch Schedule AGU/ DC DC decode ALU access response

Floating-point

8 9 10 11 12 13 14 15 16 17 Dispatch Stack Register Schedule Schedule Register FX0 FX1 FX2 FX3 rename rename write read

L2 cache pipeline

13 14 15 16 17 18 19

L2 tag L2 data Route/ Write DC and multiplex/ECC forward data

DRAM pipeline

14 15 16 17 18 19 20 21 22 23 Address Clock System request GART and Crossbar Coherence/ Memory controller to SRQ crossing queue (SRQ) address decode order check (MCT)

24 25 26 27

MCT Request DRAM access DRAM Route Clock Route/ Write DC and to DRAM to MCT crossing multiplex/ECC forward data pins

Figure 5. Opteron pipeline. processor forwards both the L2 cache and sys- The instruction and data caches each have tem data, critical-word first, to the waiting independent L1 and L2 translation look-aside load while updating the L1 cache. buffers (TLB). The L1 TLB is fully associa- tive and stores thirty-two 4-Kbyte page trans- Caches lations and eight 2-Mbyte/4-Mbyte page Opteron has separate L1 data and instruc- translations. The L2 TLB is four-way set- tion caches, each 64 Kbytes. They are two- associative with 512 4-Kbyte entries. In a clas- way set-associative, linearly indexed, and sical x86 TLB scheme, a change to the page physically tagged with a cache line size of 64 directory base flushes the TLB. Opteron has bytes. We banked them for speed with each a hardware flush filter that blocks unnecessary way consisting of eight 4-Kbyte banks. Back- TLB flushes and flushes the TLBs only after ing these L1 caches is a large 1-Mbyte L2 it detects a modification of a paging data cache, whose data is mutually exclusive with structure. Filtering TLB flushes can be a sig- respect to the data in the L1 caches. The L2 nificant advantage in multiprocessor work- cache is 16-way set-associative and uses a loads because it lets multiple processes share pseudo-least-recently-used (LRU) replace- the TLB. ment policy, grouping two ways into a sector and managing the sectors via an LRU criteri- Instruction fetch and decode on. This mechanism uses only half the num- The instruction fetch unit, with the help of ber of LRU bits of a traditional LRU scheme, branch prediction, attempts to supply 16 with similar performance. We used the stan- instruction bytes each cycle to the scan/align dard MOESI (modified, owner, exclusive, unit, which scans this data and aligns it on shared, invalid) protocol for cache coherence. instruction boundaries. This task is compli-

MARCH–APRIL 2003 69 AMD OPTERON

L2 cache Sequential fetch: occurs every cycle branch selectors

Evicted data Predicted fetch: one-cycle fetch bubble to access branch target array

Global Target history 12-entry Branch array counter return address selectors (2K (16K, 2-bit stack (RAS) targets) counters) Branch target address calculator fetch: two cycles to recover from mispredicted fetch

M

Mispredicted fetch: nine cycles to recover Pick from a mispredicted fetch Decode 1 M Decode 2 Execution Pack stages Pack/decode Branch target address calculator Dispatch Branch execution (BTAC) Schedule BTAC Execute Fetch M Mispredicted fetch Redirect Target array access

(a) (b)

Figure 6. block diagram (a) and diagram of branch prediction schemes (b). Each set of boxes in the diagram rep- resents a sequence of cycles that occur for each scheme.

cated because x86 instructions can vary in benchmarks and 28 percent fewer for length from 1 to 15 bytes. To simplify the sub- SPECfp. sequent scan process, the instruction fetch unit marks instruction boundaries as it writes Branch prediction the line into the instruction cache. AMD significantly improved Opteron’s The processor decodes variable-length branch prediction over that of Athlon. As Fig- instructions into internal fixed-length micro- ure 6 shows, Opteron uses a combination of ops using either the hardware decoders (called branch prediction schemes.4 The branch selec- “fastpath” decoders) or the microcode engine. tor array selects between static prediction and Fastpath decoders can issue as many as three the history table. x86 instructions per cycle. Most common x86 The global history table uses 2-bit saturat- instructions that decode to one or two micro- ing up/down counters. The branch target ops per instruction use the fastpath decoders. array holds branch target addresses and has On the other hand, microcode can issue only backup from a 2-cycle branch target address one x86 instruction per cycle. For better per- calculator to accelerate address calculations. formance, Opteron has more fastpath decod- The return-address stack optimizes CALL/ ing resources than its predecessor, Athlon. For RETURN pairs by storing the return address example, packed SSE instructions were of each CALL and supplying it for use by the microcoded in Athlon, but they are fastpath in corresponding RETURN instruction. The Opteron. This change resulted in 8 percent mispredicted branch penalty is 11 cycles. fewer microcoded instructions for SPECint When a line is evicted from the instruction

70 IEEE MICRO cache, it saves the branch prediction infor- to different banks. Load-to-use latency is 3 mation and the end bits in the L2 cache error cycles. Loads require an extra cycle if they have correction code field, because parity protec- a nonzero segment base, are misaligned, or tion is sufficient for read-only instruction create bank conflicts. Although loads can data. Thus, the fetch unit does not have to re- return results to the execution units out of create branch history information when it order, stores cannot commit data until they brings a line back into the instruction cache retire. Stores that have not yet committed for- from the L2 cache. This unique and simple ward their data to dependent loads. A hard- trick improved Opteron’s branch prediction ware prefetcher5 recognizes a pattern of cache accuracy on various benchmarks by 5 to 10 misses to consecutive lines (ascending or percent over the Athlon. descending) and prefetches the next cache line into the L2 cache. The L2 cache can handle Integer and floating-point execution units 10 outstanding requests for cache miss, state The decoders dispatch three micro-ops each change, and TLB miss—eight from the data cycle to the instruction control unit, which cache and two from the instruction cache. has a 72-entry reorder buffer and controls the retirement of all instructions. In parallel, the Memory controller and HyperTransport decoders issue micro-ops to the integer reser- The on-chip memory controller provides a vation stations and the floating-point sched- dual-channel 128-bit wide interface to a 333- uler. In turn, these units issue micro-ops to MHz DDR memory composed of PC2700 their respective execution units once the dual, in-line memory modules (DIMMs). instructions’ operands are available. Opteron The controller’s peak memory bandwidth is has three units for each of the following func- 5.3 Gbytes/s. It supports eight registered or tions: integer execution, address generation, unbuffered DIMMs for a memory capacity of and floating-point execution. The integer 16 Gbytes using 2-Gbyte DIMMs. It is on a units have a full 64-bit data path, and the separate clock grid, but runs at the same fre- majority of the micro-ops (32- and 64-bit add, quency as the core. Thus, memory latency subtract, rotate, shift, logical, and so on) are scales down with frequency improvements in single cycle. The hardware multiplier takes 3 the processor core. In low-power modes, the cycles for a 32-bit multiply and 5 cycles for a processor can turn off the core clock while 64-bit multiply. keeping the memory controller’s clock run- The floating-point data path is full 80-bit ning to permit access to memory by graphics extended precision. All floating-point units and other devices. Historically, the memory are fully pipelined with a throughput of one controller has been relegated to a separate chip instruction per cycle for most operations. Sim- called the , as illustrated in Fig- ple operations like compare and absolute as ures 7a and 7b. The integrated memory con- well as most MMX operations take 2 cycles. troller significantly reduces Opteron’s memory MMX multiply and sum-of-average instruc- latency and improves performance by 20 to tions take 3 cycles. Single- and double-preci- 25 percent over Athlon. sion add and multiply instructions take 4 Other on-chip Northbridge functionality cycles, and the latency of divide and square includes the processing of I/O direct-mapped root instructions depends on precision. The access, memory-mapped I/O, peripheral com- divide takes 16, 20, or 24 cycles for single, ponent interconnect (PCI) configuration double, or extended-precision data; the square requests, transaction ordering, and cache root takes either 19, 27, or 35 cycles. coherence. A broadcast cache coherence pro- tocol maintains memory coherence by avoid- Load store unit ing the serialization delay of directory-based In addition to the integer and floating-point systems and allowing the snooping of the units, the decoders issue load/store micro-ops processor caches concurrently with DRAM to the load/store unit, which manages load access. The snoop throughput scales with and store accesses to the data cache and system processor frequency and HyperTransport link memory. Each cycle, two 64-bit loads or stores speed. can access the data cache as long as they are Among Opteron’s unique features are the

MARCH–APRIL 2003 71 AMD OPTERON

P0 P1 P2 P3

P0 P1 3.2 Gbytes/s 2.1 Gbytes/s each I/O 4.2 Gbytes/s total DRAM North- I/O bridge Gfx North- I/O bridge DRAM 5.3 Gbytes/s 528 Mbytes/s 132 Mbytes/s South- 2.1 Gbytes/s total South- 1.6 Gbytes/s each bridge bridge 4.8 Gbytes/s total

I/O I/O (a) (b)

I/O I/O I/O I/O

P0 P1 P0 P1 P2 P3 5.3 Gbytes/s 6.4 Gbytes/s

P0 P1 P2 P3 P4 P5 P6 P7

I/O I/O I/O I/O I/O I/O

(c) (d) (e)

Figure 7. Multiprocessor server architectures: two-way Athlon (a), four-way Intel (b), and two-way (c), four-way (d), and eight-way (e) Opteron.

three integrated HyperTransport links. Each protected. Eight ECC check bits are stored for link is 16 bits wide (3.2 Gbytes/s per direc- each 64 bits of data, allowing the correction of tion) and is configurable as either coherent single-bit errors and the detection of double- HyperTransport (cHT) or noncoherent bit errors (SECDED, which stands for single HyperTransport (HT). The cHT protocol is error correction double error detection). for connecting processors, and the simpler DRAM has support for chipkill ECC, which HT protocol is for attaching I/O. The point- allows for the correction of 4-bit errors in the to-point HyperTransport links support flexi- same nibble. The processor accomplishes this ble, configurable, and scalable I/O topologies by using 16 ECC check bits on 128 bits of using chains of I/O “tunnel” and “cave” data. Thus, chipkill ECC can correct errors devices. The programming model for the con- affecting an entire x4 data width DRAM chip. figuration space is identical to that for the PCI In addition, the L1 data cache, L2 cache tags, specification, version 2.2.6 The Hyper- and DRAM implement hardware scrubbers Transport links are transparent to any OS that that steal idle cycles and clean up single-bit supports PCI. ECC errors. This clean-up is particularly important for the L1 data cache, which detects Reliability features but does not correct errors on the fly. Parity Reliability is very important in enterprise- checking protects the instruction cache data, class machines, and Opteron has robust reli- and L1 and L2 TLBs. ECC and parity-check- ability features. Error code correction (ECC) ing errors are reported via a machine check or parity checking protects all large arrays. The architecture that reports failures with suffi- L1 data cache, L2 cache, and DRAM are ECC cient information to diagnose the error. The

72 IEEE MICRO highly integrated Opteron design results in lower overall chip count, which also improves system reliability. P0 P1 P0 P1 Processor core performance An Opteron processor operating at 2.0 GHz using registered 333-MHz DDR mem- ory has estimated scores of 1,202 for SPECint2000 and 1,170 for SPECfp2000. P2 P3 P2 P3 Glueless multiprocessing Here, we illustrate the multiprocessing (a) (b) design space for Opteron system architectures and present microbenchmark latency and Figure 8. Local memory access (a) and Xfire memory access (b) benchmarks bandwidth results for shared-memory access- measure system performance under varying communication patterns. es from processors and I/O nodes. These microbenchmark results show that coherent memory bandwidth scales. Memory latencies manner, such that the generated system are such that the programmer has a near-flat load corresponds to all-to-all communi- view of the memory. cation. We measure the isolated read Figure 7 shows the architectures of tradi- latency as well as bandwidth under satu- tional two- and four-way x86 servers, in which rated load; Figure 8b shows this com- all the processors in the system share a single, munication pattern. external memory controller. In contrast, an integrated memory controller provides a 128- The number of DIMMs and the memory bit 333-MHz DDR interface at each node in capacity double when we double the number an Opteron system. A multiprocessor config- of processors. There are three I/O Hyper- uration can use memory on a node contigu- Transport links for one processor and four I/O ously or interleave pages from one node’s HyperTransport links for two- and four- memory with those from other nodes in the processor configurations, each link providing system. Three on-chip HyperTransport links 6.4 Gbytes/s of bandwidth. The processors enable streaming server I/O and high-band- are gluelessly interconnected with coherent width, glueless two-, four-, and eight-way HT links for the efficient topologies shown in multiprocessor systems, as shown in Figures Figures 7c, 7d, and 7e. For topologies 7c and 7c, 7d, and 7e. 7d, the average distance is less than or equal to 1 hop between nodes, whereas for 7e, it is 1.75 Multiprocessor performance analysis hops. Table 1 summarizes these results. The To evaluate the performance of our multi- memory-read bandwidth in a four-processor processor systems, we ran the following regis- system accessing data in local memory is up ter-transfer-level microbenchmarks, measuring to 15.59 Gbytes/s; bandwidth is 11.23 latencies and bandwidths on one, two, and Gbytes/s when the system interleaves memo- four processor systems: ry across all the nodes. When you bisect a network (cut it into two • Local memory access. In this microbench- parts with an equal number of processors) mark, each processor generates a series of such that the number of links cut is a mini- read-memory transactions to addresses mum, it is called a min cut. The bandwidth in local memory only. We measured the across a min cut is the bisection bandwidth, a isolated read latency as well as bandwidth measure of available bandwidth in the system under saturated load. Figure 8a shows for interprocessor communications. this communication pattern. The graph in Figure 9 shows that the read- • Xfire memory access. In the Xfire (cross- memory bandwidth scales from one- to four- fire) microbenchmark, all processors read processor systems. The average unloaded data from all nodes in an interleaved memory latency with open DRAM pages for

MARCH–APRIL 2003 73 AMD OPTERON

Table 1. Microbenchmark results.

One Two Four System parameters processor processors processors No. of DIMMs 8 16 32 Total main-memory using 2-Gbyte DIMMs (Gbytes) 16 32 64 No. of HyperTransport I/O links 3 4 4 Bisection bandwidth (Gbytes/s) NA 6.4 12.8 Diameter (no. of hops) NA 1 2 Average distance (no. of hops) NA 0.5 1 Local memory read bandwidth (Gbytes/s) 5.3 10.67 15.59 Local bandwidth per processor (Gbytes/s) 5.3 5.3 3.9 Xfire memory read bandwidth (Gbytes/s) 5.3 7.06 11.23 Xfire bandwidth per processor (Gbytes/s) 5.3 3.53 2.8

prise servers.7,8 It is designed to scale beyond 16 eight ways by combining the processor with Local memory access 14 Xfire memory access an external HyperTransport switch, as illus- 12 trated in Figure 10. The average distance between any pair of nodes in this topology is 10 2.7 hops, and the maximum distance between 8 any pair of nodes (the network diameter) is 5 6 hops. The switch contains a snoop filter, (Gbytes/s) 4 which tracks the state of cached lines in the system. Requests access the snoop filter to Memory read bandwidth 2 locate the latest copy of a line in the system. 0 The switch can also contain a large cache to 124 hold data, which is remote to a quad. Figure No. of processors 10 shows the quad configuration, a group of four processors, each with an independent I/O Figure 9. Memory bandwidth scalability. and local memory. The four processors share two sets of remote caches and snoop filters; they connect to the rest of the system through one-, two-, and four-processor systems is less two coherent switches (SW0 and SW1) and than 50 ns, less than 70 ns, and less than 110 six HyperTransport links. ns. The latency difference between remote and Both the snoop filter and the remote cache local memory in a given system is compara- reduce traffic among switches and help band- ble to the difference between a DRAM page width scale with processor count. hit and a DRAM page conflict. The latency under load increases slowly because of the availability of excess interconnect bandwidth. Latency shrinks quickly with increasing CPU he integrated DDR memory controller, clock speed and HyperTransport link speed. Tglueless multiprocessing capability, high- The aggregate memory-read bandwidths for bandwidth HyperTransport links, 64-bit one-, two-, and four-processor systems are 5.3 computing, and high-reliability features com- Gbytes/s, 7 Gbytes/s, and 11.23 Gbytes/s, bined with a core that provides leading-edge where the data resides uniformly across the performance makes the AMD Opteron a great nodes. processor for server applications. We plan future microarchitectural improvements such Scaling beyond eight-way multiprocessing as support for next-generation DDR tech- We designed Opteron for use as a building nology, enhancements to the cache and mem- block for highly scalable and reliable enter- ory subsystem, and support for larger linear

74 IEEE MICRO 4P 4P

4P 4P Dual HT links (12.8 Gbytes/s)

4P 4P

4P 4P

Snoop Snoop filter filter HT HT SW0 SW1 Remote Remote cache cache

P0 Memory P1 Memory P2 Memory P3 Memory

I/O I/O I/O I/O

Quad

Figure 10. Scaling beyond eight-way multiprocessing using an external switch. This configu- ration provides 256 DIMMs, 512-Gbytes of total memory built from 256-Mbyte DRAMs, 32 HT links, and a bisection bandwidth of 51.2 Gbytes/s. and physical addresses. In addition to servers, extensive use of clock gating, we plan further the core architecture described here serves as work to lower power requirements. MICRO a foundation for desktops, laptops, and work- stations. AMD targets Opteron processors for fabri- Acknowledgments cation in a 130-nm silicon-on-insulator process We thank all the members of the Opteron with nine layers of copper interconnect, a architecture team, including Fred Weber, Bill process available at AMD’s fab in Dresden, Hughes, Ramsey Haddad, Dave Christie, Germany. In the future, AMD plans to move Scott White, Gerald Zuraski, Phil Madrid, to a 90-nm process. Although the design team Kelvin Goveas, Ashraf Ahmed, Michael Clark, has focused on eliminating unnecessary power Hongwen Gao, Bruce Holloway, Richard dissipation without sacrificing performance by Klass, Dave Kroesche, and Tim Wood.

MARCH–APRIL 2003 75 AMD OPTERON

References Ardsher Ahmed was a senior member of the 1. AMD x86-64 Architecture Manuals, technical staff, design engineer, at AMD. His http://www.amd.com. research interests include designing high-per- 2. HyperTransport I/O Link Specification, formance computer architectures; multi- http://www.hypertransport.org/tech_ processor systems; server development; specifications.html. supercomputing and massively parallel pro- 3. D. Meyer, “The AMD-K7 : A cessing; high-bandwidth low-latency inter- First Look,” Microprocessor Forum, 1998; connection network design; performance http://www.mdronline.com/mpf/. evaluation; and microbenchmarking. Ahmed 4. S. McFarling, “Combining Branch Predic- has a BS in electrical engineering from NED tors,” WRL Technical Note TN-36, Compaq University of Engineering and Technology, Western Research Laboratory, 1993. Karachi, Pakistan; an MS in electrical engi- 5. T-F. Chen and J-L. Baer, “Effective Hard- neering from the California Institute of Tech- ware-Based Data Prefetching for High- nology, Pasadena; and a PhD in computer Performance Processors,” IEEE Trans. engineering from Binghamton University, Computers, vol. 44, no. 5, May 1995, pp. New York. He is a member of the IEEE. 609-623. 6. PCI Local Bus Specification, revision 2.2, 18 Pat Conway is a senior member of technical December 1998; http://www.pcisig.com/ staff in the Computational Products Group specifications. at AMD where he is responsible for develop- 7. F. Aono and M. Kimura, “The AzusA 16-way ing scalable, high-performance server archi- server,” IEEE Micro, vol. 20, no. 5, tectures. His work experience includes the Sept.-Oct. 2000, pp. 54-60. design and development of server hardware, 8. D. Watts et al., “Technical Description,” IBM cache coherence protocols, user-level message xSeries 440 Planning and Installation Guide, passing, and clustering technology. Conway Oct. 2002; http://www.redbooks.ibm.com/ has an MEngSc from the University College redbooks/SG246196.html. Cork, Ireland, and an MBA from Golden Gate University. He is a member of the IEEE.

Chetana N. Keltcher is a member of the tech- For more information on AMD Opteron, see nical staff at Advanced Micro Devices. She the Web site at http://www.amd.com/opteron/. worked on the K6 and Athlon microproces- sors and is now part of the Opteron architec- For further information on this or any other ture team where she is responsible for the computing topic, visit our Digital Library at load-store and data cache design. Keltcher has http://computer.org/publications/dlib. a BE in computer science from Bangalore University, India, and an MS and PhD in computer science from the Pennsylvania State University. She is a member of the IEEE.

Kevin J. McGrath is a fellow at AMD, Cali- fornia Microprocessor Division. He is the architect of AMD’s 64-bit extensions and cur- rently manages the Opteron architecture and RTL team. His work experience includes 20 years in CPU design and verification, first for Hewlett-Packard and later for ELXSI, even- tually working on the microarchitecture of the Nx586, K6, and Athlon processors, leading the microcode team for those projects. McGrath has a BS in engineering technology from California Polytechnic University, San Luis Obispo.

76 IEEE MICRO