The Amd Opteron Processor for Multiprocessor Servers
Total Page:16
File Type:pdf, Size:1020Kb
THE AMD OPTERON PROCESSOR FOR MULTIPROCESSOR SERVERS REPRESENTING AMD’S ENTRY INTO 64-BIT COMPUTING, OPTERON COMBINES THE BACKWARDS COMPATIBILITY OF THE X86-64 ARCHITECTURE WITH A DDR MEMORY CONTROLLER AND HYPERTRANSPORT LINKS TO DELIVER SERVER- CLASS PERFORMANCE. THESE FEATURES ALSO MAKE OPTERON A FLEXIBLE, MODULAR, AND EASILY CONNECTABLE COMPONENT FOR VARIOUS MULTIPROCESSOR CONFIGURATIONS. Advanced Micro Devices’ Opteron aided design (CAD) tools, scientific/engineer- processor is a 64-bit, x86-based microproces- ing problems, and high-performance servers— sor with an on-chip double-data-rate (DDR) that address large amounts of virtual and memory controller and three HyperTransport physical memory. For example, the Hammer links for glueless multiprocessing. AMD based design project, which includes Opteron, ran Opteron on the x86-64 architecture,1 AMD’s most of its design simulations on a compute backward-compatible extension of the indus- farm of 3,000 Athlons running Linux. How- try standard x86 architecture. Thus, Opteron ever, for some large problem sets, our CAD provides seamless, high-performance support tools required greater than 4 Gbytes of mem- Chetana N. Keltcher for both the existing 32-bit x86 code base and ory and had to run on far more expensive plat- new 64-bit software. Its core is an out-of- forms that supported larger addresses. The Kevin J. McGrath order, superscalar processor supported by large Hammer project could have benefited from the on-chip level-1 (L1) and level-2 (L2) caches. existence of low-cost 64-bit computing. Ardsher Ahmed Figure 1 shows a block diagram. A key feature However, there exists a huge software code is the on-chip, integrated DDR memory con- base built around the legacy 32-bit x86 instruc- Pat Conway troller, which leads to low memory latency tion set. Many x86 workstation and server users and provides high-performance support to now face the dilemma of how to transition to Advanced Micro Devices integer, floating-point, and multimedia 64-bit systems and yet maintain compatibility instructions. The HyperTransport links let with their existing code base. The AMD x86- Opteron connect to neighboring Opteron 64 architecture adds 64-bit addressing and processors and other HyperTransport devices expands register resources to support higher without additional support chips.2 performance for recompiled 64-bit programs. At the same time, it also supports legacy 32-bit Why 64 bit? applications and operating systems without The need for a 64-bit architecture arises from modification or recompilation. It thus supports applications—such as databases, computer- both the vast body of existing software as well 66 Published by the IEEE Computer Society 0272-1732/03/$17.00 2003 IEEE Intel x86 63 31 157 0 Added by x86-64 RAX EAX AH AL DDR memory PC2700 127 0 5.3 Gbytes/s XMM0 EAX Memory 64-Kbyte L1 controller instruction System Execution cache request XMM7 EDI core XMM8 R8 queue 64-Kbyte L1 Crossbar data cache SSE and 2 registers 1-Mbyte General-purpose registers XMM15 R15 HT HT HT L2 cache 79 0 6.4 6.4 Program Gbytes/s Gbytes/s counter 6.4 HT HyperTransport Gbytes/s X87 63 31 0 EIP Figure 1. Opteron architecture. Figure 2. Programmer’s model for the x86-64 architecture. as new 64-bit software. The AMD x86-64 architecture supports 64-bit virtual and 52-bit physical addresses, Greater than 32 although a given implementation might sup- 16 to 32 port less. The Opteron, for example, supports 8 to 15 48-bit virtual and 40-bit physical addresses. Less than 8 The x86-64 architecture extends all x86 inte- 100 ger arithmetic and logical integer instructions to 64 bits, including 64 × 64-bit multiply with 80 a 128-bit result. 60 The x86-64 architecture doubles the number of integer general-purpose registers (GPRs) and 40 streaming SIMD extension (SSE) registers 20 from 8 to 16 of each. It also extends the GPRs registers (percentage) Functions that require N from 32 to 64 bits. The 64-bit GPR extensions 0 overlay the existing 32-bit registers in the same gcc perl ijepg way as they had in the x86 architecture’s previ- vortex bytemarkm88ksim wordproc ous transition, from 16 to 32 bits. speadsheet Figure 2 depicts the programmer’s view of the x86-64 registers, showing the legacy x86 Figure 3. Register usage in typical applications. registers and the x86-64 register extensions. Programs address the new registers via a new instruction prefix byte called REX (register within the small number of legacy registers. extension). Adding eight more registers satisfied the reg- The extension to the x86 register set is ister allocation needs of more than 80 percent straightforward and much needed. Current of functions appearing in typical application x86 software is highly optimized to work code, as Figure 3 shows. MARCH–APRIL 2003 67 AMD OPTERON Fetch Branch prediction L1 instruction cache Scan/align L2 (64 Kbytes) cache Fastpath Microcode engine (1 Mbyte) Micro-ops Instruction control unit (72 entries) L1 data cache (64 Kbytes) System request Integer decode Floating-point decode queue and rename and rename Crossbar Floating-point scheduler Res Res Res (36-entry) Memory Load/store controller queue AGU AGU AGU FADD FMUL FMISC (44-entry) Hyper- ALU ALU ALU Transport MULT Res 8 entry reservation station Figure 4. Processor core block diagram. Core microarchitecture Pipeline Figure 4 is a block diagram of the Opteron Opteron’s pipeline, shown in Figure 5, is core, an aggressive, out-of-order, three-way long enough for high frequency and short superscalar processor. It reorders as many as enough for good IPC (instructions per cycle). 72 micro-ops, and fetches and decodes three This pipeline is fully integrated from instruc- x86-64 instructions each cycle from the tion fetch through DRAM access. It takes instruction cache. Like its predecessor, the seven cycles for fetch and decode; excellent AMD Athlon,3 Opteron turns variable-length branch prediction covers this latency. The exe- x86-64 instructions into fixed-length micro- cute pipeline is typically 12 stages for integer ops and dispatches them to two independent and 17 stages for floating-point. The data schedulers: one for integer, and one for float- cache access occurs in stage 11 for loads; the ing-point and multimedia (MMX, 3Dnow, result returns to the execution units in stage SSE, and SSE 2) operations. In addition, it 12. sends load and store micro-ops to the In case of an L1 cache miss, the pipeline load/store unit. The schedulers and the accesses the L2 cache and (in parallel) a load/store unit can dispatch 11 micro-ops request goes to the system request queue. The each cycle to the following execution pipeline cancels the system request on an L2 resources: cache hit. The memory controller schedules the request to DRAM. •three integer execution units, The stages in the DRAM pipeline run at •three address generation units, the same frequency as the core. Once data is •three floating point and multimedia units, returned from DRAM, it takes several stages and to route data across the chip to the L1 cache •two loads/stores to the data cache. and to perform error code correction. The 68 IEEE MICRO Integer 123456789101112 Fetch 1 Fetch 2 Pick Decode 1 Decode 2 Pack Pack/ Dispatch Schedule AGU/ DC DC decode ALU access response Floating-point 8 9 10 11 12 13 14 15 16 17 Dispatch Stack Register Schedule Schedule Register FX0 FX1 FX2 FX3 rename rename write read L2 cache pipeline 13 14 15 16 17 18 19 L2 tag L2 data Route/ Write DC and multiplex/ECC forward data DRAM pipeline 14 15 16 17 18 19 20 21 22 23 Address Clock System request GART and Crossbar Coherence/ Memory controller to SRQ crossing queue (SRQ) address decode order check (MCT) 24 25 26 27 MCT Request DRAM access DRAM Route Clock Route/ Write DC and to DRAM to MCT crossing multiplex/ECC forward data pins Figure 5. Opteron pipeline. processor forwards both the L2 cache and sys- The instruction and data caches each have tem data, critical-word first, to the waiting independent L1 and L2 translation look-aside load while updating the L1 cache. buffers (TLB). The L1 TLB is fully associa- tive and stores thirty-two 4-Kbyte page trans- Caches lations and eight 2-Mbyte/4-Mbyte page Opteron has separate L1 data and instruc- translations. The L2 TLB is four-way set- tion caches, each 64 Kbytes. They are two- associative with 512 4-Kbyte entries. In a clas- way set-associative, linearly indexed, and sical x86 TLB scheme, a change to the page physically tagged with a cache line size of 64 directory base flushes the TLB. Opteron has bytes. We banked them for speed with each a hardware flush filter that blocks unnecessary way consisting of eight 4-Kbyte banks. Back- TLB flushes and flushes the TLBs only after ing these L1 caches is a large 1-Mbyte L2 it detects a modification of a paging data cache, whose data is mutually exclusive with structure. Filtering TLB flushes can be a sig- respect to the data in the L1 caches. The L2 nificant advantage in multiprocessor work- cache is 16-way set-associative and uses a loads because it lets multiple processes share pseudo-least-recently-used (LRU) replace- the TLB. ment policy, grouping two ways into a sector and managing the sectors via an LRU criteri- Instruction fetch and decode on. This mechanism uses only half the num- The instruction fetch unit, with the help of ber of LRU bits of a traditional LRU scheme, branch prediction, attempts to supply 16 with similar performance.