The Amd Opteron Northbridge Architecture
Total Page:16
File Type:pdf, Size:1020Kb
..................................................................................................................................................................................................................................................... THE AMD OPTERON NORTHBRIDGE ARCHITECTURE ..................................................................................................................................................................................................................................................... TO INCREASE PERFORMANCE WHILE OPERATING WITHIN A FIXED POWER BUDGET, THE AMD OPTERON PROCESSOR INTEGRATES MULTIPLE X86-64 CORES WITH A ROUTER AND MEMORY CONTROLLER.AMD’S EXPERIENCE WITH BUILDING A WIDE VARIETY OF SYSTEM TOPOLOGIES USING OPTERON’S HYPERTRANSPORT-BASED PROCESSOR INTERFACE HAS PROVIDED USEFUL LESSONS THAT EXPOSE THE CHALLENGES TO BE ADDRESSED WHEN DESIGNING FUTURE SYSTEM INTERCONNECT, MEMORY HIERARCHY, AND I/O TO SCALE WITH BOTH THE NUMBER OF CORES AND SOCKETS IN FUTURE X86-64 CMP ARCHITECTURES. ...... In 2005, Advanced Micro Devices significant throughput improvements in introduced the industry’s first native 64-bit future products while operating within x86 chip multiprocessor (CMP) architec- a fixed power budget. AMD has also ture combining two independent processor launched an initiative to provide industry cores on a single silicon die. The dual-core access to the Direct Connect architecture. Opteron chip featuring AMD’s Direct The ‘‘Torrenza Initiative’’ sidebar sum- Connect architecture provided a path for marizes the project’s goals. existing Opteron shared-memory multipro- cessors to scale up from 4- and 8-way to 8- The x86 blade server architecture and 16-way while operating within the same Figure 1a shows the traditional front-side Pat Conway power envelope as the original single-core bus (FSB) architecture of a four-processor Opteron processor.1,2 The foundation for (4P) blade, in which several processors share Bill Hughes AMD’s Direct Connect architecture is its a bus connected to an external memory innovative Opteron processor northbridge. controller (the northbridge) and an I/O Advanced Micro Devices In this article, we discuss the wide variety controller (the southbridge). Discrete exter- of system topologies that use the Direct nal memory buffer chips (XMBs) provide Connect architecture for glueless multipro- expanded memory capacity. The single cessing, the latency and bandwidth char- memory controller can be a major bottle- acteristics of these systems, and the impor- neck, preventing faster CPUs or additional tance of topology selection and virtual- cores from improving performance signifi- channel-buffer allocation to optimizing cantly. system throughput. We also describe several In contrast, Figure 1b illustrates AMD’s extensions of the Opteron northbridge Direct Connect architecture, which uses architecture, planned by AMD to provide industry-standard HyperTransport technol- ....................................................................... 10 Published by the IEEE Computer Society. 0272-1732/07/$20.00 G 2007 IEEE Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply. ............................................................................................................................................................................................................................................................................ Torrenza Initiative AMD’s Torrenza is a multiyear initiative to create an innovation innovation across the industry, AMD is opening access to HyperTran- platform by opening access to the AMD64 Direct Connect architecture to sport. enhance acceleration and coprocessing in homogeneous and heteroge- neous systems. Figure A shows the Torrenza platform, illustrating how custom designed accelerators, say for the processing of Extensible Torrenza is designed to create an opportunity for a global innovation Markup Language (XML) documents or for service-oriented architecture community to develop and deploy application-specific coprocessors to (SOA) applications, can be tightly coupled with Opteron processors. As the industry’s first open, customer-centered x86 innovation platform, Torrenza capitalizes on the Direct Connect architecture and HyperTransport technology advances of the AMD64 platform. The Torrenza Initiative includes the following elements: N Innovation Socket. In Septem- ber 2006 AMD announced it would license the AMD64 processor socket and design specifications to OEMs to allow collaboration on specifi- cations so that they can take full advantage of the x86 architecture. Cray, Fujitsu, Sie- mens, IBM, and Sun have publicly stated their support and are designing products for the Innovation Socket. N Coprocessor enablement. Leveraging the strengths of HyperTransport, AMD is work- ing with various partners to Figure A. Torrenza platform. create an extensive partner ecosystem of tools, services, and software to implement coprocessors in silicon. HyperTransport is work alongside AMD processors in multisocket systems. Its goal is to the only open, standards-based, extensible system bus. help accelerate industry innovation and drive new technology, which can N Direct Connect platform enablement. AMD is encouraging standards then become mainstream. It gives users, original equipment manufac- bodies and operating system suppliers to support accelerators and turers, and independent software vendors the ability to leverage billions coprocessors directly connected to the processor. To help drive in third-party investments. ogy to interconnect the processors.3 Hyper- advantage in memory capacity and band- Transport interconnect offers scalability, width over the traditional architecture, high bandwidth, and low latency. The without requiring the use of costly, power- distributed shared-memory architecture consuming memory buffers. Thus, the includes four integrated memory control- Direct Connect architecture reduces FSB lers, one per chip, giving it a fourfold bottlenecks. ........................................................................ MARCH–APRIL 2007 11 Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply. ......................................................................................................................................................................................................................... HOT CHIPS Northbridge microarchitecture packets, which are 4 or 8 bytes in size, In the Opteron processor, the northbridge and a data crossbar for routing the data consists of all the logic outside the processor payload associated with commands, which core. Figure 2 shows an Opteron processor can be 4 or 64 bytes in size. with a simplified view of the northbridge Figure 3 depicts the northbridge com- microarchitecture, including system request mand flow. The command crossbar routes interface (SRI) and host bridge, crossbar, coherent HyperTransport commands. It memory controller, DRAM controller, and can deliver an 8-byte HyperTransport HyperTransport ports. packet header at a rate of 1 per clock (one The northbridge is a custom design that every 333 ps with a 3-GHz CPU). Each runs at the same frequency as the processor input port has a pool of command-size core. The command flow starts in the buffers, which are divided between four processor core with a memory access that virtual channels (VCs): Request, Posted misses in the L2 cache, such as an in- request, Probe, and Response. A static struction fetch. The SRI contains the system allocation of command buffers occurs at address map, which maps memory ranges to each of the five crossbar input ports. (The nodes. If the memory access is to local next section of this article discusses how memory, an address map lookup in the SRI buffers should be allocated across different sends it to the on-chip memory controller; virtual channels to optimize system if the memory access is off-chip, a routing throughput.) table lookup routes it to a HyperTransport The data crossbar, shown in Figure 4, port. supports cut-through routing of data pack- The northbridge crossbar has five ports: ets. The cache line size is 64 bytes, and all SRI, memory controller, and three Hyper- buffers are sized in multiples of 64 bytes to Transport ports. The processing of com- optimize the transfer of cache-line-size data mand packet headers and data packets is packets. Data packets traverse on-chip data logically separated. There is a command paths in 8 clock cycles. Transfers to crossbar dedicated to routing command different output ports are time multiplexed Figure 1. Evolution of x86 blade server architecture: traditional front-side bus architecture (a) and AMD’s Direct Connect architecture (b). MCP: multichip package; Mem.: memory controller. ....................................................................... 12 IEEE MICRO Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply. Figure 2. Opteron 800 series processor architecture. clock by clock to support high concurrency; market segment. The SCI protocol supports for example, two concurrent transfers from a single shared-address space for an arbitrary CPU and memory controller input ports to number of nodes in a distributed shared- different output ports are possible. The