...... THE AMD ARCHITECTURE

...... TO INCREASE PERFORMANCE WHILE OPERATING WITHIN A FIXED POWER BUDGET, THE

AMD OPTERON PROCESSOR INTEGRATES MULTIPLE -64 CORES WITH A ROUTER AND

MEMORY CONTROLLER.AMD’S EXPERIENCE WITH BUILDING A WIDE VARIETY OF SYSTEM

TOPOLOGIES USING OPTERON’S HYPERTRANSPORT-BASED PROCESSOR INTERFACE HAS

PROVIDED USEFUL LESSONS THAT EXPOSE THE CHALLENGES TO BE ADDRESSED WHEN

DESIGNING FUTURE SYSTEM INTERCONNECT, MEMORY HIERARCHY, AND I/O TO SCALE

WITH BOTH THE NUMBER OF CORES AND SOCKETS IN FUTURE X86-64 CMP

ARCHITECTURES.

...... In 2005, significant throughput improvements in introduced the industry’s first native 64-bit future products while operating within x86 chip multiprocessor (CMP) architec- a fixed power budget. AMD has also ture combining two independent processor launched an initiative to provide industry cores on a single silicon die. The dual-core access to the Direct Connect architecture. Opteron chip featuring AMD’s Direct The ‘‘ Initiative’’ sidebar sum- Connect architecture provided a path for marizes the project’s goals. existing Opteron shared-memory multipro- cessors to scale up from 4- and 8-way to 8- The x86 blade server architecture and 16-way while operating within the same Figure 1a shows the traditional front-side Pat Conway power envelope as the original single-core (FSB) architecture of a four-processor Opteron processor.1,2 The foundation for (4P) blade, in which several processors share Bill Hughes AMD’s Direct Connect architecture is its a bus connected to an external memory innovative Opteron processor northbridge. controller (the northbridge) and an I/O Advanced Micro Devices In this article, we discuss the wide variety controller (the ). Discrete exter- of system topologies that use the Direct nal memory buffer chips (XMBs) provide Connect architecture for glueless multipro- expanded memory capacity. The single cessing, the latency and bandwidth char- can be a major bottle- acteristics of these systems, and the impor- neck, preventing faster CPUs or additional tance of topology selection and virtual- cores from improving performance signifi- channel-buffer allocation to optimizing cantly. system throughput. We also describe several In contrast, Figure 1b illustrates AMD’s extensions of the Opteron northbridge Direct Connect architecture, which uses architecture, planned by AMD to provide industry-standard HyperTransport technol- ......

10 Published by the IEEE Computer Society. 0272-1732/07/$20.00 G 2007 IEEE

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply...... Torrenza Initiative AMD’s Torrenza is a multiyear initiative to create an innovation innovation across the industry, AMD is opening access to HyperTran- platform by opening access to the AMD64 Direct Connect architecture to sport. enhance acceleration and coprocessing in homogeneous and heteroge- neous systems. Figure A shows the Torrenza platform, illustrating how custom designed accelerators, say for the processing of Extensible Torrenza is designed to create an opportunity for a global innovation Markup Language (XML) documents or for service-oriented architecture community to develop and deploy application-specific to (SOA) applications, can be tightly coupled with Opteron processors. As the industry’s first open, customer-centered x86 innovation platform, Torrenza capitalizes on the Direct Connect architecture and HyperTransport technology advances of the AMD64 platform.

The Torrenza Initiative includes the following elements:

N Innovation Socket. In Septem- ber 2006 AMD announced it would license the AMD64 processor socket and design specifications to OEMs to allow collaboration on specifi- cations so that they can take full advantage of the x86 architecture. , Fujitsu, Sie- mens, IBM, and Sun have publicly stated their support and are designing products for the Innovation Socket. N enablement. Leveraging the strengths of HyperTransport, AMD is work- ing with various partners to Figure A. Torrenza platform. create an extensive partner ecosystem of tools, services, and software to implement coprocessors in silicon. HyperTransport is work alongside AMD processors in multisocket systems. Its goal is to the only open, standards-based, extensible system bus. help accelerate industry innovation and drive new technology, which can N Direct Connect platform enablement. AMD is encouraging standards then become mainstream. It gives users, original equipment manufac- bodies and operating system suppliers to support accelerators and turers, and independent software vendors the ability to leverage billions coprocessors directly connected to the processor. To help drive in third-party investments. ogy to interconnect the processors.3 Hyper- advantage in memory capacity and band- Transport interconnect offers scalability, width over the traditional architecture, high bandwidth, and low latency. The without requiring the use of costly, power- distributed shared-memory architecture consuming memory buffers. Thus, the includes four integrated memory control- Direct Connect architecture reduces FSB lers, one per chip, giving it a fourfold bottlenecks......

MARCH–APRIL 2007 11

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply...... HOT CHIPS

Northbridge packets, which are 4 or 8 bytes in size, In the Opteron processor, the northbridge and a data crossbar for routing the data consists of all the logic outside the processor payload associated with commands, which core. Figure 2 shows an Opteron processor can be 4 or 64 bytes in size. with a simplified view of the northbridge Figure 3 depicts the northbridge com- microarchitecture, including system request mand flow. The command crossbar routes interface (SRI) and host bridge, crossbar, coherent HyperTransport commands. It memory controller, DRAM controller, and can deliver an 8-byte HyperTransport HyperTransport ports. packet header at a rate of 1 per clock (one The northbridge is a custom design that every 333 ps with a 3-GHz CPU). Each runs at the same frequency as the processor input port has a pool of command-size core. The command flow starts in the buffers, which are divided between four processor core with a memory access that virtual channels (VCs): Request, Posted misses in the L2 cache, such as an in- request, Probe, and Response. A static struction fetch. The SRI contains the system allocation of command buffers occurs at address map, which maps memory ranges to each of the five crossbar input ports. (The nodes. If the memory access is to local next section of this article discusses how memory, an address map lookup in the SRI buffers should be allocated across different sends it to the on-chip memory controller; virtual channels to optimize system if the memory access is off-chip, a routing throughput.) table lookup routes it to a HyperTransport The data crossbar, shown in Figure 4, port. supports cut-through routing of data pack- The northbridge crossbar has five ports: ets. The cache line size is 64 bytes, and all SRI, memory controller, and three Hyper- buffers are sized in multiples of 64 bytes to Transport ports. The processing of com- optimize the transfer of cache-line-size data mand packet headers and data packets is packets. Data packets traverse on-chip data logically separated. There is a command paths in 8 clock cycles. Transfers to crossbar dedicated to routing command different output ports are time multiplexed

Figure 1. Evolution of x86 blade server architecture: traditional front-side bus architecture (a) and AMD’s Direct Connect architecture (b). MCP: multichip package; Mem.: memory controller.

...... 12 IEEE MICRO

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply. Figure 2. Opteron 800 series processor architecture. clock by clock to support high concurrency; market segment. The SCI protocol supports for example, two concurrent transfers from a single shared-address space for an arbitrary CPU and memory controller input ports to number of nodes in a distributed shared- different output ports are possible. The memory architecture. It does so through the peak arrival rate from a HyperTransport creation and maintenance of lists of sharers port is 1 per 40 CPU clock cycles or 16 ns for all cached lines in doubly linked queues (64 bytes at 4 Gbytes/s), and the on-chip with mechanisms for sharer insertion and service rate is 1 per 8 clock cycles, or 3 ns. removal. The protocol is more complex HyperTransport routing is table driven to than required for the volume server market, support arbitrary system topologies, and the and its wide variance in memory latency crossbar provides separate routing tables for would require a lot of application tuning for routing Requests, Probes, and Responses. nonuniform memory access. On the other Messages traveling in the Request and hand, bus-based systems with snoopy bus Response VCs are always point-to-point, protocols can achieve only limited transfer whereas messages in the Probe VC are rates. The coherent HyperTransport pro- broadcast to all nodes in the system. tocol was designed to support cache co- herence in a distributed shared-memory Coherent HyperTransport protocol system with an arbitrary number of nodes The Opteron processor northbridge sup- using a broadcast-based coherence protocol. ports a coherent shared-memory address This provides good scaling in the one-, two-, space. AMD’s development of a coherent four-, and even eight-socket range, while HyperTransport was strongly influenced by avoiding the serialization overhead, storage prior experience with the Scalable Coherent overhead, and complexity of directory-based Interface (SCI),4 the Compaq EV6,5 and coherence schemes.6 various symmetric multiprocessor systems.6 In general, cacheable processor requests A key lesson guiding the development of the are unordered with respect to one another in coherent HyperTransport protocol was that the coherent HyperTransport fabric. Each the high-volume segment of the server processor core must maintain the program market is two to four processors, and order of its own requests. The Opteron although supporting more than four pro- processor core implements processor consis- cessors is important, it is not a high-volume tency, in which loads and stores are always ......

MARCH–APRIL 2007 13

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply...... HOT CHIPS

Figure 3. Northbridge command flow and virtual channels. All buffers are 64-bit command/address. The memory access buffers (MABs) hold outstanding processor requests to memory; the memory address map (MAP) maps address windows to nodes; the graphics aperture resolution table (GART) maps memory requests from graphics controllers.

ordered behind earlier loads, stores are delivered to the memory controller. The ordered behind earlier stores, but loads can memory controller starts a DRAM access be ordered ahead of earlier stores (all requests and broadcasts a probe (PR) to nodes 1 and to different addresses). Opteron-based sys- 2. Node 1 forwards the probe to source tems implement a total-store-order memory- node 3. The probe is delivered to the SRI ordering model.7 The cores and the coherent on each of the four nodes. The SRI probes HyperTransport protocol support the the processor cores on each node and MOESI states: modified, owned, exclusive, combines the probe responses from each shared, and invalid, respectively.6 core into a single probe response (RP), Figure 5 is a transaction flow diagram which it returns to source node 3 (if the line illustrating the operation of the coherent is modified or owned, the SRI returns a read HyperTransport protocol. It shows the response to the source node instead of message flow resulting from a cache miss a probe response). for a processor fetch, load, or store on node Once the source node has received all 3. Initially, a request buffer is allocated in probe and read responses, it returns the fill the SRI of source node 3. The SRI looks up data to the requesting core. The request the system address map on node 3, using buffer in the SRI of source node 3 is the physical address to determine that node deallocated, and a source done message 0 is the home node for this physical address. (SD) is sent to home node 0 to signal that The SRI then looks up the crossbar routing all the transaction’s side effects, such as table, using destination node 0 to determine invalidating all cached copies for a store, which HyperTransport port to forward the have completed and the data is now globally read request (RD) to. Node 2 forwards RD visible. The memory controller is then free to home node 0, where the request is to process a subsequent request to the same ...... 14 IEEE MICRO

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply. Figure 4. Northbridge data flow. All buffers are 64-byte cache lines.

Figure 5. Traffic for an Opteron processor read transaction.

......

MARCH–APRIL 2007 15

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply...... HOT CHIPS

address. The memory latency of a request is latency. For example, the four-node topol- the longer of two paths: the time it takes to ogy (‘‘4-node square’’) is a 2 3 22Dmesh access DRAM and the time it takes to probe with a network diameter of 2 hops, an all caches in the system. average diameter of 1 hop, and an average The coherent HyperTransport protocol memory latency of x + 44 ns, where x is the message chain is essentially three messages latency of a one-node system using a 2.8- long: Request R Probe R Response. The GHz processor, a 400-MHz DDR2 protocol avoids deadlock by having a dedi- PC3200 memory, and a HyperTransport- cated VC per message class (Request, Probe, based processor interface operating at 2 and Response). Responses are always un- giga-transfers per second (GT/s). The conditionally accepted, ensuring forward system performance is normalized to that progress for probes, and in turn ensuring of one node 3 one core. We see positive forward progress for requests.8 scaling from one to eight nodes, but the One unexpected lesson that emerged normalized processor performance decreases during the bring-up and performance with increasing average diameter. The tuning of Opteron multiprocessors was that difference in normalized processor perfor- improper buffer allocation across VCs has mance among this set of workloads is a surprising negative impact on perfor- mainly due to differences in L2 cache miss mance. Why? The Opteron rates. SPECJBB2000 has the lowest miss has a flexible command buffer allocation rate and the best performance scaling, scheme. The buffer pool can be allocated whereas OLTP1 has the highest L2 miss across the four VCs in a totally arbitrary rate and the worst performance scaling. It is way, with only the requirement that worth noting that processor performance is each VC have at least one buffer allocated. a strong function of average diameter. For The optimum allocation turns out to be example, the processor performance in the a function of the number of nodes in eight-node twisted ladder, with a 1.5 hop a system, the system topology, the co- average diameter, is about 10 percent higher herence protocol, the relative mix of than in the eight-node ladder (a 2 3 42D different transaction and the routing tables. mesh), with a 1.8 hop average diameter. As a rule, the number of buffers allocated to This observation strongly influenced our the different VCs should be in the same decision to consider fully connected 4- proportion as the traffic on these VCs. and 8-node topologies in AMD’s next- After exhaustive traffic analysis, factoring generation processor architecture. in the cache coherence protocol, the The most direct way to reduce memory topology, and the routing tables, we de- latency and increase coherent memory termined optimum BIOS settings for four- bandwidth is to use better topologies and and eight-node topologies. faster links. The argument for fully con- nected topologies is simple: The shortest Opteron-based system topologies distance between two points is a direct path, The Opteron processor was designed to and fully connected topologies provide have low memory latency for one-, two-, a direct path between all possible sources and four-node systems. For example, a four- and destinations. node machine’s worst-case latency (two Future generations of Opteron processors hops) is lower than that typically achievable will have a new socket infrastructure that with an external memory controller. Even will support HyperTransport 3.0 with data so, a processor’s performance is a strong rates of up to 6.4 GT/s. We will enable fully function of system topology, as Figure 6 connected four-socket systems by adding shows. The figure illustrates the perfor- a fourth HyperTransport port, as Figure 7 mance scaling achieved for five commercial shows. We will enable fully connected workloads on five common Opteron system eight-socket systems by supporting a feature topologies. The figure shows the topologies called HyperTransport link unganging as with different node counts, along with their shown in Figure 8. A HyperTransport link average network diameter and memory is typically 16 bits wide (denoted 316) in ...... 16 IEEE MICRO

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply. Figure 6. Performance versus memory latency in five Opteron topologies (systems use a single 2.8-GHz core, a 400-MHz DDR2 PC3200, a 2-GT/s HyperTransport, and a 1-Mbyte L2 cache).

each direction. These same HyperTransport unganging provides system builders with pins can also be configured at boot time to a high degree of flexibility by expanding the operate as two logically independent links, number of logical HyperTransport ports each 8 bits wide (denoted 38). Thus, the from 4 to 8. processor interface can be configured to Fully connected topologies provide sev- provide a mix of 316 and 38 HyperTran- eral benefits: Network diameter (memory sport ports, each of which can be configured latency) is reduced to a minimum, links are to be either coherent or noncoherent. Link more evenly utilized, packets traverse fewer ......

MARCH–APRIL 2007 17

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply...... HOT CHIPS

Figure 7. Four-socket, 16-way topologies: The 4-node square topology (a) has a network diameter of 2 hops, an average diameter of 1.0 hop, and a Xfire bandwidth using 2.0 GT/s HyperTransport of 14.9 Gbytes/s. The 4-node fully connected topology (b) with two extra links and a fourth HyperTransport port yields a network diameter of 1 hop, an average diameter of .75 hop, and a Xfire bandwith of 29.9 Gbytes/s. Using HyperTransport 3.0 at 4.4 GT/s, that topology achieves a Xfire bandwidth of 65.8 Gbytes/s.

links, and there are more links. Reduced maximum number of hops between link utilization lowers queuing delay, in any pair of nodes in the network). turn reducing latency under load. Two N Xfire memory bandwidth—link-limited, simple metrics for memory latency and all-to-all communication bandwidth coherent memory bandwidth demonstrate (data only). All processors read data the performance benefit of fully connected from all nodes in an interleaved man- multiprocessor topologies: ner.

N Average diameter—average number of We can statically compute these two metrics hops between any two nodes in the for any topology, given routing tables, network (network diameter is the message visit counts, and packet sizes. In

Figure 8. Eight-socket, 32-way topologies. The 8-node twisted-ladder topology (a) has a network diameter of 3 hops, an average diameter of 1.62 hops, and a Xfire bandwidth using HyperTransport 1 at 2.0GT/s of 15.2 Bytes/s. The 8-node 234 topology (b) has a diameter of 2 hops, an average diameter of 1.12 hops, and a Xfire bandwidth using HyperTransport 3.0 at 4.4GT/s of 72.2 Gbytes/s. The 8-node fully connected topology (c) has a diameter of 1 hop, an average diameter of 0.88 hop, and a Xfire bandwidth using HyperTransport 3.0 at 4.4GT/s of 94.4 Gbytes/s.

...... 18 IEEE MICRO

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply. the four-node system in Figure 7, adding negative, and non-unit strides and has one extra HyperTransport port doubles the a dedicated buffer for prefetched data. Xfire bandwidth. In addition, this number Write bursting minimizes read and write scales linearly with link frequency. Thus, turnaround time. with HyperTransport 3.0 running at 4.4 The design has a three-level cache GT/s, the Xfire bandwidth increases by hierarchy as shown in Figure 9. Each core a factor of 4 overall. In addition, the average has separate L1 data and instruction caches diameter (memory latency) decreases from 1 of 64 Kbytes each. These caches are two- hop to 0.75 hop. way set-associative, linearly indexed, and The benefit of fully connected topologies physically tagged, with a 64-byte cache line. is even more dramatic for the eight-node The L1 has the lowest latency and supports topology in Figure 8. The Xfire bandwidth two 128-bit loads per cycle. Locality tends increases by a factor of 6 overall. In to keep the most critical data in the L1 addition, the average diameter decreases cache. Each core also has a dedicated 512- significantly from 1.6 hops to 0.875 hop. Kbyte L2 cache, sized to accommodate most Furthermore, this access pattern, which is workloads. This cache is dedicated to typical of many multithreaded commercial eliminating conflicts common in shared workloads, evenly utilizes the links. caches and is better than shared caches for virtualization. All cores share a common L3 victim cache that resides logically in the Next-generation processor architecture northbridge SRI unit. Cache lines are AMD’s next-generation processor archi- installed in the L3 when they are cast out tecture will be a native quad-core upgrade from the L2 in the processor core. The L3 that is socket- and thermal-compatible with cache is noninclusive, allowing a line to be the Opteron processor 800 series . It present in an upper level L1 or L2 cache and will contain about 450 million transistors not be present in the L3. This increases the and will be manufactured in a 65-nm maximum number of unique cache lines CMOS silicon-on-insulator process. At some that can be cached on a node to the sum of point, AMD will introduce a four-Hyper- the individual L3, L2, and L1 cache Transport-port version in a 1,207-contact, capacities (in contrast, the maximum num- organic package paired with ber of distinct cache lines that can be cached a surface-mount LGA socket with a 1.1-mm with an inclusive L3 is simply the L3 pitch and a 40 3 40-mm body. capacity). The L3 cache has a sharing-aware Core enhancements include out-of-order replacement policy to optimize the move- load execution, in which a load can pass ment, placement, and replication of data for other loads and stores that are known not to multiple cores. alias with the load. This mitigates L2 and As Figure 10 shows, the next-generation L3 cache latency. The translation look-aside design has seven clock domains (phase- buffer adds support for 1-Gbyte pages and locked loops) and two separate power planes a 48-bit physical address. The TLB’s size for the northbridge and the core. Separate increases to 512 4-Kbyte page entries plus CPU core and northbridge power planes 128 2-Mbyte page entries for better support allow processors to reduce core voltage for of virtualized workloads, large-footprint power savings while the northbridge con- databases, and transaction processing. tinues to run, thereby retaining system The design provides a second indepen- bandwidth and latency characteristics. For dent DRAM controller to provide more example, core 0 could be running at normal concurrency, additional open DRAM banks operating frequency, core 1 could be to reduce page conflicts, and a longer burst running at a lower frequency, and cores 2 length to improve efficiency. DRAM paging and 3 could be halted and placed in a low- support in the controller uses history-based power state. It is also possible to apply pattern prediction to increase the frequency higher voltage to the northbridge to raise its of page hits and decrease page conflicts. frequency for a performance boost in The DRAM prefetcher tracks positive, power-constrained platforms......

MARCH–APRIL 2007 19

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply...... HOT CHIPS

n summary, the AMD Opteron processor Iintegrates multiple x86-64 cores with an on-chip router, memory controller and HyperTransport-based processor interface. The benefits of this system integration include lower latency, cost, and power use. AMD’s next-generation processor extends the Opteron 800 series architecture by adding more cores with significant instruc- tions per cycle (IPC) enhancements, an L3 cache, and fine-grained power management to create server platforms with improved memory latency, higher coherent memory bandwidth, and higher performance per watt. MICRO

Figure 9. AMD next-generation processor’s three-level cache hierarchy. Acknowledgments We thank all the members of the AMD Opteron processor northbridge team, in- Fine-grained power management (en- cluding Nathan Kalyanasundharam, Gregg hanced AMD PowerNow! technology) pro- Donley, Jeff Dwork, Bob Aglietti, Mike vides the capability of dynamically and Fertig, Cissy Yuan, Chen-Ping Yang, Ben individually adjusting core frequencies to Tsien, Kevin Lepak, Ben Sander, Phil improve power efficiency. Madrid, Tahsin Askar, and Wade Williams.

Figure 10. Northbridge power planes and clock domains in AMD next-generation processor. VRM: voltage regulator module; SVI: serial voltage interface; VHT: HyperTransport termination voltage; VDDIO: I/O supply; VDD: core supply; VTT: DDR termination voltage; VDDNB: northbridge supply; VDDA: auxilliary supply; PLL: clock domain phase lock loop.

...... 20 IEEE MICRO

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply...... experience includes the design and devel- References opment of server hardware, cache coher- 1. C.N. Keltcher et al., ‘‘The AMD Opteron ence and message passing protocols. He Processor for Multiprocessor Servers,’’ has a M.Eng.Sc from University College IEEE Micro, vol. 23, no. 2, Mar./Apr. 2003, Cork, Ireland, and an MBA from Golden pp. 66-76. Gate University. He is a member of the 2. AMD x86-64 architecture manuals, http:// IEEE. www.amd.com. 3. HyperTransport I/O Link Specification, Bill Hughes is a senior fellow at AMD. He http://www.hypertransport.org/. was one of the initial Opteron architects 4. ISO/ANSI/IEEE Std. 1596-1992 Scalable working on HyperTransport and the on- Coherent Interface (SCI), 1992. chip memory controller and also worked on 5. R.E. Kessler, ‘‘The Alpha 21264 Micropro- the load-store and data cache units on cessor,’’ IEEE Micro, vol. 19, no. 2, Mar./ . He currently leads the Northbridge Apr. 1999, pp. 24-36. 6. D.E. Culler, J.P. Singh, and A. Gupta, Paral- and HyperTransport microarchitecture and lel Computer Architecture, A Hardware/ RTL team. He has a BS from Manchester Software Approach, Morgan Kaufmann, University, England, and a PhD from Leeds 1999. University, England. 7. S.V. Adve and K. Gharachorloo, ‘‘Shared Memory Consistency Models: A Tutorial,’’ Direct questions and comments about Computer, vol. 29, no. 12, Dec. 1996, this article to Pat Conway, Advanced Micro pp. 66-76. Devices, 1 AMD Place, Sunnyvale, CA 8. W.J. Dally and B.P. Towles, Principles and 94085; [email protected]. Practices of Interconnection Networks, Morgan Kaufmann, 2004.

Pat Conway is a principal member of For further information on this or any technical staff at AMD where he is re- other computing topic, visit our Digital sponsible for developing scalable, high- Library at http://www.computer.org/ performance server architectures. His work publications/dlib.

......

MARCH–APRIL 2007 21

Authorized licensed use limited to: Australian National University. Downloaded on March 21, 2009 at 18:44 from IEEE Xplore. Restrictions apply.