AMD’s Direct Connect Architecture 4 Node MP Architecture Pat Conway 12.8 GB/s DRAM DRAM Principal Member of Technical Staff 128-bit “Fully Connected 4 Node and 8 Node

Core 0 Core 0 MCT MCT Topologies are becoming Feasible” SRI SRI Core 1 Core 1 ncHT Additional HyperTransport™ Ports XBAR XBAR Glueless with

I/O I/O HT HT I/O • Enable Fully Connected 4 Node (four x16 HT) HT-HB HT HT HT-HB 4.0GB/s per Opteron CMP processors direction and 8 Node (eight x8 HT) cHT @ 2GT/s Data Rate • Reduced network diameter This poster illustrates the variety of Opteron system topologies built to date using – Fewer hops to memory HyperTransport and AMD's Direct Connect Architecture for glueless multiprocessing HT HT • Increased Coherent Bandwidth Four x16 HT I/O I/O and summarizes the latency/bandwidth characteristics of these systems. HT HT – more links HT-HB HT-HB OR – cHT packets visit fewer links XBAR XBAR Building and using Opteron systems has provided useful, real world lessons which expose – HyperTransport3 up to 5.2GT/s the challenges to be addressed when designing future system interconnect, memory Core 0 Core 0 • Benefits “” hierarchy and I/O to scale up both the number of cores and sockets in future CMP SRI Core 1 MCT Core 1 SRI MCT Architectures. – Low latency because of lower diameter – Evenly balanced utilization of HyperTransport links Eight x8 HT – Low queuing delays

DRAM DRAM

Low latency under load Lessons Learned #1 - HyperTransport-based Accelerators Memory Latency is the Key to Application Imagine it, Build it Fully Connected 4 Node Performance

Performance! 4 Node Square 4 Node fully connected

SOASOA XMLXML I/O I/O I/O I/O AcceleratorAccelerator AcceleratorAccelerator 16 16 P0 P2 P0 P2 Performance vs Average Memory Latency (single 2.8GHz core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache in MP system) P1 P3 P1 P3 Native Native 7 120% Dual-Core Dual-Core I/O I/O I/O I/O

8 GB/S 6 100% + 2 EXTRA LINKS W/ HYPERTRANSPORT3 5 4N FC (4.4GT/s 80% SRQ SRQ 4N SQ (2GT/s 4N FC (2GT/s OLTP1 Crossbar Crossbar HyperTransport) HyperTransport) HyperTransport3) 4 Mem.Ctrlr HT Mem.Ctrlr HT OLTP2 60% Diam 2 Avg Diam 1.00 Diam 1 Avg Diam 0.75 Diam 1 Avg Diam 0.75 SW99 3 8 GB/S 8 GB/S XFIRE BW 14.9GB/s XFIRE BW 29.9GB/s XFIRE BW 65.8GB/s SSL 40% 2 JBB (2X) (4X)

20% OtherOther PCI-EPCI-E FLOPs 1 Optimized Bridge FLOPs Optimized Bridge Accelerator System Performance SiliconSilicon Accelerator XFIRE (“crossfire”) 0 0% Performance BW is the link-limited 8 GB/S all-to-all communication 1N 2N 4N (SQ) 8N (TL) 8N (L) bandwidth (data only)

USBUSB

I/O Hub I/O Hub 4 Node Square 1 Node 8N Ladder PCIPCI

I/O I/O I/O P0 P2 P4 P6 I/O P0 P0 P2 Fully Connected 8 Node Performance

P1 P3 • Open platform for system builders I/O I/O P1 P3 P7 I/O I/O I/O – 3rd Party Accelerators 8 Node Fully Connected AvgD 0 hops 1 hops 1.8 hops –Media 8 Node8 Node 6HT 6HT 2x4 2x4 I/O I/O 8N Twisted Ladder 8 Latency x + 0ns x + 44ns (124 cpuclk) x + 105ns (234 cpuclk) P0 P2 P5 P4 16 8 –FLOPs I/O P0 P2 P4 P6 I/O I/O I/O P1 P3 I/O 543 435 I/O I/O P0 P7 I/O P0 0 0 P2 1 1 OR 8N Twisted Ladder 2 2 –XML I/O I/O P1 P6 3 4 I/O P1 P3 P5 P7 I/O 2 21 1 P4 P6 P1 0 0 P3 P2 P3 5 43 435 P0 P2 P6 –SOA I/O I/O P5 P7 I/O P4 I/O I/O I/O • AMD Opteron™ Socket or HTX slot I/O I/O

2 Node

I/O P1 P3 P5 P7 I/O • HyperTransport interface is an open standard see www..org I/O P0 P1 I/O • Coherent HyperTransport interface available if the accelerator caches system memory (under 8N TL (2GT/s 8N 2x4 (4.4GT/s 8N FC (4.4GT/s 0.5 hops 1.5 hops license) HyperTransport) HyperTransport3) HyperTransport3) x + 17ns (47 cpuclk) x + 76ns (214 cpuclk) • Homogeneous/heterogeneous computing Diam 3 Avg Diam 1.62 Diam 2 Avg Diam 1.12 Diam 1 Avg Diam 0.88 • Architectural enhancements planned to support fine grain and XFIRE BW 15.2GB/s XFIRE BW 72.2GB/s XFIRE BW 94.4GB/s synchronization (5X) (6X)

2006 Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems, 6-7 December 2006, Stanford, California