The AMD Opteron™ CMP NorthBridge Architecture: Now and in the Future
Pat Conway & Bill Hughes
August, 2006 AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor Integration: Two 64-bit CPU cores 2MB L2 cache On-chip Router & Memory Controller
Bandwidth: Dual channel DDR (128-bit) memory bus 3 HyperTransport™ (HT) links (16-bit each x 2 GT/sec x 2)
Usability and Scalability: Socket compatible: Platform and TDP! Glueless SMP up to 8 sockets Memory capacity & BW scale w/ CPUs Power Efficiency: AMD PowerNow!™ Technology with optimized power management Industry-leading system level power efficiency
2 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor
3 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future A Clean Break with the Past
Chip Chip Chip Chip Chip Chip Chip Chip X X X X X X X X
SRQ SRQ MCPMCP MCPMCP MCPMCP MCPMCP Crossbar Crossbar Mem.Ctrlr HT Mem.Ctrlr HT
8 GB/S PCI-EPCI-E USB Memory BridgeBridge USB Memory PCI-EPCI-E SRQ SRQ I/O Hub Controller I/OI/OI/O Hub HubHub Controller BridgeTM Crossbar Crossbar PCIeBridgeTM Hub PCIe Mem.Ctrlr HT Mem.Ctrlr HT PCIPCI Hub BridgeBridge
8 GB/S 8 GB/S
TM TM PCIePCIeTM PCIePCIeTM Bridge Bridge XMBXMB XMBXMB XMBXMB XMBXMB Bridge Bridge
8 GB/S
USBUSB I/OI/O HubHub PCIPCI
Legacy x86 Architecture AMD64’s Direct Connect Architecture 20-year old traditional front-side bus (FSB) architecture Industry-standard technology CPUs, Memory, I/O all share a bus Direct Connect Architecture reduces FSB bottlenecks Major bottleneck to performance HyperTransport™ interconnect offers scalable high bandwidth and low latency Faster CPUs or more cores ≠ performance 4 memory controllers – increases memory capacity and bandwidth
4 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future 4P System — Board Layout
5 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future System Overview
DRAM DRAM 12.8 GB/s 128-bit MCT Core 0 MCT Core 0 Core 1 Core 1 SRI SRI
ncHT XBAR XBAR I/O I/O I/O HT HT HT HT HT-HB HT-HB 4.0GB/s per cHT direction @ 2GT/s Data HT HT Rate I/O I/O HT HT HT-HB XBAR XBAR HT-HB
Core 0 Core 0 Core 1 Core 1 “NorthBridge” SRI SRI MCT MCT
DRAM DRAM
6 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Northbridge Microarchitecture Overview
Core0 Core1 Virtual Channel Use Data Request Read Y L2 L2 Non-posted Write Cache Block Commands Posted Request Posted Writes Y System Probe Broadcast probes N Request APIC Response Read Response Y Interface (interrupts) Probe response (SRI) Completion
Memory DRAM Crossbar Controller Controller 2 DDR2 channels (XBAR) (MCT)
HT0 HT1 HT2
7 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Northbridge Command Flow Core 0
Victim Buffer (8-entry) Write Buffer (4-entry) Instruction MAB (2-entry) Data MAB (8-entry) All buffers are 64-bit Core 1 command/address
System Request to Queue Core 24-entry Memory Command to Address MAP HT0 Input HT1 Input HT2 Input Queue DCT & GART 20-entry
Router Router Router Router Router 10-entry Buffer 16-entry Buffer 16-entry Buffer 16-entry Buffer 12-entry Buffer
XBAR
HT0 Output HT1 Output HT2 Output
8 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Northbridge Data Flow
Core 0 All buffers are 64-byte cache lines Victim Buffer (8-entry) Write Buffer (4-entry) from Core 1 Host Bridge HT0 input HT1 input HT2 input from DCT
8-entry Buffer 8-entry Buffer 8-entry Buffer 5-entry Buffer 8-entry Buffer
XBAR XBAR
HT0 output HT1 output HT2 output System Request Memory Data Queue Data Queue 12-entry 8-entry
to Core to Host to DCT Bridge
9 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Lessons Learned #1 Allocation of XBAR Command buffer across Virtual Channels can have big impact on performance
Memory 0 L2 L2 Memory 2 5: D 4 : D 3 4: RP0 : R D 3: PI2 P 0 P 2 3: PI0 2: RD MP traffic analysis gives 4: RP2 4: 5: RP0 5:
the best allocation PI1 3: e.g. Opteron Read Transaction D 6: 8:SD Request (2 visits) 1: RD Probe (3 visits) 4: PI1 5: RP1 Response (8 visits) P 1 P 3 7: SD
Memory 1 L2 L2 Memory 3
10 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Lessons Learned #2 Memory Latency is the Key to Application Performance!
Performance vs Average Memory Latency Performance (single 2.8GHz core, 400MHz vs DDR2 Average PC3200, 2GT/s HT Memory with 1MB cache in MP Latency system) (single 2.8GHz7 core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache120% in MP system)
7 6 100% 120% 5 80% OLTP1 6 4 100% OLTP2 60% 3 SW99 5 SSL 40% 80% OLTP12 JBB 4 OLTP21 20% System Performance 60% 3 SW990 0% Processor Performance SSL 1N 2N 4N (SQ) 8N (TL) 8N (L) 40% 2 JBB
4 Node Square 1 1 Node 8N Ladder 20% I/O I/O
System Performance I/O P0 P2 P4 P6 I/O P0 P0 P2 0 0% Performance Processor P1 P3 I/O I/O P1 P3 P5 P7 I/O 1N 2N 4NI/O (SQ)I/O 8N (TL) 8N (L) AvgD 0 hops 1 hops 1.8 hops Latency x + 0ns x + 44ns (124 cpuclk) x + 105ns (234 cpuclk)
8N Twisted Ladder 4 Node Square 8N Ladder 1 Node I/O P0 P2 P4 P6 I/O
I/O I/O I/O P0 P2 P4 P6 I/O P0 2 Node P0 P2 I/O P1 P3 P5 P7 I/O I/O P0 P1 I/O
0.5 hops 1.5 hops P1 P3 I/O x + 17ns (47 cpuclk) x + 76ns (214 cpuclk) I/O P1 P3 P5 P7 I/O
11 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Looking Forward HyperTransport™-based Accelerators Imagine it, Build it