The AMD Opteron™ CMP Northbridge Architecture: Now and in the Future
Total Page:16
File Type:pdf, Size:1020Kb
The AMD Opteron™ CMP NorthBridge Architecture: Now and in the Future Pat Conway & Bill Hughes August, 2006 AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor Integration: • Two 64-bit CPU cores • 2MB L2 cache • On-chip Router & Memory Controller Bandwidth: • Dual channel DDR (128-bit) memory bus • 3 HyperTransport™ (HT) links (16-bit each x 2 GT/sec x 2) Usability and Scalability: • Socket compatible: Platform and TDP! • Glueless SMP up to 8 sockets • Memory capacity & BW scale w/ CPUs Power Efficiency: • AMD PowerNow!™ Technology with optimized power management • Industry-leading system level power efficiency 2 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor 3 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future A Clean Break with the Past Chip Chip Chip Chip Chip Chip Chip Chip X X X X X X X X SRQ SRQ MCPMCP MCPMCP MCPMCP MCPMCP Crossbar Crossbar Mem.Ctrlr HT Mem.Ctrlr HT 8 GB/S PCI-EPCI-E USB Memory BridgeBridge USB Memory PCI-EPCI-E SRQ SRQ I/O Hub Controller I/OI/OI/O Hub HubHub Controller BridgeTM Crossbar Crossbar PCIeBridgeTM Hub PCIe Mem.Ctrlr HT Mem.Ctrlr HT PCIPCI Hub BridgeBridge 8 GB/S 8 GB/S TM TM PCIePCIeTM PCIePCIeTM Bridge Bridge XMBXMB XMBXMB XMBXMB XMBXMB Bridge Bridge 8 GB/S USBUSB I/OI/O HubHub PCIPCI Legacy x86 Architecture AMD64’s Direct Connect Architecture • 20-year old traditional front-side bus (FSB) architecture • Industry-standard technology • CPUs, Memory, I/O all share a bus • Direct Connect Architecture reduces FSB bottlenecks • Major bottleneck to performance • HyperTransport™ interconnect offers scalable high bandwidth and low latency • Faster CPUs or more cores ≠ performance • 4 memory controllers – increases memory capacity and bandwidth 4 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future 4P System — Board Layout 5 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future System Overview DRAM DRAM 12.8 GB/s 128-bit MCT Core 0 MCT Core 0 Core 1 Core 1 SRI SRI ncHT XBAR XBAR I/O I/O I/O HT HT HT HT HT-HB HT-HB 4.0GB/s per cHT direction @ 2GT/s Data HT HT Rate I/O I/O HT HT HT-HB XBAR XBAR HT-HB Core 0 Core 0 Core 1 Core 1 “NorthBridge” SRI SRI MCT MCT DRAM DRAM 6 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Northbridge Microarchitecture Overview Core0 Core1 Virtual Channel Use Data Request • Read Y L2 L2 • Non-posted Write • Cache Block Commands Posted Request Posted Writes Y System Probe Broadcast probes N Request APIC Response • Read Response Y Interface (interrupts) • Probe response (SRI) • Completion Memory DRAM Crossbar Controller Controller 2 DDR2 channels (XBAR) (MCT) HT0 HT1 HT2 7 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Northbridge Command Flow Core 0 Victim Buffer (8-entry) Write Buffer (4-entry) Instruction MAB (2-entry) Data MAB (8-entry) All buffers are 64-bit Core 1 command/address System Request to Queue Core 24-entry Memory Command to Address MAP HT0 Input HT1 Input HT2 Input Queue DCT & GART 20-entry Router Router Router Router Router 10-entry Buffer 16-entry Buffer 16-entry Buffer 16-entry Buffer 12-entry Buffer XBAR HT0 Output HT1 Output HT2 Output 8 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Northbridge Data Flow Core 0 All buffers are 64-byte cache lines Victim Buffer (8-entry) Write Buffer (4-entry) from Core 1 Host Bridge HT0 input HT1 input HT2 input from DCT 8-entry Buffer 8-entry Buffer 8-entry Buffer 5-entry Buffer 8-entry Buffer XBAR XBAR HT0 output HT1 output HT2 output System Request Memory Data Queue Data Queue 12-entry 8-entry to Core to Host to DCT Bridge 9 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Lessons Learned #1 Allocation of XBAR Command buffer across Virtual Channels can have big impact on performance Memory 0 L2 L2 Memory 2 5: D 4 : D 3 4: RP0 : R D 3: PI2 P 0 P 2 3: PI0 2: RD MP traffic analysis gives 4: RP2 5: RP0 the best allocation 3: PI1 e.g. Opteron Read Transaction 6: D 8:SD Request (2 visits) 1: RD Probe (3 visits) 4: PI1 5: RP1 Response (8 visits) P 1 P 3 7: SD Memory 1 L2 L2 Memory 3 10 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Lessons Learned #2 Memory Latency is the Key to Application Performance! Performance vs Average Memory Latency Performance (single 2.8GHz core, 400MHz vs DDR2 Average PC3200, 2GT/s HT Memory with 1MB cache in MP Latency system) (single 2.8GHz7 core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache120% in MP system) 7 6 100% 120% 5 80% OLTP1 6 4 100% OLTP2 60% 3 SW99 5 SSL 40% 80% OLTP12 JBB 4 OLTP21 20% System Performance 60% 3 SW990 0% Processor Performance SSL 1N 2N 4N (SQ) 8N (TL) 8N (L) 40% 2 JBB 4 Node Square 1 1 Node 8N Ladder 20% I/O I/O System Performance I/O P0 P2 P4 P6 I/O P0 P0 P2 0 0% Performance Processor P1 P3 I/O I/O P1 P3 P5 P7 I/O 1N 2N 4NI/O (SQ)I/O 8N (TL) 8N (L) AvgD 0 hops 1 hops 1.8 hops Latency x + 0ns x + 44ns (124 cpuclk) x + 105ns (234 cpuclk) 8N Twisted Ladder 4 Node Square 8N Ladder 1 Node I/O P0 P2 P4 P6 I/O I/O I/O I/O P0 P2 P4 P6 I/O P0 2 Node P0 P2 I/O P1 P3 P5 P7 I/O I/O P0 P1 I/O 0.5 hops 1.5 hops P1 P3 I/O x + 17ns (47 cpuclk) x + 76ns (214 cpuclk) I/O P1 P3 P5 P7 I/O 11 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Looking Forward HyperTransport™-based Accelerators Imagine it, Build it • Open platform for SOASOA XMLXML system builders (“Torrenza”) AcceleratorAccelerator AcceleratorAccelerator – 3rd Party Accelerators –Media Native Native Dual-Core Dual-Core –FLOPs 8 GB/S –XML SRQ SRQ –SOA Crossbar Crossbar Mem.Ctrlr HT Mem.Ctrlr HT • AMD Opteron™ Socket or HTX 8 GB/S 8 GB/S slot Other Other PCI-EPCI-E FLOPs Optimized Bridge FLOPs Optimized Bridge Accelerator • HyperTransport interface is an SiliconSilicon Accelerator open standard see 8 GB/S hypertransport.org USBUSB I/OI/O HubHub • Coherent HyperTransport PCIPCI interface available if the accelerator caches system memory (under license) 13 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future AMD’s Next Generation Processor Technology Ideal for 65nm SOI and beyond Native quad core die Expandable shared L3 cache Enhanced Direct IPC enhanced Connect Architecture CPU cores and Northbridge • 32B instruction fetch • Four ungangable x16 HyperTransportTM links • Improved branch prediction (up to 5.2GT/sec) • Out-of-order load execution • Enhanced crossbar • Up to 4 DP FLOPS/cycle • Next-generation • Dual 128-bit SSE dataflow memory support • Dual 128-bit loads per cycle • FBDIMM when appropriate • Bit Manipulation extensions (LZCNT/POPCNT) • Enhanced power • SSE extensions (EXTRQ/INSERTQ, MOVNTSD/MOVNTSS) management and RAS 14 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Balanced, Highly Efficient Cache Structure Efficient memory handling reduces the need for “brute force” cache sizes L1 L2 L3 Dedicated L1 64KB 512KB • Locality keeps most critical 2+MB Cache Cache Core 1 Control data in the L1 cache •Low latency • 2 128 bit data paths L1 L2 • 2 loads per cycle C2 L1 L2 C3 L1 L2 C4 15 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Balanced, Highly Efficient Cache Structure Efficient memory handling reduces the need for “brute force” cache sizes L1 L2 L3 64KB 512KB Dedicated L2 2+MB Cache Cache Core 1 Control • Sized to accommodate the majority of working sets L1 L2 today C2 • Dedicated to help eliminate conflicts common in shared L1 L2 caches C3 L1 L2 C4 16 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Balanced, Highly Efficient Cache Structure Efficient memory handling reduces the need for “brute force” cache sizes L1 L2 L3 64KB 512KB 2+MB Cache Cache Core 1 Control Shared L3 – Coming Soon L1 • Allocation policy which L2 optimizes movement, C2 placement and replication L1 of data for multi-core L2 • Ready for expansion C3 L1 L2 C4 17 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Additional HyperTransport™ Ports • Enable Fully Connected 4 Node (four x16 HT) and 8 Node (eight x8 HT) • Reduced network diameter – Fewer hops to memory • Increased Coherent Bandwidth –more links Four x16 HT – cHT packets visit fewer links –HyperTransport3 OR • Benefits – Low latency because of lower diameter – Evenly balanced utilization of HyperTransport links – Low queuing delays Low latency under load Eight x8 HT 18 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future 4 Node Performance 4 Node Square 4 Node fully connected I/O I/O I/O I/O 16 16 P0 P2 P0 P2 P1 P3 P1 P3 I/O I/O I/O I/O + 2 EXTRA LINKS W/ HYPERTRANSPORT3 4N SQ (2GT/s 4N FC (2GT/s 4N FC (4.4GT/s HyperTransport) HyperTransport) HyperTransport3) Diam 2 Avg Diam 1.00 Diam 1 Avg Diam 0.75 Diam 1 Avg Diam 0.75 XFIRE BW 14.9GB/s XFIRE BW 29.9GB/s XFIRE BW 65.8GB/s (2X) (4X) XFIRE (“crossfire”) BW is the link-limited all-to-all communication bandwidth (data only) 19 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future 8 Node Performance 8 Node Fully Connected 8 Node8 Node 6HT 2x4 2x4 I/O I/O 8N Twisted Ladder 8 P0 P2 P5 P4 16 8 I/O P0 P2 P4 P6 I/O I/OI/O 5 4 3 4 3 5 I/O P1 P3 I/O P0 P7 I/O P0 0 0 P2 1 1 OR 2 2 I/O P1 P6 I/O 4 3 4 I/O P1 P3 P5 P7 I/O 2 2 1 1 P4 P6 P1 0 0 P3 P2 P3 5 4 3 4 3 5 I/OI/O I/O P5 P7 I/O I/O 8N TL (2GT/s 8N 2x4 (4.4GT/s 8N FC (4.4GT/s HyperTransport) HyperTransport3)