The AMD ™ CMP NorthBridge Architecture: Now and in the Future

Pat Conway & Bill Hughes

August, 2006 AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor Integration: Two 64-bit CPU cores 2MB L2 cache On-chip Router &

Bandwidth: Dual channel DDR (128-bit) memory bus 3 HyperTransport™ (HT) links (16-bit each x 2 GT/sec x 2)

Usability and Scalability: Socket compatible: Platform and TDP! Glueless SMP up to 8 sockets Memory capacity & BW scale w/ CPUs Power Efficiency: AMD PowerNow!™ Technology with optimized power management Industry-leading system level power efficiency

2 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future AMD Opteron™ – The Industry’s First Native Dual-Core 64-bit x86 Processor

3 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future A Clean Break with the Past

Chip Chip Chip Chip Chip Chip Chip Chip X X X X X X X X

SRQ SRQ MCPMCP MCPMCP MCPMCP MCPMCP Crossbar Crossbar Mem.Ctrlr HT Mem.Ctrlr HT

8 GB/S PCI-EPCI-E USB Memory BridgeBridge USB Memory PCI-EPCI-E SRQ SRQ I/O Hub Controller I/OI/OI/O Hub HubHub Controller BridgeTM Crossbar Crossbar PCIeBridgeTM Hub PCIe Mem.Ctrlr HT Mem.Ctrlr HT PCIPCI Hub BridgeBridge

8 GB/S 8 GB/S

TM TM PCIePCIeTM PCIePCIeTM Bridge Bridge XMBXMB XMBXMB XMBXMB XMBXMB Bridge Bridge

8 GB/S

USBUSB I/OI/O HubHub PCIPCI

Legacy x86 Architecture AMD64’s Direct Connect Architecture 20-year old traditional front-side bus (FSB) architecture Industry-standard technology CPUs, Memory, I/O all share a bus Direct Connect Architecture reduces FSB bottlenecks Major bottleneck to performance HyperTransport™ interconnect offers scalable high bandwidth and low latency Faster CPUs or more cores ≠ performance 4 memory controllers – increases memory capacity and bandwidth

4 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future 4P System — Board Layout

5 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future System Overview

DRAM DRAM 12.8 GB/s 128-bit MCT Core 0 MCT Core 0 Core 1 Core 1 SRI SRI

ncHT XBAR XBAR I/O I/O I/O HT HT HT HT HT-HB HT-HB 4.0GB/s per cHT direction @ 2GT/s Data HT HT Rate I/O I/O HT HT HT-HB XBAR XBAR HT-HB

Core 0 Core 0 Core 1 Core 1 “NorthBridge” SRI SRI MCT MCT

DRAM DRAM

6 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Northbridge Overview

Core0 Core1 Virtual Channel Use Data Request Read Y L2 L2 Non-posted Write Cache Block Commands Posted Request Posted Writes Y System Probe Broadcast probes N Request APIC Response Read Response Y Interface (interrupts) Probe response (SRI) Completion

Memory DRAM Crossbar Controller Controller 2 DDR2 channels (XBAR) (MCT)

HT0 HT1 HT2

7 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Northbridge Command Flow Core 0

Victim Buffer (8-entry) Write Buffer (4-entry) Instruction MAB (2-entry) Data MAB (8-entry) All buffers are 64-bit Core 1 command/address

System Request to Queue Core 24-entry Memory Command to Address MAP HT0 Input HT1 Input HT2 Input Queue DCT & GART 20-entry

Router Router Router Router Router 10-entry Buffer 16-entry Buffer 16-entry Buffer 16-entry Buffer 12-entry Buffer

XBAR

HT0 Output HT1 Output HT2 Output

8 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Northbridge Data Flow

Core 0 All buffers are 64-byte cache lines Victim Buffer (8-entry) Write Buffer (4-entry) from Core 1 Host Bridge HT0 input HT1 input HT2 input from DCT

8-entry Buffer 8-entry Buffer 8-entry Buffer 5-entry Buffer 8-entry Buffer

XBAR XBAR

HT0 output HT1 output HT2 output System Request Memory Data Queue Data Queue 12-entry 8-entry

to Core to Host to DCT Bridge

9 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Lessons Learned #1 Allocation of XBAR Command buffer across Virtual Channels can have big impact on performance

Memory 0 L2 L2 Memory 2 5: D 4 : D 3 4: RP0 : R D 3: PI2 P 0 P 2 3: PI0 2: RD MP traffic analysis gives 4: RP2 4: 5: RP0 5:

the best allocation PI1 3: e.g. Opteron Read Transaction D 6: 8:SD Request (2 visits) 1: RD Probe (3 visits) 4: PI1 5: RP1 Response (8 visits) P 1 P 3 7: SD

Memory 1 L2 L2 Memory 3

10 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Lessons Learned #2 Memory Latency is the Key to Application Performance!

Performance vs Average Memory Latency Performance (single 2.8GHz core, 400MHz vs DDR2 Average PC3200, 2GT/s HT Memory with 1MB cache in MP Latency system) (single 2.8GHz7 core, 400MHz DDR2 PC3200, 2GT/s HT with 1MB cache120% in MP system)

7 6 100% 120% 5 80% OLTP1 6 4 100% OLTP2 60% 3 SW99 5 SSL 40% 80% OLTP12 JBB 4 OLTP21 20% System Performance 60% 3 SW990 0% Processor Performance SSL 1N 2N 4N (SQ) 8N (TL) 8N (L) 40% 2 JBB

4 Node Square 1 1 Node 8N Ladder 20% I/O I/O

System Performance I/O P0 P2 P4 P6 I/O P0 P0 P2 0 0% Performance Processor P1 P3 I/O I/O P1 P3 P5 P7 I/O 1N 2N 4NI/O (SQ)I/O 8N (TL) 8N (L) AvgD 0 hops 1 hops 1.8 hops Latency x + 0ns x + 44ns (124 cpuclk) x + 105ns (234 cpuclk)

8N Twisted Ladder 4 Node Square 8N Ladder 1 Node I/O P0 P2 P4 P6 I/O

I/O I/O I/O P0 P2 P4 P6 I/O P0 2 Node P0 P2 I/O P1 P3 P5 P7 I/O I/O P0 P1 I/O

0.5 hops 1.5 hops P1 P3 I/O x + 17ns (47 cpuclk) x + 76ns (214 cpuclk) I/O P1 P3 P5 P7 I/O

11 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Looking Forward HyperTransport™-based Accelerators Imagine it, Build it

Open platform for SOASOA XMLXML system builders (“Torrenza”) AcceleratorAccelerator AcceleratorAccelerator – 3rd Party Accelerators –Media Native Native Dual-Core Dual-Core –FLOPs 8 GB/S –XML SRQ SRQ –SOA Crossbar Crossbar Mem.Ctrlr HT Mem.Ctrlr HT

AMD Opteron™ Socket or HTX 8 GB/S 8 GB/S slot Other Other PCI-EPCI-E FLOPs Optimized Bridge FLOPs Optimized Bridge Accelerator HyperTransport interface is an SiliconSilicon Accelerator

open standard see 8 GB/S .org USBUSB I/OI/O HubHub Coherent HyperTransport PCIPCI interface available if the accelerator caches system memory (under license)

13 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future AMD’s Next Generation Processor Technology

Ideal for 65nm SOI and beyond

Native quad core die

Expandable shared L3 cache

Enhanced Direct IPC enhanced Connect Architecture CPU cores and Northbridge

32B instruction fetch Four ungangable x16 HyperTransportTM links Improved branch prediction (up to 5.2GT/sec) Out-of-order load execution Enhanced crossbar Up to 4 DP FLOPS/cycle Next-generation Dual 128-bit SSE dataflow memory support Dual 128-bit loads per cycle FBDIMM when appropriate Bit Manipulation extensions (LZCNT/POPCNT) Enhanced power SSE extensions (EXTRQ/INSERTQ, MOVNTSD/MOVNTSS) management and RAS

14 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Balanced, Highly Efficient Cache Structure Efficient memory handling reduces the need for “brute force” cache sizes

L1 L2 L3 Dedicated L1 64KB

512KB Locality keeps most critical 2+MB Cache Cache Core 1 Control data in the L1 cache Low latency 2 128 bit data paths L1 L2 2 loads per cycle C2

L1 L2 C3

L1 L2 C4

15 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Balanced, Highly Efficient Cache Structure Efficient memory handling reduces the need for “brute force” cache sizes

L1 L2 L3 64KB

512KB Dedicated L2 2+MB Cache Cache Core 1 Control Sized to accommodate the majority of working sets L1 L2 today C2 Dedicated to help eliminate conflicts common in shared L1 L2 caches C3

L1 L2 C4

16 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Balanced, Highly Efficient Cache Structure Efficient memory handling reduces the need for “brute force” cache sizes

L1 L2 L3 64KB 512KB 2+MB Cache Cache Core 1 Control Shared L3 – Coming Soon

L1 Allocation policy which L2 optimizes movement, C2 placement and replication

L1 of data for multi-core L2 Ready for expansion C3

L1 L2 C4

17 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Additional HyperTransport™ Ports

Enable Fully Connected 4 Node (four x16 HT) and 8 Node (eight x8 HT) Reduced network diameter – Fewer hops to memory Increased Coherent Bandwidth

–more links Four x16 HT – cHT packets visit fewer links –HyperTransport3 OR Benefits – Low latency because of lower diameter – Evenly balanced utilization of HyperTransport links – Low queuing delays

Low latency under load Eight x8 HT

18 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future 4 Node Performance

4 Node Square 4 Node fully connected

I/O I/O I/O I/O 16 16 P0 P2 P0 P2

P1 P3 P1 P3

I/O I/O I/O I/O

+ 2 EXTRA LINKS W/ HYPERTRANSPORT3 4N SQ (2GT/s 4N FC (2GT/s 4N FC (4.4GT/s HyperTransport) HyperTransport) HyperTransport3) Diam 2 Avg Diam 1.00 Diam 1 Avg Diam 0.75 Diam 1 Avg Diam 0.75 XFIRE BW 14.9GB/s XFIRE BW 29.9GB/s XFIRE BW 65.8GB/s (2X) (4X)

XFIRE (“crossfire”) BW is the link-limited all-to-all communication bandwidth (data only) 19 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future 8 Node Performance

8 Node Fully Connected

8 Node8 Node 6HT 2x4 2x4 I/O I/O 8N Twisted Ladder 8 P0 P2 P5 P4 16 8

I/O P0 P2 P4 P6 I/O I/OI/O 5 4 3 4 3 5 I/O P1 P3 I/O P0 P7 I/O P0 0 0 P2 1 1 OR 2 2

I/O P1 P6 I/O 4 3 4 I/O P1 P3 P5 P7 I/O 2 2 1 1 P4 P6 P1 0 0 P3 P2 P3 5 4 3 4 3 5 I/OI/O I/O P5 P7

I/O I/O

8N TL (2GT/s 8N 2x4 (4.4GT/s 8N FC (4.4GT/s HyperTransport) HyperTransport3) HyperTransport3) Diam 3 Avg Diam 1.62 Diam 2 Avg Diam 1.12 Diam 1 Avg Diam 0.88 XFIRE BW 15.2GB/s XFIRE BW 72.2GB/s XFIRE BW 94.4GB/s (5X) (6X)

20 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Why Quad-Core?

2 1.8 1.6 1.4 1.2 1.00x 1 0.8 0.6 Performance 0.4 0.2 Power 0 Max Frequency

Baseline is 2 Node x 2 Core blade running OLTP

21 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Increasing Frequency

2 1.8 1.6 1.51x 1.4 1.2 1.14x 1.00x 1 0.8 0.6 Performance 0.4 0.2 Power 0 Increased Freq Max Frequency (+16%)

Baseline is 2 Node x 2 Core blade running OLTP

22 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Decreasing Frequency

2 1.8 1.6 1.51x 1.4 1.2 1.14x 1.00x 1 0.87x 0.8 0.6 0.49x Performance 0.4 0.2 Power 0 Increased Freq Max Frequency Decreased Freq (+16%) (-16%)

Baseline is 2 Node x 2 Core blade running OLTP

23 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Quad-Core Higher Performance within a Fixed Power Budget

2 1.8 1.73x 1.6 1.51x 1.4 1.2 1.14x 1.00x 0.99x 1 0.8 Quad-Core 0.6 Performance 0.4 0.2 Power 0 Increased Freq Max Frequency Decreased Freq (+16%) (-16%)

Baseline is 2 Node x 2 Core blade running OLTP

24 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Clock and Power Planes

Core 0 Core 1

PLL PLL VDD Core 2 Core 3

PLL PLL

VRM DIMMs VDDIO, Northbridge VHT VTT SVI HyperTransport PLLs PLL PLLs Links Misc VDDA

VDDNB

25 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future DICE: Dynamic Independent Core Engagement Ability to dynamically and individually adjust core frequencies to improve power efficiency

100% 100% 100% Workload Workload Power State

100% 100% Workload Workload

26 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future DICE: Dynamic Independent Core Engagement Ability to dynamically and individually adjust core frequencies to improve power efficiency

100% 33% Workload Workload

60% Power State

33% 33% Workload Workload

27 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future DICE: Dynamic Independent Core Engagement Ability to dynamically and individually adjust core frequencies for improved power efficiency

100% 50% Workload Workload

45% Power Halted Halted State

28 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Enjoy the rest of the conference !

www.amd.com/power

29 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future Trademark Attribution

AMD, the AMD Arrow, AMD Opteron, AMD PowerNow logo and combinations thereof are trademarks of , Inc. HyperTransport and HTX are trademarks of the HyperTransport Consortium. PCIe is a trademark of the PCI-SIG. Other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.

30 21 August 2006 The Opteron CMP NorthBridge Architecture, Now and in the Future