High-Performance Power-Efficient Bulldozer-Core X86-64 Server
Total Page:16
File Type:pdf, Size:1020Kb
HIGH-PERFORMANCE POWER- EFFICIENT X86-64 SERVER AND DESKTOP PROCESSORS Using the core codenamed “Bulldozer” Sean White 19 August 2011 THE DIE Overview, Floorplan, and Technology 2 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | Photograph 3 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | What’s on it? . Eight “Bulldozer” cores – High-performance, power-efficient AMD64 cores . First generation of a new execution core family from AMD (Family 15h) . Two cores in each Bulldozer module – 128 KB of Level1 Data Cache, 16 KB/core, 64-byte cacheline, 4-way associative, write-through – 256 KB of Level1 Instruction Cache, 64 KB/Bulldozer module, 64-byte cacheline, 2-way associative – 8 MB of Level2 Cache, 2 MB/Bulldozer module, 64-byte cacheline, 16- way associative . Integrated Northbridge which controls: – 8 MB of Level3 Cache, 64-byte cacheline, 16-way associative, MOESI – Two 72-bit wide DDR3 memory channels – Four 16-bit receive/16-bit transmit HyperTransport™ links 4 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | Floorplan (315 mm2) HyperTransport™ Phy MiscIO HyperTransport 2MB L2 2MB L2 Bulldozer Bulldozer Cache Cache Module Module ™ Phy 2MB L3 Cache 2MB L3 Cache DDR3 Northbridge Phy HyperTransport 2MB L3 Cache 2MB L3 Cache Bulldozer 2MB L2 2MB L2 Bulldozer ™ Module Cache Cache Module Phy HyperTransport™ Phy MiscIO 5 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE DIE | Process Technology Layer Type Pitch 32-nm Silicon-On-Insulator (SOI) Hi-K Metal Gate CPP1 - 130 nm (HKMG) process from GlobalFoundries M01 1x 104 nm M02 1x 104 nm 11-metal-layer-stack M03 1x 104 nm M04 1.3x 130 nm Low-k dielectric M05 1.3x 130 nm M06 2x 208 nm Dual strain liners and eSiGe to improve M07 2x 208 nm performance. M08 2x 208 nm M09 4x 416 nm Multiple VT (HVT, RVT, LVT) and long-channel transistors. M10 16x 1.6 µm M11 16x 1.6 µm 1 Contacted poly pitch 6 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE A few highlights 7 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | What’s different about this core? A Bulldozer module contains 2 cores “Bulldozer” Functions with high utilization that can’t be shared without significant Fetch compromises exist for each core Decode Ex: Integer pipelines, Level1 data Integer Integer caches Scheduler Scheduler Other functions are shared between the FP Scheduler cores Ex: Floating point pipelines, Level2 Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline bit bit cache - - FMAC FMAC 128 128 This allows two cores to each use a L1 DCache L1 DCache larger, higher-performance function (ex: floating point unit) as they need Shared L2 Cache it for less total die area than having separate, smaller functions for each core 8 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Shared Frontend . Decoupled predict and fetch pipelines Prediction Queue ICache L1 BTB . Prediction-directed Fetch Queue L2 BTB instruction prefetch Ucode ROM 4 x86 Decoders . Instruction cache: 64K Byte, 2-way Int Int . 32-Byte fetch Scheduler FP Scheduler Scheduler . Instruction TLBs: Instr Instr Retire Retire bit bit - - MMX MMX FMAC FMAC AGen AGen AGen AGen 128 – Level1: 72 128 EX, DIV EX, DIV EX, MUL EX, MUL entries, mixed page sizes Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache – Level2: 512 entries, 4-way, Data Prefetcher Shared L2 Cache 4K pages . Branch fusion 9 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Separate Integer Units . Thread retire logic Prediction Queue ICache . Physical Register File L1 BTB (PRF)-based register Fetch Queue renaming L2 BTB Ucode ROM 4 x86 Decoders . Unified scheduler per core . Way-predicted 16K Byte Level1 Data cache Int Int Scheduler FP Scheduler Scheduler . Data TLB: 32-entry fully associative Instr Instr Retire Retire bit bit - . Fully out-of-order - MMX MMX FMAC FMAC AGen AGen AGen AGen 128 128 EX, DIV EX, DIV load/store EX, MUL EX, MUL – 2 128-bit loads/cycle Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache – 1 128-bit store/cycle – 40-entry Load queue Data Prefetcher Shared L2 Cache 24-entry Store queue 10 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Shared FPU . Co-processor organization Prediction Queue ICache L1 BTB Fetch Queue . Reports completion L2 BTB back to parent core Ucode ROM 4 x86 Decoders . Dual 128-bit Floating Point Int Int Multiply/Accumulate Scheduler FP Scheduler Scheduler (FMAC) pipelines Instr Instr Retire Retire bit bit - . Dual 128-bit Packed - MMX MMX FMAC FMAC AGen AGen AGen AGen 128 128 EX, DIV EX, DIV Integer pipelines EX, MUL EX, MUL . PRF-based register Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache renaming . A single floating point Data Prefetcher Shared L2 Cache scheduler for both cores 11 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | New Floating Point Instructions New Instructions Applications/Use Cases SSSE3, SSE4.1, • Video encoding and transcoding SSE4.2 • Biometrics algorithms (AMD and Intel) • Text-intensive applications AESNI • Application using AES encryption PCLMULQDQ • Secure network transactions (AMD and Intel) • Disk encryption (MSFT BitLocker) • Database encryption (Oracle) • Cloud security AVX Floating point intensive applications: (AMD and Intel) • Signal processing / Seismic • Multimedia • Scientific simulations • Financial analytics • 3D modeling FMA4 HPC applications (AMD Unique) XOP • Numeric applications (AMD Unique) • Multimedia applications • Algorithms used for audio/radio 12 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE BULLDOZER CORE | Microarch: Shared Level 2 Cache . 16-way unified L2 cache Prediction Queue ICache L1 BTB . L2 TLB and page Fetch Queue L2 BTB walker Ucode ROM 4 x86 Decoders – 1024-entry, 8-way – Services both Int Int FP Scheduler Instruction and Scheduler Scheduler Data requests Instr Instr Retire Retire bit bit - - MMX MMX FMAC FMAC . Multiple data AGen AGen AGen AGen 128 128 EX, DIV EX, DIV prefetchers EX, MUL EX, MUL . 23 outstanding L2 Ld/ST Unit L1 DTLB L1 DCache FP Ld Buffer Ld/ST Unit L1 DTLB L1 DCache cache misses for memory system Data Prefetcher Shared L2 Cache concurrency 13 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE A few of its highlights 14 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | Simplified Block Diagram CPU CPU CPU 4 Bulldozer modules C0 CPUC1 L2 L2 L2 L2 8 MB L3 Cache L2 Cache Synchronization APIC APIC (interrupts)APIC L3 L3 System Request (interrupts) APIC(*2)(interrupts) Arrays CTL Queue (SRQ) 2 DRAM channels Crossbar Memory DRAM CTL1 DDR PHY CTL DRAM CTL0 DDR PHY HT HT HT HTCTLCTL CTL CTL 4 HyperTransport™ links HT/PCI-E HT/PCIHT PHY-E HT/PCIPHY-E PHY PHY 15 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | SRQ, Crossbar, Memory Controller System Request Queue and Crossbar The northbridge SRQ and crossbar form the traffic hub of the design, routing from any requestor to the appropriate destination. Examples: A core request to the L3 cache; a request from another die or chip from a HyperTransport™ link to the memory controller, etc. Memory Controller Each memory controller enforces cache coherency and proper order of operations for the memory space it “owns”, distributing this responsibility across multi-die or multi-chip systems. This allows multi-die or multi-chip systems to scale rather than bottleneck at a single coherency/ordering choke point in the system. Obviously, the memory controller also sends the read/write requests to the DRAM controllers. 16 | High-Performance Power-Efficient x86-64 Server And Desktop Processors Using the core codenamed “Bulldozer” | 19 August 2011 | THE NORTHBRIDGE | DRAM Controllers, L3 Cache, Probe Filter DDR3 DRAM Controllers Two per die Supports Unbuffered (UDIMM), Registered (RDIMM) and Load- Reduced (LRDIMM) memory, 1.50V, 1.35V and 1.25V depending on the processor package, with frequencies up to DDR3-1866 New power-saving features: Low-voltage DDR3, fast self-refresh entry with internal clocks gated, aggressive precharge power down, tristate addr/cmd/bank w/chip-select deassertion, throttle activity for thermal or power reduction, put RX circuits in standby when no reads active, etc. L3 Cache and Probe Filter Up to 8 MB of Level3 cache, normally shared, can be partitioned L3 data ECC protected (single-bit correct, double-bit detect)