Quad-Core AMD Opterontm Y Long Term Roadmap Y AMD HPC WW
Total Page:16
File Type:pdf, Size:1020Kb
AMD WW HPC 07 Agenda y Quad-Core AMD OpteronTM y Long Term Roadmap y AMD HPC WW 2 2007 Worldwide High Performance Computing Agenda y Quad-Core AMD OpteronTM ¾ Architecture ¾ Benchmarks ¾ Tools y Long Term Roadmap y AMD HPC WW 3 2007 Worldwide High Performance Computing Quad-Core AMD Opteron™ Processors More than just four cores y Significant CPU Core Enhancements y Significant Cache Enhancements World-class performance y Native Quad Core – Faster data sharing between cores y Enhanced AMD-V™ – Nested paging acceleration for virtual environments Reducing total cost of ownership y Performance/Watt leadership – Consistent 95W thermal design point – Low power 68W solutions y Drop-in upgrade – Socket F compatibility – BIOS upgrade – Leverage existing platform infrastructure y Common Core Architecture – One core technology top-to-bottom – Top-to-bottom platform feature consistency 4 2007 Worldwide High Performance Computing AMD Quad-Core Processor The Die Comprehensive Virtualization Upgrades Enhancements for SSE128 New “Nested Paging” Can quadruple feature designed for floating-point near native performance capabilities on virtualization applications New Highly Efficient Advanced Power Cache Structure with Management Shared L3 Cache Provides granular Balance of dedicated power management and shared cache for resulting in improved optimum quad-core power efficiency performance DRAM Controller CPU Core Enhancements Enhancements To improve overall To benefit applications by memory performance improving overall efficiency with native and performance of cores quad-core processing 5 2007 Worldwide High Performance Computing 128-bit SSE and 128-bit Loads Comprehensive set of upgrades for improved performance on floating point- and graphics-intensive applications Instruction Dispatch Integer Decode & Rename FP Decode & Rename scheduler scheduler scheduler 36-entry FP scheduler AGU 64b128b 64b128b FMISC LS/ST queue FADFADD D FMULFMUL AGU ALU and AGU 64b128b 64b128b ALU Data Cache SSESSE SSESSE ALU MULT 64b 128b 64b 128b Double vector SSE performance Both SSE Floating-point and SSE Packed Integer Avoid creating bottlenecks in instruction or data delivery 6 2007 Worldwide High Performance Computing Comprehensive Enhancements for SSE128 AMD Dual-Core Opteron™ versus Quad-Core Opteron™ Parameter AMD OpteronTM with Quad-Core AMD OpteronTM DDR2 SSE Exec Width 64 128 + SSE MOVs Instruction Fetch BW 16 Bytes/cycle 32 Bytes/cycle + Unaligned Ld-Ops Data Cache Bandwidth 2 x 64bit loads/cycle 2 x 128bit loads/cycle L2/NB Bandwidth 64 bits/cycle 128 bits/cycle FP Schedule Depth 36 Dedicated x 64-bit 36 Dedicated x 128-bit ops ops Can perform SSE MOVs in the FP “store” pipe y Execute two generic SSE ops+SSE MOV each cycle (+two 128-bit SSE loads) SSE Unaligned Load-Execute mode y Reduce alignment requirements for SSE ld-op instructions y Minimize awkward pairs of separate load and compute instructions y To improve instruction packing and decoding efficiency 7 2007 Worldwide High Performance Computing ‘Barcelona’ … Not Just Four Cores Comprehensive 128-bit SSE Upgrades 64-bit Intel AMD Platforms Clovertown Barcelona Goal: Balanced SSE Execution 1x 2x 2x Instruction Fetch Bandwidth 1x 1x 2x Data Cache Bandwidth 1x 1x 2x L2 Cache / North-Bridge Bandwidth 1x 2x 2x • Barcelona doubles Instruction and Data pipelines … Intel’s pipeline doesn’t •Helps keep 128-bit SSE pipeline full for optimum performance • Dedicated 36-entry floating-point scheduler can reduce application latency •Intel 32-entry scheduler shared between floating-point and integer operations • Incredible performance boost, per core, on target applications! 8 2007 Worldwide High Performance Computing Balanced, Highly Efficient Cache Structure Dedicated L1 •AMD’s 64KB/64KB vs. Intel’s Efficient memory handling reduces 32KB/32KB need for “brute force” cache sizes • Allows 2 loads per cycle Handle Data Quickly and Efficiently. Core 1 Core 2 Core 3 Core 4 Cache Cache Cache Cache Dedicated L2 Control Control Control Control • Dedicated cache to eliminate conflicts of shared caches • Designed for true working data sets 64KB 64KB 64KB 64KB Avoid Thrashing. Minimize Latency. 512KB 512KB 512KB 512KB Shared L3 - New • Designed for optimum memory use and allocation for multi-core • Ready for expansion at the right time 2MB+ for customers Reduces Latency to Main Memory. 9 2007 Worldwide High Performance Computing Barcelona Enhancements CPU Core IPC Enhancements Deliver more DRAM bandwidth 9 Advanced branch prediction 9 Independent DRAM controllers 9 32B instruction fetch 9 Optimized DRAM paging 9 Sideband Stack Optimizer 9 Re-architect NB for higher BW 9 Out-of-order load execution 9 Write bursting 9 TLB Optimizations (1G pages) 9 DRAM prefetcher 9 Data-dependent divide latency 9 Core prefetchers 9 More Fastpath instructions • CALL and RET-Imm instructions Balanced, Highly Efficient • Data movement between FP & INT Cache Structure 9 Bit Manipulation extensions •LZCNT/POPCNT 9 Doubled L1 cache bandwidth & Ins. Decode 9 SSE extensions 9 Dedicated 512KB L2 cache •EXTRQ/INSERTQ, • MOVNTSD/MOVNTSS 9 Shared L3 cache 10 2007 Worldwide High Performance Computing Improving Processor Power Management with AMD PowerNow!™ Technology enhancements Dual-Core Native Quad-Core 75% 35% 75% 35% M E M M LE H L Hz LE MH LE H ID z ID ID z ID z CORE 0 CORE 1 CORE 0 CORE 1 10% 1% M LE MH LE H ID z ID z CORE 2 CORE 3 MHz and voltage is MHz is independently locked to highest adjusted separately per core. utilized core’s p-state Voltage is locked to highest utilized core’s p-state Native Quad-Core technology enables enhanced power management across all four cores 11 2007 Worldwide High Performance Computing AMD CoolCore™ Technology Turns off Blocks of CPU When Not in Use Coarse Control (Core) FPU y Ex, FPU (hottest part of L1 die) Core 1 L2 Core 2 L1 Fine Control (Core) L3 Memory Controller y Incrementally Smaller Sections Core 3 Core 4 Memory Controller y Reads (turn off write logic) y Writes (turn off read logic) Example only: does not reflect actual areas of clock gating AMD CoolCore™ is Automatic – No Drivers Needed! 12 Introducing Average CPU Power Average CPU Power (ACP) - Measuring processor power draw on all CPU power rails while running accurate and relevant commercially useful high utilization workloads* ACP TDP ACP TDP ACP TDP 105W 120W 75W 95W 55W 68W Each ACP value includes power for Cores, Memory Controller, and HyperTransport™ links ACP values are considerably lower than TDP • Because AMD’s TDP values are conservative engineering design limits • ACP includes workloads such as TPC-C, SPECcpu2006, SPECjbb2005, STREAM TDP will continue to be leveraged for engineering Overall thermal design maximum limits platform power is *See slide “Details around testing” most important SPEC® and the benchmark names SPECcpu2006, SPECjbb2005 are registered trademarks of the Standard Performance Evaluation Corporation. 13 Dual Dynamic Power Management™ (DDPM) Separate power planes for cores and memory controller for: y Optimum power consumption - Enables cores to operate at reduced power consumption levels while memory controller continues to run at full speed y Increased performance - Memory controller can operate at higher frequency for increased bandwidth and performance Unified Plane Systemboard DDPM Systemboard 14 Projected Infrastructure Impact of Quad-Core 7Kw Power Budget 7Kw Power Budget • Second-Generation AMD Opteron™ 10% 14% processors with planned upgrade path Wasted to quad-core within existing power & Wasted thermal envelopes • Clovertown raises power & thermal requirements within each power band Difficult Seamless Transition to Transition to • Intel customers may be forced to Quad-core Quad-core choose between higher power & cooling costs or wasted rack space Pwr. Intel TDP AMD TDP Band Dual %+ Quad Dual %+ Quad High 80W 50% 120W 120W 0% 120W Std. 65W 23% 80W 95W 0% 95W Dual-core Dual-core Quad-core Quad-core 19 2U Servers 20 2U Servers 20 2U Servers 18 2U Servers Low 40W 25% 50W 68W 0% 68W 76 total cores 80 total cores 160 total cores 144 total cores AMD Opteron™ processors Intel Xeon Designed to maximize server Can waste data center space and density and minimize transitions increase transition pain Wattage based on 2P systems, 8 DIMMs, TDP wattage for ‘Dempsey’, ‘Woodcrest’ & ‘Clovertown’ is estimated based on current publicly available processor and chipset values, AMD estimates, and an incremental 100watts for fans, storage, and power supply. (see, eg: http://techreport.com/etc/2006q2/woodcrest/index.x?pg=2 and is subject to change. The examples contained herein are intended for informational purposes only. Other factors will affect real-world power consumption. 15 2007 Worldwide High Performance Computing Performance-Per-Watt Scalability Greater Performance Same Power Performance- Consistent Per-Watt power and thermals help deliver better performance per w CPU Performance Core Watts att Single CPU Dual Core Watts 2003 CPU 2005 Quad Core Watts 16 2007 Power Quad-Core AMD OpteronTM Benchmarks PROCESSOR PERFORMANCE BENCHMARKS Performance on GCC Compiler SPECint_rate2006 Performance Scaling 17 2007 Worldwide High Performance Computing Quad-Core AMD OpteronTM Benchmarks PROCESSOR PERFORMANCE BENCHMARKS – Floating Point SPECfp_rate2006 4P Servers SPECfp_rate2006 2P Servers 18 2007 Worldwide High Performance Computing Quad-Core AMD OpteronTM Benchmarks STREAM Memory Bandwidth - STREAM 19 2007 Worldwide High Performance Computing Quad-Core AMD OpteronTM Benchmarks SHARED MEMORY PARALLEL PROCESSING SPEComp2001® Performance