POWER7 Processors: the Beat Goes On
Total Page:16
File Type:pdf, Size:1020Kb
POWER7 Processors: The Beat Goes On Joel M. Tendler, Executive IT Architect [email protected] Acknowledgment: This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002 POWER7 Processors: The Beat Goes On IBM Power Systems value proposition Deliver business value by leveraging technology Performance Flexibility Reliability Affordability + – . the highest value at the lowest risk with leading technology 2 POWER7 Processors: The Beat Goes On 45nm Approaching 20 Years of POWER Processors 65nm Next Gen. POWER7 RS64IV Sstar 130nm -Multi-core POWER6TM RS64III Pulsar 180nm -Ultra High Frequency .18um RS64II North Star .25um POWER5TM -SMT RS64I Apache .35um BiCMOS .5um POWER4TM Major POWER® Innovation .5um Muskie A35 .22um -Dual Core -1990 RISC Architecture .5um -1994 SMP -Cobra A10 -1995 Out of Order Execution -64 bit -1996 64 Bit Enterprise Architecture POWER3TM -1997 Hardware Multi-Threading .35um -630 -2001 Dual Core Processors -2001 Large System Scaling -2001 Shared Caches .72um -2003 On Chip Memory Control POWER2TM P2SC -2003 SMT .25um -2006 Ultra High Frequency RSC .35um -2006 Dual Scope Coherence Mgmt 1.0um -2006 Decimal Float/VSX .6um -2006 Processor Recovery/Sparing 604e -2009 Balanced Multi-core Processor -603 POWER1 -2009 On Chip EDRAM -AMERICA’s -601 1990 1995 2000 2005 2010 * Dates represent approximate processor power-on dates, not system availability 3 POWER7 Processors: The Beat Goes On POWER Roadmap – The Only Reliable Server Roadmap 2001 2004 2007 2010 POWER4 POWER5 POWER6 POWER7* 5GHz Alti Advanced - 2 Cores Vec Core Design 2.3GHz 2.3GHz 1.9 Core1.5+ Core L2 Cache Cache 1.5+ GHz 1.5+ GHz GHz GHz Advanced Advanced 1+ GHz Core1+ GHzCore Core CoreShared L2 System Features System Features Core Core SharedDistributed L2 Switch 4-8 cores / die Shared L2 Highly threaded cores Distributed Switch SharedDistributed L2 Switch Very High Frequencies 4-5GHz Enhanced Virtualization 2.3 GHz POWER5+ Advanced Memory Subsystem Distributed Switch Enhanced Scaling Altivec Vector SIMD instructions Simultaneous Multi-Threading (SMT) Instruction Retry/Alternate Enhanced Distributed Switch Processor Recovery Enhanced Core Parallelism Chip Multi Processing Decimal Floating Point Improved FP Performance - Distributed Switch Dynamic Energy Management Increased memory bandwidth - Shared L2 Partition Mobility Micropartitions Dynamic LPARs (32) Memory Protection Keys Virtualized IO Advanced Memory Sharing Upgrades to be available First Dual core chip in industry First Quad core in industry Fastest chip in industry For Power 570 & Power 595 BINARY COMPATIBILITY *All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. 4 POWER7 Processors: The Beat Goes On POWER7 Processor Chip ¾ 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM ¾ 1.2B transistors Equivalent function of 2.7B eDRAM efficiency ¾ Eight processor cores 12 execution units per core 4 Way SMT per core 32 Threads per chip 256KB L2 per core ¾ 32MB on chip eDRAM shared L3 ¾ Dual DDR3 Memory Controllers 100GB/s Memory bandwidth per chip sustained ¾ Scalability up to 32 Sockets 360GB/s SMP bandwidth/chip 20,000 coherent operations in flight ¾ Advanced pre-fetching Data and Instruction * Statements regarding SMP servers ¾ Binary Compatibility with POWER6 and prior systems do not imply that IBM will introduce a system with this capability. 5 POWER7 Processors: The Beat Goes On Computer Systems Memory: DRAM ¾ Dynamic RAM Uses MOSFETs and Capacitors Charge , i.e., “0” or “1”, decays over time (capacitor discharges) losing its state To be useful, cell needs periodic refreshing (dynamic) Volatile memory (data lost when memory is not powered) ¾ Each cell stores 1 bit Memory cell: 1xMOSFET + 1xCapacitor ¾ Comparison with SRAM Less components, much more dense Higher latency, variable access timings 4x4 Dynamic RAM memory Charge refresh & amplification complexity Diagram courtesy of Wikipedia 7 POWER7 Processors: The Beat Goes On Computer Systems Memory: SRAM ¾ Static RAM Uses transistors only, e.g., MOSFET Charge does not need to be refreshed (static) Uses bi-stable latching circuitry Volatile memory (data lost when memory is not powered) Static RAM memory cell: 6 x MOSFET ¾ Each cell stores 1 bit Diagram courtesy of Wikipedia 6 or more MOSFETs ¾ Comparison with DRAM Uses more components per bit Lower latency Higher frequency access Simpler access interface 8 POWER7 Processors: The Beat Goes On POWER7 Design Principles: Traditional Performance View 10000 Multiple optimization Points 1000 ¾ Balanced Design 100 Multiple optimization points Improved energy efficiency 10 RAS improvements ¾ Improved Thread Performance 1 Dynamic allocation of resources Thread Core Socket 32 Chip System Shared L3 ¾ Increased Core parallelism Balanced View 4 Way SMT 10000 Aggressive out of order execution 32 Chip System POWER6 ¾ Extreme Increase in Socket POWER7 Throughput 1000 Continued growth in socket bandwidth Socket 100 Balanced core, cache, memory improvements Core System Thruput System 10 ¾ System Thread Scalable interconnect Reduced coherence traffic 1 * Statements regarding SMP servers do not imply that IBM will Single Thread Performance introduce a system with this capability. 10 Graphs for illustration purposes only (Not actual data) POWER7 Processors: The Beat Goes On POWER7 Design Principles: Flexibility and Adaptability ¾ Cores: 4, 6, and 8-core offerings with up to 32MB of L3 Cache Dynamically turn cores on and off, reallocating energy Dynamically vary individual core frequencies, reallocating energy Dynamically enable and disable up to 4 threads per core ¾ Memory Subsystem: 4 or 8 channel configurations ¾ System Topologies: Standard, half-width, and double-width SMP busses supported ¾ Multiple System Packages 2/4s Blades and Racks High-End and Mid-Range Compute Intensive Single Chip Organic Single Chip Glass Ceramic Quad-chip MCM 1 Memory Controller 2 Memory Controllers 8 Memory Controllers 3 4B local links 3 8B local links 3 16B local links (on MCM) 2 8B Remote links 11 * Statements regarding SMP servers do not imply that IBM will introduce a system with this capability. POWER7 Processors: The Beat Goes On POWER7: Core ¾ Execution Units DFU 2 Fixed point units ISU 2 Load store units VSX FPU 4 Double precision floating FXU point 1 Vector unit 1 Branch 1 Condition register IFU 1 Decimal floating point unit CRU/BRU LSU 6 Wide dispatch/8 Wide Issue ¾ Recovery Function Distributed ¾ 1,2,4 Way SMT Support ¾ Out of Order Execution 256KB L2 ¾ 32KB I-Cache ¾ 32KB D-Cache ¾ 256KB L2 Tightly coupled to core 12 POWER7 Processors: The Beat Goes On Computer Systems Memory: Hierarchy Instructions Microprocessor Integrated Circuit OutputData InputData Typical Memory Hierarchy ¾ Memory wall Registers Differential in “performance” between addressable memory and the Instruction cache microprocessor L1 cache { ¾ Implement a memory hierarchy and Data cache intelligence to reduce access times/latency L2 cache ¾ Combination of memory-cell L3 cache technology and “distance” from the processor DIMM Buffer ASIC Increased Latency, greatercapacity Processor intimacy, smaller capacity Addressable memory 13 POWER7 Processors: The Beat Goes On Challenge: Beating Physics to Realize Multi-core Potential Local SMPLinks Remote SMP+I/O Links Core Core Core Core L2 Cache L2 Cache L2 Cache L2 Cache Mem CtrlL3 Cache and Chip Interconnect Mem Ctrl L2 Cache L2 Cache L2 Cache L2 Cache Core Core Core Core POWER7™ is an 8-core, high performance Server chip. A solid chip is a good start. But to win the race, you need a balanced system. POWER7 enables that balance. 14 POWER7 Processors: The Beat Goes On Challenge: Beating Physics to R ealize Multi-core Potential Need to Amplify Effective Socket Throughput Compute Throughput Potential to Close Gap and Achieve Potential Socket Throughput Limitation (Physical signal economics) Multi-core evolution Multi-core evolution 15 POWER7 Processors: The Beat Goes On Trends in Server Evolution Single Image Virtualized/Cloud Emerging Entry Server - A simple matter of riding Virtualized/Cloud Platform the multi-core trend? - Add more cores to the die, beef up some interfaces, Enabled by: 8-core 8-core - Technology and scale to a large SMP? - Innovation Driven by: 8-core 8-core - IT Evolution - Economics 2 to 4 socket Time 16 to 32-way SMP Server Traditional Entry Server Traditional High-End Server Single Image Platform Virtualized Consolidation Platform 2-core 2-core 2-core 2-core 2 to 4 socket 8 to 32 socket * Statements regarding SMP servers 4 to 8-way SMP Server 16 to 64-way SMP Server do not imply that IBM will introduce a system with this capability. 16 POWER7 Processors: The Beat Goes On Trends in Server Evolution Single Image Virtualized/Cloud Emerging Entry Server - A simple matter of riding Virtualized/Cloud Platform the multi-core trend? - Add more cores to the die, beef up some interfaces, Enabled by: 8-core 8-core - Technology and scale to a large SMP? - Innovation Not so simple: 8-core 8-core Driven by: S - Emerging entry servers i - IT Evolution m have characteristics similar i - Economics l 2 to 4 socket a r to traditional high-end Time 16 to 32-way SMP ServerC h large SMP servers a l l e Traditional Entry Server Traditional High-Endn Server Single Image Platform Virtualized Consolidationg Platform e Achieving solid virtual machine performance 2-core 2-core requires