Title: an Inside Look at POWER7 Architecture
Total Page:16
File Type:pdf, Size:1020Kb
2011 IBM Power Systems Technical University October 10-14 | Fontainebleau Miami Beach | Miami, FL Title: An Inside Look at POWER7 Architecture Session ID: HE01 Joel M. Tendler [email protected] © Copyright IBM Corporation 2011 Materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.3 IBM Power Systems value proposition PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Deliver business value by leveraging technology Performance Flexibility Reliability Affordability + – . the highest value at the lowest risk with leading technology © Copyright IBM Corporation 2011 2 Over 20 Years of POWER Processors 45nm PowerPower SystemsSystems TechnicalTechnical UniversityUniversity 65nm Next Gen. POWER7 RS64IV Sstar 130nm -Multi-core POWER6TM RS64III Pulsar 180nm -Ultra High Frequency .18um RS64II North Star .25um POWER5TM -SMT RS64I Apache .35um BiCMOS .5um POWER4TM Major POWER® Innovation .5um Muskie A35 .22um -Dual Core -1990 RISC Architecture .5um -1994 SMP -Cobra A10 -1995 Out of Order Execution -64 bit -1996 64 Bit Enterprise Architecture POWER3TM -1997 Hardware Multi-Threading .35um -630 -2001 Dual Core Processors -2001 Large System Scaling -2001 Shared Caches .72um -2003 On Chip Memory Control POWER2TM P2SC -2003 SMT .25um -2006 Ultra High Frequency RSC .35um -2006 Dual Scope Coherence Mgmt 1.0um -2006 Decimal Float/VSX .6um -2006 Processor Recovery/Sparing 604e -2009 Balanced Multi-core Processor -603 POWER1 -2009 On Chip EDRAM -AMERICA’s -601 1990 1995 2000 2005 2010 * Dates represent approximate processor power-on dates, not system availability © Copyright IBM Corporation 2011 3 IBM POWER Processor Roadmap - PowerPower SystemsSystems TechnicalTechnical UniversityUniversity 3 Year Revolution POWER8 POWER7 POWER6/6+ Most POWER5/5+ POWERful & IBM is the leader Fastest Scalable in Processor Processor POWER4/4+ Hardware In Industry Processor in and Server Virtualization Dual Core Industry design High Frequencies 4,6,8 Core First Dual Core for Unix & Linux Virtualization + 32MB On-Chip eDRAM Dual Core & Quad Core Md Live Partition Migration Power Optimized Cores in Industry Micropartitioning Memory Subsystem + Mem Subsystem ++ Enhanced Scaling Altivec Advanced Memory Dual Core 2 Thread SMT Instruction Retry Expansion Chip Multi Processing Distributed Switch + Alt Proc Recovery 4 Thread SMT++ Distributed Switch Core Parallelism + Dyn Energy Mgmt Reliability + Shared L2 FP Performance + 2 Thread SMT + VSM & VSX Dynamic LPARs (32) Memory bandwidth + Protection Keys Protection Keys+ 180nm, 130nm, 90nm 65nm 45nm 2001 2004 2007 2010 Future © Copyright IBM Corporation 2011 4 POWER7 Processor Chip PowerPower SystemsSystems TechnicalTechnical UniversityUniversity • 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM • Eight processor cores –12 execution units per core –4 Way SMT per core –32 Threads per chip –256KB L2 per core • 32MB on chip eDRAM shared L3 • 1.2B transistors –Equivalent function of 2.7B –eDRAM efficiency • Dual DDR3 Memory Controllers –100GB/s Memory bandwidth per chip sustained • Scalability up to 32 Sockets –360GB/s SMP bandwidth/chip –20,000 coherent operations in flight • Advanced pre-fetching Data and Instruction • Binary Compatibility with POWER6 and prior * Statements regarding SMP servers do not imply that IBM will introduce systems a system with this capability. © Copyright IBM Corporation 2011 6 Multi-threading Evolution PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Single thread Out of Order S80 Hardware Multi-thread FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL POWER5 2-way SMT POWER7 4-way SMT FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL No Thread Executing Thread 0 Thread 1 Executing Executing Thread 2 Thread 3 Executing Executing © Copyright IBM Corporation 2011 7 POWER7 Simultaneous Multithreading Support – Intelligent Threads PowerPower SystemsSystems TechnicalTechnical UniversityUniversity • Requires POWER7 Mode Standard Cache Option – POWER6 Mode supports single thread (ST) and 2-way simultaneous multithreading All cores active (SMT2) 2 – POWER7 Mode also supports 4-way simultaneous multithreading (SMT4) • Operating System Support 1.5 – AIX 6.1 and AIX 7.1 – IBM i 6.1 and 7.1 –Linux 1 • Dynamic Runtime SMT scheduling – Spread work among cores to execute in appropriate threaded mode 0.5 – Can dynamical shift between modes as required: ST / SMT2 / SMT4 • LPAR-wide SMT controls 0 ST SMT2 SMT4 – ST, SMT2, SMT4 modes © Copyright IBM Corporation 2011 8 POWER7 Processor Chip PowerPower SystemsSystems TechnicalTechnical UniversityUniversity • 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM • Eight processor cores –12 execution units per core –4 Way SMT per core –32 Threads per chip –256KB L2 per core • 32MB on chip eDRAM shared L3 • 1.2B transistors –Equivalent function of 2.7B –eDRAM efficiency • Dual DDR3 Memory Controllers –100GB/s Memory bandwidth per chip sustained • Scalability up to 32 Sockets –360GB/s SMP bandwidth/chip –20,000 coherent operations in flight • Advanced pre-fetching Data and Instruction • Binary Compatibility with POWER6 and prior * Statements regarding SMP servers do not imply that IBM will introduce systems a system with this capability. © Copyright IBM Corporation 2011 9 Computer Systems Memory: DRAM PowerPower SystemsSystems TechnicalTechnical UniversityUniversity • Dynamic RAM – Uses MOSFETs and Capacitors – Charge , i.e., “0” or “1”, decays over time (capacitor discharges) losing its state – To be useful, cell needs periodic refreshing (dynamic) – Volatile memory (data lost when memory is not powered) • Each cell stores 1 bit – Memory cell: 1xMOSFET + 1xCapacitor • Comparison with SRAM – Less components, much more dense – Higher latency, variable access timings 4x4 Dynamic RAM memory – Charge refresh & amplification complexity Diagram courtesy of Wikipedia © Copyright IBM Corporation 2011 10 Computer Systems Memory: SRAM PowerPower SystemsSystems TechnicalTechnical UniversityUniversity • Static RAM – Uses transistors only, e.g., MOSFET – Charge does not need to be refreshed (static) – Uses bi-stable latching circuitry – Volatile memory (data lost when memory is not powered) Static RAM memory cell: 6 x MOSFET • Each cell stores 1 bit Diagram courtesy of Wikipedia – 6 or more MOSFETs • Comparison with DRAM – Uses more components per bit – Lower latency – Higher frequency access – Simpler access interface © Copyright IBM Corporation 2011 11 POWER7 Processor Chip PowerPower SystemsSystems TechnicalTechnical UniversityUniversity • 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM • Eight processor cores – 12 execution units per core – 4 Way SMT per core – 32 Threads per chip – 256KB L2 per core • 32MB on chip eDRAM shared L3 • 1.2B transistors – Equivalent function of 2.7B – eDRAM efficiency • Dual DDR3 Memory Controllers – 100GB/s Memory bandwidth per chip sustained • Scalability up to 32 Sockets – 360GB/s SMP bandwidth/chip – 20,000 coherent operations in flight • Advanced pre-fetching Data and Instruction • Binary Compatibility with POWER6 and prior systems * Statements regarding SMP servers do not imply that IBM will introduce a system with this capability. © Copyright IBM Corporation 2011 13 POWER7 Processor Chip PowerPower SystemsSystems TechnicalTechnical UniversityUniversity • 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM • Eight processor cores – 12 execution units per core – 4 Way SMT per core – 32 Threads per chip – 256KB L2 per core • 32MB on chip eDRAM shared L3 • 1.2B transistors – Equivalent function of 2.7B – eDRAM efficiency • Dual DDR3 Memory Controllers – 100GB/s Memory bandwidth per chip sustained • Scalability up to 32 Sockets – 360GB/s SMP bandwidth/chip – 20,000 coherent operations in flight • Advanced pre-fetching Data and Instruction • Binary Compatibility with POWER6 and prior systems * Statements regarding SMP servers do not imply that IBM will introduce a system with this capability. © Copyright IBM Corporation 2011 14 POWER7 Design Principles: Traditional Performance View Power Systems Technical University 10000 Power Systems Technical University Multiple optimization Points 1000 • Balanced Design – Multiple optimization points 100 – Improved energy efficiency 10 – RAS improvements • Improved Thread Performance 1 – Dynamic allocation of resources Thread Core Socket 32 Chip System – Shared L3 • Increased Core parallelism Balanced View – 4 Way SMT 10000 – Aggressive out of order execution 32 Chip System POWER6 • Extreme Increase in Socket POWER7 Throughput 1000 – Continued growth in socket bandwidth Socket – Balanced core, cache, memory 100 improvements Core • System Thruput System 10 – Scalable interconnect Thread – Reduced coherence traffic 1 Single Thread Performance © Copyright IBM Corporation 2011 Graphs for illustration purposes only (Not actual data15) POWER7 Design Principles: PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Flexibility and Adaptability • Cores: – 4, 6, and 8-core offerings with up to 32MB of L3 Cache – Dynamically turn cores on and off, reallocating energy – Dynamically vary individual core frequencies, reallocating energy – Dynamically enable and disable up to 4 threads per core • Memory Subsystem: – 4 or 8 channel configurations • System Topologies: – Standard, half-width, and double-width SMP busses supported • Multiple System Packages Power 70x,710-755 Power 770-795 Compute Intensive Single Chip Organic Single Chip Glass Ceramic Quad-chip MCM 1