2011 IBM Power Systems Technical University October 10-14 | Fontainebleau Miami Beach | Miami, FL 

Title: An Inside Look at POWER7 Architecture

Session ID: HE01 Joel M. Tendler jtendler@us..com

© Copyright IBM Corporation 2011 Materials may not be reproduced in whole or in part without the prior written permission of IBM. 5.3 IBM Power Systems value proposition  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Deliver business value by leveraging technology

Performance Flexibility Reliability Affordability + –

. . . the highest value at the lowest risk with leading technology

© Copyright IBM Corporation 2011 2 Over 20 Years of POWER Processors 45nm  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity 65nm Next Gen.

POWER7 RS64IV Sstar 130nm -Multi-core POWER6TM RS64III Pulsar 180nm -Ultra High Frequency .18um RS64II North Star .25um POWER5TM -SMT RS64I Apache .35um BiCMOS .5um POWER4TM Major POWER® Innovation .5um Muskie A35 .22um -Dual Core -1990 RISC Architecture .5um -1994 SMP -Cobra A10 -1995 Out of Order Execution -64 bit -1996 64 Bit Enterprise Architecture POWER3TM -1997 Hardware Multi-Threading .35um -630 -2001 Dual Core Processors -2001 Large System Scaling -2001 Shared Caches .72um -2003 On Chip Memory Control POWER2TM P2SC -2003 SMT .25um -2006 Ultra High Frequency RSC .35um -2006 Dual Scope Coherence Mgmt 1.0um -2006 Decimal Float/VSX .6um -2006 Processor Recovery/Sparing 604e -2009 Balanced Multi-core Processor -603 POWER1 -2009 On Chip EDRAM -AMERICA’s -601 1990 1995 2000 2005 2010

* Dates represent approximate processor power-on dates, not system availability © Copyright IBM Corporation 2011 3 IBM POWER Processor Roadmap - PowerPower SystemsSystems TechnicalTechnical UniversityUniversity 3 Year Revolution

POWER8 POWER7

POWER6/6+ Most POWER5/5+ POWERful & IBM is the leader Fastest Scalable in Processor Processor POWER4/4+ Hardware In Industry Processor in and Server Virtualization ƒ Dual Core Industry design ƒ High Frequencies ƒ 4,6,8 Core First Dual Core for Unix & Linux ƒ Virtualization + ƒ 32MB On-Chip eDRAM ƒDual Core & Quad Core Md ƒ Live Partition Migration ƒ Power Optimized Cores in Industry ƒMicropartitioning ƒ Memory Subsystem + ƒ Mem Subsystem ++ ƒEnhanced Scaling ƒ Altivec ƒ Advanced Memory ƒ Dual Core ƒ2 Thread SMT ƒ Instruction Retry Expansion ƒ Chip Multi Processing ƒDistributed Switch + ƒ Alt Proc Recovery ƒ 4 Thread SMT++ ƒ Distributed Switch ƒCore Parallelism + ƒ Dyn Energy Mgmt ƒ Reliability + ƒ Shared L2 ƒFP Performance + ƒ 2 Thread SMT + ƒ VSM & VSX ƒ Dynamic LPARs (32) ƒMemory bandwidth + ƒ Protection Keys ƒ Protection Keys+ ƒ180nm, ƒ130nm, 90nm ƒ 65nm ƒ 45nm 2001 2004 2007 2010 Future

© Copyright IBM Corporation 2011 4 POWER7 Processor Chip  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

• 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM • Eight processor cores –12 execution units per core –4 Way SMT per core –32 Threads per chip –256KB L2 per core • 32MB on chip eDRAM shared L3 • 1.2B transistors –Equivalent function of 2.7B –eDRAM efficiency • Dual DDR3 Memory Controllers –100GB/s Memory bandwidth per chip sustained • Scalability up to 32 Sockets –360GB/s SMP bandwidth/chip –20,000 coherent operations in flight • Advanced pre-fetching Data and Instruction • Binary Compatibility with POWER6 and prior * Statements regarding SMP servers do not imply that IBM will introduce systems a system with this capability.

© Copyright IBM Corporation 2011 6 Multi-threading Evolution  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Single thread Out of Order S80 Hardware Multi-thread

FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL

POWER5 2-way SMT POWER7 4-way SMT FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL

No Thread Executing Thread 0 Thread 1 Executing Executing Thread 2 Thread 3 Executing Executing © Copyright IBM Corporation 2011 7 POWER7 Simultaneous Multithreading Support – Intelligent Threads

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

• Requires POWER7 Mode Standard Cache Option – POWER6 Mode supports single thread (ST) and 2-way simultaneous multithreading All cores active (SMT2) – POWER7 Mode also supports 4-way simultaneous multithreading (SMT4) 2 • Operating System Support – AIX 6.1 and AIX 7.1 – IBM i 6.1 and 7.1 1.5 –Linux

• Dynamic Runtime SMT scheduling 1 – Spread work among cores to execute in appropriate threaded mode 0.5 – Can dynamical shift between modes as required: ST / SMT2 / SMT4

• LPAR-wide SMT controls 0 ST SMT2 SMT4 – ST, SMT2, SMT4 modes

© Copyright IBM Corporation 2011 8 POWER7 Processor Chip  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

• 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM • Eight processor cores –12 execution units per core –4 Way SMT per core –32 Threads per chip –256KB L2 per core • 32MB on chip eDRAM shared L3 • 1.2B transistors –Equivalent function of 2.7B –eDRAM efficiency • Dual DDR3 Memory Controllers –100GB/s Memory bandwidth per chip sustained • Scalability up to 32 Sockets –360GB/s SMP bandwidth/chip –20,000 coherent operations in flight • Advanced pre-fetching Data and Instruction • Binary Compatibility with POWER6 and prior * Statements regarding SMP servers do not imply that IBM will introduce systems a system with this capability.

© Copyright IBM Corporation 2011 9 Computer Systems Memory: DRAM  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity • Dynamic RAM – Uses MOSFETs and Capacitors – Charge , i.e., “0” or “1”, decays over time (capacitor discharges) losing its state – To be useful, needs periodic refreshing (dynamic) – Volatile memory (data lost when memory is not powered) • Each cell stores 1 bit – Memory cell: 1xMOSFET + 1xCapacitor • Comparison with SRAM – Less components, much more dense – Higher latency, variable access timings 4x4 Dynamic RAM memory – Charge refresh & amplification complexity Diagram courtesy of Wikipedia

© Copyright IBM Corporation 2011 10 Computer Systems Memory: SRAM  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity • Static RAM – Uses transistors only, e.g., MOSFET – Charge does not need to be refreshed (static) – Uses bi-stable latching circuitry – Volatile memory (data lost when memory is not powered)

Static RAM memory cell: 6 x MOSFET

• Each cell stores 1 bit Diagram courtesy of Wikipedia – 6 or more MOSFETs • Comparison with DRAM – Uses more components per bit – Lower latency – Higher frequency access – Simpler access interface

© Copyright IBM Corporation 2011 11 POWER7 Processor Chip  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

• 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM • Eight processor cores – 12 execution units per core – 4 Way SMT per core – 32 Threads per chip – 256KB L2 per core • 32MB on chip eDRAM shared L3 • 1.2B transistors – Equivalent function of 2.7B – eDRAM efficiency • Dual DDR3 Memory Controllers – 100GB/s Memory bandwidth per chip sustained • Scalability up to 32 Sockets – 360GB/s SMP bandwidth/chip – 20,000 coherent operations in flight • Advanced pre-fetching Data and Instruction • Binary Compatibility with POWER6 and prior systems

* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.

© Copyright IBM Corporation 2011 13 POWER7 Processor Chip  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

• 567mm2 Technology: 45nm lithography, Cu, SOI, eDRAM • Eight processor cores – 12 execution units per core – 4 Way SMT per core – 32 Threads per chip – 256KB L2 per core • 32MB on chip eDRAM shared L3 • 1.2B transistors – Equivalent function of 2.7B – eDRAM efficiency • Dual DDR3 Memory Controllers – 100GB/s Memory bandwidth per chip sustained • Scalability up to 32 Sockets – 360GB/s SMP bandwidth/chip – 20,000 coherent operations in flight • Advanced pre-fetching Data and Instruction • Binary Compatibility with POWER6 and prior systems

* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.

© Copyright IBM Corporation 2011 14 POWER7 Design Principles: Traditional Performance View  Power Systems Technical University 10000 Power Systems Technical University

Multiple optimization Points 1000 • Balanced Design – Multiple optimization points 100 – Improved energy efficiency 10 – RAS improvements • Improved Thread Performance 1 – Dynamic allocation of resources Thread Core Socket 32 Chip System – Shared L3 • Increased Core parallelism Balanced View – 4 Way SMT 10000 – Aggressive out of order execution 32 Chip System POWER6 • Extreme Increase in Socket POWER7 Throughput 1000 – Continued growth in socket bandwidth Socket – Balanced core, cache, memory 100 improvements Core • System Thruput System 10 – Scalable interconnect Thread – Reduced coherence traffic 1 Single Thread Performance

© Copyright IBM Corporation 2011 Graphs for illustration purposes only (Not actual data15) POWER7 Design Principles:  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Flexibility and Adaptability • Cores: – 4, 6, and 8-core offerings with up to 32MB of L3 Cache – Dynamically turn cores on and off, reallocating energy – Dynamically vary individual core frequencies, reallocating energy – Dynamically enable and disable up to 4 threads per core • Memory Subsystem: – 4 or 8 channel configurations • System Topologies: – Standard, half-width, and double-width SMP busses supported • Multiple System Packages

Power 70x,710-755 Power 770-795 Compute Intensive Single Chip Organic Single Chip Glass Ceramic Quad-chip MCM

1 Memory Controller 2 Memory Controllers 8 Memory Controllers 3 4B local links 3 8B local links 3 16B local links (on MCM) 2 8B Remote links

© Copyright IBM Corporation 2011 16 POWER7: Core  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

• Execution Units – 2 Fixed point units DFU – 2 Load store units ISU – 4 Double precision floating point VSX FPU – 1 Vector unit FXU – 1 Branch – 1 Condition register – 1 Decimal floating point unit IFU – 6 Wide dispatch / 8 Wide Issue CRU/BRU • Recovery Function Distributed LSU • 1,2,4 Way SMT Support • Out of Order Execution • 32KB I-Cache • 32KB D-Cache 256KB L2 • 256KB L2 – Tightly coupled to core

© Copyright IBM Corporation 2011 17 Computer Systems Memory: Hierarchy  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Instructions

Microprocessor Integrated Circuit OutputData

InputData • Memory wall – Differential in “performance” Typical Memory Hierarchy

between addressable memory and Registers the microprocessor Instruction cache • Implement a memory hierarchy and L1 cache { intelligence to reduce access Data cache times/latency L2 cache • Combination of memory-cell technology and “distance” from the L3 cache processor DIMM Buffer ASIC Increased Latency, greatercapacity Processor intimacy, smaller capacity Addressable memory

© Copyright IBM Corporation 2011 18 Challenge: Beating Physics to Realize Multi-core Potential  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Local SMPLinks

Core Core Core Core

L2 Cache L2 Cache L2 Cache L2 Cache

Mem CtrlL3 Cache and Chip Interconnect Mem Ctrl Remote SMP + I/O Links Remote SMP+I/O

L2 Cache L2 Cache L2 Cache L2 Cache

Core Core Core Core

POWER7™ is an 8-core, high performance Server chip. A solid chip is a good start. But to win the race, you need a balanced system. POWER7 enables that balance. © Copyright IBM Corporation 2011 19 Challenge: Beating Physics to Realize Multi-core Potential  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Need to Amplify Effective Compute Throughput Potential Socket Throughput to Close Gap and Achieve Potential

Socket Throughput Limitation (Physical signal economics)

Multi-core evolution

Multi-core evolution © Copyright IBM Corporation 2011 20 Trends in Server Evolution

 Single Image Virtualized/Cloud PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Emerging Entry Server - A simple matter of riding Virtualized/Cloud Platform the multi-core trend? - Add more cores to the die, beef up some interfaces, Enabled by: 8-core 8-core - Technology and scale to a large SMP? - Innovation

Driven by: 8-core 8-core - IT Evolution - Economics 2 to 4 socket

Time 16 to 32-way SMP Server

Traditional Entry Server Traditional High-End Server Single Image Platform Virtualized Consolidation Platform

2-core 2-core

2-core 2-core

2 to 4 socket 8 to 32 socket * Statements regarding SMP servers 4 to 8-way SMP Server 16 to 64-way SMP Server do not imply that IBM will introduce a system with this capability.

© Copyright IBM Corporation 2011 21 Trends in Server Evolution

 Single Image Virtualized/Cloud PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Emerging Entry Server - A simple matter of riding Virtualized/Cloud Platform the multi-core trend? - Add more cores to the die, beef up some interfaces, Enabled by: 8-core 8-core - Technology and scale to a large SMP? - Innovation Not so simple: 8-core 8-core Driven by: S - Emerging entry servers i - IT Evolution m have characteristics similar i - Economics l 2 to 4 socket a r to traditional high-end

Time 16 to 32-way SMP ServerC h large SMP servers a l l e Traditional Entry Server Traditional High-Endn Server Single Image Platform Virtualized Consolidationg Platform e Achieving solid virtual machine performance 2-core 2-core requires a Balanced System Structure.

2-core 2-core

2 to 4 socket 8 to 32 socket * Statements regarding SMP servers 4 to 8-way SMP Server 16 to 64-way SMP Server do not imply that IBM will introduce a system with this capability.

© Copyright IBM Corporation 2011 22 Trends in Server Evolution

 Single Image Virtualized/CloudPowerPower SystemsSystems UltraScale TechnicalTechnical Cloud UniversityUniversity Emerging Entry Server Emerging High-End Server Virtualized/Cloud Platform UltraScale Cloud Platform

Enabled by: 8-core 8-core - Technology - Innovation

Driven by: 8-core 8-core - IT Evolution - Economics 2 to 4 socket 8 to 32 socket

Time 16 to 32-way SMP Server 64 to 256-way SMP Server

Traditional Entry Server Traditional High-End Server Same enablers and Single Image Platform Virtualized Consolidation Platform driving factors apply at larger scale

2-core 2-core

2-core 2-core

2 to 4 socket 8 to 32 socket * Statements regarding SMP servers 4 to 8-way SMP Server 16 to 64-way SMP Server do not imply that IBM will introduce a system with this capability.

© Copyright IBM Corporation 2011 23 Challenge: How does POWER7 maintain the Balance?

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Need to Amplify Effective Compute Throughput Potential Socket Throughput to Close Gap and Achieve Potential

Cache Hierarchy Technology and Innovation

Socket Throughput Limitation (Physical signal economics)

Multi-core evolution © Copyright IBM Corporation 2011 24 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Cache Hierarchy Rqmt Challenge for POWER® Servers for Multi-core POWER7

Core . . . Core POWER4TM, POWER5TM, and POWER6TM systems derive huge benefit from high bandwidth access Low Low Latency Latency to large, off-chip cache. 2M to 4M 2M to 4M per Core per Core But socket pin count constraints Cache Cache prevent scaling the off-chip cache footprint footprint interface to support 8 cores.

Large, Shared, 30+ MB Cache footprint much closer than Local Memory

© Copyright IBM Corporation 2011 25 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Cache Hierarchy Rqmt Challenge for POWER Servers for Multi-core POWER7

Core . . . Core Need to satisfy both caching requirements with one cache.

Low Low Latency Latency 2M to 4M 2M to 4M per Core per Core Cache Cache footprint footprint

Large, Shared, 30+ MB Cache footprint much closer than Local Memory

© Copyright IBM Corporation 2011 26 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Cache Hierarchy Rqmt Challenge for POWER Servers for Multi-core POWER7

Core . . . Core Low power, dense eDRAM value enhanced with low latency, high bandwidth, Low Low fast SRAM structures Latency Latency 2M to 4M 2M to 4M per Core per Core Cache Cache footprint footprint IBM Custom Custom eDRAM Fast SRAM

Dense, low power High Area/power Lower speed/bandwidth High speed/bandwidth

Large, Shared, 30+ MB Cache footprint much closer than Local Memory On-processor Private core 30+ MB Cache Sub-MB Cache

© Copyright IBM Corporation 2011 27 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Solution: High speed eDRAM on the processor die

Conventional IBM ASIC IBM Custom Custom Custom Memory DRAM eDRAM eDRAM Dense SRAM Fast SRAM

Dense, low power Off On High Area/power Low speed/bandwidth uP uP High speed/bandwidth Chip Chip

Conventional Large, Off-chip On-processor On-processor Private core Memory DIMMs 30+ MB Cache 30+ MB Cache Multi-MB Cache Sub-MB Cache

Industry Standard Caching and Memory Technologies: Conventional DIMMs, Dense and Fast SRAM’s.

IBM’s POWER Servers have leveraged large off-chip eDRAM caches in POWER4, 5, and 6.

With POWER7, IBM introduces on-processor, high-speed, custom eDRAM, combining the dense, low power attributes of eDRAM with the speed and bandwidth of SRAM.

© Copyright IBM Corporation 2011 28 L3 Cache Structure

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Local SMP Links Remote SMP + I/O Links SMP+I/O Remote Local SMPLinks

Core Core Core Core

L2 Cache L2 Cache L2 Cache L2 Cache

Mem CtrlL3 Cache and Chip Interconnect Mem Ctrl

L2 Cache L2 Cache L2 Cache L2 Cache

Core Core Core Core

© Copyright IBM Corporation 2011 30 Hybrid L3 “Fluid” Cache Structure – Intelligent Cache

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Local SMP Links Remote SMP + I/O Links SMP+I/O Remote Local SMPLinks

Core Core Core Core

L2 Cache L2 Cache L2 Cache L2 Cache

Mem CtrlL3 Cache and Chip Interconnect Mem Ctrl

L2 Cache L2 Cache L2 Cache L2 Cache

Core Core Core Core

© Copyright IBM Corporation 2011 31 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Hybrid L3 “Fluid” Cache Structure – Intelligent Cache

Core Core Core Core Core Core Core Core

Private Private Shared Private Shared Private

Large, Shared Private Private 32M L3 Cache Private Shared Private Private

- Keeps multiple footprints at ~3X lower latency than local memory.

Working Set Footprints

© Copyright IBM Corporation 2011 32 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Hybrid L3 “Fluid” Cache Structure – Intelligent Cache

Core Core Core Core Core Core Core Core

Private Private ClonedCloned Shared Private Shared Private

Large, Shared Private Private 32M L3 Cache Private Shared Fast, Local Fast, Local Private Private L3 Region L3 Region

- Keeps multiple footprints at ~3X lower latency than local memory. - Automatically migrates private footprints (up to 4M) to fast local region (per core) at ~5X lower latency than full L3 cache. Working Set - Automatically clones shared data to multiple private regions. Footprints

© Copyright IBM Corporation 2011 33 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Hybrid L3 “Fluid” Cache Structure – Intelligent Cache

Core Core Core Core Core Core Core Core

Private Private Private Private Private Private Private Private Private Private Private Private Private Large, Shared Private Private Private 32M L3 Cache Private Fast, Local Private Private Private Private L3 Region Private Private

- Enables a subset of the cores to utilize the entire large shared L3 cache when the remaining cores are not using it.

© Copyright IBM Corporation 2011 34 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Local SMPLinks

Core Core Core Core

L2 Cache L2 Cache L2 Cache L2 Cache Fast Local L3 Region Mem CtrlL3 Cache and Chip Interconnect Mem Ctrl Remote SMP + I/O Links Remote SMP+I/O

L2 Cache L2 Cache L2 Cache L2 Cache

Core Core Core Core

© Copyright IBM Corporation 2011 35 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Solution: L2 “Turbo” Cache

Core Core Core Core Core Core Core Core

L2 Turbo L2 Turbo L2 Turbo L2 Turbo L2 Turbo L2 Turbo L2 Turbo L2 Turbo Cache Cache Cache Cache Cache Cache Cache Cache

Private Private Cloned Private Cloned Private Private Cloned Large, Shared Cloned Private 32M L3 Cache Private Shared Fast, Local Fast, Local Private Private L3 Region L3 Region

- L2 “Turbo” cache keeps a tight 256K working set with extremely low latency (~3X lower than local L3 region) and high bandwidth, reducing L3 power and boosting performance.

© Copyright IBM Corporation 2011 36 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Local SMPLinks

Core Core Core Core

L2 Cache L2 Cache L2 Cache L2 Cache

Mem CtrlL3 Cache and Chip Interconnect Mem Ctrl Remote SMP + I/O Links Remote SMP+I/O

L2 Cache L2 Cache L2 Cache L2 Cache

Core Core Core Core

© Copyright IBM Corporation 2011 37 Cache Hierarchy Technology and Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Cache Hierarchy Summary

Core Core Core Core Core Core Core Core

L2 Turbo L2 Turbo L2 Turbo L2 Turbo L2 Turbo L2 Turbo L2 Turbo L2 Turbo Cache Cache Cache Cache Cache Cache Cache Cache

Private Private Cloned Private Cloned Private Private Cloned Large, Shared Cloned Private 32M L3 Cache Private Shared Fast, Local Fast, Local Private Private L3 Region L3 Region

Cache Level Capacity Array Policy Comment L1 Data 32K Fast SRAM Store-thru Local thread storage update Private L2 256K Fast SRAM Store-In De-coupled global storage update Fast L3 Region Up to 4M eDRAM Partial Victim Reduced power footprint (up to 4M) Shared L3 32M eDRAM Adaptive Large 32M shared footprint

© Copyright IBM Corporation 2011 38 Challenge: How does POWER7 maintain the Balance?

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Need to Amplify Effective Compute Throughput Potential Socket Throughput to Close Gap and Achieve Potential

Advances in Memory Subsystem

Cache Hierarchy Technology and Innovation

Socket Throughput Limitation (Physical signal economics)

Multi-core evolution © Copyright IBM Corporation 2011 39 Advances in Memory Subsystem

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Memory subsystem requirement for POWER7 processor-based servers POWER7 Requirements • Core: Core 10-20GB/s sustained – 10GB/s to 20GB/s sustained bandwidth per core memory bandwidth per core – 16GB to 32GB of cache • Socket: – 4 times growth in memory bandwidth & capacity 16-32GB • System: Energy storage per constraints core – Packaging more memory into similar volume, with similar energy and cooling constraints

© Copyright IBM Corporation 2011 40 Advances in Memory Subsystem

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Multi-faceted Solution

POWER7 Chip 1) Dual Integrated DDR3 Controllers - Massive 16KB scheduling window per POWER7 chip insures high channel and DIMM utilization - Sparse access acceleration Memory Memory - Advanced Energy Management Controller Controller - Numerous RAS advances

2) Eight high speed 6.4 GHz channels - New low power differential signaling - Sustained 100+ GB/s per socket Advanced Buffer 3) New DDR3 buffer chip architecture Chip - Larger capacity support (32 GB / core) - Energy Management support - RAS enablement

4) DDR3 DRAMs - Supports 800, 1066, 1333, and 1600 © Copyright IBM Corporation 2011 41 Advances in Memory Subsystem

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Local SMPLinks

Core Core Core Core

L2 Cache L2 Cache L2 Cache L2 Cache

Mem CtrlL3 Cache and Chip Interconnect Mem Ctrl Remote SMP + I/O Links Remote SMP+I/O

L2 Cache L2 Cache L2 Cache L2 Cache

Core Core Core Core

© Copyright IBM Corporation 2011 42 Challenge: How does POWER7 maintain the Balance?  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Need to Amplify Effective Compute Throughput Potential Socket Throughput to Close Gap and Achieve Potential

Advances in Off-Chip Signaling Technology

Advances in Memory Subsystem

Cache Hierarchy Technology and Innovation

Socket Throughput Limitation (Physical signal economics)

Multi-core evolution © Copyright IBM Corporation 2011 43 Advances in Off-chip Signaling Technology

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

1) Enhanced Signal-ended “Elastic Interface” Technology 2) New high speed, low power Differential Technology

Interface Signal Type Info Width Frequency Bandwidth

Off-chip Cache none none none none

Memory Channels Differential 28 bytes 6.4 Ghz 180 GB/s

I/O Bridge Single-ended 20 bytes 2.5 Ghz 50 GB/s

SMP Interconnect Single-ended 120 bytes 3.0 Ghz 360 GB/s

Total Bandwidth 590 GB/s

(Note that bandwidths shown are raw, peak signal bandwidths)

Moving L3 onto POWER7 along with advances in signaling technology enables significant raw bandwidth growth for both memory and I/O subsystems. Note that advanced scheduling improves POWER7’s ability to utilize memory bandwidth.

© Copyright IBM Corporation 2011 44 Challenge: How does POWER7 maintain the Balance?

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Need to Amplify Effective Compute Throughput Potential Socket Throughput to Close Gap and Exploit Long Term Investment Achieve Potential in Coherence Innovation

Advances in Off-Chip Signaling Technology

Advances in Memory Subsystem

Cache Hierarchy Technology and Innovation

Socket Throughput Limitation (Physical signal economics)

Multi-core evolution © Copyright IBM Corporation 2011 45 Exploit Long Term Investment in Coherence Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Local SMP Links Remote SMP + I/O Links SMP+I/O Remote Local SMPLinks

Core Core Core Core

L2 Cache L2 Cache L2 Cache L2 Cache

Mem CtrlL3 Cache and Chip Interconnect Mem Ctrl

L2 Cache L2 Cache L2 Cache L2 Cache

Core Core Core Core

Using local and remote SMP links, up to 32 POWER7 chips are connected © Copyright IBM Corporation 2011 46 Exploit Long Term Investment in Coherence Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

* Statements regarding SMP servers do not imply that IBM will introduce Up to 32 POWER7 chips form a massive SMP system. a system with this capability.

© Copyright IBM Corporation 2011 47 Exploit Long Term Investment in Coherence Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Coherence Protocol Features POWER7 Exploitation

- De-coupled global storage - POWER Servers can drive massive coherence throughput. - De-serialized storage updates allowing storage updates to be - 32-chip system can manage over reordered 20,000 concurrently reordered coherent storage operations - De-centralized coherence ~4X more POWER6 systems resolution, and bounded latency - Minimal tracking overhead per broadcast transport layer. operation.

- Decentralized coherence resolution, - Low latency intervention, advanced cache states, optimized - High performance locking constructs on-chip transport, and broadcast - robust scaling. free barriers.

* Statements regarding SMP servers do not imply that IBM will introduce a system with this capability.

© Copyright IBM Corporation 2011 48 Exploit Long Term Investment in Coherence Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Challenge: As system size grows, Coherence broadcast traffic increases

~5X Compute Throughput POWER6 High-End Server Global Scope POWER7 High-End Server Virtualized Consolidation Platform Coherence UltraScale Cloud Platform Broadcast

8 to 32 socket 8 to 32 socket 16 to 64-way SMP Server 64 to 256-way SMP Server 450 Compute Global Coherence Throughput 1X GB/s Throughput 320 Global Coherence GB/s Throughput

© Copyright IBM Corporation 2011 49 Exploit Long Term Investment in Coherence Innovation

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity Solution: Speculative limited scope Coherence broadcast - In 2003, recognized emerging trend - Developed Dual-Scope Broadcast Coherence Protocol for POWER6 - Utilizes 13 cache states and integrated scope indicator in memory

POWER6 High-End Server Global Scope POWER7 High-End Server Virtualized Consolidation Platform Coherence UltraScale Cloud Platform Broadcast

Nodal Scope Speculative Coherence 8 to 32 socket Broadcast 8 to 32 socket 16 to 64-way SMP Server 64 to 256-way SMP Server

Provides value for POWER6 - Latency reduction - Near Perfect Scaling for extreme memory intensive workloads

© Copyright IBM Corporation 2011 51 Summary: POWER7 maintains the Balance

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Achieves extreme Multi-core Compute Throughput Potential throughput while providing Balance and SMP scaling Exploit Long Term Investment by building on a foundation in Coherence Innovation of solid innovation.

Advances in Off-Chip Signaling Technology

Advances in Memory Subsystem

Cache Hierarchy Technology and Innovation

Socket Throughput Limitation (Physical signal economics)

IBM POWER chips: 1) History of large SMP leadership 2) Storage Architecture economics 3) High density packaging leadership Multi-core evolution © Copyright IBM Corporation 2011 52 Energy Management: Architected Idle Modes

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Two Design Points Chosen for Technology 4 PowerPC Architected States – Nap (optimized for wake-up time) RV Winkle • Turn off clocks to execution units Power gate • Reduce frequency to core • Caches and TLB remain coherent • Fast wake-Up Sleep

Nap – Sleep (optimized for power reduction) Doze

• Purge caches and TLB Wake-Up Latency • Turn off clocks to full core and caches • Reduce voltage to V-retention Energy Reduction – Leakage current reduced substantially • Voltage ramps-up on wake up • No core re-initialization required

© Copyright IBM Corporation 2011 53 Adaptive Energy Management: Energy ScaleTM

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

• Chip FO4 Tuned for Optimal Performance/Watt in Technology • DVFS (Dynamic Voltage and Frequency Slewing) – -50% to +10% frequency slew independent SPECPower: Mean System Power per Load per core Level – Frequency and voltage adjusted based on: • Work load and utilization. • On board activity monitors • Turbo-Mode

– Up to 10% frequency boost r – Leverages excess energy capacity from: • Non worst case work loads • Idle cores AC Powe Vmin • Processor and Memory Energy Usage can be independently Balanced. – Real time hardware performance monitors 100 80 60 40 20 0 used. Load Level (%) – On board power proxy logic estimates power • Power Capping Support – Allows budgeting of power to different parts of system

© Copyright IBM Corporation 2011 54  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Power Systems – Reliability, Availability, Serviceability (RAS)

© Copyright IBM Corporation 2011 55 OS Downtime Comparison Survey

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity 400 participants in 27 countries Hours

10 9 8

7 6 5

4 3 2

1 0 Win2000 Win2003 RHEL SOLARIS HP-UX SUSE AIX

The Yankee Group “2007-2008 Global Server Operating Systems Reliability Survey” as quoted in “Windows Server: The New King of Downtime” by Mark Joseph Edwards at www.windowsitpro.com/article/articleid/98475/windows-server-the-new-king-of-downtime.html, March 5, 2008 and in http://www.sunbeltsoftware.com/stu/Yankee-Group-2007-2008-Server-Reliability.pdf

© Copyright IBM Corporation 2011 56 ITIC Survey says Power Systems with AIX deliver 99.997% uptime

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity - 54% of IT executives and managers say that they require 99.99% or better availability for their applications • Power Systems with AIX delivers the best RAS of UNIX, Linux, Windows choices 1. Availability: The least amount of downtime • 15 minutes a year • 2.3 times better than the closest UNIX competitor • more than 10X better than Windows 2. Reliability: The fewest unscheduled outages • less than one outage per year 3. Serviceability: The fastest patch time • 11 minutes to apply a patch

Source: Network World, dated July 14, 2009, reports on the 2009 ITIC Global Server Hardware & Server OS Reliability Survey Results © Copyright IBM Corporation 2011 57 POWER7: Reliability and Availability Features

 PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Dynamic Oscillator OSC0 OSC1 Failover Fabric Bus Interface to other Chips and Fabric Interface Nodes ¾ ECC protected ¾ Node hot add /repair

Core Recovery BUF ¾ Leverage speculative execution resources to enable recovery BUF ¾ Error detected in GPRs FPRs VSR, flushed and retried ¾ Stacked latches to improve SER BUF Alternate Processor Recovery BUF ¾ Partition isolation for core checkstops L3 eDRAM ¾ ECC protected X8 Dimms IO Hub ¾ SUE handling ¾ Line delete ¾ 64 Byte ECC on Memory ¾ Spare rows and columns ƒ Corrects full chip kill on X8 dimms ƒ Spare X8 devices implemented GX IO Bus ¾ Dual memory chip failures do not cause PCI ¾ ECC protected outage Bridge ¾ Hot add ¾ Selective memory mirror capability to recover partition from dimm failures InfiniBand® Interface ¾ Hardware assisted scrubbing ¾ Redundant paths ¾ SUE handling

¾ Dynamic sparing on channel interface PCI Adapter ¾ PowerVM Hypervisor protected from full DIMM failures © Copyright IBM Corporation 2011 58 IBM Power Systems value proposition  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

Deliver business value by leveraging technology

Performance Flexibility Reliability Affordability + –

. . . the highest value at the lowest risk with leading technology

© Copyright IBM Corporation 2011 59 Session Evals in 3 Easy Steps  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

1 Go to: http://ibmtechu.com/up

2 Select Register button and complete the form (One time only)

© Copyright IBM Corporation 2011 60 Session Evals in 3 Easy Steps  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity

3 Select “Session Evals” button and complete the online form

Session Code: HE01

Fantastic

© Copyright IBM Corporation 2011 61 Stay Connected & Continue Skills Transfer via IBM Training  PowerPower SystemsSystems TechnicalTechnical UniversityUniversity ƒ Training paths ƒ Social media What to take, Join the when to take it conversation ƒ Custom catalog ƒ RSS feeds Create a catalog Up-to-date that meets your information on interest areas the training you need ƒ New to ƒ IBM Training Instructor Led News Online (ILO)? Targeted to Take a free your needs test drive!

ƒ Education Packs Online discount program for ALL IBM Training courses for your company Questions? Email Lisa Ryan ([email protected])

© Copyright IBM Corporation 2011 62