Bringing Heterogeneous Multiprocessors Into the Mainstream

July 27 2009 Brian Flachs Cell/B.E. Architect

Godfried Goldrian, Peter Hofstee, Jörg-Stephan Vogt IBM Systems and Technology Group Microprocessor Trends

Special Purpose More active transistors,higher frequency

Hybrid More active transistors, higher frequency

Multi-Core More active transistors, higher frequency Log(Performance)

Single Thread More active transistors, higher frequency 2005 2015(?) 2025(??) 2035(???)

Page 2 Heterogeneity . Spreading tasks within a multiprocessor . Hope to divide tasks according to characteristics  Dynamic branch prediction valuable vs. static + conditional move

 Caches effective vs. memory latency tolerance

 Tight data dependancies vs. control flow/address stream is data independent . Design processors differently for Efficiency  Accelerators

 Control plane: – High voltage, small register file, short pipeline, low frequency with dynamic branch prediction and large caches  Data plane: – Low voltage, large register file, long pipeline, high frequency with memory latency tolerance

Page 3 SPE Highlights . User-mode architecture  No need to run the O/S  No translation/protection within SPU  DMA is full Power Arch protect/x-late . Not just a coprocessor, has its own PC  RISC like organization  32 bit fixed width instructions  Dual Issue, 11-FO4 design  Broad set of operations (8/16/32/64)  VMX-like SIMD dataflow  Graphics SP-Float, IEEE DP-Float . Large unified register file  128 entry x 128 bit (I&FP)  Deep unrolling to cover unit latencies . 256 KB Local Store . Flexible DMA Engine  Improve effective memory bandwidth – improving latency tolerance – Improve utilization of moved data 2 14.5mm (90nm SOI)  Vector Load/Store w/ Scatter Gather

Page 4 Cell Broadband Microprocessor Highlights . SMP on a chip  1 PowerPC – L1 (32kB+32kB) – L2 (512kB)  8 SPUs – 256 kB Local Store – 4-way Single Precision FP Support . High-B/W Memory  25.6 GB/s (data)  Dual Rambus XDRam  3.2 Gbit/sec . Two Configurable I/O interfaces  Up to 35GB/s out  Up to 25GB/s in  Coherent interface or I/O  5 Gbit/sec

Page 5 Memory Managing Processor vs. Traditional General Purpose Processor

Cell AMD BE

IBM Intel

Page 6 Hybrid Systems The making of Roadrunner – Key Building Blocks

New Software New Hardware Open Source Bring Up, Test and Benchmarking

SDK 3.1 TriBlade 17 Connected Units – 278 Racks Linux Library DaCS

ALF Other LS21 Expansion

Misc / QS22 QS22 LS21

. Systems Integration Firmware Device Driver

. A New Programming Model extended from Los Alamos National Lab standard, cluster computing (Innovation derived through close collaboration with LANL)

. Hybrid and Heterogeneous Hardware

. Built around BladeCenter and Industry IB-DDR Slide 7

Page 7 Efficiency Benefits

. More efficient use of memory bandwidth . More performance per area . More performance per watt . Fewer modules, boards, racks, etc. . Larger portion of interconnect on chip . More computation per soft-fail / check point

Page 8 System Level Efficiency

. Power efficiency is an important problem

. Key areas of attack:  Most efficient Ops/Watts operating point – Voltage frequency scaling – Match critical functional path to Low speed Vmin – Low FO4 design maximizes perf unit area & leakage  In-system voltage selection – reduce voltage  Lower temperature operating point – Improves circuit performance, reduces leakage

Page 9 Efficient Operating Point

. On left side of graph voltage is fixed, on right side voltage is increasing to support operating frequency . Want to separately optimize the different processor types

Page 10 In-System Voltage Determination . Power systems have margin  Voltage is time variant  Voltage is different from voltage set point . Chip has voltage guard band  Noise  End-of-life  System-to-tester correlation margin . Don’t want the sum . Build card inc VRM + Processor  Set desired operating frequency . Start at high voltage and successively reduce voltage  Until challenging workload stops running . Add necessary guard band . Saves 16 W per processor

Page 11 QPACE Processor Node Card

DDR2 Memory (1GBit each) Xilinx FPGA (Virtex 5) PMC Sierra PHYs

IBM PowerXCell(TM) 8i VRM VR M

VRM

Dimensions: 330mm x 155mm

Page 12 QPACE Processor Node Cooling

TIM for memory modules TIM for VRM TIM for FPGA TIM for PHYs

heat flow

Hight-adjustable Heatpath for tolerance TIM for VRM compensation. TIM area for the thermal interface TIM is applied on the processor module and the small side of the between the Processor Node and Heatpath the Cold Plate

Page 13 QPACE Cold Plate Processor Node Root Card

Water flow In Out

Cold Plate for 32 processor nodes

Processor Node 16 Processor Nodes and 1 Root Card on each side of the Cold Plate.

Page 14 QPACE Rack

. The QPACE project is a research collaboration of IBM Development and European universities and research institutes with the goal to build a prototype of a cell processor- based supercomputer.

. The major part of this project funded by the German Research Foundation (DFG – Deutsche Forschungsgemeinschaft) as part of a Collaborative Research Center (SFB – Sonderforschungsbereich [TR55]).

Page 15 Green 500

Water Cooling : Temperature drop from processor junction to warm water is <30°C. No chiller is required !!

Estimated Power Efficiency of QPACE : Rack Power: 30 kW Linpack: 18 TFLOPS => 600 MFLOPS/W

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the ﬁle again. If the red x still appears, you may have to delete the image and then insert it again.

Page 16 Future Compilation Strategy for Accelerators . Idle accelerators are not efficient  Programmer effort can limit utilization of accelerators  Limit specialization of accelerators  Limit investment in accelerators

. Want to support standard languages: C, C++, Fortran, OpenMP, OpenCL . Want mostly transparent  Need to honor programmer directives  Ok, for programmers to sacrifice portability for better results in high value/benefit codes

. Assume different Architecture / Micro-architecture benefits from or requires different binaries  Instruction scheduling, execution latencies  Special instructions & operations

. Compile everything for all cores  Analyze output for core type assignment metrics, have source level pragmas  Might have different sources for different processor types

. Fat binaries can run anything anywhere  At least via RPC/migration  Cost metrics for execution migration  Load balancing & task queue architecture

Page 17 Heterogeneous Code-Gen Problems

. Function pointers  Translation table – From function pointer address space to actual entry point – All constructor code runs the same on all cores – Adds extra path to C++ virtual method calls  Self-modifying code? – architecture independent: llvm

. Data pointers  Shared address space  Coherency management

. Data formats  Little endian/Big endian, etc.  Conversions probably impact source

Page 18 Making Cell Easier to Program

. Software ICache

. Standard Language Support  OpenMP

 OpenCL

Page 19 Soft I-Cache Summary .c Divide code with branch always into . Upto ½ GB of code Blocks smaller than . Normal tool-chain flow the line size &  No detailed knowledge required on the part compiler annotate branches with importance+ of the developer. Return stack . ‘Small’ changes to ABI – good operability with old information source. .s  New virtual address space for code – 32 bit function pointers assembler Linker analyzes – Indirects require tag check whole program to . Runtime edits branches to go directly to their Determine cache targets when they are in cache – no overhead in .o layout that minimizes hit case. cache conflicts on Important paths  Less than 10% total runtime penalty for linker running in small caches. Still working to Runtime improve. System Exe/lib  Verified on QS22 & PXCAB hardware. . Support code out-side of cache structure

Page 20 A Cache Block

Code… br Constants Stub(s)

isa ptr brsl xor

. Linker creates a “stub” at the end of the cache line for each branch immediate . Stub has 4 word sized fields:  Address of target  Pointer to the branch  Trampoline  XOR pattern . When line is brought from memory into cache branches immediate branches to a brsl (the trampoline (T)) in the stub to the runtime entry point.  Brsl records a pointer to the stub

Page 21 Rewriting

. Runtime patches branch to point to target.  Load & rototate xor pattern – Converts branch to stub to branch to target  Load quadword pointed to by stub (contains the branch)

 Xor  Store back the quadword with the branch . Next time branch goes directly to intended target  no extra delay . Easy because cache is direct mapped. . Branch hints work!!!! . Must un-rewrite if target is evicted

Page 22 Performance Relative to LS Static

. Miss penalty = 400 cycle access to main memory in parallel with cache management software

Page 23 OpenMP and OpenCL OpenMP OpenCL

CPUs GPUs MIMD SIMD Scalar code bases Scalar/Vector code bases Parallel for loop Data Parallel Shared Memory Model Distributed Shared Memory Model

CPU SPE GPU . Pros . Pros  Single source  Single source  Annotated scalar code base  Portable vector language  Established in the market  Explicit memory control . Cons . Cons  Doesn’t always scale well  Unproven  Only supports shared memory models  Host API calls  Relies heavily on compiler  GPU Texture Engine exposed

Page 24 OpenCL Memory Model

. Shared memory model Private Memory Private Memory Private Memory Private Memory  Release consistency WorkItem 1 WorkItem M WorkItem 1 WorkItem M . Multiple distinct address spaces Compute Unit 1 Compute Unit N  Address spaces can be collapsed depending on the Local Memory Local Memory device’s memory subsystem Global / Constant Memory Data Cache . Address Qualifiers Compute Device  __private

 __local

 __constant and __global Global Memory . Example: Compute Device Memory  __global float4 *p;

Page 25 Power with VMX . Unified host and device memory  Zero copies between them . Collapsed local and private memory  Locked cache range . Cached global memory

Power 1 Power 2 Power N

WorkItem 1 WorkItem N

Compute Unit 1 Compute Unit N

Global / Constant Memory Data Cache Global / Constant Memory Data Cache

Host Device Compute Device

Host Memory Device Global Memory Local Memory Private Memory System Memory

Page 26 PCIe Attached GPU . Separate host, device, and local address spaces  Explicit copies between them

WorkItem 1 WorkItem N

Compute Unit 1 x86 Private Memory Private Memory

Local Memory PCIe

Device Memory

Host Device Compute Device

Host Memory Device Global Memory

System Memory Device Memory

Page 27 Cell Broadband Engine . Unified host and device memory  Zero copies between them

SPE 1 SPE N

WorkItem 1 WorkItem N

Compute Unit 1 Compute Unit N PPE Local Memory Local Memory

Private Memory Private Memory

Local Store Local Store

Host Device Compute Device

Host Memory Device Global Memory System Memory

Page 28 Summary . What we have done:  Broken new barriers in performance and efficiency – Top 500 #1 – Green 500 #1  Proven commercial viability – Repsol, Mentor Graphics, Broadcast International, Hitachi, Hoplon …

. What is next:  Continued Focus on Efficiency  Increasing Focus on Standards-Based Programming – Software ICache for Cell/B.E. – OpenMP & OpenCL for Cell/B.E. and other processors – …  Increasing Focus on Ease of Use – Make accelerators “invisible” for most customers – Commercial applications, not just HPC – Not an easy thing to do  Continue to Broaden Application Reach for Cell and Hybrid Systems

Page 29 Next Era of Innovation – Hybrid Computing The Next Bold Step in Innovation & Integration Symmetric Multiprocessing Era Hybrid Computing Era Today pNext 1.0 pNext 2.0

p6 p7

Traditional

Cell

Throughput Computational

BlueGene

Technology Out Market In

Driven by cores/threads Driven by workload consolidation

Page 30 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM. Special notices This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally- available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.

Page 31 Special notices (cont.)

The following terms are registered trademarks of International Business Machines Corporation in the United States and/or other countries: AIX, AIX/L, AIX/L(logo), alphaWorks, AS/400, BladeCenter, Blue Gene, Blue Lightning, C Set++, CICS, CICS/6000, ClusterProven, CT/2, DataHub, DataJoiner, DB2, DEEP BLUE, developerWorks, DirectTalk, Domino, DYNIX, DYNIX/ptx, e business(logo), e(logo)business, e(logo)server, Enterprise Storage Server, ESCON, FlashCopy, GDDM, i5/OS, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), Informix, IntelliStation, IQ-Link, LANStreamer, LoadLeveler, Lotus, Lotus Notes, Lotusphere, Magstar, MediaStreamer, Micro Channel, MQSeries, Net.Data, Netfinity, NetView, Network Station, Notes, NUMA-Q, OpenPower, Operating System/2, Operating System/400, OS/2, OS/390, OS/400, Parallel Sysplex, PartnerLink, PartnerWorld, Passport Advantage, POWERparallel, Power PC 603, Power PC 604, PowerPC, PowerPC(logo), Predictive Failure Analysis, pSeries, PTX, ptx/ADMIN, RETAIN, RISC System/6000, RS/6000, RT Personal Computer, S/390, Scalable POWERparallel Systems, SecureWay, Sequent, ServerProven, SpaceBall, System/390, The Engines of e-business, THINK, Tivoli, Tivoli(logo), Tivoli Management Environment, Tivoli Ready(logo), TME, TotalStorage, TURBOWAYS, VisualAge, WebSphere, xSeries, z/OS, zSeries.

The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: Advanced Micro-Partitioning, AIX 5L, AIX PVMe, AS/400e, Chiphopper, Chipkill, Cloudscape, DB2 OLAP Server, DB2 Universal Database, DFDSM, DFSORT, DS4000, DS6000, DS8000, e-business(logo), e-business on demand, eServer, Express Middleware, Express Portfolio, Express Servers, Express Servers and Storage, General Purpose File System, GigaProcessor, GPFS, HACMP, HACMP/6000, IBM TotalStorage Proven, IBMLink, IMS, Intelligent Miner, iSeries, Micro-Partitioning, NUMACenter, On Demand Business logo, POWER, PowerExecutive, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, PowerPC 603, PowerPC 603e, PowerPC 604, PowerPC 750, POWER2, POWER2 Architecture, POWER3, POWER4, POWER4+, POWER5, POWER5+, POWER6, POWER6+, pure XML, Redbooks, Sequent (logo), SequentLINK, Server Advantage, ServeRAID, Service Director, SmoothStart, SP, System i, System i5, System p, System p5, System Storage, System z, System z9, S/390 Parallel Enterprise Server, Tivoli Enterprise, TME 10, TotalStorage Proven, Ultramedia, VideoCharger, Virtualization Engine, Visualization Data Explorer, X-Architecture, z/Architecture, z/9.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org. UNIX is a registered trademark of The Open Group in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Microsoft, Windows, Windows NT and the Windows logo are registered trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Itanium, Pentium are registered tradas and Xeon is a trademark of Intel Corporation or its subsidiaries in the United States, other countries or both. AMD Opteron is a trademark of Advanced Micro Devices, Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries or both. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). NetBench is a registered trademark of Ziff Davis Media in the United States, other countries or both. AltiVec is a trademark of Freescale Semiconductor, Inc. Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc. Other company, product and service names may be trademarks or service marks of others.

Page 32