Bringing Heterogeneous Multiprocessors Into the Mainstream
July 27 2009 Brian Flachs Cell/B.E. Architect
Godfried Goldrian, Peter Hofstee, Jörg-Stephan Vogt IBM Systems and Technology Group Microprocessor Trends
Special Purpose More active transistors,higher frequency
Hybrid More active transistors, higher frequency
Multi-Core More active transistors, higher frequency Log(Performance)
Single Thread More active transistors, higher frequency 2005 2015(?) 2025(??) 2035(???)
Page 2 Heterogeneity . Spreading tasks within a multiprocessor . Hope to divide tasks according to characteristics Dynamic branch prediction valuable vs. static + conditional move
Caches effective vs. memory latency tolerance
Tight data dependancies vs. control flow/address stream is data independent . Design processors differently for Efficiency Accelerators
Control plane: – High voltage, small register file, short pipeline, low frequency with dynamic branch prediction and large caches Data plane: – Low voltage, large register file, long pipeline, high frequency with memory latency tolerance
Page 3 SPE Highlights . User-mode architecture No need to run the O/S No translation/protection within SPU DMA is full Power Arch protect/x-late . Not just a coprocessor, has its own PC RISC like organization 32 bit fixed width instructions Dual Issue, 11-FO4 design Broad set of operations (8/16/32/64) VMX-like SIMD dataflow Graphics SP-Float, IEEE DP-Float . Large unified register file 128 entry x 128 bit (I&FP) Deep unrolling to cover unit latencies . 256 KB Local Store . Flexible DMA Engine Improve effective memory bandwidth – improving latency tolerance – Improve utilization of moved data 2 14.5mm (90nm SOI) Vector Load/Store w/ Scatter Gather
Page 4 Cell Broadband Microprocessor Highlights . SMP on a chip 1 PowerPC – L1 (32kB+32kB) – L2 (512kB) 8 SPUs – 256 kB Local Store – 4-way Single Precision FP Support . High-B/W Memory 25.6 GB/s (data) Dual Rambus XDRam 3.2 Gbit/sec . Two Configurable I/O interfaces Up to 35GB/s out Up to 25GB/s in Coherent interface or I/O 5 Gbit/sec
Page 5 Memory Managing Processor vs. Traditional General Purpose Processor
Cell AMD BE
IBM Intel
Page 6 Hybrid Systems The making of Roadrunner – Key Building Blocks
New Software New Hardware Open Source Bring Up, Test and Benchmarking
SDK 3.1 TriBlade 17 Connected Units – 278 Racks Linux Library DaCS
ALF Other LS21 Expansion
Misc / QS22 QS22 LS21
. Systems Integration Firmware Device Driver
. A New Programming Model extended from Los Alamos National Lab standard, cluster computing (Innovation derived through close collaboration with LANL)
. Hybrid and Heterogeneous Hardware
. Built around BladeCenter and Industry IB-DDR Slide 7
Page 7 Efficiency Benefits
. More efficient use of memory bandwidth . More performance per area . More performance per watt . Fewer modules, boards, racks, etc. . Larger portion of interconnect on chip . More computation per soft-fail / check point
Page 8 System Level Efficiency
. Power efficiency is an important problem
. Key areas of attack: Most efficient Ops/Watts operating point – Voltage frequency scaling – Match critical functional path to Low speed Vmin – Low FO4 design maximizes perf unit area & leakage In-system voltage selection – reduce voltage Lower temperature operating point – Improves circuit performance, reduces leakage
Page 9 Efficient Operating Point
. On left side of graph voltage is fixed, on right side voltage is increasing to support operating frequency . Want to separately optimize the different processor types
Page 10 In-System Voltage Determination . Power systems have margin Voltage is time variant Voltage is different from voltage set point . Chip has voltage guard band Noise End-of-life System-to-tester correlation margin . Don’t want the sum . Build card inc VRM + Processor Set desired operating frequency . Start at high voltage and successively reduce voltage Until challenging workload stops running . Add necessary guard band . Saves 16 W per processor
Page 11 QPACE Processor Node Card
DDR2 Memory (1GBit each) Xilinx FPGA (Virtex 5) PMC Sierra PHYs
IBM PowerXCell(TM) 8i VRM VR M
VRM
Dimensions: 330mm x 155mm
Page 12 QPACE Processor Node Cooling
TIM for memory modules TIM for VRM TIM for FPGA TIM for PHYs
heat flow
Hight-adjustable Heatpath for tolerance TIM for VRM compensation. TIM area for the thermal interface TIM is applied on the processor module and the small side of the between the Processor Node and Heatpath the Cold Plate
Page 13 QPACE Cold Plate Processor Node Root Card
Water flow In Out
Cold Plate for 32 processor nodes
Processor Node 16 Processor Nodes and 1 Root Card on each side of the Cold Plate.
Page 14 QPACE Rack
. The QPACE project is a research collaboration of IBM Development and European universities and research institutes with the goal to build a prototype of a cell processor- based supercomputer.
. The major part of this project funded by the German Research Foundation (DFG – Deutsche Forschungsgemeinschaft) as part of a Collaborative Research Center (SFB – Sonderforschungsbereich [TR55]).
Page 15 Green 500
Water Cooling : Temperature drop from processor junction to warm water is <30°C. No chiller is required !!
Estimated Power Efficiency of QPACE : Rack Power: 30 kW Linpack: 18 TFLOPS => 600 MFLOPS/W
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
Page 16 Future Compilation Strategy for Accelerators . Idle accelerators are not efficient Programmer effort can limit utilization of accelerators Limit specialization of accelerators Limit investment in accelerators
. Want to support standard languages: C, C++, Fortran, OpenMP, OpenCL . Want mostly transparent Need to honor programmer directives Ok, for programmers to sacrifice portability for better results in high value/benefit codes
. Assume different Architecture / Micro-architecture benefits from or requires different binaries Instruction scheduling, execution latencies Special instructions & operations
. Compile everything for all cores Analyze output for core type assignment metrics, have source level pragmas Might have different sources for different processor types
. Fat binaries can run anything anywhere At least via RPC/migration Cost metrics for execution migration Load balancing & task queue architecture
Page 17 Heterogeneous Code-Gen Problems
. Function pointers Translation table – From function pointer address space to actual entry point – All constructor code runs the same on all cores – Adds extra path to C++ virtual method calls Self-modifying code? – architecture independent: llvm
. Data pointers Shared address space Coherency management
. Data formats Little endian/Big endian, etc. Conversions probably impact source
Page 18 Making Cell Easier to Program
. Software ICache
. Standard Language Support OpenMP
OpenCL
Page 19 Soft I-Cache Summary .c Divide code with branch always into . Upto ½ GB of code Blocks smaller than . Normal tool-chain flow the line size & No detailed knowledge required on the part compiler annotate branches with importance+ of the developer. Return stack . ‘Small’ changes to ABI – good operability with old information source. .s New virtual address space for code – 32 bit function pointers assembler Linker analyzes – Indirects require tag check whole program to . Runtime edits branches to go directly to their Determine cache targets when they are in cache – no overhead in .o layout that minimizes hit case. cache conflicts on Important paths Less than 10% total runtime penalty for linker running in small caches. Still working to Runtime improve. System Exe/lib Verified on QS22 & PXCAB hardware. . Support code out-side of cache structure
Page 20 A Cache Block
Code… br Constants Stub(s)
isa ptr brsl xor
. Linker creates a “stub” at the end of the cache line for each branch immediate . Stub has 4 word sized fields: Address of target Pointer to the branch Trampoline XOR pattern . When line is brought from memory into cache branches immediate branches to a brsl (the trampoline (T)) in the stub to the runtime entry point. Brsl records a pointer to the stub
Page 21 Rewriting
. Runtime patches branch to point to target. Load & rototate xor pattern – Converts branch to stub to branch to target Load quadword pointed to by stub (contains the branch)
Xor Store back the quadword with the branch . Next time branch goes directly to intended target no extra delay . Easy because cache is direct mapped. . Branch hints work!!!! . Must un-rewrite if target is evicted
Page 22 Performance Relative to LS Static
. Miss penalty = 400 cycle access to main memory in parallel with cache management software
Page 23 OpenMP and OpenCL OpenMP OpenCL
CPUs GPUs MIMD SIMD Scalar code bases Scalar/Vector code bases Parallel for loop Data Parallel Shared Memory Model Distributed Shared Memory Model
CPU SPE GPU . Pros . Pros Single source Single source Annotated scalar code base Portable vector language Established in the market Explicit memory control . Cons . Cons Doesn’t always scale well Unproven Only supports shared memory models Host API calls Relies heavily on compiler GPU Texture Engine exposed
Page 24 OpenCL Memory Model
. Shared memory model Private Memory Private Memory Private Memory Private Memory Release consistency WorkItem 1 WorkItem M WorkItem 1 WorkItem M . Multiple distinct address spaces Compute Unit 1 Compute Unit N Address spaces can be collapsed depending on the Local Memory Local Memory device’s memory subsystem Global / Constant Memory Data Cache . Address Qualifiers Compute Device __private
__local
__constant and __global Global Memory . Example: Compute Device Memory __global float4 *p;
Page 25 Power with VMX . Unified host and device memory Zero copies between them . Collapsed local and private memory Locked cache range . Cached global memory
Power 1 Power 2 Power N
WorkItem 1 WorkItem N
Compute Unit 1 Compute Unit N
Global / Constant Memory Data Cache Global / Constant Memory Data Cache
Host Device Compute Device
Host Memory Device Global Memory Local Memory Private Memory System Memory
Page 26 PCIe Attached GPU . Separate host, device, and local address spaces Explicit copies between them
WorkItem 1 WorkItem N
Compute Unit 1 x86 Private Memory Private Memory
Local Memory PCIe
Device Memory
Host Device Compute Device
Host Memory Device Global Memory
System Memory Device Memory
Page 27 Cell Broadband Engine . Unified host and device memory Zero copies between them
SPE 1 SPE N
WorkItem 1 WorkItem N
Compute Unit 1 Compute Unit N PPE Local Memory Local Memory
Private Memory Private Memory
Local Store Local Store
Host Device Compute Device
Host Memory Device Global Memory System Memory
Page 28 Summary . What we have done: Broken new barriers in performance and efficiency – Top 500 #1 – Green 500 #1 Proven commercial viability – Repsol, Mentor Graphics, Broadcast International, Hitachi, Hoplon …
. What is next: Continued Focus on Efficiency Increasing Focus on Standards-Based Programming – Software ICache for Cell/B.E. – OpenMP & OpenCL for Cell/B.E. and other processors – … Increasing Focus on Ease of Use – Make accelerators “invisible” for most customers – Commercial applications, not just HPC – Not an easy thing to do Continue to Broaden Application Reach for Cell and Hybrid Systems
Page 29 Next Era of Innovation – Hybrid Computing The Next Bold Step in Innovation & Integration Symmetric Multiprocessing Era Hybrid Computing Era Today pNext 1.0 pNext 2.0
p6 p7
Traditional
Cell
Throughput Computational
BlueGene
Technology Out Market In
Driven by cores/threads Driven by workload consolidation
Page 30 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM. Special notices This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally- available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.
Page 31 Special notices (cont.)
The following terms are registered trademarks of International Business Machines Corporation in the United States and/or other countries: AIX, AIX/L, AIX/L(logo), alphaWorks, AS/400, BladeCenter, Blue Gene, Blue Lightning, C Set++, CICS, CICS/6000, ClusterProven, CT/2, DataHub, DataJoiner, DB2, DEEP BLUE, developerWorks, DirectTalk, Domino, DYNIX, DYNIX/ptx, e business(logo), e(logo)business, e(logo)server, Enterprise Storage Server, ESCON, FlashCopy, GDDM, i5/OS, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), Informix, IntelliStation, IQ-Link, LANStreamer, LoadLeveler, Lotus, Lotus Notes, Lotusphere, Magstar, MediaStreamer, Micro Channel, MQSeries, Net.Data, Netfinity, NetView, Network Station, Notes, NUMA-Q, OpenPower, Operating System/2, Operating System/400, OS/2, OS/390, OS/400, Parallel Sysplex, PartnerLink, PartnerWorld, Passport Advantage, POWERparallel, Power PC 603, Power PC 604, PowerPC, PowerPC(logo), Predictive Failure Analysis, pSeries, PTX, ptx/ADMIN, RETAIN, RISC System/6000, RS/6000, RT Personal Computer, S/390, Scalable POWERparallel Systems, SecureWay, Sequent, ServerProven, SpaceBall, System/390, The Engines of e-business, THINK, Tivoli, Tivoli(logo), Tivoli Management Environment, Tivoli Ready(logo), TME, TotalStorage, TURBOWAYS, VisualAge, WebSphere, xSeries, z/OS, zSeries.
The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: Advanced Micro-Partitioning, AIX 5L, AIX PVMe, AS/400e, Chiphopper, Chipkill, Cloudscape, DB2 OLAP Server, DB2 Universal Database, DFDSM, DFSORT, DS4000, DS6000, DS8000, e-business(logo), e-business on demand, eServer, Express Middleware, Express Portfolio, Express Servers, Express Servers and Storage, General Purpose File System, GigaProcessor, GPFS, HACMP, HACMP/6000, IBM TotalStorage Proven, IBMLink, IMS, Intelligent Miner, iSeries, Micro-Partitioning, NUMACenter, On Demand Business logo, POWER, PowerExecutive, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, PowerPC 603, PowerPC 603e, PowerPC 604, PowerPC 750, POWER2, POWER2 Architecture, POWER3, POWER4, POWER4+, POWER5, POWER5+, POWER6, POWER6+, pure XML, Redbooks, Sequent (logo), SequentLINK, Server Advantage, ServeRAID, Service Director, SmoothStart, SP, System i, System i5, System p, System p5, System Storage, System z, System z9, S/390 Parallel Enterprise Server, Tivoli Enterprise, TME 10, TotalStorage Proven, Ultramedia, VideoCharger, Virtualization Engine, Visualization Data Explorer, X-Architecture, z/Architecture, z/9.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org. UNIX is a registered trademark of The Open Group in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Microsoft, Windows, Windows NT and the Windows logo are registered trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Itanium, Pentium are registered tradas and Xeon is a trademark of Intel Corporation or its subsidiaries in the United States, other countries or both. AMD Opteron is a trademark of Advanced Micro Devices, Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries or both. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). NetBench is a registered trademark of Ziff Davis Media in the United States, other countries or both. AltiVec is a trademark of Freescale Semiconductor, Inc. Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc. Other company, product and service names may be trademarks or service marks of others.
Page 32