Accelerators in 's Adaptive Supercomputing

NCSA’s Reconfigurable Systems Summer Institute

Dave Strenski Application Analyst [email protected] What does a petaflop look like?

Theoretical Performance with 10 Gflop/s Nodes

1000

900

800

700

600

Tflop/s 500

400

300

200 1.00000 0.99999 100 0.99990 0.99900 % Parallel 0 0.99000 1 10 0.90000 100 1000 10000 0.00000 Number of Nodes 100000

May 16, 2007 Copyright 2007 – Cray Inc. 2 Supercomputing is increasingly about

managing scalability 16,316

• Exponential increase with advent of multi-core chips • Systems with more than 100,000 10,073 processing cores • 80+ core processor expected within the decade

3,518 2,827 3,093 2,230 1,644 1,847 1,245 1,073 722 808 202 408

1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006

Average Number of Processors Per Supercomputer (Top 20) Source: www.top500.org

May 16, 2007 Copyright 2007 – Cray Inc. 3 Opportunities to Exploit Heterogeneity  Applications vary considerably in their demands  Any HPC application contains some form of parallelism • Many HPC apps have rich, SIMD-style data-level parallelism  Can significantly accelerate via vectorization • Those that don’t generally have rich thread-level parallelism  Can significantly accelerate via multithreading • Some parts of applications are not parallel at all  Need fast serial scalar execution speed (Amdahl’s Law)  Applications also vary in their communications needs • Required memory bandwidth and granularity  Some work well out of cache, some don’t • Required network bandwidth and granularity  Some ok with message passing , some need shared memory  No one processor/system design is best for all apps

May 16, 2007 Copyright 2007 – Cray Inc. 4 Increasingly Complex Application Requirements Earth Sciences Example

Evolution of Computational Climate Simulation Complexity

International Intergovernmental Panel on Climate Change, 2004, as updated by Washington, NCAR, 2005 NASA Report: Earth Sciences Vision 2030 Increased complexity and number of “As scientific computing migrated toward commodity components lends itself well to a “As scientific computing migrated toward commodity platforms,platforms, interconnect interconnect technology, technology, both both in in terms terms o of f variety of processing technologies. bandwidthbandwidth and and latency, latency, became became the the limiting limiting factor factor o onn applicationapplication performance performance and and continues continues to to be be a a performanceperformance bottleneck.” bottleneck.” Similar trends in astrophysics, - ComputerWorld article – 2/6/06 - ComputerWorld article – 2/6/06 - James Hack (NCAR) nuclear engineering, CAE, etc. - James Hack (NCAR) May 16, 2007 Copyright 2007 – Cray Inc. 5 So, Can We Just Pack Chips with Flops?

 Key is making the system easily programmable  Must balance peak computational power with generality • How easy is it to map high level code onto the machine? • How easy is it for computation units to access global data?  Some examples: • XD1 FPGAs • Clearspeed CSX600 • IBM Cell  Flop efficiency vs. generality/programmability spectrum: • Qualitative only; also influenced by memory system

General More GP-GPU Clearspeed BG/L Multi-core Purpose More area/power programmable efficient FPGAs Cell Streaming Vectors µµµprocs

May 16, 2007 Copyright 2007 – Cray Inc. 6 Cray XD1 FPGA Accelerators Performance gains from FPGA:  RC5 Cipher Breaking • Implemented on Xilinx Virtex II • 1000x faster than 2.4 GHz P4  Elliptic Curve Cryptography • Implemented on Xilinx Virtex II • 895-1300x faster than 1 GHz P3  Vehicular Traffic Simulation • Implemented on Xilinx Virtex II (XC2V6000) and Virtex II Pro (XC2VP100) • 300x faster on XC2V6000 than 1.7 GHz Xeon • 650x faster on XC2VP100 than 1.7 GHz Xeon  Smith Waterman DNA matching • 28x faster than 2.4 GHz Opteron

 Primary challenge is programming  No general-purpose compiler available

May 16, 2007 Copyright 2007 – Cray Inc. 7 Peak GFLOP/s per processor

Opteron Virtex4 Virtex5 Dual-core LX200 LX330 2.5 GHz 185 MHz 237 MHz mult/add 10 15.9 28.0

mult 5 12.0 19.9

add 5 23.9 55.3

www.fpgajournal.com/articles_2006/pdf/20061114_cray.pdf www.hpcwire.com/hpc/1195762.html

May 16, 2007 Copyright 2007 – Cray Inc. 8 Clearspeed CSX600

• 50 Gflops on card • 6 GB/s to on-card local memory (4GB) • 2+ GB/s to local host memory

• Doesn’t share memory with host • Mostly used for accelerating libraries • No general-purpose compiler available

May 16, 2007 Copyright 2007 – Cray Inc. 9 Cell Processor  Each chip contains: • One PowerPC • Eight “synergistic processing elements”  Targeted for: • (1) Playstations, (2) HDTVs, (3) computing  Lots of flops • 250 Gflops (32 bit) • ~25 Glfops (64 bit)  25 GB/s to < 1GB memory  Big challenge is programming • SPE’s have no virtual memory • Can only access data in local 256 KB buffers • Requires alignment for good performance  No general-purpose compiler available

May 16, 2007 Copyright 2007 – Cray Inc. 10 Adaptive Supercomputing Combines multiple processing architectures into a single, • Transparent Interface scalable system

• Libraries • Tools • Compilers • Scalar X86/64 • Scheduling • Vector • System Management • Multithreaded • Runtime • HW Accelerators

• Interconnect • File Systems • Storage • Packaging

Adapt the system to the application – not the application to the system

11 May 16, 2007 Copyright 2007 – Cray Inc. 11 Step 1 to Adaptive Supercomputing: Rainier Program -- Cray XT Infrastructure

C = Compute  Cray’s Rainier generation of products use a C S = Service S common infrastructure: A = Accelerator A • Opteron-based service & I/O (SIO) blades C C • Cray SeaStar interconnect C C S • Single global file system A • Single point of login C • Single point of administration  Delivered with one or more types of compute resources • Cray XT4 compute blades (scalar) • Cray XMT compute blades (multithreading) • “BlackWidow” compute cabinets (vector) • Hardware Accelerators

TheThe Cray Cray XT XT Infrastructure Infrastructure allows allows customerscustomers to to “mix “mix-and-match”-and -match” compute compute resources resources May 16, 2007 Copyright 2007 – Cray Inc. 12 DARPA HPCS Program Focused on providing a new generation of economically viable high productivity computing systems for the national security and industrial user community in the 2010 timeframe

Performance (time-to-solution) : speed up critical applications by factors of 10 to 40

Programmability (idea-to-first solution) : "High productivity reduce cost and time for developing computing is a key application solutions technology enabler for meeting our national security and economic Portability: insulate application software competitiveness from system specifics requirements. ””” - Dr. William Harrod, Robustness: protect applications from DARPA HPCS Program hardware faults and system software errors

May 16, 2007 Copyright 2007 – Cray Inc. 13 CRAY SIGNS $250 MILLION AGREEMENT WITH DARPA TO DEVELOP BREAKTHROUGH ADAPTIVE SUPERCOMPUTER

SEATTLE, WA, November 21, 2006 -- Global supercomputer leader Cray Inc. announced today that it has been awarded a $250 million agreement from the U.S. Defense Advanced Research Projects Agency (DARPA). Under this agreement, Cray will develop a revolutionary new supercomputer based on the company's Adaptive Supercomputing vision, a phased approach to hybrid computing that integrates a range of processing technologies into a single scalable platform. […]

May 16, 2007 Copyright 2007 – Cray Inc. Slide 14 Motivation for Cascade

Why are HPC machines unproductive?  Difficult to write parallel code (e.g.: MPI) • Major burden for computational scientists  Lack of programming tools to understand program behavior • Conventional models break with scale and complexity  Time spent trying to modify code to fit machine’s characteristics • For example, cluster machines have relatively low bandwidth between processors, and can’t directly access global memory… • As a result, programmers try hard to reduce communication, and have to bundle communication up in messages instead of simply accessing shared memory If the machine doesn’t match your code’s attributes, it makes the programming job much more difficult.

And code’s vary significantly in their requirements…

May 16, 2007 Copyright 2007 – Cray Inc. 15 Cascade Approach  Design an adaptive, configurable machine that can match the attributes of a wide variety of applications: • Serial (single thread, latency-driven) performance • SIMD data level parallelism (vectorizable) • Fine grained MIMD parallelism (threadable) • Regular and sparse bandwidth of varying intensities ⇒ Increases performance ⇒ Significantly eases programming ⇒ Makes the machine much more broadly applicable

 Ease the development of parallel codes • Legacy programming models: MPI, OpenMP • Improved variants: SHMEM, UPC and CoArray Fortran (CAF) • New alternative: Global View (Chapel)

 Provide programming tools to ease debugging and tuning at scale • Automatic performance analysis; comparative debugging

May 16, 2007 Copyright 2007 – Cray Inc. 16 Integrated Multi-Architecture System

Basic architecture separates system Specialized Processor Compute Nodes services from computational tasks • Purpose-built service and compute nodes • Sets the infrastructure for hybrid computing Service nodes provide command and control for variety of compute nodes • Assignment and management of the applications • I/O functions available to all nodes • Consistent control layer for the underlying hardware • Full system administration Service Nodes Flexible Execution Environment High Performance • Targeted to each Compute node type Multi-core Processor • Applications managed across different Compute Nodes compute node types

May 16, 2007 Copyright 2007 – Cray Inc. 17 Cascade System Architecture High Bandwidth Interconnect Extremely flexible and configurable

Granite Granite Granite Granite Granite Granite Baker Baker Baker Baker Baker Baker MVP MVP MVP MVP MVP MVP Opteron Opteron Opteron Opteron Opteron Opteron Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute SIO SIO Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes

Local Local Local Local Local Local Local Local Local Local Local Local Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Globally Addressable Memory Support for partitioned or flat address space

Opteron Baker Service I/O Granite MVP Node Network Opteron Network Network AMD Scorpio AMD Gemini AMD Gemini Processor Co-processor Processor Communication Processor Communication Accelerator Accelerator Memory Memory Memory Compute Compute I/O Interface Node Node

 Globally addressable memory with unified addressing architecture  Configurable network, memory, processing and I/O  Heterogeneous processing across node types, and within MVP nodes  Can adapt at configuration time, compile time, run time

May 16, 2007 Copyright 2007 – Cray Inc. 18 Integrated Multi-Architecture System

HPC Application Programs Programmers

Programming Models Chapel (future) Library Based Language Based Cascade Compiler pthreads OpenMP SHMEM MPI UPC CAF Chapel (future) Debugger suite CrayPat/Apprentice2 ALPS Other Scientific MPI MPI Other MPI Other Cray PE Tools COTS Enhanced Libraries Granite Libraries FPGA Libraries Fortran, gcc, gdb ALPS (Application Level Placement Scheduler) Infrastructure PBSpro, LSF, Moab COTS Enhanced Runtime Granite Runtime FPGA Runtime COTS Tools

CNL CNL CNL CNL CNL CNL CNL CNL CNL CNL Linux Linux

Baker Baker/ Baker Baker Baker Baker Granite Granite FPGA FPGA Baker Baker Opteron Opteron Opteron Opteron Opteron Opteron Compute Compute Compute Compute Opteron Opteron Compute Compute Compute Compute Compute Compute Node Node Node Node SIO SIO Node Node Node Node Node Node Node Node

Opteron Opteron Opteron Opteron

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core ? ? ? ? Core Core Core Core Core Core Core Core ? ? FPGA ? ? PCIe2 I/O

Global Shared Memory Segments

Memory Memory Memory Memory

May 16, 2007 Copyright 2007 – Cray Inc. 19 Example Application: Weather Research & Forecasting (WRF) Model  Mesoscale numerical weather prediction system • Regional forecast model (meters to thousands of kilometers)  Operational forecasting, environmental modeling, & atmospheric research • Key commercial application for Cray (both vector & scalar MPP systems)  Accelerating WRF performance: • Part of the code is serial: ⇒ Runs on Opteron for best-of-class serial performance • Most of the code vectorizes really well ⇒ Dynamics and radiation physics ⇒ Runs on specialized processor accelerator in vector mode • Cloud physics doesn’t vectorize ⇒ Little FP, lots of branching and conditionals ⇒ Degrades performance on vector systems ⇒ Vertical columns above grid points are all independent ⇒ Runs on specialized processor accelerator in multithreaded mode May 16, 2007 Copyright 2007 – Cray Inc. 20 System Software  Maximize compute cycles for application performance • Dedicated compute nodes • Minimize OS impact (footprint, cycles, jitter) on compute nodes • Provide efficient access to custom high-speed network  Improve robustness with reliable services • Dedicated service nodes • Replicated, scalable services • Evolutionary management software stack  Provide common interfaces for programmability • Familiar Linux user environment for programmers • Standard Linux execution environment for programs • Support for popular parallel programming paradigms  Leverage standard interfaces for portability • Standard Linux and Linux derivatives to support wide variety of applications

May 16, 2007 Copyright 2007 – Cray Inc. 21 Cascade Programming Environment  Provides for wide variety of HPC programming models

 Provides C, C++ and Fortran compilers that automatically adapt programs to different processor technologies

 Provides a user toolkit of math and scientific libraries that automatically optimizes the use of different processor technologies

 Provides a performance analysis system that expertly identifies potential program bottlenecks.

 Provides a set of debugging tools that scale to thousands of processors and go beyond traditional debugging using comparative data techniques

May 16, 2007 Copyright 2007 – Cray Inc. 22 Chapel A new parallel language developed by Cray for HPCS  Themes: • Raise level of abstraction, generality compared to SPMD approaches • Support prototyping of parallel codes + evolution to production-grade • Narrow gap between parallel and mainstream languages  Chapel’s Productivity Goals: • Vastly improve programmability over current languages/models • Support performance that matches or beats MPI • Improve portability over current languages/models (similar to MPI) • Improve code robustness via improved abstractions and semantics  Status: • Draft language specification available • Portable prototype implementation underway • Performing application kernel studies to exercise Chapel • Working with HPLS team to evaluate Chapel • Initial release made to HPLS team in December 2006

May 16, 2007 Copyright 2007 – Cray Inc. 23 Chapel Code Size Comparison

STREAM Random FFT Triad Access May 16, 2007 Copyright 2007 – Cray Inc. 24 Cray Technology

Multiple Processing Vector, scalar, massive multi-threading and Technologies application accelerators

Custom interconnect and Network Communications communications network

Systems Administration Software and tools to manage thousands of & Management processors as a single system

Very high density, upgradeable, liquid Packaging and air-cooling

Adaptive Supercomputing Single integrated system

Purpose-built & Balanced for Superior Scalability & Sustained Performance

May 16, 2007 Copyright 2007 – Cray Inc. 25 Cascade Summary  Performance • Configurable, very high bandwidth memory and interconnect • Globally addressable memory with fine-grain synchronization • Heterogeneous processing to match application (serial, TLP, DLP)  Programmability • Cray SHMEM, UPC, CAF, and Chapel high-level parallel language • Automatic performance analysis and scalable debugging tools • Globally addressable memory with fine-grain synchronization • Heterogeneous processing supports wide range of programming idioms  Portability • Linux-based OS supports standard POSIX API & Linux services • Support for mixed legacy languages and programming models • Chapel provides an architecturally-neutral path forward for code  Robustness • Central administration and management • Hardware Supervisory System • Transactional system state, virtualized failover

May 16, 2007 Copyright 2007 – Cray Inc. 26 The Cray Roadmap

Realizing Our Adaptive 2010 2009 Supercomputing Vision Cray XT4 Upgrade Phase II: Cascade Program 2008 Adaptive Hybrid System BlackWidow

Cray XT4

Cray XT3 Cray XMT 2007 Phase I: Rainier Program Multiple Processor Types with Integrated Infrastructure and Cray XD1 User Environment

Cray X1E 2006 Phase 0: Individually Architected Machines Unique Products Serving Individual Market Needs May 16, 2007 Copyright 2007 – Cray Inc. 27 Thank You!

May 16, 2007 Copyright 2007 – Cray Inc. 28