Accelerators in Cray's Adaptive Supercomputing
NCSA’s Reconfigurable Systems Summer Institute
Dave Strenski Application Analyst [email protected] What does a petaflop look like?
Theoretical Performance with 10 Gflop/s Nodes
1000
900
800
700
600
Tflop/s 500
400
300
200 1.00000 0.99999 100 0.99990 0.99900 % Parallel 0 0.99000 1 10 0.90000 100 1000 10000 0.00000 Number of Nodes 100000
May 16, 2007 Copyright 2007 – Cray Inc. 2 Supercomputing is increasingly about
managing scalability 16,316
• Exponential increase with advent of multi-core chips • Systems with more than 100,000 10,073 processing cores • 80+ core processor expected within the decade
3,518 2,827 3,093 2,230 1,644 1,847 1,245 1,073 722 808 202 408
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Average Number of Processors Per Supercomputer (Top 20) Source: www.top500.org
May 16, 2007 Copyright 2007 – Cray Inc. 3 Opportunities to Exploit Heterogeneity Applications vary considerably in their demands Any HPC application contains some form of parallelism • Many HPC apps have rich, SIMD-style data-level parallelism Can significantly accelerate via vectorization • Those that don’t generally have rich thread-level parallelism Can significantly accelerate via multithreading • Some parts of applications are not parallel at all Need fast serial scalar execution speed (Amdahl’s Law) Applications also vary in their communications needs • Required memory bandwidth and granularity Some work well out of cache, some don’t • Required network bandwidth and granularity Some ok with message passing , some need shared memory No one processor/system design is best for all apps
May 16, 2007 Copyright 2007 – Cray Inc. 4 Increasingly Complex Application Requirements Earth Sciences Example
Evolution of Computational Climate Simulation Complexity
International Intergovernmental Panel on Climate Change, 2004, as updated by Washington, NCAR, 2005 NASA Report: Earth Sciences Vision 2030 Increased complexity and number of “As scientific computing migrated toward commodity components lends itself well to a “As scientific computing migrated toward commodity platforms,platforms, interconnect interconnect technology, technology, both both in in terms terms o of f variety of processing technologies. bandwidthbandwidth and and latency, latency, became became the the limiting limiting factor factor o onn applicationapplication performance performance and and continues continues to to be be a a performanceperformance bottleneck.” bottleneck.” Similar trends in astrophysics, - ComputerWorld article – 2/6/06 - ComputerWorld article – 2/6/06 - James Hack (NCAR) nuclear engineering, CAE, etc. - James Hack (NCAR) May 16, 2007 Copyright 2007 – Cray Inc. 5 So, Can We Just Pack Chips with Flops?
Key is making the system easily programmable Must balance peak computational power with generality • How easy is it to map high level code onto the machine? • How easy is it for computation units to access global data? Some examples: • XD1 FPGAs • Clearspeed CSX600 • IBM Cell Flop efficiency vs. generality/programmability spectrum: • Qualitative only; also influenced by memory system
General More GP-GPU Clearspeed BG/L Multi-core Purpose More area/power programmable efficient FPGAs Cell Streaming Vectors µµµprocs
May 16, 2007 Copyright 2007 – Cray Inc. 6 Cray XD1 FPGA Accelerators Performance gains from FPGA: RC5 Cipher Breaking • Implemented on Xilinx Virtex II • 1000x faster than 2.4 GHz P4 Elliptic Curve Cryptography • Implemented on Xilinx Virtex II • 895-1300x faster than 1 GHz P3 Vehicular Traffic Simulation • Implemented on Xilinx Virtex II (XC2V6000) and Virtex II Pro (XC2VP100) • 300x faster on XC2V6000 than 1.7 GHz Xeon • 650x faster on XC2VP100 than 1.7 GHz Xeon Smith Waterman DNA matching • 28x faster than 2.4 GHz Opteron
Primary challenge is programming No general-purpose compiler available
May 16, 2007 Copyright 2007 – Cray Inc. 7 Peak GFLOP/s per processor
Opteron Virtex4 Virtex5 Dual-core LX200 LX330 2.5 GHz 185 MHz 237 MHz mult/add 10 15.9 28.0
mult 5 12.0 19.9
add 5 23.9 55.3
www.fpgajournal.com/articles_2006/pdf/20061114_cray.pdf www.hpcwire.com/hpc/1195762.html
May 16, 2007 Copyright 2007 – Cray Inc. 8 Clearspeed CSX600
• 50 Gflops on card • 6 GB/s to on-card local memory (4GB) • 2+ GB/s to local host memory
• Doesn’t share memory with host • Mostly used for accelerating libraries • No general-purpose compiler available
May 16, 2007 Copyright 2007 – Cray Inc. 9 Cell Processor Each chip contains: • One PowerPC • Eight “synergistic processing elements” Targeted for: • (1) Playstations, (2) HDTVs, (3) computing Lots of flops • 250 Gflops (32 bit) • ~25 Glfops (64 bit) 25 GB/s to < 1GB memory Big challenge is programming • SPE’s have no virtual memory • Can only access data in local 256 KB buffers • Requires alignment for good performance No general-purpose compiler available
May 16, 2007 Copyright 2007 – Cray Inc. 10 Adaptive Supercomputing Combines multiple processing architectures into a single, • Transparent Interface scalable system
• Libraries • Tools • Compilers • Scalar X86/64 • Scheduling • Vector • System Management • Multithreaded • Runtime • HW Accelerators
• Interconnect • File Systems • Storage • Packaging
Adapt the system to the application – not the application to the system
11 May 16, 2007 Copyright 2007 – Cray Inc. 11 Step 1 to Adaptive Supercomputing: Rainier Program -- Cray XT Infrastructure
C = Compute Cray’s Rainier generation of products use a C S = Service S common infrastructure: A = Accelerator A • Opteron-based service & I/O (SIO) blades C C • Cray SeaStar interconnect C C S • Single global file system A • Single point of login C • Single point of administration Delivered with one or more types of compute resources • Cray XT4 compute blades (scalar) • Cray XMT compute blades (multithreading) • “BlackWidow” compute cabinets (vector) • Hardware Accelerators
TheThe Cray Cray XT XT Infrastructure Infrastructure allows allows customerscustomers to to “mix “mix-and-match”-and -match” compute compute resources resources May 16, 2007 Copyright 2007 – Cray Inc. 12 DARPA HPCS Program Focused on providing a new generation of economically viable high productivity computing systems for the national security and industrial user community in the 2010 timeframe
Performance (time-to-solution) : speed up critical applications by factors of 10 to 40
Programmability (idea-to-first solution) : "High productivity reduce cost and time for developing computing is a key application solutions technology enabler for meeting our national security and economic Portability: insulate application software competitiveness from system specifics requirements. ””” - Dr. William Harrod, Robustness: protect applications from DARPA HPCS Program hardware faults and system software errors
May 16, 2007 Copyright 2007 – Cray Inc. 13 CRAY SIGNS $250 MILLION AGREEMENT WITH DARPA TO DEVELOP BREAKTHROUGH ADAPTIVE SUPERCOMPUTER
SEATTLE, WA, November 21, 2006 -- Global supercomputer leader Cray Inc. announced today that it has been awarded a $250 million agreement from the U.S. Defense Advanced Research Projects Agency (DARPA). Under this agreement, Cray will develop a revolutionary new supercomputer based on the company's Adaptive Supercomputing vision, a phased approach to hybrid computing that integrates a range of processing technologies into a single scalable platform. […]
May 16, 2007 Copyright 2007 – Cray Inc. Slide 14 Motivation for Cascade
Why are HPC machines unproductive? Difficult to write parallel code (e.g.: MPI) • Major burden for computational scientists Lack of programming tools to understand program behavior • Conventional models break with scale and complexity Time spent trying to modify code to fit machine’s characteristics • For example, cluster machines have relatively low bandwidth between processors, and can’t directly access global memory… • As a result, programmers try hard to reduce communication, and have to bundle communication up in messages instead of simply accessing shared memory If the machine doesn’t match your code’s attributes, it makes the programming job much more difficult.
And code’s vary significantly in their requirements…
May 16, 2007 Copyright 2007 – Cray Inc. 15 Cascade Approach Design an adaptive, configurable machine that can match the attributes of a wide variety of applications: • Serial (single thread, latency-driven) performance • SIMD data level parallelism (vectorizable) • Fine grained MIMD parallelism (threadable) • Regular and sparse bandwidth of varying intensities ⇒ Increases performance ⇒ Significantly eases programming ⇒ Makes the machine much more broadly applicable
Ease the development of parallel codes • Legacy programming models: MPI, OpenMP • Improved variants: SHMEM, UPC and CoArray Fortran (CAF) • New alternative: Global View (Chapel)
Provide programming tools to ease debugging and tuning at scale • Automatic performance analysis; comparative debugging
May 16, 2007 Copyright 2007 – Cray Inc. 16 Integrated Multi-Architecture System
Basic architecture separates system Specialized Processor Compute Nodes services from computational tasks • Purpose-built service and compute nodes • Sets the infrastructure for hybrid computing Service nodes provide command and control for variety of compute nodes • Assignment and management of the applications • I/O functions available to all nodes • Consistent control layer for the underlying hardware • Full system administration Service Nodes Flexible Execution Environment High Performance • Targeted to each Compute node type Multi-core Processor • Applications managed across different Compute Nodes compute node types
May 16, 2007 Copyright 2007 – Cray Inc. 17 Cascade System Architecture High Bandwidth Interconnect Extremely flexible and configurable
Granite Granite Granite Granite Granite Granite Baker Baker Baker Baker Baker Baker MVP MVP MVP MVP MVP MVP Opteron Opteron Opteron Opteron Opteron Opteron Compute Compute Compute Compute Compute Compute Compute Compute Compute Compute SIO SIO Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes
Local Local Local Local Local Local Local Local Local Local Local Local Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Globally Addressable Memory Support for partitioned or flat address space
Opteron Baker Service I/O Granite MVP Node Network Opteron Network Network AMD Scorpio AMD Gemini AMD Gemini Processor Co-processor Processor Communication Processor Communication Accelerator Accelerator Memory Memory Memory Compute Compute I/O Interface Node Node
Globally addressable memory with unified addressing architecture Configurable network, memory, processing and I/O Heterogeneous processing across node types, and within MVP nodes Can adapt at configuration time, compile time, run time
May 16, 2007 Copyright 2007 – Cray Inc. 18 Integrated Multi-Architecture System
HPC Application Programs Programmers
Programming Models Chapel (future) Library Based Language Based Cascade Compiler pthreads OpenMP SHMEM MPI UPC CAF Chapel (future) Debugger suite CrayPat/Apprentice2 ALPS Other Scientific MPI MPI Other MPI Other Cray PE Tools COTS Enhanced Libraries Granite Libraries FPGA Libraries Fortran, gcc, gdb ALPS (Application Level Placement Scheduler) Infrastructure PBSpro, LSF, Moab COTS Enhanced Runtime Granite Runtime FPGA Runtime COTS Tools
CNL CNL CNL CNL CNL CNL CNL CNL CNL CNL Linux Linux
Baker Baker/ Baker Baker Baker Baker Granite Granite FPGA FPGA Baker Baker Opteron Opteron Opteron Opteron Opteron Opteron Compute Compute Compute Compute Opteron Opteron Compute Compute Compute Compute Compute Compute Node Node Node Node SIO SIO Node Node Node Node Node Node Node Node
Opteron Opteron Opteron Opteron
Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core ? ? ? ? Core Core Core Core Core Core Core Core ? ? FPGA ? ? PCIe2 I/O
Global Shared Memory Segments
Memory Memory Memory Memory
May 16, 2007 Copyright 2007 – Cray Inc. 19 Example Application: Weather Research & Forecasting (WRF) Model Mesoscale numerical weather prediction system • Regional forecast model (meters to thousands of kilometers) Operational forecasting, environmental modeling, & atmospheric research • Key commercial application for Cray (both vector & scalar MPP systems) Accelerating WRF performance: • Part of the code is serial: ⇒ Runs on Opteron for best-of-class serial performance • Most of the code vectorizes really well ⇒ Dynamics and radiation physics ⇒ Runs on specialized processor accelerator in vector mode • Cloud physics doesn’t vectorize ⇒ Little FP, lots of branching and conditionals ⇒ Degrades performance on vector systems ⇒ Vertical columns above grid points are all independent ⇒ Runs on specialized processor accelerator in multithreaded mode May 16, 2007 Copyright 2007 – Cray Inc. 20 System Software Maximize compute cycles for application performance • Dedicated compute nodes • Minimize OS impact (footprint, cycles, jitter) on compute nodes • Provide efficient access to custom high-speed network Improve robustness with reliable services • Dedicated service nodes • Replicated, scalable services • Evolutionary management software stack Provide common interfaces for programmability • Familiar Linux user environment for programmers • Standard Linux execution environment for programs • Support for popular parallel programming paradigms Leverage standard interfaces for portability • Standard Linux and Linux derivatives to support wide variety of applications
May 16, 2007 Copyright 2007 – Cray Inc. 21 Cascade Programming Environment Provides for wide variety of HPC programming models
Provides C, C++ and Fortran compilers that automatically adapt programs to different processor technologies
Provides a user toolkit of math and scientific libraries that automatically optimizes the use of different processor technologies
Provides a performance analysis system that expertly identifies potential program bottlenecks.
Provides a set of debugging tools that scale to thousands of processors and go beyond traditional debugging using comparative data techniques
May 16, 2007 Copyright 2007 – Cray Inc. 22 Chapel A new parallel language developed by Cray for HPCS Themes: • Raise level of abstraction, generality compared to SPMD approaches • Support prototyping of parallel codes + evolution to production-grade • Narrow gap between parallel and mainstream languages Chapel’s Productivity Goals: • Vastly improve programmability over current languages/models • Support performance that matches or beats MPI • Improve portability over current languages/models (similar to MPI) • Improve code robustness via improved abstractions and semantics Status: • Draft language specification available • Portable prototype implementation underway • Performing application kernel studies to exercise Chapel • Working with HPLS team to evaluate Chapel • Initial release made to HPLS team in December 2006
May 16, 2007 Copyright 2007 – Cray Inc. 23 Chapel Code Size Comparison
STREAM Random FFT Triad Access May 16, 2007 Copyright 2007 – Cray Inc. 24 Cray Technology
Multiple Processing Vector, scalar, massive multi-threading and Technologies application accelerators
Custom interconnect and Network Communications communications network
Systems Administration Software and tools to manage thousands of & Management processors as a single system
Very high density, upgradeable, liquid Packaging and air-cooling
Adaptive Supercomputing Single integrated system
Purpose-built & Balanced for Superior Scalability & Sustained Performance
May 16, 2007 Copyright 2007 – Cray Inc. 25 Cascade Summary Performance • Configurable, very high bandwidth memory and interconnect • Globally addressable memory with fine-grain synchronization • Heterogeneous processing to match application (serial, TLP, DLP) Programmability • Cray SHMEM, UPC, CAF, and Chapel high-level parallel language • Automatic performance analysis and scalable debugging tools • Globally addressable memory with fine-grain synchronization • Heterogeneous processing supports wide range of programming idioms Portability • Linux-based OS supports standard POSIX API & Linux services • Support for mixed legacy languages and programming models • Chapel provides an architecturally-neutral path forward for code Robustness • Central administration and management • Hardware Supervisory System • Transactional system state, virtualized failover
May 16, 2007 Copyright 2007 – Cray Inc. 26 The Cray Roadmap
Realizing Our Adaptive 2010 2009 Supercomputing Vision Cray XT4 Upgrade Phase II: Cascade Program 2008 Adaptive Hybrid System BlackWidow
Cray XT4
Cray XT3 Cray XMT 2007 Phase I: Rainier Program Multiple Processor Types with Integrated Infrastructure and Cray XD1 User Environment
Cray X1E 2006 Phase 0: Individually Architected Machines Unique Products Serving Individual Market Needs May 16, 2007 Copyright 2007 – Cray Inc. 27 Thank You!
May 16, 2007 Copyright 2007 – Cray Inc. 28