Accelerators in Cray's Adaptive Supercomputing
Total Page:16
File Type:pdf, Size:1020Kb
Accelerators in Cray's Adaptive Supercomputing NCSA’s Reconfigurable Systems Summer Institute Dave Strenski Application Analyst [email protected] What does a petaflop look like? Theoretical Performance with 10 Gflop/s Nodes 1000 900 800 700 600 Tflop/s 500 400 300 200 1.00000 0.99999 100 0.99990 0.99900 % Parallel 0 0.99000 1 10 0.90000 100 1000 10000 0.00000 Number of Nodes 100000 May 16, 2007 Copyright 2007 – Cray Inc. 2 Supercomputing is increasingly about managing scalability 16,316 • Exponential increase with advent of multi-core chips • Systems with more than 100,000 10,073 processing cores • 80+ core processor expected within the decade 3,518 2,827 3,093 2,230 1,644 1,847 1,245 1,073 722 808 202 408 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 Average Number of Processors Per Supercomputer (Top 20) Source: www.top500.org May 16, 2007 Copyright 2007 – Cray Inc. 3 Opportunities to Exploit Heterogeneity Applications vary considerably in their demands Any HPC application contains some form of parallelism • Many HPC apps have rich, SIMD-style data-level parallelism Can significantly accelerate via vectorization • Those that don’t generally have rich thread-level parallelism Can significantly accelerate via multithreading • Some parts of applications are not parallel at all Need fast serial scalar execution speed (Amdahl’s Law) Applications also vary in their communications needs • Required memory bandwidth and granularity Some work well out of cache, some don’t • Required network bandwidth and granularity Some ok with message passing , some need shared memory No one processor/system design is best for all apps May 16, 2007 Copyright 2007 – Cray Inc. 4 Increasingly Complex Application Requirements Earth Sciences Example Evolution of Computational Climate Simulation Complexity International Intergovernmental Panel on Climate Change, 2004, as updated by Washington, NCAR, 2005 NASA Report: Earth Sciences Vision 2030 Increased complexity and number of “As scientific computing migrated toward commodity components lends itself well to a “As scientific computing migrated toward commodity platforms,platforms, interconnect interconnect technology, technology, both both in in terms terms o of f variety of processing technologies. bandwidthbandwidth and and latency, latency, became became the the limiting limiting factor factor o onn applicationapplication performance performance and and continues continues to to be be a a performanceperformance bottleneck.” bottleneck.” Similar trends in astrophysics, - ComputerWorld article – 2/6/06 - ComputerWorld article – 2/6/06 - James Hack (NCAR) nuclear engineering, CAE, etc. - James Hack (NCAR) May 16, 2007 Copyright 2007 – Cray Inc. 5 So, Can We Just Pack Chips with Flops? Key is making the system easily programmable Must balance peak computational power with generality • How easy is it to map high level code onto the machine? • How easy is it for computation units to access global data? Some examples: • XD1 FPGAs • Clearspeed CSX600 • IBM Cell Flop efficiency vs. generality/programmability spectrum: • Qualitative only; also influenced by memory system General More GP-GPU Clearspeed BG/L Multi-core Purpose More area/power programmable efficient FPGAs Cell Streaming Vectors µµµprocs May 16, 2007 Copyright 2007 – Cray Inc. 6 Cray XD1 FPGA Accelerators Performance gains from FPGA: RC5 Cipher Breaking • Implemented on Xilinx Virtex II • 1000x faster than 2.4 GHz P4 Elliptic Curve Cryptography • Implemented on Xilinx Virtex II • 895-1300x faster than 1 GHz P3 Vehicular Traffic Simulation • Implemented on Xilinx Virtex II (XC2V6000) and Virtex II Pro (XC2VP100) • 300x faster on XC2V6000 than 1.7 GHz Xeon • 650x faster on XC2VP100 than 1.7 GHz Xeon Smith Waterman DNA matching • 28x faster than 2.4 GHz Opteron Primary challenge is programming No general-purpose compiler available May 16, 2007 Copyright 2007 – Cray Inc. 7 Peak GFLOP/s per processor Opteron Virtex4 Virtex5 Dual-core LX200 LX330 2.5 GHz 185 MHz 237 MHz mult/add 10 15.9 28.0 mult 5 12.0 19.9 add 5 23.9 55.3 www.fpgajournal.com/articles_2006/pdf/20061114_cray.pdf www.hpcwire.com/hpc/1195762.html May 16, 2007 Copyright 2007 – Cray Inc. 8 Clearspeed CSX600 • 50 Gflops on card • 6 GB/s to on-card local memory (4GB) • 2+ GB/s to local host memory • Doesn’t share memory with host • Mostly used for accelerating libraries • No general-purpose compiler available May 16, 2007 Copyright 2007 – Cray Inc. 9 Cell Processor Each chip contains: • One PowerPC • Eight “synergistic processing elements” Targeted for: • (1) Playstations, (2) HDTVs, (3) computing Lots of flops • 250 Gflops (32 bit) • ~25 Glfops (64 bit) 25 GB/s to < 1GB memory Big challenge is programming • SPE’s have no virtual memory • Can only access data in local 256 KB buffers • Requires alignment for good performance No general-purpose compiler available May 16, 2007 Copyright 2007 – Cray Inc. 10 Adaptive Supercomputing Combines multiple processing architectures into a single, • Transparent Interface scalable system • Libraries • Tools • Compilers • Scalar X86/64 • Scheduling • Vector • System Management • Multithreaded • Runtime • HW Accelerators • Interconnect • File Systems • Storage • Packaging Adapt the system to the application – not the application to the system 11 May 16, 2007 Copyright 2007 – Cray Inc. 11 Step 1 to Adaptive Supercomputing: Rainier Program -- Cray XT Infrastructure C = Compute Cray’s Rainier generation of products use a C S = Service S common infrastructure: A = Accelerator A • Opteron-based service & I/O (SIO) blades C C • Cray SeaStar interconnect C C S • Single global file system A • Single point of login C • Single point of administration Delivered with one or more types of compute resources • Cray XT4 compute blades (scalar) • Cray XMT compute blades (multithreading) • “BlackWidow” compute cabinets (vector) • Hardware Accelerators TheThe Cray Cray XT XT Infrastructure Infrastructure allows allows customerscustomers to to “mix “mix-and-match”-and -match” compute compute resources resources May 16, 2007 Copyright 2007 – Cray Inc. 12 DARPA HPCS Program Focused on providing a new generation of economically viable high productivity computing systems for the national security and industrial user community in the 2010 timeframe Performance (time-to-solution) : speed up critical applications by factors of 10 to 40 Programmability (idea-to-first solution) : "High productivity reduce cost and time for developing computing is a key application solutions technology enabler for meeting our national security and economic Portability: insulate application software competitiveness from system specifics requirements. ””” - Dr. William Harrod, Robustness: protect applications from DARPA HPCS Program hardware faults and system software errors May 16, 2007 Copyright 2007 – Cray Inc. 13 CRAY SIGNS $250 MILLION AGREEMENT WITH DARPA TO DEVELOP BREAKTHROUGH ADAPTIVE SUPERCOMPUTER SEATTLE, WA, November 21, 2006 -- Global supercomputer leader Cray Inc. announced today that it has been awarded a $250 million agreement from the U.S. Defense Advanced Research Projects Agency (DARPA). Under this agreement, Cray will develop a revolutionary new supercomputer based on the company's Adaptive Supercomputing vision, a phased approach to hybrid computing that integrates a range of processing technologies into a single scalable platform. […] May 16, 2007 Copyright 2007 – Cray Inc. Slide 14 Motivation for Cascade Why are HPC machines unproductive? Difficult to write parallel code (e.g.: MPI) • Major burden for computational scientists Lack of programming tools to understand program behavior • Conventional models break with scale and complexity Time spent trying to modify code to fit machine’s characteristics • For example, cluster machines have relatively low bandwidth between processors, and can’t directly access global memory… • As a result, programmers try hard to reduce communication, and have to bundle communication up in messages instead of simply accessing shared memory If the machine doesn’t match your code’s attributes, it makes the programming job much more difficult. And code’s vary significantly in their requirements… May 16, 2007 Copyright 2007 – Cray Inc. 15 Cascade Approach Design an adaptive, configurable machine that can match the attributes of a wide variety of applications: • Serial (single thread, latency-driven) performance • SIMD data level parallelism (vectorizable) • Fine grained MIMD parallelism (threadable) • Regular and sparse bandwidth of varying intensities ⇒ Increases performance ⇒ Significantly eases programming ⇒ Makes the machine much more broadly applicable Ease the development of parallel codes • Legacy programming models: MPI, OpenMP • Improved variants: SHMEM, UPC and CoArray Fortran (CAF) • New alternative: Global View (Chapel) Provide programming tools to ease debugging and tuning at scale • Automatic performance analysis; comparative debugging May 16, 2007 Copyright 2007 – Cray Inc. 16 Integrated Multi-Architecture System Basic architecture separates system Specialized Processor Compute Nodes services from computational tasks • Purpose-built service and compute nodes • Sets the infrastructure for hybrid computing Service nodes provide command and control for variety of compute nodes • Assignment and management of the applications • I/O functions available to all nodes • Consistent control layer for the underlying hardware • Full system administration Service Nodes Flexible Execution Environment