Realizing GPU Computation at Scale CS-Storm: Building a Productive & Reliable GPU Platform

John K. Lee, Vice President, Cluster Products Maria Iordache PhD, Product Management Director

C O M P U T E | S T O R E | A N A L Y Z E Legal Disclaimer

Information in this document is provided in connection with Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other names and brands may be claimed as the property of others. Other product and service names mentioned herein are the trademarks of their respective owners. Copyright 2015 Cray Inc.

C O M P U T E | S T O R E | A N A L Y Z E 2 Cray Inc. - GTC, March 2014 Agenda

● About Cray

● Building a Productive & Reliable dense GPU Platform – CS-Storm ● Realizing GPU computation at scale

K40 or K80 inside

C O M P U T E | S T O R E | A N A L Y Z E 3 Cray Inc. - GTC, March 2014 About Cray

Seymour Cray founded Cray Research in 1972 • 1972-1996, Cray Research grew to leadership in Supercomputing • 1996-2000, Cray was subsidiary of SGI • 2000- present, Cray Inc. growing to $561M in revenue in 2014

Cray Inc. • NASDAQ: CRAY • Over 1,000 employees across 30 countries • Headquartered in Seattle, WA

Three Focus Areas • Computation • Storage • Analytics

Seven Major • Austin, TX • San Jose, CA Development Sites: • Chippewa Falls, WI • Seattle, WA • Pleasanton, CA • Bristol, UK • St. Paul, MN

C O M P U T E | S T O R E | A N A L Y Z E Cray Inc. Cray’s Vision: The Fusion of Supercomputing and Big & Fast Data

Modeling The World Cray solving “grand challenges” in science, engineering and analytics

Data Models Math Models Data-Intensive Processing Modeling and Integration of datasets and math models for simulation augmented High throughput search, analysis, with data to provide event processing & the highest fidelity data capture from predictive modeling virtual reality results sensors, data feeds and knowledge and instruments discovery

Compute Store Analyze

C O M P U T E | S T O R E | A N A L Y Z E Cray Inc. Supercomputing Leadership Top 500 Supercomputers in the World November 2014

Top 50 Top 100 Top 500 Cray Systems 16 28 62 Vendor Rank #1 #1 (tied) #3

C O M P U T E | S T O R E | A N A L Y Z E 6 Cray Inc. - GTC, March 2014 Cray has vast experience building very large scale, GPU-based HPC systems

● Cray has more accelerated systems in Top500 than anyone ● 75 of Top500 are systems with accelerators ● 14 of these are made by Cray

● Cray systems have more GPU performance (Rmax) on Top500 than all others combined!

● GPUs supported on both, Cray’s Cluster Systems and line – multiple of each in Top500

● CS-Storm dense-GPU Computing Platform, with 8 GPUs / server – launched in Aug 2014: ● #10 on Top500, Nov 2014 ● #4 on Green500, Nov 2014

C O M P U T E | S T O R E | A N A L Y Z E 7 Cray Inc. - GTC, March 2014 Background on CS-Storm

● Started as a custom engineering project in late 2013 ● Customer funded project to build the highest performing PCIe-attached Accelerator platform ● Able to run the host processors at peak performance ● Able to run Accelerators at peak performance for cards up to 300W ● Successfully delivered a production system with over 4,480x K40 GPUs in a single system ● #10 on last year’s Top500 list ● Another system is #4 on the Green500 list ● Designed to support any full-height, double width PCIe accelerators ● Currently supports K40 and K80 GPUs

C O M P U T E | S T O R E | A N A L Y Z E 8 Cray Inc. - GTC, March 2014 CS-Storm: Design What makes it special?

Efficient air-cooled 2U design, with enough power and cooling capacity to operate eight current and future GPUs and CPUs at peak level without capping their performance • Pull/push fan architecture efficiently cools the chassis, providing a consistent temperature across the system and its accelerators

Most efficient use of PCIe bandwidth to operate the eight GPUs at full performance

• Cray R&D performed a comprehensive signal integrity study to determine the electrical signaling capabilities of the PCIe bus

Innovative, easy-to-maintain server design, providing easy access to all components • 3 x 1,630W power supplies capable of 2+1 redundancy • Native 480 V and 208 V server power supplies – increases power efficiency

Room-neutral cooling via optional rear-door heat exchangers

C O M P U T E | S T O R E | A N A L Y Z E 9 Cray Inc. - GTC, March 2014 CS-Storm: Innovative Design

Six local SSD drives Host processors IVB / HSW

512 / 1024 GB max 16 DIMMS

2x4 NVIDIA K40 or K80 11.4 or 15 TF/node 2U form factor • 22 nodes / 48U rack • 176 GPUs / 48U rack

C O M P U T E | S T O R E | A N A L Y Z E 10 Cray Inc. - GTC, March 2014 CS-Storm: Server PCI-e Layout Haswell motherboard

High Speed High Speed Network2 Network1

QPI performance Optimized GPU Optimized

C O M P U T E | S T O R E | A N A L Y Z E 11 Cray Inc. - GTC, March 2014 GPU Cage Serviceability

24”-wide chassis provides more room than standard 19” chassis to effectively service and cool the GPUs

C O M P U T E | S T O R E | A N A L Y Z E 1 Cray Inc. - GTC, March 2014 2 Individual GPU Access

● Each cage holds 4x GPUs. ● Each GPU assembly can be removed easily through a blind mate quick disconnect. 3

1 4 2 GPU assembly

C O M P U T E | S T O R E | A N A L Y Z E 1 Cray Inc. - GTC, March 2014 3 Future-Proof Design

PCIe Gen3 standard size

K40 / K80

Maximum dimension PCIe card supported: 39.06 mm x 132.08 mm x 313.04 mm

C O M P U T E | S T O R E | A N A L Y Z E 14 Cray Inc. - GTC, March 2014 CS-Storm System: Cooling Performance

GP GPU Power CS-Storm's pull/push fan architecture U # Temperature Consumption efficiently cools the chassis, providing 1 63 213W a consistent temperature across the 2 64 222W system and its coprocessors. 3 63 214W CS-Storm is designed to provide enough 4 61 218W air flow to keep all GPUs cool under the most challenging compute workloads 5 53 218W 6 53 215W . Measured at Chippewa Falls manufacturing facility; with 27C ambient temperature 7 57 218W . DGEMM was running on all GPUs at the time of 8 51 213W record capture . The table shows the “hottest” of the 22 nodes in the rack

C O M P U T E | S T O R E | A N A L Y Z E 1 Cray Inc. - GTC, March 2014 5 CS-Storm System: Efficient, Room Neutral Cooling Customized 48U RDHx with 64.2 kW max cooling

Rear-door heat exchangers are available in both 42U and 48U heights

C O M P U T E | S T O R E | A N A L Y Z E 16 Cray Inc. - GTC, March 2014 Software at Scale: Partnerships deliver a complete software ecosystem Essential software and management tools needed to build a powerful, flexible and highly available supercomputer

Development & Intel® Parallel PGI Cluster NVIDIA® Performance ® Studio XE Cluster GNU toolchain Cray PE on CS * Development Kit® CUDA® Tools Edition HPC Application Cray LibSci, Programming Intel® MPI, MKL Platform MPI MVAPICH2 OpenMPI Libraries LibSci_ACC Tools Rogue Wave Debuggers Allinea DDT, MAP Intel® IDB PGI PGDBG® GNU GDB TotalView®

Resource Adaptive Computing: IBM Platform™ Management / SLURM MOAB® / Maui / Altair PBSPro Grid Engine LSF® Job Scheduling Torque Schedulers, File Systems and File Systems Lustre® NFS GPFS PanFS® Local (ext3, ext4, XFS) Management Cluster Cray® Advanced Cluster Engine (ACE™) Management Software** Management

Drivers and Operating Accelerator software stack & drivers OFEDTM Systems and Network Mgmt. Drivers Operating Linux® (RedHat, CentOS) ** Systems Legend: * Cray® PE on CCS v1.0 includes Cray Compiling Environment, Cray Scientific and Math Libraries and Cray Performance Measurement and Analysis Tools ** ACE Management Servers are delivered with Red Hat Linux. (Compute nodes: all operating systems; Management node/s: RedHat only)

C O M P U T E | S T O R E | A N A L Y Z E 17 Cray Inc. - GTC, March 2014 CS-Storm Performance… 8 x K80 or 8 x K40 at system level

C O M P U T E | S T O R E | A N A L Y Z E 18 Cray Inc. - GTC, March 2014 What to expect from K80 vs K40?

Features Tesla K80¹ Tesla K40 GPU 2x Kepler GK210 1 Kepler GK110B ● K80 is made of 2 GPU ASICS, the form factor is the same as 2.91 Tflops (GPU Boost Clocks) 1.66 Tflops (GPU Boost Clocks) K40, power capped at 300W (each GPU is capped at 150W) Peak DP Flops 1.87 Tflops (Base Clocks) 1.43 Tflops (Base Clocks) 8.74 Tflops (GPU Boost Clocks) 5 Tflops (GPU Boost Clocks) Peak SP Flops Based on the K80 vs K40 (card) performance: 5.6 Tflops (Base Clocks) 4.29 Tflops (Base Clocks) Good: Memory bandwidth (ECC off)² 480 GB/sec (240 GB/sec per GPU) 288 GB/sec ● 2x the memory in the same form factor is great for embarrassingly parallel Memory size (GDDR5) 24 GB (12GB per GPU) 12 GB applications that store data on the GPUs CUDA cores 4992 (2496 per GPU) 2880 ● Some applications that our CS-Storm customers run, take full advantage of Max power per GPU 300 W 235 W the larger memory; 2x performance is possible (e.g. GIS)

From http://www.nvidia.com/object/tesla-servers.html and NVIDIA presentations ● Single Precision Performance with K80 is great … (for O&G-seismic) 1 Tesla K80 specifications are shown as aggregate of two GPUs. 2 With ECC on, 6.25% of the GPU memory is used for ECC bits. For example, 6 GB total memory yields 5.25 GB of user available memory with ECC on ● NVIDIA auto boost technology in K80 vs manual boost with K40

TBD… ● For compute-intensive applications, the overall performance with K80 is expected to be ~30-70% higher than with K40 (not 2x!), depending on the application ● Not clear yet how applications that need lots of communications between GPUs and CPU / memory will perform ● Not enough experience with GPUdirect RDMA to understand its performance implications for specific applications

Single Precision tests Double Precision tests

C O M P U T E | S T O R E | A N A L Y Z E 19 Cray Inc. - GTC, March 2014 What to expect from K80 vs K40 at node level?

● Understand GPU’s performance with and without boost ● Run HPL on a node w/ 8 GPUs cards with boost, analyze the results… try boost off… ● Set run parameters and optimize HPL at node level… (it takes a lot of fiddling..)

1 GPU 1 server = 8 GPUs 1 GPU card 1 server = 8 GPUs R_max – Peak TF, baseline, w/ R_peak Peak TF, no R_max, boost efficiency R_peak efficiency boost on default settings, (no boost) boost off, fixed clock no clock setting K40 1.66 8.98 11.94 75.2% 1.43 9.65 11.94 80.8%

K80 2.91 12.62 15.46 81.6% 1.87 13.1 15.46 84.7%

175% 141% 129% 131% 136% 129%

Can we do Looks very May be expected for Driving more GPUs better? We Lots of iterations to good at node an app that runs and sharing a path, know that w/ find the optimal level, iterate to mostly on the GPU so not expecting to K40 we can… settings at node level get the best when the cores are be same number efficiency not loaded equally At this point it makes all the time “Known secret” in sense, in line with performance peak ratio reporting…

C O M P U T E | S T O R E | A N A L Y Z E 20 Cray Inc. - GTC, March 2014 What to expect from K80 vs K40 at system level?

● Use the node level parameters and expand: 4 nodes, 8 nodes… to system level ● We did have results for a similar 22-node CS-Storm system w/ K40s – our Green500 submission from November, “Storm1”

1 GPU card 1 server = 8 GPUs 1 full rack = 22 servers = 176 GPUs R_max, R_max, Peak TF, no boost off, fixed R_peak efficiency boost off, fixed R_peak efficiency boost clock clock Expect somewhere K40 1.43 9.65 11.94 80.8% 180 259 69.5% around this number if using the same K80 1.87 13.1 15.46 84.7% 233 339 68.7% HPL code/settings…

131% 136% 129% 129% 131%

● Comparable system performance: ● reflects the difference in peak GPU card performance ~30% more w/ K80 at node and at system level ● efficiency stays in the expected range ● HPL tuning makes a difference – we started with an HPL tuned for Cray XC40 with K20s (for the Kepler architecture / generation)

C O M P U T E | S T O R E | A N A L Y Z E 21 Cray Inc. - GTC, March 2014 CS-Storm: Compute Density and Efficiency Leader

. 176 x NVIDIA Tesla GPUs, 22 servers Performance . 180 TF Linpack performance w/ K40 per rack . 233 TF Linpack performance w/ K80 (1) . 3.96 TF/kW w/ K40 – #4 Green500 @ Nov ’14

. Performance density ( / sqft): • ~30% more w/ K80 than K40 (2) Power and • 6x compute-only blades (4) space efficiency . Power efficiency (flops / Watt): • ~14% more w/ K80 than K40 (3) • 4x compute only blades (4)

For the right GPU-optimized applications, the performance improvements and power and floor space savings offered by the CS-Storm system are leading the industry

(1) early result with K80; (2) based on measured Linpack performance w/ a 22_node_rack; (3) Estimated; (4) estimated using IVB processors C O M P U T E | S T O R E | A N A L Y Z E 22 Cray Inc. - GTC, March 2014 How do we start understanding a system’s behavior at the system sizes Cray deals with?

● Build up from what you know: use HPL which has a “built-in a correct answer”, to understand how the system behaves… ● Start at node level w/ default settings and measure the airflow, temperature variations, clock speeds and power variations on the GPUs within a node ● Understand at node level how to set parameters for HPL, then increase the size of the simulation… ● Look for expected scaling behavior of HPL – normally the efficiency for GPU systems should stay constant / smooth as size increases if tuned correctly

● So… we build a very large system in manufacturing, and run the “safe” type test on the whole system… ● What do you do if the performance is terrible? How do you find where the issue is? ● Stepwise, test nodes and pairs of nodes and look for any anomalies ● Within nodes vs across nodes – study the results patterns ● You need some good cluster management software with GUI, that allows understanding what happens with the system during the test/real time (Cray ACE is good for this!)

● 560 nodes system after one night of work (not much time for tuning here!)….  #10 Top500 - Nov‘14

C O M P U T E | S T O R E | A N A L Y Z E 23 Cray Inc. - GTC, March 2014 Many GPUs in Oil and Gas applications Almost linear scaling on the GPUs - CS-Storm, two node test

SPECFEM3D Strong Scaling, using a complex model on multi-GPU CS-Storm servers SPECFEM3D: • Seismology community code, proxy tor seismic application 18.00

• CUDA version - developed by Daniel Peter, ETH 16.00 • Data - courtesy BP & Princeton (3D elastic, isotropic model) 14.00 SPECFEM3D SPEED-UP

12.00 simple_model BP Demo 10.00 Ideal 8.00

6.00

4.00

2.00

0.00 1 2 #GPUs 4(K40) 8 16 C O M P U T E | S T O R E | A N A L Y Z E 24 Cray Inc. - GTC, March 2014 SpecFEM3D linear scaling on K40 and K80

SpecFEM3D - Strong scaling K40 to K80 Performance improvements 450

400

350 K80 K40 300

250

200

150 Wall (sec) time clock Wall 100

50

0 0 2 4 6 8 10 12 14 16 18 Linear scaling with # GPU cards the number of GPUs ½ run time per node by using K80 nodes

• Almost perfect scaling by adding more GPUs and with K80 vs K40 • K80 nodes showing a clear performance advantage: ~2x the performance

C O M P U T E | S T O R E | A N A L Y Z E 25 Cray Inc. - GTC, March 2014 Real-time geo-spatial visualization on CS-Storm Use Case: GIS Federal - GPUdb

Challenge: Visualize fast changing geospatial data in response to a wide range of ad-hoc user queries.

Solution: GPUdb leverages CS-Storm by ● Holding partitioned geo and time coded data in GPU memory ● Leveraging brute power of GPUs to respond to ad- For GPUdb, processing hoc queries – without query optimization 100 90 speed is proportional with● Scaling across multiple GPUs per node and multiple the available GPU memory 80 Cray CS-Storm nodes to handle billions of entries 70 – K80 will be an advantage Generic 2-GPU 60 50 40 “CS-Storm’s dense, tightly integrated dense GPU 30

Billions of of Billions Tweets architecture enables us to process 4 times more data 20 10 compared with commodity servers, with relatively 0 low power consumption.” Single Server (2U) 22-Server Rack

C O M P U T E | S T O R E | A N A L Y Z E 26 Cray Inc. - GTC, March 2014 What industries and applications would benefit from such a high density GPU system?

Industry Applications Defense / Security / Cybersecurity Information and image processing, geospatial intelligence, pattern recognition: radar

Financial Markets High-frequency trading (HFT)

Oil and Gas Seismic processing, simulation and modeling

Already have Life Sciences Structural biology, medical imaging, genomics, biomarker discovery/analysis customers in these areas Media and Entertainment Image rendering

National Labs, Large Research Institutions (science) Modeling and analytics; physics and astrophysics; computer science and information science (all items in this column…) University Research Centers with workloads limited to the CS-Storm application fit; research in computer science Weather, Climate and Remote Sensing New or redeveloped climate and weather models developed specifically to use GPUs. Remote sensing data processing applications. Business Intelligence Information processing, machine learning

Across Industries Signal processing (data/signals, voice, images, text), machine learning / “deep learning”

C O M P U T E | S T O R E | A N A L Y Z E 27 Cray Inc. - GTC, March 2014 CS-Storm systems shipped - Pictures from Cray Manufacturing @ Chippewa Falls, WI

#10 system in Top500 Financial Services, system 560 nodes w/ K40, shipped Oct 2014 nodes w/ K40, shipped Dec 2014

Newest system, w/ K80 nodes Just shipped last week to a University customer in the Bay Area! 

C O M P U T E | S T O R E | A N A L Y Z E 28 Cray Inc. - GTC, March 2014 . High density, accelerated system . Power and space efficient . Built and supported by Cray

C O M P U T E | S T O R E | A N A L Y Z E 29 Cray Inc. - GTC, March 2014 Many thanks to:

• Kevin McMahon from Cray’s performance team

• NVIDIA alliances and benchmarking team

C O M P U T E | S T O R E | A N A L Y Z E 30 Cray Inc. - GTC, March 2014 Cray CS-Storm: Uncompromising Performance

• Maximum performance in a single rack Powerful and • Power and cooling to spare Efficient • Allows GPUs to run at full power, future proof

• Optimized for scalable GPU applications • Full-system solution featuring Cray Performance by management software and Cray Design Programming Environment • Designed for upgradeability to protect your investment

Cray Service • Reliability, redundancy and serviceability and Reliability • Cray expertise

Simply, CS-Storm was designed to be the best GPU system for the most demanding customers C O M P U T E | S T O R E | A N A L Y Z E 31 Cray Inc. - GTC, March 2014 Safe Harbor Statement

This presentation may contain forward-looking statements that are based on our current expectations. Forward-looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions, and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements.

C O M P U T E | S T O R E | A N A L Y Z E 32 Cray Inc. - GTC, March 2014