AI for HPC and HPC for AI Workflows: The Differences, Gaps and Opportunities with Data Management

@SC Asia 2018

Rangan Sukumar, PhD Office of the CTO, Inc. Safe Harbor Statement

This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements. AI for HPC and HPC for AI @ Cray

500NX 500GT

Integrated Analytics and AI Dense GPU systems with broad support for Scalable high performance platform for Data Preparation and NVIDIA® Tesla® Accelerators and FPGAs with Analytics and Machine Learning AI/DL

3 AI for HPC and HPC for AI: Today’s talk

Big Four The Stanford Research Argonne National Leading Seismic Consulting Firm Top 5 Global Pharma Fortune 20 Computing Facility Laboratory Processing Services Global Company Urika-GX CS-Storm Dense GPU Technology Cray CS-Storm Cray XC40 Cluster Company “XStream” Dense GPU Cray XC40 “Teaming with Cluster Predicting how Cray was a clear Supporting core Cray CS-Storm specific patients and Machine Learning at choice and allows research and Dense GPU Research in tumors respond to Scale for Full Waveform for versatility development in Cluster astrophysics, structural different types of Inversion combined with areas including biology and drugs using the speed to tackle chem-informatics Recent win to bioinformatics, scalable deep neural PGS applied machine big data and large machine support machine materials modeling, and network code called learning optimization problems” and deep learning and deep climate CANDLE — or CANcer techniques such as workloads learning Distributed Learning regularization and -- Deloitte workloads Recent breakthrough in Environment steering to determine Advisory Cyber including 3D deep convolutional the velocity model in a Risk Services autonomous neural networks for Full Waveform Inversion vehicles amino acid environment seismic imaging similarity analysis workload.

4 Use-case: AI for HPC Application

“Ground truth” Benchmark Synthetic benchmark with known subsurface geometry and seismic reflection data. Conventional FWI ConventionalStart with an initial FWI attemptsenergy velocity to derive a moremodel accurate to try to explainvelocity reflectedmodel. waveforms.

FWI with Machine Learning Using machine learning (regularization and steering) to guide the convergence process.

5 Use-case: HPC for AI Application

CNTK already is MPI-tuned.

“Piz Daint is a with Cray XC50, E5-2690v3 12C 2.6GHz, Aries interconnect , 4888 NVIDIA Tesla P100.”

6 The State of Practice: Tale of Two Ecosystems Enterprise Computing Scientific Computing Performance

Languages and tools

Programming models

Access policy

Resource manager and job-scheduling

Architectural differences

J. Dongarra et al., Exascale computing and Big Data: The next frontier, ACM Communications 2015 7 The State of Practice: AI for HPC and HPC for AI

Scientific Computing Enterprise Computing HPC Architectures Commodity Architectures

Make assumptions about direction of data flow and requirements of data sizes to be moved The State of Practice: Tale of Two Ecosystems

Scientific Computing Enterprise Computing Primarily used for Solving equations Search/Query, Machine learning Philosophy Send data to compute Send compute to data Efficiency via Parallelism Distribution Scaling expectation Strong (scale-up) Weak (scale-out) Programming model MPI, OpenMP, etc. Map-reduce, SPMD, etc. Popular languages FORTRAN, C++, Python Java, Scala, Python, R Multi-node communication Built-in job fault tolerance over Design strength using an interconnect Ethernet Access model On-premise Cloud Preferred algebra Dense Linear Set-theoretic / Relational Memory access Predictable Random Storage Centralized, POSIX/RAID Decentralized, Duplication

9 Terminologies: Tale of Two Ecosystems

Scientific Computing Enterprise Computing Data (Structured) Vector, Matrix, Tensor Table, Key-Values, Objects Data (Unstructured) Mesh, Images (Physics-based) Documents, Images (Camera) Word Cloud, Parallel Coordinates, Visualization Voxel, Surface, Point Clouds BI Tools Cross-validation (ROC curves, Manual / Subject matter expert, Validation statistical significance) A/B testing Fourier, Wavelet, Laplace, etc. File-format transformations Extract, Transform, Load Cartesian, Radial, Toroidal, etc. e.g. CSV to VRML Properties such as periodicity, SQL, SPARQL, etc. Search (Query) self-similarity, anomaly, etc. (Sum, Average, Group by) Non-profit grand challenge Value-driven Funding Model (Answer matters) (Cost matters) Sukumar, S. R., et al., (2016, December). Kernels for scalable data analysis in science: Towards an architecture-portable future. In the Proc. Of the 2016 IEEE International Conference on Big Data, pp. 1026-1031.

10 Deep Learning: Tale of Two Ecosystems

Scientific Computing Enterprise Computing Model Domain-specific CNN, RNN, LSTM, GAN etc. Theoretic Baseline Humans, Other ML algorithms e.g. Navier Stokes Parallelism Model, Ensemble Data Computational Steering Speech, Test Image interpretation Use Case Proxy models Hyper-personalization Source File System Lustre and GPFS HDFS, S3, NFS etc. Figure of Merit Interpretability, Feasibility Time-to-accuracy, Model-size O(GBs) per sample, O(103) O(KBs) per sample, O(106) samples, Training Data samples, O(10) categories O(104) categories Data Model HDF5, NETCDF Relational, Document, Key-Value

11 The Convergence: AI for HPC

limiting resource #IO channels I/O parallel parallel I/O Limiting resource ( # memory channels ) Time to solution Time to solution

Job1 Job2 Job1’ P1 P2 (simulation) (AI) (simulation)

Lustre Lustre via POSIX via POSIX

Time to insight

12 The Convergence: HPC for AI

Source: NVIDIA Opportunity for productivity with strong scaling

ResNet-50 Success Time-to- How many Scalability Efficiency accuracy GPUs? Facebook (Caffe2) 2 days 352 GPUs 90% 1 hour 256 (large-batch) IBM PowerAI (Caffe) 50 minutes 256 GPUs 95% (large-batch) Google (TensorFlow) ~24 hours 64 TPUs >90%

Preferred Networks 15 minutes 1000 GPUs >90% (Chainer) Cray @ CSCS <14 minutes 1000 GPUs ~>95% (Tensorflow) Source: Baidu Productivity is performance and performance translates to productivity...

13 Bottlenecks Today Other Nodes

Potential off-node I/O requirement NFS Client w/CacheFS EDR IB Casual NAS (NetApp) Possible 1GB/s Copy In G G (future 8GB/s) From Data 10GbE S C From local SSD Sources F F F ?GbE S G G S G G F F F C S G G Stage 2GB/s 2TB SSD RAID5 (write b/w ~half) Bottlenecks Looking Ahead…

Figures-of-merit State-of-practice Projected 2 years ahead Training-time to best accuracy 5+ days 2+ hours Model Cost / TB (AWS GPUs) ~$25K ~10K (ResNet training on 80 GPUs for 5 days) Hardware Efficiency O(~25 Gflops) O(Teraflops) Network Depth: Flops::20x: 16x (based on AlexNet-2012 and ResNet-2015) Statistical Efficiency O(~25 Gflops) O(Teraflops) Depth: Accuracy:: 20x:13+ (based on AlexNet-2012 and ResNet-2015) Need for compute as data O(~465 Gflops) O(Petaflops) grows Data: Flops: Error:: 2x: 5x: 3+ (based on DeepSpeech1 and DeepSpeech2) Training Cadence ~ Monthly ~ Daily # of models per organization 1x 10-100x

15 Solution: System to Eco-system Thinking

Hardware Software Ecosystem

Facility Performance

System Performance

Multi-node Performance

Node Performance

Component Performance i/o

16 Solution: Communication-Aware Data Objects

Async DAG-execution Dataflow semantics Data Management & curation S P1 A1 S A S1 P1

P2 A2 KNL Xeon S2 P2 P3 A3

Data-driven notification Human in-the-loop Data-dependent workflow

S 1 P1 A2

S P1 A S P1 P2 Analysis P3

17 Solution: HPC Best Practices for Data Management Making data access and I/O methods available / relevant to all levels of the software stack. Solution: End-to-End Thinking with Benchmarks

Training Inferencing

Data Data Model Model Model A/B testing in Acquisition Preparation Training Testing Deployment production

Train Training Model Set • Cleansing Data Annotation • Shaping Test Evaluate Performance and (Ground Truth) • Enrichment Set optimize model

Validation Set Cross-Validation Model Iterative management

19 Future: Integration of Storage, Memory and Compute

● General purpose flexibility ● Commodity-like configurations ● Seamless heterogeneity ● CPUs, GPUs, FPGAs, ASICs ● High-performance interconnects for data centers ● MPI and TCP/IP collectives, compute on the network ● Unified software stack ● Programming environment for performance and productivity ● Workflow optimization ● Match growth in compute and data with I/O

20