at OLCF

Bronson Messer Scientific Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory

Mallikarjam “Arjun” Shankar Group Leader – Advanced Data and Workflows Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory

ORNL is managed by UT-Battelle, LLC for the US Department of Energy OLCF Data/Learning Strategy & Tactics

1. Engage with applications – Summit Early Science Applications (e.g., CANDLE) – INCITE projects (e.g., Co-evolutionary Networks: From Genome to 3D Proteome, Jacobson, et al.) – Directors Discretionary projects (e.g., Fusion RNN, MiNerva) 2. Create leadership-class analytics capabilities – Leadership analytics (e.g., Frameworks: pbdR, TensorFlow + Horovod) – Algorithms requiring scale (e.g., non-negative matrix factorization) 3. Enable infrastructure for analytics/AI and data-intensive facilities – Workflows to include data from observations for analysis within OLCF – Analytics enabling technologies (e.g., container deployments for rapidly changing DL/ML frameworks, analytics notebooks, etc.)

2 Data Science at the OLCF Applications Supported through DD/ALCC: Selected Projects on Titan: 2016-2017 Program PI PI Employer Project Name Allocation (Titan core-hrs) Discovering Optimal and Neuromorphic Network Structures using Evolutionary ALCC Robert Patton ORNL 75,000,000 Approaches on High Performance Computers ALCC Gabriel Perdue FNAL Large scale deep neural network optimization for neutrino physics 58,000,000 ALCC Gregory Laskowski GE High-Fidelity Simulations of Gas Turbine Stages for Model Development using Machine Learning 30,000,000 High-Throughput Screening and Machine Learning for Predicting Catalyst Structure and Designing ALCC Efthimions Kaxiras Harvard U. 17,500,000 Effective Catalysts

ALCC Georgia Tourassi ORNL CANDLE Treatment Strategy Challenge for Deep Learning Enabled Surveillance 10,000,000

DD Abhinav Vishnu PNNL Machine Learning on Extreme Scale GPU systems 3,500,000 DD J. Travis Johnston ORNL Surrogate Based Modeling for Deep Learning Hyper-parameter Optimization 3,500,000 DD Robert Patton ORNL Scalable Deep Learning Systems for Exascale Data Analysis 6,500,000 DD William M. Tang PPPL Machine Learning for Fusion Energy Applications 3,000,000 DD Catherine Schuman ORNL Scalable Neuromorphic Simulators: High and Low Level 5,000,000 DD Boram Yoon LANL for Collider Physics 2,000,000 DD Jean-Roch Vlimant Caltech HEP DeepLearning 2,000,000 DD Arvind Ramanathan ORNL ECP Cancer Distributed Learning Environment 1,500,000 DD John Cavazos U. Delaware Large-Scale Distributed and Deep Learning of Structured Graph Data for Real-Time Program Analysis 1,000,000 DD Abhinav Vishnu PNNL Machine Learning on Extreme Scale GPU systems 1,000,000 DD Gabriel Perdue FNAL MACHINE Learning for MINERvA 1,000,000 TOTAL 220,500,000

- Highlighted rows are Algorithm and Infrastructure Examples; Rest are Primarily Science Applications

Acknowledgement: J. Wells, SC 2017 3 Data Science at the OLCF Gordon Bell Prizes in 2018: Peak Performance on Summit

Attacking the Opioid Epidemic: Determining the Exascale Deep Learning for Climate Analytics Epistatic and Pleiotropic Genetic Architectures Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur for Chronic Pain and Opioid Addiction Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, Michael Houston, Wayne Joubert, Deborah Weighill, David Kainer, Sharlee Climer, Amy Justice, Kjiersten Fagnan, Daniel Jacobson

4 Data Science at the OLCF Methods: Leadership Data Analytics Capabilities

• Engage parallel math libraries at scale • R language unchanged • New distributed concepts • New profiling capabilities • New interactive SPMD parallel • Significantly improved http://pbdr.org • In-situ distributed capability • In-situ staging capability via ADIOS throughput compared to, e.g., Apache Spark HPC libraries and their R/pbdR connections – “PCA of a 134 GB matrix: ‘hours on . . . Apache Spark, . . . less than a minute using R.’” – June 2016, HPCWire

5 Data Science at the OLCF Scaling Deep Learning for Science with ORNL’s MENNDL ORNL-designed algorithm leverages Titan to create high-performing neural networks

Evolve d

An image generated from neutrino scattering data captured by the MINERvA detector at Fermi National Accelerator Laboratory. Researchers are using MENNDL and the Titan supercomputer to generate deep neural networks that can classify high-energy physics data and improve the ORNL Data Analytics Group used Titan to develop an efficiencies of measurements. evolutionary algorithm to search for optimal hyper- parameters and topologies for ML networks.

Steven R. Young, Derek C. Rose, Travis Johnston, William T. Heller, Thomas P. Karnowski, Thomas E. Potok, Robert M. Patton, Gabriel Perdue, and Jonathan Miller, “Evolving Deep Networks Using HPC.” In Proceedings of the Machine Learning on HPC Environments. Paper presented at The International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Colorado (November 6 2017), doi: 10.1145/3146347.3146355. Data Science at the OLCF Key Data Science and Learning Methods PCA, K-Means, CORAL 2 Components etc. excel on Benchmark Suite “traditional HW” part of the node Big Data Analytic PCA, K-Means, and due to the node’s Suite (BDAS) SVM (based on memory, CPU, pbdR) and on-chip Deep Learning Suite CANDLE, CNN, bandwidth (DLS) RNN, and ResNet-50 (distributed memory) Deep Learning Codes (CNN; ResNet50; ..) excel here with NVM and GPUs enabling tensor operations.

Code suites are in the CORAL2 (Collaboration of Oak Ridge, Argonne, Livermore laboratories) benchmark suite: https://asc.llnl.gov/coral-2-benchmarks/ 7 Data Science at the OLCF Weak Scaling of Data Benchmarks on Big Data Analytic Suite Titan 1000 Speedup Over Titan Baseline for CORAL-2 Big Data Benchmarks (based on pbdR) PCA K-Means SVM 41 100 1 node 2 nodes 4 nodes 36

Wall time (seconds) time Wall 10 16 32 64 128 31 # of nodes

26 Strong Scaling of Data Benchmarks on SummitDev 21 100

16 10

11 1

6 (seconds) time Wall 1 2 4 # of nodes PCA K-Means SVM H2O-PCA 1 H2O-K-Means PCA K-Means SVM PCA K-Means SVM PCA K-Means SVM SummitDev Summit

8 Data Science at the OLCF Deep Learning Suite

Speedup Over Titan Baseline for CORAL-2 Deep Learning Benchmarks 15 x6 SummitDev Summit x6 10 x6 x4 x6 x4 x4 x4 5 x5.9 x6 x3.5 x4

0 CANDLE RNN CNN-googlenet CNN-vgg CNN-alexnet CNN-overfeat

Strong Scaling of ResNet-50 on Summit 2000 Scaling of Resnet-50 based on Keras (Tensorflow backend) and 200 Horovod on ImageNet data

actual ideal seconds/epoch 20 # of GPUs 9 24 48 96 192 384 Data Science at the OLCF Infrastructure - Cross-Facility Design Pattern

From: Policy Considerations when Federating Facilities for Experimental and Observational Data Analysis, Mallikarjun (Arjun) Shankar, Suhas Somnath, Sadaf Alam, Derek Feichtinger, Leonardo Sala, and Jack Wells, (2018, Submitted Book Chapter)

10 Data Science at the OLCF CADES runs BEAM and Pycroscopy for SNS and CNMS

CADES: Compute and Data Environment for Science

Web and Supercomputing Data Server Interface/Visualization Tier Access from anywhere Compute Resource Logic Tier

Parallel Data PHP Web Service Analysis Routines & Serial Data Analysis

HTTPS Data Tier

MySQL Data Database & Artifacts LOCAL

Scientific Instrument Tier

High Speed Supercomputing Tier CNMS Resources Secure Data Transfer Scanning Transmission Electron Microscopy (STEM)

Scanning Probe Microscopy (SPM)

11 Lingerfelt et al., Procedia Computer Science 80, 2276-2280 (2016). Data Science at the OLCF Towards Data Service Offerings and Easier Data Access Across Facilities • Categories of Data Services – Type 1 data repository program for “data-only” projects. – Type 2 data services program for user communities. – Type 3 computational and data science end station program. • “DataFed” prototype to enable federated data access across facilities (currently being tested)

13 Data Services Program POC: Val Anantharaj; DataFed POC: Dale Stansberry Data Science at the OLCF