Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale

Atılım Güneş Baydin, Lei Shao, Wahid Bhimji, Lukas Heinrich, Lawrence Meadows, Jialin Liu, Andreas Munk, Saeid Naderiparizi, Bradley Gram-Hansen, Gilles Louppe, Mingfei Ma, Xiaohui Zhao, Philip Torr, Victor Lee, Kyle Cranmer, Prabhat, and Frank Wood

SC19, Denver, CO, United States 19 November 2019 Simulation and HPC Computational models and simulation are key to scientific advance at all scales

Particle physics Nuclear physics

Drug discovery Weather Climate science Cosmology 2 Introducing a new way to use existing simulators

Probabilistic programming Simulation Supercomputing (machine learning)

3 Simulators

Parameters Outputs (data)

Simulator

4 Simulators

Parameters Outputs (data)

Simulator

Prediction: ● Simulate forward evolution of the system ● Generate samples of output

5 Simulators

Parameters Outputs (data)

Simulator

Prediction: ● Simulate forward evolution of the system ● Generate samples of output

6 Simulators

Parameters Outputs (data)

Simulator

Prediction: ● Simulate forward evolution of the system ● Generate samples of output

WE NEED THE INVERSE!

7 Simulators

Parameters Outputs (data)

Simulator

Prediction: ● Simulate forward evolution of the system ● Generate samples of output

Inference: ● Find parameters that can produce (explain) observed data ● Inverse problem

● Often a manual process 8 Simulators

Parameters Outputs (data)

Inferred Simulator Observed data parameters

Gene network Gene expression

9 Simulators

Parameters Outputs (data)

Inferred Simulator Observed data parameters

Earthquake Seismometer location & readings characteristics

10 Simulators

Parameters Outputs (data)

Inferred Simulator Observed data parameters

Event analyses & Particle detector new particle readings discoveries

11 Inverting a simulator

Parameters Outputs (data)

Inferred Simulator Observed data parameters

Probabilistic programming is a machine learning framework allowing us to ● write programs that define probabilistic models ● run automated Bayesian inference of parameters conditioned on observed outputs (data)

Pyro Edward Stan 12 Inverting a simulator

Parameters Outputs (data)

Inferred Simulator Observed data parameters

Probabilistic programming is a machine learning framework allowing us to ● write programs that define probabilistic models ● run automated● Has been Bayesian limited to inference toy and small-scale of parameters problems ● Normally requires one to implement a probabilistic conditioned on observed outputs (data) model from scratch in the chosen language/system Pyro Edward Stan 13 Inverting a simulator

Parameters Outputs (data)

Inferred Simulator Observed data parameters

Key idea: Many HPC simulators are stochastic and they define probabilistic models by sampling random numbers

14 Inverting a simulator

Parameters Outputs (data) Code Inferred Simulator Observed data parameters

Key idea: Many HPC simulators are stochastic and they define probabilistic models by sampling random numbers

Simulators are probabilistic programs!

15 Inverting a simulator

Parameters Outputs (data) Code Inferred Simulator Observed data parameters

Key idea: Many HPC simulators are stochastic and they define probabilistic models by sampling random numbers

Simulators are probabilistic programs!

We “just” need an infrastructure to execute them as such 16 A new probabilistic programming system for simulators and HPC, based on PyTorch Probabilistic execution

Parameters Outputs (data) Code Inferred Simulator Observed data parameters

18 Probabilistic execution

Parameters Outputs (data) Code Inferred Simulator Observed data parameters

● Run forward & catch all random choices (“hijack” all calls to RNG) ● Record an execution trace: a record of all parameters, random choices, outputs

19 Probabilistic execution

Parameters Outputs (data) Code Inferred Simulator Observed data parameters

● Run forward & catch all random choices (“hijack” all calls to RNG) ● Record an execution trace: a record of all parameters, random choices, outputs

Probabilistic Programming eXecution protocol ++, C#, Dart, Go, Java, JavaScript, Lua, Python, Rust and others 20 Probabilistic execution

Parameters Outputs (data) Code Inferred Simulator Observed data parameters

● Run forward & catch all random choices (“hijack” all calls to RNG) ● Record an execution trace: a record of all parameters, random choices, outputs

Uniquely label each choice at runtime by “addresses” of stack traces 21 Probabilistic execution

Parameters Outputs (data) Code Inferred Simulator Observed data parameters

● Run forward & catch all random choices (“hijack” all calls to RNG) ● Record an execution trace: a record of all parameters, random choices, outputs

○ Conditioning: compare simulated output and observed data ● Approximate the distribution of parameters that can produce (explain) observed data, using inference engines like Markov-chain Monte Carlo (MCMC) 22 Probabilistic execution

Parameters Outputs (data) Code Inferred Simulator Observed data parameters This is hard and computationally costly ● Run forward & catch all random choices (“hijack” all calls to RNG) ● Need to run simulator up to millions of times ● Record an execution trace: a record of all parameters, random choices, outputs ● Simulator execution and MCMC inference are sequential

● MCMC has “burn-in period” and autocorrelation

○ Conditioning: compare simulated output and observed data

● Approximate the distribution of parameters that can produce (explain) observed data, using inference engines like Markov-chain Monte Carlo (MCMC) 23 Probabilistic execution

Parameters Outputs (data) Code Inferred Simulator Observed data parameters This is hard and computationally costly ● Run forward & catch all random choices (“hijack” all calls to RNG) ● Need to run simulator up to millions of times ● Record an execution trace: a record of all parameters, random choices, outputs ● Simulator execution and MCMC inference are sequential

● MCMC has “burn-in period” and autocorrelation

○ Conditioning: compare simulated output and observed data But we can amortize the cost of inference using deep learning ● Approximate the distribution of parameters that can produce (explain) observed data, using inference engines like Markov-chain Monte Carlo (MCMC) 24 Distributed Training data Distributed data generation (execution traces) training

Sim. Trace NN Trained NN Sim. Trace NN ⋮ ⋮ ⋮ Sim. Trace NN HPC HPC

Training (recording simulator behavior) ● Deep recurrent neural network learns all random choices in simulator ● Dynamic NN: grows with simulator complexity ○ Layers get created as we learn more of the simulator ○ 100s of millions of parameters in particle physics simulation ● Costly, but amortized: we need to train only once per given model Distributed Training data Distributed data generation (execution traces) training

Sim. Trace NN Trained NN Sim. Trace NN ⋮ ⋮ ⋮ Sim. Trace NN HPC HPC

Traces reproducing Distributed inference Trained NN observed data NN Sim. Trace Inferred Sim. NN Trace parameters ⋮ Observed data ⋮ ⋮ NN Sim. Trace HPC Inference (controlling simulator behavior) ● Trained deep NN makes intelligent choices given data observation ● Embarrassingly parallel distributed inference ● No “burn in period” ● No autocorrelation: every sample is independent

Traces reproducing Distributed inference Trained NN observed data NN Sim. Trace Inferred Sim. NN Trace parameters ⋮ Observed data ⋮ ⋮ NN Sim. Trace HPC Inference (controlling simulator behavior) ● Trained deep NN makes intelligent choices given data observation ● Embarrassingly parallel distributed inference ● No “burn in period” ● No autocorrelation: every sample is independent

Trained NN

Inferred parameters Observed data Probabilistic programming software designed for HPC

https://github.com/pyprob/pyprob

● Probabilistic programming system for simulators and HPC, based on PyTorch Distributed training and inference, efficient support for multi-TB distribution Optimized memory usage, parallel trace processing and combination

https://github.com/pyprob/ppx

● Probabilistic Programming eXecution protocol Simulator and inference/NN executed in separate processes and machines across network Using flatbuffers to support C++, C#, Dart, Go, Java, JavaScript, Lua, Python, Rust and others Probabilistic programming analogue to Open Neural Network Exchange (ONNX) for deep learning

Pyprob_cpp, RNG front end for C++ simulators https://github.com/pyprob/pyprob_cpp

Containerized workflow

Develop locally, deploy to HPC on many nodes for experiments etalumis → | ← simulate Lukas Atılım Güneş Lei Wahid Lawrence Heinrich Baydin Shao Bhimji Meadows

Bradley Jialin Andreas Saeid Gilles Gram-Hansen Liu Munk Naderiparizi Louppe

Mingfei Xiaohui Phil Victor Kyle Ma Zhao Torr Lee Cranmer

Prabhat Frank Wood

30 Simulators for Inference

Parameters Outputs (data)

Inferred Simulator Observed data parameters

Event analyses & Particle detector new particle readings discoveries

31 Use case: Large Hadron Collider (LHC)

● Seek to uncover secrets of the universe (new particles) ○ Produced by the LHC and measured in detectors such as “ATLAS” and “CMS” ○ 100s of millions of channels 100s of Petabytes of collision data ● Physicists compare observed data to detailed simulations ○ Billions of CPU hours for scans of simulation parameters ○ Currently compare, for example, histograms of summary statistics ■ Inefficient, labor-intensive, sometimes ad-hoc: not adaptive to observations ● Etalumis replaces this with automated, efficient inference; grounded in a statistical framework Specific use case: tau lepton decay

https://www.usparticlephysics.org/brochure/science/ ● “Use the Higgs Boson as a new tool for discovery”: top science driver ○ One avenue is detailed study of Higgs decay to tau leptons

● Tau leptons decay to other particles observed in detector ○ Rates for these ‘channels’ vary over orders of magnitude ○ Signs of new physics subtle, need probabilistic approach Physics expressed in simulator code

● We use popular existing Sherpa1 simulation (~1M lines of C++ code) and our own fast 3D detector simulation ● In this simulator the execution trace (record of all parameters, random choices, outputs) represents particle physics collisions and decays

● So etalumis enables interpretability by relating a detector observation to possible traces that can be related back to the physics ○ These execution traces can be different length paths through the code ○ Different ‘trace types’ (same address sequence with different sampled values) can occur at very different frequencies 1https://gitlab.com/sherpa-team/sherpa px pz

py Etalumis gives access to all latent variables: allows answering any model-based question Reaching supercomputing scale We are bringing together

Probabilistic Simulation Supercomputing programming

● Need for using HPC resources and considerable optimization ○ For all components but particularly NN training

38 Platforms and experimental setup

● NERSC Cori Cray XC40 (Data partition) ○ 16-core 2.3 GHz Intel® Xeon® Haswell (HSW) ○ 1.8 PB Burst Buffer - (1.7 TB/s) ○ 30PB Lustre filesystem (700 GB/s) ● NERSC Edison Cray XC30 ○ 12-core 2.4 GHz Intel® Xeon® Ivybridge (IVB) ● Intel Diamond Cluster ○ 16-core 2.3 GHz Intel® Xeon® Broadwell (BDW) ○ 26-core 2.1 GHz Intel® Xeon® Skylake (SKL) ○ 24-core 2.1 GHz Intel® Xeon® Cascade Lake (CSL) Challenges to optimization and scaling of NN training

Use dataset of 15M Sherpa execution traces (1.7 TB)

Challenging Neural Network and data structures:

● Different ‘trace types’ can occur at very different frequencies ⇒ small effective batch size, ● Variation of execution trace length ⇒ load imbalance ● Dynamic NN - different architecture on different nodes ⇒ scaling challenge ● Data Loading initially took >50% of total runtime for single node training ● Non-optimized 3D Convolution for CPU in PyTorch Initial software release would take~90 days to train one pass of 15M Trace Dataset! Single node performance

● PyTorch Improvements ○ Enhancements for 3D CNN on CPU with Intel MKL-DNN: 8x improvement ○ Sort/batch same trace types 28X Haswell together to improve parallelization and vectorization ● Data restructuring and parallel I/O ○ I/O reduced from 50% of time to less than 8% ○ 40% reduction in memory consumption Multi node performance

● Optimization in application and PyTorch to minimize tensors communicated and allreduce calls: >4X improvement in allreduce time

● Can reduce load-imbalance with multi-bucketing (inspired by NMT)

○ Up to 50% throughput increase - but trade-off with time to convergence Large scale training

● Fully synchronous data parallel training on 1,024 nodes ○ Adam-LARC optimizer with polynomial decay (order=2) learning rate schedule ○ One of the largest-scale uses of builtin PyTorch-MPI ● 128k global mini-batchsize Log 28X ○ Largest for this NN architecture Scale ● Overall 14000x speedup: 500x ○ Months of training in minutes ○ Ability to retrain model quickly is transformative for research REMINDER: we are doing all this to perform inference

Parameters Outputs (data)

Inferred Simulator Observed data parameters

Event analyses & Particle detector new particle readings discoveries

44 Inference

● No current method in LHC physics analysis that gives full posterior ● So we compare amortized (inference compilation) approach to full MCMC ○ Establish MCMC convergence 230X using Gelman-Rubin test ● MCMC: single IVB node 115 hours ● Inference Compilation: 30 mins on 64 HSW nodes (230x speedup)

Amortized - once trained - can be run on unseen collision data without the autocorrelation and ‘burn-in’ needed for independent samples in MCMC Science results

● Achieving MCMC “ground truth” is itself a new contribution in particle physics ● Amortized inference compilation (IC) closely matches MCMC (RMH) ● First tractable Bayesian inference for LHC physics ○ Full posterior and interpretability More physics events Conclusions

● Novel probabilistic software framework to execute & controlexisting HPC simulator code bases ● HPC features: multi-TB data, distributed training and inference ● Synchronous data parallel training of a 3DCNN-LSTM neural network ● 1,024 nodes (32,768 cores); 128k minibatch size (largest for this type of NN) ● One of the largest-scale use of PyTorch built-in MPI ● Largest scale inference in a universal probabilistic programming language (one capable of handling complex science models) ● LHC Physics use case: ~25,000 latents; Sherpa code base of ~1M lines of C++

Probabilistic programming is for the first time practical for large-scale, real-world science models Probabilistic programming is for the first time practical for large-scale, real-world science models

This is just the beginning ...

Simulation of composite materials Simulation of epidemics (Munk et al. 2019, in sub. arXiv:1910.11950) (Gram-Hansen et al., 2019, in prep.) Intel notices and disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks .

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks .

Intel Advanced Vector Extensions (Intel AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more athttp://www.intel.com/go/turbo . Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.