IN-SITU VIS IN THE AGE OF HPC+AI Peter Messmer, WOIV – ISC 2018, Frankfurt, 6/28/18 EXCITING TIMES FOR VISUALIZATION
Huge dataset everywhere
Novel workflows for HPC * AI
Rendering technology leap
2 DEEP LEARNING COMES TO HPC Accelerates Scientific Discovery
UIUC & NCSA: ASTROPHYSICS U. FLORIDA & UNC: DRUG DISCOVERY LBNL: High Energy Particle PHYSICS 5,000X LIGO Signal Processing 300,000X Molecular Energetics Prediction 100,000X faster simulated output by GAN
PRINCETON & ITER: CLEAN ENERGY U.S. DoE: PARTICLE PHYSICS Univ. of Illinois: DRUG DISCOVERY 50% Higher Accuracy for Fusion Sustainment 33% More Accurate Neutrino Detection 2X Accuracy for Protein Structure Prediction
3 The most exciting phrase to hear in science, the one that heralds new discoveries, is not “Eureka!” but “That’s funny …”
— Isaac Asimov
4 ANOMALY DETECTION Look at all the data and learn from it!
Vast amount of data produced by simulations and experiments
I/O limitations force throw-away
Human in the loop infeasible
“In-situ vis 2.0” Dataset: Visualization of historic cyclones from JWTC hurricane report from 1979 to 2016 Kim et al, 2017)
Run inference on the simulation data to detect anomalies
5 Deep Learning for Climate Modeling ANOMALY DETECTION IN CLIMATE DATA
Identifying “extreme” weather events in multi-decadal datasets with 5-layered Convolutional Neural Network. Reaching 99.98% of detection accuracy. (Kim et al, 2017)
Dataset: Visualization of historic cyclones from JWTC hurricane report from 1979 to 2016 MNIST structure Systemic framework for detection and localization of extreme climate event
DOI 10.1109/ICDMW.2017.476 OPTIMIZING PARAMETERIZED MODELS Learning from simulations to accelerate simulations
Heuristics and semi-empirical models Examples: throughout simulation codes - Parameterized models: Often expensive, complex control flow - Atmospheric radiation transport, Use trained approximate models instead collisional cross-sections, chemical reaction chains, … Trained with past simulations - Simulation parameters
- Grid refinement, load balancing, scheduling, …
Use simulations to train approximate models
7 HPC SIMULATION OF THE FUTURE
Concurrent simulation, training, inference
Leads to
- Improved approximate models
- Better utilization of data
- Faster results
Coupling of HPC and DL software stack is needed
8 DATA SCIENCE DRIVES SOFTWARE-STACK
Data science mission-critical to non-traditional HPC organizations
Deep learning, graph analytics, in-core databases,..
Sustainability and performance by scale
Frameworks supported by big corps, large communities
Big market, big support by all vendors
Economics drive performance portability and sustainability
Happened in the past, but different situation today Trilinos: 274 stars, PyTorch: 16’615 stars
Cast HPC algorithms in DL terms
9 EXAMPLE: WAVE EQUATION VIA CONV NEURAL NETWORK Applying stencils = Inference in Conv Network layer { name : “laplace_u” type : “convolution” bottom : “u_0” top : “laplace_u” u convolution_param { 0 num_output : 1 푑푣 kernel_size : 3 = 푐 ∆ 푢 stride : 1 푑푡 pad : 1 .. 푑푢 } = 푣 Du0 =ux-1–2u+ux+1 푑푡 layer { name : “velocity_update” type : “Eltwise” V3/2 = c dt Du0 + v1/2 bottom : “laplace_u” top : “velocity_update” }
u1 = dt v1/2 + u0 layer { name : “position_update” type : “Eltwise” Simplified 1D example cartoon bottom : “velocity_update” Approach works for higher dimensions top : “position_update” } Other examples: Stencils, Spectral transforms, spectral elements, .. 10 EXAMPLE: MARCHING CUBES (well.. squares)
Input Data
Threshold Interpol Interpol X Y
Label Cells
Select Interpols
11 HPC CAN CONTRIBUTE TO EMERGING DATA SCIENCE NEEDS HPC solved lots of “new” problems in the past
DL will need distributed memory parallelism
New challenges for DL algorithms
HPC has probably hit those challenges in the past https://www.hpcwire.com/2017/02 Better implementation, better algorithms /21/hpc-technique-benefits-deep- learning/
Collaborate with DL framework developers, contribute to DL frameworks
First step: speak a common language
12 RENDERING TECHNOLOGIES
13 VISUALIZATION IN THE DATACENTER Benefits of Rendering on Supercomputer
Scale with Simulation Cheaper Infrastructure Interactive High-Fidelity Rendering No Need to Scale Separate Vis Cluster All Heavy Lifting Performed on the Server Improves Perception and Scientific Insight
14 VISUALIZATION REQUIREMENTS
Accuracy
Responsiveness Visualization needs
Interpretability
15 REAL-TIME RAY-TRACING Enabled by Volta V100
OptiX Ray-Tracing SDK State-of-the-art performance: 500M+ rays/sec Algorithm and hardware agnostic Shaders with single-ray programming model, recursion Free for private and commercial use https://developer.nvidia.com/optix
16 OptiX 5 with the CNN-based AI Denoiser 157x
54x
7x 1x 4x
Pascal GP100 Volta V100 Volta V100 DGX Station CPU OptiX 4.1 OptiX 5 OptiX 5 W/ AI OptiX 5 W/ AI C.A. Chaitanya, A. Kaplanyan, C. Schied, M. Salvi, A. Lefohn, D. Nowrouzezahrai, T. Aila. “Interactive Reconstruction of Monte Carlo Image Sequences Using a Recurrent Denoising Autoencoder.” SIGGRAPH 2017 17 OPTIX FOR SCIENCE Ray-Tracing for extra visual cues
Enhanced visual perception Single workflow for science and outreach Fast rendering with AI denoiser Integrated into leading vis tools
With Denoiser 18 WILL HPC TRANSFORM AI OR WILL AI TRANSFORM HPC?
Combination of HPC and AI will lead to better exploitation of data, faster models
Both communities need each other
- AI will need distributed memory technologies at one point
- HPC needs AI for analysis and parameterized models
Economic interest in AI drives software and hardware development
Expressing HPC algorithms in AI terms helps leveraging AI features
Applying HPC techniques to scale AI
=> Continue with joint events!
Simulation and streamline data courtesy of Christoph Garth19 OptiX AI Denoiser in ParaView
Without Denoiser With
Denoiser 20 NVIDIA IndeX SDK
Large scale and distributed data rendering Scene management with volume data Transparent support for NVLink Higher-order filtering, advanced lighting & transfer functions
https://developer.nvidia.com/index
21 ParaView Plugin NVDIA IndeX ParaView Plugin
Distributed with ParaView (pvNvidiaIndeX) Mixing with ParaView primitives Support for multiple slice planes Structured and unstructured meshes In-situ capable via Catalyst https://developer.nvidia.com/index
22 NVIDIA INDEX PARAVIEW PLUGIN Growing user base
• Single GPU version bundled with ParaView
• Over 20,000 downloads in 2 months. • Cluster version
• 10+ labs and institutions actively working with us.
https://www.nvidia.com/en-us/data-center/index-paraview-plugin/ 23 NVIDIA INDEX PARAVIEW PLUGIN Licensing and Availability
SINGLE GPU ACADEMIC COMMERCIAL
Free license Free license License fee
Supports single GPU Multi-node Multi-GPU Multi-node Multi-GPU
Plugin binaries part of Plugin source part of Plugin source with ParaView v5.5 download ParaView v5.5 download customer specific features
New releases synced with Latest features in Premium support ParaView release cycle ParaView-Master
24 FLEXIBLE GPU ACCELERATION ARCHITECTURE Independent CUDA Cores & Video Engines
Decode HW* Encode HW* Formats: Formats: • MPEG-2 CPU • VC1 • H.264 • VP8 • H.265 • VP9 • Lossless • H.264 • H.265 Bit depth: • Lossless • 8 bit NVDEC Buffer NVENC • 10 bit Bit depth: • 8 bit Color** • 10/12 bit • YUV 4:4:4 • YUV 4:2:0 Color** • YUV 4:2:0 CUDA Cores Resolution • Up to 8K*** Resolution • Up to 8K***
* See support diagram for previous NVIDIA HW generations ** 4:2:2 is not natively supported on HW 25 *** Support is codec dependent VIDEO CODEC SDK Hardware Accelerated Video Encoding/Decoding
Encode and Decode API
Industry-standard codecs: H.265(HEVC), H.264(AVCHD), VP9, VP8, VC-1 & MPEG-2
Supports up to 8K x 8K encoding
Lossy & lossless compression
https://developer.nvidia.com/nvidia-video-codec-sdk 26 Streaming of Large Tile Counts
Frame rates / Latency / Bandwidth CASE STUDY Synchronization Comparison against CPU-based Compressors
Strong Scaling (Direct-Send Sort-First Compositing)
Hardware-Accelerated Multi-Tile Streaming for Realtime Remote Visualization, Biedert et al, EGPGV 2018, https://doi.org/10.2312/pgv.20181093
27 CONCEPTUAL OVERVIEW Asynchronous Pipelines
28 BENCHMARK SCENES NASA Synthesis 4K
Space Orbit Ice Streamlines Low Complexity Medium Complexity High Complexity Extreme Complexity
29 HARDWARE LATENCY Decreases with Resolution
16
14
12 ]
s 10
m
[
y
c 8
n
e
t a
L 6
4
2
0 3840x2160 2720x1530 1920x1080 1360x766 960x540 672x378 480x270 Resolution
Encode (HEVC) Encode (H.264) Decode (HEVC) Decode (H.264) 30 N:N STREAMING Pipeline Latencies
Server: Piz Daint 45 Client: Piz Daint 40
Ice 4K 35
H.264: 32 Mbps 30
]
s m
MPI-based [
25
synchronization y
c n
e 20
t a L 15
10
5
0 1 2 4 8 16 32 64 128 256 Tiles
Synchronize (Servers) Encode Network Decode Synchronize (Clients) 31 N:N STREAMING Client-Side Frame Rates and Bandwidths
Server: Piz Daint 90 900 Client: Piz Daint 80 800 Ice 4K
70 700
] s
H.264: 32 Mbps ] /
z 60 600 B
HEVC: 16 Mbps H
[
M
[
e t
50 500
a h
t
R
d i
e 40 400
w
m
d
a
n
r a
F 30 300 B 20 200
10 100
0 0 1 2 4 8 16 32 64 128 256 Tiles
HEVC (Bandwidth) H.264 (Bandwidth) HEVC (Frame Rate) H.264 (Frame Rate) 32 N:1 STREAMING Client-Side Frame Rate
Server: Piz Daint 90 Clients: Site A (5 ms) Site B (25 ms) 80
Ice 4K 70 ]
z 60
H.264: 32 Mbps H
[
e
t 50
a
R
e 40
m
a r
F 30
20
10
0 1 2 4 8 16 32 64 128 256 Tiles
Site A (1x GP100) Site B (1x GP100) Site B (2x GP100) 33 N:1 STRONG SCALING Client-Side Frame Rate
Server: Piz Daint 400 Clients: Site A (5 ms) Site B (25 ms) 350
Ice 4K 300
] z
H.264: 32 Mbps H 250
[
e
t a
R 200
e m
a 150
r F 100
50
0 1 2 4 8 16 32 64 128 256 Tiles
Site A (1x GP100) Site B (1x GP100) Site B (2x GP100) 34 EXAMPLE: OPTIX PATHTRACER
Server: Piz Daint Rendering Highly Sensitive to Tile Size Client: Site A (5 ms) H.264
30 ]
z 20
H
[
e
t
a
R
e
m
a r
F 10
0 1 2 4 8 16 32 64 128 256 Tiles
Regular Tiling Auto-Tuned Tiling
35 STANDARD-COMPLIANT BITSTREAM Web Browser Streaming Example
EGL-based shim GLUT
Stream unmodified simpleGL example from headless node to web browser (with interaction!)
JavaScript client
WebSocket-based bidirectional communication
On-the-fly MP4 wrapping of H.264
36 In Situ Volume Compression using NVENC
Use NVENC + VPU (Titan and later) to compress volume data
In situ encoding runs overlapped with computation: essentially “free”.
At least 100:1 compression (over 1000:1 in many cases) int32 int8 Compressed int8
Similar application of NvPipe for EM data at Cvlab at EPFL
N. Leaf, B Miller and K-L Ma. In Situ Video Encoding of Floating-Point Volume Data Using Special-Purpose Hardware for a Posteriori Rendering and Analysis. IEEE LDAV 2017 37 CONTAINERS
39 NVIDIA GPUU CLOUD CLOUD FOR FORHPC VISUALIZATION HPC VISUALIZATION ngc.nvidia.com
ParaView with ParaView with ParaView with VMD IndeX NVIDIA IndeX NVIDIA OptiX NVIDIA Holodeck
40 CONTAINERSCLIENT IN-SERVER A DISTRIBUTED CONTAINERS ENVIRONMENT
User Interface Pre-packaged Plugins GPU accelerated Encoding GLX and EGL
Client Side Container Client Container Server-Side Container
41 NVIDIA GPU Cloud Containers
http://ngc.nvidia.com 42 LEADING EDGE CONTAINERS Novel features
Latest version of ParaView IndeX plugin
ParaView+OptiX container, including AI denoiser and OptiX raytracing
Rapid release cycle
Without Denoiser
With Denoiser 43