GPU Enhanced Remote Collaborative Scientific Visualization Benjamin Hernandez (OLCF), Tim Biedert () March 20th, 2019

ORNL is managed by UT-Battelle LLC for the US Department of Energy

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Contents

• Part I – GPU Enhanced Remote Collaborative Scientific Visualization

• Part II – Hardware-accelerated Multi-tile streaming for Realtime Remote Visualization

2 Oak Ridge Leadership Computing Facility (OLCF)

• Provide the computational and data resources required to solve the most challenging problems.

• Highly competitive user allocation programs (INCITE, ALCC). – OLCF provides 10x to 100x more resources than other centers

• We collaborate with users sharing diversity in expertise and geographic location

3 How are these collaborations ?

Simulation Team needs • Collaborations – Extends through the life cycle of the data from computation to analysis Pre- Viz. expert visualization and visualization. feedback – Are structured around data. PI Evaluation • Data analysis and Visualization feedback – An iterative and sometimes remote Discovery process involving students, visualization experts, PIs and Visualization stakeholders

4 How are these collaborations ?

• The collaborative future must be “Web-based immersive visualization Tools with the characterized by[1]: ease of a virtual reality game.” “Visualization ideally would be combined with user-guided as well as template-guided automated feature extraction, real-time Discovery Connectivity Resources easy No resource is an annotation, and quantitative geometrical to find island analysis.” “Rapid data visualization and analysis to enable Portability Centrality understanding in near real time by a Resources widely and Resources efficiently, transparently usable and centrally supported geographically dispersed team.” …

1. U.S. Department of Energy.(2011). Scientific collaborations for extreme- scale science (Workshop Report). Retrieved from https://indico.bnl.gov/event/403/attachments/11180/13626/ScientificC ollaborationsforExtreme-ScaleScienceReportDec2011_Final.pdf

5 How are these collaborations ?

• INCITE “Petascale simulations of short pulse laser interaction with metals” PI Leonid Zhigilei, University of Virginia – Laser ablation in vacuum and liquid environments – Hundreds of million to billion scale atomistic simulations, dozens of time steps

• INCITE “Molecular dynamics of motor-protein networks in cellular energy metabolism” PI(s) Abhishek Singharoy, Arizona State University – Hundreds of million scale atomistic simulation, hundreds of time steps

6 How are these collaborations ?

• The data is centralized in OLCF • SIGHT is a custom platform for interactive data analysis and visualization. – Support for collaborative features are needed for • Low latency remote visualization streaming • Simultaneous and independent user views • Collaborative multi-display setting environment

7 SIGHT: Exploratory Visualization of Scientific Data • Remote visualization • Designed around user – Server/Client architecture to provide high needs end visualization in laptops, desktops, and powerwalls. • Lightweight tool – Load your data • Multi-threaded I/O – Perform exploratory analysis • Supports interactive/batch – Visualize/Save results visualization • Heterogeneous scientific – In-situ (some effort) visualization • Designed having OLCF infrastructure – Advanced shading to in mind. enable new insights into data exploration. – Multicore and manycore support.

8 Publications

M. V. Shugaev, C. Wu, O. Armbruster, A. Naghilou, N. Brouwer, D. S. Ivanov, T. J.-Y. Derrien, N. M. Bulgakova, W. Kautek, B. Rethfeld, and L. V. Zhigilei, Fundamentals of ultrafast laser-material interaction, MRS Bull. 41 (12), 960-968, 2016.

C.-Y. Shih, M. V. Shugaev, C. Wu, and L. V. Zhigilei, Generation of subsurface voids, incubation effect, and formation of nanoparticles in short pulse laser interactions with bulk metal targets in liquid: Molecular dynamics study, J. Phys. Chem. C 121, 16549- 16567, 2017.

C.-Y. Shih, R. Streubel, J. Heberle, A. Letzel, M. V. Shugaev, C. Wu, M. Schmidt, B. Gökce, S. Barcikowski, and L. V. Zhigilei, Two mechanisms of nanoparticle generation in picosecond laser ablation in liquids: the origin of the bimodal size distribution, Nanoscale 10, 6900-6910, 2018.

9 SIGHT’s System Architecture

SIGHT Frame Server Websockets GPU ENC SIMD ENC NVENC NVPipe TurboJPEG

Ray Tracing Backends1

2 SIGHT Client NVIDIA Intel threaded Web browser - JavaScript Optix OSPray Data Parallel Analysis Multi - Dataset Parser

OLCF

1S7175 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1 Nvidia GPU Technology Conference 2017 2P8220 Heterogeneous Selection Algorithms for Interactive Analysis of Billion Scale Atomistic Datasets Nvidia GPU Technology Conference 2018

10 Design of Collaborative Infrastructure - Alternative 1

Data

SIGHT SIGHT SIGHT

Node(s) Node(s) Node(s)

User 1 User 2 User 3

11 Design of Collaborative Infrastructure - Alternative 1

Job ?

Data

Job 1 Job 2 Job 3

SIGHT SIGHT SIGHT

Node(s) Node(s) Node(s)

User 1 User 2 User 3

12 Design of Collaborative Infrastructure - Alternative 1

Job 1 Co-scheduling ! Job coordination ! Data Communication !

Job 1 Job 2 Job 3

SIGHT ! SIGHT ! SIGHT

Node(s) Node(s) Node(s)

User 1 User 2 User 3

13 Design of Collaborative Infrastructure - Alternative 2

Data

Job 1

SIGHT

Node(s)

User 1 User 2 User 3

14 Design of Collaborative Infrastructure

• Data ORNL Summit System Overview – 4,608 nodes – Dual-port Mellanox EDR InfiniBand network – 250 PB IBM file system transferring data at 2.5 TB/s

SIGHT • Each node has – 2 IBM POWER9 processors Node(s) – 6 V100 GPUs – 608 GB of fast memory (96 GB HBM2 + 512 GB DDR4) – 1.6 TB of NV memory

User 1 User 2 User 3

15 Design of Collaborative Infrastructure

• Each NVIDIA Tesla V100 GPU – 3 NVENC Chips – Unrestricted number of concurrent sessions

• NVPipe – Lightweight C API library for low-latency video compression

– Easy access to NVIDIA's hardware- accelerated H.264 and HEVC video codecs

16 Enhancing SIGHT Frame Server Low latency encoding

(x, y, button) Sharing Optix buffer (OptiXpp) Keyboard Frame Server Buffer frameBuffer = context->createBuffer( Websockets GUI Events RT_BUFFER_OUTPUT, RT_FORMAT_UNSIGNED_BYTE4, m_width, m_height ); NVENC Visualization Stream SIGHT Client Web browser - JavaScript frameBufferPtr = buffer->getDevicePointer(optxDevice); (x, y, button) ... Keyboard Framebuffer GUI Events compress (frameBufferPtr);

Ray Tracing Backend

17 Enhancing SIGHT Frame Server Low latency encoding Opening encoding session:

myEncoder = NvPipe_CreateEncoder (NVPIPE_RGBA32, NVPIPE_H264, NVPIPE_LOSSY, (x, y, button) bitrateMbps * 1000 * 1000, targetFps); Keyboard if (!m_encoder) Frame Server return error; GUI Events Websockets myCompressedImg = new unsigned char[w*h*4]; NVENC Visualization Stream SIGHT Client Framebuffer compression: Web browser - JavaScript bool compress (myFramebufferPtr) { Camera Framebuffer ... Parameters myCompressedSize = NvPipe_Encode(myEncoder, myFramebufferPtr, w*4, myCompressedImg, w*h*4, w, h, true); Ray Tracing if (myCompressedSize == 0 ) Backend return error; … }

Closing encoding session:

NvPipe_Destroy(myEncoder);

18 Enhancing SIGHT Frame Server Low latency encoding • System Configuration: – DGX-1 Volta Average – Connection Bandwidth NVENC NVENC+MP4 TJPEG • 800Mbps (ideally 1 Gbps) Encoding HD (ms) 4.65 6.05 16.71 – NVENC Encoder Encoding 4K (ms) 12.13 17.89 51.89 Frame Size HD (KB) 116.00 139.61 409.76 • H264 PROFILE BASELINE Frame Size 4K (KB) 106.32 150.65 569.04 • 32 MBPS, 30 FPS – Turbo JPEG Broadway.cs MSE Built-in JPEG Decoding HD (ms) 43.28 39.97 78.15 • SIMD Instructions on Decoding 4K (ms) 87.40 53.10 197.63 • JPEG Quality 50 – Decoder • Broadway.js (FireFox 65) • Media Source Extensions (Chrome 72) • Built-in JPEG Decoder (Chrome 72)

19 Enhancing SIGHT Frame Server Low latency encoding

20 Enhancing SIGHT Frame Server Low latency encoding

21 Enhancing SIGHT Frame Server Simultaneous and independent user views

• A Summit node can produce 4K visualizations with NVIDIA Optix at interactive rates.

User 1 User 2

HD HD

User 3 User 4

HD HD

EVEREST 4K 32:9

User A User B

16:9 16:9

22 Enhancing SIGHT Frame Server Simultaneous and independent user views

23 Enhancing SIGHT Frame Server Simultaneous and independent user views

(x, y, button) Keyboard GUI Events Thread 0 Frame Server Viz. frame Frame Buffer queue Frame

User 1

Thread 1 Frame Server

User 2

Thread 2 Frame Server Viz. frame (Usr Id)

User 3

GUI Event queue Camera Parameters (Usr Id)

Ray Tracing Backend

24 Enhancing SIGHT Frame Server Simultaneous and independent user views

• Video

25 Discussion

• AI Up-Res • Further work – Improve compression rates – Multi-perspective/orthographic – Different resolutions projections • Render and stream @ 720p, decoding at – Traditional case: stereoscopic projection HD, 2K, 4K according to each user display – Annotations, sharing content between users, saving sessions • AI Slow-Mo – Render at low framerates • How AI could help to improve remote visualization performance: – NVIDIA NGX Tech • Optix Denoiser – Ray tracing converge faster

26 Thanks!

Benjamin Hernandez Advanced Data and Workflows Group Oak Ridge National Laboratory [email protected] • Acknowledgments This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Datasets provided by Cheng-Yu Shi and Leonid Zhigilei, Computational Materials Group at University of Virginia, INCITE award MAT130.

27 HARDWARE-ACCELERATED MULTI-TILE STREAMING FOR REALTIME REMOTE VISUALIZATION Tim Biedert, 03/20/2019 INSTANT VISUALIZATION FOR FASTER SCIENCE

CPU Supercomputer Viz Cluster

Data Transfer Traditional Days Slower Time to Time to Discovery = Discovery Simulation- 1 Week Viz- 1 Day Months

Multiple Iterations

GPU-Accelerated Supercomputer Value Proposition

Tesla • Interactivity

Platform Visualize while you • Scalability Faster Time to simulate/without Time to Discovery = Weeks Discovery data transfers • Flexibility

Restart Simulation Instantly Multiple Iterations 2 SUPPORTING MULTIPLE VISUALIZATION WORKFLOWS

LEGACY PARTITIONED CO-PROCESSING WORKFLOW SYSTEM

Separate compute & vis Different nodes for Compute and visualization system different roles on same GPU

Communication via file Communication via high- Communication via host- system speed network device transfers or memcpy

3 VISUALIZATION-ENABLED SUPERCOMPUTERS

CSCS Piz Daint NCSA Blue Waters ORNL Titan

Galaxy formation Molecular dynamics Cosmology http://blogs.nvidia.com/blog/2014/11/19/gpu-in- http://devblogs.nvidia.com/parallelforall/hpc http://www.sdav-scidac.org/29- situ-milky-way/ -visualization-nvidia-tesla-gpus/ highlights/visualization/66-accelerated-cosmology- data-anal.html

4 ROAD TO GEFORCEEXASCALE NOW Volta to Fuel Most Powerful US Supercomputers You listen to music on Spotify. You watch movies on . 1.5X HPC Performance in 1 Year GeForce Now lets you play games

the same way.1.64 1.7 1.50 1.5 1.39 1.41 1.37 1.4 Instantly stream the latest titles from our powerful cloud-gaming

supercomputers. Think of it as your V100 Performance V100Performance game console Normalized toP100 in the sky.

Gaming is now easy and instant. Summit Supercomputer 200+ PetaFlops ~3,400 Nodes System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, 5 10 Megawatts w/ 2X Tesla P100 or V100. VISUALIZATION TRENDS New Approaches Required to Solve the Remoting Challenge

Increasing data set sizes

In-situ scenarios

Interactive workflows

New display technologies

Globally distributed user bases

6 STREAMING Benefits of Rendering on Supercomputer

Scale with Simulation Cheaper Infrastructure Interactive High-Fidelity Rendering No Need to Scale Separate Vis Cluster All Heavy Lifting Performed on the Server Improves Perception and Scientific Insight

7 FLEXIBLE GPU ACCELERATION ARCHITECTURE Independent CUDA Cores & Video Engines

* Diagram represents support for the NVIDIA Turing GPU family ** 4:2:2 is not natively supported on HW 8 *** Support is codec dependent Streaming of Large Tile Counts

Frame rates / Latency / Bandwidth

Synchronization

Comparison against CPU-based Compressors CASE STUDY Strong Scaling (Direct-Send Sort-First Compositing)

Hardware-Accelerated Multi-Tile Streaming for Realtime Remote Visualization

Tim Biedert, Peter Messmer, Tom Fogal, Christoph Garth

Eurographics Symposium on Parallel Graphics and Visualization (EGPGV) 2018 (Best Paper)

DOI: 10.2312/pgv.20181093

9 CONCEPTUAL OVERVIEW Asynchronous Pipelines

10 BENCHMARK SCENES NASA Synthesis 4K

Space Orbit Ice Streamlines Low Complexity Medium Complexity High Complexity Extreme Complexity

11 CODEC PERFORMANCE

12 BITRATE VS QUALITY Structural Similarity Index (SSIM)

1

0,95

0,9

0,85 M

I 0,8

S S 0,75

0,7

0,65

0,6 1 2 4 8 16 32 64 128 256 Bitrate (90 Hz) [Mbps]

Space (H.264) Orbit (H.264) Ice (H.264) Streamlines (H.264)

Space (HEVC) Orbit (HEVC) Ice (HEVC) Streamlines (HEVC) 13 HARDWARE LATENCY Encode HEVC - Encode H.264 - Decode HEVC - Decode H.264

20 Encode (HEVC, Space) 18 Encode (HEVC, Orbit)

16 Encode (HEVC, Ice) Encode (HEVC, Streamlines) 14

Encode (H.264, Space) ] s 12

m Encode (H.264, Orbit)

[

y

c 10 Encode (H.264, Ice)

n e t Encode (H.264, Streamlines)

a 8 L Decode (HEVC, Space) 6 Decode (HEVC, Orbit) 4 Decode (HEVC, Ice)

2 Decode (HEVC, Streamlines) 0 Decode (H.264, Space) 1 2 4 8 16 32 64 128 256 Decode (H.264, Orbit) Bitrate (90 Hz) [Mbps]

14 HARDWARE LATENCY Decreases with Resolution

16

14

12 ]

s 10

m

[

y

c 8

n

e

t a

L 6

4

2

0 3840x2160 2720x1530 1920x1080 1360x766 960x540 672x378 480x270 Resolution

Encode (HEVC) Encode (H.264) Decode (HEVC) Decode (H.264) 15 CPU-BASED COMPRESSION Encode/Decode Latency and Bandwidth

120 2400 100 2100

1800 ] s

80 /

]

B

s M

m 1500

[

[

y h t

c 60

d n

1200 i

e

t w

a d

L n

40 900 a B 600 20 300

0 0 HEVC H.264 HEVC H.264 TurboJPEG BloscLZ LZ4 Snappy (GPU) (GPU) (CPU) (CPU)

Encode Decode Bandwidth 16 FULL TILES STREAMING

17 N:N WITH SIMULATED NETWORK DELAY Mean Frame Rates + Min/Max Ranges

Server: Piz Daint 120 Client: Piz Daint

Ice 4K 100

H.264: 32 Mbps ]

z 80

HEVC: 16 Mbps H

[

e t

Delay + 10% jitter a

R 60

e

m

a r

F 40

20

0 0 ms 50 ms 150 ms 500 ms Network Delay

1 2 4 8 16 32 64 128 256 18 N:N STREAMING Pipeline Latencies

Server: Piz Daint 45 Client: Piz Daint 40

Ice 4K 35

H.264: 32 Mbps 30

]

s m

MPI-based [

25

synchronization y

c n

e 20

t a L 15

10

5

0 1 2 4 8 16 32 64 128 256 Tiles

Synchronize (Servers) Encode Network Decode Synchronize (Clients) 19 N:1 STREAMING Client-Side

Server: Piz Daint 90 Clients: Site A (5 ms) Site B (25 ms) 80

Ice 4K 70 ]

z 60

H.264: 32 Mbps H

[

e

t 50

a

R

e 40

m

a r

F 30

20

10

0 1 2 4 8 16 32 64 128 256 Tiles

Site A (1x GP100) Site B (1x GP100) Site B (2x GP100) 20 STRONG SCALING

21 N:1 STRONG SCALING Client-Side Frame Rate

Server: Piz Daint 400 Clients: Site A (5 ms) Site B (25 ms) 350

Ice 4K 300

] z

H.264: 32 Mbps H 250

[

e

t a

R 200

e m

a 150

r F 100

50

0 1 2 4 8 16 32 64 128 256 Tiles

Site A (1x GP100) Site B (1x GP100) Site B (2x GP100) 22 N:1 STRONG SCALING Client-Side Frame Rate For Different Bitrates

Server: Piz Daint 200 Client: Site A (5 ms) 180 Ice 4K 160 140

H.264 ]

z

H [

120

e

t a

R 100

e

m 80

a r F 60 40 20 0 1 2 4 8 16 32 64 128 256 Tiles

4 Mbps @ 90 Hz 72 Mbps @ 90 Hz 256 Mbps @ 90 Hz 23 EXAMPLE: OPTIX PATHTRACER

Server: Piz Daint Rendering Highly Sensitive to Tile Size Client: Site A (5 ms) H.264

30 ]

z 20

H

[

e

t

a

R

e

m

a r

F 10

0 1 2 4 8 16 32 64 128 256 Tiles

Regular Tiling Auto-Tuned Tiling

24 INTEROPERABILITY

25 STANDARD-COMPLIANT BITSTREAM Web Browser Streaming Example

EGL-based shim GLUT

Stream unmodified simpleGL example from headless node to web browser (with interaction!)

JavaScript client

WebSocket-based bidirectional communication

On-the-fly MP4 wrapping of H.264

26 RESOURCES

27 VIDEO CODEC SDK APIs For Hardware Accelerated Video Encode/Decode

What’s New with Turing GPUs and Video Codec SDK 9.0

• Up to 3x decode throughput with multiple decoders on professional cards ( & Tesla)

• Higher quality encoding - H.264 & H.265

NVIDIA GeForce Now is made possible by leveraging NVENC in the • Higher encoding efficiency datacenter and streaming the result to end clients (15% lower bitrate than Pascal)

• HEVC B-frames support https://developer.nvidia.com/nvidia-video-codec-sdk • HEVC 4:4:4 decoding support

28 NVPIPE A Lightweight Video Codec SDK Wrapper

Simple C API

H.264, HEVC

RGBA32, uint4, uint8, uint16

Lossy, Lossless

Host/Device memory, OpenGL textures/PBOs

https://github.com/NVIDIA/NvPipe

Issues? Suggestions? Feedback welcome!

29 Conclusion GPU-accelerated video compression opens up novel and fast solutions to the large- scale remoting challenge

Video Codec SDK https://developer.nvidia.com/nvidia-video-codec-sdk

NvPipe https://github.com/NVIDIA/NvPipe

We want to help you solve your large-scale vis problems on NVIDIA!

Tim Biedert [email protected]

30