GPU Enhanced Remote Collaborative Scientific Visualization Benjamin Hernandez (OLCF), Tim Biedert (NVIDIA) March 20th, 2019
ORNL is managed by UT-Battelle LLC for the US Department of Energy
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Contents
• Part I – GPU Enhanced Remote Collaborative Scientific Visualization
• Part II – Hardware-accelerated Multi-tile streaming for Realtime Remote Visualization
2 Oak Ridge Leadership Computing Facility (OLCF)
• Provide the computational and data resources required to solve the most challenging problems.
• Highly competitive user allocation programs (INCITE, ALCC). – OLCF provides 10x to 100x more resources than other centers
• We collaborate with users sharing diversity in expertise and geographic location
3 How are these collaborations ?
Simulation Team needs • Collaborations – Extends through the life cycle of the data from computation to analysis Pre- Viz. expert visualization and visualization. feedback – Are structured around data. PI Evaluation • Data analysis and Visualization feedback – An iterative and sometimes remote Discovery process involving students, visualization experts, PIs and Visualization stakeholders
4 How are these collaborations ?
• The collaborative future must be “Web-based immersive visualization Tools with the characterized by[1]: ease of a virtual reality game.” “Visualization ideally would be combined with user-guided as well as template-guided automated feature extraction, real-time Discovery Connectivity Resources easy No resource is an annotation, and quantitative geometrical to find island analysis.” “Rapid data visualization and analysis to enable Portability Centrality understanding in near real time by a Resources widely and Resources efficiently, transparently usable and centrally supported geographically dispersed team.” …
1. U.S. Department of Energy.(2011). Scientific collaborations for extreme- scale science (Workshop Report). Retrieved from https://indico.bnl.gov/event/403/attachments/11180/13626/ScientificC ollaborationsforExtreme-ScaleScienceReportDec2011_Final.pdf
5 How are these collaborations ?
• INCITE “Petascale simulations of short pulse laser interaction with metals” PI Leonid Zhigilei, University of Virginia – Laser ablation in vacuum and liquid environments – Hundreds of million to billion scale atomistic simulations, dozens of time steps
• INCITE “Molecular dynamics of motor-protein networks in cellular energy metabolism” PI(s) Abhishek Singharoy, Arizona State University – Hundreds of million scale atomistic simulation, hundreds of time steps
6 How are these collaborations ?
• The data is centralized in OLCF • SIGHT is a custom platform for interactive data analysis and visualization. – Support for collaborative features are needed for • Low latency remote visualization streaming • Simultaneous and independent user views • Collaborative multi-display setting environment
7 SIGHT: Exploratory Visualization of Scientific Data • Remote visualization • Designed around user – Server/Client architecture to provide high needs end visualization in laptops, desktops, and powerwalls. • Lightweight tool – Load your data • Multi-threaded I/O – Perform exploratory analysis • Supports interactive/batch – Visualize/Save results visualization • Heterogeneous scientific – In-situ (some effort) visualization • Designed having OLCF infrastructure – Advanced shading to in mind. enable new insights into data exploration. – Multicore and manycore support.
8 Publications
M. V. Shugaev, C. Wu, O. Armbruster, A. Naghilou, N. Brouwer, D. S. Ivanov, T. J.-Y. Derrien, N. M. Bulgakova, W. Kautek, B. Rethfeld, and L. V. Zhigilei, Fundamentals of ultrafast laser-material interaction, MRS Bull. 41 (12), 960-968, 2016.
C.-Y. Shih, M. V. Shugaev, C. Wu, and L. V. Zhigilei, Generation of subsurface voids, incubation effect, and formation of nanoparticles in short pulse laser interactions with bulk metal targets in liquid: Molecular dynamics study, J. Phys. Chem. C 121, 16549- 16567, 2017.
C.-Y. Shih, R. Streubel, J. Heberle, A. Letzel, M. V. Shugaev, C. Wu, M. Schmidt, B. Gökce, S. Barcikowski, and L. V. Zhigilei, Two mechanisms of nanoparticle generation in picosecond laser ablation in liquids: the origin of the bimodal size distribution, Nanoscale 10, 6900-6910, 2018.
9 SIGHT’s System Architecture
SIGHT Frame Server Websockets GPU ENC SIMD ENC NVENC NVPipe TurboJPEG
Ray Tracing Backends1
2 SIGHT Client NVIDIA Intel threaded Web browser - JavaScript Optix OSPray Data Parallel Analysis Multi - Dataset Parser
OLCF
1S7175 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1 Nvidia GPU Technology Conference 2017 2P8220 Heterogeneous Selection Algorithms for Interactive Analysis of Billion Scale Atomistic Datasets Nvidia GPU Technology Conference 2018
10 Design of Collaborative Infrastructure - Alternative 1
Data
SIGHT SIGHT SIGHT
Node(s) Node(s) Node(s)
User 1 User 2 User 3
11 Design of Collaborative Infrastructure - Alternative 1
Job ?
Data
Job 1 Job 2 Job 3
SIGHT SIGHT SIGHT
Node(s) Node(s) Node(s)
User 1 User 2 User 3
12 Design of Collaborative Infrastructure - Alternative 1
Job 1 Co-scheduling ! Job coordination ! Data Communication !
Job 1 Job 2 Job 3
SIGHT ! SIGHT ! SIGHT
Node(s) Node(s) Node(s)
User 1 User 2 User 3
13 Design of Collaborative Infrastructure - Alternative 2
Data
Job 1
SIGHT
Node(s)
User 1 User 2 User 3
14 Design of Collaborative Infrastructure
• Data ORNL Summit System Overview – 4,608 nodes – Dual-port Mellanox EDR InfiniBand network – 250 PB IBM file system transferring data at 2.5 TB/s
SIGHT • Each node has – 2 IBM POWER9 processors Node(s) – 6 NVIDIA Tesla V100 GPUs – 608 GB of fast memory (96 GB HBM2 + 512 GB DDR4) – 1.6 TB of NV memory
User 1 User 2 User 3
15 Design of Collaborative Infrastructure
• Each NVIDIA Tesla V100 GPU – 3 NVENC Chips – Unrestricted number of concurrent sessions
• NVPipe – Lightweight C API library for low-latency video compression
– Easy access to NVIDIA's hardware- accelerated H.264 and HEVC video codecs
16 Enhancing SIGHT Frame Server Low latency encoding
(x, y, button) Sharing Optix buffer (OptiXpp) Keyboard Frame Server Buffer frameBuffer = context->createBuffer( Websockets GUI Events RT_BUFFER_OUTPUT, RT_FORMAT_UNSIGNED_BYTE4, m_width, m_height ); NVENC Visualization Stream SIGHT Client Web browser - JavaScript frameBufferPtr = buffer->getDevicePointer(optxDevice); (x, y, button) ... Keyboard Framebuffer GUI Events compress (frameBufferPtr);
Ray Tracing Backend
17 Enhancing SIGHT Frame Server Low latency encoding Opening encoding session:
myEncoder = NvPipe_CreateEncoder (NVPIPE_RGBA32, NVPIPE_H264, NVPIPE_LOSSY, (x, y, button) bitrateMbps * 1000 * 1000, targetFps); Keyboard if (!m_encoder) Frame Server return error; GUI Events Websockets myCompressedImg = new unsigned char[w*h*4]; NVENC Visualization Stream SIGHT Client Framebuffer compression: Web browser - JavaScript bool compress (myFramebufferPtr) { Camera Framebuffer ... Parameters myCompressedSize = NvPipe_Encode(myEncoder, myFramebufferPtr, w*4, myCompressedImg, w*h*4, w, h, true); Ray Tracing if (myCompressedSize == 0 ) Backend return error; … }
Closing encoding session:
NvPipe_Destroy(myEncoder);
18 Enhancing SIGHT Frame Server Low latency encoding • System Configuration: – DGX-1 Volta Average – Connection Bandwidth NVENC NVENC+MP4 TJPEG • 800Mbps (ideally 1 Gbps) Encoding HD (ms) 4.65 6.05 16.71 – NVENC Encoder Encoding 4K (ms) 12.13 17.89 51.89 Frame Size HD (KB) 116.00 139.61 409.76 • H264 PROFILE BASELINE Frame Size 4K (KB) 106.32 150.65 569.04 • 32 MBPS, 30 FPS – Turbo JPEG Broadway.cs MSE Built-in JPEG Decoding HD (ms) 43.28 39.97 78.15 • SIMD Instructions on Decoding 4K (ms) 87.40 53.10 197.63 • JPEG Quality 50 – Decoder • Broadway.js (FireFox 65) • Media Source Extensions (Chrome 72) • Built-in JPEG Decoder (Chrome 72)
19 Enhancing SIGHT Frame Server Low latency encoding
20 Enhancing SIGHT Frame Server Low latency encoding
21 Enhancing SIGHT Frame Server Simultaneous and independent user views
• A Summit node can produce 4K visualizations with NVIDIA Optix at interactive rates.
User 1 User 2
HD HD
User 3 User 4
HD HD
EVEREST 4K 32:9
User A User B
16:9 16:9
22 Enhancing SIGHT Frame Server Simultaneous and independent user views
23 Enhancing SIGHT Frame Server Simultaneous and independent user views
(x, y, button) Keyboard GUI Events Thread 0 Frame Server Viz. frame Frame Buffer queue Frame
User 1
Thread 1 Frame Server
User 2
Thread 2 Frame Server Viz. frame (Usr Id)
User 3
GUI Event queue Camera Parameters (Usr Id)
Ray Tracing Backend
24 Enhancing SIGHT Frame Server Simultaneous and independent user views
• Video
25 Discussion
• AI Up-Res • Further work – Improve compression rates – Multi-perspective/orthographic – Different resolutions projections • Render and stream @ 720p, decoding at – Traditional case: stereoscopic projection HD, 2K, 4K according to each user display – Annotations, sharing content between users, saving sessions • AI Slow-Mo – Render at low framerates • How AI could help to improve remote visualization performance: – NVIDIA NGX Tech • Optix Denoiser – Ray tracing converge faster
26 Thanks!
Benjamin Hernandez Advanced Data and Workflows Group Oak Ridge National Laboratory [email protected] • Acknowledgments This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. Datasets provided by Cheng-Yu Shi and Leonid Zhigilei, Computational Materials Group at University of Virginia, INCITE award MAT130.
27 HARDWARE-ACCELERATED MULTI-TILE STREAMING FOR REALTIME REMOTE VISUALIZATION Tim Biedert, 03/20/2019 INSTANT VISUALIZATION FOR FASTER SCIENCE
CPU Supercomputer Viz Cluster
Data Transfer Traditional Days Slower Time to Time to Discovery = Discovery Simulation- 1 Week Viz- 1 Day Months
Multiple Iterations
GPU-Accelerated Supercomputer Value Proposition
Tesla • Interactivity
Platform Visualize while you • Scalability Faster Time to simulate/without Time to Discovery = Weeks Discovery data transfers • Flexibility
Restart Simulation Instantly Multiple Iterations 2 SUPPORTING MULTIPLE VISUALIZATION WORKFLOWS
LEGACY PARTITIONED CO-PROCESSING WORKFLOW SYSTEM
Separate compute & vis Different nodes for Compute and visualization system different roles on same GPU
Communication via file Communication via high- Communication via host- system speed network device transfers or memcpy
3 VISUALIZATION-ENABLED SUPERCOMPUTERS
CSCS Piz Daint NCSA Blue Waters ORNL Titan
Galaxy formation Molecular dynamics Cosmology http://blogs.nvidia.com/blog/2014/11/19/gpu-in- http://devblogs.nvidia.com/parallelforall/hpc http://www.sdav-scidac.org/29- situ-milky-way/ -visualization-nvidia-tesla-gpus/ highlights/visualization/66-accelerated-cosmology- data-anal.html
4 ROAD TO GEFORCEEXASCALE NOW Volta to Fuel Most Powerful US Supercomputers You listen to music on Spotify. You watch movies on Netflix. 1.5X HPC Performance in 1 Year GeForce Now lets you play games
the same way.1.64 1.7 1.50 1.5 1.39 1.41 1.37 1.4 Instantly stream the latest titles from our powerful cloud-gaming
supercomputers. Think of it as your V100 Performance V100Performance game console Normalized toP100 in the sky.
Gaming is now easy and instant. Summit Supercomputer 200+ PetaFlops ~3,400 Nodes System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, 5 10 Megawatts w/ 2X Tesla P100 or V100. VISUALIZATION TRENDS New Approaches Required to Solve the Remoting Challenge
Increasing data set sizes
In-situ scenarios
Interactive workflows
New display technologies
Globally distributed user bases
6 STREAMING Benefits of Rendering on Supercomputer
Scale with Simulation Cheaper Infrastructure Interactive High-Fidelity Rendering No Need to Scale Separate Vis Cluster All Heavy Lifting Performed on the Server Improves Perception and Scientific Insight
7 FLEXIBLE GPU ACCELERATION ARCHITECTURE Independent CUDA Cores & Video Engines
* Diagram represents support for the NVIDIA Turing GPU family ** 4:2:2 is not natively supported on HW 8 *** Support is codec dependent Streaming of Large Tile Counts
Frame rates / Latency / Bandwidth
Synchronization
Comparison against CPU-based Compressors CASE STUDY Strong Scaling (Direct-Send Sort-First Compositing)
Hardware-Accelerated Multi-Tile Streaming for Realtime Remote Visualization
Tim Biedert, Peter Messmer, Tom Fogal, Christoph Garth
Eurographics Symposium on Parallel Graphics and Visualization (EGPGV) 2018 (Best Paper)
DOI: 10.2312/pgv.20181093
9 CONCEPTUAL OVERVIEW Asynchronous Pipelines
10 BENCHMARK SCENES NASA Synthesis 4K
Space Orbit Ice Streamlines Low Complexity Medium Complexity High Complexity Extreme Complexity
11 CODEC PERFORMANCE
12 BITRATE VS QUALITY Structural Similarity Index (SSIM)
1
0,95
0,9
0,85 M
I 0,8
S S 0,75
0,7
0,65
0,6 1 2 4 8 16 32 64 128 256 Bitrate (90 Hz) [Mbps]
Space (H.264) Orbit (H.264) Ice (H.264) Streamlines (H.264)
Space (HEVC) Orbit (HEVC) Ice (HEVC) Streamlines (HEVC) 13 HARDWARE LATENCY Encode HEVC - Encode H.264 - Decode HEVC - Decode H.264
20 Encode (HEVC, Space) 18 Encode (HEVC, Orbit)
16 Encode (HEVC, Ice) Encode (HEVC, Streamlines) 14
Encode (H.264, Space) ] s 12
m Encode (H.264, Orbit)
[
y
c 10 Encode (H.264, Ice)
n e t Encode (H.264, Streamlines)
a 8 L Decode (HEVC, Space) 6 Decode (HEVC, Orbit) 4 Decode (HEVC, Ice)
2 Decode (HEVC, Streamlines) 0 Decode (H.264, Space) 1 2 4 8 16 32 64 128 256 Decode (H.264, Orbit) Bitrate (90 Hz) [Mbps]
14 HARDWARE LATENCY Decreases with Resolution
16
14
12 ]
s 10
m
[
y
c 8
n
e
t a
L 6
4
2
0 3840x2160 2720x1530 1920x1080 1360x766 960x540 672x378 480x270 Resolution
Encode (HEVC) Encode (H.264) Decode (HEVC) Decode (H.264) 15 CPU-BASED COMPRESSION Encode/Decode Latency and Bandwidth
120 2400 100 2100
1800 ] s
80 /
]
B
s M
m 1500
[
[
y h t
c 60
d n
1200 i
e
t w
a d
L n
40 900 a B 600 20 300
0 0 HEVC H.264 HEVC H.264 TurboJPEG BloscLZ LZ4 Snappy (GPU) (GPU) (CPU) (CPU)
Encode Decode Bandwidth 16 FULL TILES STREAMING
17 N:N WITH SIMULATED NETWORK DELAY Mean Frame Rates + Min/Max Ranges
Server: Piz Daint 120 Client: Piz Daint
Ice 4K 100
H.264: 32 Mbps ]
z 80
HEVC: 16 Mbps H
[
e t
Delay + 10% jitter a
R 60
e
m
a r
F 40
20
0 0 ms 50 ms 150 ms 500 ms Network Delay
1 2 4 8 16 32 64 128 256 18 N:N STREAMING Pipeline Latencies
Server: Piz Daint 45 Client: Piz Daint 40
Ice 4K 35
H.264: 32 Mbps 30
]
s m
MPI-based [
25
synchronization y
c n
e 20
t a L 15
10
5
0 1 2 4 8 16 32 64 128 256 Tiles
Synchronize (Servers) Encode Network Decode Synchronize (Clients) 19 N:1 STREAMING Client-Side Frame Rate
Server: Piz Daint 90 Clients: Site A (5 ms) Site B (25 ms) 80
Ice 4K 70 ]
z 60
H.264: 32 Mbps H
[
e
t 50
a
R
e 40
m
a r
F 30
20
10
0 1 2 4 8 16 32 64 128 256 Tiles
Site A (1x GP100) Site B (1x GP100) Site B (2x GP100) 20 STRONG SCALING
21 N:1 STRONG SCALING Client-Side Frame Rate
Server: Piz Daint 400 Clients: Site A (5 ms) Site B (25 ms) 350
Ice 4K 300
] z
H.264: 32 Mbps H 250
[
e
t a
R 200
e m
a 150
r F 100
50
0 1 2 4 8 16 32 64 128 256 Tiles
Site A (1x GP100) Site B (1x GP100) Site B (2x GP100) 22 N:1 STRONG SCALING Client-Side Frame Rate For Different Bitrates
Server: Piz Daint 200 Client: Site A (5 ms) 180 Ice 4K 160 140
H.264 ]
z
H [
120
e
t a
R 100
e
m 80
a r F 60 40 20 0 1 2 4 8 16 32 64 128 256 Tiles
4 Mbps @ 90 Hz 72 Mbps @ 90 Hz 256 Mbps @ 90 Hz 23 EXAMPLE: OPTIX PATHTRACER
Server: Piz Daint Rendering Highly Sensitive to Tile Size Client: Site A (5 ms) H.264
30 ]
z 20
H
[
e
t
a
R
e
m
a r
F 10
0 1 2 4 8 16 32 64 128 256 Tiles
Regular Tiling Auto-Tuned Tiling
24 INTEROPERABILITY
25 STANDARD-COMPLIANT BITSTREAM Web Browser Streaming Example
EGL-based shim GLUT
Stream unmodified simpleGL example from headless node to web browser (with interaction!)
JavaScript client
WebSocket-based bidirectional communication
On-the-fly MP4 wrapping of H.264
26 RESOURCES
27 VIDEO CODEC SDK APIs For Hardware Accelerated Video Encode/Decode
What’s New with Turing GPUs and Video Codec SDK 9.0
• Up to 3x decode throughput with multiple decoders on professional cards (Quadro & Tesla)
• Higher quality encoding - H.264 & H.265
NVIDIA GeForce Now is made possible by leveraging NVENC in the • Higher encoding efficiency datacenter and streaming the result to end clients (15% lower bitrate than Pascal)
• HEVC B-frames support https://developer.nvidia.com/nvidia-video-codec-sdk • HEVC 4:4:4 decoding support
28 NVPIPE A Lightweight Video Codec SDK Wrapper
Simple C API
H.264, HEVC
RGBA32, uint4, uint8, uint16
Lossy, Lossless
Host/Device memory, OpenGL textures/PBOs
https://github.com/NVIDIA/NvPipe
Issues? Suggestions? Feedback welcome!
29 Conclusion GPU-accelerated video compression opens up novel and fast solutions to the large- scale remoting challenge
Video Codec SDK https://developer.nvidia.com/nvidia-video-codec-sdk
NvPipe https://github.com/NVIDIA/NvPipe
We want to help you solve your large-scale vis problems on NVIDIA!
Tim Biedert [email protected]
30