NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing Video Enhancements AGENDA Video Codec SDK Updates Benchmarks Roadmap
Total Page:16
File Type:pdf, Size:1020Kb
NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing Video Enhancements AGENDA Video Codec SDK Updates Benchmarks Roadmap 2 NVIDIA VIDEO TECHNOLOGIES 3 NVIDIA GPU VIDEO CAPABILITIES Decode HW* Encode HW* Formats: CPU • MPEG-2 Formats: • VC1 • H.264 • VP8 • H.265 • VP9 • Lossless • H.264 • H.265 Bit depth: • Lossless • 8 bit NVDEC Buffer NVENC • 10 bit Bit depth: • 8/10/12 bit Color** • YUV 4:4:4 Color** • YUV 4:2:0 • YUV 4:2:0 • YUV 4:4:4 CUDA Cores Resolution • Up to 8K*** Resolution • Up to 8K*** * See support diagram for previous NVIDIA HW generations ** 4:4:4 is supported only on HEVC for Turing; 4:2:2 is not natively supported on HW 4 *** Support is codec dependent Gamestream VIDEO CODEC SDK A comprehensive set of APIs for GPU- Video transcoding accelerated video encode and decode Remote desktop streaming NVENCODE API for video encode acceleration Intelligent video analytics NVDECODE API for video & JPEG decode acceleration (formerly called NVCUVID API) Independent of CUDA/3D cores on GPU for Video archiving pre-/post-processing Video editing 5 NVIDIA VIDEO TECHNOLOGIES Easy access to GPU DeepStream cuDNN, TensorRT, DALI video acceleration SDK cuBLAS, cuSPARSE VIDEO CODEC, OPTICAL FLOW SDK CUDA TOOLKIT Video Encode and Decode for Windows and Linux SOFTWARE APIs, libraries, tools, samples CUDA, DirectX, OpenGL interoperability NVIDIA DRIVER NVENC NVDEC CUDA Video encode Video decode High-performance computing on GPU HARDWARE 6 VIDEO CODEC SDK UPDATE 7 VIDEO CODEC SDK UPDATE SDK 7.x SDK 8.1 SDK 9.0 Pascal B-as-ref Turing 10-bit encode QP/emphasis map Multi-NVDEC FFmpeg 4K60 HEVC encode HEVC 4:4:4 decode ME-only for VR Reusable classes & Encode quality++ new sample apps HEVC B frames Quality++ SDK 8.0 10-bit transcode SDK 8.2 10/12-bit decode Decode + inference OpenGL optimizations Dec. optimizations WP, AQ, Enc. Quality Q3 2018 2016 2017 Q1 2018 2019 8 VIDEO CODEC SDK 9.0 Soul Feature Who it benefits Higher video encode quality Cloud gaming HEVC B-frames Game broadcasting (e.g. Twitch) Higher encode quality Video transcoding (e.g. Youtube, Facebook) OTT/M&E HEVC 4:4:4 decode End-to-end high-quality remote desktop Mutiple NVDECs Higher decode + inference throughput Direct output to vidmem Higher perf with post-processing Power 9 + Tesla V100 SXM2 Video SDK for IBM platforms 9 TURING UPDATES - NVDEC 10 MULTIPLE NVDECS IN TURING GPU Number of NVDECs per GPU Volta, Pascal & earlier 1 Turing – GeForce (RTX) 1 Turing – Quadro & Tesla (TU106) 3 Turing – Quadro & Tesla (TU104) 2 Turing – others 1 ➢ Quadro & Tesla feature ➢ Auto-load-balanced by driver 11 PASCAL & EARLIER Single NVDEC Scale Infer Bottleneck Scale Infer … 0101100010011 … … 1001010111010 … … 0101100010011 … NVDEC … 1001010111010 … High-res Scale Infer Decode 1080p, 720p Scale Infer Low-res infer 12 e.g. 300 × 200 TURING Multiple NVDECs Scale Infer … 0101100010011 … … 1001010111010 … Scale Infer … 0101100010011 … NVDEC 0 … 1001010111010 … . … 0101100010011 … … 1001010111010 … NVDEC N … 0101100010011 … Scale Infer … 1001010111010 … High-res Decode Infer 1080p, 720p Scale Low-res infer 13 e.g. 300 × 200 END-TO-END 4:4:4 IN TURING ➢ Preserves chroma: text and thin lines ➢ Valuable in desktop streaming 4:2:0 4:4:4 14 END-TO-END 4:4:4 IN TURING HEVC 4:4:4 HW encode & 4:4:4 HW decode TuringPascal & earlier Desktop HW CPUHW Stream Network Render Capture Encode decode 15 TURING NVENC ENHANCEMENTS 16 NVENC - ENCODING QUALITY Focus for Turing NVENC Enhancement How to use Rate distortion optimization – RDO Turing only – always ON Multiple reference frames Preset-dependent HEVC B-frames NVENCODE API Others ➢ Higher throughput at same quality as Pascal ➢ Turing GPUs have single NVENC engine with higher quality 17 TURING NVENC QUALITY ➢ Focus on quality – RDO, multi-ref, HEVC B-frames, … ➢ Quality vs performance trade-off ➢ Quality is content dependent ➢ 600+ videos of 10-20 secs each: Natural, animation, gaming, video conference, movies ➢ 720p, 1080p, 4K, 8K ➢ Quality: PSNR, SSIM, VMAF, subjective ➢ Perf: fps, number of 1080p streams per GPU 18 H.264 ENCODE BENCHMARK Non latency critical – Turing vs Pascal vs x264 H.264 - non latency critical H.264 - non latency critical 25 1.17 1.20 1.16 19.41 1.15 20 18.73 17.73 1.10 1.08 1.05 1.05 1.00 15 0.98 1.00 bitrate savingsbitrate 0.93 0.95 10.60 10 0.90 Higher Higher #1080p30 streams #1080p30 bitrate ratio @ iso quality iso @ ratio bitrate Higher perf 6.28 0.85 5.72 0.80 5 2.95 0 5 10 15 20 #1080p30 streams 0 T4 medium T4 fast P4 slow P4 medium x264 slow x264 medium x264 fast T4 medium T4 fast P4 slow P4 medium x264 slow x264 medium x264 fast 19 “iso” quality = x264 medium H.264 ENCODE BENCHMARK Non latency critical – FFmpeg commands NVENC slow -preset slow -bufsize BITRATE*2 -maxrate BITRATE*1.5 -profile:v high -bf 3 - b_ref_mode 2 -temporal-aq 1 -rc-lookahead 20 -vsync 0 x264 slow -preset slow -tune psnr -vsync 0 -threads 4 -vsync 0 NVENC medium -preset medium -rc vbr -profile:v high -bf 3 -b_ref_mode 2 -temporal-aq 1 -rc-lookahead 20 -vsync 0 x264 medium -preset medium -tune psnr -threads 4 -vsync 0 NVENC fast -preset fast -rc vbr -profile:v high -bf 3 -b_ref_mode 2 -temporal-aq 1 -rc-lookahead 20 -vsync 0 x264 fast -preset fast -tune psnr -vsync 0 -threads 4 -vsync 0 20 HEVC ENCODE BENCHMARK Non latency critical – Turing vs Pascal vs x265 HEVC – non latency critcal HEVC – non latency critical 14 1.6 11.80 12 1.4 10.71 1.35 1.21 1.2 1.10 10 1.00 1.10 0.92 1.0 8 0.8 0.6 6 Higher perf #1080p30 streams #1080p30 4.29 bitrate ratio @ iso quality @iso ratio bitrate 0.4 Higher bitrate savings bitrate Higher 4 0.2 2.98 1.91 0.0 2 0 5 10 15 0.85 #1080p30 streams 0 T4 fast T4 medium P4 medium x265 fast x265 medium x265 slow T4 medium T4 fast P4 medium x265 slow x265 medium x265 fast 21 “iso” quality = x265 medium HEVC ENCODE BENCHMARK Non latency critical – FFmpeg commands NVENC slow -preset slow -rc vbr_hq -b:v BITRATE -profile:v 4 -bf 2 -rc-lookahead 20 -g 250 -vsync 0 x265 slow -preset slow -b:v BITRATE -bf 2 -tune psnr -threads 4 -vsync 0 NVENC medium -preset medium -rc vbr_hq -b:v BITRATE -profile:v 4 -bf 2 -rc-lookahead 20 -g 250 -vsync 0 x265 medium -preset medium -b:v BITRATE -bf 2 -tune psnr -threads 4 -vsync 0 NVENC fast -preset fast -rc vbr_hq -b:v BITRATE -profile:v 4 -bf 2 -temporal-aq 1 -rc- lookahead 20 -g 250 -vsync 0 x265 fast -preset fast -b:v BITRATE -bf 2 -tune psnr -threads 4 -vsync 0 22 SOFTWARE UPDATES 23 RECONFIGURE DECODER Video Codec SDK 8.2 No init time, reuse context, lowers memory fragmentation ✓ Input resolution ✓ Scaling resolution ✓ Cropping rectangle Codecs Bit-depth and chroma format Deinterlace mode Input resolution beyond max width or max height 24 DIRECT OUTPUT TO VIDMEM Video Codec SDK 9.0 Host/system SDK 9.08.2 & earlier memory CPU process PCIe CUDA CUDA NVENC pre-process Post-process Video memory Video memory 25 OTHER UPDATES ➢ Video Codec SDK now supported on Power 9 + Tesla V100 SXM2 ➢ High-level NVDEC error status 26 OPTICAL FLOW New HW Functionality ➢ 4 × 4 optical flow vector, up to 4K × 4K ➢ New Optical Flow SDK ➢ Close to true motion ➢ Action recognition, object tracking, video ➢ Robust to intensity changes inter/extrapolation, frame-rate upconversion ➢ 10x faster than CPU; same quality ➢ Legacy ME-only mode support More information: http://developer.nvidia.com/opticalflow-sdk 27 TIPS FOR NVENC OPTIMIZATION 28 OPTIMIZATION STRATEGIES General Guidelines ➢ Minimize PCIe transfers ➢ Eliminate, if possible ➢ Use CUDA for video pre-/post-processing ➢ Multiple threads/processes to balance enc/dec utilization ➢ Monitor using nvidia-smi: nvidia-smi dmon -s uc -i <GPU_index> ➢ Analyze using GPUView on Windows ➢ Minimize disk I/O ➢ Optimize encoder settings for quality/perf balance 30 FFMPEG VIDEO TRANSCODING Tips ➢ Look at FFmpeg users’ guide in NVIDIA Video Codec SDK package ➢ Use –hwaccel keyword to keep entire transcode pipeline on GPU ➢ Run multiple 1:N transcode sessions to achieve M:N transcode at high perf 31 LOW LATENCY STREAMING (1/3) Optimization tips ➢ Low latency ≠ Low encoding time ➢ Latency determined by ➢ B-frames ➢ Look-ahead ➢ VBV buffer size & avlbl bandwidth 32 LOW LATENCY STREAMING (2/3) Optimization tips ➢ For 1-2 frame latency (e.g. cloud gaming), use ➢ RC_CBR_LOWDELAY_HQ & Low VBV buffer size ➢ Minimizes frame-to-frame variations ➢ Any preset (Default, HQ, HP preferred) ➢ LL presets have resolution-dependent behavior ➢ No look-ahead ➢ No B-frames 33 LOW LATENCY STREAMING (3/3) Optimization tips ➢ Similar to HQ (non latency critical) encoding ➢ For higher (8-10 frames) latency (e.g. OTT, broadcast), use ➢ Any RC mode ➢ Any preset (default, HQ, HP preferred) ➢ VBV buffer size as per channel bandwidth constraints ➢ Look-ahead depth < tolerable latency ➢ B-frames as needed 34 VIDEO DL TRAINING Typical Workflow Decoded frames Color space Loader NVDEC Resize Augment With DALI Instantiate operator: ➢ selfFFmpeg.input = ops.VideoReader(device="gpu", filenames=data, Crop Training ➢sequence_lengthNVDECODE=len) API ➢Use CUDAit in the pstDALI- graph:processing frames = self.input(name="Reader") output_frames = self.Crop(frames) Reorder return output_frames 35 ROADMAP 36 ROADMAP Video Codec SDK 9.1 ➢ Q3 2018 ➢ Error handling – Retrieve last error ➢ Perf/quality tuning ➢ Support for CUStream 37 RESOURCES Video Codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk FFmpeg GIT: https://git.ffmpeg.org/ffmpeg.git FFmpeg builds with hardware acceleration: http://ffmpeg.zeranoe.com/builds/ Video SDK support: [email protected] Video SDK forums: https://devtalk.nvidia.com/default/board/175/video- technologies/ Connect with Experts (CE9103): Wednesday, March 20, 2019, 3:00 pm 38 .