NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, GTC 2020 AGENDA
Video/Image Processing on NVIDIA GPUs NVENC, NVDEC, NvOFA, NvJPEG
Software Updates and New Features Updates, benchmarks and end-to-end use-cases
Roadmap Upcoming features
2 Video/Image Processing on NVIDIA GPUs
3 NVIDIA VIDEO/IMAGE HARDWARE
CUDA Cores CPU System Memory NVDEC – Video Decoder (H.264, H.265, VP9,
PCIe MPEG-2…)
NVENC – Video Encoder (H.264, H.265) NVDEC JPEG decode GPU NVENC Optical Flow Optical Flow (Turing+) – Track pixels
JPEG Decoder (Ampere+ architecture) Video Memory Video CUDA Cores
4 SOFTWARE
All binaries in NVIDIA driver
SDKs Optical CUDA Video Codec SDK Flow SDK Toolkit APIs
Reusable samples Software NVIDIA Driver Documentation
Hardware Optical Linux & Windows NVENC NVDEC NvJPEG Flow CUDA, DirectX, OpenGL, Vulkan APIs
Binary backward compatibility
5 WHAT’S NEW IN GA100 GPUs?
NVDECs Scale Infer
Pascal: 1 … 0101100010011 … … 1001010111010 … Scale Infer … 0101100010011 … NVDEC 0 Turing: 1-2 … 1001010111010 … ...... GA100: 5 . . . . … 0101100010011 … … 1001010111010 … NVDEC N … 0101100010011 … Scale Infer … 1001010111010 …
High-res Decode Infer 1080p, 720p Scale
Low-res infer e.g. 300 × 200
Not all features will be in all GPUs. Please check NVIDIA developer zone web site for detailed information 6 WHAT’S NEW IN GA100 GPUs?
NVDECs Scale Infer
Pascal: 1 … 0101100010011 … JPEG Scale Infer Turing: 1-2 … 0101100010011 … decoder 1 . . . . . … 1001010111010 … . GA100: 5 . . . … 0101100010011 … JPEG decoder 5 Scale Infer Dedicated 5-core JPEG decoder … 1111100010011 …
Full-res decode Scale Infer
Low-res infer e.g. 300 × 200
Not all features will be in all GPUs. Please check NVIDIA developer zone web site for detailed information 7 WHAT’S NEW IN GA100 GPUs?
NVDECs
Pascal: 1 Turing Ampere Turing: 1-2 NVENC GA100: 5 Optical Flow Dedicated 5-core JPEG decoder
Improved optical flow engine, • Per-pixel flow vector independent of NVENC • Improved quality & perf • 8192 × 8192 • Region of interest
Not all features will be in all GPUs. Please check NVIDIA developer zone web site for detailed information 8 SOFTWARE UPDATES Upcoming SDK Releases
Video Codec SDK 10.0
Optical Flow SDK 2.0
NvJPEG decode (part of CUDA 11.0)
9 Video Codec SDK Update
10 VIDEO CODEC SDK
SDK 9.0 SDK 10.0 SDK 8.1 Turing GA100 4K60 HEVC encode HEVC B-frame NVENC presets 2.0 Multi-NVDEC
2017 Q3 2018 Q3 2019
Q1 2018 Q1 2019 Q2 2020
SDK 9.1 SDK 8.2 SDK 8.0 True CBR Decode + inference 10-bit transcode CUDA/NVDEC optimizations parallelism
11 VIDEO SDK 10.0
GA100 Support (Ampere architecture)
Multi-NVDEC performance improvements Encoder (NVENC) presets 2.0 June 2020
12 VIDEO SDK 10.0
GA100 Support (Ampere architecture)
Multi-NVDEC performance improvements Encoder (NVENC) presets 2.0
13 TURING ARCHITECTURE Up to 2 NVDECs per GPU for Decode + Inference
180 168 160 144 140 129 120
100 83 80 71 68 GA100 (A100-42GB) 1080p30 streams 1080p30 60 49 TU104 (Tesla T4) 40 34 GP104 (Tesla P4) 22 23 22 22 23 17 21 20 16 12 15 14 16 GV100 (Quadro GV100) GM206 (Tesla M4) 0 H264 HEVC HEVC 10 VP9
14 GA100 WITH AMPERE ARCHITECTURE 5 NVDECs per GPU for Decode + Inference
180 168 160 144 140 129 120
100 83 80 71 68 GA100 (A100-42GB) 1080p30 streams 1080p30 60 49 TU104 (Tesla T4) 40 34 GP104 (Tesla P4) 22 23 22 22 23 17 21 20 16 12 15 14 16 GV100 (Quadro GV100) GM206 (Tesla M4) 0 H264 HEVC HEVC 10 VP9
15 VIDEO SDK 10.0
GA100 (Ampere microarchitecture) Support
Multi-NVDEC Encoder (NVENC) presets 2.0
16 NVENC CONFIGURATION Video Codec SDK 9.1 and earlier
Choose from
6 Presets: HQ, DEFAULT, HP, LLHQ, LL_DEFAULT, LLHP
6 Rate control modes: Constant QP, VBR, CBR, CBR_LOWDELAY_HQ, CBR_HQ, VBR_HQ
Advanced features: B-frames, B-as-reference, Look-ahead, AQ, weighted prediction, VBV…
Use-case specific configuration
Streaming: LL* + CBR_LOWDELAY_HQ + 1-frame VBV
Transcoding: HQ + VBR/CBR + B-frames + B-as-reference + AQ + high VBV
Resolution-dependent quality/performance
Extreme ends of quality/performance difficult to achieve
17 INTRODUCING NVENC PRESETS 2.0 Video Codec SDK 10.0
Choose
Use-case/Tuning Info: High-quality, Low Latency, Ultra Low Latency, Lossless
Preset: P1 (highest performance) to P7 (highest quality)
Rate Control Mode: Constant QP, CBR, VBR
Advanced Features: Built-in; tune only if required
18 LEGACY VS NEW NVENC PRESETS Comparison
Legacy Presets Presets 2.0 One-shot configuration No Yes Advanced features config Needed Rarely needed Use-case based No Yes Performance scales with resolution Sometimes Always Multi-pass rate control Indirect Direct Fine granularity of perf vs quality Low High Rate control modes Complex Easy to understand Extreme ends of quality/perf trade-off No Yes
19 NVENC PRESETS: NEW VS LEGACY HEVC, High quality (latency tolerant), VBR
1200 40% 27.45% 27.45% 27.71% 23.57% 25.86% 25.86% 20% 1000 14.29% 8.80% 0.00% 0.00% 0% 800
649 651 -20% 600 -40% 424 400 380
297 -60% Quality (Bitrate savings) Performance (fps) (fps) 1080p at Performance 230 230 200 132 119 132 -80%
0 -100% P1 P2 P3 P4 P5 P6 P7 HP Default HQ
New NVENC Presets Legacy NVENC Presets 20 NVENC PRESETS: NEW VS LEGACY H.264, High quality (latency tolerant), VBR
1200 30% 20.60% 23.78% 24.08% 20.60% 21.85% 20% 1000 16.36% 9.34% 10%
800 0.00% 0.16% 0.16% 0%
634 606 611 -10% 600 -20%
390 406 Bitrate savings 400 -30%
304 304 Performance (fps) (fps) 1080p at Performance -40% 167 200 162 140 -50%
0 -60% P1 P2 P3 P4 P5 P6 P7 HP Default HQ
New NVENC Presets Legacy NVENC Presets 21 NVENC PRESETS: NEW VS LEGACY HEVC, Ultra Low Latency (latency sensitive), CBR
800 8% 5.66% 5.96% 5.66% 6% 700 6.07% 5.96% 4.22% 4% 600 1.93% 2% 1.00% 500 0.00% 0% -1.46% 400 -2% 327 340 289 -4% 300 Bitrate savings 248 219 -6%
Performance (fps) (fps) 1080p at Performance 188 188 200 165 165 164 -8% 100 -10%
0 -12% P1 P2 P3 P4 P5 P6 P7 LL-HP LL-Default LL-HQ
New NVENC Presets Legacy NVENC Presets 22 MIGRATION TO NEW NVENC PRESETS Video Codec SDK 10.0
Strongly recommended to move to new presets
Mapping table in Video Codec SDK 10.0
Backward compatibility – Binary ; Source
Old presets will be removed from API in future SDK
SDK release – June 2020
23 Encode (NVENC) Benchmarks
24 ENCODE BENCHMARKS https://developer.nvidia.com/nvidia-video-codec-sdk
25 ENCODE BENCHMARK – H.264 Latency Tolerant H.264 Encoding vs x264 70 15%
8.59% 10% 60
5% 1.62% 50 0% 0% -4.48% 40 -5% -7.80%
-10% 30 -9.87% Reference -11.15% quality #1080p30 streams -15% 20 19 19 18 -20% better) (higher Bitrate Saving is
10 10 -25% 2 6 5 0 -30% T4 fast T4 medium P4 medium P4 slow x264 fast x264 medium x264 slow
Turing NVENC Pascal NVENC CPU
26 ENCODE BENCHMARK - HEVC Latency tolerant HEVC Encoding vs x265
60 20%
9% 50 0% 0%
-10% -8%
40 -21% -20%
-35% -35% 30 -40% Reference quality
#1080p30 streams #1080p30 20 -60%
13
12 better) (higher Bitrate Saving is 10 10 -80% 4 2 1 0.85 0 -100% T4 fast T4 medium P4 medium P4 slow x265 fast x265 medium x265 slow
Turing NVENC Pascal NVENC CPU
27 Optical Flow SDK Update
28 OPTICAL FLOW SDK
SDK 2.0 SDK 1.0 GA100 Turing 1x1, 2x2, 4x4, 8K 4x4 vectors, 4K Region of Interest Object tracker
Q3 2019
Q1 2019 Q2 2020
SDK 1.1 OpenCV Accuracy
29 OPTICAL FLOW SDK 2.0 What’s New?
Ampere architecture GPUs
Improved hardware, independent of NVENC
Turing Ampere
NVENC
Optical Flow
30 OPTICAL FLOW SDK 2.0 What’s New?
Ampere architecture GPUs
Improved hardware, independent of NVENC Ampere/Turing 83% Better performance and higher accuracy than Turing 82% 81% 80%
Up to 300 fps at 4K* Slow 3 3 pixels)
± 79% 78% 77% Medium 76% 75% Fast
Quality* (% accurateQuality*(% 74% 73% 72% 0 200 400 600 800 1000 1200 1400 Performance (fps for 1080p)
*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 31 OPTICAL FLOW SDK 2.0 What’s New?
Ampere architecture GPUs
Improved hardware, independent of NVENC 2 × 2
Better performance and higher accuracy than Turing
Up to 300 fps at 4K*
Granularity: 1×1, 2×2, 4×4 pixels 4 × 4
1 × 1
*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 32 OPTICAL FLOW SDK 2.0 What’s New?
Ampere architecture GPUs
Improved hardware, independent of NVENC
Better performance and higher accuracy than Turing Turing OF Up to 300 fps at 4K* 4096
Granularity: 1×1, 2×2, 4×4 pixels 4096 Ampere OF Resolution up to 8192 × 8192
8192 8192
*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 33 OPTICAL FLOW SDK 2.0 What’s New?
Ampere architecture GPUs
Improved hardware, independent of NVENC
Better performance and higher accuracy than Turing
Up to 300 fps at 4K*
Granularity: 1×1, 2×2, 4×4 pixels
Resolution up to 8192 × 8192
Flow vectors for region of Interest (ROI)
*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 34 OPTICAL FLOW SDK 2.0 What’s New?
Ampere architecture GPUs
Improved hardware, independent of NVENC
Better performance and higher accuracy than Turing
Up to 300 fps at 4K*
Granularity: 1×1, 2×2, 4×4 pixels
Resolution up to 8192 × 8192
Flow vectors for region of Interest (ROI)
¼ pixel accuracy
Improved confidence (cost)
Optical-Flow based object tracker
*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 35 OPTICAL FLOW SDK 2.0 Accuracy vs Performance
Ampere/Turing 83% 82% 81% 80%
3 pixels) 3 Slow ± 79% 78% 77% Medium 76% 75%
74% Fast Accuracy* (% Accuracy* accurate 73% 72% 0 200 400 600 800 1000 1200 1400 Performance (fps for 1080p)
*KITTI 2015 FL-ALL NOC 36 OPTICAL FLOW REGION OF INTEREST
Static background, moving foreground objects; e.g. video surveillance
Identify object of interest
Define ROI with extended bounds ±N pixels
Advantages: Improved performance, less noise in flow vectors
37 Main Functionality NV_OF_STATUS(NVOFAPI* PFNNVOFINIT) nvOpticalFlowCommon.h (NvOFHandle hOf, const NV_OF_INIT_PARAMS *initParams);
NV_OF_STATUS(NVOFAPI* PFNNVOFEXECUTE) CUDA and DirectX Buffer (NvOFHandle hOf, const Management NV_OF_EXECUTE_INPUT_PARAMS nvOpticalFlowCuda.h *executeInParams, NV_OF_EXECUTE_OUTPUT_PARAMS nvOpticalFlowD3D11.h *executeOutParams);
Reusable Classes NV_OF_STATUS(NVOFAPI* PFNNVOFDESTROY) (NvOFHandle hOf); NvOF.h NvOFCuda.h NvOFD3D11.h
38 USE VIA OPENCV
Mat frameL = imread(pathL, IMREAD_GRAYSCALE);
Mat frameR = imread(pathR, IMREAD_GRAYSCALE);
GpuMat d_flowL(frameL), d_flowR(frameR), d_flow;
Mat flowx, flowy, flowxy;
int gpuId = 0;
int width = frameL.size().width, height = frameL.size().height;
Ptr
39 USE VIA OPENCV
Mat frameL = imread(pathL, IMREAD_GRAYSCALE);
Mat frameR = imread(pathR, IMREAD_GRAYSCALE);
GpuMat d_flowL(frameL), d_flowR(frameR), d_flow;
Mat flowx, flowy, flowxy;
int gpuId = 0;
int width = frameL.size().width, height = frameL.size().height;
Ptr
40 OPTICAL FLOW APPLICATIONS
Object tracking
Video frame interpolation
Video frame extrapolation
Action recognition
41 OPTICAL FLOW APPLICATIONS
Object tracking - API and source code in Optical Flow SDK 2.0
Video frame interpolation
Video frame extrapolation
Action recognition
42 OBJECT TRACKING Using Feature Tracker or Correlation Filter
Camera 1
Stream 1 Object 1 Detector 1 NVDEC 1 Tracker 1
. frame th
. . K Object . . Tracker 2 NVDEC 2 . Detector M M . every Runs ...... 1 .
. . Feature Tracker NVDEC N . Object Tracker M Stream M M every frame Runs
Camera M Object classes, IDs and positions 43 OBJECT TRACKING WITH ZERO GPU USAGE Optical Flow is “Free”!!
Camera 1
Stream 1 Object 1 Detector 1 NVDEC 1 Tracker 1
. frame th
. . K Object . . Tracker 2 NVDEC 2 . Detector M M . every Runs ...... 1 .
. NVIDIA Optical . Flow Hardware NVDEC N . Object Tracker M Stream M M every frame Runs
Camera M Object classes, IDs and positions 44 OBJECT TRACKER ALGORITHM NvOFTracker in Optical Flow SDK 2.0
Decode video bitstream to frames, P-2, P-1, P (current)
Detect objects of interest (person, bicycles, cars, …)
Regions of Define regions of interest to track Decode Frames Detect Interest (NVDEC) (yolov3) th Object IDs Runs every K frame
Tracker Object Optical Flow (P, P-1) (CPU/ Classes CUDA) Flow Object Dense flow field ➔ Representative flow NvOF Vectors Positions Hardware Warp object position
Tracker minimizes cost
Cost = f(Centroid distance, IOU, OF cost, …)
45 NV OBJECT TRACKER API Source code in Optical Flow SDK 2.0
NvOFTracker.h COFTracker.h
C-API C++ Reusable Class
• NvOFTCreate • COFTracker::COFTracker
• NvOFTProcess • COFTracker::InitTracker
• NvOFTDestroy • COFTracker:: TrackObjects
46 Sample Code int main(int argc, char** argv) { // Instantiate a video decoder cv::VideoCapture videoIn(clFields.inputFile, cv::CAP_FFMPEG);
// Initialize and instantiate object detector CDetector objectDetector(detectorEngineFile, gpuId);
// Instantiate NvOFT Object tracker COFTracker objectTracker(w, dh, NvOFT_SURFACE_MEM_TYPE_SYSTEM, NvOFT_SURFACE_FORMAT_Y, gpuId);
while (frame_available) { // Read the next video frame cv::Mat frame; videoIn >> frame;
if (nFrame % N == 0) // Run detector every Nth frame and get bounding boxes boxes = objectDetector.Run(frame.data, frameProperties); else // Track detected objects in every frame trackedObjects = objectTracker.TrackObjects(frame.data, frameSize, frame.step[0], inputObjectsVector, FALSE);
DrawTrackedObjects(…) }
videoIn.release(); }
47 OBJECT TRACKER IN OPTICAL FLOW SDK Contents
Source code for end-to-end pipeline with
Video decoder
Object detector (non-NVIDIA, yoloV3)
Object tracker based on NVIDIA optical flow API for easy integration Sample application and documentation Integrable with NVIDIA DeepStream SDK
48 OBJECT TRACKER PROFILING
49 Roadmap
50 Roadmap
Video Codec SDK 11.0 (Q3 2020)
Deprecation of presets
Samples moved to GitHub, online documentation
DirectX 12 and Vulkan support via sample apps
Optical Flow SDK 3.0 (Q3 2020)
Ampere architecture GPUs
Object detector & tracker
Frame rate interpolation/extrapolation
Optical vector pre/post-processing APIs
51 RESOURCES
Video Codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk
Optical Flow SDK: https://developer.nvidia.com/nvidia-video-codec-sdk
Video Technologies Developer Forum: https://devtalk.nvidia.com/default/board/175/video-technologies/
Support: [email protected]
CWE21120: How to Use Video Codec and Optical Flow SDK on NVIDIA GPUs Effectively
52