NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, GTC 2020 AGENDA

Video/Image Processing on NVIDIA GPUs NVENC, NVDEC, NvOFA, NvJPEG

Software Updates and New Features Updates, benchmarks and end-to-end use-cases

Roadmap Upcoming features

2 Video/Image Processing on NVIDIA GPUs

3 NVIDIA VIDEO/IMAGE HARDWARE

CUDA Cores CPU System Memory NVDEC – Video Decoder (H.264, H.265, VP9,

PCIe MPEG-2…)

NVENC – Video Encoder (H.264, H.265) NVDEC JPEG decode GPU NVENC Optical Flow Optical Flow (Turing+) – Track pixels

JPEG Decoder (Ampere+ architecture) Video Memory Video CUDA Cores

4 SOFTWARE

All binaries in NVIDIA driver

SDKs Optical CUDA Video Codec SDK Flow SDK Toolkit APIs

Reusable samples Software NVIDIA Driver Documentation

Hardware Optical Linux & Windows NVENC NVDEC NvJPEG Flow CUDA, DirectX, OpenGL, Vulkan APIs

Binary backward compatibility

5 WHAT’S NEW IN GA100 GPUs?

NVDECs Scale Infer

Pascal: 1 … 0101100010011 … … 1001010111010 … Scale Infer … 0101100010011 … NVDEC 0 Turing: 1-2 … 1001010111010 … ...... GA100: 5 . . . . … 0101100010011 … … 1001010111010 … NVDEC N … 0101100010011 … Scale Infer … 1001010111010 …

High-res Decode Infer 1080p, 720p Scale

Low-res infer e.g. 300 × 200

Not all features will be in all GPUs. Please check NVIDIA developer zone web site for detailed information 6 WHAT’S NEW IN GA100 GPUs?

NVDECs Scale Infer

Pascal: 1 … 0101100010011 … JPEG Scale Infer Turing: 1-2 … 0101100010011 … decoder 1 . . . . . … 1001010111010 … . GA100: 5 . . . … 0101100010011 … JPEG decoder 5 Scale Infer Dedicated 5-core JPEG decoder … 1111100010011 …

Full-res decode Scale Infer

Low-res infer e.g. 300 × 200

Not all features will be in all GPUs. Please check NVIDIA developer zone web site for detailed information 7 WHAT’S NEW IN GA100 GPUs?

NVDECs

Pascal: 1 Turing Ampere Turing: 1-2 NVENC GA100: 5 Optical Flow Dedicated 5-core JPEG decoder

Improved optical flow engine, • Per-pixel flow vector independent of NVENC • Improved quality & perf • 8192 × 8192 • Region of interest

Not all features will be in all GPUs. Please check NVIDIA developer zone web site for detailed information 8 SOFTWARE UPDATES Upcoming SDK Releases

Video Codec SDK 10.0

Optical Flow SDK 2.0

NvJPEG decode (part of CUDA 11.0)

9 Video Codec SDK Update

10 VIDEO CODEC SDK

SDK 9.0 SDK 10.0 SDK 8.1 Turing GA100 4K60 HEVC encode HEVC B-frame NVENC presets 2.0 Multi-NVDEC

2017 Q3 2018 Q3 2019

Q1 2018 Q1 2019 Q2 2020

SDK 9.1 SDK 8.2 SDK 8.0 True CBR Decode + inference 10-bit transcode CUDA/NVDEC optimizations parallelism

11 VIDEO SDK 10.0

GA100 Support (Ampere architecture)

Multi-NVDEC performance improvements Encoder (NVENC) presets 2.0 June 2020

12 VIDEO SDK 10.0

GA100 Support (Ampere architecture)

Multi-NVDEC performance improvements Encoder (NVENC) presets 2.0

13 TURING ARCHITECTURE Up to 2 NVDECs per GPU for Decode + Inference

180 168 160 144 140 129 120

100 83 80 71 68 GA100 (A100-42GB) 1080p30 streams 1080p30 60 49 TU104 (Tesla T4) 40 34 GP104 (Tesla P4) 22 23 22 22 23 17 21 20 16 12 15 14 16 GV100 (Quadro GV100) GM206 (Tesla M4) 0 H264 HEVC HEVC 10 VP9

14 GA100 WITH AMPERE ARCHITECTURE 5 NVDECs per GPU for Decode + Inference

180 168 160 144 140 129 120

15 VIDEO SDK 10.0

GA100 (Ampere microarchitecture) Support

Multi-NVDEC Encoder (NVENC) presets 2.0

16 NVENC CONFIGURATION Video Codec SDK 9.1 and earlier

Choose from

6 Presets: HQ, DEFAULT, HP, LLHQ, LL_DEFAULT, LLHP

6 Rate control modes: Constant QP, VBR, CBR, CBR_LOWDELAY_HQ, CBR_HQ, VBR_HQ

Advanced features: B-frames, B-as-reference, Look-ahead, AQ, weighted prediction, VBV…

Use-case specific configuration

Streaming: LL* + CBR_LOWDELAY_HQ + 1-frame VBV

Transcoding: HQ + VBR/CBR + B-frames + B-as-reference + AQ + high VBV

Resolution-dependent quality/performance

Extreme ends of quality/performance difficult to achieve

17 INTRODUCING NVENC PRESETS 2.0 Video Codec SDK 10.0

Choose

Use-case/Tuning Info: High-quality, Low Latency, Ultra Low Latency, Lossless

Preset: P1 (highest performance) to P7 (highest quality)

Rate Control Mode: Constant QP, CBR, VBR

Advanced Features: Built-in; tune only if required

18 LEGACY VS NEW NVENC PRESETS Comparison

Legacy Presets Presets 2.0 One-shot configuration No Yes Advanced features config Needed Rarely needed Use-case based No Yes Performance scales with resolution Sometimes Always Multi-pass rate control Indirect Direct Fine granularity of perf vs quality Low High Rate control modes Complex Easy to understand Extreme ends of quality/perf trade-off No Yes

19 NVENC PRESETS: NEW VS LEGACY HEVC, High quality (latency tolerant), VBR

1200 40% 27.45% 27.45% 27.71% 23.57% 25.86% 25.86% 20% 1000 14.29% 8.80% 0.00% 0.00% 0% 800

649 651 -20% 600 -40% 424 400 380

297 -60% Quality (Bitrate savings) Performance (fps) (fps) 1080p at Performance 230 230 200 132 119 132 -80%

0 -100% P1 P2 P3 P4 P5 P6 P7 HP Default HQ

New NVENC Presets Legacy NVENC Presets 20 NVENC PRESETS: NEW VS LEGACY H.264, High quality (latency tolerant), VBR

1200 30% 20.60% 23.78% 24.08% 20.60% 21.85% 20% 1000 16.36% 9.34% 10%

800 0.00% 0.16% 0.16% 0%

634 606 611 -10% 600 -20%

390 406 Bitrate savings 400 -30%

304 304 Performance (fps) (fps) 1080p at Performance -40% 167 200 162 140 -50%

0 -60% P1 P2 P3 P4 P5 P6 P7 HP Default HQ

New NVENC Presets Legacy NVENC Presets 21 NVENC PRESETS: NEW VS LEGACY HEVC, Ultra Low Latency (latency sensitive), CBR

800 8% 5.66% 5.96% 5.66% 6% 700 6.07% 5.96% 4.22% 4% 600 1.93% 2% 1.00% 500 0.00% 0% -1.46% 400 -2% 327 340 289 -4% 300 Bitrate savings 248 219 -6%

Performance (fps) (fps) 1080p at Performance 188 188 200 165 165 164 -8% 100 -10%

0 -12% P1 P2 P3 P4 P5 P6 P7 LL-HP LL-Default LL-HQ

New NVENC Presets Legacy NVENC Presets 22 MIGRATION TO NEW NVENC PRESETS Video Codec SDK 10.0

Strongly recommended to move to new presets

Mapping table in Video Codec SDK 10.0

Backward compatibility – Binary ; Source

Old presets will be removed from API in future SDK

SDK release – June 2020

23 Encode (NVENC) Benchmarks

24 ENCODE BENCHMARKS https://developer.nvidia.com/nvidia-video-codec-sdk

25 ENCODE BENCHMARK – H.264 Latency Tolerant H.264 Encoding vs x264 70 15%

8.59% 10% 60

5% 1.62% 50 0% 0% -4.48% 40 -5% -7.80%

-10% 30 -9.87% Reference -11.15% quality #1080p30 streams -15% 20 19 19 18 -20% better) (higher Bitrate Saving is

10 10 -25% 2 6 5 0 -30% T4 fast T4 medium P4 medium P4 slow x264 fast x264 medium x264 slow

Turing NVENC Pascal NVENC CPU

26 ENCODE BENCHMARK - HEVC Latency tolerant HEVC Encoding vs x265

60 20%

9% 50 0% 0%

-10% -8%

40 -21% -20%

-35% -35% 30 -40% Reference quality

#1080p30 streams #1080p30 20 -60%

12 better) (higher Bitrate Saving is 10 10 -80% 4 2 1 0.85 0 -100% T4 fast T4 medium P4 medium P4 slow x265 fast x265 medium x265 slow

Turing NVENC Pascal NVENC CPU

27 Optical Flow SDK Update

28 OPTICAL FLOW SDK

SDK 2.0 SDK 1.0 GA100 Turing 1x1, 2x2, 4x4, 8K 4x4 vectors, 4K Region of Interest Object tracker

Q3 2019

Q1 2019 Q2 2020

SDK 1.1 OpenCV Accuracy

29 OPTICAL FLOW SDK 2.0 What’s New?

Ampere architecture GPUs

Improved hardware, independent of NVENC

Turing Ampere

NVENC

Optical Flow

30 OPTICAL FLOW SDK 2.0 What’s New?

Ampere architecture GPUs

Improved hardware, independent of NVENC Ampere/Turing 83% Better performance and higher accuracy than Turing 82% 81% 80%

Up to 300 fps at 4K* Slow 3 3 pixels)

± 79% 78% 77% Medium 76% 75% Fast

Quality* (% accurateQuality*(% 74% 73% 72% 0 200 400 600 800 1000 1200 1400 Performance (fps for 1080p)

*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 31 OPTICAL FLOW SDK 2.0 What’s New?

Ampere architecture GPUs

Improved hardware, independent of NVENC 2 × 2

Better performance and higher accuracy than Turing

Up to 300 fps at 4K*

Granularity: 1×1, 2×2, 4×4 pixels 4 × 4

1 × 1

*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 32 OPTICAL FLOW SDK 2.0 What’s New?

Ampere architecture GPUs

Improved hardware, independent of NVENC

Better performance and higher accuracy than Turing Turing OF Up to 300 fps at 4K* 4096

Granularity: 1×1, 2×2, 4×4 pixels 4096 Ampere OF Resolution up to 8192 × 8192

8192 8192

*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 33 OPTICAL FLOW SDK 2.0 What’s New?

Ampere architecture GPUs

Improved hardware, independent of NVENC

Better performance and higher accuracy than Turing

Up to 300 fps at 4K*

Granularity: 1×1, 2×2, 4×4 pixels

Resolution up to 8192 × 8192

Flow vectors for region of Interest (ROI)

*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 34 OPTICAL FLOW SDK 2.0 What’s New?

Ampere architecture GPUs

Improved hardware, independent of NVENC

Better performance and higher accuracy than Turing

Up to 300 fps at 4K*

Granularity: 1×1, 2×2, 4×4 pixels

Resolution up to 8192 × 8192

Flow vectors for region of Interest (ROI)

¼ pixel accuracy

Improved confidence (cost)

Optical-Flow based object tracker

*Performance is dependent on clocks, available memory bandwidth and well-designed application pipelining 35 OPTICAL FLOW SDK 2.0 Accuracy vs Performance

Ampere/Turing 83% 82% 81% 80%

3 pixels) 3 Slow ± 79% 78% 77% Medium 76% 75%

74% Fast Accuracy* (% Accuracy* accurate 73% 72% 0 200 400 600 800 1000 1200 1400 Performance (fps for 1080p)

*KITTI 2015 FL-ALL NOC 36 OPTICAL FLOW REGION OF INTEREST

Static background, moving foreground objects; e.g. video surveillance

Identify object of interest

Define ROI with extended bounds ±N pixels

Advantages: Improved performance, less noise in flow vectors

37 Main Functionality NV_OF_STATUS(NVOFAPI* PFNNVOFINIT) nvOpticalFlowCommon.h (NvOFHandle hOf, const NV_OF_INIT_PARAMS *initParams);

NV_OF_STATUS(NVOFAPI* PFNNVOFEXECUTE) CUDA and DirectX Buffer (NvOFHandle hOf, const Management NV_OF_EXECUTE_INPUT_PARAMS nvOpticalFlowCuda.h *executeInParams, NV_OF_EXECUTE_OUTPUT_PARAMS nvOpticalFlowD3D11.h *executeOutParams);

Reusable Classes NV_OF_STATUS(NVOFAPI* PFNNVOFDESTROY) (NvOFHandle hOf); NvOF.h NvOFCuda.h NvOFD3D11.h

38 USE VIA OPENCV

Mat frameL = imread(pathL, IMREAD_GRAYSCALE);

Mat frameR = imread(pathR, IMREAD_GRAYSCALE);

GpuMat d_flowL(frameL), d_flowR(frameR), d_flow;

Mat flowx, flowy, flowxy;

int gpuId = 0;

int width = frameL.size().width, height = frameL.size().height;

Ptr OpticalFlow = cuda::FarnebackOpticalFlow::create(); OpticalFlow->calc(d_flowL, d_flowR, d_flow); d_flow.download(flowxy);

39 USE VIA OPENCV

Mat frameL = imread(pathL, IMREAD_GRAYSCALE);

Mat frameR = imread(pathR, IMREAD_GRAYSCALE);

GpuMat d_flowL(frameL), d_flowR(frameR), d_flow;

Mat flowx, flowy, flowxy;

int gpuId = 0;

int width = frameL.size().width, height = frameL.size().height;

Ptr OpticalFlow = cuda::NvidiaOpticalFlow::create(perfPreset, width, height, gpuId); OpticalFlow->calc(frameL, frameR, d_flow); d_flow.download(flowxy);

40 OPTICAL FLOW APPLICATIONS

Object tracking

Video frame interpolation

Video frame extrapolation

Action recognition

41 OPTICAL FLOW APPLICATIONS

Object tracking - API and source code in Optical Flow SDK 2.0

Video frame interpolation

Video frame extrapolation

Action recognition

42 OBJECT TRACKING Using Feature Tracker or Correlation Filter

Camera 1

Stream 1 Object 1 Detector 1 NVDEC 1 Tracker 1

. frame th

. . K Object . . Tracker 2 NVDEC 2 . Detector M M . every Runs ...... 1 .

. . Feature Tracker NVDEC N . Object Tracker M Stream M M every frame Runs

Camera M Object classes, IDs and positions 43 OBJECT TRACKING WITH ZERO GPU USAGE Optical Flow is “Free”!!

Camera 1

Stream 1 Object 1 Detector 1 NVDEC 1 Tracker 1

. frame th

. . K Object . . Tracker 2 NVDEC 2 . Detector M M . every Runs ...... 1 .

. NVIDIA Optical . Flow Hardware NVDEC N . Object Tracker M Stream M M every frame Runs

Camera M Object classes, IDs and positions 44 OBJECT TRACKER ALGORITHM NvOFTracker in Optical Flow SDK 2.0

Decode video bitstream to frames, P-2, P-1, P (current)

Detect objects of interest (person, bicycles, cars, …)

Regions of Define regions of interest to track Decode Frames Detect Interest (NVDEC) (yolov3) th Object IDs Runs every K frame

Tracker Object Optical Flow (P, P-1) (CPU/ Classes CUDA) Flow Object Dense flow field ➔ Representative flow NvOF Vectors Positions Hardware Warp object position

Tracker minimizes cost

Cost = f(Centroid distance, IOU, OF cost, …)

45 NV OBJECT TRACKER API Source code in Optical Flow SDK 2.0

NvOFTracker.h COFTracker.h

C-API C++ Reusable Class

• NvOFTCreate • COFTracker::COFTracker

• NvOFTProcess • COFTracker::InitTracker

• NvOFTDestroy • COFTracker:: TrackObjects

46 Sample Code int main(int argc, char** argv) { // Instantiate a video decoder cv::VideoCapture videoIn(clFields.inputFile, cv::CAP_FFMPEG);

// Initialize and instantiate object detector CDetector objectDetector(detectorEngineFile, gpuId);

// Instantiate NvOFT Object tracker COFTracker objectTracker(w, dh, NvOFT_SURFACE_MEM_TYPE_SYSTEM, NvOFT_SURFACE_FORMAT_Y, gpuId);

while (frame_available) { // Read the next video frame cv::Mat frame; videoIn >> frame;

if (nFrame % N == 0) // Run detector every Nth frame and get bounding boxes boxes = objectDetector.Run(frame.data, frameProperties); else // Track detected objects in every frame trackedObjects = objectTracker.TrackObjects(frame.data, frameSize, frame.step[0], inputObjectsVector, FALSE);

DrawTrackedObjects(…) }

videoIn.release(); }

47 OBJECT TRACKER IN OPTICAL FLOW SDK Contents

Source code for end-to-end pipeline with

Video decoder

Object detector (non-NVIDIA, yoloV3)

Object tracker based on NVIDIA optical flow API for easy integration Sample application and documentation Integrable with NVIDIA DeepStream SDK

48 OBJECT TRACKER PROFILING

49 Roadmap

50 Roadmap

Video Codec SDK 11.0 (Q3 2020)

Deprecation of presets

Samples moved to GitHub, online documentation

DirectX 12 and Vulkan support via sample apps

Optical Flow SDK 3.0 (Q3 2020)

Ampere architecture GPUs

Object detector & tracker

Frame rate interpolation/extrapolation

Optical vector pre/post-processing APIs

51 RESOURCES

Video Codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk

Optical Flow SDK: https://developer.nvidia.com/nvidia-video-codec-sdk

Video Technologies Developer Forum: https://devtalk.nvidia.com/default/board/175/video-technologies/

Support: [email protected]

CWE21120: How to Use Video Codec and Optical Flow SDK on NVIDIA GPUs Effectively