MAPPING VIDEO CODECSTO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez- | Techische Universität Berlin - Spin Digital | MULTIPROG 2015 Video Codecs

– 70% of internet traffic will be video in 2018 [CISCO] – Video compression: significant improvements over the last years – Growing demand/offer for higher quality: HD to UHD – Video codecs require more and more computing power

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 2 Heterogeneous Architectures

Like them or not: they are available

– CPUs + GPUs + DSPs + HW accelerators + ...

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 3 Mapping: the Complete Application

– Video Player: Decoding + Rendering

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 4 Mapping: Where to Execute What?

Figure : Qualcomm Sanpdragon 810 SoC

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 5 Outline

Introduction

Hardware Acceleration of Video Decoding

OpenCL Acceleration of Video Decoding

CPU Decoding + GPU Rendering

Conclusions

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 6 Pros

– Low-power: 4K H.265 < 100 mW

Cons

– Flexibility: more than one codec – Programmability issues Figure : Tikekar et al., JSSC, Jan. 2014

Hardware Accelerators

Mapping:

– Decoder in HW: ASIC for video decoding – Rendering in GPU

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 7 Hardware Accelerators

Mapping:

– Decoder in HW: ASIC for video decoding – Rendering in GPU

Pros

– Low-power: 4K H.265 < 100 mW

Cons

– Flexibility: more than one codec – Programmability issues Figure : Tikekar et al., JSSC, Jan. 2014

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 7 HW Accelerators: the Software Issue

Multiple : accessing HW modules from SW player

“Supported codecs depend on the chosen API: which is selected at runtime depending on what is available on the system” Gstreamer Manual

API Vendor MPEG2 MPEG4 H.264 VC1 H.265 VAAPI XXXX? VDPAU XXXX? XVBA AMD XX? DXVA Microsoft X? VDA Apple X?

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 8 Outline

Introduction

Hardware Acceleration of Video Decoding

OpenCL Acceleration of Video Decoding

CPU Decoding + GPU Rendering

Conclusions

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 9 Pros

– Programmability – Compatibility: write once, and compile/run everywhere (isn’t it?)

Cons

– Programming model mismatch – Overheads

OpenCL Video Decoding

Mapping

– Decoder: accelerate decoding on the GPU using OpenCL – Rendering: in GPU

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 10 Cons

– Programming model mismatch – Overheads

OpenCL Video Decoding

Mapping

– Decoder: accelerate decoding on the GPU using OpenCL – Rendering: in GPU

Pros

– Programmability – Compatibility: write once, and compile/run everywhere (isn’t it?)

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 10 OpenCL Video Decoding

Mapping

– Decoder: accelerate decoding on the GPU using OpenCL – Rendering: in GPU

Pros

– Programmability – Compatibility: write once, and compile/run everywhere (isn’t it?)

Cons

– Programming model mismatch – Overheads

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 10 H.264 Video Decoding with GPUs

– H.264/AVC: one of the most widely used video codecs

Inverse H.264 Entropy Inverse Deblocking quanti- + stream decoding DCT filter zation

Intra YUV prediction video

Motion Frame compen- buffer sation

Kernels Entropy Inverse Intra- Motion De- Dec. DCT Pred. Comp. blocking Parallelism Low High Low High Low Divergence High Low High High High

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 11 H.264 Inverse Transform on GPUs

Kernel Only

4x4 8x8 30 SIMD 25 GT430 GTS450 20 GTX560Ti

15

10 Speedups

5

0 CRF37 CRF22 Intra CRF37 CRF22 Intra

– Speedups over scalar execution in CPU – Significant speedups: over 25×

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 12 H.264 Inverse Transform on GPUs

Kernel + Memory Transfers + OpenCL runtime

1080p 2160p 4.0 3.5 Non-Over Overlap 3.0 SIMD 2.5 2.0 1.5

Speedups 1.0 0.5 0.0 CRF37 CRF22 Intra CRF37 CRF22 Intra

– Entire application slower than CPU + SIMD with 1 core

B. Wang et al. An Optimized Parallel IDCT on Graphics Processing Units. LNCS 2012.

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 13 H.264 on GPUs

– discrete GPU attains the highest performance, 1.7 speedup – significant performance penalty when including the overhead

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 14 H.264 Motion Compensation on GPUs

kernel Mem Copy OCL runtime 1080p 2160p 1

0.8

0.6

0.4 MC Breakdown 0.2

0

AMD Nvidia Intel AMD Nvidia Intel

B. Wang et al. Parallel H.264/AVC Motion Compensation for GPUs using OpenCL. IEEE TCSVT.

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 15 H.264: Complete Application

Entropy: offload Entropy: IDCT IDCT ... frame N frame N+1 Signal Signal

Reconstruction: offload Reconstruction: MC MC ... frame N frame N+1

Pipelined Others Entropy Seq-GPU-IDCT-MC IDCT Seq-GPU-MC MC Seq-GPU-IDCT MBREC Seq-CPU-IDCT-MC Total Seq-CPU-MC Seq-CPU-IDCT Seq-CPU 0 5 10 15 20 25

– GPU + frame-level pipelining: faster than single threaded CPU – Required complete application redesign

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 16 HEVC/H.265 Video Coding Standard

– HEVC/H.265: 50% better compression compared to H.264/AVC – Add high-level parallelization tools: WP, Tiles → good for CPUs – Remove dependencies in deblocking filter – Add another filter: SAO

Inverse HEVC Entropy Inverse Deblocking quanti- + SAO filter stream decoding DCT filter zation

Intra YUV prediction video

Motion Reference compen- pictures sation

Kernels Entropy Dec. Inverse T Intra-Pred. Motion Comp. De-blocking SAO Parallelism Low High Low High High High Divergence High Low High High High Low

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 17 H.265 In-loop Filters on Discrete GPUs

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 18 H.265 In-loop Filters on Integrated GPUs

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 19 H.265 In-loop Filters on Mobile GPUs

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 20 OPenCL Decoding: Lessons Learned

– Programming model mismatch: – not all kernels can be offloaded – for kernels that can be offloaded: data Locality loss – Acceleration of kernels but not of complete application – Overheads: – CPU GPU memory transfers – OpenCL runtime – Effects on rendering of using the GPU for decoding not analyzed (yet)

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 21 Outline

Introduction

Hardware Acceleration of Video Decoding

OpenCL Acceleration of Video Decoding

CPU Decoding + GPU Rendering

Conclusions

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 22 Cons

– Low power efficiency: see Chi et al. TACO Jan 2015. – Low performance: in low power CPUs

CPU Decoding

CPU optimizations

– Multithreading: ++11 threads – SIMD: compiler intrinsics

Pros

– Programmability – High performance: in high performance CPUs

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 23 CPU Decoding

CPU optimizations

– Multithreading: C++11 threads – SIMD: compiler intrinsics

Pros

– Programmability – High performance: in high performance CPUs

Cons

– Low power efficiency: see Chi et al. TACO Jan 2015. – Low performance: in low power CPUs

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 23 4K CPU Decoding Performance

1-Thread 2-Threads 4-Threads fps sp. fps sp. fps sp Haswell 37.8 1.00 73.4 1.94 134.9 3.57 Bay Trail 6.2 1.00 12.2 1.95 23.6 3.79 Cortex-A9 2.8 1.00 5.3 1.90 8.6 3.05 Cortex-A15 6.0 1.00 11.4 1.91 19.9 3.33

– High performance CPUs fast enough for 4Kp50 – Low power CPUs not fast enough for 4Kp25 References

– Chi et al. SIMD acceleration for HEVC Decoding, IEEE TCSVT. 2015 – Chi et al. Parallel Scalability and Efficiency of HEVC Parallelization Approaches. IEEE TCSVT. Dec 2012 – Chi et al. Parallel HEVC Decoding on Multi- and Many-core Architectures. JSPS 2013

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 24 Cons

– 2D rendering is complex – Color Conversion is not the bottleneck – OPenGL has extensions, and has no performance guarantees

GPU Rendering using OpenGL

Pros

– GPU should be efficient for simple 2D rendering – Color conversion is just matrix vector multiplication – GPUs have HW units for color conversion – OPenGL is a mature language, well supported in many platforms

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 25 GPU Rendering using OpenGL

Pros

– GPU should be efficient for simple 2D rendering – Color conversion is just matrix vector multiplication – GPUs have HW units for color conversion – OPenGL is a mature language, well supported in many platforms

Cons

– 2D rendering is complex – Color Conversion is not the bottleneck – OPenGL has extensions, and has no performance guarantees

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 25 Video Rendering Stages

Frame in YUV Texture YUV Color Model RGBA Decoding Displaying Uploading Conversion

– Texture Uploading: upload/transform image from CPU to GPU – Color Model Conversion: YUV into RGBA conversion – Displaying: from framebuffer to screen

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 26 4K Video Play: Discrete GPU

4K Playback on GTX750-Ti

Decoding + Rendering 20 Decoding Rendering Texture Uploading CC + Displaying 15 Color Conversion

10

5 Performance (millisec/frame) Performance

0 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB Video Formats

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 27 4K Video Play: Integrated GPUs

4K Playback on HD4600 Linux

Decoding + Rendering 60 Decoding Rendering Texture Uploading 50 CC + Displaying Color Conversion 40

30

20 Performance (millisec/frame) Performance 10

0 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB Video Formats

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 28 4K Playback on HD4600 Win8.1

30 Decoding + Rendering Decoding Rendering 25 Texture Uploading CC + Displaying Color Conversion 20

15

10

Performance (millisec/frame) Performance 5

0 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB Video Formats

Integrated GPUs + Linear Texture

4K Texture Uploading on HD4600 Win8.1

30 glTexSubImage2D PBO + memcpy 25 PBOMAP INTEL_map_texture

20

15

10

Performance (millisec/frame) Performance 5

0 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB Video Formats

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 29 Integrated GPUs + Linear Texture

4K Texture Uploading on HD4600 Win8.1 4K Playback on HD4600 Win8.1

30 30 glTexSubImage2D Decoding + Rendering PBO + memcpy Decoding Rendering 25 PBOMAP 25 Texture Uploading INTEL_map_texture CC + Displaying Color Conversion 20 20

15 15

10 10 Performance (millisec/frame) Performance 5 (millisec/frame) Performance 5

0 0 YUV420 YUV420 YUV422 YUV444 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB 11.9MB 23.7MB 31.6MB 47.5MB Video Formats Video Formats

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 29 Integrated GPUs

– Texture uploading takes more time than decoding – Can be solved with an OPenGL extension

OpenGL extensions

– INTEL_map_texture: remove memory layout conversion – Performance loss due to memory contention

Decoding + Rendering: lessons learned

Discrete GPUs

– Performance drop: sharing memory for PCIe copy – Can be solved by using integrated GPUs: remove copies

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 30 OpenGL extensions

– INTEL_map_texture: remove memory layout conversion – Performance loss due to memory contention

Decoding + Rendering: lessons learned

Discrete GPUs

– Performance drop: sharing memory for PCIe copy – Can be solved by using integrated GPUs: remove copies

Integrated GPUs

– Texture uploading takes more time than decoding – Can be solved with an OPenGL extension

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 30 Decoding + Rendering: lessons learned

Discrete GPUs

– Performance drop: sharing memory for PCIe copy – Can be solved by using integrated GPUs: remove copies

Integrated GPUs

– Texture uploading takes more time than decoding – Can be solved with an OPenGL extension

OpenGL extensions

– INTEL_map_texture: remove memory layout conversion – Performance loss due to memory contention

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 30 Outline

Introduction

Hardware Acceleration of Video Decoding

OpenCL Acceleration of Video Decoding

CPU Decoding + GPU Rendering

Conclusions

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 31 Solve the system problem

– Complete video decoding: not just kernels – Rendering: is not just matrix multiplication – Video player is Decoding + Rendering

Conclusions

Video codecs and parallel architectures: conflicting objectives

– Video codecs: remove redundancy → introduce dependencies – Parallelism : remove dependencies → reduce compresssion

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 32 Conclusions

Video codecs and parallel architectures: conflicting objectives

– Video codecs: remove redundancy → introduce dependencies – Parallelism : remove dependencies → reduce compresssion

Solve the system problem

– Complete video decoding: not just kernels – Rendering: is not just matrix multiplication – Video player is Decoding + Rendering

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 32 Questions?

– Acknowledgements: Biao Wang, Haifeng Gao, Chi Ching Chi – More info: – http://www.aes.tu-berlin.de/ – http://www.spin-digital.com/

Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 33