MAPPING VIDEO CODECSTO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez-Mesa | Techische Universität Berlin - Spin Digital | MULTIPROG 2015 Video Codecs
– 70% of internet traffic will be video in 2018 [CISCO] – Video compression: significant improvements over the last years – Growing demand/offer for higher quality: HD to UHD – Video codecs require more and more computing power
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 2 Heterogeneous Architectures
Like them or not: they are available
– CPUs + GPUs + DSPs + HW accelerators + ...
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 3 Mapping: the Complete Application
– Video Player: Decoding + Rendering
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 4 Mapping: Where to Execute What?
Figure : Qualcomm Sanpdragon 810 SoC
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 5 Outline
Introduction
Hardware Acceleration of Video Decoding
OpenCL Acceleration of Video Decoding
CPU Decoding + GPU Rendering
Conclusions
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 6 Pros
– Low-power: 4K H.265 < 100 mW
Cons
– Flexibility: more than one codec – Programmability issues Figure : Tikekar et al., JSSC, Jan. 2014
Hardware Accelerators
Mapping:
– Decoder in HW: ASIC for video decoding – Rendering in GPU
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 7 Hardware Accelerators
Mapping:
– Decoder in HW: ASIC for video decoding – Rendering in GPU
Pros
– Low-power: 4K H.265 < 100 mW
Cons
– Flexibility: more than one codec – Programmability issues Figure : Tikekar et al., JSSC, Jan. 2014
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 7 HW Accelerators: the Software Issue
Multiple APIs: accessing HW modules from SW player
“Supported codecs depend on the chosen API: which is selected at runtime depending on what is available on the system” Gstreamer Manual
API Vendor MPEG2 MPEG4 H.264 VC1 H.265 VAAPI Intel XXXX? VDPAU NVidia XXXX? XVBA AMD XX? DXVA Microsoft X? VDA Apple X?
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 8 Outline
Introduction
Hardware Acceleration of Video Decoding
OpenCL Acceleration of Video Decoding
CPU Decoding + GPU Rendering
Conclusions
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 9 Pros
– Programmability – Compatibility: write once, and compile/run everywhere (isn’t it?)
Cons
– Programming model mismatch – Overheads
OpenCL Video Decoding
Mapping
– Decoder: accelerate decoding on the GPU using OpenCL – Rendering: in GPU
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 10 Cons
– Programming model mismatch – Overheads
OpenCL Video Decoding
Mapping
– Decoder: accelerate decoding on the GPU using OpenCL – Rendering: in GPU
Pros
– Programmability – Compatibility: write once, and compile/run everywhere (isn’t it?)
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 10 OpenCL Video Decoding
Mapping
– Decoder: accelerate decoding on the GPU using OpenCL – Rendering: in GPU
Pros
– Programmability – Compatibility: write once, and compile/run everywhere (isn’t it?)
Cons
– Programming model mismatch – Overheads
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 10 H.264 Video Decoding with GPUs
– H.264/AVC: one of the most widely used video codecs
Inverse H.264 Entropy Inverse Deblocking quanti- + stream decoding DCT filter zation
Intra YUV prediction video
Motion Frame compen- buffer sation
Kernels Entropy Inverse Intra- Motion De- Dec. DCT Pred. Comp. blocking Parallelism Low High Low High Low Divergence High Low High High High
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 11 H.264 Inverse Transform on GPUs
Kernel Only
4x4 8x8 30 SIMD 25 GT430 GTS450 20 GTX560Ti
15
10 Speedups
5
0 CRF37 CRF22 Intra CRF37 CRF22 Intra
– Speedups over scalar execution in CPU – Significant speedups: over 25×
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 12 H.264 Inverse Transform on GPUs
Kernel + Memory Transfers + OpenCL runtime
1080p 2160p 4.0 3.5 Non-Over Overlap 3.0 SIMD 2.5 2.0 1.5
Speedups 1.0 0.5 0.0 CRF37 CRF22 Intra CRF37 CRF22 Intra
– Entire application slower than CPU + SIMD with 1 core
B. Wang et al. An Optimized Parallel IDCT on Graphics Processing Units. LNCS 2012.
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 13 H.264 Motion Compensation on GPUs
– discrete GPU attains the highest performance, 1.7 speedup – significant performance penalty when including the overhead
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 14 H.264 Motion Compensation on GPUs
kernel Mem Copy OCL runtime 1080p 2160p 1
0.8
0.6
0.4 MC Breakdown 0.2
0
AMD Nvidia Intel AMD Nvidia Intel
B. Wang et al. Parallel H.264/AVC Motion Compensation for GPUs using OpenCL. IEEE TCSVT.
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 15 H.264: Complete Application
Entropy: offload Entropy: IDCT IDCT ... frame N frame N+1 Signal Signal
Reconstruction: offload Reconstruction: MC MC ... frame N frame N+1
Pipelined Others Entropy Seq-GPU-IDCT-MC IDCT Seq-GPU-MC MC Seq-GPU-IDCT MBREC Seq-CPU-IDCT-MC Total Seq-CPU-MC Seq-CPU-IDCT Seq-CPU 0 5 10 15 20 25
– GPU + frame-level pipelining: faster than single threaded CPU – Required complete application redesign
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 16 HEVC/H.265 Video Coding Standard
– HEVC/H.265: 50% better compression compared to H.264/AVC – Add high-level parallelization tools: WP, Tiles → good for CPUs – Remove dependencies in deblocking filter – Add another filter: SAO
Inverse HEVC Entropy Inverse Deblocking quanti- + SAO filter stream decoding DCT filter zation
Intra YUV prediction video
Motion Reference compen- pictures sation
Kernels Entropy Dec. Inverse T Intra-Pred. Motion Comp. De-blocking SAO Parallelism Low High Low High High High Divergence High Low High High High Low
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 17 H.265 In-loop Filters on Discrete GPUs
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 18 H.265 In-loop Filters on Integrated GPUs
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 19 H.265 In-loop Filters on Mobile GPUs
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 20 OPenCL Decoding: Lessons Learned
– Programming model mismatch: – not all kernels can be offloaded – for kernels that can be offloaded: data Locality loss – Acceleration of kernels but not of complete application – Overheads: – CPU GPU memory transfers – OpenCL runtime – Effects on rendering of using the GPU for decoding not analyzed (yet)
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 21 Outline
Introduction
Hardware Acceleration of Video Decoding
OpenCL Acceleration of Video Decoding
CPU Decoding + GPU Rendering
Conclusions
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 22 Cons
– Low power efficiency: see Chi et al. TACO Jan 2015. – Low performance: in low power CPUs
CPU Decoding
CPU optimizations
– Multithreading: C++11 threads – SIMD: compiler intrinsics
Pros
– Programmability – High performance: in high performance CPUs
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 23 CPU Decoding
CPU optimizations
– Multithreading: C++11 threads – SIMD: compiler intrinsics
Pros
– Programmability – High performance: in high performance CPUs
Cons
– Low power efficiency: see Chi et al. TACO Jan 2015. – Low performance: in low power CPUs
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 23 4K CPU Decoding Performance
1-Thread 2-Threads 4-Threads fps sp. fps sp. fps sp Haswell 37.8 1.00 73.4 1.94 134.9 3.57 Bay Trail 6.2 1.00 12.2 1.95 23.6 3.79 Cortex-A9 2.8 1.00 5.3 1.90 8.6 3.05 Cortex-A15 6.0 1.00 11.4 1.91 19.9 3.33
– High performance CPUs fast enough for 4Kp50 – Low power CPUs not fast enough for 4Kp25 References
– Chi et al. SIMD acceleration for HEVC Decoding, IEEE TCSVT. 2015 – Chi et al. Parallel Scalability and Efficiency of HEVC Parallelization Approaches. IEEE TCSVT. Dec 2012 – Chi et al. Parallel HEVC Decoding on Multi- and Many-core Architectures. JSPS 2013
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 24 Cons
– 2D rendering is complex – Color Conversion is not the bottleneck – OPenGL has extensions, and has no performance guarantees
GPU Rendering using OpenGL
Pros
– GPU should be efficient for simple 2D rendering – Color conversion is just matrix vector multiplication – GPUs have HW units for color conversion – OPenGL is a mature language, well supported in many platforms
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 25 GPU Rendering using OpenGL
Pros
– GPU should be efficient for simple 2D rendering – Color conversion is just matrix vector multiplication – GPUs have HW units for color conversion – OPenGL is a mature language, well supported in many platforms
Cons
– 2D rendering is complex – Color Conversion is not the bottleneck – OPenGL has extensions, and has no performance guarantees
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 25 Video Rendering Stages
Frame in YUV Texture YUV Color Model RGBA Decoding Displaying Uploading Conversion
– Texture Uploading: upload/transform image from CPU to GPU – Color Model Conversion: YUV into RGBA conversion – Displaying: from framebuffer to screen
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 26 4K Video Play: Discrete GPU
4K Playback on GTX750-Ti Linux
Decoding + Rendering 20 Decoding Rendering Texture Uploading CC + Displaying 15 Color Conversion
10
5 Performance (millisec/frame) Performance
0 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB Video Formats
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 27 4K Video Play: Integrated GPUs
4K Playback on HD4600 Linux
Decoding + Rendering 60 Decoding Rendering Texture Uploading 50 CC + Displaying Color Conversion 40
30
20 Performance (millisec/frame) Performance 10
0 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB Video Formats
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 28 4K Playback on HD4600 Win8.1
30 Decoding + Rendering Decoding Rendering 25 Texture Uploading CC + Displaying Color Conversion 20
15
10
Performance (millisec/frame) Performance 5
0 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB Video Formats
Integrated GPUs + Linear Texture
4K Texture Uploading on HD4600 Win8.1
30 glTexSubImage2D PBO + memcpy 25 PBOMAP INTEL_map_texture
20
15
10
Performance (millisec/frame) Performance 5
0 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB Video Formats
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 29 Integrated GPUs + Linear Texture
4K Texture Uploading on HD4600 Win8.1 4K Playback on HD4600 Win8.1
30 30 glTexSubImage2D Decoding + Rendering PBO + memcpy Decoding Rendering 25 PBOMAP 25 Texture Uploading INTEL_map_texture CC + Displaying Color Conversion 20 20
15 15
10 10 Performance (millisec/frame) Performance 5 (millisec/frame) Performance 5
0 0 YUV420 YUV420 YUV422 YUV444 YUV420 YUV420 YUV422 YUV444 8-bit 10-bit 10-bit 10-bit 8-bit 10-bit 10-bit 10-bit 11.9MB 23.7MB 31.6MB 47.5MB 11.9MB 23.7MB 31.6MB 47.5MB Video Formats Video Formats
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 29 Integrated GPUs
– Texture uploading takes more time than decoding – Can be solved with an OPenGL extension
OpenGL extensions
– INTEL_map_texture: remove memory layout conversion – Performance loss due to memory contention
Decoding + Rendering: lessons learned
Discrete GPUs
– Performance drop: sharing memory for PCIe copy – Can be solved by using integrated GPUs: remove copies
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 30 OpenGL extensions
– INTEL_map_texture: remove memory layout conversion – Performance loss due to memory contention
Decoding + Rendering: lessons learned
Discrete GPUs
– Performance drop: sharing memory for PCIe copy – Can be solved by using integrated GPUs: remove copies
Integrated GPUs
– Texture uploading takes more time than decoding – Can be solved with an OPenGL extension
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 30 Decoding + Rendering: lessons learned
Discrete GPUs
– Performance drop: sharing memory for PCIe copy – Can be solved by using integrated GPUs: remove copies
Integrated GPUs
– Texture uploading takes more time than decoding – Can be solved with an OPenGL extension
OpenGL extensions
– INTEL_map_texture: remove memory layout conversion – Performance loss due to memory contention
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 30 Outline
Introduction
Hardware Acceleration of Video Decoding
OpenCL Acceleration of Video Decoding
CPU Decoding + GPU Rendering
Conclusions
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 31 Solve the system problem
– Complete video decoding: not just kernels – Rendering: is not just matrix multiplication – Video player is Decoding + Rendering
Conclusions
Video codecs and parallel architectures: conflicting objectives
– Video codecs: remove redundancy → introduce dependencies – Parallelism : remove dependencies → reduce compresssion
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 32 Conclusions
Video codecs and parallel architectures: conflicting objectives
– Video codecs: remove redundancy → introduce dependencies – Parallelism : remove dependencies → reduce compresssion
Solve the system problem
– Complete video decoding: not just kernels – Rendering: is not just matrix multiplication – Video player is Decoding + Rendering
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 32 Questions?
– Acknowledgements: Biao Wang, Haifeng Gao, Chi Ching Chi – More info: – http://www.aes.tu-berlin.de/ – http://www.spin-digital.com/
Mapping Video Codecs to Heterogeneous Architectures | M. Alvarez-Mesa | Multiprog 2015 Slide 33