Integrating Hardware-Accelerated Video Decoding with the Display

Embedded Linux Conference Europe Integrating HW-Accelerated Video Decoding with the Display Stack Paul Kocialkowski [email protected] © Copyright 2004-2019, Bootlin. embedded Linux and kernel engineering Creative Commons BY-SA 3.0 license. Corrections, suggestions, contributions and translations are welcome! - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 1/24 Paul Kocialkowski I Embedded Linux engineer at Bootlin I Embedded Linux expertise I Development, consulting and training I Strong open-source focus I Open-source contributor I Co-maintainer of the cedrus VPU driver in V4L2 I Contributor to the sun4i-drm DRM driver I Developed the displaying and rendering graphics with Linux training I Living in Toulouse, south-west of France - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 2/24 Integrating HW-Accelerated Video Decoding with the Display Stack Outline and Introduction - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 3/24 Purpose of this talk I Present our specific use case I Some basics about video decoding I How Linux supports dedicated hardware for it I Our hardware, driver and constraints I Provide an overview of video pipeline integration I From source to sink I With efficient use of the hardware I Using the existing userspace software components I Detail what went wrong I Things don’t always pan out in the graphics world I Sharing the pain points we encountered I Constructive criticism, things could be a lot worse Always look on the bright side of life - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 4/24 Purpose of this talk Let’s try and build a good pipeline, eh? - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 5/24 You said video decoding? I Sequences of pictures take a huge load of data to represent... I So we compress them using a given codec: I Color compression: YUV sub-sampling I Spatial compression: frequency-space transform (DCT) and filtering I Temporal compression: multi-directional interpolation I Entropy compression: Huffman coding, Arithmetic coding I Add some meta-data to the mix to get the bitstream I Encapsulate that bitstream with other things (audio, ...) in a container I Then we have a reasonable amount of data for a fair result! - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 6/24 You said hardware video decoding? I So now we need a significant number of operations to get back our frames I Embedded systems don’t have that much CPU time to spare I Hardware to the rescue: fixed-function decoder block implementations I Digest video bitstream to spit out decoded pictures I Implementations are per-codec (or per-generation) I Two distinct types of hardware implementations: I Stateful: with a MCU to parse raw meta-data from bitstream, keep track of buffers I Stateless: that expect parsed metadata and compressed data only - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 7/24 Hardware video decoding in Linux (Media/V4L2) I In Linux, hardware video decoders (aka VPUs) are supported in V4L2 I Support for stateful VPUs landed with the V4L2 M2M framework I Adapted to memory-to-memory hardware I Source (output) is bitstream, destination (capture) is a decoded picture I Support for stateless VPUs landed with the Media Request API I Meta-data is passed in per-codec V4L2 controls I Controls are synchronized with buffers under media requests I Source (output) is compressed data, destination (capture) is a decoded picture I Decoded pictures are accessed: I By the CPU through mmap on the destination buffer I By other devices through dma-buf import of the destination buffer - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 8/24 The kind of expected result H.265 hardware video decoding with UI integration - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 9/24 What to do with decoded pictures Video decoding is just the tip of the iceberg... I Colorspace conversion (CSC) from YUV is often needed I Scaling and composition with UI are also required I These are awfully calculation-intensive sometimes more than CPU-based video decoding I But hey, we have hardware for that too: I The display engine usually supports all these operations via overlays/planes I Sometimes there are dedicated hardware blocks too I The GPU can do anything, so it can do that too (right?) I Let’s avoid copies and share buffers between devices full-frame memory copies are just a big no-no for performance - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 10/24 Integrating HW-Accelerated Video Decoding with the Display Stack Hardware video decoding on Allwinner platforms and display stack integration - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 11/24 Allwinner platforms Community Allwinner boards from our friends at Olimex and Libre Computer - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 12/24 Our situation: the Allwinner side of things I Relevant multimedia blocks on Allwinner hardware: I Video decoder (VPU): fixed-function (stateless) implementation, supports MPEG-2/H.263/Xvid/H.264/VP6/VP8, H.265/VP9 on recent SoCs I Display engines: support multiple input overlays I GPU: Mali 400/450 in most cases I First generation of devices (A10-A33) comes with constraints: I VPU can only map the lowest 256 MiB of RAM I VPU produces pictures in a specific tiled scan order (aka MB32) I Display engine supports MB32 tiling for planes/overlay I Second generation (A33-A64+) doesn’t have these constraints: I VPU still works with tiling internally, but untiling block is in the VPU - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 13/24 Allwinner MB32 tiled video format 0,0 w s 0,0 w wt = s h h ht Linear (raster) scan order MB32-tiled scan order I I w: width, s: stride wt: tile-aligned width (stride) I I h: height ht: tile-aligned height - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 14/24 Bootlin’s contribution for hardware video decoding support I On the DRM kernel side: I DRM_FORMAT_MOD_ALLWINNER_TILED modifier (merged in 5.1) I sun4i-drm support for linear/tiled YUV formats in overlay planes (merged in 5.1) I On the V4L2 kernel side: I Cedrus base driver (merged in 5.1) I V4L2_PIX_FMT_SUNXI_TILED_NV12 pixel format (merged in 5.1) I Experimental stateless MPEG-2 API and cedrus support (merged in 5.1) I Experimental stateless H.264 API and cedrus support (merged in 5.3) I Experimental stateless H.265 API and cedrus support (to be merged in 5.5) I On the userspace side: I A test utility: v4l2-request-test https://github.com/bootlin/v4l2-request-test I A VAAPI back-end: libva-v4l2-request https://github.com/bootlin/libva-v4l2-request - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 15/24 Integrating HW-Accelerated Video Decoding with the Display Stack Investigated and/or implemented setups - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 16/24 Bare-metal pipeline setup I Test scenario: standalone dedicated application (v4l2-request-test) I Talks to the kernel directly (both V4L2 and DRM) I Uses dma-buf for zero-copy I Exported from V4L2 with the VIDIOC_EXPBUF ioctl I Imported to DRM with the DRM_IOCTL_PRIME_FD_TO_HANDLE ioctl I CSC, scaling and composition offloaded using DRM planes I Bottomline: all is well but very limited use-case (testing) Pipeline components overview: V4L2 v4l2-request-test DRM - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 17/24 X.org pipeline setup (GPU-less): investigation I Scenario: usual media players (using VAAPI) under X I Can we use a similar setup (dma-buf to DRM plane) under X? I X initially only knows about RGB formats I But extensions exist: Xv, DRI3 I Xv extension allows supporting YUV and scaling, but... I Requires writing a hardware-specific DDX (e.g. to use planes) I Requires a buffer copy and doesn’t support modifiers I Has synchronization issues and deprecated anyway (in favor of GL) I DRI3 supposedly can solve these points: I Supports dma-buf import (but no modifier support) I Currently apparently only implemented in glamor (GPU-backed) I Doesn’t give us access to a DRM planes - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 18/24 X.org pipeline setup (GPU-less): bottomline I Scenario: usual media players (using VAAPI) under X I What worked: I Software untiling (NEON-accelerated) in VAAPI back-end I Software-based CSC, scaling and composition I Buffer copies through XCB I As a result, performance sucks still surprisingly good without scaling involved Pipeline components overview: X.org FFmpeg V4L2 VAAPI VLC (XCB) - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 19/24 Improving the X.org pipeline with a GPU in the mix I Using the GPU

Integrating Hardware-Accelerated Video Decoding with the Display

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support