Embedded Linux Conference Europe Integrating HW-Accelerated Video Decoding with the Display Stack Paul Kocialkowski [email protected] © Copyright 2004-2019, Bootlin. embedded Linux and kernel engineering Creative Commons BY-SA 3.0 license. Corrections, suggestions, contributions and translations are welcome! - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 1/24 Paul Kocialkowski I Embedded Linux engineer at Bootlin I Embedded Linux expertise I Development, consulting and training I Strong open-source focus I Open-source contributor I Co-maintainer of the cedrus VPU driver in V4L2 I Contributor to the sun4i-drm DRM driver I Developed the displaying and rendering graphics with Linux training I Living in Toulouse, south-west of France - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 2/24 Integrating HW-Accelerated Video Decoding with the Display Stack Outline and Introduction - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 3/24 Purpose of this talk I Present our specific use case I Some basics about video decoding I How Linux supports dedicated hardware for it I Our hardware, driver and constraints I Provide an overview of video pipeline integration I From source to sink I With efficient use of the hardware I Using the existing userspace software components I Detail what went wrong I Things don’t always pan out in the graphics world I Sharing the pain points we encountered I Constructive criticism, things could be a lot worse Always look on the bright side of life - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 4/24 Purpose of this talk Let’s try and build a good pipeline, eh? - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 5/24 You said video decoding? I Sequences of pictures take a huge load of data to represent... I So we compress them using a given codec: I Color compression: YUV sub-sampling I Spatial compression: frequency-space transform (DCT) and filtering I Temporal compression: multi-directional interpolation I Entropy compression: Huffman coding, Arithmetic coding I Add some meta-data to the mix to get the bitstream I Encapsulate that bitstream with other things (audio, ...) in a container I Then we have a reasonable amount of data for a fair result! - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 6/24 You said hardware video decoding? I So now we need a significant number of operations to get back our frames I Embedded systems don’t have that much CPU time to spare I Hardware to the rescue: fixed-function decoder block implementations I Digest video bitstream to spit out decoded pictures I Implementations are per-codec (or per-generation) I Two distinct types of hardware implementations: I Stateful: with a MCU to parse raw meta-data from bitstream, keep track of buffers I Stateless: that expect parsed metadata and compressed data only - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 7/24 Hardware video decoding in Linux (Media/V4L2) I In Linux, hardware video decoders (aka VPUs) are supported in V4L2 I Support for stateful VPUs landed with the V4L2 M2M framework I Adapted to memory-to-memory hardware I Source (output) is bitstream, destination (capture) is a decoded picture I Support for stateless VPUs landed with the Media Request API I Meta-data is passed in per-codec V4L2 controls I Controls are synchronized with buffers under media requests I Source (output) is compressed data, destination (capture) is a decoded picture I Decoded pictures are accessed: I By the CPU through mmap on the destination buffer I By other devices through dma-buf import of the destination buffer - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 8/24 The kind of expected result H.265 hardware video decoding with UI integration - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 9/24 What to do with decoded pictures Video decoding is just the tip of the iceberg... I Colorspace conversion (CSC) from YUV is often needed I Scaling and composition with UI are also required I These are awfully calculation-intensive sometimes more than CPU-based video decoding I But hey, we have hardware for that too: I The display engine usually supports all these operations via overlays/planes I Sometimes there are dedicated hardware blocks too I The GPU can do anything, so it can do that too (right?) I Let’s avoid copies and share buffers between devices full-frame memory copies are just a big no-no for performance - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 10/24 Integrating HW-Accelerated Video Decoding with the Display Stack Hardware video decoding on Allwinner platforms and display stack integration - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 11/24 Allwinner platforms Community Allwinner boards from our friends at Olimex and Libre Computer - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 12/24 Our situation: the Allwinner side of things I Relevant multimedia blocks on Allwinner hardware: I Video decoder (VPU): fixed-function (stateless) implementation, supports MPEG-2/H.263/Xvid/H.264/VP6/VP8, H.265/VP9 on recent SoCs I Display engines: support multiple input overlays I GPU: Mali 400/450 in most cases I First generation of devices (A10-A33) comes with constraints: I VPU can only map the lowest 256 MiB of RAM I VPU produces pictures in a specific tiled scan order (aka MB32) I Display engine supports MB32 tiling for planes/overlay I Second generation (A33-A64+) doesn’t have these constraints: I VPU still works with tiling internally, but untiling block is in the VPU - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 13/24 Allwinner MB32 tiled video format 0,0 w s 0,0 w wt = s h h ht Linear (raster) scan order MB32-tiled scan order I I w: width, s: stride wt: tile-aligned width (stride) I I h: height ht: tile-aligned height - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 14/24 Bootlin’s contribution for hardware video decoding support I On the DRM kernel side: I DRM_FORMAT_MOD_ALLWINNER_TILED modifier (merged in 5.1) I sun4i-drm support for linear/tiled YUV formats in overlay planes (merged in 5.1) I On the V4L2 kernel side: I Cedrus base driver (merged in 5.1) I V4L2_PIX_FMT_SUNXI_TILED_NV12 pixel format (merged in 5.1) I Experimental stateless MPEG-2 API and cedrus support (merged in 5.1) I Experimental stateless H.264 API and cedrus support (merged in 5.3) I Experimental stateless H.265 API and cedrus support (to be merged in 5.5) I On the userspace side: I A test utility: v4l2-request-test https://github.com/bootlin/v4l2-request-test I A VAAPI back-end: libva-v4l2-request https://github.com/bootlin/libva-v4l2-request - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 15/24 Integrating HW-Accelerated Video Decoding with the Display Stack Investigated and/or implemented setups - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 16/24 Bare-metal pipeline setup I Test scenario: standalone dedicated application (v4l2-request-test) I Talks to the kernel directly (both V4L2 and DRM) I Uses dma-buf for zero-copy I Exported from V4L2 with the VIDIOC_EXPBUF ioctl I Imported to DRM with the DRM_IOCTL_PRIME_FD_TO_HANDLE ioctl I CSC, scaling and composition offloaded using DRM planes I Bottomline: all is well but very limited use-case (testing) Pipeline components overview: V4L2 v4l2-request-test DRM - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 17/24 X.org pipeline setup (GPU-less): investigation I Scenario: usual media players (using VAAPI) under X I Can we use a similar setup (dma-buf to DRM plane) under X? I X initially only knows about RGB formats I But extensions exist: Xv, DRI3 I Xv extension allows supporting YUV and scaling, but... I Requires writing a hardware-specific DDX (e.g. to use planes) I Requires a buffer copy and doesn’t support modifiers I Has synchronization issues and deprecated anyway (in favor of GL) I DRI3 supposedly can solve these points: I Supports dma-buf import (but no modifier support) I Currently apparently only implemented in glamor (GPU-backed) I Doesn’t give us access to a DRM planes - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 18/24 X.org pipeline setup (GPU-less): bottomline I Scenario: usual media players (using VAAPI) under X I What worked: I Software untiling (NEON-accelerated) in VAAPI back-end I Software-based CSC, scaling and composition I Buffer copies through XCB I As a result, performance sucks still surprisingly good without scaling involved Pipeline components overview: X.org FFmpeg V4L2 VAAPI VLC (XCB) - Kernel, drivers and embedded Linux - Development, consulting, training and support - https://bootlin.com 19/24 Improving the X.org pipeline with a GPU in the mix I Using the GPU
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages24 Page
-
File Size-