Optimising HEVC Decoding Using Intel AVX512 SIMD Extensions

Eindhoven University of Technology MASTER Optimising HEVC decoding using Intel AVX512 SIMD extensions Cao, Xu Award date: 2018 Awarding institution: Technische Universität Berlin Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Master Thesis Optimising HEVC Decoding using Intel AVX512 SIMD Extensions Xu CAO Matriculation Number: 0367965 Technische Universität Berlin School IV · Electrical Engineering and Computer Science Department of Computer Engineering and Microelectronics Embedded Systems Architectures (AES) Einsteinufer 17 · D-10587 Berlin A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering according to the examination regulations at the Technische Uni- versität Berlin for the Mater Degree in Computer Engineering. Department of Computer Engineering and Microelectronics Embedded Systems Architectures (AES) Technische Universität Berlin Berlin Author Xu CAO Thesis period 07. May 2016 to 22. November 2016 Referees Prof. Dr. B. Juurlink, Embedded Systems Architectures (AES) Prof. Dr. R. Karnapke, Communication and Operating Sys- tems (KBS) Supervisor Prof. Dr. B. Juurlink, Embedded Systems Architectures (AES) Prof. Dr. R. Karnapke, Communication and Operating Sys- tems (KBS) M.Sc. Chi Ching Chi, Embedded Systems Architectures (AES) Declaration Hiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig und eigenhändig sowie ohne unerlaubte fremde Hilfe und auss- chlieSSlich unter Verwendung der aufgeführten Quellen und Hil- fsmittel angefertigt habe. Berlin, November 22, 2016 Xu CAO Abstract SIMD instructions have been commonly used to accelerate video codecs. The recently introduced HEVC codec like its predecessors is based on the hybrid video codec princi- ple, and, therefore, also well suited to be accelerated with SIMD. The new Intel SIMD extensions, AVX512 SIMD extensions, enables processing of more data elements than previous SIMD extensions, such as SSE, AVX and AVX2, with one single instruction. In this thesis, the difference between AVX512 and other SIMD instructions like AVX2, SSE and the advantage of using AVX512 SIMD extensions to optimise the HEVC decoding are presented. The AVX512 SIMD extensions are applied to the potential beneficial HEVC decoding kernels (Inter prediction, Inverse Transform and Deblocking Filter). The performance evaluation results based on 1080p(HD) and 2160p(UHD) resolutions video sets under Intel Software Development Emulator(SDE) indicate that by using AVX512 SIMD extensions, for 1080p 8-bit up to 23% execution instructions can be decreased compared to the HEVC optimised decoder based on AVX2 and up to 31% can be decreased for that of 2160p 10-bit. When comparing AVX512 optimisation with the scalar implementation, 85% and 83% execution instructions can be reduced for 1080p and 2160p, respectively. v Zusammenfassung SIMD-Befehle wurden häufig verwendet, um Video-Codecs zu beschleunigen. Der neu ein- geführte HEVC-Codec basiert wie seine Vorgänger auf dem Hybrid-Video-Codec-Prinzip und eignet sich daher gut, um von SIMD beschleunigt zu werden. Als neue Intel SIMD- Erweiterung ermöglicht AVX512 die Verarbeitung mehrerer Datenelemente im Vergleich zu bisherigen SIMD-Erweiterungen wie SSE, AVX und AVX2, die nur eine einzige Anwei- sung haben. Daher werden in dieser Arbeit der Unterschied zwischen AVX512 und ande- ren SIMD-Befehlen wie AVX2 und SSE sowie der Vorteil der Verwendung von AVX512- SIMD-Erweiterungen zur Optimierung der HEVC-Decodierung dargestellt. Die AVX512 SIMD-Erweiterungen werden auf die potentiellen HEVC-Decodierungskerne (Interpola- tion, Inverse Transform und Deblock Filter) angewendet. Die Ergebnisse der Leistungs- analyse auf der Basis von 1080p (HD) und 2160p (UHD) Auflösungen unter der Nutzung des Intel Software Development Emulators (SDE) zeigen, dass durch die Verwendung von AVX512 SIMD-Erweiterungen für 1080p 8-Bit die Ausführungsanweisungen um bis zu 23% verringert werden können. Demgegenüber kann der auf AVX2 basierende optimierte Decoder HEVC für 2160p 10-Bit um bis zu 31% verringert werden. Beim Vergleich der AVX512-Optimierung mit dem Scalar, können die Ausführungsanweisungen sowohl für die 1080p als auch für die 2160p um jeweils 85% beziehungsweise 83% reduziert werden. vii Contents List of Tables xi List of Figures xiii 1 Introduction 1 2 State of the Art 5 2.1 Introduction of SIMD Extensions ....................... 5 2.2 Related Work .................................. 7 2.3 Introduction of HEVC ............................. 8 3 Theoretical Foundations 13 3.1 Overview of AVX512 SIMD Extensions .................... 13 3.2 Baseline Optimised HEVC Decoder ...................... 16 4 Implementation 23 4.1 Inter prediction ................................. 23 4.2 Inverse Transform ................................ 26 4.3 Deblocking Filter ................................ 29 5 Results 35 6 Conclusions and Discussions 43 Bibliography 45 ix List of Tables 2.1 SIMD extensions to general purpose processors[8] .............. 6 2.2 SIMD extensions for desktop-computer processors .............. 8 3.1 Comparison among two released public CPUs with their projected next-gen counterparts supporting AVX512[1]. ...................... 14 3.2 Multi-media (Fractal Generation) Benchmark: SIMD Native[1]. ...... 15 3.3 Multi-media (Fractal Generation) Benchmark: Cryptography Native[1]. .. 16 3.4 Multi-media (Fractal Generation) Benchmark: Cache & Memory Transfer[1]. 17 3.5 Average executed instructions per frame[minst/frame][8]. .......... 18 3.6 Speedup per stage Haswell for different SIMD levels[8]. ........... 20 4.1 Coefficients for fractional positions for the HEVC interpolation filter. ... 24 xi List of Figures 2.1 HEVC decoding process. ............................ 9 2.2 Coding Tree Unit ................................ 10 3.1 Single threaded performance at normalized frequency and IPC results of baseline HEVC decoder[8]. ........................... 19 3.2 Decoder execution time breakdown on Haswell for scalar, sse2, and avx2. The PSIDE and PCOEF stages represent the side information and coef- ficient parsing (CABAC entropy decoding). The PSIDE includes also the interpretation of the syntax elements and coding tree unit traversal[8]. ... 20 4.1 Horizontal luminance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(1) ......................... 24 4.2 Horizontal luminance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(2) ......................... 25 4.3 Vertical chrominance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(1) ......................... 26 4.4 Vertical chrominance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(2) ......................... 27 4.5 HEVC 1D inverse transform for 32x32 TBs[8]. ................ 28 4.6 HEVC inverse transform optimisation for 32x32 TBs by AVX512 SIMD ISAs (1). ..................................... 28 4.7 HEVC inverse transform optimisation for 32x32 TBs by AVX512 SIMD ISAs (2). ..................................... 29 4.8 HEVC inverse transform optimisation for 32x32 TBs by AVX512 SIMD ISAs (3). ..................................... 30 4.9 Samples of one edge part around a block edge involved in the deblocking filter. ....................................... 30 4.10 Swizzle first and last samples of 16 edge parts into a vector. ........ 31 4.11 Luminance filter process in deblocking filter kernel. ............. 32 xiii List of Figures 5.1 Average proportion of the number of the reduced execute instructions of AVX2/AVX512 implementation compared with scalar program for 1080p 8-bit resolution video sets. ........................... 35 5.2 Average proportion of the number of the reduced execute instructions of AVX2/AVX512 implementation compared with scalar program for 2160 10-bit resolution video sets. .......................... 36 5.3 Average percentage of the reduced execute instructions by AVX512 compared with AVX2 program for 1080p 8-bit and 2160 10-bit resolution video sets. ....................................... 37 5.4 Proportion distribution of the execution instructions’ number when apply- ing AVX2 and AVX512 to decode the 1080p 8bit resolution video sets. .......................................... 38 5.5 Proportion distribution of the execution instructions’ number when apply- ing AVX2 and AVX512 to decode the 2160p 10bit resolution video sets. .......................................... 39 5.6 Proportion distribution of the average percentage of the reduced execute instructions by AVX512 compared with AVX2

Optimising HEVC Decoding Using Intel AVX512 SIMD Extensions

SIMD Extensions

REPORT Compaq Chooses SMT for Alpha Simultaneous Multithreading

SIMD: Data Parallel Execution J

Prozessorarchitektur Am Beispiel Des Amdathlon

Effectiveness of the MAX-2 Multimedia Extensions for PA-RISC 2.0 Processors

A Bibliography of Publications in IEEE Micro

Pengju Ren@XJTU 2021

Optimizing SIMD Execution in HW/SW Co-Designed Processors

Implications of Programmable General Purpose Processors for Compression/Encryption Applications 1. Introduction

ALEX BENNÉE KVM FORUM 2017 Created: 2017-10-20 Fri 20:46

UNIVERSITY of CALIFORNIA, SAN DIEGO Holistic Design for Multi-Core Architectures a Dissertation Submitted in Partial Satisfactio

Msc THESIS Customizing Vector Instruction Set Architectures