Eindhoven University of Technology

MASTER

Optimising HEVC decoding using AVX512 SIMD extensions

Cao, Xu

Award date: 2018

Awarding institution: Technische Universität Berlin

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Master Thesis

Optimising HEVC Decoding using Intel AVX512 SIMD Extensions

Xu CAO

Matriculation Number: 0367965

Technische Universität Berlin School IV · Electrical Engineering and Computer Science Department of Computer Engineering and Microelectronics Embedded Systems Architectures (AES) Einsteinufer 17 · D-10587 Berlin

A thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science in Computer Engineering

according to the examination regulations at the Technische Uni- versität Berlin for the Mater Degree in Computer Engineering.

Department of Computer Engineering and Microelectronics Embedded Systems Architectures (AES) Technische Universität Berlin Berlin

Author Xu CAO

Thesis period 07. May 2016 to 22. November 2016

Referees Prof. Dr. B. Juurlink, Embedded Systems Architectures (AES) Prof. Dr. R. Karnapke, Communication and Operating Sys- tems (KBS)

Supervisor Prof. Dr. B. Juurlink, Embedded Systems Architectures (AES) Prof. Dr. R. Karnapke, Communication and Operating Sys- tems (KBS) M.Sc. Chi Ching Chi, Embedded Systems Architectures (AES) Declaration Hiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig und eigenhändig sowie ohne unerlaubte fremde Hilfe und auss- chlieSSlich unter Verwendung der aufgeführten Quellen und Hil- fsmittel angefertigt habe.

Berlin, November 22, 2016

Xu CAO Abstract

SIMD instructions have been commonly used to accelerate video codecs. The recently introduced HEVC codec like its predecessors is based on the hybrid video codec princi- ple, and, therefore, also well suited to be accelerated with SIMD. The new Intel SIMD extensions, AVX512 SIMD extensions, enables processing of more data elements than previous SIMD extensions, such as SSE, AVX and AVX2, with one single instruction. In this thesis, the difference between AVX512 and other SIMD instructions like AVX2, SSE and the advantage of using AVX512 SIMD extensions to optimise the HEVC decoding are presented. The AVX512 SIMD extensions are applied to the potential beneficial HEVC decoding kernels (Inter prediction, Inverse Transform and Deblocking Filter). The per- formance evaluation results based on 1080p(HD) and 2160p(UHD) resolutions video sets under Intel Software Development Emulator(SDE) indicate that by using AVX512 SIMD extensions, for 1080p 8-bit up to 23% execution instructions can be decreased compared to the HEVC optimised decoder based on AVX2 and up to 31% can be decreased for that of 2160p 10-bit. When comparing AVX512 optimisation with the scalar implementation, 85% and 83% execution instructions can be reduced for 1080p and 2160p, respectively.

v

Zusammenfassung

SIMD-Befehle wurden häufig verwendet, um Video-Codecs zu beschleunigen. Der neu ein- geführte HEVC-Codec basiert wie seine Vorgänger auf dem Hybrid-Video-Codec-Prinzip und eignet sich daher gut, um von SIMD beschleunigt zu werden. Als neue Intel SIMD- Erweiterung ermöglicht AVX512 die Verarbeitung mehrerer Datenelemente im Vergleich zu bisherigen SIMD-Erweiterungen wie SSE, AVX und AVX2, die nur eine einzige Anwei- sung haben. Daher werden in dieser Arbeit der Unterschied zwischen AVX512 und ande- ren SIMD-Befehlen wie AVX2 und SSE sowie der Vorteil der Verwendung von AVX512- SIMD-Erweiterungen zur Optimierung der HEVC-Decodierung dargestellt. Die AVX512 SIMD-Erweiterungen werden auf die potentiellen HEVC-Decodierungskerne (Interpola- tion, Inverse Transform und Deblock Filter) angewendet. Die Ergebnisse der Leistungs- analyse auf der Basis von 1080p (HD) und 2160p (UHD) Auflösungen unter der Nutzung des Intel Software Development Emulators (SDE) zeigen, dass durch die Verwendung von AVX512 SIMD-Erweiterungen für 1080p 8-Bit die Ausführungsanweisungen um bis zu 23% verringert werden können. Demgegenüber kann der auf AVX2 basierende optimierte Decoder HEVC für 2160p 10-Bit um bis zu 31% verringert werden. Beim Vergleich der AVX512-Optimierung mit dem Scalar, können die Ausführungsanweisungen sowohl für die 1080p als auch für die 2160p um jeweils 85% beziehungsweise 83% reduziert werden.

vii

Contents

List of Tables xi

List of Figures xiii

1 Introduction 1

2 State of the Art 5 2.1 Introduction of SIMD Extensions ...... 5 2.2 Related Work ...... 7 2.3 Introduction of HEVC ...... 8

3 Theoretical Foundations 13 3.1 Overview of AVX512 SIMD Extensions ...... 13 3.2 Baseline Optimised HEVC Decoder ...... 16

4 Implementation 23 4.1 Inter prediction ...... 23 4.2 Inverse Transform ...... 26 4.3 Deblocking Filter ...... 29

5 Results 35

6 Conclusions and Discussions 43

Bibliography 45

ix

List of Tables

2.1 SIMD extensions to general purpose processors[8] ...... 6 2.2 SIMD extensions for desktop-computer processors ...... 8

3.1 Comparison among two released public CPUs with their projected next-gen counterparts supporting AVX512[1]...... 14 3.2 Multi-media (Fractal Generation) Benchmark: SIMD Native[1]...... 15 3.3 Multi-media (Fractal Generation) Benchmark: Cryptography Native[1]. .. 16 3.4 Multi-media (Fractal Generation) Benchmark: Cache & Memory Transfer[1]. 17 3.5 Average executed instructions per frame[minst/frame][8]...... 18 3.6 Speedup per stage Haswell for different SIMD levels[8]...... 20

4.1 Coefficients for fractional positions for the HEVC interpolation filter. ... 24

xi

List of Figures

2.1 HEVC decoding process...... 9 2.2 Coding Tree Unit ...... 10

3.1 Single threaded performance at normalized frequency and IPC results of baseline HEVC decoder[8]...... 19 3.2 Decoder execution time breakdown on Haswell for scalar, , and avx2. The PSIDE and PCOEF stages represent the side information and coef- ficient parsing (CABAC entropy decoding). The PSIDE includes also the interpretation of the syntax elements and coding tree unit traversal[8]. ... 20

4.1 Horizontal luminance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(1) ...... 24 4.2 Horizontal luminance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(2) ...... 25 4.3 Vertical chrominance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(1) ...... 26 4.4 Vertical chrominance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(2) ...... 27 4.5 HEVC 1D inverse transform for 32x32 TBs[8]...... 28 4.6 HEVC inverse transform optimisation for 32x32 TBs by AVX512 SIMD ISAs (1)...... 28 4.7 HEVC inverse transform optimisation for 32x32 TBs by AVX512 SIMD ISAs (2)...... 29 4.8 HEVC inverse transform optimisation for 32x32 TBs by AVX512 SIMD ISAs (3)...... 30 4.9 Samples of one edge part around a block edge involved in the deblocking filter...... 30 4.10 Swizzle first and last samples of 16 edge parts into a vector...... 31 4.11 Luminance filter process in deblocking filter kernel...... 32

xiii List of Figures

5.1 Average proportion of the number of the reduced execute instructions of AVX2/AVX512 implementation compared with scalar program for 1080p 8-bit resolution video sets...... 35 5.2 Average proportion of the number of the reduced execute instructions of AVX2/AVX512 implementation compared with scalar program for 2160 10-bit resolution video sets...... 36 5.3 Average percentage of the reduced execute instructions by AVX512 com- pared with AVX2 program for 1080p 8-bit and 2160 10-bit resolution video sets...... 37 5.4 Proportion distribution of the execution instructions’ number when apply- ing AVX2 and AVX512 to decode the 1080p 8bit resolution video sets...... 38 5.5 Proportion distribution of the execution instructions’ number when apply- ing AVX2 and AVX512 to decode the 2160p 10bit resolution video sets...... 39 5.6 Proportion distribution of the average percentage of the reduced execute instructions by AVX512 compared with AVX2 program for 1080p 8bit res- olution video for each stage...... 40 5.7 Proportion distribution of the average percentage of the reduced execute instructions by AVX512 compared with AVX2 program for 2160p 10bit resolution video for each stage...... 40

xiv 1 Introduction

Computer architecture has achieved technological advancements year by year and has a big mutually influence with video coding technology, especially strong in deployment of the single instruction multiple data (SIMD) instructions. The High Efficiency Video Coding (HEVC) standard is the most recent joint video project of the ITU-T Video Cod- ing Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) standardisation organisations, working together in a partnership known as the Joint Col- laborative Team on Video Coding (JCT-VC)[6]. And single instruction, multiple data (SIMD), is a class of parallel computers in Flynn’s taxonomy (It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously[21]). The purpose for introducing the SIMD extensions is for developing the software-only MPEG-1 real-time decoding using general purpose processors[12][13]. Since then SIMD extensions is particularly applicable to common tasks like adjusting the contrast in a dig- ital image or adjusting the volume of digital audio. Most modern CPU designs include SIMD instructions in order to improve the performance of multimedia use. To allow the video coding implementation more efficient, SIMD capabilities of the processors also need to be taken into account during the standardisation process. For instance, defining rea- sonable intermediate computation precision and eliminating sample dependencies is made by explicit effort. The JCT-VC organisation, a standardisation has released the High Efficiency Video Coding(HECV)[18]. The bit rate reductions for HEVC based on PSNR with HEVC showed a bit rate reduction of 35.4% compared to H.264/MPEG-4 AVC HP, 63.7% compared to MPEG-4 ASP, 65.1% compared to H.263 HLP, and 70.8% compared to H.262/MPEG-2[15]. Similar to previous standards, significant attention was paid to al- low the new standard to be accelerated with SIMD and custom hardware solutions. Apparently, with the hardware-only solutions, the energy efficiency can be very high com- pared to software solutions using general purpose processors. However when hardware acceleration is not available, software optimised solutions are required. Beside, based on the optimised and automation software solutions, less effort will be needed for the further

1 1 Introduction generation product, therefore the time for launching market can be significantly reduced. The hardware overspecialisation problems can be also avoided[19]. HEVC is more complicated than previous video coding standard, it supports for three different coding tree block sizes, more transform sizes, additional loop filter, more in- tra prediction angles, etc. Additionally, in recent years, the instruction set architectures (ISAs) and their micro-architectures are becoming more and more diverse. Further more, SIMD ISAs have become more complex with multiple instruction set extensions, and even with the same ISA the instructions have different performance characteristics depending on the implementation. However, HEVC also has significant potential for SIMD acceler- ation, therefore much more investigation for analysing the impact of SIMD acceleration on HEVC decoding need to be done. Recently, Intel released new SIMD extensions, name is AVX512, which are 512-bit exten- sions to the 256-bit Advanced Vector Extensions SIMD instructions for instruction set architecture (ISA). Ideally, AVX512 instructions can pack eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors. This enables processing of twice the number of data elements that AVX/AVX2 can process with a single instruction and four times that of SSE which can theoretically improve the performance of the HEVC decoder quite a lot. It is important to investigate how much improvement can be achieved for HEVC decoding by using the AVX512 SIMD extensions. In this thesis, the main contributions are the investigation on the difference between AVX512 and other SIMD instructions like AVX2, SSE. Then, the AVX512 SIMD exten- sions will be applied for the potential beneficial HEVC decoding kernels (Inter prediction, Inverse Transform and Deblocking Filter) and corresponding performance evaluation mea- surement results based on 1080p(HD) and 2160p(UHD) resolutions video sets under Intel Software Development Emulator(SDE) indicate that by using AVX512 SIMD extensions, for 1080p 8-bit up to 23% execution instructions can be decreased compared to the HEVC optimised decoder based on AVX2 and up to 31% can be decreased for that of 2160p 10- bit. When compared to scalar HEVC decoding, 85% and 83%execution instructions can be reduce for both of 1080p and 2160p based on AVX512 SIMD ISAs, respectively. This thesis is organised as follows. First, Section II gives introductions about the SIMD extensions, related work and HEVC features, while Section III presents theoretical founda- tions, including the overview and the advantage of using AVX512 for optimising programs, introduction of baseline optimised HEVC decoder and the investigation for the potential beneficial HEVC decoding kernels by using AVX512 SIMD ISAs. Section IV describes the implementation process of optimising HEVC decoding by using AVX512 SIMD ISAs.

2 In Section V, performance results are discussed. Finally, in Section VI, conclusions and discussion are presented.

3

2 State of the Art

In this section, background knowledge will be introduced. Firstly, the introduction of SIMD extensions will be presented. Then, the related work will be presented. Finally, the introduction of HEVC decoder will be given.

2.1 Introduction of SIMD Extensions

SIMD is the shorthand for single instructions multiple data, which is a class of parallel computers in Flynn’s taxonomy[21]. Program with SIMD can perform the same opera- tion for multiple data elements in the same time. This is the data level parallelism, not concurrency which means there are multiple data elements being processed in the same time but only one operation performed. The program which can be potentially running in data level parallelism can be significantly accelerated by SIMD. The concept of SIMD was first introduced early in 1970s, know as a "vector" of data with a single instructions, but it was separated from SIMD now. Afterwards, the SIMD was introduced in modern SIMD machines which has many limited-functionality processors, such as Thinking Machines CM-1 and CM-2. Then the era of SIMD processors was chang- ing from desktop-computer to supercomputer. However, since the desktop-computer is becoming powerful enough to support real-time gaming and audio/video processing dur- ing the 1990s, the SIMD extensions were going back to the desktop-computer again. The firstly widely used SIMD extension for desktop are introduced in 1996, named MMX. Afterwards, there are more and more SIMD extensions introduced, such as NEON, SSE2, SSSE3, SSE4.1, AVX/AVX2, AVX512 and so on (The recent SIMD extensions for desktop computer are listed in Table 2.1 [8]). Recent SIMD extensions is achieved by partitioning each register into sub-words and applying the same instructions to all of them. Significant optimised performance can be achieved by using SIMD. There are two different application which can significantly take the advantage of the SIMD: One is that when there is the same value are added to (or subtracted from) a large number of data points. The other one is that all the loaded data are processed in a single

5 2 State of the Art

SIMD ISA Base ISA Vendor Year SIMD Registers MAX[14] PA-RISC HP 1994 31x32b VIS SPARC Sun 1995 32x64b MAX-2 PA-RISC HP 1995 32x64b MVI Alpha DEC 1996 31x64b MMX[16] x86 Intel 1996 8x64b MDMX MIPS-V MIPS 1996 32x64b 3DNow! x86 AMD 1998 8x64b Altivec[9] PowerPC Motorola 1998 32x128b MIPS-3D MIPS-64 MIPS 1999 32x64b SSE[20] x86/x86-64 Intel 1999 8/16x28b SSE2[17] x86/x86-64 Intel 2000 8/16x28b SSE3 x86/x86-64 Intel 2004 8/16x28b NEON ARMv7 ARM 2005 32x64b - 16x128b SSSE3 x86/x86-64 Intel 2006 8/16x128b SSE4 x86/x86-64 Intel 2007 8/16x128b VSX Power v2.06 IBM 2010 64x128b AVX x86/x86-64 Intel 2011 16x256b XOP x86/x86-64 AMD 2011 8/16x128b AVX2 x86/x86-64 Intel 2013 16x256b AVX-512 x86/x86-64 Intel 2016 32x512b

Table 2.1: SIMD extensions to general purpose processors[8] operation. Although SIMD is quite fit for accelerating program, there are still some limitations for using SIMD. Firstly, not all program can be accelerated by SIMD, since not all of program can be vectorised easily. Then, SIMD are also limited for loading and storing data, since all of the SIMD ISAs have different vector width. Furthermore, usually, using SIMD to accelerate the program need a lot of human effort. Besides, some SIMD ISAs are more complicated to be used than others, because of, for example, the restrictions on data alignment, not all processors supporting all the SIMD ISAs. Investigating the influence of a new SIMD for a program still need a lot of human effort. In the next part, the related work about applying recent SIMD ISAs to the HEVC decoder (most of them collected by Chi et al.[8]) will be presented. The result shows even if there are some limitations of SIMD, HEVC decoder still can be optimised a lot by applying

6 2.2 Related Work

SIMD ISAs. This is also one of the reasons for this thesis, investigating the influence of the latest SIMD ISAs, AVX512. The advantage of AVX512 will be presented in Section 3.1.

2.2 Related Work

Optimising the program by SIMD extensions is quite common recently. Different SIMD extensions such as SSE2, SSSE4, SSE4.1 and AVX2 have been used to accelerate differ- ent video codecs such as MPEG-2, MPEG-4 Part 2, and more recently H.264/AVC and HEVC has been optimised by the SIMD instructions since the SIMD extensions was first introduced in 1995. A summary of works reporting SIMD optimisation for codecs before H.264/AVC can be found in [11]. H.264/AVC also benefit a lot from SIMD accelerating, including luma and chroma interpolation filters, inverse transform and deblocking filter acceleration. For instance, Zhou et al.[23] and Chen et al.[7] has reported that acceler- ating H.264/AVC by SSE2 can speedup the application ranging from 2.0 to 4.0. Beside, the SSE3 SIMD acceleration for Real time decoding of 720p content using a Pentium IV processor and SSE2 acceleration for a low-power Pentium-M processor has been reported by Iverson et al.[10]. Furthermore, some recent works have reported SIMD acceleration for HEVC decoding listed as follows (collected by Chi et al.[8]): A decoder optimised by SSE2 SIMD ex- tensions has been reported by L. Yan et al.[22]. This optimisation including luma and chroma interpolation filters, adaptive loop filter, deblocking filter and inverse transform with speedups 6.08, 2.21, 5.21 and 2.98 for each kernel respectively and result in speede- ups 4.16 in total. In the real time test, by using Intel i5 processor running at 2.4 GHz, this decoder can decode 1080p videos from 27.7 to 44.4 fps depending on the content and bitrate. Bossen et al.[3] have presented an optimised HEVC decoder using SSE4.1 and ARM NEON. In the real time test, for SSE4.1, the decoder can process more than 60 fps for 1080p video up to 7 Mbps on an Intel processor i7 running at turbo frequency of 3.6 GHz. And for ARM NEON, it can decode 480p videos at 30 fps and up to 2 Mbps on a Cortex-A9 processor running at 1.0 GHz and 1080p sequences can be decoded at 30 fps on an ARMv7 processor running at 1.3 GHz[2]. Another optimisation by using SSE4.1 has been reported by Bross et al.[5]. In the real time test, they have shown that this system is able to process, in average, 68.3 and 17.2 fps for 1080p and 2160p respectively on an Intel i7 processor working at 2.5 GHz, and the overall speedup are 4.3 and 3.3 for 1080p and 2160p, respectively with the same processer.

7 2 State of the Art

Application Year ISA Resol. fps H.264[23] 2003 SSE2 480p 48 H.264[10] 2004 SSE2 720p 30 2004 SSE3 720p 60 HEVC[22] 2012 SSE2 1080p 28-44 HEVC[3] 2012 NONE 480p 30 2012 SSE4.1 1080p 60 HEVC[2] 2012 NONE 1080p 30 HEVC[5] 2013 SSE4.1 1080p 68.3 2013 SSE4.1 2160 17.3 HEVC[8] 2015 AVX2 1080p 133 2015 AVX2 2160p 37.8

Table 2.2: SIMD extensions for desktop-computer processors

In 2015, Chi et al.[8] has reported the SIMD optimisation for the entire HEVC decoder for most of the major SIMD ISAs. Evaluation has been performed on 14 mobile and PC platforms covering most major architectures released in recent years. With SIMD up to 5x speedup can be achieved over the entire HEVC decoder, resulting in up to 133 fps and 37.8 fps on average on a single core for Main profile 1080p and Main10 profile 2160p sequences, respectively. In Section 3.2, the introduction about the optimised HEVC de- coder based on recent SIMD extensions such as SSE2, SSSE3, SSE4.1 and AVX2 which was reported by Chi et al.[8] will be presented as the baseline of this thesis. Overview of the related SIMD acceleration for video decoders are listed in Table 2.2

2.3 Introduction of HEVC

The High Efficiency Video Coding (HEVC) standard is the most recent joint video project of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) standardisation organisations, working together in a partnership known as the Joint Collaborative Team on Video Coding (JCT-VC)[6]. The video coding layer of HEVC employs the same hybrid approach (inter-/intra-picture prediction and 2-D transform coding) used in all video compression standards since H.261. In the following part, we will talk about some terminologies and focus on the HEVC de- coding process as shown in Figure 2.1:

8 2.3 Introduction of HEVC

Figure 2.1: HEVC decoding process.

First, we will talked about some block structure coding terminologies. In the H264 stan- dard, the coding layer is a Macroblock. One macroblock size is 16x16, consisting of a 16x16 luma block, and also two 8x8 chroma blocks for the commonly used 4: 2: 0 sam- pling format. The corresponding structure in the HEVC is a coding tree unit(CTU). The size of CTU is decided by the encoder. The maximum can be up to 64x64 and the mini- mum can be down to 16x16. For high-resolution video encoding, using larger size CTUs can get better compression performance. Here we take the largest size 64x64 CTU as an example to explain the HEVC coding tree structure. HEVC is firstly divided the entire frame picture according to the CTU size, as shown in Figure 2.2, the size of each CTU is 64x64. One coding tree unit(CTU) contains three coding tree block(CTB), including one luma(Y) CTB, two chroma(Cb and Cr) CTB and associated syntax elements. As shown, since the CTU size can be 16x16, 32x32, 64x64, the luma CTB size can also be 16x16, 32x32, 64x64, and always same as the size of the CTU. Here the luma CTB size is 64x64 and chroma CTB size is 32x32. The coding tree block(CTB) in the HEVC can be directly used as one coding block(CB) or divided into some small coding blocks in the form of a quadtree, so the size of the coding block can be varied in the HEVC. When the size of CTU is 64x64, the maximum size of the luma CB is 64x64 and the minimum is 8x8. For the chroma CB, the maximum is 32x32 and 4x4 is the minimum. The larger CB can improve the coding efficiency for the smooth region greatly. When processing the details of the some region, the smaller CB can make complex prediction more accurate. An image can be divided into several non-overlapping CTUs according to the CTU size. Within one CTU, based on quadtree-based loop hierarchy, the coding units at the same

9 2 State of the Art

Figure 2.2: Coding Tree Unit level have the same depth of segmentation. One CTU can contain either one CU with- out being partitioned or multiple CUs. Each CU contains a prediction unit(PU) and a transform unit(TU). For a 2Nx2N CU, there are two types of intra-prediction units size, 2Nx2N and NxN. For the inter-prediction mode, there are 8 types. Four are symmetric ones which are 2Nx2N, NXN, NX2N, 2NXN and the other four are asymmetric ones which are 2NxnU, 2NxnD, nLx2N, nRx2N (where U, D, L and R represents the four directions). nLx2N and nRx2N are the left and right ratio of 1: 3 and 3: 1 respectively. 2NxnU and 2NxnD are the up and down radio of 1: 3, 3: 1 respectively. There is another type, namely the skip mode which is for the inter-prediction, when the encoding motion infor- mation only contains the motion parameters and the residual information does not need to be encoded. A prediction block comprises a luma prediction block and two chroma prediction blocks, and associated syntax elements.

10 2.3 Introduction of HEVC

The transform unit is a unit that performs transformation and quantization indepen- dently, and its size varies flexibly. HEVC can support from 4x4 up to 32x32 encoding transformation. The basic unit is the transformation unit(TU). The size of transformation unit depends on the size of the CU mode. Within one CU, TU can be across multiple PUs with the form of quad-tree partition. Larger TUs can make the energy concentrated better, smaller TUs can retain more details and flexible partition. One transform unit includes a luma transform block and two chroma transform blocks, and associated syntax elements. Then, we will talked about the HEVC decoding process which as shown in Figure 2.1: a Entropy coding: The first step in the decoding process is to use different entropy coding algorithms to decode the grammatical elements. This contain three parts which is the network abstraction layer(NAL) unit analysis, slice head decoding and slice data decoding. The syntax structure of HEVC is placed into a logical data packet which is NAL. Each NAL consists of the header byte and payload data. NAL unit can be classified into the video coding layer(VCL) units containing the video picture samples data and non-VCL data containing any associated additional information such as parameter sets and supplemental enhancement information. During the HEVC decoding process, one frame is divided into slice, slice segment and tile. The slice consists of slice segments. In HEVC, we can divide one frame picture into many slices, or one slice is the entire frame picture. The encoded picture information is stored in the slices. The head of the slice contains the general information of the picture, such as the prediction type of the current picture, QP level and so on. The payload of the slice contains the prediction of the picture, residual information and so on. One of the most important features of the slice is independency. Each slice has the independent information so that when one slice is lost, the other slice won’t be influenced and the error code would be reduced significantly. The HEVC uses Wavefront parallel processing(WPP) and Tile method to make the HEVC more suitable for the parallel processing which means each slice can be processed in parallel. In the meanwhile, slices are divided into many CTU and the WPP can make each row of the CTUs being processed in parallel.

b Reconstruction: This step consist of Inverse Quantization(IQ) and Inverse Trans- form(IT). The decoding process of the HEVC uses the discrete cosine transform(DCT). Although DCT is not the best solution for the data compression, it can make the spatial frequency amplitude concentrate in the low frequency which is easy for quan- tization. The purpose of quantization is to reduce the accuracy of the DCT coef- ficients, thereby increasing the compression ratio. As the quantization will cause

11 2 State of the Art

errors, HEVC using intra-prediction to eliminate the intra-frame redundancy and using inter-prediction method to eliminate inter-frame redundancy. Motion com- pensation has the similar effect. Adjacent frames have many similarities which is the redundancy. Based on the prediction or the previous frame, motion compen- sation can compensate the the current frame to eliminate redundancy and improve compression ratio.

c Deblocking Filter(DF): DF and Sample Adaptive Offset(SAO) in next part are based on the reconstruction. Deblocking filtering consist of vertical boundary filtering and horizontal boundary filtering. First, the horizontal filter is used to filter the vertical boundary, and then the vertical filter is used to filter the horizontal boundary. However, too strong filtering will lead to excessive smoothing of the image in details, while the not enough filter strength will reduce the image quality. Therefore, HEVC provides three kinds of strength: 0, 1, 2. If the strength is greater than 0, then it needs to be determined if the current boundary needs filtering. Each boundary selects 8 points around for the filtering calculation, using the four rows of pixels determine whether the filtering need to be processed. There is correlation between vertical boundary filtering and horizontal boundary filtering which is the result of horizontal boundary filtering taken as the input of vertical boundary filtering. However, each horizontal boundary filtering process and each vertical boundary filtering process are independent so that they can be processed in parallel.

d Sample Adaptive Offset(SAO): SAO is the new features of HEVC decoding which is one of the important difference between the H.264. SAO filtering is applied after the vertical filtering of the deblocking filter. It is not independent of the deblocking filter. The purpose of SAO is to eliminate the ringing and blocking artefacts caused by the sampling difference. By analysing the relationship between the deblocked data and the original image, the SAO compensates the deblocked data to make it as close to the original image as possible. SAO filtering uses different offsets depending on the class of the sample and the region. There are two types of offset methods: band offset and edge offset. The band offset does not need to refer to the information of adjacent pixels when adding offset to the reconstructed pixels, since there is no data dependency. While the edge offset requires reference to the information of eight neighbouring pixels.

These are the HEVC decoding features and process in general, this thesis is focus on using AVX512 SIMD extensions to optimising the HEVC decoding. In Section 4, the details about how to optimise the objective kernels by using AVX512 SIMD ISAs are presented.

12 3 Theoretical Foundations

In this section, theoretical foundations will be presented. First, the overview and the advantage of the AVX512 SIMD extensions will be presented. Then, an optimised HEVC decoder based on the recent SIMD extensions reported by Chi et al.[8] will be presented. Based on the experimental results of this decoder, the kernels which are potential benefit from AVX512 SIMD ISAs will be analysed and the details about the optimisation of these kernels will be presented in next section.

3.1 Overview of AVX512 SIMD Extensions

Intel has released a new SIMD extension, namely Intel Advanced Vector Extensions 512 (Intel AVX-512) instructions. This instructions support 512-bit SIMD. By using AVX512 SIMD, program can pack eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors. AVX512 enable processing four times of the data elements compared to SSE and two times compared to AVX/AVX2. Intel AVX512 is one of the most powerful SIMD ISAs. It offers a great compatibility with AVX/AVX2 which make the AVX512 much more stronger than the previous SIMD ISAs. As we know, SSE and AVX cannot be easily mixed without the performance penalties. However, this is improved in AVX512 which means developer can program by mixing the AVX512 and AVX/AVX2 without penalty. This is done by mapping the register ZMM0ZMM15 of AVX512 into registers YMM0YMM15 of AVX. In this way, AVX and AVX2 instructions can be processed on the lower 128 or 256 bits of the first 16 ZMM registers when applying the AVX512 to the program. Beside, AVX512 also provide higher computational performance with the features in- cluding 32 vector registers each 512 bits wide, eight dedicated mask registers, 512-bit operations on packed floating point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, new operations, additional gather/scat- ter support, high speed math instructions, compact representation of large displacement

13 3 Theoretical Foundations value, and the ability to have optional capabilities beyond the foundational capabilities. There is an example comparison among two released public CPUs with their projected next-gen counterparts supporting AVX512 presented by Sisoftware[1] to estimate the fu- ture performance advantage with AVX512 as shown in Table 3.1[1] (No major changes are expected in future AVX512 supporting arch). By analysing the comparison result, we can get the information about the advantages of the AVX512 SIMD extensions in general. First, multi-media (Fractal Generation) benchmark for Core-i7 (4C/8T AVX512) Pro-

Intel Projected Intel i7-6700K Intel Projected Intel i7-5820K Processor next-gen (Skylake) next-gen (Haswell-E) (Skylake-E) Cores/Threads 4C/8T 4C/8T 6C/12T 6C/12T Clock Speeds 800-4000- 1200-3300- (MHz) assumed same assumed same 4200 3600 Min-Max-Turbo Caches 4x 32kB, 4x 6x 32kB, 6x assumed same assumed same L1/L2/L3 256kB, 8MB 256kB, 15MB Power TDP 91W assumed same 140W assumed same Rating (W) AVX512+ AVX512+ Instruction Set AVX2, FMA3, AVX2, FMA3, AVX2, FMA3, AVX2,FMA3, Support AVX, etc. AVX, etc. AVX, etc. AVX, etc.

Table 3.1: Comparison among two released public CPUs with their projected next-gen counterparts supporting AVX512[1]. jected, Core i7-6700K (4C/8T AVX2/FMA), Core i7-6700K (4C/8T SSEx), and Future Core i7-E (6C/12T AVX512) Projected, Core i7-5820K (6C/12T AVX2/FMA), Core i7- 5820K (6C/12T SSEx)) based on SIMD Native are listed in Table 3.2[1]. From these result, we can know for the integer and longer integer SIMD, AVX2 can achieve around 76% and 66% compared to SSE respectively. For AVX512, we can expect more than 80% improvement. And for the float SIMD, improvement is over 2x from AVX/AVX2 to SSE and for single float AVX512, although the improvement is not that much, it can be still 100%. For double and quad float, 2x can be still expected. We can see by using AVX512, the gap between Skylake-E and current GPGPUs can be narrowed. Then on Table 3.3[1], based on the result of hashing SHA2-256 and hashing SHA1, we can know there is a 2x improvement achieved by AVX/AVX2 compared to SSE. We can

14 3.1 Overview of AVX512 SIMD Extensions

Future Future Core i7- Core i7- Core i7- Core Core i7- Core-i7 6700K 5820K 6700K i7-E 5820K (4C/8T (4C/8T (6C/12T (4C/8T (6C/12T (6C/12T AVX512) AVX2/ AVX2/ SSEx) AVX512) SSEx)) Projected FMA) FMA) Projected Integer 912.5 516.2 1020.7 577.4 SIMD (+76% (+76% 292 (+76% (+76% 327 (Mpix/s) over AVX) over SSE) over AVX) over SSE) Long 315.3 190.1 284.3 171.4 SIMD (+66% (+66% 114.6 (+66% (+66% 87.6 (Mpix/s) over AVX) over SSE) over AVX) over SSE) Single 916.8 458.4 1079 539.5 Float SIMD (+2x (+2.12x 216 (+2x (+2.12x 234.8 (Mpix/s) over AVX) over SSE) over AVX) over SSE) Double 545.8 272.9 622.4 311.2 Float SIMD (+2x (+2.35x 116.1 (+2x (+2.35x 126 (Mpix/s) over AVX) over SSE) over AVX) over SSE) Quad 20.3 10.5 622.4 311.2 Float SIMD (+94% (+94% 5.4 (+94% (+94% 126 (Mpix/s) over AVX) over SSE) over AVX) over SSE)

Table 3.2: Multi-media (Fractal Generation) Benchmark: SIMD Native[1]. expected similar improvement for both case, although the double performance may be limited by the memory bandwidth. And the case of the hashing SHA2-512 is quite simi- lar with the previous cases. With hashing even better results than even fractal generation can be seen, we can expect that AVX512 will improve by at least 100%, since the AVX2 improvement is already over 2x over SSE. Then regrading the cache and memory transfer, the result are listed in Table 3.4[1]. We can see with DDR4, in the memory level, there is not obvious improvement. However, when we move up to cache level, we can see that for the L3, there is 10% improvement by using AVX2 over SSE and similar for AVX512. For L2 and L1 cache, the improvements are over 20% and even 40-50% by using AVX512 receptively. We can see that AVX512 does take advantage of the widened data ports in Skylake and future arch. The L1D cache shows the best bandwidth improvement. However, memory bandwidth is still limited by

15 3 Theoretical Foundations

Future Future Core i7- Core i7- Core i7- Core Core i7- Core-i7 6700K 5820K 6700K i7-E 5820K (4C/8T (4C/8T (6C/12T (4C/8T (6C/12T (6C/12T AVX512) AVX2/ AVX2/ SSEx) AVX512) SSEx)) Projected FMA) FMA) Projected Hashing 11.80 5.90 13.60 6.80 SHA2-256 (+2x (+2.36x 2.50 (+2x (+2.26x 3 (GB/s) over AVX] over SSE) over AVX) over SSE] Hashing 23 11.5 27.70 13.85 SHA1 (+2x (+2.16x 5.33 (+2x (+2.04x 6.79 (GB/s) over AVX) over SSE) over AVX) over SSE] Hashing 8.74 4.37 9.60 4.80 SHA2-512 (+2x (+2.33x 1.87 (+2x (+2.20x 2.18 (GB/s) over AVX) over SSE) over AVX) over SSE)

Table 3.3: Multi-media (Fractal Generation) Benchmark: Cryptography Native[1].

DDR4 speeds. Intel planes to first released the processor in future containing the AVX512 in Intel Xeon Phi processor (Coprocessor known by the code name Knights Landing). Since the real product are not available yet, Intel has extended the Intel Software Development Emulator (Intel SDE) for Intel AVX-512 to help with testing of support. Therefore, the experiment and measurement in this thesis is based on the results tested in the SDE. According to the limitation of the SDE, in this thesis, we can only get the number of the total execution instructions for the entire HEVC decoding process. The execution time, IPC and the number of execution instructions per stage are not available yet. The details about experiment and measurement results will be presented in Section 4.3.

3.2 Baseline Optimised HEVC Decoder

As talked in Section 2.2, Chi et al.[8] has presented an optimised HEVC decoder in 2015 and they implemented SIMD for all the HEVC processing steps except for the bitstream parsing. This includes inter prediction, intra prediction, inverse transform, deblocking fil- ter, SAO filter, and various memory movement operations. This SIMD optimisation is for

16 3.2 Baseline Optimised HEVC Decoder

Future Future Core i7- Core i7- Core i7- Core Core i7- Core-i7 6700K 5820K 6700K i7-E 5820K (4C/8T (4C/8T (6C/12T (4C/8T (6C/12T (6C/12T AVX512) AVX2/ AVX2/ SSEx) AVX512) SSEx)) Projected FMA) FMA) Projected Memory Bandwidth 31.30 31.30(0%) 31.30 42.00(0%) 42.30(-1%) 42.6 (GB/s) L3 267.97 243.30 202.20 195.90 Bandwidth 220.90 189.8 (+10%] (+10%) (+3%) (+3%) (GB/s) L2 392.50 323.30 536.81 444.10 Bandwidth 266.30 367.4 (+21%) (+21%) (+20%) (+20%) (GB/s) L1D 1,364.25 909.50 1,536.00 1,024.00 Bandwidth 429.90 518 (+50%) (+2.11x) (+50%) (+2x) (GB/s)

Table 3.4: Multi-media (Fractal Generation) Benchmark: Cache & Memory Transfer[1]. almost all major SIMD ISAs, including NEON, SSE2, SSSE3, SSE4.1, XOP, and AVX2. Evaluation has been performed on 14 mobile and PC platforms covering most major ar- chitectures released in recent years. With SIMD up to 5x speedup can be achieved over the entire HEVC decoder, resulting in up to 133 fps and 37.8 fps on average on a single core for Main profile 1080p and Main10 profile 2160p sequences, respectively. This thesis uses this optimised HEVC decoder as the baseline and optimises this HEVC decoder by using AVX512 SIMD extensions. By comparing the performance of the optimised HEVC decoder with the baseline HEVC decoder, the advantage of applying AVX512 to HEVC decoder is presented in Section 5. In this section, the features and performance of the baseline HEVC decoder are presented. Based on these results, the potential beneficial kernels of HEVC decoder from AVX512 SIMD extensions can be found. And these kernels are then optimised by using AVX512 SIMD extensions. The details about how to implement the optimisations of the corre- sponding kernels are presented in Section 4. Now we will talked about the performance results of the baseline HEVC decoder.

17 3 Theoretical Foundations

First, the impact of ISA and architecture is presented as shown in Table 3.5 (the runtimes of all the videos and QPs are averaged for each resolution). Apparently, applying SIMD to HEVC decoder can significantly reduce the instruction count. Compared to the scalar, the instruction count has been reduced from 4.8x to 8.1x depending on the ISA and SIMD. The instruction are reduced for both 1080p 8-bit and 2160p 10-bit. And in the case of 10- bit video sets , there are less scalar instructions replaced by SIMD instructions. For 8-bit video sets, neon implementation in Armv7 architectures has less instructions per frame than SSE2 in X86 architectures but more than other SIMD implementation in X86 archi- tectures. And for 10-bit video sets, the neon implementation always has more instructions per frame than X86 architectures. For X86 architectures SIMD implementation, in gen- eral, 64-bit architectures has less instructions per frame than 32-bit architectures. For either 32-bit or 64-bit instructions, from SSE2 to AVX2, the instructions per frame are becoming less and there is a jump of the reduction from XOP to AVX2. Second, Figure 3.1 shows the normalised performance and instructions per cycle (IPC)

scalar neon sse2 .1 avx xop avx2 1080p 8-bit armv7 319.1 76.5 x86 385.4 79.1 68.5 67.9 62.7 60.9 49.4 x86-64 344.7 72.0 61.2 60.6 55.9 54.0 42.8 2160p 10bit armv7 1085 299.9 x86 1250 256.2 243.8 242.3 223.2 210.2 159.7 x86-64 1127 232.9 219.3 218.2 198.9 184.3 138.9

Table 3.5: Average executed instructions per frame[minst/frame][8]. for 1080p and 2160p respectively. For the normalised performance, after applying SIMD to the decoding process, the performance is much better than the scalar ones. And the AVX2 ones shows the best performance. For the IPC, it can be observed that when using SIMD instructions, the IPC are always less than the scalar implementation. And AVX2 is still the least one. Then, Table 3.6 presents the speedup per stage by using different SIMD. In this table, we can see that the AVX2 implementation has the best speedup performance in general which can be up to 4.84x for 1080p 8-bit videos. We can also see the most improvement kernels are the inter-prediction kernel and SAO kernel which speedup by 10.33x and 11.18x when applying AVX2 SIMD for 1080p 8-bit videos respectively. The inverse transform and de-

18 3.2 Baseline Optimised HEVC Decoder blocking filter kernels are bebefit less than the previous mentioned kernel, but still have significant improvement. The speedup are up to 3.69x and 3.44x respectively for 1080p 8-bit videos. For 2160p 10-bit videos, in general, we get similar conclusions, except the SAO kernel is not benefit that much as for the 1080p 8-bit video sets. The performance of the other three kernels, inter-prediction, inverse transform and deblocking filter, are still improved a lot. Afterwards, Figure 3.2 present the execution profile for the HEVC decoder. Based on

Figure 3.1: Single threaded performance at normalized frequency and IPC results of base- line HEVC decoder[8]. this figure, we can see the portion of execution time on parsing the side information and coefficients (PSide, PCoeff), intra prediction (Intra), and Other are increasing when using the SIMD, because SIMD has limited impact on these part. Therefore, the SIMD won’t be implemented for these part. The figure also shows that the inter-prediction, inverse transform and debloking filter are the kernels which are significantly benefit from SIMD ISAs. From the above results, it is clearly that the AVX2 has the best performance over all the SIMDs used in [8]. And according to last Section, based on the advantage of AVX512

19 3 Theoretical Foundations

Figure 3.2: Decoder execution time breakdown on Haswell for scalar, sse2, and avx2. The PSIDE and PCOEF stages represent the side information and coefficient pars- ing (CABAC entropy decoding). The PSIDE includes also the interpretation of the syntax elements and coding tree unit traversal[8]. compared to AVX2, even better performance of the HEVC decoding process optimised by AVX512 SIMD extensions can be expected. This is also why the AVX512 SIMD

Intra Inter IT DF SAO Other Overall 1080p 8-bit sse2 1.23 6.03 2.68 2.80 4.03 1.24 3.67 ssse3 1.30 7.16 2.87 3.00 7.82 1.24 4.14 sse4.1 1.31 7.12 2.98 3.00 8.09 1.24 4.15 avx 1.33 7.58 3.23 3.04 8.26 1.24 4.29 avx2 1.33 10.33 3.69 3.44 11.18 1.18 4.95 2160p 10bit sse2 1.53 5.28 3.34 3.25 3.61 1.16 3.59 ssse3 1.63 5.53 3.34 3.41 4.22 1.16 3.74 sse4.1 1.64 5.51 3.42 3.45 4.27 1.16 3.75 avx 1.65 5.74 3.74 3.49 4.26 1.16 3.85 avx2 1.67 8.03 4.68 4.05 4.75 1.09 4.60

Table 3.6: Speedup per stage Haswell for different SIMD levels[8]. extensions is used to optimised HEVC decoding in this thesis. For the kernels of HEVC decoder, it is clearly, Inter prediction and SAO are the most beneficial part by using SIMD optimising, and IT and DF are also optimised a lot by using SIMD optimising. However, considering portion of the execution time of different kernels in HEVC decoder, since SAO has much less execution time than the other three kernels, it won’t be optimised by

20 3.2 Baseline Optimised HEVC Decoder

AVX512 SIMD in this thesis. Therefore, in this thesis, AVX512 SIMD extensions are applied to the inter-prediction, inverse transfom and deblocking filter kernels of the HEVC decoder.

21

4 Implementation

In this section, we will present how to optimised the HEVC decoding by using AVX512 SIMD ISAs. As Section 3.2 discussed, we will optimise the HEVC decoding by using AVX512 SIMD ISAs for the inter prediction, inverse transform and deblocking filter ker- nels. In this thesis, we will use AVX512F, AVX512BW and AVX512DQ instructions. The details are given as below.

4.1 Inter prediction

In general, inter prediction is the most time consuming step in the HEVC decoding[3]. In this thesis, the AVX512 SIMD is used to optimise luma and chroma, horizontal and vertical interpolation function for 8-bit and 10-bit input in inter prediction kernel. The interpolation is needed when the horizontal or vertical vector component points to a frac- tional position. Interpolation of HEVC is performed using a 7/8-tap FIR filter. The coefficients are listed in Table 4.1. HEVC supports 8-bit and 10-bit unsigned input sample values. For the interpolation, the gain is between -22 and +88 which adds 8 bits to the intermediate precision. Therefore, for 8-bit and 10-bit input sample values, a 16-bit intermediate precision is needed. And AVX512 SIMD ISAs are 512-bit width, so by using AVX512 SIMD ISAs to optimise the interpolation, only video sets in 32x block size can be benefit. We will talk about the details for the AVX512 optimisation as follows. First, we talk about how to optimise the luma horizontal interpolation by using AVX512 SIMD ISAs in inter predication kernel (8-bit horizontal interpolation optimisation is im- plemented in a similar way). Figure 4.1 shows that for horizontal luminance interpolation, first, the input samples are loaded into three 512-bit vectors by VMOVDQU32 instruc- tion. Each vector contains 32 input samples. Then we shuffle these three vector to get two groups (Group A and Group B) intermediate 512-bit vectors by using VPSHUFB instructions (In each group, the same colour as the source vector means this vector are shuffled from that source vector as the arrows show).

23 4 Implementation

position C0 C1 C2 C3 C4 C5 C6 C7 0 0 0 0 64 0 0 0 0 0.25 -1 4 -10 58 17 -5 1 0 0.5 -1 4 -11 40 40 -11 4 -1 0.75 0 1 -5 17 58 -10 4 -1

Table 4.1: Coefficients for fractional positions for the HEVC interpolation filter.

Then, as the Figure 4.2 shows that we multiply the vectors in each group by the coeffi-

Figure 4.1: Horizontal luminance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(1) cients (As the Figure 4.2 shows, the coefficients C0 and C1 in Table 4.1 are loaded into a 512-bit vector vcoeff0. The other coefficients are loaded in a similar way) and add the results to get the intermediate results for each group. Then we process shift, rounding op- erations for each intermediate result. Finally, we saturate and pack the two intermediate results into one 512-bit vector to get the final result by using VPACKSSDW. Then loop these operations to process all the input samples. The chroma horizontal interpolation optimisation are implemented quite similarly to the luma optimisation, only difference is

24 4.1 Inter prediction number of intermediate results in group A and B as shown in 4.1 are two instead of four.

Figure 4.2: Horizontal luminance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(2)

Second, we will talk about how to optimise chroma vertical interpolation. We still talk about the implementation for 10-bit input as well. As Figure 4.3 shows, first, we load input samples into four 512-bit vectors. In vertical interpolation, we proceed the calculation by directly using these four vectors without shuffle. Then, we pack the input vectors into intermediate vectors and multiply the input vectors by the coefficients in Table 4.1

25 4 Implementation

Figure 4.3: Vertical chrominance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(1) as shown in Figure 4.4. Afterwards, we add the intermediate results and process the shift and rounding operations to get the final results which is similar to the horizontal optimisation. Then we loop this process for all the input samples. The luma vertical interpolation optimisation is similar to the chroma optimisation, the difference is that in luma optimisation, the number of the input vectors is eight instead of four.

4.2 Inverse Transform

As Section 3.2 talked about, when applying SIMD to HEVC decoding process, the inverse transform kernel is benefit obviously. Therefore, we optimise the HEVC decoding by us- ing AVX512 SIMD for this kernel, too. Since the block size of inverse transform is up to 32x32 and AVX512 SIMD ISAs has the 512-bit vector size, we implement the AVX512 SIMD optimisation only for 32x32 TBs. The Figure 4.5 shows the 1-D transform for 32x32 TBs[8]. Basically, the computation of a 1-D transform consists of a series of matrix-vector multiplication followed by adding/- subtracting the partial results. When we perform the 2-D inverse transform, typically, we perform two 1-D transform on the columns and then on the rows of the transform block. Since each column is independent from each other, SIMD is quite suitable for optimising this kernel. In the case of the AVX512 implementation, it also perform the similar process in Figure

26 4.2 Inverse Transform

Figure 4.4: Vertical chrominance interpolation optimisation for 10-bit input sample by using AVX512 SIMD ISAs(2)

4.5, however, in the AVX512 implementation, every single elements of the input vector x is not an integer number but changed to a 512-bit vector. For example, in 1-D transform, in the Figure 4.5, from the right side, we will perform the odd position input elements of x multiplied by 16x16 matrix. In the AVX512 implementation, we multiply odd position 512-bit input vectors by the 16x16 matrix. The other part will perform the similar oper- ations. We will take this part as an example to explain how to load input elements and multiply them by the matrix in details. First, we pack the odd position inputs two by two. For example, first two odd position input vectors(X1 and X2) will be packed as Figure 4.6 shown to get two intermediate vectors(X13LOW and X13HIGH) by using VPUNPCKLWD and VPUNPCKHWD in-

27 4 Implementation

Figure 4.5: HEVC 1D inverse transform for 32x32 TBs[8].

Figure 4.6: HEVC inverse transform optimisation for 32x32 TBs by AVX512 SIMD ISAs (1). structions. Then we will multiply these two intermediate vectors by a new matrix as shown in the right side of Figure 4.7 respectively. To be exactly, this new 16x16 matrix is got by transposing the 1-D inverse transform 16x16 matrix and packing each two el- ements in one row as the way shown in the Figure 4.7. The purpose to pre-reorganise this matrix is to make the calculation easier. Then we set every two elements of this new 16x16 matrix (the elements in one circle as shown in Figure 4.7) into a 512-bit vector by using VPBROADCASTDA and multiply this vector by the two input intermediate vec- tors respectively. Afterwards, by looping this process as order of the new 16x16 matrix’s column first and then row’s, we can get partial result for the odd positions input. For the other positions we use the similar way to get the partial results. The difference between the 1-D transform and our AVX512 implementation is that in our

28 4.3 Deblocking Filter

Figure 4.7: HEVC inverse transform optimisation for 32x32 TBs by AVX512 SIMD ISAs (2).

AVX512 implementation, the number of the partial results are two for each position, one is "low" and another is "high", since we pack the input into the "low" and "high" vectors as shown in the Figure 4.6. Therefore, the calculating method for these partial result is also different from the 1-D inverse transform as shown in Figure 4.8 (However, the calculation function for adding/subtracting is the same as shown Figure 4.5 but different in the last step). Finally, we transpose the 32x32 vectors we got in Figure 4.8 and loop this process for the entire inverse transform.

4.3 Deblocking Filter

In HEVC decoding process, deblocking filter is also benefit obviously from SIMD. In this part, we will explain how to use AVX512 SIMD ISAs to optimise this kernel. The deblocking filter can be optimised by considering multiple edge parts simultaneously[8] as shown in Figure 4.9. Since the computation precision of the deblocking filter is within 16-bit (for < 11-bit), 32 16-bit computation lanes are available and 16 edge arts can be process simultaneously by using AVX512 SIMD ISAs. In the case of this kernel, we opti- mise the luma filter by using AVX512 SIMD ISAs. First, when boundary strengths are larger than 0, for horizontal edge filter, the samples of 16 edge parts are loaded (For vertical edge filter, these samples are transposed after loading). When using AVX512 SIMD ISAs to optimise the deblocking filter kernel, we

29 4 Implementation

Figure 4.8: HEVC inverse transform optimisation for 32x32 TBs by AVX512 SIMD ISAs (3).

Figure 4.9: Samples of one edge part around a block edge involved in the deblocking filter. found only the first and last samples of each edge part are effective for the filter decision. Therefore, after we load all samples of 16 edge parts, we swizzled the first and last samples

30 4.3 Deblocking Filter

(which is shown as the red in Figure 4.9) of 16 edge parts into vectors as shown in Figure 4.10. This figure shows the loading and swizzling process for 16 edge-part samples of P3 in Figure 4.9. We loop this process for all 16 edge-part samples from P3 to Q3 in Figure 4.9. Afterwards, for each edge part, we will evaluate the following filtering expressions 4.1 to

Figure 4.10: Swizzle first and last samples of 16 edge parts into a vector.

4.7[8] for the first and last samples. In this thesis, by using AVX512 SIMD ISAs, we can process these evaluation for 16 edge parts at a time as we mentioned before.

dp0 = |p2,0 − 2p1,0 + p0,0| (4.1)

dp3 = |p2,3 − 2p1,3 + p0,3| (4.2)

dq0 = |q2,0 − 2q1,0 + q0,0| (4.3)

dq3 = |q2,3 − 2q1,3 + q0,3| (4.4) dpq0 = dp0 + dq0 (4.5) dpq3 = dp3 + dq3 (4.6) filter = dpq0 + dpq3 < β (4.7)

where px,y and qx,y are the samples indicated in red in Figure.4.9, and the value of β depends on the QP of the p and q samples. Then, if the filter evaluate to true, we use the same swizzled results to decided the filter

31 4 Implementation is strong or normal by using the following expressions[8]:

strong0 = (|p3,0 − p0,0| + |q0,0 − q3,0| < β/8) ∧ (dpq0 < β/8) ∧ (|p0,0 − q0,0| < 2.5tc) (4.8)

strong3 = (|p3,3 − p0,3| + |q0,3 − q3,3| < β/8) ∧ (dpq3 < β/8) ∧ (|p0,3 − q0,3| < 2.5tc) (4.9) strong = strong0 ∧ strong3 (4.10) normal! = strong (4.11)

The result that we get can be a strong and normal mask, normal second sample mask, and a lossless mask. Then we switch back to the original input samples which is the Luma0 and Luma1 in Figure 4.10 and process the actual filter. When processing the actual filter, by using AVX512 SIMD ISAs, we process 8 edge parts at at once. According to the filter decision, we process the strong, normal and lossless filter for input samples. The entire

Figure 4.11: Luminance filter process in deblocking filter kernel.

32 4.3 Deblocking Filter process is shown in Figure 4.11. Finally, loop this process for all the input samples.

33

5 Results

In this thesis, the HEVC decoder implemented by Chi et al.[8] at 2015 is used as the base- line. In case of the operating system, we use the Ubuntu 15.10 distribution with Linux kernel 4.2. The program is compiled by GCC 5.2 with -O3 optimisation. Two video test sets have been selected for the experiments. One includes three 1080p 8-bit video sets from the JCTVC test set and the other include two 2160p 10-bit video sets[4]. All videos were encoded with the HM-8.0 reference encoder using 4 QP points (22, 27, 32, 37). The 1080p sequences are encoded with the random access main (8-bit) configuration and the 2160p sequences are encoded with with the random access main10 (10-bit) configuration.

Figure 5.1: Average proportion of the number of the reduced execute instructions of AVX2/AVX512 implementation compared with scalar program for 1080p 8-bit resolution video sets.

The hardware containing the AVX512 SIMD ISAs is not available yet, and as men- tioned in Section 3.1, Intel Software Development Emulator (Intel SDE) version 7.49 is

35 5 Results extended to support AVX512 SIMD extensions. Therefore, in this thesis, we use SDE to run our optimised HEVC decoder. And we can also get a mix histograms which is used to show our experimental results. However, since the purpose for this simulator is focus on the debug, the output measurement values are quite limited (For example, coun- ters of the IPC, execution time, and per stage measurement are not available in SDE). Therefore, in this paper, based on the features of this simulator, the results are mainly focus on the number of the execution instructions. In the following results, we will mainly talked about how much percentage of the execution instructions can be reduced compared to the optimised HEVC decoder by using AVX2 SIMD ISAs and the scalar HEVC decoder.

A Overall optimise perfromance:

* Compared with scalar HEVC decoder: The Figure 5.1 and Figure 5.2 show the average percentage of the number of the reduced execute instructions by AVX2 and AVX512 compared with scalar program for 1080p 8-bit and 2160 10-bit resolution video sets, respectively.

Figure 5.2: Average proportion of the number of the reduced execute instructions of AVX2/AVX512 implementation compared with scalar program for 2160 10-bit resolution video sets.

36 From these two figures, we can see, in general, by either using AVX2 or AVX512, the numbers of the execution instructions can be reduced more 80% compared with the scalar HEVC decoding. Besides, we can also see that the average percentage of the number of the reduced execution instructions is in- creasing as the QP level is becoming higher. And for both 8-bit and 10-bit videos, when the QP level is 37, the reduced percentage can be even more than 90%. Based on Figure 5.1 and Figure 5.2, we can also see by using AVX512 to optimise the HEVC decoding, the average percentage of the reduced execution instructions is higher than using AVX2. In next part, the comparison between AVX2 and AVX512 will be presented.

Figure 5.3: Average percentage of the reduced execute instructions by AVX512 compared with AVX2 program for 1080p 8-bit and 2160 10-bit resolution video sets.

* Compared with the HEVC decoder optimised by AVX2 SIMD ISAs: In Figure 5.3, we can see by using AVX512 SIMD ISAs to optimise the HEVC decoding, the average number of the execute instructions can be reduced up to 23% for 1080p 8-bit videos and 31% for 2160p 10-bit videos. When considering the performance for different QP level, it also shows that when QP level is becoming higher, the average percentage of the number of the reduced execution instructions is increasing. We can see that when QP

37 5 Results

level is high, the optimisation is much more obvious. Therefore, in next part, we will take the highest QP (37) video sets as the reference test video sets to show the proportion distribution of the execution instructions’ number when applying AVX2 and AVX512 SIMD ISAs to the HEVC decoding. The change of the proportion distribution of the execution instructions’ number for both AVX2 and AVX512 SIMD optimisation will be investigated.

B Proportion distribution of number of the execution instructions:

In Figure 5.4 and Figure 5.5, we can see that proportion distributions are changed by using AVX512 SIMD ISAs to optimise the HEVC decoding instead of AVX2. The AVX512 SIMD execution instructions are the main part of the SIMD instruc- tions now. The total number of the execution instructions are reduced, because the width of the AVX512 register is double of the AVX2 register so that double number of the data elements can be processed in one instruction which is consistent with the last part.

Figure 5.4: Proportion distribution of the execution instructions’ number when applying AVX2 and AVX512 to decode the 1080p 8bit resolution video sets.

However, from these two figures, we can see there are still some AVX2 execution instructions in the HEVC decoding process. This is because in this thesis, only the most beneficial kernels are optimised by using AVX512 SIMD ISAs. The AVX512 SIMD extensions double the size of registers compared to AVX2, this means when processing the AVX512 SIMD instructions, loading and storing data from memory need more time. The kernel which processes less instructions and consumes less

38 execution time won’t benefit obviously by using AVX512, so we don’t optimise all kernels by using AVX512 SIMD ISAs. And this also why there are still some AVX2 instructions.

Figure 5.5: Proportion distribution of the execution instructions’ number when applying AVX2 and AVX512 to decode the 2160p 10bit resolution video sets.

C Per-stage performance:

We have talked about the result of the overall AVX512 SIMD optimisation perfor- mance. In this part, we will discuss results of the per-stage performance. Figure 5.6 and Figure 5.7 shows that in general, the average percentage of the number of the reduced execute instructions in AVX512 optimisation compared with AVX2 optimisation for 1080p 8-bit and 2160p 10-bit resolution video sets are in- creased as the QP level increased which is consist with the part A in this section. The proportion distribution figures show that for the interpolation and deblock- ing filter, the performance are changing similarly (Both of them are more effective when QP level is high). However, the inverse transform is working in a different way. Based on the result, we can see the inverse transform is more effective when the QP level is low. This is because the coefficient scan pattern of HEVC concentrates the coefficients in the top left corner for each 32x32 TBs and the rest of transform often contain a lot of zero, especially when the QP level is high. And there are more none-zero values of the transform when the QP level is low, so the inverse transform is more effective when QP level is low. Ideally, the IPC, execution time, and system profile should also be included. How-

39 5 Results

Figure 5.6: Proportion distribution of the average percentage of the reduced execute in- structions by AVX512 compared with AVX2 program for 1080p 8bit resolution video for each stage.

Figure 5.7: Proportion distribution of the average percentage of the reduced execute in- structions by AVX512 compared with AVX2 program for 2160p 10bit resolu- tion video for each stage.

40 ever, because of the SDE limitation, these results are not available. And these measurement can be only performed when the real product containing AVX512 SIMD ISAs available in future.

41

6 Conclusions and Discussions

As for previous SIMD optimisation on HEVC decoding, AVX512 SIMD extensions also shows the improvement for HEVC decoding. Compared to AVX2 SIMD optimisation, for 1080p 8-bit video sets up to 23% execution instructions can be decreased and up to 31% execution instructions can be decreased for 2160p 10-bit video sets. In the mean while, compared to scalar HEVC decoding, 85% and 83% execution instructions can be reduced for 1080p 8-bit and 2160p 10-bit video sets, respectively. The acceleration solutions in this thesis allow the HEVC decoding to be easily performed when the next-generation Intel product containing the AVX512 SIMD extension available. The initial goal of this thesis is to investigate the influence of AVX512 SIMDs for opti- mising the HEVC decoding. Based on the experimental results, the HEVC decoding does benefit a lot from AVX512 SIMDs. The large execution instruction reduction, however, only can be achieved with the high programming complexity and effort. In the meantime, by using AVX512 SIMDs, not all kernel are beneficial and suitable. This is because although the AVX512 SIMDs double the register size compared to AVX2 SIMDS which means double size of elements can be processed in the same time, it also limits the block size of the video (The optimised HEVC decoder in this thesis is only effective for 32x block size). This is also the reason that not all video sets are beneficial from the optimised HEVC decoder in this thesis and not all the kernels are implemented. In general, we can see the performance of the optimised HEVC decoder by using AVX512 SIMDs are more obvious when QP level is high. This indicate that the low level QP videos are not benefit from AVX512 SIMDs as much as the high QP level. In the meanwhile, not all kernels are benefit more from AVX512 as the QP level increased, for example the in- verse transform kernel. Since in the high QP level, there is a lot zero transformed, which cannot be benefit from using AVX512 SIMD. This indicates the limitation of AVX512 SIMDs optimisation for HEVC decoding and also remind us more careful when applying new SIMDs in the future for inverse transform kernels. In this thesis, all the measurement and results are based on the number of the execution instructions. This is because the real products containing the AVX512 SIMDs are not

43 6 Conclusions and Discussions available yet and the SDE has its limitation (Only the number of instructions can be measured in this moment). Therefore, the more detail performance measurement, such as the memory accessing time, IPC and the total decoding time, etc. can not be performed for now. In the feature, when the real product is released, the real time measurement, such as decoding time, IPC, etc. should be performed. Based on the real time mea- surement, we can do a further investigation about the relation between the number of reduced instructions and decoding time. Then we can know more about the how to take the advantage of AVX512 SIMD extensions in a more proper way and avoid more risk when using AVX512 SIMD extensions to optimise the HEVC decoding.

44 Bibliography

[1] Future performance with avx512 in sandra 2016 sp1. In SiSoftware’ Reviews. sisoftware, Feb.2016. [2] F. Bossen. On software complexity: decoding 1080p content on a smartphone. Tech. Rep. JCTVC-K0327, October 2012. [3] Frank Bossen, Benjamin Bross, Karsten Suhring, and David Flynn. Hevc complexity and implementation analysis. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1685–1696, 2012. [4] Frank Bossen et al. Common test conditions and software reference configurations. Joint Collaborative Team on Video Coding (JCT-VC), JCTVC-F900, 2011. [5] Benjamin Bross, Mauricio Alvarez-Mesa, Valeri George, Chi Ching Chi, Tobias Mayer, Ben Juurlink, and Thomas Schierl. Hevc real-time decoding. In SPIE Optical Engineering+ Applications, pages 88561R–88561R. International Society for Optics and Photonics, 2013. [6] B Bross, WJ Han, GJ Sullivan, JR Ohm, and T Wiegand. High efficiency video coding (hevc) text specification draft 10 (jctvcl1003). In JCT-VC Meeting (Joint Collaborative Team of ISO/IEC MPEG & ITU-T VCEG), 2013. [7] Yen-Kuang Chen, Eric Q Li, Xiaosong Zhou, and Steven Ge. Implementation of h. 264 encoder and decoder on personal computers. Journal of Visual Communication and Image Representation, 17(2):509–532, 2006. [8] Chi Ching Chi, Mauricio Alvarez-Mesa, Benjamin Bross, Ben Juurlink, and Thomas Schierl. Simd acceleration for hevc decoding. IEEE Transactions on Circuits and Systems for Video Technology, 25(5):841–855, 2015. [9] Keith Diefendorff, Pradeep K Dubey, Ron Hochsprung, and HASH Scale. Altivec extension to accelerates media processing. IEEE Micro, 20(2):85–95, 2000. [10] Vaughn Iverson, Jeff McVeigh, and Bob Reese. Real-time h. 24-avc codec on intel architec- tures. In Image Processing, 2004. ICIP’04. 2004 International Conference on, volume 2, pages 757–760. IEEE, 2004. [11] Ville Lappalainen, Timo D. Hamalainen, and Petri Liuha. Overview of research efforts on media isa extensions and their usage in video coding. IEEE Transactions on Circuits and Systems for Video Technology, 12(8):660–670, 2002. [12] Ruby B Lee. Realtime mpeg video via software decompression on a pa-risc processor. In Compcon’95.’Technologies for the Information Superhighway’, Digest of Papers., pages 186–192. IEEE, 1995.

45 Bibliography

[13] Ketan Mayer-Patel, Brian C Smith, and Lawrence A Rowe. The berkeley software mpeg- 1 video decoder. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 1(1):110–125, 2005. [14] EEE Micro. Accelerating multimedia with enhanced microprocessors. 1995. [15] Jens-Rainer Ohm, Gary J Sullivan, Heiko Schwarz, Thiow Keng Tan, and Thomas Wie- gand. Comparison of the coding efficiency of video coding standardsincluding high efficiency video coding (hevc). IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1669–1684, 2012. [16] Alex Peleg and Uri Weiser. Mmx technology extension to the intel architecture. IEEE micro, 16(4):42–50, 1996. [17] PROCESSOR PENTIUM III. Implementing streaming . 2000. [18] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, and Thomas Wiegand. Overview of the high efficiency video coding (hevc) standard. IEEE Transactions on circuits and systems for video technology, 22(12):1649–1668, 2012. [19] Michael B Taylor. A landscape of the new dark silicon design regime. IEEE Micro, 33(5):8– 19, 2013. [20] S Thakkur and Thomas Huff. Internet streaming simd extensions. Computer, 32(12):26–34, 1999. [21] Wikipedia. Simd wiki. https://en.wikipedia.org/wiki/SIMD, 3. Oct. 2016. [22] Leju Yan, Yizhou Duan, Jun Sun, and Zongming Guo. Implementation of hevc decoder on x86 processors with simd optimization. In Visual Communications and Image Processing (VCIP), 2012 IEEE, pages 1–6. IEEE, 2012. [23] Xiaosong Zhou, Eric Q Li, and Yen-Kuang Chen. Implementation of h. 264 decoder on general-purpose processors with media instructions. In Electronic Imaging 2003, pages 224–235. International Society for Optics and Photonics, 2003.

46