Vili Viitamäki High-Level Synthesis Implementation of Hevc Intra Encoder
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by TUT DPub VILI VIITAMÄKI HIGH-LEVEL SYNTHESIS IMPLEMENTATION OF HEVC INTRA ENCODER Master of Science Thesis Examiner: Ass. Prof. Jarno Vanne Examiner and topic approved by the Council of the Faculty of Computing and Electrical Engineering on 28th March 2018 ABSTRACT VILI VIITAMÄKI: High-Level Synthesis Implementation of HEVC Intra Encoder Tampere University of Technology Master of Science Thesis, pages 45 October 2018 Master’s Degree Programme in Information Technology Major: Embedded Systems Examiner: Assistant Professor Jarno Vanne Keywords: High Efficiency Video Coding (HEVC), High-Level Synthesis (HLS), Intra coding, Video encoder, Field programmable gate array (FPGA) High Efficiency Video Coding (HEVC) is the latest video coding standard that aims to alleviate the increasing transmission and storage needs of modern video applications. Compared with its predecessor, HEVC is able to halve the bit rate required for high qual- ity video, but at the cost of increased complexity. High complexity makes HEVC video encoding slow and resource intensive but also ideal for hardware acceleration. With increasingly more complex designs, the effort required for traditional hardware de- velopment at register-transfer level (RTL) grows substantially. High-Level Synthesis (HLS) aims to solve this by raising the abstraction level through automatic tools that gen- erate RTL-level code from general programming languages like C or C++. In this Thesis, we made use of Catapult-C HLS tool to create an intra coding accelerator for an HEVC encoder on a Field Programmable Gate Array (FPGA). We used the C source code of Kvazaar open-source HEVC encoder as a reference model for accelerator implementation. Over 90 % of the implementation including all major intra coding tools were implemented with HLS, with the rest being ready made IP blocks and hand-written RTL components. The accelerator was synthesized into an Arria 10 FPGA chip that was able to accommo- date three accelerators and associated interface components. With two FPGAs connected to a high-end PC, our encoder was able to encode 2160p Ultra-High definition (UHD) video at 123 fps. Total FPGA resource usage was around 80 % with 346k Adaptive logic modules (ALMs) and 1227 Digital signal processors (DSPs). TIIVISTELMÄ Vili VIITAMÄKI: HEVC intra kooderin toteuttaminen korkean tason synteesillä Tampereen teknillinen yliopisto Diplomityö, 45 sivua Lokakuu 2018 Tietotekniikan DI-tutkinto-ohjelma Pääaine: Sulautetut Järjestelmät Tarkastaja: Apulaisprofessori Jarno Vanne Avainsanat: High Efficiency Video Coding (HEVC), High-Level Synthesis (HLS), Intra koodaus, Videonpakkaus, Field programmable gate array (FPGA) High Efficiency Video Coding (HEVC) on viimeisin videonpakkausstandardi, joka on kehitetty uusien multimediasovellusten kasvaviin tarpeisiin. HEVC:n tavoitteena on puo- littaa videon tarvitsema bittinopeus edeltävään videonpakkausstandardiin verrattuna hei- kentämättä kuvanlaatua. HEVC:n suuri kompleksisuus tekee HEVC videon pakkaami- sesta käytännön sovelluksissa hidasta ja resursseja vaativaa. Tästä johtuen HEVC on hyvä kandidaatti rautakiihdytettäväksi. Entistä monimutkaisempien ja kasvavien mallien vuoksi perinteinen RTL (Register- Transfer Level) tason laitteistokehitys on yhä vaativampaa. High-Level Synthesis (HLS) -työkalut yrittävät ratkaista tämän nostamalla laitteistokehityksen abstraktiotasoa ja auto- matisoimalla RTL-tason koodin generoinnin korkeamman tason mallista, kuten esimer- kiksi C ja C++ koodista. Tässä työssä käytettiin Catapult-C HLS -työkalua luomaan ohjelmoitavalle logiikkapii- rille (FPGA) kiihdytin, joka nopeuttaa HEVC:n intra pakkausta. Lähtökohtana työssä käytettiin Kvazaar HEVC kooderin avointa C kielistä lähdekoodia. Kaikki merkittävim- mät HEVC:n koodaustyökalut toteutettiin HLS-työkalua käyttäen niin, että FPGA:lla kyetään suorittamaan kaikki raskas laskenta. Toteutettu kiihdytin syntesoitiin Arria 10 FPGA piirille. Tälle piirille oli mahdollista si- sällyttää kolme kiihdytintä rajapintakomponenttien lisäksi. Järjestelmää testattiin käyttä- mällä kahta FPGA-korttia ja yhteensä kuutta kiihdytintä tehokkaassa testikoneessa. Tällä kokoonpanoilla saavutimme pakkausnopeudeksi 123 kuvaa sekunnissa, joka on melkein seitsemän kertaa nopeampi kuin kiihdyttämätön järjestelmä. Kolmella kiihdyttimellä yh- den FPGA:n resurssikäyttö oli noin 80 %, sisältäen 346k logiikkaelementtiä ja 1227 sig- naaliprosessoria. PREFACE This Master of Science Thesis was written in the Laboratory of Pervasive Computing at Tampere University of Technology as a part of a research. I want to thank my examiner Jarno Vanne for providing me the opportunity to work in the university and for guidance during and before this Thesis. I would also like to thank Panu Sjövall and all others who help me during this Thesis. Tampere, 22.10.2018 Vili Viitamäki TABLE OF CONTENTS 1. INTRODUCTION .................................................................................................... 1 2. BACKGROUND ...................................................................................................... 2 2.1 High-Level Synthesis (HLS) .......................................................................... 2 2.2 Kvazaar........................................................................................................... 3 2.3 Verification..................................................................................................... 3 2.4 Design flow .................................................................................................... 4 2.5 Field Programmable Gate Arrays (FPGAs) ................................................... 5 3. HIGH EFFICIENCY VIDEO CODING ................................................................... 6 3.1 Overview ........................................................................................................ 6 3.2 Intra Prediction ............................................................................................... 8 3.3 Transform ....................................................................................................... 9 3.4 Quantization ................................................................................................. 10 3.5 Inverse Transform ........................................................................................ 10 3.6 Entropy Coding ............................................................................................ 11 4. SYSTEM OVERVIEW ........................................................................................... 12 4.1 User Space .................................................................................................... 12 4.2 Kernel Space ................................................................................................ 13 4.3 PCIe and DMAs ........................................................................................... 14 4.4 FPGA Memories .......................................................................................... 15 5. INTRA CODING ACCELERATOR ...................................................................... 17 5.1 Intra coding control ...................................................................................... 18 5.2 Get Border .................................................................................................... 20 5.3 Intra Prediction ............................................................................................. 22 5.3.1 Control ........................................................................................... 23 5.3.2 Predictions ...................................................................................... 23 5.3.3 Mode selection ............................................................................... 25 5.3.4 Prediction Buffer ............................................................................ 26 5.4 Transform ..................................................................................................... 27 5.4.1 32-point DCT block ....................................................................... 27 5.4.2 4-point DST block .......................................................................... 29 5.5 Quantization & Dequantization .................................................................... 30 5.6 Inverse Transform ........................................................................................ 30 5.7 Reconstruction .............................................................................................. 31 5.8 Coefficient Cost............................................................................................ 32 5.9 CU Stack ...................................................................................................... 33 5.9.1 Stack Push ...................................................................................... 34 5.9.2 Stack Pull ....................................................................................... 35 5.10 Transpose ..................................................................................................... 36 5.10.1 Transpose Push .............................................................................. 37 5.10.2 Transpose Pull ................................................................................ 38 6. ANALYSIS ............................................................................................................