Simulation-Based HW/SW Co-Exploration of the Concurrent Execution of HEVC Intra Encoding Algorithms for Heterogeneous Multi-Core Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Accepted Manuscript Simulation-based HW/SW Co-Exploration of the Concurrent Execution of HEVC Intra Encoding Algorithms for Heterogeneous Multi-Core Architectures Jens Brandenburg, Benno Stabernack PII: S1383-7621(16)30277-6 DOI: 10.1016/j.sysarc.2016.12.009 Reference: SYSARC 1408 To appear in: Journal of Systems Architecture Received date: 1 February 2016 Revised date: 22 December 2016 Accepted date: 23 December 2016 Please cite this article as: Jens Brandenburg, Benno Stabernack, Simulation-based HW/SW Co- Exploration of the Concurrent Execution of HEVC Intra Encoding Algorithms for Heterogeneous Multi- Core Architectures, Journal of Systems Architecture (2016), doi: 10.1016/j.sysarc.2016.12.009 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ACCEPTED MANUSCRIPT Simulation-based HW/SW Co-Exploration of the Concurrent Execution of HEVC Intra Encoding Algorithms for Heterogeneous Multi-Core Architectures Jens Brandenburga, Benno Stabernacka a Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute, Video Coding & Analytics Department, Embedded Systems Group, Einsteinufer 37, 10587 Berlin, Germany Abstract The high efficiency video coding (HEVC) standard shows enhanced video compression efficiency at the cost of high performance requirements. To ad- dress these requirements different approaches, like algorithmic optimization, parallelization and hardware acceleration can be used leading to a complex design space. In order to find an efficient solution, early design verifica- tion and performance evaluation is crucial. Hereby the prevailing methodol- ogy is the simulation of the complex HW/SW architecture. Targeting het- erogeneous designs, different simulation models have different performance evaluation capabilities making a combined HW/SW co-analysis of the en- tire system a cumbersome task. To facilitate this co-analysis, we propose a non-intrusive instrumentation methodology for simulation models, which automatically adapts to the model under observation. With the help of this instrumentation methodology we perform the anal- ACCEPTEDEmail addresses: [email protected] MANUSCRIPT( Jens Brandenburg ), [email protected] ( Benno Stabernack ) Preprint submitted to Elsevier December 24, 2016 ACCEPTED MANUSCRIPT ysis and exploration of different design aspects of a SystemC-based hetero- geneous multi-core model of an HEVC intra encoder. In the course of this HW/SW co-analysis various aspects of the parallelization and hardware ac- celeration of the video coding algorithms are presented and further improved. Due to its cycle accurate nature the developed model is well suited to facili- tate various performance evaluations and to drive HW/SW co-optimizations of the explored system, as discussed in this paper. Keywords: HEVC, heterogeneous, multi-core, HW/SW co-design, performance analysis, SystemC 1. Introduction Nowadays trend towards heterogeneous multi-core systems has led to an huge increase in the complexity of the HW/SW co-design process. Early sys- tems have implemented simple functions on platforms typically using a single processing element and a dedicated hardware accelerator [1, 2]. In today’s systems multiple processing elements, hardware accelerators and memory blocks communicate with each other via synchronous and/or asynchronous interconnects [3, 4]. Mapping and optimizing a complex algorithm, like a state of the art video codec, to such a heterogeneous system requires an in-depth performance and bottleneck analysis. Typically simulation is the prevailing methodology to evaluate devel- oped designs during the HW/SW co-design process [5]. Advantages of these ACCEPTEDsimulation-based evaluation methodologies MANUSCRIPT are the accuracy with respect to functional as well as performance related properties. Of course this accu- 2 ACCEPTED MANUSCRIPT racy may vary during the HW/SW co-design process, because high accuracy simulation models of hardware components may not be available from the beginning and have to be developed first, which leads to an successive refine- ment of the platform simulation model and its accuracy in the typical top- down design process[6]. As a result the platform simulation model evolves over the time, which requires an ongoing adaption of the observation func- tionalities. In addition, there exist many different vendor specific analysis tools for various processor architectures and platform components address- ing different aspects of the overall co-optimization problem [7], which makes the combination and integration of different analysis data a challenging task, especially for heterogeneous platforms. Moreover, it is important to combine the profiling results with information gathered from dedicated components, like interrupts, signals and/or synchronization events, representing the actual hardware platform. To overcome these issues we propose a flexible analysis methodology ca- pable to non-intrusively trace all hardware aspects of the modeled simulation platform. Based on this methodology we developed a tool, which gives a com- prehensive overview of the software tasks, running on the various processing elements of the particular execution platform. Furthermore, our tool pro- vides detailed memory access and performance analyses based on SystemC virtual platform simulation models for heterogeneous embedded multi-core platforms. With the help of this simulation-based analysis tool we perform the HW/SW co-exploration of a complex and state of the art video coding ACCEPTEDalgorithm, namely a high efficiency MANUSCRIPT video coding (HEVC) intra encoder. The HEVC standard [8, 9] has been designed to supersede the very suc- 3 ACCEPTED MANUSCRIPT cessful and well established H.264/AVC standard [10]. For this goal, HEVC aims to achieve the equivalent subjective video compression quality as the predecessor, by doubling the video compression efficiency [11]. This increased video compression efficiency is a result of novel encoding algorithms, which accordingly increase the algorithm complexity and computational resource requirements. A comparison of the reference encoder models of HEVC and H.264/AVC shows a 23% bit-rate decrease at an equivalent objective video compression quality in conjunction with a 3.2 increased execution time for × the HEVC main intra configuration [12]. Of course the future success of HEVC will also depend on the availability of fast and cost effective encoder and decoder solutions. Massive parallel approaches seem to be capable for real-time HEVC video encoding with the disadvantage of high costs for system acquisition and power consumption. Heterogeneous designs with HW-accelerated components on the other hand promise a huge decrease in the power consumption, but es- pecially HW/SW co-exploration and co-optimization is an expensive task. In order to benefit from both design approaches we propose a parallel and heterogeneous HEVC encoder model, which could be used for functional as well as for performance evaluations. The proposed encoder model supports the HEVC intra profile, which can be used for a wide range of applications domains such as low latency appli- cations, contribution and studio applications or applications with increased fault tolerance. For example, low latency and increased fault tolerance are ACCEPTEDtypical requirements of automotive MANUSCRIPT applications, like rear view cameras and other camera-based obstacle detection systems. When looking at paralleliza- 4 ACCEPTED MANUSCRIPT tion strategies for HEVC encoders, most approaches focus on a combination of frame level and coding tree unit (CTU) level parallelization. Of course this contradicts the low latency or low memory requirements of certain intra-only applications. Therefore we propose a Sub-CTU level parallelization approach and show possible design improvements to benefit from such a parallelization scheme. Starting from a serial software-only description of the algorithm, the use case shows the HW/SW co-design of a parallel and HW-accelerated HEVC intra encoder. Based on a SystemC simulation platform of the developed encoder we perform different performance evaluations and show possible op- timizations targeting hardware as well as software aspects of the entire sys- tem. For these performance evaluations we use a non-intrusive instrumen- tation methodology, which supports the analysis of all SystemC simulation model internal observables and facilitates the combination of analysis data from different platform components to provide a comprehensive evaluation of the complete system. The remainder of this paper is organized as follows. In Sec. 2 a discussion of related work for simulation-based performance analysis on the one hand and HEVC encoder implementations on the other is presented. In Sec. 3 we describe the simulation-based HW/SW co-analysis framework used in the following co-exploration. The details of the explored HEVC video encoder algorithms are presented in Sec. 4. The initially sequential algorithm is mapped onto a simple embedded platform with a single processing element ACCEPTEDto provide a starting point for the co-explorationMANUSCRIPT as described in Sec.