Accepted Manuscript

Simulation-based HW/SW Co-Exploration of the Concurrent Execution of HEVC Intra Encoding Algorithms for Heterogeneous Multi-Core Architectures

Jens Brandenburg, Benno Stabernack

PII: S1383-7621(16)30277-6 DOI: 10.1016/j.sysarc.2016.12.009 Reference: SYSARC 1408

To appear in: Journal of Systems Architecture

Received date: 1 February 2016 Revised date: 22 December 2016 Accepted date: 23 December 2016

Please cite this article as: Jens Brandenburg, Benno Stabernack, Simulation-based HW/SW Co- Exploration of the Concurrent Execution of HEVC Intra Encoding Algorithms for Heterogeneous Multi- Core Architectures, Journal of Systems Architecture (2016), doi: 10.1016/j.sysarc.2016.12.009

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ACCEPTED MANUSCRIPT

Simulation-based HW/SW Co-Exploration of the Concurrent Execution of HEVC Intra Encoding Algorithms for Heterogeneous Multi-Core Architectures

Jens Brandenburga, Benno Stabernacka

a Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute, Video Coding & Analytics Department, Embedded Systems Group, Einsteinufer 37, 10587 Berlin, Germany

Abstract

The high efficiency video coding (HEVC) shows enhanced video compression efficiency at the cost of high performance requirements. To ad- dress these requirements different approaches, like algorithmic optimization, parallelization and hardware acceleration can be used leading to a complex design space. In order to find an efficient solution, early design verifica- tion and performance evaluation is crucial. Hereby the prevailing methodol- ogy is the simulation of the complex HW/SW architecture. Targeting het- erogeneous designs, different simulation models have different performance evaluation capabilities making a combined HW/SW co-analysis of the en- tire system a cumbersome task. To facilitate this co-analysis, we propose a non-intrusive instrumentation methodology for simulation models, which automatically adapts to the model under observation. With the help of this instrumentation methodology we perform the anal-

ACCEPTEDEmail addresses: [email protected] MANUSCRIPT( Jens Brandenburg ), [email protected] ( Benno Stabernack )

Preprint submitted to Elsevier December 24, 2016 ACCEPTED MANUSCRIPT

ysis and exploration of different design aspects of a SystemC-based hetero- geneous multi-core model of an HEVC intra encoder. In the course of this HW/SW co-analysis various aspects of the parallelization and hardware ac- celeration of the video coding algorithms are presented and further improved. Due to its cycle accurate nature the developed model is well suited to facili- tate various performance evaluations and to drive HW/SW co-optimizations of the explored system, as discussed in this paper. Keywords: HEVC, heterogeneous, multi-core, HW/SW co-design, performance analysis, SystemC

1. Introduction

Nowadays trend towards heterogeneous multi-core systems has led to an huge increase in the complexity of the HW/SW co-design process. Early sys- tems have implemented simple functions on platforms typically using a single processing element and a dedicated hardware accelerator [1, 2]. In today’s systems multiple processing elements, hardware accelerators and memory blocks communicate with each other via synchronous and/or asynchronous interconnects [3, 4]. Mapping and optimizing a complex algorithm, like a state of the art , to such a heterogeneous system requires an in-depth performance and bottleneck analysis. Typically simulation is the prevailing methodology to evaluate devel- oped designs during the HW/SW co-design process [5]. Advantages of these ACCEPTEDsimulation-based evaluation methodologies MANUSCRIPT are the accuracy with respect to functional as well as performance related properties. Of course this accu-

2 ACCEPTED MANUSCRIPT

racy may vary during the HW/SW co-design process, because high accuracy simulation models of hardware components may not be available from the beginning and have to be developed first, which leads to an successive refine- ment of the platform simulation model and its accuracy in the typical top- down design process[6]. As a result the platform simulation model evolves over the time, which requires an ongoing adaption of the observation func- tionalities. In addition, there exist many different vendor specific analysis tools for various processor architectures and platform components address- ing different aspects of the overall co-optimization problem [7], which makes the combination and integration of different analysis data a challenging task, especially for heterogeneous platforms. Moreover, it is important to combine the profiling results with information gathered from dedicated components, like interrupts, signals and/or synchronization events, representing the actual hardware platform. To overcome these issues we propose a flexible analysis methodology ca- pable to non-intrusively trace all hardware aspects of the modeled simulation platform. Based on this methodology we developed a tool, which gives a com- prehensive overview of the software tasks, running on the various processing elements of the particular execution platform. Furthermore, our tool pro- vides detailed memory access and performance analyses based on SystemC virtual platform simulation models for heterogeneous embedded multi-core platforms. With the help of this simulation-based analysis tool we perform the HW/SW co-exploration of a complex and state of the art video coding ACCEPTEDalgorithm, namely a high efficiency MANUSCRIPT video coding (HEVC) intra encoder. The HEVC standard [8, 9] has been designed to supersede the very suc-

3 ACCEPTED MANUSCRIPT

cessful and well established H.264/AVC standard [10]. For this goal, HEVC aims to achieve the equivalent subjective video compression quality as the predecessor, by doubling the video compression efficiency [11]. This increased video compression efficiency is a result of novel encoding algorithms, which accordingly increase the algorithm complexity and computational resource requirements. A comparison of the reference encoder models of HEVC and H.264/AVC shows a 23% bit-rate decrease at an equivalent objective video compression quality in conjunction with a 3.2 increased execution time for × the HEVC main intra configuration [12]. Of course the future success of HEVC will also depend on the availability of fast and cost effective encoder and decoder solutions. Massive parallel approaches seem to be capable for real-time HEVC video encoding with the disadvantage of high costs for system acquisition and power consumption. Heterogeneous designs with HW-accelerated components on the other hand promise a huge decrease in the power consumption, but es- pecially HW/SW co-exploration and co-optimization is an expensive task. In order to benefit from both design approaches we propose a parallel and heterogeneous HEVC encoder model, which could be used for functional as well as for performance evaluations. The proposed encoder model supports the HEVC intra profile, which can be used for a wide range of applications domains such as low latency appli- cations, contribution and studio applications or applications with increased fault tolerance. For example, low latency and increased fault tolerance are ACCEPTEDtypical requirements of automotive MANUSCRIPT applications, like rear view cameras and other camera-based obstacle detection systems. When looking at paralleliza-

4 ACCEPTED MANUSCRIPT

tion strategies for HEVC encoders, most approaches focus on a combination of frame level and coding tree unit (CTU) level parallelization. Of course this contradicts the low latency or low memory requirements of certain intra-only applications. Therefore we propose a Sub-CTU level parallelization approach and show possible design improvements to benefit from such a parallelization scheme. Starting from a serial software-only description of the algorithm, the use case shows the HW/SW co-design of a parallel and HW-accelerated HEVC intra encoder. Based on a SystemC simulation platform of the developed encoder we perform different performance evaluations and show possible op- timizations targeting hardware as well as software aspects of the entire sys- tem. For these performance evaluations we use a non-intrusive instrumen- tation methodology, which supports the analysis of all SystemC simulation model internal observables and facilitates the combination of analysis data from different platform components to provide a comprehensive evaluation of the complete system. The remainder of this paper is organized as follows. In Sec. 2 a discussion of related work for simulation-based performance analysis on the one hand and HEVC encoder implementations on the other is presented. In Sec. 3 we describe the simulation-based HW/SW co-analysis framework used in the following co-exploration. The details of the explored HEVC video encoder algorithms are presented in Sec. 4. The initially sequential algorithm is mapped onto a simple embedded platform with a single processing element ACCEPTEDto provide a starting point for the co-explorationMANUSCRIPT as described in Sec. 5. The Sub-CTU level parallelization scheme is presented in Sec. 6 in conjunction

5 ACCEPTED MANUSCRIPT

with a co-analysis and a co-optimization of the proposed implementation. In Sec. 7 we present the hardware accelerator components design and their optimization. Finally in Sec. 8 we explore the combined parallel HW/SW HEVC intra encoder developed in this work.

2. Related Work

2.1. Simulation-based HW/SW Co-Analysis

There exist several simulation environments from different vendors - geting various processor architectures for performance analysis and architec- tural exploration. For example, the Gem5 simulator [13] provides instruction set simulators for Alpha, ARM, SPARC, and x86 processor architectures at different accuracy levels, ranging from simple instruction accurate models up to timing and cycle accurate out-of-order models. These simulation en- vironments are based on a proprietary simulation approach, which directly combines an internal simulation methodology with the model of the simulated architecture. Typically these approaches provide some kind of architecture configuration mechanism to enable various architectural explorations, like the analysis of different cache replacement policies, different cache configura- tions or different memory access latencies. On the other hand more complex tasks, like the addition of dedicated hardware accelerators or the integration of different processor architecture models from different simulation environ- ments, are not natively supported and would require the modification of the complex simulation environment code base. ACCEPTEDFor architectural exploration and MANUSCRIPT modeling of entire systems, including the modeling of heterogeneous designs, the SystemC [14, 15] system level de-

6 ACCEPTED MANUSCRIPT

scription language has been proposed and standardized. SystemC provides a simulation mechanism based on discrete event simulation to enable the sys- tem simulation at different levels of accuracy. Furthermore, SystemC defines standardized interfaces to model different hardware aspects and to facilitate the integration of simulation models from different vendors. Based on the SystemC simulation framework various simulation models for platform com- ponents and instruction set simulators have been developed and integrated. For example, the MPARM [16] simulation platform integrates the cycle ac- curate SWARM [17] instruction set simulator with a bus model and memory components to simulate and explore a SystemC-based multi-core ARM plat- form. By simulating the entire platform, SystemC provides the means to analyze all aspects of the designed system and to combine software with hardware events of all kinds. For different analysis purposes different SystemC-based approaches have been developed, which could be distinguished by the way the analysis data is collected from the simulation model. On the one hand there exist manual approaches, where the simulation model has to be instrumented by a user or a developer, on the other hand automatic approaches have been proposed. Example manual instrumentation approaches for the analysis of SystemC- based simulation models can be found in [18, 19, 20] or in [21, 22]. Hereby [18, 19, 20] favor the modification of the components of the platform simu- lation model, whilst [21, 22] propose an additional SystemC analysis com- ACCEPTEDponent to be inserted into the platform MANUSCRIPT simulation model to observe the communication between existing platform components. Of course any kind

7 ACCEPTED MANUSCRIPT

of manual instrumentation requires the modification of the simulation model by a developer, who has to have specific knowledge of the model implementa- tion and the implementation of the instrumentation functions as well. This requires additional development effort, especially as the platform simulation model may evolve during the HW/SW co-design process. Contrastingly, automatic approaches promise to simplify the instrumen- tation process and to provide a more generic instrumentation methodology. One of these approaches is based on the aspect oriented programming (AOP) paradigm [23] to separate the implementation of the instrumentation func- tions from the implementation of the simulation model as discussed in [24] and [25]. In the AOP approach the instrumentation functions are inserted by an automatic code transformation into the code of the simulation model, before the entire SystemC-based platform is compiled to create the simulator executable. Although the instrumentation functions are inserted automati- cally, this approach still requires the adaption of the instrumentation func- tions in case the implementation of the simulation model is changed, because the instrumentation functions access simulation model internals, which may have changed by the modification. A different and more generic automatic approach uses reflexive mecha- nism to instrument the platform simulation model [26, 27]. These approaches are based on the SEAL [28] C++ reflection library to generate reflective wrapper components, which encapsulate the components of the platform simulation model and allow the direct access to all internal component data. ACCEPTEDBecause the reflection library has toMANUSCRIPT know the layout of these internal data for the automatic wrapper generation, the of the observed com-

8 ACCEPTED MANUSCRIPT

ponents has to be analyzed by an additional source code parser beforehand. In order to enable the reflective instrumentation approach the new wrapper components have to be instantiated at some point, which in the end requires a modification of the platform simulation model. This modification can be done automatically, but nevertheless the platform simulation model has to be changed and these changes introduce some overhead, increasing the execution time of the simulation model. In [26] each wrapper component is instantiated as an additional platform component, which impacts the SystemC simulation time, especially when the instrumentation functions are called at every clock cycle to closely monitor the entire platform simulation. In [27] the wrapper component is imple- mented in Python, which additionally impacts the execution time due to the slower execution time of the python script code. Therefore this approach targets mainly a simplified platform assembly for design space exploration and a flexible mechanism to branch into emulation functions for an emula- tion of the operating system or not available hardware functions. In case simulation speed is important an alternative approach is discussed, which manually instruments the instruction set simulators and sacrifices flexibility for simulation speed. All of these instrumentation approaches modify the SystemC platform simulation model, either manually or automatically, to add the required in- strumentation functionality. In comparison we propose a methodology, which does not modify the platform simulation model at any point and works fully ACCEPTEDnon-intrusively with respect to the MANUSCRIPT hardware as well as to the software of the platform simulation model. The proposed SystemC-based methodology

9 ACCEPTED MANUSCRIPT

allows the configuration and modification of instrumentation functions at runtime and requires no access to the source code of the observed simula- tion models. Additionally the overhead of the implemented instrumentation framework is small in comparison to the execution time of the entire platform simulation model. Based on this non-intrusive instrumentation methodology we will discuss in this paper different analysis functions to provide a generic and model independent performance analysis framework for HW/SW co- exploration, which is used to design and evolve a heterogeneous multi-core platform, running a complex and resource demanding HEVC video encoder. Overall the proposed methodology enables the development of a generic instrumentation framework, which can be used in conjunction with any Sys- temC simulation model. Moreover, due to the broad acceptance of SystemC, other simulation environments partially support their integration into Sys- temC, which in turn provides the opportunity to analyze these other en- vironments as well. For example, the Gem5 simulator can be used within SystemC, which enables the analysis of the combined platform simulation model with the help of the developed framework. On the other side, using Gem5 as a standalone simulator, our SystemC-based framework will have no access to this simulation environment.

2.2. HEVC Encoding Algorithms for Heterogeneous Multi-Core Architectures

In order to provide a fast or even real time capable HEVC encoder solution different approaches have been explored so far. Besides a full hardware-only ACCEPTEDsolution, which is out of the scope ofMANUSCRIPT this work, the most common approaches exploit parallelization approaches or algorithmic optimizations or a combina- tion of both. Algorithmic optimizations often introduce a heuristic to speed

10 ACCEPTED MANUSCRIPT

up the expensive estimation algorithms in the HEVC encoder. For example, [29] proposes a fast prediction unit depth selection algorithm, which yields a reduction of the HEVC intra encoding time of 55% in conjunction with a 1.22% BD-rate loss. A different approach is presented in [30], which reduces the number of analyzed intra prediction modes by exploiting the correlation between the prediction modes and the gradient variance of the encoded video content. Most common for all these heuristic-based optimization approaches is an algorithm speedup at the cost of a BD-rate loss, due to the reduced estimation space. Parallelization approaches rely on task level and/or data level parallel ex- ecution. Task level parallel (TLP) schemes split the algorithm into different functional parts, which can be executed independently from each other. Due to data dependencies between input and output of the different functional parts, this parallelization approach typically leads to a pipelined execution of the video coding algorithm [31, 32]. Whilst TLP-based approaches have the advantage of a low processing latency, the functional partitioning of- ten leads to an unbalanced workload distribution, which reduces the overall parallelization speedup. Contrastingly, data level parallel (DLP) approaches execute the same function on different parts of the input data. Here various parallelization schemes at different granularity levels have been proposed [33], ranging from coarse-grained group of picture (GOP) level parallel execution, over CTU level parallelization, down to fine-grained SIMD optimizations. For example, ACCEPTED[34] is based on a combination of GOP MANUSCRIPT level parallelism, slice level parallelism and SIMD optimizations. The implemented HEVC encoder achieves real-

11 ACCEPTED MANUSCRIPT

time encoding of the Main10 HEVC profile for 4 K UHD video at 60 Hz by using two PCs with 32 cores and a GPU each. Known disadvantages of these parallelization approaches are the increased latency of the GOP level parallelism and the increased bit-rate of the slice level parallelism. In [35] data level parallel execution is explored by implementing SIMD optimizations of different modules, like motion estimation, Hadamard trans- form, SAD/SSD calculation and integer transform. Achieved time savings of 56% 85% are reported for the HM-6.2 reference encoder model. The − work in [36] implements a parallel HEVC intra encoder scheme, using up to 16 threads for the x265 HEVC software encoder. This scheme exploits CTU level parallel encoding and shows an average speedup gain of 5 . Ad- × ditionally Sub-CTU level parallelism is discussed and a theoretical speedup is calculated, but no implementation of this Sub-CTU level parallelization scheme is presented. A different approach to speed up video encoders is the use of hardware accelerator components, which are designed to replace the most time consum- ing modules in the encoder algorithm. Such an approach is more often used in the embedded domain, due to its superior power characteristics compared to a massive parallel approach. Recent research has been focused on the implementation of single hardware accelerators to be used in future hetero- geneous designs. For example, [37] and [38] present HEVC intra prediction components and [39] presents an HEVC IDCT component implemented on an FPGA. ACCEPTEDThe Reconfigurable Video Coding MANUSCRIPT (RVC) framework [40] provides video codec specifications at the level of library components based on the usage

12 ACCEPTED MANUSCRIPT

of an actor oriented programming language called CAL, which enables the explicit expression of parallelism and communication between the different library components. Available tools provide a complete synthesis framework for hardware and software components, which can be used for HW/SW co- design, co-analysis and early performance evaluations [41, 42, 43, 44]. Example implementations and analyses of an HEVC video decoder based on the RVC framework can be found in [32] and [45]. In [45] a maximum speedup of 7.6 is given for a SIMD optimized parallel HEVC video decoder × running on five processor cores. In [32] an MPEG4 and an HEVC video decoder have been analyzed and mapped onto an abstract platform based on an instruction set simulator for a so called transport-trigger architecture (TTA). By developing advanced synthesis techniques, which tackle commu- nication and memory access issues, the performance of both decoders could be improved in comparison to previous dataflow-based implementations. Al- though there have been some discussions on using the RVC framework also for video encoder implementations, the main focus of the standardization process is on the definition of library components necessary to build video decoders. In the area of HW/SW co-exploration and co-analysis video coding ap- proaches have been an important object of research and a source of complex use cases. Especially for the predecessor standard H.264/AVC several co- explorations for different heterogeneous architectures have been published. Some of these co-explorations focus on the most time consuming part of the ACCEPTEDencoder, which is in case of H.264/AVC MANUSCRIPT the motion estimation [46, 47]. In [46] the H.264/AVC motion estimation algorithm is mapped to a heterogeneous

13 ACCEPTED MANUSCRIPT

platform with a DSP and a specifically designed VLSI co-processor. The de- veloped platform enables real-time motion estimation for HD sequences and is used to study complexity vs. quality trade-offs. In [47] the entire motion estimation algorithm is mapped to a separate hardware accelerator. The developed platform achieves a speedup of up to 3.6 in dependency to the × chosen motion estimation search strategy. Other studies use high-level design frameworks to map and exploit par- allelized H.264/AVC algorithms on embedded multi-core platforms [48, 31]. In [48] a pipelined version of an H.264/AVC encoder is evaluated at differ- ent abstraction levels with the help of the Sesame modeling and simulation framework. For performance evaluations a trace-driven co-simulation of the application event queues mapped to the architectural components is carried out. These evaluations show a near linear speedup of up to 4.1 on an embed- × ded platform with six processing elements. In [31] a Simulink-based design flow is used to evaluate parallel Motion-JPEG and H.264/AVC decoders at different abstraction levels. With the help of a cycle accurate virtual proto- type the task partitioning of the pipelined H.264/AVC decoder is refined to determine the best trade-off between cost and performance. Due to its widespread use for system level modelling and architectural ex- ploration, also SystemC has been exploited to analyze H.264/AVC encoder designs [49, 50]. For this purpose, [49] implements a hardware-only solu- tion with no soft-cores and performs a communication and timing analysis of this model. The analysis data is collected from an abstract platform model ACCEPTEDby adding measurement probes to appropriateMANUSCRIPT hardware components of the simulated architecture. Contrastingly, [50] explores different parallelization

14 ACCEPTED MANUSCRIPT

schemes of an H.264/AVC encoder running on an ARM-based multi-core platform. The performance evaluation is carried out with the help of a co- verification environment, which combines a virtual SystemC platform model with a hardware processor core running on a separate evaluation board. For data level parallelism at the H.264/AVC macroblock level a maximum speedup of 2.36 is achieved on a platform with four ARM cores. × In summary, explorations of video coding algorithms focus either on paral- lel software-based approaches or on heterogeneous HW-accelerated designs. Also model-based performance analyses and co-exploration studies discuss only one or the other aspect, even though the use of a virtual platform simu- lation model provides all means to develop and analyze a combined approach. For this purpose we propose a parallel and heterogeneous HW/SW model of an HEVC intra encoder, which can be used for HW/SW co-design, co- analysis and early performance evaluations. The chosen encoder paralleliza- tion is based on a concurrent execution of the recursive coding unit (CU) size estimation algorithm and enables the parallel processing of a single CTU at a fine granularity below the CTU level. Advantages of the chosen paralleliza- tion scheme are a low temporal latency and low memory requirements with respect to other coarse-grained data level parallelization approaches as well as a better balancing with respect to the pipelined task level parallelization approaches discussed above. In this work we present a modified version of a previously published HW/SW model of our HEVC intra encoder [51]. In comparison to [51] a ACCEPTEDmemory access conflict analysis is performedMANUSCRIPT for different data structures of the parallel encoder algorithm. With the help of this analysis different mem-

15 ACCEPTED MANUSCRIPT

ory mappings are explored in conjunction with a hardware-based synchro- nization scheme to provide a trade-off analysis between necessary resources and achievable speedup. The entire exploration of the encoder is based on a generic and model independent performance analysis framework for vir- tual SystemC platform simulation models, which will be presented in the following section.

3. Simulation-based HW/SW Co-Analysis

3.1. Non-intrusive Instrumentation

In order to mimic a real platform design, SystemC uses building blocks and concepts similar to real hardware components. Amongst other, the most important basic blocks used in SystemC are modules to model hardware com- ponents, ports to provide interfaces for accessing these modules and chan- nels respectively signals to model the communication between the modules. Together with an event-driven simulation kernel these building blocks are implemented as a set of C++ classes and class interfaces in the SystemC library. By introducing the instrumentation code into the SystemC library, the observation of simulation models or at least of parts thereof becomes feasible. For example, adding an instrumentation function into the SystemC signal update function enables the observation of each signal change in the simulation model. When thinking of real hardware this is similar to adding a measuring probe to each signal in the platform or more precisely to building a platform with signals already containing a measuring probe each. ACCEPTEDWhilst the implementation of such MANUSCRIPT instrumentation is straightforward and requires only a few lines of additional code, the analysis of the so collected

16 ACCEPTED MANUSCRIPT

trace data is impossible without further support. First of all, tracing all signal changes in a design with thousands of signals will create a huge amount of trace data and severely impact the overall execution time. Second, for an analysis it is necessary to know the role or purpose of the signal in the platform simulation model, because otherwise the analysis will only see the signal change not knowing whether this signal change belongs to an interrupt, a reset, a data bus or something different. Third, not all interesting parts of the platform might be accessible from inside the SystemC building blocks. A typical example for such a hidden element is a register in an instruction set simulator model, which is at a higher abstraction level usually implemented by a C++ member variable inside the instruction set simulator class. Of course this instruction set simulator class is based on the SystemC module class by inheritance, but in C++ accesses to the members of a parent class (instruction set simulator) are not foreseen from inside a child class (SystemC module). The amount of trace data can be reduced simply by enabling only the observation of signals needed for the later analysis, which in turn could be solved by selecting the signals based on their role and purpose. This means to address the first two points some kind of meta-information or description of the platform simulation model is needed. To address the third point, e. g. accessing a parent class member from a child class, C/C++ pointer arith- metic could be used. The parent class and the child class have the same base pointer address and therefore a parent class member variable could be ACCEPTEDaccessed by a fixed member variable MANUSCRIPT dependent offset added to the child base pointer address. Hence to perform an analysis of the platform simulation

17 ACCEPTED MANUSCRIPT

model based on an instrumentation of the SystemC library only, two infor- mation are needed. First the knowledge about the roles of the SystemC building blocks is required and second the memory layout of each platform component class has to be known. To provide these both information, the trace data collection module generates a platform description at runtime by performing an automatic design analysis, as will be described in the following.

3.2. Runtime Design Analysis

The execution of a SystemC-based platform simulation model can be split into two different phases. The first phase is the elaboration phase, where all components of the platform simulation model are instantiated. The second phase is the simulation phase, where the signals and the internal states of the platform components are constantly updated to mimic the progression of time. To register each instantiated platform component in the design analysis module, a function call is added to the SystemC basic building blocks. Due to the fact that every building block originates from the SystemC object class by inheritance, this requires only a modification of the SystemC object class. By this modification the design analysis module gains access to the complete list of all during the elaboration phase instantiated platform components. In order to obtain the needed C++ class memory layout, one approach could be to parse the class specification files with a C++ parser, but this would require access to the relevant source files. Another approach could be to extract the class information from the compiled binary of the platform ACCEPTEDsimulation model. Of course these informationMANUSCRIPT have to be added first, but this could be achieved automatically by enabling the addition of debug symbols information during the compilation process. Once part of the simulation

18 ACCEPTED MANUSCRIPT

model binary this information can be parsed at startup with the help of a parser library, like the DWARF [52] debugging information parser library [53]. Both approaches can be used in exchange for each other and show only slight differences, when it comes to usability and accessibility of source code. The main advantage of the debugging information approach is, that the relevant information are already part of the platform simulation model and no additional source code files are needed, which simplifies distribution of platform simulation models for distributed analyses of multiple models in concurrency. Please note, that the additional debugging information does not impact the execution time of the simulation model and that they do not prevent the usage of optimizations during the compilation process, because only datatype-related debugging information are explored. The final SystemC-based runtime design analysis algorithm is shown in Fig. 1. Besides the collection of the needed information, a central point in this algorithm is the combination of the DWARF debugging information and the registered SystemC basic building blocks, as will be explained in the following. In the first step all registered building blocks are sorted into a tree, which preserves the SystemC parent-child-relation. This relation implies that a child node is part of the parent building block platform component. With the help of this building blocks tree all SystemC ports and exports attached to a SystemC module can be identified. An example of such an initial building ACCEPTEDblocks tree is shown on the left sideMANUSCRIPT of Fig. 2. In this example the two SystemC modules PE and Bus each have a port and an export attached

19 ACCEPTED MANUSCRIPT

Platform Simulation SystemC SystemC Model Executable Elaboration Simulation

Register SystemC Building Blocks

Combine DWARF Parse DWARF Platform Information and Information Description SystemC Blocks

Figure 1: SystemC-based runtime design analysis algorithm to create platform description.

respectively, which themselves are connected with the SystemC signal Sig1. In the second step all ports and exports connected with the same SystemC signal are identified, to associate the signal with the modules connected by this signal. For the used example this means that the modules PE and Bus are connected to each other by the signal Sig1, as shown in the middle of Fig. 2. Additionally pairs of SystemC signals with the same identifiers for signals as well as for ports and exports are matched in the second step to find transaction level modelling (TLM) connections. In a platform model these TLM connections are used to implement packet-based data exchanges between SystemC modules at higher abstraction levels. ACCEPTEDIn the third and last step the MANUSCRIPT class-related debugging information is mapped to each SystemC module parent class. Therefore the parent class

20 ACCEPTED MANUSCRIPT

Module: Main is attached to (parent-child- Module: PE Module: Bus relation) Type: PE_t Type: Bus_t

Module: Module: Signal: Module: Module: UInt: Var1: +58B Int64: Var3: +42B PE Bus Sig1 PE Bus Short: Var2: +62B

connects connects connects Port: Export: Signal: Signal: Pr1 Ex1 Sig1 Sig1

2. Connected Platform 3. Connected Platform Components with 1. Basic Building Blocks Tree Components Member Variables: Platform Description Figure 2: Example transformation of registered SystemC building blocks into platform description.

type has to be identified at runtime, which can be achieved by using the C++ runtime type information library. During this mapping, inheritance and composite class relations are explored recursively to enable also the ac- cess of all member variables of child classes and of nested member classes respectively. A graphical representation of the final platform description for the ex- ample is shown on the right side of Fig. 2. This final platform description contains the graph of the connected platform components and a set of mem- ber variables VarX for two SystemC modules. For each of these member variables the relative memory offset to the base address of the given module can be extracted as well from the DWARF debugging information, which en- ables the memory-based tracking of the variable contents without the need of a dedicated access function. To perform the transformation of the registered building blocks tree into the final platform description no additional modification of the SystemC ACCEPTEDlibrary is necessary. All needed informationMANUSCRIPT can be obtained by existing SystemC interfaces, during the building blocks registration process. The

21 ACCEPTED MANUSCRIPT

parent-child-relation is maintained by SystemC internally and the connection between ports and signals are based on a shared SystemC interface object. Therefore, finding ports, exports and signals with the same shared interface object resolves the connections between these building blocks.

3.3. Analysis Framework

Besides trace data collection and automatic runtime design analysis also data analysis, export and visualization are part of the proposed analysis framework, as shown in Fig. 3. The trace data collection module is embed- ded into the SystemC library by the use of a plug-in mechanism. To increase the flexibility of the analysis framework, the remaining parts could be con- nected also via a plug-in mechanism or via remote access with the trace data collection module. The latter option enables remote distributed simulation, whilst the plug-in mechanism enables the easy exchange and modification of the trace data analysis functions.

3.3.1. SystemC Interface The interface between the SystemC library and the trace data collection module has been designed with the goal to keep the necessary modifications of the SystemC library as small as possible. As explained above, every building block in SystemC originates from the SystemC object base class sc object and therefore the design analysis requires only a registration call-back function register() within this base class to record all building blocks, as shown in the overview of Fig. 4. ACCEPTEDFurthermore, also function call-backs MANUSCRIPT at the end of the elaboration and simulation phase are needed, to signal the start of the design analysis and

22 ACCEPTED MANUSCRIPT

Platform View Analysis and Trace Platform Simulation Data Storage Model Data Views Application Software Runtime

PE PE DSP Platform Analysis Platform Description Data Mem HW Mem Runtime Trace Data Design Analysis Analysis Modules Signals Simulation SystemC Kernel SystemC Platform Trace Objects Description Data Ports TLM

SystemC Trace Data C++ Interface Collection

Figure 3: The analysis framework connected to a SystemC-based platform simulation model.

to enable the final wrap up of the data analysis respectively. Therefore the functions elaborate() and do sc stop action() in the SystemC simulation ker- nel sc simcontext have been instrumented. After the end of the elaboration function is signaled, the runtime design analysis process, described in Sec. 3.2, will be started with the set of the previously registered SystemC building blocks. To trace the change of a signal value also the signal update function up- ACCEPTEDdate() has been instrumented. In additionMANUSCRIPT the blocking b transport() as well as the non-blocking forward nb transport fw() and backward nb transport

23 ACCEPTED MANUSCRIPT

simple_target fw_process bw_process _socket • nb_transport_fw() • nb_transport_bw() • b_transport()

sc_signal sc_export sc_port sc_module • update()

sc_sim top-level objects sc_object analysis _context 1 0 .. * _context • register() 1 • elaborate() 0 .. * • do_sc_stop child objects _action() • cycle()

modified / instrumented additional parts Figure 4: Overview of the SystemC class library and the necessary modifications to in- strument SystemC.

bw() TLM transport functions have been instrumented, to trace all TLM data exchanges at runtime. Here a specific processing is necessary for non- blocking transactions, because in the non-blocking case a transaction could be split into multiple phases. Hence multiple calls to the non-blocking for- ward and backward transport functions have to be combined into a single ACCEPTEDtransaction, to detect start and end MANUSCRIPT of a transaction as well as the overall transaction duration. Other TLM properties, like accessed memory address

24 ACCEPTED MANUSCRIPT

or the amount of accessed data, could be observed by evaluating the trans- ferred TLM packet object. The last instrumentation call-back function is added to the SystemC delta cycle function cycle() to signal the execution of a single simulation cycle. In case the trace data collection module has to observe member variables of platform components, the memory addresses of these member variables are monitored after each simulation cycle to track a change of the content. All observed events are sorted in a windowed approach with respect to the global simulation time, which is given by the SystemC internal simulation cycle count. Overall, less than 100 lines of code have been added to the SystemC core library to include all required instrumentation functions. The entire processing and evaluation of the collected trace data as well as the runtime design analysis is performed outside the SystemC class library in the trace data collection module and the runtime design analysis module respectively, as shown in Fig. 3. Therefore a connection between SystemC and the trace data collection module is managed by the additional analysis context object, which forwards all instrumentation function calls to the external analysis framework.

3.3.2. Data Analysis The data analysis has been designed to run different analysis filters in accordance with the current user requirements. Common input for all anal- ysis filters is a sequence of sorted trace events from the trace data collection ACCEPTEDmodule and the platform description MANUSCRIPT to support the analysis at the platform abstraction level. For this purpose, only traces of commonly used design ele-

25 ACCEPTED MANUSCRIPT

ments such as program counter registers or memory buses are used, without the need to use special hardware performance counters. The most basic analysis filter is the fine-grained software components runtime analysis, which is used by other analysis filters to correlate software and hardware events. In principle this analysis filter follows the executed software instructions within each instruction set simulator by tracing the current program counter. Based on the executed instructions this analysis filter measures the execution time of basic blocks and detects function calls as well as function returns to maintain a function call-graph and to enable a hotspot analysis at function level. On top of this basic filter a function group analysis filter has been im- plemented, to provide an estimate of the execution time of virtual software components. Therefore a set of functions can be defined as entry points to a virtual software component. The measurement of the execution time for this virtual software component starts, whenever an entry point function is called and ends when this function returns or a different virtual software component is entered. A different analysis filter counts memory accesses from instruction set simulators or hardware components. In case the memory accesses have been initiated by an instruction set simulator, they can be contributed to different software functions and/or to virtual software components, as well as to differ- ent data structures. Therefore this filter has to monitor all memory accesses, which are implemented in dependency to the chosen abstraction level either ACCEPTEDby TLM connections or by a set of signals.MANUSCRIPT Especially for concurrent designs a bottleneck analysis based on memory

26 ACCEPTED MANUSCRIPT

access conflicts can provide important hints to optimize the design. Based on the traces of all memory accesses, memory access conflicts are detected, whenever the access times of parallel accesses overlap at the shared memory block. In this case one access is delayed due to a conflict, which could be con- tributed either to pairs of functions or to pairs of virtual software components or to a mix of software and hardware components. All these analysis filters have been implemented as part of the trace data analysis module, as depicted in Fig. 3, and can be configured at runtime to match the current platform implementation. For example, for the function level runtime analysis the program counter registers of all instruction set simulators have to be traced. These program counter member variables can be identified in the generated platform description by the SystemC module names of the instruction set simulators in conjunction with the C++ mem- ber variable name. Given these both names, the automatic tracing of the program counters can be enabled in an XML-based configuration file. Ad- ditionally the appropriate trace events, which are generated by monitoring the change of the identified program counter variables, are configured in the same XML file to be used as input for the function level runtime analysis filter. To simplify the configuration of the analysis filters, the analysis frame- work provides a platform view, which allows the configuration based on a graphical abstract representation of the observed platform simulation model. After identifying the member variables all changes are traced automatically ACCEPTEDat the granularity of a SystemC simulation MANUSCRIPT cycle, without the need to find and instrument the appropriate parts in the source code. The configuration

27 ACCEPTED MANUSCRIPT

of the analysis filters can be changed at runtime to enable the execution of different filters at different simulation phases. This reconfiguration can be triggered automatically, when a certain event during the analysis is recog- nized, e. g. a certain function is entered. Alternatively the simulation can be paused to allow a reconfiguration within the provided platform view. To facilitate the interoperability of the presented analysis framework with other tools, the automatically generated platform description and the used configuration file are based on the XML format [54]. In addition the gener- ated analysis data is stored in the standardized csv file format [55].

4. HEVC

4.1. Overview

The high efficiency video coding standard is an example of the block- based hybrid video coding scheme. In this scheme the input frame is split into a set of smaller blocks and each block is encoded by a combination of prediction, transform and entropy coding algorithms. For the HEVC main intra profile the prediction uses only reconstructed samples from the cur- rent frame facilitating the spatial correlation between adjacent samples. The following description of the HEVC encoder algorithm is based on the HM reference model software implementation [56]. Main part of the HM intra encoder model is the encoder loop composed of forward transform/quantization, inverse transform/quantization, intra es- timation and intra prediction, as shown in Fig. 5. Internally the HM encoder ACCEPTEDmaintains a reconstruction of the currentMANUSCRIPT frame, to perform intra mode esti- mation and intra prediction. Hereby the intra mode estimation has to find

28 ACCEPTED MANUSCRIPT

the best prediction mode out of 35 possible modes [57]. The estimated best mode is then used in the intra prediction to generate a prediction signal, which is subtracted from the samples of the current input frame, to exploit spatial redundancies between adjacent samples. Outside of this encoder loop a context adaptive binary scheme (CABAC) [58] is used to perform the entropy coding of all transmitted video data, e. g. quantized transform coefficients, intra prediction mode and other side information.

Transform Coefficients Fwd Transform & Quantization Inv Transform & Quantization Input Picture Prediction CABAC Split into CTUs Signal 64x64 Reconstructed 16x16 Picture

32x32

Coding Tree Unit Intra Intra Prediction Estimation Intra Prediction Mode 35 Intra Prediction Modes Figure 5: HEVC intra encoder model.

In comparison to the predecessor standard H.264/AVC the HEVC stan- ACCEPTEDdard introduces some novel features MANUSCRIPT to increase the overall coding gain. One of these features is the concept of the coding tree unit (CTU), as shown on the left side of Fig. 5. In order to match different possible block and pre-

29 ACCEPTED MANUSCRIPT

diction sizes ranging from 64 64 to 8 8 samples, a coding unit (CU) can × × be split recursively into four smaller coding units, which are encoded and predicted separately. Once the CTU partitioning has been determined by the HM encoder, the residual transform is also split recursively into smaller transform units forming the residual quad-tree (RQT) [59]. To find the best intra prediction mode and to determine the best CTU/RQT layout the HM encoder uses the rate distortion optimization (RDO) algo- rithm [60]. The key idea behind this algorithm is to determine a single estimate for the best variant by using the required bit-rate in combination with the reconstructed quality for this coding variant. This in turn means, whenever the encoder has to choose between different options, the encoder has to encode all of these options, to estimate the best encoding decision, which results in a complex encoder algorithm.

4.2. HM Reference Encoder Algorithm

Combining all possible encoder choices forms a complex and huge deci- sion space, which cannot be searched as a whole. Therefore each decision is handled independently forming a decision tree less complex than the com- plete decision space, but nevertheless too complex for real time applications. According to [61] the HM encoder test model may exceed 1000 real time × for some HD sequences running on a Xeon-based server platform. At the top level the HM encoder performs the CTU partitioning. This partitioning is handled recursively by comparing the bit-rate and distortion ACCEPTEDof the non-split compressed CU with MANUSCRIPT the split compressed CU at different tree depth levels, as shown in Fig. 6. Additionally to a CU split also the prediction unit (PU) can be split into smaller units. Such a split of the intra

30 ACCEPTED MANUSCRIPT

prediction is restricted to the smallest CU level only. Therefore the smallest CU size is compressed twice, without a PU split, depicted by D3 (2N 2N), × and with PU split, depicted by D3.N (N N) in the bottom lines of Fig. 6. × CU Depth, Size and Number per CTU Recursive CTU partitioning CU 64x64 D0, 64x64: 1 PU: 2Nx2N

Compress CU 32x32 D1, 32x32: 4 PU: 2Nx2N

Compress CU 16x16 D2, 16x16: 16 PU: 2Nx2N

8x8 8x8 8x8 8x8 8x8 8x 8x8 D3, 8x8: 64 2N 2N 2N 2N 2N 8 2N (PU: 2Nx2N) 2N 8x 8x8 8x8 8x8 8x8 8x8 8x8 D3.N, 8x8: 64 8 N N N N N N (PU: NxN) N

Figure 6: Coding unit tree partitioning. The left column shows for each tree depth the number of coding units, which have to be compressed within one CTU.

To calculate bit-rate and distortion of a compressed CU the luma intra mode and residual quad-tree partitioning are performed, as depicted in Fig. 7. In a first step the number of possible intra prediction modes is reduced by a heuristic. For this heuristic each mode is predicted first and then the Hadamard (HAD) transform costs of the residual signals, which could be cal- culated by subtracting the predicted from the original samples, are compared with each other. ACCEPTEDAfter this first stage a set containing MANUSCRIPT the best N candidates is forwarded to the second stage to determine the final best intra prediction mode. In case the N best modes did not contain the most probable intra prediction mode

31 ACCEPTED MANUSCRIPT

64x64, 32x32, 16x16: N=3 8x8: N=8 IPred Est. Residual HAD M0 No RQT IPred Est. Residual HAD M1 No RQT N Add most Est. Residual 35 IPred modes prob. Mode 3-9 best modes RQT Full RQT coding IPred Est. Residual HAD of final best mode M34 No RQT HAD based selection of N best modes out of 35 modes

Figure 7: Algorithm for luma intra mode estimation and luma residual partitioning.

derived from adjacent coding units, this most probable mode is added to the list of the N best modes, which provides a slightly better coding gain. As a result the second stage has to process either 3, 4 best modes in case of { } 64 64, 32 32 and 16 16 CU size, or 8, 9 best modes in case of 8 8 CU × × × { } × size respectively. In the second stage the residual is transform coded without RQT partitioning. Therefore the sum of squared errors (SSE) between the reconstructed and the original signal is used as distortion metric. Finally the third stage performs the RQT partitioning of the residual for the best intra prediction mode. For this RQT partitioning the bit-rate and distortion of a non-split transform unit is compared recursively with the corresponding split transform unit in a similar scheme like the recursive CTU partitioning. After the luma signal of a coding unit has been compressed the chroma ACCEPTEDsignals are compressed as well. The MANUSCRIPT chroma prediction mode and residual estimation is less complex than the luma estimation, because only a subset

32 ACCEPTED MANUSCRIPT

of five prediction modes has to be compared and no additional chroma RQT partitioning is required.

5. Experimental Setup

To explore the HEVC intra encoder execution we use a SystemC-based multi-core simulation platform, which can be easily extended with hardware accelerator blocks. The platform uses cycle accurate ARMv6 instruction set simulators [62], provided by the SoCLib project [63]. The ARM process- ing elements are connected via a micro-network with 16 MB shared memory SRAM and IO-components for frame data input and bit-stream output, also made available by the SoCLib project. The SoCLib micro-network acts as a worm-hole network on a chip (NoC) and is based internally on two in- dependent packet switched networks for virtual component interface (VCI) memory access commands and VCI response data [64]. Each ARMv6 core has separate L1 8-way associative instruction and data caches with 32 KB size respectively. The analysis and comparison of execution time and speedup is based on the number of simulated clock cycles, which is independent of a specific clock frequency. At the software layer the Mutek operating system is used, which provides a scalable and extensible solution for embedded system design [65]. The HEVC encoder implementation is based on the HM-15.0 reference encoder implementation, which has been ported to the simulated embedded ARM ACCEPTEDplatform, with the help of the gcc-4.8.2 MANUSCRIPT cross-compiler for ARM.

33 ACCEPTED MANUSCRIPT

6. Task Level Parallel Execution

6.1. Parallel CU Size Estimation

The parallel segmentation of an algorithm is a non-trivial task, which depends on different parameters such as granularity, scalability, workload balancing, latency, synchronization requirements and/or data locality. For example, encoding multiple frames in parallel yields a high granularity and in case of intra-only encoding has no special synchronization requirements between the different frames. At the downside the encoder latency is greatly increased, because the encoder has to process multiple input frames in par- allel, before finishing the first one. Another parallelization approach of the HEVC encoder is the CTU level parallel encoding. In this approach different CTUs are encoded in parallel. For a correct encoder algorithm execution, these parallel tasks have to be synchronized. Due to the nature of the intra prediction algorithm the left, top and top-right adjacent CTUs have to be encoded first. Additionally the bit-stream generation has to be synchronized in CTU order. To alleviate the bit-stream generation dependency, the wavefront parallel processing scheme [66] has been introduced in HEVC, which enables multiple less dependent bit-streams at the cost of a higher overall bit-rate. Benefits of the CTU level parallelization approach are the decreased encoder latency and an increased data locality. In a Sub-CTU level parallelization approach the processing of a single CTU is split into multiple data and/or functional parts, which are executed ACCEPTEDconcurrently. Therefore such an approach MANUSCRIPT has the advantage of a low encoder latency and a high data locality. At the downside especially the synchroniza-

34 ACCEPTED MANUSCRIPT

tion costs may exceed the performance gains of the parallel execution. Nev- ertheless in the embedded domain this may be an interesting opportunity, because memory constrains and encoder latency are the more demanding requirements. Moreover, by using a low-cost customized operating system, the synchronization costs can be kept at an acceptable level. As shown in Fig. 6, the CTU partitioning is a recursive algorithm, which exposes the opportunity to perform the compression and cost calculation of the non-split and split CU in parallel. In the split case the CU compression is performed for four CUs with a quarter number of samples each. There- fore the non-split and the split CU compression have to process the same number of samples, which should require similar processing resources. To examine this assumption we use the basic serial HEVC encoder implementa- tion, running on a single-core platform to perform an analysis of the average encoder execution time distribution for the different CU compression stages. Therefore the function group analysis filter proposed in Sec. 3.3.2 is used to specify a virtual software component for each CU compression stage. This ensures that the execution time spend in shared sub-functions is contributed to the appropriate CU compression stage. For this analysis we encoded the first frames of the class D 416 240 BasketballPass test sequence from the × HEVC standardization process with different quantization parameter (QP) values (QP = 22, 27, 32, 37 ). In Fig. 8(a), the averaged execution time { } distribution in percent over these different QP values is shown for each CU compression stage. ACCEPTEDAs can be seen, the execution timeMANUSCRIPT for the remaining part, containing mainly IO and final entropy coding of the compressed CTU, is quite small

35 ACCEPTED MANUSCRIPT

Remain D0 Remain D0 1% 11% 1% 14% D1 D1 D3.N 13% D3.N 16% 41% 28%

D2 D3 14% D2 20% D3 24% 17%

(a) (b)

Figure 8: Serial HEVC encoder execution time distribution in percent for different CU compression stages: (a) with transform skip (b) without transform skip.

(approx. 1%) in comparison to the CU compression stages. Unfortunately, in contrast to the assumption above, the balancing is limited by the increased execution time for smaller CU levels, especially for the D3.N compression stage. A more detailed function level analysis of the D3 and D3.N compres- sion stages shows, this increase is mainly a result of the transform skip (TS) feature, enabled only in the lowest D3.N compression stage. With TS en- abled, a second residual coding pass has to be examined, in which the forward and inverse transform are bypassed. This mode has been adopted to improve the coding efficiency for specific video contents such as computer-generated graphics. Therefore this coding option has little to no influence on the coding efficiency of class A, B, C, D and E test sequences [67]. Without TS enabled the distribution is better balanced between the different CU depth levels, as ACCEPTEDcan be seen in Fig. 8(b). MANUSCRIPT Given the numbers for the execution time distribution a theoretical par-

36 ACCEPTED MANUSCRIPT

allel speedup of approx. 2.4 (100/41) with TS enabled and of 3.6 (100/28) × × without TS enabled could be expected, when running each of the five CU depth compression stages on a different processing element. Of course the synchronization between parallel CU depth compression stages is required, before each RDO cost comparison, to ensure the correct execution of the encoder algorithm. As depicted in Fig. 6 this RDO cost comparison is per- formed, whenever a CU has been compressed in split as well as in non-split mode. Therefore a barrier-style synchronization is sufficient before the RDO cost comparison to ensure, that both cost values have been computed in parallel. Furthermore, some data structures had to be adjusted within the parallel encoder implementation, to prevent concurrent accesses to otherwise serially shared data structures. After these modifications the implemented parallel encoder generates the same output bit-stream as the serial encoder, without any peak signal to noise ratio (PSNR) loss in comparison to the reference software model. In order to gain a better understanding of the differences between the- oretical and achieved encoder speedup, we use for the following simulations two different simulation model configurations. In one configuration, called the ideal configuration, a system without memory access conflicts or cache misses has been implemented. In the other configuration, called the non-ideal configuration, the memory access conflicts and cache misses are enabled lim- iting the achieved parallel encoder speedup. For these both configurations the speedup of the proposed parallel HEVC encoder algorithm running on ACCEPTEDfive ARM cores is shown in Fig. 9 forMANUSCRIPT different QP values, with and without TS enabled. The speedup values have been computed by comparing the se-

37 ACCEPTED MANUSCRIPT

rial and the parallel HEVC encoder with otherwise identical configurations. This means in case transform skip has been disabled, the option is disabled in the serial as well as in the parallel setup, showing the relative parallel speedup for each configuration.

5 TS NoTS 4 TS NoTS 3

2 Speedup

1

0 22 27 32 37 Avg. 22 27 32 37 Avg. Ideal Memory Non-Ideal Memory QP Figure 9: Parallel encoder speedup with and without transform skip (TS) enabled for ideal and non-ideal memory configuration.

As can be seen in Fig. 9, for both configurations the parallel encoder speedup is approx. 43% higher, when the TS option has been disabled. Be- cause disabling TS has little impact on the coding efficiency, as explained above, this option will be disable for all the following experiments. Further- more, without TS enabled, an average speedup of 3.4 is achieved for the × ideal memory configuration. On the other side for the non-ideal memory configuration the average speedup is with 2.5 much smaller approx. 26.9%. × ACCEPTEDIn order to reduce this gap between MANUSCRIPT ideal and non-ideal memory configura- tion a bottleneck analysis of the platform memory sub-system is performed in the following.

38 ACCEPTED MANUSCRIPT

6.2. Memory Access Conflicts

So far the designed multi-core platform contains five ARM cores with caches, an NoC interconnect and a single memory component. In this con- figuration all memory accesses will be routed to the same memory compo- nent, which becomes the major bottleneck for concurrent memory accesses. Besides modifying the cache configuration to reduce the number of memory accesses another optimization option is to improve the performance of the shared main memory. In case multiple memory components are available in the designed platform, concurrent memory accesses can be realized via the network-based interconnect. This requires the segmentation and mapping of the data structures to the different separate memory blocks. A metric to guide such a segmentation and mapping is the distribution of memory access conflicts over the various data structures of the implemented software. With the help of the memory access conflict analysis filter described above, the distribution of memory access conflicts for different heap allo- cated and global data structures is summarized in Fig. 10. Similar to the execution time analysis, multiple data structures have been combined into groups for the different CU compression stages and the remaining part. Due to the fact that less than 50% of all memory access conflicts could be lo- cated in CU compression stage data structures, the remaining part has been split into additional groups. The first additional group is the Runtime group containing data structures used for synchronization as well as task context data managed internally by the runtime. The second additional group is ACCEPTEDthe Shared group containing shared MANUSCRIPT data structures, like the reconstructed picture buffer or the globally shared CU data buffer, accessed by all CU

39 ACCEPTED MANUSCRIPT

compression stages in parallel.

40 35 30 25 20 15 Conflicts [%] 10 5 0 D0 D1 D2 D3 D3.N Runtime Shared Remain

Figure 10: Distribution of memory access conflicts in percent for different heap allocated and global data structures.

The high number of memory access conflicts in the Runtime group can be explained by using the fine-grained Sub-CTU level parallelization in com- bination with the underlaying runtime, which implements a co-operating multi-tasking scheme. This means, whenever one of the parallel tasks has to check a synchronization primitive, additionally the runtime scheduler checks the task queue to see, if other tasks are ready to run. This involves multiple switches of task contexts, which in turn requires task context data to be loaded and stored in the shared main memory. Even worse the synchroniza- tion primitives have to be polled quite frequently due to the implemented fine-grained parallelization scheme within the HEVC encoder algorithm. In accordance to the distribution of memory access conflicts, the first ACCEPTEDgroup of data structures, which should MANUSCRIPT be mapped to a separate memory block, is the Runtime group, followed by data structures exclusively accessed

40 ACCEPTED MANUSCRIPT

by a single CU compression stage. Mapping the Shared group to a separate memory block will provide only a small benefit, because these data structures are accessed by all CU compression stages in a random access pattern, which depends on the result of the RDO cost comparisons. Hereby the winning CU compression stage will store its local configuration into the global shared data structures to keep track of the best encoder choices and to facilitate the prediction of adjacent CU blocks. With the exception of the Shared group all other data structure groups require less than 1 MB memory each. Therefore the multi-core platform is extended by additional memory components connected to the NoC with a size of 1 MB each, to map successively the data structure groups to these separate memory blocks. Alternatively the conflicts of the Runtime group could be reduced, by using a hardware synchronization component for inter task synchronization. This hardware synchronization component provides barriers to stop the ex- ecution of a task until an event has been signaled by another task. The synchronization component is connected with every processing core to enable synchronization between concurrent tasks running on different processing ele- ments. Because with split and non-split mode at each RDO cost comparison always two tasks are checked, we implemented a simple memory mapped array of fixed size barriers to be used as hardware synchronization compo- nent. For both implementation alternatives in combination with additional memory components for the CU compression data structures, the speedup ACCEPTEDimprovements are shown in Fig. 11. MANUSCRIPT As can be seen, using the hardware synchronization block (HwSync) pro-

41 ACCEPTED MANUSCRIPT

3 2.9 2.8 2.7

Speedup 2.6 2.5 2.4 Parallel+SepMem.Runtime+SepMem.D3.N+SepMem.D3+SepMem.D2+SepMem.D1+SepMem.D0Parallel+HwSync+SepMem.D3.N+SepMem.D3+SepMem.D2+SepMem.D1+SepMem.D0

Figure 11: Parallel encoder speedup for different mappings of data structures to separate memory blocks. On the left side the runtime data structures have been mapped to a separate memory component first. Alternatively on the right side a hardware synchro- nization component (HwSync) has been used. The Parallel configuration corresponds to the parallel HEVC encoder with transform skip disabled.

vides a higher speedup than mapping the involved data structures to a sep- arate memory component. Furthermore, after mapping the runtime and the D3.N data structures to a separate memory component, the following speedup improvements become small. This is due to the fact, that by re- moving these conflicts, the same amount of conflicts is removed equally from the remaining groups. Hence the impact of mapping another data structure group to a separate memory block is reduced and a solution near the most aggressive solution could be reached by using the hardware synchronization component in combination with one additional memory component for the D3.N compression stage data structures. By using such a platform configu- ration the parallel encoder speedup could be improved from 2.5 to 2.9 , ACCEPTED MANUSCRIPT× × which reduces the gap between ideal and non-ideal memory configuration

42 ACCEPTED MANUSCRIPT

by approx. 50.0%. Besides improving the encoder speedup, each additional memory component will also increase the overall energy consumption and introduce an area overhead to the designed system. Therefore adding only one additional memory component is also a trade-off between speedup and energy/area metrics within the multi-dimensional design exploration prob- lem.

7. Hybrid HW/SW Encoder

7.1. HEVC Co-Processor Design

In accordance to the HEVC encoder algorithm description presented in Sec. 4 we split the different encoder functions into a set of virtual software components, which represent possible candidates for hardware acceleration. Identified components are the intra prediction, the inverse/forward trans- formation/quantization and the HAD/SSE cost calculation functions. For comparison reasons we split the remaining functions, which we deem not well suited for HW-accelerated execution, into three additional groups, namely entropy coding related functions Entropy, memory copy functions Mem and block addition/subtraction functions AddSub. The average encoder execu- tion times in clock cycles per compressed CTU for these different function groups for the serial HEVC encoder are shown in Fig. 12. As can be seen in Fig. 12, the forward transform and quantization group FwdTrQuant is the most time consuming function group. A more detailed analysis of this group reveals, that the rate distortion optimized quantiza- ACCEPTEDtion (RDOQ) part consumes the mostMANUSCRIPT clock cycles. Although we currently found no hardware implementation of the RDOQ algorithm in literature,

43 ACCEPTED MANUSCRIPT

300 QP22

cc] QP27 6 250 QP32 200 QP37 150 100 50 Execution Time [10 0 AddSub Entropy Cost FwdTrQuantInvTrQuantIPred Mem Remain

Figure 12: Average encoder execution times in clock cycles per compressed CTU for different function groups for the serial HEVC encoder.

[68] discusses a promising approach, which is ready to be used for hardware implementation. In case an RDOQ hardware implementation cannot be re- alized in the near future, the RDOQ option could be disabled in the HEVC encoder, which in terms will impose a bit-rate increase of more than 6%, as reported in [69]. The second most time consuming function group is the Entropy coding group containing a set of different entropy coding functions for bit-rate esti- mation and final entropy coding. To provide a better understanding of the involved functions, Fig. 13 shows a refinement of the execution time distri- bution for this special function group. For this purpose, the entropy coding group has been split into common data copy functions Copy as well as func- tions for encoding flags Flags, coefficients Coeficients, header information ACCEPTEDHeader and prediction mode data ModeMANUSCRIPT. As can be seen in this diagram, the most time consuming part is a set of

44 ACCEPTED MANUSCRIPT

70 QP22 cc]

6 60 QP27 50 QP32 QP37 40 30 20 10 Execution Time [10 0 Copy Flags Coefficients Header Mode

Figure 13: Refined distribution of encoder execution times in clock cycles per compressed CTU for the entropy function group.

several data copy functions. The only interesting block for hardware acceler- ation would be the coefficient encoding block, which is called multiple times during the bit-rate estimation in the recursive residual coding RQT algo- rithm. But in order to execute this part in a separate hardware accelerator, the internal entropy coding state has to be copied from software to hardware and vice versa. Especially for small block sizes this imposes a huge overhead in comparison to the raw coefficient data and therefore the expected benefit is rather small. Please note, that in case of a HEVC decoder implementa- tion, the design would be much simpler, because here no alternative entropy coding states have to be managed and copied for the different estimation options. Furthermore, the execution time of the coefficient encoding block depends highly on the QP value, because for higher values the coefficients ACCEPTEDare stronger quantized. This increases MANUSCRIPT the probability for single coefficients and/or entire coefficient blocks being zero valued, which in turn reduces the

45 ACCEPTED MANUSCRIPT

benefit of a hardware coefficient encoder for higher QP values. Based on the function group profile and the discussion above, we decided to implement the following five HW-accelerated components: intra predic- tion, forward transform and quantization, inverse transform and quantiza- tion, HAD cost calculation and SSE cost calculation. Each component is connected independently with the micro-network and includes a DMA en- gine for easy programming and fast memory accesses. The implemented DMA engine supports block copies of two-dimensional data arrays and fa- cilitates burst memory access for successive sample data up to the size of a single cache line (32 B). To have an upper bound for the algorithm speedup achievable by using the hardware accelerators, we implemented similar to Sec. 6 an ideal config- uration, where the DMA engine provides immediate access to the requested memory regions. For the ideal memory configuration we found an encoder speedup of 3.7 in contrast to a much smaller speedup of 2.4 for the non- × × ideal memory configuration. A possible optimization of the no-ideal memory configuration is discussed in the next sub-section.

7.2. Hardware Accelerator Pipeline

Each hardware accelerator component is called independently by the soft- ware, running on the ARM cores, even if the output of one hardware com- ponent is directly used by the next hardware component. In such a case it is more beneficial, when the hardware accelerators communicate directly ACCEPTEDwith each other, without including MANUSCRIPT the software. For the proposed set of hardware accelerator components we identified two separate hardware accel- erator pipelines.

46 ACCEPTED MANUSCRIPT

In the first pipeline the intra prediction output is directly send to the following HAD cost calculation. This pipeline is used in the first stage of the intra mode estimation algorithm, as depicted in Fig. 7. In the second pipeline the intra prediction output is send to the forward transform/quantization, followed by the inverse transform/quantization, followed by the SSE cost calculation. This pipeline is used in the intra residual estimation, as described in Sec. 4.2. The design of the HW-accelerated HEVC encoder with these both pipelines is depicted in Fig. 14.

PE Cost IPred I$ D$ HAD

FwdTr InvTr Cost RAM Quant Quant SSE

Figure 14: HW-accelerated encoder architecture with direct connected hardware compo- nents.

A single-core speedup comparison of the ideal, the non-ideal and the pipelined encoder is given in Fig. 15. The pipelined encoder configuration is denoted DirectHW in the diagram. By using the pipelined implementation the gap between the ideal and the non-ideal memory configuration could be reduced by 26.8%. The final average encoder speedup of the single-core HW-accelerated HEVC encoder is 2.7 over all QP values. ACCEPTED MANUSCRIPT×

47 ACCEPTED MANUSCRIPT

5 Ideal Memory Non-Ideal Memory 4 DirectHW

3

Speedup 2

1

0 QP22 QP27 QP32 QP37 Avg. Figure 15: HW-accelerated single-core encoder speedup comparison.

8. Experimental Results

8.1. Heterogeneous Parallel HEVC Encoder

For the following experiments the parallel CU estimation algorithm from Sec. 6 has been combined with the hardware accelerator components im- plemented in Sec. 7. A comparison of the heterogeneous parallel encoder speedup for the different HW/SW optimization options discussed above is given in Fig. 16 for the class D BasketballPass test sequence. The average encoder speedup over all QP values for the non-ideal memory configuration is 3.5 , which could be increased to 5.2 by enabling the optimizations dis- × × cussed in the previous sections. These optimizations include first of all the additional memory block for data structures with a high number of mem- ory access conflicts SepMem.D3.N and second the hardware synchronization ACCEPTEDcomponent HwSync presented in Sec.MANUSCRIPT 6.2. The third optimization is the pipeline between the hardware accelerators DirectHW, as discussed in Sec.

48 ACCEPTED MANUSCRIPT

7.2. Overall, an additional encoder speedup of 170% could be enabled by combining these HW/SW co-optimizations. Thus the analysis shows, that parallelization schemes and hardware accelerators require a careful system design to exploit their potential benefits.

7 Parallel + HW-accel. +SepMem.D3.N 6 +HwSync +DirectHW

5

Speedup 4

3

2 QP22 QP27 QP32 QP37 Avg. Figure 16: Heterogeneous parallel encoder speedup comparison for different optimization options.

A comparison of the average speedup gains of the optimized parallel software-only encoder (2.9 ), the optimized single-core HW-accelerated en- × coder (2.7 ) and the optimized combined parallel HW-accelerated HEVC × encoder (5.2 ) is depicted in Fig. 17. For the parallel as well as for the × HW-accelerated encoder their respective optimizations have been enabled again. This comparison shows that both approaches yield similar gains and that a combination of both approaches is beneficial, due to their disjunct ACCEPTEDacceleration schemes. MANUSCRIPT A summary of the different encoder configurations used during the previ- ous explorations and their average speedup is given in Tab. 1. As explained

49 ACCEPTED MANUSCRIPT

7 Parallel optimized 6 HW-accel. optimized Parallel + HW-accel. optimized 5

4

3 Speedup 2

1

0 QP22 QP27 QP32 QP37 Avg. Figure 17: Comparison of parallel, HW-accelerated and combined encoder speedup.

above in Sec. 6.1, the HEVC transform skip (TS) option has been disabled for all listed configurations, including the serial configuration. Furthermore, the mandatory 16 MB shared memory SRAM component, the NoC-interconnect and the IO-components are also part of every analyzed configuration. Comparing the analyses of multiple frames of the same test sequence for the intra-only use case have shown no relevant deviations between these frames, which could be explained by the similar image characteristics of frames within the same sequence. Nevertheless between frames of different test sequences some deviations in the image characteristics could be found. For this purpose we encoded the first frame of each class B, C and D test sequence with all optimization options enabled for a final performance eval- uation. The results of these experiments are shown in Tab. 2. In average over all test sequences and QP values we found an encoder ACCEPTEDspeedup of 5.3 . Due to the fact, MANUSCRIPT that the parallelization approach, the × HW-accelerated components and the proposed HW/SW co-optimizations do

50 ACCEPTED MANUSCRIPT

Name Configuration Speedup Serial 1 core 1.00 Parallel 5 cores 2.47 Parallel optimized 5 cores, second memory and 2.92 HW-synchronization HW-accel. 1 core and HW-accelerators 2.34 HW-accel. optimized 1 core, HW-accelerators and separate 2.71 HW-pipeline Parallel + HW-accel. 5 cores and HW-accelerators 3.50 Parallel + HW-accel. 5 cores, HW-accelerators, second memory, 5.21 optimized HW-synchronization and separate HW-pipeline

Table 1: Average encoder speedup for different platform configurations.

not modify the encoder algorithm, this speedup could be realized without any PSNR loss in comparison to the reference software model implementation. For lower QP values a higher encoder speedup could be found, because for lower QP values more significant transform coefficients are left over after the quantization, which results in a longer execution time of the software RDOQ algorithm (see also the group profile in Fig. 12). Therefore replacing the software RDOQ algorithm by a much faster hardware accelerator will provide a higher speedup for the lower QP values. As can be seen in Tab. 2, the highest encoder speedup could be found for the class C PartyScene test sequence with an average speedup of 6.3 and a max speedup of 6.7 for × × QP22. ACCEPTED8.2. Instrumentation Overhead MANUSCRIPT Finally the overhead of the proposed non-intrusive instrumentation ap- proach should be discussed. For this purpose, the execution time of the

51 ACCEPTED MANUSCRIPT

final heterogeneous multi-core platform simulation model, running on an In- tel Xeon [email protected] GHz workstation, has been measured for different trace configurations. A comparison of these execution times with the execution time of an unmodified SystemC platform simulation model is shown in Fig. 18. 25 Member Variables Signals 20

15

10

5

Simulation Time Overhead [%] 0 0 5 10 15 20 Number Of Traced Variables/Signals Figure 18: Simulation time overhead in percent for instrumentation functions.

Adding the proposed instrumentation functions to the SystemC simula- tion kernel introduces an overhead of only 1.7%, despite the fact that all signals and TLM connections in the design have been modified to provide the required tracing functionality. Enabling the tracing of signals or the tracing of member variables yields a linear increase of the execution time, by approx. 0.1% for each additional signal and by approx. 0.6% for each addi- tional member variable. For example, for the function level execution time ACCEPTEDanalysis shown above, it is sufficient MANUSCRIPT to trace five program counter member variables only, which results in an overall simulation time overhead of approx. 7% compared to an unmodified SystemC simulation framework.

52 ACCEPTED MANUSCRIPT

9. Conclusion

In this work we describe a non-intrusive instrumentation methodology for SystemC platform simulation models. Based on this instrumentation methodology an analysis framework has been developed to realize different performance and bottleneck analysis functions. Beside support for auto- matic and non-intrusive instrumentation, the simulation time overhead of the instrumentation functions has been kept small with respect to the over- all platform model simulation time. With the help of the presented analysis framework a parallel and hetero- geneous HEVC intra-only encoder has been iteratively developed and opti- mized. Therefore in a first step a parallel CU level compression scheme has been introduced into the HM-15.0 HEVC reference encoder software imple- mentation. In a second step, the time consuming parts of the encoder al- gorithm have been replaced with hardware accelerators, where appropriate. For the system design process a SystemC simulation model of an ARM-based multi-core platform has been developed and extended with the proposed hardware accelerators. During the entire design process this model has been exploited for several different tasks such as functional evaluation, guiding the parallelization and HW/SW-partitioning decisions as well as evaluating the system performance. In future work we plan to extend the analysis framework to provide a set of optimization options, by identifying automatically functional as well as data structure related partitions for parallelization and hardware accelera- ACCEPTEDtion. Furthermore, we plan to annotate MANUSCRIPT the encoder simulation platform with a hardware cost model to have a generic multi-dimensional and configurable

53 ACCEPTED MANUSCRIPT

HEVC encoder model, where we could analyze the trade-offs between dif- ferent design decisions, e.g. hardware resources vs. execution time vs. video compression ratio. Besides optimizing the execution time, also energy consumption is a ma- jor concern for the HW/SW co-design of embedded systems. In order to extend the analysis framework by an estimation of the energy consumption we plan to annotate the cycle accurate trace data collected from the plat- form simulation model with generic energy models for the different platform components. In dependency to the required accuracy the most important platform components we plan to annotate, are the instruction set simulators, the memory components, the interconnect and the hardware accelerators. For this purpose we plan to examine different annotation schemes, like in- struction level annotation for the instruction set simulators or access pattern annotation for the interconnect and the memory components.

References

[1] R. K. Gupta, G. D. Micheli, Hardware-software cosynthesis for dig- ital systems, IEEE Design Test of Computers 10 (3) (1993) 29–41. doi:10.1109/54.232470.

[2] R. Ernst, J. Henkel, T. Benner, Hardware-software cosynthesis for mi- crocontrollers, IEEE Design Test of Computers 10 (4) (1993) 64–75. doi:10.1109/54.245964.

ACCEPTED[3] L. Benini, E. Flamand, D. Fuin, MANUSCRIPT D. Melpignano, P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded com-

54 ACCEPTED MANUSCRIPT

puting accelerator, in: Proceedings of the Conference on Design, Au- tomation and Test in Europe, DATE ’12, EDA Consortium, San Jose, CA, USA, 2012, pp. 983–987.

[4] F. Conti, D. Rossi, A. Pullini, I. Loi, L. Benini, PULP: A ultra-low power parallel accelerator for energy-efficient and flexible embedded vision, Journal of Signal Processing Systems 84 (3) (2016) 339–354. doi:10.1007/s11265-015-1070-9.

[5] L. Eeckhout, Computer architecture performance evaluation methods, Synthesis Lectures on Computer Architecture, Morgan & Claypool Pub- lishers, 2010.

[6] J. Teich, Hardware/software codesign: The past, the present, and pre- dicting the future, Proceedings of the IEEE 100 (Special Centennial Issue) (2012) 1411–1430. doi:10.1109/JPROC.2011.2182009.

[7] D. Densmore, R. Passerone, A platform-based taxonomy for ESL design, IEEE Design Test of Computers 23 (5) (2006) 359–374. doi:10.1109/MDT.2006.112.

[8] G. Sullivan, J. Ohm, W.-J. Han, T. Wiegand, Overview of the high effi- ciency video coding HEVC standard, IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2012) 1649–1668. doi:10.1109/TCSVT.2012.2221191.

[9] High efficiency video coding, Rec. ITU-T H.265 v2 and ISO/IEC 23008-2 ACCEPTEDMPEG-H Part 2: HEVC (2014) MANUSCRIPT 1–540. [10] T. Wiegand, G. J. Sullivan, G. Bjontegaard, A. Luthra, Overview of the

55 ACCEPTED MANUSCRIPT

H.264/AVC video coding standard, IEEE Trans. Circuits Syst. Video Technol. 13 (7) (2003) 560–576. doi:10.1109/TCSVT.2003.815165.

[11] J. Ohm, G. Sullivan, H. Schwarz, T. K. Tan, T. Wiegand, Comparison of the coding efficiency of video coding standards — including high effi- ciency video coding (HEVC), IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2012) 1669–1684. doi:10.1109/TCSVT.2012.2221192.

[12] J. Vanne, M. Viitanen, T. Hamalainen, A. Hallapuro, Comparative rate-distortion-complexity analysis of HEVC and AVC video codecs, IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2012) 1885–1898. doi:10.1109/TCSVT.2012.2223013.

[13] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, D. A. Wood, The Gem5 simulator, SIGARCH Comput. Archit. News 39 (2) (2011) 1–7. doi:10.1145/2024716.2024718.

[14] P. Panda, SystemC - a modeling platform supporting multiple design abstractions, in: Proceedings of the 14th International Symposium on System Synthesis, 2001, pp. 75–80. doi:10.1109/ISSS.2001.156535.

[15] IEEE standard for standard SystemC language reference manual, IEEE Std 1666-2011 (Revision of IEEE Std 1666-2005) (2012) 1– ACCEPTED638doi:10.1109/IEEESTD.2012.6134619. MANUSCRIPT [16] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, M. Olivieri, MPARM: Exploring the multi-processor SoC design space with SystemC, Journal

56 ACCEPTED MANUSCRIPT

of VLSI signal processing systems for signal, image and video technology 41 (2) (2005) 169–182. doi:10.1007/s11265-005-6648-1.

[17] M. Dales, SWARM - Software ARM (Feb. 2003). URL http://www.cl.cam.ac.uk/~mwd24/phd/swarm.html

[18] D. Hedde, F. Petrot, A non intrusive simulation-based trace system to analyse multiprocessor systems-on-chip software, in: 22nd IEEE Inter- national Symposium on Rapid System Prototyping, RSP ’11, 2011, pp. 106–112. doi:10.1109/RSP.2011.5929983.

[19] S. Lagraa, A. Termier, F. P´etrot,Data mining MPSoC simulation traces to identify concurrent memory access patterns, in: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’13, EDA Consortium, San Jose, CA, USA, 2013, pp. 755–760.

[20] S. Lagraa, A. Termier, F. P´etrot,Scalability bottlenecks discovery in MPSoC platforms using data mining on simulation traces, in: Pro- ceedings of the Conference on Design, Automation and Test in Europe, DATE ’14, European Design and Automation Association, 3001 Leuven, Belgium, Belgium, 2014, pp. 186:1–186:6.

[21] D. Genius, N. Pouillon, Monitoring communication channels on a shared memory multi-processor system on chip, in: 6th International Workshop on Reconfigurable Communication-centric Systems-on-Chip, ReCoSoC, ACCEPTED2011, pp. 1–8. doi:10.1109/ReCoSoC.2011.5981502. MANUSCRIPT [22] D. Genius, Measuring memory access latency for software objects in a NUMA system-on-chip architecture, in: 8th International Workshop

57 ACCEPTED MANUSCRIPT

on Reconfigurable and Communication-Centric Systems-on-Chip, Re- CoSoC, 2013, pp. 1–8. doi:10.1109/ReCoSoC.2013.6581525.

[23] G. Kiczales, E. Hilsdale, Aspect-oriented programming, SIGSOFT Softw. Eng. Notes 26 (5) (2001) 313–. doi:10.1145/503271.503260.

[24] D. D´eharbe, S. Medeiros, Aspect-oriented design in SystemC: Imple- mentation and applications, in: Proceedings of the 19th Annual Sym- posium on Integrated Circuits and Systems Design, SBCCI ’06, ACM, New York, NY, USA, 2006, pp. 119–124. doi:10.1145/1150343.1150378.

[25] M. Kallel, Y. Lahbib, R. Tourki, A. Baganne, Verification of Sys- temC transaction level models using an aspect-oriented and generic approach, in: 5th International Conference on Design and Technol- ogy of Integrated Systems in Nanoscale Era, DTIS, 2010, pp. 1–6. doi:10.1109/DTIS.2010.5487605.

[26] B. Albertini, S. Rigo, G. Araujo, C. Araujo, E. Barros, W. Azevedo, A computational reflection mechanism to support platform debugging in SystemC, in: 5th IEEE/ACM/IFIP International Conference on Hard- ware/Software Codesign and System Synthesis, CODES+ISSS, 2007, pp. 81–86.

[27] G. Beltrame, L. Fossati, D. Sciuto, ReSP: A nonintrusive transaction- level reflective MPSoC simulation platform for design space exploration, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 28 (12) (2009) ACCEPTED1857–1869. doi:10.1109/TCAD.2009.2030268. MANUSCRIPT

58 ACCEPTED MANUSCRIPT

[28] S. Roiser, P. Mato, The SEAL C++ reflection system, in: Proceedings of International Conference on Computing in High Energy and Nuclear Physics, CHEP ’04, CERN, 2004, pp. 437–440.

[29] X. Li, X. Zhang, Y. Shi, Z. Gao, Prediction unit depth selection based on statistic distribution for HEVC intra coding, in: Proceedings of the IEEE International Conference on Multimedia and Expo Workshops, ICMEW, 2014, pp. 1–6. doi:10.1109/ICMEW.2014.6890719.

[30] S. Wang, S. Ma, X. Jiang, J. Fan, D. Zhao, W. Gao, A fast intra optimization algorithm for HEVC, in: IEEE Visual Com- munications and Image Processing Conference, 2014, pp. 241–244. doi:10.1109/VCIP.2014.7051549.

[31] K. Huang, S. i. Han, K. Popovici, L. Brisolara, X. Guerin, L. Li, X. yan, S. I. Chae, L. Carro, A. A. Jerraya, Simulink-based MPSoC design flow: Case study of Motion-JPEG and H.264, in: 44th ACM/IEEE Design Automation Conference, 2007, pp. 39–42.

[32] H. Yviquel, A. Sanchez, P. Jskelinen, J. Takala, M. Raulet, E. Casseau, Embedded multi-core systems dedicated to dynamic dataflow pro- grams, Journal of Signal Processing Systems 80 (1) (2015) 121–136. doi:10.1007/s11265-014-0953-5.

[33] C. C. Chi, M. Alvarez-Mesa, B. Juurlink, G. Clare, F. Henry, S. Pateux, T. Schierl, Parallel scalability and efficiency of HEVC parallelization ACCEPTEDapproaches, IEEE Trans. Circuits MANUSCRIPT Syst. Video Technol. 22 (12) (2012) 1827–1838. doi:10.1109/TCSVT.2012.2223056.

59 ACCEPTED MANUSCRIPT

[34] T. K. Heng, W. Asano, T. Itoh, A. Tanizawa, J. Yamaguchi, T. Matsuo, T. Kodama, A highly parallelized H.265/HEVC real-time UHD software encoder, in: Proceedings of the IEEE International Conference on Image Processing, ICIP, 2014, pp. 1213–1217. doi:10.1109/ICIP.2014.7025242.

[35] K. Chen, Y. Duan, L. Yan, J. Sun, Z. Guo, Efficient SIMD optimization of HEVC encoder over X86 processors, in: Asia-Pacific Signal Informa- tion Processing Association Annual Summit and Conference, APSIPA ASC, 2012, pp. 1–4.

[36] Y. Zhao, L. Song, X. Wang, M. Chen, J. Wang, Efficient realization of parallel HEVC intra encoding, in: Proceedings of the IEEE International Conference on Multimedia and Expo Workshops, ICMEW, 2013, pp. 1– 6. doi:10.1109/ICMEW.2013.6618415.

[37] A. Abramowski, G. Pastuszak, A double-path intra prediction archi- tecture for the hardware H.265/HEVC encoder, in: 17th International Symposium on Design and Diagnostics of Electronic Circuits Systems, 2014, pp. 27–32. doi:10.1109/DDECS.2014.6868758.

[38] E. Kalali, Y. Adibelli, I. Hamzaoglu, A high performance and low energy intra prediction hardware for high efficiency video coding, in: 22nd In- ternational Conference on Field Programmable Logic and Applications, FPL, 2012, pp. 719–722. doi:10.1109/FPL.2012.6339161.

[39] R. Conceicao, J. Souza, R. Jeske, M. Porto, J. Mattos, L. Agostini, Hard- ACCEPTEDware design for the 32x32 IDCT MANUSCRIPT of the HEVC video coding standard, in:

60 ACCEPTED MANUSCRIPT

Proceedings of the 26th Symposium on Integrated Circuits and Systems Design, SBCCI ’13, 2013, pp. 1–6. doi:10.1109/SBCCI.2013.6644881.

[40] S. S. Bhattacharyya, J. Eker, J. W. Janneck, C. Lucarz, M. Mattavelli, M. Raulet, Overview of the MPEG reconfigurable video coding frame- work, J. Signal Process. Syst. 63 (2) (2011) 251–263. doi:10.1007/s11265- 009-0399-3.

[41] J. W. Janneck, I. D. Miller, D. B. Parlour, G. Roquier, M. Wipliez, M. Raulet, Synthesizing hardware from dataflow programs, Journal of Signal Processing Systems 63 (2) (2011) 241–249. doi:10.1007/s11265- 009-0397-5.

[42] F. Palumbo, N. Carta, D. Pani, P. Meloni, L. Raffo, The multi- dataflow composer tool: generation of on-the-fly reconfigurable plat- forms, Journal of Real-Time Image Processing 9 (1) (2014) 233–249. doi:10.1007/s11554-012-0284-3.

[43] E. Bezati, R. Thavot, G. Roquier, M. Mattavelli, High-level dataflow de- sign of signal processing systems for reconfigurable and multicore hetero- geneous platforms, Journal of Real-Time Image Processing 9 (1) (2014) 251–262. doi:10.1007/s11554-013-0326-5.

[44] C. Sau, P. Meloni, L. Raffo, F. Palumbo, E. Bezati, S. Casale-Brunet, M. Mattavelli, Automated design flow for multi-functional dataflow- based platforms, Journal of Signal Processing Systems 85 (1) (2016) ACCEPTED143–165. doi:10.1007/s11265-015-1026-0. MANUSCRIPT

61 ACCEPTED MANUSCRIPT

[45] K. Jerbi, D. Renzi, D. De Saint Jorre, H. Yviquel, M. Raulet, C. Alberti, M. Mattavelli, Development and optimization of high level dataflow programs: The HEVC decoder design case, in: 48th Asilomar Con- ference on Signals, Systems and Computers, 2014, pp. 2155–2159. doi:10.1109/ACSSC.2014.7094857.

[46] F. Urban, R. Poullaouec, J. F. Nezan, O. Deforges, A flexible heteroge- neous hardware/software solution for real-time HD H.264 motion esti- mation, IEEE Trans. Circuits Syst. Video Technol. 18 (12) (2008) 1781– 1785. doi:10.1109/TCSVT.2008.2004927.

[47] T. Dias, N. Roma, L. Sousa, H.264/AVC framework for multi-core em- bedded video encoders, in: International Symposium on System on Chip (SoC), 2010, pp. 89–92. doi:10.1109/ISSOC.2010.5625538.

[48] H. K. Zrida, A. C. Ammari, A. Jemai, M. Abid, System-level perfor- mance evaluation of a H.264/AVC encoder targeting multiprocessors architectures, in: Proceedings of the International Conference on Mi- croelectronics, ICM, 2009, pp. 169–172. doi:10.1109/ICM.2009.5418661.

[49] B. Zatt, C. Diniz, L. V. Agostini, S. Bampi, Timing and interface communication analysis of H.264/AVC encoder using systemc model, in: 18th IEEE/IFIP International Conference on VLSI and System-on- Chip, 2010, pp. 235–240. doi:10.1109/VLSISOC.2010.5642666.

[50] S. Jo, S. H. Jo, Y. H. Song, Exploring parallelization techniques ACCEPTEDbased on OpenMP in H.264/AVC MANUSCRIPT encoder for embedded multi-core

62 ACCEPTED MANUSCRIPT

processor, Journal of Systems Architecture 58 (9) (2012) 339 – 353. doi:http://dx.doi.org/10.1016/j.sysarc.2012.06.005.

[51] J. Brandenburg, B. Stabernack, Exploring the concurrent execution of HEVC intra encoding algorithms for heterogeneous multi core ar- chitectures, in: Proceedings of the Conference on Design and Ar- chitectures for Signal and Image Processing, DASIP, 2015, pp. 1–8. doi:10.1109/DASIP.2015.7367268.

[52] DWARF Debugging Information Format Committee, DWARF debug- ging information format, Version 4 (Jun. 2010).

[53] DA’s DWARF Page (2016). URL http://www.prevanders.net/dwarf.html

[54] T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, F. Yergeau, Ex- tensible markup language (xml), World Wide Web Consortium Recom- mendation REC-xml-19980210. 16 (1998) 16. URL http://www.w3.org/TR/1998/REC-xml-19980210

[55] Y. Shafranovich, Common format and MIME type for comma-separated values (CSV) files, RFC 4180 (Informational) (Oct. 2005). URL http://www.ietf.org/rfc/rfc4180.txt

[56] Fraunhofer Heinrich Hertz Institute, High Efficiency Video Coding (HEVC) — JCT-VC (2016). ACCEPTEDURL https://hevc.hhi.fraunhofer.de/ MANUSCRIPT [57] J. Lainema, F. Bossen, W.-J. Han, J. Min, K. Ugur, Intra coding of

63 ACCEPTED MANUSCRIPT

the HEVC standard, IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2012) 1792–1801. doi:10.1109/TCSVT.2012.2221525.

[58] V. Sze, M. Budagavi, High throughput CABAC entropy coding in HEVC, IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2012) 1778– 1791. doi:10.1109/TCSVT.2012.2221526.

[59] I.-K. Kim, J. Min, T. Lee, W.-J. Han, J. Park, Block partitioning struc- ture in the HEVC standard, IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2012) 1697–1706. doi:10.1109/TCSVT.2012.2223011.

[60] G. J. Sullivan, T. Wiegand, Rate-distortion optimization for video compression, IEEE Signal Process. Mag. 15 (6) (1998) 74–90. doi:10.1109/79.733497.

[61] F. Bossen, B. Bross, K. Suhring, D. Flynn, HEVC complexity and imple- mentation analysis, IEEE Trans. Circuits Syst. Video Technol. 22 (12) (2012) 1685–1696. doi:10.1109/TCSVT.2012.2221255.

[62] N. Pouillon, A. Becoulet, A. de Mello, F. Pecheux, A. Greiner, A generic instruction set simulator API for timed and untimed simu- lation and debug of MP2-SoCs, in: IEEE/IFIP International Sym- posium on Rapid System Prototyping, RSP ’09, 2009, pp. 116–122. doi:10.1109/RSP.2009.11.

[63] LiP6, SoCLib (2016). ACCEPTEDURL http://www.soclib.fr/trac/dev MANUSCRIPT [64] VSI Alliance, Virtual component interface standard version 2.0 (OCB 2 2.0), On-Chip Bus Development Working Group (2001) 1–132.

64 ACCEPTED MANUSCRIPT

[65] X. Guerin, F. Petrot, A system framework for the design of embedded software targeting heterogeneous multi-core SoCs, in: 20th IEEE Inter- national Conference on Application-specific Systems, Architectures and Processors, ASAP, 2009, pp. 153–160. doi:10.1109/ASAP.2009.9.

[66] G. Clare, F. Henry, S. Pateux, Wavefront parallel processing for HEVC encoding and decoding, document JCTVC-F274, Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Torino, IT (Jul. 2011).

[67] C. Lan, J. Xu, G. Sullivan, F. Wu, Intra transform skipping, document JCTVC-I0408, Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, Geneva, CH (Apr. 2012).

[68] H. B. Yin, E.-H. Yang, X. Yu, Z. Xia, Fast soft decision quanti- zation with adaptive preselection and dynamic trellis graph, IEEE Trans. Circuits Syst. Video Technol. 25 (8) (2015) 1362–1375. doi:10.1109/TCSVT.2014.2380232.

[69] M. Karczewicz, Y. Ye, I. Chong, Rate distortion optimized quantization, document VCEG-AH21, ITU-T SG16/Q.6 Video Coding Experts Group (VCEG), Antalya, Turkey (Jan. 2008).

ACCEPTED MANUSCRIPT

65 ACCEPTED MANUSCRIPT

Sequence 22 27 32 37 Avg. BasketballDrive 5.3 4.7 4.4 4.4 4.7 BQTerrace 5.5 5.3 4.9 4.7 5.1 Class B Cactus 5.7 4.6 4.2 4.1 4.7 1920 1080 × Kimono1 5.2 4.8 4.6 4.5 4.8 ParkScene 6.1 5.5 5.0 4.7 5.3 Tennis 5.4 4.8 4.6 4.5 4.8

Class B Avg. 5.6 4.9 4.6 4.5 4.9

BasketballDrill 5.5 5.1 4.8 4.5 5.0 BQMall 5.0 4.4 4.4 4.3 4.5 Class C PartyScene 6.7 6.5 6.1 5.6 6.3 832 480 × RaceHorses 6.3 6.2 5.7 5.1 5.9

Class C Avg. 6.0 5.6 5.3 4.9 5.5

BasketballPass 5.7 5.3 5.0 4.7 5.2 BlowingBubbles 6.2 5.6 5.1 4.7 5.4 Class D BQSquare 6.6 6.4 6.0 5.6 6.2 416 240 × RaceHorses 6.3 5.9 5.5 4.9 5.7

Class D Avg. 6.2 5.8 5.4 5.0 5.6 All 5.9 5.4 5.1 4.7 5.3

Table 2: Heterogeneous parallel encoder speedup for different test sequences. ACCEPTED MANUSCRIPT

66 ACCEPTED MANUSCRIPT

Biography of Authors

Jens Brandenburg received his Dipl.-Ing. degree in Computer Engineering from the Technical University of Berlin, Germany in 2004. He joined the embedded systems group in the video coding and analytics department of the Fraunhofer HHI, Berlin, Germany in 2006. Here he has been working on the implementation of different video decoder and encoder solutions for the H.264/AVC, H.264/SVC, and the H.265/HEVC standards. His research interests include the optimization of image processing algorithms for real-time processing and the HW/SW co-analysis of embedded platforms.

Benno Stabernack received his Diploma and Dr.-Ing. degrees in electrical engineering from the Technical University of Berlin, Germany in 1996 and 2004. In 1996 he joined the Fraunhofer Heinrich Hertz Institute, Berlin, Germany. Here, as head of the embedded systems group of the video coding and analytics department, he is currently responsible for research projects focused on hardware and software architectures for image processing algorithms. Since summer 2005 he has lectured on the design of application specific processors at the Technical University of Berlin, Germany. His research interests include VLSI architectures for video signal processing, processor architectures for embedded media signal processing and System-on-Chip (SoC) designs.

ACCEPTED MANUSCRIPT ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

68 ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

69