Mutual Impact Between Clock Gating and High Level Synthesis in Reconﬁgurable Hardware Accelerators

electronics Article Mutual Impact between Clock Gating and High Level Synthesis in Reconfigurable Hardware Accelerators Francesco Ratto 1 , Tiziana Fanni 2 , Luigi Raffo 1 and Carlo Sau 1,* 1 Dipartimento di Ingegneria Elettrica ed Elettronica, Università degli Studi di Cagliari, Piazza d’Armi snc, 09123 Cagliari, Italy; [email protected] (F.R.); [email protected] (L.R.) 2 Dipartimento di Chimica e Farmacia, Università degli Studi di Sassari, Via Vienna 2, 07100 Sassari, Italy; [email protected] * Correspondence: [email protected] Abstract: With the diffusion of cyber-physical systems and internet of things, adaptivity and low power consumption became of primary importance in digital systems design. Reconfigurable heterogeneous platforms seem to be one of the most suitable choices to cope with such challenging context. However, their development and power optimization are not trivial, especially considering hardware acceleration components. On the one hand high level synthesis could simplify the design of such kind of systems, but on the other hand it can limit the positive effects of the adopted power saving techniques. In this work, the mutual impact of different high level synthesis tools and the application of the well known clock gating strategy in the development of reconfigurable accelerators is studied. The aim is to optimize a clock gating application according to the chosen high level synthesis engine and target technology (Application Specific Integrated Circuit (ASIC) or Field Programmable Gate Array (FPGA)). Different levels of application of clock gating are evaluated, including a novel multi level solution. Besides assessing the benefits and drawbacks of the clock gating application at different levels, hints for future design automation of low power reconfigurable accelerators through high level synthesis are also derived. Citation: Ratto, F.; Fanni, T.; Raffo, L.; Keywords: high level synthesis; power management; clock gating; hardware acceleration; reconfig- Sau, C. Mutual Impact between Clock urable computing; design automation Gating and High Level Synthesis in Reconfigurable Hardware Accelerators. Electronics 2021, 10, 73. https://doi.org/10.3390/electronics 1. Introduction 10010073 With the advent of Internet of Things (IoT) and wearable devices, connectivity and Received: 16 November 2020 portability became core features for numerous embedded systems. Connectivity and Accepted: 30 December 2020 system integration led to an explosion of the amount and complexity of functionalities that Published: 3 January 2021 can be potentially supported, while portability often implies stringent constraints on power consumption, which need to be tackled since the first design steps to be met. Cyber-Physical Publisher’s Note: MDPI stays neu- Systems (CPS) era brought sensors and actuators within embedded and IoT systems in tral with regard to jurisdictional clai- many different markets [1], making them interactive with physical processes, environment, ms in published maps and institutio- and humans, as well as reactive to their status/requirements [2,3]. Application Specific nal affiliations. Integrated Circuits (ASICs) constitute one of the best options to tackle CPS performance constraints as area reduction, energy consumption minimization and speed maximization, but they generally lack in flexibility/adaptivity. Latest Field Programmable Gate Arrays Copyright: © 2021 by the authors. Li- (FPGAs) constitute a valuable alternative, embedding on the same chip flexible general censee MDPI, Basel, Switzerland. purpose processing resources and efficient programmable logic. In any case, to meet the This article is an open access article adaptivity needs of CPS, heterogeneous and flexible platforms where different kinds of distributed under the terms and con- processing elements, interconnects and memories can be exploited, became the preferred ditions of the Creative Commons At- choice. However, from the designer perspective, heterogeneity does not come for free and tribution (CC BY) license (https:// multiple complementary competencies, ranging from hardware (HW) to software (SW) creativecommons.org/licenses/by/ skills, are required to develop such systems. 4.0/). Electronics 2021, 10, 73. https://doi.org/10.3390/electronics10010073 https://www.mdpi.com/journal/electronics Electronics 2021, 10, 73 2 of 20 On the application side there is also an increasing demand for execution efficiency. In video technology, resolution is constantly augmenting, for example, going from 4 K to 8 K format required 24.9 M more pixels (+300%), with common rates that go from 60 to 120 fps. The information growth and the stringent time constraints are translated into an increased complexity for video algorithms, for example, video codecs. Dedicated HW has become an appealing solution for such issues, where computation intensive applications lie under strict constraints. Thus, heterogeneous adaptive CPS often involve dedicated HW accelerators to efficiently tackle onerous and constrained tasks within applications. However, designing dedicated HW requires advanced digital design skills and longer development time with respect to the standard SW design flow, and it is even harder when reconfigurable computing is adopted within HW accelerators to achieve the adaptivity required by CPS. Design automation has been conceived to simplify the design process when complexity and time are raised up, and High Level Synthesis (HLS) has become very popular in the last years—starting from a high level description of a functionality, HLS tools derive HW specifications, usually in terms of Hardware Description Language (HDL), ready to be processed by digital design tools for FPGA or ASIC design. The increased need for HW acceleration pushed the main FPGA and ASIC stakeholders to put huge effort in HLS tools [4,5], and the number of HLS solutions has growth in the last years also on the academical side [6]. It is true though that HLS-derived solutions are not as good as manually optimized ones, neither in terms of timing performance nor in terms of consumed power. Often, designers are required to make code refactoring or to use specific pragmas in the input high level specification to get better solutions. Moreover, the adopted HLS tool has a great impact on the resulting HW specification in terms of adopted resources and performance, and it is strictly related to the targeted technology. HLS usually does not provide any support for adaptivity/reconfiguration, nor for system-level low power management. Power consumption is managed at physical level, through the power management features provided by digital design tools. Moreover, in HW design it is not possible to know the optimal trade-off among different metrics a priori, and design iterations in the implementation steps might be necessary to meet the desired performance. System-level power management would help when a low power accelerator is needed, but it requires to be aware of the effectiveness of the applied technique, which in turn may depend on the adopted HLS tool. Some works tried to address the power management at system-level [7–9] and in some cases they also tackled design automation for HW acceleration and adaptivity [10,11]. However, to the best of our knowledge, there are no works in the literature that study the mutual impact of the chosen HLS and the adopted the power management strategy. Within this context, focusing on reconfigurable accelerators, the main contributions of this work are: • the analysis of the effectiveness of several HLS tools in the implementation of low- power HW accelerators for heterogeneous and adaptive systems; • the exploration of the mutual impact between different HLS strategies and a simple low power technique, clock gating, applied at various levels. As far as we know, this is the first time that a similar study, investigating HLS and clock gating peculiarities together, is proposed in literature. Traditionally, clock gating is either applied at region or actor level, as it will be better explained in Section3. Nevertheless, to cover a wider range of possibilities, in this paper we also propose a novel multi level clock gating approach for dataflow based reconfigurable HW accelerators; • the assessment and comparison of the different clock gating solutions, including the new one, on a video coding use case while adopting multiple HLS tools and target technologies; • the derivation of hints and guidelines for an optimal application of clock gating and for its future integration on an automatic design flow for dataflow based HW accelerators. Electronics 2021, 10, 73 3 of 20 The main objective of this work is to understand the relationships between HLS tools and CG application at different levels considering dataflow based reconfigurable accelerators. In particular, on one hand peculiarities related to HLS tools are considered, like implemented models of computation, hardware communication protocols, latency of the generated hardware, while on the other CG related aspects, such as level of application and target technology, are considered. The conducted study is intended as a preliminary work under the view of a future design automation for the addressed devices, so that hints and guidelines for driving the process in order to achieve an optimal behavior under different metrics, namely resource occupancy, maximum frequency and power, are investigated and highlighted. From the conducted study and experiments it will be clear as the new proposed multi level clock gating technique ensures

Mutual Impact Between Clock Gating and High Level Synthesis in Reconﬁgurable Hardware Accelerators

Instruction-Level Distributed Processing

Power Management 24

Power Management Using FPGA Architectural Features Abu Eghan, Principal Engineer Xilinx Inc

Clock Gating for Power Optimization in ASIC Design Cycle: Theory & Practice

Saber Eletrônica, Designers Pois Precisamos Comprovar Ao Meio Anunciante Estes Números E, Assim, Carlos C

Computer Architecture Techniques for Power-Efficiency

Analysis of Body Bias Control Using Overhead Conditions for Real Time Systems: a Practical Approach∗

Dynamic Voltage/Frequency Scaling and Power-Gating of Network-On-Chip with Machine Learning

Register Allocation and VDD-Gating Algorithms for Out-Of-Order

Power Reduction Techniques for Microprocessor Systems

A 65 Nm 2-Billion Transistor Quad-Core Itanium Processor

Happy: Hyperthread-Aware Power Profiling Dynamically