
GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs Markus Vogt, Gerald Hempel, Jeronimo Castrillon Christian Hochberger !echnische Uni$ersitat% Dresden Computer Systems Group Faculty of Computer Science !echnische Uni$ersitat% Darmstadt Dresden, Germany Darmstadt, Germany Email: forename.surname @tu-dresden.de Email: [email protected] { } Abstract—In recent years, architectures combining a recon- hardware platform from the de$eloper, providing seamless figurable fabric and a general purpose processor on a single integration with the software environment. chip became increasingly popular. Such hybrid architectures The contributions of this paper are: allow extending embedded software with application specific Automated HW/SW partitioning using a GCC-plugin hardware accelerators to improve performance and/or energy • efficiency. Aiding system designers and programmers at handling that extracts accelerators from C code and generates the complexity of the required process of hardware/software synthesizable HDL code. (HW/SW) partitioning is an important issue. Current methods Automated and platform agnostic code patching enables • are often restricted, either to bare-metal systems, to subsets of seamless integration with software environment. Accel- mainstream programming languages, or require special coding guidelines, e.g., via annotations. These restrictions still represent a erator in$ocation remains completely transparent with high entry barrier for the wider community of programmers that optional fall-back to software e3ecution. new hybrid architectures are intended for. In this paper we revisit Support for legacy application code without annotations. • HW/SW partitioning and present a seamless programming flow The rest of this paper is organized as follows. Section II for unrestricted, legacy % code. It consists of a retargetable (%% plugin that automatically identifies code sections for hardware presents the related 2ork. In Section III we introduce the target acceleration and generates code accordingly. The proposed work- hardware platforms of our proposed 2orkflo2) Section IV flow was evaluated on the Xilinx Zynq platform using unmodified describes our 2orkflo2, the compiler plugin and its integration code from an embedded benchmark suite. into GCC. Sections V and VI present the e$aluation of our approach and discuss results. The remaining Sections dra2 I) I+!,-&#C!I-+ conclusions and point out future 2ork. !oday, embedded "ybrid platforms combining field pro- grammable gate arrays (FPGA) and high performance RISC II) ,'8A!'& :-,< processing cores gi$e the user the freedom to implement Since the emergence of FPGAs, many efforts ha$e been specialized peripherals in the FPGA fabric while still relying made to exploit the performance gain offered by reconfigurable on the e3ecution power of the RISC processor(s). The Xilinx logic with customized hardware accelerators. This especially Zynq system on chip /SoC) family and the Altera Cyclone/Ar- holds true for "ybrid FPGA architectures tightly coupling a ria V SoC are prominent examples for this approach. general purpose processor with reconfigurable logic. Such devices pa$e the path for the integration of arbitrary The most obvious, flexible but also the most challenging hardware accelerators in comple3 applications, howe$er, most 2ay is to write accelerators by hand using an HDL and software de$elopers are not familiar with hardware description manually perform all required integration with the software en- languages (HDL). !"us, they are unable to de$elop application vironment. An example is shown in =>?) Designing accelerator- specific accelerators on their own. This problem has been based systems that 2ay, requires strong skills in HDL as well addressed in the past by many researchers. 9et, the proposed as deep knowledge of the underlying hardware platform. The solutions are not satisfactory) The user still has to write his de$elopment process usually is time consuming and error- own HDL code, has to take care of the HW/SW partitioning prone. Hence, the ability to implement such systems is left (often by annotating the existing code) and has to create the to the relati$ely small community of FPGA de$elopers. required SW/HW interfaces. A number of approaches ha$e been presented that reduce Our approach aims to notably lower the entry barrier for or e$en completely eliminate the necessity of writing HDL. software de$elopers to hardware-accelerated program e3ecu- The goal is to generate synthesizable code for accelerators tion. This particularly means using plain unannotated C, which from a more abstract problem description. LegUp =@? is an is a popular and established language, as input. In this 2ay, open source high-le$el synthesis (HLS) tool for FPGA based we bring hardware acceleration to a broader range of general "ybrid systems. The HW/SW partitioning is determined by applications. :e envision a transparent 2orkflo2 ideally not profiling the C program on a self-profiling processor and demanding any HDL skills or knowledge about the underlying altering the software binary afterwards in order to run it on the "ybrid system. In =A? the authors present basic support still object-oriented manner) Lime programs can be compiled for ARM-FPGA "ybrid SoCs. In =B? the authors present either into pure software binaries or into software and a set Nymble, a system based on the techniques introduced by of hardware accelerators. All interfacing is done automatically COMRADE =C?) It allows a much larger scope for accelerators by the runtime environment. Introducing a well tailored lan- by supporting a mechanism for back-delegation of unsuitable guage circumvents limitations that arise from using existing code sections into soft2are. For HW/SW partitioning, Nymble languages. Howe$er, adopting a ne2 language is a high entry requires additional code annotations using pragmas. barrier for most programmers and existing software must be Nymble as well as LegUp use Lo2 Le$el Virtual Machine ported to benefit from hardware acceleration. /88VM) as compiler frame2ork. As shown in [6], the Gnu compiler collection /GCC) has been used for HLS 2orkflows III) P8A!F-,M as well. The authors sho2 a customized GCC compiler for The 2ork presented in this paper especially addresses re- generation of hardware accelerators for a bare-metal soft-core cent "ybrid platforms combining embedded processors with a processor) Our 2ork extends C-to-HDL transformations for reconfigurable fabric. In this section we briefly describe one better integration in more comple3 systems. such system, namely, the ZedBoard e$aluation kit containing The Delft :orkbench =E? is a toolset providing semi- a Xilinx Zynq-7000 =>E? device. automatic HW/SW partitioning as well as HLS for FPGA. The programmable logic (PL) in the Zynq-7000 device is The targeted Molen machine architecture can be regarded as a full Artix-7 FPGA fabric, while the processing system (PS) "ybrid FPGA-processor architecture. The candidate kernels for is a complete ARM subsystem featuring a Cortex-A9 dual hardware acceleration are determined by profiling but must be core processor and a comprehensi$e set of peripherals. The extracted manually) PS provides four 32-bit general purpose (GP) AXI interfaces, Xilinx provides Vi$ado [8], [9], one of the most popular which allo2 connecting peripherals from the PL as well commercial HLS tools. It supports translating C, SystemC or as four full-duple3 64-bit high performance (HP) interfaces C++ code directly into hardware. Vi$ado aims at mapping for connecting AXI masters residing in the PL. The Zynq the whole application to hardware, which requires manual architecture provides one special high performance interface HW/SW partitioning by the user) Similar to Vi$ado, other connected to the Accelerator Coherency Port (ACP). The HLS tools like ,OCCC =>I? or CA!APUL! =>>? provide ACP is internally connected to the ARM Snoop Control Unit sophisticated hardware synthesis for hardware-only solutions, and can be used for cache coherent accesses to the ARM with no support for a "ybrid HW/SW translation. In =>@? the subsystem. authors present a frame2ork that matches portions of C code It should be noted, that the specific handling of these (algorithmic skeletons) exposing specific memory access pat- different AXI interfaces depends on the hardware residing terns against a library of known accelerator templates. In =>A? in the PL which presumes a profound understanding of the authors particularly address the integration of accelerators hardware accelerator) with the software domain. They present a linker that creates an e3ecutable by transparently linking functions implemented IV) :-,<FL-: in software objects and/or hardware accelerators. :ith the runtime environment provided, programs can be e3ecuted on The 2orkflo2 for transparent HW/SW partitioning and a Zynq platform running embedded Linux. compilation is composed of four steps as shown in Figure >) Most of the approaches mentioned so far address a certain />0 loop data collection performs a whole-program analysis task related to accelerator generation or integration, but the collecting information about all loops across all compilation user still has to perform manual 2ork. This requires, e$en units. /@0 loop analysis uses that information to select loops though to a lesser extent, knowledge of HDL and the underly- for potential HW acceleration, using a cost model of the ing hardware platform) In contrast, the 2ork in =>B? raises the target platform.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-