FCUDA: Enabling Efficient Compilation of CUDA Kernels Onto Fpgas*

FCUDA: Enabling Efficient Compilation of CUDA * Kernels onto FPGAs Alexandros Papakonstantinou1, Karthik Gururaj2, John A. Stratton1, Deming Chen1, Jason Cong2, Wen-Mei W. Hwu1 1Electrical & Computer Eng. Dept., University of Illinois, Urbana-Champaign, IL, USA 2Computer Science Dept., University of California, Los-Angeles, CA, USA 1{apapako2, stratton, dchen, w-hwu}@illinois.edu, 2{karthikg, cong}@cs.ucla.edu Abstract— As growing power dissipation and thermal effects offer efficient application-specific parallelism extraction disrupted the rising clock frequency trend and threatened to through the flexibility of their reconfigurable fabric. Besides, annul Moore’s law, the computing industry has switched its route heterogeneity in high performance computing (HPC) has been to higher performance through parallel processing. The rise of gaining great momentum as can be inferred by the proliferation multi-core systems in all domains of computing has opened the of heterogeneous multiprocessors ranging from Multi- door to heterogeneous multi-processors, where processors of Processor Systems on Chip (MPSoC) like the IBM Cell [21], to different compute characteristics can be combined to effectively HPC clusters with GPU/FPGA accelerated nodes such as the boost the performance per watt of different application kernels. NCSA AC Cluster [20]. The diverse characteristics of these GPUs and FPGAs are becoming very popular in PC-based compute cores/platforms render them optimal for different heterogeneous systems for speeding up compute intensive kernels types of application kernels. Currently, the performance and of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide power advantages of the heterogeneous multi-processors are customized concurrency for highly parallel kernels. However, offset by the difficulty involved in their programming. exploiting the parallelism available in these applications is Moreover, the use of different parallel programming models in currently not a push-button task. Often the programmer has to these heterogeneous compute systems often complicates the expose the application’s fine and coarse grained parallelism by development process. In the case of kernel acceleration on using special APIs. CUDA is such a parallel-computing API that FPGAs, the programming effort is further inflated by the need is driven by the GPU industry and is gaining significant to interface with hardware at the RTL level. popularity. In this work, we adapt the CUDA programming A significant milestone towards the use of the massively model into a new FPGA design flow called FCUDA, which parallel computing power of GPUs in non-graphics efficiently maps the coarse and fine grained parallelism exposed applications has been the release of CUDA by NVIDIA. in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA CUDA enables general purpose computing on the GPU flow employs AutoPilot, an advanced high-level synthesis tool (GPGPU) through a C-like API which is gaining considerable which enables high-abstraction FPGA programming. FCUDA is popularity. In this work we explore the use of CUDA as the based on a source-to-source compilation that transforms the programming interface for a new FPGA programming flow SPMD CUDA thread blocks into parallel C code for AutoPilot. (Fig. 1), which is designed to efficiently map the coarse and We describe the details of our CUDA-to-FPGA flow and fine grained parallelism expressed in CUDA kernels onto the demonstrate the highly competitive performance of the resulting reconfigurable fabric. Our CUDA-to-FPGA flow employs the customized FPGA multi-core accelerators. To the best of our state of the art high-level synthesis tool, AutoPilot [5], which knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA enables high-abstraction FPGA programming. The flow is programming model for high-performance computing in FPGAs. enabled by a source-to-source compilation phase, FCUDA, which transforms the SPMD (Single-Program-Multiple-Data) I. INTRODUCTION CUDA code into C code for AutoPilot with annotated coarse- grained parallelism. AutoPilot maps the annotated parallelism Even though parallel processing has been a major onto parallel cores ("core" in this context is an application- contributor to application speedups achieved by the high specific processing engine) and generates a corresponding RTL performance computing community, its adoption in mainstream description which is subsequently synthesized and downloaded computing domains has lagged due to the relative simplicity of onto the FPGA. enhancing application speed through frequency scaling and The selection of CUDA as the programming interface for transistor shrinking. However, the power wall encountered by our FPGA programming flow offers three main advantages. traditional single-core processors has forced a global industry shift to the multi-core paradigm. As a consequence of the rapidly growing interest for parallelism in a wider and coarser ed ed in ain A n ra gr level than feasible in traditional processors, the potential of G io -g m e- ism FP tat rse lis in lel en a lle F ral on GPUs and FPGAs has been realized. GPUs consist of hundreds m es Co ra ion pa cti ple lin pa ct tra im ide tra ex of processing cores clustered within streaming multiprocessors gu ex (SMs) that can handle intensive compute loads with high- FCUDA CUDA AutoPilot RTL degree of data-level parallelism. FPGAs on the other hand, annotated FCUDA AutoPilot code Annotation C code Design code compilation synthesis * This work is partially supported by MARCO/DARPA GSRC and NSF CCF 07-46608. The authors would like to acknowledge the equipment donation Fig. 1. CUDA-to-FPGA Flow from Intel and software donation from AutoESL. First, it provides a high-level API for expressing coarse grained A. Application Domains parallelism in a concise fashion within application kernels that FPGAs have been employed in the implementation of are going to be executed on a massively parallel acceleration different projects for the acceleration of compute intensive device. Even though CUDA is driven by the GPU computing applications. Examples range from data parallel kernels [11, domain, we show that CUDA kernels can indeed be translated 13] to entire applications such as face detection [9]. Although with FCUDA into efficient, customized multi-core compute they allow flexible customization of the architecture to the engines on the FPGA. Second, it bridges the programmability application, the physical constraints of their configurable fabric gap between homogeneous and heterogeneous platforms by favor certain kernels over others, in terms of performance. In providing a common programming model for clusters with particular, J. Williams [1] describes that FPGAs offer higher nodes that include GPUs and FPGAs. This simplifies computational densities for bit operations and 16-bit integer application development and enables efficient evaluation of arithmetic (up to 16X and 2.7X respectively) over GPUs but alternative kernel mappings onto the heterogeneous may not compete as well in wider bidwidths, such as 32-bit acceleration devices without time-consuming kernel code re- integer and single-precision floating-point operations (0.98X writing. Third, the wide adoption of the CUDA programming and 0.34X respectively). The performance degradation at large model and its popularity render a large body of existing bitwidths comes from the utilization of extra DSP units per applications available to FPGA acceleration. operation which results in limited parallelism. Floating-point In the next section we discuss important characteristics of arithmetic implementation on FPGA is inefficient for the same the FPGA and GPU platforms along with previous related reason [12]. Often, a careful decision among alternative work. Section III explains the characteristics of the CUDA and algorithms is necessary for optimal performance [7]. AutoPilot programming models and provides insight to the suitability of the CUDA API for programming FPGAs. The B. Programmability FCUDA translation details are presented in section IV, while Programming FPGAs often requires hardware design section V displays experimental results and shows that our expertise, as it involves interfacing with the hardware at the high-level synthesis based flow can efficiently exploit the RTL level. However, the advent of several academic and computational resources of top-tier FPGAs in a customized commercial Electronic System Level (ESL) design tools [2-5, fashion. Finally, section VI concludes the paper and discusses 22-23] for High-Level Synthesis (HLS) has raised the level of future work. abstraction in FPGA design. Most of these tools use high-level languages (HLLs) as their programming interface. Some of the II. THE FPGA PLATFORM earlier HLS tools [2, 3] can only extract fine grained With increasing transistor densities, the computational parallelism at the operation level by using data dependence capabilities of commercial FPGAs provided by Xilinx [16] and analysis techniques. Extraction of coarse grained parallelism is Altera [17] have greatly increased. Modern FPGAs are usually much harder in traditional HLLs that are designed to technologically in sync with the rest of the IC industry by express sequential execution. To overcome this obstacle, some employing the latest manufacturing process technologies and HLS tools [4, 5, 22] have resorted to

Load more