FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs*

Alexandros Papakonstantinou1, Karthik Gururaj2, John A. Stratton1, Deming Chen1, Jason Cong2, Wen-Mei W. Hwu1 1Electrical & Computer Eng. Dept., University of Illinois, Urbana-Champaign, IL, USA 2Computer Science Dept., University of California, Los-Angeles, CA, USA 1{apapako2, stratton, dchen, w-hwu}@illinois.edu, 2{karthikg, cong}@cs.ucla.edu

Abstract— As growing power dissipation and thermal effects offer efficient application-specific parallelism extraction disrupted the rising clock frequency trend and threatened to through the flexibility of their reconfigurable fabric. Besides, annul Moore’s law, the computing industry has switched its route heterogeneity in high performance computing (HPC) has been to higher performance through parallel processing. The rise of gaining great momentum as can be inferred by the proliferation multi-core systems in all domains of computing has opened the of heterogeneous multiprocessors ranging from Multi- door to heterogeneous multi-processors, where processors of Processor Systems on Chip (MPSoC) like the IBM Cell [21], to different compute characteristics can be combined to effectively HPC clusters with GPU/FPGA accelerated nodes such as the the performance per watt of different application kernels. NCSA AC Cluster [20]. The diverse characteristics of these GPUs and FPGAs are becoming very popular in PC-based compute cores/platforms render them optimal for different heterogeneous systems for speeding up compute intensive kernels types of application kernels. Currently, the performance and of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide power advantages of the heterogeneous multi-processors are customized concurrency for highly parallel kernels. However, offset by the difficulty involved in their programming. exploiting the parallelism available in these applications is Moreover, the use of different parallel programming models in currently not a push-button task. Often the programmer has to these heterogeneous compute systems often complicates the expose the application’s fine and coarse grained parallelism by development . In the case of kernel acceleration on using special . CUDA is such a parallel-computing API that FPGAs, the programming effort is further inflated by the need is driven by the GPU industry and is gaining significant to interface with hardware at the RTL level. popularity. In this work, we adapt the CUDA programming A significant milestone towards the use of the massively model into a new FPGA design flow called FCUDA, which power of GPUs in non-graphics efficiently maps the coarse and fine grained parallelism exposed applications has been the release of CUDA by NVIDIA. in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA CUDA enables general purpose computing on the GPU flow employs AutoPilot, an advanced high-level synthesis tool (GPGPU) through a C-like API which is gaining considerable which enables high-abstraction FPGA programming. FCUDA is popularity. In this work we explore the use of CUDA as the based on a source-to-source compilation that transforms the programming interface for a new FPGA programming flow SPMD CUDA blocks into parallel C code for AutoPilot. (Fig. 1), which is designed to efficiently map the coarse and We describe the details of our CUDA-to-FPGA flow and fine grained parallelism expressed in CUDA kernels onto the demonstrate the highly competitive performance of the resulting reconfigurable fabric. Our CUDA-to-FPGA flow employs the customized FPGA multi-core accelerators. To the best of our state of the art high-level synthesis tool, AutoPilot [5], which knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA enables high-abstraction FPGA programming. The flow is programming model for high-performance computing in FPGAs. enabled by a source-to-source compilation phase, FCUDA, which transforms the SPMD (Single-Program-Multiple-Data) I. INTRODUCTION CUDA code into C code for AutoPilot with annotated coarse- grained parallelism. AutoPilot maps the annotated parallelism Even though parallel processing has been a major onto parallel cores ("core" in this context is an application- contributor to application speedups achieved by the high specific processing engine) and generates a corresponding RTL performance computing community, its adoption in mainstream description which is subsequently synthesized and downloaded computing domains has lagged due to the relative simplicity of onto the FPGA. enhancing application speed through frequency scaling and The selection of CUDA as the programming interface for transistor shrinking. However, the power wall encountered by our FPGA programming flow offers three main advantages. traditional single-core processors has forced a global industry shift to the multi-core paradigm. As a consequence of the rapidly growing interest for parallelism in a wider and coarser ed ed in ain A n ra gr level than feasible in traditional processors, the potential of G io -g m e- ism FP tat rse lis in lel en a lle F ral on GPUs and FPGAs has been realized. GPUs consist of hundreds m es Co ra ion pa cti ple lin pa ct tra im ide tra ex of processing cores clustered within streaming multiprocessors gu ex (SMs) that can handle intensive compute loads with high- FCUDA CUDA AutoPilot RTL degree of data-level parallelism. FPGAs on the other hand, annotated FCUDA AutoPilot code Annotation C code Design code compilation synthesis * This work is partially supported by MARCO/DARPA GSRC and NSF CCF 07-46608. The authors would like to acknowledge the equipment donation Fig. 1. CUDA-to-FPGA Flow from Intel and software donation from AutoESL. First, it provides a high-level API for expressing coarse grained A. Application Domains parallelism in a concise fashion within application kernels that FPGAs have been employed in the implementation of are going to be executed on a acceleration different projects for the acceleration of compute intensive device. Even though CUDA is driven by the GPU computing applications. Examples range from data parallel kernels [11, domain, we show that CUDA kernels can indeed be translated 13] to entire applications such as face detection [9]. Although with FCUDA into efficient, customized multi-core compute they allow flexible customization of the architecture to the engines on the FPGA. Second, it bridges the programmability application, the physical constraints of their configurable fabric gap between homogeneous and heterogeneous platforms by favor certain kernels over others, in terms of performance. In providing a common programming model for clusters with particular, J. Williams [1] describes that FPGAs offer higher nodes that include GPUs and FPGAs. This simplifies computational densities for bit operations and 16-bit integer application development and enables efficient evaluation of arithmetic (up to 16X and 2.7X respectively) over GPUs but alternative kernel mappings onto the heterogeneous may not compete as well in wider bidwidths, such as 32-bit acceleration devices without time-consuming kernel code re- integer and single-precision floating-point operations (0.98X writing. Third, the wide adoption of the CUDA programming and 0.34X respectively). The performance degradation at large model and its popularity render a large body of existing bitwidths comes from the utilization of extra DSP units per applications available to FPGA acceleration. operation which results in limited parallelism. Floating-point In the next section we discuss important characteristics of arithmetic implementation on FPGA is inefficient for the same the FPGA and GPU platforms along with previous related reason [12]. Often, a careful decision among alternative work. Section III explains the characteristics of the CUDA and algorithms is necessary for optimal performance [7]. AutoPilot programming models and provides insight to the suitability of the CUDA API for programming FPGAs. The B. Programmability FCUDA translation details are presented in section IV, while Programming FPGAs often requires hardware design section V displays experimental results and shows that our expertise, as it involves interfacing with the hardware at the high-level synthesis based flow can efficiently exploit the RTL level. However, the advent of several academic and computational resources of top-tier FPGAs in a customized commercial Electronic System Level (ESL) design tools [2-5, fashion. Finally, section VI concludes the paper and discusses 22-23] for High-Level Synthesis (HLS) has raised the level of future work. abstraction in FPGA design. Most of these tools use high-level languages (HLLs) as their programming interface. Some of the II. THE FPGA PLATFORM earlier HLS tools [2, 3] can only extract fine grained With increasing transistor densities, the computational parallelism at the operation level by using data dependence capabilities of commercial FPGAs provided by Xilinx [16] and analysis techniques. Extraction of coarse grained parallelism is Altera [17] have greatly increased. Modern FPGAs are usually much harder in traditional HLLs that are designed to technologically in sync with the rest of the IC industry by express sequential execution. To overcome this obstacle, some employing the latest manufacturing process technologies and HLS tools [4, 5, 22] have resorted to employing language supporting high-bandwidth IO interfaces such as PCIe, Intel’s extensions for allowing the programmers to explicitly annotate FSB [6] and AMD’s HyperTransport [8]. By embedding fast coarse grained parallelism in the form of parallel streams [4], DSP macros, memory blocks and 32-bit microprocessor cores tasks [5] or object-oriented structures [22]. In a different into the reconfigurable fabric, a complete SoC platform is approach, special high-level languages that model parallelism available for applications which require high-throughput with streaming dataflows have been employed in HLS tools computation at a low power footprint. [23]. In this work we use the popular CUDA programming The flexibility of the reconfigurable fabric provides a model to concisely express the coarse level parallelism of versatile platform for leveraging different types of application- compute intensive kernels. CUDA kernels are then efficiently specific parallelism: i) coarse- and fine-grained, ii) data- and translated into AutoPilot input code with annotated coarse task-level and iii) different pipelined configurations. Re- grained parallelism, as discussed in the following sections. configurability, though, has an impact in the clock frequency achievable on the FPGA platform. Synthesis-generated wire- III. DETAILS OF PROGRAMMING MODELS based communication between parallel modules may limit the A. CUDA throughput of designs with wider parallelism compared to The CUDA programming model exposes parallelism smaller but faster clocked architectures. In our flow we through a data-parallel SPMD kernel function. Each kernel leverage the CUDA programming model to build multi-core implicitly describes multiple CUDA threads that are organized acceleration designs with low count of inter-core in groups called thread-blocks. Thread-blocks are further communication interconnect. organized into a grid structure (Fig 2). Threads within a thread- FPGA devices reportedly offer a significant advantage (4X- block are executed by the streaming processors (SPs) of a 12X) in power consumption over GPUs. J. Williams et al. [1] single GPU streaming multiprocessor (SM) and are allowed to showed that the computational density per Watt in FPGAs is synchronize and share data through the SM . On much higher than in GPUs. This is even true for 32-bit integer the other hand, synchronization of thread-blocks is not and floating-point arithmetic (6X and 2X respectively), for supported. Thread-block threads are launched in SIMD bundles which the raw computational density of GPUs is higher. called warps. Warps consisting of threads with highly diverse control flow will result in low performance execution. Thus,

Fig. 3. AutoPilot C Programming Model Fig. 2. CUDA Programming Model pointers may also be used (with some limitations) in the input for successful GPU acceleration it is critical that threads are code and combined with the AUTOPILOT INTERFACE organized in warps based on their control flow characteristics. pragma they can infer off-chip memory accesses. The CUDA memory model leverages separate memory spaces with diverse characteristics. Shared memory refers to C. CUDA-to-FPGA Flow Advantages on-chip SRAM blocks, with each block being accessible by a The advantages offered by the CUDA programming model single SM (Fig. 2). Global memory, on the other hand, is the in our FPGA design flow are multifold. First, both CUDA and off-chip DRAM that is accessible by all SMs. Shared memory AutoPilot’s programming model are based on the C language. is fast but small, whereas global memory is long-latency but CUDA extends the language with some GPU specific abundant. There are also two read-only off-chip memory constructs while AutoPilot uses a subset of C, augmented with spaces, constant and texture, which are cached and provide synthesis pragma annotations (ignored during gcc compilation). special features for kernels executed on the GPU. More details Thus, FCUDA source-to-source compilation does not require on the CUDA memory spaces are provided in section IV. translation between fundamentally different languages. Second, even though CUDA incorporates more memory spaces than B. AutoPilot C AutoPilot, they both distinguish between on-chip and off-chip AutoPilot’s programming model conforms to a subset of C memory spaces, and leverage programmer-specified data which may be annotated with pragmas that convey information transfers between off- and on-chip memory storage. on different implementation details. Synthesis is performed at Coarse-grained parallelism in CUDA is expressed in the the function level producing corresponding RTL descriptions form of thread-blocks that execute independently on the SMs. for each function. The RTL description of each function Moreover, the number of thread-blocks in CUDA kernels is corresponds to an FPGA core (Fig. 3) which consists of private typically in the order of hundreds or thousands. Thus, thread- datapath and FSM-based control logic. Attached to each core’s blocks constitute an excellent candidate in terms of lack of FSM are start and done signals that enable cross-function synchronization requirements and workload granularity for synchronization (including function calls and returns). FPGA core implementation. Mapping thread-blocks onto The front-end engine of AutoPilot (based on the LLVM parallel cores on the FPGA minimizes inter-core compiler [18]) uses dependence analysis techniques to extract communication without limiting parallelism extraction. Low instruction-level parallelism (ILP) within basic blocks. Coarser inter-core communication helps achieve higher execution parallelism, such as loop iteration parallelism, can also be frequencies and eliminate synchronization overhead. As a final exploited by injecting AUTOPILOT UNROLL pragmas in the point, CUDA provides a very concise programming model for code (assuming there are no loop-carried dependencies). Note expressing coarse-grained parallelism through the single-thread that unrolling and executing loop iterations in parallel, impacts kernel model. AutoPilot (as most existing HLS tools), on the FPGA resource allocation proportionally to the unroll factor. other hand, employs a programming model that expresses Concurrency at the function level is specified by the coarse grained parallelism explicitly in the form of multiple AUTOPILOT PARALLEL pragma within a code region (Fig 3). function calls annotated with appropriate pragmas (Fig. 3). The affected functions are launched concurrently by the parent FCUDA automates the extraction of the inferred parallelism in function, which stalls executing until every child function has CUDA code into in AutoPilot input code returned. Thus it is possible to implement an MPMD (Multi while handling data partitioning and FPGA core Program Multi Data) execution model with a configuration of synchronization. Thus, it eliminates the tedious and error-prone heterogeneous FPGA cores (i.e. parallel cores corresponding to task of directly expressing the coarse grained parallelism in C different functions). Note that AutoPilot will schedule two for AutoPilot. Our FPGA design flow allows the programmer functions (cores) to execute in parallel only when they cause no to describe the parallelism in a more compact and efficient way hazards. A hazard arises when two functions access the same through the CUDA programming model regardless of the memory block (resource hazard) or pass data from one function implemented number of FPGA cores. to another (data hazard). With regards to memory spaces, AutoPilot may map IV. FCUDA: CUDA-TO-FPGA FLOW variables onto local (on-chip) or external (off-chip) memories. Our CUDA-to-FPGA flow (Fig. 1) is based on a code By default all arrays get mapped onto local BRAMs while transformation process, FCUDA (currently targeting the scalar variables are mapped on configurable fabric logic. C AutoPilot HLS tool), which is guided by preprocessor DMA DMA Active connection directives (FCUDA pragmas) inserted by the FPGA Controller Controller programmer into the CUDA kernel. These directives control Idle connection the FCUDA translation of the expressed parallelism in CUDA BRAM Interconnect BRAM Interconnect BRAM code into explicitly-expressed coarse-grained parallelism in the Block Logic Block A Logic Block B generated AutoPilot code. The FCUDA pragmas describe various FPGA implementation dimensions which include the Compute Compute number, type and granularity of tasks, the type of task Logic Logic synchronization and scheduling, and the data storage within on- a) Simple Scheme b) Ping-pong Scheme and off-chip memories. AutoPilot subsequently maps the Fig. 4. Scheduling Schemes FCUDA specified tasks onto concurrent cores and generates the corresponding RTL description. Moreover, AutoPilot uses Moreover, by aggregating all of the off-chip accesses into LLVM’s [18] dependence analysis techniques and its own DMA burst transfers from/to on-chip BRAMs, the off-chip SDC-based scheduling engine [5] to extract fine grained memory bandwidth can be utilized more efficiently. instruction-level parallelism within each task. Finally Xilinx FCUDA also leverages synchronization of data transfer and FPGA synthesis tools are leveraged to map the generated RTL computation tasks based on the FCUDA annotation injected by onto reconfigurable fabric. We demonstrate that the FPGA the FPGA programmer. The selection of the synchronization accelerators generated by our FPGA design flow can efficiently scheme often incurs a tradeoff between performance and exploit the computational resources of top-tier FPGAs in a resource requirements. The FPGA programmer needs to customized fashion and provide better performance compared consider the characteristics of the accelerated kernel in order to to the GPU implementation for a range of applications. make an educated decision. A simple and resource efficient scheme is the simple DMA synchronization (Fig. 4a) which A. FCUDA Philosophy serializes data communication and computation tasks. This Concurrency in CUDA is inferred through a single-thread scheme is memory overhead free and it can also be a good fit kernel with built-in variables that are used to distinguish the for kernels that are compute intensive and incur low data tasks of each thread. Application parallelism is expressed in the communication traffic. At the opposite end, the ping-pong form of fine-granularity threads that are further bunched into synchronization scheme overlaps data communication with coarse-granularity thread-blocks (Fig. 2). Even though thread- computation by doubling the number of BRAM blocks (Fig. level parallelism can improve performance, thread-blocks offer 4b). The interconnection logic interchangeably connects each higher potential for an efficient multi-core implementation on BRAM block to the compute logic and the DMA controller, FPGA. As discussed previously, CUDA thread-blocks ensuring that each BRAM block is actively connected to only comprise autonomous tasks that operate on independent data one of the two modules in each cycle. However, this scheme sets and do not need synchronization. Conversely, CUDA may result in BRAM utilization overhead, impacting the threads within a thread-block usually reference shared data number of cores that can be instantiated on the FPGA. which often results in synchronization overhead and/or shared memory access conflicts. B. FCUDA Pragma Directives Parallelism in C code for FPGA synthesis by AutoPilot is Fig. 5a depicts the FCUDA pragma annotation of the explicitly expressed through parallel function calls (Fig. 3). A coulombic potential (CP) kernel. The kernel function is single callee function with a different set of arguments in each wrapped within GRID pragmas that define the sub-array of call may be used to infer a homogeneous multi-core thread-blocks that can be computed by the available FPGA configuration similar to the GPU organization, whereas cores within one iteration (in Fig. 5a two thread-blocks with different callee functions may model a heterogeneous multi- sequential x coordinates and same y coordinates). The BLOCK core configuration on FPGA. Therefore, the core task of the pragma determines the sub-grid of all thread-blocks that this FCUDA source-to-source translation can be simply described kernel is assigned to compute. By splitting the original CUDA as converting thread-blocks into C functions and invoking grid of thread-blocks into sub-grids, FPGA cores can be split parallel calls of the generated functions with appropriate into clusters with each cluster being assigned a sub-grid of argument sets. Having extracted the coarse-granularity thread-blocks. This can help further eliminate long wire parallelism at the thread-block level, fine-granularity interconnections between compute and synchronization cores parallelism at the thread level may also be extracted, provided and could enable asynchronous operation of different clusters. that non-allocated resources exist on the FPGA. This disparity The SYNC pragma sets the type of synchronization scheme in the thread parallelism extraction scheme between GPU and (currently a choice between simple or pingpong) to be FCUDA may lead to different combinations of concurrently implemented by the cluster synchronization core. COMPUTE executing threads in the two devices. Nevertheless, the degree and TRANSFER pragmas are used to wrap the computation and of parallelism will not differ in typical CUDA kernels that the data communication tasks of the kernel, respectively. In comprise hundreds of threads per thread-block and thousands Fig. 5a, two TRANSFER sections are used: one for fetching thread-blocks per grid. off-chip data into the atominfo array and one for storing results Another important feature of the FCUDA philosophy to the energygrid off-chip storage. The following section consists of decoupling off-chip data transfers from the rest of describes how FCUDA leverages the translation of the CUDA the thread-block operations. The main goal is to prevent long code into properly crafted C code for AutoPilot. FCUDA latency references from impacting the efficiency of the multi- compilation is based on the Cetus source-to-source compiler core execution. This is particularly important in the absence of framework [19] and it consists of two major stages. GPU-like fine-grained multi-threading support in FPGAs. te pu m sk co ta

e rit w sk ta

Fig. 5. Coulombic Potential (CP) kernel transformation through FCUDA loop implements the thread operations following the sync C. FCUDA Front-End Transformation point. This way serialized execution of threads maintains the The front-end engine of FCUDA aims to transform the thread-block synchronization semantics. FCUDA extends the single-thread kernel into semantically equivalent C code which MCUDA [10] implementation of loop-fission by adding explicitly expresses the execution of all the kernel threads in a COMPUTE and TRANSFER pragmas to the list of serialized fashion. This is achieved by converting the CUDA synchronization directives. COMPUTE and TRANSFER built-in variables that hold the thread (and thread-block) IDs pragmas are used by the FPGA programmer to annotate into regular C variables which are used as induction variables computation and off-chip data communication tasks. Thus, in thread-loops (and block-loops). Fig. 6, illustrates the synchronization of threads between tasks is required. transformation of the CUDA kernel threads into thread-loops Synchronization primitives can be removed after loop-fission, by the FCUDA front-end engine (Serialization of thread- except for FCUDA pragmas which carry implementation blocks, though not depicted for space and clarity reasons, also information used by the back-end engine of FCUDA. takes place during this FCUDA phase). Thread serialization creates the opportunity for variable As shown in Fig. 6, synchronization directives within the sharing among threads. However, there are cases in which each CUDA kernel need to be considered during the front-end thread must have its private copy of a kernel variable. This transformation phase of FCUDA in order to maintain the happens usually for variables that are accessed across ordering semantics of thread execution within the serialized synchronization points (e.g. energyval in Fig. 5a). Scalar thread-blocks. Synchronization points are indicated by CUDA variable expansion [10] is applied for such variables in order to sync directives, FCUDA COMPUTE and TRANSFER pragmas create thread-private copies of the variable. Fig. 7 depicts how and irregular control flow statements (i.e. break, continue and the computation task of CP (Fig. 5a) is transformed after the return). A loop-fission technique proposed in [10] is used to front-end processing of FCUDA. Loop-fission forms a thread- break the initially generated kernel-wide thread-loop into loop within COMPUTE pragmas whereas selective scalar localized thread-loops which do not cross any of the expansion results in vectorization of energyval which is synchronization directives encountered in the code. Fig. 6 referenced across thread loops. depicts the result of loop fission in a kernel with a single synchronization point. The initial kernel thread loop is split into D. FCUDA Back-End Transformation two thread-loops: the first thread-loop implements the thread The back-end engine of FCUDA leverages the operations preceding the sync point, while the second thread- implementation information annotated in the FCUDA pragma

#pragma FCUDA COMPUTE cores=2 begin name="cp_block" int energyval[]; for(tIdx.y=0;tIdx.y

Fig. 6. Extracting the CUDA coarse-grained parallelism in FCUDA Fig. 7. FCUDA front-end processed CP compute task directives to guide the translation of the kernel coarse grained transformed for a ping-pong synchronization scheme. An if- parallelism into the function-level type of parallelism supported else structure is used to implement the switching of the by AutoPilot (Fig. 6). Tasks annotated through FCUDA accessed BRAM block in each iteration of the block-loop. COMPUTE and TRANSFER pragmas are transformed into newly generated task functions which are called from the E. CUDA Memory Spaces original kernel function, referred hereafter as parent function. The different memory spaces leveraged in the CUDA Multiple calls of the task functions wrapped within programming model have to ultimately be mapped onto local AUTOPILOT REGION and PARALLEL directives in the parent BRAM memories, as described in the previous sections. The function (Fig. 5b) drive the synthesis tool to instantiate parallel most simple memory space to handle is shared memory due to processing cores on the configurable fabric. The degree of its common semantics with BRAM memories. Both of them parallelism is specified by the parameter information included refer to local memory blocks that are private to threads within a in the COMPUTE and TRANSFER pragmas (Fig. 5a) and it is thread-block. Thus direct mapping of shared memory arrays used to adjust the stride length of the block-loop in the parent into BRAM memory is feasible. A distinguishing characteristic function (Fig. 5b). One of the critical tasks of this of shared memory is its 16-bank organization which allows 16 transformation is the facilitation of data communication concurrent accesses by equal threads. BRAM memories, on the between the different task functions and the parent function. other hand, only support dual access concurrency. However, Variable analysis is performed to determine which variables are the serialization of block threads in the FCUDA flow private to the task and which ones need to be communicated eliminates the potential latency overhead of increased BRAM to/from the task function. Communication of both scalar and access conflicts in kernels that are engineered to take advantage array variables is implemented through the task function of the multi-bank organization of shared memory. Besides, parameters. Apart from the type and number of cores, the FPGA configurability offers the flexibility of organizing FPGA programmer can also extract thread parallelism BRAM blocks in a multi-bank scheme if necessary (though this (provided available resources exist) by injecting AUTOPILOT is not supported by our current implementation). Moreover, the UNROLL and PIPELINE pragmas within the FCUDA BRAM block size customizability enables flexible tuning of COMPUTE annotated tasks, to specify thread-loop unrolling kernels without the constraining restrictions imposed by the and pipelining, respectively. small size of shared memory on GPUs. As discussed previously, FCUDA TRANSFER pragmas are The constant memory space is shared by all the thread- used to annotate data communication tasks to off-chip blocks running on the GPU, but it is read-only and it is used for addresses. According to the FCUDA philosophy, off-chip data references that exhibit locality, since it is cached. These communication usually infers DMA burst transfers of data attributes make it a good match for the different DMA bursts between off-chip memory storage and on-chip BRAM arrays. schemes described earlier. A portion of the off-chip DRAM The FCUDA back-end engine is also responsible for will serve as constant memory and the BRAMs will be used as instantiating array variables which will infer BRAM block read-only buffers that will be filled with the corresponding allocation during synthesis by AutoPilot. BRAM associated block of data before the thread-block execution. It may be arrays are instantiated at the parent function and their number is possible to share BRAM blocks that contain constant memory determined by the degree of parallelism annotated in the data among a few compute cores on the FPGA, to reduce compute tasks that reference them (Fig. 5b). BRAM associated BRAM resource requirements per core, if it does not severely arrays may be passed as arguments to compute and transfer impact execution frequency. functions similarly to the rest of the variables. A challenging Global memory corresponds to the off-chip memory of the task of the BRAM array instantiation is the determination of GPU which is globally accessible at a high latency, but with their dimensions. There are two ways that this is accomplished: i) through variable access analysis and consideration of the void cenergy(int numatoms, int gridspacing, int * energygrid, containing thread-loop induction variable range space (“write” dim3 blockDim, dim3 gridDim) { . . . TRANSFER in Fig. 5a) ii) through FCUDA DATA parameter int c11_atominfo[],c12_atominfo[],c21_atominfo[], c22_atominfo[]; information ( “fetch” TRANSFER in Fig. 5a). More details on int c11_energyval[],c12_energyval[],c21_energyval[], c22_energyval[]; int pingpong=0; the leveraging of different CUDA memory spaces within for(bIdx.y=0;bIdx.y

1 [14] http://www.nvidia.com/page/geforce_8800.html

speedup [15] EJ Kelmelis, J. Durbano, J. Humphrey, F. Ortiz, “Modeling and 0.5 simulation of nanoscale devices with a desktop ,“ Int. 0 Society for Optical Engineering, 2006. 32bit 16bit 8bit 32bit 16bit 8bit 32bit [16] Xilinx Inc., http://www.xilinx.com [17] Altera Inc., http://www.altera.com matmul cp rc5-72 [18] LLVM compiler, http://www.llvm.org [19] S. Lee, T. Johnson, and R. Eigenmann. “Cetus - An extensible compiler Fig. 9. GPU – FPGA Performance comparison infrastructure for source-to-source transformation,” Languages and Compilers for Parallel Computing, 2003. VI. CONCLUSIONS [20] M. Showerman, W.W. Hwu, J. Enos, A. Pant, V. Kindratenko, C. Steffen, R. Pennington, “QP: A Heterogeneous Multi-Accelerator In this paper, we present a new FPGA design flow that Cluster,” Int. Conf. on High-Performance Cluster Computing, 2009. takes annotated CUDA code as input and generates C code for [21] IBM Cell Processor, http://www.research.ibm.com/cell/ [22] S. Huang, A. Hormati, D. F. Bacon, and R. M. Rabbah, “Liquid Metal: AutoPilot with task level parallelism that is synthesized into Object-Oriented Programming Across the Hardware/Software customized multi-core accelerators on FPGA. We demonstrate Boundary”, ECOOP, 2008. that the user can indeed use a single starting point for an [23] A. Hormati, M. Kudlur, S. A. Mahlke, D. F. Bacon, and R. M. Rabbah, efficient acceleration, irrespective of whether the target “Optimus: efficient realization of streaming applications on FPGAs”, platform is GPU or FPGA. CUDA allows the user to express CASES, 2008.