Generating Configurable Hardware from Parallel Patterns
Total Page:16
File Type:pdf, Size:1020Kb
Generating Configurable Hardware from Parallel Patterns Raghu Prabhakar David Koeplinger Kevin Brown HyoukJoong Lee Christopher De Sa Christos Kozyrakis Kunle Olukotun Pervasive Parallelism Laboratory Stanford University {raghup17, dkoeplin, kjbrown, hyouklee, cdesa, kozyraki, kunle}@stanford.edu Abstract systems [23] and machine learning algorithms like deep belief networks [31, 32]. In recent years the computing landscape has seen an in- creasing shift towards specialized accelerators. Field pro- The performance and energy advantages of FPGAs are now grammable gate arrays (FPGAs) are particularly promising motivating the integration of reconfigurable logic into data as they offer significant performance and energy improvements center computing infrastructures. Both Microsoft [37] and compared to CPUs for a wide class of applications and are Baidu [31] have recently announced such systems. These far more flexible than fixed-function ASICs. However, FPGAs systems have initially been in the form of banks of FPGA are difficult to program. Traditional programming models accelerators which communicate with CPUs through Infini- for reconfigurable logic use low-level hardware description band or PCIe [28]. Work is also being done on heterogeneous languages like Verilog and VHDL, which have none of the pro- motherboards with shared CPU-FPGA memory [22]. The re- ductivity features of modern software development languages cent acquisition of Altera by Intel suggests that systems with but produce very efficient designs, and low-level software lan- tighter, high performance on-chip integration of CPUs and guages like C and OpenCL coupled with high-level synthesis FPGAs are now on the horizon. (HLS) tools that typically produce designs that are far less The chief limitation of using FPGAs as general purpose efficient. accelerators is that the programming model is currently inac- Functional languages with parallel patterns are a better fit cessible to most software developers. Creating custom acceler- for hardware generation because they both provide high-level ator architectures on an FPGA is a complex task, requiring the abstractions to programmers with little experience in hard- coordination of large numbers of small, local memories, com- ware design and avoid many of the problems faced when gen- munication with off-chip memory, and the synchronization of erating hardware from imperative languages. In this paper, we many compute stages. Because of this complexity, attaining identify two optimizations that are important when using par- the best performance on FPGAs has traditionally required de- allel patterns to generate hardware: tiling and metapipelining. tailed hardware design using hardware description languages We present a general representation of tiled parallel patterns, (HDL) like Verilog and VHDL. This low-level programming and provide rules for automatically tiling patterns and gen- model has largely limited the creation of efficient custom hard- erating metapipelines. We demonstrate experimentally that ware to experts in digital logic and hardware design. these optimizations result in speedups up to 40× on a set of In the past ten years, FPGA vendors and researchers have benchmarks from the data analytics domain. attempted to make reconfigurable logic more accessible to 1. Introduction software programmers with the development of high-level synthesis (HLS) tools, designed to automatically infer regis- The slowdown of Moore’s law and the end of Dennard scal- ter transaction level (RTL) specifications from higher level arXiv:1511.06968v1 [cs.DC] 22 Nov 2015 ing has forced a radical change in the architectural landscape. software programs. To better tailor these tools to software Computing systems are becoming increasingly parallel and developers, HLS work has typically focused on imperative heterogeneous, relying on larger numbers of cores and spe- languages like C/C++, SystemC, and OpenCL [45]. Unfortu- cialized accelerators. Field programmable gate arrays (FP- nately, there are numerous challenges in inferring hardware GAs) are particularly promising as an acceleration technology, from imperative programs. Imperative languages are inher- as they can offer performance and energy improvements for ently sequential and effectful. C programs in particular offer a wide class of applications while also providing the repro- a number of challenges in alias analysis and detecting false grammability and flexibility of software. Applications which dependencies [18], typically requiring numerous user anno- exhibit large degrees of spatial and temporal locality and which tations to help HLS tools discover parallelism and determine contain relatively small amounts of control flow, such as those when various hardware structures can be used. Achieving in the image processing [21, 6], financial analytics [29, 16, 48], efficient hardware with HLS tools often requires an iterative and scientific computing domains [42, 1, 11, 50], can espe- process to determine which user annotations are necessary, cially benefit from hardware acceleration with FPGAs. FPGAs especially for software developers less familiar with the intri- have also recently been used to accelerate personal assistant cacies of hardware design [15]. Pattern Transformations HardwareHardware Generation Generation Pattern Transformations ParallelParallel Fusion TiledTiled ParallelParallel MemoryMemory Allocation Allocation MaxJMaxJ BitstreamBitstream FPGAFPGA Fusion PatternPattern IRIR Pattern Tiling PatternPattern IRIR TemplateTemplate Selection Selection HDLHGL GenerationGeneration Config.Config. Pattern Tiling Code Motion MetapipelineMetapipeline Analysis Analysis Figure 1: System diagram Functional languages are a much more natural fit for high- the entire program, they can be used even on programs which level hardware generation as they have limited to no side contain random and data-dependent accesses. effects and more naturally express a dataflow representation Our tiled intermediate representation exposes memory re- of applications which can be mapped directly to hardware gions with high data locality, making them ideal candidates to pipelines [5]. Furthermore, the order of operations in func- be allocated on-chip. Parallel patterns provide rich semantic tional languages is only defined by data dependencies rather information on the nature of the parallel computation at multi- than sequential statement order, exposing significant fine- ple levels of nesting as well as memory access patterns at each grained parallelism that can be exploited efficiently in custom level. In this work, we preserve certain semantic properties hardware. of memory regions and analyze memory access patterns in Parallel patterns like map and reduce are an increasingly order to automatically infer hardware structures like FIFOs, popular extension to functional languages which add semantic double buffers, and caches. We exploit parallelism at multiple information about memory access patterns and inherent data levels by automatically inferring and generating metapipelines, parallelism that is highly exploitable by both software and hierarchical pipelines where each stage can itself be composed hardware. Previous work [19, 3] has shown that compilers can of pipelines and other parallel constructs. Our code genera- utilize parallel patterns to generate C- or OpenCL-based HLS tion approach involves mapping parallel IR constructs to a set programs and add certain annotations automatically. However, of parameterizable hardware templates, where each template just like hand-written HLS, the quality of the generated hard- exploits a specific parallel pattern or memory access pattern. ware is still highly variable. Apart from the practical advantage These hardware templates are implemented using a low-level of building on existing tools, generating imperative code from Java-based hardware generation language (HGL) called MaxJ. a functional language only to have the HLS tool attempt to In this paper we make the following contributions: re-infer a functional representation of the program is a sub- • We describe a systematic set of rules for tiling parallel optimal solution because higher-level semantic knowledge in patterns, including a single, general pattern used to tile all the original program is easily lost. In this paper, we describe patterns with fixed output size. Unlike previous automatic a series of compilation steps which automatically generate tiling work, these rules are based on pattern matching and a low-level, efficient hardware design from an intermediate therefore do not restrict all memory accesses within the representation (IR) based on parallel patterns. As seen in Fig- program to be affine. ure1, these steps fall into two categories: high level parallel • We demonstrate a method for automatically inferring com- pattern transformations (Section4), and low level analyses plex hardware structures like double buffers, caches, CAMs, and hardware generation optimizations (Section5). and banked BRAMs from a parallel pattern IR. We also show how to automatically generate metapipelines, which One of the challenges in generating efficient hardware from are a generalization of pipelines that greatly increase design high level programs is in handling arbitrarily large data struc- throughput. tures. FPGAs have a limited amount of fast local memory and • We present experimental results for a set of benchmark ap- accesses to main memory are expensive in terms of both per- plications from the data analytics domain running on an formance