Flexibility: Fpgas and CAD in Deep Learning Acceleration

WHITE PAPER FPGA Inline Acceleration Flexibility: FPGAs and CAD in Deep Learning Acceleration Authors Abstract Andrew C . Ling Deep learning inference has become the key workload to accelerate in our artificial [email protected] intelligence (AI)-powered world. FPGAs are an ideal platform for the acceleration of deep learning inference by combining low-latency performance, power efficiency, Mohamed S . Abdelfattah and flexibility. This paper examines flexibility, and its impact on FPGA design [email protected] methodology, physical design tools and computer-aided design (CAD). We describe the degrees of flexibility required to create efficient deep learning accelerators. Andrew Bitar We quantify the effects of precision, vectorization, and buffering on performance [email protected] and accuracy, and show how an FPGA can yield superior performance through Davor Capalija architecture customization tuned for a specific neural network. We describe the [email protected] need for abstraction and propose solutions for modern FPGA design flows to enable the rapid creation of customized accelerator architectures for deep learning Gordon R . Chiu inference acceleration. Finally, we examine the implications on physical design [email protected] tools and CAD. Introduction Intel® Corporation Programmable Solutions Group Over the past few years, deep learning techniques have been increasingly applied Toronto, Canada to solve problems in many fields, starting with computer vision and speech recognition, and growing to include natural language processing, machine translation, autonomous vehicles, and smart medicine. As of 2016, analytics was the fastest growing workload in the data center, and predicted to be the largest workload by compute cycles in 2020, mostly due to the rapid increase in adoption and deployment of deep learning [4]. Unlike deep learning training, predominantly hosted in data centers and the cloud, deep learning inference – scoring a trained neural network model against unknown input data – is often performed in line with data collection, in either a data center context (web analytics, image or media processing) or an embedded context (autonomous driving or surveillance systems). In both contexts, system architects Table of Contents are looking to offload inference compute cycles from valuable CPUs by leveraging Abstract . 1 compute accelerators, such as General-Purpose Graphics Processing Units Introduction . 1 (GPGPUs), FPGAs, or fixed-function ASICs. The enhanced data parallelism available Why FPGAs for Deep Learning Acceleration . 2 in an accelerator can both improve power efficiency as well as reduce compute FPGA Deep Learning Accelerator . 2 latency. This reduced latency manifests itself in better system performance (such as response time to web user requests, or reaction time to external events in an Degrees of Flexibility . 2 Compute Precision . 3 autonomous driving system). Accelerator Geometry and Customization . 4 Memory Archtecture . 6 When data scientists attempt to solve a problem with deep learning, they typically Implications of Flexibility . 6 begin with a "standard" neural network topology (typically an ImageNet Large High and Higher-Level Design . 6 Scale Visual Recognition Challenge (ILSVRC)-winning network [10], such as Abstracting Design Entry with OpenCLTM . 6 GoogLeNet [9] or ResNet [13]), and iterate from there. With the rapid change in Future Work: Even Higher-Level Design . 7 this space, the final network is often not defined at the beginning of the project, Physical Design Implications . 7 and morphs as compute, accuracy, and application requirements change. While Packing of Heterogeneous Resources . 7 general-purpose compute technologies such as CPUs are flexible enough to Systolic Arrays for Scalable Circuits . 8 adapt to neural network topology changes, the performance of a fixed-function Floorplanning for Improving Performance . 8 accelerator varies profoundly with topology choice. This leads to a phase-ordering Designer Productivity . 8 problem, as the acceleration hardware and system design is often locked down Conclusion . 8 well before the workload is finalized, leading to subpar accelerator performance. References . 9 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration DDR Off-Chip Memory PE Array Filter Filter Filter Cache Cache Cache DRAIN_VEC Stream x Q_VEC Buffer PE-0 PE-1 PE-N x P_VEC (on-chip) K_VEC Custom Interconnect and Width Adapters C_VEC x Q_VEC x P_VEC AUX_VEC Max x Q_VEC Activation LRN Pool x P_VEC Figure 1 . System-Level Diagram of Our Neural Network Inference Accelerator As research continues, the industry may consolidate on The paper is organized as follows: The next sections standard network topologies. Until then, with new topologies describe our Deep Learning Accelerator (DLA), the degrees of and innovations emerging on a daily basis, accelerator flexibility present when implementing an accelerator on an flexibility is critical for supporting a wide gamut of networks. FPGA, and quantitatively analyzes the benefit of customizing This flexibility has implications on the physical design flows, the accelerator for specific deep learning workloads (as tools, and CAD required to efficiently define and create opposed to a fixed-function accelerator). We then describe accelerators. our current high-level design abstraction for creating the Deep Learning Accelerator, and propose higher levels of abstraction needed to enable future success in this space. Why FPGAs for Deep Learning Acceleration? Finally, we discuss details of the challenges of mapping The FPGA architecture is naturally amenable to deep generated designs into physical implementations, the learning inference acceleration. Arithmetic datapaths can workarounds employed, and proposals for future research. be synthesized and mapped to reconfigurable logic for greater power efficiency and lower latency than executing instructions on a CPU or GPGPU. System integration options FPGA Deep Learning Accelerator through flexible I/O interfaces allow tighter integration We presented an architecture and implementation for into embedded or streaming applications (such as high-performance execution of a deep learning inference directly processing the output of a camera module). Most accelerator [2] on an FPGA (Figure 1). An accelerator core importantly, flexibility in the reconfigurable logic and routing reads input image and filter data from external memory enables many different accelerator architectures to be (DDR), and stores the data in on-chip caches built of RAMs efficiently implemented on the same FPGA fabric. This paper blocks. Data is read from on-chip caches and fed to a set focuses on FPGA flexibility, argues the need for higher-level of Processing Elements (PEs), which perform dot product design, and presents the implications on CAD (high-level and calculations (executed in parallel) comprising the bulk of physical): the workload. A drain network collects the dot product output, which is fed to auxiliary kernels that perform the • Flexibility inherent in the FPGA fabric enables non-linear computations (such as activation, pooling, and construction of many accelerator architectures. The normalization). The resulting output is either fed as input to upside from using a bespoke acceleration architecture, the next convolution layer (executed sequentially), or written tuned to a specific network, can outweigh the cost of back to external memory. using a reconfigurable FPGA fabric, and exposes a very large design space. Degrees of Flexibility • Higher-level design solutions are required to take full advantage of this architecture flexibility and design Though our original work focused on a single optimized space. Our current solution consists of an acceleration instance of an accelerator for the AlexNet neural network stack, and a high-level synthesis compiler, providing topology [2], our architecture is flexible enough for abstraction. Further abstraction is necessary to generate high-performance acceleration of other network topologies. arbitrary accelerators. The design can be parameterized to create a class of neural • New challenges for CAD and physical design when network accelerator designs specialized for specific network exposed to higher levels of design abstraction and the topologies. In this section, we examine the degrees of domain’s high-performance requirements. We describe flexibility available, and quantify the benefits provided by some solutions to problems we see today, as well as propose areas of future research. FPGA flexibility with respect to compute precision, accelerator geometry, and memory architecture. 2 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration Compute Precision The DLA architecture leverages the FPGA platform to provide reduced precision by tuning the mantissa width (Figure 3). Deep learning applications often have large memory In addition to data width reduction, the DLA architecture requirements, which pose a hurdle to their acceleration. uses the FPGA bit-level granularity to improve resource Recent work has shown that reducing the accelerator’s usage efficiency by organizing floating-point numbers into precision from full-precision floating-point can shrink the “blocks” [2], defined as a group of FP numbers sharing a neural network model size while maintaining its required common exponent. The "block size” defines the number of accuracy. mantissas sharing the common exponent. Placing a number Fixed-point

Load more