white paper FPGA Inline Acceleration

Flexibility: FPGAs and CAD in Acceleration

Authors Abstract Andrew C . Ling Deep learning inference has become the key workload to accelerate in our artificial [email protected] intelligence (AI)-powered world. FPGAs are an ideal platform for the acceleration of deep learning inference by combining low-latency performance, power efficiency, Mohamed S . Abdelfattah and flexibility. This paper examines flexibility, and its impact on FPGA design [email protected] methodology, physical design tools and computer-aided design (CAD). We describe the degrees of flexibility required to create efficient deep learning accelerators. Andrew Bitar We quantify the effects of precision, vectorization, and buffering on performance [email protected] and accuracy, and show how an FPGA can yield superior performance through Davor Capalija architecture customization tuned for a specific neural network. We describe the [email protected] need for abstraction and propose solutions for modern FPGA design flows to enable the rapid creation of customized accelerator architectures for deep learning Gordon R . Chiu inference acceleration. Finally, we examine the implications on physical design [email protected] tools and CAD. Introduction Intel® Corporation Programmable Solutions Group Over the past few years, deep learning techniques have been increasingly applied Toronto, Canada to solve problems in many fields, starting with and speech recognition, and growing to include natural language processing, machine translation, autonomous vehicles, and smart medicine. As of 2016, analytics was the fastest growing workload in the data center, and predicted to be the largest workload by compute cycles in 2020, mostly due to the rapid increase in adoption and deployment of deep learning [4]. Unlike deep learning training, predominantly hosted in data centers and the cloud, deep learning inference – scoring a trained neural network model against unknown input data – is often performed in line with data collection, in either a data center context (web analytics, image or media processing) or an embedded context (autonomous driving or surveillance systems). In both contexts, system architects Table of Contents are looking to offload inference compute cycles from valuable CPUs by leveraging Abstract...... 1 compute accelerators, such as General-Purpose Graphics Processing Units Introduction ...... 1 (GPGPUs), FPGAs, or fixed-function ASICs. The enhanced data parallelism available Why FPGAs for Deep Learning Acceleration. . . 2 in an accelerator can both improve power efficiency as well as reduce compute FPGA Deep Learning Accelerator. . . . 2 latency. This reduced latency manifests itself in better system performance (such as response time to web user requests, or reaction time to external events in an Degrees of Flexibility...... 2 Compute Precision ...... 3 autonomous driving system). Accelerator Geometry and Customization. . . . 4 Memory Archtecture...... 6 When data scientists attempt to solve a problem with deep learning, they typically Implications of Flexibility...... 6 begin with a "standard" neural network topology (typically an ImageNet Large High and Higher-Level Design...... 6 Scale Visual Recognition Challenge (ILSVRC)-winning network [10], such as Abstracting Design Entry with OpenCLTM. . . . . 6 GoogLeNet [9] or ResNet [13]), and iterate from there. With the rapid change in Future Work: Even Higher-Level Design. . . . . 7 this space, the final network is often not defined at the beginning of the project, Physical Design Implications...... 7 and morphs as compute, accuracy, and application requirements change. While Packing of Heterogeneous Resources ...... 7 general-purpose compute technologies such as CPUs are flexible enough to Systolic Arrays for Scalable Circuits...... 8 adapt to neural network topology changes, the performance of a fixed-function Floorplanning for Improving Performance. . . . 8 accelerator varies profoundly with topology choice. This leads to a phase-ordering Designer Productivity ...... 8 problem, as the acceleration hardware and system design is often locked down Conclusion...... 8 well before the workload is finalized, leading to subpar accelerator performance. References...... 9 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration

DDR Off-Chip Memory

PE Array

Filter Filter Filter Cache Cache Cache DRAIN_VEC Stream x Q_VEC Buffer PE-0 PE-1 PE-N x P_VEC (on-chip)

K_VEC Custom Interconnect and Width Adapters C_VEC x Q_VEC x P_VEC AUX_VEC Max x Q_VEC Activation LRN Pool x P_VEC

Figure 1 . System-Level Diagram of Our Neural Network Inference Accelerator

As research continues, the industry may consolidate on The paper is organized as follows: The next sections standard network topologies. Until then, with new topologies describe our Deep Learning Accelerator (DLA), the degrees of and innovations emerging on a daily basis, accelerator flexibility present when implementing an accelerator on an flexibility is critical for supporting a wide gamut of networks. FPGA, and quantitatively analyzes the benefit of customizing This flexibility has implications on the physical design flows, the accelerator for specific deep learning workloads (as tools, and CAD required to efficiently define and create opposed to a fixed-function accelerator). We then describe accelerators. our current high-level design abstraction for creating the Deep Learning Accelerator, and propose higher levels of abstraction needed to enable future success in this space. Why FPGAs for Deep Learning Acceleration? Finally, we discuss details of the challenges of mapping The FPGA architecture is naturally amenable to deep generated designs into physical implementations, the learning inference acceleration. Arithmetic datapaths can workarounds employed, and proposals for future research. be synthesized and mapped to reconfigurable logic for greater power efficiency and lower latency than executing instructions on a CPU or GPGPU. System integration options FPGA Deep Learning Accelerator through flexible I/O interfaces allow tighter integration We presented an architecture and implementation for into embedded or streaming applications (such as high-performance execution of a deep learning inference directly processing the output of a camera module). Most accelerator [2] on an FPGA (Figure 1). An accelerator core importantly, flexibility in the reconfigurable logic and routing reads input image and filter data from external memory enables many different accelerator architectures to be (DDR), and stores the data in on-chip caches built of RAMs efficiently implemented on the same FPGA fabric. This paper blocks. Data is read from on-chip caches and fed to a set focuses on FPGA flexibility, argues the need for higher-level of Processing Elements (PEs), which perform dot product design, and presents the implications on CAD (high-level and calculations (executed in parallel) comprising the bulk of physical): the workload. A drain network collects the dot product output, which is fed to auxiliary kernels that perform the • Flexibility inherent in the FPGA fabric enables non-linear computations (such as activation, pooling, and construction of many accelerator architectures. The normalization). The resulting output is either fed as input to upside from using a bespoke acceleration architecture, the next convolution layer (executed sequentially), or written tuned to a specific network, can outweigh the cost of back to external memory. using a reconfigurable FPGA fabric, and exposes a very large design space. Degrees of Flexibility • Higher-level design solutions are required to take full advantage of this architecture flexibility and design Though our original work focused on a single optimized space. Our current solution consists of an acceleration instance of an accelerator for the AlexNet neural network stack, and a high-level synthesis compiler, providing topology [2], our architecture is flexible enough for abstraction. Further abstraction is necessary to generate high-performance acceleration of other network topologies. arbitrary accelerators. The design can be parameterized to create a class of neural • New challenges for CAD and physical design when network accelerator designs specialized for specific network exposed to higher levels of design abstraction and the topologies. In this section, we examine the degrees of domain’s high-performance requirements. We describe flexibility available, and quantify the benefits provided by some solutions to problems we see today, as well as propose areas of future research. FPGA flexibility with respect to compute precision, accelerator geometry, and memory architecture.

2 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration

Compute Precision The DLA architecture leverages the FPGA platform to provide reduced precision by tuning the mantissa width (Figure 3). Deep learning applications often have large memory In addition to data width reduction, the DLA architecture requirements, which pose a hurdle to their acceleration. uses the FPGA bit-level granularity to improve resource Recent work has shown that reducing the accelerator’s usage efficiency by organizing floating-point numbers into precision from full-precision floating-point can shrink the “blocks” [2], defined as a group of FP numbers sharing a neural network model size while maintaining its required common exponent. The "block size” defines the number of accuracy. mantissas sharing the common exponent. Placing a number Fixed-point representation has been employed by NVIDIA into a block requires bit-shifting the mantissa, resulting in a [8], Xilinx [12], and Google’s TPU* [18] for convolutional loss of precision; however, this loss comes with the benefit neural network (CNN) and recurrent neural network (RNN) of better resource utilization, as storage requirements shrink acceleration, while Microsoft* recently announced the use as more numbers are placed in a block (Figure 4). Tuning of reduced-precision floating-point on an Intel® Stratix® 10 block size can effectively be used to trade-off accuracy for FPGA in their acceleration of Gated Recurrent Units (GRUs) performance. [7]. We measured the impact of using reduced-precision floating-point on overall compute performance for an Intel® Stratix® 10 FPGA. Reduced precision requires fewer Sign Exponent Mantissa FPGA resources per compute unit, resulting in increased FP16 throughput (Figure 2).† FP11

11.0x FP10

10.0x FP9 9.0x FP8 8.0x

7.0x Figure 3 . Reduced Precision Floating-Point Breakdown of sign/exponent/mantiassa. 6.0x

5.0x FP16 FP11 4.0x 8 values => 128 bits 8 values => 88 bits 3.0x Normalized Peak Throughput Normalized Peak 2.0x

1.0x

0.0x FP32 FP16 FP11 FP8

Figure 2 . Normalized Peak Chip Throughput vs. Floating- Point Precision on Intel® Stratix® 10 FPGA Align mantissas to common exponent

While many solutions have been proposed for achieving good BlockFP16 (Block Size = 8) BlockFP11 (Block Size = 8) accuracy at low precision, there is no one-precision-fits-all 8 values => 101 bits 8 values => 61 bits solution. The ability to trade-off precision for performance is heavily dependent on the application and topology. Every application has its own accuracy requirements, and thus values the performance-accuracy trade-offs differently. Each topology has a different tolerance to reduced data precision. For example, SqueezeNet was designed to achieve AlexNet-level accuracies while reducing the model size by 50X [14];† so it is not surprising it is more sensitive to accuracy losses at reduced precision compared to AlexNet (Figure 5). With its fine-grained, bit-level customizability, an Figure 4 . Organizing Floating-Point Numbers into "Blocks" FPGA provides the flexibility needed to tune precision and accuracy suitable for individual applications and topologies.

3 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration

To investigate the accuracy impact, different precision and Accelerator Geometry and Customization block sizes were tested on AlexNet, GoogleNet, SqueezeNet, ResNet-18, and VGG-16 [16]. Each topology tolerates To leverage the flexibility of FPGAs, we can customize different levels of precision loss (Figure 5 and 6), further the accelerator vectorization (degree of parallelism) to illustrating how different topologies benefit from reduced suit different classes of neural networks. A change to the precision and larger block sizes to varying degrees. By vectorization manifests itself as a different accelerator leveraging FPGA flexibility, the DLA can be tuned to the level geometry – the number of processing elements and of precision that best suits an application. the widths of data buses between them. Figure 1 shows the degrees of parallelism available in the accelerator, At a specific precision, there remains the challenge of configurable via vectorization. implementing a PE architecture that maximizes throughput per area for that precision. The optimal PE for each precision The problem size at each convolution stage is defined by the level is dependent on the FPGA platform; digital signal neural network topology: processing (DSP) and logic element (LE) architecture strongly • W, H, C are the input tensor width, height, and depth. influence PE design to achieve the highest throughput in • S, R, C are the filter width, height, and depth. the least area. Choosing the right combination of DSPs and • Q, P, K are the output tensor width, height, and depth. LEs for a given precision is a complex problem that can be facilitated by CAD tools, described in later sections. Scaling Architectural parameters in the accelerator control the the design, once precision and PE architecture are chosen, is degree of parallelism employed in the core compute, which done through compute parallelism, which is explored next. impacts both the number of compute units used and the width of datapaths: • Q_VEC, P_VEC, and K_VEC define the window of the 1.1 output tensor computed in parallel. 1 • C_VEC defines the depth of the input tensor read and 0.9 used in parallel. 0.8

0.7 • S_VEC and R_VEC are the filter width and height read and computed in parallel. 0.6

0.5 • DRAIN_VEC is the width of the drain network from the PE array. This determines the subset of K_VEC that can 0.4 AlexNet be extracted from the PE array in one clock cycle. 0.3 GoogleNet • Each auxiliary kernel has an AUX_VEC, which describes 0.2 SqueezeNet Normalized Top1 Classification Accuracy Normalized Top1 ResNet-18 the subset of the DRAIN_VEC that is processed in one 0.1 VGG-16 clock cycle in each of the auxiliary kernels. 0 FP16 FP11 FP10 FP9 FP8 For each of our marquee neural networks, we customize the architecture to maximize throughput-per-area, then analyze Figure 5 . Neural Network Tolerance Levels to Reduced the geometry considerations for architectures tuned for Precision. Accuracy Measured Using Block Size = 8. various different graphs.

1.02

1

0.98

0.96

0.94

0.92

0.9 AlexNet GoogleNet 0.88 SqueezeNet

Normalized Top1 Classification Accuracy Normalized Top1 0.86 ResNet-18 VGG-16 0.84 4 6 8 10 12 14 16 FP Block Size Figure 6 . Increasing Block Size Improves Resource Usage Efficiency at the Cost of Accuracy, Measured Using BlockFP9 Precision.

4 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration a . Smaller Filters Require Faster Drains . b . Interconnect Customization and Width Adapters Allow Building Only What is Needed . Each PE consists of (Q_VEC · P_VEC) dot product operations Another degree of flexibility exists in the auxiliary kernels (which multiplies and sums C_VEC data), followed by domain – the flexible interconnect allows building only accumulators, which keep the dot product results until it has exactly what is needed for each graph. For example, the iterated over the entire filter tensor; after which, a single SqueezeNet graph has no local response normalization (LRN) output pixel is produced. Therefore, the speed of producing layers, so we can remove that kernel completely. The results from the PEs differ, depending on the ratio between interconnec-tion pattern within the interconnect is also the filter size and the vectorization of each PE. customizable based on the order of the auxiliary operations. Between the GoogLeNet, and SqueezeNet networks, For example, the AlexNet graph has both Maxpool and LRN SqueezeNet, on average, has smaller filters, causing layers, but LRN always comes first; whereas the GoogLeNet convolution results to be produced more quickly, and graph has some layers in which MaxPool precedes LRN, so we therefore requires a higher bandwidth drain network. Figure need to support both connection patterns by adding more 7 shows the effect of increasing the drain network parallelism muxing and arbitration logic. The width of each of the (DRAIN_VEC) on the AlexNet, GoogLeNet and SqueezeNet auxiliary kernels can be customized separately based on how networks. While GoogLeNet throughput saturates at much bandwidth is required for each operation. Finally, the DRAIN_VEC of 8, SqueezeNet requires DRAIN_VEC of 16 for DLA can also leverage FPGAs enhanced with hardened 95% of maximum throughput.† AlexNet only needs interconnects – such as embedded Networks-on-Chip [1, 3] DRAIN_VEC=2 to achieve 95% performance since its filers are – for high-bandwidth interkernel communication. the largest. Scaling DRAIN_VEC beyond these values yields c . Balance Vectorization to Minimize Quantization diminishing returns. Supporting increased drain network Inefficiencies . parallelization requires scaling up the area of the auxiliary In general, scaling up tensor vectorization increases kernels (Table 1). With a fixed function accelerator, a designer throughput at the expense of area. Initially, the design was has to choose, a priori, between a 21% area penalty (across scaled by increasing K_VEC – this was relatively simple, since all networks) or a 70% performance penalty on increasing K_VEC only entails adding more PEs. However, this SqueezeNet-like networks.† With a reconfigurable FPGA, the scaling method saw diminishing returns, as quantization designer can instantiate the larger DRAIN_VEC only when inefficiencies can become more pronounced as vectorization required by the network being accelerated. dimensions increase.

1.1 For example, if the output depth (K) of a layer is 96, and K_VEC is 64, two complete iterations are required, so output 1 depth will be snapped up to 128, with only 96 (75%) as useful

0.9 computations. On the other hand, if K_VEC is 32, the output depth divides perfectly into three iterations for 100% 0.8 efficiency. To mitigate this quantization effect, it is possible to

0.7 balance the scaling of the design across multiple different dimensions besides just K_VEC (e.g., P_VEC, Q_VEC, C_VEC). 0.6 The optimal balance of vectorization depends on the graph’s

Normalized Throughput layer dimensions. Figure 8 demonstrates this by comparing AlexNet 0.5 throughput of two architectures of similar area for multiple GoogleNet 0.4 different graphs. As can be seen, the optimal balance of SqueezeNet scaling between P_VEC and K_VEC varies based on the neural 2 8 16 0.3 0 5 10 15 20 25 network topology used. With a flexible FPGA platform, a DRAIN_VEC custom instance of a deep learning accelerator generated with vectorization parameters tuned for each application. Figure 7 . Effect of Drain Network Vectorization on Throughput, Normalized to Maximum Performance 1.2 P_VEC=1, K_VEC=64 P_VEC=4, K_VEC=16 1 AUX KERNELS AREA DRAIN_VEC (% OF FULL SYSTEM) 0.8 2 4% 8 14% 0.6 16 25% 0.4 Table 1 . Area Cost of Increasing DRAIN_VEC

Normalized Throughput/Area 0.2

0 AlexNet GoogleNet SqueezeNet VGG-16 ResNet-101

Figure 8 .Throughput/Area on Two Architectures with Different P_VEC and K_VEC Vectorization

5 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration

Memory Architecture Implications of Flexibility With their different models and layer sizes, neural networks In general, one size does not fit all when designing deep have a wide range of memory requirements, and creating an learning accelerators. Using a flexible accelerator on a flexible efficient design around these requirements poses a complex FPGA platform can improve performance by tuning problem. To support such requirements, the DLA architecture architecture parameters (compute precision, accelerator uses both the on-chip and off-chip memory provided by the geometry, and memory architecture) for a specific neural FPGA platform. network topology. We have presented several dimensions of flexibility (by no means an exhaustive list); customization can On-chip memory provides a fast means to store and access deliver a performance benefit (or area savings) in each data, avoiding the relatively long latency of communicating dimension. In aggregate, the benefit across the fully with off-chip memory. The DLA uses on-chip memory for customized FPGA accelerator can outweigh the cost of the both filter and tensor data (Figure 1). Filter data is stored in a programmable fabric. Flexibility of the deep learning “filter cache” (FC) contained in each PE. While the PEs acceleration architecture must be matched with flexible compute data, filters are pre-loaded from external memory design-entry flow and CAD tools, to allow designers to into the filter caches for the next set of PE computations. The specify optimized deep learning accelerators quickly “stream buffer” is used to store intermediate tensors between layers on chip. If the stream buffer is too small to hold a given tensor, the DLA falls back to using external High and Higher-Level Design memory. Given a limited DDR bandwidth, excessive use of With the large design space of possible architecture external memory can bottleneck performance. parameterizations, it becomes infeasible for a designer to The balance between using on-chip and off-chip memory manually configure the deep learning accelerator using creates an important design trade-off. Using more on-chip traditional FPGA register transfer level (RTL) design memory for the stream buffer alleviates DDR bottlenecks but techniques. We propose an extra level of abstraction to consumes more chip resources that could have been used for enable automatic specification and generation of custom more PEs. On the other hand, reducing on-chip memory used deep learning acceleration. Section 5 discusses the in the stream buffer, in favor of more PEs, provides more implications on physical design when using such high-level compute power at the cost of heavily relying on off-chip design entry. memory. The optimal balance of resources between on-chip Abstracting Design Entry with OpenCL memory and processing elements that maximizes performance is dependent on the graph’s compute and To achieve the desired flexibility, our deep learning memory requirements. Finding this optimal point requires accelerator architecture is described in a generic form that accurate design modeling as well as knowledge of the target can be easily modified with a set of architectural parameters. neural network and platform. We describe our architecture using a set of pre-defined kernels, written in OpenCL, that can be modified with specific To illustrate this trade-off, we modeled the performance of architectural parameters such as vectorization parameters, AlexNet, GoogleNet, and ResNet-101 at different stream stream buffer dimensions, memory port widths, number of buffer memory and PE sizes for a given FPGA platform. processing elements, and configuration of auxiliary kernels. Figure 9 shows how the optimal allocation of resources to The combination of these kernels form a generic architecture memory and compute differs for each of the graphs, template. Once configured, the resulting architecture is highlighting the benefit of having a flexible FPGA fabric compiled and executed through the Intel FPGA SDK for capable of tailoring a design to achieve optimal performance OpenCL. This design flow is illustrated on the right side of for any application. A resource trade-off optimal for one Figure 10. neural network can cause as much as 40%, or more, † Architecture Template performance degradation in another. DDR Filters Stream Buffers S R PE 1.2 Input Features W C Filters Parameterized H Cvec PE Kvec Architecture TM Wvec OpenCL 1 Output Features Compiler Q Filters PE Data Scientist P Kvec 0.8 Qvec

Architecture Traditional 0.6 Parameters FPGA CAD DSE Kvec = 48 Cvec = 16 … 0.4 Neural Network

Normalized Throughput Model Analytical Model FPGA Bitstream 0.2 AlexNet Tall = fmax/(Σconv(Nreal) + Σf c(Nreal/Sbatch)) GoogleNet High-Level Design Flow ResNet-101 0 New Higher-Level Design Flow + Traditional FPGA CAD 0 0.2 0.4 0.6 0.8 1 1.2 Normalized On-Chip Memory Size Figure 10 . Design Flow for Neural Network Accelerators Figure 9 .Impact of Stream Buffer Memory vs. Compute Trade-off on AlexNet, GoogleNet, and ResNet-101

6 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration

High-level design tools like the Intel FPGA SDK for OpenCL Physical Design Implications compile untimed, unscheduled OpenCL language into optimized RTL implementations, then apply optimizations Today, the state of the art FPGA CAD tools provide automated (pipelining, vectorization, compute unrolling), as necessary, placement, routing, and packing of homogeneous resources according to predicted system performance and knowledge – the three foundational CAD algorithms that account for the of the physical FPGA architecture [11]. This abstraction is physical aspects of the FPGA fabric. Hierarchical physical invaluable when creating complex, parameterizable, designs design techniques, such as floorplanning, packing of such as the deep learning accelerator as it alleviates the need for time-consuming detailed physical design optimization heterogeneous resources, and partitioning are currently done and verification across a wide parameter space. manually. We studied the manual application of these hierarchical techniques to our DLA architecture, and Future Work: Even Higher-Level Design consequently, we advocate for the automation of these Once we have an architecture template that can accept a wide techniques in future work on FPGA CAD (both high-level and range of architectural parameters to modify its structure, we traditional design). This is required to support automatic envision future users of the deep learning accelerator will rely exploration of large accelerator design spaces exposed by on even higher-level design techniques to select and configure architectures. An additional design space our higher-level design tools. Our study showed the following exploration (DSE) step is envisioned to find the optimal benefits:† architecture and run a given neural network model. The DSE • Significantly improved the quality of results (QoR), with step can be automated using traditional objective- improved scalability to larger circuits, higher optimization CAD techniques [15]. This would require a performance, and lower resource utilization. high-fidelity model representing the physical characteristics of an architecture and its resulting performance. Using the • Improved repeatability of compilation results, through model, the DSE flow would efficiently search the architecture the reduction of seed noise. parameter space to identify the optimal configuration for a • Improved designer productivity on a large design via given neural network model. reuse of partitions and incremental compilations. Each design point (a unique instance of the deep learning In the following sections, we describe the techniques we accelerator) can be accurately modeled because our manually prototyped in our physical design study that were architecture is deterministic in nature, where all memory applied to the DLA. accesses, data movements, and compute times can be calculated a priori. The deterministic architecture and Packing of Heterogeneous Resources for Efficient analytical model is presented [2], where both the resource Implementation of PEs utilization of the FPGA and the final performance of the The PEs occupy the majority of the FPGA area. The first design can be calculated prior to runtime. challenge is to achieve high utilization of heterogeneous An illustration of this proposed future approach is shown on resources within that area (i.e., utilize nearly all available logic the left side of Figure 10. The DSE tool provides the neural elements (LEs), DSP blocks and block memory (M20Ks)). To network model, architecture template, analytical model achieve this, each PE (or a small number of PEs) should architecture, and objective function (such as performance consume a mix of LE, DSP, and M20K resources that matches or latency). The DSE tool compiles the neural network the resources within the physical region where they are into a graph and maps it to the architecture template implemented. Alternatively, this challenge can be formulated core primitives. After sweeping through a wide range of as a packing problem: given a mix of heterogeneous resource architecture parameters, and generating performance in a small region of a device, maximize the number of PEs that estimates from the model architecture, a final set of can be packed into that region. parameters are chosen to satisfy the objective function. One approach employed to address this challenge utilizes These parameters can then be used to configure the multiple variants of dot products, which are the building architecture template, and the resulting design compiled blocks of PEs. Each dot product variant uses a different through OpenCL or traditional FPGA compilers to generate a tradeoff of LE and DSP resources. Figure 11 shows the bitstream for the FPGA. tradeoff curves for multiple variants of equivalent dot This higher-level design flow where neural network models products at different precisions. We can then solve an integer are compiled to generate custom FPGA-based accelerators linear programming problem to select a mix of dot product has many implications on FPGA CAD, and introduces several variants that maximize the available resources given a region new challenges to resolve to ensure high performance while constraint. Today, the design of the different dot product enabling flexibility. variants is done manually at specific architecture design points, as well as the formulation of the linear system. To enable the higher-level abstractions, next-generation tools should strive to automate this process – from the user description of the dot product mathematics, the tool should generate and optimize a mix of dot product variants that maximizes resource usage.

7 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration

35 Although manual floorplanning is feasible for a single FP11 30 architecture parameterization on a single FPGA, it is not FP10 feasible across the full architecture design space and target 25 FP9 devices. Automating this class of manual floorplanning effort FP8 20 remains a key outstanding physical design challenge for us, FP7 and research efforts showing promise have already started in 15 academia [19]. DSPs Used per PE DSPs 10 Designer Productivity: Partitioning, Reuse, and Incremental 5 Compilation

0 0 1000 2000 3000 4000 5000 6000 7000 8000 As the sizes of FPGAs grow exponentially, the size of circuits

LEs Used per PE implemented on the FPGA grow at the same pace. In addition to the QoR scaling challenges, these large sizes lead to long Figure 11 . LE vs. DSP Resource Trade-off Curves for compile times, of up to a day. This poses two significant Equivalent Implementations of PEs Across challenges for designer productivity. The first is the Different Precisions development cycle, as the designer can only do a single iteration per day. The second is a reduced repeatability of Systolic Arrays for Scalable Circuits results, as even a small design change can result in The use of systolic arrays and mesh-like structures on FPGAs performance changes due to seed noise inherently present in is motivated by the need to exploit the massive parallelism the CAD tool. of modern FPGAs in a scalable manner [5], [17], [20]. Systolic To enable scale, we employ approaches that leverage design arrays are circuits with regular structure and partitioning. The DLA is composed of multiple OpenCL nearest-neighbor connections, primarily used in algorithms kernels, each encapsulating a distinct functionality of the where the same piece of data needs to be broadcast and design. The use of design partitioning can improve designer reused in a large number of PEs. In our DLA systolic array productivity by both reducing compile time and improving design, the broadcast structure is designed as a linear data repeatability. By preserving the post-place-and-route netlist forwarding pipeline in which data flows from one PE to of partitions that have not been modified, only the kernels another. This avoids a large fan-out of a wide datapath from the designer modified need to be recompiled. Since, in most a central location, and efficiently leverages local routing cases, only a small part of the design needs to be recompiled, without introducing routing congestion. it is more likely that a single-seed compile will close timing. In current tools, the user manually expresses the systolic When considering the large DLA architecture design space, array structure explicitly in their code by designing the data the ideal partitioning may not be possible to define a priori, forwarding mechanism between the neighboring PEs. Future due to changing kernel size and number across all the work could explore techniques for inferring large broadcast architecture variants. Further research on automating the structures in user code, and automatically converting these process of dynamic design partitioning (for an arbitrary set of structures to systolic arrays, to support the large design interconnected kernels of varying sizes) is a key direction for space of accelerators. future work. In addition to improving the development speed and repeatability of QoR, it would relieve designers from the Floorplanning for Improving Performance of Large Circuits burden of manual design partitioning Systolic Arrays for Scalable Circuits Although the systolic array of PEs are a regular and repeated Conclusion structure, we observe inefficiencies when compiling large systolic arrays that fill entire FPGAs. The analysis of the We have shown through a case study of our deep learning placement of PEs by the CAD tool shows that in some cases accelerator that flexibility to customize an architecture is the neighboring PEs are not placed close to each other. This essential to create future-proof accelerators. The resulting results in long routes between PEs and a drop in frequency design space requires higher-level abstractions for efficient and performance. design and exploration. We advocate for new FPGA design flows that generate custom architectures from a high-level We can address this challenge by manually creating neural network description, and improved physical design floorplans for the systolic array, which is feasible due to its flows to infer systolic arrays, optimize for resource regular structure. Our floorplans comprise of a number of constraints, and automate floorplanning and partitioning. physical regions, we can then assign PEs to physical regions With these future enhancements, FPGAs can continue to such that any two neighboring PEs are either in the same scale, and be the accelerator of choice for deep learning region or in two adjacent regions. This results in localized PE applications. placement and mitigates the drop in performance for larger systolic arrays. We measure, on average, a 30% difference in performance for unfloorplanned designs versus a design floorplanned in this fashion. A similar benefit has been shown in a earlier work on floorplanning of mesh-of-functional-units overlay floorplanning [6].†

8 White Paper | Flexibility: FPGAs and CAD in Deep Learning Acceleration

References [10] Olga Russakovsky et al. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of [1] Mohamed S Abdelfattah, Andrew Bitar, and Vaughn Computer Vision (IJCV) 115, 3 (2015), 211–252. Betz. 2015. Take the Highway: Design for Embedded NoCs on FPGAs. In Proceedings of the 2015 ACM/ [11] T. S. Czajkowski et al. 2012. From OpenCL to SIGDA International Symposium on High-Performance Hardware on FPGAS. In 22nd Field-Programmable Gate Arrays. ACM, 98–107. International Conference on Field Programmable Logic and Applications (FPL). 531–534. [2] Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. 2017. An OpenCL [12] Yao Fu. 2016 et al. Deep Learning with INT8 Deep Learning Accelerator on Arria 10. In Proceedings Optimization on Xilinx Devices. white paper of Xilinx of the 2017 ACM/SIGDA International Symposium on (2016). Field-Programmable Gate Arrays (FPGA ’17). ACM, New [13] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep York, NY, USA, 55–64. Residual Learning for Image Recognition. In 2016 [3] Andrew Bitar, Mohamed S Abdelfattah, and Vaughn IEEE Conference on Computer Vision and Pattern Betz. 2015. Bringing Programmability to the Data Plane: Recognition (CVPR). 770–778. Packet Processing with a NoC-Enhanced FPGA. In Field [14] Forrest N. Iandola et al. 2016. SqueezeNet: Programmable Technology (FPT), 2015 International AlexNet-Level Accuracy with 50x Fewer Parameters Conference on. IEEE, 24–31. and <0.5 MB Model Size. arXiv:1602.07360 (2016). [4] Diane M. Bryant. Keynote at Intel Developer’s Forum, [15] Jacopo Panerati, Donatella Sciuto, and Giovanni San Francisco. (August 2016). Beltrame. 2017. Handbook of Hardware/Software [5] D. Capalija and T. S. Abdelrahman. 2013. A Codesign: Optimization Strategies in Design Space High-Performance Overlay Architecture for Pipelined Exploration. Springer, Netherlands. Execution of Data Flow Graphs. In 2013 23rd [16] Karen Simonyan and Andrew Zisserman. 2014. Very International Conference on Field programmable Logic Deep Convolutional Networks for Large-Scale Image and Applications. 1–8. Recognition. arXiv:1409.1556 (2014). [6] D. Capalija and T. S. Abdelrahman. 2014. [17] Xuechao Wei et al. 2017. Automated Systolic Array Tile-Based Bottom-up Compilation of Custom Architecture Synthesis for High Throughput CNN Mesh-of-Functional-Units FPGA Overlays. In 2014 24th Inference on FPGAs. In Proceedings of the 54th Annual International Conference on Field Programmable Logic Design Automation Conference 2017 (DAC ’17). ACM, and Applications (FPL). 1–8. New York, NY, USA, Article 29, 6 pages. [7] Eric Chung et al. 2017. Accelerating Persistent Neural [18] Yonghui Wu, et al. 2016. Google’s Neural Machine Networks at Datacenter Scale. HotChips. Translation System: Bridging the Gap Between Human [8] NVIDIA Corporation. 2017. NVIDIA TensorRT. (2017) and Machine Translation. arXiv:1609.08144 (2016). [9] C. Szegedy et al. 2015. Going Deeper with [19] Xiaodong Xu, Qi Xu, Jinglei Huang, and Song Chen. Convolutions. In 2015 IEEE Conference on Computer 2017. An Integrated Optimization Framework for Vision and Pattern Recognition (CVPR). 1–9. Partitioning, Scheduling and Floorplanning on Partially Dynamically Reconfigurable FPGAs. In Proceedings of the on Great Lakes Symposium on VLSI 2017 (GLSVLSI ’17). ACM, New York, NY, USA, 403–406. [20] Jialiang Zhang and Jing Li. 2017. Improving the Performance of OpenCL-Based FPGA Accelerator for Convolutional Neural Network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’17). ACM, New York, NY, USA, 25–34.

† Tests measure performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. © Intel Corporation. All rights reserved. Intel, the Intel logo, the Intel Inside mark and logo, Altera, Arria and Stratix words and logos are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Intel reserves the right to make changes to any products and services at any time without notice. Intel assumes no responsibility or liability arising out of the applica- tion or use of any information, product, or service described herein except as expressly agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specifications before relying on any published information and before placing orders for products or services. Other marks and brands may be claimed as the property of others.

Please Recycle WP-01283-1.0