Generating Systolic Array Accelerators with Reusable Blocks
Total Page:16
File Type:pdf, Size:1020Kb
Theme Article: Agile and Open-Source Hardware Generating Systolic Array Accelerators With Reusable Blocks Liancheng Jia and Liqiang Lu Yun Liang Peking University Peking University Xuechao Wei Alibaba, China Abstract—Systolic array architecture is widely used in spatial hardware and well-suited for many tensor processing algorithms. Many systolic array architectures are implemented with high-level synthesis (HLS) design flow. However, existing HLS tools do not favor of modular and reusable design, which brings inefficiency for design iteration. In this article, we analyze the systolic array design space, and identify the common structures of different systolic dataflows. We build hardware module templates using Chisel infrastructure, which can be reused for different dataflows and computation algorithms. This remarkably improves the productivity for the development and optimization of systolic architecture. We further build a systolic array generator that transforms the tensor algorithm definition to a complete systolic hardware architecture. Experiments show that we can implement systolic array designs for different applications and dataflows with little engineering effort, and the performance throughput outperforms HLS designs. & TENSOR ALGEBRA IS a prevalent tool of mod- accelerators. Systolic array architecture that ern computer applications and is increasingly features with high computation parallelism deployed onto various embedded devices. and data reusability using an array of process- Such a trend demands specialized hardware ing elements (PEs) are widely adopted in accel- erator designs. Google’s TPU uses a systolic array for the matrix multiply unit.1 Systolic Digital Object Identifier 10.1109/MM.2020.2997611 architectures are also used in many other Date of publication 26 May 2020; date of current version applications like convolution, FFT, and matrix 30 June 2020. decomposition. July/August 2020 Published by the IEEE Computer Society This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see 85 https://creativecommons.org/licenses/by/4.0/ Agile and Open-Source Hardware In general, the systolic array architecture is the design is still difficult. The computation logic composed of a grid of PEs that run in parallel is tightly bound with the dataflow, making it hard and buffers that transfer data between PEs and to scale to different dataflows and algorithms. 3) memory, and the PE is further composed of a There has been a growing interest in building a compute cell and registers that store and trans- domain-specific language (DSL) as a front-end of fer intermediate data. Each PE reads operands HLS. The advantage of DSL over pure HLS is the from its neighbors and process calculation in separation of computation and dataflow definition the compute cell. The result and operands are in DSL programming, which was first proposed by transferred to the neighbor PEs at the next time an image processing DSL Halide.6 Following works step. The systolic array architecture avoids non- extends Halide to generate HLS code for hardware local interconnection and reduces the band- synthesis.7,8 By using DSLs, the dataflow can be width pressure because only the edge PEs flexibly modified with little efforts without chang- connects to the memory, which makes it friendly ing the algorithm description. However, the exist- for modern spatial architectures. ing DSL compiler often generate unreadable HLS The systolic array architecture code which limits the optimization has three important features. opportunity. The performance of In this article, we pro- First, the structure is highly mod- pose to generate sys- DSL+HLS solutions are often slower ular, but the functionality of each tolic array designs than manual HDL design because of module is complex. The same PE which consider both the lack of low-level optimizations. structure is replicated and con- productivity and perfor- In this article, we propose to nected regularly to build the mance. For perfor- generate systolic array designs systolic array. The complexity mance consideration, which consider both productivity mainly comes from the controller we choose to use RTL and performance. For performance for data transferring in PEs and level code to avoid HLS consideration, we choose to use memory buffers. Second, there is generation. For pro- RTL level code to avoid HLS gener- ductivity, we build a large design space for systolic ation. For productivity, we build reusable hardware arrays. For the PE level, there are reusable hardware component component templates different dataflows, datatype, templates which can be used by which can be used by compute cell algorithm, data vec- different systolic different systolic designs. Our pro- torization, etc. The PE-array level designs. posed generator can build differ- configuration includes array ent systolic arrays with the shape, buffer reuse strategy, and combination of different tem- number of time steps, etc. Finally, the systolic plates. Specifically, we use Chisel infrastructure9 architecture is reusable for different designs. which enables cycle-level control as well as The same computation logic can be reused for parameterized and modular hardware genera- different systolic dataflows of the same algo- tion. We build configurable templates for the rithm, and the same systolic array structure can reusable components such as pipeline control- be reused with different computation cell logics lers, data feeders (DFs), and collectors. The gen- to implement various algorithms. erated components are connected together to Prior works often use three approaches to form the complete design of systolic array. The implement systolic architecture. 1) Low-level HDL complete systolic array can be generated with implementation.2 The systolic array is manually computation algorithm, dataflow definition, and implemented for a certain algorithm. This gives the predefined templates. high performance, but the development is tedious The contribution of this article can be sum- and time-consuming. Meanwhile, this limits the marized with the following. design space that can be explored and cannot scale to other algorithms. 2) High-level synthesis We extract reusable modules of systolic (HLS).3À5 HLS allows programmers to use soft- arrays and build parameterized and modular ware-programming language such as C for hard- hardware module template to express differ- ware development. It promotes productivity but ent functionality and configurations. 86 IEEE Micro multiplication algorithm, but it can be extended to other systolic algorithms. Systolic Architecture Topology Figure 1(a) shows a typical topology structure of systolic array. The major part is the 2-D grid of PEs that connects locally. The PEs on the edge con- Figure 1. Systolic array and PE structure for matrix nect to DFs and data collectors (DC), which con- multiplication. (a) Systolic array topology. (b) tain SRAM buffers. DFs send data from memory to Structural parameters for GEMM. PEs and data collectors receive data from PEs. The first DF/DC modules connect to off-chip memory. Since the bandwidth between off-chip memory We build a systolic array generator to gener- and the systolic accelerator is often limited, input ate Chisel implementation of systolic arrays data is reused multiple times in DFs. The reuse of with the compute cell and algorithm-level data can be represented with loop tiling. The inner- configurations. most loop-nest is directly mapped into the systolic We evaluate our systolic array generator array execution, and the outer loop-nest uses buf- using several tensor applications and com- fers in DF modules for data reusing. pare with prior works to show its perfor- mance and productivity. PE Compute Cell Design The core component of a systolic array is the Experiment on Xilinx VU9P FPGA shows compute cell inside each PE that performs the that our generated systolic architecture for actual computation. The compute cell can be GEMM achieves the frequency of 322 MHz on configured with an arithmetic function, a data integer data and 264 MHz on floating-point, type, and a data vectorization degree, which ena- which is faster than most prior works on the bles an SIMD processing of vectorized data. same device. Our generator is able to generate Additionally, the compute cell can use some complete systolic array architecture with less external IPs provided by third-party, such as the than 10% input codes compared to HDL, floating-point IP for Xilinx FPGAs. which remarkably promotes productivity. Systolic Dataflow The systolic dataflow is the mapping between SYSTOLIC ARRAY ARCHITECTURE the computation loop nest and the systolic array Our generator framework targets a general execution. The execution inside the array can be systolic architecture which covers most design represented as a dataflow tuple ðx; y; tÞ,which alternatives for systolic arrays. In this section, means the calculation at index ðx; yÞ of the PE array we show the major components of a systolic at cycle t. The mapping process is a linear transfor- array including four categories. mation and previous works have extensively stud- ied the mapping algorithm.3 We use an open- The topology of systolic array architecture. source systolic dataflow compiler10 as the front- The compute cell inside each PE that cont- end of our systolic generator. It generates the data- ains the core algorithm. flow tuple with the loop definition and linear trans- The dataflow configurations, including PE formation function. For the matrix multiplication, size, data movement direction, number of there exists two well-known dataflows: Output sta- time steps, and tensor index addressing. tionary (OS) and weight stationary (WS). The The controller that controls the behavior of matrix multiplication of two M Â N and N Â K systolic arrays including data validity and PE matrices can be defined as execution status. C½m; nþ ¼ A½m; kÂB½k; n (1) We illustrate the building process of the sys- tolic array with the example of a matrix where 0 <m<M, 0 <n<N, 0 <k<K. July/August 2020 87 Agile and Open-Source Hardware Figure 2. Implementation of systolic PE based on dataflow templates.