<<

Theme Article: Agile and Open-Source Hardware

Generating Systolic Array Accelerators With Reusable Blocks

Liancheng Jia and Liqiang Lu Yun Liang Peking University Peking University Xuechao Wei Alibaba, China

Abstract—Systolic array architecture is widely used in spatial hardware and well-suited for many tensor processing . Many systolic array architectures are implemented with high-level synthesis (HLS) design flow. However, existing HLS tools do not favor of modular and reusable design, which brings inefficiency for design iteration. In this article, we analyze the systolic array design space, and identify the common structures of different systolic dataflows. We build hardware module templates using Chisel infrastructure, which can be reused for different dataflows and computation algorithms. This remarkably improves the productivity for the development and optimization of systolic architecture. We further build a systolic array generator that transforms the tensor definition to a complete systolic hardware architecture. Experiments show that we can implement systolic array designs for different applications and dataflows with little engineering effort, and the performance throughput outperforms HLS designs.

& TENSOR ALGEBRA IS a prevalent tool of mod- accelerators. Systolic array architecture that ern computer applications and is increasingly features with high computation parallelism deployed onto various embedded devices. and data reusability using an array of - Such a trend demands specialized hardware ing elements (PEs) are widely adopted in accel- erator designs. Google’s TPU uses a systolic array for the matrix multiply unit.1 Systolic Digital Object Identifier 10.1109/MM.2020.2997611 architectures are also used in many other Date of publication 26 May 2020; date of current version applications like convolution, FFT, and matrix 30 June 2020. decomposition.

July/August 2020 Published by the IEEE Computer Society This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see 85 https://creativecommons.org/licenses/by/4.0/ Agile and Open-Source Hardware

In general, the systolic array architecture is the design is still difficult. The computation logic composed of a grid of PEs that run in parallel is tightly bound with the dataflow, making it hard and buffers that transfer data between PEs and to scale to different dataflows and algorithms. 3) memory, and the PE is further composed of a There has been a growing interest in building a compute cell and registers that store and trans- domain-specific language (DSL) as a front-end of fer intermediate data. Each PE reads operands HLS. The advantage of DSL over pure HLS is the from its neighbors and process calculation in separation of computation and dataflow definition the compute cell. The result and operands are in DSL programming, which was first proposed by transferred to the neighbor PEs at the next time an image processing DSL Halide.6 Following works step. The systolic array architecture avoids non- extends Halide to generate HLS code for hardware local interconnection and reduces the band- synthesis.7,8 By using DSLs, the dataflow can be width pressure because only the edge PEs flexibly modified with little efforts without chang- connects to the memory, which makes it friendly ing the algorithm description. However, the exist- for modern spatial architectures. ing DSL compiler often generate unreadable HLS The systolic array architecture code which limits the optimization has three important features. opportunity. The performance of In this article, we pro- First, the structure is highly mod- pose to generate sys- DSL+HLS solutions are often slower ular, but the functionality of each tolic array designs than manual HDL design because of module is complex. The same PE which consider both the lack of low-level optimizations. structure is replicated and con- productivity and perfor- In this article, we propose to nected regularly to build the mance. For perfor- generate systolic array designs systolic array. The complexity mance consideration, which consider both productivity mainly comes from the controller we choose to use RTL and performance. For performance for data transferring in PEs and level code to avoid HLS consideration, we choose to use memory buffers. Second, there is generation. For pro- RTL level code to avoid HLS gener- ductivity, we build a large design space for systolic ation. For productivity, we build reusable hardware arrays. For the PE level, there are reusable hardware component component templates different dataflows, datatype, templates which can be used by which can be used by compute cell algorithm, data vec- different systolic different systolic designs. Our pro- torization, etc. The PE-array level designs. posed generator can build differ- configuration includes array ent systolic arrays with the shape, buffer reuse strategy, and combination of different tem- number of time steps, etc. Finally, the systolic plates. Specifically, we use Chisel infrastructure9 architecture is reusable for different designs. which enables cycle-level control as well as The same computation logic can be reused for parameterized and modular hardware genera- different systolic dataflows of the same algo- tion. We build configurable templates for the rithm, and the same systolic array structure can reusable components such as pipeline control- be reused with different computation cell logics lers, data feeders (DFs), and collectors. The gen- to implement various algorithms. erated components are connected together to Prior works often use three approaches to form the complete design of systolic array. The implement systolic architecture. 1) Low-level HDL complete systolic array can be generated with implementation.2 The systolic array is manually computation algorithm, dataflow definition, and implemented for a certain algorithm. This gives the predefined templates. high performance, but the development is tedious The contribution of this article can be sum- and time-consuming. Meanwhile, this limits the marized with the following. design space that can be explored and cannot scale to other algorithms. 2) High-level synthesis We extract reusable modules of systolic (HLS).35 HLS allows programmers to use soft- arrays and build parameterized and modular ware-programming language such as C for hard- hardware module template to express differ- ware development. It promotes productivity but ent functionality and configurations.

86 IEEE Micro multiplication algorithm, but it can be extended to other systolic algorithms.

Systolic Architecture Topology Figure 1(a) shows a typical topology structure of systolic array. The major part is the 2-D grid of PEs that connects locally. The PEs on the edge con- Figure 1. Systolic array and PE structure for matrix nect to DFs and data collectors (DC), which con- multiplication. (a) Systolic array topology. (b) tain SRAM buffers. DFs send data from memory to Structural parameters for GEMM. PEs and data collectors receive data from PEs. The first DF/DC modules connect to off-chip memory. Since the bandwidth between off-chip memory We build a systolic array generator to gener- and the systolic accelerator is often limited, input ate Chisel implementation of systolic arrays data is reused multiple times in DFs. The reuse of with the compute cell and algorithm-level data can be represented with loop tiling. The inner- configurations. most loop-nest is directly mapped into the systolic We evaluate our systolic array generator array execution, and the outer loop-nest uses buf- using several tensor applications and com- fers in DF modules for data reusing. pare with prior works to show its perfor- mance and productivity. PE Compute Cell Design The core component of a systolic array is the Experiment on Xilinx VU9P FPGA shows compute cell inside each PE that performs the that our generated systolic architecture for actual computation. The compute cell can be GEMM achieves the frequency of 322 MHz on configured with an arithmetic function, a data integer data and 264 MHz on floating-point, type, and a data vectorization degree, which ena- which is faster than most prior works on the bles an SIMD processing of vectorized data. same device. Our generator is able to generate Additionally, the compute cell can use some complete systolic array architecture with less external IPs provided by third-party, such as the than 10% input codes compared to HDL, floating-point IP for Xilinx FPGAs. which remarkably promotes productivity. Systolic Dataflow The systolic dataflow is the mapping between SYSTOLIC ARRAY ARCHITECTURE the computation loop nest and the systolic array Our generator framework targets a general execution. The execution inside the array can be systolic architecture which covers most design represented as a dataflow tuple ðx; y; tÞ,which alternatives for systolic arrays. In this section, means the calculation at index ðx; yÞ of the PE array we show the major components of a systolic at cycle t. The mapping process is a linear transfor- array including four categories. mation and previous works have extensively stud- ied the mapping algorithm.3 We use an open- The topology of systolic array architecture. source systolic dataflow compiler10 as the front- The compute cell inside each PE that cont- end of our systolic generator. It generates the data- ains the core algorithm. flow tuple with the loop definition and linear trans- The dataflow configurations, including PE formation function. For the matrix multiplication, size, data movement direction, number of there exists two well-known dataflows: Output sta- time steps, and tensor index addressing. tionary (OS) and weight stationary (WS). The The controller that controls the behavior of matrix multiplication of two M N and N K systolic arrays including data validity and PE matrices can be defined as execution status. C½m; nþ ¼ A½m; kB½k; n (1) We illustrate the building process of the sys- tolic array with the example of a matrix where 0

July/August 2020 87 Agile and Open-Source Hardware

Figure 2. Implementation of systolic PE based on dataflow templates. (a) PE for OS dataflow. (b) PE for WS dataflow. (c) PE and controller templates. (d) Unified PE implementation.

From the mapping between loop indexes and the PE during the calculation pipeline, but uses dataflow tuple, the structural parameters such double buffer to transfer data simultaneously. as the array size, the direction of data move- There are complex data dependencies in the PE ment, the number of time steps, and tensor execution pipeline, and it requires a subtle indexes that read, and write at each cycle can cycle-level controller to perfectly overlap the also be generated by the compiler. Figure 1(b) pipelines in order to maximize the throughput. shows some of the structural parameters for The controller should determine whether one both dataflows including array size, time steps, data element should stay inside a particular PE, and data dependence. The data dependence is or should transfer to its neighbor PE. It also con- expressed with a tuple dx ¼ðp; q; rÞ for each ten- trols the execution status of each PE at every sor element x. p; q stands for spatial dependence cycle. The controller requires careful design and and r is temporal dependence. For OS dataflow, optimization, and existing automatic-generated each C element depends on the result of the HLS code often fails to perform well. same PE at the previous cycle, so dC ¼ð0; 0; 1Þ. Element of matrix A receives data from PE on the left, and B receives from upper PE. They do not GENERATING SYSTOLIC ARRAYS depend on the result so there’s no temporal WITH REUSABLE COMPONENTS dependence. dA ¼ð0; 1; 0Þ and dB ¼ð1; 0; 0Þ. For The design space of systolic array involves a WS dataflow, matrix B stays in PE and matrix C complex set of configurations. The configura- flows vertically. Finally, the compiler also gener- tions are divided into two categories indicating ates the tensor index that each PE executes at whether each of them can be parameterized. every cycle which affects the DF and collector’s The parameterize-able configurations can be addressing. directly implemented into module class parame- ters, which is naturally supported by Chisel. Systolic Array Controller There remain two important parts that are not The data movement scheme between systolic parameterize-able: compute cell and PE control- array PEs are different for each tensor elements, ler. The compute cell algorithm must be manu- and they are controlled separately. For example, ally implemented because it often relies on the systolic tensors transfers to adjacent PEs external IPs or customized algorithm implemen- every cycle, but the stationary tensors stays in tation. The controller of different dataflows

88 IEEE Micro might vary a lot, which brings great complexity modules (1) and one module (4). PE for WS data- for implementation and optimization. flow contains one (1), one (2), and one (3). The complete PE architecture can be built by Decoupled Generation of Controller and PE connecting the controllers for each tensor with Structure the PE I/O ports and the compute cell ports. Figure 2(a) and (b) shows the PE diagram for Both compute cell and PE ports are identical for OS and WS dataflow in GEMM computation. The the same application regardless of dataflow. I/O ports of two PEs and compute cells are iden- Figure 2(d) shows the pseudocode of the PE tical, but they are connected with different con- module structure of the generator. First, the gen- troller modules, and the controller modules are erator instantiates a compute cell, and then it related to each operand’s data dependence. This chooses the controller for each tensor according enables the decoupled generation of pipeline to the dataflow expressed in the dependence controllers and other systolic array components tuple. Finally, each inner modules are connected since the pipeline controllers are independent together to form the complete PE. with the application, and PE structure is inde- pendent with systolic dataflow. Reusable Design of Other Components The data dependence of tensors in the sys- Similar as the modular design of controller tolic array can be divided into four types based inside each PE, we also build DF/DC module tem- on whether the spatial part and temporal part of plates according to the four data dependence its dependence tuple is equal to zero. If the spa- types. DFs use double buffer RAM for concurrent tial part is zero, the tensor elements stays inside data transaction with PE and off-chip memory. PE during each pipeline stage, otherwise it flows We also define new parameters for the DF class through PEs. If the temporal part is zero, the cal- to express different data reusing methodology culation does not generate partial results for the for loop tiling. There’s no data reuse in dc mod- element to be used by next time step, and vice ules, and buffers are used for data serialization versa. since the off-chip memory has low bandwidth. We implement four templates of PE pipeline We use functional programming feature pro- controllers corresponding to their dependence vided by Chisel to implement the data address- type. The circuit diagram and pseudo code are ing for both on-chip and off-chip memory in DF/ presented in Figure 2(c). The controllers for sys- DC modules. We define a function fðx; tÞ which tolic data movement (1) and (2) are straightfor- indicates that the address of xth DF/DC module ward because the tensor elements always reads or writes at cycle t. The function f is gener- transfer to its neighbor PEs every cycle. For sta- ated with dataflow compiler, and is used as a tionary data (3) and (4), the controller becomes parameter of DF/DC templates. In this way, our complex. One register is used to transfer data generator can support arbitrary tensor access- with the compute cell (reg_s), and the other is ing scheme. used to transfer data with adjacent PEs (reg_f). The two data transfers are concurrent. When the Complete Generation of Systolic Array computation pipeline is executing, reg_s only After the PE architecture and on-chip mem- transfer data with the compute cell and reg_f ory buffers are generated, we still need a control- transfer data with other PEs at the same time. ler at the top level to maintain the data When the pipeline finishes, the two registers synchronization of each operands. It connects communicate to update their data for the next the on-chip buffer with edge PEs, and determines computation pipeline. whether the PE’s input data is valid at each Since the dataflow compiler can automati- cycle. For example, WS dataflow requires one cally generate the dependence tuples, it requires operand to stay inside all PEs before the other no extra manual effort for choosing the pipeline operand is pushed into PE array, and the con- controllers for each tensor. According to the troller maintains the dependence relationship. dependence tuples in the “Systolic Dataflow” Finally, we conclude the generation process of section, the PE of OS dataflow contains two the systolic array into a complete compilation

July/August 2020 89 Agile and Open-Source Hardware

Figure 3. Experiment results. (a) GEMM performance and resource utilization compared with prior works. (b) Performance and resource comparison of data types. (c) Evaluation of different applications.

flow. A user needs to specify the loop definition compared with the work by Cong and Wang3 and (with tiling) and compute cell definition. The gen- Lai et al.7 The improvement comes from two eration flow first uses the dataflow compiler to aspects. First, our implementation uses Chisel’s generate dataflow-related configurations. The ready-valid interface for data communication, compiler generates different dataflows by using which avoids the unified HLS programming inter- different transformation functions. Then, the com- face that leads to extra data dependence, and the piler use the configurations to generate PE con- complex finite state machine generated by HLS trollers, PE structure, DF/DC modules as well as compiler. Second, we optimize frequency by check- the top-level controller. Connection wires are ing the critical path and locating the Chisel code linked between them according to the data move- that caused the timing delay. We optimize our ment directions to generate the complete imple- design by simplifying combinational logic, inserting mentation of the systolic array. Chisel’s toolchain registers, and removing unnecessary dependence compiles it into Verilog code for FPGA synthesis. between registers. Consequently, our systolic array design is highly optimized in RTL-level. EVALUATION The manually optimized HDL solution2 still Experimental Setup has the best performance with 800GOp/s. High- performance HLS code requires manual loop We evaluate the performance and program- transformations such as loop flattening, loop ming efficiency of our systolic generator with perfectization, etc. The optimizations are com- GEMM and other tensor applications, and com- plex in HLS level, but straightforward in HDL pare the result of GEMM with several existing ; ; level. Although the DSL+HLS solutions have high HLS-based works.3 7 8 The systolic array designs development productivity, but these frame- are synthesized and implemented on Xilinx works lack the low level optimization and only VU9P FPGA platform with Xilinx Vivado 2018.2. show limited performance. For floating-point multiplication, we use Xilinx’s Floating-Point IP and integrate it into Chisel Performance Comparison of Different Dataflow implementation as a BlackBox module. and Data Types FPGA Performance Comparison for GEMM We evaluate the performance of different We compare the performance and resource uti- dataflow and data types for GEMM. lization of our generated systolic array architecture Dataflow. We do not observe great perfor- with several state-of-the-art systolic array imple- mance variation between OS and WS dataflows, mentations in Figure 3. We use the problem size of and their frequency are similar when other con- M=N=K=256. On Xilinx VU9P platform, our imple- figurations are same. However, WS dataflow is mentation with floating-point data type shows not as flexible as OS because the reduction size 677.3GOp/s which achieves 1.2 and 2.7 is fixed to the size of PE array.

90 IEEE Micro Data Type. We evaluate the DSP usage and & REFERENCES frequency of four different data types: INT8, INT 1. N. P. Jouppi et al., “In-datacenter performance 16, INT32, and FP32. The result is shown in analysis of a ,” in Proc. 44th Figure 3(b). The 8-bit calculation has the best fre- Annu. Int. Symp. Comput. Archit., 2017, pp. 1–12. quency, and it uses LUTs instead of DSPs for 2. D. J. M. Moss et al., “A customizable matrix multiplication. Floating point has the lowest fre- multiplication framework for the intel harpv2 xeon quency because it consumes most LUT and FF +fpga platform: A deep learning case study,” in Proc. resources which affects FPGA place-and-routing. Int. Symp. Field-Program. Gate Array, 2018, pp. 107–116. Evaluation of Generation Productivity 3. J. Cong and J. Wang, “PolySA: Polyhedral-based To evaluate the productivity of systolic array systolic array auto-compilation,” in Proc. Int. Conf. generation, we choose three systolic array appli- Comput.-Aided Des., 2018, pp. 1–8. cations and compare the line of codes (LOCs) of 4. X. Wei et al., “Automated systolic array architecture our proposed systolic generator and manual HDL synthesis for high throughput CNN inference on design. Dynamic GEMM refers to GEMM with FPGAs,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. dynamic bit width,11 and SW refers to Smith- Conf., 2017, pp. 1–6. Waterman algorithm.12 Since SW uses a one- 5. X. Wei, Y. Liang, and J. Cong, “Overcoming dimensional systolic array for only 2-level loop, data transfer bottlenecks in FPGA-based DNN we do not evaluate the performance here. Our pro- accelerators via layer conscious memory posed generator only requires 0.03 0.1 LOCs management,” in Proc. 56th Annu. Des. Autom. compared with generated HDL, which proves the Conf., 2019, Art. no. 125. productivity of our proposed design. 6. J. Ragan-Kelley et al., “Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines,” in Proc. CONCLUSION 34th ACM SIGPLAN Conf. Program. Lang. Des. We propose an high-performance and reus- Implementation, 2013, pp. 519–530. able systolic array generator. We extract reus- 7. Y. Lai et al., “HeteroCL: A multi-paradigm programming able hardware components that appears in infrastructure for software-defined reconfigurable different dataflows and applications. We decou- computing,” in Proc. Int. Symp. Field-Program. Gate ple the generation of complex pipeline controller Arrays, 2019, pp. 242–251. with PE and compute cell structure, and build 8. N. K. Srivastava et al., “T2S-Tensor: Productively parameterized hardware templates for each of generating high-performance spatial hardware for them. With the integration of systolic dataflow dense tensor computations,” in Proc. 27th Ann. Int. compiler, our generator uses the algorithm Symp. Field-Program. Custom Comput. Mach., 2019, definitions and compute cells to instantiate the pp. 181–189. hardware templates and connect them together 9. J. Bachrach et al., “Chisel: Constructing hardware in a to build the complete systolic architecture. scala embedded language,” in Proc. DAC Des. Experiments show that our proposed generator Autom. Conf., 2012, pp. 1212–1221. can achieve better performance compared with 10. H. Genc, “A DSL for systolic arrays,” 2018. [Online]. existing HLS-based solutions with 1.2 speedup, Available: https://github.com/hngenc/systolic-array and efficiently simplify the generation process of 11. H. Sharma et al., “Bit fusion: Bit-level dynamically systolic array architecture. composable architecture for accelerating deep neural network,” in Proc. 45th Annu. Int. Symp. Comput. ACKNOWLEDGMENTS Archit., 2018, pp. 764–775. This work was supported in part by the 12. C. W. Yu, K. Kwong, K.-H. Lee, and P. H. W. Beijing Natural Science Foundation (No. JQ19014, Leong, “A smith-waterman systolic cell,” in Proc. L172004) and in part by the Beijing Academy of Int. Conf. Field Program. Logic Appl., 2003, Artificial Intelligence (BAAI). pp. 375–384.

July/August 2020 91 Agile and Open-Source Hardware

Liancheng Jia is working toward the Ph.D. degree Transactions in Embedded Computing Systems and with the Center for Energy-Efficient Computing and Letters, and serves in the program Application, Peking University. His current research committees in the premier conferences in the related interests include high-performance computation in domain including (HPCA, MICRO, DAC, ASPLOS, GPU and embedded systems. Jia received the PACT, PPoPP, CGO, ICCAD, ICS, FPGA, FCCM). He B.S. degree in computer science from Peking Univer- is the corresponding author of this article. Contact him sity. Contact him at [email protected]. at [email protected].

Yun(Eric) Liang is currently an Associate Professor Liqiang Lu is currently working toward the Ph.D. (with tenure) with the School of Electrical Engineering degree with the School of Electrical Engineering and and Computer Science, Peking University. His Computer Science, Peking University. His research research focuses on heterogeneous computing focuses on algorithm-level and architecture-level (GPUs, FPGAs, ASICs) for emerging applications optimizations of FPGA for machine learning applica- such as AI and big data, , com- tions. Lu received the B.S. degree from the Institute pilation techniques, programming model and program of Microelectronics, Peking University, in 2017. analysis, and embedded system design. He has auth- Contact him at [email protected]. ored more than 80 scientific publications in premier international journals and conferences in related Xuechao Wei is currently a Software Engineer with domains. His research has been recognized by best Alibaba, Hangzhou, China. His research focuses on paper award at FCCM 2011 and ICCAD 2017 and best computer architecture, accelerator design, and synthe- paper nominations at PPoPP 2019, DAC 2017, sis. Wei received the Ph.D. degree from the Center for ASPDAC 2016, DAC 2012, FPT 2011, CODES+ISSS Energy-Efficient Computing and Applications, Peking 2008. He serves as an Associate Editor for the ACM University. Contact him at [email protected].

92 IEEE Micro