Binary Recurrent Unit Using FPGA Hardware to Accelerate Inference In

BINARY RECURRENT UNIT: USING FPGA HARDWARE TO ACCELERATE INFERENCE IN LONG SHORT-TERM MEMORY NEURAL NETWORKS Thesis Submitted to The School of Engineering of the UNIVERSITY OF DAYTON In Partial Fulfillment of the Requirements for The Degree of Master of Science in Electrical Engineering By Thomas C. Mealey UNIVERSITY OF DAYTON Dayton, Ohio May, 2018 BINARY RECURRENT UNIT: USING FPGA HARDWARE TO ACCELERATE INFERENCE IN LONG SHORT-TERM MEMORY NEURAL NETWORKS Name: Mealey, Thomas C. APPROVED BY: Tarek Taha, Ph.D. Vijayan Asari, Ph.D. Advisor Committee Chairman Committee Member Associate Professor, Electrical and Professor, Electrical and Computer Computer Engineering Engineering Eric Balster, Ph.D. Committee Member Associate Professor, Electrical and Computer Engineering Robert J. Wilkens, Ph.D., P.E. Eddy M. Rojas, Ph.D., M.A., P. E. Associate Dean for Research & Dean, School of Engineering Innovation, Professor School of Engineering ii c Copyright by Thomas C. Mealey All rights reserved 2018 ABSTRACT BINARY RECURRENT UNIT: USING FPGA HARDWARE TO ACCELERATE INFERENCE IN LONG SHORT-TERM MEMORY NEURAL NETWORKS Name: Mealey, Thomas C. University of Dayton Advisor: Dr. Tarek Taha Long Short-Term Memory (LSTM) is a powerful neural network algorithm that has been shown to provide state-of-the-art performance in various sequence learning tasks, including natural language processing, video classification, and speech recognition. Once an LSTM model has been trained on a dataset, the utility it provides comes from its ability to then infer information from completely new data. Due to the large complexity of LSTM models, the so-called inference stage of LSTM can require significant computing power and memory resources in order to keep up with a real-time workload. Many approaches have been taken to accelerate inference, from offloading computations to GPU or other specialized hardware, to reducing the number of computations and memory footprint required by compressing model parameters. This work takes a two-pronged approach to accelerating LSTM inference. First, a model compression scheme called binarization is identified to both reduce the storage size of model parameters and to simplify computations. This technique is applied to training LSTM for two separate sequence learning tasks, and it is shown to provide prediction performance iii comparable to the uncompressed model counterparts. Then, a digital processor architecture, called Binary Recurrent Unit (BRU), is proposed to accelerate inference for binarized LSTM models. Specifically targeted for FPGA implementation, this accelerator takes advantage of binary model weights and on-chip memory resources in order to parallelize LSTM inference computations. The BRU architecture is implemented and tested on a Xilinx Z7020 device clocked at 200 MHz. Inference computation time for BRU is evaluated against the performance of CPU and GPU inference implementations. BRU is shown to outperform CPU by as much as 39× and GPU by as much as 3.8×. iv This work is dedicated to my wife, Michelle. v ACKNOWLEDGMENTS I have greatly enjoyed my foray into the field of deep learning over the past year. First and foremost, I would like to thank my wife, Michelle, for her support and encouragement throughout the process. Without your help, this thesis would not have been possible. I would also like to thank my advisor, Dr. Tarek Taha, who introduced me to the intersection of deep learning and digital hardware design. It has been a pleasure to work with you, and I hope to continue our collaboration in the future. vi TABLE OF CONTENTS ABSTRACT........................................ iii DEDICATION.......................................v ACKNOWLEDGMENTS................................. vi LIST OF FIGURES....................................x LIST OF TABLES..................................... xii I. INTRODUCTION..................................1 1.1 Deep Learning Inference............................1 1.2 This Work...................................3 II. BACKGROUND...................................5 2.1 Sequence Learning with Neural Networks..................5 2.1.1 Feed-forward Neural Networks....................6 2.1.2 Recurrent Neural Networks......................9 2.1.3 Long Short-Term Memory Networks................. 11 2.2 Inference Acceleration with Hardware.................... 13 2.2.1 Software Implementation....................... 14 2.2.2 Hardware Implementation...................... 16 2.3 Motivation................................... 19 III. RELATED WORK.................................. 22 3.1 Model Compression.............................. 22 3.2 LSTM Accelerators.............................. 26 3.3 Other Hardware Accelerators......................... 29 vii IV. TRADE STUDY................................... 32 4.1 Analysis of Related Work........................... 32 4.1.1 Compression.............................. 32 4.1.2 Datatype................................ 35 4.1.3 Memory................................. 36 4.2 Effectiveness of Binarization......................... 38 4.2.1 Handwriting Recognition....................... 39 4.2.2 Speech Recognition.......................... 41 V. SYSTEM DESIGN.................................. 46 5.1 Datapath.................................... 46 5.1.1 Gate Pre-activation.......................... 48 5.1.2 Cell Calculations............................ 53 5.2 Memory..................................... 56 5.2.1 Input Vector.............................. 57 5.2.2 Parameter Matrices.......................... 58 5.2.3 Hidden Layer Output Vector..................... 62 5.2.4 Cell State Vector............................ 63 5.3 Theoretical Performance............................ 64 VI. HARDWARE ARCHITECTURE.......................... 66 6.1 Control Logic.................................. 68 6.1.1 Control Unit.............................. 68 6.1.2 Control Registers........................... 71 6.2 External Memory............................... 71 6.2.1 Read Unit............................... 72 6.2.2 Write Unit............................... 73 6.3 Internal Memory................................ 74 6.3.1 Input Memory............................. 75 6.3.2 Parameter Memory.......................... 75 6.3.3 Hidden State Memory......................... 77 6.3.4 Cell Calculation Unit Memory.................... 77 6.4 Matrix-Vector Product Unit......................... 79 6.4.1 Controller................................ 79 6.4.2 Processing Element Array....................... 80 6.4.3 Stream Conversion Unit........................ 81 6.5 Cell Calculation Unit............................. 82 6.5.1 Activation Function Unit....................... 83 6.5.2 Elem-Mult Unit............................ 84 6.5.3 Elem-Add Unit............................. 85 viii VII. IMPLEMENTATION & RESULTS......................... 87 7.1 Methods.................................... 87 7.1.1 Tools.................................. 88 7.1.2 Design................................. 88 7.1.3 Verification............................... 91 7.2 Hardware.................................... 92 7.2.1 Device................................. 92 7.2.2 Implementation Parameters...................... 93 7.2.3 Resource Utilization.......................... 94 7.3 Performance Evaluation............................ 95 7.3.1 Results................................. 96 7.3.2 Analysis................................ 97 VIII. CONCLUSION.................................... 101 BIBLIOGRAPHY..................................... 104 ix LIST OF FIGURES 2.1 Feed-forward Neural Network .......................... 8 2.2 Recurrent Neural Network ............................ 11 2.3 Long Short-Term Memory Cell ......................... 12 2.4 Field-Programmable Gate Array (FPGA) Device Architecture . 19 4.1 Example Data from MNIST Stroke Sequence Dataset . 40 4.2 Fully-Connected LSTM Architecture ...................... 41 4.3 Bidirectional Fully-Connected LSTM Architecture . 42 4.4 Example Data from TIMIT Dataset . 43 5.1 Row-wise Matrix-Vector Multiply Processing Element . 50 5.2 Matrix-Vector Product Unit (MVPU) Architecture . 51 5.3 Binary-Weight Row-wise Matrix-Vector Multiply Processing Element . 53 5.4 Cell Calculation Unit (CCU) Architecture ................... 56 5.5 In-Memory Organization for Parameter Data . 61 6.1 Inference Accelerator System Architecture ................... 67 6.2 Binary Recurrent Unit Architecture ...................... 67 6.3 Control Unit Batch Computation State Flow . 70 x 7.1 Vivado IP Integrator Block Design....................... 89 7.2 MVPU Processing Element Implemented in Simulink............. 90 7.3 Zedboard Zynq-7000 Development Board................... 92 xi LIST OF TABLES 4.1 Testing set accuracy for the MNIST Stroke Sequence dataset......... 41 4.2 Testing set accuracy for the TIMIT dataset................... 44 5.1 Example row-wise matrix-vector multiplication procedure........... 52 5.2 Cell Calculation Unit (CCU) schedule...................... 55 7.1 BRU Implementation Parameters, Targeted for Z7020............. 93 7.2 FPGA Resource Utilization for BRU on the Z7020.............. 95 7.3 Run Time Performance Comparison of BRU, CPU, and GPU Running In- ference for MNIST Stroke Sequence Model................... 96 7.4 Run Time Performance Comparison of BRU, CPU, and GPU Running In- ference for TIMIT Model............................. 97 xii CHAPTER I INTRODUCTION 1.1 Deep Learning Inference Today is a very exciting time to be in machine learning. Over the past decade, the field has exploded

Load more