BINARY RECURRENT UNIT:

USING FPGA HARDWARE TO ACCELERATE INFERENCE

IN LONG SHORT-TERM MEMORY NEURAL NETWORKS

Thesis

Submitted to

The School of Engineering of the

UNIVERSITY OF DAYTON

In Partial Fulfillment of the Requirements for

The Degree of

Master of Science in Electrical Engineering

By

Thomas C. Mealey

UNIVERSITY OF DAYTON

Dayton, Ohio

May, 2018 BINARY RECURRENT UNIT: USING FPGA HARDWARE TO ACCELERATE

INFERENCE IN LONG SHORT-TERM MEMORY NEURAL NETWORKS

Name: Mealey, Thomas C.

APPROVED BY:

Tarek Taha, Ph.D. Vijayan Asari, Ph.D. Advisor Committee Chairman Committee Member Associate Professor, Electrical and Professor, Electrical and Computer Computer Engineering Engineering

Eric Balster, Ph.D. Committee Member Associate Professor, Electrical and Computer Engineering

Robert J. Wilkens, Ph.D., P.E. Eddy M. Rojas, Ph.D., M.A., P. E. Associate Dean for Research & Dean, School of Engineering Innovation, Professor School of Engineering

ii c Copyright by

Thomas C. Mealey

All rights reserved

2018 ABSTRACT

BINARY RECURRENT UNIT: USING FPGA HARDWARE TO ACCELERATE

INFERENCE IN LONG SHORT-TERM MEMORY NEURAL NETWORKS

Name: Mealey, Thomas C. University of Dayton

Advisor: Dr. Tarek Taha

Long Short-Term Memory (LSTM) is a powerful neural network algorithm that has been shown to provide state-of-the-art performance in various sequence learning tasks, including natural language processing, video classification, and . Once an LSTM model has been trained on a dataset, the utility it provides comes from its ability to then infer information from completely new data. Due to the large complexity of LSTM models, the so-called inference stage of LSTM can require significant computing power and memory resources in order to keep up with a real-time workload. Many approaches have been taken to accelerate inference, from offloading computations to GPU or other specialized hardware, to reducing the number of computations and memory footprint required by compressing model parameters.

This work takes a two-pronged approach to accelerating LSTM inference. First, a model compression scheme called binarization is identified to both reduce the storage size of model parameters and to simplify computations. This technique is applied to training LSTM for two separate sequence learning tasks, and it is shown to provide prediction performance iii comparable to the uncompressed model counterparts. Then, a digital processor architecture, called Binary Recurrent Unit (BRU), is proposed to accelerate inference for binarized LSTM models. Specifically targeted for FPGA implementation, this accelerator takes advantage of binary model weights and on-chip memory resources in order to parallelize LSTM inference computations.

The BRU architecture is implemented and tested on a Xilinx Z7020 device clocked at

200 MHz. Inference computation time for BRU is evaluated against the performance of

CPU and GPU inference implementations. BRU is shown to outperform CPU by as much as 39× and GPU by as much as 3.8×.

iv This work is dedicated to my wife, Michelle.

v ACKNOWLEDGMENTS

I have greatly enjoyed my foray into the field of over the past year. First and foremost, I would like to thank my wife, Michelle, for her support and encouragement throughout the process. Without your help, this thesis would not have been possible.

I would also like to thank my advisor, Dr. Tarek Taha, who introduced me to the intersection of deep learning and digital hardware design. It has been a pleasure to work with you, and I hope to continue our collaboration in the future.

vi TABLE OF CONTENTS

ABSTRACT...... iii

DEDICATION...... v

ACKNOWLEDGMENTS...... vi

LIST OF FIGURES...... x

LIST OF TABLES...... xii

I. INTRODUCTION...... 1

1.1 Deep Learning Inference...... 1 1.2 This Work...... 3

II. BACKGROUND...... 5

2.1 Sequence Learning with Neural Networks...... 5 2.1.1 Feed-forward Neural Networks...... 6 2.1.2 Recurrent Neural Networks...... 9 2.1.3 Long Short-Term Memory Networks...... 11 2.2 Inference Acceleration with Hardware...... 13 2.2.1 Software Implementation...... 14 2.2.2 Hardware Implementation...... 16 2.3 Motivation...... 19

III. RELATED WORK...... 22

3.1 Model Compression...... 22 3.2 LSTM Accelerators...... 26 3.3 Other Hardware Accelerators...... 29

vii IV. TRADE STUDY...... 32

4.1 Analysis of Related Work...... 32 4.1.1 Compression...... 32 4.1.2 Datatype...... 35 4.1.3 Memory...... 36 4.2 Effectiveness of Binarization...... 38 4.2.1 ...... 39 4.2.2 Speech Recognition...... 41

V. SYSTEM DESIGN...... 46

5.1 Datapath...... 46 5.1.1 Gate Pre-activation...... 48 5.1.2 Cell Calculations...... 53 5.2 Memory...... 56 5.2.1 Input Vector...... 57 5.2.2 Parameter Matrices...... 58 5.2.3 Hidden Output Vector...... 62 5.2.4 Cell State Vector...... 63 5.3 Theoretical Performance...... 64

VI. HARDWARE ARCHITECTURE...... 66

6.1 Control Logic...... 68 6.1.1 Control Unit...... 68 6.1.2 Control Registers...... 71 6.2 External Memory...... 71 6.2.1 Read Unit...... 72 6.2.2 Write Unit...... 73 6.3 Internal Memory...... 74 6.3.1 Input Memory...... 75 6.3.2 Parameter Memory...... 75 6.3.3 Hidden State Memory...... 77 6.3.4 Cell Calculation Unit Memory...... 77 6.4 Matrix-Vector Product Unit...... 79 6.4.1 Controller...... 79 6.4.2 Processing Element Array...... 80 6.4.3 Stream Conversion Unit...... 81 6.5 Cell Calculation Unit...... 82 6.5.1 Unit...... 83 6.5.2 Elem-Mult Unit...... 84 6.5.3 Elem-Add Unit...... 85

viii VII. IMPLEMENTATION & RESULTS...... 87

7.1 Methods...... 87 7.1.1 Tools...... 88 7.1.2 Design...... 88 7.1.3 Verification...... 91 7.2 Hardware...... 92 7.2.1 Device...... 92 7.2.2 Implementation Parameters...... 93 7.2.3 Resource Utilization...... 94 7.3 Performance Evaluation...... 95 7.3.1 Results...... 96 7.3.2 Analysis...... 97

VIII. CONCLUSION...... 101

BIBLIOGRAPHY...... 104

ix LIST OF FIGURES

2.1 Feed-forward Neural Network ...... 8

2.2 ...... 11

2.3 Long Short-Term Memory Cell ...... 12

2.4 Field-Programmable Gate Array (FPGA) Device Architecture ...... 19

4.1 Example Data from MNIST Stroke Sequence Dataset ...... 40

4.2 Fully-Connected LSTM Architecture ...... 41

4.3 Bidirectional Fully-Connected LSTM Architecture ...... 42

4.4 Example Data from TIMIT Dataset ...... 43

5.1 Row-wise Matrix-Vector Multiply Processing Element ...... 50

5.2 Matrix-Vector Product Unit (MVPU) Architecture ...... 51

5.3 Binary-Weight Row-wise Matrix-Vector Multiply Processing Element . . . . 53

5.4 Cell Calculation Unit (CCU) Architecture ...... 56

5.5 In-Memory Organization for Parameter Data ...... 61

6.1 Inference Accelerator System Architecture ...... 67

6.2 Binary Recurrent Unit Architecture ...... 67

6.3 Control Unit Batch Computation State Flow ...... 70

x 7.1 Vivado IP Integrator Block Design...... 89

7.2 MVPU Processing Element Implemented in Simulink...... 90

7.3 Zedboard Zynq-7000 Development Board...... 92

xi LIST OF TABLES

4.1 Testing set accuracy for the MNIST Stroke Sequence dataset...... 41

4.2 Testing set accuracy for the TIMIT dataset...... 44

5.1 Example row-wise matrix-vector multiplication procedure...... 52

5.2 Cell Calculation Unit (CCU) schedule...... 55

7.1 BRU Implementation Parameters, Targeted for Z7020...... 93

7.2 FPGA Resource Utilization for BRU on the Z7020...... 95

7.3 Run Time Performance Comparison of BRU, CPU, and GPU Running In- ference for MNIST Stroke Sequence Model...... 96

7.4 Run Time Performance Comparison of BRU, CPU, and GPU Running In- ference for TIMIT Model...... 97

xii CHAPTER I

INTRODUCTION

1.1 Deep Learning Inference

Today is a very exciting time to be in . Over the past decade, the field has exploded in terms of both its applications in society and the amount of research being done to advance the state of the art. These advances have been made possible in large part due to rapid increases in computing power as well as unprecedented accessibility of data, but nonetheless it is the passion and excitement around machine learning that has motivated scientists and engineers to continue to push the boundaries of what is possible in modern computing.

Broadly speaking, machine learning is a field of computer science that enables computers to do something without explicitly being programmed how to do it. Given a specific task, information about the task, and a desired outcome, a computer program can learn how to perform the task on its own by using a machine learning algorithm. In this way, the objectives are clearly defined, but the strategy for achieving the objective is not; it is up to the computer to find the strategy.

Deep learning refers to a subset of machine learning algorithms which use a neural network-based approach to learning. More specifically, deep learning models are composed 1 of multiple layers of artificial neurons. These layers are connected to each other in a cascaded manner, in which the output of one layer becomes the input to the next one. This forms a deep hierarchical structure, hence the term deep learning. The hierarchy corresponds to an increasing level of abstraction from input to output. That is, each successive layer may understand concepts which are the combination of the features from the layer below.

To better illustrate this concept, consider how the human brain might recognize a face in an image. First, the contours and edges in the image are identified. This represents the understanding of the topmost layer in a deep network. Then, different facial features, such as a mouth or a nose, are formed from the contours and identified as such. This represents the activity of an intermediate layer of the network. Finally, since these facial features exist in close proximity to one another, we recognize that what we are seeing must be a face.

This represents the final (deepest) layer of the network.

In reality, this is not how facial recognition works in the brain. However, the example serves to illustrate the concept of hierarchical features that deep learning seeks to utilize.

Because of this, deep learning models have been found to be most effective when applied to tasks that involve data from which a hierarchical understanding can be formed. For example, deep learning has been widely applied to the field of speech recognition. It is straightforward to see how a hierarchy of understanding could be built from raw audio → phonetic sounds → words → whole phrase.

Because of their deep structure, however, deep learning models can be quite complex, containing millions of parameters. This fact makes it difficult to design and deploy appli- cations that take advantage of their prediction power.

Deep learning takes place in two stages: training and inference. During training, a large amount of data is shown to the model in a recurring fashion. The model performs

2 some task with the input data (e.g. identify a phrase from raw audio) using its current state of understanding. By examining the error between the desired output and the actual output, small, incremental adjustments are made to the model’s parameters until it is able to perform the desired task with an acceptable level of error over the training dataset. Then, during the inference stage, new data is shown to the model. Although this data has not been seen before, the model has learned enough from the training data that it can infer the

(hopefully) correct understanding of the new data.

It is the inference stage that carries the utility of a deep learning model—it is what is implemented and deployed to the end-user. The details of this implementation are deter- mined by the application. Often, the application brings demanding real-time requirements in terms of latency and number of concurrent users. Complex models require a large amount of memory resources. The enormous number of computations being performed result in high energy consumption. For all of these reasons, designing a deep learning inference system can be a challenging task.

1.2 This Work

This work is concerned with the design of a high-performance, energy-efficient solution to implementing deep learning inference. More specifically, a digital processor architecture is designed for accelerating inference in sequence-learning applications, such as speech recog- nition. The architecture is synthesized for hardware and tested on a field-programmable gate array (FPGA) platform.

The work is structured as follows: Chapter 2 provides background information on neural networks for sequence-learning as well as different hardware platforms for implementing inference. The motivation for the design of a custom hardware architecture is presented.

3 Chapter 3 provides an overview of recent academic literature on the topics of model compression and hardware accelerator design.

Chapter 4 investigates the various trades associated with model compression and hard- ware accelerator design. A compression approach, called binarization, is selected and ex- plored in order to further demonstrate its validity.

Chapter 5 walks through the design process for the accelerator architecture, considering computational dataflow and memory resource usage strategies. An expression for theoretical performance of the architecture is derived.

Chapter 6 dives into the details of the accelerator architecture, providing an overview of its major subsystems and their functions.

Chapter 7 discusses the implementation of the architecture, including the design work-

flow, the software tools used, and the hardware platform used for implementation and test- ing. Actual performance for two models is reported and compared with the performance of their software implementations.

Finally, Chapter 8 concludes the work, restating the motivation for the hardware accel- erator design and summarizing the findings of the work.

4 CHAPTER II

BACKGROUND

This work explores the intersection of two topics: neural networks for sequence learn- ing and inference acceleration through custom hardware design. This chapter provides background information on both of these topics, which will help to provide context for the remainder of the work. Then, the motivation for the design of a custom hardware accelerator architecture is presented.

2.1 Sequence Learning with Neural Networks

Neural networks have become a widely used tool in the fields of data analytics and artificial intelligence. Inspired by how neurons in the human brain work, this family of machine learning algorithms is based on a collection of artificial neurons, connected to each other in a weighted graph. Neural networks can be used to approximate complex, nonlinear functions with many inputs. There are many variants of neural networks, each of which have architectures that have been tuned to best handle a particular task.

Sequence learning refers to a subset of machine learning tasks that deal with data for which order is important—that is, the data is arranged in a specific order, and this order is relevant to the task at hand. For example, a sequence learning task might be to predict the next-day closing price of a stock, given the closing price of that stock from the past 60 days. 5 This is a regression task, in which the goal is to predict an unknown, continuous-valued output. Another example of a sequence learning task would be to predict the next word in a sentence, given the phrase “I went to the store to buy”. This is a classification task, where the goal is to predict an unknown, but discrete-valued output. Another example would be to label the word being spoken in a segment of audio; this is also a classification task, but the goal is to produce the correct label for the entire sequence, rather than for the next item in the sequence. There is a wide range of sequence learning problems, and neural network-based approaches have been shown to deliver state-of-the-art results in many of these areas.

One of the most commonly used and effective neural network variants used for sequence learning is called Long Short-Term Memory. This architecture is derived from a basic modification to neural networks for handling sequential data, called a recurrent neural network. To understand the details of these architectures, first we will examine how the most basic neural network, called a feed-forward network is structured.

2.1.1 Feed-forward Neural Networks

A feed-forward neural network is composed of multiple, interconnected layers of artificial neurons. These connections are fed forward from an input layer, to zero or more interme- diate layers (also called “hidden” layers), to an output layer. The output of each neuron in the hidden and output layers is equal to the weighted sum of its connections to the neurons in the layer below, plus a bias value, with an activation function applied. Mathematically, the output h of a neuron can be written as:

M X h = g( wixi + b), (2.1) i=1

6 where

wi is the connection weight to the i-th neuron in the previous layer,

xi is the output of the i-th neuron in the previous layer,

M is the number of neurons in the previous layer,

b is the bias value of the neuron, and

g(·) is the activation function.

For convenience, we can write the output of all the neurons in a single layer using vector notation:

h = g(W x + b), (2.2)

where

h is the N-dimensional output vector,

x is the M-dimensional input vector,

W is the N × M weight matrix,

b is the N-dimensional bias vector,

M is the number of neurons in the previous layer,

N is the number of neurons in the current layer, and

g(·) is the activation function.

7 Inputs Outputs

Input Hidden Output layer layer layer

Figure 2.1: A simple feed-forward neural network with a single hidden layer.

The weight matrix W contains the parameters of the neural network model. Its values are initially unknown, but through the process of training, the model learns connection- weight values that are appropriate for the task at hand (i.e. they approximate the desired output function). The bias values are also learnable parameters, found through training. For neural networks, training is performed using a technique called . With this

method, the gradient of the output error, or loss, function is propagated backwards through

the network from output to input, and small, gradual adjustments to the connection weights

in order to minimize the . This procedure is computationally very expensive,

and thus the training process can require significant time and computational resources.

The activation function is simply some predefined function, the purpose of which is

usually to constrain the output to some desired range. The choice of activation function is

determined by the application. Often, it chosen to be a nonlinear function. The activation

8 function used may also differ for each layer. A popular choice of activation function is the , which constrains the output between 0 and 1:

1 sigm(x) = (2.3) 1 + e−x

The hyperbolic tangent function, which has a range from -1 to 1, is another commonly used activation function:

ex + e−x tanh(x) = (2.4) ex − e−x

A standard feed-forward neural network can be used for sequence learning tasks. A fixed context window size C is determined, then inputs xt, xt+1,..., xt+C−1 are concatenated row-wise and fed to the input layer of the network as a single input vector. The sequence order is preserved, and the model should learn temporal relationships in the input. However, this is not the best approach to sequence learning for a couple reasons. For one, the dimensionality of the input vector (and thus the number of connection weights) grows by a factor of C, which can become unwieldy quickly for high-dimensional inputs and long context windows. Secondly, because the context window is fixed, we are limited in what the model can learn; temporal relationships existing beyond the scope of the window will not be captured. Ideally, we want to learn representations of arbitrary length. Recurrent neural networks solve this issue and are a more natural fit for sequence learning tasks.

2.1.2 Recurrent Neural Networks

The Recurrent Neural Network (RNN) was developed to solve the issue of learning arbitrary-length temporal dependencies. In its most basic form, an RNN differs from a feed-forward network in that its hidden layers contains connections from the output of each

9 neuron to the input to each neuron. This recurrent connection forms a memory of previous inputs, allowing the model to learn relationships in a sequence. Mathematically, the output ht of a hidden recurrent layer at time t can be written as:

ht = g(Wxxt + Whyt−1 + b), (2.5)

where

ht is the N-dimensional output vector,

xt is the M-dimensional input vector,

Wx is the N × M input weight matrix,

Wh is the N × N recurrent weight matrix,

b is the N-dimensional bias vector,

M is the number of neurons in the previous layer,

N is the number of neurons in the current layer, and

g(·) is the activation function.

Inputs are fed to an RNN one at a time, in sequential order. The network produces an

output for every input, but for some tasks the output may only be relevant after a certain

number of inputs has been seen. In principle, this modified neural network architecture

is a simple and powerful tool for sequence learning. However, in principle, RNNs can be

difficult to train properly due to the problems of vanishing and exploding gradients [1]. As

a result, they often have difficulty maintaining long-term contextual information. A variant

of the RNN, called Long Short-Term Memory, was created to solve this issue. 10 Inputs Outputs

Input Hidden Output layer layer layer

Figure 2.2: A recurrent neural network with a single hidden layer.

2.1.3 Long Short-Term Memory Networks

Long Short-Term Memory (LSTM) networks are a variation of the basic recurrent neu- ral network architecture. They are composed of “cells”, which are computational units built from the basic artificial neuron structure. LSTM solves the vanishing and exploding gradients problem by adding an explicit memory unit, called the cell state, inside each of its computational units. Each LSTM cell has four gates, which control the input to the cell state as well as the degree to which the cell state influences the current output1. Each of these gates are, in effect, small RNN layers themselves—they have weighted connections from the input and previous output.

The input gate controls which of the present inputs (new information) are considered in updating the cell state:

it = sigm(Wxixt + Whiht−1 + bi) (2.6)

1There have been numerous variations of LSTM, which differ in the number of gates used and where recurrent connections are placed. This work uses the implementation described by Zaremba et al. in [2]

11 xt ht-1 xt ht-1

New-Input Output jt Gate Gate ot

Cell State xt

it ct ht h t-1 c Input t-1 Gate

f Forget t Gate

xt ht-1

Figure 2.3: Architecture of the Long Short-Term Memory cell.

The new-input gate controls how the new information will affect the update to the cell state:

jt = tanh(Wxjxt + Whjht−1 + bj) (2.7)

The forget gate controls which of the present inputs will affect how much of the previous state is retained:

ft = sigm(Wxf xt + Whf ht−1 + bf ) (2.8)

The output gate controls which of the present inputs will affect the output of the cell:

ot = sigm(Wxoxt + Whoht−1 + bo) (2.9)

These gates are combined in a scalar fashion to compute an update to the cell state:

ct = it jt + ft ct−1 (2.10) 12 Finally, hidden layer output is computed as the scalar product of the output gate and the activated cell state:

ht = ot tanh(ct) (2.11)

By using the gates, LSTM can decide what information to discard and what to keep in

a given context. This mechanism allows the model to learn both long- and short-term de-

pendencies. For this reason, LSTM has become one of the most widely used machine learn-

ing algorithms (neural network-based or otherwise) for sequence learning tasks—including

state-of-the-art speech recognition systems [3]. This work focuses on LSTM as the specific

network architecture for which to accelerate inference.

2.2 Inference Acceleration with Hardware

As we have seen, implementing a deep learning application is a two step process: first

train the model, then implement model inference. The training phase, as in all machine

learning applications, requires the meticulous data cleaning and setup. Thanks to machine

learning frameworks like [4] and TensorFlow [5], neural network training can be

implemented relatively easily in Python. However, the actual process of training can take a

significant amount of time—days, or even weeks—to complete. This is due to the complexity

of the backpropagation algorithm used in training; specifically, the “backward pass” is

responsible for the majority of the computational burden. The “forward pass” refers to the

propagation of data from the input layer of the network to the output. The “backward

pass” consists of the propagation of gradients from the output to the input layer, and it is

the calculation of these gradients that requires significant computation time. The training

process is often iterative, as well. The system designer makes an educated guess when

setting hyperparameter values such as layer sizes and learning rate, but often these values

13 need to be tweaked, and the whole training process is rerun. However, once a model has been sufficiently trained, this step is finished and usually does not need to be revisited. The long computation time for training can be thought of as a large “one-time cost”.

Inference, on the other hand, carries the “recurring costs” of the deep learning applica- tion. In this stage, new data is fed to the model, and only the forward pass through the network is computed. The latency of the inference system, then, is determined by the time it takes to compute the forward pass. How the forward pass is implemented will determine the magnitude of its recurring cost. “Cost”, in this case, carries two implications. The

first is the actual monetary cost due to the energy consumption of the system. It is in the best interest of the application maintainer to minimize the amount of energy required to compute inference because this cost is incurred every time the system processes a new input. Secondly, there is the cost of time to the user of the application. For an analytics application, this equates to the amount of new data that the system can process in an hour.

For a mobile application, this means the time the user waits for a response from the system.

In either case, it is beneficial to minimize system latency.

Fortunately, once a model has been trained, inference computations become fixed. While the inputs to the model differ, the model parameters and the number of underlying com- putations remains the same. Because of this, inference can be optimized for both energy consumption and computational latency. However, this optimization can come at the cost of development time and complexity. There are a number of different strategies and hardware platforms that support different areas of this trade space.

2.2.1 Software Implementation

The first route that many people take when implementing inference is to use a pure-

CPU, software-based approach. This strategy is appealing because it is the simplest, both 14 in terms of development time and design complexity. It also provides flexibility, as a high- level software solution can run on any platform, and no special hardware is required. The develop-debug feedback loop for software is short compared to that of hardware design, so a robust solution can be implemented in relatively short order. When using a machine learning framework like Theano or TensorFlow, it is feasible to simply take the same code used for training and adapt it to implement inference. TensorFlow Lite [6] was developed specifically for the purpose of easily adapting TensorFlow code to mobile applications. The barriers to entry for a software-based inference implementation are low, and so often times this approach should be the first one taken by the application designer—especially if real- time requirements for system latency and throughput are modest.

However, for applications demanding low-latency, high-throughput requirements, a pure software-based approach may be insufficient. To understand why this is the case, we must look at the underlying computations being performed in the forward pass. For a single hidden layer of a feed-forward network, the output of the forward pass is computed by

Equation 2.2. With N hidden units and M-dimensional input, this computation requires

M·N multiply operations, M·N addition operations, and N activation function calculations.

For a basic scalar processor, these computations must be performed sequentially, and the M ·

N term dominates the complexity. Modern superscalar, multicore processor architectures, as well as low-level, optimized linear algebra libraries such as BLAS [7], introduce parallelism that helps to speed up matrix-vector computations, but fundamentally there is a limit to computational performance when using a pure software-based approach.

One solution for speeding up inference computation is to augment a software approach with a Graphics Processing Unit (GPU). Containing thousands of processing cores, GPUs provide massive parallelism that can speed up matrix-vector operations by multiple orders

15 of magnitude. GPUs have become a popular tool in the deep learning community; in fact, hardware manufacturers have begun targeting the design of their GPUs specifically for deep learning applications [8]. Perhaps the key reason for their popularity is the ease at which they can be integrated into an existing software solution. Many machine learning frameworks offer seamless GPU integration through their back-end computational engines.

At run time, if the engine sees that a GPU is present on the machine, it will automatically offload computations to it. The high-level code describing the model need not be changed in order to take advantage of the GPU. Of course, it would be possible to manually target the

GPU by writing custom CUDA code to implement inference, but thanks to highly-optimized

GPU libraries for deep learning like cuDNN [9], any improvements in performance may be negligible.

There are a few disadvantages to the GPU-based approach, however. The first is the high cost of the device—high-end GPUs can cost thousands of dollars [10]. While this is a non-recurring cost, it may simply be too high of a barrier to entry, especially for individuals. The other main disadvantage of using a GPU is its high power requirement.

For a complex neural network-based speech recognition algorithm, a GPU implementation required almost twice the amount of power required by its CPU-only counterpart [11]. While total energy consumption for a particular task may be lower on a GPU than on a CPU, the high power requirement of a GPU may be prohibiting for some platforms—especially embedded systems. In order to meet performance requirements while minimizing energy consumption, it may be necessary to design custom hardware for accelerating inference.

2.2.2 Hardware Implementation

While software provides design flexibility and simplicity, we have seen its limitations as a platform for implementing inference for high-performance and embedded deep learning

16 applications. In order to handle demanding workloads, as well as in the interest of lowering long-term energy costs, designing custom hardware may be a desirable strategy.

In general terms, a custom hardware solution is called an Application-Specific Integrated

Circuit (ASIC). In contrast with general purpose processors like CPUs, which are designed to handle a large variety of tasks, ASICs are designed to perform a single function. ASICs contain only the resources required to perform their specified function, and for this reason,

ASICs can be highly optimized for computational throughput and energy efficiency. For example, an ASIC designed for a accelerating a neural network-based ap- plication achieved 13× faster computation time and 3,400× less energy consumption than a GPU-based solution [12].

Broadly speaking, the less variability in an algorithm, the greater the opportunity there is to optimize its implementation with hardware. Inference in all neural network-based algorithms has a few properties that make it well-suited for a hardware-optimized imple- mentation. Firstly, there are a fixed number of computations performed in each forward pass through the network. This allows the hardware designer choose the appropriate amount and type of computational units like adders and multipliers. Secondly, matrix-vector mul- tiplication is easily parallelizable—each element of the vector product relies on a separate row of the parameter matrix, and so these output terms can be computed independently of one another. Thirdly, the model parameter values are fixed and known at run time. This opens up the opportunity for various compression schemes and quantization approaches, which reduce the amount and size of the parameters in memory. Finally, in addition to the number of computations being fixed, the order in which they are performed, and thus the order in which data is accessed, can be fixed. Having a predictable data-access pattern

17 allows memory operations to be optimized, either through the interface to off-chip DRAM, or in the properties of on-chip SRAM.

The process of designing digital hardware involves writing in a hardware description language (HDL) to describe the structure and behavior of a circuit. Unlike most software languages, which describe a sequential procedure to be executed by a CPU, HDL can describe operations being done in parallel. The process of writing and testing HDL can be both difficult and time-consuming, requiring a special skill set and domain knowledge.

Because of this, designing custom hardware to accelerate inference carries a high non- recurring engineering cost. Additionally, once an ASIC has been fabricated, its design is

fixed. Any modifications to the design require a full re-fabrication cycle and replacement of existing devices, which can be expensive.

Field-Programmable Gate Arrays (FPGAs) present a solution to hardware design that offers more flexibility than an ASIC. These devices contain a large, interconnected array of configurable logic blocks (CLBs), which can be reconfigured to implement complex digital circuits. They also contain memory (RAM) and multiply-add units (DSPs) that are con- nected to the CLBs (see Figure 2.4). In this way, FPGAs are analogous to a breadboard for integrated circuit design. While they do not offer the energy-efficiency and performance of ASICs, they still bring significantly better energy efficiency than CPU and GPU and, if the algorithm can be sufficiently optimized for hardware, as good or better speed. One

FPGA-based hardware accelerator for a speech recognition application achieved 43× and

197× better speed and energy-efficiency, respectively, than a CPU implementation; and 3× and 14× better speed and energy-efficiency, respectively, than a GPU implementation [11].

In this work, a custom hardware architecture is designed for implementing LSTM in- ference. It is possible that a future adopter of the architecture could decide to implement

18 Figure 2.4: Field-Programmable Gate Array (FPGA) device architecture.

the system as an ASIC in order to maximize energy-efficiency and performance. However, due to the rapid-prototyping nature of this project, the design specifically targets an FPGA platform.

2.3 Motivation

Sequence learning is a broad field with many practical applications, from speech recog- nition to video classification. Long Short-Term Memory networks have been shown to be a very effective approach to handling these types of problems. However, implementing infer- ence for such applications can be challenging due to the computational complexity of LSTM.

A pure-CPU software-based implementation will have limited computational throughput and poor energy efficiency. While this limitation may be acceptable for some applications, often times inference is augmented with GPUs or other specialized hardware in order to improve performance.

19 In a cloud-based analytics setting, the benefits of augmenting servers with efficient hardware are clear. For one, faster computation improves system throughput, lowering the amount of server resources required to handle the workload, and/or increasing the maximum workload the system can handle. Secondly, better energy-efficiency lowers the operating cost of the system. Companies like Google and Microsoft have begun deploying custom ASICs and FPGAs to their data centers to enhance the performance of their analytics applications

[13][14].

For mobile and Internet-of-Things (IoT) deep learning applications, many times infer- ence computation is not performed on the embedded device. Rather, data is offloaded to a cloud server to be processed. However, there are many benefits to performing inference on the device instead. As a case study, consider Amazon Alexa, a home voice assistant service.

Using an Alexa-enabled device, users can speak voice commands to do various things such as play music, check the weather, order food, and control other smart devices. The system processes voice commands by having the local device send a recorded voice audio stream to Amazon’s servers, where it is then decoded using a speech recognition algorithm [15].

If voice recognition were to take place on the device instead, there would be a number of benefits. The service provider (Amazon) benefits from decreased server load, and thus lower operating costs. The user would benefit from increased privacy, as only the directives from their voice commands, rather than the raw audio, would be shared with the server. Addi- tionally, the system could provide limited functionality even without an Internet connection if information could be downloaded to the device in advance.

The challenge of implementing inference on-device for a voice recognition system is twofold. For one, voice recognition is a near real-time task; the user will not tolerate a long delay for processing commands. However, inference in state-of-the-art voice recognition

20 models, like ones that use LSTM, is computationally complex and may require significant computational resources in order to meet the system’s latency requirements. At the same time, embedded devices are required to be relatively low-cost and low-power. It would not be feasible to simply replace an embedded processor with a more capable, but more costly, one. A low-cost, low-power solution to accelerating inference for sequence learning tasks is required. Therefore, this work proposes an FPGA-based architecture for LSTM inference acceleration.

21 CHAPTER III

RELATED WORK

To date, there have been several studies examining the problem of implementing deep learning inference in a hardware-friendly manner. Complete solutions to this problem come in two stages: first, reduce the amount of data needed for computations by compressing the model, then optimize dataflow for the target hardware platform. This chapter provides an overview of academic literature in both of these areas.

3.1 Model Compression

For many deployed deep learning applications, the most costly operation in terms of both time and power consumption is off-chip memory access. Han et al. propose a 3- stage model compression and quantization method that dramatically reduces the storage requirement of model parameters and allows many deep learning models to be stored in on- chip SRAM [16]. The authors show that, for a small number of example feedforward and convolutional networks, this method can reduce parameter storage requirements by 35 to

49× without loss in prediction accuracy. The three compression stages are as follows. First, pruning removes all model parameters which have an absolute value below some threshold.

Next, parameters are quantized through weight sharing, in which weight values are binned

22 through k-means clustering. This allows only the weight indices to be stored. Finally, the pruned and quantized parameters are stored in memory using Huffman coding.

In [11], Han et al. present a model compression method called load balance-aware pruning. This method extends the standard pruning technique by optimizing the workload across parallel processing elements, resulting in reduced computation time. The technique works by first dividing the parameter matrix into submatrices based on the number of processing elements available. These submatrices are then constrained to have the same sparsity, such that they contain an equal number of non-zero parameters after pruning. The authors show that for an example speech recognition model, the prediction performance of load balance-aware pruning does not differ much from that of standard pruning. Linear quantization is used to further compress the example model, with the fixed-point fraction length set by analyzing the dynamic range of each layer. The authors found that quantizing to 12-bits did not reduce the prediction performance from that of the 32-bit floating point model. However, 8-bit fixed point quantization caused prediction performance to deteriorate significantly.

In [17], Qiu et al. explore compression of fully-connected layers in a convolutional network using singular value decomposition (SVD). For an example CNN model, SVD is performed on the first fully-connected layer weights. The first 500 singular values are chosen.

This results in a compression rate of 7× and a prediction accuracy loss of only 0.04%. The authors also propose a data quantization flow in which the fixed-point fractional length of both the model parameters and the layer inputs are optimized. First, the dynamic range of parameter matrices for each layer is analyzed in order to determine the fractional length which yields the smallest total error between the floating-point version of the parameters.

Then, the same process is applied to the input data at each layer. Fractional lengths

23 are static in a single layer but dynamic in-between layers. The authors found that for an example CNN model, 16-bit static fixed-point precision resulted in negligible prediction performance loss compared to that of the floating-point model. 8-bit static precision resulted in significantly worse performance, but an 8-bit dynamic precision model resulted in a performance loss of only 1.52%.

Prabhavalkar et al. apply SVD for LSTM model compression in [18]. First, the authors present the compression technique for a general recurrent network. Input and recurrent parameter matrices for each layer are factored jointly to produce a recurrent projection matrix. The compression ratio for a given layer can then be controlled by setting the rank of the projection matrix. This method is extended to LSTM by concatenating the four gate parameter matrices and treating them as a single matrix. For an example speech recognition model, the authors showed that they were able to compress the original network to a third of its original size with only a small degradation in accuracy.

Chen et al. investigate the effect of various quantization strategies during training and inference in [19]. The authors find that for a CNN trained on MNIST using 32-bit floating point parameters, quantizing to 16-bit fixed point for inference results in a loss of only

0.01% in prediction accuracy.

In [20], Shin et al. apply a weight-sharing quantization approach to large-scale LSTM models. After an initial model training stage using floating-point parameters, uniform quantization and retraining are applied iteratively in order to find an optimal quantization step size. This approach is applied to two multilayer LSTMs for natural language processing applications. The authors find that due to the wide dynamic range across layers of the model, an optimal approach is to group parameter values by layer and connection type (i.e.

24 separate feedforward and recurrent parameter matrices) and perform quantization of each group separately.

In [21], Courbariaux et al. introduce a method for training very low-precision (1-bit) neural networks. So-called binary neural networks drastically reduce both memory and computational requirements for deployed models. With this model compression scheme, weights are constrained to values of +1 or -1 during training. Thus, inference computation includes only addition and subtraction, which is simpler and more efficient to implement in hardware than multiplication. The authors apply this technique to a 3-layer feedforward network for MNIST classification and achieved 0.01% better prediction accuracy than that of a full-precision version of the model.

Hubara et al. propose an alternative approach for model binarization in [22]. This training method constrains not only the network weights to binary values, but layer activa- tions as well. This provides even more of an advantage for hardware implementation, as all computational operations involve only +1 or -1. By mapping -1 to 0, multiply-accumulate operations can be replaced with XNOR and bit-count operations, which can be computed very quickly and efficiently in hardware. As in [21], the authors apply this approach to a

3-layer feedforward network for MNIST classification. This demonstrated only a 0.1% loss in prediction accuracy compared to that of the full-precision implementation of the model.

Building upon previous works on model binarization, Hou et al. propose an algorithm called loss-aware binarization in [23]. Unlike previous approaches, this algorithm directly minimizes training loss with respect to the binarized weights. The algorithm can also be extended to binarizing both weights and activations by using a simple sign function for binarizing activations. The authors experiment with a 4-layer feedforward network for

MNIST classification. Compared to the prediction accuracy of the full-precision version of

25 the model, results showed a 0.01% improvement using binary weights only, and a 0.19% loss in accuracy using binary weights and activations. Unique to this work, a method for binarizing recurrent neural networks is proposed. An example LSTM model for text prediction is used to evaluate the method. Compared to the full-precision model, the binarized model achieves 0.02% loss in accuracy using binary weights only and 0.11% loss in accuracy using binary weights and activations. Compared to two other cited binarization methods for both weight-only and weight & activation binarization, the approach proposed in this work achieves the best prediction performance.

3.2 LSTM Accelerators

Chang et al. implement a 2-layer LSTM on a Xilinx XC7Z020 in [24]. The FPGA design contains two main computational subsystems: gate modules and elementwise modules. Gate modules perform the matrix-vector multiplication and activation function computation for

LSTM cell gates. Elementwise modules combine gate module outputs with elementwise multiplication, addition, and activation functions. All weights and input are quantized to

16-bit fixed point (Q8.8) format. Activation function units use piecewise linear approx- imation and can be configured at run time to perform either the hyperbolic tangent or sigmoid functions. Input data is fed to gate modules from a Direction Memory Access

(DMA) streaming unit, which has independent streams for input and hidden layer data. To synchronize the independent input streams, gate modules have a sync unit that contains a FIFO. Elementwise modules also contain a sync unit to align the data coming from the gate units. Results from the elementwise module are written back to memory through the

DMA unit.

26 Ferreira and Fonseca propose an FPGA accelerator for LSTM in [25]. This design trades off scalability and flexibility for throughput: network size is fixed at HDL compilation time, but weight and bias parameters are stored in on-chip RAM rather than imported from off-chip memory. Matrix-vector multiplication is tiled using counter and multiplexer logic, which saves resources at the cost of increased computation time (by a factor of the resource sharing ratio). Input and recurrent matrix-vector multiplication are calculated in parallel.

Because the size of the input vector is often smaller than the recurrent vector size, the bias- add operation is performed after the input matrix-vector multiply operation completed, while the recurrent matrix-vector multiply operation is still being completed in parallel.

Elementwise operations are also multiplexed: there is one each of a hyperbolic tangent unit, a sigmoid unit, and elementwise multiply and add units. Activation functions are computed using piecewise second order approximations, evaluated using Horner’s Rule in order to save multiplier resources. All data is quantized to a signed 18-bit fixed point representation (Q7.11) in order to make full use of Xilinx DSP48E1 slices.

Guan et al. propose an FPGA accelerator for LSTM in [26]. This paper explores optimization of communication with memory as well as computational performance. To optimize memory access time, the authors propose organizing the model parameter data in memory such that it can be accessed sequentially for tiled computation. The memory orga- nization is done offline prior to inference. In terms of architectural optimization, the design contains a data dispatch unit, which handles all memory transactions separately from the

LSTM accelerator. Additionally, the accelerator uses a ping-pong buffering scheme at its input and output so that new computations can take place while data is being transferred to/from memory. To optimize computation performance, the accelerator unit performs tiled

27 matrix-vector multiplication in parallel for each of the four LSTM gates. A separate func- tional unit performs the activation function and elementwise operations. This unit contains an on-chip memory buffer to hold the cell’s current state. To evaluate the accelerator’s performance, the authors implement a known LSTM model for speech recognition. This model contains three stacked LSTM layers. The base design of the FPGA accelerator uses little memory resources, so the authors also experimented with storing the first layer’s pa- rameters in on-chip memory and found that this approach resulted in an overall speedup of about 1.5×.

In addition to the compression method proposed in [11], Han et al. present an FPGA accelerator designed to operate on sparse LSTM models. The accelerator operates directly on a compressed model by encoding the sparse matrices in memory using compressed sparse column (CSC) format. A control unit fetches this data from memory and schedules com- putational operations. Operations that do not depend on each other (e.g. the activation function of the input gate and the pre-activation of the forget gate) are scheduled to run in parallel in the accelerator. The accelerator unit is composed of multiple processing elements, a single elementwise unit, and a single activation function unit. Processing elements each read from their own dedicated FIFO that is fed by the control unit. They contain a sparse matrix read unit, which decodes the CSC-formatted parameter data. Matrix-vector product accumulation is accomplished via a single adder and buffer per processing element. The elementwise unit contains 16 multipliers and an adder tree. The activation function unit is composed of lookup tables for hyperbolic tangent and sigmoid functions, both containing

2048 samples and quantized to 16-bit (Q1.15) format.

28 3.3 Other Hardware Accelerators

Li et al. proposed an FPGA accelerator for the basic RNN model in [27]. The authors note that for natural language processing applications, the number of nodes in the output layer (i.e. the vocabulary size) is usually much larger than the number of hidden-layer nodes. As such, the computation of the output layer is often the dominating factor in the computational complexity of RNNs. To balance the computational workload of processing elements in an RNN accelerator, the proposed architecture unrolls the network in time: the computation of the hidden layer is done serially, while the output layer computation for a window of time is done in parallel. Additionally, the authors employ two design strategies to optimize FPGA resources: quantize network weights to 16-bit fixed point format and approximate non-linear activation functions using piecewise linear approximation.

Nurvitadhi et al. study techniques for optimization and acceleration of the Gated Recur- rent Unit (GRU) network in [28]. First, they propose a memoization optimization method that reduces computation time by an average of 46%. Because the authors target Natu- ral Language Processing applications, the GRU input is a one-hot encoded vector the size of the vocabulary. Therefore, there is a finite number of possible results from the input matrix-vector multiply operation, and these results can be precomputed and cached to avoid expensive computations at run time. Recurrent matrix-vector multiply operations, on the other hand, cannot be precomputed, and so the authors also propose an FPGA architecture for accelerating the remaining calculations. The accelerator consists of tiled column blocks, which evenly divide up the computation of a dot product between the input vector and a weight matrix row. Column blocks are fed input and weight values from a memory read unit. Each block consists of a floating-point multiply-accumulate unit. Par- tially accumulated results from the tiled computation are summed in a row reduction unit,

29 and output vector results are sent to a memory write unit. The accelerator only performs matrix-vector operations; the element-wise addition and multiplication as well as activation function computations required by GRU are presumably performed on the host CPU.

In [12], Han et al. propose a general-purpose hardware accelerator for fully-connected neural network layers. Implemented in 45nm CMOS technology, this chip is suitable for accelerating the matrix-vector multiplications found in many neural networks (including

LSTMs). This design utilizes sparsity coding and weight sharing in order to store all network parameters in on-chip SRAM, resulting in significant power savings and speedup.

The architecture also takes advantage of the sparsity of the input, dynamically scanning the vectors and broadcasting non-zero elements, along with their row index, to an array of processing elements. Non-zero input vector elements are multiplied by non-zero elements in the corresponding column of the weight matrix and then sent to a row accumulator. In order to balance workload distribution, each processing element has a queue at its input.

Input broadcast is halted when any processing element has a full queue.

In [29], Andri et al. propose a custom CMOS architecture for accelerating binary weight convolutional networks. The architecture is highly scalable and supports arbitrary input size and multiple kernel sizes. In this design, inputs are quantized to 12-bit

fixed-point and multiply-accumulate operations are implemented using fixed-point adders and multiplexers. Computational units are instantiated in parallel to support up to 32 input channels. A single fixed point multiply-accumulator unit is used to apply channel scaling and biasing, which processes output channels in an interleaved manner. Input and

filter weights are read from off-chip memory. Because of the high level of data reuse in

CNN computations, the architecture contains on-chip cache to store recently used input

30 and filter data as well as intermediate computational results, thus reducing communication with off-chip memory.

Li et al. propose an FPGA accelerator architecture for convolutional networks with binary weights and activations in [30]. The basic processing element of this design is com- posed of XNOR gates, which are implemented by lookup tables, and bit counters, which are implemented an adder tree. Weights and layer activations are stored solely in on-chip memory. Weights and intermediate computational results are stored using block ram, while feature maps are stored in distributed RAM. The authors choose an example CNN model with 6 convolutional layers to implement using the accelerator architecture (computation of the fully-connected layers is omitted from this design). This implementation is not scalable to other CNN models— network structure, layer sizes, and weight values are fixed at HDL compilation time.

31 CHAPTER IV

TRADE STUDY

Hardware design can be both time-consuming and costly. Therefore, it is critical to have an understanding of the targeted application as well as the design trade space before developing a hardware architecture. This chapter first provides an analysis of the FPGA accelerator design trade space. Then, a model compression strategy is selected and evaluated using two sequential datasets.

4.1 Analysis of Related Work

In this section, various FPGA design trades are discussed within the context of the results of previous, related work.

4.1.1 Compression

An important first step in optimizing a neural network model for hardware implementa- tion is compression. Parameter compression can significantly reduce the memory footprint and computational complexity of inference computations. Several methods have been pro- posed for neural network model compression. These methods are generally classified into

32 one of two approaches: parameter reduction, or parameter quantization. The former ap- proach reduces the number of parameters needed to compute model inference, while the latter reduces the number of bits needed to represent model parameters.

The pruning technique reduces the number of parameters by removing (i.e. setting to zero) unnecessary weight connections. After training, model parameters with an absolute value below some threshold are removed. Then, the sparse model is retrained using the remaining connections. The retrained model may be pruned and retrained again in order to achieve the desired level of sparsity. Using this technique, Han et al. demonstrated

9× and 13× reduction in number of parameters for two well-known convolutional networks

[16]. Building upon this approach, Han et al. also proposed a hardware-friendly pruning method, called load-balance aware pruning [11]. This method facilitates the distribution of the processing workload by constraining regions of the parameter matrices to contain an equal number of non-zero connections. After pruning, the memory footprint of the model parameters is reduced using sparse matrix encoding, such as compressed sparse column

(CSC) or compressed sparse row (CSR) format. Accelerator hardware has been designed to operate directly on sparse encoded models and efficiently skip zero-valued connections

[11, 12].

Singular Value Decomposition (SVD) is another technique for reducing the number of model parameters. By keeping only the largest singular values, the dimensionality of model parameter matrices can be reduced while maintaining an acceptable level of prediction accuracy. Giu et al. apply SVD to a fully-connect layer of a convolutional network, achieving a compression rate of 7× with only a 0.04% loss in prediction accuracy [17]. Prabhavalkar

et al. apply this technique to a 5-layer LSTM for speech processing and were able to reduce

model size by 3× with only a 0.5% loss in prediction accuracy [18]. An advantage of this

33 approach is that it does not require special encoding (and consequently more memory) to store the compressed model; this translates to simplified accelerator hardware because the parameter matrix structure does not have to be decoded.

One technique for quantizing parameters is known as weight sharing. With this ap-

proach, a small number of effective weight values are used [16]. After training, similar

weights are identified using k-means clustering. The centroid of a cluster is then chosen

to be the shared weight value. The quantized model can then be retrained using only the

shared weights. This technique allows parameter matrices to be stored as indices into the

shared weight codebook, thus reducing the memory footprint. For example, a weight shar-

ing scheme with 32 shared weights requires only 5 bits to store an individual weight. Han et

al. utilize weight sharing to enable on-chip parameter storage in their hardware accelerator

for fully-connected layers [12]. A benefit to this approach is that is does not restrict the

datatype of the actual weights; high-precision datatypes can be used for computation with

minimal storage requirements.

Another quantization technique is called binarization. This method constrains all weight

values during training to +1 or -1, thus requiring only a single bit to represent. An ex-

tension of this technique places the same constraints on layer activations as well. Various

approaches have been proposed for training a neural network with binary weights and acti-

vations [21, 22, 23]. The approach detailed in [23] was applied to an LSTM model for text

prediction and resulted in negligible loss in prediction accuracy compared to that of the

full-precision model. Binarization comes with significant advantages for hardware imple-

mentation. In a weight-only binarization scheme, multiplication is reduced to a two’s com-

plement operation, thus eliminating the need for expensive hardware multipliers. Weight

& activation binarization simplifies hardware implementation even further, requiring only

34 XNOR and bit-count operations in place of multiply-accumulation. In both schemes, model parameter storage size is drastically reduced. Efficient hardware accelerators for binary convolutional networks have been implemented [29, 30], but an accelerator architecture for binary LSTM has yet to be published.

Finally, a commonly used quantization technique is to use fixed-point datatypes. This approach is discussed in more detail below.

4.1.2 Datatype

The fundamental datatype used for hardware implementation affects many of the deci- sions made during the FPGA design process—ultimately impacting energy efficiency and processing throughput. While full-precision floating-point data is often used during the training phase, research has shown that fixed-point datatypes with reduced word length can be used for inference with minimal loss in accuracy [19, 31, 32]. Besides the benefit of reduced memory footprint compared to floating-point datatypes, this quantization approach has a number of advantages. Firstly, it is simple to implement using a linear quantization approach, and tools that automate this process are freely available [31]. Secondly, fixed- point quantization can be used in conjunction with a parameter-reduction compression approach, such as pruning or SVD [17, 12]. Finally, fixed-point multiplication and addition map directly to dedicated DSP hardware units, such as the Xilinx DSP48 and its variants.

When using a short enough word length, data can be packed such that DSP units are able to process two multiply-accumulate operations concurrently [33].

When specifying a fixed-point datatype, there is a trade between word length (the total number of bits), fraction length (the location of the decimal point), and the dynamic range that can be represented. In general, larger word lengths allow for larger dynamic range but come at the cost of increased memory footprint. It can be a challenge to find a 35 balance between reducing word length and maintaining computational accuracy. A simple approach is to use a static fixed-point representation for all datatypes. Given a large enough word length, an appropriately set fraction length can allow enough room for the integer component to grow while maintaining an acceptable level of fractional precision. Many accelerator architectures have taken this approach due to its straightforward implementation in hardware [24, 25, 27, 12]. However, a multilayer model with a large number of parameters may have a wide dynamic range that is difficult to represent using a reduced word length.

A solution to this issue is to use a dynamic fixed-point representation [31]. With this approach, the dynamic range of each layer is analyzed separately in order to select an appropriate fraction length. In hardware implementation, this translates to extra bit shift operations between layers to align decimal points. The impact of dynamic fixed-point on convolutional networks has been widely studied and shown to yield minimal loss in prediction accuracy, even with word lengths as short as 4 or 8 bits [17, 31]. Studies related to LSTM are limited, but Shin et al. found that for dynamic quantization of multilayer

LSTMs should be separated not only by layer but by connection type (i.e. feedforward vs. recurrent) as well. Dynamic fixed-point has successfully been implemented in FPGA hardware accelerators [11, 17].

4.1.3 Memory

The physical storage of model parameters is an important trade for hardware acceler- ator design. Many state-of-the-art neural network models contain hundred of megabytes worth of parameters. FPGAs present a unique challenge for implementing a neural network accelerator because on-chip memory size and I/O bandwidth are often limited [34]. At the same time, off-chip DRAM access requires an order of magnitude more energy than on-chip

36 memory access [16]. Additionally, off-chip parameter storage can be a processing bottle- neck because the speed of matrix-vector operations is often limited by memory access time

[17]. Indeed, in LSTM the bulk of computations are matrix-vector operations, so memory bandwidth is positively correlated with system throughput.

Due to the energy and access-time requirements of off-chip memory access, it would seem that on-chip parameter storage is the optimal design choice, given that the parameters can

fit in the limited on-chip memory capacity. For many models, compression techniques can shrink parameter storage requirements enough to allow the entire model to be stored on- chip. In addition to the efficiency and performance benefits, on-chip parameter storage vastly simplifies the design process by eliminating the need for a memory interface and buffering schemes. In the case of models compressed using sparsity encoding, extra decoding logic is required, but the implementation of such logic may be simpler than implementing an external memory interface. Many hardware accelerators have been design to take advantage of on-chip storage [25, 12, 30].

While on-chip parameter storage is both simple to implement and more efficient than using off-chip memory, it is also less scalable than its counterpart. Many accelerators using on-chip storage have been designed to support a single model; reconfiguring the design to support another model may be difficult or impossible given resource constraints.

Accelerators that utilize off-chip memory, on the other hand, must be designed to support an arbitrary model size and structure. Although off-chip memory access times can be relatively long, various design techniques, such as double-buffering, can be used to perform computation during memory downtime [11, 26]. Additionally, a hybrid memory approach that utilizes on-chip memory to cache recently used values can be used to reduce the amount of memory access [26, 29]. For LSTM, cell memories are a good candidate for on-chip caching

37 because of the relatively small storage requirement. For example, with an output word size of

32 bits, a cell memory cache size of only 16 kB would support a layer size of up to 4096 cells.

Additional on-chip memory could be used to store a portion of (or, if the model were small enough, all of) the model parameters. While not optimally energy-efficient, many FPGA accelerators that read parameters from off-chip have achieved significant power savings and better or comparable computational speedup compared to CPU and GPU counterparts

[11, 24, 26, 28, 29]. These designs are more complex than those that rely on on-chip memory, but they provide more utility as general-purpose model accelerators.

4.2 Effectiveness of Binarization

The first step in defining accelerator hardware architecture is selecting a model com- pression strategy; this choice informs the entire design process. Not only does this strategy determine the format of the data being processed, it also determines the required computa- tional elements, the memory requirements, and ultimately the performance and throughput of the system.

For this work, binarization was selected as a model compression strategy. Specifically, the weight-only loss-aware binarization training method was used [23]. This strategy is compelling for three reasons. First, it drastically reduces the width of model parameters.

Since memory access is the computational bottleneck in LSTM, a strategy that increases the number of parameters contained in each memory word is desirable. Second, it eliminates multiplication operations from the computation of matrix-vector products, which simplifies the design of a hardware accelerator. Third, a binary LSTM accelerator has not yet been studied. Indeed, only one published work to date has even tested the efficacy of binarization with LSTM [23]. However, this work only examined one application of LSTM, namely

38 text prediction. In order to motivate the design of a hardware accelerator, two additional applications are considered.

4.2.1 Handwriting Recognition

Automatic recognition of handwritten characters is a task that has been thoroughly explored in the machine learning community. The MNIST dataset, which contains 70,000 images of handwritten digits, has been evaluated using dozens of classifier models [35].

While many feed-forward- and convolutional-neural-network-based approaches have been shown to work well with MNIST, the static nature of the images in this dataset does not lend itself well towards an recurrent-neural-network-based classifier. However, a variant of

MNIST, called MNIST Stroke Sequence, is a more natural fit for LSTM [36].

As its name suggests, MNIST Stroke Sequence is composed of sequences that illustrate the path a pen could take to produce a handwritten digit. The pen stroke sequences are extracted from the original MNIST data using thresholding and thinning techniques (see

Figure 4.1). The data has four features: dx and dy, which are integers representing the horizontal and vertical movement, respectively, in pixels from one timestep to the next; an end-of-stroke indicator, which takes the value 1 at the last point of a stroke and 0 otherwise; and an end-of-digit indicator, which takes the value 1 at the last timestep in the sequence and 0 otherwise. Thus, MNIST Stroke Sequence is not a particularly complex dataset and should be fairly easy to learn. Even so, it brings value as another test case for the efficacy of binarization.

39 Figure 4.1: An example MNIST Stroke Sequence source image (left) and extracted sequence (right).

A fully-connected LSTM architecture (shown in Figure 4.2) was used to create a classifier model for the handwritten digit sequence-labeling task. In this architecture, the cells of the last LSTM layer are connected to an output layer of the same size as the number of classes

(in this case, 10) and activated with a softmax layer. The training objective is to minimize cross-entropy loss over all target sequences. We only care about the output prediction at the end of a sequence, so classification accuracy is evaluated only at the last timestep in a sequence.

To test the impact of binarization on classification accuracy, training was performed using both full-precision floating point and binary weights. Various configurations of number of layers and layer size were tested. For the case of multiple layers, each layer contains the same number of cells. The results of these experiments are shown in Table 4.1. As can be seen, the difference in classification accuracy due to binarization is minimal for this task.

In fact, the binarized model performs slightly better than its full-precision counterpart in the case of 2 layers of size 256. These results suggest that a binarization scheme can be

40 Input

LSTM

FC

Softmax

Output

Figure 4.2: Fully-Connected LSTM Architecture.

used to compress a model for the handwritten sequence recognition task with little to no loss in prediction accuracy.

Table 4.1: Testing set accuracy for the MNIST Stroke Sequence dataset.

Hidden Layers 1 2 1 2 1 2 Layer Size 64 64 128 128 256 256 Full-Precision 97.78 98.11 98.24 98.23 98.12 98.18 Accuracy (%) Binary 96.01 97.49 97.29 97.84 98.01 98.22

4.2.2 Speech Recognition

While binarization showed favorable results for the relatively simple task of handwritten digit recognition, a more complex test case is desired. Therefore, a speech recognition application was examined—specifically, task of phoneme labeling. A phoneme is the smallest unit of sound, which spoken words are composed of. In contrast with the written English alphabet, the phonetic alphabet has unique labels for the specific sound being made within

41 Input Input (Forward) (Reverse)

LSTM LSTM

FC

Softmax

Output

Figure 4.3: Bidirectional Fully-Connected LSTM Architecture.

the context of a word (e.g. /uh/ in book versus /uw/ in boot). Thus, the task of phoneme labeling seeks to assign a label to the phoneme being spoken at each moment in time in an audio recording. The output produced by a phoneme labeling model could be useful by itself, or it could be fed into a separate speech recognition model which understands full words or sentences from phonetic labels.

For this work, the TIMIT dataset was used as training material for a phoneme labeling model [37]. This dataset contains raw audio recordings of English speech. It includes

6300 sentences—10 sentences spoken by 630 speakers—for a total speech time of 5.4 hours.

TIMIT consists of 2342 unique, phonetically-balanced sentences. Each sentence is annotated with timestamped labels of the phoneme being spoken at each moment in time. TIMIT uses 61 unique phonemes for its alphabet.

A bidirectional, fully-connected LSTM architecture (shown in Figure 4.3) was used to create a classifier for the phoneme labeling task. Just as in the architecture used for the handwritten digit recognition task, the cells of the last LSTM layer are fully-connected to

42 Figure 4.4: Raw audio signal (top) and short-time power spectrum (bottom) for an example phrase from the TIMIT training set.

an output layer consisting of 61 units. However, unique to this architecture is the use of two identically-sized blocks of LSTM cells used in parallel within each layer. One block is used for processing the input in the forward direction of the sequence (i.e. from the first timestep to the last) as usual, while the second block is used for processing the input sequence in reverse order (i.e. from the last timestep to the first). Hence, the term bidirectional.

Raw audio was not fed directly to the model; rather, the short-time power spectrum of the signal was used as input. This work followed the pre-processing approach outlined in

[38]. Specifically, the TIMIT audio was split into overlapping frames with a duration of 25 ms and 10 ms shift in-between frames. Then, each frame is converted into its frequency-domain representation, and the Mel-frequency cepstral coefficients (MFCCs) and their regression coefficients are extracted to be used as input to the model. This results in an input feature

43 vector of size 39 (13 MFCCs, plus first- and second-order regression coefficients). Times- tamped phoneme labels were converted to frame-wise labels to align with the input. For this task, we are interested in the accuracy of the predicted label for each frame of input. The training objective, therefore, is to minimize the categorical cross-entropy at each timestep

(frame) in the input sequence.

To test the impact of binarization on classification accuracy, training was performed using both full-precision floating point and binary weights. Various configurations of number of layers and layer size were tested. For the case of multiple layers, each layer contains the same number of cells. The results of these experiments are shown in Table 4.2. These results demonstrate a larger discrepancy in prediction accuracy between the full-precision and binary versions of the model. For smaller model sizes, binarization can result in nearly a 30% drop in accuracy (1 layer / 256 cells). However, as model size grows, the binarized version begins to catch up with its full-precision counterpart. We see for 2 layers / 512 cells only a 3.18% discrepancy in accuracy, with the binary model achieving 72.30% accuracy.

Table 4.2: Testing set accuracy for the TIMIT dataset.

Hidden Layers 2 4 1 2 1 2 Layer Size2 128 128 256 256 512 512 Full-Precision 73.33 71.18 73.27 75.31 74.38 75.48 Accuracy (%)3 Binary 50.20 53.55 43.51 65.81 62.01 72.30

It seems that for both training schemes, a deeper architecture adds more value than a wider one. However, the effect of additional layers is more pronounced in the results of the

2Because of the bidirectional LSTM architecture used, the Layer Size parameter refers to the number of LSTM cells used for a single direction within the layer. Thus, a layer of size 128 contains 256 total cells. 3Frame-wise accuracy, i.e. the percentage of frames in an input sequence that were assigned the correct phoneme label, averaged over all sequences in the testing set.

44 binary model. For example, a model with 2 layers / 128 cells achieves 50.20% accuracy, while 1 layer / 256 cells achieves only 43.51%. Similarly, 2 layers / 256 cells achieves

65.81% accuracy, while 1 layer / 512 cells achieves only 62.01%. Both pairs of models contain the same number of parameters; the difference in how the cells are connected makes the difference. This suggests that a hierarchical representation of the input features can be learned, which assists the task of phoneme labeling.

The best-case results of the binary model suggest that binarization may be a viable approach to model compression, even for complex applications like speech recognition. The judgment of the suitability of this approach now falls on the application designer. It could be that 72.30% is an acceptable measure for phoneme labeling accuracy for a particular application. If this is the case, a binarized model provides an immediate 32× reduction

in parameter memory footprint (compared with full-precision weights), in addition to the

benefit eliminating the multiplication operation in computing matrix-vector products. It is

possible that even better accuracies may be achieved by tuning the number of layers and

layer sizes. This work will leave the exploration of this possibility up to another researcher;

for the sake of motivating the design on a hardware accelerator system for binary-weight

LSTM, these results have served their purpose. In the following chapter, the design of such

a system will be explored.

45 CHAPTER V

SYSTEM DESIGN

When designing a hardware accelerator, there are two important design areas to con- sider: datapath and memory. The former refers to how computations will be implemented and the order in which they take place. The latter is concerned with where the data re- quired for computations will come from and how it will be accessed. Before one can dive into the details of design implementation, a high-level architectural strategy concerning both of these areas must be made. This chapter walks through the design process and the high-level decisions made for the implementation of the binary-weight LSTM accelerator.

Throughout the chapter, the following terms will be used to define the size of a model:

M – the dimensionality of the input signal, i.e. the length of the input vector

N – the number of cells in the LSTM layer, i.e. the length of the output vector

5.1 Datapath

In order to determine the computations required for implementing LSTM inference, let us revisit the LSTM equations presented in Chapter 2:

it = sigm(Wxixt + Whiht−1 + bi) (5.1) 46 jt = tanh(Wxjxt + Whjht−1 + bj) (5.2)

ft = sigm(Wxf xt + Whf ht−1 + bf ) (5.3)

ot = sigm(Wxoxt + Whoht−1 + bo) (5.4)

ct = it jt + ft ct−1 (5.5)

ht = ot tanh(ct) (5.6)

As we can see, there are three types of computations being performed in these equations: matrix-vector operations (e.g. Wxjxt), activation functions (e.g. sigm, tanh), and element- wise operations (e.g. it jt). Matrix-vector operations are the basis of the term inside the activation function in the gate equations (5.1, 5.2, 5.3, 5.4), so we will refer to this computation as “gate pre-activation”. As we have seen, it is this computation that is responsible for the majority of the computational complexity in LSTM, having O(M · N) complexity. Therefore, the first step in the design process will be to determine a strategy for speeding up gate pre-activation calculations. The other computation types, activation functions and element-wise operations, require only O(N) complexity. We will group these computations together and refer to them as “cell calculations”. These computations require the output of the gate pre-activations as their input, so the implementation strategy for cell calculations will follow from the strategy for implementing gate pre-activation.

47 5.1.1 Gate Pre-activation

As seen in equations 5.1, 5.2, 5.3,& 5.4, the gate pre-activation consists of the sum of three terms: input connections (e.g. Wxixt), hidden state connections (e.g. Whiht−1), and

bias (e.g. bi). Rather than explicitly compute this sum, we observe that we can combine

terms by concatenating input and hidden state vectors as well as parameter matrices and

the bias vector. We now rewrite equation 5.1 as

it = sigm(˜it) = sigm(Wi˜xt) (5.7)

where

˜it denotes the pre-activation term for the input gate,

˜xt is the concatenation of xt, ht−1, and 1, i.e.

  x1(t)  x2(t)     .   .     xM (t)    ˜xt =  h1(t − 1)     h2(t − 1)     .   .    hN (t − 1) 1

and Wi is the N × (M + N + 1) concatenated weight and bias matrix of the input

gate parameters, i.e.

  wx1i1 wx2i1 . . . wxM i1 wh1i1 wh2i1 . . . whN i1 bi1 wx i wx i . . . wx i wh i wh i . . . wh i bi  W =  1 2 2 2 M 2 1 2 2 2 N 2 2  i  . .   . . 

wx1iN wx2iN . . . wxM iN wh1iN wh2iN . . . whN iN biN

48 and similarly for the other three gate equations. To generalize, we will note the gate pre-activation term ˜gt = Wg˜xt, where Wg represents one of the four gate parameter (i.e. concatenated weight and bias) matrices Wi, Wj, Wf , and Wo for the input, new-input, forget, and output gates, respectively. We will refer to Wg as the parameter matrix and the concatenated input/hidden state/unity vector ˜xt simply as the input vector.

We see that for each gate pre-activation, the computation of the matrix-vector product requires N · (M + N + 1) multiplications and N · (M + N) additions. We also note that each element of the parameter matrix Wg is required only once, for a single multiplication operation with the corresponding element of the input vector ˜xt. Conversely, each element of ˜xt is required N times, once per row of Wg. A good design strategy will be to consider the re-use of the input vector data.

In order to accelerate the computation of the matrix-vector product, we consider how the operation can be parallelized. We note first that the computation of an individual output term requires the entire input vector and a single row of the parameter matrix. This is made explicit by Equation 5.8 below:

g˜n(t) = wx1gn x1(t) + wx2gn x2(t) + ··· + wxM gn xM (t)+ (5.8)

wh1gn h1(t − 1) + wh2gn h2(t − 1) + ··· + whN gn hN (t − 1) + bin where

g˜n(t) is the n-th element of the N-dimensional pre-activation vector ˜gt.

As noted, the computation of an output vector element depends only on a single row of the parameter matrix. Therefore, we may compute each of these results in parallel to one 49 reset

Mult Addp wp p ✕ -1 x + z

Figure 5.1: Row-wise Matrix-Vector Multiply Processing Element (PE)

another using row-wise matrix-vector multiplication. This approach is described in the following paragraphs.

The basic processing element (PE) used by this technique consists of a cascaded multi- plier, adder, and single-cycle delay unit (see Figure 5.1). The inputs to the multiplier are the input term and parameter term; the inputs to the adder are the output of the multiplier and the output of the delay; the input to the delay is the output of the adder. The delay unit has a reset signal which will set the memory state, and thus the current output, to zero. The Matrix-Vector Product Unit (MVPU) consists of an array of P parallel processing elements, where P is the parallelization factor. The MVPU has P weight inputs, each of which is fed to an individual PE, and a single input vector element input, which is fed to all PEs (see Figure 5.2).

The row-wise algorithm proceeds as follows. On the first clock cycle, rows 1 through

P from the first column of Wg, along withx ˜1, are fed to the PE array inside the MVPU.

The reset signal to the delay units is held high during this first cycle. Because of this, the result of the multiplication is added with zero, and this result is then stored in the delay unit. On the next cycle, rows 1 through P from the second column of Wg, along withx ˜2, are fed to the PE array. Now, the multiplication result of the current cycle is accumulated

50 reset

P MVPU w P P 2 g~ +/−1 PE x~

Figure 5.2: Matrix-Vector Product Unit (MVPU) Architecture.

with the multiplication result from the previous cycle, which was fed to the addition unit by the output of the delay unit. This proceeds until the M +N +1-th cycle, in which rows 1

through P from the M +N +1-th (last) column of Wg, along with the last element of x˜, are fed to the PE array. All of the row-element multiplication results have been accumulated, so the output of the addition unit in this cycle is the valid pre-activation vector element.

The procedure then starts over, with rows P + 1 through 2P from the first column of Wg, along withx ˜1, being fed to the MVPU—and so on, until all N output vector elements have been computed. If N is not an integer multiple of P , then the last NmodP results are ignored. 8 1 3 2 3 5 This procedure is illustrated by Table 5.1 below, in which Wg =   is multiplied 5 2 2 4 7 5 55 6 29 by ˜x = 4 , for a result ˜g =  , with parallelization factor P = 2. t   t 40 1   57

51 Table 5.1: Example row-wise matrix-vector multiplication procedure (P = 2). Output vector results are the found in Addp columns when valid = 1.

Time x˜t w1 Mult1 Add1 w2 Mult2 Add2 reset valid 1 6 8 48 48 2 12 12 1 0 2 4 1 4 52 3 12 24 0 0 3 1 3 3 55 5 5 29 0 1 4 6 5 30 30 4 24 24 1 0 5 4 2 8 38 7 28 52 0 0 6 1 2 2 40 5 5 57 0 1

Using the row-wise implementation, P elements of the pre-activation vector can be computed every M + N + 1 clock cycles. This results in a maximum speedup of P times faster than a scalar implementation (maximum speedup is achieved when N is an integer multiple of P ). The approach is simple but effective—especially when P is chosen to be large. The control logic required to implement the procedure is straightforward, which allows for faster clock rates when implemented in hardware. There is also very little memory resource usage, requiring only a single register per processing element. Only one multiplier and adder are used per processing element as well.

In Table 5.1, the row-wise matrix-vector multiplication procedure is demonstrated using integer weights. However, with binary-weight LSTM, the implementation can be optimized for hardware by modifying the architecture of the processing element shown in 5.1. Multi- plication can be implemented by simply replacing the multiplier with a 2-to-1 multiplexer, where the 0th input to the multiplexer is the additive-inverse of the input vector element, and the 1st input to the multiplexer is just the input vector element. The weight element, which in a binary-weight scheme is a single bit, is then fed to the multiplexer as the control signal, passing the unmodified input when high (i.e. w = +1) and passing the additive- inverse of the input when low (i.e. w = −1). This approach both eliminates the need for a

52 reset wp

Add x u 0 Multp p + z-1 1

Figure 5.3: Binary-Weight Row-wise Matrix-Vector Multiply Processing Element (PE)

hard multiplier unit and allows for faster clocked logic. The binary-weight PE architecture is shown in Figure 5.3.

5.1.2 Cell Calculations

With an implementation strategy decided for the gate pre-activations, we now look into how to implement the cell calculations, which consist of activation functions and element- wise multiplication and addition. Using the row-wise matrix-vector multiplication algo- rithm, we have P new results available every M + N + 1 clock cycles. We could maintain parallelism, processing the results in P separate streams. However, this would be a both a waste of resources and an under-utilization of clock cycles. We must process the pre- activation results in M + N + 1 clock cycles, but there is very little benefit to finishing this workload early, as the computation time will still ultimately be dominated by the matrix- vector product. Therefore, we implement an approach that uses a single data stream.

The first function required for the single-stream approach is a parallel-to-serial stream conversion unit. This unit latches the P results when ready from the pre-activation unit.

Then, over P cycles, it iterates through the pre-activation results sequentially, outputting one item per cycle. The output of the parallel-to-serial unit is fed to the cell calculation unit

(CCU). The CCU consists of an activation function unit, a multiplication unit, an addition

53 unit, and two memory units. In order to illustrate the purpose of each of these units, we refer back to equations 5.1– 5.6. Here, we note a few things. Firstly, the pre-activation results must be fed through an activation function—either sigmoid or hyperbolic tangent— in order to compete the computation of equations 5.1– 5.4. Therefore, the activation function unit should be the first unit in the stream, and it should be capable of performing either activation function, depending on the gate being calculated. Secondly, we look at the three element-wise multiplication terms:

(1) it jt,

(2) ft ct−1, and

(3) ot tanh(ct).

Because we have a serial data stream, there are a few dependencies to consider. Term 1

requires that either it or jt has been computed, so we will have to store the activated result of

either the input or new-input gate in a temporary memory. Term 2 requires the previous cell

state, which, by definition, will already be available from the computation of the previous

input. However, we will require additional temporary storage for this term to be consumed

later on in the computation of ct. Term 3 requires that ct has been computed, so therefore

we must compute the output gate last. This term also implies that a second activation

function unit will be required because the primary one will be occupied in computing ot.

This secondary unit need only compute the hyperbolic tangent function. Finally, we note

that there is only one element-wise addition operation, namely for the computation of ct,

which takes two element-wise products as its input. Therefore, the addition unit must be

placed downstream from the multiplication unit.

54 Given all of these observations, we require that the gates be calculated in the following order: (1) new-input, (2) forget, (3) input, (4) output. Within the CCU, the behavior of the activation function unit, the inputs to the multiplication and addition units, and the temporary memory destination, then are determined by the gate being calculated. This schedule and specification are detailed below in Table 5.2:

Table 5.2: Cell Calculation Unit (CCU) schedule. Mult Operand indicates the second operand; the first operand is the result of the activation unit. Similarly for the Add Operand, the first operand comes from the result of the multiplication unit.

Order Gate Activation Mult Operand Add Operand Destination 1 jt tanh 1 0 MEM1 2 ft sigm ct−1 0 MEM2 3 it sigm MEM1 MEM2 ct 4 ot sigm tanh(ct) 0 ht

The required CCU functionality is accomplished by the architecture shown in Figure 5.4.

Note that the gate schedule is not explicitly implemented in this unit; rather, the control logic for the gate pre-activation unit is required to compute the pre-activation results in the specified order. The upstream logic must also encode a gate identifier, along with the vector index, of the pre-activation data element being processed. Both of these items are used to decode a read address, write address, and write enable signal for the temporary memory units. The control signal for the multiplexers is determined solely from the gate identifier.

The CCU architecture implies the use of two on-chip RAMs to implement the temporary result storage. The mechanism of storage for the cell state and hidden layer output (ct and ht, respectively) has not been specified thus far, though. In the next section, we will examine

55 gate control logic decode unit idx

ct g~ sigm 0 t h ✕ t tanh 1 din dout MEM1 + addr 1 0 wren 0 0 din dout MEM2 MEM1 1 MEM2 1 addr wren ct 2

tanh data 3 control bus

Figure 5.4: Cell Calculation Unit (CCU) Architecture

a strategy for memory usage, and the nature of the storage of these two terms will be made explicit.

5.2 Memory

With a strategy set for implementing the datapath, we must now decide on a strategy for how memory will be used in the design. This strategy will determine how input data is fed to the accelerator and where output results are stored. Broadly, there are two choices for memory usage: internal (on-chip) and external (off-chip). The former is significantly faster to access and uses less energy. However, on-chip memory resources are limited, so the use of external memory will be necessary. The datapath architecture may require certain items to be available immediately—these cases will necessitate the use of internal memory, as we saw with the RAM units used for CCU temporary results. However, any large group of data items will likely be need to stored off-chip.

56 In this section, we will examine the categories of data for which memory will be used: input vector, parameter matrices, hidden layer output vector, and cell state vector. For each of these categories, a strategy for using internal memory, external memory, or a hybrid configuration will be identified.

5.2.1 Input Vector

Let us consider the characteristics of the input to the inference system, xt. It is an

N-dimensional vector containing unknown values. These values comes from a source that is external to the accelerator—most likely a CPU or possibly directly from a sensor. Since the design of the accelerator is targeted for an embedded application, whatever device is producing the input to the system will most likely be sharing a communication bus and/or memory interface with the accelerator device. The rate at which the input is being produced or sampled is unknown, however, and it may be variable. Therefore, a simple approach to transmitting input data to the accelerator is to store the values in external memory. In this way, it does not matter where the input data comes from or when it was first stored; control registers simply tell the accelerator where to find the input data in memory and how much data to read. The accelerator communicates with external memory through a memory interface controller, which streams the input data sequentially from memory into the accelerator’s datapath. Depending on the width of the memory interface, multiple input words may be able to be delivered in a single clock cycle.

At the same time, we have seen that input vector elements are reused by the matrix- vector computation unit. The input data must persist for at least dN/P e clock cycles, and

therefore it cannot be streamed directly to the datapath. And while the dimensionality of

the input vector is fixed for a given model and known at run-time, we would like N to be

a parameter to the system so that it can support models of different sizes. So, we require

57 the use of on-chip addressable memory, rather than a fixed-size register file, to implement a “holding area” for the input vector.

Transactions with external memory are costly, so it would not be efficient to perform a read from external memory for each vector in the input sequence. Instead we would like allow for “batching” of input—bringing in multiple input vectors from off-chip memory to be processed at once. To implement this, we simply make the “holding area” memory unit large enough to support multiple input vectors (i.e. a batch), add registers to specify the batch size, and implement control logic to support iterating through the batch. This on- chip cache memory is loaded from external memory at the start of a batch sequence, and when all items in the batch have been consumed, the next batch is fetched. This technique reduces the number of external memory transactions, allowing for higher overall processing throughput. However, it also increases the single-input processing latency. Batch size, therefore, is a parameter of the system that the user will have to adjust according to the requirements of the application.

The memory size requirement for the input cache unit is reasonable due to the moderate input vector size for most sequence learning models. For example, the speech recognition model in Section 4.2.2 has an input size M = 39. With a 16-bit input word size, a 128 kB input cache would allow for a maximum batch size of 1680 for this model. A greater cache size can be used to support larger batch sizes if the target hardware platform has sufficient on-chip memory resources.

5.2.2 Parameter Matrices

The gate parameter matrices, Wi,Wj,Wf , and Wo, bring two topics to consider: memory implementation and in-memory organization.

58 Implementation

The parameter matrices contain the majority of the data that the accelerator must handle. Between the four gates, there are 4 · N · (M + N + 1) total parameter values for a single LSTM layer. While for a binary-weight model, each parameter requires only a single bit of storage, the total size of some models may be too large to be stored completely in on-chip memory. For example, the best-performing speech recognition model variant in

Section 4.2.2 contains 6.16 MB of parameter data in the LSTM layers. This requirement is too large for most of the Xilinx Virtex-7 devices, a mid-to-high-end FPGA family that is commonly used in high-performance computing systems [39]. Therefore, a memory strategy for dealing with model parameters must use external memory storage.

When considering how parameter data is used in the matrix-vector product computation, we see that P data items are consumed on each step (as illustrated in Table 5.1). There is no reuse of parameters (for a single-input computation), and a continuous computational

flow depends on P new parameters being available every clock cycle. Because of this, the matrix-vector computation time, and thus the computation time of the entire system, is memory bandwidth-limited.

In a fully-external memory scheme for parameter storage, there is no benefit to making P wider than the memory bus width. To illustrate this fact, consider a system with a memory bus width MemW idthexternal = 8 (i.e. 8 parameter values are delivered from external memory per clock cycle). With a configuration of P = 8, 8 parameter data items would be consumed per clock cycle. With a configuration of P = 32, the width of the memory bus

would require 4 clock cycles to provide all 32 items needed to perform a computation step.

Thus, we have the same average performance of 8 items per cycle. The extra parallelism

beyond the memory bus width is just a waste of chip resources.

59 However, if some portion of the model parameters can be stored in on-chip memory, then we can benefit from a partial speedup by making the width of the on-chip memory interface, as well as the parallelism factor, larger than width of the off-chip memory bus. To illustrate, consider a system with MemW idthexternal = 8 and MemW idthinternal = P = 32.

If just 25% of the model parameters can fit in on-chip memory, then the average data consumption of the matrix-vector calculation is .25*(32 items/cycle) + .75*(8 items/cycle)

= 14 items/cycle. This is a performance improvement of 1.75× over a fully-external memory approach.

Given the favorable performance impact, for the accelerator architecture we implement a hybrid memory approach. An on-chip cache for the model parameters is implemented to have a relatively large storage capacity and a wide output width (i.e. cache line size), which matches the matrix-vector parallelism factor. Once during system startup, whatever portion of model parameters that can fit on-chip are written to the cache via reading from external memory. A register is used to indicate the largest address contained in the cache, after which subsequent data items will have to be read from external memory. For the portion of the pre-activation calculations that uses parameters found in on-chip memory, computation can be performed continuously. For the portion that uses parameters found off-chip, computation will proceed every P/MemW idthexternal clock cycles.

In-Memory Organization

The order in which the gate parameters are stored in memory is an important factor to consider before implementing a memory controller to read the data. We want the logic of the memory controller to be as simple as possible, so ideally the parameter data can be arranged in memory in the order in which it will be accessed. Based on the datapath strategy identified, we can determine this order.

60 For the gate pre-activation calculations, we know that the row-wise matrix-vector mul- tiplication algorithm iterates through the columns of the parameter matrix, using the P rows per step. At the last column, the algorithm moves on to the next P rows and starts iterating again from the first column to the last column. Therefore, the in-memory layout of the parameter matrix data should start with the data at index (1,1), followed by (2,1), and so on until (P,1). Then, (1,2), (2,2), and so on until (P, 2). This order continues until

(P, P), after which we have the data at index (P+1, 1), then (P+1, 2), and so on. If N is

not an integer multiple of P , then NmodP zero-valued filler rows are padded to the bottom

of the matrix before converting to in-memory order. The in-memory organization for an

example parameter matrix, with P = 2, is shown in Figure 5.5 below.

 8 1 3   2 3 5  1 index 18   Wg =  5 2 2  ⇒ 8 2 1 3 3 5 5 4 2 7 2 5 9 0 6 0 1 0    4 7 5  9 6 1

Figure 5.5: In-memory organization for parameter data with parallelism factor P = 2. Same-color parameters are consumed by the row-wise matrix-vector multiplication algo- rithm on the same cycle. An imaginary last row is padded with zeros before the matrix is converted to in-memory order.

For cell calculations, we have established the required gate order, which is shown in

Table 5.2. To simplify control logic, we want to organize the four gate parameter matrices in memory such that they are fetched according the schedule automatically. We can do this simply by forming one large parameter matrix W , which is a vertical concatenation of the four gate parameter matrices. The gate parameter matrices are stacked from top to bottom

61 in the required order:   wx1j1 wx2j1 . . . wxM j1 wh1j1 wh2j1 . . . whN j1 bj1  . .   . .    wx1jN wx2jN . . . wxM jN wh1jN wh2jN . . . whN jN bjN    wx1f1 wx2f1 . . . wx f1 wh1f1 wh2f1 . . . wh f1 bf1   M N     . .  Wj  . .    Wf  wx1f wx2f . . . wx f wh1f wh2f . . . wh f bf  W =   =  N N M N N N N N N  (5.9) Wi   wx1i1 wx2i1 . . . wx i1 wh i wh i . . . wh i bi1   M 1 1 2 1 N 1  Wo  . .   . .    wx i wx i . . . wx i wh i wh i . . . wh i bi   1 N 2 N M N 1 N 2 N N N N  w w . . . w w w . . . w b   x1o1 x2o1 xM o1 h1o1 h2o1 hN o1 o1   . .   . . 

wx1oN wx2oN . . . wxM oN wh1oN wh2oN . . . whN oN boN

Then, by following the in-memory organization scheme for the parameter matrix, the parameter data can be retrieved from memory already in the correct order for computation.

5.2.3 Hidden Layer Output Vector

Due to the recurrent nature of LSTM, the output of the accelerator system is also used as input for the next item in the input sequence. Like the input vector xt, we want ht to be immediately available during matrix-vector product computation. Therefore, like the input vector, the hidden layer output must be stored in on-chip addressable memory. The size of this memory unit need not support the equivalent input batch size; only enough to provide the maximum number of LSTM cells per layer (i.e. N) supported by the system. This is a meager requirement—for an output word size of 32 bits, 16 kB would support a layer size of up to 4096. However, we must also consider that the current hidden state must be maintained while next-state results are being computed. So, we require a second, identically sized memory in which to store output results. The memories are used in a “ping-pong” fashion: one is written to while the other is read, and these functions swap at the start of a new input computation.

62 We are also interested in using the hidden layer output, as it is the result of the computa- tion that we are accelerating. So, output results must be written to off-chip memory for use by the application. An output-write unit that meets this requirement can be implemented using a FIFO memory. In addition to being routed to the hidden layer cache, the output stream of the CCU is also fed to a FIFO for temporary storage of results. A “write-packet size” threshold is set safely lower than the capacity of the FIFO, and a controller in the output-write unit will perform a burst write when this threshold is reached and the external memory interface is available.

5.2.4 Cell State Vector

The cell state vector ct is not needed on every clock cycle during computation, but it still requires moderately frequent access. In the schedule given by Table 5.2, it is used in

75% of the active CCU cycles: read during the forget gate, written to during the input gate, then read again during the output gate. Depending on the size of the model, the cell state vector is not required to be available immediately on every cycle; some slack is acceptable before computation would need to be halted. However, it would not make sense to store the cell state vector in off-chip memory. Extra logic would be required to implement the communication with external memory, and the additional bus traffic would interfere with the parameter reads. Also, a minimal amount of on-chip memory resources are required.

As with the hidden layer output, a 16 kB cell state cache would support a layer size of up to 4096 with 32-bit data words. Unlike the hidden layer cache, however, the cell state does not require a ping-pong buffering scheme because the computation schedule never requires ct and ct−1 concurrently.

Therefore, we implement the storage of the cell state vector as a simple RAM, identically to the temporary memory units (MEM1, MEM2) used by the CCU.

63 5.3 Theoretical Performance

Given the design strategies identified for implementing the datapath and accessing mem- ory, we can characterize the theoretical performance of the binary-weight LSTM accelerator.

First, we derive an expression for the total time required to complete a single input computation; that is, how long it takes to produce one result ht from one input xt. We start with a general expression for the computation time, which is simply the number of clock cycles required to complete the operation, multiplied by the clock rate of the system:

ComputationT ime = T otalCycles · ClockRate (5.10)

The clock rate is dependent on the hardware platform selected for implementation, but we can characterize the total number of clock cycles required based on the datapath and memory architecture. As we have seen, computation time is dominated by pre-activation matrix-vector product calculation. The number of cycles required to complete these cal- culations is given by the ceiling of total number of parameters in the model divided by the rate at which the parameter data are consumed by the MVPU. Then, we have a small additional delay from the number of cycles it takes the CCU to process the final MVPU results. Thus, total clock cycles is given by:

 NumP aram  T otalCycles = + CCUDelay (5.11) P aramConsumptionRate

For LSTM, the total number of model parameters is the count of all the elements contained in the four gate parameter matrices, Wi, Wj, Wf , and Wo:

NumP aram = 4 · N · (M + N + 1) (5.12)

64 The delay from the cell calculations, as we have seen, is P cycles due to the parallel-to- serial conversion that takes place. However, in reality, the implementation of this architec- ture will require pipeline registers in the datapath to allow for a fast clock rate. We denote the pipeline delay as Cp and write the total delay from the CCU as:

CCUDelay = P + Cp (5.13)

The consumption rate of the MVPU is, at its maximum, P parameters per cycle. This is the case when model parameters are contained in the on-chip cache and can be read without delay. However, when parameters are stored off-chip, the consumption rate is determined by the width of the external memory interface, i.e. MemW idthexternal parameters consumed per cycle. The average consumption rate, then, is given by the percentage of model param- eters in on-chip memory, denoted by p, multiplied by the on-chip consumption rate, plus the percentage of model parameters stored in off-chip memory multiplied by the off-chip consumption rate:

P aramConsumptionRate = p · P + (1 − p) · MemW idthexternal (5.14)

Putting this all together, we have:

 4 · N · (M + N + 1)  T otalCycles = + P + Cp (5.15) p · P + (1 − p) · MemW idthexternal

This expression serves as a reasonable estimate for the total number of clock cycles it will take the accelerator to process a single input vector. It can be used to find an appropriate value for P given a specific model size and real-time processing constraint. In practice, however, the estimate will be low due to the additional delays incurred by control logic and memory transaction overhead. We will see this to be the case in later chapters.

65 CHAPTER VI

HARDWARE ARCHITECTURE

Using the design strategies identified in the previous chapter, a digital hardware archi- tecture is developed for accelerating LSTM inference. The accelerator core, called Binary

Recurrent Unit (BRU), is targeted specifically for a Xilinx FPGA implementation. It requires access to external DDR memory, which it uses to read input from and write results to. It is configured through a set of control registers, which allows a wide range of model sizes to be supported. The core is intended to be used as a co-processor to a CPU host, which runs the main inference software application and offloads LSTM layer computations to BRU. Figure 6.1 shows the full inference accelerator system architecture.

This chapter covers the main components of the BRU architecture: the control logic, external memory read & write units, internal memory units, the Matrix-Vector Product

Unit (MVPU), and the Cell Calculation Unit (CCU). The purpose and mode of operation of each component and its subsystems will be explained in detail. The BRU architecture is depicted in Figure 6.2.

66 DDR3

Data

Data

DDR3 Mem Controller

Software Application

BRU Ctrl Reg

FPGA CPU

Figure 6.1: Inference accelerator system architecture.

Control Logic

External Memory Internal Memory Matrix-Vector Product Unit P Parameter Parameter Read Memory Cell Calculation Unit P P Input Input Activation Read Memory 2 Function Elem +/−1 Mult Elem PE Add

Output Hidden State Conversion Stream Write Memory

CCU Memory BRU

Figure 6.2: Binary Recurrent Unit architecture.

67 6.1 Control Logic

Control logic is required to implement the flow of operations in BRU and to configure subsystems to behave correctly. It consists of a global state machine, called the Control

Unit, and various registers that are written to and read from by the application.

6.1.1 Control Unit

The Control Unit directs the computational flow of BRU.

Inputs load param mem – Indicates a request to load the on-chip parameter memory.

start batch – Indicates a request to start a batch processing sequence.

batch size – The number of input vectors in the batch.

seq len – The length of an input sequence.

input rd first – Indicates that the first input vector of a batch has been read.

param rd last – Indicates the end of an parameter read sequence.

ccu busy – Indicates that the Cell Calculation Unit is performing computations.

Outputs ctrl start – Indicates the start of a batch processing sequence.

input rd start – Indicates the start of an input read sequence.

param rd start – Indicates the start of a parameter read sequence.

batch idx – The position in the batch of the current computation.

begin new input – Indicates the start of computation for a new input vector.

first in seq – Indicates the first input vector in a sequence.

batch rd done – Indicates that memory reads for the batch sequence are done.

The Control Unit is implemented as a finite-state machine. Specifically, it is a Mealy machine, meaning that its output is determined both by its current state and its inputs. 68 It acts as a “director” for the other subsystems—using control signals to initiate an action in another component, waiting on the completion status of the action, then deciding the next action to direct based on a schedule. Primarily, these actions are related to reading memory.

The Control Unit communicates with the Input and Parameter Memory Read units using one input and one output signal per Read unit. It asserts rd start to begin a read sequence. For parameter reads, it monitors param rd last to know when the entire read sequence has finished. For input data, however, only the first input vector needs to be present in order to begin computation. Therefore, input rd first is monitored to know

when the first input vector resides in on-chip memory, and the rest of the input batch read

sequence continues concurrently with the parameter reads.

Two control registers, load param mem and start batch, are used to send requests to

the Control Unit to initiate two routines: fill the internal parameter memory, and perform

computation for a batch of inputs. The batch size is specified by a register. Sequence length

is also specified, which informs the Control Unit when to reset the cell and hidden layer

states to zero, i.e. at the beginning of a sequence. The batch idx output, along with an input vector size register, are used to derive the appropriate internal memory addresses for reading the input data.

The state flow for batch computation is shown in Figure 6.3. As can be seen, after the first input in the batch has been fetched, the parameter matrix is fetched. Although not shown explicitly, parameters are fetched in the gate order specified by Table 5.2 due to their organization in memory, as explained in detail in Section 5.2.2. At the end of the

first parameter read cycle, a new parameter read cycle is issued if there is work remaining, i.e. the count of read cycles is less than the batch size. A batch is done when batch size

69 IDLE

start_batch → input_rd_start !start_batch FETCH INPUT

DONE input_rd_first → param_rd_start FETCH param_rd_last && !work_remaining PARAM param_rd_last && work_remaining → param_rd_start

Figure 6.3: Control Unit Batch Computation State Flow. State transition conditions are shown in black. Output signal assertions which occur on state transitions are shown in blue.

parameter reads have been issued and the last parameter data item has been read. Then, a new batch can start after the start batch signal has been deasserted.

The unit does not explicitly tell the processing units (MVPU and CCU) when to perform computations. Rather, data flows through the system directly from the Memory Read Units into the datapath, qualified by a valid signal. The Control Unit counts the number of gate fetch cycles it performs in order to know when to stop issuing parameter memory reads.

It does not explicitly keep track of when the last computational result has been produced by the Cell Calculation Unit, which would indicate the actual end of a batch. Instead, it checks ccu busy before starting a new batch to make sure that the CCU has completed computation.

70 6.1.2 Control Registers

Registers are used to parameterize different aspects of BRU and control its operation. BRU also writes to two registers to provide information back to the application. Registers are accessed via the AXI4-Lite protocol.

Read load param mem – Indicates a request to load the on-chip parameter memory.

start batch – Indicates a request to start a batch processing sequence.

batch size – The number of input vectors in the batch.

input size – The length of an input vector.

hidden size – The number of hidden units; the length of an output vector.

seq len – The length of an input sequence. input rd num bursts – Number of bursts required to read a batch of input data from external memory. input rd last burst len – The length of the last read burst for the input batch.

input rd base addr – The base address from which to read input data. param rd num bursts – Number of bursts required to read the full parameter matrix from external memory. param rd last burst len – The length of the last read burst for a gate parameter matrix. param rd base addr – The base address from which to read parameter data.

Write batch done – Indicates that a batch computation has finished and all results have been written to external memory. run time – Counter value indicating the number of clock cycles elapsed in the cur- rent batch; stopped at the end of computation.

6.2 External Memory

External memory, or “main memory”, is accessed using AXI4 protocol, an open-standard protocol for on-chip interconnect communication [40]. BRU acts as an AXI Master, which

71 controls a memory interface acting as an AXI Slave. AXI4 is a burst protocol, meaning that a contiguous burst of data is transmitted to/from the device after a simple handshaking pro- tocol. AXI4 has a maximum burst size of 256, so BRU must perform memory reads/writes in length-256 chunks or smaller.

The External Read and Write Units are designed to provide a simple interface to main memory for the other components in the architecture. Three external memory unit instances are used: one for reading input data, one for reading model parameter data, and one for writing output result data. Each one is assigned a unique region of memory, specified by a base address and region size. The addresses used within the architecture are offsets to the actual base address used in main memory, which simplifies the addressing scheme. By having separate units handle the different memory access functions, the control logic within the architecture is simplified, and memory access arbitration can be handled externally. The specifics of the Read and Write Unit architectures are explained in the following sections.

The external memory base address for parameter reads is set with a register.

6.2.1 Read Unit

The Read Unit is responsible for reading data from external memory.

Inputs rd in – Bus signal coming from the AXI Slave Memory Interface. ack – Indicates that a read request was accepted. data – The data returned from the read request. valid – Indicates the validity of the data.

rd start – Indicates a read request from the Control Unit.

base addr – The base address from which to start a read.

num bursts – Total number of bursts to read.

last burst len – The length of the last burst. 72 Outputs rd out – Bus signal going out to the AXI Slave Memory Interface. addr – The starting address for the burst read transaction. len – The number of values to be transferred. valid – Indicates the validity of addr and len. rdy – Indicates that the unit is ready to receive data.

dout – The data returned from the read request.

dvalid – Indicates the validity of the data.

didx – The index of the current data item within the read sequence.

tlast – Indicates the last data item in the read sequence.

This subsystem provides an interface that abstracts away the details of the AXI4 pro-

tocol, allowing large reads to be issued with a single control message. It accomplishes this

with the num bursts and last burst len inputs, which are used by the control logic in- side the Read Unit to perform multiple burst transactions in succession. It does this by issuing num bursts-1 read bursts of length 256, followed by a final read burst of length last burst len. The appropriate values for these inputs are calculated in software and written to the corresponding control registers, which are connected directly to the Read

Unit. The data output of this unit will not be a contiguous stream of valid data, so down- stream units must watch dvalid to qualify dout. The didx signal can be used as an address for on-chip memory (like the input cache). Two Read Units are used in the BRU architecture: one for reading input data, and one for reading parameter data.

6.2.2 Write Unit

The Write Unit is responsible for writing data to external memory.

Inputs wr in – Bus signal coming from the AXI Slave Memory Interface. ack – Indicates that a write request was accepted. 73 rdy – Indicates that data can be sent. err – AXI Bus error indicator.

data – Data to be written to memory.

valid – Indicates the validity of the data.

batch start – Indicates that a new computation batch has started.

batch done – Indicates that the last result for the batch has been calculated.

Outputs wr out – Bus signal going out to the AXI Slave Memory Interface. addr – The starting address for the burst write transaction. len – The number of values to be transferred. data – The data to be transferred. valid – Indicates the validity of the data.

tlast – Indicates the end of the write sequence.

Like the Read Unit, this subsystem provides an interface that abstracts away the details of the AXI4 protocol. When a batch computation starts, the control logic inside the unit monitors begins monitoring the data input and stores valid samples in a FIFO. When the

FIFO reaches a threshold and the memory bus is available, the unit writes all the data contained in the FIFO out to external memory. When the batch done signal goes high to indicate that all the computation results for the batch have been produced, the Read

Unit will empty the FIFO out to memory regardless of how much data it currently contains.

Only one Read Unit is used in the BRU architecture, for writing computational results from the Cell Calculation Unit to memory.

6.3 Internal Memory

Internal memory is used in various capacities in order to support low-latency data ac- cess and storage. All memory implementations are based on either a single-port (shared 74 read/write address port) or a dual-port (separate read/write address ports) RAM with one data output port. These implementations target the Block RAM (BRAM) resources avail- able on Xilinx FPGA architectures. There are four internal memory units: Input Memory,

Parameter Memory, Hidden State Memory, and Cell Calculation Unit Memory.

6.3.1 Input Memory

The Input Memory stores a batch of input data.

Inputs wr din – The input data to be written.

wr en – Indicates the validity of the write data.

wr addr – The address to write to.

rd addr – The address to read data from.

Outputs rd dout – The data read at rd addr.

Write data for this unit comes from the Input Read Unit (external memory) and read data is sent to the Matrix-Vector Product Unit. It is implemented as a dual-port RAM, which simplifies the control logic for reading and writing. The write cycle is initiated at the start of a batch. The Input Memory should be moderately sized (a few hundred kB) in order to support larger batch sizes, but the amount of memory resources dedicated to this unit will depend on how much is available on the target FPGA device. Parameter Memory should take priority for allocation of on-chip memory resources.

6.3.2 Parameter Memory

The Parameter Memory stores a portion (or all, depending on model size) of the model parameters.

Inputs 75 wr din – The input data to be written.

wr en – Indicates the validity of the write data.

wr addr – The address to write to.

rd addr – The address to read data from.

Outputs rd dout – The data read at rd addr.

Write data for this unit comes from the Parameter Read Unit (external memory) and read data is sent to the Matrix-Vector Product Unit. It is implemented as a dual-port RAM.

Parameter Memory is written to once during system startup by signaling the Control Unit with the load param mem register. The param mem addr limit register tells the Control

Unit how much data to write into the on-chip memory, and then subsequently, during

calculation, the highest address that can be found in the memory. Like main memory, this

unit is byte-addressed. The width of the data input to this unit matches the width of the

Parameter Read Unit data output. The output width for this unit, called the line size, matches the width of the Matrix-Vector Product Unit datapath, P . Reads are aligned to the line size. Therefore, a read address that represents an offset into a line will return the entire line, with the data from the requested address located at the corresponding offset within the returned line. For example, with a 128 bit (16 byte) line size, a read address of 0x22 would return the line starting at 0x20, with the contents of 0x22 located at the third byte in the returned line. The Control Unit is responsible for converting the desired parameter indices to the corresponding memory address. Parameter Memory should be made as large as possible given the available memory resources on the target FPGA device.

76 6.3.3 Hidden State Memory

The Hidden State Memory stores the hidden state used as input for the current computation as well as the output result for the next computation.

Inputs wr din – The input data to be written.

wr en – Indicates the validity of the write data.

wr addr – The address to write to.

rd addr – The address to read data from.

src toggle – When high, toggles the read/write state of the two internal RAMs.

Outputs rd dout – The data read at rd addr.

This unit supports the recurrent nature of the LSTM. Write data comes from the output of the Cell Calculation Unit, and read data is sent to the Matrix-Vector Product Unit.

Because the hidden state needs to persist while the next hidden state values are being produced, a ping-pong buffering scheme is implemented. Two identically sized single-port

RAMs are used. While one is read from, the other is written to, and vice versa. The src toggle input signal is used to swap the function of each RAM.

6.3.4 Cell Calculation Unit Memory

The Cell Calculation Unit Memory stores intermediate computational results.

Inputs wr din – The input data to be written.

wr en – Indicates the validity of the write data.

wr addr – The address to write to.

rd addr – The address to read data from. 77 gate id – Identifier for the current gate being computed.

Outputs mem1 rd dout – The data read at rd addr in MEM1.

mem2 rd dout – The data read at rd addr in MEM2.

cell rd dout – The data read at rd addr in CELL.

This unit is read from and written to by the Cell Calculation Unit. It contains three separate, identically sized dual-port RAMs: MEM1 stores the intermediate result tanh(jt);

MEM2 stores the intermediate result sigm(ft) ct−1; and CELL stores the full result of the cell state vector computation it jt + ft ct−1. Input data is routed to all three memories,

but write behavior is controlled by decoding gate id and ANDing with wr en to produce

three separate write control signals. The following decode table is used:

Gate ID Action Write Control New Input 0 Write MEM1 001 Forget 1 Write MEM2 010 Input 2 Write CELL 100 Output 3 Write None 000

Address inputs refer to the element index in a gate vector, i.e. when computing the k-th element of the cell state vector, the values MEM1[k] and MEM2[k] are read, and the

result is written to CELL[k]. The memory units must have separate write and read ports due to pipelining in the Cell Calculation Unit; write results are not available in the same cycle that read values are used. Three separate data output ports are used as well, both in order to simplify routing to the CCU functional units and because MEM1 and MEM2 contents are needed at the same time during the Input Gate computation.

78 6.4 Matrix-Vector Product Unit

The Matrix-Vector Product Unit (MVPU) implements the row-wise matrix-vector mul- tiplication algorithm, described in Section 5.1.1. It is composed of an array of P processing elements (PEs), which perform a multiply-accumulate operation along the rows of the pa- rameter matrix. It is fed parameter data in a streaming fashion, and it reads a corresponding input vector element from either the Input or Hidden State Memories. The MVPU contains its own controller unit to handle input data and manage the multiply-accumulate operation.

It also has a stream conversion unit, which handles sending computation results to the Cell

Calculation Unit.

6.4.1 Controller

The MVPU Controller is responsible for implementing the row-wise matrix-vector mul- tiplication algorithm.

Inputs param rd start – Indicates the start of a parameter read sequence.

param rd last – Indicates the end of a parameter read sequence.

param dvalid – Indicates the validity of the parameter data.

input size – The length of an input vector.

hidden size – The length of a hidden state vector.

Outputs rd src – Selects a memory unit to read from, either input or hidden state memory.

rd addr – The input or hidden state memory address to read from.

mac reset – Indicates that the accumulate register in the PEs should be reset.

mac done – Indicates the end of a row accumulation.

79 This unit monitors the control signals from the parameter data stream in order to align input data for the PE array. The param rd start signal from the Control Unit tells the

MVPU controller to begin the row-wise computation sequence. When param dvalid is high, the controller reads the appropriate value from either the Input Memory or Hidden State

Memory. It does this by keeping a running index into the input and hidden state vectors, iteratively reading items 1 through input size of the current input vector and then items

1 through hidden size of the hidden state vector. The controller asserts mac reset on the

first valid read cycle of the row, and mac done is asserted on the last valid ready cycle.

6.4.2 Processing Element Array

The array of processing elements performs the binary multiply-accumulate operation for the row-wise matrix-vector multiplication algorithm.

Inputs input data – Input vector data read from internal memory.

hidden data – Hidden state vector data read from internal memory.

rd src – Indicates the input value, input data or hidden data, to feed to the PEs.

param data – An array of P parameter data values.

valid in – Indicates the validity of the input data.

reset – Sets the value of the accumulate register in the PEs to zero.

done – Indicates the end of a row accumulation.

Outputs result data – An array of P pre-activation vector computation results.

valid out – Indicates the validity of the result data.

The PE array contains P parallel datapaths. Before the input of the array, Input and

Hidden State Memory data are sent to a 2-to-1 multiplexer, the output of which is selected 80 by the rd src signal. Then, the multiplexer output is fed to each of the P processing ele- ments, along with valid in, and reset. The parameter data input is split into P separate

signals and routed to an individual PE. Inside the PE, the binary multiply-accumulate op-

eration takes place. The “multiply” is accomplished with a two’s-complement unit and a

2-to-1 multiplexer, as shown in Figure 5.3. An adder takes the result of the multiply and

sums with the output of an accumulate register. The result of the addition is then stored

in the accumulate register, qualified by valid in. The PE contains two pipeline registers in its datapath, so the done signal is sent to a corresponding two-cycle delay register, then outputted as the valid out signal.

6.4.3 Stream Conversion Unit

The Stream Conversion Unit converts the parallel output of the processing element array to a serial data stream.

Inputs result data – The array of P results from the PE array.

result valid – Indicates the validity of the result data

num hidden – The number of hidden units.

reset – Resets the control state of the unit.

Outputs data out – The serialized result data.

data valid – Indicates the validity of the data.

data idx – The vector element index of the data.

gate id – Identifier for the gate vector the data belongs to.

last – Indicates the last index in the last gate.

81 The control logic inside the Stream Conversion Unit contains three counter values, which are all initialized to zero: local count, idx count, and gate count. When result valid is high, the P items in result data are saved into registers. Then, the controller begins counting up for P cycles, incrementing local count and idx count on each cycle. The local count is used to index into the saved result vector and send to data out. This counter is reset with each new delivery of results from the PE array, but idx count is maintained and sent to the output as data idx. When idx count reaches the value num hidden-1, it is reset to zero, and gate count is incremented to signify the computation of the next gate.

This counter is outputted as gate id. When idx count reaches the value num hidden-1 and gate id equals 3 (the ID of the output gate), last is asserted to signify the last data item in the computation.

6.5 Cell Calculation Unit

The Cell Calculation Unit (CCU) performs the activation functions and element-wise operations in LSTM.

Inputs data in – The pre-activation vector element data.

data valid in – Indicates the validity of the data.

data idx in – The vector element index of the data.

gate id – Identifier for the gate vector the data belongs to.

first in seq – Indicates the first input vector in a sequence.

last – Indicates the last index in the last gate.

begin new input – Indicates the start of computation for a new input vector.

ctrl start – Indicates the start of a batch processing sequence.

Outputs 82 data out – The cell calculation result data.

data valid out – Indicates the validity of the data.

data idx out – The vector element index of the data.

busy – Indicates that the Cell Calculation Unit is performing computations.

This unit processes the serial steam of result data from the Matrix-Vector Product Unit.

Its architecture is similar to the one shown in Figure 5.4, but it contains pipeline registers before and after each of its functional units in order to support a high clock rate. In addition to the memory unit described in Section 6.3.4, the CCU contains three major subsystems: the Activation Function Unit, the Elem-Mult Unit, and the Elem-Add Unit, which are described in detail in the following sections. Several control signals are used: first in seq tells the unit when to reset the cell state memory for the beginning of a sequence; last, begin new input, and ctrl start are used to derive the busy state of the unit.

6.5.1 Activation Function Unit

The Activation Function Unit performs the sigmoid and hyperbolic tangent activation functions.

Inputs data in – The pre-activation vector element data.

gate id – Identifier for the gate vector the data belongs to.

Outputs data out – The activated result data.

Both activation functions are implemented using a lookup table (LUT) method. A few techniques are used to optimize the implementation for hardware and yield best accuracy.

Firstly, symmetry is used to reduce the number of points in the LUT. We observe that the hyperbolic tangent function is symmetric about the origin. The sigmoid function is not 83 1 symmetric about the origin, so instead the biased sigmoid function f(x) = sigm(x) − 2 , which is symmetric about the origin, is used. Thus, only points from x >= 0 for both of these functions are sampled.

lim lim 1 1 Next, we note that x→∞ tanh(x) = 1 and x→∞ sigm(x) − 2 = 2 . Cutoff points at

1 tanh(4) = 0.9993 and sigm(6) − 2 = 0.4975 are used, above which output values of 1 and 0.5, respectively, are used.

Finally, the shapes of the activation functions are considered. Both functions have nearly-linear regions that can be approximated with coarsely-grained sampling, but some regions of the functions require high-granularity sampling. It would be a waste of LUT resources to use the same high granularity for the entire range of the functions, so four sep- arate ranges are sampled for each function. For the hyperbolic tangent, region breakpoints of 0, 0.5, 1.5, 3, and 4 were used. For the biased sigmoid, breakpoints of 0, 0.5, 1.5, 3, and

6 were used. Each region contains 16 evenly spaced samples.

Inside the Activation Function Unit, the absolute value of the input is taken. A flag is asserted if the input was negative. The absolute value is sent to the four region LUTs.

Each LUT asserts a flag if the input is in its range, and these four flags are used to select the proper LUT output. If the negative input flag was asserted, then the sign of the LUT output is flipped. Finally, the linear bias is added (for sigmoid only).

The input data in is sent to both the hyperbolic tangent and sigmoid LUT units, and gate id is decoded to select the appropriate function: hyperbolic tangent for the new input gate, and sigmoid for all others.

6.5.2 Elem-Mult Unit

The Elem-Mult Unit performs the element-wise multiplication function.

Inputs 84 data in – The activated gate vector element data.

gate id – Identifier for the gate vector the data belongs to.

mem1 rd dout – The data read from the corresponding element index in MEM1.

cell rd dout – The data read from the corresponding element index in CELL.

Outputs data out – The multiplication result data.

This unit contains a two-input multiply unit, a hyperbolic tangent LUT unit, and several multiplexers that implement the operand routing as specified in Table 5.2. Depending on the value of gate id, either mem1 rd dout, cell rd dout, or tanh(cell rd dout) are routed to

the input of the multiplier along with data in. A multiplexer on the output of the multiplier is used to select between the multiplication result and the unmodified input, rather than explicitly implementing data in*1. The hyperbolic tangent LUT unit is implemented the same as in the Activation Function Unit.

6.5.3 Elem-Add Unit

The Elem-Add Unit performs the element-wise addition function.

Inputs data in – The multiplied gate vector element data.

gate id – Identifier for the gate vector the data belongs to.

mem2 rd dout – The data read from the corresponding element index in MEM2.

Outputs data out – The multiplication result data.

This unit contains a two-input addition unit and a multiplexer that implements the operand routing as specified in Table 5.2. The inputs data in and mem2 rd dout are routed 85 to the adder. A multiplexer on the output of the adder is used to select between the addition result and the unmodified input, rather than explicitly implementing data in+0.

86 CHAPTER VII

IMPLEMENTATION & RESULTS

In order to evaluate the performance of the inference accelerator, the BRU architecture is implemented on an FPGA. This chapter presents the details of the hardware implemen- tation process, including the design methods and target device considerations. Finally, computational speedup of BRU is evaluated against pure-CPU and GPU inference.

7.1 Methods

This work followed a model-based design method for implementing the hardware system.

With this method, implementation stems from developing a simulation model, rather than jumping straight into writing HDL. A plant model is first developed to simulate interaction with external systems. Then, the target architecture is implemented using a behavioral approach. After verifying the correctness of the behavioral implementation, the model is converted piece-by-piece to a structural implementation, which more closely maps to the hardware architecture. HDL code is automatically generated for the target system directly from the simulation model, then deployed to the hardware and tested. The code generation step is a key enabler of this methodology—the behavior of the generated HDL version of the model exactly matches the behavior in the simulation environment. Still, the final step of the process is to verify the correct operation of the deployed model in hardware.

87 7.1.1 Tools

MathWorks MATLAB & Simulink were used as a development environment for the model-based design workflow. MATLAB is a software tool that provides a programming environment for numerical computing. Simulink is a companion tool to MATLAB that provides a block-diagram-based graphical environment for modeling and simulation of dy- namical systems. It is tightly integrated with MATLAB, allowing simulation data to be viewed and analyzed in MATLAB. The code generation step of the model-based design method is provided by HDL Coder, an add-on toolbox that converts both MATLAB code and Simulink models into HDL implementations. The output of the code generation process is called an IP (intellectual property) core, which is the generated HDL code packaged into subsystem for use as a building block in a larger FPGA design.

Xilinx Vivado Design Suite was used to build the FPGA target. Vivado is a software tool used for synthesizing an HDL design and compiling it for a specific Xilinx FPGA device. The Vivado IP Integrator tool was used to construct the FPGA design from the generated BRU IP core and other IP cores providing various functions, such as interfacing with external memory. A screenshot of the Vivado IP Integrator block design is shown in

Figure 7.1

7.1.2 Design

As described previously, the model-based design methodology was followed for imple- menting the BRU architecture. The first step in the process was the development of a plant model in Simulink to represent the external memory interface to DDR. The memory plant model implements a simplified version of the AXI4 protocol, which BRU uses to initiate memory read & write transactions with the actual memory interface in hardware. Next, a

88 Figure 7.1: Vivado IP Integrator block design. The BRU IP core is labeled “system top ip” in the lower-right of the diagram.

behavioral version of the architecture was implemented in Simulink. In this initial version,

the subsystems detailed in the previous chapter were implemented as MATLAB Function

Blocks, which allow MATLAB code to be executed in a Simulink model. After verifying

correct computational output and operational flow, this version of the model is saved to be

used as a “golden reference” model.

While the MATLAB Function Blocks made it is easy to construct a system that was

functionally correct, in to be able to generate HDL code for the model, the system must be

implemented using blocks that are compatible for code generation. HDL code can be gen-

erated for a subset of MATLAB functions—all finite state machine logic was implemented

as MATLAB code. However, most of the design was replaced with a structural Simulink

block implementation in order to map more effectively to hardware. The conversion to a

structural implementation was an iterative process, in which subsystems were modified one

89 Figure 7.2: MVPU Processing Element implemented in Simulink. Pipeline delay registers are highlighted in blue.

at a time. The Fixed Point Design toolbox for MATLAB was employed to convert floating point signals into hardware-friendly fixed point implementation. All changes to the design were verified by comparing the system output to that of the golden reference model.

Once the structural implementation was finished, HDL code (specifically Verilog) was generated for the model. The resulting IP core was integrated into a larger Vivado design containing other IP cores that perform various functions outside of BRU, such as interfacing with external memory or handling access to BRU’s control registers. The design was then synthesized, placed & routed for the Z7020 device, and compiled into an FPGA configuration bitstream file.

90 7.1.3 Verification

The design required verification in two areas: computational correctness and hardware implementation. To verify that the accelerator design produces the computationally correct

LSTM inference result, a “dummy” LSTM model was implemented in Python code, using the Theano framework. This model was never trained, and its parameters were randomly generated. A test input dataset, the values of which approximate a uniform distribution over the full dynamic range of BRU’s fixed-point input datatype, was also generated. In- ference for the full test input was run in Theano, and the outputs were recorded. The verification data items—dummy model parameters, test input, and recorded output—were then imported to the MATLAB environment for testing the computational correctness of the Simulink implementation. The verification data is in floating-point format, while the

BRU architecture uses fixed-point datatypes, so the output of BRU will not exactly match the verification output. Therefore, a fixed-point error tolerance of 5% of the dynamic range of the output was used.

To verify the hardware implementation, the same verification data was used. The JTAG interface on the ZedBoard was used to program the input and parameter data into DDR3 and to set the BRU control registers. After running a batch computation, the output result data was read off of the board and into MATLAB, where it was compared with the output of the Simulink model to verify the hardware implementation. Unlike in the previous verification step, the output of the hardware should match the simulation output exactly, signifying the behavior of the generated HDL code matches the simulation behavior. The run time register was also checked and compared to the run time in simulation.

91 Figure 7.3: The Zedboard Zynq-7000 development board used as the target hardware plat- form.

7.2 Hardware

The Avnet Zedboard development board was used as a platform for implementation of the BRU architecture. This board houses a Xilinx Zynq-7000 device and has 512 MB of

DDR3 memory. A USB-JTAG interface is provided for interacting with the Zynq device.

7.2.1 Device

The targeted device was a Xilinx Zynq-7000 System-on-Chip (SoC), a chip which con- tains both an ARM CPU and an FPGA. The Zynq family of SoCs is a popular platform for high-performance embedded systems, due to the versatility it provides in terms of software, hardware, and IO. For this work, only the FPGA on the device was used. However, a fully-embedded inference application would employ software running on the CPU that uses

BRU as a co-processor to offload the LSTM computations.

92 The specific Zynq device on-board the ZedBoard is the Z7020. The resource specifica- tions for the FPGA on this device are shown in the table below.

Resource Available Lookup Tables (K) 53.2 Flip-Flops (K) 106.4 Block RAM (Mb) 4.9 DSP Slices 220

As can be seen, Block RAM is most constraining resource for BRU on this device—4.9

Mb is equivalent to only 627 kB. To ensure that the design can be placed & routed by the

Vivado tool, a good rule of thumb for FPGA design is to not use more than 80% of any one resource category. This effectively brings the useable amount of Block RAM down to 500 kB. The total size of all on-chip memory units in BRU must be below this value.

7.2.2 Implementation Parameters

The BRU architecture has several implementation parameters that should be customized for the target FPGA platform, depending on the available resources. The values that were chosen for implementation on the Z7020 are shown below in Table 7.1.

Table 7.1: BRU implementation parameters, targeted for Z7020.

Parameter Value Width External Memory 32 (bits) Datapath 256 Fixed-Point Input Q5.10 Datatype Output Q11.20 Input 64 Memory Size Parameter 256 (kB) Hidden State 16 CCU 48

93 The width of the interface to external memory was limited to 32 bits, due to a con- straint in the memory interface IP core used. The width of the matrix-vector multiplication datapath, and thus the output width of the parameter cache, was chosen to be 256 bits.

This choice means that BRU can only handle models for which M + N + 1 < 256, due to how the matrix-vector product results are consumed serially.

Fixed-point datatypes of Q5.10 and Q11.20 were chosen for input and output, respec- tively. This means that the input has a range of -32 to 31.999, with a precision of 9.765e−4, and output has a range of -2048 to 2047.999, with a precision of 9.537e−7. This also implies, along with the 32-bit memory interface width, that two input data words can be read from external memory per cycle, and one output word can be written to external memory per cycle.

The 64 kB Input Memory can store up to 32,768 input data words, and the Parameter

Memory has the capacity for 2,097,152 1-bit parameters. Two 8 kB RAMs are used for

Hidden State Memory, and three 16 kB RAMS are used for the CCU Memory. This corre- sponds to a maximum of 4096 hidden units per LSTM layer. The total memory required for all on-chip units is 392 kB, or 3.1 Mb.

7.2.3 Resource Utilization

The specified BRU implementation has the following resource utilization for the Z7020 device:

94 Table 7.2: FPGA resource utilization for BRU on the Z7020 device.

Resource Utilized % of Available Lookup Tables (K) 8.3 15.64 Flip-Flops (K) 12.5 11.75 Block RAM (Mb) 3.45 70 DSP Slices 1 0.45

As shown in the table, BRU uses a significant portion of the Block RAM resources but not much of the other resource categories. The additional memory requirements beyond the 3.1 Mb specified for the BRU internal memory units are due to a JTAG interface unit, external to BRU. The fact that only one DSP slice was used makes sense: the binary-weight

PEs do not use a multiplier, and the 16-bit input adders can be implemented using LUT resources. The Elem-Mult and Elem-Add units in the CCU, on the other hand, map well to a single DSP slice.

BRU has an estimated power requirement of 0.471 Watts. This estimate does not include the requirements for a CPU host processor or DDR3 memory, however, so it is not representative of the power requirement for a full BRU-based inference system.

7.3 Performance Evaluation

The BRU hardware was used to implement inference for two LSTM models that were shown in Section 4.2. BRU was clocked at 200 MHz for these experiments. Elapsed run time was measured by reading the run time register, which begins counting up from 0 when a batch starts and stops counting when the final result output is written to external memory. The BRU run time is compared with the run time of CPU and GPU inference, both of which were implemented using Theano. Since BRU only computes the LSTM layers and not the fully-connected layers, the Theano code did not include fully-connected layer

95 computations. To further simplify the comparison, only a single LSTM layer was computed.

The CPU test platform was an Intel Core i3-2370M CPU at 2.40 GHz, and the GPU test platform consisted of an Intel Xeon CPU E5-2680 v4 at 2.40 GHz with one NVIDIA Tesla

P100 GPU. For these platforms, inference for both full-precision and binary weight versions of the models were evaluated. Input batches were run fives times per model and hardware configuration, and the average run time was recorded for each.

7.3.1 Results

Inference run time for the handwriting recognition model with the MNIST Stroke Se- quence dataset was measured. Specifically, a configuration of 1 layer, 256 hidden units was used. A batch size of 200 input sequences were evaluated, where input sequence lengths ranged from 15 to 61. In total, 7735 input vectors of size M = 4 were processed.

Table 7.3: Run time performance comparison of BRU, CPU, and GPU running inference for the MNIST Stroke Sequence model.

Platform CPU GPU BRU Weight precision Full Binary Full Binary Binary Run time (ms) 1650.03 1617.91 48.33 51.34 42.48 Rel. Speedup (×) 39 38 1.1 1.2 1

Inference run time for the speech recognition model with the TIMIT dataset was also measured. Specifically, a configuration of 1 layer, 512 hidden units was used. A batch size of 20 input sequences were evaluated, where input sequence lengths ranged from 153 to 498.

In total, 6064 input vectors of size M = 39 were processed.

96 Table 7.4: Run time performance comparison of BRU, CPU, and GPU running inference for the TIMIT model.

Platform CPU GPU BRU Weight precision Full Binary Full Binary Binary Run time (ms) 6539.13 6410.56 1082.62 1054.62 282.45 Rel. Speedup (×) 23 23 3.8 3.7 1

7.3.2 Analysis

The run time performance results demonstrate that BRU computes inference consider- ably faster than its CPU counterpart in all cases. For the MNIST Stroke Sequence model, as seen in Table 7.3, BRU achieved a best-case speedup of 39× compared to the full-precision weight model run on CPU. Compared to the binary-weight model on CPU, it achieved a

38× speedup. In Table 7.4, we see that for the TIMIT model, the performance improve- ment from CPU to BRU is about 23× for both full-precision and binary weights. While this speedup is not as dramatic as it was for the MNIST Stroke Sequence model, it is still a considerable improvement over a pure software implementation.

There are a couple things to note about the CPU performance results. For one, binary weight models appears to require slightly less computation time than full-precision for both datasets. However, this difference is most likely inconsequential. The reason for this is that the Theano framework does not include optimization for the binarized weights—during inference, they are treated as 32-bit floating point numbers, just as they are in the non- binarized model. Therefore, with more trial runs, the run time averages would be expected to even out between full-precision and binary. The second, and more interesting, aspect of the results to note is the smaller speedup from CPU to BRU for the TIMIT models than for the MNIST Stroke Sequence models. Just based on the difference in the raw amount of multiply-accumulate operations for the two model batches, computation for the 97 TIMIT batch should require around a 6.5× longer run time than that of the MNIST Stroke

Sequence batch. However, the TIMIT run time is only around 4× that of MNIST Stroke

Sequence. Evidently, the software is able to perform matrix multiplication more efficiently when using larger matrices.

To speculate about why this may be the case, we can consider how matrix-vector multi- plication may be implemented in the CPU using Single Instruction Multiple Data (SIMD) instructions. This instruction type takes advantage of data-level parallelism and performs an operation on multiple data items with a single instruction. Any optimized low-level matrix multiplication routine at the core of the Theano engine almost surely uses SIMD instructions. A scalar multiply SIMD instruction multiplies multiple data items by a single value. Matrix-vector multiplication, then, could be implemented in an efficient manner by performing a scalar multiply with the items in a matrix column and the corresponding ele- ment of the input vector, then accumulating values along the rows of the matrix. The input vector element is reused as many times as it takes to complete the scalar multiplication along the column, remaining in the processor’s cache for the duration of its use. On the next column sequence, the next vector element must be fetched from main memory, which is expensive. The column length of the TIMIT model parameter matrix is twice that of the MNIST Stroke Sequence model (since it contains twice as many hidden units). How- ever, the relative cost of operating along the longer column is low, due to the input vector element remaining in the cache. Therefore, the CPU would able to more effectively utilize

SIMD instructions and reuse data when performing matrix-vector multiplication with a taller matrix.

Moving on to the GPU results, we see in Table 7.3 that BRU performs slightly better for the MNIST Stroke Sequence models and about 3.75× better than the TIMIT model.

98 We see that, as was the case with the CPU results, there is not a significant difference in performance between full-precision and binary weight variants. Again, this is to be expected due to the fact that Theano handles binary weight values as full-precision floating point values. An interesting result to note is that, unlike in the case of the CPU, there is a larger speedup from GPU to BRU for the TIMIT models than for the MNIST Stroke

Sequence models. Two possible reasons for this are suggested, both stemming for the difference in complexity between the models. Firstly, there is likely a slowdown due to data transfer between CPU to GPU—GPU cards do not have access to main CPU memory, so data is transferred to the card over a PCIe bus. While the inference computations can be parallelized, this transfer of data cannot be. The TIMIT model contains significantly more parameters than the MNIST Stroke Sequence model (2.3 million vs. 267 thousand). Taking into account the number of input and output data items for the two batches, the TIMIT batch requires around 4× more data transfer than the MNIST Stroke Sequence model

While increased data transfer time is likely to be responsible for the majority of the relative slowdown between the two models, there is a second possible source of the discrep- ancy to consider. The GPU memory hierarchy consists of a global memory, shared block memories, and local thread memories. As the memories increase in locality, their access time becomes shorter, but their size becomes smaller. Effective use of GPU memory, then, keeps the data required for computation as local as possible. The parameter matrix for the MNIST Stroke Sequence model may be small enough to be used in this manner. In the case of the TIMIT model, however, the large parameter matrix may not fit completely into the local and shared memory segments of the GPU, and thus portions of it have to be cycled in from global memory. BRU provides an advantage in the way that its parameters are packed tightly in-memory—using 1-bit weights, the 2.3 million parameters require only

99 276 KB of memory, and thus 93% of the model parameters can be stored in the 256 KB on-chip parameter memory. BRU then fully benefits from the parallelism of its datapath for the majority of computation time.

A caveat to the BRU performance numbers should be considered, however. The run time for BRU only included the duration of computation performed on the device—from the time BRU is notified of a batch start to the time it writes its last result to external memory.

The run time for GPU, on the other hand, also included the time required for the host CPU to transfer data to and from device. A more realistic performance measurement for BRU would include the memory activity of the host CPU as well. The reason that this was not measured in this work was because main memory on-board the ZedBoard was programmed using the USB-JTAG interface, which is slow and not representative of the bandwidth of memory transfer in a real embedded system. A possible future effort could be to implement embedded software on the ARM CPU of the Zynq that uses BRU as a co-processor. This would provide for a better estimate of the real-time throughput of a BRU-based accelerator system. However, it is not expected that a full system characterization would show a significant drop in performance. The reason for this is that, unlike in the case of a GPU,

BRU shares physical memory with the host CPU. The CPU would simply have to copy data from the local process memory into the region of memory for BRU input data, rather than transfer the data over a bus.

100 CHAPTER VIII

CONCLUSION

Long Short-Term Memory neural networks are a powerful tool for sequence modeling.

However, due to their computational complexity, it can be difficult to implement LSTM inference to be able to handle real-time workloads. This is especially true for embedded systems, which offer minimal power and hardware resources. This work addressed the need for highly efficient LSTM inference with a two-pronged approach: compressing model parameters by using a binary-weight scheme, and designing a specialized digital processor architecture to compute inference for binary-weight LSTM models.

In Chapter 4, various approaches to hardware accelerator design and model compression were discussed. For hardware design, memory utilization was identified to be a primary con- cern due to large storage size requirements for model parameters as well as the performance bottleneck presented by memory access times. Binarization was identified as a compres- sion scheme that can significantly alleviate parameter storage requirements by constraining weights to +1 and −1 during training, thus needing only a single bit to store in memory.

The effectiveness of binarization was evaluated on two separate sequence learning tasks,

handwritten digit recognition and speech recognition. Prediction performance was shown

to be comparable to that of uncompressed LSTM models for these tasks, with accuracy

losses of 0.02% and 3.18% for the handwriting and speech recognition tasks, respectively. 101 With binarization validated as an effective compression scheme, the design of a hard- ware accelerator for binary-weight LSTM could proceed. Chapter 5 explored strategies in two primary areas of consideration for hardware design, datapath and memory. Since matrix-vector multiplication makes up the majority of the computations in LSTM, an ef- fective hardware accelerator will contain a parallelized datapath to speed up these compu- tations. The row-wise matrix-vector multiplication algorithm was proposed to accomplish this parallelization. To handle the cell calculations of LSTM, a serial implementation that requires minimal hardware resources was proposed. Then, memory utilization strategies were discussed for input, parameter, hidden state, and cell state data. While limited in its availability, on-chip memory should be used as much as possible to maximize computational throughput. An sizable on-chip parameter cache was determined to be the greatest factor in enabling full parallelism in the datapath.

In Chapter 6, a digital hardware architecture for accelerating inference in binary-weight

LSTM was presented. The accelerator, called Binary Recurrent Unit (BRU), is targeted for FPGA implementation. In a complete embedded system context, BRU is meant to act as a co-processor to a CPU, which offloads inference computations through shared DDR3 memory. The architecture specification contains concrete subsystems which perform the

LSTM inference computations, but it also includes various design parameters, such as the parallelism of the datapath and the size of on-chip memory units, which should be set according to the resources available on the target FPGA device.

To test the effectiveness of the BRU architecture, it was implemented on the pro- grammable logic of a Xilinx Zynq Z7020 device. Chapter 7 detailed the implementation process and the specific design parameters applied for the target device. The BRU im- plementation, clocked at 200 MHz, was used to implement inference for the two binarized

102 LSTM models presented in Chapter 4. Inference computation time for BRU was then eval- uated against the performance of CPU and GPU inference implementations. BRU was shown to outperform CPU by as much as 39× and GPU by as much as 3.8×.

This work demonstrated the feasibility of running LSTM inference in real-time on an embedded system. The BRU architecture could be adapted to a number of different con- sumer applications, such as home voice assistants or mobile platforms, to enable on-device speech recognition in real-time, without the assistance of cloud servers. Besides decreasing server load and bandwidth requirements, moving computations to the local embedded de- vice has a number of benefits to the consumer, including increased user privacy, as well as device usability without an Internet connection. This work contributes one possible solution for enabling deep learning inference in the embedded computing context.

103 BIBLIOGRAPHY

[1] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with is difficult,” vol. 5, no. 2, pp. 157–166.

[2] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization.” [Online]. Available: http://arxiv.org/abs/1409.2329

[3] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel, L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin, S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao, D. Yogatama, J. Zhan, and Z. Zhu, “Deep speech 2: End-to-end speech recognition in english and mandarin.” [Online]. Available: http://arxiv.org/abs/1512.02595

[4] Theano. [Online]. Available: http://deeplearning.net/software/theano/

[5] TensorFlow. [Online]. Available: https://www.tensorflow.org/

[6] Introduction to TensorFlow lite. [Online]. Available: https://www.tensorflow.org/ mobile/tflite/

[7] BLAS (basic linear algebra subprograms). [Online]. Available: http://www.netlib. org/blas/

[8] William Wong, “A deeper look at deep-learning frameworks,” vol. 64, no. 8, pp. 28–28.

[9] NVIDIA cuDNN. [Online]. Available: https://developer.nvidia.com/cudnn

[10] E. Eshelman. NVIDIA tesla p100 price analysis. [Online]. Available: https: //www.microway.com/hpc-tech-tips/nvidia-tesla-p100-price-analysis/

[11] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H. Yang, and W. J. Dally, “ESE: Efficient speech recognition engine with sparse LSTM on FPGA.” [Online]. Available: http://arxiv.org/abs/1612.00694

104 [12] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient inference engine on compressed deep neural network,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 243–254.

[13] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a .” [Online]. Available: http://arxiv.org/abs/1704.04760

[14] P. Hernandez, “Microsoft’s project brainwave tackles real-time AI workloads with FP- GAs,” p. 1.

[15] X. Lei, G.-H. Tu, A. X. Liu, C.-Y. Li, and T. Xie, “The insecurity of home digital voice assistants - amazon alexa as a case study.” [Online]. Available: http://arxiv.org/abs/1712.03327

[16] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding.” [Online]. Available: http://arxiv.org/abs/1510.00149

[17] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, Y. Wang, and H. Yang, “Going deeper with embedded FPGA platform for convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’16. ACM, pp. 26–35. [Online]. Available: http://doi.acm.org/10.1145/2847263.2847265

[18] R. Prabhavalkar, O. Alsharif, A. Bruguier, and L. McGraw, “On the compression of recurrent neural networks with an application to LVCSR acoustic modeling for embed- ded speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5970–5974.

[19] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A machine-learning supercomputer,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622.

[20] S. Shin, K. Hwang, and W. Sung, “Fixed-point performance analysis of recurrent neural networks,” vol. 32, no. 4, pp. 158–158. [Online]. Available: http://arxiv.org/abs/1512.01322 105 [21] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training deep neural networks with binary weights during propagations.” [Online]. Available: http://arxiv.org/abs/1511.00363

[22] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations.” [Online]. Available: http://arxiv.org/abs/1609.07061

[23] L. Hou, Q. Yao, and J. T. Kwok, “Loss-aware binarization of deep networks.” [Online]. Available: http://arxiv.org/abs/1611.01600

[24] A. X. M. Chang, B. Martini, and E. Culurciello, “Recurrent neural networks hardware implementation on FPGA.” [Online]. Available: http://arxiv.org/abs/1511.05552

[25] J. C. Ferreira and J. Fonseca, “An FPGA implementation of a long short-term memory neural network,” in 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–8.

[26] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “FPGA-based accelerator for long short- term memory recurrent neural networks,” in 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 629–634.

[27] S. Li, C. Wu, H. Li, B. Li, Y. Wang, and Q. Qiu, “FPGA acceleration of recurrent neural network based language model,” in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 111–118.

[28] E. Nurvitadhi, J. Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr, “Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC,” in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–4.

[29] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An architecture for ultra-low power binary-weight CNN acceleration.” [Online]. Available: http: //arxiv.org/abs/1606.05487

[30] Y. Li, Z. Liu, K. Xu, H. Yu, and F. Ren, “A GPU-outperforming FPGA accelerator architecture for binary convolutional neural networks.” [Online]. Available: http://arxiv.org/abs/1702.06392

[31] P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented approximation of convolutional neural networks.” [Online]. Available: http://arxiv.org/abs/1604.03168

[32] D. D. Lin, S. S. Talathi, and V. S. Annapureddy, “Fixed point quantization of deep convolutional networks.” [Online]. Available: http://arxiv.org/abs/1511.06393

[33] Y. Fu, E. Wu, A. Sirasao, S. Attia, K. Khan, and R. Wittig, “Deep learning with INT8 optimization of xilinx devices.” [Online]. Available: https://www.xilinx.com/ support/documentation/white papers/wp486-deep-learning-int8.pdf 106 [34] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural models to FPGAs,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12.

[35] Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. The MNIST database of handwritten digits. [Online]. Available: http://yann.lecun.com/exdb/mnist/

[36] E. De Jong. MNIST sequence data. [Online]. Available: https://edwin-de-jong.github. io/blog/mnist-sequence-data/

[37] J. S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT acoustic-phonetic continuous speech corpus. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S1

[38] Georgios N. Evangelopoulos, “Efficient hardware mapping of long short-term memory neural networks for automatic speech recognition.”

[39] Virtex-7 FPGA family product table. [Online]. Available: https://www.xilinx.com/ products/silicon-devices/fpga/virtex-7.html#productTable

[40] AMBA AXI4 interface protocol. [Online]. Available: https://www.xilinx.com/ products/intellectual-property/axi.html

107