DEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Programmable for Deep Neural Network Accelerators

MUHAMMAD JAZIB KHAN

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND SCIENCE

KTH ROYAL INSTITUTE OF TECHNOLOGY Electrical Engineering and Computer Science

Programmable Address Generation Unit for Deep Neural Network Accelerators

Muhammad Jazib Khan

Master in Electrical Engineering Supervisor KTH: Yu Yang Examiner: Prof. Ahmed Hemani School of Electrical Engineering and Computer Science Host company: Robert Bosch GmbH Supervisor Bosch: Sebastian Vogel and Dr. Leonardo Ecco

Abstract

The Convolutional Neural Networks are getting more and more popular due to their applications in revolutionary technologies like Autonomous Driving, Biomedical Imaging, and Natural Language Processing. With this increase in adoption, the complexity of underlying algorithms is also increasing. This trend entails implications for the computation platforms as well, i.e. GPUs, FPGA, or ASIC based accelerators, especially for the Address Generation Unit (AGU), which is responsible for the memory access. Existing accelerators typically have Parametrizable AGUs, which have minimal adaptability towards evolution in algorithms. Hence new hardware is required for new algorithms, which is a very inefficient approach in terms of time, resources, and reusability. In this research, six algorithms with different implications for hardware are evaluated for address generation, and a fully Programmable AGU (PAGU) is presented, which can adapt to these algorithms. These algorithms are Standard, Strided, Dilated, Upsampled and Padded convolution, and MaxPooling. The proposed AGU architecture is a Very Long Instruction Word based Application Specific Instruction which has specialized components like hardware counters and zero-overhead loops and a powerful Instruction Set Architecture (ISA), which can model static and dynamic constraints and affine and non-affine Address Equations. The target has been to minimize the flexibility vs. area, power, and performance trade-off. For a working test network of Semantic Segmentation, results have shown that PAGU shows close to the ideal performance, one cycle per address, for all the algorithms under consideration excepts Upsampled Convolution for which it is 1.7 cycles per address. The area of PAGU is approx. 4.6 times larger than the Parametrizable Datapath approach, which is still reasonable considering the high flexibility benefits. The potential of PAGU is not just limited to neural network applications but also in more general digital signal processing areas, which can be explored in the future.

Keywords

Address Generation Unit; Deep Neural Network Accelerators; Very Long Instruction Word; Application Specific Instruction Processor; Hardware- Software Co-design Abstract

Convolutional Neural Networks blir mer och mer populära på grund av deras applikationer inom revolutionerande tekniker som autonom körning, biomedicinsk bildbehandling och naturligt språkbearbetning. Med denna ökning av antagandet ökar också komplexiteten hos underliggande algoritmer. Detta medför implikationer för beräkningsplattformarna såväl som GPU: er, FPGA- eller ASIC-baserade acceleratorer, särskilt för Adressgenerationsenheten (AGU) som är ansvarig för minnesåtkomst. Befintliga acceleratorer har normalt Parametrizable Datapath AGU: er som har mycket begränsad anpassningsförmåga till utveckling i algoritmer. Därför krävs ny hårdvara för nya algoritmer, vilket är en mycket ineffektiv metod när det gäller tid, resurser och återanvändbarhet. I denna forskning utvärderas sex algoritmer med olika implikationer för hårdvara för adressgenerering och en helt programmerbar AGU (PAGU) presenteras som kan anpassa sig till dessa algoritmer. Dessa algoritmer är Standard, Strided, Dilated, Upsampled och Padded convolution och MaxPooling. Den föreslagna AGU-arkitekturen är en Very Long Instruction Word-baserad applikationsspecifik instruktionsprocessor som har specialiserade komponenter som hårdvara räknare och noll-overhead-slingor och en kraftfull Instruktionsuppsättning Arkitektur (ISA) som kan modellera statiska och dynamiska begränsningar och affinera och icke-affinerad adress ekvationer. Målet har varit att minimera flexibiliteten kontra avvägning av område, kraft och prestanda. För ett fungerande testnätverk av semantisk segmentering har resultaten visat att PAGU visar nära den perfekta prestanda, 1 cykel per adress, för alla algoritmer som beaktas undantar Upsampled Convolution för vilken det är 1,7 cykler per adress. Området för PAGU är ungefär 4,6 gånger större än Parametrizable Datapath-metoden, vilket fortfarande är rimligt med tanke på de stora flexibilitetsfördelarna. Potentialen för PAGU är inte bara begränsad till neurala nätverksapplikationer utan också i mer allmänna digitala signalbehandlingsområden som kan utforskas i framtiden.

Nyckelord

Adressgenereringsenhet; Deep Neural Network Accelerators; Mycket långt instruktionsord; Applikationsspecifik instruktionsprocessor; Hårdvaruprogramvara Samdesign

Acknowledgment

This research work has been carried out in Robert Bosch GmbH Corporate Research Center, Renningen Germany, in collaboration with KTH Royal Institute of Technology, Stockholm, as the affiliated academic institute. I would like to thank my Examiner, Prof. Ahmed Hemani, and my supervisors, Sebastian Vogel, Leonardo Ecco, and Yu Yang, for providing me constant support on any technical and non-technical hurdles throughout the .

This dissertation is dedicated to my parents and my siblings, M. Muzaffar Khan, A. Moiz Khan, and Maham Khan, who have supported me my whole career and made me who I am today.

“ALLAH DOES NOT BURDEN A SOUL BEYOND THAT IT CAN BEAR”

(Al-Quran; Surah al Baqarah: 286)

i

Table of Contents

1 Introduction ...... 1 1.1 Background ...... 1 1.2 Problem ...... 3 1.3 Purpose ...... 3 1.4 Goal ...... 3 1.5 Methodology ...... 4 1.6 Scope...... 4 1.7 Outline ...... 5 2 Theoretical Framework & Related Work ...... 7 2.1 Convolutional Neural Networks ...... 7 2.1.1 Classification ...... 7 2.1.2 Object Localization and Detection ...... 8 2.1.3 Semantic Segmentation ...... 8 2.2 of DNNs...... 9 2.2.1 GPU ...... 10 2.2.2 FPGA ...... 10 2.2.3 ASICs ...... 11 2.3 Address Generation Unit ...... 12 2.3.1 Algorithmic Components of Address Generation ...... 12 2.3.2 Address Equation Classification ...... 13 Affine, Piece-wise Affine, Non-Linear ...... 13 2.3.3 AGU Classification and Examples ...... 14 Lookup Table Based...... 14 Datapath Based – Parametrizable, Programmable ...... 15 2.4 ...... 16 Temporal / Vector ...... 16 Distributed / Dataflow ...... 16 3 Methodology and Design Approach ...... 19 3.1 Assumptions ...... 19 3.1.1 Compiler/Static Optimizations (still needs repositioning) ...... 19 3.1.2 Target Architecture of Processing Units and Memory Hierarchy ...... 19 3.1.3 Memory Layout ...... 20 3.2 Algorithm Level Modeling ...... 20 3.2.1 Standard Convolution ...... 21 3.2.2 Special Cases ...... 23 Strided Convolution ...... 23 Dilated Convolution ...... 23 Up-sampled Convolution ...... 24 MaxPooling ...... 26 Padded Convolution ...... 26 3.2.3 Flow / Computational Graphs of Address Equations ...... 28 3.3 Atomic Operations ...... 29 3.4 Low-Level Modeling / Hardware Architecture ...... 30 3.4.1 Instruction Level Parallelism (ILP) in ASIPs ...... 30 4 Implementation ...... 33 4.1 VLIW ASIP ...... 33

ii

4.1.1 Micro-Architecture Design ...... 33 Register Bank ...... 33 Hardware Counters vs. Zero Overhead (Hardware) Loops ...... 33 Dedicated Hardwired Registers ...... 34 Atomic Pipeline ...... 35 Special Arithmetic Operators ...... 36 Pipeline Stages ...... 36 4.1.2 Instruction Set Architecture (ISA) Design ...... 36 Atomic Instruction ...... 36 4.1.2.1.1 Single Operation mode ...... 37 4.1.2.1.2 Atomic ZOL – Sequential ...... 37 4.1.2.1.3 Atomic ZOL – Parallel & Waitfor instruction ...... 37 Instruction Slots and Dimensions ...... 37 Addressing Modes...... 37 4.2 Programming Model ...... 37 4.2.1 Static Loop Constraints – Standard, Strided and Dilated Convolution ...38 4.2.2 Variable Loop Constraints – Upsampled Convolution and MaxPooling .39 Loop Unrolling ...... 39 Dynamic Constraint Computation ...... 39 4.2.3 Padded Convolution ...... 41 4.3 Overall Architecture ...... 41 4.4 Assembler ...... 42 4.5 Reference Parametrizable Datapath Approach ...... 42 5 Results and Analysis ...... 45 5.1 Systems under Evaluation ...... 45 5.1.1 Programmable VLIW Approach ...... 45 5.1.2 Parametrizable Datapath Approach ...... 45 5.1.3 Previous System in Host Company ...... 45 5.1.4 Relevant Architectures in Literature ...... 45 5.2 Experimental Setup ...... 46 5.3 Use Case Neural Network ...... 46 5.4 Evaluation Metrics ...... 47 5.4.1 Performance ...... 47 Throughput ...... 47 Latency ...... 50 Critical Path ...... 51 5.4.2 Flexibility ...... 51 Dynamic Loop Bound Computation...... 51 Affine vs. Non-Affine Address Equation ...... 52 Number of Variables in Address Equation ...... 53 Upper and Lower Bounds of HW Counters ...... 53 5.4.3 Code Size ...... 53 5.4.4 Resource Utilization ...... 55 5.4.5 Area & Power ...... 56 5.5 Limitations ...... 56 6 Summary and Conclusion ...... 59 7 Future Research ...... 61 7.1 Compiler Design ...... 61 7.2 Code Compression ...... 61 7.3 Code Generation ...... 61 iii

7.4 Further AI Algorithms ...... 62 7.5 Conventional Digital Signal Processing Algorithms ...... 62 References ...... 63

iv

v

Table of Figures

Figure 1.1: ImageNet winner network classification accuracy over the years [19] ...... 2 Figure 2.1: AlexNET Classification Network [85] ...... 7 Figure 2.2: Example of Semantic Segmentation Network [84] ...... 9 Figure 2.3: Comparison of GPU, FPGA and ASIC ...... 11 Figure 2.4: High-level view of AGU with memory and processing units in CNN accelerator ...... 12 Figure 2.5: Algorithmic example of Address Generation ...... 13 Figure 2.6: Example of Non-Affine Address Generation ...... 14 Figure 2.8: Temporal Architecture [36] ...... 17 Figure 2.8: Spatial Architecture [36] ...... 17 Figure 3.1: Target architecture for integrating programmable AGU [74] ...... 19 Figure 3.2: Filter layout ...... 20 Figure 3.3: Input scratchpad memory layout ...... 20 Figure 3.4: Dimensions of 3D Input and Filter for Address Generation ...... 21 Figure 3.5: Pseudo-code for Address Generation ...... 22 Figure 3.6 Strided Convolution with S=2 downsampling the 7 x 7 image to 3 x 3 ...... 23 Figure 3.7: Dilated convolution with D = 2 [76] ...... 24 Figure 3.8: Up-sampled convolution with U=2 [76] ...... 25 Figure 3.9: MaxPooling convolution steps to obtain second output ...... 26 Figure 3.10: MaxPooling convolution steps to obtain first output ...... 26 Figure 3.11: Nine sections of input feature map with unique algorithmic properties ...... 27 Figure 3.12: Pseudo-code for padded convolution (9 branches) ...... 27 Figure 3.13: Computation graph of address equation for exploitation for parallelism ...... 29 Figure 3.14: Computation graph of address equation for exploitation of update frequency ...... 29 Figure 3.15: Computation graph of output address with atomic operations marked ...... 30 Figure 3.16: Computation graph of weight address with atomic operations marked ...... 30 Figure 3.17: Computation graph of Input address with atomic operations marked ...... 30 vi

Figure 3.18: Very Long Instruction Word Pipelined Processor ...... 31 Figure 3.19: Simple Scalar Pipelined Processor ...... 31 Figure 3.20: Super Scalar Pipelined Processor ...... 31 Figure 4.1: Register bank with hardware counters. Hardwired registers colored in red ...... 34 Figure 4.2: Top-level view of Register Bank ...... 35 Figure 4.3: Top-level view of the atomic pipeline component ...... 35 Figure 4.4: Atomic pipeline for Weight address ...... 35 Figure 4.5: Atomic pipeline for Input address ...... 35 Figure 4.6: Atomic Pipeline for Output address ...... 36 Figure 4.7: Code example valid for Standard, Strided and Dilated convolution ...... 38 Figure 4.8: Example of Upsampled convolution step...... 39 Figure 4.9: Timing sequence of Sequential Dynamic Constraint computation ...... 40 Figure 4.10: Timing sequence of Parallel Dynamic Constraint computation . 40 Figure 4.11: Code example of Parallel Dynamic Constraint computation ...... 41 Figure 4.12 The overall architecture of Programmable AGU ...... 42 Figure 4.13: The overall architecture of Reference Parametrizable Datapath AGU ...... 43 Figure 5.1: Convolution-Deconvolution Network under consideration ...... 46 Figure 7.1: The hierarchy of algorithms ...... 62

vii

1 Introduction

1.1 Background

Artificial Intelligence (AI), and more specifically , is considered as one of the flag bearers of the fourth Industrial Revolution, which the human race is undergoing today [1]. AI is a classic example of Kurzweil's Law of Accelerating returns [2], with more and more innovative applications transforming diverse industries faster than the expectations [3]. One huge area that is experiencing this revolution is Computer Vision. Advancement in industries like Biomedical Imaging [4], Agriculture [5], Environmental and Climate change [6], Advertising [7], and Security [8] owe big time to the improvements that Deep Learning has brought to the Computer Vision domain in the last decade. However, arguably the most hyped and anticipated of these innovations is the introduction of Self-Driving cars, powered by Deep Learning, in the ~3.5 trillion dollars [9] Automotive sector. It is projected that the world will see 90 million autonomous vehicles on the roads by 2030 [10]. AI and have been around since the 20th century [11], but its true potential has been achieved very recently in the last decade. This phenomenon is attributed to three significant factors which served as the enablers for this disruption:

1. Massive access to Data: Three billion images are shared on the internet every day [12]. Three hundred hours of video are uploaded to YouTube every minute [13]. These and other such facts are the results of the immense pervasiveness of the internet and social media, which is a very recent phenomenon. Apart from this, the introduction of the megatrend of IoT also paved the way for the presence of sensors collecting data in ubiquitous devices. Additionally, the general digitization of records, for example, in healthcare, news, business, government, etc. also brought a lot of information to the digital domain. Data storage capacity has increased while its cost has declined following Moore´s Law [14]. At the same time, better data mining and management tools, libraries, and platforms such as R Statistical Environment [15] and Pandas [16] also emerged. This better availability, access, acquisition, communication, storage, and management of data is the luxury that is very specific to the age we are living in. The Deep Learning algorithms have an inherent need for a large amount of data for the training stage, making the network ready for the 1

inference stage. Thus, the data became an enabler for trial and adoption of Deep Learning algorithms in the last decade.

2. Advanced Algorithms: Convolutional Neural Network (CNN) is the fundamental algorithm that enabled Deep Learning to revolutionize the Computer Vision. AlexNet [17] is regarded as the pioneering CNN that won the ImageNet ILSVRC [18] competition in 2012 with a classification error rate of 15.45 %. Since then, much has evolved in this domain to capture more and more functionality addressing advanced applications. The AlexNet [17], which topped the classification accuracy not long ago in 2012, today stands at 129th rank [19]. The classification errors today fall below 5% [19]. The computer vision algorithms today are not just limited to classification of images but also capturing more advanced methods of object localization, detection, Semantic Segmentation, and Instance Segmentation. This rapid improvement in algorithms, discussed in more detail in Section 2.1, also made the widespread adoption of Deep Learning possible.

Figure 1.1: ImageNet winner network classification accuracy over the years [19]

3. Compute Power: The third pillar responsible for the take-off of Deep Learning is the availability of much more sophisticated hardware resources. The history of Neural Networks is intertwined with the history of hardware. The period of the gradual end of Moore’s Law and the shift towards multiprocessor systems serendipitously overlapped with the rise of Deep Learning algorithms and Big Data. This scenario proved to be an ideal breeding ground since Deep Learning algorithms are in nature and thus work best with resources. A wide range of computing platforms with the prime target of exploiting this parallelism then emerged. The broad categories of these platforms, also known as accelerators, are GPU, FPGA, and ASIC, respectively, in the order of decreasing flexibility. A detailed description of hardware accelerators is presented in Section 2.2. This hardware acceleration is the focus of this thesis. 2

1.2 Problem

An integral component of every hardware accelerator is an Address Generation Unit (AGU). AGU is a dedicated component which is responsible for generating address patterns required by the processing units to fetch the input and filter weight from memory buffer or store the output to memory buffer after computation. The design and functionality of AGU highly depend upon the algorithm it implements. The algorithmic landscape of Deep Neural Networks is perpetually evolving. Every year marks the introduction of new algorithms for addressing diverse applications. Most of the AGUs that are a component of the current hardware accelerators are of the Parametrizable Datapath type. Which means that they provide minimal flexibility towards hardware. Thus, whenever an algorithm with different implications for hardware is required to be implemented, the hardware architecture of the AGU needs to be changed. Although there is some literature on Programmable AGU, its scope is limited to conventional Digital Signal Processing algorithms and does not discuss Deep Neural Networks at all. Hence there is a need for a fully flexible programmable approach that can cater to any variation in algorithms at the abstraction level of software instead of changing the hardware at RTL level.

1.3 Purpose

This research aims to serve multiple purposes in the domain of hardware accelerator design for Deep Neural Networks. 1. Firstly, to present a comprehensive survey or case study to analyze and compare various Address Generation Unit architectures in DNN accelerators. This has been done previously for general digital signal processing domain such as [20] and [21] but not for DNNs. 2. After laying this foundation, identification of the gaps as explained in the previous heading. Then narrowing down to a solution in the form of a VLIW ASIP based Programmable AGU. 3. In the end, it is supported by results and analysis demonstrating the usefulness of the proposed design.

1.4 Goal

The ultimate goal of this thesis is to deliver a fully flexible Programmable Address Generation Unit that can capture virtually any Convolutional Neural Network Algorithm, stands the test of functional correctness as well demonstrate such non-functional properties like performance, area, and power that cope with highly constrained requirements of embedded AI applications like Autonomous Driving.

3

1.5 Methodology

After establishing the problem clearly in Chapter 2, a hands-on experimental design methodology has been taken, in order to achieve the goal of a flexible AGU, in the following steps: 1. Firstly, six algorithms that are crucial for an end to end implementation of Semantic Segmentation network are mathematically modeled into concrete parameters and equations. These algorithms can be classified into three groups based on differences in implications towards hardware. The mathematical expressions of these algorithms are then modeled in Python to verify their functionality. 2. A hardware-software codesign based VLIW processor architecture is identified to be the starting point for the solution based on the reasons provided in Chapter 3. To approach its implementation, first, a generic custom-designed VLIW architecture is implemented at RTL level in SystemVerilog with specifications set with intuitive guesses about the problem and its solution. 3. Now, this starting point is added up with hardware components in some respects and chiseled down to remove some components in other aspects to end up with a final implementation which copes up with the requirements. In the end, we have an architecture with software oriented VLIW processor tightly coupled with hardware components such as counters and dedicated arithmetic units. 4. Instruction Set Architecture is designed and implemented in the processor and Python-based assembler in parallel with the with specialized instructions like atomic instruction. 5. Towards the end, assembly scripts in accordance with the ISA and implementing the algorithmic needs are set up for each layer of a working Semantic Segmentation network, and results are extracted. 6. Quantitative analysis is provided for measurable metrics like performance and area with the help of ModelSim, Synopsys Design Compiler, and Cadence Innovus. Qualitative evaluation is presented for the non-measurable parameters like flexibility.

1.6 Scope

Hardware accelerator design for Deep Neural Networks is a vast topic in general. Hence it is vital to draw a line differentiating what this thesis will and will not address. Although a broader description of the algorithms, as well as hardware, is presented in Chapter 0 to build a perspective, the design, implementation, and results are strictly limited to algorithmic and hardware components that relate to address generation. Thus, ideal conditions such as throughput of one output per cycle are assumed for the functional parts of the accelerator. Convolutional Neural Network algorithms are investigated for the proof of concept, and other DNN networks or DSP algorithms are out of the scope of this thesis. In addition to that, the design of the compiler is not in the scope of this thesis, although compilation level optimizations are assumed to be in play.

4

1.7 Outline

The thesis starts off by providing a comprehensive theoretical foundation and literature review of related work at the algorithmic and hardware level in Chapter 2, along with the identification of the vacuum and pointers for the possible solution. Chapter 3 then attempts to bolster the solution identification process further by presenting the mathematical modeling of the algorithms into concrete parameters and equations and then by translating this to hardware architecture requirements from a high level. Then Chapter 4 delves down at the RTL level and provides specifics of the hardware architecture. This chapter presents two architectures, one as the main pitch and the other as a reference for comparison. This brings us to Chapter 5, which provides comprehensive results along with analysis and discussion on the benefits and tradeoffs. The last two, Chapters 6 and 7, bookend the thesis by providing a conclusion and directions that can be taken in the future to further this research.

5

6

2 Theoretical Framework & Related Work

2.1 Convolutional Neural Networks

Convolutional Neural Networks are Deep Neural Networks that contain Convolution Layers. CNNs have been first introduced in the computer vision domain in the late 1990s [22]. Nevertheless, its true potential has truly been exploited in the last decade using more and more complex algorithms for increasing the accuracy of the outcome in sophisticated applications. These applications are discussed below in the order of increasing complexity. In general, these are very vast topics, so after a brief introduction, only those aspects will be presented which are relevant for the hardware architecture and, more specifically, memory access.

2.1.1 Classification Image Classification is the process in which an image is given as an input to the NN, and it identifies the Class of an object that is present in the image. To make the network ready for classification, it is trained with labeled data. Error! Reference source not found. shows how a typical Classification C NN looks like. The input image is a 3-dimensional data with x, y, and channel depth dimension. On each layer of the CNN, 3-dimensional filters/kernels are applied, which operate with the input according to the type of the layer and produce output, which is input to the next layer [23].

Figure 2.1: AlexNET Classification Network [85]

7

Each layer attempts to identify certain nuances in the image which it learns in the training process. For the generation, the type of layer is very crucial. Typical CNNs for classification contain convolution layer in the start followed by convolution or pooling layers and in the end, is the fully connected layer. The standard convolution operation is always applied to the Padded input so that the size of the output in x and y dimensions remain the same as the input [24]. It is later described, in Section 3.2, how these algorithms are parameterized for the address generation. Error! Reference source not found., in fact, shows a high-level s tructure of AlexNet [17]. The network was made up of 5 convolution layers, max-pooling layers, dropout layers, and three fully-connected layers. In the subsequent years even further improved performance networks were introduced e.g. ZFNet [25] with error rate of 11.2%, VGGNet [26] (7.3%), GoogleNet [27] (6.7%), ResNet [28] (3.6%) etc. The first thing to note here is that as the error rate of the networks is decreasing, the size of the network is increasing, from AlexNet with eight layers to GoogleNet with 22 layers [27]. Thus, any processing speed-up at the unit pixel level would give a tremendous net performance enhancement. Secondly, the variation and complexity of the algorithms is also increasing, for instance, downsampling is done by Pooling (MaxPooling [24], AveragePooling [24]) in some cases while with Strided convolution in other cases such as in the first layer of ResNet [28] (Stride factor = 2).

2.1.2 Object Localization and Detection Localization means in addition to the identification of the class of object drawing a bounding box around the single object that is classified. The same CNN models used for classification can be adapted for localization just by adding fully connected layers at the end of the network, which treats the prediction of the four coordinates of the bounding box as a regression problem [26]. Object detection means identifying multiple objects in the image and putting a bounding box around them. The first research to successfully present this was R-CNN [29]. It employs a selective search methodology by hierarchically grouping small regions to form the final box depending on some similarity metrics. This was later improved to Fast R-CNN [30] and then Faster R-CNN [31], which were much faster than their predecessors.

2.1.3 Semantic Segmentation Semantic Segmentation is the most fine-grained inference of the image, which means assigning each pixel to its class. Some pioneering methods to achieve this are Fully Convolutional Semantic Segmentation [32] and Convolutional and Deconvolutional Networks [33]. In general, the network for semantic segmentation comprises two networks, as shown in Figure 2.2. First, an encoder network, which is like standard classification networks like VGG and ResNet. Followed by a decoder network that projects the inference on the pixel space. 8

Figure 2.2: Example of Semantic Segmentation Network [84]

From the computational point of view, which is, in fact, relevant for the address generation, the decoder network contains operations unique as compared to the classification networks. To assign the inference to each pixel, the network contains Upsampling layers in the second half. Upsampling can also be seen as a fractional stride convolution [32] and sometimes also called as deconvolution. Another prominent example of a network employing Upsampling for segmentation is [34]. From the processing perspective, the Upsampling is generally combined with the subsequent convolution layer.

Dilation is another crucial algorithm that is distinctive to Semantic Segmentation. Dilation is specifically designed for dense prediction, which means labeling each pixel with an inference [35]. Dilation enables the expansion of the receptive field without loss resolution. Dilation is used to keep the output resolution high and avoid the need for Upsampling [34].

Summarizing the algorithmic landscape of CNNs presented here, it can be concluded that while more and more sophisticated applications are being captured, the size of the NNs and the complexity of algorithmic computations is also increasing. The computation operations started from Standard Convolution, for functional advancement, have grown to Padded Convolution, Pooling, Strided Convolution, Dilated Convolution, Upsampled Convolution, and much more. Hence such hardware platforms are required that can adapt to the fast-moving pace of the algorithmic needs. In the later sections, the computation patterns at the pixel level will be under focus for the algorithms mentioned above.

2.2 Hardware Acceleration of DNNs

Hardware acceleration means usage of specialized hardware considering the processing requirements of the algorithm, which in this case, is DNNs. This hardware is different from conventional processors in a way that they are highly parallel architectures and aim to exploit parallelism in the algorithm. As a trade-off, they are less general-purpose and more domain-specific than 9

the processors. DNN accelerators are majorly GPU, FPGA, ASIC, and sometimes DSP based platforms. One big trend in this area has been the increasing shift from cloud to edge computing, and there are strong reasons for it [36]. Firstly, the increasing concern towards data privacy and confidentiality, especially in applications like health care and finance, has led to more credibility towards localized processing instead of having the vulnerability of sending data to a remote processor and then getting the response back. Secondly, the communication infrastructure itself is not as reliable in all parts of the world as per required by the technologies. Lastly, there are safety-critical applications like autonomous driving [10], which have such strict latency requirements that cloud computing just does not cope up with. Hence more focus would be put on embedded or edge accelerators in this section. However, this shift comes with a cost. Edge or embedded processing is a much more constrained paradigm as compared to cloud computing. Notably, the power and area budget is meager in edge devices such as phones, autonomous vehicles, and other IoT devices. In the past, this problem used to be tackled by the Moore’s Law [14] and Dennard Scaling [37]. Transistors size getting smaller and smaller while keeping the power density the same, resultantly reducing the area and power consumption. However, the mid- marked the end of this serendipitous phenomenon due to short channel effects, Subthreshold, and Gate-oxide Leakage [38]. Consequently, more focus has been channeled towards the architectural enhancements instead of just depending on the transistor scaling. The architectural overview of the platforms used for DNN processing is provided below.

2.2.1 GPU A (GPU) was originally designed only for , but due to a similarly high level of parallelism in DNNs, it is serving an excellent purpose in this domain too. GPUs generally consist of massively parallel lightweight processing cores number of which range in hundreds and sometimes thousands. Frameworks like Caffe [39], PyCUDA [40], and TensorFlow [41] provide an abstraction to distribute the workload to the cores intuitively. Some differentiating factors among GPUs are the number of cores, memory bandwidth, on-chip, and off-chip memory size and clock speed. In general, GPUs see the DNN operations as matrix multiplications using Toeplitz Matrix [36]. It employs a SIMD (Single Instruction Multiple Data) or SIMT (Single Instruction Multiple Threads) oriented processing of data. is arguably the biggest contender in the GPU acceleration, followed by AMD and Google [42]. It provides flexible opportunities to program at the CUDA core level [43] [44]. Some famous examples of GPU product lines used for DNN acceleration are Nvidia Titan [45], Nvidia GTX [46], and Nvidia Pascal [47].

2.2.2 FPGA The perpetually evolving DNN algorithms, as well as hardware architectures, make the platform, particularly Field 10

Programmable Gate Arrays (FPGAs), a viable option. GPUs are more general- purpose with respect to the software programming interface, but FPGAs are more flexible in terms of hardware reconfigurability. Apart from this, reconfigurability benefit FPGAs are generally more power-efficient than GPUs [48]. This is due to the reason that FPGA based accelerator can be designed more coupled with the DNN to avoid excessive logic [49]. At the same time, FPGAs have some drawbacks as well. Firstly, they generally have less clock speed than GPUs, and secondly, it is harder to map the network on FPGA due to relatively less support of frameworks and inherent difficulties of hardware programming [48]. To tackle these issues leading players in FPGA development, Altera and , are introducing more support for DNNs. One such example is the Xilinx reVision stack, which provides development resources for the platform, algorithm, and application development for Deep Learning applications [50]. Some examples of FPGA based DNN accelerators are Microsoft Brainwave [51], [52] [53] [54].

2.2.3 ASICs Application-Specific Integrated Circuits (ASICs) is the most specialized platform of DNN acceleration that works for single or a limited range of algorithms. ASIC designs provide the best state of the art results for throughput, area, and energy efficiency [55]. Such custom design architecture also provides an opportunity for the co-design of Neural Network and hardware architecture, enabling better coupling of techniques like Network Compression, quantization, and Network Pruning [55]. However, on the other hand, the flexibility towards capturing a range of DNN algorithms is generally poor. Also, it is generally more financially costly to have network-specific chips instead of having relatively general-purpose hardware. Some popular examples of ASIC based accelerators are Google [56], Intel Nervana [57], IBM True North [58], and [59] [60].

Figure 2.3 summarizes the comparison of GPUs, FPGA, and ASICs with respect to performance, power, and flexibility.

Figure 2.3: Comparison of GPU, FPGA and ASIC 11

2.3 Address Generation Unit

An Address Generation Unit (AGU) is a dedicated component in hardware accelerators and digital signal processors, which is responsible for generating the addresses that the processing units require for fetching and storing data. This component generally works in parallel to the rest of the functional units. This is commonly present, especially in high throughput applications, so that the processing units do not have to compute the next data to be fetched, compromising throughput sequentially. A high-level Figure 2.4: High-level view of AGU with architectural view of an AGU in an memory and processing units in CNN accelerator accelerator can be seen in Figure 2.4.

A comprehensive research [61], lays down a framework for formalizing different aspects of an AGU at algorithmic and hardware level, the framework of which has used in 2.3.1, 2.3.2 and 2.3.3 to characterize different kinds of AGUs.

2.3.1 Algorithmic Components of Address Generation At the algorithmic level, address generation has components that can be seen in the example code in Figure 2.5. • Constants: The constants are the parameters that stay the same for one layer of CNN. From the perspective of CNNs, they are the same as Hyperparameters like dimensions of input, filter, and output and parameters like Padding, Upsampling and Dilation Factor, etc. • Loops: To traverse over the complete data at the level of single-pixel or vector, multiple nestings of the loop are used. • Variables / Loop counters: The variables are the loop counters of the loops that are responsible for traversing over the complete dimensions of the input, filter, and output in x, y, and channel dimensions. • Constraints / Loop Bounds: The constraints are the bounds of the variables and are the dimensions of the input, filter, and output tensor. Constraints can be of two types: o Static Constraint: When the bound of the variable is a constant or a constant function of the constants/hyperparameters, it is said to be Static Constraint. Static Constraints can be statically computed and fed to the system once for the complete layer. o Dynamic Constraint: When the bound of the variable is a function of at least one of the higher loop variables, then it is called a Dynamic Constraint. A dynamic constraint would change its value over the dimensions of the input layer. Hence it

12

is better to compute it dynamically; otherwise, many values would be needed to be stored. • Address Equation (AE): Address equation is the core of address generation, which is a function of constants, variables, and sometimes constraints. A specific description of these components for certain algorithms will be presented in Section 3.2.

Figure 2.5: Algorithmic example of Address Generation

2.3.2 Address Equation Classification Different algorithms have different manifestations towards address generation; thus, the address equation is classified in the following manner in literature [61].

Affine, Piece-wise Affine, Non-Linear An address equation is called Affine when it is a linear function of the constants and variables. It is classified as a Piece-wise affine when the equation is a piecewise linear function of the constants and variables. This case happens when we have some conditional behavior in the algorithmic description of the AGU. Lastly, AE is said to be Non-Linear when it is a nonlinear function of the constants and variables. This scenario is least prevalent in the CNN domain; only the first two kinds are seen. Examples of Affine and Piece-wise Affine will be presented in Section 3.2 Example non- affine address generation can be seen in Figure 2.6.

13

Figure 2.6: Example of Non-Affine Address Generation

2.3.3 AGU Classification and Examples Now that components of address generation are established, let us delve deeper into what kind of AGUs are present in existing research and what is required to improve. While there is sufficient literature available investigating AGUs for conventional multimedia and signal processing algorithms, there is no single source which provides a comprehensive analysis for the AGUs in hardware accelerators for Deep Learning or CNN domain although there are some papers that present overall accelerator architectures and discuss address generation as a brief section. A survey of such architectures optimizing certain aspects of an AGU is as under:

Lookup Table Based This type of AGU is employed in the design when the address patterns are very simple in nature, and the address space is small. The complete set of addresses is saved in the memory, and either a simple increment/decrement or FSM based logic is used to alternate between the saved addresses. This precomputation of addresses in compile time is also known as loop unrolling. This Lookup table-based address generation must not be confused with the lookup table-based loop bound/constraint computation, which can occur in the following datapath based category as well. It does not make sense to have dynamic bound computation in lookup table-based address generation.

These are some examples [62] [63] of a lookup table based AGUs, but as suggested by [21], loop unrolling gives better performance but is not scalable due to high memory and communication requirements. In addition to that, these architectures lack flexibility, which is disadvantageous considering the fast evolution of algorithms, as explained by Section 3.2.

14

Datapath Based – Parametrizable, Programmable The other more prevalent style is the Datapath (FSMD) oriented approach. The Datapath implements the actual address equation while an FSM controls it. The FSM can be provided by control signals or with a clearly defined ISA. This approach is again sub-divided into two implementation styles.

• Parametrizable AGU A parametrizable AGU means that there is a static Datapath that implements a particular address equation, and it can be fed with different values of constants and variables. Hence, the hyperparameters, variables, and bounds are changeable, but the address equation itself is static. This kind of AGU caters to a range of predefined algorithms but is not expected to accommodate new algorithms. A recent DNN accelerator [64] contains a parametrizable AGU, which was found very useful for the presented architecture. Instead of modeling the nested loops of address generation with conventional FSM modeling approach, it implements them with five hardware loops or in simple terms cascaded registers with upper bounds, enable/disable signals, and max flags. This provides some flexibility, but there are some drawbacks. First is that the bounds are statically computed, and there is no support for dynamic computation of constraints. Secondly, although it does have a small ISA, it is still not fully programmable because the address equation is fixed to a linear equation with five variables. Thirdly it does not have lower bound registers for counters, which are critical for implementing algorithms like MaxPooling and Tiling. In addition to that, it is incapable of handling any non-affined address equations. Another example of parametrizable AGU is [65]. This research provides a design automation tool which generates RTL for FPGA customized to specified NN model. That produces apart from the accelerator, FSM-D based AGUs which only produce predetermined patterns of addresses with parametrizable options like base address, layer size, etc. Some other examples of parametrizable AGUs are [66] [67] [68] [69].

• Programmable AGU A programmable AGU (PAGU) is the most generic implementation style of AGU. In a Programmable AGU, all algorithmic components of address generation discussed in Section 2.3.1 are modifiable. This is achieved by designing a processor like architecture with registers, generic Arithmetic Logic Units, and some form of an Instruction Set Architecture to have programmability. This implementation style is the prime focus of this research. In the programmable style of implementation, research [21] has been found very useful. This research, apart from providing a very comprehensive analysis of available styles of AGUs, presents an AGU architecture. The research targets at AGU for parallel distributed coarse-grained architectures and presents a technique for localized address generation with loop bound computation. The AGU can also cater for non-affine address generation, and dynamic loop bound computation. While this AGU show some very promising 15

results, there are still some arguable aspects. Firstly, while presenting the related work, it rightly highlights the problems of VLIW based approaches, i.e., FSM entanglement problem and lowering of parallelism potential of the functional unit, but this case is only valid if we share the address generation resources with the functional resources of VLIW. The research is silent on the case where the VLIW itself is a dedicated AGU working in parallel with the functional units, which, as later explained, is the core of this thesis. Secondly, if a centralized processing architecture is considered as opposed to the distributed one, the tradeoffs around the static computation of the loop bounds minimize because now one central memory can hold the bounds as opposed to the communication and storage in distributed memories. Thirdly, the research is restricted to the conventional DSP algorithms only like FFT, matrix multiplication, and 2D convolution, while this thesis is focused on DNNs. Some other examples of programmable AGUs are [70] [71] [72] [73], which pose similar problems with affine/nonaffine equations, the extent of programmability, or Static/Dynamic loop bound computation.

2.4 Memory Hierarchy

The memory hierarchy is the organization of memory with respect to the processing units. The access pattern of the data to be accessed by the processing engines for a certain algorithm directly depends upon the memory hierarchy of the architecture. Although similar address generation concepts can be extended for the off-chip memory but in case of high throughput applications like DNN accelerators, data is always first loaded from the off- chip memory to scratchpad oriented on-chip buffers to have fast access. Based on the hierarchy of the on-chip memories, the DNN accelerator architectures are classified into two broad categories as described in [36]:

Temporal / Vector In Temporal, also known as Vector processing architectures, such as [74], there is one or multiple central buffers which hold input, weight, and output, and there are no localized memories associated with the individual computation units in the processing array. So, the computation units do not store data or communicate data among other units. The development of AGU for this kind of architectures is the focus of this thesis. A representation of such an architecture can be seen in Figure 2.8.

Distributed / Dataflow Dataflow or distributed architectures, such as [75], have processing units with local memory associated with them. Thus, the data can be stored and communicated amongst the units in the processing array. Such architecture is also called a as it resembles how the heart pumps the blood in the human body. Although similar address generation concepts can be extended for dataflow architectures, it poses additional constraints

16

which are out of scope for this thesis. A representation of such architecture can be seen in Figure 2.8.

Figure 2.8: Temporal Architecture [36] Figure 2.8: Spatial Architecture [36]

17

18

3 Methodology and Design Approach

3.1 Assumptions

3.1.1 Compiler/Static Optimizations (still needs repositioning) Although the compiler design or automatic code generation is not in the scope of this thesis, the static optimizations like constant propagation and constant folding are assumed to be in play. So, for example, in the address equation, if there is an occurrence of multiplication between two parameters that remain constant per layer of the neural network, this computation is not considered while designing the hardware/processor.

3.1.2 Target Architecture of Processing Units and Memory Hierarchy The design of an AGU both at algorithmic and hardware level depends upon the architecture of the processing engines as well as the memory hierarchy. For this research, a vector processing architecture is considered as presented in [74]. The architecture has separate scratchpad memory-based on-chip buffers for input, weight, and output, which can all be accessed in parallel. There is a two-dimensional array of processing engines that are fed with vectors of data with the help of an address generation unit, and the output is stored in the output buffer.

Figure 3.1: Target architecture for integrating programmable AGU [74]

19

3.1.3 Memory Layout The algorithmic description of the access pattern for the input, weight, and output directly depends upon how all these three quantities are present or to be stored in the on-chip buffer, i.e., scratchpad memory. Hence an assumption of a memory layout must be chosen before actually deriving the address equations and loop bound equations. The memory layout that is considered is quite intuitive and straightforward and is shown in Figure 3.3 and Figure 3.2. For all three kinds of data, the data is present in memory first channel-wise then in x dimension and then in y dimension, starting from the top left corner. The access pattern for all the algorithms would also have the same dimensional order. Considering a convolution algorithm this sort of memory layout would entail a pretty straight forward access pattern for the filter data as the memory would be accessed in consecutive order, but for the input data, it would be a bit tricky because for processing one convolution output there is a jump in memory access after every FILTER_X dimension in the input memory. All three data types can be on the same or different on-chip buffer, in which case a memory offset parameter would be required to locate the starting point of the required data in memory.

Figure 3.3: Input scratchpad memory layout Figure 3.2: Filter scratchpad memory layout

3.2 Algorithm Level Modeling

The functional implementation, as well as the desired nonfunctional properties associated with it, depends highly upon the high-level modeling of the address generation. The low-level implementation is as effective and powerful as the high-level modeling. As discussed in Section 2.3.1, the algorithmic components of address generation that need to be known are constants, variables, loop bounds/constraints, and the address equations for input, weight, and output addresses. The selection of the use case algorithms is based upon the following criteria:

20

1. As one of the core innovations to be presented by the proposed architecture is its high flexibility to adapt to a range of algorithms, a diverse set of algorithms must be considered to demonstrate unique and challenging implications for hardware that conventional systems fail to capture. 2. The considered algorithms must have widespread adoption in applications of Convolution Neural Networks, proving their usefulness and impact. This criterion is supported by Section 2.1, where the use of each of the below-mentioned algorithms in real Convolutional Neural Network is explained. Following is the algorithm level description of the use cases.

3.2.1 Standard Convolution The standard convolution operation of a three-dimensional input tensor and a three-dimensional filter tensor can be parametrized, as depicted in Figure 3.4.

Figure 3.4: Dimensions of 3D Input and Filter for Address Generation

FILTER_X and FILTER_Y are the dimensions of the filter in x and y dimension. INPUT_X and INPUT_Y are the dimensions of input in x and y dimensions. In the channel or depth dimension, a new parameter has been introduced named CHANNEL_STEPS, which has a great significance in address generation arithmetic. The individual feature maps or channels are grouped to form what is called here as a Channel Step.

푇표푡푎푙 퐼푛푝푢푡 퐶ℎ푎푛푛푒푙푠 퐶퐻퐴푁푁퐸퐿_푆푇퐸푃푆 = 퐷푒푝푡ℎ 표푓 표푛푒 퐶ℎ푎푛푛푒푙 푆푡푒푝

The depth of a channel step is dependent on the width of the memory that is required to be addressed/accessed/stored or, in certain cases, the width of the data that the processing engines of the accelerator can handle at a time. The general form of convolution at algorithmic level can be represented by five nested loops with the address equations inside the nestings as shown in Figure 3.5 The innermost loop is in the depth or channel step dimension then above that is the x and y dimensions within the filter space and then the x 21

and y dimensions of the output because on filter location produces one output. So, changing output x and y effectively moves the whole 3D filter to a new location. Another thing to note here is that the input and weight address equations must be computed in every iteration as they are in the innermost loop while the output equation must be computed when the output x or y changes which would be after the number of iterations equal to the product of bounds of inner three loops. However, it will be later seen that computing all three every iteration is more convenient in a flexible system.

Figure 3.5: Pseudo-code for Address Generation

Address Equations:

퐼푛푝푢푡 퐴푑푑푟푒푠푠 = 퐼퐵퐴푆퐸 + 퐶푆 ∗ 퐼푋 ∗ 표푦 + 퐶푆 ∗ 표푥 + 퐶푆 ∗ 퐼푋 ∗ 푓푦 + 퐶푆 ∗ 푓푥 + 푠푡푒푝 푊푒𝑖푔ℎ푡 퐴푑푑푟푒푠푠 = 퐹퐵퐴푆퐸 + 퐶푆 ∗ 퐹푋 ∗ 푓푦 + 퐶푆 ∗ 푓푥 + 푠푡푒푝 푂푢푡푝푢푡 퐴푑푑푟푒푠푠 = 푂퐵퐴푆퐸 + 푂푋 ∗ 표푦 + 표푥

The 퐼퐵퐴푆퐸 , 퐹퐵퐴푆퐸 and 푂퐵퐴푆퐸 are the memory offsets where the input, filter, or output data starts in the memory. The small letters are the variables, i.e., counters of the loops and capital letters are the constants which, in this case, are input, filter, and output dimensions.

Loop Bounds/Constraints: The loop bounds in standard convolution are constant for the entire Neural Network layer; hence it can be statically computed and not necessarily left for the hardware. The equations that drive these static computations are as follows:

푓1 = 푂푈푇푃푈푇푌 = 퐼푁푃푈푇푌 − 퐹퐼퐿푇퐸푅푌 + 1 푓2 = 푂푈푇푃푈푇푋 = 퐼푁푃푈푇푋 − 퐹퐼퐿푇퐸푅푋 + 1 푓3 = 퐹퐼퐿푇퐸푅푌 푓4 = 퐹퐼퐿푇퐸푅푋 푓5 = 퐶퐻퐴푁푁퐸퐿_푆푇퐸푃푆 22

3.2.2 Special Cases Apart from the standard convolution operation, Neural Networks, especially in the domain of Semantic Segmentation, also employ more complex variants of convolution. All these cases follow the same algorithmic structure with five nested loops, as shown in Figure 3.5. The difference lies in the: addition of some constants, modified constant/variable loop bounds and in the address equations which are categorically discussed as follows:

Strided Convolution Strided convolution is one of the methods used as a downsampling layer. The filter, instead of making a single output dimension increments at a time convolves in jumps with a factor S. Figure 3.6, shows an example where this factor is 2.

Figure 3.6 Strided Convolution with S=2 downsampling the 7 x 7 image to 3 x 3

Address Equations: The equations are exactly the same as standard convolution for weight and output addresses. For input address, a scaling factor S is included with the output x and y variables.

퐼푛푝푢푡 퐴푑푑푟푒푠푠 = 퐼퐵퐴푆퐸 + 푆 ∗ 퐶푆 ∗ 퐼푋 ∗ 표푦 + 푆 ∗ 퐶푆 ∗ 표푥 + 퐶푆 ∗ 퐼푋 ∗ 푓푦 + 퐶푆 ∗ 푓푥 + 푠푡푒푝

Loop Bounds/Constraints: f3, f4, and f5 are precisely the same as standard convolution. Static computation functions of f1 and f2 are modified as follows:

퐼푁푃푈푇 − 퐹퐼퐿푇퐸푅 푓1 = 푂푈푇푃푈푇 = ⌊ 푌 푌⌋ + 1 푌 푆 퐼푁푃푈푇 − 퐹퐼퐿푇퐸푅 푓2 = 푂푈푇푃푈푇 = ⌊ 푋 푋⌋ + 1 푋 푆

Dilated Convolution Dilated convolution operates by expanding the filter in x and y dimensions creating gaps in between hence increasing the receptive field of the filter by a factor. The dilation factor D is two as an example in Figure 3.7.

23

Figure 3.7: Dilated convolution with D = 2 [76]

Address Equations: Dilation adds a scaling factor to the filter x and y variable for the input address equation. The weight and output equations are again the same as standard convolution.

퐼푛푝푢푡 퐴푑푑푟푒푠푠 = 퐼퐵퐴푆퐸 + 퐶푆 ∗ 퐼푋 ∗ 표푦 + 퐶푆 ∗ 표푥 + 퐷 ∗ 퐶푆 ∗ 퐼푋 ∗ 푓푦 + 퐷 ∗ 퐶푆 ∗ 푓푥 + 푠푡푒푝

Loop Bounds/Constraints: f3, f4, and f5 are again the same as standard convolution. Static computation functions of f1 and f2 are modified as follows:

퐼푁푃푈푇 − 퐹퐼퐿푇퐸푅 + 1 푓1 = 푂푈푇푃푈푇 = ⌊ 푌 푌 ⌋ 푌 퐷 퐼푁푃푈푇 − 퐹퐼퐿푇퐸푅 + 1 푓2 = 푂푈푇푃푈푇 = ⌊ 푋 푋 ⌋ 푋 퐷

Up-sampled Convolution Upsampling is always followed by a convolution layer in a Convolutional Neural Network [32]. Upsampling does not consume any functional processing elements. Additionally, it is not desirable to have the intermediate zeros in case of zero-filled Upsampling and repeated elements in the case of nearest-neighbour Upsampling. Hence it is useful to combine Upsampling with subsequent convolution operation and fetch only the non- zero or unique input elements while convolution, as shown in Figure 3.8.

24

Figure 3.8: Up-sampled convolution with U=2 [76]

Address Equations: The address equations for weight and output are the same as for the standard convolution. For the input address the equation contains ceiling divisions by of output x and y variables with an Upsampling factor which is assumed to be a power of two in the later sections to make it feasible for hardware:

표푦 표푥 퐼푛푝푢푡 퐴푑푑푟푒푠푠 = 퐼 + 퐶 ∗ 퐼 ∗ ⌈ ⌉ + 퐶 ∗ ⌈ ⌉ + 퐶 ∗ 퐼 ∗ 푓 + 퐶 ∗ 푓 + 푠푡푒푝 퐵퐴푆퐸 푆 푋 푈 푆 푈 푆 푋 푦 푆 푥

Loop Bounds/Constraints: The loop bounds for output x, output y, and step are statically computable functions. On the other hand, computation bounds of filter x and y loops have a unique property as it is a variable function of higher loop variables output x and y. Thus, relatively non-conventional operators of modulo operation and ceiling division are added to capture the recursive behavior in the loop bound computation. These variable function loop bounds create extraordinary implications for hardware, which will be discussed in later sections.

푓1 = 푂푈푇푃푈푇푌 = 퐼푁푃푈푇푌 ∗ 푈푃푆퐴푀푃퐿퐼푁퐺 − 퐹퐼퐿푇퐸푅푌 + 1 푓2 = 푂푈푇푃푈푇푋 = 퐼푁푃푈푇푋 ∗ 푈푃푆퐴푀푃퐿퐼푁퐺 − 퐹퐼퐿푇퐸푅푋 + 1

(표푦 % 푈) + 퐹푌 표푦 % 푈 푓3 = ⌈ ⌉ − ⌈ ⌉ 푈 푈 (표 % 푈) + 퐹 표 % 푈 푓4 = ⌈ 푥 푋⌉ − ⌈ 푥 ⌉ 푈 푈 푓5 = 퐶퐻퐴푁푁퐸퐿_푆푇퐸푃푆

25

MaxPooling MaxPooling2D with some stride and size value is generally similar to the standard convolution in terms of the address generation because the AGU is on responsible for fetching and storing data, the operation, either it is a multiply-accumulate in case of convolution or comparisons for highest value in case of MaxPooling is not the headache of AGU. In general address patterns, it is different from standard convolution because the filter needs to move on a small tile of the input tensor and then moves forward with the next tile, as shown in Figure 3.10 and Figure 3.9. These figures represent a MaxPooling with stride = size =2.

Figure 3.10: MaxPooling convolution Figure 3.9: MaxPooling convolution steps steps to obtain first output to obtain second output

Address Equations: The address equations for input, weight, and output are all precisely the same as for standard convolution.

Loop Bound/Constraints: MaxPooling differs from convolution in this aspect because it requires not only an upper bound for its output x and y loops but also the lower bounds as well from which it should always initialize once the loop hits the upper bound. So again, like Upsampling, we have a variable loop bound computation function which cannot be resolved statically.

Padded Convolution Padded convolution is also another complicated case because it poses different behavior on the edges and corners than in the middle part of the input. Two approaches have been proposed to cater to this discontinuous behavior.

26

Section-wise AE and Loop Bounds: The first approach is to divide the whole image into nine sections and derive address equations and loop bounds for all the sections separately. The nine sections can be seen in Figure 3.11.

Figure 3.11: Nine sections of input feature map with unique algorithmic properties

A high-level code example showing the two branches as an example resolving address equations and loop bounds for the top left corner and top edge can be seen in Figure 3.12.

Figure 3.12: Pseudo-code for padded convolution (9 branches)

This is an example of a Piece-wise affine address equation. The problem with this approach is that this has too many branches to handle, which would degrade the throughput in addition to the variable loop bounds for filter loops in the branches which correspond to the edges and corners.

27

Transformation Function The other method proposed for padded convolution is by using a transformation function that converts the addresses computed for the padded dimensions and maps it to actual memory locations.

The function in case of Padding factor 1 is:

New_Input_Addres푠 = Input_Address – 퐼푋 – 2 * (Input_Address // 퐼푋) + 1

For any arbitrary Padding factor the function is:

New Address = Address – Padding * INPUT_X – 2 * Padding * (Address//INPUT_X) + 2 (Padding)2

The benefit of this approach is that it resolves the padding without the use of branches and conditions. In addition to that, this enables zero skipping in terms of memory so that we do not need to have padded zeros in the memory. The disadvantage is that there is no zero skipping in terms of throughput, and the system wastes a cycle on a padded zero.

3.2.3 Data Flow / Computational Graphs of Address Equations Summarizing the algorithmic description of all the discussed algorithms, it can be stated that the address equation of input, weight, and output is always a linear or a piece-wise linear equation of 5 loop variables and statically computable coefficients which operate in the loop nesting which can either have variable or static loop bound functions. Hence, the input address equation can be written as:

Input Address = s + c4 * fx + c3 * fy + c2 * ox + c1 * oy + c0

Here the constant coefficients c0 to c4 are statically computable functions of dimensions of input, weight, and output and/or parameters associated with the algorithms. The rest are the loop variables. Starting from left to right, the variables are arranged in decreasing order of the update frequency. This is so because the leftmost variable is the loop counter of the innermost loop, while the rightmost variable is the loop counter of the outer most loop, and the rest also follow this pattern. This also implies that any operation associated with the leftmost variable must be computed every cycle if a throughput of one output per cycle is required. Moving to the right, the number of cycles within which the operations associated with the variables must be computed is equal to the product of bounds of the nested loops, i.e., bounds of the variables to the left of that variable. This can be written in an equation as:

Upper bound of cycles operation can take = ∏ (Bounds of variables nested in the loop variable that the operation is associated with)

28

The computational graphs of the address equation can be drawn in the two ways shown in Figure 3.13 and Figure 3.14. In the first figure, the computations are done to have the maximum parallelism in the operations hence taking lesser cycles. In the second figure, the operations are done in an order such that each multiply and then add operation associated with a particular variable has a different frequency because the variable associated with it has a different update frequency. Hence in this graph, if the graph is sliced vertically at any point, we can divide the computations into a high frequency and low-frequency group of operations.

Figure 3.13: Computation graph of address Figure 3.14: Computation graph of address equation for exploitation for parallelism equation for exploitation of update frequency

3.3 Atomic Operations

If the whole computation graph of the address equation is done in hardware, then there will be no flexibility to adapt it to any variations. On the other hand, if the whole computation graph is computed in software, the throughput of the system will be very low because then all the operations will be done sequentially. So, a middle ground must be established to have a hardware-software co-design that fulfills both the criteria. It can be seen in Figure 3.14 that the left-most operations are to be computed more frequently than the rightmost operations. Hence the operations associated with the variable s, fx, and fy are offloaded to hardware, and this group of operations is called atomic operations. The other operations are left for software because they have more time to be computed. This partitioning can be seen in Figure 3.17. A similar approach has been taken for the weight address and output address, which can be seen in Figure 3.16 and Figure 3.15.

29

Figure 3.17: Computation graph of Input Figure 3.16: Computation graph of weight address with atomic operations marked address with atomic operations marked

Figure 3.15: Computation graph of output address with atomic operations marked

3.4 Low-Level Modeling / Hardware Architecture

Considering the computations required for address generation and then coupling them with the target throughput of three addresses per cycle (one for each input, weight and output data) it can be deducted that apart from the hardware parallelism we also require high parallelism in the software part of the system which is discussed below.

3.4.1 Instruction Level Parallelism (ILP) in ASIPs The instruction-level parallelism in processors can be achieved in three ways, Simple Scalar pipelining, Super Scalar pipelining, and Very Long

30

Instruction Word based pipelining. The true ILP is, in fact, in the latter two kinds of architecture because, in those, there is a duplication of processor components to process several instructions in parallel. Super Scalar has dynamic dependency checks to parallelize the instructions at run time. Alternatively, the VLIW processor needs the parallelization to be done statically, and the processor takes already parallelized instructions. Super Scalar is relevant in such applications where some kind of non-determinism is involved, for example, in the form of interrupts, while VLIW cannot handle that. However, the problem that is at hand involves the processing of data, which has completely deterministic behavior as all the dimensions and parameters are known to the system before computation. Hence a VLIW processor is a sufficient system to handle the address generation processing. Now that the basic sketch of the approach has been laid down, the next section will delve deeper into the specifics of the solution.

Figure 3.19: Simple Scalar Pipelined Processor Figure 3.18: Very Long Instruction Word Pipelined Processor

Figure 3.20: Super Scalar Pipelined Processor

31

32

4 Implementation

4.1 VLIW ASIP

A Very Long Instruction Word Application Specific Instruction Processor [77] is a processor architecture with true Instruction Level Parallelism. The following sections discuss the Micro-Architecture Design and Instruction Set Architecture Design of the processor. Since the overall architecture is a standard two-slot VLIW processor, only the specialized parts will be discussed in depth here.

4.1.1 Micro-Architecture Design

Register Bank The register bank of PAGU is one of the core components which has specialized elements for address generation. The register bank has the standard inputs and outputs of a register bank comprising read ports, write ports, read/write indexes, and write enable. Apart from these standard inputs and outputs, the register bank has counter registers, counter upper and lower bound registers, and dedicated hardwired to output registers which are discussed below:

Hardware Counters vs. Zero Overhead (Hardware) Loops As seen in the algorithmic level modeling, there are multiple nested loops that execute the address generation. Now, if these loops are implemented with the standard branching mechanism, it would highly degrade the throughput. There are two ways in which the loops can be realized in the processor apart from the conventional method. One method is by employing Zero Overhead Loops (ZOL) [78]. This kind of loops takes the number of iterations and the number of subsequent instructions to loop over as input as register values and iterate without any overhead. There are several problems with this approach in this scenario. Firstly, the ZOL in VLIW processors is bound to loop over the whole instruction word together since the instruction pointer access the instruction as a whole and not the slots individually. Thus, there is no way to decouple the looping among the slots. The second issue is that the loop requirement in address generation is just maintaining a variable that increments on every iteration. Thus, a ZOL is excessive for such implementation. The other option is by having hardware counters which serve as the ideal fit. To model the five nested loops in the address generation algorithm, five

33

hardware counter registers are included in the design with upper and lower bound registers, which are cascaded in such a way that when one counter overflows it resets to the lower bound and advances the next counter by one. The first counter in the cascade is advanced by one every clock cycle when a specific input signal is high, which is called i_decrement in this architecture. The five counters are hardwired to the output to make the variable values available to the Atomic Pipeline unit.

Dedicated Hardwired Registers Now that the variables are incorporated, the next thing is the coefficients that multiply with these variables in the address equation. For this, registers r4 to r8 are dedicated to the coefficients and are hardwired to output. Three other registers r1 to r3 are dedicated for the non-atomic computation result and are hardwired to the register bank output. The top- level view and the layout can be seen in the following figures. The output hardwired registers are outlined in red color in the layout.

Figure 4.1: Register bank with hardware counters. Hardwired registers colored in red

34

Figure 4.2: Top-level view of Register Bank

Atomic Pipeline Atomic Pipeline is the component that is responsible for the computations inside the Atomic instruction. This takes hardwired inputs from the register bank for the counter, coefficients, and non-atomic result values and combines them in the form of a linear polynomial equation. This computation is itself pipelined, and this pipeline is independent of the processor pipeline. This specific component is responsible for generating three addresses the input, weight, and output address. The top-level diagram and pipelined computational graphs can be seen below.

Figure 4.3: Top-level view of the atomic pipeline component

Figure 4.5: Atomic pipeline for Input Figure 4.4: Atomic pipeline for Weight address address 35

Figure 4.6: Atomic Pipeline for Output address

Special Arithmetic Operators In the ALU of the VLIW ASIP, there are some special operators that are required for the address generation, especially for the Upsampling case. One is the modulo operator. This operation is very expensive in hardware if implemented for any value, but for the second operand to be a power of two, the modulo operator simplifies down to an AND operation with the decremented first operand. The second special operator that is implemented is the ceiling division. This, again, is implemented only for the operand that is a power of two with the right-shift for the same reason.

Pipeline Stages The ASIP has three stages of pipeline: Instruction Fetch (IF), Instruction Decode (ID), Instruction Execute + Writeback (IE). Since there is no memory access required for the address generation, there is no memory stage at the moment. However, this can be modified to MIPS like five stages if need be.

4.1.2 Instruction Set Architecture (ISA) Design

Atomic Instruction Atomic instruction is the interface for the programmer to execute the address generation. When the atomic instruction reaches the IE stage, it sets the input i_decrement to the register bank high. This advances the cascaded counters by one. These counters are hardwired to the Atomic Pipeline unit, which computes the addresses and outputs them. This atomic instruction has a place for two registers, based on which it can operate in three modes.

36

4.1.2.1.1 Single Operation mode When both the registers in the atomic instruction are r0, which is a hardwired zero value, the decrement of the counters in the register bank is done only once. This mode is useful in use cases of Static Loop Constraint, which will be discussed in Section 4.2.1.

4.1.2.1.2 Atomic ZOL – Sequential When the first register in the atomic instruction is a register other than r0 and the second register is r0, this instruction works like a One-Instruction – Zero Overhead Loop. What it means is that the stops, and the Atomic instruction is executed the number of times defined by the value of the first register. This process is sequential because no other instruction can be executed in parallel and the pipeline stalls for register value number of cycles. This mode of operation was presented for the Dynamic Constraint Computation use cases. It gives functionally correct answers but performs worse than the approach presented in the next section, hence it is not recommended to use.

4.1.2.1.3 Atomic ZOL – Parallel & Waitfor instruction When the first register of the atomic instruction is r0 and second is a non-zero register, the atomic instruction executes the number of times defined by the register value in parallel with the rest of the system. This mode can also be called a co-processing mode because it de-couples the atomic operations from the processor pipeline, and both work in parallel for the register value number of cycles. Since this does not stall the pipeline, it requires a Waitfor instruction in not all but most cases for synchronization. If, after the occurrence of this instruction, it is required that at some point in the code, the instruction pointer should wait for the Atomic ZOL – Parallel to finish, then a Waitfor statement is used at that position in the code. This is very useful to implement the Dynamic Constraint Computation use cases which will be discussed further in Section 4.2.2.2

Instruction Slots and Dimensions The VLIW has two slots that makeup one instruction and is processed in parallel. The branch instruction is fixed for the first slot as it looks for a condition flag in the previous instruction in the same first slot. Each slot is composed of 20 bits; hence the complete instruction is 40 bits wide.

Addressing Modes The standard immediate and register direct addressing modes are available for the instructions.

4.2 Programming Model

The address generation of the use case algorithms from the implementation perspective can be divided into three major classes that are discussed below. 37

4.2.1 Static Loop Constraints – Standard, Strided and Dilated Convolution All the five loops for address generation of the Standard, Strided, and Dilated Convolution have constant bound functions. Hence these functions will be computed in compile-time, and only the constant value is provided in the assembly. A code example of this class of algorithms is shown in Figure 4.7. The code can be divided into four blocks based on the function it is performing. The first block of code is the Register Bank Initialization Block in which the counter registers, bound registers, and coefficient registers are given the required values in immediate mode. The second block of code computes the non-atomic part of the address equation for the first address. The third and fourth blocks are executed in parallel in separate slots. In the second slot, the addresses are generated for the current convolution output every cycle/instruction. In the first slot, the non-atomic part of the equation is computed for the next convolution output. In this specific example, because the filter size was 3x3 and there were no Channel Steps, the occurrences of the atomic instructions are 9. This varies if the filter size or the number of channel steps are changed. The rest of the structure of the code remains the same. This code produces the address for one Channel. For generating addresses for all the output channels, software looping with branch instructions should be used.

Figure 4.7: Code example valid for Standard, Strided and Dilated convolution 38

4.2.2 Variable Loop Constraints – Upsampled Convolution and MaxPooling Algorithms like Upsampled Convolution and MaxPooling have a different approach than the previously discussed model. Both of these algorithms have variable loop bounds. This means that we do not know beforehand how many occurrences of atomic instruction we would need for one convolution output. To solve this problem, two approaches have been presented below.

Loop Unrolling The loop unrolling approach means that the values of the varying loop bounds are all given beforehand in the assembly code as immediate values. This method is typically considered very non-practical [21] because of the huge number of bounds but is quite possible for Upsampled convolution because in Upsampling the variable loop bounds have the values that are repetitive in nature which means that the very basic repetitive set of values can be given in assembly and then loop over that code to traverse all over the input. For example, in Figure 4.8, when the filter convolves over the input, the number of inputs in the receptive field is 4,2,4,2,4,2… in the first row and 2,1,2,1,2,1… in the second row, and then this repeats for the subsequent rows. Hence it can be seen that the values are repetitive in nature, and while programming such example, only the fundamental values 4,2,2,1 must be provided in code and then looped over the same section of code.

Figure 4.8: Example of Upsampled convolution step.

The advantage of an unrolling approach is that it gives better throughput in most cases. Its disadvantage is that the code size increases as the Upsampling factor increases.

Dynamic Constraint Computation Dynamic constraint computation means computing the formula of the loop bounds in the runtime and then running the loop up to the value obtained by the computation. This can be done in two forms:

39

Sequential In Sequential Dynamic Constraint Computation, the loop bounds are computed first, and then after that, the loop is run up-to that bound. A one instruction Zero Overhead Loop is integrated into the Atomic instruction to avoid the looping overhead. The timing behavior of this approach is seen in Figure 4.9.

Figure 4.9: Timing sequence of Sequential Dynamic Constraint computation

Parallel In Parallel Dynamic Constraint Computation, the loop bounds are computed in parallel with the loop. While the current iteration of the loop is being executed, the next loop bound is computed in parallel. To implement this, the Atomic ZOL Parallel mode of the atomic instruction is employed. The atomic instruction takes the value of loop bound in the register indirect mode and executes in parallel with the subsequent instructions. The timing behavior can be seen in Figure 4.9, and the code example can be seen in Figure 4.10.

Figure 4.10: Timing sequence of Parallel Dynamic Constraint computation

40

Figure 4.11: Code example of Parallel Dynamic Constraint computation

4.2.3 Padded Convolution For the padded convolution, an additional hardware component is coupled with the VLIW implementation, which implements the transformation equation discussed in Section 3.2.2.5.

4.3 Overall Architecture

Combining all the previous sections, the overall architecture of the VLIW ASIP looks as in Figure 4.12.

41

Figure 4.12 The overall architecture of Programmable AGU

4.4 Assembler

A Python-based assembler was provided by the host company to convert the VLIW assembly code to the , which is not in the scope of this thesis.

4.5 Reference Parametrizable Datapath Approach

After having investigated the algorithmic level modeling, it has been found that even the Parametrizable Datapath based approach has a lot of room for improvement. Hence to have a valid reference architecture to compare with the programable design, an implementation of the parametrizable approach has also been proposed. The architecture at its core contains five cascaded counters for capturing the behavior of the nested loops in the algorithm similar to the counters in the Register Bank of the VLIW ASIP. These counters are bound by the upper and lower bound registers, which are updated by the external inputs. Other than these bounds, the system takes five coefficients as inputs. The address equation and the loop bound equation is hardcoded. Hence the system is parametrizable but not programmable. The advantages of this sort of design are better throughput, area, and power. The disadvantage is the poor flexibility towards any variation in the algorithms. Another disadvantage is its poor critical path. This is due to the reason that the loop bound equations are not pipeline-able in this architecture since these equations are in a closed-loop, and it is impossible to pipeline them without compromising on the throughput. This architecture can be seen in Figure 4.13.

42

Figure 4.13: The overall architecture of Reference Parametrizable Datapath AGU

43

44

5 Results and Analysis

5.1 Systems under Evaluation

The following are the systems that are tested with working NN, analyzed, and compared against specific metrics. The first two are the new approaches that are presented in this thesis. The third is the one that was used as the starting point in the host company, and the last ones are example architectures from the literature that exhibit similar functionality. Although it would not be possible to compare all five systems for all metrics due to the limited availability of relevant data.

5.1.1 Programmable VLIW Approach The VLIW ASIP based AGU that is presented in this thesis as the central focus. This would be referred to as Programmable VLIW AGU in this chapter.

5.1.2 Parametrizable Datapath Approach The parametrizable approach is presented in this thesis as a reference architecture for comparison in Section 4.5. This would be later referred to in this chapter as the Parametrizable AGU.

5.1.3 Previous System in Host Company This is also a parametrizable FSM with Datapath approach, which was existing in the accelerator being tested in the host company. Not all specifics of this can be presented here due to confidentiality reasons, but for understanding the main difference in this approach and the parametrizable approach presented in this thesis is that this approach is incapable of doing dynamic loop bound computation. This would be referred to as Company AGU.

5.1.4 Relevant Architectures in Literature Two architectures would be referred to from literature, one being RACCU [21] and the other NXT AGU [64]. Features of both have been discussed in Section 2.3.

45

5.2 Experimental Setup

As the first step, all the six algorithms are modeled in Python for address generation. This prototypical implementation is verified against the equations and dry runs for verification. In the next step, an end to end implementation of VLIW processor-based programmable AGU as well as the parametrizable AGU has been implemented at RTL level in SystemVerilog. The host company provided a python-based Assembler for the VLIW processor. Both the RTL implementations have then been simulated in ModelSim. Assembly codes have been written for the algorithms. In the first stage of the simulation, the functional correctness of the address generated for the six algorithms was verified against the output from the previous Python prototype. In the second stage of the simulation, the results for nonfunctional properties were extracted, such as throughput and latency. Lastly, for ASICs based preliminary results for the area and critical path, the RTL of the processor has been synthesized with Synopsys Design Compiler and Cadence Innovus.

5.3 Use Case Neural Network

To evaluate the performance of the proposed architectures, a working neural network that is used for Semantic Segmentation in Self Driving cars is taken as a test case. The exact specifics of the network are not included here for confidentiality reasons, but in general, it is a Convolution-Deconvolution style NN with 19 layers. It takes a video input of 256 x 512 x 3 resolution and outputs a feature map of 256 x 512 resolution with each pixel labeled to a specific class of abject. In total, there are 14 convolution layers with 2 MaxPooling layers in the first half and 2 Upsampled convolution layers in the second half. This network would be referred to as Dilated Model.

Figure 5.1: Convolution-Deconvolution Network under consideration

46

5.4 Evaluation Metrics

5.4.1 Performance Performance is the primary non-functional property due to which hardware accelerators for Deep Learning came into being in the first place. Even before analyzing flexibility it is important to establish that the performance is at least within the range of acceptability for all algorithms under consideration because especially processor like architectures are highly likely to implement wide range of algorithms in functionally correct fashion but not necessarily cope up with the non-functional metrics which is generally balanced with the hardware/software portioning directed towards optimality. Following are the results and analysis of the three components of performance while assuming that the processing part of the accelerator acts ideally:

Throughput Since the processing of DNNs is a highly deterministic process, specific numbers can be determined about any overheads that are encountered for each kind of algorithm, and thus the throughput can be evaluated from that. To evaluate this overhead, its intuitive start from high granularity to the lowest level granularity, i.e., layer, channel, output address, input/weight address. Table 5.1 shows the overheads incurred along with the source of the overhead when Convolution layer addresses are generated with the Programmable AGU. These values are the same for

Per Output Per Output Per Per Layer Overhead Channel Address Input/Weight Layer Overhead Overhead Address Cycles Type Cycles Source Cycles Source Cycles Source Cycles Source

HW Counter register initialization: 5 SW loop Standard counter Convolution Coefficient decrement: Cycle initialization: 1 Strided consumed 13 3 3 0 - 1 Convolution in address Branch: 1 generation First address Dilated non atomic Delay Slot: Convolution computation, 1 SW counter initialization: 5

Table 5.1: Overheads for address generation of Standard, Strided or Dilated Convolution Layer with Programmable VLIW AGU 47

Table 5.2 shows the overheads incurred when MaxPooling layer addresses are generated with the Programmable AGU. The Per Layer overhead and Per Output Channel overhead is the same as for Standard Convolution, as in Table 5.1; hence it is not shown in this table.

Per Input/Weight Address Per Output Address Overhead Layer Type Cycles Cycles Source Cycles Source

HW Counter Update: 3

Cycle consumed in MaxPooling2D 8 Non-atomic part: 3 1 address generation

Branch + Delay slot: 2

Table 5.2: Overheads for address generation of MaxPooling with Programmable VLIW AGU

Table 5.3 shows the overheads incurred when Upsampled Convolution layer addresses are generated with the Programmable AGU with loop unrolling, Dynamic constraint computation sequential, and Dynamic constraint computation parallel. The Per Layer overhead is similar to the Standard convolution except for the fact that two of the hardware counter initializations that have variable bounds and first address non-atomic computations are counted in with Per Output address overhead and not with Per Layer overhead. So, Per Layer overhead is lesser here, i.e., 9. The Per Output Channel overhead is the same as Standard Convolution, as in Table 5.1.

Per Per Output Address Input/Weight Overhead Layer Type Address Cycles Cycles Source Cycles Source Non-Atomic part Static computation: 3 Constraint 5

Computation or Update of the constraint 2 with an immediate Loop Unrolling value: 2 Dynamic Constraint Non-Atomic part Cycle Upsampled Computation computation + Dynamic consumed 13 1 Convolution Constraint computation in address Atomic ZOL - + Branch generation Sequential Dynamic Overhead is parallelized Constraint over the atomic Varies Computation computations case to

case Atomic ZOL – max (constraints + N.A Parallel + computation, Addresses

48

WaitAtomic under receptive field)

Different for outputs with varying receptive fields

Table 5.3: Overheads for address generation of Upsampled Convolution with Programmable VLIW AGU

Table 5.4 combines the numbers in the above tables to provide the idea about layer-wise throughput for the Programmable, Parametrizable, and the Company AGU approach. The number represents the inverse of throughput, i.e., average cycles per address generation, which is more intuitive to see here. The per-layer overhead is ignored because it is very negligible compared to the cycles spent in processing the layer. It can be seen here that for Standard, Strided, and Dilated Convolution, the average cycles per address are ideal, i.e., 1 per address. For the MaxPooling layers, this again is close to ideal in case of Programmable approach (not precisely ideal due to minor overhead after every output address) and exactly ideal for Parametrizable and Bosch AGU approach. Upsampled convolution is the most interesting part where the Parametrizable approach behaves ideally because of no overhead per output address. Programmable AGU follows after that due to some overhead per output address. The Company AGU, although itself being a parametrizable approach, still stands last because it does not do dynamic loop bound computation, which is suitable for the Upsampled convolution algorithm.

Programmable Parametrizable Layer Type Company AGU VLIW AGU AGU

Standard, Strided or 1 1 1 Dilated Convolution

MaxPooling 1.0005 1 1

Upsampled 1.7 1 3.5 Convolution

Table 5.4: Layer-wise average cycles/address comparison for specifications of Dilated Model

The tables above show the generalized characteristics of address generation in Programmable AGU for each kind of layer. Now extending this to the Dilated Model network, Table 5.5 shows the inverse throughput comparison for the complete dilated model. Here all three approaches are close to ideal because the Dilated Model network is dominated by convolution layers, in which case, all three behave ideally.

49

Programmable Parametrizable Neural Network Company AGU VLIW AGU AGU

Complete Dilated 1.08 1.000001 1.3 Model

Table 5.5: Average cycles/address comparison for complete Dilated Model processing

Latency Latency is another performance metric which is especially very important for real-time and time-critical applications. Lower latency means much better response time and reactiveness to any stimulus by sensors. Applications like autonomous driving require the AI accelerator to be ultra- low latency. The pipeline stages determine the latency of a digital system that it has between input and output. Generally, the latency estimation would also include the functional components of an accelerator apart from AGU, but here the only value of latency for generating a single address is presented for an isolated AGU. Latency in time unit can then be determined by multiplying the critical path or operating clock period with the latency in the number of cycles. Applying this to the proposed Programmable VLIW AGU, it can be seen that the processor part of the AGU has three stages of pipeline followed by three stages of the pipeline of the Atomic pipeline unit. For a realistic estimate, the additional impact of the programming aspects should also be counted in. Taking an example of the Standard Convolution, Section 4.2 suggests that about nine VLIW instructions are required for the initialization of counters and coefficients before the first Atomic instruction, which marks the first address generation. Hence the total latency in this example becomes 15. On the other hand, in the case of the Parametrizable AGU, there are two architectural pipeline stages. The cycles spent in the initialization of the counters and coefficients, in this case, depends upon if this is done sequentially or in a parallel fashion. If it is done sequentially, then it would take ten cycles; otherwise, a single cycle would be required. Table 5.6 summarizes the architectural latency of both the systems.

Evaluation Metric Programmable VLIW AGU Parametrizable AGU

VLIW pipeline stages + Pipeline Stages = 2 Latency Atomic pipeline stages

(cycles) = 6 Initializations = 1 or 10 cycles

Initialization example = 9

Table 5.6: Latency comparison of Programmable VLIW AGU and Parametrizable AGU

50

Critical Path The third performance metric equally crucial for any digital system is the critical path. This determines the least allowable clock period and thus the clock frequency of the accelerator. It mainly depends upon the pipeline-ability of the architecture. This is the metric where the Programmable VLIW AGU stands as the clear winner as compared to the Parametrizable AGU. Programmable AGU is capable of being pipelined to the fundamental both in the VLIW processor part as well as the Atomic pipeline part.

On the other side, the Parametrizable AGU presented in

Figure 4.13 has an inherent architectural limitation when it comes to pipelining. Similar to the Programmable AGU, the address equation computation part can be pipelined to the fundamental level, but the arithmetic units comprising implementing loop bound computation equation cannot be pipelined. This is because these computations form a closed loop in a way that these loop bounds are associated with the counters, which trigger the cascaded counters downstream, and these cascaded counters are variables in the loop bound computation equation. Hence if the loop bound computation equation is pipelined, it would cause a delayed update of bound, resulting in a mismatch in the synchronization of the cascaded counters. Hence the critical path is the complete loop bound computation equation, which for example, in the case of Upsampled Convolution, is four arithmetic computations long. Preliminary results have shown that for 22nm technology node, the critical path for Programmable VLIW AGU comes out to be approximately 30 ns as compared to the 60 ns in case of the Parametrizable AGU.

5.4.2 Flexibility Flexibility is the key competitive advantage of the Programmable AGU, but it can only be demonstrated and evaluated qualitatively. The goal of proposed Programmable VLIW AGU is to act as a general-purpose or, at the very least, as a domain-specific platform on which a broad range of current and future algorithms can be implemented with only software modifications. One straightforward way to see that is that the proposed architecture stands the test of six diverse algorithms implementing them correctly and giving reasonable performance metrics, which is already demonstrated. More formal standards with respect to which flexibility can be assessed are discussed below.

Dynamic Loop Bound Computation Tackling variable loop bound computation is one of the features demonstrating flexibility, that Programmable VLIW AGU stands out in. For algorithms like Standard, Strided, and Dilated Convolution, the loop bounds are constant functions of constants that are static for a layer. Hence for such algorithms, static loop bound computation in compile time is sufficient. Nevertheless, in algorithms like Upsampled Convolution and MaxPooling where some of the inner loops have bounds that are a variable function of the

51

variables/loop counters of the outer loops. There are two ways to tackle such cases: Loop Unrolling or Dynamic Constraint Computation. Loop unrolling entails a large code size and, thus, a large memory, which is a luxury that especially embedded accelerators, cannot afford. Hence the approach of Dynamic Constraint Computation is implemented in the Programmable VLIW AGU. The way it is done is by having special instructions of Atomic-ZOL- Parallel and WaitAtomic that de-couples the counter advancement and thus address generation from the main pipeline of the VLIW, and the main processor pipeline is then free for computing the next loop bound (and Non- atomic part) in parallel with the rest of the system. This gives benefits in terms of throughput, which is discussed in Section 5.4.1.1. The reference Parametrizable AGU also has dynamic loop bound computation support, but this computation is carried out with a hardwired loop bound equation as opposed to Programmable VLIW AGU, which does it with VLIW ALUs. Thus, only one description of the loop bound equation is supported at a time, and any change in algorithm will require a change in hardware, which is evidently an expensive approach in terms of time, cost, and ease of integration. The Company AGU, as well as NTX AGU [64] from literature, does not support dynamic constraint computation at all. The other literature example RACCU [21] does support dynamic constraint computation but does not provide any evidence of applicability for Deep Neural Network algorithms.

Affine vs. Non-Affine Address Equation Affine vs. non-affine address equation implementation is another facet in the light of which the flexibility can be evaluated. Although no CNN algorithms require non-affine address equation computation when a system claims to be a platform that should ideally capture a broad range of current and future algorithms, it should have some support for such computation. Hence a brief description of such possibility is presented here. The Programmable VLIW AGU implements non-affine computations by using the coefficients of the equation. The outer loop variable that multiplies with the inner loop variable (forming a non-affine term) must be multiplied with its coefficient beforehand and stored in the corresponding dedicated coefficient register that is hardwired with the Atomic pipeline component. This means that such a variable must be of a lesser update frequency than the innermost loop variable because the innermost loop variable updates every cycle, and thus multiplication would be required every cycle which would block the VLIW slot and thus would not be possible to implement without compromising throughput. The variables beyond the innermost loops can be handled in the non-affine scenario in the described method. The reference Parametrizable AGU as well as NTX AGU [64], does not support non-affine address equations at all. RACCU [21] does have support for non-affine address equation computation, but again it is adapted towards conventional DSP algorithms.

52

Number of Variables in Address Equation Conventional image processing used to consider images as two- dimensional data. With the advent of DNNs, the dimensions in which data is processed has grown [36]. Apart from the x and y dimensions, the data grows in depth dimension in the form of Channels deeper in the layers of the DNN. To generate addresses to traverse over such data, the Programmable VLIW AGU has the highest frequency hardware counter (representing innermost loop) as the Channel Step dimension variable. Then follows the counter for filter x and y dimension and output x and y dimension in the respective order. Now, apart from these five hardware counters, the Programmable VLIW AGU has the potential to accommodate further higher loop variables if needed by future algorithms. This can be done by dealing with those variables in the form of conventional SW looping. The higher loop variables are by design of lesser update frequency. That is why HW counters are not required to implement higher variables with no compromise on throughput. Parametrizable AGU, Company AGU, NTX AGU [64], and RACCU [21] all support address equations with no support for additional variables.

Upper and Lower Bounds of HW Counters A relatively simple but powerful aspect is the implementation of a lower bound register in addition to the upper bound register. This holds the value of the lower limit of the counters. This is useful in algorithms where the address space where the filter would convolve must be confined to a small patch of data, and this patch is then advanced, such as in MaxPooling explained in Section 3.2.2.4. Both Programmable VLIW AGU and Parametrizable AGU have hardware counters with upper as well as lower bounds implemented as the fields of the with hardware handling the counter behavior with the bounds. The Company AGU, NTX AGU [64], and RACCU [21] do not have such extensive support.

5.4.3 Code Size Code size directly impacts the size of the required instruction memory and, thus, the area of the design. Hence it is critical for especially embedded applications where memory and area budget is limited. VLIW processors have an inherent tendency towards having more extensive code as compared to the Simple due to the potential presence of No Operation instructions. The Parametrizable AGU approach might also require a code to configure its registers, but that would have the same implications as any standard host co-processor. Hence it is not discussed here. The Programmable AGU can have either variable or fixed code size depending on the type of algorithm it is implementing. For Standard, Strided, or Dilated Convolution, the code size is variable. After the first 13 instructions for counter configuration and first non-atomic computation and the three instructions for software looping, the number of instructions containing the atomic slot is variable. The number of these instructions depends upon the filter dimensions, i.e., FILTER_X, FILTER_Y, and CHANNEL_STEPS. For example, for a 3x3 filter size and one channel step, the number of atomic instructions would be 3x3x1, which is 9. For higher dimensions, the number 53

can be reduced from the product term by employing a SW loop in between the atomic instructions. For example, for filter dimensions 3x3x4, the number of atomic instructions does not need to be 36; instead, it is 12 when SW looping is employed. MaxPooling follows a similar trend of variable code size depending upon the filter dimensions as well as the MaxPooling size and stride. The Upsampled Convolution exhibit different code sizes depending upon the implementation style it is implemented. If Upsampled convolution is implemented with loop unrolling, it has a variable code size. The code size depends upon the filter dimensions as well as the Upsampling factor. Larger the filter dimensions larger will be the code size. Larger the Upsampling factor larger will be the code size because the smallest block of repetitive instructions that need to be looped over becomes large, which can be further understood by Section 4.2. When the Upsampled convolution is implemented with Atomic single instruction Zero Overhead Loop, in both Sequential and Parallel cases, the code size is a fixed number regardless of the Upsampling factor or filter dimensions. This is because hardware takes care of the looping of atomic instruction, taking the number of repetitions as an argument. Thus, the code size remains the same. The ZOL Parallel approach drastically reduces the code size as compared to the loop unrolling approach. For example, for filter dimensions of 3x3x4 and Upsampling factor of 2, the loop unrolled code has 52 instructions while the Atomic ZOL Parallel always has 24 instructions. The above description of the code size is summarized in Table 5.7.

# of VLIW Layer Type Comments instructions Variable # of VLIWs depending upon: Standard, Strided or Dilated FILTER_X, FILTER_Y, CHANNEL_STEPS 13 + x + 3 Convolution E.g., 28 for (3,3,4) Variable # of VLIWs depending upon: FILTER_X, FILTER_Y, CHANNEL_STEPS, MaxPooling 13 + x + 3 MAXPOOLING_FACTOR

E.g., 45 for (3,3,4, 2) Variable # of VLIWs depending upon: FILTER_X, FILTER_Y, CHANNEL_STEPS, Loop 9 + x + 3 UPSAMPLING_FACTOR Unrolling

E.g., 52 for (3,3,4, 2) Upsampled Convolution Fixed # of VLIWs: HW Counter and coeff initialization: 9 Atomic ZOL 9 + 1 + 13 + 3 Atomic ZOL: 1 Sequential = 26 Dynamic constraint and non-atomic part computation: 13 SW Loop: 3

54

Fixed # of VLIWs: HW Counters + first NA: 11 Atomic ZOL 11 + 1 + 10 + 2 Atomic ZOL: 1 Parallel = 24 NA + Dynamic Constraint + waitaomic: 10 SW loop + NA result: 2

Table 5.7: Layer-wise code size comparison for Programmable VLIW AGU

5.4.4 Resource Utilization Resource utilization is not as relevant for Simple Scalar processors as it is in the case of VLIW processors. Resource utilization means how much time the processing units are actually in operation throughout the complete processing. In VLIWs, the introduction of No Operation slots is the source of degradation of resource utilization. Hence for VLIWs, the resource utilization is calculated by determining the ratio of the operational instruction slots to the total number of instruction slots (NOPs + Operational Slots) throughout the execution of the algorithm. To calculate the resource utilization of the Programmable VLIW AGU, refer back to Section 4.2, where the programming model is discussed. The possibility of a NOP is only present after the atomic instruction because, in the previous code block, there is no dependency hazard in counter initializations or first non-atomic computation; hence for that section, the resource utilization is 100%. Considering the Standard convolution case of Figure 4.7, after the first atomic instruction, the program then starts calculating the next non-atomic result for the following output while in the second slot, the outputs are being generated for current convolution output. Now the addresses that are required to be generated for one convolution output is dependent upon the dimensions of the filter. For example, for a 3x3 filter, nine addresses would be required for one convolution output, and thus nine slots would be free for the non-atomic computation. The non-atomic result computation requires five operations. Adding one jump instructions, there is a total of 6 operational instructions that involve the main pipeline components. The total number of available slots is 18. Hence the resource utilization for only this block of code is 6/18, which is 33%. This further degrades once the receptive field of the filter increases further. For the scenario of Upsampled convolution implemented with Atomic- ZOL-Parallel in Figure 4.11, the situation is much better due to two reasons. Firstly, there are additional computations for the dynamic bound computation in addition to the non-atomic result computation. Hence these operations take up the previous NOPs. Secondly, the second slot is now freed up by the use of an Atomic ZOL Parallel and can be used for useful operations. The new resource utilization is approximately 87%. Again, this degrades when the receptive field increases. The resource utilization of the Parametrizable AGU is 100% since there is no possibility of a no operation. This is one of the trade-offs of the Programmable approach as compared to the Parametrizable approach.

55

5.4.5 Area & Power Although physical mapping has not been the main focus of this research, some preliminary results are extracted for the area from which power can also be extrapolated. The Programmable VLIW AGU and Parametrizable AGU are synthesized with a 22nm technology node. The approximate area of both designs can be seen in Table 5.8. As expected, the area of the Programmable approach is about 4.6 times higher than the Parametrizable approach. Although a power analysis is out of the scope of this thesis, generally, it also follows the same trend of area. This is where the trade-off of the programmable approach lies. Hence, it depends on the area and power budget of the accelerator where the programmability is desired, if the Programmable VLIW is suitable or not. If the overall processing fabric of accelerator is much larger as compared to the AGU, then it does make sense to integrate Programmable VLIW AGU and otherwise not. This is the same reason why it does not seem quite practical to integrate Programmable VLIW AGUs in distributed architectures where localized address generation is required because having a large number of copies of Programmable AGUs will increase the area and power to a considerable amount.

Evaluation Metric Programmable VLIW AGU Parametrizable AGU

Area 14000 3000 (µm2)

Table 5.8: Area comparison of Programmable VLIW AGU and Parametrizable AGU

5.5 Limitations

After discussing what the Programmable AGU is capable of, let us now discuss where it still lacks and if or how it can be addressed. 1. 1x1 convolution: For Programmable VLIW AGU The core principal that hardware-software partitioning is based upon is that less frequent operations are left for software, and high-frequency operations are dealt with hardware. The number of cycles or instructions that the software can consume to compute its part is equal to the product of dimensions of the filter. For example, a 3x3x1 filter would allow spare 9 cycles for software to compute its part. This sort of design puts a lower bound on the dimensions of the filter. There are essentially 5 operations that compute the software non-atomic part of the equation, which can be executed in 3 VLIW instructions sparing one slot, considering the ZOL Parallel scenario. Hence the lower bound for the product of filter dimensions is 4. This means that the scenarios of 1x1 convolution, such as in GoogleNET [27], cannot be captured by the current design of the Programmable VLIW AGU without compromising on the throughput. 2. Division and modulo operations: For implementing the Upsampled Convolution address generation, the address equation and 56

loop bound computation equations contain division and modulo operations. These operations are highly expensive to implement in hardware and require special algorithms having multicycle pipelined components. The way around to this problem is proposed that the Upsampling factor is assumed to be always a power of two, which is involved in these division and modulo operations. With a number that is a power of two-division is simply a shift left, and modulo can be implemented with using an ‘and’ operator. Hence the current design of the Programmable VLIW AGU cannot deal with the Upsampling factor that is not a power of two. 3. Forwarding Logic: The Atomic instruction advances the counters by one unit. These counters are present in the register file and do not have any copy in the processor pipeline registers. Typically the forwarding logic is done by checking if the immediate previous instruction modifies the operands that are required by instruction and if so, fetching those operands from the pipeline registers. Thus, this is a minor architectural limitation that if the immediate previous instruction is an atomic instruction and the operators used by current instruction are hardware counters, then forwarding is not possible, and the programmer should take care of this. This could possibly be rectified by adding special checks for operands that are counter registers.

57

58

6 Summary and Conclusion

Convolutional Neural Networks have made remarkable progress, especially in the field of Computer Vision in the last decade. The algorithms that have made this advancement possible have also grown more sophisticated and complicated at the same pace. To make the hardware accelerators keep up with this pace, this thesis has provided an end to end modeling, design, and implementation of an Address Generation Unit for Convolutional Neural Networks, achieving a high degree of flexibility without reasonable compromise on throughput, area, and power. To accomplish this, the thesis first provides comprehensive mathematical modeling of six Convolutional Neural Network algorithms typical for Semantic Segmentation applications. The six algorithms can be aggregated into three groups considering their implications for hardware. The first group comprising Standard, Strided, and Dilated convolution, have static loop bound functions, and affine address equations having five variables. The second group containing MaxPooling and Upsampled convolution was modeled with variable loop bound equations in addition to the affine address equation. Lastly, for the third group containing only Padded convolution, it has been found that it poses implications that are unique from other algorithms in a way that it has a piece-wise affine address equation. A transformation function has been proposed to avoid the hardware expense of implementing the piecewise equation. This algorithm level modeling concludes on the deduction that all the algorithms under consideration have either constant or variable loop bound functions and address equations that simplify down to five variable affine equations with variables that follow an update frequency order dictated by order of the nested loops. The three high- frequency variable terms of the address equation are combined as what is called here Atomic operations, to offload to hardware. The rest of the equation is handled in the main VLIW pipeline sequentially. The second step to achieving the target of a flexible AGU has been the design and implementation of hardware. Since a fully programmable approach is aimed at a high throughput requirement, a two-slot VLIW processor with three pipeline stages has been chosen as a starting point for the design. The final form of the Programmable VLIW AGU with hardware- software co-design contains apart from conventional VLIW components, a specialized Register Bank, an Atomic Pipeline unit, a customized branching mechanism and specialized instructions in ISA to support this hardware. The register bank contains cascaded hardware counters with configurable upper and lower bounds implementing the looping mechanism. The counters, along 59

with dedicated coefficient registers, are hardwired to the Atomic Pipeline unit, which has a hardwired pipeline for the atomic operations along with a register field for the non-atomic result. The customized branching mechanism provides a dynamic coupling and decoupling (Atomic ZOL Parallel) of the atomic pipeline. Specialized instructions of Atomic and WaitAtomic provide an interface to the hardware component mentioned. Apart from the Programmable VLIW AGU, a Parametrizable AGU has also been designed and implemented to provide an accurate reference for evaluation. This approach is an entirely hardware approach with cascaded counters and fixed address and loop bound equation. The Programmable VLIW AGU provides very promising results when evaluated against relevant metrics and compared with the Parametrizable AGU. W.r.t throughput the Programmable AGU performs ideally for Standard, Strided, and Dilated Convolution with a throughput of three addresses (input, weight, output) per cycle. For MaxPooling as well, it performs very close to ideal. For Upsampling, Parametrizable AGU shows 1.7 times more throughput than the Parametrizable AGU, but this factor shrinks down to 1.1 when a complete neural network is evaluated because convolution layers dominate a neural network. The Programmable AGU implements Dynamic loop bound computation, captures non-affine address equations, can support an increasing number of variables, and has configurable counters all in software, making it a versatile platform. The trade-offs lie in the code size, resource utilization, and Area (indicating power as well), which is 4.6 times higher than the Parametrizable approach. Based on the results, a final verdict that the thesis draws is that the trade-offs that the Programmable VLIW AGU design possess are much negligible as compared to the benefits that it provides in the form of flexibility and throughput. However, the usage of the architecture is still subject to the requirements of the accelerator. For accelerators that have temporal architecture, this approach is very much appropriate, but for accelerators that have massively distributed architectures with localized address generation, this approach would most probably have unacceptable area and power overhead. Similarly, for accelerators that aim to serve a broad range of existing and future algorithms, Programmable VLIW AGU would be an apt fit, but for accelerators that are optimized for high area, power, and performance efficiency regardless of the flexibility, a Parametrizable approach would be more fitting.

60

7 Future Research

7.1 Compiler Design

At present, the Programmable AGU is programmed in two-slot instruction assembly. A VLIW processor does not have any hardware-level intelligence to identify and exploit parallelism; instead, it depends upon compile-time optimizations for operation dependency check and instruction- level parallelism recognition and utilization. Hence a compiler has a great significance in a complete VLIW design flow. Thus, in the proposed processor as well there is a great potential to have a compiler that can convert a high- level language to VLIW Assembly code with suitable instruction scheduling mechanisms such as [79].

7.2 Code Compression

Due to the presence of multiple slots in a VLIW instruction and possibility of NOPs, the code size is generally larger than a Simple Scalar processor. Large code size results in more requirements of memory, area, and power, which are expensive luxuries, especially in the case of embedded computing. Thus, there is a potential for applying code compression techniques. Some example techniques found in the literature to achieve compression are LZW-based compression [80] and Mask-based Encoding [81].

7.3 Code Generation

Another future direction that also relates to static optimization is the development of an environment for automatic code generation. It can be deduced from Section 4.2 that although there are some variations in the programming style from one algorithm to another, the basic structure of the code remains the same. The code can be divided into distinct blocks that implement certain functionality in a specific order. Such as in the examples provided, we always have loop register initializations in the beginning, followed by first non-atomic atomic operation computations, then atomic operations in one slot, and next non-atomic computations in the other slot. So, this paves the way for a possibility of automatic code generation in which only the hyperparameters are required to be inserted, and the code is generated automatically.

61

7.4 Further AI Algorithms

Convolutional Neural Networks are a subset of Deep Neural Networks, which is a subset of Machine Learning, which is a subset of Artificial Intelligence. Hence this thesis is just a proof of concept implemented on CNNs while there is a vast AI landscape still to be explored. The algorithms that are selected for demonstrating AGU flexibility, i.e., Standard, Strided, Dilated, Padded, and Upsampled Convolution and MaxPooling, cover a wide range of use cases ranging from simple image classification to much more complex Semantic Segmentation. However, still, there is a lot more to convolution neural networks. For example, this thesis only covered the algorithms crucial to inference. It would be interesting to implement the training algorithms. Then moving up the hierarchy to other AI algorithms.

Figure 7.1: The hierarchy of Artificial Intelligence algorithms

7.5 Conventional Digital Signal Processing Algorithms

While it is sometimes argued that Machine Learning will render the conventional Digital Signal Processing to be extinct, it is not happening any time soon, some of the reasons being the long-established, reliable, relatively less power-hungry, and standardized practices of DSP which ML still lack [82]. Even if it does happen at some point, ML itself constitutes DSP algorithms at its fundamental levels [83]. Hence it is far from archaic yet to keep DSP algorithms in consideration while demonstrating the flexibility of hardware accelerators for Deep Learning applications. Considering the open platform nature of the proposed Programmable AGU and the huge resemblance of requirements of the DSP algorithm with DNN algorithms, it should not be hard to implement DSP algorithms without making any significant architectural changes. For instance, Fast Fourier Transform (FFT) has a high resemblance with Convolution in CNNs as both perform multiply-accumulate operations over a particular window of data. Hence it should be implementable with Programmable AGU. Similarly, extensive research can be carried out in order to demonstrate the coverage of the proposed architecture for other DSP algorithms.

62

References

[1] "Big Data, AI & IoT Part Two: Driving Industry 4.0 One Step At A Time," [Online]. Available: https://www.forbes.com/sites/charlestowersclark/2019/02/20/big- data-ai-iot-part-two-driving-industry-4-0-one-step-at-a- time/#1eeb662f23a0. [2] R. Kurzweil, "The Law of Accelerating Returns," in Alan Turing: Life and Legacy of a Great Thinker, Springer Berlin Heidelberg, 2004, pp. 381- 416. [3] "AI Economy Will Further Accelerate The Pace Of Innovation," [Online]. Available: https://www.forbes.com/sites/cognitiveworld/2019/03/04/ai-economy- will-further-accelerate-the-pace-of-innovation/#31e985492f29. [4] M. I. Razzak, S. Naz and A. Zaib, "Deep learning for medical image processing: Overview, challenges and the future," in Lecture Notes in Computational Vision and Biomechanics, vol. 26, Springer Netherlands, 2018, pp. 323-350. [5] A. Kamilaris and F. X. Prenafeta-Boldú, Deep learning in agriculture: A survey, vol. 147, Elsevier B.V., 2018, pp. 70-90. [6] D. Rolnick, P. L. Donti, L. H. Kaack, K. Kochanski, A. Lacoste, K. Sankaran, A. S. Ross, N. Milojevic-Dupont, N. Jaques, A. Waldman- Brown, A. Luccioni, T. Maharaj, E. D. Sherwin, S. K. Mukkavilli, K. P. Kording, C. Gomes, A. Y. Ng, D. Hassabis, J. C. Platt, F. Creutzig, J. Chayes and Y. Bengio, "Tackling Climate Change with Machine Learning," 10 6 2019. [7] M. Fire and J. Schler, "Exploring online Ad images using a deep convolutional neural network approach," in Proceedings - 2017 IEEE International Conference on , IEEE Green Computing and Communications, IEEE Cyber, Physical and Social Computing, IEEE Smart Data, iThings-GreenCom-CPSCom-SmartData 2017, 2018. [8] G. Apruzzese, M. Colajanni, L. Ferretti, A. Guido and M. Marchetti, "On the effectiveness of machine and deep learning for cyber security," in International Conference on Cyber Conflict, CYCON, 2018. [9] "Automotive revolution – perspective towards 2030 | McKinsey," [Online]. Available: https://www.mckinsey.com/industries/automotive- and-assembly/our-insights/disruptive-trends-that-will-transform-the- auto-industry/de-de. [10] "Connected & Autonomous Cars Have Arrived, And They Are Forcing Car Companies To Build New Vehicle Architectures," [Online]. Available: https://www.forbes.com/sites/sarwantsingh/2019/11/11/connected-- autonomous-cars-have-arrived-and-they-are-forcing-car-companies-to- build-new-vehicle-architectures/#60eab4192cb1. 63

[11] "A Very Short History Of Artificial Intelligence (AI)," [Online]. Available: https://www.forbes.com/sites/gilpress/2016/12/30/a-very-short- history-of-artificial-intelligence-ai/#415e4f5a6fba. [12] "Everything You Ever Wanted To Know About Computer Vision.," [Online]. Available: https://towardsdatascience.com/everything-you- ever-wanted-to-know-about-computer-vision-heres-a-look-why-it-s-so- awesome-e8a58dfb641e. [13] "YouTube is 10 years old: the evolution of online video | Technology | The Guardian," [Online]. Available: https://www.theguardian.com/technology/2015/feb/13/youtube-10- years-old-evolution-of-online-video?CMP=fb_gu. [14] Gordon E. Moore, "Cramming more components onto integrated circuits," 1965. [15] "R: The R Project for Statistical Computing," [Online]. Available: https://www.r-project.org/. [16] "Python Data Analysis Library — pandas: Python Data Analysis Library," [Online]. Available: https://pandas.pydata.org/. [17] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet classification with deep convolutional neural networks," Communications of the ACM, vol. 60, no. 6, pp. 84-90, 1 6 2017. [18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and L. Fei-Fei, "ImageNet Large Scale Visual Recognition Challenge," International Journal of Computer Vision, vol. 115, no. 3, pp. 211-252, 1 12 2015. [19] "State-of-the-art table for Image Classification on ImageNet," [Online]. Available: https://paperswithcode.com/sota/image-classification-on- imagenet. [20] G. Talavera, A. Portero and F. Catthoor, "Impact of address generation on multimedia embedded VLIW processors," in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018. [21] N. Farahini, A. Hemani, H. SohofiB.Sc.CE M.Sc.CE, S. M. Jafri, M. A. Tajammul and K. Paul, "Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric," and Microsystems, vol. 38, no. 8, pp. 788-802, 2014. [22] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2323, 1998. [23] J. Wu, "Introduction to Convolutional Neural Networks," 2017. [24] V. Dumoulin and F. Visin, "A guide to convolution arithmetic for deep learning," 23 3 2016. [25] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in 64

Bioinformatics), 2014. [26] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," in 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 2015. [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, "Going deeper with convolutions," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015. [28] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016. [29] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014. [30] R. Girshick, "Fast R-CNN," 30 4 2015. [31] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real- Time Object Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 6 2017. [32] J. Long, E. Shelhamer and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015. [33] B. H. Hyeonwoo Noh, Seunghoon Hong, "Learning Deconvolution Network for Semantic Segmentation," 2015. [Online]. Available: https://arxiv.org/abs/1505.04366. [34] A. L. Y. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, "SEMANTIC IMAGE SEGMENTATION WITH DEEP CON- VOLUTIONAL NETS AND FULLY CONNECTED CRFS," pp. 1-12, 2012. [35] F. Yu and V. Koltun, "Multi-Scale Context Aggregation by Dilated Convolutions," 23 11 2015. [36] V. Sze, Y. H. Chen, T. J. Yang and J. S. Emer, Efficient Processing of Deep Neural Networks: A Tutorial and Survey, vol. 105, Institute of Electrical and Electronics Engineers Inc., 2017, pp. 2295-2329. [37] "Dennard Scaling - an overview | ScienceDirect Topics," [Online]. Available: https://www.sciencedirect.com/topics/computer- science/dennard-scaling. [38] T. N. Theis and H. S. Philip Wong, "The End of Moore's Law: A New Beginning for Information Technology," Computing in Science and Engineering, vol. 19, no. 2, pp. 41-50, 1 3 2017. [39] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," in MM 2014 - Proceedings of the 2014 ACM Conference on Multimedia, 2014. 65

[40] "Welcome to PyCUDA’s documentation! — PyCUDA 2019.1.2 documentation," [Online]. Available: https://documen.tician.de/pycuda/. [41] "TensorFlow," [Online]. Available: https://www.tensorflow.org/. [42] "Is NVIDIA Unstoppable In AI?," [Online]. Available: https://www.forbes.com/sites/moorinsights/2018/05/14/is-nvidia- unstoppable-in-ai/#7d34736d759e. [43] "Programming Tensor Cores in CUDA 9 | NVIDIA Developer Blog," [Online]. Available: https://devblogs.nvidia.com/programming-tensor- cores-cuda-9/. [44] S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng and J. S. Vetter, "NVIDIA Tensor Core Programmability, Performance & Precision," 11 3 2018. [45] "The World’s Most Powerful Graphics Card | NVIDIA TITAN V," [Online]. Available: https://www.nvidia.com/en-us/titan/titan-v/. [46] "Geforce GTX 10 Series | NVIDIA," [Online]. Available: https://www.nvidia.com/en-us/geforce/products/. [47] "Pascal GPU Architecture | NVIDIA GeForce," [Online]. Available: https://www.nvidia.com/en- us/geforce/products/10series/architecture/. [48] S. Mittal, "A survey of FPGA-based accelerators for convolutional neural networks," Neural Computing and Applications, 2018. [49] K. Guo, S. Zeng, J. Yu, Y. Wang and H. Yang, "A Survey of FPGA-Based Neural Network Accelerator," 24 12 2017. [50] "reVISION Zone | Machine Learning | Computer Vision," [Online]. Available: https://www.xilinx.com/products/design-tools/embedded- vision-zone.html. [51] "Project Brainwave - Microsoft Research," [Online]. Available: https://www.microsoft.com/en-us/research/project/project-brainwave/. [52] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao and J. Cong, "Optimizing FPGA- based accelerator design for deep convolutional neural networks," in FPGA 2015 - 2015 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, 2015. [53] S. Chakradhar, M. Sankaradas, V. Jakkula and S. Cadambi, "A dynamically configurable for convolutional neural networks," in Proceedings - International Symposium on , 2010. [54] S. Gupta, A. Agrawal, K. Gopalakrishnan and P. Narayanan, "Deep Learning with Limited Numerical Precision," 9 2 2015. [55] E. Wang, J. J. Davis, R. Zhao, H.-C. Ng, X. Niu, W. Luk, P. Y. K. Cheung and G. A. Constantinides, "Deep Neural Network Approximation for Custom Hardware: Where We've Been, Where We're Going," 21 1 2019. [56] "Cloud TPU | Google Cloud," [Online]. Available: https://cloud.google.com/tpu/.

66

[57] "Nervana Neural - Intel AI," [Online]. Available: https://www.intel.ai/nervana-nnp/. [58] "truenorth Archives | IBM Research Blog," [Online]. Available: https://www.ibm.com/blogs/research/tag/truenorth/. [59] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz and W. J. Dally, "EIE: Efficient Inference Engine on Compressed Deep Neural Network," 3 2 2016. [60] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, O. Temam, T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen and O. Temam, "DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning," ACM SIGPLAN Notices, vol. 49, no. 4, pp. 269-284, 2014. [61] G. Talavera, M. Jayapala, J. Carrabina and F. Catthoor, "Address generation optimization for embedded high-performance processors: A survey," Journal of Signal Processing Systems, vol. 53, no. 3, pp. 271- 284, 12 2008. [62] G. Lu, H. Singh, M. H. Lee, N. Bagherzadeh, F. J. Kurdahi, E. M. Filho and V. Castro-Alves, "The MorphoSys dynamically reconfigurable system-on-chip," in Proceedings of the 1st NASA/DoD Workshop on Evolvable Hardware, 1999. [63] U. J. Kapasi, W. J. Dally, S. Rixner, J. D. Owens and B. Khailany, "The imagine stream processor," in Proceedings - IEEE International Conference on Computer Design: VLSI in and Processors, 2002. [64] F. Schuiki, M. Schaffner, F. K. Gürkaynak and L. Benini, "A Scalable Near- for Training Deep Neural Networks on Large In-Memory Datasets," 19 2 2018. [65] "DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family - Proceedings of the 53rd Annual Design Automation Conference, {DAC} 2016, Austin, TX, USA, June 5-9, 2016," 2016. [66] L. Li and A. M. Wyrwicz, "Modularized architecture of address generation units suitable for real-time processing MR data on an FPGA," Review of Scientific Instruments, vol. 87, no. 6, 1 6 2016. [67] K. Sethi and R. Panda, "A New Approach to Design of an Address Generation Unit in a DSP Processor". [68] M. Ramesh Kini and S. Sumam David, "Address generation for DSP Kernels," in ICCSP 2011 - 2011 International Conference on Communications and Signal Processing, 2011. [69] M. Ilic and M. Stojcev, "Address generation unit as accelerator block in DSP," in 2011 10th International Conference on Telecommunications in Modern Satellite, Cable and Broadcasting Services, TELSIKS 2011 - Proceedings of Papers, 2011. [70] T. Hussain, M. Shafiq, M. Pericàs, N. Navarro and E. Ayguadé, "PPMC: A programmable pattern based ," in Lecture Notes in

67

Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012. [71] F. Balasa, N. Abuaesh, C. V. Gingu, I. I. Luican and H. Zhu, "Energy- aware for embedded multidimensional signal processing applications," Eurasip Journal on Embedded Systems, vol. 2017, no. 1, 1 12 2017. [72] M. Moreno-Berengue, G. Talavera, A. Rodriguez-Alsina and J. Carrabina, "Address generation unit for multimedia applications on application specific instruction set processors," in IECON Proceedings (Industrial Electronics Conference), 2010. [73] M. R. Kini and S. Sumam David, "Comprehensive address generator for digital signal processing," in ICIIS 2009 - 4th International Conference on Industrial and Information Systems 2009, Conference Proceedings, 2009. [74] C. Zhang, Z. Fang, P. Zhou, P. Pan and J. Cong, "Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks," in IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD, 2016. [75] Y. H. Chen, J. Emer and V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," in Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016, 2016. [76] "GitHub - vdumoulin/conv_arithmetic: A technical report on convolution arithmetic in the context of deep learning," [Online]. Available: https://github.com/vdumoulin/conv_arithmetic. [77] H. Brunst, A. Knüpfer, V. Salapura, J. A. Fisher, P. Faraboschi, C. Young and F. P. Preparata, "VLIW Processors," in Encyclopedia of Parallel Computing, Springer US, 2011, pp. 2135-2142. [78] "Zero Overhead Loops - Developer Help," [Online]. Available: https://microchipdeveloper.com/dsp0201:zero-overhead-loops#top-of- page. [79] M. S. Lam, "Software pipelining: An effective scheduling technique for VLIW machines," in ACM SIGPLAN Notices, 2004. [80] C. H. Lin, Y. Xie and W. Wolf, "LZW-based code compression for VLIW embedded systems," in Proceedings -Design, Automation and Test in Europe, DATE, 2004. [81] S.-W. Seong and P. Mishra, "A bitmask-based code compression technique for embedded systems," 2006. [82] "Deep Learning and Digital Signal Processing: A Conversation with Cadence's Chris Rowen | Berkeley Design Technology, Inc," [Online]. Available: https://www.bdti.com/InsideDSP/2016/04/14/Cadence. [83] "DSP Takes on Deep Neural Networks | Electronic Design," [Online]. Available: https://www.electronicdesign.com/industrial- automation/article/21805000/dsp-takes-on-deep-neural-networks. [84] V. Badrinarayanan, A. Kendall and R. Cipolla, "SegNet: A Deep 68

Convolutional Encoder-Decoder Architecture for Image Segmentation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481-2495, 1 12 2017. [85] "5 Advanced CNN Architectures - Deep Learning for Vision Systems MEAP V07," [Online]. Available: https://livebook.manning.com/book/grokking-deep-learning-for- computer-vision/chapter-5/v-3/9.

69

TRITA TRITA-EECS-EX-2020:22

www.kth.se