ACCELERATOR ARCHITECTURES FOR APPLICATIONS

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Sang Kyun Kim January 2013

© 2013 by Sang Kyun Kim. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/nn963tk4553

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Oyekunle Olukotun, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Christoforos Kozyrakis

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Andrew Ng

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

Matrices are a well known data representation extensively used in a wide range of applications. Numerous applications of various domains use matrix operations to represent and perform their core algorithms. Thus, improving matrix operation per- formance is critical to a vast variety of fields as it not only allows existing applications to run faster, but also enables computations with larger matrices. Modern GPUs and CPUs with SIMD support have been very effective at accel- erating matrix operations. However, these current architectures only work well on dense and fat matrices. Skinny dense matrices tend to underutilize SIMD resources when the width of a matrix is less than the number of SIMD lanes, and may limit the scalability since they have a smaller amount of computation to hide the communica- tion overhead. Sparse matrices are also difficult to accelerate current architectures, because the memory accesses are irregular and the workload imbalance is severe. This thesis introduces two different specialized hardware, targeting narrow dense and sparse matrices. The first part of this thesis focuses on accelerating a Restricted Boltzmann Machine (RBM), a popular machine learning algorithm used in deep learn- ing. The RBM accelerator was designed using a modular approach to achieve linear scalability across transistor technologies, as well as across chip boundaries. The ac- celerator was implemented on FPGAs to demonstrate the performance improvements over high-end CPUs and GPUs. Both fat and skinny matrices were shown to fully utilize the computation resources in the learning process, which allows the training algorithm to converges in less number of iterations. The second part of this thesis describes how applications can be

iv accelerated with domain-specific hardware. We studied three sparse matrix applica- tions that conventional hardware cannot easily accelerate. Based on our findings, we devised an accelerator architecture which targets certain sparse and dense matrix op- erations. The accelerator is capable of exploiting the fine-grained parallelism within sparse matrices despite the irregularity through buffering and work-stealing. In order to cover a wider range of applications, a small general-purpose core was added to the accelerator for non-critical execution flows. The sparse matrix accelerator was implemented on an FPGA board as an ASIC prototype to evaluate the performance using real-world data. Our accelerator shows performance comparable to GPUs on dense matrix operations, and excels over conventional hardware on sparse matrix operations.

v Acknowledgements

My PhD journey has been supported by family, colleagues, mentors, and friends. I couldn’t have done it without them. I would like to use this opportunity to express my gratitude towards their advice, efforts, kindness, patience, and care. First of all, I would like to say ”Thank you so much!” to my wife Eun Jung Lee for her patience and support. I am truly blessed to have met her at Stanford, as she has greatly enriched my graduate life with an abundance of joyful events. We share a lot of invaluable memories, which enabled me to endure and push forward in my research. I also want to thank my advisor Kunle Olukotun, who has been a great advisor to me. During the weekly 1:1 meetings, he was always guided me with insightful advice. He has been patient with my research despite the delays from some difficulties in hardware. Kunle also kindly financially supported me with research assistantship after my scholarship expired so that I can finish my PhD work. I also thank Darlene Hadding, who helped me with the administrative work during my studies. My degree program had very few administrative issues thanks to her awesome support. I would also like to thank the other Oral Exam Committee members: Christos Kozyrakis, Andrew Ng, and Yoshio Nishi. I thank Christos for his serving as my associate advisor, reading committee member, and oral committee member. His computer architecture classes was one of the reasons why I decided to go further in the computer architecture path. I thank Andrew for his serving as my reading committee member and oral committee member. The Brain-in-Box meetings with Andrew were very helpful in designing an RBM accelerator. I thank Yoshio for his

vi serving as my oral exam chair. He very kindly and happily agreed to be the chair, even though it was the first time we met. I want to thank all colleagues and friends. Every one of them positively influenced my research work. Thanks to Lawrence McAfee and Peter McMahon, who worked together with me on designing the RBM architecture on FPGAs. Thanks to Honglak Lee for his help on understanding machine learning algorithms and providing his source codes. Thanks to Sungpack Hong and Frank Liu, who wrote some RTL code for me, which helped me save time in implementing the sparse matrix accelerator. Special thanks to Sungpack for the conversations with him, since they have been particularly helpful in debugging the sparse matrix accelerator. Thanks to Hyoukjoong Lee for his help with using the graphics card. Thanks to Jared Casper for helping me with Altera license issues, and thanks to Jacob Leverich for his help on using the CPU cluster. I also want to thank the rest of Kunle’s group, as they always gave constructive feedback on my research. I also want to thank the Stanford Korean Christian Fellowship (KCF), which was a spiritual home for me from the first year of graduate school. Thanks to all the caring friends at KCF for their loving prayers and support. I would like to express my deep gratitude to Kwanjeong Educational Foundation. Kwanjeong Educational Foundation financially supported my first five years of grad- uate studies. Thanks to Chong-hwan Lee, the president of the foundation, and rest of the foundation staff, as I was able to study at Stanford without worrying about the expensive tuition and high costs of living. By securing my financial needs, I was able to stay more productive and focused. Finally, I want to thank my parents in Korea for their warm encouragements and having faith in me. Even at times when my grades were below expectations in Korea, my parents never gave up on me, and they still believe I can do well today. They helped me build a positive character, which really helped me not get stressed out from the long debugging and keep my pace in research.

vii Contents

Abstract iv

Acknowledgements vi

1 Introduction 1 1.1 Microprocessor Trends ...... 1 1.2 Specialized Architecture for Matrix-oriented Applications ...... 3 1.3 Previous Work On Accelerating Matrix Operations ...... 4 1.4 Thesis Outline ...... 6

2 Accelerator Architectures for Restricted Boltzmann Machines 7 2.1 Introduction ...... 7 2.2 Background on Restricted Boltzmann Machines ...... 9 2.3 Target FPGA Development Platform ...... 10 2.4 Single Chip Implementation ...... 12 2.4.1 Overall System Architecture ...... 12 2.4.2 Core Components ...... 16 2.4.3 Experimental Results ...... 19 2.5 Extension to Large Scale RBM Architecture ...... 21 2.5.1 Scalable Multi-chip Architecture ...... 22 2.5.2 Streaming Weights From DRAM and Associated Trade-offs . . 25 2.5.3 Locally Dense Sparse Network ...... 29 2.5.4 Limitations of Multi-chip RBM ...... 31 2.5.5 Experimental Results ...... 34

viii 2.6 Related Work ...... 38 2.7 Summary ...... 39

3 Sparse Matrix Accelerator 40 3.1 Introduction ...... 40 3.1.1 Sparse Matrix Applications ...... 41 3.2 Sparse Matrix Accelerator Architecture ...... 46 3.2.1 Overview ...... 47 3.2.2 Supported Matrix Operations ...... 48 3.2.3 Sparse Matrix Format ...... 50 3.2.4 Decode Unit ...... 52 3.2.5 Thread Block Unit ...... 55 3.2.6 Encode Unit ...... 60 3.2.7 Sparsification Modules ...... 61 3.3 Sparse Matrix Accelerator Implementation Details ...... 63 3.3.1 Target Development Platform ...... 65 3.3.2 Software Stack ...... 68 3.3.3 Resource Usage ...... 70 3.3.4 Challenges and Limitations ...... 72 3.4 Results and Analysis ...... 75 3.4.1 Experimental Platform ...... 75 3.4.2 Matrix Operation Performance Analysis ...... 77 3.4.3 Application Performance Analysis ...... 87 3.4.4 Summary ...... 94

4 Concluding Remarks and Future Directions 97 4.1 Contributions of Research ...... 98 4.2 Future Work ...... 99 4.2.1 Addressing the Limitations of the FPGA Implementation Plat- forms ...... 99 4.2.2 Future Directions in Accelerator Research ...... 100

ix A Pseudo-code of Sparse Matrix Applications Used for SMA 102

B ISA for Sparse Matrix Accelerator FPGA 105 B.1 Overview of Instruction Set Architecture for SMA-F ...... 105 B.2 Matrix Descriptor Format ...... 107 B.3 Detailed Instruction Format ...... 108 B.4 List of Instructions ...... 113 B.4.1 ADDH, ADDS - Floating Point Scalar Addition ...... 113 B.4.2 CSR - Control Status Register ...... 113 B.4.3 EADD - Element-wise Matrix Addition ...... 114 B.4.4 EGT - Element-wise Greater-than Comparison ...... 115 B.4.5 ELT - Element-wise Less-than Comparison ...... 116 B.4.6 EMAX - Element-wise Matrix Maximum ...... 117 B.4.7 EMIN - Element-wise Matrix Minimum ...... 118 B.4.8 EMULT - Element-wise ...... 119 B.4.9 ESUB - Element-wise Matrix Subtraction ...... 119 B.4.10 EXP2 - Element-wise Exponential Base 2 ...... 120 B.4.11 LOG2 - Element-wise Log Base 2 ...... 121 B.4.12 MOVHS, MOVSH - Change Floating Point . . . . . 122 B.4.13 MOVRx - Move Register ...... 122 B.4.14 MULH, MULS - Floating Point Scalar Multiplication . . . . . 123 B.4.15 MULT - Matrix Multiplication ...... 124 B.4.16 RCP - Element-wise Reciprocal ...... 124 B.4.17 RSQRT - Element-wise Reverse Squart Root ...... 125 B.4.18 SGT, SLT, SMAX, SMIN - Matrix-Scalar Comparison . . . . 126 B.4.19 SMULT,SADD,SSUB - Matrix-Scalar Arithmetics ...... 127 B.4.20 SUBH, SUBS - Floating Point Scalar Subtraction ...... 128 B.4.21 XCHG - Exchange ...... 129

Bibliography 130

x List of Tables

2.1 Single FPGA RBM Resource Utilization ...... 19 2.2 Performance of Single Core CPU, GPU, and Multi-FPGA ...... 35 2.3 Scalability: Speedup Against One Node of Each Platform ...... 37 2.4 Power Consumption of Each Platform (Watt) ...... 37

3.1 Density of Sparse Matrices in BC and MCL ...... 44 3.2 List of Matrix Operations Supported...... 50 3.3 Prototype Implementation vs Projected Specification ...... 68 3.4 Platform Specifications For Performance Evaluation and Analysis . . 75 3.5 CPU and GPU Libraries For Matrix Operations ...... 76

B.1 Matrix Descriptor Fields and Format ...... 108 B.2 Scalar Operations Opcode ...... 110 B.3 Matrix Arithmetic and Math Function Opcode ...... 111 B.4 Matrix Comparison Opcode ...... 112

xi List of Figures

1.1 A Block Diagram for Future Microprocessor ...... 2

2.1 Illustration of a Deep Belief Network and Restricted Boltzmann Machine9 2.2 RBM Training Algorithm Pseudo-code ...... 10 2.3 DE3 Development Board Used for RBM ...... 11 2.4 System Architecture for Single Chip Restricted Boltzmann Machine . 13 2.5 RBM Module Architecture Detail ...... 14 2.6 RBM Matrix Multiplication Unit ...... 17 2.7 Speedup for Single FPGA RBMs ...... 21 2.8 High-level View of Multi-chip RBM Interconnect ...... 22 2.9 Simplified Version of the Communication Between FPGAs ...... 23 2.10 DRAM Bandwidth With Different Batch Sizes ...... 26 2.11 Memory Bandwidth vs. I/O Bandwidth vs. On-chip Storage Trade-off 27 2.12 Buffering Weights and Partial Results ...... 28 2.13 Locally Dense Sparse Network ...... 30 2.14 Bit Error Rate Influence on Convergence ...... 32

3.1 CPU Execution Time Breakdown on Sparse Matrix Applications . . . 42 3.2 Non-zero Element Distribution ...... 45 3.3 Sparse Matrix Accelerator (SMA) Architecture Overview ...... 47 3.4 Compressed Sparse Block Format ...... 49 3.5 SMA Decode Unit Block Diagram ...... 52 3.6 SMA Thread Block Unit Block Diagram ...... 55 3.7 SMA Floating Point ALU Block Diagram ...... 58

xii 3.8 Encode Unit Block Diagram ...... 60 3.9 Sparsifying Modules ...... 62 3.10 DE4 Development Board Used for SMA-F ...... 66 3.11 SMA-F Software Stack ...... 69 3.12 SMA-F Resource Utilization ...... 71 3.13 Memory Blocks for Queues ...... 74 3.14 Matrix Multiplication Performance ...... 77 3.15 SMA-A Latency Sweep ...... 81 3.16 SMA-A Work-stealing Performance ...... 84 3.17 Element-wise Performance ...... 86 3.18 SMA-A Restricted Boltzmann Machine Performance ...... 87 3.19 SMA-A RBM Performance Analysis ...... 88 3.20 SMA-A Markov Clustering Performance ...... 91 3.21 SMA-A Betweenness Centrality Performance ...... 93

4.1 Example of 2-D Sparse RBM Configurations ...... 100

A.1 Pseudo-code for Markov Clustering ...... 102 A.2 Pseudo-code for Sparse Restricted Boltzmann Machine ...... 103 A.3 Pseudo-code for Betweenness Centrality ...... 104

B.1 Nios II Custom Instruction Format ...... 106

xiii Chapter 1

Introduction

1.1 Microprocessor Trends

Due to transistor scaling and microarchitecture advances, microprocessors have shown exponential growth in logic resource capacity and tremendous performance enhance- ments over the past several decades1. The exponential shrinking of semiconductor devices, known as Moore’s Law [45], still holds true and is expected to continue for at least several years. However, due to power limitations, clock frequency no longer scales with the transistor size. Sequential performance gains from exploiting instruction-level parallelism (ILP) have significantly reduced as well, due to the inher- ently limited amount of ILP in applications, the worsening of wire delays, and large energy consumption of increasingly complex hardware structures (e.g. super-scalar, out-of-order architectures with branch prediction). Since early 2000, microprocessors shifted their focus from increasing the single- threaded performance to an energy-efficient multi-core architecture. By utilizing the data-level and task-level parallelism in applications, a multi-core microprocessor typ- ically can achieve better performance at the same level of energy. Multithreading multicore microprocessors (e.g. Sun UltraSPARC T1 [35]) were also introduced dur- ing this time for throughput-oriented applications, which utilizes a large number of threads to hide long memory latencies. Recent microprocessors typically range from

1The first commercially available microprocessor, the Intel 4004, was introduced in 1971 [30].

1 CHAPTER 1. INTRODUCTION 2

GPU ?

Video ? Accel.

Audio Multicore FPGA Accel.

L3 Cache

Figure 1.1: A Block Diagram for Future Microprocessor two cores to eight cores, based on the performance and energy requirements. Heterogeneous computing was another microarchitectural trend seen during this time. The IBM Cell processor [31] was one of the first successful microprocessors with heterogeneous cores. NVIDIA CUDA [46] is an example of heterogeneous computing with the use of graphics processor. Intel Sandy Bridge added a graphics processor on the same die with an x86 processor. In general, heterogeneous computer architecture typically consists of one or more powerful general-purpose cores to target sequential code, and many weak cores for energy-efficient data-parallel executions. Despite the multi-core effort, without scaling of supply voltage, future micropro- cessors will continue to be limited by the power constraints. Recently, there has been concerns that future processors will be forced to turn off a large portion of the chip at any given time just to be within the power budget. Esmaeilzadeh et al. predicted that even more than 50% of a chip must be powered off at 8nm technology – so-called dark-silicon apocalypse [21] On the other hand, it has also been shown that using application-specific logic is far more energy efficient than using general-purpose processors for the same application with same level of performance. Hameed et al. reported that their ASIC design was 500x more energy efficient than a four-core CMP design [25]. Consequently, it is believed that a promising approach for future microprocessor designs is to include more specialized logic which would be powered only when they CHAPTER 1. INTRODUCTION 3

are required for their specific functionality [10, 50]. Chien et al. argue that the traditional approach of optimizing for the 90% case no longer applies, and propose the “10×10” paradigm, which utilizes the increasing number of transistors by using 10 hardware accelerators each covering 10% of the execution time. Such approach can considerably improve the overall performance and energy efficiency on wide variety of applications. Based on these observations, Figure 1.1 shows an example of what block diagrams of future microprocessors may look like. A future microprocessor is likely to have a powerful multicore CPU unit for general-purpose computing, but will also have ac- celerators for domains of applications. In fact, graphics, video and audio accelerators can already be found in modern SoCs for portable consumer devices with tight energy constraints, such as mobile phones. There has also been proposals and early imple- mentations of microprocessors with reprogrammable FPGA fabric. However, it is still an open question what other accelerators are needed, especially for the desktops and high-performance computing domain.

1.2 Specialized Architecture for Matrix-oriented Applications

Most accelerators in current microprocessors are related to security and multimedia. As the number of accelerators increases in future microprocessors, a wider range of applications will run more efficiently using the customized hardware modules. This research studies designing specialized hardware for emerging applications that use matrix operations as their primary computation. Restricted Boltzmann Machine (RBM) is an example of a matrix-oriented application, and is the first focus of this thesis. An RBM is a two layer network used as a building block for Deep Belief Networks [28], a multi-layer neural network that has been extremely popular over the last several years. This thesis discusses how we designed an efficient, highly scalable RBM hardware and demonstrated the performance and scalability by implementing the architecture in FPGAs. CHAPTER 1. INTRODUCTION 4

As an RBM extends to a very large scale, many of the connections between the layers can be expressed sparse without much loss of accuracy. We show how we extend the multi-FPGA RBM hardware to support sparse networks that are locally dense. The second part of this thesis investigates how we can generalize the RBM accel- erator architecture to support a wider range of sparse matrix applications. As a case study, we chose three specific sparse matrix applications that existing hardware cannot easily accelerate: the Sparse Restricted Boltzmann Machine, Betweenness Centrality, and Markov Clustering. We designed a sparse matrix accelerator architecture with the focus on the matrix operations used in these three applications, and implemented the design on an FPGA prototyping board. Assuming that an ASIC implementation would be roughly 10 times the performance of the FPGA prototype, we show that our customized sparse matrix hardware can give large speedups for applications with targeted matrix operations compared to current high-end CPUs and GPUs.

1.3 Previous Work On Accelerating Matrix Oper- ations

Since matrix operations are very common in numerous applications, they are also one of the most highly optimized types of computations. Most of the optimization efforts have focused on dense matrix operations for general purpose processors, utilizing the SIMD units to exploit the data parallelism and regular memory access patterns in matrix operations. More recently, graphics processors have also been extensively used to accelerate dense matrix operations as they offer a large number of SIMD cores. The SIMD units in conventional hardware share instruction decode units to reduce resource overheads. As a result, resource underutilization may occur if the matrix size is smaller than the SIMD width. Machine learning algorithms, such as the Restricted Boltzmann Machine, can benefit from using smaller batch sizes since it helps the algorithm converge with fewer iterations. However, small batch sizes are rarely used in practice, because matrix operations using long, skinny matrices very often result in poor performance due to SIMD resource underutilization. Customized hardware CHAPTER 1. INTRODUCTION 5

can more easily overcome this issue since the overhead of sharing a general purpose instruction unit is not needed. In Chapter 2, we show that our RBM accelerator can support long, skinny matrices without performance penalties. We also show that the accelerator scales well to a very large RBM network, which was one of the original motivations in building a custom RBM system. Conventional SIMD hardware also provides limited support for sparse matrices, mostly focused on sparse matrix - vector multiplications (SpMV), which aligns rel- atively well with SIMD-style computation. However, highly irregular sparse matrix operations, such as sparse matrix - sparse matrix operations, are not well supported. Software solutions exist for both CPUs and GPUs [16, 13, 8], but are not capable of fully utilizing SIMD resources. There has been previous work on implementing custom hardware accelerators for sparse matrix operations. Elgindy and Shue [20] demonstrated a fixed-point sparse matrix - vector multiplication accelerator using FPGAs. In 2005, floating point im- plementations of sparse matrix - vector multiplication were shown in multiple publi- cations [58, 18]. In 2010, Lin et al. explored the design space of implementing sparse matrix - sparse matrix multiplication on FPGAs using a systolic array architecture [40]. However, to the best of our knowledge, there has not been a sparse matrix accelerator design that targets multiple sparse matrix operations to provide adequate speedup with a full application as well as flexibility. Many sparse matrix applica- tions are composed of multiple types of matrix operations; thus, accelerating a single matrix operation is typically not sufficient to obtain good overall performance. In Chapter 3, matrix operations in sparse matrix applications are studied to show that the accelerator must support multiple matrix operations in order to not be limited by Amdahl’s Law. Our sparse matrix accelerator is capable of accelerating multiple types of matrix operations for both dense and sparse matrices, and exhibits considerable speedup over conventional hardware for representative sparse matrix applications. CHAPTER 1. INTRODUCTION 6

1.4 Thesis Outline

The outline of the remaining chapters of the thesis is as follows. Chapter 2 describes the Restricted Boltzmann Machine algorithm and shows how very large scale RBMs can be accelerated using specialized hardware. The chapter explains the details of implementing the architecture onto FPGAs and compares the performance and scal- ability with existing computer architectures. Chapter 3 gives a brief background on sparse matrix applications, and illustrates in detail our sparse matrix accelerator architecture, as well as the FPGA implementation. The proposed accelerator archi- tecture is prototyped on an FPGA to demonstrate its performance using real-world data. Chapter 4 concludes the thesis with a summary of the accomplishments of this research and suggestions for future work. Chapter 2

Accelerator Architectures for Restricted Boltzmann Machines

2.1 Introduction

A Deep Belief Network (DBN) is a multilayer generative model that is trained to extract the essential features of the input data by maximizing the likelihood of its training data. DBNs have recently gained great popularity in the machine learning community due to their potential in solving previously difficult learning problems. Introduced in 2006 by Hinton et al. [28], DBNs use Restricted Boltzmann Machines (RBMs) to efficiently train each layer of a deep network. DBNs have been successfully demonstrated in various applications, such as handwritten digit recognition [37] and human motion modeling [49]. Although DBNs appear to be a promising tool, investigations are limited by the significant amount of processing that RBMs require; existing software implementa- tions, including those for multi-core CPUs and GPUs, have long running times even for relatively small nets [47]. The primary issue is that conventional processors do not efficiently exploit the fine grain parallelism present in RBM training algorithms, which are dominated by large and skinny matrix multiplications. Graphics proces- sors have significant performance benefits over CPUs in matrix operations, but do not scale to very large networks due to I/O bandwidth limitations.

7 CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 8

We seek to address this problem in both the short and long term by developing a scalable, highly optimized custom computer architecture for DBN processing. In the near term, systems implementing this architecture may be able to achieve considerable speedups over conventional CPUs. Longer term, future generations of many-core processors may contain cores that are optimized for specific classes of applications, as mentioned in Section 1.1. An architectural exploration in this area using our ideas could lead to future processors that are better suited to DBN processing. We describe an FPGA-based system that accelerates the training of DBNs. The ability to conveniently program logic provides considerable advantages in exploring the architectural design space of RBMs. Modern FPGAs contain a large amount of configurable logic elements, which allow custom designs for complicated algorithms to be built. Abundant logic resources and the customizable nature of FPGAs allow us to fully exploit the fine-grain parallelism in the DBN training algorithm. More details of the FPGA board used are discussed in Section 2.3. Although FPGAs are used to implement the RBM algorithm, the architecture itself does not rely on the reconfigurable nature of FPGAs and may migrate to ASICs for better performance or energy efficiency. The remainder of this chapter explains our RBM accelerator architecture as fol- lows. Section 2.2 briefly reviews the Restricted Boltzmann Machines training algo- rithm. Section 2.3 describes the FPGA platform we used to implement the RBM accelerator designs. Section 2.4 illustrates our initial single FPGA design of a fully configurable Restricted Boltzmann Machine. The single chip RBM architecture was carefully constructed in a modular manner such that the design is scalable across multiple semiconductor technology generations. Section 2.5 extends the single FPGA design to a multi-FPGA version. In the multi-chip RBM architecture, each FPGA computes a subset of the visible and hidden layers, and communicate data and inter- mediate results with other FPGAs in a ring topology. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 9

hidden neurons

h1 h2 h3 h4 RBM

v1 v2 v3 v4

visible neurons

Figure 2.1: Illustration of a Deep Belief Network and Restricted Boltzmann Machine

2.2 Background on Restricted Boltzmann Machines

In this section, we briefly summarize the algorithm by Hinton et al. [27] for training Restricted Boltzmann Machines, which we seek to accelerate. A Restricted Boltzmann Machine (RBM) is a probabilistic generative model that is able to automatically extract features of its input data using an unsupervised learning algorithm. RBMs consist of a layer of hidden neurons and a layer of visible neurons with connection strengths between hidden and visible neurons represented by an array of weights (see Figure 2.1). A Deep Belief Network (DBN) [28, 9] is a multi-layer neural network, which can be viewed as a stack of RBMs, with the hidden units of one RBM used as the visible inputs to the next higher RBM. DBNs learn the weights by applying the RBM training algorithm one layer at a time. Ideally, given enough neurons and layers, the user can learn very abstract features of the training set, with the intention of modeling the hierarchical learning structure of the brain; some recent related work include the comparison of sparse DBN output to the V2 area of the visual cortex [38]. To train an RBM, samples from a training set are used as input to the RBM through the visible neurons, and then the network alternatively samples back and CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 10

-Visible neurons initially set to a batch of training examples, denoted vis_batch_0 -Repeat until convergence { 1) Sample hid_batch_0 from P(h|vis_batch_0) a) tmp_matrix_1 = vis_batch_0 * weights b) tmp_matrix_2 = tmp_matrix_1 + hid_biases c) tmp_matrix_3 = sigmoid(tmp_matrix_2) d) hid_batch_0 = tmp_matrix_3 > rand() 2) Sample vis_batch_1 from P(v|hid_batch_0) 3) Sample hid_batch_1 from P(h|vis_batch_1) 4) Update parameters: a) weights += α(vis_batch_0T*hid_batch_0 - vis_batch_1T*hid_batch_1) b) vis_biases += α(vis_batch_0T*1 – vis_batch_1T*1) c) hid_biases += α(hid_batch_0T*1 – hid_batch_1T*1) }

Figure 2.2: RBM Training Algorithm Pseudo-code forth between the visible and hidden neurons. The goal of training is to learn visible- hidden connection weights and neuron activation biases such that the RBM learns to reconstruct the input data during the phase where it samples the visible neurons from the hidden neurons. Figure 2.2 shows the pseudo-code for the RBM training algorithm. Each sampling process is essentially matrix-matrix multiply between a batch of training examples and the weight matrix, followed by a neuron activation function, which in many cases is a sigmoid function (1/ (1 + e−x)). The sampling between the hidden and visible layers is followed by a slight modification in the parameters (controlled by the learning rate α) and repeated for each data batch in the training set, and for as many epochs as is necessary to reach convergence.

2.3 Target FPGA Development Platform

Both single chip and multi-chip RBM designs are implemented on the Altera DE3 development board from Terasic [51], which features an Altera Stratix III FPGA with a DDR2 SDRAM SODIMM and SD card interface, as shown in Figure 2.3. The DE3 board also has four high speed connectors for communication with external devices, CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 11

SMA Connector

HSTC Connector Altera Stratix III EP3SL340

DDR2 SO-DIMM SD Card Socket Slot

Figure 2.3: DE3 Development Board Used for RBM one SMA connector for external clock input, and a USB Blaster interface to program the FPGA via JTAG. The JTAG interface is also used for debugging and interacting with the user application. The Stratix III EP3SL340 has 135,000 ALMs (Adaptive Logic Modules)1, 16,272 kbits of embedded RAM and 288 embedded 18x18 multipliers. With this number of multipliers, we are capable of processing approximately 256 neurons per clock cycle. A Nios II processor [1], Altera’s proprietary soft processor, is instantiated in the FPGA to communicate with the user and configure the RBM. The Nios II receives input from the user (via JTAG), reads data from the SD card to store in DRAM, controls the overall flow of the RBM module, and displays the internal state of the RBM module. Altera’s DDR2 memory controller is instantiated to support reading from and writing to DRAM at a memory clock frequency of 267MHz. The RBM core of the single FPGA implementation is clocked at 200MHz, while the the multi-FPGA RBM implementation runs at 150MHz. For inter-FPGA communication, the DE3 has four HSTC (High Speed Terasic Connector) connectors, which is Terasic’s customized version of the popular Altera’s HSMC (High Speed Mezzanine Card) [2] interface. HSTC compatible flex-cables from

1ALMs are essentially two 6-input ALUTs combined with two dedicated registers CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 12

Samtec were used to allow high speed communication between the FPGAs. The multi- FPGA RBM uses four LVDS pairs in each HSTC interface, which yields a datarate of 4.8 Gbps per direction. Details of the connecting topology and communicating protocol are discussed in Section 2.5.1. The multi-FPGA implementation also requires a common clock for all the FPGAs. A separate dedicated device is used to create and distribute the common clock to the FPGAs via SMA cables. Using the embedded PLLs, the FPGAs generate all necessary clocks from the common clock, including the system clock and DDR2 memory clock. Since all clocks are derived from the same source, each clock will have the exact same frequency in every FPGA and may only differ in the phase of the clock. Being able to guarantee that the clock rates are exactly the same across FPGAs greatly simplifies the inter-chip communication protocol. Although enforcing a com- mon clock may not be desirable in some industrial large scale settings as discussed later, we believe it suffices to serve the purpose of this research.

2.4 Single Chip Implementation

2.4.1 Overall System Architecture

Before designing a multi-chip large scale Restricted Boltzmann Machine architecture, we first investigated how we can implement an RBM on a single FPGA. This allows us to conduct architectural experiments on a relatively small scale RBMs, and later provides the core building block for a multi-chip RBM system. In addition, the single FPGA implementation gives some important insights in building a scalable multi-chip RBM, as discussed in Section 2.5.1. Figure 2.4 summarizes the structure of our single chip RBM architecture. The system consists of a Nios II processor, a DDR2 SDRAM controller, and an RBM module. The processor is clocked at a frequency of 100MHz, and functions as the interface between the user and the RBM module via JTAG-UART. The CPU also initializes the weights, reads in the visible neurons to SDRAM, initiates the algorithm, and returns the results to the user. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 13

JTAG USB Blaster DDR2 UART DDR2 NiosII Controller Parllel I/O SD Card 256 32 4 AVALON MM 256 256 32

Memory Main RBM Module Stream Controller Update Neuron Logic Array

RNG / Weight Multiply Adder Sigmoid Compare Array Array Array Array Array

Figure 2.4: System Architecture for Single Chip Restricted Boltzmann Machine

The RBM module, operating at 200MHz, is the key component that executes the algorithm with configurations chosen by the user. At a high level, the RBM module has an array of weights and neurons that are fed into an array of multipliers, and then into adders to perform the matrix multiplication. After that, the RBM computes the sigmoid to obtain the probability of firing a neuron and fires the neuron using a comparator and random number generator. After the positive and negative phases, the module continues to iterate until it meets the stopping condition given by the user. Every step is pipelined, which results in a throughput of approximately 256 multiply-and-add (MADD) per cycle. The choice of the arithmetic precision to use in our design was a critical one, since the logic and multiplier resource utilization depends directly on the data width and format. Since FPGAs have considerably less logic resources compared to ASICs [36], data precision needs to be limited to fit in enough computational units and achieve the targeted performance. Neural networks exhibit soft computing [55] characteristics, which refers to a collection of software techniques that exploit the tolerance to noise for better performance and power efficiency. To determine the optimal precision, we CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 14

AVALON MM 256 : DDR2↔RBM 256 : CPU↔RBM mem 32 : CPU↔RBM register Avalon Master Avalon Slave Stream Logic Avalon Slave Main Controller MUX

256 Main↔Local

GRP0 256 GRP1 256 GRP2 256 GRP3 256 Buffers Buffers Buffers Buffers Local FSM Local FSM Local FSM Local FSM Memory Memory Memory Memory

… …Visible Neuron Broadcast… … … … … … Tree Add / Accum Tree Add / Accum Tree Add / Accum Tree Add / Accum 16 16 16 16 Sigmoid / RNG / Sigmoid / RNG / Sigmoid / RNG / Sigmoid / RNG / Compare Compare Compare Compare

16 16 16 16 TreeAdd Value BroadCast

Figure 2.5: RBM Module Architecture Detail simulated the DBN using the Fixed-Point MATLAB Toolbox for several fixed-point formats. From the simulation results, 16-bit fixed-point numbers were chosen to represent the weights and the training data set. Previous studies [29] also demonstrate that 16-bit precision is sufficient for a large range of neural network benchmarks2. As shown in Figure 2.5, the RBM module is segmented into several groups, each consisting of an array of multipliers, adders, embedded RAM, and logic components. Weights and neuron data are stored in the embedded RAM distributed across the groups. Each group processes a different portion of the network. Nearly all compu- tations take place in these groups. The rationale for such partitioning is that wire delay increases as semiconductor technology scales, so the wire delay becomes the

2Although 16-bit fixed-point is sufficient for the purpose of this research and the neural networks in consideration, it may no longer be sufficient if we expand the network to a very large scale unless we introduce some notion of sparsity. Section 2.5.3 gives an example of a sparse RBM network CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 15

performance bottleneck if the placement and routing is not performed efficiently. Lo- calization of communication is an efficient way, and possibly the only way, to fully exploit all the parallelism in modern devices. Signals that must communicate with other groups are appropriately buffered. Partitioning the design into multiple groups also makes the design scalable. Since most of the algorithm is performed within each group, the design can be migrated to a future device easily by instantiating more of these groups without having to worry about wire delays or routing. This is because most of the wiring is localized and the global signals are buffered. This also applies when extending the system to multiple boards. Each group was sized to match the DDR2 bus width of 256 bits, allowing for 16 multipliers per group with 16 bits of data precision. A significant goal of this project is to facilitate research on large DBNs. To provide sufficient speedup, flexibility – relative to a software implementation – had to be sacrificed. Nonetheless, our system provides configurable parameters to allow wide-ranging experiments without the need to modify the FPGA design. The most significant parameter is one that allows the user to specify the number of neurons for each layer. This is in contrast to the RBM implementation by Ly and Chow [42], which requires the network size to be fixed and symmetric. However, due to the pipeline structure of our implementation, multipliers are only fully utilized when the number of neurons is a multiple of 256. Although this is not a significant limitation when our objective is to accelerate large DBNs, it considerably restricts the range of experiments for the current single board implementation since the on-chip memory only supports a weight matrix of size up to 512x512. This limitation is addressed in Section 2.5.2, which proposes a multi-chip architecture that is capable of accepting more nodes per FPGA by loading weights from DRAM. The system allows the user to specify other parameters as well, such as the learning rate. The representation of neurons as either fixed-point numbers or binary numbers is also configurable, whereas the RBM implementation by Ly and Chow[42] only supported binary neurons. This widens the exploration space to non-binary numbers, which can be found in some software RBM implementations [27]. This generalization requires the use of multipliers instead of simple AND gates, which were used in Ly and CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 16

Chow’s implementation [42]. Modern FPGAs embed hard-wired multipliers that take up die space, whether or not they are actually used. Thus, utilizing the multipliers did not increase the logic usage significantly compared to using AND gates.

2.4.2 Core Components

Matrix multiplication occurs in all three phases: the hidden neuron sampling phase, the visible neuron sampling phase, and the weight update phase. Thus, the input for the multiplication operations, which are the weights and the neurons, should reside in the embedded memories distributed across the FPGA. Although locating the inputs close to the multipliers is desirable, distribution of the weights is non-trivial due to a transpose operation that occurs during the visible neuron sampling phase. The design by Ly and Chow [42] avoids the transpose problem by distributing the data such that no embedded RAM will simultaneously read out two or more elements from the same row with the same address, and no embedded RAM will contain two or more elements of the same column. Then, by using a carefully designed addressing scheme, a column or row of the matrix is directly read out from the memory each cycle and no additional communication is required for the transpose. Although this approach eliminates communication for the transpose operation, it has two major drawbacks. One is that the weight and data matrices must be shifted before being written into the on-chip memories, requiring a sophisticated routing scheme from each RAM to the appropriate multiplier since no row or column vector of the weight matrix contains the same index number. A more critical problem is that this approach assumes that the weight matrix fits on-chip and the number of RAM blocks for the weight matrix is equal to the number of neurons, or at least O(n). However, if the network size scales to a point where the weight matrix no longer fits on-chip, then the weight matrix has to stream in from off-chip memory. Since the embedded RAMs can no longer be equal to the number of neurons, a different weight matrix routing logic is required each time a portion of the weight matrix is streamed in, severely limiting the scalability. Although our single chip RBM implementation also assumes that the weight matrix fits on-chip, our approach solves the transpose CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 17

(a) vk Multiply Accumulator Array Array k Memory Array v W j1 j j,1 Wk,1 W1,1 W2,1 … Wm,1 k v W j1 j j,2 Wk,2

W1,2 W2,2 … Wm,2

… …

k

v W j1 j j,16 Wk,16 W1,16 W2,16 … Wm,16

(b) Multiply TreeAdder Memory Array Array Array h1 Wk,1 W1,1 W2,1 … Wm,1 h2 16 W h W k,2 j1 j k,j

W1,2 W2,2 … Wm,2 …

… …

h

16 Wk,16 W1,16 W2,16 … Wm,16

Figure 2.6: Matrix Multiplication for Computing (a) Hidden Neurons (b) Visible Neurons problem in a way that scales to large networks where the weight matrix can be stored in off-chip DRAM. Figure 2.6 illustrates our approach. To understand how our module works, the key observation required is that matrix multiplication can be viewed in several different ways; a matrix multiplication C = A · B (A ∈ Rm×k, B ∈ Rk×n) can be considered as multiple linear combinations of vectors (2.1), multiple vector inner products (2.2), or as a sum of vector outer products (2.3).

     C1,j A1,i   k     C2,j  X   A2,i    = Bi,j   (2.1)  ···    ···    i=1    Cm,j Am,i CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 18

  B1,j   h i B2,j  Ci,j = Ai,1 Ai,2 ··· Ai,k ·   (2.2)  ···    Bk,j    A1,i k    X  A2,i  h i C =   × Bi,1 Bi,2 ··· Bi,n  (2.3)  ···   i=1    Am,i The matrix multiplication in the reconstruction phase (HW >) can be viewed as vector inner products (Eq. 2.2), where each row of H and each column of W > is multiplied element-wise, followed by a sum reduction. This suggests that each column of W > and each row of H should be spread out in separate on-chip RAMs so that all of these elements can be read simultaneously, as shown in Figure 2.6(b). For the hidden computation phase (VW ), consider the transposed matrix operation (W >V >), and view the operation as a linear summation of vectors (Eq. 2.1). This requires that the j th column vector of W > is multiplied by the j th element in a column vector of V >. This gives the structure of Figure 2.6(a), which computes multiple partial sums of hidden neurons in parallel. Since at each cycle we only need to read a column vector of W > for both cases, the memory layout for the weights can remain the same, and it requires no additional communication or routing for a transposed matrix multiplication. Our approach requires more adders, since the adder requirements for each phase is different. The reconstruction phase uses an adder tree to compute reconstructed  m  visible neurons; the pipelined structure yields one visible neuron every 256 cycles, where m is the number of visible neurons. The hidden neuron computation phase requires accumulators instead, holding 256 partial sums; this also computes (on aver-  n  age) one hidden neuron every 256 cycles, where n is the number of hidden neurons. This approach provides a scalable method for matrix multiply operations, with or without transpose, at the cost of using additional hardware. The update phase involves multiplying the transpose of the visible neuron matrix CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 19

Resource ALUT Registers Block Memory Combinational Memory (kbits) RBM Group 3893 (1.44%) 2205 (1.63%) 9552 (3.54%) 590 (3.54%) Global 1564 (0.58%) 0 (0.00%) 2238 (0.83%) 0 (0.00%) SOPC 6670 (2.47%) 320 (0.24%) 5740 (2.13%) 123 (0.74%) System Total 71494 (26.48%) 35600 (26.37%) 160996 (59.63%) 9560 (57.38%)

Table 2.1: Single FPGA RBM Resource Utilization by the hidden neuron matrix, which is essentially the sum of outer products between the visible and hidden neurons (Eq. 2.3). Since the visible data already has a datapath that broadcasts as in Figure 2.6(a), and a number of hidden neurons can be read simultaneously as in Figure 2.6(b), the update phase multiplication can easily reuse the structure in Figure 2.6(a), where hidden neuron values take the place of weights. The matrix multiplication results are provided to an activation function, which in our case is the widely used sigmoid function. Since the sigmoid function in hardware is an expensive operation — requiring exponentiation and division — we instead used an approximate sigmoid function design called PLAN (Piecewise Linear Approximation of Nonlinear function) [4] since it requires only a minimal number of addition and shift operations. In software simulations, we found that the convergence properties were not degraded by the use of this approximate sigmoid function. The stochastic characteristics of an RBM are also greatly influenced by the quality of the random number generator (RNG). We used the RNG described in [52], which is a combination of an LFSR (Linear Feedback Shift Register) of 43 bits and a CASR (Cellular Automata Shift Register) of 37 bits, providing good statistical properties, along with a cycle length of 280, which is sufficient for our application.

2.4.3 Experimental Results

Table 2.1 summarizes the resource utilization of our single FPGA implementation, including the RBM computation engine and our instantiation of Altera’s System-On- a-Programmable-Chip (SOPC) module. The SOPC includes a Nios II processor, a DDR2 memory controller, peripheral components, and the Avalon interconnect. It CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 20

should be noted that partitioning the design may increase the total logic count, since synthesis optimizations do not apply across partition boundaries. Thus, partitioning can be seen as a trade off of scalability and performance against silicon area. The FPGA implementation was verified by comparing the results with the refer- ence MATLAB implementation from Hinton et al. [28]. The MATLAB RBM code was modified to use fixed-point representation; the FPGA version of the RBM was only determined as correct when the results matched the MATLAB output stream, given the same input. Performance measurement was done in comparison against an Intel Core 2 pro- cessor clocked at 2.4GHz running a single-threaded version of the RBM application. MATLAB was used for the comparison since MATLAB is highly optimized for ma- trix operations, which usually performs at least comparably to C implementations, if not better. For a fair comparison, both single and double precision versions of the MATLAB RBM were used. A fixed-point MATLAB version of RBM was not consid- ered for performance evaluation, since MATLAB does not currently support efficient fixed-point matrix operations. Three network sizes, 256x256, 512x512 and 256x1024, were tested and compared to see how our system performs on small, large, and asymmetric networks. Performance measurement was done only on the execution of the algorithm itself; the time for data transfer between the SD card and the onboard DRAM was not taken into account3. Figure 2.7 shows the speedup achieved by our implementation. Our RBM system runs 25 times as fast as the single precision software implementation, and 30 times as fast as the double precision implementation. Although the power consumption was not measured in this experiment, we can infer from the multi-FPGA results in Section 2.5.5 that the estimated power for a single FPGA is around 10W, considerably lower than the Core 2 processor power consumption (65W). This shows that the energy-efficient nature of custom logic enabled our RBM implementation to offer better performance using less power compared to modern general purpose processors. Graphics processor performance was not evaluated in this experiment. Section 2.5.5

3Data may also be transferred from host computer to the FPGA via the JTAG-UART interface. Although this approach is useful for debugging the RBM module on small training datasets, the JTAG-UART interface is unreasonably slow for sending large and real-world datasets. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 21

Speedup for 50 epochs 30

25

20

15

10

5

0 256x256 256x1024 512x512

Base:single precision Base:double precision

Figure 2.7: Speedup for 256x256, 256x1024, 512x512 compares the GPU performance with the multi-FPGA RBM implementation runtime, from which we can derive that the per-FPGA running at 150MHz performs compa- rably to GPUs running at 1GHz.

2.5 Extension to Large Scale RBM Architecture

As mentioned in the previous section, the size of RBMs in a single FPGA implemen- tation is limited by the on-chip memory capacity. In order to support large RBMs, the accelerator design needs to be extended to a scalable multi-chip system. The following subsections describes the details of our multi-chip RBM architecture. Sec- tion 2.5.1 describes how multiple RBM modules can be interconnected to provide linear scalability. Section 2.5.2 discusses the trade-offs of two different strategies placing the weights in DRAM, which is necessary to accommodate the quadratically increasing weight matrix. Section 2.5.3 introduces a restricted form of sparse RBM that can be easily supported by the multi-RBM architecture with one tweak. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 22

FPGA 1

∑W5,jhj ∑W3,jhj ∑W6,jhj ∑W2,jhj 1 2 3 4 v9 ∑W h v 7,j j va 2 ∑W1,jhj v1

FPGA 3 FPGA 2

v5 v6 v7 …

... 9 a b c ∑Wa,jhj ∑W9,jhj 5 6 7 8

Figure 2.8: A High-level View of Inter-chip Communications Between Three FPGAs

Section 2.5.4 explains some of the limitations of the current multi-RBM implemen- tation and their workarounds. Section 2.5.5 analyzes the performance results of our multi-FPGA RBM implementation.

2.5.1 Scalable Multi-chip Architecture

The modular approach used in our single-FPGA RBM design divides the work into multiple groups, localizing most operations such as matrix multiplication, the sigmoid function, and weight updates. Localization enables the user to easily migrate the same design to a future technology and take advantage of the integration density improvements by adding more modules. The few operations that do require global communication are appropriately buffered to avoid long wiring. Although our modular design approach is scalable within a chip, we cannot di- rectly extend it to multiple-chip systems. The main issue is in how to deal with the global communication across the chips, which includes visible neuron broadcast (Figure 2.6(a)) in the hidden neuron computation phase and tree add reduction (Fig- ure 2.6(b)) for the visible reconstruction phase. Figure 2.8 illustrates our multi-FPGA system architecture for a network of three FPGAs. Our novel design builds on top of our previous single FPGA architecture. We CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 23

(a) FPGA1 FPGA2 FPGA3 (c) FPGA1 FPGA2 FPGA3 a b c d e f

b d 1 2 3 4 5, 6 5 6 h W h W 3, 4  ja j 5, j  jc j 1, j

b d h W h W  ja j 6, j  jc j 2, j

f f 1, 2 h W , h W  je j 3, j  je j 4, j (b) FPGA1 FPGA2 FPGA3 (d) FPGA1 FPGA2 FPGA3 a b c d e f

h W d 1 2 5, 6 3 4 7, 8 5 6  j 3, j hjW 5, j ja,b,e,f  ja 3, 4 5, 6 7, 8 h W d  j 4, j h W ja,b,e,f  ja j 6, j

f f 3, 4 h W , h W jc j 1, j jc j 2, j

Figure 2.9: Simplified Version of the Communication Between FPGAs see no inherent obstacles to extending our design to hundreds or thousands of FPGAs (although practical challenges are expected to arise when constructing such large systems). Our key insight is that it is possible to evenly distribute the computation across multiple FPGAs and only require two nearest-neighbor communication links for each FPGA, including a connection from the first FPGA to the last FPGA. The resulting interconnect topology is a ring network, as shown in Figure 2.8. Let’s first consider the hidden node computation phase. In the single FPGA im- plementation, visible neurons are broadcasted during the computation of the hidden neurons. However, broadcasting to multiple FPGAs would severely limit the scalabil- ity of our system. Thus, broadcasting the neurons is only done within the FPGA (with the appropriate buffering). Instead of broadcasting to all FPGAs, each FPGA passes the visible neurons it has read or received to its neighbor in one direction. To avoid any initial idle cycles waiting for data, the FPGA first reads from the local memory its portion of visible data and multiplies it with the appropriate weights, illustrated as the bold lines in Figure 2.9(a). Meantime each FPGA passes the visible data it had processed and consumes new incoming visible data as shown in Figure 2.9(b) until all the visible data has completely traversed the ring. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 24

Reconstruction of the visible neurons on multiple FPGAs is done in a similar manner. Visible neuron computation requires a global add reduction. If we were to implement the global add reduction across multiple FPGAs with a similar method used in the single FPGA implementation, then either the connections between the FPGAs would need to be almost all-to-all, or the global reduction would have to be performed and transferred at a slow rate from shared wire contention, limiting the overall performance. Instead of performing the global add reduction all at once, we have each FPGA calculate the partial reduction for its final destination FPGA and pass this result to the neighboring FPGA4. Figure 2.9(c) and Figure 2.9(d) illustrates how the partial reductions are passed to neighboring FPGAs. In Figure 2.9(c), each FPGA starts by computing the partial reduction for the furthest FPGA and passes its result to its neighbor. Then each FPGA in Figure 2.9(d) computes the partial sum for the next furthest FPGA and adds it to the incoming partial sum. This continues until the partial sums add up at the final destination FPGA, in which the visible node is reconstructed. Since the hidden neuron computation requires one visible neuron broadcast per cycle, each FPGA only needs to send at most one visible neuron to its neighbor per cycle5; the datarate may be lower if there are more neurons than the number of multipliers per FPGA The partial sums from visible data reconstruction also require at most one partial sum communication per cycle. Communications are only in one direction during a particular phase of the computation, although the direction of communication changes periodically (hence a physical interconnect that only support a single direction is not sufficient, but a full duplex link is not necessary). The abovementioned proposal does not consider the off-chip communication la- tency. To tolerate this latency, each chip must process its local data first while data is buffered in its input queue. However, the communication latency may still impact the performance, i.e. idle the RBM pipeline from time to time, if the latency is longer

4We chose the communication direction to be opposite of the hidden computation phase direction to support sparse RBMs, which we explain in Section 2.5.3. For fully dense networks, the direction of communication does not matter. 5The I/O bandwidth requirement may increase if weights are streamed from DRAM, which is explained in Section 2.5.2 CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 25

than the time required to compute n neurons, where n is the number of neurons per FPGA. Thus, when deciding the number of neurons per chip, this latency factor must also be considered. Fortunately, since this study targets large scale RBMs on FPGA using high speed LVDS connections, the communication latency is very unlikely to be a problem6. However, one may want a sufficiently large n when implementing the RBM design with ASICs, since ASICs typically support higher clock frequencies. The design of the parallel computation that only requires a ring topology for connecting FPGAs has several advantages. Modern ASICs and FPGAs have large off- chip I/O bandwidth by providing many pins that can be clocked at high frequencies. However, the number of pins is limited, so a ring topology, as opposed to one with a higher number of connections from each chip, is one of the few that allows the logical connections to be implemented directly as physical connections. This enables higher bandwidth, and is cheaper — solutions involving high-bandwidth switches (such as 10GbE) can be prohibitively costly.

2.5.2 Streaming Weights From DRAM and Associated Trade- offs

A major issue that was not addressed in Section 2.5.1 is that the weight matrix is assumed to fit on-chip. For a single FPGA implementation, that assumption is reasonable as modern FPGAs provide ample on-chip memory. However, the weight matrix grows as O(n2) where n is the number of neurons. As we increase the number of FPGAs, the number of multipliers increases proportionally, thus the number of neurons that can be computed per cycle also grows linearly. However, the embedded memory only increases linearly, while the weight matrix increases quadratically. To overcome this issue and scale to large systems, the weight matrix must be streamed in from off-chip memory. However, the RBM architecture discussed so far assumes that M weights are consumed each cycle, where M is the number of multi- pliers in each FPGA. This may require a huge memory bandwidth that is beyond the

6Let’s consider an FPGA with n = 256 running at 150MHz. Then the time to process n neurons is 256 × 6.7ns = 1715.2ns, which is much larger than the I/O link latency. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 26

(1) (1) v1 v 1 v : first training example (a) K=1 (b) K=2 w w 11 (1) (1) 11 v 1w11 ADD / v 1w11 v1w11 ADD / v1w11 … W31 W21 W11 … W W W … ACC

31 21 11 ACC

w (1) w 1(N/2) v 1w1(N/2) v(1) w 12 … W W ADD / 1 1(N/2) v1w12 ADD / v1w12 2(N/2) 1(N/2) … W32 W22 W12 ACC ACC

(2) (2) v 1 v : second training example w13 v1w13 ADD / v1w13 … W33 W23 W13 ACC w 11 v(2) w v(2) w

1 11 ADD / 1 11 …

… W31 W21 W11 ACC

w1N (2) v1w1N v1w1N w1(N/2) v w (2) … W W W ADD / 1 1(N/2) v 1w1(N/2) 3N 2N 1N ACC … W W ADD / 2(N/2) 1(N/2) ACC

Figure 2.10: Computation of Processing (a) One Input Vector and (b) Two Input Vectors Per Cycle limits of current FPGAs. Suppose we have 256 multipliers in a chip. To fully utilize the multipliers, we need to stream in 256 weights per cycle. Assuming a 200MHz RBM with 16-bit precision weights, the required memory bandwidth is around 100GB/sec, which is not supported by the Stratix III FPGA7. One way to alleviate the memory bandwidth problem is to exploit the additional parallelism in the matrix multiplication by blocking the neuron matrix to process multiple training examples at once. This allows each weight to be fed into several multipliers rather than just one. Figure 2.10 compares the case of processing one training example per cycle versus two training examples per cycle. The two input case reduces the number of weights needed per cycle by half while still utilizing all the multipliers. In general, parallel processing of K training examples results in a reduction of the weight bandwidth requirement by K. Let us return to the previous example where 100GB/sec bandwidth was required. The bandwidth requirement can be significantly reduced to 6.25GB/sec when K = 16. This bandwidth is achievable on the DE3 board by using a DDR2 400MHz SDRAM which can supply 128 bits at 400MHz, i.e. 6400MB/sec. However, such dramatic decrease in DRAM bandwidth requirement comes with costs. One trade-off made is that the on-chip storage for visible and hidden neurons

7Latest high-end FPGAs such as Altera Stratix V now support such memory bandwidth. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 27

180 250 75MHz, 256mult 75MHz, 256mult 160 K=32 100MHz, 256mult 100MHz, 256mult K=32

140 75MHz, 1024mult 200 75MHz, 1024mult

100MHz, 1024mult 100MHz, 1024mult 120 150 100 K=16 K=16

80 (KB) Capacity

100 chip

60 - K=8

K=8 On 40 50 K=4

K=4 IO Bandwidth Requirement (Gbps) Requirement BandwidthIO 20

0 0 0 100 200 300 400 0 100 200 300 400 DRAM Bandwidth Requirement (Gbps) DRAM Bandwidth Requirement (Gbps)

(a) (b)

Figure 2.11: Trade-off Between (a) Memory Bandwidth and I/O Bandwidth (b) Mem- ory Bandwidth and On-chip Storage increases since multiple training examples are read from the local memory instead of one. For the range of K (the number of training examples processed in parallel) we consider in our design, this only takes up a fraction of the available embedded memory. A more critical trade-off is the increase of I/O bandwidth requirements. Recall that each chip sends to its neighbor a visible neuron each cycle. However, since the RBM is now processing multiple visible training examples at once, the I/O bandwidth requirement linearly increases with K. The same applies for partial sums in the visible computation phase. Thus, the total memory and I/O bandwidth cost for M multipliers per FPGA is M/K + K (2.4) where the unit of the cost is 16 bits of data per cycle. To minimize the total bandwidth √ requirement, we simply set M/K = K, which leads to K = M. Figure 2.11(a) illustrates the memory and I/O bandwidth trade-off as K is changed. The number of multipliers were selected to be 256 and 1024 to reflect the latest FPGAs on the market. As can be seen in the plot, only the memory bandwidth requirement depends on the number of multipliers. The I/O bandwidth remains constant while we change CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 28

the number of multipliers. However, the memory and I/O bandwidth requirements both increase linearly with the clock frequency. Therefore, for a given communication capacity, it is generally more efficient to use FPGAs with more multipliers and reduce the clock frequency than attempting to gain performance by increasing the clock frequency.

(1) (1) VW v1 v : first training example (a) Multiply Weight Buffer Array W1,1 1 W2,1 W1,1 v (1)W (1) 1 1,1 vi Wi,1 i1

W 1,2 1 W2,2 W (1) (1) 1,2 v1 W1,2 vi Wi,2

i1 DRAM

W1,3 1 W1,3 v (1)W (1) 1 1,3 vi Wi,3 i1

W1,4 1 W1,4 (1) (1) v1 W1,4 vi Wi,4 i1

(2) (2) VW v1 v : second training example (b) Multiply Weight Buffer Array W1,1 1 W2,1 W1,1 v (2)W (2) 1 1,1 vi Wi,1

i1

W 1,2 1 W2,2 W (2) (2) 1,2 v1 W1,2 vi Wi,2

DRAM i1

W1,3 1 W2,3 W1,3 v (2)W (2) 1 1,3 vi Wi,3 i1

W1,4 1 W2,4 W1,4 (2) (2) v1 W1,4 vi Wi,4 i1

Figure 2.12: Buffering Weights and Partial Results

Although this memory-I/O bandwidth trade-off is a reasonable approach for mod- ern FPGAs, there may situations where I/O bandwidth is severely limited. In such CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 29

cases, one may instead trade-off memory bandwidth requirements and on-chip mem- ory storage requirements. Recall that the reduction of memory bandwidth was made possible by reusing the weights on several multipliers. Instead of simultaneously processing multiple training examples, we can still reuse weights by alternating the training examples each cycle and buffering the weights and partial results. Figure 2.12 illustrates how buffering can help reduce the weight bandwidth. Figure 2.12(a) pro- cesses the first training example and buffers the first half of the weights, while Fig- ure 2.12(b) processes the second training example and buffers the second half of the weights. The same weights are used for these two training examples; thus, the weights for each hidden variable need to stream in every other cycle, reducing the bandwidth requirement by half. The I/O bandwidth remains one neuron per cycle, but the on-chip storage has doubled to buffer two partial results instead of one. In general, alternating K training examples requires a memory bandwidth of M/K, and requires on-chip storage of M · K, where M is the number of multipliers. The trade-off be- tween memory bandwidth and on-chip buffering is shown in Figure 2.11(b). As can be seen, both memory bandwidth and storage requirements increase linearly with the number of multipliers, while the clock frequency only affects the memory bandwidth. Therefore, for a given on-chip storage capacity, it is more effective to increase the clock frequency than using FPGAs with more multipliers. In conclusion, increasing the number of training examples (K) helps reduce the memory bandwidth requirement. However, it is limited by the on-chip memory space, and it may also increases the I/O bandwidth requirement. The right balance for each system depends on the specific values of chip-to-chip and memory bandwidth.

2.5.3 Locally Dense Sparse Network

Although RBMs have all-to-all connections between the visible and hidden layers, it is unlikely that all connections will be actively used. The training of RBMs for several applications where locality is important may result in sparse representation of the weight matrix. We can exploit the sparseness of the weight matrix to increase the efficiency of computation. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 30

(a) (b)

Figure 2.13: Locally Dense Sparse Network (a) Implementation (b) Conceptual Figure Cycle

One simple way to make use of the sparseness is to limit the fanout of each neuron to a constant number C. In addition, for simplicity, we restrict connections of each neuron to be the C nearest neighboring neurons. Then, dense connections, if any, will occur only at the neighboring nodes. This is over-restrictive in the sense that the algorithm does not necessarily converge to such locally dense representation in general. However, we believe that such setting fits well in applications where locality plays an important role, such as visual recognition [39, 48]. This restricted sparse configuration also maps well to our architecture. If we set C to be a multiple of the number of neurons per chip, we only need to configure the number of chips the data is being passed down before stopping the computation phase. Figure 2.13 illustrates this approach. The boxes in Figure 2.13(a) represent the chip border and the circles are the neurons. The left and right ends of the network are wrapped around such that all nodes have constant fanout. Figure 2.13(b) is an equivalent diagram of the network in Figure 2.13(a) leaving out the chip borders. As shown in Figure 2.13(a), the two layers are not completely connected to each other, but have a constant fanout of C = 2 × 3 = 6. Although the network shown in Figure 2.13 is not itself “sparse”, conceptually it explains how a locally dense sparse network will operate when many chips are connected to the system. By adding this one control, we can easily implement the locally dense sparse network. The visible neurons and hidden partial sums are communicated in opposite direc- tions, as mentioned in Section 2.5.1. This is because hidden partial sums are used to calculate the reconstructed visible neurons, so the final summation must occur on CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 31

the FPGA where the target visible neurons are located. Visible neurons, on the other hand, are used to compute the hidden variables, and are passed along the network until it reaches the FPGA with the furthest connected hidden neuron. Another reason the locally dense sparse network is attractive is that all-to-all dense RBMs are costly to scale due to the O (n2) memory requirement for storing the weight matrix. Assuming each chip has 256 neurons and 1GB dedicated DRAM, the maximum number of chips (B) that will fit the weight matrix is around 4000, assuming weights occupy half of the DRAM space8. For a locally dense sparse network, the weight matrix can be considerably smaller, since the weight matrix grows as O (n) with the number of neurons due to constant fanout. This allows us to build a very large scale network beyond 4000 chips.

2.5.4 Limitations of Multi-chip RBM

Architectural Limitations

Although the multi-FPGA system described in the previous sections allows linear scalability, there are some practical issues that may prohibit such scaling. One issue is synchronization between the chips. A simple synchronization method is to use a common clock. This is possible because the running time of each phase in the RBM algorithm is completely deterministic, so as long as the chips are syn- chronized to each clock edge, the chips are executing the same computational phase. However, for some very large systems, distributing a common clock to every chip may not be feasible. In such cases, multiple clock domains have to be defined, in which case the boards may not be completely in sync. Thus, each chip of the multi-chip RBM architecture must stall whenever the incoming queue is empty or the outgoing queue is back-pressured, affecting the overall performance. Another issue is that the system described above does not provide any fault toler- ance mechanism. If one chip fails, then the entire system fails. One solution to this is to group identical chips and checkpoint the state of the system periodically. During

8Derived from: B · 229bytes > (256 · B)2 × 16bits CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 32

1600000 Bit Error Rate vs Convergence

1400000

1200000

1000000 6.00E-12 6.00E-11 800000 6.00E-10 6.00E-09 600000 Reconstruction Error 6.00E-08

400000

200000 1 6 11 16 21 26 Epochs

Figure 2.14: Bit Error Rate Influence on Convergence normal execution, a group of chips would stop and dump its state into a shared mem- ory. In addition, each group may have a certain number of redundant RBM chips that are connected to the ring, but which are normally bypassed. When a network determines that a chip has failed, the group reverts back to its previous checkpoint state and replaces the failed chip with a redundant chip. The network will skip the set of neural data of the phase in which the failure occurs, exploiting the fact that neural networks, including the RBM algorithm, generally exhibit the soft computing property [55]. Thus, loss of data in the middle of training an RBM is not critical, although it may increase the amount of time that is required to reach convergence. The current RBM system does not provide any protection against errors in data transmission either; however the soft computing property enables the RBM algorithm to tolerate bit errors to a certain degree. Thus, as long as the transmission media bit error rate is low enough, one may send data without any protection to eliminate communication overhead. Our simulation experiments show that for a bit error rate9

9The generated bit error rates are approximate rather than exact, since the simulation assumed that only at most one bit of each 16-bit word can have a bit-flip CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 33

of less than 6 × 10−9 manages to converge as shown in Figure 2.14.

Platform Restrictions

At the time of developing the multi-FPGA RBM, six DE3 boards were available to us. One board was used to generate a common clock, and another board became unresponsive, leaving us with four DE3 boards total. We use the four FPGAs to implement the multi-FPGA RBM, which is small compared to its potential scalability, but should be enough to demonstrate our ideas mentioned earlier. Another limitation is due to the communication between FPGAs. Although the DE3 board supports up to 30 LVDS pairs per HSTC connector10, we had difficulties in successfully communicating data for more than four LVDS pairs. The problem seemed to reside in the PLL configuration, although we never had a chance to find the root cause of the problem. Due to time constraints, we decided to use only four LVDS pairs, resulting in a total datarate of 4.8 Gbps per direction. Limiting the communication bandwidth only allows sending one 24-bit partial- sum per cycle. In addition, the RBM clock frequency had to be reduced to 150MHz to be within the I/O bandwidth constraints11, such that computation pipeline never stalls due to overflow of data and does not require complicated control signals. Since one data element is communicated per cycle for each direction, the weight datarate from DRAM can only be reduced below DE3 DDR2 bandwidth by buffering at least 16 elements per multipliers. Fitting the additional logic resources required for buffering in an already crammed FPGA is time-consuming and may not be feasible. Thus, the current RBM implementation does not support streaming weights from DRAM and is subject to future work. In fact, the sparse matrix accelerator, intro- duced in Chapter 3, applies similar DRAM streaming concepts in an Altera Stratix IV FPGA, which has more logic and memory resources than the FPGA used in the DE3

10Nine out of 30 LVDS pairs are true LVDS, while other 21 pairs are emulated (max 300Mbps). Thus, this gives a theoretical maximum I/O bandwidth of 17.1Gbps. 11Although theoretically the required I/O bandwidth is exactly the partial sum data rate (24 × 200MHz = 4.8Gbps), the communication between FPGAs not only involve the payload data, but also the packet headers. Due to this overhead, 24-bit partial sums cannot be sent every cycle unless the clock frequency is reduced. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 34

board. Our current multi-FPGA RBM implementation instead utilizes the internal memory in each FPGA. Fortunately, the embedded memory capacity is large enough to store the weights of four FPGAs. Unfortunately, the embedded memory capacity only allows the minimum 256 neurons for each FPGA, since the weight matrix stored in each FPGA includes the connections to three other FPGAs (total 4 × 256 × 256 connections per FPGA). Thus, the flexibility to adjust the number of neurons as described in Section 2.4.1 is lost. Although 16-bit fixed point numbers are adequate for the purpose of this study, this may no longer be true for very large scale systems. In such cases, the arithmetic units should support a format with better precision, such as single-precision floating point numbers12. Although the computation capacity per chip may reduce due to larger arithmetic units, the architectural features discussed in previous sections, as well as the near-infinite linear scalability, still holds true.

2.5.5 Experimental Results

To evaluate our implementation, we used the FPGA boards to train on the MNIST handwritten digits dataset. To verify that the results are indeed correct, we modified a reference MATLAB implementation from Hinton et al. [28] to a 16-bit fixed point version so that we may compare the output of the algorithm given the same input. The implementation was considered correct only when both the hardware and software implementations gave the same outputs given the same input. Performance was compared against a 2.3GHz Intel Xeon E5345 processor and a NVIDIA GeForce GTX 275 GPU, which has 240 CUDA processor cores running at 1.4GHz. The RBM module, written in C++, used the GotoBLAS2 [23] library and the NVIDIA CUBLAS library for optimized matrix operations. However, these BLAS libraries only support floating point numbers, so single precision routines were used. Using the four FPGA boards available to us, we tested our multi-FPGA prototype system with the following combinations of networks: dense 768x768, dense 1024x1024, and “sparse” 1024x1024 with connections between two neighboring boards. Although

12The sparse matrix accelerator in Chapter 3 supports half-precision floating point numbers CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 35

(a) Network size Batch size K=100 768x768 1024x1024 1024x1024(s) CPU runtime (s) 3578.52 5424.8 4332.91 Gmult/s 2.47 2.90 1.82 runtime (s) 152.91 236.13 185.10 GPU speedup 23.40 22.98 23.41 Gmult/s 57.85 66.61 42.49 runtime (s) 76.97 102.58 51.36 FPGAs speedup 46.49 52.88 84.37 Gmult/s 114.95 153.33 153.13 (b) Network size Batch size K=16 768x768 1024x1024 1024x1024(s) CPU runtime (s) 4582.09 7863.75 5755.13 Gmult/s 1.93 2.00 1.37 runtime (s) 309.87 569.11 522.46 GPU speedup 14.79 13.82 11.02 Gmult/s 28.55 27.64 15.05 runtime (s) 76.97 102.58 51.36 FPGAs speedup 59.53 76.67 112.07 Gmult/s 114.95 153.33 153.13

Table 2.2: Performance of Single Core CPU, GPU, and Multi-FPGA these are not large networks, they are sufficient to demonstrate our architecture in practice. The experiments were conducted on a fixed number of 50 epochs, and on two batch sizes, 16 and 100. For the multi-FPGA implementation, a batch size of 100 was not tested since the on-chip memory cannot hold both large batches and the weight matrix at the same time. The sparse RBM was done in CPU and GPU by only computing the required operations. Table 2.2 shows the results of the 768x768, 1024x1024 dense network and the 1024x1024 locally dense sparse network. Runtime was measured to compare the speedup of the GPU and multi-FPGA sys- tems against a single CPU core. The average number of multiplications per second (mult/s) was used as a universally comparable performance metric that does not de- pend on the problem size; this is similar to the widely used metric CUPS (connection update per second), but mult/s does not depend on the batch size. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 36

As seen in Table 2.2(b), the CPU and GPU perform poorly with a batch size of 16. This is because small batch sizes tend to perform less well in SIMD-like cores since matrix-vector multiplication does not provide enough parallelism. A batch size of 16 is sufficiently small that this lack of available parallelism becomes apparent in the results. However, the multi-FPGA implementation does not exhibit this problem since the overhead to initiate a matrix multiplication is very small and always uses all the multipliers to achieve maximum performance. This is a practical advantage for the multi-FPGA architecture since smaller batches may require fewer epochs to converge, so small batch sizes may be favored in practice by end users. Our experiments show that the error level at epoch 50 for batch size 100 can be achieved with only 36 epochs for batch size 16. Thus, if we were to run the algorithm until a certain error rate is reached, then the speedup for the multi-FPGA implementation is even higher. It is important to note that the networks we are experimenting with are not large scale. Since graphics processor tend to perform better with larger matrices, the speedup may differ when scaled to larger networks. Therefore, Table 2.2 is only a reference to show how our design’s performance scales with problem size. In fact, if we increase the network size to 3072x3072, then the graphics processor shows around 52.7X speedup compared to the Intel Xeon processor. However, graphics processors in high-end discrete graphics cards have limited amount of on-board DRAM, which limits the size of the network to approximately 10K – 20K neurons per layer. As we see shortly, graphics processors also do not scale very well to a large number of nodes, which may be an issue for investigating very large RBMs. In addition, our RBM architecture does not rely on FPGA technology and can be implemented as an ASIC at a higher clock frequency; based on the numbers in Table 2.2, we can expect to get similar per-chip performance by using a 500MHz clock on the RBM accelerator13. Therefore, although graphics processors may show better performance than our 150MHz FPGA implementation on a per-chip basis for larger networks, this does not invalidate the efficiency of our multi-chip RBM architecture

13This estimation was done by assuming that the RBM performance will increase linearly with the clock frequency, up to the point where memory bandwidth will not be the bottleneck. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 37

Platform Number of Nodes Platform Number of Nodes 2 3 4 1 2 3 4 CPU 1.59 1.80 2.14 CPU 225 450 675 900 GPU 1.64 0.61 0.62 GPU 434 619 1104 1340 FPGA 2.01 3.02 4.03 FPGA 9.96 19.92 29.88 39.84

Table 2.3: Scalability: Speedup Against Table 2.4: Power Consumption of Each One Node of Each Platform Platform (Watt) for large scale networks nor the claims of using custom logic for better performance and energy-efficiency. Table 2.3 illustrates how CPUs, GPUs, and our FPGA architecture scales with the number of nodes. Four CPU machines, using one core per node, and two GPU machines, with two NVIDIA GTX 275 cards each, were fully connected via a Gigabit Ethernet switch to perform the scalability test. OpenMPI was used for communica- tion, and the data was carefully distributed to minimize communication. Network sizes for CPUs and GPUs were chosen such that the matrices were not too small to be inefficient, but not too large to cause overwhelming communication overhead14. Our FPGAs, on the other hand, currently have a fixed configuration of 256 neurons per node, so the network size varies with the number of boards (768x768 for 3 FPGAs, 1024x1024 for 4 FPGAs). Since the total number of neurons increases with the num- ber of FPGAs in our architecture, we also increase the number of neurons for CPUs and GPUs with the number of nodes to directly compare the scalability with the FPGA. To ensure a fair measurement of scalability despite the differences in network size, we use mult/s as the metric to compare the multi-node performance against the single node performance of each platform. As shown in Table 2.3, the CPU shows sublinear scalability as we increase the number of nodes. The GPU also showed a sublinear speedup for 2 nodes, but re- vealed a major performance loss when crossing machine boundaries as the number of nodes increases from two to three. This implies that the communication becomes the bottleneck for large GPU systems [54]. The multi-FPGA RBM, on the other hand, showed good scalability up to four nodes, and are expected to scale well to a

14CPU processed 768x768 RBM per node, and GPU processed 1536x1536 per node. CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 38

very large number, as communication only occurs between neighboring nodes at a consistently reasonable datarate. Table 2.4 shows the power consumption of each platform. The numbers displayed for CPU and GPU nodes include the power consumption of the system components, such as the motherboard and DRAM. The power consumption of the FPGA system is measured for the FPGA alone, and does not include the power consumption of the on-board DRAM or the controlling host computer. Although the numbers in Table 2.4 cannot be directly compared, one can still easily infer that the multi-FPGA RBM is a more energy-efficient solution compared to the other two general purpose platforms. In general, the configurable nature of FPGAs and custom ASICs allows energy- and area-efficient computation such as fixed point arithmetic instead of the full floating point operations. The energy-efficient nature of custom logic [26], in addition to the scalability of our design, makes our approach desirable for large- scale DBN implementations, as well as special DBN cores for future heterogeneous processors.

2.6 Related Work

There has been considerable interest in accelerating the training of neural networks using customized hardware. In 1992, Cox and Blanz [15] demonstrated an FPGA implementation of a layered neural network for performing classification tasks. In 1994, Lysaght et al. [43] showed that dynamic reconfiguration of FPGAs could be used to train larger layered networks. Zhu and Sutton [57] provide a survey of FPGA implementations of neural networks, trained using backpropagation. Graf et al. [24] introduced a single FPGA design optimized for support vector machine (SVM) train- ing and convolutional neural network processing. Using systolic array for neural networks have also been explored to exploit the parallelism in neural networks [56]. Since the introduction in 2006 of Hinton et al.’s fast learning algorithm [28] for DBNs (Deep Belief Nets), there has been renewed interest in neural networks. Ly and Chow [42] introduced an FPGA architecture for training DBNs. Our single chip CHAPTER 2. ACCELERATING RESTRICTED BOLTZMANN MACHINES 39

RBM work [33] improved the single FPGA architecture by generalizing the data rep- resentation and adding runtime flexibility of major learning parameters. In addition, our RBM architecture addresses the scalability issues in [42]. Ly and Chow extended their work to multiple FPGAs [41], where a partitioning al- gorithm is used to distribute the work amongst multiple FPGAs while minimizing the communication. However, the inter-chip network requires communication resources that increase quadratically with the number of neurons, making it difficult to scale to large networks. Instead of focusing on minimizing the amount of communication, our multi-FPGA work [34] localizes the communication to allow scalability.

2.7 Summary

Deep Belief Nets are popular machine learning tools, which are based on Restricted Boltzmann Machines. The computation intensive nature of RBMs has made it dif- ficult to investigate very large scale Deep Belief Nets. We introduced a specialized architecture for RBMs that enables building a scalable and large DBN. The RBM core is carefully modularized such that it can easily scale in future devices. Inter- chip communication is performed in a ring topology, and exhibits linear scalability. In addition, the multi-chip RBM architecture supports a restricted type of sparse RBMs, where locality of connections is important. We demonstrate our ideas by im- plementing the RBM architecture on Altera Stratix III FPGAs. The single FPGA implementation has shown 25X speedup compared to a single precision software im- plementation running on an Intel Core 2 processor. Our four-FPGA implementation has shown 46X-112X speedup compared to an Intel Xeon E5345 processor. In compar- ison to an NVIDIA GTX 275, the speedup is up to 5.5X. In addition, the four-FPGA implementation has shown linear scalability, while CPU and GPU implementations suffered sublinear scalability, especially when crossing machine boundaries. We expect our scalable architecture can be used to tackle very large machine learning applica- tions that may have previously been difficult to approach. This is in contrast to previous RBM architectures, whose required communication resources scale with the square of the network size, and hence are infeasible to implement for large networks. Chapter 3

Sparse Matrix Accelerator

3.1 Introduction

Sparse matrices have long been an interesting subject to many researchers in various fields. According to Davis and Hu [17], the number of sparse matrices, as well as the sizes of the matrices, in the University of Florida Sparse Matrix Collection have continually increased since 1970. In addition, a great number of sparse matrices in the collection seemed to have been created within the last decade, which reflects the increasing importance of sparse matrices. CPUs and GPUs are known to have good performance on certain sparse matrix operations. For example, the SpMV (Sparse Matrix - Vector Multiplication) opera- tions, found in the Intel MKL and CUSPARSE library, are highly optimized for Intel SSE and NVIDIA CUDA cores, respectively. Bell and Garland [7] demonstrated how SpMV can effectively be mapped to CUDA cores. However, due to the SIMD nature of the optimizations used in these libraries, they greatly suffer from load-imbalance when the number of non-zeros varies consid- erably across rows or columns of a sparse matrix. In addition, these math libraries lack support of sparse matrix - sparse matrix multiplication (SSMM1). Due to the irregular computational patterns in multiplying two sparse matrices, it is challenging

1We use this abbreviation for sparse matrix - sparse matrix multiplications, since some publica- tions use SpMM for sparse matrix - dense matrix multiplication.

40 CHAPTER 3. SPARSE MATRIX ACCELERATOR 41

to efficiently map the computation on to a SIMD architecture. The CSparse library [16] is a highly-cited and widely-used sparse matrix library which includes a SSMM routine, but does not utilize the underlying SIMD or multi-threading capabilities. The CUSP library [8] is capable of performing sparse matrix - sparse matrix multi- plication in CUDA hardware, but it was rare to see performance improvements over single-threaded CPU implementation in our experiments. Given the increasing importance of sparse matrix computation, specialized hard- ware is needed to support the increasing number of sparse, irregular matrix opera- tions. We propose an accelerator architecture that effectively exploits the fine-grain parallelism in sparse matrix computation. The accelerator can tolerate the irregu- larities and load imbalance of sparse matrix computation with an efficient buffering and work-stealing mechanism, at the cost of limiting the acceleration to only several fixed types of matrix operations. The feasibility and performance of the accelerator was demonstrated by implementing a prototype design on an FPGA board. Details of the accelerator architecture and its implementation are discussed in Sections 3.2 and 3.3.

3.1.1 Sparse Matrix Applications

As mentioned earlier, recent math libraries for CPUs and GPUs support a number of sparse matrix operations that match well with SIMD-style computation. Thus, our focus is on three sparse matrix applications that the CPUs and GPUs cannot easily accelerate: the Sparse RBM [48], Betweenness Centrality [22], and Markov Clustering [53]. Appendix A lists the pseudo-code for these applications using MATLAB-style syntax. Sparse RBM (sRBM) is a variation of Restricted Boltzmann Machine, the basic building block in Deep Belief Networks as described in Chapter 2. sRBM differs from the original algorithm by forcibly limiting the connections between layers to randomly selected receptive fields. The enforced sparseness improves robustness to noise in visual recognition. CHAPTER 3. SPARSE MATRIX ACCELERATOR 42

M*M EXP DIV SUM/MAX ELEM-WISE 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Sparse Restricted Betweenness Markov Clustering Boltzmann Machine Centralilty

Figure 3.1: CPU Execution Time Breakdown on Sparse Matrix Applications

Betweenness Centrality (BC) is a well-known graph algorithm in the social net- working domain. BC measures the centrality of every node by finding the number of shortest paths that pass through each node. Mathematically, the betweenness centrality of vertex v is defined as

σst(v) BC(v) = sums6=v6=t∈V (3.1) σst where σst is the total number of shortest paths from vertex s to vertex t, and σst(v) is the number of shortest paths from s to t that pass through v. Markov Clustering (MCL) is a graph clustering algorithm frequently found in biological publications. MCL emulates a random walk by expanding a transition probability matrix M, i.e. computing M 2. After each expansion, strong transitions are reinforced and weak transitioned are pruned to maintain sparseness. By repeating this two-phase computation until convergence, most of the less-likely transitions are eliminated to form clusters of strongly connected nodes. CHAPTER 3. SPARSE MATRIX ACCELERATOR 43

As can be seen from the pseudo-codes in Appendix A, most of the computation in the three applications is in basic operations. The execution breakdown in terms of basic linear algebra operations for the sparse matrix applications is shown in Figure 3.1, summarizing the types of matrix operations used and their respective runtime. As shown in the figure, matrix multiplication dominates the execution time. Therefore, it is natural that the accelerator focuses on increasing matrix multiplication performance, which has a greater impact on the overall performance. However, sRBM and BC also spend considerable amounts of time on other matrix operations. By Amdahl’s law, the accelerator must accelerate these other matrix operations as well to get significant overall speedup. By studying the applications in more depth, there were a few other requirements the accelerator must satisfy to reduce inefficiencies and realize better performance. Support for most requirements were difficult to find in existing hardware and software. First, the accelerator should be able to efficiently write matrices to memory in a sparse format. Although this may seem like a simple task, doing so in a parallel manner is non-trivial. Most compressed sparse formats have some type of ordering between the non-zero elements. During a sparse matrix computation, the offset of non-zero elements for each row (or column) is computed on-the-fly according to the ordering of the format. Thus, computation results cannot be written to memory until the number of non-zero elements are found for all previous rows (or columns). This requires considerable amount of synchronization between computational threads, which may incur significant overheads to conventional CPUs and GPUs. In addition, stalling of computational threads may deteriorate performance further if adequate on-chip buffering per thread is not provided; such problems are especially challenging to GPUs which have limited amount of on-chip storage per thread. Writing to a dense matrix format and then converting it back to a sparse format is also not a feasible option, because the dense format may exceed the available memory capacity for very large matrices. Even if the memory is sufficiently large to accommodate the matrices of interest, the number of memory accesses increases by a huge amount due to the dense format. For example, performing a sparse matrix multiplication (SSMM) on sparse matrices that are relatively dense (2% density for CHAPTER 3. SPARSE MATRIX ACCELERATOR 44

App 1 2 3 4 5 6 7 BC 0.414% 6.80% 30.82% 32.98% 11.74% 2.42% 0.51% MCL 0.19% 2.24% 41.40% 10.00% 1.00% 0.20% 0.10%

Table 3.1: Density of Sparse Matrices in BC and MCL source matrices, 3% density for result matrix) showed that writing the matrix in dense format increased the number of write memory accesses by 15 times2. Our accelerator was designed to efficiently write both sparse and dense matri- ces with dedicated encoding hardware. Although stalls due to dependencies are inevitable, the sparse matrix hardware tries to minimize idle cycles by providing adequate buffering with a work-stealing mechanism among threads. In addition, the use of customized logic allows offsets to be calculated in parallel with light-weight synchronization, as explained in Section 3.2.6. Table 3.1 illustrates the non-zero density of sparse matrices used in Betweenness Centrality (BC) and Markov Clustering (MCL). The matrices shown for BC were sampled from the first inner loop shown in Figure A.3 of Appendix A, and matrices for MCL correspond to the first seven M matrices in Figure A.1. As can be seen from the table, the density greatly varies across iterations for each application. Matrices can vary from being fairly dense (e.g. 41.4% non-zeros) to very sparse (e.g. 0.1% non- zeros). Denser matrices tend to take significantly more time; for example, the denser two matrices in BC make up approximately 65% of the runtime. However, execution time of the other sparse matrices add up to the remaining 35% of the runtime, which is also a considerable portion of the execution time when accelerating an application. Therefore, the hardware must perform well on a wide range of sparseness. In addition, the graphs used in these applications tend to show large variance in the sparse matrices. Figure 3.2 shows the non-zero distribution for the densest matrix in Table 3.1 for BC and MCL. The number of non-zero elements for each row and column has been sorted and lumped into groups for better visualization. Non-zero elements are usually not evenly distributed across rows or columns, but most likely be concentrated on certain number of rows or columns, as shown in the figure. Large

2The write memory accesses were about 7% of the entire data memory accesses for the sparse matrix case. This increased to 73% for writing a dense matrix instead. CHAPTER 3. SPARSE MATRIX ACCELERATOR 45

(a) BC – Row Density (b) BC – Column Density

6000 6000

5000 5000

4000 4000

zeros per Column 3000

zeros per Row - - 3000 2000 2000

1000 Number NonNumber of Number NonNumber of 1000

0

256 384 512 640 768 896

128

------

0 -

1408 1152 1280 1536 1664 1792 1920 2048

1024

------

-

1

4 8

- -

36 12 16 20 24 28 32 40 44 48 52 56 60 64

------

1 5

129 257 385 513 641 769

9

897

33 13 17 21 25 29 37 41 45 49 53 57 60

1281 1025 1153 1409 1537 1665 1793 1920

Sorted Row Indices Sorted Column Indices

(c) MCL – Row Density (d) MCL – Column Density

300000 200000

180000 250000 160000 140000 200000

120000

zeros per Row -

150000 zeros per Column 100000 - 80000 100000 60000 40000

50000 Number NonNumber of

Number NonNumber of 20000

0 0

256 384 512 640 768 896 256 384 512 640 768 896

128 128

------

- -

1664 1152 1280 1408 1536 1664 1792 1920 2048 1152 1280 1408 1536 1792 1920 2048

1024 1024

------

- -

1 1

129 257 385 513 641 769 129 257 385 513 641 769

897 897

1537 1025 1153 1281 1409 1537 1665 1793 1920 1025 1153 1281 1409 1665 1793 1920

Sorted Row Indices Sorted Column Indices

Figure 3.2: Non-zero Element Distribution variation of the number of non-zero elements implies load imbalance across the rows and columns. Thus, such distribution may adversely influence SIMD performance, depending on the non-zero variance between rows/columns and the sparse matrix format used. For instance, SSMM operation on the matrix shown in Figure 3.2(a) and (b) using the CUSP library in CSR format on NVIDIA Tesla C2050 GPU takes 60% more execution time than computing on a uniform with the same number of non-zeros. Though the aforementioned experiment appears to be one of the extreme cases, similar performance degradation due to load imbalance was studied by Bell and Garland [7] for SpMV operations in CUDA. We address these issues in CHAPTER 3. SPARSE MATRIX ACCELERATOR 46

designing the accelerator in the following sections.

3.2 Sparse Matrix Accelerator Architecture

This section explains the architectural aspects of the sparse matrix accelerator (SMA) design. Conventional SIMD architectures consists of an instruction decode unit, mul- tiple arithmetic logic units (ALUs), and memory access units. Each SIMD ALU process different data, but synchronously executes identical instructions since they share an instruction decode unit. Our architecture is similar to SIMD in the sense that multiple ALUs share execution flow. The main difference from SIMD, however, is that our architecture allows ALUs to execute asynchronously with respect to each other. This is done by restricting the types of instructions to be predefined repetitive coarse-grain operations. Coarse granularity requires each ALU to compute a bulk of data, while predefined repetitive computation pattern allows ALU to configure dat- apaths only once for the entirety of the operation. Basic linear algebra operations, including sparse matrix arithmetic, fits perfectly into this model. Asynchronous ALU execution is essential for high performance sparse matrix op- erations, especially if the data structures are irregular. Thus, queuing becomes a key factor to sustain asynchronous operations, as data is received by each ALUs at different rates. For the following sections of this paper, we define a thread to be the asynchronous computation flow that each ALU3 is responsible for. This includes the ALU, the sur- rounding control logic, and the data storage to support the asynchronous execution. Although all threads execute the same instruction stream, they have independent data streams that need not be synchronous with each other. Details of our accelerator architecture are explained in the following subsections. CHAPTER 3. SPARSE MATRIX ACCELERATOR 47

Device Processor DRAM DRAM PCI Express Instruction Fetch Unit Controller0 Controller1 Module

INTERCONNECT

OP Decode / Dispatch Dram Arbiter

Decode Module Encode Module ALU Main Control

INTERCONNECT

Memory Memory Memory Memory Memory Memory Memory Memory

ALU Array ALU Array ALU ALU Array ALU Array ALU Array ALU Array ALU Array ALU Array ALU Array ALU Array ALU Array ALU Array ALU Array ALU Array ALU Array ALU Array ALU

SPU SPU SPU SPU SPU SPU SPU SPU

Figure 3.3: Sparse Matrix Accelerator (SMA) Architecture Overview

3.2.1 Overview

Figure 3.3 illustrates the architecture of SMA. Figure 3.3 assumes the sparse matrix accelerator as a discrete device; however, an embedded sparse matrix accelerator would have essentially the same block diagram with the exception of not needing a separate device processor and a PCI Express module. The top modules in Figure 3.3 are interfaces to external components. Many matrix applications are memory-bound, requiring a large DRAM bandwidth. Thus, the accelerator may have multiple DRAM controllers to support such memory bandwidth. The PCI Express module is used for DMA data transfers between the host and the device. It also provides means for communicating with the host processor via memory- mapped I/O to control the DMA data flow. A small device processor is included to mainly fetch and forward matrix instructions to the accelerator. Since data transfers between the host and the device may be costly, the device processor also executes

3A thread may actually have multiple ALUs, as described in Section 3.2.5. CHAPTER 3. SPARSE MATRIX ACCELERATOR 48

non-critical general purpose code to avoid communication overhead. This general purpose capability of the device processor may be excluded if the accelerator is used as an integrated functional unit of a general purpose processor. The main matrix acceleration modules are the decode module, encode module, and a large array of thread blocks. The decode module is responsible for generating memory addresses based on the matrix format and matrix operation, and decoding the data into a format uniformly used by the thread blocks. The encode module takes the output of the thread block and writes it to the main memory in the form of dense or sparse matrix. Thread blocks consist of a number of computational threads, each of which has floating point ALUs. All computational threads are configured to use identical dat- apaths based on the matrix operation being performed, but each thread operate asynchronously in respect of each other. Each thread block also has one special pur- pose unit that performs commonly used mathematical functions, such as logarithm and exponential functions. The accelerator operates as follows. The device is put in reset state except for the memory and host communication modules. Once the host downloads a piece of executable binary onto the device main memory, the device is released from reset state and the device processor starts executing the binary. The device processor offloads each matrix instruction it encounters to the matrix acceleration modules while communicating with the host for coordinating DMA data transfers. The decode, encode, and thread block modules each have an instruction queue from which each module fetches matrix operation parameters and configures the control and datapath accordingly. Once the datapaths are configured in each module, the data starts to flow from the device main memory to the decode module, then the block threads, and finally the encode module where the output data is encoded and written to memory.

3.2.2 Supported Matrix Operations

Table 3.2 summarizes the matrix operations that are accelerated by our architecture. A more comprehensive list of instructions supported by our FPGA implementation CHAPTER 3. SPARSE MATRIX ACCELERATOR 49

1 3 4 Value 1 2 3 4 5 6 7 2 Row Index 0 2 1 1 3 0 2 5 Col Index 0 1 1 4 2 0 2 6

7 Block Ptr 0 2 5 5 7

Figure 3.4: Compressed Sparse Block Format can be found in Appendix B. As can be seen in Table 3.2, the behavior of each operation is described using MATLAB [44] notations. The left hand side column of Table 3.2 lists what types of operands are used for each operation category. Matrices can be in sparse or dense format, while vectors can only be in dense format. Each operation can consist of purely sparse matrices, purely dense matrices, or a mixture of both4. Matrix transpose are supported for the operands of the accumulative operations, i.e. matrix multiplication and matrix reduction operations. Matrix transpose is also supported for other operations only if all matrices in the operations are dense. This is because the sparse matrix format used in our architecture requires a special ordering, which becomes difficult to preserve if transpose is allowed for element-wise operations. Table 3.2 includes pseudo-instructions, which are two or more instructions com- bined to behave as one instruction. For example, our architecture calculates a ma- trix element-wise division C = A ./ B as two native instructions: an element-wise in- verse R = 1 ./ B, and an element-wise multiplication C = A .* R. The use of pseudo- instructions can save area for operations that are usually not in the critical path. CHAPTER 3. SPARSE MATRIX ACCELERATOR 50

Matrix Operation MATLAB-like Notation Matrix Multiplication Matrix - Matrix A*B Matrix - Vector A * v Matrix Element-wise Arithmetic Matrix - Matrix A + B, A - B, A .* B, A./ B† Matrix - Vector A + repmat(v,M,1) Matrix - Scalar A + 3, A .^ 3† log2(A), 2 .^ A, sqrt(A)† Special Functions log(A),† exp(A),† 1 ./ sqrt(A) Matrix Element-wise Comparison Matrix - Matrix A > B, A < B, min(A,B), max(A,B) Matrix - Vector A > repmat(v,M,1) Matrix - Scalar A > 3, min(A,3) Matrix Reductions min(A), max(A) Reduction † sum(A), min(Ai,k + Bk,j)

Table 3.2: List of Matrix Operations Supported. † Note: Pseudo-instructions, made of two or more native instructions.

3.2.3 Sparse Matrix Format

As mentioned in Section 3.1, our architecture targets applications that extensively use sparse matrix operations. Thus, deciding which sparse matrix formats are sup- ported is a crucial design factor since the sparse matrix format can greatly influence the performance of these applications. Two of the applications shown in Section 3.1.1 take majority of the time in sparse matrix - sparse matrix multiplication (SSMM), which can be a very challenging operation to implement in hardware depending on the sparse matrix format. Another frequent computational pattern we found in the targeted applications was the use of transpose in conjunction with matrix multiplica- tion. Therefore, it is preferred to choose a sparse matrix format that could efficiently carry out these types of operations in hardware. Compressed Sparse Column (CSC) is a popular sparse matrix format widely used

4The actual FPGA implementation restricts mixing types of matrices for certain matrix opera- tions. CHAPTER 3. SPARSE MATRIX ACCELERATOR 51

in various applications. This format generally allows smaller memory footprints, and is convenient to perform per-column computations. Existing software solutions, such as CSparse [16], implements SSMM by accumulating the weighted column vectors of the first operand matrix, where the weights are the elements of the corresponding column vector of the second operand matrix (See Eq. 2.1 in Chapter 2). However, implementing this in hardware can be complicated since a considerable amount of indexing and indirection is required. If a dense column vector cannot fit on-chip, which is possible when the column vector size is greater than 1 million, the accelerator would also need to access the DRAM for temporary values and require some caching mechanism since it is difficult to predict which elements would be reused shortly. In addition, any transpose operation must be performed prior to the multiplication, which adds additional complications to the hardware. Although the CSC format may show good performance in many applications, we decided to postpone the support of this format to future implementations and start with a simpler sparse format for our design. The Compressed Sparse Block [12] is a sparse matrix format which enables efficient transpose operations and provides a more straight-forward approach for sparse matrix multiplications. Compressed Spare Block is illustrated in Figure 3.4. As shown in the figure, the sparse matrix is divided into fixed size blocks. In this paper, we use β to represent the number of elements of a matrix block row. The non-zero elements are stored in a 3-tuple list, where each tuple consists of a row index, column index, and the value of the element. The row index and column index is relative to the top-left corner of the matrix block in which the element is positioned. A separate matrix block pointer list is used to locate the beginning of each matrix block within the 3-tuple non-zero element list, as shown in Figure 3.4. Although the ordering of elements within a block is not fixed in CSB, we enforce a certain ordering in our architecture to ease implementing element-wise matrix op- erations and sparsifying matrices. In addition, we give the user the option to indicate whether the omitted values of a sparse matrix represent zero or infinity. Giving this option allows to expand our application further into certain graph algorithms, such as All Pairs Shortest Paths. CHAPTER 3. SPARSE MATRIX ACCELERATOR 52

From Nios II From/To DRAM

Decode Module Memory Arbiter Decode Main Control

BlockPtr Addr Data/Index Sparse Dense Batch Vector Gen Addr Gen XBAR MUX Transpose Forward

Unit Data Stream Batch Stream

Figure 3.5: SMA Decode Unit Block Diagram

One disadvantage of using CSB is that the performance greatly depends on β, whose optimal values are different for each sparse matrix. Thus, it is natural to support multiple β values in the accelerator architecture, However, we decided to fix β to a specific value for the current version of our architecture so that we can only focus on the architectural aspects during our initial design and development. It is also noteworthy that nothing in the architecture inherently requires the sparse format to be CSB, and could incorporate different sparse matrix formats in future designs. In fact, Kestur et al. designed an efficient universal sparse matrix logic that can decode various sparse format into one representation [32], which could also be used in our architecture.

3.2.4 Decode Unit

The decode unit of SMA is responsible of reading data from DRAM and forward- ing the data to the ALU arrays. This includes generating the addresses based on the matrix format, decoding the data streamed in, converting the data format, and assembling of packet headers to correctly route the data. CHAPTER 3. SPARSE MATRIX ACCELERATOR 53

Figure 3.5 shows the block diagram of the decode unit. The decode unit fetches one or more command packets from the central OP dispatch unit, and configures the finite state machines (FSM) and datapaths accordingly. The BlockPtr Addr Gen module in Figure 3.5 generates the addresses of the matrix block pointers of both matrix operands, and requests the memory requests to the DRAM arbiter. The DRAM arbiter forwards the memory requests to the DRAM memory controllers and passes the DRAM data from the memory controllers to the Data/Index Addr Gen module. The BlockPtr Addr Gen module only requests memory access when the matrix format is sparse; when the matrix format is dense, the BlockPtr Addr Gen module produces the block pointers and sends them directly to the Data/Index Addr Gen. The Data/Index Addr Gen module produces addresses of the elements within the current matrix block based on the information from the BlockPtr Addr Gen module. The elements to be read includes both indices and values in case of sparse matrix, and includes only the values in case of dense matrix. The Data/Index Addr Gen module skips empty matrix blocks, based on the matrix block information. Depending on the type of the operation, non-empty matrix blocks may also be skipped if computation is unnecessary. Such case may occur when the matrix operation type is multiplicative and the matrix block of the other operand matrix is empty. The addresses of the matrix elements, generated by the Data/Index Addr Gen module, are passed to the DRAM arbiter. The DRAM arbiter reads the matrix elements from DRAM, and sends them to one of the four destination modules: the Sparse Crossbar module, the Dense Mux module, the Batch Transpose module, or the Vector Forward module. In addition to the matrix elements, certain matrix block information from Data/Index Addr Gen module is passed to the destination modules in order to assist the interpretation and process of the matrix elements. The Sparse Crossbar module routes the elements of sparse matrices based on their row or column indices. The column indices are used for a transposed matrix, and row indices are used otherwise. The crossbar is capable of simultaneously switching multiple elements, ideally as many as the number of elements that fit in a memory data bus to fully utilize the DRAM bandwidth, unless there is conflict in one of the destination ports, in which case stalling one of the inputs is inevitable. This modules CHAPTER 3. SPARSE MATRIX ACCELERATOR 54

is used for both matrix operands of element-wise operations, and the first matrix operand of accumulative operations. The Dense Mux module routes the elements of dense matrices, where the destination of each element is determined by its row or column number. The Dense Mux achieves this by utilizing buffers and shift registers. Similar to Sparse Crossbar module, the Dense Mux module is used for both matrix operands of element-wise operations, and the first matrix operand of accumulative operations5. The output of the Sparse Crossbar and Dense Mux modules are sent to the unit data stream interconnect network. In our SMA implementation explained in Section 3.3, these are point-to-point buffered connections to the thread blocks. The Batch Transpose is responsible for handling the second matrix operand of ac- cumulative computations. Unlike element-wise matrix operations, accumulative ma- trix operations reuse data for multiple rows. To exploit this computation pattern, the architecture sends a small portion of the second operand matrix to all thread blocks to be cached for data reuse. Thus, the interconnect from the Batch Transpose to the thread block arrays is basically a broadcast, with wiring and buffering structured as a tree network to meet timing constraints. As well as sending batches of data, the Batch Transpose module performs trans- pose of matrices and densification for sparse matrices. The transpose of a small batch of data can be implemented in multiple ways; we chose the method used in [42] since it made less use of multiplexers, which are costly in FPGAs6. The Batch Transpose module densifies sparse matrix rows that are not completely empty. Densifying rows, which is basically shifting the elements to its predetermined memory location, re- duces indexing and routing complications within the thread blocks. Depending on how dense the matrix block is, this approach may somewhat negatively affect the per- formance, especially if the application is memory-bounded. Section 3.4.2 discusses the performance of the sparse matrix accumulative operations in more detail.

5The Dense Mux module was included for practical reasons. For the same memory bandwidth, dense matrix elements are read approximately twice as many as sparse matrix elements since dense matrix elements do not have indices. Reusing the Sparse Crossbar module for dense matrix ele- ments would require more crossbar ports, quadratically increasing the logic requirements. Instead, we included a separate Dense Mux module for more efficient logic use. 6However, as explained in Section 2.4.2, this transpose method may have scalability issues for future device generations. Different methods may be used in ASICs or future FPGAs CHAPTER 3. SPARSE MATRIX ACCELERATOR 55

From Decode Module

Thread 1 Thread 2

ALU LANE 1 ALU LANE 1

… ALU LANE 4 Cache ALU LANE 4

Work Stealing Work Stealing Queues Routing Logic Queues

Thread 3 SPU Thread Thread 4

ALU LANE 1 SPU LANE 1 ALU LANE 1

ALU LANE 4 SPU LANE 4 ALU LANE 4

Figure 3.6: SMA Thread Block Unit Block Diagram

The Vector Forward module is used to forward the vector data chunks to thread blocks. Since each thread block receive the entire vector chunk and caches it, the same broadcast network is used for streaming the vector data.

3.2.5 Thread Block Unit

Figure 3.6 shows the block diagram of a thread block. A thread block consists of a number of threads, a shared cache, and common control logic. The number of thread blocks and the number of threads per thread block may vary across different implementations. As a rule of thumb, since a thread block can consume one element of data (of the first matrix operand) per cycle, the number of thread blocks is roughly equal to the number of elements read from DRAM each cycle. The number of threads per thread block is then determined based on the available logic, memory, and routing resources on chip. The implementation discussed in Section 3.3 has four threads per thread block, which is also the case shown in Figure 3.6. Each thread in the thread block configures the control logic and datapaths at the beginning of each matrix operation. The configuration information is given by CHAPTER 3. SPARSE MATRIX ACCELERATOR 56

the OP dispatch unit in multiple command packets stored in a queue (not shown in Figure 3.6). The thread configuration takes several cycles to complete, which is neg- ligible compared to the number of cycles a computation intensive matrix operations would take. Logic configured during this phase includes the data routing policy, ALU datapath multiplexers, various state machines, and the work stealing engine. Data sent from the decode unit is stored in one of the queues for routing within the thread block. The routing logic determines which queues to fetch data from and which thread it needs to send data to. Depending on the operation type, this module also compares the indices of the data and discards them if it finds computation is not needed. For example, element-wise multiplication only requires computation on elements that have the same indices, and may ignore elements that do not overlap between the two matrix operands. Data from the decode unit also embeds useful control information, such as an indicator bit that tells when the matrix operation is completed. The routing logic module adjusts its states accordingly, and forwards the control data to other modules in the thread block. The thread block also has a cache module shared by all the threads within the thread block. The cache is used for accumulative matrix operations, where data may be reused across multiple rows and columns. The cache module uses a double buffering scheme to hide latency of streaming in matrix rows. As mentioned in Section 3.2.4, the Batch Transpose unit sends batches of matrix data for reuse, which is stored in the cache module of the thread block. While the thread block is busy with computations on this batch, the cache module reads in the next batch into the embedded memory. This approach requires twice as much memory compared to a single buffer cache, but can significantly improve performance when enough workload is present in each non-empty matrix block. For accumulative matrix operations, each thread performs the computation re- quired for each row of the resulting matrix. Each cycle, one of the threads receives a single srcA element and multiple srcB elements (srcA and srcB denotes the two operand matrices). The srcA element is taken from one of the thread block input queues, while the srcB elements are read from cache. These computations take multiple cycles; the exact number depends on the number of srcB elements and the number of floating CHAPTER 3. SPARSE MATRIX ACCELERATOR 57

point ALUs per thread7. On the next cycle, if available, another batch of input data is sent to one of the threads according to the index value. If the data is sent to a busy thread, data will be stored in the thread’s input queue. Otherwise, the thread immediately starts computation. For element-wise matrix operations, only one thread within the thread block is utilized due to DRAM bandwidth. Since the number of thread blocks is roughly the same as the number of elements read from DRAM each cycle, only one thread in a thread block can receive data per cycle. Unlike accumulative matrix operations, the input data of element-wise matrix operations per cycle are one srcA element and one srcB element. Such operations can be processed by a thread at a throughput of one element per cycle (all ALU components are fully pipelined). Therefore, one thread is enough to sustain the element-wise data flow. Energy may be saved in these computation by turning off unused threads via clock gating. In case one of the thread input queues become full, other threads can be used to perform work-stealing. Work-stealing is only used for accumulative operations, because element-wise operations and special math functions only utilize one thread. Our sparse matrix accelerator architecture pairs two threads for work-stealing8, as shown in Figure 3.6. When one input queue is full or above a certain threshold, the routing module detects this and sends data to its partner thread instead. The sent data has a bit in its header indicating that this data is stolen and must be returned to its owner thread after computation. After the partner thread finishes the computation, it emits the accumulated results to a queue which the owner thread reads from when ready. Since accumulative operations require internal storage for accumulated values, each thread must have storage for its own accumulated values and the partner thread’s accumulated values. Having the ability to dynamically balance the workload is important for obtaining good performance in sparse matrices with severe workload imbalance, as shown in Section 3.4.2. Since many sparse matrices tend to be locally (somewhat) dense, the rows are interleaved among the thread blocks to avoid contention within the thread block.

7Suppose there are N srcB elements and M ALUs, then it takes N/M cycles to complete the computation using these inputs. 8Although it is possible to group more threads, we only group two threads for simplicity. CHAPTER 3. SPARSE MATRIX ACCELERATOR 58

INPUT A MULT OUT INPUT B FP MULT INPUT A

INPUT B FQ OUT FP ADD OUT MULT OUT ADD/SUB EXT IN DATA OUT INPUT A INPUT B FQ OUT CMP OUT ADD OUT FP CMP

EXT IN ADD OUT

CMP OUT FEEDBACK FQ OUT QUEUE

Figure 3.7: SMA Floating Point ALU Block Diagram

For example, thread 1 of thread block 1 computes the first row, thread 1 of thread block 2 computes the second row, and so on. This way, we can reduce the chance of having data concentrated on a specific thread block for locally dense sparse matrix. In addition, work-stealing falls apart when workload is concentrated on two threads in a thread block which happen to be partner threads, while other threads are idle. The probability of such cases can be reduced when partner thread is the one responsible of rows farthest away. For the thread block shown in Figure 3.6, Thread1 is paired with Thread3, and Thread2 is paired with Thread49. SMA thread blocks also have a shared special math unit, called Special Purpose Unit (SPU). This unit performs element-wise special functions listed in Table 3.2, i.e. the log base-2 function, the exponential base-2 function, the reciprocal function, and the inverse square root function. This unit acts as a separate thread and has its own input queue. Like other element-wise operations, only one thread is needed due to memory bandwidth limitation. In addition, most of the special math operations are iterative and not pipelined, resulting in a throughput less than one; the only operation fully pipelined is the log base-2.

9This deterministic allocation approach reduces resources and routing complications. For a fully dynamic allocation of rows based on the thread’s availability would require substantially more re- source, and is subject to future work. CHAPTER 3. SPARSE MATRIX ACCELERATOR 59

Figure 3.7 shows the Floating Point Arithmetic Logic Unit (FP ALU) block dia- gram. The FP ALU consists of a floating point multiplication unit, a floating point addition/subtraction unit, a floating point comparison unit, and a feedback unit. The floating point comparison unit can be further divided into a greater than unit, a less than unit, a max unit, and a min unit. The feedback unit contains multiple queues and control logic. The queues in the feedback unit is used as temporary stor- age for the accumulative operations. Half of the queues in the feedback unit are used for the owner thread, and the other half is used for the partner thread. Figure 3.7 shows the multiplexers for the ALU datapaths. The input of each arithmetic submodule can be from the input matrices or from the output of other modules. INPUT A and INPUT B in Figure 3.7 are the srcA and srcB matrix elements, and DATA OUT is the output of the FP ALU. Which data to pass in each multiplexer depends on the matrix operation and is configured when OP dispatch commands are read. When to pass the data may depend on the current phase of the matrix operation. For example, an accumulative operation would make use of the Feedback Queue module, whose output is mostly used as input of the FP ADD module. However, when the end of a row is reached, the output of the Feedback Queue is no longer used as the input of the FP ADD module, and is used as the selected data for DATA OUT. Depending on the available resource, a thread may have multiple FP ALUs. This can be particularly useful for dense accumulative matrix operations, as long as the number of ALUs per thread does not exceed the number of srcB elements that fit in a row of the thread block cache. As the number of FP ALU increases, the amount of feedback buffering decreases, as mentioned in Section 2.5.2. (Although Section 2.5.2 focuses on Restricted Boltzmann Machines, the principles and trade-offs of using feedback buffering also generally applies for the accumulative matrix operations used in SMA.) However, feedback queues with shallow depths may cause stalls in the pipeline since the round trip latency from the input of the first ALU module to the output of the feedback queue requires that many srcB elements to keep the pipeline constantly busy. CHAPTER 3. SPARSE MATRIX ACCELERATOR 60

From Nios II From ALUs

Encode Main Control

Addr Data Shift / Sparse Offset Addr Addr Gen Deserialize Calc Gen Gen Dense Select Sparse Select Vector

Combine / Multiplex

To DRAM

Figure 3.8: Encode Unit Block Diagram

3.2.6 Encode Unit

The SMA encode unit stores the outcome of the matrix operation in DRAM. The output data from the thread blocks are combined and converted to a dense matrix format or CSB matrix format, as specified by the application. As was the case for other SMA units, the encode unit begins by reading com- mand packets from the OP dispatch unit. Based on the information in the dispatch commands, the encode unit configures its finite state machines and datapaths. For dense matrices, data from each thread corresponds to a portion of a row in the result matrix. Since each row is deterministically allocated to a thread, the output addresses for each thread can be calculated beforehand. If both source matrices are dense matrices, then the order of elements within each row also arrives in the encode unit deterministically. All elements are guaranteed to be present and come in cycle by cycle. The encode unit deserializes the output data of each row such that the data width matches the data unit for DRAM. The deserialization is done in parallel for each thread, thus requiring shift registers for each thread. Then, depending on the phase of the address generator, one of the deserialized rows is selected to be sent to DRAM. If either of the source matrices is a sparse matrix, then the order of elements CHAPTER 3. SPARSE MATRIX ACCELERATOR 61

is deterministic, but may not include all elements and the timing is non-deterministic. Thus, it is logical to send data to DRAM as soon as the data is available, since waiting for data which may not arrive and stalling other rows is not efficient. However, it also makes sense to coalesce memory access, since it requires less memory bandwidth. Thus, the encode module does not send data to DRAM immediately, but shifts and buffers data until one of the following conditions is met:

1. another row from the same thread block is outputted 2. the shift register or buffer is full 3. the thread block signals end of matrix block

Meanwhile, the encode unit can emit any row that is ready. For sparse matrices, the output data from each row is non-deterministic. Thus, a separate address generation module is required. In addition, the offset of the data for each row needs to be calculated. Since the non-zero elements is stored in a specific order within the matrix block, the data from each thread needs to be shifted depending on the offset of the previous thread and the number of elements of the previous row. How this can be done in parallel is explained in Section 3.2.7.

3.2.7 Sparsification Modules

Support for sparse matrix write is a must, as discussed in Section 3.1.1. Depending on the size of the matrix and the sparsity, the write memory access can take up to 99% of the execution time. In addition, applications require the ability to dynamically create a sparse matrix. The main issue with writing a matrix in a sparse format is that it requires an ordering of elements and pointers that depend on previous rows. A na¨ıve approach may wait for all the non-zero elements of previous rows to be calculated, count how many elements there are for that particular row, write the elements starting from the offset calculated by the previous row, and calculate the offset of data for the next row. Also, if either one of the operands is a dense matrix, reordering of the incoming elements is required. The key in writing a sparse matrix in parallel is to distribute the sparsification process across multiple modules, starting from the output of the thread block and

CHAPTER 3. SPARSE MATRIX ACCELERATOR 62

(a) (c)

THREAD0 PRUNE x4 0 r15 OUTPUT

THREAD1 PRUNE x4 shift data data combine data THREAD2 PRUNE x4 offset THREAD3 PRUNE x4 + idx + data

(b) Header (Index Info) Data Packet offset …

… +

From … idx + data Thread Block Data SORT & Count COUNT Packet

FSMs …

Figure 3.9: Sparsifying Modules: (a) Thread Block Output (b) Sort & Count Logic (c) Offset Calculation & Shift Logic ending within the encode unit. This section describes the details of writing the data/index portion of a sparse matrix. The matrix block pointer list of a sparse matrix only needs to keep track of the number of elements per matrix blocks, and can implemented within the encode module in a relatively simple manner. Figure 3.9 illustrates the sparsification modules. Figure 3.9(a) shows the thread block output logic for sparsification. Before the thread block emits data to the encode unit, the data is serialized and packetized. One reason for this is the limited routing resources. Thread blocks are likely to use majority of the logic resources,and thus are spread out throughout the chip. Since the encode unit is a centralized module, using wide channels that consume a lot of routing resources and associated buffering logic may not be feasible in some hardware implementations. Therefore, to reduce the routing resource pressure, we limit the bus width by serialization and packetization. Even though data is serialized for each thread block, each thread block runs inde- pendently, providing adequate parallelism to fully exploit the available resources and memory bandwidth. A part of the sparsification process takes place in this packetiza- tion step. Before sending data out to the encode unit, data whose value are below a user-defined threshold may optionally be eliminated. This enables true sparsification, CHAPTER 3. SPARSE MATRIX ACCELERATOR 63

as opposed to some sparse matrix routines that only preserve the non-zero locations regardless of their actual value. The thread block output FSMs control which thread output to emit and when to emit. The reason for this is two folds: it provides the memory coalescing mechanism mentioned in Section 3.2.6, and it is used to enforce an ordering of elements by dividing the data into multiple segments that are to be re-ordered in the Sort & Count module shown in Figure 3.9(b). The Sort & Count module is positioned between the thread block unit and the encode unit. This module not only provides the interconnect buffering between the thread block unit and the encode unit, but also performs sorting and counting of non-zero elements. The Sort & Count reads the row and column index information from the header of each packet, and queues the data in the corresponding bucket, assuming that data within each bucket is already in order. While the data is stored in the buckets, the module counts the number of elements for each bucket. After a bucket contains a complete segment of data, the module FSM starts emitting the segment to the encode unit. The output segment begins with a count packet, followed by data packets. Figure 3.9(c) shows the block diagram of the Sparse Offset Calc module of the encode unit. Since adjacent rows are computed by adjacent thread blocks, the offset of one thread is determined by the offset of the previous thread plus the count of the previous thread. This calculation is shown as the chain of adders in Figure 3.9(c). Assuming it is possible to perform n additions of the adder chain in one cycle, an optimistic scenario would be having at least n count packets simultaneously present for adjacent thread blocks, and computing the offsets for the n adjacent thread blocks each cycle. The Deserialize/Shift module performs the shifts according to the offset and combines the data.

3.3 Sparse Matrix Accelerator Implementation De- tails

This section describes the implementation of the Sparse Matrix Accelerator. We im- plement the design to demonstrate the feasibility of the SMA architecture, as well as CHAPTER 3. SPARSE MATRIX ACCELERATOR 64

to evaluate the performance benefits of using this architecture. Although implement- ing the architecture as an ASIC would be the preferred approach for efficient usage of silicon area and faster clock frequencies, the complexities and costs in developing a full-scale accelerator ASIC is very large compared to other approaches, such as using FPGAs and software simulation. In particular, we found that the time-consuming debugging and verification process, as well as the high cost to tape-out the design was difficult to justify for just testing feasibility and performance when other low-cost approaches can be used for that purpose. Software simulation offers the highest flexibility in programming and debugging at a very low cost. However, a typical simulation time is extremely long compared to the execution time of other hardware approaches. Due to the long simulation time, this approach may be restricted to use only tiny workloads in order to complete in a reasonable amount of time. Small workloads, however, often do not accurately reflect the characteristics of larger, real-world workloads. Another disadvantage of simulation is that the software implementation may not consider the design in real- istic details. Since the SMA is a new architecture with several unique and original modules, estimating the area and routing requirements is not a trivial task. In ad- dition, it may sometimes be challenging to determine if a module can be executed within a certain number of clock cycles while meeting the timing constraints, unless actually implemented in hardware. Due to these practical difficulties, evaluation by software simulation may not suffice to convincingly demonstrate the feasibility and performance our SMA architecture. Therefore, we take a compromise approach by implementing the architecture on an FPGA. FPGAs have the advantage of carrying out computation several orders of magnitude faster than software. This allows evaluating the runtime of larger and more practical workloads. In addition, FPGAs’ reconfigurability provide sufficient flexibility to dramatically reduce debugging efforts compared to that of ASICs. Since FPGAs can use the same or similar RTL code as ASICs to program the hardware, FPGAs also enables us to better predict the area and routing requirements of the architecture compared to the software approach. Although FPGAs may be more expensive than utilizing commodity desktop computers and servers, the costs are CHAPTER 3. SPARSE MATRIX ACCELERATOR 65

considerably lower than the costs of printing the design on a die, unless done in mass production. These aspects of FPGAs are a good fit to the purpose of prototyping the SMA architecture, which is to test the feasibility and performance of the architecture at a low cost and with the least amount of effort. However, FPGAs have significantly less logic available on die and lower clock frequency compared to ASICs. Since these deficiencies can greatly influence the overall performance of the accelerator, FPGAs are not intended as the primary target implementation, but as a prototype developing platform. In other words, the SMA architecture is implemented on FPGA with the intention of eventually designing an ASIC. The purpose of the FPGA design is to accurately emulate the ASIC behavior, cycle by cycle at a lower clock frequency. The lower logic density of FPGAs requires a scaled down design, or the use of multiple FPGAs. Because communicating between FPGAs may cause additional complications, SMA is implemented on a single FPGA, scaled-down compared to ASICs. To fit in a single FPGA, some feature needs to be sacrificed, such as using fewer ALUs or reducing the floating point precision. In our FPGA implementation, we only reduce the floating point precision. Section 3.3.1 gives a more detailed comparison between the target specification and the prototype FPGA implementation. For the following sections, we refer to the target ASIC accelerator as SMA-A (short for Sparse Matrix Accelerator-ASIC), and the FPGA prototype as SMA-F (short for Sparse Matrix Accelerator-FPGA).

3.3.1 Target Development Platform

Figure 3.10 shows DE4 the development board used for the FPGA accelerator pro- totype. The DE4 board from Terasic features an Altera Stratix IV FPGA, 2 DDR SODIMM, and PCI Express 2.0. The Altera Stratix IV GX EP4SGX530C2 has 212,480 ALMs (Adaptive Logic Modules), 20,736 kbits of embedded memory, and 512 embedded 18x18 multipliers. In comparison, the previous generation FPGA Stratix III used for the Restricted Boltzmann Machine implementation has 135,000 ALMs, 16,272 kbits of embedded memory, and 288 18x18 multipliers. As was the case for CHAPTER 3. SPARSE MATRIX ACCELERATOR 66

Altera Stratix IV EP4SGX530C2 DDR2 SO-DIMM Sockets

PCI Express 2.0 x8 Connector

Figure 3.10: DE4 Development Board Used for SMA-F the DE3 board, the DE4 uses the USB-JTAG interface to program the interface and debug the logic. However, the DE4 no longer relies on SD card for user input and instead utilize the PCI Express interface. As mentioned in Section 3.2, the SMA architecture can be used as an integrated functional unit or as a discrete device. The SMA-F was designed as a programmable discrete device that interacts with a host processor, similar to many modern GPUs. We instantiate Nios II processors, Altera’s 32-bit soft processors, to serve as the instruction decoder of the accelerator, as well as the tiny device processor for non- critical general purpose computation to avoid excessive data transfers between the host and the device. Our instantiation of the Nios II processors has a 2KB instruction cache and a 2KB data cache for each processor, performs at a rate of 1.13 MIPS/MHz, and uses approximately 1% of the entire available programmable logic. Using the Nios II processor gives us the advantage of using Altera’s C compiler for implementing the applications. However, since the accelerator is implemented as a discrete device, the user needs to program and compile two versions of code: one for the host processor and one for the Nios II processor. At host run-time, the Nios II executable binary is downloaded onto the FPGA main memory, and releases the Nios II processors from reset state. Although this may not be a preferable way CHAPTER 3. SPARSE MATRIX ACCELERATOR 67

of programming, it serves our purpose of running the discrete device, and similar programming practices is used in modern GPUs. The SMA-F utilizes the Nios II custom instruction interface [1] to implement the instructions for the matrix operations shown in Table 3.2. The Nios II custom instruction interface provides a simple and efficient way of communicating between the Nios II processors and the matrix accelerator core. Appendix B describes the SMA-F ISA in great detail. As was the case with the Restricted Boltzmann Machine implementations, the SMA-F takes advantage of Altera’s intellectual property blocks for Stratix IV FPGAs. The Nios II processor is one example of these IP blocks. The SMA-F also uses a num- ber of other IP blocks such as the QSYS interconnect, the DDR2 memory controller, and the PCI Express hard IP. The DDR2 memory controller supports up to 400MHz memory clock frequency, although the SMA-F operates at a lower memory clock fre- quency as discussed later. The SMA-F instantiates two DDR2 memory controllers for both DRAM slots and assumes a 1GB DIMM for each slot (total 2GB of device mem- ory). PCI Express hard IP supports up to 8 lanes, with a throughput of 500MB/s per lane (PCI Express 2.0 standard). Since this is a scaled-down prototype of a targeted ASIC, several assumptions had to be made. Table 3.3 summarizes these assumptions. First, we assume that the FPGA core clock frequency, 100MHz, corresponds to a clock frequency of 1GHz in the targeted ASIC environment. To ensure such projections are reasonable, the number of global wires were minimized, and any long wire was appropriately buffered. Most wires are local due to the modular design, which helps meet timing and resource constraints. The number of ALUs is fixed to 256 due to limited number of multipliers on the FPGA and routing issues. In addition, we reduce the data width of the ALUs to a 16-bit floating point format, while the target ASIC is assumed to be 32-bit single precision floating point format. Such reduction was inevitable due to limited number of programmable logic elements; the average ratio of silicon area required to implement logic in FPGA and ASIC is approximately 21 [36]. Thus, although we are restricted to 256 half-precision floating point units, we believe that more than 256 single-precision units will comfortably fit in ASIC. For the memory interface, we CHAPTER 3. SPARSE MATRIX ACCELERATOR 68

FPGA Prototype Targeted ASIC Accelerator (SMA-F) (SMA-A) Core Clock Freq. 100 MHz 1 GHz Memory Clock Freq. 200 MHz 1.3 GHz Memory Interface 128-bit DDR2 384-bit GDDR5 Memory Bandwidth 6.2 GB/s 125 GB/s Number of ALUs 256 (16-bit floating point) 256 (32-bit floating point)

Table 3.3: Prototype Implementation vs Projected Specification assume that the device dedicated memory of SMA-A has the bandwidth of modern GPUs. Because the number of DDR2 memory pins is fixed and the DDR2 memory clock frequency must operate within a certain range, we adjusted the memory clock frequency such that the aggregate memory bandwidth is approximately the amount of memory bandwidth seen in current GPUs. This adjustment also reflects the fact that we are using half-precision in our implementation, but targeting single-precision in ASIC. The 16-bit floating point format SMA-F uses does not comply with the IEEE 754-2008 standards, but follows a custom 16-bit format to match with the embedded multiplier width (9 bits). Our half-precision floating point format consists of one sign bit, 7 exponent bits, and 9 mantissa bits. We do not consider denormalization in our ALUs. Due to limited precision in the mantissa, some applications may not produce the correct answer when precision error accumulates. For these applications, we verify the correctness using small scale data and use large scale only for performance measurements. More details of evaluation methodology is discussed in Section 3.4.3.

3.3.2 Software Stack

Figure 3.11 illustrates the software stack of our matrix acceleration framework. As shown in the figure, we use Microsoft Windows XP Professional (32-bit) as the host operating system. Although any operating system can be used for the SMA-F, we chose this particular OS due to some practical issues. The PCI Express driver builds on Microsoft Windows Driver Foundation (WDF 7600.16385), which provides a (rel- atively) simple and straight-forward way to write driver code compared to Windows

CHAPTER 3. SPARSE MATRIX ACCELERATOR 69

sRBM BC MCL

Host Host Host User Level User Windows Host Common Library

PCIe Driver for FPGA

Windows Driver Frameworks

Windows Kernel Level

PCIe

Nios II Support Package II II Binary Device Common Library

Nios sRBM BC MCL Device Device Device

Figure 3.11: SMA-F Software Stack

Driver Model (WDM). The PCI Base Address Register (BAR) 0 is used for the ad- dress space of the device DRAM. Data transfers between the host and device DRAM can be done via DMA (Direct Memory Access) and MMIO (Memory-mapped I/O). The BAR 1 and BAR 2 are used for accessing the control registers of the device. The host common library operates in user level, and behaves as a wrapper for the kernel-mode driver routines. In addition, the host common library provides common routines, such as downloading Nios II binary and generating matrices. As mentioned earlier, applications are divided into code that executes within the host and routines that run on the SMA-F. In general, the host routines perform general purpose com- puting, and offload matrix-specific computation to the SMA-F via the host common library. Since the applications mentioned in Section 3.1.1 mainly compose of ma- trix operations, most computation takes place in the SMA-F, and the host is only responsible of generating the matrices and reading back the results. CHAPTER 3. SPARSE MATRIX ACCELERATOR 70

On the device side, the lowest layer of the software stack is the Altera’s Nios II Board Support Package (BSP), which supports basic libc functionality, such as printf and malloc. The device common library builds on the BSP to provide commonly used routines, which includes the macros for the matrix-related custom instructions and other matrix manipulation functions. The device common library also allows device applications to control the PCI Express interface as well as receive notification of any incoming data. The top layer of the device software stack are code that express the application using matrix operations.

3.3.3 Resource Usage

Figure 3.12 illustrates the resource usage of the SMA-F implementation. Altera Quartus II 11.0 was used for synthesis, fitting, routing, and timing analysis of the design. The SMA-F uses approximately 80% of the entire FPGA. The numbers in Figure 3.12 displays how much resource each module consumes. As shown in Figure 3.12(a), the floating point ALUs take the majority of the logic resources. Since there are 256 FP ALUs on the FPGA, each with floating point adder, multiplier, and comparison unit, it is reasonable that ALUs make up a significant portion of the die area. Another significant portion of logic resource is used by the thread block routing, which mostly consists of multiplexers and buffers. One should note that the resource utilization in Figure 3.12(a) is for the FPGA and may differ on the ASIC implementation SMA-A. For example, since ALUs utilize the FPGA embedded multipliers, which are not included in the FPGA ALUT statistics, the ALU portion of resource utilization for SMA-A may increase to include the resource used in the multipliers. In addition, although the area of other modules may increase linearly with the data precision, the ALUs may increase quadratically depending on the actual implementation, further increasing the ALU portion. If denormalized numbers are handled in SMA-A, even more resources would be required in the ALUs. On the other hand, the resource usage for routing may decrease in the SMA-A, since multiplexers tend to be much more efficient in ASICs than FPGAs. The decode and encode unit uses 11% and 8% of the logic resources, respectively. CHAPTER 3. SPARSE MATRIX ACCELERATOR 71

Memory PCI Express Device Controller 3% (a) Processor 4% Other Decode 1% 6% 11% Non-zero Count Encode 4% 8%

Thread Block Control & Routing 24% ALU 35%

SPU 4% Decode Device (b) 3% Processor ALU+SPU 1% 3%

Cache 7% Other 36%

Thread Block Control & Routing PCI 31% Express 8% Memory Non-zero Controller Count 7% 4%

Figure 3.12: SMA-F Resource Utilization: (a) Logic Utilization (b) Memory Block Utilization

The majority of the logic in these modules are used for multiplexers and shift registers in order to manipulate DRAM data. The non-zero count module for sparsification adds 4% of the logic resources required for encoding data. The Nios II processors, which are in-order processors used for decoding instructions and general purpose computing, only use 1% of the logic usage. Figure 3.12(b) shows the memory usage of the SMA-F. A considerable portion of the memory blocks is used for buffering data. As shown in the chart, a large portion of the memory is used in thread block routing. This is due to the buffers required to CHAPTER 3. SPARSE MATRIX ACCELERATOR 72

tolerate the irregularity and load imbalance. A significant amount of memory is used for buffering data in the Altera Qsys interconnect, marked as Other in Figure 3.12(b). This corresponds to the auto-generated buffered interconnect between the memory controller and other modules shown in Figure 3.3. The majority of memory block used for PCI Express and Memory Controller are also buffers, some of which are due to data transfers between asynchronous clock domains, while others are embedded in Altera’s IP and reference design code which we use without modification. The memory used for non-zero count for sparsifications are also buffers for sorting and counting non-zero elements. The total amount of cache for data reuse consumes 7% of the memory blocks. The ALUs and SPUs use 3% of the embedded memory, most of which are feedback queues in the FP ALUs. About 1% of the memory is used in the Nios II processors. Of these memory, 60% is used for the instruction cache and data cache, 15% for the register file, and 15% for HW support of debugging Nios II SW applications. Approximately 903KB of embedded memory is used in total for SMA-F. This corresponds to 35% of the entire embedded memory bits available in the Stratix IV FPGA. The SMA-F actually uses all the embedded memory blocks (i.e. 1280 M9K memory blocks and 64 M144K memory blocks), but does not fully utilize the full depth of each memory block, since many buffers do no require the full depth. The SMA-A thus will require at most 1.76 MB, assuming the worst case where all storage need twice as much space as SMA-A due to the doubling of data precision, although it is very likely that the required storage would be much less than that.

3.3.4 Challenges and Limitations

This subsection discusses the challenges confronted while implementing SMA-F. Due to limited time (and man power), many parts of the design were simplified. Minor details not essential for the purpose of evaluating the accelerator architecture were mostly omitted. Simplifications include enforcing memory alignment to 128 bytes, enforcing sizes of matrices to be a multiple of 32 or 64, and using less accurate floating point units. Since our focus was on demonstrating the feasibility of our CHAPTER 3. SPARSE MATRIX ACCELERATOR 73

approach, we argue that these simplifications do not invalidate our claims, since it is unlikely that adding these features would improve the feasibility and efficiency of our architecture in any significant way. There are also limitations that may negatively impact the performance of SMA-F. Since these limitations can only worsen SMA-F performance, they do not automati- cally negate the validity of our performance claims. However, such limitations should be avoided whenever possible as they mask the full potential of our accelerator. Such limitations include using small fixed-size matrix blocks for CSB format, supporting only one sparse matrix format, fixed mapping of matrix rows to threads, and work- stealing only between a pair of predetermined partner threads within a thread block. As shown in Section 3.4.2, small CSB matrix blocks have detrimental impacts on performance for very sparse matrices. Inefficiencies occur when a matrix operation spends most of its time sending out memory requests to probe empty small matrix blocks. Also, parallelism is currently extracted on a per matrix block basis, thus requiring sufficient amount of non-zero elements within each matrix block to fully utilize the underlying hardware; small matrix blocks most likely have few non-zeros when the matrix in consideration is very sparse. An RMAT matrix, for example, is considered a very sparse matrix. Using the current fixed size for CSB matrix blocks, more than 99.9% of the matrix blocks can contain less than 40 non-zero elements, where more than half of these matrix blocks are empty. These type of matrices typically result in suboptimal performance. Our experiments show that a non-zero density of 1% within a CSB matrix block is at least required to keep the computational resources busy. Section 3.4.2 discusses this matter in more detail. The solution to this issue is either allowing CSB matrix block size to be adjustable or supporting more sparse matrix formats. Both options are subject to future work. In addition to limitations from time constraints, there were challenges in imple- menting an efficient prototype due to the resource usage restrictions in FPGAs. One example of such restriction is using embedded memory blocks for shallow buffers. The embedded memory blocks in FPGAs offer an efficient way to implement RAM and FIFO without relying on programmable logic. However, the depth and width of CHAPTER 3. SPARSE MATRIX ACCELERATOR 74

0011010101 1101010101 0011010101 1101010101 100MHz 200MHz

Queue depth Wasted memory

2 M9K Blocks 1 M9K Block

Figure 3.13: Memory Blocks for Queues each memory block can only be configured to one of few fixed settings10. To fully utilize a memory block, the width and depth of the RAM instantiated must closely match one of the configurations; otherwise, memory bits would be wasted. In the SMA architecture, many of the buffers between modules and submodules are shallow queues that are used for simple backpressure and improving throughput. However, since memory blocks in Altera Stratix IV have a minimum depth of 256, a wide and shallow queue will require many memory blocks and waste most of the memory bits. Figure 3.13 describes this problem. On the left example shown in the figure, the buffer requires two M9K blocks to match the incoming data width, but the queue depth only requires a fraction of the embedded memory. The shadowed area repre- sents the memory bits actually used for the queue. As can be seen, a large portion of the memory is unused. To alleviate this problem, we emulate shallow queues using a faster clock frequency. On the right example shown in Figure 3.13, a clock frequency twice as fast as the previous example was used to time multiplex the incoming data within a single normal clock cycle. By using a faster clock, we can pack the wide data in one memory block. The memory bit requirement remains the same, but the number of memory blocks needed is halved.

10For example, an M9K block can be configured as the following (depth × width): 8Kx1, 4Kx2, 2Kx4, 1Kx8, 1Kx9, 512x16, 512x18, 256x32, and 256x36. CHAPTER 3. SPARSE MATRIX ACCELERATOR 75

SMA-A CPU GPU Processor Emulated Intel Xeon X5550 NVIDIA Tesla C2050 (using SMA-F) (Dual-socket) Process Tech. N/A 45 nm 40 nm Core Clock Freq. 1 GHz 2.67 GHz 1.15 GHz Memory Clock Freq. 1.3 GHz 1.33 GHz 1.5 GHz Memory Interface 384-bit GDDR5 384-bit DDR3 384-bit GDDR5 (3 channels per socket) Memory Capacity 4 GB 96 GB 3 GB Memory Bandwidth 125 GB/s 64 GB/s 144 GB/s Number of FP ALUs 256 64 448 Max FLOPS 512 GFLOPS 171 GFLOPS 1030 GFLOPS On-chip memory 1906 KB 18944 KB 3456 KB

Table 3.4: Platform Specifications For Performance Evaluation and Analysis

3.4 Results and Analysis

3.4.1 Experimental Platform

As mentioned in Section 3.3, the SMA-F is used for estimating the performance of SMA-A, the ASIC implementation of the Sparse Matrix Accelerator. To evaluate the accelerator architecture, we compare the performance with a high-end CPU and GPU, detailed in Table 3.4. The table lists the chip specifications of each platform. A dual-socket Nehalem-based Intel Xeon processors running at 2.67 GHz were used for evaluating an SMP multi-core CPU system. The Xeon processors support SSE4.2 and yield a peak floating point performance of 171 GFLOPS for all eight cores11. AVX (Advanced Vector Extensions) technology, which was first supported in Intel Sandy Bridge processors in 2011, is expected to double the floating point performance; however, AVX is not supported by Nehalem architecture and is not included in the CPU evaluation. The NVIDIA Tesla C2050 was used for evaluating the performance of programmable GPU. It features large memory bandwidth (144 GB/s) and many SIMD-like cores (448) to give a peak single-precision floating point

11Each core can compute up to 8 single-precision floating point operations per cycle, which gives total performance of 8(cores) × 8(FLOP) × 2.67(GHz) = 170.88GFLOPS. CHAPTER 3. SPARSE MATRIX ACCELERATOR 76

Matrix Operation Type CPU GPU Dense Matrix – Dense Matrix MKL + OpenMP CUBLAS + CUDA Sparse Matrix – Dense Matrix MKL + OpenMP CUSPARSE + CUDA Sparse Matrix – Sparse Matrix CSparse CUSP

Table 3.5: CPU and GPU Libraries For Matrix Operations performance of 1030 GFLOPS12. Table 3.4 also shows the projected specifications of SMA-A. The SMA-A hardware specifications highly resembles the specification of modern GPUs, which implies that SMA-A specifications are realistic and realizable. As mentioned in Section 3.3.1, the FPGA implementation (SMA-F) uses an Altera Stratix IV GX FPGA. Since SMA core modules run at a clock frequency of 100 MHz, SMA-A performance is estimated by measuring the runtime of SMA-F in number of cycles. The SMA-A is assumed to run the exact same number of cycles at a faster clock frequency (1 GHz). Thus, multiplying the number of cycles by 1 ns gives the SMA- A runtime. The measurement in SMA-F is done by hardware counters supported by Nios II processors. The CPU performance is evaluated using the gettimeofday function. The GPU uses cudaEvent functions to accurately measure the GPU runtime. Table 3.5 summarizes the software libraries used in the CPU and GPU applica- tions. The applications for the CPU utilizes the Intel Math Kernel Library (MKL 10.3.4) and CSparse [16] for optimized matrix operations. OpenMP is used to paral- lelize element-wise matrix operations, and also enables the multi-threading capability in MKL routines. Since the CSparse library uses double precision floating point num- bers for the matrix element values by default, we modified the CSparse library to use single precision floating point numbers instead for a fair performance comparison with the GPU and SMA-A. The GPU codes use CUBLAS 4.0, CUSPARSE 4.0, and CUSP 0.2.0 for most matrix operations. Element-wise matrix operations not supported by the GPU libraries are hand-coded as CUDA kernels. CHAPTER 3. SPARSE MATRIX ACCELERATOR 77

(a) Dense Matrix * Dense Matrix (b) Uniform 1% Sparse Matrix * (c) RMAT Sparse Matrix * Dense Dense Matrix Matrix CPU(8 threads) GPU SMA-A 40 Numbers on top are GFLOPS CPU(8 threads) GPU SMA-A CPU(8 threads) GPU SMA-A Numbers on top are Speedup Numbers on top are Speedup 633 140.0 100.0 35 602 88 588 122 120 90.0 120.0 112 30 508 509 506 107 80.0 70 100.0

25 70.0 63

57 80.0 60.0 20 65 50.0 43

Speedup 60.0 Speedup 15 Speedup 40.0 28 40.0 35 30.0 10 142 126 20.0 5 71 20.0 5.5 5.7 6.8 10.0 6.5 5.9 6.7 0 0.0 0.0 2048 4096 8192 2048 4096 8192 2048 4096 8192

(d) Uniform 1% Sparse Matrix * (e) RMAT Sparse Matrix * Sparse (f) RMAT32 Sparse Matrix * Sparse Sparse Matrix Matrix Matrix CPU(1 thread) GPU SMA-A CPU(1 thread) GPU SMA-A CPU(1 thread) GPU SMA-A 8 Numbers on top are Speedup 6 Numbers on top are Speedup 18 Numbers on top are Speedup 15.5 7 6.5 16 5 4.6 6 5.5 14

4.9 4 12

5 4.7 10 4 3 2.6

8 7.0

Speedup Speedup Speedup 3 2 1.8 6 2 1.11.2 1 1 1 0.9 1 4 3.1 1 0.9 1 1.0 1 1.1 1 1 0.6 1 0.4 2 1 1.0 1 1.1 1 1.1 1 1.31.5 0 0 0 2048 4096 8192 16384 2048 4096 8192 16384 2048 4096 8192 16384

Figure 3.14: Matrix Multiplication Performance

3.4.2 Matrix Operation Performance Analysis

In this section, individual matrix operation performance is measured and analyzed. As shown in Section 3.1.1, since the majority of the execution time is spent in matrix multiplication, the matrix multiplication performance is evaluated in more detail compared to other element-wise matrix operations. Four types of matrices are used in this analysis: dense, uniform 1%, RMAT, and RMAT32. A dense matrix is a type of matrix whose number of non-zero elements are very close to the total number of elements of the matrix. The format used for dense matrix is a dense array, where the elements are stored in a row-major order. Uniform 1% matrices are sparse matrices whose non-zero elements are decided from a uniform

12A core can issue a multiply-add floating point operation each cycle, thus peak floating point performance is: 448(cores) × 2(FLOP) × 1.15(GHz) = 1030.4GFLOPS. CHAPTER 3. SPARSE MATRIX ACCELERATOR 78

random distribution with a probability of 0.01. RMAT [14] matrices are also synthet- ically generated sparse matrices, representing realistic graphs that follow a power- law degree distribution. RMAT graphs resemble many real-world networks, such as the internet, social networks, and protein-protein interaction networks. RMAT and RMAT32 are generated using the same algorithm, except that RMAT32 has a denser distribution from larger fanout. The maximum number of edges in RMAT is eight times the number of nodes, while the maximum number of edges in RMAT32 is 32 times the number of nodes. The format used for the sparse matrices is Compressed Sparse Block in SMA-A, while Compressed Sparse Column was used in CPU and GPU to utilize the highly-optimized sparse matrix libraries.

Matrix Multiplication

Figure 3.14 shows the matrix multiplication performance comparison between differ- ent hardware platforms and matrix types. Only square matrices were used in the performance measurement. The x-axes represent the number of elements per row or column of each matrix. The y-axes show the speedup in comparison to the single- threaded performance of the Nehalem-based Xeon processor. Figure 3.14(a) displays the dense matrix multiplication performance. As can be seen in the figure, SMA-A shows floating point performance close to the theoretical peak FLOPS (512 GFLOPS) in the dense matrix multiplication operation. The GPU and CPU also shows relatively good floating point performance, although they fail to hit the theoretical maximum FLOPS (which are 171 GFLOPS and 1030 GFLOPS, respectively). Nonetheless, this implies all three platforms make efficient use of their on-chip memory to stay within the memory bandwidth while fully utilizing the hardware floating point units. Figure 3.14(b) and Figure 3.14(c) illustrate the speedup for sparse matrix times dense matrix (SDMM). In both plots, GPU and SMA-A benefits from the large memory bandwidth due to the memory-intensive nature of the sparse matrix mul- tiplication. In case of SMA-A, the average memory throughput measures between 44 GB/s and 75 GB/s, and simulation results reveal that peak throughput may be much higher. This indicates that SMA-A very often requires and utilizes a memory CHAPTER 3. SPARSE MATRIX ACCELERATOR 79

bandwidth larger than that of the CPU platform to sustain its performance advan- tage over the CPU. In case of the GPU, it is difficult to estimate how many DRAM memory transactions took place since the data reuse algorithm in the CUSPARSE library is unknown. However, since the GPU performance is significantly better than that of CPU, even though the average GPU FLOPS (approximately 10 GFLOPS) is considerably lower than the maximum floating point performance (1030 GFLOPS), we can deduce that a large portion of the GPU performance gain comes from the memory bandwidth. Thus, due to the large memory bandwidth and abundant paral- lel resources, both GPU and SMA-A show huge speedups compared to the multi-core CPU platform. The uniform sparse matrix multiplication performance in Figure 3.14(b) shows that the GPU performance of smaller matrices is less than that of the SMA-A. This is due to the SIMT structure of the GPU, which requires certain amount of non-zero elements clustered in a row or column to achieve good performance. Since 1% of the elements are non-zero, a row or column would have about 20, 40, and 80 non-zero elements on average for a matrix size of 2048, 4096, and 8192, respectively. The Fermi GPU, whose warp size is 32, suffers from load imbalance among threads when the warps are not fully utilized. Thus, a uniform sparse matrix with size 2048 is likely to underutilize the computation resources, and a matrix with size 4096 would only fully utilize the warps every other cycle13. In contrast, SMA-A does not suffer from such SIMD-like constraints. Figure 3.14(c) shows that SMA-A suffers performance in larger RMAT matrices. To identify the reason for the performance drop, we conducted an RTL simulation of the RMAT sparse matrix multiplication. Our simulation analysis shows that a large portion of the inefficiency is due to the issues related to the CSB format mentioned in Section 3.3.4. In particular, the large overhead of accessing small CSB matrix blocks, as well as the limited amount of parallelism within each CSB matrix block, greatly contributed to the performance degradation of SMA-A, as described in Section 3.3.4. For the RMAT matrix of size 2048, 244 out of 1024 CSB matrix blocks have density of

13The first alternating cycles utilize all 32 cores in a warp to process 32 non-zero elements out of 40, but the next cycles make use of only 8 cores for the remaining 8 elements. CHAPTER 3. SPARSE MATRIX ACCELERATOR 80

greater than 1%. Also, only 1.5% of the CSB matrix blocks were empty. However, for the RMAT matrix of size 8192, only 474 out of 16384 CSB matrix blocks have density of greater than 1%, and 25% of the CSB matrix blocks were empty. Thus, the RMAT 8192 matrix is a considerably worse environment for the SMA-A due to the large number of memory accesses probing empty CSB matrix blocks and the large number of CSB matrix blocks that are incapable of keeping the floating point resources busy. Figure 3.14(d), Figure 3.14(e), and Figure 3.14(f) plot the performance of the sparse matrix - sparse matrix multiplication (SSMM) operation for uniform 1%, RMAT, and RMAT32 sparse matrices, respectively. The CPU speedups are always 1 in these figures, since CSparse library does not support multi-threading. The GPU, which uses the CUSP library, shows SSMM performance similar to the single-thread CPU performance for most sparse matrices. We believe the performance drop of the GPU comes from the irregularity of memory accesses in SSMM operations, which makes it difficult to efficiently coalesce memory accesses and operate in a SIMD man- ner. Also, since the result of the SSMM is a sparse matrix, the ordering requirement of the non-zero elements in a sparse matrix format may also limit the parallelism by introducing synchronization and serialization. Figure 3.14(d) shows the SSMM performance using uniform 1% sparse matrices. Although only 1% of the elements are non-zero, the result matrix can be somewhat dense; the resulting matrices of size 2048, 4096, 8192, and 16384 have density of 18.49%, 33.63%, 55.92%, and 80.58%, respectively. Thus, the GPU performance number for matrix size 16384 is not shown in Figure 3.14 since the result matrix is too large and dense to fit in the 3GB graphics memory. As can be seen from the figure, the SMA-A outperforms both the GPU and CPU by a factor of 5. Although the SMA-A speedup for SSMM operations may not be as impressive as the performance gains seen for SDMM and DDMM operations, the plot still demonstrates the benefits of using a specialized hardware for sparse matrix operations. However, as mentioned in Section 3.3.4, certain limitations may prohibit SMA-A from fully utilizing the memory bandwidth or the computation resources. Figure 3.14(e) and Figure 3.14(f) use RMAT matrices with different non-zero den- sities to compare the performance of hardware platforms for SSMM operations. For example, RMAT 2048 matrix has overall density of 0.3%. As explained above, 1.5% of CHAPTER 3. SPARSE MATRIX ACCELERATOR 81

0.5 0.4 0.3 0.2 0.1

# of FP per cycle per FP of # 0 0 100 200 300 400 500 (a) DRAM Latency (# of cycles)

6

4

2

# of FP per cycle per FP of # 0 0 100 200 300 400 500 (b) DRAM Latency (# of cycles)

6

4

2

# of FP per cycle per FP of # 0 0 100 200 300 400 500 (c) DRAM Latency (# of cycles)

Figure 3.15: SMA-A Latency Sweep for SSMM Operation Using (a) Uniform 1% Sparse Matrices (b) Uniform 7% Sparse Matrices (c) Clustered 1% Sparse Matrices the CSB matrix blocks are empty, and 24% of the CSB matrix blocks have a density above 1%. RMAT32 2048 matrix has four times the number of non-zeros, resulting in an overall density of 1.4%. 496 out of 1024 matrix blocks have density greater than 1%, and only 1 out of 1024 matrix block is empty. Due to more parallelism and less format overheads, using RMAT32 matrices yields better performance than using the original RMAT graphs. In general, the discrete memory bandwidth has greater impact on the SSMM performance than the memory latency, since many memory requests are determined and streamed out without any additional dependency. This is especially true if each matrix block has enough non-zero elements to hide the latency of reading the matrix CHAPTER 3. SPARSE MATRIX ACCELERATOR 82

block pointers. In comparison, GPUs have specialized DRAMs that are tweaked to have large bandwidth at the cost of latency. For instance, the NVIDIA Tesla C2050 is known to have a round-trip GDDR5 memory latency of around 500 cycles14. On the other hand, DDR3 memory for CPUs tend to have less round-trip latency (several tens to a few hundred cycles). Figure 3.15 shows how sparse matrix multiplication (SSMM) performance on SMA-A changes with memory latency. In this experiment, the round-trip memory latency has been swept from 32 core clock cycles to 500 cycles. Since FPGA compila- tion time are usually very long, RTL simulation was used to measure the performance. Because RTL simulation runs much slower than actual hardware, matrix sizes had to be reduced to 256 or 512. Three different sparse formats were tested: uniform 1% sparse matrix, uniform 7% sparse matrix, and clustered 2% sparse matrix. As mentioned earlier, the product of larger uniform 1% sparse matrices tend to produce denser matrices. For example, a 8192x8192 uniform 1% sparse matrix multiplication results in a matrix with 56% non-zero elements. However, the SSMM operation on uniform 1% sparse matrix of size 256 results in a 1% sparse matrix, which has consid- erably different memory write characteristics compared to larger uniform matrices. Thus, to mimic similar write patterns in SSMM, we also tested using 7% uniform sparse matrices, where the result matrix would have around 50% non-zero elements. The clustered 1% sparse matrix represents the sparse matrices whose non-zero ele- ments are mostly grouped into several relatively dense chunks (approx 7% non-zero elements within that chunk). As can be seen in the Figure 3.15(a), the uniform 1% sparse matrix multiplication performance drops by 45% as the latency is increased from 32 to 500. The reason for this is two-folds. First, is that the total runtime is only 2708 cycles, and adding 500 cycle latency to the total runtime already greatly degrades the performance. Second, is due to the overhead of each matrix block. Approximately 1% non-zero elements are available per block, but the matrix block pointers are not being read fast enough due to the long round-trip latency. Since decode unit cannot generate addresses for non-zero elements without knowing the number of non-zeros within a matrix block,

14Source: various NVIDIA discussion boards on the web. CHAPTER 3. SPARSE MATRIX ACCELERATOR 83

the decode unit will stall until the memory requests for matrix block pointers are served. As a side note, the floating point throughput is much lower than the other two examples, Figure 3.15(b) and Figure 3.15(c), since the probability of non-zero elements in each matrix actually being multiplied is much lower in the former case. In case of Figure 3.15(b) and Figure 3.15(c), the performance drop due to long memory latency is relatively small. In case of uniform 7% sparse matrix multiplica- tion, the number of non-zero elements is sufficiently large to hide the memory latency for reading the matrix blocks. In case of Figure 3.15(c), the sparse matrix dimension is 512x512. The number of non-zero elements are about the same as the uniform 7% sparse matrix case, although they are clustered together to form a few relatively dense matrix blocks. Since address generation for non-zero elements are seldom stalled, the overall throughput is about the same, although the clustered case has additional overhead due to many empty matrix blocks (75% of the matrix blocks are empty). To summarize, memory latency has little effect as long as sufficient number of non-zero elements are supplied within matrix blocks to hide the memory latency for reading matrix block pointers. However, if the number of non-zero elements per matrix block is too small, the overhead to read a matrix block pointer may be overwhelming and cause considerable performance degradation, as mentioned above and in Section 3.3.4. This is a fundamental problem of using Compressed Sparse Block format with smaller fixed-sized blocks and may not be a problem for other sparse matrix formats. Figure 3.16 illustrates the effect of work stealing. The matrices used in Fig- ure 3.16(a) are synthetic sparse matrices where non-zero elements are mostly con- centrated among certain rows. To conveniently turn on and off the work-stealing mechanism, we performed the test in RTL simulation. To compare the impact of work-stealing across different non-zero element concentrations, we varied the number of non-zero elements per concentrated row and the number of concentrated rows while keeping the total number of non-zeros constant, i.e. maintaining the sparseness of 1% non-zeros. Matrix size of 512x512 was used such that simulation finishes in a reasonable amount of time. Each pair of bar in Figure 3.16(a) represent the execution time of using a particular CHAPTER 3. SPARSE MATRIX ACCELERATOR 84

(a) Concentration on 1% of the rows (b) Concentration on 2% of the rows 35000 35000

1 30000 30000 w/ work steal w/ work steal

25000 w/o work steal 25000 w/o work steal 1.35 1

1

20000 20000 1.29 1.22 1

15000 1 15000 1.14 # of cycles of # # of cycles of # 1.15 1 1 1.01 1 1 1 10000 10000

5000 5000

0 0 75% 50% 25% 1% 75% 50% 25% 1% Percentage of non-zero elements in randomly chosen Percentage of non-zero elements in randomly chosen rows rows

(c) Betweenness Centrality (BC) SSMM example (d) Aggregate BC Performance w/ work steal w/o work steal w/ work steal w/o work steal

45000 180 1 1 1 1 1 40000 1.03 160 1.10

35000 140

30000 120

1.50 100 25000 1 1 80

20000 # of cycles of # 60 15000 1 1 1.06 1 40 10000 20 5000 0 0 time 1K (unit: cycles) execution SSMM of Sum BC2K_all BC2K_1 BC2K_2 BC2K_3 BC2K_4 BC2K_5 BC2K_6 All BFS Matrices BFS Matrices in Betweenness Centrality

Figure 3.16: SMA-A Work-stealing Performance Using Skewed Matrices non-zeros concentration. The first two bars in Figure 3.16(a) is the extreme case where 75% of non-zero elements reside within four randomly chosen rows and remaining 25% are evenly distributed among the other 508 rows. The next two bars in the plot has 50% of non-zeros in the randomly chosen rows, while the remaining 50% of non-zeros are evenly distributed among the other rows. The third pair of bars shows the result of having 25% non-zeros in four rows. Finally, the last two bars uses a uniform random 1% sparse matrix for performance comparison. The y-axis indicates the runtime in this plot. The numbers on top of each bar indicate the speedup over the cases without work-stealing. CHAPTER 3. SPARSE MATRIX ACCELERATOR 85

As can be seen in the figure, performance benefits are greater for the more skewed matrices. In addition, no performance overhead was seen from work-stealing when running tests on the uniform sparse matrix. From the figure, we can also see the nega- tive effects of only work-stealing between designated partner threads, which basically limits the amount of parallelism. Note that the 75% concentration case runtime is twice as long as the uniform random sparse matrix runtime. Ideally, work-stealing will evenly distribute workload among threads such that the resulting performance would be somewhat similar to that of using the uniform sparse matrix, since the total number of non-zero elements are about the same. However, the performance gains from using work-stealing in SMA is 2X at most since workload can only be shared between partner threads. Since non-zero elements are concentrated on four rows, only eight threads (including partner threads) would be actively processing data. Figure 3.16(b) shows the work-stealing performance by concentrating on eight rows instead of four. The performance gains of work-stealing are less dramatic than Figure 3.16(a) since the matrices are less skewed. However, it also shows performance approaches faster to the uniform sparse matrix runtime since 16 threads are now mostly activate via work-stealing. Figure 3.16(c) uses matrices generated from Betweenness Centrality to see how it can actually affect the performance of an application. The x-axis shows the fringe matrices created after each iteration of the first inner loop shown in Figure A.3. Although work-stealing does not see benefits for all iterations, a particular iteration has shown 50% performance improvements. Figure 3.16(d) aggregates the matrix operation runtimes in Figure 3.16(c) and shows that work-stealing in BC can result in a 10% performance gain.

Element-wise Operations

The execution time of element-wise operations are typically much less than the ex- ecution time of matrix multiplication. However, as seen in Figure 3.1, element-wise operations also need to be accelerated to achieve significant overall speedup. Fig- ure 3.17(a) shows the performance for selected element-wise operations on dense ma- trices. The evaluation was performed using a 4096x4096 dense matrix, and speedup CHAPTER 3. SPARSE MATRIX ACCELERATOR 86

(a) Dense Element-wise Performance (b) Sparse Matrix Addition Perf.

GPU FPGA GPU FPGA

45 41.6 25 40 20.19 35 20

30

15 25

20 Speedup Speedup 10 15 5.47 10 6.2 6.1 5.5 5 4.1 4.1 5 2.7 0.6 0.61 0.32 0 0 ADD LOG2 EXP2 RCP UNI_1% RMAT Dense Element-wise Operations Sparse Matrix Addition

Figure 3.17: Element-wise Performance is relative to single-threaded CPU performance. Multi-threaded CPU performance is not shown, since our experiments in implementing multi-threaded versions of element- wise operations using OpenMP actually degraded the performance. As can be seen in the figure, the FPGA performance generally lags behind the GPU performance, with the exception of LOG2 function. One reason is that the dense element-wise operations have been less tweaked in SMA-F compared to matrix multiplication, since the focus of our accelerator is sparse matrix operations. There are some known inefficiencies in element-wise operations that have not been resolved yet. Another major reason for the performance difference is that EXP2 and RCP are not pipelined and uses an iterative algorithm in order to save area. Figure 3.17(b) shows the sparse matrix addition performance. Element-wise spe- cial functions are not tested for sparse matrices since LOG2 and RCP cannot have zero as an input for SMA-F. The CSparse cs add function was used as CPU baseline performance. For GPU sparse matrix addition performance evaluation, cusp::add function was used. As shown in the figure, SMA-A shows 5X to 20X speedups over the CPU implementation, depending on the matrix type. However, GPU tends to suffer in performance due to the irregularities of adding two sparse matrices and creating a new sparse matrix. CHAPTER 3. SPARSE MATRIX ACCELERATOR 87

(a) RBM Performance Results (b) sRBM Performance Results

CPU(8 threads) GPU SMA-A CPU(8 thread) SMA-A 25.0 30.00 22.5 22.1 25.7 20.220.9 19.8 25.00 20.0 19.4 19.9

20.00

15.0

15.00 Speedup Speedup 10.0 10.0 7.6 7.9 6.4 10.00 5.36 5.39 5.0 4.97 5.00

0.0 0.00 48 64 96 48 64 96 Training Image Length Training Image Length

Figure 3.18: SMA-A Restricted Boltzmann Machine Performance

3.4.3 Application Performance Analysis

This subsection presents the performance results of SMA-A running on applications using realistic data. Detailed analysis is performed on each application to better understand the performance gains at the application level. Both the original Restricted Boltzmann Machine and the sparse Restricted Boltz- mann machine were used to test the hardware platforms. Figure 3.18 shows the performance of running the original and sparse Restricted Boltzmann Machine on each hardware. Although different configurations can greatly impact the accuracy of the model, we always set the number of visible and hidden variables to be equal for simplicity since these tests were conducted for runtime performance analysis. The MNIST handwritten digits were used as the training data set, where each train- ing example is a 28x28 image. To test larger RBM network sizes, we scaled each handwritten digit via simple image processing. Training image length in Figure 3.18 represents the scaled image data and the weight matrix sizes. For example, training image length 48 indicates that 48x48 images (total 2304 visible variables) were used for training a 2304x2304 weight matrix. Figure 3.18(a) compares the multi-threaded CPU, GPU, and SMA-A performance CHAPTER 3. SPARSE MATRIX ACCELERATOR 88

(a) RBM Execution Time Breakdown (c) sRBM Execution Time Breakdwon

MM EXP DIV ELEMWISE SUM MM EXP DIV ELEM SUM 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% 48 64 96 48 64 96 Training Image Length Training Image Length

(b) RBM Matrix Operation Speedup (d) sRBM Matrix Operation Speedup

CPU(OMP) GPU SMA-A CPU(OMP) SMA-A 181 30.0 35

25.0 30 25

20.0

20 15.0

15

Speedup Speedup 10.0 10

5.0 5

0.0 0 MM EXP DIV ELEM SUM MM EXP DIV ELEM SUM Matrix Operations Matrix Operations

Figure 3.19: SMA-A RBM Performance Analysis over a single-threaded CPU for running the original dense RBM algorithm. As shown in the figure, GPU and SMA-A shows comparable performance with a speedup close to 20, while the multi-threaded CPU hits the maximum performance of utilizing all eight cores. Figure 3.19 further analyzes the dense RBM performance in more detail. Figure 3.19(a) shows the RBM execution time breakdown of a single-thread CPU, and Figure 3.19(b) illustrates the speedups from each matrix operations using training image length 64. More than 90% of the runtime is consumed by the matrix multiplication operation, followed by the element-wise exponential operation. As shown in Figure 3.19(b), the SMA-A and GPU exhibits 23.5X and 19X speedups from CHAPTER 3. SPARSE MATRIX ACCELERATOR 89

dense matrix multiplication, which accounts for the majority of the RBM execution. GPU also shows 180X performance gain in EXP operations, considerably reducing the exponential runtime, enough to slightly increase the overall speedup up to 19.4X. On the other hand, SMA-A EXP function showed an 18X performance gain, resulting in a slightly less 22.5X overall speedup. Figure 3.18(b) contrasts the performance gain of using a multi-threaded CPU and the SMA-A for the sparse RBM. The CPU used the Intel MKL instead of CSparse for this test since only sparse matrix - dense matrix multiplication (SDMM) and dense - dense matrix multiplications (DDMM) were involved. GPU were not tested due to a few difficulties in sparsifying the weight matrix and allocating threads to each receptive fields. The multi-thread CPU steadily yields approximately 5X speedup, while the SMA-A performance gain ranges 10X – 25X. This is somewhat related to the sparseness issue discussed in Section 3.3.4 and Section 3.4.2. Since the size of the receptive fields (7x7) is fixed for all three tests, the sparseness of the weight matrix increases with the matrix size, i.e. the first weight matrix has 2.1% non-zeros, the sec- ond has 1% non-zeros, and the largest has 0.5% non-zeros. The single-threaded CPU execution time breakdown is shown in Figure 3.19(c), and the performance gains of matrix operations are compared in Figure 3.19(d). The element-wise arithmetic and special operations make up around 40% of the entire execution, which is significantly more than the case of the dense RBM. Figure 3.19(d) shows that SMA-A has approx- imately 18X speedups for EXP operations and element-wise addition/multiplication operations, but displays low performance gains for inverse (DIV) and summation op- erations. Fortunately, the runtime portion of DIV and summation operations is much smaller than EXP and ELEMWISE runtime, having negligible impact on the overall performance. One may note that the RBM performance in the previous chapter matches GPU performance with an FPGA running at 150MHz – 200MHz clock frequency. However, the SMA FPGA implementation (SMA-F) is approximately 10X slower than the GPU. Several factors contribute to the slowdown. First, the NVIDIA C2050 used in this test has 448 CUDA cores, which is nearly twice as many as the NVIDIA GTX 275 used in the previous chapter, which has 240 CUDA cores. Also, SMA-F is running at CHAPTER 3. SPARSE MATRIX ACCELERATOR 90

100MHz clock frequency, which is much slower than the clock frequency used for the RBM FPGA implementation. In addition, many of the element-wise computation were pipelined in hardware RBM, effectively eliminating a large portion of memory I/O. Pipelining allowed distinct operations to be computed in parallel. For example, the positive phase computation H = 1/(1 + exp(−W · V − bias)) is pipelined in a way such that the matrix multiplication, bias addition, and sigmoid are all executed in parallel, resulting in a throughput of one hidden neuron per cycle. On the other hand, SMA-F implements this phase in six different matrix operations15:

1. matrix multiplication 2. bias addition

3. matrix scale operation (scale by -log2(e)) 4. exponential base-2 operation (e−x = 2−log2(e)·x) 5. matrix addition by one 6. RCP (inverse) operation.

There are also other minor differences, such as increased ALU latency due to the use of floating points, streaming data from DRAM at a reduced memory bandwidth, etc. We believe these differences add up to the 10X difference in accelerator vs. GPU performance comparison. The performance test for Markov Clustering (MCL) was done using RMAT graphs, modified by adding self loops to help with convergence. Since there are no dense matrices involved in the algorithm (although some of the sparse matrices may contain enough non-zeros to be considered as dense), we use CSparse library for the CPU evaluation. Thus, all CPU tests are single-threaded. Figure 3.20(a) shows the SMA-A performance of running MCL in comparison to the single-threaded CPU performance. The x-axis displays the matrix sizes used; for example, MCL RMAT 8K indicates a graph with 8192 nodes. As can be seen in the figure, SMA-A yields considerable speedup, ranging 50X – 76X, over the CPU

15Compared to the RBM implementation, SMA has flexibility but loses some efficiency. CHAPTER 3. SPARSE MATRIX ACCELERATOR 91

(a) Markov Clustering Performance (b) CPU Execution Time Breakdown

CPU (1 thread) SMA-A 100% 90 90% 80% 80 76.3 67.9 70% 70 60% 60 50.5 50% ELEM-WISE 50 40% SUM/MAX 40 Speedup 30% DIV 30 SSMM 20% 20 10% 10

1 1 1 Runtime Operation of Matrix Percentage 0% 0 MCL RMAT MCL RMAT MCL RMAT MCL RMAT 2K MCL RMAT 4K MCL RMAT 8K 2K 4K 8K Matrix Size Matrix Size

(c) Matrix Operation Speedup (d) SMA-A Execution Time Breakdown

SSMM ELEM-WISE 100% 140.0 90% 124.0 80% 120.0 70% 100.0 89.7 60%

ELEM-WISE 80.0 50% SUM/MAX 59.3 40% 60.0 Speedup 30% DIV SSMM 40.0 20% 10% 20.0 7.8 9.1 10.6

Percentage of Matrix Operation Runtime Operation of Matrix Percentage 0% 0.0 MCL RMAT MCL RMAT MCL RMAT MCL RMAT 2K MCL RMAT 4K MCL RMAT 8K 2K 4K 8K Matrix Size Matrix Size

Figure 3.20: SMA-A Markov Clustering Performance and Analysis implementation. To better understand where the performance gains are from, Fig- ure 3.20(b) shows the single-threaded CPU execution time breakdown. Approxi- mately 90% of the execution time is dominated by matrix multiplications, and the remaining 10% is consumed by element-wise addition/multiplication operations. The performance gains of using the accelerator for these matrix operations are shown in Figure 3.20(c). The greatest speedups were seen in the SSMM operation, ranging 59X – 124X. The huge performance difference between SMA-A and CPU is mostly due to the denser sparse matrices. As noted in Section 3.1.1, the density of MCL matrices vary greatly, where majority of the execution time are spent in computing the denser CHAPTER 3. SPARSE MATRIX ACCELERATOR 92

matrices. For these types of matrices, SMA-A is capable of fully utilizing the float- ing point units via asynchronous threads with buffering support and work-stealing. CSparse library, on the other hand, is highly optimized for very sparse matrices and computes in a sequential manner; thus, the CPU implementation of MCL is expected to suffer performance due to the denser matrices. Although huge performance improvements are seen in SSMM operations, the other element-wise operations do not see as much runtime benefit. Thus, as shown in Figure 3.20(d), a considerable portion of the execution time is spent in these element- wise operations, affecting the overall MCL performance. Figure 3.20(d) also shows that the SSMM portion of execution time increases with the matrix size. This can be explained by two different factors. First, a matrix multiplication has a runtime of O(n3), while other element-wise operations have O(n2) runtime complexity. Thus, the execution time of a matrix multiplication tends to increase faster than other element-wise matrix operations, although the exact runtime complexity varies with the number and distribution of non-zero elements. Another reason is that SSMM speedups decrease as the matrix size increases, as shown in Figure 3.20(c). As the MCL matrix size increases, the density of the densest matrix has shown to decrease: MCL 2K has 54% non-zero density, MCL 4K has 41% density, and MCL 8K has 30% density. Since CSparse library performs better on sparser matrices, MCL speedup relatively decreased. Like Markov Clustering, Betweenness Centrality performance was measured on CPU and SMA-A using RMAT graphs. The RMAT graphs were generated exactly in the same manner as shown in [5]. Again, we use CSparse library for CPU evaluation due to the extensive use of SSMM operation in the application. Figure 3.21(a) shows the SMA-A performance compared to the single-threaded CPU performance. The x-axis represents the different matrix sizes used in the eval- uation; for example, BC RMAT 13 denotes a graph with 213 nodes. The SMA-A achieves good performance compared to the single-threaded CPU, but not as impres- sive as the MCL results. To better analyze the performance bottleneck, Figure 3.21(b) breaks down the CPU execution time, and Figure 3.21(c) compares the accelerated performance between matrix operations. In contrast to MCL, considerable amount of CHAPTER 3. SPARSE MATRIX ACCELERATOR 93

(a) Betweenness Centrality Performance (b) CPU Execution Time Breakdown

CPU(1 thread) SMA-A 100% 6 5.67 90% 80% 5 70% 3.86 4 60% OTHERS 3.11 50% 3 ELEM-WISE 40%

Speedup DIV 2 30% SSMM 1 1 1 20% 1

10% Percentage of Matrix Operation Runtime Operation of Matrix Percentage 0 0% BC RMAT-13 BC RMAT-14 BC RMAT-15 BC RMAT-13 BC RMAT-14 BC RMAT-15 Matrix Size Matrix Size

(c) Matrix Operation Speedup (d) SMA-A Execution Time Breakdown

SSMM ELEM-WISE 100% 25.0 90% 21.3 80% 20.0 18.9 70% 16.4 60% 15.0 OTHERS 50% ELEM-WISE 40%

Speedup 10.0 DIV 30% SSMM 20% 5.0 3.4 2.3

1.8 10% Percentage of Matrix Operation Runtime Operation of Matrix Percentage 0.0 0% BC RMAT-13 BC RMAT-14 BC RMAT-15 BC RMAT-13 BC RMAT-14 BC RMAT-15 Matrix Size Matrix Size

Figure 3.21: SMA-A Betweenness Centrality Performance and Analysis computation take place within the element-wise matrix operations. Thus, accelerat- ing the element-wise matrix operations is likely to have a large impact on the overall MCL performance. In fact, we can infer from Figure 3.21(a), (b), and (c) that most of the performance enhancements is due to the acceleration of these element-wise matrix operations. Figure 3.21(d) further supports the explanation, as it shows the runtime percentage of SSMM in SMA-A increased to 90%, compared to 55% in the CPU evaluation. Similar to MCL, the sparseness of BC fringe matrices dynamically changes per iteration. However, the SSMM operations in BC have very different characteristics CHAPTER 3. SPARSE MATRIX ACCELERATOR 94

from MCL. The SSMM operations in MCL compute the square of the transition prob- ability matrices, i.e. M 2. Thus, the computation time quickly increases as transition probability matrices become denser. As seen earlier, majority of the MCL execution time was spent in computing matrix multiplication of the denser transition probabil- ity matrices. On the other hand, the fringe matrices in BC are always multiplied by the , which is constant throughout the iterations and is very sparse; e.g. “BC RMAT 15” matrix used in the evaluation has non-zero density of 0.02%. As seen in Section 3.3.4 and Section 3.4.2, SMA-A suffers from the inefficiencies of CSB format with small matrix block sizes when the matrix is very sparse. To overcome this limitation, SMA-A must either support multiple sparse matrix formats or flexible matrix block sizes. Such improvements are subject to future work.

3.4.4 Summary

The significance of sparse matrices has been recognized in many fields, and the number of sparse matrix applications are growing. However, conventional hardware cannot easily accelerate sparse matrix operations with their SIMD computational resources. We propose a novel sparse matrix accelerator, which focuses on enhancing sparse matrix operations that do not match SIMD-style computation. Restricting hardware acceleration only to specific matrix operations was decided as a compromise between flexibility and efficiency. We considered three applications in our design: Sparse Restricted Boltzmann Machine, Betweenness Centrality, and Markov Clustering. Our initial analysis revealed that these applications spend majority of the execution time in matrix multiplication; however, other matrix operations cannot be neglected as they would quickly become the new bottleneck if not accelerated. In addition, the accelerator must tolerate workload imbalance observed in many of the matrices used in these applications. The sparse matrix accelerator architecture consists of a decode unit, an encode unit, a large array of thread blocks. The decode unit streams in matrix data and decodes it to a unified format before forwarding it to the thread blocks. Each thread block consists of multiple threads, where each thread consists of multiple floating CHAPTER 3. SPARSE MATRIX ACCELERATOR 95

point ALUs. Threads execute asynchronously with adequate input buffering, which reduces the chance of underutilizing the floating point resources seen in lock-step SIMD computations. In addition, threads support fine-grain work-stealing in order to tolerate the variances in workloads. The encode unit writes the computation result into memory in sparse or dense matrix format, depending on the user configuration. Since the applications we focused on require sparse matrix - sparse matrix multipli- cation results to be written in a sparse matrix format, we designed the sparsification modules in the encode unit to compute the offsets of the non-zero elements for each row in parallel, despite the ordering dependencies between and within rows. A prototype of the sparse matrix accelerator was implemented on an FPGA board, to validate the design and evaluate the potential efficiency of an ASIC implementa- tion. The development board used for the FPGA prototype (SMA-F) features an Altera Stratix IV FPGA with DDR2 DRAM and PCI Express interface. The Al- tera Nios II processor interfaces with the SMA-F accelerator core via the custom instruction interface, and was used to execute sparse matrix applications using SMA- F instructions. By measuring the SMA-F runtime, we demonstrated that the estimated perfor- mance of SMA-A, the ASIC version of the sparse matrix accelerator, would be compa- rable to GPUs on dense matrix operations, and considerably faster than conventional hardware on sparse matrix operations. The dense matrix multiplication performance of the SMA-A reaches near the peak floating point performance (512 GFLOPS) and shows a consistent 27X speedup over a high-end single-threaded CPU performance. The SMA-A speedup on sparse matrix operations varies greatly, ranging 1.2X – 120X, depending on the distribution of non-zero elements. In general, sparse matrices with sufficient number of non-zero elements per fixed-sized matrix block exhibited greater performance gains than sparse matrices with very few non-zero elements. SMA-A also delivered large performance improvements in the three targeted sparse matrix applications. The SMA-A was able to run the Sparse Restricted Boltzmann Machine 26 times faster than a single-threaded CPU on smaller training images, while the speedup was around 10X for larger training images, assuming the receptive field size remains fixed across different image sizes. Markov Clustering runs 50 to CHAPTER 3. SPARSE MATRIX ACCELERATOR 96

76 times faster using SMA-A, greatly benefiting from the acceleration hardware for sparse matrix multiplication. The speedup for Betweenness Centrality ranges 3.1X – 5.7X, where most of the performance gains come from accelerating the element-wise operations. Chapter 4

Concluding Remarks and Future Directions

We are now in an era where multi-core microprocessors are the mainstream. Multi- core processors are found everywhere, from energy-stingy mobile platforms to high- end computing clusters. At the same time, we are also entering an era in which more application-specific accelerators are being integrated into microprocessors. Ap- plication specific hardware has proven to be efficient both in terms of energy and performance, and is a strong candidate solution for overcoming the emerging dark silicon apocalypse. This thesis investigates designing custom hardware to accelerate emerging ap- plications that are mostly composed of matrix operations. Although conventional SIMD processors perform well on matrix operations using dense and wide matrices, the SIMD resource utilization may significantly drop on other types of matrices. We observed less floating point throughput on matrix operations using matrices with narrow widths, and found it challenging to efficiently map irregular sparse matrix operations onto SIMD computations. By designing a custom accelerator specifically tailored to these types of matrix operations, we were able to see significant perfor- mance improvements.

97 CHAPTER 4. CONCLUDING REMARKS AND FUTURE DIRECTIONS 98

4.1 Contributions of Research

The Restricted Boltzmann Machine (RBM), a useful tool for training Deep Belief Net- works (DBN), was first studied with the objective of overcoming the computational barrier in increasing the number of neurons. We designed a Restricted Boltzmann Machine accelerator using a modular approach to provide good performance and scalability. We demonstrated that the RBM accelerator, implemented on FPGAs, performs well with both narrow and wide matrices, enabling the algorithm to con- verge faster with fewer iterations. Using a ring interconnect, the multi-FPGA RBM system was shown to scale linearly with the number of FPGAs, which allows training of very large RBMs. The key contributions of this research are:

• a modular RBM design that scales across multiple silicon technology generations • a novel solution to efficiently perform matrix multiplication with or without transpose operation • inter-chip communication mechanism and partitioning of RBM computation that allows linear scalability • understanding the trade-offs between DRAM bandwidth, I/O bandwidth, and on-chip storage for multi-chip RBMs • supporting sparse RBMs with locally dense connections

Sparse matrix operations are an important component of the computation in many scientific and information management domains. However, due to irregular data patterns observed in sparse matrices, conventional SIMD hardware does not accelerate a number of sparse matrix operations. We studied three sparse matrix applications that conventional SIMD hardware cannot easily accelerate: the Sparse Restricted Boltzmann Machine, Betweenness Centrality, and Markov Clustering. Based on our observations, we devised a sparse matrix accelerator architecture which focuses on specific matrix operations. The sparse matrix accelerator was implemented on an FPGA board as a prototype to evaluate the feasibility and performance by running the entire application using realistic data. The key contributions of this research are: CHAPTER 4. CONCLUDING REMARKS AND FUTURE DIRECTIONS 99

• demonstrating the limitations of SIMD architectures using case study applica- tions • understanding how restricting acceleration to a specific set of matrix operations can balance flexibility and efficiency • a novel asynchronous computational thread model that tolerates workload im- balance via buffering and work-stealing • a complete RTL implementation of sparse matrix accelerator in FPGA to demon- strate the feasibility and performance potential of our accelerator design

4.2 Future Work

4.2.1 Addressing the Limitations of the FPGA Implementa- tion Platforms

Our accelerator architectures delivered considerable performance improvements for their targeted applications, fulfilling our objective to demonstrate the practicability and efficiency of using custom accelerators for matrix-oriented applications. However, there were limitations due to FPGA board restrictions and time constraints which prevented a more comprehensive performance evaluation. In case of the multi-FPGA RBM implementation, communication between the FPGA boards was the main issue in testing with larger RBM networks. Although the I/O ports of the FPGA had a theoretical bandwidth of 17.1Gbps per direction, we were only able to utilize 4.8Gbps of the bandwidth. Since I/O and DRAM bandwidth requirements are in a trade-off relationship, the proposed DRAM streaming mecha- nism could not be tested due to huge DRAM memory requirements, which severely limited the RBM network size in order to fit in the on-chip memory. Since debugging the inter-FPGA communication performance is time-consuming, future work includes finding an efficient way to obtain better I/O bandwidth. For the Sparse Matrix Accelerator FPGA (SMA-F) implementation, cramming a large amount of logic into limited programmable resources took the majority of the development time, as it required many iterations of modifying the RTL code, CHAPTER 4. CONCLUDING REMARKS AND FUTURE DIRECTIONS 100

(a) (b) Hidden Group 1

Hidden Group 2

(c)

Figure 4.1: Example of 2-D Sparse RBM Configurations synthesizing, and debugging. This may have negated our original intention of using a single FPGA to save design time by avoiding I/O communication. Because the FPGA was completely full (80% resource utilization) with few opportunities to optimize resource usage further, the only way to expand our design is by using multiple FPGAs. Future work will involve deciding how to effectively partition the design. In addition, the fixed-sized small CSB blocks were a major hindrance to getting good performance. Future work includes allowing CSB blocks to dynamically change depending on the matrix size and type. Since adding the flexibility to adjust CSB block sizes may significantly increase the resource usage, a stable multi-FPGA envi- ronment must be in place to pursue this future work.

4.2.2 Future Directions in Accelerator Research

In Chapter 2, we presented a simple, novel way of implementing Restricted Boltzmann Machines with sparse connections that are locally dense. However, this approach only captures locality in one dimension, i.e. neighboring nodes are defined by distance from each node in a one dimensional space. Figure 4.1(a) illustrates an RBM where locality is defined in 2-D space, which may provide a better fit for applications training on CHAPTER 4. CONCLUDING REMARKS AND FUTURE DIRECTIONS 101

images [39]. In order to support locality beyond 1-D, a ring interconnect structure may no longer be as efficient in scaling to a very large network due to the long latency traversing the ring. Figure 4.1(b) adds bypass links to the ring interconnect to overcome such communication inefficiency. This approach, however, requires two bi-directional links be added to each FPGA for each additional dimension, which would result in a significant amount of physical wiring. Figure 4.1(c) suggests an alternative hierarchical approach, which uses high-speed routers and cables to share connections between clusters of nodes. In this approach, the bandwidth of the high- speed cables and the throughput of the routers are the limiting factors to scalability. Future research may include finding the optimal interconnect configuration for sparse RBMs with multi-dimensional locality. In Chapter 3, our sparse matrix accelerator was shown to deliver large performance gains in the three targeted applications. Future research may explore applications us- ing different types of matrix operations to investigate the possibility of extending the accelerator to support a wider range of applications. The key challenge is identifying the core matrix computations and designing an architecture that can efficiently ac- celerate all time-consuming matrix computation. Certain applications may consist of matrix operations that have considerably different computation patterns and require two or more distinct accelerators to get reasonable speedup. In such cases, finding the minimum set of accelerators required is another potential research topic. Within the current SMA architecture, future research may also explore supporting multiple sparse matrix formats and comparing the performance of using each format for different matrix applications. Another possible way of enhancing the current SMA architecture is to investigate the use of advanced architectural techniques, such as chaining of matrix operations to reduce the number of memory requests. As can be seen, there are many open questions in the hardware acceleration do- main for matrix-oriented applications, with potentially lots of room for improvement. As specialized accelerators become increasingly important in microprocessors, we ex- pect that more research will be actively conducted in this area. We hope that this thesis will serve as a baseline reference for future research in designing and analyzing hardware accelerators for matrix applications. Appendix A

Pseudo-code of Sparse Matrix Applications Used for SMA

This section lists the pseudo-codes of the applications investigated in Chapter 3. The algorithms are expressed in a MATLAB-like syntax.

Markov Clustering

The following Markov Clustering pseudo-code is based on the implementation found in [13] with few modifications.

% Initialization A = A + eye % Add self-loops M = A ./ repmat(sum(A),N,1) % normalize A

until M converges M = M * M % expand M = M .^ power % inflate M = M ./ repmat(sum(M),N,1) % normalize

chaos = max(max(M) - sum(M .^ 2)) % determine convergence M = M .* (M > prunetheshold) % prune for sparseness

Figure A.1: Pseudo-code for Markov Clustering

102 APPENDIX A. PSEUDO-CODE OF SPARSE MATRIX APPLICATIONS 103

Sparse Restricted Boltzmann Machine

This pseudo-code was based on the MATLAB code from Hinton et al. [28], and modified according to [48] for sparse weights.

% Initialize receptive field matrix F = zeros(numvis,numhid) for i = 1 to numhid vrange = randomly choose fixed size receptive field F[vrange] = 1.0

% Initialize weight matrix and biases W = F .* 0.1 .* randn(numvis,numhid) visbias = zeros(1,numvis) hidbias = zeros(1,numhid)

% Main Loop until convergence foreach batch in dataset % Positive Phase PV = batch % visible neurons PH = 1 ./ (1 + exp( -W*PV - hidbias)) % hidden neurons PP = PV’ * PH % products of PV and PH posvisact = sum(PV) poshidact = sum(PH)

% Negative Phase NV = 1 ./ (1 + exp( -W’*PH - visbias)) % visible neurons NH = 1 ./ (1 + exp( -W*NV - hidbias)) % hidden neurons NP = NV’ * NH % products of NV and NH negvisact = sum(NV) neghidact = sum(NH)

% Weight & Bias Update W = W + F .* (epsilon/numcases) .* (PP - NP) visbias = visbias + (epsilon/numcases) .* (posvisact - negvisact) hidbias = hidbias + (epsilon/numcases) .* (poshidact - neghidact)

Figure A.2: Pseudo-code for Sparse Restricted Boltzmann Machine APPENDIX A. PSEUDO-CODE OF SPARSE MATRIX APPLICATIONS 104

Betweenness Centrality

The following betweenness centrality pseudo-code is a simplified version of the MAT- LAB implementation of Brandes algorithm [11] in the HPC Scalable Graph Bench- mark [5], which is based on the parallel algorithm devised by Bader et al. [6].

% Initialization A = logical(adjmatrix) % initialize unweighted adjacency matrix bc = zeros(1,N) % initialize betweenness centrality

% Main Loop foreach batch in A % Initialize Variables & Matrices depth = 0 % depth: iterator for each BFS step % nsp(i,j) : number of shortest paths betw. node i and j nsp = [zeros;ones(batchSz);zeros] % set 1 to nsp of root vertices % fringe(i,j): NSP discovered from node i to j at current BFS depth fringe = batch % initially neighbors of root vertices % bfs(depth)(i,j): 1 if node j is found at current BFS depth from node i bfs = []

% Breadth-First Search while nnz(fringe) > 0 bfs(depth++) = logical(fringe) % save found vertices nsp += fringe % save NSP for found vertices fringe = (fringe * A) .* not(nsp) % BFS: add NSPs of neighbors of fringe

%Pre-compute nspinv and bcu nspinv = 1 ./ nsp for non-zero elements only bcu = ones(batchSize,N) % bc update

% Reverse Breadth-First Search for depth = depth:-1:2 w = bfs(depth) .* nspinv .* bcu % weights to be applied to predecessors bcu += (w * A’) .* bfs(depth-1) .* nsp % RBFS: apply weights reverse dir.

% Update BC for current batch bc += sum(bcu)

Figure A.3: Pseudo-code for Betweenness Centrality Appendix B

ISA for Sparse Matrix Accelerator FPGA

B.1 Overview of Instruction Set Architecture for SMA-F

The Sparse Matrix Accelerator FPGA (SMA-F) uses Altera Nios II processors [1] for executing applications, as described in Section 3.3. The Nios II/f core was selected to include and support a variety of features, such instruction and data caches, hardware integer multiply, divide, and barrel shifter units, and dynamic branch prediction. Each Nios II processor takes up approximately 1% of the FPGA logic resources, 0.8% of the DSP resources, and 2% of the M9K embedded memory blocks. The Nios II processors in SMA-F are not instantiated with the Memory Manage- ment Unit (MMU) or the Memory Protection Unit (MPU) enabled. Thus, each Nios II processor has a 31-bit address space [1]. To overcome the address space limitation and fully cover both memory-mapped I/O and 2GB of DRAM memory, SMA-F uses two Nios II processors. The Nios II processors are mainly used to control and execute the matrix accelerator module, but can also be used for non-critical general purpose computing. Some short data parallel operations, such as vector inner-products, are

105 APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 106

31 27 26 22 21 17 16 15 14 13 6 5 0 A B C ra rb rc N CUSTOM Figure B.1: Nios II Custom Instruction Format implemented in software and exploit the dual core feature of SMA-F. These opera- tions are implemented as a Nios II software library and are mainly composed of vector operations that take substantially smaller amount of time compared to the matrix operations supported by the SMA-F hardware. SMA-F utilizes the Nios II custom instruction interface [3] for the accelerator instructions. Figure B.1 illustrates the Nios II custom instruction format. The opcode of custom instructions is always fixed to 0x32. Instruction fields A, B, and C are 5- bit register indices, and the ra, rb, rc bit fields indicate whether the Nios II general purpose registers are used (as opposed to internal custom register) for this instruction. The 8-bit N field is user-defined, and is commonly used as a sub-opcode for a multi- function custom logic. SMA-F uses the 8-bit N field to identify the requested operation. Section B.3 lists the supported accelerator instructions and their associated N field. When ra, rb, or rc are not set, the corresponding A, B, or C fields represent the matrix descriptor indices to be used in the matrix operation. SMA-F supports 32 hardware matrix descriptors, which are used to hold essential matrix information such as the matrix dimension and format. More detail on the matrix descritor can be found in Section B.2. The accelerator instructions take a variable, non-deterministic number of cycles to execute and retire. Some instructions are blocking, in which case the Nios II processor is stalled until the instruction completes. Such instructions are mostly scalar instructions used for controlling the accelerator and also used for simple floating point arithmetics. However, the majority of the instructions are non-blocking, and Nios II processor almost immediately recovers execution flow control after issuing the instruction. These instructions are mainly matrix operations that takes more than a few hundred cycles to execute. The non-blocking property of these instructions is realized by using a separate accelerator input queue for asynchronous execution. This allows the Nios II processors to perform useful work while the SMA-F acclerator logic is executing its queued matrix operations. The STATUS register is updated APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 107

when a new instruction is inserted or when the active instruction retires. The Nios II software uses the STATUS register to resolve dependencies by manually waiting upon potentially conflicting instructions to complete and retire. Section B.4.2 describes the STATUS register in more detail.

B.2 Matrix Descriptor Format

SMA-F stores important matrix information in a data structure called matrix de- scriptor. Table B.1 illustrates the fields of a matrix descriptor in detail. Each matrix descriptor has five fields: P, M, N, D, W. Each field is 32-bit wide, and stores some piece of information that is needed to correctly interpret, process, or encode the ma- trix. SMA-F supports up to 32 hardware matrix descriptors that can be accessed using the register indices embedded in the Nios II custom instruction. If an applica- tion requires more than 32 matrix descriptors, some will be required to overflow to memory using techniques similar to register spilling in modern compilers. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 108

Field Description P The P field defines the matrix properties. • P[0]: sparse bit. 1 indicates sparse format, 0 indicates dense format. • P[1]: vector bit. 1 indicates vector, 0 indicates matrix. • P[3]: column major bit. 1 indicates column major, 0 indicates row major. • P[4]: block column major bit. 1 indicates block column major, 0 indicates block row major. • P[5]: sparse INF bit. 1 indicates sparse format omits INF, 0 indicates sparse format omits zeros. • P[7]: random bit. 1 indicates random matrix, 0 indicates . • P[8]: work-steal bit. 1 enables work-stealing, 0 disables work-stealing. Used by src0 matrix only in accumulative operations. • P[22:16]: sparse exponent. The threshold exponent value for sparsifying matrix. Used by the destination matrix only.

M The M field defines the size of the first dimension of the matrix. This field must be a multiple of 64. N The N field defines the size of the second dimension of the matrix. This field must be 1 for vectors, and a multiple of 64 otherwise. D The D field stores the address of the matrix data. The address must be aligned to 64 bytes. W The W field stores the address of the matrix block pointer array. This field is not used for dense matrices. The address must be aligned to 64 bytes.

Table B.1: Matrix Descriptor Fields and Format

B.3 Detailed Instruction Format

As mentioned in Section B.1, SMA-F uses the 8-bit user-defined N field to identify which function the application is requesting. The N field is referred to as the opcode1 in this section. SMA-F accelerator instructions can be categorized into three groups: the scalar operations, matrix arithmetic operations, and matrix comparison operations. Table B.2 lists the scalar operations supported by the SMA-F. Scalar operations are mostly used to control the accelerator, and also includes scalar support for simple half- and single-precision floating point operations. These instructions, with the ex- ception of XCHG, have a fixed latency. The latency of an XCHG instruction depends on the execution flow of the other Nios II processor. All scalar operations are blocking

1Not to be confused with the Nios II opcode, which is fixed to 0x32 for custom instructions. The N field of the custom instruction can be seen as the opcode from the accelerator’s point of view. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 109

instructions. Scalar instructions have the most significant two bits N[7:6] set to zero. When applicable, bit N[1] indicates the floating point precision of the source registers, and bit N[0] indicates the floating point precision of the destination register. Table B.3 lists the matrix arithmetic operations supported by the SMA-F. These instructions perform basic matrix operations such as matrix multiplication. The for- mat of each matrix is encoded in the matrix descriptor indicated by the corresponding register index of the instruction. All matrix arithmetic operations are non-blocking; thus, manual dependency control using synchronization instructions is required. Ma- trix arithmetic operations have the most significant two bits N[7:6] set to 01. When applicable, bit 1 indicates transpose of src0 and bit0 indicates transpose of src1. Table B.4 lists the matrix comparison operations supported by the SMA-F. These instructions perform basic comparison operations such as matrix greater-than com- parison. The format of each matrix is encoded in the matrix descriptor indicated by the corresponding register index of the instruction. All matrix arithmetic operations are non-blocking; thus, manual dependency control using synchronization instruc- tions is required. Matrix comparison operations have the most significant two bits N[7:6] set to 10. When applicable, bit 1 indicates transpose of src0 and bit0 indicates transpose of src1. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 110

Operation N[7:6] N[5:4] N[3:0] CSR 0000 00 XCHG 1000 MOVRP 0000 MOVRM 0010 MOVRN 01 0011 MOVRD 0100 MOVRW 0101 00 MOVHS 01 10 00 MOVSH 10 MULH 00 00 MULS 11 ADDH 00 11 01 ADDS 11 SUBH 00 10 SUBS 11

Table B.2: Scalar Operations Opcode APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 111

Operation N[7:6] N[5:4] N[3:0] MULT(NN) 00 MULT(NT) 01 00 00 MULT(TN) 10 MULT(TT) 11 EMULT(NN) 00 EMULT(NT) 01 00 EMULT(TN) 10 EMULT(TT) 11 EADD(NN) 00 EADD(NT) 01 01 01 EADD(TN) 10 EADD(TT) 11 ESUB(NN) 00 ESUB(NT) 01 01 10 ESUB(TN) 10 ESUB(TT) 11 SMULT(N) 00 00 SMULT(T) 10 SADD(N) 00 10 01 SADD(T) 10 SSUB(N) 00 10 SSUB(T) 10 LOG2(N) 00 00 LOG2(T) 10 EXP2(N) 00 01 EXP2(T) 11 10 RCP(N) 00 10 RCP(T) 10 RSQRT(N) 00 11 RSQRT(T) 10

Table B.3: Matrix Arithmetic and Math Function Opcode APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 112

Operation N[7:6] N[5:4] N[3:0] MAX(NN) 00 MAX(NT) 01 10 MAX(TN) 10 MAX(TT) 00 11 MIN(NN) 00 MIN(NT) 01 11 MIN(TN) 10 MIN(TT) 11 EGT(NN) 00 EGT(NT) 01 00 EGT(TN) 10 EGT(TT) 11 ELT(NN) 00 ELT(NT) 01 01 ELT(TN) 10 10 ELT(TT) 01 11 EMAX(NN) 00 EMAX(NT) 01 10 EMAX(TN) 10 EMAX(TT) 11 EMIN(NN) 00 EMIN(NT) 01 10 EMIN(TN) 10 EMIN(TT) 11 SGT(N) 00 00 SGT(T) 10 SLT(N) 00 01 SLT(T) 10 10 SMAX(N) 00 10 SMAX(T) 10 SMIN(N) 00 11 SMIN(T) 10

Table B.4: Matrix Comparison Opcode APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 113

B.4 List of Instructions

B.4.1 ADDH, ADDS - Floating Point Scalar Addition

Format:

Instruction Opcode Dst Src0 Src1 ADDH 0x34 R R R ADDS 0x37 R R R

Syntax:

ADDH ADDS

Description:

The ADDH performs scalar half-precision floating point addition of src0 and src1. ADDS performs scalar single-precision floating point addition of src0 and src1. Register values are read from and stored to the Nios II general purpose register file.

Restriction:

ADDS internally converts the operands into half-precision floating point numbers, per- forms half-precision addition, and converts the result back to a single-precision float- ing point number.

B.4.2 CSR - Control Status Register

Format:

Instruction Opcode Dst Src0 Src1 CSR 0x00 R - -

Syntax:

CSR APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 114

Description:

The CSR returns the current STATUS register of the accelerator Nios II instruction interface. The format of the STATUS register is shown below. 31 20 19 16 15 0 Unused PendingOp OPID

The PendingOp field represents the number of instructions pending in the instruc- tion queue, and the OPID field is the identifier of the current active instructions. These two fields together are used for synchronizing between dependent instructions. More specifically, Nios II application will await for the depending OPID to complete (i.e. the OPID to increment) before issuing a conflicting instruction. The PendingOp can be used to determine if the pipeline is completely flushed. The CSR currently has no access to the CONTROL register of the SMA-F’s Nios II instruction interface.

B.4.3 EADD - Element-wise Matrix Addition

Format:

Instruction Opcode Dst Src0 Src1 EADD(NN) 0x54 C C C EADD(NT) 0x55 C C C EADD(TN) 0x56 C C C EADD(TT) 0x57 C C C

Syntax:

EADD

Description:

The EADD performs element-wise matrix addition of src0 and src1, and stores the result in dst, where the src0, src1, and dst field are the indices of the corresponding matrix descriptors. Matrices src0 and src1 can be transposed using transpose0 and transpose1 APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 115

fields. The matrix format (e.g. sparse/dense) of each matrix is determined by the matrix descriptors of src0, src1, and dst.

Restriction:

EADD currently cannot perform transpose on source matrices if both of them are sparse matrices.

B.4.4 EGT - Element-wise Greater-than Comparison

Format:

Instruction Opcode Dst Src0 Src1 EGT(NN) 0x90 C C C EGT(NT) 0x91 C C C EGT(TN) 0x92 C C C EGT(TT) 0x93 C C C

Syntax:

EGT

Description:

The EGT performs element-wise greater-than matrix comparison of src0 and src1, and stores the result in dst, where the src0, src1, and dst field are the indices of the corresponding matrix descriptors. The value of dst is determined as the following:

 1.0, if src0i,j > src1i,j dsti,j = 0.0, otherwise

Matrices src0 and src1 can be transposed using transpose0 and transpose1 fields. The matrix format (e.g. sparse/dense) of each matrix is determined by the matrix descriptors of src0, src1, and dst. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 116

Restriction:

EGT currently cannot perform transpose on source matrices if both of them are sparse matrices.

B.4.5 ELT - Element-wise Less-than Comparison

Format:

Instruction Opcode Dst Src0 Src1 ELT(NN) 0x94 C C C ELT(NT) 0x95 C C C ELT(TN) 0x96 C C C ELT(TT) 0x97 C C C

Syntax:

ELT

Description:

The ELT performs element-wise less-than matrix comparison of src0 and src1, and stores the result in dst, where the src0, src1, and dst field are the indices of the corresponding matrix descriptors. The value of dst is determined as the following:

 1.0, if src0i,j < src1i,j dsti,j = 0.0, otherwise

Matrices src0 and src1 can be transposed using transpose0 and transpose1 fields. The matrix format (e.g. sparse/dense) of each matrix is determined by the matrix descriptors of src0, src1, and dst.

Restriction:

ELT currently cannot perform transpose on source matrices if both of them are sparse matrices. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 117

B.4.6 EMAX - Element-wise Matrix Maximum

Format:

Instruction Opcode Dst Src0 Src1 EMAX(NN) 0x98 C C C EMAX(NT) 0x99 C C C EMAX(TN) 0x9A C C C EMAX(TT) 0x9B C C C

Syntax:

EMAX

Description:

The EMAX performs element-wise maximum between src0 and src1, and stores the result in dst, where the src0, src1, and dst field are the indices of the corresponding matrix descriptors. The value of dst is determined as the following:

 src0i,j, if src0i,j > src1i,j dsti,j = src1i,j, otherwise

Matrices src0 and src1 can be transposed using transpose0 and transpose1 fields. The matrix format (e.g. sparse/dense) of each matrix is determined by the matrix descriptors of src0, src1, and dst.

Restriction:

EMAX currently cannot perform transpose on source matrices if both of them are sparse matrices. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 118

B.4.7 EMIN - Element-wise Matrix Minimum

Format:

Instruction Opcode Dst Src0 Src1 EMIN(NN) 0x9C C C C EMIN(NT) 0x9D C C C EMIN(TN) 0x9E C C C EMIN(TT) 0x9F C C C

Syntax:

EMIN

Description:

The EMIN performs element-wise minimum between src0 and src1, and stores the result in dst, where the src0, src1, and dst field are the indices of the corresponding matrix descriptors. The value of dst determined as the following:

 src0i,j, if src0i,j < src1i,j dsti,j = src1i,j, otherwise

Matrices src0 and src1 can be transposed using transpose0 and transpose1 fields. The matrix format (e.g. sparse/dense) of each matrix is determined by the matrix descriptors of src0, src1, and dst.

Restriction:

EMIN currently cannot perform transpose on source matrices if both of them are sparse matrices. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 119

B.4.8 EMULT - Element-wise Matrix Multiplication

Format:

Instruction Opcode Dst Src0 Src1 EMULT(NN) 0x50 C C C EMULT(NT) 0x51 C C C EMULT(TN) 0x52 C C C EMULT(TT) 0x53 C C C

Syntax:

EMULT

Description:

The EMULT performs element-wise matrix multiplication of src0 and src1, and stores the result in dst, where the src0, src1, and dst field are the indices of the corresponding matrix descriptors. Matrices src0 and src1 can be transposed using transpose0 and transpose1 fields. The matrix format (e.g. sparse/dense) of each matrix is determined by the matrix descriptors of src0, src1, and dst.

Restriction:

EMULT currently cannot perform transpose on source matrices if both of them are sparse matrices.

B.4.9 ESUB - Element-wise Matrix Subtraction

Format:

Instruction Opcode Dst Src0 Src1 ESUB(NN) 0x58 C C C ESUB(NT) 0x59 C C C ESUB(TN) 0x5A C C C ESUB(TT) 0x5B C C C APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 120

Syntax:

ESUB

Description:

The ESUB performs element-wise matrix subtraction between src0 and src1, and stores the result in dst, where the src0, src1, and dst field are the indices of the correspond- ing matrix descriptors. Matrices src0 and src1 can be transposed using transpose0 and transpose1 fields. The matrix format (e.g. sparse/dense) of each matrix is determined by the matrix descriptors of src0, src1, and dst.

Restriction:

ESUB currently cannot perform transpose on source matrices if both of them are sparse matrices.

B.4.10 EXP2 - Element-wise Exponential Base 2

Format:

Instruction Opcode Dst Src0 Src1 EXP2(N) 0x74 C C - EXP2(T) 0x76 C C -

Syntax:

EXP2

Description:

The EXP2 performs element-wise exponential base 2 on src0, and stores the result in dst, where the src0 and dst field are the indices of the corresponding matrix APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 121

descriptors. Matrix src0 can be transposed using transpose0 field. EXP2 internally performs an iterative approximation algorithm developed and implemented by Frank Liu.

Restriction:

Currently, only dense matrices are supported for EXP2.

B.4.11 LOG2 - Element-wise Log Base 2

Format:

Instruction Opcode Dst Src0 Src1 LOG2(N) 0x70 C C - LOG2(T) 0x72 C C -

Syntax:

LOG2

Description:

The LOG2 performs element-wise log base 2 on src0, and stores the result in dst, where the src0 and dst field are the indices of the corresponding matrix descriptors. Ma- trix src0 can be transposed using transpose0 field. LOG2 internally an approximation algorithm based on look-up tables, developed and implemented by Frank Liu.

Restriction:

Currently, only dense matrices are supported for LOG2. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 122

B.4.12 MOVHS, MOVSH - Change Floating Point Precision

Format:

Instruction Opcode Dst Src0 Src1 MOVHS 0x21 R R - MOVSH 0x22 R R -

Syntax:

MOVHS MOVSH

Description:

The MOVHS converts a scalar half jprecision floating point value src0 into a single precision floating point value and stores it in dst. The MOVSH converts a scalar single precision floating point value src0 into a half precision floating point value and stores it in dst. Register values are read from and stored to the Nios II general purpose register file.

B.4.13 MOVRx - Move Register

Format:

Instruction Opcode Dst Src0 Src1 R C - MOVRP 0x10 C R - R C - MOVRM 0x12 C R - R C - MOVRN 0x13 C R - R C - MOVRD 0x14 C R - R C - MOVRW 0x15 C R -

Syntax:

MOVRx APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 123

Description:

The MOVRx moves data from a Nios II register to a matrix descriptor field and vice versa. src0 indicates the index of the matrix descriptor, while the last two bits fo the opcode indicates the field.

• MOVRP moves data from/to the P field.

• MOVRM moves data from/to the M field.

• MOVRN moves data from/to the N field.

• MOVRD moves data from/to the D field.

• MOVRW moves data from/to the W field.

Refer to Section B.2 for more information of each field.

B.4.14 MULH, MULS - Floating Point Scalar Multiplication

Format:

Instruction Opcode Dst Src0 Src1 MULH 0x30 R R R MULS 0x33 R R R

Syntax:

MULH MULS

Description:

The MULH performs scalar half-precision floating point multiplication of src0 and src1. MULS performs scalar single-precision floating point multiplication of src0 and src1. Register values are read from and stored to the Nios II general purpose register file. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 124

Restriction:

MULS internally converts the operands into half-precision floating point numbers, per- forms half-precision multiplication, and converts the result back to a single-precision floating point number.

B.4.15 MULT - Matrix Multiplication

Format:

Instruction Opcode Dst Src0 Src1 MULT(NN) 0x40 C C C MULT(NT) 0x41 C C C MULT(TN) 0x42 C C C MULT(TT) 0x43 C C C

Syntax:

MULT

Description:

The MULT performs matrix multiplication of src0 and src1, and stores the result in dst, where the src0, src1, and dst field are the indices of the corresponding matrix descriptors. Matrices src0 and src1 can be transposed using transpose0 and transpose1 fields. The matrix format (e.g. sparse/dense) of each matrix is determined by the matrix descriptors of src0, src1, and dst.

B.4.16 RCP - Element-wise Reciprocal

Format:

Instruction Opcode Dst Src0 Src1 RCP(N) 0x78 C C - RCP(T) 0x7A C C - APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 125

Syntax:

RCP

Description:

The RCP performs element-wise reciprocal function on src0, and stores the result in dst, where the src0 and dst field are the indices of the corresponding matrix descriptors. Matrix src0 can be transposed using transpose0 field. RCP implementation is based on Newton Raphson Divison algorithm. The average error rate for each reciprocal compared to double precision version is 0.15% and the maximum error rate is 0.48%.

Restriction:

Currently, only dense matrices are supported for RCP.

B.4.17 RSQRT - Element-wise Reverse Squart Root

Format:

Instruction Opcode Dst Src0 Src1 RSQRT(N) 0x7C C C - RSQRT(T) 0x7E C C -

Syntax:

RSQRT

Description:

RSQRT performs element-wise reverse square root function on src0, and stores the re- sult in dst, where the src0 and dst field are the indices of the corresponding matrix descriptors. Matrix src0 can be transposed using transpose0 field. RSQRT implementa- tion is based on the Fast Inverse Square Root algorithm [19]. The average error rate APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 126

for each inverse square root compared to double precision version is 0.14% and the maximum error rate is 0.45%.

Restriction:

Currently, only dense matrices are supported for RSQRT.

B.4.18 SGT, SLT, SMAX, SMIN - Matrix-Scalar Compari- son

Format:

Instruction Opcode Dst Src0 Src1 SGT(N) 0xA0 C C R SGT(T) 0xA2 C C R SLT(N) 0xA4 C C R SLT(T) 0xA6 C C R SMAX(N) 0xA8 C C R SMAX(T) 0xAA C C R SMAX(N) 0xAC C C R SMAX(T) 0xAE C C R

Syntax:

SGT SLT SMAX SMIN

Description:

SGT,SLT,SMAX,SMIN compares the value of src1 with each element in src0, and stores the result in dst, where the src0 and dst are the indices of the corresponding matrix descriptors. src1 is a Nios II general purpose register, which holds a half precision

floating point value. For SGT, dsti,j = 1.0 if src0i,j > src1, and 0.0 otherwise. For APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 127

SLT, dsti,j = 1.0 if src0i,j < src1, and 0.0 otherwise. For SMAX, dsti,j = src0i,j if src0i,j > src1, and src1 otherwise. For SMIN, dsti,j = src0i,j if src0i,j < src1, and src1 otherwise.

Restriction:

SGT, SLT, SMAX, and SMIN currently cannot perform transpose on the source matrix if both src0 and dst are sparse matrices.

B.4.19 SMULT,SADD,SSUB - Matrix-Scalar Arithmetics

Format:

Instruction Opcode Dst Src0 Src1 SMULT(N) 0x60 C C R SMULT(T) 0x62 C C R SADD(N) 0x64 C C R SADD(T) 0x66 C C R SSUB(N) 0x68 C C R SSUB(T) 0x6A C C R

Syntax:

SMULT SADD SSUB

Description:

SMULT multiplies the value of src1 with each element in src0, and stores the result in dst, where the src0 and dst are the indices of the corresponding matrix descriptors. src1 is a Nios II general purpose register, which holds a half precision floating point value. SADD adds the value of src1 to each element in src0, and stores the result in dst. SSUB subtracts the value of src1 from each element in src0, and stores the result APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 128

in dst. Matrix src0 can be transposed using transpose0 field. The matrix format (e.g. sparse/dense) of each matrix is determined by the matrix descriptors of src0 and dst.

Restriction:

SMULT, SADD, and SSUB currently cannot perform transpose on the source matrix if both src0 and dst are sparse matrices.

B.4.20 SUBH, SUBS - Floating Point Scalar Subtraction

Format:

Instruction Opcode Dst Src0 Src1 SUBH 0x3C R R R SUBS 0x3F R R R

Syntax:

SUBH SUBS

Description:

SUBH performs scalar half-precision floating point subtraction between src0 and src1. SUBS performs scalar single-precision floating point subtraction between src0 and src1. Register values are read from and stored to the Nios II general purpose register file.

Restriction:

SUBS internally converts the operands into half-precision floating point numbers, per- forms half-precision subtraction, and converts the result back to a single-precision floating point number. APPENDIX B. ISA FOR SPARSE MATRIX ACCELERATOR FPGA 129

B.4.21 XCHG - Exchange

Format:

Instruction Opcode Dst Src0 Src1 XCHG 0x08 R R -

Syntax:

XCHG

Description:

The XCHG sends the value of src0 to the other Nios II processor and stores the value received from the other Nios II processor in dst. This instruction blocks until the other Nios II processor has also executed the XCHG instruction. This instruction is the primary mean of communication and synchronization between the two Nios II processors in SMA-F. Bibliography

[1] Altera. Nios II Processor Reference Handbook, 2011.

[2] Altera Corporation. High Speed Mezzanine Card (HSMC) Specification, June 2009.

[3] Altera Corporation. Nios II Custom Instruction User Guide, January 2011.

[4] H. Amin, K.M. Curtis, and B.R. Hayes-Gill. Piecewise linear approximation applied to nonlinear function of a neural network. Circuits, Devices and Systems, IEE Proceedings -, 144(6):313 –317, dec 1997.

[5] David A. Bader, John Feo, John Gilbert, Jeremy Kepner, David Koester, Eugune Loh, Kamesh Madduri, Bill Mann, Theresa Meuse, and Eric Robinson. Hpc scalable graph analysis benchmark v1.0, 2009.

[6] David A. Bader and Kamesh Madduri. Parallel algorithms for evaluating cen- trality indices in real-world networks. In Proceedings of the 2006 International Conference on Parallel Processing, ICPP ’06, pages 539–550, Washington, DC, USA, 2006. IEEE Computer Society.

[7] Nathan Bell and Michael Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, December 2008.

[8] Nathan Bell and Michael Garland. Cusp: Generic parallel algorithms for sparse matrix and graph computations, 2012. Version 0.3.0.

130 BIBLIOGRAPHY 131

[9] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In B. Sch¨olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 153–160. MIT Press, Cambridge, MA, 2007.

[10] Shekhar Borkar and Andrew A. Chien. The future of microprocessors. Commun. ACM, 54(5):67–77, May 2011.

[11] Ulrik Brandes. A faster algorithm for betweenness centrality. Journal of Math- ematical Sociology, 25:163–177, 2001.

[12] Aydin Bulu¸c,Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplica- tion using compressed sparse blocks. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, SPAA ’09, pages 233– 244, New York, NY, USA, 2009. ACM.

[13] Aydin Bulu¸cand John R Gilbert. The combinatorial blas: design, implementa- tion, and applications. International Journal of High Performance Computing Applications, 25(4):496–509, 2011.

[14] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. R-mat: A recursive model for graph mining. In In SDM, 2004.

[15] C.E. Cox and W.E. Blanz. Ganglion-a fast field-programmable gate array im- plementation of a connectionist classifier. Solid-State Circuits, IEEE Journal of, 27(3):288–299, Mar 1992.

[16] Timothy A. Davis. Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2006.

[17] Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collec- tion. ACM Trans. Math. Softw., 38(1):1:1–1:25, December 2011. BIBLIOGRAPHY 132

[18] Michael deLorimier and Andr´eDeHon. Floating-point sparse matrix-vector mul- tiply for fpgas. In Proceedings of the 2005 ACM/SIGDA 13th international sym- posium on Field-programmable gate arrays, FPGA ’05, pages 75–85, New York, NY, USA, 2005. ACM.

[19] David Eberly. Fast inverse square root (revisited), 2010.

[20] H. Elgindy and Yen-Liang Shue. On sparse matrix-vector multiplication with fpga-based system. In Field-Programmable Custom Computing Machines, 2002. Proceedings. 10th Annual IEEE Symposium on, pages 273 – 274, 2002.

[21] H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In Proceeding of the 38th annual international symposium on Computer architecture, pages 365–376. ACM, 2011.

[22] Linton C. Freeman. A set of measures of centrality based on betweenness. So- ciometry, 40(1):pp. 35–41, 1977.

[23] Kazushige Goto and Robert Van De Geijn. High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw., 35(1):1–14, 2008.

[24] Hans Peter Graf, Srihari Cadambi, Igor Durdanovic, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, and Srimat Chakradhar. A Massively Parallel Digital Learning Processor. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 529–536. 2009.

[25] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B.C. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz. Understanding sources of ineffi- ciency in general-purpose chips. In Proceedings of the 37th annual international symposium on Computer architecture ISCA 10, volume 38. ACM Press, 2010.

[26] Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomat- nikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Understanding sources of inefficiency in general-purpose chips. In BIBLIOGRAPHY 133

Proceedings of the 37th annual international symposium on Computer architec- ture, ISCA ’10, pages 37–47, New York, NY, USA, 2010. ACM.

[27] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

[28] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo- rithm for deep belief nets. Neural Comput., 18(7):1527–1554, July 2006.

[29] J.L. Holt and T.E. Baker. Back propagation simulations using limited precision calculations. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, volume ii, pages 121 –126 vol.2, jul 1991.

[30] J. Impagliazzo, M. Campbell-Kelly, G. Davies, and J.A.N. Lee. History in the computing curriculum. Annals of the History of Computing, IEEE, 21(1):4 –16, jan-mar 1999.

[31] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49(4/5):589–604, July 2005.

[32] Srinidhi Kestur, John Davis, and Eric Chung. Towards a universal fpga matrix- vector multiplication architecture. In Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, 2012.

[33] Sang Kyun Kim, Lawrence Christopher McAfee, Peter Leonard McMahon, and Kunle Olukotun. A Highly Scalable Restricted Boltzmann Machine Implemen- tation. In Field Programmable Logic and Applications, 2009. FPL 2009. Inter- national Conference on, Sept. 2009.

[34] Sang Kyun Kim, Peter Leonard McMahon, and Kunle Olukotun. A large-scale architecture for restricted boltzmann machines. In Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Comput- ing Machines, FCCM ’10, pages 201–208, Washington, DC, USA, 2010. IEEE Computer Society. BIBLIOGRAPHY 134

[35] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: a 32-way multithreaded sparc processor. Micro, IEEE, 25(2):21 – 29, march-april 2005.

[36] Ian Kuon and Jonathan Rose. Measuring the gap between fpgas and asics. In Proceedings of the 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA ’06, pages 21–30, New York, NY, USA, 2006. ACM.

[37] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 –2324, nov 1998.

[38] Honglak Lee, Chaitanya Ekanadham, and Andrew Ng. Sparse deep belief net model for visual area v2. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 873–880. MIT Press, Cambridge, MA, 2008.

[39] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical represen- tations. In L´eonBottou and Michael Littman, editors, Proceedings of the 26th International Conference on Machine Learning, pages 609–616, Montreal, June 2009. Omnipress.

[40] C.Y. Lin, Zheng Zhang, Ngai Wong, and H.K.-H. So. Design space exploration for sparse matrix-matrix multiplication on fpgas. In Field-Programmable Technology (FPT), 2010 International Conference on, pages 369 –372, dec. 2010.

[41] Daniel L. Ly and Paul Chow. A Multi-FPGA Architecture for Stochastic Re- stricted Boltzmann Machine. In Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on, Sept. 2009.

[42] Daniel L. Ly and Paul Chow. A high-performance fpga architecture for restricted boltzmann machines. In Proceedings of the ACM/SIGDA international sympo- sium on Field programmable gate arrays, FPGA ’09, pages 73–82, New York, NY, USA, 2009. ACM. BIBLIOGRAPHY 135

[43] P. Lysaght, J. Stockwood, J. Law, and D. Girma. Artificial neural network im- plementation on a fine-grained FPGA. Field-Programmable Logic Architectures, Synthesis and Applications, 849:421–431, 1994.

[44] MATLAB. version 7.13.0 (R2011b). The MathWorks Inc., Natick, Mas- sachusetts, 2011.

[45] G.E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8):114 –117, apr. 1965.

[46] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture Programming Guide. NVIDIA Corporation, 2007.

[47] Rajat Raina, Anand Madhavan, and Andrew Ng. Large-Scale Deep Unsuper- vised Learning using Graphics Processors. In L´eonBottou and Michael Littman, editors, Proceedings of the 26th International Conference on Machine Learning, pages 873–880, Montreal, June 2009. Omnipress.

[48] Yichuan Tang and Chris Eliasmith. Deep networks for robust visual recognition. In Johannes F¨urnkranzand Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1055–1062, Haifa, Israel, June 2010. Omnipress.

[49] Graham W. Taylor, Geoffrey E. Hinton, and Sam T. Roweis. Modeling human motion using binary latent variables. In Advances in Neural Information Pro- cessing Systems 19, pages 1345–1352. MIT Press, 2007.

[50] M.B. Taylor. Is dark silicon useful? In Proceedings of the 39th annual Design Automation Conference, 2012.

[51] Terasic Technologies Inc. DE3 User Manual, 2009.

[52] Thomas E. Tkacik. A hardware random number generator. In Revised Papers from the 4th International Workshop on Cryptographic Hardware and Embedded Systems, CHES ’02, pages 450–453, London, UK, UK, 2003. Springer-Verlag. BIBLIOGRAPHY 136

[53] Stijn Van Dongen. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000.

[54] Vasily Volkov and James W. Demmel. Benchmarking gpus to tune dense linear algebra. In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Super- computing, pages 1–11, Piscataway, NJ, USA, 2008. IEEE Press.

[55] Lotfi A. Zadeh. Fuzzy logic, neural networks, and soft computing. Commun. ACM, 37(3):77–84, March 1994.

[56] David Zhang and Sankur K. Pal, editors. Neural Networks and Systolic Array Design. World Scientific Publishing Co. Pte. Ltd., Farrer Road, Singapore, 2002.

[57] J. Zhu and P. Sutton. FPGA Implementations of Neural Networks: a Sur- vey of a Decade of Progress. In Proc. 13th International Conference on Field- Programmable Logic and Applications, pages 1062–1066, September 2003.

[58] Ling Zhuo and Viktor K. Prasanna. Sparse matrix-vector multiplication on fpgas. In Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field- programmable gate arrays, FPGA ’05, pages 63–74, New York, NY, USA, 2005. ACM.