SCALABLE GRAPH PROCESSING ON RECONFIGURABLE SYSTEMS

By ROBERT G. KIRCHGESSNER

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2016 © 2016 Robert G. Kirchgessner To my parents, Robert and Janette, and my wife Minjeong ACKNOWLEDGMENTS I would like to express my deepest gratitude to all those who have helped me down this long road towards completing my doctoral studies. I thank my advisor, Dr. Alan George, for his wisdom, guidance, and support both academically and personally throughout my graduate studies; and my co-advisor, Dr. Greg Stitt, whose invaluable academic insights helped shape my research. I thank Vitaliy Gleyzer, for his invaluable feedback, suggestions, and guidance, making me a better researcher, and MIT/LL for their support and resources which made this work possible. I would also like to thank my committee members, Dr. Herman Lam and Dr. Darin Acosta, for their important suggestions, advice, and feedback on my work. Additionally, I thank my friends and colleagues: Kenneth Hill, Bryant Lam, Abhijeet Lawande, Adam Lee, Barath Ramesh, and Gongyu Wang, who have always provided me with support, both in my research and personal life. Furthermore, I would like to thank my loving wife, without whom this would have not been possible, and my parents, who knew I could achieve this well before I knew it myself. Last but certainly not least, this work was supported in part by the I/UCRC Program of the National Science Foundation under Grant Nos. EEC-0642422 and IIP-1161022. I would like to gratefully acknowledge equipment, tools, and source code provided by Altera (now part of Intel), Xilinx, GiDEL, Nallatech and DRC Computing.

4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...... 4 LIST OF TABLES ...... 7 LIST OF FIGURES ...... 8 ABSTRACT ...... 10

CHAPTER 1 INTRODUCTION ...... 12 1.1 Field-Programmable Gate Arrays ...... 13 1.2 Graph Processing using Linear-Algebra Primitives ...... 15 2 PRODUCTIVITY AND PORTABILITY MIDDLEWARE FOR FPGA APPLICATIONS 17 2.1 Background and Related Research ...... 19 2.2 Approach ...... 22 2.2.1 Hardware Abstraction ...... 24 2.2.2 Software Abstraction ...... 27 2.2.3 Metadata and Extensible Core Library ...... 29 2.2.4 RC Middleware Toolchain ...... 31 2.3 Results and Analysis ...... 37 2.3.1 Convolution Case Study ...... 37 2.3.2 Analysis of Performance and Area Overhead ...... 41 2.3.3 Analysis of Productivity ...... 46 2.3.4 Analysis of Portability ...... 48 2.4 Summary and Conclusions ...... 50 3 EFFICIENT STORAGE FORMATS FOR SCALABLE FPGA GRAPH PROCESSING 52 3.1 Background and Related Research ...... 54 3.1.1 Coordinate Format (COO) ...... 55 3.1.2 Compressed Sparse-Column/Row Format (CSC/R) ...... 55 3.1.3 Doubly Compressed Sparse-Column/Row (DCSC/R) ...... 56 3.1.4 ELLPACK Format ...... 56 3.1.5 Jagged Diagonal Format (JDS/TJDS) ...... 57 3.1.6 Minimal Quadtree Format (MQT) ...... 57 3.2 Approach ...... 57 3.2.1 Hashed-Index Sparse-Column/Row (HISC/R) ...... 58 3.2.2 Hashed-Indexing Vector ...... 58 3.2.3 HISC/R Nonzero Storage ...... 60 3.2.4 Non-zero Lookups and Insertions ...... 61 3.2.5 Storage Analysis ...... 62

5 3.3 Results and Analysis ...... 64 3.3.1 Storage comparison ...... 65 3.3.2 Performance Comparison ...... 67 3.4 Summary and Conclusions ...... 69 4 EXTENSIBLE FPGA ARCHITECTURE FOR SCALABLE GRAPH PROCESSING .. 71 4.1 Background and Related Research ...... 73 4.1.1 Accelerating Sparse-Matrix Operations on FPGAs ...... 73 4.1.2 Standards for Graph Processing using Linear Algebra ...... 76 4.1.3 Linear-Algebra Formulation of Breadth-First Search ...... 76 4.2 Extensible Graph-Processor Architecture ...... 78 4.2.1 Merge-Sorter Architecture ...... 79 4.2.1.1 Sorting-pipeline architecture ...... 84 4.2.1.2 Merge-sorter controller ...... 85 4.2.1.3 Merge-sorter performance analysis ...... 86 4.2.2 ALU Architecture ...... 88 4.2.3 HISC/R Storage Controller ...... 89 4.2.4 FPGA Resource Analysis ...... 93 4.3 Experimental Setup ...... 93 4.4 Case Study: Sparse Generalized Matrix-Matrix Multiplication ...... 94 4.5 Case Study: Breadth-First Search ...... 96 4.6 Graph-Processor Architecture Scalability Analysis ...... 97 4.7 Summary and Conclusions ...... 99 5 CONCLUSIONS ...... 101 REFERENCES ...... 104 BIOGRAPHICAL SKETCH ...... 111

6 LIST OF TABLES Table page 2-1 Currently supported RC Middleware platforms...... 37 2-2 Comparison of lines of code required when using RC Middleware...... 39 2-3 Total map-generation time, estimated area and latency, and actual area and execu- tion time for convolution case study optimizing for performance or area...... 39 2-4 Execution time and area for various applications and kernels on each supported RC Middleware platform...... 49 3-1 Definition of variables for sparse-matrix complexity analysis...... 54 3-2 Analysis of popular sparse-matrix storage formats...... 55 4-1 Merge-sorter PE next-state logic...... 81 4-2 Graph-processor resource analysis...... 93 4-3 Summary of parameters used to simulate SpGEMM scalability...... 99

7 LIST OF FIGURES Figure page 1-1 FPGA architecture overview...... 14 2-1 Overview of RC Middleware design flow...... 23 2-2 RC Middleware hardware-abstraction layers enabling application portability between GiDEL PROCStar III and Pico M501...... 24 2-3 Overview of RC Middleware’s hardware presentation layer ...... 27 2-4 Overview of RC Middleware’s software stack and generated C++ application stub . 28 2-5 Example of application-description XML format...... 30 2-6 Overview of RC Middleware toolchain from application specification to vendor- specific project generation...... 32 2-7 Example of RC Middleware mapping two application read interface to a single phys- ical memory ...... 35 2-8 Area- and performance-optimized mapping results for mapping convolution applica- tion on PROCStar III/IV...... 38 2-9 Host and FPGA read and write performance to external memory for PROCStar III/IV, M501, and PCIe-385N...... 42 2-10 Host and FPGA read and write overhead to external memory for PROCStar III/IV, and M501 ...... 43 2-11 FPGA resource analysis for vendor, application, and RC Middleware components. .. 45 3-1 Comparison of the indexing techniques used by CSC/R, DCSC/R, and HISC/R. .. 58 3-2 Comparison of average probes required for row/column lookups vs. load factor for different hash table types...... 60

3-3 Overview of HISC/R with segmented storage vectors using initial segment size L0 and growth factor k...... 61 3-4 Pseudocode for HISC column lookups...... 62 3-5 Pseudocode for HISC non-zero insertions...... 63 3-6 Average storage ratio normalizing HISC/R and HISC/R (unsegmented) by CSC/R for randomly generated scale-30 Kronecker matrices...... 66 3-7 Average storage ratio normalizing HISC/R and HISC/R (unsegmented) by DCSC/R for randomly generated scale-30 Kronecker matrices...... 67

8 3-8 Comparison of total reads required to perform sparse matrix/matrix multiplication using HISC/R compared with CSC/R and DCSC/R...... 68 4-1 Pseudocode for vertex-centric breadth-first search...... 77 4-2 Graph adjacency-matrix representation...... 77 4-3 Overview of graph-processor architecture...... 79 4-4 Architecture diagram of merge-sorter PE...... 81 4-5 Architecture diagram of merge-sorter pipeline...... 82 4-6 Pseudocode for systolic-array priority function...... 83 4-7 Merge-sorter architecture overview...... 85 4-8 Overview of merge-sorter sorting modes ...... 86 4-9 Pipelined merge-sorter performance analysis...... 88 4-10 Design of our ALU supporting various semirings...... 89 4-11 Comparison of tabulation-hash quality metric for different PRNGs...... 91 4-12 Controller architecture for HISC/R storage format...... 92 4-13 Comparison of our architecture running SpGEMM with CombBLAS and SuiteSparse baselines...... 95 4-14 Comparison of our architecture running BFS with state-of-the-art designs on the Convey HC-1/HC-2...... 96 4-15 Scalability simulation approach...... 98 4-16 Simulated SpGEMM speedup...... 100

9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SCALABLE GRAPH PROCESSING ON RECONFIGURABLE SYSTEMS By Robert G. Kirchgessner December 2016 Chair: Alan D. George Cochair: Greg Stitt Major: Electrical and Computer Engineering Graphs are ubiquitous; capable of modeling the relationships between entities in any system. This flexibility has led to graphs becoming key data structures for computing, with scientific, commercial, and defense applications. Graph-processing applications, however, do not map well to conventional cache-based computing architectures. These cache-based architectures exploit data locality and maximize computational throughout, whereas graph- processing applications are typically memory-bounded and data-driven, with highly irregular datasets. Recent advances in graph processing has opened the door to new methods of data analysis, bringing with it new scientific discoveries and applications, as well as increasingly larger datasets of interest. These increasing computational needs has driven the exploration of new methods and unique system architecture for graph processing. Reconfigurable computing (RC) on field-programmable gate arrays (FPGAs) provides a unique opportunity to develop custom architectures tuned for graph-processing applications, maximizing performance while also minimizing power. FPGA development, however, carries similar difficulties to hardware design, requiring developers iterate through register-transfer level (RTL) designs with cycle-level accuracy. Furthermore, the lack of hardware and software standards between FPGA platforms limits developer productivity and application portability, making porting applications and scaling to larger systems a time-consuming and challenging process. In this work, we first address the portability and scalability challenges of FPGA application development by developing a novel RC Middleware (RCMW). The RCMW provides an

10 application-centric development environment, exposing only the resources and interfaces required by an application, independent of the underlying platform. Next, we explore efficient sparse-matrix storage formats for FPGA graph processing, developing a new hash-based storage format which provides up to 40% performance improvement while requiring 23% less storage compared to competing storage formats. Finally, we leverage RCMW and our new storage format to develop a scalable graph-processor architecture, capable of providing over 200× performance-per-watt improvement compared to optimized CPU baselines. We demonstrate the performance of our architecture running BFS spanning-tree calculations outperforms state- of-the-art designs on the Convey HC-1/HC-2 after adjusting for platform memory bandwidth. We present a scalability analysis of our architecture running SpGEMM on the Novo-G# multi- FPGA system using a combination of hardware experimentation and discrete-event network simulation.

11 CHAPTER 1 INTRODUCTION Graphs are arguably the most powerful data structures in modern computing, capable of modeling any relation, either abstract or concrete, between entities. This flexibility has positioned graphs as a central in data analytics and scientific research, and has opened the door to new methods of data analysis and understanding through modern graph- processing techniques. Large-scale graph processing is a key component in modern scientific computing and data analytics [1], with many commercial [2] and defense applications [3], [4]. The increasing scale of graph datasets, and computational-complexity of graph algorithms, has led to the development of various specialized graph-processing frameworks on conventional distributed systems. Pregel [5] uses a bulk-synchronous parallel message-passing model where vertex-program computation is broken down into a series of synchronous super-steps. GraphLab [6] eliminates the explicit synchronization step of Pregel, providing an asynchronous shared- memory view of graph data, and a processing model similar to MapReduce. PowerGraph [7] builds on these existing frameworks but optimizes data-distribution for power-law graphs. Finally, a more recent graph-processing framework known as GraphX [8], has been developed on the Apache Spark distributed-database engine. Graph-processing applications, however, do not map well to these conventional system ar- chitectures and programming platforms. Whereas conventional systems focus on computational throughput and data locality and reuse, graph-processing problems are typically memory- bounded and data-driven, with highly irregular datasets [9]. Cache-based architectures for these applications are a liability; adding latency to computation and wasting power and chip resources [10]. These problems are further compounded in distributed systems, where the unstructured nature of graph datasets leads to inefficient data partitioning and load imbalances [11], [12]. The mismatch of graph-processing workloads on conventional computing systems, and the need to analyze increasingly larger graph datasets, has driven the exploration of new methods, algorithms, and distributed system architectures for graph processing [13], [14], [15], [16].

12 Reconfigurable computing technologies such as field-programmable gate arrays (FPGAs) offer a unique platform for exploring the design and development of novel architectures and techniques for graph processing. Leveraging the configurability of FPGAs, we can explore specialized architectures for graph processing which exploit the properties of graph datasets. Furthermore, the bit-level customizability and power efficiency of FPGAs enable us to develop architectures with competitive performance-per-watt over state-of-the art throughput-oriented and vector processing systems. 1.1 Field-Programmable Gate Arrays

FPGAs are a reconfigurable integrated-circuit technology which allows developers to specify the internal configuration of the chip programmatically. Hardware developers, armed with special hardware-design languages (HDLs), specify the internal-circuit configuration of the FPGA down to individual logic gates, allowing them to create any behavior they require. The configuration information for each FPGA design is stored in a bitstream file which can be stored in non-volatile memory and loaded on power-up, or programmed into the FPGA over a serial interface such as JTAG. New hardware configurations can be loaded at any time, allowing developers to modify the FPGA’s behavior to meet current processing requirements. FPGAs typically consist of a series of programmable logic elements, or configurable logic blocks (CLBs), connected together by a programmable interconnect network as shown in Figure 1-1. Although the CLB architecture varies between FPGA designs, they often consist of at least one programmable lookup-table (LUT), register, and MUX to select between an asynchronous or synchronous output. In modern FPGA architectures, some of the CLBs are replaced by specialized hardware components such as phase- or delay-locked loops, on-chip RAM, digital signal-processing cores, or microprocessors. The structure and behavior of each CLB, as well as how CLBs, other resources, and I/O blocks are routed together, can be programmed by the hardware developer to create their desired application. To create an FPGA configuration, a developer begins by capturing their design behavior using a hardware-design language such as VHDL or Verilog, or more recently a high-level

13 Configurable Interconnect Fabric

Configuration Bits CLB CLB CLB

N-Input ... D Q MUX LUT CLB CLB CLB >

Clock CLB Diagram

CLB CLB CLB

I/O Blocks (IOBs) Figure 1-1. FPGA architecture overview. synthesis language such as OpenCL [17]. Next, the design is converted from a high-level description to an RTL-level representation by a hardware synthesis tool. A vendor-specific mapping tool then takes the RTL design and maps it onto the specific hardware architecture of the target FPGA, making sure the timing requirements of the design can be met. Finally, after placing-and-routing resources on the target FPGA, a bitstream file is generated which contains the configuration bits for the CLBs, and on-chip routing fabric. FPGAs have recently been integrated in computing systems to be used as hardware accelerators in high-performance reconfigurable computing (HPRC) applications. For these applications, FPGAs are typically integrated into a coprocessor board consisting of one or more FPGAs and on-board memory, and are typically connected to a host processor through a bus such as PCI Express. These FPGA accelerator cards typically come with a vendor-supplied development environment, enabling applications developers to program the board, and write custom application-specific software drivers. Due to a lack of standards between platforms, however, developers must tailor their application to a specific vendor’s software and hardware interfaces. This platform-specific development cycle prevents portability, requiring significant developer time and effort to port applications to new platforms. Additionally, vendor-specific procedural APIs further limit portability. Procedural APIs embed platform-specific parameters into application code, including data marshalling and the physical location of application

14 resources. When porting an application to a new platform, this embedding forces developers to not only to change APIs, but also handle new platform-specific restrictions which may require significant changes to their application. In order to overcome the portability and productivity hurdles of FPGA-application development, we present a novel framework called the RC Middleware (RCMW). RCMW is a layered middleware which enables application and tool portability by creating an application- specific development environment. This application-specific view of resources allows developers to focus on the ideal resource configuration for their application, without worrying about where those resources exist on the underlying platform architecture. The RC Middleware, its design, use, and benefits are discussed in detail in Chapter 2. 1.2 Graph Processing using Linear-Algebra Primitives

The difficulty of designing parallel graph-processing algorithms using conventional vertex- and edge-centric approaches has led to the development of new techniques such as the linear- algebra formulation of graph processing [18]. This approach brings with it the benefits of the predictable access patterns of linear-algebra operations, and a higher level of abstraction simplifying the implementation and parallelization of many graph algorithms [18], [19]. In order to maximize the scalability and performance of this approach, however, several key challenges must be overcome, such as how to map irregular graph datasets to distributed systems, and how to efficiently store and access graph datasets. Graph-adjacency matrices are typically sparse, having a total number of non-zero elements on the order of the dimension of the matrix, and follow a power-law degree distribution, where only a few rows or columns contain the majority of the non-zero elements [20]. When computing on these sparse datasets in a distributed system, they become hypersparse, having less than one non-zero per row/column on average. [18]. Despite this degree of sparsity, large- scale graph datasets still require significant storage space, requiring several terabytes even for small problem sizes [21].

15 To maximize the scalability and performance of these graph-processing algorithms, sparse- matrix storage formats, capable of providing scalable, low-overhead storage with low-latency access to data are critical [22]. General formats such as Compressed Sparse-Column/Row (CSC/R) and Doubly Compressed Sparse-Column/Row [23], which do not assume any inherent non-zero structure, are commonly used in graph-processing applications. These formats, however, trade-off between storage and lookup complexity, providing either fast lookups at the expense of high storage overhead for sparse datasets, or low storage overhead at the expense of increased access time for unfavorable non-zero distributions. Chapter 3 presents a detailed analysis of existing sparse-matrix storage formats, and presents a novel storage format called Hashed-Index Sparse-Column/Row (HISC/R), which is optimized for distributed graph-processing with linear-algebra primitives. In order to perform a wide variety of graph algorithms using the linear-algebra primitives for graph processing discussed in [24], [25], the typical sparse-matrix operations must be extended to support an arbitrary semiring. Many graph algorithms require that multiple sets of data be computed while traversing the vertex set of a graph, such as the parent of each node in a breadth-first search tree. In order to handle this, we define our matrices to consist of an arbitrary semiring which is specific to the algorithm being performed. Without this semiring, we would need to couple our scalar sparse-matrix operations with operations on multiple datasets. This dissertation begins by addressing the portability and scalability limitations of high- performance reconfigurable computing (HPRC) applications on FPGA platforms in Chapter 2. In Chapter 3, we explore optimizations for graph processing, and propose a new hypersparse- matrix storage format for distributed graph processing on FPGAs. In Chapter 4 we take what we learned about graph processing and storage formats in Chapter 3, and combine that with the FPGA-application portability framework discussed in Chapter 2, and present a scalable graph-processing architecture on FPGAs. Finally, Chapter 5 summarizes this work and presents our conclusions and insights.

16 CHAPTER 2 PRODUCTIVITY AND PORTABILITY MIDDLEWARE FOR FPGA APPLICATIONS Field-programmable gate arrays (FPGAs) enable developers to create application-specific hardware architectures, enabling several orders of magnitude performance improvement [26], [27] while also improving computational efficiency [28], [29] for applications which do not map well to conventional CPU and GPU architectures. This flexibility and efficiency has made FPGAs ideal for various applications, from embedded systems [30] to supercomputers [31]. These benefits, however, come with the added complexity of hardware design, limiting developer productivity relative to fixed-logic devices, and preventing widespread usage of FPGAs. The difficulty of RTL-level design coupled with a lack of standards between FPGA accelerator platforms, herein referred to as platforms, complicates application development and limits code reusability. Due to a lack of standards between platforms, developers must tailor their application to a specific vendor’s software and hardware interfaces. This platform- specific development cycle prevents portability, requiring significant developer time and effort to port applications to new platforms. Additionally, vendor-specific procedural APIs further limit portability. Procedural APIs embed platform-specific parameters into application code, including data marshalling and the physical location of application resources. These portability issues extend to high-level synthesis (HLS) tools and languages, which intend to improve developer productivity. Although HLS tools typically provide support for at least one platform out-of-the-box, the growing number of HLS tools and FPGA platforms outpaces the ability of tool vendors to provide platform support, leaving the challenge of supporting new platforms to application developers. These problems ultimately reduce HLS tool performance and usability, and end up costing tool vendors and application developers valuable time which could better be spent on developing their tools and applications. In order to help overcome the portability and productivity hurdles of FPGA-application development, we present the RC Middleware (RCMW). RCMW is a layered middleware which enables application and tool portability by creating an application-specific platform

17 abstraction. Developers specify their application’s required resources and interfaces at design time, customizing the number, type, size, and data types of interfaces. Using this specification, RCMW provides a portable application-specific hardware and software interface. One major research challenge for enabling application portability is providing standardized interfaces to application-specific resources which are independent of the underlying platform, while also minimizing overhead. Platform details such as the number and type of FPGAs, size and performance of external memories, require careful consideration when mapping an application onto a target platform. To address these challenges, the RCMW toolchain determines the application-to-platform mapping at compile time, selecting an appropriate mapping based on a user-customizable cost function. Using RCMW, developers can focus on their application or tool rather than implementing their designs onto a specific platform. In this chapter, we present and evaluate RCMW using four platforms from three vendors: the PROCStar III [32] and PROCStar IV [33] from GiDEL; the M501 [34] from Pico Comput- ing; and the PCIe-385N [35] from Nallatech. We demonstrate the ability to quickly explore different application-to-platform mappings with the RCMW toolchain using a representative convolution case study. We show that the benefits of RCMW can be achieved with minimal overhead, less than 7% performance and 3% area in the common case. We also demonstrate RCMW’s productivity benefits by showing that RCMW requires less development time and lines of code for deploying applications compared to the recommended vendor approaches. Finally, we demonstrate application portability using RCMW by executing the same application hardware and software source, for several applications and kernels, across each supported platform. The remainder of this chapter is organized as follows. Section 2.1 presents background and related work. Section 2.2 presents the RC Middleware framework and toolchain. Section 2.3 presents our experiments, results, and analysis. Section 2.4 provides a summary and concludes this chapter.

18 2.1 Background and Related Research

The lack of FPGA-accelerator standards has resulted in the development of vendor-specific APIs which limit application portability and developer productivity. To address this issue, OpenFPGA proposed a procedural C-based API standard for managing RC accelerators [36]. The OpenFPGA standard defines functions for initializing, managing, and communicating with FPGA accelerators, but requires developers to embed platform-specific application details such as the physical location of application resources. The Simple Interface for Reconfigurable Com- puting (SIRC) [37] is an object-oriented communication interface which provides functionallity similar to the OpenFPGA standard, but enables portability using platform-specific subclasses. Although OpenFPGA and SIRC define comprehensive APIs, both require embedding platform- specific application details which limits application portability. RCMW also provides a portable API standard, but overcomes this limitation by providing an object-oriented representation of application resources which encapsulates platform-specific details and enables application portability. High-Level Synthesis (HLS) tools address the productivity hurdles of FPGA-application design by providing high-level software-style development environments, but typically have limited platform support. HLS tools such as ROCCC [38] and Impulse-C [39] provide a C- style development environment and stream-optimized programming model, but take different approaches to platform support. ROCCC generates RTL cores with streaming interfaces, but requires developers handle the platform-specific implementation. Impulse-C generates synthesizable HDL cores and application driver from a single application source, and enables portability using platform-support packages (PSPs). PSPs wrap platform-specific interfaces to enable portability; however, due to their complexity and the large number of available platforms, developing PSPs is typically left as a challenge for the end user. Recent efforts such as Altera OpenCL [40] enable developers to create portable FPGA-application kernels using OpenCL. Similar to Impulse-C’s PSPs, Altera OpenCL uses board-support packages (BSPs) to target a specific platform.

19 The FUSE framework [41] provides an OS-level abstraction of hardware accelerator resources, transparently scheduling software tasks on available hardware accelerators. Similarly, SPREAD [42] provides a unified hardware and software threading model, but takes advantage of partial reconfiguration to dynamically schedule hardware tasks. Liquid Metal (Lime) [43] also defines a unified hardware and software threading model, but enables developers to create mixed FPGA and CPU applications using Java. Similar to Lime, hthreads [44] enables developers to create mixed applications, but instead uses a C-based POSIX threading model. In order to target a platform, these tools and frameworks must provide a custom platform-specfic hardware and software support package. RCMW is a complementary approach and could be leveraged by these tools and frameworks to generate a customized portable support package, allowing tool developers to focus on improving their tools instead of platform support. System-design tools such as SpecC [45] assist developers with design-space exploration and partitioning applications across multiple devices. SpecC enables developers to create a high-level application specification and refine it to select an architecture model, communication model, and finally create synthesizable RTL. OpenCPI [46] is a component-based application middleware for heterogeneous systems which enables seamless communication between com- ponents across devices including FPGAs, GPUs, and CPUs. Similarly, the System Coordination Framework [47] simplifies task communication between heterogeneous devices including CPUs and FPGAs by creating a partitioned global address space. SIMPPL [48] and IMORC [49] provide frameworks for creating applications from networks of components on a single FPGA. SIMPPL wraps IP cores with a core-specific network controller and enables asynchronous communication. IMORC also creates a network of components, but uses a multi-bus intercon- nect architecture. Although these approaches simplify the development of component-based applications, they still require significant developer time and effort to port application to new platforms. RCMW is a related approach which automatically handles mapping application components and resources onto a target platform using a customizable mapping algorithm.

20 RCMW could be leveraged by these tools to handle FPGA-component mapping and provide portable hardware and software interfaces to components. An alternative approach to enabling FPGA-application portability is to create virtual- FPGA overlays of application-specific resources. Intermediate fabrics [50] are coarse-grained virtual-FPGA fabrics customized for a particular application domain. Similarly, [51] presents a device-level middleware with customizable resources for software-defined radio applications. These approaches enable device-level portability by providing the same coarse-grained resources independent of the target device. RCMW is a complementary approach, and could provide portable resource interfaces to these virtual-FPGA fabrics. Platform vendors typically provide tools to assist with application development. Two notable examples are Nallatech’s DIMEtalk [52], and GiDEL’s PROCWizard [53]. DIMEtalk provides a graphical interface to create networks of components and generate FPGA bitfiles. PROCWizard generates an HDL wrapper and C++ interface based on the developers specified clocks, registers, and customized physical-memory interfaces. Since developers design their applications by customizing platform-specific resources, effort is still required when porting between platforms. To overcome this limitation, RCMW enables developers to configure application-specific resources without assuming any knowledge of the underlying platform. LEAP Scratchpads [54] provide cached virtual-memory interfaces and simplify FPGA application memory management. Altera’s Avalon [55] and ARM’s AXI [56] protocol were created to enable component interoperability, and define streaming and memory-mapped interfaces. RCMW defines interfaces optimized for streaming applications, but can be extended to support any interface using the extensible core library. LEAP Scratchpads, Avalon, and AXI could be added to RCMW, allowing developers to request the ideal interface for each application resource while maximizing performance and minimizing design area. The RC Middleware enables application portability by providing an application-specific view of available hardware and software interfaces, independent of the underlying platform. Using RCMW, developers specify the required resources and interfaces needed by there

21 applications at design time, and RCMW handles determining the application-to-platform mapping at compile time. RCMW is extensible, allowing support for new interface and resource types to be added by extending the RCMW core library. An earlier version of this work can be found in [57], in which we demonstrate a previous version of RCMW. Since this work, we have developed an RCMW driver which enables us to support platforms without vendor support packages. Leveraging our driver, we added support for the Nallatech PCIe-385N featuring an Altera Stratix-V FPGA and explored this platform in our experiments. Additionally, we have extended the RCMW toolchain to include a best-first search algorithm to select the application- to-platform mapping. This algorithm provides a faster alternative to the exhaustive approach presented in our previous work. In addition to a case study demonstrating the RCMW design methodology, we have included results for the platforms from our earlier work in this article for completeness. 2.2 Approach

In order to enable FPGA-application portability we must provide a standardized view of application resources independent of the underlying target platform hardware configuration and software API. RCMW enables this standardized view using customizable hardware and software middlewares consisting of three layers of abstraction, as shown in Figure 2-1A. From the bottom up these layers are: the translation layer; presentation layer; and application layer. First, the translation layer translates platform-specific hardware and software interfaces to standardized RCMW interfaces. Next, the presentation layer leverages these standardized interfaces, creating the application-specific hardware and software interfaces specified by the developer. Finally, these application-specific resources and interfaces are presented to the developer in the application layer, independent of the underlying platform. Figure 2-1B overviews RCMW’s design methodology. Using RCMW the application developer only needs to develop their application, and create an XML-based description of the resources and interfaces needed by their application. This description contains details about application components, required memories, and memory-mapped registers. When ready

22 A RC Middleware Layered Model Design Methodology B Application User Application Source Tool-Generated Source Developer Core "main" val done go Input Output SW RTL SW RTL ... Memory Memory Develops Application Resource Graph Application.xml Application Layer Application Application-Specific Application-Specific Description Interface Interface ... ... User Application User Application RC Middleware ... Interface ... rcmw_rd SW Interface HW Interface Toolset Core Interface ddr_rd Generates Database Core Description Core Presentation Layer Presentation Platform Core Dataflow Graph Description Standardized SW Interface Standardized HW Interface Bank Bank A A Platform Hardware and Software Abstraction ... Bank FPGA FPGA ... ... B 0 1 ... Bank Bank Bank C B C Platform Vendor Interface ... RCMW Interface Describes PROCStarIV.xml Platform Resource Graph Translation Layer Translation Database

... Host Platform A Platform B Figure 2-1. Overview of RC Middleware design flow. A) Layered model of hardware and software abstractions enabling application/tool-generated source portability across heterogeneous FPGA platforms. B) Overview of design methodology for executing applications on specific platforms. to execute their application on a specific platform, the developer provides the application description to the RCMW toolchain and specifies a supported target platform. RCMW selects an application-to-platform resource mapping based on a user-definable cost function optimizing for parameters such as minimal device area or interface latency. Using the selected mapping, RCMW generates a ready-to-compile project file to generate FPGA bitfiles, and a C++ class which provides both interfaces to application resources, and a stub function for developers to write their application code. This generated C++ class and stub function is herein referred to as the application stub. By enabling developers to focus on the resources and interfaces needed by their application, RCMW improves productivity by simplifying application development, and enables application portability by customizing hardware and software middleware layers to create application-specific interfaces. Although RCMW is intended to enable application portability regardless of the application class, the RCMW core library currently provides cores optimized for streaming applications, which is the focus for our case studies in this article.

23 Generated by RC Middleware Kernel GiDEL PROCStar III Memory Physical Layer Translation Layer Presentation Layer Signal Result 256MB GiDEL Memory RCMW RCMW Write result_memory_out Core GiDEL Memory BRAM Memory "convolve" Memory DDR2 Controller Adapter Burst Controller RCMW Read kernel_memory_in 2GB GiDEL Memory GiDEL Memory Burst Controller Adapter DDR2 Controller done signal_memory_in RCMW Read go Burst Controller ... ... register_Go PCIe PCIe GiDEL Userbus Memory Map Register ... register_Done Gen 1 Interface Adapter Controller File ...

Convolution.xml Pico Computing M501 Physical Layer Translation Layer Presentation Layer result_memory_out RCMW RCMW Write result_memory_out kernel_memory_in BRAM Burst Controller signal_memory_in

512MB Pico Memory Pico Memory RCMW RCMW Read kernel_memory_in Burst Controller DDR3 Controller Adapter Arbiter Convolution RCMW Read signal_memory_in Application Burst Controller

register_Go PCIe PCIe Pico Userbus Register register_Go Memory Map register_Done Gen 2 Interface Adapter Controller File register_Done

Figure 2-2. RC Middleware hardware-abstraction layers enabling application portability between GiDEL PROCStar III and Pico M501.

The remainder of this section is organized as follows. Section 2.2.1 presents the RCMW hardware abstraction layers. Section 2.2.2 presents the RCMW software abstraction layers. Section 2.2.3 discusses the RCMW XML metadata formats and extensible core library. Finally, Section 2.2.4 presents the RCMW toolchain and mapping algorithm. 2.2.1 Hardware Abstraction

Figure 2-2 illustrates an example of the three layers of hardware abstraction enabling portability for a convolution application. In this example, the developer has specified two input memories, one for the convolution kernel and one for the input signal, and one output memory for the convolution results. Since the GiDEL PROCStar III board has three external memories, each application memory can be assigned to a separate physical memory. The Pico M501, however, only has one external memory, requiring that the three application memories either share a single physical memory, or make use of on-chip block RAM (BRAM). The developer also requested a go and done memory-mapped register for triggering the application and waiting for it to finish.

24 The physical layer consists of the low-level hardware interface controllers for external memories and host communication. We have leveraged vendor-supplied components for these interfaces wherever possible to avoid recreating existing interfaces without any significant benefits. In cases where no vendor components are provided, such as for the Nallatech PCIe-385N, we leverage Altera/Xilinx IP cores and created custom HDL components. The translation layer is responsible for converting the platform-specific interfaces from the physical layer into a standardized interface the rest of the RCMW toolchain understands. This layer is generated by the RCMW toolchain leveraging the RCMW core library, and depends on the application-to-platform mapping. The presentation layer handles creating the application- specific interfaces requested by the developer using the standardized interfaces exposed by the translation layer. The presentation layer is customized by RCMW toolchain based on available platform resources and requested application resources, and is generated at compile time. The HDL cores leveraged in generating this layer are stored in an extensible core library, which is discussed later in Section 2.2.3. In order to enable a configurable number of developer-requested interfaces to platform resources, the RCMW core library includes a configurable arbitration controller. This arbitration controller can handle multiplexing any number and type of application resources to a physical resource or BRAM. Although having too many application resources mapped to a single platform resource could degrade performance, this controller is required to enable application portability between platforms with different resource configurations. When available physical- memory bandwidth is greater than the required application bandwidth, the middleware can saturate multiple application interfaces without significant loss in performance. RCMW’s customizable arbitration controller uses a request-to-send (RTS) and clear-to-send (CTS) protocol to arbitrate between application interface controllers. This protocol can be used to implement any arbitration scheme, from simple round-robin to adaptive arbitration schemes such as Bandwidth-Sensitivity-Aware arbitration [58], allowing RCMW to optimize design area and performance depending on application configuration. In the case that no arbitration logic

25 is required, the interface controller can be directly mapped to the translation-layer interface, minimizing area overhead. RCMW currently provides two standardized interface protocols: the burst interface; and FIFO interface. These protocols were selected since they are commonly used in streaming applications, but additional interface protocols can be supported by extending the RCMW core library. The burst interface enables applications to address an application memory sequentially. The interface word size can be any power-of-two number of bytes. The application specifies the starting byte-aligned address, size in memory words, and asserts the start signal to begin a transfer. The interface will transfer the requested amount of data, and assert the done signal. The FIFO interface enables application software to read or write data streams to application hardware in a first-in, first-out order. The FIFO word size can be any power-of-two number of bytes. The application first toggles the reset signal to reset the FIFO buffer and read/write pointers. Then the application reads/writes data to the interface, asserting the flush signal for write interfaces when the stream is empty. When the read or write stream is complete, the EOS signal is asserted, indicating the end of the data stream. Both interface types require the enable and read valid or write ready signals for flow control. Flow control is required by all interfaces due to differences in performance between platforms. Figure 2-3 provides a detailed illustration of the presentation layer. Each application memory has one or more interface. Using the configurable arbitration module described previously, any number of application memories and interfaces can be mapped by the RCMW toolchain to a physical memory. In the case that multiple application memories are mapped by RCMW to the same physical resource, there must be a virtual separation to prevent resources from affecting each other. This virtual separation is created by the RCMW toolchain using the generic parameters, including the base address and memory size, of each HDL interface controller. The base address corresponds to the address in physical memory where the application memory begins. The size of the memory is used to calculate address-wrapping conditions. In addition to memory interfaces, RCMW provides a separate memory-mapped

26 Presentation Layer Application Layer Generic Parameters Base Address 0x00000000 Size 0x00100000 Customizable Data Width 32 FIFO Depth 32 Arbitration Controller Chunk Size FIFO Depth/2 (Round-Robin, Priority, ...) Burst Read Interface Start Select N RTS(0) Address Burst CDC User Enable Size N CTS(0) Controller Controller Optional Valid Block RAM FIFO Data ...... Done N Interface MUX FIFO Read Interface Reset RTS(N-1) Enable CDC FIFO User Flush CTS(N-1) Controller Controller Data FIFO Valid EOS

Generic Parameters rd_ptr wr_ptr Base Address 0x00100000 Memory Clock Domain Size 0x00100000 Data Width 32

Translation Layer Interface Translation Memory Map Interface FIFO Depth 32 rd_bar Chunk Size FIFO Depth/2

PCIe Bus Memory Map ... wr_bar Controller Controller Application Output Registers

FIFO data Register File ... Input Registers FIFO Bus Clock Domain User Clock Domain Figure 2-3. Overview of RC Middleware’s hardware presentation layer interface to the application. This interface maps application resources, such as memory- mapped registers, to a host-controlled bus. The application layer presents the application- specific HDL interfaces specified in the application description to each application core, and generates a vendor-specific project for each FPGA where an application core is mapped. 2.2.2 Software Abstraction

Each hardware abstraction layer described in the previous section has a corresponding layer in software. Figure 2-4 illustrates the layered software model and RCMW-generated application stub. RCMW uses a portable object-oriented software API which provides standardized interfaces to application resources. The physical layer corresponds to the software driver interface. Although we try to leverage vendor-supplied drivers wherever possible in order to minimize development overhead, we developed a RCMW PCI-Express driver for platforms without a vendor-provided driver, such as the Nallatech PCIe-385N. The software translation layer wraps platform-specific APIs and provides a standardized software interface to platform resources. RCMW requires that each supported platform have a subclass of the RCMW Board class. This Board class defines

27 RCMW-Generated Application Stub 1 class Convolution : public Application 2 { 3 public: User-Defined Resources 4 void bind(Board &board); Convolution.xml 5 void execute(); //User stub 6 private: RCMW User-Application API 7 WriteRegister go; Presentation 8 ReadRegister done; 9 Memory kernel, signal, result; RCMW Runtime Library Translation 10 } //User-Defined Resources 11 Vendor API 12 void Convolution::execute() 13 { Platform Driver Interface 14 //User Application Stub Physical 15 } 16 RCMW Hardware Interface

Figure 2-4. Overview of RC Middleware’s software stack and generated C++ application stub

the required interfaces for the upper API levels, such as: blocking and non-blocking DMA read/write; board enumeration and initialization; clock configuration; and bitfile programming. The Board class encapsulates FPGA and Memory objects which represent physical platform components. The presentation layer handles mapping the application-specific resource interfaces onto the Board class interface provided by the translation layer. This layer is generated by the RCMW toolchain as a subclass of the RCMW Application class. Figure 2-4 illustrates the application-specific interface for a convolution example, with two registers: go and done, and three memories: kernel, signal and result. The RCMW toolchain-generated Application subclass encapsulates an instance of each resource specified by the application description. It provides Register objects, which are mapped onto the memory-mapped interface, Memory objects, which correspond to hardware sequential interfaces, and FIFO objects, which corre- spond to FIFO hardware interfaces. The Application class provides two functions: bind(...) and execute(...). The bind function is generated by the RCMW toolchain along with the application stub. The bind function handles the mapping of application resources onto a target platform at runtime, based on the RCMW-selected application-to-platform mapping. The execute function is the stub where developers implement their application software using the resources exposed

28 by the application class. RCMW provides a concurrent API which allows developers to allocate, manage, and communicate with application resources concurrently and portably. At runtime, RCMW handles detection of available platforms, selecting which platform will execute each application, initializing and configuring FPGAs, as well as managing threads for concurrent transfers and memory consistency. If a bitfile for an application is not available for a particular platform, the developer must first compile the RCMW toolchain-generated project using the vendor toolchain before being able to execute it. If a bitfile is found, the bind function is called on the selected Board object instance, and the application execute function is assigned to an idle software thread. When the application completes, RCMW releases platform resources. The application layer exposes an application-customized subclass of an Application class generated by the RCMW toolchain. This approach provides a portable programming model and allows developers to launch multiple application instances with RCMW handling platform configuration and scheduling. Developers are provided with standardized application interfaces without having to worry about where or how they are mapped onto a target platform. 2.2.3 Metadata and Extensible Core Library

This section provides an overview of the various XML-based metadata formats used in RCMW. There are three different metadata formats: the application description; platform description; and core description. The application description is used by developers to specify their application’s required resources and interfaces. The application description contains information about each core in an application, including HDL source files and any register or memory resources required. Application cores can specify any number of register or memory resources, with any number and type of resource interfaces. The application description also contains information about the structure of the application, including any core instances, and how those instances are interconnected. An excerpt of the application description from a convolution example can be seen in Figure 2-5. Although details have been excluded, the overall application description can be understood. In this example, the developer specifies

29 1 2 3 4 Convolution.vhd 5 Datapath.vhd 6 go 7 done 8 9 256MB 10 11 rcmw_sequential_rd 12 read 13 14 32 15 address 16 out 17 18 19 20 21 22

Figure 2-5. Example of application-description XML format. a core called main, composed of two source files: Convolution.vhd and Datapath.vhd. The developer specifies a memory with a burst read interface for storing the kernel data. The platform description is used to describe a platform’s resources such as FPGAs and memories, as well as their interfaces and physical connections. This description enables the RCMW toolchain to understand the available resources and how they are connected. The platform description contains the hardware details of a platform, with the software details captured by the platform-specific Board class as described in Section 2.2.2. Developers or platform vendors can easily extend RCMW to include a new platform by creating a platform description. In cases where the RCMW core library contains all required HDL components, no additional coding is needed. If the platform requires device-specific IP instances or interfaces not supported by RCMW, the core library must be extended with the necessary components. The core description describes the interfaces and function of the HDL core components in the RCMW core library. These cores are used by the RCMW toolchain to resolve the connections between application interfaces and platform resources. The core metadata

30 describes core generics, clocks, resets, interfaces, and dataflow between interfaces. Additionally, the core metadata contains information about device-specific area and performance costs in terms of LUTs and average latency. The area-cost data is based on previous post-fit results reported by the vendor toolchain. The core metadata format is more complexed than the other formats since we have to handle generic interfaces with generic port widths, or even a generic number of interfaces as in the case of a generic MUX. Cores can also specify device-specific architectures to optimize cores for a particular FPGA. To allow for generic attributes, the value of attributes for interfaces in a core are allowed to be algebraic functions of the core’s generics. These attributes are then resolved when the generic interface is bound to another interface during the mapping step of the RCMW toolchain. In the case of a core with a generic number of interfaces, the core entity declaration is generated by a core-specific script with RCMW- toolchain determined generic values. The RCMW core library contains interface adapters, arbitration controllers, and MUXes. Developers are able to add additional components to the RCMW core library by providing the core HDL, XML metadata, and generate script if required. Once the core is added to the core library, the RCMW toolchain will automatically include it when mapping application resources. Applications can also request virtual-core instances from the RCMW core library in their application description. For example, if an application requires an FFT core, the core library can be extended to include an FFT core with different architectures optimized for Altera or Xilinx FPGAs. During the mapping process, the RCMW toolchain will replace the virtual-core instance with the optimal core from the RCMW library. 2.2.4 RC Middleware Toolchain

The previous sections presented the hardware and software abstraction layers which enable application portability between heterogeneous platforms using RCMW. In order to provide an application-specific view of resources, the RCMW toolchain generates customized translation and presentation layers based on the target platform and required application resources. This section presents an overview of the RCMW toolchain, and discusses our application-to-platform mapping approach.

31 Convolution Developer executes RCMW HDL XML 2 toolchain, specifying application description and target platform

rcmwtool Mapper creates application-to- platform mapping Platform Database 3 using core database, and generates Developer creates app. Mapping.xml 1 resource description Mapper

Core Database Kernel Application Memory Resource Graph

Signal Core Result Platform Memory "convolve" Memory C++ Stub HDL Quartus ISE XML Resource Graph

Generator Generator Generator Generator Bank Bank Mapping A A

HPP CPP HDL QPF QSF SDC XISE UCF Bank FPGA FPGA ... B 0 1

Bank Bank Bank Additional RCMW tools use Mapping.xml to generate C B C 4 vendor-tool project file, HDL and C++ software stub Potential Application-to-Platform Mapping

Figure 2-6. Overview of RC Middleware toolchain from application specification to vendor-specific project generation.

Figure 2-6 illustrates the RCMW toolchain. In order to use the RCMW toolchain, an application developer only needs to develop their application logic, and describe the required resources and interfaces in the application description. The developer then executes the RCMW toolchain providing the application description, and specifying a target platform from the RCMW platform database. The RCMW toolchain determines an application-to-platform mapping based on a configurable cost function. Using the mapping results, the RCMW toolchain calls a C++ stub generator, which creates an application-specific stub similar to Figure 2-4, an HDL generator, which instantiates the required HDL entities and connects them together as shown in Figure 2-2, and a vendor-specific project generator for compiling the FPGA bitfile(s). The mapping process consists of two stages: (1) determine how to map each application resource to platform resources, such as application memories to physical memories, and application cores to FPGAs; and (2) determine how to connect each application interface to

32 the platform resource selected in (1). In the previous version of RCMW described in [57], we used an exhaustive search to explore every valid application-to-platform mapping. For applications with few components this approach is acceptable. However, for large applications and multi-FPGA platforms, the number of possible mappings grows considerably. To overcome this limitation, our updated mapping approach uses heuristics to guide the mapping process. The first stage generates a list of candidate application-to-platform resource mappings to be considered. Each candidate mapping is generated by first selecting an FPGA for each application core, and then selecting an appropriate platform resource for each application resource. For example, an application memory could be mapped to a block RAM or an external memory bank. In order to reduce the number of candidate mappings to be explored in stage two, we use the number of FPGA boundaries a datapath must pass through as a heuristic which estimates the cost of the path. Candidate mappings which map connected application cores and resources on the same FPGA or local memories are favored over mapping components across multiple FPGAs. Once the list of candidate mappings has been generated, the next stage determines how to implement each candidate mapping, and calculates the associated mapping cost using a customizable cost function. After calculating the cost for each candidate mapping, the minimum cost mapping is selected. The second stage explores each candidate mapping and determines how to connect each application interface to the selected platform resource using customizable cores from the RCMW core library. Each core in the RCMW core library is characterized by an XML-based core description which represents the core’s interfaces, generics, and data-flow graph between each interface. An interface in RCMW is characterized by a unique type, data-flow direction and collection of ports. Each port in an interface is characterized by a width, direction, and a type such as clock, reset, or data. Interfaces with compatible types, direction, and ports can be mapped together. Mapping interfaces together may require binding core generics to a particular value, such as the data port width. The core description can optionally include a Python-based

33 script which allows for automated generation of an HDL entity declaration in cases where the entity provides a generic number of interfaces, such as the configurable arbitration controller. The process of determining what cores from the RCMW core library are needed to connect an application interface to a target platform resource is similar to path finding, where the starting position is an application interface and the goal is the target platform resource. Each node in the path to the goal corresponds to a core instance from the RCMW core library, including cores which convert interface types, cross clock domains, or merge multiple datapaths using a resource arbiter. At each iteration of this mapping process there is a set of interfaces that need to be resolved to their target resource and a set of candidate cores which match those interface types. The set of candidate cores plus the current path create a new set of paths which need to be explored. The mapper uses a best-first search algorithm, selecting the next candidate path to explore using a knowledge-plus-heuristic cost function. The knowledge- based cost is cost of the current core instances in the selected path, such as the estimated FPGA resources or latency. The heuristic cost estimates the cost for any application interfaces which have not yet been resolved in the current path, and is estimated by weighting the current knowledge-based cost by the number of unresolved interfaces.

c(p) = g(p) + h(p) (2–1)

h(p) = g(p)(N0 − n)u(N0 − n) (2–2)

The overall cost function is presented in Equation 2–1 where g(p) is the knowledge-based cost, and h(p) is the heuristic cost. We define the heuristic-cost function in Equation 2–2 where N0 is the number of application interfaces that need to connect to the target physical resource, n is the number of application interfaces the current path resolves, and u is the unit step function. The heuristic function provides a lower bound on the path cost by estimating that the unresolved application interfaces will likely require a similar set of core instances as the current path. The step function acts to remove the heuristic cost once the current path

34 can support all application interfaces. This cost function can be modified to improve mapping results, or optimize for different parameters, and will be explored further in our future work. While this approach does not guarantee an optimal solution, it guarantees that we efficiently find a mapping that can resolve all application interfaces. Figure 2-7 illustrates the second stage of the mapping process for resolving two application read interfaces, signal and kernel, to a single DDR2 memory bank. Each step in Figure 2-7 shows a step of the mapping process. The rectangles refer to cores from the RCMW core library. This example uses a simplified set of cores and area costs to illustrate the mapping process. The cores and their respective costs used in this example can be found at the bottom of each mapping step.

A c(p)=0 rcmw_burst_rd signal mem_rd_if garea=100 Platform Resource g'(p)=100 n=1 Application c'(p)=200 ddr2_if ddr2 Interfaces

kernel rcmw_burst_rd

rcmw_burst_rd rd_arbiter ddr2_cntl mem_rd_if garea=100 garea=50*n garea=300 garea=150 Core Database

B c(p)=200 g'(p)=250 rcmw_burst_rd mem_rd_if n=1 signal g =100 g =150 area area c'(p)=500 Platform Resource

rd_arbiter g'(p)=200 Application n=2 g =50*n ddr2_if ddr2 Interfaces area c'(p)=200 Minimum cost path

kernel rcmw_burst_rd

rcmw_burst_rd rd_arbiter ddr2_cntl mem_rd_if garea=100 garea=50*n garea=300 garea=150 Core Database

Figure 2-7. Example of RC Middleware mapping two application read interfaces to a single physical memory. A) Instantiating application burst-read controller. B) Instantiating generic read-arbitration controller. C) Instantiating remaining IP to close path to memory. D) Mapping complete.

35 C

signal rcmw_burst_rd garea=100 Platform Resource c(p)=200 n=2 Application rd_arbiter mem_rd_if ddr2_cntl ddr2 Interfaces garea=50*n garea=150 garea=300

rcmw_burst_rd kernel garea=100

rcmw_burst_rd rd_arbiter ddr2_cntl mem_rd_if garea=100 garea=50*n garea=300 garea=150 Core Database

D

signal rcmw_burst_rd garea=100 Platform Resource

n=2 c(p)=750 Application rd_arbiter mem_rd_if ddr2_cntl ddr2 Interfaces garea=50*n garea=150 garea=300

rcmw_burst_rd kernel garea=100

rcmw_burst_rd rd_arbiter ddr2_cntl mem_rd_if garea=100 garea=50*n garea=300 garea=150 Core Database

Figure 2-7. Continued

In each step of the mapping, there are multiple open paths which need to be considered, each consisting of a set of instances from the RCMW core library, and a node which needs to be expanded next. In the first step in Figure 2-7, the signal interface is expanded first. The RCMW library is searched for candidate cores which provides the required interface type for the signal interface. One matching core is found, and is added to the current path. In the second step, the previous path is expanded to find two candidate cores: an arbitration core which supports a generic number of interfaces, and a low-level controller for interfacing with memory. Since the arbitration controller supports a generic number of interfaces, it reduces the heuristic cost and is selected as part of the minimum-cost path. Step three and four combines several steps using the same method as previous steps to illustrate the remaining cores being selected to complete the mapping. In our experiments we explore area- and performance-optimizing cost functions. The area-optimizing g(p) is equal to the total estimated lookup tables (LUTs)

36 Table 2-1. Currently supported RC Middleware platforms. Platform Vendor FPGA(s) Memory/FPGA Host Interface PROCStar III GiDEL 4x Stratix III E260 1x256MB DDR2 PCIe Gen1 8x 2x2GB DDR2 PROCStar IV GiDEL 4x Stratix IV E530 1x512MB DDR2 PCIe Gen1 8x 2x4GB DDR2 M501 Pico Computing 1x Virtex-6 LX240T 1x 512MB DDR3 PCIe Gen2 8x PCIe-385N Nallatech 1x Stratix V SGSMD5 2x 4GB DDR3 PCIe Gen2 8x

of core instances in the current path. The performance-optimizing g(p) is equal to maximum estimated latency from any application interface to the target platform resource. 2.3 Results and Analysis

In this section, we present experiments which demonstrate the portability and productivity benefits of RCMW. We evaluate these benefits using four platforms from three vendors, detailed in Table 2-1 First, we begin with a convolution application as a case study. We use RCMW to map the application to each supported platform, and look at the differences in application-to-platform mapping results for both area- and performance-optimizing cost functions. Next, we evaluate the performance and area overhead incurred when using RCMW compared to native vendor interfaces. Finally, we evaluate the productivity and portability benefits of using RCMW using several streaming applications. In our experiments, we compiled Altera bitfiles using Quartus II v13.0sp1. We used GiDEL driver version 8.9.3.0. Bitfiles for the Pico M501 were generated using Xilinx ISE 14.7. We used Pico driver version 5.2.0.0. RCMW’s software API was compiled using GCC v4.7.2 with C++11 support. All software was compiled using optimization flag -O3. 2.3.1 Convolution Case Study

This case study explores the complete development cycle for a convolution application using RCMW. Although this example requires a relatively simple resource configuration, it is representative of using the RCMW toolchain for more complexed applications. We examine the required developer effort in terms of hardware and software lines of code, and lines of

37 A B Bank Bank A Platform A Platform

Bank FPGA Bank FPGA B 0 B 0

Bank Bank Kernel C Kernel C Memory Memory

Signal Core Result Signal Core Result Memory "convolve" Memory Memory "convolve" Memory Application Application Figure 2-8. Area- and performance-optimized mapping results for mapping convolution application on PROCStar III/IV. A) Area-optimized mapping. B)Performance-optimized mapping.

XML. We explore the RCMW toolchain results for both area- and performance-optimizing cost functions for each platform. We examine the toolchain execution time, estimated area in LUTs, estimated interface latency in clock cycles, actual post-fit area in LUTs, and execution time for each platform. The estimated area and interface latency are used by the RCMW toolchain to evaluate each cost function. The convolution application performs 1-D convolution of 32-bit integers. We use a randomly generated 2-million point signal and 96-point kernel. Figure 2-8 illustrates the area-optimized (Figure 2-8A) and performance-optimized (Figure 2-8B) mappings generated by the RCMW toolchain for the convolution application on the PROCStar III and IV. The circles represent platform and application resources, with the solid lines indicating interfaces between resources. The dotted lines indicate the platform resource to which it is mapped. The area-optimizing cost function selects the mapping with the minimum area in LUTs, and does not take into account on-chip block RAM (BRAM). Using this cost function, the RCMW toolchain maps all application memories to the 2 and 4 GB banks of the PROCStar III and IV, respectively. This mapping minimizes the number of LUTs by minimizing the number of memory controller instances, which require significantly more area than the

38 Table 2-2. Comparison of lines of code required when using RC Middleware. Source Type Lines of Code Hardware HDL 456 Software C++ 22 App. Description XML 111

Table 2-3. Total map-generation time, estimated area and latency, and actual area and execution time for convolution case study optimizing for performance or area. Platform Optimization Map Time Est. Area Est. Latency Area Exec. Time Area 17.66 ms 17 857 26 15 348 39.3 ms PROCStar III Performance 18.75 ms 26 619 24 23 934 38.9 ms Area 14.27 ms 17 125 26 16 938 39.1 ms PROCStar IV Performance 17.47 ms 25 669 24 25 194 38.8 ms Area 2.17 ms 12 330 38 12 396 35.1 ms M501 Performance 2.46 ms 17 403 36 16 843 34.7 ms Area 7.16 ms 11 762 16 10 843 29.9 ms PCIe-385N Performance 6.21 ms 15 862 14 12 997 28.9 ms

RCMW arbitration logic. The performance-optimizing cost function selects the mapping where the greatest latency of all application interfaces is minimized. Using this cost function, the RCMW toolchain maps each application memory to a separate physical memory bank, minimizing the latency for each application interface. Since the kernel is sufficiently small it is mapped to BRAM. Table 2-2 presents the total hardware and software lines of code, and the lines of XML in the application description written by the application developer. We only include lines of code written by the developer, not including spacing or comments. Due to the differences in coding styles and developer experience, this table is meant to compare the relative effort for creating each component of the application. Table 2-3 presents the results of using the RCMW toolchain to map the convolution application onto each supported platform for both performance- and area-optimizing cost functions. The map time is the required execution time for the RCMW toolchain to finish mapping the application to each platform, and generate the associated FPGA project file and C++ software stub. We ran the RCMW toolchain on a quad-core Xeon E5520. The

39 estimated area in LUTs is the area calculated by the RCMW mapper using the post-fit area results reported by the FPGA-vendor toolchain for each individual core. The estimated latency in clock cycles (CCs) is the estimated maximum latency of all application interfaces in the selected mapping. The latency of each interface is calculated by adding the estimated latency of each core instance in the path from the application interface to platform interface. The calculated latency for the area- and performance-optimized mappings are only a few cycles different since the RCMW MUX component only estimates a single clock cycle for each multiplexed interface. This estimate could be improved by taking into account the type of arbitration used, the number of interfaces, and average transfer length. The area in LUTs is the post-fit area reported by the vendor toolchain. This area is similar to the estimated area which was calculated using post-fit results for each core individually, but is not equal due to optimization’s made during the analysis and synthesis, and fitter stages of the vendor toolchain. The execution time is the total time to transfer the input signal and kernel data, perform the convolution, and transfer the results. We selected an application clock frequency of 150 MHz for each platform. The execution times are similar for both performance and area optimization, since the available memory bandwidth is sufficiently higher than the bandwidth required by the convolution core. Depending on the target platform hardware configuration and required application resources, the performance optimization may or may not give significant performance improvements. As illustrated in Figure 2-8, the area-optimized mapping results in all application memories being mapped to a single external memory bank, requiring only a single memory-controller instance. The performance-optimized mapping, however, required two memory-controller instances, and additional logic for the kernel BRAM. We were able to reduce post-fit logic usage of our application by 36% by selecting an area-optimizing cost function, which could enable applications to fit additional processing elements on an FPGA and increase application performance. In our previous work, we used an exhaustive mapping algorithm which explored all possible application-to-platform resource mappings. Given the simplicity of the resource configuration for this case study, both the exhaustive and heuristic mapping algorithms

40 converge to the same mappings for both the area- and performance-optimizing cost functions. The exhaustive algorithm, however, requires over an order of magnitude longer to find the same mapping even for this simple example. 2.3.2 Analysis of Performance and Area Overhead

In this section, we analyze the interface and area overhead introduced by RCMW. First, we measure the overhead introduced by RCMW’s software API, by transferring data between host and FPGA for varying transfer sizes. We measure the time required to complete each transfer, and calculate the overhead as a percentage reduction in effective bandwidth compared to the vendor-specific API. Next, we measure the overhead introduced by RCMW when transferring data between application and platform memory. We compare the effective bandwidth when transferring data for a single RCMW read/write interface to the vendor-specific interface. To measure the FPGA to external memory bandwidth, we count the total number of clock cycles required to perform a transfer of a given size, and use the known application clock frequency to calculate the effective bandwidth. We calculate overhead as a percentage reduction in effective bandwidth. Finally, we measure the RCMW area overhead by comparing the relative logic usage of RCMW, vendor, and application components for several simple applications. The area percentages were obtained using the post-fit device usage report provided by Altera Quartus II and Xilinx ISE. Figure 2-9 presents the effective read and write bandwidth of the RCMW host to FPGA, and FPGA to external memory transfers. We find that the M501 and PCIe-385N lead the GiDEL PROCStar III and IV by a factor of two in host/FPGA bandwidth due to the newer generations of PCI Express. The maximum bandwidth of writing from the FPGA to external- memory is approximately the same for each platform due to the fixed word size of 128 bytes at 150 MHz in our benchmarks. The FPGA to external-memory read performance is also approximately the same for each platform, with the exception being the M501. The fixed latency for each 4KB read of the M501’s AXI memory interface results in the effective read bandwidth plateauing around 1 GB/s.

41 Between FPGA and External Memory Between FPGA and Host

● ● ● ● ● ● ● ● 2000 ● ●

1500 M501 ● 1000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● 500 ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ●

2000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PCIe−385n ● ● ● ● 1500 ● ● ● ● ● 1000 ● ● ● 500 ● ● ● ● ● ● 0 ● ● ● ● ●

2000 ● ● ● ● ● ● ● ● ● ● ● ● ● PROCStar III ● 1500 ● ● ● ● 1000 ● ●

Effective Bandwidth (MB/s) Effective ● ● 500 ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ●

2000 ● ● ● ● ● ● ● ● ● ● ● ● ● PROCStar IV ● 1500 ● ●

1000 ● ● ● ● ● 500 ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 512 B 32 KB 2 MB 128 MB 512 B 32 KB 2 MB 128 MB Transfer Size

● Read Performance Write Performance

Figure 2-9. Host and FPGA read and write performance to external memory for PROCStar III/IV, M501, and PCIe-385N. FPGA-to-memory bandwidth was measured with a word size of 128 bytes at 150 MHz.

42 Between FPGA and External Memory Between FPGA and Host

60% M501 40%

20% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ●

60% ● ● ● ● PROCStar III ● ● ● ●

40% ● ● ● ●

● 20% Overhead (%) Overhead ●

● ● ● ● ● ● ● ● ● 0% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

60%

● PROCStar IV ● ● ● ● ● ● ●

● 40% ● ● ● ● ● ● ● ● ●

● 20% ● ● ● ● ● ● ● ● 0% ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 512 B 32 KB 2 MB 128 MB 512 B 32 KB 2 MB 128 MB Transfer Size

● Read Overhead Write Overhead

Figure 2-10. Host and FPGA read and write overhead to external memory for PROCStar III/IV, and M501. FPGA-to-memory overhead was measured with a word size of 128 bytes at 150 MHz.

43 Figure 2-10 presents the overhead incurred by using RCMW compared to vendor-only baseline interfaces. The Nallatech PCIe-385N is not included in this figure since the vendor provides no baseline with which to compare. The left side of Figure 2-10 presents the RCMW overhead for transfers between the FPGA and external memory. The peak FPGA/memory write overhead was similar for each platform: 50%, 43%, and 43% overhead for the M501, PROCStar III, and PROCStar IV, respectively. For large transfers, this overhead quickly becomes less than 1% for each platform. The peak read overhead is less than 10% for each platform, and similarly becomes less than 1% for large transfers. For small transfers, an overhead of 50% equates to only tens of clock cycles, which is relatively insignificant. The peak write overhead is greater than the peak read overhead due to the additional cycles required to flush memory buffers and ensure read-after-write consistency. The right side of Figure 2-10 presents the RCMW overhead for transfers between the FPGA and host. The high variance found in these graphs for small transfers is due to the variance in the host’s OS scheduler. The greatest FPGA/host transfer overhead is incurred by the PROCStar III, which peaks at approximately 80% for reads and 70% for writes. This seemingly high overhead is due to the additional features provided by the RCMW software API, including thread-safety and user-memory buffer management. Although we could disable these features and significantly reduce this overhead, they are vital to RCMW’s concurrent API and are therefore included in our results. Furthermore, this high overhead occurs at small transfer sizes, and accounts for approximately 1-2 ms overhead. Since the M501 provides thread-safety for some of their API calls by default, the peak overhead is less, approximately 20% for both read and writes. Large overheads are restricted to small transfer sizes, resulting in only a few additional microseconds for each transfer. For increasing transfer sizes, this overhead is quickly amortized, resulting in less than 1%, 5% and 7% read and write overhead for the M501, PROCStar III, and PROCStar IV, respectively. Figure 2-11 presents the logic usage of RCMW, application, and vendor components. The bottom layer in the stacked bar graph represents the vendor logic usage, the middle layer

44 AES128 Needle−Distance SAD Smith−Waterman

75%

50%

25% Device Logic Usage (%) Device

0%

M501 M501 M501 M501 PCIe−385nPROCStarPROCStar III IV PCIe−385nPROCStarPROCStar III IV PCIe−385nPROCStarPROCStar III IV PCIe−385nPROCStarPROCStar III IV Platform

Vendor Application RCMW

Figure 2-11. FPGA resource analysis for vendor, application, and RC Middleware components. represents the application logic usage, and the top layer represents RCMW overhead. Each set of bars represents the area breakdown for each platform for a specific application. The PCIe-385N does not have a vendor component, since there is no vendor-provided hardware components. From this figure, we see that RCMW accounts for a very small fraction of the overall design area, typically less than 1% of the total device resources. The largest RCMW area overhead was less than 3% for the Sum of Absolute Differences (SAD) on the Pico M501. For the PCIe-385N, RCMW handles the PCIe and external memory interfaces resulting in a relatively larger RCMW area usage. Although the total area required by RCMW is platform and application specific, it is important to note that at least a portion of the area resulting from

45 mapping multiple application resources to a single physical memory would be necessary even for non-RCMW implementations. For a performance comparison of these applications and kernels please reference Table 2-4. 2.3.3 Analysis of Productivity

In this section, we analyze the productivity benefits of RCMW by comparing software lines of code (SLoC), hardware lines of code (HLoC), and total development time required by the developer. Although lines of code and development time are commonly used for measuring software development productivity, it is worth mentioning that these measures are heavily influenced by developer-specific factors such as coding style [59]. To explore the productivity benefits of RCMW, a developer familiar with the Pico Computing M501, GiDEL PROCStar III/IV, and with RCMW, implemented five cores from OpenCores [60] using both vendor- specific and RCMW-specific design flows for each platform. The cores used included an AES128 encryption core, a JPEG encoder, a SHA256 hashing core, an FIR filter, and a 3DES encryption core. Each core was implemented using both the vendor’s recommended design flow, and the RCMW toolchain. We included all code written by the developer, excluding comments and whitespace. The Nallatech PCIe-385N was excluded from this experiment due to lack of a vendor-specific design flow. GiDEL and Pico Computing provide different approaches for developers to interface their applications with platform resources. GiDEL provides a graphical tool called PROCWizard, which enables developers to customize GiDEL-provided IP cores and resource interfaces. Pico Computing takes a different approach, providing developers with a Xilinx AXI bus interface to platform memory. Pico Computing provides a streaming abstraction in both hardware and software, enabling efficient transfer of data between host and FPGA. Our experiments indicated that, on average, RCMW required 65% less SLoC, 41% less HLoC, and 53% less development time than the GiDEL-specific design flow; and 66% less SLoC, 59% less HLoC, and 69% less development time than Pico-specific design flow. Since these numbers are averaged for a single developer, we cannot draw conclusions as to an exact

46 improvement for all developers, but we can support the argument that RCMW improves productivity. These improvements are expected, since RCMW handles many development tasks typically left to the developer. By providing standardized, application-specific hardware and software interfaces, RCMW enables seamless application portability between heterogeneous platforms, while also reducing application complexity. This reduction in complexity leads to a reduction in the lines of hard- ware and software lines of code that the developer must write. In hardware, RCMW handles resource arbitration and clock-domain crossing. In software, RCMW manages platform initial- ization, cleanup, and application multithreading. The RCMW also provides API validation, monitoring driver calls to prevent the platform from entering an invalid state, alerting users via C++ exceptions when necessary. The abstractions provided by RCMW, standardized interfaces, and object-oriented APIs maximize developer productivity, while also enabling code reuse. In our experiments, the Pico M501 required relatively high HLoC due to the generic interfaces exposed to developers. Unlike PROCWizard and RCMW, the Pico M501 does not assist developers in customizing platform resources, forcing developers to handle arbitration and clock-domain crossing (CDC). GiDEL’s approach required less HLoC due to PROCWizard assisting developers in customizing IP interfaces for their application. GiDEL also handles CDC for memory interfaces, reducing HLoC. RCMW required the least HLoC, generating the specific interfaces required by the application, and handling all required arbitration and CDC. Similar results were found for total SLoC, with Pico requiring the most SLoC, followed by GiDEL, and then RCMW. Both vendor APIs require developers to manage buffers and platform-specific restrictions such as data alignment and transfer size. In order to reduce developer overhead, RCMW provides a variety of different features, such as templated read and write functions which can handle any data type. Additionally, RCMW handles buffer management, data alignment, and garbage collection internally, further simplifying application development.

47 2.3.4 Analysis of Portability

Table 2-4 presents the execution time and logic usage of various streaming applications and kernels for all four supported platforms. Each application was executed using the same application source code with a clock frequency of 150 MHz, and required between two and four streaming interfaces. The OpenCores JPEG encoder required a random-access memory inter- face to integrate with its on-board peripheral bus (OPB) interface. This table demonstrates the same application hardware and software source executing across heterogeneous platforms. Porting applications across each platform was accomplished with almost no effort, requiring only that the RCMW toolchain be executed once for each application and platform. This table is not meant to be a comparison of platform performance, since the maximum device area and achievable clock frequency was not used for each application. In our tested applications and kernels, however, we found that the PCIe-385N outperforms the PROCStar III and IV for the given implementations and input datasets. The M501 and PCIe-385N use a newer PCIe generation, enabling higher peak from host to FPGA, making transfer-heavy applications like Smith-Waterman, which streams a large database from host-to-FPGA, perform better. For applications that require more memory interfaces, such as Image Segmentation, the two additional memory banks of the PROCStar III and PROCStar IV provide an advantage over the M501. It is important to note that with newer platforms with state-of-the-art FPGAs, the increase in FPGA resources enables more application cores to fit on a single FPGA. This trend is illustrated in Table 2-4, as the device-logic usage greatly decreases from the PROCStar III, which uses a Stratix III, to the PCIe-385N, which uses a Stratix V. For HPC applications with data-level parallelism, RCMW could enable significant performance improvements by targeting existing applications to newer FPGA platforms with little to no developer effort.

48 Table 2-4. Execution time and area for various applications and kernels on each supported RC Middleware platform. M501 PCIe-385N PROCStar III PROCStar IV Kernel/Application Parameters Exec. Area Exec. Area Exec. Area Exec. Area 1D Convolution 96-point kernel; 2M points 32.9 ms 3 % 32.9 ms 3 % 39.3 ms 14 % 38.9 ms 7 % 2D Convolution 27x27 kernel; 640x480 image 12.8 ms 13 % 12.8 ms 13 % 13.2 ms 44 % 16.3 ms 25 %

49 Image Segmentation 320x480 image 1.24 s 11 % 1.24 s 11 % 1.41 s 53 % 1.39 s 26 % Needle-Distance 150 PEs; 215 characters 173 ms 10 % 173 ms 10 % 194 ms 38 % 203 ms 21 % OpenCores AES128 2M hashes 19.3 ms 6 % 19.3 ms 6 % 25.3 ms 22 % 24.3 ms 11 % OpenCores FIR 10-taps; 1M points 21.3 ms 3 % 21.3 ms 3 % 24.5 ms 15 % 24.0 ms 8 % OpenCores JPEGEnc. 640x480 image 14.3 ms 3 % 14.3 ms 3 % 15.3 ms 15 % 19.6 ms 8 % OpenCores SHA256 64K blocks 52.3 ms 4 % 52.3 ms 4 % 64.1 ms 12 % 63.3 ms 5 % Smith-Waterman 150 PEs ;215 characters 104 ms 4 % 104 ms 4 % 116 ms 11 % 119 ms 6 % Sum of Abs. Diff. 49x49 feature; 640x480 image 13.8 ms 18 % 13.8 ms 18 % 14.7 ms 75 % 19.1 ms 38 % 2.4 Summary and Conclusions

Despite performance and power advantages over conventional many-core CPU and GPU architectures, FPGAs have had limited acceptance in HPC and HPEC applications due to their portability and productivity challenges. To help overcome these challenges, we introduced the RC Middleware (RCMW). RCMW provides an extensible framework which abstracts away platform-specific details to provide an application-centric hardware and software development environment. This environment is customized by the RCMW toolchain using the developer- provided application description, and allows developers to focus on the ideal resources and interfaces for their application, without worrying about the underlying platform configuration. To create this environment, the RCMW toolchain first selects an application-to-platform mapping using a customizable cost function, and then generates the required hardware and software interfaces. We evaluated RCMW’s performance and productivity benefits for four platforms from three vendors. We demonstrated RCMW’s ability to quickly explore different application-to- platform mappings using a convolution application case study or both area- and performance- optimizing cost functions. We demonstrated that the benefits of RCMW can be achieved with less than 1% FPGA/memory and 7% host/FPGA transfer overhead in the common case. We also demonstrated that RCMW has relatively low area overhead, requiring less than 3% of logic resources for several applications across all four platforms. We presented evidence that RCMW improves developer productivity, by showing that RCMW requires fewer lines of code and total development time for deploying several kernels than vendor-specific approaches. Finally, we demonstrated that RCMW enables portability by showing that the same application source was able to execute without change across each supported platform. Leveraging the productivity and portability benefits of the RC Middleware, we focus on developing our scalable graph-processing architecture without relying on the underlying FPGA- accelerator platform hardware and software interfaces. Given the time and effort required to develop this architecture, the RC Middleware provides us with a vehicle to take advantage

50 of new reconfigurable system architectures and next-generation accelerators with minimal to no changes to our source code. In the following chapter, we overview state-of-the-art graph- processing methodologies and provide an overview of popular sparse-matrix storage formats for graph processing. We identify limitations in current storage formats, and present a novel hypersparse-matrix storage format optimized for distributed graph processing on FPGAs.

51 CHAPTER 3 EFFICIENT STORAGE FORMATS FOR SCALABLE FPGA GRAPH PROCESSING Large-scale graph processing is a key component in modern scientific computing and data analytics, with many commercial and defense applications [3], [4]. Graph-processing applications, however, do not map well to traditional system architectures and programming platforms. Whereas traditional systems focus on computational throughput and data locality and reuse, graph-processing problems are typically memory-bounded and data-driven, with highly irregular datasets [9]. Cache-based architectures for these applications are a liability; adding latency to computation and wasting power and chip resources [10]. These problems are further compounded in distributed systems, where the unstructured nature of graph datasets leads to inefficient data partitioning and load imbalances. The need to analyze increasingly larger graph datasets has driven the exploration of new methods, algorithms, and distributed system architectures for graph processing. One such method moves away from the typical edge- and vertex-centric approaches and describes graph algorithms in terms of linear-algebra primitives operating on graph adjacency matrices [18]. This approach brings with it the benefits of the predictable access patterns of linear-algebra operations, and a higher level of abstraction simplifying the implementation and paralellization of many graph algorithms [18], [19]. In order to maximize the scalability and performance of this approach, however, several key challenges must be addressed, such as how to map irregular graph datasets to distributed systems, and how to efficiently store and access sparse- and hypersparse-matrix datasets. In this paper we address the challenges of storing graph adjacency matrices to maximize graph-processing application performance while minimizing storage overhead. Graph adjacency matrices are typically sparse, having a total number of non-zero elements on the order of the dimension of the matrix, and follow a power-law degree distribution, where only a few rows or columns contain the majority of the non-zero elements [20]. When computing on these sparse datasets in a distributed system, they become hypersparse, having less than one non-zero per

52 row/column on average. [18]. Despite this degree of sparsity, large-scale graph datasets still require significant storage space, requiring several terabytes even for small problem sizes [21]. In order to maximize the scalability and performance of these graph-processing algorithms, sparse-matrix storage formats, herein referred to as formats, capable of providing scalable, low-overhead storage with low-latency access to data are critical [22]. There are a wide variety of formats which are optimized for different non-zero distributions such as diagonal or banded matrices, and for different platform architectures, such as vectors processors or GPUs. General formats such as Compressed Sparse-Column/Row (CSC/R) and Doubly Compressed Sparse- Column/Row [23], which do not assume any inherent non-zero structure, are commonly used in graph-processing applications. These formats, however, trade-off between storage and lookup complexity, providing either fast lookups at the expense of high storage overhead for sparse datasets, or low storage overhead at the expense of increased access time for unfavorable non-zero distributions. In order to overcome these limitations, we propose a novel sparse-matrix storage format called Hashed-Index Sparse Column/Row (HISC/R). HISC/R replaces the dense indexing vector in CSC/R, and the sparse indexing vectors in DCSC/R with a hashed indexing vector, enabling constant-time accesses to rows or columns of a matrix. Additionally, HISC/R optimizes the storage of hypersparse matrices by allowing non-zero elements to be stored directly in the hashed indexing vector when no additional space is required. For dense matrices, HISC/R uses a novel segmented-storage scheme which enables online non-zero insertions and deletions, eliminating the need for expensive intermediate storage formats. We demonstrate the storage and lookup performance of HISC/R over CSC/R and DCSC/R using randomly generated power-law graphs. We show that HISC/R requires up to 40% less memory reads compared to DCSC/R when performing SpGEMM, and uses up to 19% less storage for hypersparse datasets. Finally, we present an FPGA architecture for an HISC/R controller, identifying key architecture components and optimizations to maximize lookup performance.

53 Table 3-1. Definition of variables for sparse-matrix complexity analysis. Variable Description N Matrix columns M Matrix rows nze Non-zero element nnz Number of nze nzc Non-zero columns nzr Non-zero rows Bptr Bytes per pointer Bidx Bytes per index Bval Bytes per value α Hash-table load factor

The remainder of this chapter is organized as follows. Section 3.1 provides an overview of related work on sparse-matrix storage formats and their suitability for scalable graph processing. Section 3.2 presents the details of our new Hashed-Index Sparse Column/Row (HISC/R) format, analyzes its expected storage and lookup performance, and provides an overview of our HISC/R-controller architecture. Section 3.3 presents our experimental results comparing the storage and lookup performance of HISC/R against competing formats using randomly generated power-law graphs. Finally, Section 3.4 summarizes and concludes the chapter. 3.1 Background and Related Research

In this section we present a brief overview of the most common sparse-matrix storage for- mats in terms of their lookup complexity, storage performance, and amenability to distributed graph processing. Although there exists a multitude of storage formats, most are based on the ones presented here. We define an optimal format as one that enables constant-time O(1) lookup complexity for row or column elements while maintaining O(nnz) storage. In practice, however, formats must compromise between maximizing lookup performance or minimizing storage overhead for sparse datasets. Table 3-2 provides an overview of the storage and perfor- mance complexity for popular sparse-matrix storage format. For a reference of variables used in our discussion, see Table 3-1.

54 Table 3-2. Analysis of popular sparse-matrix storage formats. Format Storage (Bytes) Lookup Complexity

COO (2Bidx + Bval)nnz O(nnz) / O(lg nnz) CSC (Bval + Bidx)nnz + Bptr(N + 1) O(1) CSR (Bval + Bidx)nnz + Bptr(M + 1) O(1) DCSC (Bval + Bidx)nnz + (2Bptr + Bidx)nzc + Bptr O(lg nzc) DCSR (Bval + Bidx)nnz + (2Bptr + Bidx)nzr + Bptr O(lg nzr) ELLPACK (Bval + Bidx)M maxi{A(i, :)} O(1) JDS (Bval + Bidx)nnz + Bptr(maxi{A(i, :)} + M + 1) O(1) { } TJDS (B(val + Bidx)nnz + Bptr(maxj A(:, j) + 1)) O(1) 1 1 − ⌈ ⌉ ∈ Z 1 ⌊k⌋ − 1 MQT 2 (lg N k )nnz + ([k / ] + 3 )4 3 + Bptrnnz O(lg N) 1 Maximum bytes required where k = log4 nnz

3.1.1 Coordinate Format (COO)

The coordinate format [61] consists of a list of tuples each with row, column, and value fields. The list of tuples has a storage complexity of O(nnz), and may be stored unsorted or sorted lexicographically. When unsorted, COO has a lookup complexity of O(nnz) but allows for constant-time element inserts by appending to the end of the list. When sorted, it has a lookup complexity of O(log nnz) but then requires O(nnz) insert complexity to maintain the sorted ordering. The fast insert and low storage complexity makes COO an ideal candidate as an intermediate format for the distribution stages of non-zeros in scalable graph-processing systems, but not as the primary storage format for graph datasets. 3.1.2 Compressed Sparse-Column/Row Format (CSC/R)

CSC/R and its variants such as blocked CSC/R are the most commonly used sparse- matrix storage formats for vector processors due to their simplicity and good performance [61]. CSC/R encodes non-zeros in three vectors: the pointer, index, and value vectors. The value and index vectors are sparse vectors which store the corresponding values and indices of non-zeros in row/column-major order for CSC/CSR respectively. The pointer vector is a dense vector which contains an offset into the index and value vectors for the start of each row/column for CSR/CSC respectively. The dense pointer vector enables constant-time indexing into the start of rows and columns at the expense of significant storage overhead when

55 dealing with sparse or hypersparse matrices. As shown in Table 3-2, the dense pointer vector causes the storage requirement to be dependent on the dimensions of the matrix, making it unsuitable as a scalable storage format. 3.1.3 Doubly Compressed Sparse-Column/Row (DCSC/R)

DCSC/R [23] was proposed to overcome the scalability limitations of CSC/R. DCSC/R is similar to CSC/R but replaces the dense pointer vector with a sparse pointer vector, only storing entries for non-zero columns and rows. Since the pointer vector is sparse, another index vector is used to store the row/column index associated with each pointer. By using a sparse pointer vector, we must now search for each row/column, increasing the lookup complexity to O(nzc/r). In order to minimize the search overhead, the format introduces an AUX array which breaks the non-zero rows/columns into blocks and stores a pointer to the first non-zero of each block, giving the storage complexity shown in Table 3-2. Although DCSC/R solves the scalability issues of CSC/R by eliminating the dense pointer vector, the introduction of a sparse vector requires a search on lookup and may significantly increase the lookup latency and limit performance. 3.1.4 ELLPACK Format

ELLPACK [62] was proposed to maximize the performance of sparse-matrix computations on throughput-oriented processors such as GPUs. In ELLPACK, the non-zeros of each row are grouped together and right-padded with zeros to make each row the same size. These non-zero entries are stored in a value vector, with their associated column indices stored in a separate column vector. By forcing each row to be a similar size, memory accesses to non-zero data can be coalesced to maximize performance on GPUs. ELLPACK enables constant access time to rows, but poor storage overhead when rows are not similar size as indicated by the

M maxi{A(i, :)} term in the storage requirement shown in Table 3-2, making it a poor choice in terms of storage overhead for scalable systems.

56 3.1.5 Jagged Diagonal Format (JDS/TJDS)

The Jagged Diagonal Format (JDS) [63] was developed for iterative methods on vector- ized processors. In JDS, each row is first packed similar to ELLPACK, sorted by length, and then stored column-wise in a value array, with associated column indices stored in an index array, and a permutation vector storing the original row ordering. The permutation vector is proportional to the number of rows matrix, leading to high storage overhead and limited scala- bility. To reduce this overhead, the transposed JDS (TJDS) format was developed to eliminate the need for a permutation vector by re-ordering rows of the input datasets. Although TJDS reduces the storage requirement to O(nnz), the requirement of re-ordering the inputs makes it unsuitable for scalable systems. 3.1.6 Minimal Quadtree Format (MQT)

The MQT format [64] was developed to minimize the storage requirement of sparse matrices. MQT encodes sparse-matrix data using a structure and value vector. The structure vector contains a series of four-bit masks which breakup the matrix recursively into quadrants. Each bit in the masks indicate which quadrant has at least one non-zero element. The value vector stores the individual non-zeros ordered by quadrant. The equation in Table 3-2 provides an upper-bound for a specific matrix with a given number of non-zero elements. The lookup complexity is O(lg N), the depth of the tree, since we need to iterate over the structure vector for each non-zero element. Although MQT requires relatively little storage, the lookup complexity makes it unsuited for scalable graph-processing systems. 3.2 Approach

In this section, we present the design and implementation details of our novel sparse- matrix storage format Hashed-Index Sparse-Column/Row (HISC/R). We provide an analysis of HISC/R’s expected storage and lookup performance compared to CSC/R and DCSC/R for varying degrees of matrix sparsity. We also present an HISC/R-storage-controller architecture describing key architecture features and optimizations identified to maximize performance.

57 0 1 2 3 4 5 6 7 value V00 V02 V05 V06 V07 V10 V14 V27 V43 V46 V51 V54 V67 V72 0 V00 V02 V05 V06 V07 col_idx 0 2 5 6 7 0 4 7 3 6 1 4 7 2 1 V10 V14

2 V27 ptr 0 5 7 8 8 10 12 13 14 ptr 0 5 7 8 10 12 13 3

Compressed Sparse-Row V V row(1) 4 43 46 Row Lookup: O(1) Storage: O(N) index 0 1 2 45 6 7

5 V51 V54 Hashed 6 7 0 1 5 2 4 key 12 13 0 5 10 7 8 ptr aux 0 2 43 6 V67 Index 1 1 5 2 2 1 2 size row(1) h(1) Hashed-Index Sparse-Row row(1) Doubly-Compressed Sparse-Row 7 V72 Storage: O(nnz) Lookup: O(1) Storage: O(nnz) Lookup: O(lg nzr)

Figure 3-1. Comparison of the indexing techniques used by CSC/R, DCSC/R, and HISC/R.

3.2.1 Hashed-Index Sparse-Column/Row (HISC/R)

The problem of storing sparse matrices can be reformulated as the problem of storing the set of non-zero row or column indices from the set of all possible indices in a way that enables constant-time lookups. When the set of non-zero indices is small compared to the set of all possible indices, as in the case of sparse or hypersparse matrices, we see that this problem is analogous to one solved by hashing. HISC/R approaches the problem of sparse-matrix storage by using a hashed-indexing vector rather than the dense vector used in CSC/R, or sparse vectors used in DCSC/R, as illustrated in Figure 3-1. HISC/R is a general sparse-matrix storage format and can be used for either column- or row-major accesses. When looking up the non-zero values of a column or row, we use a to generate an offset into the hashed-index vector, verify the key, and use the pointer and size entries to iterate over the non-zeros. Since we are dealing with hashing, however, it is important to note that the storage performance depends on the achieved load factor, α, of the hashed-index vector. Similarly, lookup performance depends on the type of hash function used, and the lookup and collision resolution policies employed by our hash table. To guarantee performance, it is important we choose a hash function that is uniformly random and is easy to compute, and a hash table which provides fast lookups at high loads. 3.2.2 Hashed-Indexing Vector

Each bucket in the hashed-indexing vector consists of three entries: the key (column/row index for HISC/R respectively), pointer into the non-zero value/index vectors, and the size of

58 the current column/row. In order for HISC/R to achieve good performance, it is critical we choose a hash function which distributes the non-zero indices uniformly in the hashed-indexing vector. Therefore we need a function h : M → {0, ..., B − 1} for a hash table with B buckets, that provides sufficient uniformity regardless of the non-zero distribution.

The class of strongly-universalk hash functions [65], H, guarantees that for some randomly chosen hash function, h ∈ H, and distinct set of k keys, xi ∈ M, the values are hashed independently over B buckets. Many implementations of strongly universalk hash functions have been proposed but, for k > 3, they often involve computationally prohibitive polynomial calculations and prime-modulo arithmetic. For this reason, we focus on simple [66], a strongly universal3 family of hash functions that can be tuned for uniformity or space efficiency. Tabulation hashing relies heavily on bitwise manipulation of keys, and fast lookups into small tables of memory, making it well-suited for FPGAs. Using tabulation hashing, we achieve constant-time expected lookup performance for load factors less than 60% when using simple collision-resolution techniques such as linear and , and . As the load factor increases for these simple hash tables, the number of buckets probed to find a particular key increases significantly, requiring us to turn to more complex hash-table designs. [67] is a hash table design which requires at most two probes to find any key but at the price of maintaining load factor less than 50%. Although the lookup performance of Cuckoo Hashing is ideal for HISC/R, maintaining a load factor of less than 50% would greatly impact our storage performance. [68] is a hash table design which combines various techniques from Cuckoo Hashing, , and chaining to provide a compromise between lookup performance and load factor. Our experiments show that Hopscotch Hashing outperforms the other explored hash table types by achieving a load factor of up to 83%, with an average of 1.4 probes per lookup. Hopscotch Hashing, however, requires a significantly more complexed insertion process than other hashing methods. The results of our hash table comparison are summarized in Figure 3-2.

59 Linear Probing Quadratic Probing 30 30

20 20

10 10

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0

Double Hashing Hopscotch Hashing 30 30

20 20 Hash Table Probes per Row/Column Lookup Probes per Row/Column Hash Table 10 10

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7 0.8 0.9 Hash Table Load Factor

Figure 3-2. Comparison of average hash table probes required for row/column lookups (1σ) vs. load factor for different hash table types.

3.2.3 HISC/R Nonzero Storage

Although HISC/R could use a single index/value array similar to CSC/R and DCSC/R as shown in Figure 3-1, we instead use a novel segmented-storage vector which enables online insertions and deletions. The segmented storage vector breaks rows/columns with more than one non-zero into variable-sized sublists with initial size L0. Each sublist contains either null

d or a pointer to the next sublist of size k L0 where k is a customizable multiplier and d is the current sublist depth. Unused elements of each sublist are initialized to zero to indicate that they can be inserted into, as shown in Figure 3-3. The parameters L0 and k provide a method to tune the segmented storage vector either for storage performance or to minimize the number of memory accesses. The optimal values for these parameters depends on the properties of the matrix being stored.

60 L0 kL0

0 V00 2 V02 5 V05 6V06 7 V07 0 0 ∅

0 V10 4 V14 ∅ 3 V43 6 V46 ∅ 1 V51 4 V54 ∅

Extended Bit

Hashed 1 2 0 4 7 6 5 key 14 7 0 19 2 7 24 ptr Index 2 V27 5 2 V72 V67 2 size Hashed-Index Sparse-Row (Segmented) Row Lookup: O(1) Storage: O(nnz)

Figure 3-3. Overview of HISC/R with segmented storage vectors using initial segment size L0 and growth factor k.

In cases where there is only one non-zero in a row or column, as is the common case for hypersparse matrices, we store the non-zero value directly in the hash table. The extended bits as shown in Figure 3-3 are used to indicate whether we are storing a non-zero value or start of a segmented vector in the hash table. If we are storing a non-zero directly, we store the major index in the key, minor index in the pointer location, and the value in the size position. 3.2.4 Non-zero Lookups and Insertions

To access a column/row in HISC/R, we first lookup the entry, if any, for the column/row in the hashed-indexing vector as described in Figure 3-4. We check the extended flag of the entry to determine if the bucket contains a non-zero or the start of a segmented vector. If the extended flag is false, we can return the non-zero element found. If the flag is true, we use the address stored in the hash table to iterate over the segmented vector. When a zero is encountered when reading the segmented vector, we have hit unused storage and can stop iterating. Inserting non-zero elements first requires looking up any existing bucket entries in the hashed-indexing vector. If no entry exists, we can insert the current non-zero value directly into the bucket found and return as shown in Figure 3-5. If the entry exists and it already contains a single non-zero value, we allocate a new segment of size L0, copying both the stored non-zero element and the element we are inserting into the newly allocated segment. We then update the extended bit, size, and pointer fields of the hash table. If the entry already contains

61 1: procedure HISC Lookup(column) 2: L0, k ← 2 3: tuples ← {∅} 4: entry ← HashTableLookup(column) 5: if entry ≠ null then 6: if entry.extended = false then 7: tuples ← {(entry.row, entry.key, entry.value)} 8: else 9: depth ← 0 10: cur ptr ← entry.ptr 11: while cur ptr ≠ null do 12: s ← SegmentLookup(cur ptr) 13: sz ← SegmentSize(L0, k, depth) 14: for i ← 1...sz do 15: if s[i].value == 0 then 16: return tuples 17: end if 18: tuples ← (s[i].row, column, s[i].value) 19: end for 20: depth ← depth + 1 21: cur ptr ← s.next 22: end while 23: end if 24: end if 25: return tuples 26: end procedure

Figure 3-4. Pseudocode for HISC column lookups. the start of a segmented vector, we first index into the last sublist by reading only the pointer entries. We then search for the first zero-valued entry and insert into it. If no such entries exist, we allocate a new storage segment, insert the non-zero, and update the pointer field of the previous segment. If faster inserts are required we can keep a next-free pointer in the pointer position of the last sublist. 3.2.5 Storage Analysis

The HISC/R storage requirement when using a single value/index vector is similar to DCSC/R but replaces nzc/r with the number of buckets, B = α−1nzc/r as shown in Equation

3–1. The following analysis assumes without loss of generality that Bptr = Bval = Bidx.

−1 BHISC = (Bval + Bidx)nnz + (2Bidx + Bptr)α nzc (3–1)

62 1: procedure HISC Insert(row, column, value) 2: L0, k ← 2 3: depth ← 0 4: entry ← HashTableLookup(column) 5: if entry = null then 6: HashTableInsert(column, row, value) 7: else if entry.extended = false then 8: s ← SegmentAllocate(L0, k, depth) 9: SegmentInsert(s, entry.row, entry.value) 10: SegmentInsert(s, row, value) 11: HashTableInsert(column, s.ptr, 2) 12: else 13: s ← SegmentLookup(entry.ptr) 14: while s.next ≠ null do 15: s ← SegmentLookup(s.next) 16: depth ← depth + 1 17: end while 18: if SegmentInsert(s, row, value) = false then 19: s.next ← SegmentAllocate(L0, k, depth) 20: s ← SegmentLookup(s.next) 21: SegmentInsert(s, row, value) 22: entry.size ← entry.size + 1 23: end if 24: end if 25: return 26: end procedure

Figure 3-5. Pseudocode for HISC non-zero insertions.

Equation 3–2 compares HISC/R without segmented storage and CSC/R. When the matrix is hypersparse, M ≫ nnz and the storage ratio approaches zero. For denser matrices, nnz ≫ nzc and nzc = M, causing the storage ratio to approach unity. ( ) B 2 + 3 nzc α−1 HISC = (nnz ) (3–2) B M+1 CSC 2 + nnz Equation 3–3 compares HISC/R without segmented storage and DCSC/R. When the matrix is hypersparse nzc = nnz and it can be shown that the storage ratio converges to 0.4 + 0.6α−1. For denser matrices, nnz ≫ nzc and the storage ratio approaches unity.

( ) B 2 + 3 nzc α−1 HISC = nnz( ) (3–3) B nzc DCSC 2 + 3 nnz Equation 3–4 gives the storage ratio compared with HISC/R when using segments and storing non-zeros directly into the hashed-indexing vector. For hypersparse matrices, nzc = nnz

63 and the storage ratio approaches 0.6α−1 asymptotically, giving a theoretical maximum storage improvement of 40% over DCSC/R. When compared with CSC/R, the storage ratio approaches zero since M ≫ nzc.

( ) B 3 nzc α−1 HISC(Segmented) = nnz( ) (3–4) B nzc DCSC 2 + 3 nnz The dense case for HISC/R with segmented storage is difficult to calculate since the storage depends on the distribution of nonzeros in each row/column, and the parameters L0 and k. The space needed for each column/row can be calculated as the sum of a geometric series with base k, plus a depth term representing the number of pointers as shown in Equation 3–5.

− M∑1 1 − kD(|A(:,i)|) (B + B ) + B D(|A(:, i)|) (3–5) val idx 1 − k ptr i=0 The bounds for the geometric series depends on the number of non-zeros in each row or column and is equivalent to the maximum depth of the segmented vector as calculated in Equation 3–6.

⌈ ( )⌉ − −1 D(n) = logk 1 + (k 1)L0 n (3–6)

For dense matrices, the second term in the summation of Equation 3–5 can be ignored,

−1 and the segmented storage can then be modelled in terms of an efficiency parameter κeff as −1 shown in Equation 3–7. The parameter κeff represents how efficiently a matrix is stored for a given L0 and k for a particular matrix. In our experiments κeff ranged between 0.8 and 1.0.

−1 −1 (Bval + Bidx)κeff nnz + (2Bidx + Bptr)α nzc (3–7)

3.3 Results and Analysis

In this section we compare the storage and lookup performance of HISC/R with CSC/R and DCSC/R. All of the datasets used in our experiments were randomly generated using a

64 Kronecker-graph generator that we developed based on the work of Leskovec et al. [20]. By randomly generating our graphs, we were able to control properties of our datasets, such as average row/column density, to better explore HISC/R’s performance characteristics. We use

HISC/R with hopscotch hashing and segmented storage with the parameter values L0 = k = 2. Unsegmented HISC/R uses a single value and index array similar to CSC/R and DCSC/R, and does not store non-zero elements in the hashed pointer vector. The figures are divided into hypersparse and sparse regions in order to illustrate the storage format behavior for different levels of sparsity. 3.3.1 Storage comparison

To analyze the storage performance of HISC/R we used our Kronecker-graph generator to generate random scale-30 graphs (adjacency matrix of size 230 by 230) with edge factors from 10−3 to 103. We generated several random graphs for each edge factor using the gnutella-25 initiator matrix [20] and stored them using CSC/R, DCSC/R, HISC/R, and HISC/R with segmented storage. We then measured the average storage requirement for each in terms of average bytes per non-zero, and calculated their storage ratios. Figure 3-6 compares the storage performance of HISC/R with CSC/R. As expected from the result of Equation 3–2, the storage ratio approaches zero asymptotically with increasing sparsity, indicating that HISC/R and outperforms CSC/R for hypersparse datasets. As the matrix density approaches an average of one non-zero per row/column, the overhead from the dense pointer vector of CSC/R decreases, causing the storage ratio to increase. The storage ratio of HISC/R peaks around 1.4 at an average of 10 non-zero elements per row/column and then asymptotically approaches 1.25 as the matrix becomes denser. This overhead is due to the parameters of the segmented storage vector, L0 and k, not being optimized for the dataset being stored. The peak overhead of approximately 40% in the sparse region comes with the added benefit of runtime nonzero insertions and deletions, and can be disabled if unneeded. For HISC/R without segmented storage, the storage ratio peaks around 1.2 at an average of

65 1.5 α = 0.714286

● 1.0 ● ● ● ● ●

Storage Ratio Storage 0.5 ●

● 0.0 ● Hypersparse Region Sparse Region

10−3 10−2 10−1 100 101 102 103 Average Row/Column Density

HISC/R ● HISC/R (Unsegmented)

Figure 3-6. Average storage ratio normalizing HISC/R and HISC/R (unsegmented) by CSC/R for randomly generated scale-30 Kronecker matrices.

10 non-zeros per row/column and then asymptotically approaches zero as the matrix becomes denser. Figure 3-6 compares the storage performance of HISC/R with DCSC/R. In the hypersparse region, we see the HISC/R without segments approaching an asymptote around 1.25 as predicted by Equation 3–3 due to the load factor of the hash table. As the number of non- zeros increases, this overhead is amortized and the storage ratio approaches unity. For HISC/R with segmented storage, the storage ratio a approaches 0.84 as predicted by Equation 3–4 in the hypersparse region. HISC/R achieves a storage ratio of 0.85 in the hypersparse region by storing non-zero elements directly in the hashed pointer vector. Unsegmented HISC/R approaches a storage ratio of 1.25 in the hypersparse region due to the unused buckets in the hashed pointer vector. As the matrix becomes denser, the number of

66 1.5 α = 0.714286

− 0.4 + 0.6 α 1 ● ● ●

● 1.0 ● ● ● ● ● ●

− 0.6 α 1

Storage Ratio Storage 0.5

0.0 Hypersparse Region Sparse Region

10−3 10−2 10−1 100 101 102 103 Average Row/Column Density

HISC/R ● HISC/R (Unsegmented)

Figure 3-7. Average storage ratio normalizing HISC/R and HISC/R (unsegmented) by DCSC/R for randomly generated scale-30 Kronecker matrices. rows and columns with more than one non-zero element increases, increasing the number of storage segments. Due to the unused elements in the segmented-storage vectors, the storage ratio peaks around 1.35 in the sparse region at 10 non-zero elements per row/column, and then approaches 1.25 asymptotically as the matrix becomes denser. By disabling segmented storage, this overhead would be eliminated and the ratio would equal the non-segmented HISC/R curve in the sparse region. Optimizing the parameters L0 and k for the dataset being stored would minimize this overhead. 3.3.2 Performance Comparison

To analyze the lookup performance of HISC/R, we again used our Kronecker-graph generator to generate random scale-30 graphs (adjacency matrix of size 230 by 230) with edge factors from 10−3 to 102. We evaluate the lookup performance of HISC/R using sparse

67 40%

20%

● ● ● ●

0% ● ●

Total Memory Read Accesses (Percent Improvement) Memory Read Accesses (Percent Total ● Hypersparse Region Sparse Region −20%

10−3 10−2 10−1 100 101 102 Average Row/Column Density

DCSC/R CSC/R

Figure 3-8. Comparison of total reads required to perform sparse matrix/matrix multiplication using HISC/R compared with CSC/R and DCSC/R.

generalized matrix-matrix multiplication (SpGEMM). SpGEMM is a key kernel used for many graph-processing applications including all-nodes shortest paths, and betweenness centrality. We measure the total memory read operations required for SpGEMM when using CSC/R, DCSC/R, and HISC/R, for varying degrees of sparsity. We measure only the total memory reads needed to perform SpGEMM, and do not assume a particular hardware architecture. We compare the percent improvement of HISC/R over CSC/R and DCSC/R in terms of the reduction in total memory accesses required to perform the computation. Figure 3-8 presents the percent improvement in total number of memory reads required by HISC/R compared to CSC/R and DCSC/R. HISC/R provides an improvement of up to 40% and 14%, compared to DCSC/R and CSC/R for hypersparse datasets, respectively. The improvement of HISC/R in the hypersparse region is a result of storing non-zero elements

68 directly in the hashed pointer vector, reducing the number of indirect memory accesses compared to CSC/R and DCSC/R. CSC/R and DCSC/R require significantly more indirect memory accesses when dealing with hypersparse matrices than HISC/R. In the sparse region, HISC/R requires up to 16% more memory accesses than CSC/R due to having to decode additional pointers for the segmented-storage vectors. As the matrices become denser, this overhead decreases as the additional segment lookups are amortized by the increasing segment size. HISC/R outperforms DCSC/R in the sparse region, with the benefit asymptotically approaching zero as matrix density increases. This asymptotic behavior is because as the matrix becomes denser the number of memory accesses needed by DCSC/R decreases as the AUX array size increases. Additionally, as the matrix becomes denser, the total number of memory accesses is dominated by the number of non-zero elements being read. 3.4 Summary and Conclusions

In this chapter, we presented Hashed-Index Sparse-Column/Row (HISC/R), a novel sparse- matrix storage format optimized for graph-processing applications. HISC/R provides O(1) lookup complexity and O(nnz) storage complexity while also enabling runtime insert and delete operations, enabling matrices to be constructed directly without using expensive intermediate storage formats. We show that HISC/R requires significantly less storage than CSC/R and up to 19% less than DCSC/R for hypersparse datasets when maintaining an average hash table load factor of 71%. Finally, we show HISC/R provides a 14% and 40% improvement in terms of memory reads compared to CSC/R and DCSC/R respectively when performing matrix multiplication with hypersparse datasets. The reduction in the total number of memory accesses and favorable storage performance for hypersparse datasets makes HISC/R uniquely suited for scalable graph processing. In the following chapter, we develop a scalable graph-processor architecture on FPGAs leveraging the RC Middleware and HISC/R storage format. Using the linear-algebra approach to graph processing, our architecture is designed to perform sparse-matrix operations over user-definable semirings. We present the design of our architecture and analyze it using kernels

69 and applications key to graph-processing algorithms. Furthermore, we analyze the scalability of our architecture for different network architectures and degrees of parallelization.

70 CHAPTER 4 EXTENSIBLE FPGA ARCHITECTURE FOR SCALABLE GRAPH PROCESSING Graphs are arguably one of the most powerful data structures in modern computing, capable of modeling relations, either abstract or concrete, between entities. This flexibility has positioned graphs as a central data structure in data analytics and scientific research, and has opened the door to new methods of data analysis and applications through cutting-edge graph-processing algorithms. Graph processing is becoming more important in all fields of research, with key roles in commercial [69], [70] and defense applications [3], [4]. The growing popularity of graphs has increased the demand for systems capable of processing increasingly larger graph datasets while reducing latency. Conventional processors and system architectures, however, are inefficient at executing graph-processing applications due to the sparse nature of graph datasets, and data-driven nature of graph algorithms [9]. These conventional architectures are designed to exploit the locality of data by providing multiple levels of caching, from the processor to network architecture. Graph-processing applications, however, typically have highly random-access memory patterns with little data reuse. Cache-based architectures for these applications are a liability; adding latency to computation, and wasting power and chip resources [10]. These applications are memory bounded, and often require minimal computational throughput, spending most of the execution time on memory accesses and data manipulation. Recent advances such as the linear-algebra formulation of graph algorithms has opened the door to new opportunities to increase graph-processing performance [18]. Traditional approaches to graph-processing rely on the vertex- or edge-centric formulation of algorithms. These formulations are often difficult to parallelize, having to determine for each graph- processing algorithm how to partition the vertex and edge sets, and how to access required data efficiently [18]. Furthermore, the vertex- and edge-centric approaches share minimal code re-use between different applications on different architectures, making porting these algorithms on new systems difficult and time consuming. By using the linear-algebra formulation of graph

71 algorithms, we can represent a variety of graph-processing algorithms using a small set of parallelized linear-algebra primitives. Any system which implements these primitives efficiently for the types of datasets being computed can then perform these graph algorithms [19]. In this chapter, we explore the design of a scalable graph-processor architecture which uses the linear-algebra formulation of graph algorithms. We develop our graph-processor architecture leveraging the customizability of Field-Programmable Gate Arrays (FPGAs). One major challenge that we must address in developing our architecture is how to handle various matrix operations, including matrix-matrix multiplication, over an application-specific semiring [24], [25]. In order to support a wide variety of graph algorithms using linear-algebra primitives, we present an extensible graph-processor framework which allows the ALU to be exchanged with an application-specific ALU and semiring. We demonstrate that our architecture achieves more than a 20×/40× speedup for sparse/hypersparse generalized matrix-matrix multiplication (GEMM) compared with optimized CPU baselines running on a Xeon E5620, while requiring only 12% of the power. Additionally, we explore the performance of breadth-first search (BFS) leveraging our SpGEMM kernel and compare with state-of-the-art BFS FPGA architectures running on the Convey HC-1/HC-2. After adjusting for aggregate memory bandwidth, we find that our architecture performs better on average than the compared BFS approaches. Finally, we analyze the scalability of our architecture running SpGEMM on the Novo-G# multi-FPGA system for both 2D- and 3D-torus networks, leveraging the discrete-event simulation model presented in [71]. Using workload profiling data collected for RMAT degree-26 SpGEMM, and the Novo-G# network simulation, we predict a speedup of up to 500× for a 6×6 2D-torus, and 980× for a 4×4×4 3D-torus, giving a parallel efficiency of approximately 0.64 and 0.70, respectively. The remainder of this chapter is organized as follows. Section 4.1 presents background and related works in sparse-matrix and graph-processing architectures, and overviews the linear-algebra formulation of breadth-first search. Section 4.2 provides the design of our graph- processor architecture, providing design details and analysis for each architecture component.

72 Section 4.3-4.6 presents our experimental setup, case studies, and scalability analysis for our architecture. Section 4.7 provides a summary of our work and concludes this chapter. 4.1 Background and Related Research

This section presents background and related works on accelerating sparse-matrix com- putation on FPGAs. Although the focus of these sparse-matrix accelerators are typically on the design of efficient floating-point datapaths for sparse-matrix vector computations (SpMV) rather than graph processing, the lessons learned for handling the storage and efficient access of matrices from memory are relevant to our work. Additionally, we present related research on defining standards for graph processing using linear-algebra operations, which is impor- tant to the design of our graph-processor architecture. Finally, we present an overview of the breadth-first search (BFS) kernel and its linear-algebra formulation. 4.1.1 Accelerating Sparse-Matrix Operations on FPGAs

Fowers et al. presents the design of a bandwidth-optimized SpMV implementation [72], which leverages a new storage format called Condensed-Interleaved Sparse Representation (CISR) format. CISR resolves the starvation issue of their SpMV pipelines, which results from the sparse nature of the matrix, by interleaving multiple non-zeros from each row into a single memory word. Each memory word is broken into multiple equal-sized chunks which stores a non-zero value from different rows. Each part of the memory word is assigned a channel which performs the dot product for a particular row. In this fashion, as long as the rows have an approximately equal number of non-zero elements, their memory bandwidth is maximized. Additionally, they present a banked-vector buffer (BVB) which stores a single copy of their dense vector using on-chip block ram (BRAM). Storing one copy of the vector using the BVB minimizes the vector-storage memory requirement and allows storing larger vectors. In [73], the authors focus on developing a universal SpMV architecture which can be used with either dense or sparse matrix datasets. The authors present a new storage format, called compressed bitvector (CBV), which encodes the non-zero positions of a matrix in a dense bit vector. The authors also add a vector cache to their architecture exploit data locality in the

73 vector for the dense case. Their universal SpMV architecture was designed to support several sparse and dense storage formats, citing the differences in formats based on dataset properties. The efficient use of external-memory bandwidth is key factor to maximizing the perfor- mance of sparse algorithms. Although CISR [72] provides a novel solution to maximize memory bandwidth in cases where the average number of non-zeros per row is approximately the same, for graph processing this is often not the case. The sparse-adjacency matrices of graphs follow a power-law degree distribution, meaning that the majority of rows have relatively few non-zero elements, and some rows have disproportional large number of non-zeros. For CISR, this means many of their channel slots will unused increasing their storage overhead and reducing their effective memory performance. For storage formats like CBV, storing the matrix in a dense form is good for fast lookups, but it is not scalable due to the O(N) storage requirement. A key insight from [73] is that the authors include support for multiple storage formats, since there is no one storage format to rule them all. Since FPGAs are capable of provide customized, bit-level manipulation of data as in the case of CISR, they are able to provide significantly improved power efficiency without sacrificing performance. In [74], the authors explore the performance and power tradeoffs of sparse- matrix multiplication (SpMM) running on FPGAs relative to CPU and GPU architectures. In particular, the authors look at the performance and power trade-offs for modern nVidia GPUs, Intel XeonPhis, and the Nallatech PCIe-385n FPGA accelerator when using an OpenCL SpMM kernel on each. The authors find that while FPGAs are the most power efficient, they typically have the lowest absolute performance when compared to the GPU and XeonPhi, as a result of the relatively low aggregate memory bandwidth available to the FPGA. The authors also note that their is no perfect platform, citing that the performance was dependent on the properties of the matrix datasets. On the graph processing side of things, two noteworthy architecture for accelerating graph-processing algorithms using custom FPGA architectures can be found in [16], [75]. In these papers, the authors accelerate the traditional vertex- and edge-centric formulations of

74 graph algorithms using novel architectures. In [16], the authors present an FPGA architecture called CyGraph for a parallel breadth-first search (BFS) using a vertex-centric formulation. Their architecture breaks the BFS algorithm into a kernel consisting of four processes which run simultaneously: a current-node queue process, a neighbor-fetch process, a neighbor-lookup process, and a next-level process. Multiple of these kernels are then combined together and share a high-bandwidth external memory interface (80 GB/s) on the Convey HC-1/HC- 2. By having each of the four processes running simultaneously in multiple kernels, the design maintains a large number of in-flight memory requests increasing memory throughput. Although the memory architecture is optimized for handling a large number of in-flight requests, the unpredictable memory-access pattern limits effective memory bandwidth. In [75], the authors present a many-core, soft-processor architecture with operations optimized for graph processing. The authors identify that graph-processing applications typically require a small subset of computational operations, and therefore they design a soft- core processor which includes a reduced instruction set of the operations they need. They combine multiple soft-core processors on a single FPGA in a 2D mesh SoC and store a local copy of the graph for each processor. By limiting their architecture to a single FPGA, and requiring that each FPGA store a local copy of the dataset, the achievable performance and scalability of this architecture is limited. In our approach, we fit a single processor per FPGA, and allocate all external memory resources for that processor. Each processor is assigned a subset of the graph edge-list in order to maximize the problem sizes that can be computed. In [76], Song et al. present a novel architecture for performing graph algorithms using linear-algebra primitives. A key component to their architecture is the merge sorter logic which is responsible for re-ordering results for index matching with various operations. Our graph-processor architecture is similar, but provides an extensible framework for modifying our datapath to support various graph algorithms. Furthermore, we provide support for three different storage controllers to maximize performance based on the dataset properties.

75 4.1.2 Standards for Graph Processing using Linear Algebra

In [24], the authors present a set of standard linear-algebra primitives for graph processing. They cite the Sparse Basic Linear Algebra Subprograms (BLAS) as a key set of operations for graph processing, but extend the sparse matrix-matrix multiplication (SpMM) operation over an arbitrary semiring. In the outer-product formulation of SpMM (C = AB) there are two key operations: first, the elements of each column of A are multiplied with the elements of the corresponding rows of B to form a set of partial-product matrices. These partial product matrices are then accumulated using the addition operation. Using the syntax presented in

[24] standard SpMM can be represented as C = Aop0.op1B where op0 = + and op1 = ∗. This formulation is known as sparse generalized matrix-matrix multiplication (SpGEMM).

The authors cite the following pairs of operations (op0, op1) as examples for graph-processing applications: (max, +), (min, max), (∨, ∧), and (f(), g()). These operations can be used to perform a wide variety of graph algorithms [18]. An implementation of this approach was created as part of Combinatorial BLAS [25] which defines several key operations: matrix/matrix multiplication, matrix/vector multiplication, element-wise operation, reduction, extraction, subset assignment, subset extraction, construct, and enumeration. In developing our graph- processor architecture, we follow the GraphBLAS standards presented in [24], [25]. We develop our architecture to support the GraphBLAS standard operations, and enable semiring customization for different applications by exposing an extensible interface to modify the ALU. 4.1.3 Linear-Algebra Formulation of Breadth-First Search

Breadth-first search (BFS) is key kernel for many graph-processing applications [18]. Using BFS, we generate a BFS spanning tree which consists of a vector of parent nodes for each node in the graph. In the typical vertex-centric graph-processing methodology, BFS is calculated by iteratively expanding a frontier set of nodes until all vertices have been visited. For each node in the current frontier set, the neighbors of the node are expanded an appended to the frontier set. The first time a node is expanded, we set its parent node to the node that

76 1: procedure BFS(G, Start) 2: frontier ← Start 3: for n ∈ G do 4: n.parent ← null 5: end for 6: while frontier.size() > 0 do 7: v ← frontier.pop front() 8: for n ∈ v.neighbors do 9: if n.parent = null then 10: n.parent ← v 11: frontier.push back(n) 12: end if 13: end for 14: end while 15: end procedure

Figure 4-1. Pseudocode for vertex-centric breadth-first search.

A A B   0 1 1 0 0 B C   0 0 0 0 1 0 0 0 1 0 D A =   1 0 0 0 0 0 0 1 0 0 E

Figure 4-2. Graph adjacency-matrix representation. A) Graph structure. B) Adjacency matrix. put it onto the frontier edge set. To begin the BFS algorithm, we put the starting node into the frontier set and expand each neighbor in the following iteration. When performing BFS using linear algebra, however, we can calculate the same BFS tree using matrix-vector multiplication and vector addition over specialized semirings. Figure 4-1 presents an example graph and its associated adjacency matrix. We define a frontier vector, x, where non-zero positions denote the current vertices in our breadth first search. For example, if we wanted to perform a BFS from node A in the graph in Figure 4-2A we would define the starting frontier as x(0) = [1, 0, 0, 0, 0]T. We can then calculate the kth frontier using Equation 4–1. In order to calculate the BFS spanning tree, we find the parents for each BFS iteration and combine it with the previous iteration’s parent vector. To calculate the parents of the nodes expanded in the kth BFS iteration, we multiply AT by x(k−1) using a special semiring. We define the outer-product multiplication operator as the matrix argument multiplied with

77 the row-index of the non-zero in the vector. The value of the element in x(k−1) does not matter since we only care if the element is non-zero. We define the addition operator of the outer-product multiplication as the min function to select the minimum node label as parent, in case multiple nodes expand the same child. After performing the operation shown in 4–1 using this semiring for matrix/matrix multiplication, x(k) contains the parent node labels of

each node in the current frontier. We also define the vector-addition operator, op⊕(a, b) as a conditional select operator which returns b(i) if a(i) is zero, and a(i) otherwise. This allows us to add only newly discovered nodes to the BFS spanning tree vector. Using these semirings, we can calculate the BFS spanning tree, P(k), for the kth iteration as shown in Equation 4–2. This definition of single-frontier BFS can be extended to multiple frontiers by using the same operator definitions mentioned above, but with matrix-matrix operations.

x(k) = ATx(k−1) (4–1)

(k) (k−1) (k) P = op⊕(P , x ) (4–2)

In the following sections we present a detailed analysis of our graph-processor architecture. Our architecture was designed to be extensible, allowing support for new graph applications by replacing our ALU with an ALU which supports a different semiring. In our approach, we build on the concepts presented in [76], integrating our optimized hypersparse-matrix storage format, HISC/R, presented in Chapter 3. We leverage the RC Middleware presented in Chapter 2 to enable scalability and portability to future FPGA systems. 4.2 Extensible Graph-Processor Architecture

In this section we present a detailed overview of our scalable graph-processor architecture. Figure 4-3 presents a high-level diagram of the graph-processor architecture. The graph processor consists of a matrix datapath, which handles accelerated hardware operations on matrix datasets, and a controller datapath, which provides general computing instructions and datapath controls. By providing both datapaths we enable accelerated sparse-matrix

78 rp-rcso rhtcue h arxotrpoutoeaingnrtsintermediate generates operation outer-product matrix The architecture. graph-processor Architecture interface. Merge-Sorter loopback a as 4.2.1 acts and interface stub a provides architecture (Section controller storage (Section merge-sorter systolic-array Middleware, RC the on [ information see more For please processor. ports sparse-matrix access the DMA of arbitrated of components round-robin memories to provides external Middleware the RC to The space platform. address targeted unified the a providing memory, consisted application processor single sparse-matrix a the of of supported definition any application to The portability platform. enabling Middleware resources, RC platform underlying to access with us provide DMA and bits configuration setting by scalar datapath parameters. compute sparse-matrix directly, the datasets direct stored or on values, operate sparse-matrix address to of or used out break be and can directly controller memory The access operations. to way a provide also but clarity). operations, for omitted signals (some architecture graph-processor of Overview 4-3. Figure nti eto epeettedsg n nlsso h ieie eg-otro our of merge-sorter pipelined the of analysis and design the present we section this In to Middleware RC the leveraged we architecture processor sparse-matrix our developing In 77 .Tekycmoet ftesas-arxpoesracietr r:the are: architecture processor sparse-matrix the of components key The ]. HOST DMA PHY RC Middleware 4.2.3 Tuple Bus .Tentokitraecmoetfrtesingle-processor the for component interface network The ). 4.2.1 Network LoopbackNetwork Interface Storage Cntl(s) Storage Sorter DCSC/R HISC/R Tuple CSC/R ALU ,teaihei oi nt(Section unit logic arithmetic the ), Memory Bus 79 Registers Matrix Controller Registers General ALU

Controller Bus 4.2.2 ,adthe and ), products, known as partial products, with duplicate row and column indices. In order to combine these duplicates efficiently, we sort the generated partial products by their indices and then perform our accumulation operation as defined by our semiring. By including a sorter component with a DMA interface to memory, we can provide custom sorting instructions in our processor to sort tuples efficiently. The approach for our sorter is based on the design presented in [78], however, we generalize and greatly improve the design for our purposes. The pipelined, merge-sorter architecture is based on combining two-way merge-sorter

processing elements (PEs) in a systolic array. Each PE contains two registers: the Rs register,

which is the high-priority element register, and the Rb register, which is lower priority, as shown in Figure 4-4. The PE sorts sets of two elements by ensuring that the higher priority element

(the one that should come out of the PE first) is always placed in the Rs register, while the

lower priority element is always placed in the Rb register. Each PE has two inputs: the next

high-priority element Rsn+1 , which should be lower priority than the current Rsn but higher

priority than Rbn , and Rbn-1 which should have lower priority than Rbn . Each clock cycle the elements in the PE will be shifted towards the output, choosing their next-states in order to maintain the aforementioned priorities. Since there are four possible signals that can influence ( ) 4 the next PE state, and the relative priority of each signal must be calculated, there are 2 = 6 comparisons that must be performed. By combining the results of these comparisons we can choose the next register states and the PE output for each clock cycle. We generalize the next-state rules defined in [78], and present them in terms of a generic priority function as shown in Table 4-1. By generalizing the next-state rules in terms of an arbitrary priority function, we have now developed a model for sorting elements in a variety of ways, not only lexicographically. For example, by specifying a priority function based on the absolute difference between the row and column values, we can order elements of our matrix based on their distance to the center diagonal. In order to create a functioning sorter from our PE, we need to define additional elements, such as one with an infinitely-high priority, which provides a useful initial PE state, and

80 Rs Reg. value {valid,∞,-∞} set_id R Flags R sn+1 Value sn

MUX Flags

Rs_Sel Output MUX

Rb Reg. Out_Sel R Value R bn-1 bn MUX Flags

R _Sel Set b Set Reg. Set n-1 Value n

Setn Pr(R ,R ) Pr(Rb,Rs) s sn+1 R _Sel Rsn s Pr(Rbn-1,Rs) Pr(Rbn-1,Rb) Rbn Rb_Sel R Pr(R ,R ) Pr(R ,R ) sn+1 b sn+1 bn-1 sn+1 Out_Sel Rbn-1 Combinatorial Logic

Figure 4-4. Architecture diagram of merge-sorter PE.

Table 4-1. Merge-sorter PE next-state logic.

Condition Next Rs Next Rb Output ∧ ∧ Pr(Rbn−1 ,Rsn ) Pr(Rsn ,Rbn ) Pr(Rbn ,Rsn+1 ) Rsn Rbn Rbn−1 ¬ ∧ ∧ Pr(Rbn−1 ,Rsn ) Pr(Rbn−1 ,Rbn ) Pr(Rbn ,Rsn+1 )Rbn−1 Rbn Rsn ¬ ∧ Pr(Rbn−1 ,Rbn ) Pr(Rbn ,Rsn+1 ) Rbn Rbn−1 Rsn ∧ ∧ ¬ Pr(Rbn−1 ,Rsn ) Pr(Rsn ,Rsn+1 ) Pr(Rbn ,Rsn+1 )Rsn Rsn+1 Rbn−1 ∧ ¬ Pr(Rbn−1 ,Rsn ) Pr(Rsn ,Rsn+1 ) Rsn+1 Rsn Rbn−1 ¬ ∧ ∧ ¬ Pr(Rbn−1 ,Rsn ) Pr(Rbn−1 ,Rsn+1 ) Pr(Rbn ,Rsn+1 )Rbn−1 Rsn+1 Rsn ∧ ¬ ∧ ¬ Pr(Rsn ,Rsn+1 ) Pr(Rbn−1 ,Rsn+1 ) Pr(Rbn ,Rsn+1 )Rsn+1 Rbn−1 Rsn

one with an infinitely-low priority. By setting Rnn+1 to the infinitely-low priority element, we effectively make sure that no element is ever shifted in from that signal. In order to represent these values, we introduce an additional flag register which follows each value in the PE to indicate if it is a negative- or positive-infinity priority element, or a valid data element for comparison as shown in Figure 4-4. To create a 2K-way merge-sorter pipeline from the two-way PEs, we combine K PEs in

th a systolic array. The Rb output of the n PE is provided as an input to the Rbn-1 port of the

th th th n+1 PE. The Rs output of the n+1 PE is provided as an input to the Rsn+1 port of the n PE. These signals pass both the current values and flags of each register. The kth PE has the

81 Tuple-Value Cache (Dual-Input BRAM+Free Table) value ptr

value ptr 2K-way Merge Sorter Pipeline tuple_in in -∞ Rsn+1 Rsn Rsn+1 Rsn Rsn+1 Rsn Rsn+1 tuple_out out tuple_en en PE0 PE1 PEk tuple_valid valid Flagsn Flagsn-1 Setn Flagsn-1 Setn Flagsn-1 Setn tuple_flush flush Rbn Rbn-1 Rbn Rbn-1 Rbn Rbn-1 Rbn tuple_swap swap clk rst clk rst clk rst Controller clk rst Figure 4-5. Architecture diagram of merge-sorter pipeline.

Rsn+1 port permanently tied to a negative-infinity priority element, effectively ending the systolic array. This architecture can be seen in the 2K-way merge-sorter pipelines of Figure 4-5. Using the 2K-way merge-sorter we can sort any set of elements by 2K elements at a time. The pipeline is initialized on reset to all infinitely-high priority elements. We then shift an

th element of the set we want to sort into the sorter pipeline through the 0 PE’s Rbn-1 port each clock cycle. As we shift our set into the sorter pipeline, the infinitely high priority elements are shifted out. When 2K elements of our set is pushed into the sorter pipeline, we begin to push out the 2K-wise sorted subsets of elements. It is important to note that any elements pushed in after 2K elements will not be sorted with respect to the element that was pushed out, meaning we need to perform additional steps to sort sets greater than 2K elements. Once our set has been pushed into the array, we now need to flush the values out. The default way would be to push in elements of infinitely low priority, pushing out elements in the pipeline until it is empty. This method will require that the sorter is reset after each use, and also wastes 2K clock cycles that could be used to perform additional sorting. One limitation of the design of the merge sorter presented in [78] is that once the pipeline is full we have to flush the elements out by pushing infinitely-low priority elements in, and then resetting the pipeline. To overcome this limitation, we introduce the concept of a current-set register for each PE which provides a way to identify elements that are currently being sorted. Each PE in the sorter pipeline has a current-set register which is connected directly to the

82 1: procedure Pr(A, B) 2: if A.set ≠ B.set then ▷ Check if elements are in same set 3: if A.set ≠ pe.set then ▷ If A’s set does not match the PE it has priority 4: return true 5: else 6: return false 7: end if 8: else if A.flags.value = ∞ ∨ B.flags.value = −∞ then 9: return true 10: else if A.flags.value = −∞ ∨ B.flags.value = ∞ then 11: return false 12: else 13: return Pr’(A, B) ▷ Priority depends on comparative function Pr’(A, B) 14: end if 15: end procedure

Figure 4-6. Pseudocode for systolic-array priority function. set output of the previous PE in the array as shown in Figure 4-5. The set input of the 0th PE is tied to a register which can be toggled at any time to change the current set for the pipeline. Each element pushed into the pipeline is then assigned the set which matches the value of the current-set register at the time it is pushed into the pipeline. We must redefine the priority function to a take into account the set-register values. Figure 4-6 illustrates the rules for determining priority with sets; returning true means that element A has higher priority. When comparing elements of different sets, the element whose set does not match the current PE set register is given priority, pushing those values out of the pipeline first. If two values have the same set, they are compared normally regardless of the current-set of the PE. For example, for a single-bit set register we have two sets with values zero and one. If the current-set register is set to zero, each element pushed into the sorter pipeline will be a member of set zero. Once all elements have been pushed in, and we want to flush the sorter pipeline, we can change the current-set register and begin sorting a different set. Each clock cycle, the current-set value will propagate down the sorter pipeline and begin pushing out all elements of the previous set. Using this approach, we can sort multiple sets without ever having to flush and reset our merge-sorter pipeline, maximizing our sorter throughput.

83 4.2.1.1 Sorting-pipeline architecture

The merge-sorter architecture discussed so far is meant to sort values with a fixed-width field by passing it through the sorter pipeline with its indices. When dealing with arbitrary semirings, as in the case with different graph-processing applications, we need to provide a method of sorting values which may vary in size. One approach would be to allocate an extra-wide value register in each PE, but this would waste significant logic resources. In our approach, we take advantage of the on-chip BRAM of FPGAs to act as cache for the value field of tuples. We then sort tuples using their address in the BRAM cache as the value field in the sorter pipeline. When a tuple is inserted into the sorter pipeline, we look up the next free slot in the BRAM cache and assign the value to that address. We then pass the address into the sorter pipeline as the value of the tuple. When a value comes out of the pipeline, we lookup the associated value in the BRAM cache and output it, marking the slot in the BRAM as free. This approach also has the added benefit of greatly reducing the number of register resources required for long sorter pipelines. A block diagram of our sorter architecture with tuple-value cache is presented in Figure 4-5. One challenge to the fully-associative BRAM cache approach is keeping track of the available free slots. Since the order of the elements coming out of the merge sorter is arbitrary, we must quickly lookup and assign free slots to avoid fragmentation and pipeline stalls. To achieve this, we use an array of 32-bit registers to represent the available slots in the cache. Each bit of the register array corresponds to an address within the tuple-value cache. In order to determine the next-free value, we use a tree-based decoder which selects the lowest- order register with at least one zero bit. We then use a 32-bit barrel shifter and a logical-OR operation to set the bit in the selected register. Similarly, when an element comes out of the sorter pipeline, we can quickly lookup the register index using a barrel shifter and mask off the specific bit corresponding to that address.

84 DMA Interface DMA Interface

src_addr Set Read DMA Cntl Sorter Controller Set Write DMA Cntl dst_addr swap Row/Column Swap swap Row/Column Swap buf_addr row size FIFO 0 row done row row

Merge Sort Op. Merge col FIFO 1 2K-Way col ...... tuple_in value Merge value FIFO ready row row write Sorter col col tuple_out FIFO M-1 col col valid read M set buffers Output Buffer Tuple Stream Tuple set_sel

Figure 4-7. Top-level sorter architecture (some signals omitted for clarity).

4.2.1.2 Merge-sorter controller

The final component of the merge-sorter architecture is the set-merge controller. The merge-sorter pipeline allows us to efficiently sort lists of 2K elements or fewer. In order to sort an arbitrarily large set of tuples, however, we must merge presorted subsets of 2K-element into one sorted set. A high-level architecture for our merge controller is presented in Figure 4-7. The merge-sorter controllers has two modes of operation as shown in Figure 4-8. The streaming mode (Figure 4-8A) pushes values directly through the 2K-way merge-sorter pipeline writing the results directly back to memory. This results in our arbitrarily-large set of tuples being sorted in 2K-way sorted subsets. The merge mode (Figure 4-8B) allows us to take the sorted subsets and merge them M sets at a time, where M is a configurable number of set buffers up to the length of the pipeline. In Figure 4-8B we are taking the four sorted sublists of 2K elements and combining them into two 4K element subsets since M equals two. In order to merge multiple subsets, we begin by reading values from each of the M subsets into their respective subset buffers. A DMA read controller handles all memory read requests from the set-buffer controllers, which keeps track of the address of each subset, shown in Figure 4-7. While the sorter pipeline is not full, we push values into the array in a round- robin fashion, also inserting with each element a subset identifier field which keeps track of the subset it belongs to. When the first valid element comes out of the array, we use the

85 2k 2k 2k 2k

Systolic-Array Sorter Systolic-Array Sorter (Streaming Mode) (Merge Mode;M=2)

4k 4k

2k 2k A 2k B

Figure 4-8. Overview of the sorter in (A) streaming mode, which sorts streaming data into 2K-wise sorted subsets, and (B) merge mode, which merges M subsets.

subset identifier to determine which subset queue to read the next non-zero value from. This guarantees that the relative sorted order of the subsets are maintained. 4.2.1.3 Merge-sorter performance analysis

In this section, we provide an analysis of the performance of the merge-sorter component. When operating in streaming mode the k-way sorter requires only N + k clock cycles to sort N elements. The total streaming time is shown in Equation 4–3. In this equation, k refers to the depth of the sorter pipeline.

−1 TStreaming = (N + k)fclk (4–3)

To determine the total merge time for a set of N elements larger than k, we must first determine the total number of sorting stages required. If we configure our merge-sorter controller to have M sorting buffers, we are able to merge up to M sets at a time. For a k-way ⌈ N ⌉ sorter pipeline, that means we have sets of size k and therefore have k sets. We can then calculate the total number of sorting stages as shown in Equation 4–4. To determine the total merge time, we multiply the number of stages by the amount of work performed per stage. Since in every stage we need to move the entire set through the pipeline, this means we require at least N cycles per stage. Although there are some cases where a stage may consist of single set that does not need to be sorted, we still need to copy it in order to make room for the next sorting step. Using our set register optimization we only incur the penalty of k cycles to flush

86 the pipeline once, at the end of our merge sorting. The total merge time is shown in Equation 4–5.

⌈ ⌈ ⌉⌉ N Stages = log (4–4) m k

( ⌈ ⌈ ⌉⌉ ) N T = N log + k f−1 (4–5) Merge m k clk In order to sort any arbitrary array of N numbers, we fist must order it into sorted sub- arrays and then merge sort those sub-arrays. The first sort is done through a streaming sort, and may have been performed when copying elements directly from the network interface or from memory to the sorting buffer. After the sub-arrays are presorted, they are then merge sorted together M subsets at a time until the final sorted array is calculated. In order to sort an array of N numbers, we use an in-memory buffer of size 2N and ping-pong the data between N-sized buffers at each sorting step until the final sorted array is generated. Equation 4–6 show the total sorting time including both the streaming- and merge-sort stages. Note that in this equation, the penalty of k cycles is only counted once due to the set buffers. The average number of cycles-per-element required to sort sets of increasing sizes in illustrated in Figure 4-9. We see that for different values of k, the minimum number of cycles per element occurs when the set size is equal to k, and steadily increases as the number of merge steps required increases. When increasing M, the number of sets that can be merged at a time, there is a significant reduction in the number of cycles per element required. To maximize sorter performance, M should be as made as large as possible, up to k sets. The limiting factors on the value of M is the available memory bandwidth, available buffer size, and frequency of the sorter pipeline.

[ ( ⌈ ⌈ ⌉⌉) ] N T = T + T = N 1 + log + k f−1 (4–6) Sort Streaming Merge m k clk

87 4096 k 4096 M 1024 2 2048 16 4096 128

512 512

A 64 64 B

8 Average Cycles per Element Average 8

0 5 10 15 20 25 30 20 25 210 215 220 225 230 2 2 2 2 2 2 2 Number of Elements Sorted Number of Elements Sorted Figure 4-9. Pipelined merge-sorter performance analysis when varying (A) value of k, and (B) value of M.

4.2.2 ALU Architecture

In this section we overview the design of the arithmetic logic unit (ALU) for our graph- processor architecture for SpGEMM and breadth-first search. The ALU architecture supports a variable-width input which can range from a single word to multiple words wide, configurable at compile time. Using configuration bits for the ALU, we can specify a variety of operations common for semirings used in graph-processing, including the operations defined in Section 4.1.3. The ALU can be used to perform binary or unary operation on a arbitrary tuple, up to a predefined maximum determined at compile time for the architecture. In unary operation mode, the operation to be performed on each field is selected by the configuration bits of the ALU, and the appropriate registers are selected by the ALU instruction. In binary operation mode, each ALU field can select a conditional operator such as a comparator, and an operation applied to each field based on the result of that condition. The current condition operators include: less than, greater than, equal to, greater-than scalar, less-than scalar as shown in Figure 4-10. The current binary operators supported are: min, max, select, and scalar multiply- add (the scalars are specified in general registers). One advantage of using FPGAs for the

88 tuple0 tuple1

(row,col,value) (row,col,value)

lhs.swap_idx rhs.swap_idx lhs.swap_val_row Index/Value Select Index/Value Select rhs.swap_val_row lhs.swap_val_col rhs.swap_val_col row' col' value' row' col' value'

opcode Index Equality Op(Min,Max,+,*,Sel) zero_select

(row,col,value) Figure 4-10. Design of our ALU supporting various semirings. design of our architecture is the ability for developers to replace our ALU hardware unit with one optimized for their application. The partial reconfiguration ability of FPGAs would be particularly useful for replacing the ALU while the graph processor is running, allowing users to change their application-specific semiring at any time. 4.2.3 HISC/R Storage Controller

In this section, we present an overview of the controller architecture for the Hashed-Index, Sparse-Column/Row(HISC/R) storage format. The storage controller handles accessing matrix data from main memory through the RC Middleware’s DMA interfaces. As discussed in our previous work presenting the HISC/R storage format, a novel hypersparse matrix storage format [79], the sparse-matrix storage format and the properties of the matrix being stored greatly impacts the storage and lookup overhead of the architecture. Since we are focusing on distributed graph datasets which are typically hypersparse, we choose HISC/R as our primary storage format. However, since operations between hypersparse matrices may become sparse or even dense, we also provide specialized controllers for CSC/R and DCSC/R. As a hashed-based storage format, the performance of HISC/R depends on the hash function chosen. We use tabulation hashing, a hash function based on lookup tables, Ti, initialized with random bits which are then XORed together as shown in Equation 4–7. The address for each randomized lookup table is generated using bit fields from the key, usually consisting of one to four bits. In order to guarantee good performance of our hash function, we need to select them uniform randomly from the set of all possible hash functions. If any

89 correlation exists between the bits of our hash function, it could detrimentally impact the distribution of values, and thus the performance of HISC/R. To guarantee good performance of HISC/R, we compare various psuedo random-number generation (PRNG) techniques on FPGAs when used with tabulation hashing.

h(x) = T1[x] ⊕ T2[x] ⊕ · · · ⊕ Tc[x] (4–7)

The Mersenne Twister (MT) [80] is a high-quality PRNG with favorable properties such as a period of 219937 − 1. It is often used as the PRNG in software programming systems, however its computational complexity and memory requirements make it relatively expensive to implement on FPGAs. A more common approach to generating PRNGs on FPGA is the linear-feedback shift register (LFSR). The LFSR is a linear-recurrence generator based on the mathematical properties of finite fields. The next state is determined wholly by its current state and specific bits which generate a feedback bit, making them extremely fast and cheap to compute. An LFSR is represented by a characteristic polynomial which chooses specific bits and either provides feedback at those bits (Galois LFSR), or uses those bits to calculate feedback which is then shifted in at the end of the register (Fibonacci LFSR). The characteristic polynomial of the LFSR determines its period and sequence of PRNG values it cycles through. A maximal LFSR is described by a characteristic polynomial which maximizes its sequence length for the number of bits in the register. Although maximal LFSRs are commonly used for generating random numbers on FPGAs, the properties of the random numbers generated make them unsuited for applications such as Monte Carlo simulations, where an unbiased distribution of numbers are required, without special considerations [81]. We compared the performance of various PRNGs used to generate the random table data when used with tabulation hashing for hashing both sequential and randomized sequences of keys. We compared the MT19937, maximal LFSR-32, 64, and 128, and XORSHIFT [82] PRNGs when used with tabulation hashing. We use the hash function quality metric defined

90 ●

1.008

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.005 ● ●

● 1.002 ● Hash Function Quality ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● 0.999 ●

MT19937 LFSR32 LFSR64 LFSR128 XORSHIFT PRNG Method Figure 4-11. Comparison of tabulation-hash quality metric for different PRNGs.

th in [83] which is shown in Equation 4–8, where bj is the number of items in the j bucket, m is the number of buckets, and n is the number of items. This metric compares the distribution of non-zeros with the expected behavior of a random function, and generates a decimal value which should be between 0.95 and 1.05. We classify any values above 1.05 as failing. Figure 4-11 presents a box plot of the achieved hash quality metric for many different sequences of keys. We use a four-bit key partitioning of a 32-bit key resulting in eight randomized lookup tables of 16 randomized words each. Each PRNG was seeded randomly and was subsequently used to generate random bits to initialize the tabulation-hash hash tables. According to our results, although the 32-bit maximal LFSR performs poorly compared to the other LFSR sizes, MT, and XORSHIFT, it is still within the allowed range of quality. The XORSHIFT PRNG provides performance closest to the MT PRNG, but the 128-bit LFSR provides similar performance with the simplest implementation. Based on our results, we

91 PRNG[0] IDX[0]PRNG[N-1]... IDX[N-1]

D Q D Q H[0] HISC/R Controller Tabulation Hash en en row_in index hash col_in XOR Array D Q D Q H[1]

FIFO en en ...... val_in ... cmd seed wr Command Queue D Q D Q H[M-1] Tuple Interface Tuple init Controller PRNG en en format Memory Controller seed Rd Control rd_data base State Machine rd_address

Con fi guration base FIFO rd_valid Controller rd_en hash Wr Control wr_data wr_address

tuple FIFO wr_ready DMA Interface Lookaside wr_en Hash Table row_out Tuple Output Buffer col_out clk val_out rst (BRAM) FIFO valid enable Tuple Interface Tuple

Figure 4-12. Controller architecture for HISC/R storage format.

choose a 128-bit LFSR for the PRNG in our sparse-matrix processor, giving us good random performance with the simplest implementation.

− m∑1 b (b + 1)/2 q = j j (4–8) (n/2m)(n + 2m − 1) j=0 Figure 4-12 presents a block diagram of the hardware HISC/R storage controller. The HISC/R storage controller is composed of four primary units: the HISC/R controller logic, the pipelined tabulation-hashing unit, the memory controller, and tuple output buffers. The controller logic is responsible for accepting non-zero row or column lookup requests and scheduling the appropriate memory accesses. The Lookaside Hash Table (LHT) is a secondary hash table implemented in BRAM to optimize memory performance. A primary function of the LHT is to cache indices of columns/rows that were previously accessed and determined to be empty, reducing accesses to main memory. By filtering these requests, we reduce the number of random accesses and improve the memory throughput and latency of HISC/R. The tabulation hashing unit uses the method described in [66] to map a k-bit index

⌈ k ⌉ c to a j-bit address using c lookup tables. Each table has 2 entries of j bits, which can be implemented in BRAM or registers. The bits of the tables are initialized with randomized data from a pseudo-random number generator to create a random hash function. Each set of c bits ⌈ k ⌉ from the index select a j bit word from a table. All c j-bit words are then XORed together to

92 Table 4-2. Graph-processor resource analysis. Logic Usage Block-Ram Bits DSP-18 Component Total % Total % Total % Controller 97 741 23 131 072 0.6 16 1.6 Merge Sorter 135 978 32 1 048 576 5 0 0 ALU 67 994 16 2048 0 64 6.3 Total 301 713 71 1 181 426 5.6 80 7.9

generate the bucket address. We use an XOR tree with configurable latency to maximize the achievable clock frequency for large index and address sizes. 4.2.4 FPGA Resource Analysis

Table 4-2 provides a post-fit FPGA resource usage summary when our architecture was compiled on an Altera Stratix-IV E530 FPGA. The merge-sorter (K=4096, M=32) component required significantly more logic and memory resources than other components due to the sorter-pipeline registers and set input buffers. Since for denser graphs the merge sorter performance dominates the computation time, it is important we maximize the depth of the sorter and number of sets it can merge. The second largest component was the controller, which consists of the control-logic state machine and matrix/general registers, and DMA controllers and buffers. Finally, the ALU which consists of wide tuple-input registers, control bits, and several different mathematical operations is the third largest component. Although the network loopback interface does require some resources for the FIFO message buffer, it is not included in our resource analysis. In total our single-PE architecture requires approximately 71% of Stratix-IV logic resources, and only 5.6% of BRAM bits. 4.3 Experimental Setup

The case studies presented in Sections 4.4-4.6 are all performed using a single FPGA bitstream containing the architecture presented in Section 4.2. The bitstream was compiled for the GiDEL PROCStar IV E530 platform [33] in Quartus 14.1 with RC Middleware version 2.1. The PROCStar IV has four Stratix-IV E530 FPGAs connected to the host computer through a PCIe x8 interface. Each FPGA of the PROCStar IV has three memory banks, 2x4 GB DDR2

93 SODIMMS with an aggregate bandwidth of 8 GB/s, and a 512 MB DDR2 SDRAM with 4 GB/s of bandwidth. All software, including software baselines, we compiled with GCC 4.9.2 with the default optimization levels included by the software build scripts. All R-MAT graph datasets were generated using GTGraph [84] by adding random noise to the initiator matrix. Our software baselines were executed on dual Xeon E5620 CPUs running at 2.40 GHz. 4.4 Case Study: Sparse Generalized Matrix-Matrix Multiplication

In this section we present the results of our SpGEMM case study comparing our architec- ture with the optimized CPU baselines SuiteSparse [85], which is the optimized sparse-matrix library at the heart of MATLAB, and CombBLAS [25], which is a parallel, MPI-based BLAS toolkit designed for graph processing. To compare SpGEMM performance, we explore varying matrix degrees, which determines the total number of vertices, at different levels of spar- sities, which determines the number of non-zeros in the matrix. For each test we generate two random R-MAT matrices with the desired degree and sparsity, multiply them using our architecture and CPU baselines, and record the raw execution time not including input/result read/write times. In addition to the CPU baselines, we record the execution times for each test for our architecture (HW) using each of the three supported storage formats. Figure 4-13 presents the results of our SpGEMM case studies varying the matrix densities for different R-MAT matrix degrees. When compared with CombBLAS, our architecture consistently gives an order of magnitude or greater performance improvement, with the exception of our degree-20 test case at about 0.1 non-zero per row/column. The peak speedup for the degree-26 case study is more than a 20× improvement for both hypersparse and sparse cases. When compared with SuiteSparse, we see that our architecture provides as similar performance improvement as compared to CombBLAS. At lower matrix degrees, SuiteSparse provides an execution advantage over CombBLAS which does impact our achieved speedup. This improved performance is likely due to the sparse-accumulator data structure used by the SuiteSparse library. At larger matrix sizes, the overhead for allocating the sparse accumulator

94 103 103

102 102 ● ● 101 101

100 100 ● ● 10−1 10−1

A 10−2 ● 10−2 ● B

● 10−3 10−3 Execution Time (s) Execution 10−4 10−4

10−5 10−5

10−6 10−6 10−2 10−1 100 101 10−1 10−0.5 100 100.5 101 Average Row/Column Density Average Row/Column Density

103 103

● 102 102 ● 101 101 ● 100 100 ● −1 −1 ● 10 10 ● ● ● ● C 10−2 ● 10−2 D

10−3 10−3 Execution Time (s) Execution 10−4 10−4

10−5 10−5

10−6 10−6 10−4 10−3 10−2 10−1 100 101 10−4 10−3 10−2 10−1 100 101 Average Row/Column Density Average Row/Column Density

103 103

2 2 10 10 ●

● 101 101 ● 100 100 ● ● ● ● ● ● ● 10−1 ● ● 10−1

E 10−2 10−2 F

10−3 10−3 Execution Time (s) Execution 10−4 10−4

10−5 10−5

10−6 10−6 10−5 10−4 10−3 10−2 10−1 100 10−5 10−4 10−3 10−2 10−1 100 Average Row/Column Density Average Row/Column Density

● SuiteSparse CombBLAS HW (CSC/R) HW (DCSC/R) HW (HISC/R)

Figure 4-13. Comparison of our architecture running SpGEMM with CombBLAS and SuiteSparse baselines. R-MAT degrees: (A) 16 (B) 18 (C) 20 (D) 22 (E) 24 (F) 26.

95 0.8

0.6

0.4

0.2

0.0 Billion edges per second (GTEPS) 20 21 22 23 RMAT Matrix Degree

Betkaoui et al. Attia et al. Graph Processor Graph Processor (Scaled)

Figure 4-14. Comparison of our architecture running BFS with state-of-the-art designs on the Convey HC-1/HC-2.

leads to SuiteSparse having a plateau in performance for the hypersparse cases. The peak speedup compared to SuiteSparse for hypersparse cases in several orders of magnitude for the degree-24/26 test cases. For the sparse region, we achieve a peak speedup of more than 40× for degree-24 matrices. We also measured the power required for executing our architecture on the CPU host and PROCStar-IV. We measured the power required by the PROCStar-IV by subtracting the static power of just the host from the host plus PROCStar-IV. Our results indicate that our FPGA architecture requires less than 12% of the power compared with the CPU host even when including the static power for the three unused FPGAs on the PROCStar-IV board. Taking the power efficiency into account, our architecture achieves more than at 200×/400× performance-per-watt improvement when compared to CombBLAS/SuiteSparse. 4.5 Case Study: Breadth-First Search

In this section we analyze the performance of our architecture performing the BFS spanning-tree calculation using the linear-algebra formulation presented in Section 4.1.3. We compare our BFS performance with state-of-the-art BFS architectures running on the Convey

96 HC-1/HC-2 including CyGraph [16] and another HC-1 architecture described in [86]. We generate random RMAT matrices with an average edge factor of eight to match the numbers we are comparing with in the papers for scales 20, 21, 22, and 23. We generate the BFS spanning tree using a randomized starting node in the graph and execute until completion, calculating the number of edges traversed and execution time. We report the performance as the number of edges traversed per second. Figure 4-14 presents our BFS execution times compared with the works presented in [16] and [86]. One problem with comparing the HC-1/HC-2 architecture with our design is the available memory bandwidth. The PROCStar IV provides an order-of-magnitude less aggregate memory bandwidth than the HC-1 and HC-2 systems (8 GB/s vs. 80 GB/s) making it difficult to compare the efficiency of our approaches. Therefore, we scale our billions of edge traversals per second (GTEPs) measurement for BFS by a factor of 10 to adjust for the difference in memory bandwidth. We present both the raw and scaled execution times in Figure 4-14. Based on Figure 4-14 we see that our architecture performs better than the compared approaches when scaling for bandwidth. We see that our architecture provides a peak perfor- mance of around 0.64 GTEPs (scaled) for a scale-20 RMAT matrix. It is important to note, however, that directly scaling the performance based on a memory-bandwidth factor does not take into account other performance factors such as the achievable architecture throughput. For memory-bound algorithms such as graph processing, however, it is reasonable to assume that an increase in bandwidth would result in a similar increase in performance. 4.6 Graph-Processor Architecture Scalability Analysis

In this section we present our approach to projecting the SpGEMM performance scalability of our architecture leveraging the Novo-G# multi-FPGA reconfigurable supercomputer [71]. To predict the performance of our architecture running on multiple FPGAs in Novo-G#, we leverage the discrete-event simulation network model presented in [71]. One challenge for predicting the performance scalability of our application is that our workload and performance is data driven and highly irregular. In order to capture the actual performance of each node

97 Workload Simulator Profiling Network Simulator Algorithm Module Novo-G# Model PE PE Memory Accesses VCT Switching 0 n Partition Module Thread ... Thread Operation Count DO Routing Storage Module Message Count Link Contention

...

MPI Cluster Figure 4-15. Scalability simulation approach combining experimental results with the Novo-G# simulation model.

in the network, we developed a C++ workload simulator which profiles actual SpGEMM operations, and profiles computation and communication for each simulated network node. Our simulator records metrics including the number of memory reads, writes, ALU operations, and number of messages sent between each pair of processors. We then pass these statistics into the graph-processor stimulus model running on the Novo-G# network simulator as shown in Figure 4-15. We record the total simulated time for the computation to complete and compare with the CPU baseline execution time. We run our profiler at a fixed problem size of 10 non-zeros per row/column for a randomly-generated degree-26 RMAT matrix with an increasing level of parallelism for both 2D- and 3D-torus networks, and calculate the speedup. For both 2D- and 3D-torus cases we use a block-cyclic decomposition of our input datasets, and assume the data is randomly permuted. A summary of the simulation parameters can be found in Table 4-3. Each processing element (PE) in the simulation runs four tasks. Each PE starts by reading local graph data, and redistributing it to nodes in the network. At the same time, incoming edge data is used to index into the locally stored matrix and calculate the resulting partial products, if any. The partial products are then redistributed, and accumulated by the owning processes. The number of messages sent and received, and the total memory read time for each process, was determined through the workload simulator. The time to accumulate the partial products owned by each node was determined by the analytical performance model

98 Table 4-3. Summary of parameters used to simulate SpGEMM scalability. Parameter Value Data Partitioning 2D block cyclic Block Size 216 by 216 Mapping Modulo Dataset RMAT-26 Sparsity 10 element/row Channel Width 4 bits Channel Rate 10 Gbps Channel Delay 40 ns Router Frequency 250 MHz Routing Algorithm Dimension Order Switch Cut Through Routing Cycles 2 Flit Width 256 bits Header Flits 1

of the pipelined sorter. In our model, we assume that each receive channel can be routed simultaneously to the appropriate output channel as long as there is no write contention. Figure 4-16 presents the simulated speedup compared to the CombBLAS single-CPU baseline for increasing levels of parallelism. Our results include up to a 6×6 2D-torus (36 node) configuration, and up to a 4×4×4 (64 node) 3D-torus configuration. The limited cross- sectional bandwidth of the 2D torus configuration limits the scalability of the performance of our processor architecture, causing the speedup to fall off for more than 25 processors. We achieve up to a 500× simulated speedup for a 6×6 2D-torus, giving a parallel efficiency of approximately 0.64. The improved cross-sectional bandwidth of the 3D torus configuration allows it to scale to maintain its performance better for an increasing number of processors. We achieve up to a 980× simulated speedup for a 4×4×4 3D-torus, giving a parallel efficiency of approximately 0.70. 4.7 Summary and Conclusions

In this chapter we presented an extensible FPGA graph-processor architecture designed for the linear-algebra formulation of graph applications, focusing on sparse generalized matrix- matrix multiplication and breadth-first search kernels. We presented a detailed look at the

99 1000 4x4x4

750

3x3x3 Topology 500 ● 6x6 ● 2D Torus 3D Torus ● 5x5

Simulated Speedup Simulated 4x4 250 2x2x2 ● 3x3

● 2x2 ● 0 0 20 40 60 Number of Processors

Figure 4-16. Simulated SpGEMM speedup for an increasing number of nodes for different network topologies on the Novo-G# system model. design of our architecture, identifying key architecture components and providing a detailed overview and analysis of each. We provided a method of extending our architecture to support various graph-processing applications by replacing our BFS ALU with an ALU which supports an application-specific semiring. Using a single bitstream for our graph-processor architecture compiled on the GiDEL PROCStar IV platform, we achieved up to a 20×/40× speedup compared to Comb- BLAS/SuiteSparse at 12% of the power over multi-threaded CPU baselines when performing SpGEMM with scale-26/24 R-MAT matrices. We achieved competitive performance to state- of-the-art BFS architectures on the Convey HC-1/HC-2 after adjusting for available memory bandwidth for a single processor. Finally, we presented our methodology for exploring the scala- bility of our architecture performing SpGEMM on the Novo-G# multi-FPGA supercomputer for 2D- and 3D-torus networks. Our simulated model projects up to a 500× speedup over the CPU baseline for a 6×6 2D-torus, and up to a 980× speedup for a 4×4×4 3D-torus.

100 CHAPTER 5 CONCLUSIONS In this work we have addressed the portability and productivity challenges associated with HPRC application design, tackled the challenges of efficient sparse-matrix storage formats for graph-processing on FPGAs, and applied what we learned and developed to engineer a scalable graph-processor architecture on FPGAs. We explored the performance of our architecture using sparse generalized matrix-matrix multiplication (SpGEMM), and breadth-first search (BFS) kernels. We compared the performance of our architecture using SpGEMM and demonstrated over a 26x speedup at only 12% of the power when compared to a multi-threaded CPU baseline. We also compared the BFS performance architecture to current state-of-the-art approaches and determined our architecture performs competitively after accounting for the differences in aggregate bandwidth. To address the scalability, portability, and productivity hurdles of FPGA-application de- velopment, we proposed a novel RC Middleware (RCMW) layer which presents an application- specific view of platform resources. The RCMW allows users to define the resources and interfaces required by their specific application, and the RC Middleware will map those re- sources onto a target platform at compile time. This approach enables seamless portability of FPGA applications across FPGA platforms supported by the RC Middleware. RCMW currently supports four heterogeneous platforms from three vendors: the GiDEL PROCStar III and IV, the Pico Computing M501, and the Nallatech PCIe-385N. Platform support in the RC Mid- dleware is enabled by a descriptive XML format which describes the platform interfaces and resources, and IP core library which contains the interface controllers for the platform, allowing new platforms to be added easily. We evaluated RCMW’s performance and productivity bene- fits for four platforms from three vendors. We demonstrated RCMW’s ability to quickly explore different application-to-platform mappings using a convolution application case study or both area- and performance-optimizing cost functions. We demonstrated that the benefits of RCMW can be achieved with less than 1% FPGA/memory and 7% host/FPGA transfer overhead in the

101 common case. We also demonstrated that RCMW has relatively low area overhead, requiring less than 3% of logic resources for several applications across all four platforms. We presented evidence that RCMW improves developer productivity, by showing that RCMW requires fewer lines of code and total development time for deploying several kernels than vendor-specific approaches. Finally, we demonstrated that RCMW enables portability by showing that the same application source was able to execute without change across each supported platform. We addressed the challenges of storing graph adjacency matrices to maximize graph- processing application performance while minimizing storage overhead. There are a wide variety of formats which are optimized for different non-zero distributions such as diagonal or banded matrices, and for different platform architectures, such as vectors processors or GPUs. General formats such as Compressed Sparse-Column/Row (CSC/R) and Doubly Compressed Sparse-Column/Row, which do not assume any inherent non-zero structure, are commonly used in graph-processing applications. These formats, however, trade-off between storage and lookup complexity, providing either fast lookups at the expense of high storage overhead for sparse datasets, or low storage overhead at the expense of increased access time for unfavorable non-zero distributions. In order to overcome these limitations, we propose a novel sparse-matrix storage format called Hashed-Index Sparse Column/Row (HISC/R). HISC/R replaces the dense indexing vector in CSC/R, and the sparse indexing vectors in DCSC/R with a hashed indexing vector, enabling constant-time accesses to rows or columns of a matrix. Additionally, HISC/R optimizes the storage of hypersparse matrices by allowing non-zero elements to be stored directly in the hashed indexing vector when no additional space is required. HISC/R provides O(1) lookup complexity and O(nnz) storage complexity while also enabling runtime insert and delete operations, enabling matrices to be constructed directly without using expensive intermediate storage formats. We showed that HISC/R requires significantly less storage than CSC/R and up to 19% less than DCSC/R for hypersparse datasets when maintaining an average hash table load factor of 71%. Additionally, we show HISC/R provides a 14% and 40% improvement in terms of memory reads compared to CSC/R and DCSC/R respectively when

102 performing matrix multiplication with hypersparse datasets. The reduction in the total number of memory accesses and favorable storage performance for hypersparse datasets make HISC/R uniquely suited for scalable graph processing. Leveraging our work on the RC Middleware portability framework and HISC/R storage format, we developed an extensible FPGA graph-processor architecture designed for the linear-algebra formulation of graph applications, focusing on sparse generalized matrix-matrix multiplication and breadth-first search kernels. We presented a detailed look at the design of our architecture, identifying key architecture components and providing a detailed overview and analysis of each. We provided a method of extending our architecture to support various graph-processing applications by replacing our BFS ALU with an ALU which supports an application-specific semiring. Using a single bitstream for our graph-processor architecture compiled on the GiDEL PROCStar IV platform, we achieved up to a 20×/40× speedup compared to Comb- BLAS/SuiteSparse at 12% of the power over multi-threaded CPU baselines when performing SpGEMM with scale-26/24 R-MAT matrices. We achieved competitive performance to state- of-the-art BFS architectures on the Convey HC-1/HC-2 after adjusting for available memory bandwidth for a single processor. Finally, we presented our methodology for exploring the scala- bility of our architecture performing SpGEMM on the Novo-G# multi-FPGA supercomputer for 2D- and 3D-torus networks. Our simulated model projects up to a 500× speedup over the CPU baseline for a 6×6 2D-torus, and up to a 980× speedup for a 4×4×4 3D-torus.

103 REFERENCES [1] L. C. Monerris, E. T. Serrano, J. D. S. Quilis, and I. B. Espert,“Gpf4med: A large-scale graph processing system applied to the study of breast cancer,”in Computational Science and Engineering, 2015 IEEE 18th International Conference on, Oct 2015, pp. 27–34. [2] S. A. Jacobs and A. Dagnino,“Large-scale industrial alarm reduction and critical events mining using graph analytics on spark,”in 2016 IEEE Second International Conference on Big Data Computing Service and Applications, March 2016, pp. 66–71. [3] F. Riaz and K. M. Ali,“Applications of graph theory in computer science,”in Com- putational Intelligence, Communication Systems and Networks (CICSyN), 2011 Third International Conference on, July 2011, pp. 142–145. [4] L. Ball,“Automating social network analysis: A power tool for counter-terrorism,”Security Journal, vol. 29, no. 2, pp. 147–168, 2016. [5] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Cza- jkowski,“Pregel: A system for large-scale graph processing,”in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’10. New York, NY, USA: ACM, 2010, pp. 135–146. [6] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein,“Dis- tributed graphlab: A framework for machine learning and data mining in the cloud,”Proc. VLDB Endow., vol. 5, no. 8, pp. 716–727, Apr. 2012. [7] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin,“Powergraph: Distributed graph-parallel computation on natural graphs,”in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’12. Berkeley, CA, USA: USENIX Association, 2012, pp. 17–30. [8] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica,“Graphx: Graph processing in a distributed dataflow framework,”in Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’14. Berkeley, CA, USA: USENIX Association, 2014, pp. 599–613. [9] A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry,“Challenges in parallel graph processing,”Parallel Processing Letters, vol. 17, no. 1, p. 5–20, 2007 2007. [10] D. A. Bader, G. Cong, and J. Feo,“On the architectural requirements for efficient execution of graph algorithms,”in Proceedings of the 2005 International Conference on Parallel Processing, ser. ICPP ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 547–556. [11] Y. Chi, G. Dai, Y. Wang, G. Sun, G. Li, and H. Yang,“Nxgraph: An efficient graph processing system on a single machine,”in 2016 IEEE 32nd International Conference on Data Engineering (ICDE), May 2016, pp. 409–420.

104 [12] S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K. John,“Proxy-guided load balancing of graph processing workloads on heterogeneous clusters,”in 2016 45th International Conference on Parallel Processing (ICPP), Aug 2016, pp. 77–86. [13] R. Elshawi, O. Batarfi, A. Fayoumi, A. Barnawi, and S. Sakr,“Big graph processing systems: State-of-the-art and open challenges,”in Big Data Computing Service and Applications (BigDataService), 2015 IEEE First International Conference on, March 2015, pp. 24–33. [14] S. Zhou, C. Chelmis, and V. K. Prasanna,“High-throughput and energy-efficient graph processing on fpga,”in 2016 IEEE 24th Annual International Symposium on Field- Programmable Custom Computing Machines (FCCM), May 2016, pp. 103–110. [15] N. Engelhardt and H. K. H. So,“Gravf: A vertex-centric distributed graph processing framework on fpgas,”in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Aug 2016, pp. 1–4. [16] O. G. Attia, T. Johnson, K. Townsend, P. Jones, and J. Zambreno,“CyGraph: A recon- figurable architecture for parallel breadth-first search,”Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS, pp. 228–235, 2014. [17] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and D. P. Singh,“From opencl to high-performance hardware on fpgas,” in 22nd International Conference on Field Programmable Logic and Applications (FPL), Aug 2012, pp. 531–534. [18] J. Kepner and J. Gilbert, Graph Algorithms in the Language of Linear Algebra. Philadel- phia, PA, USA: Society for Industrial and Applied Mathematics, 2011. [19] M. M. Wolf, J. W. Berry, and D. T. Stark,“A task-based linear algebra building blocks ap- proach for scalable graph analytics,”in High Performance Extreme Computing Conference (HPEC), 2015 IEEE, Sept 2015, pp. 1–6. [20] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani,“Kronecker graphs: An approach to modeling networks,”J. Mach. Learn. Res., vol. 11, pp. 985–1042, Mar. 2010. [21] G. . S. Committee. (2010) Graph 500 benchmark specification. Accessed 2016-08-21. [Online]. Available: http://www.graph500.org/specifications [22] A. Eisenman, L. Cherkasova, G. Magalhaes, Q. Cai, P. Faraboschi, and S. Katti, “Parallel graph processing: Prejudice and state of the art,” in Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering, ser. ICPE ’16. New York, NY, USA: ACM, 2016, pp. 85–90. [23] A. Buluc and J. R. Gilbert, “On the representation and multiplication of hypersparse matrices,”pp. 1–11, April 2008.

105 [24] T. Mattson, D. Bader, J. Berry, A. Buluc, J. Dongarra, C. Faloutsos, J. Feo, J. Gilbert, J. Gonzalez, B. Hendrickson, J. Kepner, C. Leiserson, A. Lumsdaine, D. Padua, S. Poole, S. Reinhardt, M. Stonebraker, S. Wallach, and A. Yoo, “Standards for graph algorithm primitives,” in High Performance Extreme Computing Conference (HPEC), 2013 IEEE, Sept 2013, pp. 1–2. [25] A. Bulu¸cand J. R. Gilbert, “The combinatorial blas: Design, implementation, and applications,”Int. J. High Perform. Comput. Appl., vol. 25, no. 4, pp. 496–509, Nov. 2011. [26] T. El-Ghazawi, E. El-Araby, M. Huang, K. Gaj, V. Kindratenko, and D. Buell,“The promise of high-performance reconfigurable computing,” Computer, vol. 41, no. 2, pp. 69–76, Feb 2008. [27] C. Pascoe, A. Lawande, H. Lam, A. George, W. F. Sun, and M. Herbordt, “Reconfigurable supercomputing with scalable systolic arrays and in-stream control for wavefront genomics processing,” in Proc. of Intl. Conference on Engineering of Reconfigurable Systems and Algorithms, Las Vegas, NV, Jul. 2010. [28] J. Williams, C. Massie, A. D. George, J. Richardson, K. Gosrani, and H. Lam, “Characterization of fixed and reconfigurable multi-core devices for application acceleration,” ACM Trans. Reconfigurable Technol. Syst., vol. 3, no. 4, pp. 19:1–19:29, Nov. 2010. [29] B. Betkaoui, D. B. Thomas, and W. Luk, “Comparing performance and energy efficiency of fpgas and gpus for high productivity computing,” in Field-Programmable Technology (FPT), 2010 International Conference on, Dec 2010, pp. 94–101. [30] P. Garcia, K. Compton, M. Schulte, E. Blem, and W. Fu, “An overview of reconfigurable hardware in embedded systems,”EURASIP J. Embedded Syst., vol. 2006, no. 1, pp. 13–13, Jan. 2006. [31] A. George, H. Lam, and G. Stitt, “Novo-g: At the forefront of scalable reconfigurable supercomputing,”Computing in Science Engineering, vol. 13, no. 1, pp. 82–86, Jan 2011. [32] GiDEL Ltd. (2009) PROCStar III Product Brief. Accessed 2016-09-30. [Online]. Available: http://www.gidel.com/pdf/PROCStarIII%20Product%20Brief.pdf [33] GiDEL Ltd. (2010) PROCStar IV Product Brief. Accessed 2016-09-26. [Online]. Available: http://www.gidel.com/pdf/PROCStarIV%20Product%20Brief.pdf [34] Pico Computing. (2013) M-501 product brief. Accessed 2016-09-30. [Online]. Available: http://picocomputing.com/wp-content/uploads/2013/09/M-501-Product-Brief1.pdf [35] Nallatech. (2014) Nallatech pcie-385. Accessed 2016-09-30. [Online]. Available: http://www.nallatech.com/wp-content/uploads/pcie 385pb v1 21.pdf [36] OpenFPGA Inc. (2008) OpenFPGA. Accessed 2014-07-23. [Online]. Available: www.openfpga.org/

106 [37] K. Eguro, “Sirc: An extensible reconfigurable computing communication api,” in Field-Programmable Custom Computing Machines (FCCM), 2010 18th IEEE Annual International Symposium on, May 2010, pp. 135–138. [38] J. Villarreal, A. Park, W. Najjar, and R. Halstead,“Designing modular hardware accelerators in c with roccc 2.0,” in Field-Programmable Custom Computing Machines (FCCM), 2010 18th IEEE Annual International Symposium on, May 2010, pp. 127–134. [39] J. E. Stone, D. Gohara, and G. Shi, “Opencl: A parallel programming standard for heterogeneous computing systems,” Computing in Science Engineering, vol. 12, no. 3, pp. 66–73, May 2010. [40] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yiannacouras, and D. P. Singh,“From opencl to high-performance hardware on fpgas,”in 22nd International Conference on Field Programmable Logic and Applications (FPL), Aug 2012, pp. 531–534. [41] A. Ismail and L. Shannon,“Fuse: Front-end user framework for o/s abstraction of hardware accelerators,” in Field-Programmable Custom Computing Machines (FCCM), 2011 IEEE 19th Annual International Symposium on, May 2011, pp. 170–177. [42] Y. Wang, X. Zhou, L. Wang, J. Yan, W. Luk, C. Peng, and J. Tong,“Spread: A streaming- based partially reconfigurable architecture and programming model,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 21, no. 12, pp. 2179–2192, Dec 2013. [43] S. S. Huang, A. Hormati, D. F. Bacon, and R. Rabbah, “Liquid metal: Object-oriented programming across the hardware/software boundary,” in Proceedings of the 22Nd European Conference on Object-Oriented Programming, ser. ECOOP ’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 76–103. [44] D. Andrews, R. Sass, E. Anderson, J. Agron, W. Peck, J. Stevens, F. Baijot, and E. Komp,“Achieving programming model abstractions for reconfigurable computing,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 16, no. 1, pp. 34–44, Jan 2008. [45] L. Cai, D. Gajski, and M. Olivarez, “Introduction of system level architecture exploration using the specc methodology,”in Circuits and Systems, 2001. ISCAS 2001. The 2001 IEEE International Symposium on, vol. 5, 2001, pp. 9–12 vol. 5. [46] J. Kulp. (2010, May) OpenCPI Technical Summary. Accessed 2014-03-31. [Online]. Available: http://opencpi.org [47] V. Aggarwal, G. Stitt, A. George, and C. Yoon,“SCF: a framework for task-level coordination in reconfigurable, heterogeneous systems,” ACM Trans. Reconfigurable Technol. Syst., vol. 5, no. 2, p. 7:17:23, Jun. 2012.

107 [48] L. Shannon and P. Chow, “Simplifying the integration of processing elements in computing systems using a programmable controller,” in 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05), April 2005, pp. 63–72. [49] T. Schumacher, C. Plessl, and M. Platzner, “Imorc: Application mapping, monitoring and optimization for high-performance reconfigurable computing,” in Field Programmable Custom Computing Machines, 2009. FCCM ’09. 17th IEEE Symposium on, April 2009, pp. 275–278. [50] G. Stitt and J. Coole, “Intermediate fabrics: Virtual architectures for near-instant fpga compilation,”IEEE Embedded Systems Letters, vol. 3, no. 3, pp. 81–84, Sept 2011. [51] X. Reves, V. Marojevic, R. Ferrus, and A. Gelonch, “Fpga’s middleware for software defined radio applications,” in International Conference on Field Programmable Logic and Applications, 2005., Aug 2005, pp. 598–601. [52] Nallatech Ltd. (2007) Dimetalk v3.0. Accessed 2014-07-24. [Online]. Available: http://www.nallatech.com [53] GiDEL Ltd. (2014) Procwizard. Accessed 2014-07-24. [Online]. Available: http://www.gidel.com/procwizard.htm [54] M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer, “Leap scratchpads: Automatic memory and cache management for reconfigurable logic,” in Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA ’11. New York, NY, USA: ACM, 2011, pp. 25–28. [55] Altera Corp. (2007) Avalon memory-mapped interface specification. Accessed 2014-07-24. [Online]. Available: https://www.altera.com/literature/manual/mnl avalon spec.pdf [56] ARM. (2013) Amba axi and ace protocol specification. Accessed 2015-07-24. [Online]. Available: https://silver.arm.com/download/download.tm?pv=1377613 [57] R. Kirchgessner, A. D. George, and H. Lam, “Reconfigurable computing middleware for application portability and productivity,” in IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors, June 2013, pp. 211–218. [58] L. Hao and G. Stitt,“Bandwidth-sensitivity-aware arbitration for fpgas,”Embedded Systems Letters, IEEE, vol. 4, no. 3, pp. 73–76, Sept 2012. [59] J. Schofield, “The statistically unreliable nature of lines of code,” CrossTalk: The Journal of Defense Software Engineering, vol. 18, no. 4, pp. 29–33, April 2005. [60] OpenCores.org. (2014) Opencores. Accessed 2014-02-02. [Online]. Available: http://opencores.org/ [61] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2003.

108 [62] N. Bell and M. Garland, “Efficient sparse matrix-vector multiplication on CUDA,” NVIDIA Corporation, NVIDIA Technical Report NVR-2008-004, Dec. 2008. [63] E. Montagne and A. Ekambaram, “An optimal storage format for sparse matrices,” Inf. Process. Lett., vol. 90, no. 2, pp. 87–92, Apr. 2004. [64] I. A˚ imecek, D. Langr, and P. Tvrdik, “Minimal quadtree format for compression of sparse matrices storage,”pp. 359–364, 2012. [65] M. N. Wegman and J. Carter,“New hash functions and their use in authentication and set equality,”Journal of Computer and System Sciences, vol. 22, no. 3, pp. 265 – 279, 1981. [66] M. Pˇatra¸scuand M. Thorup, “The power of simple tabulation hashing,” J. ACM, vol. 59, no. 3, pp. 14:1–14:50, Jun. 2012. [67] R. Pagh and F. F. Rodler, “Cuckoo hashing,” J. Algorithms, vol. 51, no. 2, pp. 122–144, May 2004. [68] M. Herlihy, N. Shavit, and M. Tzafrir, Hopscotch Hashing, ser. DISC ’08. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 350–364. [69] E. Georganas, A. Bulu¸c,J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick, “Parallel de bruijn graph construction and traversal for de novo genome assembly,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’14. Piscataway, NJ, USA: IEEE Press, 2014, pp. 437–448. [Online]. Available: http://dx.doi.org/10.1109/SC.2014.41 [70] J. Riedy and D. A. Bader, “Multithreaded community monitoring for massive streaming graph data,” in Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), 2013 IEEE 27th International, May 2013, pp. 1646–1655. [71] A. G. Lawande, A. D. George, and H. Lam, “Novo-g#: a multidimensional torus-based reconfigurable cluster for molecular dynamics,” Concurrency and Computation: Practice and Experience, vol. 28, no. 8, pp. 2374–2393, 2016, cpe.3565. [72] J. Fowers, K. Ovtcharov, K. Strauss, E. S. Chung, and G. Stitt,“A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication,” in Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on, May 2014, pp. 36–43. [73] S. Kestur, J. D. Davis, and E. S. Chung, “Towards a universal fpga matrix-vector multiplication architecture,”in Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on, April 2012, pp. 9–16. [74] H. Giefers, P. Staar, C. Bekas, and C. Hagleitner,“Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of gpu, xeon phi and fpga,”in 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2016, pp. 46–56.

109 [75] N. Kapre, “Custom fpga-based soft-processors for sparse graph acceleration,” in 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), July 2015, pp. 9–16. [76] W. S. Song, V. Gleyzer, A. Lomakin, and J. Kepner, “Novel graph processor architecture, prototype system, and results,” in High Performance Extreme Computing Conference (HPEC), 2016 IEEE, Sept 2016. [77] R. Kirchgessner, A. D. George, and G. Stitt,“Low-overhead fpga middleware for application portability and productivity,” ACM Trans. Reconfigurable Technol. Syst., vol. 8, no. 4, pp. 21:1–21:22, Sep. 2015. [78] W. S. Song, “Systolic merge sorter,” May 29 2012, US Patent No. 8,190,943. [Online]. Available: https://www.google.com/patents/US20100235674 [79] R. Kirchgessner, G. D. L. Torre, A. George, and V. Gleyzer,“Hisc/r: An efficient hypersparse storage format for scalable graph processing,” in Proceedings of the 6th Workshop on Irregular Applications: Architectures and Algorithms, ser. IA3 ’16, 2016. [80] M. Matsumoto and T. Nishimura, “Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator,” ACM Trans. Model. Comput. Simul., vol. 8, no. 1, pp. 3–30, Jan. 1998. [81] H. Bauke and S. Mertens, “Random numbers for large scale distributed monte carlo simulations,”CoRR, vol. abs/cond-mat/0609584, 2006. [82] G. Marsaglia,“Xorshift rngs,”Journal of Statistical Software, vol. 8, no. 1, pp. 1–6, 2003. [83] A. V. Aho, M. S. Lam, and J. D. Ullman, Compilers: Principles, Techniques, and Tools, 2nd ed. Addison Wesley, Sep. 2006. [84] K. Madduri and D. A. Bader. (2006) GTgraph: a suite of synthetic random graph generators. Accessed 2016-07-21. [Online]. Available: http: //www.cse.psu.edu/˜kxm85/software/GTgraph/ [85] T. A. Davis. (2016) Suitesparse: A suite of sparse matrix software. Accessed 2016-10-15. [Online]. Available: http://ww.nallatech.com/wp-content/uploads/pcie 385pb v1 21.pdf [86] B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk,“A reconfigurable computing approach for efficient and scalable parallel graph exploration,” in Proceedings of the 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors, ser. ASAP ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 8–15.

110 BIOGRAPHICAL SKETCH Robert Kirchgessner is a Ph.D. graduate from the Department of Electrical and Computer Engineering at the University of Florida. He received his Master of Science in electrical and computer engineering from the University of Florida in 2011. He graduated cum laude in 2009 from the University of Florida with a Bachelor of Science in electrical engineering, and a Bachelor of Science in computer engineering. His research focuses on high-performance reconfigurable computing, tools and design methodologies for FPGA application development, and high-performance graph processing methodologies and architectures. During his work as a doctoral student in the NSF Center for High-Performance Reconfigurable Computing (CHREC), he had the opportunity to lead several research projects investigating high-performance image processing, high-level synthesis and design tools, and many-core architectures and applications.

111