SCALABLE GRAPH PROCESSING on RECONFIGURABLE SYSTEMS by ROBERT G. KIRCHGESSNER a DISSERTATION PRESENTED to the GRADUATE SCHOOL OF

SCALABLE GRAPH PROCESSING ON RECONFIGURABLE SYSTEMS By ROBERT G. KIRCHGESSNER A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2016 © 2016 Robert G. Kirchgessner To my parents, Robert and Janette, and my wife Minjeong ACKNOWLEDGMENTS I would like to express my deepest gratitude to all those who have helped me down this long road towards completing my doctoral studies. I thank my advisor, Dr. Alan George, for his wisdom, guidance, and support both academically and personally throughout my graduate studies; and my co-advisor, Dr. Greg Stitt, whose invaluable academic insights helped shape my research. I thank Vitaliy Gleyzer, for his invaluable feedback, suggestions, and guidance, making me a better researcher, and MIT/LL for their support and resources which made this work possible. I would also like to thank my committee members, Dr. Herman Lam and Dr. Darin Acosta, for their important suggestions, advice, and feedback on my work. Additionally, I thank my friends and colleagues: Kenneth Hill, Bryant Lam, Abhijeet Lawande, Adam Lee, Barath Ramesh, and Gongyu Wang, who have always provided me with support, both in my research and personal life. Furthermore, I would like to thank my loving wife, without whom this would have not been possible, and my parents, who knew I could achieve this well before I knew it myself. Last but certainly not least, this work was supported in part by the I/UCRC Program of the National Science Foundation under Grant Nos. EEC-0642422 and IIP-1161022. I would like to gratefully acknowledge equipment, tools, and source code provided by Altera (now part of Intel), Xilinx, GiDEL, Nallatech and DRC Computing. 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................... 4 LIST OF TABLES ...................................... 7 LIST OF FIGURES ..................................... 8 ABSTRACT ......................................... 10 CHAPTER 1 INTRODUCTION ................................... 12 1.1 Field-Programmable Gate Arrays ........................ 13 1.2 Graph Processing using Linear-Algebra Primitives ................ 15 2 PRODUCTIVITY AND PORTABILITY MIDDLEWARE FOR FPGA APPLICATIONS 17 2.1 Background and Related Research ........................ 19 2.2 Approach ..................................... 22 2.2.1 Hardware Abstraction .......................... 24 2.2.2 Software Abstraction ........................... 27 2.2.3 Metadata and Extensible Core Library .................. 29 2.2.4 RC Middleware Toolchain ........................ 31 2.3 Results and Analysis ............................... 37 2.3.1 Convolution Case Study ......................... 37 2.3.2 Analysis of Performance and Area Overhead ............... 41 2.3.3 Analysis of Productivity ......................... 46 2.3.4 Analysis of Portability .......................... 48 2.4 Summary and Conclusions ............................ 50 3 EFFICIENT STORAGE FORMATS FOR SCALABLE FPGA GRAPH PROCESSING 52 3.1 Background and Related Research ........................ 54 3.1.1 Coordinate Format (COO) ........................ 55 3.1.2 Compressed Sparse-Column/Row Format (CSC/R) ........... 55 3.1.3 Doubly Compressed Sparse-Column/Row (DCSC/R) .......... 56 3.1.4 ELLPACK Format ............................ 56 3.1.5 Jagged Diagonal Format (JDS/TJDS) .................. 57 3.1.6 Minimal Quadtree Format (MQT) .................... 57 3.2 Approach ..................................... 57 3.2.1 Hashed-Index Sparse-Column/Row (HISC/R) .............. 58 3.2.2 Hashed-Indexing Vector ......................... 58 3.2.3 HISC/R Nonzero Storage ........................ 60 3.2.4 Non-zero Lookups and Insertions .................... 61 3.2.5 Storage Analysis ............................. 62 5 3.3 Results and Analysis ............................... 64 3.3.1 Storage comparison ............................ 65 3.3.2 Performance Comparison ......................... 67 3.4 Summary and Conclusions ............................ 69 4 EXTENSIBLE FPGA ARCHITECTURE FOR SCALABLE GRAPH PROCESSING .. 71 4.1 Background and Related Research ........................ 73 4.1.1 Accelerating Sparse-Matrix Operations on FPGAs ............ 73 4.1.2 Standards for Graph Processing using Linear Algebra .......... 76 4.1.3 Linear-Algebra Formulation of Breadth-First Search ........... 76 4.2 Extensible Graph-Processor Architecture .................... 78 4.2.1 Merge-Sorter Architecture ........................ 79 4.2.1.1 Sorting-pipeline architecture .................. 84 4.2.1.2 Merge-sorter controller .................... 85 4.2.1.3 Merge-sorter performance analysis .............. 86 4.2.2 ALU Architecture ............................. 88 4.2.3 HISC/R Storage Controller ........................ 89 4.2.4 FPGA Resource Analysis ......................... 93 4.3 Experimental Setup ............................... 93 4.4 Case Study: Sparse Generalized Matrix-Matrix Multiplication ......... 94 4.5 Case Study: Breadth-First Search ........................ 96 4.6 Graph-Processor Architecture Scalability Analysis ................ 97 4.7 Summary and Conclusions ............................ 99 5 CONCLUSIONS .................................... 101 REFERENCES ........................................ 104 BIOGRAPHICAL SKETCH ................................. 111 6 LIST OF TABLES Table page 2-1 Currently supported RC Middleware platforms. .................... 37 2-2 Comparison of lines of code required when using RC Middleware. .......... 39 2-3 Total map-generation time, estimated area and latency, and actual area and execution time for convolution case study optimizing for performance or area. ....... 39 2-4 Execution time and area for various applications and kernels on each supported RC Middleware platform. ................................. 49 3-1 Definition of variables for sparse-matrix complexity analysis. ............. 54 3-2 Analysis of popular sparse-matrix storage formats. .................. 55 4-1 Merge-sorter PE next-state logic. ........................... 81 4-2 Graph-processor resource analysis. ........................... 93 4-3 Summary of parameters used to simulate SpGEMM scalability. ............ 99 7 LIST OF FIGURES Figure page 1-1 FPGA architecture overview. ............................. 14 2-1 Overview of RC Middleware design flow. ....................... 23 2-2 RC Middleware hardware-abstraction layers enabling application portability between GiDEL PROCStar III and Pico M501. ......................... 24 2-3 Overview of RC Middleware's hardware presentation layer .............. 27 2-4 Overview of RC Middleware's software stack and generated C++ application stub . 28 2-5 Example of application-description XML format. ................... 30 2-6 Overview of RC Middleware toolchain from application specification to vendor- specific project generation. .............................. 32 2-7 Example of RC Middleware mapping two application read interface to a single phys- ical memory ...................................... 35 2-8 Area- and performance-optimized mapping results for mapping convolution application on PROCStar III/IV. ............................... 38 2-9 Host and FPGA read and write performance to external memory for PROCStar III/IV, M501, and PCIe-385N. ............................. 42 2-10 Host and FPGA read and write overhead to external memory for PROCStar III/IV, and M501 ....................................... 43 2-11 FPGA resource analysis for vendor, application, and RC Middleware components. .. 45 3-1 Comparison of the indexing techniques used by CSC/R, DCSC/R, and HISC/R. .. 58 3-2 Comparison of average hash table probes required for row/column lookups vs. load factor for different hash table types. .......................... 60 3-3 Overview of HISC/R with segmented storage vectors using initial segment size L0 and growth factor k. .................................. 61 3-4 Pseudocode for HISC column lookups. ........................ 62 3-5 Pseudocode for HISC non-zero insertions. ....................... 63 3-6 Average storage ratio normalizing HISC/R and HISC/R (unsegmented) by CSC/R for randomly generated scale-30 Kronecker matrices. ................. 66 3-7 Average storage ratio normalizing HISC/R and HISC/R (unsegmented) by DCSC/R for randomly generated scale-30 Kronecker matrices. ................. 67 8 3-8 Comparison of total reads required to perform sparse matrix/matrix multiplication using HISC/R compared with CSC/R and DCSC/R. ................. 68 4-1 Pseudocode for vertex-centric breadth-first search. .................. 77 4-2 Graph adjacency-matrix representation. ........................ 77 4-3 Overview of graph-processor architecture. ....................... 79 4-4 Architecture diagram of merge-sorter PE. ....................... 81 4-5 Architecture diagram of merge-sorter pipeline. .................... 82 4-6 Pseudocode for systolic-array priority function. .................... 83 4-7 Merge-sorter architecture overview. .......................... 85 4-8 Overview of merge-sorter sorting modes ........................ 86 4-9 Pipelined merge-sorter performance analysis. ..................... 88 4-10 Design of our ALU supporting various semirings. ................... 89 4-11 Comparison of tabulation-hash quality metric for different PRNGs. ......... 91 4-12 Controller architecture for HISC/R storage format. .................. 92 4-13 Comparison of our architecture running SpGEMM with CombBLAS and SuiteSparse

SCALABLE GRAPH PROCESSING on RECONFIGURABLE SYSTEMS by ROBERT G. KIRCHGESSNER a DISSERTATION PRESENTED to the GRADUATE SCHOOL OF

Linear Probing with Constant Independence

CSC 344 – Algorithms and Complexity Why Search?

Implementing the Map ADT Outline

SAHA: a String Adaptive Hash Table for Analytical Databases

Hash Tables & Searching Algorithms

Chapter 5 Hashing

Pseudorandom Data and Universal Hashing∗

Collisions There Is Still a Problem with Our Current Hash Table

Cuckoo Hashing

Linear Probing with 5-Independent Hashing

Cuckoo Hashing

The Power of Simple Tabulation Hashing