PGAS Communication for Heterogeneous Clusters with Fpgas
Total Page:16
File Type:pdf, Size:1020Kb
PGAS Communication for Heterogeneous Clusters with FPGAs by Varun Sharma A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto © Copyright 2020 by Varun Sharma Abstract PGAS Communication for Heterogeneous Clusters with FPGAs Varun Sharma Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2020 This work presents a heterogeneous communication library for generic clusters of processors and FPGAs. This library, Shoal, supports the partitioned global address space (PGAS) memory model for applications. PGAS is a shared memory model for clusters that creates a distinction between local and remote memory accesses. Through Shoal and its common application programming interface for hardware and software, applications can be more freely migrated to the optimal platform and deployed onto dynamic cluster topologies. The library is tested using a thorough suite of microbenchmarks to establish latency and throughput performance. We also show an implementation of the Jacobi method that demonstrates the ease with which applications can be moved between platforms to yield faster run times. ii Acknowledgements It takes a village to raise a child and about as many people to raise a thesis as well. Foremost, I would like to thank my supervisor, Paul Chow, for all that he’s done for me over the course of the four years that I’ve known him thus far. Thank you for accepting me as your student for my undergraduate thesis and not immediately rejecting me when I asked to pursue a Masters under your supervision as well. The freedom and independence you provided over the course of this work was intimidating at times but also comforting as a testament to your confidence in me. I’d like to thank all my colleagues in Paul’s group. Our weekly meetings of Vivado Anonymous was a constant reminder that we were in this together. This work was also made easier by the many great grad students, especially my fellow denizens of PT477, of whom there are too many to name. Among them, I’d like to mention in particular Rose Li, Naif Tarafdar, Daniel Rozhko and Daniel Ly-Ma, Thomas Lin and Marco Merlini. Thanks for everything, both in and outside of work. Thank you to Ruediger Willenberg and Sanket Pandit for the inspiration, prior work and help that you provided. Ruedi, the casual confidence you exuded as a graduating PhD was a revelation to me as an undergrad and I hope to be able to emulate that someday. Finally, I’d like to thank my family for their endless support, concern and love. My parents took a chance on coming to Canada with a small family in tow. I would not be here if not for them. iii Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Research Contributions . .2 1.3 Thesis Organization . .2 2 Background 4 2.1 Field-Programmable Gate Arrays . .4 2.2 Hardware vs. Software . .5 2.3 Memory Models . .5 2.3.1 Shared . .5 2.3.2 Distributed . .6 2.3.3 Partitioned Global Address Space . .7 2.4 AXI Interfaces . .7 2.4.1 AXI-Stream . .7 2.4.2 AXI-Full and AXI-Lite . .8 2.5 Galapagos . .8 2.5.1 A Layered Approach . .8 2.5.2 The Galapagos Model . 10 2.5.3 Why Galapagos? . 10 2.6 Related Work . 11 2.6.1 SHMEM and its Successors . 11 2.6.2 GASNet . 11 2.6.3 HUMboldt . 13 3 Shoal 14 3.1 Previous Work: THeGASNets . 14 3.1.1 THeGASNet . 14 3.1.2 THe_GASNet Extended . 15 3.1.3 Limitations . 17 3.2 Rationale for Shoal . 18 3.2.1 Compatibility . 18 3.2.2 Scalability . 19 3.2.3 Freedom . 19 3.2.4 Maintainability and Extensibility . 19 iv 3.2.5 Usability . 20 3.3 Communication API . 20 3.3.1 Packet Format . 22 3.4 Software Implementation . 22 3.4.1 libGalapagos . 23 3.4.2 Making a Node in Shoal . 23 3.4.3 Handler Thread . 26 3.5 Hardware Implementation . 26 3.5.1 GAScore: a remote DMA engine . 26 3.5.2 Integration with Galapagos . 29 3.6 Shoal Kernels . 30 4 Sonar 32 4.1 Background . 33 4.1.1 Forms of Testing . 33 4.1.2 Related Work . 33 4.2 Motivation . 34 4.2.1 Difficulties with Hardware Testbenches . 34 4.2.2 Simulation in HLS . 35 4.3 Introducing Sonar . 35 4.4 Writing a Testbench . 36 4.4.1 DUT . 36 4.4.2 Test Vectors . 36 4.5 Case Studies . 39 4.5.1 UMass RCG HDL Benchmark Collection . 39 4.5.2 cocotb ........................................... 40 4.6 Comparing Different Tools . 41 4.6.1 Controllability . 42 4.6.2 Ease of Use . 42 4.6.3 Capability . 42 4.6.4 Readability . 42 4.6.5 Compatibility . 42 4.6.6 Heterogeneity . 43 4.6.7 Summary . 43 4.7 Future Work . 43 5 Evaluation 44 5.1 Experimental Setup . 44 5.1.1 Hardware . 44 5.1.2 Software . 45 5.2 Hardware Usage . 45 5.3 Microbenchmarks . 46 5.3.1 libGalapagos . 47 5.3.2 Shoal . 49 v 5.4 Stencil Codes . 56 5.4.1 Baseline . 56 5.4.2 Porting from THe_GASNet Extended . 56 5.4.3 Software Performance . 60 5.4.4 Hardware Performance . 60 6 Conclusions 63 6.1 Future Work . 64 6.1.1 Quick Improvements . ..