An In-Fabric Memory Architecture for FPGA-Based Computing

CoRAM: An In-Fabric Memory Architecture for FPGA-Based Computing Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering Eric S. Chung B.S., Electrical Engineering and Computer Science University of California, Berkeley Carnegie Mellon University Pittsburgh, PA August, 2011 Abstract Field Programmable Gate Arrays (FPGA) have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their newfound potency in both processing performance and energy efficiency, FPGAs have not gained widespread acceptance as mainstream computing de- vices. A fundamental obstacle to FPGA-based computing can be traced to the FPGA’s lack of a common, scalable memory abstraction. When developing for FPGAs, application writers are often responsible for crafting the application-specific infrastructure logic that transports the data to and from the processing kernels, which are the ultimate producers and consumers within the fabric. Very often, this infrastructure logic not only increases design time and effort but will inflexibly lock a design to a particular FPGA product line, hindering scalability and portability. To create a common, scalable memory abstraction, this thesis proposes a new FPGA memory architecture called Con- nected RAM (CoRAM) to serve as a portable bridge between the distributed computation kernels and the edge memory interfaces. In addition to improving performance and efficiency, the CoRAM architecture provides a virtualized memory environment as seen by the hardware kernels to simplify application development and to improve an application’s scalability and portability. i Acknowledgements I owe all of my success to my academic advisor, Professor James Hoe, who taught me how to think, speak, and write. It has been a tremendous privilege to work with James, who generously provided the resources and the comfortable environment that enabled me to pursue my ideas to fruition. I am sincerely grateful for his consistent encouragement, insight, and support throughout all these years. I also thank Professor Ken Mai for his constant source of feedback throughout my time at CMU and for inviting me into his office when I had yet-another-circuit-question. I thank Professor Babak Falsafi for his intellectual contributions to my research projects, his genuine support for those all around him, and his encouragement and advice even all the way from Switzerland. I also thank my external committee members Professor Guy Blelloch, Professor Joel Emer, and Professor John Wawrzynek for serving on my defense. Their helpful comments and constructive criticisms significantly enhanced the quality of this thesis. I also wish to thank various members of the RAMP community for the friendships and the interactions that I have benefited from over so many years: Professor Derek Chiou, Hari Angepat, Nikhil Patil, Michael Pellauer, Professor Mark Oskin, and Andrew Putnam, and Professor Martha Mercaldi-Kim. I would also like to thank my mentors at Microsoft Research, Chuck Thacker and John Davis. I would like to acknowledge the National Science Foundation (CCF-1012851), C2S2, CSSI, Sun Microsystems, Microsoft Research, and John and Claire Bertucci for providing the generous funding that supported my work at CMU. I would like to thank Bluespec and Xilinx for their tool donations. There are many fellow graduate students and friends whom I wish to acknowledge. Jared Smolens, Brian Gold, and Jangwoo Kim were valuable mentors to me when I joined the TRUSS ii project. Thanks to Jared for being a great friend, for his feedback, for maintaining our cluster, and for his ability to proofread anything I could throw at him. Brian always gave valuable opinions and was never too busy to answer my questions; thanks also to Brian and his wife Robin for mak- ing me feel welcome when I first arrived to CMU. Jangwoo was always supportive and providing helpful late-night discussions and advice on demand. Nikos Hardavellas was a sounding board for many ideas in the wee hours of the night. I hope Nikos will forgive me for the events of ISCA’05. Michael Papamichael provided the network-on-chip generator that helped to make this thesis pos- sible. Thanks to Michael for being a constant source of feedback, for introducing me to the best Pad-See-Ew in all of existence, and for accompanying me on our adventure to New York’s finest Brazilian BBQ. Peter Milder is a great friend, collaborator, and always found time to stop by and chat. Peter Klemperer was an entertaining officemate and friend who kept me up-to-date with his endless stories and projects. Yoongu Kim was my mutual commiserator and kept me company dur- ing the late work hours. I am grateful to Tom Wenisch, who frequently brought me out to half-priced margaritas at Mad Mex and gave me good advice on many subjects. I am also grateful to Mike Fer- dman for his feedback on many of my research projects; thanks for Mike and Alisa Ferdman for also inviting me to their parties. Thanks to Gabriel Weisz, who endlessly tolerated my design bugs and also shared his homebrew with us. Thanks goes out to Eriko Nurvitadhi, who tolerated me as his roommate and officemate for nearly 7 years. In addition, I also greatly enjoyed the company and interactions with many other members of CMU and CALCM: Roland Wunderlich, Stephen Som- ogyi, Evaggelis Vlachos, Ippokratis Pandis, Vasileios Liaskovitis, Kai Yu, Wei Yu, Jennifer Black, Qian Yu, Professor Onur Mutlu, Chris Craik, Chris Fallin, Theodoros Strigkos, Marek Telgarsky, Bob Koutsoyannis, Kristen Dorsey, Ryan Johnson, and Brett Meyer. I’d like to thank Elaine Lawrence, Reenie Kirby, Matt Koeske, and all the other assistants in our department for handling all the administrative duties that made things run smoothly for all the students. Susan Farrington always made me feel welcome and appreciated at CMU. The Isomac and Expobar machines deserve a special thanks for distilling the endless amounts of caffeine that transformed into this thesis. I would like to thank Edward Lin and Fabian Manan who were great friends who kept my life interesting beyond the lab. Thank you to old friends who kept close from afar: Ben Liu, Patrick Charoenpong, Brandon Ooi, Mike Liao, Chethana Kulkarni, Kazim Zaidi, and Menzies Chen. iii I am grateful to our cats, Cho Cho and Mi Mi, who kept us company all these years with an unlimited supply of entertainment, exasperation, and cat hair. I cannot thank my parents enough for the sacrifices that they have made so that I could be here today. My family members, Alan, Lily, Melody, and Katie have been a pillar of support throughout my life and time at CMU. Finally, I express my gratitude to Kari Cheng. As the significant other of an ‘eternal’ graduate student, Kari tolerated my habits, my unusual hours, and my times of absence with love, grace, and understanding. This journey would not have been complete without her. iv Contents Chapter 1: Introduction 2 1.1 FPGAs for Computing . .3 1.2 Limitations of Conventional FPGAs . .3 1.3 FPGAs Lack A Standard Memory Architecture . .5 1.4 The Connected RAM Memory Architecture . .6 1.5 Thesis Contributions . .7 1.6 Thesis Outline . .8 Chapter 2: Background 9 2.1 FPGA Anatomy . .9 2.2 Why Compute With FPGAs? . 12 2.3 FPGA-Based Computing . 14 2.3.1 The Convey HC-1 . 15 2.3.2 Nallatech Accelerator Module . 16 2.3.3 Berkeley Emulation Engine 3 (BEE3) . 17 2.4 The Need For A Standard Memory Architecture . 17 Chapter 3: CoRAM Architecture 19 3.1 CoRAM System Organization and Assumptions . 19 3.2 The CoRAM Program Model . 21 3.2.1 Software Control Threads . 21 3.2.2 How Does CoRAM Solve The Portability Challenge? . 25 3.2.3 CoRAM Microarchitecture . 25 v 3.3 CoRAM versus Alternative Approaches . 27 3.4 Communication and Synchronization . 31 3.5 Other Issues . 32 3.6 Summary . 34 Chapter 4: Cor-C Architecture and Compiler 35 4.1 CoR-C Overview . 36 4.1.1 Control Threads . 36 4.1.2 Object Instantiation and Identification . 38 4.1.3 Memory Control Actions . 40 4.2 Disallowed Behaviors . 44 4.3 Simple Example: Vector Addition . 45 4.4 The CoRAM Control Compiler (CORCC) . 48 4.5 Summary . 52 Chapter 5: Cor-C Examples 53 5.1 Matrix-Matrix Multiplication . 53 5.1.1 Background . 54 5.1.2 Related Work . 56 5.1.3 Parallelization on the FPGA . 57 5.1.4 Manual Implementation . 59 5.1.5 Cor-C Implementation . 60 5.1.6 Discussion . 63 5.2 Black-Scholes . 64 5.2.1 Background . 64 5.2.2 Parallelization on the FPGA . 65 5.2.3 Memory Streaming . 66 5.2.4 Discussion . 70 5.3 Sparse Matrix-Vector Multiplication . 72 5.3.1 Background . 72 5.3.2 Related Work . 73 vi 5.3.3 Parallelization on the FPGA . 74 5.3.4 Discussion . 78 5.4 Summary . 78 Chapter 6: CoRAM Memory System 79 6.1 Devising a Memory System for CoRAM . 79 6.2 CoRAM Memory System: Hard or Soft? . 82 6.3 Understanding the Cost of Soft CoRAM . 83 6.4 Implementing A Cluster-Style Microarchitecture for CoRAM . 87 6.4.1 Design Overview . 89 6.4.2 Clients and Interfaces . 90 6.4.3 Message Sequencing and Out-of-Order Delivery . 92 6.4.4 Memory-to-CoRAM Datapath . 94 6.4.5 Cluster Summary . 100 6.5 Network-on-Chip . 101 6.6 Memory Controllers and Caches . 104 6.7 Network-to-DRAM-Cache Interface . 105 6.8 Summary . 106 Chapter 7: Prototype 108 7.1 Cluster Design . 109 7.2 CONECT Network-on-Chip Generator .

Load more