Accelerator Architectures for Matrix Applications A

ACCELERATOR ARCHITECTURES FOR MATRIX APPLICATIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Sang Kyun Kim January 2013 © 2013 by Sang Kyun Kim. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/nn963tk4553 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Oyekunle Olukotun, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Christoforos Kozyrakis I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Andrew Ng Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Matrices are a well known data representation extensively used in a wide range of applications. Numerous applications of various domains use matrix operations to represent and perform their core algorithms. Thus, improving matrix operation performance is critical to a vast variety of fields as it not only allows existing applications to run faster, but also enables computations with larger matrices. Modern GPUs and CPUs with SIMD support have been very effective at accelerating matrix operations. However, these current architectures only work well on dense and fat matrices. Skinny dense matrices tend to underutilize SIMD resources when the width of a matrix is less than the number of SIMD lanes, and may limit the scalability since they have a smaller amount of computation to hide the communication overhead. Sparse matrices are also difficult to accelerate current architectures, because the memory accesses are irregular and the workload imbalance is severe. This thesis introduces two different specialized hardware, targeting narrow dense and sparse matrices. The first part of this thesis focuses on accelerating a Restricted Boltzmann Machine (RBM), a popular machine learning algorithm used in deep learning. The RBM accelerator was designed using a modular approach to achieve linear scalability across transistor technologies, as well as across chip boundaries. The accelerator was implemented on FPGAs to demonstrate the performance improvements over high-end CPUs and GPUs. Both fat and skinny matrices were shown to fully utilize the computation resources in the learning process, which allows the training algorithm to converges in less number of iterations. The second part of this thesis describes how sparse matrix applications can be iv accelerated with domain-specific hardware. We studied three sparse matrix applications that conventional hardware cannot easily accelerate. Based on our findings, we devised an accelerator architecture which targets certain sparse and dense matrix operations. The accelerator is capable of exploiting the fine-grained parallelism within sparse matrices despite the irregularity through buffering and work-stealing. In order to cover a wider range of applications, a small general-purpose core was added to the accelerator for non-critical execution flows. The sparse matrix accelerator was implemented on an FPGA board as an ASIC prototype to evaluate the performance using real-world data. Our accelerator shows performance comparable to GPUs on dense matrix operations, and excels over conventional hardware on sparse matrix operations. v Acknowledgements My PhD journey has been supported by family, colleagues, mentors, and friends. I couldn't have done it without them. I would like to use this opportunity to express my gratitude towards their advice, efforts, kindness, patience, and care. First of all, I would like to say "Thank you so much!" to my wife Eun Jung Lee for her patience and support. I am truly blessed to have met her at Stanford, as she has greatly enriched my graduate life with an abundance of joyful events. We share a lot of invaluable memories, which enabled me to endure and push forward in my research. I also want to thank my advisor Kunle Olukotun, who has been a great advisor to me. During the weekly 1:1 meetings, he was always guided me with insightful advice. He has been patient with my research despite the delays from some difficulties in hardware. Kunle also kindly financially supported me with research assistantship after my scholarship expired so that I can finish my PhD work. I also thank Darlene Hadding, who helped me with the administrative work during my studies. My degree program had very few administrative issues thanks to her awesome support. I would also like to thank the other Oral Exam Committee members: Christos Kozyrakis, Andrew Ng, and Yoshio Nishi. I thank Christos for his serving as my associate advisor, reading committee member, and oral committee member. His computer architecture classes was one of the reasons why I decided to go further in the computer architecture path. I thank Andrew for his serving as my reading committee member and oral committee member. The Brain-in-Box meetings with Andrew were very helpful in designing an RBM accelerator. I thank Yoshio for his vi serving as my oral exam chair. He very kindly and happily agreed to be the chair, even though it was the first time we met. I want to thank all colleagues and friends. Every one of them positively influenced my research work. Thanks to Lawrence McAfee and Peter McMahon, who worked together with me on designing the RBM architecture on FPGAs. Thanks to Honglak Lee for his help on understanding machine learning algorithms and providing his source codes. Thanks to Sungpack Hong and Frank Liu, who wrote some RTL code for me, which helped me save time in implementing the sparse matrix accelerator. Special thanks to Sungpack for the conversations with him, since they have been particularly helpful in debugging the sparse matrix accelerator. Thanks to Hyoukjoong Lee for his help with using the graphics card. Thanks to Jared Casper for helping me with Altera license issues, and thanks to Jacob Leverich for his help on using the CPU cluster. I also want to thank the rest of Kunle's group, as they always gave constructive feedback on my research. I also want to thank the Stanford Korean Christian Fellowship (KCF), which was a spiritual home for me from the first year of graduate school. Thanks to all the caring friends at KCF for their loving prayers and support. I would like to express my deep gratitude to Kwanjeong Educational Foundation. Kwanjeong Educational Foundation financially supported my first five years of graduate studies. Thanks to Chong-hwan Lee, the president of the foundation, and rest of the foundation staff, as I was able to study at Stanford without worrying about the expensive tuition and high costs of living. By securing my financial needs, I was able to stay more productive and focused. Finally, I want to thank my parents in Korea for their warm encouragements and having faith in me. Even at times when my grades were below expectations in Korea, my parents never gave up on me, and they still believe I can do well today. They helped me build a positive character, which really helped me not get stressed out from the long debugging and keep my pace in research. vii Contents Abstract iv Acknowledgements vi 1 Introduction 1 1.1 Microprocessor Trends . .1 1.2 Specialized Architecture for Matrix-oriented Applications . .3 1.3 Previous Work On Accelerating Matrix Operations . .4 1.4 Thesis Outline . .6 2 Accelerator Architectures for Restricted Boltzmann Machines 7 2.1 Introduction . .7 2.2 Background on Restricted Boltzmann Machines . .9 2.3 Target FPGA Development Platform . 10 2.4 Single Chip Implementation . 12 2.4.1 Overall System Architecture . 12 2.4.2 Core Components . 16 2.4.3 Experimental Results . 19 2.5 Extension to Large Scale RBM Architecture . 21 2.5.1 Scalable Multi-chip Architecture . 22 2.5.2 Streaming Weights From DRAM and Associated Trade-offs . 25 2.5.3 Locally Dense Sparse Network . 29 2.5.4 Limitations of Multi-chip RBM . 31 2.5.5 Experimental Results . 34 viii 2.6 Related Work . 38 2.7 Summary . 39 3 Sparse Matrix Accelerator 40 3.1 Introduction . 40 3.1.1 Sparse Matrix Applications . 41 3.2 Sparse Matrix Accelerator Architecture . 46 3.2.1 Overview . 47 3.2.2 Supported Matrix Operations . 48 3.2.3 Sparse Matrix Format . 50 3.2.4 Decode Unit . 52 3.2.5 Thread Block Unit . 55 3.2.6 Encode Unit . 60 3.2.7 Sparsification Modules . 61 3.3 Sparse Matrix Accelerator Implementation Details . 63 3.3.1 Target Development Platform . 65 3.3.2 Software Stack . 68 3.3.3 Resource Usage . 70 3.3.4 Challenges and Limitations . 72 3.4 Results and Analysis . 75 3.4.1 Experimental Platform . 75 3.4.2 Matrix Operation Performance Analysis . 77 3.4.3 Application Performance Analysis . 87 3.4.4 Summary . 94 4 Concluding Remarks and Future Directions 97 4.1 Contributions of Research . 98 4.2 Future Work . 99 4.2.1 Addressing the Limitations of the FPGA Implementation Plat- forms .

Accelerator Architectures for Matrix Applications A

A GMRES Solver with ILU(K) Preconditioner for Large-Scale Sparse Linear Systems on Multiple Gpus

Product Irregularity Strength of Graphs with Small Clique Cover Number

Phd Thesis Parallel and Scalable Sparse Basic Linear Algebra

Multi-Color Low-Rank Preconditioner for General Sparse Linear Systems ∗

Fast Large-Integer Matrix Multiplication

Sparse Linear System Solvers on Gpus: Parallel Preconditioning, Workload Balancing, and Communication Reduction

A Study of Linear Error Correcting Codes

Optimizing the Sparse Matrix-Vector Multiplication Kernel for Modern Multicore Computer Architectures

Why Jacket Matrices?

Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi

Dynamic Scheduling for Efficient Hierarchical Sparse Matrix

Investigation on Digital Fountain Codes Over Erasure Channels and Additive White