Simplified Vector-Thread Architectures for Flexible and Efficient Data-Parallel Accelerators

Simplified Vector-Thread Architectures for Flexible and Efficient Data-Parallel Accelerators by Christopher Francis Batten B.S. in Electrical Engineering, University of Virginia, May 1999 M.Phil. in Engineering, University of Cambridge, August 2000 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2010 © Massachusetts Institute of Technology 2010. All rights reserved. Author........................................................................... Department of Electrical Engineering and Computer Science January 29, 2010 Certified by . Krste Asanovic´ Associate Professor, University of California, Berkeley Thesis Supervisor Accepted by . Terry P. Orlando Chairman, Department Committee on Graduate Students Electrical Engineering and Computer Science 2 Simplified Vector-Thread Architectures for Flexible and Efficient Data-Parallel Accelerators by Christopher Francis Batten Submitted to the Department of Electrical Engineering and Computer Science on January 29, 2010, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract This thesis explores a new approach to building data-parallel accelerators that is based on simplify- ing the instruction set, microarchitecture, and programming methodology for a vector-thread architecture. The thesis begins by categorizing regular and irregular data-level parallelism (DLP), before presenting several architectural design patterns for data-parallel accelerators including the multiple- instruction multiple-data (MIMD) pattern, the vector single-instruction multiple-data (vector-SIMD) pattern, the single-instruction multiple-thread (SIMT) pattern, and the vector-thread (VT) pattern. Our recently proposed VT pattern includes many control threads that each manage their own array of microthreads. The control thread uses vector memory instructions to efficiently move data and vector fetch instructions to broadcast scalar instructions to all microthreads. These vector mecha- nisms are complemented by the ability for each microthread to direct its own control flow. In this thesis, I introduce various techniques for building simplified instances of the VT pattern. I propose unifying the VT control-thread and microthread scalar instruction sets to simplify the microarchitecture and programming methodology. I propose a new single-lane VT microarchitecture based on minimal changes to the vector-SIMD pattern. Single-lane cores are simpler to implement than multi-lane cores and can achieve similar energy efficiency. This new microarchitecture uses control processor embedding to mitigate the area overhead of single-lane cores, and uses vector fragments to more efficiently handle both regular and irregular DLP as compared to previous VT architectures. I also propose an explicitly data-parallel VT programming methodology that is based on a slightly modified scalar compiler. This methodology is easier to use than assembly programming, yet simpler to implement than an automatically vectorizing compiler. To evaluate these ideas, we have begun implementing the Maven data-parallel accelerator. This thesis compares a simplified Maven VT core to MIMD, vector-SIMD, and SIMT cores. We have implemented these cores with an ASIC methodology, and I use the resulting gate-level models to evaluate the area, performance, and energy of several compiled microbenchmarks. This work is the first detailed quantitative comparison of the VT pattern to other patterns. My results suggest that future data-parallel accelerators based on simplified VT architectures should be able to combine the energy efficiency of vector-SIMD accelerators with the flexibility of MIMD accelerators. Thesis Supervisor: Krste Asanovic´ Title: Associate Professor, University of California, Berkeley 4 Acknowledgements I would like to first thank my research advisor, Krste Asanovic,´ who has been a true mentor, pas- sionate teacher, inspirational leader, valued colleague, and professional role model throughout my time at MIT and U.C. Berkeley. This thesis would not have been possible without Krste’s constant stream of ideas, encyclopedic knowledge of technical minutia, and patient yet unwavering support. I would also like to thank the rest of my thesis committee, Christopher Terman and Arvind, for their helpful feedback as I worked to crystallize the key themes of my research. Thanks to the other members of the Scale team at MIT for helping to create the vector-thread architectural design pattern. Particular thanks to Ronny Krashinsky, who led the Scale team, and taught me more than he will probably ever know. Working with Ronny on the Scale project was, without doubt, the highlight of my time in graduate school. Thanks to Mark Hampton for having the courage to build a compiler for a brand new architecture. Thanks to the many others who made both large and small contributions to the Scale project including Steve Gerding, Jared Casper, Albert Ma, Asif Khan, Jaime Quinonez, Brian Pharris, Jeff Cohen, and Ken Barr. Thanks to the other members of the Maven team at U.C. Berkeley for accepting me like a real Berke- ley student and helping to rethink vector threading. Particular thanks to Yunsup Lee for his tremen- dous help in implementing the Maven architecture. Thanks to both Yunsup and Rimas Avizienis for working incredibly hard on the Maven RTL and helping to generate such detailed results. Thanks to the rest of the Maven team including Chris Celio, Alex Bishara, and Richard Xia. This thesis would still be an idea on a piece of green graph paper without all of their hard work, creativity, and dedica- tion. Thanks also to Hidetaka Aoki for so many wonderful discussions about vector architectures. Section 1.4 discusses in more detail how the members of the Maven team contributed to this thesis. Thanks to the members of the nanophotonic systems team at MIT and U.C. Berkeley for allowing me to explore a whole new area of research that had nothing to do with vector processors. Thanks to Vladimir Stajanovic´ for his willingness to answer even the simplest questions about nanophotonic technology, and for being a great professional role model. Thanks to Ajay Joshi, Scott Beamer, and Yong-Jin Kwon for working with me to figure out what to do with this interesting new technology. Thanks to my fellow graduate students at both MIT and Berkeley for making every extended in- tellectual debate, every late-night hacking session, and every conference trip such a wonderful ex- perience. Particular thanks to Dave Wentzlaff, Ken Barr, Ronny Krashinsky, Edwin Olson, Mike Zhang, Jessica Tseng, Albert Ma, Mark Hampton, Seongmoo Heo, Steve Gerding, Jae Lee, Nirav Dave, Michael Pellauer, Michal Karczmarek, Bill Thies, Michael Taylor, Niko Loening, and David Liben-Nowell. Thanks to Rose Liu and Heidi Pan for supporting me as we journeyed from one coast to the other. Thanks to Mary McDavitt for being an amazing help throughout my time at MIT and even while I was in California. Thanks to my parents, Arthur and Ann Marie, for always supporting me from my very first experiment with Paramecium to my very last experiment with vector threading. Thanks to my brother, Mark, for helping me to realize that life is about working hard but also playing hard. Thanks to my wife, Laura, for her unending patience, support, and love through all my ups and downs. Finally, thanks to my daughter, Fiona, for helping to put everything into perspective. 5 6 Contents 1 Introduction 13 1.1 Transition to Multicore & Manycore General-Purpose Processors . 13 1.2 Emergence of Programmable Data-Parallel Accelerators . 19 1.3 Leveraging Vector-Threading in Data-Parallel Accelerators . 21 1.4 Collaboration, Previous Publications, and Funding . 22 2 Architectural Design Patterns for Data-Parallel Accelerators 25 2.1 Regular and Irregular Data-Level Parallelism . 25 2.2 Overall Structure of Data-Parallel Accelerators . 30 2.3 MIMD Architectural Design Pattern . 32 2.4 Vector-SIMD Architectural Design Pattern . 36 2.5 Subword-SIMD Architectural Design Pattern . 41 2.6 SIMT Architectural Design Pattern . 43 2.7 VT Architectural Design Pattern . 48 2.8 Comparison of Architectural Design Patterns . 53 2.9 Example Data-Parallel Accelerators . 55 3 Maven: A Flexible and Efficient Data-Parallel Accelerator 61 3.1 Unified VT Instruction Set Architecture . 61 3.2 Single-Lane VT Microarchitecture Based on Vector-SIMD Pattern . 62 3.3 Explicitly Data-Parallel VT Programming Methodology . 66 4 Maven Instruction Set Architecture 69 4.1 Instruction Set Overview . 69 4.2 Challenges in a Unified VT Instruction Set . 74 4.3 Vector Configuration Instructions . 77 4.4 Vector Memory Instructions . 80 4.5 Calling Conventions . 82 4.6 Extensions to Support Other Architectural Design Patterns . 83 4.7 Future Research Directions . 86 7 4.8 Related Work . 89 5 Maven Microarchitecture 91 5.1 Microarchitecture Overview . 91 5.2 Control Processor Embedding . 99 5.3 Vector Fragment Merging . 100 5.4 Vector Fragment Interleaving . 102 5.5 Vector Fragment Compression . 104 5.6 Leveraging Maven VT Cores in a Full Data-Parallel Accelerator . 107 5.7 Extensions to Support Other Architectural Design Patterns . 109 5.8 Future Research Directions . 111 5.9 Related Work . 113 6 Maven Programming Methodology 117 6.1 Programming Methodology Overview . 117 6.2 VT Compiler . 122 6.3 VT Application

Simplified Vector-Thread Architectures for Flexible and Efficient Data-Parallel Accelerators

Hardware Realization of an FPGA Processor – Operating System Call Offload and Experiences

Performance and Energy Efficient Network-On-Chip Architectures

Computer Architecture: Dataflow (Part I)

ECE 590: Digital Systems Design Using Hardware Description Language (VHDL) Systolic Implementation of Faddeev's Algorithm in V

On the Efficiency of Register File Versus Broadcast Interconnect For

Bringing Virtualization to the X86 Architecture with the Original Vmware Workstation

Computer Architectures

Systolic Computing on Gpus for Productive Performance

Static Fault Attack on Hardware DES Registers

Configurable Fine-Grain Protection for Multicore Processor Virtualization 1

Validating Register Allocation and Spilling

V810 Familytm 32-Bit Microprocessor