Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S

Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky B.S. Electrical Engineering and Computer Science University of California at Berkeley, 1999 S.M. Electrical Engineering and Computer Science Massachusetts Institute of Technology, 2001 Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2007 c Massachusetts Institute of Technology 2007. All rights reserved. Author........................................................................... Department of Electrical Engineering and Computer Science May 25, 2007 Certified by . Krste Asanovic´ Associate Professor Thesis Supervisor Accepted by . Arthur C. Smith Chairman, Department Committee on Graduate Students 2 Vector-Thread Architecture And Implementation by Ronny Meir Krashinsky Submitted to the Department of Electrical Engineering and Computer Science on May 25, 2007, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application parallelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow. The Scale VT architecture is an instantiation of the vector-thread paradigm designed for low- power and high-performance embedded systems. Scale includes a scalar RISC control processor and a four-lane vector-thread unit that can execute 16 operations per cycle and supports up to 128 simultaneously active virtual processor threads. Scale provides unit-stride and strided-segment vector loads and stores, and it implements cache refill/access decoupling. The Scale memory system includes a four-port, non-blocking, 32-way set-associative, 32 KB cache. A prototype Scale VT processor was implemented in 180 nm technology using an ASIC-style design flow. The chip has 7.1 million transistors and a core area of 16.6 mm2, and it runs at 260 MHz while consuming 0.4–1.1 W. This thesis evaluates Scale using a diverse selection of embedded benchmarks, including example kernels for image processing, audio processing, text and data processing, cryptography, network processing, and wireless communication. Larger applications also include a JPEG image encoder and an IEEE 802.11a wireless transmitter. Scale achieves high performance on a range of different types of codes, generally executing 3–11 compute operations per cycle. Unlike other architectures which improve performance at the expense of increased energy consumption, Scale is generally even more energy efficient than a scalar RISC processor. Thesis Supervisor: Krste Asanovic´ Title: Associate Professor Acknowledgments As I graduate from 24th grade and look back on my education, I am grateful to so many teachers, friends, and family members that have helped me get to this point. I feel very fortunate to have had the opportunity to complete a Ph.D. First and foremost, thank you Krste for many years of teaching and wisdom. It has been a true privilege to work with you, and you have shaped my knowledge and understanding of computer architecture and research. Thank you for being intensely interested in the details, and for, so often, guiding me to the answer when I couldn’t quite formulate the question. Scale would not have come to be without your dedication, encouragement, and steadfast conviction. To: “are you done yet?”, yes. Chris, in 25 words or less, your eye for elegant ideas and design has improved the work we’ve done together. Our conversations and debates have been invaluable. Thank you. My thesis is primarily about the Scale project, but Scale has been a team effort. I acknowledge specific contributions to the results in this thesis at the end of the introduction in Section 1.6. I am very grateful to everyone who has contributed to the project and helped to make this thesis possible. Many brilliant people at MIT have enhanced my graduate school experience and taught me a great deal. Special thanks to all the long-term members of Krste’s group, including, Seongmoo Heo, Mike Zhang, Jessica Tseng, Albert Ma, Mark Hampton, Emmett Witchel, Chris Batten, Ken Barr, Heidi Pan, Steve Gerding, Jared Casper, Jae Lee, and Rose Liu. Also, thanks to many others who were willing to impart their knowledge and argue with me, including, Matt Frank, Jon Babb, Dan Rosendband, Michael Taylor, Sam Larsen, Mark Stephenson, Mike Gordon, Bill Thies, David Wentzlaff, Rodric Rabbah, and Ian Bratt. Thanks to my thesis committee members, Arvind and Anant Agarwal. Finally, thank you to our helpful administrative staff, including Shireen Yadollah- pour and Mary McDavitt. To my friends and family: thank you so much for your love and support. This has been a long journey and I would not have made it this far without you! 5 6 Contents Contents 7 List of Figures 11 List of Tables 15 1 Introduction 17 1.1 The Demand for Flexible Performance-Efficient Computing ............ 18 1.2 Application-Specific Devices and Heterogeneous Systems ............. 19 1.3 The Emergence of Processor Arrays ......................... 20 1.4 An Efficient Core for All-Purpose Computing .................... 21 1.5 Thesis Overview ................................... 21 1.6 Collaboration, Previous Publications, and Funding .................. 22 2 The Vector-Thread Architectural Paradigm 23 2.1 Defining Architectural Paradigms .......................... 23 2.2 The RISC Architectural Paradigm .......................... 24 2.3 The Symmetric Multiprocessor Architectural Paradigm ............... 26 2.4 The Vector Architectural Paradigm .......................... 28 2.5 The Vector-Thread Architectural Paradigm ...................... 30 2.6 VT Semantics ..................................... 37 2.7 VT Design Space ................................... 38 2.7.1 Shared Registers ............................... 38 2.7.2 VP Configuration ............................... 38 2.7.3 Predication .................................. 39 2.7.4 Segment Vector Memory Accesses ...................... 39 2.7.5 Multiple-issue Lanes ............................. 40 2.8 Vector-Threading ................................... 40 2.8.1 Vector Example: RGB-to-YCC ....................... 40 2.8.2 Threaded Example: Searching a Linked List ................. 41 3 VT Compared to Other Architectures 45 3.1 Loop-Level Parallelism ................................ 46 3.2 VT versus Vector ................................... 48 3.3 VT versus Multiprocessors .............................. 49 3.4 VT versus Out-of-Order Superscalars ........................ 50 3.5 VT versus VLIW ................................... 50 3.6 VT versus Other Architectures ............................ 51 7 4 Scale VT Instruction Set Architecture 55 4.1 Overview ....................................... 55 4.2 Virtual Processors ................................... 59 4.2.1 Structure and State .............................. 59 4.2.2 VP Instructions ................................ 59 4.2.3 Predication .................................. 59 4.3 Atomic Instruction Blocks .............................. 61 4.4 VP Configuration ................................... 61 4.5 VP Threads ...................................... 63 4.6 Cross-VP Communication .............................. 64 4.6.1 Queue Size and Deadlock .......................... 66 4.7 Independent VPs ................................... 67 4.7.1 Reductions .................................. 67 4.7.2 Arbitrary Cross-VP Communication ..................... 68 4.8 Vector Memory Access ................................ 68 4.9 Control Processor and VPs: Interaction and Interleaving .............. 69 4.10 Exceptions and Interrupts ............................... 69 4.11 Scale Virtual Machine Binary Encoding and Simulator ............... 70 4.11.1 Scale Primop Interpreter ........................... 70 4.11.2 Scale VM Binary Encoding ......................... 71 4.11.3 Scale VM Simulator ............................. 71 4.11.4 Simulating Nondeterminism ......................... 71 5 Scale VT Microarchitecture 73 5.1 Overview ....................................... 73 5.2 Instruction Organization ............................... 75 5.2.1 Micro-Ops .................................. 76 5.2.2 Binary Instruction Packets .......................... 80 5.2.3 AIB Cache Fill ................................ 83 5.3 Command Flow .................................... 84 5.3.1 VTU Commands ............................... 85 5.3.2 Lane Command Management ........................ 85 5.3.3 Execute Directives .............................. 89 5.4 Vector-Thread Unit Execution ............................ 90 5.4.1 Basic Cluster Operation ........................... 90 5.4.2 Long-latency Operations ........................... 94 5.4.3 Inter-cluster Data Transfers .........................

Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S

Data-Level Parallelism

2.5 Classification of Parallel Computers

Lecture 14: Vector Processors

Computer Architecture: Parallel Processing Basics

COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

7Th Gen Intel® Core™ Processor U/Y-Platforms

Lect. 11: Vector and SIMD Processors

Vector Processors

Lecture 19: SIMD Processors

Chapter 4: Data-Level Parallelism in Vector, SIMD, and GPU Architectures Introduction

The RISC-V Instruction Set Manual Volume I: User-Level ISA Document Version 2.2