Vector Microprocessors
Total Page:16
File Type:pdf, Size:1020Kb
Vector Microprocessors by Krste AsanoviÂc B.A. (University of Cambridge) 1987 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY of CALIFORNIA, BERKELEY Committee in charge: Professor John Wawrzynek, Chair Professor David A. Patterson Professor David Wessel Spring 1998 The dissertation of Krste AsanoviÂc is approved: Chair Date Date Date University of California, Berkeley Spring 1998 Vector Microprocessors Copyright 1998 by Krste AsanoviÂc 1 Abstract Vector Microprocessors by Krste AsanoviÂc Doctor of Philosophy in Computer Science University of California, Berkeley Professor John Wawrzynek, Chair Most previous research into vector architectures has concentrated on supercomputing applications and small enhancements to existing vector supercomputer implementations. This thesis expands the body of vector research by examining designs appropriate for single-chip full-custom vector microprocessor imple- mentations targeting a much broader range of applications. I present the design, implementation, and evaluation of T0 (Torrent-0): the ®rst single-chip vector microprocessor. T0 is a compact but highly parallel processor that can sustain over 24 operations per cycle while issuing only a single 32-bit instruction per cycle. T0 demonstrates that vector architectures are well suited to full-custom VLSI implementation and that they perform well on many multimedia and human-machine interface tasks. The remainder of the thesis contains proposals for future vector microprocessor designs. I show that the most area-ef®cient vector register ®le designs have several banks with several ports, rather than many banks with few ports as used by traditional vector supercomputers, or one bank with many ports as used by superscalar microprocessors. To extend the range of vector processing, I propose a vector ¯ag processing model which enables speculative vectorization of ªwhileº loops. To improve the performance of inexpensive vector memory systems, I introduce virtual processor caches, a new form of primary vector cache which can convert some forms of strided and indexed vector accesses into unit-stride bursts. Professor John Wawrzynek Dissertation Committee Chair iii Contents List of Figures ix List of Tables xv 1 Introduction 1 2 Background and Motivation 5 2.1 Alternative Forms of Machine Parallelism :: ::: :::: ::: :::: ::: :::: ::: 6 2.2 Vector Processing ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 8 2.2.1 Vector Memory-Memory versus Vector Register Architectures : ::: :::: ::: 8 2.2.2 Vector Register ISA :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 9 2.2.3 The Virtual Processor View : :::: ::: :::: ::: :::: ::: :::: ::: 11 2.2.4 Vector ISA Advantages ::: :::: ::: :::: ::: :::: ::: :::: ::: 13 2.3 Applications ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 14 2.3.1 Software Effort : :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 16 2.3.2 The Continuing Importance of Assembly Coding : ::: :::: ::: :::: ::: 16 2.4 Scalar Performance ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 17 2.4.1 Scalar Performance of Vector Supercomputers ::: ::: :::: ::: :::: ::: 19 2.4.2 Scalar versus Vector Performance Tradeoff :::: ::: :::: ::: :::: ::: 21 2.5 Memory Systems : ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 21 2.5.1 STREAM Benchmark : ::: :::: ::: :::: ::: :::: ::: :::: ::: 22 2.5.2 High Performance DRAM Interfaces ::: :::: ::: :::: ::: :::: ::: 24 2.5.3 IRAM: Processor-Memory Integration ::: :::: ::: :::: ::: :::: ::: 25 2.6 Energy Ef®ciency ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 25 2.7 Summary : :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 28 3 T0: A Vector Microprocessor 29 3.1 Project Background :: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 29 3.2 Torrent Instruction Set Architecture : :::: ::: :::: ::: :::: ::: :::: ::: 31 3.3 T0 Microarchitecture :: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 32 3.3.1 External Memory ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 32 3.3.2 TSIP ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 34 3.3.3 Instruction Fetch and Decode :::: ::: :::: ::: :::: ::: :::: ::: 35 3.3.4 Scalar Unit ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 36 3.3.5 Vector Register File :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 37 3.3.6 Vector Memory Unit :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 37 3.3.7 Vector Arithmetic Units ::: :::: ::: :::: ::: :::: ::: :::: ::: 39 3.3.8 Chaining : ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 42 3.3.9 Exception Handling :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 43 iv 3.3.10 Hardware Performance Monitoring : ::: :::: ::: :::: ::: :::: ::: 44 3.4 T0 Implementation ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 44 3.4.1 Process Technology :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 44 3.4.2 Die Statistics :: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 48 3.4.3 Spert-II System :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 48 3.5 T0 Design Methodology :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 49 3.5.1 RTL Design :: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 49 3.5.2 RTL Veri®cation :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 50 3.5.3 Circuit and Layout Design :: :::: ::: :::: ::: :::: ::: :::: ::: 51 3.5.4 Control Logic Design : ::: :::: ::: :::: ::: :::: ::: :::: ::: 52 3.5.5 Timing Analysis :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 52 3.5.6 Timeline and Status :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 52 3.6 Short Chimes ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 53 4 Vector Instruction Execution 57 4.1 Overall Structure of a Vector Microprocessor ::: :::: ::: :::: ::: :::: ::: 58 4.2 Scalar Instruction Execution in a Vector Processor : :::: ::: :::: ::: :::: ::: 59 4.3 Vector Instruction Execution :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 60 4.4 Decoupled Vector Execution :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 62 4.5 Out-of-Order Vector Execution : ::: :::: ::: :::: ::: :::: ::: :::: ::: 63 4.6 Exception Handling for Vector Machines :: ::: :::: ::: :::: ::: :::: ::: 64 4.6.1 Handling Vector Arithmetic Traps :: ::: :::: ::: :::: ::: :::: ::: 64 4.6.2 Handling Vector Data Page Faults :: ::: :::: ::: :::: ::: :::: ::: 65 4.7 Example Decoupled Vector Pipeline Design : ::: :::: ::: :::: ::: :::: ::: 66 4.7.1 Instruction Queues ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 66 4.7.2 Execution Example :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 69 4.7.3 Queue Implementation : ::: :::: ::: :::: ::: :::: ::: :::: ::: 72 4.8 Inter-Instruction Latencies ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 74 4.9 Interlocking and Chaining ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 76 4.9.1 Types of Chaining ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 76 4.9.2 Interlock Control Structure :: :::: ::: :::: ::: :::: ::: :::: ::: 77 4.9.3 Early Pipe Interlocks :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 78 4.9.4 Late Pipe Interlocks :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 79 4.9.5 Chaining Control :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 80 4.9.6 Interlocking and Chaining Summary : ::: :::: ::: :::: ::: :::: ::: 81 5 Vector Register Files 83 5.1 Vector Register Con®guration : ::: :::: ::: :::: ::: :::: ::: :::: ::: 83 5.1.1 Vector Register Length : ::: :::: ::: :::: ::: :::: ::: :::: ::: 83 5.1.2 Number of Vector Registers : :::: ::: :::: ::: :::: ::: :::: ::: 85 5.1.3 Recon®gurable Register Files :::: ::: :::: ::: :::: ::: :::: ::: 87 5.1.4 Implementation-dependent Vector Register Length ::: :::: ::: :::: ::: 87 5.1.5 Context-Switch Overhead :: :::: ::: :::: ::: :::: ::: :::: ::: 90 5.1.6 Vector Register File Con®guration for a Vector Microprocessor : ::: :::: ::: 91 5.2 Vector Register File Implementation : :::: ::: :::: ::: :::: ::: :::: ::: 92 5.2.1 Register-Partitioned versus Element-Partitioned Banks : :::: ::: :::: ::: 93 5.2.2 Wider Element Bank Ports :: :::: ::: :::: ::: :::: ::: :::: ::: 97 5.2.3 Element Bank Partitioning for a Vector Microprocessor : :::: ::: :::: ::: 98 5.2.4 VLSI Implementation of Vector Register Files ::: ::: :::: ::: :::: ::: 99 v 6 Vector Flag Processing 107 6.1 Forms of Vector Conditional Execution ::: ::: :::: ::: :::: ::: :::: ::: 107 6.2 Vector Flag Register Organization :: :::: ::: :::: ::: :::: ::: :::: ::: 111 6.3 Flag Combining Operations :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 111 6.4 Masked Flag Generation :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 112 6.5 Flag Load and Store :: :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 113 6.6 Flag to Scalar Unit Communication : :::: ::: :::: ::: :::: ::: :::: ::: 115 6.7 Speculative Vector Execution : ::: :::: ::: :::: ::: :::: ::: :::: ::: 116 6.7.1 Speculative Overshoot : ::: :::: ::: :::: ::: :::: ::: :::: ::: 118 6.8 Flag Priority Instructions :::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 121 6.9 Density-Time Implementations of Masked Vector Instructions :: :::: ::: :::: ::: 124 6.10 Flag Latency versus Energy Savings : :::: ::: :::: ::: :::: ::: :::: ::: 125 6.11 Vector Flag Implementation :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 125 7 Vector Arithmetic Units 129 7.1 Vector Arithmetic Pipelines :: ::: :::: ::: :::: ::: :::: ::: :::: ::: 129 7.2 Vector IEEE Floating-Point Support : :::: ::: :::: ::: :::: ::: :::: ::: 130 7.2.1 Denormalized Arithmetic :: :::: ::: :::: ::: :::: ::: :::: ::: 130 7.2.2 User Floating-Point Trap Handlers :: ::: :::: ::: :::: ::: :::: ::: 133 7.3 Variable Width Virtual Processors :: :::: ::: :::: ::: :::: ::: :::: ::: 133 7.3.1 Setting VP Width ::: ::: :::: ::: :::: ::: :::: ::: :::: ::: 135 7.3.2 Narrow VPs and