Exploiting Concurrency in a General-Purpose One-Instruction Computer Architecture

EXPLOITING CONCURRENCY IN A GENERAL-PURPOSE ONE-INSTRUCTION COMPUTER ARCHITECTURE A thesis submitted to the University of Manchester for the degree of Doctor of Philosophy in the Faculty of Engineering and Physical Sciences 2010 By Christopher Daniel Emmons School of Computer Science Contents Abstract 14 Declaration 15 Copyright 16 Acknowledgements 17 The author 18 I Introduction and background 19 1 Introduction 20 1.1 Motivation . 20 1.2 Research objectives . 21 1.3 Research contributions . 23 1.4 Thesis structure . 24 2 General-purpose computer architecture 25 2.1 General-purpose computer requirements . 25 2.2 Challenges in computer architecture . 26 2.2.1 Design complexity and productivity gap . 26 2.2.2 Power consumption . 27 2.2.3 Processor and memory performance gap . 28 2.2.4 Cost of communication . 29 2.3 Research directions . 30 2.3.1 Exploiting parallelism . 30 2.3.2 Increasing cache sizes . 35 2 2.3.3 Designing at higher levels of abstraction . 35 2.3.4 Communication-centric architectures . 36 2.4 Conclusion . 37 3 The Fleet architecture 38 3.1 Architecture overview . 38 3.2 Instruction set architecture . 39 3.2.1 Native hardware data types . 39 3.2.2 One instruction . 40 3.2.3 Code bags . 43 3.2.4 A simple Fleet program . 44 3.3 Concurrency . 45 3.4 Hardware organization . 45 3.4.1 Instruction fetch and dispatch . 45 3.4.2 Switch fabric . 46 3.4.3 Ships . 46 3.5 Early findings . 46 3.6 Limitations . 47 3.7 Conclusion . 47 II The Armada architecture 49 4 The Armada architecture 50 4.1 Differences from Fleet . 50 4.2 Architecture enhancements . 53 4.2.1 Independent code bags . 53 4.2.2 Enhanced local register file . 54 4.2.3 Context synchronizers . 55 4.2.4 Flow caching . 55 4.3 Instruction set architecture . 56 4.3.1 Memory model . 56 4.3.2 One instruction . 56 4.3.3 Choosing the code bag type . 58 4.3.4 Fetching code bags . 60 4.3.5 Freeing Fleet cores . 66 3 4.3.6 Virtual pipelines . 66 4.3.7 Handling state in Ships . 68 4.3.8 Context synchronizers . 69 4.3.9 Flow caching . 70 4.3.10 Hardware reset behavior . 72 4.3.11 Event handling . 73 4.3.12 Fleet prototype core ISA . 73 4.4 Related work . 74 4.4.1 Transport-triggered architectures . 74 4.4.2 Dataflow machines . 74 4.4.3 WaveScalar architecture . 75 4.4.4 Independence architectures . 75 4.4.5 SCALP and Vortex asynchronous processors . 76 4.4.6 TRIPS architecture . 76 4.4.7 Fleet at Sun Microsystems and U.C. Berkeley . 77 4.5 Conclusion . 77 5 The Armada-1 microarchitecture 78 5.1 Overview . 78 5.2 Memory subsystem . 78 5.3 Fleet cores . 79 5.3.1 Simultaneous multithreading . 79 5.3.2 Data size . 80 5.3.3 Instruction horn . 81 5.3.4 Instruction pool and reservation stations . 81 5.3.5 Switch fabric . 82 5.3.6 Ships . 83 5.4 Fetch and dispatch unit . 84 5.4.1 Code bag fetch unit . 84 5.4.2 Instruction dispatch unit . 87 5.5 Context synchronizers . 88 5.6 Communication . 88 5.7 Conclusion . 88 6 Evaluation of Armada-1 90 6.1 ArmadaSim . 90 4 6.1.1 Structural configuration . 90 6.1.2 Timing model . 91 6.1.3 Statistics gathering . 91 6.1.4 Fleet state checker . 91 6.1.5 Value-change dump output . 92 6.1.6 Verification . 92 6.1.7 Design complexity . 93 6.1.8 Limitations . 94 6.2 Mandelbrot . 94 6.2.1 Computation . 94 6.2.2 Parallelism . 95 6.2.3 Mapping to Armada . 98 6.3 Conclusion . 101 7 Microarchitecture results and discussion 102 7.1 Performance overview . 102 7.2 Exploiting concurrency . 106 7.2.1 Instruction-level concurrency . 106 7.2.2 Thread-level concurrency . 108 7.3 Code density . 108 7.4 Code bag organization . 109 7.5 Asynchronous system performance . 110 7.6 Resource multiplexing . 111 7.7 Conclusion . 111 III An Armada compiler 113 8 Armada Procedure Call Standard 114 8.1 Overview . 114 8.2 Register file . 115 8.2.1 Behavior . 115 8.2.2 Listing . 115 8.3 Stack . 116 8.3.1 Allocation policy . 116 8.3.2 Choosing the allocation policy . 117 5 8.4 Data placement . 118 8.4.1 Argument passing . 118 8.4.2 Automatic variables . 120 8.4.3 Return values . 121 8.5 Procedure calls . 121 8.5.1 Serial procedure calls . 121 8.5.2 Concurrent procedure calls . 122 8.6 Conclusion . 126 9 Compiling for Armada 127 9.1 Modern optimizing compilers . 127 9.2 LLVM IR . 128 9.2.1 Basic blocks . 129 9.2.2 Instructions . 129 9.3 Armada code generator . 132 9.3.1 Data types . 132 9.3.2 Pre-processing . 133 9.3.3 Code bag formation . 135 9.3.4 Ship type and instruction selection . 137 9.3.5 Instruction merging . 137 9.3.6 Instruction splitting . 138 9.3.7 Stray token cleanup . 138 9.3.8 Ship allocation . 140 9.4 Assembler . 141 9.5 Limitations . 141 9.6 Conclusion . ..

Load more