Configurable Simultaneously Single-Threaded (Multi-)Engine Processor

Configurable Simultaneously Single-Threaded (Multi-)Engine Processor by Anita Tino Bachelor of Engineering (B.Eng), Ryerson University, 2009 Master of Applied Science (M.A.Sc), Ryerson University, 2011 A dissertation presented to Ryerson University in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the program of Electrical and Computer Engineering Toronto, Ontario, Canada, 2017 c Anita Tino, 2017 AUTHOR'S DECLARATION FOR ELECTRONIC SUBMISSION OF A DISSERTATION I hereby declare that I am the sole author of this thesis dissertation. This is a true copy of the dissertation, including any required final revi- sions, as accepted by my examiners. I authorize Ryerson University to lend this dissertation to other institutions or individuals for the purpose of scholarly research. I further authorize Ryerson University to reproduce this dissertation by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. I understand that my dissertation may be made electronically available to the public. -Anita Tino ii Configurable Simultaneously Single-Threaded (Multi-)Engine Processor Anita Tino Doctor of Philosophy, 2017, Electrical and Computer Engineering, Ryerson University Abstract As the multi-core computing era continues to progress, the need to increase single- thread performance, throughput, and seemingly adapt to thread-level parallelism (TLP) remain important issues. Though the number of cores on each processor continues to increase, expected performance gains have lagged. Accordingly, computing systems often include Simultaneously Multi-Threaded (SMT) processors as a compromise between sequential and parallel performance on a single core. These processors effectively improve the throughput and utilization of a core, however often at the expense of single-thread performance as threads per core scale. Accordingly, applications which require higher single-thread performance must often resort to single-thread core multi-processor systems which incur additional area overhead and power dissipation. In attempts to improve single- and multi-thread core efficiency, this work introduces the concept of a Configurable Simultaneously Single-Threaded (Multi-)Engine Processor (ConSSTEP). ConSSTEP is a nuanced approach to multi- threaded processors, achieving performance gains and energy efficiency by invoking low overhead reconfigurable properties with full software compatibility. Experimen- tal results demonstrate that ConSSTEP is able to increase single-thread Instruc- tions Per Cycle (IPC) up to 1.39x and 2.4x for 2-thread and 4-thread workloads, respectively, improving throughput and providing up to 2x energy efficiency when compared to a conventional SMT processor. iii iv Acknowledgements I would like to express my gratitude to my supervisor, Dr. Kaamran Raahemifar, for his guidance and support of unconventional approaches and ideas throughout my PhD studies. I am truly grateful for the sincere trust and patience he has placed in me to develop the work presented in this thesis. I would like to thank the Ontario Graduate Scholarship (OGS) program and the Faculty of Engineering, Architec- ture, and Science (FEAS) for their financial support throughout my PhD studies. A special thanks to my thesis defense committee - Dr. Lev Kirischian, Dr. Ebrahim Bagheri, Dr. Issaac Woungang, and Dr. George Stamoulis, for their valuable feed- back and comments which have greatly enhanced the quality of this thesis. I would also like to thank Dr. Gul N. Khan for his advice and support throughout my graduate studies, and especially for granting me with opportunities that have truly enhanced my experience at Ryerson. To Ryerson's faculty and staff, I would like to thank them for sharing their wisdom and knowledge throughout the years, making my Ryerson experience memorable. Also a special thanks to Jason, Luis, Dan, and Nipin for all their guidance, assistance, and laughs throughout the years. Sincere thanks to Luis for his help on the development of this written thesis. Finally to those closest to me: To my mom, sister Cassandra, my loving grand- parents and family, I would like to sincerely thank them for their constant support and encouragement to pursue my studies, and especially for tolerating my never ending university workload. I am truly blessed to have such a supportive and loving family, and would not be where I am today without them. To my best friend Cristina who has been nothing but supportive, encouraging, and loving since day one and especially throughout my graduate studies - words cannot express how truly grateful I am. And finally to Matthew - we've developed an exponential closeness throughout the past years. You push me to be the best I can be with endless love and encouragement, and I am sincerely thankful to have you in my life. v vi Contents Author's Declaration ii Abstract iii Acknowledgements v List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Motivation . 1 1.1.1 Processor Performance Plateau . 1 1.1.2 Computing Model Classifications . 3 1.1.3 Limitations of ISAs and Compilers . 5 1.1.4 Single Core Bottlenecks . 7 1.1.5 Multi-Threaded Processors . 9 1.2 Research Objectives . 13 1.3 Thesis Contributions . 14 1.4 Thesis Statement . 16 1.5 Dissertation Outline . 16 2 Background 18 2.1 Introduction . 18 2.2 In-order, Single (Scalar) Issue Processors . 18 2.3 Instruction Set Architectures (ISA) . 20 2.4 Superscalar/Out-of-Order Models . 23 2.4.1 Architectural Design . 24 2.4.2 Speculative Execution . 25 vii 2.4.3 Limitations of OoO Superscalar Processors . 30 2.5 Parallelism Granularity and Application Flow . 31 2.5.1 Instruction-Level Parallelism (ILP) . 31 2.5.2 Thread-Level Parallelism (TLP) . 32 2.5.3 Data-Level Parallelism (DLP) . 32 2.5.4 Control-Flow vs Dataflow . 33 2.6 Single-Thread vs Multi-Threading . 34 2.6.1 Fine-Grained Multi-Threading . 35 2.6.2 Coarse-Grained Multi-Threading . 35 2.6.3 Simultaneous Multi-Threading (SMT) . 35 2.7 Multi-Core Models . 36 2.7.1 Motivation . 37 2.7.2 Limitations . 38 2.8 Very Long Instruction Word (VLIW) . 39 2.9 Explicitly Parallel Instruction Processors (EPIC) . 40 2.10 Digital Signal Processors (DSP) . 41 2.11 Co-Designed Virtual Machines . 41 2.12 Summary . 42 3 ConSSTEP Overview 44 3.1 Introduction . 44 3.2 General Overview . 44 3.3 Assumptions . 46 3.4 Compilation Process . 47 3.5 Architectural Flow and Pipeline . 49 3.6 Execution Example . 53 3.7 ConSSTEP vs Conventional CPUs . 54 3.7.1 Execution . 54 3.7.2 Storage & Interconnect . 55 3.7.3 ConSSTEP vs VLIWs . 56 3.8 Tradeoffs of ConSSTEP . 57 3.9 Summary . 59 4 Related Work 60 4.1 Introduction . 60 4.2 Hybrid Data-flow Architectures . 61 4.3 Distributed and Coarse-Grained Architectures . 64 viii 4.4 Reconfigurable Architectures and CGRAs . 67 4.5 Other Architecture Models . 71 4.5.1 Stream Processors . 71 4.5.2 Transport Triggered Architectures . 71 4.6 Summary . 72 5 Architecture 73 5.1 Introduction . 73 5.2 rS Interconnect . 73 5.3 Read Register Buffer (RRB) . 76 5.4 Write Register Buffer (WRB) . 76 5.5 Functional Units . 77 5.6 External Register File . 78 5.7 Configuration & Setup . 79 5.7.1 Setup Mitigation Techniques . 79 5.7.2 Floating Point (FP) Execution . 81 5.7.3 Branch Prediction & Loop Acceleration . 82 5.8 Exception Handling . 83 5.9 Scheduler . 85 5.10 Data Memory Accesses . 87 6 Compilation 89 6.1 Introduction . 89 6.2 Logical Compiler . 89 6.3 Physical Compiler (PhysC) . 90 6.3.1 Bundle Formation . 90 6.3.2 Data Dependency Analysis . 92 6.3.3 Instruction Analysis . 92 6.3.4 Engine Mapping . 93 6.3.5 Instruction-to-FU Mapping . 93 6.3.6 rS & Unit Mapping . 95 6.3.7 Configuration Data Generation . 98 6.4 Summary . 98 7 Experimental Methodology 99 7.1 Introduction . 99 7.2 Architectural Framework . 99 ix 7.2.1 Simulators . 99 7.2.2 Benchmarks . 104 7.3 Physical Modelling . 105 7.3.1 Area Estimates . 105 7.3.2 Wire Delays . 105 7.3.3 Cycle Time Determination . 108 7.4 Summary . 109 8 Sensitivity Studies 110 8.1 Introduction . 110 8.2 Intra-Scheduling Algorithm Efficiency . 111 8.2.1 rS Latch Utilization . 112 8.2.2 rS Utilization during Propagation . 114 8.2.3 IPC . 117 8.2.4 Hop Count . 119 8.2.5 Summary - Intra-Scheduling . 119 8.3 Inter-Scheduling: Bundle to Engine Mapping . 122 8.4 Summary . 123 9 Experimental Results and Analysis 124 9.1 Introduction . 124 9.2 Two-Thread Comparison . 126 9.2.1 Configuration Overhead Concealment Techniques . 126 9.2.2 Area . ..

Load more