An FPGA Implementation of an Investigative Many-Core Processor; Fynbos

An FPGA implementation of an investigative many-core processor; Fynbos. In support of a Fortran autoparallelising software pipeline J. Wyngaard Department of Electrical Engineering University of Cape Town Univeristy of Cape Town Thesis Presented for the Degree of Doctor of Philosophy June 2014 The copyright of this thesis vests in the author. No quotation from it or information derived from it is to be published without full acknowledgement of the source. The thesis is to be used for private study or non- commercial research purposes only. Published by the University of Cape Town (UCT) in terms of the non-exclusive license granted to UCT by the author. University of Cape Town Abstract An FPGA implementation of an investigative many-core processor; Fynbos, in support of a Fortran autoparallelising software pipeline Janet Ruth Wyngaard June 2014 In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Moore’slaw con- tinues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trad- ing in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a num- ber of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work imple- ments a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler. Acknowledgements I wish to acknowledge J. Collins most particularly as this work would not have been possibly without his compiler, and his and my supervisor Prof M Inggs’ initiation of the project. I would further like to thank them both for their supervision over the course of this thesis. I would also like to sincerely thank my colleagues in ACE lab at CHPC over the years, and most recently my husband Sebastian for all the support and company. Finally the financial assistance of the following organisations towards this research is hereby acknowledged. Opinions expressed and conclusions ar- rived at are those of the author and are not necessarily to be attributed to these bodies: • The Centre for High Performance Computing (CHPC) a division of the Meraka Institute, a national research centre of the Council for Scien- tific and Industrial Research (CSIR), and an initiative of the Depart- ment of Science and Technology. • The National Research Foundation (NRF) ii Contents List of Figures vii List of Tables ix 1 Introduction 1 1.1 Context and Definitions: The Switch to Ubiquitous Parallel Computing 2 1.2 Problem as it relates to HPC . 6 1.2.1 Architecture . 6 1.2.2 Programmability . 9 1.3 The many-core hypothesis . 11 1.3.1 Architecture definition . 11 1.3.2 Programmability challenges . 14 1.4 Thesis Hypothesis and scope limitations . 16 1.5 Chapter Summary . 18 2 Literature Review 21 2.1 Introduction . 21 2.2 Many-core-like Architectures . 22 2.2.1 Dataflow ISAs . 23 2.2.1.1 TRIPS . 24 2.2.1.2 WaveCache . 25 2.2.1.3 Tartan . 27 2.2.1.4 D3AS............................. 28 2.2.1.5 Comparisons . 29 2.2.2 MPPAs . 31 2.2.2.1 Proposed event-driven programmable array . 31 2.2.2.2 Kalray . 33 2.2.2.3 AMBRIC . 33 iii CONTENTS 2.2.2.4 Comparisons . 34 2.2.3 Vector, VLIW, and Streaming Processors . 36 2.2.3.1 Imagine . 41 2.2.3.2 Merrimac . 43 2.2.3.3 Rigel . 44 2.2.3.4 RAW . 46 2.2.3.5 GPGPUs . 49 2.2.3.6 ClearSpeed . 54 2.2.3.7 Comparisons . 55 2.2.4 Alternative Approaches . 56 2.2.4.1 Reconfigurable Architectures . 56 2.2.4.2 Embedded Systems . 58 2.2.4.3 Fabrication Advances and Consequences . 61 2.2.5 Summary . 63 2.3 Concurrent Software Stacks and Languages . 66 2.3.1 Current HPC Programming Models . 73 2.3.1.1 Distributed Memory Based Message Passing Model . 73 2.3.1.2 Shared Memory Based Multi-threading Models . 73 2.3.1.3 Global Address Space Model and Variations Thereof . 74 2.3.1.4 Hybrid and Heterogeneous Models . 75 2.3.1.5 Implicit Programming Models . 78 2.3.2 Autoparallelising Software Stacks . 82 2.3.3 Summary . 87 2.4 Summary and Conclusions . 88 3 APPRASE Pipeline 93 3.1 Introduction . 93 3.2 Compiler History . 94 3.2.0.1 DAREA . 95 3.3 APPRASE . 98 3.3.1 Matching APPRASE to Fynbos . 100 3.4 Host software . 104 3.4.1 Fynbos Compiler (xml converter) . 104 3.4.2 Communication . 106 3.5 Summary and Conclusions . 107 iv CONTENTS 4 Fynbos Architecture 113 4.1 Introduction . 113 4.1.1 Development Environment Hardware . 115 4.1.2 Architecture Overview . 116 4.2 Control and Data Movement . 119 4.2.1 Memory Infrastructure . 120 4.2.2 Data Sharing, Array Interconnectivity . 125 4.3 A Many-core PE . 130 4.3.1 ALU Design . 130 4.4 Operation . 133 4.4.1 Execution Flow . 133 4.4.2 Exceptions and Events . 135 4.4.2.1 Moving Results Off-chip . 135 4.4.2.2 Floating Point Exceptions . 138 4.4.2.3 Branching . 138 4.4.2.4 Program termination . 138 4.5 Verification . 139 4.5.1 Design discards . 139 4.6 Summary and Conclusions . 139 5 APPRASE-Fynbos Evaluation 145 5.1 Introduction . 145 5.2 Fynbos Hardware Evaluation . 145 5.2.1 Tool Chain Configuration and ASIC Scaling . 146 5.2.1.1 FPGAs . 146 5.2.1.2 ASICs . 151 5.2.2 Hardware Comparisons . 154 5.2.2.1 System Scalability . 154 5.2.2.2 Configuration Variants . 157 5.2.3 Power Distribution . 161 5.3 APPRASE-Fynbos Software Evaluation . 163 5.3.1 Theoretical Maximums of a Scaling Fynbos . 165 5.3.2 APPRASE-Fynbos Parallelisation Efficiency . 170 5.3.2.1 Program Length . 176 5.3.2.2 Communication . 178 v CONTENTS 5.3.2.3 Division Capacity and Parallelism . 181 5.3.2.4 Power Efficiency . 183 5.3.3 APPRASE-Fynbos Real-world Performance . 185 5.4 Summary and Conclusions . 188 5.4.1 Hardware . 188 5.4.2 Software . 189 6 Conclusions 195 6.1 Further Work Required . 196 6.2 Fynbos and Many-core Architecture Conclusions . 198 6.3 APPRASE and Programming for Many-core Conclusions . 201 6.4 Hypothesis Conclusions . 204 References 207 A Additional information 219 A.1 Additional Explanatory Notes . 219 A.1.1 GPGPU Programming Model and Hardware Mapping . 219 A.1.2 BRAM Analysis Table . 222 A.1.3 Division Hardware . 222 A.1.3.1 Serial Nature of Division . 222 A.1.3.2 Xilinx Division Hardware . 224 A.1.4 Host commands . 225 A.2 Literature review comparison tables . 225 A.3 Additional graphs . 230 Appendices 219 B Index to Digital Attachment 235 vi List of Figures 1.1 Microprocessor industry trend graphs . 4 3.1 DAREA scheduling algorithm . 97 4.1 Fynbos hardware development environment ..

An FPGA Implementation of an Investigative Many-Core Processor; Fynbos

Performance and Energy Efficient Network-On-Chip Architectures

Computer Architecture: Dataflow (Part I)

Configurable Fine-Grain Protection for Multicore Processor Virtualization 1

CG-Ooo Energy-Efficient Coarse-Grain Out-Of-Order Execution

Parallel Computer Architecture III

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor

An Evaluation of the TRIPS Computer System

A Survey on Coarse-Grained Reconfigurable Architectures from a Performance Perspective

Designing Heterogeneous Many-Core Processors to Provide High Performance Under Limited Chip Power Budget

Modeling Instruction Placement on a Spatial Architecture

Compiling for EDGE Architectures

Scatter-Add in Data Parallel Architectures