Goblincore-64: a Scalable, Open Architecture for Data Intensive High Performance Computing

GoblinCore-64: A Scalable, Open Architecture for Data Intensive High Performance Computing by John D. Leidel, B.A. A Doctoral Dissertation In Computer Science Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Approved Dr. Yong Chen Committee Chair Dr. David Haglin Dr. Yu Zhuang Dr. Sunho Lim Dr. Noe Lopez-Benitez Dr. Mark A. Sheridan Dean of the Graduate School May, 2017 c 2017, John D. Leidel Texas Tech University, John D. Leidel, May 2017 ACKNOWLEDGEMENTS I would like to take this opportunity to thank all those that have assisted with my ongoing research, development and academic activities. First, I would like to acknowledge the support and guidance from my Ph.D. advisor, Dr. Yong Chen. You took me in as a non-traditional graduate student and guided my research and academic pursuits to a suc- cessful end. We’ve created some wonderful technology together that serves to change the computational world as we know it. My sincere appreciation goes to my committee members. Dr. David Haglin (Pacific Northwest National Laboratory), your guidance and expertise in applying creative solutions to seemingly impossible computational problems is an inspiration to us all. Dr. Yu Zhuang (Texas Tech) and Dr. Noe Lopez-Benitez (Texas Tech), thank you both for your guidance and review of my hardware and system architecture knowledge and designs. Finally, thanks to Dr. Sunho Lim (Tech Tech) for his guidance and advice on interconnection networks and topologies. I would like to thank the J.T. and Margaret Talkington Foundation for their ongoing support in my academic career at Texas Tech University. Your support is a cornerstone of the Lubbock community. For my colleagues in the Data Intensive Scalable Computing Laboratory, thank you for your continued support for your fellow student from afar. Special thanks are in order for my dear friend, Mr. Xi Wang. Your patience and determination are an inspiration to your colleagues in the PhD program. I wish you all the best in your future research and academic career. To my mother, Gloria Leidel, and my father, Dr. David Leidel, you gave me the tools, love and the support as a child to grow and develop into a dedicated and productive member of my local community. The discipline instilled by you as parents has given me the ability to serve my country, my community and my family. I would like to acknowledge the sacrifices made by my family throughout my academic research. My daughter Camdyn Leidel and sons, Kyle and William Leidel. Thank you each for your patience in allowing your father to pursue his dreams. I only hope that my perseverance serves as inspiration for you each to listen to your heart and follow your dreams. ii Texas Tech University, John D. Leidel, May 2017 Finally, I would like to thank my dear wife, Stephanie Leidel. You are my guiding light, my reason for being and my inspiration to excel. With you, all things in life are possible. iii Texas Tech University, John D. Leidel, May 2017 TABLE OF CONTENTS Acknowledgements . ii Abstract . vi List of Tables . vii List of Figures . viii 1. Introduction . .1 1.1 Scalable, Efficient Data Intensive Computing . .1 1.2 Dissertation Overview . .2 1.3 Contributions . .4 1.4 Organization . .5 2. Background . .7 2.1 Data Intensive Computing . .7 2.2 Algorithmic Design . .9 2.3 Programming Models . .9 2.4 Software Infrastructures . 11 3. Core System Architecture . 13 3.1 Machine Hierarchy . 13 3.2 System Interconnect . 20 4. Instruction Set Extensions . 32 4.1 Core Micro Architecture . 32 4.2 Instruction Set Extension . 37 4.3 Hardware-Software Task Interface . 40 4.4 Application of ISA Extensions . 47 4.5 Thread Concurrency Mechanisms . 56 5. Data Intensive Memory Hierarchy . 62 5.1 Memory Hierarchy . 62 5.2 Scratchpad Memory Architecture . 66 5.3 Dynamic Memory Request Coalescing . 66 6. HMC Simulation and Ancillary Device Components . 78 6.1 Software Architecture . 78 6.2 Atomic Memory Operation Unit . 79 iv Texas Tech University, John D. Leidel, May 2017 6.3 HMC Channel Interface . 81 6.4 HMC Memory Devices . 81 7. Performance Evaluation . 108 7.1 Evaluation Methodology . 108 7.2 Evaluation of Results . 116 7.3 Comprehensive Analysis . 137 8. Related Work . 140 8.1 Data Intensive Hardware Architectures . 140 8.2 Data Intensive Software Architectures . 144 9. Conclusion and Future Work . 148 9.1 Conclusion . 148 9.2 Future Work . 149 Bibliography . 151 Appendix A: Instruction Set Encoding . 169 Appendix B: Selected Source Code . 172 v Texas Tech University, John D. Leidel, May 2017 ABSTRACT The current class of mainstream microprocessor architectures rely upon multi-level data caches and relatively low degrees of concurrency to solve a wide range of applications and algorithmic constructs. These mainstream architectures are well suited to efficiently executing applications that are generally considered to be cache friendly. These may include applications that operate on dense, linear data structures or applications that make heavy reuse of data in cache. However, applications that are generally considered to be data intensive in nature may access memory with irregular memory request patterns or access such large data structures that they cannot reside entirely in an on-chip data cache. In this work, we introduce the GoblinCore-64 (GC64) architecture and instruction set. The goal of GC64 is to provide a scalable, flexible and open architecture for efficiently executing data intensive computing applications and algorithms. The GC64 infrastructure is built upon a hierarchical set of hardware modules designed to support scalable concurrency with explicit support for latency hiding. The infrastructure’s memory hierarchy is constructed using software-managed scratchpad memories for local, application-managed memory requests and Hybrid Memory Cube devices for main memory storage. The instruction set is based upon the RISC-V instruction set specification with additional extensions to support scatter/gather memory requests, task concurrency and task management. Further, the system architecture is bolstered by a high performance main memory subsystem based upon three-dimensional stacked memories, or Hybrid Memory Cubes, coupled to an intelligent memory request coalescing methodology. The result is a simple, effective, and scalable architecture that can be easily adapted to efficiently execute data intensive applications using commodity programming models such as OpenMP, MPI and MapReduce. In order to validate the GC64 infrastructure, we construct a simulation infrastructure based upon the standard RISC-V simulation platform, Spike, that mimics a production GC64 device. We demonstrate the aforementioned system architecture using a wide array of benchmark vehicles that include the GAP Benchmark Suite, NAS Parallel Benchmarks, High Performance Conjugate Gradient Benchmark and the Barcelona OpenMP Task Suite. vi Texas Tech University, John D. Leidel, May 2017 LIST OF TABLES 3.1 HMC Device Packet Control Overhead . 26 3.2 GoblinCore-64 System Inteconnect Packets . 29 3.3 GoblinCore-64 Atomic System Inteconnect Packets . 30 3.4 GoblinCore-64 System Interconnect Performance . 31 4.1 RISC-V ISA Extensions . 33 4.2 RISC-V ISA Register Extensions . 35 4.3 RISC-V Opcode Cost Table . 61 5.1 HMC Device Packet Control Overhead . 68 5.2 DMC Synthesized Physical Design . 74 6.1 Example Atomic Memory Operation Request Stream . 80 6.2 Systolic Atomic Memory Operation Request Stream . 81 6.3 Stream Runtime in Clock Cycles . 99 6.4 RandomAccess Runtime in Clock Cycles . 100 7.1 Evaluation Configurations . 126 7.2 Context Switch Event Derivation . 127 7.3 Peak Speedups by Configuration . 135 vii Texas Tech University, John D. Leidel, May 2017 LIST OF FIGURES 1.1 HPC-Derived Data Intensive Computing . .2 2.1 November 2016 Graph 500 Results . .8 3.1 GC64 Task Processor . 14 3.2 GC64 Task Group . 15 3.3 GC64 Socket Architecture . 16 3.4 GC64 Task Group OpenSoC Fabric Connectivity . 17 3.5 GC64 4x4 Network on Chip Mesh Topology . 19 3.6 GC64 Local Memory Interconnection . 23 3.7 HMC Interconnection Topologies . 24 3.8 GC64 System Interconnection . 25 3.9 GC64 4x4xN Torus Interconnection Topology . 27 3.10 GC64 Extended HMC Packet . 28 4.1 GC64 Task Queue Link Structure . 44 4.2 GC64 Task Keystone Link Structure . 45 4.3 Scatter Gather Code Examples . 48 4.4 OpenMP Taskyield Example . 53 4.5 HPCC RandomAccess Example . 54 4.6 GC64 Thread Context Pipeline . 56 4.7 GC64 Pressure-Driven Context Switch Example . 59 5.1 GC64 Physical Address Format . 65 5.2 GC64 Extended Physical Address Format . 65 5.3 GC64 Software Managed Scratchpad . 66 5.4 Dynamic Memory Coalescing . 69 5.5 Tree Expiration . 71 5.6 DMC Architecture . 73 6.1 Traditional and GC64 Software Architectures . 78 6.2 AMO Systolic Execution Unit . 80 6.3 HMC-Sim API structure hierarchy . 88 6.4 HMC-Sim Sub-cycle state machine . 93 6.5 HMC-Sim sample API sequencing . 95 viii Texas Tech University, John D. Leidel, May 2017 6.6 HPCC StreamTriad results for HMC 4link-2GB device . 100 6.7 HPCC StreamTriad results for HMC 4link-4GB device . 101 6.8 HPCC StreamTriad results for HMC 8link-4GB device . 102 6.9 HPCC StreamTriad results for HMC 8link-8GB device . 103 6.10 HPCC RandomAccess results for HMC 4link-2GB device . ..

Load more