Hardware Design of Message Passing Architecture on Heterogeneous System

HARDWARE DESIGN OF MESSAGE PASSING ARCHITECTURE ON HETEROGENEOUS SYSTEM by Shanyuan Gao A dissertation submitted to the faculty of The University of North Carolina at Charlotte in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering Charlotte 2013 Approved by: Dr. Ronald R. Sass Dr. James M. Conrad Dr. Jiang Xie Dr. Stanislav Molchanov ii c 2013 Shanyuan Gao ALL RIGHTS RESERVED iii ABSTRACT SHANYUAN GAO. Hardware design of message passing architecture on heterogeneous system. (Under the direction of DR. RONALD R. SASS) Heterogeneous multi/many-core chips are commonly used in today's top tier supercomputers. Similar heterogeneous processing elements | or, computation ac- celerators | are commonly found in FPGA systems. Within both multi/many-core chips and FPGA systems, the on-chip network plays a critical role by connecting these processing elements together. However, The common use of the on-chip network is for point-to-point communication between on-chip components and the memory interface. As the system scales up with more nodes, traditional programming methods, such as MPI, cannot effectively use the on-chip network and the off-chip network, therefore could make communication the performance bottleneck. This research proposes a MPI-like Message Passing Engine (MPE) as part of the on-chip network, providing point-to-point and collective communication primitives in hardware. On one hand, the MPE improves the communication performance by offloading the communication workload from the general processing elements. On the other hand, the MPE provides direct interface to the heterogeneous processing elements which can eliminate the data path going around the OS and libraries. Detailed experimental results have shown that the MPE can significantly reduce the communication time and improve the overall performance, especially for heterogeneous computing systems because of the tight coupling with the network. Additionally, a hybrid \MPI+X" computing system is tested and it shows MPE can effectively of- fload the communications and let the processing elements play their strengths on the computation. iv ACKNOWLEDGMENTS No words can express my appreciation to my advisor, Dr. Ron Sass. Your wisdom and character have persistently nurtured and guided me through all the joys and difficulties during this journey, and continually so. I am grateful to my committee, Professor James Conrad, Professor Jiang Xie, and Professor Stanislav Molchanov. Your advices have challenged me in every aspect and perfected this work. I would like to thank the RCS lab (Andy, Will, Robin, Bin, Scott, Yamuna, Rahul, Ashwin, Shweta, and countless others). Our untiring discussions have inspired me to explore new directions. Your help has contributed to many parts of this work. I would like to thank my mother and my father for their full-hearted support. Rong, thank you for your love. Without you, nothing of this matters. v TABLE OF CONTENTS LIST OF TABLES viii LIST OF FIGURES ix LIST OF ABBREVIATIONS xi CHAPTER 1: INTRODUCTION1 1.1 High-Performance Computing1 1.2 Interconnect and Communication2 1.3 Motivation4 1.4 Thesis Question7 CHAPTER 2: BACKGROUND 11 2.1 Field Programmable Gate Array 11 2.2 Computational Science 12 2.3 Top500 Supercomputers 12 2.4 Message-Passing 13 2.5 Benchmarks 15 2.6 Amdahl's Law 17 2.7 Communication Model 17 CHAPTER 3: RELATED WORK 19 3.1 MPI Related Research 19 3.1.1 Point-to-point Communication 19 3.1.2 Collective Communication 19 3.1.3 Hardware Optimization 21 3.2 On-chip Message-Passing 23 3.2.1 Raw 23 3.2.2 Intel Terascale Computing 24 3.2.3 RAMP 24 vi 3.2.4 Reconfigurable Computing Cluster 25 CHAPTER 4: DESIGN 26 4.1 Design Infrastructure 26 4.1.1 Off-chip Network 26 4.1.2 On-chip Network 27 4.1.3 Network Interface 28 4.1.4 Base System 28 4.1.5 Miscellaneous IP Cores 30 4.2 Stage 1: Hardware Message-Passing Engine 30 4.2.1 Point-to-point Communication 30 4.2.2 Collective Communication 31 4.3 Stage 2: Heterogeneous System 35 4.3.1 Parallel FFT Operation 38 4.3.2 Parallel Matrix-Vector Multiplication 41 CHAPTER 5: EVALUATION AND ANALYSIS 45 5.1 Evaluation Infrastructure 45 5.2 Testing Methodology 45 5.3 Stage 1 Experiment 46 5.3.1 Barrier Performance Result 46 5.3.2 Broadcast Performance Result 48 5.3.3 Reduce Performance Result 52 5.3.4 Allreduce Performance Result 54 5.3.5 Summary 55 5.4 Communication Model 57 5.4.1 Linear Fitting for Barrier 57 5.4.2 Latency Model 58 5.5 Stage 2 Experiment 63 vii 5.5.1 Parallel Fast Fourier Transformation 65 5.5.2 Parallel Matrix-Vector Multiplication 68 5.5.3 Hybrid Computing System 71 5.6 Validation 72 CHAPTER 6: CONCLUSION 75 REFERENCES 77 viii LIST OF TABLES TABLE 1.1: Top500 HPC trend3 TABLE 5.1: Broadcast measurement vs. simulation 59 TABLE 5.2: Reduce measurement vs. simulation 63 TABLE 5.3: Bandwidth measurement vs. simulation 64 TABLE 5.4: Resource utilization of hardware MPE 73 TABLE 5.5: Performance improvement of hardware MPE 73 TABLE 5.6: Performance improvement of MPE on heterogeneous systems 74 ix LIST OF FIGURES FIGURE 1.1: Configuration of heterogeneous HPC system4 FIGURE 1.2: Communication of future heterogeneous HPC system5 FIGURE 1.3: Proportion of collective operation in synthetic benchmark6 FIGURE 1.4: Traditional operation flow8 FIGURE 1.5: Operation flow of hardware message-passing8 FIGURE 2.1: Diagram of barrier operation 14 FIGURE 2.2: Diagram of broadcast operation 15 FIGURE 2.3: Diagram of reduce operation 15 FIGURE 2.4: Amdahl's Law, performance gain of parallelism 18 FIGURE 4.1: Direct connected off-chip network with the router 27 FIGURE 4.2: 4-ary 3-cube torus network 27 FIGURE 4.3: On-chip components connected around the crossbar switch 28 FIGURE 4.4: Signal interface of LocalLink 28 FIGURE 4.5: Hardware base system with the on-chip router 29 FIGURE 4.6: FSM of barrier operation 32 FIGURE 4.7: FSM of broadcast operation 33 FIGURE 4.8: FSM of reduce operation 34 FIGURE 4.9: Topologies of hardware collective communication 34 FIGURE 4.10: Block diagram of MPE in heterogeneous system 36 FIGURE 4.11: Typical programming method for parallel heterogeneous system 37 FIGURE 4.12: FFT Decimation-In-Time 39 FIGURE 4.13: FFT Decimation-In-Frequency 39 FIGURE 4.14: Block diagram of FFT Core 41 FIGURE 4.15: Block diagram of FFT IO 42 FIGURE 4.16: Block diagram of vector-vector multiplication 43 x FIGURE 4.17: Hybrid of hardware thread and software thread 44 FIGURE 5.1: MPE barrier using different topologies 47 FIGURE 5.2: Software barrier 47 FIGURE 5.3: Bandwidth of different broadcast topologies 49 FIGURE 5.4: Bandwidth of software broadcast 49 FIGURE 5.5: Latency of different broadcast topologies 51 FIGURE 5.6: Latency of software broadcast 51 FIGURE 5.7: Bandwidth of different reduce topologies 53 FIGURE 5.8: Bandwidth of software reduce 53 FIGURE 5.9: Latency of different reduce topologies 54 FIGURE 5.10: Latency of software reduce 55 FIGURE 5.11: Execution time of different allreduce topologies 56 FIGURE 5.12: Latency of software allreduce 56 FIGURE 5.13: Mathematic fitting for MPE barrier 58 FIGURE 5.14: Time chart of broadcast operation 59 FIGURE 5.15: Latency simulation of broadcast 60 FIGURE 5.16: Time chart of reduce operation 61 FIGURE 5.17: Latency simulation of reduce 62 FIGURE 5.18: Max bandwidth simulation of broadcast and reduce 64 FIGURE 5.19: Communication impact on software FFT 65 FIGURE 5.20: Comparison of communication with software FFT 66 FIGURE 5.21: Communication impact on hardware FFT 67 FIGURE 5.22: Comparison of communication with hardware FFT 67 FIGURE 5.23: Communication impact on software MACC computation 68 FIGURE 5.24: Communication impact on accelerated MACC computation 69 FIGURE 5.25: Hardware MACC and MPE 70 FIGURE 5.26: Communication impact on hybrid computing system 72 xi LIST OF ABBREVIATIONS CPU Central Processing Unit COTS Commodity Off The Shelf IC Integrated Circuit PCB Printed Circuit Board IP Intellectual Property FPU Floating Point Unit FPGA Field Programmable Gate Array Flops Floating point operations per second FIFO First In, First Out PetaFlops 1015 Flops TeraFlops 1012 Flops GigaFlops 109 Flops MPI Message-Passing Interface PE Processing Element SOC System on Chip PLB Processor Local Bus MPMC Multiport Memory Controllor PC Personal Computer UART Universal Asynchronous Receiver/Transmitter P2P Point-to-point CHAPTER 1: INTRODUCTION Frequency scaling has played a major role in pushing the computer industry for- ward. Nevertheless, due to the memory wall, the instruction-level parallelism (ILP) wall, and the power wall [1,2], conventional frequency scaling has shown diminish- ing returns in performance in past few years. As a result, academia and industry research has shifted the focus towards the multi/many-core era. New single-chip architectures have been designed and manufactured to explore and exploit parallelism rather than single-thread performance. To make use of the massive amount of cores | or, processing elements (PEs) | the on-chip and off-chip interconnect becomes the critical component of these new architectures. Presently, these multi/many-core processors follow traditional multi-chip symmetric multiprocessor (SMP) designs, which are integrated onto a single chip. Consequently, interconnect designs mainly provide point-to-point communication and are mostly used for the general shared-memory processor model. This is sufficient for desktop personal computers; however, in large scale systems with many heterogeneous PEs, this could lead to seriously inefficient communication and make the general-purpose processor the bottleneck of the systems.

Hardware Design of Message Passing Architecture on Heterogeneous System

Efficient Inter-Core Communications on Manycore Machines

D-Bus, the Message Bus System Training Material

Message Passing Model

Message Passing

Network on Chip for FPGA Development of a Test System for Network on Chip

Multiprocess Communication and Control Software for Humanoid Robots Neil T

Embedded Networks on Chip for Field-Programmable Gate Arrays by Mohamed Saied Abdelfattah a Thesis Submitted in Conformity With

An Early Performance Evaluation of Many Integrated Core Architecture Based SGI Rackable Computing System

Message Passing

Inter-Process Communication

A Comparison of Four Series of Cisco Network Processors

MYTHIC MULTIPLIES in a FLASH Analog In-Memory Computing Eliminates DRAM Read/Write Cycles