Lightweight, Scalable, Shared-Memory Computing on Many-Core Processors

LIGHTWEIGHT, SCALABLE, SHARED-MEMORY COMPUTING ON MANY-CORE PROCESSORS By BRYANT C. LAM A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2015 © 2015 Bryant C. Lam ACKNOWLEDGMENTS I would like to thank three important groups of people for their assistance and support in the creation of this dissertation: my committee members, my close friends and colleagues, and my wonderful family. I personally would like to thank my chair and cochair, Dr. Alan George and Dr. Herman Lam, for their academic, career, and personal advice and opportunities; my parents, Hoa and Jun, for their encouragement; and my loving wife, Phoebe, for her compassion and years of support. And last, but certainly not least, this work was supported in part by the I/UCRC Program of the National Science Foundation under Grant Nos. EEC-0642422 and IIP-1161022. 3 TABLE OF CONTENTS page ACKNOWLEDGMENTS...................................3 LIST OF TABLES......................................7 LIST OF FIGURES.....................................8 ABSTRACT......................................... 10 CHAPTER 1 INTRODUCTION................................... 12 2 LOW-LEVEL PGAS COMPUTING ON MANY-CORE PROCESSORS WITH TSHMEM 15 2.1 Background................................... 16 2.1.1 SHMEM and OpenSHMEM....................... 16 2.1.2 GASNet and the OpenSHMEM Reference Implementation....... 18 2.1.3 GSHMEM................................ 18 2.1.4 OSHMPI: OpenSHMEM using MPI-3.................. 19 2.1.5 OpenMP................................. 19 2.1.6 Tilera Many-Core Processors....................... 20 2.2 Device Performance Studies........................... 21 2.2.1 Memory Hierarchy............................ 22 2.2.2 TMC Common Memory......................... 23 2.2.3 TMC UDN Helper Functions....................... 25 2.2.4 TMC Spin and Sync Barriers....................... 27 2.3 Design Overview of TSHMEM.......................... 28 2.3.1 Environment Setup and Initialization................... 29 2.3.2 Point-to-Point Data Transfers...................... 29 2.3.2.1 Dynamically allocated symmetric objects........... 30 2.3.2.2 Statically allocated symmetric objects............. 30 2.3.2.3 Performance of SHMEM put/get............... 31 2.3.3 Synchronization............................. 34 2.3.3.1 Barrier synchronization.................... 34 2.3.3.2 Fence/quiet.......................... 36 2.3.4 Collective Communication........................ 36 2.3.4.1 Broadcast........................... 37 2.3.4.2 Fast collection......................... 39 2.3.4.3 Reduction........................... 40 2.4 Application Case Studies............................. 43 2.4.1 Exponential Curve Fitting........................ 43 2.4.2 OSH 2D Heat Equation......................... 45 2.4.3 Matrix Multiply.............................. 45 2.4.4 OSH Matrix Multiply........................... 46 4 2.4.5 OSH Heat Image............................. 47 2.4.6 Distributed FFT with SHMEM and FFTW............... 49 2.5 Concluding Remarks............................... 50 3 EVALUATING MANY-CORE PERFORMANCE WITH NAS PARALLEL BENCH- MARKS........................................ 52 3.1 Background................................... 52 3.1.1 OpenMP................................. 52 3.1.2 Tilera TILE-Gx.............................. 53 3.1.3 Intel Xeon Phi.............................. 54 3.2 Architecture Profiling with NPB......................... 55 3.2.1 NPB Kernels............................... 56 3.2.1.1 IS: integer sort......................... 56 3.2.1.2 EP: embarassingly parallel................... 56 3.2.1.3 CG: conjugate gradient.................... 58 3.2.1.4 MG: multi-grid......................... 59 3.2.1.5 FT: discrete 3D Fourier transform............... 59 3.2.2 NPB Pseudo-applications........................ 60 3.2.2.1 BT: block tri-diagonal solver................. 60 3.2.2.2 SP: scalar penta-diagonal solver................ 61 3.2.2.3 LU: lower-upper Gauss–Seidel solver............. 61 3.2.3 NPB Unstructured Computation and Data Movement......... 62 3.2.3.1 UA: unstructured adaptive mesh............... 62 3.2.3.2 DC: data cube......................... 63 3.2.4 Architectural Analysis.......................... 63 3.3 Concluding Remarks............................... 65 4 ANALYSIS AND DESIGN OPTIMIZATION OF SCIF COMMUNICATIONS FOR PGAS COMPUTING WITH SHMEM ACROSS MANY-CORE COPROCESSORS.. 66 4.1 Background................................... 67 4.1.1 Intel Xeon Phi (Knights Corner) Coprocessor.............. 68 4.1.2 PGAS and OpenSHMEM......................... 69 4.1.3 Related Works.............................. 70 4.2 Communication with Xeon Phi......................... 71 4.2.1 System Setup............................... 71 4.2.2 Communication Methods......................... 72 4.2.3 SCIF Overview.............................. 73 4.2.4 SCIF Performance Evaluation...................... 75 4.2.4.1 Intra-device.......................... 75 4.2.4.2 Inter-device near........................ 77 4.2.4.3 Inter-device far......................... 78 4.2.4.4 Performance highlights.................... 79 4.3 Design Overview of TSHMEM.......................... 81 4.3.1 Environment Setup and Initialization................... 82 5 4.3.1.1 Symmetric PGAS partitions.................. 82 4.3.1.2 SCIF network manager..................... 83 4.3.2 Put/Get................................. 84 4.3.3 Synchronization............................. 85 4.3.3.1 Barrier............................. 85 4.3.3.2 Fence/quiet.......................... 85 4.3.4 Other SHMEM Routines......................... 86 4.4 Performance Evaluation............................. 87 4.4.1 Setup of MPI Runtime Environments.................. 87 4.4.1.1 MPICH............................. 87 4.4.1.2 MVAPICH2-MIC........................ 87 4.4.1.3 Intel MPI............................ 88 4.4.2 Put/Get................................. 89 4.4.2.1 Intra-device.......................... 89 4.4.2.2 Inter-device near........................ 91 4.4.2.3 Inter-device far......................... 93 4.4.3 Barrier.................................. 93 4.4.4 Application Case Studies......................... 95 4.4.4.1 2D heat equation....................... 95 4.4.4.2 Heat image.......................... 97 4.4.4.3 Distributed FFT........................ 98 4.5 Concluding Remarks............................... 99 5 CONCLUSIONS.................................... 101 REFERENCES........................................ 103 BIOGRAPHICAL SKETCH................................. 108 6 LIST OF TABLES Table page 2-1 Basic subset of OpenSHMEM functions........................ 17 2-2 Architectural comparison for TILE-Gx8036 and TILEPro64.............. 20 2-3 Performance of OSH heat image at 36 cores for varying problem sizes........ 49 3-1 Speedup of NPB OpenMP for TILE-Gx and Xeon Phi................. 60 7 LIST OF FIGURES Figure page 2-1 Tilera architecture diagrams.............................. 20 2-2 Effective transfer bandwidth for shared-memory copy operations........... 24 2-3 Average half round-trip latencies on UDN....................... 26 2-4 Latencies of TMC spin and sync barriers........................ 28 2-5 Effective bandwidth of SHMEM put/get transfers on TILE-Gx36........... 31 2-6 Latencies of SHMEM dynamic put/get transfers on TILE-Gx36............ 32 2-7 Latencies of SHMEM static put/get transfers on TILE-Gx36............. 33 2-8 Latencies of SHMEM barrier on TILE-Gx36...................... 35 2-9 SHMEM broadcast latencies on TILE-Gx36...................... 38 2-10 SHMEM fast-collect latencies on TILE-Gx36...................... 39 2-11 SHMEM float-summation reduction latencies on TILE-Gx36.............. 41 2-12 Execution times for exponential curve fitting, OSH 2D heat equation, matrix multiply, and OSH matrix multiply.............................. 44 2-13 Execution times for OSH heat image and parallelization of FFTW.......... 48 3-1 Execution times for NPB kernels............................ 57 3-2 Execution times for NPB pseudo-applications..................... 61 3-3 Execution times for NPB unstructured computation and data movement....... 62 4-1 SCIF on Xeon Phi for intra-device communication within a single coprocessor.... 76 4-2 SCIF on Xeon Phi for inter-device near communication between two coprocessors via PCIe managed by the same CPU.......................... 77 4-3 SCIF on Xeon Phi for inter-device far communication between two coprocessors, each managed by a different, adjacent CPU...................... 79 4-4 System diagram with SCIF read/write small-message latencies and large-message effective bandwidths................................... 80 4-5 TSHMEM design architecture for Xeon Phi...................... 82 4-6 One-sided put/get latencies within a single Xeon Phi coprocessor........... 89 4-7 One-sided put/get latencies between two coprocessors in a system node....... 91 8 4-8 Barrier latencies on several Xeon Phi coprocessors................... 94 4-9 Execution times for 2D heat equation......................... 96 4-10 Execution times for heat image............................. 97 4-11 Execution times for distributed FFTW......................... 98 9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy LIGHTWEIGHT, SCALABLE, SHARED-MEMORY COMPUTING ON MANY-CORE PROCESSORS By Bryant C.

Lightweight, Scalable, Shared-Memory Computing on Many-Core Processors

Mellanox Scalableshmem Support the Openshmem Parallel Programming Language Over Infiniband

Many-Core Fabricated Chips Information Page

Introduction to Parallel Programming

An Open Standard for SHMEM Implementations Introduction

Enabling Efficient Use of UPC and Openshmem PGAS Models on GPU Clusters

Introduction to Parallel Computing

Bridging Parallel and Reconfigurable Computing with Multilevel PGAS and SHMEM+

Distributed Shared Memory – a Survey and Implementation Using Openshmem

NUG Monthly Meeting

A Space Exploration Framework for GPU Application and Hardware Codesign

Parallel Computing, Models and Their Performances

Benchmarking Parallel Performance on Many-Core Processors