Lightweight, Scalable, Shared-Memory Computing on Many-Core Processors
Total Page:16
File Type:pdf, Size:1020Kb
LIGHTWEIGHT, SCALABLE, SHARED-MEMORY COMPUTING ON MANY-CORE PROCESSORS By BRYANT C. LAM A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2015 © 2015 Bryant C. Lam ACKNOWLEDGMENTS I would like to thank three important groups of people for their assistance and support in the creation of this dissertation: my committee members, my close friends and colleagues, and my wonderful family. I personally would like to thank my chair and cochair, Dr. Alan George and Dr. Herman Lam, for their academic, career, and personal advice and opportunities; my parents, Hoa and Jun, for their encouragement; and my loving wife, Phoebe, for her compassion and years of support. And last, but certainly not least, this work was supported in part by the I/UCRC Program of the National Science Foundation under Grant Nos. EEC-0642422 and IIP-1161022. 3 TABLE OF CONTENTS page ACKNOWLEDGMENTS...................................3 LIST OF TABLES......................................7 LIST OF FIGURES.....................................8 ABSTRACT......................................... 10 CHAPTER 1 INTRODUCTION................................... 12 2 LOW-LEVEL PGAS COMPUTING ON MANY-CORE PROCESSORS WITH TSHMEM 15 2.1 Background................................... 16 2.1.1 SHMEM and OpenSHMEM....................... 16 2.1.2 GASNet and the OpenSHMEM Reference Implementation....... 18 2.1.3 GSHMEM................................ 18 2.1.4 OSHMPI: OpenSHMEM using MPI-3.................. 19 2.1.5 OpenMP................................. 19 2.1.6 Tilera Many-Core Processors....................... 20 2.2 Device Performance Studies........................... 21 2.2.1 Memory Hierarchy............................ 22 2.2.2 TMC Common Memory......................... 23 2.2.3 TMC UDN Helper Functions....................... 25 2.2.4 TMC Spin and Sync Barriers....................... 27 2.3 Design Overview of TSHMEM.......................... 28 2.3.1 Environment Setup and Initialization................... 29 2.3.2 Point-to-Point Data Transfers...................... 29 2.3.2.1 Dynamically allocated symmetric objects........... 30 2.3.2.2 Statically allocated symmetric objects............. 30 2.3.2.3 Performance of SHMEM put/get............... 31 2.3.3 Synchronization............................. 34 2.3.3.1 Barrier synchronization.................... 34 2.3.3.2 Fence/quiet.......................... 36 2.3.4 Collective Communication........................ 36 2.3.4.1 Broadcast........................... 37 2.3.4.2 Fast collection......................... 39 2.3.4.3 Reduction........................... 40 2.4 Application Case Studies............................. 43 2.4.1 Exponential Curve Fitting........................ 43 2.4.2 OSH 2D Heat Equation......................... 45 2.4.3 Matrix Multiply.............................. 45 2.4.4 OSH Matrix Multiply........................... 46 4 2.4.5 OSH Heat Image............................. 47 2.4.6 Distributed FFT with SHMEM and FFTW............... 49 2.5 Concluding Remarks............................... 50 3 EVALUATING MANY-CORE PERFORMANCE WITH NAS PARALLEL BENCH- MARKS........................................ 52 3.1 Background................................... 52 3.1.1 OpenMP................................. 52 3.1.2 Tilera TILE-Gx.............................. 53 3.1.3 Intel Xeon Phi.............................. 54 3.2 Architecture Profiling with NPB......................... 55 3.2.1 NPB Kernels............................... 56 3.2.1.1 IS: integer sort......................... 56 3.2.1.2 EP: embarassingly parallel................... 56 3.2.1.3 CG: conjugate gradient.................... 58 3.2.1.4 MG: multi-grid......................... 59 3.2.1.5 FT: discrete 3D Fourier transform............... 59 3.2.2 NPB Pseudo-applications........................ 60 3.2.2.1 BT: block tri-diagonal solver................. 60 3.2.2.2 SP: scalar penta-diagonal solver................ 61 3.2.2.3 LU: lower-upper Gauss–Seidel solver............. 61 3.2.3 NPB Unstructured Computation and Data Movement......... 62 3.2.3.1 UA: unstructured adaptive mesh............... 62 3.2.3.2 DC: data cube......................... 63 3.2.4 Architectural Analysis.......................... 63 3.3 Concluding Remarks............................... 65 4 ANALYSIS AND DESIGN OPTIMIZATION OF SCIF COMMUNICATIONS FOR PGAS COMPUTING WITH SHMEM ACROSS MANY-CORE COPROCESSORS.. 66 4.1 Background................................... 67 4.1.1 Intel Xeon Phi (Knights Corner) Coprocessor.............. 68 4.1.2 PGAS and OpenSHMEM......................... 69 4.1.3 Related Works.............................. 70 4.2 Communication with Xeon Phi......................... 71 4.2.1 System Setup............................... 71 4.2.2 Communication Methods......................... 72 4.2.3 SCIF Overview.............................. 73 4.2.4 SCIF Performance Evaluation...................... 75 4.2.4.1 Intra-device.......................... 75 4.2.4.2 Inter-device near........................ 77 4.2.4.3 Inter-device far......................... 78 4.2.4.4 Performance highlights.................... 79 4.3 Design Overview of TSHMEM.......................... 81 4.3.1 Environment Setup and Initialization................... 82 5 4.3.1.1 Symmetric PGAS partitions.................. 82 4.3.1.2 SCIF network manager..................... 83 4.3.2 Put/Get................................. 84 4.3.3 Synchronization............................. 85 4.3.3.1 Barrier............................. 85 4.3.3.2 Fence/quiet.......................... 85 4.3.4 Other SHMEM Routines......................... 86 4.4 Performance Evaluation............................. 87 4.4.1 Setup of MPI Runtime Environments.................. 87 4.4.1.1 MPICH............................. 87 4.4.1.2 MVAPICH2-MIC........................ 87 4.4.1.3 Intel MPI............................ 88 4.4.2 Put/Get................................. 89 4.4.2.1 Intra-device.......................... 89 4.4.2.2 Inter-device near........................ 91 4.4.2.3 Inter-device far......................... 93 4.4.3 Barrier.................................. 93 4.4.4 Application Case Studies......................... 95 4.4.4.1 2D heat equation....................... 95 4.4.4.2 Heat image.......................... 97 4.4.4.3 Distributed FFT........................ 98 4.5 Concluding Remarks............................... 99 5 CONCLUSIONS.................................... 101 REFERENCES........................................ 103 BIOGRAPHICAL SKETCH................................. 108 6 LIST OF TABLES Table page 2-1 Basic subset of OpenSHMEM functions........................ 17 2-2 Architectural comparison for TILE-Gx8036 and TILEPro64.............. 20 2-3 Performance of OSH heat image at 36 cores for varying problem sizes........ 49 3-1 Speedup of NPB OpenMP for TILE-Gx and Xeon Phi................. 60 7 LIST OF FIGURES Figure page 2-1 Tilera architecture diagrams.............................. 20 2-2 Effective transfer bandwidth for shared-memory copy operations........... 24 2-3 Average half round-trip latencies on UDN....................... 26 2-4 Latencies of TMC spin and sync barriers........................ 28 2-5 Effective bandwidth of SHMEM put/get transfers on TILE-Gx36........... 31 2-6 Latencies of SHMEM dynamic put/get transfers on TILE-Gx36............ 32 2-7 Latencies of SHMEM static put/get transfers on TILE-Gx36............. 33 2-8 Latencies of SHMEM barrier on TILE-Gx36...................... 35 2-9 SHMEM broadcast latencies on TILE-Gx36...................... 38 2-10 SHMEM fast-collect latencies on TILE-Gx36...................... 39 2-11 SHMEM float-summation reduction latencies on TILE-Gx36.............. 41 2-12 Execution times for exponential curve fitting, OSH 2D heat equation, matrix multi- ply, and OSH matrix multiply.............................. 44 2-13 Execution times for OSH heat image and parallelization of FFTW.......... 48 3-1 Execution times for NPB kernels............................ 57 3-2 Execution times for NPB pseudo-applications..................... 61 3-3 Execution times for NPB unstructured computation and data movement....... 62 4-1 SCIF on Xeon Phi for intra-device communication within a single coprocessor.... 76 4-2 SCIF on Xeon Phi for inter-device near communication between two coprocessors via PCIe managed by the same CPU.......................... 77 4-3 SCIF on Xeon Phi for inter-device far communication between two coprocessors, each managed by a different, adjacent CPU...................... 79 4-4 System diagram with SCIF read/write small-message latencies and large-message effective bandwidths................................... 80 4-5 TSHMEM design architecture for Xeon Phi...................... 82 4-6 One-sided put/get latencies within a single Xeon Phi coprocessor........... 89 4-7 One-sided put/get latencies between two coprocessors in a system node....... 91 8 4-8 Barrier latencies on several Xeon Phi coprocessors................... 94 4-9 Execution times for 2D heat equation......................... 96 4-10 Execution times for heat image............................. 97 4-11 Execution times for distributed FFTW......................... 98 9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy LIGHTWEIGHT, SCALABLE, SHARED-MEMORY COMPUTING ON MANY-CORE PROCESSORS By Bryant C.