LIGHTWEIGHT, SCALABLE, SHARED-MEMORY COMPUTING ON MANY-CORE PROCESSORS

By BRYANT C. LAM

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2015 © 2015 Bryant C. Lam ACKNOWLEDGMENTS I would like to thank three important groups of people for their assistance and support in the creation of this dissertation: my committee members, my close friends and colleagues, and my wonderful family. I personally would like to thank my chair and cochair, Dr. Alan George and Dr. Herman Lam, for their academic, career, and personal advice and opportunities; my parents, Hoa and Jun, for their encouragement; and my loving wife, Phoebe, for her compassion and years of support. And last, but certainly not least, this work was supported in part by the I/UCRC Program of the National Science Foundation under Grant Nos. EEC-0642422 and IIP-1161022.

3 TABLE OF CONTENTS page ACKNOWLEDGMENTS...... 3 LIST OF TABLES...... 7 LIST OF FIGURES...... 8 ABSTRACT...... 10

CHAPTER 1 INTRODUCTION...... 12 2 LOW-LEVEL PGAS COMPUTING ON MANY-CORE PROCESSORS WITH TSHMEM 15 2.1 Background...... 16 2.1.1 SHMEM and OpenSHMEM...... 16 2.1.2 GASNet and the OpenSHMEM Reference Implementation...... 18 2.1.3 GSHMEM...... 18 2.1.4 OSHMPI: OpenSHMEM using MPI-3...... 19 2.1.5 OpenMP...... 19 2.1.6 Tilera Many-Core Processors...... 20 2.2 Device Performance Studies...... 21 2.2.1 Memory Hierarchy...... 22 2.2.2 TMC Common Memory...... 23 2.2.3 TMC UDN Helper Functions...... 25 2.2.4 TMC Spin and Sync Barriers...... 27 2.3 Design Overview of TSHMEM...... 28 2.3.1 Environment Setup and Initialization...... 29 2.3.2 Point-to-Point Data Transfers...... 29 2.3.2.1 Dynamically allocated symmetric objects...... 30 2.3.2.2 Statically allocated symmetric objects...... 30 2.3.2.3 Performance of SHMEM put/get...... 31 2.3.3 Synchronization...... 34 2.3.3.1 Barrier synchronization...... 34 2.3.3.2 Fence/quiet...... 36 2.3.4 Collective Communication...... 36 2.3.4.1 Broadcast...... 37 2.3.4.2 Fast collection...... 39 2.3.4.3 Reduction...... 40 2.4 Application Case Studies...... 43 2.4.1 Exponential Curve Fitting...... 43 2.4.2 OSH 2D Heat Equation...... 45 2.4.3 Matrix Multiply...... 45 2.4.4 OSH Matrix Multiply...... 46

4 2.4.5 OSH Heat Image...... 47 2.4.6 Distributed FFT with SHMEM and FFTW...... 49 2.5 Concluding Remarks...... 50 3 EVALUATING MANY-CORE PERFORMANCE WITH NAS PARALLEL BENCH- MARKS...... 52 3.1 Background...... 52 3.1.1 OpenMP...... 52 3.1.2 Tilera TILE-Gx...... 53 3.1.3 Xeon Phi...... 54 3.2 Architecture Profiling with NPB...... 55 3.2.1 NPB Kernels...... 56 3.2.1.1 IS: integer sort...... 56 3.2.1.2 EP: embarassingly parallel...... 56 3.2.1.3 CG: conjugate gradient...... 58 3.2.1.4 MG: multi-grid...... 59 3.2.1.5 FT: discrete 3D Fourier transform...... 59 3.2.2 NPB Pseudo-applications...... 60 3.2.2.1 BT: block tri-diagonal solver...... 60 3.2.2.2 SP: scalar penta-diagonal solver...... 61 3.2.2.3 LU: lower-upper Gauss–Seidel solver...... 61 3.2.3 NPB Unstructured Computation and Data Movement...... 62 3.2.3.1 UA: unstructured adaptive mesh...... 62 3.2.3.2 DC: data cube...... 63 3.2.4 Architectural Analysis...... 63 3.3 Concluding Remarks...... 65 4 ANALYSIS AND DESIGN OPTIMIZATION OF SCIF COMMUNICATIONS FOR PGAS COMPUTING WITH SHMEM ACROSS MANY-CORE COPROCESSORS.. 66 4.1 Background...... 67 4.1.1 Intel Xeon Phi (Knights Corner) Coprocessor...... 68 4.1.2 PGAS and OpenSHMEM...... 69 4.1.3 Related Works...... 70 4.2 Communication with Xeon Phi...... 71 4.2.1 System Setup...... 71 4.2.2 Communication Methods...... 72 4.2.3 SCIF Overview...... 73 4.2.4 SCIF Performance Evaluation...... 75 4.2.4.1 Intra-device...... 75 4.2.4.2 Inter-device near...... 77 4.2.4.3 Inter-device far...... 78 4.2.4.4 Performance highlights...... 79 4.3 Design Overview of TSHMEM...... 81 4.3.1 Environment Setup and Initialization...... 82

5 4.3.1.1 Symmetric PGAS partitions...... 82 4.3.1.2 SCIF network manager...... 83 4.3.2 Put/Get...... 84 4.3.3 Synchronization...... 85 4.3.3.1 Barrier...... 85 4.3.3.2 Fence/quiet...... 85 4.3.4 Other SHMEM Routines...... 86 4.4 Performance Evaluation...... 87 4.4.1 Setup of MPI Runtime Environments...... 87 4.4.1.1 MPICH...... 87 4.4.1.2 MVAPICH2-MIC...... 87 4.4.1.3 Intel MPI...... 88 4.4.2 Put/Get...... 89 4.4.2.1 Intra-device...... 89 4.4.2.2 Inter-device near...... 91 4.4.2.3 Inter-device far...... 93 4.4.3 Barrier...... 93 4.4.4 Application Case Studies...... 95 4.4.4.1 2D heat equation...... 95 4.4.4.2 Heat image...... 97 4.4.4.3 Distributed FFT...... 98 4.5 Concluding Remarks...... 99 5 CONCLUSIONS...... 101 REFERENCES...... 103 BIOGRAPHICAL SKETCH...... 108

6 LIST OF TABLES Table page 2-1 Basic subset of OpenSHMEM functions...... 17 2-2 Architectural comparison for TILE-Gx8036 and TILEPro64...... 20 2-3 Performance of OSH heat image at 36 cores for varying problem sizes...... 49 3-1 of NPB OpenMP for TILE-Gx and Xeon Phi...... 60

7 LIST OF FIGURES Figure page 2-1 Tilera architecture diagrams...... 20 2-2 Effective transfer bandwidth for shared-memory copy operations...... 24 2-3 Average half round-trip latencies on UDN...... 26 2-4 Latencies of TMC spin and sync barriers...... 28 2-5 Effective bandwidth of SHMEM put/get transfers on TILE-Gx36...... 31 2-6 Latencies of SHMEM dynamic put/get transfers on TILE-Gx36...... 32 2-7 Latencies of SHMEM static put/get transfers on TILE-Gx36...... 33 2-8 Latencies of SHMEM barrier on TILE-Gx36...... 35 2-9 SHMEM broadcast latencies on TILE-Gx36...... 38 2-10 SHMEM fast-collect latencies on TILE-Gx36...... 39 2-11 SHMEM float-summation reduction latencies on TILE-Gx36...... 41 2-12 Execution times for exponential curve fitting, OSH 2D heat equation, matrix multi- ply, and OSH matrix multiply...... 44 2-13 Execution times for OSH heat image and parallelization of FFTW...... 48 3-1 Execution times for NPB kernels...... 57 3-2 Execution times for NPB pseudo-applications...... 61 3-3 Execution times for NPB unstructured computation and data movement...... 62 4-1 SCIF on Xeon Phi for intra-device communication within a single coprocessor.... 76 4-2 SCIF on Xeon Phi for inter-device near communication between two coprocessors via PCIe managed by the same CPU...... 77 4-3 SCIF on Xeon Phi for inter-device far communication between two coprocessors, each managed by a different, adjacent CPU...... 79 4-4 System diagram with SCIF read/write small-message latencies and large-message effective bandwidths...... 80 4-5 TSHMEM design architecture for Xeon Phi...... 82 4-6 One-sided put/get latencies within a single Xeon Phi coprocessor...... 89 4-7 One-sided put/get latencies between two coprocessors in a system node...... 91

8 4-8 Barrier latencies on several Xeon Phi coprocessors...... 94 4-9 Execution times for 2D heat equation...... 96 4-10 Execution times for heat image...... 97 4-11 Execution times for distributed FFTW...... 98

9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy LIGHTWEIGHT, SCALABLE, SHARED-MEMORY COMPUTING ON MANY-CORE PROCESSORS By Bryant C. Lam December 2015 Chair: Alan D. George Cochair: Herman Lam Major: Electrical and Computer Engineering Modern processor architectures are delivering increasingly higher performance through wider parallelism and more processing cores. At its extreme, this methodology gives rise to emerging many-core architectures with focus on extremely parallel tasks using processing cores that are individually less complex but significantly more numerous than modern multi-core processors. In my dissertation, we present research, design, and analysis of a new SHMEM infrastructure specifically crafted for low-level PGAS on modern and emerging many-core processors featuring dozens of cores and more. Our approach with a new library known as TSHMEM is investigated and evaluated atop Tilera and Intel many-core architectures, and is intended to enable similar libraries atop other architectures now emerging. In developing TSHMEM, we explore design decisions and their impact on parallel performance for the Tilera TILE-Gx, TILEPro, and Intel Xeon Phi many-core architectures, and then evaluate the designs and algorithms within TSHMEM through microbenchmarking and applications studies with other communication libraries. Furthermore, we provide OpenMP application results with the NAS Parallel Benchmarks (NPB) for the TILE-Gx and Xeon Phi, allowing us to observe architectural strengths for each architecture through the context of computational kernels and communication patterns common to numerous science domains in HPC. In leveraging the insights from device-level microbenchmarking, our TSHMEM design outperforms the OpenSHMEM reference implementation and achieves similar to positive performance over

10 OpenMP and OSHMPI atop MPICH on the TILE-Gx. Furthermore, benchmarking with NPB demonstrates the integer performance strength of TILE-Gx and the floating-point performance advantages with Xeon Phi, highlighting the classes of applications each architecture excels in. With our performance analyses and TSHMEM infrastructure, we expand our scope toward inter-device communication performance and behavior on a computationally dense system node of four Intel Xeon Phi 5110P many-core coprocessors. We explore these communication behaviors with TSHMEM, focusing our design decisions on efficient single- and multi- coprocessor communication when these devices are fully utilized. Experiments with TSHMEM show that it outperforms MPICH, MVAPICH2-MIC, and Intel MPI in one-sided put/get performance and barrier synchronization, scales to deliver higher inter-device application performance, and enables critical insights for progressively higher-density systems with nodes containing multiple many-core devices.

11 CHAPTER 1 INTRODUCTION Diminishing returns from increased clock frequencies and instruction-level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many-core architectures have progressed at a remarkable rate, concerns arise regarding the performance and productivity of numerous parallel-programming tools for application development. Development of parallel applications on many-core processors often requires developers to familiarize themselves with unique characteristics of a target platform while attempting to maximize performance and maintain correctness of their applications. The family of partitioned global address space (PGAS) programming models comprises the current state of the art in balancing performance and programmability. One such PGAS approach is SHMEM, a lightweight, shared-memory programming library that has demonstrated high performance and productivity potential for parallel-computing systems with distributed-memory architectures. Chapter2 presents research, design, and analysis of a new SHMEM infrastructure specifically crafted for low-level PGAS on modern and emerging many-core processors featuring dozens of cores and more. Our approach with a new library known as TSHMEM is investigated and evaluated atop two generations of Tilera architectures, which are among the most sophisticated and scalable many-core processors to date, and is intended to enable similar libraries atop other architectures now emerging. In developing TSHMEM, we explore design decisions and their impact on parallel performance for the Tilera TILE-Gx and TILEPro many-core architectures, and then evaluate the designs and algorithms within TSHMEM through microbenchmarking and applications studies with other communication libraries. Our results with barrier primitives provided by the Tilera libraries show dissimilar performance between the TILE-Gx and TILEPro; therefore, TSHMEM’s barrier design takes an alternative approach and leverages the on-chip mesh network to provide consistent low- latency performance. In addition, our experiments with TSHMEM show that naive collective

12 algorithms consistently outperformed linear distributed collective algorithms when executed in an SMP-centric environment. In leveraging these insights for the design of TSHMEM, our approach outperforms the OpenSHMEM reference implementation, achieves similar to positive performance over OpenMP and OSHMPI atop MPICH, and supports similar libraries in delivering high-performance to emerging many-core systems. With the emergence of many-core processor architectures onto the HPC scene, concerns arise regarding the performance and productivity of numerous existing parallel-programming tools, models, and languages. As these devices begin augmenting conventional distributed cluster systems in an evolving age of heterogeneous supercomputing, proper evaluation and profiling of many-core processors must occur in order to understand their performance and architectural strengths with existing parallel-programming environments and HPC applications. Chapter3 presents and evaluates the comparative performance between two many-core processors, the Tilera TILE-Gx8036 and the Intel Xeon Phi 5110P, in the context of their applications performance with OpenMP and their device-benchmarking results with the NAS Parallel Benchmarks (NPB). OpenMP results with NPB on these platforms allow us to observe architectural strengths and insights through the context of computational kernels and communication patterns common to numerous science domains in HPC. Benchmarking with NPB highlights the integer performance strength of TILE-Gx and the floating-point performance advantages with Xeon Phi. By understanding the performance characteristics of these many-core architectures and their comparative application behaviors, application and library developers are able to focus their time and effort toward similar architectures for their performance needs. HPC systems are moving more and more toward heterogeneous platforms specializing in particular classes of applications, with specialized tools and libraries support. As many-core devices evolve with increasingly higher core counts, systems begin to have more parallel computation localized among processing devices within a node, providing greater incentive to optimize for intra-node performance. Chapter4 presents research, design, and analysis for inter-device communication performance and behavior on a computationally dense

13 system node consisting of four Intel Xeon Phi 5110P many-core coprocessors. Our approach includes exhaustive microbenchmarking of Intel SCIF (Symmetric Communications Interface) to determine its performance profile for communication between multiple coprocessors. We then explore designs and optimizations for these communication behaviors through a new version of TSHMEM, our OpenSHMEM library specifically crafted to leverage SCIF for inter- coprocessor PGAS-based communication. In developing TSHMEM for Xeon Phi, we focus on efficient single- and multi-coprocessor communication when these devices are fully utilized, and evaluate our approach with microbenchmarks and application case studies alongside several MPI implementations: MPICH, MVAPICH2-MIC, and Intel MPI. Our results show that SCIF can provide high-performance communication between multiple coprocessors, but its large-message write performance significantly degrades when transferring beyond the local PCIe bus and across CPU sockets. Experiments with TSHMEM show that it outperforms the MPI implementations in one-sided put/get performance and barrier synchronization, scales to deliver higher intra-node performance for several applications, and enables critical insights to inter-device behavior for progressively higher-density systems with nodes containing multiple many-core devices.

14 CHAPTER 2 LOW-LEVEL PGAS COMPUTING ON MANY-CORE PROCESSORS WITH TSHMEM Parallel programming is experiencing explosive growth in demand due to processor architectures shifting toward many processing cores in an effort to maintain performance progression, especially in the face of technological and physical limitations. With the emergence of many-core processors into the high-performance computing (HPC) scene, there is strong interest in evaluating and evolving existing parallel-programming models, tools, and libraries. This evolution is necessary to best exploit the increasing single-device parallelism from multi- and many-core processors, especially in a field focused on massively distributed . HPC has traditionally focused on models such as message passing with MPI [1] or with OpenMP [2], but interest is rising for a partitioned global address space (PGAS) abstraction with its potential to provide high-performing libraries and languages around a straightforward memory and communication model. Notable members of the PGAS family include SHMEM [3, 4], Unified Parallel C (UPC), (GA), Co-Array Fortran (CAF), Titanium, GASPI, MPI-3 RMA [5], X10, and Chapel. In this chapter, we present research, design, and analysis for a new SHMEM infrastructure for low-level PGAS semantics on modern and emerging many-core processors. We approach our investigation and evaluation with a new SHMEM library known as TSHMEM [6], based on the OpenSHMEM version 1.0 specification, with the intended objective of exploring SHMEM and PGAS semantics on many-core processors and enable similar libraries to fully leverage these emerging devices. TSHMEM serves as the basis for our performance evaluation of communication algorithms as they pertain to SHMEM functionality, with focus on design exploration and maximizing the capabilities of the Tilera TILE-Gx and TILEPro many-core architectures. While exploring the design decisions that define TSHMEM, we strive to achieve high realizable performance via microbenchmarking and application studies, comparing results with alternative libraries and programming environments. In doing so, TSHMEM

15 aims to deliver a high-performance, many-core programming library that offers insights into performance for a variety of communication algorithms in the context of highly parallel, many-core processors. The remainder of the chapter is organized as follows. Section 2.1 provides background on the SHMEM library and standardization efforts via OpenSHMEM, our previous research with GSHMEM (i.e., SHMEM for clusters), a synopsis of OpenMP, and a brief introduction to Tilera’s many-core architectures. Section 2.2 presents several microbenchmarking results on Tilera TILE-Gx8036 and TILEPro64 processors. Section 2.3 delves into the design of TSHMEM, with performance results and analysis for functionality defined by the OpenSHMEM specification. Section 2.4 presents several application studies with TSHMEM performance. Finally, Section 2.5 provides concluding remarks. 2.1 Background

The single-program, multiple-data (SPMD) programming style is highly amenable for tasks on large parallel systems, enabling diverse programming models such as active message passing, distributed shared memory, and partitioned global address space. This section provides a brief background of SHMEM, GSHMEM, and Tilera, which form the foundation of our experience and design for TSHMEM. A synopsis of OpenMP is also provided, as it serves as one of the parallel-programming environments that TSHMEM is compared with in Section 2.4. 2.1.1 SHMEM and OpenSHMEM

The SHMEM communication library adheres to a strict PGAS model whereby each cooperating parallel (also known as a processing element, or PE) consists of a shared symmetric partition within the global address space. Each symmetric partition consists of symmetric objects (variables or arrays) of the same size, type, and relative address on all PEs. Originally developed to provide shared-memory semantics on the distributed-memory Cray T3D , SHMEM closely models SPMD via its symmetric, partitioned, global address space.

16 Table 2-1. Basic subset of OpenSHMEM functions. Category Example functions Setup and initialization start_pes() shmem_my_pe() Environment query shmem_n_pes() Memory allocation shmalloc(), shfree() shmem_int_p() Elemental put/get shmem_int_g() shmem_putmem() Block put/get shmem_getmem() shmem_int_iput() Strided put/get shmem_int_iget() shmem_barrier() Barrier shmem_barrier_all() shmem_fence() Communications sync shmem_quiet() shmem_wait() Point-to-point sync shmem_wait_until() Broadcast shmem_broadcast32() shmem_collect32() Collection shmem_fcollect32() shmem_int_sum_to_all() Reduction shmem_long_prod_to_all() Atomic swap shmem_swap()

There are two types of symmetric objects that can reside in the symmetric partitions: static and dynamic. Static variables reside in the heap segment of the program executable and are allocated during link time. These static variables, when parallelized as multiple processes, appear at the same virtual address to all processes running the same executable, thus ensuring its symmetry across all partitions. Dynamic symmetric variables, in contrast, are allocated at runtime on all PEs via SHMEM’s dynamic memory allocation function shmalloc(). These dynamic variables, however, may or may not be allocated at the same virtual address on all PEs, but are at the same offset relative to the start of each symmetric partition. SHMEM provides several routines for explicit communication between PEs, including one- sided data transfers (puts and gets), blocking barrier synchronization, and collective operations, as illustrated by the basic subset of available routines listed in Table 2-1. In addition to being a high-performance, lightweight library, SHMEM has historically provided for atomic memory operations not available in popular library alternatives until recently (e.g., MPI 3.0).

17 Commercial SHMEM implementations have emerged from vendors such as Cray, SGI, and . Application portability between variants, however, proved difficult due to different functional semantics, incompatible , or system-specific implementations. This situation had regrettably fragmented developer adoption in the HPC community. Fortunately, SHMEM has seen renewed interest in the form of OpenSHMEM, a community-led effort to create a standard specification for SHMEM functions and semantics [7]. Version 1.0 of the OpenSHMEM specification [8] has already seen research and industry adoption in various implementations: the OpenSHMEM reference implementation [9], MVAPICH2-X [10], OSHMPI [11], Portals-SHMEM [12], POSH (Paris-OpenSHMEM) [13], and through vendors such as SGI [3], Cray [14], and Mellanox [15]. 2.1.2 GASNet and the OpenSHMEM Reference Implementation

The OpenSHMEM community has a reference implementation of their library with primary source-code contributions from the University of Houston and Oak Ridge National Labora- tory [9]. This reference implementation is compliant with version 1.0 of the OpenSHMEM specification and is implemented atop GASNet [16], a low-level networking layer and communi- cations middleware for supporting SPMD parallel-programming models such as PGAS. GASNet defines a core and an extended networking API that are implemented via conduits. These conduits enable support for numerous networking technologies and systems. By leveraging GASNet’s conduit abstraction, the OpenSHMEM reference implementation is portable to numerous cluster-based systems. 2.1.3 GSHMEM

Our prior work with SHMEM involved the design and evaluation of an OpenSHMEM library called GSHMEM (GatorSHMEM) [17] atop GASNet [16]. GSHMEM targeted a draft version of the OpenSHMEM v1.0 specification in order to evaluate its existing functionality and propose several new additions for future revisions. Built for x86_64-based cluster systems, experimental results via microbenchmarking showed that GSHMEM performance is comparable to a proprietary Quadrics implementation of SHMEM and an MPI library (MVAPICH) over

18 InfiniBand. Additionally, two application case studies with GSHMEM demonstrated the library’s portability across two distinct systems with vastly disparate interconnection technologies. GSHMEM proved that, by leveraging GASNet, SHMEM implementations can be made modern and portable over different architectures and system hierarchies without sacrificing high performance or developer productivity. 2.1.4 OSHMPI: OpenSHMEM using MPI-3

MPI 3.0 represents a significant revision to the MPI standard by including support for one-sided communication and introducing new semantics for memory consistency and ordering [5]. Hammond, et al. developed an OpenSHMEM library [11] using MPI-3’s one-sided, remote-memory-access (RMA) operations and demonstrated comparable results against other SHMEM implementations such as the OpenSHMEM reference implementation, MVAPICH2-X, and Portals-SHMEM. Of note, OSHMPI was able to outperform its competitors in the SMP intranode configuration, suggesting its suitability for platforms such as the TILE-Gx. 2.1.5 OpenMP

The OpenMP specification defines a collection of library routines, compiler directives, and environment variables that enable application parallelization via multiple threads of execution [2]. Standardized in 1997, OpenMP has been widely adopted and is portable across multiple platforms. OpenMP commonly exploits symmetric- (SMP) architectures by enabling both data-level and -level parallelism. Parallelization is typically achieved via a fork- and-join approach controlled by compiler directives whereby a master thread will fork several child threads when encountering an OpenMP parallelization section. The child threads may be assigned to different processing cores and operate independently, thereby sharing the computational load with the master. Threads are also capable of accessing shared-memory variables and data structures to assist computation. At the end of each parallel section, child threads are joined with the master thread and the parallel section closes. The master thread continues on with sequential code execution until another parallel section is encountered.

19 A B

Figure 2-1. Tilera architecture diagrams.A) TILE-Gx8036 [18].B) TILE Pro64 [19].

Table 2-2. Architectural comparison for TILE-Gx8036 and TILEPro64. TILE-Gx8036 TILEPro64 36 tiles of 64-bit VLIW processors 64 tiles of 32-bit VLIW processors 32K L1i, 32K L1d, 256K L2 cache per tile 16K L1i, 8K L1d, 64K L2 cache per tile Up to 750 billion operations per second Up to 443 billion operations per second 60 Tbps of on-chip mesh interconnect 37 Tbps of on-chip mesh interconnect 500 Gbps (62.5 GB/s) of memory bandwidth 200 Gbps (25 GB/s) of memory bandwidth 1.0 to 1.5 GHz operating frequency 700 or 866 MHz operating frequency 10 to 55W (22W typical @ 1.0 GHz) 19 to 23W @ 700 MHz 2 DDR3 memory controllers 4 DDR2 memory controllers mPIPE for wire-speed MiCA for crypto and compression

While other multi-threading APIs exist (e.g., POSIX threads), OpenMP is comparatively easier to use for developers that desire an incremental path to application parallelization for their existing sequential code. With the emergence of many-core processors such as Tilera’s TILE-Gx and Intel’s Xeon Phi, OpenMP is evolving to become a viable choice for single-device supercomputing tasks. 2.1.6 Tilera Many-Core Processors

Tilera Corporation develops commercial many-core processors emphasizing high perfor- mance and low power in the cloud-computing, general, and embedded markets. Each Tilera many-core processor is designed as a scalable 2D mesh of tiles, with each tile consisting of a processing core and cache system attached to several on-chip networks via a non-blocking cut-through switch. Referred to as the Tilera iMesh (intelligent Mesh), their scalable 2D

20 mesh consists of networks that provide data routing between memory controllers, caches, and external I/O and enables developers to explicitly transfer data between tiles via a low-level user-accessible dynamic network. Our work focuses on the 36-core TILE-Gx8036 (Figure 2-1A) with its predecessor, the 64-core TILEPro64 (Figure 2-1B), as a reference point for comparison. Their architectural characteristics are detailed in Table 2-2. The TILEPro is Tilera’s previous generation of many- core processors with 32-bit processing cores interconnected via four dynamically dimension- order-routed networks and one developer-defined statically routed network. In contrast, the TILE-Gx is Tilera’s current generation of 64-bit many-core processors. Differentiated by a substantially redesigned architecture, the TILE-Gx family exhibits upgraded processing cores and improved iMesh interconnects attached to five dynamic networks between the tiles and I/O. The TILE-Gx also includes hardware accelerators not found on previous Tilera processors: mPIPE (multicore Programmable Intelligent Packet Engine) for wire-speed packet classification, distribution, and load balancing; and MiCA (Multicore iMesh Coprocessing Accelerator) for cryptographic and compression acceleration. Other members of the TILE-Gx family include the 9-core TILE-Gx8009, 16-core TILE-Gx8016, and 72-core TILE-Gx8072. 2.2 Device Performance Studies

Tilera provides the Tilera Multicore Components (TMC) library for general application development, suitable for a variety of task models and featuring components that developers can leverage for their routines. In addition, the gxio library provides programmability for features specific to TILE-Gx devices, such as mPIPE and MiCA. For ease of development on their many-core devices, Tilera provides a customized Eclipse IDE installation with numerous extensions, such as state trackers for individual tiles. These libraries and development tools are packaged from Tilera in a Multicore Development Environment (MDE) distributable with the necessary drivers and boot images for development on their platforms. Our work uses MDE version 4.2.2 on the TILE-Gx and MDE version 3.0.3 for the TILEPro. The software versions packaged in our MDE releases are similar. The main difference between major MDE releases

21 is the target architecture supported (version 3 corresponding to TILEPro and version 4 for TILE-Gx). Benchmarking these libraries is necessary to determine the upper bound on performance realizable for any library design (e.g., TSHMEM) or application. Routines relevant to the func- tionality required in TSHMEM are microbenchmarked to compare performance and overhead. Platforms targeted by our research are the TILEmpower-Gx server with a single TILE-Gx8036 operating at 1.0 GHz, and the TILEncorePro-64 with a single TILEPro64 operating at 700 MHz. A host machine is required for PCIe-card platforms such as the TILEncorePro-64, while it is an option for standalone server platforms such as the TILEmpower-Gx. In Sections 2.3 and 2.4, a performance analysis for the design of TSHMEM is conducted on the TILE-Gx. While TSHMEM supports both the TILE-Gx and TILEPro architectures, TSHMEM performance numbers on the TILEPro are not provided in later sections. Focus is emphasized on the newer, current-generation TILE-Gx architecture due to its higher relevance for this work and the decreased support for the older TILEPro. Instead, microbenchmarking results in this section will provide general trends for expected performance with TSHMEM on the TILEPro. Microbenchmark executions in this section are composed of an average of at least 1000 iterations. 2.2.1 Memory Hierarchy

Before discussing the microbenchmarks, a brief synopsis of Tilera’s memory hierarchy is necessary. Each physical tile on the TILE-Gx and TILEPro consists of a processor with L1i, L1d, and L2 caches. Tilera employs several techniques to reduce latency for external memory operations, one of which is the Dynamic Distributed Cache (DDC). Tilera’s DDC presents a large L3 unified cache that is the aggregation of L2 caches from all tiles. Each physical memory address is dynamically assigned to a home tile to manage, allowing memory requests to be potentially fulfilled from the caches of other tiles instead of memory, thereby maximizing on-chip performance.

22 The method by which memory addresses are assigned to home tiles is memory homing. Tilera’s memory hierarchy provides for three classes of homing: local homing; remote homing; and hash-for-home. Local homing assigns a page of memory to the same tile accessing it. For memory regions exhibiting high locality, this approach provides for a potentially faster hit latency. Unfortunately, local homing loses the advantage of DDC as these pages cannot be distributed to other tiles’ L2 caches. As a result, local homing is most suitable in cases such as small private data that can entirely reside in L2 cache, such as program stack data. Remote homing is the contra to local, whereby memory pages are homed on a tile other than the one currently accessing the data. This strategy is most useful in producer-consumer relationships when the producer can set a page for remote homing and write directly into the home tile’s cache, avoiding unnecessary access to its own cache. The home tile as consumer can then directly consume the result from its own cache. Finally, hash-for-home is similar to remote homing; however, instead of homing a page to a single tile, the page is hashed and distributed across multiple tiles. This method allows for accesses across the entire L3 DDC, reducing bottlenecks at any individual tile’s cache. Hash-for-home is inappropriate for private single-reader data that is more suitable for local or remote homing, but excels for memory shared between multiple threads or processes. By default, hash-for-home is used for a majority of data and instruction memory as it provides excellent performance for shared memory and good performance for private memory. 2.2.2 TMC Common Memory

The TMC library provides routines for allocating shared memory between processes. Referred to as common memory, it differentiates itself from traditional cross-process shared- memory mappings in that all participating processes will map the shared-memory region at the same virtual address, enabling processes to share pointers into common memory. Additionally, any process can create new mappings which become visible to others, removing the restriction that all shared memory must be created from a parent process. TSHMEM leverages common memory to provide the PGAS model and shared-memory semantics of

23 h IEG’ Ms,hwvr a enrdsge oicuefiednmcntok,two networks, dynamic five include to redesigned been has however, iMesh, TILE-Gx’s The w rniin r trbtdt n cu tteLd(2K)adL 26K)ccesizes, cache KB) (256 L2 and KB) (32 L1d the at occur and to attributed are transitions two improved. substantially cache is for performance one memory and TILE-Gx operations result, response a and As request coherency. memory for dedicated now are which of tiles. among communication coherency cache for another and operations memory to dedicated TILE The reasons. several to much is TILE-Gx36 TILE the on on than bandwidth higher Effective memory. common TMC via segments memory in use significant its transfers. to data due one-sided performance is overall memory TSHMEM’s shared determining this in from important and decisively to operations memory-copy of bandwidth The SHMEM. copy shared-memory for memory) and (cache bandwidth transfer Effective 2-2. Figure ffciebnwdho IEG3 xeine he rniin npromne h first The performance. in transitions three experiences TILE-Gx36 on bandwidth Effective for results microbenchmark shows 2-2 Figure IEG83 rvdsu o6. Bso grgt,tertclmemory theoretical aggregate, of GB/s 62.5 to up provides TILE-Gx8036 the 2-2, Table bandwidth. From region. memory common TILE another and to TILE-Gx36 memory of common core one on operations

Pro Effective Bandwidth (GB/s) 0 1 2 . . . 0 5 1 5 2 5 3 4fraltase ie.Ti efrac ieec a eattributed be can difference performance This sizes. transfer all for 64 4 B 8 B

Pro 16 B 32 B

siehcnit ffu yai ewrs n fwihis which of one networks, dynamic four of consists iMesh ’s 64 B 128 B 256 B TILE-Gx36 512 B 1 KB esg Size Message 2 KB 4 KB 8 KB 24 16 KB 32 KB memcpy()

TILE 64 KB 128 KB

Pro 256 KB 512 KB 64

Pro 1 MB 2 MB prtosbtenshared- between operations

4 rnfr r rmTilera’s from are Transfers 64. 4 MB 8 MB 16 MB 32 MB 64 MB indicating representative performance for the cache system. The L1d cache performance tops out around 3.0 GB/s and the L2 cache performance reaches a peak around 2.87 GB/s. While the L1d and L2 cache performances are similar in this situation, this result can be influenced by the MDE version of the Tilera software installation and the performance optimizations introduced with newer MDE minor releases. Our previous work [6] used MDE 4.0.0 while this work uses MDE 4.2.2, providing higher realizable mesh bandwidth for cache and memory transfers. The third performance transition on TILE-Gx36 is attributed to Tilera’s L3 DDC. Effective bandwidth decreases steadily between the L2 cache size of 256 KB to the L3 DDC limit of 9 MB (for TILE-Gx8036) as the L2 caches on the device are exhausted. The performance of memory-to-memory transfers is approximately 1.1 GB/s for transfer sizes beyond 9 MB (projected aggregate bandwidth of 40 GB/s) and remains constant as transfer size increases. The TILEPro64 follows the same trends experienced with TILE-Gx36, but at a less-pronounced performance benefit. Performance is stable at or near 0.50 GB/s through the L1d and L2 cache sizes and decreases into memory-to-memory transfers (0.37 GB/s, projected aggregate of 23.7 GB/s). These results represent our practical experience in determining the realistically achievable memory bandwidths for these architectures. As such, we make no claims to verify the theoretical aggregate bandwidths provided by Tilera in Table 2-2 due to non-trivial variations with methods for empirically measuring aggregate bandwidth as well as the result’s limited applicability in our experiments. These memory-bandwidth results are revisited in Section 2.3 when TSHMEM one-sided put/get performance is analyzed. 2.2.3 TMC UDN Helper Functions

Tilera provides access to the UDN (User Dynamic Network), a low-latency direction-order- routed dynamic network on their iMesh. Developers attach a 1-word header to each payload with information about the destination tile and transfer the data packet via the UDN—at a rate of 1 word per hop, per clock cycle—into one of four demultiplexing queues at the destination. Each receiving queue on the UDN can accommodate up to a payload size of 127

25 35 TILE-Gx36 33 TILEPro64 30 30

25 24.524.5

20.5 20 Half Round-Trip Latency (ns) 19

Neighbors Sides Corners Tile-to-Tile in 6 6 Area × Figure 2-3. Average half round-trip latencies (100 million iterations) on UDN between adjacent tiles (neighbors), tiles across the area (sides), and tiles on opposite corners of the effective area (corners). TILE-Gx36 has higher latency due to setup-and-teardown on a 64-bit switching fabric vs. TILEPro64’s 32-bit fabric. words (8-byte word on the TILE-Gx, 4-byte word on the TILEPro), making the UDN suitable for small-sized explicit communication. The TMC library provides UDN helper routines that facilitate these transfers via two-sided send-and-receive calls. We microbenchmark the UDN’s latency performance of minimum-sized payloads on the TILE-Gx36 and TILEPro64 between pairs of tiles with varying distances: neighbors for transfers between adjacent tiles; sides for transfers horizontally or vertically across the test area; and corners for diagonal transfers over the entire test area. The effective test area on both devices is 6×6 tiles, providing full coverage of the TILE-Gx36. Timing is performed on the sender tile as a halved average between a 1-word send and a 1-word acknowledgment from the receiver. Average one-way latencies are depicted in Figure 2-3. For each case, average latencies were consistent with low variance of up to 1 ns, regardless of the message direction. Each case can be broken down into two components: setup-and-teardown time and network-traversal time. The clock frequency and packet-switching rate are known, allowing us to roughly

26 determine the setup-and-teardown time. Our TILE-Gx36 operates at 1 GHz, requiring 1 ns to route 1 word/hop. In comparison, the TILEPro64 at 700 MHz requires 1.43 ns. The number of hops in a 6×6 mesh network is 1, 5, and 10 for neighbor-to-neighbor, side-to-side, and corner-to-corner, respectively; therefore the estimated setup-and-teardown time is roughly 19.5 ns for the TILE-Gx and 18 ns for the TILEPro. Because of the longer setup-and-teardown time, the TILE-Gx has a higher average latency for the neighbor-to-neighbor case, but exhibits equal or lower average latency for side-to-side and corner-to-corner as the number of hops increases. These latency tests have focused on minimum-sized payloads, but actual data transferred is doubled on TILE-Gx due to a 64-bit switching fabric compared to 32-bit on TILEPro. 2.2.4 TMC Spin and Sync Barriers

The TMC library provides two types of barriers for synchronization: spin and sync. True to its name, the spin barrier will block processing and poll continuously until the correct number of tasks has reached the barrier. This polling results in lower overhead but incurs significant performance degradation if the currently blocking task is context-switched out for a new task. As such, spin barriers should only be used when there is only one task per tile. In contrast, the sync barrier interacts with the Linux scheduler and notifies it when the barrier begins to block. The scheduler can swap out the task while it waits and replace it for another task to continue processing. The sync barrier incurs a larger performance penalty than spin, but allows for additional use cases when the restrictions of a spin barrier are inappropriate. The semantics for these two barrier types require a state variable backed by shared memory, and therefore rely on the memory technology. Latency results for spin and sync barriers are shown in Figure 2-4. As expected, spin barriers vastly outperform sync barriers due to their polling nature, with latencies of 1.6 µs and 49.0 µs at 36 tiles for the TILE-Gx36 and TILEPro64, respectively, compared to 211 µs and 754 µs. Furthermore, the barriers for the TILE-Gx significantly outperform the TILEPro’s due to different memory technologies (DDR3 vs. DDR2). Since SHMEM focuses on low-overhead, low-latency performance, the TMC spin barrier for TILE-Gx is an appealing candidate for use in

27 1,000

100 s) µ

10 Latency (

1

0 5 10 15 20 25 30 35 Number of Tiles

TILE-Gx36: spin TILEPro64: spin TILE-Gx36: sync TILEPro64: sync

Figure 2-4. Latencies of TMC spin and sync barriers. Spin barriers leverage spin polling to outperform the sync barriers’ use of process .

TSHMEM, but its performance difference with the spin barrier on TILEPro poses a challenge in realizing the same low-latency performance for the TILEPro. 2.3 Design Overview of TSHMEM

The software architecture of TSHMEM leverages the Tilera TMC libraries to provide an OpenSHMEM-compliant high-performance library for Tilera many-core processors. TSHMEM targets the OpenSHMEM v1.0 specification and implements all functionality required by SHMEM applications, with exception of support for static symmetric-variable transfers using SHMEM atomic operations. All other SHMEM functionality, including collectives and atomic operations with dynamic variables, is supported. The subsections below are ordered categorically according to Table 2-1, each including design description and performance results for the TILE-Gx8036. The performance of TSHMEM is compared with the microbenchmark results for the TILE-Gx from Section 2.2 and with other OpenSHMEM implementations: the OpenSHMEM reference implementation version 1.0f (referred to afterward as simply OpenSHMEM or OSH), and OSHMPI (git commit 1f33a2735b on 20140819) atop MPICH [20] version 3.1.3. The underlying functionality in the

28 OpenSHMEM reference implementation is provided by GASNet version 1.22.0, cross-compiled for the TILE-Gx architecture with GASNet’s SMP conduit. In contrast to the GASNet middleware in the OpenSHMEM reference implementation, TSHMEM does not leverage any middleware, instead opting to design its functionality with device primitives and algorithm exploration for higher device utilization and bare-metal performance. Execution runs with MPICH use mpiexec -bind-to core:1 to set CPU affinity. All compilations were done with the TILE-Gx compiler based on GCC version 4.4.7. Latency benchmarks for put/get and collectives are provided by the OSU micro-benchmarks suite [21]. 2.3.1 Environment Setup and Initialization

SHMEM implementations typically consist of the library to which applications are linked and an executable launcher which sets up the initial environment, forks the requested number of processes, and executes the desired application. TSHMEM’s executable launcher initializes the environment by setting up Tilera’s TMC common memory in order to create a globally shared space visible to all processes, and setting up the UDN for explicit communication between the tiles participating in SHMEM. After forking, each process uniquely binds to a tile, creating a one-to-one mapping. After exec(), the application calls start_pes() to finish initialization. At this time, the globally shared memory is partitioned symmetrically among participating tiles (providing the PGAS memory model) and each tile reports its partition’s starting address to every other tile via the UDN. Dynamic symmetric memory is managed via shmalloc() and shfree(). TSHMEM’s design of shmalloc() consists of a doubly linked list tracking the memory segments being used in the current tile’s symmetric partition. Memory is kept implicitly symmetric by the constraints imposed when using shmalloc(), requiring applications to call the routine on all PEs with the same size argument at the same location in the program execution path. 2.3.2 Point-to-Point Data Transfers

OpenSHMEM specifies several categories of point-to-point, one-sided data transfers consisting of elemental, bulk, and strided put/get operations. Elemental put/get functions

29 operate on single-element symmetric objects (e.g., short, int, float) whereas bulk functions operate on contiguous data. Strided operations allow the transfer of data with strides between consecutive elements in the source and/or target arrays. In the v1.0 specification, put operations will return from the function once the data transfer is in flight and the local buffer is available for reuse by the calling PE. Get operations, in contrast, will block and not return until the requested memory is visible to the local PE. 2.3.2.1 Dynamically allocated symmetric objects

At the startup of a SHMEM program, shared-memory partitions are given to each tile. Due to the symmetry of each partition, a tile in TSHMEM can determine the virtual address of any other tile’s dynamic symmetric object by calculating the offset of its own object from its partition’s start address and then adding the offset to the target tile’s partition start address. The data transfer is then facilitated with a memcpy() operation using the calculated virtual address into TMC common memory. 2.3.2.2 Statically allocated symmetric objects

Static symmetric objects are treated very differently from their dynamic counterpart. These objects are allocated statically into the program’s heap space at link time and are symmetric since the virtual addresses of the program heap are identical when parallel processes are instantiated from the same executable. Unfortunately, the heap space resides in private memory of a process and is not directly accessible to other processes. TSHMEM facilitates data transfer for static symmetric objects via UDN interrupts. The put/get functions check the data target and source addresses to see if either address does not reside in the globally partitioned shared space. If an address does not reside in the shared space, it is assumed to be a static symmetric variable. The local tile will notify the remote tile over UDN, causing an interrupt and forcing the remote tile to service the operation only when the local tile cannot. If one of the addresses is dynamic, either the local or the remote tile will be able to directly access that dynamic memory to service the request. For example, if the local tile cannot get from a remote tile’s static symmetric variable, the remote tile can

30 with 2-7B and 2-7A , 2-6 , Figures in illustrated are results The OSHMPI. and OpenSHMEM hash-for-home 2.2.1. the Section using in 2-2 described Figure strategy in microbenchmark memory TILE-Gx common the the matches from closely performance performance TSHMEM realizable in the design as put/get overhead dynamic low The demonstrates performance. get with aligns closely performance shmem_getmem() put/get SHMEM of Performance 2.3.2.3 on supported overhead. currently TILE as not the operation are copy TSHMEM memory in additional transfers an symmetric-variable incurs static Unfortunately, but transfer, is the buffer in shared-memory completely temporary assist to a to able transfers, created be static-to-static will these tile For remote operation. nor the local service addresses the source neither and variables, target symmetric static both to when case point the In transfers. impact dynamic-to-dynamic performance to minimal compared a incurs and transfer dynamic-to-static or static-to-dynamic Get. B) Put. A) TILE-Gx36. instead on transfers put/get SHMEM of bandwidth Effective 2-5. Figure A ohptadgtpromne r ihri SMMwe oprn hs eut with results these comparing when TSHMEM in higher are performances get and put Both dynamic-to-dynamic for bandwidth effective the shows 2-5 Figure Effective Bandwidth (GB/s) 3 0 1 2 put Pro 4 B 8 B noadnmcsmercvral ntelcltl.Ti cnrorpeet a represents scenario This tile. local the on variable symmetric dynamic a into

rhtcuedet ako upr o D interrupts. UDN for support of lack to due architecture 16 B 32 B OSHMPI OpenSHMEM TSHMEM 64 B 128 B rnfr nTHE,OeSMM n SMI o SMM put TSHMEM, For OSHMPI. and OpenSHMEM, TSHMEM, in transfers 256 B esg Size Message 512 B 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 31 B Effective Bandwidth (GB/s) 3 0 1 2 4 B 8 B 16 B 32 B 64 B 128 B 256 B esg Size Message 512 B 1 KB 2 KB 4 KB shmem_putmem() 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB and SMMas xiisasih efrac eeto 0.1 of benefit performance slight a exhibits also TSHMEM eino SMMb eeaigteUNwe appropriate. the when with UDN taken the have leveraging we by approach TSHMEM the of supporting design transfers, OpenSHMEM medium of to that small than for higher OSHMPI consistently and are of however, performance TSHMEM, The in variables. transfers temporary symmetric static a static these of between use transfers the in to aid due to expected buffer is shared-space behavior this TSHMEM, In transfers. dynamic OSHMPI. and OpenSHMEM over OSHMPI. from those than faster times three to two TSHMEM are Likewise, operations acknowledgments. put and small-message transfers message generalized the its handle allowing to and conduit interface SMP active-message GASNet’s to operations put these passing overhead minimized than explicit with faster an implementation times and bare-metal four TSHMEM’s to to three due are OpenSHMEM TSHMEM in with those KB) 1 Small-message than transfers. (less static latencies and put dynamic dynamic for scale logarithmic a on performance latency Put. Dynamic A) TILE-Gx36. on transfers put/get dynamic SHMEM of Latencies 2-6. Figure Latency (µs) A 1 ihtecs fsai rnfr,ltnypromne nu eat oprdto compared penalty a incur performances latency transfers, static of case the With , 100 000 0 10 . 1 1 4 B 8 B DnmcGet. Dynamic B) 16 B OSHMPI OpenSHMEM TSHMEM 32 B memcpy() 64 B 128 B

esg Size Message 256 B 512 B 1 KB

prto.I otat pnHE nuslre vredwhen overhead larger incurs OpenSHMEM contrast, In operation. 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB

32 Latency (µs) B 1 , 100 000 0 10 . 1 1 4 B 8 B 16 B

µ 32 B o ml-esg e operations get small-message for s 64 B 128 B

esg Size Message 256 B 512 B 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB hntelclsd ftetase srpeetdb yai ymti aibe h remote the variable, symmetric dynamic a by represented is transfer the of side local the When and 2-7A Figures in seen transfers static the static-to-static to the performance to comparable cases contrast, with two In path, these latencies. code relegate static-to-static MPICH the over of OSHMPI half and than OpenSHMEM more both by reduced latencies are TSHMEM cases optimizations, two static-to-static these these alternative With for the of required. instead are buffers overhead intermediate little where with case operation the service to able is 2-7D . tile and 2-7C Figures in seen as operations get static-to-dynamic and operations put static Put. Static A) TILE-Gx36. on transfers put/get static SHMEM of Latencies 2-7. Figure Latency (µs) Latency (µs) A C 1 1 utemr,THE nldsotmztosfripoe efrac ihdynamic-to- with performance improved for optimizations includes TSHMEM Furthermore, , , 100 100 000 000 0 0 10 10 . . 1 1 1 1 4 B 4 B 8 B 8 B Sai-oDnmcGet. Static-to-Dynamic D) Put. Dynamic-to-Static C) Get. Static B) 16 B 16 B OSHMPI OpenSHMEM TSHMEM OSHMPI OpenSHMEM TSHMEM 32 B 32 B 64 B 64 B 128 B 128 B

esg Size Message 256 B Size Message 256 B 512 B 512 B 1 KB 1 KB 2 KB 2 KB 4 KB 4 KB 8 KB 8 KB 16 KB 16 KB 32 KB 32 KB 64 KB 64 KB 128 KB 128 KB 256 KB 256 KB 512 KB 512 KB 1 MB 1 MB

33 Latency (µs) Latency (µs) B D 1 1 , , 100 000 100 000 0 0 10 10 . . 1 1 1 1 4 B 4 B 8 B 8 B 16 B 16 B 32 B 32 B 64 B 64 B 128 B 128 B esg Size Message esg Size Message 256 B 256 B 512 B 512 B 1 KB 1 KB 2 KB 2 KB 4 KB 4 KB 8 KB 8 KB 16 KB 16 KB 32 KB 32 KB 64 KB 64 KB 128 KB 128 KB 256 KB 256 KB 512 KB 512 KB 1 MB 1 MB 2-7B. Interestingly, OSHMPI performs slightly worse for transfers at and above 256 KB with these static-to-dynamic put operations compared to the static-to-static case. Intuitively, OSHMPI should perform similar to the statics case, warranting further investigation at possible non-optimal behavior. Note that, by definition, functional semantics for the remaining two cases of static-to-dynamic put operations and dynamic-to-static get operations are equivalent to dynamic-to-dynamic transfers because the remote PE’s symmetric variable is dynamically allocated in shared memory and can be directly accessed by the local tile. 2.3.3 Synchronization

The OpenSHMEM specification provides several categories of synchronization: barrier sync; communication sync with fence/quiet; and point-to-point sync (waiting until a vari- able’s value has changed). TSHMEM includes these functions to provide computation and communication synchronization for SHMEM processes. 2.3.3.1 Barrier synchronization

Barrier synchronization in SHMEM is provided by two routines: shmem_barrier_all(), which blocks forward processing until all tiles reach the barrier; and shmem_barrier(), which invokes a barrier on a subset of the tiles defined by an active-set triplet of which tile to start at, the stride between consecutive tiles, and the number of tiles participating in the barrier. The microbenchmark results for TMC spin and sync barriers in Figure 2-4 illustrate that using sync barriers is not feasible due to their high latency, and the spin barrier on TILEPro is significantly slower than the one on TILE-Gx. Consequently, TSHMEM’s barrier design uses the UDN to synchronize between tiles. The start tile in the active set generates an active-set identification for the barrier in order to prevent overlapping barrier calls from returning out-of-order or stalling. The active-set identification is encoded with a wait signal and is sent to the next tile and resent linearly until the last tile sends it back to the start, acknowledging that all participating tiles have reached the same execution point in the program. The process is repeated with a release signal, allowing the blocking processes to linearly forward the signal before resuming program

34 100 s)

µ 10 Latency ( 1

0 5 10 15 20 25 30 35 Number of PEs

TSHMEM OpenSHMEM OSHMPI TMC spin

Figure 2-8. Latencies of SHMEM barrier on TILE-Gx36.

execution. The number of messages transferred for this operation is 2n, where n is the number of PEs in the barrier. Interestingly, another design was evaluated whereby the start tile broadcasts the release signal instead of having each tile forward it linearly in a chain. Barrier latencies, however, were two times slower for this method. The performance of shmem_barrier_all() is shown in Figure 2-8 for TSHMEM, OpenSHMEM, and OSHMPI. For comparison and convenience, the microbenchmark results for the TMC spin barrier on TILE-Gx36 from Figure 2-4 are also illustrated. While not depicted in Figure 2-8, TSHMEM barriers on TILEPro64 perform with a 36-tile latency of 3 µs, on the same magnitude of performance as TSHMEM barriers on TILE-Gx and vastly outperforming the TMC spin barrier on TILEPro64 (50 µs). The TMC spin barrier on TILE-Gx36, however, outperforms the TSHMEM barrier, opening the possibility of adopting its use for the TILE-Gx version of TSHMEM. Unfortunately, the use semantics for TMC barriers require memory allocation of a state variable to track the number of tasks in the barrier. This allocation would have to occur for each instance of a SHMEM barrier call in order to ensure that PEs that are engaged in multiple barriers do not return from the wrong barrier. One design option is to leverage memoization techniques to alleviate some of the allocation penalty of state variables;

35 however, the added complexity from both memoization management and state-variable management may result in a performance penalty greater than the current performance of TSHMEM barriers over UDN, especially since the current TSHMEM barrier design does not depend on state variables nor require memory allocation. We plan to explore memoization and its performance implications in future TSHMEM barrier designs. Alongside the TSHMEM barrier results, performance for barriers in OpenSHMEM and OSHMPI are also shown. OpenSHMEM barriers demonstrate significant variance and unreliable behavior when scaling up. OSHMPI barriers have minimal variance and scale in performance from 8 µs to 44 µs at 36 tiles. In contrast, TSHMEM barriers reach 2.4 µs at 36 tiles, over 18 times lower latency than OSHMPI barriers. 2.3.3.2 Fence/quiet

Since put operations do not wait for completion before returning to the calling PE, the communication synchronization routines shmem_fence() and shmem_quiet() ensure outstanding puts are ordered properly or completed before returning. The shmem_fence() routine guarantees put ordering to individual PEs before and after the function call, but does not guarantee completion. In contrast, shmem_quiet() is semantically stronger and will block execution until all outstanding puts to all PEs are completed. TSHMEM implements shmem_quiet() using tmc_mem_fence(), a memory fence operation that blocks until all memory stores are visible. Currently, shmem_fence() is set as an alias of shmem_quiet(), providing it the stronger semantics until shmem_fence() is implemented with its weaker semantics. 2.3.4 Collective Communication

SHMEM collective routines provide group-based communication for a subset of tiles. Collective designs and performance results for TSHMEM are discussed below. While collective algorithms have been explored with greater depth in other parallel environments such as MPI [22, 23, 24, 25, 26], the collective algorithms in TSHMEM presented here are intended to explore performance behaviors on the TILE-Gx and its 2D mesh. Results for OpenSHMEM

36 and OSHMPI are also provided as a basis for comparison of both algorithmic performance and runtime/conduit behavior. 2.3.4.1 Broadcast

Broadcast is a one-to-all operation where the active set of PEs obtains data from a root PE. TSHMEM currently has support for push-based and pull-based implementations of broadcast. The push-based broadcast is performed by having the root PE perform a put operation sequentially to all other PEs. This algorithm does not fully utilize the mesh fabric or the memory bandwidth of the Tilera processors, and is therefore only used for testing purposes in TSHMEM. In contrast, the pull-based broadcast is performed by having all other PEs in the active set perform a get operation on the data from the root PE. This approach distributes work to all other PEs on the device, instead of the root PE performing all of the work as is the situation with push-based. All other PEs will be issuing concurrent requests to a single memory location. The effective bandwidth is maximized on the TILE-Gx by leveraging its L3 distributed cache (Section 2.2.1) and storing this repeatedly accessed data within the L2 caches of the tiles, bypassing the need to go to memory. Local tiles on the device observe maximum performance by accessing data directly from cache, but alternative algorithms are preferred for PEs on multi-socket systems that have to access the data through memory and cannot benefit from this optimization. Figure 2-9 shows results for push-based and pull-based TSHMEM algorithms, two algorithms within OpenSHMEM, and the performance of the underlying MPI broadcast within OSHMPI. The first OpenSHMEM algorithm is a linear broadcast which is functionally equivalent to the TSHMEM pull-based approach: all PEs other than the root PE issue a get operation on the data. The second algorithm is a binary-tree broadcast whereby a binary- tree graph is generated to determine which parent PEs transfer data to which children PEs. With the root PE at the tree’s root, parent nodes transfer data via put operations until all children receive the broadcasted data. For message sizes less than 128 KB, the tree-based

37 esg racss o ag esg ie rm4M o3 B SMIsaiie with stabilizes OSHMPI MB, 32 to MB 4 from sizes message large 0.8 For to broadcasts. 0.5 message between of achieving order broadcast, an MPI is than algorithm faster TSHMEM magnitude pull-based the comparison, In TSHMEM. in approach implementation. reference the for improvement InfiniBand of supporting area systems an distributed is with and observed [27] been not also is has collectives it OpenSHMEM TILE-Gx; with to performance isolated low addition, In as 2-9C. size, Figure message PEs in increasing of indicated with number variance the latency As higher counts. experiences system PE OpenSHMEM observe large increase, we with GASNet, GASNet from using overhead variance large runtime the and to instability addition In uses. communication it GASNet transfers. that the data runtime to large attributed of be latency can higher difference the performance with message OpenSHMEM-linear’s overhead large runtime at of TSHMEM-pull amortization of to performance due the sizes approach to able only the is and algorithm sizes linear message all for functional OpenSHMEM-linear the outperforms despite TSHMEM-pull approach similarity. pull-based TSHMEM’s significant and is algorithm linear difference the performance between the Furthermore, sizes. unfavorable message demonstrates large but at PEs. algorithm, performance 32 linear C) the than PEs. faster 16 is B) algorithm PEs. OpenSHMEM 8 A) TILE-Gx36. on latencies broadcast SHMEM 2-9. Figure Latency (µs) 10 10 10 10 10 10 A h P racs sdb SMIprom iial otesrihfradpush-based straightforward the to similarly performs OSHMPI by used broadcast MPI The 0 1 2 3 4 5 4 B 8 B 16 B TSHMEM-pull 32 B TSHMEM-push 64 B 128 B 256 B

esg Size Message 512 B 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB OSHMPI OSH-tree OSH-linear 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 10 10 10 10 10 10 B 0 1 2 3 4 5 4 B 8 B 16 B 32 B 64 B 128 B 256 B

esg Size Message 512 B 1 KB

38 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 10 10 10 10 10 10 C 0 1 2 3 4 5 4 B 8 B 16 B 32 B µ 64 B fltnyfrsmall- for latency of s 128 B 256 B

esg Size Message 512 B 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB atcletdsg a eboe notosae:(1) stages: This two into result. broken concatenated be newly a can array, the design everyone’s get collect receives PEs fast PE other root all the where Once executed put PE. is a broadcast root perform pull-based a PEs to all data algorithm, their fcollect send naive and the operation For design. linear a and array. broadcast; resultant based the to portion their same-sized append the to supply where must know PE implicitly the each to to that PEs array restriction allowing their the array, append has to collect where fast as contrast, well In know as result. to progressed other has each concatenation with the communicate along supply far to to how need PE PEs each concatenation. allows for collect array General different-sized of (fcollect). a types collect two fast defines and specification collect OpenSHMEM routines: The collection PEs. all to array resultant the distributes collection Fast 2.3.4.2 cache. distributed TSHMEM transfers, 0.54 64-byte For from are increases. latencies PEs broadcast of number the as exhibits TSHMEM-pull parallel PEs. Finally, high 32 TSHMEM-pull. C) than latency PEs. higher 16 times B) three PEs. approximately 8 A) TILE-Gx36. on latencies fast-collect SHMEM 2-10. Figure Latency (µs) 10 10 10 10 10 A sosTHE eut o w loihs av einlvrgn pull- leveraging design naive a algorithms: two for results TSHMEM shows 2-10 Figure and PE each from array an concatenates that operation all-to-all an is Collection 1 2 3 4 5 4 B 8 B 16 B TSHMEM-linear TSHMEM-naive 32 B 64 B 128 B

esg Size Message 256 B 512 B 1 KB 2 KB 4 KB

8 KB OSHMPI OpenSHMEM 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 10 10 10 10 10 B µ 5 1 2 3 4 8Ps o0.57 to PEs) (8 s 4 B 8 B 16 B 32 B 64 B 128 B

esg Size Message 256 B 512 B 39 1 KB 2 KB 4 KB 8 KB 16 KB µ 32 KB 3 E) eetn rmteTILE-Gx the from benefiting PEs), (32 s n 64 KB

E icuigtero E transfer PE) root the (including PEs 128 KB 256 KB 512 KB 1 MB 10 10 10 10 10 C 1 2 3 4 5 4 B 8 B 16 B 32 B 64 B 128 B

esg Size Message 256 B 512 B 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB M bytes to the root PE’s destination array, and (2) root PE broadcasts (n M) bytes to × destination arrays on (n 1) PEs. Treating M as constant, stage 1’s total data transferred − scales linearly with the number of participating tiles, similar to a broadcast operation. Stage 2, however, scales quadratically in total data as the number of tiles increase because each PE receives a copy of the entire concatenated result containing arrays from all other PEs. Summarizing this algorithm, all PEs execute a put operation to the root PE, then all PEs will execute a get operation from the root PE for the result. In contrast, the linear fcollect algorithm has all PEs execute a put operation to each other PE, sending it the portion of its data. This algorithm allows the result to be iteratively built on all PEs as the data arrives. Both TSHMEM and OpenSHMEM implement this linear algorithm, with results illustrated in Figure 2-10. Within OSHMPI, fcollect is implemented using MPI_Allgather() which performs the same functionality from the MPICH library. On the TILE-Gx, MPICH performance is more favorable than that of GASNet. As the number of PEs increases, TSHMEM’s fcollect performance surprisingly widens in favor of the naive approach for small message sizes. The main performance advantage of the naive approach with a large number of PEs is cache locality. The concatenated array is built on the root PE and then repeated cache reads can distribute the result to the other PEs efficiently via the L3 distributed cache. The linear algorithm only outperforms this naive approach for medium-sized messages. Since small-message or large-message transfers are emphasized in most applications, the default fcollect algorithm in TSHMEM is this naive approach with pull-based broadcast. In comparing the algorithms for large messages, TSHMEM-naive is 1.6 times faster than TSHMEM-linear, 2.3 times faster than OpenSHMEM, and 2.6 times faster than OSHMPI. 2.3.4.3 Reduction

Reduction is an all-to-all operation that performs an associative binary operation on the array elements from each active-set PE. OpenSHMEM reduction routines are defined by the element type (e.g., short, int, float) and the reduction operation (e.g., xor, sum, min, max).

40 hrb ahP a bet u t aaot otP ocrety h av reduction naive the concurrently, PE root a onto data its put to able was PE each whereby oal nec E o SMI euto prtosaetasae to computed translated is are result calls. operations the function reduction as OSHMPI, needed For not a PE. is to each result them on final reduces locally the and Broadcasting PEs PE. other local all the from for data result source the gets iteratively PE each reduction, result. reduction pull-based the a obtain naive, to to Similar PE PE). each for root executed (the then node is root to broadcast the nodes reaches child it the until from node results parent the each reduces and sets pattern which communication algorithm binary-tree reduction they a tree as up a results provides the also reducing TSHMEM and Therefore, operations arrive. get sequentially performing PE root the with fcollect bottlenecked naive is the Unlike PEs. participating other all to broadcast results pull-based the a distribute available, to is issued reduction all is final Once the array. and PE, result participated other the have each with PEs from values active-set data the get on continuously operation PE reduction root the a performing has iteratively reduction naive for design The tree. 16 B) PEs. 8 A) TILE-Gx36. on latencies reduction float-summation SHMEM 2-11. Figure Latency (µs) 10 10 10 10 10 10 A ial,THE n pnHE ohpoiealna euto loih.Frlinear For algorithm. reduction linear a provide both OpenSHMEM and TSHMEM Finally, and linear, naive, operations: reduction for designs three includes currently TSHMEM 1 2 3 4 5 6 4 B 8 B 16 B TSHMEM-linear TSHMEM-tree TSHMEM-naive 32 B 64 B

3 PEs. 32 C) PEs. 128 B

esg Size Message 256 B 512 B 1 KB 2 KB 4 KB

8 KB OSHMPI OpenSHMEM 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 10 10 10 10 10 10 B 6 1 2 3 4 5 4 B 8 B 16 B 32 B 64 B 128 B

esg Size Message 256 B 512 B 41 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 10 10 10 10 10 10 C 1 2 3 4 5 6 4 B 8 B 16 B 32 B 64 B 128 B esg Size Message

MPI_Allreduce() 256 B 512 B 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB Results for summation reduction with single-precision floating-point arrays are shown in Figure 2-11. We observed similar performance trends for other reduction operations such as integer summation and integer XOR. TSHMEM-linear results surprisingly show similar performance as the TSHMEM-naive method; however, TSHMEM-naive outperforms TSHMEM- linear for larger message sizes. At 32 PEs and for message sizes beyond the L1 cache size (32 KB), TSHMEM-naive is approximately two to three times faster than TSHMEM-linear. For TSHMEM-linear, cache locality significantly affects performance at these larger message sizes since all participating PEs are attempting to get data from all other PEs while accessing their own cache and memory to compute results. This behavior causes numerous concurrent and random memory accesses, whereas TSHMEM-naive experiences better cache locality due to the root PE performing all of the reduction calculations. For 1-MB transfers with 32 PEs, the root PE in TSHMEM-naive experiences half as many local-tile L3 cache reads compared to TSHMEM-linear, and the remaining PEs require only 0.6% as many local-and-remote L3 cache reads to retrieve the reduced dataset from the root PE compared to locally computing it. At 8 and 16 PEs, TSHMEM-tree is equal to or faster than TSHMEM-naive. For 32 PEs, TSHMEM-tree is faster than TSHMEM-naive for all message sizes. The default reduction algorithm in TSHMEM is the tree approach due to more efficient memory utilization with increasing PE counts. OpenSHMEM performance exhibits similar trends as with broadcast and fcollect. OSHMPI reduction performance is approximately 2.8 times slower than TSHMEM-tree for small messages, and similar in performance for larger messages. The collective results in this subsection are intended as a case study for the TILE-Gx. A common theme is that distributed collective algorithms can display insufficient performance on shared-memory, many-core devices. Others have reached similar conclusions when experimenting with multi-core systems [28]. In leveraging the device-level microbenchmarking results, we demonstrate that the design of collective communications in TSHMEM offers high performance on the TILE-Gx many-core architecture, while enabling further library exploration toward systems consisting of multiple many-core processors.

42 2.4 Application Case Studies

SHMEM and OpenMP are highly amenable programming environments for SMP architectures due to their shared-memory semantics. With many-core processors emerging onto the HPC scene, developers are interested in the performance and scalability of their applications for these devices. This section analyzes several applications, written in both SHMEM and OpenMP, on the TILE-Gx8036 [29]. We focus our analysis on showcasing performance differences between OpenMP (provided by the TILE-Gx GCC 4.4.7 compiler) and the three SHMEM implementations: TSHMEM, the OpenSHMEM reference implementation, and OSHMPI atop MPICH. OpenMP serves as a baseline for our performance comparison due to its ubiquity for parallel programming on SMP devices. In comparing TSHMEM with OpenMP, we aim to show that libraries like TSHMEM can offer competitive or higher performance than established language-based solutions. The applications in this section consist of both custom-developed kernels and example programs from the OpenSHMEM test suite version 1.0d [9]. These applications are presented as follows: exponential curve fitting; OSH 2D heat equation; matrix multiply; OSH matrix multiply; OSH heat image; and a case study in parallelizing the FFTW library. SHMEM applications were ported to OpenMP when it was easily achieved. Specific optimizations were made only when the computational algorithm remained unchanged for both versions of the application. Scalability results are presented with increasing number of PEs, where PEs are either processes in SHMEM or threads in OpenMP and are reported in execution times up to the realistic maximum number of PEs for the TILE-Gx8036 (36 PEs). For OSH heat image, we also present results with increasing problem sizes to illustrate TSHMEM’s performance improvement over OpenMP and OSHMPI at full-device scale. 2.4.1 Exponential Curve Fitting

An exponential equation of the form y = aebx can be represented in linear form with logarithms: ln(y) = ln(a) + bx. This form allows us to leverage linear curve-fitting via least-mean-squares approximation and transform the final result back to exponential form

43 100 Exponential Curve Fitting OSH 2D Heat Equation

100

10 Execution Time (sec) Execution Time (sec) 10 2 4 8 16 32 2 4 8 16 32 Number of PEs Number of PEs A B Matrix Multiply OSH Matrix Multiply 100

100 Execution Time (sec) Execution Time (sec) 10 10 2 4 8 16 32 2 4 8 16 32 Number of PEs Number of PEs C D TILE-Gx8036: OpenMP TSHMEM OpenSHMEM OSHMPI

Figure 2-12. Execution times.A) Exponential curve fitting (100M points, double).B) OSH 2D heat equation (Jacobi method on a 288×288 matrix).C) Matrix multiply (2048×2048, double).D) OSH matrix multiply (2048 ×2048, double). with inverse logarithms. The implementation for curve fitting is and consists of a constant number of barriers and reductions. This application serves as a metric for parallel-performance overhead for the runtime environments that we are testing. The execution times are presented in Figure 2-12A. OpenSHMEM is the only runtime environment that does not exhibit the expected linear scalability. Scalability of OpenSHMEM is significantly impacted for executions with more than 4 PEs on the TILE-Gx. Further investigation shows that this behavior is a result of generic instrumentation of the GASNet

44 conduits on the TILE-Gx and interoperability issues with GASNet and the TILE-Gx’s process scheduler. As a result, GASNet is unable to leverage the TILE-Gx’s NUMA (non-) hierarchy in an efficient manner. Attempting to manually set the processor affinity via the Linux scheduler and via numactl fails to improve its high-variance behavior. As a result, the performance comparisons between TSHMEM, OpenMP, and OSHMPI in this section are more relevant in demonstrating application behavior on TILE-Gx. 2.4.2 OSH 2D Heat Equation

The OpenSHMEM (OSH) website has a test suite consisting of benchmarks and applications. One such application is an iterative heat-equation solver for heat distribution in a rectangular (2D) domain via conduction. The provided application supports three iteration methods: Jacobi, Gauss–Seidel, and successive over-relaxation. We benchmark our runtime environments with the Jacobi method on a 288×288 rectangular domain, with 288 chosen as the least-common multiple of 32 and 36 such that the domain space is evenly divisible amongst the PEs. SHMEM communication consists of a linear number of put operations, broadcasts, reductions, and barriers. An OpenMP implementation was not tested for this application. Execution times in Figure 2-12B show that TSHMEM and OSHMPI performance are similar, with a performance edge to TSHMEM at full-device utilization due to barrier performance. In contrast, OpenSHMEM executions behaved erratically, preventing several PE counts from executing to completion. 2.4.3 Matrix Multiply

The matrix-multiplication algorithm chosen for instrumentation was a partial-row dissemination with loop-interchange optimization for three matrices: C = A B. Each × PE is partitioned a block of sequential rows to compute the partial result. In the case of OpenMP, the A, B, and C matrices are shared among the threads via compiler directives. Because of SHMEM’s symmetric heap, the A and C matrices can be easily partitioned among the PEs, but each PE receives a private copy of the B matrix due to the pattern of computation. Consequently, the memory requirements are forced to scale with the number

45 of PEs and the size of the matrix due to the private copies that reside on each PE. There are other parallelization strategies that do not require private matrix copies, but the pattern of computation and communication would have differentiated from the OpenMP version. In addition to row dissemination, loop interchange can easily occur since each matrix element in C has no data dependency with its other elements. By interchanging the inner-most loop with one of its outer loops, locality of reference and cache-hit rates drastically increase. Execution times for SHMEM and OpenMP matrix multiplication are presented in Figure 2- 12C. For the SHMEM version, communication consists of broadcasting the B matrix to all PEs unless the data can be accessed directly from the remote partition via shmem_ptr(). The OpenMP version has only implicit barriers, as all three matrices are shared via compiler directives and are directly accessible. The execution times for OpenMP, TSHMEM, and OSHMPI are similar with each other and scale consistently to full-device utilization. 2.4.4 OSH Matrix Multiply

One of the applications from the OpenSHMEM test suite is a matrix-multiplication kernel. Unlike the previous matrix-multiplication kernel, this kernel implements a block-column distribution for computation and leverages a distributed data structure that divides up the three matrices among the PEs. This data distribution results in more communication time to obtain non-local elements of the B matrix to perform matrix multiplication, but the advantage is substantially lower memory use for increasing number of PEs. As a result, this approach sacrifices some runtime performance, but is more amenable for very large matrices. The communication in this application consists of a quadratic number of barriers and put operations with complexity O(p r), where p is the number of PEs and r is the number of rows in × the matrices. For further details, source code can be obtained from the OpenSHMEM test suite [9]. The performance of this kernel is shown in Figure 2-12D for the TILE-Gx. TSHMEM and OpenMP performance scale similarly on the device, with OpenMP showing a slight performance improvement at 32 PEs. This result is attributed to the amount of data

46 movement in the SHMEM version. In the SHMEM version, each PE exchanges data by copying it into another PE’s shared partition at the end of each compute iteration. The OpenMP version, however, does not require this step because the data can be accessed directly via data sharing. While the OpenMP approach is more amenable for an SMP device, the SHMEM approach was implemented for operation on a distributed system as the data cannot be accessed directly and must be transferred with one-sided operations. A different SHMEM implementation would be capable of accessing the data directly, but would only be applicable on SMP devices as a result. Interestingly, OSHMPI performance is consistently worse than OpenMP and TSHMEM even at 2 PEs and stops scaling after 16 PEs (half-device utilization). This kernel is the only application in this section that exhibits this behavior. This result is attributed to the amount of communication in this application and exposes potential scalability issues with the application itself. The amount of communication depends on p, the number of PEs, therefore additional PEs will increase both the amount and duration of communication operations. OSHMPI is significantly affected by this behavior with its higher-latency barriers compared to TSHMEM. Finally, the execution times from Figures 2-12C and 2-12D show that this implementation of matrix multiplication is 1.5 to 2 times slower than the previous matrix multiplication due to the pattern of computation. 2.4.5 OSH Heat Image

This application takes width and height parameters as inputs and solves a heat-conduction modeling problem. Each PE is assigned a block of rows and assists in performing iterative heat- conduction computation in order to generate an output image. The SHMEM communication for this application consists of a linear number of put and barrier operations based on the number of iterations in the modeling problem. The OpenMP version consists of a linear number of barriers and critical-section regions. Execution times are shown in Figure 2-13A. OpenMP, TSHMEM, and OSHMPI observe similar performance until 16 PEs. TSHMEM continues to scale while both OpenMP and

47 100 OSH Heat Image FFTW

100

10 Execution Time (sec) Execution Time (sec) 10 1 2 4 8 16 32 2 4 8 16 32 Number of PEs Number of PEs A B TILE-Gx8036: OpenMP TSHMEM OpenSHMEM OSHMPI

Figure 2-13. Execution times.A) OSH heat image (1024 ×1024 with 5000 iterations). B) Parallelization of FFTW (8192 FFT operations on 8192-length complex-float arrays).

OSHMPI exhibit a slight degradation in scaling at 32 and 36 PEs. This application demon- strates favorable performance for TSHMEM at full-device utilization. Because the entire input data is implicitly shared with OpenMP, synchronization operations such as barriers become more numerous and costly than with SHMEM’s distributed approach to data partitioning and ownership. Additionally, as the number of PEs increases, the additional synchronization points while iterating on the heat-image model results in this decrease of performance for OpenMP. This slight decrease also applies with TSHMEM, but is less impactful on the overall application performance than in the OpenMP case. For OSHMPI, the majority of the time difference compared to TSHMEM is due to higher-latency barrier synchronization. We present results for OSH heat image with increasing input sizes in Table 2-3. TSHMEM shows a maximum performance improvement of 30% over OpenMP at inputs of 2048×2048, with an improvement of 18% for larger input sizes. This performance improvement is significant and is not intuitively conveyed via the logarithmic-scale graph in Figure 2-13A. Comparing TSHMEM to OSHMPI, TSHMEM’s performance improvement decreases as the problem size increases, but the execution time differences between TSHMEM and OSHMPI

48 Table 2-3. Performance of OSH heat image at 36 cores for varying problem sizes. TSHMEM Compared to OpenMP Compared to OSHMPI Problem size Time (s) Time (s) Improvement (%) Time (s) Improvement (%) 1024 × 1024 13.5 16.2 17 16.0 16 2048 × 2048 61.2 87.6 30 67.4 9.3 4096 × 4096 347.6 440.6 21 355.4 2.2 8192 × 8192 1568.8 1918.8 18 1581.1 0.78 16384 × 16384 6313.6 7717.2 18 6371.1 0.90

show an increasing trend. At 8192×8192, the time difference is 12.3 seconds in favor of TSHMEM over OSHMPI. The time difference at 16384×16384 is 57.5 seconds, an increasing trend favoring TSHMEM. This trend indicates that the rate of growth for the time difference is positive, but is slower than the rate of growth of the raw execution time for problem sizes less than 16384×16384. As a result, TSHMEM exhibits higher scalability than OSHMPI and each percentage point of improvement becomes more significant as problem sizes increase. Performance improvement of TSHMEM begins to increase at 16384×16384, indicating that the rate of growth for the time difference is now faster than the rate of growth of the execution time. 2.4.6 Distributed FFT with SHMEM and FFTW

The final application involves the process-based parallelization of a popular FFT library, FFTW [30]. The application performs a distributed, one-dimensional, discrete Fourier transform (DFT) using the FFTW library, with data setup and inter-process communication via SHMEM. While the FFTW library is already multithreaded internally, this application uses SHMEM instead of MPI to handle inter-process communication via fast one-sided puts to quickly exchange data for a distributed system. An OpenMP implementation was not tested for this application. The execution times are shown in Figure 2-13B for the TILE-Gx. This application executes in three phases: (1) DFT operation with twiddle calculations and data exchange, (2) matrix transpose, and (3) DFT operation. All of the SHMEM communication occurs in phase one during data exchange and, for each PE, consists of a linear number of put operations and a

49 computational barrier. TSHMEM and OSHMPI execution times are similar and achieve full- device scaling, with TSHMEM demonstrating a slight performance advantage over OSHMPI due to higher-performance put and barrier operations. OpenSHMEM is able to achieve a moderate amount of scalability, but not to the extent of either TSHMEM or OSHMPI. 2.5 Concluding Remarks

In exploring PGAS semantics for modern many-core processors, we have presented and evaluated our design and analysis of TSHMEM, a high-performance OpenSHMEM library built atop Tilera-provided libraries for the Tilera TILE-Gx and TILEPro many-core architectures. The current TSHMEM design provides for all of OpenSHMEM functionality, excluding static- variable support for atomic operations. Our analysis of TSHMEM serves as an evaluation basis for low-level PGAS semantics and performance on modern and emerging many-core processors with the intent of enabling similar libraries to deliver higher utilization and performance for current- and next-generation many-core systems. Performance, portability, and scalability of SHMEM applications for the TILE-Gx are illustrated via numerous application case studies comparing TSHMEM performance with OpenMP, the OpenSHMEM reference implementation, and OSHMPI. Our experiments exhibited application-scalability concerns with the OpenSHMEM reference implementation due to generic instrumentation for TILE-Gx used by its underlying GASNet communications runtime. As a result, we focus our experiments on analyzing performance behavior with TSHMEM, OpenMP, and OSHMPI. For application scalability, TSHMEM, OpenMP, and OSHMPI exhibit similar trends, but when exploring different problem sizes at full-device utilization, TSHMEM demonstrates a marginal to significant performance improvement. This conclusion provides validation to a bare-metal library design for TSHMEM on many-core devices. In the following Chapter, we investigate the performance of another many-core architec- ture, the Intel Xeon Phi, and conduct performance benchmarks to determine its capabilities. Through these benchmarks, we discover the architectural and application benefits for several

50 HPC domains and apply our experiences with TSHMEM to research and design an efficient, inter-device programming library for many-core systems.

51 CHAPTER 3 EVALUATING MANY-CORE PERFORMANCE WITH NAS PARALLEL BENCHMARKS With the emergence of many-core processors into the high-performance computing (HPC) scene, there is strong interest in evaluating and evolving existing parallel-programming models, tools, and libraries. This evolution is necessary to best exploit the increasing single-device parallelism from multi- and many-core processors, especially in a field focused on massively distributed supercomputers. Although many-core devices offer exciting new opportunities for application acceleration, these devices need to be properly evaluated between each other and the conventional servers they potentially supplement or replace. In this chapter, we evaluate the performance of OpenMP applications on two current- generation many-core devices, the Tilera TILE-Gx and the Intel Xeon Phi. We present results from the suite of NAS Parallel Benchmarks (NPB) [31] on these many-core platforms in order to evaluate their architectural strengths at different categories of computation and communication common among HPC applications. OpenMP implementations are provided by the native compiler for each platform. Results from these applications emphasize comparative performance of our many-core devices and enable optimal selection and usage of the underlying architectures. The remainder of the chapter is organized as follows. Section 3.1 provides background on OpenMP and brief architectural descriptions of the Tilera TILE-Gx and Intel Xeon Phi. Section 3.2 analyzes comparative architectural strengths of the devices with results from NAS Parallel Benchmarks. Finally, Section 3.3 provides concluding remarks. 3.1 Background

This section provides brief background of OpenMP and the Tilera and Intel many-core platforms that will execute these applications. 3.1.1 OpenMP

The OpenMP specification defines a collection of library routines, compiler directives, and environment variables that enable application parallelization via multiple threads of

52 execution [2]. Standardized in 1997, OpenMP has been widely adopted and is portable across multiple platforms. OpenMP commonly exploits SMP architectures by enabling both data-level and thread- level parallelism. Parallelization is typically achieved via a fork-and-join approach controlled by compiler directives whereby a master thread will fork several child threads when encountering an OpenMP parallelization section. The child threads may be assigned to different processing cores and operate independently, thereby sharing the computational load with the master. Threads are also capable of accessing shared-memory variables and data structures to assist computation. At the end of each parallel section, child threads are joined with the master thread and the parallel section closes. The master thread continues on with sequential code execution until another parallel section is encountered. While other multi-threading APIs exist (e.g., POSIX threads), OpenMP is comparatively easier to use for developers that desire an incremental path to application parallelization for their existing sequential code. With the emergence of many-core processors such as the TILE-Gx and Xeon Phi, OpenMP is evolving to become a viable choice for single-device supercomputing tasks. 3.1.2 Tilera TILE-Gx

Tilera Corporation, based in San Jose, California, develops commercial many-core processors with emphases on high performance and low power in the , general server, and embedded devices markets. Each Tilera many-core processor is designed as a scalable 2D mesh of tiles, with each tile consisting of a processing core and cache system. These tiles are attached to several on-chip networks via non-blocking cut-through switches. Referred to as the Tilera iMesh (intelligent Mesh), this scalable 2D mesh consists of dynamic networks that provide data routing between memory controllers, caches, and external I/O and enables developers to explicitly transfer data between tiles via a low-level user-accessible dynamic network.

53 Our research focuses on the current-generation Tilera TILE-Gx8036. The TILE-Gx is Tilera’s new generation of 64-bit many-core processors. Differentiated by a substantially redesigned architecture from its 32-bit predecessor—the TILEPro—the TILE-Gx exhibits upgraded processing cores, improved iMesh interconnects, and novel on-chip accelerators. Each 64-bit processing core is attached to five dynamic networks on the iMesh. The TILE-Gx8036 (a specific model of the TILE-Gx36) has 36 tiles, each consisting of a 64-bit VLIW processor with 32k L1i, 32k L1d, and 256k L2 cache. Furthermore, the L2 cache of each core is aggregated to form a large unified L3 cache. The TILE-Gx8036 offers up to 500 Gbps of memory bandwidth and over 60 Tbps of on-chip mesh interconnect bandwidth. An operating frequency from 1.0 to 1.2 GHz allows this processor to perform up to 750 billion operations per second at 10 to 55W (22W typical) TDP. Other members of the TILE-Gx family include the TILE-Gx9, TILE-Gx16, and TILE-Gx72. In addition, the TILE-Gx includes hardware accelerators not found on previous Tilera processors: mPIPE (multicore Programmable Intelligent Packet Engine) for wire-speed packet classification, distribution, and load balancing; and MiCA (Multicore iMesh Coprocessing Accelerator) for cryptographic and compression acceleration. 3.1.3 Intel Xeon Phi

The Xeon Phi is Intel’s product family of many-core coprocessors. With processing cores based on the original Intel Pentium architecture, the Xeon Phi architecture is comprised of up to 61 -like cores with an in-order memory model on a ring-bus topology. Its performance strength is derived from the 512-bit SIMD vector units on each core, providing maximum performance to highly vectorized applications. With four hardware-thread units per core (up to 244), the Xeon Phi can theoretically achieve more than 1 TFLOPS of double-precision performance. Each core of the Xeon Phi consists of an x86-like processor with hardware support for four threads and 32k L1i, 32k L1d, and 512k L2 caches. The caches are kept coherent via a globally distributed tag directory. Several wide high-bandwidth bidirectional ring interconnects connect the cores to each other and to the on-board GDDR5 memory modules. Data movement is

54 facilitated by this bidirectional hardware ring bus, including cache-line sharing via each core’s private L2 cache. With the Xeon Phi, Intel aims to offer a general-purpose application coprocessor and accelerator for large-scale heterogeneous systems. Housed in a GPU form factor, the Xeon Phi attaches to the PCIe bus of a standard host server and provides application acceleration via three modes of operation: offload for highly parallel work while the host processor executes typically serial work, symmetric for parallel work sharing between the host and coprocessor, and native for only-coprocessor application executions. Multiple Xeon Phi coprocessors can be attached to a single server for very high computational throughput per unit area. By supporting tools and libraries (e.g., OpenMP, Intel Clik, Intel MKL) available for code acceleration on Xeon processors, code portability to the Xeon Phi is relatively straightforward. Further performance optimizations for the Xeon Phi generally enable higher parallelism for the application on other platforms. Our research focuses on the Intel Xeon Phi 5110P coprocessor. This coprocessor model is comprised of 60 cores with 240 hardware threads, 30 MB of on-chip cache, and 8 GB GDDR5 memory with peak bandwidth of 320 GB/s. Operating at 1.053 GHz, this passively-cooled coprocessor operates at 225W TDP. 3.2 Architecture Profiling with NPB

The NAS Parallel Benchmarks (NPB) are performance programs developed and main- tained by the NASA Advanced Supercomputing (NAS) Division. The focus of NPB is performance profiling of highly parallel computing systems with computation and communi- cation patterns that are common among HPC applications. NPB has attracted community support due to several main features: portability of benchmarks between various platforms, flexibility of execution with benchmark configuration options, exhaustiveness with increasingly more time-consuming classes of input workloads (e.g., W, A, B, C), and diverse benchmarks for a variety of problem domains. NPB offers descriptive benchmarking of a system when comparatively referenced with other results.

55 The focus of this section is on architectural analysis between the TILE-Gx and Xeon Phi many-core processors. We leverage NPB version 3.3.1 to form the basis of our comparisons. While the suite includes MPI, OpenMP, and serial programs, the nature of our comparison is with SMP processors, therefore we primarily profile with the OpenMP suite. Serial baselines provided by NPB are leveraged as a baseline for speedup. This section consists of five kernel application types, three pseudo-applications (simulations), and two additional applications which specifically target irregular memory accesses and data movement. Each application was compiled and executed with O3 optimization levels. The following subsections briefly introduce each individual NPB application and then present an analysis of the benchmark results on TILE-Gx8036 and Xeon Phi 5110P, concluding with a synopsis of findings. 3.2.1 NPB Kernels

The five NPB kernel applications perform core computations for various numerical methods commonly used in the field of computational fluid dynamics (CFD) [32, 33]. Each kernel focuses on a particular type of numerical computation. 3.2.1.1 IS: integer sort

The IS application performs a sorting operation on a large integer data set with a bucket- sort algorithm. This type of operation finds use in CFD applications that require particle methods [31]. Unlike the majority of other NPB applications, IS does not test floating-point performance of the system and instead, focuses on integer performance. Execution times for the TILE-Gx and Xeon Phi with IS Class C are presented in Figure 3-1. Tilera processors are advertised for their integer performance, and these results offer comparative claim with the Xeon Phi. With better or parity performance, the TILE-Gx definitively excels at integer performance with a fraction of the power consumption compared to the Xeon Phi. 3.2.1.2 EP: embarassingly parallel

EP is an embarrassingly parallel application designed to test the maximum achievable peak performance on a particular system. The application involves generating pairs of Gaussian random deviates which are used to perform 2D statistics for many CFD applications [31].

56 IS Class C EP Class B 1,000 100 TILE-Gx Xeon Phi 100

10 10 Execution Time (sec) Execution Time (sec)

1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 Number of PEs Number of PEs

CP Class B

1,000

100 Execution Time (sec)

1 2 4 8 16 32 64 128 256 Number of PEs

MG Class B 1,000 FT Class B 100

100 10 Execution Time (sec) Execution Time (sec) 1 10

1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 Number of PEs Number of PEs

Figure 3-1. Execution times for NPB kernels.

57 Generation of pseudo-random numbers is done in parallel to minimize the cost of inter- process communication. Typical to Monte Carlo simulations, this application involves no communication until the very end when a reduction operation is performed. Figure 3-1 shows that the Xeon Phi is about five times faster in execution time than the TILE-Gx. Although this application is embarrassingly parallel, the Xeon Phi stops linearly scaling around 36 threads. This result is perplexing since the Xeon Phi with scatter thread affinity has only loaded 36 out of the available 60 cores. In Table 3-1, the speedup at 64 threads further decreases with four cores being actively shared by two threads each and one thread on the remaining 56 cores. We offer no conjecture for the sublinear performance scaling of an embarrassingly parallel application on the Xeon Phi. For the TILE-Gx, speedup is linear up to 34 of its 36 tiles. This result is attributed to the TILE-Gx architecture and operating-system management. The TILE-Gx mesh fabric— consisting of processors, cache, and switches—needs to interface externally via I/O shims on the boundaries of the device. These shims convert I/O traffic into packets amenable for transfer on the mesh. When transferring over interfaces such as PCIe, our TILE-Gx requires a minimum of two cores to handle I/O-shim traffic processing. Speedup results in Table 3-1 illustrate this core oversubscription with 36 tiles not performing with a speedup of 36 due to time-multiplexing of two threads of execution. Additional TILE-Gx executions at 34 threads show the expected speedup of 34. While the minimum of one tile is required to handle necessary I/O traffic, more may be used depending on additional I/O functionality requested (e.g., Universal Serial Bus) as in our case. 3.2.1.3 CG: conjugate gradient

The CG application implements a “conjugate gradient method to compute an approxima- tion to the smallest eigenvalue of a large, sparse, unstructured matrix” [34]. This method is commonly used to solve an unstructured, sparse linear system of equations. The kernel tests performance with unstructured grid computations and irregular long-distance communications.

58 Similar to EP, the CG application is easily parallelized with the majority of parallelism occurring inside the conjugate-gradient computation loop. The execution times in Figure 3-1 show that TILE-Gx outperforms Xeon Phi on this application for all PE counts. The TILE-Gx execution times are roughly two-thirds of the Xeon Phi execution times. While this result is interesting on its own, Table 3-1 also shows speedup as definitively superlinear on the Xeon Phi and borderline superlinear with the TILE-Gx. The authors responsible for the NAS OpenMP implementations have similarly experienced superlinear speedup when evaluating CG on an SGI Origin 2000 [34]. Unfortunately, no clear explanation was provided; however, high cache-miss rates are suspected due to the irregular matrix-access pattern. 3.2.1.4 MG: multi-grid

MG implements a multi-grid method to solve a 3D Poisson partial differential equa- tion [33]. Multi-grid methods are effective in solving problems which take a large number of iterations to obtain error convergence. By changing the grid from fine to coarse, the multi-grid method is capable of solving sparse/realistic problems within a fixed number of iterations. This kernel sequentially executes five different routines that each implement loop parallelism on their outer-most loops [34]. The application tests performance of both short- and long-distance structured communication. The execution times in Figure 3-1 show that Xeon Phi outperforms TILE-Gx with more than three times reduction in runtime for all PE counts. In addition, speedup on the Xeon Phi surpass those of the TILE-Gx. With lower execution times and higher scalability, the Xeon Phi excels at this class of applications over the TILE-Gx. 3.2.1.5 FT: discrete 3D Fourier transform

This application implements a Fast Fourier Transform (FFT) to solve 3D partial differential equations (PDE). The application calls forward FFT operations for each dimension of the 3D PDE and then iteratively calls inverse FFT routines. Designed to rigorously test the communication performance of the system [34], execution times with FT show that Xeon Phi performs 1.8–2.3 times faster than the TILE-Gx until 16 threads. Around 16 threads, TILE-Gx

59 Table 3-1. Speedup of NPB OpenMP for TILE-Gx and Xeon Phi. PEs IS-C EP-B CG-B MG-B FT-B BT-A SP-A LU-A UA-A DC-W 2 1.9 2.0 2.0 1.4 1.9 2.0 2.0 1.9 1.5 1.7 4 3.6 4.0 4.0 2.6 3.6 3.8 3.8 3.7 2.8 3.1 8 6.6 8.0 8.1 5.0 6.4 7.7 7.4 7.1 5.3 5.2 TILE-Gx 16 12.4 15.9 16.0 9.6 10.1 15.1 13.7 13.2 9.6 7.4 32 22.2 31.8 30.9 17.2 13.5 29.5 23.9 23.1 15.7 9.6 36 22.8 31.9 34.0 17.3 13.5 28.9 23.3 22.4 16.6 9.4 2 1.8 1.8 2.2 2.0 1.9 2.0 2.0 1.7 0.6 1.9 4 3.6 3.6 4.7 4.0 3.8 3.9 3.7 3.5 1.2 3.9 8 7.3 8.0 9.4 7.9 7.7 7.8 7.1 6.9 2.1 7.7 16 14.6 16.0 18.5 15.3 14.2 15.4 14.7 12.7 4.1 15.5 Xeon Phi 32 29.0 32.0 36.6 29.1 27.6 29.6 28.5 23.0 7.1 28.8 36 32.6 32.3 41.3 30.2 29.6 26.5 28.5 24.8 7.8 32.1 64 43.8 48.5 80.0 36.2 45.2 49.7 42.4 35.5 11.1 45.5 128 61.6 78.4 148.7 54.0 57.1 45.6 39.6 31.1 15.5 74.4 240 44.0 109.4 105.6 20.8 30.4 14.3 13.7 11.0 7.2 – For each application on each platform, serial baseline is taken as the fastest execution time from three sep- arate programs: (1) NPB OpenMP with PE = 1, (2) NPB OpenMP compiled serially (ignoring OpenMP parallelization directives), and (3) NPB serial program. All executions on Xeon Phi explicitly set thread affinity via KMP_AFFINITY=granularity=fine,scatter.

begins to decrease in performance while the Xeon Phi maintains scalability until 128 threads. Table 3-1 shows that Xeon Phi speedup peaks at 57.1 while the TILE-Gx attains a speedup of 13.5 at full-device scale. 3.2.2 NPB Pseudo-applications

The NPB benchmark suite includes three pseudo-applications that combine computations to mimic the execution order for several important CFD applications [31]. Some of the complexities associated with actual CFD applications have been stripped from these pseudo- applications when those complexities are not significant to parallel performance. 3.2.2.1 BT: block tri-diagonal solver

The BT application solves multiple independent systems of tri-diagonal equations. Three sets of independent systems of equations are progressively solved using a multi-partition scheme in order to balance computational load and minimize communication. This application primarily profiles the computational density of the system. The execution times in Figure 3-2 for the

60 BT Class A SP Class A 1,000 1,000 TILE-Gx Xeon Phi

100 100 Execution Time (sec) Execution Time (sec)

10 10 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 Number of PEs Number of PEs

LU Class A 1,000

100 Execution Time (sec)

10 1 2 4 8 16 32 64 128 256 Number of PEs

Figure 3-2. Execution times for NPB pseudo-applications.

Xeon Phi are roughly 4.5 times faster than execution times for the TILE-Gx. Speedups for the two platforms are very similar, with the Xeon Phi continuing to scale until 64 threads. 3.2.2.2 SP: scalar penta-diagonal solver

In contrast to BT with tri-diagonal equations, SP solves scalar penta-diagonal systems. The application is parallelized similarly to BT and also performs coarse-grained communication to test the computational power of the system. Execution times with the Xeon Phi are around two times faster than the TILE-Gx. Similar to BT, the speedup of SP for both platforms scale in close relation to each other. 3.2.2.3 LU: lower-upper Gauss–Seidel solver

LU implements a lower-upper diagonal solver with a symmetric successive over-relaxation (SSOR) method to solve a square-block diagonal system split into lower and upper triangular blocks [34]. This application includes four main routines which are iteratively called by the

61 1,000 UA Class A DC Class W TILE-Gx Xeon Phi 100

100

10 Execution Time (sec) Execution Time (sec)

10 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 Number of PEs Number of PEs

Figure 3-3. Execution times for NPB unstructured computation and data movement.

SSOR solver. Similar to BT and SP, the LU application exhibits the same trends in execution times and speedup on both Xeon Phi and TILE-Gx. LU execution times on the Xeon Phi are around two times faster compared to TILE-Gx. For these three pseudo-solvers, all speedups on the Xeon Phi reach a maximum around 64 threads. 3.2.3 NPB Unstructured Computation and Data Movement

The original NPB applications primarily had straightforward, fixed-stride, memory-access patterns which could be exploited in order to amortize any memory-traffic penalty [35]. Furthermore, as the sizes of real-world data became exceedingly large, the stress exhibited on the system’s memory hierarchy by the large number of data movements became a limiting factor. Unfortunately, this concern was not assessed by the original benchmarks [36]. The two more-recent NPB applications presented in this subsection exert additional pressure on the memory system by performing unpredictable dynamic memory accesses and intensive data transfers. 3.2.3.1 UA: unstructured adaptive mesh

The UA application simulates solving a 3D heat-transfer problem [35]. Implemented on an unstructured mesh that is adaptively and dynamically refined, the mesh is adjusted finer in locations where large temperature gradients exist and coarser in other locations. This application specifically profiles the memory and communication of the system by performing

62 irregular, unpredictable memory accesses that are typical of modern scientific applications. Execution times show a narrow difference in favor of Xeon Phi vs. TILE-Gx. Unlike previous NPB applications, the serial baseline for the Xeon Phi outperforms the parallel version with two threads. Potential explanations for this result are offered during the architectural analysis in Section 3.2.4. 3.2.3.2 DC: data cube

This application implements an arithmetic data cube to test the data-handling capabilities of the system when dealing with large, distributed data sets [36]. The DC application generates huge volumes of data and is capable of testing different levels of a memory hierarchy from L1 cache to distributed storage. This benchmark establishes the performance of a system for application domains such as data mining that are capable of exerting similar pressure on the memory hierarchy of the system. By default, DC indirectly benchmarks file system performance by reading and writing its data onto disk. For our experiments, DC was compiled with its IN_CORE feature to enable in-memory benchmarking instead of writing to disk. The execution times for TILE-Gx are faster compared to the Xeon Phi for PE counts less than 32 and perform near parity at 32 and 36 PEs. Unfortunately, DC consumes more memory with increasing PE counts and larger problem sizes. Executions above 144 threads were not possible at Class W. While not displayed in Figure 3-3, DC was additionally profiled at Class A and showed similar trends for execution time and speedup when compared to Class W; however, excessive memory consumption limited scalability tests to less than 64 threads at Class A. 3.2.4 Architectural Analysis

The applications of this section showcase common kernel operations and communication patterns in HPC applications. The use of OpenMP has its advantages for SMP-processor benchmarking, however scalability becomes a concern for several of these benchmarks as more cores are increasingly introduced by many-core devices. Several NPB benchmarks such as

63 the pseudo-applications BT, SP, and LU all experience performance-scaling issues beyond 64 threads. There are several possible explanations for this scaling issue:

1. Several NPB applications may no longer scale well with increasingly high core counts. While possible, this hypothesis is unlikely due to the several classes of input data sizes provided by NPB, each increasing the workload roughly four times from the previous class. For example, Class-B executions contain roughly four times more work than Class-A executions.

2. The Xeon Phi 5110P has 60 cores with 240 hardware-thread engines. At a power-of-two scale of 64 threads, four of these cores will be loaded with two threads, sharing the resources and cache of that core. Performance scaling possibly degrades if low cache locality is experienced or shared resources such as the vector engines need to be time- multiplexed. While this hypothesis is the most likely conclusion, further investigation is needed. Although several NPB applications do not scale on the Xeon Phi to full-device utilization, those applications consistently stop scaling around half-device utilization (120 threads). The OpenMP implementation of EP, however, scales well and fully utilizes the Xeon Phi due to minimal communication. The majority of these applications experience similar trends in comparative performance for the TILE-Gx and Xeon Phi. Exceptions arise when either the TILE-Gx or Xeon Phi narrowly or strongly outperform the other. The computational kernels IS and CG, for example, show high potential for the TILE-Gx due to near parity performance with Xeon Phi. Lower power consumption enables the TILE-Gx to outperform on a per-watt basis. The Xeon Phi, however, excels at computation and communication in line with the MG kernel and the BT, SP, and LU pseudo-applications. Along with execution times several factors faster than those of the TILE-Gx, the Xeon Phi also offers higher floating-point performance and higher scalability with up to 244 threads (61 cores). Effective exploitation of the 244 threads is necessary to achieve high performance on the Xeon Phi, especially as per-core shared resources become a greater concern. The UA and DC applications focus on benchmarking memory performance with irregular accesses. The results show that the TILE-Gx performs well compared to the Xeon Phi. TILE- Gx uses DDR3 memory while Xeon Phi sacrifices some latency for higher-bandwidth GDDR5 on-board memory. Due to the irregular memory accesses, the Xeon Phi generally incurs significant latency costs when retrieving these accesses from memory whereas the TILE-Gx

64 has slightly faster latency with DDR3. Fortunately, the Xeon Phi has a larger amount of on-chip cache (30 MB vs. TILE-Gx8036’s 9 MB) and uses aggressive memory prefetching to mitigate some of these costs. With irregular accesses, however, prefetching may not perform well or adversely affect performance. Finally, the TILE-Gx features a very high-bandwidth mesh-interconnect fabric optimized for data-packet traversal, memory accesses, and that definitively assist with the UA and DC applications. NPB results demonstrate that the TILE-Gx is a viable choice for IS- or CG-based applications, integer arithmetic and comparisons, and lower-latency irregular memory accesses when compared to the Xeon Phi. The Xeon Phi excels at the group of pseudo-applications BT, SP, and LU (used commonly in CFD), floating-point performance, and applications that are highly vectorized. The Xeon Phi also includes additional optimizations that are available for applications amenable to memory streaming. While mentioned casually, these performance results should be power normalized to quantitatively determine the computational density per watt of these devices. 3.3 Concluding Remarks

We have presented and evaluated exhaustive platform benchmarking with OpenMP applications from NAS Parallel Benchmarks (NPB) in order to compare architectural strengths of the TILE-Gx and Xeon Phi. Several major contributions are illustrated with this work. In conducting our NPB OpenMP-applications analysis, architectural strengths emerged for the TILE-Gx (integer operations, low-latency memory accesses) and the Xeon Phi (MG, CFD pseudo-applications). Surprising results that merit further investigation include sublinear speedup for the EP embarrassingly-parallel kernel on the Xeon Phi and definitive superlinear results with the CG congruent-gradient kernel. In leveraging the insights from our performance analysis with TSHMEM on TILE-Gx and NPB on TILE-Gx and Xeon Phi, we expand our TSHMEM work toward developing an effective OpenSHMEM programming library for multiple Xeon Phi coprocessors.

65 CHAPTER 4 ANALYSIS AND DESIGN OPTIMIZATION OF SCIF COMMUNICATIONS FOR PGAS COMPUTING WITH SHMEM ACROSS MANY-CORE COPROCESSORS Due to the technological and physical limitations of frequency scaling, modern processor architectures are delivering increasingly higher performance through wider parallelism and more processing cores. At the extreme, this trend gives rise to emerging many-core architectures such as the Intel Xeon Phi coprocessor with focus on extremely parallel tasks using processing cores that are individually less complex but significantly more numerous than modern multi- core processors found in mainstream servers. The Xeon Phi architecture features up to 61 cores (244 hardware threads) alongside 8 to 16 GB of on-board memory and x86_64 compatibility, representing an attractive alternative to graphics-processing units (GPUs) and suitable for applications that desire both performance and portability with the opportunity for incremental optimizations. Furthermore, the Xeon Phi can execute applications in native mode, enabling traditional high-performance computing (HPC) applications that communicate via message passing with MPI [1] or shared memory with OpenMP [2] to execute directly on the coprocessor. These emerging many-core architectures present a unique option for application acceleration in numerous HPC domains. As multi-core and many-core devices evolve to include increasingly higher core counts, servers and systems begin to have more computation localized among processing devices within a node, providing greater incentive to optimize for intra-node performance. This trend is especially relevant for accelerators similar to the Xeon Phi that can be densely packed into a single platform, thereby substantially increasing that platform’s compute capabilities and enabling lower-latency communication for parallel applications. Understanding the communication behaviors between these devices within a node becomes valuable for application and library developers intending to correctly optimize data movement for these compute-dense systems. In this chapter, we present research, design, and analysis for inter-device communication performance and behavior on a computationally dense system node consisting of four Intel

66 Xeon Phi 5110P many-core coprocessors. For HPC applications, data movement and effective communication with these coprocessors can significantly affect runtime performance. Our approach includes extensive microbenchmarking and performance analysis of SCIF (Symmetric Communications Interface [37]), a high-performance, inter-device communications library for Intel Xeon processors and Xeon Phi coprocessors. We then present design and analysis for a new version of TSHMEM, our OpenSHMEM library with newly integrated support for Xeon Phi. TSHMEM is designed for efficient intra-node communication between multiple Xeon Phi coprocessors, leveraging the insights gained from our analysis with SCIF. Our experiments with TSHMEM are evaluated alongside several MPI implementations—MPICH [20], MVAPICH2- MIC [38], and Intel MPI [39]—in order to provide a comprehensive, multi-device performance study with fully utilized Xeon Phi coprocessors in a single node. In doing so, we aim to enable critical insights to intra-node behavior with SCIF and several popular MPI implementations as well as deliver a high-performance, many-core programming library with TSHMEM. The remainder of the chapter is organized as follows. Section 4.1 provides background on the Intel Xeon Phi many-core coprocessor, a synopsis of the partitioned global address space (PGAS) model and OpenSHMEM, and several related works. Section 4.2 presents communi- cation methods for Xeon Phi and intra-node microbenchmark results with SCIF. Section 4.3 delves into the design of TSHMEM for Xeon Phi. Section 4.4 showcases microbenchmarks and applications results with TSHMEM and several MPI implementations. Finally, Section 4.5 provides concluding remarks. 4.1 Background

One of the most common programming styles on large parallel systems is single-program, multiple-data (SPMD). By partitioning a large dataset across replicated application kernels, SPMD enables diverse programming models such as message passing and partitioned global address space (PGAS). This section provides a brief overview of the Intel Xeon Phi architecture, PGAS, and OpenSHMEM, which form the foundation of our experience and design with

67 TSHMEM. Several related works are also provided as they serve as additional references for exploratory designs and performance analyses on systems with Xeon Phi coprocessors. 4.1.1 Intel Xeon Phi (Knights Corner) Coprocessor

Formerly known as the Many Integrated Core (MIC) architecture, the Xeon Phi is Intel’s many-core product line focused on high-performance computing. We examine the Knights Corner architecture, which is provided as a many-core coprocessor suitable for application acceleration with a host processor. Xeon Phi coprocessors based on Knights Corner has up to 61 processing cores at 1.25 GHz, with each core consisting of 32 KB L1i, 32 KB L1d, and 512 KB L2 cache [40]. Each Xeon Phi core provides in-order processing and also supports up to four hardware thread contexts (up to 244 total) that share the core’s resources via time multiplexing. These cores are interconnected via a wide bi-directional ring bus with each other, the on-board GDDR5 memory modules (up to 16 GB), and a globally distributed tag directory that handles coherency for the L2 caches. Part of the Xeon Phi’s performance strength is derived from the 512-bit SIMD vector units for more than 1 TFLOPS of double-precision performance and up to 352 GB/s memory bandwidth along the ring interconnect for highly vectorized applications. Programming for the Xeon Phi can be achieved through three execution models: offload, native, and symmetric. Offload execution is the accelerator-based model whereby applications executing on a host processor can annotate their source code with offload directives. These directives (e.g., pragmas or via Intel ) enable data transfer to and from the coprocessor and code execution. This style of programming is recognizable for users familiar with general- purpose processing on graphics processing units (GPGPUs) through APIs such as OpenCL. In contrast, native execution is similar to treating the coprocessor as its own compute node. Application binaries are compiled explicitly for the Xeon Phi and execute directly on the coprocessor. Symmetric execution is a supertype of native execution in which the application workload is shared between the host processor and natively on a coprocessor. This style of execution is amenable for applications that have significant compute serialization, allowing the

68 serial portion to remain on the faster host processors while the highly parallel work is executed on the Xeon Phi’s numerous cores. The Xeon Phi coprocessor supports an x86_64-like instruction set and runs a native Linux operating system, enabling ease of portability for the majority of x86 applications and developer tools. Intel also provides software support for the Xeon Phi through their compilers and libraries (e.g., OpenMP, Intel MPI, Intel Cilk, and Intel MKL). For these reasons, Xeon Phi development has a low entry barrier, but applications still require proper SIMD optimizations to fully realize the best performance with the coprocessor. 4.1.2 PGAS and OpenSHMEM

HPC has traditionally focused on models such as message passing with MPI [1] or shared memory with OpenMP [2]. However, interest is rising for a partitioned global address space (PGAS) abstraction with its potential to enable high-performance libraries and languages around a straightforward memory and communication model. Notable members of the PGAS family include SHMEM [8, 4], Unified Parallel C (UPC), Global Arrays (GA), Co-Array Fortran (CAF), Titanium, GASPI, MPI-3 RMA [5], X10, and Chapel. The SHMEM communication library adheres to a strict PGAS model whereby each cooperating parallel process (also known as a processing element, or PE) owns a shared symmetric partition within the global address space. Each symmetric partition contains symmetric objects (scalar variables or arrays) of the same size, type, and relative address on all PEs. SHMEM provides several routines for explicit communication between PEs, including one-sided data transfers (puts and gets), blocking barrier and point-to-point synchronization, collectives, and atomic memory operations. The power of SHMEM comes from its simple memory model with the potential to enable lightweight library abstractions and hardware-level optimizations. Modern SHMEM development is maintained through OpenSHMEM, a community-driven effort among academia, industry, and government to standardize SHMEM semantics and provide improvements and advancements for next-generation applications and systems [8, 7].

69 OpenSHMEM has already seen research and industry adoption in various implementations: the OpenSHMEM reference implementation [9], TSHMEM [41], MVAPICH2-X [10], OSHMPI [11], Portals-SHMEM [12], POSH (Paris-OpenSHMEM) [13], and through vendors such as SGI [3], Cray [14], and Mellanox [15]. 4.1.3 Related Works

One of the first works that explores communication performance with multiple Xeon Phi coprocessors comes from Colfax International and their performance analysis with Intel MPI and InfiniBand [42]. Colfax focused on introducing users to the Xeon Phi environment with their system-setup methodology for a testbench cluster of two nodes each with four Xeon Phi 31S1P coprocessors (57 cores), then investigated Intel MPI performance on two network fabrics: TCP and DAPL (Direct Access Programming Library). The following DAPL providers were emphasized: SCIF, mlx4_0 for InfiniBand, and mcm for MIC-based InfiniBand via host-side proxy daemon (mpxyd). In Section 4.2.4, we validate a subset of the Colfax results for the performance of SCIF with our own performance results while providing additional analysis and insights to SCIF behavior. MVAPICH2-MIC is an MPI library based on MVAPICH2 with optimizations for InfiniBand- based Xeon Phi clusters [38]. Their work focused on efficient coprocessor communication with SCIF within a node and InfiniBand between nodes. The experiments were performed on the TACC Stampede supercomputer with a per-node system setup of dual-socket Xeon E5-2680 Sandy Bridge host processors and one Xeon Phi SE10P (61 cores). Because each Stampede node only has a single coprocessor, MVAPICH2-MIC intra-node experiments were limited to (1) within a coprocessor, and (2) between coprocessor and host processor. We expand on their results in Section 4.4 by including MVAPICH2-MIC as one of the MPI implementations for intra-node performance analysis between multiple coprocessors. In addition to MVAPICH2-MIC, there are other specialized MPI libraries that target Xeon Phi clusters, such as DFCA-MPI [43] and MT-MPI [44]. In the PGAS domain, a subset of the

70 related research with Xeon Phi includes OpenSHMEM [45, 46], UPC [47, 48], MPI-3 RMA [49], and GASPI [50]. These works have all explored a facet of design or performance analysis with one or several Xeon Phi coprocessors. Our work differentiates itself by providing a larger breadth and depth of knowledge for the intra-node performance of mainstream libraries on systems with high compute and communication density due to the presence of multiple Xeon Phi coprocessors. 4.2 Communication with Xeon Phi

For HPC applications, communication is often a bottleneck that limits the hardware’s computational capabilities. This section introduces the various communication methods available for Xeon Phi coprocessors, then examines the features and performance of SCIF, a high-performance communications library, in greater detail. 4.2.1 System Setup

Our research system consists of four Intel Xeon Phi 5110P coprocessors [51]. This coprocessor model is based on the Knights Corner architecture and has 60 cores with 240 hardware threads, 30 MB of total L2 cache, and 8 GB GDDR5 memory with peak bandwidth of 320 GB/s. Operating at 1.053 GHz, this passively-cooled coprocessor has a 225W TDP and is attached to the host platform via PCI Express (PCIe). The host system consists of two Intel Xeon E5-2620 v2 processors running at 2.10 GHz and a Supermicro X9DRG-QF dual-socket motherboard with a Patsburg-based chipset. The host processors are interconnected with Intel QPI (QuickPath Interconnect) for up to 7.2 GT/s (14.4 GB/s) unidirectional communications bandwidth. The Xeon Phi coprocessors are attached via PCIe 3.0 ×16 interfaces, with each host processor managing two coprocessors. Although these coprocessors are connected to PCIe 3.0 slots, the Xeon Phi 5110P (and others in the Knights Corner product family) only supports PCIe 2.0. Communication performance between coprocessors depends on whether those coprocessors are located on the same PCIe bus, or on separate PCIe buses and necessitate data movement between adjacent CPUs via QPI.

71 The operating system on the host is CentOS 6.6 (Linux 2.6.32-504.23.4.el6.x86_64) with software tools and libraries provided by Intel Parallel Studio. Execution binaries are cross-compiled for Xeon Phi with Intel Composer XE 2015.3.187. For the Xeon Phi, the operating system and supporting tools are provided by the Intel Manycore Platform Software Stack (MPSS) version 3.5.1. MPSS enables basic functionality on the coprocessors and methods of high-performance communication through libraries such as SCIF or from OFED (OpenFabrics Enterprise Distribution). In lieu of the supplied OFED version in the MPSS distributable, we leverage the upstream OFED version 3.18-rc3 which includes the latest support for Xeon Phi coprocessors. 4.2.2 Communication Methods

Within a single coprocessor, Xeon Phi supports the typical assortment of communication methods and APIs as a result of its x86-like compatibility and Linux operating system. These intra-coprocessor methods include the methods available from Linux such as shared memory (mmap or shm) and various point-to-point constructs (e.g., sockets, pipes, FIFOs, message queues). With compiler support from Intel or from the GNU gcc port, OpenMP is also available as a programming model alongside parallel-programming advancements in mainstream languages such as C, C++, and Fortran. All of these methods are primarily relevant for native execution on the coprocessor. For programming between host processor and coprocessor, offloading is an attractive paradigm for users familiar with accelerator-based models such as OpenMP 4.0, OpenACC, OpenCL, and CUDA. Two main offload methods are available for Xeon Phi developers: explicit offload with offload directives and implicit offload with MYO (Mine-Yours-Ours). Explicit offload is more familiar to OpenMP 4.0/OpenACC users and is leveraged from a host application through offload directives (pragmas) that specify data structures and memory regions to transfer over to one or more coprocessors. Computation kernels are annotated with the appropriate pragmas to operate on that memory before the data is transferred back to the host. This model of programming and communication offers high autonomy for users.

72 Noteworthy, the Intel compilers support the OpenMP 4.0 standard, which includes support for compute devices such as Xeon Phi through similar programming directives. OpenMP 4.0 is thus an option for applications that require standards-compliance and portability. Alternatively, there is MYO, which is a shared-memory approach to synchronize data structures between host processors and coprocessors. Data structures are allocated at the same virtual addresses and implicitly synchronized between the devices, enabling computation via pointer-based operations or on complex data structures such as trees. MYO offers a more implicit communications model and support for MYO is primarily through keywords in Intel Cilk, a C/C++-based multi-threaded parallel-programming language. The offload methods are designed to handle application-specific communication. In contrast, SCIF is a low-level library that enables communication between Xeon processors and Xeon Phi coprocessors in an application-agnostic manner, enabling higher-level libraries such as MPI or OpenSHMEM to leverage its features. Intel also provides COI, the Coprocessor Offload Infrastructure library, that offers an asynchronous, pipelined programming model via source/sink data movement. COI is built atop SCIF for peer-to-peer communication. 4.2.3 SCIF Overview

The Symmetric Communications Interface is a high-performance communications library with a sockets-like connection framework [37]. Each side of a connection is managed through an endpoint descriptor, with connections originating or terminating on Xeon processors or Xeon Phi coprocessors. The bulk of SCIF functionality operates on these endpoint descriptors and abstracts the details for communicating over PCIe. SCIF supports message passing through two-sided communication, but its main performance benefit is from one-sided direct memory access (DMA) via explicit remote-memory access (RMA) operation or remote-memory mapping. The simpler two-sided communication routines scif_send and scif_recv are akin to the two-sided routines in MPI. While these operations do support large-message transfers, they are only truly suitable for small control messages. These message-passing transfers will copy

73 the payload into a transfer buffer before context switching from user space to kernel space to perform the actual operation. Then, the receiving end will be sent an interrupt and move into kernel space to read the data out of the transfer buffer before waking up the blocking recv call. Due to the extra copy operations and kernel transitions from system calls, these routines are not particularly useful for sustained low-latency operations. The best performance of SCIF is from its DMA capabilities. SCIF endpoints are required to register any memory regions that are to be transferred via explicit RMA. These registered memory regions are part of a separate address space called the registered address space that exists in a one-to-one mapping with the application’s virtual address space. The explicit RMA routines scif_readfrom and scif_writeto can then transfer memory between a local registered address and a remote registered address. SCIF also provides two additional routines (scif_vreadfrom and scif_vwriteto) to transfer memory between a local virtual address and a remote registered address. These virtual-address-based RMA variants incur some overhead cost by implicitly registering the local virtual address before performing the transfer, then optionally unregistering the local virtual address when done. As such, their intended use is for applications that need to transfer some temporary region of memory with lifetime not long enough to necessitate registration. High-performance applications are advised to leverage the former readfrom/writeto functions when possible. DMA performance is dependent on memory alignment. Memory regions should be properly cache-line-size aligned (64 bytes) for optimal performance. When memory is not aligned, the transfer is split into head-body-tail payloads in which the aligned body is transferred via the DMA engine, but the head and tail payloads are transferred via the much slower programmed input/output (PIO) method using the CPU. Additionally, applications with large memory regions can experience improved DMA performance by using 2 MB huge pages instead of the standard 4 KB page size. Due to the asynchronous nature of DMA, SCIF provides straightforward memory-ordering routines. Ongoing/uncompleted RMA transfers to or from an endpoint can be marked with

74 scif_fence_mark, and then waited on for completion with scif_fence_wait. Subsequent DMA transfers that occur after a mark operation are not affected by wait. Alternatively, scif_fence_signal can be used to signal on completion of all uncompleted RMA operations either to or from an endpoint. The signal occurs by writing a value into a variable, allowing a task to poll on the variable for completion. The explicit RMA routines perform well for large transfers, but applications with small transfers are not left excluded. Remote registered address spaces can be directly mapped into the local virtual address space with scif_mmap, enabling normal load/store operations (e.g., assignment operator, memcpy) on the remote memory. This method of direct memory access is highly recommended for applications that require low-latency performance for small messages. 4.2.4 SCIF Performance Evaluation

Figures 4-1–4-3 show our SCIF microbenchmark results with latency in the first column and effective bandwidth in the second column. For each figure, we measured the performance of transfers using load/store operations via memcpy (on a memory-mapped region with scif_mmap) and with explicit RMA routines readfrom/writeto and vreadfrom/vwriteto. We evaluated these operations for three localities: intra-device (between two processes on a single coprocessor); inter-device near (between two coprocessors on the same PCIe bus managed by the same host CPU); and inter-device far (between two coprocessors on PCIe buses that are managed by different, adjacent CPUs). On our research system, the four coprocessors are enumerated as mic0, mic1, mic2, and mic3 with virtual network interfaces. For locality, mic0 and mic1 are on the same PCIe bus (and likewise with mic2 and mic3). Communication between the two sockets (e.g., mic0 to mic2) has to traverse Intel QPI on the host processors. 4.2.4.1 Intra-device

For intra-device transfers, read and write performance is symmetric. In Figures 4-1A and 4-1B, read operations are similar in performance to the complementary write operation, and thus their performance curves overlap. SCIF is able to leverage shared memory to avoid

75 with via operations load/store direct with vreadfrom/vwriteto rnfr.We eitaino ohsdso rnfri inappropriate, is transfer a of sides both of registration When transfers. 4 than intra- more for yielding even transfers, The worthwhile device cost. significantly up-front is however, expensive registration, an for limitation be advantage a can performance registration imposing since SCIF, by structures registered data large be extremely to on require spaces address do receiving routines and these sending mentioned, previously the As both GB/s. 11 than greater reaches With performance performance. higher yields routines RMA explicit 2 than larger messages for GB/s 2.5 toward diminish to MB. begins but messages, KB 8 for GB/s Linux from allocated memory was region memory the if as to the contrast, achieved is performance best the messages, small For transfers. intra-device on overhead coprocessor. single a within communication intra-device for Phi Xeon on SCIF 4-1. Figure Latency (µs) A 10 o rnfr agrta 2 B wthn rmdrc odsoeoeain othe to operations load/store direct from switching KB, 128 than larger transfers For memcpy 1 , , 0 000 000 100 . 0 01 10 . 1 1 mmap 1

Bandwidth. B) Latency. A) 2 r oaie ihntedvc,tepromneof performance the device, the within localized are ) 4 readfrom/writeto

eut lg lsl with closely align results 8 16

esg ie(bytes) Size Message 32 64 128 256 512 otnsa bu 44 about at routines 1K 2K 4K 8K 16K 32K 64K 128K ehdrn tasal 4 stable a at runs method 256K 512K scif_mmap × 1M 2M scif_mmap ihrbnwdhcmae to compared bandwidth higher 4M 8M µ .Sneteela/tr prtos(norcase, our (in operations load/store these Since s. 76 Effective Bandwidth (MB/s) B eutn nltnisa o s3 s In ns. 30 as low as latencies in resulting , 10 1 , , 0 100 000 000 . 0 eut.Bnwdhrahsapa f7.7 of peak a reaches Bandwidth results. 01 10 . 1 1 mmap 1 readfrom/writeto 2 4 hl o hw,Lnxshared- Linux shown, not While . 8 16 µ

esg ie(bytes) Size Message 32 ihtesoe v-variant slower the with s 64 128

scif_mmap 256 512 1K 2K 4K scif_mmap

vreadfrom/vwriteto 8K 16K 32K scif_vwriteto scif_writeto scif_mmap scif_vreadfrom scif_readfrom scif_mmap intra-coprocessor , 64K

sequivalent is 128K 256K 512K write read 1M

ihlarge with 2M 4M 8M ( hc eut nti efrac erae t6 ye n eod hs aece mrv to improve latencies these beyond, and bytes 64 At decrease. performance this in results which operations, these by unregistered possibly and registered temporarily is address virtual local The niei 4bts rnfr esta 4btsaefre oueporme input/output 4.8 programmed about use to forced are bytes 64 than less Transfers bytes. 64 is engine 30 of latencies with slower significantly are routines RMA observe 0.86 still with transfers performance small-message asymmetry, low-latency the Despite operations. write than slower (e.g., bus PCIe same the mic1 on coprocessors two between communication SCIF of near Inter-device to 4.2.4.2 compared when MB 4 than virtual- less these memcpy transfers consequence, for a worse As perform . 4-1A operations Figure RMA in address-based seen as penalty space. runtime address significant registered a the incurring in be to end remote the requires only and instead used be can coprocessors two between communication near inter-device for Phi Xeon on SCIF 4-2. Figure vreadfrom/vwriteto Latency (µs) 100 A 10 so h performance the show 4-2B and 4-2A Figures configuration, near inter-device the For .The ). 1 , , , 000 000 000 100 0 10 . n nyrahu oapoiaey28GB/s. 2.8 approximately to up reach only and 1 1

µ 1 n 44 and s Bandwidth. B) Latency. A) CPU. same the by managed PCIe via 2 scif_mmap 4 8 16

esg ie(bytes) Size Message 32 64 128 µ 256 ,respectively. s, 512 o rnfr esta 4bts h iiu ala o h DMA the for payload minimum The bytes. 64 than less transfers for ) 1K prahsosta edoeain r nodro magnitude of order an are operations read that shows approach 2K 4K 8K 16K 32K 64K 128K 256K µ

ed n 0.19 and reads s 512K 1M 2M 4M 8M

77 Effective Bandwidth (MB/s) B 10 1 , , 0 100 000 000 . 0 01 10 . 1 1 µ

rts oprtvl,teexplicit the Comparatively, writes. s 1 2 4 µ 8 ( s 16

esg ie(bytes) Size Message 32 readfrom/writeto 64 128 256 512 1K 2K 4K 8K 16K 32K scif_vwriteto scif_writeto scif_mmap scif_vreadfrom scif_readfrom scif_mmap 64K 128K 256K mic0 512K write read

n 90 and ) 1M 2M 4M to 8M µ s Furthermore, there is a latency-performance anomaly in the small-message range. Observing the bandwidth results for these message sizes shows that scif_mmap performance plateaus at 9 MB/s for reads and 41 MB/s for writes until 256 bytes. SCIF supports 256-byte- payload TLPs (transaction-layer packets) over PCIe [52], accounting for the performance jump in Figure 4-2B from 64 bytes (Xeon Phi cache-line size) to 256 bytes (PCIe TLP payload size). The scif_mmap bandwidth plateaus again after 256 bytes at up to 77 MB/s for reads and 690 MB/s for writes. Fortunately, the explicit RMA operations have symmetric read/write performance for inter-device near communication. Due to the asymmetry in scif_mmap performance, there are different cutoff sizes for read and write to recommend switching over to explicit RMA routines. Reads have lower small-message performance than writes, therefore reads will switch sooner at about 64 bytes compared to writes at about 4 KB. Bandwidth for readfrom/writeto reaches a peak of about 5.3 GB/s, well near the practical maximum throughput of PCIe 2.0 ×16 (theoretical max of 8 GB/s). These SCIF results influence our design of TSHMEM as described in Section 4.3. 4.2.4.3 Inter-device far

Figures 4-3A and 4-3B show SCIF performance results for the inter-device far configura- tion whereby two coprocessors are communicating through different CPU sockets. This transfer imposes the most bandwidth constriction due to the transfer moving from the local PCIe bus, over Intel QPI through the host CPU, and through the other endpoint’s PCIe bus. Fortunately, small-message latency results with scif_mmap are not penalized as severely due to transfer buffering. Direct-access latencies are as low as 1.04 µs for reads and 0.23 µs for writes with 8-byte transfers. Similar to the inter-device near configuration, there are bandwidth asymptotes for the far configuration, with 8 MB/s (read) and 34 MB/s (write) performance between 8 to 64 bytes. For transfers larger than 256 bytes, bandwidth peaks at 60 MB/s (read) and 290 MB/s (write). Direct-access far writes have significantly lower bandwidth compared to the near configuration’s 690 MB/s write performance.

78 o edadwie he rnfrtpsaeson nr-eie( intra-device shown: ( are near types transfer three write, and read For highlights Performance 4.2.4.4 affect noticeably not does [38]. it performance unfortunately intra-node but Phi, Xeon from transfers InfiniBand approach inter-node this uses team DAPL OFED The the through reads. into operations write converts that far devices SCIF with [ limitation known Phi a Xeon is on performance transfers with write bandwidth large-message worst poor the The have messages. they large messages, small with performance low-latency best the about at for peak performance scif_mmap methods the write contrast, far In the of performance. all MB/s configuration, 290 near communica- previous the the Unlike of regardless method. constricted tion severely is performance write the importantly, asymmetric Most the mitigate to ability our limiting mance, coprocessors, two between communication far inter-device for Phi Xeon on SCIF 4-3. Figure Latency (µs) 100 A 10 4-4. Figure in summarized are 4.2.4 Section from results SCIF highest-performing The perfor- read/write symmetric have not do operations RMA explicit the Unfortunately, 1 mic0 , , , 000 000 000 100 0 10 . 1 1 1 to Bandwidth. B) Latency. A) CPU. adjacent different, a by managed each

ed,rahn eko bu . Bsfrlretases hl rtshave writes While transfers. large for GB/s 1.4 about of peak a reaching reads, 2 mcm 4

mic1 8 16 ALpoie [ provider DAPL

esg ie(bytes) Size Message 32 64

, 128

mic3 256 512 52 1K

n a eprilymtgtdwt rx amno receiving on daemon proxy a with mitigated partially be can and ] 2K

to 4K 8K

mic2 16K 32K 64K 53 128K

;aditrdvc a ( far inter-device and ); 256K .MAIH-I lolvrgsapoybsdslto for solution proxy-based a leverages also MVAPICH2-MIC ]. 512K 1M 2M 4M 8M 79

B Effective Bandwidth (MB/s) 1 , 0 100 000 . 0 01 10 . 1 1 1 scif_mmap 2 4 readfrom mic1 8 16

esg ie(bytes) Size Message 32 64

to 128

mic0 256 mic3

edwieperformance. read/write 512 1K tl efrsbte than better performs still 2K , 4K mic3 ; 8K mic2 16K 32K scif_vwriteto scif_writeto scif_mmap scif_vreadfrom scif_readfrom scif_mmap

;inter-device ); 64K 128K

to 256K 512K write read

mic0 1M 2M 4M 8M .Each ). Kbytes). 4K oaotbt ehd o pia efrac ewe ml n ag messages. large and small between performance wish optimal 64 may for be OpenSHMEM methods to or both expected MPI adopt are as to transfers such libraries all parallel-programming if General operations larger. RMA or the bytes SCIF use leverages only that and library design or its application simplify an can therefore messages, large for preferable not already while are performance, latency The the acceptable. as are fast 4-4 as Figure from latencies the if viable are to 1K around to (up messages small write of contrast, range In larger engine. a DMA for the also performance to would higher cores it exhibit Phi as operations Xeon DMA, the to from switching request from the DMA. benefit offloading would over allow bytes operations 64 RMA as explicit small to as switch operations to Read size cutoff the influencing thereby writes, than memory remote on using operations mapped load/store direct with achieved is performance low-latency ( for faster operations result the RMA bandwidth include one results and latency messages The small messages. for large results latency two with annotated is transfer large-message and latencies small-message read/write SCIF with diagram System 4-4. Figure o rnfr fa es 4btsbtenmlil orcsos xlctRAoperations RMA explicit coprocessors, multiple between bytes 64 least at of transfers For best the transfers, small For numbers. these from draw to conclusions several are There prto ( operation ffciebnwdh.Frec rnfrtp,toltnyrslsaeson (faster) shown: are results latency two type, transfer memcpy each For bandwidths. effective scif_mmap scif_mmap 0.86 µs0.86 scif_readfrom/scif_writeto 4.8 µs4.8 nalclo remote or local a on 11 GB/s11 0.03 µs0.03 mic1 mic0 3.8 µs3.8 scif_readfrom 5.2 GB/s 5.2 hncmuiaigbtencpoesr,drc ed r slower are reads direct coprocessors, between communicating When . prah r epcal taot4.8 about at respectable are approach,

PCI Express CPU0 1.04 µs1.04 4.9 µs4.9 or scif_mmap scif_writeto 1.4 GB/s 0.23 µs0.23 4.8 µs4.8 QPI 80 ). 0.29 GB/s0.29 CPU1 drgo,ad(lwr CFepii RMA explicit SCIF (slower) and region, ed scif_mmap sn DMA. using ) PCI Express µ ehdadtesoe SCIF slower the and method 0.19 µs0.19 4.8 µs .TeeRAoperations RMA These s. 11 GB/s11 0.03 µs0.03 mic3 mic2 3.9 µs3.9 Read 5.3GB/s Write In terms of performance pitfalls and deficiencies, the most significant is the band- width for large-message far writes. With only 290 MB/s write performance between CPU sockets, alternative solutions are necessary, such as converting these writes into read op- erations. Performance of DMA transfers is also dependent on proper memory alignment. Unaligned memory may be heavily penalized such that it is preferable to use memcpy via scif_mmap, regardless of message size. Finally, the virtual-address-based RMA operations scif_vreadfrom/scif_vwriteto should be avoided when possible (or only used for transfers with short-lifetime temporary buffers) due to the extra overhead incurred from temporary SCIF registration. 4.3 Design Overview of TSHMEM

TSHMEM is an OpenSHMEM-compliant library with the objective of supporting high- performance communication and experimental research on many-core devices. Historically, we have provided TSHMEM support for the Tilera TILE-Gx and TILEPro families of 2D-mesh many-core processors [41, 29], and now expand our support to Intel Xeon Phi coprocessors. TSHMEM supports the OpenSHMEM v1.2 specification [8] and implements all functionality required by SHMEM applications, including one-sided put/get, barrier synchronization, data collectives, and remote atomic memory operations. The subsections below detail our design of TSHMEM for Xeon Phi. Figure 4-5 illustrates the TSHMEM submodules that form the basis of our infrastructure. The OpenSHMEM API primarily consists of type-differentiated communication routines (e.g., put operations for int, float, double, etc.) and has great synergy with the C++ template system. As such, we leverage C++11 and its features such as templates, lambdas, and automatic type deduction in our TSHMEM design for a higher quality and better performing library. Performance analysis is found in Section 4.4 with TSHMEM alongside several MPI implementations that all support MPI-3 RMA operations.

81 OpenSHMEM

Data Collectives Extensions Put/Get Barrier and Synchronization Atomics

Intra-device Memory Management SCIF Network PCIe

Symmetric Heap Symmetric Address TSHMEM Connection Manager Manager Translator

Linux Virtual-Memory Manager RMA Management Remote Memory Map

Registered Address Spaces Device Primitives Intel Xeon Phi 5110P (k1om, Knights Corner)

Figure 4-5. TSHMEM design architecture for Xeon Phi.

4.3.1 Environment Setup and Initialization

SHMEM (OpenSHMEM) is a SPMD communication library with programming- environment characteristics similar to MPI. SHMEM processes, or processing elements (PEs), are created and initialized during application runtime with the shmem_init routine. After initialization, PEs communicate with each other primarily through the SHMEM API. When an application has completed using SHMEM, library cleanup and communications teardown can occur through shmem_finalize. In TSHMEM, applications are started with shmemrun similar to MPI’s mpirun. During shmem_init, TSHMEM forks the correct number of processes for the current device (if multiple devices are participating in the execution), sets their CPU affinity, initializes the symmetric partitions, and establishes SCIF connections for remote data transfers. 4.3.1.1 Symmetric PGAS partitions

SHMEM has a symmetric PGAS memory model whereby all symmetric partitions store identically structured objects (size, type, offset). This memory model enables the ability for SHMEM data transfers to calculate remote pointer addresses using object information from the local symmetric partition without the need to obtain remote metadata. TSHMEM implements the symmetric partition with fixed, virtual memory-mapped addresses to huge pages. Each PE would normally need to store the virtual base address of each other PE’s symmetric partition in order to calculate a remote object’s location from

82 the local object’s offset. Instead, these fixed addresses reduce library memory overhead by replacing the base-pointer array with a calculation function for base-address pointers. The same calculation function is used by scif_mmap to map the remote memory on Xeon Phi to these fixed addresses. Additionally, fixed-address calculation enables the compiler to place the relevant code into hot memory so it has a lower probability of being evicted from cache. Shared memory is allocated from 2 MB huge pages instead of the standard 4 KB page size. Use of huge pages improves DMA performance and lessens the pressure on the translation lookaside buffers (TLB) attached to the Xeon Phi’s ring interconnect. SHMEM includes support for data transfers with dynamically allocated memory and globally allocated static memory. Dynamically allocated memory can be obtained with shmem_malloc. Due to the importance of memory alignment for SCIF DMA performance, TSHMEM will align all allocation requests of sufficient size. In contrast, global memory is statically allocated into the program executable’s data segment at link time. This memory is treated by SHMEM as static symmetric memory because the virtual addresses of objects in the data segment will be identical when parallel processes are replicated from the same executable. TSHMEM handles symmetric-partition management of both dynamic and static memory through memory maps with scif_mmap, enabling normal load/store operations into a remote coprocessor’s memory. 4.3.1.2 SCIF network manager

TSHMEM abstracts network connections and communication through network managers. For Xeon Phi, we provide SCIF as one of the network managers alongside direct access through local shared memory. Our design leverages SCIF to establish peer-to-peer connections between participating SHMEM PEs and to register the local symmetric partition for remote access. Direct memory access to remote symmetric partitions is achieved with scif_mmap, and TSHMEM optionally supports lazy initialization for remote scif_mmaped memory to delay this expensive operation until it is needed on first access. Furthermore, we support SCIF RMA operations for explicit DMA through a thin-layer interface for one-sided put/get operations.

83 Other SHMEM operations implicitly benefit from SCIF RMA through use of SHMEM put/get routines. 4.3.2 Put/Get

OpenSHMEM specifies point-to-point, one-sided data transfers consisting of elemental, bulk, and strided put/get operations. Elemental put/get functions operate on single-element symmetric objects (e.g., int, long, float) whereas bulk functions operate on a contiguous array of objects. Strided operations allow the transfer of data with strides between consecutive elements in the source and/or target arrays. In the v1.2 specification, put operations will return from the function once the data transfer is in flight and the local buffer is available for reuse by the calling PE. Get operations, in contrast, will block and not return until the requested memory is visible to the local PE. In TSHMEM, we leverage scif_mmap to enable direct load/store operations such as memcpy on remote symmetric partitions. As seen from the SCIF performance results in Section 4.2.4, memcpy works well for small transfers, resulting in low-latency performance that is unmatched by the explicit SCIF RMA operations. For larger transfers, we switch over to these explicit RMA operations to obtain higher bandwidth. Notably, TSHMEM also uses these explicit RMA operations even for local PEs when it detects a sufficiently large-message transfer. The message-size thresholds to switch between these two communication methods were empirically determined based on TSHMEM performance profiling and the SCIF results from Section 4.2.4. These thresholds depend on a number of factors: the type of operation (read, write), if the sending memory is aligned properly, if the receiving memory is aligned properly, and the size of the transfer. For large transfers, the time it takes to determine whether or not the explicit RMA operations should be used is insignificant compared to the actual transfer time. For small transfers, however, this determination time can be up to 4× longer than the transfer would have taken. TSHMEM includes a small-message fast path that mitigates this problem by shortcutting through the threshold-calculation conditions when below a minimum message size (possibly different sizes for read and write).

84 4.3.3 Synchronization

The OpenSHMEM specification provides several categories of synchronization: barrier sync; communication sync with fence/quiet; and point-to-point sync (waiting until a vari- able’s value has changed). TSHMEM includes these functions to provide computation and communication synchronization for SHMEM processes. 4.3.3.1 Barrier

Barrier synchronization is provided by two routines: shmem_barrier_all, which blocks forward processing until all PEs reach the barrier; and shmem_barrier, which invokes a barrier on a subset of PEs defined by an active-set triplet of which PE to start at, the stride between consecutive PEs, and the number of PEs participating in the barrier. TSHMEM’s barrier design uses a tree-based propagation algorithm on a distributed data structure of state variables to track barrier status for each PE in the environment. PEs participating in a barrier determine which children PEs they have to wait for and wait until those children have reached the barrier. Then the PE updates its barrier state to a hash value of the active-set triplet to prevent parent PEs participating in overlapping active sets from exiting the wrong barrier at an incorrect time. This tree algorithm propagates up to a root PE, which is defined as the start PE in the active set. The root PE then releases all its children PEs through a shmem_put operation while the children wait for the release state with a point-to-point synchronization operation such as shmem_wait. This algorithm has shown itself to offer low-latency performance and scale well for our research system. 4.3.3.2 Fence/quiet

Several SHMEM operations do not wait for completion before returning to the calling PE. These include put operations, atomic memory operations, and memory stores to symmetric objects. SHMEM provides shmem_fence and shmem_quiet for ordering and completion. For example, multiple put operations may arrive out of order to a destination PE. The shmem_fence routine provides put ordering to individual PEs by guaranteeing that puts that have started before shmem_fence will arrive before subsequent puts after it. However,

85 shmem_fence only ensures ordering, not completion. For completion, shmem_quiet is used to block execution until all outstanding puts to all PEs are completed. In TSHMEM, the hardware handles coherency and completion for direct load/store operations on remote-memory regions. We only have to concern ourselves with the ordering and completion of explicit SCIF RMA operations. Whenever TSHMEM initiates an explicit RMA operation, we cache the target’s SCIF endpoint descriptor. If shmem_fence is called, TSHMEM iterates over all active endpoints in the cache and marks their outstanding DMA operations with scif_fence_mark. RMA operations that start after shmem_fence will be properly ordered after the marked set. For shmem_quiet, TSHMEM will iterate over the active endpoints in the cache and mark their DMA operations, then each marked set is passed to scif_fence_wait to wait until those marked DMA operations are fully completed. Once all outstanding DMA operations in the system have been delivered, the endpoints cache is cleared. A straightforward fence/quiet implementation may attempt scif_fence_mark for all PEs (not just active ones), but this behavior incurs a large amount of overhead by unnecessarily synchronizing on endpoints that do not have active communication with the current PE. This performance penalty is unacceptable which is why our design leverages an endpoints cache for active communication. Note that for small transfers, TSHMEM avoid this overhead entirely by using direct load/stores in lieu of explicit RMA, allowing these transfers to fully benefit from their low-latency characteristics. 4.3.4 Other SHMEM Routines

The put/get, barrier, and fence/quiet routines can be used to build higher-level SHMEM operations such as data collectives. TSHMEM provides a variety of collective algorithms, such as linear and binary tree, for SHMEM broadcast, collection, and reduction. These collectives are a focus for future work in exploring hardware-aware and system-aware algorithms in addition to further optimizations to TSHMEM.

86 4.4 Performance Evaluation

In studying the performance behaviors between multiple Xeon Phi coprocessors, we leverage several MPI implementations that all support MPI-3 RMA operations: MPICH [20], MVAPICH2-MIC [38], and Intel MPI [39]. The results in this section are intended to provide comparative analysis with our TSHMEM design and between the selected MPI implementa- tions. 4.4.1 Setup of MPI Runtime Environments

We have previously described our hardware and software setup in Section 4.2.1. This subsection provides additional details for the setup and execution of each chosen MPI implementation. 4.4.1.1 MPICH

MPICH is a high-performance, portable MPI implementation primarily used to provide a quality reference implementation for MPI-library developers and to test experimental MPI features. The communication in MPICH is mostly handled through Nemesis, an internal communications channel that supports multiple network types. These Nemesis networks are abstracted and implemented as network modules (netmod). For our performance results, we evaluated MPICH version 3.1.4 with two Nemesis net- mods: SCIF and TCP. The TCP results are provided as a baseline for inter-coprocessor perfor- mance. MPICH was cross-compiled from source with --with-device=ch3:nemesis:scif,tcp. Runtime selection between netmods is controlled by the MPIR_CVAR_NEMESIS_NETMOD environ- ment variable. As a portable reference implementation, the code and infrastructure of MPICH is leveraged in numerous other MPI libraries, including MVAPICH2 and Intel MPI. 4.4.1.2 MVAPICH2-MIC

Based on MPICH and MVAPICH2, MVAPICH2-MIC (MV2-MIC) is optimized for InfiniBand-based Xeon Phi clusters with support for SCIF. For our performance results, we evaluated MVAPICH2-MIC version 2.0. Due to dependence on InfiniBand, intra-node execution is not possible without the presence of an InfiniBand HCA (Host Channel Adapter) with an

87 active port. As such, our research system was outfitted with a Mellanox MHQH29C-XTR ConnectX-2 dual-port card in physical loopback for the purposes of this evaluation. The HCA card is attached to the same PCIe bus as mic2 and mic3. 4.4.1.3 Intel MPI

Scalable support for MPI on Xeon Phi clusters is provided by Intel MPI (IMPI). For our performance results, we evaluated Intel MPI version 5.1.0.079. IMPI handles network communication via OFED fabric providers, with support for several fabrics including shared memory, DAPL, and TCP. For Xeon Phi, SCIF is implemented as one of the available DAPL providers by emulating an InfiniBand HCA and allowing OFED-based libraries such as IMPI to transparently leverage SCIF for optimal intra-node communication between devices. During evaluation, we observed that IMPI suffered from severe performance deficiencies in its default configuration. With assistance from Intel Support, all of our performance results include the following environment variables:

• I_MPI_FABRICS=shm:dapl

• I_MPI_DAPL_PROVIDER=ofa-v2-scif0

• I_MPI_ADJUST_BARRIER=4 # topology-aware recursive doubling

• I_MPI_SCALABLE_OPTIMIZATION=0 # disable The last two variables each represent different issues. The I_MPI_ADJUST_BARRIER=4 variable controls which barrier algorithm is used during runtime, with support for three different algorithms in standard and topology-aware variants. The default algorithm for IMPI barriers is standard recursive doubling (BARRIER=2), however it performed more than 10–15× worse for inter-coprocessor barriers compared to BARRIER=4. Selecting BARRIER=4 for topology-aware recursive doubling yields significantly lower latencies [54]. In its default configuration, we discovered that IMPI exhibited severe performance deficiencies when executing multiple, concurrent put/get operations with one or more copro- cessors [55]. The cause was determined to be IMPI’s use of scalable optimizations with 16 or

88 n eso igeXo h coprocessor. Phi Xeon single a on gets and Intra-device 4.4.2.1 from region performance. memory a on operations [ RMA processes expect MPI can across use allocations optimal symmetric more scalable for more memory enable aligned and potentially allocate of to benefit implementation The MPI an size. allow requested a on based region memory allocated newer the and memory, older the routines: two evaluate MPI_Win_create We operations. one-sided for windows memory create [ suite micro-benchmarks OSU the with obtained Put/Get 4.4.2 applications. and microbenchmarks our running with (sec- disabled before higher were magnitude optimizations of These orders hours!). several to were onds that times execution in resulting processes, more Intra-Device A) coprocessor. Phi Xeon single a within latencies put/get One-sided 4-6. Figure Latency (µs) A 10 sositadvc u n e eut.Promnei ymti ewe puts between symmetric is Performance results. get and put intra-device shows 4-6 Figure were microbenchmarks MPI and OpenSHMEM results, barrier and put/get our For 1 , , 000 000 100 0 10 . 1 1 1

ItaDvc Get. Intra-Device B) Put. 2 4

8 create MPI Intel allocate MPI Intel create MVAPICH2-MIC allocate MVAPICH2-MIC create MPICH allocate MPICH TSHMEM 16

esg ie(bytes) Size Message 32

htalw sr ocet idwo nabtaiypealctdrgo of region pre-allocated arbitrarily an on window a create to users allows that 64 128 256 512

MPI_Win_allocate 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M htalw sr ocet idwo newly a on window a create to users allows that

89 Latency (µs) B 10 21 1 , , 100 000 000 0 .MIpoie eea otnsto routines several provides MPI ]. 10 . 1 1 1 MPI_Win_allocate 2

I_MPI_SCALABLE_OPTIMIZATION=0 4 8 16

esg ie(bytes) Size Message 32 64 128 256 512

MPI_Win_allocate 1K 2K

5 4K

.A uh user a such, As ]. 8K 16K 32K ohv higher have to 64K 128K 256K 512K 1M 2M 4M 8M sto is TSHMEM exhibits the highest performance among the libraries with latencies as low as 0.19 µs (puts) and 0.16 µs (gets). In contrast, the lowest latency exhibited by the MPI libraries is 0.62 µs, more than 3× slower. TSHMEM’s low-latency performance is due to reduced library overhead relative to the MPI implementations, partially from the design decision to integrate a small-message fast path and avoid complicated checks that would consume more time than the transfer itself would. For intra-device transfers, TSHMEM uses memcpy on shared memory. However, when transfers are sufficiently large, we switch from memcpy to explicit SCIF RMA operations in order to leverage the DMA engine for higher performance based on the SCIF performance evaluation presented in Section 4.2.4. This switchover occurs at 256 KB for intra-device transfers. With 8 MB message sizes, TSHMEM bandwidth is more than 11.2 GB/s (puts) and 12.0 GB/s (gets), while the highest performing MPI library only provides 4.7 GB/s (puts and gets). For the MPI results, they can be divided between MPI_Win_allocate and MPI_Win_create. Because allocate is able to properly allocate memory for users, the performance should be higher since it will allocate from shared memory, allowing memcpy operations to have little to no overhead compared to a self-allocated shared-memory region. In contrast, create has to provide a memory window on any arbitrary memory region, severely restricting performance with additional overhead to ensure proper communication. The MPI implementations will typically provide communication through create windows by using a shared-memory transfer buffer or a network-specific transfer buffer and incur extra copy operations through this buffer. Figure 4-6 reflects this analysis. Put/get operations through a create window are an order of magnitude slower for small messages. For allocate, MPICH, MV2-MIC, and IMPI have essentially identical performance by using shared memory. Interestingly, large-message transfers through create windows with MV2-MIC are significantly slower than the other MPI libraries. For example, 8 MB puts with MV2-MIC achieve only 935 MB/s whereas MPICH achieves 1650 MB/s and IMPI reaches 2140 MB/s. Comparing these results with allocate, all three MPI libraries provide about 4600 to 4700 MB/s.

90 o ml esgs u efrac (0.28 performance put 27 messages, than small more For exhibiting sizes, message all 4-7B . at and coprocessors 4-7A near Figures in shown are bus PCIe same the on coprocessors both with results near Inter-device 4.4.2.2 Near node. system a in coprocessors two between latencies put/get One-sided 4-7. Figure Latency (µs) Latency (µs) A C 10 10 iia oteitadvc ae SMMhstehgetpromnefrinter-device for performance highest the has TSHMEM case, intra-device the to Similar Performance coprocessors. two between results get and put inter-device shows 4-7 Figure 1 1 , , , , 000 000 000 000 100 100 0 0 10 10 . . 1 1 1 1 1 1

ItrDvc a Get. Far Inter-Device Inter-Device D) B) Put. Put. are Far Near coprocessors Inter-Device Far Inter-Device C) A) CPU. Get. same CPUs. Near the adjacent by different, managed by and managed to attached are coprocessors 2 2 4 4 8 8 16 16

esg ie(bytes) Size Message 32 (bytes) Size Message 32 64 64 128 128 256 256 512 512 1K 1K

MVAPICH2-MIC 2K 2K 4K 4K TSHMEM 8K 8K 16K 16K 32K 32K 64K 64K 128K 128K 256K 256K 512K 512K 1M 1M 2M 2M 4M 4M 8M 8M ne P allocate MPI Intel PC (SCIF) MPICH µ )i etrta e efrac (0.84 performance get than better is s) 91 Latency (µs) Latency (µs) B D 10 10 1 1 , , , , 100 000 000 100 000 000 0 0 10 10 . . 1 1 1 1 1 1 2 2 4 4 8 8 16 16 ne P create MPI Intel esg ie(bytes) Size Message esg ie(bytes) Size Message 32 32 PC (TCP) MPICH × 64 64 128 128 atrsalmsaetransfers. small-message faster 256 256 512 512 1K 1K 2K 2K 4K 4K 8K 8K 16K 16K 32K 32K 64K 64K 128K 128K 256K 256K 512K 512K 1M 1M 2M 2M 4M 4M µ 8M 8M s), reflecting the same trends in SCIF inter-device near performance from Section 4.2.4.2 with very little overhead. Latencies for the MPI libraries are 10 µs (MV2-MIC), 24 µs (MPICH-SCIF), 33 µs (IMPI), and 450 µs (MPICH-TCP). For large messages, TSHMEM bandwidth reaches more than 5.2 GB/s for both puts and gets after switching to the SCIF RMA operations. IMPI with create exhibits 4.8 GB/s bandwidth is the only MPI library to approach the large-message performance of TSHMEM. For the MPI libraries, we tested MPI_Win_create and MPI_Win_allocate. Performance between the two methods were approximately identical, except with Intel MPI. Therefore, Figure 4-7 only shows results with both methods for Intel MPI. Curiously, IMPI allocate is lower performing than create, with more than 10× worse performance for large messages. Memory allocation in the microbenchmark is only performed once before the benchmark’s timing loop. IMPI performance is fairly constant for small messages, until 4 KB where the performance difference between the two methods starts to appear and favors create. This performance discrepancy possibly indicates the need for optimizations to MPI_Win_allocate if it is expected to perform as well or better than MPI_Win_create. MV2-MIC has the fastest small-message performance with about 10 µs latency among the MPI libraries evaluated. For larger messages, however, IMPI create offers the best MPI performance at about 4.8 GB/s whereas MV2-MIC degrades to about 1.5 GB/s. For MPICH, we evaluated two netmods: SCIF and TCP, with TCP acting as a baseline. Among the SCIF-based implementations, TSHMEM is the highest performing whereas MPICH-SCIF is the lowest performing with 0.73 GB/s bandwidth for large message sizes. The MPICH-SCIF implementation registers and memory-maps a 64 KB transfer buffer, resulting in better initialization time due to reduced memory registration and mapping times relative to TSHMEM and MV2-MIC, however MPICH-SCIF exhibits the lowest overall performance for SCIF-based libraries as a result. The trending behavior with IMPI is similar to MPICH-SCIF, indicating that IMPI is using a similar implementation for its SCIF-based DAPL provider. Between the two

92 MPI libraries, IMPI create takes advantage of SCIF much more effectively than MPICH-SCIF, but IMPI allocate only exhibits about 0.52 GB/s for large messages. 4.4.2.3 Inter-device far

Performance results are shown in Figures 4-7C and 4-7D for inter-device far transfers between two coprocessors on different PCIe buses managed by different, adjacent CPUs. With SCIF, these transfers move between the adjacent CPUs via QPI. For small-message put transfers, TSHMEM has the highest performance with latencies as low as 0.28 µs, exhibiting little overhead over the SCIF performance results. For large-message put operations, however, performance degrades due to the underlying SCIF behavior, reaching around 290 MB/s of bandwidth. MV2-MIC uses similar SCIF-based communication and also exhibits this performance drop as seen in Figure 4-7C. This bandwidth reduction is not seen with IMPI create (1.3 GB/s) and, to a lesser extent, MPICH-SCIF (0.60 GB/s). Far put performance is a focus for our future work with TSHMEM to experiment with and evaluate different approaches to alleviate this performance degradation. In contrast, far get operations with TSHMEM remain the highest performing at all message sizes with small-message latencies as low as 1.04 µs and large-message bandwidth up to 1.5 GB/s, providing little to no performance overhead with the SCIF microbenchmarks. Compared with the other libraries, their large-message bandwidths are 1.3 GB/s (IMPI create), 1.2 GB/s (MV2-MIC), 0.59 GB/s (MPICH-SCIF), and 0.14 GB/s (MPICH-TCP). 4.4.3 Barrier

Barriers are computation-blocking operations that wait for a number of PEs before resuming normal program flow, potentially limiting application scalability as the number of PEs participating in a barrier increases. Some barrier operations, such as those with OpenSHMEM, will also quiet all ongoing communication. The OpenSHMEM barrier microbenchmark minimizes communication such that the performance impact of shmem_quiet is minimized, roughly equating the semantics of these barriers to that of MPI_Barrier which only blocks computation.

93 TSHMEM MPICH (SCIF) MPICH (TCP) MVAPICH2-MIC Intel MPI 1,000

1,000

s) 100 µ 100 100

Latency ( 10

10 10 1 1 2 4 1 2 4 1 2 4

(n=2) (n=2) (n=4) (ppn=60) (ppn=30) (ppn=15) (n=60) (n=120) (n=240) Number of Coprocessors Number of Coprocessors Number of Coprocessors A B C

Figure 4-8. Barrier latencies on several Xeon Phi coprocessors.A) Minimum PEs.B) 60 PEs. C) Maximum PEs.

Figure 4-8 shows barrier latencies with our evaluated libraries for three different sets of configurations:A) the minimum number of PEs on one to four coprocessors;B) a constant 60 PEs spread across one to four coprocessors; andC) 60 PEs on each Xeon Phi 5110P 60-core coprocessor. Borrowing some terminology from MPICH’s mpirun process launcher, each x-axis label is annotated with n (total number of PEs) or ppn (PEs per node, where node is a coprocessor). In MPI, each “node” is a logical compute device and is capable of spawning processes. Typically, an MPI node will equate to a physical cluster node, but the Xeon Phi coprocessors can also be treated as logical nodes. Therefore, for our purposes, an MPI node (in ppn) represents one of the four coprocessors within our physical system. As an example of this annotation, Figure 4-8B uses ppn because n is always 60 for that configuration. TSHMEM exhibits the fastest barriers with an order of magnitude faster performance compared to the MPI implementations. This performance is due to a low-latency design optimized for intra-node communication. For the MPICH netmods, MPICH-SCIF is faster than MPICH-TCP when there are fewer PEs per coprocessor (e.g., less than ppn=30). As the coprocessors become fully utilized (Figure 4-8C), the barrier performance with MPICH-TCP

94 is faster. Furthermore, MPICH-SCIF does not initialize for n=240 due to the number of peer- to-peer SCIF connections, so that result is omitted. For the remaining libraries, MV2-MIC has the lowest performing barriers among the SCIF-based libraries, whereas IMPI offers balanced performance for MPI_Barrier between the configurations. 4.4.4 Application Case Studies

In addition to our microbenchmarks, we showcase performance results with three OpenSHMEM applications: 2D heat equation, heat image, and distributed FFT. Our analysis is focused on scaling trends with these applications and relative performance between the evaluated libraries that use SCIF. For the MPI libraries, we leverage OSHMPI [11], an abstraction library that provides the OpenSHMEM API through MPI-3 RMA operations. Many SHMEM operations have a straightforward mapping to MPI-3 routines due to OpenSHMEM’s more stringent memory model and functional constraints compared to MPI’s more generic capabilities. Consequently, the use of OSHMPI for MPI-3 abstraction minimally influences performance and enables our ability to leverage the same application source code with all of the libraries. Each application is evaluated in two sets of configurations:A) 60 PEs per coprocessor, andB) 60 PEs spread across one to four coprocessors. Note that for one coprocessor, MPICH uses shared memory instead of SCIF or TCP for intra-coprocessor communication. The MPICH-SCIF and MPICH-TCP results for one coprocessor represent two different data points for MPICH with shared memory, thus the performance difference between the two will be minimal. 4.4.4.1 2D heat equation

This application is an iterative heat-equation solver for heat distribution in a rectangular (2D) domain via conduction. The provided application supports three iteration methods: Jacobi, Gauss–Seidel, and successive over-relaxation. We benchmark our libraries with the Jacobi method on a 2160×2160 rectangular domain, with 2160 chosen as a multiple of 240, the maximum number of PEs across four coprocessors, such that the domain space is evenly

95 TSHMEM MPICH (SCIF) MPICH (TCP) MVAPICH2-MIC Intel MPI

85 100 80

75

70 Execution Time (sec) Execution Time (sec) 10 65

60 120 180 240 1 (ppn=60) 2 (ppn=30) 4 (ppn=15)

Number of PEs (ppn=60) Number of Coprocessors A B

Figure 4-9. Execution times for 2D heat equation (2160×2160, Jacobi method).A) Maximum PEs.B) 60 PEs.

divisible amongst the PEs. Application communication consists of a linear number of put operations, broadcasts, reductions, and barriers. Execution times are shown in Figure 4-9. TSHMEM exhibits the fastest execution times and highest performance for most of these configurations. Except for MPICH-SCIF, all of the libraries exhibit scalability. Although MPICH-SCIF performed adequately in microbenchmarks, application performance shows performance deficiencies when communicating with multiple coprocessors. As mentioned previously, MPICH-SCIF also fails to run for 240 PEs. Between TSHMEM, MV2-MIC, and IMPI, IMPI is the lowest performing library. With fully utilized coprocessors, MPICH-TCP performs surprisingly well, exhibiting lower execution times than MV2-MIC and IMPI. However, MPICH-TCP performs worse relative to TSHMEM, MV2-MIC, and IMPI when the number of PEs is held constant. This result indicates that either SCIF does not scale well with increasing number of PEs on a coprocessor, or MPICH-TCP scales well for this particular application’s communication pattern. Subsequent applications will show that this result is most likely due to the application’s communication pattern.

96 TSHMEM MPICH (SCIF) MPICH (TCP) MVAPICH2-MIC Intel MPI 1,000 90

80

100 70 Execution Time (sec)

Execution Time (sec) 60

10 60 120 180 240 1 (ppn=60) 2 (ppn=30) 4 (ppn=15)

Number of PEs (ppn=60) Number of Coprocessors A B

Figure 4-10. Execution times for heat image (8640×8640 with 5000 iterations).A) Maximum PEs.B) 60 PEs.

4.4.4.2 Heat image

This application solves a heat-conduction modeling problem and generates an output image. Each PE is assigned a block of rows and assists in performing iterative heat-conduction computation. Communication consists of a linear number of put and barrier operations based on the number of iterations in the modeling problem. Among the three applications, heat image has the least amount of communication. Execution times are shown in Figure 4-10. Similar to the previous application, MPICH- SCIF is the only library that does not exhibit scalability. This conclusion is a recurring trend with the SCIF netmod in MPICH. Among the remaining libraries, TSHMEM exhibits the highest performance for all configurations, followed by MV2-MIC and IMPI. With fully utilized coprocessors, MPICH-TCP performance is competitive with IMPI. For 60 PEs, TSHMEM, MV2-MIC, IMPI, and MPICH-TCP each deliver higher performance with fewer PEs per coprocessor, despite a constant PE count between configurations. This result is due to the nearest-neighbor communication pattern for this application, reducing the amount of inter-coprocessor communication to only a couple of PEs per coprocessor.

97 TSHMEM MPICH (SCIF) MPICH (TCP) MVAPICH2-MIC Intel MPI

100 1,000

100 10

10 Execution Time (sec) Execution Time (sec) 1 1

60 120 180 240 1 (ppn=60) 2 (ppn=30) 4 (ppn=15)

Number of PEs (ppn=60) Number of Coprocessors A B

Figure 4-11. Execution times for distributed FFTW (10800 FFT operations on 10800-length complex-float arrays).A) Maximum PEs.B) 60 PEs.

4.4.4.3 Distributed FFT

The final application involves the process-based parallelization of a popular FFT library, FFTW [30]. The application performs a distributed, one-dimensional, discrete Fourier transform (DFT) using the FFTW library, with data setup and inter-process communication via SHMEM and its fast one-sided put operations to quickly exchange data. Among the three applications, distributed FFT has the most amount of communication. Execution times are shown in Figure 4-11. Due to the large amount of communication, this application forces performance issues to the forefront. Among the lowest performing, MPICH-TCP and IMPI demonstrate execution times 100–1000× worse than TSHMEM, MV2-MIC, and MPICH-SCIF for fully utilized coprocessors. Note that for single-coprocessor executions, MPICH will use shared memory instead of SCIF or TCP. Among the SCIF-based libraries, IMPI appears as an anomaly. The large amount of communication penalizes IMPI more than the other libraries, possibly indicating further IMPI optimizations or tuning is required in addition to the environment variables we already use from Section 4.4.1.3.

98 TSHMEM provides the best performance among the libraries. Going from n=60 to 120, TSHMEM scales while the other libraries fail to. Moving across the QPI links with n=180, TSHMEM predictably decreases in performance due to the SCIF far-write bandwidth limitations, however n=240 shows that performance still scales despite the limitations. These results are further emphasized in Figure 4-11B where TSHMEM is able to maintain performance parity with a constant number of PEs until it has to move between the CPUs with QPI. 4.5 Concluding Remarks

In exploring PGAS communication between multiple Xeon Phi many-core coprocessors, we presented extensive microbenchmarking results with SCIF, an inter-device communications library, and then leveraged insights from those results in our new design of TSHMEM for Xeon Phi featuring intra-node optimizations for communication between one or more coprocessors. We then evaluated performance for TSHMEM and several MPI implementations through microbenchmarks and application case studies to provide a comparative analysis between these libraries on multiple coprocessors. Our results enable performance-trend analysis, further optimizations to communication in each evaluated library, and critical insights to inter-device behavior for progressively higher-density systems with nodes containing multiple many-core devices. Performance of TSHMEM is derived from our experiences with SCIF and its performance profile. SCIF offers a high-performance, intra-node communication library between multiple devices, achieving demonstrably low-latency read and write performance alongside high- throughput RMA operations over DMA. With our SCIF microbenchmarking, we observed inter-coprocessor latencies as low as 0.19 µs with write operations, and throughput as high as 5.3 GB/s over the local PCIe bus. Furthermore, SCIF RMA operations can also improve large-message read/write performance for intra-coprocessor communication when compared to load/store operations through shared memory. Unfortunately, large-message write performance with SCIF hits a performance bottleneck when transferring beyond the local PCIe bus and

99 across CPU sockets via QPI, forcing alternative designs to improve this limitation. TSHMEM leverages multiple SCIF features through algorithm-switching techniques to maximize performance for small as well as large transfers with little to no overhead over the base-level SCIF performance. For our comparative analysis of intra-node performance, we leveraged microbenchmarking and application case studies in order to evaluate TSHMEM and several MPI implementations: MPICH, MVAPICH2-MIC and Intel MPI. Results with put/get showed TSHMEM outper- forming the other libraries for all message sizes, with exception for put operations across QPI. Inter-coprocessor small-message put performance with TSHMEM was more than 27× faster than the other libraries. Barrier operations with TSHMEM exhibited orders of magnitude higher performance due to intra-node optimizations. Analysis with three different applications emphasized a variety of performance trends among the libraries evaluated. TSHMEM was highlighted with the overall highest performing behavior, followed by MVAPICH2-MIC, Intel MPI, and finally MPICH.

100 CHAPTER 5 CONCLUSIONS In exploring PGAS communication between multiple Xeon Phi many-core coprocessors, we presented our design with TSHMEM for TILE-Gx in Chapter2 and Xeon Phi in Chapter4. Performance of TSHMEM on TILE-Gx is demonstrated with microbenchmarks of Tilera- library and TSHMEM functions, offering direct validation of realizable performance and any inherited overhead. Results indicate that TSHMEM designs for dynamic symmetric-variable transfers display minimal overhead with underlying Tilera libraries and numerous SHMEM functions outperform those from the OpenSHMEM reference implementation and from OSHMPI atop MPICH. Additionally, the design for barrier synchronization in TSHMEM is shown to be fast relative to several available Tilera barrier primitives for both the TILE-Gx and TILEPro. In comparing the performance of TSHMEM collectives, the communication algorithms that emphasize cache locality by coalescing results onto a single tile surprisingly performed better than the algorithms that focused on linearly distributed communication. Our latest work with TSHMEM focuses on communication performance and behavior be- tween multiple Xeon Phi coprocessors. We provide extensive microbenchmark and application results with SCIF, TSHMEM, and several MPI implementations on our computationally dense Xeon Phi research platform. The current TSHMEM design provides for all of OpenSHMEM functionality with optimizations for intra-node-focused inter-coprocessor communication. Our extensive results and analysis with TSHMEM serve as a basis for effective PGAS design and communication on modern and emerging multiple-many-core systems. Future work for TSHMEM will include design exploration and optimizations to mitigate performance degradation of SCIF transfers across QPI, hardware- and system-aware collectives, and methods for improved intra-node communication. Additionally, OpenSHMEM is an evolving standard with an active community. Further library optimizations in TSHMEM will also incorporate exploration of extensions for the OpenSHMEM standard specification. With the resurgence of interest in SHMEM and OpenSHMEM, proposed extensions such as

101 threading support [56] merit investigation with impact on performance and API semantics. Finally, we plan to further explore PGAS designs with TSHMEM on emerging many-core architectures such as the second-generation Xeon Phi architecture, Knights Landing, with host processor and coprocessor form factors.

102 REFERENCES [1] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-performance, portable implementa- tion of the MPI message passing interface standard,” Parallel Computing, vol. 22, no. 6, pp. 789–828, 1996. doi:10.1016/0167-8191(96)00024-5 [2] L. Dagum and R. Menon, “OpenMP: An industry standard API for shared-memory programming,” Computational Science Engineering, IEEE, vol. 5, no. 1, pp. 46–55, Jan.–Mar. 1998. doi:10.1109/99.660313 [3] Silicon Graphics International Corp., “SHMEM API for parallel programming,” Milpitas, CA, USA. [4] R. Barriuso and A. Knies, “SHMEM user’s guide for C,” Cray Research Inc., Eagan, MN, USA, Tech. Rep., Jun. 1994. [5] MPI: A Message-Passing Interface Standard, Message Passing Interface Forum, Jun. 2015, version 3.1. [6] B. C. Lam, A. D. George, and H. Lam, “TSHMEM: Shared-memory parallel computing on Tilera many-core processors,” in Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, ser. IPDPSW ’13. Washington, DC, USA: IEEE Computer Society, 2013, pp. 325–334. doi:10.1109/IPDPSW.2013.154 [7] B. Chapman, T. Curtis, S. Pophale, S. Poole, J. Kuehn, C. Koelbel, and L. Smith, “Introducing OpenSHMEM: SHMEM for the PGAS community,” in Proceedings of the 4th Conference on Partitioned Global Address Space Programming Models, ser. PGAS ’10. New York, NY, USA: ACM, 2010, pp. 2:1–2:3. doi:10.1145/2020373.2020375 [8] OpenSHMEM Application Programming Interface, OpenSHMEM, Mar. 2015, version 1.2. [9] University of Houston, “OpenSHMEM source releases.” [Online]. Available: http://openshmem.org/site/Downloads/Source [10] The Ohio State University, “MVAPICH2-X: Unified MPI+PGAS communication runtime over OpenFabrics/Gen2 for exascale systems.” [Online]. Available: http://mvapich.cse.ohio-state.edu/overview/ [11] J. R. Hammond, S. Ghosh, and B. M. Chapman, “Implementing OpenSHMEM using MPI-3 one-sided communication,” in OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools, ser. Lecture Notes in Computer Science, S. Poole, O. Hernandez, and P. Shamis, Eds. Cham, Switzerland: Springer International Publishing, 2014, vol. 8356, pp. 44–58. doi:10.1007/978-3-319-05215-1_4 [12] “Portals OpenSHMEM implementation.” [Online]. Available: https://code.google.com/p/ portals-shmem/

103 [13] C. Coti, “POSH: Paris OpenSHMEM, a high-performance OpenSHMEM implementation for shared memory systems,” Procedia Computer Science, vol. 29, pp. 2422–2431, 2014. [14] Cray Inc., “Software for the Cray XK7 system,” Seattle, WA, USA. [15] , “Mellanox HPC-X OpenSHMEM,” Sunnyvale, CA, USA. [16] D. Bonachea, “GASNet specification, v1.1,” University of California at Berkeley, Berkeley, CA, USA, Tech. Rep. UCB/CSD-02-1207, Oct. 2002. [17] C. Yoon, V. Aggarwal, V. Hajare, A. D. George, and M. Billingsley, III, “GSHMEM, a portable library for lightweight, shared-memory, parallel programming,” in Proceedings of the 5th Conference on Partitioned Global Address Space Programing Models, ser. PGAS ’11, Oct. 2011, pp. 1–9. [18] Tilera Corporation, “TILE-Gx36 multicore processor.” [Online]. Available: http://www.tilera.com/products/?ezchip=585&spage=621 [19] Tilera Corporation, “TILEPro64 processor family.” [20] “MPICH: High-performance portable MPI.” [Online]. Available: http://www.mpich.org/ [21] The Ohio State University, “OSU micro-benchmarks.” [Online]. Available: http://mvapich.cse.ohio-state.edu/benchmarks/ [22] T. Ma, G. Bosilca, A. Bouteiller, and J. J. Dongarra, “Kernel-assisted and topology- aware MPI collective communications on multicore/many-core platforms,” Journal of Parallel and , vol. 73, no. 7, pp. 1000–1010, 2013, Best Papers: International Parallel and Distributed Processing Symposium (IPDPS) 2010, 2011 and 2012. doi:10.1016/j.jpdc.2013.01.015 [23] R. L. Graham and G. Shipman, “MPI support for multi-core architectures: Optimized shared memory collectives,” in Recent Advances in and Message Passing Interface, ser. Lecture Notes in Computer Science, A. Lastovetsky, T. Kechadi, and J. Dongarra, Eds. Springer Berlin Heidelberg, 2008, vol. 5205, pp. 130–140. doi:10.1007/978-3-540-87475-1_21 [24] A. R. Mamidala, R. Kumar, D. De, and D. K. Panda, “MPI collectives on modern multicore clusters: Performance optimizations and communication characteristics,” in Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid, ser. CCGRID ’08, May 2008, pp. 130–137. doi:10.1109/CCGRID.2008.87 [25] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn, “Collective communication: theory, practice, and experience,” Concurrency and Computation: Practice and Experience, vol. 19, no. 13, pp. 1749–1783, 2007. doi:10.1002/cpe.1206 [26] R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH,” International Journal of High Performance Computing Applications, vol. 19, no. 1, pp. 49–66, 2005. doi:10.1177/1094342005051521

104 [27] J. Jose, J. Zhang, A. Venkatesh, S. Potluri, and D. K. Panda, “A comprehensive performance evaluation of OpenSHMEM libraries on InfiniBand clusters,” in OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools, ser. Lecture Notes in Computer Science, S. Poole, O. Hernandez, and P. Shamis, Eds. Cham, Switzerland: Springer International Publishing, 2014, vol. 8356, pp. 14–28. doi:10.1007/978-3-319-05215-1_2 [28] R. Nishtala and K. A. Yelick, “Optimizing collective communication on multicores,” in Proceedings of the 1st USENIX Conference on Hot Topics in Parallelism, ser. HotPar ’09. Berkeley, CA, USA: USENIX Association, 2009, p. 18. [29] B. C. Lam, A. Barboza, R. Agrawal, A. D. George, and H. Lam, “Benchmarking parallel performance on many-core processors,” in OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools, ser. Lecture Notes in Computer Science, S. Poole, O. Hernandez, and P. Shamis, Eds. Cham, Switzerland: Springer International Publishing, 2014, vol. 8356, pp. 29–43, extended version. doi:10.1007/978-3-319-05215-1_3 [30] M. Frigo and S. G. Johnson, “The design and implementation of FFTW3,” Proceedings of the IEEE, vol. 93, no. 2, pp. 216–231, 2005. doi:10.1109/JPROC.2004.840301 [31] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga, “The NAS Parallel Benchmarks,” NASA Advanced Supercomputing Division, Tech. Rep. RNR-94-007, 1994. [32] D. Bailey, T. Harris, W. Saphir, R. F. Van der Wijngaart, A. Woo, and M. Yarrow, “The NAS Parallel Benchmarks 2.0,” NASA Advanced Supercomputing Division, Tech. Rep. NAS-95-020, 1995. [33] S. Saini and D. H. Bailey, “NAS Parallel Benchmark (version 1.0) results 11-96,” NASA Advanced Supercomputing Division, Tech. Rep. NAS-96-018, 1996. [34] H. Jin, M. Frumkin, and J. Yan, “The OpenMP implementation of NAS Parallel Benchmarks and its performance,” NASA Advanced Supercomputing Division, Tech. Rep. NAS-99-011, 1999. [35] H. Feng, R. F. Van der Wijngaart, R. Biswas, and C. Mavriplis, “Unstructured Adaptive (UA) NAS Parallel Benchmark, version 1.0,” NASA Advanced Supercomputing Division, Tech. Rep. NAS-04-006, 2004. [36] M. A. Frumkin and L. Shabanov, “Arithmetic data cube as a data intensive benchmark,” NASA Advanced Supercomputing Division, Tech. Rep. NAS-03-005, 2003. [37] Symmetric Communications Interface (SCIF) for Intel Xeon Phi Product Family, Users Guide, Intel Corporation, Aug. 2015, revision 3.5.

105 [38] S. Potluri, K. Hamidouche, D. Bureddy, and D. K. Panda, “MVAPICH2-MIC: A high performance MPI library for Xeon Phi clusters with InfiniBand,” in Proceedings of the 2013 Extreme Scaling Workshop, ser. XSW ’13. Washington, DC, USA: IEEE Computer Society, 2013, pp. 25–32. doi:10.1109/XSW.2013.8 [39] Intel Corporation, “Intel MPI Library,” Santa Clara, CA, USA. [Online]. Available: http://software.intel.com/intel-mpi-library [40] R. Rahman, “Intel Xeon Phi core micro-architecture,” Intel Corporation, Santa Clara, CA, USA. [Online]. Available: http://software.intel.com/articles/intel-xeon-phi-core-micro-architecture [41] B. C. Lam, A. D. George, H. Lam, and V. Aggarwal, “Low-level PGAS computing on many-core processors with TSHMEM,” Concurrency and Computation: Practice and Experience, 2015. doi:10.1002/cpe.3569 [42] V. Karpusenko and A. Vladimirov, “Configuration and benchmarks of peer-to-peer MPI communication over Gigabit Ethernet and InfiniBand in a cluster with Intel Xeon Phi coprocessors,” Colfax International, Tech. Rep., 2014. [43] M. Si, Y. Ishikawa, and M. Tatagi, “Direct MPI library for Intel Xeon Phi co-processors,” in Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, ser. IPDPSW ’13. Washington, DC, USA: IEEE Computer Society, 2013, pp. 816–824. doi:10.1109/IPDPSW.2013.179 [44] M. Si, A. J. Peña, P. Balaji, M. Takagi, and Y. Ishikawa, “MT-MPI: Multithreaded MPI for many-core environments,” in Proceedings of the 28th ACM International Conference on Supercomputing, ser. ICS ’14. New York, NY, USA: ACM, 2014, pp. 125–134. doi:10.1145/2597652.2597658 [45] J. Jose, K. Hamidouche, X. Lu, S. Potluri, J. Zhang, K. Tomko, and D. K. Panda, “High performance OpenSHMEM for Xeon Phi clusters: Extensions, runtime designs and application co-design,” in Proceedings of the 2014 IEEE International Conference on Cluster Computing, ser. CLUSTER ’14, Sep. 2014, pp. 10–18. doi:10.1109/CLUSTER.2014.6968754 [46] N. Namashivayam, S. Ghosh, D. Khaldi, D. Eachempati, and B. Chapman, “Native mode-based optimizations of remote memory accesses in OpenSHMEM for Intel Xeon Phi,” in Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, ser. PGAS ’14. New York, NY, USA: ACM, 2014, pp. 12:1–12:11. doi:10.1145/2676870.2676881 [47] M. Luo, M. Li, A. Venkatesh, X. Lu, and D. K. Panda, “UPC on MIC: Early experiences with native and symmetric modes,” in Proceedings of the 7th International Conference on PGAS Programming Models, ser. PGAS ’13, 2013, pp. 198–210. [48] D. A. Mallón, G. L. Taboada, and L. Koesterke, “MPI and UPC broadcast, scatter and gather algorithms in Xeon Phi,” Concurrency and Computation: Practice and Experience, 2015. doi:10.1002/cpe.3552

106 [49] M. Li, K. Hamidouche, X. Lu, J. Lin, and D. K. Panda, “High-performance and scalable design of MPI-3 RMA on Xeon Phi clusters,” in Euro-Par 2015: Parallel Processing, ser. Lecture Notes in Computer Science, J. L. Träff, S. Hunold, and F. Versaci, Eds. Germany: Springer Berlin Heidelberg, 2015, vol. 9233, pp. 625–637. doi:10.1007/978-3-662-48096-0_48 [50] D. Grünewald and C. Simmendinger, “The GASPI API specification and its implementation GPI 2.0,” in Proceedings of the 7th International Conference on PGAS Programming Models, ser. PGAS ’13, 2013, pp. 243–248. [51] Intel Corporation, “Intel Xeon Phi coprocessor 5110P,” Santa Clara, CA, USA. [Online]. Available: http://ark.intel.com/products/71992/ [52] F. Roth, “Performance of write to remote memory window, comment 11,” Intel Developer Zone, Forum for Intel MIC Architecture, Jul. 2013. [Online]. Available: http://software.intel.com/forums/topic/393771#comment-1742284 [53] The DAPL Team, “README/Release Notes for OFED 3.18 DAPL Release 2.1.6,” Aug. 2015. [54] B. C. Lam, “Performance issues with Intel MPI (barriers) between Xeon Phi coprocessors,” Intel Developer Zone, Forum for Intel MIC Architecture, Jun. 2015. [Online]. Available: http://software.intel.com/forums/topic/560662 [55] B. C. Lam, “Performance issues with Intel MPI (RMA put/get) on Xeon Phi,” Intel Developer Zone, Forum for Intel MIC Architecture, Jul. 2015. [Online]. Available: http://software.intel.com/forums/topic/562296 [56] M. ten Bruggencate, D. Roweth, and S. Oyanagi, “Thread-safe SHMEM extensions,” in OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools, ser. Lecture Notes in Computer Science, S. Poole, O. Hernandez, and P. Shamis, Eds. Cham, Switzerland: Springer International Publishing, 2014, vol. 8356, pp. 178–185. doi:10.1007/978-3-319-05215-1_13

107 BIOGRAPHICAL SKETCH Bryant Lam is a computer engineer and high-performance-computing researcher focused on scalable many-core computing through hardware and software performance analysis and lightweight programming libraries. Bryant’s work includes TSHMEM, an OpenSHMEM programming library for Tilera and Intel many-core devices, and has published several conference and journal manuscripts. His interests include high-performance computing, parallel-programming models and languages, and development of scalable applications and tools. Bryant received his B.S. in Electrical Engineering, B.S. in Computer Engineering, M.S. in Electrical and Computer Engineering, and Ph.D. in Electrical and Computer Engineering from the University of Florida.

108