Analysis of Threading Libraries for High Performance Computing
Total Page:16
File Type:pdf, Size:1020Kb
This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TC.2020.2970706 © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Analysis of Threading Libraries for High Performance Computing Adrian´ Castello´ Rafael Mayo Gual Sangmin Seo Pavan Balaji Universitat Jaume I Universitat Jaume I Ground X Argonne National Laboratory Castellon´ de la Plana, Spain Castellon´ de la Plana, Spain Seoul, Korea Lemmont, USA Email: [email protected] Email: [email protected] Email: [email protected] Email: [email protected] Enrique S. Quintana-Ort´ı Antonio J. Pena˜ Universitat Politecnica` de Valencia` Barcelona Supercomputing Center (BSC) Valencia,` Spain Barcelona, Spain Email: [email protected] Email: [email protected] Abstract—With the appearance of multi-/many core machines, of hardware parallelism may be inefficient. In response to applications and runtime systems evolved in order to exploit the this problem, dynamic scheduling and lightweight threads new on-node concurrency brought by new software paradigms. (LWTs) (also known as user-level threads, or ULTs) models POSIX threads (Pthreads) was widely-adopted for that purpose were first proposed in [5] in order to deal with the required and it remains as the most used threading solution in current levels of parallelism, offering more efficient management, hardware. Lightweight thread (LWT) libraries emerged as an alternative offering lighter mechanisms to tackle the massive context switching and synchronization operations. These concurrency of current hardware. In this paper, we analyze thread solutions rely on threads that are managed in the user- in detail the most representative threading libraries including space so that the OS is only minimally involved and, hence, Pthread- and LWT-based solutions. In addition, to examine the overhead is lower. the suitability of LWTs for different use cases, we develop To illustrate this, Figure 1 highlights the time spent when a set of microbenchmarks consisting of OpenMP patterns commonly found in current parallel codes, and we compare the creating OS thread and user-level threads (ULTs). In this results using threading libraries and OpenMP implementations. example, one thread is created for each core in a machine Moreover, we study the semantics offered by threading libraries with two Intel Xeon E5-2695v4 (2.10 GHz) CPUs and 128 in order to expose the similarities among different LWT ap- GB of memory. For the OS threads, we employ the GNU C plication programming interfaces and their advantages over 6.1 library [6], and Argobots (07-2018) threads for the ULT Pthreads. This study exposes that LWT libraries outperform case [7]. The time difference is caused by the implication of solutions based on operating system threads when tasks and the OS and by the features of each type of thread. nested parallelism are required. 0.14 Index Terms—Lightweight Threads, OpenMP, GLT, POSIX OS Threads, Programming Models ULT 0.12 1. Introduction 0.1 In the past few years, the number of cores per processor has increased steadily, reaching impressive counts such as 0.08 the 260 cores per socket in the Sunway TaihuLight su- 0.06 percomputer [1], which was ranked #1 for the first time Time (ms) in the June 2016 TOP500 list [2]. This trend indicates 0.04 that upcoming exascale systems may well feature a large 0.02 number of cores. Therefore, future applications will have to accommodate this massive concurrency by deploying a 0 1 2 4 8 16 18 32 36 40 48 64 72 large number of threads and/or tasks in order to extract # of Threads a significant fraction of the computational power of such hardware. Figure 1: Cost of creating OS threads and ULTs. Current solutions for extracting on-node parallelism are For tackling the OS overhead, a number of LWT li- based on operating system (OS) threads in both low- or high- braries have been implemented for specific OSs, such as level libraries. Examples of this usage are Pthreads [3] Windows Fibers [8] and Solaris Threads [9]; for for the former and programming models (PMs) such as specific hardware such as TiNy-threads [10] for the OpenMP [4] for the latter. However, performing thread Cyclops64 cellular architecture; or for network services such management in the OS increases the cost of these operations as Capriccio [11]. Other solutions emerged to support (e.g. creation, context-switch, or synchronization). As a con- specific higher-level PMs. This is the case of Converse sequence, leveraging OS threads to exploit a massive degree Threads [12] [13] for Charm++ [14]; and Nanos++ Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected]. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TC.2020.2970706 LWTs [15] for task parallelism in OmpSs [16]. More- clusions. over, general-purpose solutions have emerged such as GNU Portable Threads [17], StackThreads/PM [18], 2. Threading Libraries ProtoThreads [19], MPC [20], MassiveThreads [21], In this section we describe the two types of threading Qthreads [22], and Argobots [7]. Other solutions libraries, OS threads and LWTs, that are analyzed and evalu- that abstract LWT facilities include Cilk [23], Intel ated in this paper. Moreover, we briefly present the OpenMP TBB [24], and Go [25]. In addition, solutions like PM, for which production implementations are currently Stackless Python [26] and Protothreads [19] are based on Pthreads. more focused on stackless threads. For the evaluation of the libraries, from the point of In spite of their potential performance benefits, none view of OS threads, we have selected Pthreads because of these LWT software solutions has been significantly this is a standard library that matches the current hard- adopted to date. The easier code development via directive- ware concurrency. In the case of LWTs, Qthreads and based PMs, in combination with the lack of a standard/ MassiveThreads have been selected because these are specification, hinder portability and require a considerable among the most used lightweight threading models in high- effort to translate code from one library to another. In order performance computing (HPC). In addition, Converse to tackle this situation, a common application programming Threads and Argobots were chosen because they cor- interface (API), called Generic Lightweight Threads (GLT), respond to the first (and still currently used) LWT library was presented in [27]. This API unifies LWT solutions under and the most flexible solution, respectively. Despite Go is a unique set of semantics, becoming the first step toward not HPC-oriented, we have also included it as representative a standard/specification. GLT is currently implemented on of the high-level abstracted LWT implementations. top of Qthreads, MassiveThreads, and Argobots. Prior to highlight the strengths and weaknesses of each One further step is presented in [28] and [29], where we solution, we present a summary of the most used func- explain the semantical mapping between the OpenMP and tions when programming with threads. Table 1 lists the OmpSs PMs and LWTs and implement both high-level so- nomenclature of each API for different functionality. This lutions on top of the GLT API. includes initialization and finalization functions that set up In this paper we demonstrate the usability and perfor- and destroy the threads environment, as well as the threads/ mance benefits of LWT solutions. We analyze several thread- tasklets management (creation, join and yield). ing solutions from a semantic point of view, identifying TABLE 1: Summary of the most used functions in mi- their strong and weak points. Moreover, we offer a detailed crobenchmark implementations using threads. Pth, Arg, Qth, performance study using OpenMP because of its position Mth, CTh, and Go identify the threading libraries Pthreads, as the de facto standard parallel PM for multi/many-core Argobots, Qthreads, MassiveThreads, Converse Threads, and architectures. Our results reveal that the performance of most Go, respectively. of the LWT solutions is similar and that these are as efficient Function Pth Arg Qth MTh CTh Go as OS threads in some simple scenarios, while outperforming (pthread ) (ABT ) (qthreads ) (myth ) them in many more complex cases. Init – init initialize init ConverseInit – In our previous work [30], we compared several LWT Thread create thread create fork create CthCreate go solutions and used the OpenMP PM as the baseline. In this Tasklet – task create – – CmiSyncSend – study we expand that work adding Pthreads library to Yield yield thread yield yield yield CthYield – Join join thread free readFF join – channel our semantic and functional analysis of threading libraries End – finalize finalize fini ConverseExit – in order to highlight the overhead (if any) introduced by the OpenMP implementations. The purpose of this paper is to present the first comparison of threading libraries from 2.1. Pthreads API a semantic point of view, along with a complete perfor- Pthreads [31] offers three PMs that differ in how mance evaluation that aims to demonstrate that LWTs are the threads are bound and which thread is in control. An a promising replacement for Pthreads used both as low- important agent in these PMs is the kernel scheduled entity level libraries and as the base implementation for high-level (KSE).