BCL: a Cross-Platform Distributed Data Structures Library
Total Page:16
File Type:pdf, Size:1020Kb
BCL: A Cross-Platform Distributed Data Structures Library Benjamin Brock, Aydın Buluç, Katherine Yelick University of California, Berkeley Lawrence Berkeley National Laboratory {brock,abuluc,yelick}@cs.berkeley.edu ABSTRACT a matching receive operation or outside of a global collective. How- One-sided communication is a useful paradigm for irregular paral- ever, PGAS languages lack the kind of high level libraries that exist lel applications, but most one-sided programming environments, in other popular programming environments. For example, high- including MPI’s one-sided interface and PGAS programming lan- performance scientific simulations written in MPI can leverage a guages, lack application-level libraries to support these applica- broad set of numerical libraries for dense or sparse matrices, or tions. We present the Berkeley Container Library, a set of generic, for structured, unstructured, or adaptive meshes. PGAS languages cross-platform, high-performance data structures for irregular ap- can sometimes use those numerical libraries, but lack the kind of plications, including queues, hash tables, Bloom filters and more. data structures that are important in some of the most irregular BCL is written in C++ using an internal DSL called the BCL Core parallel programs. that provides one-sided communication primitives such as remote In this paper we describe a library, the Berkeley Container Li- get and remote put operations. The BCL Core has backends for brary (BCL) that is intended to support applications with irregular MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data patterns of communication and computation and data structures structures to be used natively in programs written using any of with asynchronous access, for example hash tables and queues, that these programming environments. Along with our internal DSL, can be distributed across processes but manipulated independently we present the BCL ObjectContainer abstraction, which allows BCL by each process. BCL is designed to provide a complementary set data structures to transparently serialize complex data types while of abstractions for data analytics problems, various types of search maintaining efficiency for primitive types. We also introduce the algorithms, and other applications that do not easily fit a bulk- set of BCL data structures and evaluate their performance across a synchronous model. BCL is written in C++ and its data structures number of high-performance computing systems, demonstrating are designed to be coordination free, using one-sided communica- that BCL programs are competitive with hand-optimized code, even tion primitives that can be executed using RDMA hardware with- while hiding many of the underlying details of message aggregation, out requiring coordination with remote CPUs. In this way, BCL is serialization, and synchronization. consistent with the spirit of PGAS languages, but provides higher level operations such as insert and find in a hash table, rather than CCS CONCEPTS low-level remote read and write. As in PGAS languages, BCL data structures live in a global address space and can be accessed by • Computing methodologies → Parallel programming lan- every process in a parallel program. BCL data structures are also guages. partitioned to ensure good locality whenever possible and allow for scalable implementations across multiple nodes with physically KEYWORDS disjoint memory. Parallel Programming Libraries, RDMA, Distributed Data Structures BCL is cross-platform, and is designed to be agnostic about the ACM Reference Format: underlying communication layer as long as it provides one-sided Benjamin Brock, Aydın Buluç, Katherine Yelick. 2019. BCL: A Cross-Platform communication primitives. It runs on top of MPI’s one-sided com- Distributed Data Structures Library . In Proceedings of ACM Conference munication primitives, OpenSHMEM, and GASNet-EX, all of which (Conference’17). ACM, New York, NY, USA, 13 pages. https://doi.org/10. provide direct access to low-level remote read and write primitives 1145/nnnnnnn.nnnnnnn to buffers in memory [6, 9, 17]. BCL provides higher level abstrac- arXiv:1810.13029v2 [cs.DC] 23 Apr 2019 tions than these communication layers, hiding many of the details 1 INTRODUCTION of buffering, aggregation, and synchronization from users that are Writing parallel programs for supercomputers is notoriously diffi- specific to a given data structure. BCL also has an experimental cult, particularly when they have irregular control flow and complex UPC++ backend, allowing BCL data structures to be used inside an- data distribution; however, high-level languages and libraries can other high-level programming environment. BCL uses a high-level make this easier. A number of languages have been developed for data serialization abstraction called ObjectContainers to allow the high-performance computing, including several using the Parti- storage of arbitrarily complex datatypes inside BCL data structures. tioned Global Address Space (PGAS) model: Titanium, UPC, Coarray BCL ObjectContainers use C++ compile-time type introspection to Fortran, X10, and Chapel [8, 11, 12, 27, 31, 32]. These languages are avoid introducing any overhead in the common case that types are especially well-suited to problems that require asynchronous one- byte-copyable. sided communication, or communication that takes place without We present the design of BCL with an initial set of data struc- tures and operations. We then evaluate BCL’s performance on ISx, Conference’17, July 2017, Washington, DC, USA an integer sorting mini-application, Meraculous, a mini-application 2019. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 https://doi.org/10.1145/nnnnnnn.nnnnnnn taken from a large-scale genomics application, and a collection of Conference’17, July 2017, Washington, DC, USA Benjamin Brock, Aydın Buluç, Katherine Yelick microbenchmarks examining the performance of individual data MPI structure operations. We explain how BCL’s data structures and OpenSHMEM design decisions make developing high-performance implementa- Internal DSL BCL Containers tions of these benchmarks more straightforward and demonstrate (BCL Core) that BCL is able to match or exceed the performance of both spe- GASNet-EX cialized, expert-tuned implementations as well as general libraries UPC++ across three different HPC systems. Figure 1: Organizational diagram of BCL. 1.1 Contributions (1) A distributed data structures library that is designed for high inserts, and finds, are executed purely in RDMA without requiring performance and portability by using a small set of core primi- coordination with remote CPUs. tives that can be executed on four distributed memory backends (2) The BCL ObjectContainer abstraction, which allows data struc- Portability. BCL is cross-platform and can be used natively in tures to transparently handle serialization of complex types programs written in MPI, OpenSHMEM, GASNet-EX, and UPC++. while maintaining high performance for simple types When programs only use BCL data structures, users can pick whichever (3) A distributed hash table implementation that supports fast in- backend’s implementation is most optimized for their system and sertion and lookup phases, dynamic message aggregation, and network hardware. individual insert and find operations Software Toolchain Complexity. BCL is a header-only library, so (4) A distributed queue abstraction for many-to-many data ex- users need only include the appropriate header files and compile changes performed without global synchronization with a C++-14 compliant compiler to build a BCL program. BCL (5) A distributed Bloom filter which achieves fully atomic insertions data structures can be used in part of an application without having using only one-sided operations to re-write the whole application or include any new dependencies. (6) A collection of distributed data structures that offer variable levels of atomicity depending on the call context using an ab- 3 BCL CORE straction called concurrency promises (7) A fast and portable implementation of the Meraculous bench- The BCL Core is the cross-platform internal DSL we use to imple- mark built in BCL ment BCL data structures. It provides a high-level PGAS memory (8) An experimental analysis of irregular data structures across model based on global pointers, which are C++ objects that allow three different computing systems along with comparisons be- the manipulation of remote memory. During initialization, each tween BCL and other standard implementations. process creates a shared memory segment of a fixed size, and every process can read and write to any location within another node’s shared segment using global pointers. A global pointer is a C++ ob- 2 BACKGROUND AND HIGH-LEVEL DESIGN ject that contains (1) the rank of the process on which the memory Several approaches have been used to address programmability is located and (2) the particular offset within that process’ shared issues in high-performance computing, including parallel languages memory segment that is being referenced. Together, these two val- like Chapel, template metaprogramming libraries like UPC++, and ues uniquely identify a global memory address. Global pointers are embedded DSLs like STAPL. These environments provide core regular data objects and can be passed around between BCL pro- language abstractions that can boost productivity, and some of them cesses using communication primitives or stored in global memory. have sophisticated support for multidimensional arrays. However,