Efficient Barrier Implementation on the POWER8 Processor

Efficient Barrier Implementation on the POWER8 Processor C.D. Sudheer Ashok Srinivasan IBM Research Dept. of Computer Science New Delhi, India Florida State University [email protected] Tallahassee, USA [email protected] Abstract—POWER8 is a new generation of POWER access and the impact of pre-fetching to cache. Our best processor capable of 8-way simultaneous multi-threading implementation achieves latency of around 1 µs - 4 µs per core. High-performance computing capabilities, such as on 1 and 8-way SMT respectively. This improves on the high amount of instruction-level and thread level parallelism, are integrated with a deep memory hierarchy. Fine-grained existing implementations by one to two orders of magnitude. parallel applications running on such architectures often rely Our implementation uses characteristics of the memory sub- on an efficient barrier implementation for synchronization. system, which can also help optimize other applications, as We present a variety of barrier implementations for a 4-chip mentioned in section VII on future work. POWER8 node. These implementations are optimized based The primary contributions of this paper are as follows. on a careful study of the POWER8 memory sub-system. Our best implementation yields one to two orders of magnitude • Developing efficient MPI and POSIX threads based lower time than the current MPI and POSIX threads based barrier implementations that improve on existing ones barrier implementations on POWER8. Apart from providing by one to two orders of magnitude. efficient barrier implementations, an additional significance • Demonstrating the potential for optimization through of this work lies in demonstrating how certain features of the memory subsystem, such as NUMA access to remote L3 cache control of hardware prefetching by the POWER8 core. and the impact of prefetching, can be used to design efficient This will be useful for optimizing other applications primitives on the POWER8. too on POWER8. • Demonstrating optimization of data movement on Keywords-Barrier; POWER8; MPI; pthreads; multi-chip POWER8 servers by accounting for NUMA effects in access to L3 cache. I. INTRODUCTION II. POWER8 ARCHITECTURE The POWER8 is a new high-end processor from IBM, POWER8 is the eighth-generation POWER processor [2], and IBM’s first high-end processor to be available under designed for both high thread-level performance and system OpenPOWER Foundation. It provides up to 8-way SMT throughput on a wide variety of workloads. POWER8 pro- per core, thus permitting a high degree of parallelism in cessor is implemented in IBM’s 22nm SOI technology [3]. node. Further details on its architecture are presented in Each POWER8 chip has up to 12 processor cores, a 50% Section II. Applications may implement parallelism using increase over the previous generation POWER processor. multiple threads, multiple processes, or a combination. In The 649mm2 die, shown in Figure 1, includes 12 enhanced many applications involving fine-grained parallelization, the 8-way multithreaded cores with private L2 caches, and a performance of barrier calls play a crucial role [1]. In this 96MB high-bandwidth eDRAM L3 cache. paper, we present our efficient MPI and POSIX pthreads The POWER8 memory hierarchy includes a per-core L1 barrier implementations on a 4-chip POWER8 server. cache, with 32KB for instructions and 64KB for data, a per- An efficient barrier implementation relies on characteri- core SRAM-based 512KB L2 cache, and a 96MB eDRAM- zation of the memory sub-system. We present early related based shared L3 cache. The cache line size on POWER8 is work on the POWER8 memory sub-system in section III, 128 bytes. In addition, a 16MB off-chip eDRAM L4 cache and also related work on similar architectures, especially in per memory buffer chip is supported. There are 2 memory the context of collective communication implementations. controllers on the chip supporting a sustained bandwidth We present our results on additional characterization of the of up to 230GB per second. The POWER8 chip has an memory subsystem in section IV. We then describe a variety interconnection system that connects all components within of popular barrier algorithms in section V. the chip. The interconnect also extends through module and Section VI presents our comprehensive evaluation of board technology to other POWER8 processors in addition these algorithms, along with results of optimization for to DDR3 memory and various I/O devices. the POWER8 architecture. In particular, the most effective The POWER8 based Scale Out servers [4] such as the optimizations account for NUMA effects in remote L3 cache one we used in our work are based on IBM‘s Murano dual-chip module (DCM) [5], which puts two half-cored III. RELATED WORK POWER8 chips into a single POWER8 socket and links In [8] various aspects of a POWER8 system were studied them by a crossbar [6]. Each DCM contains two POWER8 using microbenchmarks, and the performance of OpenMP chips mounted onto a single module that is plugged into a directives is evaluated. The performance of few scientific processor socket. For example, a 12-core DCM actually has applications using OpenMP is evaluated and it highlights two 6-core POWER8 processors mounted on the module. the overhead of OpenMP primitives especially when 8-way The particular server we used in this work is 8247-42L [7], SMT is used in each core. and it has 2 sockets, each with one DCM containing two 6- Communication between the cores in a cache coherent core POWER8 chips. In total, it has 24 cores with 192-way system is modeled in [9] using Xeon Phi as a case study. parallelism. Using their model, they designed algorithms for collective Each POWER8 core can issue and execute up to ten communication. instructions and 4 load operations can be performed in a Our barrier implementation is not built on top of given cycle [2]. This higher load/store bandwidth capability point-to-point primitives. Rather, all processes directly compared to its predecessor makes POWER8 more suitable use a common region of shared memory. Such use of for high bandwidth requirements of many commercial, big shared memory has been shown to be more efficient data, and HPC workloads. than implementing collectives on top of point-to-point POWER8 also has enhanced prefetching features such primitives [10] on the Opteron. We have also implemented as Instruction speculation awareness and Data prefetch collectives using shared memory for Cell BE and Xeon depth awareness. Designed into the load-store unit, the Phi. In our earlier work, we had used double buffering data prefetch engine can recognize streams of sequentially on the programmer controlled caches (local stores) of the increasing or decreasing accesses to adjacent cache lines Cell processor to optimize MPI [11]. We also showed that and then request anticipated lines from more distant levels double buffering can be effective even on the Xeon Phi of the memory hierarchy. After a steady state of data [12]. Since the caches on Xeon Phi are not programmer stream prefetching is achieved by the pipeline, each stream controlled, we induce the same benefits by careful design of confirmation causes the engine to bring one additional line the algorithms. Though our implementations on POWER8 into the L1 cache, one additional line into the L2 cache, use shared memory, the implementation is tuned considering and one additional line into the L3 cache [2]. Prefetching is the cache hierarchy and prefetching features of POWER8 automatically done by the POWER8 hardware and is config- rather than through the above optimization techniques. urable through the Data Streams Control Register (DSCR). The POWER ISA supports instructions and corresponding programming language intrinsics to supply a hint to data IV. EVALUATION OF MEMORY SUBSYSTEM prefetch engines to override the automatic stream detection Understanding the memory bandwidth, memory access capability of the data prefetcher [3]. Our shared memory latency or the cache-to-cache bandwidth on the POWER8 based barrier implementation is optimized as described later is crucial to optimize parallel codes. Here we focus on the by configuring this DSCR register. chip organization used in the POWER8 scale-out servers, which uses a 2 socket configuration. The L3 cache is shared among the the cores and a core can access a remote L3 cache block and the latency for the access depends on the placement of the remote core. The remote core could be present on the same chip, or same socket or a remote socket, and access latency varies accordingly. To achieve better parallel performance, core binding and NUMA placement are crucial. To better understand the NUMA effects of the POWER8 memory subsystem, we used two benchmarks to measure the core to core access latency. First, we use the standard ping-pong benchmark [13] to measure the point-to-point communication latency. Message passing latency between two nodes or between two cores within a node is commonly determined by measuring ping pong latency. MPI latency is usually defined to be half of the time of a ping pong operation with a message of size zero. Figure 1. POWER8 processor chip The table I shows MPI latency between two POWER8 cores with the two MPI processes placed onto different chips. For convenience, we use the following naming con- Min 0.295 Max 0.328 vention. The chips on the 2-socket POWER8 server is Average 0.303 referred as socket 0 having two chips A and B, and socket 1 Table III having two chips C and D. One MPI process is always bound BARRIER LATENCY IN A CHIP WITH ALL POSSIBLE MAPPINGS. onto a core in chip A, and latency is measured by placing the other task onto different chips. As can be seen from the latency numbers, when the two cores are on the same chip (A), the latency is less compared to when the cores V.

Load more