Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation

Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation Michael A. Bender∗x Jonathan Berryy Simon D. Hammondy K. Scott Hemmerty Samuel McCauley∗ Branden Moorey Benjamin Moseleyz Cynthia A. Phillipsy David Resnicky and Arun Rodriguesy ∗Stony Brook University, Stony Brook, NY 11794-4400 USA fbender,[email protected] ySandia National Laboratories, Albuquerque, NM 87185 USA fjberry, sdhammo, kshemme, bjmoor, caphill, drresni, [email protected] zWashington University in St. Louis, St. Louis, MO 63130 USA [email protected] xTokutek, Inc. www.tokutek.com Abstract—A fundamental challenge for supercomputer memory close to the processor, there can be a higher architecture is that processors cannot be fed data from number of connections between the memory and caches, DRAM as fast as CPUs can consume it. Therefore, many enabling higher bandwidth than current technologies. applications are memory-bandwidth bound. As the number 1 of cores per chip increases, and traditional DDR DRAM While the term scratchpad is overloaded within the speeds stagnate, the problem is only getting worse. A computer architecture field, we use it throughout this variety of non-DDR 3D memory technologies (Wide I/O 2, paper to describe a high-bandwidth, local memory that HBM) offer higher bandwidth and lower power by stacking can be used as a temporary storage location. DRAM chips on the processor or nearby on a silicon The scratchpad cannot replace DRAM entirely. Due to interposer. However, such a packaging scheme cannot contain sufficient memory capacity for a node. It seems the physical constraints of adding the memory directly likely that future systems will require at least two levels of to the chip, the scratchpad cannot be as large as DRAM, main memory: high-bandwidth, low-power memory near although it will be much larger than cache, having the processor and low-bandwidth high-capacity memory gigabytes of storage capacity. Since the scratchpad is further away. This near memory will probably not have smaller than main memory, it does not fully replace significantly faster latency than the far memory. This, combined with the large size of the near memory (multiple main memory, but instead augments the existing memory GB) and power constraints, may make it difficult to treat hierarchy. it as a standard cache. The scratchpad has other limitations besides its size. In this paper, we explore some of the design space for First, the scratchpad does not significantly improve upon a user-controlled multi-level main memory. We present the latency of DRAM. Therefore, the scratchpad is algorithms designed for the heterogeneous bandwidth, not designed to accelerate memory-latency-bound ap- using streaming to exploit data locality. We consider algorithms for the fundamental application of sorting. Our plications, but rather bandwidth-bound ones. Second, algorithms asymptotically reduce memory-block transfers adding a scratchpad does not improve the bandwidth under certain architectural parameter settings. We use and between DRAM and cache. Thus, the scratchpad will extend Sandia National Laboratories’ SST simulation ca- not accelerate a computation that consists of a single pability to demonstrate the relationship between increased scan of a large chunk of data that resides in DRAM. bandwidth and improved algorithmic performance. Mem- ory access counts from simulations corroborate predicted This gives rise to a new multi-level memory hierarchy performance. This co-design effort suggests implementing where two of the components—the DRAM and the two-level main memory systems may improve memory scratchpad—work in parallel. We view the DRAM and performance in fundamental applications. scratchpad to be on the same level of the hierarchy because their access times are similar. There is a tradeoff I. INTRODUCTION between the memories: the scratchpad has limited space Recently vendors have proposed a new approach to im- and the DRAM has limited bandwidth. prove memory performance by increasing the bandwidth Since the scratchpad does not have its own level in between cache and memory [18], [21]. The approach 1Note that the name “scratchpad” also refers to high speed internal is to bond memory directly to the processor chip or memory used for temporary calculations [5], [29] which is a different to place it nearby on a silicon interposer. By placing technology than that discussed in this paper. the cache hierarchy, when a record is evacuated from a ρ-factor speedup over the state-of-the-art sorting algo- the cache there is an algorithmic decision whether to rithms when DRAM is the only accessible memory. We place the record in the scratchpad or directly into the begin by introducing a sequential algorithm for sorting, DRAM. In the currently-proposed architecture designs, and then generalize to the multiprocessor (one multicore) this decision is user-controlled. setting. We complement this result by giving a matching Under the assumption that the memory is user- lower bound in our theoretical model. Our algorithms controlled, an algorithm must coordinate memory ac- supply theoretical evidence that the proposed architec- cesses from main memory and the scratchpad. Ideally ture can indeed speed up fundamental applications. the faster bandwidth of the scratchpad can be leveraged Empirical contributions. After theoretically establish- to alleviate the bandwidth bottleneck in applications. ing the performance of our algorithms, we performed an Unfortunately, known algorithmics do not directly apply empirical evaluation. The scratchpad architecture is yet to the proposed two level main memory architecture. to be produced, so we could not perform experiments The question looms, can an algorithm use the higher on a deployed system. Instead, we extended Sandia bandwidth scratchpad memory to improve performance? National Laboratories’ SST [25] simulator to allow for Unless this question is answered positively, the proposed a scratchpad memory architecture with variable band- architecture, with its additional complexity and manufac- width, and ran simulations over a range of possible turing cost, is not viable. hardware parameters. Our interest in this problem comes from Trinity [30], Our empirical results corroborate the theoretical re- the latest NNSA (National Nuclear Security Adminis- sults. We consider our algorithm in the cases where tration) supercomputer architecture. This supercomputer the application is memory-bandwidth bound. The bench- uses the Knight’s Landing processor from Intel with Mi- mark we compare against our algorithm is the well- cron memory. This processor chip uses such a two-level established parallel GNU multiway merge sort [27]. We memory [18]. The mission for the Center of Excellence show a linear reduction in running time for our algorithm for Application Transition [21], a collaboration between when increasing the bandwidth from two to eight times. NNSA Labs (Sandia, Los Alamos, and Lawrence Liv- We also show that in the case where the application is ermore), Intel, and Cray, is to ensure applications can not memory bound, then, as expected, we do not achieve use the new architecture when it becomes available in improved performance. 2015-16. This is the first work to examine the actual Our empirical and theoretical results together show the effect of the scratchpad on the performance of a specific first evidence that the scratchpad architecture could be algorithm. a feasible way to improve the performance of memory- bandwidth-bound applications. A. Results Algorithms and architecture co-design. This work In this paper we introduce an algorithmic model of the is an example of co-design. The possible architectural scratchpad architecture, generalizing existing sequential design motivates algorithm research to take advantage and parallel external-memory models [1], [3]. (See Sec- of the new architectural features. The results, especially tion VI for a brief background on the architectural design the experiments, help determine ranges of architectural and motivation for the scratchpad.) We introduce the- parameters where the advantage is practical, at least for oretically optimal scratchpad-optimized, sequential and this initial application. parallel sorting algorithms. We report on hardware sim- In particular, we find that the bandwidth-expansion ulations varying the relative-bandwidth parameter. These parameter ρ introduced in the theoretical model also is experiments suggest that the scratchpad will improve an important performance parameter in our simulations. running times for sorting on actual scratchpad-enhanced We estimate the ratio of processing capability to memory hardware for a sufficient number of cores and sufficiently bandwidth necessary for sorting to become memory- large bandwidth improvement. bandwidth bound. From this, we estimate the number Theoretical Contributions. We give an algorithmic of cores that must be on a node at today’s computing model of the scratchpad, which generalizes existing capability for the scratchpad to be of benefit. The core sequential and parallel external-memory models [1], [3]. counts and minimum values of ρ could guide vendors in In our generalization, we allow two different block sizes, the design of future scratchpad-based systems. B and ρB (ρ > 1) to model the bandwidths of DRAM The SST extension, designed and implemented by and the larger bandwidth of the scratchpad. Specifically, computer architects, is a co-design research tool. SST is ρ > 1 is the relative increase

Two-Level Main Memory Co-Design: Multi-Threaded Algorithmic Primitives, Analysis, and Simulation

Paralellizing the Data Cube

Minimizing Writes in Parallel External Memory Search

The Parallel Persistent Memory Model

Implementing Operational Intelligence Using In-Memory Computing

Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures

High Performance Computing Ð Past, Present and Future

Experiments with a Parallel External Memory System*

Fundamentals – Parallel Architectures, Models, and Languages

The Efficiency of Mapreduce in Parallel External Memory Arxiv

The Power of In-Memory Computing: from Supercomputing to Stream Processing

On the Complexity of List Ranking in the Parallel External Memory Model

15. Hierarchical Models and Software Tools for Parallel Programming