The Mondrian Data Engine

The Mondrian Data Engine Mario Drumond Alexandros Daglis Nooshin Mirzadeh Dmitrii Ustiugov Javier Picorel Babak Falsafi Boris Grot1 Dionisios Pnevmatikatos2 EcoCloud, EPFL 1University of Edinburgh 2FORTH-ICS & ECE-TUC [email protected],[email protected],[email protected] ABSTRACT massive amounts of data while maintaining a low request tail latency. The increasing demand for extracting value out of ever-growing Similarly, analytic engines for business intelligence are increasingly data poses an ongoing challenge to system designers, a task only memory resident to minimize query response time. The net result is made trickier by the end of Dennard scaling. As the performance that memory is taking center stage in server design as datacenter op- density of traditional CPU-centric architectures stagnates, advancing erators maximize memory capacity integrated per server to increase compute capabilities necessitates novel architectural approaches. throughput per total cost of ownership [42, 53]. Near-memory processing (NMP) architectures are reemerging as The slowdown in Dennard scaling has further pushed server promising candidates to improve computing efficiency through tight designers toward efficient memory access [18, 30, 49]. A large coupling of logic and memory. NMP architectures are especially spectrum of data services has moderate computational require- fitting for data analytics, as they provide immense bandwidth to ments [20, 36, 43] but large energy footprints primarily due to data memory-resident data and dramatically reduce data movement, the movement [16]: the energy cost of fetching a word of data from main source of energy consumption. off-chip DRAM is up to 6400× higher than operating on it [29]. To Modern data analytics operators are optimized for CPU execu- alleviate the cost of memory access, several memory vendors are tion and hence rely on large caches and employ random memory now stacking several layers of DRAM on top of a thin layer of logic, accesses. In the context of NMP, such random accesses result in enabling near-memory processing (NMP) [35, 46, 63]. The com- wasteful DRAM row buffer activations that account for a significant bination of ample internal bandwidth and a dramatic reduction in fraction of the total memory access energy. In addition, utilizing data movement makes NMP architectures an inherently better fit for NMP’s ample bandwidth with fine-grained random accesses requires in-memory data analytics than traditional CPU-centric architectures, complex hardware that cannot be accommodated under NMP’s tight triggering a recent NMP research wave [2, 3, 22, 23, 56, 57]. area and power constraints. Our thesis is that efficient NMP calls for While improving over CPU-centric systems with NMP is trivial, an algorithm-hardware co-design that favors algorithms with sequen- designing an efficient NMP system for data analytics in terms of tial accesses to enable simple hardware that accesses memory in performance per watt is not as straightforward [19]. Efficiency impli- streams. We introduce an instance of such a co-designed NMP archi- cations stem from the optimization of data analytics algorithms for tecture for data analytics, the Mondrian Data Engine. Compared to a CPU platforms, making heavy use of random memory accesses and CPU-centric and a baseline NMP system, the Mondrian Data Engine relying on the cache hierarchy for high performance. In contrast, the improves the performance of basic data analytics operators by up to strength of NMP architectures is the immense memory bandwidth 49× and 5×, and efficiency by up to 28× and 5×, respectively. they offer through tight integration of logic and memory. While DRAM is designed for random memory accesses, such accesses CCS CONCEPTS are significantly costlier than sequential ones in terms of energy, stemming from DRAM row activations [65]. • Computer systems organization ! Processors and memory In addition to their energy premium, random memory accesses architectures; Single instruction, multiple data; • Information sys- can also become a major performance obstacle. The logic layer of tems ! Main memory engines; • Hardware ! Die and wafer stack- NMP devices can only host simple hardware, being severely area- ing; and power-limited. Unfortunately, generating enough memory-level KEYWORDS parallelism to saturate NMP’s ample bandwidth using fine-grained accesses requires non-trivial hardware resources. Random memory Near-memory processing, sequential memory access, algorithm- accesses result in underutilized bandwidth, leaving a significant hardware co-design performance headroom. In this work, we show that achieving NMP efficiency requires an 1 INTRODUCTION algorithm-hardware co-design to maximize the amount of sequential memory accesses and provide hardware capable of transforming Large-scale IT services are increasingly migrating from storage to these accesses into bandwidth utilization under NMP’s tight area memory because of intense data access demands [7, 17, 51]. Online and power constraints. We take three steps in this direction: First, services such as search and social connectivity require managing we identify that the algorithms for common data analytics operators ISCA ’17, June 24-28, 2017, Toronto, ON, Canada that are preferred for CPU execution are not equally fitting for NMP. © 2017 Association for Computing Machinery. NMP favors algorithms that maximize sequential accesses, trading This is the author’s version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of ISCA ’17, June 24-28, 2017, https://doi.org/10.1145/3079856.3080233. 2 This work was done while the author was at EPFL. off algorithmic complexity for DRAM-friendly and predictable ac- Basic operator Spark operator cess patterns. Second, we observe that most data analytics operators Filter, Union, LookupKey, Scan feature a major data partitioning phase, where data is shuffled across Map, FlatMap, MapValues GroupByKey, Cogroup, ReduceByKey, memory partitions to improve locality in the following probe phase. Group by Partitioning unavoidably results in random memory accesses, regard- Reduce, CountByKey, AggregateByKey less of the choice of algorithm. However, we show that minimal Join Join Sort SortByKey hardware support that exploits a common characteristic in the data layout of analytics workloads can safely reorder memory accesses, Table 1: Characterization of Spark operators. thus complementing the algorithm’s contribution in turning random accesses to sequential. Finally, we design simple hardware that can operate on data streams at the memory’s peak bandwidth, while respecting the energy and power constraints of the NMP logic layer. plow through data [60]. Table 1 contains a broad set of common data Building on our observations, we propose the Mondrian Data transformations in Spark [21] and the basic data operators needed to Engine, an NMP architecture that reduces power consumption by implement them in memory: Scan, Group by, Join, and Sort. changing energy-hungry random memory access patterns to sequen- Contemporary analytics software is typically column-oriented tial DRAM-friendly ones and boosts performance through wide com- [1, 12, 21, 40, 50, 71], storing datasets as collections of individual putation units performing data analytics operations on data streams columns, each containing a key and an attribute of a particular type. at full memory bandwidth. We make the following contributions: A dataset is distributed across memory modules where the operators • Analyze common data analytics operators, demonstrating preva- can run in parallel. The simplest form of transformation is a Scan lence of fine-grain random memory accesses. In the context of operator where each dataset subset is sequentially scanned for a NMP, these induce significant performance and energy overheads, particular key. More complex transformations require comparing preventing high utilization of the available DRAM bandwidth. keys among one or more subsets of the dataset. • Argue that NMP-friendly algorithms fundamentally differ from To execute the comparison efficiently (i.e., maximize parallelism conventional CPU-centric ones, optimizing for sequential memory and locality), modern database systems partition datasets based on accesses rather than cache locality. key ranges [68]. For instance, parallelizing a Join would require first • Identify object permutability as a common attribute in the parti- a range partitioning of a dataset based on the data items’ keys to tioning phase of key analytics operators, meaning that the location divide them among the memory modules, followed by an indepen- and access order of objects in memory can be safely rearranged to dent Join operation on each subset. A similar data item distribution improve the efficiency of memory access patterns. We exploit per- is required for the Group by and Sort operators. mutability to coalesce random accesses that would go to disjoint Table 2 compares and contrasts the algorithmic steps in the ba- locations in the original computation into a sequential stream. sic data operators. All operators can be broken down into two main • Show that NMP efficiency can be significantly improved through phases. Starting from data that is initially randomly distributed across an algorithm-hardware co-design that trades off additional com- multiple memory partitions, the goal of the first phase, partitioning, putations and memory accesses for increased

The Mondrian Data Engine

Voodoo - a Vector Algebra for Portable Database Performance on Modern Hardware

A Survey on Parallel Database Systems from a Storage Perspective: Rows Versus Columns

Memory Interference Characterization Between CPU Cores and Integrated Gpus in Mixed-Criticality Platforms

A Cache-Efficient Sorting Algorithm for Database and Data Mining

How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?

Prefetching for Complex Memory Access Patterns

VECTORWISE Simply FAST

Eth-28647-02.Pdf

Detecting False Sharing Efficiently and Effectively

A Hybrid Analytical DRAM Performance Model

Micro Adaptivity in Vectorwise

Performance Analysis of Cache Memory