Arxiv:1905.04767V1 [Cs.DB] 12 May 2019 Ing Amounts of Data Into Their Complex Computational Data Locality, Limiting Data Reuse Through Caching [121]

Moving Processing to Data On the Influence of Processing in Memory on Data Management Tobias Vin¸con · Andreas Koch · Ilia Petrov Abstract Near-Data Processing refers to an architec- models, on the other hand database applications exe- tural hardware and software paradigm, based on the cute computationally-intensive ML and analytics-style co-location of storage and compute units. Ideally, it will workloads on increasingly large datasets. Such modern allow to execute application-defined data- or compute- workloads and systems are memory intensive. And even intensive operations in-situ, i.e. within (or close to) the though the DRAM-capacity per server is constantly physical data storage. Thus, Near-Data Processing seeks growing, the main memory is becoming a performance to minimize expensive data movement, improving per- bottleneck. formance, scalability, and resource-efficiency. Processing- Due to DRAM's and storage's inability to perform in-Memory is a sub-class of Near-Data processing that computation, modern workloads result in massive data targets data processing directly within memory (DRAM) transfers: from the physical storage location of the data, chips. The effective use of Near-Data Processing man- across the memory hierarchy to the CPU. These trans- dates new architectures, algorithms, interfaces, and de- fers are resource intensive, block the CPU, causing un- velopment toolchains. necessary CPU waits and thus impair performance and scalability. The root cause for this phenomenon lies in Keywords Processing-in-Memory · Near-Data the generally low data locality as well as in the tradi- Processing · Database Systems tional system architectures and data processing algorithms, which operate on the data-to-code principle. It requires data and program code to be transferred to 1 Introduction the processing elements for execution. Although data- to-code simplifies software development and system ar- Over the last decade we have been witnessing a clear chitectures, it is inherently bounded by the von Neu- trend towards the fusion of the compute-intensive and mann bottleneck [55,12], i.e. it is limited by the avail- the data-intensive paradigms on architectural, system, able bandwidth. Furthermore, modern workloads are and application levels. On the one hand, large com- not only bandwidth-intensive but also latency-bound putational tasks (e.g. simulations) tend to feed grow- and tend to exhibit irregular access patterns with low arXiv:1905.04767v1 [cs.DB] 12 May 2019 ing amounts of data into their complex computational data locality, limiting data reuse through caching [121]. Tobias Vin¸con These trends relate to the following recent develop- Data Management Lab ments: (a)Moore's Law is said to be slowing down for Reutlingen University, Germany different types of semiconductor elements and Dennard E-mail: [email protected] scaling [26] has come to an end. The latter postulates Andreas Koch that performance per Watt grows at approximately the Embedded Systems and Applications Group Technische UniversitätDarmstadt, Germany same rate as the transistor count set by Moore's Law. E-mail: [email protected] Besides the scalability of cache-coherence protocols, the Ilia Petrov end of Dennard scaling is, among the frequently quoted Data Management Lab reasons, as to why modern many-core CPUs do not have Reutlingen University, Germany 128 cores that would otherwise be technically possible E-mail: [email protected] by now (see also [46,77]). As a result, improvements 2 Tobias Vin¸conet al. in computational performance cannot be based on the With 3D-stacking NAND Flash chips are not only expectation of increasing clock frequencies, and there- getting denser, but also exhibit higher bandwidth. fore mandate changes in the hardware and software ar- For instance, [58] presents 3D V-NAND with a ca- chitectures. (b) Modern systems can offer much higher pacity of 512 Gb and 1.0 GB/s bandwidth. Storage levels of parallelism, yet scalability and the effective use devices package several of those chips and connect- of parallelism are limited by the programming models ing them over independent channels to the on-device as well as by the amount and the types of data trans- processing element. Hence, aggregate the device-internal fers. (c) Memory wall [117] and Storage Wall. Storage bandwidth, parallelism, and access latencies are sig- (DRAM, Flash, HDD) is getting larger, cheaper but nificantly better than the external ones (device-to- also colder, as access latencies decrease at much lower host). rates. Over the last 20 years DRAM capacities have Consider the following numbers to gain a perspec- grown 128x, DRAM bandwidth has increased 20x, yet tive on the above claims. The on-chip bandwidth DRAM latencies have only improved 1.3x [20]. This of commercially available High-Bandwidth Memory trend also contributes to slow data transfers. (d) Mod- (HBM2) is 256 GB/s per package, whereas a fast ern data sets are large in volume (machine data, scien- 32-bit DDR5 chip can reach 32 GB/s. Upcoming tific data, text) and are growing fast [101]. Hence, they HBM3 is expected to raise that to 512 GB/s per do not necessarily fit in main memory (despite increas- package. Furthermore, a hypothetical 1 TB device ing memory capacities) and are spread across all lev- built on top of [58] will include at least 16 chips, els of the virtual memory/storage hierarchy. (e) Mod- yielding an aggregated on-device bandwidth of 16 ern workloads (hybrid/HTAP or analytics-based such GB/s, which is four times more than state-of-the- as OLAP or ML) tend to have low data locality and art four-lane PCIe 3.0 x4 (approx. 4 GB/s). incur large scans (sometimes iterative) that result in Consequently, processing data close to its physical massive data transfers. The low data locality of such storage location has become economically viable and workloads leads to low CPU cache-efficiency, making technologically feasible over a range of technologies. the cost of data movement even higher. In essence, due to system architectures and processing principles, current workloads require transferring 1.1 Definitions increasing volumes of large data through the virtual memory hierarchy, from the physical storage location to Near-Data Processing (NDP) targets the execution of the processing elements, which limits performance and data-processing operations (fully or partially) in-situ scalability and hurts resource- and energy-efficiency. based on the above trends, i.e within the compute ele- Nowadays, two important technological developments ments on the respective level of the storage hierarchy, open an opportunity to counter these factors. (close to) where data is physically stored and transfer { Trend 1: Hardware manufacturers are able to fabri- the results back, without moving the raw data. The cate combinations of storage and compute elements underlying assumptions are: (a) the result size is much at reasonable costs and package them within the smaller than the raw data, hence, allowing less frequent same device. Consider, for instance, 3D-stacked DRAM and smaller-sized data transfers; or (b) the in-situ com- [51], or modern mass storage devices [54,43] with putation is faster and more efficient than on the host, embedded CPUs or FPGAs. Cost-efficient fabrica- thus, yielding higher performance and scalability, or tion has been a major obstacle to past efforts. Inter- better efficiency. NDP has extensive impact on: estingly, this trend covers virtually all levels of the 1. hardware architectures and interfaces, instruction memory hierarchy: CPU and caches; memory and sets, hardware compute; storage and compute; accelerators { spe- 2. hardware techniques that are essential for present cialized CPUs and storage; and eventually, network programming models, such as: cache-coherence, ad- and compute. dressing and address translation, shared memory. { Trend 2: As magnetic/mechanical storage is being 3. software (OS, DBMS) architectures, abstractions and replaced with semiconductor storage technologies (3D programming models. Flash, Non-Volatile Memories, 3D DRAM), another 4. development toolchains, compilers, hardware/software key trend emerges: the chip- and device-internal band- co-design. width are steadily increasing. (a) Due to 3D-Stacking, Nowadays, we are facing disruptive changes in the the chip-organization and chip-internal interconnect, virtual memory hierarchy (Fig. 1), with the introduc- the on-chip bandwidth with 3D-Stacked DRAM is tion of new levels such as Non-Volatile Memories (NVM) steadily increasing and a function of the density. (b) or Flash. Yet, co-location of storage and computing units Moving Processing to Data 3 becomes possible on all levels of the memory hierarchy. surprising, given the characteristics of magnetic/mechanical NDP emerges as an approach to tackle the \memory storage combined with Amdahl's balanced systems law and storage walls" by placing the suitable processing [42], it is revised with modern technologies. Modern operations on the appropriate level so that execution semi-conductor storage technologies (NVM, Flash) are takes place close to the physical storage location of the offering high raw bandwidth and high levels of paral- data. The placement of data and compute as well as lelism. [14] also raises the issue of temporal locality in DBMS optimizations for such co-placements appear as database applications, which has already been ques- major issues. Depending on the physical data location tioned earlier and is considered to be low in modern and compute-placement several different terms are used workloads, causing unnecessary data transfers. Near- [97,12]: Data Processing presents an opportunity to address it. { Near-Data Processing or In-Situ Processing { are The concept of Active Disk emerged towards the general terms referring to the concept of perform- end of the 1990s. It is most prominently represented by ing data processing operations close to the physical systems such as: Active Disk [6], IDISK [57], and Active data location, independent of the memory hierarchy storage/disk [87].

Arxiv:1905.04767V1 [Cs.DB] 12 May 2019 Ing Amounts of Data Into Their Complex Computational Data Locality, Limiting Data Reuse Through Caching [121]

Memory Hierarchy Memory Hierarchy

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019

Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc

Caches & Memory

Stealing the Shared Cache for Fun and Profit

IBM Power Systems Performance Report Apr 13, 2021

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

Cache & Memory System

Cache-Fair Thread Scheduling for Multicore Processors

A Cache Line Fill Circuit for a Micropipelined, Asynchronous Microprocessor

ECE 571 – Advanced Microprocessor-Based Design Lecture 17