Toward Efficient In-Memory Data Analytics on NUMA Systems
Total Page:16
File Type:pdf, Size:1020Kb
Toward Efficient In-memory Data Analytics on NUMA Systems Puya Memarzia Suprio Ray Virendra C Bhavsar Univeristy of New Brunswick Univeristy of New Brunswick Univeristy of New Brunswick Fredericton, Canada Fredericton, Canada Fredericton, Canada [email protected] [email protected] [email protected] ABSTRACT memory query processing systems have been increasingly Data analytics systems commonly utilize in-memory query adopted, due to continuous improvements in DRAM capac- processing techniques to achieve better throughput and ity and speed, and the growing demands of the data analyt- lower latency. Modern computers increasingly rely on Non- ics industry [36]. As the hardware landscape shifts toward Uniform Memory Access (NUMA) architectures in order to greater parallelism and scalability, keeping pace with these achieve scalability. A key drawback of NUMA architectures changes and maintaining efficiency are a key challenge. is that many existing software solutions are not aware of The development of commodity CPU architectures con- the underlying NUMA topology and thus do not take full tinues to be influenced by various obstacles that hinder the advantage of the hardware. Modern operating systems are speed and quantity of processing cores that can be packed designed to provide basic support for NUMA systems. How- into a single processor die [1]. The power wall motivated ever, default system configurations are typically sub-optimal the development of multi-core CPUs [22], which have be- for large data analytics applications. Additionally, achiev- come the de facto industry standard. The memory wall [50, ing NUMA-awareness by rewriting the application from the 65] is a symptom of the growing gap between CPU and ground up is not always feasible. memory performance, and the bandwidth starvation of pro- In this work, we evaluate a variety of strategies that aim cessing cores that share the same memory controller. The to accelerate memory-intensive data analytics workloads on demand for greater processing power has pushed the adop- NUMA systems. We analyze the impact of different mem- tion of various decentralized memory controller layouts, ory allocators, memory placement strategies, thread place- which are collectively known as non-uniform memory access ment, and kernel-level load balancing and memory manage- (NUMA) architectures. These architectures are widely pop- ment mechanisms. Our findings indicate that the operating ular in the server and high performance workstation mar- system default configurations can be detrimental to query kets, where they are used for compute-intensive and data- performance. With extensive experimental evaluation we intensive tasks. NUMA architectures are pervasive in multi- demonstrate that methodical application of these techniques socket and in-memory rack-scale systems. Recent develop- can be used to obtain significant speedups in four common- ments have led to On-Chip NUMA Architectures (OCNA) place in-memory data analytics workloads, on three different that partition the processor's cores into multiple NUMA re- hardware architectures. Furthermore, we show that these gions, each with their own dedicated memory controller [52, strategies can speed up two popular database systems run- 67]. It is clear that the future is NUMA, and that the soft- ning a TPC-H workload. ware stack needs to evolve and keep pace with these changes. Although these advances have opened a path toward greater performance, the burden of efficiently leveraging the hard- Categories and Subject Descriptors ware has mostly fallen on software developers and system H.2.4 [Systems]: Query Processing administrators. Although a NUMA system's memory is shared among all arXiv:1908.01860v3 [cs.DB] 25 Jan 2020 Keywords its processors, the access times to different portions of the memory varies depending on the topology. NUMA systems NUMA, Memory Allocators, Memory Management, Con- encompass a wide variety of CPU architectures, topologies, currency, Database Systems, Operating Systems and interconnect technologies. As such, there is no stan- dard for what a NUMA system's topology should look like. 1. INTRODUCTION Due to the variety of NUMA topologies and applications, The digital world is producing large volumes of data at fine-tuning the algorithm to a single machine configuration increasingly higher rates [76, 34, 68]. Data analytics sys- will not necessarily achieve optimal performance on other tems are among the key technologies that power the in- machines. Given sufficient time and resources, applications formation age. The breadth of applications that depend could be fine-tuned to the different system configurations on efficient data processing has grown dramatically. Main that they are deployed on. However, in the real world, this is not always feasible. Therefore, it is desirable to pursue c 2019 Copyright held by the owner/author(s). Permission to make solutions that can improve performance across-the-board, digital or hard copies of all or part of this work for personal or class- without tuning the code. room use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear In an effort to provide a general solution that speeds up this notice and the full citation on the first page. 1 applications on NUMA systems, some researchers have pro- ing this problem without extensively modifying the code posed using NUMA schedulers that co-exist with the oper- requires tools and tuning strategies that are application- ating system (OS). These schedulers operate by monitoring agnostic. In this work, we evaluate the viability of several running applications in real-time, and managing thread and key approaches that aim to achieve this. In this context, memory placement [7, 15, 47]. The schedulers make deci- the impact and role of memory allocators have been under- sions based on memory access patterns, and aim to balance appreciated and overlooked. We demonstrate that signifi- the system load. However, some of these approaches are not cant performance gains can be achieved by altering policies architecture or OS independent. For instance, Carrefour [13] that affect thread placement, memory allocation and place- needs an AMD CPU based on the K10 architecture, in ad- ment, and load balancing. In particular, we investigate 5 dition to a modified OS kernel. Moreover, researchers have different workloads that prominently feature joins and ag- have argued that these schedulers may not be beneficial for gregations, arguably two of the most popular and compu- multi-threaded in-memory query processing [58]. A differ- tationally expensive workloads used in data analytics. Our ent approach involves either extensively modifying or com- study covers the following aspects: pletely replacing the operating system. This is done with 1. Dynamic memory allocators (Section 3.1) the goal of providing a custom tailored environment for the 2. Thread placement and scheduling (Section 3.2) application. Some researchers have pursued this direction 3. Memory placement policies (Section 3.3) with the goal of providing an operating system that is more 4. Operating system configuration: virtual memory page suitable for large database applications [24, 26, 27]. Custom size and NUMA load balancing (Section 3.4) operating systems aim to reduce the burden on developers, An important finding from our research is that the default but their adoption has been limited due to the high pace of operating system environment can be detrimental to query advances in both the hardware and software stack. In the performance. For instance, the default Linux memory allo- past, researchers in the systems community proposed a few cator ptmalloc can perform poorly compared to other alter- new operating systems for multicore architecture, including natives. Furthermore, with extensive experimental evalua- Corey [9], Barrelfish [3] and fos [80]. However, none of them tion, we demonstrate that it is possible to systematically uti- were adopted by the industry. We believe that any custom lize application-agnostic (or black-box) approaches to obtain operating system designed for data analytics will follow the speedups on a variety of in-memory data analytics work- same trajectory. On the other hand, these efforts underscore loads. We show that a hash join workload achieves a 3× the need to investigate the impact of system and architec- speedup on Machine C (see machine topologies in Figure 1 tural aspects on query performance. and specifications in Table 3), just from using the tbbmalloc In recent times, researchers in the database community memory allocator. This speedup improves to 20× when we have started to pay attention to the issues with query perfor- utilize the Interleave memory placement policy and modify mance on NUMA systems. These researchers have favored the OS configuration. We also show that our findings can a more application-oriented approach that involves algorith- carry over to other hardware configurations, by evaluating mic tweaks to the application's source code, particularly, in the experiments on machines with three different hardware the context of query processing engines. Among these works architectures and NUMA topologies. Lastly, we show that some are static solutions that attempted to make query op- performance can be improved on two real database systems: erators NUMA-aware [66, 78]. Others are dynamic solu- MonetDB and PostgreSQL. For example, MonetDB's query tions that focused on work allocation to threads using work- latency for the TPC-H workload