Emulating Hybrid Memory on NUMA Hardware
Total Page:16
File Type:pdf, Size:1020Kb
Emulating Hybrid Memory on NUMA Hardware Shoaib Akram Jennifer B. Sartor Kathryn S. McKinley Lieven Eeckhout Ghent University Ghent University Google Ghent University Abstract—Non-volatile memory (NVM) has the potential to memories [10], [11]. DRAM is fast and durable whereas NVM disrupt the boundary between memory and storage, including the is dense and has low energy. Hardware mitigates NVM wear- abstractions that manage this boundary. Researchers comparing out in both its storage and memory roles using wear-leveling the speed, durability, and abstractions of hybrid systems with DRAM, NVM, and disk to traditional systems typically use and other approaches [11], [10], [12], [13], [14], while the simulation, which makes it easy to evaluate different hardware OS keeps frequently accessed data in DRAM [15], [16], [17], technologies and parameters. Unfortunately, simulation is ex- [18], [19], [20]. Recent work also explores managed runtimes tremely slow, limiting the number of applications and dataset to mitigate wear-out [21], [22], tolerate faults [23], and keep sizes in the evaluation. Simulation typically precludes realistic frequently read objects in DRAM [24]. Collectively, prior multiprogram workloads and considering runtime and operating system design alternatives. research illustrates the substantial opportunities to exploit NVM Good methodology embraces a variety of techniques for across all layers including software and language runtimes. validation, expanding the experimental scope, and uncovering In this paper, we expand on the methodologies for evalu- new insights. This paper introduces an emulation platform for ating NVM and hybrid memories. The dominant evaluation hybrid memory that uses commodity NUMA servers. Emulation methodology in prior work is simulation; see for example [11], complements simulation well, offering speed and accuracy for realistic workloads, and richer software experimentation. We use [10], [12], [13], [14], [15], [16], [17], [19]. A few researchers a thread-local socket to emulate DRAM and the remote socket have complemented simulation with architecture-independent to emulate NVM. We use standard C library routines to allocate measurements [25], [21], [22], but these measurements have heap memory in the DRAM or NVM socket for use with explicit limited value because they miss important effects such as CPU memory management or garbage collection. We evaluate the caching. This paper shows emulation confirms the results of emulator using various configurations of write-rationing garbage collectors that improve NVM lifetimes by limiting writes to simulation and architecture-independent analysis and enables NVM, and use 15 applications from three benchmark suites researchers to explore richer software configurations. with various datasets and workload configurations. We show The advantage of simulation is that it eases modeling emulation enhances simulation results. The two systems confirm new hardware features, revealing how sensitive results are most trends, such as NVM write and read rates of different to architecture. Its major limitation is that it is many orders software configurations, increasing our confidence for predicting future system effects. In a few cases, simulation and emulation of magnitude slower than running programs on real hardware. differ, offering opportunities for revealing methodology bugs or Because time and resources are finite, it thus reduces the new insights. Emulation adds novel insights, such as the non- scope and variety of architecture optimizations, application linear effects of multi-program workloads on write rates. We domains, implementation languages, and datasets one can make our software infrastructure publicly available to advance explore. Popular simulators also trade off accuracy to speed up the evaluation of novel memory management schemes on hybrid memories. simulation [26], [27]. Furthermore, frequent hardware changes, microarchitecture complexity, and hardware’s proprietary nature I. INTRODUCTION make it difficult to faithfully model real hardware. Systems researchers and architects have long pursued bridg- Other research evaluations are increasingly embracing em- ing the speed gap between processor, memory, and storage. Simulation Architecture Emulation arXiv:1808.00064v1 [cs.PL] 31 Jul 2018 Despite many efforts, the increase in processor performance Independent has consistently outpaced memory and storage speeds. Recent advances in memory technologies have the potential to disrupt Speed Slow Fast Native this speed gap. Hardware Diversity High N/A Modest On the storage side, emerging non-volatile memory (NVM) Workload Diversity Low High High technologies with speed closer to DRAM and persistence Production Datasets ✗ ✔ ✔ similar to disk promise to narrow the speed gap between Full System Effects ✗ ✗ ✔ processors and storage. Recent work engineers new filesystem Realistic Hardware ✗ ✗ ✔ abstractions, storage stacks, programming models, wear-out TABLE I: Comparing the strengths and weaknesses of mitigation schemes, and prototyping platforms to integrate evaluation methodologies for hybrid memories. Emulation NVM in the storage hierarchy [1], [2], [3], [4], [5], [6], [7]. enables native exploration of diverse workloads and datasets On the main memory side, NVM promises abundant memory. on realistic hardware. DRAM is facing scaling limitations [8], [9], and recent work combines DRAM and NVM to form hybrid main 1 ulation. For instance, emulating cutting-edge hardware on major portion of the additional writes to memory are due to commodity machines to model: asymmetric multicores using nursery writes. Kingsguard collectors isolate these writes on frequency scaling [28], [29], die-stacked and hybrid memory DRAM and thus are especially effective in multiprogrammed using DRAM [18], [24], [1], and wearable memory using fault environments. injection software [23]. Recent work using emulation for explor- • Modern graph processing workloads use larger heaps and ing hybrid memory is either limited to native languages [18], their write rates are also higher than widely used Java [1], or is limited to simplistic heap organizations in the case benchmarks. Future work should include such benchmarks of managed languages [24] (See also Section VI). Table I when evaluating hybrid memories. compares the methodologies for evaluating hybrid memories, • Addressing large objects’ behaviors are essential to memory showing all can lead to insight and that emulation has distinct managers for hybrid memories. Graph applications can advantages in speed and software configuration. see huge reductions in write rates when using Kingsguard We present the design, implementation, and evaluation of collectors, because they have a lot of large objects that benefit an emulation platform that uses widely available commodity from targeted optimizations. NUMA hardware to model hybrid DRAM-NVM systems. • Changing a benchmark’s allocation behavior or input changes We use the local socket to emulate DRAM and the remote write rates. Future work should eliminate useless allocations socket to emulate NVM. All threads execute on the DRAM and use a variety of inputs for evaluating hybrid memories. socket. Our heap splits virtual memory into DRAM and NVM • LLC size impacts write rates. Future work should use suitable virtual memory, which we manage using two free lists, one for workloads with emulation on modern servers with large each NUMA node by explicitly specifying where to allocate LLCs, or report evaluation for a range of LLC sizes using memory in the C standard library. We expose this hybrid simulation. memory to the garbage collector, which directs the OS where • Graph applications wear PCM out faster than traditional Java in memory (which NUMA node) to map heap regions. Contrary benchmarks. Multiprogramming workloads can also wear to most prior work, our platform handles both manual memory PCM out in less than 5 years. Write limiting with Kingsguard management routines from the standard C library and memory collectors brings PCM lifetimes to practical levels. management using an automatic memory manager (garbage II. BACKGROUND collector). We redesign the memory manager in the popular This section briefly discusses characteristics of NVM hard- Jikes research virtual machine (RVM) to add support for hybrid ware and the role of DRAM in hybrid DRAM-NVM systems. memories. Our software infrastructure is publicly available at We then discuss write-rationing garbage collection [21] that <link-anonymized-for-blind-reviewing>. protect NVM from writes and prolongs memory lifetime. We We evaluate this emulation platform on recently proposed will evaluate write-rationing garbage collectors in Section V write-rationing garbage collectors for hybrid memories [21]. using our emulation platform. Write-rationing collectors keep highly mutated objects in A. NVM Drawbacks and Hybrid Memory DRAM in hybrid DRAM-NVM systems to target longer NVM lifetime. We use 15 applications from three benchmark suites: A promising NVM technology currently in production is DaCapo, Pjbb, and GraphChi; two input datasets; seven garbage phase change memory (PCM) [30]. PCM cells store information collector configurations; and workloads consisting of one, two, as the change in resistance of a chalcogenide material [31]. and four application instances executing simultaneously. We During a write operation, electric current heats up PCM cells to find emulation results are very similar to simulation results high temperatures and the cells cool