Enabling Efficient and Transparent Remote Memory Access in Disaggregated Datacenters

Enabling Efficient and Transparent Remote Memory Access in Disaggregated Datacenters Nathan Pemberton John D. Kubiatowicz, Ed. Randy H. Katz, Ed. Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2019-154 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-154.html December 1, 2019 Copyright © 2019, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement I would like to first thank Emmanuel Amaro who was integral to the design, motivation, and hardware implementation of the PFA. Howard Mao contributed significantly to general infrastructure and implemented the memory blade. The FireSim simulator was the work of many in the Berkeley architecture research group (BAR), especially Sagar Karandikar. I would also like to thank those in the BAR group, and the RISC-V Foundation for supplying the underlying infrastructure for this research. Finally, I acknowledge my advisers Prof. Randy Katz and Prof. John Kubiatowicz, as well as Prof. Krste Asanovic, for their guidance throughout this project. Enabling Efficient and Transparent Remote Memory Access in Disaggregated Datacenters by Nathan Pemberton Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, in partial satisfaction of the requirements for the degree of Master of Science, Plan II. Approval for the Report and Comprehensive Examination: Committee: Professor Randy Katz Research Advisor (12/15/2017) * * * * * * * Professor John Kubiatowitz Research Advisor (12/15/2017) Abstract Researchers from industry and academia have recently proposed to disaggregate memory in warehouse-scale computers, motivated by the increasing performance of networks, and a proliferation of novel memory technologies. In a system with memory disaggregation, each compute node contains a modest amount of fast memory (e.g. high-bandwidth DRAM integrated on-package), while large capacity memory or non- volatile memory is made available across the network through dedicated memory nodes. One common proposal to harness the fast local memory is to use it as a large cache for the remote bulk memory. This cache could be implemented purely in hardware, which could minimize latency, but may involve complicated architectural changes and would lack OS insights into memory usage. An alternative is to manage the cache purely in software with traditional paging mechanisms. This approach requires no additional hardware, can use sophisticated algorithms, and has insight into memory usage patterns. However, our experiments show that even when paging to local memory, applications can be slowed significantly due to the overhead of handling page faults, which can take several microseconds and pollute the caches. In this thesis, I propose a hybrid HW/SW cache using a new hardware device called the \page fault accelerator" (PFA), with a special focus on the impact on operating system design and performance. With the PFA, applications spend 2.5x less time managing paging, and run 40% faster end-to-end. 2 Contents 1 Introduction5 2 Motivation and Background6 2.1 FireBox......................................6 2.2 Interfaces to Remote Memory..........................7 2.2.1 Low-Level Interfaces...........................7 2.2.2 Software Interfaces............................8 2.3 OS Paging Overview...............................9 2.4 Paging Limitations................................ 11 3 Page Fault Accelerator 12 3.1 Page Fault Accelerator Design.......................... 13 3.1.1 Page Table Entry............................. 15 3.1.2 Remote Memory Protocol........................ 16 4 Implementation 18 4.1 PFA Reference Implementation......................... 18 4.2 Hardware Implementation............................ 18 4.2.1 RocketCore and RocketChip....................... 19 4.2.2 Memory Blade.............................. 19 4.2.3 FireSim.................................. 19 4.3 Linux Integration................................. 20 4.3.1 Non-PFA Paging in Linux........................ 20 4.3.2 PFA Modifications............................ 22 4.3.3 kpfad.................................... 24 4.3.4 Baseline Swapping............................ 25 5 Evaluation 25 5.1 Experimental Design............................... 25 5.2 End-to-End Performance............................. 25 5.2.1 Analysis.................................. 26 6 Future Work 30 7 Conclusion 31 Appendices 32 A Reproducibility 32 B PFA Specification 32 B.1 Overview...................................... 32 B.1.1 Basic Usage................................ 32 B.1.2 Free Frame and New Page Queue Management............ 33 3 B.1.3 Limitations................................ 33 B.2 RISCV Standards................................. 33 B.3 PTE........................................ 34 B.4 MMIO....................................... 34 B.4.1 FREE................................... 35 B.4.2 FREE STAT............................... 35 B.4.3 EVICT................................... 35 B.4.4 EVICT STAT............................... 36 B.4.5 NEW PGID................................ 36 B.4.6 NEW VADDR.............................. 36 B.4.7 NEW STAT................................ 36 B.4.8 INIT MEM................................ 37 C Memory Blade Protocol 37 C.1 Ethernet header.................................. 37 C.2 Request Packet.................................. 37 C.3 Atomic/Small Payload.............................. 38 C.4 Response Packet................................. 38 Glossary 39 Acronyms 40 4 1 Introduction Traditional data center design aggregates all necessary resources (e.g., disk, memory, power supply, etc.) into many self contained server chassis. This design was motivated by the ability to leverage commodity PC components and networks[1]. Additionally, an aggregated design was desirable because in-chassis interconnects were significantly faster than networks. However, data center-side compute has grown into an important independent market, leading to specialized server platforms and networks (often called warehouse-scale computers (WSCs)). Furthermore, networking technology has seen a rapid increase in performance, with 40 Gbit=s Ethernet becoming commonplace, and 100+ Gbit=s networks readily available, narrowing the gap between off-package DRAM and remote memory. Workloads have also changed; applications are fundamentally distributed (e.g., service-oriented architecture, map-reduce, etc.), use larger and rapidly changing datasets (\Big Data"), and demand latencies that can only be delivered by in-memory processing. Finally, a number of promising new memory technologies are becoming available. New non-volatile memory (NVM) devices are being introduced that promise low idle power, high density, and near-DRAM performance (e.g., fast NAND, phase-change, memristor). On the high-performance side, improvements in packaging technology have led to fast on-package DRAM (e.g., HBM) that offers hundreds of GB/s of bandwidth with capacities in the tens of GB. These hardware and software trends have lead to proposals from both academia[2][3] and industry[4][5][6][7] for a new style of WSC where resources are disaggregated . In a disaggregated WSC, resources like disk and memory become first-class citizens over a high- performance network. A compute node couples CPUs, network interfaces, and a small amount of high-speed memory into a self-contained system in package (SiP). This design allows data center operators to scale memory capacity, while allocating it more flexibly (avoiding stranded resources and complex resource allocation policies). However, the memory access latency will be higher than traditional off-package DRAM, and bandwidth may be limited or subjected to congestion. The small on-package memory allows us to mitigate some of this performance gap, but the question remains: how best to use it? One way to harness the on-package DRAM is to use it as a large cache for remote bulk memory. Operating systems have traditionally provided this through virtual memory paging which uses virtual memory to treat local physical memory as a software-managed cache (typically for disk). Indeed, several recent academic research projects have proposed using paging over remote direct memory access (RDMA) as a way of disaggregating memory[8][9]. Paging has traditionally been backed by slow disks with access latencies in the milliseconds. This lead to sophisticated algorithms that can take several microseconds for every cache miss. An alternative is to have fully hardware managed DRAM caches[10][11]. These eliminate much of the overhead, but lack the sophistication and application-level insight of OS-based approaches. For example, operating systems often use significant memory for optimistic pre-fetching and caching of disk blocks. A hardware-managed cache may choose to store these in remote memory, while the OS would simply delete them. This thesis introduces a hardware accelerator for OS-managed caching called the page fault accelerator (PFA). The PFA works by handling latency-critical page faults (cache-miss) in hardware, while allowing the OS to manage

Load more