LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang, Purdue University https://www.usenix.org/conference/osdi18/presentation/shan

This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). October 8–10, 2018 • Carlsbad, CA, USA ISBN 978-1-939133-08-3

Open access to the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation Yizhou Shan, Yutong Huang, Yilun Chen, Yiying Zhang Purdue University

Abstract that can fit into monolithic servers and deploying them in datacenters is a painful and cost-ineffective that The monolithic server model where a server is the unit often limits the speed of new hardware adoption. of deployment, operation, and failure is meeting its lim- We believe that datacenters should break - its in the face of several recent hardware and application lithic servers and organize hardware devices like CPU, trends. To improve resource utilization, elasticity, het- DRAM, and disks as independent, failure-isolated, erogeneity, and failure handling in datacenters, we be- network-attached components, each having its own con- lieve that datacenters should break monolithic servers troller to manage its hardware. This hardware re- into disaggregated, network-attached hardware compo- source disaggregation architecture is enabled by recent nents. Despite the promising benefits of hardware re- advances in network technologies [24, 42, 52, 66, 81, 88] source disaggregation, no existing OSes or software sys- and the trend towards increasing processing power in tems can properly manage it. hardware controller [9, 23, 92]. Hardware resource dis- We propose a new OS model called the splitkernel to aggregation greatly improves resource utilization, elas- manage disaggregated systems. Splitkernel disseminates ticity, heterogeneity, and failure isolation, since each traditional OS functionalities into loosely-coupled mon- hardware component can operate or fail on its own and itors, each of which runs on and manages a hardware its resource allocation is independent from other com- component. A splitkernel also performs resource allo- ponents. With these benefits, this new architecture has cation and failure handling of a distributed set of hard- already attracted early attention from academia and in- ware components. Using the splitkernel model, we built dustry [1, 15, 48, 56, 63, 77]. LegoOS, a new OS designed for hardware resource dis- Hardware resource disaggregation completely shifts aggregation. LegoOS appears to users as a set of dis- the paradigm of computing and presents a key challenge tributed servers. Internally, a user application can span to system builders: How to manage and virtualize the multiple processor, memory, and storage hardware com- distributed, disaggregated hardware components? ponents. We implemented LegoOS on -64 and evalu- Unfortunately, existing kernel designs cannot address ated it by emulating hardware components using com- the new challenges hardware resource disaggregation modity servers. Our evaluation results show that Le- brings, such as network communication overhead across goOS’ performance is comparable to monolithic disaggregated hardware components, fault tolerance of servers, while largely improving resource packing and hardware components, and the resource management of reducing failure rate over monolithic clusters. distributed components. Monolithic kernels, microker- nels [36], and [37] run one OS on a mono- 1 Introduction lithic machine, and the OS assumes local accesses to For many years, the unit of deployment, operation, and shared main memory, storage devices, network inter- failure in datacenters has been a monolithic server, one faces, and other hardware resources in the machine. Af- that contains all the hardware resources that are needed ter disaggregating hardware resources, it may be viable to run a user program (typically a processor, some main to run the OS at a processor and remotely manage all memory, and a disk or an SSD). This monolithic archi- other hardware components. However, remote man- tecture is meeting its limitations in the face of several agement requires significant amount of network traffic, issues and recent trends in datacenters. and when processors fail, other components are unus- First, datacenters face a difficult bin-packing problem able. Multi-kernel OSes [21, 26, 76, 106] run a kernel of fitting applications to physical machines. Since a pro- at each processor (or core) in a monolithic computer and cess can only use processor and memory in the same ma- these per-processor kernels communicate with each other chine, it is hard to achieve full memory and CPU resource through message passing. Multi-kernels still assume lo- utilization [18, 33, 65]. Second, after packaging hard- cal accesses to hardware resources in a monolithic ma- ware devices in a server, it is difficult to add, remove, or chine and their message passing is over local buses in- change hardware components in datacenters [39]. More- stead of a general network. While existing OSes could over, when a hardware component like a memory con- be retrofitted to support hardware resource disaggrega- troller fails, the entire server is unusable. Finally, mod- tion, such retrofitting will be invasive to the central sub- ern datacenters host increasingly heterogeneous hard- systems of an OS, such as memory and I/O management. ware [5, 55, 84, 94]. However, designing new hardware We propose splitkernel, a new OS architecture for

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 69 monolithic Monolithic Server CPU GPU Memory kernel monitor monitor monitor kernel kernel kernel CPU CPU CPU GPU Memory Core GPU P-NIC mem NIC mem NIC network msging across hardware components msg passing over local bus Disk Disk Server Server NVM HDD SSD Main Memory network across servers monitor monitor monitor Figure 1: OSes Designed for Figure 2: Multi-kernel Architecture. NVM Hard Disk SSD Monolithic Servers. P-NIC: programmable NIC. Figure 3: Splitkernel Architecture. hardware resource disaggregation (Figure 3). The basic processor caches as virtual caches that are accessed us- idea is simple: When hardware is disaggregated, the OS ing addresses. To improve performance, should be also. A splitkernel breaks traditional operat- LegoOS uses a small amount (e.g., 4 GB) of DRAM or- ing system functionalities into loosely-coupled monitors, ganized as a virtual cache below current last-level cache. each running at and managing a hardware component. LegoOS process monitor manages application pro- Monitors in a splitkernel can be heterogeneous and can cesses and the extended DRAM-cache. Memory mon- be added, removed, and restarted dynamically without itor manages all virtual and physical memory space al- affecting the rest of the system. Each splitkernel monitor location and address mappings. LegoOS uses a novel operates locally for its own functionality and only com- two-level distributed virtual memory space management municates with other monitors when there is a need to mechanism, which ensures efficient foreground mem- access resources there. There are only two global tasks ory accesses and balances load and space utilization in a splitkernel: orchestrating resource allocation across at allocation time. Finally, LegoOS uses a space- components and handling component failure. and performance-efficient memory replication scheme to We choose not to support coherence across different handle memory failure. components in a splitkernel. A splitkernel can use any We implemented LegoOS on the x86-64 architecture. general network to connect its hardware components. All LegoOS is fully backward compatible with Linux ABIs monitors in a splitkernel communicate with each other by supporting common Linux system call . To via network messaging only. With our targeted scale, ex- evaluate LegoOS, we emulate disaggregated hardware plicit message passing is much more efficient in network components using commodity servers. We evaluated bandwidth consumption than the alternative of implicitly LegoOS with microbenchmarks, the PARSEC bench- maintaining cross-component coherence. marks [22], and two unmodified datacenter applications, Following the splitkernel model, we built LegoOS, the Phoenix [85] and TensorFlow [4]. Our evaluation re- first OS designed for hardware resource disaggregation. sults show that compared to monolithic Linux servers LegoOS is a distributed OS that appears to applications that can hold all the working sets of these applications, as a set of virtual servers (called vNodes). A vNode can LegoOS is only 1.3× to 1.7× slower with 25% of appli- run on multiple processor, memory, and storage compo- cation working set available as DRAM cache at proces- nents and one component can host resources for multiple sor components. Compared to monolithic Linux servers vNodes. LegoOS cleanly separates OS functionalities whose main memory size is the same as LegoOS’ DRAM into three types of monitors, process monitor, memory cache size and which use local SSD/DRAM swapping monitor, and storage monitor. LegoOS monitors share or network swapping, LegoOS’ performance is 0.8× to no or minimal states and use a customized RDMA-based 3.2×. At the same time, LegoOS largely improves re- network stack to communicate with each other. source packing and reduces system mean time to failure. The biggest challenge and our focus in building Le- Overall, this work makes the following contributions: goOS is the separation of processor and memory and • We propose the concept of splitkernel, a new OS ar- their management. Modern processors and OSes assume chitecture that fits the hardware resource disaggre- all hardware memory units including main memory, page gation architecture. tables, and TLB are local. Simply moving all memory • We built LegoOS, the first OS that runs on and man- hardware and software to across ages a disaggregated hardware cluster. the network will not work. • Based on application properties and hardware trends, We propose a new hardware architecture to cleanly we propose a hardware plus software solution that separate processor and memory hardware function- cleanly separates processor and memory functionalities, alities, while preserving most of the performance of while meeting application performance requirements. monolithic server architecture. LegoOS moves all memory hardware units to the disag- LegoOS is publicly available at http://LegoOS.io. gregated memory components and organizes all levels of

70 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association 100 100 CPU CPU FPGA [12, 84], non-volatile memory [49], and NVMe- 80 Memory 80 Memory based SSDs [98]. The monolithic server model tightly 60 60 couples hardware devices with each other and with a 40 40 motherboard. As a result, making new hardware devices 20 20 work with existing servers is a painful and lengthy pro- Utilization (%) 0 Utilization (%) 0 0 5 10 15 20 25 0 2 4 6 8 10 12 cess [84]. Mover, datacenters often need to purchase new Time (day) Time (hour) (a) Google Cluster (b) Alibaba Cluster servers to host certain hardware. Other parts of the new Figure 4: Datacenter Resource Utilization. servers can go underutilized and old servers need to retire to make room for new ones. 2 Disaggregate Hardware Resource 2.2 Hardware Resource Disaggregation This section motivates the hardware resource disaggre- gation architecture and discusses the challenges in man- The server-centric architecture is a bad fit for the fast- aging disaggregated hardware. changing datacenter hardware, software, and cost needs. There is an emerging interest in utilizing resources be- 2.1 Limitations of Monolithic Servers yond a local machine [41], such as distributed mem- A monolithic server has been the unit of deployment and ory [7, 34, 74, 79] and network swapping [47]. These so- operation in datacenters for decades. This long-standing lutions improve resource utilization over traditional sys- server-centric architecture has several key limitations. tems. However, they cannot solve all the issues of mono- Inefficient resource utilization. With a server being the lithic servers (e.g., the last three issues in §2.1), since physical boundary of resource allocation, it is difficult their hardware model is still a monolithic one. To fully to fully utilize all resources in a datacenter [18, 33, 65]. support the growing heterogeneity in hardware and to We analyzed two production cluster traces: a 29-day provide elasticity and flexibility at the hardware level, we Google one [45] and a 12-hour Alibaba one [10]. Fig- should break the monolithic server model. ure 4 plots the aggregated CPU and memory utilization We envision a hardware resource disaggregation in the two clusters. For both clusters, only around half of architecture where hardware resources in traditional the CPU and memory are utilized. Interestingly, a signif- servers are disseminated into network-attached hardware icant amount of jobs are being evicted at the same time components. Each component has a controller and a net- in these traces (e.g., evicting low-priority jobs to make work interface, can operate on its own, and is an inde- room for high-priority ones [102]). One of the main pendent, failure-isolated entity. reasons for resource underutilization in these production The disaggregated approach largely increases the flex- clusters is the constraint that CPU and memory for a job ibility of a datacenter. Applications can freely use re- have to be allocated from the same physical machine. sources from any hardware component, which makes re- Poor hardware elasticity. It is difficult to add, move, source allocation easy and efficient. Different types of remove, or reconfigure hardware components after they hardware resources can scale independently. It is easy to have been installed in a monolithic server [39]. Because add, remove, or reconfigure components. New types of of this rigidity, datacenter owners have to plan out server hardware components can easily be deployed in a data- configurations in advance. However, with today’s speed center — by simply enabling the hardware to talk to the of change in application requirements, such plans have network and adding a new network link to connect it. to be adjusted frequently, and when changes happen, it Finally, hardware resource disaggregation enables fine- often comes with waste in existing server hardware. grain failure isolation, since one component failure will Coarse failure domain. The failure unit of monolithic not affect the rest of a cluster. servers is coarse. When a hardware component within a Three hardware trends are making resource disag- server fails, the whole server is often unusable and ap- gregation feasible in datacenters. First, network speed plications running on it can all crash. Previous analy- has grown by more than an order of magnitude and sis [90] found that motherboard, memory, CPU, power has become more scalable in the past decade with supply failures account for 50% to 82% of hardware fail- new technologies like Remote Direct Memory Access ures in a server. Unfortunately, monolithic servers cannot (RDMA) [69] and new topologies and switches [15, 30, continue to operate when any of these devices fail. 31], enabling fast accesses of hardware components that Bad support for heterogeneity. Driven by application are disaggregated across the network. InfiniBand will needs, new hardware technologies are finding their ways soon reach 200Gbps and sub-600 nanosecond speed [66], into modern datacenters [94]. Datacenters no longer being only 2× to 4× slower than main memory bus in host only commodity servers with CPU, DRAM, and bandwidth. With main memory bus facing a bandwidth hard disks. They include non-traditional and special- wall [87], future network bandwidth (at line rate) is even ized hardware like GPGPU [11, 46], TPU [55], DPU [5], projected to exceed local DRAM bandwidth [99].

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 71 Second, network interfaces are moving closer to hard- processors will have no local memory or storage to store ware components, with technologies like Intel Omni- user or kernel data. An OS also needs to manage dis- Path [50], RDMA [69], and NVMe over Fabrics [29, 71]. tributed hardware resource and handle hardware compo- As a result, hardware devices will be able to access net- nent failure. We summarize the following key challenges work directly without the need to attach any processors. in building an OS for resource disaggregation, some of Finally, hardware devices are incorporating more pro- which have previously been identified [40]. cessing power [8, 9, 23, 67, 68, 75], allowing application • How to deliver good performance when appli- and OS logics to be offloaded to hardware [57, 92]. On- cation execution involves the access of network- device processing power will enable system software to partitioned disaggregated hardware and current net- manage disaggregated hardware components locally. work is still slower than local buses? With these hardware trends and the limitations of monolithic servers, we believe that future datacenters • How to locally manage individual hardware compo- will be able to largely benefit from hardware resource nents with limited hardware resources? disaggregation. In fact, there have already been several • How to manage distributed hardware resources? initial hardware proposals in resource disaggregation [1], • How to handle a component failure without affect- including disaggregated memory [63, 77, 78], disaggre- ing other components or running applications? gated flash [59, 60], Intel Rack-Scale System [51], HP • What abstraction should be exposed to users and “The Machine” [40, 48], IBM Composable System [28], how to support existing datacenter applications? and Berkeley Firebox [15]. Instead of retrofitting existing OSes to confront these 2.3 OSes for Resource Disaggregation challenges, we take the approach of designing a new OS Despite various benefits hardware resource disaggrega- architecture from the ground up for hardware resource tion promises, it is still unclear how to manage or utilize disaggregation. disaggregated hardware in a datacenter. Unfortunately, existing OSes and distributed systems cannot work well 3 The Splitkernel OS Architecture with this new architecture. Single-node OSes like Linux We propose splitkernel, a new OS architecture for re- view a server as the unit of management and assume all source disaggregation. Figure 3 illustrates splitkernel’s hardware components are local (Figure 1). A potential overall architecture. The splitkernel disseminates an OS approach is to run these OSes on processors and access into pieces of different functionalities, each running at memory, storage, and other hardware resources remotely. and managing a hardware component. All components Recent disaggregated systems like soNUMA [78] take communicate by message passing over a common net- this approach. However, this approach incurs high net- work, and splitkernel globally manages resources and work latency and bandwidth consumption with remote component failures. Splitkernel is a general OS architec- device management, misses the opportunity of exploit- ture we propose for hardware resource disaggregation. ing device-local computation power, and makes proces- There can be many types of implementation of splitk- sors the single point of failure. ernel. Further, we make no assumption on the specific Multi-kernel solutions [21, 26, 76, 106, 107] (Figure 2) hardware or network type in a disaggregated cluster a view different cores, processors, or programmable de- splitkernel runs on. Below, we describe four key con- vices within a server separately by running a kernel on cepts of the splitkernel architecture. each core/device and using message passing to commu- Split OS functionalities. Splitkernel breaks traditional nicate across kernels. These kernels still run in a single OS functionalities into monitors. Each monitor man- server and all access some common hardware resources ages a hardware component, virtualizes and protects its in the server like memory and the network interface. physical resources. Monitors in a splitkernel are loosely- Moreover, they do not manage distributed resources or coupled and they communicate with other monitors to handle failures in a disaggregated cluster. access remote resources. For each monitor to operate on There have been various distributed OS proposals, its own with minimal dependence on other monitors, we most of which date decades back [16, 82, 97]. Most of use a stateless design by sharing no or minimal states, or these distributed OSes manage a set of monolithic servers metadata, across monitors. instead of hardware components. Run monitors at hardware components. We expect each Hardware resource disaggregation is fundamentally non-processor hardware component in a disaggregated different from the traditional monolithic server model. cluster to have a controller that can run a monitor. A A complete disaggregation of processor, memory, and hardware controller can be a low-power general-purpose storage means that when managing one of them, there core, an ASIC, or an FPGA. Each monitor in a splitkernel will be no local accesses to the other two. For example, can use its own implementation to manage the hardware

72 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association component it runs on. This design makes it easy to in- • Comparable performance to monolithic Linux tegrate heterogeneous hardware in datacenters — to de- servers. ploy a new hardware device, its developers only need to • Efficient resource management and memory failure build the device, implement a monitor to manage it, and handling, both in space and in performance. attach the device to the network. Similarly, it is easy to • Easy-to-use, backward compatible user interface. reconfigure, restart, and remove hardware components. • Supports common Linux system call interfaces. Message passing across non-coherent components. Un- like other proposals of disaggregated systems [48] that 4.1 Abstraction and Usage Model rely on coherent interconnects [24, 42, 81], a splitker- LegoOS exposes a distributed set of virtual nodes, or vN- nel runs on general-purpose network layer like Ether- ode, to users. From users’ point of view, a vNode is like net and neither underlying hardware nor the splitkernel a virtual machine. Multiple users can run in a vNode provides cache coherence across components. We made and each user can run multiple processes. Each vNode this design choice mainly because maintaining coher- has a unique ID, a unique virtual IP address, and its own ence for our targeted cluster scale would cause high net- storage mount point. LegoOS protects and isolates the work bandwidth consumption. Instead, all communica- resources given to each vNode from others. Internally, tion across components in a splitkernel is through net- one vNode can run on multiple pComponents, multiple work messaging. A splitkernel still retains the coherence mComponents, and multiple sComponents. At the same guarantee that hardware already provides within a com- time, each hardware component can host resources for ponent (e.g., cache coherence across cores in a CPU), more than one vNode. The internal execution status is and applications running on top of a splitkernel can use transparent to LegoOS users; they do not know which message passing to implement their desired level of co- physical components their applications run on. herence for their data across components. With splitkernel’s design principle of components not Global resource management and failure handling. One being coherent, LegoOS does not support writable shared hardware component can host resources for multiple ap- memory across processors. LegoOS assumes that threads plications and its failure can affect all these applica- within the same process access shared memory and tions. In addition to managing individual components, threads belonging to different processes do not share the splitkernel also needs to globally manage resources writable memory, and LegoOS makes scheduling deci- and failure. To minimize performance and scalability sion based on this assumption (§4.3.1). Applications that bottleneck, the splitkernel only involves global resource use shared writable memory across processes (e.g., with management occasionally for coarse-grained decisions, MAP SHARED) will need to be adapted to use message while individual monitors make their own fine-grained passing across processes. We made this decision be- decisions. The splitkernel handles component failure by cause writable shared memory across processes is rare adding redundancy for recovery. (we have not seen a single instance in the datacenter applications we studied), and supporting it makes both 4 LegoOS Design hardware and software more complex (in fact, we have Based on the splitkernel architecture, we built LegoOS, implemented this support but later decided not to include the first OS designed for hardware resource disaggrega- it because of its complexity). tion. LegoOS is a research prototype that demonstrates One of the initial decisions we made when building the feasibility of the splitkernel design, but it is not the LegoOS is to support the Linux system call interface only way to build a splitkernel. LegoOS’ design targets and unmodified Linux ABI, because doing so can greatly three types of hardware components: processor, mem- ease the adoption of LegoOS. Distributed applications ory, and storage, and we call them pComponent, mCom- that run on Linux can seamlessly run on a LegoOS clus- ponent, and sComponent. ter by running on a set of vNodes. This section first introduces the abstraction LegoOS 4.2 Hardware Architecture exposes to users and then describes the hardware archi- tecture of components LegoOS runs on. Next, we ex- LegoOS pComponent, mComponent, and sComponent plain the design of LegoOS’ process, memory, and stor- are independent devices, each having their own hard- age monitors. Finally, we discuss LegoOS’ global re- ware controller and network interface (for pComponent, source management and failure handling mechanisms. the hardware controller is the processor itself). Our cur- Overall, LegoOS achieves the following design goals: rent hardware model uses CPU in pComponent, DRAM in mComponent, and SSD or HDD in sComponent. We • Clean separation of process, memory, and storage leave exploring other hardware devices for future work. functionalities. To demonstrate the feasibility of hardware resource • Monitors run at hardware components and fit device disaggregation, we propose a pComponent and an constraints.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 73 leave a small amount of memory (e.g., 4 GB) at each L1 v$ L1 v$ Mem Ctrl core core pComponent and move most memory across the network L2 v$ L2 v$ TLB or other (e.g., few TBs per mComponent). pComponents’ local L3 Virtual Cache mapping unit Network memory can be regular DRAM or the on-die HBM [53, ExCache ExCache DRAM array kernel tags 72], and mComponents use DRAM or NVM. phys mem Different from previous proposals [63], we propose to organize pComponents’ DRAM/HBM as cache rather pComponent mComponent than main memory for a clean separation of process and Figure 5: LegoOS pComponent and mComponent Architecture. memory functionalities. We place this cache under the current processor Last-Level Cache (LLC) and call it an mComponent architecture designed within today’s net- extended cache, or ExCache. ExCache serves as another work, processor, and memory performance and hardware layer in the memory hierarchy between LLC and mem- constraints (Figure 5). ory across the network. With this design, ExCache can Separating process and memory functionalities. LegoOS serve hot memory accesses fast, while mComponents can moves all hardware memory functionalities to mCompo- provide the capacity applications desire. nents (e.g., page tables, TLBs) and leaves only caches at ExCache is a virtual, inclusive cache, and we use a the pComponent side. With a clean separation of process combination of hardware and software to manage Ex- and memory hardware units, the allocation and manage- Cache. Each ExCache line has a (virtual-address) tag and ment of memory can be completely transparent to pCom- two access permission bits (one for read/write and one ponents. Each mComponent can choose its own memory for valid). These bits are set by software when a line is allocation technique and virtual to physical memory ad- inserted to ExCache and checked by hardware at access dress mappings (e.g., segmentation). time. For best hit performance, the hit path of ExCache is Processor virtual caches. After moving all memory handled purely by hardware — the hardware cache con- functionalities to mComponents, pComponents will only troller maps a virtual address to an ExCache set, fetches see virtual addresses and have to use virtual memory ad- and compares tags in the set, and on a hit, fetches the hit dresses to access its caches. Because of this, LegoOS ExCache line. Handling misses of ExCache is more com- organizes all levels of pComponent caches as virtual plex than with traditional CPU caches, and thus we use caches [44, 104], i.e., virtually-indexed and virtually- LegoOS to handle the miss path of ExCache (see §4.3.2). tagged caches. Finally, we use a small amount of DRAM/HBM at A virtual cache has two potential problems, commonly pComponent for LegoOS’ own kernel data usages, ac- known as synonyms and homonyms [95]. Synonyms cessed directly with physical memory addresses and happens when a physical address maps to multiple virtual managed by LegoOS. LegoOS ensures that all its own addresses (and thus multiple virtual cache lines) as a re- data fits in this space to avoid going to mComponents. sult of memory sharing across processes, and the update With our design, pComponents do not need any ad- of one virtual cache line will not reflect to other lines that dress mappings: LegoOS accesses all pComponent- share the data. Since LegoOS does not allow writable side DRAM/HBM using physical memory addresses and inter-process memory sharing, it will not have the syn- does simple calculations to locate the ExCache set for a onym problem. The homonym problem happens when memory access. Another benefit of not handling address two address spaces use the same virtual address for their mapping at pComponents and moving TLBs to mCom- own different data. Similar to previous solutions [20], we ponents is that pComponents do not need to access TLB solve homonyms by storing an address space ID (ASID) or suffer from TLB misses, potentially making pCompo- with each cache line, and differentiate a virtual address nent cache accesses faster [58]. in different address spaces using ASIDs. Separating memory for performance and for capacity. 4.3 Process Management Previous studies [41, 47] and our own show that today’s The LegoOS process monitor runs in the kernel space network speed cannot meet application performance re- of a pComponent and manages the pComponent’s CPU quirements if all memory accesses are across the net- cores and ExCache. pComponents run user programs in work. Fortunately, many modern datacenter applications the . exhibit strong memory access temporal locality. For ex- ample, we found 90% of memory accesses in Power- 4.3.1 Process Management and Scheduling Graph [43] go to just 0.06% of total memory and 95% go At every pComponent, LegoOS uses a simple local to 3.1% of memory (22% and 36% for TensorFlow [4] scheduling model that targets datacenter applica- respectively, 5.1% and 6.6% for Phoenix [85]). tions (we will discuss global scheduling in § 4.6). Le- With good memory-access locality, we propose to goOS dedicates a small amount of cores for kernel back-

74 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association ground threads (currently two to four) and uses the rest formation that the destination component needs to handle of the cores for application threads. When a new process the request. To solve this discrepancy between the Linux starts, LegoOS uses a global policy to choose a pCom- syscall interface and LegoOS’ design, we add a layer on ponent for it (§ 4.6). Afterwards, LegoOS schedules new top of LegoOS’ core process monitor at each pCompo- threads the process spawns on the same pComponent by nent to store Linux states and translate these states and choosing the cores that host fewest threads. After assign- the Linux syscall interface to LegoOS’ internal interface. ing a thread to a core, we let it run to the end with no scheduling or kernel under common scenar- 4.4 Memory Management ios. For example, we do not use any network We use mComponents for three types of data: anony- and let threads busy wait on the completion of outstand- mous memory (i.e., heaps, stacks), memory-mapped ing network requests, since a network request in LegoOS files, and storage buffer caches. The LegoOS memory is fast (e.g., fetching an ExCache line from an mCompo- monitor manages both the virtual and physical memory nent takes around 6.5 µs). LegoOS improves the overall address spaces, their allocation, deallocation, and mem- processor utilization in a disaggregated cluster, since it ory address mappings. It also performs the actual mem- can freely schedule processes on any pComponents with- ory read and write. No user processes run on mCompo- out considering memory allocation. Thus, we do not nents and they run completely in the kernel mode (same push for perfect core utilization when scheduling indi- is true for sComponents). vidual threads and instead aim to minimize scheduling LegoOS lets a process address space span multiple and performance overheads. Only when a mComponents to achieve efficient memory space uti- pComponent has to schedule more threads than its cores lization and high parallelism. Each application process will LegoOS start preempting threads on a core. uses one or more mComponents to host its data and a home mComponent 4.3.2 ExCache Management , an mComponent that initially loads the process, accepts and oversees all system calls related LegoOS process monitor configures and manages Ex- to virtual memory space management (e.g., brk, mmap, Cache. During the pComponent’s boot time, LegoOS munmap, and mremap). LegoOS uses a global memory configures the set associativity of ExCache and its cache resource manager (GMM) to assign a home mCompo- replacement policy. While ExCache hit is handled com- nent to each new process at its creation time. A home pletely in hardware, LegoOS handles misses in software. mComponent can also host process data. When an ExCache miss happens, the process monitor 4.4.1 Memory Space Management fetches the corresponding line from an mComponent and inserts it to ExCache. If the ExCache set is full, the Virtual memory space management. We propose a two- process monitor first evicts a line in the set. It throws level approach to manage distributed virtual memory away the evicted line if it is clean and writes it back to spaces, where the home mComponent of a process makes an mComponent if it is dirty. LegoOS currently supports coarse-grained, high-level virtual memory allocation de- two eviction policies: FIFO and LRU. For each ExCache cisions and other mComponents perform fine-grained set, LegoOS maintains a FIFO queue (or an approximate virtual memory allocation. This approach minimizes net- LRU list) and chooses ExCache lines to evict based on work communication during both normal memory ac- the corresponding policy (see §5.3 for details). cesses and virtual memory operations, while ensuring good load balancing and memory utilization. Figure 6 4.3.3 Supporting Linux Syscall Interface demonstrates the data structures used. One of our early decisions is to support Linux ABIs for At the higher level, we split each virtual memory ad- backward compatibility and easy adoption of LegoOS. A dress space into coarse-grained, fix-sized virtual regions, challenge in supporting the Linux system call interface or vRegions (e.g., of 1 GB). Each vRegion that contains is that many Linux syscalls are associated with states, in- allocated virtual memory addresses (an active vRegion) formation about different Linux subsystems that is stored is owned by an mComponent. The owner of a vRe- with each process and can be accessed by user programs gion handles all memory accesses and virtual memory across syscalls. For example, Linux records the states requests within the vRegion. of a running process’ open files, socket connections, and The lower level stores user process virtual memory several other entities, and it associates these states with area (vma) information, such as virtual address ranges file descriptors (fds) that are exposed to users. In contrast, and permissions, in vma trees. The owner of an active LegoOS aims at the clean separation of OS functionali- vRegion stores a vma tree for the vRegion, with each ties. With LegoOS’ stateless design principle, each com- node in the tree being one vma. A user-perceived virtual ponent only stores information about its own resource memory range can split across multiple mComponents, and each request across components contains all the in- but only one mComponent owns a vRegion.

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 75 b 4.4.2 Optimization on Memory Accesses vRegion pComponent mComp3 (backup) d array 1 2 2 3 1 2 With our strawman memory management design, all logs e 1 6 2 GMM a c ExCache misses will go to mComponents. We soon sComp found that a large performance overhead in running real 3 mComp1 (home) mComp2 (primary) mem log file applications is caused by filling empty ExCache, i.e., 1 2 2 3 1 2 2 2 2 4 cold misses. To reduce the performance overhead of vma cold misses, we propose a technique to avoid accessing trees 5 mComponent on first memory accesses. Figure 6: Distributed Memory Management. The basic idea is simple: since the initial content of anonymous memory (non-file-backed memory) is zero, vRegion owners perform the actual virtual memory al- LegoOS can directly allocate a cache line with empty location and vma tree set up. A home mComponent can content in ExCache for the first access to anonymous also be the owner of vRegions, but the home mCom- memory instead of going to mComponent (we call such ponent does not maintain any information about mem- cache lines p-local lines). When an application creates ory that belongs to vRegions owned by other mCompo- a new anonymous memory region, the process monitor nents. It only keeps the information of which mCompo- records its address range and permission. The applica- nent owns a vRegion (in a vRegion array) and how much tion’s first access to this region will be an ExCache miss free virtual memory space is left in each vRegion. These and it will trap to LegoOS. LegoOS process monitor then metadata can be easily reconstructed if a home mCom- allocates an ExCache line, fills it with zeros, and sets its ponent fails. /W bit according to the recorded memory region’s per- When an application process wants to allocate a virtual mission. Before this p-local line is evicted, it only lives memory space, the pComponent forwards the allocation in the ExCache. No mComponents are aware of it or will request to its home mComponent ( 1 in Figure 6). The allocate physical memory or a virtual-to-physical mem- home mComponent uses its stored information of avail- ory mapping for it. When a p-local cache line becomes able virtual memory space in vRegions to find one or dirty and needs to be flushed, the process monitor sends more vRegions that best fit the requested amount of vir- it to its owner mComponent, which then allocates phys- tual memory space. If no active vRegion can fit the allo- ical memory space and establishes a virtual-to-physical cation request, the home mComponent makes a new vRe- memory mapping. Essentially, LegoOS delays physi- gion active and contacts the GMM ( 2 and 3 ) to find a cal memory allocation until write time. Notice that it is candidate mComponent to own the new vRegion. GMM safe to only maintain p-local lines at a pComponent Ex- makes this decision based on available physical memory Cache without any other pComponents knowing them, space and access load on different mComponents (§ 4.6). since pComponents in LegoOS do not share any memory If the candidate mComponent is not the home mCompo- and other pComponents will not access this data. nent, the home mComponent next forwards the request to that mComponent ( 4 ), which then performs local vir- 4.5 Storage Management tual memory area allocation and sets up the proper vma LegoOS supports a hierarchical file interface that is back- tree. Afterwards, the pComponent directly sends mem- ward compatible with POSIX through its vNode abstrac- ory access requests to the owner of the vRegion where tion. Users can store their directories and files under their the memory access falls into (e.g., a and c in Figure 6). vNodes’ mount points and perform normal read, write, LegoOS’ mechanism of distributed virtual memory and other accesses to them. management is efficient and it cleanly separates mem- LegoOS implements core storage functionalities at ory operations from pComponents. pComponents hand sComponents. To cleanly separate storage functionali- over all memory-related system call requests to mCom- ties, LegoOS uses a stateless storage server design, where ponents and only cache a copy of the vRegion array for each I/O request to the storage server contains all the fast memory accesses. To fill a cache miss or to flush a information needed to fulfill this request, e.g., full path dirty cache line, a pComponent looks up the cached vRe- name, absolute file offset, similar to the server design in gion array to find its owner mComponent and sends the NFS v2 [89]. request to it. While LegoOS supports a hierarchical file use inter- Physical memory space management. Each mCompo- face, internally, LegoOS storage monitor treats (full) di- nent manages the physical memory allocation for data rectory and file paths just as unique names of a file and that falls into the vRegion that it owns. Each mCompo- place all files of a vNode under one internal directory at nent can choose their own way of physical memory allo- the sComponent. To locate a file, LegoOS storage moni- cation and own mechanism of virtual-to-physical mem- tor maintains a simple hash table with the full paths of ory address mapping. files (and directories) as keys. From our observation,

76 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association most datacenter applications only have a few hundred sources and can freely allocate each type of resource files or less. Thus, a simple hash table for a whole vNode from a pool of components. Doing so largely improves is sufficient to achieve good lookup performance. Using resource packing compared to a monolithic server cluster a non-hierarchical file system implementation largely re- that packs all type of resources a job requires within one duces the complexity of LegoOS’ file system, making it physical machine. Also note that LegoOS allocates hard- possible for a storage monitor to fit in storage devices ware resources only on demand, i.e., when applications controllers that have limited processing power [92]. actually create threads or access physical memory. This LegoOS places the storage buffer cache at mCompo- on-demand allocation strategy further improves LegoOS’ nents rather than at sComponents, because sComponents resource packing efficiency and allows more aggressive can only host a limited amount of internal memory. Le- over-subscription in a cluster. goOS memory monitor manages the storage buffer cache by simply performing insertion, lookup, and deletion of 4.7 Reliability and Failure Handling buffer cache entries. For simplicity and to avoid coher- After disaggregation, pComponents, mComponents, and ence traffic, we currently place the buffer cache of one sComponents can all fail independently. Our goal is to file under one mComponent. When receiving a file read build a reliable disaggregated cluster that has the same system call, the LegoOS process monitor first uses its or lower application failure rate than a monolithic clus- extended Linux state layer to look up the full path name, ter. As a first (and important) step towards achieving then passes it with the requested offset and size to the this goal, we focus on providing memory reliability by mComponent that holds the file’s buffer cache. This handling mComponent failure in the current version of mComponent will look up the buffer cache and returns LegoOS because of three observations. First, when dis- the data to pComponent on a hit. On a miss, mCom- tributing an application’s memory to multiple mCompo- ponent will forward the request to the sComponent that nents, the probability of memory failure increases and stores the file, which will fetch the data from storage de- not handling mComponent failure will cause applications vice and return it to the mComponent. The mComponent to fail more often on a disaggregated cluster than on will then insert it into the buffer cache and returns it to the monolithic servers. Second, since most modern datacen- pComponent. Write and fsync requests work in a similar ter applications already provide reliability to their dis- fashion. tributed storage data and the current version of LegoOS does not split a file across sComponent, we leave provid- 4.6 Global Resource Management ing storage reliability to applications. Finally, since Le- LegoOS uses a two-level resource management mecha- goOS does not split a process across pComponents, the nism. At the higher level, LegoOS uses three global re- chance of a running application process being affected by source managers for process, memory, and storage re- the failure of a pComponent is similar to one affected by sources, GPM, GMM, and GSM. These global managers the failure of a processor in a monolithic server. Thus, perform coarse-grained global resource allocation and we currently do not deal with pComponent failure and load balancing, and they can run on one normal Linux leave it for future work. machine. Global managers only maintain approximate A naive approach to handle memory failure is to per- resource usage and load information. They update their form a full replication of memory content over two or information either when they make allocation decisions more mComponents. This method would require at least or by periodically asking monitors in the cluster. At the 2× memory space, making the monetary and energy cost lower level, each monitor can employ its own policies of providing reliability prohibitively high (the same rea- and mechanisms to manage its local resources. son why RAMCloud [80] does not replicate in memory). For example, process monitors allocate new threads Instead, we propose a space- and performance-efficient locally and only ask GPM when they need to create a approach to provide in-memory data reliability in a best- new process. GPM chooses the pComponent that has the effort way. Further, since losing in-memory data will not least amount of threads based on its maintained approxi- affect user persistent data, we propose to provide mem- mate information. Memory monitors allocate virtual and ory reliability in a best-effort manner. physical memory space on their own. Only home mCom- We use one primary mComponent, one secondary ponent asks GMM when it needs to allocate a new vRe- mComponent, and a backup file in sComponent for each gion. GMM maintains approximate physical memory vma. A mComponent can serve as the primary for some space usages and memory access load by periodically vma and the secondary for others. The primary stores asking mComponents and chooses the memory with least all memory data and metadata. LegoOS maintains a load among all the ones that have at least vRegion size of small append-only log at the secondary mComponent free physical memory. and also replicates the vma tree there. When pCompo- LegoOS decouples the allocation of different re- nent flushes a dirty ExCache line, LegoOS sends the data

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 77 to both primary and secondary in parallel (step a and (which we call worker threads) execute the actual RPC b in Figure 6) and waits for both to reply ( c and d ). functions. For each pair of components, we use one In the background, the secondary mComponent flushes physically consecutive memory region at a component the backup log to a sComponent, which writes it to an to serve as the receive buffer for RPC requests. The append-only file. RPC client component uses RDMA write with immedi- If the flushing of a backup log to sComponent is slow ate value to directly write into the memory region and and the log is full, we will skip replicating application the polling thread polls for the immediate value to get the memory. If the primary fails during this time, LegoOS metadata information about the RPC request (e.g., where simply reports an error to application. Otherwise when a the request is written to in the memory region). Immedi- primary mComponent fails, we can recover memory con- ately after getting an incoming request, the polling thread tent by replaying the backup logs on sComponent and in passes it along to a work queue and continues to poll for the secondary mComponent. When a secondary mCom- the next incoming request. Each worker thread checks ponent fails, we do not reconstruct anything and start if the work queue is not empty and if so, gets an RPC replicating to a new backup log on another mComponent. request to process. Once it finishes the RPC function, it sends the return value back to the RPC client with an 5 LegoOS Implementation RDMA write to a memory address at the RPC client. The RPC client allocates this memory address for the return We implemented LegoOS in C on the x86-64 architec- value before sending the RPC request and piggy-backs ture. LegoOS can run on commodity, off-the-shelf ma- the memory address with the RPC request. chines and support most commonly-used Linux system call APIs. Apart from being a proof-of-concept of the The second network stack is our own implementation splitkernel OS architecture, our current LegoOS imple- of the socket interface directly on RDMA. The final stack mentation can also be used on existing datacenter servers is a traditional socket TCP/IP stack we adapted from to reduce the energy cost, with the help of techniques lwip [35] on our ported e1000 Ethernet driver. Applica- like Zombieland [77]. Currently, LegoOS has 206K tions can choose between these two socket implementa- SLOC, with 56K SLOC for drivers. LegoOS supports tions and use virtual IPs for their socket communication. 113 syscalls, 15 pseudo-files, and 10 vectored syscall op- 5.3 Processor Monitor codes. Similar to the findings in [100], we found that implementing these Linux interfaces are sufficient to run We reserve a contiguous physical memory region dur- many unmodified datacenter applications. ing kernel boot time and use fixed ranges of memory in this region as ExCache, tags and metadata for these 5.1 Hardware Emulation caches, and kernel physical memory. We organize Ex- Since there is no real resource disaggregation hardware, Cache into virtually indexed sets with a configurable set we emulate disaggregated hardware components using associativity. Since x86 (and most other architectures) commodity servers by limiting their internal hardware uses hardware-managed TLB and walks page table di- usages. For example, to emulate controllers for mCom- rectly after TLB misses, we have to use paging and the ponents and sComponents, we limit the usable cores of only chance we can trap to OS is at page fault time. We a server to two. To emulate pComponents, we limit the thus use paged memory to emulate ExCache, with each amount of usable main memory of a server and configure ExCache line being a 4 KB page. A smaller ExCache line it as LegoOS software-managed ExCache. size would improve the performance of fetching lines from mComponents but increase the size of ExCache tag 5.2 Network Stack array and the overhead of tag comparison. We implemented three network stacks in LegoOS. The An ExCache miss causes a page fault and traps to Le- first is a customized RDMA-based RPC framework we goOS. To minimize the overhead of context switches, implemented based on LITE [101] on top of the Mel- we use the application thread that faults on a ExCache lanox mlx4 InfiniBand driver we ported from Linux. Our miss to perform ExCache replacement. Specifically, this RDMA RPC implementation registers physical memory thread will identify the set to insert the missing page us- addresses with RDMA NICs and thus eliminates the need ing its virtual memory address, evict a page in this set if it for NICs to cache physical-to-virtual memory address is full, and if needed, flush a dirty page to mComponent mappings [101]. The resulting smaller NIC SRAM can (via a LegoOS RPC call to the owner mComponent of the largely reduce the monetary cost of NICs, further saving vRegion this page is in). To minimize the network round the total cost of a LegoOS cluster. All LegoOS internal trip needed to complete a ExCache miss, we piggy-back communications use this RPC framework. For best la- the request of dirty page flush and new page fetching in tency, we use one dedicated polling thread at RPC server one RPC call when the mComponent to be flushed to and side to keep polling incoming requests. Other thread(s) the mComponent to fetch the missing page are the same.

78 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association LegoOS maintains an approximate LRU list for each 5.6 Experience and Discussion ExCache set and uses a background thread to sweep all We started our implementation of LegoOS from scratch entries in ExCache and adjust LRU lists. LegoOS sup- to have a clean design and implementation that can fit ports two ExCache replacement policies: FIFO and LRU. the splitkernel model and to evaluate the efforts needed For FIFO replacement, we simply maintain a FIFO queue in building different monitors. However, with the vast for each ExCache set and insert a corresponding entry to amount and the complexity of drivers, we decided to port the tail of the FIFO queue when an ExCache page is in- Linux drivers instead of writing our own. We then spent serted into the set. Eviction victim is chosen as the head our engineering efforts on an “as needed” base and took of the FIFO queue. For LRU, we use one background shortcuts by some of the Linux code. For exam- thread to sweep all sets of ExCache to adjust their LRU ple, we re-used common algorithms and data structures lists. For both policies, we use a per-set lock and lock in Linux to easily port Linux drivers. We believe that be- the FIFO queue (or the LRU list) when making changes ing able to support largely unmodified Linux drivers will to them. assist the adoption of LegoOS. 5.4 Memory Monitor When we started building LegoOS, we had a clear goal of sticking to the principle of “clean separation of We use regular machines to emulate mComponents by functionalities”. However, we later found several places limiting usable cores to a small number (2 to 5 depend- where performance could be improved if this principle is ing on configuration). We dedicate one core to busy poll relaxed. For example, for the optimization in §4.4.2 to network requests and the rest for performing memory work correctly, pComponent needs to store the address functionalities. The LegoOS memory monitor performs range and permission for anonymous virtual memory re- all its functionalities as handlers of RPC requests from gions — memory-related information that otherwise only pComponents. The memory monitor handles most of mComponents need to know. Another example is the im- these functionalities locally and sends another RPC re- plementation of mremap. With LegoOS’ principle of quest to a sComponent for storage-related functionalities mComponents handling all memory address allocations, (e.g., when a buffer cache miss happens). LegoOS stores memory monitors will allocate new virtual memory ad- application data, application memory address mappings, dress ranges for mremap requests. However, when data vma trees, and vRegion arrays all in the main memory of in the mremap region is in ExCache, LegoOS needs to the emulating machine. move it to another set if the new virtual address does not The memory monitor loads an application executable fall into the current set. If mComponents are ExCache- from sComponents to the mComponent, handles applica- aware, they can choose the new virtual memory address tion virtual memory address allocation requests, allocates to fall into the same set as the current one. Our strategy physical memory at the mComponent, and reads/writes is to relax the clean-separation principle only by giving data to the mComponent. Our current implementation of “hints”, and only for frequently-accessed, performance- memory monitor is purely in software, and we use hash critical operations. tables to implement the virtual-to-physical address map- pings. While we envision future mComponents to imple- 6 Evaluation ment memory monitors in hardware and to have special- This section presents the performance evaluation of Le- ized hardware parts to store address mappings, our cur- goOS using micro- and macro-benchmarks and two un- rent software implementation can still be useful for users modified real applications. We also quantitatively ana- that want to build software-managed mComponents. lyze the failure rate of LegoOS. We ran all experiments on a cluster of 10 machines, each with two Intel Xeon 5.5 Storage Monitor CPU E5-2620 2.40GHz processors, 128 GB DRAM, and Since storage is not the focus of the current version of one 40 Gbps Mellanox ConnectX-3 InfiniBand network LegoOS, we chose a simple implementation of building adapter; a Mellanox 40 Gbps InfiniBand switch connects storage monitor on top of the Linux vfs layer as a load- all of the machines. The Linux version we used for com- able Linux kernel module. LegoOS creates a normal file parison is v4.9.47. over vfs as the mount partition for each vNode and issues 6.1 Micro- and Macro-benchmark Results vfs file operations to perform LegoOS storage I/Os. Do- ing so is sufficient to evaluate LegoOS, while largely sav- Network performance. Network communication is at the ing our implementation efforts on storage device drivers core of LegoOS’ performance. Thus, we evaluate Le- and layered storage protocols. We leave exploring other goOS’ network performance first before evaluating Le- options of building LegoOS storage monitor to future goOS as a whole. Figure 7 plots the average latency of work. sending messages with socket-over-InfiniBand (Linux-

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 79 40 8000 60 2 Linux−IPoIB Linux 1−workload−thread LegoOS−Sock−o−IB LegoOS 4−workload−thread 30 LegoOS−RPC−IB 1000 40 20 100 1.5 1M x 4−worker Linux KIOPS 2M x 2−worker p−local KIOPS 20 10 1M x 2−worker Slowdown 1M x 1−worker 1 Latency (usec) 0 0 1 128 256 512 1K 2K 4K 1 4 8 Msg Size (Byte) Num of Workload Threads RandW RandR SeqW SeqR SC Freqmine BS Figure 7: Network Latency. Figure 8: Memory Throughput. Figure 9: Storage Throughput. Figure 10: PARSEC Results. SC: StreamClsuter. BS: BlackScholes.

IPoIB) in Linux, LegoOS’ implementation of socket on goOS’ random read performance is close to Linux, since top of InfiniBand (LegoOS-Sock-o-IB), and LegoOS’ network cost is relatively low compared to the SSD’s implementation of RPC over InfiniBand (LegoOS-RPC- random read performance. With better SSD sequential IB). LegoOS uses LegoOS-RPC-IB for all its internal read performance, network cost has a higher impact. Le- network communication across components and uses goOS’ write-and-fsync performance is worse than Linux LegoOS-Sock-o-IB for all application-initiated socket because LegoOS requires one RTT between pCompo- network requests. Both LegoOS’ networking stacks sig- nent and mComponent to perform write and two RTTs nificantly outperform Linux’s. (pComponent to mComponent, mComponent to sCom- Memory performance. Next, we measure the perfor- ponent) for fsync. mance of mComponent using a multi-threaded user-level PARSEC results. We evaluated LegoOS with a set micro-benchmark. In this micro-benchmark, each thread of workloads from the PARSEC benchmark suite [22], performs one million sequential 4 KB memory loads in including BlackScholes, Freqmine, and StreamCluster. a heap. We use a huge, empty ExCache (32 GB) to run These workloads are a good representative of compute- this test, so that each memory access can generate an Ex- intensive datacenter applications, ranging from machine- Cache (cold) miss and go to the mComponent. learning algorithms to streaming processing ones. Fig- Figure 8 compares LegoOS’ mComponent perfor- ure 10 presents the slowdown of LegoOS over single- mance with Linux’s single-node memory performance node Linux with enough memory for the entire applica- using this workload. We vary the number of per- tion working sets. LegoOS uses one pComponent with mComponent worker threads from 1 to 8 with one and 128 MB ExCache, one mComponent with one worker two mComponents (only showing representative config- thread, and one sComponent for all the PARSEC tests. urations in Figure 8). In general, using more worker For each workload, we tested one and four workload threads per mComponent and using more mComponents threads. StreamCluster, a streaming workload, performs both improve throughput when an application has high the best because of its batching memory access pat- parallelism, but the improvement largely diminishes af- tern (each batch is around 110 MB). BlackScholes and ter the total number of worker threads reaches four. We Freqmine perform worse because of their larger work- also evaluated the optimization technique in § 4.4.2 (p- ing sets (630 MB to 785 MB). LegoOS performs worse local in Figure 8). As expected, bypassing mComponent with higher workload threads, because the single worker accesses with p-local lines significantly improves mem- thread at the mComponent becomes the bottleneck to ory access performance. The difference between p-local achieving higher throughput. and Linux demonstrates the overhead of trapping to Le- goOS kernel and setting up ExCache. 6.2 Application Performance Storage performance. To measure the performance of We evaluated LegoOS’ performance with two real, un- LegoOS’ storage system, we ran a single-thread micro- modified applications, TensorFlow [4] and Phoenix [85], benchmark that performs sequential and random 4 KB a single-node multi-threaded implementation of MapRe- read/write to a 25 GB file on a Samsung PM1725s NVMe duce [32]. TensorFlow’s experiments use the Cifar-10 SSD (the total amount of data accessed is 1 GB). For dataset [2] and Phoenix’s use a Wikipedia dataset [3]. write workloads, we issue an fsync after each write to Unless otherwise stated, the base configuration used for test the performance of writing all the way to the SSD. all TensorFlow experiments is 256 MB 64-way ExCache, Figure 9 presents the throughput of this workload on one pComponent, one mComponent, and one sCompo- LegoOS and on single-node Linux. For LegoOS, we use nent. The base configuration for Phoenix is the same as one mComponent to store the buffer cache of this file TensorFlow’s with the exception that the base ExCache and initialize the buffer cache to empty so that file I/Os size is 512 MB. The total amount of virtual memory ad- can go to the sComponent (Linux also uses an empty dresses touched in TensorFlow is 4.4 GB (1.75 GB for buffer cache). Our results show that Linux’s performance Phoenix). The total working sets of the TensorFlow and is determined by the SSD’s read/write bandwidth. Le- Phoenix execution are 0.9 GB and 1.7 GB. Our default

80 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association 6 6 2.5 2.5 Linux−swap−SSD Linux−swap−SSD Piggyback+List+FIFO List+LRU 1M 4M 5 Linux−swap−ramdisk 5 Linux−swap−ramdisk List+FIFO Bitmap+FIFO 2M 4M−Rep InfiniSwap InfiniSwap 2M−Rep 2 2 4 LegoOS 4 LegoOS

3 3 1.5 1.5 Slowdown Slowdown Slowdown Slowdown 2 2

1 1 1 1 128 256 512 256 512 1024 1 worker 4 worker TensorFlow Phoenix ExCache/Memory Size (MB) ExCache/Memory Size (MB) Figure 11: TensorFlow Perf. Figure 12: Phoenix Perf. Figure 13: ExCache Mgmt. Figure 14: Memory Config.

ExCache sizes are set as roughly 25% of total working tations of finding an empty line in an ExCache set (lin- sets. We ran both applications with four threads. early scan a free bitmap and fetching the head of a free Impact of ExCache size on application performance. list), and one network optimization (piggyback flushing Figures 11 and 12 plot the TensorFlow and Phoenix run a dirty line with fetching the missing line). time comparison across LegoOS, a remote swapping sys- Figure 13 presents these comparisons with one and tem (InfiniSwap [47]), a Linux server with a swap file in four mComponent worker threads. All tests run the a local high-end NVMe SSD, and a Linux server with Cifar-10 workload on TensorFlow with 256 MB 64-way a swap file in local ramdisk. All values are calculated ExCache, one mComponent, and one sComponent. Us- as a slowdown to running the applications on a Linux ing bitmaps for this ExCache configuration is always server that have enough local resources (main memory, worse than using free lists because of the cost to linearly CPU cores, and SSD). For systems other than LegoOS, scan a whole bitmap, and bitmaps perform even worse we change the main memory size to the same size of Ex- with higher associativity. Surprisingly, FIFO performs Cache in LegoOS, with rest of the memory on swap file. better than LRU in our tests, even when LRU utilizes ac- With around 25% working set, LegoOS only has a slow- cess locality pattern. We attributed LRU’s worse perfor- down of 1.68× and 1.34× for TensorFlow and Phoenix mance to the lock contention it incurs; the kernel back- compared to a monolithic Linux server that can fit all ground thread sweeping the ExCache locks an LRU list working sets in its main memory. when adjusting the position of an entry in it, while Ex- LegoOS’ performance is significantly better than Cache miss handler thread also needs to lock the LRU swapping to SSD and to remote memory largely because list to grab its head. Finally, the piggyback optimization of our efficiently-implemented network stack, simplified works well and the combination of FIFO, free list, and code path compared with Linux paging subsystem, and piggyback yields the best performance. the optimization technique proposed in §4.4.2. Surpris- Number of mComponents and replication. Finally, we ingly, it is similar or even better than swapping to lo- study the effect of the number of mComponents and cal memory, even when LegoOS’ memory accesses are memory replication. Figure 14 plots the performance across network. This is mainly because ramdisk goes slowdown as the number of mComponents increases through buffer cache and incurs memory copies between from one to four. Surprisingly, using more mCompo- the buffer cache and the in-memory swap file. nents lowers application performance by up to 6%. This LegoOS’ performance results are not easy to achieve performance drop is due to the effect of ExCache pig- and we went through rounds of design and implementa- gyback optimization. When there is only one mCompo- tion refinement. Our network stack and RPC optimiza- nent, flushes and misses are all between the pComponent tions yield a total improvement of up to 50%. For ex- and this mComponent, thus enabling piggyback on every ample, we made all RPC server (mComponent’s) replies flush. However, when there are multiple mComponents, unsignaled to save mComponent’ processing time and to LegoOS can only perform piggyback when flushes and increase its request handling throughput. Another opti- misses are to the same mComponent. mization we did is to piggy-back dirty cache line flush We also evaluated LegoOS’ memory replication per- and cache miss fill into one RPC. The optimization of the formance in Figure 14. Replication has a performance first anonymous memory access (§4.4.2) improves Le- overhead of 2% to 23% (there is a constant 1 MB space goOS’ performance further by up to 5%. overhead to store the backup log). LegoOS uses the same ExCache Management. Apart from its size, how an Ex- application thread to send the replica data to the backup Cache is managed can also largely affect application per- mComponent and then to the primary mComponent, re- formance. We first evaluated factors that could affect sulting in the performance lost. ExCache hit rate and found that higher associativity im- Running multiple applications together. All our experi- proves hit rate but the effect diminishes when going be- ments so far run only one application at a time. Now we yond 512-way. We then focused on evaluating the miss evaluate how multiple applications perform when run- cost of ExCache, since the miss path is handled by Le- ning them together on a LegoOS cluster. We use a sim- goOS in our design. We compare the two eviction poli- ple scenario of running one TensorFlow instance and one cies LegoOS supports (FIFO and LRU), two implemen- Phoenix instance together in two settings: 1) two pCom-

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 81 Processor Disk Memory NIC Power Other Monolithic LegoOS MTTF (year) 204.3 33.1 289.9 538.8 100.5 27.4 5.8 6.8 - 8.7 Table 1: Mean Time To Failure Analysis. MTTF numbers of devices (columns 2 to 7) are obtained from [90] and MTTF values of monolithic server and LegoOS are calculated using the per-device MTTF numbers.

5 TensorFlow Phoenix n 4 HM (MTTF ) MTTF = i=0 i (1) 3 n

Slowdown 2 where n includes all devices in a machine/component. 1 2P1M−4W 2P1M−8W 1P2M−1G 1P2M−0.5G Figure 15: Multiple Applications. Further, when calculating MTTF of LegoOS, we esti- mate the amount of components needed in LegoOS to run the same applications as a monolithic cluster. Our esti- ponents each running one instance, both accessing one mated worst case for LegoOS is to use the same amount mComponent(2P1M), and 2) one pComponent running of hardware devices (i.e., assuming same resource uti- two instances and accessing two mComponents (1P2M). lization as monolithic cluster). LegoOS’ best case is to Both settings use one sComponent. Figure 15 presents achieve full resource utilization and thus requiring only the runtime slowdown results. We also vary the num- about half of CPU and memory resources (since aver- ber of mComponent worker threads for the 2P1M setting age CPU and memory resource utilization in monolithic and the amount of ExCache for the 1P2M setting. With server clusters is around 50% [10, 45]). 2P1M, both applications suffer from a performance drop With better resource utilization and simplified hard- because their memory access requests saturate the single ware components (e.g., no motherboard), LegoOS im- mComponent. Using more worker threads at the mCom- proves MTTF by 17% to 49% compared to an equivalent ponent improves the performance slightly. For 1P2M, monolithic server cluster. application performance largely depends on ExCache size, similar to our findings with single-application ex- periments. 7 Related Work Hardware Resource Disaggregation. There have 6.3 Failure Analysis been a few hardware disaggregation proposals from Finally, we provide a qualitative analysis on the failure academia and industry, including Firebox [15], HP ”The rate of a LegoOS cluster compared to a monolithic server Machine” [40, 48], dRedBox [56], and IBM Composable cluster. Table 1 summarizes our analysis. To measure the System [28]. Among them, dRedBox and IBM Compos- failure rate of a cluster, we use the metric Mean Time To able System package hardware resources in one big case (hardware) Failure (MTTF), the mean time to the fail- and connect them with buses like PCIe. The Machine’s ure of a server in a monolithic cluster or a component scale is a rack and it connects SoCs with NVMs with a in a LegoOS cluster. Since the only real per-device fail- specialized coherent network. FireBox is an early-stage ure statistics we can find are the mean time to hardware project and is likely to use high-radix switches to con- replacement in a cluster [90], the MTTF we refer to in nect customized devices. The disaggregated cluster we this study indicates the mean time to the type of hard- envision to run LegoOS on is one that hosts hundreds to ware failures that require replacement. Unlike traditional thousands of non-coherent, heterogeneous hardware de- MTTF analysis, we are not able to include transient fail- vices, connected with a commodity network. ures. Memory Disaggregation and Remote memory. Lim To calculate MTTF of a monolithic server, we first et al. first proposed the concept of hardware disaggre- obtain the replacement frequency of different hardware gated memory with two models of disaggregated mem- devices in a server (CPU, memory, disk, NIC, mother- ory: using it as network swap device and transparently board, case, power supply, fan, CPU heat sink, and other accessing it through memory instructions [63, 64]. Their cables and connections) from the real world (the COM1 hardware models still use a monolithic server at the local and COM2 clusters in [90]). For LegoOS, we envision side. LegoOS’ hardware model separates processor and every component to have a NIC and a power supply, and memory completely. in addition, a pComponent to have CPU, fan, and heat Another set of recent projects utilize remote memory sink, an mComponent to have memory, and an sCompo- without changing monolithic servers [6, 34, 47, 74, 79, nent to have a disk. We further assume both a monolithic 93]. For example, InfiniSwap [47] transparently swaps server and a LegoOS component to fail when any hard- local memory to remote memory via RDMA. These re- ware devices in them fails and the devices in them fail mote memory systems help improve the memory re- independently. Thus, the MTTF can be calculated using source packing in a cluster. However, as discussed in §2, the harmonic mean (HM) of the MTTF of each device. unlike LegoOS, these solutions cannot solve other lim-

82 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association itations of monolithic servers like the lack of hardware mogeneous pool. There are also few emerging propos- heterogeneity and elasticity. als to build distributed OSes in datacenters [54, 91], e.g., to reduce the performance overhead of middleware. Le- Storage Disaggregation. Cloud vendors usually pro- goOS achieves the same benefits of minimal middleware vision storage at different physical machines [13, 103, layers by only having LegoOS as the system manage- 108]. Remote access to hard disks is a common practice, ment software for a disaggregated cluster and using the because their high latency and low throughput can eas- lightweight vNode mechanism. ily hide network overhead [61, 62, 70, 105]. While dis- aggregating high-performance flash is a more challeng- 8 Discussion and Conclusion ing task [38, 59]. Systems such as ReFlex [60], Deci- bel [73], and PolarFS [25], tightly integrate network and We presented LegoOS, the first OS designed for hard- storage layers to minimize software overhead in the face ware resource disaggregation. LegoOS demonstrated the of fast hardware. Although storage disaggregation is not feasibility of resource disaggregation and its advantages our main focus now, we believe those techniques can be in better resource packing, failure isolation, and elastic- realized in future LegoOS easily. ity, all without changing Linux ABIs. LegoOS and re- source disaggregation in general can help the adoption Multi-Kernel and Multi-Instance OSes. Multi-kernel of new hardware and thus encourage more hardware and OSes like Barrelfish [21, 107], Helios [76], Hive [26], system software innovations. and fos [106] run a small kernel on each core or pro- LegoOS is a research prototype and has a lot of room grammable device in a monolithic server, and they use for improvement. For example, we found that the amount message passing to communicate across their internal of parallel threads an mComponent can use to process kernels. Similarly, multi-instance OSes like Popcorn [17] memory requests largely affect application throughput. and Pisces [83] run multiple Linux kernel instances on Thus, future developers of real mComponents can con- different cores in a machine. Different from these OSes, sider use large amount of cheap cores or FPGA to imple- LegoOS runs on and manages a distributed set of hard- ment memory monitors in hardware. ware devices; it manages distributed hardware resources We also performed an initial investigation in load using a two-level approach and handles device failures balancing and found that memory allocation policies (currently only mComponent). In addition, LegoOS dif- across mComponents can largely affect application per- fers from these OSes in how it splits OS functionali- formance. However, since we do not support mem- ties, where it executes the split kernels, and how it per- ory data migration yet, the benefit of our load-balancing forms message passing across components. Different mechanism is small. We leave memory migration for fu- from multi-kernels’ message passing mechanisms which ture work. In general, large-scale resource management are performed over buses or using shared memory in of a disaggregated cluster is an interesting and important a server, LegoOS’ message passing is performed using topic, but is outside of the scope of this paper. a customized RDMA-based RPC stack over InfiniBand or RoCE network. Like LegoOS, fos [106] separates Acknowledgments OS functionalities and run them on different processor cores that share main memory. Helios [76] runs satel- We would like to thank the anonymous reviewers and lite kernels on heterogeneous cores and programmable our shepherd Michael Swift for their tremendous feed- NICs that are not cache-coherent. We took a step further back and comments, which have substantially improved by disseminating OS functionalities to run on individual, the content and presentation of this paper. We are also network-attached hardware devices. Moreover, LegoOS thankful to Sumukh H. Ravindra for his contribution in is the first OS that separates memory and process man- the early stage of this work. agement and runs virtual memory system completely at This material is based upon work supported by the network-attached memory devices. National Science Foundation under the following grant: NSF 1719215. Any opinions, findings, and conclusions Distributed OSes. There have been several distributed or recommendations expressed in this material are those OSes built in late 80s and early 90s [14, 16, 19, 27, of the authors and do not necessarily reflect the views of 82, 86, 96, 97]. Many of them aim to appear as a sin- NSF or other institutions. gle machine to users and focus on improving inter-node IPCs. Among them, the most closely related one is Amoeba [96, 97]. It organizes a cluster into a shared pro- cess pool and disaggregated specialized servers. Unlike Amoeba, LegoOS further separates memory from pro- cessors and different hardware components are loosely coupled and can be heterogeneous instead of as a ho-

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 83 References [20] A. Basu, M. D. Hill, and M. M. Swift. Reducing Memory Reference Energy with Opportunistic Virtual [1] Open Compute Project (OCP). http:// Caching. In Proceedings of the 39th Annual Interna- www.opencompute.org/. tional Symposium on Computer Architecture (ISCA ’12). [2] The CIFAR-10 dataset. https://www.cs.toronto.edu/ [21] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, ∼kriz/cifar.html. [3] Wikipedia dump. https://dumps.wikimedia.org/. R. Isaacs, S. Peter, T. Roscoe, A. Schupbach,¨ and [4] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, A. Singhania. The Multikernel: A New OS Architecture J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, for Scalable Multicore Systems. In Proceedings of the M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. ACM SIGOPS 22nd Symposium on Operating Systems Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, Principles (SOSP ’09). M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system [22] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PAR- 12th USENIX Sym- SEC Benchmark Suite: Characterization and Architec- for large-scale machine learning. In tural Implications. In Proceedings of the 17th Interna- posium on Operating Systems Design and Implementa- tional Conference on Parallel Architectures and Compi- tion (OSDI ’16). lation Techniques (PACT ’08) [5] S. R. Agrawal, S. Idicula, A. Raghavan, E. Vlachos, . [23] M. N. Bojnordi and E. Ipek. PARDIS: A Programmable V. Govindaraju, V. Varadarajan, C. Balkesen, G. Gian- Memory Controller for the DDRx Interfacing Standards. nikis, C. Roth, N. Agarwal, and E. Sedlar. A Many-core Proceedings of the 39th Annual International Sympo- Pro- In Architecture for In-memory Data Processing. In sium on Computer Architecture (ISCA ’12) ceedings of the 50th Annual IEEE/ACM International . [24] Cache Coherent Interconnect for Accelerators, 2018. Symposium on Microarchitecture (MICRO ’17). https://www.ccixconsortium.com/. [6] M. K. Aguilera, N. Amit, I. Calciu, X. Deguillard, [25] W. Cao, Z. Liu, P. Wang, S. Chen, C. Zhu, S. Zheng, J. Gandhi, S. Novakovic,´ A. Ramanathan, P. Subrah- manyam, L. Suresh, K. Tati, R. Venkatasubramanian, Y. Wang, and G. Ma. PolarFS: an ultra-low latency and and M. Wei. Remote regions: a simple abstraction for re- failure resilient distributed file system for shared storage 2018 USENIX Annual Technical Con- cloud database. Proceedings of the VLDB Endowment mote memory. In (VLDB ’18). ference (ATC ’18). [26] J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teo- [7] M. K. Aguilera, N. Amit, I. Calciu, X. Deguillard, dosiu, and A. Gupta. Hive: Fault Containment for J. Gandhi, P. Subrahmanyam, L. Suresh, K. Tati, Shared-memory Multiprocessors. In Proceedings of the R. Venkatasubramanian, and M. Wei. Remote memory Fifteenth ACM Symposium on Operating Systems Prin- in the age of fast networks. In Proceedings of the 2017 ciples (SOSP ’95). Symposium on Cloud Computing (SoCC ’17). [27] D. Cheriton. The V Distributed System. Commun. ACM, [8] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A 31(3), March 1988. scalable processing-in-memory accelerator for parallel [28] I.-H. Chung, B. Abali, and P. Crumley. Towards a Com- graph processing. In Proceedings of the 42nd Annual In- posable Computer System. In Proceedings of the Inter- ternational Symposium on Computer Architecture (ISCA national Conference on High Performance Computing ’15). in Asia-Pacific Region (HPC Asia ’18). [9] J. Ahn, S. Yoo, O. Mutlu, and K. Choi. PIM- [29] Cisco, EMC, and Intel. The Performance Impact of enabled Instructions: A Low-overhead, Locality-aware . . Processing-in-memory Architecture. In Proceedings of NVMe and NVMe over Fabrics. http://www snia org/ the 42nd Annual International Symposium on Computer sites/default/files/NVMe Webcast Slides Final.1.pdf. Architecture (ISCA ’15). [30] P. Costa. Towards rack-scale computing: Challenges and [10] Alibaba. Alibaba Production Cluster Trace Data. https: opportunities. In First International Workshop on Rack- //.com/alibaba/clusterdata. scale Computing (WRSC ’14). [31] P. Costa, H. Ballani, K. Razavi, and I. Kash. R2C2: A [11] Amazon. Amazon EC2 Elastic GPUs. https:// Network Stack for Rack-scale Computers. In Proceed- aws.amazon.com/ec2/elastic-gpus/. ings of the ACM SIGCOMM 2015 Conference on SIG- [12] Amazon. Amazon EC2 F1 Instances. https:// COMM (SIGCOMM ’15). aws.amazon.com/ec2/instance-types/f1/. [32] J. Dean and S. Ghemawat. MapReduce: Simplified Data [13] Amazon. Amazon EC2 Root Device Volume. https: Processing on Large Clusters. In Proceedings of the 6th //docs.aws.amazon.com/AWSEC2/latest/UserGuide/ RootDeviceStorage.html#RootDeviceStorageConcepts. Conference on Symposium on Opearting Systems Design [14] Y. Artsy, H.-Y. Chang, and R. Finkel. Interprocess com- and Implementation (OSDI ’04). munication in charlotte. IEEE Software, Jan 1987. [33] C. Delimitrou and C. Kozyrakis. Quasar: Resource- [15] K. Asanovi. FireBox: A Hardware Building Block efficient and QoS-aware Cluster Management. In Pro- for 2020 Warehouse-Scale Computers, February 2014. ceedings of the 19th International Conference on Archi- Keynote talk at the 12th USENIX Conference on File tectural Support for Programming Languages and Op- and Storage Technologies (FAST ’14). erating Systems (ASPLOS ’14). [16] A. Barak and R. Wheeler. MOSIX: An integrated unix [34] Dragojevic,´ Aleksandar and Narayanan, Dushyanth and for multiprocessor workstations. International Computer Hodson, Orion and Castro, Miguel. FaRM: Fast Remote Science Institute, 1988. Memory. In Proceedings of the 11th USENIX Confer- [17] A. Barbalace, M. Sadini, S. Ansary, C. Jelesnianski, ence on Networked Systems Design and Implementation A. Ravichandran, C. Kendir, A. Murray, and B. Ravin- (NSDI ’14). dran. Popcorn: Bridging the programmability gap in [35] A. Dunkels. Design and Implementation of the lwIP heterogeneous-isa platforms. In Proceedings of the TCP/IP Stack. Swedish Institute of Computer Science, 2001. Tenth European Conference on Computer Systems (Eu- [36] K. Elphinstone and G. Heiser. From l3 to sel4 what have roSys ’15). we learnt in 20 years of l4 ? In Proceed- [18] L. A. Barroso and U. Holzle.¨ The case for energy- ings of the Twenty-Fourth ACM Symposium on Operat- proportional computing. Computer, 40(12), December ing Systems Principles (SOSP ’13). 2007. [19] F. Baskett, J. H. Howard, and J. T. Montague. Task [37] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Exok- Communication in DEMOS. In Proceedings of the ernel: An architecture for application- Sixth ACM Symposium on Operating Systems Principles level resource management. In Proceedings of the Fif- (SOSP ’77). teenth ACM Symposium on Operating Systems Princi-

84 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association ples (SOSP ’95). nick, N. Penukonda, A. Phelps, M. R. Jonathan Ross, [38] Facebook. Introducing Lightning: A flexible NVMe A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snel- JBOF. https://code.fb.com/data-center-engineering/ ham, J. Souter, A. S. Dan Steinberg, M. Tan, G. Thorson, introducing-lightning-a-flexible-nvme-jbof/. B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, [39] Facebook. Wedge 100: More open and versa- W. Wang, E. Wilcox, and D. H. Yoon. In-Datacenter Per- tile than ever. https://code.fb.com/networking-traffic/ formance Analysis of a Tensor Processing Unit. In Pro- wedge-100-more-open-and-versatile-than-ever/. ceedings of the 44th Annual International Symposium on [40] P. Faraboschi, K. Keeton, T. Marsland, and D. Milojicic. Computer Architecture (ISCA ’17). Beyond Processor-centric Operating Systems. In 15th [56] K. Katrinis, D. Syrivelis, D. Pnevmatikatos, G. Zer- Workshop on Hot Topics in Operating Systems (HotOS vas, D. Theodoropoulos, I. Koutsopoulos, K. Hasharoni, ’15). D. Raho, C. Pinto, F. Espina, S. Lopez-Buedo, Q. Chen, [41] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira, M. Nemirovsky, D. Roca, H. Klos, and T. Berends. S. Han, R. Agarwal, S. Ratnasamy, and S. Shenker. Net- Rack-scale disaggregated cloud data centers: The dReD- work Requirements for Resource Disaggregation. In Box project vision. In 2016 Design, Automation Test in 12th USENIX Symposium on Operating Systems Design Europe Conference Exhibition (DATE ’16). and Implementation (OSDI ’16). [57] A. Kaufmann, S. Peter, N. K. Sharma, T. Anderson, and [42] GenZ Consortium. http://genzconsortium.org/. A. Krishnamurthy. High Performance Packet Processing [43] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and with FlexNIC. In Proceedings of the Twenty-First Inter- C. Guestrin. PowerGraph: Distributed Graph-Parallel national Conference on Architectural Support for Pro- Computation on Natural Graphs. In Proceedings of the gramming Languages and Operating Systems (ASPLOS 10th USENIX conference on Operating Systems Design ’16). and Implementation (OSDI ’12). [58] S. Kaxiras and A. Ros. A New Perspective for Effi- [44] J. R. Goodman. Coherency for Multiprocessor Virtual cient Virtual-cache Coherence. In Proceedings of the Address Caches. In Proceedings of the Second Interna- 40th Annual International Symposium on Computer Ar- tional Conference on Architectual Support for Program- chitecture (ISCA ’13). ming Languages and Operating Systems (ASPLOS ’87). [59] A. Klimovic, C. Kozyrakis, E. Thereska, B. John, and [45] Google. Google Production Cluster Trace Data. https: S. Kumar. Flash Storage Disaggregation. In Proceed- //github.com/google/cluster-data. ings of the Eleventh European Conference on Computer [46] Google. GPUs are now available for Google Systems (EuroSys ’16). Compute Engine and Cloud Machine Learning. [60] A. Klimovic, H. Litz, and C. Kozyrakis. ReFlex: Re- https://cloudplatform.googleblog.com/2017/02/GPUs- mote Flash ≈ Local Flash. In Proceedings of the Twenty- are-now-available-for-Google-Compute-Engine-and- Second International Conference on Architectural Sup- Cloud-Machine-Learning.html. port for Programming Languages and Operating Sys- [47] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. Shin. Ef- tems (ASPLOS ’17). ficient Memory Disaggregation with Infiniswap. In Pro- [61] E. K. Lee and C. A. Thekkath. Petal: Distributed Vir- ceedings of the 14th USENIX Symposium on Networked tual Disks. In Proceedings of the Seventh International Systems Design and Implementation (NSDI ’17). Conference on Architectural Support for Programming [48] Hewlett-Packard. The Machine: A New Kind of Languages and Operating Systems (ASPLOS ’96). Computer. http://www.hpl.hp.com/research/systems- [62] S. Legtchenko, H. Williams, K. Razavi, A. Donnelly, research/themachine/. R. Black, A. Douglas, N. Cheriere, D. Fryer, K. Mast, [49] Intel. Intel Non-Volatile Memory 3D XPoint. A. D. Brown, A. Klimovic, A. Slowey, and A. Row- http://www.intel.com/content/www/us/en/architecture- stron. Understanding Rack-Scale Disaggregated Stor- and-technology/non-volatile-memory.html?wapkw= age. In 9th USENIX Workshop on Hot Topics in Storage 3d+xpoint. and File Systems (HotStorage ’17). [50] Intel. Intel Omni-Path Architecture. https://tinyurl.com/ [63] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Rein- ya3x4ktd. hardt, and T. F. Wenisch. Disaggregated Memory for Ex- [51] Intel. Intel Rack Scale Architecture: Faster pansion and Sharing in Blade Servers. In Proceedings of Service Delivery and Lower TCO. http: the 36th Annual International Symposium on Computer //www.intel.com/content/www/us/en/architecture- Architecture (ISCA ’09). and-technology/intel-rack-scale-architecture.html. [64] K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, [52] Intel, 2018. https://www.intel.com/content/www/us/en/ P. Ranganathan, and T. F. Wenisch. System-level Im- high-performance-computing-fabrics/. plications of Disaggregated Memory. In Proceedings of [53] JEDEC. HIGH BANDWIDTH MEMORY (HBM) the 2012 IEEE 18th International Symposium on High- DRAM JESD235A. https://www.jedec.org/standards- Performance Computer Architecture (HPCA ’12). documents/docs/jesd235a. [65] D. Meisner, B. T. Gold, and T. F. Wenisch. The pow- [54] W. John, J. Halen, X. Cai, C. Fu, T. Holmberg, ernap server architecture. ACM Trans. Comput. Syst., V. Katardjiev, M. Sedaghat, P. Skoldstr¨ om,¨ D. Turull, February 2011. V. Yadhav, and J. Kempf. Making Cloud Easy: Design [66] Mellanox. ConnectX-6 Single/Dual-Port Adapter sup- Considerations and First Components of a Distributed porting 200Gb/s with VPI. http://www.mellanox.com/ Operating System for Cloud. In 10th USENIX Workshop page/products dyn?product family=265&mtag= on Hot Topics in Cloud Computing (HotCloud ’18). connectx 6 vpi card. [55] N. P. Jouppi, C. Young, N. Patil, D. Patterson, [67] Mellanox. Mellanox BuleField SmartNIC. G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Bo- http://www.mellanox.com/page/products dyn?product den, A. Borchers, R. Boyle, P. luc Cantin, C. Chao, family=275&mtag=bluefield smart nic. C. Clark, J. Coriell, M. D. Mike Daley, J. Dean, B. Gelb, [68] Mellanox. Mellanox Innova Adapters. http: T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hag- //www.mellanox.com/page/programmable network mann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, adapters?mtag=&mtag=programmable adapter cards. J. Ibarz, A. Jaffey, A. Jaworski, H. K. Alexander Ka- [69] Mellanox. Rdma aware networks programming user plan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, manual. http://www.mellanox.com/related-docs/ D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacK- prod software/RDMA Aware Programming user ean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, manual.pdf. R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omer- [70] J. Mickens, E. B. Nightingale, J. Elson, D. Gehring,

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 85 B. Fan, A. Kadav, V. Chidambaram, O. Khan, and [87] B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, K. Nareddy. Blizzard: Fast, Cloud-scale Block Storage and Y. Solihin. Scaling the Bandwidth Wall: Challenges for Cloud-oblivious Applications. In 11th USENIX Sym- in and Avenues for CMP Scaling. In Proceedings of the posium on Networked Systems Design and Implementa- 36th Annual International Symposium on Computer Ar- tion (NSDI ’14). chitecture (ISCA ’09). [71] D. Minturn. NVM Express Over Fabrics. 11th Annual [88] S. M. Rumble. Infiniband Verbs Performance. OpenFabrics International OFS Developers’ Workshop, https://ramcloud.atlassian.net/wiki/display/RAM/ March 2015. Infiniband+Verbs+Performance. [72] T. P. Morgan. Enterprises Get On The Xeon Phi [89] R. Sandberg. The Design and Implementation of the Roadmap. https://www.enterprisetech.com/2014/11/17/ Sun Network . In Proceedings of the 1985 enterprises-get-xeon-phi-roadmap/. USENIX Summer Technical Conference, 1985. [73] M. Nanavati, J. Wires, and A. Warfield. Decibel: Isola- [90] B. Schroeder and G. A. Gibson. Disk Failures in the Real tion and Sharing in Disaggregated Rack-Scale Storage. World: What Does an MTTF of 1,000,000 Hours Mean In 14th USENIX Symposium on Networked Systems De- to You? In Proceedings of the 5th USENIX Conference sign and Implementation (NSDI ’17). on File and Storage Technologies (FAST ’07). [74] J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Ka- [91] M. Schwarzkopf, M. P. Grosvenor, and S. Hand. New han, and M. Oskin. Latency-Tolerant Software Dis- Wine in Old Skins: The Case for Distributed Operating tributed Shared Memory. In Proceedings of the 2015 Systems in the Data Center. In Proceedings of the 4th USENIX Annual Technical Conference (ATC ’15). Asia-Pacific Workshop on Systems (APSys ’13). [75] Netronome. Agilio SmartNICs. https: [92] S. Seshadri, M. Gahagan, S. Bhaskaran, T. Bunker, //www.netronome.com/products/smartnic/overview/. A. De, Y. Jin, Y. Liu, and S. Swanson. Willow: A [76] E. B. Nightingale, O. Hodson, R. McIlroy, C. Haw- User-Programmable SSD. In Proceedings of the 11th blitzel, and G. Hunt. Helios: Heterogeneous Multipro- USENIX Symposium on Operating Systems Design and cessing with Satellite Kernels. In Proceedings of the Implementation (OSDI ’14). ACM SIGOPS 22nd Symposium on Operating Systems [93] Y. Shan, S.-Y. Tsai, and Y. Zhang. Distributed Shared Principles (SOSP ’09). Persistent Memory. In Proceedings of the 2017 Sympo- [77] V. Nitu, B. Teabe, A. Tchana, C. Isci, and D. Hagimont. sium on Cloud Computing (SoCC ’17). Welcome to Zombieland: Practical and Energy-efficient [94] M. Silberstein. Accelerators in data centers: the sys- Memory Disaggregation in a Datacenter. In Proceedings tems perspective. https://www.sigarch.org/accelerators- of the Thirteenth EuroSys Conference (EuroSys ’18). in-data-centers-the-systems-perspective/. [78] S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and [95] A. J. Smith. Cache Memories. ACM Comput. Surv., B. Grot. Scale-out NUMA. In Proceedings of the 19th 14(3), September 1982. International Conference on Architectural Support for [96] A. S. Tanenbaum, M. F. Kaashoek, R. Van Renesse, Programming Languages and Operating Systems (ASP- and H. E. Bal. The Amoeba Distributed Operating Sys- LOS ’14). tem&Mdash;a Status Report. Comput. Commun., 14(6), [79] S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and August 1991. B. Grot. The Case for RackOut: Scalable Data Serv- [97] A. S. Tanenbaum, R. van Renesse, H. van Staveren, ing Using Rack-Scale Systems. In Proceedings of the G. J. Sharp, and S. J. Mullender. Experiences with the Seventh ACM Symposium on Cloud Computing (SoCC Amoeba Distributed Operating System. Commun. ACM, ’16). 33(12), December 1990. [80] D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, [98] E. Technologies. NVMe Storage Accelerator Series. and M. Rosenblum. Fast Crash Recovery in RAMCloud. https://www.everspin.com/nvme-storage-accelerator- In Proceedings of the 23rd ACM Symposium on Operat- series. ing Systems Principles (SOSP ’11). [99] S. Thomas, G. M. Voelker, and G. Porter. CacheCloud: [81] Open Coherent Accelerator Processor Interface, 2018. Towards Speed-of-light Datacenter Communication. In https://opencapi.org/. 10th USENIX Workshop on Hot Topics in Cloud Com- [82] J. K. Ousterhout, A. R. Cherenson, F. Douglis, M. N. puting (HotCloud ’18). Nelson, and B. B. Welch. The Sprite Network Operating [100] C.-C. Tsai, B. Jain, N. A. Abdul, and D. E. Porter. A System. Computer, 21(2), February 1988. Study of Modern Linux API Usage and Compatibility: [83] J. Ouyang, B. Kocoloski, J. R. Lange, and K. Pedretti. What to Support when You’Re Supporting. In Proceed- Achieving Performance Isolation with Lightweight Co- ings of the Eleventh European Conference on Computer Kernels. In Proceedings of the 24th International Sym- Systems (EuroSys ’16). posium on High-Performance Parallel and Distributed [101] S.-Y. Tsai and Y. Zhang. LITE Kernel RDMA Sup- Computing (HPDC ’15). port for Datacenter Applications. In Proceedings of [84] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, the 26th Symposium on Operating Systems Principles K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fow- (SOSP ’17). ers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, [102] A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Pe- E. Tune, and J. Wilkes. Large-scale cluster management terson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and at Google with Borg. In Proceedings of the European D. Burger. A Reconfigurable Fabric for Accelerating Conference on Computer Systems (EuroSys ’15). Large-scale Datacenter Services. In Proceeding of the [103] VMware. Virtual SAN. https://www.vmware.com/ 41st Annual International Symposium on Computer Ar- products/vsan.html. chitecuture (ISCA ’14). [104] W. H. Wang, J.-L. Baer, and H. M. Levy. Organization [85] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and Performance of a Two-level Virtual-real Cache Hi- and C. Kozyrakis. Evaluating MapReduce for Multi- erarchy. In Proceedings of the 16th Annual International core and Multiprocessor Systems. In Proceedings of Symposium on Computer Architecture (ISCA ’89). the 13th International Symposium on High Performance [105] A. Warfield, R. Ross, K. Fraser, C. Limpach, and Computer Architecture (HPCA ’07). S. Hand. Parallax: Managing Storage for a Million Ma- [86] R. F. Rashid and G. G. Robertson. Accent: A Commu- chines. In Proceedings of the 10th Conference on Hot nication Oriented Kernel. In Topics in Operating Systems (HotOS ’05). Proceedings of the Eighth ACM Symposium on Operat- [106] D. Wentzlaff, C. Gruenwald, III, N. Beckmann, ing Systems Principles (SOSP ’81). K. Modzelewski, A. Belay, L. Youseff, J. Miller, and

86 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association A. Agarwal. An Operating System for Multicore and Clouds: Mechanisms and Implementation. In Proceed- ings of the 1st ACM Symposium on Cloud Computing (SoCC ’10). [107] G. Zellweger, S. Gerber, K. Kourtis, and T. Roscoe. De- coupling cores, kernels, and operating systems. In Pro- ceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI ’14). [108] Q. Zhang, G. Yu, C. Guo, Y. Dang, N. Swanson, X. Yang, R. Yao, M. Chintalapati, A. Krishnamurthy, and T. Anderson. Deepview: Virtual Disk Failure Diag- nosis and Pattern Detection for Azure. In 15th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI ’18).

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 87