Elasticizing Linux Via Joint Disaggregation of Memory and Computation

Elasticizing Linux via Joint Disaggregation of Memory and Computation Ehab Ababneh, Zaid Al-Ali, Sangtae Ha, Richard Han, Eric Keller Department of Computer Science, University of Colorado Boulder Abstract companies wish to use the “right” size machine, which might change over time. But, transitioning from one VM In this paper, we propose a set of operating system prim- size to another can pose challenges. In some cases, we itives which provides a scaling abstraction to cloud ap- can take snapshots (e.g., with CRIU [8]) and migrate plications in which they can transparently be enabled to the application to a bigger/smaller VM, but this can be support scaled execution across multiple physical nodes disruptive, and the management of the application needs as resource needs go beyond that available on a single scripts and other infrastructure to trigger scaling. machine. These primitives include stretch, to extend the An alternative is to re-write the applications with scal- address space of an application to a new node, push and ing in mind. To leverage the scaling, commonly applica- pull, to move pages between nodes as needed for execu- tions are built around frameworks such as Hadoop[12], tion and optimization, and jump, to transfer execution in Apache Spark[26], MPI[7] or PGAS [1]. These frame- a very lightweight manner between nodes. This joint dis- works are designed with the flexibility of being able aggregation of memory and computing allows for trans- to execute tasks on a varying amount of distributed re- parent elasticity, improving an application’s performance sources available. The problem here is two fold. First, by capitalizing on the underlying dynamic infrastructure to leverage this, the application needs to be built for this without needing an application re-write. We have im- – a significant challenge (requiring a re-write) for any plemented these primitives in a Linux 2.6 kernel, collec- existing application, and forcing application developers tively calling the extended operating system, ElasticOS. to evaluate and become fluent in the latest frameworks Our evaluation across a variety of algorithms shows up to and potentially adapt the application as the frameworks 10x improvement in performance over standard network change. Second, and perhaps more challenging, is that swap. not every application fits into one of these frameworks. Another approach to scaling is to replicate 1 Introduction VMs/containers when an application becomes popular and requires more resources. This too introduces We are in the midst of a significant transition in comput- burdens on the programmer in order to synchronize arXiv:1806.00885v1 [cs.DC] 3 Jun 2018 ing, where we are consuming infrastructure rather than shared data and state across multiple replicas, as well building it. This means that applications have the power as to script their applications to spawn/delete replicas of a dynamic infrastructure underlying them, but many depending on load. applications struggle to leverage that flexibility. In this In short, in each case, the burden of scaling is placed paper, we propose supporting this at the operating sys- on programmers. We argue that developers shouldn’t tem level with new primitives to support scaling. need to be experts in cloud management and other frame- To gain some context on the challenge with scaling, works, in addition to also needing to be fluent in pro- we first discuss how it is predominantly handled today. gramming and their application domain. Instead, the op- The most straight forward option, which required no erating system should provide more support. Broadly changes to applications, is to simply get a bigger (vir- speaking, the job of an operating system is to make the tual) machine as load for an application increases. Cloud life of an application developer easier (through abstrac- providers, such as Amazon[2], offer a wide range of ma- tion). A modern OS provides virtual memory abstrac- chine sizes which cost anywhere from less than a penny tions, so developers do not have to coordinate memory per hour to a few dollars per hour. For cost efficiency, use among applications, network socket abstractions, so developers can send messages without needing to be in- timately familiar with the underlying network protocols, and many other abstractions (file system, device, multi- Elastic OS tasking) all to support developers.We propose that scal- Elastic OS ing should be an OS abstraction. Related Work: We are not the first to propose that operating systems should support scaling. Scaling of Datacenter memory approaches are popular and include efforts such as RAMCloud [22], which requires refactoring in user Figure 1: ElasticOS Vision for Cloud Data Centers. space to utilize its memory scaling capabilities. An early approach to sharing memory called DSM [21, 19, 17, 16, 14, 6] suffered from scaling issues, but more recently migration decision may be obsolete due to the length of disaggregation-based approaches towards memory have time needed to copy all the state. emerged that are centered around transparent scaling of Introducing ElasticOS: In response to these short- memory behind the swap interface, such as NSwap, In- comings, we introduce four primitives to realize the scal- finiswap, X-Swap and Memx [20, 10, 29, 4]. Scaling ing OS abstraction – stretch, jump, push, and pull. These of computation approaches include process migration to scaling abstractions are designed to be transparent, ef- machines with more resources [8, 13, 3, 25, 18, 27], in ficient, and practically useful. Our approach is inspired addition to the scaling frameworks and replication meth- by an early work that hypothesized elasticizing operating ods mentioned previously. Approaches to accelerate pro- systems as a hot research topic, but did not build a work- cess migration [24, 15] have been proposed to hide the ing implementation of the proposed concept [11]. Stretch latency of migration by copying most of the process state is used when an application becomes overloaded (e.g., a in the background and only copying a small delta to the lot of thrashing to disk is occurring), so the operating sys- new machine after halting the process. Single system tem stretches the application’s address space to another image (SSI) OSs such as Kerrighed, MOSIX, Sprite and machine – extending the amount of memory available to Amoeba [15, 3, 5, 28] have been created to support op- the application. Push and pull allow memory pages to eration across a distributed cluster of machines. These be transferred between machines which the application approaches typically employ a process migration model has been stretched to, whether proactively to optimize to move computation around cluster nodes and require placement, or reactively to make it so the data is avail- applications to be recompiled for these specialized OSs. able where it is needed. Jump allows program execution These prior efforts in OS scaling suffer from a variety to transfer to a machine which the application has been of limitations. Network swap-based approaches, while stretched to. Unlike heavyweight process migration, our being a step in the right direction of disaggregation in the jump primitive is a lightweight transfer of execution that data center, miss the opportunity to exploit joint disag- only copies the small amount of state needed to begin ex- gregation of computation and memory for improved per- ecution immediately on the remote machine, such as reg- formance. Execution is typically assumed to be pinned ister state and the top of the stack. Any additional state on one machine, while memory pages are swapped back that is needed is faulted in using pulls from the rest of and forth remotely across the network. This can result in the distributed address space. Having both jumping and excessive swapping of pages over the network. In these push/pull allows for the OS to choose between moving cases, movement of computation to a remote machine the data to be where the execution needs it, and moving towards a cluster of locality stored in the remote ma- the execution to be where the data is. This supports the chine’s memory would result in substantially faster ex- natural, but not necessarily perfect locality that exists in ecution and lower network overhead, as we will show applications. later. To demonstrate the feasibility of this scaling approach, Combining current network swap approaches with ex- we extended the Linux kernel with these four primitives, isting process migration techniques to alleviate excessive and call the extended Linux, ElasticOS. Figure 1 pro- network swapping overhead would suffer two major lim- vides a high level view of ElasticOS. We see that an in- itations. First, each decision to move computation would stance of ElasticOS is capable of spanning a number of incur the overhead of copying the entire address space. nodes in the data center, and that the number of spanned This is a significant amount of overhead to impose on nodes can elastically scale up or down depending upon the network. Second, even with accelerated process mi- application demand. The application is executed within gration, there is a substantial delay between the time the ElasticOS, and the scaling primitives are used to sup- decision is made to migrate and when that is completed, port this execution across a distributed collection of re- at which time the conditions that triggered the original sources. 2 N1 N1 N2 N1 N2 N1 N2 N1 N2 Execution Page fault 0. Memory filled up 1. Stretching to Node 2 2. Page balancing 3. Pulling and pushing (e.g., in response to page fault) N1 N2 N1 N2 N1 N2 N1 N2 N1 N2 7. Jump back to Node 1 6. Too many remote page faults 5. Jump to Node 2 4. Too many remote page faults! in the opposite direction Figure 2: Illustration of ElasticOS abstractions.

Load more