Ghost: Fast & Flexible User-Space Delegation of Linux Scheduling

ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling Jack Tigar Humphries Neel Natu Ashwin Chaugule Google Google Google Ofir Weisse Barret Rhoden Josh Don Google Google Google Luigi Rizzo Oleg Rombakh Paul Turner Google Google Google Christos Kozyrakis Stanford University Abstract types can substantially improve key metrics such as latency, We present ghOSt, our infrastructure for delegating kernel throughput, hard/soft real-time characteristics, energy effi- scheduling decisions to userspace code. ghOSt is designed ciency, cache interference, and security [1–26]. For example, to support the rapidly evolving needs of our data center the Shinjuku request scheduler [25] optimized highly disper- workloads and platforms. sive workloads – workloads with a mix of short and long Improving scheduling decisions can drastically improve requests – improving request tail latency and throughput the throughput, tail latency, scalability, and security of im- by an order of magnitude. The Tableau scheduler for virtual portant workloads. However, kernel schedulers are difficult machine workloads [23] demonstrated improved throughput to implement, test, and deploy efficiently across a large fleet. by 1.6× and latency by 17× under multi-tenant scenarios. Recent research suggests bespoke scheduling policies, within The Caladan scheduler [21] focused on resource interference custom data plane operating systems, can provide compelling between foreground low-latency apps and background best performance results in a data center setting. However, these effort apps, improving network request tail latency byas gains have proved difficult to realize as it is impractical to much as 11,000×. To mitigate recent hardware vulnerabil- deploy a custom OS image(s) at an application granularity, ities [27–32], cloud platforms running multi-tenant hosts particularly in a multi-tenant environment, limiting the prac- had to rapidly deploy new core-isolation policies, isolating tical applications of these new techniques. shared processor state between applications. ghOSt provides general-purpose delegation of schedul- Designing, implementing, and deploying new scheduling ing policy to userspace processes in a Linux environment. policies across a large fleet is an exacting task. It requires de- ghOSt provides state encapsulation, communication, and velopers to design policies capable of balancing the specific action mechanisms that allow complex expression of sched- performance requirements of many applications. The imple- uling policies within a userspace agent, while assisting in mentation must conform with a complex kernel architecture, synchronization. Programmers use any language to develop and errors will, in many cases, crash the entire system or and optimize policies, which are modified without a host otherwise severely impede performance due to unintended reboot. ghOSt supports a wide range of scheduling models, side effects [33]. Even when successful, the disruptive nature from per-CPU to centralized, run-to-completion to preemp- of an upgrade carries its own opportunity cost in host and tive, and incurs low overheads for scheduling actions. We application down-time. This creates a challenging conflict demonstrate ghOSt’s performance on both academic and between risk-minimization and progress. real-world workloads. We show that by using ghOSt instead Prior attempts to improve performance and reduce com- of the kernel scheduler, we can quickly achieve comparable plexity in the kernel by designing userspace solutions have throughput and latency while enabling policy optimization, significant shortcomings: they require substantial modifi- non-disruptive upgrades, and fault isolation for our data- cation of application implementation [1, 10, 21, 25, 34, 35], center workloads. We open-source our implementation to require dedicated resources in order to be highly respon- enable future research and development based on ghOSt. sive [1, 21, 25], or otherwise require application-specific kernel modifications [1, 2, 10, 21, 25, 34–37]. 1 Introduction Although Linux does support multiple scheduling implementations; tailoring, maintaining, and deploying a different CPU scheduling plays an important role in application perfor- mechanism for every different application is impractical to mance and security. Tailoring policies for specific workload 1 Kernel User ghOSt supports concurrent execution of multiple policies, Workload space space fault isolation, and allocation of CPU resources to different Optional scheduling hints applications. Importantly, to enable practical transition of Thread/CPU Messages existing systems in the fleet to use ghOSt, it co-exists with Kernel ghOSt agents Status words other schedulers running in the kernel, such as CFS (§3.4). We demonstrate that ghOSt enables us to define various ghOSt Transactions scheduling class CPU scheduling policies leading to performance comparable or better than Syscalls decisions existing schedulers on both academic and production workloads (§4). We characterize the overheads of key ghOSt op- erations (§4.1) – showing that ghOSt’s overheads are only Figure 1. Overview of ghOSt. a few hundred nanoseconds higher than comparable schedulers. We evaluate ghOSt by implementing centralized and preemptive policies for `s-scale workloads that lead to high manage on a large fleet. One of our goals is to formalize throughput and low tail latency in the presence of high re- what kernel modifications are required to enable a plethora quest dispersion [10, 25] and antagonists [1, 21] (§4.2). We of optimizations (and experimentation) in userspace, on a compare ghOSt to Shinjuku [25], a specialized modern data stable kernel ABI. plane, to show ghOSt’s minimal overhead and competitive Further, the hardware landscape is ever changing, stress- `s-scale performance (within 5% of Shinjuku) and while sup- ing existing scheduling abstraction models [38] originally porting a broader set of workloads, including multi-tenancy. designed to manage simpler systems. Schedulers must evolve We also implement a policy for our production packet- to support these rapid changes: increasing core counts, new switching framework used in our data centers, leading to sharing properties (e.g., SMT), and heterogeneous systems comparable and in some cases 5-30% better tail latency than (e.g. big.LITTLE[39]). NUMA properties of even a single- the soft real-time scheduler used today (§4.3). Lastly, we socket platform continue to increase (e.g., chiplets [40]). Com- implement a ghOSt policy for machines running our pro- pute offloads such as AWS Nitro [41] and DPUs [42] as well duction, massive-scale database, which serves billions of as domain-specific accelerators such as GPUs and TPUs[43] requests per day (§4.4). For that workload, we show for both are new types of tightly coupled compute devices that exist throughput and latency that ghOSt matches and often out- entirely beyond the domain of a classical scheduler. We need performs by 40-50% the kernel scheduling policy we use a paradigm to more efficiently support the evolving prob- today by customizing to the workload’s machine topology in lem domain than is possible today with monolithic kernel <1000 total lines of code. With ghOSt, scheduling strategies implementations. — previously requiring extensive kernel modification — can In this paper, we present the design of ghOSt and its eval- be implemented in just 10s or 100s of lines of code. uation on production and academic use-cases. ghOSt is a scheduler for native OS threads that delegates policy decisions to userspace. The goal of ghOSt is to fundamentally 2 Background & Design Goals change how scheduling policies are designed, implemented, and Large cloud providers and users (e.g., cloud clients) are mo- deployed. ghOSt provides the agility of userspace development tivated to deploy new scheduling policies to optimize per- and ease of deployment, while still enabling `s-scale schedul- formance for key workloads running on increasingly com- ing. ghOSt provides abstraction and interface for userspace plex hardware topologies, and provide protection against software to define complex scheduling policies, from the new hardware vulnerabilities such as Spectre [29–32] and per-CPU model to a system-wide (centralized) model. Impor- L1TF/MDS [27, 31, 44] by isolating untrusted threads on tantly, ghOSt decouples the kernel scheduling mechanism separate physical cores. from policy definition. The mechanism resides in the kernel Scheduling in Linux. Linux supports implementing mul- and rarely changes. The policy definition resides in userspace tiple policies via scheduling classes [7]. Classes are ordered and rapidly changes. We open-source our implementation by their priority: a thread scheduled with a higher priority to enable future research and development using ghOSt. class will preempt a thread scheduled with a lower prior- By scheduling native threads, ghOSt supports existing ap- ity class. To optimize performance of specific applications, plications without any changes. In ghOSt (Fig. 1), the schedul- developers can modify a scheduling class to better fit the ap- ing policy logic runs within normal Linux processes, denoted plication’s needs, e.g., prioritizing the application’s threads to as agents, which interact with the kernel via the ghOSt API. reduce network latency [21] or scheduling the application’s The kernel notifies the agent(s) of state changes for all man- threads using real-time policies to meet deadlines [45] for

Ghost: Fast & Flexible User-Space Delegation of Linux Scheduling

Confine: Automated System Call Policy Generation for Container

Operating System Support for Run-Time Security with a Trusted Execution Environment

Master Thesis

An Operating System Architecture for Intra-Kernel Privilege Separation

Towards a Formally Verified Microkernel Using the VCC Verifier

The Journal for the International Ada Community

Implementing Molecular Dynamics on Hybrid High Performance Computers - Short Range Forces

Linux Distrib Vendors Make Patches Available for GHOST 29 January 2015, by Nancy Owano

Formal Verification for High Assurance Software: a Case Study Using the SPARK Auto-Active Verification Oolsett

SCONE: Secure Linux Containers with Intel

Schnek: a C++ Library for the Development of Parallel Simulation Codes on Regular Grids.” Computer Physics Communications, Vol

Virtual Ghost: Protecting Applications from Hostile Operating Systems