Helios: Heterogeneous Multiprocessing with Satellite Kernels

Helios: Heterogeneous Multiprocessing with Satellite Kernels † Edmund B. Nightingale Orion Hodson Ross McIlroy Microsoft Research Microsoft Research University of Glasgow, UK Chris Hawblitzel Galen Hunt Microsoft Research Microsoft Research ABSTRACT 1. INTRODUCTION Helios is an operating system designed to simplify the task of writ- Operating systems are designed for homogeneous hardware ar- ing, deploying, and tuning applications for heterogeneous platforms. chitectures. Within a machine, CPUs are treated as interchange- Helios introduces satellite kernels, which export a single, uniform able parts. Each CPU is assumed to provide equivalent functional- set of OS abstractions across CPUs of disparate architectures and ity, instruction throughput, and cache-coherent access to memory. performance characteristics. Access to I/O services such as file At most, an operating system must contend with a cache-coherent systems are made transparent via remote message passing, which non-uniform memory architecture (NUMA), which results in vary- extends a standard microkernel message-passing abstraction to a ing access times to different portions of main memory. satellite kernel infrastructure. Helios retargets applications to avail- However, computing environments are no longer homogeneous. able ISAs by compiling from an intermediate language. To simplify Programmable devices, such as GPUs and NICs, fragment the tra- deploying and tuning application performance, Helios exposes an ditional model of computing by introducing “islands of computa- affinity metric to developers. Affinity provides a hint to the operat- tion” where developers can run arbitrary code to take advantage of ing system about whether a process would benefit from executing device-specific features. For example, GPUs often provide high- on the same platform as a service it depends upon. performance vector processing, while a programmable NIC pro- We developed satellite kernels for an XScale programmable I/O vides the opportunity to compute “close to the source” without card and for cache-coherent NUMA architectures. We offloaded wasting time communicating over a bus to a general purpose CPU. several applications and operating system components, often by These devices are not cache-coherent with respect to general pur- changing only a single line of metadata. We show up to a 28% pose CPUs, are programmed with unique instruction sets, and often performance improvement by offloading tasks to the XScale I/O have dramatically different performance characteristics. card. On a mail-server benchmark, we show a 39% improvement Operating systems effectively ignore programmable devices by in performance by automatically splitting the application among treating them no differently than traditional, non-programmable multiple NUMA domains. I/O devices. Therefore device drivers are the only available method of communicating with a programmable device. Unfortunately, the device driver interface was designed for pushing bits back and forth Categories and Subject Descriptors over a bus, rather than acting as an interface through which appli- D.4.4 [Operating Systems]: Communications Management; D.4.7 cations coordinate and execute. As a result, there often exists little [Operating Systems]: Organization and Design; D.4.8 [Operating or no support on a programmable device for once straightforward Systems]: Performance tasks such as accessing other I/O devices (e.g., writing to disk), de- bugging, or getting user input. A secondary problem is that drivers, which execute in a privileged space within the kernel, become ever General Terms more complicated as they take on new tasks such as executing ap- Design, Management, Performance plication frameworks. For example, the NVIDIA graphics driver, which supports the CUDA runtime for programming GPUs, con- tains an entire JIT compiler. Keywords Helios is an operating system designed to simplify the task of Operating systems, heterogeneous computing writing, deploying, and tuning applications for heterogeneous platforms. Helios introduces satellite kernels, which export a single, † uniform set of OS abstractions across CPUs of disparate archi- Work completed during an internship at Microsoft Research. tectures and performance characteristics. Satellite kernels allow developers to write applications against familiar operating system APIs and abstractions. In addition, Helios extends satellite kernels to NUMA architectures, treating NUMA as a shared-nothing Permission to make digital or hard copies of all or part of this work for multiprocessor. Each NUMA domain, which consists of a set of personal or classroom use is granted without fee provided that copies are CPUs and co-located memory, runs its own satellite kernel and in- not made or distributed for profit or commercial advantage and that copies dependently manages its resources. By replicating kernel code and bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific by making performance boundaries between NUMA domains ex- permission and/or a fee. plicit, Helios removes the kernel as a bottleneck to scaling up per- SOSP’09, October 11–14, 2009, Big Sky, Montana, USA. formance in large multiprocessor systems. Copyright 2009 ACM 978-1-60558-752-3/09/10 ...$10.00. 221 Satellite kernels are microkernels. Each satellite kernel is com- We discuss the design goals for Helios in the next section and posed of a scheduler, a memory manager, a namespace manager, we describe the implementation in Section 3. Section 4 evaluates and code to coordinate communication between other kernels. All Helios, Section 5 discusses related work, and then we conclude. other traditional operating system drivers and services (e.g., a file system) execute as individual processes. The first satellite kernel to 2. DESIGN GOALS boot, called the coordinator kernel, discovers programmable de- We followed four design goals when we created an operating vices and launches additional satellite kernels. Helios provides system for heterogeneous platforms. First, the operating system transparent access to services executing on satellite kernels by ex- should efficiently export a single OS abstraction across different tending a traditional message-passing interface to include remote programmable devices. Second, inter-process communication that message passing. When applications or services communicate with spans two different programmable devices should function no dif- each other on the same satellite kernel a fast, zero-copy, message- ferently than IPC on a single device. Third, the operating system passing interface is used. However, if communication occurs be- should provide mechanisms to simplify deploying and tuning ap- tween two different satellite kernels, then remote message pass- plications. Fourth, it should provide a means to encapsulate the ing automatically marshals messages between the kernels to facili- disparate architectures of multiple programmable devices. tate communication. Since applications are written for a message- passing interface, no changes are required when an application is 2.1 Many Kernels: One Set of Abstractions run on a programmable device. Exporting a single set of abstractions across many platforms sim- In a heterogeneous environment, the placement of applications plifies writing applications for different programmable devices. Fur- can have a drastic impact on performance. Therefore, Helios sim- ther, these abstractions must be exported efficiently to be useful. plifies application deployment by exporting an affinity metric that is Therefore, when determining the composition of a satellite kernel, expressed over message-passing channels. A positive affinity pro- the following guidelines were followed: vides a hint to the operating system that two components will benefit from fast message passing, and should execute on the same satel- • Avoid unnecessary remote communication. A design that re- lite kernel. A negative affinity suggests that the two components quires frequent communication to remote programmable de- should execute on different satellite kernels. Helios uses affinity vices or NUMA domains (e.g., forwarding requests to a gen- values to automatically make placement decisions when processes eral purpose CPU) would impose a high performance penalty. are started. For example, the Helios networking stack expresses Therefore, such communication should be invoked only when a positive affinity for the channels used to communicate with a a request cannot be serviced locally. network device driver. When a programmable network adapter is present, the positive affinity between the networking stack and the • Require minimal hardware primitives. If a programmable driver executing on the adapter causes Helios to automatically of- device requires a feature-set that is too constrained (e.g., an fload the entire networking stack to the adapter. Offloading the MMU), then few devices will be able to run Helios. On the networking stack does not require any changes to its source code. other hand, requiring too few primitives might force Helios Affinity values are expressed as part of an application’s XML meta- to communicate with other devices to implement basic fea- data file, and can easily be changed by developers or system admin- tures (such as interrupts), which violates the previous guide- istrators to tune application performance or adapt an application to line. Therefore, Helios should require a minimal set of

Helios: Heterogeneous Multiprocessing with Satellite Kernels

Multiprocessing Contents

Multiprocessing and Scalability

Design of a Message Passing Interface for Multiprocessing with Atmel Microcontrollers

Piotr Warczak Quarter: Fall 2011 Student ID: 99XXXXX Credit: 2 Grading: Decimal

Message Passing Fundamentals

Composable Multi-Threading and Multi-Processing for Numeric Libraries

14. Parallel Computing 14.1 Introduction 14.2 Independent

Efficient Parallel Approaches to Financial Derivatives and Rapid Stochastic Convergence

Parallel Processing with Eight-Node Raspberry Pi Cluster

Introduction to Multithreading, Superthreading and Hyperthreading Introduction

Three Basic Multiprocessing Issues

Multiprocessing: Architectures and Algorithms