The Utility Coprocessor: Massively Parallel Computation from the Coffee Shop

The Utility Coprocessor: Massively Parallel Computation from the Coffee Shop John R. Douceur, Jeremy Elson, Jon Howell, and Jacob R. Lorch Microsoft Research Abstract slow jobs that take several minutes without UCop be- UCop, the “utility coprocessor,” is middleware that come interactive (15–20 seconds) with it. Thanks to makes it cheap and easy to achieve dramatic speedups the recent emergence of utility-computing services like of parallelizable, CPU-bound desktop applications using Amazon EC2 [8] and FlexiScale [45], which rent com- utility computing clusters in the cloud. To make UCop puters by the hour on a moment’s notice, anyone with a performant, we introduced techniques to overcome the credit card and $10 can use UCop to speed up his own low available bandwidth and high latency typical of the parallel applications. networks that separate users’ desktops from a utility One way to describe UCop is that it effectively con- computing service. To make UCop economical and easy verts application software into a scalable cloud service to use, we devised a scheme that hides the heterogene- targeted at exactly one user. This goal entails five re- ity of client configurations, allowing a single cluster to quirements. Configuration transparency means the ser- serve virtually everyone: in our Linux-based prototype, vice matches the user’s application, library, and con- the only requirement is that users and the cluster are us- figuration state. Non-invasive installation means UCop ing the same major kernel version. works with a user’s existing file system and application This paper presents the design, implementation, and configuration. Application generality means a developer evaluation of UCop, employing 32–64 nodes in Amazon can easily apply the system to any of a variety of applica- EC2, a popular utility computing service. It achieves tions, and ease of integration means it can be done with 6–11× speedups on CPU-bound desktop applications minimal changes to the application. Finally, the system ranging from video editing and photorealistic rendering must be performant. to strategy games, with only minor modifications to the UCop achieves these goals. To guarantee that the clus- original applications. These speedups improve perfor- ter uses exactly the same inputs as a process running on mance from the coffee-break timescale of minutes to the the client, it exclusively uses clients’ data files, applica- 15–20 second timescale of interactive performance. tion images, and library binaries; the cluster’s own file system is not visible to clients. The application exten- 1 Introduction sion is a simple user-mode library that can be installed easily and non-invasively. We demonstrate UCop’s gen- The hallmark that separates desktop computing from erality by applying it to the diverse application areas batch computing is the notion of interactivity: users can of 3D modeling, strategy games, and video editing; we see their work in finished form as they go. However, also describe six other suitable application classes. Fi- many CPU-intensive applications that are best used inter- nally, UCop is easy to integrate: with our 295-line patch actively, such as video editing, 3D modeling, and strat- to a video editor, users can exploit a 32-node cluster egy games, can be slow enough even on modern desktop (at $7/hour), transforming three-minute batch workflow hardware that the user experience is disrupted by long for video compositing into 15-second interactive WYSI- wait times. This paper presents the Utility Coproces- WYG display. sor (UCop), a system that dramatically speeds up desk- The biggest challenge in splitting computation be- top applications that are CPU-bound and parallelizable tween the desktop and the cloud is achieving good per- by supplementing them with the power of a large data- formance despite the high-latency, low-bandwidth net- center compute cluster. We demonstrate several applica- work that separates them. It is dealing with this chal- tions and workloads that are changed in kind by UCop: lenge that most distinguishes our work from past “depart- 1 ment clusters,” such as NOW [9], MOSIX [12], and Con- bandwidth (§5.4), and an analysis of the optimized sys- dor [40], which assume users and compute resources are tem’s time budget (§5.5). Finally, §6 concludes. colocated and connected by a fast network. UCop com- bines a variety of old and new techniques to address net- 2 Prior Work working issues. To reduce latency penalties, we carefully relax the file consistency contract and use automatic pro- UCop bears similarity to prior research on computa- filing to send cache validation information to the server tional clusters, grids, process migration, network file sys- before it is needed. To reduce bandwidth penalties, tems, and parallel programming models. we use remote differential compression. A library-level Computational clusters are collections of computers multiplexer on the cluster end of the link scales the ef- that are typically homogeneously configured, geograph- fects of these techniques across many servers. This com- ically close, and either moderately or very tightly cou- bination reduces UCop’s overhead for remotely running pled. Sprite [32] is a distributed operating system that a process (assuming most of its dependencies are cached provides a network file system, process-migration facil- in the cloud) down to just a few seconds, even on links ities, and a single system image to a cluster of worksta- with latencies of several hundred milliseconds. tions. MOSIX [12] is a management system that runs on Of course, it makes little sense to pay a remote- clusters of x86-based Linux computers; it supports high- execution overhead of a few seconds for a computation performance computing for both batch and interactive that could be done locally in less time. UCop is also processes via automatic resource discovery and dynamic not practical for tasks that are I/O bound, or for multi- workload distribution. Condor [40] is a software frame- threaded applications with fine-grained parallelism. In work that runs on Linux, Unix, Mac OS X, FreeBSD, and other words, UCop will not speed up an Emacs session or Windows, and supports the parallel execution of tasks on reduce the wait while Outlook indexes incoming email. tightly coupled clusters or idle desktop machines. The However, there is an important class of desktop applica- Berkeley NOW [9] system is a distributed supercomputer tions that are both CPU-bound and parallelizable; UCop running on a set of extremely tightly coupled worksta- enhances such applications less invasively and at more tions interconnected via Myrinet. Cluster systems have interactive timescales than existing systems. been applied to interactive applications, including some The contributions of this paper are: of those we consider in §4, such as compilation [28] and • We identify a new cluster computing configuration: graphics rendering [25]. However, for transparent paral- remote parallelization for interactive performance, lelization, clusters require the client to be one of the ma- which provides practical benefit to independent, in- chines in the cluster, requiring invasive installation. By dividual users. contrast, in the UCop system architecture, the client is ar- bitrarily configured, geographically remote, and largely • We identify the primary challenges of this new con- decoupled from the cluster. figuration: the latency and bandwidth constraints of Computational grids [19] are collections of comput- the user’s access link. ers that are loosely coupled, heterogeneously configured, and geographically dispersed. Grid systems comprise a • We introduce prethrowing and task-end-to-start large body of work, encompassing various projects (e.g., consistency as techniques for dealing with that link. the Open Science Grid [20] and EGEE [3]), standards (e.g., WSRF [11]), recommendations (e.g., OGSA [20]), • We add remote differential compression, a cluster- and toolkits (e.g., Globus [18] and gLite [5]). Although side multiplexer, and a shared cache, resulting in the majority of work on grid systems is focused on batch a system that can invoke a wide parallel computa- processing, there has been some limited research into tion using just four round-trip latencies and minimal adding interactivity to grid systems. IC2D [14] is a bandwidth. graphical environment for monitoring and steering ap- • We show our system is performant, easy to deploy, plications that employ the ProActive Java library. I- and can readily adapt existing programs into paral- GASP [13] is a system that provides grid interactivity lel services running in the cloud. via a remote shell and desktop. It also includes middleware for matching applications to their required re- We begin with a review of related work in §2. §3 sources, which is considered by some [38] to be a criti- describes UCop’s architecture and implementation, and cal problem for satisfying the quality-of-service require- §4 describes UCop applications. §5 has several evalu- ments of interactive applications. The DISCOVER [27] ations: microbenchmarks (§5.1), end-to-end application system provides system-integration middleware and an benchmarks (§5.2), a decomposition of each optimiza- application-control network to support runtime moni- tion’s effect (§5.3), a sensitivity analysis to latency and toring and steering of batch applications. The Interac- 2 tive European Grid project [6] provides many services mance (§5.3). In addition, LBFS uses leases instead of intended to support interactivity, including a migrating prethrowing, so the machine that holds the authoritative desktop, complex visualization services, job scheduling, files (the server in LBFS terms, but the client in UCop and security services [34]. However, none of this work terms) must store its files using the Arla [44] AFS client. supports interactive applications per se, but rather pro- This would require invasive installation in our scenario. vides mechanisms for interactively monitoring and ma- Involved parallel programming models, such as the nipulating a long-running distributed computation.

The Utility Coprocessor: Massively Parallel Computation from the Coffee Shop

Clustering with Openmosix

Load Balancing Experiments in Openmosix”, Inter- National Conference on Computers and Their Appli- Cations , Seattle WA, March 2006 0 0 5 4 9

HPC with Openmosix

Introduction Course Outline Why Does This Fail? Lectures Tutorials

Legoos: a Disseminated, Distributed OS for Hardware Resource

Architectural Review of Load Balancing Single System Image

SEMINARIS DOCENTS DE CASO 03/04 - 2Q NÚM GRUP: Facultat D´Informàtica De Barcelona - Departament AC - UPC

Study of Mosix Software Tool for Cluster Computing in Linux

Process Migration in Distributed Systems

Using Mosix for Wide-Area Compuational Resources

Task Management Issues in Distributed Systems Abstract 1 Introduction 2 Techniques and Attributes of Task Management

Open-MPI Over MOSIX Parallel Computing in a Clustered World