COMET: Code Offload by Migrating Execution Transparently
Total Page:16
File Type:pdf, Size:1020Kb
COMET: Code Offload by Migrating Execution Transparently Mark S. Gordony D. Anoushe Jamshidiy Scott Mahlkey Z. Morley Maoy Xu Chen? yEECS Department, ? AT&T Labs - Research University of Michigan [email protected] Ann Arbor, MI, USA fmsgsss,ajamshid,mahlke,[email protected] Abstract In our work, we apply DSM to offloading – the task of augmenting low performance computing elements with In this paper we introduce a runtime system to allow high performance elements. unmodified multi-threaded applications to use multiple Offloading is an idea that has been around as long machines. The system allows threads to migrate freely as there has been a disparity between the computational between machines depending on the workload. Our pro- powers of available computing elements. The idea has totype, COMET (Code Offload by Migrating Execution grown in popularity with the concept of ubiquitous com- Transparently), is a realization of this design built on top puting where many low-powered, well-connected com- of the Dalvik Virtual Machine. COMET leverages the puting elements would exist that could benefit from the underlying memory model of our runtime to implement computation of nearby server-class machines. The pop- distributed shared memory (DSM) with as few interac- ular approach to this problem is visible in specialized tions between machines as possible. Making use of a systems like Google Translate and Apple iOS’s Siri. For new VM-synchronization primitive, COMET imposes lit- broad applicability, COMET and other recent work [9, 8] tle restriction on when migration can occur. Additionally, have aimed to generalize this approach to enable offload- enough information is maintained so one machine may ing in applications that contain no offloading logic. These resume computation after a network failure. systems when compared to specialized offloading systems We target our efforts towards augmenting smartphones can offer similar benefits that compilers offer over hand- or tablets with machines available in the network. We optimized code. They can save programmer effort and demonstrate the effectiveness of COMET on several real in some cases outperform many specialized efforts. We applications available on Google Play. These applications believe our work is unique as it is the first to apply DSM include image editors, turn-based games, a trip planner, to offloading. Using DSM instead of remote procedure and math tools. Utilizing a server-class machine, COMET calls (RPC) offers many advantages including full multi- can offer significant speed-ups on these real applications threading support, migration of a thread at any point when run on a modern smartphone. With WiFi and 3G during its execution, and in some cases, more efficient networks, we observe geometric mean speed-ups of 2.88X data movement. and 1.27X relative to the Dalvik interpreter across the set However, in applying DSM, we faced many challenges of applications with speed-ups as high as 15X on some unique to our use case. First, the latency and bandwidth applications. characteristics of a smartphone’s network connection to an external server are much worse than those of a clus- 1 Introduction ter of workstations using a wired connection. Second, Distributed Shared Memory (DSM) systems provide a the type of computation is significantly different: while way for memory to be accessed and modified between previous DSM systems focused on scientific computing, computing elements. It was an active area of research we aim to augment real user-facing applications with in the late 1980s. Classically, DSM has been applied to more stringent response requirements. Despite these chal- networks of workstations, special purpose message pass- lenges, we show in x5 that performance improvements can ing machines, custom hardware, and heterogeneous sys- be significant for a wide range of applications across both tems [18]. With the onset of relatively low performance WiFi and 3G networks. smartphones, a new use case for DSM has presented itself. Our design aims to serve these primary goals: 1 Unmodified & Executes concurrently The rest of the paper is organized as follows. In x2, multi-threaded with non-offloaded threads we summarize works in the field of DSM and offloading Mobile Offloaded and discuss how they relate to COMET. x3 presents the application threads overall system design of COMET. x4 describes how we implemented COMET using Android’s Dalvik Virtual Ma- Memory in-sync Memory chine. x5 presents how we evaluated COMET including states states the overheads introduced by the offloading system, per- Distributed memory Distributed memory formance and energy improvements for our benchmark synchronization via synchronization suite, and two case studies demonstrating some interesting network challenges and how COMET solves them. x6 covers some PhoneOS RemoteOS limitations of our system, and x7 ends with final remarks. Figure 1: High-level view of COMET’s architecture • Require only program binary (no manual effort) 2 Related Work • Execute multi-threaded programs correctly A significant amount of work has gone into designing • Improve speed of computation systems that combine the computation efforts of multiple • Resist network and server failures machines. The biggest category of such systems are • Generalize well with existing applications those that create new programming language constructs to facilitate offloading. Agent Tcl [14] was one such a Building upon previous work, our work offers sev- system that allowed a programmer to easily let computa- eral new contributions. To the best of our knowledge, tion “jump” from one endpoint to another. Other systems we propose a new approach to lazy release consistency that fall into this broad category include Emerald [19], DSM [20] that operates at the field level granularity to Network Objects [5], Obliq [6], Rover [17], Cuckoo [21], allow for multiple writers of the same memory unit. This and MapReduce [11]. Other work instead focuses on spe- allows us to minimize the number of times the client and cific kinds of computation like Odessa [26] which targets server need to communicate to points where communica- perception applications, or SociableSense [27] designed tion must occur. Using DSM also allows us to be the first for harvesting social measurements. COMET takes an offloading engine to fully support multi-threaded compu- alternative approach by building on top of an existing tation and allow for threads to move between endpoints runtime and requiring no binary modifications. at any interpreted point in their execution rather than Other related offloading systems include OLIE [15], offloading only entire methods. Despite these features, which applied offloading to Java to overcome resource COMET is designed to allow computation to resume on constraints automatically. Messer et al. [23] proposed a the client if the server is lost at any point during the system to automatically partition application components system’s execution. Finally, we propose a very simple using MINCUT algorithms. Hera-JVM [22] used DSM scheduling algorithm that can give some loose guarantees to manage memory consistency while doing RPC-style on worst case performance. Figure 1 gives an overview of offloading between cores on a Cell processor. Helios [25] the design of the system. was an operating system built to utilize heterogeneous We implemented COMET within the Dalvik Virtual programmable devices available on a device. Machine of the Android Gingerbread operating system Closer to our use case, MAUI [9] enabled automated and conducted tests using a Samsung Captivate phone and offloading and demonstrated that offloading could be an an eight core server for offloading. Our tests were com- effective tool for performance and energy gains. In their prised of a suite of nine real-world applications available system, the developer was not required to write the logic on Google Play including trip planners, image editors, to ship computation to a remote server; instead he/she games, and math/science tools. From our tests, we found would decide which functions could be offloaded and the that COMET achieved a geometric mean speedup of 2.88X offloading engine would do the work of deciding what and average energy savings of 1.51X when offloading should be offloaded and collecting all necessary state. using a WiFi connection. 3G did not perform as well with Requiring annotation of what could be offloaded limited a geometric speedup of 1.27X and a modest increase in the reach of the system leaving room for CloneCloud [8] energy usage. With lower latency and higher bandwidth, to extend this design, filling this usability gap by using we expect better offloading performance for 4G LTE net- static analysis to automatically decide what could be of- works compared to 3G. floaded. Other works, like ECOS [13], attempt to address 2 the problem of ensuring data privacy to make offloading deal with high latency between endpoints that often exists applicable for enterprises. in wireless networks. JESSICA2 implemented DSM in Java for clusters of While it appears our design could be extended to more workstations. JESSICA2 implemented DSM at the object than two endpoints, we discuss how operations work level using an object homing technique. This approach is specifically for the two endpoint case that our prototype not suitable for our use case because each synchronization supports. Our intended endpoints are one client (a phone) operation triggers a network operation. Moreover it is not and one server (a high-performance machine). However, robust to network failures. the design rarely needs to identify which endpoint is Munin [7] was one of a couple early DSM prototypes which and our work could be used in principle to combine built during the 90s that targeted scientific computation two similarly powered devices. on workstations. Shasta [29] aimed to broaden the ap- plicability of software DSM by making a system that 3.1 Security worked with existing applications. These were followed Under our design a malicious server can take arbitrary up by systems like cJVM [4] and JESSICA2 [30] which action on behalf of the client process being offloaded.