Work Stealing for Interactive Services to Meet Target Latency ∗
Total Page:16
File Type:pdf, Size:1020Kb
Work Stealing for Interactive Services to Meet Target Latency ∗ Jing Li∗ Kunal Agrawal∗ Sameh Elniketyy Yuxiong Hey I-Ting Angelina Lee∗ Chenyang Lu∗ Kathryn S. McKinleyy ∗Washington University in St. Louis yMicrosoft Research fli.jing, kunal, angelee, [email protected], fsamehe, yuxhe, [email protected] Abstract quent occurrence of high latency responses at one server is signif- Interactive web services increasingly drive critical business work- icantly amplified because services aggregate responses from large loads such as search, advertising, games, shopping, and finance. number of servers. Therefore, these servers are designed to mini- mize tail latency, i.e., the latency of requests in the high percentiles, Whereas optimizing parallel programs and distributed server sys- th tems have historically focused on average latency and throughput, such as 99 percentile latency (Dean and Barroso 2013; DeCandia the primary metric for interactive applications is instead consis- et al. 2007; Haque et al. 2015; He et al. 2012; Yi et al. 2008). This tent responsiveness, i.e., minimizing the number of requests that paper studies a closely related problem, optimizing for a target la- miss a target latency. This paper is the first to show how to gen- tency — given a target latency, the system minimizes the number eralize work-stealing, which is traditionally used to minimize the of requests whose latency exceeds this target. makespan of a single parallel job, to optimize for a target latency in We explore interactive services on a multicore server with mul- interactive services with multiple parallel requests. tiple parallelizable requests. Interactive service workloads tend We design a new adaptive work stealing policy, called tail- to be computationally intensive with highly variable work de- control, that reduces the number of requests that miss a target mand (Dean and Barroso 2013; He et al. 2012; Jeon et al. 2013; latency. It uses instantaneous request progress, system load, and a Kanev et al. 2015; Haque et al. 2015; Zircon Computing 2010). A target latency to choose when to parallelize requests with stealing, web search engine, for example, represents, partitions, and repli- when to admit new requests, and when to limit parallelism of large cates billions of documents on thousands of servers to meet respon- requests. We implement this approach in the Intel Thread Building siveness requirements. Request work is unknown when it arrives at Block (TBB) library and evaluate it on real-world workloads and a server and prediction is unappealing because it is never perfect synthetic workloads. The tail-control policy substantially reduces or free (Lorch and Smith 2001; Jalaparti et al. 2013; Kim et al. 2015b). The amount of work per request varies widely. The work the number of requests exceeding the desired target latency and th delivers up to 58% relative improvement over various baseline of the 99 percentile requests is often larger than the median by policies. This generalization of work stealing for multiple requests orders of magnitude, ranging from 10 to over 100 times larger. effectively optimizes the number of requests that complete within Moreover, the workload is highly parallelizable — it has inter- a target latency, a key metric for interactive services. request parallelism (multiple distinct requests execute simultane- ously) and fine-grain intra-request parallelism (Jeon et al. 2013; Haque et al. 2015; Zircon Computing 2010). For instance, search 1. Introduction could process every document independently. More generally, the Delivering consistent interactive latencies is the key performance parallelism of these requests may change as the request executes. metric for interactive cloud services, such as web search, stock Such workloads are excellent candidates for dynamic multi- trading, ads, and online gaming. The services with the most fluid threading, where the application expresses its logical parallelism and seamless responsiveness incur a substantial competitive advan- with spawn/sync, fork/join, or parallel loops, and the runtime tage in attracting and captivating users over less responsive sys- schedules the parallel computation, mapping it to available pro- tems (Dean and Barroso 2013; DeCandia et al. 2007; He et al. 2012; cessing resources dynamically. This model has the following ad- Yi et al. 2008). Many such services are deployed on cloud systems vantages. 1) It separates the scheduling logic from the application that span hundreds or thousands of servers, where achieving inter- logic and thus the runtime adapts the application to different hard- active latencies is even more challenging (Jeon et al. 2013; Apache ware and number of processors without any changes to the appli- Lucene 2014; Mars et al. 2011; Ren et al. 2013; Zircon Computing cation. 2) This model is general and allows developers to structure 2010; Yeh et al. 2006; Raman et al. 2011). The seemingly infre- parallelism in their programs in a natural and efficient manner. 3) Many modern concurrency platforms support dynamic multithread- ∗ This work and was initiated and partly done during Jing Li’s internship at ing, including Cilk dialects (Intel 2013; Frigo et al. 1998; Danaher Microsoft Research in summer 2014. et al. 2006; Leiserson 2010; Lee et al. 2013), Habanero (Barik et al. 2009; Cave´ et al. 2011), Java Fork/Join (Lea 2000; Kumar et al. 2012), OpenMP (OpenMP 2013), Task Parallel Library (Leijen et al. 2009), Threading Building Blocks (TBB) (Reinders 2010), and X10 (Charles et al. 2005; Kumar et al. 2012). Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice Most of these platforms use a work-stealing scheduler, which and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be is efficient both in theory and in practice (Blumofe et al. 1995; honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected] or Publications Blumofe and Leiserson 1999; Arora et al. 2001; Kumar et al. Dept., ACM, Inc., fax +1 (212) 869-0481. 2012). It is a distributed scheduling strategy with low scheduling PPoPP ’16 March 12-16, 2016, Barcelona, Spain Copyright c 2016 ACM 978-1-4503-4092-2/16/03. $15.00 overheads. The goal of traditional work stealing is to deliver the DOI: http://dx.doi.org/10.1145/2851141.2851151 smallest possible makespan for one job. In contrast, interactive 2010) and evaluate the system with several interactive server services have multiple requests (jobs) and the goal is to meet a workloads: a Bing search workload, a finance server work- target latency. To exploit the benefits of work-stealing strategy’s load (He et al. 2012), and synthetic workloads with long-tail dynamic multithreading for interactive services, we modify work distributions. We compare tail-control with three baseline work- stealing so it can optimize for target latency. stealing schedulers, steal-first, admit-first, and default TBB. While work stealing has not been explored for interactive The empirical results show that tail-control significantly out- services, researchers have considered parallelizing requests for performs them, achieving up to a 58% reduction in the number the related problem of minimizing tail latency in interactive ser- of requests that exceed the target latency. vices (Jeon et al. 2013; Kim et al. 2015a; Haque et al. 2015). This work applies higher degrees of parallelism to large requests that 2. Background and Terminology perform significantly more work than other small requests. They assume that small requests will meet the target latency even with- This section defines terminology used throughout the paper, char- out parallelism. In contrast, we show when the system is heavily acterizes interactive services and available intra-request paral- loaded, a large parallel request can monopolize many cores, im- lelism, and lastly reviews work stealing basics. posing queuing delays on the execution of many small requests, Terminology. The response time (latency) of an interactive ser- and thus increase their latencies much more than necessary. In this vice request is the time elapsed between the time when the request work, we investigate a different strategy — under heavy instanta- arrives in the system and the time when the request completes. neous load, the system explicitly sacrifices a small number of large Once a request arrives, it is active until its completion. A request requests. We execute them serially so they occupy fewer resources, is admitted once the system starts working on it. An active request enabling smaller waiting requests to meet the target latency. is thus either executing or waiting. Since the service may perform Designing such a strategy faces several challenges. (1) When a some or all of the request in parallel to reduce its execution time, request arrives in the system, the request’s work is unknown. (2) we define request work as the total amount of computation time Which requests to sacrifice depends on the request’s progress, the that all workers spend on it. A request is large or small according target latency, and current load. Serializing large request can limit to its relative total work. A request misses a specified target latency their impact