Apollo: Learning Query Correlations for Predictive Caching in Geo-Distributed Systems
Total Page:16
File Type:pdf, Size:1020Kb
Apollo: Learning Query Correlations for Predictive Caching in Geo-Distributed Systems Brad Glasbergen Michael Abebe Khuzaima Daudjee University of Waterloo University of Waterloo University of Waterloo [email protected] [email protected] [email protected] Scott Foggo Anil Pacaci University of Waterloo University of Waterloo [email protected] [email protected] ABSTRACT The performance of modern geo-distributed database applications is increasingly dependent on remote access latencies. Systems that cache query results to bring data closer to clients are gaining popularity but they do not dynamically learn and exploit access patterns in client workloads. We present a novel prediction frame- work that identifies and makes use of workload characteristics obtained from data access patterns to exploit query relationships (a) Datacenter Locations (b) Edge Node Locations within an application’s database workload. We have designed and implemented this framework as Apollo, a system that learns query Figure 1: Google’s datacenter and edge node locations [21]. patterns and adaptively uses them to predict future queries and cache their results. Through extensive experimentation with two different benchmarks, we show that Apollo provides significant 1. SELECT C_ID FROM CUSTOMER WHERE performance gains over popular caching solutions through reduced C_UNAME = @C_UN and C_PASSWD = @C_PAS query response time. Our experiments demonstrate Apollo’s ro- bustness to workload changes and its scalability as a predictive cache for geo-distributed database applications. 2. SELECT MAX(O_ID) FROM ORDERS WHERE O_C_ID = @C_ID 1 INTRODUCTION Modern distributed database systems and applications frequently 3. SELECT ... FROM ORDER_LINE, ITEM have to handle large query processing latencies resulting from the WHERE OL_I_ID = I_ID and OL_O_ID = @O_ID geo-distribution of data [11, 13, 41]. Industry reports indicate that even small increases in client latency can result in significant drops Figure 2: A set of motivating queries in TPC-W’s Order in both web traffic [20] and sales [3, 30]. A common solution to Display web interaction. Boxes of the same colour indicate this latency problem is to place data closer to clients [38, 39] using shared values across queries. caches, thereby avoiding costly remote round-trips to datacenters [27]. Static data, such as images and video content, is often cached on servers geographically close to clients. These caching servers, Database client workloads often exhibit query patterns, cor- called edge nodes, are a crucial component in industry architec- responding to application usage patterns. In many workloads [1, tures. To illustrate this, consider Google’s datacenter and edge 10, 42], queries are highly correlated. That is, the execution of node locations in Figure 1. Google has comparatively few datacen- one query determines which query executes next and with what ter locations relative to edge nodes, and the latency between the parameters. These dependencies provide opportunities for opti- edge nodes and datacenters can be quite large. Efficiently caching mization through predictively caching queries. In this paper, we data on these edge nodes substantially reduces request latency for focus on discovering relationships among queries in a workload. clients. We exploit the discovered relationships to predictively execute Existing caching solutions for edge nodes and content deliv- future dependent queries. Our focus is to reduce the response time ery networks (CDN) focus largely on static data, necessitating of consequent queries by predicting and executing them, caching costly round trips to remote data centers for requests relying on query results ahead of time. In doing so, clients can avoid contact- dynamic data [21]. Since a majority of webpages today are gener- ing a database located at a distant datacenter, satisfying queries ated dynamically [5], a large number of requests are not satisfied instead from the cache on a closer edge node. by cached data, thereby incurring significant latency penalties. We As examples of query patterns, we consider a set of queries address this concern in Apollo, a system that exploits client access from the TPC-W benchmark [42]. In this benchmark’s Order patterns to intelligently prefetch and cache dynamic data on edge Display web interaction, shown in Figure 2, we observe that the nodes. second query is dependent upon the result set of the first query. © 2018 Copyright held by the owner/author(s). Published in Proceedings of the 21st Therefore, given the result set of the first query, we can determine International Conference on Extending Database Technology (EDBT), March 26-29, the input set of the second query, predictively execute it, and 2018, ISBN 978-3-89318-078-3 on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons cache its results. After the second query has executed, we can license CC-by-nc-nd 4.0. use its result set as input to the third query, again presenting an Series ISSN: 2367-2005 253 10.5441/002/edbt.2018.23 query relationships, dependencies and workload characteristics for use in our predictive framework. The predictive framework stores query result sets in a shared local cache, querying the re- mote database if a client submits a query for which the cache does not have the results. Figure 3 gives a high level overview of how incoming queries are executed, synthesized into the query transition graph, and used for future query predictions. Incoming queries are routed to the query processor, which retrieves query results from a shared local query result cache, falling back to a remote database on a cache miss. Query results are immediately returned to the client and, together with their source queries, are mapped to more general- ized query template representations (Section 2.1). These query templates are placed into per-client queues of queries called query streams, which are continuously scanned for relationships among executed queries. Query relationships are synthesized into the query transition graph and then used to detect query correlations, discovering dependencies among executed queries and storing them in a dependency graph. This dependency graph is used by the prediction engine to predict consequent queries given client queries that have executed. Figure 3: Query flow through components of the predictive Although we focus on geographically distributed edge nodes framework. with remote datacenters, Apollo can also be deployed locally as a middleware cache. Our experiments in Section 4 show that both deployment environments benefit significantly from Apollo’s opportunity for predictive caching. Similar scenarios abound in predictive caching. the TPC-W and TPC-C benchmarks, such as in the TPC-W Best- Next, we discuss the abstractions and algorithms of our pre- Seller web interaction and in the TPC-C Stock level transaction. dictive framework, describing how queries flowing through the Examples that benefit from such optimization, including real- system are merged into the underlying models and used to predict world applications, have been previously described [10]. future queries. In this paper, we propose a novel prediction framework that uses a query-pattern aware technique to improve performance in geo- distributed database systems through caching. We implement this 2.1 Query Templates framework in Apollo, which uses a query transition graph to learn Using a transition graph to reason about query relationships re- correlations between queries and to predict future queries. In doing quires a mapping from database workloads (queries and query so, Apollo determines query results that should be cached ahead of relationships) to transition structures (query templates and tem- time so that future queries can be satisfied from a cache deployed plate transitions). We propose a formalization of this mapping close to clients. Apollo prioritizes common and expensive queries through precise definitions, and then show how our model can be for caching, eliminating or reducing costly round-trips to remote used to predict future queries. data without requiring modifications to the underlying database Queries within a workload are often correlated directly through architecture. Apollo’s ability to learn allows it to rapidly adapt to parameter sharing. Motivated by the Stock Level transaction in workloads in an online fashion. Apollo is designed to enhance an the TPC-C benchmark, consider an example of parameter sharing existing caching layer, providing predictive caching capabilities in which an application executes query Q1 to look up a product ID for improved performance. followed by query Q2 to check the stock level of a given product The contributions of this paper are threefold: ID. A common usage pattern is to execute Q1, and then use the (1) We propose a novel predictive framework to identify re- returned product ID as an input to Q2 to check that product’s stock lationships among queries and predict consequent ones. level. In this case, Q2 is directly related to Q1 via a dependency Our framework uses online learning to adapt to changing relationship. Specifically, Q2 relies on the output of Q1 to execute. workloads and reduce query response times (Section 2). We generalize our model by tracking relationships among query (2) We design and implement our framework in a system called templates rather than among parameterized queries. Two queries Apollo, which predictively executes and caches query