Database-Agnostic Workload Management

Database-Agnostic Workload Management Shrainik Jain Jiaqi Yan University of Washington Snowflake Computing shrainik jiaqi @cs.washington.edu @snowflake.com Thierry Cruanes Bill Howe Snowflake Computing University of Washington thierry.cruanes billhowe @snowflake.com @cs.washington.edu ABSTRACT and designing benchmarks [35]. These techniques can be We present a system to support generalized SQL workload used as part of a more comprehensive approach to automate analysis and management for multi-tenant and multi-database database administration [26]. platforms. Workload analysis applications are becoming However, the diversity of applications have led to a diver- more sophisticated to support database administration, model sity of solutions, each relying on specialized feature engineer- user behavior, audit security, and route queries, but the ing. For example, workload summarization for index recom- methods rely on specialized feature engineering, and there- mendation uses the structure of join and group by operators fore must be carefully implemented and reimplemented for as features [3], query recommendation may pre-process a each SQL dialect, database system, and application. Mean- query into fragments before making recommendations [13], while, the size and complexity of workloads are increasing and security audits may require user-defined functions to as systems centralize in the cloud. We model workload enforce particular policies [32]. analysis and management tasks as variations on query la- In fact, the features and the algorithms to extract them beling, and propose a system design that can support gen- tend to be the significant contributions in the papers in this eral query labeling routines across multiple applications and space. But the state of the art in a variety of applications is database backends. The design relies on the use of learned to learn features automatically. For instance, Natural Lan- vector embeddings for SQL queries as a replacement for guage Processing applications previously relied on parsing application-specific syntactic features, reducing custom code and labeling sentences as a pre-processing step, but now and allowing the use of off-the-shelf machine learning algo- use learned vector representations almost exclusively [6, 28]. rithms for labeling. The key hypothesis, for which we pro- This approach not only obviates the need for manual feature vide evidence in this paper, is that these learned features engineering and pre-processing, but also has the potential to can outperform conventional feature engineering on repre- significantly outperform more specialized methods. sentative machine learning tasks. We present the design We see three trends motivating an analogous role for gen- of a database-agnostic workload management and analytics eralized workload representations. First, workload hetero- service, describe potential applications, and show that sep- geneity is increasing, making it difficult to maintain SQL arating workload representation from labeling tasks affords parsers and feature extraction routines. The number of new capabilities and can outperform existing solutions for SQL-like languages is increasing, with inconsistent support representative tasks, including workload sampling for index and syntax for even relatively common features such as outer recommendation, user labeling for security audits and error joins. Second, workload scale is increasing. Cloud-hosted, prediction. multi-tenant database services including Redshift [8], Snowflake [4], BigQuery [23] and more receive millions of queries daily from thousands of customers using hundreds of schemas; 1. INTRODUCTION relying on brittle parsers (or worse, manual inspection) to Extracting patterns from a SQL query workload has en- identify query patterns that influence administration deci- abled a number of important features in database systems, sions is no longer tenable. Third, new use cases for cen- including workload compression [3], index recommendation [2], tralized workload management are emerging. For example, modeling user and application behavior [31, 9, 35], query SQL debugging [7], database forensics [27], and data use recommendation [1], predicting cache performance [29, 5], management [32] motivate a more automated analysis of user behavior patterns, and cloud-hosted multi-tenant systems motivate a more automated approach to query routing and resource allocation. In this work, we propose Querc, a database-agnostic system for mining and managing large-scale and heterogeneous This article is published under a Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits distribution workloads. We model workload management and analysis and reproduction in any medium as well allowing derivative works, pro- as a set of query labeling tasks. For instance, workload vided that you attribute the original work to the author(s) and CIDR 2019. sampling can be reduced to labeling each query as present 9th Biennial Conference on Innovative Data Systems Research (CIDR ‘19) or absent in the sample, error prediction involves labeling January 13-16, 2019 , Asilomar, California, USA. each query with an error type, query routing involves label- Querc ing each query with a cluster resource to which the query Qworker(X) should be routed, and so on. Because our framework de- Service 1 pends only on the query text (along with typical metadata query(X, t) EmbedderA(X,Y) LabelerA(X) DB(X) such as arrival timestamp and userid issuing the query), it can be used with any DBMS and any SQL dialect. In fact, Qworker(Y) as we will show, features learned with a workload against DB(Y) a particular schema and SQL dialect can be effective even query(Y, t) EmbedderA(X,Y) LabelerA(Y) when used with a different schema and SQL dialect. The weakness of this approach is that it requires enormous amounts of data to be effective. But as database products EmbedderB(Y) LabelerB(Z) migrate to the cloud, service providers have access to workloads from a large number of customers, potentially even across different database products. Since the input is just Qworker(Z) Service 2 the query text, these diverse workloads can be processed as query(Z, t) EmbedderB(Z) LabelerC(Z) one very large dataset. But the resulting vectors can still be DB(Z) used to train models to support specific applications, as we Logs will show on two representative tasks: workload summariza- Model Training, Evaluation & tion for index selection, user prediction for security audits Deployment Offline Labeling and routing, and query error prediction. Training Set Management and Analytics 2. SYSTEM ARCHITECTURE Figure 1 illustrates the architecture of Querc. There are Figure 1: System architecture. Queries arrive for three dif- three applications, X, Y, Z. Each application has its own ferent applications X, Y , and Z and are processed by one database, DB(X), DB(Y), and DB(Z), though these may be or more (embedder, labeler) pair before being sent on to the logical instances in the same physical multi-tenant service. database, centralized for offline labeling tasks, or both. In this example, DB(X) and DB(Y) are tenants in the same service. Each application is also associated with a sepa- rate stream of queries (at left), where query(X,t) indicates rather than general machine learning, one data model can be a batch of queries arriving for application X at time instant shared among most applications. The only messages passed t. between components are labeled queries. A labeled query Each application is associated with one Qworker, but each is a tuple (Q; c1; c2; c3;::: ) where ci is a label. This sim- Qworker operates multiple classifiers. Qworkers may not be ple model captures situations where a query arrives already entirely stateless, as some labeling tasks process a small win- equipped with a timestamp, a userid, an IP address, etc., dow of queries. However, the state is assumed to be small but also captures more verbose query logs that are returned such that the Qworkers do not need their own local storage from the database. and can be load balanced and parallelized in typical ways. The training module also records the queries with their Each classifier is a pre-trained (embedder, labeler) pair. The predicted labels for retraining, evaluation, and to support same trained embedder may be used across multiple applica- offline analysis tasks. Offline tasks are those that do not tions. This split design is critical, because we want to learn require or do not allow processing each query separately, features using a very large, combined workload, but an in- and can be implemented as typical batch jobs. For exam- dividual classifier may perform better when trained on an ple, query clustering is important for workload summariza- application-specific workload. In this example, application tion [16], but does not require real-time labeling of individual X and application Y both share the same embedder, Embed- queries. derA, trained on the combined X and Y workloads, written Training data is collected periodically from the databases EmbedderA(X,Y). This log sharing between customers may in the form of query logs. These logs are (batched) sequences not always be permitted by customers for security reasons, of labeled queries, but with additional labels to be used for and in this example, application Z uses only its own data. training, such as runtime, memory usage, error codes, secu- But there is some incentive for customers to pool their data rity flags, resource IDs. We do not specify the mechanism as the additional signal can potentially improve accuracy, by which these logs are transmitted from the database to and some cloud providers support features to allow data Querc, since most systems have robust means of exporting sharing between customers. logs in appropriate forms. The Labeler passes the query on to the database, but In some applications, Querc may not be in the critical path also transmits the query back to a central training mod- for query execution to avoid any performance overhead or ule (\Training, Evaluation, and Offline Labeling" in Figure reduce dependencies. In these cases, queries will be forked 1).

Database-Agnostic Workload Management

Oracle Database VLDB and Partitioning Guide, 11G Release 2 (11.2) E25523-01

Data Warehouse Fundamentals for Storage Professionals – What You Need to Know EMC Proven Professional Knowledge Sharing 2011

Study Material for B.Sc.Cs Dataware Housing and Mining Semester - Vi, Academic Year 2020-21

Public 1 Agenda

Database Machines in Support of Very Large Databases

Requirements for XML Document Database Systems Airi Salminen Frank Wm

A Methodology for Evaluating Relational and Nosql Databases for Small-Scale Storage and Retrieval

VLDB Prerequisite for the Success of Digital India 02 Content

Big Data Query Optimization -Literature Survey

Oracle Database 12C Release 2 for Data Warehousing and Big Data

Very Large Databases: Challenges and Strategies

VLDB - an Analysis of DB2 at Very Large Scale - D13