Demonstration of Scroogedb: Getting More Bang for the Buck with Deterministic Approximation in the Cloud

Demonstration of ScroogeDB: Getting More Bang For the Buck with Deterministic Approximation in the Cloud Saehan Jo Jialing Pei Immanuel Trummer Cornell University, NY, USA Cornell University, NY, USA Cornell University, NY, USA [email protected] [email protected] [email protected] ABSTRACT ScroogeDB aims at minimizing the number of bytes pro- We demonstrate ScroogeDB which aims at minimizing mon- cessed via deterministic approximate query processing. etary cost of processing aggregation queries in the Cloud. It Prior work on approximate query processing (AQP) typi- runs on top of a Cloud database that offers pay-as-you-go cally uses sampling to generate confidence bounds on query query processing where users pay according to the num- aggregates [1]. Confidence bounds can be hard to inter- ber of bytes processed. ScroogeDB exploits deterministic pret for lay users [6]. Sampling is problematic if aggregates approximate query processing (DAQ) to achieve monetary are sensitive to outliers (e.g., maxima and minima). Those savings. That is, ScroogeDB provides deterministic bounds, drawbacks have recently motivated DAQ [6], aimed at pro- i.e., bounds that contain the true value with a 100% prob- ducing hard bounds on query aggregates. Prior work on ability. ScroogeDB creates small synopses of the database DAQ focuses on simple (single table) queries [4, 6] or time and uses these synopses to answer aggregation queries. By series data [2]. Another recent work, BitGourmet [3], intro- rewriting a query on base tables into a query on smaller syn- duces a specialized execution engine that operates on bit- opses, we significantly reduce the amount of processed data. sliced data to produce deterministic bounds. All prior work We do not pre-compute synopses in advance of an analysis focuses on minimizing execution time, rather than mone- session. Instead, we generate them on the fly, interleaving tary execution fees. On the other hand, ScroogeDB is non- synopsis generation with query execution. intrusive (i.e., runs on top of existing Cloud databases) and In our demonstration, we show that our system realizes focuses on minimizing monetary fees. impressive monetary savings with little precision loss. We ScroogeDB operates as a layer on top of a Cloud pro- run our system on top of the Google BigQuery Cloud plat- cessing platform offering pay-as-you-go pricing. It is non- form and provide users with a graphical interface that vi- intrusive since it does not require access to the internals of sualizes deterministic bounds. The graphical interface also the Cloud database. When a user issues a query, ScroogeDB provides information regarding the generated synopses and rewrites the query into a semantically equivalent query on how they contribute to monetary savings. data synopses and executes it instead. Each data synopsis is a table that summarizes a portion of the database (cover- PVLDB Reference Format: ing one or multiple base tables). These synopses are much Saehan Jo, Jialing Pei, Immanuel Trummer. Demonstration of smaller in size than base tables, and thus, using synopses ScroogeDB: Getting More Bang For the Buck with Deterministic rather than base tables reduces monetary fees. ScroogeDB Approximation in the Cloud. PVLDB, 13(12): 2961-2964, 2020. DOI: https://doi.org/10.14778/3415478.3415519 chooses the synopsis to use to answer a specific query based on a precision model. The precision model estimates the result precision when answering the current query using a 1. INTRODUCTION specific synopsis. Given a user-defined precision constraint, With the proliferation of Cloud-based query processing ScroogeDB finds the synopsis with minimum processing cost services like Google BigQuery or Amazon Redshift, very among synopses that satisfy the precision constraint. large data sets have become more readily accessible for data Prior AQP systems usually generate auxiliary data struc- analysis. A novel pricing model available for Cloud-service tures offline, prior to query executions. Their goal is to users is the pay-as-you-go policy where users pay solely based reduce run time, whereas our goal is to reduce monetary on the amount of data processed and stored in the Cloud. cost. Since we pay per byte processed, creating redundant In this scenario, we want to minimize the amount of data synopses directly translates into unnecessary monetary fees. read per query to minimize monetary fees for end-users. We have to ensure that each generated synopsis pays off. For this reason, ScroogeDB generates synopses in an online fashion. That is, we interleave synopsis generation (which is This work is licensed under the Creative Commons Attribution- done in the background) with query execution. ScroogeDB NonCommercial-NoDerivatives 4.0 International License. To view a copy does not require an example workload nor does it create pre- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For computed synopses. Instead, we only rely on (a handful of) any use beyond those covered by this license, obtain permission by emailing queries from the ongoing analysis session to determine the [email protected]. Copyright is held by the owner/author(s). Publication rights query distribution in a probabilistic manner. licensed to the VLDB Endowment. Based on that, ScroogeDB searches for a synopsis that Proceedings of the VLDB Endowment, Vol. 13, No. 12 ISSN 2150-8097. benefits us not only for the current query but also in future DOI: https://doi.org/10.14778/3415478.3415519 2961 Figure 1: Overview of the ScroogeDB system. query executions. ScroogeDB relies on a cost-based synopsis utility model to determine the optimal synopsis to generate. We formulate the problem of choosing the synopsis with maximum utility as an integer linear program. In our demonstration, we run ScroogeDB on top of Google BigQuery and demonstrate its effectiveness in reducing monetary cost. ScroogeDB demonstrates impressive monetary savings with little precision loss on query results. Section 2 gives an overview of ScroogeDB, introducing its components. In Section 3, we provide details on our demonstration plan which includes an interactive user experience of our system. Figure 2: Example of exact and approximate 2. PROBLEM MODEL columns in a synopsis. We introduce our problem model and relevant definitions. ScroogeDB supports aggregation queries with key-foreign key joins on a star schema. We currently support Count, a synopsis is at its finest granularity, we call it an Exact Col- Sum, Avg, Min, and Max aggregation functions and group umn. This is when there is one value per value group. We by clauses as well as equality and inequality predicates. use a tilde over a column symbol to denote an approximate column (i.e.,c ~ indicates that a column c is approximate). 2.1 Data Synopsis The size of a synopsis depends on how many distinct value A Data Synopsis is a table summarizing a portion of the combinations appear in the selected columns. With approx- underlying database. It can cover a subset of columns in imate columns, we can reduce the number of distinct value the fact table as well as columns in dimension tables. To combinations, and thus, shrink the size of a synopsis even specify a synopsis, we select a subset of columns from the further compared to when it only contains exact columns. database. Then, the synopsis consists of all distinct value For an approximate column, a synopsis stores the minimum combinations appearing in the selected columns and their and maximum values of each value group. frequencies. In other words, a data synopsis resembles a Example 1. Figure 2 depicts a case where the number of materialized view created by the following SQL statement: distinct value combinations reduces from seven to four. We select C , Count(*) as cnt divide the value domain [1; 5] of an integer column c into from F two value ranges [1; 3) and [3; 5]. The synopsis with the [ join D on {foreign key constraint}]* approximate columnc ~ stores its minimum and maximum group by C; for each value range. where C denotes a subset of columns in the database, F is the fact table, and D denotes a dimension table. Note that 2.2 Problem Statement there can be multiple dimension tables, each joined along We measure the Precision of deterministic bounds as the the foreign key constraint. lower bound divided by the upper bound. In other words, Now, we introduce approximation to synopses. We can we get l=u where l is the lower bound and u is the upper divide the value domain of a column so that distinct values bound (assuming positive numbers for aggregates). We get are clustered into a smaller number of coarser value groups. a precision of one when the query result is exact (i.e., l = u) Each value group can represent multiple consecutive values and a precision of zero when we have a trivial lower bound in the value domain. We call a column in a synopsis an Ap- of zero. proximate Column if its value domain is divided into coarser The goal of Monetary Fee Minimization via DAQ is to value ranges. Otherwise, if the value domain of a column in minimize the monetary cost of pay-as-you-go query process- 2962 ing in the Cloud. In our problem model, a user conducts an Sum(cmax · cnt) as upper where a =`B'. Note that cmin and analysis session and executes queries on a database one after cmax are column names rather than aggregates. another. In this scenario, there is no example workload and the specifics of a query are not given until its execution. We 3.2 Precision Model exploit DAQ in our solution to achieve this goal. Given a If a synopsis only contains exact columns, we get the best target precision, we produce deterministic bounds with the possible precision of one (i.e., we compute exact results).

Demonstration of Scroogedb: Getting More Bang for the Buck with Deterministic Approximation in the Cloud

Building Your Hybrid Cloud Strategy with AWS Ebook

Magic Quadrant for Enterprise High-Productivity Application Platform As a Service

An Introduction to Cloud Databases a Guide for Administrators

Price-Performance in Modern Cloud Database Management Systems

Machine Learning Is Driving an Innovation Wave in Saas Software

The Impact of the Google Cloud to Increase the Performance for a Use Case in E-Commerce Platform

Database Solutions on AWS

Performance Evaluation of Nosql Databases As a Service with YCSB: Couchbase Cloud, Mongodb Atlas, and AWS Dynamodb

Economic and Social Impacts of Google Cloud September 2018 Economic and Social Impacts of Google Cloud |

Youtube-8M Video Classification

Run Critical Databases in the Cloud

Student Resume Book