Bounded Approximate Query Processing

1 Bounded Approximate Query Processing Kaiyu Li Yong Zhang Guoliang Li Wenbo Tao Ying Yan Abstract—OLAP is a core functionality in database systems and the performance is crucial to enable on-time decisions. However, OLAP queries are rather time consuming, especially on large datasets, and traditional exact solutions usually cannot meet the high-performance requirement. Recently, approximate query processing (AQP) has been proposed to enable approximate OLAP. However, existing AQP methods have some limitations. First, they may involve unacceptable errors on skewed data (e.g., long-tail distribution). Second, they require to store large amount of data and have no significant performance improvement. Third, they only support a small subset of SQL aggregation queries. To overcome these limitations, we propose a bounded approximate query processing framework BAQ. Given a predefined error bound and a set of queries, BAQ judiciously selects high-quality samples from the data to generate a unified synopsis offline, and then uses the synopsis to answer online queries. Compared with existing methods, BAQ has the following salient features. (1) BAQ does not need to generate a synopsis for each query while it only generates a unified synopsis, and thus BAQ has much smaller synopsis. (2) BAQ achieves much smaller error than existing studies. Specifically, BAQ can provide deterministic approximate result (i.e., the estimated query results must be within the error bound with 100% confidence) for SQL aggregation queries that do not contain selection conditions on numerical columns. For queries with selection conditions on numerical columns, we propose effective grouping-based techniques and the estimated results are also within the error bound in practice. Experimental results on both real and synthetic datasets show that BAQ significantly outperforms state-of-the-art approaches. For example, on a Microsoft production dataset (a real dataset with synthetic queries), BAQ has 10-100× improvement on synopsis size and 10-100× improvement on the error compared with state-of-the-art algorithms. Index Terms—Data Integration, Approximate Query Processing, Synopsis, Sampling, Bounded Error F 1 INTRODUCTION is challenging to select high-quality samples. Random sampling works well on data with normal distribution, but has Online analytical processing (OLAP) is an important func- large (unacceptable) errors on skewed data [42]. Although tionality in database systems. The performance of process- stratified sampling (AQUA [35], START [36], BlinkDB [4]) can ing OLAP queries is crucial in many applications, such as deal with sparse data, SAQ still has some limitations. (1) SAQ decision-support systems. For example, consider a revenue cannot give 100% confidence on the error bound of using table of a fruit company in Table 1. A data analyst samples to answer queries and may return large errors for T US wants to know the average tax in market “ ” on category skewed data, and thus SAQ cannot meet the requirement of apple SELECT AVG(Tax) “ ”, and poses a SQL query “ many real applications that the error must be within 5% [42]. FROM WHERE Market=US and Fruit=apple”. Nev- T (2) SAQ requires to generate a sample for each QCS and it may ertheless, executing a SQL query on large datasets is expen- generate many samples, leading to low performance. sive and traditional exact solutions cannot meet the high- Secondly, DAQ does not require a query workload. It performance requirement. To address this problem, approx- focuses on numerical data and uses the high-order bits of imate query processing (AQP) [10], [18] has been proposed a numerical data to approximate the results. It still needs to to enable interactive OLAP on big data. However, existing consider all the data and the performance improvement is SAQ AQP techniques (e.g., sampling-based method ( ) [10], limited. Moreover, it cannot support some SQL queries, e.g., DAQ Sketch [12] deterministic method ( ) [33], [11], [13], [43], queries with both categorical and numerical columns. Histogram Wavelet [9], [19], [26], [30]–[32], [37], [40] and Thirdly, Sketch, Histogram and Wavelet also require a [7], [14], [15], [25], [39]) have the following limitations. query workload. However, they cannot answer the queries Firstly, SAQ requires a given query workload, selects a that are not in the workload. Moreover, they can only synopsis of samples for each distinct column set (QCS) of support a subset of SQL aggregation queries and cannot queries in the workload, and uses the synopsis to answer an support queries with categorical columns. OLAP query whose QCS is a subset of a given query’s QCS. It To address these limitations, we propose a bounded approximate query processing framework BAQ. Given an • Kaiyu Li is with the Department of Computing Science, Tsinghua Uni- error bound and a query workload, BAQ generates a unified versity, Beijing, China. [email protected]. synopsis and uses the synopsis to answer an online query • Yong Zhang is with the Research Institute of Information Technology at QCS QCS Tsinghua University, Beijing, China. [email protected]. whose is a subset of a given query’s . Compared • Guoliang Li is with the Department of Computer Science, Tsinghua with existing methods, BAQ has the following salient fea- National Laboratory for Information Science and Technology (TNList), tures. (1) BAQ does not need to generate a synopsis for each Tsinghua University, Beijing, China. [email protected]. query while it only generates a unified synopsis, and thus • Wenbo Tao is with the Computer Science and Artificial Intelli- gence Laboratory, Massachusetts Institute of technology, Boston, U.S. BAQ generates much smaller synopsis. (2) BAQ achieves much [email protected]. smaller error than existing studies. Specifically, BAQ can • Ying Yan is with the Cloud Computing Group, MSRA, Beijing, China. provide deterministic approximate result (i.e., the estimated [email protected]. query results must be within the error bound with 100% 2 confidence) for SQL aggregation queries that do not contain Definition 1 (Relative Error). Given a SQL query with aggre- a a a selection conditions on numerical columns. For queries with gations AGG( ), AGG( ), , AGG( a ), the relative A1 A2 ··· A A selection conditions on numerical columns, we propose ef- error on an aggregation AGG( a) is j j Ai fective grouping-based techniques and the estimated results ( result estimation j − j if result=0 are also within the error bound in practice. result estimation (1) j − j if result=0 The basic idea of BAQ is that we first judiciously select result 6 j j a a query synopsis for each QCS with 100% confidence within where result is the true result on AGG( i ), estimation is the error bound, and then generate a unified synopsis by the result estimated using the synopsis,A and is a small covering all these query synopses, which is used to answer number close to 0. online queries. For example, in Table 1, we first generate Definition 2 (Bounded Synopsis Construction). Given a rela- synopsis for each query in Figure 1 and finally generate a tion , SQL queries q1; q2; ; qτ , and a system-defined relativeT error bound δ, select··· a minimal subset of unified synopsis in Table 2 by merging the query synopsis. records based on which we compute an approximationS (We will introduce more details in Section 3). To summarize, answer for qi such that the relative error is within δ. we make the following contributions. Table 1 shows a table with three categorical attributes (1) We propose a bounded approximate query processing (Market, Fruit and Profitable) and two numerical at- framework BAQ. BAQ can provide deterministic approximate tributes (Revenue and Tax). Figure 1 illustrates five exam- result for queries that do not contain selection conditions on ple queries. Suppose the error bound is 0.2. The highlighted the numerical attributes. For queries with selection condi- records in Table 1 are selected to construct the synopsis in tions on numerical columns, we propose effective grouping- Table 2, and we can utilize the synopsis to answer these based techniques and the estimated results are also within queries within the error bound. the error bound in practice. (2) We first select a query synopsis for each QCS (see Sec- Result Model. We compute the deterministic approximate tion 3). We then generate a unified synopsis by covering result where the relative error is within an error bound δ the query synopses, prove that the problem is NP-hard, and with 100% confidence. Definition 3 (Query Column Set). Given a query qi, the propose heuristic solutions (see Section 4). s a w g query column set (QCS) of qi is πi = A A A A . τ [ [ [ (3) We extend our method to support big data and devise π = πi is the QCS of all queries. [i=1 efficient algorithms to generate a unified synopsis on dis- Note that the synopsis we generate not only can answer tributed environments (see Section 5). the queries q1; q2; ; qτ , but also can answer a query (4) We implemented our method on Spark and evaluated ··· whose QCS is a subset of πi within the error bound. our method on both real and synthetic datasets. Experiment Remark. (1) Our techniques can be extended to support the results showed that our method generated much smaller nest queries and the queries with the Having clause. We synopsis and had much better error-bound guarantee com- omit the details due to space constraints. (2) We focus on pared with state-of-the-art algorithms (see Section 6). a single table. Existing techniques for queries with multiple tables [20], [24] can be easily integrated into our method. 2 PRELIMINARIES 2.1 Problem Definition 2.2 Workflow Data Model. Given a database relation with N records, Our method works in two steps: offline synopsis generation T and online query processing.

Load more