Probabilistic Summarization for Interactive Data Exploration

Laurel Orr, Magdalena Balazinska, and Dan Suciu Seattle, Washington, USA {ljorr1, magda, suciu}@cs.washington.edu

ABSTRACT ing heavy hitters. However, sampling may fail to return We present a probabilistic approach to generate a small, estimates for small populations; targeted stratified samples query-able summary of a dataset for interactive data ex- can alleviate this shortcoming, but stratified samples need ploration. Departing from traditional summarization tech- to be precomputed to target a specific query, defeating the niques, we use the Principle of Maximum Entropy to gen- original purpose of AQP. erate a probabilistic representation of the data that can be In this paper, we propose an alternative approach to in- used to give approximate query answers. We develop the teractive data exploration based on the Maximum Entropy theoretical framework and formulation of our probabilistic principle (MaxEnt). The MaxEnt model has been applied representation and show how to use it to answer queries. in many settings beyond data exploration; e.g., the multi- We then present solving techniques and give three critical plicative weights mechanism [11] is a MaxEnt model for both optimizations to improve preprocessing time and query ac- differentially private and, by [10], statistically valid answers curacy. Lastly, we experimentally evaluate our work using a to queries, and it has been shown to be theoretically opti- 5 GB dataset of flights within the United States and a 210 mal. In our setting of the MaxEnt model, the data is pre- GB dataset from an astronomy particle simulation. While processed to compute a probabilistic model. Then, queries our current work only supports linear queries, we show that are answered by doing probabilistic inference on this model. our technique can successfully answer queries faster than The model is defined as the probabilistic space that obeys sampling while introducing, on average, no more error than some observed statistics on the data and makes no other sampling and can better distinguish between rare and nonex- assumptions (Occam’s principle). The choice of statistics istent values. boils down to a precision/memory tradeoff: the more statis- tics one includes, the more precise the model and the more space required. Once computed, the MaxEnt model defines 1. INTRODUCTION a probability distribution on possible worlds, and users can Interactive data exploration allows a data analyst to interact with this model to obtain approximate query re- browse, query, transform, and visualize data at “human sults. Unlike a sample, which may miss rare items, the speed” [7]. It has been long recognized that general-purpose MaxEnt model can infer something about every query. DBMSs are ill suited for interactive exploration [19]. While Despite its theoretical appeal, the computational chal- users require interactive responses, they do not necessarily lenges associated with the MaxEnt model make it difficult require precise responses because either the response is used to use in practice. In this paper, we develop the first scal- in some visualization, which has limited resolution, or an able techniques to compute and use the MaxEnt model. As approximate result is sufficient and can be followed up with an application, we illustrate it with interactive data explo- a more costly query if needed. Approximate Query Process- ration. Our first contribution is to simplify the standard ing (AQP) refers to a set of techniques designed to allow fast MaxEnt model to a form that is appropriate for data sum- but approximate answers to queries. All successful AQP sys- marization (Sec. 3). We show how to simplify the MaxEnt tems to date rely on sampling or a combination of sampling model to be a multi-linear polynomial that has one mono- and indexes. The sample can either be computed on-the-fly, mial for each possible tuple (Sec. 3, Eq. (5)) rather than its e.g., in the highly influential work on online aggregation [12] na¨ıve form that has one monomial for each possible world or systems like DBO [14] and Quickr [16], or precomputed (Sec. 2, Eq. (2)). Even with this simplification, the MaxEnt offline, like in BlinkDB [2] or Sample+Seek [9]. Samples model starts by being larger than the data. For example, the flights dataset is 5 GB, but the number of possible tuples have the advantage that they are easy to compute, can ac- 10 curately estimate aggregate values, and are good at detect- is approximately 10 , more than 5 GB. Our first optimiza- tion consists of a compression technique for the polynomial of the MaxEnt model (Sec 4.1); for example, for the flights dataset, the summary is below 200MB, while for our larger This work is licensed under the Creative Commons Attribution- dataset of 210GB, it is less than 1GB. Our second optimiza- NonCommercial-NoDerivatives 4.0 International License. To a copy tion consists of a new technique for query evaluation on the of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For MaxEnt model (Sec. 4.2) that only requires setting some any use beyond those covered by this license, obtain permission by emailing variables to 0; this reduces the runtime to be on average [email protected]. Proceedings of the VLDB Endowment, Vol. 10, No. 10 below 500ms and always below 1s. Copyright 2017 VLDB Endowment 2150-8097/17/06.

1154 We find that the main bottleneck in using the MaxEnt In contrast to typical probabilistic where the model is computing the model itself; in other words, com- probability of a is calculated from the probability puting the values of the variables of the polynomial such that of each tuple, we calculate a relation’s probability from a it matches the existing statistics over the data. Solving the formula derived from the MaxEnt principle and a set of MaxEnt model is difficult; prior work for multi-dimensional constraints on the overall distribution. This approach cap- histograms [18] uses an iterative scaling algorithm for this tures the idea that the distribution should be uniform except purpose. To date, it is well understood that the MaxEnt where otherwise specified by the given constraints. model can be solved by reducing it to a convex optimization problem [23] of a dual function (Sec. 2), which can be solved 2.2 The Principle of Maximum Entropy using Gradient Descent. However, even this is difficult given The Principle of Maximum Entropy (MaxEnt) states that the size of our model. We managed to adapt a variant of subject to prior data, the probability distribution which best Stochastic Gradient Descent called Mirror Descent [5], and represents the state of knowledge is the one that has the our optimized query evaluation technique can compute the largest entropy. This means given our set of possible worlds, MaxEnt model for large datasets in under a day. PWD, the probability distribution Pr(I) is one that agrees In summary, in this paper, we develop the following new with the prior information on the data and maximizes techniques: X − Pr(I) log(Pr(I)) • A closed-form representation of the probability space of I∈PWD possible worlds using the Principle of Maximum Entropy, and a method to use the representation to answer queries where I is a database instance, also called possible world. P in expectation (Sec 3). The above probability must be normalized, I Pr(I) = 1, • Compression method for the MaxEnt summary (Sec 4.1). and must satisfy the prior information represented by a set of k expected value constraints: • Optimized query processing techniques (Sec 4.2). • A new method for selecting 2-dimensional statistics based sj = E[φj (I)], j = 1, k (1) on a modified KD-tree (Sec 4.3). where sj is a known value and φj is a function on I that We implement the above techniques in a prototype sys- returns a numerical value in R. One example constraint is tem that we call EntropyDB and evaluate it on the flights that the number of flights from CA to WI is 0. and astronomy datasets. We find that EntropyDB can an- Following prior work on the MaxEnt principle and solving swer queries faster than sampling while introducing no more constrained optimization problems [4, 23, 20], the MaxEnt error, on average, and does better at identifying small pop- probability distribution takes the form ulations. k ! 1 X Pr(I) = exp θ φ (I) (2) 2. BACKGROUND Z j j j=1 We summarize data by fitting a probability distribution over the active domain. The distribution assumes that the where θj is a parameter and Z is the following normalization domain values are distributed in a way that preserves given constant: statistics over the data but are otherwise uniform. k !! def X X For example, consider a data scientist who analyzes a Z = exp θj φj (I) . dataset of flights in the United States for the month of De- I∈PWD j=1 cember 2013. All she knows is that the dataset includes To compute the k parameters θj , we must solve the non- all flights within the 50 possible states and that there are linear system of k equations, Eq. (1), which is computa- 500,000 flights in total. She wants to know how many of tionally difficult. However, it turns out [23] that Eq. (1) is those flights are from CA to NY. Without any extra infor- equivalent to ∂Ψ/∂θ = 0 where the dual Ψ is defined as: mation, our approach would assume all flights are equally j likely and estimate that there are 500, 000/502 = 200 flights. k def X Now suppose the data scientist finds out that flights leav- Ψ = sj θj − ln (Z) . ing CA only go to NY, FL, or WA. This changes the estimate j=1 because instead of there being 500, 000/50 = 10, 000 flights Furthermore, Ψ is concave, which means solving for the k leaving CA and uniformly going to all 50 states, those flights parameters can be achieved by maximizing Ψ. We note that are only going to 3 states. Therefore, the estimate becomes Z is called the function, and its log, ln(Z), is called 100, 000/3 = 3, 333 flights. the cumulant. This example demonstrates how our summarization tech- nique would answer queries, and the rest of this section cov- ers its theoretical foundation. 3. EntropyDB APPROACH This section explains how we use the MaxEnt model for 2.1 Possible World Semantics approximate query answering. We first show how we use the To model a probabilistic database, we use the slotted pos- MaxEnt framework to transform a single relation R into a sible world semantics where rows have an inherent unique probability distribution represented by P . We then explain identifier, meaning the order of the tuples matters. Our set how we use P to answer queries over R. of possible worlds is generated from the active domain and size of each relation. Each database instance is one possible 3.1 Maximum Entropy Model of Data world with an associated probability such that the probabil- We consider a single relation with m attributes and ities of all possible worlds sum to one. schema R(A1,...,Am) where each attribute, Ai, has an

1155 Domains: Next, we derive the MaxEnt distribution for the possible D1 = {a1, a2} N1 = 2 instances I of a fixed size n. We replace the exponential D2 = {b1, b2} N2 = 2 parameters θj with ln(αj ) so that Eq. (2) becomes T up = {(a1, b1), (a1, b2), (a2, b1), (a2, b2)} d = 4

Database Instance: Query: 1 Y hcj ,Ii I: A B Pr(I) = αj . (3) q: SELECT COUNT( ) Z 1 a1 b1 * j=1,k 2 a b FROM R 1 2 WHERE A = a1 3 a2 b2 We prove the following about the structure of the partition 4 a1 b1 function Z: 5 a2 b2 Modeling Data and Query: n = 5, m = 2 Lemma 3.1. The partition function is given by I I n = (2, 1, 0, 2) q = (1, 1, 0, 0) hq, n i = 3 also denoted hq, Ii Z = P n (4)

Figure 1: Illustration of the data and query model where P is the multi-linear polynomial

1 def X Y hcj ,tii active domain Di, assumed to be discrete and ordered. P (α1, . . . , αk) = αj . (5) Let T up = D1 × D2 × · · · × Dm = {t1, . . . , td} be the i=1,d j=1,k set of all possible tuples. Denoting Ni = |Di|, we have d = |T up| = Qm |N |. Proof. Fix any n = [n1, . . . , nd] such that ||n||1 = i=1 i Pd An instance for R is an ordered bag of n tuples, denoted i=1 ni = n. The number of instances I of cardinality I Q I. For each I, we form a frequency vector which is a d- n with n = n is n!/ i ni!. Furthermore, for each such 2 I I I d instance, hc , Ii = hc , ni = P n hc , t i. Therefore, dimensional vector n = [n1, . . . , nd] ∈ R , where each j j i i j i I number ni represents the count of the tuple ti ∈ T up in I I (Fig. 1). The mapping from I to n is not one-to-one P X X n! Y i nihcj ,tii because the instance I is ordered, and two distinct instances Z = Pr(I) = Q αj i ni! may have the same counts. Further, for any instance I of I n:||n||1=n j=1,k I P I  n cardinality n, ||n ||1 = i ni = n. The frequency vector X Y hc ,t i n of an instance consisting of a single tuple {ti} is denoted = α j i = P . ti  j  n = [0,..., 0, 1, 0,..., 0] with a single value 1 in the ith i=1,d j=1,k t d position; i.e., {n i : i = 1, d} forms a basis for R . While the MaxEnt principle allows us, theoretically, to answer any query probabilistically by averaging the query over all possible instances; in this paper, we limit our dis- The data summary consists of the polynomial P (Eq. (5)) cussion to linear queries. A linear query is a d-dimensional and the values of its parameters αj ; the polynomial is de- vector q = [q , . . . , q ] in d. The answer to q on instance I 1 d R fined by the linear queries cj in the statistics Φ, and the I Pd I is the dot product hq, n i = i=1 qini . With some abuse of parameters are computed from the numerical values sj . I notation, we will write I when referring to n and ti when ti referring to n . Notice that hq, tii = qi, and, for any in- Example 3.2. Consider a relation with three attributes P I stance I, hq, Ii = i ni hq, tii. R(A, B, C), and assume that the domain of each attribute Fig. 1 illustrates the data and query model. Any count- has 2 distinct elements. Assume n = 10 and the only statis- ing query is a vector q where all coordinates are 0 or tics in Φ are the following 1-dimensional statistics: 1 and can be equivalently defined by a predicate π such that hq, Ii = |σπ(I)|; with more abuse, we will use π in- (A = a1, 3) (B = b1, 8) (C = c1, 6) stead of q when referring to a counting query. Other SQL (A = a2, 7) (B = b2, 2) (C = c2, 4). queries can be modeled using linear queries, too. For exam- ple, SELECT A, COUNT(∗) AS cnt FROM R GROUP BY The first statistic asserts that |σA=a1 (I)| = 3, etc. The poly- A ORDER BY cnt DESC LIMIT 10 corresponds to sev- nomial P is eral linear queries, one for each group, where the outputs are sorted and the top 10 returned. P =α1β1γ1 + α1β1γ2 + α1β2γ1 + α1β2γ2+ Our goal is to compute a summary of the data that is α2β1γ1 + α2β1γ2 + α2β2γ1 + α2β2γ2 small yet allows us to approximatively compute the answer where α1, α2 are variables associated with the statistics on to any linear query. We assume that the cardinality n of 3 R is fixed and known. In addition, we know k statistics, A, β1, β2 are for B , and γ1, γ2 are for C. Consider the concrete instance Φ = {(cj , sj ): j = 1, k}, where cj is a linear query and sj ≥ 0 is a number. Intuitively, the statistic (cj , sj ) asserts I = {(a1, b1, c1), (a1, b2, c2),..., (a1, b2, c2)} that hcj ,Ii = sj . For example, we can write 1-dimensional and 2-dimensional (2D) statistics like |σ (I)| = 20 and A1=63 where the tuple (a1, b2, c2) occurs 9 times. Then, Pr(I) = |σ (I)| = 300. 10 9 9 10 A1∈[50,99]∧A2∈[1,9] α1 β1β2 γ1γ2 /P . 1We support continuous data types by bucketizing their ac- tive domains. 2 3 This is a standard data model in several applications, such We abuse notation here for readability. Technically, αi = as differential privacy [17]. αai , βi = αbi , and γi = αci .

1156 Example 3.3. Continuing the previous example, we add write P as P = α1P1 + α2P2 where α1P1 represents the first the following multi-dimensional statistics to Φ: two lines and α2P2 represents the last two lines in Eq. (6). P is also linear in β1, β2 and in γ1, γ2. (A = a1 ∧ B = b1, 2) (B = b1 ∧ C = c1, 5) (A = a2 ∧ B = b2, 1) (B = b2 ∧ C = c1, 1). 3.2 Query Answering P is now In this section, we show how to use the data summary to approximately answer a linear query q by returning its P =α β γ [αβ] [βγ] + α β γ [αβ] + 1 1 1 1,1 1,1 1 1 2 1,1 expected value E[hq,Ii]. The summary (the polynomial P α1β2γ1[βγ]2,1 + α1β2γ2+ and the values of its variables αj ) uniquely define a prob- ability space on the possible worlds (Eq. (3) and (5)). We α2β1γ1[βγ]1,1 + α2β1γ2+ start with a well known result in the MaxEnt model. If c` α2β2γ1[αβ]2,2[βγ]2,1 + α2β2γ2[αβ]2,2. (6) is the linear query associated with the variable α`, then The red variables are the added 2-dimensional statistic vari- ables; we use [αβ]1,1 to denote a single variable correspond- n α`∂P E[hc`, Ii] = . (8) ing to a 2D statistics on the attributes AB. Notice that each P ∂α` red variable only occurs with its related 1-dimensional vari- We review the proof here. The expected value of hc`, Ii over ables. αβ1,1, for example, is only in the same term as α1 the probability space (Eq. (3)) is and β1. Now consider the earlier instance I. Its probability be- 10 9 9 10 1 X Y hcj ,Ii 1 X α`∂ Y hcj ,Ii comes Pr(I) = α β1β γ1γ [αβ]1,1[βγ]1,1/P . 1 2 2 E[hc`, Ii] = n hc`, Ii αj = n αj P P ∂α` I j I j To facilitate analytical queries, we choose the set of statis- n 1 α`∂ X Y hcj ,Ii 1 α`∂P n α`∂P tics Φ as follows: = n αj = n = . P ∂α` P ∂α` P ∂α` I j • Each statistic φ = (c , s ) is associated with some pred- j j j To compute a new linear query q, we add it to the sta- icate πj such that hcj , Ii = |σπj (I)|. It follows that for tistical queries cj , associate it with a fresh variable β, and every tuple ti, hcj , tii is either 0 or 1; therefore, each denote Pq the extended polynomial: variable αj has degree 1 in the polynomial P in Eq. (5).

• For each domain Di, we include a complete set of 1- def X Y hcj ,tii hq,tii dimensional statistics in our summary. In other words, Pq(α1, . . . , αk, β) = αj β (9) for each v ∈ Di, Φ contains one statistic with predi- i=1,d j=1,k cate Ai = v. We denote Ji ⊆ [k] the set of indices of Notice that Pq[β = 1] ≡ P ; therefore, the extended data the 1-dimensional statistics associated with Di; therefore, summary defines the same probability space as P . We can |Ji| = |Di| = Ni. apply Eq. (8) to the query q to derive: • We allow multi-dimensional statistics to be given by arbi- trary predicates. They may be overlapping and/or incom- n ∂Pq plete; e.g., one statistic may count the tuples satisfying [hq, Ii] = . (10) E P ∂β A1 ∈ [10, 30] ∧ A2 = 5 and another count the tuples sat- This leads to the following na¨ıve strategy for computing isfying A2 ∈ [20, 40] ∧ A4 = 20. the expected value of q: extend P to obtain P and apply • We assume the number of 1-dimensional statistics q formula Eq. (10). One way to obtain P is to iterate over dominates the number of attribute combinations; i.e., q m all monomials in P and add β to the monomials correspond- P N  2m. i=1 i ing to tuples counted by q. As this is inefficient, Sec. 4.2 • If some domain Di is large, it is beneficial to reduce the describes how to avoid modifying the polynomial altogether. size of the domain using equi-width buckets. In that case, we assume the elements of Di represent buckets, and Ni 3.3 Probabilistic Model Computation is the number of buckets. We now describe how to compute the parameters of the • We enforce our MaxEnt distribution to be overcom- summary. Given the statistics Φ = {(cj , sj ): j = 1, k}, we plete [23, pp.40] (as opposed to minimal). More pre- need to find values of the variables {αj : j = 1, k} such that cisely, for any attribute Ai and any instance I, we have [hc , Ii] = s for all j = 1, k. As explained in Sec 2, this is P E j j hcj , Ii = n, which means that some statistics are j∈Ji equivalent to maximizing the dual function Ψ: redundant since they can be computed from the others and from the size of the instance n. k def X Ψ = s ln(α ) − n ln P. (11) Note that as a consequence of overcompleteness, for any j j j=1 attribute Ai, one can write P as a linear expression Indeed, maximizing P reduces to solving the equations X ∂Ψ/∂αj = 0 for all j. Direct calculation gives us ∂Ψ/∂αj = sj n ∂P P = αj Pj (7) − = 0, which is equivalent to sj − [hcj , Ii] by αj P ∂αj E j∈Ji Eq. (8). The dual function Ψ is concave, and hence it has where each Pj , j ∈ Ji is a polynomial that does not contain a single maximum value that can be obtained using convex the variables (αj )j∈Ji . In Example 3.3, the 1-dimensional optimization techniques such as Gradient Descent. variables for A are α1, α2, and indeed, each monomial in In particular, we achieve fastest convergence rates using a Eq. (6) contains exactly one of these variables. One can variant of Stochastic Gradient Descent (SGD) called Mirror

1157 Algorithm 1 Solving for the αs maxError = infinity (ii) w h i l e maxError >= t h r e s h o l d do (i) (i) z }| { maxError = −1 z }| { z }| { 200 600 f o r each alpha do X X X X X X P = ( αi)( βj )( γk) + ( γk) ( αi)( βj )(δ1 − 1) sj (P −αj Pα ) j 101 501 v a l u e = (n−s )P j αj (13) alpha = value nαj Pα (ii) e r r o r = value − j P (i) z }| { maxError = max(error , maxError) " 650 900 700 800 # zX}| { X X X X + ( αi) ( βj )( γk)(δ2 − 1) + ( βj )( γk)(δ3 − 1) 551 801 651 701 Descent [5], where each iteration chooses some j = 1, k and (14) nαj ∂P updates αj by solving = sj while keeping all other (ii) P ∂αj z }| { parameters fixed. In other words, the step of SGD is chosen 200 600 900 def ∂P X X X to solve ∂Ψ/∂αj = 0. Denoting Pα = and solving, we + ( αi)( βj )( γ )(δ1 − 1)(δ2 − 1) . (15) j ∂αj k obtain: 101 551 801

sj (P − αj Pα ) α = j . (12) j (n − s )P The size, counting only the αs, βs, and γs for simplicity, is j αj 3000 + 1200 + 1350 + 250.

Since P is linear in each α, neither P −αj Pαj nor Pαj contain Before proving the general formula for P , note that this any αj variables. compression is related to standard algebraic factorization We repeat this for all j, and continue this process until all techniques involving kernel extraction and rectangle cover- nα P j αj ings [13]; both techniques reduce the size of a polynomial by differences |sj − P |, j = 1, k, are below some threshold. Algorithm 1 shows pseudocode for the solving process. factoring out divisors. The standard techniques, however, are unsuitable for our use because they require enumeration of the product terms in the sum-of-product (SOP) polyno- mial to extract kernels and form cube matrices. Our poly- 4. OPTIMIZATIONS nomial in SOP form is too large to be materialized, making We now discuss three optimizations: (1) summary com- these techniques infeasible. It is future work to investigate pression in Sec. 4.1, (2) optimized query processing in other factorization techniques geared towards massive poly- Sec. 4.2, and (3) selection of statistics in Sec. 4.3. nomials. We now make the following three assumptions for the rest 4.1 Compression of the Data Summary of the paper.

The summary consists of the polynomial P that, by def- Vm Qm • Each predicate has the form πj = i=1 ρij where m is inition, has |T up| monomials where |T up| = i=1 Ni. We describe a technique that compresses the summary to a size the number of attributes and ρij is the projection of πj P onto Ai. If j ∈ Ji, then πj ≡ ρij . For any set of indices closer to O( Ni). i def We start by walking through an example with three at- of multi-dimensional statistics S ⊂ [k], we denote ρiS = tributes, A, B, and C, each with an active domain of size V def V j∈S ρij , and πS = i ρiS ; as usual, when S = ∅, then N1 = N2 = N3 = 1000. Suppose first that we have only ρi∅ ≡ true. 1D statistics. Then, instead of representing P as a sum of 3 P • Each ρij is a range predicate Ai ∈ [u, v]. 1000 monomials, P = i,j,k∈[1000] αiβj γk, we factorize it P P P • For each I, the multi-dimensional statistics whose at- to P = ( αi)( βj )( γk); the new representation has tributes are exactly those in I are disjoint; i.e., for j , j size 3 · 1000. 1 2 whose attributes are I, ρij 6≡ true for i ∈ I, ρij ≡ true Now, suppose we add a single 3D statistic on ABC: A = for i 6∈ I, and π ∧ π ≡ false. 3 ∧ B = 4 ∧ C = 5. The new variable, call it δ, occurs j1 j2 in a single monomial of P , namely α β γ δ. Thus, we can 3 4 5 Using this, define J ⊆ P([k])4 for I ⊆ [m] to be the compress P to (P α )(P β )(P γ ) + α β γ (δ − 1). I i j k 3 4 5 set of sets of multi-dimensional statistics whose combined Instead, suppose we add a single 2D range statistics on attributes are {A : i ∈ I} and whose intersection is non- AB, say A ∈ [101, 200] ∧ B ∈ [501, 600], and call its i empty (i.e., not false). In other words, for each S ∈ J , associated variable δ . This will affect 100 · 100 · 1000 I 1 ρ 6∈ {true, false} for i ∈ I and ρ ≡ true for i∈ / I. monomials. We can avoid enumerating them by noting iS iS For example, suppose we have the three 2D statistics from that they, too, factorize. The polynomial compresses to P P P P200 P600 P before: πj1 = A1 ∈ [101, 200] ∧ A2 ∈ [501, 600], πj2 = A2 ∈ ( αi)( βj )( γk) + ( αi)( βj )( γk)(δ1 − i=101 j=501 [551, 650] ∧ A3 ∈ [801, 900], and πj = A2 ∈ [651, 700] ∧ 1). 3 A3 ∈ [701, 800]. Then, {j1} ∈ J{1,2} and {j2}, {j3} ∈ J{2,3}. Finally, suppose we have three 2D statistics: the previous Further, {j1, j2} ∈ J{1,2,3} because ρ2j ∧ ρ2j 6≡ false. one on AB plus the statistics B ∈ [551, 650] ∧ C ∈ [801, 900] 1 2 However, {j1, j3} ∈/ J{1,2,3} because ρ2j1 ∧ ρ2j3 ≡ false. and B ∈ [651, 700] ∧ C ∈ [701, 800] on BC. Their associated Using these definitions, we now give the compression. variables are δ1, δ2, and δ3. Now we need to account for the fact that 100 · 50 · 100 monomials contain both δ1 and δ2. Applying the inclusion/exclusion principle, P compresses to the following (the i and ii labels are referenced later). 4P([k]) is the power set of {1, 2, . . . , k}

1158 Theorem 4.1. The polynomial P is equivalent to: multi-dimensional attribute sets and R is the largest number (i) of rectangle coverings defined by the statistics over some I. z }| { X Y X In the worst case, if one gathers all possible multi- P =  αj  dimensional statistics, this compression will be worse than m I⊆[m] i∈I / j∈Ji Q the uncompressed polynomial, which is of size O( i=1 Ni).     However, in practice, B < m, and R is dependent on   a     the number and type of statistics collected and results in  X Y X  Y    αj   (αj − 1) a significant reduction of polynomial size to one closer to S∈JI i∈I j∈Ji  j∈S  Pm πj ∧ρiS 6≡false O( i=1 Ni) (see Fig. 2 discussion). | {z } (ii) 4.2 Optimized Query Answering The proof uses induction on the size of I, but we omit it for In this section, we assume that the query q is a counting lack of space. query defined by a conjunction of predicates, one over each To give intuition, when I = ∅, we get the sum over the attribute Ai; i.e., q = |σπ(I)|, where 1D statistics because when S = ∅,(ii) equals 1. When I is π = ρ ∧ · · · ∧ ρ (16) not empty, (ii) has one summand for each set S of multi- 1 m dimensional statistics whose attributes are I and whose in- and ρi is a predicate over the attribute Ai. If q ignores tersection is non-empty. For each such S, the summand Ai, then we simply set ρi ≡ true. Our goal is to compute sums up all 1-dimensional variables αj , j ∈ Ji that are in E[hq, Ii]. In Sec. 3.2, we described a direct approach that the ith projection of the predicate πS (this is what the con- consists of constructing a new polynomial Pq and returning dition (πj ∧ρiS ) 6≡ false checks) and multiplies with terms Eq. (10). However, as described in Sec. 3.2 and Sec. 4.1, this αj − 1 for j ∈ S. may be expensive. At a high level, our algorithm computes the compressed We describe here an optimized approach to compute representation of P by first computing the summand for E[hq, Ii] directly from P . The advantage of this method when I = ∅ by iterating over all 1-dimensional statistics. is that it does not require any restructuring or rebuilding It then iterates over the multi-dimensional statistics, and of the polynomial. Instead, it can use any optimized ora- builds a map from I to the attributes that are defined on cle for evaluating P on given inputs. Our optimization has I; i.e., I → JI such that |S| = 1 for S ∈ JI . It then two parts: a new formula E[hq, Ii] and a new formula for iteratively loops over this map, taking the cross product of derivatives. different values, JI and JI0 , to see if any new JI∪I0 can be New formula for E[hq, Ii]: Let πj be the predicate generated. If so, JI∪I0 is added to the map. Once done, it associate to the jth statistical query. In other words, iterates over the keys in this map to build the summands hcj , Ii = |σπj (I)|. The next lemma applies to any query for each I. q defined by some predicate π. Recall that β is the new The algorithm can be used during query answering to variable associated to q in Pq (Sec. 3.2). compute the compressed representation of Pq from P

(Sec. 3.2) by rebuilding ii for the new q. However, as this is Lemma 4.3. For any ` variables αj1 , . . . , αj` of Pq: inefficient and may increase the size of our polynomial, our (1) If the logical implication πj1 ∧· · ·∧πj` ⇒ π holds, then system performs query answering differently, as explained in α ··· α ∂`P α ··· α β∂`+1P the next section. j1 j` q = j1 j` q (17) We now analyze the size of the compressed polynomial P . ∂αj1 ··· ∂αj` ∂αj1 ··· ∂αj` ∂β Let Ba denote the number of non-empty JI ; i.e., the number m (2) If the logical equivalence πj1 ∧· · ·∧πj` ⇔ π holds, then of unique multi-dimensional attribute sets. Since Ba < 2 Pm m Pm ` and N  2 , B is dominated by N . For αj ··· αj ∂ Pq β∂P i=1 i a i=1 i 1 ` = q (18) some I, part (i) of the compression is O(Pm N ). Part i=1 i ∂αj1 ··· ∂αj` ∂β (ii) of the compression is more complex. For some S ∈ Pm Proof. (1) The proof is immediate by noting that every JI , the summand is of size O( i=1 Ni + |S|). As |S| ≤ Pm Pm monomial of Pq that contains all variables αj1 , . . . , αj` also Ba  i=1 Ni, the summand is only O( i=1 Ni). Putting Pm contains β; therefore, all monomials on the LHS of Eq. (17) it together, for some I, we have the size is O( i=1 Ni + Pm Pm contain β and thus remain unaffected by applying the oper- |JI | i=1 Ni) = O(|JI | i=1 Ni). |J | is the number of sets of multi-dimensional statistics ator β∂/∂β. I (2) From item (1), we derive Eq. (17); we prove now that whose combined attributes are {Ai : i ∈ I} and whose in- β∂Pq tersection is non-empty. A way to think about this is that the RHS of Eq. (17) equals ∂β . We apply item (1) again to βα ∂2P each Ai defines a dimension in |I|-dimensional space. Each β∂Pq j1 q the implication π ⇒ πj and obtain = (the 1 ∂β ∂β∂αj S ∈ JI defines a rectangle in this hyper-space. This means 1 role of β in Eq. (17) is now played by αj ). As P is linear, |JI | is the number of rectangle coverings defined by the 1 the order of partials does not matter, and this allows us to statistics over {Ai : i ∈ I}. If we denote R = maxI |JI |, Pm remove the operator αj1 ∂/∂αj1 from the RHS of Eq. (17). then the size of the summand is O(R i=1 Ni). m By repeating the argument for π ⇒ πj , π ⇒ πj , etc, we Further, although there are 2 possible I, JI is non- 2 3 remove αj ∂/∂αj , then αj ∂/∂αj , etc from the RHS. empty for only Ba + 1 I (the 1 is from I = ∅). Therefore, 2 2 3 3 the size of the compression is O(B R Pm N ). a i=1 i Corollary 4.4. (1) Assume q is defined by a point pred- Theorem 4.2. The size of the polynomial is icate π = (A1 = v1 ∧ · · · ∧ A` = v`) for some ` ≤ m. For Pm O(BaR i=1 Ni) where Ba is the number of unique each i = 1, `, denote ji the index of the statistic associated to

1159 n the value vi. In other words, the predicate πji ≡ (Ai = vi). We estimate q using our formula P P [α1:35 = 0, α151:1000 = Then, 0, γ1:659 = 0, γ835:1000 = 0]. There is no need to compute a ` representation of a new polynomial. n αj1 ··· αj ∂ P E[hq, Ii] = ` (19) P ∂αj1 ··· ∂αj` 4.3 Choosing Statistics (2) Let q be the query defined by a predicate as in Eq. (16). In this section, we discuss how we choose the multi- Then, dimensional statistics. Recall that our summary always in- m cludes all 1D statistics of the form Ai = v for all attributes X X n αj ··· αjm ∂ P E[hq, Ii] = ··· 1 Ai and all values v in the active domain Di. We describe P ∂αj ··· ∂αj j ∈J :π ⇒ρ j ∈J :π ⇒ρ 1 m here how to tradeoff the size of the summary for the preci- 1 1 j1 1 m m jm m (20) sion of the MaxEnt model. A first choice that we make is to include only 2D statis- Proof. (1) Eq. (19) follows from Eq. (10), Eq. (18), and tics. It has been shown that restricting to pairwise corre- the fact that Pq[β = 1] ≡ P . (2) Follows from (1) by ex- lations offers a reasonable compromise between the number panding q as a sum of point queries as in Lemma. 4.3 (1). of statistics needed and the summary’s accuracy [22]. Fur- thermore, we restrict each Ai Ai statistic to be a range In order to compute a query using Eq. (20), we would 1 2 predicate; i.e., πj ≡ Ai1 ∈ [u1, v1] ∧ Ai2 ∈ [u2, v2]. As ex- have to examine all m-dimensional points that satisfy the plained in Sec. 4.1, the predicates over the same attributes query’s predicate, convert each point into the corresponding Ai Ai are disjoint. 1D statistics, and use Eq. (19) to estimate the count of the 1 2 The problem is as follows: given a budget B = Ba ∗ Bs, number of tuples at this point. Clearly, this is inefficient which Ba attribute pairs Ai Ai do we collect statistics on when q contains any range predicate containing many point 1 2 and which Bs statistics do we collect for each attribute pair? queries. This is a complex problem, and we make the simplifying as- New formula for derivatives Thus, to compute sumption that B and Ba are known, but we explore different [hq, Ii], one has to evaluate several partial derivatives of P . E choices of Ba in Sec. 6. It is future work to investigate au- Recall that P is stored in a highly compressed format, and tomatic techniques for determining the budgets. therefore, computing the derivative may involve nontrivial Given Ba, there are two main considerations when pick- manipulations. Instead, we use the fact that our polyno- P ing pairs: attribute cover and attribute correlation. If we mial is overcomplete, meaning that P = αj Pj , where j∈Ji focus only on correlation, we can pick the set of attribute 5 Pj , j ∈ Ji does not depend on any variable in {αj : j ∈ Ji} pairs that are not uniform and have the highest combined (Eq. (7)). Let ρi be any predicate on the attribute Ai. Then, correlation such that every pair has at least one attribute not included in any previously chosen, more correlated pair. X αji ∂P ^ =P [ αj = 0] (21) If we also consider attribute cover, we can pick the set of ∂αj j ∈J :π ⇒ρ i j∈J :π 6⇒ρ i i ji i i ji i pairs that covers the most attributes with the highest com- bined correlation. For example, if B = 2 and we have Thus, in order to compute the summation on the left, it a the attribute pairs BC, AB, CD, and AD in order of most suffices to compute P after setting to 0 the values of all to least correlated, if we only consider correlation, we would variables α , j ∈ J that do not satisfy the predicate ρ (this j i i choose AB and BC. However, if we consider attribute cover, is what the condition π 6⇒ ρ checks). ji i we would choose AB and CD. We experiment with both of Finally, we combine this with Eq. (20) and obtain the these choices in Sec. 6, and, in the end, conclude that con- following, much simplified formula for answering a query q, sidering attribute cover achieves more precise query results defined by a predicate of the form Eq. (16): for the same budget than the alternative. n ^ ^ Next, for a given attribute pair of type Ai Ai , we need E[hq, Ii] = P [ αj = 0] 1 2 P to choose the best Bs 2D range predicates [u1, v1] × [u2, v2]. i=1,m j∈Ji:πj 6⇒ρi i We consider three heuristics and show experimental results In other words, we set to 0 all 1D variables αj that corre- to determine which technique yields, on average, the lowest spond to values that do not satisfy the query, evaluate the error on query results. n polynomial P , and multiply it by P (which is a precom- LARGE SINGLE CELL In this heuristic, the range puted constant independent of the query). For example, if predicates are single point predicates, Ai1 = u1 ∧ Ai2 = u2, the query ignores an attribute Ai, then we leave the 1D vari- and we choose the points (u1, u2) as the Bs most popular ables for that attribute, αj , j ∈ Ji, unchanged. If the query values in the two dimensional space; i.e., the Bs largest val- checks a range predicate, Ai ∈ [u, v], then we set αj = 0 ues of |σA =u ∧A =u (I)|. i1 1 i2 2 for all 1D variables αj corresponding to values of Ai outside ZERO SINGLE CELL In this heuristic, we select that range. the empty/zero/nonexistent cells; i.e., we choose Bs points (u1, u2) s.t. σA =u ∧A =u (I) = ∅. If there are fewer than i1 1 i2 2 Example 4.5. Consider three attributes A, B, and C Bs such points, we choose the remaining points as in SIN- each with domain 1000 and two multi-dimensional statis- GEL CELL. The justification for this heuristic is that, given tics: one AB statistic A ∈ [101, 200] ∧ B ∈ [501, 600] only the 1D statistics, the MaxEnt model will produce false and two BC statistics B ∈ [551, 650] ∧ C ∈ [801, 900] and positives (“phantom” tuples) in empty cells; this is the op- B ∈ [651, 700] ∧ C ∈ [701, 800]. The polynomial P is shown posite problem encountered by sampling techniques, which in Eq. (13). Consider the query: return false negatives. This heuristic has another advantage 5 q: SELECT COUNT(*) FROM R This can be checked by calculating the chi-squared coeffi- WHERE A in [36,150] AND C in [660,834] cient and seeing if it is close to 0

1160 Budget 500 1000 2000 the budget. 0.60 (b.ii) plots the same for nonexistent values, and clearly Best split for 0.40 ZERO does best because it captures the zero values first. A x A’ traditional KD-tree (i) COMPOSITE, however, gets a low error with a budget u1 u2 u3 u4 0.20 of 1,000 and outperforms LARGE. Interestingly, LARGE

Heavy Hitters Heavy 0.00 u1’ 2 10 10 10 0.15 does slightly worse with a budget of 1,000 than 500. This 0.10 is a result of the final value of P being larger with a larger u2’ 1 10 10 10 (ii) 0.05 budget, and this makes our estimates slightly higher than Avg Error Avg u3’ 1 12 10 10 Nonexistent 0.00 0.5, which we round up to 1. With a budget of 500, our 0.80 estimates are slightly lower than 0.5, which we round down 0.60 to 0. Best split for data summary (iii) 0.40 Lastly, (b.iii) plots the same for 100 light hitter values, (a) 0.20 Light Hitters Light and while LARGE eventually outperforms COMPOS- Zero Large Composite ITE, COMPOSITE gets similar error for all budgets. In (b) fact, COMPOSITE outperforms LARGE for a budget of 1,000 because LARGE predicts that more of the light hit- Figure 2: (a) Illustration of the splitting step, and ter values are nonexistent than it does with a smaller budget (b) query accuracy versus budget for the three dif- as less weight is distributed to the light hitter values. ferent heuristics and three different selections: (b.i) Overall, we see that COMPOSITE is the best method to selecting 100 heavy hitter values, (b.ii) selecting use across all queries and is the technique we use in our eval- nonexistent values, and (b.iii) selecting 100 light hit- uation. Note that when the budget is high enough, around ter values. 5,000, all methods get almost no error on all queries be- cause we have a large enough budget to capture the entire because the value of αj in P is always 0 and does not need to be updated during solving. domain. Also note that the uncompressed polynomial size COMPOSITE This method partitions the entire space of these summarizes are orders of magnitude larger than the compressed versions. For example, for a budget of 2,000, Di1 × Di2 into a set of Bs disjoint rectangles and associates one statistic with each rectangle. We do this using an adap- the uncompressed polynomial has 4.4 million terms while tation of KD-trees. the compressed polynomial has only 9,000 terms. The only difference between our KD-tree algorithm and the traditional one is our splitting condition. Instead of 5. IMPLEMENTATION splitting on the median, we split on the value that has the We implemented our polynomial solver and query eval- lowest sum squared average value difference. This is because uator in Java 1.8, in a prototype system that we call En- we want our KD-tree to best represent the true values. Sup- 0 tropyDB. We created our own polynomial class and variable pose we have cell counts on dimensions A and A as shown types to implement our factorization. Since the initial im- in Fig. 2 (a). For the next vertical split, if we followed the plementation of our solver took an estimated 3 months to standard KD-tree algorithm, we would choose the second run on the experiments in Sec 6, we further optimized our split. Instead, our method chooses the first split. Using the factorization and solver by using bitmaps to associate vari- first split minimizes the sum squared error. ables with their statistics and using Java’s parallel streaming Our COMPOSITE method repeatedly splits the attribute library to parallelize polynomial evaluation. Also, by using domains D and D (alternating) by choosing the best i1 i2 Java to answer queries, we were able to store the polynomial split value until it exhausts the budget Bs. Then, for each factorization in memory. By utilizing these techniques, we rectangle [u1, v1] × [u2, v2] in the resulting KD-tree, it cre- reduced the time to learn the model (solver runtime) to 1 ates a 2D statistic (cj , sj ), where the query cj is associ- day and saw a decrease in query answering runtime from ated with the 2D range predicate and the numerical value around 10 sec to 500 ms (95% decrease). We show more def sj = |σA ∈[u ,v ]∧A ∈[u ,v ](I)|. i1 1 1 i2 2 2 query evaluation performance results in Sec. 6. We evaluate the three heuristics on the dataset of flights Lastly, we stored the polynomial variables in a Postgres in the United States restricted to the attributes (fl_date, 9.5.5 database and stored the polynomial factorization in fl_time, distance) (see Sec 6 for details). We gather statis- a text file. We perform all experiments on a 64bit Linux tics using the different techniques and different budgets machine running Ubuntu 5.4.0. The machine has 120 CPUs on the attribute pair (fl_time, distance). There are 5,022 and 1 TB of memory. For the timing results, the Postgres possible 2D statistics, 1,334 of which exist in Flights. database, which stores all the samples, also resides on this We evaluate the accuracy of the resultant count of the machine and has a shared buffer size of 250 GB. query SELECT fl_time, distance, COUNT(*) FROM Flights WHERE fl_time = x AND distance = y GROUP BY fl_time, distance for 100 heavy hitter (x, 6. EVALUATION y) values, 100 light hitter (x, y) values, and 200 random (x, In this section, we evaluate the performance of EntropyDB y) nonexistent/zero values. We choose 200 zero values to in terms of query accuracy and query execution time. We match the 100+100 heavy and light hitters. compare our approach to uniform sampling and stratified Figure 2 (b.i) plots the query accuracy versus method sampling. and budget for 100 heavy hitter values. Both LARGE and COMPOSITE achieve almost zero error for the larger bud- 6.1 Experimental Setup gets while ZERO gets around 60 percent error no matter For all our summaries, we ran our solver for 30 iterations using the method presented in Sec. 3.3 or until the error

1161 Flights Flights MaxEnt Method No2D 1&2 3&4 1&2&3 Coarse Fine Particles Pair 1 (origin, distance) X X fl_date Pair 2 (dest, distance) X X 307 307 density 58 (FD) mass 52 Pair 3 (time, distance) X X origin 54 147 x 21 Pair 4 (origin, dest) X (OS/OC) y 21 dest 54 147 z 21 (DS/DC) Figure 4: MaxEnt 2D statistics including in the grp 2 fl_time summaries. The top is the label of the Max- 62 62 type (ET) 3 snapshot Ent method used in the graphs. distance 3 81 81 (DT) # possible 8 factorized further beyond what we did in our prototype im- # possible tuples 5.0 × 10 4.5 × 109 3.3×1010 tuples plementation. We compute correlations on FlightsCoarse across all Figure 3: Active domain sizes. Each cell shows the attribute-pairs and identify the following pairs as having number of distinct values after binning. Abbrevia- the largest correlations (C stands for “coarse”): 1C = (ori- tions shown in brackets are used in figures to refer gin_state, distance), 2C = (destination_state, distance), to attribute names: e.g., OS stands for origin_state. 3 = (fl_time, distance), and 4C = (origin_state, destina- tion_state). We use the corresponding attributes, which are was below 1 × 10−6. Our summaries took under 1 day to also the most correlated, for the finer-grained relation and compute with the majority of the time spent building the refer to those attribute-pairs as 1F, 2F, and 4F. polynomial and solving for the parameters. As explained in Sec. 4.3, we build four summaries with a We evaluate EntropyDB on two real datasets as opposed budget B = 3, 000, chosen to keep runtime under a day while to benchmark data to measure query accuracy in the pres- allowing for variations of Ba (“breadth”) and Bs (“depth”), ence of naturally occurring attribute correlations. The first to show the difference in choosing statistics based solely dataset comprises information on flights in the United States on correlation (choosing statistics in order of most to least from January 1990 to July 2015 [1]. We load the data into correlated) versus attribute cover (choosing statistics that PostgreSQL, remove values, and bin all real-valued at- cover the attributes with the highest combined correlation). tributes into equi-width buckets. We further reduce the The first summary, No2D, contains only 1D statistics. The size of the active domain to decrease memory usage and next two, Ent1&2 and Ent3&4, use 1,500 buckets across solver runtime by binning cities such that the two most pop- the attribute-pairs (1, 2) and (3, 4), respectively. The final ular cities in each state are separated and the remaining one, Ent1&2&3, uses 1,000 buckets for the three attribute- less popular cities are grouped into a city called ‘Other’. pairs (1, 2, 3). We do not include 2D statistics related to We use equi-width buckets to facilitate transforming a the flight date attribute because this attribute is relatively user’s query into our domain and to avoid hiding outliers, uniformly distributed and does not need a 2D statistic to but it is future work to try different bucketization strate- correct for the MaxEnt’s underlying uniformity assumption. gies. The resulting relation, FlightsFine(fl_date, Fig 4 summarizes the summaries. origin_city, dest_city, fl_time, distance), is For sampling, we choose to compare with a uniform sam- 5 GB in size. ple and four different stratified samples. We choose the To vary the size of our active domain, we also cre- stratified samples to be along the same attribute-pairs as ate FlightsCoarse(fl_date, origin_state, the 2D statistics in our summaries; i.e., pair 1 through pair dest_state, fl_time, distance), where we use 4. the origin state and destination state as flight locations. To test query accuracy, we use the following query tem- The left in Fig. 3 shows the resulting active domain plate: sizes. SELECT A1 , . . . , Am COUNT( ∗ ) The second dataset is 210 GB in size. It comprises N-body FROM R WHERE A1=‘v1 ’ AND . . . AND Am=‘vm ’ ; particle simulation data [15], which captures the state of as- tronomy simulation particles at different moments in time We test the approaches on 400 unique (A1,.., Am) val- (snapshots). The relation Particles(density, mass, ues. We choose the attributes for the queries in a way that x, y, z, grp, type, snapshot) contains attributes illustrates the strengths and weaknesses of EntropyDB. For that capture particle properties and a binary attribute, grp, the selected attributes, 100 of the values used in the ex- indicating if a particle is in a cluster or not. We bucketize periments have the largest count (heavy hitters), 100 have the continuous attributes (density, mass, and position co- the smallest count (light hitters), and 200 (to match the ordinates) into equi-width bins. The right table in Fig. 3 200 existing values) have a zero true count (nonexisten- shows the resulting domain sizes. t/null values). To evaluate the accuracy of EntropyDB, we compute |true − est|/(true + est) on the heavy and light hitters. To evaluate how well EntropyDB distinguishes be- 6.2 Query Accuracy tween rare and nonexistent values, we compute the F mea- We first compare EntropyDB to uniform and stratified sure, 2 ∗ precision ∗ recall/(precision + recall) with precision sampling on the flights dataset. We use one percent sam- = |{estt > 0 : t ∈ light hitters}|/|{estt > 0 : t ∈ ples, which require approximately 100 MB of space when (light hitters ∪ null values)}| and recall = |{estt > 0 : stored in PostgreSQL. To approximately match the sample t ∈ light hitters}|/100. We do not compare the runtime size, our largest summary requires only 600 KB of space in of EntropyDB to sampling for the flights data because the PostgreSQL to store the polynomial variables and approxi- dataset is small, and the runtime of EntropyDB is, on aver- mately 200 MB of space in a text file to store the polyno- age, below 0.5 seconds and at most 1 sec. Sec. 6.3 reports mial factorization. This, however, could be improved and runtime for the larger data.

1162 OB & DB (Pair 4) DB & ET & DT (Pair 2&3) FL & DB & DT (Pair 2) 0.6 0.5 0.4 0.0

F Measure 0.2 Heavy Hitters Heavy -0.5

ET & DT (Pair 3) DB & DT (Pair 2) FL & DB & DT (Pair 2) Uni Uni Strat2 Strat2 Strat3 Strat3 Strat1 Strat1 0.5 Strat4 Strat4 Ent1&2 Ent1&2 Ent3&4 Ent3&4

0.0 Coarse Ent1&2&3 Fine Ent1&2&3 -0.5

Light Hitters Light Figure 6: F measure for light hitters and null values Di ff erence in Error for Coarse Data

Uni Uni Uni over FlightsCoarse (left) and FlightsFine (right). Strat2 Strat2 Strat2 Strat3 Strat3 Strat3 Strat1 Strat1 Strat1 Strat4 Strat4 Strat4 Ent1&2 Ent1&2 Ent1&2 Ent3&4 Ent3&4 Ent3&4 ‘CA’ and 300 miles. Figure 5: Query error difference between all meth- Fig. 5 (bottom) shows results for the same queries but for ods and Ent1&2&3 over FlightsCoarse. The pair in the bottom 100 light hitter values. In this case, EntropyDB parenthesis in the header corresponds to the always does better than uniform sampling. Our performance 2D statistic pair(s) used in the query template. For compared to stratified sampling depends on the stratifica- reference, pair 1 is (origin/OB, distance/DT), pair tion and query. Stratified sampling outperforms Ent1&2&3 2 is (dest/DB, distance/DT), pair 3 is (time/ET, when the stratification is exactly along the attributes in- distance/DT), and pair 4 is (origin/OB, dest/DB). volved in the query. For example, for query 1, the sample stratified on pair 3 outperforms EntropyDB by a significant Fig. 5 (top) shows query error differences between all amount because pair 3 is computed along the attributes methods and Ent1&2&3 (i.e., average error for method X in query 1. Interestingly, Ent3&4 and Ent1&2 do better minus average error for Ent1&2&3) for three different heavy than Ent1&2&3 on query 1 and query 2, respectively. Even hitter queries over FlightsCoarse. Hence, bars above though both of the query attributes for query 1 and query 2 zero indicate that Ent1&2&3 performs better and vice versa. are statistics in Ent1&2&3, Ent1&2 and Ent3&4 have more Each of the three query templates uses a different set of at- buckets and are thus able to capture more zero elements. tributes that we manually select to illustrate different sce- Lastly, we see that for query 3, we are comparable to strat- narios. The attributes of the query are shown in the column ified sampling because we have a 2D statistic over pair 2C, header in the figure, and any 2D statistic attribute-pair con- and the other attribute, flight date, is relatively uniformly tained in the query attributes is in parentheses. Each bar distributed in the query result. shows the average of 100 query instances selecting different We ran the same queries over the FlightsFine dataset values for each template. and found identical trends in error difference. We omit the As the figure shows, Ent1&2&3 is comparable or better graph due to space constraints. than sampling on two of the three queries and does worse An important advantage of our approach is that it more than sampling on query 1. The reason it does worse on accurately distinguishes between rare values and nonexis- query 1 is that it does not have any 2D statistics over 4C, tent values compared with stratified sampling, which often the attribute-pair used in the query, and 4C is fairly corre- does not have samples for rare values when the stratification lated. Our lack of a 2D statistic over 4C means we cannot does not match the query attributes. To assess how well our correct for the MaxEnt’s uniformity assumption. On the approach works on those rare values, Fig. 6 shows the av- other hand, all samples are able to capture the correlation erage F measure over fifteen 2- and 3-dimensional queries because the 100 heavy hitters for query 1 are responsible selecting light hitters and null values. for approximately 25% of the data. This is further shown We see that Ent1&2 and 3&4 have F measures close by Ent3&4, which has 4C as one of its 2D statistics, doing to 0.72, beating all stratified samples and also beating better than Ent1&2&3 on query 1. Ent1&2&3. The key reason why they beat Ent1&2&3 is Ent1&2&3 is comparable to sampling on query 2 because that these summaries have the largest numbers of buckets, two of its 2D statistics cover the three attributes in the which ensures they have more fine grained information and query. It is better than both Ent1&2 and Ent3&4 because can more easily identify regions without tuples. Ent1&2&3 each of those methods has only one 2D statistic over the has an F measure close to 0.69, which is slightly lower than attributes in the query. Finally, Ent1&2&3 is better than the stratified sample over pair 3 but better than all other stratified sampling on query 3 because it not only contains samples. The reason the sample stratified over pair 3 per- a 2D statistic over 2C but also correctly captures the uni- forms well is that the flight time attribute has a more skewed formity of flight date. This uniformity is also why Ent1&2 distribution and has more rare values than other dimensions. and a uniform sample do well on query 3. Another reason A stratified sample over that dimensions will be able to cap- stratified sampling performs poorly on query 3 is because ture this. On the other hand, Ent1&2&3 will estimate a the result is highly skewed in the attributes of destination small count for any tuple containing a rare flight time value state and distance but remains uniform in flight date. The and will be rounded to 0. top 100 heavy hitter tuples all have the destination of ‘CA’ with a distance of 300. This means even a stratified sample 6.3 Scalability over destination state and distance will likely not be able to To measure the performance of EntropyDB on large-scale capture the uniformity of flight date within the strata for datasets, we use three subsets of the 210 GB Particles table. We select data for one, two, or all three snapshots

1163 Heavy Hitters Light Hitters den & mass mass & x & y y & z & grp & den & mass mass & x & y y & z & grp & 0.8 0.3 & grp & type & z type & grp & type & z type 0.6 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 0.2 1.0 0.4 0.1 0.2

0.0 F Measure 0.0 0.5 Avg Error Avg Avg Heavy Hitters Error Hitters Heavy Avg No2D No2D No2D No2D Ent1&2 Ent1&2 Ent1&2 Ent1&2 0.0 Ent3&4 Ent3&4 Ent3&4 Ent3&4 Ent1&2&3 Ent1&2&3 Ent1&2&3 Ent1&2&3 Uni Strat No2D EntAll Coarse Fine Coarse Fine 4 (a) (b)

2 Figure 8: (a) Error over 2D heavy hitter queries and (b) F measure over 2D light hitter and null Avg Runtime (sec) Runtime Avg 0 value queries across different MaxEnt methods over FlightsCoarse and FlightsFine. Figure 7: Query average error and runtime for three 4D selection queries on the Particles table. The all, we see that the summary with more attribute-pairs but stratified sample (orange) is stratified on (den, grp). fewer buckets (more “breadth”), Ent1&2&3, does best on the heavy hitters. On the other hand, for the light hit- (each snapshot is approximately 70 GB in size). We build a ters, we see that the summary with fewer attribute-pairs 1 GB uniform sample for each subset of the table as well as but more buckets (more “depth”) and still covers the at- a stratified sample over the pair density and group with the tributes, Ent3&4, does best. Ent3&4 doing better than same sampling percentage as the uniform sample. We then Ent1&2 implies that choosing the attribute-pairs that cover build two MaxEnt summaries; EntNo2D uses no 2D statis- the attributes yields better accuracy than choosing the most tics, and EntAll contains 5 2D statistics with 100 buckets correlated pairs because even though Ent1&2 has the most over each of the most correlated attributes, not including correlated attribute-pairs, it does not have a statistic con- snapshot. We run a variety of 4D selection queries such taining flight time. Lastly, Ent1&2&3 does best on the as the ones from Sec. 6.2, split into heavy hitters and light heavy hitter queries yet slightly worse on the light hitter hitters. We record the query accuracy and runtime. queries because it does not have as many buckets as Ent1&2 Fig. 7 shows the query accuracy and runtime for three and Ent3&4 and can thus not capture as many regions in different selection queries as the number of snapshots in- the active domains with no tuples. creases. We see that EntropyDB consistently does better than sampling on query runtime, although both EntropyDB and stratified sampling execute queries in under one sec- 7. DISCUSSION ond. Stratified sampling outperforms uniform sampling be- The above evaluation shows that EntropyDB is compet- cause the stratified samples are generally smaller than their itive with stratified sampling overall and better at distin- equally selective uniform sample. guishing between infrequent and absent values. Impor- In terms of query accuracy, sampling always does bet- tantly, unlike stratified sampling, EntropyDB’s summaries ter than EntropyDB for the heavy hitter queries. This is permit multiple 2D statistics. The main limitations of En- expected because the bucketization of Particles is rela- tropyDB are the dependence on the size of the active do- tively coarse grained, and a 1 GB sample is sufficiently large main, correlation-based 2D statistic selection, manual buck- to capture the heavy hitters. We do see that EntAll does sig- etization, and limited query support. nificantly better than EntNo2D for query 1 because three of To address the first problem, our future work is to in- its five statistics are over the attributes of query 1 while only vestigate using standard algebraic factorization techniques 1 statistic is over the attributes of queries 2 and 3. However, on non-materializable polynomials. By further reducing the the query results of query 3 are more uniform, which is why polynomial size, we will be able to handle larger domain EntNo2D and EntAll do well. sizes. We also will explore using statistical model techniques For the light hitter queries, none of the methods do well to more effectively decompose the attributes into 2D pairs, except for the stratified sample in query 1 because the query similar to [8]. To no longer require bucketizing categorical is over the attributes used in the stratification. EntAll does variables (like city), we will research hierarchical polynomi- slightly better than stratified sampling on queries 2 and 3. als. These polynomials will start with coarse buckets (like states), and build separate polynomials for buckets that re- 6.4 Statistics Selection quire more detail. This may require the user to wait while a To compare different 2D statistic choices for our method, new polynomial is being loaded but would allow for different we look at the query accuracy of the four different MaxEnt levels of query accuracy without sacrificing polynomial size. methods summarized in Fig. 8. We use the flights dataset Lastly, to address our queries not reporting error, we and query templates from Sec. 6.2. We run six different will add variance calculations to query answers. We have two-attribute selection queries over all possible pairs of the a closed-form formula for calculating variance for a single attributes covered by pair 1 through 4; i.e., origin, destina- statistic but still need to expand the formula to handle more tion, time, and distance. We select 100 heavy hitters, 100 complex queries. Additionally, our theoretical model can light hitters, and 200 null values. support more complex queries involving joins and other ag- Fig. 8 shows the average error for the heavy hitters and gregates, but it is future work to implement these queries the F measure for light hitters across the six queries. Over- and make them run efficiently.

1164 8. RELATED WORK [3] B. Babcock, S. Chaudhuri, and G. Das. Dynamic sample Although there has been work in the theoretical aspects of selection for approximate query processing. In Proceedings of the 2003 ACM SIGMOD international conference on probabilistic databases [21], as far as we could find, there is Management of data, pages 539–550, 2003. not existing work on using a probabilistic database for data [4] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A summarization. However, there has been work by Markl [18] maximum entropy approach to natural language processing. on using the maximum entropy principle to estimate the se- Comput. Linguist., 22(1), 1996. lectivity of predicates. This is similar to our approach except [5] S. Bubeck. Convex optimization: Algorithms and we are allowing for multiple predicates on an attribute and complexity. Foundations and Trends in Machine Learning, are using the results to estimate the likelihood of a tuple 8, N0. 3-4:231–357, January 2015. being in the result of a query rather than the likelihood of [6] S. Chaudhuri, G. Das, and V. Narasayya. A robust, optimization-based approach for approximate answering of a tuple being in the database. aggregate queries. In ACM SIGMOD Record, volume 30, Our work is also similar to that by Suciu and R´e[20] pages 295–306, 2001. except their goal was to estimate the size of the result of a [7] A. Crotty, A. Galakatos, E. Zgraggen, C. Binnig, and query rather than tuple likelihood. Their method also relied T. Kraska. The case for interactive data exploration on statistics on the number of distinct values of an attribute accelerators (ideas). In Proceedings of the Workshop on whereas our statistics are based on the selectivity of each Human-In-the-Loop Data Analytics, page 11. ACM, 2016. value of an attribute. [8] A. Deshpande, M. N. Garofalakis, and R. Rastogi. Independence is good: Dependency-based histogram There has been much research in sampling techniques for synopses for high-dimensional data. In SIGMOD faster query processing. In the work by Chaudhiri et al. Conference, 2001. [6], they precompute the samples of data that minimizes [9] B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and the errors due to variance in the data for a specific set of C. Wang. Sample+seek: Approximating aggregates with queries they predict. The work by [3] chooses multiple sam- distribution precision guarantee. In Proc. SIGMOD, pages ples to use in query execution but only considers single col- 679–694, 2016. umn stratifications. The work by [9] builds a measure-biased [10] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, sample for each measure dimension to handle sum queries and A. Roth. Generalization in adaptive data analysis and holdout reuse. In Advances in Neural Information and uniform samples to handle count queries. Depending on Processing Systems, pages 2350–2358, 2015. if the query is highly selective or not, they choose an appro- [11] M. Hardt and G. N. Rothblum. A multiplicative weights priate sample. The later work of BlinkDB [2] improves this mechanism for privacy-preserving data analysis. In by removing assumptions on the queries. BlinkDB only as- Foundations of Computer Science (FOCS), 2010 51st sumes that there is a set of columns that are queried, but the Annual IEEE Symposium on, pages 61–70. IEEE, 2010. values for these columns can be anything among the possible [12] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online set of values. BlinkDB then computes samples for each pos- aggregation. In ACM SIGMOD Record, volume 26, pages 171–182. ACM, 1997. sible value of the predicate column in an online fashion and [13] A. Hosangadi, F. Fallah, and R. Kastner. Factoring and chooses the single best sample to run when a user executes eliminating common subexpressions in polynomial a query. expressions. In IEEE/ACM International Conference on Although we handle linear queries, our work makes no Computer Aided Design, 2004. ICCAD-2004., 2004. assumptions on query workload and can take into ac- [14] C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable count multi-attribute combinations when choosing statis- approximate query processing with the dbo engine. ACM tics. When a user executes a query, our method does not Transactions on Database Systems (TODS), 33(4):23, 2008. need to choose which summary to use. Further, the sum- [15] P. Jetley et al. Massively parallel cosmological simulations with ChaNGa. In Proc. IPDPS, 2008. mary building is all done offline. [16] S. Kandula, A. Shanbhag, A. Vitorovic, M. Olma, R. Grandl, S. Chaudhuri, and B. Ding. Quickr: Lazily 9. CONCLUSION approximating complex adhoc queries in bigdata clusters. In Proceedings of the 2016 International Conference on We presented, EntropyDB, a new approach to generate Management of Data, pages 631–646. ACM, 2016. probabilistic database summaries for interactive data ex- [17] C. Li et al. Optimizing linear counting queries under ploration using the Principle of Maximum Entropy. Our differential privacy. In Proc. of PODS, pages 123–134, 2010. approach is complementary to sampling. Unlike sampling, [18] V. Markl et al. Consistently estimating the selectivity of EntropyDB’s summaries strive to be independent of user conjuncts of predicates. In Proc. of VLDB, pages 373–384. queries and capture correlations between multiple different VLDB Endowment, 2005. attributes at the same time. Results from our prototype [19] B. Mozafari and N. Niu. A handbook for building an implementation on two real-world datasets up to 210 GB approximate query engine. IEEE Data Eng. Bull., 38(3):3–29, 2015. in size demonstrate that this approach is competitive with [20] C. R´eand D. Suciu. Understanding cardinality estimation sampling for queries over frequent items while outperform- using entropy maximization. ACM TODS, 37(1):6, 2012. ing sampling on queries over less common items. [21] D. Suciu, D. Olteanu, C. R´e,and C. Koch. Probabilistic Acknowledgments This work is supported by NSF 1614738 and databases. Synthesis Lectures on Data Management, NSF 1535565. Laurel Orr is supported by the NSF Graduate 3(2):1–180, 2011. Research Fellowship. [22] K. Tzoumas, A. Deshpande, and C. S. Jensen. Efficiently adapting graphical models for selectivity estimation. The VLDB Journal, 22(1):3–27, 2013. 10. REFERENCES [23] M. J. Wainwright and M. I. Jordan. Graphical models, http://www.transtats.bts.gov/ [1] . exponential families, and variational inference. Found. [2] S. Agarwal et al. Blinkdb: queries with bounded errors and Trends Mach. Learn., 1(1-2), Jan 2008. bounded response times on very large data. In Proc. of EuroSys’13, pages 29–42, 2013.

1165