<<

arXiv:2004.08255v1 [cs.DS] 17 Apr 2020 oe.Hwvr re ttsis uha unie,heavy quantiles, as such , order However, nodes. eut rmec oe h oa omncto oto this of cost communication total mergi is The by computation node. obtained network each be distributed from a can results of statistics nodes aggregation in overall stored the are data where [15], pc opeiyaeboth an are complexity complexity Time space continuously. can results da we arrived new all time, covering incrementing with results one aggregated by For update one and is. trace come elements data data model where the streaming max [14], For large statistics. or aggregation min how all deviations, are matter values standard values, no mean sum, computed relatively examples, them be makes to which easy incremental, and mergeable are based statisti categories order two and into statistics divided aggregation scalability: be small on can on to easily metadata calculated ways The be new data. can out which space figuring metadata and the on compute time decades a both spent applicable of have not complexity Researchers intolerable are of their methods Because for brutal challenges. more classic own size, its data faces large and [11]. to atta dramatically is importance profiling great mining data data, data subsequently large-scale can preprocessing of When which performance [10], preprocessin [9], the cleaning data improve data other as S such for [ etc. work analysis, [8], guidance [4], un [7], data provides duplicates to thresholds information for determine discover researchers [6], crucial [3], helps anomalies detect is distribution It It data data. derstand [2]. large-scale for [1], especially data given about agae n ihabifdsuso fteftr directio future different the of in discussion packages brief existing area. a this of in with coverage and with Fin languages presented. are conclude streams, data s we high-speed scenarios, and various data in skewed algorithms perfor quantile Th and approximate covered. efficiency of the are improving models for distributed techniques multiple or quanti models approximate streaming statist compute studies over that order of algorithms an randomized analysis alg and on deterministic comprehensive Both focus computation. a quantile on we approximate present computation paper, and their this limitation support quantiles, In not the data. does Thus, space large-scale incremental. be memory computed or and be mergeable time to not difficult are one is However, they statistics, data. order large-scale metadata, of of metadata extract to helps uvyo prxmt unieCmuainon Computation Quantile Approximate of Survey A aapoln sasto ciiist ecietemetadata the describe to activities of set a is profiling Data Abstract grgto ttsisaenmdfrterpoet htth that property their for named are statistics Aggregation 2 hrtnSho fCmue cec,Uiest fWaterlo of University Science, Computer of School Cheriton A aavlm rw xesvl,dt profiling data extensively, grows volume data —As O ( | ag-cl aa(ehia Report) (Technical Data Large-scale v | .I I. ) where , NTRODUCTION 1 colo otae snhaUiest,Biig 100084, Beijing, University, Tsinghua Software, of School O (1) | v | sfrdsrbtdmodels distributed for As . stenme fnetwork of number the is hwiChen Zhiwei s[12]. cs orithms [13], s mance c as uch cause aby ta ched kind ally, uch en, 5], ny ng les on ey ic, of 1 d n g - . , oinZhang Aoqian , n one where es hnlna-ieslcinwsivne 2] nclas at In 1973 [25]. computing to invented of back was methods selection traced linear-time be when can least studies Algorithmic sort. after dynamic and [23] have histograms [24]. algorithms computations complicated equi-depth geometric more quantile as resolve to such addition, subroutines problems, In as used [22]. widely helpin been data interval, the time on clean or constraints to long able temporal very still appropriate are delayed quantiles are granularity, sensitive a inconsistent timestamps less timestamps have some imprecise are if where reflect They even data, prevalent, to [21]. temporal In deviation accurate outliers. data more to and and center objective mean data more absol with are compared and deviations quantiles values, deviations, [20] dirty standard collection [19], and contains data values networks dataset in sensor a used in When etc monitoring widely top-k, running-state are max, basic and sum, quantiles seven the with Besides, al of along for [18]. one defined, basic is operators the quantile is analysis, which data language, log Sawzall Googles the components In built-in etc. libraries. as Python, functions MATLAB, quantile-computing Excel, have including tools, instance, dat For analysis in one practice. data and statistics as theory important both regarded most in analysis and filter, are quality fundamental quantiles correlation most the Therefore, appropriate of reco [17]. set data an over databases in query set in correlation And set to for [16]. efficiency so values helps improving attribute the attribute it same demonstrate misplaced querying, the to identify For of applied to querying. values is as data among it and distribution cleaning cleaning, distance data data in in c used example, computational widely low at is reflect data Pdf to of proba able (pdf), the function thus are distribution (cdf), they function words, distribution distributi cumulative data other the the In of parameters. description the without generate to help They mergea way. a into some many them in turn them, form to incremental compute propo trying or are statistics, to structures order order these storage for In or produced way. structures newly data straight with customized a results existing in merging c results cannot by we So, them property. such pute preserve not do etc., hitters, pc opeiyis complexity space uniei h lmn tacranrn ntedataset the in rank certain a at element the is quantile A quantiles. statistic, order one on focus we summary, this In φ ∈ ,Wtro,Otro aaaNL3G1 N2L Canada Ontario, Waterloo, o, (0 ⌊ , φN 2 1) rtw otaleeet n hnrtr the return then and elements all sort we first , ⌋ t iecmlxt is complexity time Its . O ( N φ qatl vradtsto size of dataset a over -quantile ) biul.Hwvr nlarge-scale in However, obviously. China O ( N log N specify ) many bility to g om- and ost. ons sed ute rds ble N sic or re a 1 l , . . , 2

data, the method is infeasible under restrictions of memory • Update time It is the time spent on updating quantile size. Munro et al. has proved that any exact quantile algorithm answers when new element arrives. Fast updates can im- with p-pass scan over data requires at least Ω(N 1/p) space prove the user experience, so many streaming algorithms [26]. Besides, in streaming models, e.g., over streaming events take update time as a main consideration. [27], quantile algorithms should also be streaming. That is, • Accuracy It measures the distance between approximate they are permitted to scan each element only once and need quantiles and ground truth. Intuitively, the more accurate to update quantile answers instantaneously when receiving an algorithm is, the more space and the longer time it will new elements. There is no way to compute quantiles exactly consume. They are on the trade-off relationship. We use under such condition. Thus, approximation is introduced in approximation error, maximum actual approximation er- quantile computation. Approximate computation is an efficient ror and average actual approximation error as quantitative way to analyze large-scale data under restricted resources indicators to measure accuracy. [28]. On one hand, it raises computational efficiency and We collected and studied researches about approximate lower computational space. On the other hand, large scale quantile computation, then completed this survey. The survey of the dataset can dilute approximation effects. Large-scale includes various algorithms, varying from data sampling to data is usually dirty, which also makes approximate quantile data structure transformation, and techniques for optimization. endurable and applicable in industry. Significantly, the scale Important algorithms are listed in Table I. The remaining parts of data is relative, based on the availability of time and space. of this survey are organized as follows: In Section II, we in- So, the rule about how to choose between exact quantiles and troduce deterministic quantile algorithms over both streaming approximate quantiles differs in heterogeneous scenarios, de- models and distributed models. Section III discusses random- pending on the requirement for accuracy and the contradiction ized algorithms [32]. Section IV introduces some techniques between the scale of data and that of resources. When the cost and algorithms for improving the performance and efficiency of computing exact quantiles is intolerable and the results are of quantile algorithms in various scenarios. Section V presents not required to be totally precise, approximate quantiles are a a few off-the-shelf tools in industry for quantile computation. promising alternative. Finally, Section VI makes the conclusion and proposes inter- We denote approximation error by ǫ. A ǫ-approximate φ- esting directions for future research. quantile is any element whose rank is between − ǫN and r + ǫN after sort, where r = ⌊φN⌋. For example, we II. DETERMINISTIC ALGORITHMS want to calculate 0.1-approximate 0.3-quantile of the dataset An algorithm is deterministic while it returns a fixed answer 11, 21, 24, 61, 81, 39, 89, 56, 12, 51. As shown in Figure 1, we given the same dataset and query condition. Furthermore, sort the elements as 11, 12, 21, 24, 39, 51, 56, 61, 81, 89 and quantile algorithms are classified based on their application compute the of the quantile’s rank, which is [(0.3 − scenarios. In streaming models, where data elements arrive one 0.1) × 10, (0.3+0.1) × 10] = [2, 4]. Thus the answer can be by one in a streaming way, algorithms are required to answer one of 12, 21, 24. In order to further reduce computation space, quantile queries with only one-pass scan, given the data size approximate quantile computation is often combined with N [33] or not [34], [36], [37], [47], [48]. Except to answering randomized sampling, making the deterministic computation quantile queries for all arrived data, Lin et al. [35] concentrates becomes randomized. In such case, another parameter δ, or on tracing quantiles for the most recent N elements over a randomization degree, is introduced, meaning the algorithm data stream. In distributed models, where data or statistics are answers a correct quantile with a of at least 1 − δ. stored in distributed architectures such as sensor networks, algorithms are proposed to merge quantile results from child nodes using as low communication cost as possible to reduce energy consumption and prolong equipment life [20], [38], [40], [42], [45].

A. Streaming Model

Fig. 1. An example of a ǫ-approximate φ-quantile, where ǫ = 0.1 and The most prominent feature of streaming algorithms is that φ = 0.3. all data are required to be scanned only once. Besides, the length of the data stream may be uncertain and can even grow There are 3 basic metrics to assess an approximate quantile arbitrarily large. Thus, classic quantile algorithms are infeasi- algorithm [29]: ble for streaming models because of the limitation of memory. • Space complexity It is necessary for streaming algo- Therefore, the priority of approximate quantile algorithms for rithms. Due to the limitation of memory, only algo- streaming models is to minimize space complexities. In 2010, rithms using sublinear space are applicable [30]. It cor- Hung et al. [49] have proved that any comparison-based ǫ- responds to communication cost in distributed models. approximate quantile algorithm over streaming models needs 1 1 In a distributed sensor network, communication overhead space complexity of at least Ω( ǫ log ǫ ), which sets a lower consumes more power and limits the battery life of bound for these algorithms. power-constrained devices, such as wireless sensor nodes Both Jain et al. [50] and Agrawal et al. [51] proposed [31]. So, quantile algorithms are aimed at low space algorithms to compute quantiles with one-pass scan. How- complexity or low communication cost. ever, neither of them clarified the upper or lower bound of 3

TABLE I APPROXIMATEQUANTILEALGORITHMS

Algorithm Randomization Model Space complexity / Communication cost 1 2 MRL98 [33] Deterministic Streaming O( ǫ log (ǫN)) 1 GK01 [34] Deterministic Streaming O( ǫ log(ǫN)) 1 2 1 Lin SW [35] Deterministic Streaming O( ǫ log(ǫ N)+ ǫ2 ) 1 2 Lin n-of-N [35] Deterministic Streaming O( ǫ2 log (ǫN)) 1 1 Arasu FSSW [36] Deterministic Streaming O( ǫ log ǫ log N) 1 1 Arasu VSSW [36] Deterministic Streaming O( ǫ log ǫ log(ǫN) log N) 1 u GK-UN [37] Deterministic Streaming O( ǫ log(ǫP C(S ))) 1 Q-digest [20] Deterministic Distributed O( ǫ |v| log σ) |v| 2 GK04 [38] Deterministic Distributed O( ǫ log N) |v| Cormode05 [39] Deterministic Distributed O( ǫ2 log N) |v| Yi13 [40] Deterministic Distributed O( ǫ log N) 1 2 1 MRL99 [41] Randomized Streaming O( ǫ log ǫ ) 1 1.5 1 Agarwal13 [42] Randomized Streaming O( ǫ log ǫ ) 1 1 Felber15 [43] Randomized Streaming O( ǫ log ǫ ) 1 1 Karnin16 [44] Randomized Streaming O( ǫ log log ǫδ ) 1 Huang11 [45] Randomized Distributed O( ǫ p|v|h) 1 Haeupler18 [46] Randomized Distributed O(log log N + log ǫ )

approximation error. In 1997, Alsabti et al. [47] improved the • If only one buffer is empty, its height is set to l. algorithm and came up with a version with guaranteed error • If there are two or more empty buffers, their heights are bound, referred to as ARS97. Its basic idea is sampling and it set to 0. includes the following steps: • Otherwise, buffers of height l are collapsed, generating a 1) Divide the dataset into r partitions. buffer of height l +1. 2) For each partition, s elements and store them in By tuning b and k, MLR98 can narrow the approximation a sorted way. error within ǫ. Figure 2 demonstrates the trigger strategy when 3) Combine r partitions of data, generating one sequence b =3. The height of the strategy tree is logarithmic, thus the 1 2 for querying quantiles. space complexity is O( ǫ log (ǫN)). ARS97 is targeted at disk-resident data, rather than streaming models. Nevertheless, its idea to partition the entire dataset and maintain a sampled sorted sequence for quantile querying inspires quantile algorithms over streaming models afterwards. The inspired algorithm is referred to as MRL98 proposed by Manku et al. [33]. It requires the prior knowledge of the length N of data stream. Similar with ARS97, MRL98 divides the data stream into b blocks, samples k elements from each block and puts them into b buffers. Each buffer X is given a weight w(X), representing the number of elements covered by this buffer. The algorithm consists of 3 operations: • NEW Put the first bk elements into buffers successively Fig. 2. The collapsing strategy with 3 buffers. A node corresponds to a buffer and set their weights to 1. and its number denotes the weight. • COLLAPSE Compress elements from multiple buffers into one buffer. Specifically, each element from an in- The limitation of MRL98 is that it needs to know the length put buffer Xi would be duplicated w(Xi) times. Then of the data stream at first. However, in more cases, the length these duplicated elements are sorted and merged into is uncertain and may even grow arbitrarily large. In such case, a sequence, where k elements are selected at regular Greenwald et al. [34] proposes celebrated GK01 algorithm, intervals and stored in the output buffer Y , whose weight distinguished by its innovative data structure, Summaries, w(Y )= w(Xi) or S for short. The basic idea is that when N increases, the Pi • OUTPUT Select an element as the quantile answer from set of ǫ-approximate answers for querying φ-quantile expands b buffers. as well, so correctness can be retained even if removing some NEW and OUTPUT are straightforward, so the algorithm’s elements. S is a collection of tuples in the form of (vi,gi, ∆i), space complexity depends mainly on how to trigger COL- where vi is the element with the property vi ≤ vi+1, gi and LAPSE. MRL98 proposed a tree-structure trigger strategy. ∆i are 2 integers satisfying following conditions: Each buffer X is assigned a height l(X) and l is set to g ≤ r(v )+1 ≤ g + ∆ (1) X j i X j i minil(Xi). l(X) is set as the following standard: j≤i j≤i 4

gi + ∆i ≤⌊2ǫN⌋ (2) r(vi), whose bounds are guaranteed by (1), is the ground truth of vi’s rank. Obviously, there are at most gi +∆i −1 elements between vi−1 and vi. (2) makes sure that the range is within ⌊2ǫN⌋− 1. Therefore, for any φ ∈ (0, 1), there always exists a tuple (vi,gi, ∆i), where r(vi) ∈ [⌊(φ − ǫ)N⌋, ⌊(φ + ǫ)N⌋], and thus vi is a ǫ-approximation φ-quantile. To find the approximate quantile, we can find the least i satisfying: Fig. 3. Structure of SW model g + ∆ > 1+ ⌊ǫN⌋ + max (g + ∆ )/2 (3) X j i i i i j≤i ′ ǫN the problem: if 0 ≤ N − N ≤ ⌊ 2 ⌋, it can convert a ǫ ′ ′ and return vi−1 as the answer. S also supports several opera- 2 -approximate S covering N elements to a ǫ-approximate tions: S covering N elements. Its worst case space complexity is 1 2 1 . • INSERT Search S to find the least i so that vi > v and in- O( ǫ log(ǫ N)+ ǫ2 ) The distinction of n-of-N model is that it answers quantile sert (v, 1, ⌊2ǫN⌋) right before the tuple ti = (vi,gi, ∆i). queries instantaneously over the most recent elements where • DELETE Update gi+1 as gi+1 = gi+1 + gi and delete n is any integer not larger than a predefined . It takes ti. To maintain (2), ti is removable only when it satisfies n N advantage of the EH-partition technique [54] as shown in gi + gi+1 + ∆i+1 ≤⌊2ǫN⌋ (4) Figure 4. The technique classifies buckets, marking buckets at

• COMPRESS It can be found that DELETE is adding level i as i−bucket whose S covers all elements arriving since 1 g of the deleting tuple to that of its predecessor. So we the bucket’s timestamp. There are at most ⌈ λ ⌉ +1 buckets at each level, where λ ∈ (0, 1). When a new element arrives, the can delete multiple successive tuples, ti+1,ti+2, ..., ti+k model creates a 1−bucket and sets its timestamp to the current at the same time by updating gi as gi = gi + gi+1 + timestamp. When the number of i − buckets reaches ⌈ 1 ⌉ +2, gi+2 + ··· + gi+k and removing them. GK01 proposed a λ complicated COMPRESS strategy to reduce the size of the two oldest buckets at level i are merged into a 2i−bucket carrying the oldest timestamp iteratively until buckets at all S as small as possible: it executes when ti,ti+1, ..., ti+k 1 levels are less than ⌈ 1 ⌉ + 2. Because each bucket covers are removable on the arrival of every 2ǫ elements. λ 11 elements arriving since its timestamp, merging two buckets It proves that the maximum size of S is 2ǫ log(2ǫN). So, 1 is equal to removing the later one. Lin et al. also proved that its space complexity is O( ǫ log(ǫN)). Additionally, the sum- mary structure is also applicable in a wide range such as for any bucket b, its coverage Nb satisfies Kolmogorov-Smirnov statistical tests [52] and balanced par- Nb − 1 ≤ λN (5) allel computations [53]. GK01 is for computing quantiles over all arrived data. In order to guarantee that quantiles are ǫ-approximate, λ is set ǫ ǫ Sometimes quantiles of the most recent N elements in a stream to ǫ+2 . Each bucket preserves a 2 -approximate S. Quantiles are required. Lin et al. [35] expanded GK01 and proposed two are queried as follows: algorithms for such case: SW model and n-of-N model. 1) Scan the bucket list until finding the first bucket b SW model is for answering quantiles over the most recent N making Nb ≤ n. elements instantaneously, where N is predefined. The model 2) Use LIFT to convert Sb to a ǫ-approximate S covering puts the most recent N elements into several buckets in their n elements. arriving order. Rather than original elements, each buckets 3) Search S to find the quantile answer. ǫN stores a Summary [34] covering successive elements. ǫNb 2 According to (5), n − Nb ≤ ǫ+2 . Mark its predecessor bucket The buckets have 3 states as illustrated in Figure 3: ′ as b and we have Nb′ >n, thus n − Nb ≤ Nb′ − Nb − 1, and ǫN • A bucket is active when its coverage is less than . At ǫn 2 furthermore n − Nb ≤ ⌊ 2 ⌋. So, LIFT can be applied to Sb. this time, it maintains a ǫ -approximate S computed by 1 2 4 The worst case space complexity is O( ǫ2 log (ǫN)). GK01. Arasu et al. [36] generalized SW model and came up • A bucket is compressed when its coverage reaches with fixed- and variable- size sliding window approximate ǫN ǫ ǫ 2 . The 4 -approximate S would be compressed to 2 - quantile algorithms. A window is fixed-size when insertion and approximate S by an algorithm COMPRESS. deletion of elements must appear in pairs after initialization. • A bucket is expired if it is the oldest bucket when the The fixed-size sliding window model defines blocks and levels coverage of all buckets exceeds N. Once expired, the as Figure 5. Each level preserves a partitioning of the data bucket is removed from the bucket list. stream into non-overlapping blocks of equal size. Blocks and By maintaining and merging buckets, SW model answers levels are both numbered sequentially. Assuming the window quantile queries of elements covered by all unexpired buckets. size is N, the block b in level l contains a Summary [34], However, when an old bucket is just expired and the active denoted as F (N,ǫ) in this model, which covers elements with l ǫN l ǫN bucket has not been full, the coverage of all unexpired buckets arriving positions in the range [b2 4 , (b+1)2 4 −1]. Similar N ′ is less than N. The difference between N and N ′ is at with SW model, a bucket is assigned one of 3 states at any ǫN most ⌊ 2 ⌋− 1. An algorithm LIFT was proposed to resolve point of time: 5

Fig. 6. Fn(N,ǫ), a restriction of F (N,ǫ) to the last n elements.

satisfying 2k−1

Fig. 4. Structure of the EH-partition technique in n-of-N model, where λ = • INSERT Update all Fn(N,ǫ) in V (n,ǫ) by increment- 0.5. k k k+1 ǫ ing n. If n +1 = 2 + 1, create F2 (2 , 2 ) from k k ǫ F2 (2 , 2 ) and insert the new element into it, thus getting k k+1 ǫ • A block is active if all elements covered by it belong to F2 +1(2 , 2 ). k ǫ k ǫ the current window. • DELETE Compute Fn−1(2 , 2 ) from Fn(2 , 2 ). If n − k−1 k− k ǫ • A bucket is expired if it covers at least one element which 1=2 , remove F2 1 (2 , 2 ). is older than the other N elements. • QUERY In order to query ǫ-approximate quantiles over ′ • A bucket is under construction if some of its elements the most n ≤ n recent elements, find the integer l l−1 l l ǫ belong to the current window while others are yet to satisfying 2 < n ≤ 2 . Then query Fn(2 , 2 ) and arrive. return the answer. The highest level with active blocks or blocks under construc- The space complexity of the variable-size sliding window tion L is log ( 4 ). Blocks in levels above are marked expired. 2 ǫ model is O( 1 log 1 log(ǫN) log N). For each active block in level l, a F (N,ǫl) is retained, where ǫ ǫ ǫ (L−l) ǫl = 2(2L+2) 2 . When a block is under construction, For uncertain data, one may still expect to obtain some ǫl et al. GK01 computes its F (N, 2 ), and it is converted to a F (N,ǫl) consistent answers for queries [55]. Liang [37] ex- in the same way as COMPRESS in SW model. The F (N,ǫl) tended the problem and resolved approximate quantile queries occupies 1 space. Arasu et al. proved that using a set over uncertain data streams. More specifically, it focuses O( ǫl ) of Summaries covering N1,N2, ..., Ns elements each with on maintaining quantile Summaries [34] over data streams approximate error ǫ1,ǫ2, ..., ǫs, a ǫ-approximate quantile can whose elements are drawn from individually domain space, be computed, where ǫ1N1+ǫ2N2+···+ǫsNs . Thus, the represented by continuous or discrete pdf. An uncertain data ǫ = N +N +···+Ns 1 2 u u u fixed-size sliding window model can computes ǫ-approximate stream S is a sequence of elements as {e1 ,e2 , ...}, each of 1 1 quantiles over the last N elements using O( ǫ log ǫ log N) which is drawn from a domain Di with pdf fi : Di → (0, 1] u space. Contrary to a fixed-size window, a variable-size window such that p∈Di fi(p)=1. So, S is a concise representation of exponentialP or infinite number of possible worlds W. Each world W = {pi|pi ∈ Di,i = 1, 2, ...} is a deter-

ministic stream with probability P r(W ) = pi∈W fi(pi). Therefore, the definition of approximate quantilesQ is gen- u u eralized as the element p ∈ Di of ei ∈ S such that p pmin W ∈W P r(W )q(r, rW ) − W ∈W P r(W )q(r, rW ) ≤ ǫN, P p P where rW is the rank of p in W and pmin is the element p Fig. 5. Levels and blocks in the fixed-size sliding window model. minimizing W ∈W P r(W )q(r, rW ). Liang et al. proposed 2 P p error metric functions as q(r, rW ), the squared error func- p p 2 bears no limitation on insertion and deletion, so the window tion q(r, rW ) = (r − rW ) and the related error function p p size keeps changing. Arasu et al. used V (n,ǫ) to denote a q(r, rW ) = r − rW . Following GK01, an online algorithm, epsilon-approximate Summary covering n elements, where namely UN-GK, is introduced. UN-GK adjusts tuples in u u n is the current size of the window. When a new ele- Summaries as vi = pi ∈ Dj of ej ∈ S , gi = P Cmin(pi)− ment arrive, it becomes V (n + 1,ǫ), and when the oldest P Cmin(pi−1), and ∆i = P Cmax(pi) − P Cmin(pi), where element leaves, it gets V (n − 1,ǫ). Besides, Fn(N,ǫ) is P Cmin(p) and P Cmax(p) is the lower bound and the upper defined as a restriction of F (N,ǫ) to the last n elements bound of probabilistic cardinality [56] of elements in Su no u as Figure 6. Fn(N,ǫ) is the same as F (N,ǫ) except that larger than p. And (2) is modified as gi + ∆i ≤ 2ǫP C(S ), only blocks whose elements all belong to the most n recent where P C(Su) is the probabilistic cardinality of all elements elements, instead of N elements, are assumed active. V (n,ǫ) in Su as P C(Su) = |Su|. In this way, a ǫ-approximate can be constructed by a set of Fn(N,ǫ) in the form of quantile can be queried anytime over an uncertain data stream, k ǫ k−1 ǫ 2 ǫ 1 u {Fn(2 , 2 ), Fn(2 , 2 ), ..., Fn( ǫ , 2 )}, where k is an integer and its space complexity is O( ǫ log(ǫP C(S ))). 6

B. Distributed Model When a network node receives q-digest trees from other nodes, In distributed models such as sensor networks, communi- it merges them with its own tree by adding up count of nodes cation between nodes consumes much energy and cuts down representing the same elements and applying COMPRESS. the battery life of power-constrained devices. In addition, data After merging all q-digest trees in the network, we can query transmission takes most of the running time of algorithms. quantiles by traversing the ultimate tree in postorder, summing Therefore, the priority of approximate quantile algorithms over up count of passed nodes until the sum exceeds ⌊φN⌋. The distributed models is to reduce the communication cost by queried quantile is v.max of the current node. The total 1 decreasing the size of transmitted data. communication cost of q-digest is O( ǫ |v| log σ), where |v| In 2004, Shrivastava et al. [20] designed an approximate denotes the number of network nodes. quantile algorithm on distributed sensor networks with fixed- In the same year, Greenwald et al. [38] proposed an universe data, named as q-digest. Fixed-universe data refers algorithm on distributed models, referred to as GK04. Unlike to elements from a definite collection. Q-digest uses a unique q-digest, GK04 does not require that all data be fixed-universe. ǫ binary tree structure to compress and store elements so that It maintains a collection of 2 -approximate Summaries [34], the storage space is cut down. The size of the fixed-universe denoted as Sv, in each node v in the sensor network. More 1 2 k i i collection is denoted as σ and the compress coefficient is specifically, Sv = {Sv ,Sv , ..., Sv }, where Sv covers nv i i i denoted as k. Figure 7 is the structure of a q-digest tree, elements, and Sv is classified as class(Sv)= ⌊log nv⌋. In the which exists in each network node. Each node of the tree network, data are transmitted in the form of Sv. When data covers elements ranging from v.min to v.max, recording the arrives at a node, it combines its own Sv with the coming Sv sum of their frequencies as count(v). Each leaf represents one by iteratively merging Sv of the same class from bottom to element in the universe, in other words, v.min = v.max. Take top. Once the iteration ends, a pruning algorithm is applied to i nv node d as an example, its coverage is [7, 8] and there are 2 Sv to reduce the number of its tuples to log ǫ +1 at most. The i ′ elements in this range. Besides, each node v of the tree must pruning would bring up the approximation error of Sv from ǫ to ′ ǫ at most, so a -approximate is computed after satisfy: ǫ + 2 log nv ǫ S n 1 2 count(v) ≤⌊ ⌋ (6) merging Sv from all nodes. GK04 only transmits O( ǫ log N) k data, where N is the number of all data in the network. n Moreover, if the height of the network topology is far smaller count(v)+ count(vp)+ count(vs) > ⌊ ⌋ (7) k than N, Greenwald et al. improved GK04 to reduce the total |v| h where vp is v’s parent node and vs is its sibling node. communication cost to O( ǫ log N log( ǫ )). Its basic idea is (6) guarantees the upper bound of v’ coverage to narrow to apply a new operation, REDUCE, which makes sure that h approximation error while (7) demonstrates when to merge all nodes transmits less than O(log ǫ ) Summaries, after two small nodes so that the storage cost can be reduced. merging and pruning. Q-digest and GK04 are both offline algorithms that compute quantiles over stationary data in distributed models. Besides, there are also algorithms focusing on distributed models whose nodes receiving continuous data streams. In 2005, Cormode et al. [39] proposed a quantile algorithm for the scenario that multiple mutually isolated sensors are connected with one coordinator, which traces updating quantiles in real time. Its goal is to ensure ǫ-approximate quantiles at the coordinator while minimizing communication cost between nodes and the coordinator. In general, each remote node maintains a local approximate Summary [34] and informs the coordinator after certain number of updates. In the coordinator, the approxima- tion error ǫ is divided into 2 parts as ǫ = α + β:

Fig. 7. Structure of a q-digest tree, where n = 15,k = 5, σ = 8. • α is the approximation error of local Summaries sent to the coordinator. Q-digest proposed an algorithm COMPRESS to merge • β is the upper bound on the deviation of local nodes of a tree: Summaries since the last communication. 1) Scan nodes from bottom to top, from left to right, and Intuitively, larger β allows for large deviation, thus less com- sum up count(v) and count(vs) when (7) is violated. munication between nodes and the coordinator. But because ǫ 2) Store the sum in vp. is fixed, α is smaller, increasing the size of Summaries sent 3) Remove the count in v and vs. to the coordinator each time. In other words, α and β are on In order to reduce data communication, we number nodes the trade-off relationship. To resolve the trade-off, Cormode in the tree from top to bottom, from left to right and only et al. introduces a prediction model in each remote node that transmits nonempty nodes in the form of (id(v), count(v)), in captures the anticipated behavior of its local data stream. With which way every transmission contains only O(log σ +log N) the model, the coordinator is able to predict the current state data. For example, node c in Figure 7 is transformed to (6, 2). of a local data stream while computing the global Summary 7 and the remote node can check for the deviation between its algorithms propose randomized versions in this way [36], [40], Summary and the coordinator’s prediction. The algorithm [41]. proposes 3 concise prediction models: • Zero-Information Model assumes that there is no local update at any remote node since the last communication. A. Streaming Model • Synchronous-Updates Model assumes that at each time Back to 1971, Vapnik et al. [57] proposed a randomized step, each local node receives one update to its distribu- 1 1 quantile algorithm with space complexity of O( ǫ2 log ǫ ). This tion. 1 2 1 benchmark were raised to O( ǫ log ǫ ) by Manku et al. [41] • Update-Rates Model assumes that updates are observed in 1999, referred to as MRL99. MRL99 does not require prior at each local node at a uniform rate with a notion of knowledge of the data size N and occupies less space in global time. experiments when φ is an extreme value. The algorithm are Using prediction models to reduce communication times, the improved based on MRL98 [33] so they have the identical |v| total communication cost of this algorithm is O( ǫ2 log N). frame with only minor differences in the operation NEW: In 2013, Yi et al. [40] proposed an algorithm in the 1) For each buffer, randomly select an element among r same scenario and optimized the cost to O( |v| log N). The ǫ consecutive elements. algorithm divides the whole tracking period into O(log N) 2) Repeats this operation k times to get k initial elements. rounds. A new round begins whenever doubles. It first N 3) Set the buffer’s weight to r. resolves the median-tracking problem, which can be easily generalized to the quantile-tracking problem. Assume that M Notice that MRL99 equals with MRL98 if r =1. Because of is the cardinality of data, fixed at the beginning of a round, and the collapsing strategy as Figure 2, the larger weight a buffer m is the tracking median at the coordinator. The coordinator has, the greater chance there its data must be retained while maintains 3 data structures. The first one is a dynamic set collapsing. If r is kept consistent, the probability of newly ǫM arrived data being selected will go down continuously. So r of disjoint intervals, each of which contains between 8 ǫM should keep changed dynamically, meaning the sampling is and 2 elements. The others are 2 counters C.∆(L) and C.∆(R), recording the number of elements received by all nonuniform. MRL99 initially sets r to 2 and traces a parameter remote nodes to the left and the right of m since last update h, representing the maximum height of all buffers. When a respectively. They are guaranteed with an absolute error at buffer’s height reaches h + i for the first time, where i ≥ 0, r ǫM is doubled. By tuning h, b and k, MRL99 manages to compute most 8 by asking each remote node to send an update whenever it receives ǫM elements to the left or the right of m. approximate quantiles with a probability of at least 1 − δ. 8|v| Recalling two sliding window models in Section II-A, Arasu When |C.∆(L) − C.∆(R)|≥ ǫM , m is updated as follows: 2 et al. proposed the fact that the quantile of a random sample of 1) Compute the total number of elements to the left and the 1 −1 size O( 2 log δ ) is an ǫ-approximate quantile of N elements right of m, C.L and C.R and let d = 1 |C.L − C.R|. ǫ 2 with the probability at least 1 − δ. To sample elements of 2) Compute the new median m′ satisfying that |r(m) − ′ ǫM specific size, a fast alternative is to randomly select one out r(m ) − d|≤ 4 , where r is the rank of element in all k 1 −1 of 2 successive elements, where k = ⌊log N/( 2 log δ )⌋. data. Replace m with m′. m′ can be found quickly with 2 ǫ In this way, k grows logarithmically along with the data stream the set of intervals. First find the first separating element as required, so the approximation error and the randomization e of the intervals to the left of M. Then compute n , 1 1 degree are guaranteed. the number of elements at all remote sites that are in the Another algorithm was proposed by Agarwal et al. [42], interval [e ,m]. If |n − d|≤ ǫM , e is m′. Otherwise 1 1 2 1 which is based on Summaries [34]. Its basic idea is to sample find the next separating element e and count the number i tuples from multiple Summaries and merge them to com- n until |n − d|≤ ǫM , then set m′ to e . i i 2 i pute approximate quantiles with low space complexity. There 3) Set C.∆(L) and C.∆(R) to 0. are two situations while merging: same-weight merges and Yi et al. has proved that m is at most ǫM elements away from uneven-weight merges. Same-weight merges are for merging the ground truth. Step 1 needs to exchange O(|v|) messages. ′ two Ss covering the same number of elements. It contains As for Step 2, m can be found after at most O(1) searches following steps: and the cost of each search is O(|v|), making the total cost ǫ 1) Combine the two Ss in a sorted way. O(|v|). Because each update increases N by a factor of 1+ 2 , 1 2) Label the tuples in order and classify them by label m is updated at most O( ǫ ) times. So, the total communication |v| |v| parity. cost is O( ) each round and O( log N) for the whole ǫ ǫ 3) Equiprobably select one class of tuples as the merged algorithm. result Smerged. III. RANDOMIZED ALGORITHMS Assuming each S covers k elements, if we have k = 1 1 Generally, randomized approximate quantile algorithms are O( ǫ qlog( ǫδ )), the algorithm will answer quantile queries combined with random sampling. They first sample a part with a probability of at least 1 − δ. Uneven-weight merges are of data and compute overall approximate quantiles with this for merging two Ss of different sizes, which can be reduced portion. With less data taken into computation, the compu- to same-weight merges by a so-called logarithmic technique 1 1.5 1 tation cost is cut down. In fact, many deterministic quantile [38]. The space complexity of the algorithm is O( ǫ log ǫ ). 8

However, Agarwal et al. just proposed and analyzed the into a ”super node”. Now the height of the tree reduces to algorithm in theory without implementation. Afterwards, Fel- t. By setting t = |v|/h, the desired space bound becomes 1 ber et al. [43] came up with a randomized algorithm whose O( ǫ |v|h). 1 1 p space complexity is O( ǫ log ǫ ) but also did not realize it. In 2018, Haeupler et al. [46], gave a drastically faster gossip Besides, this algorithm is not actually useful but only suitable algorithm, referred as Haeupler18, to compute approximate for theoretical study because its hidden coefficient of O is too quantiles. Gossip algorithms [58] are algorithms that allow large. nodes in a distributed network to contact with each other In 2016, Karnin et al. [44] achieved the space complexity randomly in each round and gradually converge to get final 1 1 of O( ǫ log log ǫδ ). Its idea is based on MRL99 with some results. The algorithm contains two phases. In the first phase, improvements. The first improvement is using increasing each node adjusts its value so that the quantiles around φ- compactor capacities as the height gets larger (a compactor quantile become the median quantiles approximately. And in operates a COLLAPSE process in MRL99). The second comes the second phase, nodes compute their approximate median 1 from special handling of the top log log δ compactors. And the quantiles to get the global result. Haeupler et al. proved that 1 last is replacing those top compactors with Summaries [34]. the algorithm requires O(log log N + log ǫ ) rounds to solve the ǫ-approximate φ-quantile problem with high probability. B. Distributed Model IV. IMPROVEMENT As for distributed models, Huang et al. [45] proposed a randomized quantile algorithm which brings down total So far, in the discussion about approximate quantile algo- 2 N rithms, they are generally used with constant approximation communication cost from O(|v| log ǫ ) in GK04 [38] to O( 1 |v|h), where h denotes the height of the network error and indiscriminate performance on data regardless of ǫ p topology. It contains two version: the flat model and the tree data distribution. However, in some cases, we may have known model. For the flat model, all other nodes are assumed to be that the data is skewed [59]–[63]. In other cases, quantile directly connected to the root node. The algorithm is designed queries over high-speed data streams need to be updated as following steps: and answered highly efficiently [64], [65]. In addition, there are also techniques for optimizing quantile computation with 1) Sample elements in node v with a probability of p and the help of GPUs [66], [67]. This section presents several compute their ranks in v, denoted as r(a, v) where a is techniques for improving the performance and efficiency of a sampled element. approximate quantile algorithms in various scenarios. 2) Transmit sampled elements, as well as ones from its child nodes, to its parent node v . p A. Skewness 3) Find predecessors of a, denoted as pred(a, vs), in v’s sibling nodes vs. The first algorithm is known as t-digest, proposed by Dun- 4) Estimate the rank of a in vs, denoted as rˆ(a, vs), ning et al. [63]. T-digest is for computing extreme quantiles according to r(pred(a, vs), vs). such as the 99th, 99.9th and 99.99th . Totally 5) Compute the approximate rank of a in vp as r(a, vp)= different from q-digest, its basic idea is to cluster real-valued rˆ(a, vs)+ r(a, v). samples like histograms. But they differ in three aspects. P If elements are not uniformly distributed in the network and First, the range covered by clusters may overlap. Second, some nodes contain the majority, the total communication cost instead of lower and upper bounds, a cluster is represented will increase dramatically as the effect of load imbalance. The by a centroid value and an accumulated size on behalf of algorithm resolves the problem by tuning p based on data the number of elements. Third, clusters whose range is close amount in each node: to extreme values contain only a few elements so that the approximation error is not absolutely bounded, but relatively • If the amount is greater than N/ |v|, p is set to 1/(ǫNv). p bounded, which is φ(1 − φ). T-digest can be applied to both • Otherwise, p is set to Θ( |v|/(ǫN)). p streaming models and distributed models because the proposed The inconsistent probability of sampling makes sure that cluster is a mergeable structure. The merge is restricted by the O( 1 ) elements are sampled at most in each node no matter ǫ size bound of clusters. Dunning et al. proposed several scale how many elements there exist at first. After transmitting functions to define the bound. The standard is that the size of all sampled elements to the root node, the element whose a each cluster should be small enough to get accurate quantiles, rank is closest to is returned as the queried r(a, vroot) ⌊φN⌋ but large enough to avoid winding up too many clusters. A quantile. For the tree model, things become more complicated scale function is for two reasons. First, an intermediate node may suffer from δ heavy traffic going through if it has too many descendants f(φ)= sin−1(2φ − 1) (8) without any data reduction. Second, each message needs O(h) 2π hops to reach the root node, leading to the total communication where δ is the compression parameters and the size bound is of O(h |v|/ǫ). To resolve the first problem, Huang et al. defined as p proposed a algorithm MERGE in a systematic way to reduce Wleft + W Wleft Wbound = f( ) − f( ) ≤ 1 (9) data size. As for the second problem, the basic idea is to N N partition the routing tree into t connected components, each where Wleft and W are respectively the weight of clusters of which has O(|v|/t) nodes. Then each component is shrunk whose centroid values are smaller than that of the current 9

3 algorithm to compute biased quantiles with space complexity 1 2 of O( ǫ log σ log(ǫN)), where σ denotes the size of the fixed- universe collection. And a simpler sampling-based approach 1 et al. −3 2 ) by Gupta [71] uses space of O(ǫ log N log σ). 0 f(

-1 B. High-Speed Data Streams -2

-3 In order to compute quantiles over high-speed data streams, 0 0.2 0.4 0.6 0.8 1 both computational cost and per-element update cost need to be low. Zhang et al. [65] proposed an algorithm for both Fig. 8. Function of (8) with δ = 10. fixed- and arbitrary- size high-speed data streams. Let N and n denote the number of elements in the entire data stream and elements seen so far. For the fixed-size data streams, where N cluster and of current cluster. As Figure 8 shows, (8) is non- is given, a multi-level summary structure S = {s0,s1, ..., sL} decreasing and is steeper when φ is closer to 0 or 1, which is maintained, where si is the summary at level i, as shown means clusters covering extreme values have smaller size, in Figure 9. Each element in s is stored with its upper and making the algorithm more accurate for computing extreme lower bound of rank, rmin(e) and rmax(e). The data stream quantiles and more robust for skewed data. The second algorithm, proposed by Lin et al. [61] and elaborated by Liu et al. [62], is aimed at streaming models, using nonlinear interpolation. The algorithm maintains two buffers, the quantile buffer Q = {q1, q2, ..., qm}, where qi is the approximate φi-quantile, and the data buffer B of size n, holding the most recent n elements. Q is estimated from observed data and incrementally updated when B is full- Fig. 9. A multi-level summary structure with L = 3. filled. In order to estimate the extreme quantiles accurately, 1 the nonlinear interpolation F (x), which is an approximate is divided into blocks of size b = ⌊ ǫ log ǫN⌋ and si covers distribution function estimated from a training set stream, is a disjoint bag Bi. Among bags, B0 contains the most recent leveraged together with B and Q to update Q. blocks even though it may be incomplete. The structure is The third algorithm was proposed by Cormode et al. [59] for maintained as follows: skewed data. Again, the algorithm is an improved version of 1) Insert the new element to s0. GK01 for two problems. The first problem is the biased quan- 2) If |s0| 0 tinuous quantile queries with different φ and ǫ, Lin et al. 1 1 2 and worst case of O( ǫ2 log ǫ log ǫN). [64] proposed 2 techniques for processing queries. The first For problems resolved by q-digest [20] that all elements are technique is to cluster multiple queries as a single query selected from a fixed-universe collection, Cormode et al. [60] virtually while guaranteeing accuracy. Its basic idea is to combined the binary tree structure in q-digest and standard cluster the queries that share some common results. The dictionary data structures [70], proposing a new deterministic second technique is to minimize both the total number of times 10 for reprocessing and the number of clusters. It adopts a trigger- In Java, Google publishes an open-source set of core li- based lazy update paradigm. braries, named as Guava4. It includes new collection types, graph libraries, support for concurrency, etc. Also, quantile computation is included in its functions as median() and C. GPU percentiles(). The quantiles are exact results so the average Govindaraju et al. [66] studied optimizing quantile com- time complexity of its implementation is O(n) while the worst putation using graphics processors, or GPU for short. GPUs case time complexity is O(N 2). It optimizes multiple quantile are well designed for rendering and allow many rendering computation on the same dataset with indexes, improving the applications to raise memory performance [72]. In order to performance in some degree. Another package that supports utilize the high computational power of GPUs, Govindaraju et quantile computation is Apache Beam5. It is a unified pro- al. proposed an algorithm based on sorting networks. Sorting gramming model for batch and streaming data processing on networks are a set of sorting algorithms mapped well to mesh- execution engines. Unlike Guava, it implements approximate based architectures [73]. Operations in the algorithm, includ- quantile computation. ing comparisons and comparator mapping, are realized by The programming language, Rust6, implements approxi- color blending and texture mapping in GPUs. The theoretical mate quantile computation over data streams with a moderate algorithm they used is GK04 [38] and by taking advantage of amount of memory. It implements the algorithms, GK01 [34] high computational power and memory bandwidth of GPUs, and CKMS [59], so that no boatload of information is stored in the algorithm offers great performance for quantile computa- memory. Besides, it implements a variant of quantile computa- tion over streaming models. tion, an ǫ-approximate frequency count for a stream, outputting the k most frequent elements, with the algorithm Misra-Gries V. APPROXIMATE QUANTILE COMPUTATION TOOLS [74], which helps with an estimation of quantiles as well. As for C++, a peer-reviewed set of core libraries, Boost, provides Over the past decades, many off-the-shelf, open-source exact quantile computation as its in-tool function7. packages or tools for quantile computation have been de- veloped and available to users. Some of them are based on exact quantile computation while others implement approxi- VI. FUTURE DIRECTIONS AND CONCLUSIONS mate quantile computation. In this section, we review such In this survey, we have presented a comprehensive survey of tools, focusing on their algorithmic theories and application approximate quantile computation. Readers who want to know scenarios. more about the performance of basic quantile algorithms can Most basically, SQL provides a window function ntile1 read a paper by Luo et al. [29], which implements a part of the which helps with quantile computation. It receives a parameter quantile algorithms above and compares them by experiments. b and partitions a set of data into b buckets equally. Therefore, Besides, Cormode et al. [75] summarized lower bounds of the with sortBy and ntile, we can partition the data in order and researches about comparison-based quantile computation so get the leading element of each partition as the quantile. Also, far and proved that the space complexity of GK01 is optimal these quantiles are exact. by showing a matching lower bound. MATLAB is a famous numerical computing environment Even though approximate quantiles have been studied for and programming language. It provides comprehensive pack- more than three decades, there is still plenty of room for im- ages for computing various statistics or functions. Quantiles provement. The explosion of data also brings new challenges are included as its in-tool function2. The function quantile to this field. There are many studies to resolve quantile com- receives a few parameters, including a numerical array, the putation in large-scale data, but not enough. Besides, with the cumulative probability and the computation kind. It computes development of industries, new scenarios keep appearing so quantiles in either exact or approximate way. It computes that quantile computation should be optimized specifically as exact quantiles by the classic algorithm that uses sort. And well. And the evolvement of techniques brings new direction it implements t-digest [63] for approximate quantiles. So, and new potential to quantile computation. Here we present a this function is suitable to compute quantiles over distributed few future directions for quantile computation. models, which benefit from parallel computation. First, almost all quantile algorithms require at least one- The distributed cluster-computing framework, Spark, pro- pass scan over the entire dataset, no matter it is applied on vides approximate quantile computation since version 2.03. It streaming models or distributed models. However, at some implements GK01 and can be called in many programming time, the cost of scanning the entire data is intolerable if languages such as Python, Scala, R and Java. Computing on quantile queries are demanded to be answered in time. In a Spark dataframe, the function contains 3 parameters as a such case, only a portion of the entire dataset is permitted dataframe, the name of a numerical column, a list of quantile into the process of the whole algorithm. This condition is more and the approximation error. Since version 2.2, restricted than that in existing randomized quantile algorithms. the function has been upgraded to support computation over Even if randomized algorithms sample a part of data to save multiple dataframe columns. 4https://github.com/google/guava 1https://www.sqltutorial.org/sql-window-functions/sql-ntile/ 5https://beam.apache.org/ 2https://www.mathworks.com/help/stats/quantile.html 6https://www.rust-lang.org/ 3https://spark.apache.org/docs/2.0.0 7https://www.boost.org/ 11 the space of computation, the process of sampling requires frequencies of two edges connecting to the same vertex are the participation of the entire dataset, which means that they correlated). Maybe it is a challenge, as well as an opportunity, need to scan all data once. To resolve the problem, we need to efficiently compute approximate quantiles in a correlated to determine the sampling methods first. Unlike fine-grained data graph. Furthermore, if the graph is constrained by some sampling in existing randomized algorithms, we may use patterns [84], [85], the algorithms may be improved and coarse-grained sampling, such as sampling in the unit of block. optimized correspondingly. Other cases include computing The method should also take the way of storage into account. quantiles in the Blockchain network [86]. Unlike traditional For example, if the dataset is stored in distributed HDFS [76], distributed models, which have a master node and multiple we may sample part of blocks, avoiding scanning the entire slave nodes, the Blockchain network is decentralized, making data. The sampling methods may be exclusive, varying based merging quantile algorithms infeasible. Further research is on applications. After determining sampling methods, we need needed for computing approximate quantiles in such network. to consider how much data is sampled to balance the accuracy Haeupler et al. [46], which uses gossip algorithms, provides and the occupation of time and space. One direction is to a good direction, but there is much more to be done. simulate the idea in machine learning [77] that data are trained (or sampled in our situation) continuously until the quantile result converges. REFERENCES Second, many new computation engines are being devel- [1] Z. Abedjan, L. Golab, and F. Naumann, “Profiling relational data: a oped. They can be classified into two categories in general. survey,” VLDB J., vol. 24, no. 4, pp. 557–581, 2015. One is streaming computation engines and the other is dis- [2] S. Song, L. Chen, and P. S. Yu, “Comparable dependencies over tributed computation engines. Streaming computation engines, heterogeneous data,” VLDB J., vol. 22, no. 2, pp. 253–274, 2013. [3] Z. He, Z. Cai, S. Cheng, and X. Wang, “Approximate aggregation for represented by Spark Streaming, Storm and Flink [78], are tracking quantiles and range countings in wireless sensor networks,” aimed at real-time streaming needs with minor difference Theor. Comput. Sci., vol. 607, pp. 381–390, 2015. in some ways. For example, Storm and Flink behave like [4] S. Song, H. Zhu, and L. Chen, “Probabilistic correlation-based similarity true streaming processing systems with lower latencies while measure on text records,” Inf. Sci., vol. 289, pp. 8–24, 2014. [5] Y. Wang, S. Song, L. Chen, J. X. Yu, and H. Cheng, “Discovering Spark Streaming can handle higher throughput at the cost conditional matching rules,” TKDD, vol. 11, no. 4, pp. 46:1–46:38, 2017. of higher latencies. Distributed computation engines, repre- [6] A. Zhang, S. Song, J. Wang, and P. S. Yu, “Time series data cleaning: sented by Spark and GraphLab [79], are designed to analyze From anomaly detection to anomaly repairing,” PVLDB, vol. 10, no. 10, pp. 1046–1057, 2017. distributed data such as datasets with graph properties.They [7] S. Song, L. Chen, and H. Cheng, “Efficient determination of distance have their pros and cons in heterogeneous scenarios as well. thresholds for differential dependencies,” IEEE Trans. Knowl. Data Eng., The characteristics of these engines may be utilized in the vol. 26, no. 9, pp. 2179–2192, 2014. [8] S. Song and L. Chen, “Differential dependencies: Reasoning and dis- implementation of algorithms. As reviewed in Section V, covery,” ACM Trans. Database Syst., vol. 36, no. 3, pp. 16:1–16:41, only a small fraction of approximate quantile algorithms is 2011. implemented in them such as GK01 in Spark. But many other [9] S. Song, A. Zhang, J. Wang, and P. S. Yu, “SCREEN: stream data cleaning under speed constraints,” in Proceedings of the 2015 ACM algorithms are still needed to be implemented and optimized SIGMOD International Conference on Management of Data, Melbourne, purposefully. For example, Spark [80] is a framework for Victoria, Australia, May 31 - June 4, 2015, pp. 827–841, 2015. computation in distributed clusters, supporting parallel com- [10] A. Zhang, S. Song, and J. Wang, “Sequential data cleaning: A statistical putation naturally and having potential of benefitting many approach,” in Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, quantile algorithms such as MRL98. Its streaming version, USA, June 26 - July 01, 2016 (F. Ozcan,¨ G. Koutrika, and S. Madden, Spark Structured Streaming [81], supports stream processing, eds.), pp. 909–924, ACM, 2016. as well as window operations. It may help quantile algorithms [11] S. Song, C. Li, and X. Zhang, “Turn waste into wealth: On simultaneous clustering and cleaning over dirty data,” in Proceedings of the 21th ACM over streaming models, such as Lin SW and Lin n-of-N [35], SIGKDD International Conference on Knowledge Discovery and Data to be implemented in a more efficient way, even though Mining, Sydney, NSW, Australia, August 10-13, 2015, pp. 1115–1124, the theoretical complexity remains the same. Many quantile 2015. [12] S. Guha and A. McGregor, “Stream order and order statistics: Quantile algorithms are implemented with basic languages while others estimation in random-order streams,” SIAM J. Comput., vol. 38, no. 5, are even without implementation. Transplanting them to new pp. 2044–2059, 2009. computation engines is not an easy work and there is great [13] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream systems,” in Proceedings of the Twenty-first room for optimization. ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Third, with the appearance of new specific application Systems, June 3-5, Madison, Wisconsin, USA, pp. 1–16, 2002. scenarios, the requirements of approximate quantile algorithms [14] S. Muthukrishnan, “Data streams: Algorithms and applications,” Foun- dations and Trends in Theoretical Computer Science, vol. 1, no. 2, 2005. evolve as well. For example, nowadays, more and more [15] A. Segall, “Distributed network protocols,” IEEE Trans. Information attention is being paid to the correlation of data, and data Theory, vol. 29, no. 1, pp. 23–34, 1983. graph is one way to present the correlation. In a data graph, [16] Y. Sun, S. Song, C. Wang, and J. Wang, “Swapping repair for misplaced each edge is usually associated with a weight, representing attribute values,” in 36st International Conference on Data Engineering (ICDE’20), IEEE, 2020. frequencies of the appearance of the correlation. One may need [17] F. Gao, S. Song, L. Chen, and J. Wang, “Efficient set-correlation operator to determine a threshold for pruning edges by their weights inside databases,” J. Comput. Sci. Technol., vol. 31, no. 4, pp. 683–701, for better performance in analysis [82], [83]. And we can use 2016. [18] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the quantiles to determine the threshold. However, unlike random data: Parallel analysis with sawzall,” Scientific Programming, vol. 13, sampling in a dataset, the edges in a graph are correlated (the no. 4, pp. 277–298, 2005. 12

[19] G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and [40] K. Yi and Q. Zhang, “Optimal tracking of distributed heavy hitters and D. Srivastava, “Holistic udafs at streaming speeds,” in Proceedings of quantiles,” Algorithmica, vol. 65, no. 1, pp. 206–223, 2013. the ACM SIGMOD International Conference on Management of Data, [41] G. S. Manku, S. Rajagopalan, and B. G. Lindsay, “Random sampling Paris, France, June 13-18, 2004, pp. 35–46, 2004. techniques for space efficient online computation of order statistics [20] N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri, “ of large datasets,” in SIGMOD 1999, Proceedings ACM SIGMOD and beyond: new aggregation techniques for sensor networks,” in Pro- International Conference on Management of Data, June 1-3, 1999, ceedings of the 2nd International Conference on Embedded Networked Philadelphia, Pennsylvania, USA, pp. 251–262, 1999. Sensor Systems, SenSys 2004, Baltimore, MD, USA, November 3-5, [42] P. K. Agarwal, G. Cormode, Z. Huang, J. M. Phillips, Z. Wei, and 2004, pp. 239–249, 2004. K. Yi, “Mergeable summaries,” ACM Trans. Database Syst., vol. 38, [21] C. Leys, C. Ley, O. Klein, P. Bernard, and L. Licata, “Detecting outliers: no. 4, pp. 26:1–26:28, 2013. Do not use around the mean, use absolute deviation [43] D. Felber and R. Ostrovsky, “A randomized online quantile summary around the median,” Journal of Experimental Social Psychology, vol. 49, in o(1/epsilon * log(1/epsilon)) words,” in Approximation, Random- no. 4, pp. 764–766, 2013. ization, and Combinatorial Optimization. Algorithms and Techniques, [22] S. Song, Y. Cao, and J. Wang, “Cleaning timestamps with temporal APPROX/RANDOM 2015, August 24-26, 2015, Princeton, NJ, USA, constraints,” PVLDB, vol. 9, no. 10, pp. 708–719, 2016. pp. 775–785, 2015. [23] M. B. Greenwald and S. Khanna, “Quantiles and equi-depth histograms [44] Z. S. Karnin, K. J. Lang, and E. Liberty, “Optimal quantile approxi- over streams,” in Data Stream Management - Processing High-Speed mation in streams,” in IEEE 57th Annual Symposium on Foundations Data Streams, pp. 45–86, 2016. of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, [24] P. Indyk, “Algorithms for dynamic geometric problems over data New Brunswick, New Jersey, USA, pp. 71–78, 2016. streams,” in Proceedings of the 36th Annual ACM Symposium on Theory [45] Z. Huang, L. Wang, K. Yi, and Y. Liu, “Sampling based algorithms for of Computing, Chicago, IL, USA, June 13-16, 2004, pp. 373–380, 2004. quantile computation in sensor networks,” in Proceedings of the ACM [25] M. Blum, R. W. Floyd, V. R. Pratt, R. L. Rivest, and R. E. Tarjan, “Time SIGMOD International Conference on Management of Data, SIGMOD bounds for selection,” J. Comput. Syst. Sci., vol. 7, no. 4, pp. 448–461, 2011, Athens, Greece, June 12-16, 2011, pp. 745–756, 2011. 1973. [46] B. Haeupler, J. Mohapatra, and H. Su, “Optimal gossip algorithms for [26] J. I. Munro and M. Paterson, “Selection and sorting with limited exact and approximate quantile computations,” in Proceedings of the storage,” Theor. Comput. Sci., vol. 12, pp. 315–323, 1980. 2018 ACM Symposium on Principles of Distributed Computing, PODC [27] J. Wang, S. Song, X. Zhu, X. Lin, and J. Sun, “Efficient recovery 2018, Egham, United Kingdom, July 23-27, 2018, pp. 179–188, 2018. of missing events,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 11, [47] K. Alsabti, S. Ranka, and V. Singh, “A one-pass algorithm for accurately pp. 2943–2957, 2016. estimating quantiles for disk-resident data,” in VLDB’97, Proceedings of [28] S. Mittal, “A survey of techniques for approximate computing,” ACM 23rd International Conference on Very Large Data Bases, August 25-29, Comput. Surv., vol. 48, no. 4, pp. 62:1–62:33, 2016. 1997, Athens, Greece, pp. 346–355, 1997. [29] G. Luo, L. Wang, K. Yi, and G. Cormode, “Quantiles over data streams: [48] F. Chen, D. Lambert, and J. C. Pinheiro, “Incremental quantile estima- experimental comparisons, new analyses, and further improvements,” tion for massive tracking,” in Proceedings of the sixth ACM SIGKDD VLDB J., vol. 25, no. 4, pp. 449–472, 2016. international conference on Knowledge discovery and data mining, [30] S. Ganguly and A. Majumder, “Cr-precis: A deterministic summary Boston, MA, USA, August 20-23, 2000, pp. 516–522, 2000. structure for update data streams,” in Combinatorics, Algorithms, Proba- [49] R. Y. S. Hung and H. Ting, “An w(\frac1e bilistic and Experimental Methodologies, First International Symposium, log\frac1e)\omega(\frac{1}{\varepsilon} \log ESCAPE 2007, Hangzhou, China, April 7-9, 2007, Revised Selected \frac{1}{\varepsilon}) space lower bound for finding epsilon- Papers, pp. 48–59, 2007. approximate quantiles in a data stream,” in Frontiers in Algorithmics, [31] A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong, 4th International Workshop, FAW 2010, Wuhan, China, August 11-13, “Model-driven data acquisition in sensor networks,” in (e)Proceedings of 2010. Proceedings, pp. 89–100, 2010. 2 the Thirtieth International Conference on Very Large Data Bases, VLDB [50] R. Jain and I. Chlamtac, “The p algorithm for dynamic calculation of 2004, Toronto, Canada, August 31 - September 3 2004, pp. 588–599, quantiiles and histograms without storing observations,” Commun. ACM, 2004. vol. 28, no. 10, pp. 1076–1085, 1985. [32] R. Motwani and P. Raghavan, “Randomized algorithms,” ACM Comput. [51] R. Agrawal and A. N. Swami, “A one-pass space-efficient algorithm for Surv., vol. 28, no. 1, pp. 33–37, 1996. finding quantiles,” in Advances in Data Management, Proceedings of the [33] G. S. Manku, S. Rajagopalan, and B. G. Lindsay, “Approximate medians 7th International Conference on Management of Data, COMAD 1995, and other quantiles in one pass and with limited memory,” in SIGMOD December 28-30, 1995, Pune, India, 1995. 1998, Proceedings ACM SIGMOD International Conference on Manage- [52] A. Lall, “Data streaming algorithms for the kolmogorov-smirnov test,” ment of Data, June 2-4, 1998, Seattle, Washington, USA, pp. 426–435, in 2015 IEEE International Conference on Big Data, Big Data 2015, 1998. Santa Clara, CA, USA, October 29 - November 1, 2015, pp. 95–104, [34] M. Greenwald and S. Khanna, “Space-efficient online computation 2015. of quantile summaries,” in Proceedings of the 2001 ACM SIGMOD [53] Y. Tao, W. Lin, and X. Xiao, “Minimal mapreduce algorithms,” in international conference on Management of data, Santa Barbara, CA, Proceedings of the ACM SIGMOD International Conference on Man- USA, May 21-24, 2001, pp. 58–66, 2001. agement of Data, SIGMOD 2013, New York, NY, USA, June 22-27, 2013, [35] X. Lin, H. Lu, J. Xu, and J. X. Yu, “Continuously maintaining quantile pp. 529–540, 2013. summaries of the most recent N elements over a data stream,” in [54] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining stream Proceedings of the 20th International Conference on Data Engineering, statistics over sliding windows,” SIAM J. Comput., vol. 31, no. 6, ICDE 2004, 30 March - 2 April 2004, Boston, MA, USA, pp. 362–373, pp. 1794–1813, 2002. 2004. [55] X. Lian, L. Chen, and S. Song, “Consistent query answers in incon- [36] A. Arasu and G. S. Manku, “Approximate counts and quantiles over sistent probabilistic databases,” in Proceedings of the ACM SIGMOD sliding windows,” in Proceedings of the Twenty-third ACM SIGACT- International Conference on Management of Data, SIGMOD 2010, SIGMOD-SIGART Symposium on Principles of Database Systems, June Indianapolis, Indiana, USA, June 6-10, 2010 (A. K. Elmagarmid and 14-16, 2004, Paris, France, pp. 286–296, 2004. D. Agrawal, eds.), pp. 303–314, ACM, 2010. [37] C. Liang, M. Li, and B. Liu, “Online computing quantile summaries [56] C. Liang, Y. Zhang, P. Shi, and Z. Hu, “Learning accurate very fast over uncertain data streams,” IEEE Access, vol. 7, pp. 10916–10926, decision trees from uncertain data streams,” Int. J. Systems Science, 2019. vol. 46, no. 16, pp. 3032–3050, 2015. [38] M. Greenwald and S. Khanna, “Power-conserving computation of order- [57] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform convergence statistics over sensor networks,” in Proceedings of the Twenty-third of relative frequencies of events to their probabilities,” in Measures of ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database complexity, pp. 11–30, Springer, 2015. Systems, June 14-16, 2004, Paris, France, pp. 275–285, 2004. [58] S. P. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Gossip algorithms: [39] G. Cormode, M. N. Garofalakis, S. Muthukrishnan, and R. Rastogi, design, analysis and applications,” in INFOCOM 2005. 24th Annual “Holistic aggregates in a networked world: Distributed tracking of ap- Joint Conference of the IEEE Computer and Communications Societies, proximate quantiles,” in Proceedings of the ACM SIGMOD International 13-17 March 2005, Miami, FL, USA, pp. 1653–1664, 2005. Conference on Management of Data, Baltimore, Maryland, USA, June [59] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, “Effective 14-16, 2005, pp. 25–36, 2005. computation of biased quantiles over data streams,” in Proceedings of 13

the 21st International Conference on Data Engineering, ICDE 2005, 5-8 International Conference on Management of Data, SIGMOD Conference April 2005, Tokyo, Japan, pp. 20–31, 2005. 2018, Houston, TX, USA, June 10-15, 2018, pp. 601–613, 2018. [60] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, “Space- [82] X. Zhu, S. Song, X. Lian, J. Wang, and L. Zou, “Matching heterogeneous and time-efficient deterministic algorithms for biased quantiles over data event data,” in International Conference on Management of Data, streams,” in Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD- SIGMOD 2014, Snowbird, UT, USA, June 22-27, 2014, pp. 1211–1222, SIGART Symposium on Principles of Database Systems, June 26-28, 2014. 2006, Chicago, Illinois, USA, pp. 263–272, 2006. [83] Y. Gao, S. Song, X. Zhu, J. Wang, X. Lian, and L. Zou, “Matching [61] Z. Lin, J. Liu, and N. Lin, “Accurate quantile estimation for skewed data heterogeneous event data,” IEEE Trans. Knowl. Data Eng., vol. 30, streams,” in 28th IEEE Annual International Symposium on Personal, no. 11, pp. 2157–2170, 2018. Indoor, and Mobile Radio Communications, PIMRC 2017, Montreal, [84] X. Zhu, S. Song, J. Wang, P. S. Yu, and J. Sun, “Matching heterogeneous QC, Canada, October 8-13, 2017, pp. 1–7, 2017. events with patterns,” in IEEE 30th International Conference on Data [62] J. Liu, W. Zheng, Z. Lin, and N. Lin, “Accurate quantile estimation for Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, skewed data streams using nonlinear interpolation,” IEEE Access, vol. 6, pp. 376–387, 2014. pp. 28438–28446, 2018. [85] S. Song, Y. Gao, C. Wang, X. Zhu, J. Wang, and P. S. Yu, “Matching [63] T. Dunning and O. Ertl, “Computing extremely accurate quantiles using heterogeneous events with patterns,” IEEE Trans. Knowl. Data Eng., t-digests,” CoRR, vol. abs/1902.04023, 2019. vol. 29, no. 8, pp. 1695–1708, 2017. [64] X. Lin, J. Xu, Q. Zhang, H. Lu, J. X. Yu, X. Zhou, and Y. Yuan, [86] G. Zyskind, O. Nathan, and A. Pentland, “Decentralizing privacy: Using “Approximate processing of massive continuous quantile queries over blockchain to protect personal data,” in 2015 IEEE Symposium on high-speed data streams,” IEEE Trans. Knowl. Data Eng., vol. 18, no. 5, Security and Privacy Workshops, SPW 2015, San Jose, CA, USA, May pp. 683–698, 2006. 21-22, 2015, pp. 180–184, 2015. [65] Q. Zhang and W. Wang, “A fast algorithm for approximate quantiles in high speed data streams,” in 19th International Conference on Scientific and Statistical Database Management, SSDBM 2007, 9-11 July 2007, Banff, Canada, Proceedings, p. 29, 2007. [66] N. K. Govindaraju, N. Raghuvanshi, and D. Manocha, “Fast and approximate stream mining of quantiles and frequencies using graphics processors,” in Proceedings of the ACM SIGMOD International Confer- ence on Management of Data, Baltimore, Maryland, USA, June 14-16, 2005, pp. 611–622, 2005. [67] Y. ZHOU, H. WANG, and C.-t. CHENG, “Parallel computing method of data stream quantiles with gpu,” Journal of Computer Applications, vol. 2, 2010. [68] Q. Zhang and W. Wang, “An efficient algorithm for approximate biased quantile computation in data streams,” in Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007, Lisbon, Portugal, November 6-10, 2007, pp. 1023–1026, 2007. [69] Y. Zhang, X. Lin, J. Xu, F. Korn, and W. Wang, “Space-efficient relative error order sketch over data streams,” in Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3-8 April 2006, Atlanta, GA, USA, p. 51, 2006. [70] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms. The MIT Press and McGraw-Hill Book Company, 1989. [71] A. Gupta and F. Zane, “Counting inversions in lists,” in Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 12-14, 2003, Baltimore, Maryland, USA, pp. 253–254, 2003. [72] O. Rubinstein, D. Reed, and J. Alben, “Gpu rendering to system memory,” Oct. 27 2005. US Patent App. 10/833,694. [73] K. E. Batcher, “Sorting networks and their applications,” in American Federation of Information Processing Societies: AFIPS Conference Proceedings: 1968 Spring Joint Computer Conference, Atlantic City, NJ, USA, 30 April - 2 May 1968, pp. 307–314, 1968. [74] J. Misra and D. Gries, “Finding repeated elements,” Sci. Comput. Program., vol. 2, no. 2, pp. 143–152, 1982. [75] G. Cormode and P. Vesel´y, “Tight lower bound for comparison-based quantile summaries,” CoRR, vol. abs/1905.03838, 2019. [76] D. Borthakur et al., “Hdfs architecture guide,” Hadoop Apache Project, vol. 53, no. 1-13, p. 2, 2008. [77] C. M. Bishop and N. M. Nasrabadi, “Pattern Recognition and Machine Learning,” J. Electronic Imaging, vol. 16, no. 4, p. 049901, 2007. [78] S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holder- baugh, Z. Liu, K. Nusbaum, K. Patil, B. Peng, and P. Poulosky, “Benchmarking streaming computation engines: Storm, flink and spark streaming,” in 2016 IEEE International Parallel and Distributed Pro- cessing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016, pp. 1789–1792, 2016. [79] J. Wei, K. Chen, Y. Zhou, Q. Zhou, and J. He, “Benchmarking of distributed computing engines spark and graphlab for big data analytics,” in Second IEEE International Conference on Big Data Computing Ser- vice and Applications, BigDataService 2016, Oxford, United Kingdom, March 29 - April 1, 2016, pp. 10–13, 2016. [80] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010, 2010. [81] M. Armbrust, T. Das, J. Torres, B. Yavuz, S. Zhu, R. Xin, A. Ghodsi, I. Stoica, and M. Zaharia, “Structured streaming: A declarative API for real-time applications in apache spark,” in Proceedings of the 2018