Tuple Routing Strategies for Distributed Eddies
Total Page:16
File Type:pdf, Size:1020Kb
Tuple Routing Strategies for Distributed Eddies Feng Tian David J. DeWitt Department of Computer Sciences University of Wisconsin, Madison Madison, WI, 53706 {ftian, dewitt}@cs.wisc.edu Abstract data stream management systems has begun to receive the attention of the database community. Many applications that consist of streams of data Many of the fundamental assumptions that are the are inherently distributed. Since input stream basis of standard database systems no longer hold for data rates and other system parameters such as the stream management systems [8]. A typical stream query is amount of available computing resources can long running -- it listens on several continuous streams fluctuate significantly, a stream query plan must and produces a continuous stream as its result. The notion be able to adapt to these changes. Routing tuples of running time, which is used as an optimization goal by between operators of a distributed stream query a classic database optimizer, cannot be directly applied to plan is used in several data stream management a stream management system. A data stream management systems as an adaptive query optimization system must use other performance metrics. In addition, technique. The routing policy used can have a since the input stream rates and the available computing significant impact on system performance. In this resources will usually fluctuate over time, an execution paper, we use a queuing network to model a plan that works well at query installation time might be distributed stream query plan and define very inefficient just a short time later. Furthermore, the performance metrics for response time and “optimize-then-execute” paradigm of traditional database system throughput. We also propose and systems is no longer appropriate and a stream execution evaluate several practical routing policies for a plan must be able to adapt to changes of input streams and distributed stream management system. The system resources. performance results of these policies are An eddy [2] is a stream query execution mechanism compared using a discrete event simulator. that can continuously reorder operators in a query plan. Finally, we study the impact of the routing policy Each input tuple to an eddy carries its own execution on system throughput and resource allocation history. This execution history is implemented using two when computing resources can be shared bitmaps. A done bitmap records which operators the tuple between operators. has already visited and a ready bitmap records which operators the tuple can visit next. An eddy routes each 1. Introduction tuple to the next operator based on the tuple’s execution history and statistics maintained by eddy. If the tuple Stream database systems are a new type of database satisfies the predicate of an operator, the operator makes system designed to facilitate the execution of queries appropriate updates to the two bitmaps and returns the against continuous streams of data. Example applications tuple to the eddy. The eddy continues this iteration until for such systems include sensor networks, network the tuple has visited all operators. Figure 1.1 shows an monitoring applications, and online information tracking. eddy with three operators. The major advantage of an Since many stream-based applications are inherently eddy is that the execution plan is highly adaptive with the distributed, a centralized solution is not viable. Recently routing decision for each individual tuple deciding the the design and implementation of scalable, distributed execution order of the operators for this tuple. [2][18] demonstrate that this technique adapts well to changes in Permission to copy without fee all or part of this material is granted input stream rates. provided that the copies are not made or distributed for direct However, a centralized eddy cannot be directly commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by employed in a distributed data stream management system permission of the Very Large Data Base Endowment. To copy without incurring unnecessary network traffic and delays otherwise, or to republish, requires a fee and/or special permission from and would almost certainly end up being a bottleneck. the Endowment Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003 2. Related Work Op2 Op1 Op3 There are a number of research projects currently Op2 studying issues related to streaming data [1][2][3][4][5] [6][7][8][12][16][18][22][26]. Those that are most closely Eddy Op1 Op3 related to our work are the Aurora* [6][8], STREAM [3][4][22], Telegraph [2][9][18] and Cougar [7][12] Figure 1.2 A distributed projects. Figure 1.1 Centralized The original eddy paper [2] introduced the concept of Eddy query plan routing tuples between operators as a form of query In this paper we study the design, implementation, and optimization. This paper extends the idea of an eddy to a performance of the following distributed eddy algorithm. distributed environment. The routing policies described in After an operator processes a tuple, instead of returning [2] and [18] are compared against several other routing the tuple to a centralized eddy, the operator makes a policies in Section 4 and 5. routing decision based on the execution history of the Aurora [8] describes the architecture of a data stream tuple and statistics maintained at the operator. Figure 1.2 management system. Aurora* [6] extends Aurora to a shows a distributed plan with three operators. The dashed distributed environment and discusses load sharing arrows indicate possible routes between operators. The techniques. Aurora also uses routing as a mechanism to four solid arrows indicate one possible execution order reorder operators. The routing mechanism is similar to that a tuple might actually take. The routing policy at each that of an eddy and our results can be adapted to Aurora*. operator decides the execution order of the operators for STREAM [3] describes a query language and precise each tuple, therefore, dynamically optimizing the semantics of stream queries. [5][22] describe both distributed stream query plan. The purpose of this paper is operator scheduling and resource management in a to study the effectiveness of different routing policies. centralized data stream management system, focusing on As discussed earlier, query response time is not an minimizing inter-operator queue length or memory appropriate metric to evaluate a data stream management consumption. In [22] a near-optimal scheduling algorithm system. Instead we propose the following two metrics: for reducing inter-operator queue size is presented. In ART - the average response time measured as the addition, [22] explores using constraints to optimize average time between when a tuple enters and leaves the stream query plans. operators that form the distributed eddy. Cougar [7][12] is a distributed sensor database system. Cougar focuses on forming clusters out of sensors to MDR - the maximum data rate the system can handle allow intelligent in-network aggregation to conserve before an operator becomes a bottleneck. energy by reducing the amount of communication The formal description of the system and rigorous between sensor nodes. definitions of these metrics will be given in Section 3. [27] asserts that execution time is not an appropriate Section 4 examines the impact of the routing policy on goal for optimizing stream queries and proposes the use of system performance. The distributed query plan is output rates as more appropriate. The output rate metric modelled using a queuing network and a solution proposed in [27] is essentially equivalent to our MDR. technique is described. We also study several practical Several approaches have been proposed on how to routing policies that have straightforward gather statistics over a stream [4] [11] [13] [16] [19] [20] implementations and compare their performance. [21] with the primary focus being how to obtain good A distributed stream processing system must be able estimates over streaming data with limited amounts of to dynamically adapt to configuration changes such as memory and minimal CPU usage. These results will be adding or removing computing resources. Changes in critical to the design of accurate routing policies to any input data rates may also require the system to re-allocate distributed eddy implementation. resources via load sharing techniques. Aurora* [6] There are many papers that describe the use of implements box sliding and box splitting to enable load queuing networks to analyze computer system. [14][15] sharing across nodes. The natural way of applying these are the standard texts on this subject. load sharing techniques is to split the workload of an overloaded node and to merge workloads of lightly loaded 3. Overview of the System Model and nodes. The routing policy is an important factor in determining which node is likely to be overloaded. In Performance Metrics Section 5, the effect of routing policy on the system We model a distributed stream query plan as a set of throughput and resource allocation when computing operators Opi, i=1,..,n connected by a network. Input resources can be added to or removed from a node is tuples to an operator are added to a first-come, first- examined. Conclusions and future research directions are served (FCFS) queue, as shown in Figure 3.1. Opi.R contained in Section 6. resources (i.e. CPU, memory and network bandwidth) are assumed to be available to each operator Opi,. We further Operators in this model have only one input queue. assume that each input tuple to Opi consumes, on average, We briefly explain how to implement the join operator, Opi.r, resources. Thus, Opi can process at most which logically has two input streams. Our treatment of Opi.R/Opi.r input tuples per time unit and the average join is very much like a distributed version of SteMs service time Ts for each individual tuple is Opi.r/Opi.R.