Automata Theory Based Approach to the Join Ordering Problem in Relational Database Systems
Automata Theory based Approach to the Join Ordering Problem in Relational Database Systems
Miguel Rodríguez1, Daladier Jabba1, Elias Niño1, Carlos Ardila1 and Yi-Cheng Tu2 1Department of Systems Engineering, Universidad del Norte, KM5 via Puerto Colombia, Barranquilla, Colombia 2Department of Computer Science and Engineering, University of South Florida, Tampa, U.S.A.
Keywords: Automata Theory, Query Optimization, Join Ordering Problem.
Abstract: The join query optimization problem has been widely addressed in relational database management systems (RDBMS). The problem consists of finding a join order that minimizes the time required to execute a query. Many strategies have been implemented to solve this problem including deterministic algorithms, randomized algorithms, meta-heuristic algorithms and hybrid approaches. Such methodologies deeply depend on the correct configuration of various input parameters. In this paper, a meta-heuristic approach based on the automata theory will be adapted to solve the join-ordering problem. The proposed method requires a single input parameter that facilitates its usage respect to those previously described in the literature. The algorithm was embedded into PostgreSQL and compared with the genetic competitor using the most resent TPC-DS benchmark. The proposed method is supported by experimental results achieving up to 30% faster response time than GEQO in different queries.
1 INTRODUCTION theoretically possible for a small number of relations. When N increases substantially, finding Since the early days of RDBMS the problem of the optimal join order is considered an NP-hard finding a join order to minimize the execution time problem and thus deterministic algorithms cannot of a query has been approached. (Chaudhuri, 1998) find a solution easily. defined large join queries as relational algebra Systems holding workloads from applications queries with N join operations involving N+1 such as decision support systems and business relations when N is greater or equals to 10. intelligence require the ability of joining more than Consecutively the Large Join Query Optimization 10 relations easily. In this paper, a Meta heuristic Problem (LJQOP) was formally addressed as finding approach based on the automata theory that has been a Query Execution Plan (QEP) with a minimum cost effectively used in the solution of the Traveling for a large join query. Salesman Problem (TSP) will be presented and its The LJQOP have been widely addressed and application in the solution of the join ordering many methods have been developed to solve it. problem will be discussed. Finally the automata Randomized algorithms such as iterative based query optimizer proposed in this work will be improvement and simulated annealing, evolutionary tested using the most recent decision support algorithms such as genetic algorithms and Meta benchmark TPC-DS. heuristics such as ant colony optimization are some The remaining parts of this work will be common strategies used in the solution of the distributed as follows. Previous work on solving the problem. LJQOP is discussed in section two. The proposed The solution space of a LJQOP consists of all methodology will be explained in section three. The query trees that answers the query. There are three experimental design and setup used to test the types of query trees that can result from the solution algorithm is going to be exposed in section four. A space: left deep, bushy and right deep. An extended discussion about the results obtained by the discussion about types of query trees is given in algorithm and a comparison analysis between the (Ioannidis and Kang, 1991). proposed method and the PostgreSQL genetic The construction of a LJQOP solution space is optimizer module is showed in section five. Finally
Rodríguez M., Jabba D., Niño E., Ardila C. and Tu Y.. 257 Automata Theory based Approach to the Join Ordering Problem in Relational Database Systems. DOI: 10.5220/0004433802570265 In Proceedings of the 2nd International Conference on Data Technologies and Applications (DATA-2013), pages 257-265 ISBN: 978-989-8565-67-9 Copyright c 2013 SCITEPRESS (Science and Technology Publications, Lda.) DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications
conclusions and future work are presented. The calculation of the acceptance probability of the simulated annealing algorithm is adopted from the Metropolis algorithm and corresponds to the 2 RELATED WORK following equation. 1, The join ordering problem has been approached in (2.1) , different ways among the years. A literature review presented in (Steinbrunn et al., 1997) provides Different implementations of the simulated detailed information on different approaches to the annealing algorithm have been used to solve the join solution of the problem and classifies them in four ordering problem using different cooling schemas, groups. The first one corresponds to deterministic initial solutions, and solution generation algorithms such as dynamic programming and mechanisms. minimum selectivity algorithm. The second group, The implementation in (Ioannidis and Wong, randomized algorithms, includes simulated 1987) proposed the use of simulated annealing to annealing, iterative improvement, two-phase solve the recursive query optimization problem. The optimization and random sampling. The third group initial state was chosen using semi-naïve consists of genetic algorithms, which encode the evaluation methods and the initial temperature solutions and then uses selection, crossover and was chosen as twice the cost of the initial state. The mutation algorithms. Finally the fourth group is termination criterion of the algorithm is composed of compound of hybrid methods. two parts: the temperature must be below 1 and the Three of the most popular approximate solutions final state must remain the same for four consecutive to the join-ordering problem are simulated stages. The generation mechanism is based on a annealing, genetic algorithms and ant colony transition probability matrix : → 0,1 optimization. where each neighbor of the current state has the same probability to be chosen as the next state.
2.1 Simulated Annealing 1 ∈ , ′ | | (2.2) The annealing process in physics consists of 0 obtaining low energy states of a solid element being heated. Simulated Annealing takes advantage of the Finally authors suggest the use of two different Metropolis algorithm used to study equilibrium cooling schedules in their implementation. They properties in the microscopically analysis of solids. propose the use of the following equation to control Specifically the Metropolis algorithm generates a the temperature of the system. sequence of states for a solid object. Given an (2.3) element in state i with energy E a new element in state j is produced, if the difference between The function returns values between 0 and 1. energies is below cero, the new state is automatically The first strategy proposed consists of keeping a accepted; otherwise its acceptance will depend on constant value of 0.95 and the second one consists of certain probability based on the temperature the modifying the value of according to Table 1. system is exposed to and a physic constant known as Table 1: Factor to reduce temperature. Boltzmann constant k . Similarly, the simulated annealing algorithm constructs solutions to / combinatorial problems linking solution-generation 2 0.80 alternatives and an acceptance criterion. The states 4 0.85 of the system can be matched to solutions of the 8 0.90 combinatorial problem, and in the same way the cost ∞ 0.95 function of the optimization problem can be seen as the energy cost of the annealing system. Therefore A second approach to query optimization by the simulated annealing algorithm starts exposing simulated annealing is proposed in (Swami and the system to high temperatures and thus accepting Gupta, 1988) where two implementations of solutions that do not improve previous solutions. By simulated annealing are compared to several other terms of a cooling factor, the temperature starts algorithms including perturbation walk, Quasi- lowering until it reaches zero where solutions that do random sampling, local optimization and iterative not improve its parents are not accepted. improvement. The proposed simulated annealing
258 AutomataTheorybasedApproachtotheJoinOrderingProbleminRelationalDatabaseSystems
implementation uses an interesting generation mutation operation is essential for the genetic mechanism that combines two different strategies. algorithm to work properly, it must be used The first strategy is swapping which consists of carefully. selecting two positions in the vector and Genetic algorithms have also been used to solve interchanges its values and the second strategy is 3- the query optimization problem as alternative to cycle, which consists of randomly selecting 3 randomized algorithms. The genetic algorithm elements of the actual state and shift them one implemented by the authors of (Bennett et al., 1991) position to the right in a circle. In order to select is the first known genetic algorithm used to which strategy is used to generate the new solution approach the query optimization problem. The at a given iteration, variable ∈ 0,1 that authors adapted a genetic algorithm used to solve the represents the frequency of swap selection and thus assembly line balancing problem focusing on 1 that represents 3-cycle selection is used. finding an appropriate encoding schema and crossover operation to solve the query optimization 2.2 Genetic Algorithms problem. The author of (Muntes-Mulero et al., 2006) proposed the Carquinyoli Genetic Optimizer (CGO) The author of (Holland, 1992) explains that “Most which uses a tree fashioned codification for the organisms evolve by means of two primary process: algorithm to represent solutions, the crossover natural selection and sexual reproduction. The first operation randomly selects two members of the determines which members of the population survive current population and examines each tree’s to reproduce, and the second ensures mixing and operations and stores them in a list, then a sub tree recombination among the genes of their offspring”. from each parent is selected and an offspring is Genetic algorithms use the same procedure to seek generated by combining a sub tree from one parent an optimal solution to an optimization problem by and the ordered list of operations from the other, the selecting the most fitted solutions of the problem same procedure is applied to the other sub tree and and combining them to create new generations. operation list. Five different mutation strategies were A genetic algorithm heavily depends on the used, swap, change scan, change join, join performance of its three basic operations: selection, reordering and random sub tree. The selection recombination and mutation. strategy used by GCO is a simple elitist algorithm. In general the selection scheme describes how to Finally the commercial database system extract individuals from the current generation to PostgreSQL is equipped with GEQO, a genetic create new elements to be evaluated in the next optimizer that activates when the number of tables generation by creating a mating pool. Consequently involved in a query exceeds 10. GEQO is based on the elements selected in the present generation must the steady state genetic algorithm GENITOR that be “good” enough to be the parents of the new presents two main differences compared to generation. traditional genetic algorithms, the explicit use of The crossover operation is what makes genetic ranking and the genotype reproduction in an algorithms different from other randomized individual basis. methods. Based on the natural reproduction process where parts of the genes of both parents are 2.3 Ant Colony Optimization combined to form new individuals, the crossover function uses two individuals selected from the The optimization method based on ant colonies mating pool and combines them to create new consists of three procedures: ConstructAntsSolution, individuals. Several methods to crossover have been UpdatePheromones and DeamonActions. The first designed and some of the most popular are one-point method manages the construction of solutions by crossover, two-point crossover, cycle crossover and single ants using pheromone trails and heuristic uniform crossover. information. The second method uses the solution The mutation operation adds an additional constructed by the ant and update pheromone trail change to the new individuals of the population to accordingly increasing the amount of pheromones or prevent the generation of uniform populations and reducing the amount of pheromones due getting trapped in local optima. The mutation evaporation. The third method includes actions that operation in binary encoded genetic algorithms can cannot be performed by single ants like local be easily implemented by selecting a random bit optimization procedures or additional pheromone from the encoded string and change its value by increases. using the negation operation. Even though the Lately new algorithms to solve the query
259 DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications
optimization problem have been proposed based on ∗ 1 the ACO theory. Different approaches to query |Σ| (3.2) 2 optimization using ant algorithms have been : Σ → Q, is the transition function and developed and also combined with genetic takes the node ∈ and swap the elements in the algorithms to increase the accuracy of the solutions positions indicated by the element of the alphabet, found. In this section different approaches to query is the initial state and is given by an initial optimization assembled on ant colony optimization algorithms are studied. Specifically three different solution to the problem, is the set of final states, approaches will me mentioned: the first known ACO is the input vector containing the initial order of algorithm to be applied to the query optimization elements corresponding to the state , is the problem (LI et al., 2008), a best-worst ACO objective function of the combinatorial problem that combined with a genetic algorithm to solve the evaluates the given order in the node . query optimization problem (Zhou et al., 2009) and For instance the following example is given to finally another type of combination between ACO understand the construction of a DFAS. Given the and GA to approach the join ordering problem is objective function of a combinatorial optimization presented in (Kadkhodaei and Mahmoudi, 2011). problem and an input vector. 0.3 0.2 0.1 (3.3)