Query Optimization in Database Systems

Query Optimization in Database Systems MATTHIAS JARKE Graduate School of Business Administration, New York Uniuersity, New York, New York 10006 JijRGEN KOCH Fachbereich Znformatik, Johann Wolfgang Goethe-Universitiit, 6000 Frankfurt 1, West Germany Efficient methods of processing unanticipated queries are a crucial prerequisite for the success of generalized database management systems. A wide variety of approaches to improve the performance of query evaluation algorithms have been proposed: logic-based and semantic transformations, fast implementations of basic operations, and combinatorial or heuristic algorithms for generating alternative access plans and choosing among them. These methods are presented in the framework of a general query evaluation procedure using the relational calculus representation of queries. In addition, nonstandard query optimization issues such as higher level query evaluation, query optimization in distributed databases, and use of database machines are addressed. The focus, however, is on query optimization in centralized database systems. Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; H.2.2 [Database Management]: Physical Design--access methodqH.2.3 [Database Management]: Languages-query languages; H.2.4 [Database Management]: Systems-query processing; H.2.6 [Database Management]: Database Machines; 1.1.1 [Algebraic Manipulation]: Expressions and Their Representation-simplification of expressions General Terms: Algorithms, Performance Additional Key Words and Phrases: Database implementation, query optimization, query simplification JNTRODUCTION Such conceptual models include the hier- archical, the network, the relational, and a Database management systems (DBMS) number of semantics-oriented models that have become a standard tool for shielding have been reviewed in a large number of the computer user from details of secondary books and surveys [Brodie et al. 19841. storage management. They are designed to A second area of interest is the safe and improve the productivity of application efficient implementation of the DBMS. programmers and to facilitate data access Computerized data have become a central by computer-naive end users. resource of most organizations. Each im- There have been two major areas of re- plementation meant for production use search in database systems. One is the anal- must take this into account by guarantee- ysis of data models into which the real ing the safety of the data in the cases of world can be mapped and on which inter- concurrent access [Bernstein and Good- faces for different user types can be built. man 19&c], recovery [Verhofstad 19781, Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and ita date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1984 ACM 0360-0300/84/0600-0111$00.75 hnputing Surveys,Vol. 16, No. 2, June 1964 112 l M. Jarke and J. Koch CONTENTS 1982a] and extendable to the implementation of network DBMSs [Dayal and Good- man 19821. Moreover, many popular query languages, such as SQL [Astrahan and INTRODUCTION Chamberlin 19751 or QUEL [Stonebraker 1. THE QUERY OPTIMIZATION PROBLEM et al. 19761, map easily into relational cal- 1.1 Queries culus. 1.2 Optimization Objectives In the interest of space, the focus of the 1.3 Top-Down Approach to Query Optimization 2. QUERY REPRESENTATION paper is primarily on the problem of opti- 2.1 The Relational Calculus mizing queries in the centralized DBMS. 2.2 The Relational Algebra Centralized query optimization is not only 2.3 Query Graphs important in many mainframe databases- 2.4 Tableaus 3. QUERY TRANSFORMATION and more recently in microcomputer 3.1 Standardization DBMSs-but also appears as a subproblem 3.2 Simplification of query optimization in distributed sys- 3.3 Amelioration tems. Distributed query optimization itself 4. QUERY EVALUATION 4.1 One-Variable Expressions [Bayer et al. 1984; Sacco and Yao 1982; 4.2 Two-Variable Expressions Ullman 19821 is only addressed briefly, and 4.3 Multivariable Expressions the following two related areas are not 5. ACCESS PLANS treated at all: 5.1 Generation of Access Plans 5.2 Cost Analysis of Access Plans User Optimization. The overall cost of an 5.3 Selection of Access Plans information system is composed of the 5.4 Support for Multiple Queries DBMS cost and the costs of user efforts to 6. NONSTANDARD QUERY OPTIMIZATION 6.1 Higher Level Queries work with the system. The interface in the 6.2 Distributed Databases two areas consists of the functional capa- 6.3 Database Machines bilities and usability of the query language I. SUMMARY [Vassiliou and Jarke 19841, mainly in the ACKNOWLEDGMENTS REFERENCES response time of the system. If one assumes given functional capabilities of the query language and a response time minimization goal of the query evaluation system, query optimization can be handled as a separately and reorganization [Sockut and Goldberg tractable subproblem of user optimization. 19791. One major criticism of many early File Structures. A query optimization al- DBMSs has been their lack of efficiency in gorithm has to choose among a variety of handling the powerful operations they of- existing access paths to resolve a query. fer, particularly the content-based access The internal details of implementing such to data by queries. Query optimization tries access paths and the derivation of the re- to solve this problem by integrating a large lated cost functions (see, e.g., Teorey and number of techniques and strategies, rang- Fry [1982]) are beyond the scope of this ing from logical transformations of queries paper. to the optimization of access paths and the storage of data on the file system level. The paper is organized into six sections, Traditionally, each of these approaches following a top-down approach. In Section has used a different language. This is prob- 1 we present a global framework for query ably one of the reasons why no comprehen- optimization. In Section 2 we compare four sive survey of query optimization tech- techniques for representing queries in niques has yet been presented. The goal of terms of their suitability for optimization. this paper is to review query optimization In Section 3 we utilize one of these tech- techniques in the common framework of niques, the relational calculus, for present- relational calculus. This has been shown to ing logic-based transformations, including be technically equivalent to a relational the emerging methods of semantic query algebra representation [Codd 1972; Klug optimization. Computing Surveys, Vol. 16, No. 2, June 1984 Query Optimization in Database Systems l 113 After being transformed, a query must be comes necessary if ad hoc queries are to be mapped into a sequence of operations that asked by use of a general-purpose query return the requested data. In Section 4 we language. analyze the implementation of such opera- A second application of queries occurs in tions on a low-level system of stored data transactions that change the stored data and access paths. In Section 5 we present based on their current value (e.g., “give all optimization procedures for integrating assistant professors a 10 percent salary in- these operations into a globally optimal crease”). Finally, querylike expressions can access plan. be used internally in a DBMS, for example, A number of query optimization prob- to check access rights [Griffiths and Wade lems require special treatment because of 19761, maintain integrity constraints higher query complexity or certain charac- [ Stonebraker 19751, and synchronize con- teristics of the underlying hardware. Three current accesses correctly [Reimer 19831. such problem areas-higher level queries, distributed queries, and queries using da- 1.2 Optimization Objectives tabase machines-are summarized in Sec- tion 6. The economic principle requires that optimization procedures either attempt to max- 1. THE QUERY OPTIMIZATION PROBLEM imize the output for a given number of resources or to minimize the resource usage Exact optimization of query evaluation pro- for a given output. Query optimization tries cedures is in general computationally in- to minimize the response time for a given tractable and is hampered further by the query language and mix of query types in a lack of precise statistical information about given system environment. This general the database. Query evaluation algorithms goal allows a number of different opera- must rely heavily on heuristics. Neverthe- tional objective functions. The response less, the term “query optimization” will be time goal is reasonable only under the as- used to refer to strategies intended to im- sumption that user time is the most impor- prove the efficiency of query evaluation tant bottleneck resource. Otherwise, direct procedures. In this section we state the cost minimization of technical resource objectives of query optimization and pre- usage can be attempted. Fortunately, both sent a general procedure designed to struc- objectives are largely complementary; when ture the solution process. goal conflicts arise, they are typically re- solved by assigning limits to the availability 1.1 Queries of technical resources (e.g., those of main memory buffer space). A query is a language expression that de- In order to allow a fair comparison of scribes data to be retrieved from a database. efficiency, the functional capabilities of the In the context of query optimization, it is query evaluation systems to be compared often assumed that queries are expressed must be similar. The requirement of “rela- in a content-based (and mostly set-ori- tional completeness” coined by Codd [ 19721 ented) manner, giving the optimizer suffi- (compare Section 2.1) has become a quasi- cient choices among alternative evaluation standard. The techniques surveyed in this procedures. paper are presented as contributions to the Queries are used in several settings.

Load more