Query Optimization in Systems

MATTHIAS JARKE Graduate School of Business Administration, New York Uniuersity, New York, New York 10006 JijRGEN KOCH Fachbereich Znformatik, Johann Wolfgang Goethe-Universitiit, 6000 Frankfurt 1, West Germany

Efficient methods of processing unanticipated queries are a crucial prerequisite for the success of generalized database management systems. A wide variety of approaches to improve the performance of query evaluation algorithms have been proposed: logic-based and semantic transformations, fast implementations of basic operations, and combinatorial or heuristic algorithms for generating alternative access plans and choosing among them. These methods are presented in the framework of a general query evaluation procedure using the representation of queries. In addition, nonstandard query optimization issues such as higher level query evaluation, query optimization in distributed , and use of database machines are addressed. The focus, however, is on query optimization in centralized database systems. Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; H.2.2 [Database Management]: Physical Design--access methodqH.2.3 [Database Management]: Languages-query languages; H.2.4 [Database Management]: Systems-query processing; H.2.6 [Database Management]: Database Machines; 1.1.1 [Algebraic Manipulation]: Expressions and Their Representation-simplification of expressions General Terms: Algorithms, Performance Additional Key Words and Phrases: Database implementation, query optimization, query simplification

JNTRODUCTION Such conceptual models include the hier- archical, the network, the relational, and a Database management systems (DBMS) number of semantics-oriented models that have become a standard tool for shielding have been reviewed in a large number of the computer user from details of secondary books and surveys [Brodie et al. 19841. storage management. They are designed to A second area of interest is the safe and improve the productivity of application efficient implementation of the DBMS. programmers and to facilitate data access Computerized data have become a central by computer-naive end users. resource of most organizations. Each im- There have been two major areas of re- plementation meant for production use search in database systems. One is the anal- must take this into account by guarantee- ysis of data models into which the real ing the safety of the data in the cases of world can be mapped and on which inter- concurrent access [Bernstein and Good- faces for different user types can be built. man 19&c], recovery [Verhofstad 19781,

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and ita date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 0 1984 ACM 0360-0300/84/0600-0111$00.75

hnputing Surveys,Vol. 16, No. 2, June 1964 112 l M. Jarke and J. Koch

CONTENTS 1982a] and extendable to the implementa- tion of network DBMSs [Dayal and Good- man 19821. Moreover, many popular query languages, such as SQL [Astrahan and INTRODUCTION Chamberlin 19751 or QUEL [Stonebraker 1. THE QUERY OPTIMIZATION PROBLEM et al. 19761, map easily into relational cal- 1.1 Queries culus. 1.2 Optimization Objectives In the interest of space, the focus of the 1.3 Top-Down Approach to Query Optimization 2. QUERY REPRESENTATION paper is primarily on the problem of opti- 2.1 The Relational Calculus mizing queries in the centralized DBMS. 2.2 The Centralized query optimization is not only 2.3 Query Graphs important in many mainframe databases- 2.4 Tableaus 3. QUERY TRANSFORMATION and more recently in microcomputer 3.1 Standardization DBMSs-but also appears as a subproblem 3.2 Simplification of query optimization in distributed sys- 3.3 Amelioration tems. Distributed query optimization itself 4. QUERY EVALUATION 4.1 One-Variable Expressions [Bayer et al. 1984; Sacco and Yao 1982; 4.2 Two-Variable Expressions Ullman 19821 is only addressed briefly, and 4.3 Multivariable Expressions the following two related areas are not 5. ACCESS PLANS treated at all: 5.1 Generation of Access Plans 5.2 Cost Analysis of Access Plans User Optimization. The overall cost of an 5.3 Selection of Access Plans information system is composed of the 5.4 Support for Multiple Queries DBMS cost and the costs of user efforts to 6. NONSTANDARD QUERY OPTIMIZATION 6.1 Higher Level Queries work with the system. The interface in the 6.2 Distributed Databases two areas consists of the functional capa- 6.3 Database Machines bilities and usability of the I. SUMMARY [Vassiliou and Jarke 19841, mainly in the ACKNOWLEDGMENTS REFERENCES response time of the system. If one assumes given functional capabilities of the query language and a response time minimization goal of the query evaluation system, query optimization can be handled as a separately and reorganization [Sockut and Goldberg tractable subproblem of user optimization. 19791. One major criticism of many early File Structures. A query optimization al- DBMSs has been their lack of efficiency in gorithm has to choose among a variety of handling the powerful operations they of- existing access paths to resolve a query. fer, particularly the content-based access The internal details of implementing such to data by queries. Query optimization tries access paths and the derivation of the re- to solve this problem by integrating a large lated cost functions (see, e.g., Teorey and number of techniques and strategies, rang- Fry [1982]) are beyond the scope of this ing from logical transformations of queries paper. to the optimization of access paths and the storage of data on the file system level. The paper is organized into six sections, Traditionally, each of these approaches following a top-down approach. In Section has used a different language. This is prob- 1 we present a global framework for query ably one of the reasons why no comprehen- optimization. In Section 2 we compare four sive survey of query optimization tech- techniques for representing queries in niques has yet been presented. The goal of terms of their suitability for optimization. this paper is to review query optimization In Section 3 we utilize one of these tech- techniques in the common framework of niques, the relational calculus, for present- relational calculus. This has been shown to ing logic-based transformations, including be technically equivalent to a relational the emerging methods of semantic query algebra representation [Codd 1972; Klug optimization.

Computing Surveys, Vol. 16, No. 2, June 1984 Query Optimization in Database Systems l 113 After being transformed, a query must be comes necessary if ad hoc queries are to be mapped into a sequence of operations that asked by use of a general-purpose query return the requested data. In Section 4 we language. analyze the implementation of such opera- A second application of queries occurs in tions on a low-level system of stored data transactions that change the stored data and access paths. In Section 5 we present based on their current value (e.g., “give all optimization procedures for integrating assistant professors a 10 percent salary in- these operations into a globally optimal crease”). Finally, querylike expressions can access plan. be used internally in a DBMS, for example, A number of query optimization prob- to check access rights [Griffiths and Wade lems require special treatment because of 19761, maintain integrity constraints higher query complexity or certain charac- [ Stonebraker 19751, and synchronize con- teristics of the underlying hardware. Three current accesses correctly [Reimer 19831. such problem areas-higher level queries, distributed queries, and queries using da- 1.2 Optimization Objectives tabase machines-are summarized in Sec- tion 6. The economic principle requires that opti- mization procedures either attempt to max- 1. THE QUERY OPTIMIZATION PROBLEM imize the output for a given number of resources or to minimize the resource usage Exact optimization of query evaluation pro- for a given output. Query optimization tries cedures is in general computationally in- to minimize the response time for a given tractable and is hampered further by the query language and mix of query types in a lack of precise statistical information about given system environment. This general the database. Query evaluation algorithms goal allows a number of different opera- must rely heavily on heuristics. Neverthe- tional objective functions. The response less, the term “query optimization” will be time goal is reasonable only under the as- used to refer to strategies intended to im- sumption that user time is the most impor- prove the efficiency of query evaluation tant bottleneck resource. Otherwise, direct procedures. In this section we state the cost minimization of technical resource objectives of query optimization and pre- usage can be attempted. Fortunately, both sent a general procedure designed to struc- objectives are largely complementary; when ture the solution process. goal conflicts arise, they are typically re- solved by assigning limits to the availability 1.1 Queries of technical resources (e.g., those of main memory buffer space). A query is a language expression that de- In order to allow a fair comparison of scribes data to be retrieved from a database. efficiency, the functional capabilities of the In the context of query optimization, it is query evaluation systems to be compared often assumed that queries are expressed must be similar. The requirement of “rela- in a content-based (and mostly set-ori- tional completeness” coined by Codd [ 19721 ented) manner, giving the optimizer suffi- (compare Section 2.1) has become a quasi- cient choices among alternative evaluation standard. The techniques surveyed in this procedures. paper are presented as contributions to the Queries are used in several settings. The implementation of queries in a relationally most obvious application is that of direct complete language with minimal evaluation requests by end users who need information cost or response time. Queries of higher about the structure or content of the data- complexity [Chandra and Hare1 1982a] are base. If the requests are limited to a set of considered in Section 6.1. The total cost to standard queries, they can be optimized be minimized is the sum of the following: manually by programming the associated search procedures and restricting the user’s Communication Cost: The cost of trans- input to a menu format. However, an mitting data from the site where they are automatic query optimization system be- stored to the sites where computations are

Computing Surveys, Vol. 16, No. 2, June 19.34 114 l M. Jarke and J. Koch performed and results are presented. These measured by the number of comparisons to costs are composed of costs for the com- be performed). A number of common ideas munication line, which are usually related underly most techniques developed to re- to the time the line is open, and costs for duce these costs. They try to (1) avoid the delay in processing caused by transmis- duplication of effort, (2) use standardized sion. The latter, which is more important parts, (3) look ahead in order to avoid un- for query optimization, is often assumed to necessary operations, (4) choose the cheap- be a linear function of the number of data est way to execute elementary operations, transmitted. and (5) sequence them in an optimal fash- Secondary Storage Access Cost: The cost ion. The following simple example demon- of (or time for) loading data pages from strates what can be expected from query secondary storage into main memory. This optimization. is influenced by the number of data to be Consider the relational schema of a da- retrieved (mainly by the size of intermedi- tabase that describes employees offering ate results), the clustering of data on phys- computer lectures to departments of a geo- ical pages, the size of the available buffer graphically distributed organization: space, and the speed of the devices used. Storage Cost: The cost of occupying sec- employees (e, ename, status, city) ondary storage and memory buffers over papers (enr, title, year) time. Storage costs are relevant only if stor- departments (dname city, street address) age becomes a system bottleneck and if it courses (E, cz’abstract) can be varied from query to query. lectures (cnr-5 -3dname -2enr daytime) Computation Cost: The cost for (or time Key attributes are underlined; a given com- of) using the central processing unit (CPU). bination of key attribute values identifies a The structure of query optimization al- element uniquely. Assume that a gorithms is strongly influenced by the user is interested in the trade-off among these cost components. In “names of departments located in New long-range distributed DBMSs with rela- York offering courses on database tively slow communication lines, commu- management.-” nication delay dominates the costs, whereas the other factors are relevant only for local There are many possible strategies to solve suboptimization. In centralized systems, this query, three of which are compared the costs are dominated by the time for with respect to the following assumptions secondary storage accesses although the on actual data values. Note that the de- CPU costs may be quite high for complex tailed data used for the computations below queries [Gotlieb 19751. In locally distrib- are not usually available to the query op- uted DBMSs, all factors have similar timizer, but have to be estimated. weights, which results in very complex cost There are 100 “departments”, 5 of which functions and optimization procedures. are located in New York. A physical block Since the focus of this paper is on cen- can take 5 department records or 50 dname tralized databases, comuunication costs are values. not considered because in such systems There are 500 “courses”, 20 of which are communication requirements are inde- on database management. The physical pendent of the evaluation strategy. For the block size is 10 records. optimization of single queries, storage costs There are 2000 “lectures”. Three hun- are usually also assumed to be of secondary dred are on database management, 100 are importance. They are considered only for held in New York departments, and 20 the simultaneous optimization of multiple (from 3 departments) satisfy both condi- queries. tions. The physical block size is 10 records. There remain the costs of secondary stor- Assume further that sorting time is N * age accesses (usually measured by the num- log(B)N, where iV is the file size in blocks, ber of page accesses) and CPU usage (often and that there is a buffer of one block for

Computing Surveys, Vol. 16, No. 2, June 1984 Query Optimization in Database Systems l 115 each relation. Finally, all relations are selections as early as possible and thereby physically sorted by ascending key values. reducing the cost for sorting and merging The first strategy presented here follows intermediate results: a straightforward approach of translating a Strategy 3 relational calculus expression into a se- 1. Merge “courses” with “lectures”, and quence of algebra operations [Codd 19721. (r: 50 + 200) Together with each step of the strategy, the 2. keep only the dnames of combinations numbers of secondary storage accesses re- with cname = ‘database management’ quired to read (r) or write (w) a physical (w: 2) block are given: 3. Sort the dname list generated. (r + w: 2) Strategy I 4. Merge the result with the “departments” re- 1. Form the Cartesian product of the “courses”, lation, and “lectures”, and “departments” relations, and (r: 2 + 20) (r: 200000) 5. keep only those dnames with city = ‘New 2. Retain the dname of those “depart- York’. ment” records, for which the cnr of “courses” and “lectures” match, (w: 1) and total: 277 accesses. the dname of “lectures” and “departments” match, Thus a reduction by a factor of approxi- and cname = ‘database management’ mately 700 has been achieved. For larger and city = New York, (w: 1) databases and more complex queries, more total: approximately 200000 accesses. sophisticated techniques may result in even higher reductions. The extremely high cost of this strategy results from the fact that it generates an 1.3 Top-Down Approach intermediate result, of which only a very to Query Optimization small portion is actually relevant for fur- ther processing. Efficiency can be improved Query optimization research in the litera- substantially by considering only those ture can be divided in two classes, which combinations of elements from different can be described as bottom up and top relations that have matching values in com- down. Researchers found the overall query mon attributes. By making use of existing optimization problem to be very complex. sort orders, such combinations can be Theoretical work began with a bottom-up formed by “merging” the participating re- approach, studying special cases, such as lations. This operation is called a “”: the optimal implementation of important operations and evaluation strategies for Strategy 2 certain simple subclasses of queries. Sub- 1. Merge %ourses” and “lectures”. sequently researchers attempted to com- (r: 50 + 200; w: 400) pose larger building blocks from these early 2. Sort the result by dnames. results. (r + w: 400 log(2)400) A need for working systems triggered the 3. Merge the result with “departments”. development of full-scale query evaluation (r: 400 + 20; w: 400 + 400) procedures, which stressed the generality 4. Select the combinations with city = ‘New York’ and cname = ‘database management’, of solutions and handled query optimiza- and tion in a uniform and heuristic manner (r: 800) [Astrahan and Chamberlin 1975; Makinou- 5. keep only the dname column. chi et al. 1981; Niebuhr et al. 1976; Palermo (w: 1) 1972; Schenk and Pinkert 1977; Wong and Youssefi 19761. As this often did not total: approximately 6000 accesses. achieve competitive system efficiency, the The cost for answering the query can be current trend seems to be a top-down ap- reduced further by performing value-based proach that incorporates more knowledge

Computing Surveys, Vol. 16, No. 2, June 1964 116 l M. Jarke and J. Koch about special case optimization opportuni- additional information must be compared ties into the general procedures. At the to its value. same time, the general algorithms them- selves have been augmented by combina- 2. QUERY REPRESENTATION torial cost-minimization procedures for choosing among strategies. Queries can be represented in a number of This paper follows the top-down ap- forms. In the context of query optimization, proach, utilizing the general evaluation an appropriate query representation form procedure that follows as a framework for must fulfill the following requirements: It the specific techniques developed in query should be powerful enough to express a optimization research: large class of queries, and it should provide a well-defined basis for query transforma- Step 1. Find an internal query represen- tion. In this section we present four differ- tation into which user queries can easily be ent query representation forms, each of mapped that leaves the system all neces- which has been used in a number of ap- sary degrees of freedom to optimize the proaches to query optimization. evaluation. Step 2. Apply logical transformations to 2.1 The Relational Calculus the query representation that (1) standard- ize the query, (2) simplify the query to avoid The (tuple) relational calculus as intro- duplication of effort, and (3) ameliorate the duced by Codd [1971, 19721 is a notation query to streamline the evaluation and to for defining the result of a query through allow special case procedures to be applied. the description of its properties. The rep- Step 3. Map the transformed query into resentation of a query in relational calculus alternative sequences of elementary opera- consists of two parts: the target list and the tions for which a good implementation and selection expression. its associated cost are known. The result of The selection expression specifies the this step is a set of candidate “access plans”. contents of the relation resulting from the Step 4. Compute the overall cost for each query by means of a first-order predicate access plan, choose the cheapest one, and (i.e., a generalizedBoolean expression pos- execute it. sibly containing existential and/or univer- sal quantifiers). The target list defines the The first two steps of this procedure are free variables occurring in the predicate to a large degree data independent and thus and specifies the structure of the resulting often can be handled at compile time. The relation. Example 2.1 demonstrates the re- quality of Steps 3 and 4, that is, the richness lational calculus representation using the of the access plans generated and the opti- syntax of the database programming lan- mality of the choice algorithm, heavily de- guage Pascal/R [Schmidt 19771. pends upon knowledge about the values in Example 2.1. Names of processors who the database. published some paper in 1981. The consequences of data dependence are twofold. First, if the database is volatile, [ (e.ename) OF Steps 3 and 4 can be done only at run time. EACH e IN employees: This means that the possible gain in effi- e. status = professor ciency must be traded off against the cost AND SOME p IN papers of the optimization itself. Second, a meta- (e.enr = p.enr AND p.year = 1981)] database (e.g., an augmented data diction- ary) must maintain general information In the target list, that is, the subexpres- about the database structure as well as sion preceding the colon, the range of the statistical information about the database (free) variable e is restricted to elements of contents. As in many similar operational the relation “employees”. The relation “em- research problems (e.g., inventory control), ployees” is therefore called the range relu- the costs of obtaining and maintaining this tion of e. The target attribute specification

Computing Surveys, Vol. 16, No. 2, June 1984 Query Optimization in Database Systems l 117 “( e.ename)” indicates that only the names power. A representation form is said to be of employees are retained for the query relationally complete if it allows the defini- result. tion of any query result definable by a The selection expression-the predicate relational calculus expression. Clearly, re- following the colon-defines constraints on lational completeness has to be considered the free variable. The first constraint is a as a minimum requirement with respect to restrictive or monadic term, restricting the expressive power. An often cited example free variable to those “employees” records for a conceptually simple query that goes that have the status value “professor”. This beyond relational completeness is “find the constraint is AND-connected with a join or names of employees reporting to manager dyadic term, relating “employees” to “pa- Smith at any level”, provided that a hier- pers”, and another monadic term, further archy of employees is modeled in a single restricting the result to those employees relation (e.g., via a name and manager at- who published some paper in 1981. The tribute) [Pirotte 19791. Furthermore, quer- comparison operators usually allowed in ies in today’s applications often contain terms are =, #, <, >, 5, and ZZ. aggregations that cannot be expressed in In contrast to the one-sorted predicate pure relational calculus. However, the ex- calculus, the relational calculus allows var- tension of relational calculus by aggregate iables to be bound to different sorts (range functions is rather straightforward [Klug relations); for instance, variable e is bound 1982b; Maier and Warren 19811. to “employees” and variable p is bound to “papers”. The consequences of the many sortedness of the relational calculus with 2.2 The Relational Algebra respect to query transformation are dis- The relational algebra as defined by Codd cussed in Section 3.1. - [1972] is a collection of operators on rela- In addition to the logical operator AND, tions. These operators fall into two classes, the operators OR and NOT can also be that is, traditional set operators, such as used in predicates. Relational calculus Cartesian product, union, intersection, and predicates are completely defined by the difference, and special relational algebra following recursive rules: operators, such as restriction, projection, 1. Atomic predicates: join, and division. The special operators are defined below by relating them to equiva- (i) A (monadic or dyadic) term is an lent relational calculus expressions. atomic predicate. The restriction operator applied to a re- (ii) TRUE is an atomic predicate. lation “rel” constructs a horizontal subset (iii) FALSE is an atomic predicate. according to a quantifier-free predicate 2. An atomic predicate is a predicate. Let containing only monadic terms or intrare- A be a predicate, r an element variable, lational dyadic terms (comparisons be- and rel a relation. Then tween two attributes of the same relation element): (i) SOME r IN rel(A), (ii) ALL r IN rel(A ) Rest(re1, pred) = [EACH r IN rel: pred]. are also predicates. The projection operator serves to con- 3. Let A and B be predicates. Then struct a vertical subset of a relation “rel” (i) NOT (A) (negation), by selecting a set A of specified attributes (ii) A and B (conjunction), and eliminating duplicate tuples within (iii) A OR B (disjunction) these attributes: are predicates. Proj(re1, A) = [(r.A) OF EACH r IN rel: 4. No other formulas are predicates. true]. In Codd [1972] the relation calculus has The join operator permits two relations been introduced as a yardstick of expressive “reli” and “re12” to be combined into a

Computing Surveys, Vol. 16, No. 2, June 1964 118 . M. Jarke and J. Koch single relation whose attributes are the alent algebra expression. An analogous re- union of the attributes of “reh” and “relp”: sult for algebra and calculus expressions extended by aggregate functions has been Join(re&, A op B, relz) proven by Klug [1982a]. = [EACH rl IN reh, EACH r-2 IN re12: r1.A op r&l]. 2.3 Query Graphs The comparison operators “op” allowed in Graphs have been used for the visual rep- joins are the same as those in dyadic terms resentation of structured objects in a num- of the relational calculus. If “op” is the ber of areas. Two well-known examples are equality operator ‘=‘, the “natural” join the use of syntax trees in compiler con- omits either A or B in the result. struction and the use of AND/OR graphs The division operator provides an alge- artificial intelligence applications. braic counterpart to the universal quanti- graphs are used in query optimization for fier. It is defined as follows: the representation of queries or query eval- uation strategies. Two classes of graphs can Divi(relr, A by B, re12) be distinguished: object graphs and opera- = [ ( r.compl(A ) ) OF EACH rl IN re&: tor graphs. ALL r2 IN re12 SOME r3 IN rell Nodes in object graphs represent objects ( rl.compl(A ) = r2.compl(A ) AND such as (relation) variables and constants. r2.B = r3A)]. Edges describe predicates that these objects where compl(A ) is the complement of A in are to fulfill [Bernstein and Chiu 1981; the attribute set of “reh”. As the definition Palermo 1972; Youssefi and Wong 19791. indicates, division is a rather complex op- Object graphs contain the properties of the eration, which can make the understanding query result and are therefore closely re- of a query a difficult job. lated to the relational calculus. Operator Example 2.2 represents the query of Ex- graphs describe an operator-controlled data ample 2.1 in relational algebra. flow by representing operators as nodes that are connected by edges indicating the Example 2.2. Names of professors who direction of data movement. In Smith and published some paper in 1981. Chang [1975] and Yao [1979], operator Proj(Rest(Join(employees, graphs have been used for the representa- enr = enr, tion of algebra expressions. Figures 1 and 2 Rest(papers, year = 1981)), give one example, respectively, for an object status = professor), graph and an operator graph. ename) Query graphs have many attractive prop- As opposed to a relational calculus erties. The visual presentation of a query expression, which describes the relation re- contributes to an easier understanding of sulting from a query by means of its prop- its structural characteristics. In addition, erties, a relational algebra expression de- graph theory offers a number of results fines an algorithm for the construction of useful for the automatic analysis of graphs, the resulting relation. A calculus expression for example, discovery of cycles and tree appears to be a better starting point for property. Finally, an important advantage query optimization since it provides an op- of query graphs is that they can be easily timizer only with the basic properties of the augmented with additional information. query; optimization opportunities may be- For example, the augmentation of graphs come hidden in a particular sequence of with details of the physical data organiza- algebra operators. With respect to rela- tion of a database has been proposed by tional completeness, however, the rela- Rosenthal and Reiner [1982]. tional algebra is at least as powerful as the 2.4 Tableaus relational calculus. In Codd [1972] it has been shown that any relational calculus Tableaus as defined by Aho et al. [1979a, expression can be translated into an equiv- 1979b, 1979c] are tabular notations for a

Computing Surveys, Vol. 16, No. 2, June 1984 Query Optimization in Database Systems l 119

e.ename

Figure 1. An object graph representing the example query.

t

ename (projection) I

t

status = professor (restriction)

a t

I enr = enr (join)

“c3

t employees t

year = 1981 (restriction)

a

t papers

Figure 2. An operator graph representing the example query. subset of relational calculus queries, char- responding to existentially quantified var- acterized by containing only AND-con- iables), constants, blanks, and tags (indi- nected terms and no universal quantifiers. cating the range relation). Thus tableau queries are a particular kind Figure 3 illustrates the construction of a of conjunctive queries [Char&a and Merlin tableau representing the query of Example 1977; Rosenkrantz and Hunt 19801. 2.1. It starts with tableaus for single rela- Tableaus are specialized matrices, the tions and proceeds by combining these ta- columns of which correspond to the attri- bleaus into new tableaus for larger and butes of the underlying database schema. larger subexpressions. Distinguished vari- The first of the matrix, the summary, ables are denoted by a’s; nondistinguished serves the same purpose as the target list ones are denoted by b’s. of a relational calculus expression. The Expressions containing disjunction (set other rows describe the predicate. The sym- union) and negation (set difference) can be bols appearing in a tableau are distin- represented by sets of tableaus [Sagiv and guished variables (corresponding to free Yannakakis 19801. Klug [1983] and John- variables), nondistinguished variables (cor- son and Klug [1983] use sets of tableaus

Computing Surveys, Vol. 16, No. 2, June 1984 120 l M. Jarke and J. Koch status e*llM enr year T(employees) =

T(papers) = a3 a4

papers

T( Rest(papers,year=l981) ) =

$1 papers

T( Join(employees. enr=enr , Rest(papers,year=1981) ) =

T( Rest(Join(employees. enr=enr, Rest(papers,year=l981), status=professor) ) =

T( Proj(Rest(Join(employees. enr=enr. Rest(papers,year=1961). status=professor). ename) ) =

Figure 3. Stepwise construction of a tableau T representing the query of Example 2.1. for representing general conjunctive quer- The transformation of a given expression ies. The specific value of tableaus with re- into an equivalent one by means of well- spect to query optimization is discussed in defined rules is the subject of this section. Section 3.2. The goals of query transformation are 3. QUERY TRANSFORMATION threefold: (1) the construction of a stand- ardized starting point for query optimiza- We have seen that queries can be expressed tion (stundardization), (2) the elimination in a number of different representation of redundancy (simplification), and (3) the forms. Additionally, a number of semanti- construction of expressions that are im- cally equivalent expressions may exist for proved with respect to evaluation perform- each query, even within a given language. ance (amelioration).

Computing Surveys, Vol. 16, No. 2, June 1984 Query Optimization in Database Systems 121 3.1 Standardization (see Grant and Minker [1981], Jarke et al. [1984], and Warren [1981] for exceptions), Several approaches to query optimization a detailed description is omitted at this define a standardized starting point point. through a normalized version of the under- The transformation of an arbitrary rela- lying query representation form [Jarke and tional calculus expression into prenex nor- Schmidt 1981; Kim 1982; Palermo 1972; mal form is a matter of moving quantifiers Wong and Youssefi 19761. In the following, over terms (from right to left). Quantifier we present two normal forms for the rela- movement is governed by the transforma- tional calculus together with the rules to be tion rules of 1. Normalization of the obeyed by a normalization procedure. matrix is rather straightforward and can be A relational calculus representation of a achieved by using DeMorgan’s rules, the query is said to be in prenex normal form if distributive rules, and the rule of double its selection expression is of the form negation (see Table 2). SOME/ALL rl IN rell - - - The distinction between empty and non- SOME/ALL r, IN rel,(M), empty range relations in rules Q2 and Q3 of Table 1 results from the variability of where A4 is a quantifier-free predicate. M relations over time and the many sorted- is called the matrix and can also be stand- ness of the relational calculus [Jarke and ardized. A matrix consisting of a disjunc- Schmidt 19821. A relational calculus tion of conjunctions (of terms Ai;), such as expression can be transformed into an equivalent expression of a one-sorted cal- (A11 AND . . . AND Al,) OR . - - culus by introducing a range definition such OR (AmI AND . . . AND A,,), as (r IN rel) as another type of atomic predicate: is said to be in disjunctive normal form, and a matrix consisting of a conjunction of dis- 01: SOME r IN rel(pred) junctions, such as w SOME r((r IN rel) AND pred), (Al, OR . . . OR AI,) AND . . . 02: ALL r IN rel(pred) AND (Am1 OR . . . OR A,,), ti ALL r( (r IN rel) + pred). is in conjunctive normal form. The application of rules Q2a and Q3a when The prenex normal form combined with moving a quantifier over a term would the normal forms for the matrix yields two therefore yield a wrong result in the case of normal forms for relational calculus expres- an empty range relation. It follows that sions: disjunctive prenex normal form normalization of an arbitrary relational cal- (DPNF) and conjunctive prenex normal culus expression at compile time must pre- form (CPNF). The use of DPNF is moti- serve information about the original range vated by the goal to optimize and evaluate definition of variables so that run-time independent query components separately modifications according to rules Q2b and [Bernstein et al. 19811. The CPNF has Q3b can be performed when necessary. proved useful for the decomposition of queries [Wong and Youssefi 19761 and for data-dependent amelioration (e.g., testing 3.2 Simplification the most restrictive disjunction first). Queries in CPNF can be transformed We have already seen that there might be further into a quantifier-free form popular several semantically equivalent expressions in artificial intelligence theorem-proving representing one and the same query. One applications, the so-called clausal form source of differences between any two [Nilsson 19821. Logic-based database lan- equivalent expressions is their degree of guages such as Prolog [Kowalski 19811 are redundancy [Hall 1976; Stroet and Eng- based on clausal form. Since clausal form mann 19791. A straightforward evaluation has been rarely used in query optimization of a redundant expression would lead to

Computing Surveys, Vol. 16, No. 2, June 1984 122 l M. Jarke and J. Koch Table 1. Transformation Rules for Quantified Expressions

Q1: A AND SOME r IN rel (B(r)) <==> SOME r IN rel (A AND B(r))

Q2: A OR SOYE r IN rel (B(r)) <==> ;j ;OYE r IN rel (A OR B(r)) rel Z [] rel = []

43: A AND ALL r IN rsl (B(r)) <==> ;i ;LL r in rel (A AND B(r)) rel # [] rel = []

Q4: A OR ALL r IN rel (B(r)) <==> ALL r IN rel (A OR B(r))

45: SOME rl IN rell SOME r2 IN rel2 (A(rl.r2)) <==> SOME r2 IN re12 SOYE rl IN rell (A(rl.r2))

QS: ALL rl IN rell ALL r2 IN re12 (A(rl.rZ)) <==> ALL r2 IN re12 ALL rl IN rell (A(rl.r2))

47: SOME r IN rel (A(r) OR B(rj) <==> SOME r IN rel (A(r)) OR SOME r IN rel (B(r))

QS: ALL r IN rel (A(r) AND B(r)) <==> ALL r IN rel (A(r)) AND ALL r in rel (B(r))

QB: NOT ALL r IN rel (A(r)) <==> SOME r IN rel (NOT(A(r)))

QlO:NOT SOME r IN rel (A(r)) <==> ALL r IN rel (NOT(A(~)))

the execution of a set of operations, some [EACH e IN employees: of which are superfluous. Therefore query e.ename = ‘Smith’ optimization aims at the elimination of re- OR dundancy by means of transforming a re- (e.status = assistant dundant expression into an equivalent non- OR e.status = professor) AND redundant one. NOT(e.status = professor A redundant expression can be simplified OR e.status = assistant)] by applying the transformation rules M4a to to M4j, which consider idempotency (see Table 2). The application of these rules is [EACH e IN employees: e.ename = ‘Smith’] complicated by the fact that idempotency can occur at any level in the expression, by means of rules M4d and M4g, the sub- expressions owing to the presence of common sub- expressions, that is, subexpressions that (e.status = assistant OR e.status = professor) occur more than once in the expression and representing the query. Thus, in order to simplify an expression such as (e.status = professor OR e.status = assistant)

Computing Surveys, Vol. 16, No. 2, June 1964 Query Optimization in Database Systems l 123 Table 2. Transformation Rules for General Expressions

kl : Comuutative rules

a) A OR B <==> B OR A

b) A AND B <==> B AND A

Y2: Associative rules

a) (A OR B) OR C <==> A OR (EJ OR C)

b) (A AND B) AND C <==> A AND (B AND C)

Y3: Distributive rules

a) A OR (B AND C) <==> (A OR B) AND (A OR C)

b) A AND (B OR C) <==> (A AND B) OR (A AND C)

Y4: Idempotency rules

a) A OR A <==> A

b) A AND A <==> A

c) A OR NOT(A) <==> TRUE

d) A AND NOT(A) <==> FALSE

e) A AND (A OR B) <==> A

I) A OR (A AND B) <==> A

2) A OR FALSE <==> A

h) A AND TRUE <==> A

i) A OR TRUE <==> TRUE

j) A AND FALSE <==> FALSE

Y5: De Morgan’s rules

a) NOT (A AND B) <==> NOT (A) OR NOT (B)

b) NOT (A OR B) <==> NOT (A) AND NOT (B)

MB: Double negation rule

NOT (NOT (A)) <==> A

must first be recognized as being equiva- in Table 3. (Note that these rules can be lent. Algorithms are given by Downey et al. applied only at run time.) [1980] and Hall [1974, 19761. The recogni- Terms as defined in Section 2.1 serve as tion of common subexpressions and the atomic predicates in the relational calculus. application of idempotency rules have to be However, these terms can be simplified or performed concurrently rather than se- removed if the semantics of the comparison quentially, since the simplification of an operators are explicitly taken into account. expression by means of idempotency rules An important application is the so-called may yield further common subexpressions, constant propagation, which uses transitiv- which, in turn, are subject to simplification. ity laws, such as Expressions that are bound to empty rela- tions can also be simplified. Transforma- r.A op s.B AND s.B = const tion rules for their simplification are given =a r.A op const,

Computing Surveys, Vol. 16, No. 2, June 1984 124 . M. Jarke and J. Koch Table 3. Transformation Rules for Expressions with Empty Relations

El: [EACH r in [I: pred] <==> []

~2: [ OF EACH r IN [I: pred] <==> [I

E3: SOME r IN [] (pred) <==> FALSE

E4: ALL r IN [I (pred) <==> TRUE

to reduce the number of dyadic terms in a curs when one or more matrix conjunctions query. Algorithms that minimize the num- of the standardized query can be shown to ber of rows in tableaus as introduced in be unsatisfiable [Eswaran et al. 1976; Klug Section 2.4 systematically exploit such sim- 1983; Munz et al. 1979; Ozsoyoglu and Yu plification rules for conjunctive queries 19801. For example, consider the expression [Aho et al. 1979a; Sagiv 1981, 19831. Since the number of rows in a tableau is one more than the number of joins (dyadic join r.A I s.B AND s.B > t.C AND t.C 2 r.A, terms) in the expression, the minimization of the number of rows corresponds to the which implies the contradiction, r.a > r.a, elimination of redundant joins. and can therefore be replaced by the Sagiv and Yannakakis [1980] exiend the Boolean value, false. Satisfiability can be tableau techniques to cover the simplifica- efficiently decided at compile time for a tion of expressions containing disjunctions. conjunction of terms with comparison op- The generalization to all expressions of a erators (=, <, I, >, 2) [Rosenkrantz and relationally complete language, however, is Hunt 19801, but is computationally intract- still an open problem. able when the nonequality comparison Information about semantic integrity operator is allowed. constraints can be exploited for “semantic” query simplification [Aho et al. 1979a, 3.3 Amelioration 1979b, 1979c; Jarke et al. 1984; Johnson and Klug 1982; Ott and Horlaender 1982; Query simplification does not necessarily Rosenthal and Reiner 19841. As a simple produce a unique expression. Other nonre- example, consider the case of key con- dundant expressions may exist that are se- straints. If r and r’ are (free or existentially mantically equivalent to the one generated quantified) variables ranging over the same by some simplification technique. The eval- relation “rel”, equijoin terms of the form uation of expressions corresponding to one “r.key = r ‘.key” are superfluous in the and the same query may differ substantially sense that the term and one of the varia- with respect to performance parameters, bles, say r’, can be deleted, followed by the for example, the size of intermediate results substitution of r’ by r in any term referring and the number of relation elements ac- to r’. This type of simplification is most cessed. Below, a number of query transfor- relevant in the context of processing, mation heuristics are presented that, when in which the reduction of user queries ad- applied to expressions, yield ameliorated dressing views to system queries addressing expressions with respect to evaluation per- stored relations may introduce substantial formance. redundancy. Its application range can be The simplest transformations considered extended by considering additional con- in this section are the combination of a straints implied by the query structure sequence of projections into a single projec- [Klug 1980; Koch et al. 19813. tion, and the combination of a sequence of A final opportunity for simplification oc- restrictions into a single restriction [Hall

Computing Surveys, Vol. 16, No. 2, June 1984 Query Optimization in Database Systems l 125 1976; Smith and Chang 19751. The corre- able is detached and forms an inner nesting. sponding transformation rules are Detachment is performed recursively at any nesting level until the expression can- Al: Proj(. . . Proj(Proj(re1, Al) not be reduced further. Experiments re- AZ), . . . , AnI ported by Youssefi and Wong [1979] have PrT(re1, A,), shown this heuristic to be very strong. Ex- A2: Rest(. . . (Rest(Rest(re1, predi), ample 3.2 demonstrates the detachment of predd, . . . , pred,) a subexpression in a complex expression. Example 3.2. Departments offering lec- Rez(re1, predi AND predz AND . . . tures that are held by professors who live AND pred,) . in the same city where the department is The combination of intrarelational op- located and who have published some paper erations results in two advantages: First, in 1981. repetitive reading of the same relation is The corresponding expression is avoided, and second, existing access paths may be used for the combined operation, [EACH d IN departments: and not only for the first operation in the SOME 1 IN lectures sequence. SOME e IN employees (estatus = professor Minimization of the size of intermediate AND results to be constructed, stored, and re- ddname = ldname trieved is the goal of a number of amelio- AND l.enr = e.enr rating transformations. An important heu- AND e.city = dxity ristic moves selective operations, such as AND restriction and projection, over construc- SOME p IN papers tive operations, such as join and Cartesian (p.year = 1981 product, in order to perform the selective AND p.enr = e.enr))] operations as early as possible [Smith and An equivalent expression produced by Chang 19751. In the context of relational query detachment is calculus, the consideration of a certain evaluation sequence can be represented by [EACH d IN departments: a nested expression. The evaluation of a SOME 1 IN lectures nested expression starts with the evalua- SOME e IN [EACH e IN tion of the innermost nesting, followed by [EACH e IN employees: its surrounding nesting, and so on until the e.status = professor]: SOME p IN [EACH p IN outermost nesting is reached. A nested papers: p.year = 19811 expression implying the early evaluation of (e.enr = p.enr)] monadic terms (restrictions) is given in (d.dname = l.dname AND l.enr = e.enr Example 3.1. AND exity = d.city)] Example 3.1. A nested expression An object graph representing the query is equivalent to the expression in Example shown in Figure 4. 2.1. Note that the resulting nested expres- [ (e.ename) OF sion is irreducible [Goodman and Shmueli EACH e IN [EACH e IN employees: 19801; that is, it cannot be separated into estatus = professor]: two subexpressions overlapping on a single SOME p IN [EACH p IN papers: variable. In other words, the nested expres- p.year = 19811 sion contains a cycle (see Figure 4). (e.enr = p.enr)] The importance of the distinction be- The early evaluation of selective opera- tween cyclic and acyclic (treelike) expres- tions forms a special case of query detach- sions for query processing is discussed ment as introduced by Wong and Youssefi further in Section 4.3. At this point, we [ 19761. A subexpression that overlaps with mention only that there are cycles that can the rest of the expression on a single vari- be transformed into equivalent acyclic

Computing Surveys, Vol. 16, No. 2, June 1964 126 l M. Jarke and J. Koch

EACH d IN departments

l e.status=prof I l p.year=l991

Figure 4. Object graph for Example 3.2.

query graphs. Such cycles include those A5: ALL r IN rel: (NOT(predi) that (1) are introduced by transitivity OR br& [Bernstein and Chiu 1981; Yu and Ozsoy- oglu 19791, (2) contain certain combina- AZ r IN [EACH r IN rel: predl] tions of inequality join term edges [Bern- (pred2) . stein and Goodman 1981b; Ozsoyoglu and Yu 19801, (3) are “closed” by universally Note that transformation rule A5 for uni- quantified variables [Jarke and Koch versally quantified variables is especially 19831, and (4) contain variables that can be profitable since, through the reduction of decomposed by use of functional depend- the number of conjunctions in the outer encies [Kambayashi and Yoshikawa 19831. nesting, the intermediate results can be The concepts of extended range expres- expected to be considerably smaller in size. sions {Jarke and Schmidt 19821 and range The ameliorating transformations pre- nesting [Jarke and Koch 19831 provide a sented thus far use information from three generalization of query detachment in that sources: general transformation rules and they also consider expressions containing heuristics guiding their usage, knowledge universal quantification. Database rela- about the relational data structures, and tions defining the range of a relation vari- the query itself. Two other knowledge bases able are replaced by calculus expressions that have not yet been considered are the according to the following transformation integrity constraints that complement the rules: structural schema definition in many da- tabase systems and the actual data stored A3: [EACH r IN rel: predl AND predz] in the database. Integrity constraints are predicates that [ETCH r IN [EACH r IN rel: predi]: must be true for each element of a certain relation, or for each combination of ele- w&l, ments of a certain group of relations. They A4: SOME r IN rel(predi AND pred2) therefore can be added to the selection expression of any query without changing S&E r IN [EACH r IN rel: predi] its truth value. There are a few approaches (w&h exploiting this observation for amelioration

Computing Surveys, Vol. 16, No. 2, June 1964 Query Optimization in Database Systems l 127 rather than simplification (which was men- sions, two-variable expressions, and multi- tioned in the previous subsection). They variable expressions. The individual ap- have been called knowledge-based [Ham- proaches can be viewed as the building mer and Zdonik 19801 or semantic query blocks of a general query evaluation system. processing [King 1979, 19811. Their associated costs and ranges of appli- Assume, for example, that an integrity cability constitute the input to the last constraint says: “We only hire professors stage of the query optimization process, who have published at least one paper per which generates the optimal access plan. year.” In this case, the evaluation of Ex- ample 2.1 (asking for professors with pa- 4.1 One-Variable Expressions pers published in 1981) becomes trivial, and the evaluation of Example 3.2 is substan- One-variable expressions describe condi- tially simplified. tions for the selection of elements from a Adding an integrity constraint to a selec- single relation. A naive approach to their tion expression can also change the struc- evaluation would be to read every element ture of the query in order to make it more of the relation, and to test whether it sat- tractable. Consider the constraint: “We isfies each term of the expression. Since only hire local professors.” In this case, the this approach is very costly in the presence term “d.city = e.city” in Example 3.2 can of large relations and complex expressions, be omitted. The remaining query no longer various techniques have been used to re- contains a cycle in its object graph. duce the number of element accesses and The success of semantic query processing the number of tests applied to an accessed depends largely on the development of ef- element. ficient heuristics for choosing among the The number of element accesses can be many transformations made possible by the reduced by employing data structures that addition of any combination of integrity provide access paths other than those of constraints to the query. In King [1981] exhaustive sequential search. One possibil- and Xu [ 19831, artificial intelligence type ity is to keep the relation sorted with re- rules are used to make this decision for a spect to one or more attributes so that it special class of relational databases. can be accessed in ascending or descending Yao [1979] points out that cases exist in order, or binary search can be performed. which the optimal transformation is data This has proved useful for the evaluation dependent. The heuristics presented above of range queries (i.e., expressions that de- may not always be optimal, especially when fine an interval of attribute values [Bolour certain access paths are supported by phys- 1981; Davis and Winslow 19821). Another alternative, hashing, provides fast direct ical data structures. One consequence of access without necessarily preserving order. such data dependence is that, in addition Direct and ordered access can be pro- to the query compiler, the run-time support vided by indexes, possibly combined with must also be equipped with query transfor- multilist structures [Welch and Graham mation facilities. Furthermore, if heuristics 1976; Yang 19771. Conceptually speaking, do not yield satisfactory results, simulta- an index is a binary relation that associates neous optimization is required on the phys- attribute values with references to relation ical and the logical level. Before turning to elements, usually called tuple identifiers such integrated approaches, however, the (TIDs). We distinguish one-dimensional in- physical evaluation of query components dexes, which support access via a single must be described. relation attribute, from multidimensional indexes, which support access via a combi- 4. QUERY EVALUATION nation of attributes. One-dimensional in- dexes are usually implemented by ISAM In this section we present methods for the [IBM 19661 or B-tree [Bayer and Mc- evaluation of query components of varying Creight 19721 structures. An overview of complexity, such as one-variable expres- multidimensional index structures is given

Computing Surveys, Vol. 16, No. 2, June 1964 128 l M. Jarke and J. Koch by Bentley and Friedman [ 19791. Some ex- ues [Hanani 19771, whereas others work amples include the work by Shneiderman without such assumptions [Breitbart and [1977] on combined indexes, Nievergelt et Reiter 19751. Warren [1981] applies a sim- al. [1984] on grid files, and Gardarin et al. ilar technique for optimizing database pro- [1984] on predicate trees. Although access grams expressed in logic. paths are usually invisible to the user, ef- forts have been reported to develop high- 4.2 Two-Variable Expressions level language representations directly available to those database programmers Two-variable expressions describe condi- who insist on extreme efficiency. Such lan- tions for the combination of elements from guage constructs range from TIDs [Jarke two relations. In general, two-variable and Schmidt 1981; van de Riet et al. 19811 expressions are composed of monadic to abstract representations of complete ac- terms, which restrict single variables inde- cess paths [Mall et al. 1984; Schmidt 1984; pendently of each other, and dyadic terms, Tsichritzis 19761. which establish the link between both var- The number of tests applied to an ac- iables. In this section we first describe the cessed relation element during expression basic methods for the evaluation of a single evaluation can be reduced by means of run- dyadic term, corresponding to the join op- time transformations of the expression. erator in Section 2.2, and then strategies The optimization of a special class of for the evaluation of arbitrary two-variable expressions, Boolean expressions, has been expressions. a research topic in compiler construction Approaches to the implementation of the for a long time [Gries 19711. Boolean join operation can be classified into order- expressions (i.e., quantifier-free AND/OR dependent and order-independent strate- connected terms) are an integral part of a gies [Todd 19741. A simple method that is number of control structures in high-level independent of the order of element access programming languages. The purpose of is the so-called nested iteration method code optimization for Boolean expressions [Pecherer 1975, 1976; Selinger et al. 19791 is to generate code that skips over the eval- in which every pair of relation elements is uation of expression components no longer accessed, and concatenated if the join con- relevant to the value of the expression as a dition is satisfied. A sketch of the algorithm whole. For example, in the statement follows: IF A AND B THEN FORi:=lTONlDO statement-l read zth element of r&; ELSE FORj:= lTON*DO statement-2 read jth element of relZ; END form the join according to the join concli- tion; the evaluation of term B is superfluous, and END; the ELSE-branch can be executed right END; away in case term A has already been eval- uated as “false”. If the same idea is applied Let N1(N2) be the number of elements of to the evaluation of one-variable expres- the relation read in the outer (inner) loop. sions in query languages [ Gudes and Reiter Ni + Ni * N2 secondary storage accesses 1973; Liu 19761, queries can be simplified are required to evaluate the dyadic term, at run time. assuming that each element access needs Another approach designed to improve one secondary storage access. evaluation efficiency is that of changing the The nested iteration method can be aug- order in which individual expression com- mented by the use of an index on the join ponents are evaluated. Several algorithms attribute(s) of “re12.” Instead of scanning are known to lead to optimal evaluation “re12” sequentially for each element of sequences in certain situations; some as- “re&,” the matching “re&” elements are sume a priori probabilities for attribute val- retrieved directly [Griffeth 1978; Klug

Computing Surveys, Vol. 16, No. 2, June 1964 Query Optimization in Database Systems . 129 1982b]. Thus only Ni + Ni * Nz * j,, of dyadic terms. They differ in the way they accesses are required, where j,, is a join make use of or even create access paths, selectivity factor describing the reduction such as indexes and sorting, and the order of the Cartesian product of “reh” and “relz” in which the terms are processed. One such by the join condition. method is illustrated in Figure 5. It makes The nested block method [Kim 19801 extensive use of indexes. Tuple identifiers adapts the nested iteration method to a resulting from the processing of monadic paged-memory environment. It assumes terms and those that satisfy the join con- that a main memory buffer will hold one or dition are intersected and then used to ac- more pages of both relations, where each cess the relation elements. These elements page contains a set of records. are projected onto attributes appearing in The algorithm itself is basically identical the dyadic term and in the target list. The to the one of the nested iteration method, projected elements are concatenated and except that memory pages are read instead finally projected on the target attribute. of single relation elements. The number of Blasgen and Eswaran [1976], Niebuhr secondary storage accesses needed to form and Smith [1976], Yao and DeJong [1978], the join is reduced bo P, + (Pi/&) * Pz, and Yao [1979] present various other al- where P1(P2) is the number of pages occu- gorithms and compare them systematically pied by the outer (inner) relation and B1 is with respect to their efficiency. Their re- the number of pages of the outer relation sults demonstrate that often no a priori held in the main memory buffer. The for- best algorithm exists. The optimizer must mula demonstrates that it is always pref- either rely on heuristics or perform an ex- erable to read the smaller relation in the pensive cost comparison of many alterna- outer loop (i.e., make Pi < I’*). Note that tives for each query. only PI -I- Pz accesses are necessary if one An important case of two-variable ex- of the relations can be kept entirely in the pressions is a join in which one of the main memory buffer. participating variables (the “inner” one) is The merge method [Blasgen and Eswaran existentially or universally quantified. The 1977; Selinger et al. 19791 is based on the result of such an expression contains only order in which relation elements are ac- elements of one relation. Furthermore, ac- cessed. Both relations are scanned in as- cess to the other relation needs to establish cending or descending order of join attri- only whether, for a given value of the bute values and merged according to the “outer” variable, the join condition is sat- join condition. Approximately N1 + Nz + isfied with any (respectively all) elements S1 + Sp secondary storage accesses are re- of the range relation of the “inner” variable. quired, where S1 and Sa denote the number These elements themselves are of no inter- of secondary storage accesses necessary to est. This means that for the evaluation of sort the relations. If the two relations are quantified queries intermediate results can already sorted by the same criterion, the be represented in a compressed fashion merge method appears to be the most effi- [Dayal 1983a; Jarke and Schmidt 19811. If cient one for evaluating a dyadic term the two-variable expression contains just [Merrett 1981; Merrett et al. 19811. Excep- one join term with a comparison operator tions occur when one of the two relations of <, 5, >, or 2, only one attribute value is small enough to fit in main memory (the (the minimum or maximum attribute value nested block method is preferable) or one appearing in the inner relation) is required. relation is much larger than the other and That is, the quantified two-variable expres- indexes are available (nested iteration with sion can be converted into a monadic term indexes is better). [ Jarke and Koch 19831. Incidentally, a sim- Methods for the evaluation of arbitrary ilar use of aggregate functions (maximum two-variable expressions are created on the or minimum) has also been proposed in the basis of strategies for one-variable expres- context of integrity maintenance [Bern- sions and algorithms for the computation stein et al. 19801.

Computing Surveys, Vol. 16, No. 2, June 1984 130 l M. Jarke and J. Koch

t

employees-enr-ind

papers-enr-ind year D= 1981 . t papers-year-ind Figure 5. Operator graph illustrating the evaluation of the query of Example 2.1. The existence of various indexes is assumed.

4.3 Multivariable Expressions data can be avoided by simultaneous eval- uation of multiple-query components. In Stategies for the evaluation of multivaria- Palermo [19’72], all monadic terms associ- ble expressions are the largest building ated with a variable are completely evalu- blocks for a general query-processing sys- ated and all dyadic terms in which the same tem. Two basic approaches exist, which are variable appears are partially processed referred to as parallel processing and step- concurrently with scanning of the range wise reduction. relation of the variable. Even AND-connec- The parallel processing of query compo- tions existing among the terms can be eval- nents serves to avoid repeated access to the uated in parallel, which further reduces the same data. Repeated access to the same size of intermediate results [Jarke and

Computing Surveys, Vol. 16, No. 2, June 1964 Query Optimization in Database Systems 131 Schmidt 19821. A similar approach on a d.dname 0 higher level has been described by Klug [1982b] in which aggregate functions and complex subqueries are computed in par- EACH d IN allel. Scheduling strategies for the parallel departments processing of query components are dis- cussed by Schmidt [ 19791. Pipelining operations that can work on the partial output of preceding operations is another technique that exploits paral- lel processing opportunities [Smith and Chang 1975; Yao 19791. For example, re- e.statu8=prof. I . dayt ime>8pm. striction and projection can be pipelined so that only a relatively small buffer for data Figure 6. Object graph for Example 4.1. exchange is required rather than the crea- tion and subsequent reading of a possibly large temporary relation. Aspects of simultaneous evaluation and The second basic approach to the evalu- pipelining are combined in the so-called ation of multivariable expressions will be “feedback” method [Clausen 1980; Rothnie motivated by means of the following ex- 1974, 19751, which uses partial results of a ample. join operation in order to restrict its input. The degree to which this can be done de- Example 4.1. Figure 6 shows an object - pends on the quantification of variables graph representing the expression occurring in the join term. For example, [ (ddname) OF consider the expression EACH d IN departments: SOME e IN employees(e.status= professor [EACH rl IN reli: AND ALL r2 IN re12 (rI.A op r2.B)]. e.city = d.city) AND Assume that the join term is evaluated by SOME 1 IN lectures (ldaytime > 8 pm nested iteration. While testing some ele- AND ment, rl, it is found that the term “q.A op l.dname = d.dname)] r2.B” evaluates to false for a certain r2 with r2.B = cl. Because of the universal quanti- Expressions like the one in Example 4.1 fication of r2, rl is rejected, and an elimi- are called tree expressions [Goodman and nation filter can be added to the selection Shmueli 1982; Shmueli 19811, since their expression associated query graph is a tree. The stand- ard approach for evaluating such an expres- [EACH rl IN reli: sion would be to form the join of the three NOT (rI.A op ci) relations, restrict the intermediate result AND ALL r2 IN re12 (rI.A op r2.B )] according to monadic terms, and finally project it onto the attributes appearing in because the same rz would cause the rejec- the target list. As has been shown in the tion of all elements rl of “reli” that do not introductory example (Section 1.2, Strat- satisfy the first term. On the other hand, if egy 2), this method performs rather poorly. rl with rI.A = c2 passes the test, a true filter It is even more problematic in a distributed can be added to the selection predicate: environment where each relation resides on [EACH rl IN reli: a different site, as entire relations must be rI.A op c2 OR transmitted from site to site. Moreover, the relation at the target site is temporarily NOT (rl.A op cl) AND . . -1. expanded through the formation of the join, Both filters can be updated subsequently although the final result is only a horizontal to sharpen the constraints. and vertical subset of it.

Computing Surveys, Vol. 16, No. 2, June 1964 132 . M. Jarke and J. Koch

d.dname 0

d _ dname = \ 1 .dname Figure 7. Object graph for Example 4.2.

In Bernstein and Chiu [1981], Bernstein expression, one semijoin per edge is exe- et al. [1981], and Bernstein and Goodman cuted in breadth-first leaf-to-root order. [1981a, 1981b, 1981c], the stepwise reduc- Thus a tree expression containing one free tion of tree expressions (with free and ex- and n - 1 existentially quantified variables istentially quantified variables) has been can be completely processed by n - 1 semi- introduced. This method often outperforms joins. the simple approach above in a decentral- If all variables are free, an additional ized as well as in a centralized setting. It is semijoin sequence in reverse order must be based on a modified join operation, and performed. In a centralized setting, this uses the so-called semijoin operator. kind of semijoin strategy is inferior to a The semijoin of a relation “reh” by a merge join operation combined with paral- relation “relp” equals the join of these re- lel quantifier evaluation. Even in a distrib- lations projected back onto the attributes uted system, the simple bottom-up/top- of relation “reli”: down sequence is often less efficient than a middle-out sequence that begins with semijoin(reh, A op B, relz) semijoins that achieve very strong reduc- = proj( join(re&, A op B, relz), tion [Chiu and Ho 1980; Chiu et al. 19811. attributes(reh)). Strategies for the evaluation of tree The operation thus forms “half of a join.” expressions containing both existential and The main advantages of semijoin over join universal quantifiers must take into ac- are as follows: (1) its evaluation only re- count the order in which these quantifiers quires the transmission of value lists of the appear in the expression. Stepwise reduc- joining attributes instead of that of an en- tion is possible only when the processing of tire relation, and (2) it has a “reductive” the edges of the query tree (breadth-first effect since the result of semijoin (reh, A leaf-to-root) corresponds to the order of the op B, relz) is always a subset of re12, whereas quantifiers in the expression (right to left). a join may produce a Cartesian product in the worst case. In terms of relational cal- Example 4.2. Consider the query tree of culus, a semijoin corresponds to a two- Figure ‘7 representing the expression variable expression, where one variable is [ (ddname) OF existentially quantified, as discussed in EACH d IN departments: Section 4.2. ALL p IN papers The simplest evaluation of a tree expres- SOME e IN employees sion by means of stepwise reduction can be (p.enr = e.enr AND e.city = d.city) described as follows. Starting from the AND leaves of the query tree representing the SOME 1 IN lectures (ldname = ddname)]

Computing Surveys, Vol. 16, No. 2, June 1964 Query Optimization in Database Systems l 133 d.dname 0

EACH d IN departments Figure 8. A cyclic calculus expres- sion and its corresponding object graph.

SOME e IN e.enr=l .enr SOME 1 IN

. l e.status=prof 1. dayt ime>Spm

departments dname city street address

employees 1 ei / en’i s~:~s ~=~~~, Figure 9. Some possible database State.

lecturea 1 CII / eII / dam / da;;;me 1

Processing’the tree in breadth-first leaf-to- Cyclic expressions are the complement of root order would yield the value of the tree expressions with respect to the entire expression set of expressions. Although we quoted [ (ddname) OF some benign exceptions in Section 3.3, EACH d IN departments: cyclic expressions in general cannot be fully SOME e IN employees reduced by means of semijoins [Bernstein ALL p IN papers and Goodman 1981a, 1981b; Goodman and (p.enr = e.enr AND e.city = deity) Shmueli 19821. AND SOME 1 IN lectures (ldname = ddnarne)] Example 4.3. Consider the query: “names of departments that offer lectures which is not equivalent to the original after 8 pm given by professors who live in expression. the city where the department is located”. The position of an existential and a uni- The corresponding relational calculus ex- versal quantifier cannot be interchanged pression and a query graph are shown in without changing the meaning of the Figure 8. expression, except in cases where rules Ql If the database is in the state shown in through Q4 of Table 1 apply. Expressions Figure 9, no sequence of semijoins (corre- containing only one type of quantifier allow sponding to edges of the query graph of the sequence of variables to be inter- Figure 8) will produce the correct result changed arbitrarily, according to transfor- (the empty relation). The reason is that the mation rules Q5 and Q6. Jarke and Koch semijoin technique only considers one edge [1983] describe a directed, so-called “quant at a time, and thus loses restrictive condi- graph” that supports the application of tions introduced through the feedback ef- these rules. fect of the cycle:

Computing Surveys, Vol. 16, No. 2, June 1964 134 l M. Jarke and J. Koch

d.dname 0

EACH d IN departments

d.dname=l .dname Figure 10. Augmented object d.city=l.city graphfor Example4.3.

SOME e IN SOME 1 IN

e.city=l .city

l l e.status=prot 1. dayt ime>Epm

[ (ddname) OF sented in this section, are candidates for EACH d IN departments: hardware components in specialized data- SOME e IN employees base machines. However, such components (e.status = professor often allow parallelism and therefore re- AND quire somewhat different join and semijoin SOME 1 IN lectures (l&time > 8 pm algorithms [Bitton et al. 1983; Maekawa AND 1982; Missikoff and Scholl 1983; Valduriez ddname = l.dname AND l.enr = e.enr 1982; Valduriez and Gardarin 19841. A brief AND e.city = d.city))] survey of hardware approaches to query optimization is presented in Section 6.3. In Kambayashi et al. [1982] a proposal is made that is intended to generalize the 5. ACCESS PLANS applicability of the semijoin technique to In the previous section we dealt with tech- cyclic expressions. The overall idea is to niques for the efficient evaluation of query transform the cyclic query graph into a tree components that can be used as building by adding appropriate terms to edges of the blocks of a general query evaluation algo- graph. Figure 10 demonstrates the tech- rithm. The final step of our query evalua- nique applied to the cyclic expression of tion framework requires the combination Example 4.3. of these blocks into an efficient evaluation The additional terms “d.city = Lcity” and procedure for an arbitrary standardized, “l.city = e.city” imply the condition “d.city simplified, and ameliorated expression. Un- = e.city” by transitivity. Thus the resulting fortunately, work in this area is still incom- graph is equivalent to a chain query, a plete. special form of tree query. Note that addi- The inputs of such a procedure are a tion of the new terms corresponds basically logically preprocessed query (as described to addition of the city attribute to the in Section 3), the existing storage struc- schema of the lectures relation (initialized tures and access paths, and a cost model. with values). The output is an optimal (or at least heur- The query tree then is processed by step- istically “good”) access plan. The procedure wise reduction, executing a generalized consists of the following steps. semijoin for each edge while taking into account the newly introduced attributes. (1) Generate all reasonable logical access The number of data transfers are reduced plans for evaluating the query. A logical by means of specialized compression tech- access plan describes a sequence of opera- niques. tions or of intermediate results leading Methods for the efficient implementa- from existing relations to the final result of tion of operations, such as the ones pre- a query.

Computing Surveys, Vol. 16, No. 2, June 1984 Query Optimization in Database Systems l 135 (2) Augment the logical access plans by this concept places too much emphasis on details of the physical representation of the user-specified block structure of the data (sort orders, existence of physical ac- query and therefore introduces query cess paths, statistical information). standardization steps into SQL query pro- (3) Choose the cheapest access plan by ap- cessing. plying a model of access and processing A similar compromise was chosen in costs. INGRES [Youssefi and Wong 19791. The heuristic decomposition approach reduces In this section we review the generation a query to a set of subqueries containing at of access plans, cost models for their eval- most two variables. For each of these uation, and the problem of selecting the subqueries a more detailed analysis of its cheapest plan. The quality of the final so- optimal implementation is performed. lution plan is strongly influenced by the A comprehensive procedure for generat- existing storage structures and access ing access plans to solve conjunctive quer- paths, which usually cannot be optimized ies without universal quantifiers and aggre- for a single ad hoc query. In Section 5.4 we gates has been proposed by Rosenthal and therefore briefly consider the simultaneous Reiner [1982]. An expanded object graph optimization of multiple queries. representation is used for modeling evalu- ation strategies that exploit auxiliary direct 5.1 Generation of Access Plans access structures. To avoid the generation of an excessive number of strategies to be Access plans describe sequences of opera- generated, the generation of access plans is tions (represented by operator graphs) or interleaved with the selection step: Strate- intermediate results (object graphs) leading gies known to be infeasible or dominated from the existing data structures to a query by other procedures are not created. result. The query optimizer should generate a set of plans rich enough to contain the 5.2 Cost Analysis of Access Plans optimal plan but small enough to keep the optimization effort acceptable. The selection of physical access plans is Two extreme approaches are exemplified determined by heuristic rules or is based on by Smith and Chang [1975] and Yao a cost model of storage structures and ac- [1929]. Smith and Chang use a rigid set of cess operations [Merrett 19771. In this sec- “automatic programming” query transfor- tion, cost models and their integration into mation rules similar to the ones discussed optimization procedures are reviewed. in Section 3. Their procedure generates Whereas a few researchers consider exactly one access plan, which need not be working storage requirements [Kim 1982; optimal. Yao’s method generates all non- Lang et al. 1977; Sacco and Schkolnick dominated access plans possible in a given 19821 or CPU costs [Gotlieb 1975; Selinger physical environment. While this may be et al. 19791, most cost models are based on feasible in the context of two-variable quer- the number of secondary storage accesses. ies, it becomes prohibitively costly for very For a given operation, this figure is influ- complex queries. enced by the size of its operands, the access Other approaches seek a compromise be- structures used, and the size of main mem- tween heuristic selection and detailed gen- ory buffers. eration of alternative access plans. For ex- At the beginning of the evaluation, the ample, System R [Chamberlin et al. 1981; operands are existing data structures of Selinger et al. 19791 applies a hierarchical known size, such as relations or indexes. In procedure based on the nested block con- later stages, however, most operands are cept of SQL. On the lower level, evaluation results of preceding operations, and the plans for each query block are generated cost model must estimate their size by using and compared. On the upper level, the se- information about the original data struc- quence in which the query blocks are eval- tures and the selectivity of the operations uated is determined. Kim [ 19821 notes that already performed on them. Although there

Computing Surveys, Vol. 16, No. 2, June 1964 136 l M. Jarke and J. Koch is much ongoing work, no generally ac- onstrates that it is sufficient to know the cepted formulas for estimating the size of size of all projections in the database if the intermediate results have evolved thus far. attribute value distributions are uniform This results in part from the fact that the and independent both within an attribute trade-off between the amount of informa- and between attributes of the same domain. tion used and the accuracy of the result is Christodoulakis [ 1981, 19831 and Mont- not very well understood. gomery et al. 119831 have critically reviewed In general, the more restrictive the as- such assumptions and have proved that sumptions about the data, the fewer are the they lead to a bias against direct-access parameters needed to compute the size of structures in selection plans. However, no the results of operations. For example, in practical formulas with more general as- Demolombe [ 19801 a recursive procedure sumptions, but without excessive data re- for estimating the size of a quantifier-free quirements, have yet been published. calculus expression is described, for which The final cost measure is the number of five types of parameters must be known if secondary storage accesses, not the sizes of fairly detailed database statistics are avail- intermediate results. A large number of re- able. Under more restrictive assumptions, searchers estimate the relationship be- however, only three of them are needed. tween the two figures [Cardenas 1975; The size estimates given by Selinger et al. Chan and Niamir 1982; Cheung 198213; Luk [1979] use only information already exist- 1983; Whang et al. 1983; Yao 1977a; Yu et ing in the database but make many as- al. 19781. In essence, it depends on the sumptions about data and queries [Astra- physical storage structures involved and han et al. 19801. At the other extreme, Yao the proportion of elements to be accessed. [ 19791 assumes (implicitly) that detailed Assume first that all elements of an op- selectivity data are known; no statement is erand of size N have to be accessed. The made as to how these data are obtained. optimal number of secondary storage ac- (See, however, Yao [1977b] for an access cesses would then be N/B, where B is the cost model.) blocking factor of the operand. This can be More recently researchers recognized the achieved only if the elements are stored need to carefully state and critically review densely and it is clear from the beginning all underlying assumptions about database on which physical records the elements re- characteristics to generate formally valid side. As a counterexample, the so-called parameter systems that allow one to “segment scan” of System R requires access to a superset of the necessary pages to find (1) compute a size estimate for any feasible all elements of a relation [Selinger et al. operation, and 1979 1. If it is necessary to read the elements (2) compute parameter values for inter- in some predetermined sequence, they must mediate results required for further op- not only be stored densely but also sorted erations. by the given reading order. If direct access to a subset of the elements Such techniques view the database state at is used, the number of secondary storage run time as the result of a random process accesses required to retrieve n of the N that generates relation elements from the elements will depend on the clustering of Cartesian product of the attribute domains, elements in physical blocks. Most of the governed by some probability distribution models cited above assume random place- (usually assumed to be uniform) and by ment of records on pages, which in some general laws (e.g., functional dependencies) sense describes a worst case [Christodou- of the database schema [Gelenbe and Gardy lakis 19811. Optimal clustering can reduce 1982; Richard 19811. From these assump- the number of pages to be accessed to n/B. tions, parameters are derived whose values In conclusion, the traditional assump- must be known in order to compute the size tions about value distributions and element of any intermediate result of complex op- placements tend to overestimate costs and erations. For example, Richard [ 19811 dem- thus to bias cost estimates against the use

Computing Surveys, Vol. 16, No. 2, June 1964 , Query Optimization in Database Systems 137 of direct-access structures. On the other ting stuck in local optima if no look ahead hand, more sophisticated techniques re- is applied. However, if used carefully, the quire more statistical information about method can reduce the evaluation costs for the database. The question of how to keep queries, in which the sizes of actual inter- such information up-to-date is not yet fully mediate results differ from the expected resolved. sizes.

5.3 Selection of Access Plans 5.4 Support for Multiple Queries How are the cost estimates used in query All of the query evaluation procedures con- optimization? As mentioned in Section 5.2, sidered thus far concentrate on optimizing there are heuristic procedures that do not the evaluation of a single query. Chesnais use them at all. Other approaches combine et al. [ 19831 have also investigated the per- heuristic reduction of choices with enumer- formance effect of multiple users accessing ative cost minimization in the “end game” a database in parallel. However, query op- [Youssefi and Wong 19791. Experiments timization strategies can even go beyond indicate that combinatorial analysis can simple parallelism by sharing the execution improve database performance consider- costs of common operations among queries. ably [Epstein and Stonebraker 19801. Additionally, a strategy that optimizes the There are two ways to utilize cost esti- evaluation of multiple queries simultane- mates in the selection process. First, the ously can consider investments in addi- costs of each alternative access plan can be tional access paths, the creation of which determined completely [Blasgen and Es- would not be cost effective for a single waran 1976; Yao 19791. This approach has query. The few existing approaches to mul- the advantage of covering techniques like tiple-query optimization can be classified parallelism or feedback in a realistic way. in three groups, according to the time scope On the other hand, the optimization effort for which decisions are made: (1) simulta- is high. neous optimization of batched queries, (2) Second, the cost of strategies can be com- index selection, and (3) physical database puted incrementally in parallel to their gen- design. eration. This approach allows whole fami- A set of queries submitted by one or more lies of strategies with common parts to be users at approximately the same time can evaluated in parallel, which considerably be batched for more efficient evaluation reduces the optimization costs. For exam- [Shneiderman and Goodman 19761. The ple, the method proposed by Rosenthal and techniques for batched evaluation are sim- Reiner [ 19821 retains only the cost minimal ilar to those described in Section 4.3 for way to obtain each intermediate result, and multivariable expressions. First, results of discards any other method as soon as its common subexpressions can be shared nonoptimality is detected. among queries [Grant and Minker 1981; An extension of this second approach is a dynamic query optimization procedure, Jarke 19841, and subexpressions accessing which derives from the observation that, at the same physical data page can do so with each moment, only the next operation to one secondary storage access. Second, tem- be performed has to be decided. To guar- porary physical access ,paths such as sort- antee overall optimality, only the conse- ing, hashing, or indexes can be provided quences of this decision for the remainder whose costs pay off for the batch as a whole. of the evaluation must be evaluated. A dy- Finally, results of some queries can be re- namic procedure has actual information tained for processing subsequent queries about all its operands, including interme- [Finkelstein 1982; Hevner and Yao 19811. diate results. This information can also be There seems to be no coherent theory in used to update the estimates of the remain- this area yet. Kim [1981, 19841 and Jarke ing steps. The dynamic method has two [ 19841 present language constructs andpre- drawbacks: its cost and the danger of get- liminary architectures, and a number of

Computing Surveys, Vol. 16, No. 2, June 1984 138 l M. Jarke and J. Koch ongoing research projects are described in is referred to the cited literature for further _ IEEE [ 19821. details. Many of the examples in this paper have demonstrated the importance of using in- dexes for the performance of query evalua- 6.1 Higher Level Queries tion algorithms. From this viewpoint, in- dexes can hardly hurt anywhere but are Queries expressed in a relationally com- most profitable if they are very selective plete language retrieve such relation ele- and support access to attributes frequently ments (or sets of elements) from a database referred to in queries [Gilles and Schuster that can be described by a predicate of the 19751. However, index selection must also relational calculus. Whereas this method is take into account altering transactions be- sufficient for most transaction-oriented cause they must change the index in addi- business applications, certain other appli- tion to the base data. The index selection cations may require more complex data ob- problem has been described in several sur- jects or more powerful query predicates. veys [Batory 19821 and tutorial papers These requirements can be characterized [March 1983; Putkonen 1979; Schkolnick as language extensions that yield query lan- 1975; Severance and Carlis 19771. Selection guages more powerful than the relationally and maintenance of more general indexes complete ones. Optimization techniques for that support user views have been investi- such extensions are addressed in this sec- gated by Roussopoulos [1982a, 1982b]. The tion. use of high-level language constructs for Hierarchical and network database sys- extended view concepts has been discussed tems support data objects that are more by Jarke [1984] and Schmidt [1984]. complex than flat records and thus contain Finally, query optimization influences information-bearing access paths [Astra- the physical . However, han and Ghosh 19741. Relational interfaces long-term query optimization is only one of to such systems therefore require efficient many aspects of physical database design translation of relational queries to naviga- (see, e.g., Carlis et al. [1981], Carlson and tional access programs. (The reverse prob- Kaplan [1976], Schkolnick [1982], and lem-program conversion from network Teorey and Fry [1982]). Important design code to relational queries-has also been strategies with an impact on query process- studied [Katz and Wong 19821.) Several ing efficiency include the horizontal clus- approaches have been described: interpre- tering of relation elements by attribute tative, translational, and view processing. values [Salton 19781 and the vertical par- Interpretative models (e.g., Zaniolo [1979]) directly interpret relational queries titioning of attributes by frequency of com- as sequences of tuple-at-a-time operations bined access [Hammer and Niamir 19791. on a network database. Similarly, direct translation methods (e.g., Vassiliou and Lo- chovsky [1980]) frequently do not address 6. NONSTANDARD QUERY OPTIMIZATION optimization; such methods may yield quite We have described query optimization in inefficient code. Dayal and his co-workers the framework of relational calculus queries [Dayal et al. 1981; Dayal and Goodman in centralized database systems. While this 19821 and Gray [1981, 19841 have devel- approach covers much of the work done in oped query optimization strategies for net- the area, some query-processing problems work databases. Alternatively, one can rep- exceed the framework either because of a resent network structures as relational query complexity that goes beyond rela- views with particular integrity constraints tional completeness or as a result of the [Rosenthal and Reiner 19841. In this way, structure of the underlying physical data- existing relational view optimization facil- base. Without claiming completeness, in ities can be used for compiling an efficient the following sections we briefly survey navigational program working on the net- some important developments. The reader work database.

Computing Surveys, Vol. 16, No. 2, June 1964 Query Optimization in Database Systems 139 Applications in computer-aided design Database applications in artificial intel- and manufacture and text processing usu- ligence, especially expert systems [Nau ally work on even more complex objects 19831 and natural language user interfaces composed of related elements from differ- [Marburger and Nebel 1983; Sagalowicz ent relations. On top of a traditional rela- 19771, require inferences to be performed tional (or network) database, one can de- over the raw data coming from the database fine structures that allow access to either [Buneman 1979; Chang 1979; Minker 1975, the whole object or to part of it, using 19781. Whereas parts of such an inference multiple views of the data [Johnston et al. mechanism can be provided by traditional 19831. Another approach is an extension of view mechanisms with some additional op- the by nonfirst normal timization [Jarke et al. 1984; Ott 1977; Ott form relations [Lamersdorf 1984; Schek and Horlaender 1982; Paige 19821, more and Pistor 19821. Such extensions support complex requests-use of recursion, for access to substructures by specialized in- example-must be treated in a different dexing schemes or use pattern-searching way. mechanisms similar to those used in infor- A number of alternative architectures for mation retrieval [Davis and Kunii 19821. coupling expert systems with DBMSs are Dayal [ 1983b] investigates the related presented by Jarke and Vassiliou [ 19841. A problem of answering queries in generali- request to the expert system usually trans- zation hierarchies, which has to take into lates to a sequence of related database calls. account the fact that data objects inherit Optimization techniques include the com- properties of more general objects. bination of multiple tuple-oriented data- A relational calculus query retrieves a set base calls into set-oriented operations [Ku- of data as they come from the database, but nifuji and Yokota 1982; Vassiliou et al. does not support grouping and abstracting 19841, the simplification of such retrieval from the stored data to more complex data requests [Grishman 1978; Jarke et al. 1984; objects within the query language. The Reiter 19781, the reordering of conditions grouping of and summarizing over elements to be tested [Warren 1981], and the sharing of the same relation by certain so-called of intermediate results [Grant and Minker category attributes lies in the domain of 19811, which is particularly useful in exe- statistical query processing. Some query cuting recursive database calls [Chang languages offer a limited set of built-in 1978; Henschen and Naqvi 1984; Kellogg aggregate functions to support such appli- 1982; Minker and Nicolas 19831. A tool cations. Aggregate functions can be defined often proposed in this context is the logic as extensions to either the relational cal- programming language Prolog [ Kowalski culus [Klug 1982a] or the relational algebra 1981 J. Parsaye [ 19831 proposes extensions [Ozsoyoglu and Ozsoyoglu 19831, and can to Prolog specifically designed for database be evaluated by using nested iteration with and knowledge-base management. indexes as described in Section 4 [Klug The language extensions presented in 1982131. However, many statistical data- this section can be characterized theoreti- bases are characterized by high redundancy cally by their expressive power and by and a large percentage of null values. the difficulty of their evaluation. Chandra Therefore the data must be compressed and and Hare1 [1982a, 1982b] analyze several stored in ways that differ from standard classes of higher level query languages, such relational databases. An overview of these as (in order of increasing language power issues can be found in Shoshani [1982]. and computational complexity) first-order Special software [Eggers and Shoshani relational calculus queries, Horn clause 19801 and hardware [Bancilhon et al. 1982; queries, fix-point queries, second-order Hawthorn 19821 have been devised for queries, and general computable queries. processing queries on such databases. The user of a traditional query language Walker [1980] discusses the use of small has to achieve the power of such high-level abstracted databases in decision support languages through the use of general pro- systems. gramming language constructs in a data-

Computing Surveys, Vol. 16, No. 2, June 1984 140 l M. Jarke and J. Koch base programming language [Schmidt demonstrated by Tanenbaum [1981] and 19841. The disadvantage of this solution, Gavish and Segev [1982]. Within this gen- besides the increased programming effort, eral strategy, the most important tactical is that the responsibility for efficient im- objectives are the maximal reduction of plementation shifts from the database data to be transmitted by local preprocess- management system to the user. ing [Forker 19821, and the selection of the site(s) where the global operations are per- formed (e.g., Bernstein et al. [1981] and 0.2 Distributed Databases Ceri and Pelagatti [1982]). If data are rep- In a fragments of one licated, there is a choice of which copy to logical database (the database as seen by use in order to minimize data transfer [Ull- the user) reside on several physical data- man 1982; Williams et al. 19821. bases, each accessed by a separate com- A large number of distributed query pro- puter. Databases can be distributed for cessing strategies have been devised. The higher availability of data (through data choice among such strategies is influenced replication [Andler et al. 1982]), improved by several factors. accessibility by the most frequent users (through local storage of data [Wong 6.2.1 Trade-Off between Communication 1983]), and increased execution speed of and Local Processing Costs queries (through parallel processing of If the communication lines are relatively fragments [Su and Mikkilineni 1982; slow, communication costs usually domi- Wong and Katz 19831). Examples of dis- nate the costs of distributed evaluation to tributed database systems include Distribu- the extent that all other costs become ted INGRES [Stonebraker and Neuhold negligible. Some of the more theoretical 19771, POREL [Neuhold and Biller 19771, models (e.g., Apers et al. [1983], Hevner SIRIUS [Esculier and Clorieux 19791, [1979], and Muthuswamy and Kerschberg VDN [Munz 19791, SDD-1 [Bernstein et al. [1983]) ignore local processing costs alto- 19811, and R* [Williams et al. 19821. gether. In practice [Bernstein et al. 1981; The fact that data are physically dis- Smith et al. 19811 there will be a two-level persed and may be replicated strongly in- optimization procedure that first plans the fluences database design [Chen and Akoka global strategy and then develops an opti- 19801, [Bernstein and mal subquery evaluation plan for each local Goodman 1981c], and query processing. site. Rothnie and Goodman [ 19771 give an over- As the communication speed increases, view of the major research issues. local processing costs have to be taken into If data are distributed, the cost of data account [Chu and Hurley 1982; Gouda and transfer becomes a decision variable rather Dayal 19811: No longer can an arbitrary than a constant in the query optimization amount of local preprocessing be justified problem. Since communication costs tend to reduce data transfer between sites. to dominate local processing costs, query Kerschberg et al. [1982] report experience optimization requires a completely differ- with a two-computer network that demon- ent objective function: the minimization of strates the relative importance of local communication delay, often represented by processing. Of course, simultaneous min- the amount of data transmitted from one imization of data transfer and local pro- site to another. cessing increases the number of alterna- The primary decision influencing data tives to be compared considerably since the transfer is the selection of the site(s) where problem no longer can be decomposed into comput.ations are performed. The general a hierarchy of independent subproblems. strategy is to distribute the evaluation of the query rather than collect all the data 6.2.2 Network Structure and execute the query at one site (e.g., where the query was issued). The benefits The topology of the computer network on of this approach have been impressively which the distributed database is imple-

Computing Surveys, Vol. 16, No. 2, June 1964 Query Optimization in Database Systems l 141 mented has an impact on the complexity of fragments. Work on general nondisjoint query optimization. In networks with arbi- fragments has just begun [Maier and Ull- trary message routing, queuing delays at man 19831. intermediate nodes play an important role The main focus of the existing literature [Muthuswamy and Kerschberg 1983; Tan- is on vertically distributed databases. In enbaum 19811. However, they can be con- this case, restriction and projection can be sidered only within a global optimization of performed locally; research has therefore all network traffic. To avoid this compli- concentrated on the optimal implementa- cation, many optimization algorithms (e.g., tion and scheduling of joins and other mul- Hevner and Yao [19’79]) assume a fully tirelation operations. The main tool used connected network with node-independent in this context is the semijohn operation communication delays. Another simple described in Section 4.3. As discussed, tree structure is a star computer network queries can be completely solved by using [Kerschberg et al. 19821, in which a major semijoins. For the special case of chain central processor is connected to several queries, an efficient algorithm exists that (usually smaller) local processors. computes the optimal semijoin schedule The network structure can be homoge- when the reduction factors of each semijoin neous (i.e., it can consist of processors of are known [Chiu et al. 19811. Similar algo- the same type) or heterogeneous. For a rithms for general tree queries are given by homogeneous network (equal processor Chiu and Ho [1980] and Gouda and Dayal speed, processor-independent linear com- [ 19811. However, owing to the computa- munication costs), Chu and Hurley [1982] tional complexity of exact methods, heuris- have proved a number of theorems, the tic procedures are normally preferred application of which restricts the search [Bernstein et al. 1981; Chang 1982; Cheung space for optimal solutions. For instance, 1982a; Yu and Chang 19831. the local preprocessing of monadic opera- Some early algorithms do not use semi- tions (restriction, projection) is shown to joins. Wong [1977] employs a hill-climbing be globally optimal under these assump- procedure to improve incrementally on an tions. Even within a homogeneous com- initial solution that processes the complete puter network, a heterogeneous distributed query ‘at one site. Epstein et al. [1978] database may exist if the local databases present a model for minimizing processing work on different data models or DBMS time or network traffic separately for each [Chan et al. 1983; Smith et al. 19811. operation in a query. No global optimality is guaranteed. The optimization algorithm 6.2.3 Data Distribution Strategy for R* (a distributed extension of System R [Williams et al. 19821) performs an ex- A relation can be physically partitioned haustive search over combinations of sev- horizontally (the set of tuples is divided eral alternative join strategies in multijoin into subsets according to the pattern of queries. Within its search space and the local reference), vertically (the set of attri- limits of data estimation, an optimal solu- butes is partitioned according to different tion is found [Daniels 1982; Daniels et al. applications at different locations), or into 1982; Ng 1982; Selinger and Adiba 19701. arbitrary fragments, with or without over- lap among them [Mahmoud et al. 19791. As in the centralized case, many distrib- Gavish and Segev [1983] describe an al- uted query evaluation algorithms have been gorithm for optimizing set operations in criticized for either assuming too much sta- horizontally partitioned relational data- tistical information to be practical or for bases. The problem is proven to be com- making unrealistic simplifications. Mu- putationally untractable, and heuristics for thuswamy and Kerschberg [1983] describe its solution are developed. Ullman [1982] a procedure for obtaining detailed database describes the use of “guard conditions” for statistics by observing database usage. simplifying queries on horizontally distrib- The computational complexity of the dis- uted databases by identifying the relevant tributed query optimization problem has

Computing Surveys, Vol. 16, No. 2, June 1964 142 l M. Jarke and J. Koch lead to the same phenomenon that we ob- uation of restrictive operations and parallel served for centralized queries: One group of processing, by placing on-board logic as researchers is looking for optimal solutions close as possible to the base data and divid- in special cases, while another employs gen- ing labor among existing hardware compo- eral heuristics in order to be able to answer nents. all queries. If one plans to implement a In general, a hardware back end consists distributed DBMS, the challenge is to com- of a set of cooperating special-purpose proc- bine both approaches in a homogeneous essors. A wide bandwidth of architectural way. A good example of such a comprehen- alternatives exists, ranging from the asso- sive strategy, which includes the efficient ciation of identical processors with disjoint handling of fragments, data replication, fragments of the database to the assign- and dynamic access plans (avoiding some ment of specific processors to different of the aforementioned pitfalls of compli- DBMS functions; these assignments can be cated statistical estimates), is given by Yu made either statically or dynamically. and Chang [ 19831. The so-called “cellular logic” [Su 19791 is characterized by the fragmentation of the 0.3 Database Machines database (e.g., residing on a disk) into equal-sized cells (e.g., disk tracks), each of The use of database machines is motivated which is associated with a special-purpose primarily by the goal of relieving a general- processor. Usually, these processors have purpose host computer from the burden of an identical repertoire and are connected database processing so that overall system with a master processor controlling the performance may be improved. Off-load- concurrent operations of its slaves. Various ing DBMS functionality from the host cellular logic prototypes, such as CASSM to a back-end database machine can be [ Su and Lipkovsky 19751, RAP [ Ozkarahan achieved with a software-oriented or hard- 19821, and RARES [Lin et al. 19761, have ware-oriented approach. received wide attention. In the software backend approach (sur- Some designs of database machines have veyed, e.g., by Maryanski [1980]), the da- also experimented with associative arrays tabase-together with DBMS software-is [Berra and Oliver 19791. However, owing transferred to a standby general-purpose to their relatively high cost and limited computer. The software back end com- capacity, they do not constitute a realistic pletely processes each database request, solution for the permanent storage of entire thus enabling the host to execute nonda- databases. Associative arrays therefore are tabase tasks in parallel. Another attractive usually proposed to be used as staging de- property of the approach is the reasonable vices for relatively inexpensive mass stor- cost of conventional computers. For very age devices. database-intensive applications, however, The specialization of processors to spe- the software back-end approach does not cific DBMS functions has been realized in seem to be appropriate. In these situations, a number of database machine architec- the back-end computer itself becomes the tures. Specific cost models [Bernadat 19831 system bottleneck since it has the same and optimization algorithms [Valduriez limitations as a general-purpose host. Over- and Gardarin 19841 have been developed. all performance may be even worse than in In the DBC [Banerjee and Hsiao 19791, a host-only configuration because of the access control, directory maintenance, and additional interprocessor communication query processing are performed by different requirements. computers. The dynamic assignment of sin- In such a case, a hardware backend ap- gle processors to query-processing tasks, proach (surveyed, e.g., by Hsiao [1979] and such as the evaluation of distinct subquer- Langdon [ 19791) could be chosen. From the ies, has been pursued in DIRECT [Dewitt viewpoint of query optimization, hardware 19791, as well as the search processor of the back ends can be used to support important Technical University of Braunschweig optimization strategies, such as early eval- [Leilich et al. 19781. More recently, LSI

Computing Surveys, Vol. 16, No. 2, June 1964 Query Optimization in Database Systems 143

and VLSI technology have been explored expressions. ACM Trans. Database Syst. 4, 4 for the purposes of close integration of data (Dec.), 435-454. storage and query-processing capabilities, AHO, A. V., SAGIV, Y., AND ULLMAN, J. D. 1979c. as well as hardware implementations of Equivalences among relational expressions. SIAM J. Comput. 8, 2,218-246. language constructs usually available in ANDLER, S., DING, I., ESWARAN, K., HAUSER, C., high-level query languages [Kim et al. 1981; KIM, W., MEHL, J., AND WILLIAMS, R. 1982. Menon and Hsiao 19811. System D: A distributed system for availability. In Proceedings of the 8th International Conference on Very Large Data Bases (Mexico City). VLDB 7. SUMMARY Endowment, Saratoga, Calif., pp. 33-44. An overview of logical transformation tech- APERS, P. M. G., HEVNER, A. R., AND YAO, S. B. niques and physical evaluation methods for 1983. Optimization algorithms for distributed queries. IEEE Trans. Softw. Eng. SE-g, 1,5768. database queries was given, using the ASTRAHAN, M. M., AND CHAMBERLIN, D. D. 1975. framework of the relational calculus. It was Implementation of a structured English query shown that a large body of knowledge has language. Commun. ACM 18, 10 (Oct.), 580-588. been developed to solve the problem of ASTRAHAN, M. M., AND GHOSH, S. P. 1974. A search processing queries efficiently in conven- path selection algorithm for the Data Independ- tional centralized and distributed database ent Accessing Model (DIAM). In Proceedings of systems. the ACM-SZGMOD Workshop on Data Descrip- tion, Access and Control (Ann Arbor, Mich., May Query optimization research is still an l-3). ACM, New York, pp. 367-388. active field. Promising directions include ASTRAHAN, M. M., SCHKOLNICK, M., AND KIM, W. the development of simple yet realistic cost 1980. Performance of the System R access path estimates, the optimization of queries on selection algorithm. In Information Processing 80. databases with deductive or computational Elsevier North-Holland, New York, pp. 487-491. capabilities, and the simultaneous optimi- BANCILHON, F., RICHARD, P., AND SCHOLL, M. 1982. On line processing of compacted relations. In zation of multiple queries and update trans- Proceedings of the 8th International Conference actions. Other interesting areas only briefly on Very Large Data Bases (Mexico City). VLDB addressed in this survey are query optimi- Endowment, Saratoga, Calif., pp. 263-269. zation in database systems that utilize more BANERJEE, J., AND HSIAO, D. K. 1979. DBC-A advanced access paths, such as multiple- database computer for very large databases. attribute indexes or database machines, IEEE Trans. Comput. C-28,6,414-429. and query optimization in systems that BATORY, D. S. 1982. Index selection. In Principles of Database Design, S. B. Yao, Ed. Springer, New work on the complex data structures re- York. quired for artificial intelligence, office, sta- BAYER, R., AND MCCREIGHT, E. 1972. Organization tistical, decision support, or computer- and maintenance of large ordered indexes. Acta aided design and manufacture applications. Znf. 1, 173-189. BAYER, R., ELHARDT, K., KIESSLING, W., AND KIL- ACKNOWLEDGMENTS LAR, D. 1984. Verteilte Datenbanksysteme. Znf. Spektrum 7, 1, 1-19. The work of Matthias Jarke was supported in part by BENTLEY, J. L., AND FRIEDMAN, J. H. 1979. Data a joint study with the IBM Corporation. The work of structures for range searching. ACM Comput. Jiirgen Koch was supported in part by the Deutsche Sum. 11, 4 (Dec.), 397-409. Forschungsgemeinechaft under Grant SCHM450/2-1 BERNADAT, P. 1983. Decomposition et evaluation de (principal investigator: J. W. Schmidt). The authors questions dans une machine base de don&es are grateful to Joachim Schmidt and Arnie Rosenthal relationnelles. These IIIme cycle, Department for many discussions and suggestions, and to the ref- d’Informatique, Universite de Paris VI, Paris, erees and editors for helpful comments on the pres- France. entation of this material. BERNSTEIN, P. A., AND CHIU, D. M. W. 1981. Using semi-joins to solve relational queries. J. ACM 28, REFERENCES 1 (Jan.), 25-40. BERNSTEIN, P. A., AND GOODMAN, N. 1981a. The AHO, A. V., BEERI, C., AND ULLMAN, J. D. 1979a. power of natural semijoins. SIAM J. Comput. 10, The theory of joins in relational databases. ACM 4,751-771. Trans. Database Syst. 4, 3 (Sept.), 297-314. BERNSTEIN, P. A., AND GOODMAN, N. 1981b. The AHO, A. V., SAGIV, Y., AND ULLMAN, J. D. 1979b. power of inequality semijoins. Znf. Syst. 6,4,255- Efficient optimization of a class of relational 265.

Computing Surveys,Vol. 16,No. 2, June 1984 144 l M. Jarke and J. Koch BERNSTEIN, P. A., AND GOODMAN, N. 1981c. SCHKOLNICK, M., SELINGER, P. G., SLUTZ, D. Concurrency control in distributed database sys- R.. WADE. B. W.. AND YOST, R. A. 1981. tems. ACM Comput. Surv. 13, 2 (June), 185-221. Support for repetitive transactions and ad hoc BERNSTEIN, P. A., BLAUSTEIN, B. T., AND CLARKE, queries in System R. ACM Trans. Database Syst. E. M. 1980. Fast maintenance of semantic in- 6, 1 (Mar.), 70-94. tegrity assertions using redundant aggregate data. CHAN, A., AND NIAMIR, B. 1982. On estimating cost In Proceedings of the 6th International Conference of accessing records in blocked database organi- on Very Large Data Bases (Montreal, Oct. l-3). zations. Comput. J. 25, 3, 368-374. IEEE, New York, pp. 126-136. CHAN, A., DAYAL, U., Fox, S., GOODMAN, N., RIES, BERNSTEIN, P. A., GOODMAN, N., WONG, E., REEVE, D. D., AND SKEEN, D. 1983. Overview of an Ada C. L., AND ROTHNIE, J. B., JR. 1981. Query compatible distributed database manager. In processing in a system for distributed databases SZGMOD 83, Proceedings of the Annual Meeting (SDD-1). ACM Trans. Database Syst. 6, 4 (Dec.), (San Jose, Calif., May 23-26). ACM, New York, 602-625. pp. 228-237. BERRA, B., AND OLIVER, E. 1979. The role of asso- CHANDRA, A. K., AND HAREL, D. 1982a. Structure ciative array processors in database machine ar- and complexity of relational queries. J. Comput. chitectures. IEEE Comput. 12, 3, 53-61. Syst. Sci. 25, 99-128. BITTON, D., BORAL, H., DEWI~, D., AND WILKIN- CHANDRA, A. K., AND HAREL, D. 1982b. Horn SON, W. K. 1983. Parallel algorithms for the clauses and the fixpoint query hierarchy. In Pro- execution of operations. ACM ceedings of the ACM Symposium on Principles of Trans. Database Syst. 8, 3 (Sept.), 324-353. Database Systems (Los Angeles, Calif., Mar. 29- BLASGEN, M. W., AND ESWARAN, K. P. 1976. On 31). ACM, New York, pp. 158-163. the evaluation of queries in a relational data base CHANDRA, A. K., AND MERLIN, P. M. 1977. Optimal system. IBM Res. Rep. RJ 1745, IBM Research implementation of conjunctive queries in rela- Laboratories, San Jose, Calif. tional data bases. In Proceedings of the 9th An- BLASGEN, M. W., AND ESWARAN, K. P. 1977. nual ACM Symposium on Theory of Computation Storage and access in relational databases. IBM (Boulder, Colo., May 24). ACM, New York, pp. Syst. J. 16, 363-377. 77-90. BOLOUR, A. 1981. Optimal retrieval for small range CHANG, C. L. 1978. DEDUCE 2: Further investiga- queries. SIAM J. Comput. 10, 4, 721-741. tions of deduction in relational databases. In BREITBART, Y., AND REITER, A. 1975. Algorithms Logic and Databases, H. Gallaire and J. Minker, for fast evaluation of Boolean expressions. Acta Eds. Plenum, New York, pp. 210-236. Znf 4, 107-116. CHANG, C. L. 1979. On evaluation of queries con- taining derived relations in a relational data base. BRODIE, M., MYLO~OULOS, J., AND SCHMIDT, J. W., Eds. 1984. On Conceptual ModelZing. Perspec- IBM Res. Rep. RJ2667, IBM Research Labora- tives from Artificial Intelligence, Databases, and tories, San Jose, Calif. Programming Languages. Springer, New York. CHANG, J.-M. 1982. A heuristic approach to distrib- uted query processing. In Proceedings of the 8th BUNEMAN, P. 1979. The problem of multiple paths Znternational Conference on Very Large Data in a database schema. In Proceedings of the 5th Bases (Mexico City). VLDB Endowment, Sara- International Conference on Very Large Data toga, Calif., pp. 54-61. Bases (Rio de Janeiro, Oct. 3-5). IEEE, New York, pp. 368-372. CHEN, P. P. S., AND AKOKA, J. 1980. Optimal design of distributed information systems. IEEE Trans. CARDENAS, A. F. 1975. Analysis end performance of Comput. C-29, 12,1068-1080. inverted data base structures. Commun. ACM 18, CHESNAIS, A., GELENBE, E., AND MITRANI, I. 1983. 5 (May), 253-263. On the modeling of parallel access to shared data. CARLIS, J. V., MARCH, S. T., AND DICKSON, G. W. Commun. ACM 26.3 (Mar.), 196-202. 1981. Physical database design: A DSS ap- CHEUNG, T.-Y. 1982a. A method for equijoin queries proach. In Proceedings of the 2nd International in distributed relational databases. IEEE Trans. Conference on Information Systems (Boston, Comput. C-31,8, 746-751. Mass.). ACM, New York, pp. 153-172. CHEUNG, T.-Y. 1982b. Estimating block accesses CARLSON, C. R., AND KAPLAN, R. S. 1976. A gener- and number of records in file management. Com- alized access path model and its application to a mun. ACM 25, 7 (July), 484-487. relational data base system. In Proceedings of the CHIU, D. M., AND Ho, Y. C. 1980. A methodology ACM-SZGMOD International Conference on for interpreting tree queries into optimal semi- Management of Data (Washington, D.C., June 2- join expressions. In Proceedings of the ACM- 4). ACM, New York, pp. 143-154. SZGMOD Znternational Conference on Manage- CERI, S., AND PELAGA~I, G. 1982. Allocation of ment of Data (Santa Monica, Calif., May 14-16). operations in distributed database access. IEEE ACM, New York, pp. 169-178. Trans. Comput. C-31, 2,119-128. CHIU, D. M., BERNSTEIN, P. A., AND Ho, Y. C. CHAMBERLIN, D. D., ASTRAHAN, M. M., KING, W. F., 1981. Optimizing chain queries in a distributed LORIE, R. A., MEHL, J. W., PRICE, T. G., database system. Tech. Rep. TR-01-81, Computer

Computing Surveys,Vol. 16, No. 2, June 1984 Query Optimization in Database Systems l 145 Science Dept., Harvard University, Cambridge, CCAdl-11, Computer Corporation of America, Mass. Cambridge, Mass. CHRISTODOULAKIS, S. 1981. Estimating selectivities DEMOLOMBE, R. 1980. Estimation of the number of in data bases. Tech. Rep. CSRG-136, Computer tuples satisfying a query expressed in predicate Science Dept., University of Toronto, Toronto, calculus language. In Proceedings of thesth Con- Canada. ference on Very Large Data Bases (Montreal, Oct. CHRISTODOULAKIS, S. 1983. Estimating block trans- i-3). IEEE, New York, pp. 55-63: fers and join sizes. In SIGMOD 83, Proceedings DEWITT, D. J. 1979. Query execution in DIRECT. of the Annual Meeting (San Jose. Calif., Mav 23- In Proceedings of the ACM-SZGMOD Znterna- 26). ACM, New York: pp. 40-54.’ - tional Conference on Management of Data (Bos- CHU, W. W., AND HURLEY, P. 1982. Optimal query ton, Mass., May 30June 1). ACM, New York, processing for distributed database systems. pp. 13-22. IEEE Trans. Comput. C-31,9,835-850. DOWNEY, P. J., SETHI, R., AND TARJAN, R. E. CLAUSEN, S. E. 1980. Optimizing the evaluation of 1980. Variations on the common subexpression calculus expressions in a relational database sys- problem. J. ACM 27, 4 (Oct.), 758-771. tem. Znf. Syst. 5, 1,41-54. ECGERS, S. J., AND SHOSHANI, A. 1980. Efficient CODD, E. F. 1971. A database sublanguage founded access of compressed data. In Proceedings of the on the relational calculus. In Proceedings of the 6th Zntemational Conference on Very Large Data ACM-SIGFIDET Workshop, Data Description, Bases (Montreal, Oct. l-3). IEEE, New York, pp. Access, and Control (San Diego, Calif., Nov. ll- 205-211. 12). ACM, New York, pp. 35-68. EPSTEIN, R., AND STONEBRAKER, M. 1980. Analysis CODD, E. F. 1972. Relational completeness of data of distributed data base processing strategies. In base sublanguages. In Courant Computer Science Proceedings of the 6th International Conference Symposia No. 6: Data Base Systems. Prentice- on Very Large Data Bases (Montreal, Oct. l-3). Hall, New York, pp. 67-101. IEEE, New York, pp. 92-101. DANIELS, D. 1982. Query compilation in a distrib- EPSTEIN, R., STONEBRAKER, M., AND WONG, E. uted database system. IBM Res. Rep. RJ3423, 1978. Distributed query processing in a rela- IBM Research Laborator+, San Jose, Calif. tional data base system. In Proceedings of the ACM-SZGMOD Znternational Conference on DANIELS, D., SELINGER, S., HAAS, L., LINDSAY, B., Management of Data (Austin, Tex., May Bl-June MOHAN, C., WALKER, A., AND WILMS, P. 2). ACM, New York, pp. 169-180. 1982. An introduction to distributed query com- Dilation in R*. In Proceedings of the 2nd Svmpos- ESCULIER, C., AND CLORIEUX, A. M. 1979. The SIR- IUS-DELTA distributed DBMS. In Proceedings iurn on Distributed Databask iBerlin, FRG): El- sevier North-Holland, New York. of the International Conference on Entity-Rela- tiomhb Aonroach. P. Chen. Ed. Elsevier North- DAVIS, H. W., AND WINSLOW, L. E. 1982. Holland, New York, pp. 543-551. Computational power in query languages. SIAM ESWARAN, K. P., GRAY, J. N., LORIE, R. A., AND J. Comput. 11, 3,547-554. TRAIGER, I. L. 1976. The notions of consistency DAVIS, L. S., AND KUNII, T. L. 1982. Pattern data- and predicate locks in a database system. Com- bases. In Data Base Design Technioues II, S. B. mun. ACM 19, 11 (Nov.), 624-633. Yao and T. L. Kunii, Ed;. Springer, New’ York, FINKELSTEIN, S. 1982. Common expression analysis pp. 357-399. in database applications. In Proceedings of the DAYAL, U. 1983a. Evaluating queries with quanti- ACM-SZGMOD International Conference on fiers: A horticultural approach. In Proceedings of Management of Data (Orlando, Fla., June 2-4). the ACM Symposium on Principles of Database ACM, New York, pp. 235-245. Systems (Atlanta, Ga.). ACM, -New. York, pp. FORKER, H. J. 1982. Algebraical and operational 125-136. methods for the optimization of query processing DAYAL, U. 1983b. Processing queries over generali- in distributed relational database management zation hierarchies in a multi-database system. In systems. In Proceedings of the 2nd International Proceedings of the 9th International Conference Symposium on Distributed Databases (Berlin, on Very Large Data Bases (Florence, Italy). FRG). Elsevier North-Holland, New York, pp. VLDB Endowment, Saratoga, Calif., pp. 342-353. 39-59. DAYAL, U., AND GOODMAN, N. 1982. Query optimi- GARDARIN, G., VALDURIEZ, P., AND VIEMONT, Y. zation for CODASYL database systems. In Pro- 1984. Predicate trees: An approach to optimize ceedings of the ACM-SZGMOD International relational query operations. In Proceedings of the Conference on Management of Data (Orlando, IEEE COMPDEC Conference (Los Angeles, Fla., June 2-4). ACM, New York, pp. 138-150. Calif.). IEEE, New York. DAYAL, U., GOODMAN, N., LANDERS, T. A., OLSON, GAVISH, B., AND SEGEV, A. 1982. Query optimization K.. SMITH. J. M.. AND YEDWAB. L. 1981. Local in distributed computer systems. In Management query optimization in Multibase-A system for of Distributed Data Processing, J. Akoka, Ed. heterogeneous distributed databases. Tech. Rep. Elsevier North-Holland, New York, pp. 233-252.

Computing Surveys,Vol. 16, No. 2, June 1984 146 . M. Jarke and J. Koch GAVISH, B., AND SEGEV, A. 1983. Set query optimi- FRG, Sept. 13-15). IEEE, New York, pp. 400- zation in horizontally partitioned distributed da- 406. tabases. Working Paper QM8304, Graduate GUDES, E., AND REITER, A. 1973. On evaluating School of Management, University of Rochester, Boolean expressions. Softw. Pratt. Exper. 3, 345- Rochester, N.Y. 350. GELENBE, E., AND GARDY, D. 1982. The size of HALL, P. A. V. 1974. Common subexpression iden- projections of relations satisfying a functional tification in general algebraic systems. Tech. Rep. dependency. In Proceedings of the 8th Intema- UKSC 0060. IBM UK Scientific Center. Peterlee. tional Conference on Very Large Data Bases England. (Mexico City). VLDB Endowment, Saratoga, HALL, P. A. V. 1976. Optimization of single expres- Cahf., pp. 325-333. sions in a relational data base system. IBM J. GILLES, J. H., AND SCHUSTER, S. A. 1975. Query Res. Devel. 20, 3, 244-257. execution and index selection for relational data HAMMER, M., AND NIAMIR, B. 1979. A heuristic bases. Tech. Rep. CSRG-53, Computer Science approach to attribute partitioning. In Proceedings Dept., University of Toronto, Toronto, Ontario. of the ACM-SIGMOD International Conference GOODMAN, N., AND SHMUELI, 0. 1980. Nonreduc- on Management of Data (Boston, Mass., May 30- ible database states for cyclic queries. Tech. Rep. June 1). ACM, New York, pp. 93-101. TR-15-80, Computer Science Dept., Harvard HAMMER, M., AND ZDONIK, S. B., JR. 1980. University, Cambridge, Mass. Knowledge-based query processing. In Proceed- GOODMAN, N., AND SHMUELI, 0.1982. Tree queries: ings of the 6th Znternational Conference on Very A simple class of relational queries. ACM Trans. Large Data Bases (Montreal, Oct. l-3). IEEE, Database Syst. 7,4 (Dec.), 653-677. New York, pp. 137-147. GOTLIEB, L. R. 1975. Computing joins of relations. HANANI, M. Z. 1977. An optimal evaluation of Boo- In Proceedings of the ACM-SZGMOD Znterna- lean expressions in an online query system. Com- tional Confer&& on Management of Data (San mun. ACM 20, 5 (May), 344-347. Jose, Calif., May 14-16). ACM, New York, pp. HAWTHORN, P. 1982. Microprocessor assisted tuple 55-63. access, decompression, and assembly for statisti- GOUDA, M. G., AND DAYAL, U. 1981. Optimal semi- cal database systems. In Proceedings of the 8th join schedules for query processing in local dis- International Conference on Very Large Data tributed database systems. In Proceedings of the Bases (Mexico Citv). VLDB Endowment. Sara- ACM-SZGMOD International Conference on toga, Calif., pp. 223-233. Management of Data (Ann Arbor, M&h., Apr. 29- HENSCHEN, L. J., AND NAQVI, S. A. 1984. On com- May 1). ACM, New York, pp. 164-175. piling queries in recursive first-order databases. GRANT, J., AND MINKER, J. 1981. Optimization in J. ACM 31, 1 (Jan.), 47-85. deductive and conventional relational database HEVNER, A. R. 1979. The optimization of query systems. In Advances in , H. processing on distributed database systems. Gallaire. J. Minker. and J.-M. Nicolas. Eds. Ph.D. dissertation. Comnuter Science Dent.. Pur- Plenum; New York, pp. 195-234. ’ due University, West Lafayette, Ind. - ’ GRAY, P. M.D. 1981. Use of automatic programming HEVNER, A. R., AND YAO, S. B. 1979. Query proc- and simulation to facilitate operations on CO- essing on a distributed database. IEEE Trans. DASYL databases. In Database, M. P. Atkinson, Softw. Eng. SE-5, 3, 177-187. Ed. Pergamon Infotech, London, pp. 315-369. HEVNER, A. R., AND YAO, S. B. 1981. Transaction GRAY, P. M. D. 1984. Implementing the join opera- optimization on a distributed database system. tion on CODASYL DBMS. In Databases: Role Tech. Rep. HR-81-257, Honeywell Corporate and Structure, P. M. Stocker, Ed. Cambridge Computer Center, Bloomington, Minn. University Press, Cambridge, England. HSIAO, D. K. 1979. Database machines are coming, GRIES, D. 1971. Compiler Construction for Digital database machines are coming. IEEE Comput. Computers. Wiley, New York. 12, 3, 7-9. GRIFFETH, N. D. 1978. Nonprocedural query proc- IBM CORPORATION 1966. Introduction to IBM di- essing for databases with access paths. In Pro- rect-access storage devices and organization ceedings of the ACM-SZGMOD International methods. Programming manual GC 20-1649-06, Conference on Management of Data (Austin, Tex., May 31-June 2). ACM, New York, pp. 160-168. IEEE 1982. Special issue on query optimization. IEEE GRIFFITHS, P. G., AND WADE, B. W. 1976. An au- Database Eng. 5, 3 (Sept.). thorization mechanism for a relational database JARKE, M. 1984. Common subexpression isolation in system. ACM Trans. Database Syst. 1, 3 (Sept.), multiple query optimization. In Query Processing 242-255. in Database Systems, W. Kim, D. Reiner, and D. GRISHMAN, R. 1978. The simplification of retrieval Batory, Eds. Springer, New York. requests generated by question-answering sys- JARKE, M., AND KOCH, J. 1983. Range nesting: A tems. In Proceedings of the 4th International Con- fast method to evaluate quantified queries. In ference on Very Large Data Bases (West Berlin, SIGMOD 83, Proceedings of Annual Meeting (San

Computing Surveys,Vol. 16, No. 2, June 1964 Query Optimization in Database Systems 147

Jose, Calif., May 23-26). ACM, New York, pp. ment of Data (Santa Monica, (Calif., May 14-16). 196-206. ACM, New York, pp. 179-187. JARKE, M., AND SCHMIDT, J. W. 1981. Evaluation KIM, W. 1981. Query optimization for relational da- of first-order relational expressions. Tech. Rep. tabase systems. IBM Res. Rep. RJ3081, IBM 78, Fachbereich Informatik, Universitaet Ham- Research Laboratories, San Jose, Calif. burg, Hamburg, FRG. KIM, W. 1982. On optimizing an SQL-like nested JARKE, M., AND SCHMIDT, J. W. 1982. Query proc- query. ACM Trans. Database Syst. 7, 3 (Sept.), essing strategies in the PASCAL/R relational pp. 443-469. database management system. In Proceedings of KIM, W. 1984. Global optimization of relational the ACM-SZGMOD International Conference on queries: A first step. In Query Processing in Da- Management of Data (Orlando, Fla., June 2-4). tabase Systems, W. Kim, D. Reiner, and D. Ba- ACM, New York, pp. 256-264. tory, Eds. Springer, New York. JARKE, M., AND VASSILIOU, Y. 1984. Coupling ex- KIM, W., KUCK, D. J., AND GARSKI, D. 1981. A bit- pert systems with database management systems. serial/tuple-parallel relational query processor. In Artificial Intelligence Applications for Business, IBM Res. Rep. RJ3194, IBM Research Labora- W. Reitman, Ed. Ablex, Norwood, N.J., pp. 6% tories, San Jose, Calif. 85. KING, J. J. 1979. Exploring the use of domain knowl- JARKE, M., CLIFFORD, J., AND VASSILIOU, Y. 1984. edge for query processing efficiency. Tech. Rep. An optimizing Prolog front-end to a relational STAN-CS-79-781, Computer Science Dept., query system. In SZGMOD 84, Proceedings of Stanford University, Stanford, Calif. Annual Meeting (Boston, Mass., June 18-21). KING, J. J. 1981. QUIST: A system for semantic ACM, New York, pp. 296-306. query optimization in relational databases. In JOHNSON, D. S., AND KLUG, A. 1982. Testing con- Proceedings of the 7th International Conference tainment of conjunctive queries under functional on Very Large Data Bases (Cannes, Sept. 9-11). and inclusion dependencies. In Proceedings of the IEEE, New York, pp. 510-517. ACM Symposium on Principles of Database Sys- KLUG, A. 1980. Calculating constraints on relational tems (Los Angeles, Calif., Mar. 29-31). ACM, expressions. ACM Trans. Database Syst. 5, 3 New York, pp. 164-169. (Sept.), 260-290. JOHNSON, D. S., AND KLUG, A. 1983. Optimizing KLUG, A. 1982a. Equivalence of relational algebra conjunctive queries that contain untyped varia- and relational calculus query languages having bles. SIAM J. Comput. 12, 4, 616640. aggregate functions. J. ACM 29, 3 (July), 699- JOHNSTON, H. R., SCHWEITZER, J. E., AND WARKEN- 717. TINE, E. R. 1983. A DBMS facility for handling KLUG, A. 1982b. Access paths in the “ABE” statis- structured engineering entities. In Proceedings of tical query facility. In Proceedings of the ACM- the Database Week Engineering Design Applica- SZGMOD International Conference on Manage- tions Conference (San Jose, Calif.). ACM, New ment of Data (Orlando, Fla., June 2-4). ACM, York, pp. 3-12. New York, pp. 161-173. KAMBAYASHI, Y., AND YOSHIKAWA, M. 1983. Query KLUG, A. 1983. Locking expressions for increased processing utilizing dependencies and horizontal database concurrency. J. ACM 30, 1 (Jan.), 36- SZG%fOD 8.3 Proceedings of the decompo&ion. In 54. Annual Meetine (San Jose. Calif.. Mav 23-26). ACM, New York,pp. 55-67. KOCH, J., SCHMIDT, J. W., AND WUNDERLICH, V. - 1981. Type derivation for first-order relational KAMBAYASHI, Y., YOSHIKAWA, M., AND YAJIMA, S. expressions. Tech. Rep. no. 79, Fachbereich In- 1982. Query processing for distributed data- formatik, Universitiit Hamburg, Hamburg, FRG. bases using generalized semi-joins. In Proceedings of the ACM-SZGMOD International Conference KOWALSKI, R. 1981. Logic as a database language. on Management of Data (Orlando, Fla., June 2- Unpublished manuscript, Computer Science 4). ACM, New York, pp. 151-160. Dept., Imperial College, London. KATZ, R. H., AND WONG, E. 1982. Decompiling CO- KUNIFUJI, S., AND YOKOTA, H. 1982. Prolog and DASYL DML into relational queries. ACM relational databases for Fifth Generation Com- Trans. Database Syst. 7, 1 (Mar.), l-23. puter Systems. In Proceedings of the Workshop KELLOGG, C. 1982. A practical amalgam of knowl- on Loeical Bases for Data Bases (Toulouse, edge and data base technology. InProceedings of France?. ONERA-CERT, Toulouse, France. the National Conference on Artificial Intelligence LAMERSDORF, W. 1984. Recursive data models for (Pittsburgh, Pa.): AAAI, Menlo Park, Calif, non-conventional database applications. In Pro- KERSCHBERG, L., TING, P. D., AND YAO, S. B. ceedings of the IEEE COMPDEC Conference (Los 1982. Query optimization in star computer net- Angeles, Calif.). IEEE, New York. works. ACM Trans. Database Syst. 7, 4 (Dec.), LANG, T., WOOD, C., AND FERN~NDEZ, I. B. 1977. 678-711. Database buffer paging in virtual storage systems. KIM, W. 1980. A new way to compute the product ACM Trans. Database Syst. 2, 4 (Dec.), 339-351. and join of relations. In Proceedings of the ACM- LANGDON, G. G. 1979. Database machines: an intro- SZGMOD Znternatianal Conference on Manage- duction. IEEE Trans. Comput. C-28,6,381-383.

Computing Surveys, Vol. 16, No. 2, June 1964 148 l M. Jarke and J. Koch LEILICH, H.-O., STIEGE, G., AND ZEIDLER, H. C. MARYANSKI, F. J. 1980. Backend database systems. 1978. A search processor for data base manage- ACM Comput. Surv. 1.2, 1 (Mar.), 3-25. ment systems. In Proceedings of the 4th Znterna- MENON, M. J., AND HSIAO, D. K. 1981. Design and ti0na.l Conference on Very Large Data Bases (West analysis of a relational join operation for VLSI. Berlin, FRG, Sept. 13-15). IEEE, New York, pp. In Proceedings of the 7th International Conference 286-287. on Very Large Data Bases (Cannes, Sept. 9-11). LIN, C. S., SMITH, D. C. P., AND SMITH, J. M. IEEE, New York, pp. 44-55. 1976. The design of a rotating associative mem- MERRETT, T. H. 1977. Database cost analysis: Atop- ory for relational database applications. ACM down approach. In Proceedings of the ACM-SZG- Trans. Database Syst. 1, 1 (Mar.), 53-65. MOD International Conference on Management LIU, J. W. S. 1976. Algorithms for parsing search of Data (Toronto, Canada, Aug. 3-5). ACM, New queries in systems with inverted file organiza- York, pp. 135-143. tions. ACM Trans. Database Syst. 1, 4 (Dec.), MERRETP, T. H. 1981. Why sort-merge gives the 299-316. best implementation of the natural join. SZG- LUK, W. S. 1983. On estimating block accesses in MOD Rec. 13, 2,39-51. database organizations. Commun. ACM 26, 11 M ERRETT, T. H., KAMBAYASHI, Y., AND YASUURA, H. (Nov.), 945-947. 1981. Scheduling of page-fetches in join opera- MAEKAWA, M. 1982. Parallel join and sorting algo- tions. In Proceedings of the 7th Znternatianal Con- rithms. In Data Base Design Techniques ZZ, S. B. ference on Very Large Data Bases (Cannes, Sept. Yao and T. L. Kunii, Eds. Springer, New York, 9-11). IEEE, New York, pp. 488-498. pp. 266-296. MINKER, J. 1975. Performing inferences over rela- MAHMOUD, S. A., RIORDON, J. S., AND TOTH, K. C. tion data bases. In Proceedings of the ACM- 1979. Database partitioning and query process- SZGMOD Zntematianal Conference on Manage- ing. In Proceedings of the ZFZP Working Confer- ment of Data (San Jose, Calif., May 14-16). ACM, ence on Database Architecture. Elsevier North- New York, pp. 79-91. Holland, New York, pp. 3-21. MINKER, J. 1978. Search strategy and selection func- MAIER, D. 1983. The Theory of Relational Databases. tion for an inferential relational system. ACM Computer Science Press, Rockville, Md. Trans. Database Syst. 3, 1 (Mar.), 1-31. MAIER, D., AND ULLMAN, J. D. 1983. Fragments of MINKER, J., AND NICOLAS, J.-M. 1983. On recursive axioms in deductive databases. Znf. Syst. 8, 1, l- relations. In SZGMOD 33, Proceedings of the An- nual Meeting (San Jose, Calif., May 23-26). 13. ACM, New York, pp. 15-22. MISSIKOFF, M., AND SCHOLL, M. 1983. Relational queries in domain based DBMS. In SZGMOD 63, MAIER, D., AND WARREN, D. S. 1981. Incorporating Proceedings of Annual Meeting (San Jose, Calif., computed relations in relational databases. In May 23-26). ACM, New York, pp. 219-227. Proceedings of the ACM-SZGMOD International Conference on Management of Data (Ann Arbor, MONTGOMERY, A. I., D’SOUZA, D. J., AND LEE, S. B. Mich., Apr. 29-May 1). ACM, New York, pp. 1983. The cost of relational algebraic operations 176-187. in skewed data: Estimates and experiments. In Information Processing 83. Elsevier North-Hol- MAKINOUCHI, A., TEZUKA, M., KITAKAMI, H., AND land, New York, pp. 235-241. ADACHI, S. 1981. The optimization strategy for query evaluation in RDB/Vl. In Proceedings of MUNZ, R. R. 1979. Gross architecture of the distrib- the 7th International Conference on Very Large uted database system. VDN. In Proceedings of the Data Bases (Cannes, Sept. 9-11). IEEE, New ZFZP Working Conference on Database Architec- York, pp. 518-529. ture. Elsevier North-Holland, New York, pp. 23- 34. MALL, M., REIMER, M., AND SCHMIDT, J. W. 1984. Data selection, sharing and access control MUNZ, R. R., SCHNEIDER, H.-J., AND STEYER, F. in a relational scenario. In On Conceptual Mod- 1979. Application of sub-predicate tests in da- cling. Perspectives from Artificial Intelligence, Da- tabase systems. In Proceedings of the 5th Znter- tabases, and Programming Languages, M. Brodie, national Conference on Very Large Data Bases J. Mylopoulos, and J. W. Schmidt, Eds. Springer, (Rio de Janeiro, Oct. 3-5). IEEE, New York, pp. New York, pp. 411-436. 426-435. MARBURGER, H., AND NEBEL, B. 1983. Natuer- MUTHUSWAMY, B., AND KERSCHBERG, L. 1983. lichsprachlicher Datenbankzugang mit HAM- Distributed query optimization using detailed da- ANS: Syntaktische Korespondenz, natuerlich- tabase statistics. Unpublished manuscript, Com- sprachliche Quantifizienmg und semantisches puter Science Dept., University of South Caro- Model1 des Diskursbereichs. In Sprachen fuer lina. Datenbanken, J. W. Schmidt, Ed. Springer-Ver- NAU, D. 1983. Expert computer systems. IEEE Com- lag, Berlin, pp. 26-41. put. 16, 2 (Feb.), 63-85. MARCH, S. T. 1983. A mathematical programming NEUHOLD, E. J., AND BILLER, H. 1977. POREL: A approach to the selection of access paths for large distributed data base on an inhomogeneous com- multiuser data bases. Decision Sci. 14,4,564-587. puter network. In Proceedings of the 3rd Znter-

Computing Surveys,Vol. 16, No. 2, June 1964 Query Optimization in Database Systems l 149

national Conference on Very Large Data Bases PECHERER, R. M. 1976. Efficient exploration of (Tokyo, Oct. 6-8). IEEE, New York, pp. 380-395. product spaces. In Proceedings of tke ACM-SZG- NG, P. 1982. Distributed compilation and recompi- MOD International Conference on Management lation of distributed queries. IBM Res. Rep. of Data (Washington, D.C., June 2-4). ACM, New RJ3375, IBM Research Laboratories, San Jose, York, pp. 169-177. Calif. PIROTTE, A. 1979. Findamental and secondary issues NIEBUHR, K. E., AND SMITH, S. E. 1976. N-ary joins in the design of non-procedural relational lan- for processing Query by Example. IBM Tech. guages. In Proceedings of the 5th International Disclosure Bull. 19, 6,2377-2381. Conference on Very Large Data Bases (Rio de Janeiro, Oct. 3-5). IEEE, New York, pp. 239-250. NIEBUHR, K. E., SCHOLZ, K. W., AND SMITH, S. E. 1976. Algorithm for processing Query by Ex- PUTKONEN, A. 1979. On the selection of the access ample. IBM Tech. Disclosure Bull. 19,2,736-741. path in inverted database organizations. Znf. Syst. 4, 4, 219-225. NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK, K. C. 1984. The grid file: An adaptable, symmetric REIMER, M. 1983. Solving the phantom problem by multikey file structure. ACM Trans. Database predicative optimistic concurrency control. In Syst. 9, 1 (Mar.), 38-71. Proceedings of the 9th International Conference on Very Large Data Bases (Florence, Italy). NILSSON, N. 1982. Principles of Artificial Zntelli- VLDB Endowment, Saratoga, Calif., pp. 81-88. gence. Springer, New York. REITER, R. 1978. Deductive question-answering on OTT, N. 1977. Interpretation of questions with quan- relational data bases. In Logic and Databases, H. ‘tifiers and negation in the -USL system. IBM Gallaire and J. Minker, Eds. Plenum, New York, Tech. RCTD. TR-77.10965. IBM Scientific Center. pp. 149-178. Heidelberg, FRG. ’ OTT, N., AND HORLAENDER, K. 1982. Removing re- RICHARD, P. 1981. Evaluation of the size of a query dundant join operations in queries involving expressed in relational algebra. In Proceedings of views. IBM Tech. Rep. TR-82.03993, IBM Sci- the ACM-SZGMOD International Conference on entific Center, Heidelberg, FRG. Management of Data (Ann Arbor, Mich., Apr. 29- May 1). ACM, New York, pp. 155-163. OZKARAHAN, E. A. 1982. Database machine/com- ROSENKRANTZ, D. J., AND HUNT, H. B. III. 1989. puter based distributed databases. In Proceedings Processing conjunctive predicates and queries. In bf the 2nd Znternational Symposium on Distrib- Proceedings of the 6th Znternatianal Conference uted Databases (Berlin. FRG). Elsevier North- on Very Large Data Bases (Montreal, Oct. l-3). Holland, New York, pp: 61-86. IEEE, New York, pp. 64-72. OZSOYOGLU, M. Z., AND Ozso~o~~u, G. 1983. An ROSENTHAL, A., AND REINER, D. 1982. An architec- extension of relational algebra for summary ta- ture for.query optimization. In Proceedings of the bles. In Proceedings of the 2nd Statistical Data- ACM-SZGMOD International Conference on base Workshop (Berkeley, Calif.). University of Management of Data (Orlando, Fla.,’ June 2-4). California, Berkeley. ACM, New York, pp. 246-255. Ozso~o~~u, M., AND Yu, C. T. 1980. On identifying ROSENTHAL, A., AND REINER, D. 1984. Querying a class of database queries that can be processed relational views of networks. In Query P&es& efficiently. In Proceedings of the IEEE COMP- in Database Systems. W. Kim. D. Reiner. and D. SAC Conference. IEEE, New York, pp. 453-461. Batory, Eds. ipringer, New York. PAIGE, R. 1982. Applications of finite differencing ROTHNIE, J. B. 1974. An approach to implementing to database integrity control and query/transac- a relational data management system. In Pro- tion optimization. In Proceedings of the Workshop ceedings of the ACM-SZGMOD Workshop on Data on Logical Bases for Data Bases (Toulouse, Description, Access, ana’ Contml (Ann Arbor, France). ONERA-CERT, Toulouse, France. Mich., May l-3). ACM, New York, pp. 277-294. PALERMO, F. P. 1972. A data base search problem. ROTHNIE, J. B., JR. 1975. Evaluating inter-entry In Proceedings of the 4th Symposium on Computer retrieval expressions in a relational data base and Information Science (Miami Beach, Fla.). management system. In Proceedings of the Na- AFIPS Press, Reston, Va., pp. 67-101. tional Computer Conference (Anaheim, Calif., PARSAYE, K. 1983. Database management, knowl- May lS-22), vol. 44. AFIPS Press, Reston, Va., edge base management, and expert system devel- pp. 417-423. onment in PROLOG. In Proceedings of the Da- ROTHNIE, J. B., AND GOODMAN, N. 1977. A survey tabase Week Conference on Engineering Applica- of research and development in distributed data- tions of Databases (San Jose. Calif.). ACM. New base management. In Proceedings of the 3rd Zn- York, pp. 159-178.‘ ’ ’ ’ ternational Conference on Very Large Data Bases PECHERER, R. M. 1975. Efficient evaluation of (Tokyo, Oct. 6-8). IEEE, New York, pp. 48-62. expressions in a relational algebra. In Proceedings Rousso~ou~os, N. 1982a. View indexing in rela- of the ACM Pacific 75 Conference (San Francisco, tional databases. ACM Trans. Database Syst. 7, 2 Calif., May 14-16). ACM, New York, pp. 44-49. (June), 258-290.

Computing Surveys,Vol. 16, No. 2, June 1964 150 l M. Jarke ana! J. Koch ROUSSOPOULOS, N. 1982b. The logical access path Res. Rep. RJ2283, IBM Research Laboratories, schema of a database. IEEE Trans. Softw. Eng. San Jose, Calif. SE-8,6,563-573. SELINGER, P. G., ASTRAHAN, M. M., CHAMBERLIN, SACCO, G. M., AND SCHKOLNICK, M. 1982. A mech- D. D., LORIE. R. A.. AND PRICE. T. G. anism for managing the buffer pool in a relational 1979. ‘Access path selection in a relational da- database system using the hot set model. In Pro- tabase management system. In Proceedings of the ceedings of the 8th International Conference on ACM-SZGMOD International Conference on Very Large Data Bases (Mexico City). VLDB Management of Data (Boston, Mass., May Endowment, Saratoga, Calif., pp. 257-262. 30-June 1). ACM, New York, pp. 23-34. SACCO, G. M., AND YAO, S. B. 1982. Query optimi- SEVERANCE, D. G., AND CARLIS, J. V. 1977. A prac- zation in distributed database systems. In Ad- tical approach to selecting record access paths. vances in Computers, vol. 21. Academic Press, ACM Comput. Surv. 9, 4 (Dec.) 259-272. New York, pp. 225-273. SHMUELI, 0. 1981. The fundamental role of tree SAGALOWICZ, D. 1977. IDA: An intelligent data ac- schemas in relational query processing. Ph.D. cess program. In Proceedings of the 3rd Interna- thesis, Computer Science Dept., Harvard Univ., tional Conference on Very Large Data Bases (To- Cambridge, Mass. kyo, Oct. 6-8). IEEE, New York, pp. 293-302. SHNEIDERMAN, B. 1977. Reduced combined indexes SAGIV, Y. 1981. Optimization of Queries in Relational for efficient multiple attribute retrieval. Znf. Syst. Databases. UMI Research Press, Ann Arbor, 2, 4, 149-154. Michigan. SHNEIDERMAN, B., AND GOODMAN, V. 1976. SAGIV, Y. 1983. Quadratic algorithms for minimizing Batched searching of sequential and tree struc- joins in restricted relational expressions. SIAM tured tiles. ACM Trans. Database Syst. 1, 3 J. Comput. 12, 2, 316328. (Sept.), 268-275, SAGIV, Y., AND YANNAKAKIS, M. 1980. Equivalences SHOSHANI, A. 1982. Statistical database: Character- among relational expressions with the union and istics, problems, and some solutions. In Proceed- difference operators. J. ACM 27, 4 (Oct.) 633- ings of the 8th International Conference on Very 655. Large Data Bases (Mexico City). VLDB Endow- SALTON, G., AND WONG, A. 1978. Generation and ment, Saratoga, Calif., pp. 208-222. search of clustered files. ACM Trans. Database SHULTZ, R. K., AND ZINGG, R. J. 1984. Response Syst. 3, 4 (Dec.) 321-346. time analysis of multiprocessor computers for SCHEK, H.-J., AND PISTOR, P. 1982. Data structures database support. ACM Trans. Database Syst. 9, for an integrated database management and in- 1 (Mar.), 100-132. formation retrieval system. In Proceedings of the SMITH, J. M., AND CHANG, P. Y. T. 1975. Optimizing 8th International Conference on Very Large Data the performance of a relational algebra database Bases (Mexico City). VLDB Endowment, Sara- interface. Commtin. ACM 18, 10 (Oct.), 568-579. toga, Calif., pp. 197-207. SMITH, J. M., BERNSTEIN, P. A., DAYAL, U., GOOD- SCHENK, K. L., AND PINKERT, J. R. 1977. An algo- MAN, N., LANDERS, T., LIN, K. W. T., AND rithm for servicing multi-relational queries. In WONG, E. 1981. MULTIBASE-Integrating Proceedings of the ACM-SIGMOD International heterogeneous distributed database systems. In Conference on Management of Data (Toronto, Proceedings of the AFZPS National Computer Canada, Aug. 3-5). ACM, New York, pp. 10-20. Conference (Chicago, May 4-7), vol. 50. AFIPS SCHKOLNICK, M. 1975. The optimal selection of in- Press, Reston, Va., pp. 487-499. dexes for files. Znf. Syst. 1, 4, 141-146. SOCKUT, G. H., AND GOLDBERG, R. P. 1979. SCHKOLNICK, M. 1982. Physical database design Database reorganization-Principles and prac- techniques. In Data Base Design Techniques 71, tice. ACM Comput. Surv. 11, 4 (Dec.) 371-395. S. B. Yao and T. L. Kunii. Eds. Snrineer. New STONEBRAKER, M. 1975. Implementation of integ- York, pp. 229-252. ’ A y ’ rity constraints and views by query modification. SCHMIDT, J. W. 1977. Some high level language con- In Proceedings of the ACM-SZGMOD Znterna- structs for data of type relation. ACM Trans. tional Conference on Management of Data (San Database Syst. 2, 3 (Sept.), 247-261. Jose, Calif., May 14-16). ACM, New York, pp. SCHMIDT, J. W. 1979. Parallel processing of rela- 65-78. tions: A single-assignment approach. In Proceed- STONEBRAKER, M., AND NEUHOLD, E. 1977. A dis- ings of the 5th International Conference on Very tributed database version of INGRES. In Pro- Large Data Bases (Rio de Janeiro, Oct. 3-5). ceedings of the 2nd Berkeley Workshop on Dis- IEEE, New York, pp. 398-408. tributed Data Management-and Computer Net- SCHMIDT, J. W. 1984. Database programming: Lan- works (Berkeley. Calif.). Universitv of California. guage constructs and execution models. In Pro- Berkeley. - grammiersprachen und Programmentwicklung, U. STONEBRAKER, M., WONG, E., KREPS, P., AND HELD, Ammann, Ed. Springer-Verlag, Berlin, pp. l-26. G. 1976. The design and implementation of SELINGER, P. G., AND ADIBA, M. 1980. Access path INGRES. ACM Trans. Database Syst. 1,3 (Sept.), selection in distributed database systems. IBM 189-222.

Computing Surveys,Vol. 16, No. 2, June 1984 Query Optimization in Database Systems l 151

STROET, J. W. M., AND ENGMANN, R. 1979. WALKER, A. 1980. On retrieval from a small version Manipulation of expressions in a relational alge- of a large data base. In Proceedings of the 6th bra. Znf. Syst. 4, 4,195-203. International Conference on Very Large Data SU, S. Y. W. 1979. Cellular-logic devices: Concepts Bases (Montreal, Oct. l-3). IEEE, New York, pp. and applications. IEEE Comput. 12, 3,11-25. 47-54. Su, S. Y. W., AND LIPKOVSKI, G. 1975. CASSM: A WARREN, D. H. D. 1981. Efficient processing of in- cellular system for very large databases. In Pro- teractive relational database queries expressed in ceedings of the 1st Znternational Conference on logic. In Proceedings of the 7th International Con- Very Large Data Bases (Framingham, Mass., ference on Very Large Data Bases (Cannes, Sept. Sept. 22-24). ACM, New York, pp. 456-472. ‘9-11). IEEE, New York, pp. 272-281. Su, S. Y. W., AND MIKKILINENI, K. P. 1982. Parallel WELCH, J. W., AND GRAHAM, J. W. 1976. Retrieval algorithms and their implementation in MICRO- using ordered lists in inverted and multilist files. NET. In Proceedings of the 8th International In Proceedings of the ACM-SZGMOD Znterna- Conference on Very Large Data Bases (Mexico tional Conference on Management of Data (Wash- City). VLDB Endowment, Saratoga, Calif., pp. ington, D.C., June 2-4). ACM, New York, pp. 21- 310-324. 29. TANENBAUM, A. 1981. Computer Networks. Pren- WHANG, K.-Y., WIEDERHOLD, G., AND SAGALOWICZ, tice-Hall, New York. D. 1983. Estimating block accesses in database TEOREY, T. J., AND FRY, J. P. 1982. Design of Da- organizations: A closed noniterative formula. tabase Structures. Prentice-Hall, New York. Commun. ACM 26, 11 (Nov.), 940-944. TODD, S. 1974. Implementing the join operator in WILLIAMS. R.. DANIELS, D., HAAS, L., LAPIS, G., relational data bases. IBM Scientific Center LINDSAY, R., NG, P., OBERMARCK, R., SELINGER, Tech. Note 15, IBM UK Scientific Center, Peter- P.. WALKER. A.. WILMS, P., AND YOST, R. lee, England. i982. R*: An overview of the architecture. In TSICHRITZIS, D. 1976. LSL: A link and selector lan- Proceedings of the International Conference on guage. In Proceedings of the ACM-SZGMOD Zn- Database Systems (Jerusalem, Israel). ternational Conference on Management of Data WONG, E. 1977. Retrieving dispersed data from (Washington, D.C., June 2:4). ACM, New York, SDD-1: A svstem for distributed databases. In pp. 123-133. Proceedings of the Second Berkeley Workshop on ULLMAN, J. D. 1982. Principles of Database Systems. Distributed Data Management and Computer Computer Science Press, Rockville, Md. Networks (Berkeley, Calif.), pp. 217-235. VALDURIEZ, P. 1982. Semi-join algorithms for mul- WONG, E. 1983. Dynamic rematerialization: Proc- tinrocessor svstems. In Proceedings of the ACM- essing distributed queries using redundant data. SZGMOD Zn~ernational Conference on Mannge- IEEE Trans. Softw. Eng. SE-g, 3,228-232. ment of Data (Orlando, Fla., June 2-4). ACM, WONG, E., AND KATZ, R. H. 1983. Distributing a New York, pp. 225-233. database for parallelism. In SZGMOD 83, Pro- VALDURIEZ, P., AND GARDARIN, G. 1984. Join and ceedings of the Annual Meeting (San Jose, Calif., semijoin algorithms for a multiprocessor database May 23-26). ACM, New York, pp. 23-29. machine. ACM Trans. Database Syst. 9. 1 (Mar.), WONG, E., AND YOUSSEFI, K. 1976. Decomposi- 133-161. tion-A strategy for query processing. ACM VAN DE RIET, R. P., WASSERMAN, A. I., KERSTEN, M. Trans. Database Syst. 1, 3 (Sept.), 223-241. L., AND DE JONGE, W. 1981. High-level pro- XV, G. D. 1983. Search control in semantic query gramming features for improving the efficiency optimization. Tech. Rep. #83-09, Computer and of a relational database system. ACM Trans. Da- Information Science Dept., University of Massa- tabase Syst. 6, 3 (Sept.), 464-485. chusetts, Amherst, Mass. VASSILIOU, Y., AND JARKE, M. 1984. Query lan- YANG, C.-S. 1977. Avoiding redundant accesses in guages-A taxonomy. In Human Factors and Zn- unsorted multilist file organizations. Znf. Syst. 2, teractioe Computer Systems, Y. Vassiliou, Ed. 4,155-158. Ablex, Norwood, N.J. YAO, S. B. 1977a. Approximating block accesses in VASSILIOU, Y., AND LOCHOVSKY, F. 1980. DBMS database organizations. Commun. ACM 20, 4 transaction translation. In Proceedings of the (Apr.), 260-261. IEEE COMPSAC Conference. IEEE, New York, YAO, S. B. 1977b. An attribute based model for da- pp. 89-96. tabase access cost analysis. ACM Trans. Database VASSILIOU, Y., CLIFFORD, J., AND JARKE, M. 1984. Syst. 2. 1 (Mar.), 45-67. Access to specific declarative knowledge in expert YAO, S. B. 1979. Optimization of query evaluation systems: The impact of logic programming. De- aleorithms. ACM Trans. Database Syst. 4, 2 cision SupZIort Syst. 1, 1. (June), 133-155. VERHOFSTAD, J. S. M. 1978. Recovery techniques YAO, S. B., AND DFJONG, D. 1978. Evaluation of for database systems. ACM Comput. Suru. 10, 2 database access paths. In Proceedings of the (June), 167-195. ACM-SZGMOD Znternational Conference on

Computing Surveys,Vol. 16, No. 2, June 1964 152 l M. Jarke and J. Koch

Management of Data (Austin, Tex., May 31-June YU, C. T., AND OZSOYOGLU, M. 1979, An algorithm 2). ACM, New York, pp. 66-77. for tree query membership of a distributed query. In Proceedings of the IEEE COMPSAC Confer- YOIJSSEFI, K., AND WONG, E. 1979. Query process- ence. IEEE, New York, pp. 306-312. ing in a relational database management system. C. T., LUK, W. S., AND SIU, M. K. 1976. On the In Proceedings of the 5th International Conference yu, on Very Large Data Bases (Rio de Janeiro, Oct. estimation of the number of desired records with 3-5). IEEE, New York, pp. 409-417. respect to a given query. ACM Trans. Database Syst. 3, 1 (Mar.), 41-56. Yu, C. T., AND CHANG, C. C. 1963. On the design of ZANIOLO, C. 1979. Design of relational views over a query processing strategy in a distributed da- network schemas. In Proceedings of the ACM- tabase environment. In SIGMOD 83, Proceedings SZGMOD International Conference on Manage- of the Annual Meeting (San Jose, Cahf., May 23- ment of Data (Boston, Mass., May BO-June 1). 26). ACM, New York, pp. 30-39. ACM, New York, pp. 179-190.

Received October 1983; final revision accepted April 1984.

Computing Surveys,Vol. 16, No. 2, June 1964