Query Evaluation Techniques for Large

GOETZ GRAEFE

Portland State University, Computer Science Department, P. O. Box751, Portland, Oregon 97207-0751

Database management systems will continue to manage large data volumes. Thus, efficient algorithms for accessing and manipulating large sets and sequences will be required to provide acceptable performance. The advent of object-oriented and extensible systems will not solve this problem. On the contrary, modern data models exacerbate the problem: In order to manipulate large sets of complex objects as efficiently as today’s database systems manipulate simple records, query processing algorithms and software will become more complex, and a solid understanding of algorithm and architectural issues is essential for the designer of database management software. This survey provides a foundation for the design and implementation of query execution facilities innew database management systems. It describes awide array of practical query evaluation techniques for both relational and postrelational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set-matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains.

Categories and Subject Descriptors: E.5 [Data]: Files; H.2.4 [Database Management]: Systems—query processing General Terms: Algorithms, Performance Additional Key Words and Phrases: Complex query evaluation plans, dynamic query evaluation plans; extensible database systems, iterators, object-oriented database systems, operator model of parallelization, parallel algorithms, relational database systems, set-matching algorithms, sort-hash duality

INTRODUCTION the other emerging database application areas. Effective and efficient management of In most of these new application do- large data volumes is necessary in virtu- mains, database management systems ally all computer applications, from busi- have traditionally not been used for two ness data processing to library infor- reasons. First, restrictive data definition mation retrieval systems, multimedia and manipulation languages can make applications with images and sound, application development and mainte- computer-aided design and manufactur- nance unbearably cumbersome. Research ing, real-time process control, and scien- into semantic and object-oriented data tific computation. While database models and into persistent database pro- management systems are standard gramming languages has been address- tools in business data processing, they ing this problem and will eventually lead are only slowly being introduced to all to acceptable solutions. Second, data vol-

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying ie by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1993 ACM 0360-0300/93/0600-0073 $01.50

ACM Computing Surveys,Vol. 25, No. 2, June 1993 GUN I tNT”S umes might be so large or complex that the real or perceived performance advan- INTRODUCTION tage of file systems is considered more 1 ARCHITECTURE OF QUERY EXECUTION ENGINES important than all other criteria, e.g., the -,9 SORTING AND HASHING higher levels of abstraction and program- 2.1 Sorting mer productivity typically achieved with 2.2 Hashing database management systems. Thus, 3. DISK ACCESS 31 File Scans object-oriented database management 32 Associative Access Using Indices systems that are designed for nontradi- 3.3 Buffer Management tional database application domains and 4 AGGREGATION AND DUPLICATE REMOVAL extensible database management system 41 Aggregation Algorithm sBased on Nested Loops toolkits that support a variety of data 4,z Aggrcgat~on Algorithms Based on S(,rtlng models must provide excellent perfor- 43 Aggregation Algorithms Based on Hashing mance to meet the challenges of very 44 ARough Performance C’omparlson large data volumes, and techniques for 45 Addltlonal Remarks on Aggregation manipulating large data sets will find 5 BINARY MATCHING OPERATIONS 51 Nested-Loops Jom Algorithms renewed and increased interest in the 52 Merge-Join Algorithms database community. 53 Hash Join AIgorlthms The purpose of this paper is to survey 54 Pointer-Based Joins efficient algorithms and software archi- 55 ARough Performance Comparison 6 UNIVERSAL QuANTIFICATION tectures of database query execution en- 7 DUALITY OF SORT- AND HASH-BASED gines for executing complex queries over QUERY PROCESSING ALGORITHMS large databases. A “complex” query is 8 EXECUTION OF COMPLEX QUERY PLANS one that requires a number of query- 9 MECHANISMS FOR PARALLEL QUERY processing algorithms to work together, EXECUTION 91 Parallel versus Dlstrlbuted Database and a “large” database uses files with Systems sizes from several megabytes to many g~ ~~rm~ ~fp~ralle]lsm terabytes, which are typical for database 9.3 Implementation Strategies applications at present and in the near 94 Load Balancing and Skew future [Dozier 1992; Silberschatz et al. 95 Arcbltectures and Architecture Independence 1991]. This survey discusses a large vari- 10 PARALLEL ALGORITHMS ety of query execution techniques that 10.1 Parallel Selections and Updates must be considered when designing and 102 Parallel SOrtmg implementing the query execution mod- 103 Parallel Aggregation and Duphcate Removal ule of a new database management sys- 10.4 Parallel Jolnsand Otber Binary Matcb, ng tem: algorithms and their execution costs, Operations sorting versus hashing, parallelism, re- 105 Parallel Universal Quantlficatlon source allocation and scheduling issues 11 NONSTANDARD QUERY PROCESSING in complex queries, special operations for ALGORITHMS 11 1 Nested Relations emerging database application domains 112 TemPoral and Scientific Database such as statistical and scientific data- Management bases, and general performance-enhanc- 11.3 Object-oriented Database Systems ing techniques such as precomputation 114 More Control Operators and compression. While many, although 12 ALJDITIONAL TECHNIQUES FOR PERFORMANCE IMPROVEMENT not all, techniques discussed in this pa- 121 Precomputatlon and Derived Data per have been developed in the context of 122 Data Compression relational database systems, most of 12.3 Surrogate Processing 124 B,t Vector F,lter, ng them are applicable to and useful in the 12.5 Specmllzed Hardware query processing facility for any database SUMMARY AND OUTLOOK management system and any data model, ACKNOWLEDGMENTS provided the data model permits queries REFERENCES over “bulk” data types such as sets and lists. Query Evaluation Techniques ● 75

User Interface possible plans that can be chosen by Database the auerv o~timizer, A ~en~ra~ outline of the steps required Query Optimizer for processing a database query are Query Execution Engine shown in Fimu-e 2. Of course. this se- Files and Indices quence is only a general guideline, and 1/0 Buffer different database systems may use dif- Disk ferent steps or merge multiple steps into one. After a query or request has been Figure 1. FQuery processing in a database system, entered into the database system, be it interactively or by an application pro- gram, the query is parsed into an inter- It is assumed that the reader possesses nal form. Next, the query is validated basic textbook knowledge of database against the metadata (data about the query languages, in particular of rela- data, also called schema or catalogs) to tional algebra, and of file systems, in- ensure that the query contains only valid cluding some basic knowledge of index references to existing database objects. If structures. As shown in Figure 1, query the database system provides a macro processing fills the gap between database facility such as relational views, refer- query languages and file systems. It can enced macros and views are expanded be divided into query optimization and into the query [ Stonebraker 1975]. In- query execution. A query optimizer tegrity constraints might be expressed as translates a query expressed in a high- views (externally or internally) and would level query language into a sequence of also be integrated into the query at this operations that are implemented in the point in most systems [Metro 1989]. The query execution engine or the file system. query optimizer then maps the expanded The goal of query optimization is to find query expression into an optimized plan a query evaluation plan that minimizes that operates directly on the stored the most relevant performance measure, database objects. This mapping process which can be the database user’s wait for can be very complex and might require the first or last result item, CPU, 1/0, substantial search and cost estimation and network time and effort (time and effort. (O~timization is not discussed in effort can differ due to parallelism), this pape~; a survey can be found in Jarke memory costs (as maximum allocation or and Koch [1984 ].) The optimizer’s output as time-space product), total resource us- is called a query execution plan, query age, even energy consumption (e.g., for evaluation plan, QEP, or simply plan. battery-powered laptop systems or space Using a simple tree traversal algorithm, craft), a combination of the above, or some this plan is translated into a representa- other performance measure. Query opti- tion readv for execution bv the database’s mization is a special form of planning, query ex~cution engine; t~e result of this employing techniques from artificial in- translation can be compiled machine code telligence such as plan representation, or a semicompiled or interpreted lan- search including directed search and guage or data structure. pruning, dynamic programming, branch- This survey discusses only read-only and-bound algorithms, etc. The query ex- queries explicitly; however, most of the ecution engine is a collection of query techniques are also applicable to update execution operators and mechanisms for requests. In most database management operator communication and synchro- systems, update requests may include a nization—it employs concepts from al- search predicate to determine the gorithm design, operating systems, database objects are to be modified. Stan- networks, and parallel and distributed dard query optimization and execution computation. The facilities of the query techniques apply to this search; the ac- execution engine define the space of tual update procedure can be either ap-

ACM Computing Surveys, Vol. 25, No. 2, June 1993 76 * Goetz Graefe

Parsing

L Query Validation

L View Resolution

L Optimization

L Plan Compilation

L Execution

Figure 2. Query processing steps.

plied in a second phase, a method called Query processing, on the other hand, fo- deferred updates, or merged into the cuses on extracting information from a search phase if there is no danger of large amount of data without actually creating ambiguous update semantics. 1 changing the database. For example, The problem of ensuring ACID seman- printing reports for each branch office tics for updates—making updates with average salaries of employees under Atomic (all-or-nothing semantics), Con- 30 years old requires shared access to a sistent (translating any consistent large number of records. Mixed requests database state into another consistent are also possible, e.g., for crediting database state), Isolated (from other monthly earnings to a stock account by queries and requests), and Durable (per- combining information about a number sistent across all failures)—is beyond the of sales transactions. The techniques dis- scope of this paper; suitable techniques cussed here apply to the search effort for have been described by many other au- such a mixed request, e.g., for finding the thors, e.g., Bernstein and Goodman relevant sales transactions for each stock [1981], Bernstein et al. [1987], Gray and account. Reuter [1991], and Haerder and Reuter Embedded queries, i.e., database [1983]. queries that are contained in an applica- Most research into providing ACID se- tion program written in a standard pro- mantics focuses on efficient techniques gramming language such as Cobol, PL/1, for processing very large numbers of C, or Fortran, are also not addressed relatively small requests. For example, specifically in this paper because all increasing the balance of one account and techniques discussed here can be used decreasing the balance of another account for interactive as well as embedded require exclusive access to only two queries. Embedded queries usually are database records and writing some optimized when the program is compiled, information to an update log. Current in order to avoid the optimization over- research and development efforts in head when the program runs. This transaction processing target hundreds method was pioneered in System R, in- and even thousands of small transactions cluding mechanisms for storing opti- per second [Davis 1992; Serlin 1991]. mized plans and invalidating stored plans when they become infeasible, e.g., when .— an index is dropped from the database [Chamberlain et al. 1981b]. Of course, the 1A standard example for this danger is the “Hal- loween” problem: Consider the request to “give all cut between compile-time and run-time employees with salaries greater than $30,000 a 3% can be placed at any other point in the raise.” If (i) these employees are found using an sequence in Figure 2. index on salaries, (ii) index entries are scanned in Recursive queries are omitted from this increasing salary order, and (iii ) the index is up- survey, because the entire field of recur- dated immediately as index entries are found, then each qualifying employee will get an infinite num- sive query processing—optimization ber of raises. rules and heuristics, selectivity and cost

ACM Computing Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques “ 77 estimation, algorithms and their paral- hashing, the two general approaches to lelization—is still developing rapidly managing and matching elements of large (suffice it to point to two recent surveys sets, are described in Section 2. Section 3 by Bancilhon and Ramakrishnan [1986] focuses on accessing large data sets on and Cacace et al. [1993]). disk. Section 4 begins the discussion of The present paper surveys query exe- actual data manipulation methods with cution techniques; other surveys that algorithms for aggregation and duplicate pertain to the wide subject of database removal, continued in Section 5 with bi- systems have considered data models and nary matching operations such as join query langaages [Gallaire et al. 1984; and intersection and in Section 6 with Hull and King 1987; Jarke and Vassiliou operations for universal quantification. 1985; McKenzie and Snodgrass 1991; Section 7 reviews the manv dualities be- Peckham and Maryanski 1988], access tween sorting and hashi~g and points methods [Comer 1979; Enbody and Du out their differences that have an impact 1988; Faloutsos 1985; Samet 1984; on the performance of algorithms based Sockut and Goldberg 1979], compression on either one of these approaches. Execu- techniques [Bell et al. 1989; Lelewer and tion of very complex query plans with Hirschberg 1987], distributed and het- many operators and with nontrivial plan erogeneous systems [Batini et al. 1986; shaues is discussed in Section 8. Section Litwin et al. 1990; Sheth and Larson 9 is’ devoted to mechanisms for parallel 1990; Thomas et al. 1990], concurrency execution, including architectural issues control and recovery [Barghouti and and load balancing, and Section 10 Kaiser 1991; Bernstein and Goodman discusses specific parallel algorithms. 1981; Gray et al. 1981; Haerder and Section 11 outlines some nonstandard Reuter 1983; Knapp 1987], availability operators for emerging database appli- and reliability [Davidson et al. 1985; Kim cations such as statistical and scientific 1984], query optimization [Jarke and database management systems. Section Koch 1984; Mannino et al. 1988; Yu and 12 is a potpourri of additional techniques Chang 1984], and a variety of other that enhance the performance of many database-related topics [Adam and Wort- algorithms, e.g., compression, precompu- mann 1989; Atkinson and Bunemann tation, and specialized hardware. The fi- 1987; Katz 1990; Kemper and Wallrath nal section contains a brief summarv . and 1987; Lyytinen 1987; Teoroy et al. 1986]. an outlook on query processing research Bitton et al. [1984] have discussed a and its future. number of parallel-sorting techniques, For readers who are more interested in only a few of which are really used in some tonics than others. most sections database systems. Mishra and Eich’s are fair~y self-contained.’ Moreover, the [1992] recent survey of relational join al- hurried reader may want to skip the gorithms compares their behavior using derivation of cost functions ;2 their re- diagrams derived from one by Kitsure- sults and effects are summarized later in gawa et al. [1983] and also describes join diagrams. methods using index structures and join methods for distributed systems. The 1. ARCHITECTURE OF QUERY present survey is much broader in scope EXECUTION ENGINES as it also considers system architectures for complex query plans and for parallel This survey focuses on useful mecha- execution, selection and aggregation al- nisms for processing sets of items. These gorithms, the relationship of sorting and items can be records, tuples, entities, or hashing as it pertains to database query objects. Furthermore, most of the tech- processing, special operations for nontra- ditional data models, and auxiliary tech- niques such as compression. 2 In any case, our cost functions cover only a lim- Section 1 discusses the architecture of ited, though Important, aspect of query execution query execution engines. Sorting and cost, namely 1/0 effort.

ACM Computing Surveys, Vol 25, No 2, June 1993 78 “ Goetz Graefe niques discussed in this survey apply to bra operators. Because of the lack of an sequences, not only sets, of items, al- algorithm specification, a logical algebra though most query processing research expression is not directly executable and has assumed relations and sets. All query must be mapped into a physical algebra processing algorithm implementations it- expression. For example, it is impossible erate over the members of their input to determine the execution time for the sets; thus, sets are always represented left expression in Figure 3, i.e., a logical by sequences. Sequences can be used to algebra expression, without mapping it represent not only sets but also other first into a physical algebra expression one-dimensional “bulk types such as such as the query evaluation plan on the lists, arrays, and time series, and many right of Figure 3. This mapping process database query processing algorithms can be trivial in some database systems and techniques can be used to manipu- but usually is fairly complex in real late these other bulk types as well as database svstems because it involves al- sets. The important point is to think of gorithm c~oices and because logical and these algorithms as algebra operators physical operators frequently do not map consuming zero or more inputs (sets or directly into one another, as shown in the sequences) and producing one (or some- following four examples. First, some op- times more) outputs. A complete query erators in the physical algebra may im- execution engine consists of a collection plement multiple logical operators. For of operators and mechanisms to execute example, all serious implementations of complex expressions using multiple oper- relational join algorithms include a facil- ators, including multiple occurrences of ity to output fewer than all attributes, the same operator. Taken as a whole, the i.e., a relational delta-project (a projec- query processing algorithms form an al- tion without dudicate removal) is in- gebra which we call the physical algebra cluded in the ph~sical join operator. Sec- of a database system. ond, some physical operators implement The physical algebra is equivalent to, only part of a logical operator. For exam- but quite different from, the logical alge- ple, a duplicate removal algorithm imple- bra of the data model or the database ments only the “second half” of a rela- system. The logical algebra is more tional projection operator. Third, some closely related to the data model and physical operators do not exist in the defines what queries can be expressed in logical algebra. Concretely, a sort opera- the data model; for example, the rela- tor has no place in pure relational alge- tional algebra is a logical algebra. A bra because it is an algebra of sets. and physical algebra, on the other hand, is sets are, by their defi~ition, unordered. system specific. Different systems may Finally, some properties that hold for implement the same data model and the logical operators do not hold, or only with same logical algebra but may use very some qualifications, for the counterparts different physical algebras. For example, in physical algebra. For example, while while one relational system may use only intersection and union are entirely sym- nested-loops joins, another system may metric and commutative, algorithms im- provide both nested-loops join and plementing them (e.g., nested loops or merge-join, while a third one may rely hybrid hash join) do not treat their two entirely on hash join algorithms. (Join inputs equally. algorithms are discussed in detail later The difference of logical and physical in the section on binary matching opera- algebras can also be looked at in a differ- tors and algorithms.) ent way. Any database system raises the Another significant difference between level of abstraction above files and logical and physical algebras is the fact records; to do so, there are some logical that specific algorithms and therefore type constructors such as tuple, relation, cost functions are associated only with set, list, array, pointer, etc. Each logical physical operators, not with logical alge- type constructor is complemented by

ACM Comput,ng Surveys, V.1 25, No. 2, June 1993 Query Evaluation Techniques ● 79

Merge-Join (Intersect) Intersection /\ sort sort /\ Set A Set B I I File Scan A File Scan B

Figure 3. Logical and physical algebra expressions.

some operations that are permitted on process for each operator and then to use instances of such types, e.g., attribute interprocess communication mechanisms extraction, selection, insertion, deletion, (e.g., pipes) to transfer data between op- etc. erators, leaving it to the operating sys- On the physical or representation level, tem to schedule and suspend operator there is typically a smaller set of repre- processes as pipes are full or empty. sentation types and structures, e.g., file, While such data-driven execution re- record, record identifier (RID), and maybe moves the need for temporary disk files, very large byte arrays [Carey et al. 1986]. it introduces another cost, that of operat- For manipulation, the representation ing system scheduling and interprocess types have their own operations, which communication. In order to avoid both will be different from the operations on temporary files and operating system logical types. Multiple logical types and scheduling, Freytag and Goodman [1989] type constructors can be mapped to the proposed writing rule-based translation same physical concept. They may also be programs that transform a plan repre- situations in which one logical type con- sented as a tree structure into a single structor can be mapped to multiple phys- iterative program with nested loops and ical concepts, e.g., a set depending on its other control structures. However, the re- size. The mapping from logical types to quired rule set is not simple, in particu- physical representation types and struc- lar for algorithms with complex control tures is called physical . logic such as sorting, merge-join, or even Query optimization is the mapping from hybrid hash join (to be discussed later in logical to physical operations, and the the section on matching). query execution engine is the imple- The most practical alternative is to im- mentation of operations on physical rep- plement all operators in such a way that resentation types and of mechanisms they schedule each other within a single for coordination and cooperation among operating system process. The basic idea multiple such operations in complex que- is to define a granule, typically a single ries. The policies for using these mech- record, and to iterate over all granules anisms are part of the query optimizer. comprising an intermediate query result.3 Synchronization and data transfer be- Each time an operator needs another tween operators is the main issue to be granule, it calls its input (operator) to addressed in the architecture of the query produce one. This call is a simple pro- execution engine. Imagine a query with two joins, and consider how the result of —- the first join is passed to the second one. 3 It is possible to use multiple granule sizes within The simplest method is to create (write) a single query-processing system and to provide and read a temporary file. The need for special operators with the sole purpose of translat- temporary files, whether they are kept in ing from one granule size to another. An example M the buffer or not, is a direct result of a query processing system that uses records as an iteration granule except for the inputs of merge-join executing an operator’s input subplans (see later in the section on binary matching), for completely before starting the operator. which it uses “value packets,” i.e., groups of records Alternatively, it is possible to create one with equal join attribute values.

ACM Computmg Surveys, Vol. 25, No. 2, June 1993 80 ● Goetz Graefe cedure call, much cheaper than inter- request, (iii) this model effectively im- process communication since it does not plements, within a single process, involve the operating system. The calling (special-purpose) coroutines and de- operator waits (just as any calling rou- mand-driven data flow. (iv) items never tine waits) until the input operator has wait in a temporary ‘file or buffer be- produced an item. That input operator, tween operators because they are never in a complex query plan, might require m-educed before thev are needed. (v) an item from its own input to produce an ~herefore this model “& very efficient in item; in that case, it calls its own input its time-space-product memory costs, (operator) to produce one. Two important (vi) iterators can schedule anv tree. in- features of operators implemented in this cluding bushy trees (see below), ‘(vii) way are that they can be combined into no operator is affected by the complex- arbitrarily complex query evaluation ity of the whole plan, i.e., this model plans and that any number of operators of operator implementation and syn- can execute and schedule each other in a chronization works for sim~le as well single process without assistance from or as very complex query plan;. As a final interaction with the underlying operat- remark, there are effective ways to com- ing system. This model of operator imple- bine the iterator model with ~arallel mentation and scheduling resembles very query processing, as will be disc~ssed in closely those used in relational systems, Section 9. e.g., System R (and later SQL/DS and Since query plans are algebra expres- DB2), Ingres, Informix, and Oracle, as sions, they can be represented as trees. well as in experimental systems, e.g., the Query plans can be divided into prototyp- E programming language used in EXO- ical shapes, and query execution engines DUS [Richardson and Carey 1987], Gen- can be divided into groups according to esis [Batory et al. 1988a; 1988 b], and which shapes of plans they can evaluate. Starburst [Haas et al. 1989; 1990]. Oper- Figure 4 shows prototypical left-deep, ators imdemented in this model are right-deep, and bushy plans for a join of L called iterators, streams, synchronous four inputs. Left-deep and right-deep pipelines, row-sources, or similar names plans are different because join al- in the “lingo” of commercial systems. gorithms use their two inputs in differ- To make the implementation of opera- ent ways; for example, in the nested-loops tors a little easier, it makes sense to join algorithm, the outer loop iterates separate the functions (a) to prepare an over one input (usually drawn as left operator for producing data, (b) to pro- input) while the inner loop iterates over duce an item, and (c) to perform final the other input. The set of bushy plans is housekeeping. In a file scan, these func- the most general as it includes the sets of tions are called o~en. next. and close both left-deep and right-deep plans. procedures; we adopt these names for all These names are taken from Graefe and operators. Table 1 gives a rough idea of DeWitt [ 1987]; left-deep plans are also what the open, next, and close proce- called “linear processing trees” dures for some operators do, as well as [Krishnamurthy et al. 1986] or “plans the principal local state that needs to be with no composite inner” [Ono and saved from one invocation to the next. Lehman 1990]. (Later sections will discuss sort and join For queries with common subexpres- operations in detail. ) The first three sions, the query evaluation plan is not a examples are trivial, but the hash join tree but an acyclic directed graph (DAG). operator shows how an operator can Most systems that identify and exploit schedule its inputs in a nontrivial common subexpressions execute the plan manner. The interesting observations equivalent to a common subexpression are that (i) the entire query plan is exe- separately, saving the intermediate re- cuted within a single process, (ii) oper- sult in a temporary file to be scanned ators produce one item at a time on repeatedly and destroyed after the last

ACM Computing Surveys, Vol 25, No 2. June 1993 Query Evaluation Techniques ● 81

Table 1. Examples of Iterator Functions

Iterator Open Next Close Local State

Print open input call next on input; close input format the item on screen Scan open file read next item close file open file descriptor Select open input call next on input close input until an item qualifies Hash join allocate hash call next on probe close probe input; hash directory (without directory; open left input until a match is deallocate hash overflow “build” input; build found directory resolution) hash table calling next on build input; close build input; open right “probe” input

Merge-Join open both inputs get next item from close both inputs (without input with smaller duplicates) key until a match is found

Sort open input; build all determine next destroy remaining merge heap, open file initial run files output item; read run files descriptors for run files calling next on input; new item from the close input; merge correct run file run files untd only one merge step is left

Join C-D Join A-B

Jo::@ :fi: ‘m:.-.

A B c D

Figure 4. Left-deep, bushy, and right-deep plans. scan. Each plan fragment that is exe- verges above some predefine threshold. cuted as a unit is indeed a tree. The Among the implementations of itera- alternative is a “split” iterator that can tors for query processing, one group can deliver data to multiple consumers, i.e., be called “stored-set oriented and the that can be invoked as iterator by multi- other “algebra oriented.” In System R, an ple consumer iterators. The split iterator example for the first group, complex join paces its input subtree as fast as the plans are constructed using binary join fastest consumer requires it and holds iterators that “attach” one more set items until the slowest consumer has (stored relation) to an existing intermedi- consumed them. If the consumers re- ate result [Astrahan et al. 1976; Lorie quest data at about the same rate, the and Nilsson 1979], a design that sup- split operator does not require a tempo- ports only left-deep plans. This design rary spool file; such a file and its associ- led to a significant simplification of the ated 1/0 cost are required only if the System R optimizer which could be based data rate required by the consumers di- on dynamic programming techniques, but

ACM Computing Surveys, Vol. 25, No. 2, June 1993 82 - Goetz Graefe it ignores the optimal plan for some record structure in Volcano’s implemen- queries [Selinger et al. 1979].4 A similar tation language (C [Kernighan and design was used, although not strictly Ritchie 1978]), and an arrow represents required by the design of the execution a pointer. Each operator in a query eval- engine, in the Gamma uation dan consists of two record struc- [DeWitt et al. 1986; 1990; Gerber 1986]. tures, ; small structure of four points On the other hand, some systems use and a state record. The small structure is binary operators for which both inputs the same for all algorithms. It represents can be intermediate results, i.e., the out- the stream or iterator abstraction and put of arbitrarily complex subplans. This can be invoked with the open, next, and design is more general as it also permits close procedures. The purpose of state bushy plans. Examples for this approach records is similar to that of activation are the second query processing engine of records allocated by com~iler-~enerated Ingres based on Kooi’s thesis [Kooi 1980; code upon entry in~o a p>oced~re. Both Kooi and Frankforth 1982], the Starburst hold values local to the procedure or the execution engine [Haas et al. 1989], and iterator. Their main difference is that the Volcano query execution engine activation records reside on the stack and [Graefe 1993b]. The tradeoff between vanish upon procedure exit, while state left-deep and bushy query evaluation records must ~ersist from one invocation plans is reduction of the search space in of the iterato~ to the next, e.g., from the the query optimizer against generality of invocation of open to each invocation of the execution engine and efficiency for next and the invocation of close. Thus. some queries. Right-deep plans have only state records do not reside on the stack recently received more interest and may but in heap space. actually turn out to be very efficient, in The type of state records is different particular in systems with ample mem- for each iterator as it contains iterator- ory [Schneider 1990; Schneider and De- specific arguments and local variables Witt 1990]. (state) while the iterator is suspended, The remainder of this section provides e.g., currently not active between invoca- more details of how iterators are imple- tions of the operator’s next m-ocedure. mented in the Volcano extensible query Query plan nodes are linked t~gether by processing system. We use this system means of input pointers, which are also repeatedly as an example in this survey ke~t in the state records. Since ~ointers because it provides a large variety of to ‘functions are used extensivel~ in this mechanisms for database query process- design, all operator code (i.e., the open, ing, but mostly because its model of oper- next, and close procedures) can be writ- ator implementation and scheduling ten in such a way that the names of resembles very closely those used in input operators and their iterator proce- many relational and extensible systems. dures are not “hard-wired” into the code. The purpose of this section is to provide and the operator modules do not need to implementation concepts from which a be recompiled for each query. Further- new query processing engine could be more, all operations on individual items, derived. e.g., printing, are imported into Volcano Figure 5 shows how iterators are rep- o~erators as functions. makirw the o~er- resented in Volcano. A box stands for a a~ors independent of the sem%tics ‘and representation of items in the data streams they are processing. This organi- zation using function pointers for input ~Since each operator in such a query execution operators is fairly standard in commer- system will accessa permanent relation, the name cial database management systems. “accesspath selection” used for System R optimiza- In order to make this discussion more tion, although including and actually focusing on join optimization, was entirely correct and more concrete, Figure 5 shows two operators in descriptive than “query optimization.” a query evaluation plan that prints se-

ACM Computmg Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 83

* open-filter * next-filter e + close-filter

1 [ Arguments I Input I State I t I I + open-filescan T e + next-filescan print () + close-filescan

I i Arguments i Input 1 State I 3+ 1 1 I

predicate () (none)

Figure 5. Two operators in a Volcano query plan.

Iected records from a file. The purpose 2. SORTING AND HASHING and capabilities of the filter operator in Before discussing specific algorithms, two Volcano include printing items of a general approaches to managing sets of stream using a print function passed to data are introduced. The purpose of many the filter operator as one of its argu- query-processing algorithms is to per- ments. The small structure at the top form some kind of matching, i.e., bring- gives access to the filter operator’s itera- ing items that are “alike” together and tor functions (the open, next, and close performing some operation on them. procedures) as well as to its state record. There are two basic approaches used for Using a pointer to this structure, the this purpose, sorting and hashing. This open, next, and close procedures of the pair permeates many aspects of query filter operator can be invoked, and their processing, from indexing and clustering local state can be passed to them as a over aggregation and join algorithms to procedure argument. The filter’s iterator methods for parallelizing database oper- functions themselves, e.g., open-filter, can ations. Therefore, we discuss these ap- use the input pointer contained in the proaches first in general terms, without state record to invoke the input operator’s regard to specific algorithms. After a sur- functions, e.g., open-file-scan. Thus, the vey of specific algorithms for unary (ag- filter functions can invoke the file scan gregation, duplicate removal) and binary functions as needed and can pace the file (join, semi-join, intersection, division, scan according to the needs of the filter. In this section, we have discussed gen- etc.) matching problems in the following eral physical algebra issues and syn- sections, the duality of sort- and hash- chronization and data transfer between based algorithms is discussed in detail. operators. Iterators are relatively 2.1 Sorting straightforward to implement and are suitable building blocks for efficient, ex- Sorting is used very frequently in tensible query processing engines. In the database systems, both for presentation following sections, we consider individual to the user in sorted reports or listings operators and algorithms including a and for query processing in sort-based comparison of sorting and hashing, de- algorithms such as merge-join. There- tailed treatment of parallelism, special fore, the performance effects of the many operators for emerging database applica- algorithmic tricks and variants of exter- tions such as scientific databases, and nal sorting deserve detailed discussion in auxiliary techniques such as precompu- this survey. All sorting algorithms actu- tation and compression. ally used in database systems use merg-

ACM Computing Surveys,Vol. 25, No. 2, June 1993 84 “ Goetz Graefe ing, i.e., the input data are written into algorithms, and the last question regard- initial sorted runs and then merged into ing substitute sorts will be touched on in larger and larger runs until only~ne run the section on surrogate processing. is left, the sorted output. Only in the All sort algorithms try to exploit the unusual case that a data set is smaller duality between main-memory mergesort than the available memorv can in- and quicksort. Both of these algorithms memory techniques such as q-uicksort be are recursive divide-and-conquer algo- used. An excellent reference for many of rithms. The difference is that merge sort the issues discussed here is Knuth [ 19731.-, first divides physically and then merges who analyzes algorithms much more ac- on logical keys, whereas quicksort first curately than we do in this introductory divides on logical keys and then com- survey. bines physically by trivially concatenat- In order to ensure that the sort module ing sorted subarrays. In general, one of interfaces well with the other operators, the two phases—dividing and combining e.g., file scan or merge-join, sorting should —is based on logical keys, whereas the be im~lemented as an iterator. i.e.. with other arranges data items only physi- open, ‘next, and close procedures as all cally. We call these the logical and the other operators of the physical algebra. physical phases. Sorting algorithms for In the Volcano query-processing system verv large data sets stored on disk or (which is based on iterators), most of the tap: are-also based on dividing and com- sort work is done during open -sort bining. Usually, there are two distinct [Graefe 1990a; 1993b]. This procedure sub algorithms, one for sorting within consumes the entire input and leaves ap- main memory and one for managing sub- propriate data structures for next-sort to sets of the data set on disk or tape. The produce the final sorted output. If the choices for mapping logical and physical entire input fits into the sort space in phases to dividing and combining steps main memory, open-sort leaves a sorted are independent for these two subalgo- array of pointers to records in 1/0 buffer rithms. For practical reasons, e.g., ensur- memory which is used by next-sort to ing that a run fits into main memory, the ~roduce the records in sorted order. If disk management algorithm typically ~he input is larger than main memory, uses physical dividing and logical com- the open-sort procedure creates sorted bining (merging). A point of practical im- runs and merges them until onlv one portance is the fan-in or degree of final merge ph~se is left, The last merge merging, but this is a parameter rather step is performed in the next-sort proce- than a defining algorithm property. dure, i.e., when demanded by the con- There are two alternative methods for sumer of the sorted stream, e.g., a creating initial runs, also called “level-O merge-join. The input to the sort module runs” here. First, an in-memory sort al- must be an iterator, and sort uses open, gorithm can be used, typically quicksort. next, and close procedures to request its Using this method, each run will have input; therefore, sort input can come from the size of allocated memory, and the a scan or a complex query plan, and the number of initial runs W will be W = sort operator can be inserted into a query [R\M] for input size R and memory size plan at any place or at several places. M. (Table 3 summarizes variables and Table 2, which summarizes a taxon- their meaning in cost calculations in this omy of parallel sort algorithms [ Graefe survey.) Second, runs can be produced 1990a], indicates some main characteris- using replacement selection [Knuth tics of database sort algorithms. The first 1973]. Replacement selection starts by few items apply to any database sort and filling memory with items which are or- will be discussed in this section. The ganized into a priority heap, i.e., a data questions pertaining to parallel inputs structure that efficiently supports the op- and outputs and to data exchange will be erations insert and remove-smallest. considered in a later section on parallel Next, the item with the smallest key is

ACM Comput, ng Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques “ 85

Table 2. A Taxonomy of Database Sorhng Algorithms

Determinant Possible Options Input division Logical keys (partitioning) or physical division Result combination Logical keys (merging) or physical concatenation Main-memory sort Quicksort or replacement selection Merging Eager or lazy or semi-eager; lazy and semi-eager with or without optimizations Read-ahead No read-ahead or double-buffering or read-ahead with forecasting Input Single-stream or parallel output Single-stream or parallel Number of data exchanges One or multiple Data exchange Before or after local sort Sort objects Original records or key-RZD pairs (substitute sort)

removed from the priority heap and writ- be twice the size of memory, except the ten to a run file and then immediately first few runs (which get the process replaced in the priority heap with an- started) and the last run. On the aver- other item from the input, With high age, the expected number of runs is about probability, this new item has a key w = [R/(2 x M)l + I, i.e., about half as larger than the item just written and many runs as created with quicksort. A therefore will be included in the same more detailed discussion and an analysis run file. Notice that if this is the case, of replacement selection were provided the first run file will be larger than mem- by Knuth [1973]. ory. Now the second item (the currently An additional difference between smallest item in the priority heap) is quicksort and replacement selection is written to the run file and is also re- the resulting 1/0 pattern during run cre- placed immediately in memory by an- ation. Quicksort results in bursts of reads other item from the input. This process and writes for entire memory loads from repeats, always keeping the memory and the input file and to initial run files, the priority heap entirely filled. If a new while replacement selection alternates item has a key smaller than the last key between individual read and write opera- written, the new item cannot be included tions. If only a single device is used, in the current run file and is marked for quicksort may result in faster 1/0 be- the next run file. In comparisons among cause fewer disk arm movements are re- items in the heap, items marked for the quired. However, if different devices are current run file are always considered used for input and temporary files, or if “smaller” than items marked for the next the input comes as a stream from an- run file. Eventually, all items in memory other operator, the alternating behavior are marked for the next run file, at which of replacement selection may permit more point the current run file is closed, and a overlap of 1/0 and processing and there- new one is created. fore result in faster sorting. Using replacement selection, run files The problem with replacement selec- are typically larger than memory. If the tion is memory management. If input input is already sorted or almost sorted, items are kept in their original pages in there will be only one run file. This situa- the buffer (in order to save copying data, tion could arise, for example, if a file is a real concern for large data volumes) sorted on field A but should be sorted on each page must be kept in the buffer A as major and B as the minor sort key. until its last record has been written to a If the input is sorted in reverse order, run file. On the average, half a page’s which is the worst case, each run file will records will be in the priority heap. Thus, be exactly as large as memory. If the the priority heap must be reduced to half input is random, the average run file will the size (the number of items in the heap

ACM Computing Surveys, Vol 25, No. 2, June 1993 86 ● Goetz Graefe

Table 3. Variables, Their Mearrrrg, and Units by a factor F from level to level, the Variables Description Units number of merge levels L, i.e., the num- ber of times each item is written to a run M Memory size pages file, is logarithmic with the input size, R, S Inputs or their sizes pages namely L = [log~( W )1. c Cluster or unit There are four considerations that can of I/O pages improve the merge efficiency. The first F, K Fan-in or two issues pertain to scheduling of 1/0 fan-out (none) w Number of level-O operations, First, scans are faster if run files (none) read-ahead and write-behind are used; L Number of therefore, double-buffering using two merge levels (none) pages of memory per input run and two for the merge output might speed the merge process [Salzberg 1990; Salzberg is one half the number of records that fit et al. 1990]. The obvious disadvantage is into memory), canceling the advantage of that the fan-in is cut in half. However, longer and fewer run files. The solution instead of reserving 2 x F + 2 clusters, a to this problem is to copy records into a predictive method called forecasting can holding space and to keep them there be employed in which the largest key in while they are in the priority heap and each input buffer is used to determine until they are written to a run file. If the from which input run the next cluster input items are of varying sizes, memory will be read. Thus, the fan-in can be set management is more complex than for to any number in the range [ ilI/(2 X C) quicksort because a new item may not fit – 2] s F s [M/C – 1]. One or two into the space vacated in the holding read-ahead buffers per input disk are space by the last item written into a run sufficient, and F = [ iW/C ] – 3 will be file. Solutions to this problem will intro- reasonable in most cases because it uses duce memory management overhead and maximal fan-in with one forecasting in- some amount of fragmentation, i.e., the put buffer and double-buffering for the size of runs will be less than twice the merge output. size of memory. Thus, the advantage of Second, if the operating system and having fewer runs must be balanced with the 1/0 hardware support them, using the different 1/0 pattern and the dis- large cluster sizes for the run files is very advantage of more complex memory beneficial. Larger cluster sizes will re- management. duce the fan-in and therefore may in- The level-O runs are merged into level- crease the number of merge levels.5 1 runs, which are merged into level-2 However, each merging level is per- runs, etc., to produce the sorted output. formed much faster because fewer 1/0 During merging, a certain amount of operations and disk seeks and latency buffer memory must be dedicated to each delays are required. Furthermore, if the input run and the merge output. We call unit of 1/0 is equal to a disk track, the unit of 1/0 a cluster in this survey, rotational latencies can be avoided en- which is a number of pages located con- tirely with a sufficiently smart disk con- tiguously on disk. We indicate the cluster size with C, which we measure in pages just like memory and input sizes. The 5In files storing permanent data, large clusters number of 1/0 clusters that flt in mem- (units of 1/0) contammg many records may also ory is the quotient of memory size and create artificial buffer contention (if much more cluster size. The maximal merge fan-in disk space is copied into the buffer than truly nec- F, i.e., the number of runs that can be essary for one record) and “false sharing” in envi- ronments with page (cluster) locks, I.e., artificial merged at one time, is this quotient mi- concurrency conflicts, Since run files m a sort oper- nus one cluster for the output. Thus, F = ation are not shared but temporary, these problems [M/C – 1]. Since the sizes of runs grow do not exist in this context.

ACM Computmg Surveys,Vol. 25, No 2, June 1993 Query Evaluation Techniques ● 87 troller. Usually, relatively small fan-ins bound, the cluster size is increased for with large cluster sizes are the optimal less disk access overhead per record and choice, even if the sort requires multiple therefore faster 1/0; if the sort is CPU merge levels [ Graefe 1990a]. The precise bound, the cluster size is decreased to tradeoff depends on disk seek, latency, slow the 1/0 in favor of a larger merge and transfer times. It is interesting to fan-in. Next, in order to ensure that the note that the optimal cluster size and two processing components (1/0 and fan-in basically do not depend on the CPU) never (or almost never) have to input size. wait for one another, the amount of space As a concrete example, consider sort- dedicated to read-ahead is determined as ing a file of R = 50 MB = 51,200 KB the 1/0 time for one cluster multiplied using M = 160 KB of memory. The num- by the processing bandwidth. Typically, ber of runs created by quicksort will be this will result in one cluster of read- W = [51200/160] = 320. Depending on ahead space per disk used to store and the disk access and transfer times (e.g., read inputs run into a merge. Of course, 25 ms disk seek and latency, 2 ms trans- in order to make read-ahead effective, fer time for a page of 4 KB), C = 16 KB forecasting must be used. Finally, the will typically be a good cluster size for same amount of buffer space is allocated fast merging. If one cluster is used for for the merge output (access latency times read-ahead and two for the merge out- bandwidth) to ensure that merge pro- put, the fan-in will be F = 1160/161 – 3 cessing never has to wait for the com- = 7. The number of merge levels will be pletion of output 1/0. It is an open L = [logT(320)l = 3. If a 16 KB 1/0 oper- issue whether these two alternative ap- ation takes T = 33 ms, the total 1/0 proaches to tuning cluster size and read- time, including a factor of two for writing ahead space result in different alloca- and reading at each merge level, for the tions and sorting speeds or whether one entire sort will be 2 X L X [R/Cl X T = of them is more effective than the other. 10.56 min. The third and fourth merging issues An entirely different approach to de- focus on using (and exploiting) the maxi- termining optimal cluster sizes and the mal fan-in as effectively and often as amount of memory allocated to forecast- possible. Both issues require adjusting ing and read-ahead is based on process- the fan-in of the first merge step using ing and 1/0 bandwidths and latencies. the formula given below, either the first The cluster sizes should be set such that merge step of all merge steps or, in the 1/0 bandwidth matches the process- semi-eager merging [ Graefe 1990a], the ing bandwidth of the CPU. Bandwidths first merge step after the end of the in- for both 1/0 and CPU are measured here put has been reached. This adjustment is in record or bytes per unit time; instruc- used for only one merge step, called the tions per unit time (MIPS) are irrelevant. initial merge here, not for an entire merge It is interesting to note that the CPU’s level. processing bandwidth is largely deter- The third issue to be considered is that mined by how fast the CPU can assemble the number of runs W is typically not a new pages, in other words, how fast the power of F; therefore, some merges pro- CPU can copy records within memory. ceed with fewer than F inputs, which This performance measure is usually ig- creates the opportunity for some opti- nored in modern CPU and cache designs mization. Instead of always merging runs geared towards high MIPS or MFLOPS of only one level together, the optimal numbers [ Ousterhout 19901. strategy is to merge as many runs as Tuning the sort based on bandwidth possible using the smallest run files and latency proceeds in three steps. First, available. The only exception is the fan-in the cluster size is set such that the pro- of the first merge, which is determined to cessing and 1/0 bandwidths are equal or ensure that all subsequent merges will very close to equal. If the sort is 1/0 use the full fan-in F.

ACM Computing Surveys,Vol. 25, No. 2, June 1993 88 e Goetz Graefe

Figure 6. Naive and optimized merging.

Let us explain this idea with the exam- among multiple final merges. Thus, the ple shown in Figure 6. Consider a sort final fan-in f and the “normal” fan-in F with a maximal fan-in F = 10 and an should be specified separately in an ac- input file that requires W = 12 initial tual sort implementation. Using a final runs. Instead of merging only runs of the fan-in of 1 also allows the sort operator same level as shown in Figure 6, merging to produce output into a very slow opera- is delayed until the end of the input has tor, e.g., a display operation that allows been reached. In the first merge step, scrolling by a human user, without occu- only 3 of the 12 runs are combined, and pying a lot of buffer memory for merging the result is then merged with the other input runs over an extended period of 9 runs, as shown in Figure 6. The 1/0 time.G cost (measured by the number of memory Considering the last two optimization loads that must be written to any of the options for merging, the following for- runs created) for the first strategy is 12 mula determines the fan-in of the first + 10 + 2 = 24, while for the second merge. Each merge with normal fan-in F strategy it is 12 + 3 = 15. In other words, will reduce the number of run files by the first strategy requires 607. more 1/0 F – 1 (removing F runs, creating one to temporary files than the second one. new one). The goal is to reduce the num- The general rule is to merge just the ber of runs from W to f and then to 1 right number of runs after the end of the (the final output). Thus, the first merge input file has been reached, and to al- should reduce the number of runs to f + ways merge the smallest runs available k(F – 1) for some integer k. In other for merging. More detailed examples are words, the first merge should use a fan-in given in Graefe [1990a]. One conse- of FO=((W–f– l)mod(F– 1))+ 2.In quence of this optimization is that the the example of Figure 6, (12 – 10 – merge depth L, i.e., the number of run 1) mod (10 – 1) + 2 results in a fan-in for files a record is written to during the sort the initial merge of FO = 3. If the sort of or the number of times a record is writ- Figure 6 were the input into a merge-join ten to and read from disk, is not uniform and if a final fan-in of 5 were desired, the for all records. Therefore, it makes sense initial merge should proceed with a fan-in to calculate an average merge depth (as of FO=(12–5– l)mod(10– 1)+2= required in cost estimation during query 8. optimization), which may be a fraction. If multiple sort operations produce Of course, there are much more sophisti- input data for a common consumer oper- cated merge optimizations, e.g., cascade ator, e.g., a merge-join, the two final fan- and polyphase merges [Knuth 1973]. ins should be set proportionally to the Fourth, since some operations require multiple sorted inputs, for example merge-join (to be discussed in the section on matching) and sort output can be GThere is a simdar case of resource sharing among the operators producing a sort’s input and the run passed directly from the final merge into generation phase of the sort, We will come back to the next operation (as is natural when these issues later in the section on executing and using iterators), memory must be divided scheduling complex queries and plans.

ACM Computing Surveys, Vol 25, No. 2, June 1993 Query Evaluation Techniques ● 89 size of the two inputs. For example, if 50-page (200 KB) 1/0 buffer, varying the two merge-join inputs are 1 MB and 9 cluster size from 1 page (4 KB) to 15 MB, and if 20 clusters are available for pages (60 KB). The initial run size was inputs into the two final merges, then 2 1,600 records, for a total of 63 initial clusters should be allocated for the first runs. We counted the number of 1/0 and 18 clusters for the second input (1/9 operations and the transferred pages for = 2/18). all run files, and calculated the total 1/0 Sorting is sometimes criticized because cost by charging 25 ms per 1/0 operation it requires, unlike hybrid hashing (dis- (for seek and rotational latency) and 2 cussed in the next section), that the en- ms for each transferred page (assuming 2 tire input be written to run files and then MB/see transfer rate). As can be seen in retrieved for merging. This difference has Table 4 and Figure 7, there is an optimal a particularly large effect for files only cluster size with minimal 1/0 cost. The slightly larger than memory, e.g., 1.25 curve is not as smooth as might have times the size of memory. Hybrid hash- been expected from the approximate cost ing determines dynamically how much of function because the curve reflects all the input data truly must be written to real-system effects such as rounding temporary disk files. In the example, only (truncating) the fan-in if the cluster size slightly more than one quarter of the is not an exact divisor of the memory memory size must be written to tempo- size, the effectiveness of merge optimiza- rary files on disk while the remainder of tion varying for different fan-ins, and the file remains in memory. In sorting, internal fragmentation in clusters. The the entire file (1.25 memory sizes) is detailed data in Table 4, however, reflect written to one or two run files and then the trends that larger clusters and read for merging. Thus, sorting seems to smaller fan-ins clearly increase the require five times more 1/0 for tempo- amount of data transferred but not the rary files in this example than hybrid number of 1/0 operations (disk and la- hashing. However, this is not necessarily tency time) until the fan-in has shrunk true. The simple trick is to write initial to very small values, e.g., 3. It is clearly runs in decreasing (reverse) order. When suboptimal to always choose the smallest the input is exhausted, and merging in cluster size (1 page) in order to obtain increasing order commences, buffer the largest fan-in and fewest merge lev- memory is still full of useful pages with els. Furthermore, it seems that the range small sort keys that can be merged im- of cluster sizes that result in near- mediately without 1/0 and that never optimal total 1/0 costs is fairly large; have to be written to disk. The effect of thus it is not as important to determine writing runs in reverse order is compara- the exact value as it is to use a cluster ble to that of hybrid hashing, i.e., it size “in the right ball park.” The optimal is particularly effective if the input is fan-in is typically fairly small; however, only slightly larger than the available it is not e or 3 as derived by Bratberg- memory. sengen [1984] under the (unrealistic) To demonstrate the effect of cluster assumption that the cost of an 1/0 oper- size optimization (the second of the four ation is independent of the amount of merging issues discussed above), we data being transferred. sorted 100,000 100-byte records, about 10 MB, with the Volcano query process- 2.2 Hashing ing system, which includes all merge op- timization described above with the ex- For many matching tasks, hashing is an ception of read-ahead and forecasting. alternative to sorting. In general, when (This experiment and a similar one were equality matching is required, hashing described in more detail earlier [Graefe should be considered because the ex- 1990a; Graefe et al. 1993].) We used a pected complexity of set algorithms based sort space of 40 pages (160 KB) within a on hashing is O(N) rather than

ACM Computing Surveys, Vol. 25, No. 2, June 1993 90 “ Goetz Graefe

Table 4. Effect of Cluster Size Optlmlzations Cluster Pages Total 1/0 Size Average Disk Transferred cost [ X 4 KB] Fan-in Depth Operations [X4KB] [see]

1 40 1.376 6874 6874 185.598 2 20 1.728 4298 8596 124.642 3 13 1.872 3176 9528 98.456 4 10 1.936 2406 9624 79.398 5 8 2.000 1984 9920 69.440 6 6 2.520 2132 12792 78.884 7 5 2.760 1980 13860 77.220 8 5 2.760 1718 13744 70.438 9 4 3.000 1732 15588 74.476 10 4 3.000 1490 14900 67.050 11 3 3.856 1798 19778 84.506 12 3 3.856 1686 20232 82.614 13 3 3.856 1628 21164 83.028 14 2 5.984 2182 30548 115.646 15 2 5.984 2070 31050 113.850

150 Total 1/0 Cost 100 [SW] 50

0 i I I I I I I o 2.5 5 7.5 10 12.5 15

Cluster Size [x 4 KB]

Figure 7. Effect of cluster size optimizations

0( IV log N) as for sorting. Of course. this must fit into memory. However, if the makes ‘intuitive sense- if hashing is required hash table is larger than mem- viewed as radix sorting on a virtual key ory, hash table ouerfZow occurs and must [Knuth 1973]. be dealt with. Hash-based query processing algo- There are basically two methods for rithms use an in-memory hash table of managing hash table overflow, namely database objects to perform their match- avoidance and resolution. In either case, ing task. If the entire hash table (includ- the input is divided into multiple parti- ing all records or items) fits into memory, tion files such that partitions can be pro- hash-based query processing algorithms cessed independently from one another, are very easy to design, understand, and and the concatenation of the results of all implement, and they outperform sort- partitions is the result of the entire oper- based alternatives. Note that for binary ation. Partitioning should ensure that the matching operations, such as join or in- partitioning files are of roughly even size tersection, only one of the two inputs and can be done using either hash parti -

ACM Computing Surveys,Vol. 25, No. 2, June 1993 Query Evaluation Techniques ● 91 tioning or range partitioning, i.e., based on keys estimated to be quantiles. Usu- ally, partition files can be processed us- ing the original hash-based algorithm. The maximal partitioning fan-out F, i.e., number of partition files created, is de- ‘.+-.. termined by the memory size ill divided ‘- over the cluster size C minus one cluster ‘.< Partition for the partitioning input, i.e., F = [M/C ‘= Files – 11, just like the fan-in for sorting. ‘.< On Disk In hash table overflow avoidance, the ‘* input set is partitioned into F partition ‘.. files before any in-memory hash table is Iash ‘* / built. If it turns out that fewer partitions Directory1~ than have been created would have been sufficient to obtain partition files that Figure 8. Hybrid hashing. will fit into memory, bucket tuning (col- lapsing multiple small buckets into larger ones) and dynamic destaging (determin- rithms. As many hash buckets as possi- ing which buckets should stay in mem- ble are kept in memory, e.g., as linked ory) can improve the performance of lists as indicated by solid arrows. The hash-based operations [Kitsuregawa et other hash buckets are spooled to tempo- al. 1989a; Nakayama et al. 1988]. rary disk files, called the overflow or par- Algorithms based on hash table over- tition files, and are processed in later flow resolution start with the assumption stages of the algorithm. Hybrid hashing that overflow will not occur, but resort to is useful if the input size R is larger than basically the same set of mechanisms as the memory size AL?but smaller than the hash table overflow avoidance once it memory size multiplied by the fan-out F, does occur. No real system uses this naive i.e., M< R< FXM. hash table overflow resolution because In order to predict the number of 1/0 so-called hybrid hashing is as efficient operations (which actually is not neces- but more flexible. Hybrid hashing com- sary for execution because the algorithm bines in-memory hashing and overflow adapts to its input size but may be desir- resolution [DeWitt et al. 1984; Shapiro able for cost estimation during query 1986]. Although invented for relational optimization), the number of required join and known as hybrid hash join, hy- partition files on disk must be deter- brid hashing is equally applicable to all mined. Call this number K, which must hash-based query processing algorithms. satisfy O s ~ s F. Presuming that the Hybrid hash algorithms start out with assignment of buckets to partitions is the (optimistic) premise that no overflow optimal and that each partition file is will occur; if it does, however, they parti- equal to the memory size ikf, the amount tion the input into multiple partitions of of data that may be written to ~ parti- which only one is written immediately to tion files is equal to K x M. The number temporary files on disk. The other F – 1 of required 1/0 buffers is 1 for the input partitions remain in memory. If another and K for the output partitions, leaving overflow occurs, another partition is M – (K+ 1) x C memory for the hash written to disk. If necessary, all F parti- table. The optimal K for a given input tions are written to disk. Thus, hybrid size R is the minimal K for which K X hash algorithms use all available mem- M + (M – (K + 1) x C) > R. Solving ory for in-memory processing, but at the this inequality and taking the smallest same time are able to process large input such K results in K = [(~ – M i- files by overflow resolution, Figure 8 C)/(M – C)]. The minimal possible 1/0 shows the idea of hybrid hash algo- cost, including a factor of 2 for writing

ACM Computing Surveys,Vol. 25, No 2, June 1993 92 “ Goetz Graefe and reading the partition files and mea- ets in the hash directory for a fan-out of sured in the amount of data that must be 10 [Schneider 19901. In other words. the written or read, is 2 x (~ – (lvl – (~ + fan~out is set a pri~ri by the query opti- 1) X C)). To determine the 1/0 time, this mizer based on the expected (estimated) amount must be divided by the cluster input size. Since the page size in Gamma size and multiplied with the 1/0 time for is relatively small, only a fraction of one cluster. memorv is needed for outrmt buffers. and For example, consider an input of R = an in-memory hash tab~e can be ‘used 240 pages, a memory of M = 80 pages, even while output partitions are being and a cluster size of C = 8 pages. The written to disk. Second. in bucket tuning maximal fan-out is F = [80/8 – 1] = 9. and dynamic destaging [Kitsuregawa e; The number of partition files that need al. 1989a; Nakayama 1988], a large num- to be created on disk is K = [(240 – 80 ber of small partition files is created and + 8)/(80 – 8)] = 3. In other words, in the then collamed into fewer partition files best case, K X C = 3 X 8 = 24 pages will no larger ~han memory. I; order to ob- be used as output buffers to write K = 3 tain a large number of partition files and, partition files of no more than M = 80 at the same time. retain some memorv pages, and M–(K+l)x C=80–4 for a hash table, ‘the cluster size is sent X 8 = 48 pages of memory will be used quite small, e.g., C = 1 page, and the as the hash table. The total amount of fan-out is very large though not maxi- data written to and read from disk is mal, e.g., F = M/C/2. In the example 2 X (240 – (80 – 4 X 8)) = 384 pages. If above, F = 40 output partitions with an writing or reading a cluster of C = 8 average size of R/F = 6 pages could be pages takes 40 msec, the total 1/0 time created, even though only K = 3 output is 384/8 X 40 = 1.92 sec. partitions are required. The smallest In the calculation of K, we assumed an partitions are assigned to fill an in-mem- optimal assignment of hash buckets to ory hash table of size M – K X C = 80 – partition files. If buckets were assigned 3 x 1 = 77 pages. Hopefully, the dy- in the most straightforward way, e.g., by namic destaging rule—when an overflow dividing the hash directory into F occurs, assign the largest partition still equal-size regions and assigning the in memorv . to disk—ensures that indeed buckets of one region to a partition as the smallest partitions are retained in indicated in Figure 8, all partitions were memory. The partitions assigned to disk of nearly the same size, and either all or are collapsed into K = 3 partitions of no none of them will fit into their output more than M = 80 pages, to be processed cluster and therefore into memory. In in K = 3 subsequent phases. In binary other words, once hash table overflow operations such as intersection and rela- occurred, all input was written to parti- tional join, bucket tuning is quite effec- tion files. Thus, we presumed in the ear- tive for skew in the first input, i.e., if the lier calculations that hash buckets were hash value distribution is nonuniform assigned more intelligently to output and if the partition files are of uneven partitions. sizes. It avoids spooling parts of the sec- There are three ways to assign hash ond (typically larger) input to temporary buckets to partitions. First, each time a ~artition. files because the ~artitions. in hash table overflow occurs, a fixed num- memory can be matched immediately us- ber of hash buckets is assigned to a new ing a hash table in the memory not re- output partition. In the Gamma database auired as outtmt buffer and because a . L machine, the number of disk partitions is number of small partitions have been col- chosen “such that each bucket [bucket lapsed into fewer, larger partitions, in- here means what is called an output par- creasing the memory available for the tition in this survey] can reasonably be hash table. For skew in the second inmt. expected to fit in memory” [DeWitt and bucket tuning and dynamic desta~ng Gerber 1985], e.g., 10% of the hash buck- have no advantage. Another disadvan-

ACM Computing Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques 8 93 tage of bucket tuning and dynamic destaging is that the cluster size has to be relatively small, thus requiring a large number of 1/0 operations with disk seeks and rotational latencies to write data to the overflow files. Third, statistics gath- ered before hybrid hashing commences can be used to assign hash buckets to partitions [Graefe 1993a]. Unfortunately, it is possible that one or more partition files are larger than ~ x= memory. In that case, partitioning is used recursively until the file sizes have Figure 9. Recursive partitioning. shrunk to memory size. Figure 9 shows how a hash-based algorithm for a unary operation, such as aggregation or dupli- A major problem with hash-based algo- cate removal, partitions its input file over rithms is that their performance depends multiple recursion levels. The recursion on the quality of the hash function. In terminates when the files fit into mem- many situations, fairly simple hash func- ory. In the deepest recursion level, hy- tions will perform reasonably well. brid hashing may be employed. Remember that the purpose of using If the partitioning (hash) function is hash-based algorithms usually is to find good and creates a uniform hash value database items with a specific key or to distribution, the file size in each recur- bring like items together; thus, methods sion level shrinks by a factor equal to the as simple as using the value of a join key fan-out, and therefore the number of re- as a hash value will frequently perform cursion levels L is logarithmic with the satisfactorily. For string values, good size of the input being partitioned. After hash values can be determined by using L partitioning levels, each partition file binary “exclusive or” operations or by de- is of size R’ = R/FL. In order to obtain termining cyclic redundancy check (CRC) partition files suitable for hybrid hashing values as used for reliable data storage (with M < R’ < F X itf), the number of and transmission. If the quality of the full recursion levels L, i.e., levels at which hash function is a potential problem, uni- hybrid hashing is not applied, is L = versal hash functions should be consid- [log~(R/M)]. The 1/0 cost of the re- ered [Carter and Wegman 1979]. maining step using hybrid hashing can If the partitioning is skewed, the re- be estimated using the hybrid hash for- cursion depth may be unexpectedly high, mula above with R replaced by R’ and making the algorithm rather slow. This multiplying the cost with FL because is analogous to the worst-case perfor- hybrid hashing is used for this number of mance of quicksort, 0(iV2 ) comparisons partition files. Thus, the total 1/0 cost for an array of N items, if the partition- for partitioning an input and using hy- ing pivots are chosen extremely poorly brid hashing in the deepest recursion and do not divide arrays into nearly equal level is sub arrays. Skew is the major danger for inferior 2X RX L+2XFL performance of hash-based query- X( R’–(M– KX C)) processing algorithms, There are several ways to deal with skew. For hash-based =2x( Rx(L+l)– F” algorithms using overflow avoidance, X( M–K XC)) bucket tuning and dynamic destaging are quite effective. Another method is to ob- =2 X( RX(L+l)– FLX(M tain statistical information about hash -[(R’ –M)/(M - C)l X C)). values and to use it to carefully assign

ACM Computing Surveys, Vol. 25, No. 2, June 1993 94 “ Goetz Graefe hash buckets to partitions. Such statisti- into database systems supporting and ex- cal information can be ke~t in the form of ploiting a deep storage hierarchy is still histomams and can ei;her come from in its infancy. perm~nent system catalogs (metadata), On the other hand, some researchers from sampling the input, or from previ- have considered in-memory or main- ous recursion levels. For exam~le. for an memory databases, motivated both by the intermediate query processing’ result for desire for faster transaction and query- which no statistical parameters are processing performance and by the de- known a priori, the first partitioning level creasing cost of semi-conductor memory might have to proceed naively pretending [Analyti and Pramanik 1992; Bitton et that the partitioning hash function is al. 1987; Bucheral et al. 1990; DeWitt et perfect, but the second and further recur- al. 1984; Gruenwald and Eich 1991; Ku- sion levels should be able to use statistics mar and Burger 1991; Lehman and Carey gathered in earlier levels to ensure that 1986; Li and Naughton 1988; Severance each partitioning step creates even parti- et al. 1990; Whang and Krishnamurthy tions, . i.e., . that the data is ~artitioned. 1990]. However, for most applications, an with maximal effectiveness [Graefe analysis by Gray and Putzolo [ 1987] 1993a]. As a final resort, if skew cannot demonstrated that main memory is cost be managed otherwise, or if distribution effective only for the most frequently ac- skew is not the problem but duplicates cessed data, The time interval between are, some systems resort to algorithms accesses with equal disk and memory that are not affected by data or hash costs was five minutes for their values of value skew. For example, Tandem’s hash memory and disk prices and was ex- join algorithm resorts to nested-loops join pected to grow as main-memory prices (to be discussed later) [Zeller and Gray decrease faster than disk prices. For the 19901. purposes of this survey, we will presume As- for sorting, larger cluster sizes re- a disk-based storage architecture and will sult in faster 1/0 at the expense of consider disk 1/0 one of the major costs smaller fan-outs, with the optimal fan-out of query evaluation over large databases. being fairly small [Graefe 1993a; Graefe et al. 1993]. Thus, multiple recursion lev- 3.1 File Scans els are not uncommon for large files, and statistics gathered on one level to limit The first operator to access base data is skew effects on the next level are a real- the file scan, typically combined with istic method for large files to control a built-in selection facility. There is the performance penalties of uneven not much to be said about file scan ex- partitioning. cept that it can be made very fast using read-ahead, particularly, large-chunk (“track-at-a-crack”) read-ahead. In some 3. DISK ACCESS database systems, e.g., IBMs DB2, the All query evaluation systems have to read-ahead size is coordinated with the access base data stored in the data- free space left for future insertions dur- base. For databases in the megabyte to ing database reorganization. If a free terabyte range, base data are typically page is left after every 15 full data pages, stored on secondary storage in the form the read-ahead unit of 16 pages (64 KB) of rotating random-access disks. How- ensures that overflow records are imme- ever, deeper storage hierarchies includ- diately available in the buffer. ing optical storage, (maybe robot-oper- Efficient read-ahead requires contigu- ated) tape archives, and remote storage ous file allocation, which is supported by servers will also have to be considered in many operating systems. Such contigu- future high-functionality high-volume ous disk regions are frequently called ex- database management systems, e.g., as tents. The UNIX operating system does outlined by Stonebraker [1991]. Research not provide contiguous files, and many

ACM Computmg Surveys, Vol 25, No. 2, June 1993 Query Evaluation Techniques ● 95

database systems running on UNIX use cation of storage structures, though not a “raw” devices instead, even though this new idea [Omiecinski 1985], is likely to means that the database management become an important research topic system must provide operating-system within database research over the next functionality such as file structures, disk few years as databases become larger space allocation, and buffering. and larger and are spread over many The disadvantages of large units of 1/0 disks and many nodes in parallel and are buffer fragmentation and the waste distributed systems. of 1/0 and bus bandwidth if only individ- While most current database system ual records are required. Permitting dif- implementations only use some form of ferent page sizes may seem to be a good B-trees, an amazing variety of index idea, even at the added complexity in the structures has been described in the lit- buffer manager [Carey et al. 1986; Sikeler erature [Becker et al. 1991; Beckmann et 1988], but this does not solve the prob- al. 1990; Bentley 1975; Finkel and Bent- lem of mixed sequential scans and ran- ley 1974; Guenther and Bilmes 1991; dom record accesses within one file. The Gunther and Wong 1987; Gunther 1989; common solution is to choose a middle-of- Guttman 1984; Henrich et al. 1989; Heel the-road page size, e.g., 8 KB, and to and Samet 1992; Hutflesz et al. 1988a; support multipage read-ahead. 1988b; 1990; Jagadish 1991; Kemper and Wallrath 1987; Kolovson and Stone- braker 1991; Kriegel and Seeger 1987; 3.2 Associative Access Using Indices 1988; Lomet and Salzberg 1990a; Lomet In order to reduce the number of accesses 1992; Neugebauer 1991; Robinson 1981; to secondary storage (which is relatively Samet 1984; Six and Widmayer 1988]. slow compared to main memory), most One of the few multidimensional index database systems employ associative structures actually implemented in a search techniques in the form of indices complete database management system that map key or attribute values to loca- are R-trees in Postgres [Guttman 1984; tor information with which database Stonebraker et al. 1990b]. objects can be retrieved. The best known Table 5 shows some example index and most often used database index structures classified according to four structure is the B-tree [Bayer and characteristics, namely their support McCreighton 1972; Comer 1979]. A large for ordering and sorted scans, their number of extensions to the basic struc- dynamic-versus-static behavior upon ture and its algorithms have been pro- insertions and deletions, their support posed, e.g., B ‘-trees for faster scans, fast for multiple dimensions, and their sup- loading from a sorted file, increased fan- port for point data versus range data. We out and reduced depth by prefix and suf- omitted hierarchical concatenation of at- fix truncation, B*-trees for better space tributes and uniqueness, because all in- utilization in random insertions, and dex structures can be implemented to top-down B-trees for better locking be- support these. The indication “no range havior through preventive maintenance data” for multidimensional index struc- [Guibas and Sedgewick 1978]. Interest- tures indicates that range data are not ingly, B-trees seem to be having a renais- part of the basic structure, although they sance as a research subject, in particular can be simulated using twice the number with respect to improved space utiliza- of dimensions. We included a reference or tion [Baeza-Yates and Larson 1989], con- two with each structure; we selected orig- currency control [ Srinivasan and Carey inal descriptions and surveys over the 1991], recovery [Lanka and Mays 1991], many subsequent papers on special as- parallelism [Seeger and Larson 1991], pects such as performance analyses, mul- and on-line creation of B-trees for very tidisk’ and multiprocessor implementa- large databases [Srinivasan and Carey tions, page placement on disk, concur- 1992]. On-line reorganization and modifi- rency control, recovery, order-preserving

ACM Computing Surveys, Vol. 25, No. 2, June 1993 96 = Goetz Graefe

Table 5. Classlflcatlon of Some Index Structures Structure Ordered Dynamic Multl-Dim. Range Data References

ISAM Yes No No No [Larson 1981] B-trees Yes Yes No No [Bayer and McCreighton 1972; Comer 1979] Quad-tree Yes Yes Yes No [Finkel and Bentley 1974; Samet 1984] kD-trees Yes Yes Yes No [Bentley 1975] KDB-trees Yes Yes Yes No [Robinson 1981] hB-trees Yes Yes Yes No [Lomet and Salzberg 1990a] R-trees Yes Yes Yes Yes [Guttman 1984] Extendible No Yes No No [Fagin et al. 1979] Hashing Linear Hashing No Yes No No [Litwin 1980] Grid Files Yes Yes Yes No [Nievergelt et al. 1984]

hashing, mapping range data of N di- information that can then be used to re- mensions into point data of 2 N dimen- trieve the actual data object. Typically, in sions, etc.—this list suggests the wealth relational systems, an attribute value is of subsequent research, in particular on mapped to a tuple or record identifier B-trees, linear hashing, and refined mul- (TID or RID). Different systems use dif- tidimensional index structures, ferent approaches, but it seems that most Storage structures typically thought of new designs do not firmly attach the as index structures may be used as pri- record lookup to the index scan. mary structures to store actual data or There are several advantages to sepa- as redundant structures (“access paths”) rating index scan and record lookup. that do not contain actual data but point- First, it is possible to scan an index only ers to the actual data items in a separate without ever retrieving records from the data file. For example, Tandem’s Non- underlying data file. For example, if only Stop SQL system uses B-trees for actual salary values are needed (e.g., to deter- data as well as for redundant index mine the count or sum of all salaries), it structures. In this case, a redundant in- is sufficient to access the salary index dex structure contains not absolute loca- only without actually retrieving the data tions of the data items but keys used to records. The advantages are that (i) fewer search the primary B-tree. If indices are 1/0s are required (consider the number redundant structures, they can still be of 1/0s for retrieving N successive index used to cluster the actual data items, i.e., entries and those to retrieve N index the order or organization of index entries entries plus N full records, in particular determines the order of items in the data if the index is nonclustering [Mackert file. Such indices are called clustering and Lehman 1989] and (ii) the remaining indices; other indices are called nonclus- 1/0 operations are basically sequential tering indices. Clustering indices do not along the leaves of the index (at least for necessarily contain an entry for each data B ‘-trees; other index types behave differ- item in the primary file, but only one ently). The optimizers of several commer- entry for each page of the primary file; in cial relational products have recently this case, the index is called sparse. Non- been revised to recognize situations in clustering indices must always be dense, which an index-only scan is sufficient. i.e., there are the same number of entries Second, even if none of the existing in- in the index as there are items in the dices is sufficient by itself, multiple in- primary file. dices may be “joined” on equal RIDs to The common theme for all index struc- obtain all attributes required for a query tures is that they associatively map some (join algorithms are discussed below in attribute of a data object to some locator the section on binary matching). For ex-

ACM Computmg Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques ● 97 ample, by matching entries in indices on ences to items that must be retrieved, salaries and on names by equal RIDs, giving the functional join operator signif- the correct salary-name pairs are estab- icant freedom to fetch items from disk lished. If a query requires only names efficiently. Of course, this technique and salaries, this “join” has made access- works most effectively if no other trans- ing the underlying data file obsolete. actions or operators use the same disk Third, if two or more indices apply to drive at the same time. individual clauses of a query, it may be This idea has been generalized to as- more effective to take the union or inter- semble complex objects~In object-oriented section of RID lists obtained from two systems, objects can contain pointers to index scans than using only one index (identifiers of) other objects or compo- (algorithms for union and intersection are nents, which in turn may contain further also discussed in the section on binary pointers, etc. If multiple objects and all matching). Fourth, joining two tables can their unresolved references can be con- be accomplished by joining the indices on sidered concurrently when scheduling the two join attributes followed by record disk accesses, significant savings in disk retrievals in the two underlying data sets; seek times can be achieved [Keller et al. the advantage of this method is that only 1991]. those records will be retrieved that truly contribute to the join result [Kooi 1980]. 3.3 Buffer Management Fifth, for nonclustering indices, sets of RIDs can be sorted by physical location, 1/0 cost can be further reduced by and the records can be retrieved very caching data in an 1/0 buffer. A large efficiently, reducing substantially the number of buffer management tech- number of disk seeks and their seek dis- niques have been devised; we give only a tances. Obviously, several of these tech- few references. Effelsberg and Haerder niques can be combined. In addition, some [19841 survey many of the buffer man- systems such as Rdb/VMS and DB2 use agement issues, including those pertain- very sophisticated implementations of ing to issues of recovery, e.g., write-ahead multiindex scans that decide dynami- logging. In a survey paper on the interac- cally, i.e., during run-time, which indices tions of operating systems and database to scan, whether scanning a particular management systems, Stonebraker index reduces the resulting RID list suf- [ 1981] pointed out that the “standard” ficiently to offset the cost of the index buffer replacement policy, LRU (least re- scan, and whether to use bit vector filter- cently used), is wrong for many database ing for the RID list intersection (see a situations. For example, a file scan reads later section on bit vector filtering) a large set of pages but uses them only [Antoshenkov 1993; Mohan et al. 1990]. once, “sweeping” the buffer clean of all Record access performance for nonclus- other pages, even if they might be useful tering indices can be addressed without in the future and should be kept in mem- performing the entire index scan first (as ory. Sacco and Schkolnick [1982; 19861 required if all RIDs are to be sorted) by focused on the nonlinear performance using a “window” of RIDs. Instead of effects of buffer allocation to many rela- obtaining one RID from the index scan, tional algorithms, e.g., nested-loops join. retrieving the record, getting the next Chou [ 1985] and Chou and DeWitt [ 1985] RID from the index scan, etc., the lookup combined these two ideas in their DBMIN operator (sometimes called “functional algorithm which allocates a fixed number join”) could load IV RIDs, sort them into of buffer pages to each scan, depending a priority heap, retrieve the most conve- on its needs, and uses a local replace- niently located record, get another RID, ment policy for each scan appropriate to insert it into the heap, retrieve a record, its reference pattern. A recent study into etc. Thus, a functional join operator us- buffer allocation is by Faloutsos et al. ing a window always has N open refer- [1991] and Ng et al. [19911 on using

ACM Computing Surveys, Vol. 25, No. 2, June 1993 98 ● G’oetz Graefe marginal gain for buffer allocation. A very memory management scheme for inter- promising research direction for buffer mediate results and items passed from management in object-oriented database iterator to iterator. The cost of this deci- systems is the work by Palmer and sion is additional in-memory copying as Zdonik [1991] on saving reference pat- well as the possible inefficiencies associ- terns and using them to predict future ated with, in effect, two buffer and mem- object faults and to prevent them by ory managers. prefetching the required pages. The interactions of index retrieval and 4. AGGREGATION AND DUPLICATE buffer management were studied by REMOVAL Sacco [1987] as well as Mackert and Lehman [ 19891.-, and several authors Aggregation is a very important statisti- studied database buffer management and cal concept to summarize information virtual memory provided by the operat- about large amounts of data. The idea is ing system [Sherman and Brice 1976; to represent a set of items by a single Stonebraker 1981; Traiger 1982]. value or to classify items into groups and On the level of buffer manager imple- determine one value per group. Most mentation, most database buffer man- database systems support aggregate agers do not ~rovide read and write in. functions for minimum, maximum, sum, t~rfaces to th~ir client modules but fixing count, and average (arithmetic mean’). and unfixing, also called pinning and un- Other aggregates, e.g., geometric mean pinning. The semantics of fixing is that a or standard deviation, are typically not fixed page is not subject to replacement provided, but may be constructed in some or relocation in the buffer pool, and a systems with extensibility features. Ag- client module may therefore safely use a gregation has been added to both rela- memory address within a fixed page. If tional calculus and algebra and adds the the buffer manager needs to replace a same expressive power to each of them page but all its buffer frames are fixed, [Klug 1982]. some special action must occur such as Aggregation is typically supported in dynamic growth of the buffer pool or two forms, called scalar aggregates and transaction abort. aggregate functions [Epstein 1979]. The iterator implementation of query Scalar aggregates calculate a single evaluation algorithms can exploit the scalar value from a unary input relation, buffer’s fix/unfix interface by passing e.g., the sum of the salaries of all employ- pointers to items (records, objects) fixed ees. Scalar aggregates can easily be de- in the buffer from iterator to iterator. termined using a single pass over a data The receiving iterator then owns the fixed set. Some systems exploit indices, in par- item; it may unfix it immediately (e.g., ticular for minimum, maximum, and after a predicate fails), hold on to the count. fixed record for a while (e.g., in a hash Aggregate functions, on the other hand, table), or pass it on to the next iterator determine a set of values from a binary (e.g., if a predicate succeeds). Because the input relation, e.g., the sum of salaries iterator control and interaction of opera- for each department. Aggregate functions tors ensure that items are never pro- are relational operators, i.e., they con- duced and fixed before they are required, sume and produce relations. Figure 10 the iterator protocol is very eflic~ent in shows the output of the query “count of its buffer usage. employees by department.” The “by-list” Some implementors, however, have felt or grouping attributes are the key of the that intermediate results should not be new relation, the Department attribute materialized or kept in the database sys- in this example. tem’s 1/0 buffer, e.g., in order to ease Algorithms for aggregate functions re- implementation of transaction (ACID) se- quire grouping, e.g., employee items may mantics, and have designed a separate be grouped by department, and then one

ACM Computing Surveys. Vol 25, No 2, June 1993 Query Evaluation Techniques ● 99

on nested loops, sorting, and hashing. The first algorithm, which we call nested-loops aggregation, is the most Shoe 9 simple-minded one. Using a temporary Hardware 7 I file to accumulate the output, it loops for each input item over the output file accu- Figure 10. Count of employees by department, mulated so far and either aggregates the input item into the appropriate output item or creates a new output item and output item is calculated per group. This appends it to the output file. Obviously, grouping process is very similar to dupli- this algorithm is quite inefficient for large cate removal in which eaual data items inputs, even if some performance en- must be brought together; compared, and hancements can be applied.7 We mention removed. Thus, aggregate functions and it here because it corres~onds. to the al~o- duplicate removal are typically imple- rithm choices available for relational mented in the same module. There are joins and other binary matching prob- only two differences between aggregate lems (discussed in the next section), functions and duplicate removal. First, in which are the nested-loops join and the duplicate removal, items are compared more efficient sort-based and hash-based join algorithms. As for joins and binary on all their attributes. but onlv. on the attributes in the by-list of aggregate matching, where the nested-loops algo- functions. Second, an identical item is rithm is the only algorithm that can eval- immediately dropped from further con- uate any join predicate, the nested-loops sideration in duplicate removal whereas aggregation algorithm can support un- in aggregate functions some computation usual aggregations where the input items is ~erformed before the second item of are not divided into disjoint equivalence th~ same group is dropped. Both differ- classes but where a single input item ences can easily be dealt with using a may contribute to multiple output items. switch in an actual algorithm implemen- While such aggregations are not sup- tation. Because of their similarity, dupli- ported in today’s database systems, clas- cate removal and aggregation are de- sifications that do not divide the input scribed and used interchangeably here. into equivalence classes can be useful in In most existing commercial relational both commercial and scientific applica- systems, aggregation and duplicate re- tions. If the number of classifications is moval algorithms are based on sorting, small enough that all output items can following Epstein’s [1979] work. Since be kept in memory, the performance of aggregation requires that all data be con- this algorithm is acceptable. However, for sumed before any output can be pro- the more standard database aggregation duced, and since main memories were problems, sort-based and hash-based du- significantly smaller 15 years ago when plicate removal and aggregation algo- the prototypes of these systems were de- rithms are more appropriate. signed, these implementations used tem- porary files for output, not streams and iterator algorithms. However, there is no reason why aggregation and duplicate re- 7The possible improvements are (i) looping over pages or clusters rather than over records of input moval cannot be implemented using iter- and output items (block nested loops), (ii) speeding ators exploiting today’s memory sizes. the inner loop by an index (index nested loops), a method that has been used in some commercial relational sYstems, (iii) bit vector filtering to deter- 4.1 Aggregation Algorithms Based on mine without inner loop or index lookup that an Nested Loops item in the outer loop cannot possibly have a match in the inner loop. All three of these issues are There are three types of algorithms for discussed later in this survey as they apply to aggregation and duplicate removal based binary operations such as joins and intersection.

ACM Computing Surveys, Vol. 25, No 2, June 1993 100 “ Goetz Graefe

4.2 Aggregation Algorithms Based on the output cardinality (number of items) Sorting is G times less than the input cardinality (G = R/o), where G is called the aver. Sorting will bring equal items together, age group size or the reduction factor, and duplicate removal will then be easy. only the last ~logF(G)l merge levels, in- The cost of duplicate removal is domi- cluding the final merge, are affected by nated bv the sort cost. and the cost of early aggregation because in earlier lev- this na~ve dudicate removal al~orithm els more than G runs exist, and items based on sorting can be assume~ to be from each group are distributed over all that of the sort operation. For aggrega- those runs, giving a negligible chance of tion, items are sorted on their grouping early aggregation. attributes. In the first merge levels, all input items This simple method can be improved participate, and the cost for these levels by detecting and removing duplicates as can be determined without explicitly cal- early as possible, easily implemented in culating the size and number of run files the routines that write run files during on these levels. In the affected levels, the sorting. With such “early” duplicate re- size of the output runs is constant, equal moval or aggregation, a run file can never to the size of the final output O = R/G, contain more items than the final output while the number of run files decreases (because otherwise it would contain du- by a factor equal to the fan-in F in each plicates!), which may speed up the final level. The number of affected levels that merges significantly [Bitten and DeWitt create run files is Lz = [log~(G)l – 1; the 1983]. subtraction of 1 is necessary because the As for any external sort operation, the final merge does not create a run file but optimizations discussed in the section on the output stream. The number of unaf- sorting, namely read-ahead using fore- fected levels is LI = L – Lz. The number casting, merge optimizations, large clus- of input runs is W/Fz on level i (recall ter sizes, and reduced final fan-in for the number of initial runs W = R/M binary consumer operations, are fully ap- from the discussion of sorting). The total plicable when sorting is used for aggre- cost,8 including a factor 2 for writing and gation and duplicate removal. However, reading, is to limit the complexity of the formulas, we derive 1/0 cost formulas without the L–1 effects of these optimizations. 2X RX LI+2XOX ~W/Fz The amount of I/0 in sort-based a~- Z=L1 gregation is determined by the number ~f =2 XRXL1+2XOXW merge levels and the effect of early dupli- cate removal on each merge step. The x(l\FL1 – l/’F~)/’(l – l\F). total number of merge levels is unaf- fected by aggregation; in sorting with For example, consider aggregating R = 100 MB input into O = 1 MB output quicksort and without optimized merg- ing, the number of merge levels is L = (i.e., reduction factor G = 100) using a system with M = 100 KB memory and DogF(R/M)l for input size R, memory fan-in F = 10. Since the input is W = size M, and fan-in F. In the first merge 1,000 times the size of memory, L = 3 levels, the likelihood is negligible that merge levels will be needed. The last Lz items of the same group end up in the = log~(G) – 1 = 1 merge level into tem- same run file, and we therefore assume porary run files will permit early aggre- that the sizes of run files are unaffected until their sizes would exceed the size of gation. Thus, the total 1/0 will be the final output. Runs on the first few merge levels are of size M X F’ for level i, and runs of the last levels have the 8 Using Z~= ~ a z = (1 – a~+l)/(l – a) and Z~.KaL same size as the final output. Assuming = X$LO a’ – E~=-O] a’ = (aK – a“’’+l)/(l – a).

ACM Computing Surveys, Vol. 25, No 2, June 1993 Query Evaluation Techniques ● 101

2X1 OOX2+2X1X1OOO files (all partitioning files in any one re- x(l/lo2 – 1/103 )/’(1 – 1/10) cursion level) will basically be as large as the entire input because once a partition = 400 + 2 x 1000 x 0.009/0.9 is being written to disk, no further aggre- = 420 MB gation can occur until the partition files are read back into memory. which has to be divided by the cluster The amount of 1/0 for hash-based size used and multi~lied bv the time to aggregation depends on the number of read or write a clu~ter to “estimate the partitioning (recursion) levels required 1/0 time for aggregation based on sort- before the output (not the input) of one ing. Naive separation of sorting and sub- partition fits into memory. This will be sequent aggregation would have required the case when partition files have been reading and writing the entire input file reduced to the size G x M. Since the par- three times, for a total of 600 MB 1/0. titioning files shrink by a factor of F at Thus, early aggregation realizes almost each level (presuming hash value skew is 30% savings in this case. absent or effectively counteracted), the Aggrega~e queries may require that number of partitioning (recursion) levels duplicates be removed from the input set is DogF(R/G/M)l = [log~(O/M)l for in- to the aggregate functions,g e.g., if the put size R, output size O, reduction fac- SQL distinct keyword is used. If such an tor G, and fan-out 1’. The costs at each aggregate function is to be executed us- level are proportional to the input file ing sorting, early aggregation can be used size R. The total 1/0 volume for hashing only for the duplicate removal part. How- with overflow avoidance, including a fac- ever, the sort order used for duplicate tor of 2 for writing and reading, is removal can be suitable to .~ermit the subsequent aggregation as a simple filter 2 x R X [logF(O/~)l. operation on the duplicate removal’s out- put stream. The last partitioning level may use hy- brid hashing, i.e., it may not involve 1/0 for the entire input file. In that case, 4.3 Aggregation Algorithms Based on L = llog~( O\M )j complete recursion lev- Hashing els involving all input records are re- Hashing can also be used for aggregation quired, partitioning the input into files of by hashing on the grouping attributes. size R’ = R/FL. In each remaining hy- Items of the same group (or duplicate brid hash aggregation, the size limit for items in duplicate removal) can be found overflow files is M x G because such an and aggregated when inserting them into overflow file can be aggregated in mem- the hash table. Since only output items, ory. The number of partition files K must not input items, are kept in memory, satisfy KxMx G+(M– KxC)XG hash table overflow occurs only if the > R’, meaning K = [(R’/G – M)/(M – output does not fit into memory. How- C)l partition files will be created. The ever, if overflow does occur, the partition total 1/0 cost for hybrid hash aggrega- tion is

2X RX L+2XFL 9 Consider two queries, both counting salaries per department. In order to determine the number of x( R’–(M– KxC)XG) (salaried) employees per department, all salaries are counted without removing duplicate salary val- =2 X( RX(L+l)– FL ues. On the other hand, in order to assess salary x( M–Kx C)XG) differentiation in each department, one might want to determine the number of distinct salary levels in =2 X( RX(L+l)– FL each department. For this query, only distinct salaries are counted, i.e., duplicate department- x(M– [(R’/G –M)/(M– C)] salary pairs must be removed prior to counting. (This refers to the latter type of query.) xC) x G).

ACM Computing Surveys, Vol. 25, No 2, June 1993 102 “ Goetz Graefe

600 – 500 –

400 – I/o 300 – [MB] o Sorting without early aggregation 200 – A Sorting with early aggregation 100 – x Hashing without hybrid hashing o– ❑ Hashing with hybrid hashing I I I I I I I 1 I I I I I 1 23510 2030 50 100 200300500 1000

Group Size or Reduction Factor

Figure 11. Performance of sort- and hash-based aggregation.

As for sorting, if an aggregate query suits of Bitton and DeWitt [1983]. The requires duplicate removal for the input other algorithms all exhibit similar, set to the aggregate function,l” the group though far from equal, performance im- size or reduction factor of the duplicate provements for larger reduction factors. removal step determines the perfor- Sorting with early aggregation improves mance of hybrid hash duplicate removal. once the reduction factor is large enough The subsequent aggregation can be per- to affect not only the final but also previ- formed after the duplicate removal as an ous merge steps. Hashing without hybrid additional operation within each hash hashing improves in steps as the number bucket or as a simple filter operation on of partitioning levels can be reduced, with the duplicate removal’s output stream. “step” points where G = F’ for some i. Hybrid hashing exploits all available 4.4 A Rough Performance Comparison memory to improve performance and generally outperforms overflow avoid- It is interesting to note that the perfor- ance hashing. At points where overflow mance of both sort-based and hash-based avoidance hashing shows a step, hybrid aggregation is logarithmic and improves hashing has no effect, and the two hash- with increasing reduction factors. Figure ing schemes have the same performance. 11 compares the performance of sort- and While hash-based aggregation and du- hash-based aggregationll using the for- plicate removal seem superior in this mulas developed above for 100 MB input rough analytical performance compari- data, 100 KB memory, clusters of 8 KB, son, recall that the cost formula for sort- fan-in or fan-out of 10, and varying group based aggregation does not include the sizes or reduction factors. The output size effects of replacement selection or the is the input size divided by the group merge optimizations discussed earlier in size. the section on sorting; therefore, Figure It is immediately obvious in Figure 11 11 shows an upper bound for the 1/0 that sorting without early aggregation is cost of sort-based aggregation and dupli- not competitive because it does not limit cate removal. Furthermore, since the cost the sizes of run files, confirming the re- formula for hashing presumes optimal assignments of hash buckets to output partitions, the real costs of sort- and 1“ See footnote 9. hash-based aggregation will be much 11Aggregation by nested-loops methods is omitted from Figure 11 because it is not competitive for more similar than they appear in Figure large data sets. 11. The important point is that both their

ACM Computing Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques “ 103

costs are logarithmic with the input size, cost calculation, and comparison of alter- improve with the group size or reduction native plans. For these applications, factor, and are quite similar overall. faster algorithms can be designed that rely either on a single sequential scan of the data (no run files, no overflow files) 4.5 Additional Remarks on Aggregation or on sampling [Astrahan et al. 1987; Some applications require multilevel ag- Hou and Ozsoyoglu 1991; 1993; Hou et gregation. For example, a report genera- al. 1991]. tion language might permit a request like “sum (employee. salary by employee.id by 5. BINARY MATCHING OPERATIONS employee. department by employee. divi- While aggregation is essential to con- siony to create a report with an entry for dense information, there are a number of each employee and a sum for each de- database operations that combine infor- partment and each division. In fact, spec- mation from two inputs, files, or sets and ifying such reports concisely was the therefore are essential for database driving design goal for the report genera- systems’ ability to provide more than tion language RPG. In SQL, this requires reliable shared storage and to perform multiple cursors within an application inferences, albeit limited. A group of op- program, one for each level of detail. This erators that all do basically the same is very undesirable for two reasons. First, task are called the one-to-one match op- the application program performs what erations here because an input item con- is essentially a join of three inputs. Such tributes to the output depending on its joins should be provided by the database match with one other item. The most system, not required to be performed prominent among these operations is the within application programs. Second, the relational join. Mishra and Eich [1992] database system more likely than not have recently written a survey of join executes the operations for these cursors algorithms, which includes an interest- independently from one another, result- ing analysis and comparison of algo- ing in three sort operations on the em- rithms focusing on how data items from ployee file instead of one. the two inputs are compared with one If complex reporting applications are another. The other one-to-one match op- to be supported, the query language erations are left and right semi-joins, left, should support direct requests (perhaps right, and symmetric outer-joins, left and similar to the syntax suggested above), right anti-semi-joins, symmetric anti- and the sort operator should be imple- join, intersection, union, left and right mented such that it can perform the en- differences, and symmetric or anti-dif- tire operation in a single sort and one ference.lz Figure 12 shows the basic final pass over the sorted data. An analo- gous algorithm based on hashing can be defined; however, if the aggregated data 12 The anti-semijoin of R and S is R SEMIJOIN are required in sort order, sort-based ag- S = R – (R SEMZJOIN S), i.e., the items in R gregation will be the algorithm of choice. without matches in S. The (symmetric) anti-join For some applications, exact aggregate contains those items from both inputs that do not functions are not required; reasonably have matches, suitably padded as in outer joins to close approximations will do. For exam- make them union compatible. Formally, the (sym- metric) anti-join of R” and S is R JOIN S =- (R ple, exploratory (rather than final pre- SEMIJOIN S ) U (S SEMIJOIN R ) with the tuples cise) data analysis is frequently very use- of the two union arguments suitably extended with ful in “approaching” a new set of data null values. The symmetric or anti-difference M the [Tukey 1977]. In real-time systems, pre- union of the two differences. Formally the anti-dif- cision and response time may be reason- ference of R and S is (R u S) – (R n S) = (R – S) u (S – R) [Maier 1983]. Among these three op- able tradeoffs. For database query opti- erations, the anti-semijoin is probably the most mization, approximate statistics are a useful one, as in the query to “find the courses that sufficient basis for selectivity estimation, don’t have any enrollment.”

ACM Computing Surveys, Vol. 25, No. 2, June 1993 104 “ Goetz Graefe

R s

A B c

m

output Match on all Match on some Attributes Attributes A Difference Anti-semi-join B Intersection Join, semi-join c Difference Anti-semi-join A, B Left outer join A, C Symmetric difference Anti-join B, C Right outer join A, B, C Union Symmernc outer join

Figure 12, Binary one-to-one matching.

principle underlying all these operations, surprising places. Consider an object- namely separation of the matching and oriented database system that uses a nonmatching components of two sets, table to map logical object identifiers called R and S in the figure, and produc- (OIDS) to physical locations (record iden- tion of appropriate subsets, possibly after tifiers or RIDs). Resolving a set of OIDS some transformation and combination of to RIDs can be regarded (as well as opti- records as in the case of a join. If the sets mized and executed) as a semi-join of the R and S have different schemas as in mapping table and the set of OIDS, and relational joins, it might make sense to all conventional join strategies can be think of the set B as two sets 11~ and employed. Another example that can oc- B~, i.e., the matching elements from R cur in a database management system and S. This distinction permits a clearer for any data model is the use of multiple definition of left semi-join and right indices in a query: the pointer (OID or semi-join, etc. Since all these operations RID) lists obtained from the indices must require basically the same steps and can be intersected (for a conjunction) or be implemented with the same algo- united (for a disjunction) to obtain the rithms, it is logical to implement them in list of pointers to items that satisfy the one general and efficient module. For whole query. Moreover, the actual lookup simplicity, only join algorithms are dis- of the items using the pointer list can be cussed here. Moreover, we discuss algo- regarded as a semi-join of the underlying rithms for only one join attribute since data set and the list, as in Kooi’s [1980] the algorithms for multi-attribute joins thesis and the Ingres product [Kooi and (and their performance) are not different Frankforth 1982] and a recent study by from those for single-attribute joins. Shekita and Carey [1990]. Finally, many Since set operations such as intersec- path expressions in object-oriented tion and difference will be used and must database systems such as “employee. de- be implemented efficiently for any data partment.manager. office.location” can model, this discussion is relevant to rela- frequently be interpreted, optimized, and tional, extensible, and object-oriented executed as a sequence of one-to-one database systems alike. Furthermore, bi- match operations using existing join and nary matching problems occur in some semi-join algorithms. Thus, even if rela-

ACM Computing Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 105 tional systems were completely abolished are never written to disk, and because and replaced by object-oriented database these costs are equal for all algorithms. systems, set matching and join tech- niques developed in the relational con- 5.1 Nested-Loops Join Algorithms text would continue to be important for the performance of database systems. The simplest and, in some sense, most Most of today’s commercial database direct algorithm for binary matching is systems use only nested loops and the nested-loops join: for each item in one merge-join because an analysis per- input (called the outer input), scan the formed in connection with the System R entire other input (called the inner in- project determined that of all the join put) and find matches. The main advan- methods considered, one of these two al- tage of this algorithm is its simplicity. ways provided either the best or very Another advantage is that it can com- close to the best performance [Blasgen pute a Cartesian product and any @join and Eswaran 1976; 1977]. However, the of two relations, i.e., a join with an arbi- System R study did not consider hash trary two-relation comparison predicate. join algorithms, which are now regarded However, Cartesian products are avoided as more efficient in many cases. by query optimizers because their out- There continues to be a strong interest puts tend to contain many data items in join techniques, although the interest that will eventually not satisfy a query has shifted over the last 20 years from predicate verified later in the query eval- basic algorithmic concerns to parallel uation plan. techniques and to techniques that adapt Since the inner input is scanned re- to unpredictable run-time situations such peatedly, it must be stored in a file, i.e., a as data skew and changing resource temporary file if the inner input is pro- availability. Unfortunately, many new duced by a complex subplan. This situa- proposed techniques fail a very simple tion does not change the cost of nested test (which we call the “Guy Lehman test loops; it just replaces the first read of the for join techniques” after the first person inner input with a write. who pointed this test out to us), making Except for very small inputs, the per- them problematic for nontrivial queries. formance of nested-loops join is disas- The crucial test question is: Does this trous because the inner input is scanned new technique apply to joining three in- very often, once for each item in the outer puts without interrupting data flow be- input. There are a number of improve- tween the join operators? For example, a ments that can be made to this naiue technique fails this test if it requires ma- nested-loops join. First, for one-to-one terializing the entire intermediate join match operations in which a single match result for random sampling of both join carries all necessary information, e.g., inputs or for obtaining exact knowledge semi-join and intersection, a scan of the about both join input sizes. Given its im- inner input can be terminated after the portance, this test should be applied to first match for an item of the outer input. both proposed query optimization and Second, instead of scanning the inner in- query execution techniques. put once for each item from the outer For the 1/0 cost formulas given here, input, the inner input can be scanned we assume that the left and right inputs once for each page of the outer input, an have R and S pages, respectively, and algorithm called block nested-loops join that the memory size is M pages. We [Kim 1980]. Third, the performance can assume that the algorithms are imple- be improved further by filling all of mem- mented as iterators and omit the cost of ory except K pages with pages of the reading stored inputs and writing an op- outer input and by using the remaining eration’s output from the cost formulas K pages to scan the inner input and to because both inputs and output may be save pages of the inner input in memory. iterators, i.e., these intermediate results Finally, scans of the inner input can be

ACM Computmg Surveys, Vol. 25, No 2, June 1993 106 e Goetz Graefe

made a little faster by scanning the inner built on the fly, i.e., indices built on inter- input alternatingly forward and back- mediate query processing results. ward, thus reusing the last page of the A recent investigation by DeWitt et al. previous scan and therefore saving one [1993] demonstrated that index nested- 1/0 per inner scan. The 1/0 cost for this loops join can be the fastest join method version of nested-loops join is the product if one of the inputs is so small and if the of the number of scans (determined by other indexed input is so large that the the size of the outer input) and the cost number of index and data page re- per scan of the inner input, plus K 1/0s trievals, i.e., about the product of the because the first inner scan has to scan index depth and the cardinality of the or save the entire inner input. Thus, the smaller input, is smaller than the num- total cost for scanning the inner input ber of pages in the larger input. repeatedly is [R\(M – K)] x (S – K) + Another interesting idea using two or- K. This expression is minimized if K = 1 dered indices, e.g., a B-tree on each of the and R z S, i.e., the larger input should two join columns, is to switch roles of be the outer. inner and outer join inputs after each If the critical performance measure is index lookup, which leads to the name not the amount of data read in the re- “zig-zag join.” For example, for a join peated inner scans but the number of predicate R.a = S.a, a scan in the index 1/0 operations, more than one page on R .a finds the lower join attribute should be moved in each 1/0, even if value in R, which is then looked up in more memory has to be dedicated to the the index on S.a. A continuing scan in inner input and less to the outer input, the index on S.a yields the next possible thus increasing the number of passes over join attribute value, which is looked up the inner input. If C pages are moved in in the index on R .a, etc. It is not immedi- each 1/0 on the inner input and M – C ately clear under which circumstances pages for the outer input, the number of this join method is most efficient. 1/0s is [R/(M – C)] x (S/C) + 1, For complex queries, N-ary joins are which is minimized if C = M/2. In other sometimes written as a single module, words, in order to minimize the number i.e., a module that performs index lookups of large-chunk 1/0 operations, the clus- into indices of multiple relations and joins ter size should be chosen as half the all relations simultaneously. However, it available memory size [Hagmann 1986]. is not clear how such a multi-input join Finally, index nested-loops join ex- implementation is superior to multiple ploits a permanent or temporary index index nested-loops joins, on the inner input’s join attribute to re- place file scans by index lookups. In prin- 5.2 Merge-Join Algorithms ciple, each scan of the inner input in naive nested-loops join is used to find The second commonly used join method matches, i.e., to provide associativity. Not is the merge-join. It requires that both surprisingly, since all index structures inputs are sorted on the join attribute. are designed and used for the purpose of Merging the two inputs is similar to the associativity, any index structure sup- merge process used in sorting. An impor- porting the join predicate (such as = , <, tant difference, however, is that one of etc.) can be used for index nested-loops the two merging scans (the one which is join. The fastest indices for exact match advanced on equality, usually called the queries are hash indices, but any index inner input) must be backed up when structure can be used, ordered or un - both inputs contain duplicates of a join ordered (hash), single- or multi-attribute, attribute value and when the specific single- or multidimensional. Therefore, one-to-one match operation requires that indices on frequently used join attributes all matches be found, not just one match. (keys and foreign keys in relational sys- Thus, the control logic for merge-join tems) may be useful. Index nested-loops variants for join and semi-j oin are slightly join is also used sometimes with indices different. Some systems include the no-

ACM Computing Surveys, Vol 25, No 2. June 1993 Query Evaluation Techniques ● 107 tion of “value packet,” meaning all items [Cheng et al. 1991], combining elements with equal join attribute values [Kooi from index nested-loops join, merge-join, 1980; Kooi and Frankforth 1982]. An it- and techniques joining sorted lists of in- erator’s next call returns a value packet, dex leaf entries. After sorting the outer not an individual item, which makes the input on its join attribute, hybrid join control logic for merge-j oin much easier. uses a merge algorithm to “join” the outer If (or after) both inputs have been sorted, input with the leaf entries of a preexist- the merge-join algorithm typically does ing B-tree index on the join attribute of not require any 1/0, except when “value the inner input. The result file contains packets” are larger than memory. (See entire tuples from the outer input and footnote 1.) record identifiers (RIDs, physical ad- An input may be sorted because a dresses) for tuples of the inner input. stored database file was sorted, an or- This file is then sorted on the physical dered index was used, an input was locations, and the tuples of the inner re- sorted explicitly, or the input came from lation can then be retrieved from disk an operation that produced sorted out- very efficiently. This algorithm is not en- put, e.g., another merge-join. The last tirely new as it is a special combination point makes merge-join an efficient algo- of techniques explored by Blasgen and rithm if items from multiple sources are Eswaran [1976; 1977], Kooi [1980], and matched on the same join attribute(s) in Whang et al. [1984; 1985]. Blasgen and multiple binary steps because sorting in- Eswaran considered the manipulation of termediate results is not required for RID lists but concluded that either later merge-joins, which led to the con- merge-join or nested-loops join is the op- cept of interesting orderings in the Sys- timal choice in almost all cases; based on tem R query optimizer [Selinger et al. this study, only these two algorithms 1979]. Since set operations such as inter- were implemented in System R [Astra- section and union can be evaluated using han et al. 1976] and subsequent rela- any sort order, as long as the same sort tional database systems, Kooi’s optimizer order is present in both inputs, the effect treated an index similarly to a base rela- of interesting orderings for one-to-one tion and the lookup of data records from match operators based on merge-join can index entries as a join; this naturally always be exploited for set operations. permitted joining two indices or an index A combination of nested-loops join and with a base relation as in hybrid join. merge-join is the heap-filter merge -]”oin [Graefe 1991]. It first sorts the smaller 5.3 Hash Join Algorithms inner input by the join attribute and saves it in a temporary file. Next, it uses Hash join algorithms are based on the all available memory to create sorted idea of building an in-memory hash table runs from the larger outer input using on one input (the smaller one, frequently replacement selection. As discussed in the called the build input) and then probing section on sorting, there will be about this hash table using items from the other W = R/(2 x M) + 1 such runs for outer input (frequently called the probe input ). input size R. These runs are not written These algorithms have only recently to disk; instead, they are joined immedi- found greater interest [Bratbergsengen ately with the sorted inner input using 1984; DeWitt et al. 1984; DeWitt and merge-join. Thus, the number of scans of Gerber 1985; DeWitt et al. 1986; Fushimi the inner input is reduced to about one et al. 1986; Kitsuregawa et al. 1983; half when compared to block nested loops. 1989a; Nakayama et al. 1988; Omiecin- On the other hand, when compared to ski 1991; Schneider and DeWitt 1989; merge-join, it saves writing and read- Shapiro 1986; Zeller and Gray 1990]. One ing temporary files for the larger outer reason is that they work very fast, i.e., input. without any temporary files, if the build Another derivation of merge-join is the input does indeed fit into memory, inde- hybrid join used in IBMs DB2 product pendently of the size of the probe input.

ACM Computing Surveys, Vol 25, No 2, June 1993 108 ● Goetz Graefe

However, they require overflow avoid- ance or resolution methods for larger build inputs, and suitable methods were developed and experimentally verified Second only in the mid-1980’s, most notably in Join connection with the Grace and Gamma –-l -l-l database machine projects [DeWitt et al. Input 1986; 1990; Fushimi et al. 1986; Kitsure- gawa et al. 1983]. In hash-based join methods, build and L probe inputs are partitioned using the same partitioning function, e.g., the join First Join Input key value modulo the number of parti- tions. The final join result can be formed Figure 13. Effect of partitioning for join operations. by concatenating the join results of pairs of partitioning files. Figure 13 shows the effect of partitioning the two inputs of a binary operation such as join into hash buckets and partitions. (This figure was adapted from a similar diagram by IKit- suregawa et al. [ 1983]. Mishra and Eich [1992] recently adapted and generalized it in their survey and comparison of rela- tional join algorithms.) Without parti- tioning, each item in the first input must be compared with each item in the sec- ond input; this would be represented by complete shading of the entire diagram. With partitioning, items are grouped into Figure 14. Recursive partitioning in binary operations. partition files, and only pairs in the se- ries of small rectangles (representing the partitions) must be compared. If a build partition file is still larger file individually, hash-based binary than memory, recursive partitioning is matching operators are particularly ef- required. Recursive partitioning is used fective when the input sizes are very dif- for both build- and probe-partitioning ferent [Bratbergsengen 1984; Graefe et files using the same hash and partition- al. 1993]. ing functions. Figure 14 shows how both The 1/0 cost for binary hybrid hash input files are partitioned together. The operations can be determined by the partial results obtained from pairs of number of complete levels (i.e., levels partition files are concatenated to form without hash table) and the fraction of the result of the entire match operation. the input remaining in memory in the Recursive partitioning stops when the deepest recursion level. For memory size build partition fits into memory. Thus, M, cluster size C, partitioning fan-out the recursion depth of partitioning for F = [M\C – 1],build input size R, and binary match operators depends only on probe input size S, the number of com- the size of the build input (which there- plete levels is L = [logF( R/M)], after fore should be chosen to be the smaller which the build input partitions should input) and is independent of the size of be of size R’ = R/FL. The 1/0 cost for the probe input. Compared to sort-based the binary operation is the cost of parti- binary matching operators, i.e., variants tioning the build input divided by the of merge-join in which the number of size of the build input and multiplied by merge levels is determined for each input the sum of the input sizes. Adapting the

ACM Computing Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques ● 109

cost formula for unary hashing discussed exploring joins of indices of different re- earlier, the total amount of 1/0 for a lations (joining lists of key-TID pairs) or recursive binary hash operation is joins of one relation with another rela- tion’s index. In the Genesis data model and database system, Batory et al. 2X( RX(L+l)– F’X(M [1988a; 1988b] modeled joins in a func- -[(R’ - ~+ c)/(lf - C)l tional way, borrowing from research into the database languages FQL [Buneman xc))/R x (R + s) et al. 1982; Buneman and Frankel 1979], DAPLEX [Shipman 1981], and Gem [Tsur which can be approximated with 2 x and Zaniolo 1984; Zaniolo 1983] and log~(R/iW) X (R + S). In other words, permitting pointer-based join imple- the cost of binary hash operations on mentations in addition to traditional large inputs is logarithmic; the main value-based implementations such as difference to the cost of merge-join is nested-loops join, merge-join, and hy- that the recursion depth (the logarithm) brid hash join. depends only on one file, the build input, Shekita and Carey [ 1990] recently ana- and is not taken for each file individually. lyzed three pointer-based join methods As for all operations based on parti- based on nested-loops join, merge-join, tioning, partitioning (hash) value skew is and hybrid hash join. Presuming rela- the main danger to effectiveness. When tions R and S, with a pointer to an S using statistics on hash value distribu- tuple embedded in each R tuple, the tions to determine which buckets should nested-loops join algorithm simply scans stay in memory in hybrid hash algo- through R and retrieves the appropriate rithms, the goal is to avoid as much 1/0 S tuple for each R tuple. This algorithm as possible with the least memory “in- is very reminiscent of uncluttered index vestment.” Thus, it is most effective to scans and performs similarly poorly for retain those buckets in memory with few larger set sizes. Their conclusion on naive build items but many probe items or, pointer-based join algorithms is that “it more formally, the buckets with the is unwise for object-oriented database smallest value for r,\(rt + Sj ) where r, systems to support only pointer-based and si indicate the total size of a bucket’s join algorithms,” build and probe items [Graefe 1993b]. The merge-join variant starts with sorting R on the pointers (i.e., according to the disk address they point to) and 5.4 Pointer-Based Joins then retrieves all S items in one elevator Recently, links between data items have pass over the disk, reading each S page found renewed interest, be it in object- at most once. Again, this idea was sug- oriented systems in the form of object gested before for uncluttered index scans, identifiers (OIDS) or as access paths for and variants similar to heap-filter faster execution of relational joins. In a merge-join [Graefe 1991] and complex ob- sense, links represent a limited form of ject assembly using a window and prior- precomputed results, somewhat similar ity heap of open references [Keller et al. to indices and join indices, and have the 1991] can be designed. usual cost-versus-benefit tradeoff be- The hybrid hash join variant partitions tween query performance enhancement only relation R on pointer values, ensur- and maintenance effort. Kooi [1980] mod- ing that R tuples with S pointers to the eled the retrieval of actual records after same page are brought together, and then index searches as “TID joins” (tuple iden- retrieves S pages and tuples. Notice that tifiers permitting direct record access) in the two relations’ roles are fixed by the his query optimizer for Ingres; together direction of the pointers, whereas for with standard join commutativity and standard hybrid hash join the smaller associativity rules, this model permitted relation should be the build input. Differ-

ACM Computing Surveys, Vol 25, No. 2, June 1993 110 “ Goetz Graefe

125 –

100 V Pointer 10 S+R o Ne ed Loops I/o 75 ❑ Merg oin with Two Sorts Count + Hashi not using Hybrid ing [Xlooo] 50 A Pointer Join

25 : /~

0 1 I I I I I I I I 100 300 500 700 900 1100 13(KI 1500

Sizeof R, S=lOx R

Figure 15. Performance of alternative join methods

ently than standard hybrid hash join, without hybrid hashing, bucket tuning, relation S is not partitioned. This algo- or dynamic destaging; and pointer joins rithm performs somewhat faster than with pointers from R to S and from S to pointer-based merge-join if it keeps some R without grouping pointers to the same partitions of R in memory and if sorting target page together. This comparison is writes all R tuples into runs before merg- not precise; its sole purpose is to give a ing them. rough idea of the relative performance of Pointer-based join algorithms tend to the algorithm groups, deliberately ignor- outperform their standard value-based ing the many tricks used to improve and counterparts in many situations, in par- fine-tune the basic algorithms. The rela- ticular if only a small fraction of S actu- tion sizes vary; S is always 10 times ally participates in the join and can be larger than R. The memory size is 100 selected effectively using the pointers in KB; the cluster size is 8 KB; merge fan-in R. Historically, due to the difficulty of and partitioning fan-out are 10; and the correctly maintaining pointers (nones- number of R-records per cluster is 20. sential links ), they were rejected as a It is immediately obvious in Figure 15 relational access method in System R that nested-loops join is unsuitable for [Chamberlain et al. 1981al and subse- medium-size and large relations, because quently in basically all other systems, the cost of nested-loops join is propor- perhaps with the exception of Kooi’s tional to the size of the Cartesian prod- modified Ingres [Kooi 1980; Kooi and uct of the two inputs. Both merge-join Frankforth 1982]. However, they were (sorting) and hash join have logarithmic reevaluated and implemented in the cost functions; the sudden rise in merge- Starburst project, both as a test of Star- join and hash join cost around R = 1000 burst’s extensibility and as a means of is due to the fact that additional parti- supporting “more object-oriented” modes tioning or merging levels become neces- of operation [Haas et al. 1990]. sary at that point. The sort-based merge-join is not quite as fast as hash join because the merge levels are deter- 5.5 A Rough Performance Comparison mined individually for each file, includ- Figure 15 shows an approximate perfor- ing the bigger S file, while only the mance comparison using the cost formu- smaller build relation R determines the las developed above for block nested-loops partitioning depth of hash join. Pointer join; merge-join with sorting both inputs joins are competitive with hash and without optimized merging; hash join merge-joins due to their linear cost func-

ACM Computmg Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques w 111 tion, but only when the pointers are em- will become more important. The second bedded in the smaller relation R. When reason is valid; however, the substitute S-records point to R-records, the cost of expressions are very slow to execute be- the pointer join is even higher than for cause of the Cartesian product. The third nested-loops join. reason is also valid, but replacing a uni- The important point of Figure 15 is to versal quantifier may require very com- illustrate that pointer joins can be very plex aggregation clauses that are easy to efficient or very inefficient, that one-to- “get wrong” for the database user. Fur- one match algorithms based on nested- thermore, they might be too complex for loops join are not competitive for the optimizer to recognize as universal medium-size and large inputs, and that quantification and to execute with a di- sort- and hash-based algorithms for one- rect algorithm. The fourth reason is not to-one match operations both have loga- valid; universal quantification algo- rithmic cost growth. Of course, this com- rithms can be very efficient (in fact, as parison is quite naive since it uses only fast as semi-join, the operator for exis- the simplest form of each algorithm. tential quantification), useful for very Thus, a comparison among alternative large inputs, and easy to parallelize. In algorithms in a query optimizer must use the remainder of this section, we discuss the precise cost function for the available sort- and hash-based direct and indirect algorithm variant. (aggregation-based) algorithms for uni- versal quantification. 6. UNIVERSAL QUANTIFICATION’3 In the relational world, universal quantification is expressed with the uni- Universal quantification permits queries versal quantifier in relational calculus such as “find the students who have and with the division operator in rela- taken all database courses”; the differ- tional algebra. We will explain algo- ence to one-to-one match operations is rithms for universal quantification using that a student qualifies because his or relational terminology. The running ex- her transcript matches an entire set of ample in this section uses the relations courses, not only one item as in an exis- Student ( student-id, name, major), tentially quantified query (e.g., “find stu- Course (course-no, title), Transcript ( stu- dents who have taken a (at least one) dent-id, course-no, grade ), and Require- database course”) that can be executed ment (major, course-no) with the obvious using a semi-join. In the past, universal key attributes. The query to find the stu- quantification has been largely ignored dents who have taken all courses can be for four reasons. First, typical data- expressed in relational algebra as base applications, e.g., record-keeping and accounting applications, rarely re- 97 quire universal quantification. Second, it studmt.,d, cwrmnoTranscri@ can be circumvented using a complex ex- + trCOU~,,e.~OCourse. pression involving a Cartesian product. Third, it can be circumvented using com- The projection of the Transcript relation plex aggregation expressions. Fourth, is called the dividend, the projection of there seemed to be a lack of efficient the Course relation the divisor, and the algorithms. result relation the quotient. The quotient The first reason will not remain true attributes are those attributes of the div- for database systems supporting logic idend that do not appear in the divisor. programming, rules, and quantifiers, and The dividend relation semi-joined with algorithms for universal quantification the divisor relation and projected on the quotient attributes, in the example the set of student-ids of Students who have

12 This section is a summary of earlier work [Graefe taken at least one course, is called the 1989; Graefe and Cole 1993]. set of quotient candidates here.

ACM Computmg Surveys, Vol. 25, No. 2, June 1993 112 “ Goetz Graefe

Some universal quantification queries database courses both in the dividend seem to require relational division but (the Transcript relation) and in the divi- actually do not. Consider the query for sor (the Course relation). Counting only the students who have taken all courses database courses might be easy for the required for their major. This query can divisor relation, but requires a semi-join be answered with a sequence of one-to- of the dividend with the divisor relation one match operations. A join of Student to propagate the restriction on the divi- and Requirement projected on the stu- sor to the dividend if it is not known a dent-id and course-no attributes minus priori whether or not referential in- the Transcript relation can be projected tegrity holds between the dividend’s divi- on student-ids to obtain a set of students sor attributes and the divisor, i.e., who have not taken all their require- whether or not there are divisor attribute ments. An anti-semi-join of the Student values in the dividend that cannot be relation with this set finds the students found in the divisor. For example, course- who have satisfied all their require- nos in the Transcript relation that do not ments. This sequence will have accept- pertain to database courses (and are able performance because its required therefore not in the divisor) must be re- set-matching algorithms (join, difference, moved from the dividend by a semi-join anti-semi-join) all belong to the family of with the divisor. In general, if the divisor one-to-one match operations, for which is the result of a prior selection, any efficient algorithms are available as dis- referential integrity constraints known cussed in the previous section. for stored relations will not hold and must Division algorithms differ not only in be explicitly enforced using a semi-join. their performance but also in how they Furthermore, in order to ensure correct fit into complex queries, Prior to the divi- counting, duplicates have to be removed sion, selections on the dividend, e.g., only from either input if the inputs are projec- Transcript entries with “A” grades, or on tions on nonkey attributes. the divisor, e.g., only the database There are four methods to compute the courses, may be required. Restrictions on quotient of two relations, a sort- and a the dividend can easily be enforced with- hash-based direct method, and sort- and out much effect on the division operation, hash-based aggregation. Table 6 shows while restrictions on the divisor can im- this classification of relational division ply a significant difference for the query algorithms. Methods for sort- and hash- evaluation plan. Subsequent to the divi- based aggregation and the possible sort- sion operation, the resulting quotient re- or hash-based semi-join have already lation (e.g., a set of student-ids) may be been discussed, including their variants joined with another relation, e.g., the for inputs larger than memory and their Student relation to obtain student cost functions. Therefore, we focus here names. Thus, obtaining the quotient in a on the direct division algorithms. form suitable for further processing (e.g., The sort-based direct method, pro- join or semi-join with a third relation) posed by Smith and Chang [1975] and can be advantageous. called naizw diuision here, sorts the Typically, universal quantification can divisor input on all its attributes and the easily be replaced by aggregations. (In- dividend relation with the quotient at- tuitively, all universal quantification can tributes as major and the divisor at- be replaced by aggregation. However, we tributes as minor sort keys. It then pro- have not found a proof for this statement.) ceeds with a merging scan of the two For example, the example query about sorted inputs to determine which items database courses can be restated as “find belong in the quotient. Notice that the the students who have taken as many scan can be programmed such that it database courses as there are database ignores duplicates in either input (in case course s.” When specifying the aggregate those had not been removed yet in the function, it is important to count only sort) as well as dividend items that do

ACM Computing Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques - 113

Table 6. Classiflcatlon of Relational Dtvwon Algorithms

Based on Sorting Based on Hashing

Direct Naive divison Hash-division

Indirect by semi-join and Sorting with duplicate removal, Hash-based duplicate removal, aggregation merge-join, sorting with hybrid hash join, hash-based aggregation aggregation

not refer to items in the divisor. Thus, Stadent Coarse neither a preceding semi-join nor explicit Jack Intro to Databases duplicate removal steps are necessary for naive division. The 1/0 cost of naive di- Jill Intro to Databases vision is the cost of sorting the two in- Jill Intro to Graphics puts plus the cost of repeated scans of Jill Readings in Databases the divisor input. Figure 16 shows two tables, a dividend and a divisor, properly sorted for naive division. Concurrent scans of the “Jack” tuples (only one) in the dividend and of the entire divisor determine that “Jack Figure 16. Sorted inputs into naive division is not part of the quotient because he has not taken the “Readings in Databases” course. A continuing scan through the removal during insertion into the divisor “Jill” tuples in the dividend and a new table) and automatically ignores dupli- scan of the entire divisor include “Jill” in cates in the dividend as well as dividend the output of the naive division. The fact items that do not refer to items in the that “Jill” has also taken an “Intro to divisor (e.g., the AI course in the exam- Graphics” course is ignored by a suitably ple). Thus, neither prior semi-join nor general scan logic for naive division. duplicate removal are required. However, The hash-based direct method, called if both inputs are known to be duplicate hash-division, uses two hash tables, one free, the bit maps can be replaced by for the divisor and one for the quotient counters. Furthermore, if referential in- candidates. While building the divisor tegrity is known to hold, the divisor table table, a unique sequence number is as- can be omitted and replaced by a single signed to each divisor item. After the counter. Hash-division, including these divisor table has been built, the dividend variants, has been implemented in the is consumed. For each quotient candi- Volcano query execution engine and has date, a bit map is kept with one bit for shown better performance than the other each divisor item. The bit map is indexed three algorithms [Graefe 1989; Graefe with the sequence numbers assigned to and Cole 1993]. In fact, the performance the divisor items. If a dividend item does of hash-division is almost equal to a not match with an item in the divisor hash-based join or semi-join of table, it can be ignored immediately. dividend and divisor relations (a semi- Otherwise, a quotient candidate is either join corresponds to existential quantifica- found or created, and the bit correspond- tion), making universal quantification ing to the matching divisor item is set. and relational division realistic opera- When the entire dividend has been con- tions and algorithms to use in database sumed, the quotient consists of those applications. quotient candidates for which all bits are The aspect of hash-division that makes set. it an efficient algorithm is that the set of This algorithm can ignore duplicates in matches between a quotient candidate the divisor (using hash-based duplicate and the divisor is represented efficiently

ACM Computing Surveys, Vol. 25, No 2, June 1993 114 “ Goetz Graefe using a bit map. Bit maps are one of the ample above are partitioned into under- standard data structures to represent graduate and graduate courses, the final sets, and just as bit maps can be used for result consists of the students who have a number of set operations, the bit maps taken all undergraduate courses and all associated with each quotient candidate graduate courses, i.e., those that can be can also be used for a number of opera- found in the division result of each parti- tions similar to relational division. For tion. In quotient partitioning, the entire example, Carlis [ 1986] proposed a gener- divisor must be kept in memory for all alized division operator called “HAS” that partitions. The final result is the concate- nation (union) of all partial results. For included relational division as a s~ecialL case. The hash-division algorithm can example, if Transcript items are parti- easily be extended to compute quotient tioned by odd and even student-ids, the candidates in the dividend that match a final result is the union (concatenation) majority or given fraction of divisor items of all students with odd student-id who as well as (with one more bit in each bit have taken all courses and those with map) quotient candidates that do or do even student-id who have taken all not match exactly the divisor items courses. If warranted by the input data, For real queri~s containing a division, divisor partitioning and quotient parti- consider the operation that frequently tioning can be combined. follows a division. In the example, a Hash-division can be modified into an user is typically not really interested in algorithm for duplicate removal. Con- student-ids only but in information about sider the problem of removing duplicates the students. Thus, in many cases, rela- from a relation R(X, Y) where X and Y tional division results will be used to are suitably chosen attribute groups. This select items from another relation using relation can be stored using two hash a semi-join. The sort-based algorithms tables, one storing all values of X (simi- produce their output sorted, which will lar to the divisor table) and assigning facilitate a subsequent (semi-) merge-join. each of them a unique sequence number, The hash-based algorithms produce their the other storing all values of Y and bit output in hash order; if overflow oc- maps that indicate which X values have curred, there is no m-edictable. order at occurred with each Y value. Consider a all. However, both aggregation-based and brief example for this algorithm: Say re- direct hash-based algorithms use a hash lation R(X, Y) contains 1 million tuples, table on the quotient attributes, which but only 100,000 tuples if duplicates were may be used immediately for a subse- removed. Let X and Y be each 100 bytes quent (semi-) join. It seems quite long (total record size 200), and assume straightforward to use the same hash there are 4,000 unique values of each X table for the aggregation and a subse- and Y. For the standard hash-based du- quent join as well as to modify hash- plicate removal algorithm, 100,000 x 200 division such that it removes quotient bytes of memory are needed for duplicate candidates from the quotient table that removal without use of temporary files. do not belong to the final quotient and For the redesigned hash-division algo- then performs a semi-join with a third rithm, 2 x 4,000 x 100 bytes are needed input relation. for data values, 4,000 x 4 for unique se- If the two hash tables do not fit into quence numbers, and 4,000 x 4,000 bits memory, the divisor table or the quotient for bit maps. Thus, the new algorithm table or both can be partitioned, and in- works efficiently with less than 3 MB of dividual partitions can be held on disk memory while conventional duplicate re- for processing in multiple steps. In diui- moval requires slightly more than 19 MB sor partitioning, the final result consists of memory, or seven times more than the of those items that are found in all duplicate removal algorithm adapted partial results; the final result is the from hash-division. Clearly, choosing at- intersection of all partial results. For ex- tribute groups X and Y to find attribute ample, if the Course relations in the ex- groups with relatively few unique values

ACM Computmg Surveys, Vol 25, No. 2, June 1993 Query Evaluation Techniques ● 115

is crucial for the performance and mem- Both approaches permit in-memory ver- ory efficiency of this new algorithm. Since sions for small data sets and disk-based such knowledge is not available in most versions for larger data sets. If a data set systems and queries (even though some fits into memory, quicksort is the sort- efficient and helpful algorithms exist, e.g., based method to manage data sets while Astrahan et al. [1987]), optimizer heuris- classic (in-memory) hashing can be used tics for choosing this algorithm might be as a hashing technique. It is interesting difficult to design and verify. to note that both quicksort and classic To summarize the discussion on uni- hashing are also used in memory to oper- versal quantification algorithms, aggre- ate on subsets after “cutting” an entire gation can be used in systems that lack large data set into pieces. The cutting direct division algorithms, and hash- process is part of the divide-and-conquer division performs universal quantifica- paradigm employed for both sort- and tion and relational division generally, i.e., hash-based query-processing algorithms. it covers cases with duplicates in the in- This important similarity of sorting and puts and with referential integrity viola- hashing has been observed before, e.g., tions, and efficiently, i.e., it permits par- by Bratbergsengen [ 1984] and Salzberg titioning and using hybrid hashing tech- [1988]. There exists, however, an impor- niques similar to hybrid hash join, mak- tant difference. In the sort-based algo- ing universal quantification (division) as rithms, a large data set is divided into fast as existential quantification (semi- subsets using a physical rule, namely into join). As will be discussed later, it can chunks as large as memory. These chunks also be effectively parallelized. are later combined using a logical step, merging. In the hash-based algorithms, large inputs are cut into subsets using a 7. DUALITY OF SORT- AND HASH-BASED logical rule, by hash values. The result- QUERY PROCESSING ALGORITHMS’4 ing partitions are later combined using a

We conclude the discussion of individual physical step, i.e., by simply concatenat- query processing by outlining the many ing the subsets or result subsets. In other existing similarities and dualities of sort- words, a single-level merge in a sort algo- and hash-based query-processing algo- rithm is a dual to partitioning in hash rithms as well as the points where the algorithms. Figure 17 illustrates this du- two types of algorithms differ. The pur- ality and the opposite directions. pose is to contribute to a better under- This duality can also be observed in standing of the two approaches and their the behavior of a disk arm performing tradeoffs. We try to discuss the ap- the 1/0 operations for merging or parti- proaches in general terms, ignoring tioning. While writing initial runs after whether the algorithms are used for rela- sorting them with quicksort, the 1/0 is tional join, union, intersection, aggre- sequential. During merging, read opera- gation, duplicate removal, or other tions access the many files being merged operations. Where appropriate, however, and require random 1/O capabilities. we indicate specific operations. During partitioning, the 1/0 operations Table 7 gives an overview of the fea- are random, but when reading a parti- tures that correspond to one another. tion later on, they are sequential. For both approaches, sorting and hash- ing, the amount of available memory lim- its not only the amount of data in a basic 14 ~art~ of ~hi~ section have been derived from unit processed using quicksort or classic Graefe et al. [1993], which also provides experimen- hashing, but also the number of basic tal evidence for the relative performance of sort- units that can be accessed simultane- and hash-based query processing algorithms and ously. For sorting, it is well known that discusses simple cases of transferring tuning ideas merging is limited to the quotient of from one type of algorithm to the other. The discus- sion of this section is continued in Graefe [ 1993a; memory size and buffer space required 1993 C]. for each run, called the merge fan-in.

ACM Computing Surveys, Vol 25, No. 2, June 1993 116 ● Goetz Graefe

Table 7. Duallty of Soti- and Hash-Based Algorithms

Aspect Sorting Hashing

In-memory algorithm Qulcksort Classic Hash Divide-and-conquer Physical dlvi sion, logical Logical division, physical combination paradigm combination Large inputs Single-level merge Partitioning 1/0 Patterns Sequential write, random read Random write, sequential read Temporary files accessed Fan-in Fan-out simultaneously 1/0 Optimlzatlons Read-ahead, forecasting Write-behind Double-buffering, Double-buffering, striping merge output striping partitioning input Very large inputs Multi-level merge Recursive partitioning Merge levels Recursion depth Nonoptimal final fan-in Nonoptimal hash table size Optimizations Merge optimizatlons Bucket tuning Better use of memory Reverse runs & LRU Hybrid hashing Replacement selection ? ? Single input in memory Aggregation and Aggregation m replacement selection Aggregation in hash table duphcate removal Algorithm phases Run generation, intermediate and Initial and intermediate partitioning, final merge In-memory (hybrid) hashing Resource sharing Eager merging Depth-first partitlomng Lazy merging Breadth-first partitioning Partitiomng skew and Mergmg run files of different sizes Uneven output file sizes effectiveness “Item value” log (run size) log (build partition size/or@nal build input size) Bit vector filtering For both inputs and on each For both inputs and on each recursion merge level? level Interesting orderings, Multiple merge-joins without sorting’ N-ary partltlonmg and jams multiple joins intermediate results Interesting orderings: Sorted grouping on foreign key Grouping while budding the hash grouping/aggregation useful for subsequent join table in hash Join followed by join Interesting orderings in B-trees feeding into a merge-join Mergmg m hash value order index structures

In order to keep the merge process active at all times, many merge imple- mentations use read-ahead controlled by forecasting, trading reduced 1/0 delays for a reduced fan-in. In the ideal case, the bandwidths of 1/0 and processing (merging) match, and 1/0 latencies for both the merge input and output are hid- Figure 17. Duahty of partitioning and mergmg, den by read-ahead and double-buffering, as mentioned earlier in the section on sorting. The dual to read-ahead during merging is write-behind during partition- ing, i.e., keeping a free output buffer that Similarly, partitioning is limited to the can be allocated to an output file while same fraction, called the fan-out since the previous page for that file is being the limitation is encountered while writ- written to disk. There is no dual to fore- ing partition files. casting because it is trivial that the next

ACM Computmg Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques ● 117 output partition to write to is the one for section on sorting). There is no direct which an output cluster has just filled dual in hash-based algorithms for this up. Both read-ahead in merging and optimization. With respect to memory write-behind in partitioning are used to utilization, the fact that a partition file ensure that the processor never has to and therefore a hash table might actu- wait for the completion of an 1/0 opera- ally be smaller than memory is the clos- tion. Another dual is double-buffering and est to a dual. Utilizing memory more striping over multiple disks for the out- effectively and using less than the maxi- put of sorting and the input of mal fan-out in hashing has been ad- partitioning. dressed in research on bucket tuning Considering the limitation on fan-in [Kitsuregawa et al. 1989a] and on his- and fan-out, additional techniques must togram-driven recursive hybrid hash join be used for very large inputs. Merging [Graefe 1993a]. can be performed in multiple levels, each The development of hybrid hash algo- combining multiple runs into larger ones. rithms [DeWitt et al. 1984; Shapiro 1986] Similarly, partitioning can be repeated was a consequence of the advent of large recursively, i.e., partition files are repar- main memories that had led to the con- titioned, the results repartitioned, etc., sideration of hash-based join algorithms until the partition files flt into main in the first place. If the data set is only memory. In sorting and merging, the runs slightly larger than the available mem- grow in each level by a factor equal to ory, e.g., 109%0larger or twice as large, the fan-in. In partitioning, the partition much of the input can remain in memory files decrease in size by a factor equal to and is never written to a disk-resident the fan-out in each recursion level. Thus, partition file. To obtain the same effect the number of levels during merging is for sort-based algorithms, if the database equal to the recursion depth during par- system’s buffer manager is sufficiently titioning. There are two exceptions to be smart or receives and accepts appropri- made regarding hash value distributions ate hints, it is possible to retain some or and relative sizes of inputs in binary op- all of the pages of the last run written in erations such as join; we ignore those for memory and thus achieve the same effect now and will come back to them later. of saving 1/0 operations, This effect can If merging is done in the most naive be used particularly easily if the initial way, i.e., merging all runs of a level as runs are written in reverse (descending) soon as their number reaches the fan-in, order and scanned backward for merg- the last merge on each level might not be ing. However, if one does not believe in optimal. Similarly, if the highest possible buffer hints or prefers to absolutely en- fan-out is used in each partitioning step, sure these 1/0 savings, then using a the partition files in the deepest recur- final memory-resident run explicitly in sion level might be smaller than mem- the sort algorithm and merging it with ory, and less than the entire memory is the disk-resident runs can guarantee this used when processing these files. Thus, effect. in both approaches the memory re- Another well-known technique to use sources are not used optimally in the memory more effectively and to improve most naive versions of the algorithms. sort performance is to generate runs In order to make best use of the final twice as large as main memory using a merge (which, by definition, includes all priority heap for replacement selection output items and is therefore the most [Knuth 1973], as discussed in the earlier expensive merge), it should proceed with section on sorting. If the runs’ sizes are the maximal possible fan-in. Making best doubled, their number is cut in half. use of the final merge can be ensured by Therefore, merging can be reduced by merging fewer runs than the maximal some amount, namely log~(2) = fan-in after the end of the input file has l/logs(F) merge levels. This optimiza- been reached (as discussed in the earlier tion for sorting has no direct dual in the

ACM Computing Surveys, Vol. 25, No. 2, June 1993 118 0 Goetz Graefe realm of hash-based query-processing If an iterator interface is used for both algorithms. its input and output, and therefore mul- If two sort operations produce input tiple operators overlap in time, a sort data for a binarv -. ooerator such as a operator can be divided into three dis- merge-join and if both sort operators’ fi- tinct algorithm phases. First, input items nal merges are interleaved with the join, are consumed and sorted into initial runs. each final merge can employ only half Second, intermediate merging reduces the memorv. In hash-based one-to-one the number of runs such that only one match algo~ithms, only one of the two final merge step is left. Third, the final inputs resides in and consumes memory merge is performed on demand from the beyond a single input buffer, not both as consumer of the sorted data stream. Dur- in two final merges interleaved with a ing the first phase, the sort iterator has merge-join. This difference in the use of to share resources, most notably memory the two inputs is a distinct advantage of and disk bandwidth, with its producer hash-based one-to-one match al~orithms operators in a query evaluation plan. that does not have a dual in s~rt-based Similarly, the third phase must share algorithms. resources with the consumers. Interestingly, these two differences of In many sort implementations, namely sort- and hash-based one-to-one match those using eager merging, the first and algorithms cancel each other out. Cutting second phase interleave as a merge step the number of runs in half (on each merge is initiated whenever the number of runs level, including the last one) by using on one level becomes equal to the fan-in. replacement selection for run generation Thus, some intermediate merge steps exactly offsets this disadvantage of sort- cannot use all resources. In lazy merging, based one-to-one match operations. which starts intermediate merges only Run generation using replacement se- after all initial runs have been created, lection has a second advantage over the intermediate merges do not share quicksort; this advantage has a direct resources with other operators and can dual in hashing. If a hash table is used to use the entire memory allocated to a compute an aggregate function using query evaluation plan; “thus, intermedi- grouping, e.g., sum of salaries by depart- ate merges can be more effective in lazy ment, hash table overflow occurs only if merging than in eager merging. the operation’s output does not fit in Hash-based query processing algo- memory. Consider, for example, the sum rithms exhibit three similar phases. of salaries by department for 100,000 First, the first partitioning step executes employees in 1,000 departments. If the concurrently with the input operator or 1,000 result records fit in memory, clas- operators. Second, intermediate parti- sic hashing (without overflow) is suffi- tioning steps divide the partition files to cient. On the other hand, if sorting based ensure that they can be processed with on quicksort is used to compute ;his ag- hybrid hashing. Third, hybrid and in- gregate function, the input must fit into memory hash methods process these par- memory to avoid temporary files.15 If re- tition files and produce output passed to placement selection is used for run gen- the consumer operators. As in sorting, eration, however, the same behavior as the first and third phases must share with classic hashing is easy to achieve. resources with other concurrent opera- tions in the same query evaluation plan. The standard implementation of hash- based query processing algorithms for verv large in~uts uses recursion, i.e., the ori~inal ‘algo~ithm is invoked for each 15A scheme usuw aulcksort and avoiding tem~o- partition file (or pair of partition files). rary 1/0 m this ~as~ can be devised but ~ould’be extremely cumbersome; we do not know of any While conce~tuallv sim~le. this method report or system with such a scheme. has the di~adva~tage ‘ that output is

ACM Computing Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques 8 119 produced before all intermediate parti- I I tioning steps are complete. Thus, the op- n erators that consume the output must ,... A ~es allocate resources to receive this output, typically memory (e.g., a hash table). ‘i” k El Partitioning I Further intermediate partitioning steps * Merging - will have to share resources with the < consumer operators, making them less effective. We call this direct recursive im- Figure 18. Partitioning skew, plementation of hash-based partitioning depth-first partitioning and consider its behavior as well as its resource sharing must increase as partition files get and performance effects a dual to eager smaller in hash-based query processing merging in sorting. The alternative algorithms. For sorting, a suitable choice schedule is breadth-first partitioning, for such a value is the logarithm of the which completes each level of partition- run size [Graefe 1993c]. The value of a ing before starting the next one. Thus, sorted run is the product of the run’s size hybrid and in-memory hashing are not and the size’s logarithm. The optimal initiated until all partition files have be- merge effectiveness is achieved if each come small enough to permit hybrid and item’s value increases in each merge step in-memory hashing, and intermediate by the logarithm of the fan-in, and the partitioning steps never have to share overall value of all items increases with resources with consumer operators. this logarithm multiplied with the data Breadth-first partitioning is a dual to lazy volume participating in the merge step. merging, and it is not surprising that However, only if all runs in a merge step they are both equally more effective than are of the same size will the value of all depth-first partitioning and eager merg- items increase with the logarithm of the ing, respectively. fan-in. It is well known that partitioning skew In hash-based query processing, the reduces the effectiveness of hash-based corresponding value is the fraction of a algorithms. Thus, the situation shown in partition size relative to the original in- Figure 18 is undesirable. In the extreme put size [Graefe 1993c]. Since only the case, one of the partition files is as large build input determines the number of as the input, and an entire partitioning recursion levels in binary hash partition- step has been wasted. It is less well rec- ing, we consider only the build partition. ognized that the same issue also pertains If the partitioning is skewed, i.e., output to sort-based query processing algo- partition files are not of uniform length, rithms [Graefe 1993c]. Unfortunately, in the overall effectiveness of the partition- order to reduce the number of merge ing step is not optimal, i.e., equal to the steps, it is often necessary to merge files logarithm of the partitioning fan-out. from different merge levels and therefore Thus, preventing or managing skew in of different sizes. In other words, the partitioning hash functions is very im- goals of optimized merging and of maxi- portant [Graefe 1993a]. mal merge effectiveness do not always Bit vector filtering, which will be dis- match, and very sophisticated merge cussed later in more detail, can be used plans, e.g., polyphase merging, might be for both sort- and hash-based one-to-one required [Knuth 1973]. match operations, although it has been The same effect can also be observed if used mainly for parallel joins to date. “values” are attached to items in runs Basically, a bit vector filter is a large and in partition files. Values should re- array of bits initialized by hashing items flect the work already performed on an in the first input of a one-to-one match item. Thus, the value should increase operator and used to detect items in the with run sizes in sorting while the value second input that cannot possibly have a

ACM Computing Surveys, Vol. 25, No. 2, June 1993 120 “ Goetz Graefe , match in the first input. In effect, bit Merge-Join vector filtering reduces the second input to the items that truly participate in the /“\ Probe Vector 2 binary operation plus some “false passes” sort due to hash collisions in the bit vector I I filter, In a merge-join with two sort oper- sOrt Build Vector 2 ations, if the bit vector filter is used be- fore the second sort, bit vector filtering is I I Build Vector 1 Probe Vector 1 as effective as in hybrid hash join in reducing the cost of processing the sec- I I ond input. In merge-join, it can also be Scan Input 1 Scan Input 2 used symmetrically as shown in Figure 19. Notice that for the right input, bit Figure 19. Merge-join with symmetric bit vector filtering. vector filtering reduces the sort input size, whereas for the left input, it only reduces the merge-join input. In recur- sive hybrid hash join, bit vector filtering same attribute can be performed without can be used in each recursion level. The sorting intermediate join results. For effectiveness of bit vector filtering in- joining three relations, as shown in Fig- creases in deeper recursion levels, be- ure 20, pipelining data from one merge- cause the number of distinct data values join to the next without sorting trans- in each ~artition file decreases. thus re- lates into a 3:4 advantage in the number ducing tie number of hash collisions and of sort operations compared to two joins false passes if bit vector filters of the on different join keys, because the inter- same size are used in each recursion level. mediate result 01 does not need to be Moreover, it can be used in both direc- sorted. For joining N relations on the tions, i.e.: to reduce the second input us- same key, only N sorts are required in- ing a bit vector filter based on the first stead of 2 x N – 2 for joins on different input and to reduce the first input (in the attributes. Since set operations such as next recursion level) using a bit vector the union or intersection of N sets can filter based on the second input. The same always be performed using a merge-join effect could be achieved for sort-based algorithm without sorting intermediate binary operations requiring multilevel results, the effect of interesting orderings sorting and merging, although to do so is even more important for set operations implies switching back and forth be- than for relational joins. tween the two sorts for the two inputs Hash-based algorithms tend to produce after each merge level. Not surprisingly, their outputs in a very unpredictable or- switching back and forth after each merge der, depending on the hash function and level would be the dual to the .nartition- on overflow management. In order to take ing process of both inputs in recursive advantage of multiple joins on the same hybrid hash join. However, sort operators attribute (or of multiple intersections, that switch back and forth on each merge etc.) similar to the advantage derived level are not only complex to implement from interesting orderings in sort-based but may also inhibit the merge optimiza- query processing, the equality of at- tion discussed earlier. tributes has to be exploited during the The final entries in Table 7 concern logical step of hashing, i.e., during parti- interesting orderings used in the System tioning. In other words, such set opera- R query optimizer [Selinger et al. 1979] tions and join queries can be executed and presumably other query optimizers effectively by a hash join algorithm that as well. A strong argument in favor of recursively partitions N inputs concur- sorting and merge-join is the fact that rently. The recursion terminates when merge-join delivers its output in sorted N – 1 inputs fit into memory and when order; thus, multiple merge-joins on the the Nth input is used to probe N – 1

ACM Computmg Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques ● 121

Merge-Join b=b Merge-Join a=a /\ Sort on b Sort on b /\ 01I I Merge-Join a=a Sort on a Merge-Join a=a /\ I Sort on a Sort on a Input 13 / \ ‘n’”’” Sort on a Sort on a I I I I Input 11 Input 12 Input 11 Input 12

Figure 20. The effect of interesting orderings.

derings is an aggregation followed by a join. Many aggregations condense infor- mation about individual entities; thus, the aggregation operation is performed on a relation representing the “many” side of a many-to-one relationship or on the relation that represents relationship instances of a many-to-many relation- ship. For example, students’ grade point averages are computed by grouping and averaging transcript entries in a many- to-many relationship called transcript Figure 21. Partitioning in a multiinput hash join. between students and courses. The im- portant point to note here and in many similar situations is the grouping at- hash tables. Thus, the basic operation of tribute is a foreign key. In order to relate this N-ary join (intersection, etc.) is an the aggregation output with other infor- N-ary join of an N-tuple of partition files, mation pertaining to the entities about not pairs as in binary hash join with one which information was condensed, aggre- build and one m-obe file for each ~arti- gations are frequently followed by a join. tion. Figure 21 ‘illustrates recursiv~ par- If the grouping operation is based on titioning for a join of three inputs. In- sorting (on the grouping attribute, which stead of partitioning and joining a pair of very frequently is a foreign key), the nat- in~uts and ~airs of ~artition files as in ural sort order of the aggregation output tr~ditional binary hybrid hash join, there can be exploited for an efficient merge- are file triples (or N-tuples) at each step. join without sorting. However, N-ary recursive partitioning While this seems to be an advantage of is cumbersome to implement, in particu- sort-based aggregation and join, this lar if some of the “join” operations are combination of operations also permits a actually semi-join, outer join, set inter- special trick in hash-based query pro- section, union, or difference. Therefore, cessing [Graefe 1993 b]. Hash-based ag- until a clean implementation method for gregation is based on identifying items of hash-based N-ary matching has been the same group while building the hash found, it might well be that this distinc- table. At the end of the operation, the tion. /.,ioins on the same or on different hash table contains all output items attributes, contributes to the right choice hashed on the grouping attribute. If the between sort- and hash-based algorithms grouping attribute is the join attribute in for comdex aueries. the next operation, this hash table can Anot~er si~uation with interesting or- immediately be probed with the other

ACM Computing Surveys, Vol. 25, No 2, June 1993 122 “ Goetz Graefe join input. Thus, the combined aggrega- very poorly and create significantly tion-join operation uses only one hash higher costs than sorting and merge-join. table, not two hash tables as two sepa- If the quality of the hash function cannot rate o~erations would do. The differences be predicted or improved (tuned) dynam- to tw~ separate operations are that only ically [Graefe 1993a], sort-based query- one join input can be aggregated effi- processing algorithms are superior be- ciently and that the aggregated input cause they are less vulnerable to nonuni- must be the join’s build input. Both is- form data distributions. Since both cases, sues could be addressed by symmetric join of differently sized files and skewed hash ioins with a hash table on each of hash value distributions, are realistic sit- the in”~uts which would be as efficient as uations in database query processing, we sorting and grouping both join inputs. recommend that both sort- and hash- A third use of interesting orderings is based algorithms be included in a query- the positive interaction of (sorted, B-tree) processing engine and chosen by the index scans and merge-join. While it has query optimizer according to the two not been reported explicitly in the litera- cases above. If both cases arise simulta- ture. the leaves and entries of two hash neously, i.e., a join of differently sized indices can be merge-joinedjust like those inputs with unpredictable hash value of two B-trees, provided the same hash distribution, the query optimizer has to function was used to create the indices. estimate which one poses the greater For example, it is easy to imagine “merg- danger to system performance and choose ing” the leaves (data pages) of two ex- accordingly. tendible hash indices [Fagin et al. 1979], The important conclusion from these even if the key cardinalities and distribu- dualities is that neither the absolute in- tions are verv different. put sizes nor the absolute memory size In summa~y, there exist many duali- nor the input sizes relative to the mem- ties between sorting using multilevel ory size determine the choice between merging and recursive hash table over- sort- and hash-based query-processing flow management. Two special cases ex- algorithms. Instead, the choice should be ist which favor one or the other, however. governed by the sizes of the two inputs First, if two join inputs are of different into binary operators relative to each size (and the query optimizer can reli- other and by the danger of performance ably predict this difference), hybrid hash impairments due to skewed data or hash join outperforms merge-join because only value distributions. Furthermore, be- the smaller of the two inputs determines cause neither algorithm type outper- what fraction of the input files has to be forms the other in all situations, both written to temporary disk files during should be available in a query execution partitioning (or how often each record engine for a choice to be made in each has to be written to disk during recursive case by the query optimizer. partitioning), while each file determines its own disk 1/0 in sorting [Bratberg- 8. EXECUTION OF COMPLEX QUERY sengen 1984]. For example, sorting the PLANS larger of two join inputs using multiple merge levels is more expensive than When multiple operators such as aggre- writing a small fraction of that file to gations and joins execute concurrently in hash overflow files. This performance ad- a pipelined execution engine, physical re- vantage of hashing grows with the rela- sources such as memory and disk band- tive size difference of the two inputs, not width must be shared by all operators. with their absolute sizes or with the Thus, optimal scheduling of multiple op- memory size. erators and the division and allocation of Second, if the hash function is very resources in a complex plan are impor- poor, e.g., because of a prior selection on tant issues. the ioin attribute or a correlated at- In earlier relational execution engines, tribu~e, hash partitioning can perform these issues were largely ignored for two

ACM Computmg Surveys, Vol. 25, No 2, June 1993 Query Evaluation Techniques “ 123 reasons. First, only left-deep trees were tention each time a new hash table is used for query execution, i.e., the right built. Most recently, Brown et al. [1992] (inner) input of a binary operator had to have considered resource allocation be a scan. In other words, concurrent tradeoffs among short transactions and execution of multiple subplans in a single complex queries. query was not possible. Second, under Schneider [1990] and Schneider and the assumption that sorting was needed DeWitt [1990] were the first to systemat- at each step and considering that sorting ically examine execution schedules and for nontrivial file sizes requires that the costs for right-deep trees, i.e., query eval- entire input be written to temporary files uation plans with multiple binary hash at least once, concurrency and the need joins for which all build phases proceed for resource allocation were basically ab- concurrently or at least could proceed sent. Today’s query execution engines concurrently (notice that in a left-deep consider more join algorithms that per- plan, each build phase receives its data mit extensive pipelining, e.g., hybrid hash from the probe phase of the previous join, join, and more complex query plans, in- limiting left-deep plans to two concurrent cluding bushy trees. Moreover, today’s joins in different phases). Among the systems support more concurrent users most interesting findings are that and use parallel-processing capabilities. through effective use of bit vector fil- Thus, resource allocation for complex tering (discussed later), memory re- queries is of increasing importance for quirements for right-deep plans might database query processing. actually be comparable to those of Some researchers have considered re- left-deep plans [Schneider 1991]. This source contention among multiple query work has recently been extended by processing operators with the focus on Chen et al. [1992] to bushy plans in- buffer management. The goal in these terpreted and executed as multiple efforts was to assign disk pages to buffer right-deep subplans, slots such that the benefit of each buffer For binary matching iterators to be slot would be maximized, i.e., the number used in bushy plans, we have identified of 1/0 operations avoided in the future. several concerns. First, some query- Sacco and Schkolnick [1982; 1986] ana- processing algorithms include a point at lyzed several database algorithms and which all data are in temporary files on found that their cost functions exhibit disk and at which no intermediate result steps when plotted over available buffer data reside in memory. Such “stop” points space, and they suggested that buffer can be used to switch efficiently between space should be allocated at the low end different subplans. For example, if two of a step for the least buffer use at a subplans produce and sort two merge-join given cost. Chou [1985] and Chou and inputs, stopping work on the first sub- DeWitt [1985] took this idea further by plan and switching to the second one combining it with separate page replace- should be done when the first sort opera- ment algorithms for each relation or scan, tor has all its data in sorted runs and following observations by Stonebraker when only the final merge is left but no [1981] on operating system support for output has been produced yet. Figure 22 database systems, and with load control, illustrates this point in time. Fortu- calling the resulting algorithm DBMIN. nately, this timing can be realized natu- Faloutsos et al. [1991] and Ng et al. [1991] rally in the iterator implementation of generalized this goal and used the classic sorting if input runs for the final merge economic concepts of decreasing marginal are opened in the first call of the next gain and balanced marginal gains for procedure, not at the end of the open maximal overall gain. Their measure of phase. A similar stop point is available in gain was the reduction in the number of hash join when using overflow avoidance. page faults. Zeller and Gray [1990] de- Second, since hybrid hashing produces signed a hash join algorithm that adapts some output data before the memory con- to the current memory and buffer con- tents (output buffers and hash table) can

ACM Computing Surveys, Vol. 25, No. 2, June 1993 124 - Goetz Graefe

merge join sort in its run generation phase shares resources with other operations, e.g., a sort following two sorts in their final /\ merges and a merge-join, it should also use resources proportional to its input ‘0” ‘~ size. For example, if two merge-join in- mm puts are of the same size and if the (done) to be done merge-join output which is sorted imme- diately following the merge-join is as Figure 22. The stop point during sorting large as the two inputs together, the two final merges should each use one quar- ter of memory while the run gener- be discarded and since, therefore, such a ation (quicksort) should use one half of stop point does not occur in hybrid hash memory. join, implementations of hybrid hash join Fifth, in recursive hybrid hash join, and other binary match operations should the recursion levels should be executed be parameterized to permit overflow level by level. In the most straightfor- avoidance as a run time option to be ward recursive algorithm, recursive invo- chosen by the query optimizer. This dy- cation of the original algorithm for each namic choice will permit the query opti- output partition results in depth-first mizer to force a stop point in some opera- partitioning, and the algorithm produces tors while using hybrid hash in most output as soon as the first leaf in the operations. recursion tree is reached. However, if the Third, binary-operator implementa- operator that consumes the output re- tions should include a switch that con- quires memory as soon as it receives in- trols which subplan is initiated first. In put, for example, hybrid hash join (ii) in Table 1 with algorithm outlines for itera- Figure 23 as soon as hybrid hash join (i) tors’ open, next, and close procedures, produces output, the remaining parti- the hash join open procedure executes tioning operations in the producer opera- the entire build-input plan first before tor (hybrid hash join (i)) must share opening the probe input. However, there memory with the consumer operator (hy- might be situations in which it would be brid hash join (ii)), effectively cutting the better to open the probe input before partitioning fan-out in the producer in executing the build input. If the probe half. Thus. hash-based recursive match- input does not hold any resources such ing algorithms should proceed in three as memory between open and next calls, distinct phases—consuming input and initiating the probe input first is not a initial partitioning, partitioning into files problem. However, there are situations suitable for hybrid hash join, and final in which it creates a big benefit, in par- hybrid hash join for all partitions—with ticular in bushy query evaluation plans phase two completed entirely before and in parallel systems to be discussed phase three commences. This sequence of later. partitioning steps was introduced as Fourth, if multiple operators are active breadth-first partitioning in the previous concurrently, memory has to be divided section as opposed to depth-first parti- among them. If two sorts produce input tioning used in the most straightforward data for a merge-join, which in turn recursive algorithms. Of course, the top- passes its output into another sort using most operator in a query evaluation plan quicksort, memory should be divided pro- does not have a consumer operator with portionally to the sizes of the three files which it shares resources; therefore, this involved. We believe that for multiple operator should use depth-first partition- sorts producing data for multiple merge- ing in order to provide a better response joins on the same attribute, proportional time, i.e., earlier delivery of the first data memory division will also work best. If a item.

ACM Computing Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 125

Hybrid Hash Join (ii) of the principal ideas of Ingres Decompo- sition is a repetitive cycle consisting of 01/ \ three steps. First, the next step is se- Hybrid Hash Join (i) lected, e.g., a selection or join. Second, /“ \ 1“’”’13 the chosen step is executed into a tempo- Input 11 Input 12 rary table. Third, the query is simplified by removing predicates evaluated in the Figure 23. Plan for joining three inputs. completed execution step and replacing one range variable (relation) in the query with the new temporary table. The justi- fication and advantage of this approach Sixth, the allocation of resources other are that all earlier selectivities are known than memory, e.g., disk bandwidth and for each decision, because the intermedi- disk arms for seeking in partitioning and ate results are materialized. The disad- merging, is an open issue that should be vantage is that data flow between opera- addressed soon, because the different im- tors cannot be exploited, resulting in a m-ovement. rates in CPU and disk s~eeds. significant cost for writing and reading will increase the im~ortance of disk ~er- intermediate files. For very complex formance for over~ll query processing queries, we suggest modifying Decompo- performance. One possible alleviation of sition to decide on and execute multiple this .m-oblem might come from disk ar- steps in each cycle, e.g., 3 to 9 joins, rays configured exclusively for perfor- instead of executing only one selection or mance, not for reliability. Disk arrays join as in Ingres. Such a hybrid approach might not deliver the entire ~erformance might very well combine the advantages ga~n the large number of disk’drives could of a priori optimization, namely, in- provide if it is not possible to disable a memory data flow between iterators, and disk array’s parity mechanisms and to optimization with exactly known inter- access s~ecific disks within an arrav. mediate result sizes. particul~rly during partitioning aid An optimization and execution envi- merging. ronment even further tuned for very Finally, scheduling bushy trees in complex queries would anticipate possi- multiprocessor systems is not entirely ble outcomes of executing subplans and understood yet. While all considerations provide multiple alternative subsequent discussed above apply in principle, multi- plans. Figure 24 shows the structure of m-ocessors ~ermit trulv . concurrent exe- such a dynamic plan for a complex query. cution of multiple subplans in a bushy First, subplan A is executed, and statis- tree. However, it is a very hard problem tics about its result are gathered while it to schedule two or more subplans such is saved on disk. Depending on these that their result streams are available at statistics, either B or C is executed next. the right times and at the right rates, in If B is chosen and executed, one of D, E, particular in light of the unavoidable er- and F will complete the query; in the rors in selectivity and cost estimation case of C instead of B, it will be G or H. during query optimization [Christodoula- Notice that each letter A–H can be an kis 1984; Ioannidis and Christodoulakis arbitrarily complex subplan, although 19911. probably not more than 10 operations The last point, estimation errors, leads due to the limitations of current selectiv- us to suspect that plans with 30 (or even ity estimation methods. Unfortunately, 100) joins or other operations cannot be realization of such sophisticated query optimized completely before execution. optimizers will require further research, Thus. we susnect. that a techniaue. remi- e.g., into determination of when separate niscent of Ingres Decomposition [Wong cases are warranted and limitation of and Youssefi 1976; Youssefi and Wong the possibly exponential growth in the 1979] will prove to be more effective. One number of subplans.

ACM Computing Surveys, Vol. 25, No 2, June 1993 126 ● Goet.z Graefe

and scaleup are different, it should al- ways be clearly indicated which parallel efficiency measure is being reported. B c A third measure for the success of a parallel algorithm based on Amdahl’s law /’”% is the fraction of the sequential program DEF GH for which linear speedup was attained, defined byp=f Xs/d+(l-f)X sfor Figure 24. A decision tree of partial plans, sequential execution time s, parallel exe- cution time p, and degree of parallelism d. Resolved for f, this is f = (s – p)/ (s 9. MECHANISMS FOR PARALLEL QUERY – s/d) = ((s – p)/s)/((d – I)/d). For EXECUTION the example above, this fraction is f = Considering that all high-performance ((1200 – 100)/’1200)/((16 – 1)/’16) = computers today employ some form of 97.78%. Notice that this measure gives parallelism in their processing hardware, much higher percentage values than the it seems obvious that software written to parallel efficiency calculated earlier; manage large data volumes ought to be therefore, the two measures should not able to exploit parallel execution capabil- be confused. ities [DeWitt and Gray 1992]. In fact, we For query processing problems involv- believe that five years from now it will be ing sorting or hashing in which multiple argued that a database management sys- merge or partitioning levels are expected, tem without parallel query execution will the speedup can frequently be more than be as handicapped in the market place as linear, or superlinear. Consider a sorting one without indices. problem that requires two merge levels The goal of parallel algorithms and in a single machine. If multiple machines systems is to obtain speedup and scaleup, are used, the sort problem can be parti- and speedup results are frequently used tioned such that each machine sorts a to demonstrate the accomplishments of a fraction of the entire data amount. Such design and its implementation. Speedup partitioning will, in a good implementa- considers additional hardware resources tion, result in linear speedup. If, in addi- for a constant problem size; linear tion, each machine has its own memory speedup is considered optimal. In other such that the total memory in the system words, N times as many resources should grows with the size of the machine, fewer solve a constant-size problem in l\N of than two merge levels will suffice, mak- the time. Speedup can also be expressed ing the speedup superlinear. as parallel efficiency, i.e., a measure of how close a system comes to linear 9.1 Parallel versus Distributed Database speedup. For example, if solving a prob- Systems lem takes 1,200 seconds on a single ma- chine and 100 seconds on 16 machines, It might be useful to start the discussion the speedup is somewhat less than lin- of parallel and distributed query process- ear. The parallel efficiency is (1 x ing with a distinction of the two concepts. 1200)/(16 X 100) = 75%. In the database literature, “distributed” An alternative measure for a parallel usually implies “locally autonomous,” i.e., system’s design and implementation is each participating system is a complete scaleup, in which the problem size is al- database management system in itself, tered with the resources. Linear scaleup with access control, metadata (catalogs), is achieved when N times as many re- query processing, etc. In other words, sources can solve a problem with iV times each node in a distributed database man- as much data in the same amount of agement system can function entirely on time. Scaleup can also be expressed us- its own, whether or not the other nodes ing parallel efficiency, but since speedup are present or accessible. Each node per-

ACM Computing Surveys, Vol. 25, No 2, June 1993 Query Evaluation Techniques ● 127 forms its own access control, and co- be of the same types), or heterogeneous, operation of each node in a distributed meaning that multiple database manage- transaction is voluntary. Examples of ment systems work together using stan- distributed (research) systems are R* dardized interfaces but are internally [Haas et al. 1982; Traiger et al. 1982], different.lG Furthermore, distributed distributed Ingres [Epstein and Stone- systems may employ parallelism, e.g., braker 1980; Stonebraker 1986a], and by pipelining datasets between nodes SDD-1 [Bernstein et al. 1981; Rothnie et with the receiver already working on al. 1980]. There are now several commer- some items while the producer is still cial distributed relational database man- sending more. Parallel systems can be agement systems. Ozsu and Valduriez based on shared-memory (also called [1991a; 1991b] have discussed dis- shared-everything), shared-disk (multi- tributed database systems in much more ple processors sharing disks but not detail. If the cooperation among multiple memory), distributed-memory (with- database systems is only limited, the sys- out sharing disks, also called shared- tem can be called a “federated database nothing), or hierarchical computer system [ Sheth and Larson 1990]. architectures consisting of multiple clus- In parallel systems, on the other hand, ters, each with multiple CPUS and disks there is only one locus of control. In other and a large shared memory. Stone- words, there is only one database man- braker [ 1986b] compared the first three agement system that divides individual alternatives using several aspects of queries into fragments and executes the database management, and came to fragments in parallel. Access control to the conclusion that distributed memory data is independent of where data objects is the most promising database man- currently reside in the system. The query agement system platform. Each of optimizer and the query execution engine these approaches has advantages and typically assume that all nodes in the disadvantages; our belief is that the hi- system are available to participate in ef- erarchical architecture is the most gen- ficient execution of complex queries, and eral of these architectures and should participation of nodes in a given transac- be the target architecture for new data- tion is either presumed or controlled by a base software development [Graefe and global resource manager, but is not based Davison 1993], on voluntary cooperation as in dis- tributed systems. There are several par- 9.2 Forms of Parallelism allel research prototypes, e.g., Gamma There are several forms of parallelism [DeWitt et al. 1986; DeWitt et al. 1990], that are interesting to designers and im- Bubba [Boral 1988; Boral et al. 1990], plementors of query processing systems. Grace [Fushimi et al. 1986; Kitsuregawa Irzterquery parallelism is a direct result et al. 1983], and Volcano [Graefe 1990b; of the fact that most database manage- 1993b; Graefe and Davison 1993], and ment systems can service multiple re- products, e.g., Tandem’s NonStop SQL quests concurrently. In other words, [Englert et al. 1989; Zeller 1990], Tera- multiple queries (transactions) can be data’s DBC/1012 [Neches 1984; 1988; executing concurrently within a single Teradata 1983], and Informix [Davison database management system. In this 1992]. form of parallelism, resource contention Both distributed database systems and parallel database systems have been de- signed in various kinds, which may cre- 16 In some organizations, two different database ate some confusion. Distributed systems management systems may run on the came (fairly can be either homogeneous, meaning that large) computer. Their interactions could be called “nondistributed heterogeneous.” However, since the all participating database management rules governing such interactions are the same as systems are of the same type (the hard- for distributed heterogeneous systems, the case is ware and the operating system may even usually ignored in research and system design.

ACM Computing Surveys, Vol. 25, No 2, June 1993 128 * Goetz Graefe

is of great concern, in particular, con- represented sequences or time series in tention for memory and disk arms. a scientific database management sys- The other forms of parallelism are all tem, partitioning into subsets to be oper- based on the use of algebraic operations ated on independently would not be on sets for database query processing, feasible or would require additional e.g., selection, join, and intersection. The synchronization when putting the in- theory and practice of exploiting other dependently obtained results together. “bulk” types such as lists for parallel Both vertical interoperator parallelism database query execution are only now and intraoperator parallelism are used in developing. Interoperator parallelism is database query processing to obtain basically pipelining, or parallel execution higher performance. Beyond the obvious of different operators in a single query. opportunities for speedup and scaleup For example, the iterator concept dis- that these two concepts offer, they both cussed earlier has also been called “syn- have significant problems. Pipelining chronous pipelines” [Pirahesh et al. does not easily lend itself to load balanc- 1990]; there is no reason not to consider ing because each process or processor asynchronous pipelines in which opera- in the pipeline is loaded proportionally to tors work independently connected by a the amount of data it has to process. This buffering mechanism to provide flow amount cannot be chosen by the imple- control. mentor or the query optimizer and can- Interoperator parallelism can be used not be predicted very well. For intraoper- in two forms, either to execute producers ator, partitioning-based parallelism, load and consumers in pipelines, called uerti- balance and performance are optimal if cal in teroperator parallelism here, or to the partitions are all of equal size; how- execute independent subtrees in a com- ever, this can be hard to achieve if value plex bushy-query evaluation plan concur- distributions in the inputs are skewed. rently, called horizontal in teroperator or bushy parallelism here. A simple exam- 9.3 Implementation Strategies ple for bushy parallelism is a merge-join receiving its input data from two sort The purpose of the query execution en- processes. The main problem with bushy gine is to provide mechanisms for query parallelism is that it is hard or impossi- execution from which the query opti- ble to ensure that the two subplans start mizer can choose—the same applies for generating data at the right time and the means and mechanisms for parallel generate them at the right rates. Note execution. There are two general ap- that the right time does not necessarily proaches to parallelizing a query execu- mean the same time, e.g., for the two tion engine, which we call the bracket inputs of a hash join, and that the right and operator models and which are used, rates are not necessarily equal, e.g., if for example, in the Gamma and Volcano two inputs of a merge-join have different systems, respectively. sizes. Therefore, bushy parallelism pre- In the bracket model, there is a generic sents too many open research issues and process template that can receive and is hardly used in practice at this time. send data and can execute exactly one The final form of parallelism in operator at any point of time. A schematic database query processing is intraopera- diagram of a template process is shown tor parallelism in which a single opera- in Figure 25, together with two possible tor in a query plan is executed in multi- operators, join and aggregation. In order ple processes, typically on disjoint pieces to execute a specific operator, e.g., a join, of the problem and disjoint subsets of the the code that makes up the generic tem- data. This form, also called parallelism plate “loads” the operator into its place based on fragmentation or partitioning, (by switching to this operator’s code) and is enabled by the fact that query process- initiates the operator which then controls ing focuses on sets. If the underlying data execution; network 1/0 on the receiving

ACM Computmg Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 129

output when an entire query is evaluated on a single CPU (and could therefore be eval- uated in a single process, without inter- process communication and operating Input(s) system involvement) or when data do not need to be repartitioned among nodes in a network. An example for the latter is the query “joinCselAselB” in the Wiscon- sin Benchmark, which requires joining three inputs on the same attribute [De- Witt 1991], or any other query that per- mits interesting orderings [ Selinger et al. Figure 25. Bracket model of parallelization. 1979], i.e., any query that uses the same join attribute for multiple binary joins. and sending sides is performed as a ser- Thus, in queries with multiple operators vice to the operator on its request and (meaning almost all queries), interpro- initiation and is implemented as proce- cess communication and its overhead are dures to be called by the operator. The mandatory in the bracket model rather number of inputs that can be active at than optional. any point of time is limited to two since An alternative to the bracket model is there are only unary and binary opera- the operator model. Figure 26 shows a tors in most database systems. The oper- possible parallelization of a join plan us- ator is surrounded by generic template ing the operator model, i.e., by inserting code. which shields it from its environ- “parallelism” operators into a sequential ment, for example, the operator(s) that plan, called exchange operators in the produce its input and consume its out- Volcano system [Graefe 1990b; Graefe put. For parallel query execution, many and Davison 1993]. The exchange opera- templates are executed concurrently in tor is an iterator like all other operators the system, using one process per tem- in the system with open, next, and close plate. Because each operator is written procedures; therefore, the other opera- with the imdicit. assum~tion that this tors are entirely unaffected by the pres- operator controls all acti~ities in its pro- ence of exchange operators in a query cess, it is not possible to execute two evaluation plan. The exchange operator operators in one process without resort- does not contribute to data manipulation; ing to some thread or coroutine facility thus, on the logical level, it is a “no-op” i.e., a second implementation level of the that has no place in a logical query alge- process concept. bra such as the relational algebra. On In a query-processing system using the the physical level of algorithms and pro- bracket model, operators are coded in cesses, however, it provides control not such a way that network 1/0 is their provided by any of the normal operators, only means of obtaining input and deliv- i.e., process management, data redistri- ering output (with the exception of scan bution, and flow control. Therefore, it is and store operators). The reason is that a control operator or a meta-operator. each operator is its own locus of control, Separation of data manipulation from and network flow control must be used to process control and interprocess com- coordinate multiple operators, e.g., to munication can be considered an im- match two operators’ speed in a pro- portant advantage of the operator ducer-consumer relationship. Unfortu- model of parallel query processing, be- nately, this coordination requirement cause it permits design, implementation, also implies that passing a data item and execution of new data manipulation from one operator to another always in- algorithms such as N-ary hybrid hash volves expensive interprocess communi- join [Graefe 1993a] without regard to the cation system calls, even in the cases execution environment.

ACM Computmg Surveys, Vol. 25, No. 2, June 1993 130 “ Goetz Graefe

Print I Print Exchange I Q Join /’”\ Join Exchange

/\ I Exchange Exchange scan I I scan scan

Figure 26. Operator model of parallehzation.

A second issue important to point out Figure 27. Processes created by exchange operators. is that the exchange operator only pro- vides mechanisms for parallel query pro- cessing; it does not determine or presup- pose policies for using its mechanisms. ous figure, with each circle representing Policies for parallel processing such as a process. Note that this set of processes the degree of parallelism, partitioning is only one possible parallelization, which functions, and allocation of processes to makes sense if the joins are on the same processors can be set either by a query join attributes. Furthermore, the degrees optimizer or by a human experimenter in of data parallelism, i.e., the number of the Volcano system as they are still sub- processes in each process group, can be ject to intense research. The design of the controlled using an argument to the ex- exchange operator permits execution of a change operator. complex query in a single process (by There is no reason to assume that the using a query plan without any exchange two models differ significantly in their operators, which is useful in single- performance if implemented with similar processor environments) or with a num- care. Both models can be implemented ber of processes by using one or more with a minimum of control overhead and exchange operators in the query evalua- can be combined with any partitioning tion plan. The mapping of a sequential scheme for load balancing. The only dif- plan to a parallel plan by inserting ex- ference with respect to performance is change operators permits one process per that the operator model permits multiple operator as well as multiple processes for data manipulation operators such as join one operator (using data partitioning) or in a single process, i.e., operator synchro- multiple operators per process, which is nization and data transfer between oper- useful for executing a complex query plan ators with a single procedure call with- with a moderate number of processes. out operating system involvement. The Earlier parallel query execution engines important advantages of the operator did not provide this degree of flexibility; model are that it permits easy paral- the bracket model used in the Gamma lelization of an existing sequential sys- design, for example, requires a separate tem as well as development and mainte- process for each operator [DeWitt et al. nance of operators and algorithms in a 1986]. familiar and relatively simple single-pro- Figure 27 shows the processes created cess environment [ Graefe and Davison by the exchange operators in the previ- 1993].

ACM Computmg Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 131

The bracket and operator models both tioned such that the processing load is provide pipelining and partitioning as nearly equal for each processor. Notice part of pipelined data transfer between that in particular for binary operations process groups. For most algebraic opera- such as join, equal processing loads can tors used in database query processing, be different from equal-sized partitions. these two forms of parallelism are suffi- There are several research efforts de- cient. However, not all operations can be veloping techniques to avoid skew or to easily supported by these two models. limit the effects of skew in parallel query For example, in a transitive closure oper- processing, e.g., Baru and Frieder [1989], ator, newly inferred data is equal to in- DeWitt et al. [ 1991b], Hua and Lee put data in its importance and role for [1991], Kitsuregawa and Ogawa [1990], creating further data. Thus, to paral- Lakshmi and Yu [1988; 1990], Omiecin- lelize a single transitive closure operator, ski [199 1], Seshadri and Naughton the newly created data must also be par- [1992], Walton [1989], Walton et al. titioned like the input data. Neither [1991], and Wolf et al. [1990; 1991]. How- bracket nor operator model immediately ever, all of these methods have their allow for this need. Hence, for transitive drawbacks, for example, additional re- closure operators, intraoperator paral- quirements for local processing to deter- lelism based on partitioning requires that mine quantiles. the processes exchange data among Skew management methods can be di- themselves outside of the stream vided into basically two groups. First, paradigm. skew avoidance methods rely on deter- The transitive closure operator is not mining suitable partitioning rules before the only operation for which this restric- data is exchanged between processing tion holds. Other examples include the nodes or processes. For range partition- complex object assembly operator de- ing, quantiles can be determined or esti- scribed by Keller et al. [1991] and opera- mated from sampling the data set to be tors for numerical optimizations as might partitioned, from catalog data, e.g., his- be used in scientific databases. Both tograms, or from a preprocessing step. models, the bracket model and the opera- Histograms kept on permament base data tor model, could be extended to provide a have only limited use for intermediate general and efficient solution to intraop- query processing results, in particular, if erator data exchange for intraoperator the partitioning attribute or a correlated parallelism. attribute has been used in a prior selec- tion or matching operation. However, for stored data they may be very beneficial. 9.4 Load Balancing and Skew Sampling implies that the entire popula- For optimal speedup and scaleup, pieces tion is available for sampling because the of the processing load must be assigned first memory load of an intermediate re- carefully to individual processors and sult may be a very poor sample for parti- disks to ensure equal completion times tioning decisions. Thus, sampling might for all pieces. In interoperator paral- imply that the data flow between opera- lelism, operators must be grouped to en- tors be halted and an entire intermediate sure that no one processor becomes the result be materialized on disk to ensure bottleneck for an entire pipeline. Bal- proper random sampling and subsequent anced processing loads are very hard to partitioning. However, if such a halt is achieve because intermediate set sizes required anyway for processing a large cannot be anticipated with accuracy and set, it can be used for both purposes. For certainty in database query optimization. example, while creating and writing ini- Thus, no existing or proposed query- tial run files without partitioning in a processing engine relies solely on inter- parallel sort, quantiles can be deter- operator parallelism. In intraoperator mined or estimated and used in a com- parallelism, data sets must be parti- bined partitioning and merging step.

ACM Computing Surveys, Vol. 25, No. 2, June 1993 132 “ Goetz Graefe

10000 – solid — 99 9Z0confidence dotted — 95 % confidence Sample 10~_ o 1024 patitions Size ❑ 2 partitions ❑ Per ❑ ❑ Partition 10Q –

107 I I I I 1 1.25 1.5 1.75 2 Skew Limit

Figure 28. Skew hmit,confidence, andsamples izeperpartltion.

Second, skew resolution repartitions distribution. Figure 28 shows the re- some or all of the data if an initial parti- quired sample sizes per partition for var- tioning has resulted in skewed loads. ious skew limits, degrees of parallelism, Repartitioning is relatively easy in and confidence levels. For example, to shared-memory machines, but can also ensure a maximal skew of 1.5 among be done in distributed-memory architec- 1,000 partitions with 9570 confidence, 110 tures, albeit at the expense of more net- random samples must be taken at each work activity. Skew resolution can be site. Thus, relatively small samples suf- based on rehashing in hash partitioning fice for reasonably safe skew avoidance or on quantile adjustment in range parti- and load balancing, making precise tioning. Since hash partitioning tends to methods unnecessary. Typically, only create fairly even loads and since net- tens of samples per partition are needed, work bandwidth will increase in the near not several hundreds of samples at each future within distributed-memory ma- site. chines as well as in local- and wide-area For allocation of active processing ele- networks, skew resolution is a reason- ments, i.e., CPUS and disks, the band- able method for cases in which a prior width considerations discussed briefly in processing step cannot be exploited to the section on sorting can be generalized gather the information necessary for for parallel processes. In principal, all skew avoidance as in the sort example stages of a pipeline should be sized such above. that they all have bandwidths propor- In their recent research into sampling tional to their respective data volumes in for load balancing, DeWitt et al. [ 1991b] order to ensure that no stage in the and Seshadri and Naughton [1992] have pipeline becomes a bottleneck and slows shown that stratified random sampling the other ones down. The latency almost can be used, i.e., samples are selected unavoidable in data transfer between randomly not from the entire distributed pipeline stages should be hidden by the data set but from each local data set at use of buffer memory equal in size to the each site, and that even small sets of product of bandwidth and latency. samples ensure reasonably balanced loads. Their definition of skew is the quo- 9.5 Architectures and Architecture tient of sizes of the largest partition and Independence the average partition, i.e., the sum of sizes of all partitions divided by the de- Many database research projects have gree of parallelism. In other words, a investigated hardware architectures for skew of 1.0 indicates a perfectly even parallelism in database systems. Stone-

ACM Computmg Surveys, Vol 25, No. 2, June 1993 Query Evaluation Techniques ● 133 braker[1986b] compared shared-nothing cuted customized software on standard (distributed-memory), shared-disk (dis- processors with local disks [Boral et al. tributed-memory with multiported disks), 1990; DeWitt et al. 1990]. For scalabil- and shared-everything (shared-memory) ity and availability, both projects architectures for database use based on a used distributed-memory hardware number of issues including scalability, with single-CPU nodes and investi- communication overhead, locking over- gated scaling questions for very large head, and load balancing. His conclusion configurations. at that time was that shared-everything The XPRS system, on the other hand, excels in none of the points considered; has been based on shared memory [Hong shared-disk introduces too many locking and Stonebraker 1991; Stonebraker et al. and buffer coherency problems; and 1988a; 1988b]. Its designers believe that shared-nothing has the significant bene- modern bus architectures can handle up fit of scalability to very high degrees of to 2,000 transactions per second, and that parallelism. Therefore, he concluded that shared-memory architectures provide overall shared-nothing is the preferable automatic load balancing and faster architecture for database system imple- communication than shared-nothing mentation. (Much of this section has been machines and are equally reliable and derived from Graefe et al. [1992] and available for most errors, i.e., media Graefe and Davison [1993 ].) failures, software, and operator errors Bhide [1988] and Bhide and Stone- [Gray 1990]. However, we believe that braker [1988] compared architectural al- attaching 250 disks to a single machine ternatives for transaction processing and as necessary for 2,000 transactions per concluded that a shared-everything second [ Stonebraker et al. 1988b] re- (shared-memory) design achieves the best quires significant special hardware, e.g., performance, up to its scalability limit. channels or 1/0 processors, and it is quite To achieve higher performance, reliabil- likely that the investment for such hard- ity, and scalability, Bhide suggested ware can have greater impact on overall considering shared-nothing (distributed- system performance if spent on general- memory) machines with shared-every- purpose CPUS or disks. Without such thing parallel nodes. The same idea is special hardware, the performance limit mentioned in equally general terms by for shared-memory machines is probably Pirahesh et al. [1990] and Boral et al. much lower than 2,000 transactions per [1990], but none of these authors elabo- second. Furthermore, there already are rate on the idea’s generality or potential. applications that require larger storage Kitsuregawa and Ogawa’s [1990] new and access capacities. database machine SDC uses multiple Richardson et al. [1987] performed an shared-memory nodes (plus custom hard- analytical study of parallel join algo- ware such as the Omega network and a rithms on multiple shared-memory “clus- hardware sorter), although the effect of ters” of CPUS. They assumed a group of the hardware design on operators other clusters connected by a global bus with than join is not evaluated in their article. multiple microprocessors and shared Customized parallel hardware was in- memory in each cluster. Disk drives were vestigated but largely abandoned after attached to the busses within clusters. Boral and DeWitt’s [1983] influential Their analysis suggested that the best analysis that compared CPU and 1/0 performance is obtained by using only speeds and their trends. Their analysis one cluster, i.e., a shared-memory archi- concluded that 1/0, not processing, is tecture. We contend, however, that their the most likely bottleneck in future results are due to their parameter set- high-performance query execution. Sub- tings, in particular small relations (typi- sequently, both Boral and DeWitt em- cally 100 pages of 32 KB), slow CPUS barked on new database machine (e.g., 5 psec for a comparison, about 2-5 projects, Bubba and Gamma, that exe- MIPS), a slow global network (a bus with

ACM Computing Surveys, Vol. 25, No 2, June 1993 134 “ Goetz Graefe typically 100 Mbit/see), and a modest advantages. The important point is the number of CPUS in the entire system combina~ion of lo~al busies within (128). It would be very interesting to see shared-memory parallel machines and a the analysis with larger relations (e.g., dobal interconnection network amomz 1– 10 GB), a faster network, e.g., a mod- machines. The diagram is only a very ern hypercube or mesh with hardware general outline of such an architecture; routing. and consideration of bus load manv details are deliberately left out and u, . . and bus contention in each cluster, which unspecified. The network could be imple- might lead to multiple clusters being the mented using a bus such as an ethernet, better choice. On the other hand, commu- a ring, a hypercube, a mesh, or a set of nication between clusters will remain a ~oint-to-~oint connections. The local significant expense. Wong and Katz busses m-ay or may not be split into code [1983] developed the concept of “local and data or by address range to obtain sufficiency” that might provide guidance less contention and higher bus band- in declusterinp and re~lication to reduce width and hence higher scalability limits data moveme~t betw~en nodes. Other for the use of sh~red memory. “Design work on declustering and limiting declus- and placement of caches, disk controllers, tering includes Copeland et al. [1988], terminal connections. and local- and Fang et al. [1986], Ghandeharizadeh and wide-area network connections are also DeWitt [1990], Hsiao and DeWitt [1990], left open. Tape drives or other backup and Hua and Lee [ 19901. devices would be connected to local Finally, there are several hardware de- busses. signs that attempt to overcome the Modularity is a very important consid- shared-memory scaling problem, e.g., the eration for such an architecture. For ex- DASH project [Anderson et al. 1988], the ample, it should be possible to replace all Wisconsin Multicube [Goodman and CPU boards with uwn-aded.= models with- Woest 1988], and the Paradigm project out having to replace memories or disks. [Cheriton et al. 1991]. However, these Considering that new components will desires follow the traditional se~aration change communication demands, e.g., of operating system and application pro- faster CPUS might require more local bus gram. They rely on page or cache-line bandwidth, it is also important that the faulting and do not provide typical allocation of boards to local busses can be database concepts such as read-ahead changed, For example, it should be easy and dataflow. Lacking separation of to reconfigure a machine with 4 X 16 mechanism and policy in these designs CPUS into one with 8 X 8 CPUS. almost makes it imperative to implement Beyond the effect of faster communica- dataflow and flow control for database tion and synchronization, this architec- query processing within the query execu- ture can also have a simificant effect on tion engine. At this point, none of these control overhead, load balancing, and re- hardware designs has been experimen- sulting response time problems. Investi- tally tested for database query gations in the Bubba project at MCC processing. demonstrated that large degrees of paral- New software systems designed to ex- lelism mav reduce ~erformance unless ploit parallel hardware should be able to load imbal~nce and o~erhead for startup, exploit both the advantages of shared synchronization, and communication can memory, namely efficient communica- be kept low [Copeland et al. 1988]. For tion, synchronization, and load balanc- example, when placing 100 CPUS either ing, and of distributed memory, namely in 100 nodes or in 10 nodes of 10 CPUS scalability to very high degrees of paral- each, it is much faster to distribute query lelism and reliability and availability plans to all CPUS and much easier to through independent failures. Figure 29 achieve reasonable . balanced loads in the shows a general hierarchical architec- second case than in the first case. Within ture, which we believe combines these each shared-memory parallel node, load

ACM Computmg Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques “ 135

[ Interconnection Network I I I Loc .d Bus + CPU 1

+ CPU ]

+ CPU I

+ CPU ~

Ed Figure 29. Ahierarchical-memory architecture imbalance can be dealt with either by engine were discussed. In this section, compensating allocation of resources, e.g., individual algorithms and their special memory for sorting or hashing, or by rel- cases for parallel execution are consid- atively efficient reassignment of data to ered in more detail. Parallel database processors. query processing algorithms are typically Many of today’s parallel machines are based on partitioning an input using built as one of the two extreme cases of range or hash partitioning. Either form this hierarchical design: a distributed- of partitioning can be combined with sort- memory machine uses single-CPU nodes, and hash-based query processing algo- while a shared-memory machine consists rithms; in other words, the choices of of a single node. Software designed for partitioning scheme and local algorithm this hierarchical architecture will run on are almost always entirely orthogonal. either conventional design as well as a When building a parallel system, there genuinely hierarchical machine and will is sometimes a question whether it is allow the exploration of tradeoffs in the better to parallelize a slower sequential range of alternatives in between. The algorithm with better speedup behavior most recent version of Volcano’s ex- or a fast sequential algorithm with infe- change operator is designed for hierar- rior speedup behavior. The answer to this chical memory, demonstrating that the question depends on the design goal and operator model of parallelization also of- the planned degree of parallelism. In the fers architecture- and topology-indepen- few single-user database systems in use, dent parallel query evaluation [Graefe the goal has been to minimize response and Davison 1993]. In other words, the time; for this goal, a slow algorithm with parallelism operator is the only operator linear speedup implemented on highly that needs to “understand” the underly- parallel hardware might be the right ing architecture, while all data manipu- choice. In multi-user systems, the goal lation operators can be implemented typically is to minimize resource con- without concern for parallelism, data dis- sumption in order to maximize through- tribution, and flow control. put. For this goal, only the best sequen- tial algorithms should be parallelized. For example, Boral and DeWitt [1983] con- 10. PARALLEL ALGORITHMS cluded that parallelism is no substitute In the previous section, mechanisms for for effective and efficient indices. For a parallelizing a database query execution new parallel algorithm with impressive

ACM Computing Surveys, Vol 25, No. 2, June 1993 136 ● Goetz Graefe speedup behavior, the question of fied: the number of their parallel inputs whether or not the underlying sequential (e.g., scan or subplans executed in paral- algorithm is the most efficient choice lel) and the number of parallel outputs should always be considered. (consumers) [Graefe 1990a]. As sequen- tial input or output restrict the through- put of parallel sorts, we assume a 10.1 Parallel Selections and Updates multiple-input multiple-output paral- Since disk 1/0 is a performance bottle- lel sort here, and we further assume neck in many systems, it is natural to that the input items are partitioned parallelize it. Typically, either asyn- randomly with respect to the sort at- chronous 1/0 or one process per partici- tribute and that the outmut items pating 1/0 device is used, be it a disk or should be range-partitioned ~nd sorted an array of disks under a single con- within each range. troller. If a selection attribute is also the Considering that data exchange is ex- partitioning attribute, fewer than all pensive, both in terms of communication disks will contain selection results, and and synchronization delays, each data the number of processes and activated item should be exchanged only once be- disks can be limited. Notice that parallel tween m-ocesses. Thus. most ~arallel sort . L selection can be combined very effec- algorithms consist of a local sort and a tively with local indices, i.e., ind~ces cov- data exchange step. If the data exchange ering the data of a single disk or node. In step is done first, quantiles must be general, it is most efficient to maintain known to ensure load balancing during indices close to the stored data sets, i.e., the local sort step. Such quantil& can b: on the same node in a parallel database obtained from histograms in the catalogs system. or by sampling. It is not necessary that For updates of partitioning attributes the quantiles be precise; a reasonable in a partitioned data set, items may need approximation will suffice. to move between disks and sites, just as If the local sort is done first. the final items may move if a clustering attribute local merging should pass data directly is updated. Thus, updates of partitioning into the data exchange step. On each attributes may require setting up data receiving site, multiple sorted streams transfers from old to new locations of must be merged during the data ex- modified items in order to maintain the change step. bne of th~ possible prob- consistency of the partitioning. The fact lems is that all producers of sorted that updating partitioning attributes is streams first produce low key values, more expensive is one reason why im- limiting performance by the speed of the mutable (or nearly immutable) iden- first (single!) consumer; then all produc- tifiers or keys are usually used as ers switch to the next consumer, etc. partitioning attributes. If a different partitioning strategy than range partitioning is used, sorting with subsequent partitioning is not guaran- 10.2 Parallel Sorting teed to be deadlock free in all situations. Since sorting is the most expensive oper- Deadlock will occur if (1) multiple con- ation in many of today’s database man- sumers feed multiple producers, (2) each agement systems, much research has m-oducer m-educes a sorted stream, and been dedicated to parallel sorting ~ach con&mer merges multiple sorted [Baugsto and Greipsland 1989; Beck et streams, (3) some key-based partitioning al. 1988; Bitton and Friedland 1982; rule is used other than range partition- Graefe 1990a; Iyer and Dias 1990; Kit- ing, i.e., hash partitioning, (4) flow con- suregawa et al, 1989b; Lorie and Young trol is enabled, and (5) the data distribu- 1989; Menon 1986; Salzberg et al. 1990]. tion is particularly unfortunate. There are two dimensions along which Figure 30 shows a scenario with two parallel sorting methods can be classi- producer and two consumer processes,

ACM Computing Surveys, Vol 25, No. 2, June 1993 Query Evaluation Techniques ● 137 i.e., both the producer operators and the consumer operators are executed with a degree of parallelism of two. The circles in Figure 30 indicate processes, and the arrows indicate data paths. Presume that the left sort produces the stream 1, 3, 5, 7 , 999, 1002, 1004, 1006, Ikii, :.., 2000 while the right sort pro- duces 2, 4, 6, 8,..., 1000, 1001, 1003, 1005, 1007,..., 1999. The merge opera- tions in the consumer processes must re- ceive the first item from each producer process before they can create their first output item and remove additional items from their input buffers. However, the producers will need to produce 500 items Figure 30. Scenario with possible deadlock. each (and insert them into one consumer’s input buffer, all 500 for one consumer) before they will send their first item to the other consumer. The data exchange buffer needs to hold 1,000 items at one point of time, 500 on each side of ! [$1 Figure 30. If flow control is enabled and Receive \ Receive ) if the exchange buffer (flow control slack) is less than 500 items, deadlock will odd Y I even occur. even The reason deadlock can occur in this m situation is that the producer processes Pa ..--.. need to ship data in the order obtained from their input subplan (the sort in Fig- k ) ure 30) while the consumer processes need to receive data in sorted order as F required by the merge. Thus, there are

two sides which both require absolute Figure 31. Deadlock-free scenario. control over the order in which data pass over the process boundary. If the two requirements are incompatible, an un- Our recommendation is to avoid the bounded buffer is required to ensure above situation, i.e., to ensure that such freedom from deadlock. query plans are never generated by the In order to avoid deadlock, it must be optimizer. Consider for which purposes ensured that one of the five conditions such a query plan would be used. The listed above is not satisfied. The second typical scenario is that multiple pro- condition is the easiest to avoid, and cesses perform a merge join of two in- should be focused on. If the receiving puts, and each (or at least one) input is processes do not perform a merge, i.e., sorted by several producer processes. An the individual input streams are not alternative scenario that avoids the prob- sorted, deadlock cannot occur because the lem is shown in Figure 31. Result data slack given in the flow control must be are partitioned and sorted as in the pre- somewhere, either at some producer or vious scenario. The important difference some consumer or several of them, and is that the consumer processes do not the process holding the slack can con- merge multiple sorted incoming streams. tinue to process data, thus preventing One of the conditions for the deadlock deadlock. problem illustrated in Figure 30 is that

ACM Computmg Surveys, Vol. 25, No. 2, June 1993 138 . Goetz Graefe

Figure 32. Deadlock danger due to a binary operator in the consumer.

there are multiple producers and multi- If moving a sort operation into the con- ple consumers of a single logical data sumer process is not realistic, e.g., be- stream. However, a very similar deadlock cause the data already are sorted when situation can occur with single-process they are retrieved from disk as in a B-tree producers if the consumer includes an scan, alternative parallel execution operation that depends on ordering, typi- strategies must be found that do not re- cally merge-join. Figure 32 illustrates the quire repartitioning and merging of problem with a merge-join operation exe- sorted data between the producers and cuted in two consumer processes. Notice consumers. There are two possible cases. that the left and right producers in Fig- In the first case, if the input data are not ure 32 are different inputs of the merge- only sorted but also already partitioned join, not processes executing the same systematically, i.e., range or hash parti- operators as in Figures 30 and 31. The tioned, on the attribute(s) considered by consumer in Figure 32 is still one opera- the consumer operator, e.g., the by-list of tor executed by two processes. Presume an aggregate function or the join at- that the left sort produces the stream 1, tribute, the process boundary and data 3, 5, 7, ..., 999, 1002, 1004, 1006, exchange could be removed entirely. This 1008 ,...,2000 while the right sort pro- implies that the producer operator, e.g., duces 2, 4, 6, 8,...,1000, 1001, 1003, the B-tree scan, and the consumer, e.g., 1005, 1007,... , 1999. In this case, the the merge-join, are executed by the same merge-join has precisely the same effect process group and therefore with the as the merging of two parts of one logical same degree of parallelism. data stream in Figure 30. Again, if the In the second case, although sorted data exchange buffer (flow control slack) on the relevant attribute within each is too small, deadlock will occur. Similar partition, the operator’s data could be to the deadlock avoidance tactic in Fig- partitioned either round-robin or on a ure 31, deadlock in Figure 32 can be different attribute. For a join, a avoided by placing the sort operations fragment-and-replicate matching strat- into the consumer processes rather than egy could be used [Epstein et al. 1978; into the producers. However, there is an Epstein and Stonebraker 1980; Lehman additional solution for the scenario in et al. 1985], i.e., the join should execute Figure 32, namely, moving only one of within the same threads as the operator the sort operators, not both, into the con- producing sorted output while the second sumer processes. input is replicated to all instances of the

ACM Computmg Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 139

lo– Total Memory Size M =40 Input Size R = 125,000 8– Merge Depth 6–

4–

2– I I I I I I I 1 3 5 7 9 11 13 Degree of Parallelism

Figure 33. Merge depth as a function of parallelism.

join. Note that fragment-and-replicate A final remark on deadlock avoidance: methods do not work correctly for semi- Since deadlock can only occur if the con- join, outer join, difference, and union, i.e., sumer process merges, i.e., not only the when an item is replicated and is in- producer but also the consumer operator serted (incorrectly) multiple times into try to determine the order in which data the global output. A second solution that cross process boundaries, the deadlock works for all operators, not only joins, is problem only exists in a query execution to execute the consumer of the sorted engine based on sort-based set-processing data in a single thread. Recall that mul- algorithms. If hash-based algorithms tiple consumers are required for a dead- were used for aggregation, duplicate re- lock to occur. A third solution that is moval, join, semi-join, outer join, inter- correct for all operators is to send dummy section, difference, and union, the need items containing the largest key seen so for merging and therefore the danger of far from a producer to a consumer if no deadlock would vanish. data have been exchanged for a predeter- An interesting parallel sorting method mined amount of time (data volume, key with balanced communication and with- range). In the examples above, if a pro- out the possibility of deadlock in spite of ducer must send a key to all consumers local sort followed by data exchange (if at least after every 100 data items pro- the data distribution is known a priori) is cessed in the producer, the required to sort locally only by the position within buffer space is bounded, and deadlock the final partition and then exchange can be avoided. In some sense, this solu- data guaranteeing a balanced data flow. tion is very simple; however, it requires This method might be best seen in an that not only the data exchange mecha- example: Consider 10 partitions with key nism but also sort-based algorithms such values from O to 999 in a uniform distri- as merge-join must “understand” dummy bution. The goal is to have all key values items. Another solution is to exchange all between O to 99 sorted on site O, between data without regard to sort order, i.e., to 100 and 199 sorted on site 1, etc. First, omit merging in the data exchange mech- each partition is sorted locally at its orig- anism, and to sort explicitly after repar- inal site, without data exchange, on the titioning is complete. For this sort, re- last two digits only, ignoring the first placement selection might be more effec- digit. Thus, each site has a sequence such tive than quicksort for generating initial as 200, 301, 401, 902, 2, 603, 804, 605, runs because the runs would probably 105, 705,...,999, 399. Now each site be much larger than twice the size of sends data to its correct final destination. memory. Notice that each site sends data simulta-

ACM Computing Surveys, Vol. 25, No. 2, June 1993 140 “ Goetz Graefe neously to all other sites, creating a bal- locally to resolve hash table overflow, anced data flow among all producers and items can be moved directlv . to their final consumers. While this method seems ele- site. Hopefully, this site can aggregate gant, its problem is that it requires fairly them immediately into the local hash detailed distribution information to en- table because a similar item already ex- sure the desired balanced data flow. ists. In manv recent distributed-memorv In shared-memory machines, memory machines, it” is faster to ship an item to must be divided over all concurrent sort another site than to do a local disk 1/0. processes. Thus, the more processes are In fact. some distributed-memorv ven- active, the less memory each one can get. dors attach disk drives not to tie pri- The importance of this memory division mary processing nodes but to special is the limitation it puts on the size of “1/0 nodes” because network delay is initial runs and on the fan-in in each negligible compared to 1/0 time, e.g., in merge process. In other words, large de- Intel’s iPSC/2 and its subsequent paral- grees of parallelism may impede perfor- lel architectures. mance because they increase the number The advantage is that disk 1/0 is re- of merge levels. Figure 33 shows how the quired only when the aggregation output number of merge levels grows with in- size does not fit into the aggregate mem- creasing degrees of parallelism, i.e., de- orv. available on all machines. while the creasing memory per process and merge standard local aggregation-exchange- fan-in. For input size R, total mem- global aggregation scheme requires local ory size ill, and P parallel processes, the disk 1/0 if any local output size does not merge depth L is L = log&f lP - ~((R\P)/ fit into a local memorv. The difference ( M/P)) = log ~, P- ~(R/M). The optimal between the two is d~termined by the degree of parallelism must be deter- degree to which the original input is al- mined considering the tradeoff between ready partitioned (usually not at all), parallel processi~g and large fan-ins, making this technique very beneficial. somewhat similar to the tradeoff be- tween fan-in and cluster size. Extending 10.4 Parallel Joins and Other Binary this argument using the duality of sort- Matching Operations ing and hashing, too much parallelism in hash partitioning on shared-memory ma- Binary matching operations such as join, chines can also be detrimental, both for semi-join, outer join, intersection, union, aggregation and for binary lmatching and difference are different than the pre- [Hong and Stonebraker 1993]. vious operations exactly because they are binary. For bushy parallelism, i.e., a join for which two subplans create the two 10.3 Parallel Aggregation and Duplicate inputs independently from one another Removal in parallel, we might consider symmetric Parallel algorithms for aggregation and hash join algorithms. Instead of differen- duplicate removal are best divided into a tiating between build and probe local step and a global step. First, dupli- inputs, the symmetric hash join uses two cates are eliminated locally, and then hash tables. one for each in~ut. When a data are partitioned to detect and re- data item (or packet of ite’ms) arrives, move duplicates from different original the join algorithm first determines which sites. For aggregation, local and global in~ut it came from and then ioins the aggregate functions may differ. For ex- ne’w data item with the hash t~ble built ample, to perform a global count, the from the other input as well as inserting local aggregation counts while the global the new data item into its hash table aggregation sums local counts into a such that data items from the other in- global count. put arriving later can be joined correctly. For local hash-based aggregation, a Such a symmetric hash join algorithm special technique might improve perfor- has been used in XPRS, a shared- mance. Instead of creating overflow files memory high-performance extensible-

ACM Computing Surveys, Vol. 25, No 2, June 1993 Query Evaluation Techniques 9 141

relational database system [Hong and outer join, difference, and union, namely, Stonebraker 1991; 1993; Stonebraker when an item is replicated and is in- et al. 1988a; 1988b] as well as in Pris- serted into the output (incorrectly) multi- ma/DB, a shared-nothing main-memory ple times. database system [Wilschut 1993; Wil- A technique for reducing network traf- schut and Apers 1993]. The advantage fic during join processing in distributed of symmetric matching algorithms is that database systems uses redundant semi- they are independent of the data rates of joins [Bernstein et al. 1981; Chiu and Ho the inputs; their disadvantage is that 1980; Gouda and Dayal 1981], an idea they require that both inputs fit in mem- that can also be used in distributed- ory, although one hash table can be memory parallel systems. For example, dropped when one input is exhausted. consider the join on a common attribute For parallelizing a single binary A of relations R and S stored on two matching operation, there are basically different nodes in a network, say r and two techniques, called here symmetric s. The semi-join method transfers a partitioning and fragment and replicate. duplicate-free projection of R on A to s, In both cases, the global result is the performs a semi-join there to determine union (concatenation) of all local results. the items in S that actually participate Some algorithms exploit the topology of in the join result, and ships these items certain architectures, e.g., ring- or cube- to r for the actual join. In other words, based communication networks [Baru based on the relational algebra law that and Frieder 1989; Omiecinski and Lin 1989]. In the symmetric partitioning meth- R JOINS ods, both inputs are partitioned on the = R JOIN (S SEMIJOIN R), attributes relevant to the operation (i.e., the join attribute for joins or all at- tributes for set operations), and then the cost savings of not shipping all of S were operation is performed at each site. Both realized at the expense of projecting and the Gamma and the Teradata database shipping the R. A-column and executing machines use this method. Notice that the semi-join. Of course, this idea can be the partitioning method (usually hashed) used symmetrically to reduce R or S or and the local join method are indepen- both, and all operations (projection, du- dent of each other; Gamma and Grace plicate removal, semi-join, and final join) use hash joins while Teradata uses can be executed in parallel on both r and merge-join. s or on more than two nodes using the In the fragment-and-replicate meth- parallel join strategies discussed earlier ods, one input is partitioned, and the in this section. Furthermore, there are other one is broadcast to all sites. Typi- probabilistic variants of this idea that cally, the larger input is partitioned by use bit vector filtering instead of semi- not moving it at all, i.e., the existing joins, discussed later in its own section. partitions are processed at their loca- Roussopoulos and Kang [ 1991] recently tions prior to the binary matching opera- showed that symmetric semi-joins are tion. Fragment-and-replicate methods particularly useful. Using the equalities were considered the join algorithms of (for a join of relations R and S on at- choice in early distributed database sys- tribute A) tems such as R*, SDD-1, and distributed Ingres, because communication costs overshadowed local processing costs and R JOINS because it was cheaper to send a small = R JOIN (S SEMIJOIN r~R ) input to a small number of sites than to = (R SEMIJOIN w~(S SEMIJOIN n.R)) partition both a small and a large input. JOIN (S SEMIJOIN(a)~~R) (a) Note that fragment-and-replicate meth- = (R SEMIJOIN ri-.(S SEMIJOIN WAR)) ods do not work correctly for semi-join, JOIN (S (SEMIJOIN@)r~R ) , (b)

ACM Computing Surveys, Vol 25, No. 2, ,June 1993 142 ● Goetz Graefe they designed a four-step procedure to I L I L I compute the join of two relations stored A A A i at two sites. First, the first relation’s join attribute column R. A is sent duplicate R free to the other relation’s site, s. Second, the first semi-join is computed at s, and either the matching values (term (a) above) or the nonmatching values (term (b) above) of the join column S. A are sent back to the first site, r. The choice s between (a) and (b) is made based on the number of matching and nonmatching17 Figure 34. Symmetric fragment-and-replicate join. values of S. A. Third, site r determines which items of R will participate in the join R JOIN S, i.e., R SEMIJOIN S. replicated within each row, while the Fourth, both input sites send exactly other input is partitioned and replicated those items that will participate in the over columns. Each item from one input join R JOIN S to the site that will com- “meets” each item from the other input pute the final result, which may or may at exactly one site, and the global join not be one of the two input sites. Of result is the concatenation of all local course, this two-site algorithm can be joins. used across any number of sites in a Avoiding partitioning as well as broad- parallel query evaluation system. casting for many joins can be accom- Typically, each data item is exchanged plished with a physical database design only once across the interconnection net- that considers frequently performed joins work in a parallel algorithm. However, and distributes and replicates data over for parallel systems with small communi- the nodes of a parallel or distributed sys- cation overhead, in particular for tem such that many joins already have shared-memory systems, and in parallel their input data suitably partitioned. processing systems with processors with- Katz and Wong formalized this notion as out local disk(s), it may be useful to local sufficiency [Katz and Wong 1983; spread each overflow file over all avail- Wong and Katz 1983]; more recent re- able nodes and disks in the system. The search on the issue was performed in the disadvantage of the scheme may be Bubba project [Copeland et al. 1988]. communication overhead; however, the For joins in distributed systems, a third advantages of load balancing and cumu- class of algorithms, called fetch-as- lative bandwidth while reading a parti- needed, was explored. The idea of these tion file have led to the use of this scheme algorithms is that one site performs the both in the Gamma and SDC database join by explicitly requesting (fetching) machines, called bucket spreading in the only those items from the other input SDC design [DeWitt et al. 1990; needed to perform the join [Daniels and Kitsuregawa and Ogawa 19901. Ng 1982; Williams et al. 1982]. If one For parallel non-equi-joins, a symmet- input is very small, fetching only the ric fragment-and-replicate method has necessary items of the larger input might been proposed by Stamos and Young seem advantageous. However, this algo- [Stamos and Young 1989]. As shown in rithm is a particularly poor implementa- Figure 34, processors are organized into tion of a semi-join technique discussed rows and columns. One input relation is above. Instead of requesting items or val- partitioned over rows, and partitions are ues one by one, it seems better to first project all join attribute values, ship (stream) them across the network, per- form the semi-join using any local binary 17SEMIJOIN stands for the anti-semi-join, which determines those items in the first input that do matching algorithm, and then stream ex- not have a match in the second input. actly those items back that will be re-

ACM Computing Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques ● 143 quired for the join back to the first site. replicated in the main memory of all par- The difference between the semi-join ticipating processors. After replication, technique and fetch-as-needed is that the all local hash-division operators work semi-join scans the first input twice, once completely independent of each other. to extract the join values and once to Clearly, replication is trivial for shared- perform the real join, while fetch as memory machines, in particular since a needed needs to work on each data item single copy of the divisor table can be only once. shared without synchronization among multiple processes once it is complete. When using divisor partitioning, the 10.5 Parallel Universal Quantification resulting partitions are processed in par- In our earlier discussion on sequential allel instead of in phases as discussed for universal quantification, we discussed hash table overflow. However, instead of four algorithms for universal quantifica- tagging the quotient items with phase tion or relational division, namely, naive numbers, processor network addresses division (a direct, sort-based algorithm), are attached to the data items, and the hash-division (direct, hash based), and collection site divides the set of all incom- sort- and hash-based aggregation (indi- ing data items over the set of processor rect ) algorithms, which might require network addresses. In the case that the semi-joins and duplicate removal in the central collection site is a bottleneck, the inputs. collection step can be decentralized using For naive division, pipelining can be quotient partitioning. used between the two sort operators and the division operator. However, both quo- 11. NONSTANDARD QUERY PROCESSING tient partitioning and divisor partition- ALGORITHMS ing can be employed as described below for hash-division. In this section, we briefly review the For algorithms based on aggregation, query processing needs of data models both pipelining and partitioning can be and database systems for nonstandard applied immediately using standard applications. In many cases, the logical techniques for parallel query execution. operators defined for new data models While partitioning seems to be a promis- can use existing algorithms, e.g., for in- ing approach, it has an inherent problem tersection. The reason is that for process- due to the possible need for a semi-join. ing, bulk data types such as array, set, Recall that in the example for universal bag (multi-set), or list are represented as quantification using Transcript and sequences similar to the streams used in Course relations, the join attribute in the the query processing techniques dis- semi-join (course-no) is different than the cussed earlier, and the algorithms to ma- grouping attribute in the subsequent ag- nipulate these bulk types are equal to gregation (student-id). Thus, the Tran- the ones used for sets of tuples, i.e., rela- script relation has to be partitioned twice, tions. However, some algorithms are gen- once for the semi-join and once for the uinely different from the algorithms we aggregation. have surveyed so far. In this section, we For hash-division, pipelining has only review operators for nested relations, limited promise because the entire temporal and scientific databases, division is performed within a single object-oriented databases, and more operator. However, both partitioning meta-operators for additional query pro- strategies discussed earlier for hash table cessing control. overflow can be employed for parallel ex- There are several reasons for integrat- ecution, i.e., quotient partitioning and di- ing these operators into an algebraic visor partitioning [Graefe 1989; Graefe query-processing system. First, it per- and Cole 1993]. mits efficient data transfer from the For hash-division with quotient par- database to the application embodied in titioning, the divisor table must be these operators. The interface between

ACM Computing Surveys, Vol. 25, No. 2, June 1993 144 “ Goetz Graefe database operators is designed to be as can be represented more naturally than efficient as possible; the same efficient in the fully normalized model; many fre- interface should also be used for applica- quent join operations can be avoided, and tions. Second, operator implementors can structural information can be used for take advantage of the control provided by physical clustering. Its disadvantage is the meta-operators. For example, an op- the added complexity, in particular, in erator for a scientific application can storage management and query be implemented in a single-process envi- m-ocessing. ronment and later parallelized with the Severa~ algebras for nested relations exchange operator. Third, query opti- have been defined, e.g., Deshpande and mization based on algebraic transform a- Larson [1991], Ozsoyoglu et al. [1987], tion rules can cover all operators, includ- Roth et al. [ 1988], Schek and Scholl ing operations that are normally consid- [1986], Tansel and Garnett [1992]. Our ered database application code. For ex- discussion here focuses not on the concep- ample, using algebraic optimization tools tual design of NF z algebras but on algo- such as the EXODUS and Volcano opti- rithms to manipulate nested relations. mizer generators [ Graefe and DeWitt Two operations required in NF 2 1987; Graefe et al. 1992; Graefe and database systems are operations that McKenna 1993], optimization rules that transform an NF z relation into a normal- can move an unusual database operator ized relation with atomic attributes onlv../ in a query plan are easy to implement. and vice versa. The first operation is For a sampling operator, a rule might frequently called unnest or flatten; the permit transforming an algebra expres- opposite direction is called the nest oper- sion to query a sample instead of sam- ation. The unnest operation can be per- pling a query result, formed in a single scan over the NF 2 relation that includes the nested subtu- ples; both normalized relations in Figure 11.1 Nested Relations 35 and their join can be derived readily Nested relations, or Non-First-Normal- enough from the NF z relation. The nest Form (NF 2) relations, permit relation- operation requires grouping of tuples in valued attributes in addition to atomic the detail relation and a join with the values such as integers and strings used master relation. Grouping and join can in the normal or “flat” relational model. be implemented using any of the algo- For example, in an order-processing ap- rithms for aggregate functions and bi- plication, the set of individual line items nary matching discussed earlier, i.e., sort- on each order could be represented as a and hash-based sequential and parallel nested relation, i.e., as part of an order methods. However, in order to ensure tuple. Figure 35 shows an NF z relation that unnest and nest o~erations. are ex- with two tuples with two and three nested act inverses of each other, some struc- tuples and the equivalent normalized re- tural information might have to be pre- lations, which we call the master and served in the unnest operation. Ozsoyoglu detail relations. Nested relations can be and Wang [1992] present a recent inves- used for all one-to-many relationships but tigation of “keying methods” for this pur- are particularly well suited for the repre- pose. sentation of “weak entities” in the All operations defined for flat relations Entity-Relationship (ER) Model [Chen can also be defhed for nested relations, 1976], i.e., entities whose existence and in particular: selection, join, and set op- identification depend on another entity erations (union, intersection, difference). as for order entries in Figure 35. In gen- For selections, additional power is gained eral, nested subtuples may include rela- with selection conditions on subtuples tion-valued attributes, with arbitrary and sets of subtuples using set compar- nesting depth. The advantages of the NF 2 isons or existential or universal auantifl-. model are that component relationships cation. In principle, since a nested rela-

ACM Computing Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 145

Order Customer Date Items -No -No Part-No Count 110 911 910902 4711 8 2345 7 112 912 910902 9876 3 2222 1 2357 9

Order-No Part-No Quantity 110 4711 8 110 2345 7 112 9876 3 112 2222 1 112 2357 9

Figure 35. Nested relation and equivalent flat relations.

tion is a relation, any relational calculus Deshpande and Larson [ 1992] investi- and algebra expression should be permit- gated join algorithms for nested relations ted for it. In the example in Figure 35, because “the purpose of nesting in order there may be a selection of orders in to store precomputed joins is defeated which the ordered quantity of all items is if it is unnested every time a join is more than 100, which is a universal performed on a subrelation.” Their algo- quantification. The algorithms for selec- rithm, (parallel) partitioned nested- tions with quantifier are similar to the hashed-loops, joins one relation’s subre- ones discussed earlier for flat relations, lations with a second, flat relation by e.g., relational semi-join and division, but creating an in-memory hash table with are easier to implement because the the flat relation. If the flat relation is grouping process built into the flat- larger than memory, memory-sized seg- relational algorithms is inherent in the ments are loaded one at a time, and the nested tuple structure. nested relation is scanned repeatedly. For joins, similar considerations apply. Since an outer tuple of the nested rela- Matching algorithms discussed earlier tion might have matches in multiple seg- can be used in principle. They may be ments of the flat relation, a final merging more complex if the join predicate in- pass is required. This join algorithm is volves subrelations, and algorithm com- reminiscent of hash-division, the flat re- binations may be required that are de- lation taking the role of the divisor and rived from a flat-relation query over flat the nested tuples replacing the quotient relations equivalent to the NF 2 query table entries with their bit maps. over the nested relations. However, there Sort-based join algorithms for nested should be some performance improve- relations require either flattening the ments possible if the grouping of values nested tuples or scanning the sorted flat in the nested relations can be exploited, relation for each nested tuple, somewhat as for example, in the join algorithms reminiscent of naive division. Neither al- described by Rosenthal et al. [1991]. ternative seems very promising for large

ACM Computing Surveys, Vol. 25, No 2, June 1993 146 “ Goetz Graefe

inputs. Sort semantics and appropriate join. They do not reestablish relation- sort algorithms including duplicate re- ships based on identifying keys but match moval and grouping have been consid- data values that express a dimension in ered by Saake et al. [1989] and Kuespert which distance can be defined, in partic- et al. [ 1989]. Other researchers have fo- ular, time. Traditionally, such join predi- cused on storage and retrieval methods cates have been considered non-equi-joins for nested relations and operations possi- and were evaluated by a variant of ble with single scans [Dadam et al. 1986; nested-loops join. However, such “band Deppisch et al. 1986; Deshpande and Van joins” can be executed much more effi- Gucht 1988; Hafez and Ozsoyoglu 1988; ciently by a variant of merge-join that Ozsoyoglu and Wang 1992; Scholl et al. keeps a “window” of inner relation tuples 1987; Scholl 1988]. in memory or by a variant of hash join that uses range partitioning and assigns some build tuples to multiple partition 11.2 Temporal and Scientific Database files. A similar partitioning model must Management be used for parallel execution, requiring For a variety of reasons, management multi-cast for some tuples. Clearly, these and manipulation of statistical, tempo- variants of merge-join and hash join will ral, and scientific data are gaining inter- outperform nested loops for large inputs, est in the database research community. unless the band is so wide that the join Most work on temporal databases has result approaches the Cartesian product. focused on semantics and representation For storage and management of the in data models and query languages massive amounts of data resulting from [McKenzie and Snodgrass 1991; Snod- scientific experiments, database tech- grass 1990]; some work has considered niques are very desirable. Operators for special storage structures, e.g., Ahn and processing time series in scientific Snodgrass [1988], Lomet and Salzberg databases are based on an interpretation [ 1990b], Rotem and Segev [1987], Sever- of a stream between operators not as a ance and Lehman [1976], algebraic oper- set of items (as in most database applica- ators, e.g., temporal joins [Gunadhi and tions) but as a sequence in which the Segev 1991], and optimization of tempo- order of items in a stream has semantic ral queries, e.g., Gunadhi and Segev meaning. For example, data reduction [1990], Leung and Muntz [1990; 1992], using interpolation as well as extrapola- Segev and Gunadhi [1989]. While logical tion can be performed within the stream query algebras require extensions to ac- paradigm. Similarly, digital filtering commodate time, only some storage [Hamming 19771 also fits the stream- structures and algorithms, e.g., multidi- processing protocol very easily. Interpo- mensional indices, differential files, and lation, extrapolation, and digital filtering versioning, and the need for approximate were implemented in the Volcano system selection and matching (join) predicates with a single algorithm (physical opera- are new in the query execution algo- tor) to verify this fit, including their opti- rithms for temporal databases. mization and parallelization [ Graefe and A number of operators can be identi- Wolniewicz 1992; Wolniewicz and Graefe fied that both add functionality to 1993]. Another promising candidate is vi- database systems used to process scien- sualization of single-dimensional arrays tific data and fit into the database query such as time series, processing paradigm. DeWitt et al. Problems that do not fit the stream [1991a] considered algorithms for join paradigm, e.g., many matrix operations predicates that express proximity, i.e., such as transformations used in linear join predicates of the form R. A – c1 < algebra, Laplace or Fast Fourier Trans- S.B < R.A + Cz for some constants c1 form, and slab (multidimensional sub- and Cz. Such join predicates are very array) extraction, are not as easy to different from the usual use of relational integrate into database query processing

ACM Computing Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 147 systems. Some of them seem to fit better gebraic optimization techniques will con- into the storage management subsystem tinue to be useful. rather than the algebraic query execu- Associative operations are an impor- tion engine. For example, slab extraction tant part in all object-oriented algebras has been integrated into the NetCDF because they permit reducing large storage and access software [Rew and amounts of data to the interesting subset Davis 1990; Unidata 1991]. However, it of the database suitable for further con- is interesting to note that sorting is a sideration and processing. Thus, set- suitable algorithm for permuting the lin- processing and set-matching algorithms ear representation of a multidimensional as discussed earlier in this survey will be array, e.g., to modify the hierarchy of found in object-oriented systems, imple- dimensions in the linearization (row- vs. mented in such a way that they can oper- column-major linearization). Since the ate on heterogeneous sets. The challenge final position of all elements can be for query optimization is to map a com- predicted from the beginning of the op- plex query involving complex behavior eration, such “sort” algorithms can be and complex object structures to primi- based on merging or range partitioning tives available in a query execution en- (which is yet another example of the gine. Translating an initial request with duality of sort- and hash- (partitioning-) abstract data types and encapsulated be- based data manipulation algorithms). havior coded in a computationally com- plete language into an internal form that both captures the entire query’s seman- 11.3 Object-Oriented Database Systems tics and allows effective query optimiza- Research into query processing for exten- tion is still an open research issue sible and object-oriented systems has [Daniels et al. 1991; Graefe and Maier been growing rapidly in the last few 1988]. years. Most proposals or implementa- Beyond associative indices discussed tions use algebras for query processing, earlier, object-oriented systems can also e.g., Albert [1991], Cluet et al. [1989], benefit from special relationship indices, Graefe and Maier [1988], Guo et al. i.e., indices that contain condensed infor- [199 1], Mitschang [1989], Shaw and mation about interobject references. In Zdonik [1989a; 1989b; 1990], Straube and principle, these index structures are sim- Ozsu [1989], Vandenberg and DeWitt ilar to join indices [Valduriez 1987] but [1991], Yu and Osborn [1991]. These al- can be generalized to support multiple gebras resemble relational algebra in the levels of referencing. Examples for in- sense that they focus on bulk data types dices in object-oriented database systems but are generalized to support operations include the work of Maier and Stein on arrays, lists, etc., user-defined opera- [1986] in the Gemstone object-oriented tions (methods) on instances, heteroge- database system product, Bertino [1990; neous bulk types, and inheritance. The 1991] and Bertino and Kim [1989], in the use of algebras permits several impor- Orion project and Kemper et al. [19911 tant conclusions. First, naive execution and Kemper and Moerkotte [1990a; models that execute programs as if all 1990b] in the GOM project. At this point, data were in memory are not the only it is too early to decide which index alternative. Second, data manipulation structures will be the most useful be- operators can be designed and imple- cause the entire field of query processing mented that go beyond data retrieval and in object-oriented systems is still devel- permit some amount of data reduction, oping rapidly, from query languages to aggregation, and even inference. Third, algebra design, algorithm repertoire, and algebraic execution techniques including optimization techniques. other areas of the stream paradigm and parallel execu- intense current research interest are tion can be used in object-oriented data buffer management and clustering of models and database systems, Fourth, al- objects on disk.

ACM Computing Surveys, Vol. 25, No. 2, June 1993 148 “ Goetz Graefe

One of the big performance penalties or control operator. There are several in object-oriented database systems is other control o~erators. that can be used “pointer chasing” (using OID references) in database query processing, and we which may involve object faults and disk survey them briefly in this section. read operations at widely scattered loca- In situations in which an intermediate tions, also called “goto’s on disk.” In or- result is used repeatedly, e.g., a nested- der to reduce 1/0 costs, some systems loops join with a composite inner input, use what amounts to main-memory either the intermediate result is derived databases or map the entire database many times, or it is saved in a temporary into virtual memory. For systems with file during its first derivation and then an explicit database on disk and an in- retrieved from this file while serving sub- memory buffer, there are various tech- sequent requests. This situation arises niques to detect object faults; some not only with nested-loops join but also commercial object-oriented database sys- with other algorithms, e.g., sort-based tems use hardware mechanisms origi- universal quantification [Smith and nally perceived and implemented for Chang 1975]. Thus, it might be useful to virtual-memory systems. While such encapsulate this functionality in a new hardware support makes fault detection algorithm, which we call the store-and- faster. it does not address the .m-oblem of scan o~erator. expensive 1/0 operations. In order to re- The’ store-and-scan operator permits duce actual 1/0 cost, read-ahead and three generalizations. First, if the first planned buffering must be used. Palmer consumption of the intermediate result and Zdonik [1991] recently proposed might actually not need it entirely, e.g., a keeping access patterns or sequences and nested-loops semi-join which terminates activating read-ahead if accesses equal each inner scan after the first match, the or similar to a stored ~attern are de- operator should be switched to derive only tected. Another recent p~oposal for effi- the necessary data items (which implies cient assembly of complex objects uses a leaving the input plan ready to produce window (a small set) of open references more data later) or to save the entire and resolves, at any point of time, the intermediate result in the temporary file most convenient one by fetching this ob- right away in order to permit release of ject or component from disk, which has all resources in the subplan. Second, sub- shown dramatic improvements in disk sequent scans might permit starting not seek times and makes complex object re- at the beginning of the temporary file but trieval more efficient and more indepen- at some later point. This version is useful dent of object clustering [Keller et al. if manv . duplicates. exist in the inwts. of 19911. Policies and mechanisms for effi- one-to-one matching algorithms based on cient parallel complex object assembly are merge-join. Third, in some execution an important challenge for the develop- strategies for correlated SQL sub queries, ers of next-generation object-oriented the plan corresponding to the inner block database management systems [Maier et is executed once for each tuple in the al. 1992]. outer block. The tudes of the outer block provide different c~rrelation values, al- though each value may occur repeatedly. 11.4 More Control Operators In order to ensure that the inner plan is The exchange operator used for parallel executed only once for each outer correla- query processing is not a normal opera- tion value, the store-and-scan operator tor in the sense that it does not manipu- could retain information about which part late, select, or transform data. Instead, of its temporary file corresponds to which the exchange operator provides control of correlation value and restrict each scan query processing in a way orthogonal to appropriately. Another use of a tem~orarv file is to what a query does and what algorithms .“ it uses. Therefore, we call it a meta- support common subexpressions, which

ACM Computing Surveys, Vol. 25, No 2, June 1993 Query Evaluation Techniques - 149 can be executed efficiently with an opera- could design data flow translation con- tor that passes the result of a common trol operators. The first such operator subexpression to multiple consumers, as which we call the active scheduler can be mentioned briefly in the section on the used between a demand-driven producer architecture of query execution engines. and a data-driven consumer. In this case, The problem is that multiple consumers, neither operator will schedule the other; typically demand-driven and demand- therefore, an active scheduler that de- driving their inputs, will request items of mands items from the producer and forces the common subexpression result at dif- them onto the consumer will glue these ferent times or rates. The two standard two operators together. An active- solutions are either to execute the com- scheduler schematic is shown in the third mon subexpression into a temporary file diagram of Figure 36. The opposite case, and let each consumer scan this file at a data-driven producer and a demand- will or to determine which consumer will driven consumer, has two operators, each be the first to require the result of the trying to schedule the other one. A sec- common subexpression, to execute the ond flow control operator, called the pas- common subexpression as part of this sive scheduler, can be built that accepts consumer and to create a file with the procedure calls from either neighbor and common subexpression result as a by- resumes the other neighbor in a corou- product of the first consumer’s execution. tine fashion to ensure that the resumed Instead, we suggest a new meta-operator, neighbor will eventually demand the item which we call the split operator, to be the scheduler just received. The final dia- placed at the top of the common subex- gram of Figure 36 shows the control flow pression’s plan and which can serve mul- of a passive scheduler. (Notice that this tiple consumers at their own paces. It case is similar to the bracket model of automatically performs buffering to ac- parallel operator implementations dis- count for different paces, uses temporary cussed earlier in which an operating sys- disk space if the discrepancies are too tem or networking software layer had to wide, and is suitably parameterized to be placed between data manipulation op- permit both standard solutions described erators and perform buffering and flow above. control.) In query processing systems, data flow Finally, for very complex queries, it is usually paced or driven from the top, might be useful to break the data flow the consumer. The leftmost diagram of between operators at some point, for two Figure 36 shows the control flow of nor- reasons. First, if too many operators run mal iterators. (Notice that the arrows in in parallel, contention for memory or Figure 36 show the control flow; the data temporary disks might be too intense, flow of all diagrams in Figure 36 is as- and none of the operators will run as sumed to be upward. In data-driven data efficiently as possible. A long series of flow, control and data flows point in the hybrid hash joins in a right-deep query same direction; in demand-driven data plan illustrates this situation. Second, flow, their directions oppose each other.) due to the inherent error in selectivity However, in real-time systems that cap- estimation during query optimization ture data from experiments, this ap- [Ioannidis and Christodoulakis 1991; proach may not be realistic because the Mannino et al. 1988], it might be worth- data source, e.g., a satellite receiver, has while to execute only a subset of a plan, to be able to unload data as they arrive. verify the correctness of the estimation, In such systems, data-driven operators, and then resume query processing with shown in the second diagram of Figure another few steps. After a few processing 36, might be more appropriate. To com- steps have been performed, their result bine the algorithms implemented and size and other statistical properties such used for query processing with such as minimum and maximum and approxi- real-time data capture requirements, one mate number of duplicate values can be

ACM Computing Surveys, Vol. 25, No. 2, June 1993 150 “ Goet.z Graefe

r+ Passive Standard Data-Driven I Active Iterator Opemtor Scheduler Scheduler

L + CDFigure 36. Operators, schedulers, and control flow

easily determined while saving the result 12. ADDITIONAL TECHNIQUES FOR on temporary disk. PERFORMANCE IMPROVEMENT In principle, this was done in Ingres’ In this section, we consider some addi- original optimization method called De- tional techniques that have been pro- composition, except that Ingres per- posed in the literature or used in real formed only one operation at a time systems and that have not been dis- before optimizing the remaining query cussed in earlier sections of this survey. [Wong and Yousset3 1976; Youssefi and In particular, we consider precomputa- Wong 1979]. We propose alternating more tion, data compression, surrogate pro- slowly between optimization and execu- cessing, bit vector filters, and specialized tion, i.e., to perform a “reasonable” num- hardware. Recently proposed techniques ber of steps between optimizations, where that have not been fully developed are reasonable may be three to ten selections not discussed here, e.g., “racing” equiva- and joins depending on errors and error lent plans and terminating the ones that propagation in selectivity estimation. seem not competitive after some small Stopping the data flow and resuming af- amount of time. ter additional optimization could very well turn out to be the most reliable technique for very large complex queries. 12.1 Precomputation and Derived Data Implementation of this technique could be embodied in another control operator, It is trivial to answer a query for which the choose-plan operator first described the answer is already known—therefore, in Graefe and Ward [1989]. Its current precomputation of frequently requested implementation executes zero or more information is an obvious idea. The prob- subplans and then invokes a decision lem with keeping preprocessed informa- function provided by the optimizer that tion in addition to base data is that it is decides which of multiple equivalent redundant and must be invalidated or plans to execute depending on intermedi- maintained on updates to the base data. ate result statistics, current system load, Precomputation and derived data such and run-time values of query parameters as relational views are duals. Thus, con- unknown at optimization time. Unfor- cepts and algorithms designed for one tunately, further research is needed to will typically work well for the other. The develop techniques for placing such op- main difference is the database user’s erators in very complex query plans. view: precomputed data are typically One possible purpose of the subplans used after a query optimizer has deter- executed prior to a decision could be to mined that they can be used to answer a sample the values in the database. user query against the base data, while A very interesting research direction derived data are known to the user and quantifies the value of sampling by ana- can be queried without regard to the fact lyzing the resulting improvement in the that they actually must be derived at decision quality [Seppi et al. 1989]. run-time from stored base data. Not sur-

ACM Computing Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 151

prisingly, since derived data are likely to Babb [1982] explored storing only be referenced and requested by users and results of outer joins, but not the nor- application programs, precomputation of malized base relations, in the content- derived data has been investigated both addressable file store (CAFS), and called for relational and object-oriented data this encoding join normal form. Blakeley models. et al. [1989], Blakeley and Martin [1990], Indices are the simplest form of pre- Larson and Yang [1985], Medeiros and computed data since they are a re- Tompa [1985], Tompa and Blakeley dundant and, in a sense, precomputed [1988], and Yang and Larson [ 1987] in- selection. They represent a compromise vestigated storing and maintaining ma- between a nonredundant database and terialized views in relational database one with complex precomputed data be- systems. Their hope was to speed rela- cause they can be maintained relatively tional query processing by using derived efficiently. data, possibly without storing all base The next more sophisticated form data, and ensuring that their mainte- of precomputation are inversions as pro- nance overhead would be less than their vided in System Rs “Oth” prototype benefits in faster query processing. For [Chamberlain et al. 1981al, view indices example, Blakeley and Martin [1990] as analyzed by Roussopoulos [1991], demonstrated that for a single join there two-relation join indices as proposed by exists a large range of retrieval and up- Valduriez [1987], or domain indices as date mixes in which materialized views used in the ANDA project (called VAL- outperform both join indices and hybrid TREE there) [Deshpande and Van Gucht hash join. This investigation should be 1988] in which all occurrences of one do- extended, however, for more complex main (e.g., part number) are indexed to- queries, e.g., joins of three and four in- gether, and each index entry contains a puts, and for queries in object-oriented relation identification with each record systems and emerging database applica- identifier. With join or domain indices, tions. join queries can be answered very fast, Hanson [1987] compared query modifi- typically faster than using multiple sin- cation (i.e., query evaluation from base gle-relation indices. On the other hand, relations) against the maintenance costs single-clause selections and updates may of materialized views and considered in be slightly slower if there are more en- particular the cost of immediate versus tries for each indexed key. deferred updates. His results indicate For binary operators, there is a spec- that for modest update rates, material- trum of possible levels of precomputa- ized views provide better system perfor- tions (as suggested by J. A. Blakely), mance. Furthermore, for modest selectiv- explored predominantly for joins. The ities of the view predicate, deferred-view simplest form of precomputation in sup- maintenance using differential files port of binary operations is individual [Severance and Lohman 1976] outper- indices, e.g., clustering B-trees that en- forms immediate maintenance of materi- sure and maintain sorted relations. On alized views. However, Hanson also did the other extreme are completely materi- not include multi-input joins in his study. alized join results. Intermediate levels Sellis [1987] analyzed caching of re- are pointer-based joins [ Shekita and sults in a query language called Quel + Carey 1990] (discussed earlier in the sec- (which is a subset of Postquel tion on matching) and join indices [Stonebraker et al. 1990bl) over a rela- [Valduriez 19871. For each form of pre- tional database with procedural (QUEL) computed result, the required redundant fields [Sellis 1987]. He also considered data structures must be maintained each the case of limited space on secondary time the underlying base data are up- storage used for caching query results, dated, and larger retrieval speedup might and replacement algorithms for query re- be paid for with larger maintenance sults in the cache when the space be- overhead. comes insufficient.

ACM Computing Surveys, Vol. 25, No. 2, June 1993 152 * Goetz Graefe

Links between records (pointers of view will select only a single student, not some sort, e.g., record, tuple, or object all students at the school. Thus, if the identifiers) are another form of precom- view definition is merged into the query putation. Links are particularly effective before query optimization, as discussed for system performance if they are com- in the introduction, only one student’s bined with clustering (assignment of grade point average, not the entire view, records to pages). Database systems for will be computed for each query. Obvi- the hierarchical and network models have ously, the treatment of this difference used physical links and clustering, but will affect an analysis of costs and bene- supported basically only queries and op- fits of materialized views. erations that were “recomputed” in this way. Some researchers tried to overcome 12.2 Data Compression this restriction by building relational query engines on top of network systems, A number of researchers have investi- e.g., Chen and Kuck [1984], Rosenthal gated the effect of compression on and Reiner [ 1985], Zaniolo [1979]. How- database systems and their performance ever, with performance improvements in [Graefe and Shapiro 1991; Lynch and the relational world, these efforts seem Brownrigg 1981; Ruth and Keutzer 1972; to have been abandoned. With the advent Severance 1983]. There are two types of of extensible and object-oriented database compression in database systems. First, management systems, combining links the amount of redundancy can be re- and ad hoc query processing might be- duced by prefix and suffix truncation, in come a more interesting topic again. A particular, in indices, and by use of en- recent effort for an extensible-relational coding tables (e.g., color combination “9” system are Starburst’s pointer-based means “red car with black interior”). Sec- joins discussed earlier [Haas et al. 1990; ond, compression schemes can be applied Shekita and Carey 1990]. to attribute values, e.g., adaptive In order to ensure good performance Huffman coding or Ziv-Lempel methods for its extensive rule-processing facilities, [Bell et al. 1989; Lelewer and Hirschberg Postgres uses precomputation and 1987]. This type of compression can be caching of the action parts of production exploited most effectively in database rules [Stonebraker 1987; Stonebraker et query processing if all attributes of the al. 1990a; 1990b]. For automatic mainte- same domain use the same encoding, e.g., nance of such derived data, persistent the “Part-No” attributes of data sets rep- “invalidation locks” are stored for detec- resenting parts, orders, shipments, etc., tion of invalid data after updates to the because common encodings permit com- base data, parisons without decompression. Finally, the Cactis project focused on Most obviously, compression can re- maintenance of derived data in object- duce the amount of disk space required oriented environments [Hudson and King for a given data set. Disk space savings 1989]. The conclusions of this project in- has a number of ramifications on 1/0 clude that incremental maintenance cou- performance. First, the reduced data pled with a fairly simple adaptive clus- space fits into a smaller physical disk tering algorithm is an efficient way to area; therefore, the seek distances and propagate updates to derived data. seek times are reduced. Second, more One issue that many investigations data fit into each disk page, track, and into materialized views ignore is the fact cylinder, allowing more intelligent clus- that many queries do not require views tering of related objects into physically in their entirety. For example, if a rela- near locations. Third, the unused disk tional student information system in- space can be used for disk shadowing to cludes a view that computes each stu- increase reliability, availability, and 1/0 dent’s grade point average from the en- performance [Bitten and Gray 1988]. rollment data, most queries using this Fourth, compressed data can be trans-

ACM Cornputlng Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 153

ferred faster to and from disk. In other records are shorter, i.e., less copying is words, data com~ression is an effective required. ~econd, for inputs larger than means to increa~e disk bandwidth (not memory, more records fit into memory. by increasing physical transfer rates but In hybrid hash join and duplicate re- by increasing the information density moval, for instance, the fraction of the of transferred data) and to relieve the file that can be retained in the hash table 1/0 bottleneck found in many high- and thus be joined without any 1/0 is performance database management sys- larger. During sorting, the number of tems [Boral and DeWitt 19831. Fifth. in records in memory and thus per run is distributed database systems and’ in larger, leading to fewer runs and possibly client-server situations, compressed da- fewer merge levels. Third, and very in- ta can be transferred faster across the terestingly, skew is less likely to be a network than uncompressed data. Un- problem. The goal of compression is to compressed data require either more represent the information with as few network time or a seoarate. comm-ession, bits as possible. Therefore, each bit in step. Finally, retaining data in com- the output of a good compression scheme pressed form in the 1/0 buffer allows has close to maximal information con- more records to remain in the buffer. tent, and bit columns seen over the thus increasing the buffer hit rate and entire file are unlikely to be skewed. reducing the number of 1/0s. The last Furthermore, bit columns will not be cor- three points are actually more general. related. Thus, the compressed key values They apply to the entire storage hierar- can be used to create a hash value distri- chy of tape, disk, controller caches, local bution that is almost guaranteed to be and remote main memories, and CPU uniform, i.e., optimal for hashing in caches. memory and partitioning to overflow files For query processing, compression can as well as to multiple processors in paral- be exploited far beyond improved 1/0 lel join algorithms. ~erformance because decomm-ession can We believe that data compression is ~ften be delayed until a rel~tively small undervalued in current query processing data set is presented to the user or an research, mostly because it was not real- application program. First, exact-match ized that many operations can often be comparisons can be performed on com- performed faster on compressed data pressed data. Second, projection and du- than on uncompressed data, and we hope plicate removal can be performed with- that future database management sys- out decompressing data. The situation for tems make extensive use of data com- aggregation is a little more complex since pression. Considering the current growth the attribute on which arithmetic is per- rates in CPU and 1/0 performance, it formed typically must be decompressed. might even make sense to exploit data Third, neither the join attributes nor compression on the fly for hash table other attributes need to be decompressed overflow resolution. for most joins. Since keys and foreign keys are from the same domain, and if 12.3 Surrogate Processing compression schemes are fixed for each domain, a join on compressed key values Another very useful technique in query will give the same results as a join on processing is the use of surrogates for normal uncompressed key values. It intermediate results. A surrogate is a ref- might seem unusual to perform a merge- erence to a data item, be it a logical join in the order of compressed values, object identifier (OID) used in object- but it nonetheless is ~ossible and will oriented systems or a physical record produce correct results.’ identifier (RID) or location. Instead of There are a number of benefits from keeping a complete record in memory, processing compressed data. First, mate- only the fields that are used immediately rializing output records is faster because are kept, and the remainder replaced by

ACM Computing Surveys, Vol. 25, No. 2, June 1993 154 “ Goetz Graefe

a surrogate, which has in principle the Ingres [Kooi 1980; Kooi and Frankforth same effect as compression. While this 1982] and IBMs hybrid join [Cheng et al. technique has traditionally been used to 1991] discussed in the section on binary reduce main-memory requirements, it matching. can also be employed to improve board- Surrogate processing has also been and CPU-level caching [Nyberg et al. used in parallel systems, in particular, 1993]. distributed-memory implementations, to The simplest case in which surrogate reduce network traffic. For example, processing can be exploited is in avoiding Lorie and Young [1989] used RIDs to copying. Consider a relational join; when reduce the communication time in paral- two items satisfy the join predicate, a lel sorting by sending (sort key, RID) new tuple is created from the two origi- ~airs to a central site. which determines nal ones. Instead of copying the data ~ach record’s global’ rank, and then fields, it is possible to create only a pair repartitioning and merging records very of RIDs or pointers to the original records quickly by their rank alone without fur- if they are kept in memory. If a record is ther data comparisons. 50 times larger than an RID, e.g., 8 vs. Another form of surrogates are encod- 400 bytes, the effort spent on copying ings with lossy compressions, such as su- bvtes is reduced bv that factor perimposed coding used for efficient ac- “ Copying is alre~dy a major part of the cess methods [Bloom 1970; Faloutsos CPU time spent in many query process- 1985; Sacks-Davis and Ramamohanarao ing systems, but it is becoming more ex- 1983; Sacks-Davis et al. 1987]. Berra et pensive for two reasons. First, many al. [1987] and Chung and Berra [1988] modern CPU designs and implementa- considered indexing and retrieval organi- tions are optimized for an impressive zations for very large (relational) knowl- number of instructions per second but do edge bases and databases. They em- not provide the performance improve- ployed three techniques, concatenated ments in mundane tasks such as moving code words (CCWS), superimposed code bytes from one memory location to an- words (SCWS), and transformed inverted other [Ousterhout 19901. Second, many lists (TILs). TILs are normal index struc- modern computer architectures employ tures for all attributes of a relation that multiple CPUS accessing shared memory permit answering conjunctive queries by over one bus because this design permits bitwise anding. CCWS and SCWS use fast and inexpensive parallelism. Al- hash values of all attributes of a tuple though alleviated by local caches, bus and either concatenate such hash values contention is the major bottleneck and or bitwise or them together. The result- limitation to scalability in shared- ing code words are then used as keys in memory parallel machines. Therefore, re- indices. In their particular architecture, ductions in memory-to-memory copying Berra et al. and Chung and Berra con- in database query execution engines per- sider associative memory and optical mit higher useful degrees of parallelism computing to search efficiently through in shared-memorv machines. such indices, although conventional soft- A second example for surrogate pro- ware techniques could be used as well. cessing was mentioned earlier in connec- tion with indices. To evaluate a conjunc- 12.4 Bit Vector Filtering tion with multiple clauses, each of which is supported by an index, it might be In parallel systems, bit vector filters have useful to perform an intersection of RID- been used very effectively for what we lists to reduce the number of records call here “probabilistic semi-j oins.” Con- needed before actual data are accessed. sider a relational join to be executed on a A third case is the use of indices and distributed-memory machine with repar- RIDs to evaluate joins, for example, in titioning of both input relations on the the query processing techniques used in join attribute. It is clear that communica-

ACM Computmg Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 155

tion effort could be reduced if only the filter, a large fraction of the probe tuples tuples that actually contribute to the join can often be discarded before incurring result, i.e., those with a match in the network costs. The decision whether to other relation, needed to be shipped create one bit vector filter for the entire across the network. To accomplish this, build input or to create a bit vector filter distributed database systems were de- for each of the join sites depends on the signed to make extensive use of semi- space available for bit vector filters and joins, e.g., SDD-1 [Bernstein et al. 1981]. the communication costs for bit arrays. A faster alternative to semi-joins, Mullin [ 1990] generalized bit vector fil- which, as discussed earlier, requires ba- tering to sending bit vector filters back sically the same computational effort as and forth between sites. In his words, natural joins, is the use of bit vector “the central notion is to send small but filters [Babb 1979], also called Bloom- optimally information-dense Bloom fil- filters [Bloom 1970]. A bit vector filter ters between sites as long as these filters with N bits is initialized with zeroes; serve to reduce the volume of tuples and all items in the first (preferably the which need to be transmitted by more smaller) input are hashed on their join than their own size.” While this proce- keyto O,..., N – 1. For each item, one dure achieves very low communication bit in the bit vector filter is set to one; costs, it ignores the 1/0 cost at each site hash collisions are ignored. After the first if the reduced relations must be scanned join input has been exhausted, the bit from disk in each step. Qadah [1988] dis- vector filter is used to filter the second cussed a limited form of this idea using input. Data items of the second input are only two bit vector filters and augment- hashed on their join key value, and only ing it with bit vector filter compression. items for which the bit is set to one can While bit vector filtering is typically possibly participate in the join. There is used only for joins, it is equally applica- some chance for false passes in the case ble to all other one-to-one match oper- of collisions, i.e., items of the second in- ators, including semi-j oin, outer join, put pass the bit vector filter although intersection, union, and difference. For they actually do not participate in the operators that include nonmatching join, but if the bit vector filter is suffi- items in their output, e.g., outer joins ciently large, the number of false passes and unions, part of the result can be is very small. obtained before network transfer, based In general, if the number of bits is solely on the bit vector filter. For one-to- about twice the number of items in the one match operations other than join, first input, bit vector filters are very ef- e.g., outer join and union, bit vector fil- fective. If many more bits are available, ters can also be used, but the algorithm the bit vector filter can be split into mul- must be modified to ensure that items tiple subvectors, or multiple bits can be that do not pass the bit vector filter are set for each item using multiple hash properly included in the operation’s out- functions, reducing the number of false put stream. For parallel relational divi- passes. Babb [1979] analyzed the use of sion (universal quantification), bit vector multiple bit vector filters in detail. filtering can be used on the divisor at- The Gamma relational database ma- tributes to eliminate most of the dividend chine demonstrated the effectiveness of items that do not pertain to any divisor bit vector filtering in relational join pro- item. Thus, our earlier assessment that cessing on distributed-memory hardware universal quantification can be per- [DeWitt et al. 1986; 1988; 1990; Gerber formed as fast as existential quantifica- 1986]. When scanning and redistributing tion (a semi-join of dividend and divisor the build input of a join, the Gamma relations) even extends to special tech- machine creates a bit vector filter that is niques used to boost join performance. then distributed to the scanning sites of Bit vector filtering can also be ex- the probe input. Based on the bit vector ploited in sequential systems. Consider a

ACM Computmg Surveys, Vol. 25, No. 2, June 1993 156 * Goetz Graefe

merge-join with sort operations on both be used in hash-based duplicate removal. inputs. If the bit vector filter is built Since bit vector filters can only deter- based on the input of the first sort, i.e., mine safely which item has not been seen the ,bit vector filter is completed when all yet, but not which item has been seen yet data have reached the first sort operator. (due to possible hash collisions), bit vec- This bit vector filter can then be used to tor filters cannot be used in the most reduce the input into the second sort op- direct way in hash-based duplicate re- erator on the (presumably larger) second moval. However, hash-based duplicate input. Depending on how the sort opera- removal can be modified to become simi- tion is organized into phases, it might lar to a hash join or actually a hash-based even be possible to create a second bit set intersection. Consider a large file R vector filter from the second merge-join and a partitioning fan-out F. First, R is input and use it to reduce the first join partitioned into F/2 partitions. For each input while it is being merged. partition, two files are created; thus, this For sequential hash joins, bit vector step uses the entire fan-out to create a filters can be used in two ways. First, total of F files. Within each partition, a they can be used to filter items of the bit vector filter is used to determine probe input using a bit vector filter cre- whether an item belongs into the first or ated from items of the build input. This the second file of that partition. If an use of bit vector filters is analogous to bit item is guaranteed to be unique, i.e., there vector filter usage in parallel systems is no earlier item indicated in the bit and for merge-join. In Rdb/VMS and vector filter, the item is assigned to the DB2, bit vector filters are used when first file, and a bit in the bit vector filter intersecting large RID lists obtained from is set. Otherwise, the item is assigned multiple indices on the same table into the partition’s second file. At the end [Antoshenkov 1993; Mohan et ‘al. 1990]. of this partitioning step, there are F files, Second, new bit vector filters can be cre- half of them guaranteed to be free of ated and used for each partition in each duplicate data items. The possible size of recursion level. In the Volcano query- the duplicate-free files is limited by the processing system, the operator imple- size of the bit vector filters; therefore, menting hash join, intersection, etc. uses this step should use the largest bit vector the space used as anchor for each bucket’s filters possible. After the first partition- linked list for a small bit vector filter ing step, each partition’s pair of files is after the bucket has been spilled to an intersected using the duplicate-free file overflow file. Only those items from the as probe input. Recall that duplicate re- probe input that pass the bit vector filter moval for a join’s build input can be ac- are written to the probe overflow file. complished easily and inexpensively This technique is used in each recursion while building the in-memory hash table. level of overflow resolution. Thus, during Remaining duplicates with one copy in recursive partitioning, relatively small the duplicate-free (probe) file and an- bit vector filters can be used repeatedly other copy in the other file (the build and at increasingly finer granularity to input) in the hash table are found when remove items from the probe input that the probe input is matched against the do not contribute to the join result. Bit hash table. This algorithm performs very vectors could also be used to remove items well if many output items depend on only from the build input using bit vector fil- one input item and if the bit vectors are ters created from the probe input; how- quire large. In that case, the duplicate- ever, since the probe input is presumed free partition files are very large, and the the larger input and hash collisions in smaller partition file with duplicates can the bit vector filter would make the filter be processed very efficiently. less effective, it may or may not be an In order to find and exploit a dual in effective technique. the realm of sorting and merge-join to bit With some modifications of the stan- vector filtering in each recursion level of dard algorithm, bit vector filters can also recursive hash join, sorting of multiple

ACM Computing Surveys, Vol 25, No. 2, June 1993 Query Evaluation Techniques 9 157

inputs must be divided into individual Furthermore, it is not clear what special- merge levels. In other words, for a ized hardware would be most beneficial merge-join of inputs R and S, the sort to design, in particular, in light of today’s activity should switch back and forth be- directions toward extensible database tween R and S, level by level, creating systems and emerging database applica- and using a new bit vector filter in each tion domains. Therefore, we do not favor merge level. Unfortunately, even with a specialized database hardware modules sophisticated sort implementation that beyond general-purpose processing, stor- supports this use of bit vector filters in age, and communication hardware dedi- each merge level, recursive hybrid hash cated to executing database software. join will make more effective use of bit vector filters because the inputs are par- SUMMARY AND OUTLOOK titioned, thus reducing the number of distinct values in each partition in each Database management systems provide recursion level. three essential groups of services. First, they maintain both data and associated metadata in order to make databases 12.5 Specialized Hardware self-contained and self-explanatory, at Specialized hardware was considered by least to some extent, and to provide data a number of researchers, e.g., in the forms independence. Second, they support safe of hardware sorters and logic-per-track data sharing among multiple users as selection. A relatively recent survey of well as prevention and recovery of fail- database machine research is given by ures and data loss. Third, they raise the Su [1988]. Most of this research was level of abstraction for data manipula- abandoned after Boral and DeWitt’s tion above the primitive access com- [1983] influential analysis that compared mands provided by file systems with more CPU and 1/0 speeds and their trends. or less sophisticated matching and infer- They concluded that 1/0 is most likely ence mechanisms, commonly called the the bottleneck in future high-perfor- query language or query-processing facil- mance query execution, not processing. ity. We have surveyed execution algo- Therefore, they recommended moving rithms and software architectures used from research on custom processors to in providing this third essential service. techniques for overcoming the 1/0 bot- Query processing has been explored tleneck, e.g., by use of parallel readout extensively in the last 20 years in the disks, disk caching and read-ahead, and context of relational database manage- indexing to reduce the amount of data to ment systems and is slowly gaining be read for a query. Other investigations interest in the research community for also came to the conclusion that par- extensible and object-oriented systems. allelism is no substitute for effective This is a very encouraging development, storage structures and query execution because if these new systems have in- algorithms [DeWitt and Hawthorn creased modeling power over previous 1981; Neches 1984]. An additional very data models and database management strong argument against custom VLSI systems but cannot execute even simple processors is that microprocessor speed requests efficiently, they will never gain is currently improving so rapidly that it widespread use and acceptance. is likely that, by the time a special hard- Databases will continue to manage mas- ware component has been designed, fab- sive amounts of data; therefore, efficient ricated, tested, and integrated into a query and request execution will con- larger hardware and software system, the tinue to represent both an important re- next generation of general-purpose CPUs search direction and an important crite- will be available and will be able to exe- rion in investment decisions in the “real cute database functions programmed in a world.” In other words, new database high-level language at the same speed as management systems should provide the specialized hardware component. greater modeling power (this is widely

ACM Computing Surveys, Vol. 25, No 2, June 1993 158 - Goetz Graefe accepted and intensely pursued), but also for most operators, in particular, for op- competitive or better performance than erators on sets but also for other bulk previous systems. We hope that this sur- types such as arrays, lists, and time vey will contribute to the use of efficient series. and parallel algorithms for query pro- We can hope that much of the existing cessing tasks in new database manage- relational technology for query optimiza- ment systems. tion and parallel execution will remain A large set of query processing algo- relevant and that research into extensi- rithms has been developed for relational ble optimization and parallelization will systems. Sort- and hash-based tech- have a significant impact on future niques have been used for physical- database applications such as scientific storage design, for associative index data. For database management systems structures, for algorithms for unary and to become acceptable for new application binary matching operations such as ag- domains, their performance must at least gregation, duplicate removal, join, inter- match that of the file systems currently section, and division, and for parallel in use. Automatic optimization and par- query processing using hash- or range- allelization may be crucial contributions partitioning. Additional techniques such to achieving this goal, in addition to the as precomputation and compression have query execution techniques surveyed been shown to provide substantial per- here. formance benefits when manipulating large volumes of data. Many of the exist- ing algorithms will continue to be useful ACKNOWLEDGMENTS for extensible and object-oriented sys- JOS6 A Blakeley, Cathy Brand, Rick Cole, Diane tems, and many can easily be general- Davison, David Helman, Ann Linville, Bill ized from sets of tuples to more general McKenna, Gail Mitchell, Shengsong N1, Barb Pe- pattern-matching functions. Some ters, Leonard Shapiro, the students of “Readings m emerging database applications will re- Database Systems” at the University of Colorado at quire new operators, however, both for Boulder (Fall 1991) and “Database Implementation translation between alternative data rep- Techniques” at Portland State University (Winter resentations and for actual data manipu- 1993), David Maler’s weekly reading group at the lation. Oregon Graduate Institute (Winter 1992), the anonymous referees, and the Computmg Surveys The most promising aspect of current editors Shamkant Navathe and Dick Muntz gave research into database query processing many valuable comments on earlier drafts of this for new application domains is that the survey, which have improved the paper very much. concept of a fixed number of parametri- This paper is based on research partially supported zed operators, each performing a part of by the National Science Foundation with grants the required data manipulation and each IRI-8996270, IRI-8912618, IRI-9006348, IRI- passing an intermediate result to the next 9116547, IRI-9119446, and ASC-9217394, ARPA operator, is versatile enough to meet the with contract DAAB 07-91-C-Q5 18, Texas Instru- new challenges. This concept permits ments, D@al Equipment Corp., Intel Super- computer Systems Division, Sequent Computer specification of database queries and re. Systems. ADP, and the Oregon Advanced Com- quests in a logical algebra as well as puting Institute (OACIS) concise representation of database pro- grams in a physical algebra. Further- more, it allows algebraic optimizations of REFERENCES requests, i.e., optimizing transformations of algebra expressions and cost-sensitive ADALI, N. R., AND WORTMANN, J. C. 1989. Secu- translations of logical into physical ex- rity-control methods for statistical databases: A comparative study. ACM Comput. Surv. 21, pressions. Finally, it permits pipelining 4 (Dec. 1989), 515. between operators to exploit parallel AHN, I., AND SNonGRAss, R. 1988. Partitioned computer architectures and partitioning storage for temporal databases. Inf. SVSt. 13, 4, of stored data and intermediate results 369.

ACM Computing Surveys, Vol 25, No 2, June 1993 Query Evaluation Techniques ● 159

ALBERT, J. 1991. Algebraic properties of bag data BATORY, D. S., LEUNG, T. Y., AND WISE, T. E. 1988b. types. In Proceedings of the International Con- Implementation concepts for an extensible data ference on Very Large Data Bases. VLDB En- model and data language. ACM Trans. dowment, 211. Database Syst. 13, 3 (Sept.), 231. ANALYTI, A., AND PRAMANIK, S. 1992. Fast search BAUGSTO, B., AND GREIPSLAND, J. 1989. Parallel in main memory databases. In Proceedings of sorting methods for large data volumes on a the ACM SIGMOD Conference. ACM, New hypercube database computer. In Proceedings York, 215. of the 6th International Workshop on Database ANDERSON, D. P., Tzou, S. Y., AND GmwM, G. S. Machines (Deauville, France, June 19-21). 1988. The DASH virtual memory system. BAYER, R., AND MCCREIGHTON, E. 1972. Organi- Tech. Rep. 88/461, Univ. of California—Berke- sation and maintenance of large ordered in- ley, CS Division, Berkeley, Calif. dices. Acts Informatica 1, 3, 173. ANTOSHENKOV, G. 1993. Dynamic query opti- BECK, M., BITTON, D., AND WILKINSON, W. K. 1988. mization in Rdb/VMS. In Proceedings of the Sorting large files on a backend multiprocessor. IEEE Conference on Data Engineering. IEEE, IEEE Trans. Comput. 37, 7 (July), 769. New York. BECKER, B., SIX, H. W., AND WIDMAYER, P. 1991. ASTRAHAN, M. M., BLASGEN, M. W., CHAMBERLAIN, Spatial priority search: Au access technique for D. D., ESWARAN, K. P., GRAY, J. N., GRIFFITHS, scaleless maps. In Proceedings of ACM SIG- P. P., KING, W. F., LORIE, R. A., MCJONES, MOD Conference. ACM, New York, 128. P. R., MEHL, J. W., PUTZOLU, G. R., TRAIGER, BECKMANN, N., KRIEGEL, H. P., SCHNEIDER, R., AND I. L., WmE, B. W., AND WATSON, V. 1976. SEEGER, B. 1990. The R*-tree: Au efficient System R: A relational approach to database and robust access method for points and rect- management. ACM Trans. Database Syst. 1, 2 angles. In Proceedings of ACM SIGMOD Con- (June), 97. ference. ACM, New York, 322. ASTRAHAN, M. M., SCHKOLNICK, M., MD WHANG, K. Y. 1987. Approximating the number of BELL, T., WITTEN, I. H., AND CLEARY, J. G. 1989. unique values of an attribute without sorting, Modelling for text compression. ACM Cornput. Inf. Syst. 12, 1, 11. Surv. 21, 4 (Dec.), 557. ATKINSON, M. P., AND BUNEMANN, O. P. 1987. BENTLEY, J. L. 1975. Multidimensional binary Types and persistence in database program- search trees used for associative searching. ming languages. ACM Comput. Surv. 19, 2 Commun. ACM 18, 9 (Sept.), 509. (June), 105. BERNSTEIN, P. A., mm GOODMAN, N. 1981. Con- BABB, E. 1982. Joined Normal Form: A storage currency control in distributed database sys- encoding for relational databases. ACM Trans. tems. ACM Comput. Suru. 13, 2 (June), 185. Database Syst. 7, 4 (Dec.), 588. BERNSTEIN, P. A., GOODMAN, N., WONG, E., REEVE, BABB, E. 1979. Implementing a relational C. L., AND ROTHNIE, J. B. 1981. Query pro- database by means of specialized hardware. cessing in a system for distributed databases ACM Trans. Database Syst. 4, 1 (Mar.), 1. (SDD-1). ACM Trans. Database Syst, 6, 4 BAEZA-YATES, R. A., AND LARSON, P. A. 1989. Per- (Dec.), 602. formance of B + -trees with partial expansions. BERNSTEIN, P. A., HADZILACOS, V., AND GOODMAN, N. IEEE Trans. Knowledge Data Eng. 1, 2 (June), 1987. Concurrency Control and Recovery in 248. Database Systems. Addison-Wesley, Reading, BANCILHON, F., AND RAMAKRISHNAN, R. 1986. An Mass. amateur’s introduction to recursive query pro- BERRA, P. B., CHUNG, S. M., AND HACHEM, N. I. cessing strategies. In Proceedings of the ACM 1987, Computer architecture for a surrogate SIGMOD Conference. ACM, New York, 16. file to a very large data/knowledge base. IEEE BARGHOUTI, N. S., AND KAISER, G. E. 1991. Con- Comput. 20, 3 (Mar), 25. currency control in advanced database applica- BERTINO, E. 1991. An indexing technique for ob- tions. ACM Comput. Suru. 23, 3 (Sept.), 269. ject-oriented databases. In Proceedings of the BARU, C. K., AND FRIEDER, O. 1989. Database op- IEEE Conference on Data Engineering. IEEE, erations in a cube-connected multicomputer New York, 160. system. IEEE Trans. Comput. 38, 6 (June), BERTINO, E. 1990. Optimization of queries using 920. nested indices. In Lecture Notes in Computer BATINI, C., LENZERINI, M., AND NAVATHE, S. B. Science, vol. 416. Springer-Verlag, New York. 1986. A comparative analysis of methodolo- BERTINO, E., AND KIM, W. 1989. Indexing tech- gies for integration. ACM niques for queries on nested objects. IEEE Comput. Sum. 18, 4 (Dec.), 323. Trans. Knowledge Data Eng. 1, 2 (June), 196. BATORY, D. S., BARNETT, J. R., GARZA, J. F., SMITH, BHIDE, A. 1988. An analysis of three transaction K. P., TSUKUDA, K., TWICHELL, B. C., AND WISE, processing architectures. In Proceedings of the T. E. 1988a. GENESIS: An extensible International Conference on Ve~ Large Data database management system, IEEE Trans. Bases (Los Angeles, Aug.). VLDB Endowment, Softw. Eng. 14, 11 (Nov.), 1711. 339.

ACM Computing Surveys, Vol 25, No. 2, June 1993 160 * Goetz Graefe

BHIDE, A,, AND STONEBRAKER, M. 1988. Aperfor- BRATBERGSENGEN, K, 1984. Hashing methods mance comparison of two architectures for fast and relational algebra operations. In Proceed- transaction processing. In proceedings of the ings of the International Conference on Very IEEE Conference on Data Englneermg. IEEE, Large Data Bases. VLDB Endowment, 323. New York, 536. BROWN, K. P., CAREY, M. J., DEWITT, D, J., MEHTA, BITTON, D., AND DEWITT, D. J. 1983. Duplicate M., AND NAUGHTON, J. F. 1992. Scheduling record elimination in large data files. ACM issues for complex database workloads, Com- Trans. Database Syst. 8, 2 (June), 255, puter Science Tech. Rep. 1095, Univ. of BITTON-FRIEDLAND, D. 1982. Design, analysis, Wisconsin—Madison. and implementation of parallel external sorting BUCHERAL, P., THEVERIN, J. M., AND VAL~URIEZ, P. algorithms Ph D. Thesis, Univ. of Wiscon- 1990. Efficient main memory data manage- sin—Madison. ment using the DBGraph storage model. In Proceedings of the International Conference on BITTON, D , AND GRAY, J 1988. Dmk shadowing, Very Large Data Bases. VLDB Endowment, 683. In Proceedings of the International Conference on Very Large Data Bases. (Los Angeles, Aug.). BUNEMAN, P., AND FRANKEL, R. E. 1979. FQL—A VLDB Endowment, 331 Functional Query Language. In Proceedings of ACM SIGMOD Conference. ACM, New York, BITTON, D., DEWITT, D J., HSIAO, D. K. AND MENON, 52. J. 1984 A taxonomy of parallel sorting. ACM Comput. SurU. 16, 3 (Sept.), 287. BUNEMAN, P,, FRANKRL, R, E., AND NIKHIL, R. 1982 An implementation technique for database BITTON, D., HANRAHAN, M. B., AND TURBWILL, C. query languages, ACM Trans. Database Syst. 1987, Performance of complex queries in main 7, 2 (June), 164. memory database systems. In Proceedt ngs of the IEEE Con ference on Data Engineering. CACACE, F., CERI, S., .AND HOUTSMA, M A. W. 1992. IEEE, New York. A survey of parallel execution strategies for transitive closures and logic programs To ap- BLAKELEY, J. A., AND MARTIN. N. L. 1990 Join pear in Dwtr. Parall. Databases. index, materialized view, and hybrid hash-join: CAREY, M. J,, DEWITT, D. J., RICHARDSON, J. E., AND A performance analysis In Proceedings of the SHEKITA, E. J. 1986. Object and file manage- IEEE Conference on Data Englneermg IEEE, ment in the EXODUS extensible database sys- New York. tem. In proceedings of the International BM.KELEY, J. A, COBURN, N., AND LARSON, P. A. Conference on Very Large Data Bases. VLDB 1989. Updating derived relations: Detecting Endowment, 91. irrelevant and autonomously computable up- CARLIS, J. V. 1986. HAS: A relational algebra dates. ACM Trans Database Syst 14,3 (Sept.), operator, or divided M not enough to conquer. 369 In Proceedings of the IEEE Conference on Data BLAS~EN, M., AND ESWARAN, K. 1977. Storage and Engineering. IEEE, New York, 254. access in relational databases. IBM Syst. J. CARTER, J. L., AND WEGMAN, M. N. 1979, Univer- 16, 4, 363. sal classes of hash functions. J. Cornput. Syst. BLASGEN, M., AND ESWARAN, K. 1976. On the Scl. 18, 2, 143. evaluation of queries in a relational database CHAMB~RLIN, D. D., ASTRAHAN, M M., BLASGEN, M. system IBM Res. Rep RJ 1745, IBM, San Jose, W., GsA-I-, J. N., KING, W. F., LINDSAY, B. G., Calif. LORI~, R., MEHL, J. W., PRICE, T. G,, PUTZOLO, BLOOM, B H. 1970 Space/time tradeoffs in hash F , SELINGER, P. G., SCHKOLNIK, M., SLUTZ, D. coding with allowable errors Cornmzm. ACM R., TKAIGER, I. L , WADE, B. W., AND YOST, R, A. 13, 7 (July), 422. 198 la. A history and evaluation of System R. BORAL, H. 1988. Parallehsm in Bubba. In Pro- Cornmun ACM 24, 10 (Oct.), 632. ceedings of the International Symposium on CHAMBERLAIN, D. D,, ASTRAHAN, M. M., KING, W F., Databases m Parallel and Distr~buted Systems LORIE, R. A,, MEHL, J. W., PRICE, T. G., (Austin, Tex., Dec.), 68. SCHKOLNIK, M., SELINGER, P G., SLUTZ, D. R., BORAL, H., AND DEWITT, D. J. 1983. Database WADE, B. W., AND YOST, R. A. 1981b. SUp- machines: An idea whose time has passed? A port for repetitive transactions and ad hoc critique of the future of database machines. In queries in System R. ACM Trans. Database Proceedings of the International Workshop on Syst. 6, 1 (Mar), 70. Database Machines. Reprinted in Parallel Ar- CHEN, P. P. 1976. The entity relationship model ch I tectures for Database Systems. IEEE Com- —Toward a umtied view of data. ACM Trans. puter Society Press, Washington, D. C., 1989. Database Syst. 1, 1 (Mar.), 9. BORAL, H., ALEXANDER, W., CLAY, L., COPELAND, G., C!HEN, H , AND KUCK, S. M. 1984. Combining re- DANFORTH, S., FRANKLIN, M., HART, B., SMITH, lational and network retrieval methods. In M., AND VALDURIEZ, P. 1990. Prototyping proceedings of ACM SIGMOD Conference, Bubba, A Highly Parallel Database System. ACM, New York, 131, IEEE Trans. Knowledge Data Eng. 2, 1 (Mar.), CHEN, M. S., Lo, M. L., Yu, P. S., AND YOUNG, H C 4. 1992. Using segmented right-deep trees for

ACM Computmg Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques * 161

the execution of pipelined hash joins. In Pro- SCHMIDT, D., AND VANCE, B. 1991. Query op- ceedings of the International Conference on Very timization in revelation, an overview. IEEE Large Data Bases (Vancouver, BC, Canada). Database Eng. 14, 2 (June). VLDB Endowment, 15. DAVIDSON, S. B., GARCIA-M• LINA, H., AND SKEEN, D. CHEN~, J., HADERLE, D., HEDGES, R., IYER, B. R., 1985. Consistency in partitioned networks, MESSINGER, T., MOHAN, C., ANI) WANG, Y. ACM Comput. Surv. 17, 3 (Sept.), 341. 1991. An efficient hybrid join algorithm: A DAVIS, D. D. 1992. Oracle’s parallel punch for DB2 prototype. In Proceedings of the IEEE OLTP. Datamation (Aug. 1), 67. Conference on Data Engineering. IEEE, New DAVISON, W. 1992. Parallel index building in In- York, 171. formix OnLine 6.0. In Proceedings of ACM CHERITON, D. R., GOOSEN, H. A., mn BOYLE, P. D. SIGMOD Conference. ACM, New York, 103. 1991 Paradigm: A highly scalable shared- DEPPISCH, U., PAUL, H. B., AND SCHEK, H. J. 1986. memory multicomputer. IEEE Comput. 24, 2 A storage system for complex objects. In Pro- (Feb.), 33. ceedings of the International Workshop on Ob- CHIU, D. M., AND Ho, Y, C. 1980. A methodology ject-Or[ented Database Systems (Pacific Grove, for interpreting tree queries into optimal semi- Calif., Sept.), 183. join expressions. In Proceedings of ACM SIG- DESHPANDE, V., AND LARSON, P. A. 1992. The de- MOD Conference. ACM, New York, 169. sign and implementation of a parallel join algo- CHOU, H, T. 1985. Buffer management of rithm for nested relations on shared-memory database systems. Ph.D. thesis, Univ. of multiprocessors. In Proceedings of the IEEE Wisconsin—Madison. Conference on Data Engineering. IEEE, New CHOU, H. T., AND DEWITT, D. J. 1985. An evalua- York, 68, tion of buffer management strategies for rela- DESHPANDE, V., AND LARSON, P, A, 1991. An alge- tional database systems. In Proceedings of the bra for nested relations with support for nulls International Conference on Very Large Data and aggregates. Computer Science Dept., Univ. Bases (Stockholm, Sweden, Aug.), VLDB En- of Waterloo, Waterloo, Ontario, Canada. dowment, 127. Reprinted in Readings in DESHPANDE, A.j AND VAN GUCHT, D. 1988. An im- Database Systems. Morgan-Kaufman, San plementation for nested relational databases. Mateo, Calif., 1988. In proceedings of the I?lternattonal Conference CHRISTODOULAKIS, S. 1984. Implications of cer- on Very Large Data Bases (Los Angeles, Calif., tain assumptions in database performance Aug. ) VLDB Endowment, 76. evaluation. ACM Trans. Database Syst. 9, 2 DEWITT, D. J. 1991. The Wisconsin benchmark: (June), 163. Past, present, and future. In Database and CHUNG, S. M., AND BERRA, P. B. 1988. A compari- Transaction Processing System Performance son of concatenated and superimposed code Handbook. Morgan-Kaufman, San Mateo, Calif, word surrogate files for very large data/knowl- DEWITT, D. J., AND GERBER, R. H. 1985, Multi- edge bases. In Lecture Notes in Computer Sci- processor hash-based Join algorithms. In F’ro- ence, vol. 303. Springer-VerIag, New York, 364. ceedings of the International Conference on Very CLUET, S., DIZLOBEL, C., LECLUSF,, C., AND RICHARD, Large Data Bases (Stockholm, Sweden, Aug.). P. 1989. Reloops, an algebra based query VLDB Endowment, 151. language for an object-oriented database sys- DEWITT, D. J., AND GRAY, J. 1992. Parallel tem. In Proceedings of the Ist International database systems: The future of high-perfor- Conference on Deductive and Object-Orzented mance database systems. Commun. ACM 35, 6 Databases (Kyoto, Japan, Dec. 4-6). (June), 85. COMER, D. 1979. The ubiquitous B-tree. ACM DEWITT, D. J., AND HAWTHORN, P. B. 1981. A Comput. Suru. 11, 2 (June), 121. performance evaluation of database machine COPELAND, G., ALEXANDER, W., BOUGHTER, E., AND architectures. In Proceedings of the Interns- KELLER, T. 1988. Data placement in Bubba. tional Conference on Very Large Data Bases In Proceedings of ACM SIGMOD Conference. (Cannes, France, Sept.). VLDB Endowment, ACM, New York, 99. 199. DADAM, P., KUESPERT, K., ANDERSON, F., BLANKEN, DEWITT, D. J., GERBER, R. H., GRAEFE, G., HEYTENS, H,, ERBE, R., GUENAUER, J., LUM, V., PISTOR, P., M. L., KUMAR, K. B., AND MURALI%ISHNA, M. AND WALCH, G. 1986. A database manage- 1986. GAMMA-A high performance dataflow ment prototype to support extended NF 2 rela- database machine. In Proceedings of the Inter- tions: An integrated view on flat tables and national Conference on Very Large Data Bases. hierarchies. In Proc.edmgs of ACM SIGMOD VLDB Endowment, 228. Reprinted in Read- Conference. ACM, New York, 356. ings in Database Systems. Morgan-Kaufman, DMWELS, D., AND NG, P. 1982. Distributed query San Mateo, Calif., 1988. compilation and processing in R*. IEEE DEWITT, D. J., GHANDEHARIZADEH, S., AND SCHNEI- Database Eng. 5, 3 (Sept.). DER, D. 1988. A performance analysis of the DmIELS, S., GRAEFE, G., KELLER, T., MAIER, D., GAMMA database machine. In Proceedings of

ACM Computing Surveys, Vol. 25, No. 2, June 1993 162 * Goetz Graefe

ACM SIGMOD Conference. ACM, New York, dictive load control for flexible buffer alloca- 350 tion. In Proceedings of the International DEWITT, D. J., GHANDEHARIZADEH, S., SCHNEIDER, Conference on Very Large Data Bases D., BRICKER, A., HSIAO, H. I.. AND RASMUSSEN, (Barcelona, Spain). VLDB Endowment, 265. R. 1990. The Gamma database machine pro- FANG, M. T,, LEE, R. C. T., AND CHANG, C, C. 1986. ject. IEEE Trans. Knowledge Data Eng. 2, 1 The idea of declustering and its applications. In (Mar.). 44. Proceedings of the International Conference on DEWITT, D. J., KATZ, R., OLKEN, F., SHAPIRO, L., Very Large Data Bases (Kyoto, Japan, Aug.). STONEBRAKER, M., AND WOOD, D. 1984. Im- VLDB Endowmentj 181. plementation techniques for mam memory FINKEL, R. A., AND BENTLEY, J. L. 1974. Quad database systems In ProceecZings of ACM SIG- trees: A data structure for retrieval on compos- MOD Conference. ACM, New York, 1. ite keys. Acts Inform atzca 4, 1,1. DEWITT, D., NAUGHTON, J., AND BURGER, J. 1993. FREYTAG, J. C., AND GOODMAN, N. 1989 On the Nested loops revisited. In proceedings of Paral- translation of relational queries into iteratme lel and Distributed In forrnatlon Systems [San programs. ACM Trans. Database Syst. 14, 1 Diego, Calif., Jan.). (Mar.), 1. DEWITT, D. J., NAUGHTON, J. E., AND SCHNEIDER, FUSHIMI, S., KITSUREGAWA, M., AND TANAKA, H. D. A. 1991a. An evaluation of non-equijoin 1986. An overview of the system software of a algorithms. In Proceedings of the International parallel relational database machine GRACE. Conference on Very Large Data Bases In Proceedings of the Intern atLona! Conference (Barcelona, Spain). VLDB Endowment, 443. on Very Large Data Bases Kyoto, Japan, Aug.). DEWITT, D., NAUGHTON, J., AND SCHNEIDER, D. ACM, New York, 209. 1991b Parallel sorting on a shared-nothing GALLAIRE, H., MINRER, J., AND NICOLAS, J M. 1984. architecture using probabilistic splitting. In Logic and databases A deductive approach. Proceedings of the International Conference on ACM Comput. Suru. 16, 2 (June), 153 Parallel and Dlstnbuted Information Systems GERBER, R. H. 1986. Dataflow query process- (Miami Beach, Fla , Dec.) ing using multiprocessor hash-partitioned DOZIER, J. 1992. Access to data in NASAS algorithms. Ph.D. thesis, Univ. of Earth observing systems. In proceedings of Wisconsin—Madison. ACM SIGMOD Conference. ACM, New York, 1. GHANDEHARIZADEH, S., AND DEWITT, D. J. 1990. EFFELSBERG, W., AND HAERDER, T. 1984. Princi- Hybrid-range partitioning strategy: A new ples of database buffer management. ACM declustermg strategy for multiprocessor Trans. Database Syst. 9, 4 (Dee), 560. database machines. In Proceedings of the Inter- EN~ODY, R. J., AND Du, H. C. 1988. Dynamic national Conference on Very Large Data Bases hashing schemes ACM Comput. Suru. 20, 2 (Brisbane, Australia). VLDB Endowment, 481. (June), 85. GOODMAN, J. R., AND WOEST, P. J 1988. The ENGLERT, S., GRAY, J., KOCHER, R., AND SHAH, P. Wisconsin Multicube: A new large-scale cache- 1989. A benchmark of nonstop SQL release 2 coherent multiprocessor. Computer Science demonstrating near-linear speedup and scaleup Tech Rep. 766, Umv. of Wisconsin—Madison on large databases. Tandem Computers Tech. GOUDA, M. G.. AND DAYAL, U. 1981. Optimal Rep. 89,4, Tandem Corp., Cupertino, Calif. semijoin schedules for query processing in local EPSTEIN, R. 1979. Techniques for processing of distributed database systems. In Proceedings aggregates in relational database systems of ACM SIGMOD Conference. ACM, New York, UCB/ERL Memo. M79/8, Univ. of California, 164, Berkeley, Calif. GRAEFE, G. 1993a. Volcano, An extensible and EPSTEIN, R., AND STON~BRAE~R, M. 1980. Analy- parallel dataflow query processing system. sis of dmtrlbuted data base processing strate- IEEE Trans. Knowledge Data Eng. To be gies. In Proceedings o} the International Con- published. ference on Very Large Data Bases UWmtreal, GRAEFE, G. 1993b. Performance enhancements Canada, Oct.). VLDB Endowment, 92. for hybrid hash join Available as Computer EPSTEIN, R , STONE~RAKER, M., AND WONG, E. 1978 Science Tech. Rep. 606, Univ. of Colorado, Distributed query processing in a relational Boulder. database system. In Proceedings of ACM GRAEFE, G. 1993c Sort-merge-join: An idea SIGMOD Conference. ACM, New York. whose time has passed? Revised in Portland FAGIN, R,, NUNERGELT, J., PIPPENGER, N., AND State Univ. Computer Science Tech. Rep. 93-4. STRONG, H. R. 1979. Extendible hashing: A GRAEFII, G. 1991. Heap-filter merge join A new fast access method for dynamic tiles. ACM algorithm for joining medium-size inputs. IEEE Trans. Database Syst. 4, 3 (Sept.), 315. Trans. Softw. Eng. 17, 9 (Sept.). 979. FALOUTSOS, C 1985. Access methods for text. GRAEFE, G. 1990a. Parallel external sorting in ACM Comput. Suru. 17, 1 (Mar.), 49. Volcano. Computer Science Tech. Rep. 459, FALOUTSOS, C,, NG, R., AND SELLIS, T. 1991. Pre- Umv. of Colorado, Boulder.

ACM Computmg Surveys, Vol. 25, No 2, June 1993 Query Evaluation Techniques ● 163

GRAEFE, G. 1990b. Encapsulation of parallelism GWY) J., AND REUTER, A. 1991. Transaction Pro- in the Volcano query processing system. In cessing: Concepts and Techniques. Morgan- proceedings of ACM SIGMOD Conference, Kaufman, San Mateo, Calif, York, 102. ACM, New GRAY, J., MCJONES, P., BLASGEN, M., LINtEAY, B., GRAEFE) G. 1989. Relational division: Four algo- LORIE, R., PRICE, T., PUTZOLO, F., AND TRAIGER, rithms and their performance. In Proceedings I. 1981. The recovery manager of the Sys- of the IEEE Conference on Data Engineering. tem R database manager. ACM Comput, Sw-u. IEEE, New York, 94. 13, 2 (June), 223. GRAEFE, G., AND COLE, R. L. 1993. Fast algo- GRUENWALD, L., AND EICH, M, H. 1991. MMDB rithms for universal quantification in large reload algorithms. In Proceedings of ACM databases. Portland State Univ. and Univ. of SIGMOD Conference, ACM, New York, 397. Colorado at Boulder. GUENTHER, O., AND BILMES, J. 1991. Tree-based GRAEFE, G., AND DAVISON, D. L. 1993. Encapsula- access methods for spatial databases: Imple- tion of parallelism and architecture-indepen- mentation and performance evaluation. IEEE dence in extensible database query processing. Trans. Knowledge Data Eng. 3, 3 (Sept.), 342. IEEE Trans. Softw. Eng. 19, 7 (July). GUIBAS, L., AND SEI)GEWICK, R. 1978. A dichro- GRAEFE, G., AND DEWITT, D. J. 1987. The matic framework for balanced trees. In Pro- EXODUS optimizer generator, In Proceedings ceedings of the 19th SymposL urn on the Founda of ACM SIGMOD Conference. ACM, New York, tions of Computer Science. 160. GUNADHI, H., AND SEGEW, A. 1991. Query pro- GRAEFE, G., .miI) MAIER, D. 1988. Query opti- cessing algorithms for temporal intersection mization in object-oriented database systems: joins. In Proceedings of the IEEE Conference on A prospectus, In Advances in Object-Oriented Data Engtneermg. IEEE, New York, 336. Database Systems, vol. 334. Springer-Verlag, GITNADHI, H., AND SEGEV) A. 1990. A framework New York, 358. for query optimization in temporal databases. GRAEFE, G., AND MCKRNNA, W. J. 1993. The Vol- In Proceedings of the 5th Zntcrnatzonal Confer- cano optimizer generator: Extensibility and ef- ence on Statistical and Scten tific Database ficient search. In Proceedings of the IEEE Con- Management. ference on Data Engineering. IEEE, New York. GUNTHER, O. 1989. The design of the cell tree: GRAEIW, G., AND SHAPIRO, L. D. 1991. Data com- An object-oriented index structure for geomet- pression and database performance. In Pro- ric databases. In Proceedings of the IEEE Con- ceedings of the ACM/IEEE-Computer Science ference on Data Engineering. IEEE, New York, Symposium on Applied Computing. 598. ACM\ IEEE, New York. GUNTHER, O., AND WONG, E. 1987 A dual space GRAEFE, G., AND WARD, K. 1989. Dynamic query representation for geometric data. In Proceed- evaluation plans. In Proceedings of ACM ings of the International Conference on Very SIGMOD Conference. ACM, New York, 358. Large Data Bases (Brighton, England, Aug.). GRAEEE,G., ANDWOLNIEWICZ,R. H. 1992. Alge- VLDB Endowment, 501. braic optimization and parallel execution of Guo, M., SLT, S. Y. W., AND LAM, H. 1991. An computations over scientific databases. In Pro- association algebra for processing object- ceedings of the Workshop on Metadata Manage- oriented databases. In proceedings of the IEEE ment in Sclentrfzc Databases (Salt Lake City, Conference on Data Engmeermg, IEEE, New Utah, Nov. 3-5). York, 23. GRAEFE, G., COLE, R. L., DAVISON, D. L., MCKENNA, GUTTMAN, A. 1984. R-Trees: A dynamic index W. J., AND WOLNIEWICZ, R. H. 1992. Extensi- structure for spatial searching. In Proceedings ble query optimization and parallel execution of ACM SIGMOD Conference. ACM, New York, in Volcano. In Query Processing for Aduanced 47. Reprinted in Readings in Database Sys- Database Appkcatlons. Morgan-Kaufman, San tems. Morgan-Kaufman, San Mateo, Ccdif., Mateo, Calif. 1988. GRAEEE, G., LHVWLLE, A., AND SHAPIRO, L. D. 1993. Hfis, L., CHANG, W., LOHMAN, G., MCPHERSON, J., Sort versus hash revisited. IEEE Trans. WILMS, P. F., LAPIS, G., LINDSAY, B., PIRAHESH, Knowledge Data Eng. To be published. H., CAREY, M. J., AND SHEKITA, E. 1990. GRAY, J. 1990. A census of Tandem system avail- Starburst mid-flight: As the dust clears. IEEE ability between 1985 and 1990, Tandem Trans. Knowledge Data Eng. 2, 1 (Mar.), 143. Computers Tech. Rep. 90.1, Tandem Corp., Hfis, L., FREYTAG, J. C., LOHMAN, G., AND Cupertino, Calif. PIRAHESH, H. 1989. Extensible query pro- GWY, J., AND PUTZOLO, F. 1987. The 5 minute cessing in Starburst. In Proceedings of ACM rule for trading memory for disc accesses and SIGMOD Conference. ACM, New York, 377. the 10 byte rule for trading memory for CPU H.%+s, L. M., SELINGER, P. G., BERTINO, E., DANI~LS, time. In Proceedings of ACM SIGMOD Confer- D., LINDSAY, B., LOHMAN, G., MASUNAGA, Y., ence. ACM, New York, 395. Mom, C., NG, P., WILMS, P., AND YOST, R.

ACM Computing Surveys, Vol. 25, No. 2, June 1993 164 ● Goetz Graefe

1982. R*: A research project on distributed HUA, K. A., AND LEE, C. 1990. An adaptive data relational database management. IBM Res. Di- placement scheme for parallel database com- vision, San Jose, Calif. puter systems. In Proceedings of the Interna- HAERDER, T., AND REUTER, A. 1983. Principles of tional Conference on Very Large Data Bases transaction-oriented database recovery. ACM (Brisbane, Australia). VLDB Endowment, 493, Comput. Suru. 15, 4 (Dec.). HUDSON, S. E., AND KING, R. 1989. Cactis: A self- HAFEZ, A., AND OZSOYOGLU, G. 1988. Storage adaptive, concurrent implementation of an ob- structures for nested relations. IEEE Database ject-oriented database management system. Eng. 11, 3 (Sept.), 31. ACM Trans. Database Svst. 14, 3 (Sept.), 291, HAGMANN, R. B. 1986. An observation on HULL, R., AND KING, R. 1987. Semantic database database buffering performance metrics. In modeling: Survey, applications, and research Proceedings of the International Conference on issues. ACM Comput. Suru, 19, 3 (Sept.), 201. Very Large Data Bases (Kyoto, Japan, Aug.). HUTFLESZ, A., SIX, H. W., AND WIDMAYER) P. 1990. VLDB Endowment, 289. The R-File: An efficient access structure for HAMMING, R. W. 1977. Digital Filters. Prentice- proximity queries. In proceedings of the IEEE Hall, Englewood Cliffs, N.J. Conference on Data Engineering. IEEE, New HANSON, E. N. 1987. A performance analysis of York, 372. view materialization strategies. In Proceedings HUTFLESZ, A., SIX, H. W., AND WIDMAYER, P. 1988a, of ACM SIGMOD Conference. ACM, New York, Twin grid files: Space optimizing access 440. schemes. In Proceedings of ACM SIGMOD HENRICH, A., Stx, H. W., AND WIDMAYER, P. 1989. Conference. ACM, New York, 183. The LSD tree: Spatial access to multi- HUTFLESZ, A., Sm, H, W., AND WIDMAYER, P, 1988b. dimensional point and nonpoint objects. In The twin grid file: A nearly space optimal index Proceedings of the International Conference on structure. In Lecture Notes m Computer Sci- Very Large Data Bases (Amsterdam, The ence, vol. 303, Springer-Verlag, New York, 352. Netherlands). VLDB Endowment, 45. IOANNIDIS, Y. E,, AND CHRISTODOULAIiIS, S. 1991, HOEL, E. G., AND SAMET, H. 1992. A qualitative On the propagation of errors in the size of join comparison study of data structures for large results. In Proceedl ngs of ACM SIGMOD Con- linear segment databases. In %oceedtngs of ference. ACM, New York, 268. ACM SIGMOD Conference. ACM. New York, 205. IYKR, B. R., AND DIAS, D. M. 1990, System issues in parallel sorting for database systems. In HONG, W., AND STONEBRAKER, M. 1993. Opti- Proceedings of the IEEE Conference on Data mization of parallel query execution plans in Engineering. IEEE, New York, 246. XPRS. Distrib. Parall. Databases 1, 1 (Jan.), 9. JAGADISH, H. V. 1991, A retrieval technique for HONG, W., AND STONEBRAKRR, M. 1991. Opti- similar shapes. In Proceedings of ACM mization of parallel query execution plans in XPRS. In Proceedings of the International Con- SIGMOD Conference. ACM, New York, 208. ference on Parallel and Distributed Information JARKE, M., AND KOCH, J. 1984. Query optimiza- Systems (Miami Beach, Fla., Dec.). tion in database systems. ACM Cornput. Suru. Hou, W. C., AND OZSOYOGLU, G. 1993. Processing 16, 2 (June), 111. time-constrained aggregation queries in CASE- JARKE, M., AND VASSILIOU, Y. 1985. A framework DB. ACM Trans. Database Syst. To be for choosing a database query language. ACM published. Comput. Sure,. 17, 3 (Sept.), 313. Hou, W. C., AND OZSOYOGLU, G. 1991. Statistical KATz, R. H. 1990. Towards a unified framework estimators for aggregate relational algebra for version modeling in engineering databases. queries. ACM Trans. Database Syst. 16, 4 ACM Comput. Suru. 22, 3 (Dec.), 375. (Dec.), 600. Hou, W. C., OZSOYOGLU, G., AND DOGDU, E. 1991. KATZ, R. H., AND WONG, E. 1983. Resolving con- Error-constrained COUNT query evaluation in flicts in global storage design through replica- relational databases. In Proceedings of ACM tion. ACM Trans. Database S.vst. 8, 1 (Mar.), SIGMOD Conference. ACM, New York, 278. 110. HSIAO, H. I., ANDDEWITT, D. J. 1990. Chained KELLER, T., GRAEFE, G., AND MAIE~, D. 1991. Ef- declustering: A new availability strategy for ficient assembly of complex objects. In Proceed- multiprocessor database machines. In Proceed- ings of ACM SIGMOD Conference. ACM. New ings of the IEEE Conference on Data Engineer- York, 148. ing. IEEE, New York, 456. KEMPER, A., AND MOERKOTTE G. 1990a. Access HUA, K. A., m~ LEE, C. 1991. Handling data support in object bases. In Proceedings of ACM skew in multicomputer database computers us- SIGMOD Conference. ACM, New York, 364. ing partition tuning. In Proceedings of the In- KEMPER, A., AND MOERKOTTE, G. 1990b. Ad- ternational Conference on Very Large Data vanced query processing in object bases using Bases (Barcelona, Spain). VLDB Endowment, access support relations. In Proceedings of the 525. International Conference on Very Large Data

ACM Computing Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques 9 165

Bases (Brisbane, Australia). VLDB Endow- KFUEGEL, H. P., AND SEEGER, B. 1988. PLOP- ment, 290. Hashing: A grid file without directory. In Pro- KEMPER, A.j AND WALI.RATH, M. 1987. An analy- ceedings of the IEEE Conference on Data Engi- sis of geometric modeling in database systems. neering. IEEE, New York, 369. ACM Comput. Suru. 19, 1 (Mar.), 47. KRIEGEL, H. P., AND SEEGER, B. 1987. Multidi- KEMPER,A., KILGER, C., AND MOERKOTTE, G. 1991. mensional dynamic hashing is very efficient for Function materialization in object bases. In nonuniform record distributions. In Proceed- Proceedings of ACM SIGMOD Conference. ings of the IEEE Conference on Data Enginee- ACM, New York, 258. ring. IEEE, New York, 10. KRRNIGHAN,B. W., ANDRITCHIE,D. M. 1978. VW KRISHNAMURTHY,R., BORAL, H., AND ZANIOLO, C. C Programming Language. Prentice-Hall, 1986. Optimization of nonrecursive queries. In Englewood Cliffs, N.J. Proceedings of the International Conference on Very Large Data Bases (Kyoto, Japan, Aug.). KIM, W. 1984. Highly available systems for VLDB Endowment, 128. database applications. ACM Comput. Suru. 16, 1 (Mar.), 71. KUESPERT, K.j SAAKE, G.j AND WEGNER, L. 1989. Duplicate detection and deletion in the ex- KIM, W. 1980. A new way to compute the prod- tended NF2 data model. In Proceedings of the uct and join of relations. In Proceedings of 3rd International Conference on the Founda- ACM SIGMOD Conference. ACM, New York, 179. tions of Data Organization and Algorithms (Paris, France, June). KITWREGAWA,M., AND OGAWA,Y, 1990. Bucket spreading parallel hash: A new, robust, paral- KUMAR, V., AND BURGER, A. 1991. Performance lel hash join method for skew in the super measurement of some main memory database database computer (SDC). In Proceedings of recovery algorithms. In Proceedings of the the International Conference on Very Large IEEE Conference on Data Engineering. IEEE, Data Bases (Brisbane, Australia). VLDB En- New York, 436. dowment, 210. LAKSHMI,M. S., AND Yu, P. S. 1990. Effectiveness KI’rSUREGAWA, M., NAKAYAMA, M., AND TAKAGI, M. of parallel joins. IEEE Trans. Knowledge Data 1989a. The effect of bucket size tuning in the Eng. 2, 4 (Dec.), 410. dynamic hybrid GRACE hash join method. In LAIWHMI, M. S., AND Yu, P. S, 1988. Effect of Proceedings of the International Conference on skew on join performance in parallel architec- Very Large Data Bases (Amsterdam, The tures. In Proceedings of the International Sym- Netherlands). VLDB Endowment, 257. posmm on Databases in Parallel and Dis- KITSUREGAWA7 M., TANAKA, H., AND MOTOOKA, T. tributed Systems (Austin, Tex., Dec.), 107. 1983. Application of hash to data base ma- LANKA, S., AND MAYS, E. 1991. Fully persistent chine and its architecture. New Gener. Com- B + -trees. In Proceedings of ACM SIGMOD p~t. 1, 1, 63. Conference. ACM, New York, 426. KITSUREGAWA, M., YANG, W., AND FUSHIMI, S. LARSON,P. A. 1981. Analysis of index-sequential 1989b. Evaluation of 18-stage ~i~eline. . hard- files with overflow chaining. ACM Trans. ware sorter. In Proceedings of the 6th In tern a- Database Syst. 6, 4 (Dec.), 671. tional Workshop on Database Machines LARSON, P., AND YANG, H. 1985. Computing (Deauville, France, June 19-21). queries from derived relations. In Proceedings KLUIG, A. 1982, Equivalence of relational algebra of the International Conference on Very Large and relational calculus query languages having Data Bases (Stockholm, Sweden, Aug.). VLDB aggregate functions. J. ACM 29, 3 (July), 699. Endowment, 259. KNAPP, E. 1987. Deadlock detection in dis- LEHMAN, T. J., AND CAREY, M. J. 1986. Query tributed databases. ACM Comput. Suru. 19, 4 processing in main memory database systems. (Dec.), 303. In Proceedings of ACM SIGMOD Conference. KNUTH, D, 1973. The Art of Computer Program- ACM, New York, 239. ming, Vol. III, Sorting and Searching. LELEWER, D. A., AND HIRSCHBERG, D. S. 1987. Addison-Wesley, Reading, Mass. Data compression. ACM Comput. Suru. 19, 3 KOLOVSON, C. P., ANTD STONEBRAKER M. 1991. (Sept.), 261. Segment indexes: Dynamic indexing tech- LEUNG, T. Y. C., AND MUNTZ, R. R. 1992. Tempo- niques for multi-dimensional interval data. In ral query processing and optimization in multi- Proceechngs of ACM SIGMOD Conference. processor database machines. In Proceedings of ACM, New York, 138. the International Conference on Very Large KC)OI. R. P. 1980. The optimization of queries in Data Bases (Vancouver, BC, Canada). VLDB relational databases. Ph.D. thesis, Case West- Endowment, 383. ern Reserve Univ., Cleveland, Ohio. LEUNG, T. Y. C., AND MUNTZ, R. R. 1990. Query KOOI, R. P., AND FRANKFORTH, D. 1982. Query processing in temporal databases. In Proceed- optimization in Ingres. IEEE Database Eng. 5, ings of the IEEE Conference on Data Englneer- 3 (Sept.), 2. mg. IEEE, New York, 200.

ACM Com~utina Survevs, Vol 25, No. 2, June 1993 166 “ Goetz Graefe

LI, K., AND NAUGHTON, J. 1988. Multiprocessor MAIER, D., GRAEFE, G., SHAFIRO, L., DANIELS, S., main memory transaction processing. In Pro- KELLER, T., AND VANCE, B. 1992 Issues in ceedings of the Intcrnatlonal Symposium on distributed complex object assembly In Prc~- Databases m Parallel and Dlstrlbuted Systems ceedings of the Workshop on Distributed Object (Austin, Tex., Dec.), 177. Management (Edmonton, BC, Canada, Aug.). LITWIN, W. 1980. Linear hashing: A new tool for MANNINO, M. V., CHU, P., ANrI SAGER, T. 1988. file and table addressing. In Proceedings of the Statistical profile estimation m database sys- International Conference on Ver

ACM Computmg Surveys, Vol 25, No 2. June 1993 Query Evaluation Techniques ● 167

ACM SIGMOD Conference. ACM, New York, data models. ACM Comput. Suru. 20, 3 (Sept.), 118. 153. NG, R., FALOUTSOS,C., ANDSELLIS,T. 1991. Flex- PmAHESH, H., MOHAN, C., CHENG, J., LIU, T. S., AND ible buffer allocation based on marginal gains. SELINGER, P. 1990. Parallelism in relational In Proceedings of ACM SIGMOD Conference. data base systems: Architectural issues and ACM, New York, 387. design approaches. In Proceedings of the Inter- NIEVERGELT, J., HINTERBERGER, H., AND SEVCIK, national Symposwm on Databases m Parallel K. C. 1984. The grid file: An adaptable, sym- and Distributed Systems (Dublin, Ireland, metric multikey file structure. ACM Trans. July). Database Syst. 9, 1 (Mar.), 38. QADAH, G. Z. 1988. Filter-based join algorithms NYBERG,C., BERCLAY,T., CVETANOVIC,Z., GRAY.J., on uniprocessor and dmtributed-memory multi- ANDLOMET,D. 1993. AlphaSort: A RISC ma- processor database machines. In Lecture Notes chine sort. Tech. Rep. 93.2. DEC San Francisco m Computer Science, vol. 303. Springer-Verlag, Systems Center. Digital Equipment Corp., San New York, 388. Francisco. REW, R. K., AND DAVIS, G. P. 1990. The Unidata OMIECINSK1,E. 1991. Performance analysis of a NetCDF: Software for scientific data access. In load balancing relational hash-join algorithm the 6th International Conference on Interactwe Information and Processing Swstems for Me- for a shared-memory multiprocessor. In Pro- teorology, Ocean ography,- aid Hydrology ceedings of the International Conference on Very (Anaheim, Calif.). Large Data Bases (Barcelona, Spain). VLDB Endowment, 375. RICHARDSON, J. E., AND CAREY, M. J. 1987. Pro- OMIECINSKLE. 1985. Incremental file reorgani- gramming constructs for database system im- plementation m EXODUS. In Proceedings of zation schemes. In Proceedings of the Interna- ACM SIGMOD Conference. ACM, New York, tional Conference on Very Large Data Bases 208. (Stockholm, Sweden, Aug.). VLDB Endowment, 346. RICHAR~SON,J. P., Lu, H., AND MIKKILINENI, K. 1987. Design and evaluation of parallel OMIECINSKI, E., AND LIN, E. 1989. Hash-based pipelined join algorithms. In Proceedings of and index-based join algorithms for cube and ACM SIGMOD Conference. ACM, New York, ring connected multicomputers. IEEE Trans. 399. Knowledge Data Eng. 1, 3 (Sept.), 329. ROBINSON,J. T. 1981. The K-D-B-Tree: A search ONO, K., AND LOHMAN, G. M. 1990. Measuring structure for large multidimensional dynamic the complexity of join enumeration in query indices. ln proceedings of ACM SIGMOD Con- optimization. In Proceedings of the Interna- ference. ACM, New York, 10. tional Conference on Very Large Data Bases (Brisbane, Australia). VLDB Endowment, 314. ROSENTHAL,A., ANDREINER,D. S. 1985. Query- ing relational views of networks. In Query Pro- OUSTERHOUT,J. 1990. Why aren’t operating sys- cessing in Database Systems. Springer, Berlin, tems getting faster as fast as hardware. In 109. USENIX Summer Conference (Anaheim, Calif., ROSENTHAL, A., RICH, C., AND SCHOLL, M. 1991. June). USENIX. Reducing duplicate work in relational join(s): A O~SOYOGLU, Z. M., AND WANG, J. 1992. A keying modular approach using nested relations. ETH method for a nested relational database man- Tech. Rep., Zurich, Switzerland. agement system. In Proceedings of the IEEE ROTEM, D., AND SEGEV, A. 1987. Physical organi- Conference on Data Engineering. IEEE, New zation of temporal data. In Proceedings of the York, 438. IEEE Conference on Data Engineering. IEEE, OZSOYOGLU,G., OZSOYOGLU,Z. M., ANDMATOS,V. New York, 547. 1987. Extending relational algebra and rela- ROTH, M. A., KORTH, H. F., AND SILBERSCHATZ, A. tional calculus with set-valued attributes and 1988. Extended algebra and calculus for aggregate functions. ACM Trans. Database nested relational databases. ACM Trans. Syst. 12, 4 (Dec.), 566. Database Syst. 13, 4 (Dec.), 389. Ozsu, M. T., AND VALDURIEZ, P. 1991a. Dis- ROTHNIE, J. B., BERNSTEIN, P. A., Fox, S., GOODMAN, tributed database systems: Where are we now. N., HAMMER, M., LANDERS, T. A., REEVE, C., IEEE Comput. 24, 8 (Aug.), 68. SHIPMAN, D. W., AND WONG, E. 1980. Intro- Ozsu, M. T., AND VALDURIEZ, P. 1991b. Principles duction to a system for distributed databases of Distributed Database Systems. Prentice-Hall, (SDD-1). ACM Trans. Database Syst. 5, 1 Englewood Cliffs, N.J. (Mar.), 1. PALMER, M., AND ZDONIK, S. B. 1991. FIDO: A ROUSSOPOULOS, N. 1991. An incremental access cache that learns to fetch. In Proceedings of the method for ViewCache: Concept, algorithms, International Conference on Very Large Data and cost analysis. ACM Trans. Database Syst. Bases (Barcelona, Spain). VLDB Endowment, 16, 3 (Sept.), 535. 255. ROUSSOPOULOS, N., AND KANG, H. 1991. A PECKHAM, J., AND MARYANSKI, F. 1988. Semantic pipeline N-way join algorithm based on the

ACM Computing Surveys, Vol. 25. No. 2, June 1993 168 “ Goetz Graefe

2-way semijoin program. IEEE Trans Knoul- mance evaluation of four parallel join algo- edge Data Eng. 3, 4 (Dec.), 486. rithms in a shared-nothing multiprocessor RUTH, S. S , AND KEUTZER, P J 1972. Data com- environment. In Proceedings of ACM SIGMOD pression for business files. Datamatlon 18 Conference ACM, New York, 110 (Sept.), 62. SCHOLL, M. H, 1988. The nested relational model SAAKD, G., LINNEMANN, V., PISTOR, P , AND WEGNER, —Efficient support for a relational database L. 1989. Sorting, grouping and duplicate interface. Ph.D. thesis, Technical Umv. Darm- elimination in the advanced information man- stadt. In German. agement prototype. In Proceedl ngs of th e Inter- SCHOLL, M., PAUL, H. B., AND SCHEK, H. J 1987 national Conference on Very Large Data Bases Supporting flat relations by a nested relational VLDB Endowment, 307 Extended version in kernel. In Proceedings of the International IBM Sci. Ctr. Heidelberg Tech. Rep 8903.008, Conference on Very Large Data Bases (Brigh- March 1989. ton, England, Aug ) VLDB Endowment. 137. S.4CX:0, G 1987 Index access with a finite buffer. SEEGER, B., ANU LARSON, P A 1991. Multl-disk In Procecdmgs of the International Conference B-trees. In Proceedings of ACM SIGMOD Con- on Very Large Data Bases (Brighton, England, ference ACM, New York. 436. Aug.) VLDB Endowment. 301. ,SEG’N, A . AND GtTN.ADHI. H. 1989. Event-Join op- SACCO, G. M., .4ND SCHKOLNIK, M. 1986, Buffer timization in temporal relational databases. In management m relational database systems. Proceedings of the IrLternatlOnctl Conference on ACM Trans. Database Syst. 11, 4 (Dec.), 473. Very Large Data Bases (Amsterdam, The SACCO, G M , AND SCHKOLNI~, M 1982. A mech- Netherlands), VLDB Endowment, 205. anism for managing the buffer pool m a rela- SELINGiZR, P. G., ASTRAHAN, M. M , CHAMBERLAIN. tional database system using the hot set model. D. D., LORIE, R. A., AND PRWE, T. G. 1979 In Proceedings of the International Conference Access path selectlon m a relational database on Very Large Data Bases (Mexico City, Mex- management system In Proceedings of AdM ico, Sept.). VLDB Endowment, 257. SIGMOD Conference. ACM, New York, 23. SACXS-DAVIS,R., AND RAMMIOHANARAO,K. 1983. Reprinted in Readings m Database Sy~tems A two-level superimposed coding scheme for Morgan-Kaufman, San Mateo, Calif., 1988. partial match retrieval. Inf. Syst. 8, 4, 273. SELLIS. T. K. 1987 Efficiently supporting proce- S.ACXS-DAVIS, R., KENT, A., ANn RAMAMOHANAR~O, K dures in relational database systems In Pro- 1987. Multikey access methods based on su- ceedl ngs of ACM SIGMOD Conference. ACM. perimposed coding techniques. ACM Trans New York, 278. Database Syst. 12, 4 (Dec.), 655 SEPPI, K., BARNES, J,, AND MORRIS, C. 1989, A SALZBERG, B. 1990 Mergmg sorted runs using Bayesian approach to query optimization m large main memory Acts Informatica 27, 195 large scale data bases The Univ. of Texas at SALZBERG, B. 1988, Fde Structures: An Analytlc Austin ORP 89-19, Austin. Approach. Prentice-Hall, Englewood Cliffs, NJ. SERLIN, 0. 1991. The TPC benchmarks. In SALZBER~,B , TSU~~RMAN,A., GRAY, J., STEWART, Database and Transaction Processing System M., UREN, S., ANrJ VAUGHAN, B. 1990 Fast- Performance Handbook. Morgan-Kaufman, San Sort: A distributed single-input single-output Mateo, Cahf external sort In Proceeduzgs of ACM SIGMOD SESHAnRI, S , AND NAUGHTON, J F. 1992 Sam- Conference ACM, New York, 94. pling issues in parallel database systems In SAMET, H. 1984. The quadtree and related hier- Proceedings of the International Conference archical data structures. ACM Comput. Saru. on Extending Database Technology (Vienna, 16, 2 (June), 187, Austria, Mar.). SCHEK, H. J., ANII SCHOLL, M. H. 1986. The rela- SEVERANCE. D. G. 1983. A practitioner’s guide to tional model with relation-valued attributes. data base compression. Inf. S.yst. 8, 1, 51. Inf. Syst. 11, 2, 137. SE\rERANCE, D., AND LOHMAN, G 1976. Differen- SCHNEIDER, D. A. 1991. Blt filtering and multi- tial files: Their application to the maintenance way join query processing. Hewlett-Packard of large databases ACM Trans. Database Syst. Labs, Palo Alto, Cahf. Unpublished Ms 1.3 (Sept.). SCHNEIDER, D. A. 1990. Complex query process- SEWERANCE, C., PRAMANK S., AND WOLBERG, P. ing in multiprocessor database machines. Ph.D. 1990. Distributed linear hashing and parallel thesis, Univ. of Wmconsin-Madison projection in mam memory databases. In Pro- SCHNF,mER. D. A., AND DEWITT, D. J. 1990. ceedings of the Internatl onal Conference on Very Tradeoffs in processing complex join queries Large Data Bases (Brmbane, Australia) VLDB via hashing in multiprocessor database Endowment, 674. machines. In Proceedings of the Interna- SHAPIRO, L. D. 1986. Join processing in database tional Conference on Very Large Data Bases systems with large main memories. ACM (Brisbane, Austraha). VLDB Endowment, 469 Trans. Database Syst. 11, 3 (Sept.), 239. SCHNEIDER, D., AND DEWITT, D. 1989. A perfor- SHAW, G. M., AND ZDONIIL S. B. 1990. A query

ACM Camputmg Surveys, Vol. 25, No. 2, June 1993 Query Evaluation Techniques ● 169

algebra for object-oriented databases. In Pro. STONEBRAKER, M. 1991. Managing persistent ob- ceedings of the IEEE Conference on Data Engl. jects in a multi-level store. In Proceedings of neering. IEEE, New York, 154. ACM SIGMOD Conference. ACM, New York, 2. SHAW, G., AND Z~ONIK, S. 1989a. An object- STONEBRAKER,M. 1987. The design of the POST- oriented query algebra. IEEE Database Eng. GRES storage system, In Proceedings of the 12, 3 (Sept.), 29. International Conference on Very Large Data SIIAW, G. M., AND ZDUNIK, S. B. 1989b. AU Bases (Brighton, England, Aug.). VLDB En- object-oriented query algebra. In Proceedings dowment, 289. Reprinted in Readings in of the 2nd International Workshop on Database Database Systems. Morgan-Kaufman, San Ma- Programming Languages. Morgan-Kaufmann, teo, Calif,, 1988. San Mateo, Calif., 103. STONEBRAKER, M. 1986a. The case for shared- SHEKITA, E. J., AND CAREY, M. J. 1990. A perfor- nothing. IEEE Database Eng. 9, 1 (Mar.), mance evaluation of pointer-based joins. In STONEBRAKER, M, 1986b. The design and imple- Proceedings of ACM SIGMOD (70nferen ce. mentation of distributed INGRES. In The ACM, New York, 300. INGRES Papers. Addison-Wesley, Reading, SHERMAN, S. W., AND BRICE, R. S. 1976. Perfor- Mass., 187. mance of a database manager in a virtual STONEBRAKER, M. 1981. Operating system sup- memory system. ACM Trans. Data base Syst. port for database management. Comrnun. ACM 1, 4 (Dec.), 317. 24, 7 (July), 412. SHETH, A. P., AND LARSON, J. A. 1990, Federated STONEBRAKER, M. 1975. Implementation of in- database systems for managing distributed, tegrity constraints and views by query modifi- heterogeneous, and autonomous databases. cation. In Proceedings of ACM SIGMOD Con- ACM Comput. Surv. 22, 3 (Sept.), 183. ference ACM, New York. SHIPMAN, D. W. 1981. The functional data model STONEBRAKER, M., AOKI, P., AND SELTZER, M. and the data 1anguage DAPLEX. ACM Trans. 1988a. Parallelism in XPRS. UCB/ERL Database Syst. 6, 1 (Mar.), 140. Memorandum M89\ 16, Univ. of California, SIKELER, A. 1988. VAR-PAGE-LRU: A buffer re- Berkeley. placement algorithm supporting different page STONEBRAWCR, M., JHINGRAN, A., GOH, J., AND sizes. In Lecture Notes in Computer Science, POTAMIANOS, S. 1990a. On rules, procedures, vol. 303. Springer-Verlag, New York, 336. caching and views in data base systems In SILBERSCHATZ, A., STONEBRAKER, M., AND ULLMAN, J. Proceedings of ACM SIGMOD Conference. 1991. Database systems: Achievements and ACM, New York, 281 opportunities. Commun. ACM 34, 10 (Oct.), STONEBRAKER,M., KATZ, R., PATTERSON,D., AND 110, OUSTERHOUT,J. 1988b. The design of XPRS, SIX, H. W., AND WIDMAYER, P. 1988. Spatial In Proceedings of the International Conference searching in geometric databases. In Proceed- on Very Large Data Bases (Los Angeles, Aug.). ings of the IEEE Conference on Data Enginee- VLDB Endowment, 318. ring. IEEE, New York, 496. STONEBBAKER, M., ROWE, L. A., AND HIROHAMA, M. SMITH,J. M., ANDCHANG,P. Y. T. 1975. Optimiz- 1990b. The implementation of Postgres. IEEE ing the performance of a relational algebra Trans. Knowledge Data Eng. 2, 1 (Mar.), 125. database interface. Commun. ACM 18, 10 STRAUBE, D. D., AND OLSU, M. T. 1989. Query (Oct.), 568. transformation rules for an object algebra, SNODGRASS. R, 1990. Temporal databases: Status Dept. of Computing Sciences Tech, Rep. 89-23, and research directions. ACM SIGMOD Rec. Univ. of Alberta, Alberta, Canada. 19, 4 (Dec.), 83. Su, S. Y. W. 1988. Database Computers: Princ- SOCKUT, G. H., AND GOLDBERG, R. P. 1979. iples, Archltectur-es and Techniques. McGraw- Database reorganization—Principles and prac- Hill, New York. tice. ACM Comput. Suru. 11, 4 (Dec.), 371. TANSEL, A. U., AND GARNETT, L. 1992. On Roth, SRINIVASAN, V., AND CAREY, M. J. 1992. Perfor- Korth, and Silberschat,z’s extended algebra and mance of on-line index construction algorithms. calculus for nested relational databases. ACM In Proceedings of the International Conference Trans. Database Syst. 17, 2 (June), 374. on Extending Database Technology (Vienna, TEOROy, T. J., YANG, D., AND FRY, J. P. 1986. A Austria, Mar.). logical design metb odology for relational SRINWASAN,V., AND CAREY, M. J. 1991. Perfor- databases using the extended entity-relation- mance of B-tree concurrency control algo- ship model. ACM Cornput. Suru. 18, 2 (June). rithms. In Proceedings of ACM SIGMOD 197. Conference. ACM, New York, 416. TERADATA. 1983. DBC/1012 Data Base Com- STAMOS,J. W., ANDYOUNG,H. C. 1989. A sym- puter, Concepts and Facilities. Teradata Corpo- metric fragment and replicate algorithm for ration, Los Angeles. distributed joins. Tech. Rep. RJ7 188, IBM Re- THOMAS, G., THOMPSON, G. R., CHUNG, C. W., search Labs, San Jose, Calif. BARKMEYER, E., CARTER, F., TEMPLETON, M.,

ACM Computing Surveys, Vol. 25, No. 2, June 1993 170 “ Goetz Graefe

Fox, S., AND HARTMAN, B. 1990. Heteroge- in a mam memory database system. Ph.D. the- neous distributed database systems for produc- sis, Univ. of Tweuk, The Netherlands. tion use, ACM Comput. Surv. 22. 3 (Sept.), WiLSCHUT, A. N., AND AP~RS, P. M. G. 1993. 237. Dataflow query execution m a parallel main- TOMPA, F. W., AND BLAKELEY, J A. 1988. Main- memory environment. Distrlb. Parall. taining materialized views without accessing Databases 1, 1 (Jan.), 103. base data. Inf Swt. 13, 4, 393. WOLF, J. L , DIAS, D. M , AND Yu, P. S. 1990. An TRAI~ER, 1. L. 1982. Virtual memory manage- effective algorithm for parallelizing sort merge ment for data base systems ACM Oper. S.vst. in the presence of data skew In Proceedl ngs of Reu. 16, 4 (Oct.), 26. the International Syrnpo.?lum on Data base~ 1n TRIAGER, 1. L., GRAY, J., GALTIERI, C A., AND Parallel and DLstrlbuted Systems (Dubhn, LINDSAY, B, G. 1982. Transactions and con- Ireland, July) sistency in distributed database systems. ACM WOLF, J. L., DIAS, D M., Yu, P. S . AND TUREK, ,J. Trans. Database Syst. 7, 3 (Sept.), 323. 1991. An effective algorithm for parallelizmg TSUR, S., AND ZANIOLO, C. 1984 An implementa- hash Joins in the presence of data skew. In tion of GEM—Supporting a semantic data Proceedings of the IEEE Conference on Data model on relational back-end. In Proceedings of Engtneermg. IEEE, New York. 200 ACM SIGMOD Conference. ACM, New York, WOLNIEWWZ, R. H., AND GRAEFE, G. 1993 Al- 286. gebralc optlmlzatlon of computations over TUKEY, J, W, 1977. Exploratory Data Analysls. scientific databases. In Proceedings of the Addison-Wesley, Reading, Mass, International Conference on Very Large UNTDATA 1991. NetCDF User’s Guide, An Inter- Data Bases. VLDB Endowment. face for Data Access, Verszon III. NCAR Tech WONG, E., ANO KATZ. R. H. 1983. Dlstributmg a Note TS-334 + 1A, Boulder, Colo database for parallelism. In Proceedings of VALDURIIIZ, P. 1987. Join indices. ACM Trans. ACM SIGMOD Conference. ACM, New York, Database S.yst. 12, 2 (June). 218. 23, WONG, E., AND Youssmv, K. 1976 Decomposition VANDEN~ERC, S. L., AND DEWITT, D. J. 1991. Al- —A strategy for query processing. ACM Trans gebraic support for complex objects with ar- Database Syst 1, 3 (Sept.), 223. rays, identity, and inhentance. In Proceedmg.s of ACM SIGMOD Conference. ACM, New York, YANG, H., AND LARSON, P. A. 1987. Query trans- 158, formation for PSJ-queries. In proceedings of the International Conference on Very Large WALTON, C B. 1989 Investigating skew and Data Bases (Brighton, England, Aug. ) VLDB scalabdlty in parallel joins. Computer Science Endowment, 245. Tech. Rep. 89-39, Umv. of Texas, Austin. YOUSS~FI. K, AND WONG, E 1979. Query pro- WALTON, C, B., DALE, A. G., AND JENEVEIN, R. M. cessing in a relational database management 1991. A taxonomy and performance model of system. In Proceedings of the Internatmnal data skew effects in parallel joins. In Proceed- Conference on Very Large Data Bases (Rio de ings of the Interns tlona 1 Conference on Very Janeiro, Ott ). VLDB Endowment, 409. Large Data Bases (Barcelona, Spain). VLDB Endowment, 537. Yu, C, T., AND CHANG) C. C. 1984. Distributed query processing. ACM Comput. Surt, 16, 4 WHANG. K. Y.. AND KRISHNAMURTHY, R. 1990. (Dec.), 399. Query optimization m a memory-resident do- mam relational calculus database system. ACM YLT, L., AND OSBORN, S, L, 1991. An evaluation Trans. Database Syst. 15, 1 (Mar.), 67. framework for algebralc object-oriented query WHANG, K. Y., WIEDERHOLD G., AND SAGALOWICZ, D. models. In Proceedings of the IEEE Conference 1985 The property of separabdity and Its ap- on Data Englneermg. VLDB Endowment, 670. plication to physical database design. In Query ZANIOIJ I, C. 1983 The database language Gem. Processing m Database Systems. Springer, In Proceedings of ACM SIGMOD Conference. Berlin, 297. ACM, New York, 207. Rcprmted m Readcngb WHANG, K. Y., WIEDERHOLD, G., AND SAGLOWICZ, D. zn Database Systems. Morgan-Kaufman, San 1984. Separability—An approach to physical Mateo, Calif., 1988. database design. IEEE Trans. Comput. 33, 3 ZANmLo, C. 1979 Design of relational views over (Mar.), 209. network schemas. In Proceedings of ACM WILLIANLS,P., DANIELS, D., HAAs, L., LAPIS, G., SIGMOD Conference ACM, New York, 179 LINDSAY,B., NG, P., OBERMARC~,R., SELINGER, ZELLER, H. 1990. Parallel query execution m P., WALKER, A., WILMS, P., AND YOST, R. 1982 NonStop SQL. In Dtgest of Papers, 35th C’omp- R’: An overview of the architecture. In Im- Con Conference. San Francisco. provmg Database Usabdlty and Responslue - ZELLER H,, AND GRAY, J. 1990 An adaptive hash ness. Academic Press, New York. Reprinted in join algorithm for multiuser environments. In Readings m Database Systems. Morgan-Kauf- Proceedings of the International Conference on man, San Mateo, Calif., 1988 Very Large Data Bases (Brisbane, Australia). WILSCHUT, A. N. 1993. Parallel query execution VLDB Endowment, 186.

Recewed January 1992, final revk+on accepted February 1993

ACM Computing Surveys, Vol. 25, No 2, June 1993