Query Evaluation Techniques for Large Databases

Query Evaluation Techniques for Large Databases GOETZ GRAEFE Portland State University, Computer Science Department, P. O. Box751, Portland, Oregon 97207-0751 Database management systems will continue to manage large data volumes. Thus, efficient algorithms for accessing and manipulating large sets and sequences will be required to provide acceptable performance. The advent of object-oriented and extensible database systems will not solve this problem. On the contrary, modern data models exacerbate the problem: In order to manipulate large sets of complex objects as efficiently as today’s database systems manipulate simple records, query processing algorithms and software will become more complex, and a solid understanding of algorithm and architectural issues is essential for the designer of database management software. This survey provides a foundation for the design and implementation of query execution facilities innew database management systems. It describes awide array of practical query evaluation techniques for both relational and postrelational database systems, including iterative execution of complex query evaluation plans, the duality of sort- and hash-based set-matching algorithms, types of parallel query execution and their implementation, and special operators for emerging database application domains. Categories and Subject Descriptors: E.5 [Data]: Files; H.2.4 [Database Management]: Systems—query processing General Terms: Algorithms, Performance Additional Key Words and Phrases: Complex query evaluation plans, dynamic query evaluation plans; extensible database systems, iterators, object-oriented database systems, operator model of parallelization, parallel algorithms, relational database systems, set-matching algorithms, sort-hash duality INTRODUCTION the other emerging database application areas. Effective and efficient management of In most of these new application do- large data volumes is necessary in virtu- mains, database management systems ally all computer applications, from busi- have traditionally not been used for two ness data processing to library infor- reasons. First, restrictive data definition mation retrieval systems, multimedia and manipulation languages can make applications with images and sound, application development and mainte- computer-aided design and manufactur- nance unbearably cumbersome. Research ing, real-time process control, and scien- into semantic and object-oriented data tific computation. While database models and into persistent database pro- management systems are standard gramming languages has been address- tools in business data processing, they ing this problem and will eventually lead are only slowly being introduced to all to acceptable solutions. Second, data vol- Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying ie by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1993 ACM 0360-0300/93/0600-0073 $01.50 ACM Computing Surveys,Vol. 25, No. 2, June 1993 GUN I tNT”S umes might be so large or complex that the real or perceived performance advan- INTRODUCTION tage of file systems is considered more 1 ARCHITECTURE OF QUERY EXECUTION ENGINES important than all other criteria, e.g., the -,9 SORTING AND HASHING higher levels of abstraction and program- 2.1 Sorting mer productivity typically achieved with 2.2 Hashing database management systems. Thus, 3. DISK ACCESS 31 File Scans object-oriented database management 32 Associative Access Using Indices systems that are designed for nontradi- 3.3 Buffer Management tional database application domains and 4 AGGREGATION AND DUPLICATE REMOVAL extensible database management system 41 Aggregation Algorithm sBased on Nested Loops toolkits that support a variety of data 4,z Aggrcgat~on Algorithms Based on S(,rtlng models must provide excellent perfor- 43 Aggregation Algorithms Based on Hashing mance to meet the challenges of very 44 ARough Performance C’omparlson large data volumes, and techniques for 45 Addltlonal Remarks on Aggregation manipulating large data sets will find 5 BINARY MATCHING OPERATIONS 51 Nested-Loops Jom Algorithms renewed and increased interest in the 52 Merge-Join Algorithms database community. 53 Hash Join AIgorlthms The purpose of this paper is to survey 54 Pointer-Based Joins efficient algorithms and software archi- 55 ARough Performance Comparison 6 UNIVERSAL QuANTIFICATION tectures of database query execution en- 7 DUALITY OF SORT- AND HASH-BASED gines for executing complex queries over QUERY PROCESSING ALGORITHMS large databases. A “complex” query is 8 EXECUTION OF COMPLEX QUERY PLANS one that requires a number of query- 9 MECHANISMS FOR PARALLEL QUERY processing algorithms to work together, EXECUTION 91 Parallel versus Dlstrlbuted Database and a “large” database uses files with Systems sizes from several megabytes to many g~ ~~rm~ ~fp~ralle]lsm terabytes, which are typical for database 9.3 Implementation Strategies applications at present and in the near 94 Load Balancing and Skew future [Dozier 1992; Silberschatz et al. 95 Arcbltectures and Architecture Independence 1991]. This survey discusses a large vari- 10 PARALLEL ALGORITHMS ety of query execution techniques that 10.1 Parallel Selections and Updates must be considered when designing and 102 Parallel SOrtmg implementing the query execution mod- 103 Parallel Aggregation and Duphcate Removal ule of a new database management sys- 10.4 Parallel Jolnsand Otber Binary Matcb, ng tem: algorithms and their execution costs, Operations sorting versus hashing, parallelism, re- 105 Parallel Universal Quantlficatlon source allocation and scheduling issues 11 NONSTANDARD QUERY PROCESSING in complex queries, special operations for ALGORITHMS 11 1 Nested Relations emerging database application domains 112 TemPoral and Scientific Database such as statistical and scientific data- Management bases, and general performance-enhanc- 11.3 Object-oriented Database Systems ing techniques such as precomputation 114 More Control Operators and compression. While many, although 12 ALJDITIONAL TECHNIQUES FOR PERFORMANCE IMPROVEMENT not all, techniques discussed in this pa- 121 Precomputatlon and Derived Data per have been developed in the context of 122 Data Compression relational database systems, most of 12.3 Surrogate Processing 124 B,t Vector F,lter, ng them are applicable to and useful in the 12.5 Specmllzed Hardware query processing facility for any database SUMMARY AND OUTLOOK management system and any data model, ACKNOWLEDGMENTS provided the data model permits queries REFERENCES over “bulk” data types such as sets and lists. Query Evaluation Techniques ● 75 User Interface possible plans that can be chosen by Database Query Language the auerv o~timizer, A ~en~ra~ outline of the steps required Query Optimizer for processing a database query are Query Execution Engine shown in Fimu-e 2. Of course. this se- Files and Indices quence is only a general guideline, and 1/0 Buffer different database systems may use dif- Disk ferent steps or merge multiple steps into one. After a query or request has been Figure 1. FQuery processing in a database system, entered into the database system, be it interactively or by an application program, the query is parsed into an inter- It is assumed that the reader possesses nal form. Next, the query is validated basic textbook knowledge of database against the metadata (data about the query languages, in particular of rela- data, also called schema or catalogs) to tional algebra, and of file systems, in- ensure that the query contains only valid cluding some basic knowledge of index references to existing database objects. If structures. As shown in Figure 1, query the database system provides a macro processing fills the gap between database facility such as relational views, refer- query languages and file systems. It can enced macros and views are expanded be divided into query optimization and into the query [ Stonebraker 1975]. In- query execution. A query optimizer tegrity constraints might be expressed as translates a query expressed in a high- views (externally or internally) and would level query language into a sequence of also be integrated into the query at this operations that are implemented in the point in most systems [Metro 1989]. The query execution engine or the file system. query optimizer then maps the expanded The goal of query optimization is to find query expression into an optimized plan a query evaluation plan that minimizes that operates directly on the stored the most relevant performance measure, database objects. This mapping process which can be the database user’s wait for can be very complex and might require the first or last result item, CPU, 1/0, substantial search and cost estimation and network time and effort (time and effort. (O~timization is not discussed in effort can differ due to parallelism), this pape~; a survey can be found in Jarke memory costs (as maximum allocation or and Koch [1984 ].) The optimizer’s output as time-space product), total resource us- is called a query execution plan, query age, even energy consumption (e.g., for evaluation plan, QEP, or simply plan. battery-powered laptop systems or space Using a simple tree traversal algorithm, craft), a combination of the above, or

Load more