Optimization Techniques for Queries with Expensive Methods

Optimization Techniques For Queries with Exp ensive Metho ds JOSEPH M. HELLERSTEIN U.C. Berkeley Ob ject-Relational database management systems allow knowledgeable users to de ne new data typ es, as well as new methods (op erators) for the typ es. This exibility pro duces an attendant complexity, whichmust b e handled in new ways for an Ob ject-Relational database management system to b e ecient. In this pap er we study techniques for optimizing queries that contain time-consuming metho ds. The fo cus of traditional query optimizers has b een on the choice of join metho ds and orders; selections have b een handled by \pushdown" rules. These rules apply selections in an arbitrary order b efore as many joins as p ossible, using the assumption that selection takes no time. However, users of Ob ject-Relational systems can emb ed complex metho ds in selections. Thus selections may take signi cant amounts of time, and the query optimization mo del must b e enhanced. In this pap er, we carefully de ne a query cost framework that incorp orates b oth selectivity and cost estimates for selections. We develop an algorithm called Predicate Migration, and prove that it pro duces optimal plans for queries with exp ensive metho ds. We then describ e our implementation of Predicate Migration in the commercial Ob ject-Relational database management system Il lustra, and discuss practical issues that a ect our earlier assumptions. We compare Predicate Migration to a variety of simpler optimization techniques, and demonstrate that Predicate Migration is the b est general solution to date. The alternative techniques we presentmay b e useful for constrained workloads. Categories and Sub ject Descriptors: H.2.4 [Database Management]: Systems|Query Process- ing General Terms: Query Optimization, Exp ensive Metho ds Additional Key Words and Phrases: Ob ject-Relational databases, extensibility, Predicate Migra- tion, predicate placement Preliminary versions of this work app eared in Proceedings ACM-SIGMOD International Con- ference on Management of Data, 1993 and 1994. This work was funded in part by a National Science Foundation Graduate Fellowship. Any opinions, ndings, conclusions or recommendations expressed in this publication are those of the author and do not necessarily re ect the views of the National Science Foundation. This work was also funded by NSF grant IRI-9157357. This is a preliminary release of an article accepted byACM Transactions on Database Systems. The de nitive version is currently in pro duction at ACM and, when released, will sup ersede this version. Address: Author's current address: University of California, Berkeley, EECS Computer Science Division, 387 So da Hall #1776, Berkeley, California, 94720-1776. Email: [email protected] erkeley.edu. Web: http://www.cs.b erkeley.edu/~jmh/. ACM Copyright Notice: Copyright 1997 by the Asso ciation for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for p ersonal or classro om use is granted without fee provided that copies are not made or distributed for pro t or direct commercial advantage and that copies show this notice on the rst page or initial screen of a display along with the full citation. Copyrights for comp onents of this work owned by others than ACM must b e honored. Abstracting with credit is p ermitted. To copy otherwise, to republish, to p ost on servers, to redistribute to lists, or to use any comp onent of this work in other works, requires prior sp eci c p ermission and/or a fee. Permissions may b e requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or [email protected]. 2 J.M. Hellerstein 1. OPENING One of the ma jor themes of database researchover the last 15 years has b een the intro duction of extensibilityinto Database Management Systems (DBMSs). Rela- tional DBMSs have b egun to allow simple user-de ned data typ es and functions to b e utilized in ad-ho c queries [Pirahesh 1994]. Simultaneously, Ob ject-Oriented DBMSs have b egun to o er ad-ho c query facilities, allowing declarative access to ob jects and metho ds that were previously only accessible through hand-co ded, im- p erative applications [Cattell 1994]. More recently, Ob ject-Relational DBMSs were develop ed to unify these supp osedly distinct approaches to data management [Il- lustra Information Technologies, Inc. 1994; Kim 1993]. Much of the original research in this area fo cused on enabling technologies, i.e. system and language designs that make it p ossible for a DBMS to supp ort extensibility of data typ es and metho ds. Considerably less research explored the problem of making this new functionality ecient. This pap er addresses a basic eciency problem that arises in extensible database systems. It explores techniques for eciently and e ectively optimizing declarative queries that contain time-consuming metho ds. Such queries are natural in mo dern extensible database systems, whichwere expressly designed to supp ort declarative queries over user-de ned typ es and metho ds. Note that \exp ensive" time-consuming metho ds are natural for complex user-de ned data typ es, which are often large ob jects that enco de signi cant complexity of information (e.g., ar- rays, images, sound, video, maps, circuits, do cuments, ngerprints, etc.) In order to eciently pro cess queries containing exp ensive metho ds, new techniques are needed to handle exp ensive metho ds. 1.1 A GIS Example To illustrate the issues that can arise in pro cessing queries with exp ensive meth- o ds, consider the following example over the satellite image data presented in the Sequoia b enchmark for Geographic Information Systems (GIS) [Stonebraker et al. 1993]. The query retrieves names of digital \raster" images taken by a satellite; particularly, it selects the names of images from the rst time p erio d of observation that show a given level of vegetation in over 20% of their pixels: SELECT name FROM rasters Example 1. WHERE rtime = 1 AND veg(raster) > 20; In this example, the metho d veg reads in 16 megabytes of raster image data (infrared and visual data from a satellite), and counts the p ercentage of pixels that have the characteristics of vegetation (these characteristics are computed p er pixel using a standard technique in remote sensing [Frew 1995].) The veg metho d is very time- consuming, taking many thousands of instructions and I/O op erations to compute. It should b e clear that the query will run faster if the selection rtime = 1 is applied b efore the veg selection, since doing so minimizes the numb er of calls to veg.A traditional optimizer would order these two selections arbitrarily, and mightwell Optimization Techniques For Queries with Exp ensive Metho ds 3 apply the veg selection rst. Some additional logic must b e added to the optimizer to ensure that selections are applied in a judicious order. While selection ordering such as this is imp ortant, correctly ordering selections within a table-access is not sucient to solve the general optimization problem of where to place predicates in a query execution plan. Consider the following example, which joins the rasters table with a table that contains notes on the rasters: SELECT rasters.name, notes.note FROM rasters, notes Example 2. WHERE rasters.rtime = notes.rtime AND notes.author = 'Cli ord' AND veg(rasters.raster) > 20; Traditionally, an optimizer would plan this query by applying all the single-table selections in the WHERE clause b efore p erforming the join of rasters and notes. This heuristic, often called \predicate pushdown", is considered b ene cial since early selections usually lower the complexity of join pro cessing, and are traditionally considered to b e trivial to check[Palermo 1974]. However in this example the cost of evaluating the exp ensive selection predicate may outweigh the b ene t gained by doing selection b efore join. In other words, this may b e a case where predicate pushdown is precisely the wrong technique. What is needed here is \predicate pullup", namely p ostp oning the time-consuming selection veg(rasters.raster) > 20 until after computing the join of rasters and notes. In general it is not clear how joins and selections should b e interleaved in an optimal execution plan, nor is it clear whether the migration of selections should havean e ect on the join orders and metho ds used in the plan. We explore these query optimization problems from b oth a theoretical p oint of view, and from the exp erience of pro ducing an industrial-strength implementation for the Illustra Ob ject-Relational DBMS. 1.2 Bene ts for RDBMS: Sub queries It is imp ortant to note that exp ensive metho ds do not exist only in next-generation Ob ject-Relational DBMSs. Current relational languages, such as the industry standard, SQL [ISO ANSI 1993], have long supp orted exp ensive predicate metho ds in the guise of subquery predicates. A sub query predicate is one of the form expres- sion op erator query.Evaluating such a predicate requires executing an arbitrary query and scanning its result for matches | an op eration that is arbitrarily exp ensive, dep ending on the complexity and size of the sub query. While some sub query predicates can b e converted into joins (thereby b ecoming sub ject to traditional join-based optimization and execution strategies) even sophisticated SQL rewrite systems such as that of DB2/CS [Pirahesh et al.

Load more