<<

A Performance Study of Query Optimization Algorithms on a System Supporting Procedures t

Anant Jhingran EECS Department University of California, Berkeley

Abstract . POSTGRESallows fields of a to have pro- However, a number of recent proposals which cedural (executable) objects. POSTQUEL is the query enhance Codd’s [CODD70] model require the language supporting access to these fields, and in this modification of the existing algorithms to optimize the paper we consider the optimizing process for such new set of queries that were not possible before. In this queries. The simplest algorithm for optimization assumes paper we study the query optimization problem in one that the procedural objects are executed in full, whenever such extended relational model, namely POSTGRES. needed. As a refinement to this basic process, we pro- We present a number of algorithms for optimization of pose an algorithm wherein cost savings are achieved by queries in such an environment, and do a performance modifying the procedural queries before executing them. study of each. In another direction of refinement, we consider the cach- The rest of the paper is organized as follows. In ing of the materialized results. Two caching strategies- Section 2 we present the extensions in POSTGRES caching in tuples, and separatecaching-are considered. relevant to our study. We also discuss the previous work The fifth algorithm is flattening, where a POSTQUEL on optimization of queries on procedural objects. The query is modified into an equivalent flat query, and then optimizing paradigm and the details of the algorithms optimized through a traditional optimizer. We study the under consideration are then discussed in Section 3. In relative performancesof these algorithms under varying Section 4 we present the framework in which we com- conditions and parameters.Our results show that caching pare the various algorithms. Section 5 presents the wins when updates do not occur with a high frequency, results of our study. Finally, this paper ends with the and that separatecaching is, in general, better than cach- conclusions on the viability of each algorithm. ing in tuples. We further show that when the composition of the objects in the procedural field is predictable and 2. POSTGRES PROCEDURES parameterizable,flattening is a good option. The relational model has been found deficient in 1. INTRODUCTION many areas of database applications (e.g., knowledge management [ZANI85], and engineering applications Query optimization in relational databasesystems [STON83]). As a result, there have been several propo- has been a traditional research problem. A number of sals to enhance the relational model. These extensions algorithms for optimizing queries have been proposed either address specihc deficiencies (e.g., ADT INGRES [SELI79, WONG761. They arc based on a variety of [STON83]), or a broad spectrum of inadequacies (e.g., paradigms [JARK841, and work well for the traditional Starburst, EXODUS, GENESIS, DASDBS and POSTGRES[lEEE87]). t This research was sponsored by tic Dcfcnse Advanced Research Project Agency (DoD), Arpa Order No. 4781, monitored by POSTGRES [STON86], a new Space and Naval Warfare Systems Command under Contract N00039- system currently being developed at Berkeley, extends 84-c-0089. the relational model in several ways. The one relevant to this study is the addition of procedural data types. Thus, in addition to the standard data types permitted by all systems (integer, real, character etc.) and abstract data Permission to copy without fee alI or Put of thin mat&l is types permitted by some systems[STON83], fields of the grPntedprovidsdthrtthecopieruenotmrdear~~for relations in POSTGREScan contain procedural objects. direct commucid advatage, the VLDB cqyxight notice d Procedural objects are executable programming the title of the ~blication ad its date qpca, and notice ir giva~ constructs. In this paper we restrict ourselves to those thatcopyingisbypermissionoftheVuyLargeDataBus procedures that are queries on the underlying database. Endowment. To copy othawisc, or to republish, requires a fee The fields containing such queries are called POST- d/orspccialpcrmissionfmmthcEndowment. QUEL fields. The presenceof procedural objectslends a

Proceedings of the 14th VLDB Conference Los Angeles,California 1988 88 high degree of flexibility to the design of a database POSTQUEL extends QUEL in many ways schema and allows many complex problems (e.g., [ROWE87]. The extension which deals with the pro- storage of query plans, representation of hierarchical cedural objects is the multiple dot notation (like GEM information, and sharing of subobjects by complex lZANI831). The execution of the queries in the pro- objects) to be naturally addressed[STON87]. cedural fields in a multiple dot notation is implicit. For Consider the following relations in a database example, in Query 1, the target field EMP.hobbies.day schema: can only be determined after the queries in EMP.hobbies are executed. DEPT (number = ~10, name = ~10, mgr = ~10) EMP (name = ~10, hobbies = POSTQUEL, The depth of a field is one less than the number of dept = PQSTQUEL) dots in its multiple dot representation. Thus EMP.name SOFTBALL (name = ~10, day = ~10, position = ~10) is at depth zero, and EMP.hobbies.day is at depth one. FOOTBALL (name = ~10, day = ~10, position = ~10) Clausescontaining fields with depth greater than zero are MUSIC (name = ~10, instrument = c10) called extended clauses. The rest are called ordinary clauses. The field day in SOFTBALL and FOOTBALL refers to the day the person plays that game. Each employee has The two PQSTQUEL fields in EMP are fundamen- zero or more hobbies. The field EMP.hobbies conse- tally different in terms of the nature of the objects they quently contains up to three POSTQUEL queries, one for contain. EMP.hobbies contains objects of unpredictable each hobby. Each employee belongs to exactly one composition. For example, as the database evolves, clliartment. For example, the relation EMP may look employees may take up new hobbies, and give up old ones. As a result, the set of queries that occupy name hobbies dept EMP.hobbies may change dynamically, and can only be John determined after each tuple in EMP is fetched and exam- retrieve (SOFTBALL.day. r&eve @EPT.all) SOFl’BALLqosition) where ined. In contrast, the composition of objects in EMP.dept where SOFTBALLname = “John” DEPT.munber = 9 is fixed-each tuple of EMP contains exactly one object of the form: retrieve (FOOTBALL.day, FOOTBALL.position) retrieve (DEPT.all) where where FOOTBALL.namc = “John” DEPT.number= $dept-number

Mary retrieve (SOFTBALLday, retrieve @BpT.aB) where $dept-numbermay differ acrossthe tuples. SOFTBALL.positicm) where In the case of EMP.dept, it makes sense to store when: SORBALL.name = “Mary” DEPT.number = 8 only the parameter $dept-number instead of the entire retrieve (MUSIC.instmment) POSTQUEL query. The relation EMP would thus con- where MUSIC.name = “Mary” tain: name hobbies dept 1 gives a set of queries which would be used John retrieve (SOFTBALL.day. SOFTBALLposition) 9 to illustrate the optimizing algorithms. where SOFTBALLname = “John”

retrieve (FOOTBALLday. FOOTBALLposition) where FOOTBALL.name = “John”

Mary retrieve (SOFlBALL.day. SOFTBALL.position) 8 2 retrieve (BMP.name) where where SOFTBALLname = “Mary” EMP.dept.name = “TOY” retrieve (MUSIC.instmment) Table 1: Example PQSTQUEL queries where MUSIC.name = “Mary” Query 1 retrieves the days John plays something, and Now consider Query 2. It can be converted into the fol- query 2 retrieves the namesof the cmployecs who work lowing “flattened” query: in the “TOY” dcpartmcnt. retrieve (EMP.name) where The PQSTQUEL fields can contain arbitrary EMP.dept = DEPT.number and PQSTQUEL queries. To distinguish between the POST- DEPT.name= “TOY” QUEL queries present in the procedural fields (such as those in EMP.hobbics) and the queries used to access This sort of query modification, which removes the these fields (such as those in Table l), we refer to the extended clauses in a PQSTQUEL query, is generally former as objects and the latter as queries. possible if the structure of the objects in a POSTQUEL field is the same across all the tuples. Such query modilication is referred to as Jlaftening. It is similar in,

89 concept to the flattening of SQL queries introduced in 3. OPI’IMIZATION ALGORITHMS [KIM82]. We have developed five basic algorithms for optimizing POSTQUEL queries. These algorithms treat 2.1. Previous Work ordinary clauses similarly; they differ only in their han- Optimization of queries on procedural objects has dling of the extended clauses.Each algorithm assumesa been studied from a perspective different from ours in different query processingstrategy. Accordingly, the dis- [SELL871 and [HANS88]. [SELL871 is an overview of cussion of the optimization algorithms is also a discus- the preliminary ideas on optimizing POSTQUEL sion of the correspondingquery processingstrategies. queries. It discussesan optimizing strategy for POST- QUEL queries based on the decomposition and tuple 3.1. The Basic Strategy substitution strategy in [WONG76]. This optimizer post- The SystemR query optimizer looks through most pones the evaluation of the extended clauses to the very of the viable query plans and estimatesthe cost of each. end of the . Furthermore, it is based on a It then selects the plan with the least estimated cost greedy approach. Consequently, the plans that it pro- [SELI79]. To do so, it needs the estimatesof the selec- duces may be suboptimal. In contrast, our optimization rivities of the clausesin the query. The cost of a plan is a strategiesare basedon an exhaustive searchapproach as function of the selectivities and the costs of relational used in System R, with the extended clauses integrated accessesand joins. into such a framework. We have shown the viability and advantagesof such an approachin [JHIN87]. The functions used for sclectivities and costs must be modified in order to be applicable to extended clauses A significant part of [SELL871 is devoted to dis- [JHIN87]. To discover if a tuple satislies an extended cussions on the caching strategies for materialized clause may involve execution of one or more procedural objects. Both finite and infinite cache space are con- objects. Determining the cost and selectivity of this pro- sidered. However, the discussion is exploratory in nature cess requires some statistical information about the and fails to reach specific conclusions. queries in the procedural fields. The execution of pro- Hanson [HANS881 studies the relative pcrfor- cedural objects is also termed as materializarion. mance of three algorithms for dealing with proceduraJ Ordinary clausescan be used to reduce the cost of fields-Always Recompute, Cache and Invalidate, and relational accesses if suitable indexes are available Update Cache.The last two algorithms differ in the ways [SELI79]. In other words, tuples not satisfying ordinary in which the cached set of objects is kept current. An ela- clause(s) may be filtered out by using a suitable access borate parameterized model is presented, which is then path. The same is not true for extended clauses, which used to compare the three algorithms. The study can generally be tested for only after the tuples have assumesthe availability of int’inite cache spaceand con- been fetched. For example, while an index on EMP.name cludes that caching strategies win if the probability of would help in Query 1 (where one may fetch only the update to the cachedobjects is low. tuples satisfying the clause), the clause in Query 2 can- Our study has points in common with Hanson’s, not act as a filter for accessing tuples of EMP. For but is more extensive in many aspects.We study two optimization purposes, extended clauses can be treated Cache and Invalidate strategies(as opposedto just one in like ordinary clauses, provided the modified cost and [HANS88]), and make the realistic assumption of selectivity functions are used, and the above limitation bounded cache space.Furthermore, we discussthe actual (of when extended clauses can be tested for) is kept in query optimization problem. In [HANS881 the objective mind. All our algorithms (except FLAT) use this function for the optimization algorithms is the expected approach,but differ in their cost functions for evaluating cost of accessing one procedural object. A similar extended clauses. They use an exhaustive search stra- assumptionis made in [SELL871 when the caching alter- tegy, and evaluate the extended clausesfrom left to right. natives are discuss& We, however, have the objective For example, in Query 2 (having the extended clause function as the cost of an entire POSTQUEL query, [email protected]= “TOY”), a plan would involve fctch- which may involve processing many objects. In particu- ing a tuple of EMP and materializing the object in the lar, we identify two different classes of POSTQUEL dept field of that tuple to determine the matching tuple of queries which result in very different algorithm perfor- DEPT. In casethe extended clause is deeper,this process mance. We also discuss query modification strategies would continue further. which reduce query costs substantially under some cir- The first algorithm, Complete Materiaiizafion with cumstances. No Caching (CM) is the simplest one. CM assumesthat the cost of executing an object has to be paid in full every time materialization is needed.There is no concept of storing these materialized results for future use. For example, in Query 2, the cost of the plan in CM includes

90 the cost of fetching the tuples of EMP, the cost of exe- Caching strategies can be broadly classified into cuting the procedural object in each of these tuples, and result caching and plan caching. In this study, only the the cost of checking for each of these materializations if former is considered since our model does not take into the result has the name field as “TOY.” account the cost of generating a plan. Even result cach- ing can be accomplished in various ways. Here we dis- 3.2. Restricted Materialization (REST) cuss two result caching strategies,which lead to two dif- Materialization returns all the subobjectsof a pro- ferent optimizing algorithms. Both these algorithms cedural object. Sometimeswe are not interested in the materialize an object only if its current version does not entire relation returned by the materialization of a POST- exist in cache. In all other aspects, they are similar to QUEL object; a subsetof the tuples might suflice. Under CM. these circumstances,it is possible that cost savings may be achieved by modifying the POSTQUEL object(s) 3.3.1. Complete Materialization-Cache Separately before executing them so that they only return the tuples (W of interest. In Query 2, the plan in REST pays the cost In CS, the materialized objects are cached in a of fetching the tuplcs of EMP, and for each tuple e in separatecache relation on the disk. Each object in the EMP, the cost of the following resrricted materialization: database has a unique-id which is a function of the retrieve (DEPT.all) where query_block (the structure of the object), and DEPT.number= “emp-dcpt” list-ofgarameters (the set of parametersthat uniquely and DEPT.name= “TOY” identify a particular object within the objects that have the same query-block). The unique-id is the input to a where “emp-dept” is the actual value of the parameter hash function that determines the slot in the cache rela- $dept-number in e. Note that under such a plan, there is tion where the object should be cached. no need to check if the tuplc(s) returned by the (res- tricted) materialization have their name field(s) as Whenever an object needs to be evaluated, CS “TOY.” determinesits unique-id and then hashesinto cache. If a current version of the materialized object is found, it is It is thus possible that the extended clause in a retrieved and the cost of executing the object is avoided. query can be used to modify some or all the objects that If such a version does not exist, the object is materialized riced to be materialized. REST checks if such an object and stored in cache if space permits. Note that under modification is semantically valid. If this is the case,the these assumptions,one page accessis required to cheek plan includes restricted materialization of these objects. if a result is cached. The number of page accessesto Otherwise, it is similar to CM in all aspects. retrieve a cachedrelation, of course, dependson its size.

3.3. Caching Strategies 3.3.2. Complete Materialization-cache in Tuples The two algorithms mentioned above keep no his- (CT) tory. Consider the case where the employees “Mary” In this approach, the materialized objects are and “John” belong to the same department, and there- stored in the tuples themselves. When the results are fore contain identical objects in their corresponding dept small and there is some free space in each page, the field. Moreover, assume that the tuple for “John” is materialized result can be cachedin the samepage as the accessedbefore that of “Mary” in answering Query 2. tuple containing the object. As a result, if a small object The POSTQUEL object in John’s tuplc will be executed is cached, it can often be retrieved without paying any first. If the result of this query execution could be stored, extra cost of I/O. For large objects, WC make the then the execution of the object in Mary’s tuple could be assumption that the first page of the cached object is avoided. It is thus clear that caching of mat.erialized stored clustered with the tuple. Under these assump- objects might help in reducing the cost of executing a tions, it follows that the number of page accesses POSTQUEL query. required to retrieve a cachedobject in CT is one less than There is another important bcnelit of caching. the number of accessesrequired in CS. We refer to the Consider a sequenceof queries, Q 1, . . . , Qn, which are extra cost in CS as the cost of cache lookup. Note that in not submitted as a batch (and hence global query optimi- CS, this cost has to be paid cvcn if the object is not zation algorithms such as in [SELL881do not work). The cached. processing of any query would involve materializing On the other hand, if the fraction of all objects that some objects. It is possible that the execution of a query are cached is cachedfraction, then CT may need to Qi can utilize one or more of the objects materialized by cache many more objects (and hence require much more the queries (Qi:i ci). Of course, updateswill invalidate space) to achicvc the same cachedfracfion as CS. This materialized objects, but where they arc infrcqucnt, happensbecause objects may be repeated across tuples. caching is likely to be bcnclicial [SELL87, STON87]. For example, consider the objects in EMP.dept. If there

91 are 100,000 employees, and 500 dcpartmcnts, then to databasecharacteristics. In this section we discuss our achieve a cachedfraction equal to one, CS would need model in detail. to cache 500 objects. whereasCT would need 100,000. 4.1. POSTQUEL queries 3.4. Flattening (FLAT) We restrict our study to the POSTQUEL qucrics of Flattening as a meansof evaluating a POSTQUEL two types: query has been discussed in Section 2. A flattened ver- sion of a POSTQUEL query can be passed through a traditional optimizer and a plan gencratcd. This plan can retrieve (REL.Procf ield.Ordf ield ,) whcrc be no worse than the plan of CM and REST. The other algorithms evaluate an extended clause in a top down approach (i.e., from left to right). This order of relational accessesis just one of the many options available in FLAT query. If the other options are cheaper, then FLAT would do better. Ordfieldl is an ordinary field in the target list of the Consider Query 2, and its flattened version: POSTQUEL queries in the POSTQUEL field REL.Procfield. Ordfieldz is an ordinary attribute of retrieve (EMP.name)where REL . In queries of Type 1, all the objects in the POST- EMP.dcpt = DEPTnumber and QUEL field of the selected tuplc of REL need to be DEPT.name= “TOY” materialized. In contrast, in queries of Type 2, the The other algorithms pick a tuple of the EMP relation, materialization process can stop as soon as a match is and for each tuple, fetch the “matching” tuples of found. The relative performance of the algorithms is DEPT. This is equivalent to a nestedloop in a FLAT highly dependent on the type of the query involved. strategy with EMP as the outer relation, and DEPT as the More complicated queries (deeper and/or more extended inner relation. The cost of an inner fetch in FLAT clauses) are beyond the scope of this paper, and further, correspondsexactly to the cost of a (resuictcd) materiali- their behavior is often similar to the simpler ones of our zation in REST. FLAT is likely to win if a merge join choice. (i.e., join after sorting EMP and DEPT on the fields Apart from the Type, there is one more parameter EMP.dcpt and DEPT.numbcr respectively), or a bottom associatedwith the POSTQUEL queries. For queries of up evaluation (i.e., using DEPT as the outer and EMP as Type 1, Selbot is the selectivity of the clause in the the inner relation) is cheaper. qualification. In Type 2 queries, Selbot is the selectivity There are two factors that mitigate the seeming of the clause that is used to modify a POSTQUEL object superiority of FLAT. The first is that there is no hope of in REST. The relative performance of the query caching materialized objects. Thus, while FLAT would modilication algorithms (both REST and FLAT) is certainly be better than REST or CM, it may be worse dependenton this parameter. than CS or CT. The second is a more practical reason.If the number of objects in a tuple is large, and/or their 4.2. Update Model composition is unpredictable, then flattening is unviable. Updates to the underlying database invalidate For example, consider Query 1. Since the set and struc- some or all of the objects materialized by the POST- ture of queries in EMP.hobbics may change dynamically, QUEL queries. This section discusses the model for it is not possible to store the parametersof these objects studying the performance of the caching algorithms in in suitable field(s). the presenceof updates. Techniques for flattening a POSTQUEL query parallel various modifcation algorithms [STON75], 4.2.1. Object Invalidation and their discussion is beyond the scopeof this paper. When an object is materialized and cached for the first time, I-locks (Invalidation locks) arc set on the 4. MODEL tuples and index intervals read during the execution of In order to compare the various algorithms, we that object. Updates to the tuples and index intervals have constructed opdmizcrs of limited functionalitics involve invalidation of all those objects which have I- which gencratc the cheapestplan and its expected cost. locks on them. Updatesdo not remove the I-locks set on This cxpcctcd cost is the yardstick used to evaluate the the tuplcs, and hence only the first materialization and corresponding processing strategy. To simplify our caching of an object requires the setting of I-locks. How- evaluation task, we have made certain assumptionsand ever, all updates do involve paying the extra cost of parametcrized some conditions. Thcsc parameters invalidation over the casewhere no caching is done. characterizethe POSTQUEL query and other systemand

92 With the passageof time, the number of objects top down query plan accessesthe tuples of REL , and for that have been materialized and cached at least once each such tuple, determines the matching tuples of increases.We make the simplifying assumption that all ObjRel by materializing (if necessary) the object(s) in the objects have been cached at least once before we REL.Procfield. A bottom up plan accesses ObjRel begin comparing the pcrformancc (in other words, we before REL . Such a bottom up plan is possible only in a ignore the transient behavior). Thus the cost of setting flattened query. I-locks doesnot enter our calculations. The cost of an object depends on (among other things) the index on the attribute in its qualification. This 4.2.2. Query Sequence index is called Object-Index, and its type determinesthe Consider a random sequence of queries, where cost of the objects: each query is an update with probability Pr(UPDATE) [l] Easy: These objects have Object-Zndex as C-SEC, and a retrieve with probability 1 - Pr(UPDATE). All and typically cost 2-3 page accessesfor their retrieve queries are POSTQUJZLqueries made solely of materialization. either Type 1 or of Type 2 queries (depending on the experiment under consideration). All updatesare POST- [2] Hard: These objects have Object-Index as NEX- QUEL replace commands (without any extended IST, and cost a complete SEGMENT scan for clauses, though) which update a fraction of the tuples in materialization. the relation(s) touched during the materialization of A PROC-MZX fraction of the objects in each POST- POSTQUEL objects. This fraction is fixed at 0.01 for the QUEL field are hard, and 1 - PROC-MIX are easy. remainder of the study. The tuples that are updated are Thus a PROC-MIX near zero represents a majority of selcctcdat random. easy objects, and a PROC-MIX near one, a majority of The fundamental nature of the caching algorithms hard objects. is best brought out by their avcragc case responses;and Use-Fuctor is the expected number of times each not the responses to some particular query sequence. object is repeated in a . Thus if there are P dis- Therefore, for a fixed set of parameters,we have used tinct objects (i.e., no two having the same query-block random query sequencesof length hundred, and then and list-ofparameters) in REL.Procf ield , then averaged the behavior over a hundred such sequences. The averageresponses arc fairly stable at this point. Use-Factor = I REL I xobjectsger-tuple I P In the absenceof updatesand with limited Size Cache, 4.3. Database Structure the cachedfractions in CS and CT are in this ratio. The relations in the database contain the fields There is one more parameter of inlerest- required to make the objects and the qucrics syntactically Flat-Index. This is the index on the fields that store the correct. The indexes on the various fields can be of the parametersfor FLAT. The bottom up plan in FLAT is type Primary (PRIM), Clustered Secondary (C-SEC), aided by the presence of indexes on these fields. For Non Clustered Secondary (NC-SEC) or non-existent example. a FlatJndex = C-SEC means that there exists (NEXISTJ. In the absenceof any index of thesetypes, a a clustered secondary index on EMP.dept when it stores relation can be accessed through a SEGMENT scan the parameter$dept-number. In a bottom up query plan [SELI79]. The assumptions of the cost of relational for the flattened version of Query 2, a tuple e in EMP accessthrough these indexes (as well as for joins) are the that matches a tuple d in DEPT satisfies the following sameas in [SELI79]. condition: e.dept = “particular-dept-number.” Here The POSTQUEL field contains objectsger tuple “particular-dcpt-number” is the value in the field POSTQUEL objects in a tuple of REL. These cbjects d.number. A nested loop/merge join is facilitated by the are simple selections and projections on a (set of) presenceof Flat-Index. relation(s) such that the POSTQUEL queries of Type 1 and 2 are syntactically valid. Each object is a single 4.4. Parameters of Study relation query containing cxacrly one qualilication Table 2 shows the parametersof the study. along with clause--a selection. Thus its structure is: their default values. On the basis of some fixed parame- rctricvc (ObjReLOrdf ield 1) where ters (not shown above), the size of the databaserelation ObjRel.Ordf ield3 operator value is about 50 MBytes. The number of tuples returned by an object is fixed at ten 5. PERFORMANCE RESULTS for the remainder of this study. (Note that within the same column, ObjRel may not bc the same in all the In this Section, the results of the performance objects. For example, EMP.hobbies contains objects with analysis of the optimizing algorithms is presented.The ObjRel as SOFTBALL, FOOTBALL and MUSIC.) A

93 cachedfraction for CS makesit win. Query Parameters As mentioned before, the cost of an inner fetch in I a top down, nested loop join in FLAT is the sameas the Name Default cost of a materialization in REST (and hence, the same Type lor2 as CM). Thus if PROC-MIX is low, then this plan is the Selbot 0.1 cheapestfor FLAT. Since this plan is identical to CM’s, DatabaseParameters FLAT follows the curve of CM. When PROC-MIX Name Default becomessufficiently large, nested loop join is no longer objects per tuple 1 the cheapest, and flat switches to merge scan, while maintaining a top down accessof relations. For higher values of PROC-MIX, it abandons the top down approachaltogether, and does a bottom up query evalua- tion. Under these circumstances, the cost of FLAT becomesindependent of the cost of the objects. Other Parameters Name Default 5.1.2. Type 2 Queries Size Cache 10 MBytes In a flattened version of a query of Type 2, there is Pr(UPDATE) 0.2 a restriction on ObjRel. This, together with the presence of a default secondary index on REL.Procfield Table 2: Paramctcrsof Study (Flat-Index = NC-SEC), makes the bottom up plan the best for a FLAT query. Since this plan is unaffected by cost of an algorithm is the estimated cost of the plan it PROC MIX, the curve for FLAT is a horizontal line, generates.The lower this cost, the better the algorithm is. which% substantially lower than the other curves. Thus We first discuss the cost characteristicsof the algorithms FLAT is definitely superior for queries of Type 2. espe- as functions of some of the important paramcters- cially for high PROC MIX. CS and CT show a behavior PROC-MIX, Pr (UPDATE ), Size-Cache and similar to Type 1 queries. Use-Factor. In the accompanying graphs, all costs have For queries of Type 2, REST is always cheaper been normalized such that Cost(CM) = 1 at the smallest than CM. This is primarily due to the fact that the cost of x coordinate. We next study the behavior of these algo- materializing modified objects is never more than that of rithms as functions of pairs of these parameters.Finally, the corresponding objects. The extra clause (a selection the effects of the other parametersnot included in the list on ObjRel.Ordf ieldl) helps in reducing the cost of above are discussed. materialization if an index exists on ObjRel.Ordf ield 1, and if the access of ObjRel through such an index is 5.1. Cost of the Objects cheaper than the other indexes. The default parameters Figure l(a) plots the normalized costsas a function provide for an NC-SEC on ObjRel.Ordfieldl. At low of PROC-MIX for queries of Type 1, while Figure l(b) PROC-MIX, a scan of ObjRel through this index is not does the same for queries of Type 2. Note that the cheapest,but beyond a certain PROC-MIX, this is PROC-MIX is an indicator of the expected cost of the best plan. From the figure we note that for materialization of the procedural objects. PROC-MIX > 0.1, materialization of a modified object is cheaper than the corresponding object, and is indepen- 5.1.1. Type 1 Queries dent of PROC-MIX. For very high PROC-MIX, REST is even better than the caching algorithms. For these types of queries, a top down plan involves restricting REL using the qualilication, and then 5.2. Updates executing the objects in the selected tuples to determine the target field values. No modilication of the objects (as Figure 2 plots the curves for Type 1 query as a desired in REST) is possible, and hence REST performs function of Pr(UPDATE). It can be seen that as identically to CM. CT and CS are better than CM for all Pr (UPDATE) increases, the two caching algorithms choices of PROC-MIX. Thus caching is a clear winner deteriorate. With an increase in the frequency of under thesecircumstances. updates, there is an increase in the number of objects being invalidated. This has a two-fold effect. First, At low PROC-MIX, Ihe extra cost of cache invalidation costs increase. Second,a retrieve query sees lookup is a significant fraction of the total cost. As a fewer cachedobjects on the average;and hence has to do result, CT performs better than CS, in spite of having a more materialization, and pay a higher processingcost. lower cachedfraclion . At higher PROC-MIX, the lookup cost is negligible, and the higher An update in a query sequenceinvalidates a higher number of cached objects if they are being cached in

94 0.d i 8 0.00001 0.0001 0.001 0.01 0.1 1.0 o.oooo1 0.0001 0.001 0.01 0.1 1.0 PROC-MIX PROC-MIX Query Type 1 Query Type 2 Figure 1: Normalized Costsas a function of PROC-MlX tuples compared to when they are cached separately (in 1. Note how the ratio falls as Size-Cache is raised to 25 an approximate ratio of Use-Factor ). As a consequence, MBytes (which is sufficient to cache all objects in CT). updatespenalize CT more than CS. Even with this Size-Cache, the curve of Cost(CT)/Cost(CS) is above one. The reason for this is 5.3. Size of the Cache the extra penalties of updates in CT. When According to the parameters,CS requires a cache Pr (UPDATE) is made zero (and Size-Cache is still 25 spaceof = 12.5 MBytes to achieve a cachedfraction of MBytes), the ratio of costs drops to below one. This one. At sizes more than this, only CT benelits. Figure 3 curve represents the ideal conditions for a caching plots the cost characteristics of the two caching algo- algorithm-enough cache space, and no updates.Under rithms and CM as a function of Size Cache. The curves thesecircumstances, CT is definitely superior. for REST and FLAT are omitted forthe sake of clarity. From now on, we restrict ourselves to CT is better than CS for either very small Size-Cache Size-Cache I 1OMBytes. This is in keeping with our (where the performance penalties of extra lookup are assumption of a bounded cache space. The larger sizes more than the extra caching benelits of CS); or for a very which we encounteredso far were used only to bring out large Size-Cache (= 22 Mbytes), where the the fundamental differences in the algorithms. cachedfraction in CT is close to one. 5.5. Regions of Optimal Performance 5.4. Level of Sharing We now turn to the behavior of the algorithms as An increase in Use-Factor has a two-fold effect functions of pairs of the above parametersby plotting the on the cost of CS. First, for a given Size-Cache, the regions where each algorithm performs the best. cachedfraction increases. Second, an update causesa In Figures 6(a) and 6(b), the regions as functions lower number of invalidations. Thus it is obvious that as of Pr (UPDATE) and Use-Factor are shown for queries Use-Factor increases, CS would be more and more of Type 1. It is clear that for a sufficiently high appealing. Figure 4 plots ‘OS’ CT 7I7d&i as a function of the Pr (UPDATE ), the caching algorithms would prove to be Use-Factor for the four possible choices of PROC-MIX non-competitive. Referring to Figure 6(a), consider a and query type. The significant point of note is the earlier horizontal line drawn through Use-Factor = 2. CT is the flattening in low PROC-MIX queries. Thus for inexpen- best algorithm until Pr(UPDATE) = 0.4. Then CS sive objects, CT gives a comparableperformance to CS, becomes the best. As Pr (UPDATE) increases further, for all values of Use-Factor. even CS fails to be better than the other algorithms. Note that for Use-Factor < 1.5, CS never wins. We have seenvarious reasonswhy CT, in general, performs worse than CS. .In Figure 5 we attempt to cap- When the objects are expensive (Figure 6(b)), CT ture these reasonsfor a high PROC-MIX query of Type is never preferred. CS is the best for high Use Factor

95 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Pr(UPDATE) Figure 2: Costs vs. Pr(UPDATE) Figure 3: Costsvs. Sire-Cache (Type 1 Queries, PROC-MlX=OA) (Type 1 Queries, PROC_MIX=O.OOOl) 1000.0 lo()c)O --.--__-T ______T-_---_-_ :‘---~ ----: -_-- ‘--~-: -____ ~---: ______: 1 ! ! j i ! ! i

,100.o a : 10.0

0

1.0

0.1 4 0.14 : : : : i 1 i 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 Use-Factor Use-Factor Figure 4: Cost(CT)/Cost(CS) vs Use-Factor Figure 5: Dependence of Cost(CT)/Cost(CS) on Size-Cache and Pr(UPDATE)

a.0 0.2 0.4 0.6 0.8 Pr( UPDATE) Pr(UPDATE) 6(a): PROC-MIX = 0.0001 6(b): PROC-MIX = 0.4 Figure 6: Regions of best performance as functions of Use-Factor and Pr (UPDATE)

96 and/or low Pr (UPDATE ). From both the figures, it is choice of objectsger-tuple = 1. We now discuss the clear that if the Use-Factor > 100, CS is extremely implications of objectsger-zuple > 1 on FLAT. Con- competitive unlessPr (UPDATE) = 1. sider the following schema: Figure 7 plots the regions as a function of the two PAIRS (seed= i4, partners = PQSTQUEL) most important parametersfor the caching algorithms- h4EN (seed= i4, name = ~10, country = ~10) Pr(UPDATE) and Size Cache. It can be seen that for WOMEN (seed= i4, name = ~10, country = ~10) low Pr(UPDATE), CT% better than CS for low cache which describes the players taking part in a tennis tour- sizes. Taking a vertical slice, (say at nament.PAIRS contains the data about the mixed double Size-Cache = 10 MBytes), for Pr (UPDATE) c 0.4, CT tournament, and MEN and WOMEN about the singles is the best, for 0.4 < Pr (UPDATE) c 0.6, CS is the best, tournament for men and women respectively. and for Pr (UPDATE) > 0.6, the non-caching algorithms PAIRS.partners contains two queries of the form: perform the best. This result is similar to what was obtained in Figure 6(a) along the line Usefactor = 2. retrieve (MENall) where MEN.name = $paraml The next figure (Figure 8) capturesthe behavior as function of PROC-MIX and Pr (UPDATE). In the upper retrieve (WOMEN.all) where left comer, note how an increase in PROC MIX makes WOMEN.name = $param2 CS more and more competitive. This is-because as The PQSTQUEL query PROC-MIX increases, so dots the cost of materializa- tion. Consequently, the benefits of caching go up. This retrieve (PAIRS.seed)where continues until the bottom up algorithm for FLAT beats PAIRS.partner.seed< 5 any caching approach(also seeFigure la). returns the seedsof the mixed double teamswhere either partner has a seedbetter than 5 in his/her respective tour- 5.6. Other Parameters nament. Assume that we store the parameters of these In this subsection we briefly discuss the other objects (instead of the full queries) in the fields parametersof our study. PAIRS.male and PAIRS.female. We have seen before that FLAT performs similar to REST if it choosesa top 5.6.1. Number of Objects per Tuple down approach. This holds even if objectsger-tuple > 1, as is in this case. FLAT has been shown to be distinctly superior in case of queries of Type 2 and under somecircumstances In contrast, in a bottom up plan (i.e., accessing for queries of Type 1. This is partly a result of the default MEN and WOMEN before PAIRS), FLAT needs to

l*ol ______-______-_.___ ~‘--~~ ______- ______-_____ ~: . ------.------.----3------~~-----~ 0.8. ______1.______-______-.-______j

T . ______.______CL-~ __..______------..------; E) 0.2. _____-__-____---._____-- _-__._---- 3___. __~~-_-_.-~~~~_.~-~~~~~.~~~~~~ j ...... --...... i...... I o.o* 4 0.1 1.0 10.0 o.OOoO10.0001 0.001 0.01 0.1 1.0 Size-Cache(MBytes) PROC_MIX Figure 7: Regions of best performance Figure 8: Regions of best performance as functions of Sire-Cache and Pr(UPDATE) as functions of PROC-MIX and Pr(UPDATE) (Type 1 Queries, PROC~MIX=O.OOOl) (Type 1 Queries)

97 execute (an equivalent of) the following two queries: a competitive edge till a high value of Selbot if retrieve (PAIRS.seed)where Flat Index is C-SEC. On the other hand, the absenceof PAIRS.male = MEN.name and this index makesFLAT switch to a top down plan at low MEN.seed < 5 values of Selbot . As objectsger-tuple increase, a Flat Index is retrieve (PAIRSseed) where neededon each field that stores a parameter, iFFLAT is PAIRS.female = WOMENname and to perform better than other algorithms. WOMEN.seed < 5 Assuming the availability of indexes on PAIRS.male and 6. CONCLUSIONS PAIRS.female, the best plan for each subquery is bottom We have shown that the caching algorithms are up. If the total cost of these two subqueriesis more than competitive in situations of low to moderate update pro- a top down plan, the latter is chosen. Otherwise, FLAT bability. In this, our conclusions are similar to choosesthe option of executing thesetwo subqucries. lFIANS881. Moreover, it has been demonstrated that In general, for low objectsger-tuple , a sequence separatecaching is better than caching in tuples under of subqueries performing an equivalent task would be most circumstances. This is especially true when the cheaper. As objectsger-tuple increases (and so does cache size is limited and Use-Factor is high because the number of subqueries),the bottom up approach is no separate caching is able to achieve a higher longer the best, and then FLAT switches to top down and cachedfraction. Furthermore, updates penalize CT performs similar to REST. This is confirmed in Figure 9. more than they do CS. There are two factors that may mitigate this superiority of CS. First, if the objects are 5.6.2. Flat Index cheap, then CS would suffer becauseof the extra lookup Figure 10 plots the curve for FLAT and CM as a costs. The second factor is the implementation problems of CS. We have assumedthe availability of “hashing” function of Selbot for different choices of Fiat Index. In queries of Type 2, FLAT has a bottom up pl6 if the into a cache relation. As the objects become more com- plex, so would the hashing strategy. Since our model clause on ObjRel is highly selective. As this clause does not take this into account, its effects have not becomes less selective, the cost of the bottom up plan increases,and after a point FLAT switches to top down. enteredthe picture. As we have seen before, a Flat-Index helps lower the In caseswhere the number of objects in a tuple is cost of a bottom up plan. Consequently,FLAT maintains near one and their composition is predictable and easily parameterizable,it has been further shown that flattening

0.0 ! 1 1 2 3 4 5 6

objectsger-tuple Selbot Figure 9: Effect of objectsger-tuple Figure 10: Effect of Flat_lndex on FLAT (Query Type 2, PROC MIX =O.OOOl) (Query Type 2, PROC-MIX =O.OOOl)

98 is a good option. This is especially true if the query is California at Berkeley, May 1987. best solved by a bottom up approach. As the number of [KIM821 Kim, W., “On Optimizing an SQL-like objects per tuple increases, FLAT loses its competitive Nested Query,” ACM Trans. on Database edge. To emphasize again, flattening is not possible Systems,7.3, Sept. 1982. when the composition of the objects is unpredictable. [ROWE871Rowe, L., and Stonebraker, M., “The It is clear that CM is the preferred alternative in POSTGRES Data Model,” Proc. 13th presenceof frequent updates,and where flattening is not VLDB Conf., Brighton, 1987. viable. REST is never worse than CM, but its marginal utility is often negligible Moreover, if the cost of gen- [SELI79] Selinger, P. et al., “Access Path Selection in erating the plans (which has not cntercd our picture) is a Relational Database Management Sys- also a criterion, then REST would perform worst than it tem,” Proc. ACM-SIGMOD Conference on does in our studies. Managementof Data, 1979. It may seem that caching and restricted materiali- [SELL871 Sellis, T., “Efficiently Supporting Pro- zation are orthogonal (and thus the two may be applied cedures in Relational Database Systems,” together in a strategy). However, it can be shown that Proc. ACM-SIGMOD Conference on caching benefits restricted materialization only within a Managementof Data, May 1987. query (Section 3.3) with inter query bcnclits being highly [SELL881 Sellis, T., “Multiple-Query Optimization,” unlikely. We consider the latter as much more important, ACM Trans. on Database Systems, 13.1, and have therefore not examined this possibility. March 1988. A real query optimizer will, in general, be based @TON751 Stonebraker,M., “Implementation of Views on one or more of the above strategies. The actual and Integrity Control by Query choice(s) of the strategies will depend strongly on the Modification,‘* Proc. ACM-SIGMOD factors discussed in this study. It is necessaryto deter- Conferenceon Managementof Data, 1975. mine theseparameters before such a choice can be made. [STON83] Stonebraker, M., et al., “Application of Though we have discussed the optimizing algo- Abstract Data Types and Abstract Indices to rithms in a specific environment, the discussion on the CAD ,” Proc. ACM-SIGMOD various strategies should extend to any system support- Conference on Engineering Design Applica- ing procedural objects. tions, 1983. [STON86] Stonebmker, M., and Rowe, L., “Design of Acknowledgements POSTGRES,” Proc. ACM-SIGMOD The author wishes to thank his advisor. Professor Conferenceon Managementof Data, 1986. ,for his advice on the work and on [STON87] Stonebmker,M., et al., “Extending a Data- the preliminary drafts of this paper. Thanks are also due baseSystem with Procedures,” ACM Trans. to the anonymous refereesfor their valuabIc suggestions. on DatabaseSystems, Sept. 1987 REFERENCES WONG761 Wong, E., and Youssefi, K., “Decomposition-A strategy for query pro- [CODD70] Codd, E.F., “A Relational Model of Data cessing,” ACM Trans. on Database Sys- for Large Shared Data Banks,” Comm. of tems, Sept. 1976 ACM, June 1970. [ZANI83] Zaniolo, C., “The Database Language [HANS883 Hanson, E., “Processing Qucrics against GEM,” Proc. ACM-SIGMOD Conference Database Proccdurcs: A Performance on Managementof Data, 1983. Analysis,” Proc. ACM-SIGMOD Confer- ence on Managcmcnt of Data, June 1988. [ZANI853 Zaniolo, C., “The Representation and Deductive Retrieval of Complex Objects,” [IEEE871 Bulletin of the Computer Society of the Proc. International Conference on VLDB, IEEE TC on DatabaseEngineering, Special 1985. Issue on Extensible Database Systems, Carey, M. cd., June 1987. [JARK84] Jarke, M., and Koch, J., “Query Optimiza- tion in DatabaseSystems,” ACM Comput- ing Surveys,June 1984. [JHIN87] Jhingran, A., “A Compile Time Optimizer for Database Systems Supporting Pro- CedlUeS,” Master’s Report, University of

99