<<

Managing Intra-operator Parallelism in Database Systems*

Manish Mehta David J. Dewitt IBM Alamaden Research Center Sciences Department San Jose, CA, USA. University of Wisconsin-Madison [email protected] [email protected]

Abstract dat,a. However, it is not evident how intra-operat,or parallelism should be used in a multi-query environ- Int#ra-operator (or partitioned) parallelism is a rnent, where multiple concurrent queries are cont,end- well-established mechanism for achieving high per- ing for system resources. Selecting low degrees of op- formance in parallel database systems. How- era.tor parallelism can lead to underutilizat,ion of the ever, the problem of how to exploit intra-operator system and reduced performance. On the other hand. parallelism in a multi-query environment is not high degrees of parallelism can give “too many” re- well underst,ood. This paper presents a detailed sources to a single query and lead t.o high resource performance evaluation of several algorithms for contention. This paper explores this problem of deter- managing intra-operator parallelism in a parallel mining the %ppropriat,e” degree of intra-operator par- database system. A dynamic scheme based on the allelism for queries in a multi-query pa.rallel dat.aba.se concept of matching the ra.te of flow of tuples be- system. tween operat,ors is shown to perform well on a va- T.here are two important issues that need to be ad- riety of workloads and configurations. dressed. First, for each query, t,he algorithm must de- termine the degree of parallelism of each operator in the query plan. Second, the algorithm must a.ssign 1 Introduction specific processors to execute each operat.or inst,ance. Highly-parallel databa.se systems are increasingly be- The degree of parallelism of an operator should be ing used for large-scale database applications. Exam- selected such that the cost of starting and terminat- ples of these systems include products like Teradata ing all of the instances of t,he operat,or is more than DBC/1012 [26], Tandem Hima,laya [13], IBM SP1/2 offset, by the performance improvement due t,o paral- [16], and research prot,otypes like Bubba [8], Ga.mma lelism. Since startup and termination costs are a func- [7], and Volcano [12]. Inlru-operafor parallelism (or t,ion of the configuration and workload, the degree of partitioned parallelism [4]) is a well-established tech- parallelism should change for different workloads and nique for increasing performance in these systems. Ba- configurations. In this paper, we present, several al- sically, by allowing the input data to be partitioned gorithms for determining the degree of parallelism of among multiple processors and memories, this tech- operat,ors. A detailed performance evaluation shows nique enables a database operator to be split into many that a new dynamic algorit,hm based on t,he concept, independent operators each working on a part of the of matching the rat,e of flow of tuples bet.ween oper- at,ors provides good performance across a variet,y of *This research was performed while the first author was at workloads and configurations. the University of Wisconsin-Madison and was supported by an IBM Research Initiation Grant. The primary objective of an algorithm for assigning Pwmission to copy without jet all OT part of this material is processors to operators is load balancing. Processors granted provided that the copies are not made or distributed for should be assigned t#ooperators such that, the workload direct commercial advantage, the VLDB copyright notice and is uniformly distributed and all the nodes are equally the title of the publication and its date appear, and notice is utilized. This paper presents several alternative meth- given that copying is by permission of the ,veery Large Data Base Endowment. To copy otherwise, or to republish, requires a jee ods of assigning processors to operators. The results and/or special permission jvom the Endowment. show that algorithms that utilize information about Proceedings of the Zlst VLDB Conference the system workload in assignment decisions perform Zurich, Swizerland, 1995 better than algorithms t,hat assign processors stati-

382 tally. also has the highest startup and termination costs a,nd The rest, of the paper is organized as follows. Sec- leads to the highest resource contention. Moreover, tion 2 presents the system archit,ecture of a typical it is a static algorithm that selects the same degree highly-para.llel database syst,em. The algorithms for of parallelism for all operators regardless of the query selectmingthe degree of parallelism are presented in Sec- type and the workload. tion 3 while algorithms for mapping operators t,o pro- 3.2 MinDp cessors are presented in Section 4. Section 5 discusses how these algorithms can be combined to produce a The degree of parallelism selected by this algorit,hm is comprehensive processor allocation scheme. The simu- equal to the minimum of the degree of parallelism of lation model used for the performance evaluation is de- all the input streams. For example, consider a binary scribed in Section G followed by a, description of the ex- hash-join query where the degrees of parallelism of the periment#al parameters in Section 7. Section 8 presents selects on the inner and outer relations are Inner-dp a performance evaluation of algorithms for determin- and Outer-dp, respectively. The MinDp algorithm will ing the degree of parallelism and Section 9 cont,ains the select the join’s degree of parallelism as min(Inner-dp, evaluation of algorithms for mapping the operators to Outer-dp). processors. Related work is discussed in Section 10 3.3 MaxDp and Section 11 contains our conclusions and sugges- The MaxDp algorithm is sets t,he degree of join paral- tions for future work. lelism to be the maximum of the degree of parallelism 2 System Architecture of the input &earns. Note that in the case of unary op- erators like store, the MaxDp and MinDp a.lgorithms Highly parallel systems are typically constructed using are identical. a sha,red-nothing [25] archit#ecture. The system con- sists of a set of external terminals from which trans- 3.4 RateMatch actions a.re submitted. The transactions a.re sent to The RateMatch algorithm is based on the idea of a. randomly-selected scheduling node. The execution matching processing rates of operators. If the rate of each transaction on the processing nodes is coor- at which tuples are sent to an operator is much higher dinated by a specialized process called the schedvler. than the rate at which the tuples can be processed by The scheduler allocates resources (memory and proces- the operator, incoming messa,ges will accumulate and sors) to the transaction and is responsible for sta.rting the message buffer can overflow, forcing the sender to and terminating all the operators in a t,ransaction. The block. On t,he other hand, if the rate at which t,u- processing nodes are composed of a CPU, memory, and ples are received by an operat,or is much lower than one or more disk drives’. There is no shared mem- the maximum possible processing rat,e, the operator ory/disk between the nodes, hence the term Sha.red- will frequently be idle and will wast,e syst,em resources Nothing. All inter-node communicat,ion is via message (specifically memory). By matching operator process- pa.ssing on the int,erconnection network. ing rates, the RateMatch algorithm prevents senders 3 Determining the Degree of Operator from blocking and, at the same time, conserves system Paralleliqm resources by avoiding idle operators. We next present. t,he formulas used by the RateM- In most shared-nothing database syst8ems, the only atch algorit.hm to calculate the rate at which t.uples are way t,o access data on a node is to schedule a select processed by the select and hash-join operators. Sim- operator on that node. This implies that the degree of ilar formulas can also be developed for other database parallelism of select operators is fixed a-priori by the operators. These formulas adjust the degree of paral- data placement mechanism. However, the degree of lelism based on the current, CPU utilization and disk parallelism for other operators, like joins and stores, response times, and therefore allow the RateMatch al- can be chosen independently of the initial data place- gorit,hm t.o adapt to different workloads and config- ment. We consider four algorithms for determining the urations. We first develop formulas for a single-user degree of parallelism,of such operators. system assuming no buffering of messages, and then 3.1 Maximum incorporate the effect of multiple users and message buffering. The degree of parallelism chosen by this algorithm is equal to t,he number of nodes in t#he system. Maxi- 3.4.1 Processing Rate of Select Operators mum therefore achieves t,he highest parallelism, but it The total time (in seconds) taken by each select oper-

‘For the rest of this paper, the term node is used to collec- ator to process a data page is tively refer to a processor, its l&xl memory and the attached set of disks.

383 where TIIO~,,,,, is the time taken to perform one I/O Maximum Memory Allocation: In the case of and Tcpuselect is the CPU time taken for processing maximum memory allocation, join operators do not one data page. Note that the above equation assumes perform any I/OS. During the build phase, the join a.n overlap in CPU and I/O processing. TI/o~,,,,~ is operators read each incoming tuple and insert them calculated as the sum of the time taken to initiate an into a hash table. Therefore, the time taken to pro- I/O request and the actual time taken to perform the cess a data packet in the build phase is I/O. Therefore, 1R’~ev + (IRead + IH~~~) * TupsPerPkt II/O TBuild = $ AvgDiskReadTime CPUSpeed TIl Osdect = CPUSpeed where Ill0 is the number of instructions needed to ini- where IReV is the number of in&ructions needed to re- tiate an I/O and CPUSpeed is the speed of the CPU ceive a data packet, IHash is t,he number of instructions in instructions per second (MIPS * 106). TcpuSelcct needed to hash a tuple (this assumes that the tuple is includes the time taken by the selects to examine each already in memory and no copying is required), a.nd tuple on the data page, apply the selection predicate, TupsPerPkt is the number of tuples in each network and send the selected tuples to the next operator. packet. In the probe phase, the join reads incoming tu- Therefore, ples and probes the hash table for matches. If a match is found, the result tuples are composed and sent to = (IRONS + Ipred) * TwPerPw T the parent operator. Therefore, CPU,,,,,, CPCTSpeed ISend * Selection,~electivity IPerTupde = IRead + I$+.& + Iconapode * Join.Sek + CPUSpeed I&o + IPerTuple * TupsPerPFt where IRead is the number of instructions for reading T Probe = CPUSpeed a tuple in memory, Ipred is the number of instructions for applying a predicate, ISend is the time taken to + knd * JoinSel send a page (including the cost of copying tuples to CPlJSpeed the network buffer), TupsPerPage is the number of where Ip&e is the number of instructions needed tuples in each page, and SelectionSelectivity is the to probe the hash table, Iconapose is the number of fraction of tuples that satisfy the selection predicate. instructions needed to produce a result tuple, and Therefore the total rate at which pages are pro- JoinSel is the join selectivity. cessed by the select operators is NSelect Minimum Memory Allocation: If the joins are ProcRateseleet = ~~eleel allocated their minimum memory allocation, the in- where Nseleet is the degree of parallelism of the select coming tuples are divided into disk-resident partitions. operator. Similarly, the total rate (in pages/second) Since select, operators send tuples to join. operators at which tuples are produced by the select operators only during the partitioning phase, the rate calcula- is tion ma,tches processing rates only for the partitioning phase. In both the build and probe phases, the incom- RateSelect = ProcRateSelect * SelectionSelectivity ing tuples are read, hashed and then written to the N select * SelectionSelectivity corresponding disk partition’. Therefore, = T select IRev + II/O Finally, since the rate at which tuples are produced TBuildll+-,,be = AvgDiskWriteTime + may be different in the build and probe phases due CPUSpeed to a different degrees of select parallelism in the build and probe phase, the above calculaGon is carried out +(IRead + IHash -t Icopy),* TupsPerPkt for each phase separately. These rates are denoted as CPUSpeed RateserectB,,,ld and Rateseleetp,,b,, respectively. where IcO’opYis the number of instructions needed to 3.4.2 Processing Rate of Hash-Join Operators copy the tuples to the outgoing disk page. Once the time t.aken for processing in each phase has The rate at which tuples are processed by a hash-join been calculated, the number of join processes needed operator depends on the amount of memory allocated to it. Here, we present formulas only for maximum ‘Recall that if joins are given their minimum memory alloca- tion, the build and probe phases refer to the initial phases that and minimum join memory allocation. The formulas distribute the inner and outer relations, respectively. The result for intermediate memory allocations can be developed tuples are produced in a third phase where each participating similarly. join processor processes its disk-resident partitions.

384 to absorb the incoming tuples in the build phase, messa.gebuffer overflow. Let, M be t,he size of the mes- NJOilZ&,ld 9 is calculated by equating the rat,e at which sage buffer, T the t,ime taken by the select to process tuples are processed by t,he joins t,o t.he rate at which all of the t,uples, and NJ~~ and Nselect are the the de- t,uples are produced by the select,s. grees of join and select, parallelism, respectively. The total number of message packets accumulated per sec- NJOi~EWd RateJoin,,,,,d = = RateselectB,,,,d ond at the join operators is the difference in the rate Tkhdd at which tuples are sent, by the select operat,ors and Therefore: the rate at which they are consumed by the join op- erators. Therefore, the total number of data packets l~Join&,,d = RateselectBulld * TBuild accumulated over the period of the query is given by:

Similarly: #AccumulatedMsgs =

NJoinprobe = RatwelectProbe * TProbe (Nse~eet * Rateseleet - NJG~ * RateJoin) *T If t,his number is equal to the total message buffer size Finally, the number of join sites should be such that of the join operators (i.e. M * NJ~~~~), there will be select, operators do not block in either the build or the no message overflow. Therefore, probe phase, so: #AccumulatedMsgs = M * NJoins l(l)

The total time taken to process the input is, in turn, where N,,, is the total number of nodes in the system. estimated as. Tseleet * InputSize 3.4.3 Extension to Multiple Users T= Nseleet The only extension needed in the formulas for a multiple-user syst.em is to modify the value of the where InputSize is the number of pages accessed from CPU CPUSpeed parameter to incorporate the fact t,he input relation. Substituting the value of T and that other users in the system will a.lso be using the #AccumulateMsgs in Equation 1 and simplifying. the CPU simultaneously. Note t.hat the effect of multiple degree of join parallelism is calculated as: users at the disk is already incorporated in the value of Nseleet * Rateselect* Tselect* InputSize the Average Disk Read/Write parameters. Therefore, N~oan= M * Nseleet + RateJ,i, * Tseleet * InputSize the value of CPUSpeed is modified to EffectiveCPUS- peed using the formula for service time, S(x), for a where Tselzet is the time taken by a select operator to task wit,h service demand x on an M/G/l server with process one data’page. round-robin scheduling [17], i.e.: 4 Processor Assignment Algorithms 2 S(x) = Once the degree of parallelism for an operator has been 1 - Utilization determined, each instance of an operator must, be as- where x is the service demand of the arriving job. signed to a specific processor. Six algorithms for pro- Therefore, the time taken for each CPU processing cessor assignment are considered in this paper: task should be modified using the following equat.ion. 4.1 Random CPlTInstructi0n.s The desired number of processors are chosen rahdomly TCPU = Ef fectiveCPUSpeed in this algorithm. Although the Random algorithm is simple to implement, it, can lead to load imbal- CPUInstructions ance (since it does not use any information about the = CPUSpeed * (1 - Utilization) present state of the system). 4.2 Round-Robin 3.4.4 Effect of Message Buffer Size The Round-Robin algorithm chooses processors in a The previous formulas assumed that there is no mes- round-robin fashion (i.e. if the first operator is exe- sage buffering and therefore t,hat the processing rates cuted on nodes l-10, the next operator is executed need to match exactly. However, in practice, the oper- on nodes 11 onwards). This algorithm distributes the ators buffer only a limited number of message packets. processing load bet.ter than Random, but it can also In this case. the join processing rate may be slower lead to load imbalances because it ignores the actua1 than the select, processing rate as long as there is no distribution of the load in the system.

385 4.3 Avail-Memory and disk-utilization is artificially increased by 5% in- between periods of sta.tistics collection. This algorithm The third algorithm was proposed in [24] and assumes is not useful for operators tha.t do not perform a signif- t,hat the processing load of an operator is proportional ica.nt amount of disk I/O. An example is a hash-join to its memory requirement and chooses the processors operator wit,h maximum memory allocation. Such a with the most free memory. This assumption is ap- join operator performs only CPU processing since the plicable t,o memory-intensive operat,ors like joins and input relations are read by separate select operators. sorts. Since memory allocation is performed at the scheduling nodes, memory utilization figures are al- 4.6 Input ready available to the query schedulers. Therefore, this algorithm does not entail any extra communica- The Input, strat.egy can only be used with the MinDp tion bet#ween the scheduling and processing nodes3. and the MaxDp algorithms. Recall that, the degree of parallelism selected by the MinDp and MaxDp algo- 4.4 CPU-Util rithms is equal to the degree of parallelism of one of the input opera,tors; the input operator w&h t(he maxi- The CPU-Util algorithm was first proposed in [23] and mum parallelism for MaxDp and minimum parallelism assigns the lea&utilized processors to an operator. for MinDp. The Input strategy executes an opera.tor The performance of this algorithm depends on the fre- on the same set of processors as the selected input op- quency with which CPU utilization statist,ics can be erator. For example, if the MinDp policy selects the updated at the scheduling nodes. Also. in order to inner relation data stream for a. hash-join operat,or, t(he prevent, two successive operat#ors from being scheduled Input, strategy will assign the join operator to the set on the same set of nodes, once a set, of processors have of nodes where the inner relation is being accessed. been chosen, their CPU utilizations are increased “ar- tificially”. This artificial increase in CPU utilization 5 Processor Allocation Strategies prevent,s successive operators from being scheduled on the sa.me set of processors, and it is cancelled the next The algorithms for determining the degree of operator time the st,a.tistics are updated. The amount by which parallelism can be combined with t,he algorithms for the utilization should be increased is difficult to esti- processor assignment to obtain a wide variety of pro- mate, however. If the amount is too low, it may not cessor allocation algorithms. Most, of the combined al- prevent two successive operators from being executed gorithms perform processor allocation in t,wo phases. on the same set of nodes. Conversely, if the amount The degree of parallelism is determined in the first is too high, it can lead to a large difference between phase. The degree of operator parallelism and the to- the a.ctual utilization of the node and the utilization as t,al memory allocation to the operator are then used seen by the query schedulers, leading the CPU-Util al- to determine the memory needed per processor. In gorithm t,o schedule queries on nodes that are already the second phase, a list is made of candidate proces- more heavily utilized. sors that have enough memory available in their buffer In order to select these parameters, we performed pools. Finally, the processor assignment algorithm is a detailed sensitivity analysis of CPU-Util [Meht94]. used to select a subset of the processors from the can- As a result of the ana.lysis, our simulation model up- didate list. If the number of processors in t#he candi- dates utilization statistics at the scheduling nodes ev- date list is less than the degree of parallelism of the ery 5 seconds. Also, once an operator is scheduled on operator, the query blocks and waits for memory to a node, its utilization is increased “artificially” by 5%. become available. The only exceptions to this process Not,e that this algorithm is not useful for operators like occur with the Maximum policy, which executes each store, that do not perform much CPU processing. operator on all processors, and the Input processor as- signment policy, which executes the operator on the 4.5 Disk-Util nodes where the input data stream is produced. The Disk-Util algorithm chooses the processors on the nodes with the least disk-utilization. Similar to 6 Simulation Model the CPU-Util algorithm, disk utilization statistics are The performance studies presented in this paper are reported to the scheduling nodes every 5 seconds, based on a detailed simulation model of a shared- nothing parallel database system. The simulator is 3 If there are multiple scheduling nodes, some communication is needed among the scheduling nodes to maintain an accurate written in the CSIM/C++ process-orient,ed simulation estimate of memory consumption at the processing nodes. How- language [SchwSO] and models the database system as ever, these costs are ignored in this paper since inter-scheduler a closed queueing system. The following sections de- communication for memory management is needed in all pro- scribe the configuration, database and workload mod- cessor allocation algorithms. els of the simulator in more detail.

386 6.1 Configuration Model “point’-t’o-point” and no broadcast mechanism is used The terminals model the external workload source for for communica.tion. Table 1 summarizes the configu- t,he syst,em. Ea.ch terminal sequentially submits a ration parameters and Ta.ble 2 shows the CPU instruc- stream of transactions. Each t.erminal has an expo- tion costs used in the simulator for various dat,a,base nentia.lly distributed “thinktime” t,o creat,e variations operations. in a.rrival rates. All experiments in this paper use a. Parameter 1 Value configuration consisting of 128 nodes. The nodes are 1 Number of Nodes I 128 modeled a.s a CPU, a buffer pool of 16 Mbyt.es4 with Memory Per Node 16 Mbytes 8 Kbyt.e data pages. and one or more disk drives. The CPU Speed 10 MIPS CPU uses a round-robin scheduling policy with a 5 Number of Disks per Node 1 Page Size 8 Kbytes millisecond timeslice. The buffer pool models a set of Disk Seek Factor 0.617 main memory page frames whose replacement is con- Disk Rotation Time 16.667 msec trolled via t#he LRU policy extended with “love/hat,e” Disk Settle Time 2.0 msec hints. These hints are provided by the various rela- Disk Transfer Rate 3.09 Mbytes/set Disk Cache Context Size 4 pages tional operators when fixed pages ark unpinned. For Disk Cache Size 8 contexts example, “love” hints are given by the index scan op- Message Wire Delay 500 psec erator t,o keep index pages in ‘memory; “hate” hints a.re used by the sequential scan operator to prevent Table 1: Simulator Parameters: Configuration buffer pool flooding. In , a memory reserva- 7 Experimental Parameters tion system under the control of the scheduler task 7.1 Configuration allows buffer pool memory to be reserved for a par- ticular operator. This memory reservation mechanism Although we have experimented with both a disk- is used by hash join operators to ensure t#hat enough intensive and CPU-intensive configuration, for the memory is available to prevent their hash table frames sake of brevity, results are presented in this paper only from being stolen by other operators. for a CPU-intensive configuration. The results of t,he The simulated disks model a Fujitsu Model M2266 disk-intensive configuration are briefly summarized in (1 Gbyte, 5.25”) disk drive. This disk provides a cache each performance section and t,he interestsed reader is t&hat, is divided into 32 Kbyt,e cache contexts for use referred bo [20] for detailed experimental results. The in prefetching pages for sequential scans. In the disk CPU-intensive configuration consists of a 10 MIPS model, which slightly simplifies the actual operation of CPU and four disks per processor. The CPU speed the disk. the cache is managed as follows: each I/O re- was chosen to be artificially low so that the processors quest, a,long with the required page number, specifies could be saturated with only 4 disks per node, thus whether or not prefetching is desired. If prefet#ching reducing the running time of the simulations time5. is requested. four pages are read from the disk into a A message buffer of 256 Kbytes is provided for each cache context as part, of transferring the page originally operator. This implies that at most, 32 8Kbyte pages requested from the disk into memory. Subsequent re- can be buffered by each operator. Operat,ors st,op send- quests to one of the prefetched blocks can then be ing messages when they detect that, the receiver’s mes- satisfied without incurring an I/O operation. A sim- sage buffer is full. ple round-robin replacement, policy is used t,o allocate 5For a faster processor, we would need to simulate many more cache contexts if the number of concurrent prefetch disks per processor. For instance, it takes upto 16 disks with requests exceeds the number of ava.ilable cache con- high I/O prefetching to saturate one Alpha AXP processor [5] texts. The disk queue is ma.naged using an elevator algorithm. The interconnection is modeled as an infinite band- Operation Instr. width net,work so t.here is no network contention for Initiate Select, Operator 20000 Terminate Select Operator 5000 messages. This is based on previous experience with Initiate Join Operator 40000 the GAMMA prototype [7] which showed that network Terminate Join Operator 10000 contsention is minimal in typical sha,red-nothing PDBs. Apply a Predicate 100 Messages do, however, incur an end-to-end transmis- Read Tuple from Buffer 300 Probe Hash Table 200 sion delay of 500 microseconds. All messages are Insert Tuple in Hash Table 100 Start an I/O 10000 4The simulated buffer pool size is smaller than buffer pools in Copy a Byte in Memory 1 typical configurations. Unfortunately, simulating a larger buffer Send(Receive) an SK Message 10000 pool size would require enormous amounts of resources. Some of our simulations took up to 36 MBytes and ran for 24 hours on an IBM RS/SOOO even with 16 Mbytes of memory per node. Table 2: Simulation Parameters: CPU Costs

387 7.2 Database The performance of all of the algorithms is examined A simplified database is used in all of the experiments. under various system loads by increasing the number Each database relat,ion contains five million tuples and of query terminals from 10 to 40. is fully declustered. Although the relations are fully 8 Determining Degree of Parallelism declustered, we model range queries to explore the ef- fect of reading data on only a subset of nodes. The This section presents a performance comparison of al- tuple size is fixed at 200 bytes and a clustered index is gorithms that, determine the degree of parallelism. The modeled on each relation. CPU-Util algorithm is used in all the experiments t,o perform processor assignment; the reason for using this 7.3 Workload algorithm here will become evident in Section 9. The workload contains only binary hybrid hash-join [6] 8.1 Maximum Memory Allocation queries. The hybrid hash-join method was chosen since it ha.s been shown to be superior to other join meth- The first experiment compares the performance of the ods. Binary join queries were chosen so that issues, algorithms on the Small workload (clustered index like pipelining and query scheduling, that arise while scans with 1% indexselectivity) when each query is processing complex queries could be ignored. This is a given its maximum memory allocation. We assume rea.sona.ble simplification as most, commercial database that range declustering is used and the degree of paral- systems execute queries comprised of multiple joins as lelism of the select operators varies uniformly between a series of binary joins, and do not pipeline tuples be- 1 and 128. Figure 1 shows the average query response t(ween adjacent joins in the query tree. Each binary time and the degree of join parallelism as the load in- join query is composed of two select, operators (one for creases from 10 to 40 terminals. the inner and one for the outer relation) plus a join Since the queries in t,his work1oa.d are small, st,artup operat,or and a store operator. The select operators and t,ermination costs form a large fraction of t,he execut,e wherever the input relations are declustered. query response time. Therefore, the rela.tive order Therefore, processor allocation needs t,o be determined of the algorithms is determined by the startup and only for the join and store operators. In order to termination costs, which, in turn, a.re determined by simplify this performance study, a simplistic proces- the degree of join parallelism. The Maximum algo- sor allocation policy is used for store operators - the rithm, which selects the highest degree of join par- store operator of each query executes on the same set allelism (128), has the highest startup and termina- of nodes as the join operator of the query. There- tion costs and, consequently, the highest, average query fore, the degree of parallelism and specific processor response time. The MaxDp and MinDp algorithms assignments need to be determined only for the join choose smaller degrees of parallelism (85 and 38, re- operator in a query. Moreover, the join selectivity has spectively) and therefore achieve layer average query been fixed at 1% to make the size of the join out- response times (as compared to the Maximum algo- put small to reduce the impact of store operators on rithm). The lowest query response time is achieved the performance results. Based on the results of ear- by the RateMatch algorithm. This algorithm realizes lier memory allocation studies [21] [28] [l] [3], joins are that the sizes of the inner and outer relations of t,he given either t,heir maximum or their minimum memory join queries are small, and that the join and select allocation. Three kinds of workloads are considered: processing rates can be ma,tched with a low degree of join parallelism. Moreover, unlike the other algo- Small, Medium and Large. Table 3 summarizes the rithms, the RateMatch algorithm dynamically adapts important parameters of these three workloads. to the query workload: it selects a higher degree of Workload Access Method indexselectivity join parallelism as the system load increases because Small Clustered Index Scan 1% the CPU-utilization increases. The degree of join par- Medium Clustered Index Scan 25% allelism increases from 25 to 32 as the 1oa.d increases Large File Scan N/A from 10 to 40 terminals. Table 3: Workload Parameters The next experiment explores the relative perfor- We also assume the presence of some mechanism mance of the algorithms on the Medium workload that, can be used to direct the select operators to only (clustered index scans with 25% indexSelectivity). a subset of the nodes.6. Therefore, the degree of select Figure 2 shows the average query response time and operator parallelism is chosen randomly from 1 to 128. the degree of join parallelism chosen by the algorithms. Note that, except for the RateMatch algorithm, the GEven though all the relations are fully declustered, select. degrees of join parallelism chosen by all the other al- operators can be directed to a subset of nodes in several cases (e.g. when range de&&wing [ll] is used to map tuples to gorithms are the same as in the previous experiment. relation partitions). This is, because, t-heir choice of the degree of paral-

388 4: + Maximum 140- ---&-- MaxDp 5;u : 0 l ---+--- MinDp 120- 3 :. - --A- - - RateMatch .-EaB 3- : 5 c : g IOO- 8 . Le 80- I) .-.-. -.a-.-.-.-m.- ._._. 1 ez : .2 2 2: : 2 60- .9 0 4 : t b 40- + .-.-.-. l -.-. -.-*.- .____ + s e 1: B A _____ *-----+-----A f 20- 4 :. 0- 01 0 10 20 30 40 0 IO 20 30 40 Number of Terminals Number of Terminals Figure 1: Performance of Algorithms for Determining the Degree of Parallelism - Small Workload Maximum Memory Allocation

40- + Maximum 140 3 -.-B--- MaxDp a m * - 0 k: ---+-- MinDp 120 b 1 g - - + - - RateMatch & 2 IOO- G 8 ce 80- ..-.-.-..-.-.-.-I.-.-.-.. 2 % 20- e 2 5 60- .5 0 *-----*t-----*-----A 8 4 $ 40- * .-.-. -.*-.- ._._ * ______+ Sk e d Y 20-

Figure 2: Performance of Algorithms for Determining the Degree of Parallelism - Medium Workload Maximum Memory Allocation lelism depends only on the data pla.cement and config- served in the previous workload since the queries were uration size, both of which remain static for all work- much smaller (1% indexselectivity) and the message loads. buffer size (256 KB) was large enough to prevent over- The performance results for the Medium workload flow. The RateMatch algorithm dynamically selects a are quit,e different from those of the Small workload. higher degree of join parallelism for this workload to The MinDp algorithm has the highest average query prevent the message buffers of the join operators from response time for the Medium workload for less than 30 overflowing. The MaxDp and Maximum algorithms terminals, followed by the Maximum, then the MaxDp, avoid message buffer overflow but, they incur higher and finally the RateMatch algorithm. The important startup and termination costs than the RateMatch al- thing to note is that MinDp has the worst perfor- gorit,hm since they select degrees of join parallelism mance, even though Figure 2 shows that it chooses that are “too” high. However, startup and termina- the smallest degree of join parallelism (38). tion costs are a smaller fraction of the total response time of the queries in this workload, so the average The inferior performance of MinDp is due to the query response times achieved by the Maximum and fa.ct it selects a degree of join parallelism that is so low MaxDp algorithms are only 17% and 9% higher, re- that the join operators cannot process tuples at the spectively, than those of the RateMatch algorithm. rate at which they are produced by the selects, so the message buffers of the join operators overflow. This However, as the number of terminals is increased, causes the select operators to block leading t#o lower the relative performance of the MinDp algorithm im- CPU and disk utilizations and higher query response proves as the load increases (more than 20 termi- times. Note that message buffer overflow was not ob- nals). This is, because, even though some of the op-

389 160- b Maximum 1407 _._. ..-.- Max&, 0 aw - 0 _._. +.-.- MinDp 120- ---A--- RateMatch % i IOO- $e 80- ..-.-.--~-.-.-.-..-.-.-.~ .B 2 60- 4 0 ~ __.___. i ___.-. -A--.-‘-. 0 p 40- ..-.-.-.*-.-.-.-..-.-.-.. a 20 1 04 ,.,...... ,....,....,...... ,.,..,.... ( 0 I . , . , . , . . , 0 10 20 30 40 0 IO 20 30 40 Number of Terminals Number of Terminals Figure 3: Performance of Algorit,hms for Det,ermining the Degree of Parallelism - Large Workload Maximum Memory Allocation

era,tors block in the MinDp a.lgorithm due to mes- select operators is the same for both workloads. The sage buffer overflow, there is enough concurrent, ac- rate depends on t,he degree of parallelism of t,he se- tivity in t,he syst,em t,o achieve high processor ut,iliza- lect operators. Since the degree of parallelism of select tion, and there is no increase in t,he average response operators is identical for bot,h workloads, t,he rat,e at, t,ime. The Maximum and Ma.xDp algorit8hms, on t.he which t,uples a.re sent. t,o t,he join opera,tors in t#he two ot,her hand, perform relatively worse because of their workloads is also identical. Therefore, the same degree higher sta.rt,up and termination c0st.s. The Ra,teM- of join pa,rallelism can be used for hot#h t,he workloads. a.tch algorithm chooses a. much smaller degree of join This implies that any algorithm which chooses the de- parallelism (compared to Maximum and MaxDp). and gree of join parallelism based on the size of t,he input therefore performs well. At the highest load of 40 ter- relations will be non-optimal. The degree of join paral- minals. RateMatch resu1t.s in an average response time lelism chosen by such an algorithm for the Large work- t,hat, is only 2% higher than the average response time load would be four t.imes that of t.he Medium workload, achieved by MinDp. while these experiments have shown t#hat the degree of The next experiment examines the performance of join parallelism should remain the same for both work- the algorithms on the Large query workload (file scans loads. wit#h 100% selectionSelect,ivity). Figure 3 shows the The last t,hree experiments have shown that the average query response time and the degree of join MinDp algorithm performs well for small-query work- parallelism for each algorithm. The resu1t.s show that loads but can lead to a underutilized system and higher RateMat,ch still has t,he best performance, but t,he query response times for larger queries (since it can performance of the other algorit#hms is much closer; underestimate the proper degree of join parallelism Maximum a.nd MaxDp provide response times that are causing select operators to block). The Maximum only ll’% and 9% higher, respectively. The reason is and MaxDp algorithms perform poorly for small query the same as before - file scans increase the execution workloads due to high startup and termina.tion cost,s time of t#he queries, so the effect of extra startup and but can provide reasonable performance as query sizes termination costs in the MaxDp and Maximum algo- increase. Finally, the RateMatch algorithm consis- rithms become less and less significant. The MinDp tently shows good performance. The RateMatch al- algorithm behaves similar to the last experiment. It gorithm performs well for the small query workload selects a low degree of parallelism, causing the select because it, reduces startup and termination costs. At operat,ors to block; thus, leading t(o high average query the same time, it can dynamically increase the degree response times at low query loads. As the load in- of parallelism for larger queries to prevent operators creases, t,he effect of blocking diminishes and MinDp from blocking. is able to achieve lower average query response times. 8.1.1 Minimum Memory Allocation It is interesting to note that the degree of join par- allelism chosen by the RateMatch algorithm for the In each of the previous experiments, queries were given Medium and for the Large workloads is nearly identi- their maximum memory allocation. The next exper- cal (even though the Large queries process nearly four iment compares the performance of the algorithms times the data processed by the Medium queries). The when queries are given their minimum memory allo- reason is t,hat the rate at which tuples are sent by the cation. Figure 4 shows the average query response

390 *‘I + Maximum 140- 3 -.-m--- MaxDp 8 120- c ---+--- MinDp p 60- E ---A--- RateMatch 2 IOO- F 2e 80- ..-.-.-.~‘-.-.-.-..-.-.-.~ .z 2 60- 0 P b 40- ..-.-.-.*-.-.-.-~.-.-.-.~ d 20- 4

010 10 20 30 40 OV IO 20 30 40 Number of Terminals Number of Terminals Figure 4: Performance of Algorithms for Determining the Degree of Parallelism - Medium Workloa,d Minimum Memory Allocation time and t)he degree of join parallelism for the medium small query workloads due to high startup and t,ermi- workload. Result#s are not presented for the Small nation cost,s, but t#hey provide rea.sona,ble performance work1oa.d since there is not much of a difference be- for larger query sizes. On the other hand, the RateM- tween maximum and minimum memory allocation for atch algorithm can dynamically adapt the degree of queries in the Small workload. Additionally, Large parallelism to provide good performance for bot.h small workload results were qualitatively very similar to and large query workloads. the Medium workload results and have therefore been If minimum memory allocation is used for queries. omitted. a higher degree of jqin parallelism improves response Figure 4 shows that, if queries are allocated their times. Therefore, the Maximum and RateMat#ch al- minimum memory allocation, the higher the degree of gorithms perform well. but, the MinDp and MaxDp join parallelism, the lower the average query response algorithms lead to higher response times. time. This occurs for two reasons. First, minimum The relat,ive performance of the algorithms is also memory allocation implies t#hat input relations must the same in a disk-int,ensive configuration [20] except be pa.rtitioned by the join operators into disk-resident, that all the algorithms have nearly identical perfor- buckets. Since writing out tuples t,o disk is slow, the mance when queries are given their maximum mem- processing rate of join operators decreases. Therefore, ory allocation. Response times are dominated by I/O too little join parallelism can cause the select opera- processing time in a disk-intensive configuration; since tors to block. Second, once partitioning of the input all t,he algorithms perform the same I/O processing if relat.ions is complete, a higher degree of join paral- queries are allocated their maximum memory alloca- lelism implies faster processing for each disk-resident tion, their performance is also identical. bucket,. Consequently, the Maximum and RateMatch algorithms, which both select high degrees of join par- 9 Determining Processor Assignment allelism (128 and 122, respectively), provide better performance than the MinDp and MaxDp algorithms, So far we have compared the performance of t,he al- which select lower degrees of join parallelism (85 and gorithms for selecting the degree of join parallelism. 38, respectively). This section presents a performance evaluation of the 8.2 Result Summary six processor assignment algorithms discussed in Sec- tion 4. The results of the previous experiments have shown that when queries are allocated their maximum mem- 9.1 Maximum Memory Allocation ory allocation, the MinDp algorithm performs well for small queries since the cost of startup and termina- The first experiment in t,his section compares the per- tion constitutes a large fraction of t,heir response time. forma.nce of the algorithms on the Small workload. However, the MinDp algorithm’s performance can de- Maximum memory allocation is used for the queries teriorate as the size of the input relations increases and the degree of select parallelism varies varies uni- because it can underestimate the degree of operator formly between 1 and 128. Figure 5 shows the perfor- parallelism, thus causing other operators t.o block. The mance of the various processor assignment algorit.hms Maximum and MaxDp algorithms perform poorly for when t,he RateMatch algorithm is used to determine

391 the degree of join parallelism7. As explained previ- gorithms. Results for the Large workload were quali- ously, the Input algorithm cannot be used with the tatively similar and ha,ve been omitted. Rate algorithm, so it is not shown in Figure 5. The 3oT ---*-- Random Disk-ITtil algorithm is also absent since it is used only 3 ---m-- RoundRobin L u with minimum memory allocation. 3 - ---e-- Avail-Mem ,Q $3 P : ---A--- CPU-Util ,I IN i= 20- $:,y’ ,I Y P5 , I ,.I,,c. ; ’t B 5,C,‘, * d ,rS, .’ e I *:* 2 z lo- ‘gc-’ 8 m e . % ---@--- Random - - -) - - Round-Robin 0 .“~““‘,““,“,‘,““““‘,,““““, 0 10 20 30 40 - - -+ - - Avail-Mem Number of Terminals ---A--- CPU-W Figure 6: Processor Assignment Algorithms Medium Workload, Maximum Memory Allocation Ofi Number of Terminals The resuhs of these experiments show that the CPU-Util algorithm for processor assignment, achieves Figure 5: Processor Assignment Algorithms Small Workload, Maximum Memory Allocation the lowest response times when queries are given t.heir maximum memory allocation. However, as query sizes Figure 5 shows that t,he Random algorithm leads increase, simpler algorithms like R.ound-Robin and t,o the highest response times since it. sometimes as- Avail-Mem ca.n also perform quite well. signs even heavily loaded processors to a join. The Round-Robin and Avail-Memory algorithms distribute 9.2 Minimum Memory Allocation the join workload more uniformly than the R.andom algorithm and therefore achieve lower response times. The next experiment with the CPTT-intensi.ve configu- However, both t,hese algorithms ignore the CPU load ration explores the performance of the processor allo- from the select operators, and thus do not perform cation algorithms when joins are given their minimum as well as the CPU-Util algorithm. The CPU-Util memory allocation. As in Section 8.1.1, results are re- achieves the lowest query response times, but it is ported only for the Medium workload (since there is only about 10% better than the Round-Robin algo- not much difference between the maximum and mini- rithm. This is because the CPU-Util algorithm uses mum memory allocations for the Small workload and utilization statistics that are updat,ed every 5 seconds. the results for the Large workload are similar to the Since queries in this workload are very small (1% se- results of Medium). Figure 7 shows the average query lectivity on the input relations), multiple queries often response times for the different processor assignment a.rrive in the system within the 5 second int,ervals when algorithms under various system loads. Since join op- the CPU-utilization statistics are out-of-date. These erators perform I/OS with minimum memory alloca- queries therefore get executed on nodes that do not tion, the performance of the Disk-Util algorithm is also necessarily have the lowest CPU-utilization. As a re- included. sult, the performance of the CPU-Util algorithm is Figure 7 shows that all the processor assignment al- only slightly better than the simpler Round-Robin al- gorithms have virtually identical performance in this gorithm. case. This is mainly because t,he number of proces- The relative performance of the processor assign- sors chosen by RateMatch is high for this workload. ment algorit,hms is also similar with the Medium work- The average degree of join parallelism is 122. There- load. Figure 6 shows the performance of the different fore, the set of processors chosen by the CPU-Util al- algorithms when the RateMatch algorithm is used to gorithm, for example, is not very different from the select the degree of join parallelism. CPU-Util pro- set of processors chosen by the Random algorithm. As vides the best performance in all the cases, followed a result, all of the algorithms have basically the same by the Avail-Mem, Round-Robin, and the Random al- performance for this workload. This experiment shows that as the degree of operator parallelism increases, the 7[20] also contains experimental results when other algo- rit.hms like MaxDp are used determine the degree of join paral- differences in the performance of the various processor lelism and the results are qualitatively very similar. allocation algorithms virtually disappear.

392 50-$ ---*-- Random The formula assumes that if S is the startup cost, of 3 : - - -) - - Round-Robin a’n operation, and P is the per-tuple processing cost,, 8 40: ---+-- Avail-Mem ,m PN i ---A--- CPU-Util 4’ then the optimal degree of parallelism, noFt = 2 8’ d- i; ---X--- Disk-UN Note that, t,his formula is based only on t,he size of t”hd t 30 z p” operand and disregards the rate of flow of tuples be- % #’ tween operators. The results presented earlier in the g.9 20 p” paper (Section 8) show that this can lead to exces- sively high degrees of parallelism. Rahm and Marek : l /4 “Me IO [23] study algorithms to determine the degree of par- 9 allelism for t,he join operator and also st,udy proces- d sor assignment algorithms. The authors consider only small queris and decrease the degree ofjoin parallelism based on the CPU-utilization in a multi-user environ- ment. Our results, however, show t,hat, reducing the Figure 7: Processor Assignment AlgoritShms degree of parallelism even in CPU-int,ensive configu- Medium Workload, Minimum Memory Allocation rations does not affect performa.nce significantly, es- pecially for large query sizes. Moreover, the algorit,hm 9.3 Result Summary presented in [23] uses the optimal parallelism in single- This section has presented the performance of several user mode as input. This parameter can be hard to processor assignment algorithms. The results show estimate especially since it is a complex of that the choice of the processor assignment algorithm the configuration, memory, and the memory alloca- has a significant performance impact only if the work- tion policy. Algorithms for processor and memory al- load is CPU-intensive and queries a.re given their max- location were also studied in [24]. [MurpSl] use the imum memory allocation. In this case, the CPU-Util concept of matching processing rates of operators in a algorithm, which selects the least utilized processors, query plan to determine buffer allocation. achieves the lowest response time.’ In all other cases9, the choice of a processor assignment algorithm has a 11 Conclusions small performance impact and therefore a simple algo- This paper has investigated the problem of managing rithm like Round-Robin is sufficient t#oobtain reason- intra-operator parallelism in a multi-query environ- able performance. Finally, the experiments show that the choice of the processor assignment algorithm is not ment for a parallel database system. Four algorithms as important as the choice of the algorithm used to de- for deciding the degree of operat,or para.llelism and six algorithms for selecting the assignment of operat,or in- cide the degree of join parallelism; the differences be- stances to processors were considered. A detailed per- tween the performance of the processor assignment al- formance evaluation of the algorithms showed that us- gorithms are smaller than the differences in the perfor- ing the RateMatch algorithm for deciding the degree mance of the algorithms for select,ing join parallelism. of parallelism and the CPU-Util algorithm for select- ing processors achieves the best performance irrespec- 10 Related Work tive of the workload and hardware configuration. The RateMatch algorithm calculates the degree of paral- Intra-operator parallelism and processor assignment lelism based on the rate at which tuples are processed has been an active area of database research. The by various operat,ors, while the CPU-Util algorithm topic has been studied extensively in t,he context of selects the processor with the least CPU-Utilization. load balancing in shared-everyt,hing systems [14] [19]. Both algorithms use information about the current Several researchers have focused on processor assign- system state and can therefore dynamically adapt to ment for queries with multiple join operators [lo] [22] different workloads. However, experiments also show [9]. Processor assignment has also been studied exam- that if the workload consists of large queries, or if the ined in t,he context of distributed database systems [2] configuration is disk-intensive, simpler allocation al- WI. gorithms like MaxDp can perform equally well. This A formula for determining the optimal degree of implies that processor allocation can be significantly parallelism of database operators was presented in [27]. simplified in several cases. 8These results explain the use of the CPU-Util algorithm in In this paper, the RateMatch algorithm was used all of the experiments comparing algorithms for selecting join only to determine the degree of join parallelism for bi- performance (Section 8). gThe experiments with the disk-intensive configuration [20] nary join queries in a shared-nothing system. However, also show that all the algorithms perform similarly. we feel that the paradigm of matching the rate of tu-

393 ple flow between operators can be used in other cases [ll] S. Ghandeharizadeh. Ph,ysical Database Design in also. For example, it can be used in a complex query Multiprocessor Systems. PhD thesis, TJniversity of to match the rat#e of flow bet,ween the opera.tors in a Wisconsin-Madison, 1990. parallel-query pipeline. Similarly, the algorithm can [la] G. Graefe. Volcano: An extensible and parallel be used to decide the degree of parallelism for other dataflow query processing system. Technical report, operators like sorts and aggregates. We plan to explore Oregon Graduate Cent,er, June 1989. these issues further in the future. Another direction of [13] Tandem Performance Group. A benchmark of non- future research is t,he application of the RateMatch stop sql on the debit credit transaction. In Proc. SIG- algorithm t,o shared-memory and sha,red-disk sy&ems. MOD Con!.. Chicago, IL, November 1988. The results presented in this st,udy have also shown [14] W. Hong. Exploiting inter-operation parallelism in the importance of decoupling processor allocation from xprs. In Proc. ACM SIGMOD Conj., San Diego, CA, data placement. The MinDp algorithm, for inst.ance, June 1992. can underestimate the degree of operator parallelism and cause high query response times. Similarly, the [15] Ii. A. Hua and C. Lee. An adaptive data placement MaxDp algorit,hm can overest,imate operator paral- scheme for parallel database computer systems. In Proc. VLDB Conf., Brisbane, Aust,ralia, 1990. lelism and lead to higher startup and termination costs. The decoupling of processor allocation a.nd [16] Int’l Business Machines. Scalable PO WERparollel Sys- da.ta placement can have a significant impact on sev- terns, GA23-2475-02 edition, February 1995. eral other area.s of research in shared-nothing paral- [li’] L. Kleinrock. Queueing Systems - Vol. 2: Computer lel database syst,ems as well. For example, all of the Applications. John Wiley and Sons, 1976. studies on declustering policies [ll] [15] also make the [18] H. Lu and M. Carey. Some experimental results on implicit, assumption that operations like joins are ex- dist,ributed join algorit,hms in a local network. In Proc. e&ed on the nodes where data is accessed. A re- VLDB Conf., Stockholm. Sweden. August 1985. exa.mination of the algorithms proposed in t,hese st,ud- ies will be required if this assumption is removed. [19] H. Lu and K. Tan. Dynamic and load-balanced task- orient,ed database query processing in paralle systems. References In Proc. EDBT Conf.. 1992. [l] K. Brown, M. Mehta, M. Carey, and M. Livnp. To- [20] M. Meh&. Resource illlocation for Parallel Shared- wards automated performance tuning for complex Nothing Database Systems. PhD thesis, Univer- workloads. In Proc. VLDB Conf., Santiago, Chile, sity of Wisconsin-Madison, 1994. available at. September 1994. http://www.cs.wisc.edu/ mmehta/mmehta.html. [2] M. Carey, M. Livny, and H. Lu. Dynamic task alloca- [21] M. Meht,a and D. Dewitt. Dynamic memory allo- tion in a distributed database system. In Proc. Intl. cation for multiple-query workloads. In Proc. VLDB Conf. On Disk. Computing Systems, May 1985. Conj., Dublin. Ireland, August 1993. [3] D. Davison and G. Graefe. Dynamic resource broker- [22] P. S. Yu Ming-Syan Chen and K. L. Wu. Schedul- ing for multi-user query execution. In Proc. SIGMOD ing and processor allocation for parallel execution of Cont., San Jose, Ca, May 1995. multi-join queries. In Proc. of the 8th Int. Conj. on [4] D. Dewitt and J. Gray. Parallel database systems: Data Engineering, Pheonix, AZ, February 1992. The future of high performance database systems. [23] E. Rahm and R. Marek. Analysis of dynamic load bal- CACM, 35(6), March 1992. ancing strategies for parallel shared nothing database [5] C. Nyberg et. al. Alphasort: A rise machine sort. In systems. In VLDB, Dublin, Ireland, August 1993. Proc. SIGMOD Conf., Minneapolis, MN, May 1994. [24] E. Rahm and R. Marek. Dynamic multi-resource load [6] D. Dewitt et. al. Implementation techniques for main balancing in parallel database systems. Technical Re- memory databases. In Proc. SIGMOD Conj., Boston, port 2 (1994), University of Leipzig, June 1994. MA, June 1984. [25] M. Stonebraker. The case for shared nothing. Data [i’] D. Dewitt et. al. The gamma database machine Engineering, 9(l), March 1986. project. IEEE Trans. on Knowledge and Data Engg., [26] Teradata Corp. DBC’/l012 Data Base Computer Sys- 2(l), March 1990. tem Manual, document no. clO-0001-02, release 2.0 [8] H. Boral et. al. Prototyping bubba, a highly paral- edition, November 1985. lel database system. IEEE Trans. on Knowledge and [27] A. Wilschut, J. Flokstra, and P. Apers. Paral- Data Engg., 2(l), March 1990. lelism in a main-memory dbms: The performance [9] Ming-Ling Lo et. al. On optimal processor allocation of prisma/db. In Proc. VLDB C&f., Vancouver, to support pipelined hash joins. In SIGMOD, pages Canada, August 1992. 69-78, Washington DC, June 199’3. [28] P. Yu and D. Cornell. Buffer management based on lo] Ming-Syan Chen et. al. Using segmented right,-deep return on consumption in a multi-query environment. trees for the execution of pipelined hash joins. In VLDB Journal, 2(l), January 1993. VLDB, Vancouver, Canada, August 1992.

394