Realization of DBS 11. Operations – Implementation

Theo Härder www.haerder.de

Goals - Systematic development of relational processing concepts for a single table or for several tables - Realization of plan operators

Main reference: Theo Härder, Erhard Rahm: Datenbanksysteme – Konzepte und Techniken der Implementierung, Springer, 2001, Chapter 11.

Goetz Graefe: Query Evaluation Techniques for Large Databases, ACM Computing Surveys 25:2, June 1993, pp. 73-170.

Realization of Database Systems – SS 2011 © 2011 AG DBIS

Realization of DBS Table Operations - Implementation

 Operations of the relational algebra - Unary operations: ,  Table Sort operations - Binary operations: , , , , , –   

Plan operators R  T S Nested-loops & sort/  SQL queries contain logical expressions which can be mapped to the operations of the relational algebra. They are further transformed into Hash join access plans. So-called plan operators implement these logical operations

Joins on type-  Plan operators on a single table spanning paths  Selection

Distributed joins  Operators across several tables

Set operations  Join algorithms - Nested-loops join, Sort-merge join - Hash join (classic hashing, simple hash join, hybrid hash join) - Exploitation of type-crossing access paths - Distributed join algorithms

 Further binary operations (set operations)

© 2011 AG DBIS 11-2 Realization of DBS Plan Operators on a Single Table

 Selection – general ways of evaluation

Table • Direct access via a given TID, via a hash method or a one- resp. multi- operations dimensional index structure • Sequential search in a table Plan operators • Search via an index structure (index table, bitlist) • Selection using several pointer lists more than a single index Nested-loops & sort/merge join structure can be exploited • Search via a multi-dimensional index structure

Hash join

Joins on type-  Projection spanning paths is typically performed in combination with sorting, selection, or join

Distributed joins  Modification

Set operations • Updates are set-oriented in SQL, but restricted to a single table • INSERT, DELETE and UPDATE are directly mapped to the corresponding operations of the storage structures • “Automatic” execution of maintenance operations - to access paths, - to guarantee clustering and reorganization etc. • Provisions for logging and recovery etc.

© 2011 AG DBIS 11-3

Realization of DBS Plan Operators for the Selection

 Use of Scan Operators • Definition of start- and stop Table • Definition of simple search arguments operations

 Plan operators Plan operators 1. Table scan (relation scan) - Always possible Nested-loops & sort/merge join - SCAN operator implements selection operation 2. Index scan Hash join - Selection of most cost-effective index - Specification of search range (start-, stop condition) Joins on type- 3. k-d scan spanning paths - Evaluation of multi-dimensional search criteria - Use of differing evaluation directions by navigation Distributed joins 4. TID algorithm - Evaluation of all “useable" index structures Set operations - Location of TID lists of variable lengths - Boolean connection of the lists - Access to the records according to the hit list (result list)

 Further plan operators in combination with selection • Sorting • Grouping (see sort operator) • Special operators e.g. in Data-Warehouse applications for grouping and

© 2011 AG DBIS aggregation (CUBE operator) 11-4 Realization of DBS Operators Across Several Tables

 SQL allows complex queries across k tables • One-variable expressions: Table describe conditions for the selection of elements a table operations • Two-variable expressions: describe conditions for the combination of elements from two tables Plan operators • Typically, k-variable expressions are decomposed into one- and two-variable expressidltdbdiltions and evaluated by corresponding plan operators Nested-loops & sort/merge join  Plan operators across several tables

Hash join • General ways for the evaluation: - Nested iteration Joins on type- for each element of outer table T spanning paths o traversal of inner table Ti • O(No · Ni + No) Distributed joins • important application: nested-loops join - Merge method Set operations iterating traversals through T1, T2

• O(N1 + N2) • additional sort costs, if necessary • important application: merging join - Hashing

Partitioning of inner table Ti and -wise loading in HT in memory. “Probing” by outer table To or its

© 2011 AG DBIS resp. partitions using HT:  O(p · No + Ni) 11-5

Realization of DBS Operators Across Several Tables (2)

 n-way joins • Decomposition into n-1 two-way joins2 Table operations • Number of possible join sequences is dependent on the join attributes chosen • Maximal n! different sequences possible Plan operators • Use of pipelining techniques • OiOptimal eval uati on sequence d epend ent on Nested-loops & sort/merge join - Plan operators - “Fitting” sort orders for join attributes Hash join - Size of operands etc.

Joins on type-  Some join sequences using two-way joins (n=5) spanning paths result result

Distributed joins result

Set operations T5 T2 T4 T5 T4 T3 T5

T1 T2 T1 T2 T3 T4 T3 T1 left-deep tree bushy tree right-deep tree

 Analogous proceeding in case of set operations

© 2011 AG DBIS 2. Practicality test (Guy Lohman test for join techniques): Does a new technique apply to joining three inputs without 11-6 interrupting data flow between the join operators? Realization of DBS Plan Operators for the Join

 Join • Record-type-spanning operation: usually very expensive Table • Frequent use: important optimization candidate operations • Typical application: equi-join • General Θ-join infrequent Plan operators  Imppjplementation of the join operation can process, at the same time, selections (and projections) on the participating tables R and S Nested-loops & sort/merge join SELECT * FROM R, S Hash join WHERE R.JA Θ S.JA AND PR AND P Joins on type- S spanning paths • JA: join attribute

• PR and PS: predicates defined on selection attributes (SA) of R and S Distributed joins  Possible access paths Set operations • Scans over R and S (always)

• Scans over IR(JA), IS(JA) (if present)  deliver sort sequence according to JA

• Scans over IR(SA), IS(SA) (if present)  if necessary, fast selection for PR and PS • Scans over other index structures (if present)  if necessary, faster location of all records © 2011 AG DBIS 11-7

Realization of DBS Nested-Loops Join

 Assumptions • Records in R and S are not ordered according to join attributes

Table • Index structures IR(JA) and IS(JA) do not exist operations  Algorithm for Θ-join Scan over S, Plan operators for each record s, if PS: scan over R, Nested-loops & for each record r, if PR AND (r.JA Θ s.JA): sort/merge join execute join, i.e., write combined record (r, s) into the result set.

Hash join  Complexity: O(N*M)  Joins on type- Nested-loops join using index access spanning paths Scan over S, for each record s, if PS: determine via access to IR(JA) all TIDs for records satisfying r.JA = s.JA, Distributed joins for each TID: fetch record r, if PR: write combdbined record d() (r, s ) into th e resul t set. Set operations

 Nested-block join Scan over S, for each page (resp. set of contiguous pages) of S: scan over R, for each page (resp. set of contiguous pages) of R: for each record s of the S-page, if PS: for each record r of the R-page, if PR AND (r.JA Θ s.JA): write combined record (r, s) into the result set. © 2011 AG DBIS 11-8 Realization of DBS Sort-Merge Join

 Algorithm consists of 2 phases • Phase 1: Sorting of R and S w.r.t R(JA) and S(JA) (if not already present); Table in doing so, early elimination of records not needed ( P , P ) operations R S • Phase 2: Iterating scans over sorted R- and S-records

Plan operators where join is performed in case of r.JA = s.JA  Complexity: O(N log N) Nested-loops & sort/merge join  Special case

Hash join If either IR(JA) and IS(JA) or GAPS over R(JA) and S(JA) (join index) is present:  exploitation of index structures on join attributes Joins on type- Iterating scans over I (JA) and I (JA): spanning paths R S for each with two keys from IR(JA) and IS(JA), if r.JA = s.JA: fetch the records using the related TIDs, Distributed joins if PR and PS: write combined record (r , s) into the result set Set operations

© 2011 AG DBIS 11-9

Realization of DBS Hash Join

 Simplest case (classic hashing) • Step 1: Partitioned read of (smaller) table R and construction of a hash Table operations table using hH(r(JA)) w.r.t. values of R(JA) of partitions Ri (1  i  p): each partition fits into the available memory and each record satisfies PR Plan operators • Step 2: Probing for records of S using PS; if successful, execution of join

Nested-loops & • Step 3: Repeat steps 1 and 2 as long as R is exhausted sort/merge join  Construction of hash tables and probing Hash join Scan over R; building hash tables Hi (1  i  p) one at a time in memory

Joins on type- H spanning paths R 1 Scan over S with probing of S Distributed joins H1 . . . Set operations H R p Scan over S with probing of S HP

 Complexity: O(p · N)  Special case R fits into memory: one partition (p = 1) 11-10 © 2011 AG DBIS  a single scan over S is sufficient! Realization of DBS Hash Join (2)

 Partitioning of R with hp(r(JA))

Table operations #records / JA-value

Plan operators

Nested-loops & sort/merge join JA 0 100

Hash join #records / JA’-value hp(r(JA))

Joins on type- spanning paths JA’ Distributed joins 0 0.33 0.66 1

R R R Set operations 1 2 3

© 2011 AG DBIS 11-11

Realization of DBS Hash Join (3)

 Partitioning

Table • Partitioning of R in subsets R1, R2, ..., Rp: operations a record r of R is in Ri, if h(r) is in Hi

Plan operators R

Nested-loops & sort/merge join

Hash join

Joins on type- . . . spanning paths H1 H2 Hp

Distributed joins  Why is this partitioning a critical operation?

Set operations Which auxiliary operations may be required?

Is the use of a hash function needed for partitioning?

• Table S is partitioned with same function hP while evaluating PS

© 2011 AG DBIS 11-12 Realization of DBS Hash Join (4)

 Variants of hash join are primarily distinguished by the kind of partitioning  Partitioning technique in case of simple hash join Table shown for construction and probing of H operations 1

step 1: R Plan operators

H1 Nested-loops & 1. iteration Rrest sort/merge join Srest

Hash join

step 2: S Joins on type- spanning paths

Distributed joins  Simple hash join

• Step 1: Execute scan on R ((),smaller table), evaluate PR and apply hash function hP to Set operations each qualified record r. Is hP(r(JA)) in the chosen range, record into H1. Otherwise, write r in an output buffer for a file Rrest for “pretermitted” r-records.

• Step 2: Execute scan on S, evaluate PS and apply hash function hP to each qualified record s. Is hP(s(JA)) in the chosen range, search a join counterpart (probing) in H1. If successful, form a join record and put it to the result. Otherwise, write s to an output buffer for a file Srest for “pretermitted” s-records.

• Step 3: Repeat step 1 and 2 using the so far “pretermitted” records on Hi as long as Rrest is exhausted. Here, evaluation of PR and PS is not required anymore. © 2011 AG DBIS 11-13

Realization of DBS Hash Join (5)

 Grace join (grace join)

Table • Partitioning of R and S takes place before join starts operations • Partitions Ri and Si are stored in temporary files on disk

Plan operators • Construction of Hi ( M pages) in memory with Ri and probing with Si

Nested-loops & H1 sort/merge join R1

Hash join S1 Scan over S1 with probing of H1

Joins on type- spanning paths . . . H RP P Distributed joins

SP Set operations Scan over SP with probing of HP

 What is the minimal memory size required?

© 2011 AG DBIS 11-14 Realization of DBS Hash Join (6)

 Hybrid hash join

Table • Optimization such that construction and probing of H1 is done operations in parallel to partitioning Scan Plan operators 1) R

Nested-loops & sort/merge join

R2 R3 RP Hash join a) R1 constructed in H1 memory area: memory 1 page each Joins on type- spanning paths immediate S2 S3 SP probing of S -records Distributed joins b) 1 S

Set operations Scan

2) H2 R2

S2 as in case of Grace join Scan

3) . . .

© 2011 AG DBIS 11-15

Realization of DBS Hash Join - Example

#records / JA-value  Partitioning

Table a) Partitioning of R with hP(r(JA)) JA operations 0 100

#records / h (r(JA )) Plan operators JA‘-value p

Nested-loops & JA‘ sort/merge join 0 0.33 0.66 1

R R R b) Partitioning of S with hP(s(JA)) 1 2 3 Hash join

 II. Join 1) H1 Joins on type- R1 in memory with spanning paths JA’: 0.0 – 0.33 hH(r(JA)) S1 JA’: 0.0 – 0.33 Distributed joins read, probing with of hH(s(JA)) 2) Set operations H2 R2 JA’: 0.34 – 0.66 S2 JA’: 0.34 – 0.66

H 3) R3 3 JA’: 0.67 – 1.0 S3 © 2011 AG DBIS JA’: 0.67 – 1.0 11-16 Realization of DBS Use of Type-Spanning Access Paths

 Join via link structures • Use of hierarchical access paths for equi-join Table operations Scan over R (Owner table), for each record r, if PR: Plan operators Scan over related link structure LR-S(JA), for each record s, if P S: Nested-loops & write combined record (r, s ) into the result set. sort/merge join

Hash join  Further methods • Join indexes which are built for certain Θ-joins Joins on type- spanning paths VIR: VIS: Distributed joins RS RS RS

TIDr2 TIDs4 TIDr1 TIDs3 TIDs2 TIDr2 Set operations TIDr1 TIDs3 TIDr2 TIDs2 TIDs3 TIDr1

TIDr2 TIDs2 TIDr2 TIDs4 TIDs4 TIDr2

TIDr2 TIDs6 TIDr2 TIDs6 TIDs6 TIDr2

Logical view Index for TIDR Index for TIDS

© 2011 AG DBIS 11-17

Realization of DBS Use of Type-Spanning Access Paths (2)

• Use of generalized access path structures (GAPS)

Table operations

K53 Plan operators

Nested-loops & K25 K36 K47 K58 K78 K88 sort/merge join

Hash join ...... Joins on type- spanning paths

TIDs for Dept TIDs for Distributed joins Mgr

Set operations   . . . K55 1 3 1 4 TID TID TID TID TID TID TID TID TID . . . 

PRIOR NEXT TIDs for Emp TIDs for optional Equipment reference to overflow page

© 2011 AG DBIS 11-18 Realization of DBS Join Algorithms - Comparison

input stream 2

e21 e22 e23 e21 e22 e23 e21 e22 e23 e e Table 11 11 e11 operations e12 e12 e12

e13 e13 e13 . . . . . Plan operators stream 1 t

Nested-loops & sort/merge join inpu

Hash partitions Hash join (a) Nested-loops join (b) Merge join (c) Hash join element comparison successful element comparison Joins on type- spanning paths  Nested-loops join is always applicable, however, scanning of complete search space has to be taken into account. Distributed joins

 Merge join needs lowest search costs, requires, however, sorted input streams. Index Set operations structures on both join attributes satisfy this prerequisite. Otherwise, explicit sorting of both tables w.r.t. join attributes reduces cost advantage substantially. Nevertheless, sort-merge join can own additional advantages, if the result is required in sorted sequence and sorting of the large result is more expensive than sorting of two small result sets.

 Hash join partitions search space. Fig. c assumes that the same hash function h is applied to tables R and S. The partition size of the (smaller) table is given by the available buffer size in memory. A reduction of the partition size, to approximate case b, causes higher preparation costs and is therefore not recommendable.

© 2011 AG DBIS 11-19

Realization of DBS Join Algorithms in Distributed DBS

 Problem statement

• Query in node K, which requires a join between (sub-)table R at node KR and (sub-)table Table S at node KS operations • Determination of processing node: K, KR or KS

Plan operators  Determination of evaluation strategy • Send participating tables completely to a node and compute join locally (“ship whole”) Nested-loops & sort/merge join - Minimal number of messages - Very high transfer volumes Hash join • Request for every join value in the first table related records from the second table (“fetch as needed“) Joins on type- spanning paths - Large number of messages - Only relevant records are considered Distributed joins • Trade-off solution: Semi-join resp. extensions such as Bit-vector join (hash filter join)

Set operations  Semi-join • Shipping of a list of JA values of R to node of S • Determination of join counterparts in S and returning them to node of R • Then join processing at node of R

 Bit-vector join • Similar to Semi-join, only shipping of a bit vector (Bloom Filter) created using a hash function 11-20 © 2011 AG DBIS • Returning a superset of join counterparts in S Realization of DBS Semi-Join and Bit-Vector Join Dept Frankfurt Dno Loc Mgr Dno Name 47 47 Hans 39 join 47 Anna return projections of Table 64 operations join counterpart records

Emp Plan operators ship the whole JA Munich Dno Name Address Phone Dno 69 Nested-loops & 47 28 sort/merge join 39 75 64 47 Hans find join counterparts 47 Anna Hash join 44

Joins on type- spanning paths Dept Frankfurt DNo Loc Mgr check Dno Name Address Phone 47 + join 47 Distributed joins 39 64 return the potential create bit vector by hashing Set operations join candidates 1 0 0 1 1 0 0 0

Munich ship bit vector Emp Dno Name Address Phone 69 28 1 0 0 1 1 0 0 0 75 47 hashing of Dno values to find potential 91 © 2011 AG DBIS join candidates 44 11-21

Realization 3 of DBS Set Operations

 Which set operations are needed? R S Table operations R, S union-compatible input streams Plan operators AB C A, B, C element sets

Nested-loops & sort/merge join

Hash join operation result matching in all attributes matching in one or several attributes A difference (R-S) anti-semi-join (S, R) Joins on type- spanning paths B intersection join, semi-join (S, R) C difference (S-R) anti-semi-join (R, S) Distributed joins A, B left-sided outer join

Set operations A, C anti-difference anti-join B, C right-sided outer join A, B, C union symmetrical outer join

 Which algorithms can be used for these set operations? • What has to be compared at a time? • How can a relationship to the join algorithms be found?

© 2011 AG DBIS 3. Graefe, G.: Query evaluation techniques for large databases, ACM Computing Surveys 25:2, 1993, pp. 73-170 11-22 Realization of DBS Set Operations (2)

 Binary matching operations • Solve the same task, in principle: “one-to-one matching operations” Table • An input element contributes to the output dependent of its “match” with another input operations element • Operations repeatedly require the same steps and, therefore, can be implemented using Plan operators the same algorithms • Set- and jjpoin operations are closely connected!

Nested-loops &  sort/merge join Same logical proceeding • Three element sets are formed from R and S: A, B, C • Elements in B fit together ! Hash join • How can these three element sets be formed? - Using nested iteration Joins on type- - Using merge method spanning paths - Using hash method

Distributed joins  Unified realization concept • Comparison of join- vs. primary -key attributes Set operations • Commonality: records are grouped on the basis of attribute values • Some unary operations are possible with special measures - Grouping and sorting enable simple duplicate elimination - In case of aggregation, an attribute value per group is determined - In case of join, grouping of potential join counterparts is cost-effective (either in partitions or a sort order) - Using set operations, the element sets A, B, C can be found; at the same time, duplicate elimination is possible © 2011 AG DBIS 11-23

Realization of DBS Summary

 Selection operations • Existing access path types require tailor-made operations and efficient mapping Table • Combination of various access paths possible (TID algorithm) operations

 General classes of evaluation methods for binary operations Plan operators • Nested iteration • Merge method Nested-loops & sort/merge join • Hashing

Hash join  Many options for processing of join operations • Nested-loops join Joins on type- • Sort-merge join spanning paths • Hash join • And variations Distributed joins

 Set operations Set operations • Use of the same algorithm classes, in principle • Variation of executing comparisons

 Extensibility infrastructure in object-relational DBMS • Creation of user-defined functions and operators • Generalization: user-defined table operators with n input tables and m output tables

© 2011 AG DBIS 11-24