<<

Multi-Dimensional Indexing

Torsten Grust Chapter 5 Multi-Dimensional Indexing k Indexing Points and Regions in Dimensions B+-trees...... over composite keys Architecture and Implementation of Database Systems Point Quad Trees Point (Equality) Search Summer 2016 Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Torsten Grust Range Queries Wilhelm-Schickard-Institut für Informatik Spaces with High Dimensionality

Universität Tübingen Wrap-Up 1 Multi-Dimensional More Dimensions. . . Indexing

Torsten Grust

One SQL query, many range predicates

1 SELECT* 2 FROM CUSTOMERS 3 WHERE ZIPCODE BETWEEN 8880 AND 8999 B+-trees...... over composite keys 4 AND REVENUE BETWEEN 3500 AND 6000 Point Quad Trees Point (Equality) Search range predicate two Inserting Data This query involves a in dimensions. Region Queries

k-d Trees Balanced Construction Typical use cases for multi-dimensional data include K-D-B-Trees K-D-B-Tree Operations • geographical data (GIS: Geographical Information Splitting a Point Page Splitting a Region Page

Systems), R-Trees multimedia retrieval e.g. Searching and Inserting • ( , image or video search), Splitting R-Tree Nodes • OLAP (Online Analytical Processing), UB-Trees Bit Interleaving / Z-Ordering • queries against encoded data (e.g., XML) B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 2 Multi-Dimensional . . . More Challenges. . . Indexing

Torsten Grust Queries and data can be points or regions

B+-trees...... over composite keys point queryregion query Point Quad Trees region inclusion Point (Equality) Search region data or intersection Inserting Data Region Queries point data k-d Trees Balanced Construction

K-D-B-Trees most interesting: k-nearest-neighbor search (k-NN) K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees Searching and Inserting . . . and you can come up with many more meaningful types of Splitting R-Tree Nodes UB-Trees queries over multi-dimensional data. Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Note: All equality searches can be reduced to one-dimensional queries. Range Queries Spaces with High Dimensionality

Wrap-Up 3 Multi-Dimensional Points, Lines, and Regions Indexing Tarifzonen | Fare zones Torsten Grust

FEUERTHALEN

FLURLINGEN16 SCHLOSS LAUFEN A. RH.

DACHSEN

RHEINAU STAMMHEIM MARTHALEN OSSINGEN 62 OBERSTAMM- 15 61 HEIM OBERNEUNFORN WIL ZH RAFZ HÜNTWANGEN KLEINANDELFINGEN 14 ANDELFINGEN WASTERKINGEN HÜNTWANGEN- RÜDLINGEN WIL 24 60 FLAACH THALHEIM-ALTIKON KAISERSTUHL AG 13 BUCHBERG ALTIKON HENGGART B+-trees. . . ZWEIDLEN ELLIKON EGLISAU DINHARD AN DER THUR TEUFEN 63 GLATTFELDEN HETTLINGEN . . . over composite keys 18 SEUZACH RICKENBACH-ATTIKON STADEL NEFTENBACH REUTLINGEN GUNDETSWIL BEI NIEDERGLATT BÜLACH 23 WALLRÜTI Point Quad Trees BACHS WIESENDANGEN EMBRACH- PFUNGEN- OBERWINTERTHUR NEFTENBACH ELSAU HÖRI RORBAS HEGI ELGG NIEDERWENINGENNIEDERWENINGENDORFSCHÖFFLISDORF- GRÜZE OBERWENINGEN 12 WÜLFLINGEN Point (Equality) Search STEINMAUR RÄTERSCHEN SCHOTTIKON NIEDERGLATT OBEREMBRACH TÖSS SCHLEINIKON SEEN ELGG DIELSDORF Inserting Data * EIDBERG 64 REGENSBERG NIEDERHASLI BRÜTTEN 17 20 SCHLATT BEI SENNHOF-KYBURG BOPPELSEN FLUGHAFEN 21 WINTERTHUR KLOTEN Region Queries BUCHS- KOLLBRUNN DÄLLIKON NÜRENSDORF KYBURG OTELFINGEN OBERGLATT RÄMISMÜHLE- RÜMLANG KEMPTTHAL ZELL HÜTTIKON TURBENTHAL OTELFINGEN 11 BALSBERG RIKON GOLFPARK GLATTBRUGG WEISSLINGEN REGENSDORF- BASSERSDORF k-d Trees OETWIL AN WATT AFFOLTERN OPFIKON DIETLIKON 70 SITZBERG DER 22 71 WALLISELLEN WILDBERG SPREITENBACH SEEBACH WILA SHOPPING CENTER GEROLDSWIL EFFRETIKON Balanced Construction UNTERENGSTRINGEN * ILLNAU DIETIKON 10 OERLIKON 84 SCHLIEREN ALTSTETTEN 21 VOLKETSWIL 35 STELZENACKER WIPKINGEN DÜBENDORF RUSSIKON SALAND STERNENBERG GLANZENBERG HARDBRÜCKE STETTBACH FEHRALTORF K-D-B-Trees BERGDIETIKON URDORF ZÜRICHHB 30 HITTNAU WIEDIKON SCHWERZENBACH ZH REPPISCHHOF SELNAU BAUMA STADELHOFEN NÄNIKON-GREIFENSEE UITIKON WEIHERMATT FÄLLANDEN K-D-B-Tree Operations WALDEGG PFÄFFIKON ZH GIESSHÜBEL BIRMENSDORF ZH ENGE 31 REHALP STEG TIEFENBRUNNEN USTER 72 Splitting a Point Page ZOLLIKERBERG AATHAL BÄRETSWIL 54 WOLLISHOFEN 30 KEMPTEN FISCHENTHAL ZUMIKON MAUR Splitting a Region Page BONSTETTEN- LEIMBACH GOLDBACH WETTSWIL 40 73 KILCHBERG WETZIKON KÜSNACHT FORCH GOSSAU ZH RINGWIL GIBSWIL MÖNCHALTORF OBERLUNKHOFEN SOOD- ZH ERLENBACH ZH OBERLEIMBACH WINKEL EGG R-Trees 55 RÜSCHLIKON AM ZÜRICHSEE 32 HINWIL ADLISWIL 50 - 33 HEDINGEN ESSLINGEN 34 SIHLAU GRÜNINGEN OBERDÜRNTEN OTTENBACH WILDPARK-HÖFLI Searching and Inserting 41 42 WALD FALTIGBERG AFFOLTERN LANGNAU-GATTIKON OETWIL 43 UETIKON TANN-DÜRNTEN AM OBERRIEDEN DORF AM SEE Splitting R-Tree Nodes 33 OBFELDEN AEUGST SIHLWALD MÄNNEDORF RÜTI ZH AM ALBIS HORGEN OBERDORF SIHLBRUGG 56 AU ZH METTMENSTETTEN HAUSEN 51 AM ALBIS STÄFA FELDBACH 80 WAGEN UB-Trees MASCHWANDEN WÄDENSWIL KNONAU KAPPEL 52 BLUMENAU AM ALBIS Bit Interleaving / HIRZEL 83 BURGHALDEN RICHTERSWILBÄCH 82 Z-Ordering PFÄFFIKON 53 GRÜNENFELD FREIENBACH SCHÖNENBERG ZH WOLLERAU SAMSTAGERN B+-Trees over Z-Codes SZ

Bahnstrecke mit Bahnhof Railway with Station SOB HÜTTEN Regionalbus Regional bus SCHINDELLEGI- 81 Range Queries FEUSISBERG Schiff Boat 51 Zonennummer 51 Zone number Die Tarifzonen 10 (Stadt Zürich) Fare zones 10 (Zurich City) and 20 10 10 Spaces with High * und 20 (Stadt Winterthur) werden * (Winterthur City) each count as two für die Preisberechnung doppelt zones when calculating the fare. 20* 20* gezählt. Dimensionality

Verbundfahrausweise berechtigen You can travel as often as you like on während der Gültigkeitsdauer zu your ZVV ticket within the fare zones beliebigen Fahrten in den gelösten and period shown on the ticket. Zonen. © Zürcher Verkehrsverbund/PostAuto Region Zürich, 12.2007 Wrap-Up 4 Multi-Dimensional . . . More Solutions Indexing

Torsten Grust

Recent proposals for multi-dimension index structures Quad Tree [Finkel 1974] K-D-B-Tree [Robinson 1981] R-tree [Guttman 1984] Grid File [Nievergelt 1984] + B+-trees. . . R -tree [Sellis 1987] LSD-tree [Henrich 1989] . . . over composite keys

R*-tree [Geckmann 1990] hB-tree [Lomet 1990] Point Quad Trees Point (Equality) Search Vp-tree [Chiueh 1994] TV-tree [Lin 1994] Inserting Data Π UB-tree [Bayer 1996] hB- -tree [Evangelidis 1995] Region Queries k-d Trees SS-tree [White 1996] X-tree [Berchtold 1996] Balanced Construction

M-tree [Ciaccia 1996] SR-tree [Katayama 1997] K-D-B-Trees K-D-B-Tree Operations Pyramid [Berchtold 1998] Hybrid-tree [Chakrabarti 1999] Splitting a Point Page DABS-tree [Böhm 1999] IQ-tree [Böhm 2000] Splitting a Region Page Slim-tree [Faloutsos 2000] Landmark file [Böhm 2000] R-Trees Searching and Inserting P-Sphere-tree [Goldstein 2000] A-tree [Sakurai 2000] Splitting R-Tree Nodes UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes None of these provides a “fits all” solution. Range Queries Spaces with High Dimensionality

Wrap-Up 5 + Multi-Dimensional Can We Just Use a B -tree? Indexing

Torsten Grust • Can two B+-trees, one over ZIPCODE and over REVENUE, adequately support a two-dimensional range query?

Querying a rectangle

B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page • Can only scan along either index at once, and both of them R-Trees Searching and Inserting produce many false hits. Splitting R-Tree Nodes UB-Trees Bit Interleaving / • If all you have are these two indices, you can do index Z-Ordering intersection: B+-Trees over Z-Codes perform both scans in separation to obtain the Range Queries

rids of candidate tuples. Then compute the intersection Spaces with High between the two rid lists (DB2: IXAND). Dimensionality Wrap-Up 6 Multi-Dimensional IBM DB2: Index Intersection Indexing

Torsten Grust

IBM DB2 uses index Plan selected by the intersection (operator IXAND) query optimizer B+-trees. . . 1 SELECT* . . . over composite keys 2 FROM ACCEL Point Quad Trees 3 WHERE PRE BETWEEN0 AND 10000 Point (Equality) Search Inserting Data 4 AND POST BETWEEN 10000 AND 20000 Region Queries

k-d Trees Relevant indexes defined over table Balanced Construction K-D-B-Trees ACCEL: K-D-B-Tree Operations Splitting a Point Page 1 IPOST over column POST Splitting a Region Page R-Trees 2 SQL0801192024500 over Searching and Inserting column PRE (primary key) Splitting R-Tree Nodes UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 7 Multi-Dimensional Can Composite Keys Help? Indexing Indexes with composite keys Torsten Grust

REVENUE REVENUE

B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data Region Queries

ZIPCODE ZIPCODE k-d Trees Balanced Construction

hREVENUE, ZIPCODEi index hZIPCODE, REVENUEi index K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees • Almost the same thing! Searching and Inserting Splitting R-Tree Nodes

Indices over composite keys are not symmetric: The major UB-Trees + Bit Interleaving / attribute dominates the organization of the B -tree. Z-Ordering B+-Trees over Z-Codes • But: Since the minor attribute is also stored in the index Range Queries (part of the keys k), we may discard non-qualifying tuples Spaces with High Dimensionality before fetching them from the data pages. Wrap-Up 8 Multi-Dimensional Multi-Dimensional Indices Indexing

Torsten Grust

• B+-trees can answer one-dimensional queries only.1 We would like to have a multi-dimensional index structure • B+-trees. . . that . . . over composite keys Point Quad Trees • is symmetric in its dimensions, Point (Equality) Search • clusters data in a space-aware fashion, Inserting Data Region Queries

• is dynamic with respect to updates, and k-d Trees • provides good support for useful queries. Balanced Construction K-D-B-Trees K-D-B-Tree Operations • We will start with data structures that have been designed Splitting a Point Page for in-memory use, then tweak them into disk-aware Splitting a Region Page R-Trees database indices (e.g., organize data in a page-wise fashion). Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High 1Toward the end of this chapter, we will see UB-trees, a nifty trick that Dimensionality + uses B -trees to support some multi-dimensional queries. Wrap-Up 9 Multi-Dimensional “Binary” Search Tree Indexing

Torsten Grust In k dimensions, a “binary tree” becomes a 2k -ary tree.

Point quad tree (k = 2) • Each data point partitions the data k B+-trees. . . space into 2 disjoint . . . over composite keys regions. Point Quad Trees Point (Equality) Search • In each node, a region Inserting Data points to another node Region Queries k-d Trees (representing a refined Balanced Construction partitioning) or to a K-D-B-Trees K-D-B-Tree Operations special null value. Splitting a Point Page Splitting a Region Page

• This data structure is a R-Trees point quad tree. Searching and Inserting Splitting R-Tree Nodes % Finkel and Bentley. Quad Trees: A Data Structure for UB-Trees Bit Interleaving / Retrieval on Composite Keys. Z-Ordering Acta Informatica, vol. 4, 1974. B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 10 Multi-Dimensional Searching a Point Quad Tree Indexing

Torsten Grust

Point quad tree (k = 2) Point quad tree search

1 Function: p_search (q, node)

2 if q matches data point in node then B+-trees. . . 3 return data point; . . . over composite keys

4 else Point Quad Trees ? 5 P ← partition containing q; Point (Equality) Search Inserting Data 6 if P points to null then Region Queries 7 return not found; k-d Trees Balanced Construction 8 else 0 K-D-B-Trees 9 node ← node pointed to by K-D-B-Tree Operations P; ? Splitting a Point Page 10 return Splitting a Region Page p_search (q, node0); R-Trees Searching and Inserting Splitting R-Tree Nodes

UB-Trees 1 Function: q Bit Interleaving / pointsearch ( ) Z-Ordering 2 return p_search (q, root) ; B+-Trees over Z-Codes ! Range Queries

Spaces with High Dimensionality

Wrap-Up 11 Multi-Dimensional Inserting into a Point Quad Tree Indexing

Inserting a point qnew into a quad tree happens analogously to Torsten Grust an insertion into a binary tree: 1 Traverse the tree just like during a search for qnew until you encounter a partition P with a null pointer. 2 Create a new node n0 that spans the same area as P and is B+-trees. . . partitioned by qnew, with all partitions pointing to null. . . . over composite keys 0 3 Let P point to n . Point Quad Trees Point (Equality) Search Inserting Data Note that this procedure does not keep the tree balanced. Region Queries k-d Trees Example (Insertions into an empty point quad tree) Balanced Construction K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees → → Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes → → Range Queries Spaces with High Dimensionality

Wrap-Up 12 Multi-Dimensional Range Queries Indexing

Torsten Grust To evaluate a range query2, we may need to follow several children of a quad tree node node: Point quad tree range search

B+-trees...... over composite keys 1 Function: r_search (Q, node) Point Quad Trees 2 if data point in node is in Q then Point (Equality) Search Inserting Data 3 append data point to result ; Region Queries

k-d Trees 4 foreach partition P in node that intersects with Q do 0 Balanced Construction 5 node ← node pointed to by P ; K-D-B-Trees 0 6 r_search (Q, node ) ; K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees 1 Function: regionsearch (Q) Searching and Inserting Splitting R-Tree Nodes 2 return r_search (Q, root) ; UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High 2We consider rectangular regions only; other shapes may be answered by Dimensionality querying for the bounding rectangle and post-processing the output. Wrap-Up 13 Multi-Dimensional Point Quad Trees—Discussion Indexing

Torsten Grust

Point quad trees " are symmetric with respect to all dimensions and

" can answer point queries and region queries. B+-trees...... over composite keys But Point Quad Trees Point (Equality) Search % insertion order Inserting Data the shape of a quad tree depends on the of Region Queries

its content, in the worst case degenerates into a linked list, k-d Trees % null space inefficient k Balanced Construction pointers are (particularly for large ). K-D-B-Trees K-D-B-Tree Operations In addition, Splitting a Point Page Splitting a Region Page - they can only store point data. R-Trees Searching and Inserting Splitting R-Tree Nodes Also remember that quad trees were designed for main memory UB-Trees Bit Interleaving / operation. Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 14 Multi-Dimensional k-d Trees Indexing

Torsten Grust

Sample k-d tree (k = 2) • Index k-dimensional data, but keep the tree binary. • For each tree level l use B+-trees. . . discriminator . . . over composite keys a different Point Quad Trees dimension dl along Point (Equality) Search Inserting Data which to partition. Region Queries • Typically: round k-d Trees Balanced Construction

robin K-D-B-Trees K-D-B-Tree Operations • This is a k-d tree. Splitting a Point Page % Bentley. Multidimensional Splitting a Region Page Binary Search Trees Used for R-Trees Associative Searching. Comm. Searching and Inserting ACM, vol. 18, no. 9, Sept. 1975. Splitting R-Tree Nodes UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 15 Multi-Dimensional k-d Trees Indexing k-d trees inherit the positive properties of the point quad trees, Torsten Grust but improve on space efficiency. For a given point set, we can also construct a balanced k-d tree:3 k-d tree construction (Bentley’s OPTIMIZE algorithm)

B+-trees...... over composite keys 1 Function: kdtree (pointset, level) Point Quad Trees Point (Equality) Search 2 if pointset is empty then Inserting Data 3 return null ; Region Queries k-d Trees 4 else Balanced Construction

5 p ← median from pointset (along dlevel ); K-D-B-Trees K-D-B-Tree Operations 6 pointsleft ← {v ∈ pointset where vd < pd }; level level Splitting a Point Page 7 pointsright ← Splitting a Region Page R-Trees {v ∈ pointset where vdlevel ≥ pdlevel }\{p}; Searching and Inserting 8 n ← new k-d tree node, with data point p ; Splitting R-Tree Nodes

9 n.left ← kdtree (pointsleft, level + 1) ; UB-Trees 10 n.right ← points level + 1 ; Bit Interleaving / kdtree ( right, ) Z-Ordering 11 return n ; B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality 3 vi : coordinate i of point v. Wrap-Up 16 Multi-Dimensional Balanced k-d Tree Construction Indexing

Torsten Grust

Example (Step-by-step balanced k-d tree construction)

→ → → B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data f Region Queries a e k-d Trees → → → → c h Balanced Construction b d g K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page d Splitting a Region Page Resulting tree shape: R-Trees e c Searching and Inserting Splitting R-Tree Nodes g a h f UB-Trees Bit Interleaving / Z-Ordering b B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 17 Multi-Dimensional K-D-B-Trees Indexing

Torsten Grust

k-d trees improve on some of the deficiencies of point quad trees: " We can balance a k-d tree by re-building it.

(For a limited number of points and in-memory processing, this B+-trees. . . may be sufficient.) . . . over composite keys Point Quad Trees " We’re no longer wasting big amounts of space. Point (Equality) Search Inserting Data Region Queries It’s time to bring k-d trees to the disk. The K-D-B-Tree k-d Trees Balanced Construction • uses pages as an organizational unit K-D-B-Trees K-D-B-Tree Operations (e.g., each node in the K-D-B-tree fills a page) and Splitting a Point Page k-d tree-like layout Splitting a Region Page • uses a to organize each page. R-Trees Searching and Inserting Splitting R-Tree Nodes

% John T. Robinsion. The K-D-B-Tree: A Search Structure for Large UB-Trees Bit Interleaving / Multidimensional Dynamic Indexes. SIGMOD 1981. Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 18 Multi-Dimensional K-D-B-Tree Idea Indexing

Torsten Grust Region pages: ··· ◦ page 0 ··· ◦ ◦ ◦ ··· • contain entries hregion, pageIDi ◦ ◦ no null pointers • B+-trees. . . • form a balanced . . . over composite keys Point Quad Trees page 3 tree Point (Equality) Search page 1 ◦ Inserting Data ◦ ··· all regions disjoint Region Queries ··· ◦ ◦ • ◦ ··· page 2 and rectangular k-d Trees ··· ◦ ◦ ◦ Balanced Construction

K-D-B-Trees K-D-B-Tree Operations . Splitting a Point Page . Splitting a Region Page . . Point pages: . R-Trees Searching and Inserting page 4 • contain entries Splitting R-Tree Nodes page 5 hpoint, ridi UB-Trees + Bit Interleaving / • B -tree leaf Z-Ordering B+-Trees over Z-Codes ;nodes Range Queries Spaces with High Dimensionality

Wrap-Up 19 Multi-Dimensional K-D-B-Tree Operations Indexing

Torsten Grust

• Searching a K-D-B-Tree works straightforwardly: • Within each page determine all regions Ri that contain the query point q (intersect with the query region Q). B+-trees. . . R . . . over composite keys • For each of the i , consult the page it points to and Point Quad Trees recurse. Point (Equality) Search Inserting Data • On point pages, fetch and return the corresponding Region Queries record for each matching data point pi . k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations • When inserting data, we keep the K-D-B-Tree balanced, Splitting a Point Page + much like we did in the B -tree: Splitting a Region Page R-Trees • Simply insert a hregion, pageIDi (hpoint, ridi) entry into Searching and Inserting a region page (point page) if there’s sufficient space. Splitting R-Tree Nodes UB-Trees • Otherwise, split the page. Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 20 Multi-Dimensional K-D-B-Tree: Splitting a Point Page Indexing

Torsten Grust

Splitting a point page p

1 Choose a dimension i and an i-coordinate xi along which to B+-trees. . . split, such that the split will result in two pages that are not . . . over composite keys Point Quad Trees overfull. Point (Equality) Search Inserting Data 2 Move data points p with pi < xi and pi > xi to new pages Region Queries pleft and pright (respectively). k-d Trees Balanced Construction 3 Replace hregion, pi in the parent of p with hleft region, plefti K-D-B-Trees K-D-B-Tree Operations hright region, prighti. Splitting a Point Page Splitting a Region Page

R-Trees Searching and Inserting Step 3 may cause an overflow in p’s parent and, hence, lead to a Splitting R-Tree Nodes UB-Trees split of a region page. Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 21 Multi-Dimensional K-D-B-Tree: Splitting a Region Page Indexing

Torsten Grust

• Splitting a point page and moving its data points to the resulting pages is straightforward. • In case of a region page split, by contrast, some regions may intersect with both sides of the split. B+-trees...... over composite keys

Point Quad Trees Consider, e.g., page 0 on slide 19: Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees split K-D-B-Tree Operations Splitting a Point Page region crossing the split Splitting a Region Page R-Trees Searching and Inserting Splitting R-Tree Nodes • Such regions need to be split, too. UB-Trees Bit Interleaving / recursive downward Z-Ordering • This can cause a splitting (!) the tree. B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 22 Multi-Dimensional Example: Page 0 Split in K-D-B-Tree of slide 19 Indexing

Torsten Grust Split of region page 0

new root

page 6 B+-trees...... over composite keys

◦ ◦ Point Quad Trees Point (Equality) Search page 0 Inserting Data ◦ ◦ Region Queries

k-d Trees Balanced Construction page 3 K-D-B-Trees page 7 K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page page 1 page 2 R-Trees Searching and Inserting Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering • Root page 0 ⇒ pages 0 and 6 ( creation of new root). B+-Trees over Z-Codes Range Queries • Region page 1 ⇒ pages 1 and 7; (point pages not shown). Spaces with High Dimensionality

Wrap-Up 23 Multi-Dimensional K-D-B-Trees: Discussion Indexing

Torsten Grust

K-D-B-Trees " are symmetric with respect to all dimensions, " cluster data in a space-aware and page-oriented fashion, B+-trees. . . " are dynamic with respect to updates, and . . . over composite keys Point Quad Trees " can answer point queries and region queries. Point (Equality) Search Inserting Data Region Queries

However, k-d Trees - we still don’t have support for region data and Balanced Construction K-D-B-Trees - K-D-B-Trees (like k-d trees) will not handle deletes K-D-B-Tree Operations Splitting a Point Page dynamically. Splitting a Region Page R-Trees Searching and Inserting This is because we always partitioned the data space such that Splitting R-Tree Nodes • every region is rectangular and UB-Trees Bit Interleaving / Z-Ordering • regions never intersect. B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 24 Multi-Dimensional R-Trees Indexing

Torsten Grust

R-trees do not have the disjointness requirement: • R-tree inner or leaf nodes contain hregion, pageIDi or hregion, ridi B+-trees. . . entries (respectively). . . . over composite keys

region is the minimum bounding rectangle that spans all Point Quad Trees Point (Equality) Search data items reachable by the respective pointer. Inserting Data • Every node contains between d and 2d entries ( B+-tree).4 Region Queries k-d Trees • Insertion and deletion algorithms keep an R-tree; balanced Balanced Construction K-D-B-Trees at all times. K-D-B-Tree Operations Splitting a Point Page R-trees allow the storage of point and region data. Splitting a Region Page R-Trees Searching and Inserting % Antonin Guttman. R-Trees: A Dynamic Index Structure for Spatial Splitting R-Tree Nodes

Searching. SIGMOD 1984. UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality 4 except the root node Wrap-Up 25 Multi-Dimensional R-Tree: Example Indexing

Torsten Grust

• order: d = 2 ◦

• region data ◦

◦ page 0 B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search

inner nodes Inserting Data ◦ Region Queries ◦ ··· ◦ ◦ k-d Trees ··· ◦ page 3 ··· ◦ Balanced Construction ◦ ◦ page 1 ··· K-D-B-Trees page 2 K-D-B-Tree Operations Splitting a Point Page Splitting a Region Page

R-Trees Searching and Inserting Splitting R-Tree Nodes page 8 page 9 UB-Trees Bit Interleaving /

leaf nodes Z-Ordering page 6 B+-Trees over Z-Codes page 7 Range Queries

Spaces with High Dimensionality

Wrap-Up 26 Multi-Dimensional R-Tree: Searching and Inserting Indexing

Torsten Grust

While searching an R-tree, we may have to descend into more than one child node for point and region queries (% range search in point quad trees, slide 13).

B+-trees...... over composite keys + Inserting into an R-tree (cf. B -tree insertion) Point Quad Trees Point (Equality) Search Inserting Data 1 Choose a leaf node n to insert the new entry. Region Queries k-d Trees • Try to minimize the necessary region enlargement(s). Balanced Construction 2 If n is full, split it (resulting in n and n0) and distribute old K-D-B-Trees K-D-B-Tree Operations 0 and new entries evenly across n and n . Splitting a Point Page Splitting a Region Page

• Splits may propagate bottom-up and eventually reach R-Trees + the root (% B -tree). Searching and Inserting Splitting R-Tree Nodes 3 After the insertion, some regions in the ancestor nodes of n UB-Trees Bit Interleaving / may need to be adjusted to cover the new entry. Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 27 Multi-Dimensional Splitting an R-Tree Node Indexing

Torsten Grust To split an R-tree node, we generally have more than one alternative:

B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations “bad” split “good” split Splitting a Point Page Splitting a Region Page Heuristic: Minimize the totally covered area. But: R-Trees Searching and Inserting • Exhaustive search for the best split is infeasible. Splitting R-Tree Nodes UB-Trees • Guttman proposes two ways to approximate the search. Bit Interleaving / Z-Ordering e.g. B+-Trees over Z-Codes • Follow-up papers ( , the R*-tree) aim at improving the Range Queries

quality of node splits. Spaces with High Dimensionality

Wrap-Up 28 Multi-Dimensional R-Tree: Deletes Indexing

Torsten Grust All R-tree invariants (see 25) are maintained during deletions.

1 If an R-tree node n underflows (i.e., less than d entries are left after a deletion), the whole node is deleted.

2 Then, all entries that existed in n are re-inserted into the B+-trees. . . R-tree. . . . over composite keys Point Quad Trees Note that Step 1 may lead to a recursive deletion of n’s parent. Point (Equality) Search Inserting Data • Deletion, therefore, is a rather expensive task in an R-tree. Region Queries k-d Trees Balanced Construction

K-D-B-Trees K-D-B-Tree Operations Spatial indexing in mainstream database implementations Splitting a Point Page Splitting a Region Page • Indexing in commercial database systems is typically based R-Trees Searching and Inserting on R-trees. Splitting R-Tree Nodes e.g. UB-Trees • Yet, only few systems implement them out of the box ( , Bit Interleaving / Z-Ordering PostgreSQL). Most require the licensing/use of specific B+-Trees over Z-Codes extensions. Range Queries Spaces with High Dimensionality

Wrap-Up 29 Multi-Dimensional Bit Interleaving Indexing • We saw earlier that a B+-tree over concatenated fields Torsten Grust ha, bi does not help our case, because of the asymmetry between the role of a and b in the index key. • What happens if we interleave the bits of a and b (hence, + make the B -tree “more symmetric”)? B+-trees. . . a = b = . . . over composite keys 42 17 Point Quad Trees 00101010 00010001 Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction 0010101000010001 K-D-B-Trees K-D-B-Tree Operations ha, bi (concatenation) Splitting a Point Page Splitting a Region Page a = 42 b = 17 R-Trees Searching and Inserting 00101010 00010001 Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes 0000100110001001 Range Queries Spaces with High a and b interleaved Dimensionality Wrap-Up 30 Multi-Dimensional Z-Ordering Indexing

Torsten Grust

a a

B+-trees...... over composite keys

Point Quad Trees Point (Equality) Search Inserting Data Region Queries

k-d Trees Balanced Construction

K-D-B-Trees b b K-D-B-Tree Operations ha, bi (concatenated) a and b interleaved Splitting a Point Page Splitting a Region Page • Both approaches linearize all coordinates in the value space R-Trees Searching and Inserting according to some order. % see also slide 8 Splitting R-Tree Nodes UB-Trees • Bit interleaving leads to what is called the Z-order. Bit Interleaving / Z-Ordering • The Z-order (largely) preserves spatial clustering. B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 31 + Multi-Dimensional UB-Tree: B -trees Over Z-Codes Indexing

Torsten Grust • Use a B+-tree to index Z-codes of multi-dimensional data. • Each leaf in the B+-tree describes an interval in the Z-space. region • Each interval in the Z-space describes a in the B+-trees. . . multi-dimensional data space: . . . over composite keys Point Quad Trees Point (Equality) Search + UB-Tree: B -tree over Z-codes Inserting Data Region Queries

k-d Trees Balanced Construction x8 K-D-B-Trees x10 x x K-D-B-Tree Operations 7 9 Splitting a Point Page x6 Splitting a Region Page

R-Trees 0 3 4 20 21 35 36 47 48 63 x2 x4 Searching and Inserting x1 x3 x5 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 Splitting R-Tree Nodes

UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes • To retrieve all data points in a query region Q, try to touch Range Queries intersect Q Spaces with High only those leave pages that with . Dimensionality

Wrap-Up 32 Multi-Dimensional UB-Tree: Range Queries Indexing

Torsten Grust After each page processed, perform an index re-scan ( ) to fetch the next leaf http://mistral.in.tum.de/ page that intersects with Q.

B+-trees...... over composite keys

Point Quad Trees UB-tree range search (Function ub_range(Q)) Point (Equality) Search Inserting Data 1 cur ← z(Qbottom,left) ; Region Queries 2 while true do k-d Trees // search B+-tree page containing cur (% slide 0.0) Example by Volker Markl and Rudolf Bayer, taken from Balanced Construction K-D-B-Trees 3 page ← search (cur); K-D-B-Tree Operations 4 foreach data point p on page do Splitting a Point Page Splitting a Region Page 5 if p is in Q then R-Trees 6 append p to result ; Searching and Inserting Splitting R-Tree Nodes

7 if region in page reaches beyond Qtop,right then UB-Trees Bit Interleaving / 8 break ; Z-Ordering B+-Trees over Z-Codes // compute next Z-address using Q and data on current Range Queries Spaces with High page Dimensionality 9 cur ← get_next_z_address (Q, page); Wrap-Up 33 Multi-Dimensional UB-Trees: Discussion Indexing

Torsten Grust

• Routine get_next_z_address() (return next Z-address

lying in the query region) is non-trivial and depends on the B+-trees. . . shape of the Z-region. . . . over composite keys Point Quad Trees • UB-trees are fully dynamic, a property inherited from the Point (Equality) Search + Inserting Data underlying B -trees. Region Queries k-d Trees • The use of other space-filling curves to linearize the data Balanced Construction space is discussed in the literature, too. E.g., Hilbert K-D-B-Trees curves K-D-B-Tree Operations . Splitting a Point Page Splitting a Region Page I UB-trees have been commercialized in the Transbase® R-Trees Searching and Inserting database system. Splitting R-Tree Nodes UB-Trees Bit Interleaving / Z-Ordering B+-Trees over Z-Codes Range Queries

Spaces with High Dimensionality

Wrap-Up 34 Multi-Dimensional Spaces with High Dimensionality Indexing

Torsten Grust For large k, all techniques that have been discussed become ineffective: • For example, for k = 100, we get 2100 ≈ 1030 partitions per node in a point quad tree. Even with billions of data points, B+-trees. . . almost all of these are empty. . . . over composite keys Point Quad Trees • Consider a really big search region, cube-sized covering Point (Equality) Search Inserting Data 95 % of the range along each dimension: Region Queries data space k-d Trees Balanced Construction

query region K-D-B-Trees K-D-B-Tree Operations For k = 100, the probability of a point Splitting a Point Page Splitting a Region Page

being in this region is still only R-Trees 100 0.95 ≈ 0.59 %. Searching and Inserting Splitting R-Tree Nodes

95 % UB-Trees Bit Interleaving / 100 % Z-Ordering B+-Trees over Z-Codes • We experience the curse of dimensionality here. Range Queries Spaces with High Dimensionality

Wrap-Up 35 Multi-Dimensional Page Selectivty for k-NN Search Indexing

Torsten Grust N=50’000, image database, k=10

140 Scan R*-Tree

120 X-Tree B+-trees. . . VA-File . . . over composite keys 100 Point Quad Trees Point (Equality) Search Inserting Data 80 Region Queries k-d Trees 60 Balanced Construction K-D-B-Trees K-D-B-Tree Operations 40 Splitting a Point Page Splitting a Region Page

% Vector/Leaf blocks visited 20 R-Trees Searching and Inserting Splitting R-Tree Nodes 0 UB-Trees 0 5 10 15 20 25 30 35 40 45 Bit Interleaving / Number of dimensions in vectors Z-Ordering B+-Trees over Z-Codes Range Queries Data: Stephen Bloch. What’s Wrong with High-Dimensionality Search. VLDB 2008. Spaces with High Dimensionality

Wrap-Up 36 Multi-Dimensional Wrap-Up Indexing

Torsten Grust

Point Quad Tree k-dimensional analogy to binary trees; main memory only. k-d Tree, K-D-B-Tree k-d tree: partition space one dimension at a time B+-trees. . . + . . . over composite keys (round-robin); K-D-B-Tree: B -tree-like organization with Point Quad Trees pages as nodes, nodes use a k-d-like structure internally. Point (Equality) Search Inserting Data R-Tree Region Queries k-d Trees regions within a node may overlap; fully dynamic; for point Balanced Construction and region data. K-D-B-Trees K-D-B-Tree Operations Splitting a Point Page UB-Tree Splitting a Region Page use space-filling curve (Z-order) to linearize k-dimensional R-Trees + Searching and Inserting data, then use B -tree. Splitting R-Tree Nodes Curse Of Dimensionality UB-Trees Bit Interleaving / Z-Ordering most indexing structures become ineffective for large k; fall B+-Trees over Z-Codes back to sequential scanning and approximation/compression. Range Queries Spaces with High Dimensionality

Wrap-Up 37