Outline

1. Similarity Searching Similarity Searching: 2. Distance-based indexing 3. Dimension reduction Indexing, Nearest Neighbor Finding, Dimensionality 4. Embedding methods Reduction, and Embedding Methods, for Applications in Multimedia 5. Nearest neighbor finding

Hanan Samet∗ [email protected]

Department of Computer Science University of Maryland College Park, MD 20742, USA

Based on joint work with Gisli R. Hjaltason. ∗ Currently a Science Foundation of Ireland (SFI) Walton Fellow at the Centre for Geocomputation at the National University of Ireland at Maynooth (NUIM)

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.1/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.2/114

Similarity Searching Voronoi Diagrams Apparently straightforward solution: Important task when trying to find patterns in applications involving mining 1. Partition space into regions where all different types of data such as images, video, , text documents, points in the region are closer to the DNA sequences, etc. region’s data point than to any other Similarity searching module is a central component of content-based data point retrieval in multimedia databases 2. Locate the Voronoi region corre- Problem: finding objects in a data set S that are similar to a query object q sponding to the query point based on some distance measure d which is usually a distance d/2 Sample queries: Problem: storage and construction cost for N d-dimensional points is Θ(N ) 1. point: objects having particular feature values Impractical unless resort to some high-dimensional approximation of a Voronoi 2. range: objects whose feature values fall within a given range or where diagram (e.g., OS-tree) which results in approximate nearest neighbors the distance from some query object falls into a certain range Exponential factor corresponding to the dimension d of the underlying space in 3. nearest neighbor: objects whose features have values similar to those the complexity bounds when using approximations of Voronoi diagrams (e.g., of a given query object or set of query objects (t, )-AVD) is shifted to be in terms of the error threshold  rather than in terms 4. closest pairs: pairs of objects from the same set or different sets which of the number of objects N in the underlying space are sufficiently similar to each other (variant of spatial join) 1. (1, )-AVD: O(N/d−1) space and O(log(N/d−1)) time for nearest neigh- Responses invariably use some variant of nearest neighbor finding bor query 2. (1/(d−1)2, )-AVD: O(N) space and O(t + log N) time for nearest neigh- bor query

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.3/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.4/114

Approximate Voronoi Diagrams (AVD) Approximate Voronoi Diagrams (AVD) Representations

Example partitions of space induced by  neighbor sets A AB BC BC Darkness of shading indicates cardinality of nearest neighbor sets with white corresponding to 1 BCF A B A B AD ABE BEF CFI C ADE C ADE AD E E E CFI D D D EF F AEG EGH F G G I GH I FHI H DG G H H HI I DEG EFH ( = 0.10) ( = 0.30) ( = 0.50) G GH H HI

Partition underlying domain so that Allow up to t 1 elements r (1 ≥ ib ≤ for  0, every block b is asso- i t) of S to be associated with ≥ ≤ ciated with some element rb in S each block b for a given , where such that rb is an -nearest neigh- each point in b has one of the rib as bor for all of the points in b (e.g., its -nearest neighbor (e.g., (3,0)- AVD or (1,0.25)-AVD) AVD)

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.5/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.6/114

Problem: Alternative View of Curse of Dimensionality

Number of samples needed to estimate an arbitrary function with a given Probability density function (analogous to histogram) of the distances of level of accuracy grows exponentially with the number of variables (i.e., the objects is more concentrated and has a larger mean value dimensions) that comprise it (Bellman) Implies algorithms need to do more work For similarity searching, curse means that the number of points in the data Worst case when d(x, x) = 0 and d(x, y) = 1 for all y = x set that need to be examined in deriving the estimate ( nearest neighbor) 6 grows exponentially with the underlying dimension ≡ Implies must compare every object with every other object 1. can’t always use triangle inequality to prune objects from consideration Effect on nearest neighbor finding is that the process may not be 2. triangle inequality (i.e., d(q, p) d(p, x) + d(q, x)) implies that any x meaningful in high dimensions ≤ such that d(q, p) d(p, x) >  cannot be at a distance of  or less from | − | When ratio of variance of distances and expected distances, between two q as d(q, x) d(q, p) d(p, x) >  random points p and q drawn from the data and query distributions, ≥ − 3. when  is small while probability density function is large at d(p, q), converges to zero as dimension d gets very large (Beyer et al.) then probability of eliminating an object from consideration via use of dist Variance[ (p,q)] triangle inequality is remaining area under curve which is small (see limd→∞ Expected[dist(p,q)] = 0 left) in contrast to case when distances are more uniform (see right) 1. distance to the nearest neighbor and distance to the farthest neighbor tend to converge as the dimension increases 2. implies that nearest neighbor searching is inefficient as difficult to differentiate nearest neighbor from other objects probability density probability density 3. assumes uniformly distributed data d(p,q) d(p,x) d(p,q) d(p,x) 2ε 2ε Partly alleviated by fact that real-world data is rarely uniformly-distributed (a) (b) Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.7/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.8/114 Other Problems Solutions Based on Indexing

Point and range queries are less complex than nearest neighbor queries 1. Map objects to a low-dimensional which is then indexed 1. easy to do with multi-dimensional index as just need comparison tests using one of a number of different data structures such as k-d trees, 2. nearest neighbor require computation of distance R-trees, quadtrees, etc. use : representative points, SVD, DFT, etc. needs d multiplications and d 1 additions − 2. Directly index the objects based on distances from a subset of the objects Often we don’t know features describing the objects and thus need aid of making use of data structures such as the vp-tree, M-tree, etc. domain experts to identify them useful when only have a distance function indicating similarity (or dis-similarity) between all pairs of N objects if change distance metric, then need to rebuild index — not so for multidimensional index 3. If only have distance information available, then embed the data objects in a vector space so that the distances of the embedded objects as measured by the distance metric in the embedding space approximate the actual distances commonly known embedding methods include multidimensional scaling (MDS), Lipschitz embeddings, FastMap, etc. once a satisfactory embedding has been obtained, the actual search is facilitated by making use of conventional indexing methods, perhaps coupled with dimensionality reduction

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.9/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.10/114

Outline Part 1: Indexing Low and High Dimensional Spaces

1. Indexing low and high dimensional spaces 1. Quadtree variants 2. Distance-based indexing 2. k-d tree 3. Dimensionality reduction 3. R-tree 4. Embedding methods 4. Bounding sphere methods 5. Nearest neighbor searching 5. Hybrid tree 6. Avoiding overlapping all of the leaf blocks 7. Pyramid technique 8. Methods based on a sequential scan

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.11/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.12/114

Simple Non-Hierarchical Data Structures Grid Method

Sequential list Inverted List Divide space into squares of width equal to the search region Name X Y X Y Each cell contains a list of all points within it

Chicago 35 42 Denver Miami Assume L∞ distance metric (i.e., Chessboard) Mobile 52 10 Omaha Mobile Assume C = uniform distribution of points per cell Toronto 62 77 Chicago Atlanta Average search time for k-dimensional space is O(F 2k) Buffalo 82 65 Mobile Omaha · F = number of records found = C, since query region has the width of Denver 5 45 Toronto Chicago a cell Omaha 27 35 Buffalo Denver 2k = number of cells examined Atlanta 85 15 Atlanta Buffalo (0,100) (100,100) Miami 90 5 Miami Toronto

Inverted lists: (62,77) (82,65) Toronto Buffalo 1. 2 sorted lists (5,45) 2. data is pointers Denver (35,42) Chicago 3. enables pruning the search with respect to one key y (27,35) Omaha (85,15) Atlanta (52,10) Mobile (90,5) Miami (0,0) x (100,0) Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.13/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.14/114

Point Quadtree (Finkel/Bentley) PR Quadtree

Marriage between uniform grid and a binary search tree 1. Regular decomposition point representation 2. Decompose whenever a block contains more than one point (0,100) (100,100) 3. Maximum level of decomposition depends on minimum point separation if two points are very close, then decomposition can be very deep (62,77) Toronto can be overcome by viewing blocks as buckets with capacity c and Chicago only decomposing a block when it contains more than c points (82,65) Buffalo (0,100) (100,100) (5,45) Mobile Denver (35,42) Toronto Chicago Denver Omaha (62,77) (27,35) Toronto A Omaha Buffalo Atlanta Miami (85,15) Atlanta (52,10) (82,65) B C E Mobile (90,5) Buffalo Miami (5,45) (35,42) (0,0) (100,0) Denver Chicago D (27,35) F Omaha Toronto Buffalo Denver Mobile (85,15) Atlanta

(52,10) (90,5) Mobile Miami Chicago Omaha Atlanta Miami (0,0) (100,0)

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.15/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.16/114 Region Search Finding Nearest Object

Ex: Find all points within radius r of point A Ex: find the nearest object to P 12 8 7 6 E C Assume PR quadtree for points (i.e., at most one point per block) 13 9 1 4 B 5 1 2 3 9 10 Search neighbors of block 1 in D P 2 A 3 r counterclockwise order 10 11 5 Points are sorted with respect to the space new F 4 A they occupy which enables pruning the 13 search space F 11 12 6 7 8 Algorithm: 1. start at block 2 and compute distance to P from A 2. ignore block 3, even if nonempty, as A is closer to P than any point in 3 Use of quadtree results in pruning the search space 3. examine block 4 as distance to SW corner is shorter than the distance from P to A; however, reject B as it is further from P than A If a quadrant subdivision point p lies in a region l, then search the 4. ignore blocks 6, 7, 8, 9, and 10 as the minimum distance to them from quadrants of p specified by l P is greater than the distance from P to A 1. SE 5. SW, NW 9. All but NW 13. All 5. examine block 11 as the distance from P to the S border of 1 is shorter 2. SE, SW 6. NE 10. All but NE than distance from P to A; but, reject F as it is further from P than A 3. SW 7. NE, NW 11. All but SW If F was moved, a better order would have started with block 11, the 4. SE, NE 8. NW 12. All but SE southern neighbor of 1, as it is closest to the new F Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.17/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.18/114

k-d tree (Bentley) Adaptive k-d tree

Test one attribute at a time instead of all simultaneously as in the point Data is only stored in terminal nodes quadtree An interior node contains the median of the set as the discriminator Usually cycle through all the attributes The discriminator key is the one for which the spread of the values of the Shape of the tree depends on the order in which the data is encountered key is a maximum

(0,100) (100,100) (0,100) (100,100)

Chicago 57,X (62,77) x test (82,65) Toronto (82,65) (62,77) Buffalo Buffalo Denver Mobile Toronto y test 31,X 40,Y (5,45) (5,45) Omaha Toronto Denver Denver (35,42) (35,42) x test Chicago Miami Chicago 16,X 26,Y 10,Y 72,X Buffalo (27,35) y test Omaha (27,35) (85,15) Atlanta Omaha Denver Omaha Mobile Chicago Miami Atlanta Toronto Buffalo (85,15) Atlanta Atlanta (52,10) (52,10) Mobile Mobile (90,5) Miami (90,5) Miami (0,0) (100,0) (0,0) (100,0)

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.19/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.20/114

Minimum Bounding Rectangles: R-tree (Guttman) R*-tree (Beckmann et al.) Objects grouped into hierarchies, stored in a structure similar to a B-tree Goal: minimize overlap for leaf nodes and area increase for nonleaf nodes Object has single bounding rectangle, yet area that it spans may be included in several bounding rectangles Changes from R-tree: Drawback: not a disjoint decomposition of space (e.g., Chicago in R1+R2) 1. insert into leaf node p for which resulting bounding box has minimum increase in overlap with bounding boxes of p’s brothers Order (m, M) R-tree compare with R-tree where insert into leaf node for which increase 1. between m M/2 and M entries in each node except root in area is a minimum (minimizes coverage) ≤ 2. at least 2 entries in root unless a leaf node 2. in case of overflow in p, instead of splitting p as in R-tree, reinsert a X-tree (Berchtold/Keim/Kriegel): if split creates too much overlap, then in- fraction of objects in p (e.g., farthest from centroid) stead of splitting, create a supernode known as ‘forced reinsertion’ and similar to ‘deferred splitting’ or (0,100) (100,100) ‘rotation’ in B-trees 3. in case of true overflow, use a two-stage process (goal: low coverage) (62,77) Toronto determine axis along which the split takes place a. sort bounding boxes for each axis on low/high edge to get 2d R4 R0 R1 R2 (82,65) lists for d-dimensional data (5,45) Buffalo Denver (35,42) b. choose axis yielding lowest sum of perimeters for splits based on Chicago (27,35) R1 R3 R4 R2 R5 R6 y Omaha sorted orders R3 determine position of split (85,15) Atlanta R3 Denver Omaha R4 Buffalo Toronto R5 Chicago Mobile R6 Atlanta Miami a. position where overlap between two nodes is minimized (52,10) R5 Mobile (90,5) R6 b. resolve ties by minimizing total area of bounding boxes R0 R2 Miami (0,0) x (100,0) Works very well but takes time due to forced reinsertion Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.21/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.22/114

Minimum Bounding Hyperspheres K-D-B-tree (Robinson) SS-tree (White/Jain) (0,100) (100,100) Rectangular embedding space is hierarchically decomposed into disjoint 1. make use of hierarchy of minimum bounding rectangular regions hyperspheres (62,77) Toronto No dead space in the sense that at any level of the tree, entire embedding 2. based on observation that hierarchy of (82,65) Buffalo space is covered by one of the nodes minimum bounding hyperspheres is more S2 (5,45) suitable for hyperspherical query regions Denver Aggregate blocks of k-d tree partition of space into nodes of finite capacity (27,35) (35,42) y Omaha Chicago 3. specifying a minimum bounding hypersphere S3 When a node overflows, it is split along one of the axes requires slightly over one half the storage for a (85,15) Atlanta Originally developed to store points but may be extended to non-point minimum bounding hyperrectangle (52,10) Mobile (90,5) objects represented by their minimum bounding boxes enables greater fanout at each node R0 Miami (0,0) x (100,0) S4 Drawback: to get area covered by object, must retrieve all cells it occupies resulting in shallower trees S1 4. drawback over minimum bounding (0,100) (100,100) hyperrectangles is that it is impossible cover A x=50 space with minimum bounding hyperspheres without some overlap C SR-tree (Katayama/Satoh) E J 1. bounding region is intersection of minimum bounding B y D hyperrectangle and minimum bounding hypersphere F H 2. motivated by desire to improve performance of SS-tree G by reducing volume of minimum bounding boxes I

(0,0) (100,0) Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.23/114 Copyright 2009: Hanan Samet x Similarity Searching for Multimedia Databases Applications – p.24/114 Hybrid tree (Chakrabarti/Mehrotra) Avoiding Overlapping All of the Leaf Blocks Assume uniformly-distributed data (0,100) (100,100) 1. Variant of k-d-B-tree that avoids splitting 1. most data points lie near the boundary of the space that is being split the region and point pages that intersect A x=50 x=70 Ex: for d = 20, 98.5% of the points lie within 10% of the surface a partition line l along partition axis a with Ex: for d = 100, 98.3% of the points lie within 2% of the surface value v by slightly relaxing the disjointness 2. rarely will all of the dimensions be split even once requirement C E Ex: assuming at least M/2 points per leaf node blocks, and at least J B one split along each dimension, then total number of points N must 2. Add two partition lines at x = 70 for region y D F be at least 2dM/2 low and x = 50 for region high H if d = 20 and M = 10, then N must be at least 5 million to split a. A, B, C, D, and G with region low G along all dimensions once b. E, F, H, I, and J with region I high 3. if each region is split at most once, and without loss of generality, split (0,0) x (100,0) is in half, then query region usually intersects all the leaf node blocks 3. Associating two partition lines with each partition region is analogous to query selectivity of 0.01% for d = 20 leads to ‘side length of query associating a bounding box with each region (also spatial k-d tree) region’=0.63 which means that it intersects all the leaf node blocks similar to bounding box in R-tree but not minimum bounding box implies a range query will visit each leaf node block store approximation of bounding box by quantizing coordinate value One solution: use a 3-way split along each dimension into three parts of along each dimension to b bits for a total of 2bd bits for each box proportion r, 1 2r, and r thereby reducing fanout of each node (Henrich) − Sequential scan may be cheaper than using an index due to high dimensions We assume our data is not of such high dimensionality!

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.25/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.26/114

Pyramid Technique (Berchtold/Böhm/Kriegel) Methods Based on a Sequential Scan Subdivide data space as if it is an onion by peeling off hypervolumes that 1. If neighbor finding in high dimensions must access every disk page at are close to the boundary random, then a linear scan may be more efficient Subdivide hypercube into 2d pyramids having the center of the data space advantage of sequential scan over hierarchical indexing methods is as the tip of their cones that actual I/O cost is reduced by being able to scan the data Each of the pyramids has one of the faces of the hypercube as its base sequentially instead of at random as only need one disk seek Each pyramid is decomposed into slices parallel to its base 2. VA-file (Weber et al.)

Useful when query region side length is greater than half the width of the use bi bits per feature i to approximate feature data space as won’t have to visit all leaf node blocks d impose a d dimensional grid with b = i=1 bi grid cells sequentially scan all grid cells as a filterP step to determine possible q candidates which are then checked in their entirety via a disk access q VA-file is an additional representation in the form of a grid which is imposed on the original data 3. Other methods apply more intelligent quantization processes Pyramid containing q is the one corresponding to the coordinate i whose VA+-file (Ferhatosmanoglu et al): decorrelate the data with KLT distance from the center point of the space is greater than all others yielding new features and vary number of bits as well as use clustering to determine the region partitions Analogous to iMinMax method (Ooi/Tan/Yu/Bressan) with exception that iM- IQ-tree (Berchtold et al): hierarchical like an R-tree with unordered inMax associates a point with its closest surface but the result is still a de- minimum bounding rectangles composition of the underlying space into 2d pyramids

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.27/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.28/114

Part 2: Distance-Based Indexing Basic Definitions

1. Basic definitions 1. Often only information available is a distance function indicating degree of 2. Properties for pruning the search similarity (or dis-similarity) between all pairs of N data objects 3. Ball partitioning methods 2. Distance metric d: objects must reside in finite (S, d) where a. vp-tree for o1, o2, o3 in S, d must satisfy b. mvp-tree d(o1, o2) = d(o2, o1) (symmetry) d(o , o ) 0, d(o , o ) = 0 iff o = o (non-negativity) 4. General hyperplane partitioning methods 1 2 ≥ 1 2 1 2 d(o , o ) d(o , o ) + d(o , o ) (triangle inequality) a. gh-tree 1 3 ≤ 1 2 2 3 b. GNAT 3. Triangle inequality is a key property for pruning search space c. Bisector trees and mb-trees Computing distance is expensive 5. M-tree 4. Non-negativity property enables ignoring negative values in derivations 6. sa-tree 7. Distance matrix methods a. AESA b. LAESA

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.29/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.30/114

Pivots Pruning: Two Distances Identify a distinguished object or subset of the objects termed pivots or vantage points Lemma 1: Knowing distance d(p, q) from p to q and distance d(p, o) from p to o enables bounding the distance d(q, o) from q to o: 1. sort remaining objects based on a. distances from the pivots, or d(q, p) d(p, o) d(q, o) d(q, p) + d(p, o) b. which pivot is the closest | − | ≤ ≤ 2. and build index 3. use index to achieve pruning of other objects during search Given pivot p S, for all objects o S0 S, we know: o ∈ ∈ ⊆ o 1. exact value of d(p, o), d(q,o) d(q,o) p q p q 2. d(p, o) lies within range [rlo, rhi] of values (ball partitioning) (ball |d(q,p)±d(p,o)| d(q,p)+d(p,o) partitioning) or drawback is asymmetry of partition as outer shell is usually narrow 3. o is closer to p than to some other object p2 S (generalized hyperplane partitioning)(generalized hyperplane partitioning)∈ Distances from pivots are useful in pruning the search

S2

S1 p2

p r

p

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.31/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.32/114 Pruning: One Distance and One Range Pruning: Two Ranges

Lemma 2: Knowing distance d(p, q) from p to q and that distance d(p, o) from p Lemma 3: Knowing that the distance d(p, q) from p to q is in the range [slo, shi] to o is in the range [rlo, rhi] enables bounding the distance d(q, o) from q to o: and and that distance d(p, o) from p to o is in the range [rlo, rhi] enables bounding the distance d(q, o) from q to o: max d(q, p) r , r d(q, p), 0 d(q, o) d(q, p) + r { − hi lo − } ≤ ≤ hi max s r , r s , 0 d(q, o) r + s { lo − hi lo − hi } ≤ ≤ hi hi

d(q2,p)+rhi o s -r rhi rlo lo hi p q2 rhi-d(q1,p) o q q3 p d(q2,p)-rlo p o q1 rhi+shi q [rlo rhi] , [rlo,rhi] [slo,shi] [slo,shi]

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.33/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.34/114

Pruning: Two Objects and Identity of Closest vp-tree (; Uhlmann|Yianilos)

Lemma 4: Knowing the distance d(q, p1) and d(q, p2) from q to pivot objects p1 Ball partitioning method and p2 and that o is closer to p1 than to p2 (or equidistant from both — i.e., Pick p from S and let r be median of distances of other objects from p d(p1, o) d(p2, o)) enables a lower bound on the distance d(q, o) from q to o: ≤ Partition S into two sets S1 and S2 where: d(q, p ) d(q, p ) max 1 − 2 , 0 d(q, o) S = o S p d(p, o) < r  2 ff ≤ 1 { ∈ \ { } | } S = o S p d(p, o) r 2 { ∈ \ { } | ≥ }

o Apply recursively, yielding a binary tree with pivot and radius values at internal nodes p1 q p2 Choosing pivots (d(q,p1)-d(q,p2))/2 1. simplest is to pick at random 2. choose a random sample and then select median

S2 Lower bound is attained when q is anywhere on the line from p1 to p2 p

Lower bound decreases as q is moved off the line S1

No upper bound as objects can be arbitrarily far from p1 and p2 p r

S1 S2

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.35/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.36/114

vp-tree Example Range Searching with vp-tree

Find all objects o such that d(q, o)  i ≤ b S2 e must also u n visit “outside” q l S1 v ε f s p t p r p r m p o must also q q visit “inside” ε k ε p q r s d j h t u g i j c h Use Lemma 2 as know distance from pivot and bounds on the ranges in d r a q o c v b the two subtrees a max d(q, p) r , r d(q, p), 0 d(q, o) d(q, p) + r {} m g k we n f { − hi lo − } ≤ ≤ hi 1. visit left subtree iff d(q, p) r  d(q, p) r +  − ≤ ⇒ ≤ rlo = 0 and rhi = r 2. visit right subtree iff r d(q, p)  d(q, p) r  − ≤ ⇒ ≥ − r = r and r = lo hi ∞

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.37/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.38/114

Increasing Fanout in vp-tree b gh-tree (Metric tree; Uhlmann) e S4 w n Fanout of a node in vp-tree is low u f Generalized hyperplane partitioning method v Options t m Pick p1 and p2 from S and partition S into two sets S1 and S2 where: o r 1. increase fanout by splitting S into m equal- 3 k q sized subsets based on m + 1 bounding p r1 r d S1 = o S p1, p2 d(p1, o) d(p2, o) S { ∈ \ { } | ≤ } values r0, . . . , rm or even let r0 = 0 and g 1 i s j S2 = o S p1, p2 d(p2, o) < d(p1, o) c { ∈ \ { } | } rm = h r2 a ∞ S2 2. mvp-tree S3 Objects in S1 are closer to p1 than to p2 (or equidistant from both), and each node is equivalent to collapsing objects in S2 are closer to p2 than to p1 nodes at several levels of vp-tree hyperplane corresponds to all points o satisfying d(p1, o) = d(p2, o) use same pivot for each subtree at a level can also “move” hyperplane, by using d(p1, o) = d(p2, o) + m although the ball radius values differ i Apply recursively, yielding a binary tree with two pivots at internal nodes rationale: only need one distance r1 computation per level to visit all nodes at P r2 r3 the level (useful when search backtracks) p1 p2 p a. first pivot i partitions into ball of S1 S2 S3 S4 2 radius r1 S1 ={a,h,q,t} b. second pivot p partitions inside of the S2 ={d,j,s} S1 S2 ball for i into subsets S1 and S2 , and S ={c,g,k,m,o,t} outside of the ball for i into subsets 3 p1 S3 and S4 S4 ={b,e,f,n,u,v,w} Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.39/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.40/114 gh-tree Example Range Searching with gh-tree

Find all objects o such that d(q, o)  b ≤ e l u n a b o k p2 ε t v f p1 q q p2 m c d e f ε q o g o r ε p d g h i j k l m n p1 q i s c j Lower bound on d(q, o) is distance to hyperplane (or zero) h {o,p} {q} {r} {s} {t} {u} {v} a can only use directly in Euclidean spaces otherwise, no direct representation of the “generalized hyperplane” (a) (b) But, can use Lemma 4 with distance from pivots

d(q, p ) d(q, p ) max 1 − 2 , 0 d(q, o)  2 ff ≤

1. visit left subtree iff d(q,p1)−d(q,p2)  d(q, p ) d(q, p ) + 2 2 ≤ ⇒ 1 ≤ 2 2. visit right subtree iff d(q,p2)−d(q,p1)  d(q, p ) d(q, p ) + 2 2 ≤ ⇒ 2 ≤ 1

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.41/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.42/114

Increasing Fanout in gh-tree Bisector tree (bs-tree) (Kalantari/McDonald)

Fanout of a node in gh-tree is low 1. gh-trees with covering balls Geometric Near-neighbor Access tree (GNAT; Brin) 2. Drawback: radius of covering ball of a node is sometimes smaller than the

1. increase fanout by adding m pivots P = p , . . . , pm to split S into radii of the covering balls of its descendants (termed eccentric) { 1 } S1, . . . , Sm based on which of the objects in P is the closest 3. Drawback: radius of covering ball of a node is sometimes smaller than the 2. for any object o S P , o is a member of S if d(p , o) d(p , o) for all radii of the covering balls of its descendants (termed eccentric) ∈ \ i i ≤ j j = 1, . . . , m 4. Drawback: radius of covering ball of a node is sometimes smaller than the 3. store information about ranges of distances between pivots and radii of the covering balls of its descendants (termed eccentric) objects in the subtrees to facilitate pruning search 5. Bad for pruning as ideally we want radii of covering balls to decrease as search descends y 10

o1 p1 x -10 pa 10 p2 o2

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.43/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.44/114 -10

mb-tree (Dehne/Noltemeier) Comparison of mb-tree (BSP tree) and PR k-d tree

1. Inherit one pivot from ancestor node (0,100) (100,100)

2. Fewer pivots and fewer distance computations but perhaps deeper tree Partition of (62,77) underlying Toronto 3. Like bucket ( ) PR k-d tree as split whenever region has objects but (82,65) k k > 1 space Buffalo region partitions are implicit (defined by pivot objects) instead of explicit (35,42) analogous Chicago Denver Mobile Toronto Buffalo (5,45) Atlanta b to that of y Denver (27,35) e BSP tree Omaha n a b u l for points (85,15) (52,10) Atlanta Omaha Chicago k Mobile t m v PR k-d tree f c a e b (0,0) x (100,0) g o r q d s o c a d e u b v (0,100) (100,100) c p i j h (62,77) {g,q} {p} {h,i} {j,r,s} {k} {m,t} {n} {f,l} Toronto a Chicago-Mobile BSP tree (82,65) Buffalo Chicago -Toronto Mobile (35,42) -Atlanta (a) (b) Chicago Chicago (5,45) -Denver Toronto Atlanta Denver -Buffalo Mobile Omaha - y Chicago Denver (27,35) Toronto Buffalo Omaha (85,15) (52,10) Atlanta Omaha Chicago Mobile

(0,0) x (100,0)

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.45/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.46/114

PR k-d tree M-tree (Ciaccia et al.)

1. Regular decomposition point representation Dynamic structure based on R-tree 2. Decompose whenever a block contains more than one point, while cycling (actually SS-tree) through attributes All objects in leaf nodes p3 3. Maximum level of decomposition depends on minimum point separation Balls around “routing” objects (like piv- o p1 if two points are very close, then decomposition can be very deep ots) play same role as minimum bounding boxes can be overcome by viewing blocks as buckets with capacity c and p2 only decomposing a block when it contains more than c points Pivots play similar role as in GNAT, but: 1. all objects are stored in the leaf nodes and an object may be

(0,100) (100,100) referenced several times in the M-tree as it could be a routing object in more than one nonleaf node (62,77) 2. for an object o in a subtree of node n, the subtree’s pivot p is not Toronto (82,65) always the one closest to o among all pivots in n Buffalo (35,42) 3. object o can be inserted into subtrees of several pivots: a choice Chicago Denver Mobile Toronto Buffalo (5,45) Atlanta Each nonleaf node n contains up to c entries of format (p, r, D, T ) y Denver (27,35) 1. p is the pivot (i.e., routing object) Omaha 2. r is the covering radius (85,15) (52,10) Atlanta Omaha Chicago 0 Mobile 3. D is distance from p to its parent pivot p

(0,0) x (100,0) 4. T points to the subtree Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.47/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.48/114 Delaunay Graph sa-tree (Navarro) Definition Definition: 1. each object is a node and two nodes have an edge between them if their 1. choose arbitrary object a as root of tree Voronoi cells have a common boundary 2. find N(a), smallest possible set of neighbors of a, so that any neighbor 2. explicit representation of neighbor relations that are implicitly represented is closer to a than to any other object in N(a) in a i.e., x is in N(a) iff for all y N(a) x , d(x, a) < d(x, y) ∈ − { } equivalent to an index or access structure for the Voronoi diagram all objects in S N(a) are closer to some object in N(a) than to a \ 3. search for a nearest neighbor of q starts with an arbitrary object and then 3. objects in N(a) become children of a proceeds to a neighboring object closer to q as long as this is possible 4. associate remaining objects in S with closest child of a, and Unfortunately we cannot construct Voronoi cells explicitly if only have recursively define subtrees for each child of a interobject distances i i o o f f Spatial Approximation tree (sa-tree): approximation of the Delaunay graph g g h h 1. a is root r b r b a a d d d d c j j 2. N(a)={b,c,d,e} c a a b b n n 3. second level q c q c s s h h k k 4. h N(a) and N(b) as h q e e q o i 6∈ o g i g t t closer to F than to b or a r m u v m u v r l l p f p k f 5. fourth level k w w l l e j e j m m Use heuristics to construct sa-tree as N(a) is used in the definition which n n makes it circular, and thus resulting tree is not necessarily minimal and not Point Set Delaunay graph unique

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.49/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.50/114

Range Searching with sa-tree kNN Graphs (Sebastian/Kimia) 1. Each vertex has an edge to each of its k nearest neighbors Search algorithms make use of Lemma 4 which provides a lower bound on a a a a a d d d d d distances c c c c c b b b b 1. know that for c in a N(a), b in N(a), and o in tree rooted at b, then o b { } ∪ h h h h h is closer to b than to c q g q q q o o i o i o i o i therefore,(d(q, b) d(q, c))/2 d(q, o) from Lemma 4 g i q g g g p r p r p r p r p r X − ≤ f f k k k f k f k f 2. want to avoid visiting as many children of a as possible l l l j l l e e e must visit any object o for which d(q, o)  e e j j j ≤ m j m m m m must visit any object o in b if lower bound (d(q, b) d(q, c))/2  n n n n n − ≤ Point Set 1NN graph 2NN graph 3NN graph 4NN graph no need to visit any objects o in b for which there exist c in 2. Problems a N(a) so that (d(q, b) d(q, c))/2 >  {higher} ∪ lower bound implies−less likely to visit graph is not necessarily connected even if increase so graph is connected, search may halt at object d(q, o) is maximized when d(q, c) is minimized k p which is closer to q than any of the k nearest neighbors of p but not c is object in a N(a) which is closest to q st { } ∪ closer than all of the objects in p’s neighbor set (e.g., the k + 1 3. choose c so that lower bound (d(q, b) d(q, c))/2 on d(q, o) is nearest neighbor) maximized − Ex: search for nearest neighbor of X in 4NN graph starting at any c is object in a N(a) closest to q { } ∪ one of {e,f,j,k,l,m,n} will return k instead of r Once find c, traverse each child b N(a) except those for which ∈ overcome by extending size of search neighborhood as in approximate nearest neighbor search (d(q, b) d(q, c))/2 >  − use several starting points for search (i.e., seeds) 3. Does not require triangle inequality and thus works for arbitrary distances Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.51/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.52/114

Alternative Approximations of the Delaunay Graph Use of Delaunay Graph Approximations

1. Other approximation graphs of the Delaunay graph are connected by 1. Unless approximation graph is a superset of Delaunay graph (which it is virtue of being supersets of the minimal spanning tree (MST) of the vertices not), to be useful in nearest neighbor searching, we need to be able to force the algorithm to move to other neighbors of current object even if 2. Relative neighborhood graph (RNG): an edge between vertices u and v if p for all vertices p, u is closer to v than is p or v is closer to u than is p — that they are farther from q than p is, d(u, v) Max d(p, u), d(p, v) 2. Examples: ≤ { } 3. Gabriel graph (GG): an edge between vertices u and v if for all other kNN graph: use extended neighborhood vertices p we have that d(u, p)2 + d(v, p)2 d(u, v)2 sa-tree: prune search when can show (with aid of triangle inequality) ≥ that it is impossible to reach the nearest neighbor via a transition to 4. RNG and GG are not restricted to Euclidean plane or Minkowski metrics nearest neighbor or set of neighbors 5. MST(E) RNG(E) GG(E) DT(E) in Euclidean plane with edges E ⊂ ⊂ ⊂ RNG and GG have advantage that are always connected and don’t 6. MST(E) RNG(E) GG(E) in any metric space as DT is only defined for need seeds ⊂ ⊂ the two-dimensional Euclidean plane advantage of kNN graph is that k nearest neighbors are precomputed

a a a a a d d d d d c c c c c b b b b b h h h h q q g q g h q q i o i o g i o i o o i g g r p r p r r p r p k f p f f k k f k f l l l e j l l k e j e j e j e j m m m m m n n n n n Point Set MST RNG GG Delaunay Graph (DT) Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.53/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.54/114

Spatial Approximation Sample Hierarchy (SASH)(Houle) SASH vis-a-vis Triangle Inequality Hierarchy of random samples of set of objects S of size S/2, S/4, S/8, . . . , 1 Triangle inequality is analogous to transitivity with corresponding to Makes use of approximate nearest neighbors “approximate nearest neighbor” relation ≤ Has similar properties as the kNN graph Appeal to triangle inequality, d(x0, q) d(q, x) + d(x0, x), regardless of 1. both do not require that the triangle inequality be satisfied whether or not it holds ≤ 2. both are indexes 2 1. to establish links to objects likely to be neighbors of query object q, O(N ) time to build kNN graph as no existing index 0 0 SASH is built incrementally level by level starting at root with samples of when d(q, x) and d(x , x) are both very small, then d(q, x ) is also increasing size making use of index already built for existing levels very small (analogous to “nearest”) 0 thereby taking O(N log N) time implies if x S S is a highly ranked neighbor of both q and 0 0 ∈ \ 0 0 each level of SASH is a kNN tree with maximum k = c x S among objects in S S , then x is also likely to be a highly ∈ \ 0 Key to approximation is to treat the “nearest neighbor relation” as an ranked neighbor of q among objects in S 0 “equivalence relation” even though this is not generally true x is a highly ranked neighbor of x (symmetry) AND x is a highly ranked neighbor of q 1. assumption of “equivalence” relation is the analog of  RESULT: x0 is a highly ranked neighbor of q (transitivity) 2. no symmetry: x being approximate nearest neighbor of x0 does not mean 2. INSTEAD of to eliminate objects that are guaranteed not to be that x0 must be an approximate nearest neighbor of x neighbors 3. no transitivity: x being approximate nearest neighbor of q and x0 being approximate nearest neighbor of x does not mean that x0 must be an approximate nearest neighbor of q 4. construction of SASH is analog of UNION operation 5. finding approximate nearest neighbor is analog of FIND operation Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.55/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.56/114 a d Mechanics of SASH Example of SASH construction c b SASH construction (UNION of UNION-FIND) Ex: P=2 C=5 h 1. form hierarchy of samples Initially, no choice in the first 3 levels q o i 2. assume SASH has been built and process sample S0 g i Find two closest objects at level 4 for each r know that x in SASH SASH is one of p approximate nearest p f i i−1 object at level 5 k 0 0 \ neighbors of x S and use SASHi to determine x l ∈ f:k,m n:k,m j e infer that x0 is one of c > p approximate nearest neighbors in S0 of x m p:k,r l:k,m (symmetry) n c:a,h d:a,h 3. special handling to ensure that every object at level i + 1 is an i:h,k d:h,r approximate nearest neighbor of at least one object at level i (i.e., no 1: g orphan objects) o:k,r Finding k approximate nearest neighbors of q (FIND of UNION-FIND) Retain 5 nearest neighbors at level 5 to 2: e 1. follow links from level i 1 of SASH to level i retaining in U the k each object at level 4 − i i approximate nearest neighbors of q at level i of the SASH k:{f,n,p,l,i,oX} 3: j b 2. determine k approximate nearest neighbors of q from the union of Ui over all levels of the SASH m:f,n,l 4: k m h r a 3. know that x in Ui is an approximate nearest neighbor of q n:c,d,i,q 0 4. know that x in Ui+1 is an approximate nearest neighbor of x in Ui a:c,d 0 5. infer that x in Ui+1 is an approximate nearest neighbor of q Ignore o as k has 5 closer neighbors 5: f n p l c d i q o (transitivity)

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.57/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.58/114

Example SASH Approximate k Nearest Neighbor Finding Drawback of SASH

Ex: k = 3 and query object c Assumes that if a at level i is an approximate nearest neighbor of o at level i + 1, then by symmetry o is likely to be an approximate nearest neighbor Let f(k, i) = k = k1−(h−i)/ log2 N yielding k = (1, 1, 2, 2, 3) i i of a, which is not generally true U = root g of SASH 1 Ex: objects at level i are not necessarily linked to their nearest neighbors U2 = objects reachable from U1 which is e at level i + 1

U3 = objects reachable form U2 which is b and j which are retained as C6 C C7 5 C8 k3 = 2 C4 C9 U4 objects reachable from U3 which is {a,h,k,m,r} and we retain just a and h in as U4 k4 = 2 Level i: P1 P2 P3 P4 U5 = objects reachable form U4 which is {c,d,i,q}, and we retain just c, d, and q in U as k = 3 5 5 Level i+1: C1 C2 C3 C4 C5 C6 C7 C8 C9 P P Take union of U , U , U , U , U which is the set {a,b,c,d,e,g,h,i,j,k,m,q,r}, 3 4 1 2 3 4 5 P and the closest three neighbors to query object c are a, b, and d 1 C1 C2 C3

P2

P3 and P4 at level i are linked to the sets of three objects C4, C5, C6 and { } C7, C8, C9 , respectively, at level i+1, instead of to their nearest neighbors { } C1, C2, and C3 at level i+1.

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.59/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.60/114

AESA (Vidal Ruiz) LAESA (Micó et al.)

Precomputes O(N 2) interobject distances between all N objects in S and AESA is costly as treats all N objects as pivots stores them in a distance matrix Choose a fixed number M of pivots Distance matrix is used to provide lower bounds on distances from query Similar approach to searching as in AESA but object q to objects whose distances have not yet been computed 1. non-pivot objects in Sc do not help in tightening lower bound distances Only useful if static set of objects and number of queries N as otherwise of the objects in Su can use brute force to find nearest neighbor with distance computations N 2. eliminating pivot objects in Su may hurt later in tightening the distance Algorithm for range search: bounds

Su: objects whose distance from q has not been computed and that Differences: have not been pruned, initially S 1. selecting a pivot object in Su over any non-pivot object, and dlo(q, o): lower bound on d(q, o) for o Su, initially zero 2. eliminating pivot objects from S only after a certain fraction f of the ∈ u 1. remove from Su the object p with lowest value dlo(q, p) pivot objects have been selected into Sc (f can range from 0 to 100% terminate if Su is empty or if dlo(q, p) >  if f = 100% then pivots are never eliminated from Su 2. compute d(q, p), adding p to result if d(q, p)  ≤ 3. for all o Su, update d (q, o) if possible ∈ lo d (q, o) max d (q, o), d(q, p) d(p, o) lo ← { lo | − |} lower bound property by Lemma 1: d(q, p) d(p, o) d(q, o) | − | ≤ 4. go to step 1

Other heuristic possible for choosing next object: random, highest dlo, etc.

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.61/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.62/114

Classifying Distance-Based Indexing Methods Pivot-Based vs: Clustering-Based Indexing Methods

1. Pivot-based methods: 1. Both achieve a partitioning of the underlying data set into spatial-like zones pivots, assuming k of them, can be viewed as coordinates in a 2. Difference: k-dimensional space and the result of the distance computation for pivot-based: boundaries of zones are more well-defined as they can object x is equivalent to a mapping of x to a point (x0, x1, . . . , xk−1) be expressed explicitly using a small number of objects and a known where coordinate value xi is the distance d(x, pi) of x from pivot pi distance value result is similar to embedding methods clustering-based methods: boundaries of zones are usually expressed also includes distance matrix methods which contain precomputed implicitly in terms of the cluster centers, instead of explicitly, which distances between some (e.g., LAESA) or all (e.g., AESA) objects may require quite a bit of computation to determine difference from ball partitioning as no hierarchical partitioning of in fact, very often, the boundaries cannot be expressed explicitly as, data set for example, in the case of an arbitrary metric space (in contrast to 2. Clustering-based methods: a Euclidean space) where we do not have a direct representation of partition data into spatial-like zones based on proximity to the ‘generalized hyperplane’ that separates the two partitions distinguished object called the cluster center each object associated with closest cluster center also includes sa-tree which records subset of Delaunay graph of the data set which is a graph whose vertices are the Voronoi cells different from pivot-based methods where an object o is associated with a pivot p on the basis of o’s distance from p rather than because p is the closest pivot to o

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.63/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.64/114 Distance-Based vs: Multidimension Indexing Part 3: Dimension Reduction

1. Distance computations are used to build index in distance-based indexing, 1. Motivation but once index has been built, similarity queries can often be performed overcoming curse of dimensionality with significantly fewer distance computations than a sequential scan of want to use traditional indexing methods (e.g., R-tree and quadtree entire dataset variants) which lose effectiveness in higher dimensions 2. Drawback is that if we want to use a different distance metric, then need to 2. Searching in a dimensionally-reduced space build a separate index for each metric in distance-based indexing 3. Using only one dimension not the case for multidimensional indexing methods which can support arbitrary distance metrics when performing a query, once the index 4. Representative point methods has been built 5. Singular value decomposition (SVD, PCA, KLT) however, multidimensional indexing is not very useful if don’t have a 6. Discrete Fourier transform (DFT) feature value and only know relative interobject distances (e.g., DNA sequences)

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.65/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.66/114

Searching in a Dimensionally-Reduced Space Nearest Neighbors in a Dimensionally-Reduced Space

Want a mapping f so that d(v, u) d0(f(v), f(u)) where d0 is the distance 1. Ideally d(a, b) d(a, c) implies d0(f(a), f(b)) d0(f(a), f(c)), for any ≈ ≤ ≤ in the transformed space objects a, b, and c Range searching proximity preserving property 1. reduce query radius implies that nearest neighbor queries can be performed directly in the implies more precision as reduce false hits transformed space 2. increase query radius rarely holds implies more recall as reduce false dismissals a. holds for translation and scaling with any Minkowski metric b. holds for rotation when using Euclidean metric in both original and 3. d0(f(a), f(b)) d(a, b) for any pair of objects a and b ≤ transformed space mapping f is contractive and 100% recall 2. Use “filter-and-refine” algorithm with no false dismissals (i.e., 100% recall) as long as f is contractive if o is nearest neighbor of q, contractiveness ensures that ‘filter’ step finds all candidate objects o0 such that d0(f(q), f(o0)) d(q, o) ≤ ‘refine’ step calculates actual distance to determine actual nearest neighbor

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.67/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.68/114

Using only One Dimension Representative Points

1. Can keep just one of the features 8 128 129 132 133 144 145 148 149 192 Often, objects with spatial extent are represented by a representative point such as a sphere by its radius and the coordinate values of its center global: feature f with largest range 7 42 43 46 47 58 59 62 63 106 Toronto local: feature f with largest range of 6 40 41 44 45 56 57 60 61 104 Really a transformation into a point in a higher dimensional space and expected values about the value of Buffalo thus not a dimensional reduction 5 34 35 38 39 50 51 54 55 98 feature f for query object q Transformation is usually not contractive as distance between the 4 always contractive if distance metric for 32 33 36 37 48 49 52 53 96 Denver Chicago transformed objects is greater than the distance between the original the single feature is suitably derived y 3 10 11 14 15 26 27 30 31 74 objects Omaha from the distance metric used on all of 2 8 9 12 13 24 25 28 29 72 Atlanta the features 1 2 3 6 7 18 19 22 23 66 Mobile Miami 2. Combine all features into one feature 0 0 1 4 5 16 17 20 21 64

concatenate a few bits from each 0 1 2 3 4 5 6 7 8

feature x use bit interleaving or Peano-Hilbert code not contractive: points (4,3) and (4,4) are adjacent, but codes 26 and 48 are not!

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.69/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.70/114

Transformation into a Different and Smaller Feature Set SVD (KLT, PCA)

y y Method of finding a linear transformation of n-dimensional feature vectors y’ that yields good dimensionality reduction

x’ 1. after transformation, project feature vectors on “first” k axes, yielding C C k-dimensional vectors (k n) ≤ A D A D 2. projection minimizes the sum of the squares of the Euclidean B B distances between the set of n-dimensional feature vectors and their x x corresponding k-dimensional feature vectors (a) (b) Rotate x, y axes to obtain x0, y0 axes Letting F denote the original feature vectors, calculate V , the SVD transform matrix, and obtain transformed feature vectors T so that F V = T x0 is dominant axis and can even drop axis y0 F = UΣV T and retain the k most discriminating values in Σ (i.e., the largest ones and zeroing the remaining ones) Start with m n-dimensional points Drawback is the need to know all of the data in advance which means that need to recompute if any of the data values change Transformation preserves Euclidean distance and thus projection is contractive

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.71/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.72/114 Discrete Fourier Transform (DFT) Invertibility of DFT

Drawback of SVD: need to recompute when one feature vector is modified Ex: decomposition of real-valued five-dimensional feature vector ~x = DFT is a transformation from time domain to frequency domain or vice (1.2,1.4,0.65,-0.25,-0.75) versa 1.5 DFT of a feature vector has same number of components (termed coefficients) as original feature vector )

t 1 DFT results in the replacement of a sequence of values at different instances of time by a sequence of an equal number of coefficients in the 0.5 frequency domain Analogous to a mapping from a high-dimensional space to another space 0 of equal dimension Provides insight into time-varying data by looking into the dependence of −0.5 the variation on time as well as its repeatability, rather than just looking at feature vector component value ( x the strength of the signal (i.e., the amplitude) as can be seen from the

−1 conventional representation of the signal in the time domain −1 0 1 2 3 4 5 feature vector component ( t ) Cosine basis functions are solid Sine basis functions are broken Solid circle shows original feature vector

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.73/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.74/114

Use of DFT for Similarity Searching Part 4: Embedding Methods

Euclidean distance norm of feature vector and its DFT are equal 1. Problem statement Can apply a form of dimension reduction by eliminating some of the 2. Lipschitz embeddings Fourier coefficients 3. SparseMap Zeroth coefficient is average of components of feature vector 4. FastMap Hard to decide which coefficients to retain 5. MetricMap 1. choose just the first k coefficients 2. find dominant coefficients (i.e., highest magnitude, mean, variance, etc.) requires knowing all of the data and not so dynamic

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.75/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.76/114

Overview of Embedding Methods Exact Distance Preservation May Be Impossible

1. Given a finite set of N objects and a distance metric d indicating distance Ex: 4 objects a, b, c, e between them 1. d(a, b) = d(b, c) = d(a, c) = 2 and d(e, a) = d(e, b) = d(e, c) = 1.1 2. Find function F that maps N objects into a vector space of dimension k d satisfies triangle inequality using a distance function d0 in this space Cannot embed objects into a 3-d Euclidean space — that is, with d0 as ideally, k is low: k N  the Euclidean distance while preserving d computing F should be fast — O(N) or O(N log N) 2. Can embed if distance between e and a, b, and c is at least 2/√3 avoid examining all 2 inter-object distance pairs O(N ) place a, b, and c in plane p and place e on line perpendicular to p that fast way of obtaining F (o) given o passes through the centroid of the triangle in p formed by a, b, and c 3. Problem setting also includes situation where the original objects are N 3. Also possible if use City Block distance metric dA (L1) described by an n-dimensional feature vector place a, b, and c at (0,0,0), (2,0,0), and (1,1,0), respectively, and e at 4. Ideally, the distances between the objects are preserved exactly by the (1,0,0.1) mapping F exact preservation means that (S, d) and (F (S), d0) are isometric possible when d and d0 are both Euclidean, in which case it is always true when k = N 1 − difficult in general for arbitrary combinations of d and d0 regardless of value of k

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.77/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.78/114

Exact Distance Preservation Always Possible with Properties of Embeddings Chessboard Distance 1. Contractiveness: d0(F (a), F (b)) d(a, b) ≤ One dimension for each object alternative to exact distance preservation ensures 100% recall when use the same search radius in both the original Map object o into vector d(o, o ), d(o, o ), . . . , d(o, o ) { 1 2 N } and embedding space as no correct responses are missed For any pair of objects oi and oj, but precision may be less than 100% due to false candidates

0 2. Distortion: measures how much larger or smaller the distances in the embed- d (F (oi), F (oj )) = dM (F (oi), F (oj)) = max d(oi, ol) d(oj, ol) 0 l {| − |} ding space d (F (o1), F (o2)) are than the corresponding distances d(o1, o2) in the original space For any l, d(oi, ol) d(oj , ol) d(oi, oj ) by the triangle inequality 1 0 | − | ≤ defined as c1c2 where c1 d(o1, o2) d (F (o1), F (o2)) c2 d(o1, o2) for d(o , o ) d(o , o ) = d(o , o ) for l = i and l = j in which case · ≤ ≤ · i l j l i j all object pairs o1 and o2 where c1, c2 1 | 0 − | ≥ d (F (oi), F (oj )) = d(oi, oj) similar effect to contractiveness Therefore, distances are preserved by F when using the Chessboard 3. SVD is optimal way of linearly transforming n-dimensional points to k- metric dM (L∞) dimensional points (k n) ≤ Number of dimensions here is high: k = N ranks features by importance At times, define F in terms of a subset of the objects drawbacks: a. can’t be applied if only know distance between objects b. slow: O(N m2) where m is dimension of original space · c. only works if d and d0 are the Euclidean distance Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.79/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.80/114 Lipschitz Embeddings (Linial et al.) Motivation for Lipschitz Embeddings

Based on defining a coordinate space where each axis corresponds to a If x is an arbitrary object, can obtain some information about d(o1, o2) for reference set which is a subset of the objects arbitrary objects o1 and o2 by comparing d(o1, x) and d(o2, x) — that is, d(o , x) d(o , x) Definition | 1 − 2 | 1. set R of subsets of S, R = A , A , . . . , A d(o1, x) d(o2, x) d(o1, o2) by Lemma 1 { 1 2 k} | − | ≤ 2. d(o, A) = minx∈A d(o, x) for A S Extend to subset A so that d(o , A) d(o , A) d(o , o ) { } ⊂ | 1 − 2 | ≤ 1 2 3. F (o) = (d(o, A1), d(o, A2), . . . , d(o, Ak)) Proof: coordinate values of o are distances from o to the closest element 1. let x1, x2 A be such that d(o1, A) = d(o1, x1) and d(o2, A) = d(o2, x2) in each of Ai ∈ 2. d(o , x ) d(o , x ) and d(o , x ) d(o , x ) implies saw one such embedding earlier using L∞ where R is all singleton 1 1 ≤ 1 2 2 2 ≤ 2 1 d(o1, A) d(o2, A) = d(o1, x1) d(o2, x2) subsets of S — that is, R = o1 , o2 , . . . , oN | − | | − | {{ } { } { }} 3. d(o , x ) d(o , x ) can be positive, while a negative value implies 1 1 − 2 2 d(o , x ) d(o , x ) max d(o , x ) d(o , x ) , d(o , x ) d(o , x ) | 1 1 − 2 2 | ≤ {| 1 1 − 2 1 | | 1 2 − 2 2 |} 4. from triangle inequality, max d(o , x ) d(o , x ) , d(o , x ) d(o , x ) d(o , o ) {| 1 2 − 2 2 | | 1 1 − 2 1 |} ≤ 1 2 5. therefore, d(o , A) d(o , A) d(o , o ) as | 1 − 2 | ≤ 1 2 d(o , A) d(o , A) = d(o , x ) d(o , x ) | 1 − 2 | | 1 1 − 2 2 | By using R of the subsets, we increase likelihood that d(o1, o2) is captured 0 by d (F (o1), F (o2))

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.81/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.82/114

Mechanics of Lipschitz Embeddings SparseMap (Hristescu/Farach-Colton)

Linial et al. let R be O(log2 N) randomly selected subsets of S Attempts to overcome high cost of computing Lipschitz embedding of 0 Linial in terms of number of distance computations and dimensions For d = Lp, define F so that 1/p Uses regular Lipschitz embedding instead of Linial et al. embedding F (o) = (d(o, A1)/q, d(o, A2)/q, . . . , d(o, Ak)/q), where q = k 1/p c 0 1. does not divide the distances d(o, Ai) by k F satisfies d(o1, o2) d (F (o1), F (o2)) d(o1, o2) blog2 Nc · ≤ ≤ 2. uses Euclidean distance metric Distortion of O(log N) is large and may make F ineffective at preserving Two heuristics relative distances as want to use distance value in original space 1. reduce number of distance computations by calculating an upper Since sets are chosen at random, proof is probabilistic and is a Ai c bound dˆ(o, Ai) instead of the exact value d(o, Ai) constant with high probability only calculate a fixed number of distance values for each object as Embedding is impractical opposed to A distance values | i| 1. large number and sizes of subsets in R mean that there is a high 2. reduce number of dimensions by using a “high quality” subset of R probability that all N objects appear in a subset of R instead of the entire set implies need to compute distance between query object q and all use greedy resampling to reduce number of dimensions by objects in S eliminating poor reference sets 2. number of coordinates is log N 2, which is relatively large Heuristics do not lead to a contractive embedding but can be made b 2 c N = 100 yields k = 36 which is too high contractive (Hjaltason and Samet) 1. modify first heuristic to compute actual value d(o, Ai), not upper bound 0 2. use dM (L∞) as d instead of dE (L2)

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.83/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.84/114

FastMap (Faloutsos/Lin) Mechanics of FastMap

Inspired by dimension reduction methods for Euclidean space based on Obtain coordinate values for points by projecting them on k mutually linear transformations such as SVD, KLT, PCA orthogonal coordinate axes Claimed to be general but assumes that d is Euclidean as is d0 and only for Compute projections using the given distance function d these cases is it contractive Construct coordinate axes one-by-one Objects are assumed to be points 1. choose two objects (pivots) at each iteration 2. draw a line between them that serves as the coordinate axis 3. determine coordinate value along this axis for each object o by mapping (i.e., projecting) o onto this line Prepare for next iteration 1. determine the (m 1)-dimensional hyperplane H perpendicular to the − line that forms the previous coordinate axis 2. project all of the objects onto H perform projection by defining a new distance function dH measuring distance between projections of objects on H dH is derived from original distance function d and coordinate axes determined so far 3. recur on original problem with m and k reduced by one, and a new distance function dH continue process until have enough coordinate axes

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.85/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.86/114

Choosing Pivot Objects Deriving First Coordinate Value

Pivot objects serve to anchor the line that forms the newly-formed 1. Two possible positions for projection of object for first coordinate coordinate axis a a Ideally want a large spread of the projected values on the line between the d(r,a) d(s,a) d(s,a) pivot objects d(r,a) r s r s 1. greater spread generally means that more distance information can be xa xa d(r,s) extracted from the projected values d(r,s) 2 2 2 2 for objects a and b, more likely that xa xb is large, thereby x obtained by solving d(r, a) x = d(s, a) (d(r, s) x ) | − | a a a providing more information − 2 − 2 − 2 Expanding and rearranging yields x = d(r,a) +d(r,s) −d(s,a) 2. similar to principle in KLT but different as spread is weaker notion than a 2d(r,s) variance which is used in KLT 2. Used Pythagorean Theorem which is only applicable to Euclidean space large spread can be caused by a few outliers while large variance implicit assumption that d is Euclidean distance means values are really scattered over a wide range equation is only a heuristic when used for general metric spaces 2 Use an O(N) heuristic instead of O(N ) process for finding approximation implies embedding may not be contractive of farthest pair 3. Observations about xa 1. arbitrarily choose one of the objects a can show xa d(r, s) 2. find object r which is farthest from a | | ≤ maximum spread between arbitrary a and b is 2d(r, s) 3. find object s which is farthest from r bounds may not hold if d is not Euclidean as then the distance function could iterate more times to obtain a better estimate of farthest pair used in subsequent iterations may possibly not satisfy triangle inequality Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.87/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.88/114 Projected Distance Side-Effects of Non-Euclidean Distance d

Ex: 3-d space and just before determining second coordinate 1. dH can fail to satisfy triangle inequality H produce coordinate values that lead to non-contractiveness dH : distance function for the distances between 2 objects when projected onto the hyperplane H u u’ 2. Non-contractiveness may cause negative values of dH (a, b) xu perpendicular to the first coordinate axis b’ complicates search for pivot objects x yb t B s (through pivots r and s) r A ya problem: square root of negative number is a complex number which a’ Determining dH (t, u) for some objects t and u: means that a and b (really their projections) cannot serve as pivot t 1. let t0 and u0 be their projections on H C t’ objects 0 0 |xt-xu| dH (t, u) equals distance between t and u also know: d(t0, u0) = d(C, u) 2. angle at C in triangle tuC is 90◦, so can apply Pythagorean theorem:

2 2 2 2 0 0 2 d(t, u) = d(t, C) + d(C, u) = (xt xu) + d(t , u ) − 0 0 3. rearranging and dH (t, u) = d(t , u ) yields

2 2 2 d (t, u) = d(t, u) (xt xu) H − − Implicit assumption that d is Euclidean distance

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.89/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.90/114

Subsequent Iterations Computational Complexity

Distance function at iteration i is the distance function d from previous 1. O(k N) distance computations to map N objects to k-dimensional space H · iteration O(N) distance computations at each iteration Notation: 2. O(k N) space to record the k coordinate values of each of the points i th · 1. xo: i coordinate value obtained for object o corresponding to the N objects 1 2 i 2. Fi(o) = xo, xo, . . . , xo : first i coordinate values of F (o) 3. 2 k array to record identities of k pairs of pivot objects, as this { } × 3. di: distance function used in iteration i information is needed to process queries i i 4. p1 and p2: two pivot objects chosen in iteration i 4. Query objects are transformed to k-dimensional points by applying same i i algorithm used to construct points corresponding to original objects, p2 is the farthest object from p1 i 2 i i 2 i 2 except that we use existing pivot objects i di(p1,o) +di(p1,p2) −di(p2,o) xo = i i 2di(p1,p2) O(k) process as o(k) distance computations Recursive distance function: 5. Can also record distance between pivot objects so no need to recompute

d1(a, b) = d(a, b) d (a, b)2 = d (a, b)2 (xi−1 xi−1)2 i i−1 − a − b = d(a, b)2 d (F (a), F (b))2 − E i−1 i−1

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.91/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.92/114

Properties of FastMap Implications of Non-Contractiveness of FastMap

1. Contractiveness 1. Not guaranteed to to be able to determine k coordinate axes yes as long as d and d0 are both Euclidean limits extent of distance preservation a. no if d is Euclidean and d0 is not failure to determine more coordinate axes does not necessarily imply 0 Ex: use city block distance dA (L1) for d as dA((0, 0), (3, 4)) = 7 that relative distances among the objects are effectively preserved while d ((0, 0), (3, 4)) = 5 E 2. Distance distortion can be very large b. no if d is not Euclidean regardless of d0 Ex: four objects, a through e, with distances d(a, b) = 10, 3. Presence of many non-positive, or very small positive, distance values d(a, c) = 4, d(a, e) = 5, d(b, c) = 8, d(b, e) = 7, and d(c, e) = 1 (which can cause large distortion) in the intermediate distance functions letting a and b be pivots in the first iterations, results in (i.e., those used to determine the second and subsequent coordinate axes) may cause FastMap to no longer satisfy the claimed O(N) bound on xe xc = 6/5 = 1.2 < 1 = d(c, e) − the number of distance computations in each iteration if d non-Euclidean, then eventually non-contractive if enough iterations finding a legal pivot pair may, in the worst case, require examining the 2. With Euclidean distances, distance can preserved given enough iterations distances between a significant fraction of all possible pairs of objects, min m, N 1 for m-dimensional space and N points 2 { − } or Ω(N ) distance computations 3. Distance expansion can be very large if non-contractive

4. If d is not Euclidean, then dH could violate triangle inequality Ex: four objects, a through e, with distances d(a, b) = d(c, e) = 6, d(a, c) = 5, d(a, e) = d(b, e) = 4, and d(b, c) = 3 letting a and b be pivots, yields d (a, c) + d (a, e) 5.141 < H H ≈ 5.850 dH (c, e), violating triangle inequality Copyright 2009: Hanan Samet ≈ Similarity Searching for Multimedia Databases Applications – p.93/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.94/114

MetricMap (Wang et al.) Part 5: Nearest Neighbor Finding

1. Similar to SVD, FastMap, and a special class of Lipschitz embeddings 1. Classical methods such as branch and bound in Euclidean spaces, equivalent to applying SVD for dimension 2. K nearest neighbors reduction 3. Incremental nearest neighbor finding based on an analogy to rotation and projection in Euclidean spaces a. general method 2. Differs from FastMap as embedding space is pseudo-Euclidean b. permitting duplicate instances of objects some coordinate axes make a negative contribution to “distances” 4. Approximate nearest neighbor finding between the points 5. Probably approximately correct (PAC) nearest neighbor finding 3. Makes use of 2k reference objects which form a coordinate space in a (2k 1)-dimensional space − one reference object is mapped to origin and rest are mapped to unit vectors in the (2k 1)-dimensional space − forms a matrix that preserves distance between reference objects 4. Mapping each object is less expensive than in FastMap only need k + 1 distance computations 5. Employs different strategy to handle non-Euclidean metrics maps into a pseudo-Euclidean space, which may result in less distortion in the distances may possibly not be contractive

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.95/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.96/114 Branch and Bound Algorithm (Fukunaga/Narendra) Branch and Bound Enhancements

1. Visit elements in hierarchy using a depth-first traversal Process elements of active list in an order more closely correlated with maintain a list L of current candidate k nearest neighbors finding the k nearest neighbors 1. process elements that are more likely to contain the nearest 2. D : distance between q and the farthest object in L k k neighbors before those that are less likely to do so D = max d(q, o) ), or if L contains fewer than k objects k o∈L{ } ∞ 2. possibly prune elements from further consideration by virtue of being D is monotonically non-increasing over the course of the search k farther away from the query object than any of the members of list L of th traversal, and eventually reaches the distance of the k nearest the current candidate k nearest neighbors neighbor of q in case of distance-based indexes for metric space searching, 3. If element et being visited represents an object o (i.e., t = 0), then insert o prune with aid of triangle inequality into L, removing farthest if L > k | | Can use cost estimate functions 4. Otherwise, et (t 1) is not an object 1. MinDistObject(q, n) is least possible distance from query object q to ≥ construct an active list A(et) of child elements of et, ordered by an object in tree rooted at n “distance” from q 2. MaxDistObject(q, n) is greatest possible distance between q and an recursively visit the elements in A(et) in order, backtracking when object in tree rooted at n a. all elements have been visited, or When use a spatial index with bounding box hierarchies, then order on b. reaching an element e 0 A(et) with d 0 (q, e 0 ) > D t ∈ t t k basis of minimum distance to the bounding box associated with each condition ensures that all objects at distance of kth nearest element neighbor are reported if sufficient to report k objects, then use d 0 (q, e 0 ) D t t ≥ k

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.97/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.98/114

Incremental Nearest Neighbors (Hjaltason/Samet) Mechanics of Incremental Nearest Neighbor Algorithm

Motivation Make use of a search hierarchy (e.g., tree) where 1. often don’t know in advance how many neighbors will need 1. objects at lowest level 2. e.g., want nearest city to Chicago with population > 1 million 2. object approximations are at next level (e.g., bounding boxes in an Several approaches R-tree) 1. guess some area range around Chicago and check populations of 3. nonleaf nodes in a tree-based index cities in range Traverse search hierarchy in a “best-first” manner similar to A*-algorithm if find a city with population > 1 million, must make sure that there instead of more traditional depth-first or breadth-first manners are no other cities that are closer with population > 1 million 1. at each step, visit element with smallest distance from query object inefficient as have to guess size of area to search among all unvisited elements in the search hierarchy problem with guessing is we may choose too small a region or too i.e., all unvisited elements whose parents have been visited large a region 2. use a global list of elements, organized by their distance from query a. if size too small, area may not contain any cities with right object population and need to expand the search region use a priority queue as it supports necessary insert and delete b. if size too large, may be examining many cities needlessly minimum operations 2. sort all the cities by distance from Chicago ties in distance: priority to lower type numbers impractical as we need to re-sort them each time pose a similar if still tied, priority to elements deeper in search hierarchy query with respect to another city also sorting is overkill when only need first few neighbors 3. find k closest neighbors and check population condition

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.99/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.100/114

Incremental Nearest Neighbor Algorithm Example of INCNEAREST

Algorithm: Initially, algorithm descends tree to leaf node containing q

INCNEAREST(q, S, T ) expand n 1 Q NEWPRIORITYQUEUE() expand n1 ← 2 et root of the search hierarchy induced by q, S, and T ← Start growing search region 3 ENQUEUE(Q, et, 0) expand n3 4 while not ISEMPTY(Q) do report e as nearest neighbor 5 et DEQUEUE(Q) ← 6 if t = 0 then /* et is an object */ 7 Report et as the next nearest object n n n n 8 else n n expand n n expand n n expand n n expand 9 for each child element 0 of do 1 3 1 3 1 3 1 3 et et n2 n2 n2 n2 10 ENQUEUE(Q, et0 , dt0 (q, et0 )) expand expand expand n n n n a c a c a c a c n n n n 1. Lines 1-3 initialize priority queue with root 1 g 1 g 1 g 1 g d q d q d q d q 2. In main loop take element et closest to q off the queue f f f f b e b e b e b e i n i n i n i n report et as next nearest object if et is an object n3 h 2 n3 h 2 n3 h 2 n3 h 2 otherwise, insert child elements of et into priority queue expand expand e front n front 2 front n3 nd3 n na front 2 front 2 front n1 nd1 ndc1 n na na front 3 front 3 front b3 front n nn2 ncn2 ncnh2 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.101/114 Copyright 2009:queueHanan Samet queue queueSimilarity Searching for Multimedia DatabasesqueueApplications – p.102/114

VASCO Spatial Applet Complexity Analysis

Algorithm is I/O optimal no nodes outside search region are accessed better pruning than branch and bound algorithm Observations for finding k nearest neighbors for uniformly-distributed two-dimensional points expected # of points on priority queue: c √k · expected # of leaf nodes intersecting search region: c (k + √k) · In worst case, priority queue will be as large as entire data set e.g., when data objects are all nearly

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ equidistant from query object

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ probability of worst case very low, as it ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ depends on a particular configuration of ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ both the data objects and the query object ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ (but: curse of dimensionality!) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

http://www.cs.umd.edu/ hjs/quadtree/index.html ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢ ¢

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.103/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.104/114 Duplicate Instances of Objects Duplicate Instances Algorithm

Objects with extent such as lines, rectangles, regions, etc. are indexed by INCNEARESTDUP(q, S, T )

methods that associate the objects with the different blocks that they 1 Q NEWPRIORITYQUEUE() occupy ← 2 et root of the search hierarchy induced by q, S, and T ← Indexes employ a disjoint decomposition of space in contrast to 3 ENQUEUE(Q, et, 0) non-disjoint as is the case for bounding box hierarchies (e.g., R-tree) 4 while not ISEMPTY(Q) do 5 et DEQUEUE(Q) Search hierarchies will contain multiple references to some objects ← 6 if t = 0 then /* et is an object */ Adapting incremental nearest neighbor algorithm: 7 while et = FIRST(Q) do 1. make sure to detect all duplicate instances that are currently in priority 8 DELETEFIRST(Q) queue 9 Report et as the next nearest object 10 else /* e is not an object */ 2. avoid inserting duplicate instances of an object that has already been t 11 for each child element et0 of et do reported 0 12 if t > 0 or d 0 (q, e 0 ) dt(q, et) then t t ≥ 13 ENQUEUE(Q, et0 , dt0 (q, et0 ))

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.105/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.106/114

Differences from INCNEAREST VASCO Spatial Applet

1. Object o (et0 ) is enqueued only if o has not yet been reported check if o’s distance from q is less than distance from et to q (line 12) if yes, then o must have been encountered in an element et00 which was closer to q and hence already been reported 2. Check for multiple instances of object o and report only once (lines 7–9) 3. Order objects in queue by identity when at same distance 4. Retrieve all nodes in the queue before objects at same distance important because an object can have several ancestor nodes of the same type interesting as unlike INCNEAREST where want to report neighbors as soon as possible so break ties by giving priority to elements with lower type numbers

http://www.cs.umd.edu/ hjs/quadtree/index.html

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.107/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.108/114

INCNEAREST Extensions Approximate Nearest Neighbors

1. Incremental range query 1. Often, obtaining exact results is not critical and willing to trade off accuracy 2. Incremental retrieval of k nearest neighbors for improved performance need an extra queue to keep track of k neighbors found so far and can 2. Let  denote the approximation error tolerance th common criterion is that the distance between q and the resulting use distance dk from q of the k candidate nearest neighbor ok to reduce number of priority queue operations candidate nearest neighbor o0 is within a factor of 1 +  of the distance to the actual nearest neighbor o 3. Farthest neighbor i.e., d(q, o0) (1 + )d(q, o) 4. Pairs of objects ≤ distance join distance semi-join

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.109/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.110/114

Approximate Nearest Neighbors with INCNEAREST Probably Approximately Correct (PAC) Nearest

1. Modify INCNEAREST by multiplying the key values for non-object elements Neighbors (Ciaccia/Patella) on the priority queue by 1 +  Relax approximate nearest neighbor condition by stipulating a maximum in a practical sense, non-object element et is enqueued with a larger probability δ for tolerating failure, thereby enabling the decision process to distance value — that is, by a factor of (1 + ) halt sooner at the risk δ of being wrong implies that we delay its processing, thereby allowing objects to be 0 reported ‘before their time’ Object o is considered a PAC-nearest neighbor of q if the probability that d(q, o0) (1 + ) d(q, o) is at least 1 δ, where o is actual nearest neighbor e.g., once et is finally processed, all objects o satisfying ≤ · − 0 d(q, o) (1 + )dt(q, et) (which is greater than dt(q, et) if  > 0) would Alternatively, given  and δ, 1 δ is the minimum probability that o is the ≤ − have already been reported (1 + )-approximate nearest neighbor of q thus an object c in et with a distance d(q, c) d(q, o) could exist, yet o Ciaccia and Patella use information about the distances between q and the ≤ is reported before c data objects to derive an upper bound s on the distance between q and a algorithm does not necessarily report the resulting objects in strictly PAC-nearest neighbor o0 increasing order of their distance from q Distance bound s is used during the actual nearest neighbor search as a 2. Different from Arya/Mount algorithm which cannot be incremental as pre-established halting condition — that is, the search can be halted once priority queue only contains non-object elements locating an object o0 with d(q, o0) s ≤ shrinks distance r from q to the closest object o by a factor of 1 +  and Method is analogous to executing a variant of a range query, where the only inserts a non-object element e into the priority queue if the range is defined by the distance bound s, which halts on the first object in distance d(b, q) of e’s corresponding block b from q is less than the the range shrunken distance Difficulty is determining a relationship between δ and the distance bound s Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.111/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.112/114 Concluding Remarks Selected Overview References H. Samet. Applications of Spatial Data Structures: Computer Graphics, 1. Similarity search is a broad area of research Image Processing, and GIS. Addison-Wesley, Reading, MA, 1990. 2. Much relation to geometry; geometric setting is usually missing H. Samet. The Design and Analysis of Spatial Data Structures, 3. Progress is heavily influenced by applications Addison-Wesley, Reading, MA, 1990. 4. Need to look at old literature to be able to evaluate current research results V. Gaede and O. Günther. Multidimensional access methods. ACM Computer Surveys, 20(2):170–231, June 1998. 5. Much is left to do as difficult to say what is best solution C. Böhm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3):322–373, Sept. 2001. E. Chávez, G. Navarro, R. Baeza-Yates, and J. Marroquín. Searching in metric spaces. ACM Computing Surveys, 33(3):273–322, Sept. 2001. G. R. Hjaltason and H. Samet. Properties of embedding methods for similarity searching in metric spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5):530–549, May 2003. Also University of Maryland Computer Science TR-4102. G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Transactions on Systems, 28(4):517–580, Dec. 2003. H. Samet. Foundations of Multidimensional and Metric Data Structures, Morgan-Kaufmann, San Francisco, CA, 2006.

Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.113/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.114/114