Similarity Searching: Outline Similarity Searching Voronoi Diagrams
Total Page:16
File Type:pdf, Size:1020Kb
Outline 1. Similarity Searching Similarity Searching: 2. Distance-based indexing 3. Dimension reduction Indexing, Nearest Neighbor Finding, Dimensionality 4. Embedding methods Reduction, and Embedding Methods, for Applications in Multimedia Databases 5. Nearest neighbor finding Hanan Samet∗ [email protected] Department of Computer Science University of Maryland College Park, MD 20742, USA Based on joint work with Gisli R. Hjaltason. ∗ Currently a Science Foundation of Ireland (SFI) Walton Fellow at the Centre for Geocomputation at the National University of Ireland at Maynooth (NUIM) Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.1/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.2/114 Similarity Searching Voronoi Diagrams Apparently straightforward solution: Important task when trying to find patterns in applications involving mining 1. Partition space into regions where all different types of data such as images, video, time series, text documents, points in the region are closer to the DNA sequences, etc. region’s data point than to any other Similarity searching module is a central component of content-based data point retrieval in multimedia databases 2. Locate the Voronoi region corre- Problem: finding objects in a data set S that are similar to a query object q sponding to the query point based on some distance measure d which is usually a distance metric d=2 Sample queries: Problem: storage and construction cost for N d-dimensional points is Θ(N ) 1. point: objects having particular feature values Impractical unless resort to some high-dimensional approximation of a Voronoi 2. range: objects whose feature values fall within a given range or where diagram (e.g., OS-tree) which results in approximate nearest neighbors the distance from some query object falls into a certain range Exponential factor corresponding to the dimension d of the underlying space in 3. nearest neighbor: objects whose features have values similar to those the complexity bounds when using approximations of Voronoi diagrams (e.g., of a given query object or set of query objects (t; )-AVD) is shifted to be in terms of the error threshold rather than in terms 4. closest pairs: pairs of objects from the same set or different sets which of the number of objects N in the underlying space are sufficiently similar to each other (variant of spatial join) 1. (1; )-AVD: O(N/d−1) space and O(log(N/d−1)) time for nearest neigh- Responses invariably use some variant of nearest neighbor finding bor query 2. (1/(d−1)2; )-AVD: O(N) space and O(t + log N) time for nearest neigh- bor query Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.3/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.4/114 Approximate Voronoi Diagrams (AVD) Approximate Voronoi Diagrams (AVD) Representations Example partitions of space induced by neighbor sets A AB BC BC Darkness of shading indicates cardinality of nearest neighbor sets with white corresponding to 1 BCF A B A B AD ABE BEF CFI C ADE C ADE AD E E E CFI D D D EF F AEG EGH F G G I GH I FHI H DG G H H HI I DEG EFH ( = 0:10) ( = 0:30) ( = 0:50) G GH H HI Partition underlying domain so that Allow up to t 1 elements r (1 ≥ ib ≤ for 0, every block b is asso- i t) of S to be associated with ≥ ≤ ciated with some element rb in S each block b for a given , where such that rb is an -nearest neigh- each point in b has one of the rib as bor for all of the points in b (e.g., its -nearest neighbor (e.g., (3,0)- AVD or (1,0.25)-AVD) AVD) Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.5/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.6/114 Problem: Curse of Dimensionality Alternative View of Curse of Dimensionality Number of samples needed to estimate an arbitrary function with a given Probability density function (analogous to histogram) of the distances of level of accuracy grows exponentially with the number of variables (i.e., the objects is more concentrated and has a larger mean value dimensions) that comprise it (Bellman) Implies similarity search algorithms need to do more work For similarity searching, curse means that the number of points in the data Worst case when d(x; x) = 0 and d(x; y) = 1 for all y = x set that need to be examined in deriving the estimate ( nearest neighbor) 6 grows exponentially with the underlying dimension ≡ Implies must compare every object with every other object 1. can’t always use triangle inequality to prune objects from consideration Effect on nearest neighbor finding is that the process may not be 2. triangle inequality (i.e., d(q; p) d(p; x) + d(q; x)) implies that any x meaningful in high dimensions ≤ such that d(q; p) d(p; x) > cannot be at a distance of or less from j − j When ratio of variance of distances and expected distances, between two q as d(q; x) d(q; p) d(p; x) > random points p and q drawn from the data and query distributions, ≥ − 3. when is small while probability density function is large at d(p; q), converges to zero as dimension d gets very large (Beyer et al.) then probability of eliminating an object from consideration via use of dist Variance[ (p;q)] triangle inequality is remaining area under curve which is small (see limd!1 Expected[dist(p;q)] = 0 left) in contrast to case when distances are more uniform (see right) 1. distance to the nearest neighbor and distance to the farthest neighbor tend to converge as the dimension increases 2. implies that nearest neighbor searching is inefficient as difficult to differentiate nearest neighbor from other objects probability density probability density 3. assumes uniformly distributed data d(p,q) d(p,x) d(p,q) d(p,x) 2ε 2ε Partly alleviated by fact that real-world data is rarely uniformly-distributed (a) (b) Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.7/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.8/114 Other Problems Solutions Based on Indexing Point and range queries are less complex than nearest neighbor queries 1. Map objects to a low-dimensional vector space which is then indexed 1. easy to do with multi-dimensional index as just need comparison tests using one of a number of different data structures such as k-d trees, 2. nearest neighbor require computation of distance R-trees, quadtrees, etc. use dimensionality reduction: representative points, SVD, DFT, etc. Euclidean distance needs d multiplications and d 1 additions − 2. Directly index the objects based on distances from a subset of the objects Often we don’t know features describing the objects and thus need aid of making use of data structures such as the vp-tree, M-tree, etc. domain experts to identify them useful when only have a distance function indicating similarity (or dis-similarity) between all pairs of N objects if change distance metric, then need to rebuild index — not so for multidimensional index 3. If only have distance information available, then embed the data objects in a vector space so that the distances of the embedded objects as measured by the distance metric in the embedding space approximate the actual distances commonly known embedding methods include multidimensional scaling (MDS), Lipschitz embeddings, FastMap, etc. once a satisfactory embedding has been obtained, the actual search is facilitated by making use of conventional indexing methods, perhaps coupled with dimensionality reduction Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.9/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.10/114 Outline Part 1: Indexing Low and High Dimensional Spaces 1. Indexing low and high dimensional spaces 1. Quadtree variants 2. Distance-based indexing 2. k-d tree 3. Dimensionality reduction 3. R-tree 4. Embedding methods 4. Bounding sphere methods 5. Nearest neighbor searching 5. Hybrid tree 6. Avoiding overlapping all of the leaf blocks 7. Pyramid technique 8. Methods based on a sequential scan Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.11/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.12/114 Simple Non-Hierarchical Data Structures Grid Method Sequential list Inverted List Divide space into squares of width equal to the search region Name X Y X Y Each cell contains a list of all points within it Chicago 35 42 Denver Miami Assume L1 distance metric (i.e., Chessboard) Mobile 52 10 Omaha Mobile Assume C = uniform distribution of points per cell Toronto 62 77 Chicago Atlanta Average search time for k-dimensional space is O(F 2k) Buffalo 82 65 Mobile Omaha · F = number of records found = C, since query region has the width of Denver 5 45 Toronto Chicago a cell Omaha 27 35 Buffalo Denver 2k = number of cells examined Atlanta 85 15 Atlanta Buffalo (0,100) (100,100) Miami 90 5 Miami Toronto Inverted lists: (62,77) (82,65) Toronto Buffalo 1. 2 sorted lists (5,45) 2. data is pointers Denver (35,42) Chicago 3. enables pruning the search with respect to one key y (27,35) Omaha (85,15) Atlanta (52,10) Mobile (90,5) Miami (0,0) x (100,0) Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.13/114 Copyright 2009: Hanan Samet Similarity Searching for Multimedia Databases Applications – p.14/114 Point Quadtree (Finkel/Bentley) PR Quadtree Marriage between uniform grid and a binary search tree 1.