Algorithms for Similarity Search

Outline Similarity Search Algorithms for 1 High-level overview Similarity Search 2 Locality-sensitive hashing 3 Combinatorial approach Yury Lifshits Yahoo! Research Content Optimization ESSCASS 2010 1 High-level overview 2 Explore/exploit for Yahoo! frontpage 3 Ediscope project for social engagement analysis 1 / 27 2 / 27 Similarity Search: an Example More Formally Input: Set of objects Search space: object domain U, similarity function σ Task: Preprocess it Input: database S = fp1;:::; png ⊆ U Query: q 2 U Task: find argmaxpi σ(pi ; q) p2 Most similar p5 p4 p3 Query: New object q Task: Find the most p p similar one in the dataset 1 6 3 / 27 4 / 27 Applications (1/5) Information Retrieval Applications (2/5) Machine Learning Content-based retrieval (magnetic resonance images, tomography, CAD shapes, time series, texts) Spelling correction kNN classification rule: classify by majority of k nearest training examples. E.g. recognition of faces, Geographic databases (post-office problem) fingerprints, speaker identity, optical characters Searching for similar DNA sequences Nearest-neighbor interpolation Related pages web search Semantic search, concept matching 5 / 27 6 / 27 Applications (3/5) Data Mining Applications (4/5) Bipartite Problems Recommendation systems (most relevant movie to a Near-duplicate detection set of already watched ones) Plagiarism detection Personalized news aggregation (most relevant news articles to a given user's profile of interests) Computing co-occurrence similarity (for detecting synonyms, query extension, machine translation...) Behavioral targeting (most relevant ad for displaying to a given user) Key difference: Mostly, off-line problems Key differences: Query and database objects have different nature Objects are described by features and connections 7 / 27 8 / 27 Applications (5/5) As a Subroutine Variations of the Computation Task Additional requirements: Dynamic nearest neighbors: moving objects, deletes/inserts, changing similarity function Coding theory (maximum likelihood decoding) Related problems: Nearest neighbor: nearest museum to my hotel MPEG compression (searching for similar fragments Reverse nearest neighbor: all museums for which my hotel is the nearest in already compressed part) one Range queries: all museums up to 2km from my hotel Clustering Closest pair: closest pair of museum and hotel Spatial join: pairs of hotels and museums which are at most 1km apart Multiple nearest neighbors: nearest museums for each of these hotels Metric facility location: how to build hotels to minimize the sum of \museum | nearest hotel" distances 9 / 27 10 / 27 Brief History Ideal algorithm 1908 Voronoi diagram + Any data model 1967 kNN classification rule by Cover and Hart + Any dimension 1973 Post-office problem posed by Knuth + Any dataset 1997 The paper by Kleinberg, beginning of provable + Near-linear space upper/lower bounds + Logarithmic search time 2006 Similarity Search book by Zezula, Amato, Dohnal and Batko + Exact nearest neighbor 2008 First International Workshop on Similarity Search + Zero probability of error 2009 Amazon acquires SnapTell + Provable efficiency 11 / 27 12 / 27 Nearest Neighbors in Theory Theory: Four Techniques Branch and bound Greedy walks (p1; p2; p3; p4; p5) Sphere Rectangle Tree Orchard's Algorithm k-d-B tree Geometric near-neighbor access tree Excluded middle vantage point (p1; p2; p3) (p4; p5) forest mvp-tree Fixed-height fixed-queries tree AESA p2 ∗ p Vantage-point tree LAESA R -tree Burkhard-Keller (p1; p3) p2 p4 p5 3 p4 tree BBD tree Navigating Nets Voronoi tree Balanced aspect ratio q p1 p3 p1 tree Metric tree vps -tree M-tree Locality-Sensitive Epsilon nets Hashing SS-tree R-tree Spatial approximation tree Mappings: LSH, Works for small intrinsic dimension Multi-vantage point tree Bisector tree mb-tree Cover tree Hybrid tree random projections, minhashing Generalized hyperplane tree Slim tree Spill Tree Fixed queries tree X-tree k-d tree Balltree Quadtree Octree Post-office tree 13 / 27 14 / 27 Future of Similarity Search Cloud implementation Locality Sensitive Hashing Hadoop stack API (as in OpenCalais, Google Language API, WolframAlpha) Focus on memory 15 / 27 16 / 27 LSH vs. Ideal algorithm Approximate Algorithms c-Approximate r-range query: if there at least one { Any data model p 2 S : d(q; p) ≤ r return some p0 : d(q; p0) ≤ cr { Exact nearest neighbor { Zero probability of error r cr + Provable efficiency q + Any dimension + Any dataset ∗ c-Approximate nearest neighbor query: return some + Near-linear space 0 0 p 2 S : d(p ; q) ≤ crNN , where rNN = minp2S d(p; q) +∗ Logarithmic search time Today we consider only range queries 17 / 27 18 / 27 Today's Focus Data model: d-dimensional Euclidean space: d R Locality-Sensitive Hashing: Our goal: provable performance bounds General Scheme Sublinear search time, near-linear preprocessing space Still an open problem: approximate nearest neighbor search with logarithmic search and linear preprocessing 19 / 27 20 / 27 Definition of LSH The Power of LSH Indyk&Motwani'98 Locality-sensitive hash family H with parameters Notation: ρ = log(1=P1) < 1 log(1=P2) (c; r; P1; P2): Theorem If kp − qk ≤ r then PrH[h(p) = h(q)] ≥ P1 Any (c; r; P1; P2)-locality-sensitive hashing leads to an If kp − qk ≥ cr then PrH[h(p) = h(q)] ≤ P2 algorithm forc-approximater-range search with (roughly)n ρ query time andn 1+ρ preprocessing space Proof in the next four slides 21 / 27 22 / 27 LSH: Preprocessing LSH: Search Composite hash function: g(p) =< h1(p);:::; hk (p) > 1 Compute g1(q);:::; gL(q) Preprocessing with parameters L; k: 2 Go to corresponding buckets and explicitly check 1 Choose at random L composite hash functions of k d(p; q) ≤?cr for every point there components each 3 Stopping conditions: (1) we found a satisfying 2 Hash every p 2 S into buckets g1(p);:::; gL(p) object or (2) we tried at least3 L objects Preprocessing space: O(Ln) Search time is O(L) 23 / 27 24 / 27 LSH: Analysis (1/2) LSH: Analysis (2/2) 1 − x ≤ e−x for x 2 [0; 1] Pk n = 1 L = P−k (−logδ) In order to have probability of error at most δ we set k; L such that: 2 1 k The expected number of cr-far objects to be tried is P2 Ln = L The chance to never be hashed to the same bucket as q for a Solution: k log n k L −P1 L true r-neighbor (1 − P1 ) ≤ e is at most δ k = log(1=P2) log n − log(1=P1) log(1=P2) log(1=P ) ρ Rewriting these constraints: L = P1 log(1/δ) = n 2 log(1/δ) = n log(1/δ) k −k P2 n = 1 L = P1 (−logδ) Preprocessing space O(Ln) ≈ n1+ρ+o(1) Search O(L) ≈ nρ+o(1) 25 / 27 26 / 27 LSH Based on p-Stable Distributions Datar, Immorlica, Indyk and Mirrokni'04 p · r h(p) = b + bc w r: random vector, every coordinate comes from N(0; 1) b: random shift from [0; 1] w: quantization width 1 ρ(c) < c 27 / 27.

Algorithms for Similarity Search

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support