Algorithms for Similarity Search

Outline Similarity Search Algorithms for 1 High-level overview Similarity Search 2 Locality-sensitive hashing 3 Combinatorial approach Yury Lifshits Yahoo! Research Content Optimization ESSCASS 2010 1 High-level overview 2 Explore/exploit for Yahoo! frontpage 3 Ediscope project for social engagement analysis 1 / 27 2 / 27 Similarity Search: an Example More Formally Input: Set of objects Search space: object domain U, similarity function σ Task: Preprocess it Input: database S = fp1;:::; png ⊆ U Query: q 2 U Task: find argmaxpi σ(pi ; q) p2 Most similar p5 p4 p3 Query: New object q Task: Find the most p p similar one in the dataset 1 6 3 / 27 4 / 27 Applications (1/5) Information Retrieval Applications (2/5) Machine Learning Content-based retrieval (magnetic resonance images, tomography, CAD shapes, time series, texts) Spelling correction kNN classification rule: classify by majority of k nearest training examples. E.g. recognition of faces, Geographic databases (post-office problem) fingerprints, speaker identity, optical characters Searching for similar DNA sequences Nearest-neighbor interpolation Related pages web search Semantic search, concept matching 5 / 27 6 / 27 Applications (3/5) Data Mining Applications (4/5) Bipartite Problems Recommendation systems (most relevant movie to a Near-duplicate detection set of already watched ones) Plagiarism detection Personalized news aggregation (most relevant news articles to a given user's profile of interests) Computing co-occurrence similarity (for detecting synonyms, query extension, machine translation...) Behavioral targeting (most relevant ad for displaying to a given user) Key difference: Mostly, off-line problems Key differences: Query and database objects have different nature Objects are described by features and connections 7 / 27 8 / 27 Applications (5/5) As a Subroutine Variations of the Computation Task Additional requirements: Dynamic nearest neighbors: moving objects, deletes/inserts, changing similarity function Coding theory (maximum likelihood decoding) Related problems: Nearest neighbor: nearest museum to my hotel MPEG compression (searching for similar fragments Reverse nearest neighbor: all museums for which my hotel is the nearest in already compressed part) one Range queries: all museums up to 2km from my hotel Clustering Closest pair: closest pair of museum and hotel Spatial join: pairs of hotels and museums which are at most 1km apart Multiple nearest neighbors: nearest museums for each of these hotels Metric facility location: how to build hotels to minimize the sum of \museum | nearest hotel" distances 9 / 27 10 / 27 Brief History Ideal algorithm 1908 Voronoi diagram + Any data model 1967 kNN classification rule by Cover and Hart + Any dimension 1973 Post-office problem posed by Knuth + Any dataset 1997 The paper by Kleinberg, beginning of provable + Near-linear space upper/lower bounds + Logarithmic search time 2006 Similarity Search book by Zezula, Amato, Dohnal and Batko + Exact nearest neighbor 2008 First International Workshop on Similarity Search + Zero probability of error 2009 Amazon acquires SnapTell + Provable efficiency 11 / 27 12 / 27 Nearest Neighbors in Theory Theory: Four Techniques Branch and bound Greedy walks (p1; p2; p3; p4; p5) Sphere Rectangle Tree Orchard's Algorithm k-d-B tree Geometric near-neighbor access tree Excluded middle vantage point (p1; p2; p3) (p4; p5) forest mvp-tree Fixed-height fixed-queries tree AESA p2 ∗ p Vantage-point tree LAESA R -tree Burkhard-Keller (p1; p3) p2 p4 p5 3 p4 tree BBD tree Navigating Nets Voronoi tree Balanced aspect ratio q p1 p3 p1 tree Metric tree vps -tree M-tree Locality-Sensitive Epsilon nets Hashing SS-tree R-tree Spatial approximation tree Mappings: LSH, Works for small intrinsic dimension Multi-vantage point tree Bisector tree mb-tree Cover tree Hybrid tree random projections, minhashing Generalized hyperplane tree Slim tree Spill Tree Fixed queries tree X-tree k-d tree Balltree Quadtree Octree Post-office tree 13 / 27 14 / 27 Future of Similarity Search Cloud implementation Locality Sensitive Hashing Hadoop stack API (as in OpenCalais, Google Language API, WolframAlpha) Focus on memory 15 / 27 16 / 27 LSH vs. Ideal algorithm Approximate Algorithms c-Approximate r-range query: if there at least one { Any data model p 2 S : d(q; p) ≤ r return some p0 : d(q; p0) ≤ cr { Exact nearest neighbor { Zero probability of error r cr + Provable efficiency q + Any dimension + Any dataset ∗ c-Approximate nearest neighbor query: return some + Near-linear space 0 0 p 2 S : d(p ; q) ≤ crNN , where rNN = minp2S d(p; q) +∗ Logarithmic search time Today we consider only range queries 17 / 27 18 / 27 Today's Focus Data model: d-dimensional Euclidean space: d R Locality-Sensitive Hashing: Our goal: provable performance bounds General Scheme Sublinear search time, near-linear preprocessing space Still an open problem: approximate nearest neighbor search with logarithmic search and linear preprocessing 19 / 27 20 / 27 Definition of LSH The Power of LSH Indyk&Motwani'98 Locality-sensitive hash family H with parameters Notation: ρ = log(1=P1) < 1 log(1=P2) (c; r; P1; P2): Theorem If kp − qk ≤ r then PrH[h(p) = h(q)] ≥ P1 Any (c; r; P1; P2)-locality-sensitive hashing leads to an If kp − qk ≥ cr then PrH[h(p) = h(q)] ≤ P2 algorithm forc-approximater-range search with (roughly)n ρ query time andn 1+ρ preprocessing space Proof in the next four slides 21 / 27 22 / 27 LSH: Preprocessing LSH: Search Composite hash function: g(p) =< h1(p);:::; hk (p) > 1 Compute g1(q);:::; gL(q) Preprocessing with parameters L; k: 2 Go to corresponding buckets and explicitly check 1 Choose at random L composite hash functions of k d(p; q) ≤?cr for every point there components each 3 Stopping conditions: (1) we found a satisfying 2 Hash every p 2 S into buckets g1(p);:::; gL(p) object or (2) we tried at least3 L objects Preprocessing space: O(Ln) Search time is O(L) 23 / 27 24 / 27 LSH: Analysis (1/2) LSH: Analysis (2/2) 1 − x ≤ e−x for x 2 [0; 1] Pk n = 1 L = P−k (−logδ) In order to have probability of error at most δ we set k; L such that: 2 1 k The expected number of cr-far objects to be tried is P2 Ln = L The chance to never be hashed to the same bucket as q for a Solution: k log n k L −P1 L true r-neighbor (1 − P1 ) ≤ e is at most δ k = log(1=P2) log n − log(1=P1) log(1=P2) log(1=P ) ρ Rewriting these constraints: L = P1 log(1/δ) = n 2 log(1/δ) = n log(1/δ) k −k P2 n = 1 L = P1 (−logδ) Preprocessing space O(Ln) ≈ n1+ρ+o(1) Search O(L) ≈ nρ+o(1) 25 / 27 26 / 27 LSH Based on p-Stable Distributions Datar, Immorlica, Indyk and Mirrokni'04 p · r h(p) = b + bc w r: random vector, every coordinate comes from N(0; 1) b: random shift from [0; 1] w: quantization width 1 ρ(c) < c 27 / 27.

Algorithms for Similarity Search

Lecture 04 Linear Structures Sort

Cover Trees for Nearest Neighbor

Faster Cover Trees

Canopy – Fast Sampling Using Cover Trees

Cover Trees for Nearest Neighbor

SURVEY on VARIOUS-WIDTH CLUSTERING for EFFICIENT K-NEAREST NEIGHBOR SEARCH

An Empirical Comparison of Exact Nearest Neighbour Algorithms

COVER TREES for NEAREST NEIGHBOR Nearest Neighbor

All-Near-Neighbor Search Among High-Dimensional Data Via Hierarchical Sparse Graph Code Filtering

Improving Dual-Tree Algorithms

Non-Rigid Surface Registration Using Cover Tree Based Clustering and Nearest Neighbor Search

Introduction to Cover Tree