Outline Similarity Search Algorithms for 1 High-level overview Similarity Search 2 Locality-sensitive hashing

3 Combinatorial approach Yury Lifshits Yahoo! Research Content Optimization

ESSCASS 2010 1 High-level overview

2 Explore/exploit for Yahoo! frontpage

3 Ediscope project for social engagement analysis

1 / 27 2 / 27

Similarity Search: an Example More Formally

Input: of objects Search space: object domain U, similarity function σ Task: Preprocess it Input: database S = {p1,..., pn} ⊆ U Query: q ∈ U

Task: find argmaxpi σ(pi , q)

p2

Most similar p5 p4 p3

Query: New object q Task: Find the most p p similar one in the dataset 1 6

3 / 27 4 / 27 Applications (1/5) Information Retrieval Applications (2/5) Machine Learning

Content-based retrieval (magnetic resonance images, tomography, CAD shapes, time series, texts)

Spelling correction kNN classification rule: classify by majority of k nearest training examples. E.g. recognition of faces, Geographic databases (post-office problem) fingerprints, speaker identity, optical characters

Searching for similar DNA sequences Nearest-neighbor interpolation

Related pages web search

Semantic search, concept matching

5 / 27 6 / 27

Applications (3/5) Data Mining Applications (4/5) Bipartite Problems

Recommendation systems (most relevant movie to a Near-duplicate detection set of already watched ones)

Plagiarism detection Personalized news aggregation (most relevant news articles to a given user’s profile of interests) Computing co-occurrence similarity (for detecting synonyms, query extension, machine translation...) Behavioral targeting (most relevant ad for displaying to a given user)

Key difference: Mostly, off-line problems Key differences: Query and database objects have different nature Objects are described by features and connections 7 / 27 8 / 27 Applications (5/5) As a Subroutine Variations of the Computation Task

Additional requirements: Dynamic nearest neighbors: moving objects, deletes/inserts, changing similarity function

Coding theory (maximum likelihood decoding) Related problems: Nearest neighbor: nearest museum to my hotel MPEG compression (searching for similar fragments Reverse nearest neighbor: all museums for which my hotel is the nearest in already compressed part) one Range queries: all museums up to 2km from my hotel Clustering Closest pair: closest pair of museum and hotel Spatial join: pairs of hotels and museums which are at most 1km apart Multiple nearest neighbors: nearest museums for each of these hotels Metric facility location: how to build hotels to minimize the sum of “museum — nearest hotel” distances

9 / 27 10 / 27

Brief History Ideal algorithm

1908 Voronoi diagram + Any data model 1967 kNN classification rule by Cover and Hart + Any dimension 1973 Post-office problem posed by Knuth + Any dataset 1997 The paper by Kleinberg, beginning of provable + Near-linear space upper/lower bounds + Logarithmic search time 2006 Similarity Search book by Zezula, Amato, Dohnal and Batko + Exact nearest neighbor 2008 First International Workshop on Similarity Search + Zero probability of error 2009 Amazon acquires SnapTell + Provable efficiency

11 / 27 12 / 27 Nearest Neighbors in Theory Theory: Four Techniques

Branch and bound Greedy walks (p1, p2, p3, p4, p5) Sphere Rectangle Orchard’s Algorithm k-d-B tree Geometric near-neighbor access tree Excluded middle vantage point (p1, p2, p3) (p4, p5) forest mvp-tree Fixed-height fixed-queries tree AESA p2

∗ p Vantage-point tree LAESA R -tree Burkhard-Keller (p1, p3) p2 p4 p5 3 p4 tree BBD tree Navigating Nets Voronoi tree Balanced aspect ratio q p1 p3 p1 tree vps -tree M-tree Locality-Sensitive Epsilon nets Hashing SS-tree R-tree Spatial approximation tree Mappings: LSH, Works for small intrinsic dimension Multi-vantage point tree Bisector tree mb-tree Cover tree Hybrid tree random projections, minhashing Generalized hyperplane tree Slim tree Spill Tree Fixed queries tree X-tree k-d tree Balltree Post-office tree

13 / 27 14 / 27

Future of Similarity Search

Cloud implementation Locality Sensitive Hashing Hadoop stack API (as in OpenCalais, Google Language API, WolframAlpha)

Focus on memory

15 / 27 16 / 27 LSH vs. Ideal algorithm Approximate Algorithms c-Approximate r-range query: if there at least one – Any data model p ∈ S : d(q, p) ≤ r return some p0 : d(q, p0) ≤ cr – Exact nearest neighbor

– Zero probability of error r cr + Provable efficiency q + Any dimension + Any dataset

∗ c-Approximate nearest neighbor query: return some + Near-linear space 0 0 p ∈ S : d(p , q) ≤ crNN , where rNN = minp∈S d(p, q) +∗ Logarithmic search time Today we consider only range queries 17 / 27 18 / 27

Today’s Focus

Data model: d-dimensional Euclidean space: d R Locality-Sensitive Hashing: Our goal: provable performance bounds General Scheme Sublinear search time, near-linear preprocessing space

Still an open problem: approximate with logarithmic search and linear preprocessing

19 / 27 20 / 27 Definition of LSH The Power of LSH

Indyk&Motwani’98

Locality-sensitive hash family H with parameters Notation: ρ = log(1/P1) < 1 log(1/P2) (c, r, P1, P2): Theorem If kp − qk ≤ r then PrH[h(p) = h(q)] ≥ P1 Any (c, r, P1, P2)-locality-sensitive hashing leads to an If kp − qk ≥ cr then PrH[h(p) = h(q)] ≤ P2 algorithm forc-approximater-range search with (roughly)n ρ query time andn 1+ρ preprocessing space

Proof in the next four slides

21 / 27 22 / 27

LSH: Preprocessing LSH: Search

Composite hash function: g(p) =< h1(p),..., hk (p) > 1 Compute g1(q),..., gL(q) Preprocessing with parameters L, k: 2 Go to corresponding buckets and explicitly check 1 Choose at random L composite hash functions of k d(p, q) ≤?cr for every point there components each 3 Stopping conditions: (1) we found a satisfying 2 Hash every p ∈ S into buckets g1(p),..., gL(p) object or (2) we tried at least3 L objects

Preprocessing space: O(Ln) Search time is O(L)

23 / 27 24 / 27 LSH: Analysis (1/2) LSH: Analysis (2/2)

1 − x ≤ e−x for x ∈ [0, 1] Pk n = 1 L = P−k (−logδ) In order to have probability of error at most δ we set k, L such that: 2 1

k The expected number of cr-far objects to be tried is P2 Ln = L The chance to never be hashed to the same bucket as q for a Solution: k log n k L −P1 L true r-neighbor (1 − P1 ) ≤ e is at most δ k = log(1/P2)

log n − log(1/P1) log(1/P2) log(1/P ) ρ Rewriting these constraints: L = P1 log(1/δ) = n 2 log(1/δ) = n log(1/δ)

k −k P2 n = 1 L = P1 (−logδ) Preprocessing space O(Ln) ≈ n1+ρ+o(1) Search O(L) ≈ nρ+o(1) 25 / 27 26 / 27

LSH Based on p-Stable Distributions Datar, Immorlica, Indyk and Mirrokni’04

p · r h(p) = b + bc w

r: random vector, every coordinate comes from N(0, 1) b: random shift from [0, 1] w: quantization width

1 ρ(c) < c

27 / 27