SRS: Solving C-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index

Total Page:16

File Type:pdf, Size:1020Kb

SRS: Solving C-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index SRS: Solving c-Approximate Nearest Neighbor Queries in High Dimensional Euclidean Space with a Tiny Index Yifang Suny Wei Wangy Jianbin Qiny Ying Zhangz Xuemin Liny yUniversity of New South Wales, Australia zUniversity of Technology Sydney, Australia fyifangs, weiw, jqin, [email protected] [email protected] ABSTRACT more than c times the distance of the nearest point. Local- Nearest neighbor searches in high-dimensional space have ity sensitive hashing (LSH) [21] is a widely adopted method many important applications in domains such as data min- that answers such a query in sublinear time with constant ing, and multimedia databases. The problem is challenging probability. It also enjoys great success in practice, due to due to the phenomenon called \curse of dimensionality". An its excellent performance and ease of implementation. alternative solution is to consider algorithms that returns a A major limitation of LSH-based methods is that their c-approximate nearest neighbor (c-ANN) with guaranteed indexes are typically very large. The original LSH method probabilities. Locality Sensitive Hashing (LSH) is among requires indexing space superlinear in the number of data the most widely adopted method, and it achieves high ef- points [14], and the index typically includes hundreds or ficiency both in theory and practice. However, it is known thousands of hash tables in practice. Recently, several LSH- to require an extremely high amount of space for indexing, based variants, such as LSB-forest [42] and C2LSH [17], are hence limiting its scalability. proposed to build a smaller sized index size without los- In this paper, we propose several surprisingly simple meth- ing the performance guarantees of the original LSH method. However, their index sizes are still superlinear in the number ods to answer c-ANN queries with theoretical guarantees 1 requiring only a single tiny index. Our methods are highly of points; this prevents their usage for large datasets. flexible and support a variety of functionalities, such as find- In this paper, we propose a surprisingly simple algorithm ing the exact nearest neighbor with any given probability. to solve the c-ANN problem with (1) rigorous theoretical In the experiment, our methods demonstrate superior per- guarantees, (2) requiring an extremely small index, and (3) formance against the state-of-the-art LSH-based methods, delivering superior empirical performance to existing meth- and scale up well to 1 billion high-dimensional points on a ods. Our methods are based on projecting high-dimensional single commodity PC. points into an appropriately chosen low-dimensional space via 2-stable random projections. The key observation is that the inter-point distance in the projected space (called pro- 1. INTRODUCTION jected distance) over that in the original high-dimensional Given a d-dimensional query point q in a Euclidean space, space follows a known distribution, which has a sharp con- a nearest neighbor (NN) query returns the point o in a centration bound. Therefore, given any threshold on the dataset D such that the Euclidean distance between q and projected distance, and for any point o, we can compute o is the minimum. It is a fundamental problem and has exactly the probability that o's projected distance is within wide applications in many domain areas, such as computa- the threshold. This observation gives rise to our basic al- tional geometry, data mining, information retrieval, multi- gorithm, which reduces a d-dimensional c-ANN query to a media databases, and data compression. m-dimensional exact k-NN query equipped with some early- While efficient solutions exist for NN queries in low di- termination test. It can be shown that the resulting algo- mensional space, they are challenging in high-dimensional rithm answers a c-ANN query with at least constant prob- space. This is mainly due to the phenomenon of \curse of ability using γ1 · n I/Os in the worst case, using an in- dimensionality", where indexing-based methods are outper- dex of γ2 · n pages, where γ1 1 and γ2 1 are two small constants. For example, in a typical setting, we have formed by the brute-force linear scan method [45]. 2 The curse of dimensionality can be largely mitigated by γ1 = 0:0083 and γ2 = 0:0059. We also derive several vari- allowing a small amount of errors. This is the c-ANN query, ants of the basic algorithm, which offer various new function- which returns a point o0, such that its distance to q is no alities (as described in Section 5.3.2 and shown in Table 1. Our proposed algorithms can also be extended to support returning the c-approximate k-nearest neighbors (i.e., c-k- ANN queries) with conditional guarantees. This work is licensed under the Creative Commons Attribution- The contributions of the paper are summarized below: NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- mission prior to any use beyond those covered by the license. Contact copy- 1Note that the data size is always O(n · d). right holder by emailing [email protected]. Articles from this volume were 2 invited to present their results at the 41st International Conference on Very As will be pointed out in Section 3, existing methods using Large Data Bases, August 31st - September 4th 2015, Kohala Coast, Hawaii. space linear in the number of data points (n) typically Proceedings of the VLDB Endowment, Vol. 8, No. 1 cannot outperform the linear scan method, whose query Copyright 2014 VLDB Endowment 2150-8097/14/09. complexity is 0:25n in this example. 1 Table 1: Comparison of Algorithms Algorithm Success Index Size Query Cost Constraint on Approx- Comment Probability imation Ratio c 1:5 0:5 LSB-forest [42] 1=2 − 1=e O((dn=B) ) O((dn=B) logB n) c ≥ 4 and must be the Index size superlin- square of an integer ear in both n and d C2LSH [17] 1=2 − 1=e O((n log n)=B) O((n log n)=B) c ≥ 4 and must be the β = O(1) (or O(n=B)) (or O(n=B)) square of an integer (or β = O(1=n)) SRS-12(T; c; pτ ) 1=2 − 1=e Fastest γ2n (worst case), SRS-1(T; c) 1=2 − 1=e γ n (worst case), Better quality 1 γ2 1 c ≥ 1 + 0 γ 1 SRS-12 (T; c ; pτ ) 1=2 − 1=e 1 Tunable quality SRS-2(c; p) p O(n=B) Support finding NN • We proposed a novel solution to answer c-ANN queries for In this paper, we consider a dataset D which contains high-dimensional points. Our methods use a tiny amount n points in a d dimensional Euclidean space <d. We are of space for the index, and answer the query with bounded particularly interested in the high-dimensional case where I/O costs and guaranteed success probability. They also d is a fairly large number (e.g., d ≥ 50). The coordinate have excellent empirical performance and allow the user value of o on the i-th dimension is denoted as o[i]. The to fine-tune the tradeoff between query time and result Euclidean distance between two points, dist(o1; o2), is de- quality. We present the technical details in Sections 4 q d fined as P (o [i] − o [i])2. Throughout the rest of the and 5 and theoretical analysis in Section 6. i=1 1 2 • We obtained several new results with our proposal: (i) Our paper, we are concerned with a point's distance with re- SRS-12 algorithm is the first c-ANN query processing algo- spect to a query point q, so we use the shorthand notation rithm with guarantees, using linear-sized index and work- dist(o) := dist(o; q), and refer to it as distance of point o. ing with any approximation ratio. Previous methods use For simplicity, we assume there is no draw in terms of superlinear index size and/or cannot work with non-integer distances. This makes the following definitions unique and c or c < 2 (See Table 1 for a comparison). (ii) Our SRS-2 deterministic. The results in this paper still hold without algorithm returns exact nearest neighbor with any given such assumption by a straight-forward extension. Given a query point q, the nearest neighbor (NN) of q (de- probability, which is not possible for LSH-based methods ∗ since they require c > 1. (iii) There is no approxima- noted as o ) is the point in D that has the smallest distance. We can generalize the concept to the i-th nearest neighbor tion ratio guarantee for c-k-ANN query results by existing ∗ (denoted as oi ). A k nearest neighbor query, or k-NN query, LSH-based methods for k > 1, while our SRS-12 provides ∗ ∗ ∗ such a guarantee under certain conditions. returns the ordered set of f o1; o2; : : : ; ok g. • We performed an extensive performance evaluation against Given an approximation ratio c > 1, a c-approximate the state-of-the-art LSH-based methods in the external nearest neighbor query, or c-ANN query, returns a point o 2 D, such that dist(o) ≤ c · dist(o∗); such a point o is memory setting. In particular, we used datasets with 8 3 million and 1 billion points, which is much larger than also called a c-ANN point. Similarly, a c-approximate k- nearest neighbor query, or c-k-ANN query, returns k points those previous reported. Experiments on these large datasets ∗ not only demonstrate the need for linear-sized index, but oi 2 D (1 ≤ i ≤ k), such that dist(oi) ≤ c · dist(oi ).
Recommended publications
  • Lecture 04 Linear Structures Sort
    Algorithmics (6EAP) MTAT.03.238 Linear structures, sorting, searching, etc Jaak Vilo 2018 Fall Jaak Vilo 1 Big-Oh notation classes Class Informal Intuition Analogy f(n) ∈ ο ( g(n) ) f is dominated by g Strictly below < f(n) ∈ O( g(n) ) Bounded from above Upper bound ≤ f(n) ∈ Θ( g(n) ) Bounded from “equal to” = above and below f(n) ∈ Ω( g(n) ) Bounded from below Lower bound ≥ f(n) ∈ ω( g(n) ) f dominates g Strictly above > Conclusions • Algorithm complexity deals with the behavior in the long-term – worst case -- typical – average case -- quite hard – best case -- bogus, cheating • In practice, long-term sometimes not necessary – E.g. for sorting 20 elements, you dont need fancy algorithms… Linear, sequential, ordered, list … Memory, disk, tape etc – is an ordered sequentially addressed media. Physical ordered list ~ array • Memory /address/ – Garbage collection • Files (character/byte list/lines in text file,…) • Disk – Disk fragmentation Linear data structures: Arrays • Array • Hashed array tree • Bidirectional map • Heightmap • Bit array • Lookup table • Bit field • Matrix • Bitboard • Parallel array • Bitmap • Sorted array • Circular buffer • Sparse array • Control table • Sparse matrix • Image • Iliffe vector • Dynamic array • Variable-length array • Gap buffer Linear data structures: Lists • Doubly linked list • Array list • Xor linked list • Linked list • Zipper • Self-organizing list • Doubly connected edge • Skip list list • Unrolled linked list • Difference list • VList Lists: Array 0 1 size MAX_SIZE-1 3 6 7 5 2 L = int[MAX_SIZE]
    [Show full text]
  • Mining of Massive Datasets
    Mining of Massive Datasets Anand Rajaraman Kosmix, Inc. Jeffrey D. Ullman Stanford Univ. Copyright c 2010, 2011 Anand Rajaraman and Jeffrey D. Ullman ii Preface This book evolved from material developed over several years by Anand Raja- raman and Jeff Ullman for a one-quarter course at Stanford. The course CS345A, titled “Web Mining,” was designed as an advanced graduate course, although it has become accessible and interesting to advanced undergraduates. What the Book Is About At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort. The principal topics covered are: 1. Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data. 2. Similarity search, including the key techniques of minhashing and locality- sensitive hashing. 3. Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost. 4. The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach. 5. Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements. 6.
    [Show full text]
  • Applied Statistics
    ISSN 1932-6157 (print) ISSN 1941-7330 (online) THE ANNALS of APPLIED STATISTICS AN OFFICIAL JOURNAL OF THE INSTITUTE OF MATHEMATICAL STATISTICS Special section in memory of Stephen E. Fienberg (1942–2016) AOAS Editor-in-Chief 2013–2015 Editorial......................................................................... iii OnStephenE.Fienbergasadiscussantandafriend................DONALD B. RUBIN 683 Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election . ......................XIAO-LI MENG 685 Hypothesis testing for high-dimensional multinomials: A selective review SIVARAMAN BALAKRISHNAN AND LARRY WASSERMAN 727 When should modes of inference disagree? Some simple but challenging examples D. A. S. FRASER,N.REID AND WEI LIN 750 Fingerprintscience.............................................JOSEPH B. KADANE 771 Statistical modeling and analysis of trace element concentrations in forensic glass evidence.................................KAREN D. H. PAN AND KAREN KAFADAR 788 Loglinear model selection and human mobility . ................ADRIAN DOBRA AND REZA MOHAMMADI 815 On the use of bootstrap with variational inference: Theory, interpretation, and a two-sample test example YEN-CHI CHEN,Y.SAMUEL WANG AND ELENA A. EROSHEVA 846 Providing accurate models across private partitioned data: Secure maximum likelihood estimation....................................JOSHUA SNOKE,TIMOTHY R. BRICK, ALEKSANDRA SLAVKOVIC´ AND MICHAEL D. HUNTER 877 Clustering the prevalence of pediatric
    [Show full text]
  • Arxiv:2102.08942V1 [Cs.DB]
    A Survey on Locality Sensitive Hashing Algorithms and their Applications OMID JAFARI, New Mexico State University, USA PREETI MAURYA, New Mexico State University, USA PARTH NAGARKAR, New Mexico State University, USA KHANDKER MUSHFIQUL ISLAM, New Mexico State University, USA CHIDAMBARAM CRUSHEV, New Mexico State University, USA Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many diverse application domains. Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy. In this survey paper, we provide a review of state-of-the-art LSH and Distributed LSH techniques. Most importantly, unlike any other prior survey, we present how Locality Sensitive Hashing is utilized in different application domains. CCS Concepts: • General and reference → Surveys and overviews. Additional Key Words and Phrases: Locality Sensitive Hashing, Approximate Nearest Neighbor Search, High-Dimensional Similarity Search, Indexing 1 INTRODUCTION Finding nearest neighbors in high-dimensional spaces is an important problem in several diverse applications, such as multimedia retrieval, machine learning, biological and geological sciences, etc. For low-dimensions (< 10), popular tree-based index structures, such as KD-tree [12], SR-tree [56], etc. are effective, but for higher number of dimensions, these index structures suffer from the well-known problem, curse of dimensionality (where the performance of these index structures is often out-performed even by linear scans) [21]. Instead of searching for exact results, one solution to address the curse of dimensionality problem is to look for approximate results.
    [Show full text]
  • Cover Trees for Nearest Neighbor
    Cover Trees for Nearest Neighbor Alina Beygelzimer [email protected] IBM Thomas J. Watson Research Center, Hawthorne, NY 10532 Sham Kakade [email protected] TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637 John Langford [email protected] TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637 Abstract The basic nearest neighbor problem is as follows: We present a tree data structure for fast Given a set S of n points in some metric space (X, d), nearest neighbor operations in general n- the problem is to preprocess S so that given a query point metric spaces (where the data set con- point p ∈ X, one can efficiently find a point q ∈ S sists of n points). The data structure re- which minimizes d(p, q). quires O(n) space regardless of the met- ric’s structure yet maintains all performance Context. For general metrics, finding (or even ap- properties of a navigating net [KL04a]. If proximating) the nearest neighbor of a point requires the point set has a bounded expansion con- Ω(n) time. The classical example is a uniform met- stant c, which is a measure of the intrinsic ric where every pair of points is near the same dis- dimensionality (as defined in [KR02]), the tance, so there is no structure to take advantage of. cover tree data structure can be constructed However, the metrics of practical interest typically do in O c6n log n time. Furthermore, nearest have some structure which can be exploited to yield neighbor queries require time only logarith- significant computational speedups.
    [Show full text]
  • Faster Cover Trees
    Faster Cover Trees Mike Izbicki [email protected] Christian Shelton [email protected] University of California Riverside, 900 University Ave, Riverside, CA 92521 Abstract est neighbor of p in X is defined as The cover tree data structure speeds up exact pnn = argmin d(p;q) nearest neighbor queries over arbitrary metric q2X−fpg spaces (Beygelzimer et al., 2006). This paper makes cover trees even faster. In particular, we The naive method for computing pnn involves a linear scan provide of all the data points and takes time q(n), but many data structures have been created to speed up this process. The 1. A simpler definition of the cover tree that kd-tree (Friedman et al., 1977) is probably the most fa- reduces the number of nodes from O(n) to mous. It is simple and effective in practice, but it can only exactly n, be used on Euclidean spaces. We must turn to other data 2. An additional invariant that makes queries structures when given an arbitrary metric space. The sim- faster in practice, plest and oldest of these structures is the ball tree (Omo- 3. Algorithms for constructing and querying hundro, 1989). Although attractive for its simplicity, it pro- the tree in parallel on multiprocessor sys- vides only the trivial runtime guarantee that queries will tems, and take time O(n). Subsequent research focused on provid- 4. A more cache efficient memory layout. ing stronger guarantees, producing more complicated data structures like the metric skip list (Karger & Ruhl, 2002) On standard benchmark datasets, we reduce the and the navigating net (Krauthgamer & Lee, 2004).
    [Show full text]
  • Canopy – Fast Sampling Using Cover Trees
    Canopy { Fast Sampling using Cover Trees Canopy { Fast Sampling using Cover Trees Manzil Zaheer Abstract Background. The amount of unorganized and unlabeled data, specifically images, is growing. To extract knowledge or insights from the data, exploratory data analysis tools such as searching and clustering have become the norm. The ability to search and cluster the data semantically facilitates tasks like user browsing, studying medical images, content recommendation, and com- putational advertisements among others. However, large image collections are plagued by lack of a good representation in a relevant metric space, thus making organization and navigation difficult. Moreover, the existing methods do not scale for basic operations like searching and clustering. Aim. To develop a fast and scalable search and clustering framework to enable discovery and exploration of extremely large scale data, in particular images. Data. To demonstrate the proposed method, we use two datasets of different nature. The first dataset comprises of satellite images of seven urban areas from around the world of size 200GB. The second dataset comprises of 100M images of size 40TB from Flickr, consisting of variety of photographs from day to day life without any labels or sometimes with noisy tags (∼1% of data). Methods. The excellent ability of deep networks in learning general representation is exploited to extract features from the images. An efficient hierarchical space partitioning data structure, called cover trees, is utilized so as to reduce search and reuse computations. Results. Using the visual search-by-example on satellite images runs in real-time on a single machine and the task of clustering into 64k classes converges in a single day using just 16 com- modity machines.
    [Show full text]
  • Cover Trees for Nearest Neighbor
    Cover Trees for Nearest Neighbor Alina Beygelzimer [email protected] IBM Thomas J. Watson Research Center, Hawthorne, NY 10532 Sham Kakade [email protected] TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637 John Langford [email protected] TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637 Abstract The basic nearest neighbor problem is as follows: We present a tree data structure for fast Given a set S of n points in some metric space (X, d), nearest neighbor operations in general n- the problem is to preprocess S so that given a query point p X, one can efficiently find a point q S point metric spaces (where the data set con- ∈ ∈ sists of n points). The data structure re- which minimizes d(p, q). quires O(n) space regardless of the met- ric’s structure yet maintains all performance Context. For general metrics, finding (or even ap- properties of a navigating net [KL04a]. If proximating) the nearest neighbor of a point requires the point set has a bounded expansion con- Ω(n) time. The classical example is a uniform met- stant c, which is a measure of the intrinsic ric where every pair of points is near the same dis- dimensionality (as defined in [KR02]), the tance, so there is no structure to take advantage of. cover tree data structure can be constructed However, the metrics of practical interest typically do in O c6n log n time. Furthermore, nearest have some structure which can be exploited to yield neighbor queries require time only logarith- significant computational speedups.
    [Show full text]
  • SURVEY on VARIOUS-WIDTH CLUSTERING for EFFICIENT K-NEAREST NEIGHBOR SEARCH
    Vol-3 Issue-2 2017 IJARIIE-ISSN(O)-2395-4396 SURVEY ON VARIOUS-WIDTH CLUSTERING FOR EFFICIENT k-NEAREST NEIGHBOR SEARCH Sneha N. Aware1, Prof. Dr. Varsha H. Patil2 1 PG Student, Department of Computer Engineering,Matoshri College of Engineering & Research Centre,Nashik ,Maharashtra, India. 2 Head of Department of Computer Engineering, Matoshri College of Engineering & Research Centre, Nashik, Maharashtra, India. ABSTRACT Over recent decades, database sizes have grown large. Due to the large sizes it create new challenges, because many machine learning algorithms are not able to process such a large volume of information. The k-nearest neighbor (k- NN) is widely used in many machines learning problem. k-NN approach incurs a large computational cost. In this research a k-NN approach based on various-width clustering is presented. The k-NN search technique is based on VWC is used to efficiently find k-NNs for a query object from a given data set and create clusters with various widths. This reduces clustering time in addition to balancing the number of produced clusters and their respective sizes. Keyword: - clustering,k-Nearest Neighbor, various-width clustering, high dimensional, etc. 1. INTRODUCTION:- The k-nearest neighbor approach (k-NN) has been extensively used as a powerful non-parametric technique in many scientific and engineering applications. To find the closest result a k-NN technique is widely used in much application area. This approach incurs a large computational area. General k-NN problem states that, suppose we have n number of objects and a Q query object, then distance from the query object to all training object need to be calculated in order to find k closest objects to Q.
    [Show full text]
  • Similarity Search Using Locality Sensitive Hashing and Bloom Filter
    Similarity Search using Locality Sensitive Hashing and Bloom Filter Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Computer Science and Engineering Submitted By Sachendra Singh Chauhan (Roll No. 801232021) Under the supervision of Dr. Shalini Batra Assistant Professor COMPUTER SCIENCE AND ENGINEERING DEPARTMENT THAPAR UNIVERSITY PATIALA – 147004 June 2014 ACKNOWLEDGEMENT No volume of words is enough to express my gratitude towards my guide, Dr. Shalini Batra, Assistant Professor, Computer Science and Engineering Department, Thapar University, who has been very concerned and has aided for all the material essential for the preparation of this thesis report. She has helped me to explore this vast topic in an organized manner and provided me with all the ideas on how to work towards a research-oriented venture. I am also thankful to Dr. Deepak Garg, Head of Department, CSED and Dr. Ashutosh Mishra, P.G. Coordinator, for the motivation and inspiration that triggered me for the thesis work. I would also like to thank the faculty members who were always there in the need of the hour and provided with all the help and facilities, which I required, for the completion of my thesis. Most importantly, I would like to thank my parents and the Almighty for showing me the right direction out of the blue, to help me stay calm in the oddest of the times and keep moving even at times when there was no hope. Sachendra Singh Chauhan 801232021 ii Abstract Similarity search of text documents can be reduced to Approximate Nearest Neighbor Search by converting text documents into sets by using Shingling.
    [Show full text]
  • Compressed Slides
    compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 7 0 logistics • Problem Set 1 is due tomorrow at 8pm in Gradescope. • No class next Tuesday (it’s a Monday at UMass). • Talk Today: Vatsal Sharan at 4pm in CS 151. Modern Perspectives on Classical Learning Problems: Role of Memory and Data Amplification. 1 summary Last Class: Hashing for Jaccard Similarity • MinHash for estimating the Jaccard similarity. • Locality sensitive hashing (LSH). • Application to fast similarity search. This Class: • Finish up MinHash and LSH. • The Frequent Elements (heavy-hitters) problem. • Misra-Gries summaries. 2 jaccard similarity jA\Bj # shared elements Jaccard Similarity: J(A; B) = jA[Bj = # total elements : Two Common Use Cases: • Near Neighbor Search: Have a database of n sets/bit strings and given a set A, want to find if it has high similarity to anything in the database. Naively Ω(n) time. • All-pairs Similarity Search: Have n different sets/bit strings. Want to find all pairs with high similarity. Naively Ω(n2) time. 3 minhashing MinHash(A) = mina2A h(a) where h : U ! [0; 1] is a random hash. Locality Sensitivity: Pr[MinHash(A) = MinHash(B)] = J(A; B): Represents a set with a single number that captures Jaccard similarity information! Given a collision free hash function g :[0; 1] ! [m], Pr [g(MinHash(A)) = g(MinHash(B))] = J(A; B): What is Pr [g(MinHash(A)) = g(MinHash(B))] if g is not collision free? Will be a bit larger than J(A; B). 4 lsh for similarity search When searching for similar items only search for matches that land in the same hash bucket.
    [Show full text]
  • Lecture Note
    Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar to detect plagiarism, mirror websites, multiple versions of the same article etc. Finding similar items is useful for building recommender systems as well where we want to find users with similar buying patterns. In Netflix two movies can be deemed similar if they are rated highly by the same customers. While, there are many measures of similarity, in this lecture, we will concentrate one such popular measure known as Jaccard Similarity. Definition (Jaccard Similairty). Given two sets S1 and S2, Jaccard similarity of S1 and S2 is defined as |S1∩S2 S1∪S2 Example 1. Let S1 = {1, 2, 3, 4, 7} and S2 = {1, 4, 9, 7, 5} then |S1 ∪ S2| = 7 and |S1 ∩ S2| = 3. 3 Thus the Jaccard similarity of S1 and S2 is 7 . 1.1 Document Similarity To compare two documents to know how similar they are, here is a simple approach: • Compute k shingles for a suitable value of k k shingles are all substrings of length k that appear in the document. For example, if a document is abcdabd and k = 2, then the 2-shingles are ab, bc, cd, da, ab, bd. • Compare the set of k shingled based on Jaccard similarity. One will often map the set of shingled to a set of integers via hashing. What should be the right size of the hash table? Select a hash table size to avoid collision.
    [Show full text]