Set Similarity Search Beyond Minhash∗

Total Page:16

File Type:pdf, Size:1020Kb

Set Similarity Search Beyond Minhash∗ Set Similarity Search Beyond MinHash∗ Tobias Christiani Rasmus Pagh IT University of Copenhagen IT University of Copenhagen Copenhagen, Denmark Copenhagen, Denmark [email protected] [email protected] d ABSTRACT searching a data set P ⊆ f0; 1g for a vector of similarity at least j1 d We consider the problem of approximate set similarity search un- with a query vector q 2 f0; 1g , but allow the search algorithm der Braun-Blanquet similarity B¹x; yº = jx \ yj/max¹jxj; jyjº. The to return a vector of similarity j2 < j1. To simplify the exposition ¹b1;b2º-approximate Braun-Blanquet similarity search problem is we will assume throughout the introduction that all vectors are to preprocess a collection of sets P such that, given a query set q, if t-sparse, i.e., have the same Hamming weight t. there exists x 2 P with B¹q; xº ≥ b1, then we can eciently return Recent theoretical advances in data structures for approximate 0 0 x 2 P with B¹q; x º > b2. near neighbor search in Hamming space [5] make it possible to beat We present a simple data structure that solves this problem with the asymptotic performance of MinHash-based Jaccard similarity 1+ρ Í ρ space usage O¹n logn+ x2P jxjº and query time O¹jqjn lognº search (using the LSH framework of [21]) in cases where the similar- where n = jP j and ρ = log¹1/b1)/log¹1/b2º. Making use of exist- ity threshold j2 is not too small. However, numerical computations ing lower bounds for locality-sensitive hashing by O’Donnell et suggest that MinHash is always better when j2 < 1/45. al. (TOCT 2014) we show that this value of ρ is tight across the In this paper we address the problem: Can similarity search parameter space, i.e., for every choice of constants 0 < b2 < b1 < 1. using MinHash be improved in general? We give an armative In the case where all sets have the same size our solution strictly answer in the case where all sets have the same size t by intro- improves upon the value of ρ that can be obtained through the use of ducing Chosen Path: a simple data-independent search method state-of-the-art data-independent techniques in the Indyk-Motwani that strictly improves MinHash, and is always better than the locality-sensitive hashing framework (STOC 1998) such as Broder’s data-dependent method of [5] when j2 < 1/9. Similar to data- MinHash (CCS 1997) for Jaccard similarity and Andoni et al.’s cross- independent locality-sensitive ltering (LSF) methods [9, 16, 24] polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even our method works by mapping each data (or query) vector to a though our solution is data-independent, for a large part of the set of keys that must be stored (or looked up). The name Chosen parameter space we outperform the currently best data-dependent Path stems from the way the mapping is constructed: As paths in a method by Andoni and Razenshteyn (STOC 2015). layered random graph where the vertices at each layer is identied with the set f1;:::;dg of dimensions, and where a vector x is only allowed to choose paths that stick to non-zero components xi . This 1 INTRODUCTION is illustrated in Figure1. In this paper we consider the approximate set similarity problem or, equivalently, the problem of approximate Hamming near neighbor search in sparse vectors. Data that can be represented as sparse vectors is ubiquitous — a typical example is the representation of text documents as term vectors, where non-zero vector entries cor- x y x' respond to occurrences of words (or shingles). In order to perform identication of near-identical text documents in web-scale collec- tions, Broder et al. [11, 12] designed and implemented MinHash (a.k.a. min-wise hashing), now understood as a locality-sensitive hash function [21]. This allowed approximate answers to similarity arXiv:1612.07710v2 [cs.DS] 18 Apr 2017 queries to be computed much faster than by other methods, and in particular made it possible to cluster the web pages of the AltaVista search engine (for the purpose of eliminating near-duplicate search results). Almost two decades after it was rst described, MinHash remains one of the most widely used locality-sensitive hashing Figure 1: Chosen Path uses a branching process to associate d k methods as witnessed by thousands of citations of [11, 12]. each vector x 2 f0; 1g with a set Mk ¹xº ⊆ f1;:::;dg of paths A similarity measure maps a pair of vectors to a similarity score of lengtk k (in the picture k = 3). The paths associated with in »0; 1¼. It will often be convenient to interpret a vector x 2 f0; 1gd x are limited to indices in the set fi j xi = 1g, represented as the set fi j xi = 1g. With this convention the Jaccard similarity by an ellipsoid at each level in the illustration. In the exam- 0 of two vectors can be expressed as J¹x; yº = jx \ yj/jx [ yj. In ple the set sizes are: jM3¹xºj = 4 and jM3¹yºj = jM3¹x ºj = 3. approximate similarity search we are interested the problem of Parameters are chosen such that a query y that is similar to x 2 P is likely to have a common path in x \ y (shown as a ∗The research leading to these results has received funding from the European Research bold line), whereas it shares few paths in expectation with Council under the European Union’s 7th Framework Programme (FP7/2007-2013) / 0 ERC grant agreement no. 614331. vectors such as x that are not similar. 1 1.1 Related Work does not necessarily imply that the LSH is optimal for solving the High-dimensional approximate similarity search methods can be approximate search problem for the measure S. The problem of characterized in terms of their ρ-value which is the exponent for nding tight upper and lower bounds on the ρ-value that can be which queries can be answered in time O¹dnρ º, where n is the size obtained through the LSH framework for data-independent ¹s1;s2º- of the set P and d denotes the dimensionality of the space. Here we approximate similarity search across the entire parameter space focus on the “balanced” case where we aim for space O¹n1+ρ + dnº, ¹s1;s2º remains open for two of the most common measures of set but note that there now exist techniques for obtaining other trade- similarity: Jaccard similarity J¹x; yº = jx \ yj/jx [ yj and cosine p os between query time and space overhead [4, 16]. similarity C¹x; yº = jx \ yj/ jxjjyj. Locality-Sensitive Hashing Methods. We begin by describ- A random function from the MinHash family Hminhash hashes ing results for Hamming space, which is a special case of simi- a set x ⊂ f1;:::;dg to the rst element of x in a random per- larity search on the unit sphere (many of the results cited apply mutation of the set f1;:::;dg. For h ∼ Hminhash we have that to the more general case). In Hamming space the focus has tradi- Pr»h¹xº = h¹yº¼ = J¹x; yº, yielding an LSH solution to the approxi- tionally been on the ρ-value that can be obtained for solutions to mate Jaccard similarity search problem. For cosine similarity the the ¹r;crº-approximate near neighbor problem: Preprocess a set of SimHash family Hsimhash, introduced by Charikar [13], works by d points P ⊆ f0; 1gd such that, given a query point q, if there exists sampling a random hyperplane in R that passes through the origin 0 0 x 2 P with kx − qk1 ≤ r, then return x 2 P with kx − qk1 < cr. In and hashing x according to what side of the hyperplane it lies on. For the literature this problem is often presented as the c-approximate h ∼ Hsimhash we have that Pr»h¹xº = h¹yº¼ = 1 − arccos¹C¹x; y))/π, near neighbor problem where bounds for the ρ-value are stated in which can be used to derive a solution for cosine similarity, al- terms of c and, in the case of upper bounds, hold for every choice though not the clean solution that we could have hoped for in the of r, while lower bounds may only hold for specic choices of r. style of MinHash for Jaccard similarity. There exists a number of O’Donnell et al. [30] have shown that the value ρ = 1/c for dierent data-independent LSH approaches [2, 3, 34] that improve c-approximate near neighbor search in Hamming space, obtained upon the ρ-value of SimHash. Perhaps surprisingly, it turns out that in the seminal work of Indyk and Motwani [23], is the best possible these approaches yield lower ρ-values for the ¹j1; j2º-approximate in terms of c for schemes based on Locality-Sensitive Hashing Jaccard similarity search problem compared to MinHash for cer- (LSH). However, the lower bound only applies when the distances tain combinations of ¹j1; j2º. Unfortunately, while asymptotically of interest, r and cr, are relatively small compared to d, and better superior these techniques suer from a non-trivial on¹1º-term in upper bounds are known for large distances. Notably, other LSH the exponent that only decreases very slowly with n. In compari- schemes for angular distance on the unit sphere such as cross- son, both MinHash and SimHash are simple to describe and have polytope LSH [2] give lower ρ-values for large distances.
Recommended publications
  • Challenges in Web Search Engines
    Challenges in Web Search Engines Monika R. Henzinger Rajeev Motwani* Craig Silverstein Google Inc. Department of Computer Science Google Inc. 2400 Bayshore Parkway Stanford University 2400 Bayshore Parkway Mountain View, CA 94043 Stanford, CA 94305 Mountain View, CA 94043 [email protected] [email protected] [email protected] Abstract or a combination thereof. There are web ranking optimiza• tion services which, for a fee, claim to place a given web site This article presents a high-level discussion of highly on a given search engine. some problems that are unique to web search en• Unfortunately, spamming has become so prevalent that ev• gines. The goal is to raise awareness and stimulate ery commercial search engine has had to take measures to research in these areas. identify and remove spam. Without such measures, the qual• ity of the rankings suffers severely. Traditional research in information retrieval has not had to 1 Introduction deal with this problem of "malicious" content in the corpora. Quite certainly, this problem is not present in the benchmark Web search engines are faced with a number of difficult prob• document collections used by researchers in the past; indeed, lems in maintaining or enhancing the quality of their perfor• those collections consist exclusively of high-quality content mance. These problems are either unique to this domain, or such as newspaper or scientific articles. Similarly, the spam novel variants of problems that have been studied in the liter• problem is not present in the context of intranets, the web that ature. Our goal in writing this article is to raise awareness of exists within a corporation.
    [Show full text]
  • Mining of Massive Datasets
    Mining of Massive Datasets Anand Rajaraman Kosmix, Inc. Jeffrey D. Ullman Stanford Univ. Copyright c 2010, 2011 Anand Rajaraman and Jeffrey D. Ullman ii Preface This book evolved from material developed over several years by Anand Raja- raman and Jeff Ullman for a one-quarter course at Stanford. The course CS345A, titled “Web Mining,” was designed as an advanced graduate course, although it has become accessible and interesting to advanced undergraduates. What the Book Is About At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort. The principal topics covered are: 1. Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data. 2. Similarity search, including the key techniques of minhashing and locality- sensitive hashing. 3. Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost. 4. The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach. 5. Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements. 6.
    [Show full text]
  • Applied Statistics
    ISSN 1932-6157 (print) ISSN 1941-7330 (online) THE ANNALS of APPLIED STATISTICS AN OFFICIAL JOURNAL OF THE INSTITUTE OF MATHEMATICAL STATISTICS Special section in memory of Stephen E. Fienberg (1942–2016) AOAS Editor-in-Chief 2013–2015 Editorial......................................................................... iii OnStephenE.Fienbergasadiscussantandafriend................DONALD B. RUBIN 683 Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election . ......................XIAO-LI MENG 685 Hypothesis testing for high-dimensional multinomials: A selective review SIVARAMAN BALAKRISHNAN AND LARRY WASSERMAN 727 When should modes of inference disagree? Some simple but challenging examples D. A. S. FRASER,N.REID AND WEI LIN 750 Fingerprintscience.............................................JOSEPH B. KADANE 771 Statistical modeling and analysis of trace element concentrations in forensic glass evidence.................................KAREN D. H. PAN AND KAREN KAFADAR 788 Loglinear model selection and human mobility . ................ADRIAN DOBRA AND REZA MOHAMMADI 815 On the use of bootstrap with variational inference: Theory, interpretation, and a two-sample test example YEN-CHI CHEN,Y.SAMUEL WANG AND ELENA A. EROSHEVA 846 Providing accurate models across private partitioned data: Secure maximum likelihood estimation....................................JOSHUA SNOKE,TIMOTHY R. BRICK, ALEKSANDRA SLAVKOVIC´ AND MICHAEL D. HUNTER 877 Clustering the prevalence of pediatric
    [Show full text]
  • Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model
    Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model Grigori Sidorov1, Alexander Gelbukh1, Helena Gomez-Adorno1, and David Pinto2 1 Centro de Investigacion en Computacion, Instituto Politecnico Nacional, Mexico D.F., Mexico 2 Facultad de Ciencias de la Computacion, Benemerita Universidad Autonoma de Puebla, Puebla, Mexico {sidorov,gelbukh}@cic.ipn.mx, [email protected], [email protected] Abstract. We show how to consider similarity between 1 Introduction features for calculation of similarity of objects in the Vec­ tor Space Model (VSM) for machine learning algorithms Computation of similarity of specific objects is a basic and other classes of methods that involve similarity be­ task of many methods applied in various problems in tween objects. Unlike LSA, we assume that similarity natural language processing and many other fields. In between features is known (say, from a synonym dictio­ natural language processing, text similarity plays crucial nary) and does not need to be learned from the data. role in many tasks from plagiarism detection [18] and We call the proposed similarity measure soft similarity. question answering [3] to sentiment analysis [14-16]. Similarity between features is common, for example, in The most common manner to represent objects is natural language processing: words, n-grams, or syn­ the Vector Space Model (VSM) [17]. In this model, the tactic n-grams can be somewhat different (which makes objects are represented as vectors of values of features. them different features) but still have much in common: The features characterize each object and have numeric for example, words “play” and “game” are different but values.
    [Show full text]
  • Arxiv:2102.08942V1 [Cs.DB]
    A Survey on Locality Sensitive Hashing Algorithms and their Applications OMID JAFARI, New Mexico State University, USA PREETI MAURYA, New Mexico State University, USA PARTH NAGARKAR, New Mexico State University, USA KHANDKER MUSHFIQUL ISLAM, New Mexico State University, USA CHIDAMBARAM CRUSHEV, New Mexico State University, USA Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many diverse application domains. Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy. In this survey paper, we provide a review of state-of-the-art LSH and Distributed LSH techniques. Most importantly, unlike any other prior survey, we present how Locality Sensitive Hashing is utilized in different application domains. CCS Concepts: • General and reference → Surveys and overviews. Additional Key Words and Phrases: Locality Sensitive Hashing, Approximate Nearest Neighbor Search, High-Dimensional Similarity Search, Indexing 1 INTRODUCTION Finding nearest neighbors in high-dimensional spaces is an important problem in several diverse applications, such as multimedia retrieval, machine learning, biological and geological sciences, etc. For low-dimensions (< 10), popular tree-based index structures, such as KD-tree [12], SR-tree [56], etc. are effective, but for higher number of dimensions, these index structures suffer from the well-known problem, curse of dimensionality (where the performance of these index structures is often out-performed even by linear scans) [21]. Instead of searching for exact results, one solution to address the curse of dimensionality problem is to look for approximate results.
    [Show full text]
  • Stanford University's Economic Impact Via Innovation and Entrepreneurship
    Full text available at: http://dx.doi.org/10.1561/0300000074 Impact: Stanford University’s Economic Impact via Innovation and Entrepreneurship Charles E. Eesley Associate Professor in Management Science & Engineering W.M. Keck Foundation Faculty Scholar, School of Engineering Stanford University William F. Miller† Herbert Hoover Professor of Public and Private Management Emeritus Professor of Computer Science Emeritus and former Provost Stanford University and Faculty Co-director, SPRIE Boston — Delft Full text available at: http://dx.doi.org/10.1561/0300000074 Contents 1 Executive Summary2 1.1 Regional and Local Impact................. 3 1.2 Stanford’s Approach ..................... 4 1.3 Nonprofits and Social Innovation .............. 8 1.4 Alumni Founders and Leaders ................ 9 2 Creating an Entrepreneurial Ecosystem 12 2.1 History of Stanford and Silicon Valley ........... 12 3 Analyzing Stanford’s Entrepreneurial Footprint 17 3.1 Case Study: Google Inc., the Global Reach of One Stanford Startup ............................ 20 3.2 The Types of Companies Stanford Graduates Create .... 22 3.3 The BASES Study ...................... 30 4 Funding Startup Businesses 33 4.1 Study of Investors ...................... 38 4.2 Alumni Initiatives: Stanford Angels & Entrepreneurs Alumni Group ............................ 44 4.3 Case Example: Clint Korver ................. 44 5 How Stanford’s Academic Experience Creates Entrepreneurs 46 Full text available at: http://dx.doi.org/10.1561/0300000074 6 Changing Patterns in Entrepreneurial Career Paths 52 7 Social Innovation, Non-Profits, and Social Entrepreneurs 57 7.1 Case Example: Eric Krock .................. 58 7.2 Stanford Centers and Programs for Social Entrepreneurs . 59 7.3 Case Example: Miriam Rivera ................ 61 7.4 Creating Non-Profit Organizations ............. 63 8 The Lean Startup 68 9 How Stanford Supports Entrepreneurship—Programs, Cen- ters, Projects 77 9.1 Stanford Technology Ventures Program .........
    [Show full text]
  • SIGMOD Flyer
    DATES: Research paper SIGMOD 2006 abstracts Nov. 15, 2005 Research papers, 25th ACM SIGMOD International Conference on demonstrations, Management of Data industrial talks, tutorials, panels Nov. 29, 2005 June 26- June 29, 2006 Author Notification Feb. 24, 2006 Chicago, USA The annual ACM SIGMOD conference is a leading international forum for database researchers, developers, and users to explore cutting-edge ideas and results, and to exchange techniques, tools, and experiences. We invite the submission of original research contributions as well as proposals for demonstrations, tutorials, industrial presentations, and panels. We encourage submissions relating to all aspects of data management defined broadly and particularly ORGANIZERS: encourage work that represent deep technical insights or present new abstractions and novel approaches to problems of significance. We especially welcome submissions that help identify and solve data management systems issues by General Chair leveraging knowledge of applications and related areas, such as information retrieval and search, operating systems & Clement Yu, U. of Illinois storage technologies, and web services. Areas of interest include but are not limited to: at Chicago • Benchmarking and performance evaluation Vice Gen. Chair • Data cleaning and integration Peter Scheuermann, Northwestern Univ. • Database monitoring and tuning PC Chair • Data privacy and security Surajit Chaudhuri, • Data warehousing and decision-support systems Microsoft Research • Embedded, sensor, mobile databases and applications Demo. Chair Anastassia Ailamaki, CMU • Managing uncertain and imprecise information Industrial PC Chair • Peer-to-peer data management Alon Halevy, U. of • Personalized information systems Washington, Seattle • Query processing and optimization Panels Chair Christian S. Jensen, • Replication, caching, and publish-subscribe systems Aalborg University • Text search and database querying Tutorials Chair • Semi-structured data David DeWitt, U.
    [Show full text]
  • The Best Nurturers in Computer Science Research
    The Best Nurturers in Computer Science Research Bharath Kumar M. Y. N. Srikant IISc-CSA-TR-2004-10 http://archive.csa.iisc.ernet.in/TR/2004/10/ Computer Science and Automation Indian Institute of Science, India October 2004 The Best Nurturers in Computer Science Research Bharath Kumar M.∗ Y. N. Srikant† Abstract The paper presents a heuristic for mining nurturers in temporally organized collaboration networks: people who facilitate the growth and success of the young ones. Specifically, this heuristic is applied to the computer science bibliographic data to find the best nurturers in computer science research. The measure of success is parameterized, and the paper demonstrates experiments and results with publication count and citations as success metrics. Rather than just the nurturer’s success, the heuristic captures the influence he has had in the indepen- dent success of the relatively young in the network. These results can hence be a useful resource to graduate students and post-doctoral can- didates. The heuristic is extended to accurately yield ranked nurturers inside a particular time period. Interestingly, there is a recognizable deviation between the rankings of the most successful researchers and the best nurturers, which although is obvious from a social perspective has not been statistically demonstrated. Keywords: Social Network Analysis, Bibliometrics, Temporal Data Mining. 1 Introduction Consider a student Arjun, who has finished his under-graduate degree in Computer Science, and is seeking a PhD degree followed by a successful career in Computer Science research. How does he choose his research advisor? He has the following options with him: 1. Look up the rankings of various universities [1], and apply to any “rea- sonably good” professor in any of the top universities.
    [Show full text]
  • Evaluating Vector-Space Models of Word Representation, Or, the Unreasonable Effectiveness of Counting Words Near Other Words
    Evaluating Vector-Space Models of Word Representation, or, The Unreasonable Effectiveness of Counting Words Near Other Words Aida Nematzadeh, Stephan C. Meylan, and Thomas L. Griffiths University of California, Berkeley fnematzadeh, smeylan, tom griffi[email protected] Abstract angle between word vectors (e.g., Mikolov et al., 2013b; Pen- nington et al., 2014). Vector-space models of semantics represent words as continuously-valued vectors and measure similarity based on In this paper, we examine whether these constraints im- the distance or angle between those vectors. Such representa- ply that Word2Vec and GloVe representations suffer from the tions have become increasingly popular due to the recent de- same difficulty as previous vector-space models in capturing velopment of methods that allow them to be efficiently esti- mated from very large amounts of data. However, the idea human similarity judgments. To this end, we evaluate these of relating similarity to distance in a spatial representation representations on a set of tasks adopted from Griffiths et al. has been criticized by cognitive scientists, as human similar- (2007) in which the authors showed that the representations ity judgments have many properties that are inconsistent with the geometric constraints that a distance metric must obey. We learned by another well-known vector-space model, Latent show that two popular vector-space models, Word2Vec and Semantic Analysis (Landauer and Dumais, 1997), were in- GloVe, are unable to capture certain critical aspects of human consistent with patterns of semantic similarity demonstrated word association data as a consequence of these constraints. However, a probabilistic topic model estimated from a rela- in human word association data.
    [Show full text]
  • Practical Considerations in Benchmarking Digital Testing Systems
    Foreword The papers in these proceedings were presented at the 39th Annual Symposium on Foundations of Computer Science (FOCS ‘98), sponsored by the IEEE Computer Society Technical Committee on Mathematical Foundations of Computing. The conference was held in Palo Alto, California, November g-11,1998. The program committee consisted of Miklos Ajtai (IBM Almaden), Mihir Bellare (UC San Diego), Allan Borodin (Toronto), Edith Cohen (AT&T Labs), Sally Goldman (Washington), David Karger (MIT), Jon Kleinberg (Cornell), Rajeev Motwani (chair, Stanford)), Seffi Naor (Technion), Christos Papadimitriou (Berkeley), Toni Pitassi (Arizona), Dan Spielman (MIT), Eli Upfal (Brown), Emo Welzl (ETH Zurich), David Williamson (IBM TJ Watson), and Frances Yao (Xerox PARC). The program committee met on June 26-28, 1998, and selected 76 papers from the 204 detailed abstracts submitted. The submissions were not refereed, and many of them represent reports of continuing research. It is expected that most of these papers will appear in a more complete and polished form in scientific journals in the future. The program committee selected two papers to jointly receive the Machtey Award for the best student-authored paper. These two papers were: “A Factor 2 Approximation Algorithm for the Generalized Steiner Network Problem” by Kamal Jain, and “The Shortest Vector in a Lattice is Hard to Approximate to within Some Constant” by Daniele Micciancio. There were many excellent candidates for this award, each one deserving. At this conference we organized, for the first time, a set of three tutorials: “Geometric Computation and the Art of Sampling” by Jiri Matousek; “Theoretical Issues in Probabilistic Artificial Intelligence” by Michael Kearns; and, “Information Retrieval on the Web” by Andrei Broder and Monika Henzinger.
    [Show full text]
  • Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing
    Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing Tim vor der Bruck¨ Marc Pouly School of Information Technology School of Information Technology Lucerne University of Lucerne University of Applied Sciences and Arts Applied Sciences and Arts Switzerland Switzerland [email protected] [email protected] Abstract determined beforehand on a very large cor- pus typically using either the skip gram or The prevalent way to estimate the similarity of two documents based on word embeddings the continuous bag of words variant of the is to apply the cosine similarity measure to Word2Vec model (Mikolov et al., 2013). the two centroids obtained from the embed- 2. The centroid over all word embeddings be- ding vectors associated with the words in each longing to the same document is calculated document. Motivated by an industrial appli- to obtain its vector representation. cation from the domain of youth marketing, If vector representations of the two documents to where this approach produced only mediocre compare were successfully established, a similar- results, we propose an alternative way of com- ity estimate can be obtained by applying the cosine bining the word vectors using matrix norms. The evaluation shows superior results for most measure to the two vectors. of the investigated matrix norms in compari- Let x1; : : : ; xm and y1; : : : ; yn be the word vec- son to both the classical cosine measure and tors of two documents. The cosine similarity value several other document similarity estimates. between the two document centroids C1 und C2 is given by: 1 Introduction Estimating semantic document similarity is of ut- m n 1 X 1 X most importance in a lot of different areas, like cos( ( x ; y )) \ m i n i plagiarism detection, information retrieval, or text i=1 i=1 (1) Pm Pn summarization.
    [Show full text]
  • The Pagerank Citation Ranking: Bringing Order to the Web
    The PageRank Citation Ranking: Bringing Order to the Web Marlon Dias [email protected] Information Retrieval DCC/UFMG - 2017 Introduction Paper: The PageRank Citation Ranking: Bringing Order to the Web, 1999 Authors: Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Page and Brin were MS students at Stanford They founded Google in September, 98. Most of this presentation is based on the original paper (link) The Initiative's focus is to dramatically “ advance the means to collect, store, and organize information in digital forms, and make it available for searching, retrieval, and processing via communication networks -- all in user-friendly ways. Stanford WebBase project Pagerank Motivation ▪ Web is vastly large and heterogeneous ▫ Original paper's estimation were over 150 M pages and 1.7 billion of links ▪ Pages are extremely diverse ▫ Ranging from "What does the fox say? " to journals about IR ▪ Web Page present some "structure" ▫ Pagerank takes advantage of links structure Pagerank Motivation ▪ Inspiration: Academic citation ▪ Papers ▫ are well defined units of work ▫ are roughly similar in quality ▫ are used to extend the body of knowledge ▫ can have their "quality" measured in number of citations Pagerank Motivation ▪ Web pages, on the other hand ▫ proliferate free of quality control or publishing costs ▫ huge numbers of pages can be created easily ▸ artificially inflating citation counts ▫ They vary on much wider scale than academic papers in quality, usage, citations and length Pagerank Motivation A research article A random archived about the effects of message posting cellphone use on asking an obscure driver attention is very question about an IBM different from an computer is very advertisement for a different from the IBM particular cellular home page provider The average web page quality experienced “ by a user is higher than the quality of the average web page.
    [Show full text]