A Probabilistic Molecular Fingerprint for Big Data Settings

Total Page:16

File Type:pdf, Size:1020Kb

A Probabilistic Molecular Fingerprint for Big Data Settings Probst and Reymond J Cheminform (2018) 10:66 https://doi.org/10.1186/s13321-018-0321-8 Journal of Cheminformatics RESEARCH ARTICLE Open Access A probabilistic molecular fngerprint for big data settings Daniel Probst* and Jean‑Louis Reymond Abstract Background: Among the various molecular fngerprints available to describe small organic molecules, extended connectivity fngerprint, up to four bonds (ECFP4) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representa‑ tions ( 1024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem≥ or ZINC to perform very slowly due to the curse of dimensionality. Results: Herein we report a new fngerprint, called MinHash fngerprint, up to six bonds (MHFP6), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally diferent manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the applica‑ tion of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. By leveraging locality sensitive hashing, LSH approximate nearest neighbor search methods perform as well on unfolded MHFP6 as comparable methods do on folded ECFP4 fngerprints in terms of speed and relative recovery rate, while operating in very sparse and high-dimensional binary chemical space. Conclusion: MHFP6 is a new molecular fngerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (https​://githu​b.com/reymo​nd- group​/mhfp). Keywords: Virtual screening, Similarity search, Fingerprints, Locality sensitive hashing, Approximate k-nearest neighbor search Introduction Among the assortment of fngerprints for the com- Many uses of cheminformatics require the quantifcation parison of molecules in use today, extended connectiv- of the similarity between molecules. As the underlying ity fngerprint (ECFP) is the most prominent due to its data structure used to represent molecules is a graph, this outstanding performance in molecular structure com- problem is equivalent to a subgraph isomerism problem, parisons requiring the identifcation of compounds with which is at least NP-complete [1]. Molecular fngerprints similar bioactivity, as assessed in benchmarking stud- reduce this problem to the comparison of vectors, ena- ies [6, 7]. However, the performance of ECFP results bling further application of approximation methods and from a precise encoding of molecular structure, which heuristics, thus speeding up the computation [2–5]. is achieved by using high-dimensional vectors, typically d ≥ 1024 , with the consequence that linear searching becomes slow when applied to very large databases such as GDB, PubChem or ZINC [8–10]. For more complex *Correspondence: [email protected] k Department of Chemistry and Biochemistry, National Center tasks such as constructing -nearest neighbor graphs, 2 for Competence in Research NCCR TransCure, University of Berne, linear search takes O(dn ) time, becoming prohibitively Freiestrasse 3, 3012 Bern, Switzerland © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat​iveco​mmons​.org/licen​ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creat​iveco​mmons​.org/ publi​cdoma​in/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Probst and Reymond J Cheminform (2018) 10:66 Page 2 of 12 slow. Tis problem occurs even when applying com- circular nature of ECFP with w-shingling and MinHash, monly used optimized search algorithms such as k–d or which are encoding and comparison methods used in ball trees, as well as algorithms from the R- and B-tree natural language processing and text mining [19–21] families, because their performance degrades to linear (Fig. 1). Tese methods are commonly used in appli- time due to the curse of dimensionality [11–13]. In addi- cations such as discarding already indexed web pages tion, given the often binary, relatively sparse, and high during web-crawling, signal processing or plagiarism p dimensional nature of ECFP, L metrics generally per- detection [22, 23]. We obtain our MHFP by frst writing form badly, further limiting the number of available opti- out circular substructures around each atom as SMILES, mization techniques. In the past, several approaches to a process which we call molecular shingling in analogy to remove the curse of dimensionality’s impact on nearest the w-shingling scheme used for the above-mentioned neighbor searching have been presented by the chem- text mining applications. We then apply the MinHash informatics community. Most notably the BitBound hashsing scheme to assign these SMILES to bit values in method, which exploits simple bounds on similarity our MHFP. measures and indexing to achieve sub-linear speed on MinHash is a locality sensitive hashing (LSH) scheme exact nearest neighbor searches with a time complexity which applies a family of hashing functions to the sub- 0.6 of O(n ) for many metrics, including Jaccard similarity strings in a molecular shingling and stores the mini- [14, 15]. In our efort to facilitate the exploration of very mum hash generated from each hashing function in a large databases such as GDB, we previously used lower set. Tese sets, containing the minimum hash values, dimensionality fngerprints such as MQN (Molecular have the interesting property that they can be indexed Quantum Number, 42D) or SMIfp (SMILES fngerprint, by an LSH algorithm for approximate nearest neighbor 34D) for similarity searches, however, such fngerprints search (ANN), removing the curse of dimensionality only encode molecular composition and do not allow [24]. While a previously reported LSH implementation precise structural similarity calculation [16–18]. for chemical structure indexing and searching was based Herein we report a new family of fngerprints termed on embeddings in Euclidean space, MinHash allows for MHFP (MinHash fngerprint) which combine the the indexing of chemical structures in extremely sparse Fig. 1 MHFP, ECFP workfow comparison. a Comparison of hashing and approximate nearest neighbor search indexing of ECFP with Annoy (gray) and MHFP via molecular shingling and MinHash with LSH Forest (orange). In addition, MinHash is applied to unfolded ECFP hashes and indexed using LSH Forest as well (green), resulting in the hybrid fngerprint MHECFP. The latter was used as a control to separate the infuences of molecular shingling and applying MinHash on the measured performance. b Circular substructure SMILES of an input molecule are computed with each heavy atom as the center (examples for MHFP4 shown in red and blue). In addition, SMILES for each ring are extracted (examples shown in black). Circular substructure SMILES are rooted at the central atom. All substructure SMILES are canonicalized and kekulized Probst and Reymond J Cheminform (2018) 10:66 Page 3 of 12 Jaccard (Tanimoto) space, a metric more appropriate for F = {S1, ...Sn} over H where each set represents a mol- fngerprint-based similarity calculations [25, 26]. Note ecule, the MinHash function hmin(Si, a, b) is applied to that LSH search algorithms cannot be directly applied each set Si in F . Let s be the vector form of a set S from F 61 to ECFP hashes due to the nature of the primary hashing and p be the Mersenne prime 2 − 1 . Te MinHash of a scheme used to assign circular substructures to bit val- molecular graph is then calculated as: ues. Furthermore, ECFP encodes circular substructures T 32 by iteratively hashing atomic invariants. Common imple- hmin(si, a, b) = min a · si + b modp mod 2 − 1 mentations of ECFP, as found in RDKit or Open Babel, (2) contain a default or hardcoded selection of atomic invari- Te set form Smin of smin can then be used to estimate ants to be hashed that is targeted towards applications the Jaccard similarity coefcient of two sets Si , Sj using in medicinal chemistry, thereby making assumptions Eq. 1 [31]. regarding the importance of atomic features such as acid- Te expected error of estimating the Jaccard similarity ity or charge, thereby introducing a potential bias which 1 coefcient between two sets using MinHash is O log(n) , is entirely avoided in MHFP, as it takes all information n encoded in the SMILES into account [6, 27–29]. where is the number of hash functions used [32]. To assess the performance of MHFP we compare it to variants of ECFP as well as to a hybrid fngerprint LSH forest MHECFP which applies MinHash to unfolded ECFP Te local sensitivity hashing (LSH) forest algorithm is hashes. We fnd that the performance of MHFP surpasses an extension to LSH similarity indexing [33, 34]. Intro-
Recommended publications
  • Mining of Massive Datasets
    Mining of Massive Datasets Anand Rajaraman Kosmix, Inc. Jeffrey D. Ullman Stanford Univ. Copyright c 2010, 2011 Anand Rajaraman and Jeffrey D. Ullman ii Preface This book evolved from material developed over several years by Anand Raja- raman and Jeff Ullman for a one-quarter course at Stanford. The course CS345A, titled “Web Mining,” was designed as an advanced graduate course, although it has become accessible and interesting to advanced undergraduates. What the Book Is About At the highest level of description, this book is about data mining. However, it focuses on data mining of very large amounts of data, that is, data so large it does not fit in main memory. Because of the emphasis on size, many of our examples are about the Web or data derived from the Web. Further, the book takes an algorithmic point of view: data mining is about applying algorithms to data, rather than using data to “train” a machine-learning engine of some sort. The principal topics covered are: 1. Distributed file systems and map-reduce as a tool for creating parallel algorithms that succeed on very large amounts of data. 2. Similarity search, including the key techniques of minhashing and locality- sensitive hashing. 3. Data-stream processing and specialized algorithms for dealing with data that arrives so fast it must be processed immediately or lost. 4. The technology of search engines, including Google’s PageRank, link-spam detection, and the hubs-and-authorities approach. 5. Frequent-itemset mining, including association rules, market-baskets, the A-Priori Algorithm and its improvements. 6.
    [Show full text]
  • Applied Statistics
    ISSN 1932-6157 (print) ISSN 1941-7330 (online) THE ANNALS of APPLIED STATISTICS AN OFFICIAL JOURNAL OF THE INSTITUTE OF MATHEMATICAL STATISTICS Special section in memory of Stephen E. Fienberg (1942–2016) AOAS Editor-in-Chief 2013–2015 Editorial......................................................................... iii OnStephenE.Fienbergasadiscussantandafriend................DONALD B. RUBIN 683 Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election . ......................XIAO-LI MENG 685 Hypothesis testing for high-dimensional multinomials: A selective review SIVARAMAN BALAKRISHNAN AND LARRY WASSERMAN 727 When should modes of inference disagree? Some simple but challenging examples D. A. S. FRASER,N.REID AND WEI LIN 750 Fingerprintscience.............................................JOSEPH B. KADANE 771 Statistical modeling and analysis of trace element concentrations in forensic glass evidence.................................KAREN D. H. PAN AND KAREN KAFADAR 788 Loglinear model selection and human mobility . ................ADRIAN DOBRA AND REZA MOHAMMADI 815 On the use of bootstrap with variational inference: Theory, interpretation, and a two-sample test example YEN-CHI CHEN,Y.SAMUEL WANG AND ELENA A. EROSHEVA 846 Providing accurate models across private partitioned data: Secure maximum likelihood estimation....................................JOSHUA SNOKE,TIMOTHY R. BRICK, ALEKSANDRA SLAVKOVIC´ AND MICHAEL D. HUNTER 877 Clustering the prevalence of pediatric
    [Show full text]
  • Arxiv:2102.08942V1 [Cs.DB]
    A Survey on Locality Sensitive Hashing Algorithms and their Applications OMID JAFARI, New Mexico State University, USA PREETI MAURYA, New Mexico State University, USA PARTH NAGARKAR, New Mexico State University, USA KHANDKER MUSHFIQUL ISLAM, New Mexico State University, USA CHIDAMBARAM CRUSHEV, New Mexico State University, USA Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many diverse application domains. Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy. In this survey paper, we provide a review of state-of-the-art LSH and Distributed LSH techniques. Most importantly, unlike any other prior survey, we present how Locality Sensitive Hashing is utilized in different application domains. CCS Concepts: • General and reference → Surveys and overviews. Additional Key Words and Phrases: Locality Sensitive Hashing, Approximate Nearest Neighbor Search, High-Dimensional Similarity Search, Indexing 1 INTRODUCTION Finding nearest neighbors in high-dimensional spaces is an important problem in several diverse applications, such as multimedia retrieval, machine learning, biological and geological sciences, etc. For low-dimensions (< 10), popular tree-based index structures, such as KD-tree [12], SR-tree [56], etc. are effective, but for higher number of dimensions, these index structures suffer from the well-known problem, curse of dimensionality (where the performance of these index structures is often out-performed even by linear scans) [21]. Instead of searching for exact results, one solution to address the curse of dimensionality problem is to look for approximate results.
    [Show full text]
  • A Generalized Framework for Analyzing Taxonomic, Phylogenetic, and Functional Community Structure Based on Presence–Absence Data
    Article A Generalized Framework for Analyzing Taxonomic, Phylogenetic, and Functional Community Structure Based on Presence–Absence Data János Podani 1,2,*, Sandrine Pavoine 3 and Carlo Ricotta 4 1 Department of Plant Systematics, Ecology and Theoretical Biology, Institute of Biology, Eötvös University, H-1117 Budapest, Hungary 2 MTA-ELTE-MTM Ecology Research Group, Eötvös University, H-1117 Budapest, Hungary 3 Centre d'Ecologie et des Sciences de la Conservation (CESCO), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, 75005 Paris, France ; [email protected] 4 Department of Environmental Biology, University of Rome ‘La Sapienza’, 00185 Rome, Italy; [email protected] * Correspondence: [email protected] Received: 17 October 2018; Accepted: 6 November 2018; Published: 12 November 2018 Abstract: Community structure as summarized by presence–absence data is often evaluated via diversity measures by incorporating taxonomic, phylogenetic and functional information on the constituting species. Most commonly, various dissimilarity coefficients are used to express these aspects simultaneously such that the results are not comparable due to the lack of common conceptual basis behind index definitions. A new framework is needed which allows such comparisons, thus facilitating evaluation of the importance of the three sources of extra information in relation to conventional species-based representations. We define taxonomic, phylogenetic and functional beta diversity of species assemblages based on the generalized Jaccard dissimilarity index. This coefficient does not give equal weight to species, because traditional site dissimilarities are lowered by taking into account the taxonomic, phylogenetic or functional similarity of differential species in one site to the species in the other. These, together with the traditional, taxon- (species-) based beta diversity are decomposed into two additive fractions, one due to taxonomic, phylogenetic or functional excess and the other to replacement.
    [Show full text]
  • A Similarity-Based Community Detection Method with Multiple Prototype Representation Kuang Zhou, Arnaud Martin, Quan Pan
    A similarity-based community detection method with multiple prototype representation Kuang Zhou, Arnaud Martin, Quan Pan To cite this version: Kuang Zhou, Arnaud Martin, Quan Pan. A similarity-based community detection method with multiple prototype representation. Physica A: Statistical Mechanics and its Applications, Elsevier, 2015, pp.519-531. 10.1016/j.physa.2015.07.016. hal-01185866 HAL Id: hal-01185866 https://hal.archives-ouvertes.fr/hal-01185866 Submitted on 21 Aug 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A similarity-based community detection method with multiple prototype representation Kuang Zhoua,b,∗, Arnaud Martinb, Quan Pana aSchool of Automation, Northwestern Polytechnical University, Xi'an, Shaanxi 710072, PR China bDRUID, IRISA, University of Rennes 1, Rue E. Branly, 22300 Lannion, France Abstract Communities are of great importance for understanding graph structures in social networks. Some existing community detection algorithms use a single prototype to represent each group. In real applications, this may not adequately model the different types of communities and hence limits the clustering performance on social networks. To address this problem, a Similarity-based Multi-Prototype (SMP) community detection approach is proposed in this paper.
    [Show full text]
  • Similarity Indices I: What Do They Measure?
    NRC-1 Addendum Similarity Indices I: What Do They Measure? by J. W. Johnston November 1976 Prepared for The Nuclear Regulatory Commission Pacific Northwest Laboratories NOTICE This report was prepared as an account of work sponsored by the United States Government. Neither the United States nor the Energy Research and Development Administration, nor any of their employees, nor any of their contractors, subcontractors, or their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness or usefulness of any information, apparatus, product or process disclosed, or represents that its use would not infringe privately owned rights. PACIFIC NORTHWEST LABORATORY operated by BATTELLE for the ENERGY RESEARCH AND DEVELOPMENT ADMIFulSTRATION Under Contract E Y-76-C-06-7830 Printed in the Ucited States of America Avai!able from hcat~onalTechntcal lnformat~onService U.S. Department of Commerce 5285 Port Royal Road Spr~ngf~eld,Virginta 22151 Price: Pr~ntedCODV $*; Microfiche $3.00 NTI5 'Pages Selling Price ERDA-RL RICHL&\D M 4 BNWL-2152 NRC-1 Addendum Similarity Indices I: What Do They Measure? by J.W. Johnston November 1976 Prepared for The Nuclear Regulatory Commission Battelle Pacific Northwest Laboratories Richland, Washington 99352 SUMMARY The characteristics of 25 similarity indices used in studies of eco- logical communities were investigated. The type of data structure, to which these indices are frequently applied, was described as consisting of vectors of measurements on attributes (species) observed in a set of sam- ples. A genera1 similarity index was characterized as the result of a two step process defined on a pair of vectors.
    [Show full text]
  • A New Similarity Measure Based on Simple Matching Coefficient for Improving the Accuracy of Collaborative Recommendations
    I.J. Information Technology and Computer Science, 2019, 6, 37-49 Published Online June 2019 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2019.06.05 A New Similarity Measure Based on Simple Matching Coefficient for Improving the Accuracy of Collaborative Recommendations Vijay Verma Computer Engineering Department, National Institute of Technology, Kurukshetra, Haryana, India-136119 E-mail: [email protected] Rajesh Kumar Aggarwal Computer Engineering Department, National Institute of Technology, Kurukshetra, Haryana, India-136119 E-mail: [email protected] Received: 13 February 2019; Accepted: 22 March 2019; Published: 08 June 2019 Abstract—Recommender Systems (RSs) are essential tools of an e-commerce portal in making intelligent I. INTRODUCTION decisions for an individual to obtain product Nowadays, the world wide web (or simply web) serves recommendations. Neighborhood-based approaches are as an important means for e-commerce businesses. The traditional techniques for collaborative recommendations rapid growth of e-commerce businesses presents a and are very popular due to their simplicity and efficiency. potentially overwhelming number of alternatives to their Neighborhood-based recommender systems use users, this frequently results in the information overload numerous kinds of similarity measures for finding similar problem. Recommender Systems (RSs) are the critical users or items. However, the existing similarity measures tools for avoiding information overload problem and function only on common ratings between a pair of users provide useful suggestions that may be relevant to the (i.e. ignore the uncommon ratings) thus do not utilize all users [1]. RSs help individuals by providing personalized ratings made by a pair of users.
    [Show full text]
  • The Duality of Similarity and Metric Spaces
    applied sciences Article The Duality of Similarity and Metric Spaces OndˇrejRozinek 1,2,* and Jan Mareš 1,3,* 1 Department of Process Control, Faculty of Electrical Engineering and Informatics, University of Pardubice, 530 02 Pardubice, Czech Republic 2 CEO at Rozinet s.r.o., 533 52 Srch, Czech Republic 3 Department of Computing and Control Engineering, Faculty of Chemical Engineering, University of Chemistry and Technology, 166 28 Prague, Czech Republic * Correspondence: [email protected] (O.R.); [email protected] (J.M.) Abstract: We introduce a new mathematical basis for similarity space. For the first time, we describe the relationship between distance and similarity from set theory. Then, we derive generally valid relations for the conversion between similarity and a metric and vice versa. We present a general solution for the normalization of a given similarity space or metric space. The derived solutions lead to many already used similarity and distance functions, and combine them into a unified theory. The Jaccard coefficient, Tanimoto coefficient, Steinhaus distance, Ruzicka similarity, Gaussian similarity, edit distance and edit similarity satisfy this relationship, which verifies our fundamental theory. Keywords: similarity metric; similarity space; distance metric; metric space; normalized similarity metric; normalized distance metric; edit distance; edit similarity; Jaccard coefficient; Gaussian similarity 1. Introduction Mathematical spaces have been studied for centuries and belong to the basic math- ematical theories, which are used in various real-world applications [1]. In general, a mathematical space is a set of mathematical objects with an associated structure. This Citation: Rozinek, O.; Mareš, J. The structure can be specified by a number of operations on the objects of the set.
    [Show full text]
  • Similarity Search Using Locality Sensitive Hashing and Bloom Filter
    Similarity Search using Locality Sensitive Hashing and Bloom Filter Thesis submitted in partial fulfillment of the requirements for the award of degree of Master of Engineering in Computer Science and Engineering Submitted By Sachendra Singh Chauhan (Roll No. 801232021) Under the supervision of Dr. Shalini Batra Assistant Professor COMPUTER SCIENCE AND ENGINEERING DEPARTMENT THAPAR UNIVERSITY PATIALA – 147004 June 2014 ACKNOWLEDGEMENT No volume of words is enough to express my gratitude towards my guide, Dr. Shalini Batra, Assistant Professor, Computer Science and Engineering Department, Thapar University, who has been very concerned and has aided for all the material essential for the preparation of this thesis report. She has helped me to explore this vast topic in an organized manner and provided me with all the ideas on how to work towards a research-oriented venture. I am also thankful to Dr. Deepak Garg, Head of Department, CSED and Dr. Ashutosh Mishra, P.G. Coordinator, for the motivation and inspiration that triggered me for the thesis work. I would also like to thank the faculty members who were always there in the need of the hour and provided with all the help and facilities, which I required, for the completion of my thesis. Most importantly, I would like to thank my parents and the Almighty for showing me the right direction out of the blue, to help me stay calm in the oddest of the times and keep moving even at times when there was no hope. Sachendra Singh Chauhan 801232021 ii Abstract Similarity search of text documents can be reduced to Approximate Nearest Neighbor Search by converting text documents into sets by using Shingling.
    [Show full text]
  • Compressed Slides
    compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Spring 2020. Lecture 7 0 logistics • Problem Set 1 is due tomorrow at 8pm in Gradescope. • No class next Tuesday (it’s a Monday at UMass). • Talk Today: Vatsal Sharan at 4pm in CS 151. Modern Perspectives on Classical Learning Problems: Role of Memory and Data Amplification. 1 summary Last Class: Hashing for Jaccard Similarity • MinHash for estimating the Jaccard similarity. • Locality sensitive hashing (LSH). • Application to fast similarity search. This Class: • Finish up MinHash and LSH. • The Frequent Elements (heavy-hitters) problem. • Misra-Gries summaries. 2 jaccard similarity jA\Bj # shared elements Jaccard Similarity: J(A; B) = jA[Bj = # total elements : Two Common Use Cases: • Near Neighbor Search: Have a database of n sets/bit strings and given a set A, want to find if it has high similarity to anything in the database. Naively Ω(n) time. • All-pairs Similarity Search: Have n different sets/bit strings. Want to find all pairs with high similarity. Naively Ω(n2) time. 3 minhashing MinHash(A) = mina2A h(a) where h : U ! [0; 1] is a random hash. Locality Sensitivity: Pr[MinHash(A) = MinHash(B)] = J(A; B): Represents a set with a single number that captures Jaccard similarity information! Given a collision free hash function g :[0; 1] ! [m], Pr [g(MinHash(A)) = g(MinHash(B))] = J(A; B): What is Pr [g(MinHash(A)) = g(MinHash(B))] if g is not collision free? Will be a bit larger than J(A; B). 4 lsh for similarity search When searching for similar items only search for matches that land in the same hash bucket.
    [Show full text]
  • Lecture Note
    Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar to detect plagiarism, mirror websites, multiple versions of the same article etc. Finding similar items is useful for building recommender systems as well where we want to find users with similar buying patterns. In Netflix two movies can be deemed similar if they are rated highly by the same customers. While, there are many measures of similarity, in this lecture, we will concentrate one such popular measure known as Jaccard Similarity. Definition (Jaccard Similairty). Given two sets S1 and S2, Jaccard similarity of S1 and S2 is defined as |S1∩S2 S1∪S2 Example 1. Let S1 = {1, 2, 3, 4, 7} and S2 = {1, 4, 9, 7, 5} then |S1 ∪ S2| = 7 and |S1 ∩ S2| = 3. 3 Thus the Jaccard similarity of S1 and S2 is 7 . 1.1 Document Similarity To compare two documents to know how similar they are, here is a simple approach: • Compute k shingles for a suitable value of k k shingles are all substrings of length k that appear in the document. For example, if a document is abcdabd and k = 2, then the 2-shingles are ab, bc, cd, da, ab, bd. • Compare the set of k shingled based on Jaccard similarity. One will often map the set of shingled to a set of integers via hashing. What should be the right size of the hash table? Select a hash table size to avoid collision.
    [Show full text]
  • Distributed Clustering Algorithm for Large Scale Clustering Problems
    IT 13 079 Examensarbete 30 hp November 2013 Distributed clustering algorithm for large scale clustering problems Daniele Bacarella Institutionen för informationsteknologi Department of Information Technology Abstract Distributed clustering algorithm for large scale clustering problems Daniele Bacarella Teknisk- naturvetenskaplig fakultet UTH-enheten Clustering is a task which has got much attention in data mining. The task of finding subsets of objects sharing some sort of common attributes is applied in various fields Besöksadress: such as biology, medicine, business and computer science. A document search engine Ångströmlaboratoriet Lägerhyddsvägen 1 for instance, takes advantage of the information obtained clustering the document Hus 4, Plan 0 database to return a result with relevant information to the query. Two main factors that make clustering a challenging task are the size of the dataset and the Postadress: dimensionality of the objects to cluster. Sometimes the character of the object makes Box 536 751 21 Uppsala it difficult identify its attributes. This is the case of the image clustering. A common approach is comparing two images using their visual features like the colors or shapes Telefon: they contain. However, sometimes they come along with textual information claiming 018 – 471 30 03 to be sufficiently descriptive of the content (e.g. tags on web images). Telefax: The purpose of this thesis work is to propose a text-based image clustering algorithm 018 – 471 30 00 through the combined application of two techniques namely Minhash Locality Sensitive Hashing (MinHash LSH) and Frequent itemset Mining. Hemsida: http://www.teknat.uu.se/student Handledare: Björn Lyttkens Lindén Ämnesgranskare: Kjell Orsborn Examinator: Ivan Christoff IT 13 079 Tryckt av: Reprocentralen ITC ”L’arte rinnova i popoli e ne rivela la vita.
    [Show full text]