A Probabilistic Molecular Fingerprint for Big Data Settings

Probst and Reymond J Cheminform (2018) 10:66 https://doi.org/10.1186/s13321-018-0321-8 Journal of Cheminformatics RESEARCH ARTICLE Open Access A probabilistic molecular fngerprint for big data settings Daniel Probst* and Jean‑Louis Reymond Abstract Background: Among the various molecular fngerprints available to describe small organic molecules, extended connectivity fngerprint, up to four bonds (ECFP4) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representa‑ tions ( 1024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem≥ or ZINC to perform very slowly due to the curse of dimensionality. Results: Herein we report a new fngerprint, called MinHash fngerprint, up to six bonds (MHFP6), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally diferent manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the applica‑ tion of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. By leveraging locality sensitive hashing, LSH approximate nearest neighbor search methods perform as well on unfolded MHFP6 as comparable methods do on folded ECFP4 fngerprints in terms of speed and relative recovery rate, while operating in very sparse and high-dimensional binary chemical space. Conclusion: MHFP6 is a new molecular fngerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub (https://github.com/reymond- group/mhfp). Keywords: Virtual screening, Similarity search, Fingerprints, Locality sensitive hashing, Approximate k-nearest neighbor search Introduction Among the assortment of fngerprints for the com- Many uses of cheminformatics require the quantifcation parison of molecules in use today, extended connectiv- of the similarity between molecules. As the underlying ity fngerprint (ECFP) is the most prominent due to its data structure used to represent molecules is a graph, this outstanding performance in molecular structure com- problem is equivalent to a subgraph isomerism problem, parisons requiring the identifcation of compounds with which is at least NP-complete [1]. Molecular fngerprints similar bioactivity, as assessed in benchmarking stud- reduce this problem to the comparison of vectors, ena- ies [6, 7]. However, the performance of ECFP results bling further application of approximation methods and from a precise encoding of molecular structure, which heuristics, thus speeding up the computation [2–5]. is achieved by using high-dimensional vectors, typically d ≥ 1024 , with the consequence that linear searching becomes slow when applied to very large databases such as GDB, PubChem or ZINC [8–10]. For more complex *Correspondence: [email protected] k Department of Chemistry and Biochemistry, National Center tasks such as constructing -nearest neighbor graphs, 2 for Competence in Research NCCR TransCure, University of Berne, linear search takes O(dn ) time, becoming prohibitively Freiestrasse 3, 3012 Bern, Switzerland © The Author(s) 2018. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Probst and Reymond J Cheminform (2018) 10:66 Page 2 of 12 slow. Tis problem occurs even when applying com- circular nature of ECFP with w-shingling and MinHash, monly used optimized search algorithms such as k–d or which are encoding and comparison methods used in ball trees, as well as algorithms from the R- and B-tree natural language processing and text mining [19–21] families, because their performance degrades to linear (Fig. 1). Tese methods are commonly used in appli- time due to the curse of dimensionality [11–13]. In addi- cations such as discarding already indexed web pages tion, given the often binary, relatively sparse, and high during web-crawling, signal processing or plagiarism p dimensional nature of ECFP, L metrics generally per- detection [22, 23]. We obtain our MHFP by frst writing form badly, further limiting the number of available opti- out circular substructures around each atom as SMILES, mization techniques. In the past, several approaches to a process which we call molecular shingling in analogy to remove the curse of dimensionality’s impact on nearest the w-shingling scheme used for the above-mentioned neighbor searching have been presented by the chem- text mining applications. We then apply the MinHash informatics community. Most notably the BitBound hashsing scheme to assign these SMILES to bit values in method, which exploits simple bounds on similarity our MHFP. measures and indexing to achieve sub-linear speed on MinHash is a locality sensitive hashing (LSH) scheme exact nearest neighbor searches with a time complexity which applies a family of hashing functions to the sub- 0.6 of O(n ) for many metrics, including Jaccard similarity strings in a molecular shingling and stores the mini- [14, 15]. In our efort to facilitate the exploration of very mum hash generated from each hashing function in a large databases such as GDB, we previously used lower set. Tese sets, containing the minimum hash values, dimensionality fngerprints such as MQN (Molecular have the interesting property that they can be indexed Quantum Number, 42D) or SMIfp (SMILES fngerprint, by an LSH algorithm for approximate nearest neighbor 34D) for similarity searches, however, such fngerprints search (ANN), removing the curse of dimensionality only encode molecular composition and do not allow [24]. While a previously reported LSH implementation precise structural similarity calculation [16–18]. for chemical structure indexing and searching was based Herein we report a new family of fngerprints termed on embeddings in Euclidean space, MinHash allows for MHFP (MinHash fngerprint) which combine the the indexing of chemical structures in extremely sparse Fig. 1 MHFP, ECFP workfow comparison. a Comparison of hashing and approximate nearest neighbor search indexing of ECFP with Annoy (gray) and MHFP via molecular shingling and MinHash with LSH Forest (orange). In addition, MinHash is applied to unfolded ECFP hashes and indexed using LSH Forest as well (green), resulting in the hybrid fngerprint MHECFP. The latter was used as a control to separate the infuences of molecular shingling and applying MinHash on the measured performance. b Circular substructure SMILES of an input molecule are computed with each heavy atom as the center (examples for MHFP4 shown in red and blue). In addition, SMILES for each ring are extracted (examples shown in black). Circular substructure SMILES are rooted at the central atom. All substructure SMILES are canonicalized and kekulized Probst and Reymond J Cheminform (2018) 10:66 Page 3 of 12 Jaccard (Tanimoto) space, a metric more appropriate for F = {S1, ...Sn} over H where each set represents a mol- fngerprint-based similarity calculations [25, 26]. Note ecule, the MinHash function hmin(Si, a, b) is applied to that LSH search algorithms cannot be directly applied each set Si in F . Let s be the vector form of a set S from F 61 to ECFP hashes due to the nature of the primary hashing and p be the Mersenne prime 2 − 1 . Te MinHash of a scheme used to assign circular substructures to bit val- molecular graph is then calculated as: ues. Furthermore, ECFP encodes circular substructures T 32 by iteratively hashing atomic invariants. Common imple- hmin(si, a, b) = min a · si + b modp mod 2 − 1 mentations of ECFP, as found in RDKit or Open Babel, (2) contain a default or hardcoded selection of atomic invari- Te set form Smin of smin can then be used to estimate ants to be hashed that is targeted towards applications the Jaccard similarity coefcient of two sets Si , Sj using in medicinal chemistry, thereby making assumptions Eq. 1 [31]. regarding the importance of atomic features such as acid- Te expected error of estimating the Jaccard similarity ity or charge, thereby introducing a potential bias which 1 coefcient between two sets using MinHash is O log(n) , is entirely avoided in MHFP, as it takes all information n encoded in the SMILES into account [6, 27–29]. where is the number of hash functions used [32]. To assess the performance of MHFP we compare it to variants of ECFP as well as to a hybrid fngerprint LSH forest MHECFP which applies MinHash to unfolded ECFP Te local sensitivity hashing (LSH) forest algorithm is hashes. We fnd that the performance of MHFP surpasses an extension to LSH similarity indexing [33, 34]. Intro-

A Probabilistic Molecular Fingerprint for Big Data Settings

Mining of Massive Datasets

Applied Statistics

Arxiv:2102.08942V1 [Cs.DB]

A Generalized Framework for Analyzing Taxonomic, Phylogenetic, and Functional Community Structure Based on Presence–Absence Data

A Similarity-Based Community Detection Method with Multiple Prototype Representation Kuang Zhou, Arnaud Martin, Quan Pan

Similarity Indices I: What Do They Measure?

A New Similarity Measure Based on Simple Matching Coefficient for Improving the Accuracy of Collaborative Recommendations

The Duality of Similarity and Metric Spaces

Similarity Search Using Locality Sensitive Hashing and Bloom Filter

Compressed Slides

Lecture Note

Distributed Clustering Algorithm for Large Scale Clustering Problems