Bioactivity Descriptors for Uncharacterized Chemical Compounds
Total Page:16
File Type:pdf, Size:1020Kb
ARTICLE https://doi.org/10.1038/s41467-021-24150-4 OPEN Bioactivity descriptors for uncharacterized chemical compounds ✉ Martino Bertoni 1,6, Miquel Duran-Frigola 1,2,6 , Pau Badia-i-Mompel 1,6, Eduardo Pauls1, Modesto Orozco-Ruiz1, Oriol Guitart-Pla1, Víctor Alcalde1, Víctor M. Diaz 3,4, Antoni Berenguer-Llergo 1, ✉ Isabelle Brun-Heath 1, Núria Villegas 1, Antonio García de Herreros3 & Patrick Aloy 1,5 Chemical descriptors encode the physicochemical and structural properties of small mole- 1234567890():,; cules, and they are at the core of chemoinformatics. The broad release of bioactivity data has prompted enriched representations of compounds, reaching beyond chemical structures and capturing their known biological properties. Unfortunately, bioactivity descriptors are not available for most small molecules, which limits their applicability to a few thousand well characterized compounds. Here we present a collection of deep neural networks able to infer bioactivity signatures for any compound of interest, even when little or no experimental information is available for them. Our signaturizers relate to bioactivities of 25 different types (including target profiles, cellular response and clinical outcomes) and can be used as drop-in replacements for chemical descriptors in day-to-day chemoinformatics tasks. Indeed, we illustrate how inferred bioactivity signatures are useful to navigate the chemical space in a biologically relevant manner, unveiling higher-order organization in natural product collec- tions, and to enrich mostly uncharacterized chemical libraries for activity against the drug- orphan target Snail1. Moreover, we implement a battery of signature-activity relationship (SigAR) models and show a substantial improvement in performance, with respect to chemistry-based classifiers, across a series of biophysics and physiology activity prediction benchmarks. 1 Joint IRB-BSC-CRG Programme in Computational Biology, Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain. 2 Ersilia Open Source Initiative, Cambridge, UK. 3 Programa de Recerca en Càncer, Institut Hospital del Mar d’Investigacions Mèdiques (IMIM) and Departament de Ciències de la Salut, Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain. 4 Faculty of Medicine and Health Sciences, International University of Catalonia, Barcelona, Catalonia, Spain. 5 Institució Catalana de Recerca i Estudis Avançats (ICREA), ✉ Barcelona, Catalonia, Spain. 6These authors contributed equally: Martino Bertoni, Miquel Duran-Frigola, Pau Badia-i-Mompel. email: [email protected]; [email protected] NATURE COMMUNICATIONS | (2021) 12:3932 | https://doi.org/10.1038/s41467-021-24150-4 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-24150-4 ost of the chemical space remains uncharted and identi- structures of the molecules, targets and metabolic genes, network fying its regions of biological relevance is key to medicinal properties of the targets, cell response profiles, drug indications, M 1,2 chemistry and chemical biology . To explore and catalog and side effects, among others (Fig. 1a). In the CC, each molecule this vast space, scientists have invented a variety of chemical is annotated with multiple n-dimensional vectors (i.e., bioactivity descriptors, which encode physicochemical and structural properties signatures) corresponding to the spaces where experimental of small molecules. Molecular fingerprints are a widespread form of information is available. As a result, chemistry (A) signatures are descriptors consisting of binary (1/0) vectors describing the presence widely available (~106 compounds), whereas cell-based assays (D) or absence of certain molecular substructures. These encodings are at cover about 30,000 molecules, and clinical (E) signatures are the core of chemoinformatics and are fundamental in compound known for only a few thousand drugs (Fig. 1b). We thus sought to similarity searches and clustering, and are applied to computational infer missing signatures for any compound in the CC, based on drug discovery (CDD), structure optimization, and target prediction. the observation that the different bioactivity spaces are not The corpus of bioactivity records available suggests that other completely independent and can be correlated. numerical representations of molecules are possible, reaching beyond Bioactivity signatures must be amenable to similarity calcula- chemical structures and capturing their known biological properties. tions, ideally by conventional metrics such as cosine or Euclidean Indeed, it has been shown that an enriched representation of mole- distances, so that short distances between molecule signatures cules can be achieved through the use of bioactivity signatures3. reflect a similar biological behavior. Therefore, inference of bioac- Bioactivity signatures are multi-dimensional vectors that capture tivity signatures can be posed as a metric learning problem where the biological traits of the molecule in a format that is akin to observed compound–compound similarities of a given kind are the structural descriptors or fingerprints used in the field of che- correlated to the full repertoire of CC signatures, so that similarity moinformatics. The first attempts to develop biological descriptors measures are possible for any compound of interest, including those for chemical compounds encapsulated ligand-binding affinities4,and that are not annotated with experimental data. In practice, for each fi fi ngerprints describing the target pro le of small molecules unveiled CC space (Si), we tackle the metric learning problem with a so- many unanticipated and physiologically relevant associations5.Cur- called Siamese Neural Network (SNN), having as input a stacked rently, public databases contain experimentally determined bioac- array of CC signatures available for the compound (belonging to – – tivity data for about a million molecules, which represent only a small any of the A1 E5 layers, S1 S25)andasoutput,ann-dimensional percentage of commercially available compounds6 and a negligible embedding optimized to discern between similar and dissimilar 7 fi fraction of the synthetically accessible chemical space .Inpractical molecules in Si. More speci cally, we feed the SNN with triplets of terms, this means bioactivity signatures cannot be derived for most molecules (an anchor molecule, one that is similar to the anchor compounds, and CDD methods are limited to using chemical (positive) and one that is not (negative)), and we ask the SNN to information alone as a primary input, thereby hindering their correctly classify this pattern with a distance measurement per- performance and not fully exploiting the bioactivity knowledge formed in the embedding space (Fig. 1a and Supplementary Fig. 1). produced over the years by the scientificcommunity. We trained 25 such SNNs, corresponding to the 25 spaces available Recently, we integrated the major chemogenomics and drug in the CC. We used 107 molecule triplets and chose an SNN databases in a single resource named the Chemical Checker (CC), embedding dimension of 128 for all CC spaces, scaling it to which is the largest collection of small-molecule bioactivity signatures the norm so as to unify the distance magnitude across SNNs (see available to date8. In the CC, bioactivity signatures are organized by “Methods” section for details). As a result of this procedure, we fi ‘ ’ data type (ligand-receptor binding, cell sensitivity pro les, toxicology, obtained 25 SNN signaturizers (S1–25), each of them devoted to etc.), following a chemistry-to-clinics rationale that facilitates the one of the CC spaces (Si). A signaturizer takes as input the subset of selection of relevant signature classes at each step of the drug dis- CC signatures available for a molecule and produces a 128D sig- covery pipeline. In essence, the CC is an alternative representation of nature that, in principle, captures the similarity profile of the the small-molecule knowledge deposited in the public domain and, as molecule in the Si CC space, where experimental information may such, it is also limited by the availability of experimental data and the not be available for the compound. coverage of its source databases (e.g., ChEMBL9 or DrugBank10). To handle the acute incompleteness of experimental signatures Thus, the CC is most useful when a substantial amount of bioactivity accessible for training the SNNs (Fig. 1b), we devised a signature- information is available for the molecules and remains of limited dropout sampling scheme that simulates a realistic prediction 11 value for poorly characterized compounds . In the current study, we scenario where, depending on the CC space of interest (Si), sig- present a methodology to infer CC bioactivity signatures for any natures from certain spaces will be available while others may not. compound of interest, based on the observation that the different For example, in the CC, biological pathway signatures (C3) are bioactivity spaces are not completely independent, and thus simila- directly derived from binding signatures (B4), thus implying that, rities of a given bioactivity type (e.g., targets) can be transferred to in a real B4 prediction case, C3 will never serve as a covariate. In other data kinds (e.g., therapeutic indications). Overall, we make practice, signature sampling probabilities for each CC space Si – bioactivity signatures available for