The Chembl Database in 2017

Published online 28 November 2016 Nucleic Acids Research, 2017, Vol. 45, Database issue D945–D954 doi: 10.1093/nar/gkw1074 The ChEMBL database in 2017 Anna Gaulton1, Anne Hersey1,Micha-l Nowotka1,A.Patr´ıcia Bento1,2, Jon Chambers1, David Mendez1, Prudence Mutowo1, Francis Atkinson1, Louisa J. Bellis1, Elena Cibrian-Uhalte´ 1,MarkDavies1, Nathan Dedman1, Anneli Karlsson1,Mar´ıa Paula Magarinos˜ 1,2, John P. Overington1, George Papadatos1, Ines Smit1 and Andrew R. Leach1,* 1European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK and 2Open Targets, Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Received September 23, 2016; Revised October 21, 2016; Editorial Decision October 24, 2016; Accepted October 30, 2016 ABSTRACT feedback, the scope and data types within ChEMBL have gradually expanded, with some major new areas included ChEMBL is an open large-scale bioactivity database in recent releases: compounds in clinical development, data (https://www.ebi.ac.uk/chembl), previously de- from patents, direct depositions for neglected diseases and scribed in the 2012 and 2014 Nucleic Acids Research agrochemical data. Database Issues. Since then, alongside the contin- Drug discovery remains a costly process with a high ued extraction of data from the medicinal chemistry failure rate (3–6). To provide a more complete picture literature, new sources of bioactivity data have across the drug discovery and development process, and to also been added to the database. These include: help researchers better understand what makes a success- deposited data sets from neglected disease screen- ful medicine, we have extended the ChEMBL data model ing; crop protection data; drug metabolism and to include, for the first time, data typically generated in the disposition data and bioactivity data from patents. A pre-clinical and clinical phases of drug discovery, specifi- cally drug metabolism and disposition data. Another com- number of improvements and new features have also mon approach to understanding pharmaceutical attrition been incorporated. These include the annotation of is to learn from successful drugs and failed drug candidates assays and targets using ontologies, the inclusion (7–9). We have therefore extended our set of drug-target an- of targets and indications for clinical candidates, notations to include those for clinical candidates and have addition of metabolic pathways for drugs and calcu- also mapped these chemical entities to their therapeutic in- lation of structural alerts. The ChEMBL data can be dications. accessed via a web-interface, RDF distribution, data At the end of 2013, EMBL-EBI took over the operation, downloads and RESTful web-services. development and support of the SureChem patent system (now called SureChEMBL (10)) from Digital Science Ltd. Access to this resource has highlighted the potential value INTRODUCTION to scientists of bioactivity data not yet published in the sci- Since its inception a major component of ChEMBL’s con- entific literature. However, the current SureChEMBL sys- tent has been bioactivity data regularly extracted from the tem only extracts compound structures from the patents medicinal chemistry literature (1,2). Among many other and not associated bioactivity data. As a first step to ad- applications such data enables researchers to identify tool dress this opportunity we have worked with BindingDB to compounds for potential therapeutic targets, to probe the incorporate the BindingDB patent data into ChEMBL (11). available SAR data for a target, investigate phenotypic data Neglected disease research continues to be a field of drug associated with similar compounds and to identify potential discovery conducted largely (though not exclusively) by not- off-target effects of specific chemotypes. In order to pro- for-profit organisations that aim to expedite research byen- vide a more complete perspective, based in part on user couraging sharing of experimental data with the commu- *To whom correspondence should be addressed. Tel: +44 1223 494333; Fax: +44 1223 494468; Email: [email protected] Present addresses: Louisa J. Bellis, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK. John P. Overington, Mark Davies and Anneli Karlsson, BenevolentAI, 40 Churchway, London NW1 1LW, UK. Nathan Dedman, Local Measure, 87 Leonard St, London EC2A 4QS, UK. George Papadatos, GlaxoSmithKline, Medicines Research Centre, Gunnels Wood Road, Stevenage, Herts SG1 2NY, UK. C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. D946 Nucleic Acids Research, 2017, Vol. 45, Database issue nity (12–15). Depositions of this type of data into ChEMBL Beyond the area of neglected diseases, the University have continued to increase since the original malaria depo- of Vienna and Roche have deposited supplementary data sitions in 2010. associated with publications already in ChEMBL (17,18). Whilst the pharmaceutical and drug discovery commu- This is complementary to the similar sets already de- nity continues to be the major user and consumer of posited by GlaxoSmithKline and we encourage similar de- ChEMBL data, other life sciences communities also work positions from other authors. AstraZeneca have taken a with similar types of data. The agrochemical industry is one different approach to direct data deposition. They iden- such community where specific efforts have been made to tified compounds already in ChEMBL and then pro- widen the coverage of data relevant to the discovery and vided data on these compounds from a variety of in development of herbicides, pesticides and fungicides (16). vitro ADME and physicochemical screens including pro- In the next sections, we describe the new data types now tein binding, microsome and hepatocyte clearance, solu- integrated into the ChEMBL database, the annotations we bility, pKa and lipophilicity. It is important to note that have undertaken to enable structured organisation and ac- for all such deposited data sets, ChEMBL provides a cess to the data and how the new data and annotations can DOI so that the data can be cited in subsequent publica- be viewed via the web interface. tions. For example, the DOI for the AstraZeneca deposited data is 10.6019/CHEMBL3301361 and this resolves to the DATA CONTENT ChEMBL Document Report Card for the complete data set. Current data content Deposited data sets can be identified through the ChEMBL’s content continues to grow; release 22 of the ChEMBL interface by selecting the relevant entry in the database contains information extracted from more than ‘Activity Source Filter’ (located to the right of the keyword 65 000 publications, together with 50 deposited data sets, search bar) and then performing a keyword search with a and data drawn from other databases (Table 1). In total, wildcard (*) against ‘Documents’ or ‘Assays’. This will re- there are >1.6 million distinct compound structures repre- turn all documents or assays associated with that source or sented in the database, with 14 million activity values from depositor, from which point further information regarding >1.2 million assays. These assays are mapped to ∼11 000 compounds, targets and activity measurements can be nav- targets, including 9052 proteins (of which 4255 are human). igated. Deposited data sets Crop protection data ChEMBL continues to receive data sets from both not- In order to broaden the utility of ChEMBL in crop for-profit and commercial organisations that wish to share protection research, a data set of >40 000 compounds data with the scientific community. These deposited data and 245 000 activity data points has been extracted sets contain many novel chemical structures and associated from crop protection-related publications and added to bioactivity data. A good example is a library of small molec- ChEMBL (16). This data set significantly increased the ular weight (∼320 Da average) natural product-like com- content of pesticide, herbicide and insecticide assays pounds created at the University of Dundee with funding in the database. The ChEMBL taxonomy tree browser from the Gates Foundation. This library is being screened in has been extended to allow easier retrieval of this a number of neglected disease assays. To date, information data (https://www.ebi.ac.uk/chembl/target/browser). In ad- about the compound structures and their activity in cyto- dition, known pesticides already in ChEMBL were as- toxicity assays is available in ChEMBL; further assay data signed a mechanism of action classification, following the will be deposited as the screening is completed. Another ex- Fungicide Resistance Action Committee (FRAC: http: ample is the Malaria Box Compound Set (http://www.mmv. //www.frac.info/publications), Herbicide Resistance Ac- org/research-development/open-access-malaria-box), a set tion Committee (HRAC: http://hrac.tsstaging.com/tools/ of 400 compounds with antimalarial activity that was made classification-lookup) or Insecticide Resistance Action available by the Medicines for Malaria Venture (MMV) Committee (IRAC: http://www.irac-online.org/documents/ for research groups to request and screen (15). The latest moa-brochure/?ext=pdf)

The Chembl Database in 2017

Uniprot at EMBL-EBI's Role in CTTV

PDF of Connecting Science's 2017

The Metagenomic Data Life-Cycle: Standards and Best Practices Petra Ten Hoopen1, Robert D

5 February 2020 Draft Programme

Rfam 14: Expanded Coverage of Metagenomic, Viral and Microrna Families Ioanna Kalvari, Eric P

S41467-018-05712-5.Pdf

Management Training Scheme Introduction from Mike Stratton

RNA Promotes Phase Separation of Glycolysis Enzymes Into

EMBL-EBI) Wellcome Genome Campus Hinxton, Cambridge CB10 1SD United Kingdom

Genome Research Limited Strategic Overview Contents

Connecting Science 2018/2019 Annual Review

This Exhibition Introduces the Role That the Wellcome Genome Campus