EMBL-EBI Now and in the Future

SureChEMBL: Open Patent Data Chemaxon UGM, Budapest 21/05/2014 Mark Davies ChEMBL Group, EMBL-EBI EMBL-EBI Resources Genes, genomes & variation European Nucleotide Ensembl European Genome-phenome Archive Archive Ensembl Genomes Metagenomics portal 1000 Genomes Gene, protein & metabolite expression ArrayExpress Metabolights Expression Atlas PRIDE Literature & Protein sequences, families & motifs ontologies InterPro Pfam UniProt Europe PubMed Central Gene Ontology Experimental Factor Molecular structures Ontology Protein Data Bank in Europe Electron Microscopy Data Bank Chemical biology ChEMBL ChEBI Reactions, interactions & pathways Systems BioModels BioSamples IntAct Reactome MetaboLights Enzyme Portal ChEMBL – Data for Drug Discovery 1. Scientific facts 3. Insight, tools and resources for translational drug discovery >Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE Compound RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG Ki = 4.5nM PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE Bioactivity data Assay/Target APTT = 11 min. 2. Organization, integration, curation and standardization of pharmacology data Patent Data • Do we include patent data in the ChEMBL database? • We do provide cross-references (UniChem), but not the underlying chemical data • Most common question asked about ChEMBL during training and outreach • Why is this important to Drug Discovery researchers? • Patent literature 2-3 years ahead of published literature • Prior art and freedom to operate • Lots more data – but high cost to extract + lots of noisy data SureChem = SureChEMBL • December 2013 EMBL-EBI acquired SureChem – a leading ‘chemistry patent mining’ product from Digital Science, Macmillan Group • SureChem not aligned with core future academic business • SureChem provides a live (updated daily) view chemical patent space • Existing SureChem User base • Free (SureChemOpen) • Paying (SureChemPro + API) • EMBL-EBI will support existing licensees - All have expired now • EMBL-EBI will provide an ongoing, free and open resource to entire community • Rebranded SureChEMBL EMBL-EBI Chemistry Resources RDF and REST API interfaces Atlas PDBe ChEBI ChEMBL SureChEMBL 3rd Party Data ZINC, PubChem, ThomsonPharma Ligand Ligand Nomenclature Bioactivity Ligand DOTF, IUPHAR, DrugBank, KEGG, induced structures of primary and data from structures NIH NCC, transcript from secondary literature from patent eMolecules, FDA response structurally metabolites. and literature SRS, PharmGKB, defined Chemical depositions Selleck, …. protein Ontology complexes 750 15K 24K 1.5M 15M ~55M UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >70M REST API Interface SureChEMBL System Filter by authority Patent number Keyword search search Types of chemistry Structure search Filter sketch by date Paste SMILES, Filter by document MOL, name section Help https://www.surechembl.org/ SureChEMBL System SureChEMBL System SureChEMBL System Data Export and View Patent Family SureChEMBL System SureChEMBL API Access System Capabilities • Searching capabilities • Free text keywords and Lucene fields • Patent IDs & bibliographic information • Patent authority & date • Chemical structure • Retrieving capabilities • Retrieve chemistry (with additional filters) • Retrieve patent family information • Retrieve annotated full patent text • Accessible via Web Interface and API SureChEMBL Data Coverage Data Description & Languages Years Bib. data DocDB + Original EP applications from 1978 Full text Original (EN, DE, FR) Bib. data DocDB + Original EP granted From 1980 Full text Original (EN, DE, FR) Bib. data DocDB + Original From 1978 WO applications Full text Original (EN, DE, FR, ES, RU) From 1978 Bib. data DocDB + Original From 2001 US applications Full text Original (EN) From 2001 Bib. data DocDB + Original From 1920 US granted Full text Original (EN) From 1976 Bib. data DocDB From 1973 JP applications Full text PAJ - English abstracts/titles From 1976 JP granted Bib. data DocDB From 1994 90+ countries Bib. data DocDB From 1920 All patents from above data sources are searchable via SureChEMBL SureChEMBL Chemistry Data Coverage • Exemplified structures from patent title, description, abstract and claims • Structures from text 1976 onwards • Structures from images 2007 onwards • USPTO have provided ‘Complex Work Units’ since 2001 • CWU file types include MOL and CDX • CWUs processed as part of pipeline SureChEMBL Data Processing SureChEMBL System Patent Offices 1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl- Chemistry 1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4- WO methylpiperazine Database SureChem IP Name to EP Entity OCR Applications Recognition Structure & Granted Processed (five methods) patents Database US (service) Image to Applications & granted Structure (one method) Patent JP Application API PDFs Abstracts Server (service) Users SureChEMBL and Chemaxon SureChEMBL System Patent Offices Chemistry WO Database SureChem IP Name to EP Entity OCR Applications Recognition Structure & Granted Processed (five methods) patents Database US (service) Image to Applications & granted Structure (one method) Patent JP Application API PDFs Abstracts Server (service) Users ChEMBL Overlap • InChI based comparison using filtered parent compounds SureChEMBL ChEMBL 235K 1.3M 18.4% 12.2M (ChEMBL 18) (Exported 08/05/14) Filters • MW between 100 and 1200 • #Atoms between 6 and 70 • ALogP between -10 and 10 • #Rings > 0 • #C > 0 • #C != #Atoms • RTB <= 20 SureChEMBL and UniChem • 12.2M SureChEMBL compounds are being loaded into UniChem - InChI based ‘Unified Chemical Identifier' system • SureChem drug-like subset (~5M) previously loaded • Other UniChem sources include: https://www.ebi.ac.uk/unichem/ Migration Status • System currently built and optimised to run on Amazon Web Services • The time and cost considered too high to move away from AWS in short term • Long term plan will be to migrate on to EBI infrastructure • 3 Phase migration process 1. Patent Data Server (IFI Claims/Fairview) – done 2. Data Processing Pipeline – done 3. Web Application/API – working in progress Technical Challenges • User account migration • System currently uses Digital Science authentication system • User account required by Pro account service • Plan to move over open system e.g. OAuth/OpenID • We aim to minimise impact on existing user, but may require users to sign-up again • Impact of providing free and open access to the Pro account service and API • Need to monitor usage • Usage limitations may be required Entity Extraction Enhanced Entity Extraction • Identify new entity types e.g. proteins, diseases, cell lines, assays.. • Extend using ChEMBL dictionaries + others • Ontology mapping/Semantic tagging • Protein/biotherapeutic sequence extraction • Sequence based patent searches • Currently system provides minimal cross referencing • Quickly enhance using UniChem • Tag up all commonly used identifiers (ChEBI, ChEMBL, PubChem, UniProt,…) Bioactivity Data Extraction? Compounds Target/Assay Bioactivity Markush Structure Extraction? -alkyl -aryl -heteroaryl -heterocyclyl -cycloalkyl …. Image Processing • Image extraction starts from 01/01/2007 • Use Amazon EC2 Spot Instances to process pre-2007 image data • Spot instance significantly cheaper, e.g. m1.xlarge instance costs: • Standard Cost = $0.52/hour • Spot Instance Cost = ~$0.125/hour Image Processing • New methods and tools can be introduced to improve compound image extraction • System currently uses CLiDE, alternatives include OSRA and Imago • Document segmentation, developed as part of curation system, could be applied to complex patent images Open PHACTS Extension • Open PHACTS project is keen to include patent data in future extensions to the project • ENSO approved - funding to include SureChEMBL data in Open PHACTS • RDF conversion, target indexing and API development • EBI-RDF project benefit from RDF conversion • SureChEMBL is updated daily, compared to quarterly ChEMBL updates • Interesting challenge for us creating exports and systems loading SureChEMBL More Future Plans • Refactor interface for EMBL look and feel • Third party user support system migration • Workflow tool enhancements • Update and release existing KNIME protocols + Pipeline Pilot • Lots of interest to bring the system in-house for internal document processing and searching • Complex licensing issues • AWS setup makes this easier • Ligand Ensemble-based mapping of ChEMBL literature to patents • Provide weekly/monthly feed of patent structures to PubChem Rebranding Complete People and Groups Involved The ChEMBL Group Digital Science • John Overington • Nicko Goncharoff • Mark Davies • James Siddle • George Papadatos • Richard Koks • Jon Chambers • Tom Llewellyn • Anne Hersey ChEMBL 18 Released Website Web Virtual Services 1,359,508 compounds 12,419,715 activities Machine 1,042,374 assays 9,414 targets 53,298 documents 19 bioactivity sources Semantic Widgets Web Downloads https://www.ebi.ac.uk/chembl/ myChEMBL Update Coming Soon http://chembl.blogspot.co.uk/ .

EMBL-EBI Now and in the Future

Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*

DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S

1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3

Origin of a Folded Repeat Protein from an Intrinsically Disordered Ancestor

The Pfam Protein Families Database Marco Punta1,*, Penny C

Comparative Genomics of the Major Parasitic Worms

UC Berkeley UC Berkeley Electronic Theses and Dissertations

Structural Basis for Effector Transmembrane Domain Recognition

MEROPS: the Peptidase Database Neil D

Human Genetics 1990–2009

Browsing Genes and Genomes with Ensembl