SureChEMBL: Open Patent Data Chemaxon UGM, Budapest 21/05/2014
Mark Davies ChEMBL Group, EMBL-EBI
EMBL-EBI Resources Genes, genomes & variation
European Nucleotide Ensembl European Genome-phenome Archive Archive Ensembl Genomes Metagenomics portal 1000 Genomes Gene, protein & metabolite expression
ArrayExpress Metabolights Expression Atlas PRIDE Literature & Protein sequences, families & motifs ontologies InterPro Pfam UniProt Europe PubMed Central Gene Ontology Experimental Factor Molecular structures Ontology Protein Data Bank in Europe Electron Microscopy Data Bank
Chemical biology
ChEMBL ChEBI Reactions, interactions & pathways Systems BioModels BioSamples IntAct Reactome MetaboLights Enzyme Portal ChEMBL – Data for Drug Discovery 1. Scientific facts 3. Insight, tools and resources for translational drug discovery
>Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE Compound RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR
WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG Ki = 4.5nM PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE Bioactivity data
Assay/Target APTT = 11 min.
2. Organization, integration, curation and standardization of pharmacology data Patent Data
• Do we include patent data in the ChEMBL database? • We do provide cross-references (UniChem), but not the underlying chemical data • Most common question asked about ChEMBL during training and outreach • Why is this important to Drug Discovery researchers? • Patent literature 2-3 years ahead of published literature • Prior art and freedom to operate • Lots more data – but high cost to extract + lots of noisy data
SureChem = SureChEMBL
• December 2013 EMBL-EBI acquired SureChem – a leading ‘chemistry patent mining’ product from Digital Science, Macmillan Group • SureChem not aligned with core future academic business • SureChem provides a live (updated daily) view chemical patent space • Existing SureChem User base • Free (SureChemOpen) • Paying (SureChemPro + API) • EMBL-EBI will support existing licensees - All have expired now • EMBL-EBI will provide an ongoing, free and open resource to entire community • Rebranded SureChEMBL EMBL-EBI Chemistry Resources
RDF and REST API interfaces
Atlas PDBe ChEBI ChEMBL SureChEMBL 3rd Party Data
ZINC, PubChem, ThomsonPharma Ligand Ligand Nomenclature Bioactivity Ligand DOTF, IUPHAR, DrugBank, KEGG, induced structures of primary and data from structures NIH NCC, transcript from secondary literature from patent eMolecules, FDA response structurally metabolites. and literature SRS, PharmGKB, defined Chemical depositions Selleck, …. protein Ontology complexes
750 15K 24K 1.5M 15M ~55M
UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >70M
REST API Interface SureChEMBL System Filter by authority Patent number Keyword search search
Types of chemistry Structure search Filter sketch by date
Paste SMILES, Filter by document MOL, name section Help https://www.surechembl.org/ SureChEMBL System SureChEMBL System SureChEMBL System
Data Export and View Patent Family SureChEMBL System SureChEMBL API Access System Capabilities
• Searching capabilities • Free text keywords and Lucene fields • Patent IDs & bibliographic information • Patent authority & date • Chemical structure • Retrieving capabilities • Retrieve chemistry (with additional filters) • Retrieve patent family information • Retrieve annotated full patent text • Accessible via Web Interface and API SureChEMBL Data Coverage
Data Description & Languages Years
Bib. data DocDB + Original EP applications from 1978 Full text Original (EN, DE, FR) Bib. data DocDB + Original EP granted From 1980 Full text Original (EN, DE, FR) Bib. data DocDB + Original From 1978 WO applications Full text Original (EN, DE, FR, ES, RU) From 1978 Bib. data DocDB + Original From 2001 US applications Full text Original (EN) From 2001 Bib. data DocDB + Original From 1920 US granted Full text Original (EN) From 1976 Bib. data DocDB From 1973 JP applications Full text PAJ - English abstracts/titles From 1976
JP granted Bib. data DocDB From 1994
90+ countries Bib. data DocDB From 1920
All patents from above data sources are searchable via SureChEMBL SureChEMBL Chemistry Data Coverage
• Exemplified structures from patent title, description, abstract and claims • Structures from text 1976 onwards • Structures from images 2007 onwards • USPTO have provided ‘Complex Work Units’ since 2001 • CWU file types include MOL and CDX • CWUs processed as part of pipeline SureChEMBL Data Processing
SureChEMBL System Patent Offices 1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl- Chemistry 1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4- WO methylpiperazine Database
SureChem IP Name to EP Entity OCR Applications Recognition Structure & Granted Processed (five methods) patents Database US (service) Image to Applications & granted Structure (one method)
Patent JP Application API PDFs Abstracts Server (service)
Users SureChEMBL and Chemaxon
SureChEMBL System Patent Offices Chemistry WO Database
SureChem IP Name to EP Entity OCR Applications Recognition Structure & Granted Processed (five methods) patents Database US (service) Image to Applications & granted Structure (one method)
Patent JP Application API PDFs Abstracts Server (service)
Users ChEMBL Overlap
• InChI based comparison using filtered parent compounds
SureChEMBL ChEMBL 235K 1.3M 18.4% 12.2M
(ChEMBL 18) (Exported 08/05/14)
Filters • MW between 100 and 1200 • #Atoms between 6 and 70 • ALogP between -10 and 10 • #Rings > 0 • #C > 0 • #C != #Atoms • RTB <= 20 SureChEMBL and UniChem
• 12.2M SureChEMBL compounds are being loaded into UniChem - InChI based ‘Unified Chemical Identifier' system • SureChem drug-like subset (~5M) previously loaded • Other UniChem sources include:
https://www.ebi.ac.uk/unichem/ Migration Status
• System currently built and optimised to run on Amazon Web Services • The time and cost considered too high to move away from AWS in short term • Long term plan will be to migrate on to EBI infrastructure • 3 Phase migration process 1. Patent Data Server (IFI Claims/Fairview) – done 2. Data Processing Pipeline – done 3. Web Application/API – working in progress
Technical Challenges
• User account migration • System currently uses Digital Science authentication system • User account required by Pro account service • Plan to move over open system e.g. OAuth/OpenID • We aim to minimise impact on existing user, but may require users to sign-up again • Impact of providing free and open access to the Pro account service and API • Need to monitor usage • Usage limitations may be required Entity Extraction Enhanced Entity Extraction
• Identify new entity types e.g. proteins, diseases, cell lines, assays.. • Extend using ChEMBL dictionaries + others • Ontology mapping/Semantic tagging • Protein/biotherapeutic sequence extraction • Sequence based patent searches • Currently system provides minimal cross referencing • Quickly enhance using UniChem • Tag up all commonly used identifiers (ChEBI, ChEMBL, PubChem, UniProt,…)
Bioactivity Data Extraction? Compounds
Target/Assay
Bioactivity Markush Structure Extraction?
-alkyl -aryl -heteroaryl -heterocyclyl -cycloalkyl …. Image Processing
• Image extraction starts from 01/01/2007 • Use Amazon EC2 Spot Instances to process pre-2007 image data • Spot instance significantly cheaper, e.g. m1.xlarge instance costs: • Standard Cost = $0.52/hour • Spot Instance Cost = ~$0.125/hour Image Processing
• New methods and tools can be introduced to improve compound image extraction • System currently uses CLiDE, alternatives include OSRA and Imago • Document segmentation, developed as part of curation system, could be applied to complex patent images Open PHACTS Extension
• Open PHACTS project is keen to include patent data in future extensions to the project • ENSO approved - funding to include SureChEMBL data in Open PHACTS • RDF conversion, target indexing and API development • EBI-RDF project benefit from RDF conversion • SureChEMBL is updated daily, compared to quarterly ChEMBL updates • Interesting challenge for us creating exports and systems loading SureChEMBL More Future Plans
• Refactor interface for EMBL look and feel • Third party user support system migration • Workflow tool enhancements • Update and release existing KNIME protocols + Pipeline Pilot • Lots of interest to bring the system in-house for internal document processing and searching • Complex licensing issues • AWS setup makes this easier • Ligand Ensemble-based mapping of ChEMBL literature to patents • Provide weekly/monthly feed of patent structures to PubChem
Rebranding Complete People and Groups Involved
The ChEMBL Group Digital Science • John Overington • Nicko Goncharoff • Mark Davies • James Siddle • George Papadatos • Richard Koks • Jon Chambers • Tom Llewellyn • Anne Hersey ChEMBL 18 Released
Website
Web Virtual Services 1,359,508 compounds 12,419,715 activities Machine 1,042,374 assays 9,414 targets 53,298 documents 19 bioactivity sources Semantic Widgets Web
Downloads
https://www.ebi.ac.uk/chembl/ myChEMBL Update Coming Soon
http://chembl.blogspot.co.uk/