SureChEMBL: Open Patent Data Chemaxon UGM, Budapest 21/05/2014

Mark Davies ChEMBL Group, EMBL-EBI

EMBL-EBI Resources Genes, genomes & variation

European Nucleotide Ensembl European Genome-phenome Archive Archive Metagenomics portal 1000 Genomes Gene, & metabolite expression

ArrayExpress Metabolights Expression Atlas PRIDE Literature & Protein sequences, families & motifs ontologies InterPro Pfam UniProt Europe PubMed Central Gene Ontology Experimental Factor Molecular structures Ontology in Europe Electron Microscopy Data Bank

Chemical biology

ChEMBL ChEBI Reactions, interactions & pathways Systems BioModels BioSamples IntAct Reactome MetaboLights Portal ChEMBL – Data for Drug Discovery 1. Scientific facts 3. Insight, tools and resources for translational drug discovery

>Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE Compound RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR

WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG Ki = 4.5nM PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE Bioactivity data

Assay/Target APTT = 11 min.

2. Organization, integration, curation and standardization of pharmacology data Patent Data

• Do we include patent data in the ChEMBL database? • We do provide cross-references (UniChem), but not the underlying chemical data • Most common question asked about ChEMBL during training and outreach • Why is this important to Drug Discovery researchers? • Patent literature 2-3 years ahead of published literature • Prior art and freedom to operate • Lots more data – but high cost to extract + lots of noisy data

SureChem = SureChEMBL

• December 2013 EMBL-EBI acquired SureChem – a leading ‘chemistry patent mining’ product from Digital Science, Macmillan Group • SureChem not aligned with core future academic business • SureChem provides a live (updated daily) view chemical patent space • Existing SureChem User base • Free (SureChemOpen) • Paying (SureChemPro + API) • EMBL-EBI will support existing licensees - All have expired now • EMBL-EBI will provide an ongoing, free and open resource to entire community • Rebranded SureChEMBL EMBL-EBI Chemistry Resources

RDF and REST API interfaces

Atlas PDBe ChEBI ChEMBL SureChEMBL 3rd Party Data

ZINC, PubChem, ThomsonPharma Ligand Ligand Nomenclature Bioactivity Ligand DOTF, IUPHAR, DrugBank, KEGG, induced structures of primary and data from structures NIH NCC, transcript from secondary literature from patent eMolecules, FDA response structurally metabolites. and literature SRS, PharmGKB, defined Chemical depositions Selleck, …. protein Ontology complexes

750 15K 24K 1.5M 15M ~55M

UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >70M

REST API Interface SureChEMBL System Filter by authority Patent number Keyword search search

Types of chemistry Structure search Filter sketch by date

Paste SMILES, Filter by document MOL, name section Help https://www.surechembl.org/ SureChEMBL System SureChEMBL System SureChEMBL System

Data Export and View Patent Family SureChEMBL System SureChEMBL API Access System Capabilities

• Searching capabilities • Free text keywords and Lucene fields • Patent IDs & bibliographic information • Patent authority & date • Chemical structure • Retrieving capabilities • Retrieve chemistry (with additional filters) • Retrieve patent family information • Retrieve annotated full patent text • Accessible via Web Interface and API SureChEMBL Data Coverage

Data Description & Languages Years

Bib. data DocDB + Original EP applications from 1978 Full text Original (EN, DE, FR) Bib. data DocDB + Original EP granted From 1980 Full text Original (EN, DE, FR) Bib. data DocDB + Original From 1978 WO applications Full text Original (EN, DE, FR, ES, RU) From 1978 Bib. data DocDB + Original From 2001 US applications Full text Original (EN) From 2001 Bib. data DocDB + Original From 1920 US granted Full text Original (EN) From 1976 Bib. data DocDB From 1973 JP applications Full text PAJ - English abstracts/titles From 1976

JP granted Bib. data DocDB From 1994

90+ countries Bib. data DocDB From 1920

All patents from above data sources are searchable via SureChEMBL SureChEMBL Chemistry Data Coverage

• Exemplified structures from patent title, description, abstract and claims • Structures from text 1976 onwards • Structures from images 2007 onwards • USPTO have provided ‘Complex Work Units’ since 2001 • CWU file types include MOL and CDX • CWUs processed as part of pipeline SureChEMBL Data Processing

SureChEMBL System Patent Offices 1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl- Chemistry 1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4- WO methylpiperazine Database

SureChem IP Name to EP Entity OCR Applications Recognition Structure & Granted Processed (five methods) patents Database US (service) Image to Applications & granted Structure (one method)

Patent JP Application API PDFs Abstracts Server (service)

Users SureChEMBL and Chemaxon

SureChEMBL System Patent Offices Chemistry WO Database

SureChem IP Name to EP Entity OCR Applications Recognition Structure & Granted Processed (five methods) patents Database US (service) Image to Applications & granted Structure (one method)

Patent JP Application API PDFs Abstracts Server (service)

Users ChEMBL Overlap

• InChI based comparison using filtered parent compounds

SureChEMBL ChEMBL 235K 1.3M 18.4% 12.2M

(ChEMBL 18) (Exported 08/05/14)

Filters • MW between 100 and 1200 • #Atoms between 6 and 70 • ALogP between -10 and 10 • #Rings > 0 • #C > 0 • #C != #Atoms • RTB <= 20 SureChEMBL and UniChem

• 12.2M SureChEMBL compounds are being loaded into UniChem - InChI based ‘Unified Chemical Identifier' system • SureChem drug-like subset (~5M) previously loaded • Other UniChem sources include:

https://www.ebi.ac.uk/unichem/ Migration Status

• System currently built and optimised to run on Amazon Web Services • The time and cost considered too high to move away from AWS in short term • Long term plan will be to migrate on to EBI infrastructure • 3 Phase migration process 1. Patent Data Server (IFI Claims/Fairview) – done 2. Data Processing Pipeline – done 3. Web Application/API – working in progress

Technical Challenges

• User account migration • System currently uses Digital Science authentication system • User account required by Pro account service • Plan to move over open system e.g. OAuth/OpenID • We aim to minimise impact on existing user, but may require users to sign-up again • Impact of providing free and open access to the Pro account service and API • Need to monitor usage • Usage limitations may be required Entity Extraction Enhanced Entity Extraction

• Identify new entity types e.g. , diseases, cell lines, assays.. • Extend using ChEMBL dictionaries + others • Ontology mapping/Semantic tagging • Protein/biotherapeutic sequence extraction • Sequence based patent searches • Currently system provides minimal cross referencing • Quickly enhance using UniChem • Tag up all commonly used identifiers (ChEBI, ChEMBL, PubChem, UniProt,…)

Bioactivity Data Extraction? Compounds

Target/Assay

Bioactivity Markush Structure Extraction?

-alkyl -aryl -heteroaryl -heterocyclyl -cycloalkyl …. Image Processing

• Image extraction starts from 01/01/2007 • Use Amazon EC2 Spot Instances to process pre-2007 image data • Spot instance significantly cheaper, e.g. m1.xlarge instance costs: • Standard Cost = $0.52/hour • Spot Instance Cost = ~$0.125/hour Image Processing

• New methods and tools can be introduced to improve compound image extraction • System currently uses CLiDE, alternatives include OSRA and Imago • Document segmentation, developed as part of curation system, could be applied to complex patent images Open PHACTS Extension

• Open PHACTS project is keen to include patent data in future extensions to the project • ENSO approved - funding to include SureChEMBL data in Open PHACTS • RDF conversion, target indexing and API development • EBI-RDF project benefit from RDF conversion • SureChEMBL is updated daily, compared to quarterly ChEMBL updates • Interesting challenge for us creating exports and systems loading SureChEMBL More Future Plans

• Refactor interface for EMBL look and feel • Third party user support system migration • Workflow tool enhancements • Update and release existing KNIME protocols + Pipeline Pilot • Lots of interest to bring the system in-house for internal document processing and searching • Complex licensing issues • AWS setup makes this easier • Ligand Ensemble-based mapping of ChEMBL literature to patents • Provide weekly/monthly feed of patent structures to PubChem

Rebranding Complete People and Groups Involved

The ChEMBL Group Digital Science • John Overington • Nicko Goncharoff • Mark Davies • James Siddle • George Papadatos • Richard Koks • Jon Chambers • Tom Llewellyn • Anne Hersey ChEMBL 18 Released

Website

Web Virtual Services 1,359,508 compounds 12,419,715 activities Machine 1,042,374 assays 9,414 targets 53,298 documents 19 bioactivity sources Semantic Widgets Web

Downloads

https://www.ebi.ac.uk/chembl/ myChEMBL Update Coming Soon

http://chembl.blogspot.co.uk/