Chemistry Connect AstraZeneca’s cheminformatics platform for large-scale integration of structure and bioactivity data
ICIC 2012 14-17 October Berlin
Sorel Muresan
AstraZeneca R&D Mölndal Chemistry Innovation Centre, Discovery Sciences Chemistry Connect – a team effort
Discovery Sciences | CIC http://dx.doi.org/10.1016/j.drudis.2011.10.005 Driver – explosion in SAR data
• Chemical information landscape changing fast
• Make every SAR point count, access all available chemistry
• Internal & external datasources
2006 2008
Discovery Sciences | CIC Southan, C.; Varkonyi, P.; Muresan, S., J. Cheminfo. 2009 SAR key entities and relationships
Unstructured Data Structured Entries in from Documents Relational Databases Expert Extraction or Text Mining
Discovery Sciences | CIC Southan, C.; Boppana, K.; Jagarlapudi, S.; Muresan, S. J. Cheminfo. 2011 Manually extracted SAR data (commercial)
• GOSTAR (GVKBIO Online Structure Activity Relationship Database) is a comprehensive database that captures explicit relationships between the three entities of publications, compounds and targets.
Discovery Sciences | CIC SAR data (public)
• PubChem • the NCBI public informatics backbone for the NIH Molecular Libraries Initiative focused on small molecules as systems biology probes and potential therapeutic agents.
• ChEMBL • includes drugs, small molecules from the medicinal chemistry or biochemical literature and their targets.
Discovery Sciences | CIC Extracting chemical entities from text
Collaboration with IBM Research Almaden to apply text analytics technology to analyze intellectual property and scientific literature
- 10 million full text patents
- 11 million structures
- 17% out of 58M parent structures in Chemistry Connect
Discovery Sciences | CIC Chemical Named Entity Recognition (NER)
7-CHLORO-1,3-DIHYDRO-1-METHYL-5- PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE
Name-to-Structure software
CN1c2ccc(cc2C(=NCC1=O)c3ccccc3)Cl
Discovery Sciences | CIC Extracting chemical entities from text
The biggest cause of missing compounds when extracting chemical entities from text is the presence of typographical errors: human errors, OCR failures, hyphenation and multiple line issues, etc.
• Automated spelling correction with CaffeineFix from NextMove Software
• CaffeineFix significantly improves extraction rates (22% increase from D=0 to D=1)
• name2structure software are complementary (40% of the structures come from single n2s contributions)
Discovery Sciences | CIC Sayle, R.; Xie, P.; Muresan, S., JCIM 2012 Chemistry Connect
Compound
Document
Test & Result
Chemistry Connect
Discovery Sciences | CIC Chemistry Connect
Compound
Document
Test & Result
Target
Chemistry Connect
Discovery Sciences | CIC Chemistry Connect
Discovery Sciences | CIC Chemistry Connect
Discovery Sciences | CIC Chemistry Connect
Discovery Sciences | CIC Exact match source comparisons
sources that include predominantly patent- known drugs derived compounds
Discovery Sciences | CIC Finding a common language
Acetaminophen [3H]Acetaminophen 882-720-13 Acetaminophen (4-hydroxyacetanilide) 10066-90-7 882-720-16 Acetaminophen glucuronide(55%) acetaminophen sulfate 103-90-2 882-720-20 Acetaminophen sulfate(30%) A F ANACIN acetaminophen sulphate 1047-607-00 Acetaminophen Uniserts 1169-894-12 A PER acetaminophene A.F. ANACIN Acetamol 16110-10-4 ACETANILIDE, 4'-HYDROXY- AAP 222 AF Acetavance aa-sulfate Acetofen 222-AF ACETOMINOPHEN AA-sulphate Actamin 3-(glutathion-S-yl)acetaminophen Abenol Actamin Extra Actamin Super 37519-14-5 Abensanil Actifed Plus 3-hydroxyacetaminophen ABROL Actimol Actimol Chewable Tablets 4-(Acetylamino)phenol ABROLET Actimol Children's Suspension AC112578 Actimol Infants' Suspension 4-13-00-01091 Actimol Junior Strength Caplets 4-ACETAMIDOPHENOL AC112579 Actron Acamol Afebrin 4-Acetaminophenol Afebryl Accu-Tap 4-ACETYLAMINOPHENOL Aferadol Acenol AG10223 4'-Hydroxyacetanilide AG12029 Acenol (pharmaceutical) AG124687 4-HYDROXYACETANILIDE Acephen AG12800 AG12948 4-HYDROXYANILID KYSELINY OCTOVE Acertol Amadil 4-hydroxyphenolacetamide Aceta Aminofen 644/4046 Aceta Elixir Aminofen Max Anacin 644/7502 Aceta Tablets Anacin-3 64889-81-2 Acetaco Anacin-3 Extra Strength 659/9501 Acetagesic Anadin dla dzieci Acetalgin Anaflon 77097-85-9 Analter Acetaminophen: ACETAMIDE, N-(4- Anapap 840-416-00 HYDROXYPHENYL)- Andox 872-667-00 ACETAMIDE, N-(P- Anelix >1000 synonyms.. 878-022-04 Anexsia HYDROXYPHENYL)- Anexsia 10/660 878-022-09 Acetamidophenol Anexsia 5/325 878-022-14 Acetaminofen Anexsia 7.5/325 Acetaminophen Anexsia 7.5/650 878-022-19 Anhiba 882-720-04 Acetaminophen (4- Anoquan 882-720-07 hydroxyacetanilide) Anti-Algos Acetaminophen Antidol 882-720-10 Apacet glucuronide(55%) DiscoveryApacet Sciences Capsules | CIC acetaminophen sulfate Word of the Day : Crowdsourcing
Discovery Sciences | CIC Chemistry Connect
Discovery Sciences | CIC Technical Overview - ETL
Data Sources Extraction Transformation Loading
Text Files Python Structure Scripts Normalization (chemistry) Property calc Oracle PL/SQL Oracle DB (ext tables)
Pipeline Pilot (biological results) Web Service
Discovery Sciences | CIC Technical Overview - Application
HTML
Java
Oracle 11g WebLogic Server REST (and SOAP) services Direct 7 .Net
PipelinePilot Knime Excel
Discovery Sciences | CIC Chemistry Connect Apps
Canvas
Chemistry SARConnect Plato Connect
Key compounds
Discovery Sciences | CIC Canvas is…
…a Rosetta stone for compounds It automatically translates AZ numbers, 196 B.C. 2012 A.D. SNs, chemical names, structures, SMILEs, development IDs, reagent IDs, trade names, legacy Astra & Zeneca IDs… …really easy to use Copy a compound name or …a portal to information structure to the clipboard and let Canvas do the rest of the work It acts as a springboard to let you access Chemistry Connect, ISAC, IBEX, IBIS data, Compound View, ELN data, AZ Patent Db, IBEX, Integrity... …and in 2011, 1750 AZ scientists did Safety assessment & many others… Biologists Med chem Synthetic chem …a compound design tool Patent attorneys It quickly calculates C-lab properties, chemical Crystallographers names, molecular weights, checks novelty… DMPK Comp chem Discovery Sciences | CIC Jon Winter, Oncology iMed Utopia Documents
Discovery Sciences | CIC http://getutopia.com/index.php Key compound prediction from patents
From WO1996025405 the earliest patent which claims it, can you work out the structure of Bextra (Valdecoxib), the Pfizer NSAID?
74 exemplified cmpds
Discovery Sciences | CIC Tyrchan, C. et al JCIM 2012 WO1996025405 - Bextra
Source #compounds Bextra Bextra exists ranked GVKBIO 74 Y 1 (broad core) 1 (narrow core) SureChem 501 Y 1 (broad core) 1 (narrow core)
Discovery Sciences | CIC EP268956 - Aciphex
Source #compounds Aciphex Aciphex exists ranked GVKBIO 27 Y 2 (core1) 1 (core2) SureChem 168 Y 1 (core1) 1 (core2)
Discovery Sciences | CIC PLATO for Safety/Tox – General Concept
Predictive Secondary Pharmacology
Expert Systems
QSAR Models Job Input Results
Molecule BioSim Summary
Pharma Connect
Additional Services
Scott Boyer Catrin Hasselgren Lars Carlsson All services in Plato are complementary to Tobias Noeske find an overall answerDiscovery to your Sciences problem! | CIC Predictive Secondary Pharmacology strategy
Chemistry Connect Similarity or Compound information: name & structure substructure search Input Target information: bioactivity data Molecule
Compound - Target associations • Potential off-target effects? • Part of a safety risk assessment
Scott Boyer Similarity concept: Similar compounds Catrin Hasselgren bind to similar targets. Lars Carlsson M Johnson et al., Prog Clin Biol ResDiscovery (1989 Sciences), 291:167 | CIC Tobias Noeske P Willet, Drug Discov Today (2006), 11:1046 SARConnect – navigate SAR landscape
Test & Results Compound hierarchy
Target hierarchy
Discovery Sciences | CIC Eriksson, M. et al Molecular Informatics 2012 SARConnect – structure classification
Compound structure Molecular Framework Topological Framework Terminal Rings & Bonds
Level 1 Level2 Level 3 Level 4
Discovery Sciences | CIC SARConnect – target classification
Level 1 (Broad target class)
Enzyme NHR GPCR Ion Channel Other
Level 2 (Swiss-Prot family class) GPCR Signal Transmembrane Transmembrane
Level 3 (Sub-families) Class A Class B Class C Frizzeled Family
Symbol
CALCR CALCLR CRHR2 GCCR GIPR GLP1R PTH2R VIPR1 VIPR2
Discovery Sciences | CIC SARConnect – navigate SAR landscape
Discovery Sciences | CIC Take-home messages
• Chemistry Connect is enabling AstraZeneca to intensify its exploitation of synergies between internal and external SAR estate and to shorten the time between hypothesis generation during DMTA cycles
• Our Chemical Dictionary of 120 million chemical terms has become a crucial cross-mapping resource between chemistry and the scientific literature
• We cannot wave a magic wand over data quality, provenance issues, drug name space, and the inherent challenges of chemistry representation but Chemistry Connect gives us a unique overview and amelioration options for each source
Discovery Sciences | CIC A Democracy of Ideas (Acknowledgements)
• Plamen Petrov • Niklas Blomberg • Chris Southan • Jon Winter • Paul Xie • John Cumming • Peter Varkonyi • Scott Boyer • Thierry Kogej • Catrin Hasselgren • Christian Tyrchan • Lars Carlsson • Magnus Kjellberg • Tobias Noeske • Håkan Nilsson • and many others… • Mats Eriksson • Jonas Ekengren • Ithipol Suriyawongkul
Discovery Sciences | CIC Thank you!
Discovery Sciences | CIC