http://www.orcid.org/0000-0002-2668-4821
Cheminformatics approaches to support chemical identification delivered via the EPA CompTox Chemicals Dashboard
Antony Williams1, Andrew D. McEachran2, Chris Grulke1, Elin Ulrich3 and Jon R. Sobus3 1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC 2) Oak Ridge Institute of Science and Education (ORISE) Research Participant, RTP, NC 3) National Exposure Research Laboratory, U.S. Environmental Protection Agency, RTP, NC
The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA
Spring 2019 ACS Spring Meeting, Orlando Suspect Screening and Non-Targeted Analysis Workflows
Suspect Screening Non-Targeted Analysis Color Key
Raw Samples Processed Features Red = Analytical Chemistry
Extracted Samples Prioritized Features Blue = Data Processing & Analysis Purple = Mathematical & QSPR Modeling Raw Features Predicted Formulas Green = Informatics & Web Services
“Molecular Features” Candidate Structures
DSSTox Chemical Database Sorted Structures
Matched Formulas Predicted Retention Times
Mapped Structures Predicted Mass Spectra
Prioritized Structures Predicted/Observed Functional Use (using ToxPi) Predicted/Observed Media Occurrence Confirmed Structures (using ToxCast standards) Methodological Concordance
Predicted Concentrations Top Candidate Structure(s)
1 CompTox Chemicals Dashboard https://comptox.epa.gov/dashboard
875k Chemical Substances
2 BASIC Search
3 Detailed Chemical Pages
4 Access to Chemical Hazard Data
5 In Vitro Bioassay Screening ToxCast and Tox21
6 Sources of Exposure to Chemicals
7 Link Access
8 NIST WebBook https://webbook.nist.gov/chemistry/
9 MassBank of North America https://mona.fiehnlab.ucdavis.edu
10 m/z CLOUD https://www.mzcloud.org/
11 DO WE REALLY NEED ANOTHER DATABASE?
12 Is a bigger database better?
• ChemSpider was 26 million chemicals then • Much BIGGER today • Is bigger better??
13 Comparing Search Performance
• Dashboard content was 720k chemicals • Only 3% of ChemSpider size • What was the comparison in performance?
14 SAME dataset for comparison
15 How did performance compare?
16 Data Quality is important
• Data quality in free web-based databases!
17 Will the correct Microcystin LR Stand Up? ChemSpider Skeleton Search
18 Comparing ChemSpider Structures
19 Comparing ChemSpider Structures
20 Other Searches
21 Delivering a Better Database
• An ideal database would provide: – Curated CAS Number-Name mappings with “correct” chemical structures • We have full time curators checking data
22 MASS AND FORMULA SEARCHING (and metadata ranking)
23 Advanced Searches Mass and Formula Based Search
24 Advanced Searches Mass and Formula Based Search
25 Using Metadata for Ranking
• Use available metadata to rank candidates – Associated data sources • Associated lists in DSSTox database • Associated sources in PubChem • Specific types (e.g. water, surfactants, pesticides etc.) – Number of associated PubMed articles – Number of products/categories containing the chemical
26 Metadata rank ordering
27 SPECIFIC APPLICATIONS TO MASS SPEC.
28 Mass Spec Focused Applications
29 “MS-Ready Structures” https://doi.org/10.1186/s13321-018-0299-2
30 31 MS-Ready Mappings
32 MS-Ready Mappings Set
33 Advanced Searches Mass Search
34 Advanced Searches Mass Search
35 MS-Ready Structures for Formula Search
36 MS-Ready Mappings
• EXACT Formula: C10H16N2O8: 3 Hits
37 MS-Ready Mappings
• Same Input Formula: C10H16N2O8 • MS Ready Formula Search: 125 Chemicals
38 MS-Ready Mappings
• 125 chemicals returned in total – 8 of the 125 are single component chemicals – 3 of the 8 are isotope-labeled – 3 are neutral compounds and 2 are charged
39 Batch Searching
• Singleton searches are useful but we work with thousands of masses and formulae!
• Typical questions
– What is the list of chemicals for the formula CxHyOz – What is the list of chemicals for a mass +/- error – Can I get chemical lists in Excel files? In SDF files? – Can I include properties in the download file?
40 Batch Searching Formula/Mass
41 Searching batches using MS-Ready Formula (or mass) searching
42 RELATED APPLICATIONS OF INTEREST TO MASS SPEC.
43 Find me “related structures” Formula-Based Search
44 Select Chemicals of Interest
45 Find me “related structures” Based on Structure Similarity
46 Find me “related structures” Based on Structure Similarity
47 Find me “related structures” Structure Similarity – sort on mass
48 Literature Searching
49 Literature Searching
50 Literature Searching
51 FOCUSED CHEMICAL LISTS OF INTEREST
52 Chemical Lists
53 EPAHFR: Hydraulic Fracturing
54 PFAS lists of Chemicals
55 COMPLEX CHEMICAL SUBSTANCES
56 UVCB Chemicals
57 Many Hydraulic Fracturing Chemicals are “Complex”
58 “Markush Structures” https://en.wikipedia.org/wiki/Markush_structure
59 WORK IN PROGRESS
60 Suspect Screening and Non-Targeted Analysis Workflow
Suspect Screening Non-Targeted Analysis Color Key
Raw Samples Processed Features Red = Analytical Chemistry
Extracted Samples Prioritized Features Blue = Data Processing & Analysis Purple = Mathematical & QSPR Modeling Raw Features Predicted Formulas Green = Informatics & Web Services
“Molecular Features” Candidate Structures
DSSTox Chemical Database Sorted Structures
Matched Formulas Predicted Retention Times
Mapped Structures Predicted Mass Spectra
Prioritized Structures Predicted/Observed Functional Use (using ToxPi) Predicted/Observed Media Occurrence Confirmed Structures (using ToxCast standards) Methodological Concordance
Predicted Concentrations Top Candidate Structure(s)
61 Work in Progress
• Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database
62 Predicted Mass Spectra http://cfmid.wishartlab.com/
• MS/MS spectra prediction for ESI+, ESI-, and EI • Predictions generated and stored for >800,000 structures, to be accessible via Dashboard
63 Search Expt. vs. Predicted Spectra Search Expt. vs. Predicted Spectra Spectral Viewer Comparison
66 Work in Progress
• Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Retention Time Index Prediction
67 Retention Time Prediction for Ranking
68 Moving to Relative Retention Times
69 Work in Progress
• Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Retention Time Index Prediction • Structure/substructure/similarity search
70 Prototype Development
71 Prototype Development
72 Work in Progress
• Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Retention Time Index Prediction • Structure/substructure/similarity search • Access to API and web services for programmatic access
73 API services and Open Data
• Groups waiting on our API and web services • Mass Spec companies instrument integration • Release will be in iterations but for now our data are available
74 SIDE EFFECTS OF SHARING OPEN DATA
75 NORMAN Suspect List Exchange https://www.norman-network.com/?q=node/236
76 Integration to MetFrag in place https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0299-2
77 Conclusion
• Dashboard access to data for ~875,000 chemicals • MS-Ready data facilitates structure identification • Related metadata facilitates candidate ranking • Relationship mappings and chemical lists of great utility • Dashboard and contents are one part of the solution • We are committed to open API development with time..
78 Acknowledgements
• THANK YOU for the invitation! • IT Development team – especially Jeff Edwards and Jeremy Dunne • Chris Grulke for the ChemReg system • NERL colleagues – Jon Sobus, Elin Ulrich, Mark Strynar, Seth Newton • Emma Schymanski, LCSB, Luxembourg • The NORMAN Network and all contributors
79 Contact
Antony Williams US EPA Office of Research and Development National Center for Computational Toxicology EMAIL: [email protected] ORCID: https://orcid.org/0000-0002-2668-4821
80