<<

http://www.orcid.org/0000-0002-2668-4821

Cheminformatics approaches to support chemical identification delivered via the EPA CompTox Chemicals Dashboard

Antony Williams1, Andrew D. McEachran2, Chris Grulke1, Elin Ulrich3 and Jon R. Sobus3 1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC 2) Oak Ridge Institute of Science and Education (ORISE) Research Participant, RTP, NC 3) National Exposure Research Laboratory, U.S. Environmental Protection Agency, RTP, NC

The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA

Spring 2019 ACS Spring Meeting, Orlando Suspect Screening and Non-Targeted Analysis Workflows

Suspect Screening Non-Targeted Analysis Color Key

Raw Samples Processed Features Red = Analytical Chemistry

Extracted Samples Prioritized Features Blue = Data Processing & Analysis Purple = Mathematical & QSPR Modeling Raw Features Predicted Formulas Green = Informatics & Web Services

“Molecular Features” Candidate Structures

DSSTox Chemical Sorted Structures

Matched Formulas Predicted Retention Times

Mapped Structures Predicted Mass Spectra

Prioritized Structures Predicted/Observed Functional Use (using ToxPi) Predicted/Observed Media Occurrence Confirmed Structures (using ToxCast standards) Methodological Concordance

Predicted Concentrations Top Candidate Structure(s)

1 CompTox Chemicals Dashboard https://comptox.epa.gov/dashboard

875k Chemical Substances

2 BASIC Search

3 Detailed Chemical Pages

4 Access to Chemical Hazard Data

5 In Vitro Bioassay Screening ToxCast and Tox21

6 Sources of Exposure to Chemicals

7 Link Access

8 NIST WebBook https://webbook.nist.gov/chemistry/

9 MassBank of North America https://mona.fiehnlab.ucdavis.edu

10 m/z CLOUD https://www.mzcloud.org/

11 DO WE REALLY NEED ANOTHER DATABASE?

12 Is a bigger database better?

• ChemSpider was 26 million chemicals then • Much BIGGER today • Is bigger better??

13 Comparing Search Performance

• Dashboard content was 720k chemicals • Only 3% of ChemSpider size • What was the comparison in performance?

14 SAME dataset for comparison

15 How did performance compare?

16 Data Quality is important

• Data quality in free web-based !

17 Will the correct Microcystin LR Stand Up? ChemSpider Skeleton Search

18 Comparing ChemSpider Structures

19 Comparing ChemSpider Structures

20 Other Searches

21 Delivering a Better Database

• An ideal database would provide: – Curated CAS Number-Name mappings with “correct” chemical structures • We have full time curators checking data

22 MASS AND FORMULA SEARCHING (and metadata ranking)

23 Advanced Searches Mass and Formula Based Search

24 Advanced Searches Mass and Formula Based Search

25 Using Metadata for Ranking

• Use available metadata to rank candidates – Associated data sources • Associated lists in DSSTox database • Associated sources in PubChem • Specific types (e.g. water, surfactants, pesticides etc.) – Number of associated PubMed articles – Number of products/categories containing the chemical

26 Metadata rank ordering

27 SPECIFIC APPLICATIONS TO MASS SPEC.

28 Mass Spec Focused Applications

29 “MS-Ready Structures” https://doi.org/10.1186/s13321-018-0299-2

30 31 MS-Ready Mappings

32 MS-Ready Mappings Set

33 Advanced Searches Mass Search

34 Advanced Searches Mass Search

35 MS-Ready Structures for Formula Search

36 MS-Ready Mappings

• EXACT Formula: C10H16N2O8: 3 Hits

37 MS-Ready Mappings

• Same Input Formula: C10H16N2O8 • MS Ready Formula Search: 125 Chemicals

38 MS-Ready Mappings

• 125 chemicals returned in total – 8 of the 125 are single component chemicals – 3 of the 8 are isotope-labeled – 3 are neutral compounds and 2 are charged

39 Batch Searching

• Singleton searches are useful but we work with thousands of masses and formulae!

• Typical questions

– What is the list of chemicals for the formula CxHyOz – What is the list of chemicals for a mass +/- error – Can I get chemical lists in Excel files? In SDF files? – Can I include properties in the download file?

40 Batch Searching Formula/Mass

41 Searching batches using MS-Ready Formula (or mass) searching

42 RELATED APPLICATIONS OF INTEREST TO MASS SPEC.

43 Find me “related structures” Formula-Based Search

44 Select Chemicals of Interest

45 Find me “related structures” Based on Structure Similarity

46 Find me “related structures” Based on Structure Similarity

47 Find me “related structures” Structure Similarity – sort on mass

48 Literature Searching

49 Literature Searching

50 Literature Searching

51 FOCUSED CHEMICAL LISTS OF INTEREST

52 Chemical Lists

53 EPAHFR: Hydraulic Fracturing

54 PFAS lists of Chemicals

55 COMPLEX CHEMICAL SUBSTANCES

56 UVCB Chemicals

57 Many Hydraulic Fracturing Chemicals are “Complex”

58 “Markush Structures” https://en.wikipedia.org/wiki/Markush_structure

59 WORK IN PROGRESS

60 Suspect Screening and Non-Targeted Analysis Workflow

Suspect Screening Non-Targeted Analysis Color Key

Raw Samples Processed Features Red = Analytical Chemistry

Extracted Samples Prioritized Features Blue = Data Processing & Analysis Purple = Mathematical & QSPR Modeling Raw Features Predicted Formulas Green = Informatics & Web Services

“Molecular Features” Candidate Structures

DSSTox Chemical Database Sorted Structures

Matched Formulas Predicted Retention Times

Mapped Structures Predicted Mass Spectra

Prioritized Structures Predicted/Observed Functional Use (using ToxPi) Predicted/Observed Media Occurrence Confirmed Structures (using ToxCast standards) Methodological Concordance

Predicted Concentrations Top Candidate Structure(s)

61 Work in Progress

• Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database

62 Predicted Mass Spectra http://cfmid.wishartlab.com/

• MS/MS spectra prediction for ESI+, ESI-, and EI • Predictions generated and stored for >800,000 structures, to be accessible via Dashboard

63 Search Expt. vs. Predicted Spectra Search Expt. vs. Predicted Spectra Spectral Viewer Comparison

66 Work in Progress

• Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Retention Time Index Prediction

67 Retention Time Prediction for Ranking

68 Moving to Relative Retention Times

69 Work in Progress

• Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Retention Time Index Prediction • Structure/substructure/similarity search

70 Prototype Development

71 Prototype Development

72 Work in Progress

• Predicted Spectra for candidate ranking – Viewing and Downloading pre-predicted spectra – Search spectra against the database • Retention Time Index Prediction • Structure/substructure/similarity search • Access to API and web services for programmatic access

73 API services and Open Data

• Groups waiting on our API and web services • Mass Spec companies instrument integration • Release will be in iterations but for now our data are available

74 SIDE EFFECTS OF SHARING OPEN DATA

75 NORMAN Suspect List Exchange https://www.norman-network.com/?q=node/236

76 Integration to MetFrag in place https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0299-2

77 Conclusion

• Dashboard access to data for ~875,000 chemicals • MS-Ready data facilitates structure identification • Related metadata facilitates candidate ranking • Relationship mappings and chemical lists of great utility • Dashboard and contents are one part of the solution • We are committed to open API development with time..

78 Acknowledgements

• THANK YOU for the invitation! • IT Development team – especially Jeff Edwards and Jeremy Dunne • Chris Grulke for the ChemReg system • NERL colleagues – Jon Sobus, Elin Ulrich, Mark Strynar, Seth Newton • Emma Schymanski, LCSB, Luxembourg • The NORMAN Network and all contributors

79 Contact

Antony Williams US EPA Office of Research and Development National Center for Computational Toxicology EMAIL: [email protected] ORCID: https://orcid.org/0000-0002-2668-4821

80