<<

ChemSpider – Is This The Future of Linked Chemistry on the Internet?

Antony Williams BAGIM, Boston, August 2010 Our dog has fleas It’s not an Advantage… What is the structure of “Advantage”?

. Audience Participation Time….

. Where would you look? . What would you trust? . Where would you look ONLINE? What is the Structure of Vitamin K? MeSH

. A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 () derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K What is the Structure of Vitamin K1? Wikipedia What is the Structure of Vitamin K1? CAS’s Common Chemistry PubChem

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2- enyl)-1,4-dione” . Variants of systematic names on PubChem

. 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl . 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl . 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl . 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl . 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl . 2-methyl-3-[(E)-3,7,11,15-tetramethyl . 2-methyl-3-(3,7,11,15-tetramethyl . 2-methyl-3-[(E)-3,7,11,15-tetramethyl Bioassay Data are Associated…

Structures on DailyMed Lack of Stereochemistry Does Stereochemistry Matter? Does one stereocenter matter?

. Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, Softenon, Incorrect Structures Wow! ChEBI – Manual Curation

The InChI Identifier Multiple Layers InChIStrings Hash to InChIKeys PubChem InChIKeys

. MBWXNTAXLNYFJB-NKFFZRIASA-N . MBWXNTAXLNYFJB-LKUDQCMESA-N . MBWXNTAXLNYFJB-UHFFFAOYSA-N . MBWXNTAXLNYFJB-FAKCLFGASA-N . MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) . MBWXNTAXLNYFJB-ODDKJFTJSA-N . MBWXNTAXLNYFJB-KSVLJPARSA-N . MBWXNTAXLNYFJB-UDCSOKOMSA-N . MBWXNTAXLNYFJB-JHBCSKSVSA-N . MBWXNTAXLNYFJB-JXAKDHTRSA-N PubChem InChIKeys

. MBWXNTAXLNYFJB-NKFFZRIASA-N . MBWXNTAXLNYFJB-LKUDQCMESA-N . MBWXNTAXLNYFJB-UHFFFAOYSA-N . MBWXNTAXLNYFJB-FAKCLFGASA-N . MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) . MBWXNTAXLNYFJB-ODDKJFTJSA-N . MBWXNTAXLNYFJB-KSVLJPARSA-N . MBWXNTAXLNYFJB-UDCSOKOMSA-N . MBWXNTAXLNYFJB-JHBCSKSVSA-N . MBWXNTAXLNYFJB-JXAKDHTRSA-N InChIs

. InChIs are proliferating across databases . InChIs are increasingly used by publishers . Single code base – no multi-flavored SMILES

. InChIs are “incomplete” but very useful… – Search the Internet Full Skeleton Search: 104 Hits Full Search: 4 Hits Is this the structure of Vitamin K1? Where is chemistry online? . Encyclopedic articles (Wikipedia) . Chemical vendor databases . Metabolic pathway databases . Property databases . Patents with chemical structures . Drug Discovery data . Scientific publications . Compound aggregators . Blogs/Wikis and Open Notebook Science Linked Data on the Web

Taken from: Rafael Sidis’ Blog Where Would You look? What Do You Trust? Question Everything online: www.dhmo.org It’s all on Wikipedia… What’s ? What’s Methane? What ELSE is Methane??? The EXPERTS must get it right?! Wikipedia, C&E News, PubChem C&E News (from ACS) Feedback from Steve Ritter

. “As for where we source our structures, our primary source is the researcher and peer- reviewed papers, because many compounds are novel. . ..we always double check them against one or more primary sources, typically and SciFinder. . Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.” Feedback from Steve Ritter

. “As a rule, we at C&EN don’t use Wikipedia as a primary source for structures or chemical information, and I recommend that policy to anyone.”

. “It would be nice to have an authoritative web- based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.” A vision…

. Authoritative web-based source of standard, well-drawn structures . With associated data – spectra, property data, ADME/Tox data, Bioassay data . Linked to encyclopedic articles, publications, patents, MSDS/safety sheets . Links to chemical vendors . Links to property predictions A Pragmatic Vision

“Build a Structure Centric Community”

. December 2006 – A hobby project initiated to connect chemistry on the web . Integrate chemical structure data on the web . Create a “structure-based hub” to information and data . Provide access to structure-based “algorithms” . Let chemists contribute their own data . Allow the community to curate/correct data What do humans want?

media.obsessable.com

As few interfaces as possible www.chemspider.com We’re Out to Answer Questions

. Questions a chemist might ask… . What is the melting point of n-heptanol? . What is the chemical structure of Xanax? . Chemically, what is ? . What are the stereocenters of ? . Where can I find publications about xylene? . What are the different trade names for ? . What is the NMR spectrum of ? . What are the safety handling issues for Blue? Search for a Chemical…by name Available Information…

. Linked to vendors, safety data, toxicity, metabolism Available Information…. Search for a chemical…by structure Substructure search coming… Annotating, Cleaning and Growing...

. Almost 25 million chemicals from 400 diverse data sources

. “Diverse” data sources… . High Quality through questionable to wrong . Rich content of Wikipedia links, YouTube videos and photographs to “Stub Records” containing “just a structure”

. All records can be further enhanced…25 million compounds need annotation by the masses Search “Vitamin H” Search “Vitamin H” “Curate” Identifiers “Curate” Identifiers “Curate” Identifiers “Curate” Identifiers

. General curation activities . Remove incorrect names . Correct spellings . Remove names with/without stereo compared to the structure . Correct registry numbers and other numeric identifiers (Beilstein, EINECS etc) . Add multilingual names . Add alternative names Crowdsourced “Annotations”

. Registered Users can add . Descriptions/Syntheses/Commentaries . Links to PubMed articles . Links to articles via DOIs . Add spectral data . Add Crystallographic Information Files . Add photos . Add MP3 files . Add Videos

Spectra Linked Spectra Linked Link off a structure in ChemSpider

. Chemical suppliers . Other publications . Analytical Data . Related Reactions . Wikipedia . Patents . “Everything” Semantic Markup: Project Prospect Success Depends on Dictionaries Semantic Linking of Structures . What would you want to link off a structure? . Chemical suppliers . Other publications . Analytical Data . Related Reactions . Wikipedia . Patents . “Everything” “Chemicalizing” Pages “Chemicalizing” Pages

ChemSpider SyntheticPages ChemSpider SyntheticPages ChemSpider Everywhere: What do computers want?

Web services Web Services ChemSpider Everywhere

. Linked from Wikipedia and many Public Databases

. Linked from Open Notebook Science sites

. Linked from Blogs using Structure/Spectra EMBED

. Integrated into structure drawing packages

. Integrated to software offerings from Thermo, Waters, Agilent, Bruker Structure Database Lookup Structure Database Lookup Reaction Database Look-up Reaction Database Look-up There will always be gaps...

. What ChemSpider does not deal with, yet...

. Materials . Minerals . Polymers . Biological macromolecules ChemSpider Tomorrow

. 6 months: >1.2M compounds/month . 6 months: >800,000 new uniques . 6 months: >60 new data sources added

. Continue the curation effort and keep cleaning . Finish depositions – millions left to deposit . Integrate RSC content – a massive archive! . Integrate RSC publishing workflows and databases . Enable the semantic web for chemistry – RDF was layered on last week The Future of Linked Chemistry on the Internet? . I can buy my wife a “methane ring” for Xmas . There are more than 10 compounds called Vitamin K1 on PubChem… . Most databases online cannot be annotated . The public funds the generation of data that is then mis-associated, cannot be used for modeling, for reference, for… . Low quality databases become authorities . The community accepts the status quo The PREFERABLE Future of Linked Chemistry on the Internet? . Public compound databases federate to build a truly linked environment of validated data! . Data validation needs are not ignored . Publishers layer on information to make publications discoverable . Public-Private databases can be linked . proliferate . RDF is everywhere

. Business models WILL change Thank you

Email: [email protected] Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams