Open Knowledge: Reproducibility in Cheminformatics with Open Data, Open Source and Open Standards
Egon Willighagen
Bioclipse & Proteochemometric Group (Prof. Wikberg) Department of Pharmaceutical Biosciences Uppsala University
2009-08-31 The Setting...
Problem
Solution
Results
Discussions 1998: Organic
Conclusion chemistry... beatiful science! But ... why, how, what, ...
PJJA Buijnsters et al., Eur.J.Org.Chem, 2002, 1397–1406
2009-08-31 Bioclipse & Proteochemometric Group - 2 - Egon Willighagen | chem-bla-ics.blogspot.com Reliable Knowledge: Trust
Problem
Solution
Results Discussions How to build Trust Conclusion track record
2009-08-31 Bioclipse & Proteochemometric Group - 3 - Egon Willighagen | chem-bla-ics.blogspot.com Knowledge: Trust
Problem
Solution
Results
Discussions How to build Trust Conclusion track record transparency: citation
2009-08-31 Bioclipse & Proteochemometric Group - 4 - Egon Willighagen | chem-bla-ics.blogspot.com Knowledge: Trust
Problem
Solution Results How to build Trust Discussions track record Conclusion transparency: citation reproducibility: details
2009-08-31 Bioclipse & Proteochemometric Group - 5 - Egon Willighagen | chem-bla-ics.blogspot.com Knowledge: Trust
Problem
Solution How to build Trust Results track record Discussions transparency: citation Conclusion reproducibility: details
Open {Data|Standards|Source|. . . }
2009-08-31 Bioclipse & Proteochemometric Group - 6 - Egon Willighagen | chem-bla-ics.blogspot.com Knowledge Representation...
Problem
Solution
Results
Discussions Conclusion What are the organic normal conditions?
2009-08-31 Bioclipse & Proteochemometric Group - 7 - Egon Willighagen | chem-bla-ics.blogspot.com The Problem: Reproducibility...
Problem Where reproducibility is Solution severely hampered: Results
Discussions recalculate basic atom and
Conclusion bond properties access to QSAR/QSPR data well-defined algorithms publications destroy information
2009-08-31 Bioclipse & Proteochemometric Group - 8 - Egon Willighagen | chem-bla-ics.blogspot.com Solutions...
Openess Problem license that allows Solution modification and Results redistribution Discussions
Conclusion hiding behind public domain is not helpful Semantic Web be explicit in what you mean both in facts and in algorithms
2009-08-31 Bioclipse & Proteochemometric Group - 9 - Egon Willighagen | chem-bla-ics.blogspot.com Reproducibility needs ODOSOS
Open Data
Problem No Intellectual Monopoly Solution Open Source Results algorithms are complex Discussions
Conclusion implementations even more strong interaction with representation Open Standards Semantic Web formats unique identifiers http: // en. wikipedia. org/ wiki/ Glyn_ Moody
2009-08-31 Bioclipse & Proteochemometric Group - 10 - Egon Willighagen | chem-bla-ics.blogspot.com Jmol
Problem
Solution Started in 1997 by
Results Dan Gezelter Discussions (Notre Dame) Conclusion Leaders: Bradly Smith, me, Miguel Howard, Bob Hanson
E.L. Willighagen, M. Howard, Nature Precedings, 2005 http: // www. jmol. org/
2009-08-31 Bioclipse & Proteochemometric Group - 11 - Egon Willighagen | chem-bla-ics.blogspot.com The Chemistry Development Kit
A Family of Projects CDK-Taverna (chemoinformatics workflows) Problem JChemPaint (semantic 2D editor) Solution
Results ChemoJava (GPL-ed extension) Discussions Goals Conclusion library of cheminformatics algorithms educational Usage CDK 2003: 75+ times cited in literature Bioclipse, KNIME, Jumbo (CML), AMBIT, ...
C. Steinbeck et al., J.Chem.Inf.Comput.Sci, 2003 C. Steinbeck et al., Curr.Pharm.Design, 2006
2009-08-31 Bioclipse & Proteochemometric Group - 12 - Egon Willighagen | chem-bla-ics.blogspot.com CDK: an Open Project
Problem Features Solution open mailinglist and bug Results tracker Discussions
Conclusion open source repository release soon, release often Offer Review senior developers review patches
2009-08-31 Bioclipse & Proteochemometric Group - 13 - Egon Willighagen | chem-bla-ics.blogspot.com Bioclipse
Problem
Solution
Results
Discussions
Conclusion
O. Spjuth et al., BMC Bioinformatics 2007, 8:59
2009-08-31 Bioclipse & Proteochemometric Group - 14 - Egon Willighagen | chem-bla-ics.blogspot.com Integration
Services databases: PubChem Problem
Solution web services Results Google Spreadsheets Discussions MyExperiment.org: Bioclipse Conclusion Scripting Language Twitter, ... journals, ... Techniques SOAP, REST, XMPP, . . . Resource Description Framework dedicated APIs
2009-08-31 Bioclipse & Proteochemometric Group - 15 - Egon Willighagen | chem-bla-ics.blogspot.com MyExperiment: Bioclipse Scripting Language
Problem
Solution
Results
Discussions
Conclusion
2009-08-31 Bioclipse & Proteochemometric Group - 16 - Egon Willighagen | chem-bla-ics.blogspot.com XMPP
XMPP Jabber
Problem protocol Solution Alternative to Results HTTP Discussions XML-based: Conclusion improved semantics Features Asychronous XML-based: improved semantics J. Wagener et al., BMC Bioinformatics, 2009, in production
2009-08-31 Bioclipse & Proteochemometric Group - 17 - Egon Willighagen | chem-bla-ics.blogspot.com Resource Description Framework
Problem Facts as Triples Solution subject Results predictate (relation) Discussions object Conclusion Examples wp:Benzene chem:hasSMILES "c1ccccc1" wp:Benzene owl:sameAs chemspider:123
2009-08-31 Bioclipse & Proteochemometric Group - 18 - Egon Willighagen | chem-bla-ics.blogspot.com OpenMolecules RDF
Problem
Solution
Results
Discussions
Conclusion
http://rdf.openmolecules.net/
2009-08-31 Bioclipse & Proteochemometric Group - 19 - Egon Willighagen | chem-bla-ics.blogspot.com Blue Obelisk
Problem
Solution
Results
Discussions
Conclusion R Guha et al., J.Chem.Inf.Model., 2006
2009-08-31 Bioclipse & Proteochemometric Group - 20 - Egon Willighagen | chem-bla-ics.blogspot.com Which License?
Choice Problem
Solution GPL v2 or v3, LGPL v2 or
Results v3, Apache, BSD, MIT, ... Discussions FDL, CC0, PDDL Conclusion Important: redistribution, modification Bad Practise not explicitly stating your intentions Public Domain
2009-08-31 Bioclipse & Proteochemometric Group - 21 - Egon Willighagen | chem-bla-ics.blogspot.com Mixing Data?
Problem Solution License Incompatibility Results
Discussions Ask about the copyright
Conclusion holders intention! Use Open Standard Interfaces Resource Description Framework
2009-08-31 Bioclipse & Proteochemometric Group - 22 - Egon Willighagen | chem-bla-ics.blogspot.com Conclusions
Problem No Intellectual Monopoly Acchieved Solution Jmol, CDK, JChemPaint, Bioclipse Results • A huge success! Discussions
Conclusion Open Data in chemistry is still way behind • Open Access trap • Public Domain trap Semantics is showing up • in RDF • in Publishing
2009-08-31 Bioclipse & Proteochemometric Group - 23 - Egon Willighagen | chem-bla-ics.blogspot.com The Details
Problem
Solution Results http://www.citeulike.org/user/ Discussions egonw/tag/papers Conclusion http: //chem-bla-ics.blogspot.com mailto: [email protected]
2009-08-31 Bioclipse & Proteochemometric Group - 24 - Egon Willighagen | chem-bla-ics.blogspot.com