Open Knowledge: Reproducibility in with Open Data, Open Source and Open Standards

Egon Willighagen

Bioclipse & Proteochemometric Group (Prof. Wikberg) Department of Pharmaceutical Biosciences Uppsala University

2009-08-31 The Setting...

Problem

Solution

Results

Discussions 1998: Organic

Conclusion chemistry... beatiful science! But ... why, how, what, ...

PJJA Buijnsters et al., Eur.J.Org.Chem, 2002, 1397–1406

2009-08-31 Bioclipse & Proteochemometric Group - 2 - Egon Willighagen | chem-bla-ics.blogspot.com Reliable Knowledge: Trust

Problem

Solution

Results Discussions How to build Trust Conclusion track record

2009-08-31 Bioclipse & Proteochemometric Group - 3 - Egon Willighagen | chem-bla-ics.blogspot.com Knowledge: Trust

Problem

Solution

Results

Discussions How to build Trust Conclusion track record transparency: citation

2009-08-31 Bioclipse & Proteochemometric Group - 4 - Egon Willighagen | chem-bla-ics.blogspot.com Knowledge: Trust

Problem

Solution Results How to build Trust Discussions track record Conclusion transparency: citation reproducibility: details

2009-08-31 Bioclipse & Proteochemometric Group - 5 - Egon Willighagen | chem-bla-ics.blogspot.com Knowledge: Trust

Problem

Solution How to build Trust Results track record Discussions transparency: citation Conclusion reproducibility: details

Open {Data|Standards|Source|. . . }

2009-08-31 Bioclipse & Proteochemometric Group - 6 - Egon Willighagen | chem-bla-ics.blogspot.com Knowledge Representation...

Problem

Solution

Results

Discussions Conclusion What are the organic normal conditions?

2009-08-31 Bioclipse & Proteochemometric Group - 7 - Egon Willighagen | chem-bla-ics.blogspot.com The Problem: Reproducibility...

Problem Where reproducibility is Solution severely hampered: Results

Discussions recalculate basic atom and

Conclusion bond properties access to QSAR/QSPR data well-defined algorithms publications destroy information

2009-08-31 Bioclipse & Proteochemometric Group - 8 - Egon Willighagen | chem-bla-ics.blogspot.com Solutions...

Openess Problem license that allows Solution modification and Results redistribution Discussions

Conclusion hiding behind public domain is not helpful Semantic Web be explicit in what you mean both in facts and in algorithms

2009-08-31 Bioclipse & Proteochemometric Group - 9 - Egon Willighagen | chem-bla-ics.blogspot.com Reproducibility needs ODOSOS

Open Data

Problem No Intellectual Monopoly Solution Open Source Results algorithms are complex Discussions

Conclusion implementations even more strong interaction with representation Open Standards Semantic Web formats unique identifiers http: // en. wikipedia. org/ wiki/ Glyn_ Moody

2009-08-31 Bioclipse & Proteochemometric Group - 10 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution Started in 1997 by

Results Dan Gezelter Discussions (Notre Dame) Conclusion Leaders: Bradly Smith, me, Miguel Howard, Bob Hanson

E.L. Willighagen, M. Howard, Nature Precedings, 2005 http: // www. jmol. org/

2009-08-31 Bioclipse & Proteochemometric Group - 11 - Egon Willighagen | chem-bla-ics.blogspot.com The Chemistry Development Kit

A Family of Projects CDK-Taverna (chemoinformatics workflows) Problem JChemPaint (semantic 2D editor) Solution

Results ChemoJava (GPL-ed extension) Discussions Goals Conclusion library of cheminformatics algorithms educational Usage CDK 2003: 75+ times cited in literature Bioclipse, KNIME, Jumbo (CML), AMBIT, ...

C. Steinbeck et al., J.Chem.Inf.Comput.Sci, 2003 C. Steinbeck et al., Curr.Pharm.Design, 2006

2009-08-31 Bioclipse & Proteochemometric Group - 12 - Egon Willighagen | chem-bla-ics.blogspot.com CDK: an Open Project

Problem Features Solution open mailinglist and bug Results tracker Discussions

Conclusion open source repository release soon, release often Offer Review senior developers review patches

2009-08-31 Bioclipse & Proteochemometric Group - 13 - Egon Willighagen | chem-bla-ics.blogspot.com Bioclipse

Problem

Solution

Results

Discussions

Conclusion

O. Spjuth et al., BMC 2007, 8:59

2009-08-31 Bioclipse & Proteochemometric Group - 14 - Egon Willighagen | chem-bla-ics.blogspot.com Integration

Services databases: PubChem Problem

Solution web services Results Google Spreadsheets Discussions MyExperiment.org: Bioclipse Conclusion Scripting Language Twitter, ... journals, ... Techniques SOAP, REST, XMPP, . . . Resource Description Framework dedicated APIs

2009-08-31 Bioclipse & Proteochemometric Group - 15 - Egon Willighagen | chem-bla-ics.blogspot.com MyExperiment: Bioclipse Scripting Language

Problem

Solution

Results

Discussions

Conclusion

2009-08-31 Bioclipse & Proteochemometric Group - 16 - Egon Willighagen | chem-bla-ics.blogspot.com XMPP

XMPP Jabber

Problem protocol Solution Alternative to Results HTTP Discussions XML-based: Conclusion improved semantics Features Asychronous XML-based: improved semantics J. Wagener et al., BMC Bioinformatics, 2009, in production

2009-08-31 Bioclipse & Proteochemometric Group - 17 - Egon Willighagen | chem-bla-ics.blogspot.com Resource Description Framework

Problem Facts as Triples Solution subject Results predictate (relation) Discussions object Conclusion Examples wp:Benzene chem:hasSMILES "c1ccccc1" wp:Benzene owl:sameAs chemspider:123

2009-08-31 Bioclipse & Proteochemometric Group - 18 - Egon Willighagen | chem-bla-ics.blogspot.com OpenMolecules RDF

Problem

Solution

Results

Discussions

Conclusion

http://rdf.openmolecules.net/

2009-08-31 Bioclipse & Proteochemometric Group - 19 - Egon Willighagen | chem-bla-ics.blogspot.com

Problem

Solution

Results

Discussions

Conclusion Guha et al., J.Chem.Inf.Model., 2006

2009-08-31 Bioclipse & Proteochemometric Group - 20 - Egon Willighagen | chem-bla-ics.blogspot.com Which License?

Choice Problem

Solution GPL v2 or v3, LGPL v2 or

Results v3, Apache, BSD, MIT, ... Discussions FDL, CC0, PDDL Conclusion Important: redistribution, modification Bad Practise not explicitly stating your intentions Public Domain

2009-08-31 Bioclipse & Proteochemometric Group - 21 - Egon Willighagen | chem-bla-ics.blogspot.com Mixing Data?

Problem Solution License Incompatibility Results

Discussions Ask about the copyright

Conclusion holders intention! Use Open Standard Interfaces Resource Description Framework

2009-08-31 Bioclipse & Proteochemometric Group - 22 - Egon Willighagen | chem-bla-ics.blogspot.com Conclusions

Problem No Intellectual Monopoly Acchieved Solution Jmol, CDK, JChemPaint, Bioclipse Results • A huge success! Discussions

Conclusion Open Data in chemistry is still way behind • Open Access trap • Public Domain trap Semantics is showing up • in RDF • in Publishing

2009-08-31 Bioclipse & Proteochemometric Group - 23 - Egon Willighagen | chem-bla-ics.blogspot.com The Details

Problem

Solution Results http://www.citeulike.org/user/ Discussions egonw/tag/papers Conclusion http: //chem-bla-ics.blogspot.com mailto: [email protected]

2009-08-31 Bioclipse & Proteochemometric Group - 24 - Egon Willighagen | chem-bla-ics.blogspot.com