Enhancing Systems Biology Models Through Semantic Data Integration
Total Page:16
File Type:pdf, Size:1020Kb
School of Computing Science Enhancing Systems Biology Models Through Semantic Data Integration Allyson L. Lister Submitted for the degree of Doctor of Philosophy in the School of Computing Science, Newcastle University December 2011 c 2011, Allyson L. Lister Abstract Studying and modelling biology at a systems level requires a large amount of data of different ex- perimental types. Historically, each of these types is stored in its own distinct format, with its own internal structure for holding the data produced by those experiments. While the use of community data standards can reduce the need for specialised, independent formats by providing a common syn- tax, standards uptake is not universal and a single standard cannot yet describe all biological data. In the work described in this thesis, a variety of integrative methods have been developed to reuse and restructure already extant systems biology data. SyMBA is a simple Web interface which stores experimental metadata in a published, common format. The creation of accurate quantitative SBML models is a time-intensive manual process. Modellers need to understand both the systems they are modelling and the intricacies of the SBML format. However, the amount of relevant data for even a relatively small and well-scoped model can be overwhelming. Saint is a Web application which accesses a number of external Web services and which provides suggested annotation for SBML and CellML models. MFO was developed to for- malise all of the knowledge within the multiple SBML specification documents in a manner which is both human and computationally accessible. Rule-based mediation, a form of semantic data inte- gration, is a useful way of reusing and re-purposing heterogeneous datasets which cannot, or are not, structured according to a common standard. This method of ontology-based integration is generic and can be used in any context, but has been implemented specifically to integrate systems biology data and to enrich systems biology models through the creation of new biological annotations. The work described in this thesis is one step towards the formalisation of biological knowledge useful to systems biology. Experimental metadata has been transformed into common structures, a Web application has been created for the retrieval of data appropriate to the annotation of systems biology models and multiple data models have been formalised and made accessible to semantic integration techniques. i This thesis is dedicated to my parents, who encouraged me in all things, and is particularly dedi- cated to my husband and my son, without whose patience and support I could not have finished this research. “Among those who have endeavoured to promote learning and rectify judgement, it has long been customary to complain of the abuse of words, which are often admitted to signify things so different, that, instead of assisting the understanding as vehicles of knowledge, they produce error, dissension, and perplexity....” Dr. Samuel Johnson, 1752, via Nature Structural & Molecular Biology 14, 681 (2007). “Metadata is a love note to the future.” (NYPL Labs, http://twitpic.com/ 6ry6ar, via @CameronNeylon @kissane) Acknowledgements Many thanks go to my supervisors Dr. Anil Wipat, Dr. Phillip Lord and Dr. Matthew Pocock. A special thanks goes to those people who found extra commas and other errors: Phoebe Boatright, Dr. Dagmar Waltemath, Lucy Slattery, Mélanie Courtot and Dr. Paul Williams, Jr. Past and present colleagues at Newcastle University have provided much support and inspiration, including Dr. Frank Gibson and Dr. Katherine James. I gratefully acknowledge the support of the BBSRC and the EPSRC for funding CISBAN at New- castle University. I also acknowledge the support of the Newcastle University Systems Biology Resource Centre, the Newcastle University Bioinformatics Support Unit and the North East regional e-Science centre. iii Declaration I declare that the following work embodies the result of my own work, that it has been composed by myself and does not include work forming part of a thesis presented successfully for a degree in this or another University. Allyson Lister iv During the course of this work, I have been involved both in the development of a number of standards efforts and in the publishing of a number of papers. The list below describes the standards efforts to which I have contributed and the roles I have had within those efforts: • a developer of UniProt/TrEMBL [1]; • a program developer and contributor to the Functional Genomics Experiment (FuGE)[2,3,4], a standard XML syntax for describing experiments; • a core developer and coordinator of the Ontology of Biomedical Investigations (OBI)[5,6,7], a standard semantics for describing experiments; • an early developer of the ISA-TAB [8] tab-delimited syntax for describing experiments; • a developer of the minimal information checklist MIGS/MIMS for genomics and metage- nomics information [9]; • an advisor for the Systems Biology Ontology (SBO), the Kinetic Simulation Algorithm Ontology (KiSAO) and the TErminology for the Description of DYnamics (TEDDY); • a co-author of the Systems Biology Markup Language (SBML) Level 3 Annotation pack- age [10]; • an advisor for the Cell Behavior Ontology1; and • an advisor in the nascent synthetic biology standards2. I have co-authored 20 papers, specification documents, articles and technical reports during the course of this work: 1. Mélanie Courtot, Nick Juty, Christian Knupfer, Dagmar Waltemath, Anna Zhukova, Andreas Drager, Michel Dumontier, Andrew Finney, Martin Golebiewski, Janna Hastings, Stefan Hoops, Sarah Keating, Douglas B. Kell, Samuel Kerrien, James Lawson, Allyson Lister, James Lu, Rainer Machne, Pedro Mendes, Matthew Pocock, Nicolas Rodriguez, Alice Villeger, Darren J. Wilkinson, Sarala Wimalaratne, Camille Laibe, Michael Hucka, and Nicolas Le Novère. Controlled vocabularies and semantics in systems biology. Molecular Systems Biology, 7(1), October 2011. 2. Stephen G. Addinall, Eva-Maria Holstein, Conor Lawless, Min Yu, Kaye Chapman, A. Peter Banks, Hien-Ping Ngo, Laura Maringele, Morgan Taschuk, Alexander Young, Adam Ciesi- olka, Allyson L. Lister, Anil Wipat, Darren J. Wilkinson, and David Lydall. Quantitative fit- ness analysis shows that NMD proteins and many other protein complexes suppress or enhance distinct telomere cap defects. PLoS Genet, 7(4):e1001362+, April 2011. 1http://cbo.biocomplexity.indiana.edu 2http://www.sbolstandard.org/ v 3. Mélanie Courtot, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. MIREOT: The minimum information to reference an external ontology term. Applied Ontology, 6(1):23–33, January 2011. 4. Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML mod- els through rule-based semantic integration. Journal of biomedical semantics, 1 Suppl 1(Suppl 1):S3+, 2010. 5. Andrew R. Jones and Allyson L. Lister. Managing experimental data using FuGE. Methods in molecular biology (Clifton, N.J.), 604:333–343, 2010. 6. Allyson L. Lister. Semantic integration in the life sciences. Ontogenesis, http://ontogenesis.knowledgeblog.org/126. January 2010. 7. Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bet- tina Roth, and Reinhard Schneider. Live coverage of intelligent systems for molecular Biol- ogy/European conference on computational biology (ISMB/ECCB) 2009. PLoS Comput Biol, 6(1):e1000640+, January 2010. 8. Allyson L. Lister, Ruchira S. Datta, Oliver Hofmann, Roland Krause, Michael Kuhn, Bettina Roth, and Reinhard Schneider. Live coverage of scientific conferences using Web technolo- gies. PLoS Comput Biol, 6(1):e1000563+, January 2010. 9. Allyson Lister, Varodom Charoensawan, Subhajyoti De, Katherine James, Sarath Chandra C. Janga, and Julian Huppert. Interfacing systems biology and synthetic biology. Genome biology, 10(6):309+, 2009. 10. Allyson L. Lister, Matthew Pocock, Morgan Taschuk, and Anil Wipat. Saint: a lightweight integration environment for model annotation. Bioinformatics, 25(22):3026–3027, November 2009. 11. Mélanie Courtot, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan R. Brinkman, and Alan Ruttenberg. MIREOT: the minimum information to reference an exter- nal ontology term. In Barry Smith, editor, International Conference on Biomedical Ontology, pages 87–90. University at Buffalo College of Arts and Sciences, National Center for Onto- logical Research, National Center for Biomedical Ontology, July 2009. 12. Allyson L. Lister, Phillip Lord, Matthew Pocock, and Anil Wipat. Annotation of SBML models through Rule-Based semantic integration. In Phillip Lord, Susanna-Assunta Sansone, Nigam Shah, Susie Stephens, and Larisa Soldatova, editors, The 12th Annual Bio-Ontologies Meeting, ISMB 2009, pages 49+, June 2009. 13. The OBI Consortium. Modeling biomedical experimental processes with OBI. In Phillip Lord, Susanna-Assunta Sansone, Nigam Shah, Susie Stephens, and Larisa Soldatova, editors, The 12th Annual Bio-Ontologies Meeting, ISMB 2009, pages 41+, June 2009. 14. Andrew R. Jones, Allyson L. Lister, Leandro Hermida, Peter Wilkinson, Martin Eisenacher, Khalid Belhajjame, Frank Gibson, Phil Lord, Matthew Pocock, Heiko Rosenfelder, Javier Santoyo-Lopez, Anil Wipat, and Norman W. W. Paton. Modeling and managing experimental data using FuGE. OMICS: A Journal of Integrative Biology, 13(3):239–251, June 2009. 15. Mélanie Courtot, William Bug, Frank Gibson, Allyson L. Lister, James Malone, Daniel Schober, Ryan