The Pennsylvania State University

The Pennsylvania State University The Graduate School The Huck Institute of the Life Sciences FORMAL METHODS FOR GENOMIC DATA INTEGRATION A Thesis in Integrative Biosciences by Nigam Shah 2005 Nigam Shah Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2005 ii The thesis of Nigam Shah was reviewed and approved* by the following: Nina V. Fedoroff Willaman Professor of Life Sciences and Evan Pugh Professor Acting Co-Director, Integrative Biosciences Graduate Program Huck Institutes of the Life Sciences Thesis Advisor Chair of Committee Mark D. Shriver Associate Professor of Anthropology and Genetics Wojciech Makalowski Associate Professor of Biology Francesca Chiaromonte Associate Professor of Statistics and Health Evaluation Sciences Gustavo A. Stolovitzky Manager, Functional Genomics & Systems Biology IBM T.J. Watson Research Center Special Member *Signatures are on file in the Graduate School iii ABSTRACT The rapid growth of life sciences research and the associated literature over the past decade, the rapid expansion of biological databases, and invention of high throughput techniques that permit collection of data on many genes and proteins simultaneously have created an acute need for new computational tools to support the biologist in collecting, evaluating and integrating large amounts of information of many disparate kinds. This thesis presents methods for the representation, manipulation and conceptual integration of diverse biological data with prior biological knowledge to facilitate both, interpretation of data and evaluation of hypotheses. We have developed a tool (called CLENCH) that assists in the interpretation of gene-lists resulting from microarray data analysis, by integrating and visualizing Gene Ontology (GO) annotations and transcription factor binding site information with gene expression data. During the development of CLENCH, it became evident that developing a unified framework for representing prior knowledge and information can increase our ability to integrate new data with existing knowledge. In subsequent work, we developed the HyBrow (Hypothesis Browser) system as a prototype tool for designing hypotheses and evaluating them for consistency with existing knowledge. HyBrow consists of a conceptual framework with the ability to represent diverse biological information types, an ontology for describing biological processes at different levels of detail, a database to query information in the ontology, and programs to design, evaluate and revise hypotheses. We demonstrate the HyBrow prototype using the galactose gene network in Saccharomyces cerevisiae as a test system. Along with the increase in available information, knowledgebases, which provide structured descriptions of biological processes, are proliferating rapidly. In order to support computer-aided information integration tools like HyBrow, a knowledgebase should be trustworthy and it should structure information in a sufficiently expressive manner to represent biological systems at multiple scales. We extend and adapt the conceptual framework underlying HyBrow and use it to verify the trustworthiness and usefulness of the Reactome knowledgebase. iv TABLE OF CONTENTS LIST OF FIGURES vi LIST OF TABLES vii ACKNOWLEDGEMENTS viii Chapter 1: Introduction 1 Chapter 2: Managing and interpreting large scale gene expression data. 2 Managing high volume microarray data 3 Using the Gene Ontology for interpreting microarray expression datasets: 8 Signaling pathways as an organizing framework for expression data 17 Summary 19 Chapter 3: Towards a unified formal representation for genomics data 21 Challenges for developing a unified formal representation 23 Description of relevant related efforts 28 Chapter 4: A novel conceptual framework 32 Extensions to the conceptual framework 34 Comparison with other conceptual frameworks 36 Chapter 5: Prototype implementation of HyBrow 42 Hypothesis ontology 43 Inference rules and constraints 48 Database and information gathering 52 User interfaces 54 The hypothesis evaluation process 54 Test runs with sample hypotheses 57 Chapter 6: Lessons learned from the prototype 60 Revision of the hypothesis ontology 61 Bottleneck for structuring data and role of knowledgebases 62 Chapter 7: Comparison with related efforts 65 The Riboweb project 66 Modeling biological processes as workflows 66 Pathway logic 67 Summary 68 v Role of Knowledgebases 68 Chapter 8: Proofreading the Reactome knowledgebase 70 Background 71 Methods 72 Results 76 Summary 83 Chapter 9: Summary and Future directions. 85 Future directions 86 References 88 Appendix A – Formal specification of the hypothesis grammar 95 Appendix B – Using the GUI 96 vi LIST OF FIGURES Figure 1 Flow-chart showing the microarray data preprocessing pipeline. 6 Figure 2 The types of plots that can be made by ProcessGprfile.pl. 7 Figure 3 Visualizing the expression, annotation and TF binding site data. 11 Figure 4 Directed acyclic graph showing the relationships among GO categories 12 Figure 5 A sample row from the CLENCH result table. 13 Figure 6 Components of a formal representation. 27 Figure 7 Examples of different types of ontology specifications. 46 Figure 8 An overview of the ontology. 47 Figure 9 Outline of the binds to prompter rule. 51 Figure 10 Screen shots of the visual and widget interfaces. 54 Figure 11 The hypothesis evaluation process. 56 Figure 12 Screen shot of the result page 57 Figure 13 Properties of agents in the revised ontology. 61 vii LIST OF TABLES Table 1: A comparison of the properties of different conceptual frameworks 41 Table 2: Numbers of Well-Formed Pathways 83 Table 3: Property comparison for the latest releases of Reactome 83 viii ACKNOWLEDGEMENTS First of all I would like to acknowledge my advisor, Nina Fedoroff, for her mentoring and support throughout my graduate studies. She has had the most profound influence on the way I think (and write!) about science and my approach to research in general. I feel privileged to have studied under a scientist of her stature. I would also like to acknowledge Stephen Racunas, my colleague and a very dear friend, for making my graduate studies at Penn State a memorable and enriching experience. I feel honored to know and work with someone like him. I am also very grateful to Dilip Desai, a close family friend, who along with my parents (Haresh and Chaula Shah) has played a very major role in shaping my personality and outlook towards life. Finally and most importantly, I am grateful to my wife Prachi for always being with me and for her unconditional love during the ups and downs of graduate life. 1 Chapter 1: Introduction With the advent of high-throughput technologies, molecular biology is undergoing a revolution in terms of the amount and types of data available to the scientist. On the one hand there is an abundance of individual data types such as gene and protein sequences, gene expression data, protein structures, protein interactions and annotations. On the other hand there is a shortage of tools and methods that can handle this deluge of information and allow a biologist to draw meaningful inferences. A significant amount of time and energy is spent in merely locating and retrieving information rather than thinking about what that information means. In this situation it becomes extremely difficult to integrate current knowledge about the relationships within biological systems and formulate hypotheses about a large number genes and proteins[1]. It becomes difficult to determine whether the hypotheses are consistent internally or with data, to refine inconsistent hypotheses and to understand the implications of complicated hypotheses[2]. It is obvious that this situation needs to be rectified and tools need to be developed that allow repetitive tasks to be automated and that allow formal methods to query and interpret the information at hand[3]. My thesis work is focused on developing methods for integrating large data sets with prior biological knowledge to facilitate their interpretation. My initial efforts were focused on interpreting results from microarray expression data using the gene ontology and known biological pathways. During this work, which is described in the next chapter, it became evident that explicitly structuring prior knowledge and formally representing current information facilitates the integration of new data with prior knowledge by increasing our ability to fit the new data into the big picture. Subsequently, in collaboration with Stephen Racunas, an engineering doctoral student, we developed a prototype system for integrating biological data and existing knowledge in an environment that supports the formulation and evaluation of alternative hypotheses about biological systems. This work is described starting from chapter three. 2 Chapter 2: Managing and interpreting large scale gene expression data. Microarray technology is a high-throughput method of measuring the expression level of thousands of genes in parallel. It is also the most widely used method among the several high-throughput technologies for collecting data on the levels of various biological entities such as mRNA and proteins in cells. My efforts to manage genomic data were focused on preprocessing, analyzing and interpreting microarray gene expression data. I developed programs for rapid preprocessing of raw microarray data and interpreting gene-groups that result from analyzing those data. While developing these methods for interpreting microarray data,

The Pennsylvania State University

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support