Rich Probabilistic Models for Genomic Data
Total Page:16
File Type:pdf, Size:1020Kb
RICH PROBABILISTIC MODELS FOR GENOMIC DATA a dissertation submitted to the department of computer science and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy Eran Segal August 2004 c Copyright by Eran Segal 2004 All Rights Reserved ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Daphne Koller Department of Computer Science Stanford University (Principal Adviser) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Nir Friedman School of Computer Science and Engineering Hebrew University I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Stuart Kim Department of Developmental Biology Stanford University I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. David Haussler Department of Computer Science University of California Santa Cruz Approved for the University Committee on Graduate Studies. iii iv To Gal. v vi Abstract Genomic datasets, spanning many organisms and data types, are rapidly being pro- duced, creating new opportunities for understanding the molecular mechanisms un- derlying human disease, and for studying complex biological processes on a global scale. Transforming these immense amounts of data into biological information is a challenging task. In this thesis, we address this challenge by presenting a statisti- cal modeling language, based on Bayesian networks, for representing heterogeneous biological entities and modeling the mechanism by which they interact. We use sta- tistical learning approaches in order to learn the details of these models (structure and parameters) automatically from raw genomic data. The biological insights are then derived directly from the learned model. We describe three applications of this framework to the study of gene regulation: Understanding the process by which DNA patterns (motifs) in the control re- • gions of genes play a role in controlling their activity. Using only DNA sequence and gene expression data as input, these models recovered many of the known motifs in yeast and several known motif combinations in human. Finding regulatory modules and their actual regulator genes directly from gene • expression data. Some of the predictions from this analysis were tested success- fully in the wet-lab, suggesting regulatory roles for three previously uncharac- terized proteins. Combining gene expression profiles from several organisms for a more robust • prediction of gene function and regulatory pathways, and for studying the degree to which regulatory relationships have been conserved across evolution. vii viii Acknowledgments First and foremost I would like to thank my adviser, Daphne Koller, for her guidance and support throughout my years at Stanford. Of all the mentors I had, Daphne had the most profound impact on this thesis and on my academic development. I have learned tremendously from her brilliance, high standards and strive for excellence, including how to choose and pursue interesting research directions and how to present them both in papers and in talks. Of all the researchers I know, I cannot imagine a better choice than Daphne for an adviser. I feel priviliged to have had the opportunity to work closely and learn from her. Excluding Daphne, Nir Friedman, my unofficial co-adviser, had the greatest in- fluence on this thesis and on teaching me how to do research. Most importantly, Nir introduced me to and got me excited about the field of computational biology. Nir's contribution to the work presented in this thesis is fundamental, from the basic ideas of integrating heterogeneous genomic data into a single probabilistic model, through the development of the algorithms that learn these models automatically from data, and finally with ways to present, analyze, and visualize the results. I greatly enjoyed working with Nir and am indebted to him for his dedication and for the many research skills I learned from him. I would also like to thank Stuart Kim for taking me into his lab and having the patience to teach me biology and the biologists' way of thinking. Through the interaction with Stuart and the members of his lab, I learned how to think critically about biological results and how to select convincing controls to perform. I am very grateful to David Haussler, the fourth member of my reading committee, for insightful comments and motivating discussions throughout this past year. I also ix thank Matt Scott for agreeing to chair my Orals committee and for a stimulating and enjoyable collaboration on the fifth chapter of this thesis. The work in this thesis is the product of a collaboration with a number of re- searchers. In particular, I had the good fortune of collaborating with Aviv Regev, a truly phenomenal colleague. Aviv played a key role in the development of the module network framework presented in Chapter 4. This project would not have been pos- sible without her creative ideas, diligence, and thoroughness. My collaboration with Aviv has not only been the most fruitful one, but also a very enjoyable one on the personal level and I look forward to our continued working together. I would also like to thank David Botstein for his inspiring vision and for his belief in our methods. The collaboration with David resulted in the testing of our computational predictions in his lab and was a significant milestone in my Ph.d. I also thank Michael Shapira for carrying out these experiments. I am grateful to every one of my other co-authors over the last several years. Audrey Gasch, the first biologist who was brave enough to work with us. Ben Taskar, with whom I worked on the first project of this research direction. Yoseph Barash, who played a major role in developing the ideas and infrastructure of Chapter 3 and Roman Yelensky, for his dedication on continuing the work on this project. Dana Pe'er, for her friendship and numerous insightful discussions and working sessions that lead to the development of the module network framework. It was a great pleasure to work with Josh Stuart on the multi-species project. The intense hours we spent together for over a year were productive, fun, and lead to a great friendship. I would also like to thank other co-authors with whom I enjoyed working on projects that were not included in this thesis: Alexis Battle, Asa Ben-Hur, Doug Brutlag, Lise Getoor, Uri Lerner, Dirk Ormoneit, Saharon Rosset, Roded Sharan, and Haidong Wang. As part of the research on this thesis and to facilitate our collaboration with biologists, we developed GeneXPress, a statistical analysis and visualization software. GeneXPress has been a critical component of all the projects presented in this thesis and I would like to thank the many students with whom I worked and who contributed to the development of GeneXPress: David Blackman, Jonathan Efrat, Todd Fojo, x Jared Jacobs, Amit Kaushal, William Lu, Tuan Pham, Daniel Salinas, Mark Tong, Paul Wilkins, and Roman Yelensky. Still on the GeneXPress token, thanks to Nir Friedman for the initial motivation and for the many useful feedbacks along the way. Thanks also to Aviv Regev, our number one user, whose many comments and feedback greatly helped in shaping the tool. In the initial development of the code infrastructure for this thesis, I benefited greatly from the software infrastructure of PHROG (Probabilistic Reasoning and Optimizing Gadgets). I thank Lise Getoor, Uri Lerner and Ben Taskar for their work on PHROG and for sharing their code. I am proud to have been a member of Daphne's research group. I was greatly motivated and inspired by the productivity of this group and enjoyed the stimulating discussions during our weekly meetings. The high quality and impact of the work produced by this group is a testament to the extraordinary members of the group and to Daphne's vision and advising skills. I would like to thank: Pieter Abbeel, Drago Angelov, Alexis Battle, Gal Chechik, Lise Getoor, Carlos Guestrin, Uri Lerner, Uri Nodelman, Dirk Ormoneit, Ron Parr, Christian Shelton, Ben Taskar, Simon Tong, David Vickrey, Haidong Wang, and Roman Yelensky. I am grateful to Yossi Matias for taking me through my first steps of academic research and for his help in my application process and in encouraging me to come to Stanford. Thanks also to my friends and colleagues at Zapa, in particular Eyal Gever and Doron Gill, for their encouragement, help and support in my application process to Stanford. A great deal of the writing part of this thesis was done in the Morning Due Cafe in San Francisco. The friendliness of the owners and customers and the large amounts of coffee I had there were critical for this writing. Lots of thanks to the many friends who distracted me and kept me sane. The list is too long to fit in these lines, but I am grateful to my friends from back home in Israel, and to my new friends from Stanford, for their ongoing support and encouragement. I am grateful to my parents, Rachel and Yoffi, for everything they taught me and for the values they have given me. Without them, I would not arrive at this point. I also thank my brother Amir, for being such a great friend, and my little brother xi Tomer, for all the joy he brought to our family.