Probabilistic Models for Collecting, Analyzing, and Modeling Expression Data
Total Page:16
File Type:pdf, Size:1020Kb
Probabilistic Models for Collecting, Analyzing, and Modeling Expression Data Hai-Son Phuoc Le May 2013 CMU-ML-13-101 Probabilistic Models for Collecting, Analyzing, and Modeling Expression Data Hai-Son Phuoc Le May 2013 CMU-ML-13-101 Machine Learning Department School of Computer Science Carnegie Mellon University Thesis Committee Ziv Bar-Joseph, Chair Christopher Langmead Roni Rosenfeld Quaid Morris Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy. Copyright @ 2013 Hai-Son Le This research was sponsored by the National Institutes of Health under grant numbers 5U01HL108642 and 1R01GM085022, the National Science Foundation under grant num- bers DBI0448453 and DBI0965316, and the Pittsburgh Life Sciences Greenhouse. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: genomics, gene expression, gene regulation, microarray, RNA-Seq, transcriptomics, error correction, comparative genomics, regulatory networks, cross-species, expression database, Gene Expression Omnibus, GEO, orthologs, microRNA, target prediction, Dirichlet Process, Indian Buffet Process, hidden Markov model, immune response, cancer. To Mom and Dad. i Abstract Advances in genomics allow researchers to measure the complete set of transcripts in cells. These transcripts include messenger RNAs (which encode for proteins) and microRNAs, short RNAs that play an important regulatory role in cellular networks. While this data is a great resource for reconstructing the activity of networks in cells, it also presents several computational challenges. These challenges include the data collection stage which often results in incomplete and noisy measurement, developing methods to integrate several experiments within and across species, and designing methods that can use this data to map the interactions and networks that are activated in specific conditions. Novel and efficient algorithms are required to successfully address these challenges. In this thesis, we present probabilistic models to address the set of challenges associated with expression data. First, we present a novel probabilistic error correction method for RNA-Seq reads. RNA-Seq generates large and comprehensive datasets that have revolutionized our ability to accurately recover the set of transcripts in cells. However, sequencing reads inevitably contain errors, which affect all downstream analyses. To address these problems, we develop an efficient hidden Markov model- based error correction method for RNA-Seq data . Second, for the analysis of expression data across species, we develop clustering and distance function learning methods for querying large expression databases. The methods use a Dirichlet Process Mixture Model with latent matchings and infer soft assignments between genes in two species to allow comparison and clustering across species. Third, we introduce new probabilistic models to integrate expression and interaction data in order to predict targets and networks regulated by microRNAs. Combined, the methods developed in this thesis provide a solution to the pipeline of expression analysis used by experimentalists when performing expression experi- ments. iii Acknowledgements A Ph.D. may be the highest personal academic reward which many wish to achieve, but the road leading to a Ph.D. is certainly not a work of a single person. I would like to express my deepest gratitude to the multitude of people who have taught, helped, and supported me during the joyful but also adventurous and challenging time at Carnegie Mellon University. Certainly for me, writing this acknowledgements is one of the most wonderful exercises in graduate school. I am indebted to the generous support of my advisor, Ziv Bar-Joseph, who is truly a source of inspiration and ideas. Not too long after I started school, it immediately became clear to me that his academic success is a product of a remarkable balance of work, family, and life-long passions. He is not only an academic father but also a life role model. Ziv is always persistent and patient with answering a myriad of my questions. Every week, I look forward to our meeting with a list of questions and always leave with more ideas to work on. His instinct and fast thinking ability cut through conceptual layers of many problems so quickly and lead to questions, for which usually take me weeks to find good answers. Not only did I learn the technical and research methodology, but I also developed an appreciation for high-impact research, which must be well-motivated and driven by deliberate applications and substantial findings. He is so dedicated to the research and detail-oriented to the results. On one occasion, Ziv showed up at my office late in the evening to my surprise. It turned out that he went home earlier forgetting to send materials needed for our paper submission due at midnight. He walked back to school and gave it to me in person. I appreciate the committee, Roni Rosenfeld, Chris Langmead, and Quaid Morris for their advice, comments, and suggestions to improve the work in this thesis and my oral presentation. In particular, Quaid meticulously read the draft and suggested ways to make the draft more readable. Roni insisted on making the presentation more accessible to the audience. I want to thank past and current members of the Systems Biology group at CMU. Marcel Schulz is remarkable at selling new ideas and dedicated to new research collabora- tion. His openness to share knowledge led to the first part of this thesis. Saket Navlakha is instrumental in helping me improve my presentation skills. Our discussion about research, philosophical aspects of life, and religion is always entertaining and makes lunch more enjoyable. I enjoy small chats with Anthony Gitter, Shan Zhong, Aaron Wise, and Guy Zinman, with whom I shared room and explored new cities during conferences. I am grateful for administrative help of Diane Stidle and Michelle Martin in scheduling meetings, talks and paperwork. I would like to thank my parents, Hai Le and Lanh Chau, for their unconditional love and support. My dad, who taught me maths in first grade, introduced me to the world of logical thinking. My humble and warm-hearted mom taught me how to listen and treat people with care and respect. Although both were not physically with me during my undergraduate and graduate study, their presence was always in my heart. My sister, v Tram Le, and her family is a source of comfort and encouragement in difficult time. I cherish my time spending with many new friends in Pittsburgh. Thao Pham, Hoang Tran and Ha Nguyen cook delicious food and always welcome me to share their culinary delights. I enjoy listening to Hang Nguyen discussing, debating and ranting about politics, history or horoscope. Hoan Ho always reminds me of a determined and strong-willed person when I face a difficult task. Phuong Pham and Thang Ho motivated me to run and trained with me. I am thankful for Suze Ninh’s diligent care and effort in revising my writing and listening to my practice talks. Her sincere love comforts me in stressful time. Tuan Nguyen’s sense of humor puts away worries and troubles. They are always available to listen to my problems and cheerfully enjoys sips of whiskey or a bottle of beer. I also thank other friends that I have exchanged ideas and interacted with: Lucia Castellanos, Rob Hall, Tzu-Kuo Huang, Wooyoung Lee, Ankur Parikh, Liang Xiong, Min Xu and Yang Xu. I will miss playing tennis and going to the gym with Hoan, Marcel, Saket, Yang, Chao Shen, and Hua Shan. Pittsburgh, PA May 2013 vi Contents Contents vii List of Figures xi List of Tables xiii List of Notations xv 1 Introduction 1 1.1 Growing amount of genomics data . 1 1.2 Review of probabilistic models . 7 1.3 Overview of this thesis . 8 1.4 Organization of this thesis . 9 I Collecting and preprocessing gene expression data 11 2 SEECER: a probabilistic method for error correction of RNA-Seq 13 2.1 Introduction . 13 2.2 Methods . 14 2.3 Experimental setup . 20 2.4 Robustness and comparison with other methods . 20 2.5 Assembly of error corrected RNA-Seq sea cucumber data . 30 2.6 Discussion . 33 II Cross-species analysis of functional genomics pathways 35 3 Querying large cross-species databases of expression experiments 37 3.1 Introduction . 37 3.2 Methods . 39 3.3 Results: Testing distance metrics on data from human and mouse tissues . 44 3.4 Results: Identifying similar experiments in GEO . 49 3.5 Conclusions and future work . 53 4 Cross-species Expression Analysis with Latent Matching of Genes 55 4.1 Introduction . 55 4.2 Problem definition . 56 4.3 Model . 57 4.4 Experiments and Results . 62 4.5 Conclusions . 66 vii Contents III Using expression data to infer condition-specific miRNA targets 67 5 GroupMiR: Inferring Interaction Networks using the Indian Buffet Process 69 5.1 Introduction . 69 5.2 Interaction model . 70 5.3 Regression model for mRNA expression . 75 5.4 Inference by MCMC . 76 5.5 Results . 78 5.6 Conclusions . 82 6 PIMiM: Protein Interaction based MicroRNA Modules 83 6.1 Introduction . 83 6.2 Methods . 84 6.3 Constraint module learning for multiple condition analysis . 88 6.4 Results . 89 6.5 Conclusions . 99 7 Conclusions and Future Work 101 7.1 Conclusions . 101 7.2 Themes shared by the methods in this thesis . 102 7.3 Future work . 103 A Supplementary materials for Chapter 2 107 A.1 Detailed analysis of false positive and false negatives after TopHat align- ment with the human data . 107 A.2 De novo assembly results by expression . 107 A.3 Detailed analysis of types of corrections made by SEECER .