Statistical Methods to Infer Biological Interactions George Jay Tucker
Total Page:16
File Type:pdf, Size:1020Kb
Statistical methods to infer biological interactions by George Jay Tucker B.S., Harvey Mudd College (2008) Submitted to the Department of Mathematics in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Applied Mathematics at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2014 c Massachusetts Institute of Technology 2014. All rights reserved. Author.............................................................. Department of Mathematics May 1, 2014 Certified by. Bonnie Berger Professor of Applied Mathematics Thesis Supervisor Accepted by . Michel X. Goemans Chairman, Applied Mathematics Committee 2 Statistical methods to infer biological interactions by George Jay Tucker Submitted to the Department of Mathematics on May 1, 2014, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Applied Mathematics Abstract Biological systems are extremely complex, and our ability to experimentally measure interactions in these systems is limited by inherent noise. Technological advances have allowed us to collect unprecedented amounts of raw data, increasing the need for computational methods to disentangle true interactions from noise. In this thesis, we focus on statistical methods to infer two classes of important biological interactions: protein-protein interactions and the link between genotypes and phenotypes. In the first part of the thesis, we introduce methods to infer protein-protein interactions from affinity purification mass spectrometry (AP-MS) and from luminescence-based mam- malian interactome mapping (LUMIER). Our work reveals novel context dependent interactions in the MAPK signaling pathway and insights into the protein homeostasis machinery. In the second part, we focus on methods to understand the link between genotypes and phenotypes. First, we characterize the effects of related individuals on standard association statistics for genome-wide association studies (GWAS) and introduce a new statistic that corrects for relatedness. Then, we introduce a sta- tistically powerful association testing framework that corrects for confounding from population structure in large scale GWAS. Lastly, we investigate regularized regres- sion for phenotype prediction from genetic data. Thesis Supervisor: Bonnie Berger Title: Professor of Applied Mathematics 3 4 Acknowledgments This journey would not have been possible without the many people that have sup- ported me throughout my time at MIT. First, I would like to thank my advisor, Bon- nie Berger, for her guidance and her unwavering support. I would also like to thank Alkes Price for welcoming me into his group meetings and journal clubs, for countless interesting discussions about medical genomics, and for mentorship as I learned about statistical genetics. My family and friends have supported me throughout my time here. In particular, my heartfelt thanks go to: Po-Ru Loh, for being an inspiration and exceptional friend. I will always be grateful that I had the chance to work with someone so hard-working, careful, considerate, and humble. Mark Lipson, for our discussions about research and everyday life over our weekly lunches. Jian Peng, for teaching me that research is difficult even if it feels like it shouldn't be and that's okay. The rest of the members of the Berger lab, in particular Alex Levin, Irene Kaplow, Leonid Chindelevitch, Deniz Yorukoglu, Fulton Wang and Sean Sim- mons, for teaching me how to do research. Patrice Macaluso, for keeping us sane and staving off chaos. Mikko Taipale, for being an amazing collaborator and general academic badass. Polina Golland, for welcoming me into her reading group and teaching me to ask questions about the \trivial" things that usually turn out to confuse everyone. Mark Behrens, for mentoring me through my short trip in Algebraic Topology and for being understanding. Lastly, I would like to thank my love, Holly Johnsen, who has been with me through the best and the darkest times of this journey. 5 6 Contents 1 Introduction 11 1.1 Inferring protein-protein interactions . 11 1.2 Statistical genetics . 15 I Inferring protein-protein interactions 19 2 Proteomic and Functional Genomic Landscape of Receptor Tyrosine Kinase and Ras to Extracellular Signal-Regulated Kinase Signaling 21 2.1 Introduction . 22 2.2 Results . 23 2.2.1 An RTK-Ras-ERK interaction network . 25 2.3 Discussion . 31 2.4 Materials and Methods . 34 2.4.1 RNAi screening . 34 2.4.2 TAP and mass spectrometry . 35 2.4.3 Computational analysis of TAP-MS data . 37 2.4.4 Additional statistical analysis . 38 2.4.5 Western blotting and coimmunoprecipitation . 39 2.4.6 In vivo analysis . 40 3 Incorporating quantitative mass spectrometry data in protein inter- action analysis 53 3.1 Introduction . 54 7 3.2 Results . 57 3.2.1 Sampling framework . 57 3.2.2 Validation on three AP-MS data sets . 61 3.3 Discussion . 62 3.3.1 Characterization of methods . 62 3.3.2 Low rank plus sparse matrix framework . 64 3.3.3 Moving toward complexes . 66 3.4 Conclusions . 66 3.5 Methods . 66 3.5.1 AP-MS data sets . 66 3.5.2 Validation data sets . 67 3.5.3 Implementation . 67 4 Inferring interactors from LUMIER using mixture models 71 4.1 Introduction . 71 4.2 LUMIER . 74 4.3 Methods . 75 4.3.1 Spatial Bias Model . 76 4.3.2 Background Luminescence Model . 80 4.4 An application to mapping chaperone, co-chaperone, and client interactions . 80 4.4.1 Experiment setup . 81 4.4.2 Preprocessing . 81 4.4.3 Validation . 81 4.4.4 Results . 83 4.5 Conclusion . 86 II Statistical genetics 87 5 Mixed models with related individuals 89 5.1 Introduction . 89 8 5.1.1 MLM statistics . 90 5.1.2 Expected statistics with unrelated individuals . 92 5.1.3 Expected statistics with related individuals . 93 5.2 Results . 96 5.2.1 Simulated genotypes and phenotypes . 96 5.2.2 CARe genotypes . 97 5.2.3 CARe phenotypes . 100 5.3 Statistical Methods . 101 5.3.1 MLM statistics . 101 5.3.2 Two variance component MLM statistics . 102 5.4 Conclusion . 102 6 Improving the Power of GWAS and Avoiding Confounding from Population Stratification with PC-Select 105 6.1 Introduction . 105 6.2 Results . 106 6.3 Discussion . 110 6.4 Methods . 112 6.4.1 MS dataset . 112 6.4.2 Statistical methods . 112 7 Phenotype prediction using regularized regression on genetic data in the DREAM5 Systems Genetics B Challenge 117 7.1 Introduction . 118 7.2 Materials and Methods . 119 7.2.1 Dataset and challenge setup . 119 7.2.2 Preliminary ranking of predictors by correlation . 120 7.2.3 Rank transformation to reduce phenotype outliers . 121 7.2.4 Basis expansion to boolean combinations of genotype variables 122 7.2.5 Regularized regression modeling . 123 7.3 Results . 125 9 7.3.1 Modest performance of all regression techniques on training dataset . 125 7.3.2 Effectiveness of rank transformation on phenotype 1 . 126 7.3.3 Strong regularization in best-fit models . 126 7.3.4 High variance in performance on individual cross-validation folds and test set . 127 7.3.5 Official DREAM5 challenge results . 128 7.4 Discussion . 128 A Supporting Information for Incorporating quantitative mass spec- trometry data in protein interaction analysis 137 B Supporting Information for PC-Select 145 B.1 Model performance as the number of top SNPs to include in the GRM is varied. 145 B.2 Implementation . 146 10 Chapter 1 Introduction In this thesis, we focus on inferring two classes of important biological interactions: protein-protein interactions (PPI) and the link between genotypes and phenotypes. Biological systems are extremely complex and our ability to experimentally measure interactions is limited by inherent noise. Computational methods can identify pat- terns that are invisible to the human eye and disentangle true interactions from noise. Throughout this thesis we draw on a wide range of statistical methods to achieve this goal. In this chapter, we set the context for and summarize the main contributions of this thesis. 1.1 Inferring protein-protein interactions Proteins are the building blocks of cells, constituting most of the cell's dry mass and executing nearly all cell functions. However, proteins do not act alone; all proteins interact with other molecules, from enzymes catalyzing chemical reactions to pro- teins transmitting extracellular signals to change gene expression and protein levels. The biological properties of a protein depend on its physical interactions with other proteins. As such, an important way to begin characterizing the biological role of a protein is to identify its binding partners. In the past two decades, significant effort has been devoted to generating comprehensive PPI networks (e.g., [141, 61, 49, 51, 53]) to uncover the molecular basis of genetic interactions and provide functional roles for 11 proteins. These networks have been used as scaffolds to transfer known annotations to uncharacterized proteins in our lab and others. For example, IsoRank [119] and IsoRankN [80] predict functional orthologs across species by aligning PPI networks. In signaling network reconstruction, perturbation studies are used to reveal the critical components of the pathway. However, in many cases, these studies identify proteins that are not directly part of the core pathway. Huang et al. [59] and Yeger Lotem et al. [154] developed methods that use network flows and minimal trees in the PPI network to organize these disparate proteins into functionally coherent pathways. Before we can realize the benefits of a comprehensive PPI network, we first have to generate the interaction network. Mapping protein-protein interactions is ex- tremely time and labor intensive because of the sheer number of potential inter- actions. Mass spectrometry or affinity purification mass spectrometry (AP-MS) and yeast two-hybrid (Y2H) are two widely used high-throughput techniques for identi- fying protein interactions. The first large-scale PPI networks were generated for the model organism Saccharomyces cerevisiae, initially using yeast two-hybrid screens (Y2H) [141, 61] and subsequently by AP-MS [49, 56].