Signature Redacted
Total Page:16
File Type:pdf, Size:1020Kb
Integrative Analysis of Heterogeneous Genomic Datasets to Discover Genetic Etiology of Autism Spectrum Disorders by Sumaiya Nazeen B.Sc. in Computer Science and Engineering, Bangladesh University of Engineering and Technology (2011) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science MASSACHI1-g 516 at the O TECHNOLOGY MASSACHUSETTS INSTITUTE OF TECHNOLOGY SEP 2 5 20% 2014 September LIBRARIES @ Massachusetts Institute of Technology 2014. All rights reserved. Signature redacted A uthor ................. ................... ... Department of Electrical Engineering and Computer Science August 28, 2014 Certified by......Signature ........................ Bonnie A. Berger Professor of Applied Mathematics and Computer Science Thesis Supervisor Accepted by ................. Signature redacted....... / )tOjie A. Kolodziejski Professor of Electrical Engineering Chair, Department Committee on Graduate Students Integrative Analysis of Heterogeneous Genomic Datasets to Discover Genetic Etiology of Autism Spectrum Disorders by Sumaiya Nazeen Submitted to the Department of Electrical Engineering and Computer Science on August 28, 2014, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract Understanding the genetic background of complex diseases is crucial to medical research, with implications to diagnosis, treatment and drug development. As molecular approaches to this challenge are time consuming and costly, computational approaches offer an efficient alternative. Such approaches aim at predicting and prioritizing genes for a particular disease of interest. State-of-the-art gene prediction and prioritization methods rely on the obser- vation that disease-causing genes have some sort of functional similarity based on either sequence, phenotype, protein-protein interaction (PPI) network, or functional annotation. Another increasingly accepted view is that human diseases result from perturbations of molecular networks, and genes causing the same or similar diseases tend to be close to one another in molecular networks. Such observations have built the basis for a large collection of computational approaches to find previously unknown genes associated with certain dis- eases. The majority of the methods are designed based on protein interactome networks, with integration of other large-scale omics data, to infer how likely it is that a gene is associated with a disease. In this thesis, we set out to address this outstanding challenge of understanding the genetic etiology of autism spectrum disorder (ASD), which refers to a group of complex neurodevelopmental disorders sharing the common feature of dysfunctional reciprocal so- cial interaction. We introduce three novel methods for computing how likely a given gene is to be involved in ASDs based on copy number variations (CNVs), phenotype similar- ity, and protein interactome network topology. We also customize a random walk with restarts algorithm for ASD gene prioritization for the first time. Finally, we provide a novel integrative approach for combining CNV, phenotype similarity, and topology-related infor- mation with existing knowledge from literature. Our integrative approach outperforms the individual schemes in identifying and ranking ASD related genes. Our candidate gene set provides a number of interesting biological insights in that it is overrepresented in a number of interesting signaling, cell-adhesion and neurological pathways, molecular functions, and biological processes that are worth further investigation in connection with ASDs. We also find evidence for an interesting connection between gastrointestinal disorders, particularly inflammatory bowel diseases (IBD), and ASDs. The subnetworks we identify indicate the possibility of existence of subclasses of disorders along the autism spectrum. Thesis Supervisor: Bonnie A. Berger Title: Professor of Applied Mathematics and Computer Science 3 4 Acknowledgments This thesis owes its existence to Professor Bonnie Berger. It has been an amazing experience to work with her. She has been an excellent source of encouragement and inspiration to me. She has been incredibly patient with me and always put my personal growth as a researcher first. I cannot thank her more for teaching me how to approach the process of learning and research. I am indebted to Dr. Rohit Singh for his constant help, advice, support, and mentorship in all aspects of my thesis. This work would not have been possible without his invaluable advice and support. I remember countless meetings with him in which I walked in frustrated, yet walked out encouraged and excited again. I'd like to thank Rohit for his warm support and patience in teaching me how to face the moments when progress seems slow. I would like to thank Professor Isaac Kohane, Dr. Nathan Palmer, and Dr. Finale Doshi- Velez for sharing their knowledge of autism spectrum disorders with me. Many thanks to the members of Berger lab for sharing my exciting as well as frustrating moments. I'd like to thank Patrice for lightening up my days with her warm greetings. I am grateful to George, Hoon, Sean, and Christina for having discussions with me and encouraging me along in my research. Thanks to Andrew, Deniz, Jian, Noah, and William for being there whenever I needed help. I owe my gratitude to the Bangladeshi Students Association at MIT, which has become my family in Boston. As always, I am ever grateful to my parents and siblings for their love and constant support. Finally, I express my utmost gratitude to my greatest supporter: to the Almighty Allah, who has bestowed good health upon me, kept me free from anxiety, and filled my everyday with joy and hope. 5 6 Contents Abstract 3 Acknowledgments 5 List of Figures 11 List of Tables 13 1 Introduction 15 1.1 M otivation. .. ...... ....... ....... .... .. ..... ... 15 1.2 State of the art ..... ........... ........ ... ... ... 18 1.2.1 General Trends in Disease Gene Prediction .... ... .... ... 18 1.2.2 Computational Advances in ASD Gene Prediction ... .... ... 26 1.3 Contributions .. .. .......... .. ... ... ... 27 1.4 O utline ... .......... ........... .... .. ... .... 29 2 Predicting and Prioritizing Candidate Genes for ASD 31 2.1 CNV Information Entropy based Prioritizer .... .... ..... ..... 32 2.1.1 Copy Number Variation (CNV) ...... ..... .......... 32 2.1.2 Copy Number Variants in ASD .... .... ... .......... 33 2.1.3 Calculating Information Entropy Score from CNVs . .......... 34 2.1.4 Quality of CNV Information Entropy based Prioritization .... ... 35 2.2 ASD Similarity based Prioritizer ....... ....... ....... .. 36 2.2.1 Similarity of Phenotypes or Diseases ....... .. ... .. .. 36 2.2.2 Gene-Phenotype Association Data ..... .... ... ... ... 38 2.2.3 Calculating ASD Similarity Scores ......... .. .. .. .. .. 38 7 2.2.4 Performance of ASD Similarity based Prioritizer ..... ..... .. 38 2.3 Diffusion State ASD Proximity based Prioritizer ...... ........ .. 40 2.3.1 Diffusion State Distance (DSD) in PPI Network .... ....... .. 40 2.3.2 Calculating Diffusion State ASD Proximity (DSAP) of Genes ... .. 42 2.3.3 Quality of DSAP-based Ranking ..... ......... ....... 42 2.4 Network Crosstalk based Prioritizer ........ .............. .. 44 2.4.1 M otivation ................. ................ 44 2.4.2 Problem Formulation ....... ..................... 44 2.4.3 Calculating Network Crosstalk Scores ................. 45 2.4.4 Dealing with Statistical Bias ......... ............ .. 46 2.4.5 Performance of Network Crosstalk based Prioritizer ......... .. 47 3 Integrative Approach for Identifying ASD Risk Genes 49 3.1 Background ............... ...................... 49 3.1.1 Lasso-penalized Logistic Regression ........... ......... 50 3.2 Predicting ASD Association via Logistic Regression based Integrative Approach 50 3.2.1 Preparing Data for Training and Validation ........... .... 50 3.2.2 Constructing Lasso-regularized Binomial Regression Model ...... 50 3.2.3 Selecting Model Coefficients ...... ......... ........ 51 3.2.4 Creating Regularized Model and Making Predictions . ......... 51 3.3 Performance Analysis .. ......... .......... ........ ... 52 4 ASD Genetics: Implications from Candidate ASD Risk Genes 57 4.1 Gene Sets for Analysis .... ...... ..... ...... ..... ..... 57 4.2 Hypergeometric Test for Enrichment ..... ..... ...... ..... .. 58 4.3 Pathway Enrichment Analysis .......... ................. 58 4.3.1 An Interesting Connection with Inflammatory Bowel Disease (IBD) .. 62 4.4 Enrichment Analysis on GO gene sets ...... ....... ...... ... 62 4.5 Enrichment Analysis for Subnetworks ..... ....... ....... .... 63 4.6 Functional Analysis for Overlap with Diseases and Bio-functions .... ... 66 5 Conclusion 71 Appendix A SFARI Genes for Autism Spectrum Disorders 75 8 Appendix B Risk Genes for ASDs Identified by Integrative Approach 87 Appendix C Subnetworks in ASD Risk Gene Set 95 Bibliography 99 9 10 List of Figures 2-1 Copy number variations in a pair of chromosomes. ............... 32 2-2 Steps in CNV-based prediction-prioritization of ASD genes. .......... 35 2-3 Receiver operating characteristic curves for CNV-based prioritizer using dif- ferent scaling factors. .......... ............. ......... 36 2-4 Lift chart for CNV-based prioritizer. ..... ............. ..... 37 2-5 Receiver operating characteristic curve for ASD