Gene and Gene-Set Analysis for Genome-Wide Association Studies
Total Page:16
File Type:pdf, Size:1020Kb
Gene and Gene-Set Analysis for Genome-Wide Association Studies Inti Pedroso National Institute of Health Research Specialist Biomedical Research Centre for Mental Health & Medical Research Council Social, Genetic and Developmental Psychiatry Research Centre King’s College London This dissertation is submitted for the degree of Doctor of Philosophy March 2011 1 Abstract Genome-wide association studies (GWAS) have identified hundreds of loci at very stringent levels of statistical significance across many different human traits. However, it is now clear that very large samples (n~104-105) are needed to find the majority of genetic variants underlying risk for most human diseases. Therefore, the field has engaged itself in a race to increase study sample sizes with some studies yielding very successful results but also studies which provide little or no new insights. This project started early on in this new wave of studies and I decided to use an alternative approach that uses prior biological knowledge to improve both interpretation and power of GWAS. The project aimed to a) implement and develop new gene-based methods to derive gene-level statistics to use GWAS in well established system biology tools; b) use of these gene-level statistics in networks and gene-set analyses of GWAS data; c) mine GWAS of neuropsychiatric disorders using gene, gene-sets and integrative biology analyses with gene-expression studies; and d) explore the ability of these methods to improve the analysis GWAS on disease sub-phenotypes which usually suffer of very small sample sizes. In this project, we focused on the analysis of GWAS on bipolar disorder, however, GWAS from other complex disorders have also been analysed and are used throughout to compare the performance of the methods developed across disorders that, very likely, have different genetic architecture. From the analysis of these datasets, it was possible to draw that conclusion. I found that from a computational perspective the application of gene and gene-set analyses to GWAS is feasible, even if calculation are performed on a common desktop computer. As a genetic mapping tool, gene-based association of GWAS provides a 2 ABSTRACT valuable complement to single SNP analysis methods. It highlights true disease loci and generates an accurate association statistic for a gene. These gene-level statistics proved of merit for meta-analyses of GWAS and integrative analysis with protein-protein interaction networks or gene-expression studies. In this thesis, I combined results of GWAS and gene-expression studies and I was able to provide evidence of association with bipolar disorder for a protein-interaction network. This demonstrates the possibility of using GWAS to extract information of biological systems from GWAS of neuropsychiatric disorders. This information is not readily available from single SNP analyses, which also suffer from genetic heterogeneity. Therefore, these methods help to improve the interpretation and provide an alternative to the simple but costly method of improving statistical power through increasing sample size. This ability proved to be important in our analyses of bipolar disorder sub-phenotypes, which had significantly smaller sample sizes. Analysis of disease sub-phenotypes showed that some genetic associations are shared with the main diagnosis but that others seemed largely specific of the sub-phenotypes. This evidence allowed me to suggest that genetic association with sub-phenotypes should be analysed as both a follow up of association in a main diagnosis and an independently if the available evidence is suggesting that the genetic associations observed are largely independent of those with the main diagnosis (e.g., as we found for age of onset of bipolar disorder). Despite the interesting new insights and potential relevance, our gene and gene-set associations with bipolar disorder require replication in independent samples. This is especially important for the association found with the bipolar disorder sub-phenotypes because their smaller samples size also makes them more prone to be false positives. This project has also generated the software suite FORGE which allows users to perform fast and robust analysis of GWAS, allowing researchers with a lack of 3 ABSTRACT experience in bioinformatics and statistical genetics, e.g. wet-laboratory based biologists, to perform both gene-based and pathway analyses to generate a systems biology interpretation of their phenotype of interest. 4 Acknowledgements I am most grateful to everyone that contributed to the work of this thesis, for giving advice and encouragement. First of all, I would like to thank my supervisor Gerome Breen. His imagination and scientific twist has always prompted me to look at scientific problems from a different perspective. I would also like to thank Mike Barnes, who never spared words for strong criticism but also to give support and prompt me to be critic, ruthless and grasp beyond the obvious. Finally, Tom Price, who joined my PhD supervision in the last turn of the race but provided timely and sharp advice and time for enjoyable scientific conversation. Thanks to you since you have helped the most. Of all others I shall first thank my wife Katherinne for cooking and gardening our love. I must thank my parents for giving me space to be or not to be and for giving me a whole life of love. I must also thanks my grand mother and my uncle Gonzalo for bringing passion into this endless route of discovery and for making me believe in people and what they can do. Thanks to you because you made me what I am. I cannot forget my friends, both in Chile and UK, who have helped me from the beginning and continue we help me now. In particular I would like to thank Anbarasu L, James R, Sara Campos, Sarah Cohen, Sarah Jugurnauth, Margarita, Shaza, Katherine Tansey, Ursula, Cerisse, Jo G, Carol Shum, Sietske, Evangelos Vassos, Bahare Azadi, Chloe Wong and many other people has made my years as student in the SGDP a pleasure. Because of my sweet tooth I must also express my gratitude to the Fellowship of the Cake, for baking every month and sharing with me the sweetest of the friendships. 5 ACKNOWLEDGMENTS Thanks to David Collier, Cathryn Lewis, Stuart Newman, Ammar Al-Chalabi, all member of the Depression Genetic Consortium led by Prof Peter McGuffin and many others for small and big gestures of help or advice. I am immensely grateful for the funding provided by the NIHR BRC for Mental Health and the Overseas Research Studentship Award Scheme. Finally, I must thank the patients and researchers who gathered the data I analysed. Without their hard work this thesis would not be possible. 6 Do we need to do all of this? As a first year PhD student I used to ask myself, Do we need all these genetic studies? Are they relevant at all? Are we just teasing ourselves while pleasing our big brain’s curiosity? Is the answer relevant just because we want to know it or can know it? Unfortunately, I still have no answer. I have learnt that beyond pleasing ourselves genetic research can be tremendously valuable when a real breakthrough occurs and people out there benefit from it. That is the reason I focused my efforts in help transforming GWAS into something useful to biologists, with the hope that it would help produce solutions for real people. I do think more genetic research should be funded and more genetic findings should end up in systems biology efforts, where causality and complexity can be studied systematically. At the end of the day complexity is at the heart of why we are in love with biology and why is so challenging. However, I have also learnt that we may be missing an important bit in here. Many diseases do have a strong genetic liability but also have an important environmental contribution. Many have risen in frequency beyond what genetics alone can explain. We inherited our genes and not our diseases. Changes in lifestyle may be an important aspect of decreasing disease morbidity, probably in some cases beyond what genetic findings can do. I think human genetics will soon face great scientific challenges. The massive amounts of incoming data will enable us to test some of our wildest hypothesis with an accuracy and depth not seen before. New sequencing technologies are opening the door to systematic and almost complete sampling of genetic material. Protein measurements may probably follow soon as will quantifications and sequencing of lipids and sugars. Biology will move aways from a reductionistic science into a data driven technology. 7 DO WE NEED TO DO ALL OF THIS? Quantitative models will supersede speculation and medicine will become a science of forecasting, in the same way that weather forecasting changed in the last 80 years. Nevertheless, I think you still will have to go to the gym to lose weight and get fit for the summer. 8 Contents I INTRODUCTION 27 1.1 GENETIC MAPPING 28 1.1.1 Sweet and sour: From Mendelian to complex diseases 28 1.1.2 The common disease-common variant model 30 1.1.3 Construction of public resources to analyse human genetic variation 31 1.1.4 Genome-wide association studies 33 1.2 CHALLENGES OF CURRENT GWAS 39 1.3 GENE-SET AND NETWORK ANALYSES 42 1.4 GSA AND NETWORK ANALYSES IN GWAS 46 1.5 OBJECTIVES OF THIS THESIS 51 II A SOFTWARE SUITE TO PERFORM GENE-BASED AND GENE- SET ANALYSES OF GENOME-WIDE ASSOCIATION STUDIES 53 2.1 INTRODUCTION 54 2.2 GENE-BASED ANALYSES OF GWAS 55 2.3 GENE-SET ANALYSES 56 2.3.1 Different null hypotheses 57 2.3.2 Using SNP versus gene-based statistics 57 2.3.3 Methods implemented in FORGE 57 2.4 METHODS 58 9 CONTENTS 2.4.1 Region or gene-wide statistics 58 2.4.1.1 Sidak’s correction of the minimum p-value 58 2.4.1.2 Fisher’s method to combine correlated p-values 58 2.4.1.3 Fixed-effect z-score statistic 58 2.4.1.4 Random-effect z-score statistic 59 2.4.1.5 Significance of gene-wide statistics by sampling from a MND 60 2.5 GENE-SET ANALYSES 62 2.5.1 SNP to gene-sets strategy 62 2.5.1.1 Gene-sets analysis with gene p-values 62 2.5.1.2 Reducing the number of simulations needed of p-values << 1.