Genome-Wide Association Study (GWAS)
Total Page:16
File Type:pdf, Size:1020Kb
Linkage disequilibrium based eQTL analysis and comparative evolutionary epigenetic regulation of gene transcription by NING JIANG A thesis submitted to The University of Birmingham for the degree of DOCTOR OF PHILOSOPHY School of Biosciences The University of Birmingham November 2012 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual property rights of the author or third parties in respect of this work are as defined by The Copyright Designs and Patents Act 1988 or as modified by any successor legislation. Any use made of information contained in this thesis/dissertation must be in accordance with that legislation and must be properly acknowledged. Further distribution or reproduction in any format is prohibited without the permission of the copyright holder. ABSTRACT The present thesis contains two independent parts of research, which are summarised as below. Part I Genome-Wide Association Study (GWAS) has recently been proposed as a powerful strategy for detecting the many subtle genetic variants that underlie phenotypic variation of complex polygenic traits in population-based samples. One of the main obstacles to successfully using the linkage disequilibrium based methods is knowledge of any underlying population structure. The presence of subgroups within a population can result in spurious association. A robust statistical method is developed to remove the population structure interference in GWAS by incorporating single control marker into testing for significance of genetic association of a polymorphic marker (SNP) with phenotypic variance of a complex trait. The novel approach avoids the need of structure prediction which could be infeasible or inadequate in practice and accounts properly for a varying effect of population stratification on different regions of the genome under study. Both intensive computer simulation study and eQTL analysis in genetically divergent human populations show that the new method confers an improved statistical power for detecting genuine genetic association in subpopulations and an effective control of spurious associations stemmed from population structure. Part II Regulation of gene transcription (or expression) plays an essential role for viruses, prokaryotes and eukaryotes when the genetic information is turned into functional products—RNA. It can introduce the variability and adjustability of an organism through modulating gene expression levels. Gene expression process may be regulated by several aspects, from transcription initiation to methylation-based regulation. In epigenetics, the addition of methyl groups to cytosine bases in DNA sequence is considered as heritable chemical markers. DNA methylation has been confirmed that it plays a fundamental role for regulation of gene expression and is widespread among eukaryotic species, particular in vertebrates. In these analyses, two highly conserved but distinct methylation-based promoter regulatory patterns have been detected in higher vertebrates. These two diverse promoter classifications show distinct features and properties in genomic, methylation, expression and function levels. However, they present a highly conserved performance among several higher vertebrate species. ACKNOWLEDGEMENTS The progress of this project owes a great deal to the guidance and encouragement of my supervisor, Professor Zewei Luo. Without his invaluable patience, advice and insight, this project would not be possible to complete. Thank you very much for guiding me in both of my academic research and daily life. I would also like to thank my second supervisors, Dr Christine Hackett and Dr David Marshall. Your support and encouragement have always been helpful throughout my PhD project. Additionally, I want to express my special thanks to Professor Michael J. Kearsey and Dr Lindsey Leach for their help, suggestion and advice. I appreciate all the help from my colleagues in University of Birmingham, including Dr Minghui Wang, Dr Tianye Jia, Dr Elena Potokina, Dr Joseph Abraham, Yiyuan Liu and Jing Chen. Thanks for giving me support and sharing your experiences in research. It has been a pleasure to work with all of you. I express my heartfelt gratitude to my parents Qihe Jiang and Lifang Nie, my grandma Yusheng Sun for helping me a lot in various ways to complete my PhD project. A very special thank you to my love Mrs Hui Jiang whose support and love has been incredibly important to me. It is my great fortune to live with you and always will be. Lastly but certainly not least, thanks also go to Scottish Crop Research Institute (SCRI) and University of Birmingham. I have been fortunate to receive a joint studentship to make this project possible. Table of Contents Part I Developing the linkage disequilibrium (LD) based association method for eQTL analysis Chapter Ι........................................................................................................................... 2 Overall introduction: association mapping, gene transcription process and expression quantitative trait loci (eQTLs) analysis.............................................................................. 2 1. 1 Related publications .............................................................................................. 2 1.2 Overview............................................................................................................... 2 1.3 Complex traits ....................................................................................................... 4 1.4 Genetic linkage and linkage analysis...................................................................... 6 1.5 Linkage disequilibrium and association studies...................................................... 8 1.5.1 Definition of LD ............................................................................................ 8 1.5.2 Genetic association studies........................................................................... 11 1.5.3 Advantages of association studies ................................................................ 12 1.5.4 Population structure: A big challenge in association study ........................... 14 1.6 Genetic markers and high-throughput genotyping technologies ........................... 17 1.6.1 Genotyping SNP markers using microarray platforms.................................. 19 1.6.2 Discovering and genotyping SNPs based on NGS technologies.................... 20 1.6.3 Prediction of genetic polymorphisms from expression microarray dataset.... 22 1.7 Gene expression process...................................................................................... 25 1.8 Gene transcription ............................................................................................... 28 1.9 Expression quantitative trait loci (eQTLs) analysis .............................................. 33 1.9.1 A frame work of eQTL analysis................................................................... 35 1.9.2 Extracting genome wide gene expression profile for eQTL mapping............ 36 1.9.3 eQTL Hotspots ............................................................................................ 40 1.9.4 An important application of eQTL studies.................................................... 41 1.10 References........................................................................................................... 44 Chapter II........................................................................................................................ 51 Developing a robust statistical method ............................................................................ 51 for association-based....................................................................................................... 51 expression quantitative trait analysis ............................................................................... 51 2.1 Related publication.............................................................................................. 51 2.2 Overview............................................................................................................. 51 2.3 Introduction......................................................................................................... 53 2.3.1 Family-based association studies.................................................................. 54 2.3.2 Genomic control (GC) method and the structure association (SA) analysis .. 55 2.3.3 Principal component analysis (PCA) based method— EIGENSTRAT ......... 58 2.3.4 Correcting the confound effect of population structure using only one genetic marker 59 2.4 Statistical models and methods ............................................................................ 62 2.4.1 Method 1: a novel regression analysis with correcting population structure.. 62 2.4.1.1 The novel linear regression model and theoretical analysis................... 65 2.4.1.2 Significant test of the regression coefficient ...................................... 68 b1 2.4.1.3 Selection of the control marker............................................................. 69 2.4.2 Method 2 (Regression analysis without correcting population structure) ...... 70 2.4.3 Method 3 (multiple regression analysis)....................................................... 71 2.5 Simulation study.................................................................................................. 73 2.5.1 Simulations