UNIVERSITY of CALIFORNIA, SAN DIEGO Effective Design And
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA, SAN DIEGO Effective Design and Analysis of Systems Genetics Studies A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Hyun Min Kang Committee in charge: Professor Pavel Pevzner, Chair Professor Eleazar Eskin, Co-Chair Professor Vineet Bafna Professor Sanjoy Dasgupta Professor Trey Ideker Professor Nicholas J. Schork 2009 Copyright Hyun Min Kang, 2009 All rights reserved. The dissertation of Hyun Min Kang is approved, and it is acceptable in quality and form for publi- cation on microfilm and electronically: Co-Chair Chair University of California, San Diego 2009 iii DEDICATION To Jihye and Joseph. iv EPIGRAPH Get the facts, or the facts will get you. And when you get them, get them right, or they will get you wrong. — Thomas Fuller v TABLE OF CONTENTS Signature Page . iii Dedication . iv Epigraph . v Table of Contents . vi List of Figures . x List of Tables . xii Acknowledgements . xiii Vita and Publications . xvi Abstract of the Dissertation . xviii Chapter 1 Introduction . 1 Chapter 2 A high-density haplotype resource of 94 inbred mouse strains . 9 2.1 Motivation . 9 2.2 Results . 10 2.2.1 The mouse HapMap resource . 10 2.2.2 Haplotype structure among the strains . 15 2.2.3 Integrating NIEHS/Perlegen resequencing and HapMap data . 19 2.2.4 Effects of larger resources . 24 2.2.5 Trait mapping with the mouse HapMap resource . 26 2.3 Discussion . 28 2.4 Methods . 31 Chapter 3 An adaptive and memory efficient algorithm for genotype impu- tation . 33 3.1 Motivation . 33 3.2 Materials and methods . 37 3.2.1 The imputation problem. 37 3.2.2 Imputation algorithm for haploid model . 38 3.2.3 Extension to unphased genotypes (diploid model) . 43 3.3 Results . 45 3.3.1 Genotype imputation of 94 inbred mouse strains . 45 3.3.2 Imputation of HapMap SNPs in WTCCC samples . 47 3.4 Conclusion . 48 vi Chapter 4 Efficient control of population structure in model organism as- sociation mapping . 50 4.1 Motivation . 50 4.2 Materials and methods . 54 4.2.1 Genotypes and phenotypes . 54 4.2.2 Efficient mixed model association (EMMA) . 55 4.2.3 Similarity-based kinship matrix . 58 4.2.4 Phylogenetic control . 59 4.2.5 Statistical tests and multiple hypothesis testing . 60 4.2.6 Simulation studies . 61 4.2.7 Derivation of restricted likelihood and derivatives . 62 4.3 Results . 64 4.3.1 Comparison with previous methods . 64 4.3.2 High resolution henome-wide association mapping in inbred mouse strains . 67 4.3.3 Power of inbred association mapping . 70 4.4 Discussion . 73 Chapter 5 Accounting for sample structure in large scale genome-wide as- sociation studies using a variance component model . 78 5.1 Motivation . 78 5.2 Materials and methods . 81 5.2.1 Variance component model to account for sample structure . 81 5.2.2 Estimating marker specific inflation factor . 85 5.2.3 Accounting for large effect sizes at some SNPs . 88 5.2.4 Application to case control datasets . 89 5.2.5 Genotype and phenotype data . 90 5.3 Results . 91 5.3.1 NFBC66 . 91 5.3.2 Application to WTCCC case-control data . 106 5.3.3 Marker specific inflation factors . 107 5.4 Discussion . 112 Chapter 6 Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots . 117 6.1 Motivation . 117 6.2 Results . 121 6.2.1 Spurious regulatory hotspots in recombinant inbred mice . 121 6.2.2 Inter-sample correlation as signatures of systematic confounding effects . 122 6.2.3 Inter-sample Correlation Emended (ICE) eQTL map- ping . 128 vii 6.2.4 Some trans-regulatory bands in high quality datasets are likely to correspond to real genetic effects . 136 6.2.5 Correcting for confounding effects in human lympho- blastoid cell line expression . 145 6.2.6 Comparison with previous methods . 146 6.3 Discussion . 152 6.4 Materials and methods . 153 6.4.1 Gene expression data and genetic maps . 153 6.4.2 Traditional eQTL mapping and genome wide eQTL maps . 154 6.4.3 Explicit batch effect correction and Surrogate Vari- able Analysis . 155 6.4.4 Genome wide inter-sample correlation . 155 6.4.5 Simulation studies . 156 6.4.6 Variance component test . 157 6.4.7 ICE eQTL mapping . 157 6.4.8 Assessing the statistical significance of trans-regula- tory bands . 158 Chapter 7 A High Resolution Association Mapping Panel for the Dissection of Complex Traits in Mice . 159 7.1 Motivation . 159 7.2 Results . 162 7.2.1 Design principles of mouse association studies . 162 7.2.2 Strain selection for the Hybrid Mouse Diversity Panel163 7.2.3 Validating the statistical power of the HMDP through mapping metabolic clinical traits . 167 7.2.4 Resolution of mouse association studies . 171 7.2.5 Application of the HMDP by mapping metabolic clinical traits . 173 7.2.6 Comparison to previous mouse association studies . 175 7.3 Discussion . 177 7.4 Materials and methods . 181 7.4.1 Animals . 181 7.4.2 Phenotypes/ phenotyping protocols . 182 7.4.3 Genotyping . 182 7.4.4 RNA isolation and expression profiling . 183 7.4.5 Gene expression analysis . 183 7.4.6 Genome-wide association mapping accounting for population structure . 184 7.4.7 Estimation of power and mapping resolution . 184 7.4.8 Genome-wide significance threshold . 185 7.4.9 Validation of clinical and expression associations . 186 viii Chapter 8 Conclusion and future work . 187 8.1 Summary and conclusion . 187 8.2 Future work . 189 8.2.1 GWAS with unstratified populations . 189 8.2.2 Exploring multiple rare variants hypothesis . 190 8.2.3 Capturing unmodeled confounding effects inherent in various high-throughput data . 191 8.2.4 Challenges in sequence-based association mapping . 191 Bibliography . 193 ix LIST OF FIGURES Figure 1.1: A conceptual diagram of systems genetics studies . 3 Figure 2.1: Classification of 94 strains used in the mouse HapMap projects . 11 Figure 2.2: Fraction of genome covered by shared segments . 16 Figure 2.3: Fraction of pairwise shared geomic segments . 18 Figure 2.4: Estimated imputation accuracy and coverage . 27 Figure 2.5: Phenotypic varianace explained by population structure . 29 Figure 2.6: Number of phenotypes with significant associations . 30 Figure 2.7: Comparison of genomic control inflation factors . 30 Figure 3.1: An example of the imputation problem . 35 Figure 3.2: An example of HMM . 39 Figure 4.1: Comparison between different methods . 65 Figure 4.2: Cumulative distribution of p-values . 67 Figure 4.3: Genome-wide association plots . 69 Figure 4.4: Power estimates based on real phenotypes . 71 Figure 4.5: Simulation-based power estimates . 74 Figure 5.1: Principal components and geographical information . 92 Figure 5.2: Scatterplot of five principal components . 93 Figure 5.3: QQ plot of LDL association . 94 Figure 5.4: Inflation by the number of principal components . 96 Figure 5.5: Comparison between IBS and IBD estimates . 98 Figure 5.6: QQ plots for NFBC66 association mapping . 101 Figure 5.7: Comparion between EMMA and EMMAX . 102 Figure 5.8: Comparisons of LDL association plots . 103 Figure 5.9: Concordance between different methods . 105 Figure 5.10: Differences in beta estimates . 106 Figure 5.11: QQ plots in WTCCC association mapping . 108 Figure 5.12: Distribution of the marker specific inflation factors . 109 Figure 5.13: QQ plots comparison using simulated phenotypes . 111 Figure 5.14: Concordance of per-marker inflation factor . 113 Figure 6.1: Comparion of regulatory hotspots . 123 Figure 6.2: Genome wide eQTL maps . 124 Figure 6.3: Genome wide inter-sample correlation with replicated samples . 127 Figure 6.4: eQTL maps of simulated expression datasets . 129 Figure 6.5: Statistical power under various systematic confounding effects . 132 Figure 6.6: trans-regulatory bands under various systematic confounding effects133 Figure 6.7: Number of genes with significant eQTLs . 135 Figure 6.8: Concordance of eQTLs . 137 Figure 6.9: Inter-sample correlation with M430v2 chip . 140 x Figure 6.10: Regulatory hotspots between original and simulated datasets . 142 Figure 6.11: Genome wide eQTL maps using Surrogate Variable Analysis . 144 Figure 6.12: P-values for differential expressions across populations . 148 Figure 6.13: QQ plot of differential expressions between populations . 149 Figure 6.14: Cis associations within each HapMap population . 149 Figure 7.1: Power Calculations . 166 Figure 7.2: Detection of associations for plasma lipids in HMDP strains . 169 Figure 7.3: Expression traits demonstrate high resolution of HMDP . 172 Figure 7.4: Correcting for population structure dramatically reduces false positives . ..