Case-Control Association Tests Correcting for Population Stratification
Total Page:16
File Type:pdf, Size:1020Kb
Case-Control Association Tests Correcting for Population Stratification Dissertation zur Erlangung des Doktorgrades der Mathematisch–Naturwissenschaftlichen Fakult¨aten der Georg–August–Universit¨atzu G¨ottingen vorgelegt von Karola K¨ohler aus G¨ottingen G¨ottingen 2005 D7 Referent: Prof. Dr. Manfred Denker Korreferentin: Prof. Dr. Heike Bickeb¨oller Tag der m¨undlichen Pr¨ufung: 25.01.2006 Danksagung Ganz besonders m¨ochte ich mich bei Frau Prof. Dr. Heike Bickeb¨oller bedanken, die mir das interessante Thema vorgeschlagen und mich w¨ahrend der Entstehung der Arbeit maßgeblich begleitet hat. Sie hat es mir erm¨oglicht, an vielen Workshops und Tagungen teilzunehmen, meine Arbeit vorzustellen und neue Anregungen zu bekommen. Außerdem danke ich Herrn Prof. Dr. Manfred Denker f¨urdie Ubernahme¨ des Erstgutachtens. Weiterhin m¨ochte ich mich bei Herrn Prof. Dr. Jonathan K. Pritchard daf¨urbedanken, mir den Quellcode seiner Programme STRUCTURE und STRAT sowie sein eigenes Simulationsprogramm zur Verf¨ugung gestellt zu haben. Ein besonderes Dankesch¨ongilt auch Dr. med. Michael Steffens f¨urdie Diskussionen ¨uber die Genomic Control Studie sowie Melanie Bergmann f¨ur das sehr sorgf¨altige und zeitaufwendige Korrekturlesen meiner Arbeit. Außerdem bedanke ich mich bei meinen Kolleginnen und Kollegen der Abteilungen Genetische Epidemiologie und Medizinische Statisitik, die mir immer geholfen haben, offene Fragen zu kl¨aren. Nicht vergessen zu erw¨ahnen m¨ochte ich meine Familie und meine Freunde, die jederzeit eine große moralische Unterst¨utzung f¨urmich waren. Die Arbeit wurde teilweise durch das Bundesministerium f¨urBildung und Forschung (BMBF) im Rahmen der genetisch-epidemiologischen Methodenzentren im nationalen Genomforschungsnetz gef¨ordert(F¨ordernummern: 01GS0204, 01GR0462, 01GS0422). Contents 1 Introduction 1 2 Population genetics 6 2.1 Introduction to Genetics . 6 2.2 Random mating populations . 8 2.2.1 Hardy-Weinberg equilibrium . 8 2.2.2 Linkage equilibrium . 9 2.2.3 Linkage disequilibrium due to tight linkage . 10 2.2.4 Extension to multiple alleles . 10 2.3 Models for structured populations . 11 2.3.1 The general concept of inbreeding and allelic correlations . 11 2.3.2 Theoretical models for discrete subpopulations . 13 2.3.3 Extension of the models to multiple alleles . 16 2.3.4 Incorporating admixture . 17 2.4 Inference in subdivided populations . 18 2.4.1 Estimation of the fixation index FST .................. 18 2.4.2 The prediction rate . 19 3 Genetic case-control association studies 22 3.1 Concepts . 22 3.1.1 General concepts of case-control studies . 22 3.1.2 Concepts of genetic case-control studies . 23 3.2 Analysis of genetic association in a homogeneous population . 24 3.2.1 The genotypic odds ratio and the genotypic relative risks . 24 3.2.2 The multiplicative penetrance model and the allelic odds ratio . 25 3.2.3 Commonly applied tests for association . 27 3.2.4 Derivation of test statistics for the allelic 2 × 2 table . 28 3.3 Stratified analysis . 30 3.3.1 The bias of the allelic χ2-test . 30 3.3.2 Stratified tests for a series of 2 × 2 tables . 31 3.4 Logistic regression . 36 3.4.1 The logistic regression model for case-control data . 36 3.4.2 The logistic regression model applied to genetic data . 37 4 Genomic Control 39 4.1 The variance inflation of the allelic χ2-test . 39 4.2 The method of Genomic Control . 43 4.2.1 The general concept . 43 4.2.2 The observed type-I error rate . 45 5 Structured Association 47 5.1 Inference on population structure . 47 5.1.1 The standard mixture model . 48 5.1.2 The mixture model for case-control data . 50 5.1.3 The bias of the standard EM algorithm . 52 5.1.4 Extensions of the EM algorithm . 54 5.1.5 Estimation of the number of subpopulations . 55 5.1.6 The EM algorithm with admixture of Purcell and Sham (2004) . 56 5.1.7 The Bayesian admixture model of Pritchard et al. (2000a) . 57 5.2 Association tests based on the inferred structure . 59 5.2.1 The likelihood function . 59 5.2.2 The likelihood ratio test of Pritchard et al. (2000b) . 61 5.2.3 The Wald test . 62 5.2.4 Logistic regression as an alternative approach . 65 5.3 One-step approaches . 66 5.3.1 The one-step mixture model of Satten et al. (2001) . 66 5.3.2 The one-step Bayesian model of Hoggart et al. (2003) . 67 5.4 Discussion . 68 6 Results from population based studies 70 6.1 Results from previous studies . 70 6.1.1 The amount of stratification in real populations . 70 6.1.2 The impact of population stratification on case-control studies . 71 6.2 The German Genomic Control Study . 73 6.2.1 The study sample . 74 6.2.2 Population structure in the study sample . 74 6.2.3 Consequences for case-control studies within Germany . 76 6.3 Discussion . 77 7 Simulations 79 7.1 The set-up of the simulation study . 79 7.1.1 The simulation model . 79 7.1.2 Details of the simulations . 80 7.1.3 The simulation parameters . 81 7.2 Results . 82 7.2.1 Inference on population structure . 82 7.2.2 Theoretical variance inflation for the simulation scenarios . 85 7.2.3 Type-I-error rate of the association tests . 85 7.2.4 Power of the association tests . 88 7.3 Discussion . 91 7.3.1 Summary . 91 7.3.2 Comparison to other simulation studies . 93 7.3.3 Limitations of the simulations . 94 8 Summary and outlook 96 A Appendix 98 A.1 Notation . 98 A.2 Statistical Theory . 100 A.2.1 Likelihood based tests of significance . 100 A.2.2 The EM algorithm . 102 References 104 Index 110 Curriculum Vitae 112 1 Introduction In a case-control study a group of subjects affected with a disease (cases) is compared to a group of unaffected subjects (controls) with respect to potential exposures of risk factors. In genetic case-control association studies the exposures of primary interest are genetic factors. In the candidate gene approach so called candidate genes expected to be functionally related to disease causing pathways are chosen to be investigated. Each study participant is genotyped at one or a few genetic marker loci which are specific positions on the chromosomes lying directly on or very close to the candidate genes. As the geno- typing result the genotype consisting of two alleles at the marker locus is determined for each individual. Most often these marker loci are single nucleotide polymorphisms which have two possible alleles and three possible genotypes in the population. The genotype distribution at a marker locus is investigated with respect to differences between cases and controls. In case of such a difference marker locus and disease are said to be associated. The simplest test for association between the alleles at the marker locus and the disease status is Pearson’s χ2-test for a 2 × 2 table. However, there are different reasons for an association. The first reason is that the marker locus is itself a locus directly related to the disease, thus the association is causal. An association also can be observed if marker and disease locus are in close proximity on the genome due to linkage disequilibrium caused by the tendency of certain alleles appearing together on short chromosomal segments. However, an association can also be observed due to unobserved population stratification. Li (1969) first noted the possible importance of unobserved population structure for ge- netic association studies. Population structure may lead to spurious associations if the allele distribution is different between the subpopulations and if the general disease risk varies between the subpopulations. Under these conditions population stratification is said to act as a confounder in the case-control association study if the general epidemiological terminology is used. If population stratification is not accounted for the number of false positive association tests increases. In the middle of the nineties with improvements in genotyping technology the concerns about population stratification became more relevant (e.g. Lander and Schork, 1994). Furthermore, it turned out that association analysis would become an important tool to find the genetic basis of complex diseases (e.g. Risch and Merikangas, 1996). Most of the common diseases are complex diseases where there clearly is a genetic basis but the genetic model is not clear. Many genes with small genetic effects are expected to contribute to a complex disease but the knowledge about complex diseases is still rather small. To overcome the problems of population stratification Spielman et al. (1993) introduced the TDT (transmission disequilibrium test) as one of the first family based association tests and Ewens and Spielman (1995) explicitly showed that this test for association is robust against population structure. Case-parents trios consisting of one affected subject and its parents have to be recruited and the transmission of alleles 2 1 Introduction from the parents to the affected offspring is investigated. Subsequently, the number of family-based association tests has been grown rapidly including sibs of the proband and allowing for missing parental information (e.g. Spielman and Ewens, 1998; Knapp, 1999). Rabinowitz and Laird (2000) developed FBAT as a unified approach for the analysis of family based association studies including pedigrees with arbitrary structure and arbitrary missing marker information. However, there are also a lot of convincing arguments against the use of family-based association tests (e.g Risch and Teng, 1998). The most important argument is that case-control studies are more easy to be implemented than family-based studies. For complex diseases large samples are required to identify genes with small ef- fects.