MSc
2.º CICLO FCUP 2019 disease through genome Quantifying
Quantifying the genetic the genetic predisposition to a complex complex a to predisposition genetic the predisposition to a complex disease - wide association through genome-wide association
Ana Margarida Carrapatoso Macedo Master’s degree thesis presented to Faculdade de Ciências da Universidade do Porto in Mathematical Engineering
Ana Margarida Carrapatoso Macedo Carrapatoso Margarida Ana 2019 Quantifying the genetic predisposition to a complex disease through genome-wide association
Ana Margarida Carrapatoso Macedo Mathematical Engineering Department of Mathematics 2019
Supervisor Alexandra Lopes, Assistant Researcher, i3S – Instituto de Investigação e Inovação em Saúde, IPATIMUP - Instituto de Patologia e Imunologia Molecular da Universidade do Porto
Co-supervisor Nádia Pinto, Junior Researcher, i3S – Instituto de Investigação e inovação em Saúde, IPATIMUP - Instituto de Patologia e Imunologia Molecular da Universidade do Porto, CMUP – Centro de Matemática da Universidade do Porto Todas as correções determinadas pelo júri, e só essas, foram efetuadas.
O Presidente do Júri,
Porto, ______/______/______FCUP i Quantifying the genetic predisposition to a complex disease through genome-wide association
Agradecimentos
Ao Ipatimup e ao grupo de Genetica´ Populacional, que me acolheram no mundo da investigac¸ao˜ cient´ıfica.
As` minhas orientadoras Alexandra Lopes e Nadia´ Pinto, que foram incansaveis´ e estiveram sempre dispon´ıveis para as minhas duvidas,´ apostaram em mim e me deram a oportunidade de fazer parte deste projeto ate´ ao fim.
Ao meu irmao,˜ que me corrigiu os paragrafos´ mais esquisitos sempre que a l´ıngua inglesa me falhou.
Ao meu companheiro de todos os dias, que nunca deixa de acreditar em mim e me incentiva a fazer sempre mais.
Aos meus pais, que me proporcionaram uma educac¸ao˜ superior e nunca duvidaram das minhas escolhas. FCUP ii Quantifying the genetic predisposition to a complex disease through genome-wide association
Abstract
The main goal of this work was to contextualize and apply the methods used in genome-wide asso- ciation studies, as well as studies of target regions with functional relevance to the phenotype under analysis.
The methods explored in detail included the quality control steps of genetic data, statistical tests for common-variant association (Pearson’s chi-squared test, Fisher’s exact test and the Cochran-Armitage test for trend) and the SKAT-O method, which combines burden and non-burden approaches to the study of rare-variant association.
Subsequently, the methods were applied to a data set of Alzheimer’s disease (AD) patients and healthy controls from north Iberian Peninsula, in the scope of the multicenter study ”AD-EEGWA”. It features new data from a still understudied population, regarding genetic association to this disease. Besides the genetic component and biographic data, we had access to electroencephalography measures for most of the study participants. The project is currently ongoing, and more biological samples are being collected to empower genetic analyses.
The SKAT-O method allowed to identify one gene (PLEKHA5) with significantly different minor allele frequencies between cases and controls in our sample: collectively, the rare alleles were present in 10% of controls, but in just 1% of cases. This gene had previously been identified as having differential gene expression in astrocytes between AD cases and controls.
The common-variant association methods allowed to identify nine SNPs with nominally significant differences of allele and genotype distribution between cases and controls, three of which had been associated with Alzheimer’s disease in previous studies. We then inquired about the possibility of an association between these genetic variants and the brainwaves obtained from EEGs. Four out of nine SNPs showed significant differences for some brainwaves, concerning mean relative power values within cases and controls with different alleles/genotypes. However, the obtained results did not match what was expected, considering the EEG brainwave behavior in AD cases and controls, and the possible risk allele of each SNP.
All in all, we concluded that, even in small samples, it is possible to find association between pheno- types and the aggregate effects of rare variants. It will be interesting to replicate the study in a larger sample and see if these results hold. The lack of agreement between the obtained results and the ex- FCUP iii Quantifying the genetic predisposition to a complex disease through genome-wide association pected from the analysis of the genetics-EEG relation motivates new approaches to the problem. The complexity of this disease strongly impels to insist on a interdisciplinary approach, that explores the effect of genotypes on well defined disease endophenotypes, to help diagnosis.
Keywords: association study, Alzheimer’s disease, complex phenotype, genetic heterogeneity, rare variant, endophenotype, electroencephalogram FCUP iv Quantifying the genetic predisposition to a complex disease through genome-wide association
Resumo
O objetivo principal deste trabalho foi contextualizar e aplicar os metodos´ utilizados em estudos de associac¸ao˜ do genoma completo, bem como de regioes˜ alvo com relevanciaˆ funcional para o fenotipo´ sob estudo.
Os metodos´ estudados em mais detalhe inclu´ıram os passos de controlo de qualidade dos dados geneticos,´ testes estat´ısticos para a associac¸ao˜ de variantes comuns (teste do qui-quadrado de Pear- son, teste exato de Fisher e teste de Cochran-Armitage para tendencia)ˆ e o metodo´ SKAT-O, que com- bina as abordagens burden e non-burden para o estudo da associac¸ao˜ com variantes raros.
Posteriormente, aplicaram-se os metodos´ a um conjunto de dados de doentes de Alzheimer e con- trolos saudaveis´ da regiao˜ norte da Pen´ınsula Iberica,´ no ambitoˆ do projeto multicentricoˆ ”AD-EEGWA”. Trata-se de um novo conjunto de dados de uma populac¸ao˜ ainda pouco estudada, do ponto de vista da associac¸ao˜ genetica´ a` doenc¸a. Para alem´ da componente genetica´ e dados biograficos,´ tivemos acesso a medidas de eletroencefalograma para grande parte dos participantes do estudo. O projeto ainda se encontra em curso, e mais amostras biologicas´ estao˜ a ser recolhidas de modo a trazer mais poder as` analises´ geneticas.´
Com o metodo´ SKAT-O, foi poss´ıvel identificar um gene (PLEKHA5) com diferenc¸as significativas de frequenciasˆ dos seus alelos raros entre casos e controlos da nossa amostra: coletivamente, os alelos raros estavam presentes em cerca de 10% dos controlos, mas apenas em 1% dos casos. Trata-se de um gene que ja´ tinha sido anteriormente identificado como tendo expressao˜ genetica´ diferencial em astrocitos´ entre casos e controlos de Alzheimer.
Os metodos´ de associac¸ao˜ para variantes comuns permitiram identificar nove SNPs com diferenc¸as nominalmente significativas na distribuic¸ao˜ de alelos e genotipos´ entre casos e controlos, tresˆ dos quais ja´ tinham sido associados a` doenc¸a de Alzheimer em estudos previos.´ Seguidamente, averiguamos a possibilidade de haver uma associac¸ao˜ entre estes variantes geneticos´ e as ondas cerebrais obti- das com os EEGs. Dos nove, quatro SNPs mostraram diferenc¸as significativas para algumas ondas cerebrais, relativamente aos valores medios´ de ”poder relativo” em casos e controlos com diferentes alelos/genotipos.´ Contudo, os resultados obtidos nao˜ corresponderam ao que era esperado, tendo em conta os valores de EEGs em casos de Alzheimer e controlos e o poss´ıvel alelo de risco de cada SNP.
Conclu´ımos que, mesmo com pequenas amostras, e´ poss´ıvel encontrar associac¸ao˜ entre fenotipos´ e FCUP v Quantifying the genetic predisposition to a complex disease through genome-wide association o efeito agregado de variantes raros. Sera´ interessante replicar o estudo numa amostra maior e perceber se os resultados se mantem.ˆ A falta de coerenciaˆ entre os resultados da analise´ da relac¸ao˜ genetica-´ EEG e o que era esperado motiva novas abordagens a este problema. A complexidade desta doenc¸a incita fortemente a` insistenciaˆ numa abordagem interdisciplinar, que explore o efeito de genotipos´ e endofenotipos´ bem definidos da doenc¸a, para auxiliar no seu diagnostico.´
Palavras-chave: estudo de associac¸ao,˜ doenc¸a de Alzheimer, fenotipo´ complexo, heterogeneidade genetica,´ variante raro, endofenotipo,´ eletroencefalograma FCUP vi Quantifying the genetic predisposition to a complex disease through genome-wide association
Contents
Agradecimentos ...... i Abstract ...... ii Resumo ...... iv List of Tables ...... viii List of Figures ...... xi List of Abbreviations ...... xiii
Introduction 1
1 Theoretical framework 3 1.1 Introductory concepts of biology and genetics ...... 3 1.2 Population genetics concepts ...... 7 1.3 Association studies ...... 16 1.4 An introduction to Alzheimer’s disease ...... 18
2 Study design and data quality control 23 2.1 Study design ...... 24 2.2 Data collection and variant calling ...... 27 2.3 Variant quality control ...... 28
3 Models of association 33 3.1 Statistical tests for common variants ...... 33 3.2 Rare variant association approaches ...... 36 3.3 P-value adjustment ...... 40 3.4 The odds ratio ...... 41 FCUP vii Quantifying the genetic predisposition to a complex disease through genome-wide association
4 An application of association studies to Alzheimer’s disease 43 4.1 Aim and objectives ...... 43 4.2 Subjects and methods ...... 43 4.3 Data quality control ...... 45 4.4 Rare-variant analysis ...... 51 4.5 Analysis of electroencephalography data ...... 58 4.6 Discussion ...... 66
Conclusion 69
Appendices 77 A Informed consent of participation in research study ...... 79 B Mini Mental State Examination (MMSE) ...... 81 C Lists of nominally significant variants in SKAT-O ...... 85 C.1 Model 1 (Sex as covariate) ...... 85 C.2 Model 2 (Sex, age, PC1 and PC2 as covariates) ...... 96 FCUP viii Quantifying the genetic predisposition to a complex disease through genome-wide association
List of Tables
1.1 (a) Possible genetic constitution of the offspring resulting from crossing (AA) with (aa) in- dividuals (1st hybrid generation); (b) Possible genetic constitution of the offspring resulting from crossing (Aa) individuals among each other (2nd hybrid generation)...... 4
1.2 Blood group (phenotype) of the offspring, depending on the ABO alleles (genotype) inher- ited from the parents...... 6
1.3 Observed (left) and expected (right) haplotype frequencies in the population, under total linkage equilibrium...... 9
1.4 Observed haplotype frequencies in the population, considering linkage disequilibrium. . . 9
1.5 Probability values of IBS state I = i conditional on IBD state Z = z for each value of I and Z...... 12
1.6 Mating outcomes assuming Hardy-Weinberg equilibrium...... 13
1.7 Causes of Alzheimer’s disease...... 19
1.8 APOE allele according to the genotype for SNPs rs429358 and rs7412...... 20
3.1 (a) Contingency table of allele counts; (b) Contingency table of genotype counts...... 34
3.2 Counts of cases and controls in each of the n exposure categories Ej, in a sample of c individuals...... 41
4.1 Summary table of the individual counts after each per-individual QC step, according to disease status, gender and country of origin (note: whenever male and female counts did not add up to the ”Total” column, it was due to the presence of individuals with unknown gender; this issue was overcome as of the sex check step)...... 49
4.2 Summary table of probe/variant counts at each per-marker QC step...... 50
4.3 Disease status vs. gender distribution of the QCed sample...... 51 FCUP ix Quantifying the genetic predisposition to a complex disease through genome-wide association
4.4 Identifier and description of each gene list to be tested for association, and number of genes and rare variants they contain...... 55
4.5 Number of significant genes without and with p-value correction in each gene list and model (Model 1 – sex as the only covariate; Model 2 – sex, age, PC1 and PC2 as covariates). 56
4.6 Frequency and properties of PLEKHA5 rare variants identified in our sample. (*) in com- plete LD (r2 = 1.0) Notes: ”PHRED” refers to the PHRED-scaled CADD score; MA - minor allele; MAF - minor allele frequency; ”gnomAD NFE” refers to the Non-Finnish European population of the gnomAD database (the number of genotyped alleles for each SNP was, respectively, 129 088, 129 088, 75 296 and 113 128); the p-values refer to Fisher’s exact test for differences between control frequencies in our data and the gnomAD database. . 56
4.7 P-values obtained in ANOVA tests when testing for differences in RP of each of the brain- waves between cases and controls...... 59
4.8 P-values obtained in Tukey test for multiple comparisons of the RP in each brainwave between controls and cases in each disease stage (CON - controls; MIL - mild AD; MOD - moderate AD; SEV - severe AD). The values below 0.05/30 = 1.67 × 10−3 are underlined and in bold...... 60
4.9 Gene and p-values obtained in the allelic and genotypic tests for the SNPs which were nominally significant in both at the α = 0.05 level. (*) SNPs previously associated to AD. . 61
4.10 Wald test p-values for the variation of relative power across each frequency band in cases and controls with different alleles for each of the considered SNPs. The values below 0.05 are underlined and in bold...... 62
4.11 Wald test p-values for the variation of relative power across each frequency band in cases and controls with different genotypes for each of the considered SNPs. The values below 0.05 are underlined and in bold...... 62
4.12 Minor and alternative alleles of each SNP and their respective frequencies of the minor allele in the sample, in cases and in controls. (*) SNP previously associated to AD. . . . . 63
C.1 Nominally significant genes in SKAT-O under null Model 1 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 94 FCUP x Quantifying the genetic predisposition to a complex disease through genome-wide association
C.2 Nominally significant genes in SKAT-O under null Model 1 tested for ”Dementia” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 94 C.3 Nominally significant genes in SKAT-O under null Model 1 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 95 C.4 Nominally significant genes in SKAT-O under null Model 1 tested for ”AD Disgenet” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 96 C.5 Nominally significant genes in SKAT-O under null Model 2 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 106 C.6 Nominally significant genes in SKAT-O under null Model 2 tested for ”Dementia” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 106 C.7 Nominally significant genes in SKAT-O under null Model 2 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 106 C.8 Nominally significant genes in SKAT-O under null Model 2 tested for ”AD Disgenet” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 108 FCUP xi Quantifying the genetic predisposition to a complex disease through genome-wide association
List of Figures
1.1 Mendel’s fundamental experiment...... 3
4.1 Missing data rate vs. heterozygosity across individuals passing the DQC and QCCR steps. Shading indicates sample density; the dashed lines represent the defined heterozygosity threshold; the outliers are highlighted in red...... 46 4.2 (a) Plot of the first principal component against the second, calculated with the software EIGENSOFT; (b) Plot of the same PCs as in (a), zoomed in on the cluster which contains the european populations (the outliers are encircled in red). In the legends, PT and ES represent the portuguese and spanish individuals in our sample, respectively; the remain- ing populations come from the 1KGP dataset...... 48 4.3 Distribution of age at the time of sample collection...... 52 4.4 Proportion of variance explained by the first i principal components; (b) is a zoom-in of (a) on the first 100 PCs...... 53 4.5 Plot of the first and second principal components, restricted to the sample subjects. . . . 53 4.6 Probability density function of Beta(p, 1, 25)...... 54 4.7 Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2 in cases and controls...... 59 4.8 Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2 in controls and mild, moderate and severe AD cases (CON - controls; MIL - mild AD; MOD - moderate AD; SEV - severe AD)...... 60 4.9 Plot of the mean relative power distribution in the different frequency bands of controls with each (a) allele or (b) genotype for SNP rs71336232...... 64 4.10 Plot of the mean relative power distribution in the different frequency bands of controls with each (a) allele or (b) genotype for SNP rs7144273...... 64 FCUP xii Quantifying the genetic predisposition to a complex disease through genome-wide association
4.11 Plot of the mean relative power distribution in the different frequency bands of cases with each (a) allele or (b) genotype for SNP rs10833211...... 65 4.12 Plot of the mean relative power distribution in the different frequency bands of cases with each (a) allele or (b) genotype for SNP rs10833214...... 65 FCUP xiii Quantifying the genetic predisposition to a complex disease through genome-wide association
List of Abbreviations
1KGP The 1000 Genomes Project AD Alzheimer’s disease APT Affymetrix Power Tools ANOVA analysis of variance CADD combined annotation dependent depletion CDCV common disease common variant CDRV common disease rare variant CG candidate gene CT computed tomography df degrees of freedom DGE differential gene expression DNA deoxyribonucleic acid DQC dish quality control EEG electroencephalogram GUI graphical user interface GWA genome-wide association HWE Hardy-Weinberg equilibrium IBD identity by descent IBS identity by state indel insertion/deletion variant LD linkage disequilibrium MAF minor allele frequency MCI mild cognitive impairment MIL mild Alzheimer’s disease MMSE Mini Mental State Examination MOD moderate Alzheimer’s disease MRI magnetic resonance imaging NFE non-Finnish European FCUP xiv Quantifying the genetic predisposition to a complex disease through genome-wide association
OR odds ratio PC(A) principal component (analysis) QC quality control QCCR quality control call rate RNA ribonucleic acid SEV severe Alzheimer’s disease SKAT sequence kernel association test SNP single nucleotide polymorphism SNV single nucleotide variant UTR untranslated region FCUP 1 Quantifying the genetic predisposition to a complex disease through genome-wide association
Introduction
This work aimed to describe the general procedures and methods used in genome-wide association studies, and their posterior application to a data set composed of cases and controls of Alzheimer’s disease from the Iberian Peninsula. The work is reflected in the present dissertation, which is structured as follows. To place the work in context, we start by introducing in Chapter 1 some basic concepts of biology and population genetics, and the general idea behind association studies. We also briefly introduce Alzheimer’s disease, its symptoms, causes and means of diagnosis. In Chapter 2, we describe the procedures relative to study design and quality control of genomic data. Here, we present relevant thresholds for various quality measures, as criteria to keep or discard individuals and genetic variants from the study. Chapter 3 was intended to portray in detail the models of association used to study the effect of both common and rare variation on a complex phenotype. We focus on case-control studies, but also consider the case of quantitative traits whenever relevant. Chapter 4 is the application of the previously described methods on a data set of Alzheimer’s patients and controls. It is split in two distinct phases of analysis. In a first stage, given the modest sample size, we focus on the aggregate effect of rare variation on the disease, using the SKAT-O method. In a second and final stage, we assess the behavior of different brainwaves at various stages of the disease; we also seek a possible association between a set of genetic variants and the values of EEG at different frequency bands. Finally, on Chapter 5, we make some considerations about the work, namely on the importance of an approach combining genetic data and other endophenotypes associated with the disease, for a hopefully more precise diagnosis. FCUP 2 Quantifying the genetic predisposition to a complex disease through genome-wide association FCUP 3 Quantifying the genetic predisposition to a complex disease through genome-wide association
Chapter 1
Theoretical framework
1.1 Introductory concepts of biology and genetics
All our ideas about the transmission of specific characters and changes in the characteristics of pop- ulations make use of concepts introduced in 1865 by the one who is often referred to as the Father of Genetics - Gregor Mendel [1]. Working in a small garden with nothing but peas as his material, he was able to formulate a hypothesis that explains the inheritance of some traits in a very simple way.
The simplified experiment was as follows: Mendel crossed purebred plants with green peas and pure- bred plants with yellow peas, obtaining a first hybrid generation of all yellow peas; he then crossed these hybrid plants among each other and obtained peas of both colors, in a proportion of approximately 3 yellow to 1 green. This experiment is illustrated in Figure 1.1 and Table 1.1.
AA aa Initial generation
Aa Aa 1st hybrid generation
AA Aa Aa aa 2nd hybrid generation
Fig. 1.1 – Mendel’s fundamental experiment. FCUP 4 Quantifying the genetic predisposition to a complex disease through genome-wide association
From these results, Mendel formulated his hypothesis for sexual reproduction, which can be expressed as follows:
1. Each character of an individual is controlled by two ”factors”, the alleles, one of which the individual receives from his father, and the other from his mother.
2. From the two alleles carried by the individual, one is expressed (dominant), while the effect of the other may not be apparent (recessive).
3. A reproductive cell (egg and sperm, in humans) produced by an individual bears, for each charac- ter, one and only one of the two alleles which the individual carries.
In the initial generation, the plants with yellow peas carried only ”yellow” genes; their genetic represen- tation for this trait is (AA). The plants with green peas carried only ”green” genes, hence having genetic constitution (aa). Crossing individuals of these two types can only generate individuals of one type, (Aa). When crossing (Aa) individuals with each other, it is possible to obtain individuals with genetic consti- tution (AA), (aa) or (Aa), with probabilities 1/4, 1/4 and 1/2, respectively. Because A is dominant over a, the (Aa) peas are yellow, hence the obtained proportions of all yellow peas in the 1st hybrid generation, and 3 yellow peas to one green in the 2nd.
(a) a a (b) A a A (Aa) (Aa) A (AA) (Aa) A (Aa) (Aa) a (Aa) (aa)
Tbl. 1.1 – (a) Possible genetic constitution of the offspring resulting from crossing (AA) with (aa) individuals (1st hybrid generation); (b) Possible genetic constitution of the offspring resulting from crossing (Aa) individuals among each other (2nd hybrid generation).
Mendel’s experiment was a starting point for modeling the inheritance of other, more complex, pheno- types, namely diseases, involving the contribution and/or interaction of various genes [2]. The human body is composed of billions of different types of cells. Cells are constantly dying and being replaced by newly formed ones through the process of cell division – the source of an individual’s development and growth. The very first cell of an individual already carries all of the genetic information he/she will bear throughout his/her whole life, encoded in their DNA. DNA, or deoxyribonucleic acid, is a 3 billion long double helix molecule made of two complementary chains of nucleotides. A nucleotide is composed of one of four chemical bases – adenine (A), thymine FCUP 5 Quantifying the genetic predisposition to a complex disease through genome-wide association
(T), cytosine (C) and guanine (G) –, attached to one sugar molecule and one phosphate molecule. In each nucleotide chain, A pairs up with T in the opposite chain, and C with G. Therefore, taking one of the chains as reference, DNA can essentially be seen as a single sequence of As, Ts, Cs and Gs. During cell division, DNA replicates itself, providing each new cell with an identical copy of all genetic material (if no mutation occurs). This process is called ”replication”.
DNA is tightly coiled into structures called ”chromosomes”. Most human cells, the somatic cells, con- tain 23 pairs of chromosomes, for a total of 46, and are hence called ”diploid”. The exceptions are the reproductive cells or gametes, which carry only half of an individual’s genetic material, i.e., 23 unpaired chromosomes, thus being ”haploid”.
During cell division occurs a phenomenon called ”genetic recombination”, which is the crossover of the arms of each chromosome in the cell, leading to the exchange of genetic material between them. Recombination occurs between homologous regions, meaning that the alleles or genes are in a similar order of arrangement in both chromosomes.
In somatic cells, all chromosome pairs but one are termed ”autosomes”; the one pair that differs in its structure, usually referred to as the 23rd pair, is the pair of sex chromosomes, which determines human gender - females have two X chromosomes, whereas males have one X and one Y. In each reproductive cell, since only half of the genetic material is present, there is an X chromosome in female gametes and either an X or a Y in male ones; indeed, it is the information carried in the paternal gamete that determines the sex of the offspring.
The set of all genetic information of an individual is called his/her ”genome”. The ”exome” is the small fraction (about 1%) of the genome known to encode for protein production. The information to produce a functional protein is encoded in a special unit of DNA at a determined genetic site (”locus”), called a ”gene”. Each gene can have various modes of action, determined by variants of its sequence. The multiple versions of a gene are called ”alleles”. A gene is said to be ”polymorphic” if, for its locus, the minor allele frequency (MAF) is at least 1% within a population [3].
Because humans have twenty-two paired chromosomes, most genes are represented twice in our genome, through alleles that may or may be not identical. A ”genotype” is defined as the set of alleles found at an individual’s locus. A ”phenotype” is an observable trait that concerns a particular locus (or the combined action of several loci). For example, in Mendel’s experiment described in 1.1, the color of the peas is a phenotype, whereas the genetic constitution – (AA), (Aa) or (aa) – is the genotype. FCUP 6 Quantifying the genetic predisposition to a complex disease through genome-wide association
An ”endophenotype” is somewhere between the previous two definitions: it is a quantitative, non- observable trait that also shows a genetic connection. A set of genotypes observed at linked loci of one individual is called his/her ”haplotype”.
For each autosomal locus, an individual is said to be ”homozygous” if he/she carries two copies of the same allele, and ”heterozygous” if the alleles are different. We can also refer to homozygous individuals as ”homozygotes”, and to heterozygous individuals as ”heterozygotes”.
It is the union of one male and one female reproductive cells that generates a ”zygote”, i.e., a fertilized egg cell, with a full complement of hereditary information necessary for the development of a human being. At the moment of fertilization, the new individual receives for each autosomal locus one allele from his father and one from his mother; as for the 23rd pair, if the new individual is female, then she has inherited one X chromosome from each of her parents, while if he is male, he has inherited an X chromosome from his mother and a Y from his father. Women can be homozygous or heterozygous for genes in the 23rd chromosome; men can only be ”hemizygous”, due to X and Y not being homologous in all of their extension, except for short ”pseudoautosomal regions” on their tips.
Individuals with different genotypes may show simillar phenotypes due to genetic dominance-recessiveness relationships. Allele A is said to be ”dominant” to a (or, equivalently, a is ”recessive” to A) if the action of A, but not that of a, is manifested in the phenotype of a (Aa) heterozygote. Alleles A and B are said to be co-dominant if they are both expressed in an (AB) individual’s phenotype. An example of co-dominant expression is the AB blood type in the human ABO blood system, as shown in Table 1.2 [3]. This table also illustrates the dominance of alleles A and B over allele O.
ABO A A AB A B AB B B OABO
Tbl. 1.2 – Blood group (phenotype) of the offspring, depending on the ABO alleles (genotype) inherited from the parents. FCUP 7 Quantifying the genetic predisposition to a complex disease through genome-wide association
1.2 Population genetics concepts
Population genetics involves the study of genetic variation within and between populations, by examin- ing allele frequencies at different loci over time and space. Mathematical models are used to investigate and predict the occurrence of specific alleles (or combinations of alleles) in populations, based on the ever increasing understanding of genetics and evolution. As such, it becomes necessary to introduce some concepts regarding genetic variations, and how they relate to each other and influence the evolu- tion of species.
1.2.1 Genetic variants
Chromosomes are not perfectly stable entities: changes in the DNA sequence may occur as a result of external or internal factors, such as the interaction with radiation, chemicals or viruses, or simply an error during the replication process. These changes are called ”mutations”, and give rise to genetic variants; if their frequency among a population is above 1%, they are considered common and called ”polymorphisms” [3]. For example, the replacement of one nucleotide by another is called a ”single nucleotide variant” (SNV), or a ”single nucleotide polymorphism” (SNP) in case it is common. The most common single base-pair changes are between the two existing classes of nucleotides – purines (A↔G) and pyrimidines (C↔T) –, thus, most SNPs in a population are ”biallelic”. Another class of well-known variants are indels, which is the insertion or deletion of a portion of DNA, no larger than 1 000 bases, into the genome. Variants are part of the evolution of species. Some variants do not change an individual’s phenotype, while others may greatly affect it; some variants can increase an individual’s fitness in the surrounding environment, while others can have deleterious effects and generate disease. ”Monogenic” diseases result from deleterious variants in a single gene. They are inherited according to Mendel’s laws, hence also being called ”Mendelian” diseases. If a disease results from the joint contribution of a number of independently acting or interacting genes, it is called ”polygenic”. Variants can have numerous classifications depending on their length, placement and function. Exonic variants are located in portions of a gene that will encode a part of the final mature RNA produced by that gene and hence may change the resulting amino-acid sequence. UTR variants are in untranslated FCUP 8 Quantifying the genetic predisposition to a complex disease through genome-wide association regions but may have an important role in regulating gene expression. These are variants of potential functional importance and could be good candidates for further analysis in association studies. A SNV that is in a coding region of the genome but results in no change to the encoded amino acid is called a ”synonymous” substitution; when a genetic SNV influences the protein expression, it is termed ”non-synonymous”. Indel variants may yield ”frameshift” variations, which are potentially deleterious. ”Stop-gain” and ”stop-loss” variations result, respectively, in a premature termination and an abnormal extension to the protein translation process, and thus alter the protein itself. Of these classes of genetic variants, the synonymous substitutions are the least functionally relevant, and it is not uncommon to prioritize the analysis of variants falling in the remaining classes when searching for an association with a phenotype or disease [3]. Some genetic variants are known to be more likely to ”travel together” from generation to generation than would be expected if different loci associated in a random manner. This phenomenon of non-random association is termed ”linkage disequilibrium”.
1.2.2 Linkage disequilibrium
Various studies have confirmed that the inheritance of certain alleles within a population is often cor- related, causing many individuals to share the same haplotype. The alleles are thus said to be in linkage disequilibrium (LD). Even though genetic distance influences LD, it does not necessarily cause it; two loci being in LD simply means that the alleles appear together in the same population more (or less) frequently than chance would have us expect. Suppose that allele A at locus 1 and allele B at locus 2 are found at frequencies p and q, respectively, in the population. If the two loci were independent, then we would expect to see the [AB] haplotype at frequency pq; however, if the frequency of the [AB] haplotype was either higher or lower than pq, then the two loci could be in LD. Let us consider two biallelic loci on the same chromosome, with alleles A and a at the first locus, and B and b at the second. Their allelic frequencies in the population are pA, pa, pB and pb (note that, because the loci are biallelic, pa = 1 − pA and pb = 1 − pB); the haplotype frequencies are pAB, pAb, paB and pab. Table 1.3 shows the observed and expected haplotype frequencies under linkage equilibrium; Table 1.4 illustrates a scenario where a measure of LD is included to the observed haplotype frequencies. FCUP 9 Quantifying the genetic predisposition to a complex disease through genome-wide association
Observed frequencies Expected frequencies B b B b
A pAB pAb A pApB pApb
a paB pab a papB papb
Tbl. 1.3 – Observed (left) and expected (right) haplotype frequencies in the population, under total linkage equilibrium.
Observed frequencies B b Total
A pApB + D pApb − D pA
a papB − D papb + D pa
Total pB pb
Tbl. 1.4 – Observed haplotype frequencies in the population, considering linkage disequilibrium.
The measure of linkage disequilibrium D is the difference between the expected haplotype frequencies and the observed, defined as
D = pAB − pApB.
In order to standardize D, we need to find its boundaries. Using the fact that the observed frequencies must be non-negative, we obtain
pApB + D ≥ 0 ⇔ D ≥ −pApB
papb + D ≥ 0 ⇔ D ≥ −papb
pApb − D ≥ 0 ⇔ D ≤ pApb
papB − D ≥ 0 ⇔ D ≤ papB
Thus we can define
D D0 = Dmax where FCUP 10 Quantifying the genetic predisposition to a complex disease through genome-wide association
min{pApb, papB}, if D > 0 Dmax = . max{−pApB, −papb}, if D < 0
This normalization causes D0 to range between −1 and 1. When D0 = ±1, then at least one of the haplotypes was not observed; if allele frequencies are similar, a high D0 value means the markers are good surrogates for each other. Another widely used measure to calculate LD between loci, preferred by population geneticists, is Pearson’s coefficient of correlation r,
D r = √ pApapBpb or, more commonly, its squared value (r2). When r2 = 1, the two loci are in total linkage disequilibrium, i.e., they both provide identical information; if r2 = 0, they are in perfect equilibrium, i.e. the genetic information is transmitted independently [4]. Tipically, two loci are considered to be correlated when an r2 value greater than 0.2 is achieved [5]. Linkage disequilibrium is of major importance in association studies, namely at the level of marker selection in the study design phase. Indeed, mapping LD across the human genome has made possible to deduce an individual’s genotype at a given locus through others in high disequilibrium. This is done by strategically choosing single tagSNPs to represent entire haplotypes of regions in high LD, which results in a less costly study.
1.2.3 Identity by descent
Even after taking LD into account, loci which are independent within a population may still show sig- nificant similarities among individuals, introducing a degree of relatedness which must be accounted for, especially when performing an association study with a sample of unrelated individuals. If relatives are present, a bias may be introduced to the study, because the genotypes within families will be over-represented and the sample may no longer be an accurate reflection of the allele frequencies in the entire population. An important measure of relatedness used to identify such cases is identity by descent (IBD), a degree of recent shared ancestry for a pair of individuals. Two alleles are said to be IBD if and only if they have FCUP 11 Quantifying the genetic predisposition to a complex disease through genome-wide association descended from the same ancestral allele. Mutation breaks identity by descent. Two individuals are said to be related if they may share IBD alleles. There is some point in the past beyond which individuals are assumed to be unrelated. Identical twins are expected to have a proportion of shared IBD alleles equal to 1; first-degree relatives, 0.5; second-degree relatives, 0.25; and so on [5]. A similar concept is that of identity by state (IBS), which is based on the average proportion of indistin- guishable alleles shared at genotyped variants for each pair of individuals. Therefore, two alleles which are IBD are also IBS, but the opposite may not be true, because alleles IBS may not originate from the same common ancestor; similarly, an individual may have more alleles IBS than IBD, but the opposite can never occur. Purcell et al. [6] considered a method-of-moments approach to estimate the probability of sharing 0, 1, or 2 IBD alleles for any pair of individuals from the same homogeneous, random-mating population. Denoting IBS states as I and IBD states as Z (in both cases, the possible states being 0, 1, and 2), then we have that
N(I = 0) P (Z = 0) = N(I = 0 | Z = 0)
N(I = 1) − P (Z = 0)N(I = 1 | Z = 0) P (Z = 1) = N(I = 1 | Z = 1)
N(I = 2) − P (Z = 0)N(I = 2 | Z = 0) − P (Z = 1)N(I = 2 | Z = 1) P (Z = 2) = N(I = 2 | Z = 2) where N(I = i | Z = z) is the expected count of variants with IBS state I = i conditional on IBD state Z = z for the entire genome, and is defined as
L X N(I = i | Z = z) = P (I = i | Z = z) m=1 where the summation is over all variants with genotype data on both individuals, and the conditional probabilities are calculated as in Table 1.5. We can thus define the proportion of alleles shared IBD as
P (Z = 1) πˆ = + P (Z = 2). 2 FCUP 12 Quantifying the genetic predisposition to a complex disease through genome-wide association
IZP (I | Z) 0 0 2p2q2 1 0 4p3q + 4pq3 2 0 p4 + 4p2q2 + q4 0 1 0 1 1 2p2q + 2pq2 2 1 p3 + p2q + pq2 + q3 0 2 0 1 2 0 2 2 1
Tbl. 1.5 – Probability values of IBS state I = i conditional on IBD state Z = z for each value of I and Z.
Due to genotyping errors, LD and population structure, a πˆ value higher than 0.98 is considered enough to consider two samples under analysis as duplicates. The usual procedure is to remove one individual from each pair with πˆ > 0.1875 (a value halfway between the expected IBD for second- and third-degree relatives) [5].
1.2.4 The Hardy-Weinberg equilibrium
The Hardy-Weinberg equilibrium (HWE) is a law of genetics which states that allele and genotype frequencies in a population will remain constant from generation to generation, under the following as- sumptions:
(a) the population size is so large that it can be treated as infinite;
(b) generations are discrete, and individuals from different generations do not breed together;
(c) mating is at random;
(d) migration does not occur;
(e) selection does not occur (i.e., individuals with different genotypes are assumed to have equal fitness to the environment); FCUP 13 Quantifying the genetic predisposition to a complex disease through genome-wide association
(f) mutations do not occur (i.e., individuals with genotype (AiAj) can only produce gametes with an
Ai or an Aj allele at that locus);
(g) initial genotype frequencies are equal in the two sexes.
The equilibrium in autosomal loci
Let’s suppose, for simplicity and because most human loci are biallelic, that there are n = 2 observed alleles, A1 and A2 with proportions p and q = 1 − p, respectively, for a given locus in a population. There are 3 possible genotypes, (A1A1), (A1A2) (identical to (A2A1)) and (A2A2), with initial proportions u, v and w, respectively.
From the genotype proportions, it is possible to deduce the allele proportions:
1 p = u + v 2 1 q = w + v 2
Under the stated assumptions, the next generation will be composed as shown in Table 1.6.
Mating Type Frequency Nature of Offspring
2 (A1A1) × (A1A1) u (A1A1) 1 1 (A1A1) × (A1A2) 2uv 2 (A1A1) + 2 (A1A2)
(A1A1) × (A2A2) 2uw (A1A2) 2 1 1 1 (A1A2) × (A1A2) v 4 (A1A1) + 2 (A1A2) + 4 (A2A2) 1 1 (A1A2) × (A2A2) 2vw 2 (A1A2) + 2 (A2A2) 2 (A2A2) × (A2A2) w (A2A2)
Tbl. 1.6 – Mating outcomes assuming Hardy-Weinberg equilibrium.
The obtained frequencies for the three genotypes (A1A1), (A1A2) and (A2A2) for the first generation FCUP 14 Quantifying the genetic predisposition to a complex disease through genome-wide association are, respectively,
1 1 2 u2 + uv + v2 = u + v = p2 4 2 1 1 1 uv + 2uw + v2 + vw = 2 u + v w + v = 2pq (1.1) 2 2 2 1 1 2 v2 + vw + w2 = w + v = q2 4 2
and, for the second generation,
1 2 p2 + 2pq = [p(p + q)]2 2 = p2 1 1 2 p2 + 2pq q2 + 2pq = 2p(p + q)q(p + q) 2 2 (1.2) = 2pq 1 2 q2 + 2pq = [q(p + q)]2 2 = q2
meaning that, after a single round of random mating under the conditions above, the genotype fre- quencies stabilize at Hardy-Weinberg proportions [7].
Testing for equilibrium
Departures from HWE are generally measured at a given SNP using a χ2 goodness-of-fit test between the observed and expected genotypes. The χ2 statistics is defined as
2 X (Oi − Ei) χ2 = (1.3) E i i
where Oi and Ei are the observed and expected absolute frequencies of each of the n genotypes in a population at that locus. This test statistic has a χ2 distribution with n − 1 degrees of freedom [8]. A deviation from HWE implies a violation of at least one of the assumptions stated above; it is usually an indication of the presence of population substructure or the occurrence of a genotyping error. FCUP 15 Quantifying the genetic predisposition to a complex disease through genome-wide association
1.2.5 Population substructure
Population substructure, also referred to as population admixture or population stratification, is the presence of genetic differences between subpopulations of an apparently homogeneous population due to genetic history (e.g., migration, selection, and/or ethnic integration). Principal component analysis (PCA) is widely used to detect and visualize hidden population substructure that is not apparent in the data and which may be providing untrue results, when analyzing characteristics of the population as a whole [8]. The central idea of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated variables (in the case of an association study, these are the thousands of genetic markers), while retaining as much of the variation present in the data set as possible. This is achieved by trans- forming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. Assuming our markers as biallelic, the data can be seen as a large rectangular matrix C, with rows indexed by individuals, and columns indexed by polymorphic markers. For each marker, there is a reference and an alternative allele. We suppose there are n markers and m individuals, and that the number of markers is much larger than the number of samples, n m. Let C(i, j) be the number of reference alleles for marker j, individual i. Thus, for autosomal loci, we have C(i, j) ∈ {0, 1, 2}. For each column of C, we calculate its mean µ(j) and standard deviation σ(j), and obtain a new matrix M
C(i, j) − µ(j) M(i, j) = . σ(j) generally called the variance-standardized genetic relationship matrix. This step of normalization is intended to make the markers (co-variables) comparable, reducing their mean and variance to 0 and 1, respectively. With this matrix, we can now define
X = MM T ,
a square matrix m × m, with dimensions equal to the number of sampled individuals. We then com- pute the eigenvalues of matrix X and the corresponding eigenvectors, which are called the PCs. The eigenvector corresponding to the highest eigenvalue is called the ”first principal component”, denoted FCUP 16 Quantifying the genetic predisposition to a complex disease through genome-wide association
PC1; the eigenvector corresponding to the second highest eigenvalue is PC2; and so on. The variation explained by PCs decreases, with the first PC explaining the most variation [9]. Plotting PCs against each other can show evidence of population substructure, by clustering the in- dividual data across these new axes of variation. Those PCs which are found to be significant can posteriorly be used as co-variables in regression models (see Chapter 3).
1.3 Association studies
Variation in a DNA sequence can influence the risk of developing disease. Early studies investigated genetic variants underlying rare conditions that showed clear Mendelian inheritance patterns in families, and turned out to be very successful due to these variants carrying 100% disease risk [10]. Scientific efforts have been made to put together as much information about the human genome vari- ation as possible, allowing for better design and less costly studies. Such efforts include the Human Genome Project (1990-2003), the International HapMap Project (2002-2009) and, more recently, the 1000 Genomes Project (2008-2015). Investigating the causes of complex diseases has proven to be a much more difficult task, because there is not one single cause, but rather the combined action of many causal factors, genetic and/or envi- ronmental, that predispose to disease development. This means that even variants with a low increased relative risk, when found together in the same genome, may significantly contribute to the disorder in question to manifest. Genetic association studies aim to detect such variants involved in complex dis- eases. The fundamental idea behind association studies is the comparison of allele or genotype frequencies between cases and controls, in order to relate genetic variants to a certain phenotype (such as a dis- ease); if a particular allele/genotype is more common among cases than controls, it may be a risk factor and may be subject to further study [8]. There are two main theories for disease associated variants: the ”common disease common vari- ant” (CDCV) hypothesis, and the ”common disease rare variant” (CDRV) hypothesis. These hypotheses argue contrary views concerning which variants carry the most penetrance, i.e., the proportion of in- dividuals carrying a particular allele that also express an associated trait. While the first argues that FCUP 17 Quantifying the genetic predisposition to a complex disease through genome-wide association genetic variations with appreciable frequency in the population, but relatively low penetrance, are the major contributors to genetic susceptibility to common diseases, the second reasons that multiple rare DNA sequence variations, each with relatively high penetrance, are the major contributors to genetic sus- ceptibility to common diseases. Both hypotheses stand on empirical evidence, and each uses specific methods for association analysis [11].
In the early days of association studies, the initial choice was to focus on common variants, mainly due to genome-wide surveys of rare variation requiring many more assays than the arrays available at the time could support. However, there was strong motivation to support the CDRV hypothesis, namely the idea that deleterious variants are likely to be rare due to purifying selection; indeed, loss-of-function variants, which prevent the generation of functional proteins, are especially rare [12].
A number of softwares have been developed to deal with genetic data files, which are evidently of very large sizes due to the thousands of variants for analysis in most genetic studies. These programs are mostly command-line based, which makes dealing with these types of files more computationally efficient than working with GUI-based softwares; they are also readily and freely available online for use. One such program is PLINK [6], which contains multiple basic commands such as calculating allele frequen- cies, IBD and heterozygotic proportions, converting between multiple file types and performing basic allelic/genotypic chi-squared association tests. It also performs PCA, but a more specific command-line program for this effect is EIGENSOFT. Among others, this program contains the EIGENSTRAT stratifica- tion correction method, which uses principal component analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation.
One last command-line program worth mentioning is ANNOVAR. This tool, given a list of variants and their corresponding genetic coordinates, uses a number of databases to functionally annotate them. The resulting features of each given variant includes the gene it belongs to, whether or not it falls in a coding region, whether or not it yields a change to the produced amino-acid, among others. An important feature provided by ANNOVAR is the Combined Annotation Dependent Depletion (CADD) score, a measure of the deleteriousness of SNVs and indels in the human genome. The CADD scores are ”PHRED-scaled”, meaning their values are ranked in order of magnitude terms rather than the precise rank itself. For example, variants at the top 10% of CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, and so on. ANNOVAR also provides information from the Genome Aggregation Database (gnomAD) on allelic frequencies in various populations from around the world, among which FCUP 18 Quantifying the genetic predisposition to a complex disease through genome-wide association are Non-Finnish Europeans (NFE) wherein the Iberian population is included. Several psychiatric disorders, such as schizophrenia [13], depression [14] or Alzheimer’s disease (AD) [15], have been found to be polygenic and were associated with several genetic variants. In this work, we describe the general procedures of association studies and the corresponding statistical methods, followed by an application to the case of AD in patients from the Iberian Peninsula.
1.4 An introduction to Alzheimer’s disease
1.4.1 Symptoms, causes and available diagnosis
Alzheimer’s disease (AD) is the most common type of dementia. It usually begins with subtle memory failure, which worsens over time and begins to affect an individual’s daily living. A person suffering from this condition will eventually have trouble recognizing people, naming objects, dealing with everyday chores and personal care, behaving appropriately in social situations, among others. At an advanced stage of the disease, the patient will require constant care. After the first symptoms appear, an individual usually survives 8 to 10 years, but the course of the disease can go up to 25 years, ending in death by pneumonia, malnutrition or general inanition [16]. There are three stages to AD. The early stage is mild Alzheimer’s disease, when a person can still function independently but has few memory lapses, such as forgetting familiar words or the location of everyday objects. Individuals with mild AD are firstly diagnosed with a condition called mild cognitive impairment (MCI), which has similar associated symptoms to the early stage of AD; deciding whether the MCI observed in an individual is due to AD relies on brain imaging and cerebrospinal fluid tests. The middle stage, or moderate Alzheimer’s disease, is typically the longest stage; the symptoms become more pronounced and the patient will require more care. The late stage is called severe Alzheimer’s disease, when individuals lose ability to respond to the environment and need constant help in performing daily activities. It is often not easy to place an individual at a specific stage, as they may overlap [16]. AD can be classified according to the age of onset. The most common type is late-onset Alzheimer’s disease, which constitutes approximately 95% of the cases, and affects individuals whose first symptoms appeared after the age of 65; early-onset Alzheimer’s disease affects the other 5% of the cases, for whom FCUP 19 Quantifying the genetic predisposition to a complex disease through genome-wide association the age of onset is below 65 [17].
Cause % of cases Late-onset familial 15-25 Early-onset familial <2 Down syndrome <1 Unknown (includes genetic/environment interactions) ∼75
Tbl. 1.7 – Causes of Alzheimer’s disease.
The main causes of AD are described in Table 1.7. Approximately 25% of all AD is familial (i.e., ≥3 persons in a family have AD) and 75% is nonfamilial (i.e., an individual with AD and no known family history of AD); the onset of nonfamilial Alzheimer’s disease is usually at an advanced age. Because familial and nonfamilial AD appear to have the same clinical and pathologic phenotypes (observable manifestation), they can only be distinguished by family history and/or by molecular genetic testing [16]. Most cases of early-onset AD are due to genetic factors transmitted from parent to child. Research has shown that this form of the disease mostly results from a variation in one of these three genes: APP, PSEN1 or PSEN2. When any of these genes is altered, large amounts of amyloid β-peptide, a toxic protein fragment, are produced in the brain. This peptide builds up to form clumps called ”amyloid plaques”, characteristic of Alzheimer’s disease, which lead to the death of nerve cells and the progressive signs and symptoms of this disorder [16]. Some evidence indicates that essentially all persons with Down syndrome develop the neuropathologic hallmarks of AD after age 40. Down syndrome, a condition characterized by intellectual disability and other health problems, occurs when a person is born with an extra copy of chromosome 21 in each cell. The presumed reason for the association between these two conditions is the lifelong overexpression of APP on chromosome 21, and the resultant overproduction of β-amyloid in the brain [17]. Research has come to support the concept that late-onset Alzheimer’s disease is a complex disorder, with many susceptibility genes involved, as well as environmental factors (such as higher education, or exposure to electromagnetic fields [18]). The gene APOE has been extensively studied and proven to have great influence in the manifestation of AD. APOE is polymorphic, with three major alleles: ε2, ε3 and ε4. The presence of the ε4 allele in heterozygous (ε3ε4) or homozygous (ε4ε4) state increases the risk for AD threefold and 15-fold, FCUP 20 Quantifying the genetic predisposition to a complex disease through genome-wide association respectively. APOE ε2 allele has shown to have a protective effect [16]. APOE alleles are determined by the two SNPs rs429358 and rs7412 as shown in Table 1.8.
rs429358 rs7412 Allele CT ε1 TT ε2 TC ε3 CC ε4
Tbl. 1.8 – APOE allele according to the genotype for SNPs rs429358 and rs7412.
However, the presence of APOE ε4 does not determine that an individual will develop the disease; in fact, approximately 42% of individuals with AD do not have any APOE ε4 allele. Similarly, the absence of APOE ε4 does not rule out the possibility of one developing AD. Currently, the only definitive way to establish a diagnosis of AD is to microscopically examine a section of the person’s brain tissue after death. However, there are still several approaches that have been proven to be highly effective in the diagnosis of Alzheimer’s disease to a living patient. The initial step is to consult with a specialized doctor (psychiatrist), who will review the individual’s medical history and analyze the symptoms, as well as conduct a series of tests to the cognitive and physical abilities of the individual. The Mini Mental State Examination (MMSE) is one such test widely used for this purpose. It is not unusual for the doctor to interview friends and family of the patient, to better understand their behavioral changes over time. This series of clinical assessments often provide enough information to perform a correct diagnosis [16]; however, it isn’t always clear, and may require further, more advanced testing. Analysis of electroencephalograms (EEGs) can also be used as a means of diagnosis. EEGs are used to register electrical activity in the brain, focusing namely on spectral measures, which include the classical brainwaves in delta, theta, alpha, beta and gamma frequencies. Each one of these brainwaves in an endophenotype to Alzheimer’s disease. Brainwaves are activated according to our actions, feelings, circadian rhythm, and some disorders may trigger the over-expression or inhibition of a given brainwave. In order to interpret the EEG, it is important to understand which behaviors lead to certain variations in activity of each brainwave. Delta (δ, 0.5-4 Hz) waves suspend external awareness and are the source of empathy; they are gener- FCUP 21 Quantifying the genetic predisposition to a complex disease through genome-wide association ated in deep meditation and dreamless sleep, when healing and regeneration processes are triggered. Theta (θ, 4-8 Hz) brainwaves are connected with the learning, memory, and intuition functions. Al- pha (α, 8-13 Hz) waves aid overall mental coordination and learning. Beta (β, 13-30 Hz) brainwaves are present when we are alert, engaged in problem solving, judgment, decision making, or focused mental activity; they dominate our normal waking state of consciousness. Gamma (γ, >30 Hz) brain- waves are the fastest of brain waves, and relate to simultaneous processing of information from different brain areas. More detailed information can be found online at https://brainworksneurotherapy.com/ what-are-brainwaves. Other means of diagnosis include laboratory testing, which is usually performed as a way of ruling out conditions that cause similar symptoms to Alzheimer’s, such as nutritional deficiencies or other diseases that could be affecting the person’s memory. These tests make use of blood, urine and cerebrospinal fluid samples [19]. One final diagnosis method worth mentioning is brain-imaging testing, such as computed tomography (CT) or magnetic resonance imaging (MRI) scans. They allow to look for evidence of trauma, tumors, and stroke that could be causing dementia and to look for brain atrophy, shrinkage that may be present later in the Alzheimer disease progression. These tests require that the person remain still for a period of time [19]. All the methods above provide information that allow to rule out a series of conditions that cause symp- toms similar to AD. Such conditions are, for instance, past strokes, Parkinson’s disease and depression [19].
1.4.2 The association studies approach
The methods of diagnosis described above require that the individual is showing symptoms of the disease. Some of them may be considered invasive, such as lab testing (which requires lumbar puncture and spinal fluid collecting); others may be unaccessible to the majority of the population due to their high cost, such as an MRI scan. In addition, imaging techniques may provide poor quality results, as Alzheimer’s patients tend to have a hard time standing still even for short periods of time, especially at an advanced stage of the disease. Genetic testing arises as an alternative to the previously described methods. By identifying alleles FCUP 22 Quantifying the genetic predisposition to a complex disease through genome-wide association which increase risk of developing the disease, it could be possible to make an early diagnosis, since the genetic material we carry is the same throughout our lifetime (except for mutations that may occur). This way, the disease could be prevented even before the appearance of symptoms, and thus prolong the quality of life of a potential future AD patient. This would also be a cheaper alternative, and can be made less invasive, while maintaining sample quality, through the use of oral swabs (instead of blood collecting) to obtain the genetic material. FCUP 23 Quantifying the genetic predisposition to a complex disease through genome-wide association
Chapter 2
Study design and data quality control
Genetic association studies can essentially be divided into candidate gene (CG) and genome-wide association (GWA) studies. CG studies are based on the prior hypothesis of a potential role of selected genes or genetic regions on a specific phenotype or disease, taking into consideration their biological function or association in previous studies. Genome-wide association studies, on the other hand, make use of information on the variation across the entire human genome, and are useful for hypothesis- generating purposes [10].
GWA analyses usually target relatively common SNPs. CG studies, however, focus on the effects of rare variants, which may be hard to detect, especially when dealing with small sample sizes. Besides being more cost effective than sequencing an entire human genome, studying the part that rare variants play on disease has been largely motivated by the CDRV hypothesis [11]. Indeed, if a certain variant has a large deleterious effect, it may also impact fitness and thus become less and less frequent in each generation.
This chapter describes the general procedures of an association study from the study design, through the process of data collection and up to the quality control steps, taking into account which methods suit each type of study, depending on the chosen approach. FCUP 24 Quantifying the genetic predisposition to a complex disease through genome-wide association
2.1 Study design
Describing the phenotype accurately
The phenotype of interest must be defined as accurately and specifically as possible, in a way that minimizes the likely causal heterogeneity based on existing clinical and biological evidence. Such defi- nitions may change, as more information becomes available. This will increase power of detection of an effect and allow for replication studies [10].
Checking disease heritability
Heritability is a measure of how well differences between individuals’ genes account for differences in their traits, i.e., how much of the variation in a given trait can be attributed to genetic variation (as opposed to environmental causes) [3]. Heritability is assessed by studying disease patterns in family members, namely by comparing monozy- gotic with dizygotic twins. Because monozygotic twins are genetically identical (the two alleles in each locus are IBD), while dizygotic twins are expected to share, on average, half of their alleles, comparing disease status in twins can enlighten on the role of genetic factors [10]. Diseases which have been shown to have low heritability will likely need very large sample sizes in order to find etiological genetic variants. Moreover, in diseases with heritability close to zero, there isn’t much advantage in conducting a genetic case-control study [10].
Choosing the best approach to the problem
Concerning sample relatedness, association studies can be sorted into two categories: population- based case-control studies, and family-based studies. The first approach may require several thousands of cases of the phenotype of interest. This number can be decreased, and power of the study increased, by recruiting cases with family history of the condition, or even multiple cases from the same family (adjusting for familial correlation), for a sample with a more homogeneous genetic background; this is called ”enrichment sampling” [20]. This sampling method does not always increase power in genetic studies, as familial aggregation may be due to shared environmental factors, for example [10]. Another premise for choosing a population-based case-control approach is to assume that one or FCUP 25 Quantifying the genetic predisposition to a complex disease through genome-wide association more of the underlying genetic variants are common. Moderately rare variants could also be detected, but only if they carry a large effect. A prior hypothesis that all undetected variants are rare and of small effects would require an unfeasibly large sample size, in order to have power to detect the effect of single variants [21]. If the case definition is a phenotype that shows clear segregation in families, then a population-based case-control approach is no longer suitable, and a family-based study is preferable [10].
Control selection
The golden rule of control selection for any case-control study is that cases and controls should belong to the same population, and they must be representative of that population who would have become cases, according to the case definition and the recruitment strategies for the study. This minimizes false positives and confounding [22]. Bias due to environmental factors is generally not a problem in association studies; the most important type of bias is related to the ethnic origin of cases and controls. This is commonly referred to as ”pop- ulation stratification”, and is an example of a confounding variable. Under this situation, differences in allelic frequencies between cases and controls are due to the underlying sampling scheme, rather than an actual effect of the variant on disease risk [23]. The effects of population stratification can sometimes be avoided at the study design level (by matching controls to cases on potentially important confounders) or the data analysis level (by adjusting the results for these confounders). Matching is only essential when the effect of the confounder cannot be accurately measured or is too large to be adjusted for in the analysis [24]. Population stratification is minimized when controls are matched to cases on ethnicity, or when the sample is restricted to a particular ethnic group. Further matching on sex can reduce population strat- ification in situations where there are gender differences in disease prevalence. Matching on age may improve power of the study by ensuring that controls had the same opportunity as cases to develop (and be diagnosed with) the disease. This could be a problem when dealing with age-related diseases such as Alzheimer’s disease. Whether or not further matching is necessary and decreases population stratification will depend on the disease in question [10]. Remaining stratification can be investigated and controlled (to some extent) by analytical methods [25, 26]. As a method of control selection, GWA studies often resort to banks of already genotyped shared FCUP 26 Quantifying the genetic predisposition to a complex disease through genome-wide association healthy controls, mainly due to it being a much more economical approach. It is important that basic characteristics of such panels are known, such as ethnicity, sex, age and area of recruitment, so that they can be matched to in the design or adjusted for in the analysis [10].
The described methods of control selection are specific of studies intended to assess genetic risk, and no longer suited if we incorporate environmental factors [10].
Sample size
Sample sizes for each study will depend on the existence of case sub-groups and a priori hypotheses to be tested, on whether it is a CG or a GWA approach, among (many) other factors. Estimating the required sample size often relies on empirical results from simulation studies [10].
The lack of availability of genetic information from cases for an association study often relates to economic issues. When testing many SNPs, a one-stage design can be very expensive, so one can resort to a multi-stage design, where all SNPs are tested in a random subset of cases and controls, and those found significant are taken through to be tested in the remainder of the study sample [27]. The power of a study can also be potentially improved with an increased control/case ratio [24].
Replication studies
Theoretical considerations prove that, when true discovery is claimed based on crossing a threshold of statistical significance and the discovery study is underpowered, the observed effects are expected to be inflated. Furthermore, flexible analyses coupled with selective reporting may inflate the published dis- covered effects. Therefore, a study designed to replicate a finding should base sample size calculations on smaller effect sizes [28].
A true replication study must be performed on a population comparable to the original, i.e., it must involve the analysis of the same polymorphism in the same direction of the effect, in the same ethnic population measured on the same phenotype. Failure to replicate findings in a different population does not allow judgement of the validity of the results in the original study; it can only elucidate on the lack of effect on the second population [10]. FCUP 27 Quantifying the genetic predisposition to a complex disease through genome-wide association
2.2 Data collection and variant calling
Following the study design is collecting the data for analysis. Genotyping individuals in association studies is usually done with DNA microarrays. These consist of specific DNA sequences (known as probes) corresponding to a short section of a gene or other DNA sequence of the human genome. Probes are usually 100 to 10 000 bases long and fluorescently labeled. Among the manufacturers of DNA microarrays was Affymetrix, Inc., a company now owned by Thermo Fisher Scientific. This company developed the GeneChip array technology and the Affymetrix Power Tools (APT), which can be used for variant calling, quality control and genotyping. A GeneChip array can contain up to thousands of DNA probes, designed to vary in specific locations matching those of known human genome variation. When placing the probes and a DNA sample in the same environment, DNA breaks up into fragments which attach to the corresponding probe in a process called hybridization, issuing a fluorescent measurable signal that allows to identify the nucleotide sequence in each fragment and thus determine the DNA sample sequence. Two important measures to consider when assessing the quality of the variant calling process are the dish quality control (DQC) and the quality control call rate (QCCR). DQC is a measure of the contrast between the adenine-thymine (AT) and cytosine-guanine (CG) signals, and is defined as
AT Signal - CG Signal DQC = AT Signal + CG Signal
QCCR is the proportion of non-missing data for each individual. The ”Axiom Genotyping Analysis Guide” by Affymetrix provides guidelines for these measures, and any subject falling below these values should be eliminated from further study. Their best practices guide can be found online at https://assets.thermofisher.com/TFS-Assets/LSG/manuals/axiom_ genotyping_solution_analysis_guide.pdf. Another relevant measure when doing probe QC is the heterozygosity rate. When this value is too high (usually higher than µ + 3σ) for a given individual, it could hint sample contamination; when it is too low (below µ − 3σ), it could mean that there are related individuals in the sample. In any case, it is recommended that the individuals falling outside this interval be discarded from further analysis [5]. One final quality control step for variants before genotyping is to sort them into categories according to allelic intensities. For each probe, let alleles A and B correspond to the two possible bases at that FCUP 28 Quantifying the genetic predisposition to a complex disease through genome-wide association locus, A being whichever comes first alphabetically; for example, for a [C/T ] SNP, the (CC), (CT ) and (TT ) genotypes are named (AA), (AB) and (BB), respectively. The intensities of alleles A and B are calculated, and using the obtained Asignal and Bsignal values, we plot X against Y , where
X = Contrast = log2(Asignal) − log2(Bsignal)
and
log (A ) + log (B ) Y = Size = 2 signal 2 signal 2
The obtained values for these measures for each individual allow sorting each probe into one of 7 categories – ”Poly High Resolution”, ”Mono High Resolution”, ”No Minor Homozygote”, ”Hemizygous”, ”Off-Target Variants”, ”Call Rate Below Threshold” and ”Other”. The first four categories contain high quality variants. ”Poly High Resolution” variants are characterized by three very clearly defined clusters for each of the three possible genotypes. ”Mono High Resolution” is a category for variants with a single cluster corresponding to one of the states. ”No Minor Homozy- gote” refers to variants which show only two possible states: one homozygous and one heterozygous. ”Hemizygous” variants are the ones present in the sex chromosomes. Variants classified as ”Off-Target” are usually distributed by more than three clusters. Variants with ”Call Rate Below Threshold” have well defined clusters, but the missing data rate is too high for these variants to be considered in further study. Variants with other characteristics fall in the ”Other” category. Variants falling in any of these three categories are not recommended by Affymetrix to proceed with, and are therefore discarded. The final step after removing all ”problematic” variants is to genotype them, which can be done using APT in the command line.
2.3 Variant quality control
Before conducting the actual ”association” part of the association study, one has yet to control the genotyped data for its quality, namely in testing whether the ”same-population sampling” process was FCUP 29 Quantifying the genetic predisposition to a complex disease through genome-wide association successful. These quality control (QC) steps aim to remove individuals and/or markers which may con- tain high error rates, as they could introduce bias to the study and increase false-positive and false- negative rates. In what follows, we will introduce some standard QC steps in both GWA and CG studies.
2.3.1 Genome-wide association studies
It is not uncommon for GWA studies to test many thousands and even millions of SNPs for association, hence even a low error rate can be detrimental: indeed, each marker removed is a potentially overlooked disease association, which can be more impactful for the final results than removing a handful of indi- viduals. Therefore, in order to maximize the amount of markers that remain in the study, QC steps are taken on a ’per-individual’ basis prior to a ’per-marker’ basis [5].
Per-individual quality control
Quality control in subjects of the study consists of four main steps:
1. Identification of individuals with discordant sex information The best way to detect discrep- ancies between an individual’s genotype information and his/her assigned sex is by comparing the homozygosity rate across all X-chromosome SNPs for each individual in the sample with the expected rate. Males are expected to have a homozygosity rate of 1, and females less that 0.2 [5]. Individuals found to have discordant sex information should be removed from further analysis, unless the sample can be correctly identified using existing genotype data or it can be confirmed that sex was recorded incorrectly.
2. Identification of individuals with outlying missing genotype/heterozygosity rates Genotype failure and heterozygosity rates per individual have been used as measures of the quality of DNA samples. As a guideline, individuals with more that 3-7% missing genotypes have been removed in GWA studies [29, 30]. Individuals with an excessive proportion of heterozygote genotypes may be indicative of DNA sample contamination; a low proportion of heterozygotes, in turn, may consist evidence of inbreeding [5].
3. Identification of duplicate or related individuals If duplicates or relatives are present, a bias may be introduced to the study, because the genotypes within families will be over-represented and the FCUP 30 Quantifying the genetic predisposition to a complex disease through genome-wide association
sample may no longer be an accurate reflection of the allele frequencies in the entire population [5]. Measures of relatedness are used to identify such cases:
• Identity by state (IBS) A value calculated for each pair of individuals based on the average proportion of alleles shared at genotyped SNPs (note that duplicate individuals will have an IBS of 1). This measure works best with independent SNPs, which is why selected regions are pruned, so that no SNPs in a given window are correlated.
• Identity by descent (IBD) A degree of recent shared ancestry for a pair of individuals. Dupli- cates are expected to have IBD=1; first-degree relatives, IBD=0.5; second-degree relatives, IBD=0.25; and so on. Due to genotyping errors, LD or population structure, an IBD value for a pair of subjects higher than 0.98 is enough to consider them duplicates. The usual procedure is to remove one individual from each pair with IBD>0.1875 [5], which is halfway between the expected IBD for second- and third-degree relatives.
4. Identification of individuals of divergent ancestry The main source of confounding in associa- tion studies is the existence of population stratification: differences found between cases and con- trols will be due to diverse ancestries, rather than underlying differences directly related to disease status [23]. Even when drawing cases and controls from the same population, genetic substructure may still be present, and confounding occurs when that substructure is not equally distributed be- tween the two phenotypes. The most common method for identifying ancestry differences is PCA – see 1.2.5 for more details on this subject.
Per-marker quality control
Criteria to filter out SNPs differ from study to study; nevertheless, when filtering out SNPs, one must always keep in mind that they may be discarding a potentially disease associated variant. Quality control in genotyped markers consists of four main steps:
1. Identification of SNPs with an excessively missing genotype Usually, markers with a call rate lower than 95% (i.e., markers for which at least 5% of individuals were not successfully genotyped) are removed from further study [30, 31]. For low frequency markers, higher thresholds have been defined, as not to lose potentially crucial information from rare variants [29]. FCUP 31 Quantifying the genetic predisposition to a complex disease through genome-wide association
2. Identification of SNPs demonstrating a significant deviation from the Hardy-Weinberg equi- librium (HWE) Markers which show large deviation from HWE could hint genotype or genotype calling errors; however, departure from HWE can also indicate selection, so a case sample can show deviations from HWE at disease associated loci [32]. For this reason, only control samples should be tested for deviations from HWE.
3. Identification of SNPs with significantly different missing genotype rates between cases and controls This reduces confounding and removes poorly genotyped SNPs. When cases and controls come from several different sources, it is wiser to test for significant differences in call rate, allele frequency and genotype frequency between the various groups, to make sure that it is fair to treat the combined set as homogeneous [5].
4. Removal of all markers with a very low minor allele frequency (MAF) Since power to detect association at rare variants is very low [33], targeting common variants does not overly impact GWA studies. Tipically, SNPs with 1-2% or lower MAF are removed, but higher thresholds may be set when working with small sample sizes [5].
2.3.2 Candidate-gene studies
CG association studies work with far fewer SNPs, so many of the quality control steps used in GWAS cannot be undertaken. Using fewer SNPs greatly reduces our ability to get accurate estimates of DNA quality, population ancestry and familial relationships with other subjects. One should still attempt to identify and remove individuals with very low call rates, which are typically lower in CG studies than in GWAS, due to the reduced number of genotyped SNPs. However, excluding markers with a high failure rate may seriously impair a CG study, due to SNPs being chosen based on their ”tagging” properties [34]; in this situation, it is advisable to return to the design stage and select a different tagSNP [5]. Detection of deviations from the Hardy-Weinberg equilibrium in controls is still relevant in CG studies, for genotype quality checking purposes [5]. FCUP 32 Quantifying the genetic predisposition to a complex disease through genome-wide association FCUP 33 Quantifying the genetic predisposition to a complex disease through genome-wide association
Chapter 3
Models of association
Association studies aim to predict a phenotype based on individuals’ characteristics, with the partic- ularity of dealing with an abnormally large number of predictor variables – the genetic markers. When dealing with quantitative traits, such as height or blood pressure, linear regression models are applied. The type of phenotype we’re most interested in is the presence or absence of a disease, which is the focus of a case-control study; this is an example of qualitative trait, and uses logistic regression to predict the disease probability . It is expected that disease risk be modified by environmental effects, such as epidemiological risk factors (as is gender), clinical variables (such as disease severity and age of onset) and population stratification (measured in principal components capturing variation due to differential ancestry), but also by the interactive and joint effects of genetic factors [35]. The challenge is to determine the set of genetic variants which are most likely to yield significant results when used as covariates. Depending on the minor allele frequency (MAF) of the variants of interest to be tested, different methods have been developed.
3.1 Statistical tests for common variants
The main idea when looking for association between a certain allele or genotype and a (binary) pheno- type is to find differences in frequencies of said genotypes between affected and unaffected individuals. This is feasible when dealing with common variants; other methods should be used with rare variants FCUP 34 Quantifying the genetic predisposition to a complex disease through genome-wide association
[12]. If we look at disease status and allelic/genotypic status as two categorical variables, then these com- parisons can be done with simple statistical tests on contingency tables (Table 3.1). Such tests include the Cochran-Armitage trend test [36, 37], Pearson’s Chi-squared test [38] and Fisher’s exact test [39], which are widely used when looking for evidence of common-variant association. For quantitative traits, the Wald statistical test is usually employed [40].
(a) a A Sum (b) (aa)(Aa)(AA) Sum
Cases m11 m12 m1· Cases n11 n12 n13 n1·
Controls m21 m22 m2· Controls n21 n22 n23 n2·
Sum m·1 m·2 m Sum n·1 n·2 n·3 n
Tbl. 3.1 – (a) Contingency table of allele counts; (b) Contingency table of genotype counts.
Let’s consider a genetic marker of study: a single biallelic locus with possible alleles A and a (and possible genotypes (AA), (Aa) and (aa)). We define a penetrance parameter γ (γ > 1) of the disease relating to a certain allele or phenotype, which is associated with the proportion of individuals carrying that particular variant that also express the disease. Models for disease penetrance include the additive model, the multiplicative model, the common dominant model and the common recessive model. We assume, without loss of generosity, that A is the presumed risk allele being tested for. An additive model indicates that the risk of disease increases by γ-fold for an individual with genotype (Aa), and by 2γ-fold with genotype (AA), relative to a (aa) genotype (this is equivalent to assuming a co-dominant model); a multiplicative model indicates that each additional A allele increases disease risk by γ-fold; a dominant model indicates that one copy of allele A is sufficient to increase disease risk by γ-fold, so (Aa) and (AA) genotypes can be grouped into a single category; as for the recessive model, it indicates that two copies of the risk allele are necessary to increase disease risk by γ-fold, so in this case it is the (aa) and (Aa) genotypes that are grouped [35].
3.1.1 Chi-squared test
Under the null hypothesis of no association, we expect the relative allele or genotype frequencies to be the same in case and control groups. A test of association is thus given by a simple chi-squared FCUP 35 Quantifying the genetic predisposition to a complex disease through genome-wide association test for independence of rows and columns of the underlying contingency table (note that when doing genotype comparisons, the contingency table is 2 × 3, hence resulting in a 2 df test; under the dominant or recessive models, or with allele comparisons, the table’s dimensions are 2 × 2, producing a 1 df test).
Considering an m × n contingency table where nij is the value for count in cell (i, j), the chi-squared test statistic is thus given by
m n 2 X X [nij − E(nij)] χ2 = E(n ) i=1 j=1 ij
P P P P ni·n·j 2 where ni· = j nij, n·j = i nij, n = i j nij and E(nij) = n approximately follows a χ distribution with (m − 1)(n − 1) degrees of freedom.
The chi-squared test statistic is a good approximation to the sampling distribution when dealing with large samples; instead, with small sample sizes, it is usual to apply Fisher’s exact test, which allows to calculate the exact significance of deviation from the null hypothesis. This test is only feasible by hand in the case of 2 × 2 contingency tables (due to the one single df); computational methods have been developed to extend the test to the general case of a m × n table.
Considering the contingency table 3.1 (a), the probability of observing such an arrangement of the sample, under the null hypothesis of no association between disease status and allelic distribution, can be obtained as such: