MSc

2.º CICLO FCUP 2019 disease through genome Quantifying

Quantifying the genetic the genetic predisposition to a complex complex a to predisposition genetic the predisposition to a complex disease - wide association through genome-wide association

Ana Margarida Carrapatoso Macedo Master’s degree thesis presented to Faculdade de Ciências da Universidade do Porto in Mathematical Engineering

Ana Margarida Carrapatoso Macedo Carrapatoso Margarida Ana 2019 Quantifying the genetic predisposition to a complex disease through genome-wide association

Ana Margarida Carrapatoso Macedo Mathematical Engineering Department of Mathematics 2019

Supervisor Alexandra Lopes, Assistant Researcher, i3S – Instituto de Investigação e Inovação em Saúde, IPATIMUP - Instituto de Patologia e Imunologia Molecular da Universidade do Porto

Co-supervisor Nádia Pinto, Junior Researcher, i3S – Instituto de Investigação e inovação em Saúde, IPATIMUP - Instituto de Patologia e Imunologia Molecular da Universidade do Porto, CMUP – Centro de Matemática da Universidade do Porto Todas as correções determinadas pelo júri, e só essas, foram efetuadas.

O Presidente do Júri,

Porto, ______/______/______FCUP i Quantifying the genetic predisposition to a complex disease through genome-wide association

Agradecimentos

Ao Ipatimup e ao grupo de Genetica´ Populacional, que me acolheram no mundo da investigac¸ao˜ cient´ıfica.

As` minhas orientadoras Alexandra Lopes e Nadia´ Pinto, que foram incansaveis´ e estiveram sempre dispon´ıveis para as minhas duvidas,´ apostaram em mim e me deram a oportunidade de fazer parte deste projeto ate´ ao fim.

Ao meu irmao,˜ que me corrigiu os paragrafos´ mais esquisitos sempre que a l´ıngua inglesa me falhou.

Ao meu companheiro de todos os dias, que nunca deixa de acreditar em mim e me incentiva a fazer sempre mais.

Aos meus pais, que me proporcionaram uma educac¸ao˜ superior e nunca duvidaram das minhas escolhas. FCUP ii Quantifying the genetic predisposition to a complex disease through genome-wide association

Abstract

The main goal of this work was to contextualize and apply the methods used in genome-wide asso- ciation studies, as well as studies of target regions with functional relevance to the phenotype under analysis.

The methods explored in detail included the quality control steps of genetic data, statistical tests for common-variant association (Pearson’s chi-squared test, Fisher’s exact test and the Cochran-Armitage test for trend) and the SKAT-O method, which combines burden and non-burden approaches to the study of rare-variant association.

Subsequently, the methods were applied to a data set of Alzheimer’s disease (AD) patients and healthy controls from north Iberian Peninsula, in the scope of the multicenter study ”AD-EEGWA”. It features new data from a still understudied population, regarding genetic association to this disease. Besides the genetic component and biographic data, we had access to electroencephalography measures for most of the study participants. The project is currently ongoing, and more biological samples are being collected to empower genetic analyses.

The SKAT-O method allowed to identify one (PLEKHA5) with significantly different minor allele frequencies between cases and controls in our sample: collectively, the rare alleles were present in 10% of controls, but in just 1% of cases. This gene had previously been identified as having differential gene expression in astrocytes between AD cases and controls.

The common-variant association methods allowed to identify nine SNPs with nominally significant differences of allele and genotype distribution between cases and controls, three of which had been associated with Alzheimer’s disease in previous studies. We then inquired about the possibility of an association between these genetic variants and the brainwaves obtained from EEGs. Four out of nine SNPs showed significant differences for some brainwaves, concerning mean relative power values within cases and controls with different alleles/genotypes. However, the obtained results did not match what was expected, considering the EEG brainwave behavior in AD cases and controls, and the possible risk allele of each SNP.

All in all, we concluded that, even in small samples, it is possible to find association between pheno- types and the aggregate effects of rare variants. It will be interesting to replicate the study in a larger sample and see if these results hold. The lack of agreement between the obtained results and the ex- FCUP iii Quantifying the genetic predisposition to a complex disease through genome-wide association pected from the analysis of the genetics-EEG relation motivates new approaches to the problem. The complexity of this disease strongly impels to insist on a interdisciplinary approach, that explores the effect of genotypes on well defined disease endophenotypes, to help diagnosis.

Keywords: association study, Alzheimer’s disease, complex phenotype, genetic heterogeneity, rare variant, endophenotype, electroencephalogram FCUP iv Quantifying the genetic predisposition to a complex disease through genome-wide association

Resumo

O objetivo principal deste trabalho foi contextualizar e aplicar os metodos´ utilizados em estudos de associac¸ao˜ do genoma completo, bem como de regioes˜ alvo com relevanciaˆ funcional para o fenotipo´ sob estudo.

Os metodos´ estudados em mais detalhe inclu´ıram os passos de controlo de qualidade dos dados geneticos,´ testes estat´ısticos para a associac¸ao˜ de variantes comuns (teste do qui-quadrado de Pear- son, teste exato de Fisher e teste de Cochran-Armitage para tendencia)ˆ e o metodo´ SKAT-O, que com- bina as abordagens burden e non-burden para o estudo da associac¸ao˜ com variantes raros.

Posteriormente, aplicaram-se os metodos´ a um conjunto de dados de doentes de Alzheimer e con- trolos saudaveis´ da regiao˜ norte da Pen´ınsula Iberica,´ no ambitoˆ do projeto multicentricoˆ ”AD-EEGWA”. Trata-se de um novo conjunto de dados de uma populac¸ao˜ ainda pouco estudada, do ponto de vista da associac¸ao˜ genetica´ a` doenc¸a. Para alem´ da componente genetica´ e dados biograficos,´ tivemos acesso a medidas de eletroencefalograma para grande parte dos participantes do estudo. O projeto ainda se encontra em curso, e mais amostras biologicas´ estao˜ a ser recolhidas de modo a trazer mais poder as` analises´ geneticas.´

Com o metodo´ SKAT-O, foi poss´ıvel identificar um gene (PLEKHA5) com diferenc¸as significativas de frequenciasˆ dos seus alelos raros entre casos e controlos da nossa amostra: coletivamente, os alelos raros estavam presentes em cerca de 10% dos controlos, mas apenas em 1% dos casos. Trata-se de um gene que ja´ tinha sido anteriormente identificado como tendo expressao˜ genetica´ diferencial em astrocitos´ entre casos e controlos de Alzheimer.

Os metodos´ de associac¸ao˜ para variantes comuns permitiram identificar nove SNPs com diferenc¸as nominalmente significativas na distribuic¸ao˜ de alelos e genotipos´ entre casos e controlos, tresˆ dos quais ja´ tinham sido associados a` doenc¸a de Alzheimer em estudos previos.´ Seguidamente, averiguamos a possibilidade de haver uma associac¸ao˜ entre estes variantes geneticos´ e as ondas cerebrais obti- das com os EEGs. Dos nove, quatro SNPs mostraram diferenc¸as significativas para algumas ondas cerebrais, relativamente aos valores medios´ de ”poder relativo” em casos e controlos com diferentes alelos/genotipos.´ Contudo, os resultados obtidos nao˜ corresponderam ao que era esperado, tendo em conta os valores de EEGs em casos de Alzheimer e controlos e o poss´ıvel alelo de risco de cada SNP.

Conclu´ımos que, mesmo com pequenas amostras, e´ poss´ıvel encontrar associac¸ao˜ entre fenotipos´ e FCUP v Quantifying the genetic predisposition to a complex disease through genome-wide association o efeito agregado de variantes raros. Sera´ interessante replicar o estudo numa amostra maior e perceber se os resultados se mantem.ˆ A falta de coerenciaˆ entre os resultados da analise´ da relac¸ao˜ genetica-´ EEG e o que era esperado motiva novas abordagens a este problema. A complexidade desta doenc¸a incita fortemente a` insistenciaˆ numa abordagem interdisciplinar, que explore o efeito de genotipos´ e endofenotipos´ bem definidos da doenc¸a, para auxiliar no seu diagnostico.´

Palavras-chave: estudo de associac¸ao,˜ doenc¸a de Alzheimer, fenotipo´ complexo, heterogeneidade genetica,´ variante raro, endofenotipo,´ eletroencefalograma FCUP vi Quantifying the genetic predisposition to a complex disease through genome-wide association

Contents

Agradecimentos ...... i Abstract ...... ii Resumo ...... iv List of Tables ...... viii List of Figures ...... xi List of Abbreviations ...... xiii

Introduction 1

1 Theoretical framework 3 1.1 Introductory concepts of biology and genetics ...... 3 1.2 Population genetics concepts ...... 7 1.3 Association studies ...... 16 1.4 An introduction to Alzheimer’s disease ...... 18

2 Study design and data quality control 23 2.1 Study design ...... 24 2.2 Data collection and variant calling ...... 27 2.3 Variant quality control ...... 28

3 Models of association 33 3.1 Statistical tests for common variants ...... 33 3.2 Rare variant association approaches ...... 36 3.3 P-value adjustment ...... 40 3.4 The odds ratio ...... 41 FCUP vii Quantifying the genetic predisposition to a complex disease through genome-wide association

4 An application of association studies to Alzheimer’s disease 43 4.1 Aim and objectives ...... 43 4.2 Subjects and methods ...... 43 4.3 Data quality control ...... 45 4.4 Rare-variant analysis ...... 51 4.5 Analysis of electroencephalography data ...... 58 4.6 Discussion ...... 66

Conclusion 69

Appendices 77 A Informed consent of participation in research study ...... 79 B Mini Mental State Examination (MMSE) ...... 81 C Lists of nominally significant variants in SKAT-O ...... 85 C.1 Model 1 (Sex as covariate) ...... 85 C.2 Model 2 (Sex, age, PC1 and PC2 as covariates) ...... 96 FCUP viii Quantifying the genetic predisposition to a complex disease through genome-wide association

List of Tables

1.1 (a) Possible genetic constitution of the offspring resulting from crossing (AA) with (aa) in- dividuals (1st hybrid generation); (b) Possible genetic constitution of the offspring resulting from crossing (Aa) individuals among each other (2nd hybrid generation)...... 4

1.2 Blood group (phenotype) of the offspring, depending on the ABO alleles (genotype) inher- ited from the parents...... 6

1.3 Observed (left) and expected (right) haplotype frequencies in the population, under total linkage equilibrium...... 9

1.4 Observed haplotype frequencies in the population, considering linkage disequilibrium. . . 9

1.5 Probability values of IBS state I = i conditional on IBD state Z = z for each value of I and Z...... 12

1.6 Mating outcomes assuming Hardy-Weinberg equilibrium...... 13

1.7 Causes of Alzheimer’s disease...... 19

1.8 APOE allele according to the genotype for SNPs rs429358 and rs7412...... 20

3.1 (a) Contingency table of allele counts; (b) Contingency table of genotype counts...... 34

3.2 Counts of cases and controls in each of the n exposure categories Ej, in a sample of c individuals...... 41

4.1 Summary table of the individual counts after each per-individual QC step, according to disease status, gender and country of origin (note: whenever male and female counts did not add up to the ”Total” column, it was due to the presence of individuals with unknown gender; this issue was overcome as of the sex check step)...... 49

4.2 Summary table of probe/variant counts at each per-marker QC step...... 50

4.3 Disease status vs. gender distribution of the QCed sample...... 51 FCUP ix Quantifying the genetic predisposition to a complex disease through genome-wide association

4.4 Identifier and description of each gene list to be tested for association, and number of and rare variants they contain...... 55

4.5 Number of significant genes without and with p-value correction in each gene list and model (Model 1 – sex as the only covariate; Model 2 – sex, age, PC1 and PC2 as covariates). 56

4.6 Frequency and properties of PLEKHA5 rare variants identified in our sample. (*) in com- plete LD (r2 = 1.0) Notes: ”PHRED” refers to the PHRED-scaled CADD score; MA - minor allele; MAF - minor allele frequency; ”gnomAD NFE” refers to the Non-Finnish European population of the gnomAD database (the number of genotyped alleles for each SNP was, respectively, 129 088, 129 088, 75 296 and 113 128); the p-values refer to Fisher’s exact test for differences between control frequencies in our data and the gnomAD database. . 56

4.7 P-values obtained in ANOVA tests when testing for differences in RP of each of the brain- waves between cases and controls...... 59

4.8 P-values obtained in Tukey test for multiple comparisons of the RP in each brainwave between controls and cases in each disease stage (CON - controls; MIL - mild AD; MOD - moderate AD; SEV - severe AD). The values below 0.05/30 = 1.67 × 10−3 are underlined and in bold...... 60

4.9 Gene and p-values obtained in the allelic and genotypic tests for the SNPs which were nominally significant in both at the α = 0.05 level. (*) SNPs previously associated to AD. . 61

4.10 Wald test p-values for the variation of relative power across each frequency band in cases and controls with different alleles for each of the considered SNPs. The values below 0.05 are underlined and in bold...... 62

4.11 Wald test p-values for the variation of relative power across each frequency band in cases and controls with different genotypes for each of the considered SNPs. The values below 0.05 are underlined and in bold...... 62

4.12 Minor and alternative alleles of each SNP and their respective frequencies of the minor allele in the sample, in cases and in controls. (*) SNP previously associated to AD. . . . . 63

C.1 Nominally significant genes in SKAT-O under null Model 1 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 94 FCUP x Quantifying the genetic predisposition to a complex disease through genome-wide association

C.2 Nominally significant genes in SKAT-O under null Model 1 tested for ”Dementia” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 94 C.3 Nominally significant genes in SKAT-O under null Model 1 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 95 C.4 Nominally significant genes in SKAT-O under null Model 1 tested for ”AD Disgenet” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 96 C.5 Nominally significant genes in SKAT-O under null Model 2 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 106 C.6 Nominally significant genes in SKAT-O under null Model 2 tested for ”Dementia” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 106 C.7 Nominally significant genes in SKAT-O under null Model 2 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 106 C.8 Nominally significant genes in SKAT-O under null Model 2 tested for ”AD Disgenet” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients)...... 108 FCUP xi Quantifying the genetic predisposition to a complex disease through genome-wide association

List of Figures

1.1 Mendel’s fundamental experiment...... 3

4.1 Missing data rate vs. heterozygosity across individuals passing the DQC and QCCR steps. Shading indicates sample density; the dashed lines represent the defined heterozygosity threshold; the outliers are highlighted in red...... 46 4.2 (a) Plot of the first principal component against the second, calculated with the software EIGENSOFT; (b) Plot of the same PCs as in (a), zoomed in on the cluster which contains the european populations (the outliers are encircled in red). In the legends, PT and ES represent the portuguese and spanish individuals in our sample, respectively; the remain- ing populations come from the 1KGP dataset...... 48 4.3 Distribution of age at the time of sample collection...... 52 4.4 Proportion of variance explained by the first i principal components; (b) is a zoom-in of (a) on the first 100 PCs...... 53 4.5 Plot of the first and second principal components, restricted to the sample subjects. . . . 53 4.6 Probability density function of Beta(p, 1, 25)...... 54 4.7 Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2 in cases and controls...... 59 4.8 Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2 in controls and mild, moderate and severe AD cases (CON - controls; MIL - mild AD; MOD - moderate AD; SEV - severe AD)...... 60 4.9 Plot of the mean relative power distribution in the different frequency bands of controls with each (a) allele or (b) genotype for SNP rs71336232...... 64 4.10 Plot of the mean relative power distribution in the different frequency bands of controls with each (a) allele or (b) genotype for SNP rs7144273...... 64 FCUP xii Quantifying the genetic predisposition to a complex disease through genome-wide association

4.11 Plot of the mean relative power distribution in the different frequency bands of cases with each (a) allele or (b) genotype for SNP rs10833211...... 65 4.12 Plot of the mean relative power distribution in the different frequency bands of cases with each (a) allele or (b) genotype for SNP rs10833214...... 65 FCUP xiii Quantifying the genetic predisposition to a complex disease through genome-wide association

List of Abbreviations

1KGP The 1000 Genomes Project AD Alzheimer’s disease APT Affymetrix Power Tools ANOVA analysis of variance CADD combined annotation dependent depletion CDCV common disease common variant CDRV common disease rare variant CG candidate gene CT computed tomography df degrees of freedom DGE differential gene expression DNA deoxyribonucleic acid DQC dish quality control EEG electroencephalogram GUI graphical user interface GWA genome-wide association HWE Hardy-Weinberg equilibrium IBD identity by descent IBS identity by state indel insertion/deletion variant LD linkage disequilibrium MAF minor allele frequency MCI mild cognitive impairment MIL mild Alzheimer’s disease MMSE Mini Mental State Examination MOD moderate Alzheimer’s disease MRI magnetic resonance imaging NFE non-Finnish European FCUP xiv Quantifying the genetic predisposition to a complex disease through genome-wide association

OR odds ratio PC(A) principal component (analysis) QC quality control QCCR quality control call rate RNA ribonucleic acid SEV severe Alzheimer’s disease SKAT sequence kernel association test SNP single nucleotide polymorphism SNV single nucleotide variant UTR untranslated region FCUP 1 Quantifying the genetic predisposition to a complex disease through genome-wide association

Introduction

This work aimed to describe the general procedures and methods used in genome-wide association studies, and their posterior application to a data set composed of cases and controls of Alzheimer’s disease from the Iberian Peninsula. The work is reflected in the present dissertation, which is structured as follows. To place the work in context, we start by introducing in Chapter 1 some basic concepts of biology and population genetics, and the general idea behind association studies. We also briefly introduce Alzheimer’s disease, its symptoms, causes and means of diagnosis. In Chapter 2, we describe the procedures relative to study design and quality control of genomic data. Here, we present relevant thresholds for various quality measures, as criteria to keep or discard individuals and genetic variants from the study. Chapter 3 was intended to portray in detail the models of association used to study the effect of both common and rare variation on a complex phenotype. We focus on case-control studies, but also consider the case of quantitative traits whenever relevant. Chapter 4 is the application of the previously described methods on a data set of Alzheimer’s patients and controls. It is split in two distinct phases of analysis. In a first stage, given the modest sample size, we focus on the aggregate effect of rare variation on the disease, using the SKAT-O method. In a second and final stage, we assess the behavior of different brainwaves at various stages of the disease; we also seek a possible association between a set of genetic variants and the values of EEG at different frequency bands. Finally, on Chapter 5, we make some considerations about the work, namely on the importance of an approach combining genetic data and other endophenotypes associated with the disease, for a hopefully more precise diagnosis. FCUP 2 Quantifying the genetic predisposition to a complex disease through genome-wide association FCUP 3 Quantifying the genetic predisposition to a complex disease through genome-wide association

Chapter 1

Theoretical framework

1.1 Introductory concepts of biology and genetics

All our ideas about the transmission of specific characters and changes in the characteristics of pop- ulations make use of concepts introduced in 1865 by the one who is often referred to as the Father of Genetics - Gregor Mendel [1]. Working in a small garden with nothing but peas as his material, he was able to formulate a hypothesis that explains the inheritance of some traits in a very simple way.

The simplified experiment was as follows: Mendel crossed purebred plants with green peas and pure- bred plants with yellow peas, obtaining a first hybrid generation of all yellow peas; he then crossed these hybrid plants among each other and obtained peas of both colors, in a proportion of approximately 3 yellow to 1 green. This experiment is illustrated in Figure 1.1 and Table 1.1.

AA aa Initial generation

Aa Aa 1st hybrid generation

AA Aa Aa aa 2nd hybrid generation

Fig. 1.1 – Mendel’s fundamental experiment. FCUP 4 Quantifying the genetic predisposition to a complex disease through genome-wide association

From these results, Mendel formulated his hypothesis for sexual reproduction, which can be expressed as follows:

1. Each character of an individual is controlled by two ”factors”, the alleles, one of which the individual receives from his father, and the other from his mother.

2. From the two alleles carried by the individual, one is expressed (dominant), while the effect of the other may not be apparent (recessive).

3. A reproductive cell (egg and sperm, in humans) produced by an individual bears, for each charac- ter, one and only one of the two alleles which the individual carries.

In the initial generation, the plants with yellow peas carried only ”yellow” genes; their genetic represen- tation for this trait is (AA). The plants with green peas carried only ”green” genes, hence having genetic constitution (aa). Crossing individuals of these two types can only generate individuals of one type, (Aa). When crossing (Aa) individuals with each other, it is possible to obtain individuals with genetic consti- tution (AA), (aa) or (Aa), with probabilities 1/4, 1/4 and 1/2, respectively. Because A is dominant over a, the (Aa) peas are yellow, hence the obtained proportions of all yellow peas in the 1st hybrid generation, and 3 yellow peas to one green in the 2nd.

(a) a a (b) A a A (Aa) (Aa) A (AA) (Aa) A (Aa) (Aa) a (Aa) (aa)

Tbl. 1.1 – (a) Possible genetic constitution of the offspring resulting from crossing (AA) with (aa) individuals (1st hybrid generation); (b) Possible genetic constitution of the offspring resulting from crossing (Aa) individuals among each other (2nd hybrid generation).

Mendel’s experiment was a starting point for modeling the inheritance of other, more complex, pheno- types, namely diseases, involving the contribution and/or interaction of various genes [2]. The human body is composed of billions of different types of cells. Cells are constantly dying and being replaced by newly formed ones through the process of cell division – the source of an individual’s development and growth. The very first cell of an individual already carries all of the genetic information he/she will bear throughout his/her whole life, encoded in their DNA. DNA, or deoxyribonucleic acid, is a 3 billion long double helix molecule made of two complementary chains of nucleotides. A nucleotide is composed of one of four chemical bases – adenine (A), thymine FCUP 5 Quantifying the genetic predisposition to a complex disease through genome-wide association

(T), cytosine (C) and guanine (G) –, attached to one sugar molecule and one phosphate molecule. In each nucleotide chain, A pairs up with T in the opposite chain, and C with G. Therefore, taking one of the chains as reference, DNA can essentially be seen as a single sequence of As, Ts, Cs and Gs. During cell division, DNA replicates itself, providing each new cell with an identical copy of all genetic material (if no mutation occurs). This process is called ”replication”.

DNA is tightly coiled into structures called ””. Most human cells, the somatic cells, con- tain 23 pairs of chromosomes, for a total of 46, and are hence called ”diploid”. The exceptions are the reproductive cells or gametes, which carry only half of an individual’s genetic material, i.e., 23 unpaired chromosomes, thus being ”haploid”.

During cell division occurs a phenomenon called ”genetic recombination”, which is the crossover of the arms of each in the cell, leading to the exchange of genetic material between them. Recombination occurs between homologous regions, meaning that the alleles or genes are in a similar order of arrangement in both chromosomes.

In somatic cells, all chromosome pairs but one are termed ”autosomes”; the one pair that differs in its structure, usually referred to as the 23rd pair, is the pair of sex chromosomes, which determines human gender - females have two X chromosomes, whereas males have one X and one Y. In each reproductive cell, since only half of the genetic material is present, there is an X chromosome in female gametes and either an X or a Y in male ones; indeed, it is the information carried in the paternal gamete that determines the sex of the offspring.

The set of all genetic information of an individual is called his/her ”genome”. The ”exome” is the small fraction (about 1%) of the genome known to encode for production. The information to produce a functional protein is encoded in a special unit of DNA at a determined genetic site (”locus”), called a ”gene”. Each gene can have various modes of action, determined by variants of its sequence. The multiple versions of a gene are called ”alleles”. A gene is said to be ”polymorphic” if, for its locus, the minor allele frequency (MAF) is at least 1% within a population [3].

Because humans have twenty-two paired chromosomes, most genes are represented twice in our genome, through alleles that may or may be not identical. A ”genotype” is defined as the set of alleles found at an individual’s locus. A ”phenotype” is an observable trait that concerns a particular locus (or the combined action of several loci). For example, in Mendel’s experiment described in 1.1, the color of the peas is a phenotype, whereas the genetic constitution – (AA), (Aa) or (aa) – is the genotype. FCUP 6 Quantifying the genetic predisposition to a complex disease through genome-wide association

An ”endophenotype” is somewhere between the previous two definitions: it is a quantitative, non- observable trait that also shows a genetic connection. A set of genotypes observed at linked loci of one individual is called his/her ”haplotype”.

For each autosomal locus, an individual is said to be ”homozygous” if he/she carries two copies of the same allele, and ”heterozygous” if the alleles are different. We can also refer to homozygous individuals as ”homozygotes”, and to heterozygous individuals as ”heterozygotes”.

It is the union of one male and one female reproductive cells that generates a ”zygote”, i.e., a fertilized egg cell, with a full complement of hereditary information necessary for the development of a human being. At the moment of fertilization, the new individual receives for each autosomal locus one allele from his father and one from his mother; as for the 23rd pair, if the new individual is female, then she has inherited one X chromosome from each of her parents, while if he is male, he has inherited an X chromosome from his mother and a Y from his father. Women can be homozygous or heterozygous for genes in the 23rd chromosome; men can only be ”hemizygous”, due to X and Y not being homologous in all of their extension, except for short ”pseudoautosomal regions” on their tips.

Individuals with different genotypes may show simillar phenotypes due to genetic dominance-recessiveness relationships. Allele A is said to be ”dominant” to a (or, equivalently, a is ”recessive” to A) if the action of A, but not that of a, is manifested in the phenotype of a (Aa) heterozygote. Alleles A and B are said to be co-dominant if they are both expressed in an (AB) individual’s phenotype. An example of co-dominant expression is the AB blood type in the human ABO blood system, as shown in Table 1.2 [3]. This table also illustrates the dominance of alleles A and B over allele O.

ABO A A AB A B AB B B OABO

Tbl. 1.2 – Blood group (phenotype) of the offspring, depending on the ABO alleles (genotype) inherited from the parents. FCUP 7 Quantifying the genetic predisposition to a complex disease through genome-wide association

1.2 Population genetics concepts

Population genetics involves the study of genetic variation within and between populations, by examin- ing allele frequencies at different loci over time and space. Mathematical models are used to investigate and predict the occurrence of specific alleles (or combinations of alleles) in populations, based on the ever increasing understanding of genetics and evolution. As such, it becomes necessary to introduce some concepts regarding genetic variations, and how they relate to each other and influence the evolu- tion of species.

1.2.1 Genetic variants

Chromosomes are not perfectly stable entities: changes in the DNA sequence may occur as a result of external or internal factors, such as the interaction with radiation, chemicals or viruses, or simply an error during the replication process. These changes are called ”mutations”, and give rise to genetic variants; if their frequency among a population is above 1%, they are considered common and called ”polymorphisms” [3]. For example, the replacement of one nucleotide by another is called a ”single nucleotide variant” (SNV), or a ”single nucleotide polymorphism” (SNP) in case it is common. The most common single base-pair changes are between the two existing classes of nucleotides – purines (A↔G) and pyrimidines (C↔T) –, thus, most SNPs in a population are ”biallelic”. Another class of well-known variants are indels, which is the insertion or deletion of a portion of DNA, no larger than 1 000 bases, into the genome. Variants are part of the evolution of species. Some variants do not change an individual’s phenotype, while others may greatly affect it; some variants can increase an individual’s fitness in the surrounding environment, while others can have deleterious effects and generate disease. ”Monogenic” diseases result from deleterious variants in a single gene. They are inherited according to Mendel’s laws, hence also being called ”Mendelian” diseases. If a disease results from the joint contribution of a number of independently acting or interacting genes, it is called ”polygenic”. Variants can have numerous classifications depending on their length, placement and function. Exonic variants are located in portions of a gene that will encode a part of the final mature RNA produced by that gene and hence may change the resulting amino-acid sequence. UTR variants are in untranslated FCUP 8 Quantifying the genetic predisposition to a complex disease through genome-wide association regions but may have an important role in regulating gene expression. These are variants of potential functional importance and could be good candidates for further analysis in association studies. A SNV that is in a coding region of the genome but results in no change to the encoded amino acid is called a ”synonymous” substitution; when a genetic SNV influences the protein expression, it is termed ”non-synonymous”. Indel variants may yield ”frameshift” variations, which are potentially deleterious. ”Stop-gain” and ”stop-loss” variations result, respectively, in a premature termination and an abnormal extension to the protein translation process, and thus alter the protein itself. Of these classes of genetic variants, the synonymous substitutions are the least functionally relevant, and it is not uncommon to prioritize the analysis of variants falling in the remaining classes when searching for an association with a phenotype or disease [3]. Some genetic variants are known to be more likely to ”travel together” from generation to generation than would be expected if different loci associated in a random manner. This phenomenon of non-random association is termed ”linkage disequilibrium”.

1.2.2 Linkage disequilibrium

Various studies have confirmed that the inheritance of certain alleles within a population is often cor- related, causing many individuals to share the same haplotype. The alleles are thus said to be in linkage disequilibrium (LD). Even though genetic distance influences LD, it does not necessarily cause it; two loci being in LD simply means that the alleles appear together in the same population more (or less) frequently than chance would have us expect. Suppose that allele A at locus 1 and allele B at locus 2 are found at frequencies p and q, respectively, in the population. If the two loci were independent, then we would expect to see the [AB] haplotype at frequency pq; however, if the frequency of the [AB] haplotype was either higher or lower than pq, then the two loci could be in LD. Let us consider two biallelic loci on the same chromosome, with alleles A and a at the first locus, and B and b at the second. Their allelic frequencies in the population are pA, pa, pB and pb (note that, because the loci are biallelic, pa = 1 − pA and pb = 1 − pB); the haplotype frequencies are pAB, pAb, paB and pab. Table 1.3 shows the observed and expected haplotype frequencies under linkage equilibrium; Table 1.4 illustrates a scenario where a measure of LD is included to the observed haplotype frequencies. FCUP 9 Quantifying the genetic predisposition to a complex disease through genome-wide association

Observed frequencies Expected frequencies B b B b

A pAB pAb A pApB pApb

a paB pab a papB papb

Tbl. 1.3 – Observed (left) and expected (right) haplotype frequencies in the population, under total linkage equilibrium.

Observed frequencies B b Total

A pApB + D pApb − D pA

a papB − D papb + D pa

Total pB pb

Tbl. 1.4 – Observed haplotype frequencies in the population, considering linkage disequilibrium.

The measure of linkage disequilibrium D is the difference between the expected haplotype frequencies and the observed, defined as

D = pAB − pApB.

In order to standardize D, we need to find its boundaries. Using the fact that the observed frequencies must be non-negative, we obtain

pApB + D ≥ 0 ⇔ D ≥ −pApB

papb + D ≥ 0 ⇔ D ≥ −papb

pApb − D ≥ 0 ⇔ D ≤ pApb

papB − D ≥ 0 ⇔ D ≤ papB

Thus we can define

D D0 = Dmax where FCUP 10 Quantifying the genetic predisposition to a complex disease through genome-wide association

 min{pApb, papB}, if D > 0 Dmax = .  max{−pApB, −papb}, if D < 0

This normalization causes D0 to range between −1 and 1. When D0 = ±1, then at least one of the haplotypes was not observed; if allele frequencies are similar, a high D0 value means the markers are good surrogates for each other. Another widely used measure to calculate LD between loci, preferred by population geneticists, is Pearson’s coefficient of correlation r,

D r = √ pApapBpb or, more commonly, its squared value (r2). When r2 = 1, the two loci are in total linkage disequilibrium, i.e., they both provide identical information; if r2 = 0, they are in perfect equilibrium, i.e. the genetic information is transmitted independently [4]. Tipically, two loci are considered to be correlated when an r2 value greater than 0.2 is achieved [5]. Linkage disequilibrium is of major importance in association studies, namely at the level of marker selection in the study design phase. Indeed, mapping LD across the has made possible to deduce an individual’s genotype at a given locus through others in high disequilibrium. This is done by strategically choosing single tagSNPs to represent entire haplotypes of regions in high LD, which results in a less costly study.

1.2.3 Identity by descent

Even after taking LD into account, loci which are independent within a population may still show sig- nificant similarities among individuals, introducing a degree of relatedness which must be accounted for, especially when performing an association study with a sample of unrelated individuals. If relatives are present, a bias may be introduced to the study, because the genotypes within families will be over-represented and the sample may no longer be an accurate reflection of the allele frequencies in the entire population. An important measure of relatedness used to identify such cases is identity by descent (IBD), a degree of recent shared ancestry for a pair of individuals. Two alleles are said to be IBD if and only if they have FCUP 11 Quantifying the genetic predisposition to a complex disease through genome-wide association descended from the same ancestral allele. Mutation breaks identity by descent. Two individuals are said to be related if they may share IBD alleles. There is some point in the past beyond which individuals are assumed to be unrelated. Identical twins are expected to have a proportion of shared IBD alleles equal to 1; first-degree relatives, 0.5; second-degree relatives, 0.25; and so on [5]. A similar concept is that of identity by state (IBS), which is based on the average proportion of indistin- guishable alleles shared at genotyped variants for each pair of individuals. Therefore, two alleles which are IBD are also IBS, but the opposite may not be true, because alleles IBS may not originate from the same common ancestor; similarly, an individual may have more alleles IBS than IBD, but the opposite can never occur. Purcell et al. [6] considered a method-of-moments approach to estimate the probability of sharing 0, 1, or 2 IBD alleles for any pair of individuals from the same homogeneous, random-mating population. Denoting IBS states as I and IBD states as Z (in both cases, the possible states being 0, 1, and 2), then we have that

N(I = 0) P (Z = 0) = N(I = 0 | Z = 0)

N(I = 1) − P (Z = 0)N(I = 1 | Z = 0) P (Z = 1) = N(I = 1 | Z = 1)

N(I = 2) − P (Z = 0)N(I = 2 | Z = 0) − P (Z = 1)N(I = 2 | Z = 1) P (Z = 2) = N(I = 2 | Z = 2) where N(I = i | Z = z) is the expected count of variants with IBS state I = i conditional on IBD state Z = z for the entire genome, and is defined as

L X N(I = i | Z = z) = P (I = i | Z = z) m=1 where the summation is over all variants with genotype data on both individuals, and the conditional probabilities are calculated as in Table 1.5. We can thus define the proportion of alleles shared IBD as

P (Z = 1) πˆ = + P (Z = 2). 2 FCUP 12 Quantifying the genetic predisposition to a complex disease through genome-wide association

IZP (I | Z) 0 0 2p2q2 1 0 4p3q + 4pq3 2 0 p4 + 4p2q2 + q4 0 1 0 1 1 2p2q + 2pq2 2 1 p3 + p2q + pq2 + q3 0 2 0 1 2 0 2 2 1

Tbl. 1.5 – Probability values of IBS state I = i conditional on IBD state Z = z for each value of I and Z.

Due to genotyping errors, LD and population structure, a πˆ value higher than 0.98 is considered enough to consider two samples under analysis as duplicates. The usual procedure is to remove one individual from each pair with πˆ > 0.1875 (a value halfway between the expected IBD for second- and third-degree relatives) [5].

1.2.4 The Hardy-Weinberg equilibrium

The Hardy-Weinberg equilibrium (HWE) is a law of genetics which states that allele and genotype frequencies in a population will remain constant from generation to generation, under the following as- sumptions:

(a) the population size is so large that it can be treated as infinite;

(b) generations are discrete, and individuals from different generations do not breed together;

(c) mating is at random;

(d) migration does not occur;

(e) selection does not occur (i.e., individuals with different genotypes are assumed to have equal fitness to the environment); FCUP 13 Quantifying the genetic predisposition to a complex disease through genome-wide association

(f) mutations do not occur (i.e., individuals with genotype (AiAj) can only produce gametes with an

Ai or an Aj allele at that locus);

(g) initial genotype frequencies are equal in the two sexes.

The equilibrium in autosomal loci

Let’s suppose, for simplicity and because most human loci are biallelic, that there are n = 2 observed alleles, A1 and A2 with proportions p and q = 1 − p, respectively, for a given locus in a population. There are 3 possible genotypes, (A1A1), (A1A2) (identical to (A2A1)) and (A2A2), with initial proportions u, v and w, respectively.

From the genotype proportions, it is possible to deduce the allele proportions:

1 p = u + v 2 1 q = w + v 2

Under the stated assumptions, the next generation will be composed as shown in Table 1.6.

Mating Type Frequency Nature of Offspring

2 (A1A1) × (A1A1) u (A1A1) 1 1 (A1A1) × (A1A2) 2uv 2 (A1A1) + 2 (A1A2)

(A1A1) × (A2A2) 2uw (A1A2) 2 1 1 1 (A1A2) × (A1A2) v 4 (A1A1) + 2 (A1A2) + 4 (A2A2) 1 1 (A1A2) × (A2A2) 2vw 2 (A1A2) + 2 (A2A2) 2 (A2A2) × (A2A2) w (A2A2)

Tbl. 1.6 – Mating outcomes assuming Hardy-Weinberg equilibrium.

The obtained frequencies for the three genotypes (A1A1), (A1A2) and (A2A2) for the first generation FCUP 14 Quantifying the genetic predisposition to a complex disease through genome-wide association are, respectively,

1  1 2 u2 + uv + v2 = u + v = p2 4 2 1  1   1  uv + 2uw + v2 + vw = 2 u + v w + v = 2pq (1.1) 2 2 2 1  1 2 v2 + vw + w2 = w + v = q2 4 2

and, for the second generation,

 1 2 p2 + 2pq = [p(p + q)]2 2 = p2  1   1  2 p2 + 2pq q2 + 2pq = 2p(p + q)q(p + q) 2 2 (1.2) = 2pq  1 2 q2 + 2pq = [q(p + q)]2 2 = q2

meaning that, after a single round of random mating under the conditions above, the genotype fre- quencies stabilize at Hardy-Weinberg proportions [7].

Testing for equilibrium

Departures from HWE are generally measured at a given SNP using a χ2 goodness-of-fit test between the observed and expected genotypes. The χ2 statistics is defined as

2 X (Oi − Ei) χ2 = (1.3) E i i

where Oi and Ei are the observed and expected absolute frequencies of each of the n genotypes in a population at that locus. This test statistic has a χ2 distribution with n − 1 degrees of freedom [8]. A deviation from HWE implies a violation of at least one of the assumptions stated above; it is usually an indication of the presence of population substructure or the occurrence of a genotyping error. FCUP 15 Quantifying the genetic predisposition to a complex disease through genome-wide association

1.2.5 Population substructure

Population substructure, also referred to as population admixture or population stratification, is the presence of genetic differences between subpopulations of an apparently homogeneous population due to genetic history (e.g., migration, selection, and/or ethnic integration). Principal component analysis (PCA) is widely used to detect and visualize hidden population substructure that is not apparent in the data and which may be providing untrue results, when analyzing characteristics of the population as a whole [8]. The central idea of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated variables (in the case of an association study, these are the thousands of genetic markers), while retaining as much of the variation present in the data set as possible. This is achieved by trans- forming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. Assuming our markers as biallelic, the data can be seen as a large rectangular matrix C, with rows indexed by individuals, and columns indexed by polymorphic markers. For each marker, there is a reference and an alternative allele. We suppose there are n markers and m individuals, and that the number of markers is much larger than the number of samples, n  m. Let C(i, j) be the number of reference alleles for marker j, individual i. Thus, for autosomal loci, we have C(i, j) ∈ {0, 1, 2}. For each column of C, we calculate its mean µ(j) and standard deviation σ(j), and obtain a new matrix M

C(i, j) − µ(j) M(i, j) = . σ(j) generally called the variance-standardized genetic relationship matrix. This step of normalization is intended to make the markers (co-variables) comparable, reducing their mean and variance to 0 and 1, respectively. With this matrix, we can now define

X = MM T ,

a square matrix m × m, with dimensions equal to the number of sampled individuals. We then com- pute the eigenvalues of matrix X and the corresponding eigenvectors, which are called the PCs. The eigenvector corresponding to the highest eigenvalue is called the ”first principal component”, denoted FCUP 16 Quantifying the genetic predisposition to a complex disease through genome-wide association

PC1; the eigenvector corresponding to the second highest eigenvalue is PC2; and so on. The variation explained by PCs decreases, with the first PC explaining the most variation [9]. Plotting PCs against each other can show evidence of population substructure, by clustering the in- dividual data across these new axes of variation. Those PCs which are found to be significant can posteriorly be used as co-variables in regression models (see Chapter 3).

1.3 Association studies

Variation in a DNA sequence can influence the risk of developing disease. Early studies investigated genetic variants underlying rare conditions that showed clear Mendelian inheritance patterns in families, and turned out to be very successful due to these variants carrying 100% disease risk [10]. Scientific efforts have been made to put together as much information about the human genome vari- ation as possible, allowing for better design and less costly studies. Such efforts include the Human Genome Project (1990-2003), the International HapMap Project (2002-2009) and, more recently, the 1000 Genomes Project (2008-2015). Investigating the causes of complex diseases has proven to be a much more difficult task, because there is not one single cause, but rather the combined action of many causal factors, genetic and/or envi- ronmental, that predispose to disease development. This means that even variants with a low increased relative risk, when found together in the same genome, may significantly contribute to the disorder in question to manifest. Genetic association studies aim to detect such variants involved in complex dis- eases. The fundamental idea behind association studies is the comparison of allele or genotype frequencies between cases and controls, in order to relate genetic variants to a certain phenotype (such as a dis- ease); if a particular allele/genotype is more common among cases than controls, it may be a risk factor and may be subject to further study [8]. There are two main theories for disease associated variants: the ”common disease common vari- ant” (CDCV) hypothesis, and the ”common disease rare variant” (CDRV) hypothesis. These hypotheses argue contrary views concerning which variants carry the most penetrance, i.e., the proportion of in- dividuals carrying a particular allele that also express an associated trait. While the first argues that FCUP 17 Quantifying the genetic predisposition to a complex disease through genome-wide association genetic variations with appreciable frequency in the population, but relatively low penetrance, are the major contributors to genetic susceptibility to common diseases, the second reasons that multiple rare DNA sequence variations, each with relatively high penetrance, are the major contributors to genetic sus- ceptibility to common diseases. Both hypotheses stand on empirical evidence, and each uses specific methods for association analysis [11].

In the early days of association studies, the initial choice was to focus on common variants, mainly due to genome-wide surveys of rare variation requiring many more assays than the arrays available at the time could support. However, there was strong motivation to support the CDRV hypothesis, namely the idea that deleterious variants are likely to be rare due to purifying selection; indeed, loss-of-function variants, which prevent the generation of functional , are especially rare [12].

A number of softwares have been developed to deal with genetic data files, which are evidently of very large sizes due to the thousands of variants for analysis in most genetic studies. These programs are mostly command-line based, which makes dealing with these types of files more computationally efficient than working with GUI-based softwares; they are also readily and freely available online for use. One such program is PLINK [6], which contains multiple basic commands such as calculating allele frequen- cies, IBD and heterozygotic proportions, converting between multiple file types and performing basic allelic/genotypic chi-squared association tests. It also performs PCA, but a more specific command-line program for this effect is EIGENSOFT. Among others, this program contains the EIGENSTRAT stratifica- tion correction method, which uses principal component analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation.

One last command-line program worth mentioning is ANNOVAR. This tool, given a list of variants and their corresponding genetic coordinates, uses a number of databases to functionally annotate them. The resulting features of each given variant includes the gene it belongs to, whether or not it falls in a coding region, whether or not it yields a change to the produced amino-acid, among others. An important feature provided by ANNOVAR is the Combined Annotation Dependent Depletion (CADD) score, a measure of the deleteriousness of SNVs and indels in the human genome. The CADD scores are ”PHRED-scaled”, meaning their values are ranked in order of magnitude terms rather than the precise rank itself. For example, variants at the top 10% of CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, and so on. ANNOVAR also provides information from the Genome Aggregation Database (gnomAD) on allelic frequencies in various populations from around the world, among which FCUP 18 Quantifying the genetic predisposition to a complex disease through genome-wide association are Non-Finnish Europeans (NFE) wherein the Iberian population is included. Several psychiatric disorders, such as schizophrenia [13], depression [14] or Alzheimer’s disease (AD) [15], have been found to be polygenic and were associated with several genetic variants. In this work, we describe the general procedures of association studies and the corresponding statistical methods, followed by an application to the case of AD in patients from the Iberian Peninsula.

1.4 An introduction to Alzheimer’s disease

1.4.1 Symptoms, causes and available diagnosis

Alzheimer’s disease (AD) is the most common type of dementia. It usually begins with subtle memory failure, which worsens over time and begins to affect an individual’s daily living. A person suffering from this condition will eventually have trouble recognizing people, naming objects, dealing with everyday chores and personal care, behaving appropriately in social situations, among others. At an advanced stage of the disease, the patient will require constant care. After the first symptoms appear, an individual usually survives 8 to 10 years, but the course of the disease can go up to 25 years, ending in death by pneumonia, malnutrition or general inanition [16]. There are three stages to AD. The early stage is mild Alzheimer’s disease, when a person can still function independently but has few memory lapses, such as forgetting familiar words or the location of everyday objects. Individuals with mild AD are firstly diagnosed with a condition called mild cognitive impairment (MCI), which has similar associated symptoms to the early stage of AD; deciding whether the MCI observed in an individual is due to AD relies on brain imaging and cerebrospinal fluid tests. The middle stage, or moderate Alzheimer’s disease, is typically the longest stage; the symptoms become more pronounced and the patient will require more care. The late stage is called severe Alzheimer’s disease, when individuals lose ability to respond to the environment and need constant help in performing daily activities. It is often not easy to place an individual at a specific stage, as they may overlap [16]. AD can be classified according to the age of onset. The most common type is late-onset Alzheimer’s disease, which constitutes approximately 95% of the cases, and affects individuals whose first symptoms appeared after the age of 65; early-onset Alzheimer’s disease affects the other 5% of the cases, for whom FCUP 19 Quantifying the genetic predisposition to a complex disease through genome-wide association the age of onset is below 65 [17].

Cause % of cases Late-onset familial 15-25 Early-onset familial <2 Down syndrome <1 Unknown (includes genetic/environment interactions) ∼75

Tbl. 1.7 – Causes of Alzheimer’s disease.

The main causes of AD are described in Table 1.7. Approximately 25% of all AD is familial (i.e., ≥3 persons in a family have AD) and 75% is nonfamilial (i.e., an individual with AD and no known family history of AD); the onset of nonfamilial Alzheimer’s disease is usually at an advanced age. Because familial and nonfamilial AD appear to have the same clinical and pathologic phenotypes (observable manifestation), they can only be distinguished by family history and/or by molecular genetic testing [16]. Most cases of early-onset AD are due to genetic factors transmitted from parent to child. Research has shown that this form of the disease mostly results from a variation in one of these three genes: APP, PSEN1 or PSEN2. When any of these genes is altered, large amounts of amyloid β-peptide, a toxic protein fragment, are produced in the brain. This peptide builds up to form clumps called ”amyloid plaques”, characteristic of Alzheimer’s disease, which lead to the death of nerve cells and the progressive signs and symptoms of this disorder [16]. Some evidence indicates that essentially all persons with Down syndrome develop the neuropathologic hallmarks of AD after age 40. Down syndrome, a condition characterized by intellectual disability and other health problems, occurs when a person is born with an extra copy of chromosome 21 in each cell. The presumed reason for the association between these two conditions is the lifelong overexpression of APP on chromosome 21, and the resultant overproduction of β-amyloid in the brain [17]. Research has come to support the concept that late-onset Alzheimer’s disease is a complex disorder, with many susceptibility genes involved, as well as environmental factors (such as higher education, or exposure to electromagnetic fields [18]). The gene APOE has been extensively studied and proven to have great influence in the manifestation of AD. APOE is polymorphic, with three major alleles: ε2, ε3 and ε4. The presence of the ε4 allele in heterozygous (ε3ε4) or homozygous (ε4ε4) state increases the risk for AD threefold and 15-fold, FCUP 20 Quantifying the genetic predisposition to a complex disease through genome-wide association respectively. APOE ε2 allele has shown to have a protective effect [16]. APOE alleles are determined by the two SNPs rs429358 and rs7412 as shown in Table 1.8.

rs429358 rs7412 Allele CT ε1 TT ε2 TC ε3 CC ε4

Tbl. 1.8 – APOE allele according to the genotype for SNPs rs429358 and rs7412.

However, the presence of APOE ε4 does not determine that an individual will develop the disease; in fact, approximately 42% of individuals with AD do not have any APOE ε4 allele. Similarly, the absence of APOE ε4 does not rule out the possibility of one developing AD. Currently, the only definitive way to establish a diagnosis of AD is to microscopically examine a section of the person’s brain tissue after death. However, there are still several approaches that have been proven to be highly effective in the diagnosis of Alzheimer’s disease to a living patient. The initial step is to consult with a specialized doctor (psychiatrist), who will review the individual’s medical history and analyze the symptoms, as well as conduct a series of tests to the cognitive and physical abilities of the individual. The Mini Mental State Examination (MMSE) is one such test widely used for this purpose. It is not unusual for the doctor to interview friends and family of the patient, to better understand their behavioral changes over time. This series of clinical assessments often provide enough information to perform a correct diagnosis [16]; however, it isn’t always clear, and may require further, more advanced testing. Analysis of electroencephalograms (EEGs) can also be used as a means of diagnosis. EEGs are used to register electrical activity in the brain, focusing namely on spectral measures, which include the classical brainwaves in delta, theta, alpha, beta and gamma frequencies. Each one of these brainwaves in an endophenotype to Alzheimer’s disease. Brainwaves are activated according to our actions, feelings, circadian rhythm, and some disorders may trigger the over-expression or inhibition of a given brainwave. In order to interpret the EEG, it is important to understand which behaviors lead to certain variations in activity of each brainwave. Delta (δ, 0.5-4 Hz) waves suspend external awareness and are the source of empathy; they are gener- FCUP 21 Quantifying the genetic predisposition to a complex disease through genome-wide association ated in deep meditation and dreamless sleep, when healing and regeneration processes are triggered. Theta (θ, 4-8 Hz) brainwaves are connected with the learning, memory, and intuition functions. Al- pha (α, 8-13 Hz) waves aid overall mental coordination and learning. Beta (β, 13-30 Hz) brainwaves are present when we are alert, engaged in problem solving, judgment, decision making, or focused mental activity; they dominate our normal waking state of consciousness. Gamma (γ, >30 Hz) brain- waves are the fastest of brain waves, and relate to simultaneous processing of information from different brain areas. More detailed information can be found online at https://brainworksneurotherapy.com/ what-are-brainwaves. Other means of diagnosis include laboratory testing, which is usually performed as a way of ruling out conditions that cause similar symptoms to Alzheimer’s, such as nutritional deficiencies or other diseases that could be affecting the person’s memory. These tests make use of blood, urine and cerebrospinal fluid samples [19]. One final diagnosis method worth mentioning is brain-imaging testing, such as computed tomography (CT) or magnetic resonance imaging (MRI) scans. They allow to look for evidence of trauma, tumors, and stroke that could be causing dementia and to look for brain atrophy, shrinkage that may be present later in the Alzheimer disease progression. These tests require that the person remain still for a period of time [19]. All the methods above provide information that allow to rule out a series of conditions that cause symp- toms similar to AD. Such conditions are, for instance, past strokes, Parkinson’s disease and depression [19].

1.4.2 The association studies approach

The methods of diagnosis described above require that the individual is showing symptoms of the disease. Some of them may be considered invasive, such as lab testing (which requires lumbar puncture and spinal fluid collecting); others may be unaccessible to the majority of the population due to their high cost, such as an MRI scan. In addition, imaging techniques may provide poor quality results, as Alzheimer’s patients tend to have a hard time standing still even for short periods of time, especially at an advanced stage of the disease. Genetic testing arises as an alternative to the previously described methods. By identifying alleles FCUP 22 Quantifying the genetic predisposition to a complex disease through genome-wide association which increase risk of developing the disease, it could be possible to make an early diagnosis, since the genetic material we carry is the same throughout our lifetime (except for mutations that may occur). This way, the disease could be prevented even before the appearance of symptoms, and thus prolong the quality of life of a potential future AD patient. This would also be a cheaper alternative, and can be made less invasive, while maintaining sample quality, through the use of oral swabs (instead of blood collecting) to obtain the genetic material. FCUP 23 Quantifying the genetic predisposition to a complex disease through genome-wide association

Chapter 2

Study design and data quality control

Genetic association studies can essentially be divided into candidate gene (CG) and genome-wide association (GWA) studies. CG studies are based on the prior hypothesis of a potential role of selected genes or genetic regions on a specific phenotype or disease, taking into consideration their biological function or association in previous studies. Genome-wide association studies, on the other hand, make use of information on the variation across the entire human genome, and are useful for hypothesis- generating purposes [10].

GWA analyses usually target relatively common SNPs. CG studies, however, focus on the effects of rare variants, which may be hard to detect, especially when dealing with small sample sizes. Besides being more cost effective than sequencing an entire human genome, studying the part that rare variants play on disease has been largely motivated by the CDRV hypothesis [11]. Indeed, if a certain variant has a large deleterious effect, it may also impact fitness and thus become less and less frequent in each generation.

This chapter describes the general procedures of an association study from the study design, through the process of data collection and up to the quality control steps, taking into account which methods suit each type of study, depending on the chosen approach. FCUP 24 Quantifying the genetic predisposition to a complex disease through genome-wide association

2.1 Study design

Describing the phenotype accurately

The phenotype of interest must be defined as accurately and specifically as possible, in a way that minimizes the likely causal heterogeneity based on existing clinical and biological evidence. Such defi- nitions may change, as more information becomes available. This will increase power of detection of an effect and allow for replication studies [10].

Checking disease heritability

Heritability is a measure of how well differences between individuals’ genes account for differences in their traits, i.e., how much of the variation in a given trait can be attributed to genetic variation (as opposed to environmental causes) [3]. Heritability is assessed by studying disease patterns in family members, namely by comparing monozy- gotic with dizygotic twins. Because monozygotic twins are genetically identical (the two alleles in each locus are IBD), while dizygotic twins are expected to share, on average, half of their alleles, comparing disease status in twins can enlighten on the role of genetic factors [10]. Diseases which have been shown to have low heritability will likely need very large sample sizes in order to find etiological genetic variants. Moreover, in diseases with heritability close to zero, there isn’t much advantage in conducting a genetic case-control study [10].

Choosing the best approach to the problem

Concerning sample relatedness, association studies can be sorted into two categories: population- based case-control studies, and family-based studies. The first approach may require several thousands of cases of the phenotype of interest. This number can be decreased, and power of the study increased, by recruiting cases with family history of the condition, or even multiple cases from the same family (adjusting for familial correlation), for a sample with a more homogeneous genetic background; this is called ”enrichment sampling” [20]. This sampling method does not always increase power in genetic studies, as familial aggregation may be due to shared environmental factors, for example [10]. Another premise for choosing a population-based case-control approach is to assume that one or FCUP 25 Quantifying the genetic predisposition to a complex disease through genome-wide association more of the underlying genetic variants are common. Moderately rare variants could also be detected, but only if they carry a large effect. A prior hypothesis that all undetected variants are rare and of small effects would require an unfeasibly large sample size, in order to have power to detect the effect of single variants [21]. If the case definition is a phenotype that shows clear segregation in families, then a population-based case-control approach is no longer suitable, and a family-based study is preferable [10].

Control selection

The golden rule of control selection for any case-control study is that cases and controls should belong to the same population, and they must be representative of that population who would have become cases, according to the case definition and the recruitment strategies for the study. This minimizes false positives and confounding [22]. Bias due to environmental factors is generally not a problem in association studies; the most important type of bias is related to the ethnic origin of cases and controls. This is commonly referred to as ”pop- ulation stratification”, and is an example of a confounding variable. Under this situation, differences in allelic frequencies between cases and controls are due to the underlying sampling scheme, rather than an actual effect of the variant on disease risk [23]. The effects of population stratification can sometimes be avoided at the study design level (by matching controls to cases on potentially important confounders) or the data analysis level (by adjusting the results for these confounders). Matching is only essential when the effect of the confounder cannot be accurately measured or is too large to be adjusted for in the analysis [24]. Population stratification is minimized when controls are matched to cases on ethnicity, or when the sample is restricted to a particular ethnic group. Further matching on sex can reduce population strat- ification in situations where there are gender differences in disease prevalence. Matching on age may improve power of the study by ensuring that controls had the same opportunity as cases to develop (and be diagnosed with) the disease. This could be a problem when dealing with age-related diseases such as Alzheimer’s disease. Whether or not further matching is necessary and decreases population stratification will depend on the disease in question [10]. Remaining stratification can be investigated and controlled (to some extent) by analytical methods [25, 26]. As a method of control selection, GWA studies often resort to banks of already genotyped shared FCUP 26 Quantifying the genetic predisposition to a complex disease through genome-wide association healthy controls, mainly due to it being a much more economical approach. It is important that basic characteristics of such panels are known, such as ethnicity, sex, age and area of recruitment, so that they can be matched to in the design or adjusted for in the analysis [10].

The described methods of control selection are specific of studies intended to assess genetic risk, and no longer suited if we incorporate environmental factors [10].

Sample size

Sample sizes for each study will depend on the existence of case sub-groups and a priori hypotheses to be tested, on whether it is a CG or a GWA approach, among (many) other factors. Estimating the required sample size often relies on empirical results from simulation studies [10].

The lack of availability of genetic information from cases for an association study often relates to economic issues. When testing many SNPs, a one-stage design can be very expensive, so one can resort to a multi-stage design, where all SNPs are tested in a random subset of cases and controls, and those found significant are taken through to be tested in the remainder of the study sample [27]. The power of a study can also be potentially improved with an increased control/case ratio [24].

Replication studies

Theoretical considerations prove that, when true discovery is claimed based on crossing a threshold of statistical significance and the discovery study is underpowered, the observed effects are expected to be inflated. Furthermore, flexible analyses coupled with selective reporting may inflate the published dis- covered effects. Therefore, a study designed to replicate a finding should base sample size calculations on smaller effect sizes [28].

A true replication study must be performed on a population comparable to the original, i.e., it must involve the analysis of the same polymorphism in the same direction of the effect, in the same ethnic population measured on the same phenotype. Failure to replicate findings in a different population does not allow judgement of the validity of the results in the original study; it can only elucidate on the lack of effect on the second population [10]. FCUP 27 Quantifying the genetic predisposition to a complex disease through genome-wide association

2.2 Data collection and variant calling

Following the study design is collecting the data for analysis. Genotyping individuals in association studies is usually done with DNA microarrays. These consist of specific DNA sequences (known as probes) corresponding to a short section of a gene or other DNA sequence of the human genome. Probes are usually 100 to 10 000 bases long and fluorescently labeled. Among the manufacturers of DNA microarrays was Affymetrix, Inc., a company now owned by Thermo Fisher Scientific. This company developed the GeneChip array technology and the Affymetrix Power Tools (APT), which can be used for variant calling, quality control and genotyping. A GeneChip array can contain up to thousands of DNA probes, designed to vary in specific locations matching those of known human genome variation. When placing the probes and a DNA sample in the same environment, DNA breaks up into fragments which attach to the corresponding probe in a process called hybridization, issuing a fluorescent measurable signal that allows to identify the nucleotide sequence in each fragment and thus determine the DNA sample sequence. Two important measures to consider when assessing the quality of the variant calling process are the dish quality control (DQC) and the quality control call rate (QCCR). DQC is a measure of the contrast between the adenine-thymine (AT) and cytosine-guanine (CG) signals, and is defined as

AT Signal - CG Signal DQC = AT Signal + CG Signal

QCCR is the proportion of non-missing data for each individual. The ”Axiom Genotyping Analysis Guide” by Affymetrix provides guidelines for these measures, and any subject falling below these values should be eliminated from further study. Their best practices guide can be found online at https://assets.thermofisher.com/TFS-Assets/LSG/manuals/axiom_ genotyping_solution_analysis_guide.pdf. Another relevant measure when doing probe QC is the heterozygosity rate. When this value is too high (usually higher than µ + 3σ) for a given individual, it could hint sample contamination; when it is too low (below µ − 3σ), it could mean that there are related individuals in the sample. In any case, it is recommended that the individuals falling outside this interval be discarded from further analysis [5]. One final quality control step for variants before genotyping is to sort them into categories according to allelic intensities. For each probe, let alleles A and B correspond to the two possible bases at that FCUP 28 Quantifying the genetic predisposition to a complex disease through genome-wide association locus, A being whichever comes first alphabetically; for example, for a [C/T ] SNP, the (CC), (CT ) and (TT ) genotypes are named (AA), (AB) and (BB), respectively. The intensities of alleles A and B are calculated, and using the obtained Asignal and Bsignal values, we plot X against Y , where

X = Contrast = log2(Asignal) − log2(Bsignal)

and

log (A ) + log (B ) Y = Size = 2 signal 2 signal 2

The obtained values for these measures for each individual allow sorting each probe into one of 7 categories – ”Poly High Resolution”, ”Mono High Resolution”, ”No Minor Homozygote”, ”Hemizygous”, ”Off-Target Variants”, ”Call Rate Below Threshold” and ”Other”. The first four categories contain high quality variants. ”Poly High Resolution” variants are characterized by three very clearly defined clusters for each of the three possible genotypes. ”Mono High Resolution” is a category for variants with a single cluster corresponding to one of the states. ”No Minor Homozy- gote” refers to variants which show only two possible states: one homozygous and one heterozygous. ”Hemizygous” variants are the ones present in the sex chromosomes. Variants classified as ”Off-Target” are usually distributed by more than three clusters. Variants with ”Call Rate Below Threshold” have well defined clusters, but the missing data rate is too high for these variants to be considered in further study. Variants with other characteristics fall in the ”Other” category. Variants falling in any of these three categories are not recommended by Affymetrix to proceed with, and are therefore discarded. The final step after removing all ”problematic” variants is to genotype them, which can be done using APT in the command line.

2.3 Variant quality control

Before conducting the actual ”association” part of the association study, one has yet to control the genotyped data for its quality, namely in testing whether the ”same-population sampling” process was FCUP 29 Quantifying the genetic predisposition to a complex disease through genome-wide association successful. These quality control (QC) steps aim to remove individuals and/or markers which may con- tain high error rates, as they could introduce bias to the study and increase false-positive and false- negative rates. In what follows, we will introduce some standard QC steps in both GWA and CG studies.

2.3.1 Genome-wide association studies

It is not uncommon for GWA studies to test many thousands and even millions of SNPs for association, hence even a low error rate can be detrimental: indeed, each marker removed is a potentially overlooked disease association, which can be more impactful for the final results than removing a handful of indi- viduals. Therefore, in order to maximize the amount of markers that remain in the study, QC steps are taken on a ’per-individual’ basis prior to a ’per-marker’ basis [5].

Per-individual quality control

Quality control in subjects of the study consists of four main steps:

1. Identification of individuals with discordant sex information The best way to detect discrep- ancies between an individual’s genotype information and his/her assigned sex is by comparing the homozygosity rate across all X-chromosome SNPs for each individual in the sample with the expected rate. Males are expected to have a homozygosity rate of 1, and females less that 0.2 [5]. Individuals found to have discordant sex information should be removed from further analysis, unless the sample can be correctly identified using existing genotype data or it can be confirmed that sex was recorded incorrectly.

2. Identification of individuals with outlying missing genotype/heterozygosity rates Genotype failure and heterozygosity rates per individual have been used as measures of the quality of DNA samples. As a guideline, individuals with more that 3-7% missing genotypes have been removed in GWA studies [29, 30]. Individuals with an excessive proportion of heterozygote genotypes may be indicative of DNA sample contamination; a low proportion of heterozygotes, in turn, may consist evidence of inbreeding [5].

3. Identification of duplicate or related individuals If duplicates or relatives are present, a bias may be introduced to the study, because the genotypes within families will be over-represented and the FCUP 30 Quantifying the genetic predisposition to a complex disease through genome-wide association

sample may no longer be an accurate reflection of the allele frequencies in the entire population [5]. Measures of relatedness are used to identify such cases:

• Identity by state (IBS) A value calculated for each pair of individuals based on the average proportion of alleles shared at genotyped SNPs (note that duplicate individuals will have an IBS of 1). This measure works best with independent SNPs, which is why selected regions are pruned, so that no SNPs in a given window are correlated.

• Identity by descent (IBD) A degree of recent shared ancestry for a pair of individuals. Dupli- cates are expected to have IBD=1; first-degree relatives, IBD=0.5; second-degree relatives, IBD=0.25; and so on. Due to genotyping errors, LD or population structure, an IBD value for a pair of subjects higher than 0.98 is enough to consider them duplicates. The usual procedure is to remove one individual from each pair with IBD>0.1875 [5], which is halfway between the expected IBD for second- and third-degree relatives.

4. Identification of individuals of divergent ancestry The main source of confounding in associa- tion studies is the existence of population stratification: differences found between cases and con- trols will be due to diverse ancestries, rather than underlying differences directly related to disease status [23]. Even when drawing cases and controls from the same population, genetic substructure may still be present, and confounding occurs when that substructure is not equally distributed be- tween the two phenotypes. The most common method for identifying ancestry differences is PCA – see 1.2.5 for more details on this subject.

Per-marker quality control

Criteria to filter out SNPs differ from study to study; nevertheless, when filtering out SNPs, one must always keep in mind that they may be discarding a potentially disease associated variant. Quality control in genotyped markers consists of four main steps:

1. Identification of SNPs with an excessively missing genotype Usually, markers with a call rate lower than 95% (i.e., markers for which at least 5% of individuals were not successfully genotyped) are removed from further study [30, 31]. For low frequency markers, higher thresholds have been defined, as not to lose potentially crucial information from rare variants [29]. FCUP 31 Quantifying the genetic predisposition to a complex disease through genome-wide association

2. Identification of SNPs demonstrating a significant deviation from the Hardy-Weinberg equi- librium (HWE) Markers which show large deviation from HWE could hint genotype or genotype calling errors; however, departure from HWE can also indicate selection, so a case sample can show deviations from HWE at disease associated loci [32]. For this reason, only control samples should be tested for deviations from HWE.

3. Identification of SNPs with significantly different missing genotype rates between cases and controls This reduces confounding and removes poorly genotyped SNPs. When cases and controls come from several different sources, it is wiser to test for significant differences in call rate, allele frequency and genotype frequency between the various groups, to make sure that it is fair to treat the combined set as homogeneous [5].

4. Removal of all markers with a very low minor allele frequency (MAF) Since power to detect association at rare variants is very low [33], targeting common variants does not overly impact GWA studies. Tipically, SNPs with 1-2% or lower MAF are removed, but higher thresholds may be set when working with small sample sizes [5].

2.3.2 Candidate-gene studies

CG association studies work with far fewer SNPs, so many of the quality control steps used in GWAS cannot be undertaken. Using fewer SNPs greatly reduces our ability to get accurate estimates of DNA quality, population ancestry and familial relationships with other subjects. One should still attempt to identify and remove individuals with very low call rates, which are typically lower in CG studies than in GWAS, due to the reduced number of genotyped SNPs. However, excluding markers with a high failure rate may seriously impair a CG study, due to SNPs being chosen based on their ”tagging” properties [34]; in this situation, it is advisable to return to the design stage and select a different tagSNP [5]. Detection of deviations from the Hardy-Weinberg equilibrium in controls is still relevant in CG studies, for genotype quality checking purposes [5]. FCUP 32 Quantifying the genetic predisposition to a complex disease through genome-wide association FCUP 33 Quantifying the genetic predisposition to a complex disease through genome-wide association

Chapter 3

Models of association

Association studies aim to predict a phenotype based on individuals’ characteristics, with the partic- ularity of dealing with an abnormally large number of predictor variables – the genetic markers. When dealing with quantitative traits, such as height or blood pressure, linear regression models are applied. The type of phenotype we’re most interested in is the presence or absence of a disease, which is the focus of a case-control study; this is an example of qualitative trait, and uses logistic regression to predict the disease probability . It is expected that disease risk be modified by environmental effects, such as epidemiological risk factors (as is gender), clinical variables (such as disease severity and age of onset) and population stratification (measured in principal components capturing variation due to differential ancestry), but also by the interactive and joint effects of genetic factors [35]. The challenge is to determine the set of genetic variants which are most likely to yield significant results when used as covariates. Depending on the minor allele frequency (MAF) of the variants of interest to be tested, different methods have been developed.

3.1 Statistical tests for common variants

The main idea when looking for association between a certain allele or genotype and a (binary) pheno- type is to find differences in frequencies of said genotypes between affected and unaffected individuals. This is feasible when dealing with common variants; other methods should be used with rare variants FCUP 34 Quantifying the genetic predisposition to a complex disease through genome-wide association

[12]. If we look at disease status and allelic/genotypic status as two categorical variables, then these com- parisons can be done with simple statistical tests on contingency tables (Table 3.1). Such tests include the Cochran-Armitage trend test [36, 37], Pearson’s Chi-squared test [38] and Fisher’s exact test [39], which are widely used when looking for evidence of common-variant association. For quantitative traits, the Wald statistical test is usually employed [40].

(a) a A Sum (b) (aa)(Aa)(AA) Sum

Cases m11 m12 m1· Cases n11 n12 n13 n1·

Controls m21 m22 m2· Controls n21 n22 n23 n2·

Sum m·1 m·2 m Sum n·1 n·2 n·3 n

Tbl. 3.1 – (a) Contingency table of allele counts; (b) Contingency table of genotype counts.

Let’s consider a genetic marker of study: a single biallelic locus with possible alleles A and a (and possible genotypes (AA), (Aa) and (aa)). We define a penetrance parameter γ (γ > 1) of the disease relating to a certain allele or phenotype, which is associated with the proportion of individuals carrying that particular variant that also express the disease. Models for disease penetrance include the additive model, the multiplicative model, the common dominant model and the common recessive model. We assume, without loss of generosity, that A is the presumed risk allele being tested for. An additive model indicates that the risk of disease increases by γ-fold for an individual with genotype (Aa), and by 2γ-fold with genotype (AA), relative to a (aa) genotype (this is equivalent to assuming a co-dominant model); a multiplicative model indicates that each additional A allele increases disease risk by γ-fold; a dominant model indicates that one copy of allele A is sufficient to increase disease risk by γ-fold, so (Aa) and (AA) genotypes can be grouped into a single category; as for the recessive model, it indicates that two copies of the risk allele are necessary to increase disease risk by γ-fold, so in this case it is the (aa) and (Aa) genotypes that are grouped [35].

3.1.1 Chi-squared test

Under the null hypothesis of no association, we expect the relative allele or genotype frequencies to be the same in case and control groups. A test of association is thus given by a simple chi-squared FCUP 35 Quantifying the genetic predisposition to a complex disease through genome-wide association test for independence of rows and columns of the underlying contingency table (note that when doing genotype comparisons, the contingency table is 2 × 3, hence resulting in a 2 df test; under the dominant or recessive models, or with allele comparisons, the table’s dimensions are 2 × 2, producing a 1 df test).

Considering an m × n contingency table where nij is the value for count in cell (i, j), the chi-squared test statistic is thus given by

m n 2 X X [nij − E(nij)] χ2 = E(n ) i=1 j=1 ij

P P P P ni·n·j 2 where ni· = j nij, n·j = i nij, n = i j nij and E(nij) = n approximately follows a χ distribution with (m − 1)(n − 1) degrees of freedom.

The chi-squared test statistic is a good approximation to the sampling distribution when dealing with large samples; instead, with small sample sizes, it is usual to apply Fisher’s exact test, which allows to calculate the exact significance of deviation from the null hypothesis. This test is only feasible by hand in the case of 2 × 2 contingency tables (due to the one single df); computational methods have been developed to extend the test to the general case of a m × n table.

Considering the contingency table 3.1 (a), the probability of observing such an arrangement of the sample, under the null hypothesis of no association between disease status and allelic distribution, can be obtained as such:

m1· m2·  m1· m2·  m11 m21 m12 m22 m1·! m2·! m·1! m·2! p = m  = m  = m! m11! m12! m21! m22! m·1 m·2

3.1.2 Cochran-Armitage trend test

Any penetrance model specifying some kind of trend in risk with increasing numbers of A alleles can be examined using the Cochran-Armitage trend test. This method modifies chi-squared tests to incorporate a suspected order in the effects of the exposure categories. This test has shown to improve power relative to the chi-squared test, as long as the disease risks associated with the (Aa) genotype are intermediate to those associated with the (AA) and (aa) genotypes.

A Cochran-Armitage trend test of association between disease and a marker is given by FCUP 36 Quantifying the genetic predisposition to a complex disease through genome-wide association

2 hP3 i wi(n1in2· − n2in1·) 2 i=1 T = h i n1·n2· P3 2 P2 P3 n i=1 wi n·i(n − n·i) − 2 i=1 j=i+1 wiwjn·in·j

which has a χ2 distribution with 1 df under the null hypothesis of no association .

In the equation above, w = (w1, w2, w3) are weights chosen to detect particular types of association, where w1, w2 and w3 refer, respectively, to genotypes (aa), (aA) and (AA). To test whether allele A is dominant over a, we use the vector of weights w = (0, 1, 1). To test if A is recessive to a, we use w = (0, 0, 1). In genetic association studies, the most widely used weights are w = (0, 1, 2), to test for an additive effect of allele A.

3.2 Rare variant association approaches

Rare variants, typically defined as having a minor allele frequency (MAF) of less than 0.01 (approx- imately), could play an important role in the etiology of complex traits, as well as account for missing heritability unexplained by common variants. However, association tests similar to the ones performed on common variants, such as single-variant tests, show little power when performed on rare variants. As such, several methods [41] have been developed for the specific case of rare-variant association, focus- ing on the cumulative effects of rare variants in specific genetic regions, such as genes. These tests can essentially be labeled as burden and nonburden, each with their own strengths and weaknesses. In this section, we present both approaches, as well as a third that combines the two methods in an optimized variable, as described by Lee et al. [42]. Let’s assume that n subjects are sequenced for m variants in a given genetic region. For the ith T subject, let yi denote a dichotomous phenotype , Gi = (gi1 ··· gim) the genotypes of the m variants T (gij = 0, 1, 2) and Xi = (xi1 ··· xis) the remaining covariates. Assuming, without loss of generality, an additive model and a binary outcome (results are similar for quantitative traits), we consider the following logistic regression model:

T T logit(πi) = γ0 + γ1 Xi + β Gi (3.1)

where πi is the disease probability, γ1 is a s × 1 vector of regression coefficients of the covariates, and FCUP 37 Quantifying the genetic predisposition to a complex disease through genome-wide association

T β = (β1 ··· βm) is a m × 1 vector of regression coefficients of the genetic variants. Suppose that πˆi is the estimated probability of the outcome yi of individual i, under the null hypothesis H0 : β = 0; hence πˆi can be calculated by fitting the null model

T logit(πi) = γ0 + γ1 Xi (3.2)

We define the statistic of the marginal model for variant j as

n X Sj = gij(yi − πˆi) i=1

Note that Sj is positive when variant j is associated with increased disease risk and negative when variant j is associated with decreased risk. The standard m degrees of freedom (df) test for no genetic association has little statistical power when m is large. Burden and non-burden tests attempt to reduce the number of dfs and thus increase analysis power.

3.2.1 Burden testing

Burden tests collapse rare variants of a genetic region into a single burden variable, and then regress the phenotype on that variable to test for the cumulative effects of variants in the region.

Burden tests treat the βjs as the same up to a weight function, i.e., βj = wjβc, where wj is the weight function that may depend on the properties of the jth variant (such as its MAF, for instance). Then, equation 3.1 becomes

 m  T X  logit(πi) = γ0 + γ1 Xi + βc wjgij (3.3) j=1  and the association between the m genetic variants and a dichotomous trait can be tested through the one-df test, H0 : βc = 0. Some simple algebraic manipulation yields the following definition of the burden score statistic:

2  m  X QB =  wjSj (3.4) j=1 FCUP 38 Quantifying the genetic predisposition to a complex disease through genome-wide association

The main weakness of this test is the assumption that all variants in a region are causal and affect the phenotype in the same direction and with similar magnitudes, leading to a substantial loss of power when any of these assumptions are violated. Hence, finding signal for association in genetic regions containing protective and also risk variants, and even noncausal variants, will be compromised from the start.

3.2.2 Sequence Kernel Association Test (SKAT)

SKAT is a nonburden test. It assumes that the βjs in Equation 3.1 are independent, and follow an 2 arbitrary distribution with mean 0 and variance wj τ. The null hypothesis H0 : β = 0 in the model in

Equation 3.1 is equivalent to the hypothesis H0 : τ = 0. Hence, SKAT is a variance-component test under the induced logistic mixed model. Specifically, under the logistic model (Equation 3.1), the SKAT statistic can be written as

T QS = (y − πˆ) K(y − πˆ) (3.5)

T where πˆ = (ˆπ1 ··· πˆn) is a vector of the estimated probability of the outcome y = (y1 ··· yn) under the T T null model (Equation 3.2), and K = GW W G is a n×n kernel matrix, where G = (G1 ··· Gn) is a n×m genotype matrix and W = diag(w1 ··· wm) is a m × m diagonal weight matrix. The SKAT statistic can be simplified as the weighted sum of the individual SNP score statistics as

m X 2 2 QS = wj Sj (3.6) j=1 which asymptotically follows a mixture of chi-square distributions.

The weight function wj can be flexibly chosen using the observed data. For example, the beta density function of MAF can be used as a weight function in which wj = Beta(pj, a1, a2), where pj is the esti- mated MAF for SNP j using all cases and controls, and the parameters a1 and a2 are prespecified (note that for a1 = a2 = 1, all considered variants are weighted equally; for a1 = 1 and a2 > 1, rare variants are more highly weighted, and there is a loss of power for association with common variants). 2 A quick comparison of Equations 3.4 and 3.6 shows that, because SKAT collapses Sj instead of Sj, as is done in burden tests, SKAT is robust to groupings that include both variants with positive and negative effects. FCUP 39 Quantifying the genetic predisposition to a complex disease through genome-wide association

3.2.3 SKAT-O: Optimal Unified Association Test

Having the previous discussions in mind, it seems that burden tests are not powerful when the target region has many noncausal variants or when causal variants have different directions of association, whereas SKAT is powerful in these situations. However, if the target region has a high proportion of causal variants with effects in the same direction, burden tests can be more powerful than SKAT. Because prior biological functions of the target variants are often unknown, and due to the variety of genetic mechanisms across the genome, the development of a test that is optimal in both scenarios gains a substantial interest. The proposed unified test is quite literally a weighted average of the SKAT and the burden test statistics:

Qρ = ρQB + (1 − ρ)QS, 0 ≤ ρ ≤ 1

One can easily see that this new class of tests holds the burden test and SKAT as special cases (when ρ = 1 and ρ = 0, respectively). One can also show that the unified test is equivalent to a generalized

SKAT, derived as the variance component score statistic assuming the regression coefficients βj in 2 Equation 3.1 follow an arbitrary distribution with mean 0 and variance wj τ and pairwise correlation ρ between different βjs as

T Qρ = (y − πˆ) Kρ(y − πˆ),

T T where Kρ = GW RρWG is an n × n kernel matrix, Rρ = (1 − ρ)I + ρ1 1 is an m × m compound symmetric matrix, and 1 = (1 ··· 1)T . Thus the weight ρ can be interpreted as the correlation of the re- gression coefficients βjs. If they are perfectly correlated (ρ = 1), they will be all the same after weighting, and one should collapse the variants first before running regression, i.e., using the burden test; if the regression coefficients are unrelated to each other, one should use SKAT. The optimal weight ρ is unknown, and needs to be estimated from the data to maximize power as follows:

Qoptimal = min pρ 0≤ρ≤1

where pρ is the p-value computed on the basis of a given ρ. The optimal unified test statistic can be efficiently estimated by setting a grid 0 = ρ1 < ρ2 < ··· < ρb = 1 and determining FCUP 40 Quantifying the genetic predisposition to a complex disease through genome-wide association

Qoptimal = min {pρ1 , ··· , pρb }

It has been shown that Qρ can be decomposed into a mixture of two random variables: one asymptoti- cally follows a chi-square distribution with one df, whereas the other can be asymptotically approximated to a mixture of chi-square distributions. Hence, the p-value of Qoptimal can be quickly obtained analyti- cally with the use of a one-dimensional numerical integration.

Software

As of august 2017, the R statistical software package ”SKAT” is available with functions for kernel- regression-based association tests, including the ones discussed above: burden test, SKAT and SKAT- O. This package was co-developed and is maintained by the author of the paper [42] upon which this section was based.

3.3 P-value adjustment

All methods above, both for common and rare-variant analysis, resort mostly to multiple testing, which requires adjusting the allowed type I error rate. Let’s suppose we are testing m SNPs for association with a phenotype, or conducting an analysis on the cumulative effects of variants in m different genes; in essence, we are running m association tests. The significance level α now serves as a threshold for the probability of observing a false positive generated by all tests undertaken, and it should not be exceeded. Let α be the probability of a type I error in a single test. Then, the probability of not observing a type I error would be 1 − α, which makes the probability of not observing type I errors in any of the multiple m tests (1 − α)m. This results in the probability α0 = 1 − (1 − α)m of observing a type I error in at least one of the conducted tests. Rearranging this equation leads to the well known Bonferroni correction for multiple testing: α0 = α/m [43]. So, suppose for instance we were interested in testing one million SNPs for association with a disease, and expect a false positive rate no higher than the classical α = 5 × 10−2; hence, the corrected p-value threshold would be 5 × 10−8, meaning that, for each conducted test, the obtained p-value should not FCUP 41 Quantifying the genetic predisposition to a complex disease through genome-wide association exceed 5 × 10−8 in order for disease association to the corresponding SNP to be considered statistically significant.

3.4 The odds ratio

The odds ratio is a very important measure when working with case-control phenotypes and logistic regression models. It is defined as the ratio of the odds of disease in a group of individuals subject to a given exposure, relative to the remaining non-exposed individuals. In a logistic regression model, the response variable is the logarithm of the odds of disease.

Let us consider the contingency Table 3.2, where the Ejs are a number of exposures such as gender, age, allele or genotype counts, etc., with n ≥ 2, and cij the individual count on row i and column j.

Exposure E1 E2 ··· En Sum

Cases c11 c12 ··· c1n c1·

Controls c21 c22 ··· c2n c2·

Sum c·1 c·2 ··· c·n c

Tbl. 3.2 – Counts of cases and controls in each of the n exposure categories Ej , in a sample of c individuals.

We define the odds of developing the disease given exposure Ej as c1j/c2j. Hence, the odds of devel- oping disease for an individual exposed to Ej, relative to an individual with the exact same characteristics except they were exposed to Ek, is defined as [44]

c /c c c OR = 1j 2j = 1j 2k c1k/c2k c1kc2j Considering, for instance, Table 3.1 (b), the genotypic odds ratio for genotype (AA) relative to genotype (aa) is OR = n13n21 , and the genotypic odds ratio for genotype (Aa) relative to genotype (aa) is (AA) n11n23 OR = n12n21 [35]. (Aa) n11n22 FCUP 42 Quantifying the genetic predisposition to a complex disease through genome-wide association FCUP 43 Quantifying the genetic predisposition to a complex disease through genome-wide association

Chapter 4

An application of association studies to Alzheimer’s disease

4.1 Aim and objectives

The main goal of this study was to identify genetic variants with some degree of relevance in the development of AD, in a group of cases and controls from the Iberian Peninsula. We studied rare and common variants separately. In a first stage, we analyzed the influence of rare variants on the disease, using the SKAT-O method described in the previous chapter. The second objective of this work was to assess differences in genotype frequencies of common variants, and the behavior of selected EEG measures in individuals with each genotype.

4.2 Subjects and methods

This work was developed in the scope of the project ”AD-EEGWA: Analisis´ y correlacion´ entre el genoma completo y la actividad cerebral para la ayuda en el diagnostico´ de la enfermedad de Alzheimer”, which arises from the collaboration between ”IPATIMUP - Instituto de Patologia e Imunologia Molecular da Universidade do Porto, Porto, Portugal”, ”Grupo de Ingenier´ıa Biomedica,´ Universidade de Valladolid, Valladolid, Spain”, ”Asociacion´ de Familiares y Amigos de Enfermos de Alzheimer y otras demencias de FCUP 44 Quantifying the genetic predisposition to a complex disease through genome-wide association

Zamora, Zamora, Spain” and ”Associac¸ao˜ Portuguesa de Familiares e Amigos de Doentes de Alzheimer, Porto, Portugal”. This project was approved by the Ethics Committee of Universidade do Porto. Additional information can be found at http://www.gib.tel.uva.es/ad-eegwa/.

The initial goal was to constitute a sample of 250 individuals, equally distributed by country of origin and each of the five possible AD diagnoses. Controls should not be younger than 70 years old. An effort was made to recruit as many participants to the study as possible, and these goals were surpassed in every category.

The biological sample consisted of 285 individuals from the Iberian Peninsula, of which 144 were from Northern Portugal and 141 from the Castile and Leon´ region in Spain. Each subject was assessed by a neuropsychologist through the Mini Mental State Examination (MMSE) in appendix, and assigned one of the five disease status CON, MCI, MIL, MOD or SEV, ending up with 220 cases and 65 controls. Of the 220 cases, 54 had mild cognitive impairment, 58 had mild AD, 53 had moderate AD and 55 had severe AD.

Although MCI individuals were initially included amongst cases, this is a condition which may or may not be due to Alzheimer’s disease. In the spirit of keeping our phenotype as precise as possible, we considered excluding these individuals from the analyses.

Most individuals were sampled for DNA through the Oragene Saliva Kit; in case that was not possible (especially in most severe cases), buccal swabs were used instead. The 285 biological samples were collected by ”Associac¸ao˜ Portuguesa de Familiares e Amigos de Doentes de Alzheimer, Porto, Portugal” and ”Asociacion´ de Familiares y Amigos de Enfermos de Alzheimer y otras demencias de Zamora, Zamora, Spain”.

Of the totality of study subjects, 253 of them also underwent a five minute electroencephalogram (EEG) test in resting state, sitting down with eyes closed. The EEGs were measured by ”Grupo de Ingenier´ıa Biomedica,´ Universidade de Valladolid, Valladolid, Spain”.

Each subject was assigned an identification code so that the data would be handled anonymously. A copy of the informed consent document, which was delivered to and signed by the controls, patients with MCI and legal representatives of AD patients, can be found in appendix.

The sequencing of the DNA sample was done by the ”National Genotyping Center (CEGEN), Ge- nomics Medicine Group, University of Santiago de Compostela, Santiago de Compostela, Spain”. Three GeneChip array plates (Axiom Spain Biobank Array) with 96 sample capacity were used to obtain inten- FCUP 45 Quantifying the genetic predisposition to a complex disease through genome-wide association sity data on 814 923 probes. Each plate was used to process 95 samples and one plate control. For data quality control, variant annotation and genotyping we used command-line programs Affymetrix Power Tools (APT), ANNOVAR and PLINK. The R statistical software was used both in rare-variant analysis and in EEG analysis.

4.3 Data quality control

Prior to any analysis, we performed several steps of per-individual and per-marker quality control (QC). These steps were in accordance with the best practices guide by Affymetrix. The individual and marker counts in each QC step are summarized in Tables 4.1 and 4.2.

4.3.1 Variant calling

Dish quality control (DQC) and quality control call rates (QCCR)

The first per-individual QC performed concerned the identification of poor quality DNA samples, and was based on DQC and QCCR values. We found that 40 subjects had DQC and QCCR values below the defined thresholds for these measures (0.82 and 0.97, respectively). Of these, one was a control and the remaining 39 were cases, of which three had mild cognitive impairment (MCI), three had mild AD, 10 had moderate AD and 23 had severe AD. These individuals were not considered in further steps.

Heterozygosity assessment

After annotating each probe for the corresponding variant and genotyping the variants through provided annotation files (build version hg19), we proceeded to calculate the proportion of heterozygous variants for each individual. As mentioned before, higher values than expected for heterozygosity could indicate sample contam- ination, and lower values could hint population substructure. A frequently used threshold to eliminate heterozygosity outliers is µ ± 3σ [5]. We found that 7 individuals did not meet criteria to remain in the study sample; however, due to the small number of subjects available, we chose not to eliminate the 6 FCUP 46 Quantifying the genetic predisposition to a complex disease through genome-wide association

Fig. 4.1 – Missing data rate vs. heterozygosity across individuals passing the DQC and QCCR steps. Shading indicates sample density; the dashed lines represent the defined heterozygosity threshold; the outliers are highlighted in red. below the defined threshold, hinting population substructure, but just the one above, suggesting contam- ination. Individuals’ missing data rates are plotted against heterozygosity rates in Figure 4.1.

Cluster plots analysis

After preliminary per-individual quality control and probe calling, we proceeded to sort each variant into categories, based on allelic intensities. We sorted our probes into the 7 probe categories, but proceeded with only four, recommended by Thermo Fisher Scientific’s best practices guide – ”Poly High Resolution”, ”Mono High Resolution”, ”No Minor Homozygote” and ”Hemizygous”. The variants not falling in one of these categories were eliminated from further study, thus keeping 730 887 (≈90%) out of the initial 814 923. After this preliminary QC analysis, we annotated the passing variants according to the Genome Ref- erence Consortium Human Build 37 (GRCh37) SNP assembly. This process resulted in the annotation of 701 192 SNPs; after removing duplicates, we obtained a final set of 692 768 SNPs for analysis. We FCUP 47 Quantifying the genetic predisposition to a complex disease through genome-wide association then proceeded to per-individual quality control steps.

4.3.2 Per-individual QC

Identification of individuals with discordant sex information

The first step was to assess individuals for discordant sex information, by checking the sex chromo- somes’ homozygosity rate for each individual. Since, as said before, males are expected to have a homozygosity rate of 1 and females of less than 0.2, it has been established that any individual falling in the 0.2–0.8 interval leads to inconclusive results as for presumed sex. Our data contained 3 individuals with discordant sex information: two females inside the interval above, and one individual with no infor- mation on sex and predicted to be male, with homozygosity rate 1. Therefore, we decided to keep the two female samples unchanged, and set the individual with unknown gender as male.

Identification of duplicate or related individuals

This step makes use of the previously introduced concepts of linkage disequilibrium (LD) and identity by descent (IBD). By considering a reduced dataset without SNPs from extended regions of high LD, IBD was calculated for each pair of individuals, and one individual from each pair with IBD>0.1875 was removed. Five pairs of individuals were found to be in these conditions, and hence five individuals were removed from further analyses, in such a way that we would keep the individuals with highest call rates, and as many females and cases as possible.

Identification of individuals with divergent ancestry

The final step of per-individual quality control was the analysis of the principal components (PCs). In order to determine them, we used EIGENSOFT. We merged our data with a publicly available dataset from the 1000 Genomes Project (1KGP), con- taining 12 different populations from 4 ancestry groups: European – Utah Residents with Northern and Western European Ancestry (CEU), Toscani in Italy (TSI), Finnish in Finland (FIN) and British in England and Scotland (GBR) –, East Asian – Han Chinese in Beijing, China (CHB), Japanese in Tokyo, Japan (JPT) and Southern Han Chinese (CHS) –, Admixed Americans – Mexican Ancestry from Los Angeles FCUP 48 Quantifying the genetic predisposition to a complex disease through genome-wide association

USA (MXL) and Puerto Ricans from Puerto Rico (PUR) – and African – Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK) and Americans of African Ancestry in SW USA (ASW). The first two PCs are plotted in Figure 4.2. In subfigure (a), we can clearly distinguish three clusters, corresponding to the East Asian (bottom), the African (top left) and the European and Admixed American (top right) subpopulations. In subfigure (b), we see that our Portuguese and Spanish individuals overlap, as do the British and Northwestern European populations from 1000G.

(a) (b)

Fig. 4.2 – (a) Plot of the first principal component against the second, calculated with the software EIGENSOFT; (b) Plot of the same PCs as in (a), zoomed in on the cluster which contains the european populations (the outliers are encircled in red). In the legends, PT and ES represent the portuguese and spanish individuals in our sample, respectively; the remaining populations come from the 1KGP dataset.

It is usual to use the µ±5σ threshold as a reference for keeping or discarding individuals; any individual falling outside this threshold in at least one of the first 10 principal components should not be considered for further study. By using this measure, we excluded two controls from our dataset, which are encircled in red in Figure 4.2 (b). As a final per-individual QC step, we excluded all individuals with MCI from further study, which elim- inated 50 cases from our sample. Table 4.1 summarizes the sample counts for disease status, gender and country of origin in each per-individual QC step. In total, out of the 285 initially considered, 98 individuals (≈34%) were removed. FCUP 49 Quantifying the genetic predisposition to a complex disease through genome-wide association

Disease status Gender Country of origin QC step Total Cases Controls Male Female Portugal Spain Initial counts 220 65 86 193 144 141 285 After DQC 194 64 80 176 121 137 258 After QCCR 181 64 77 167 115 130 245 After heterozygosity 181 63 77 166 115 129 244 assessment After discordant 181 63 78 166 115 129 244 sex assessment After relatedness 178 61 75 164 113 126 239 assessment (IBD) After PCA 178 59 73 164 111 126 237 After MCI removal 128 59 59 128 87 100 187 Final counts 128 59 59 128 87 100 187

Tbl. 4.1 – Summary table of the individual counts after each per-individual QC step, according to disease status, gender and country of origin (note: whenever male and female counts did not add up to the ”Total” column, it was due to the presence of individuals with unknown gender; this issue was overcome as of the sex check step). FCUP 50 Quantifying the genetic predisposition to a complex disease through genome-wide association

4.3.3 Per-marker QC

Firstly, in order to identify genotyping and genotype calling errors, we eliminated SNPs with a significant deviation from HWE in control samples. By performig the HWE exact test, 51 variants with a p-value lower than 10−6 were removed. We proceeded to assess variant missingness rates, which could be different after per-individual QC. Holding 5% as threshold, we found that 363 variants from our dataset had excessive missingness, and they were hence excluded. Finally, the identification of markers with significant differences in missing genotype rates between cases and controls was done through a Fisher’s exact test on case/control missing call counts at each variant. Differences were considered significant whenever p-value was lower than 10−5; since all variants in our dataset presented a p-value above 8 × 10−4, none was removed at this step. One final necessary step was to remove any markers which took a single value amongst every study subject (i.e., SNPs for which MAF=0). This was to make sure that every variant tested for in the future was, indeed, a variant in our sample. After per-marker QC, we were left with 645 321 markers for analysis (out of the initial 692 768), of which 534 535 were common and 110 786 rare, using 2% as the MAF threshold. Table 4.2 shows a summary of probe/variant counts after each per-marker quality control step.

QC step Probe/variant count Inital count 814 923 probes After variant calling 730 887 probes After annotation 701 192 variants After duplicate removal 692 768 variants After HWE exact test 692 717 variants After missingness rate QC 692 354 variants After filtering out variants not found in our sample 645 321 variants 645 321 variants Final count (110 786 rare + 534 535 common)

Tbl. 4.2 – Summary table of probe/variant counts at each per-marker QC step. FCUP 51 Quantifying the genetic predisposition to a complex disease through genome-wide association

4.4 Rare-variant analysis

The rare-variant analysis stage of this work was motivated by the ”common disease rare variant” (CDRV) hypothesis, as well as by the fact that the effects of common variants on diseases have by now been extensively studied and the potential contribution of rare variants has not been given equal importance. This type of analysis is not done by testing all rare variants at once and interpreting the obtained results, since these variants have very low frequencies and, in order to achieve power to detect an effect at a significant level for any given variant, extremely large sample sizes would be required. Instead, a set of previously defined target regions is selected based on functional criteria, and taken for further study. We filtered out 110 786 rare variants (at a permissive MAF<2% in our global sample) and, using ANNOVAR to functionally annotate them, kept only exonic variants and variants in untranslated regions (UTR), filtered out the synonymous variants and were left with 16 877 variants for analysis.

4.4.1 Exploratory data analysis

After applying the QC steps, we had 187 subjects, distributed by gender and disease categories as per Table 4.3. Of the 128 cases, 55 had mild AD, 43 had moderate AD and 30 had severe AD. A simple chi-squared test for independence showed that there are significant differences in gender proportions between cases and controls (p = 7.6 × 10−3).

Case Control Sum Male 32 27 59 Female 96 32 128 Sum 128 59 187

Tbl. 4.3 – Disease status vs. gender distribution of the QCed sample.

The age of the subjects at the time of sample collection varies from 62 to 96 years old, with 5 missing values. By performing an ANOVA test, it is possible to verify that no significant differences are found in the distribution of age between cases and controls (p = 0.40) or between case categories (p = 0.56). This is also illustrated in Figure 4.3. FCUP 52 Quantifying the genetic predisposition to a complex disease through genome-wide association

Fig. 4.3 – Distribution of age at the time of sample collection.

We then proceeded to decide which principal components should be included in the analysis. A plot of the cumulative sum divided by the total number of eigenvalues showed that the first two principal components account for about 10% of the variation in the data, and that there is no particularly relevant gain to the understanding of the data in adding further PCs. This is illustrated in Figure 4.4. Additionally, Figure 4.5 shows a single cluster when plotting the first two principal components against each other, and how there is not a clear distinction between cases and controls of our sample, nor between subjects of different nationalities. Regressing the first 20 principal components on the phenotype through a logistic model did not yield significance to any PC either.

As we have pointed out before, the APOE gene has been proven to be extremely relevant to the development of Alzheimer’s disease in a number of studies, and it is often included as a covariate in the regression models. The APOE genotype (ε1, ε2, ε3 or ε4) is defined by the two SNPs rs429358 and rs7412 as shown in Table 1.8. Unfortunately, the only probe that contained intensity data for SNP rs7412 was eliminated at the variant calling step of the QC, because it was sorted into the ”Call Rate Below Threshold” class in cluster analysis. These two SNPs were genotyped by Sanger sequencing, in an attempt to retrieve this information, but the data was not available in time to be included it in this work. FCUP 53 Quantifying the genetic predisposition to a complex disease through genome-wide association

(a) (b)

Fig. 4.4 – Proportion of variance explained by the first i principal components; (b) is a zoom-in of (a) on the first 100 PCs.

Fig. 4.5 – Plot of the first and second principal components, restricted to the sample subjects. FCUP 54 Quantifying the genetic predisposition to a complex disease through genome-wide association

4.4.2 Test parameters

We ran the SKAT-O method based on two different null models: Model 1 included only the sex as a co-variate, and Model 2 adjusted for sex, age and the first two principal components. Due to the five missing age values, five samples were excluded from the analyses with Model 2, all of which were cases with mild AD. The variants’ weights were a beta density function of their MAF, so that the weight of variant j with

MAF pj was defined as wj = Beta(pj, a1, a2) and (a1, a2) = (1, 25) were the defined parameters [42]. Figure 4.6 shows the probability density function of this distribution.

Fig. 4.6 – Probability density function of Beta(p, 1, 25).

We tested each model considering four subsets of the exonic and UTR variants, excluding synonymous substitutions and unique instances, each subset corresponding to a different gene list, as described in Table 4.4 (here, a ”unique instance” is any variant that is the only one in its respective gene). The SKAT-O method performs a separate test for each set of variants grouped by gene, so the sig- nificance level α = 0.05 must be corrected for the number of genes tested. Hence, we obtain four new −5 −3 −4 −4 significance levels, αAll = 1.4 × 10 , αDem = 4.5 × 10 , αDGE = 3.7 × 10 and αDis = 1.1 × 10 , for FCUP 55 Quantifying the genetic predisposition to a complex disease through genome-wide association the gene lists ”All Genes”, ”Dementia”, ”DGE Brain” and ”AD Disgenet”, respectively.

Before removing After removing List Filter unique instances unique instances # of genes # of variants # of genes # of variants All Genes None 10 013 16 877 3 481 10 345 Genes with rare variants associated Dementia to dementia which are also relevant 34 61 11 38 to AD development [45] Genes with differential expression DGE Brain 498 755 136 393 between AD cases and controls [46] Genes previously associated or with a functional link to AD, retrieved from AD Disgenet 1286 2406 462 1582 the online database of gene-disease associations ”Disgenet”

Tbl. 4.4 – Identifier and description of each gene list to be tested for association, and number of genes and rare variants they contain.

4.4.3 Results

Table 4.5 shows the number of genes tested for each list and the number of genes that were significant, both nominally and after the p-value correction. After p-value correction, a single gene showed significance in four of the eight tests: the protein coding gene PLEKHA5, or Pleckstrin Homology Domain Containing A5. Using Model 1 as the null model, the obtained p-value was 1.3 × 10−5 and the value of ρ was 0.4 (here, ρ is the weight of the Burden score and 1 − ρ is the weight of the SKAT score on the optimal SKAT-O test statistic); Model 2 as the null model yielded a p-value of 1.5 × 10−5 and a ρ value of 0.9. Our sample contained four genotyped rare SNVs for gene PLEKHA5 – rs76626801, rs77598867, rs200349314 and rs140734813 (Table 4.6). In total, rare alleles of gene PLEKHA5 appear in 8 of 59 controls, and in one of 128 cases, so the rare variants found in this gene appear to have a protective effect against AD. The one case with a rare allele could explain the obtained values of ρ. FCUP 56 Quantifying the genetic predisposition to a complex disease through genome-wide association

# of genes # of nominally # of significant genes Gene list Model in list significant genes after p-value adjustment Model 1 250 1 All Genes 3 481 Model 2 267 1 Model 1 2 0 Dementia 11 Model 2 2 0 Model 1 8 1 DGE Brain 136 Model 2 10 1 Model 1 29 0 AD Disgenet 462 Model 2 37 0

Tbl. 4.5 – Number of significant genes without and with p-value correction in each gene list and model (Model 1 – sex as the only covariate; Model 2 – sex, age, PC1 and PC2 as covariates).

CADD MAF MAF MAF Fisher Functional score MA Cases Controls Controls Test consequence (PHRED) (n=128) (n=59) gnomAD NFE p-value Non- rs76626801(*) 9 C 0.00 5.08 × 10−2 7.20 × 10−3 2.36 × 10−4 synonymous Non- rs77598867(*) 13.12 G 0.00 5.08 × 10−2 7.21 × 10−3 2.37 × 10−4 synonymous Non- rs200349314 14.15 T 3.91 × 10−3 8.47 × 10−3 1.99 × 10−3 2.11 × 10−1 synonymous Non- rs140734813 23.5 G 0.00 8.47 × 10−3 7.07 × 10−5 9.34 × 10−3 synonymous

Tbl. 4.6 – Frequency and properties of PLEKHA5 rare variants identified in our sample. (*) in complete LD (r2 = 1.0) Notes: ”PHRED” refers to the PHRED-scaled CADD score; MA - minor allele; MAF - minor allele frequency; ”gnomAD NFE” refers to the Non-Finnish European population of the gnomAD database (the number of genotyped alleles for each SNP was, respectively, 129 088, 129 088, 75 296 and 113 128); the p-values refer to Fisher’s exact test for differences between control frequencies in our data and the gnomAD database. FCUP 57 Quantifying the genetic predisposition to a complex disease through genome-wide association

We proceeded to build the two logistic regression models described above, now including the effects of gene PLEKHA5. For each model, we tested both the separate and aggregate effects of the variants on the phenotype. rs76626801 was not included in the models because the information it contained was equivalent to rs77598867; in addition, it had one individual with missing genotype.

When testing for the separate effect of the variants in either of the two models, only the effect of sex was statistically significant, with p-values of 3.2 × 10−3 in Model 1 and 5.6 × 10−4 in Model 2. The substantially lower p-value obtained with Model 2 must be due to the additional variable age having five missing values, which results in five male cases not being considered in the analysis. As for the models which adjusted for the combined effects of the three variants in gene PLEKHA5, both the sex as well as the combined variants variable were statistically significant, with respective p-values 4.0 × 10−3 and 4.3 × 10−3 in Model 1, and 7.1 × 10−4 and 2.2 × 10−3 in Model 2.

In any of the four models considered, the coefficients of both significant variables were negative, where the category of reference for variable sex was ”female”. This means that being male is protective against the disease, as is having at least one copy of the rare allele in any of the referred variants. Let us take, for instance, the obtained model when adjusting for sex and the aggregate effects of the SNVs of PLEKHA5,

  πˆi logit(ˆπi) = log = 1.27 − 0.99Sexi − 3.09PLEKHA5i 1 − πˆi

where πˆi is the estimated probability of individual i being an AD patient, the response variable logit(ˆπi) is the logarithm of the odds of disease, Sexi is a variable that takes the value 1 if individual i is male and

0 otherwise, and PLEKHA5i is the total number of rare alleles for all three variants found in individual i.

If we assume that our sample was representative of the Iberian population, this means that the proba- bility, under this simplified model, that a female with no rare alleles in any of the three SNVs of PLEKHA5 developing the disease is (1 + exp(−1.27))−1 ≈ 78%. It also means that the odds of developing AD being male is exp(−0.99) ≈ 37% relative to being a female with the same number of rare alleles in gene PLEKHA5. Finally, it means that the odds of an individual with k rare alleles in any of the PLEKHA5 variants developing AD is exp(−3.09) ≈ 5% relative to an individual of the same sex with k − 1 rare alleles. FCUP 58 Quantifying the genetic predisposition to a complex disease through genome-wide association

4.5 Analysis of electroencephalography data

In addition to age, sex and genotypes of the 187 subjects of this study, we also had electroencephalog- raphy (EEG) data for 155, of which 45 were controls, 46 cases with mild AD, 39 with moderate AD and 25 with severe AD. We also had EEGs measured for a number of MCI individuals, but as before, these were not included in the analysis. We focused on the study of the behavior of spectral measures among individuals with different geno- types, specifically of relative power (RP) in the classical EEG frequency bands delta (δ, 0.5-4 Hz), theta (θ, 4-8 Hz), alpha (α, 8-13 Hz), beta-1 (β1, 13-19 Hz) and beta-2 (β2, 19-30 Hz). The gamma frequency band was not included in analyses, due to possible contamination from muscle artifacts in this frequency band [47]. Previous studies [48, 49] have shown that the relative power (RP) of the delta and theta frequency bands tend to increase in AD patients relatively to healthy controls; the remaining frequency bands alpha and beta, on the other hand, behave in the opposite way, with a higher RP in controls over cases. In the present analysis we addressed some questions about the EEG data in our Iberian sample, such as: (i) Do brainwaves differ between AD cases and controls? (ii) Does their behavior also differ in patients at different stages of the disease? (iii) Is there any correlation between genetic and EEG data?

Results

(i) Do brainwaves differ between AD cases and controls?

It is possible to verify in Figure 4.7 that our sample conforms to the expected, in the sense that the RP of the delta and theta waves is higher in cases than in controls, and the RP of alpha and beta waves, on the other hand, is lower. We conducted ANOVA tests for each measure, and the differences in RP between classes are, in fact, significant (Table 4.7). FCUP 59 Quantifying the genetic predisposition to a complex disease through genome-wide association

Fig. 4.7 – Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2 in cases and controls.

delta theta alpha beta-1 beta-2 p-value 2.98 × 10−6 4.63 × 10−6 3.99 × 10−6 5.12 × 10−11 5.13 × 10−6

Tbl. 4.7 – P-values obtained in ANOVA tests when testing for differences in RP of each of the brainwaves between cases and controls.

(ii) Does their behavior also differ in patients at different stages of the disease?

For testing differences between RP values of each brainwave, we used the Tukey statistical test for multiple comparisons [50]. The obtained p-values are in Table 4.8. The delta and alpha brainwaves show statistically significant differences in their mean values between the early and advanced stages of disease, but our data does not provide evidence that EEG analysis alone could distinguish between mod- erate AD and the other two disease status. We can also observe in Figure 4.8 that the increase/decrease in relative power becomes more evident with disease progression relatively to healthy controls. FCUP 60 Quantifying the genetic predisposition to a complex disease through genome-wide association

Additionally, the beta-1 brainwave could act as a case-control indicator in our sample: the differences in relative power between controls and each of the disease stages are significant, but the three disease stages are statistically indistinguishable. Without the p-value adjustment, the theta and beta-2 brain- waves show similar properties.

delta theta alpha beta-1 beta-2 CON-MIL 1.69 × 10−1 4.96 × 10−3 2.08 × 10−1 1.98 × 10−5 6.40 × 10−3 CON-MOD 1.32 × 10−4 1.23 × 10−4 4.52 × 10−5 1.00 × 10−7 2.34 × 10−3 CON-SEV < 10−7 2.82 × 10−3 < 10−7 < 10−7 2.09 × 10−5 MIL-MOD 8.03 × 10−2 6.58 × 10−1 3.16 × 10−2 5.31 × 10−1 9.71 × 10−1 MIL-SEV 4.00 × 10−7 8.93 × 10−1 6.10 × 10−6 2.81 × 10−2 1.77 × 10−1 MOD-SEV 3.84 × 10−3 9.92 × 10−1 5.16 × 10−2 3.93 × 10−1 3.78 × 10−1

Tbl. 4.8 – P-values obtained in Tukey test for multiple comparisons of the RP in each brainwave between controls and cases in each disease stage (CON - controls; MIL - mild AD; MOD - moderate AD; SEV - severe AD). The values below 0.05/30 = 1.67 × 10−3 are underlined and in bold.

Fig. 4.8 – Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2 in controls and mild, moderate and severe AD cases (CON - controls; MIL - mild AD; MOD - moderate AD; SEV - severe AD). FCUP 61 Quantifying the genetic predisposition to a complex disease through genome-wide association

(iii) Is there any correlation between genetic and EEG data?

In order to select the set of variants to study further, we performed the allelic and genotypic chi-squared tests on 3 396 common variants of a list of 15 selected candidate genes with functional relevance to AD [51, 52]. As expected due to the modest sample size, none of the variants was significant when applying −8 the p-value adjustment for such a large number of variants (αadj = 9.4 × 10 ), but there were still some nominally significant SNPs at the α = 0.05 level in both tests, as can be seen in Table 4.9.

RefSNP Gene(s) Allelic test p-value Genotypic test p-value rs3737002(*) CR1 8.90 × 10−3 2.69 × 10−2 rs71336232 BIN1;CYP27C1 1.42 × 10−2 4.13 × 10−2 rs56031191 EPHX2 1.81 × 10−2 2.19 × 10−2 rs10833211 NAV2 8.87 × 10−3 4.13 × 10−2 rs10833214 NAV2 1.14 × 10−2 3.79 × 10−2 rs7232(*) MS4A6A 1.82 × 10−2 1.81 × 10−4 rs6589894 SORL1;RNU6-256P 2.91 × 10−2 3.39 × 10−2 rs7144273(*) SLC24A4 1.13 × 10−2 6.75 × 10−3 rs741780 TOMM40 2.32 × 10−2 4.64 × 10−2

Tbl. 4.9 – Gene and p-values obtained in the allelic and genotypic tests for the SNPs which were nominally significant in both at the α = 0.05 level. (*) SNPs previously associated to AD.

The three SNPs rs3737002, rs7232 and rs7144273 had been previously reported as being associated with Alzheimer’s disease [53, 54, 55]; for rs7232 and rs7144273, the reported risk allele is also the most frequent among cases than controls in our sample. After selecting the set of SNPs to analyze, we performed Wald tests to check for differences in relative power over each frequency band, for individuals with different alleles and genotypes (cases and controls were analyzed separately). The obtained p-values are in Tables 4.10 and 4.11. FCUP 62 Quantifying the genetic predisposition to a complex disease through genome-wide association

delta theta alpha beta-1 beta-2 RefSNP Cases Controls Cases Controls Cases Controls Cases Controls Cases Controls rs3737002 0.918 0.904 0.969 0.110 0.696 0.291 0.849 0.710 0.824 0.715 rs71336232 0.479 0.393 0.531 0.175 0.198 0.926 0.211 0.660 0.811 0.033 rs56031191 0.283 0.462 0.092 0.446 0.131 0.263 0.105 0.288 0.151 0.253 rs10833211 0.172 0.849 0.014 0.918 0.830 0.449 0.594 0.204 0.641 0.818 rs10833214 0.074 0.997 0.021 0.754 0.949 0.256 0.341 0.280 0.367 0.942 rs7232 0.161 0.182 0.613 0.978 0.146 0.771 0.819 0.504 0.266 0.088 rs6589894 0.970 0.416 0.663 0.948 0.506 0.077 0.404 0.051 0.881 0.728 rs7144273 0.290 0.275 0.712 0.124 0.226 0.653 0.269 0.257 0.878 0.020 rs741780 0.073 0.374 0.650 0.596 0.236 0.181 0.182 0.654 0.169 0.872

Tbl. 4.10 – Wald test p-values for the variation of relative power across each frequency band in cases and controls with different alleles for each of the considered SNPs. The values below 0.05 are underlined and in bold.

delta theta alpha beta-1 beta-2 RefSNP Cases Controls Cases Controls Cases Controls Cases Controls Cases Controls rs3737002 0.950 0.114 0.381 0.305 0.600 0.390 0.983 0.821 0.240 0.815 rs71336232 0.698 0.220 0.808 0.378 0.358 0.903 0.234 0.864 0.970 0.131 rs56031191 0.329 0.681 0.222 0.672 0.263 0.431 0.144 0.386 0.375 0.370 rs10833211 0.055 0.474 0.028 0.493 0.805 0.265 0.478 0.457 0.328 0.604 rs10833214 0.028 0.753 0.034 0.550 0.918 0.383 0.416 0.611 0.200 0.669 rs7232 0.255 0.452 0.847 0.950 0.219 0.608 0.918 0.222 0.489 0.270 rs6589894 0.625 0.126 0.902 0.663 0.662 0.272 0.229 0.059 0.205 0.891 rs7144273 0.756 0.444 0.965 0.566 0.452 0.916 0.609 0.546 0.833 0.169 rs741780 0.171 0.457 0.864 0.719 0.447 0.127 0.380 0.430 0.335 0.506

Tbl. 4.11 – Wald test p-values for the variation of relative power across each frequency band in cases and controls with different genotypes for each of the considered SNPs. The values below 0.05 are underlined and in bold. FCUP 63 Quantifying the genetic predisposition to a complex disease through genome-wide association

None of these tests was statistically significant for the adjusted p-value. Only four out of nine SNPs showed nominally significant differences in at least one of the frequency bands. For these SNPs, we present on Table 4.12 the minor and alternative alleles, as well as the frequency of the minor allele in affected and unaffected individuals and in the entire sample. We also present, on Figures 4.9 to 4.12, barplots of the mean relative power in individuals (either cases or controls) with each allele (a) and genotype (b), for the five considered brainwaves.

When testing for differences in RP values between controls with each allele/genotype, we found that only differences in the beta-2 brainwave for the SNPs rs71336232 and rs7144273 were statistically significant between different alleles. For these SNPs, alleles G and C, respectively, were more frequent among cases than among controls; however, by observing subfigure (a) of Figures 4.9 and 4.10, these alleles are associated with a higher beta-2 RP. This is surprising once, as we have seen in Figure 4.7, cases are expected to have lower RP values than controls for this brainwave.

As for the differences in brainwaves between cases with each allele/genotype, the values of the theta band were significantly different between alleles and genotypes of the SNPs rs10833211 and rs10833214, as well as the values of delta RP between the different genotypes of the second SNP. For these variants, alleles C and T, respectively, are most common amongst cases than controls, but observ- ing subfigure (a) of Figures 4.11 and 4.12 we achieve a simillar conclusion as before: they are associated with a lower theta RP. This once again contradicts the expected, as theta has been shown to have higher RP values in cases over controls (Figure 4.7).

Minor Frequency Frequency Alternative RefSNP allele in cases in controls allele rs71336232 G 0.4429 0.3333 A rs10833211 C 0.3704 0.2333 T rs10833214 T 0.3645 0.2333 C rs7144273(*) T 0.1963 0.3556 C

Tbl. 4.12 – Minor and alternative alleles of each SNP and their respective frequencies of the minor allele in the sample, in cases and in controls. (*) SNP previously associated to AD. FCUP 64 Quantifying the genetic predisposition to a complex disease through genome-wide association

rs71336232

(a) (b)

Fig. 4.9 – Plot of the mean relative power distribution in the different frequency bands of controls with each (a) allele or (b) genotype for SNP rs71336232.

rs7144273

(a) (b)

Fig. 4.10 – Plot of the mean relative power distribution in the different frequency bands of controls with each (a) allele or (b) genotype for SNP rs7144273. FCUP 65 Quantifying the genetic predisposition to a complex disease through genome-wide association

rs10833211

(a) (b)

Fig. 4.11 – Plot of the mean relative power distribution in the different frequency bands of cases with each (a) allele or (b) genotype for SNP rs10833211.

rs10833214

(a) (b)

Fig. 4.12 – Plot of the mean relative power distribution in the different frequency bands of cases with each (a) allele or (b) genotype for SNP rs10833214. FCUP 66 Quantifying the genetic predisposition to a complex disease through genome-wide association

4.6 Discussion

Immediately at the QC stage, the struggle to obtain good quality DNA samples from AD patients was evident, especially from those at advanced stages of the disease. Indeed, of the 44 non-MCI individuals failing QC steps, 6 were controls, while 38 were patients – 3 with mild, 10 with moderate and 25 with severe AD. This is most likely due to the method of saliva sampling used in patients at more advanced stages of the disease (buccal swabs instead of Oragene collector, as mentioned in 4.2).

In the rare-variant association stage of this work, gene PLEKHA5 showed significant differences in allelic frequencies between cases and controls when the effects of its multiple rare variants were com- bined. The rare alleles of these variants showed up in 8 of 59 controls, and in 1 of 128 cases, so if these variants are in fact associated with AD, their effect is presumably protective against the disease.

Even though our sample size is modest, these variants appear to be much more frequent in our controls than in controls from large databases such as gnomAD (Table 4.6). It will be interesting to validate these results in a larger sample of Iberian individuals, in order to clarify whether this is a population-specific effect; indeed, this is a work in progress, and we expect to have access to new data in the future.

PLEKHA5 is an interesting candidate for an impact in AD pathology as it is thought to play an important role at the blood–brain barrier [56]. This tightly sealed monolayer of brain endothelial cells, which keeps neurotoxic plasma-derived components, leukocytes and pathogens out of the central nervous system, has been shown to become compromised in AD and other dementias [57]. PLEKHA5 has also been shown to be differentially expressed in astrocytes between AD patients and controls [46], a brain cell type known to support the endothelial cells that form the blood-brain barrier.

We were able to find statistically significant results among the rare variation in our sample. This fits well with the hypothesis of genetic heterogeneity in complex diseases, with contribution of both common and rare alleles for disease risk.

Table 4.8 shows some interesting results, namely in the differences of mean RP values of the theta and beta frequency bands. At these brainwaves, the differences between controls and each disease stage were statistically significant, but none of the differences between AD stages reached significance. This property makes the theta and beta brainwaves good indicators of case-control status.

We aimed to correlate the EEG and genetic data; for that we selected 9 SNPs which showed signif- icant differences in allele and genotype distribution between our cases and controls, from a list of 15 FCUP 67 Quantifying the genetic predisposition to a complex disease through genome-wide association selected candidate genes with functional relevance to AD. None of these SNPs had been analyzed in previous studies covering genetics-EEG interactions. SNPs rs3737002, rs7232 and rs7144273 had al- ready shown evidence of association with Alzheimer’s disease; for two of these, the reported risk allele matched the most frequent in cases than controls of our sample.

The observation of Figures 4.9 to 4.12 and Table 4.12 yields some contradictory results, considering what was expected from the differences observed in brainwaves between cases and controls. Indeed, the direction of the effect of the risk allele (most frequent in cases) on the brainwave phenotype was significantly associated to it in the opposite direction of what would be expected. For example, the RP of theta waves is higher in AD patients, but the effect of the risk allele in rs10833211 (C) is in the opposite direction (lower RP).

We also noticed that, in subfigure (b) of Figures 4.9 to 4.12, sometimes it is the heterozygous state that differs from either of the homozygous states, which clashes with the hypothesis that each additional risk allele in a disease-associated variant adds to the disease risk itself.

These observations suggest that, even though these alleles/genotypes are more frequent in our pa- tients than in controls and also have an effect on the brain wave phenotype, these associations are independent. In the future we aim to explore further the impact of genome-wide variation on the brain- wave phenotype.

The two distinct analyses can complement each other, for example, by using the common set of vari- ants of the PLEKHA5 gene as candidates for conferring disease susceptibility, and studying how EEG measures behave in individuals with each allele/genotype for these variants.

In the future, we are most looking forward to integrating APOE as a co-variate in our models and analyses. Additionally, we are interested in exploring further the contribution of rare variants to AD risk, namely by testing other models in SKAT-O and by studying other types of variants such as CNVs (Copy Number Variants). One of the intents is to use the CADD score of the variants as their respective weights in SKAT-O, instead of a function of their MAFs. We are also interested in studying further other genes which were only nominally significant but with quite low p-values, and even how the eventual interaction of two or more genes could affect the disease. We aim as well to explore other methods for rare-variant analysis that we haven’t yet explored, and compare the results obtained with different approaches.

In addition to the present sample, an effort is being made to recruit more participants and their respec- tive genomic data (only the biological sample is being enriched; no further EEG measurements are being FCUP 68 Quantifying the genetic predisposition to a complex disease through genome-wide association done). This will allow to validate our conclusions in the new data, and most likely yield other, different, but still interesting results. As for the EEG analysis, this was a first, more conservative approach to the problem. In future work, we would like to do this study in the opposite direction: by starting to find variants with significant differences between alleles/genotypes in brainwave values, and proceeding to compare their allelic or genotypic frequencies between cases and controls of AD. Because there is more power to find association in small samples when dealing with quantitative traits relative to case-control phenotypes, we are confident that the results will be much different. In spite of all the genetic studies conducted to date on Alzheimer’s disease, there is still a fraction of heritability that remains unexplained [58]. It is important that we insist on enriching our current sample with more AD cases and healthy controls, in order to replicate this study in the new subjects and have a better understanding of genetic patterns underlying this complex disease. FCUP 69 Quantifying the genetic predisposition to a complex disease through genome-wide association

Conclusion

In this work, we got a general understanding of the methods used to analyze the effects of human genome variation on complex phenotypes. We would like to highlight that the methods developed to date rely on strong mathematical foundations, and allow processing large quantities of data in a quick and efficient manner, without loss of information, while still achieving statistical significance. Our analyses brought to light new genetic associations in new data. We were able to find significant results with a modest sample size, when analyzing the effect of rare variation on Alzheimer’s disease (particularly of the potentially protective effect of gene PLEKHA5). This is a very difficult task to accom- plish, but can be done if said variants carry a large effect, which was presumably the case. Nevertheless, the little results provided by these methods when applied to the case of AD was quite evident, not only in our study but also in collaborations under international consortiums. This motivated a combined approach to studying complex phenotypes. By incorporating into the genetic component the behavior of other endophenotypes of the disease in cases and controls, it could be possible to ascertain the existence of an interaction between them and eventually provide a more accurate diagnosis. I could not help but notice that several of the methods used in association studies today were first idealized decades ago. This is one of the beautiful properties of mathematics: what is true today will be hundreds of years from now, and what is true here is true everywhere else. It is very interesting that such a universally true science can be of such help to areas in such great evolution as the life and health sciences. We live at a time when attention is focused on unlocking the power of data. It will be interesting to find out in the future what other methods being developed for the analysis of big data could add to the case of genetic association. Combining efforts from different areas of knowledge could enable to design smarter strategies to ”crack” the genetic code, and make the most of this information to fight human disease. FCUP 70 Quantifying the genetic predisposition to a complex disease through genome-wide association FCUP 71 Quantifying the genetic predisposition to a complex disease through genome-wide association

Bibliography

[1] G. J. Mendel, “Experiments on Plant Hybridization,” 1866.

[2] A. Jacquard, The Genetic Structure of Populations, vol. 5. Springer-Verlag, 1974.

[3] National Library of Medicine (US). Genetics Home Reference [Internet]. Bethesda (MD): The Li- brary, https://ghr.nlm.nih.gov/primer, Help Me Understand Genetics.

[4] D. Stram, Design, Analysis, and Interpretation of Genome-Wide Association Scans. Springer Sci- ence+Business Media, 2014.

[5] C. Anderson, F. Pettersson, G. Clarke, L. Cardon, A. Morris, and K. Zondervan, “Data quality control in genetic case-control association studies,” Nature Protocols, vol. 5, no. 9, pp. 1564–1573, 2010.

[6] S. Purcell, B. Neale, K. Todd-Brown, L. Thomas, M. A. R. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. W. de Bakker, M. J. Daly, and P. C. Sham, “PLINK: A Tool Set for Whole-Genome Associa- tion and Population-Based Linkage Analyses,” The American Journal Of Human Genetics, vol. 81, pp. 559–575, sep 2007.

[7] K. Lange, Mathematical and Statistical Methods for Genetic Analysis, vol. 2. Springer-Verlag, 2002.

[8] E. Reed, S. Nunez, D. Kulp, J. Qian, M. Reilly, and A. Foulkes, “A guide to genome-wide association analysis and post-analytic interrogation,” Statistics In Medicine, vol. 34, no. 28, pp. 3769–3792, 2015.

[9] N. Patterson, A. Price, and D. Reich, “Population Structure and Eigenanalysis,” Plos Genetics, vol. 2, no. 12, pp. 2074–2093, 2006. FCUP 72 Quantifying the genetic predisposition to a complex disease through genome-wide association

[10] K. Zondervan and L. Cardon, “Designing candidate gene and genome-wide case–control associa- tion studies,” Nature Protocols, vol. 2, no. 10, pp. 2492–2501, 2007.

[11] B. Li and S. M. Leal, “Methods for Detecting Associations with Rare Variants for Common Diseases: Application to Analysis of Sequence Data,” American Journal of Human Genetics, vol. 83, no. 3, pp. 311–321, 2008.

[12] S. Lee, G. Abecasis, M. Boehnke, and X. Lin, “Rare-Variant Association Analysis : Study Designs and Statistical Tests,” The American Journal Of Human Genetics, pp. 5–23, 2014.

[13] H. Williams, M. Owen, and M. O’Donovan, “New findings from genetic association studies of schizophrenia,” Journal Of Human Genetics, vol. 54, no. 1, pp. 9–14, 2009.

[14] J. Ormel, C. Hartman, and H. Snieder, “The genetics of depression: successful genome-wide as- sociation studies introduce new challenges,” Translational Psychiatry, vol. 9, no. 1, 2019.

[15] J. C. Bis, X. Jian, B. W. Kunkle, Y. Chen, et al., “Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation,” Molecular Psychiatry, 2018.

[16] T. Bird, Alzheimer Disease Overview. GeneReviews®[Internet]. Seattle (WA): University of Wash- ington, Seattle., oct 1998.

[17] Genetics Home Reference [Internet], “Alzheimer disease,” 2013.

[18] H. Harmanci, M. Emre, H. Gurvit, B. Bilgic, H. Hanagasi, E. Gurol, H. Sahin, and S. Tinaz, “Risk Fac- tors for Alzheimer Disease: A Population-Based Case-Control Study in Istanbul, Turkey,” Alzheimer Disease & Associated Disorders, vol. 17, no. 3, 2003.

[19] M. F. f. M. E. Mayo Clinic and Research, “Diagnosing Alzheimer’s: How Alzheimer’s is diagnosed..”

[20] A. C. Antoniou and D. F. Easton, “Polygenic inheritance of breast cancer: Implications for design of association studies.,” Genetic epidemiology, vol. 25, pp. 190–202, nov 2003.

[21] K. T. Zondervan and L. R. Cardon, “The complex interplay among factors that influence allelic association.,” Nature reviews. Genetics, vol. 5, pp. 89–100, feb 2004. FCUP 73 Quantifying the genetic predisposition to a complex disease through genome-wide association

[22] K. Rothman and S. Greenland, Modern epidemiology. Philadelphia: Lippincott-Raven, 1998.

[23] L. R. Cardon and L. J. Palmer, “Population stratification and spurious allelic association.,” Lancet (London, England), vol. 361, pp. 598–604, feb 2003.

[24] J. J. Schlesselman and M. A. Schneiderman, “Case Control Studies: Design, Conduct, Analysis,” Journal of Occupational and Environmental Medicine, vol. 24, no. 11, 1982.

[25] H.-J. Tsai, S. Choudhry, M. Naqvi, W. Rodriguez-Cintron, E. G. Burchard, and E. Ziv, “Comparison of three methods to estimate genetic ancestry and control for stratification in genetic association studies among admixed populations.,” Human genetics, vol. 118, pp. 424–433, dec 2005.

[26] P. Gorroochurn, G. A. Heiman, S. E. Hodge, and D. A. Greenberg, “Centralizing the non-central chi-square: A new method to correct for population stratification in genetic case-control association studies.,” Genetic epidemiology, vol. 30, pp. 277–289, may 2006.

[27] D. Thomas, R. Xie, and M. Gebregziabher, “Two-Stage sampling designs for gene association stud- ies.,” Genetic epidemiology, vol. 27, pp. 401–414, dec 2004.

[28] J. Ioannidis, “Why most discovered true associations are inflated.,” Epidemiology, vol. 20, p. 629, jul 2009.

[29] P. R. Burton, D. G. Clayton, L. R. Cardon, N. Craddock, et al., “Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls,” Nature, vol. 447, no. 7145, pp. 661–678, 2007.

[30] M. S. Silverberg, J. H. Cho, J. D. Rioux, D. P. B. McGovern, et al., “Ulcerative colitis-risk loci on chromosomes 1p36 and 12q15 found by genome-wide association study.,” Nature genetics, vol. 41, pp. 216–220, feb 2009.

[31] S. A. Fisher, M. Tremelling, C. A. Anderson, R. Gwilliam, et al., “Genetic determinants of ulcerative colitis include the ECM1 locus and five loci implicated in Crohn’s disease.,” Nature genetics, vol. 40, pp. 710–712, jun 2008.

[32] J. K. Wittke-Thompson, A. Pluzhnikov, and N. J. Cox, “Rational inferences about departures from Hardy-Weinberg equilibrium,” American journal of human genetics, vol. 76, pp. 967–986, jun 2005. FCUP 74 Quantifying the genetic predisposition to a complex disease through genome-wide association

[33] A. P.Morris and E. Zeggini, “An evaluation of statistical approaches to rare variant analysis in genetic association studies.,” Genetic epidemiology, vol. 34, pp. 188–193, feb 2010.

[34] F. Pettersson, C. Anderson, G. Clarke, J. Barrett, L. Cardon, A. Morris, and K. Zondervan, “Marker selection for genetic case-control association studies,” Nature Protocols, vol. 4, no. 5, pp. 743–752, 2009.

[35] G. Clarke, C. Anderson, F. Pettersson, L. Cardon, A. Morris, and K. Zondervan, “Basic statistical analysis in genetic case-control studies,” Nature Protocols, vol. 6, no. 2, pp. 121–133, 2011.

[36] W. G. Cochran, “Some Methods for Strengthening the Common Chi-Squared Tests,” Biometrics, vol. 10, no. 4, pp. 417–451, 1954.

[37] P. Armitage, “Tests for Linear Trends in Proportions and Frequencies,” Biometrics, vol. 11, no. 3, pp. 375–386, 1955.

[38] K. Pearson, “On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling,” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 50, pp. 157–175, jul 1900.

[39] R. A. Fisher, “On the Interpretation of χ2 from Contingency Tables, and the Calcula- tion of P,” Journal of the Royal Statistical Society, vol. 85, no. 1, pp. 87–94, 1922.

[40] D. J. Balding, M. Bishop, and C. Cannings, Handbook of Statistical Genetics, vol. 1. Wiley, 2007.

[41] S. Lee, G. R. Abecasis, M. Boehnke, and X. Lin, “Rare-variant association analysis: Study designs and statistical tests,” American Journal of Human Genetics, vol. 95, no. 1, pp. 5–23, 2014.

[42] S. Lee, M. J. Emond, M. J. Bamshad, K. C. Barnes, M. J. Rieder, D. A. Nickerson, D. C. Christiani, M. M. Wurfel, and X. Lin, “Optimal unified approach for rare-variant association testing with appli- cation to small-sample case-control whole-exome sequencing studies,” The American Journal of Human Genetics, vol. 91, no. 2, pp. 224–237, 2012.

[43] M. Jafari and N. Ansari-Pour, “Why, when and how to adjust your P values?,” Cell Journal, vol. 20, no. 4, pp. 604–607, 2019. FCUP 75 Quantifying the genetic predisposition to a complex disease through genome-wide association

[44] M. Szumilas, “Explaining odds ratios,” Journal of the Canadian Academy of Child and Adolescent Psychiatry = Journal de l’Academie canadienne de psychiatrie de l’enfant et de l’adolescent, vol. 19, pp. 227–229, aug 2010.

[45] E. E. Blue, J. C. Bis, M. O. Dorschner, D. W. Tsuang, et al., “Genetic Variation in Genes Underlying Diverse Dementias May Explain a Small Proportion of Cases in the Alzheimer’s Disease Sequenc- ing Project,” Dementia and Geriatric Cognitive Disorders, vol. 45, no. 1-2, pp. 1–17, 2018.

[46] H. Mathys, J. Davila-Velderrain, Z. Peng, F. Gao, S. Mohammadi, J. Z. Young, M. Menon, L. He, F. Abdurrob, X. Jiang, A. J. Martorell, R. M. Ransohoff, B. P. Hafler, D. A. Bennett, M. Kellis, and L. H. Tsai, “Single-cell transcriptomic analysis of Alzheimer’s disease,” Nature, vol. 570, no. 7761, pp. 332–337, 2019.

[47] E. M. Whitham, K. J. Pope, S. P. Fitzgibbon, T. Lewis, C. R. Clark, S. Loveless, M. Broberg, A. Wal- lace, D. DeLosAngeles, P. Lillie, A. Hardy, R. Fronsko, A. Pulbrook, and J. O. Willoughby, “Scalp electrical recording during paralysis: quantitative evidence that EEG frequencies above 20 Hz are contaminated by EMG.,” Clinical neurophysiology : official journal of the International Federation of Clinical Neurophysiology, vol. 118, pp. 1877–1888, aug 2007.

[48] N. V. Ponomareva, G. I. Korovaitseva, and E. I. Rogaev, “EEG alterations in non-demented individ- uals related to apolipoprotein E genotype and to risk of Alzheimer disease,” Neurobiology of Aging, vol. 29, pp. 819–827, 2008.

[49] H. D. Waal, C. J. Stam, W. D. Haan, E. C. W. V. Straaten, M. A. Blankenstein, P. Scheltens, and W. M. V. D. Flier, “Alzheimer’s disease patients not carrying the apolipoprotein E 4 allele show more severe slowing of oscillatory brain activity,” Neurobiology of Aging, vol. 34, pp. 2158–2163, 2013.

[50] J. W. Tukey, “Comparing Individual Means in the Analysis of Variance,” Biometrics, vol. 5, no. 2, pp. 99–114, 1949.

[51] M. W. Weiner, D. P. Veitch, P. S. Aisen, L. A. Beckett, N. J. Cairns, R. C. Green, D. Harvey, C. R. J. Jack, W. Jagust, J. C. Morris, R. C. Petersen, A. J. Saykin, L. M. Shaw, A. W. Toga, and J. Q. Trojanowski, “Recent publications from the Alzheimer’s Disease Neuroimaging Initiative: Reviewing progress toward improved AD clinical trials.,” Alzheimer’s & dementia : the journal of the Alzheimer’s Association, vol. 13, pp. e1–e85, apr 2017. FCUP 76 Quantifying the genetic predisposition to a complex disease through genome-wide association

[52] I. E. Jansen, J. E. Savage, K. Watanabe, J. Bryois, et al., “Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk,” Nature Genetics, vol. 51, no. 3, pp. 404–413, 2019.

[53] X.-Y. Ma, J.-T. Yu, M.-S. Tan, F.-R. Sun, D. Miao, and L. Tan, “Missense variants in CR1 are asso- ciated with increased risk of Alzheimer’ disease in Han Chinese.,” Neurobiology of aging, vol. 35, pp. 443.e17–21, feb 2014.

[54] P. Proitsi, S. H. Lee, K. Lunnon, A. Keohane, J. Powell, C. Troakes, S. Al-Sarraj, S. Furney, H. Soini- nen, I. Kłoszewska, P. Mecocci, M. Tsolaki, B. Vellas, S. Lovestone, and A. Hodges, “Alzheimer’s disease susceptibility variants in the MS4A6A gene are associated with altered levels of MS4A6A expression in blood,” Neurobiology of Aging, vol. 35, no. 2, pp. 279–290, 2014.

[55] T. Patel, K. J. Brookes, J. Turton, S. Chaudhury, T. Guetta-Baranes, R. Guerreiro, J. Bras, D. Her- nandez, A. Singleton, P. T. Francis, J. Hardy, and K. Morgan, “Whole-exome sequencing of the BDR cohort: evidence to support the role of the PILRA gene in Alzheimer’s disease,” Neuropathology and applied neurobiology, vol. 44, pp. 506–521, aug 2018.

[56] S. C. Eisele, C. M. Gill, G. M. Shankar, and P. K. Brastianos, “PLEKHA5: A Key to Unlock the Blood- Brain Barrier?,” Clinical cancer research : an official journal of the American Association for Cancer Research, vol. 21, pp. 1978–1980, may 2015.

[57] A. Montagne, Z. Zhao, and B. V. Zlokovic, “Alzheimer’s disease: A matter of blood-brain barrier dysfunction?,” The Journal of experimental medicine, vol. 214, pp. 3151–3169, nov 2017.

[58] P. G. Ridge, S. Mukherjee, P. K. Crane, and J. S. K. Kauwe, “Alzheimer’s disease: analyzing the missing heritability.,” PloS one, vol. 8, no. 11, p. e79771, 2013. FCUP 77 Quantifying the genetic predisposition to a complex disease through genome-wide association

Appendices FCUP 78 Quantifying the genetic predisposition to a complex disease through genome-wide association FCUP 79 Quantifying the genetic predisposition to a complex disease through genome-wide association

Appendix A

Informed consent of participation in research study FCUP 80 Quantifying the genetic predisposition to a complex disease through genome-wide association

Proyecto 0378_AD_EEGWA_2_P: “Análisis y correlación entre el genoma completo y la actividad cerebral para la ayuda en el diagnóstico de la enfermedad de Alzheimer”

Termo de Consentimento para Investigação da Doença de Alzheimer

Título do projeto: Análise e correlação entre o genoma completo e a atividade cerebral para a ajuda do diagnóstico da doença de Alzheimer (INTERREG V A Espanha - Portugal)

Eu, ______, controlo/doente de Alzheimer/representante legal do doente de Alzheimer/cuidador principal do doente de Alzheimer (riscar o que não interessa) dou autorização para colheita de saliva/mucosa bucal, para fins de investigação da doença. □ SIM □ NÃO

Fui informado clara e exaustivamente do projeto que está a ser realizado e compreendo que, por minha vontade, a informação recolhida pode ser retirada do estudo a qualquer momento, sem que eu tenha de dar qualquer explicação e sem que isso se repercuta nos cuidados sanitários a serem prestados. Autorizo que os dados clínicos associados à doença sejam registados e tratados informaticamente neste e noutros estudos científicos que surjam em seu seguimento, mantendo o anonimato das amostras: □ SIM □ NÃO

O anonimato das amostras será garantido através da atribuição de um código a cada doente. Os resultados que venham a ser obtidos no âmbito dos referidos estudos estarão sempre abrangidos por compromisso de sigilo profissional.

Nome completo do doente:

Local e data:

Assinatura do doente/representante legal do doente/cuidador principal do doente (riscar o que não interessa):

Assinatura do representante do projeto:

Contactos: Nádia Pinto ([email protected]); IPATIMUP/i3S R. Alfredo Allen, 4200-135 Porto, Portugal; Tel: 22 040 8800. Patrícia Sousa ou Luís Durães ([email protected]; [email protected]); Delegação Norte da Alzheimer Portugal Rua do Farol Nascente, 74A, R/C, 4455-301 Lavra, Portugal; Tel: 22 926 0912.

FCUP 81 Quantifying the genetic predisposition to a complex disease through genome-wide association

Appendix B

Mini Mental State Examination (MMSE)

The table below presents the limiting values for each AD stage according to classification in the MMSE test on the two subsequent pages.

Control Mild cognitive Mild AD Moderate AD Severe AD (CON) impairment (MCI) (MIL) (MOD) (SEV) MMSE correct answers 27-30 21-26 21-26/28 11-20 0-10 (out of 30) FCUP 82 Quantifying the genetic predisposition to a complex disease through genome-wide association

Mini Mental State Examination (MMSE)

1. Orientação (1 ponto por cada resposta correta)

Em que ano estamos? ____ Em que mês estamos? ____ Em que dia do mês estamos? ____ Em que dia da semana estamos? ____ Em que estação do ano estamos? ____ Nota: ____ Em que país estamos? ____ Em que distrito vive? ____ Em que terra vive? ____ Em que casa estamos? ____ Em que andar estamos? ____ Nota: ____ 2. Retenção (contar 1 ponto por cada palavra corretamente repetida)

"Vou dizer três palavras; queria que as repetisse, mas só depois de eu as dizer todas; procure ficar a sabê-las de cor." Pêra ____ Gato ____ Bola ____ Nota: ____ 3. Atenção e Cálculo (1 ponto por cada resposta correta. Se der uma errada mas depois continuar a subtrair bem, consideram-se as seguintes como corretas. Parar ao fim de 5 respostas)

" Agora peço-lhe que me diga quantos são 30 menos 3 e depois ao número encontrado volta a tirar 3 e repete assim até eu lhe dizer para parar." 27 _ 24 _ 21 _ 18 _ 15 _ Nota: ____ 4. Evocação (1 ponto por cada resposta correta)

"Veja se consegue dizer as três palavras que pedi há pouco para decorar." Pêra ____ Gato ____ Bola ____ Nota: ____ 5. Linguagem (1 ponto por cada resposta correta)

a. "Como se chama isto?" Mostrar os objetos: Relógio ____ Lápis ____ Nota: ____ b. "Repita a frase que eu vou dizer: O RATO ROEU A ROLHA." Nota: ____

FCUP 83 Quantifying the genetic predisposition to a complex disease through genome-wide association

c. "Quando eu lhe der esta folha de papel, pegue nela com a mão direita, dobre-a ao meio e ponha-a sobre a mesa"; dar a folha segurando com as duas mãos. Pega com a mão direita ____ Dobra ao meio ____ Coloca onde deve ____ Nota: ____ d. "Leia o que está neste cartão e faça o que lá diz." Mostrar um cartão com a frase bem legível "FECHE OS OLHOS"; sendo analfabeto lê-se a frase. Fechou os olhos ____ Nota: ____ e. "Escreva uma frase inteira aqui." Deve ter sujeito e verbo e fazer sentido; os erros gramaticais não prejudicam a pontuação.

Frase: Nota: ____ 6. Habilidade construtiva (1 ponto pela cópia correta)

Deve copiar um desenho. Dois pentágonos parcialmente sobrepostos; cada um deve ficar com 5 lados, dois dos quais intersectados. Não valorizar tremor ou rotação.

Cópia:

Nota: ____ TOTAL (máximo 30 pontos): ____

Considera-se com defeito cognitivo: • analfabetos ≤ 15 pontos • 1 a 11 anos de escolaridade ≤ 22 • com escolaridade superior a 11 anos ≤ 27 FCUP 84 Quantifying the genetic predisposition to a complex disease through genome-wide association FCUP 85 Quantifying the genetic predisposition to a complex disease through genome-wide association

Appendix C

Lists of nominally significant variants in SKAT-O

C.1 Model 1 (Sex as covariate)

C.1.1 ”All Genes” list

Gene SNPs p-value ρ PLEKHA5 rs140734813, rs200349314, rs76626801, rs77598867 1.28269072183063e-05 0.4 SV2B rs111403670, rs117484961, rs75467612 0.000206253233447403 0 NLE1 rs199557933, rs2306512, rs2306513, rs34325923 0.000272358434131565 0.9 TTC28 rs201500299, rs77885044 0.0010474038243793 1 CPS1 rs114819130, rs141373204, rs201407486 0.00125353521984552 1 MCOLN2 rs116001957, rs146406791, rs61734232, rs6704203 0.00138557557036122 0.1 PLOD1 rs138490756, rs142978362 0.00144483735611393 0 ZNF816; rs138196750, rs142612844, rs57088011 0.00302071172078148 0 ZNF321P; ZNF816 SERPINA9 rs35347445, rs45438596 0.00308212127305385 1 CCP110 rs112583917, rs147470192, rs151214000 0.00321539379050365 0.4 CHRNE; rs121909512, rs144169073, rs146931108, rs372635387 0.00321762450183518 0.1 C17orf107 FCUP 86 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ SLC28A1 rs17222428, rs185043976, rs45523532 0.00339455997904208 0 USP34 rs184824807, rs185706364 0.00374367662200369 1 ANKRD52 rs11614777, rs201680602 0.00393110116837765 1 PPL rs140094738, rs142496602, rs148151950, rs1801936, rs549852586 0.00394566831328255 0 TMTC1 rs142394560, rs79931373 0.00396355529971623 1 TNRC6A rs113388806, rs72770407, rs72770408 0.00415638070167071 0 MYO5C rs145997599, rs200009646, rs55712142, rs56250328, rs62623565, 0.00417122661142173 0 rs72734946 C11orf82 rs35533646, rs61902276 0.00432868709070741 1 PLIN1 rs74407840, rs8179071 0.00481371051365119 0 DDX51 rs61729150, rs61760237 0.00500479014265092 1 MYBPC2 rs199625688, rs369687162 0.00506749129114119 1 NTHL1 rs148104494, rs1805378 0.00518082903567628 0 C2orf44 rs139596495, rs144078821 0.00522940772889109 0.8 EHMT2 rs115884658, rs149384831 0.00524041605898073 0.9 DTWD2 rs145178816, rs183646069 0.00529831101655178 0.7 KIAA1522 rs116386430, rs181736501, rs372313200 0.00535278788464469 1 FSIP2 rs116029352, rs116639972 0.00609393928112853 0 DMGDH rs41272262, rs77116243 0.00613046878101422 1 APOBEC1 rs12820011, rs139646668, rs61753204 0.00615386446749772 0 ADAD2; rs141742506, rs72800737, rs78344060 0.00624114517250234 0 RP11- 486L19.2 TTC28-AS1; rs41277943, rs7286215 0.00636602277001337 0 TTC28 CD244 rs115868021, rs12145141 0.00666159668529894 0 BZRAP1 rs142234067, rs199845425 0.00692257348417964 0 DIS3L2 rs148474013, rs184764939 0.00692257348417964 0 DTNB rs200652614, rs79208578 0.00692257348417964 0 ENDOU; rs141369443, rs145477608 0.00692257348417964 0 RP1- 197B17.3 NAT8 rs140837195, rs62000430 0.00692257348417964 0 PALLD rs115372194, rs150764613 0.00692257348417964 0 QSER1 rs200945717, rs201036288 0.00692257348417964 0 TRAF3IP1 rs138469421, rs34723381 0.00692257348417964 0 FCUP 87 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ ZNF223 rs145131242, rs148654869 0.00692257348417964 0 FAM129B rs138792807, rs148940802 0.00704843600551877 0 THADA rs112504846, rs114186945, rs138256193, rs56269749, rs61754254 0.00712556889023542 0 NUFIP2 rs139731845, rs146112698 0.00717738617044695 0 C1QL4 rs143197428, rs144219849 0.0071773861704476 0 PKD1; rs554255347, rs577103612 0.00826869445303226 1 TSC2 PIAS2 rs117151539, rs56202233 0.00886376657571155 0.5 HIF1A; rs11549467, rs138451482, rs41508050 0.00920060904037799 1 HIF1A-AS2; RP11- 618G20.1 AGBL1 rs144777747, rs181958589, rs184053635, rs73459659 0.00980109047683354 1 KMT2D rs146044282, rs3741626, rs833819 0.00981829147008739 0.8 LRRC31 rs113189539, rs187788865 0.00992902063328839 0.9 AP1AR rs116585845, rs35367822 0.00998458643490209 1 COL4A4 rs11556632, rs13027659, rs147109071, rs147910396, rs149243282, 0.0100028568105685 0 rs17353916, rs181652003, rs185029960, rs190148408, rs371717486, rs530491908, rs540904446, rs55836847, rs566586172, rs569681869 LDHAL6B; rs117906902, rs148423958 0.0101717606398347 1 MYO1E KIT rs138585275, rs72549294 0.0107001667800719 0 COLEC12 rs149622251, rs188943034 0.010846600783724 0 NUTM1 rs118111266, rs141497257, rs61737331, rs61737334, rs78773193 0.0111416271193224 1 PPP6R2 rs140861368, rs144758234, rs45484793 0.0111704533914276 0 ZNF335 rs117132825, rs117802609, rs41305805 0.0121195168530767 0.9 GRIN3A; rs41282159, rs62575856 0.0121539858551327 1 PPP3R2 TMEM232 rs61732341, rs77909347, rs78510152 0.0121662178348798 1 VWA8 rs138075452, rs200944707, rs41288291, rs73464952, rs78161810 0.0124737840924437 1 TAGAP; rs144047559, rs41267765 0.0125423441127027 0 RP1- 111C20.4 SIPA1L3 rs116981346, rs138476311, rs148704615, rs377395166, rs551633325, 0.0134175506189133 1 rs61729129, rs61729138, rs78686793 BCAR3 rs151322895, rs527237451, rs61752468 0.0148068473141629 0 FCUP 88 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ UBQLNL; rs117149898, rs7933557 0.0149792196536852 0 HBG2 TNS3 rs190294673, rs2293362, rs41280696 0.0151902756679531 1 POLR1A rs146078741, rs80106112 0.015328280994014 1 ARHGAP23 rs138289112, rs151011696 0.0156416234993084 1 FHOD3 rs202146466, rs61735998, rs78740177, rs9964535 0.016098883614128 1 SAMD12 rs117020479, rs74405593 0.0162690152857995 1 POLR1D rs118191175, rs8459 0.0162775485327143 0 PLAT rs114878147, rs139316969, rs151006962 0.0163498154817973 0 DBH rs145059403, rs5324, rs76856960 0.0163585498491526 0 XIRP2 rs143400009, rs151294213, rs184983098, rs373331298 0.0164348618049401 0.2 CGN rs140720174, rs142913144, rs41272459, rs535976236 0.0166492880536009 0.1 TBC1D14 rs11731231, rs140427960 0.016806659366145 1 CNTNAP4 rs145609341, rs34251012 0.0171215515498836 0 KDM1B rs150757038, rs72840622 0.0173769860475965 1 APOB rs1042023, rs12713559, rs12713843, rs12713844, rs12720854, 0.017435747798304 0 rs1801702, rs1801703, rs199893862, rs61736761, rs61743502, rs61744153, rs6752026, rs72653095, rs72654423, rs72654426 OR10G9 rs117973239, rs150573636 0.0176002044365379 0.1 ITGA11 rs148886354, rs201928196, rs2271725, rs61729767 0.0180104654928966 1 SMYD1 rs10170209, rs112558914, rs145275251 0.0180778758950091 1 PLEKHG4 rs142391556, rs145862477, rs149229558, rs17680862 0.0183389388216024 0.4 MAST4 rs114551553, rs17221458, rs17221521, rs200131786, rs55970008, 0.0184235889719145 0.1 rs56337909 GPR179 rs149252987, rs149998444, rs183799079, rs62073368 0.01882772610412 0 GALR2 rs61745847, rs8192514 0.0189845481203674 0 RP11- rs117627469, rs6578940 0.0192289681078803 0.3 379P15.1 C10orf12 rs112594620, rs7082522, rs7894200 0.0196841742092002 0 BRINP1 rs139063583, rs139590963, rs142894245, rs150796528 0.0197120994192905 1 RPGRIP1L rs137982921, rs1420574, rs79892445 0.019885169616512 1 IL12RB1 rs140254802, rs145590794 0.0199246173167652 0.1 ZNF592 rs150829393, rs61737677 0.0200707309157494 0 PDIA4 rs140572691, rs142233058, rs148077813, rs2290971 0.020160354689889 0 IP6K3 rs34343647, rs34573836 0.0204662460249079 0 FCUP 89 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ PINK1; rs45515602, rs74315358 0.0205188503117583 0.7 PINK1-AS ZMYM1 rs184831579, rs190358882 0.0205890139954527 0.7 ACCSL rs138904993, rs202158584 0.0207016532781156 0.7 CLCA2 rs145061827, rs55906076 0.0207184482733212 0.7 PCCB rs142403318, rs145135400, rs150555106 0.0208708715456132 0.8 DEFB1 rs1800968, rs2738047 0.0208899682443203 0.7 ZNF532 rs200279390, rs77308563 0.0209590333950379 0.7 POLK rs181098391, rs186798689 0.0209763086682901 0.7 PHLDB3 rs139745869, rs76853904 0.0210442229579007 0.7 GAPVD1 rs2030148, rs55779102 0.021056400921894 0.7 LRRC28 rs145009665, rs146950856 0.0211681072902089 0.7 CYP2A6; rs145308399, rs199916117, rs2644906, rs28399454, rs6413474, 0.0212384461818441 0 CTC- rs8192730 490E21.12 SH2D3A rs139285655, rs147966200, rs148876828 0.0212758373779324 1 ZFP28; rs141875487, rs549712142 0.0213326931888908 0.7 AC007228.11 SLU7 rs41275313, rs74574815 0.0213395121888099 0.7 CYP2J2 rs146801076, rs56053398 0.0213776944463709 0.7 PAN2 rs117027379, rs34404784 0.0215674414370597 0 RFTN2 rs143754556, rs149995388 0.0215883914474104 0.7 TRIOBP rs150690007, rs193043234, rs199646135 0.0218517562101082 0 ZNF284; rs192483922, rs200252807 0.0218536409899198 0.7 ZNF223 ALOX15B rs141534086, rs146833910, rs149652899, rs7225107 0.0219124853170757 0 OR51I1; rs61736831, rs61737027, rs76233016 0.0219165312733349 1 HBE1; HBG2; AC104389.28 SLC25A41 rs183437348, rs191002996 0.0221690885582743 1 COL15A1 rs138655830, rs142838918 0.0221985123096633 0 GLI1 rs139792497, rs149817893 0.0223138288802153 0.7 TRERF1 rs147004250, rs371625080, rs61756353 0.0223215840264146 1 BBS4 rs147202164, rs41277724 0.0224909329384584 0 ZFP41 rs147227382, rs72695731 0.0225151489004749 1 FCUP 90 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ INA rs185636023, rs34440112, rs78099843 0.0226205855092912 0 SRD5A2; rs28383064, rs28383082 0.0226627000131504 0 AL133247.2 DHX38 rs11554765, rs201984534, rs35794819, rs36064538, rs61749037 0.0226702520677626 0.2 PEX6 rs115960224, rs141238034, rs61753220 0.0228933457914512 0 HTR3B rs540079693, rs72466469, rs78418698 0.0229178061139528 0 AC007271.3; rs3917327, rs41294846 0.0234165251463081 1 IL1R1 ITGA10 rs116524970, rs146565671 0.0236344091660727 1 CLSTN2 rs115904131, rs200466942, rs36054782 0.0237660937590261 0 WDR88 rs112854679, rs142717725 0.0238192033369318 0 LMO4 rs12033984, rs41311178 0.0238203520410873 0 NWD1 rs117353506, rs138264504, rs144054207, rs146553378, rs147984852, 0.0241486788198817 0 rs61746179, rs73008597 SLC22A18 rs141445711, rs143044180 0.0241965089995357 0 CST9L rs1054613, rs376798152, rs79890466 0.0243172005774492 1 WDR60 rs150548113, rs73167274 0.0245041461512109 0 EIF2B3 rs139445917, rs151056457, rs77068026 0.0245072179926894 1 FRMPD2 rs116143480, rs144770940, rs55802136 0.0248977429743919 0 TET1 rs117273115, rs140677396, rs199602262, rs74925160 0.025092572048848 0.4 NPC1L1 rs116204045, rs148087541, rs201908509, rs35803101, rs52815063, 0.0251626392787587 0 rs76327525 GPX6 rs138819577, rs34955392, rs35701070 0.0255445170675985 0.2 ITGA2 rs13173706, rs143262642, rs80331976 0.0255948763874377 0 FGD2 rs147148066, rs201398945, rs708016 0.0257197378034163 1 LBP rs2232585, rs2232597, rs2232607, rs36015492, rs5744212 0.0258770444481264 0.2 TGFBR3 rs17882736, rs2228363 0.0258781121326818 1 OR1I1 rs75278395, rs75323205 0.0260123091087723 0 EPB41L1 rs6089016, rs73101499 0.0261568998880019 1 ZNF451; rs149876604, rs536946478 0.0267106128455809 0 RP11- 203B9.4 ANO5 rs137854523, rs200631556, rs375834855, rs61910685, rs72982058, 0.0268578140810088 1 rs78428314 DEFB129 rs142429237, rs41275426 0.0268627842547565 1 SLIT3 rs144594798, rs34260167, rs35305517 0.0269053066210037 1 FCUP 91 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ ABCA1 rs138880920, rs140365800, rs28933692, rs35819696, rs9282543 0.0277169060915612 1 ABCA6 rs199889249, rs77542162 0.0280772345139368 0 VPS13D rs116415833, rs12407578, rs143861538, rs41279452, rs61774897 0.0282487939686733 0 CCDC116 rs11705259, rs28513567, rs35113423, rs76090341 0.0282587326626214 0 SMCO1 rs11926701, rs73219650, rs9869292 0.0284657467188571 1 TG rs114322847, rs114944116, rs116062097, rs142998186, rs202227846, 0.0285602988953173 0.7 rs2069548, rs35301433 GBP6 rs116475216, rs144565821, rs75966734 0.028941927753715 1 LURAP1L rs140682520, rs61755264 0.0289995517430901 0 PNLIPRP3 rs144087426, rs2116286 0.0291950680513361 0 ATP6V0A4 rs150777839, rs61747674 0.0294276080175521 1 FAM161A rs139266382, rs187695569 0.0297796835316252 1 RAB11FIP1 rs139211525, rs16887092, rs75558150 0.0299604606841069 0.1 GPR98 rs111033430, rs114137750, rs13171868, rs145294917, rs145556097, 0.0304827638267233 1 rs16868974, rs186999408, rs200789563, rs200945405, rs201586455, rs41302834, rs41303344, rs41304892, rs41308846, rs537366992, rs72782753 MANBA rs569997475, rs75826658 0.0305369950679346 1 LYSMD4 rs145468893, rs8025780 0.0311673070101934 1 TCN2 rs35838082, rs35915865 0.0312188488650053 0.3 COL11A2 rs1054531, rs146555195 0.0313516533318306 0.3 C1orf168 rs140047407, rs17114336, rs41305876 0.0313527022243418 1 AHI1 rs117447608, rs146416468, rs190854744, rs41288013, rs6940875 0.0316041450813754 0 CNGB1 rs147593839, rs148999583, rs16942445, rs79889567, rs8055343 0.0329488005645297 1 PITPNM1 rs143726971, rs144939807, rs61755426 0.0330451335363045 0 IQGAP2 rs10454915, rs34950321, rs34968964 0.0336677154087465 1 ZNF192P1 rs1150669, rs41269281 0.0338256769132504 1 PLEKHH1 rs111462449, rs200119528, rs45616031, rs7150973 0.0340243827372246 1 COL17A1 rs141174922, rs146267259, rs146841330, rs147785714, rs73329731 0.0342448049454783 0 WFS1 rs142428158, rs35031397, rs55993016 0.0343056941309401 0 HTR4 rs140360260, rs148952645, rs201078356 0.0346103984672113 1 OAS3; RP1- rs45585037, rs61732395 0.0347266050646556 1 71H24.1 DDX42 rs112345375, rs117181531 0.0348218463959154 1 FASN rs145866788, rs199640764, rs2228306, rs2229426 0.0349147627336826 0 ALOXE3 rs121434235, rs3027233, rs540629378 0.0349671148526833 0 FCUP 92 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ GUCY2D rs28743021, rs9905402 0.0351506597719825 0.5 IQCH rs35933176, rs62014646 0.0351759643380504 0 HORMAD2 rs34150968, rs975704 0.0355853319755312 1 SNAPC4 rs117208709, rs138223741, rs142043016, rs182356747, rs3812561 0.0358793933378758 0 CIITA rs2229317, rs45447897, rs45538933 0.0358934783025569 1 PAPSS2 rs150809990, rs17173698, rs45467596 0.0360290790678494 0 RASAL3 rs201042593, rs56209154 0.0362699401142416 1 SLC16A10 rs118149501, rs17072442 0.0366640597458007 1 NBEA rs189755961, rs41292197, rs41292207 0.0372577678260197 1 PTPRB rs111634123, rs149998223, rs61758735 0.0372709603241508 1 BIVM- rs142438319, rs375592438, rs4150342 0.0375234697933093 1 ERCC5; ERCC5 FUS rs140875749, rs201608365, rs387906627, rs929867 0.0378085788151265 1 RBM6 rs143972186, rs562534362, rs61731329 0.0378112280869849 1 GSTM5 rs113130058, rs150417585 0.0381904507346418 0.8 DNAH2 rs11656500, rs117487916, rs118057786, rs11868946, rs140035206, 0.0386180787790155 0 rs141976760, rs142532084, rs142627042, rs146533727, rs35788701, rs57926692, rs73232344 DYNC2H1 rs191381310, rs201043335, rs61898615 0.0386531913044877 0 RAD54L rs138546115, rs28363192 0.0388548450954659 0 APOL1 rs150247892, rs180731649 0.0388693317457725 0 SLC39A4 rs200073988, rs75920625 0.0388727238584385 0.9 MPP4 rs373969080, rs542804992, rs573231898 0.0390158152291755 0 NID2 rs117965887, rs144461334, rs150406341, rs35147930 0.0390927572196594 0.1 KRT71 rs141534890, rs144618122, rs34468387, rs35988863, rs74095123 0.0391205501522466 1 LY6G5C rs11575852, rs145921744 0.0393459723217628 0 MDC1; rs148637924, rs2517560, rs28986467, rs3132589, rs58344693, 0.0394736995152022 0.8 MDC1-AS1 rs75355880, rs80087328 COLGALT2 rs140276948, rs201381167 0.0394812441325765 0 ZNF644 rs140271599, rs143932357 0.039635523208475 0 CHST11 rs1048662, rs148230565, rs1565814, rs76161287 0.0397463544807108 1 IL27RA rs149202847, rs76543168 0.0397525333406776 0 FCUP 93 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ TMEM95; rs534565673, rs79041177 0.0398153470041515 0 RP11- 542C16.1; KCTD11 CELSR1 rs140996267, rs575259420, rs61737811, rs75983687 0.0399276380691761 0 ZNF678 rs116552491, rs61744724 0.0406933537455352 0 OR5K3; rs74627725, rs80178587 0.0406933537455381 0 RP11- 325B23.2 CC2D1A rs201251295, rs201884654, rs61740117 0.0407765503053354 0 LARS rs112912805, rs150148403 0.040931163349354 0 CYP4V2 rs138739819, rs149684063, rs61745524, rs72646291 0.0409730338218514 0 CRTAC1 rs140424345, rs141406261, rs187123855, rs200414761, rs546782945 0.0412310268556996 0 AHCTF1 rs142603415, rs146417813 0.0413441939390779 0 LAMC2 rs140949383, rs142335339, rs17481405 0.0414635245733795 0 SOGA2 rs34877994, rs368883249, rs371815051, rs55676538, rs72942342 0.0415537493628491 0 RP11- rs116979331, rs3743781, rs9646285 0.0416071036568635 1 327F22.4; CYLD DIS3L; rs35711894, rs372600238 0.0416147465327248 1 RP11- 352G18.2 AGBL3 rs141725502, rs148526262, rs17804854, rs561505613 0.0428668703205987 1 CYP3A4 rs28371759, rs28371763, rs3091339, rs4986910 0.0431845144911781 0.8 MELK rs114617403, rs117713820, rs35142210 0.0435665533855277 1 PAM rs116526321, rs2230458, rs35658696, rs61736661, rs78753846 0.0436155616686695 0.4 FSIP2; rs115715789, rs142675481, rs202032936, rs74508015 0.0440382206714343 0.8 AC008174.3 IKZF1 rs200731248, rs73695637, rs79731631 0.0443499257361525 0 TSNAXIP1 rs202129606, rs562641068, rs74684664 0.0447377855523133 0 NLRP5 rs138053847, rs199871361, rs200541204, rs202104062, rs34175666 0.0449402037447389 0 SPATA6L; rs41279557, rs62543898, rs71496487 0.0451831405155117 0 PPAPDC2 ANK2 rs121912706, rs138842207, rs35249198, rs36210417, rs66785829, 0.0452962561187048 0.5 rs72544141 ZNF846 rs182054421, rs78118057, rs79195449 0.0463797351231157 0 FCUP 94 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ UNC13B rs138440338, rs139577182, rs41276043, rs41315995 0.0463803216514959 0 PAH rs5030851, rs5030860, rs62508588 0.0467106983679985 0 PLXNC1 rs114905217, rs115651556 0.0467789179130978 0.8 HLA-E rs149396632, rs17195369 0.0468751182577532 0.5 PTPN23 rs147122610, rs147400377, rs149563514, rs56349424 0.0473848527957535 0.2 SEC63 rs111637159, rs140498555, rs143966094, rs186286561, rs386704681, 0.0474597915306307 0 rs672444, rs687374 TLL2 rs113406248, rs138680811, rs41291632, rs542374653 0.0477176174975066 0.6 RP11- rs74008600, rs78170742 0.0477224214651125 1 243A14.1 TMC7 rs118019760, rs151053735, rs55796412 0.0480243619478897 1 GSTZ1 rs140540096, rs149972480 0.0481135336546697 1 OR1J4 rs145918200, rs568140238 0.0490473421706158 1 WRN rs34477820, rs78488552 0.0493336081894851 1 OVCH1; rs12305672, rs77444163 0.0495362522344141 0 OVCH1- AS1 BPIFB1 rs45473499, rs61737874, rs61739245 0.0498943552800507 1

Tbl. C.1 – Nominally significant genes in SKAT-O under null Model 1 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients).

C.1.2 ”Dementia” list

Gene SNPs p-value ρ EIF2B3 rs139445917, rs151056457, rs77068026 0.0245072179926894 1 FUS rs140875749, rs201608365, rs387906627, rs929867 0.0378085788151265 1

Tbl. C.2 – Nominally significant genes in SKAT-O under null Model 1 tested for ”Dementia” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients).

C.1.3 ”DGE Brain” list

Gene SNPs p-value ρ PLEKHA5 rs140734813, rs200349314, rs76626801, rs77598867 1.28269072183063e-05 0.4 CCP110 rs112583917, rs147470192, rs151214000 0.00321539379050365 0.4 FCUP 95 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ MYO5C rs145997599, rs200009646, rs55712142, rs56250328, rs62623565, 0.00417122661142173 0 rs72734946 LDHAL6B; rs117906902, rs148423958 0.0101717606398347 1 MYO1E ITGA2 rs13173706, rs143262642, rs80331976 0.0255948763874377 0 AHI1 rs117447608, rs146416468, rs190854744, rs41288013, rs6940875 0.0316041450813754 0 IQCH rs35933176, rs62014646 0.0351759643380504 0 COLGALT2 rs140276948, rs201381167 0.0394812441325765 0

Tbl. C.3 – Nominally significant genes in SKAT-O under null Model 1 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients).

C.1.4 ”AD Disgenet” list

Gene SNPs p-value ρ TNRC6A rs113388806, rs72770407, rs72770408 0.00415638070167071 0 MYO5C rs145997599, rs200009646, rs55712142, rs56250328, rs62623565, 0.00417122661142173 0 rs72734946 ADAMTSL4; rs115937511, rs74124919 0.00530661540824785 1 RP11- 54A4.2 NAT8 rs140837195, rs62000430 0.00692257348417964 0 PKD1; rs554255347, rs577103612 0.00826869445303226 1 TSC2 HIF1A; rs11549467, rs138451482, rs41508050 0.00920060904037799 1 HIF1A-AS2; RP11- 618G20.1 KIT rs138585275, rs72549294 0.0107001667800719 0 GRIN3A; rs41282159, rs62575856 0.0121539858551327 1 PPP3R2 UBQLNL; rs117149898, rs7933557 0.0149792196536852 0 HBG2 PLAT rs114878147, rs139316969, rs151006962 0.0163498154817973 0 DBH rs145059403, rs5324, rs76856960 0.0163585498491526 0 FCUP 96 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ APOB rs1042023, rs12713559, rs12713843, rs12713844, rs12720854, 0.017435747798304 0 rs1801702, rs1801703, rs199893862, rs61736761, rs61743502, rs61744153, rs6752026, rs72653095, rs72654423, rs72654426 PINK1; rs45515602, rs74315358 0.0205188503117583 0.7 PINK1-AS OR51I1; rs61736831, rs61737027, rs76233016 0.0219165312733349 1 HBE1; HBG2; AC104389.28 GLI1 rs139792497, rs149817893 0.0223138288802153 0.7 AC007271.3; rs3917327, rs41294846 0.0234165251463081 1 IL1R1 LMO4 rs12033984, rs41311178 0.0238203520410873 0 TET1 rs117273115, rs140677396, rs199602262, rs74925160 0.025092572048848 0.4 LBP rs2232585, rs2232597, rs2232607, rs36015492, rs5744212 0.0258770444481264 0.2 SLIT3 rs144594798, rs34260167, rs35305517 0.0269053066210037 1 ABCA1 rs138880920, rs140365800, rs28933692, rs35819696, rs9282543 0.0277169060915612 1 COL11A2 rs1054531, rs146555195 0.0313516533318306 0.3 ANK2 rs121912706, rs66785829 0.0317837787769395 1 HTR4 rs140360260, rs148952645, rs201078356 0.0346103984672113 1 FASN rs145866788, rs199640764, rs2228306, rs2229426 0.0349147627336826 0 IL27RA rs149202847, rs76543168 0.0397525333406776 0 LAMC2 rs140949383, rs142335339, rs17481405 0.0414635245733795 0 CYP3A4 rs28371759, rs28371763, rs3091339, rs4986910 0.0431845144911781 0.8 AHI1 rs146416468, rs41288013, rs6940875 0.0485310038440118 0

Tbl. C.4 – Nominally significant genes in SKAT-O under null Model 1 tested for ”AD Disgenet” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients).

C.2 Model 2 (Sex, age, PC1 and PC2 as covariates)

C.2.1 ”All Genes” list

Gene SNPs p-value ρ PLEKHA5 rs140734813, rs200349314, rs76626801, rs77598867 1.50254508876202e-05 0.9 FCUP 97 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ SV2B rs111403670, rs117484961, rs75467612 0.000178643122383135 0 NLE1 rs199557933, rs2306512, rs2306513, rs34325923 0.000244188077878442 0.5 HTR4 rs140360260, rs148952645, rs201078356 0.000777045997368726 0 TNRC6A rs113388806, rs72770407, rs72770408 0.000836140698784495 0.4 MCOLN2 rs116001957, rs146406791, rs61734232, rs6704203 0.00119637578233909 0.1 CPS1 rs114819130, rs141373204, rs201407486 0.0014023121199455 1 TTC28 rs201500299, rs77885044 0.00146023253185451 1 PLOD1 rs138490756, rs142978362 0.00147126818548714 0 SERPINA9 rs35347445, rs45438596 0.00185860755012535 1 ZNF816; rs138196750, rs142612844, rs57088011 0.00217129665422722 0 ZNF321P; ZNF816 PALLD rs115372194, rs150764613 0.00299435881830362 0 CHRNE; rs121909512, rs144169073, rs146931108, rs372635387 0.00304860494035021 0 C17orf107 PPL rs140094738, rs142496602, rs148151950, rs1801936, rs549852586 0.00320290456631349 0 KIAA1522 rs116386430, rs181736501, rs372313200 0.00330644495545055 0.9 TMTC1 rs142394560, rs79931373 0.0033524556764779 1 QSER1 rs200945717, rs201036288 0.00405311101271283 0 C2orf44 rs139596495, rs144078821 0.00472710540729702 1 DTWD2 rs145178816, rs183646069 0.00486245532725343 1 C11orf82 rs35533646, rs61902276 0.00501918100397406 1 SLC28A1 rs17222428, rs185043976, rs45523532 0.00510475945845344 0 TTC28-AS1; rs41277943, rs7286215 0.0052930311672276 0 TTC28 DDX51 rs61729150, rs61760237 0.00558382724282616 1 BZRAP1 rs142234067, rs199845425 0.00561912847067032 0 CCP110 rs112583917, rs147470192, rs151214000 0.00563643423142686 0.4 ANKRD52 rs11614777, rs201680602 0.00573979004494657 1 MYBPC2 rs199625688, rs369687162 0.00611893227974381 1 TMEM232 rs61732341, rs77909347, rs78510152 0.00626777840105006 1 TRAF3IP1 rs138469421, rs34723381 0.00627888288394177 0 MYO5C rs145997599, rs200009646, rs55712142, rs56250328, rs62623565, 0.00644068645041817 0 rs72734946 USP34 rs184824807, rs185706364 0.00653042481347841 1 SCRN2 rs138415129, rs145155073 0.00666121709791486 0.5 FCUP 98 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ FHOD3 rs202146466, rs61735998, rs78740177, rs9964535 0.00676982224344123 1 PLIN1 rs74407840, rs8179071 0.00682616047883803 0 SIPA1L3 rs116981346, rs138476311, rs148704615, rs377395166, rs551633325, 0.00699271657934568 1 rs61729129, rs61729138, rs78686793 AP1AR rs116585845, rs35367822 0.00702103870955583 1 LRRC31 rs113189539, rs187788865 0.00708155867222073 1 PCCB rs142403318, rs145135400, rs150555106 0.00716866059411653 1 ADAD2; rs141742506, rs72800737, rs78344060 0.00737818761922772 0 RP11- 486L19.2 ENDOU; rs141369443, rs145477608 0.00742178110149353 0 RP1- 197B17.3 DMGDH rs41272262, rs77116243 0.00742623559891306 1 DIS3L2 rs148474013, rs184764939 0.00758703327967919 1 ACER2 rs10964136, rs41270117 0.00767163897232399 1 FAM129B rs138792807, rs148940802 0.007698917317014 0 FSIP2 rs116029352, rs116639972 0.00784012946571176 0 KMT2D rs146044282, rs3741626, rs833819 0.00791068952262381 1 ZNF223 rs145131242, rs148654869 0.00825125538439621 0 EHMT2 rs115884658, rs149384831 0.00845086744515336 0 PIAS2 rs117151539, rs56202233 0.00860459589933969 0.4 CD244 rs115868021, rs12145141 0.00870443769868754 0 APOBEC1 rs12820011, rs139646668, rs61753204 0.00872512936414573 0 LRRC28 rs145009665, rs146950856 0.00893096957775697 0 KIT rs138585275, rs72549294 0.00921411786255708 0 NTHL1 rs148104494, rs1805378 0.0102706165648086 0 PKD1; rs554255347, rs577103612 0.0103760306833386 1 TSC2 NAT8 rs140837195, rs62000430 0.0106260508015023 0 NUTM1 rs118111266, rs141497257, rs61737331, rs61737334, rs78773193 0.0113113874293209 1 LMO4 rs12033984, rs41311178 0.011366270027716 0 ACAN rs75169935, rs77572130 0.0114465940606828 0 THADA rs112504846, rs114186945, rs138256193, rs56269749, rs61754254 0.011651667173751 0 C1QL4 rs143197428, rs144219849 0.0116843552626357 0 FCUP 99 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ COL4A4 rs11556632, rs13027659, rs147109071, rs147910396, rs149243282, 0.0117572046707965 0 rs17353916, rs181652003, rs185029960, rs190148408, rs371717486, rs530491908, rs540904446, rs55836847, rs566586172, rs569681869 NUFIP2 rs139731845, rs146112698 0.0117983017600076 0 LBP rs2232585, rs2232597, rs2232607, rs36015492, rs5744212 0.0118076418785924 0.1 WDR88 rs112854679, rs142717725 0.0121390712539442 0 SLC22A18 rs141445711, rs143044180 0.0124847078170114 0.3 DTNB rs200652614, rs79208578 0.0128267351308665 0 HIF1A; rs11549467, rs138451482, rs41508050 0.0133799977531176 1 HIF1A-AS2; RP11- 618G20.1 PAM rs116526321, rs2230458, rs35658696, rs61736661, rs78753846 0.0140590840422535 0.3 DBH rs145059403, rs5324, rs76856960 0.0142464969009851 0 PTPRB rs111634123, rs149998223, rs61758735 0.0144794706874032 1 VWA8 rs138075452, rs200944707, rs41288291, rs73464952, rs78161810 0.0145636837825792 1 INA rs185636023, rs34440112, rs78099843 0.0146424713095899 0 PPP6R2 rs140861368, rs144758234, rs45484793 0.0147993040155834 0 BCAR3 rs151322895, rs527237451, rs61752468 0.015584028666548 0.1 RPGRIP1L rs137982921, rs1420574, rs79892445 0.015602206252904 1 OR10G9 rs117973239, rs150573636 0.0157755693519727 0.8 ZFP41 rs147227382, rs72695731 0.0160848677850381 1 WDR41 rs389319, rs41272254 0.0163445241164171 1 CLCA2 rs145061827, rs55906076 0.0170008455367421 0.7 ZNF846 rs182054421, rs78118057, rs79195449 0.0172249087726115 0 TNS3 rs190294673, rs2293362, rs41280696 0.0172570782741986 1 LAMC2 rs140949383, rs142335339, rs17481405 0.0172906952770334 0 TSNAXIP1 rs202129606, rs562641068, rs74684664 0.0173480496827564 0 GLI1 rs139792497, rs149817893 0.0175359302372476 1 CCDC116 rs11705259, rs28513567, rs35113423, rs76090341 0.0176251531412645 0 TRIOBP rs150690007, rs193043234, rs199646135 0.0177038308174322 0 ABCA6 rs199889249, rs77542162 0.0177507777671656 0 TG rs114322847, rs114944116, rs116062097, rs142998186, rs202227846, 0.0178169544336766 0.7 rs2069548, rs35301433 TIAM1 rs113219144, rs200132537, rs200252911, rs34882418 0.0178483196533519 0 GYG1 rs143137713, rs566956486 0.0180559231043166 1 FCUP 100 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ KDM1B rs150757038, rs72840622 0.0183141939158937 1 XIRP2 rs143400009, rs151294213, rs184983098, rs373331298 0.0186851117441058 0.2 SLIT3 rs144594798, rs34260167, rs35305517 0.0186851482902388 1 ACCSL rs138904993, rs202158584 0.0187172278291208 0.6 ZMYM1 rs184831579, rs190358882 0.0188216306963231 0.9 POLR1A rs146078741, rs80106112 0.0188853396822383 1 SLC16A10 rs118149501, rs17072442 0.0190297084390874 1 CKAP2 rs143514154, rs202229967 0.0190342057236467 0.1 CYP2A6; rs145308399, rs199916117, rs2644906, rs28399454, rs6413474, 0.0192029767298586 0 CTC- rs8192730 490E21.12 ZNF592 rs150829393, rs61737677 0.0192238756481847 0 IP6K3 rs34343647, rs34573836 0.0192338758741167 0 RRP7A rs146723886, rs61731241, rs645625 0.0192343688931491 0 POLK rs181098391, rs186798689 0.0193989113963823 0.9 GRIN3A; rs41282159, rs62575856 0.0198487505809034 1 PPP3R2 NPC1L1 rs116204045, rs148087541, rs201908509, rs35803101, rs52815063, 0.0200565061174125 0 rs76327525 EPB41L1 rs6089016, rs73101499 0.020065932276915 1 NWD1 rs117353506, rs138264504, rs144054207, rs146553378, rs147984852, 0.0201584915666096 0 rs61746179, rs73008597 SH2D3A rs139285655, rs147966200, rs148876828 0.0203969012015729 1 POLR1D rs118191175, rs8459 0.0206868765621305 0 GBP6 rs116475216, rs144565821, rs75966734 0.0210723208957813 1 DYNC2H1 rs191381310, rs201043335, rs61898615 0.0212883400565177 0 BBS4 rs147202164, rs41277724 0.0215706904380262 0 RFTN2 rs143754556, rs149995388 0.0218475141907273 0.1 PHLDB3 rs139745869, rs76853904 0.0221262197379432 1 UBQLNL; rs117149898, rs7933557 0.0221915354023159 0 HBG2 NBEA rs189755961, rs41292197, rs41292207 0.0222508893959804 1 PDIA4 rs140572691, rs142233058, rs148077813, rs2290971 0.0225629931801491 0 PIM1 rs34698101, rs36084391 0.0226029119923128 1 PAPSS2 rs150809990, rs17173698, rs45467596 0.0228338839841587 0 ZNF532 rs200279390, rs77308563 0.0229091761342515 1 FCUP 101 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ RP11- rs117627469, rs6578940 0.0235250409561545 0.4 379P15.1 MAST4 rs114551553, rs17221458, rs17221521, rs200131786, rs55970008, 0.0236659756560168 0.1 rs56337909 ZNF284; rs192483922, rs200252807 0.0237028039727898 1 ZNF223 TRERF1 rs147004250, rs371625080, rs61756353 0.0240945104254255 1 ITGA10 rs116524970, rs146565671 0.0241403886518612 1 SNAPC4 rs117208709, rs138223741, rs142043016, rs182356747, rs3812561 0.024235089074704 0 MNDA rs148142374, rs35417083 0.024543570714563 0.9 APOB rs1042023, rs12713559, rs12713843, rs12713844, rs12720854, 0.0246206953550908 0 rs1801702, rs1801703, rs199893862, rs61736761, rs61743502, rs61744153, rs6752026, rs72653095, rs72654423, rs72654426 WDR60 rs150548113, rs73167274 0.0247831638439548 0 SORCS2 rs144615557, rs16840892, rs35935435 0.0248644649430154 1 AGBL1 rs144777747, rs181958589, rs184053635, rs73459659 0.0248770012453626 1 HORMAD2 rs34150968, rs975704 0.0251661795782988 1 CRTAC1 rs140424345, rs141406261, rs187123855, rs200414761, rs546782945 0.0253062769332999 0 CYP3A4 rs28371759, rs28371763, rs3091339, rs4986910 0.0253225281289883 1 FASN rs145866788, rs199640764, rs2228306, rs2229426 0.0256925759218217 0 HTR3B rs540079693, rs72466469, rs78418698 0.0257290026170584 0 GTF3C5 rs191292694, rs374767427 0.0257392969884936 1 FRMPD2 rs116143480, rs144770940, rs55802136 0.025831389866472 0 GLDN rs140063339, rs151045681 0.0263851305649508 0 COLEC12 rs149622251, rs188943034 0.0266499595755794 0 PEX6 rs115960224, rs141238034, rs61753220 0.0268191257399536 0 OR1I1 rs75278395, rs75323205 0.0273806601418587 0 LURAP1L rs140682520, rs61755264 0.0278791957857137 0 BIVM- rs142438319, rs375592438, rs4150342 0.0281214036133539 1 ERCC5; ERCC5 ANO5 rs137854523, rs200631556, rs375834855, rs61910685, rs72982058, 0.028605945174274 1 rs78428314 SCN4A rs372019457, rs41280102, rs80338952 0.0290303337384318 1 CLSTN2 rs115904131, rs200466942, rs36054782 0.0291319896301949 0 ZNF335 rs117132825, rs117802609, rs41305805 0.0294848641227067 1 FCUP 102 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ C10orf12 rs112594620, rs7082522, rs7894200 0.0294866646057862 0 CYP7A1 rs142708991, rs528066345 0.0296601055364116 0 C1orf168 rs140047407, rs17114336, rs41305876 0.0298545963032464 1 PAN2 rs117027379, rs34404784 0.0299214402377285 0 HPD rs137852868, rs36023382 0.0299672690944064 1 ITGA2 rs13173706, rs143262642, rs80331976 0.0305020687440385 0 ZNF451; rs149876604, rs536946478 0.030677870658024 0 RP11- 203B9.4 PTPN18 rs111873277, rs74409947 0.0308535587487031 0 DHX38 rs11554765, rs201984534, rs35794819, rs36064538, rs61749037 0.0309073120760382 0.1 GPR179 rs149252987, rs149998444, rs183799079, rs62073368 0.031190259611201 0 SRD5A2; rs28383064, rs28383082 0.0313914622866728 0 AL133247.2 PLAT rs114878147, rs139316969, rs151006962 0.0315530938148344 0 ZNF192P1 rs1150669, rs41269281 0.0319349882300523 1 CC2D1A rs201251295, rs201884654, rs61740117 0.0320377363693387 0 CST9L rs1054613, rs376798152, rs79890466 0.0321664662571288 1 ZFP28; rs141875487, rs549712142 0.0323365637519039 0.7 AC007228.11 OR51I1; rs61736831, rs61737027, rs76233016 0.032348672763517 0.9 HBE1; HBG2; AC104389.28 SAMD12 rs117020479, rs74405593 0.0325844459974752 1 ITGA11 rs148886354, rs201928196, rs2271725, rs61729767 0.0326454782918444 1 CNTNAP4 rs145609341, rs34251012 0.0330045166331927 0 KIF17 rs34232864, rs376988005, rs41310420 0.0336034896838567 0 LYSMD4 rs145468893, rs8025780 0.0336643712034059 0.8 GPR108 rs139348331, rs144129725, rs201134279, rs4807897 0.0338328669780701 0 HAS1 rs148269958, rs34682338, rs45625331 0.0338699339557596 0 TGFBR3 rs17882736, rs2228363 0.034152032169158 1 ARHGAP23 rs138289112, rs151011696 0.0344789466775748 1 FGD2 rs147148066, rs201398945, rs708016 0.0346853644564203 1 DCTN1 rs121909344, rs55862001 0.034781208679355 1 AHI1 rs117447608, rs146416468, rs190854744, rs41288013, rs6940875 0.0348901387762959 0 FCUP 103 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ TCN2 rs35838082, rs35915865 0.0349430674881751 0.2 TBC1D14 rs11731231, rs140427960 0.0351080740632916 1 NCOR2 rs142292731, rs1472840, rs184942554, rs200297509, rs36081651, 0.0355086592828766 0 rs61755988 SEC63 rs111637159, rs140498555, rs143966094, rs186286561, rs386704681, 0.0357937637874845 0.5 rs672444, rs687374 IQCH rs35933176, rs62014646 0.0360494177441141 0 ATP6V0A4 rs150777839, rs61747674 0.0360774292699474 1 ALOX15B rs141534086, rs146833910, rs149652899, rs7225107 0.0366216677001951 0 SIGLEC12 rs145853613, rs73051357 0.0368765536343258 0.8 WRN rs34477820, rs78488552 0.0371677139583508 1 VPS13D rs116415833, rs12407578, rs143861538, rs41279452, rs61774897 0.0372812318090968 0 C1orf95 rs149458463, rs41305727, rs41314286 0.0373476882197225 1 EIF2B3 rs139445917, rs151056457, rs77068026 0.0379421059141253 1 SOGA2 rs34877994, rs368883249, rs371815051, rs55676538, rs72942342 0.0380210362322233 0 IL12RB1 rs140254802, rs145590794 0.0384824069084902 0.7 XAB2 rs35275272, rs61761630 0.0391572517457411 0 RNF165 rs16978564, rs61744490, rs76095029 0.0393093532050838 1 SMCO1 rs11926701, rs73219650, rs9869292 0.0397104119420714 1 ANKLE1 rs145927575, rs77683348 0.039987158434491 0 GABRA4 rs114810504, rs115206335 0.0403809018886395 0 TMC7 rs118019760, rs151053735, rs55796412 0.0405209031024033 1 USH1C rs145013633, rs41282932 0.04054544945038 0 IL27RA rs149202847, rs76543168 0.0405733216327705 0 FAM81B rs117093563, rs148203459, rs1541797, rs200859980, rs553700628, 0.0406043976222065 0.4 rs76962324 CHST11 rs1048662, rs148230565, rs1565814, rs76161287 0.0406195690942903 1 ZBBX rs150765050, rs200250021, rs200749252, rs34465133 0.0409406066410592 0 IKZF1 rs200731248, rs73695637, rs79731631 0.040945969086711 0 ALOXE3 rs121434235, rs3027233, rs540629378 0.0411044121293398 0 TTLL6 rs147901482, rs149873082 0.041133460490088 0 ERRFI1 rs41278968, rs74761277 0.0412368035680894 1 SLC39A4 rs200073988, rs75920625 0.0412570193381026 1 SBNO1 rs114314586, rs181914282, rs61760909 0.0414417527692715 0.2 CGN rs140720174, rs142913144, rs41272459, rs535976236 0.0415041810014879 0.2 RAB11FIP1 rs139211525, rs16887092, rs75558150 0.0416105868433961 0.1 FCUP 104 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ PLCD4 rs200161654, rs61733653 0.0420855206164819 0 SLC25A41 rs183437348, rs191002996 0.0421425761632844 1 TTLL2 rs12528714, rs145779681, rs148797490, rs149701744 0.042451095562093 0 HPS1 rs116698870, rs58548334 0.0428722632340464 1 RAD54L rs138546115, rs28363192 0.0430424617056263 0 STARD13 rs34425674, rs41306650 0.0431469699831179 0 PRDM2 rs148083107, rs148892113, rs17350795, rs41269799 0.0434639821877816 1 SLC12A1 rs34661166, rs34819316 0.0435192742105717 0.5 ZC3H3 rs145312531, rs149025999 0.0437957390105376 0 GALR2 rs61745847, rs8192514 0.0438023616847217 0 PRSS12 rs35996030, rs558897637 0.0440455050007418 1 ABCA1 rs138880920, rs140365800, rs28933692, rs35819696, rs9282543 0.0441439207185963 1 MPP4 rs373969080, rs542804992, rs573231898 0.044401846584119 0 PLXNC1 rs114905217, rs115651556 0.0445129908768141 0.8 FSIP2; rs115715789, rs142675481, rs202032936, rs74508015 0.0447925463240687 1 AC008174.3 GLIS3 rs117876027, rs148572278, rs200265407, rs35154632 0.044851218915117 0 GSTM5 rs113130058, rs150417585 0.0448736431974401 1 TMEM95; rs534565673, rs79041177 0.0449051899923299 0 RP11- 542C16.1; KCTD11 WFS1 rs142428158, rs35031397, rs55993016 0.0449112948530476 0 AHNAK rs112663036, rs114515655, rs115627503, rs116243978, rs139375615, 0.0450163025780678 0 rs149505116, rs201078275, rs529488235, rs61312994, rs74853209, rs75436331, rs75855515, rs76414066, rs77055528 TCF21; rs56412384, rs61729591 0.0454710510914957 1 RP3- 323P13.2 DNAH2 rs11656500, rs117487916, rs118057786, rs11868946, rs140035206, 0.0455739487008744 0.1 rs141976760, rs142532084, rs142627042, rs146533727, rs35788701, rs57926692, rs73232344 DIS3L; rs35711894, rs372600238 0.0456710981525653 1 RP11- 352G18.2 COL17A1 rs141174922, rs146267259, rs146841330, rs147785714, rs73329731 0.0457693522588996 0 FCUP 105 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ NCOA2 rs189421236, rs201963563 0.0458682410669888 1 AC007271.3; rs3917327, rs41294846 0.0461057456307039 1 IL1R1 DEFB1 rs1800968, rs2738047 0.0461192018261696 0.7 LARS rs112912805, rs150148403 0.0461931837831271 0 FNBP1 rs3739861, rs41279162 0.0462437340046724 0 FAM161A rs139266382, rs187695569 0.0464046989826674 1 TPMT rs115106679, rs1800462, rs6921269, rs72552738 0.0465141949054824 0 TET3 rs199849765, rs56254597, rs72816199, rs72818011 0.0466787269072474 1 COL11A2 rs1054531, rs146555195 0.0466856771119994 0.4 SPATA31D1 rs145142439, rs75742550 0.0468336314078605 0 TSPAN15 rs143114858, rs62625031 0.0468558170847533 1 RBM6 rs143972186, rs562534362, rs61731329 0.0472878594640195 1 SCN1A; rs121917910, rs544692790 0.0473494782478932 0 AC010127.3 ADAMDEC1; rs147608974, rs200134300, rs61752046 0.047416873559009 1 RP11- 624C23.1 OR4D5 rs116430791, rs77776318 0.0474472315880398 0.9 TET1 rs117273115, rs140677396, rs199602262, rs74925160 0.047538305689707 0.4 KNDC1 rs145440398, rs34697182 0.0475910329109525 0.8 PLEKHH1 rs111462449, rs200119528, rs45616031, rs7150973 0.0476120121460348 1 TAGAP; rs144047559, rs41267765 0.0476531034835423 0 RP1- 111C20.4 PAH rs5030851, rs5030860, rs62508588 0.0478371817930683 0.5 PINK1; rs45515602, rs74315358 0.0479352111127964 0.6 PINK1-AS SMYD1 rs10170209, rs112558914, rs145275251 0.048079539753303 1 SLC8A3 rs34816272, rs76324914 0.0480950703278233 0.7 BPIFB1 rs45473499, rs61737874, rs61739245 0.0485039597118579 1 OR1J4 rs145918200, rs568140238 0.0485819338914997 1 DEFB129 rs142429237, rs41275426 0.0486744940816048 1 FRMD4B rs114083486, rs148389023, rs199706552, rs202099937 0.0487321483488207 1 CHL1 rs142251617, rs73105057 0.0487725622945439 0 TRPC6 rs116690727, rs142655335, rs267602665 0.0488382770645735 0 FCUP 106 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ COL15A1 rs138655830, rs142838918 0.0489046805207205 0 FUS rs140875749, rs201608365, rs387906627, rs929867 0.049302533617717 1 GAPVD1 rs2030148, rs55779102 0.0496139356419425 0.7

Tbl. C.5 – Nominally significant genes in SKAT-O under null Model 2 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients).

C.2.2 ”Dementia” list

Gene SNPs p-value ρ EIF2B3 rs139445917,rs151056457,rs77068026 0.0379421059141253 1 FUS rs140875749,rs201608365,rs387906627,rs929867 0.049302533617717 1

Tbl. C.6 – Nominally significant genes in SKAT-O under null Model 2 tested for ”Dementia” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients).

C.2.3 ”DGE Brain” list

Gene SNPs p-value ρ PLEKHA5 rs140734813, rs200349314, rs76626801, rs77598867 1.50254508876202e-05 0.9 CCP110 rs112583917, rs147470192, rs151214000 0.00563643423142686 0.4 MYO5C rs145997599, rs200009646, rs55712142, rs56250328, rs62623565, 0.00644068645041817 0 rs72734946 SCRN2 rs138415129, rs145155073 0.00666121709791486 0.5 SORCS2 rs144615557, rs16840892, rs35935435 0.0248644649430154 1 ITGA2 rs13173706, rs143262642, rs80331976 0.0305020687440385 0 AHI1 rs117447608, rs146416468, rs190854744, rs41288013, rs6940875 0.0348901387762959 0 IQCH rs35933176, rs62014646 0.0360494177441141 0 AHNAK rs112663036, rs114515655, rs115627503, rs116243978, rs139375615, 0.0450163025780678 0 rs149505116, rs201078275, rs529488235, rs61312994, rs74853209, rs75436331, rs75855515, rs76414066, rs77055528 KNDC1 rs145440398, rs34697182 0.0475910329109525 0.8

Tbl. C.7 – Nominally significant genes in SKAT-O under null Model 2 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients). FCUP 107 Quantifying the genetic predisposition to a complex disease through genome-wide association

C.2.4 ”AD Disgenet” list

Gene SNPs p-value ρ HTR4 rs140360260, rs148952645, rs201078356 0.000777045997368726 0 TNRC6A rs113388806, rs72770407, rs72770408 0.000836140698784495 0.4 ADAMTSL4; rs115937511, rs74124919 0.00487858432147673 1 RP11- 54A4.2 MYO5C rs145997599, rs200009646, rs55712142, rs56250328, rs62623565, 0.00644068645041817 0 rs72734946 KIT rs138585275, rs72549294 0.00921411786255708 0 PKD1; rs554255347, rs577103612 0.0103760306833386 1 TSC2 NAT8 rs140837195, rs62000430 0.0106260508015023 0 LMO4 rs12033984, rs41311178 0.011366270027716 0 ACAN rs75169935, rs77572130 0.0114465940606828 0 LBP rs2232585, rs2232597, rs2232607, rs36015492, rs5744212 0.0118076418785924 0.1 HIF1A; rs11549467, rs138451482, rs41508050 0.0133799977531176 1 HIF1A-AS2; RP11- 618G20.1 DBH rs145059403, rs5324, rs76856960 0.0142464969009851 0 LAMC2 rs140949383, rs142335339, rs17481405 0.0172906952770334 0 GLI1 rs139792497, rs149817893 0.0175359302372476 1 SLIT3 rs144594798, rs34260167, rs35305517 0.0186851482902388 1 GRIN3A; rs41282159, rs62575856 0.0198487505809034 1 PPP3R2 UBQLNL; rs117149898, rs7933557 0.0221915354023159 0 HBG2 APOB rs1042023, rs12713559, rs12713843, rs12713844, rs12720854, 0.0246206953550908 0 rs1801702, rs1801703, rs199893862, rs61736761, rs61743502, rs61744153, rs6752026, rs72653095, rs72654423, rs72654426 SORCS2 rs144615557, rs16840892, rs35935435 0.0248644649430154 1 CYP3A4 rs28371759, rs28371763, rs3091339, rs4986910 0.0253225281289883 1 FASN rs145866788, rs199640764, rs2228306, rs2229426 0.0256925759218217 0 HPD rs137852868, rs36023382 0.0299672690944064 1 PLAT rs114878147, rs139316969, rs151006962 0.0315530938148344 0 FCUP 108 Quantifying the genetic predisposition to a complex disease through genome-wide association

Gene SNPs p-value ρ OR51I1; rs61736831, rs61737027, rs76233016 0.032348672763517 0.9 HBE1; HBG2; AC104389.28 IL27RA rs149202847, rs76543168 0.0405733216327705 0 SBNO1 rs114314586, rs181914282, rs61760909 0.0414417527692715 0.2 TTN; TTN- rs114331773, rs116676813, rs12471771, rs13398235, rs146181116, 0.0427422954992841 1 AS1 rs148067743, rs17354992, rs17452588, rs199895260, rs200181804, rs2306636, rs2562832, rs2742347, rs3731745, rs3813245, rs3813246, rs4893852, rs4893853, rs4894048, rs55675869, rs55742743, rs55762754, rs55801134, rs55837610, rs55842557, rs55880440, rs55886356, rs56018860, rs56142888, rs72646808, rs72646809, rs72646891, rs72648237, rs72648257, rs72648272, rs72648273, rs72648927, rs72648937, rs72648942, rs72648970 HPS1 rs116698870, rs58548334 0.0428722632340464 1 ABCA1 rs138880920, rs140365800, rs28933692, rs35819696, rs9282543 0.0441439207185963 1 GLIS3 rs117876027, rs148572278, rs200265407, rs35154632 0.044851218915117 0 AC007271.3; rs3917327, rs41294846 0.0461057456307039 1 IL1R1 COL11A2 rs1054531, rs146555195 0.0466856771119994 0.4 TET1 rs117273115, rs140677396, rs199602262, rs74925160 0.047538305689707 0.4 PINK1; rs45515602, rs74315358 0.0479352111127964 0.6 PINK1-AS SLC8A3 rs34816272, rs76324914 0.0480950703278233 0.7 CHL1 rs142251617, rs73105057 0.0487725622945439 0 TRPC6 rs116690727, rs142655335, rs267602665 0.0488382770645735 0

Tbl. C.8 – Nominally significant genes in SKAT-O under null Model 2 tested for ”AD Disgenet” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients).