Efficient SNP based Heritability Estimation and Multiple Phenotype-Genotype Association Analysis in Large Scale Cohort studies

A DISSERTATION SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY

SOUVIK SEAL

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Advised by Dr. Saonli Basu

August, 2020 c SOUVIK SEAL 2020 ALL RIGHTS RESERVED Acknowledgements

There are many people that I am grateful to for their contribution to my time in graduate school. I would like to thank Dr. Saonli Basu for mentoring me through past few years. Thank you for introducing me to challenging datasets and relevant problems in the world of Statistical Genetics. I would like to thank Dr. Abhirup Datta from Johns Hopkins University for guiding me in my most recent work described in chapter 3. Thank you for exposing me to Spatial Statistics, a branch which was entirely new to me. I would also like to thank Dr. Matt McGue for introducing me to the Minnesota Center for Twins and Family (MCTFR) Study, which I have used in two of my projects. I am thankful to all my teachers whose classes have broadened my biostatistical intellect and greatly helped in my research. I am grateful to my parents Mr. Sanjay Seal and Mrs. Srabani Seal and my girlfriend Manjari Das from Carnegie Mellon University for their incredible support throughout past four years. Manjari, special thanks to you not only for your brilliant ideas from time to time that have helped me in advancing my research but also for helping me get through tough times. Finally, I would like to thank Dr. Saurabh Ghosh and Dr. Kiranmoy Das from Indian Statistical Institute, Kolkata for motivating me to join the Biostatistics PhD program at the University of Minnesota.

i Dedication

The thesis is dedicated to my grandmother Rani Seal whose company I dearly miss everyday. You and parents have been pivotal in shaping my life.

ii Abstract

Recent developments in genotyping technologies have opened up many a new pos- sibilities of unravelling the genetic basis of common diseases. The past decade has seen an advent of a bunch of large scale cohort studies giving us, the re- searchers, access to an unprecedented wealth of data providing information on millions of genetic variants and numerous diseases/traits on millions of individu- als. But, efficient analysis of such high-dimensional data demands non-traditional yet novel statistical techniques. The development of a complex human disease is an intricate interplay of genetic and environmental factors. In order to better understand such traits, we are often interested in estimating the overall trait heri- tability: the proportion of total trait variance due to genetic factors within a given population. Accurate estimation and inference of heritability gives us some basic understanding of disease risk and etiology. Traits with high estimated heritabil- ity incite interest among the researchers for a further Genome-Wide Association Study (GWAS) to pinpoint the significant genetic variants. As we move into the era of genome editing and personalized medicine, addressing the shared genetic basis of multiple diseases/traits or the genetic basis of a single disease/trait over multiple time-points becomes more and more important. In light of these exciting statistical problems, my thesis focuses on developing robust tools for estimating heritability and performing GWAS in large scale cohort studies both in a univari- ate and multivariate context.

iii Contents

Acknowledgements i

Dedication ii

Abstract iii

List of Tables vii

List of Figures ix

1 Introduction 1

2 Heritability Estimation and Genetic Association Testing in Lon- gitudinal Twins Study 8 2.1 Introduction ...... 8 2.2 Materials and Methods ...... 11 2.2.1 Cross-sectional Family Study ...... 11 2.2.2 Existing Methods for longitudinal Twins Study ...... 13 2.2.3 Proposed model ...... 16 2.3 Estimation in RMFM and MFM ...... 20 2.3.1 RMFM ...... 20

iv 2.3.2 MFM Method of Moments (MFM-MOM) ...... 23 2.4 Results ...... 25 2.4.1 Comparing heritability in simulation setup by different ap- proaches ...... 25 2.4.2 Univariate heritabilities in Real Data by different approaches 30 2.5 Discussion ...... 31

3 Efficient SNP-based Heritability estimation using Gaussian Pre- dictive Process 33 3.1 Introduction ...... 33 3.2 Materials and Methods ...... 35 3.2.1 Genome-based Restricted Maximum Likelihood Approach 35 3.2.2 Proposed Method ...... 37 3.3 Results ...... 43 3.3.1 Simulation using Coalescent Theory ...... 43 3.3.2 Simulation using UK Biobank data ...... 45 3.3.3 Analysis of real UK Biobank traits ...... 55 3.4 Time Comparison ...... 57 3.5 Discussion ...... 59

4 Multivariate Association Analysis of Correlated Traits in Related Individuals 61 4.1 Introduction ...... 61 4.2 Material and Method ...... 64 4.2.1 Existing Methods ...... 65 4.2.2 Proposed Method ...... 72 4.3 Results ...... 75

v 4.3.1 Simulation Study ...... 75 4.3.2 Real Data Analysis ...... 83 4.4 Discussion ...... 87

References 89

Appendix A. 102 A.1 Calculating variance of MFM-MOM heritability estimate . . . . . 102 A.1.1 Theorems ...... 106

Appendix B. 110 B.1 Additional Figures ...... 110 B.1.1 Simulation from section 3.3.2.1 ...... 111 B.1.2 Simulation from section 3.3.2.2 ...... 112 B.1.3 Simulation from section 3.3.2.3 ...... 114

Appendix C. 116 C.1 Positive semi-definiteness of RMultiPAR’s covariance assumption 116 C.2 Comparing the assumptions of RMultiPAR with the traditional approach ...... 117 C.3 Development of Adjusted RMultiPAR ...... 119 C.4 Discussion about Adjusted RMultiPAR ...... 123 C.5 Simulation with more number of traits ...... 127 C.5.1 Manhattan Plots ...... 128

vi List of Tables

2.1 The table compares the computational time of fitting RMFM using Optim in R with the proposed two stage approach of fitting RMFM and also the MFM-MOM, in seconds. Under the simulation setup described in section 2.4, each of the methods, were run 100 times and their minimum, maximum and mean values were listed. . . . 25 2.2 Mean of the univariate heritabilities by OpenMx ...... 26

2 2.3 Univariate Heritabilities (hk) in Real Data ...... 30 3.1 Mean comparison of different methods for two cases: Case (1) and Case (2) with true h2 = 0.8 ...... 45 3.2 Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes with true h2 = 0.7 and 40,000 individuals 48 3.3 Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes under case (a) with true h2 = 0.7. . . . . 51 3.4 Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes under case (b) with true h2 = 0.2. . . . . 51 3.5 Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes under case (a) with true h2 = 0.6. . . . . 54 3.6 Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes under case (b) with true h2 = 0.2. . . . . 54

vii 3.7 Time comparison of different methods in seconds for the simulation from section (3.3.1) with 5k (8k SNPs) and 8k (13k SNPs) individuals. 57 3.8 Time comparison of PredLMM in minutes for varying different knot (subsample) sizes with Bolt-REML under the simulation from sec- tion (3.3.2.3) ...... 58 4.1 The table lists the common SNPs detected by all three methods: RMultiPAR, PCA and MinP at p-value threshold of 1 × 10−8. . . 85 4.2 The table lists the SNPs detected only by RMultiPAR at p-value threshold of 1 × 10−8...... 86 C.1 The table lists the mean (and sd in the bracket) of the values the √ √ function f takes in the interval ((1− τ)2, (1+ τ)2) for datapoints separated by 0.002...... 126

viii List of Figures

2.1 Comparing univariate heritabilities obtained by three different meth- ods. Shows that the estimates of the heritabilities are biased from the true value 0.8 in case of OpenMx...... 26 2.2 Histograms of the univariate heritabilities obtained by marginal ACE models ...... 28 2.3 Histograms of the univariate heritabilities obtained by MFM-MOM 28 2.4 Histogram of the multivariate heritability by marginal ACE models 29 2.5 Histogram of the multivariate heritabilities obtained by MFM-MOM 29 3.1 The figure compares MSE of different methods for case (1) and (2). 44 3.2 The figure plots the pairwise principal components of the genetic data of the individuals from the UK Biobank cohort along with their self-reported ancestries...... 46 3.3 The figure compares MSE of GREML (sub) and PredLMM for three different subsample (knot) sizes, 4000, 8000 and 16,000...... 48 3.4 The figure compares MSE of GREML (sub) and PredLMM for five different subsample sizes under case (a) (top) and case (b) (bottom). 50 3.5 The figure compares MSE of GREML (sub) and PredLMM for five different subsample sizes for case (a) (top) and case (b) (bottom). 53

ix 3.6 The figure shows barplot of the heritability estimates by two meth- ods with different subsample sizes...... 56 4.1 The plot shows the type 1 error and power of the methods under the simulation setup of the section (4.3.1.1)...... 77 4.2 The plot compares the type 1 error and power of different methods under the simulation setup of the section (4.3.1.2) for three different values of ρ at level 0.05 ...... 79 4.3 The histogram of RMMLR test statistic for the case of ρ = 0.5 (on the left) and ρ = 0.8 (on the right) in section (4.3.1.3) ...... 81 4.4 The plot compares the type 1 error and power of different methods under the simulation setup of the section (4.3.1.3) at level 0.05. . 82 4.5 This is a venn diagram showing the common SNPs detected by the three methods at a p-value threshold of 1 × 10−8...... 84 B.1 The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under simulation from section (3.3.2.1). The dotted line corresponds to the true heritability h2 = 0.7 and the blue dot inside each boxplot refers to the mean point of the estimates...... 111 B.2 The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under case (a) from section (3.3.2.2). The dotted line corresponds to the true heritability h2 = 0.7...... 112 B.3 The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under case (b) from section (3.3.2.2). The dotted line corresponds to the true heritability h2 = 0.2...... 113

x B.4 The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under case (a) from section (3.3.2.3). The dotted line corresponds to the true heritability h2 = 0.6...... 114 B.5 The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under case (b) from section (3.3.2.3). The dotted line corresponds to the true heritability h2 = 0.2...... 115 √ √ C.1 Plot of function f for x ∈ ((1 − τ)2, (1 + τ)2 ...... 126 C.2 The plot compares the type 1 error and power of different methods under the simulation setup of the section (4.3.1.2) of the main paper with respectively 6 and 8 traits...... 127 C.3 The figure shows the Manhattan plots of the negative of the log p-values from the different methods. The blue line corresponds to negative of the logarithm of 1 × 10−5. The red line corresponds to negative of the logarithm of the threshold of 1 × 10−8...... 128

xi Chapter 1

Introduction

Recent developments in genotyping technologies have opened up many a new pos- sibilities of unravelling the genetic basis of common diseases. Availability of rich genomic and phenotypic data on large scale cohort studies or on Biobanks is giv- ing us, access to an unprecedented wealth of data (Allen et al., 2014). But, it is not trivial to extract meaningful information from such large volume of data. Ef- ficient and robust analysis of such high-dimensional data demands non-traditional statistical techniques. The development of a complex human trait/phenotype is an intricate interplay of genetic and environmental factors. In order to better understand such traits, we are often interested in estimating the overall trait heritability: the proportion of total trait variance due to genetic factors within a given population. Accu- rate estimation and inference of heritability gives us some basic understanding of disease risk and etiology. Traits with high estimated heritability incite interest among the researchers for a further Genome-Wide Association Study (GWAS) to pinpoint the significant genetic variants. Moreover, availability of multiple cor- related traits on individuals has generated ample interests to study the shared

1 Chapter 1 – Introduction genetic basis of these correlated traits. Pleiotropy is a phenomenon where a ge- netic variant affects multiple traits. We have known for decades that pleiotropy is widespread and the current availability of data on multiple traits has generated huge interests in detection of such pleiotropic variants. However, performing a genome-wide association analysis with multiple correlated traits has both statis- tical and computational challenges. In light of these exciting statistical problems, my thesis focuses on developing robust tools for estimating heritability and per- forming GWAS in large scale cohort studies both in a univariate and multivariate context. Heritability of a trait is mostly defined from the perspective of a cross sectional study design. However, we often encounter longitudinal studies with repeated measurements on a trait. My first project aims to develop a statistical model to handle repeated measures of a trait over time and allows for more precise esti- mation of heritability. The classical twin study model, also known as the ACE model (Rabe-Hesketh et al., 2008), with monozygotic (MZ) and dizygotic (DZ) twin pairs is widely used for estimating the trait heritability. It is a Linear Mixed Model (LMM) which decomposes the trait variance into three latent components

2 2 such as the additive genetic (σA), shared environment (σC ) and the non-shared en- 2 2 vironmental component (σE) and heritability (h ) is calculated as the proportion 2 σA of variance explained by the additive genetic effect ( 2 2 2 ). The parameter, σA+σC +σE h2, can also be written as twice the difference of the correlation between MZ twin pairs and DZ twin pairs. Thus, another approach of estimating h2 without di- rectly estimating the latent variance components is Falconer’s method (Falconer, 1960) which uses the sample MZ and DZ correlations to estimate h2. A multi- variate (longitudinal) extension of the ACE model and a corresponding definition of multivariate (longitudinal) heritability has been proposed by the researchers

2 Chapter 1 – Introduction in this context (Klingenberg and Monteiro, 2005; Ge et al., 2016). But such a modelling framework involves a lot of variance-covariance parameters making Maximum Likelihood (ML) based parameter estimation extremely difficult and infeasible altogether when the number of time-points is too large. Also, in some specific scenarios, unique MLEs of the variance components may not exist (Ro´s et al., 2016), which can severely bias the estimation of the heritability as seen in chapter 2. As the first goal of this thesis, we extend the previously discussed Falconer’s method to a longitudinal context to estimate multivariate (longitudinal) heri- tability. The new modelling framework, which we term as Multivariate Falconer’s Model (MFM), is based on comparing MZ and DZ correlations over different time- points. We develop a method of moments approach for obtaining the parameter estimates in MFM. MFM can be shown to be theoretically equivalent to the Mul- tivariate ACE (MACE) model. But a major shortcoming of MACE can be over- come in MFM by allowing different variances for the different types of twin pairs (MZ and DZ) to accommodate the differences in shared environments (Hur et al., 2008). We also propose a reasonably simplified version of MFM which we name as Reduced MFM (RMFM) that involves much less number of variance-covariance parameters compared to MFM or MACE model making a ML based parameter estimation feasible in a high-dimensional scenario. We develop a rapid two-stage ML based estimation procedure for obtaining the parameter estimates in RMFM. We use these methods to estimate both univariate and multivariate (longitudinal) heritability along with their standard errors for the alcohol consumption trait in Minnesota Center for Twin and Family Research (MCTFR) study. Twin study methodologies for estimating heritability have long been critiqued for being based upon number of faulty assumptions (Burt and Simons, 2014).

3 Chapter 1 – Introduction In recent years, advances in genome sequencing have generated genetic data on large scale cohort studies, such as UK Biobank (Allen et al., 2014), Precision Medicine cohort (Khoury and Evans, 2015), Million Veterans Program (Gaziano et al., 2016). These studies consist of information on millions of genetic markers and numerous diseases/traits on millions of individuals. This enormous wealth of genomic data has provided the opportunity to use SNP-based methods for esti- mating heritability (Yang et al., 2010). Throughout the past few decades, GWA studies have identified hundreds of SNPs revealing information on the genetic variation of complex diseases and traits. For most traits, however, the associated SNPs from GWAS only explain a small fraction of the twin-studies heritability. In search of this so called ”missing heritability”, instead of focusing just on the asso- ciated SNPs, researchers nowadays try to capture even infinitesimal SNP effects by taking into account all the SNPs in a LMM framework (Yang et al., 2011; Lippert et al., 2011; Loh et al., 2015; Chen et al., 2016). This SNP-based LMM setup, often referred to as Genome-based Restricted Maximum Likelihood (GREML) approach, usually involves distantly related people which refers to apparently un- related individuals, who share genetic relatedness due to their evolutionary history (Weir et al., 2006). The GREML approach has become widely popular and has immense potential in unveiling the genetic basis of numerous available traits from all the large scale cohort studies. There are several softwares like Genome-wide Complex Trait Analysis (GCTA) (Yang et al., 2011), Genome-wide Efficient Mixed Model Association (GEMMA) (Zhou and Stephens, 2012) which have implemented efficient algorithms to fit the GREML. All these methods have per iteration computational complexity of O(N 3)(N being the number of individuals) and badly struggle when N is large as

4 Chapter 1 – Introduction we have seen in our analysis of UK Biobank data (with even N = 100, 000 individ- uals). A much efficient Monte Carlo Average Information REML algorithm has been implemented in the software named Bolt-REML (Loh et al., 2015) that can handle considerably large number of individuals (N > 100, 000). As the second goal of this thesis, we approximate the likelihood corresponding to the GREML approach to develop a significantly rapid algorithm for estimating heritability. The approximation is motivated by unifying concepts of genetic coalescence (Kingman, 2000; Degnan and Salter, 2005) and Gaussian Predictive Process models (Banerjee et al., 2008; Finley et al., 2009). Our method that we name as PredLMM, exploits the structure of the GRM to ease the computationally demanding linear algebraic steps of the standard GREML algorithm like calculating or inverse of a high dimensional (N × N) at every iteration. It reduces per itera- tion computational complexity from O(N 3) FLOPS (floating point operations) to O(r2N)+O(r3) FLOPS where r is much smaller than N. We verify the reliability and robustness of our proposed approach through extensive simulation studies. We also analyze a section of the UK Biobank cohort estimating the heritability of multiple quantitative traits. We implement PredLMM in an efficient Python module which is to be released soon. Estimating heritability is the first step towards precisely identifying the ge- netic behind a disease/trait. In recent years, study of pleiotropy or identifying shared genetic determinants of multiple correlated traits is increas- ingly gaining momentum as we move into the era of personalized medicine and genome editing. In the world of personalized medicine, it is really important to ensure that a drug tailored according to the genetic basis of a disease does not have any dire side effect causing one or more other diseases. It can be avoided by analyzing the diseases jointly revealing their shared genetic basis. The modern

5 Chapter 1 – Introduction day biobanks have rich information on multiple correlated traits and on millions of genetically related individuals (families and distantly related people). For testing genetic association with multiple traits, the most commonly used softwares like Multi Trait Mixed Modeling (MTMM) (Korte et al., 2012), Genome-wide Effi- cient Mixed Model Association (GEMMA) (Zhou and Stephens, 2014) are based on a Multivariate Linear Mixed Model (MVLMM). MVLMM is basically similar to the MACE model discussed earlier and involves a lot of variance-covariance components. It makes the parameter estimation and further association infer- ence incredibly difficult especially when the number of traits and the number of individuals are high as in a large scale cohort study like UK Biobank. The third goal of this thesis is to develop a rapid SNP-based association test for detection of pleiotropic variants using multiple correlated traits in genetically related individuals. The test is inspired by the MVLMM framework but involves less number of variance-covariance parameters alleviating the computational bur- den. It is based on two steps in first of which the genetic dependency for each trait is modeled by univariate LMM (Kang et al., 2010; Lippert et al., 2011) and in the second step, the between-trait dependency is captured through a Seemingly Un- related Regression (SUR) (Zellner, 1962) model. We name the method as Rapid Multiple Phenotype Association Analysis in Related individuals (RMultiPAR). The test has far better computational complexity than MTMM and also GEMMA in some cases. We also derive theoretical connection of RMultiPAR with the ex- isting methods and discuss their equivalence in particular scenarios. We study the performance of RMultiPAR in terms of pwer and type 1 error through extensive simulations. We have analyzed the monozygotic twins and full sibling pairs from the UK Biobank data to test SNP-based association with four anthropometric traits: standing height, weight, hip circumference and waist circumference. Codes

6 Chapter 1 – Introduction implementing RMultiPAR in R can be found here. The thesis has been organized as follows: Chapter 2 contains details of her- itability estimation with a longitudinal trait in the twin study framework along with results from our extensive simulation studies as well as MCTFR data analysis. Chapter 3 contains the most exciting work that is based on developing an efficient SNP based heritability estimation approach named PredLMM unifying the ideas of genetic coalescence and Gaussian Predictive Process modeling. Detailed simu- lation study verifying the robustness of the proposed estimator of heritability is provided. We also estimate the heritability of multiple traits like standing height, weight, BMI, diastolic and systolic blood pressure, hip and waist circumference using around 160,000 individuals from the UK Biobank cohort data. Chapter 4 contains the details of the test RMultiPAR, a rapid SNP-based association test of multiple correlated traits with genetically related individuals. Extensive simula- tion study judging the performance of the proposed test under varying scenarios and a real data analysis with four traits: standing height, weight, hip and waist circumference and MZ twins and full sibling pairs (around 45,000 individuals) from UK Biobank cohort are also provided. Supplementary materials for each chapter have been provided in appendices at the end of this thesis.

7 Chapter 2

Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study

2.1 Introduction

A standard longitudinal study is based on the repeated evaluation of one or more measurable characteristics (phenotypes or traits) for a set of of unrelated indi- viduals. Its prime goal is to study the pattern and determinants of systematic changes in a phenotype of interest over time. A longitudinal study with twins (or families) involve analysis of complex correlated data; modeling within individ- ual correlation across time-points and modeling familial correlation among family members at each time point (Basson et al., 2016). Ignoring longitudinal trends when estimating the heritability of such biomarkers in an age-diverse or rapidly changing group of subjects will lead to estimates that could be biased in unex- pected ways; there may be complex temporal patterns which are heritable, but

8 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study cannot be seen by examining a single time point. In a longitudinal study, sometimes modeling the complex correlation (or co- variance) structure is of prime interest as in heritability estimation (Ge et al., 2016) and sometimes it is secondary, where we are primarily interested in study- ing the association of a genetic variant with a phenotype over time. Even in the second scenario, accuracy in the estimation of the covariance matrix could greatly influence the inference of association. In the context of a cross-sectional twin study, estimation of variance compo- nents within a linear mixed effect model (LMM) is often of interest. For example, the widely used ACE model (Rabe-Hesketh et al., 2008) is a LMM that uses ran- dom effects to capture correlation between family members. This ACE model attempts to decompose the total phenotypic variance into additive genetic vari- ance (A), share environment (C) and non-shared environment (E) and imposes a certain structure to the twin covariance matrix to illustrate the extent of genome sharing as well as sharing of the environment. Narrow sense heritability, in this context, is calculated as the ratio of additive genetic variance to the phenotypic variance. The ACE model can further be used to test the impact of a SNP on the phenotype, where the SNP is added as a fixed effect in the model. There are several R packages like nlme (Bates et al., 2014), OpenMx (Neale et al., 2016), GMMAT (Chen and Conomos, 2015) which can be used to implement this model. However, for the association analysis with longitudinal twins (family) data, most of the existing software do not allow a flexible modeling of both the modes of correlations such as the familial and the longitudinal correlation. As a solution to the above problem, the common practice is to simplify either or both of the two modes of correlation in order to perform analysis using the existing packages in R or SAS. Shi et al. (2009) and Sung et al. (2014) used PROC MIXED in SAS to

9 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study fit a LMM with covariance matrix as a kronecker product of a compound symme- try correlation structure capturing the familial dependence and an unstructured correlation capturing the longitudinal dependence. Similarly, Basson et al. (2016) discusses three other ways of simplifying the complex correlation. In the context of estimating multidimensional heritability, Ge et al. (2016) proposed an intuitive extension of the covariance structure of univariate ACE model to a multivariate setup. But such a modelling framework involves estimation of a lot of latent variance-covariance parameters which makes using a likelihood based approach very difficult. Also, often unique MLEs of the variance-covariance components do not exist (Ro´set al., 2016) which can severely bias the estimation of the heri- tability as we would see later. They proposed marginal estimation of heritability separately at each time point and the multivariate heritability was proposed to be estimated as a weighted average of the univariate heritabilities over different time- points. For obtaining the corresponding standard errors they used a bootstrapping approach. For testing genetic association this type of covariance modeling can be used but again it would involve estimation of lots of latent variance-covariance parameters. We develop a model which is motivated by the multivariate ACE model and bears a few additional novelties. We demonstrate that our approach is a lon- gitudinal extension of traditional Falconer’s approach (Falconer, 1960). In our model, we allow different variances for two different types of twins (MZ and DZ) to accommodate for the differences in shared environments. Shi et al. (2009)’s kronecker product model (to be referred as KP model from now on) can be shown to be a special case of our model. The parameter estimation in our model can be performed by maximum likelihood estimation. However, this maximization can get computationally intensive and prone to errors as the number of time-point

10 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study increases. We propose a theoretically sound two-stage method of estimation of the parameters and also a method of moments approach, both of which are rapid and hence, particularly suitable on a genome wide scale. Apart from extensive simulation studies under different realistic scenarios, we have also analyzed the alcoholism trait from the MCTFR data available over four different ages: 17, 20, 24 and 29 and MZ and DZ twin pairs.

2.2 Materials and Methods

2.2.1 Cross-sectional Family Study

Consider a family i with ni individuals (for twins ni = 2). Let there be a con- tinuous phenotype Yij which corresponds to j-th individual of i th family. Let the corresponding SNP genotype be Gij which takes values 0, 1, 2 based on the number of minor alleles, and the vector of other covariates be Sij. Throughout the paper Ir denotes the identity matrix of dimension r × r, Jr denotes a matrix of dimension r × r with all the elements being 1 and 1r denotes r × 1 row vector of all 1’s. The univariate ACE model (Rabe-Hesketh et al., 2008) captures the total variance of the phenotype through three random effects: additive genetic

(Ai), common environment (Ci), and unique environment (Ei). The relationship between the phenotype Yi. and a SNP can be written as,

1 T Yi. = β0 2 + αGi. + Si. η + Ai + Ci + Ei (2.1)

11 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study 1 T where E(Yi.) = β0 2 + αGi. + Si. η and cov(Yi.) = cov(Ai) + cov(Ci) + cov(Ei). The distribution of ACE random effects are modeled as,

2 2 2 I Ai ∼ N(0, σA2Φi),Ci ∼ N(0, σC Ji),Ei ∼ N(0, σE ) (2.2)

where 2Φi is the genetic relationship matrix (GRM) or kinship correlation matrix of the individuals in a family. For heritability analysis with families, each fam- ily i is taken as a independent unit and Φi is either known or is estimated from genome-wide SNP data. For example, for a pair of twins of type z (z=MZ or DZ),   1 1 z the GRM is defined as 2Φz =  . We are often interested in the narrow sense 1 z 1 heritability and the proportion of trait variance due to shared environmental ef- 2 2 2 σA 2 σC fects which are respectively defined as, h = 2 2 2 and c = 2 2 2 . MLEs σA+σC +σE σA+σC +σE of the parameters can be obtained in R using packages like nlme (Bates et al., ˆ 2 2 2 2014) and OpenMx (Neale et al., 2016). Denote the MLEs as β0, α,ˆ σˆA, σˆC , σˆE. and the variance estimate ofα ˆ as ˆvar(α). The association test i.e testing the

αˆ hypothesis H0 : α = 0 is done via the wald test statistic √ , which asymptot- ˆvar(α) ically follows N(0, 1). Also, the MLEs of h2 and c2 are obtained by substituting

2 2 2 σA, σC , σE with their respective MLEs. From the variance-covariance assumptions in equation (2.2), one can observe that, the correlation between two individuals in family i is corr(Yij,Yij0 ) = rjj0 = 2 2 2σAΦjj0 +σC z z 2 2 2 . For example, for a twin-pair of type z is, corr(Yi1,Yi2) = rz = σA+σC +σE 2 2 σA/z+σC 2 2 2 2 . It can thus be noted that the heritability can be written as, h = σA+σC +σE

2(r1 − r2) and the proportion of trait variance due to shared environmental effects 2 can be written as, c = (2r2 − r1). Scottish geneticist Douglas Falconer (Falconer, 2 2 1960) proposed estimating h and c by replacing rz’s with their sample estimates i.e replace r1 by the sample correlation between a MZ pair and replace r2 by

12 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study the sample correlation between a DZ pair. As demonstrated in our paper (Ar- bet et al., 2017), the heritability estimates could be severely biased for assuming equal variances for MZ and DZ population. The Falconer’s approach allows the flexibility of assuming different variances for MZ and DZ population.

2.2.2 Existing Methods for longitudinal Twins Study

The methods to be discussed can be generalized for any type of families but for the notational simplicity we only discuss it in the context of twins. Let

z Yijk denote the phenotype at time k of twin j of i-th twin-pair of type z. Let z z z T z there be K many time-points for each individual. Yi.k = (Yi1k,Yi2k) , Yi.. = (YzT , YzT ,..., YzT )T , Yz = (YzT , YzT ,..., YzT )T and Y = (Y1T , Y2T )T , i.1 i.2 i.K ..k 1.k 2.k m1.k ..k ..k ..k T G = (G11,G12,...,Gm2) . Let Sijk be a vector of p covariates for j-th individual in i-th twin-pair at k-th time-point. So we assume that the covariates vary over

T T time. Si.k = (Si1k,Si2k) , Si.. = (Si.1, Si.2,..., Si.K ) . Note that we are interested in testing the effect of the SNP (G) on the phenotype (Y..k) at each time-point k.

2.2.2.1 Multivariate ACE (MACE)

Model: In the multivariate context when there are no SNP and covariates, as- suming phenotypes to be centered, Ge et al. (2016) extended equation (2.1) as,

z z Yi.. = Ai.. + Ci.. + Ei.. (2.3)

z z z z z z T with cov(Yi..) = cov(Ai..) + cov(Ci..) + cov(Ei..). Ai.. = (Ai11,Ai21, ··· ,Ai2K ) ,

Ci.. = (Ci11,Ci21, T T z ··· ,Ci2K ) ,Ei.. = (Ei11,Ei21, ··· ,Ei2K ) . Aijk, Cijk and Eijk are additive genetic,

13 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study shared environment and unique environment component respectively at time-point k for j-th individual in i-th twin-pair of type z with the following distributional assumption,

z J I Ai.. ∼ N(0, ΣA ⊗ 2Φp),Ci.. ∼ N(0, ΣC ⊗ 2),Ei.. ∼ N(0, ΣE ⊗ 2) (2.4)

ΣA, ΣC , ΣE are unstructured covariance matrices. These covariance assumptions indicate that the three components can be correlated across time-points. Let the

2 2 2 diagonals of ΣA,ΣC and ΣE respectively be σAk, σCk, σEk for k = 1,...,K. Note z 2 2 2 2 that, independent of z, var(Yijk) = σk = σAk + σCk + σEk. So this model assumes same variance at a time-point k for both an MZ and a DZ individual. Ge et al.

2 (2016) were interested only in the multidimensional heritability, defined as hmult = PK 2 trace(ΣA) k=1 σAk 2 trace(Σ )+trace(Σ )+trace(Σ ) = PK 2 PK 2 PK 2 . hmult can also be written A C E k=1 σAk+ k=1 σCk+ k=1 σEk 2 2 PK σk 2 as sum of weighted univariate heritabilities at each k i.e hmult = k=1 PK 2 hk, k=1 σk 2 2 σAk hk = 2 2 2 . This definition of multidimensional heritability was earlier σAk+σCk+σEk used by Klingenberg and Monteiro (2005).

2 Estimation: It is clear that the formula for hmult does not involve any longitu- dinal component which is why Ge et al. (2016) does not estimate those. Ignoring the longitudinal dependence and considering each time-point separately, they fit K independent univariate ACE models (without SNP or covariates). Thus they

2 2 2 obtain the estimates of σAk, σCk, σEk and the estimate of the univariate heritabil- 2 2 2 ity hk for each k. Once the estimates of hk’s are obtained, hmult is estimated using the weighted average formula. And they use bootstrapping for estimating the corresponding standard errors. But the estimation of the longitudinal components would be necessary to find

14 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study out a closed form expression of the standard error of their estimate of the multi- variate heritabilitiy. There are other definitions of multidimensional heritability which originate from Breeder’s equation (Klingenberg and Monteiro, 2005) and are

|ΣA| dependent on the off-diagonals of ΣA, ΣC , ΣE. For example, would be |ΣA+ΣC +ΣE | a valid definition of multidimensional heritability which would require estimation of full matrices ΣA, ΣC , ΣE. Moreover when our goal is to do association test- ing, ignoring the longitudinal dependence altogether will result in loss of power (Burton et al., 2005; Sung et al., 2014). So we can consider a likelihood based approach. let us denote all the variance-covariance parameters as θ and the log likelihood as l(ππ,θθ),

1 l(ππ,θθ) = − log(|ΣΣ(θ)|) − (Y − Xπ)T ΣΣ(θ)−1(Y − Xπ) + constant (2.5) 2 where cov(Y) = ΣΣ(θ). If θ is known, by Aitken’s theorem (Goldberger and Gold- berger, 1991), the Minimum Variance Unbiased Estimator of π will be the MLE or GLS estimator and its asymptotic covariance will be,

T −1 −1 T −1 πˆGLS = (X ΣΣ(θ) X) X ΣΣ(θ) Y (2.6)

T −1 −1 cov(πˆGLS) = (X ΣΣ(θ) X) (2.7)

For association testing, wald test could be be performed using the test statistic T ˆ T −1 2 (MπˆGLS) {Mcov(πGLS)M } (MπˆGLS) which would follow a χK distribution. But in real data, when θ is unknown, we can maximize l(ππ,θθ) w.r.t π and θ jointly. By Magnus (1978), the MLE of π should be of the form, πˆMLE = −1 −1 T ˆ −1 T ˆ (X ΣΣ(θMLE) X) X ΣΣ(θMLE) Y with an estimated asymptotic covariance of −1 T ˆ −1 cov(πˆMLE) = (X ΣΣ(θMLE) X) . We can use R package OpenMx for obtain- ing the However, in section (2.3), we show that for a moderate sample size, θ or

15 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study ratherΣA,ΣC and ΣE are estimated with heavy bias by the MLE approach. As 2 2 a result, the estimates of the heritabilities: hk’s and hmult are also heavily biased. So for a moderate sample size simplifying the covariance structure sensibly may turn out to be helpful to avoid these estimation issues.

2.2.2.2 Kronecker Product model (KP model)

Model: Shi et al. (2009) and Sung et al. (2014), in the context of only association testing, used a model similar to equation (2.5) without A, C, E random effect terms,

z z z Yi.. = Xi..π + i..,E(Yi..) = Xi..ππ, cov(Yi..) = cov(i..) = Σ ⊗ V (2.8) with Σ being a K dimensional unstructured covariance matrix capturing the lon- gitudinal dependence and V being thee correlation matrix capturing the familial dependence. In this paper, we try to establish a correspondence between the Multivariate ACE model and the KP model.

Estimation: They used the function PROC MIXED in SAS software to get the

MLEs of the parameters. For testing H0 : Mππ = 0, they performed a likelihood ratio test.

2.2.3 Proposed model

2.2.3.1 Reduced Multivariate ACE (RACE)

We start with a special case of the MACE model specified in equation (2.3) and (2.5). We keep all the assumptions of MACE same except we assume that the covariance matrices, ΣA,ΣC and ΣE have same underlying correlation structure

16 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study R, i.e,

T T T ΣA = SARSA, ΣC = SC RSC , ΣE = SERSE . (2.9)

where R = ((ρkk0 ))K×K is a correlation matrix and SA,SC ,SE are K dimensional diagonal matrices with k-th diagonal element being σAk, σCk, σEk respectively. We name this simplified model as Reduced Multivariate ACE model or RACE. The point of discussing this model is that it reduces the number of variance-covariance parameters substantially and thereby eases the whole ML estimation process. Standard R optimization functions like optim, nlminb can be used for estimation of the parameters.

2.2.3.2 Reduced Multivariate Falconer’s model (RMFM)

The correlations between any two pair of phenotypes, under RACE are,

z z σAkσAk0 + σCkσCk0 + σEkσEk0 cor(Y ,Y 0 ) = ρ 0 ijk ijk kk p 2 2 2 2 2 2 (σAk + σCk + σEk)(σAk0 + σCk0 + σEk0 ) 2 2 z z σAk/z + σCk cor(Yijk,Yij0k) = rzk = 2 2 2 (2.10) σAk + σCk + σEk

z z σAkσAk0 /z + σCkσCk0 cor(Y ,Y 0 0 ) = ρ 0 . ijk ij k kk p 2 2 2 2 2 2 (σAk + σCk + σEk)(σAk0 + σCk0 + σEk0 )

In the new model we change two of the above correlation assumptions as,

z z cor(Yijk,Yijk0 ) = ρkk0

z z cor(Yijk,Yij0k) = rzk p 2 2 p 2 2 z z σAk/z + σCk σAk0 /z + σCk0 √ cor(Y ,Y 0 0 ) = ρ 0 = ρ 0 r r 0 . ijk ij k kk p 2 2 2 2 2 2 kk zk zk (σAk + σCk + σEk)(σAk0 + σCk0 + σEk0 ) (2.11)

17 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study These changes are not arbitrary. Note the following two inequalities due to Cauchy-Schwarz (Weisstein, 2015),

q 2 2 2 2 2 2 (σAk + σCk + σEk)(σAk0 + σCk0 + σEk0 ) ≥ (σAkσAk0 + σCkσCk0 + σEkσEk0 ) q 2 2 2 2 (σAk/z + σCk)(σAk0 /z + σCk0 ) ≥ (σAkσAk0 /z + σCkσCk0 )

2 2 2 σAk σCk σEk 0 Equality occurs if and only if σ2 = σ2 = σ2 = constant. Therefore, ρkk in set Ak0 Ck0 Ek0 of equations (2.11) is interpreted as the common longitudinal correlation of the

z z ACE components, i.e cor(Aijk,Aijk0 ) = cor(Cijk,Cijk0 ) = cor(Eijk,Eijk0 ) = ρkk0 .

But now, ρkk0 is interpreted as just the longitudinal correlation of an individual. Unlike MACE, we allow separate variance for MZ and DZ individual at each k

z 2 z as var(Yijk) = σzk. Then the expectation and the covariance matrix of Yi.. under RMFM can be written in a compact form as,

z z E(Yi..) = Xi..ππ, cov(Yi..) = (Σz ⊗ J2) Nz (2.12)

T where Σz = SzRSz with Sz being a K-dimensional diagonal matrix having k- th element to be σzk, Nz = ((nrs))2K×2K with n(2k−1)(2k0−1) = n(2k)(2k0) = 1, √ n(2k−1)(2k0) = n(2k)(2k0−1) = rzkrzk0 . Note that, when there is only 1 time-point, i.e K = 1 and if we assume that an MZ and a DZ individual have same variances, the model becomes equivalent to the univariate ACE model. The estimation procedure will be discussed in section (2.3).

KP model as a special case: It can be observed when rzk’s are constant over the time-points i.e rzk = rz for k = 1, 2,...,K, the correlation assumptions in

18 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study equation (2.12) get simpler,

z z z z z z cor(Yijk,Yijk0 ) = ρkk0 , cor(Yijk,Yij0k) = rz, cor(Yij0k,Yij0k0 ) = rzρkk0 (2.13)

Note that N , in this case, has a simpler form as, N = J ⊗ V where V =   z z K z z 1 rz  . Using the properties of Kronecker product and Hadamard product rz 1 z (Liu and Trenkler, 2008), the covariance matrix of Yi.. further simplifies to,

z cov(Yi..) = (Σz ⊗ J2) (JK ⊗ Vz) = (Σz JK ) ⊗ (J2 Vz) = Σz ⊗ Vz

Thus this special case of RMFM is similar to the KP model discussed in equation

(2.8). But it is more general in a sense that it uses separate parameters (rz’s) to model the familial dependence between MZ and DZ twin-pair and allows separate

2 variances (σz ) for the two types of twin-pairs.

2.2.3.3 Multivariate Falconer’s model (MFM)

z z RMFM can further be generalized by allowing cor(Yijk,Yij0k0 ) in equation (2.11) to be unstructured instead of the earlier special structure,

z z z z cor(Yijk,Yijk0 ) = ρzkk0 , cor(Yijk,Yij0k) = rzk (2.14) z z z z cor(Yijk,Yij0k0 ) = cor(Yij0k,Yijk0 ) = rzkk0

Basically we are allowing a MZ pair and a DZ pair have two entirely different covariance structures which is the most general assumption one can make. Clearly it will have a lot of variance-covariance parameters to be estimated. That is why we would not be using a likelihood based approach for the estimation. We call this model Multivariate Falconer’s Model (MFM) because it is clearly a more general

19 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study version of RMFM discussed earlier. Notice that in Multivariate ACE model, if the

2 2 2 2 phenotypes at each k are scaled to have variance 1, i.e σk = σAk + σCk + σEk = 1,

z z cor(Yijk,Yij0k0 ) = rzkk0 = σAkk0 /z + σCkk0 .

We observe that, 2(r1kk0 − r2kk0 ) = σAkk0 and (2r2kk0 − r1kk0 ) = σCkk0 . Thus, there exists a one to one correspondence between MACE and MFM framework. This is very similar to what Falconer noticed in the univariate context. rzkk0 ’s can be estimated by their sample analogues which we denote byr ˆzkk0 .

2.3 Estimation in RMFM and MFM

2.3.1 RMFM

2 2 T T T T T Let θ1k = (σ1k, σ2k, r1k, r2k) , θ1 = (θ11, . . . ,θθ1K ) , θ2 = (ρ12, ρ13, . . . , ρK−1K ) . T T T The vector of variance-covariance parameters, θ = (θ1 ,θθ2 ) . From equation (2.13), combining all the twin pairs ΣΣ(θ) takes the form,

  I m1 ⊗ ((Σ1 ⊗ J2) N1) 0 cov(Y) = ΣΣ(θ) =   I 0 m−m1 ⊗ ((Σ2 ⊗ J2) N2)

The estimation procedure is pretty similar to that in case of MACE discussed in section (2.2.1). When θ is known, the estimation of π and its covariance can be done using formulas in (2.7) and (2.8). And when θ is unknown, we can maximize the likelihood l(ππ,θθ) w.r.t both ππ,θθ. It can be done in R using the functions like optim, nlminb. But it gets computationally very demanding for even a moderately large m and K (refer to table 2.1). Thus it is not practical to be used on a GWAS scale where there are millions of SNPs. That is why we look for a computationally

20 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study more feasible alternative. From section 27.5 of Goldberger and Goldberger (1991), a Feasible Generalized Least Squares (FGLS) estimator of π can be obtained by replacing θ in equation (2.6) and (2.7) by some consistent estimate of it. We will construct a FGLS estimator of π by first finding out a consistent estimate of θ.

Two stage Estimation in RMFM : At first, we get a consistent estimate of θ by a two-stage approach described below,

1. We look at the data at each time-point independently and estimate just the familial dependence in this step. Thus this step is kind of similar to Ge et al. (2016)’s ad-hoc method of getting heritability estimates which we discussed

z earlier. From equation (3.8), note the marginal distribution of Yi.k,

z z Yi.k ∼ N(Xi.kπk,Vk )   h i 1 rzk T π T z 2 Xi.k = 12 Gi Si.k , πk = (βk, αk, ηk) ,Vk = σzk   , z = 1, 2 rzk 1

1 2 It is to be noted that Vk and Vk are only functions of θ1k. We maximize the z T T T sum of log-likelihoods of Yi.k’s for a particular k w.r.t γ k = (πk ,θθ1k) , For this maximization, we have used R package RFGLS (Li et al., 2011) which T T ˆ T is computationally very fast. The MLE of γ k is denoted as, γˆk = (πˆk ,θ1k) . T T T T T ˆ ˆ ˆ T Also, πˆ = (πˆ1 ,...,πˆK ) , θ1 = (θ11,...,θ1K ) . We have shown in Theorem 1 (refer to the Appendix), why these estimates are consistent. We should note that this step is quite similar to Ge et al. (2016)’s proposed way of fitting K-many marginal ACE models.

2. In this step, we estimate the longitudinal dependence by looking at indepen-

∗ 2 2 2 2 T T dent individuals over time. Denote, θ2 = (σ11, . . . , σ1K , σ21, . . . , σ2K ,θθ2 ) .

21 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study

z From equation (2.12), notice the marginal distribution of Yij.,

z Yij. ∼ N(Xij.ππ, Σz) h i T Xij. = IK GijIK Sij.

∗ Note that Σz is only a function of θ2. At first, we consider only the first z individual from each twin-pair i i.e only the Yi1.’s and maximize the sum of T ∗T T corresponding log likelihoods w.r.t ζ = (π ,θθ2 ) . We obtain the MLE of ζ 1 ∗1T ˆ 1T ˆ T as ζ = (πˆ ,θ2 ) .

Next consider only the second individual from from each twin-pair i i.e only

z the Yi2.’s and maximize the sum of corresponding log likelihoods. Obtain 2 ∗2T ∗3 ∗1 2∗ ˆ 2T ˆ T ˆ 1 ˆ ˆ the MLE of ζ as ζ = (πˆ ,θ2 ) . Also define θ2 = 2 (θ2 +θ2 ). The reason for taking this average of the parameter estimates is that we believe it would increase asymptotic efficiency because we are using more information. For both the maximization we use R function gls from nlme(Bates et al., 2014) package which is computationally very efficient. We have shown in Theorem 2 (refer to the Appendix), why these estimates are consistent.

2 Note that we are estimating σzk’s twice in the above two steps. We can take their average as the final estimates. Also, both the steps give us estimates of π but we do not need use them. One can argue about centralizing the data and removing π altogether from the above two steps which would further expedite the process but may decrease the asymptotic efficiency of the estimates of the variance parameters.

To reiterate, in the first step, we have obtained the estimates of rzk’s and in the second step, we have estimated ρkk0 ’s. Thus we have obtained an estimate of θ, call it θˆ. Using these parameter estimates we compute the univariate heritabilitites

2 (hk) and then the multivariate heritability is calculated as a weighted combination

22 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study of the univariate heritabilities,

K X σˆ2 +σ ˆ2 hˆ2 = 1k 2k hˆ2 mult PK 2 2 k k=1 k=1(ˆσ1k +σ ˆ2k)

We slightly change the earlier definition of the multivariate heritability to acco- modate the different variances within a MZ and a DZ pair. For calculating its standard error we use bootstrapping with 1000 resamples. Plugging in θˆ in equations (2.6) and (2.7), we obtain a FGLS estimate of π as

πˆF GLS,

RMFM T ˆ −1 −1 T ˆ −1 πˆF GLS = (X ΣΣ(θ) X) X ΣΣ(θ) Y (2.15) RMFM T ˆ −1 −1 cov(πˆF GLS ) = (X ΣΣ(θ) X) (2.16)

From Goldberger and Goldberger (1991), πˆF GLS has same asymptotic properties as πˆ in this case as θˆ is consistent. For inference, we perform a Wald test using

RMFM T ˆRMFM T −1 RMFM the test statistic (MπˆF GLS ) {Mcov(πF GLS )M } (MπˆF GLS ) which follows 2 asymptoticaly χK distribution.

2.3.2 MFM Method of Moments (MFM-MOM)

The number of variance-covariance parameters i.e the dimension of θ under MFM is significantly higher that than in RMFM. Therefore, applying a likelihood based maximization approach to get the parameter estimates is even more tedious and computationally challenging. That is why similar to the earlier case, we would again construct a FGLS estimator of π by first finding out a consistent estimator of θ. θ consists of the

z 2 z variance terms var(Yijk) = σzk and the correlation terms cor(Yijk,Yij0k0 ) = rzkk0 .

23 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study We use an idea similar to what Falconer did in the univariate context to estimate the heritability. He estimated the correlation parameters by their sample ana-

2 logues. Similarly we estimate σzk’s and rzkk0 ’s by their sample analogues without using any information on the SNP or the covariates. We call the estimate of θ ˆ thus obtained as θsample. Next the univariate heritabilities and their asymptotic variance-covariances are calculated as (see the Appendix for more details),

ˆ2 hk = 2(ˆr1k − rˆ2k)   ˆ2 1 2 2 1 2 2 varˆ (hk) = 4 (1 − rˆ1k) + (1 − rˆ2k) n1 n2 ˆ2 ˆ2 4  2 2  covˆ (hk, hk0 ) = (1 +r ˆ1krˆ1k)(ˆρzkk0 +r ˆ1kk0 ) − 2ˆρzkk0 rˆ1kk0 (ˆr1k +r ˆ1k0 ) n1 4  2 2  + (1 +r ˆ2krˆ2k0 )(ˆρzkk0 +r ˆ2kk0 ) − 2ˆρzkk0 rˆ2kk0 (ˆr2k +r ˆ2k0 ) n2

Again, the multivariate heritability is calculated as a weighted combination of the univariate heritabilities as,

K X σˆ2 +σ ˆ2 hˆ2 = 1k 2k hˆ2 mult PK 2 2 k k=1 k=1(ˆσ1k +σ ˆ2k)

And the asymptotic error of this estimate is obtained using the derived variance- covariance formulae of the univariate heritabilities. Next, we substitute this esti- mate of θ in (7) and (8) to get the corresponding FGLS estimators,

MFM T ˆ −1 −1 T ˆ −1 πˆF GLS = (X ΣΣ(θsample) X) X ΣΣ(θsample) Y (2.17) MFM T ˆ −1 −1 cov(πˆF GLS) = (X ΣΣ(θsample) X) (2.18)

MFM T ˆMFM T −1 For inference, we perform a Wald test using the test statistic (MπˆF GLS) {Mcov(πF GLS)M } MFM 2 (MπˆF GLS) which follows asymptoticaly χK distribution.

24 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study 2.3.2.1 Computational time comparison of the different methods

Table 2.1: The table compares the computational time of fitting RMFM using Optim in R with the proposed two stage approach of fitting RMFM and also the MFM-MOM, in seconds. Under the simulation setup described in section 2.4, each of the methods, were run 100 times and their minimum, maximum and mean values were listed.

Methods Minimum Mean Maximum Optim RMFM 166.8 176.9 194.2 Two Stage RMFM 33.47 35.32 38.89 MFM-MOM 2.664 3.317 4.115

We can see that for estimating the variance-covariance parameters without any covariates in our simulation setup the standard optimization packages take around 3 minutes to complete. Whereas the two stage approach is far more rapid and the method of moments approach, as expected, is the fastest.

2.4 Results

We would compare the heritability estimates obtained by different methods under several simulation scenarios.

2.4.1 Comparing heritability in simulation setup by dif- ferent approaches

z We simulate Yijk’s under the RACE model with no covariates. There were K = 4 many time-points. The mean was kept to be 0 i.e π = 0. The parameter values

2 2 2 considered were σAk = 48, σCk = 0, σEk = 12, (so we are keeping the variances 2 of a MZ and a DZ twin same at all the time-points) at ρkk0 = 0.5, hk = 0.8. So, r1k = 0.8, r2k = 0.4. Note by Cauchy-Schwarz inequality discussed earlier, RACE

25 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study and RMFM are equivalent under these parameter values. So, from equation (2.7),

1 1 √ 2 2 √ cor(Yijk,Yij0k0 ) = r1kr1k0 ρkk0 = 0.4, cor(Yijk,Yij0k0 ) = r2kr2k0 ρkk0 = 0.2. We considered 4 different values of the sample size, m = 500, 1000, 2500, 5000. Out of which first 50% were MZ and last 50% were DZ. Number of simulations was 1000.

OpenMx_MACE

6 Timepoints Time 1 4 Time 2

density Time 3 2 Time 4 0 0.00 0.25 0.50 0.75 Heritability Two−stage RMFM

Timepoints 3 Time 1 2 Time 2

density Time 3 1 Time 4 0 0.4 0.6 0.8 1.0 1.2 Heritability Method of Moments_MFM

3 Timepoints

Time 1 2 Time 2

density Time 3 1 Time 4 0 0.4 0.6 0.8 1.0 1.2 Heritability

Figure 2.1: Comparing univariate heritabilities obtained by three different methods. Shows that the estimates of the heritabilities are biased from the true value 0.8 in case of OpenMx.

Table 2.2: Mean of the univariate heritabilities by OpenMx

Sample size Time 1 Time 2 Time 3 Time 4 500 0.6596 0.6988 0.7110 0.7138 1000 0.6843 0.7015 0.7282 0.7313 2500 0.7396 0.748 0.7575 0.7594 5000 0.7584 0.7726 0.7738 0.7733

26 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study Figure 2.1 shows estimated empirical density of the heritability estimates by OpenMx, by Two-stage RMFM and by Method of Moments MFM, at each of the 4 time-points for sample size m = 500. We see that unlike the two-stage RMFM and Method of moments MFM, the distribution of heritability at each time-point looks right-skewed and non-normal in case of OpenMx with certain complete outliers. The table 2.2 lists mean of the heritability estimates by OpenMx at different time- points for varying m. It can be seen how the heritabilities are under-estimated in general by considering a full likelihood maximization by OpenMx. With increase in sample size, the mean of the heritability estimates at a time-point get closer to the original value of 0.8. This is what supports our claim that dimension reduction of a fully general MACE model should be sought especially when the number of time-points are high or the sample size is not large enough. Next we see the impact of assuming different variances of a MZ and a DZ twin. Here we compare (Ge et al., 2016)’s approach of fitting marginal ACE mod- els which assumes same variance for a MZ and a DZ twin with our MFM Method of Moments which allows different variances. We keep the univariate heritabilities

2 at each time-point same, hk = 0.6 (therefore the true multivariate heritability is

0.6) and also the longitudinal correlation is same as earlier, ρkk0 = 0.5. But we consider different set of latent variances i.e, for twin-pair of type z we consider

z2 z2 z2 latent variances as σAk = 48/z, σCk = 0 and σEk = 12/z for z = 1, 2. We first look at the histogram plots of the univariate heritabilities for the two different methods,

27 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study

Figure 2.2: Histograms of the univariate heritabilities obtained by marginal ACE models

Figure 2.3: Histograms of the univariate heritabilities obtained by MFM-MOM From figure 2.2 and 2.3 we can see that marginal ACE models overestimate the

28 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study univariate heritabilities but the MFM-MOM gives rightly centered univariate her- itability estimates. Next we plot the histogram of the multivariate heritability estimate by both the methods,

Figure 2.4: Histogram of the multivariate heritability by marginal ACE models

Figure 2.5: Histogram of the multivariate heritabilities obtained by MFM-MOM

29 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study Again we see that marginal ACE models overestimates the multivariate her- itability but MFM-MOM estimates are rightly centered around the true value 0.6.

2.4.2 Univariate heritabilities in Real Data by different approaches

We used our proposed methods on the data collected in The Minnesota Center for Twins and Family Study (MCTFS)(Miller et al., 2012). The MCTFS is a longitudinal study in which data was collected from a cohort of twins at five different time points: ages 11, 17, 20, 24, and 29. Out of many available traits, We look at the alcohol habit trait which are available on the last 4 time-points. After filtering out the missing values, our data-set has, 526 many MZ twin pairs and 256 DZ twin pairs. The ratio of sample variances of an MZ individual to that of an DZ individual at age 17, 20, 24 and 29 are respectively, 0.985, 0.9681, 0.974 and 0.851. So the MZ and DZ variances are almost equal except age 29. We list the univariate heritabilities at the 4 different time-points by different approaches in the following table,

2 Table 2.3: Univariate Heritabilities (hk) in Real Data

Methods Time 1 Time 2 Time 3 Time 4 OpenMx 0.3245 0.4668 0.5048 0.4886 Two Stage RMFM 0.3226 0.4635 0.5291 0.4454 MFM-MOM 0.3239 0.4643 0.5310 0.4435

The heritability estimates at different ages by all the methods are very close except age 29 (Time 4). At age 29, the heritability estimate by OpenMx is slightly higher than the other two methods probably because the assumption of equal variance of an MZ and DZ twin pair by MACE fails here. Estimate of multivariate

30 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study heritability would just be the weighted average of the univariate heritabilitties.

2.5 Discussion

The multivariate (longitudinal) extension of the univariate ACE model (MACE) (Klingenberg and Monteiro, 2005; Ge et al., 2016) involves a lot of variance- covariance parameters making Maximum Likelihood (ML) based parameter esti- mation and thus, estimation of multivariate (longitudinal) heritability extremely difficult. It becomes infeasible altogether when the number of time-points is too large. Also, in some particular cases, unique MLEs of the variance components may not exist (Ro´set al., 2016) yielding highly biased parameter estimates. We have observed such a scenario in section (2.4.1) where the MLEs of the univariate and multivariate heritabilitties turn out to be very much biased. In this paper, we have extended the univariate Falconer’s method in a longi- tudinal context for estimating multivariate (longitudinal) heritability. The new modelling framework, which we term as Multivariate Falconer’s Model (MFM), is based on comparing MZ and DZ correlations over different time-points. We develop a method of moments approach for obtaining the parameter estimates in MFM. MFM can be shown to be theoretically equivalent to the Multivariate ACE (MACE) model. Another widely used framework named as Kronecker Product (KP) model (Shi et al., 2009; Sung et al., 2014) is shown to be a special case of MFM. We have also proposed a reasonably simplified version of MFM which we name as Reduced MFM (RMFM) that involves much less number of variance- covariance parameters compared to MFM or MACE model making a ML based parameter estimation feasible in a high-dimensional scenario. We develop a rapid two-stage ML based estimation procedure for obtaining the parameter estimates

31 Chapter 2 – Heritability Estimation and Genetic Association Testing in Longitudinal Twins Study in RMFM. For the method of moments approach, we have also obtained closed form expressions of the standard error of the heritability estimates. We used all the methods to analyze the alcoholism trait available over four different ages with MZ and DZ twin pairs from The Minnesota Center for Twins and Family Study (MCTFS). We found out that the univariate heritability esti- mates by the different methods are pretty close to each other at every timepoint except the last one. A possible justification is that the assumption of equal vari- ance of the individuals from a MZ and a DZ twin pair seems to be violated for that particular time-point causing slight overestimation in OpenMx method (uni- variate ACE model. This particular assumption is a major shortcoming of the univariate ACE as well as MACE model. It is overcome in our proposed MFM by allowing different variances for the different types of twin pairs (MZ and DZ) accommodating the differences in shared environments (Hur et al., 2008). There are a few things that we would like to examine further. One of them is that the variance covariance matrices resulting from MACE is not always positive definite, especially for MZ twins. We want to investigate what exact conditions on the variance covariance parameters would ensure the positive definiteness. Also we would like to perform a longitudinal genome-wide association test on the alco- hol consumption trait from the MCTFR data modelling the covariance structures following our proposed methods.

32 Chapter 3

Efficient SNP-based Heritability estimation using Gaussian Predictive Process

3.1 Introduction

In past few decades, Genome Wide Association Studies (GWAS) have identified hundreds of SNPs revealing the genetic variation of complex diseases and traits. For most traits, however, the associated SNPs from GWAS only explain a small fraction of the heritability usually estimated using twin and family studies. In search of this so called ”missing heritability”, instead of focusing just on the as- sociated SNPs, the researchers nowadays try to capture even infinitesimal SNP effects by taking into account all the SNPs in a LMM framework (Yang et al., 2011; Lippert et al., 2011; Loh et al., 2015; Chen et al., 2016). The SNP-based LMM framework, often known as Genome-based Restricted Maximum Likelihood

33 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process (GREML) approach, usually involves distantly related people which refers to ap- parently unrelated individuals who share genetic relatedness due to their evolu- tionary history (Weir et al., 2006). This LMM, classically used in twin and family studies with known pedigree relationships (Neale et al., 1994; Rabe-Hesketh et al., 2008), is now used with a Genetic Relationship Matrix (GRM) constructed using genome-wide SNP data of available individuals. Heritability is calculated as the ratio of two variance components of the LMM which are usually estimated using a Restricted Maximum Likelihood (REML) approach. There are several softwares like Genome-wide Complex Trait Analysis (GCTA) (Yang et al., 2011), Genome-wide Efficient Mixed Model Association (GEMMA) (Zhou and Stephens, 2012) which have implemented efficient algorithms to fit the GREML. All these methods have per iteration computational complexity of O(N 3) (N being the number of individuals) and badly struggle when N is large. We have seen in our analysis even with N = 100, 000 individuals, these softwares stay stuck even for 10-15 hours on a highly capable super-computing server. A much efficient Monte Carlo Average Information REML algorithm has been implemented in the software named Bolt-REML (Loh et al., 2015) that can handle considerably large number of individuals. In the same analysis with N = 100, 000, Bolt-REML took around 5 hours to finish. It has much better computational complexity of O(MN 1.5)(M being the number of SNPs) per iteration. But the complexity is still not linear in N which makes it challenging to use on a further large scale. In recent years, advances in genome sequencing have generated genetic data on large scale cohort studies, such as UK Biobank (Allen et al., 2014), Precision Medicine cohort (Khoury and Evans, 2015), Million Veterans Program (Gaziano et al., 2016). These studies consist of information on millions of genetic markers and numerous diseases/traits on millions of individuals. Therefore, it is needless

34 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process to say that to be applicable on such magnanimous cohort studies, GREML based methods need to be as time and memory efficient as possible. In this paper, we approximate the likelihood corresponding to the GREML ap- proach to develop a significantly rapid algorithm for estimating heritability. The approximation is motivated by the concepts of genetic coalescence (Kingman, 2000; Degnan and Salter, 2005) and Gaussian Predictive Process models (Baner- jee et al., 2008; Finley et al., 2009). Our method that we name as PredLMM, exploits the structure of the GRM to ease the computationally demanding linear algebraic steps of the standard GREML algorithm like calculating determinant or inverse of a high dimensional matrix (N × N) at every iteration. It reduces per iteration computational complexity from O(N 3) FLOPS (floating point oper- ations) to O(r2N)+O(r3) FLOPS where r is much smaller than N. We verify the reliability and robustness of our proposed approach through extensive simulation studies replicating many possible realistic scenarios. We also analyze a section of the UK Biobank cohort estimating the heritability of multiple quantitative traits like Standing Height, Weight, BMI, Systolic and Diastolic blood pressure, Hip and Waist circumference. We implement PredLMM in an efficient Python module which is to be released soon.

3.2 Materials and Methods

3.2.1 Genome-based Restricted Maximum Likelihood Ap- proach

Let Y denote the N × 1 vector of phenotypes corresponding to N individuals, X denote the N ×p matrix of covariates, and W denote the N ×M mean and variance

35 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process scaled genotype matrix for the N individuals and M SNPs, i.e., E(wij) = 0 and

V ar(wij) = 1. Consider the following LMM,

2 2 Y = Xβ + Wγ + ; γ ∼ NN (0, σwIN ),  ∼ NN (0, σe IN ) (3.1)

And corresponding marginal model can be written as,

1 Y ∼ N (Xβ, σ2A + σ2I ); σ2 = Mσ2 , A = WW> (3.2) N h e N h w M where A is formally known as the Genetic Relationship Matrix (GRM). Heri-

2 2 2 2 tability is calculated as h = σh/(σh + σe ). To estimate the variance parameters 2 2 2 σh, σe and eventually h , the most common practice followed in softwares like GCTA (Yang et al., 2011), Fast LMM (Lippert et al., 2011), GEMMA (Zhou and Stephens, 2014), is to use a Restricted Maximum Likelihood (REML) approach. Thus, the entire framework is referred to as Genome-based Restricted Maximum Likelihood (GREML) approach. Optimizing the likelihood corresponding to Eq 3.2 requires computing the

2 2 inverse and determinant of the N × N dense matrix V = σhA + σe I at every iteration. It requires O(N 3) FLOPS at each iteration. Thus, quite expectedly, the aforementioned softwares either crash or need an insane amount of time as N increases. For example, in our analysis of UK Biobank data with even N = 100, 000, GCTA stayed stuck for ten hours (more details later). More notably, Bolt-REML (Loh et al., 2015) implements a much efficient Monte Carlo AI-REML algorithm which has per iteration computational complexity of O(MN 1.5). In the analysis previously mentioned, with N = 100, 000 and M = 350, 000, Bolt-REML took around five hours. But it is to be kept in mind that Bolt-REML does not use a precomputed GRM A unlike the previous methods. It directly works with

36 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process the SNP matrix W which is why its computational complexity increases linearly with M. When we had further larger M, the computational time of Bolt-REML increased significantly.

3.2.2 Proposed Method

3.2.2.1 Asymptotic limit of the GRM

First, we show that under certain assumptions, as the number of SNPs M goes to infinity, the likelihood for (3.2) converges almost-surely to a Gaussian Process (GP) likelihood. The assumptions are as follows,

1. Assumption 1 (Correlation across individuals): We assume that each in-

dividual i = 1, 2,...,N can be represented by a point (location) si in some spatial manifold D equipped with a distance d. The correlation be- tween the genotypes of individuals i and i0 at the jth SNP is given by

Cov(wij, wi0j) = Cj(si, si0 ) where Cj is a valid covariance function in D 0 which decreases monotonically with increasing distance d(si, si).

This assumption is rooted in the theory of genetic coalescence (Kingman, 2000; Degnan and Salter, 2005) which postulates that every individual in a population can be traced back to a common ancestor. Under coalescence, the correlation between genotypes of individuals will vary inversely with the time to coalescence, i.e., number of ancestral generations till the most recent common ancestor. Hence, the individuals in a population can be assigned to nodes of a phylogenetic tree. Trees are equipped with a valid distance metric (shortest distance between nodes) and models for tree-structured objects commonly specify the correlation as decreasing function of the dis- tance (Basseville et al., 1992). Our assumption of latent embedding however

37 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process doesn’t rely on the manifold being a tree, and accommodates any manifold with a notion of distance. The maximum likelihood estimate of h2 from (3.2) has been shown to be consistent in Jiang et al. (2016). The theory relies on the assumption that the genotype distributions are independent

across individuals (upto standardization). Formally, wi ⊥ wi0 for any two 0 th individuals i 6= i where wi = i row of W, is the genotype vector for the ith individual. Such an assumption of between-individual independence of genotype distributions is in sharp violation of the principles of coalescent theory.

2. Assumption 2 (Stationarity and ergodicity across the SNPs): We assume

0 that the centered and scaled genotype process we j = (w1j, . . . , wNj) is second-order stationary and ergodic for j = 1, 2,.... Stationarity trans-

0 lates to Cov(wej) = Cov(we j0 ) = C for all j, j implying that the covariance

functions Cj = C for all j = 1, 2,... Ergodicity implies that as the number of SNPs grows, we have

M 1 X > lim A = lim wjwj → C = Cov(w1) (3.3) M→∞ M→∞ M e e e j=1

The simplest setting where this assumption is satisfied is when the scaled

and centered genotype processes {we j}j=1,2,... are assumed to be iid. Assump- tion of iid genotypes is common in theoretical studies of the heritability estimation (Jiang et al., 2016) but is only sufficient and not necessary for us. More realistic scenarios like presence of Linkage Disequilibrium (LD) that effectuates correlation across genotypes can also be accommodated as long as the ergodicity is ensured. Correlation structures arising from abso- lutely regular-mixing processes (Bradley, 2005) like autoregressive (AR(p)),

38 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process moving average (MA(q)) or ARMA(p, q) (Mokkadem, 1988) will satisfy the strong law of convergence in Equation (3.3) (Nobel and Dembo, 1993).

Under Assumption 2 we have the following assertion.

2 2 2 2 lim NN (Y | Xβ, σhA + σe I) = NN (Y | Xβ, σhC + σe I) (3.4) M→∞

where NN (Y | µ, Σ) denotes the normal likelihood for a realization Y with mean µ and variance Σ. Thus the likelihood used in heritability estimation converges to a likelihood for data Y a partial realization of a Gaussian process on D with mean

0 and covariance function C observed at the N latent locations s1, s2,..., sN . It is expected that estimation of heritability using the limiting likelihood (3.4) will be similar to that from the exact likelihood (3.2) as the number of SNPs M is usually very large.

3.2.2.2 PredLMM:

Just switching to the limiting likelihood (3.4) does not ease any of the computa- tional burden as GP likelihoods also require O(N 3) FLOPS. However, over the last two decades a series of increasingly sophisticated algorithms have been pro- posed for fast approximate GP likelihoods (see Heaton et al., 2019, for a recent review). Our approach uses Predictive Process (PP) (Banerjee et al., 2008) which results in the low-rank plus diagonal approximation of the dense matrix C. Let S =

∗ ∗ {s1, s2,..., sN } denote the set of N latent locations, and S = {s1, s2,..., sr} denote a set of r  N locations in D referred to as the knots. Also, for two sets A and B in D let CA,B denote the |A|×|B| matrix (C(si, si0 ))i∈A,i0∈B. The predictive

39 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process process approximation of C is given by

−1 −1 Ce PP = CS,S∗ CS∗,S∗ CS∗,S∗ + diag(C − CS,S∗ CS∗,S∗ CS∗,S∗ ). (3.5)

The first term is a low-rank factorization as the number of knots is much less than the sample size. Banerjee et al. (2008) showed that this low-rank term is the optimal (in terms of reverse Kullback Leibler divergence) low-rank approximation of C using the knots S∗. Finley et al. (2009) proposed adding the diagonal matrix (second term) to eliminate a positive bias on the diagonal entries. For moderate choices of r  N, inference from the predictive process likelihood provides an excellent approximation to that from the full GP likelihood. Computationally, Predictive Process only requires O(Nr2 + r3) FLOPS and as r  N, the approx- imation results in massive gains in run times. Consequently, Predictive Processes is one of the most popular approximations of the full GP likelihood and is widely adopted in many spatial applications. In our setting, direct usage of Predictive Process likelihood is not recommended for two reasons. First, the locations si are unknown to us. Hence, CPP can only be calculated using approximate locations like a vector of the top few PC scores. The impact of such choices of locations is less intuitive. Second, covariance functions usually involve additional spatial parameters θ, thereby increasing the number of unknown parameters to be estimated. Instead, we consider the following strategy. We choose S∗ to be a subset of S, and define I to be the subset of B = {1, 2,...,N} containing the indices corresponding to S∗. We can decompose the GRM A as,

−1 −1 A = AB,B = AB,I AI,I AI,B + (AB,B − AB,I AI,I AI,B)

40 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process The decomposition is inspired by the concept of conditional variance (Eaton, 1983) where the term on the right is the conditional GRM of the individuals from the subset B ∩ Ic given the individuals in the subset I. Replacing the term on the right with its diagonal, we then have a direct low-rank plus diagonal approximation of A as

−1 −1 Ae PP = AB,I AI,I AI,B + diag(AB,B − AB,I AI,I AI,B).

2 2 We propose using the likelihood NN (Y | Xβ, σhAe PP +σe I) for heritability estima- tion. It is clear that A and Ae PP agree on the diagonals, and on the sub-matrix corresponding to the knots I. Also, limM→∞ Ae PP = Ce PP . Hence, using triangular inequality, we can write

kA − Ae PP k ≤ kA − Ck + kC − Ce PP k + kCe PP − Ae PP k

Under assumption 2, the first and third terms vanishes as M → ∞, while for a well chosen set of knots S∗. When C is a decreasing function of the distance as postulated in Assumption 1, the predictive process approximation Ce PP is close to C and hence the middle term will also be small. This justifies why for large M,

Ae PP is expected to be close to A.

3.2.2.3 Computational gains:

2 2 Evaluation of our PredLMM likelihood NN (Y | Xβ, σhAe PP +σe I), does not require computing or storing the entire N × N GRM matrix A and can be calculated only using the N × r tall thin sub-matrix AB,I , the small square matrix AB,B, and diagonal elements of A. This reduces storage from O(N 2) to O(Nr2) – a substantial gain for biobank-scale studies with large N.

41 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process

Subsequently, the nice low-rank plus diagonal structure of Ae PP facilitates fast 2 2 evaluation of the likelihood. Inverse of σhAe PP + σe I becomes feasible and signifi- cantly rapid using the Woodbury matrix identity (Riedel, 1992), while the (Harville, 1998) is leveraged for scalable computation of the determinant. Both the steps involve O(Nr2 + r3) FLOPS, as r  N, the compu- tation is thus becomes linear in N – a drastic reduction from the O(N 3) FLOPS required for evaluating the true likelihood.

3.2.2.4 Choice of knots design and number:

In traditional spatial applications, where the domain D is known and the loca- tions si are observed, the knots need not coincide with the data locations. Recom- mended choices for the knot-set include space-filling designs and lattices (Banerjee et al., 2008). In our case, the locations are artificial constructs to motivate our direct approximation. Hence restricting the knot set to be a subsample of these hypothetical data locations is necessary to ensure that the direct approximation

Ae PP can be calculated using submatrices of A. However, our practice has prece- dence even in conventional spatial settings. Using some of the data locations has been shown to improve performance of predictive process (Banerjee et al., 2008), while related approaches like splines and other basis function expansions also commonly use data locations as knots. We used random subsamples as knots, as empirical experiments highlighted in section (3.3) demonstrated considerable robustness of results to the choice of subsample. More informed choices for knots like cluster representatives from PCA-, GRM-, or dendogram-based clustering can also be used. Choice of the the number of subsamples r to be used for PredLMM is more nuanced. Performance of predictive process is generally more sensitive to the

42 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process number than the design of the knots (Banerjee et al., 2008). Increasing r improves the quality of the approximation, with Ae PP equalling A when r = N. However, as the computation is cubic in r, use of a very large r would defeat the purpose of the approximation. In practice, estimating h2 for a few choices of r of increasing values is recommended, stopping when the estimates of h2 no longer change substantially. Parallel computing resources, if available, can be heavily deployed for this step.

3.3 Results

3.3.1 Simulation using Coalescent Theory

We start with assessing the performance of PredLMM under a setup abiding the genetic coalescent theory (Nordborg, 2004). We consider relatively small values of N; case (1): N = 5000 and case (2): N = 8000 with respectively M = 8000 and M = 13, 000 many SNPs. We assume that there are 4 different sub-populations originating from a common ancestor. Allele frequencies of the SNPs for each dis- tinct sub-population are generated using the Balding-Nichols model (Balding and Nichols, 1995; Price et al., 2006) assuming SNPs to be independent. First, for each

SNP j, the allele frequency p0j in the ancestral population is drawn from a uniform distribution on [0.1, 0.9]. The allele frequency in each population pkj is generated from a beta distribution with parameters p0j(1−θk)/θk and (1−p0j)(1−θk)/θk. In all simulations, we set all θk to a common value, FST . Next, the genotypes of indi- viduals in each population were generated from a binomial distribution Bin(2, pkj) assuming Hardy-Weinberg equilibrium. We then randomly select mcausal many causal 2 causal SNPs (Wk ) and generate their effects (uk) from N(0, h /mcausal), and 2 the residual effect e is generated from N(00, (1/h − 1)IN ). Finally the pheno-

Pmcausal causal type vector is formed as, Y = k=1 Wk uk + e. We keep heritability at

43 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process h2 = 0.8. We repeat the procedure 100 times to generate 100 many traits and the corresponding genotypes matrices. Figure 3.1 shows Mean Squared Error (MSE) of different methods under the above simulation setup.

Figure 3.1: The figure compares MSE of different methods for case (1) and (2).

GCTA (Full), GEMMA and Bolt-REML are all GREML based methods. They consider the same LMM from equation 3.2 and implement different REML algo- rithms which is why they display more or less same and the lowest MSE values. GREML (500) and GREML (2000) respectively refer to fitting GREML with a random subsample of size 500 and 2000 of the full data. We used our own Python module to fit these random subsample based GREMLs (we have verified that the

44 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process results match the existing softwares like GCTA). PredLMM (500) and PredLMM (2000) respectively refers to fitting PredLMM with a subsample (set of knots) of size 500, 2000 respectively. GREML (500) performs very poorly compared to PredLMM (500) in both the cases with the latter having much less MSE value. GREML (2000) performs better but still has higher MSE than PredLMM (2000). Table 3.1 lists the mean of the estimates of different methods. We notice severe un- derestimation by GREML (500) and a very slight underestimation by PredLMM (500).

Table 3.1: Mean comparison of different methods for two cases: Case (1) and Case (2) with true h2 = 0.8

GCTA GEMMA Bolt-REML GREML GREML PredLMM PredLMM (500) (2000) (500) (2000) Case (1) 0.7928 0.7928 0.7934 0.642 0.798 0.788 0.7915 Case (2) 0.7974 0.7978 0.7979 0.681 0.7942 0.787 0.795

So, this simulation exercise replicates a scenario when Assumption 1 from sec- tion (3.2.2.1) holds i.e every individual from a particular sub-population originates from a common ancestor and those ancestors, on the other hand, originate from a common ancestor (evolutionary structure like a phylogenetic tree). We see that under this assumption, even using a very small set of knots (subsample) PredLMM can yield highly robust estimate of heritability as also the theory suggests.

3.3.2 Simulation using UK Biobank data

To replicate more realistic scenarios, we next move on to simulations using the real UK Biobank cohort data (Allen et al., 2014). UK Biobank is a large long- term biobank study in the United Kingdom which is investigating the respective

45 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process contributions of genetic predisposition and environmental exposure to the devel- opment of various diseases. We have access to 784,256 many marker data and multiple phenotypic data of 502,628 many individuals. The population is of pre- dominantly British ethnicity (442,687) and a few other major ethnicities are Irish (13,213), Other White (16,340), Asian (9839), Black (8038). The people of dif- ferent ethnicities do share very apparent genetic differences as we can see from Figure 3.2. There seems to be existence of clear subclusters based on ethnicity in the UK Biobank population. Further subtle substructures can be found out by applying sophisticated clustering algorithms on the principal components of the genetic data as explored in Galinsky et al. (2016).

Figure 3.2: The figure plots the pairwise principal components of the genetic data of the individuals from the UK Biobank cohort along with their self-reported ancestries.

46 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process Keeping these substructures in mind, we want to look into a diverse group of simulations using the real genetic data from UK Biobank study replicating many possible realistic scenarios. After standard quality control steps as advised in Bycroft et al. (2017) (removing SNPs with MAF less than 0.01 and missingness over 10%, removing individuals with high missing genotype rate), we are left with around 320,000 many individuals and 566,000 many SNPs. We compute the full Genetic Relationship Matrix (A) only once using GCTA.

3.3.2.1 Simulation with 40,000 individuals from a heterogneous sub- population

The first goal is to investigate a highly heterogeneous sub-population of relatively small size from the whole. So, we create a sub-population of size 50,000 from the entire population by choosing individuals with different ethnicities. We randomly choose 28,000 people with British ethnicity, 9000 having ’Other White’ ethnicity, 6000 having Irish ethnicity etc. This sub-population clearly under-represents the people with British ethnicity, but our intention here is to see how the methods perform when we have clearly distinct sub-clusters in a population. To simulate k many traits, every time we randomly sample 40,000 individuals from the sub-population of 50,000 and generate a multivariate normal random vector as Y40000 ∼ N40000(µµ, 70A40000 +30I40000) where A40000 is the corresponding GRM. So the true heritability is, h2 = 70/(70 + 30) = 0.7. We shall point out that unlike in section (3.3.1), instead of assuming that only a percentage of SNPs have causal effect on the trait, we are now assuming that all the SNPs affect the trait in some way. We are simulating directly using the standard LMM described in equation 3.2. Mean µ is kept at 0. We check the performance of PredLMM for different subsample (knot) sizes:

47 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process 4000, 8000 and 16,000. We also perform GCTA on those chosen subsamples (knots) just to see how GREML based methods perform on smaller subsamples. We refer to GREML on a random subsample as GREML (sub) from now on. The comparison of Mean Squared Error (MSE) is shown in Figure 3.3 with Table 3.2 listing the mean (and variance) values of the estimates. For the corresponding boxplot refer to Figure B.1 from the supplementary.

Figure 3.3: The figure compares MSE of GREML (sub) and PredLMM for three different subsample (knot) sizes, 4000, 8000 and 16,000.

Table 3.2: Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes with true h2 = 0.7 and 40,000 individuals

4k 8k 16k GREML (sub) 0.7153 (0.0127) 0.6990 (0.0046) 0.7083 (0.0011) PredLMM 0.6683 (0.0028) 0.6783 (0.0017) 0.7033 (0.0009)

From Table 3.2, we notice a slight underestimation by PredLMM when the subsample (knot) size is smaller. Expectedly, the bias vanishes as we increase the subsample size. From Figure 3.3, we see that PredLMM has much better MSE for

48 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process all the subsample sizes with the difference being most prominent when subsample size is the smallest i.e 4000. Thus, even using a small set of knots, we are achieving a very robust estimate of heritability by PredLMM.

3.3.2.2 Simulation with 80,000 Individuals individuals from a homo- geneous sub-population

In theory, PredLMM is supposed to perform well when there are distinct sub- clusters in the population. In the simulation of the last section, we had created a sub-population from the entire data ensuring existence of subclusters based on eth- nicity. Next, we want to check PredLMM’s performance when the sub-population is relatively homogeneous with no apparent subclusters. So, we create a sub- population of 120,000 individuals with only British ancestry from the available pool of 442,687. We consider two different values of heritability: (a) h2 = 0.7 and (b) h2 = 0.2. To generate the traits under case (a), from the sub-population of size 120,000, every time we randomly sample 80,000 people and simulate a trait as Y80,000 ∼ 2 N80,000(µµ, 70A80,000 + 30I80,000). Thus, the true heritability in this case is h =

70/(70 + 30) = 0.7. For case (b), we simulate as Y80,000 ∼ N80,000(µµ, 50A80,000 + 2 200I80,000). Thus, the true heritability in this case is h = 50/(50 + 200) = 0.2.

A80,000 is the corresponding GRM. Mean µ is kept at 0. We compare PredLMM with GREML (sub) for six different subsample sizes, 1000, 2000, 4000, 8000, 16,000, 24,000. The comparison of MSE is shown in Fig- ure 3.4.

49 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process

Figure 3.4: The figure compares MSE of GREML (sub) and PredLMM for five different subsample sizes under case (a) (top) and case (b) (bottom).

We notice that PredLMM, even in this supposedly homogeneous sub-population,

50 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process has much better MSE compared to subsample based GREML. However, as the subsample size (knot size) increases in case (a), we notice that GREML catches up with PredLMM in terms of MSE. But in case (b), PredLMM always has much bet- ter MSE compared to subsample based GREML. Tables 3.3 and 3.4 respectively list the mean (and variance) of the estimates by different methods.

Table 3.3: Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes under case (a) with true h2 = 0.7.

1k 2k 4k 8k 16k 24k GREML (sub) 0.6783 (0.1012) 0.6701 (0.03328) 0.6957 (0.01106) 0.6959 (0.00211) 0.6994 (0.00057) 0.6985 (0.0002) PredLMM 0.6023 (0.00074) 0.6206 (0.00029) 0.6385 (0.00025) 0.6534 (0.00010) 0.6722 (0.000044) 0.6841 (0.00004)

Table 3.4: Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes under case (b) with true h2 = 0.2.

1k 2k 4k 8k 16k 24k GREML (sub) 0.4148 (0.09928) 0.2392 (0.05251) 0.1931 (0.01371) 0.1923 (0.00229) 0.1919 (0.00055) 0.1983 (0.00042) PredLMM 0.2524 (0.00035) 0.2355 (0.00056) 0.2263 (0.00016) 0.2192 (0.000016) 0.2139 (0.00011) 0.2097 (0.00008)

With PredLMM, there is slight downward bias in the mean of the estimates in case (a) i.e when the true heritability value is high (h2 = 0.7) and slight upward bias in case (b) i.e when the true heritability value is low (h2 = 0.2). In both the cases, the bias diminish as subsample size increases. But, the main take-away would be that even with a subsample size (knot size) of just 1000 (1.25% of 80,000), we can get a reliable estimate of heritability by PredLMM, whereas GREML on a subsample of same size yields severely biased estimate especially in case (b). Figures B.2 and B.3 in the supplementary show the boxplots corresponding to case (a) and case (b).

51 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process 3.3.2.3 Simulation with a larger heterogeneous sub-population that is better representative of the entire population

So far we have looked into two sub-populations of moderate sizes with one of them being highly heterogeneous and the other being mostly homogeneous in terms of ethnicity. Next, we focus on a larger sub-population that takes into account all the ethnicites keeping in mind their relative proportion in the entire population. We randomly select 120,000 individuals with British ancestry and 37,000 people with other ancestries like: Asian, Black, Irish etc. creating a mixed sub-population of 157,000. This sub-population is expected to replicate the full data and its characteristics better. We consider two different values of heritability: (a) h2 = 0.6 and (b) h2 = 0.2. To generate the traits under case (a), every time we randomly sample 100,000 people and simulate a trait as Y100,000 ∼ N100,000(µµ, 60A100,000 + 40I100,000) where 2 A100,000 is the corresponding GRM. So the true heritability is, h = 60/(60+40) =

0.6. For case (b), we simulate as Y100,000 ∼ N100,000(µµ, 50A100,000 + 200I100,000). 2 Thus, the true heritability in this case is h = 50/(50 + 200) = 0.2. A100,000 is the corresponding GRM. Mean µ is kept at 0. We compare PredLMM with GREML (sub) for five different subsample sizes, 1000, 2000, 4000, 8000, 16,000, 24,000. The comparison of MSE is shown in Figure 3.5. Tables 3.5 and 3.7 respectively list the mean (and variance) of the estimates by the methods under case (a) and case (b).

52 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process

Figure 3.5: The figure compares MSE of GREML (sub) and PredLMM for five different subsample sizes for case (a) (top) and case (b) (bottom).

53 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process We notice that PredLMM, similar to the earlier simulations, has much better MSE compared to subsample based GREML for all the subsample (knot) sizes. The difference is more substantial for lower subsample sizes like 1000, 2000 and 4000. The MSE of both the methods decrease rapidly as the subsample size increases. Unlike the simulation from section (3.2.2.3), subsample based GREML never catches up with PredLMM in terms of MSE even for larger subsample sizes.

Table 3.5: Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes under case (a) with true h2 = 0.6.

1k 2k 4k 8k 16k 24k GREML (sub) 0.6186 (0.15187) 0.6168 (0.05707) 0.6231 (0.01459) 0.5941 ( 0.0052) 0.6038 (0.00097) 0.5987 (0.00046) PredLMM 0.5432 (0.00091) 0.5612 (0.00064) 0.5717 (0.0005) 0.5739 (0.00044) 0.5859 (0.00010) 0.5906 (0.00023)

Table 3.6: Mean (Variance) comparison of GREML (sub) and PredLMM with different subsample sizes under case (b) with true h2 = 0.2.

1k 2k 4k 8k 16k 24k GREML (sub) 0.2632 (0.0864) 0.1321 (0.02264) 0.1794 (0.01661) 0.1977 (0.00134) 0.2058 (0.00144) 0.2040 (0.00064) PredLMM 0.2617 (0.001) 0.2447 (0.00067) 0.2363 (0.00025) 0.2247 (0.00015) 0.2166 (0.00007) 0.2112 (0.00005)

Like the previous sections, with PredLMM, there is slight downward bias in the mean of the estimates in case (a) i.e when the true heritability is high (h2 = 0.6) and slight upward bias in case (b) i.e when the true heritability is low (h2 = 0.2). In both the cases, the bias vanish with increasing subsample (knot) size. In case (a), for lower subsample sizes, GREML (sub) shows slight overestimation. In case (b), GREML (sub) shows underestimation in most of the cases especially for subsample of size 2000. Once again, we argue that one can get a reliable and precise estimate of heritability by PredLMM even with a very small subsample (knot) size. The corresponding boxplots shown in Figures B.4 and B.5 in the supplementary strengthen the same argument.

54 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process 3.3.3 Analysis of real UK Biobank traits

Finally, we estimate the heritability of a few quantitative traits like standing height, weight, BMI, systolic and diastolic blood pressure, hip and waist circum- ference using the mixed sub-population of size 157,000 that we had created in section (3.3.2.3). We take into account several quantitative covariates like age, squared age, top 10 principal components of the genetic data, categorical trait like sex into the methods. We perform GREML (sub) and PredLMM with four different subsample (knot) sizes: 8000, 16,000, 24,000 and 40,000. We also apply Bolt-REML on the entire sub-population of size 157,000. The estimates by Bolt-REML can be thought of as the expected truth since it fits REML based on the likelihood corresponding to equation 3.2 with the entire sub-population i.e the likelihood corresponding to

2 2 Y157,000 ∼ N157,000(Xβ, σhA157,000 + σe I157,000). However, we only use the secant optimization algorithm in Bolt-REML (not the more accurate trust region opti- mization that is also implemented in the software) for time constraint. For each trait, it takes almost 20 hours even for that. Figure 3.6 shows barplot of the heri- tability estimates by two methods with different subsample sizes for all the traits. To capture the variability of the estimates, we have repeated the methods with subsample size 8000 ten times with ten different subsamples. Thus, the barplot corresponding to subsample size 8000 consists of 10 points (and an interval show- ing the 95% CI), whereas the barplots corresponding to other subsample sizes has one point.

55 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process

Figure 3.6: The figure shows barplot of the heritability estimates by two methods with different subsample sizes.

We start with observing that PredLMM estimates do have slight upward bias for all the traits and seem to get close to the Bolt-REML estimates (expected truth) gradually as the knot size increases. We had also seen such a trend in our simulation studies in case (b) of section (3.3.2.2) and (3.3.2.3). But, once again, PredLMM estimate (with 8000 knot size) shows very low variability. So it would make perfect sense to use PredLMM even with a small knot size (5% of the entire population) for a preliminary heritability analysis of a trait in a large scale study. And that estimate can be made more and more precise increasing the knot size based on the availability of computing resources. Subsample based GREML, however, is very unreliable with a small subsample size. With 8000 sub- samples, we notice that GREML (sub) yields highly biased estimate of heritability

56 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process especially for traits like Height, DiBP (diastolic blood pressure), sysBP (systolic blood pressure). The estimate also has very high variance and thus, a much wider CI. Also, it should be noted that unlike PredLMM, increasing the subsample size does not always ensure more accuracy in case of GREML (sub). We mostly see no systematic pattern in GREML (sub) estimates with respect to subsample size (for PredLMM, there is a clear decreasing trend for most of the traits).

3.4 Time Comparison

We have run all the methods on a HP Linux cluster with nodes that use 24 many Haswell E5-2680v3 processor cores. Here we list the time taken by different methods (using a single node and 24 many cores) under simulations from section (3.3.1) and (3.3.2.3).

Table 3.7: Time comparison of different methods in seconds for the simulation from section (3.3.1) with 5k (8k SNPs) and 8k (13k SNPs) individuals.

GCTA GEMMA Bolt-REML GREML (500) GREML (2000) PredLMM (500) PredLMM (2000) 5k 15.5 351.07 13.25 3.41 6.7 5.398 16.77 8k 33.5 1293.44 27.87 5.7 3.3 13.67 28.46

Table 3.7 refers to the simulation from section (3.3.1). We see that the methods like GCTA, Bolt-REML all take similar amount of time, whereas PredLMM with 500 knots takes around 40% of that. PredLMM with 2000 knots takes time similar to Bolt-REML. The time advantage will be more prominent from the Table 3.8 that refers to the real data analysis from section (3.3.2.3). We use the secant op- timization algorithm in Bolt-REML software (not the more accurate trust region optimization that is also implemented in it) for time constraint.

57 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process Table 3.8: Time comparison of PredLMM in minutes for varying different knot (subsample) sizes with Bolt-REML under the simulation from section (3.3.2.3) PredLMM Bolt-REML Knot size 1000 2000 4000 8000 16000 24000 Time 10 13 14 20 45 75 420

We see that PredLMM takes just a fraction of time as compared to Bolt- REML even for a large knot size of 24,000. For knot sizes of 1000, 2000 and 4000, time taken by PredLMM are very similar. There is a huge leap in the time taken by PredLMM for knot size of 8000 to knot size of 16,000 and further. Recall that the per iteration computational complexity of PredLMM is O(Nr2 + r3) i.e the complexity is cubic in the knot size (r) which justifies the leap in time taken. One may argue that it would be wise to use just 8000 knots since it can yield a reasonable estimate in a very reasonable time. We shall also mention that we have used a pre-computed GRM (using GCTA) in all these analyses (we have computed the GRM for the entire population and use its sub-matrices as necessary). Computing the GRM is an arduous task that can take multiple hours depending upon the number of SNPs and the number of individuals. But, it is usually of less concern since the computation is just a one time thing and the computed GRM then can be used in multiple analyses. Bolt-REML does not use a pre-computed GRM and uses the original genetic data every time for each of the traits which would be very time consuming for hundreds of traits.

58 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process 3.5 Discussion

Genome-based Restricted Maximum Likelihood (GREML) approach for estimat- ing heritability has become widely popular with the advent of large scale co- hort studies. But most of the existing softwares implementing the GREML like, Fast-LMM, GEMMA, GCTA, Bolt-REML either take excessive time or fall apart entirely when the number of individuals (N) is too large. In this paper, we have developed a significantly rapid algorithm for estimating heritability, named PredLMM, approximating the likelihood corresponding to the GREML. The ap- proximation is achieved by unifying the concepts of genetic coalescence and Gaus- sian Predictive Process models. The algorithm reduces the usual per iteration computational complexity from O(N 3) to O(Nr2 + r3) where r (knot size) is much smaller than N. From the simulation study of section (3.3.1), we have seen that under the presence of genetic coalescence, PredLMM yields highly robust estimate of her- itability even with a small knot size (r). To replicate more realistic scenarios, next we perform a bunch of simulation exercises using the real genetic data from the UK Biobank cohort study. We check the performance of PredLMM in three cases, a highly heterogeneous sub-population (see section (3.3.2.1)) a highly ho- mogeneous sub-population (see section (3.3.2.2)) and a moderately heterogeneous sub-population more representative of the full data (see section (3.3.2.2)). There happens to be slight downward bias in PredLMM estimates when the heritability value is high and a upward bias when the true heritability is low, with small knot sizes. However, the bias vanishes gradually as we increase the knot size. The bottom-line is that even with a very small knot size (10% of N), one can achieve an extremely reliable estimate of heritability (with very low variance). We also

59 Chapter 3 – Efficient SNP-based Heritability estimation using Gaussian Predictive Process estimate the heritability of some quantitative traits like Standing Height, Weight, BMI, Diastolic and Systolic blood pressure, Hip and Waist circumference with 157,000 individuals from the UK Biobank data. Our future goal would be to analyze the entire UK Biobank data with some of the more interesting behavioral traits like alcohol consumption, CPD (cigarettes smoked per day) etc. A very efficient module implementing PredLMM in Python would be released soon. Also, so far we have used GCTA for computing the GRM (A) which takes up O(N 2) storage and costs too much time. PredLMM actually does not need computing the full GRM but only a sub section of it requiring storage space of just O(Nr2). Therefore, in order to use less resources and further faster analysis of large scale data, we would like to incorporate the feature of computing just the require section of the GRM in our module as well.

60 Chapter 4

Multivariate Association Analysis of Correlated Traits in Related Individuals

4.1 Introduction

Comparison of discoveries across many Genome Wide Association (GWA) studies for different traits indicate that pleiotropy is a common phenomenon (Pender- grass et al., 2013; Pickrell et al., 2015; Visscher and Yang, 2016). Indeed, it is increasingly common to analyze multiple correlated traits to improve power by exploiting the common genetic basis for multiple traits. Combined analyses of multiple correlated traits have demonstrated better power to detect genetic vari- ants that could not be detected within a univariate framework (Li et al., 2015; Ellinghaus et al., 2016). Further, analysis of genetic correlation using genome- wide panel of SNPs have identified groups of traits that are likely to share many underlying genetic variants of small effects (Lee et al., 2013; Chen et al., 2014;

61 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals Bulik-Sullivan et al., 2015; Ji et al., 2017). Here we propose a method that take advantage of genotypic and phenotypic correlations to construct a more powerful association test. In the context of unrelated individuals, many articles have illustrated the ad- vantages of joint analysis of multiple correlated traits over combining results from separate univariate analyses (O’Reilly et al., 2012; Basu et al., 2013; Ried et al., 2014). As discussed by Ray et al. (2016), several of such tests broadly fall under the category of Multivariate Analysis of Variance (MANOVA), or in general, un- der Multivariate Multiple Linear Regression (MMLR) approach. These type of tests explicitly model the correlation between the traits and can be very powerful compared to tests that do not directly model such correlation. There is a recent surge of large biobank data (Allen et al., 2014; Khoury and Evans, 2015; Gaziano et al., 2016) with genetically related individuals (families and distantly related people). Distant relationship means that apparently inde- pendent people also share genetic relatedness due to their evolutionary history (Weir et al., 2006). The biobanks have rich information on multiple correlated traits. It is not straightforward to generalize the MMLR in this context because of an additional mode of dependency (’familial’ or ’genetic’ relatedness) among the individuals. Researchers often tend to simplify either the familial or the between- trait dependency structure in the methods like Kronecker Product (KRC) ap- proach (Sung et al., 2014) and Rapid Multivariate Multiple Linear Regression (RMMLR) approach (Basu et al., 2013). But we show that such simplifications of the covariance structure can easily bias the association study in a general scenario. A more general framework known as Multivariate Linear Mixed Model (MVLMM) has been used by researchers like Korte et al. (2012); Zhou and

62 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals Stephens (2014); Ge et al. (2016). The framework involves a lot of variance- covariance components which makes parameter estimation incredibly difficult es- pecially when the number of traits and the number of individuals are large. Some of the current algorithms for fitting a single MVLMM: Wombat (Meyer, 2007),

3 3 3 7 GCTA (Yang et al., 2011) have computational complexity of O(t1n L + t2n L ) where t1 is the number of optimization iterations required in an expectation- maximization–like (EM) step and t2 is the number of optimization iterations re- quired in a Newton-Raphson–like (NR) step (Zhou and Stephens, 2012). Theo- retically, for s many SNPs the total computational complexity of these algorithms

3 3 3 7 will be O(s(t1n L + t2n L )) which is not at all feasible. To reduce the computa- tional burden, Multi Trait Mixed Modelling (MTMM) developed by (Korte et al., 2012) computes the variance-covariance parameters only once by fitting a single MVLMM without using any marker data. With the estimated variance-covariance parameters, MTMM performs LRT for each of the SNPs. MTMM can handle only

3 3 3 7 2 2 two traits and has computational complexity of O(t1n L +t2n L +sn L ) which is still pretty high. On the other hand, Zhou and Stephens (2014) developed an efficient algorithm named Genome-wide Efficient Mixed Model Association (GEMMA) to fit separate MVLMM for each SNP with much better computa-

3 2 2 2 6 tional complexity: O(n + n L + s(n + t1nL + t2nL )). They mention that the software can handle a reasonably large GWAS (e.g., 50,000 individuals) and a modest number of traits (e.g., 2–10 traits). But we encountered difficulty us- ing GEMMA with around 45,000 individuals and 4 traits in our analysis of UK Biobank data. In this paper, we develop a rapid SNP-based association test of multiple cor- related traits with genetically related individuals. The test is inspired by the MVLMM framework but involves less number of variance-covariance parameters

63 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals which helps in alleviating the computational burden. It is based on two steps in first of which the genetic dependency for each trait is modeled by univariate Linear Mixed Modeling (LMM) (Kang et al., 2010; Lippert et al., 2011) and in the second step, the between-trait dependency is captured through a Seemingly Un- related Regression (SUR) (Zellner, 1962) model. We name the method as Rapid Multiple Phenotype Association Analysis in Related individuals (RMultiPAR). It

3 2 has computational complexity of O(L(n + t3n + sn )) where t3 is the number of optimization iterations required for the Newton-Raphson (NR) method. Thus, the complexity is much better than that of MTMM and in some cases better than GEMMA. We also derive theoretical connection of RMultiPAR with the exist- ing methods and discuss their equivalence in particular scenarios. We study the performance of RMultiPAR in terms of pwer and type 1 error through extensive simulations. We have analyzed the monozygotic twins and full sibling pairs from the UK Biobank data to test SNP-based association with four anthropometric traits: standing height, weight, hip circumference and waist circumference. Codes for implementing RMultiPAR in R can be found here.

4.2 Material and Method

Let ylij denote the measured phenotype l on individual j in family i (i = 1, . . . , m, j = P 1, . . . , ni, i ni = n and l = 1,...,L). Let Gij denote the additive genetic score of a SNP with alleles ’A’ and ’a’ of individual j in i the family. When there are only distantly related people in the data, we will have m = 1 and n1 = n i.e the sce- T nario can be thought of as having a single large family. Let Yli. = (Yli1,...,Ylini ) , T T Yl = [Yl1.,..., Ylm.] , Gi. = (Gi1,...,Gini ) .

Suppose there are s many SNPs in total. Let Ci be a ni × p matrix with p

64 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals covariates corresponding to individuals of family i. Throughout the paper 1r and

Ir respectively denote the vector of all 1’s of length r and the identity matrix of h i h iT dimension r. Define matrices, X = and X = T T . i 1ni Ci Gi. X1 ,..., Xm

4.2.1 Existing Methods

We will break down the existing methods in two groups: 1) Methods that in- dependently analyze the traits i.e do not model the cross-trait dependency and combine the results for the joint inference and 2) methods that analyze the traits by jointly modeling the cross-trait dependency among them.

4.2.1.1 Combination tests that do not model the cross trait covariance

For the trait l consider the following multivariate regression model,

T Yl = Xβ l + l, β l = (αl, βlc, βl) (4.1)

where αl is the intercept, βl is the additive fixed effect of the SNP, βlc is a length p vector of fixed effects corresponding to the p covariates, and l is the vector of random errors. If the individuals were independent the covariance matrix could 2I simply be modeled as, Vl = cov(l) = σl n. But with related individuals, the standard approach is to consider,

K X Vl = σkllCk + σEllIn (4.2) k=1 where σkll’s are the variance components and Ck’s are n×n relationship matrices.

σEll is the unique environmental variance component. In most cases the number of variance components K = 1, σ1ll is termed as the additive genetic component. C1

65 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals is the genetic GRM either derived using known familial relationships (Neale et al., 1994; Almasy and Blangero, 1998) or estimated from a large number of genetic variants when the relationships are unknown (Kang et al., 2010; Yang et al., 2011; Zhou and Stephens, 2012). In family studies (and twin studies), usually K = 2 and σ2ll is termed as the shared environmental variance component (Neale et al., 0 1994) and C2 is a matrix that reflects the household sharing between subjects j, j in family i and has the corresponding element to be 1. The framework can also be thought of as a LMM with the variance components implicitly corresponding to some random effects.

To test the null hypothesis H0l : βl = 0, softwares like EMMA, EMMAX (Kang et al., 2010), GCTA-MLMA (Yang et al., 2011), R package RFGLS (Li et al., 2011) can be used. Let pl denote the corresponding p - value. These types of analyses can save computational time, but not modeling the cross-trait covariance hugely impacts the power as we would see in the simulations.

MinP Test and Fisher’s test: The MinP test statistic is based on the min- imum of adjusted p−values, where adjustment is usually done by Bonferroni’s method to take care of multiple-testing issue. The statistic is given by pmin = L L minl=1Lpl. Under H0 : ∩l=1H0l and the assumption of independence between the phenotypes, pmin is distributed as the minimum of independent U(0, 1) variables. In the presence of correlation between the traits, this test can be highly conserva- tive (Van der Sluis et al., 2013). Fisher’s method involves combining the logarith- PL mic transformation of the p−values p1, ..., pL. The test statistic is −2 l=1 log pl, 2 which under H0 and the assumption of independent tests, has a χ2L distribution. In the presence of strong correlation among traits, inflated type-I error is observed (‘anti-conservative’) (Basu et al., 2013; Ray et al., 2016).

66 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals PCA of the traits In this test instead of combining the p-values, the traits are combined (Galesloot et al., 2014). Consider the first principal component of all the traits and treat it as a new variable, say Ycomb. Next, fit the following univariate LMM similar to equation (4.1) and (3.2),

Ycomb = Xβ comb + comb, cov(comb) = σ1,combC1 + σE,combIn.

T where β comb = (αcomb, βcomb, βcomb,c) and αcomb is the intercept, βcomb is the ad- ditive fixed effect of the SNP, βcomb,c is the vector of fixed effects corresponding to the covariates. σ1,comb, σE,comb are respectively the additive genetic and the unique environmental variance component. The hypothesis of interest would be,

Hcomb,0 : βcomb = 0. After finding out the first PC of the traits: Ycomb, the asso- ciation analysis can be performed using existing softwares like EMMA, EMMAX, RFGLS etc.

4.2.1.2 Methods that analyze the traits jointly

Now we will discuss the tests which analyze the traits jointly taking into account the between-trait dependency. Mean Model: Consider the following MMLR model,

        Y1 X ··· 0 β 1 1          .   . .. .   .   .  Y =  .  =  . . .   .  +  .  (4.3)         YL 0 ··· X β L L | {z } | {z } | {z } IL⊗X β 

L The null hypothesis of interest is, H0 : ∩l=1H0l : Hββ = 00, where H is a matrix T such that, Hββ = (β1, . . . , βL) . Even though we are interested in testing for the

67 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals mean parameters, it is important to model the dependency structure among the related individuals as well as the correlation among the traits. Misspecification of the covariance structure could lead to potetntial bias in the association inference (Kang et al., 2010; Li et al., 2011). Covariance Model: In a study with unrelated individuals, the joint covariance T T T I of  = (1 , . . . ,L) is usually modeled as, V = cov() = Σ⊗ n where Σ = [σll0 ]L×L is a positive definite matrix representing the residual covariance among the traits. So in this setup,

2I I Vl = cov(l) = σl n, Vll0 = cov(l,l0 ) = σll0 n (4.4)

However, for related individuals Vl and Vll0 should be capturing the familial relatedness. The most popular approach of modeling Vl and Vll0 (Korte et al., 2012; Zhou and Stephens, 2014) is the following,

Vl = σ1llC1 + σEllIn; Vll0 = σ1ll0 C1 + σEll0 In (4.5)

where ΣA = [σ1ll0 ]L×L and ΣE = [σEll0 ]L×L are the positive definite matrices corresponding to additive genetic and unique environmental covariance respec- tively. We would refer to this covariance modeling as the traditional approach throughout the paper. The framework is often referred to as a Multivariate Linear Mixed Model or MVLMM.

Methods following the traditional approach To perform an association analysis with multiple traits using the traditional approach can be computation- ally very expensive. Here we discuss the methods which follow the traditional

68 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals covariance modeling but use techniques to speed up the computation for estima- tion and inference. We denote the vector of all the variance-covariance parameters as σ, V can be thought of as a function of σ: V ≡ V(σ).

Likelihood based methods: Denote the log likelihood corresponding to model in equation (4.3) with the covariance structure described in equation (4.5) as l(ββ,σσ). On a genome-wide scale, finding out the MLEs of β and σ jointly for each SNP becomes computationally very demanding especially when L or n is large. That is why approximate methods like EMMAX (Kang et al., 2010), GCTA- MLMA (Yang et al., 2011) in a univariate context (L = 1) ease the computational burden by estimating σ (as σˆ) only once without using any SNP data. Then, for each SNP consider l(ββ,σˆ) to conduct Likelihood Ratio Test (LRT). In the bivariate context (L = 2), MTMM (Korte et al., 2012) follows a similar procedure where σˆ is first estimated by a likelihood based approach without any SNP data. Next, for each SNP an F -test is constructed considering l(ββ,σˆ). This

3 3 3 7 2 2 method has computational complexity of O(t1n L +t2n L +sn L ) where t1 is the number of optimization iterations required in the EM step and t2 is the number of optimization iterations required in the NR step. Clearly, even a moderately large value of L would cause huge computational deadlock in this method. Zhou and Stephens (2014) greatly improve the computational complexity in their method named GEMMA which has computational complexity of O(n3 + n2L + s(n2 +

2 6 t1nL + t2nL )). Also, an additional novelty of GEMMA is that unlike MTMM, it is an exact test which means that ββ,σσ are getting jointly estimated for each SNP by maximizing l(ββ,σσ).

69 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals Moment based approach: In a study of variance components estimation with multiple traits and related individuals, Ge et al. (2016) proposed a moment- based method of estimating σ with computational complexity of O(n2L + nL2).

The authors do not use the estimated σˆMM in association studies. But, one can get a Feasible Generalized Least Squares (FGLS) estimate (Magnus, 1978) of β(σ) as,

ˆ 0T 0 −1 0T −1 ˆ 0T 0 −1 β(σˆMM) = (X V(σˆMM)X ) X V(σˆMM) Y, cov(β(σˆMM)) = (X V(σˆMM)X )

ˆ With β(σˆMM), Wald test can be performed to test the null hypothesis. The testing step will have computational complexity of O(sn2L2) for s many SNPs. Thus, the entire approach would be computationally much feasible compared to the likelihood based methods. But, one must be aware that a moment-based estimator can often be heavily biased compared to a maximum likelihood estimate (Robertson and Fryer, 1970).

4.2.1.3 Tests that simplify the covariance structure

Here we discuss a few tests which fit MVLMMs with much simpler modelling of Vl and Vll0 than the traditional approach. These tests were particularly developed for analyzing family data and are not suitable for a set of distantly related people.

With family data, Vl and Vll0 are block-diagonal matrices with respectively Vli = cov(li.)’s and Vll0i = cov(li.,l0i.)’s on the diagonal.

KRC approach: Sung et al. (2014) model the covariance structure fof a partic-

2 ular family i as, Vli = σl Ωfamily and Vll0i = σll0 Ωfamily where Ωfamily is an unknown compound symmetric correlation matrix. So the approach does not take into account the GRM and assumes same correlation among all the individuals, an

70 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals assumption known not to be realistic (Burkett et al., 2015). For example, siblings would be more genetically correlated than second-degree relatives.

Rapid Multivariate Multiple Linear Regression (RMMLR): Basu et al.

(2013) model Vl’s as in equation (4.2). They do not explicitly discuss modeling

Vll0 . But it is implicit that Vl’s and Vll0 ’s will have to have a very special form, 2 0 namely, Vl = al M and Vll0 = ρll0 alal0 M with al > 0 and |ρll0 | < 1 for l 6= l and M being a positive definite matrix. This is a very stringent assumption and its failure to hold can easily bias the association inference as to be seen in our simulation studies.

Generalized Estimating Equation Generalized Estimating Equation or com- monly known as GEE (Ziegler et al., 1998) is a well used technique in the analysis of correlated traits. GEE estimates are obtained by solving the following estimat- ing equation,

m X ∂µi U(β) = M −1(Y − µ ) ∂ββ i .i. i i=1

T T T T Y.i. = (Y1i1, ··· ,YLi1,Y1i2, ··· ,YL21) and µi = [(Xiβ 1) ,..., (Xiβ L) ] . One can consider unstructured working correlation structures (Mi) for different family types. So this approach too, like KRC, does not consider any known or estimated genetic relationship values. After obtaining the mean parameter estimates and their covariance matrix, a Wald test statistic can be constructed.

71 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals 4.2.2 Proposed Method

4.2.2.1 Model:

We consider the mean model from equation (4.3). We model Vl’s same as the traditional approach i.e Vl = σ1llC1 + σEllIn. But we model the cross trait covariance Vll0 in a different way:

1/2 1/2 Vll0 = ρll0 Vl Vl0 (4.6)

Unlike the traditional modelling of Vll0 in equation (4.5), we are not separately considering the cross-trait genetic dependency (σ1ll0 ) and the cross-trait unique environmental dependency (σEll0 ) components. Instead, we are using a single correlation parameter (ρll0 ) to capture the overall cross-trait dependency. It not only reduces the number of variance-covariance parameters to be estimated but also enables us to develop a rapid two-stage estimation strategy which will be discussed in the next section. We show that the resultant V = cov(Y) under our covariance assumption is positive semi-definite as required for it to be a valid covariance matrix. We also derive the theoretical connection of our cross-trait assumption with the traditional one and discuss their equivalence in certain scenarios. The details can be found in the Web Appendix A of the Supplementary Information. It can also be observed that RMMLR’s covariance assumption is a special case

4.2.2.2 Estimation and Inference:

Our estimation strategy is based on two steps. In the first step, we consider each trait separately and find out the marginal Restricted Maximum Likelihood

72 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals

Estimate (REMLE) of Vl from the following model:

Yl = 1nαl + Cβlc + l (4.7)

ˆ RF ˆ RF −1/2 Let Vl be the REMLE of Vl and (Vl ) be a square root of the ma- ˆ RF −1 ∗ ˆ RF −1/2 ∗ trix (Vl ) . Next, transform the vectors as, Yl = (Vl ) Yl and Xl = ˆ RF −1/2 ˆ RF ∗ (Vl ) X. When Vl is a very good approximation of the true Vl, cov(Yl ) = ˆ RF −1/2 ˆ RF −1/2 I (Vl ) Vl(Vl ) ≈ n. The covariance between the transformed vectors is,

∗∗ ∗ ∗ ˆ RF −1/2 ˆ RF −1/2 −1/2 −1/2 I Vll0 = cov(Yl , Yl0 ) = cov((Vl ) Yl, (Vl0 ) Yl0 ) ≈ Vl Vll0 Vl0 = ρll0 n (4.8)

Using the transformed variables, we write down a Seemingly Unrelated Regression (SUR) model (Zellner, 1962) the following way,

        ∗ ∗ ∗ Y1 X1 ··· 0 β 1 1         ∗  .   . .. .   .   .  Y =  .  =  . . .   .  +  .  (4.9)         ∗ ∗ ∗ YL 0 ··· XL β L L | {z } | {z } | {z } X∗ β ∗

∗ where β l’s are the coefficient vectors defined earlier and the residual error  follows ∗ ∗ NnL(0, Σ ⊗ In) with Σ = ((ρll0 ))L×L being a positive definite matrix. Thus we ∗ ∗ ∗ ∗ have, E(Y ) = X β and cov(Y ) = Σ ⊗ In. Thus the covariance structure with the transformed vectors becomes very similar to equation (4.4) i.e the case of unrelated individuals. Finding out MLEs of β and Σ∗ jointly may be cumbersome if n, L are large.

73 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals That is why we use a FGLS estimator given by,

ˆ ∗T ˆ ∗ I −1 ∗ −1 ∗T ˆ ∗ I −1 ∗ ˆ ∗T ˆ ∗ I −1 ∗ −1 β F GLS = (X (Σ ⊗ n) X ) X (Σ ⊗ n) Y , cov(β F GLS) = (X (Σ ⊗ n) X ) (4.10) where Σˆ ∗ is a consistent estimator. For estimating Σˆ ∗, we first fit an Ordinary

∗ Least Sqaures (OLS) model to estimate the residuals: l ’s and then estimate each ∗T ∗ ∗ ∗ l l0 element of Σ as σ 0 = . For testing H : Hββ = 0, we consider the wald test ll n 0 statistic,

ˆ T ˆ T −1 ˆ TM = (Hβ F GLS) (Hcov(β F GLS)H ) (Hβ F GLS) (4.11)

2 TM would asymptotically follow a simple χL distribution. We name the method as Rapid Multiple Phenotype Association analysis in Related individuals (RMul-

3 2 tiPAR). It has computational complexity of O(L(n + t3n + sn )). For estimation we use R packages GENESIS (Conomos et al., 2019) and systemfit (Henningsen et al., 2007).

4.2.2.3 Distribution under Covariance Misspecification:

2 The asymptotic χL distribution of TM is based on the cross-trait covariance as- sumption we made in equation (4.6). To be comprehensive, we also derive the asymptotic distribution of TM when the traditional cross-trait covariance assump- tion from equation (4.5) holds instead of ours. In such a scenario, TM follows an 2 aχb distribution where a, b can be estimated from the data. We refer to this 2 version of the test (which considers aχb distribution) as Adjusted RMultiPAR. However, we argue that in most realistic scenarios, a, b will respectively be close to 1,L meaning that the Adjusted RMultiPAR and the usual RMultiPAR will be

74 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals performing very similarly. We discuss these details in the Web Appendix B of the Supplementary Information.

4.3 Results

4.3.1 Simulation Study

We perform rigorous simulations to compare our method with the existing ones. Our main purpose is to study how the type 1 error and power of the different approaches get impacted under different possible trait-covariance structures. We consider highly genetically correlated samples since our primary objective is to study how the simplifications of the covariance structures adapted in different methods impact their respective inference. We primarily focus on simulating MZ twins and full siblings who respectively share genetic correlation of 1 and 0.5.

4.3.1.1 Simulation under the traditional covariance assumption

2 First we study how usual RMultiPAR (with χL distribution) and Adjusted RMul- 2 tiPAR (with aχb distribution) perform under covariance misspecification i.e when the traditional covariance assumption from equation (4.5) holds instead of our as- sumption from equation (4.6). We compare our methods with (Korte et al., 2012)’s MTMM approach which models the covariance in the traditional way. MTMM can handle only two traits and relies on a commercial software named ASReml to estimate the variance-covariance parameters. Since we do not have access to ASReml software, we use bivariate GCTA-GREML (Yang et al., 2011) to esti- mate the variance-covariance parameters and then use the estimates to construct appropriate wald test for each SNP in R.

75 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals Mean model: We consider L = 2 traits and m = 500 families each having a pair of individuals i.e ni = 2 for all i and the total number of individuals is n = P ni = 2m = 1000. We simulate SNP genotype (Gij) following Hardy-Weinberg equilibrium keeping the minor allele frequency (MAF) at 0.2. Considering the mean model from equation (4.3) without any covariates, we simulate the trait ylij for the mean models, a) the SNP is associated with no trait i.e effect size

β1 = β2 = 0, b) the SNP is associated with only one trait i.e, either β1 = 0 and

β2 > 0 or, β1 > 0 and β2 = 0 and c) the SNP is associated with both the traits,

β1 > 0, β2 > 0.

Covariance model: We consider different cases with the traditional covariance structure from equation (4.5).

1. In first two cases, we consider all the pairs to be identical twins (MZ) with

MZ MZ ρ the covariance structure: V = 48C + 12I , V 0 = ρ48C + 12I l 1 1000 ll 1 2 1000 MZ where ρ = 0.5 in case (a) and ρ = 0.8 in case (b). C1 is the known Genetic Relationship Matrix (GRM) which is block diagonal with each block being a 2 × 2 matrix of all 1’s.

2. In the next two cases, we consider all the pairs to be full siblings with the

sib sib ρ covariance structure: V = 48C +12I , V 0 = ρ48C + 12I where l 1 1000 ll 1 2 1000 sib ρ = 0.5 in case (c) and ρ = 0.8 in case (d). C1 is the known GRM which is block diagonal with each block being a 2×2 matrix with 1’s on the diagonal and 0.5 on the off-diagonal.

Result:

76 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals

Figure 4.1: The plot shows the type 1 error and power of the methods under the simulation setup of the section (4.3.1.1).

When there are only MZ pairs, the usual RMultiPAR test produces slightly in- flated type 1 error. The inflation is more prominent when the trait correlation ρ is high (ρ = 0.8). The Adjusted RMultiPAR as expected shows no inflation in the type 1 error in either of the cases (a) and (b). But, when there are only full sibling pairs the usual RMultiPAR shows no inflation in the type 1 error and performs almost similarly as the Adjusted RMultiPAR. We found out that the inflation is very much dependent on the distribution of the eigenvalues of the GRM. It only occurs if a large number of the eigenvalues are exactly 0. To our understanding, such a scenario can only arise if the dataset has a lot of MZ twin-pairs. But usually in biobank datasets, we have distantly related individuals and the eigenvalues of

77 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals the corresponding GRM are evenly spread. So the usual RMultiPAR should work without any bias in those scenarios obviating the need of estimating a, b parame- ters. We discuss this issue in details in the the Web Appendix B. Going back to figure (3.1), When ρ is small as in case (a) and case (c), power of all the methods increases as the number of associated traits increases. Interestingly, when ρ is higher as in case (b) and case (d), all the methods have the highest power when only one of the traits is associated. Also we see that the power of RMultiPAR is as good as the MTMM approach which shows the novelty of RMultiPAR over the computationally demanding and tough to implement MTMM approach.

4.3.1.2 Simulation under a simple case of RMultiPAR’s covariance assumption

Next we compare the usual RMultiPAR in terms of type 1 error and power with the combination tests like MinP, Fisher’s test, PCA test discussed in section (4.2.1.1) and the joint tests like GEE and RMMLR discussed in section (4.2.1.3) under different covariance structures. We consider L = 4 traits for m = 1000 families each of which consists of a pair of individuals i.e ni = 2 for all i and the total P number of individuals is n = ni = 2m = 2000. First 500 families have MZ twin-pairs and the rest have full sibling pairs. Let C1 be the corresponding known GRM.

Mean model: We simulate SNP genotype (Gij) following Hardy-Weinberg equi- librium keeping the minor allele frequency (MAF) at 0.2. Considering the mean model from equation (4.3) without any covariates, we simulate the trait ylij for the following cases, 1) the SNP is associated with no traits, 2) the SNP is associated with only one of the traits, 3) the SNP is associated with two of the traits, and,

78 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals 4) the SNP is associated with all the traits.

Covariance model: And as the covariance matrix of Y we consider,

Vl = 48C1 + 12I2000, Vll0 = (48ρ)C1 + (12ρ)I2000 = ρVl

We consider three different values of ρ namely, 0, 0.5 and 0.8 and conduct 1000 sim- ulations. It can be noticed that the structure of Vll0 agrees to both the traditional assumption in equation (4.3) and RMultiPAR’s assumption in equation (4.6). It can be further noticed that the structure is also in compliance with RMMLR’s covariance assumptions discussed in section (4.1.3.2) which makes RMultiPAR and RMMLR theoretically equivalent in this particular scenario.

Results:

Figure 4.2: The plot compares the type 1 error and power of different methods under the simulation setup of the section (4.3.1.2) for three dif- ferent values of ρ at level 0.05

79 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals When ρ = 0 i.e the traits are independent with each other, all the approaches have correct type 1 error. The combination methods: MinP, Fisher’s test and PCA perform poorly in terms of power compared to the other methods. All the methods have increasing power as the number of associated traits increases. When ρ > 0, Fisher’s test and GEE have highly inflated type 1 error. RMultiPAR maintains proper type 1 error and also have consistently better power than the combination methods. Interestingly, it has the highest power when only half the traits are associated and loses power when the SNP is associated with all four of the traits. We had mentioned earlier that under this particular covariance setup, RMultiPAR and RMMLR test statistics are theoretically equivalent which explains their similarity in the power plots. Another interesting observation is that as the number of associated traits increases, PCA approach significantly gains power and even outperforms MinP. A power plot with more number of traits under the same covariance structure can be found in the Appendix C.

4.3.1.3 Simulation under a more general case of RMultiPAR’s covari- ance assumption

We consider a covariance structure that is more general than the earlier case. We assume Vl’s to be different for different traits,

I I 1/2 1/2 0 V1 = 48C1 + 12 2000, Vl = 24C1 + 36 2000; l 6= 1, Vll0 = ρVl Vl0 ; for l 6= l

We consider ρ = 0.5, 0.8 and the the mean model from section (4.3.1.2). Notice that the covariance setup does not abide RMMLR’s assumptions discussed in sec- tion (4.2.1.3). We plot the histogram of the RMMLR test statistic under H0 for both the values of ρ.

80 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals

Figure 4.3: The histogram of RMMLR test statistic for the case of ρ = 0.5 (on the left) and ρ = 0.8 (on the right) in section (4.3.1.3)

From the above plot we see that RMMLR test statistic which should always be strictly positive, takes negative values quite a few times. It demonstrates the incorrectness of RMMLR when its underlying covariance assumptions do not hold. So we remove RMMLR from further comparisons.

Results: We plot the type 1 error and power of RMultiPAR, GEE and the combination tests.

81 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals

Figure 4.4: The plot compares the type 1 error and power of different methods under the simulation setup of the section (4.3.1.3) at level 0.05.

We see a similar trend as in the earlier plots from section (4.3.1.2). Fisher’s test and GEE still have inflated type 1 error with the inflation being higher when ρ is higher. RMultiPAR has better power than all the combination tests. Similar to the earlier section, when only half of the traits are associated with the SNP, RMultiPAR has the highest power.

82 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals 4.3.2 Real Data Analysis

UK Biobank (Allen et al., 2014) is a large long-term biobank study in the United Kingdom which is investigating the respective contributions of genetic predispo- sition and environmental exposure to the development of various diseases. Our goal is to focus on people with high genetic dependency which is why we consider monozygotic twins and full siblings from the Uk Biobank cohort. To reduce the ef- fects of strong population structure we only consider individuals with self-reported British ancestry and further prune the sample by a robust sparse K-means clus- tering algorithm (Kondo et al., 2016) using the top five principal components of the genetic data. Our final dataset has 44739 many individuals spread over 18151 families each of which having size of either 1, 2, 3, 4, 5 or, 6. There are 467843 many SNPs which have minor allele frequency (MAF) greater than 0.01 and missingness less than 10% and pass Hardy-Weinberg equilibrium test at threshold 0.001. We also prune the SNPs with r2 linkage disequilibrium (LD) threshold of 0.9. We use PLINK for all the quality control steps. We consider four anthropometric pheno- types: standing height, weight, hip circumference and waist circumference. Sex, age, squared age, batch effect and top fifteen principal components are treated as the covariates. We conduct four different tests: a) Univariate association analyses of the traits with the assumption of independence between the individuals (simple linear re- gression regression for each trait) b) MinP c) PCA and d) RMultiPAR. Univariate association analyses with the assumption of independence gives very high median based genomic inflation factor (λ) for all the traits; the values are 1.48, 1.27, 1.24, 1.23 for height, weight, hip and waist circumference respectively. The genomic lambdas improve greatly when instead of the assumption of independence, the genetic dependency is taken into account (Vl is modeled as in equation (4.2)).

83 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals The genomic lambda values become 1.18, 1.098, 1.09 1.089. It reaffirms the im- portance of properly modeling the genetic dependency in association studies for avoiding bias in the inference. However, the genomic lambdas are still slightly inflated which can be attributed to highly polygenic nature of the phenotypes (Yang et al., 2011; Boyle et al., 2017) and also to the existence of further subtle population substructures in the UK Biobank data (Yengo et al., 2018). RMulti- PAR and PCA respectively has genomic inflation factor of 1.086 and 1.098.

Figure 4.5: This is a venn diagram showing the common SNPs detected by the three methods at a p-value threshold of 1 × 10−8.

84 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals From the venn diagram in figure 3.5 we see that at a p-value threshold of 1 × 10−8 (accounting Bonferroni correction), MinP and RMultiPAR respectively detects 169 and 190 many SNPs as significant whereas PCA detects only 19 SNPs. We have seen in the simulation studies that PCA has respectable power only if all the phenotypes are associated with the SNP. In the real data that is barely the case and is the probable reason behind such less number of detection by PCA. In the following table, we list 13 common SNPs detected by the three methods along with the p-values.

Table 4.1: The table lists the common SNPs detected by all three methods: RMul- tiPAR, PCA and MinP at p-value threshold of 1 × 10−8.

CHR SNP RMultiPAR PCA MinP Univariate Analysis Height Weight Hip Waist 1 rs10913469 1.30e-11 1.50e-12 4.96e-11 8.96e-01 4.66e-11 3.76e-13 1.12e-08 1 rs633715 2.03e-13 6.73e-14 3.88e-12 5.42e-01 1.37e-12 1.68e-14 7.64e-09 2 rs62104180 6.47e-10 8.72e-11 3.12e-11 8.51e-01 2.34e-10 2.98e-10 2.18e-11 2 rs62106258 6.28e-11 7.32e-12 1.20e-12 6.03e-01 7.57e-12 3.33e-11 1.83e-12 3 rs1344672 8.15e-37 1.96e-36 4.19e-10 4.91e-37 8.68e-11 1.94e-07 7.35e-04 16 rs1121980 3.61e-22 6.75e-21 5.12e-20 8.09e-01 1.69e-21 3.70e-14 2.55e-18 16 rs11642841 1.18e-11 1.18e-11 4.87e-11 9.46e-01 2.94e-12 3.80e-08 1.48e-09 16 rs1421085 3.46e-21 2.22e-19 1.34e-18 6.40e-01 5.55e-20 5.74e-13 1.55e-17 16 rs2058908 1.53e-09 3.23e-09 3.81e-09 9.00e-01 8.07e-10 3.39e-06 2.15e-08 16 rs9922619 7.32e-19 2.92e-17 2.16e-16 6.48e-01 7.31e-18 3.13e-11 5.54e-15 18 rs10871777 1.28e-10 7.85e-12 1.13e-11 2.39e-04 1.96e-12 1.48e-08 1.02e-08 18 rs12970134 8.78e-11 3.11e-12 2.88e-12 3.63e-03 7.78e-13 4.50e-09 5.16e-10 18 rs489693 6.47e-10 5.10e-11 2.22e-11 4.17e-04 1.27e-11 4.76e-09 5.67e-09

From Table 1, we notice that the SNP rs1344672 which has earlier been re- ported to be associated with height (Kim et al., 2010) is also jointly associated the four phenotypes. On the other hand, a few SNPs on chr 16, namely, rs1121980, rs1421085 etc. which have earlier been reported to be associated with obesity and fat mass (Kawajiri et al., 2012) turn out to be jointly associated with our pheno- types as well. Next we list all the SNPs detected to be significant by RMultiPAR but not by MinP or PCA.

85 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals Table 4.2: The table lists the SNPs detected only by RMultiPAR at p-value threshold of 1 × 10−8.

CHR SNP RMultiPAR PCA MinP Univariate Analysis Height Weight Hip Waist 1 rs17391694 2.18e-11 2.10e-08 2.22e-07 5.25e-09 6.11e-09 2.44e-05 1.40e-03 1 rs2605100 2.95e-11 7.09e-05 1.07e-01 3.22e-01 1.56e-01 1.77e-05 7.35e-01 1 rs2820441 3.23e-11 1.29e-05 3.56e-02 1.63e-01 5.74e-02 3.23e-06 4.03e-01 1 rs2820436 1.45e-10 1.34e-05 2.51e-02 1.20e-01 4.17e-02 3.35e-06 3.37e-01 1 rs4846565 2.71e-10 1.27e-05 1.87e-02 8.03e-02 2.73e-02 3.18e-06 3.09e-01 1 rs4846567 3.60e-10 9.33e-05 6.31e-02 8.53e-02 8.55e-02 2.33e-05 6.14e-01 1 rs6541227 4.01e-10 1.53e-05 2.98e-02 2.95e-01 5.36e-02 3.83e-06 2.78e-01 1 rs1417066 6.52e-09 4.29e-05 3.66e-02 4.62e-01 5.56e-02 1.07e-05 3.15e-01 2 rs56292657 1.71e-09 2.41e-08 2.92e-01 6.01e-09 2.87e-01 2.96e-02 1.20e-01 2 rs2303565 3.31e-09 1.32e-06 6.59e-03 3.29e-07 8.53e-03 3.44e-03 5.71e-04 4 rs951252 2.37e-09 7.08e-07 1.41e-04 1.77e-07 6.74e-05 8.27e-06 3.88e-02 6 rs72961013 1.85e-20 2.51e-03 7.91e-01 4.87e-02 8.85e-01 6.28e-04 3.02e-02 6 rs114344942 1.77e-14 4.20e-06 5.93e-01 1.05e-06 9.15e-01 3.24e-02 1.53e-01 6 rs115447786 5.19e-12 4.50e-05 8.40e-01 1.13e-05 8.15e-01 3.61e-02 2.78e-01 6 rs1936799 2.25e-11 9.73e-02 4.21e-01 1.04e-01 4.27e-01 2.43e-02 7.73e-02 6 rs1052486 8.22e-11 1.29e-08 6.54e-01 3.23e-09 2.61e-01 4.34e-01 6.11e-02 6 rs1049281 8.44e-11 5.42e-08 4.11e-02 1.35e-08 6.62e-03 9.99e-03 7.09e-01 6 rs56005336 1.31e-10 7.30e-05 8.15e-01 1.82e-05 4.96e-01 2.23e-02 6.04e-01 6 rs114355919 3.31e-10 2.39e-05 3.26e-01 5.96e-06 5.75e-01 2.74e-01 1.01e-01 6 rs72976106 4.06e-10 2.11e-02 8.00e-01 2.82e-01 5.76e-01 5.26e-03 2.22e-01 6 rs3132449 6.60e-10 1.23e-07 1.13e-04 3.07e-08 9.27e-06 1.20e-04 8.51e-02 6 rs537160 1.00e-09 2.43e-08 7.55e-01 6.08e-09 4.09e-01 4.61e-01 7.19e-02 6 rs77393224 1.19e-09 7.06e-06 5.18e-02 1.77e-06 7.90e-02 7.10e-01 2.37e-02 6 rs3132450 1.71e-09 3.21e-07 2.29e-04 8.01e-08 1.91e-05 2.07e-04 1.20e-01 6 rs1150753 1.80e-09 1.75e-06 1.59e-04 4.38e-07 1.19e-05 8.14e-05 1.05e-01 6 rs1936801 2.01e-09 1.00e-01 5.23e-01 8.81e-01 8.66e-01 7.34e-02 2.51e-02 6 rs2229642 2.01e-09 1.63e-08 9.72e-02 4.07e-09 1.49e-01 1.16e-02 8.33e-03 6 rs3131383 2.04e-09 3.27e-07 8.75e-05 8.19e-08 7.91e-06 9.02e-05 6.64e-02 6 rs1270942 2.16e-09 1.06e-06 1.21e-04 2.64e-07 9.43e-06 7.92e-05 8.45e-02 6 rs4711750 2.54e-09 2.49e-01 9.73e-01 4.42e-01 4.00e-01 6.22e-02 9.04e-02 6 rs11755266 2.59e-09 8.68e-08 1.66e-06 2.17e-08 1.26e-06 2.64e-06 1.22e-03 6 rs809871 4.07e-09 4.48e-08 1.56e-04 1.12e-08 4.45e-04 1.14e-04 4.89e-03 6 rs2745353 5.28e-09 8.22e-02 5.52e-01 8.11e-01 9.70e-01 1.05e-01 2.05e-02 6 rs3094013 8.25e-09 2.19e-06 6.65e-04 5.46e-07 1.04e-04 1.77e-04 2.01e-01 6 rs2524066 8.88e-09 1.48e-06 5.67e-02 3.71e-07 1.19e-02 1.76e-02 7.36e-01 6 rs2844531 9.63e-09 1.75e-06 7.53e-04 4.37e-07 9.00e-05 3.18e-04 2.06e-01 6 rs1150758 9.82e-09 2.41e-06 1.91e-05 9.49e-06 6.01e-07 1.61e-04 1.97e-02 6 rs7738642 9.90e-09 1.07e-08 4.11e-01 2.68e-09 5.34e-01 7.53e-01 8.99e-01 7 rs849140 3.60e-10 1.19e-07 1.55e-01 2.98e-08 1.35e-01 1.07e-02 6.35e-02 7 rs4731702 1.24e-09 2.62e-08 1.27e-04 8.17e-01 5.88e-04 6.54e-09 1.37e-03 12 rs10843209 1.22e-09 1.32e-07 2.17e-01 3.31e-08 2.13e-01 3.91e-01 1.97e-02 15 rs4842918 2.79e-11 4.09e-08 2.00e-01 1.02e-08 5.54e-01 2.50e-01 2.37e-01 15 rs62027818 2.18e-09 1.88e-08 1.94e-02 4.70e-09 4.77e-02 9.62e-03 1.73e-01 15 rs11633181 7.51e-09 3.42e-06 4.47e-02 8.55e-07 1.81e-01 5.58e-02 5.57e-02 15 rs4483821 8.67e-09 4.22e-08 3.35e-01 1.06e-08 5.76e-01 4.37e-01 6.20e-01 16 rs1477196 2.98e-09 3.48e-08 4.93e-08 2.52e-01 3.53e-08 2.57e-05 8.70e-09 16 rs7187961 5.58e-09 3.51e-08 3.70e-08 3.53e-01 8.76e-09 1.94e-06 1.63e-07 17 rs2070776 9.54e-09 2.31e-08 1.02e-02 5.78e-09 4.79e-03 4.25e-03 4.55e-01 18 rs34302357 6.18e-09 1.50e-08 4.48e-05 3.75e-09 5.97e-05 4.57e-03 6.75e-04

It is really interesting to find out that rs2605100 on chr 1, in spite of not being individually significant for any of the traits, is detected to be jointly as- sociated. rs2605100 has earlier been reported to be influential in adiposity and fat distribution (Lindgren et al., 2009). rs849141 on chr 7 has been previously

86 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals reported to be associated with height (Soranzo et al., 2009).The Manhattan plots of the log p-values of the methods can be found in the Web Appendix C of the Supplementary information. We shall mention that in the real data analysis, we have also considered Adjusted RMultiPAR (estimated a, b parameters using a mo- ment based approach discussed in the the Web Appendix B). But we choose not to report it since the result is very similar to the usual RMultiPAR. The mean values of estimated a, b over different SNPs are respectively, 0.99 and 3.99 (with sd 0.00097 and 0.00095). It strengthens our point, argued in section (4.3.1.1)

2 and the Web Appendix B of the Supplementary Information, that the simple χL distributional assumption of RMultiPAR works reasonably well in most cases. It

2 obviates the need of considering aχb distribution and thus, the need of estimating the parameters a, b.

4.4 Discussion

Study of pleiotropy or shared genetic basis of multiple correlated traits is gain- ing momentum in recent years, particularly as we move into the era of person- alized medicine and genome editing. Majority of the statistical techniques for SNP-based association testing with multivariate traits assume unrelated individ- uals. These methods cannot be easily extended to include genetically related individuals. In this paper, we have discussed some common practices to perform multi-trait association anaysis and have shown that it is extremely important to correctly model the complex dependency structure in order to maintain correct type 1 error and achieve respectable power. The most commonly used MVLMM framework, which extensively models the dependency structure, also includes a

87 Chapter 4 – Multivariate Association Analysis of Correlated Traits in Related Individuals lot of variance-covariance components making the parameter estimation incredi- bly difficult for large number of traits or individuals. Researchers have focused on developing efficient algorithms to implement the traditional MVLMM (Korte et al., 2012; Zhou and Stephens, 2014), but most of them still cannot handle an extremely large scale cohort, such as UK Biobank (Allen et al., 2014). On the other hand, some researchers have overly simplified the dependency structure to make the computational aspect feasible. Most of these tests cannot accommo- date distantly related individuals and lacks reliability as we have shown in the simulations. In this paper, we have reasonably simplified the traditional MVLMM frame- work to construct a test named RMultiPAR which has much better computa- tional complexity than some of the existing algorithms for fitting the MVLMM like MTMM (Korte et al., 2012) and GEMMA (Zhou and Stephens, 2014) (in some cases). We have shown that RMultiPAR maintains correct type 1 error even under strong genetic and between trait dependence and has significantly better power than the combination methods like MinP, PCA or simplified MVLMM approaches like RMMLR (Basu et al., 2013), GEE etc. We have also applied RMultiPAR on MZ twins and full sibling pairs from the UK Biobank data to perform single SNP association analyses with four anthropometric traits: standing height, weight, hip circumference and waist circumference and found new significant findings. One of our future goals is to analyze all 500k distantly related individuals using RMultiPAR to study behavioral traits like alcohol consumption, cigarette consumption from the UK Biobank cohort. We would also like to expand our method to accommodate categorical traits as well. The RMultiPAR codes and a comprehensive user guide can be found here.

88 Bibliography

Allen, N. E., Sudlow, C., Peakman, T., Collins, R., et al. (2014). Uk biobank data: come and get it.

Almasy, L. and Blangero, J. (1998). Multipoint quantitative-trait linkage analysis in general pedigrees. The American Journal of Human Genetics 62, 1198–1211.

Arbet, J., McGue, M., and Basu, S. (2017). A robust and unified framework for estimating heritability in twin studies using generalized estimating equations. arXiv preprint arXiv:1710.09326 .

Balding, D. J. and Nichols, R. A. (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12.

Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussian predic- tive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70, 825–848.

Basseville, M., Benveniste, A., Chou, K. C., Golden, S. A., Nikoukhah, R., and Willsky, A. S. (1992). Modeling and estimation of multiresolution stochastic processes. IEEE Transactions on Information Theory 38, 766–784.

89 Basson, J., Sung, Y. J., de las Fuentes, L., Schwander, K. L., Vazquez, A., and Rao, D. C. (2016). Three approaches to modeling gene-environment interac- tions in longitudinal family data: Gene-smoking interactions in blood pressure. Genetic epidemiology 40, 73–80.

Basu, S., Zhang, Y., Ray, D., Miller, M. B., Iacono, W. G., and McGue, M. (2013). A rapid gene-based genome-wide association test with multivariate traits. Human heredity 76, 53–63.

Bates, D., Maechler, M., Bolker, B., Walker, S., et al. (2014). lme4: Linear mixed-effects models using eigen and s4. R package version 1, 1–23.

Bickel, P. J. and Doksum, K. A. (2015). Mathematical statistics: basic ideas and selected topics, volume I, volume 117. CRC Press.

Boos, D. D. and Stefanski, L. A. (2013). Essential statistical inference: theory and methods, volume 120. Springer Science & Business Media.

Boyle, E. A., Li, Y. I., and Pritchard, J. K. (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186.

Bradley, R. C. (2005). Basic properties of strong mixing conditions. a survey and some open questions. arXiv preprint math/0511078 .

Bulik-Sullivan, B. K., P-r, L., Finucane, H. K., Ripke, S., Yang, J., Patterson, N., Daly, M. J., Price, A. L., and Bm., N. (2015). Ld score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet 47, 291–295.

Burkett, K. M., Roy-Gagnon, M.-H., Lefebvre, J.-F., Wang, C., Fontaine-Bisson, B., and Dubois, L. (2015). a comparison of statistical methods for the discovery

90 of genetic risk factors using longitudinal family study designs. Frontiers in immunology 6, 589.

Burt, C. H. and Simons, R. L. (2014). Pulling back the curtain on heritability studies: Biosocial criminology in the postgenomic era. Criminology 52, 223– 262.

Burton, P. R., Scurrah, K. J., Tobin, M. D., and Palmer, L. J. (2005). Covari- ance components models for longitudinal family data. International Journal of Epidemiology 34, 1063–1077.

Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., O’Connell, J., et al. (2017). Genome-wide genetic data on˜ 500,000 uk biobank participants. BioRxiv page 166298.

Chansangiam, P., Hemchote, P., and Pantaragphong, P. (2009). Inequalities for kronecker products and hadamard products of positive definite matrices. Sci. Asia 35, 106–110.

Chen, G.-B., Lee, S. H., Brion, M.-J. A., Montgomery, G. W., Wray, N. R., Radford-Smith, G. L., Visscher, P. M., and Consortium, I. I. G. (2014). Esti- mation and partitioning of (co) heritability of inflammatory bowel disease from gwas and immunochip data. Human molecular genetics 23, 4710–4720.

Chen, H. and Conomos, M. P. (2015). Gmmat: Generalized linear mixed model association tests version 0.6.

Chen, H., Wang, C., Conomos, M. P., Stilp, A. M., Li, Z., Sofer, T., Szpiro, A. A., Chen, W., Brehm, J. M., Celed´on,J. C., et al. (2016). Control for population

91 structure and relatedness for binary traits in genetic association studies via logistic mixed models. The American Journal of Human Genetics 98, 653–666.

Conomos, M. P., Gogarten, S. M., Brown, L., Chen, H., Rice, K., Sofer, T., Thornton, T., and Yu, C. (2019). GENESIS: GENetic EStimation and Inference in Structured samples (GENESIS): Statistical methods for analyzing genetic data from samples with population structure and/or relatedness. R package version 2.12.4.

Degnan, J. H. and Salter, L. A. (2005). Gene tree distributions under the coales- cent process. Evolution 59, 24–37.

Eaton, M. L. (1983). Multivariate statistics: a vector space approach. JOHN WILEY & SONS, INC., 605 THIRD AVE., NEW YORK, NY 10158, USA, 1983, 512 .

Ellinghaus, D., Jostins, L., Spain, S. L., Cortes, A., Bethune, J., Han, B., Park, Y. R., Raychaudhuri, S., Pouget, J. G., H¨ubenthal, M., et al. (2016). Analysis of five chronic inflammatory diseases identifies 27 new associations and highlights disease-specific patterns at shared loci. Nature genetics 48, 510.

Falconer, D. S. (1960). Introduction to quantitative genetics. Oliver And Boyd; Edinburgh; London.

Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009). Improving the performance of predictive process modeling for large datasets. Computational statistics & data analysis 53, 2873–2884.

Galesloot, T. E., Van Steen, K., Kiemeney, L. A., Janss, L. L., and Vermeulen,

92 S. H. (2014). A comparison of multivariate genome-wide association methods. PloS one 9, e95923.

Galinsky, K. J., Loh, P.-R., Mallick, S., Patterson, N. J., and Price, A. L. (2016). Population structure of uk biobank and ancient eurasians reveals adaptation at genes influencing blood pressure. The American Journal of Human Genetics 99, 1130–1139.

Gaziano, J. M., Concato, J., Brophy, M., Fiore, L., Pyarajan, S., Breeling, J., Whitbourne, S., Deen, J., Shannon, C., Humphries, D., et al. (2016). Million veteran program: a mega-biobank to study genetic influences on health and disease. Journal of clinical epidemiology 70, 214–223.

Ge, T., Reuter, M., Winkler, A. M., Holmes, A. J., Lee, P. H., Tirrell, L. S., Roffman, J. L., Buckner, R. L., Smoller, J. W., and Sabuncu, M. R. (2016). Multidimensional heritability analysis of neuroanatomical shape. Nature com- munications 7, 13291.

Goldberger, A. S. and Goldberger, A. S. G. (1991). A course in econometrics. Harvard University Press.

Harville, D. A. (1998). Matrix algebra from a statistician’s perspective.

Hayashi, F. (2000). Econometrics. princeton. New Jersey, USA: Princeton Uni- versity .

Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., Gerber, F., Gramacy, R. B., Hammerling, D., Katzfuss, M., et al. (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics 24, 398–425.

93 Henningsen, A., Hamann, J. D., et al. (2007). systemfit: A package for estimating systems of simultaneous equations in r. Journal of Statistical Software 23, 1–40.

Hur, Y., Kaprio, J., Iacono, W., Boomsma, D., McGue, M., Silventoinen, K., Martin, N., Luciano, M., Visscher, P., Rose, R., et al. (2008). Genetic influences on the difference in variability of height, weight and body mass index between caucasian and east asian adolescent twins. International Journal of Obesity 32, 1455.

Ji, S.-G., Juran, B. D., Mucha, S., Folseraas, T., Jostins, L., Melum, E., Ku- masaka, N., Atkinson, E. J., Schlicht, E. M., Liu, J. Z., et al. (2017). Genome- wide association study of primary sclerosing cholangitis identifies new risk loci and quantifies the genetic relationship with inflammatory bowel disease. Nature genetics 49, 269.

Jiang, J., Li, C., Paul, D., Yang, C., and Zhao, H. (2014). High-dimensional genome-wide association study and misspecified mixed model analysis. arXiv preprint arXiv:1404.2355 .

Jiang, J., Li, C., Paul, D., Yang, C., Zhao, H., et al. (2016). On high-dimensional misspecified mixed model analysis in genome-wide association study. The An- nals of Statistics 44, 2127–2160.

Joyce, J. M. (2011). Kullback-leibler divergence. In International encyclopedia of statistical science, pages 720–722. Springer.

Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.-y., Freimer, N. B., Sabatti, C., Eskin, E., et al. (2010). Variance component model to account for sample structure in genome-wide association studies. Nature genetics 42, 348.

94 Kawajiri, T., Osaki, Y., and Kishimoto, T. (2012). Association of gene polymor- phism of the fat mass and obesity associated gene with metabolic syndrome: a retrospective cohort study in japanese workers. Yonago acta medica 55, 29.

Khoury, M. J. and Evans, J. P. (2015). A public health perspective on a national precision medicine cohort: balancing long-term knowledge generation with early health benefit. Jama 313, 2117–2118.

Kim, J.-J., Lee, H.-I., Park, T., Kim, K., Lee, J.-E., Cho, N. H., Shin, C., Cho, Y. S., Lee, J.-Y., Han, B.-G., et al. (2010). Identification of 15 loci influencing height in a korean population. Journal of human genetics 55, 27.

Kingman, J. F. (2000). Origins of the coalescent: 1974-1982. Genetics 156, 1461–1463.

Klingenberg, C. P. and Monteiro, L. R. (2005). Distances and directions in mul- tidimensional shape spaces: implications for morphometric applications. Sys- tematic Biology 54, 678–688.

Kondo, Y., Salibian-Barrera, M., Zamar, R., et al. (2016). Rskc: an r package for a robust and sparse k-means clustering algorithm. Journal of Statistical Software 72, 1–26.

Korte, A., Vilhj´almsson,B. J., Segura, V., Platt, A., Long, Q., and Nordborg, M. (2012). A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nature genetics 44, 1066.

Lee, S. H., Ripke, S., Neale, B. M., Faraone, S. V., Purcell, S. M., Perlis, R. H.,

95 Mowry, B. J., Thapar, A., Goddard, M. E., Witte, J. S., et al. (2013). Ge- netic relationship between five psychiatric disorders estimated from genome- wide snps. Nature genetics 45, 984.

Li, X., Basu, S., Miller, M. B., Iacono, W. G., and McGue, M. (2011). A rapid generalized least squares model for a genome-wide quantitative trait association analysis in families. Human heredity 71, 67–82.

Li, Y. R., Li, J., Zhao, S. D., Bradfield, J. P., Mentch, F. D., Maggadottir, S. M., Hou, C., Abrams, D. J., Chang, D., Gao, F., et al. (2015). Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases. Nature medicine 21, 1018.

Lindgren, C. M., Heid, I. M., Randall, J. C., Lamina, C., Steinthorsdottir, V., Qi, L., Speliotes, E. K., Thorleifsson, G., Willer, C. J., Herrera, B. M., et al. (2009). Genome-wide association scan meta-analysis identifies three loci influ- encing adiposity and fat distribution. PLoS genetics 5, e1000508.

Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I., and Heckerman, D. (2011). Fast linear mixed models for genome-wide association studies. Nature methods 8, 833.

Liu, S. and Trenkler, G. (2008). Hadamard, khatri-rao, kronecker and other matrix products. Int. J. Inf. Syst. Sci 4, 160–177.

Loh, P.-R., Tucker, G., Bulik-Sullivan, B. K., Vilhjalmsson, B. J., Finucane, H. K., Salem, R. M., Chasman, D. I., Ridker, P. M., Neale, B. M., Berger, B., et al. (2015). Efficient bayesian mixed-model analysis increases association power in large cohorts. Nature genetics 47, 284.

96 Magnus, J. R. (1978). Maximum likelihood estimation of the gls model with unknown parameters in the disturbance covariance matrix. Journal of econo- metrics 7, 281–312.

Meyer, K. (2007). Wombat—a tool for mixed model analyses in quantitative ge- netics by restricted maximum likelihood (reml). Journal of Zhejiang University Science B 8, 815–821.

Miller, M. B., Basu, S., Cunningham, J., Eskin, E., Malone, S. M., Oetting, W. S., Schork, N., Sul, J. H., Iacono, W. G., and McGue, M. (2012). The minnesota center for twin and family research genome-wide association study. Twin Research and Human Genetics 15, 767–774.

Mokkadem, A. (1988). Mixing properties of arma processes. Stochastic processes and their applications 29, 309–315.

Neale, M. C., Cardon, L. R., et al. (1994). Methodology for genetic studies of twins and families. STATISTICS IN MEDICINE 13, 199–199.

Neale, M. C., Hunter, M. D., Pritikin, J. N., Zahery, M., Brick, T. R., Kirk- patrick, R. M., Estabrook, R., Bates, T. C., Maes, H. H., and Boker, S. M. (2016). Openmx 2.0: Extended structural equation and statistical modeling. Psychometrika 81, 535–549.

Nobel, A. and Dembo, A. (1993). A note on uniform laws of averages for dependent processes. Statistics & Probability Letters 17, 169–172.

Nordborg, M. (2004). Coalescent theory. Handbook of statistical genetics .

O’Reilly, P. F., Hoggart, C. J., Pomyen, Y., Calboli, F. C., Elliott, P., Jarvelin,

97 M.-R., and Coin, L. J. (2012). Multiphen: joint model of multiple phenotypes can increase discovery in gwas. PloS one 7, e34861.

Pendergrass, S. A. et al. (2013). Phenome-wide association study (phewas) for detection of pleiotropy within the population architecture using genomics and epidemiology (page) network. PLoS Genet. e1003087 9,.

Pickrell, J., Berisa, T., Segurel, L., Tung, J. Y., and Hinds, D. (2015). Detection and interpretation of shared genetic influences on 42 human traits. Nat Genet 48, 709–717.

Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics 38, 904–909.

Rabe-Hesketh, S., Skrondal, A., and Gjessing, H. K. (2008). Biometrical modeling of twin and family data using standard mixed model software. Biometrics 64, 280–288.

Ray, D., Pankow, J. S., and Basu, S. (2016). Usat: A unified score-based associ- ation test for multiple phenotype-genotype analysis. Genetic epidemiology 40, 20–34.

Rencher, A. C. and Schaalje, G. B. (2008). Linear models in statistics. John Wiley & Sons.

Ried, J. S., Shin, S.-Y., Krumsiek, J., Illig, T., Theis, F. J., Spector, T. D., Adamski, J., Wichmann, H.-E., Strauch, K., Soranzo, N., et al. (2014). Novel genetic associations with serum level metabolites identified by phenotype set enrichment analyses. Human molecular genetics 23, 5847–5857.

98 Riedel, K. S. (1992). A sherman–morrison–woodbury identity for rank augmenting matrices with application to centering. SIAM Journal on Matrix Analysis and Applications 13, 659–662.

Robertson, C. and Fryer, J. (1970). The bias and accuracy of moment estimators.

Ro´s,B., Bijma, F., de Munck, J. C., and de Gunst, M. C. (2016). Existence and uniqueness of the maximum likelihood estimator for models with a kronecker product covariance structure. Journal of Multivariate Analysis 143, 345–361.

Shi, G., Rice, T. K., Gu, C. C., and Rao, D. C. (2009). Application of three-level linear mixed-effects model incorporating gene-age interactions for association analysis of longitudinal family data. In BMC proceedings, volume 3, page S89. BioMed Central.

Soranzo, N., Rivadeneira, F., Chinappen-Horsley, U., Malkina, I., Richards, J. B., Hammond, N., Stolk, L., Nica, A., Inouye, M., Hofman, A., et al. (2009). Meta- analysis of genome-wide scans for human adult stature identifies novel loci and associations with measures of skeletal frame size. PLoS genetics 5, e1000445.

Sung, Y. J., Simino, J., Kume, R., Basson, J., Schwander, K., and Rao, D. (2014). Comparison of two methods for analysis of gene–environment interactions in longitudinal family data: the framingham heart study. Frontiers in genetics 5, 9.

Van der Sluis, S., Posthuma, D., and Dolan, C. V. (2013). Tates: efficient multi- variate genotype-phenotype analysis for genome-wide association studies. PLoS genetics 9, e1003235.

99 Visscher, P. M. and Yang, J. (2016). A plethora of pleiotropy across complex traits. Nature genetics 48, 707.

Weir, B. S., Anderson, A. D., and Hepler, A. B. (2006). Genetic relatedness analysis: modern data and new challenges. Nature Reviews Genetics 7, 771– 780.

Weisstein, E. W. (2015). Schwarz’s inequality. MathWorld–A Wolfram Web Re- source .

Wu, P. Y. (1988). Products of positive semidefinite matrices. and Its Applications 111, 53–61.

Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., et al. (2010). Common snps explain a large proportion of the heritability for human height. Nature genetics 42, 565.

Yang, J., Lee, S. H., Goddard, M. E., and Visscher, P. M. (2011). Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics 88, 76–82.

Yang, J., Weedon, M. N., Purcell, S., Lettre, G., Estrada, K., Willer, C. J., Smith, A. V., Ingelsson, E., O’connell, J. R., Mangino, M., et al. (2011). Genomic infla- tion factors under polygenic inheritance. European Journal of Human Genetics 19, 807.

Yengo, L., Sidorenko, J., Kemper, K. E., Zheng, Z., Wood, A. R., Weedon, M. N., Frayling, T. M., Hirschhorn, J., Yang, J., Visscher, P. M., et al. (2018). Meta- analysis of genome-wide association studies for height and body mass index

100 in 700000 individuals of european ancestry. Human molecular genetics 27, 3641–3649.

Zellner, A. (1962). An efficient method of estimating seemingly unrelated re- gressions and tests for aggregation bias. Journal of the American statistical Association 57, 348–368.

Zhou, X. and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nature genetics 44, 821.

Zhou, X. and Stephens, M. (2014). Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nature methods 11, 407.

Ziegler, A., Kastner, C., and Blettner, M. (1998). The generalised estimating equations: an annotated bibliography. Biometrical Journal: Journal of Mathe- matical Methods in Biosciences 40, 115–139.

101 Appendix A

Supplementary Materials for Chapter 2

A.1 Calculating variance of MFM-MOM heri- tability estimate

Suppose that phenotypes are centered around 0 and we only have two time-points i.e K = 2. Let

n n n 1 Xz 1 Xz 1 Xz m = Y z2 m = Y z2 m = Y z2 z11 n i11 z22 n i21 z33 n i12 z i=1 z i=1 z i=1 n n n 1 Xz 1 Xz 1 Xz m = Y z2 m = Y z Y z m = Y z Y z z44 n i22 z12 n i11 i21 z34 n i12 i22 z i=1 z i=1 z i=1

102 Using multivariate normality one can write,

       2 mz11 σz1 0           2     mz22  σz1  0             2    ∗ ∗  mz12 σz1rz1 0 1 Σ1 Σ12    −   ∼ N6   ,      2    nz ∗ ∗  mz33  σz2  0 Σ12 Σ2            2     mz44  σz2  0         2 mz34 σz1rz2 0   z2 z2 z2 z2 z2 z z cov(Yi11,Yi11) cov(Yi11,Yi21) cov(Yi11,Yi11Yi21)   Σ∗ =  z2 z2 z2 z2 z2 z z  1  cov(Yi21,Yi11) cov(Yi21,Yi21) cov(Yi21,Yi11Yi21)    z z z2 z z z2 z z z z cov(Yi11Yi21,Yi11) cov(Yi11Yi21,Yi21) cov(Yi11Yi21,Yi11Yi21)   z2 z2 z2 z2 z2 z z cov(Yi12,Yi12) cov(Yi12,Yi22) cov(Yi12,Yi12Yi22)   Σ∗ =  z2 z2 z2 z2 z2 z z  2  cov(Yi22,Yi12) cov(Yi22,Yi22) cov(Yi22,Yi12Yi22)    z z z2 z z z2 z z z z cov(Yi12Yi22,Yi12) cov(Yi11Yi22,Yi22) cov(Yi12Yi22,Yi12Yi22)   z2 z2 z2 z2 z2 z z cov(Yi11,Yi12) cov(Yi11,Yi22) cov(Yi11,Yi12Yi22)   Σ∗ =  z2 z2 z2 z2 z z z  12  cov(Yi21,Yi12) cov(Yi21,Yi22) cov(Yi21,Yi12Yi22)    z z z2 z z z2 z z z z cov(Yi11Yi21,Yi12) cov(Yi11Yi21,Yi22) cov(Yi11Yi21,Yi12Yi22)

Notice that,

z z z z z z z z z z z z cov(Yi11Yi21,Yi12Yi22) = E(Yi11Yi21Yi12Yi22) − E(Yi11Yi21)E(Yi12Yi22)

2 2 2 2 = σz1σz2(ρ12 + rz12 + rz1rz2 − rz1rz2)

2 2 2 2 = σz1σz2(ρ12 + rz12)

103 In a similar manner all the other terms of the matrix Σ∗ can be derived. We get,

  2 2 2rz1 2rz1   Σ∗ = σ4  2  1 z1 2rz1 2 2rz1    2 2rz1 2rz1 (rz1 + 1)   2 2 2rz2 2rz2   Σ∗ = σ2  2  2 z2 2rz2 2 2rz2    2 2rz2 2rz2 (rz2 + 1)   2 2 2ρ12 2rz12 2ρ12rz12   Σ∗ = σ2 σ2  2 2  12 z1 z2  2rz12 2ρ12 2ρ12rz12    2 2 2ρ12rz12 2ρ12rz12 (ρ12 + rz12)

Define a function, g(a, b, c, d, e, f) = ( √c , √f )T so that we get, g(m , m , m , m , m , m ) = ab de z11 z22 z12 z33 z44 z34 T T (ˆrz1, rˆz2) i.e the sample analogue of the vector (rz1, rz2) . Then,

  √c √c √2 T 1 − 3 − 3 0 0 0 ∇g(a, b, c, d, e, f) =  a b ab ab  2 0 0 0 − √f − √f √2 d3e de3 de

So that

  rz1 rz1 2 1 − 2 − 2 2 0 0 0 2 2 2 2 2 2 T σz1 σz1 σz1 ∇g(σz1, σz1, σz1rz1, σz2, σz2, σz1rz2) =   2 rz2 rz2 2 0 0 0 − 2 − 2 2 σz2 σz2 σz2

T The variance of the vector (ˆrz1, rˆz2) can now be found out using delta method,

  rˆz1 1 var = ∇g(σ2 , σ2 , σ2 r , σ2 , σ2 , σ2 r )T Σ∗∇g(σ2 , σ2 , σ2 r , σ2 , σ2 , σ2 r )   n z1 z1 z1 z1 z2 z2 z1 z2 z1 z1 z1 z1 z2 z2 z1 z2 rˆz2 z

104 Thus the variance and covariance of the estimates would be,

  rz1 − 2 σz1 1 h i ∗   1 2 2 var(ˆr ) = rz1 rz1 2 Σ  rz1  = (1 − r ) z1 − σ2 − σ2 σ2 1 − σ2  z1 4nz z1 z1 z1  z1  nz 2 2 σz1   rz2 − 2 σz2 1 h i ∗   1 2 2 var(ˆr ) = rz2 rz2 2 Σ  rz2  = (1 − r ) z2 − σ2 − σ2 σ2 2 − σ2  z2 4nz z2 z2 z2  z2  nz 2 2 σz2   rz2 − 2 σz2 1 h i ∗   cov(ˆr , rˆ ) = rz1 rz1 2 Σ  rz2  z1 z2 − σ2 − σ2 σ2 12 − σ2  4nz z1 z1 z1  z2  2 2 σz2 1  2 2  = (1 + rz1rz2)(ρ12 + rz12) − 2ρ12rz12(rz1 + rz2) nz

Similarly when K > 2, all such pair-wise covariances can be obtained. Now univariate heritabilities and their variance is calculated as,

ˆ2 hk = 2(ˆr1k − rˆ2k)   ˆ2 1 2 2 1 2 2 varˆ (hk) = 4 (1 − rˆ1k) + (1 − rˆ2k) n1 n2 ˆ2 ˆ2 4  2 2  covˆ (hk, hk0 ) = (1 +r ˆ1krˆ1k)(ˆρkk0 +r ˆ1kk0 ) − 2ˆρkk0 rˆ1kk0 (ˆr1k +r ˆ1k0 ) n1 4  2 2  + (1 +r ˆ2krˆ2k0 )(ˆρkk0 +r ˆ2kk0 ) − 2ˆρkk0 rˆ2kk0 (ˆr2k +r ˆ2k0 ) n2

105 Multivariate heritability is calculated as a weighted combination of the univariate heritabilities,

K X σˆ2 +σ ˆ2 hˆ2 = 1k 2k hˆ2 mult PK 2 2 k k=1 k=1(ˆσ1k +σ ˆ2k)

The variance of this estimate can now easily be obtained by using univariate variances and covariances.

A.1.1 Theorems

Theorem A.1. In the first step of our two-stage estimation we maximize sum of

z the marginal log likelihood of Yi.k’s denoted by lk(γ k) for each k,

m1 m X 1 X 2 lk(γ k) = log f1(Yi.k|Xi.k,γγk) + log f2(Yi.k|Xi.k,γγk) i=1 i=m1+1 where, fz(.|x,γγk) is the bivariate normal density function with mean (xπk) and z variance covariance matrix Vk . Estimate of γ k obtained i.e, γˆk is consistent if

m1 m−m1 1. m → c1, m → c2 as m → ∞, where c1, c2 are known constants.

2. E1(log f1(y1|x,γγk)) and E2(log f2(y2|x,γγk)) both exist and are finite. The T T T true value of γ k is γ 0k = (π0k,θθ01k) . E1 denotes that the expected value

is with respect to the true joint density f1(y1|x,γγ0k) and integrated over y1

and x. E2 denotes that the expected value is with respect to the true joint

density f2(y2|x,γγ0k) and integrated over y2 and x.

Proof. We will use the theorem 7.2 of Hayashi (2000). Define the average log

106 likelihood function as,

1 Q (γ ) = l (γ ) (A.1) k k m k k

Clearly MLE of Qk(γ k) and lk(γ k) are the same. We want to verify the five conditions of the theorem 7.2 of Hayashi (2000),

3 2 1. The parameter space of γ k is Θ = R × R++ × [0, 1] and hence, convex. T T T The true parameter vector γ 0k = (π0k,θθ01k) ∈ Θ.

2. Using example 1.6.11 along with corollary 1.6.2 from Bickel and Doksum (2015), we get that likelihood of a bivariate normal distribution is log-

concave, implying l1k(γ k), l2k(γ k) are concave. So their sum lk(γ k) and

hence, Qk(γ k) is concave over γ k for any dataset.

3. Qk(γ k) is continuous and hence, measurable for all γ k ∈ Θ.

1 2 4. Define Q0k(γ k) = c1E1(log f1(Yi.k|Xi.k,γγk))+c2E2(log f2(Yi.k|Xi.k,γγk)). Us- ing non-negativity property of Kullback-Liebler divergence (Joyce, 2011) , Q0k(γ k) is uniquely maximized at γ 0k.

5.

1 Q (γ ) = l (γ ) k k m k k   m1 m 1 X 1 X 2 =  log f1(Y |Xi.k, γ k) + log f2(Y |Xi.k, γ k) m i.k i.k i=1 i=m1+1   m1 m m1 1 X 1 m − m1 1 X 2 =  log f1(Yi.k|Xi.k, γ k) + log f2(Yi.k|Xi.k, γ k) m m1 m m − m1 i=1 i=m1+1

1 Note that, log f1(Yi.k|Xi.k,γγk) for i = 1, . . . , m1 are iid with mean E1(log f1(y1|x,γγk)) 2 and log f2(Yi.k|Xi.k,γγk) for i = m1+1, . . . , m are iid with mean E2(log f2(y2|x,γγk)).

So, by Weak Law of Large Numbers (WLLN), we can conclude that, as m1,

107 m − m1 goes to ∞,

m1 1 X P log f (Y1 |X ,γγ ) −→ E (log f (y |x,γγ )) (A.2) m 1 i.k i.k k 1 1 1 k 1 i=1 m 1 X 2 P log f2(Yi.k|x,γγk) −→ E2(log f2(y2|x,γγk)) (A.3) m − m1 i=m1+1

m1 m−m1 Also, we assumed that m −→ c1 and m −→ c2. Using thes limits in equation (A.2) we get,

P Qk(γ k) −→ Q0k(γ k).

Thus all the five conditions of Hayashi (2000)’s theorem 7.2 are validated. and P hence, we should have, γˆk −→ γ 0k for k = 1,...,K, or,

        πˆk P π0k πˆ P π0   −→   for k = 1,...,K, or,   −→   . ˆ ˆ θ1k θ01k θ1 θ01

Theorem A.2. In the step two of our two-stage estimation we maximize the following two likelihood functions

m1 m X 1 X 2 q1(ζ) = log g1(Yi1.|Xi1.,ζζ) + log g2(Yi1.|Xi1.,ζζ) i=1 i=m1+1 m1 m X 1 X 2 q2(ζ) = log g1(Yi2.|Xi2.,ζζ) + log g2(Yi2.|Xi2.,ζζ) i=1 i=m1+1

ˆ1 ˆ2 to get respective MLEs as ζ and ζ , where, gz(.|x,ηη) is the density function of K-variate normal with mean (xπk) and variance covariance matrix Σz and

108 T ∗T T T ∗T T η = (π ,θθ2 ) . The true value of ζ is ζ 0 = (π0 ,θθ02 ) . g(.|x,γγk) is the density function of a K-variate normal with mean (xπ) and variance covariance matrix

Σ. If condition (1) of Theorem 1 holds and if E(log gz(y|x,ηη))’s exist and are ˆ1 ˆ2 finite, ζ and ζ and their average are all consistent, where Ez denotes that the expected value is with respect to the true joint density gz(y|x,ζζ 0) and integrated over y and x.

Proof. The proof can be done in a very similar way as Theorem (A.1) so we ignore the details. Just to give a brief direction of the proof, consider an average log likelihood function corresponding to q1(ζ) and an average log likelihood function corresponding to q2(ζ),

1 1 Q1(ζ) = q (ζ),Q2(ζ) = q (ζ) m 1 m 2 and next we can verify the five conditions of (Hayashi, 2000) theorem 7.2 in an ˆ1 P ˆ2 P exactly similar way as in our last theorem to get that, ζ −→ ζ 0 and ζ −→ ζ 0. As convergence in probability is closed under addition, it immediately follows that, ˆ3 1 ˆ1 ˆ2 P ˆ∗1 ˆ∗2 ˆ∗3 ζ 2 = 2 (ζ 2 + ζ 2) −→ ζ 02. Thus we have, θ2 , θ2 , θ2 to be all consistent.

109 Appendix B

Supplementary Materials for Chapter 3

B.1 Additional Figures

The supplementary material consists of plots corresponding to the simulation studies performed in Chapeter 3.

110 B.1.1 Simulation from section 3.3.2.1

Figure B.1: The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under simulation from section (3.3.2.1). The dotted line corresponds to the true heritability h2 = 0.7 and the blue dot inside each boxplot refers to the mean point of the estimates.

111 B.1.2 Simulation from section 3.3.2.2

Figure B.2: The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under case (a) from section (3.3.2.2). The dotted line corresponds to the true heritability h2 = 0.7.

112 Figure B.3: The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under case (b) from section (3.3.2.2). The dotted line corresponds to the true heritability h2 = 0.2.

113 B.1.3 Simulation from section 3.3.2.3

Figure B.4: The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under case (a) from section (3.3.2.3). The dotted line corresponds to the true heritability h2 = 0.6.

114 Figure B.5: The figure shows the box plots of the estimates by two methods for different subsample (knot) sizes under case (b) from section (3.3.2.3). The dotted line corresponds to the true heritability h2 = 0.2.

115 Appendix C

Supplementary Materials for Chapter 5

C.1 Positive semi-definiteness of RMultiPAR’s covariance assumption

Here we show that under RMultiPAR’s modeling of Vll0 as in equation (6), resul- tant V = cov(Y) is positive semi-definite.

1/2 1/2 Theorem C.1. If Vll0 = ρll0 Vl Vl0 , V would be positive semi-definite if Vl’s ∗ and Σ = ((ρll0 )) are positive semi-definite.

Proof. V can be written as,

116  1/2 1/2 1/2 1/2 V1 ρ12V1 V2 . . . ρ1LV1 VL    1/2 1/2 1/2 1/2  ρ12V1 V2 V2 . . . ρ2LV2 VL  V =    . . . .   . . . .    1/2 1/2 1/2 1/2 ρ1LV1 VL ρ23V2 VL ... VL       1/2 I I I 1/2 V1 0 ... 0 n ρ12 n . . . ρ1L n V1 0 ... 0        0 1/2 0   I I I   0 1/2 0   0 V2 ... 0  ρ12 n n . . . ρ2L n  0 V2 ... 0  =        . . . .   . . . .   . . . .   . . . .   . . . .   . . . .        1/2 I I I 1/2 0 0 ... VL ρ1L n ρ2L n ... n 0 0 ... VL

| {z } | ∗{z } | {z } Vdiag Σ ⊗In Vdiag

1/2 diag Since Vl’s are all positive semi-definite, Vl ’s are too and thus, V is positive ∗ ∗ semi-definite. As Σ is positive semi-definite, the kronecker product Σ ⊗ In is positive semi-definite as well (Chansangiam et al., 2009). It is known that the product of positive semi-definite matrices are positive semi-definite as well (Wu, 1988) which concludes that V is indeed positive semi-definite.

C.2 Comparing the assumptions of RMultiPAR with the traditional approach

In both the traditional approach and RMultiPAR, for each trait the covariance matrix is modeled as, Vl = σ1llC1 + σEllIn. The difference lies in modeling the cross trait covariance: in the traditional approach (equation (4.5)), it is modeled I ∗ I as, σ1ll0 C1 + σEll0 n (let Vll0 = σ1ll0 C1 + σEll0 n) and in RMultiPAR (equation 1/2 1/2 ∗∗ 1/2 1/2 (4.6)) it is modeled as ρll0 Vl Vl0 (let Vll0 = ρll0 Vl Vl0 ). Now lets look at the

117 following derivations,

  σEll σEll Vl = σ1ll C1 + In = σ1ll (C1 + rllIn); rll = σ1ll σ1ll √ ∗∗ 1/2 1/2 I 1/2 I 1/2 Vll0 = ρll0 Vl Vl0 = ρll0 σ1llσ1l0l0 (C1 + rll n) (C1 + rl0l0 n) σ 0 ∗ I I Ell Vll0 = σ1ll0 C1 + σEll0 n = σ1ll0 (C1 + rll0 n); rll0 = σ1ll0 σ1ll0 σEll0 ρ1ll0 = √ , ρEll0 = √ σ1llσ1l0l0 σEllσEl0l0

∗ ∗∗ Theorem C.2. Vll0 and Vll0 are exactly the same if ρ1ll0 = ρEll0 = ρll0 and rll = rl0l0 .

Proof. Let λi be the i − th largest eigenvalue of the GRM C1. The eigen- decomposition of the matrix (C1 + rll0 In) will be,

T (C1 + rll0 In) = QDll0 Q (C.1)

where Dll0 is a diagonal matrix with the elements (λ1 + rll0 , . . . , λn + rll0 ) and Q is the orthogonal matrix of eigenvectors of C1. So a square root of the above I 1/2 1/2 T 1/2 matrix will be (C1 + rll0 n) = QDll0 Q where Dll0 is a diagonal matrix with 1/2 1/2 the elements (λ1 + rll0 ) ,..., (λn + rll0 ) . Next we can write,

∗ T ∗∗ √ 1/2 1/2 T Vll0 = σ1ll0 QDll0 Q , Vll0 = ρll0 σ1llσ1l0l0 QDll Dl0l0 Q (C.2)

1/2 1/2 p p The elements of Dll Dl0l0 are (λ1 + rll)(λ1 + rl0l0 ),..., (λn + rll)(λn + rl0l0 ).

118 By AM-GM inequality note that,

q q 2 (λj + rll)(λj + rl0l0 ) = (λj + (rll + rl0l0 )λj + rllrl0l0 )

q 2 √ ≥ (λj + 2 rllrl0l0 λj + rllrl0l0 ) √ = (λj + rllrl0l0 )

where equality occurs only when rll = rl0l0 . We can write,

σEll0 ρEll0 √ rll0 = = rllrl0l0 (C.3) σ1ll0 ρ1ll0

Next, we notice if ρ1ll0 = ρEll0 = ρll0 and rll = rl0l0 , q 1/2 1/2 (λj + rll)(λj + rl0l0 ) = (λj + rll0 ) =⇒ Dll Dl0l0 = Dll0

∗∗ √ 1/2 1/2 T √ T T ∗ Vll0 = ρll0 σ1llσ1l0l0 QDll Dl0l0 Q = ρ1ll0 σ1llσ1l0l0 QDll0 Q = σ1ll0 QDll0 Q = Vll0

∗ ∗∗ Thus, we have shown that Vll0 and Vll0 are equal.

C.3 Development of Adjusted RMultiPAR

The usual RMultiPAR test statistic TM from equation (4.11) asymptotically fol- 2 lows a simple χL distribution when the identity from equation (4.8) holds which, ∗∗ on the other hand, holds when the assumption Vll0 = Vll0 is true. In this section ∗ ∗∗ we derive the asymptotic distribution of TM when Vll0 = Vll0 6= Vll0 . ∗ ∗T −1 ∗ −1 Lets define, A = Σ ⊗In, R = (X A X ) and W to be the true covariance matrix of the transformed vector Y∗ i.e W = cov(Y∗). For simplicity, we assume ˆ RF −1/2 −1/2 that the REMLE (Vl ) = Vl . The diagonals and off-diagonals of W : Wl,

119 Wll0 have the following structure,

−1/2 −1/2 −1/2 −1/2 I Wl = cov(Vl Yl, Vl Yl) = Vl VlVl = n (C.4) −1/2 −1/2 −1/2 −1/2 Wll0 = cov(Vl Yl, Vl0 Yl0 ) = Vl Vll0 Vl0 (C.5)

From equation (4.9) and (4.10) we have,

∗ ∗ Y ∼ NnL(µµ, W ); µ = X β ˆ T ˆ T −1 ˆ TM = (Hβ F GLS) (Hcov(β F GLS)H ) (Hβ F GLS) = Y∗T A−1X∗RHT (HRHT )−1HRX∗T A−1Y∗

∗ Since TM is a quadratic form in Y , its distribution can be approximated by an 2 aχb distribution where a and b are determined by matching the first two moments 2 of TM with those of aχb distribution. We look at the expectation and variance of

120 the test statistic.

∗T −1 ∗ T T −1 ∗T −1 ∗ E(TM ) = E(Y A X RH (HRH ) HRX A Y ) = E(trace(Y∗T A−1X∗RHT (HRHT )−1HRX∗T A−1Y∗))

= E(trace(A−1X∗RHT (HRHT )−1HRX∗T A−1Y∗Y∗T ))

= trace(A−1X∗RHT (HRHT )−1HRX∗T A−1E(Y∗Y∗T ))

= trace(A−1X∗RHT (HRHT )−1HRX∗T A−1(W + µµµµT ))

= trace(X∗RHT (HRHT )−1HRX∗T A−1WA−1)

+ trace(A−1X∗RHT (HRHT )−1HRX∗T A−1µµµµT ))

= trace(RHT (HRHT )−1HRX∗T A−1WA−1X∗) | {z } P + trace(A−1X∗RHT (HRHT )−1HRX∗T A−1µµµµT ))

= M1 + M2

Similarly by Rencher and Schaalje (2008), the variance of TM would be,

2 var(TM ) = 2trace(P ) + 4M2 = 2M3 + 4M2

Now note that under H0,

−1 ∗ T T −1 ∗T −1 T M2 = trace(A X RH (HRH ) HRX A µµµµ )) = trace(A−1X∗RHT (HRHT )−1HRX∗T A−1X∗ββββT XT ∗))

= trace(R(HRHT )−1HRX∗T A−1X∗ββββT XT ∗A−1X∗HT ))

= trace((HRHT )−1HRX∗T A−1X∗ββββT XT ∗A−1X∗RHT ))

= trace((HRHT )−1HβββββT HT )) = 0 (since, Hββ = 0)

121 Therefore when C is known, a, b, can be estimated by moment matching as, a = M3 M1 2 M1 ∗∗ and b = . When Vll0 = V 0 , equation (C.5) gives us, M3 ll

∗ Wll0 = ρll0 In or, W = Σ × In = A.

In this scenario we have,

∗ T T −1 ∗T −1 −1 M1 = trace(X RH (HRH ) HRX A WA ) = trace(X∗RHT (HRHT )−1HRX∗T A−1)

= trace((HRHT )−1HRX∗T A−1X∗RHT )

= trace((HRHT )−1HRR−1RHT )

T −1 T = trace((HRH ) HRH ) = trace(IL) = L

−1 ∗ T T −1 ∗T −1 −1 ∗ T T −1 ∗T −1 M3 = trace(A X RH (HRH ) HRX A WA X RH (HRH ) HRX A W ) = trace(A−1X∗RHT (HRHT )−1HRX∗T A−1X∗RHT (HRHT )−1HRX∗T )

= trace(A−1X∗RHT (HRHT )−1HRR−1RHT (HRHT )−1HRX∗T )

= trace(A−1X∗RHT (HRHT )−1HRHT (HRHT )−1HRX∗T )

= trace(A−1X∗RHT (HRHT )−1HRX∗T )

= trace((HRHT )−1HRX∗T A−1X∗RHT )

= trace((HRHT )−1HRHT ) = L

2 Thus, when A, W are equal we will have a = 1, b = L i.e TM would follow χL

122 distribution asymptotically Notice that P is of dimension L and using Cauchy- Schwarz inequality with traces we get,

q M 2 trace(P )2 M M 2 I2 1 1 1 trace(P ) ≤ trace(P )trace( L) =⇒ b = = 2 ≤ L, a = ≥ M3 trace(P ) b L

We investigate what happens to the inference when the true distribution of TM 2 2 is aχb but we use the cutoffs from a χL distribution. When a is close to 1 or less than 1, the test will be conservative. When a is significantly higher than 1, the test will be inflated. Under the traditional covariance assumption i.e when

∗ Vll0 = Vll0 , from equation (C.5) we have,

−1/2 I −1/2 Wll0 = Vl (σ1ll0 C1 + σEll0 n)Vl0 (C.6)

Therefore the parameters σ1ll0 , σEll0 need to be estimated for finding out a, b pa- rameters. It can be done by fitting a REML just like the variance component es- timation step of MTMM. But then, the computational complexity of our method will be same as that of MTMM. Therefore we recommend estimating these pa- rameters by a moment based approach instead (Ge et al., 2016) which is much faster. Otherwise, a bootstrap based approach (Boos and Stefanski, 2013) can also be used to estimate E(TM ) and var(TM ) which in turn can be used to estimate a, b. We refer to this a, b adjusted version of RMultiPAR test statistic as Adjusted RMultiPAR.

C.4 Discussion about Adjusted RMultiPAR

In this section we argue why in most general scenarios, a, b will respectively be

2 almost equal to 1 and L which means that considering the simple χL distribution

123 would not bias the association inference. Using equation (C.1), (C.2) and (C.6) we can write,

σ1ll0 −1/2 −1/2 −1 Wll0 ≈ √ QDll Dll0 Dl0l0 Q σ1llσ1l0l0 | {z } D∗ ll0 ∗ T = ρ1ll0 QDll0 Q (C.7)

! ∗ ∗ λj + rll0 Dll0 is a diagonal matrix with elements, λjll0 = p , j = 1, . . . n. (λj + rll)(λj + rl0l0 ) Equation (C.7) can be rewritten as,

n X ∗ T Wll0 = ρ1ll0 λjll0 QjQj (C.8) j=1

where Qj is the j-th column of Q. λj’s are all equal to 1 if the GRM (C1) is identity i.e when the individuals are all genetically unrelated. In such a case,

∗ 1 + rll0 λjll0 ’s are all equal to some constant z = p . Letting ρll0 = zρ1ll0 (1 + rll)(1 + rl0l0 ) we can simplify the equation (C.7) as,

n X T I ∗ I Wll0 = ρll0 QjQj = ρll0 n =⇒ W = Σ × n = A j=1

Thus, just as the earlier section, in this particular scenario also M1 = M3 = L or a = 1, b = L. When we have distantly related individuals, usually a SNP-based GRM is 1 computed as C = ZZT where Z is the n × r matrix of standardized SNP (r 1 r many) genotypes. It is a well-known result that the empirical spectral distribution n of C asymptotically follows Marcenko-Pastur distribution as the ratio → τ and 1 r √ 2 n, r → ∞. It can further be shown that the smallest eigenvalue λn → (1 − τ)

124 √ 2 and the largest eigenvalue λ1 → (1 + τ) (Jiang et al., 2014). We investigate ∗ how λjll0 ’s look under these assumptions. We consider the following function f,

x + rll0 f(x) = p (x + rll)(x + rl0l0 ) √ √ and study how for x ∈ ((1 − τ)2, (1 + τ)2) the function behaves. If we can show that f takes very closely packed values we would be able to conclude that

∗ λjll0 ’s are almost equal.

In biobank datasets, τ is usually less than 1 (otherwise C1 will be singular). We keep it at τ = 0.6. We consider different combination of the values of rll0 , rll, rl0l0 . σEll 1 − hl σ1ll We note that rll can be written as, rll = = where hl = is σ1ll hl σ1ll + σEll the heritability of the l-th trait. To replicate a few realistic secarios, we consider three different setups: a) where both l, l0 traits have heritability more than 0.5, rll = 0.5, rl0l0 = 0.2, b) where one of the traits have heritability more than 0.5, rll = 0.5, rl0l0 = 5 and c) where both of the traits have heritability less than 0.5, rll = 5, rl0l0 = 10. Motivated by equation (C.3), in all the three setups we consider 1√ √ √  three different values of r 0 : r r 0 0 , r r 0 0 , 1.5 r r 0 0 . Thus, for each case ll 2 ll l l ll l l ll l l a, b and c we have subcases 1, 2 and 3 corresponding to the three different √ √ 2 2 √ 2 values of ρll0 . We vary x from (1 − τ) = (1 − 0.6) = 0.051 to (1 + τ) = √ (1+ 0.6)2 = 3.15 by margin 0.002. We plot the function values for different cases and subcases in figure C.1 and list the mean and sd of the values in table C.1.

125 √ √ Figure C.1: Plot of function f for x ∈ ((1 − τ)2, (1 + τ)2 under different cases and subcases from section (C.4).

Table C.1: The table lists the mean√ (and sd in√ the bracket) of the values the function f takes in the interval ((1 − τ)2, (1 + τ)2) for datapoints separated by 0.002.

subcase 1 subcase 2 subcase 3 case a 0.87 (0.083) 0.98 (0.005) 1.1 (0.075) case b 0.63 (0.063) 0.87 (0.025) 1.1 (0.1) case c 0.58 (0.04) 0.99 (0.003) 1.4 (0.047)

We see that in all the scenarios, f takes very closely packed values that implies

∗ that in most realistic scenarios, λjll0 ’s will be very close to some constant z. And

126 then we can simplify the equation (C.8) as,

n X T I ∗ I Wll0 ≈ ρ1ll0 zQjQj = ρll0 n; ρll0 = zρ1ll0 =⇒ W ≈ Σ × n = A j=1

Thus, we will have M1 ≈ L, M3 ≈ L or a ≈ 1, b ≈ L.

C.5 Simulation with more number of traits

Figure C.2: The plot compares the type 1 error and power of different meth- ods under the simulation setup of the section (4.3.1.2) of the main paper with respectively 6 and 8 traits. From figure C.2, we see that the overall trend for both the cases is similar to the case with 4 traits in section (4.3.1.2). RMultiPAR still has the highest power

127 when half the traits are associated and loses power as the number of associated traits increases more than that.

C.5.1 Manhattan Plots

Figure C.3: The figure shows the Manhattan plots of the negative of the log p- values from the different methods. The blue line corresponds to negative of the logarithm of 1 × 10−5. The red line corresponds to negative of the logarithm of the threshold of 1 × 10−8.

Figure C.3 shows the Manhattan plots of the negative of the log p-values from the different methods. We see that the Manhattan plots of RMultiPAR and MinP are pretty similar. An interesting observation is that on chromosome 16 all the methods display similar p-values.

128