THEORETICAL AND COMPUTATIONAL STUDIES

OF BAYESIAN LINEAR MODELS

A Dissertation Submitted to the Faculty of

The Graduate School

Baylor College of Medicine

In Partial Fulfillment of the

Requirements for the Degree

of

Doctor of Philosophy

by

QUAN ZHOU

Houston, Texas

March 16, 2017 APPROVED BY THE DISSERTATION COMMITTEE

Signed Yongtao Guan, Ph.D., Chairman

Rui Chen, Ph.D.

Dennis Cox, Ph.D.

Chris Man, Ph.D.

Michael Schweinberger, Ph.D.

APPROVED BY THE STRUCTURAL AND COMPUTATIONAL

BIOLOGY & MOLECULAR BIOPHYSICS GRADUATE PROGRAM

Signed Aleksandar Milosavljevic, Ph.D., Director of SCBMB

APPROVED BY THE INTERIM DEAN OF

GRADUATE BIOMEDICAL SCIENCES

Signed Adam Kuspa, Ph.D.

Date

2 Acknowledgements

First of all, I want to express my deepest gratitude to my advisor Dr. Yongtao Guan and all the other members of our group for their tremendous help in every aspect of my life throughout the past 5 years. The completion of my degree projects should be attributed to Dr. Guan’s amazing ideas, selfless support and unceasing hard work. From him I learned programming and statistical skills, enthusiasm and devotion to science and how to conduct academic research. My sincere thanks also go to other group members, Hang Dai, Zhihua Qi, Hanli Xu and Liang Zhao, for their kindness, friendship and encouragement. Second, I want to thank my thesis committee, Dr. Rui Chen, Dr. Dennis Cox, Dr. Chris Man and Dr. Michael Schweinberger, for their amiable, generous and continuous academic help and guidance. Special thanks to Dr. Cox for his excellent courses (Mathematical Statistics I and II, Stochastic Process, Multivari- ate Analysis, Functional Data Analysis) which opened a new world for me. In addition, I want to thank Dr. Philip Ernst, who has collaborated with me on a probability paper and offered me priceless opportunities like giving lectures for the course Mathematical Probability. Third, I am truly grateful to my program SCBMB and the Graduate School of BCM, which always support me in choosing my research and studying the skills that I like. Without the learning environment they have created almost nothing could I have achieved. I also want to thank Rice University for the courses I took over four years and its hospitality to a visiting student like me. Last, but by no means least, I want to thank my parents, my friends and teachers at and out of Houston as they have filled my PhD journey with hope and happiness.

Dedication This thesis is dedicated to my family, especially my grandfather and grandmother who have passed away during my PhD study. To their love I shall be immensely and forever indebted.

3 Abstract

Statistical methods have been extensively applied to genome-wide association studies to demystify the genetic architecture of many common complex diseases in the past decade. Bayesian methods, though not as popular as traditional methods, have been used for various purposes, like association testing, causal SNP identi- fication, heritability estimation and genotype imputation. This work focuses on the Bayesian methods based on linear regression. Bayesian hypothesis testing reports a (null-based) Bayes factor instead of a p-value. For linear regression, it is shown in Chap. 2.1 that under the null model of no effect, 2 log(Bayes factor) is asymptotically distributed as a weighted sum of

2 independent χ1 random variables. The weights are all between 0 and 1. Similarly, under the alternative model with some necessary conditions on the effect size, 2 log(Bayes factor) is asymptotically distributed as a weighted sum of indepen- dent noncentral chi-squared random variables. An immediate benefit is that the p-values associated with the Bayes factors can be analytically computed rather than by permutation, which is of vital importance in genome-wide association studies. Due to multiple testing, in whole-genome studies the significance thresh- old is extremely small and thus permutation is in fact impractical. Furthermore, the asymptotic results help explain the behaviour of the Bayes factor and the origin of some well-known paradoxes, like Bartlett’s paradox (Chap. 2.2). Lastly, in light of this null distribution, a new statistic named the scaled Bayes factor is proposed. It is defined via a rescaling of the Bayes factor so that the expectation of log(scaled Bayes factors) is fixed to zero (or some other constant). In Chap. 5.1 its practical and theoretical benefits are discussed. Chap. 5.2 describes an appli- cation of the scaled Bayes factor to the analysis of a real whole-genome dataset for intraocular pressure. For multi-linear regression, the computation of the p-value associated with the Bayes factor requires the evaluation of the distribution function of a weighted sum

4 2 of independent χ1 random variables. We implemented in C++ a recent polynomial method of Bausch [2013], which appears to be the most efficient solution so far (Chap. 2.3.2 and 2.3.3). Simulation studies (Chap. 2.3.4) show that the p-values computed according to the asymptotic null distribution have very good calibration, even for very large Bayes factors, validating the use of this method in genome-wide association studies. The expression of the Bayes factor for linear regression contains the posterior mean estimator for the regression coefficient, which is also called the ridge esti- mator by non-Bayesians. When XtX is available (X denotes the design matrix), ridge estimators are usually computed via the Cholesky decomposition of the ma- trix XtX + cI, which is efficient but still has cubic complexity in the number of regressors. A new iterative method, called ICF (iterative solutions using complex factorization), is proposed in Chap. 3. It assumes that the Cholesky decomposi- tion of XtX is already obtained. Simulation (Chap. 3.5) shows that, when ICF is applicable, it is much better than the Cholesky decomposition and other iterative methods like Gauss-Seidel algorithm. The ICF algorithm fits perfectly with the Bayesian variable selection regres- sion proposed by Guan and Stephens [2011] since in MCMC, the Cholesky de- composition of XtX can be obtained by efficient updating algorithms (but not for XtX + cI if c is changing). A reimplementation of their method using ICF substantially improves the efficiency of posterior inferences (Chap. 4.3). Sim- ulation studies (Chap. 4.4) show that the new method can efficiently estimate the heritability of a quantitative trait and report well-calibrated posterior inclu- sion probabilities. Furthermore, compared with another popular software package GCTA (Chap. 7.5), the new method has much better performance in prediction (Chap. 4.4.3).

The use of the scaled Bayes factor for variable selection is discussed in Chap. 5.3.

To achieve consistency, the scaling factor is calibrated using the data (Chap. 5.3.1).

5 Simulation studies demonstrate that, after the calibration, the scaled Bayes fac- tor performs at least as well as the unscaled Bayes factor in both heritability estimation and prediction (Chap. 5.4).

6 Contents

Approvals 2

Acknowledgements 3

Abstract 4

Symbols and Notations 13

Abbreviations 14

1 Introduction 15 1.1 Genome-Wide Association Studies ...... 15 1.1.1 Some Genetic Concepts ...... 16 1.1.2 Some Statistical Concepts ...... 20 1.2 Bayesian Linear Regression ...... 25 1.3 Applications of Bayesian Linear Regression to GWAS ...... 28 1.3.1 Association Testing ...... 28 1.3.2 Variable Selection, Heritability Estimation and Prediction . 30

2 Distribution and P-value of the Bayes Factor 34 2.1 Distribution of Bayes Factors in Linear Regression ...... 34 2.1.1 Distributions of Quadratic Forms ...... 35

2.1.2 Asymptotic Distributions of log BFnull ...... 39 2.1.3 Asymptotic Results in Presence of Confounding Covariates . 43 2.2 Properties of the Bayes Factor and Its P-value ...... 45 2.2.1 Comparison with the P-values of the Frequentists’ Tests . . 45

7 2.2.2 Independent Normal Prior and Zellner’s g-prior ...... 47 2.2.3 Behaviour of the Bayes Factor and Three Paradoxes . . . . . 49 2.2.4 Behaviour of the P-value Associated with the Bayes Factor . 54 2.2.5 More about Simple Linear Regression ...... 56 2.3 Computation of the P-values Associated with Bayes Factors . . . . 59 2.3.1 Bartlett-type Correction ...... 60 2.3.2 Bausch’s Method ...... 62 2.3.3 Implementation of Bausch’s Method ...... 66 2.3.4 Calibration of the P-values ...... 69

3 A Novel Algorithm for Computing Ridge Estimators 72 3.1 Background ...... 72 3.2 Direct Methods for Computing Ridge Estimators ...... 74 3.2.1 Spectral Decomposition of XtX ...... 75 3.2.2 Cholesky Decomposition of XtX + Σ ...... 76 3.2.3 QR Decomposition of the Block Matrix Xt Σ1/2t . . . . . 77 3.2.4 Bidiagonalization Methods ...... 78 3.3 Iterative Methods for Computing Ridge Estimators ...... 79 3.3.1 Jacobi, Gauss-Seidel and Successive Over-Relaxation . . . . 79 3.3.2 Steepest Descent and Conjugate Gradient ...... 82 3.4 A Novel Iterative Method Using Complex Factorization ...... 84 3.4.1 ICF and Its Convergence Properties ...... 84 3.4.2 Tuning the Relaxation Parameter for ICF ...... 88 3.5 Performance Comparison by Simulation ...... 92 3.5.1 Methods ...... 92 3.5.2 Wall-time Usage, Convergence Rate and Accuracy ...... 93

4 Bayesian Variable Selection Regression 98 4.1 Background and Literature Review ...... 98 4.1.1 Models for Bayesian Variable Selection ...... 100 4.1.2 Methods for the Model Fitting ...... 103

8 4.2 The BVSR Model of Guan and Stephens ...... 105 4.2.1 Model and Prior ...... 106 4.2.2 MCMC Implementation ...... 109 4.3 A Fast Novel MCMC Algorithm for BVSR using ICF ...... 114 4.3.1 The Exchange Algorithm ...... 115 4.3.2 Updating of the Cholesky Decomposition ...... 116 4.3.3 Summary of fastBVSR Algorithm ...... 118 4.4 GWAS Simulation ...... 119 4.4.1 Posterior Inference for the Heritability ...... 120 4.4.2 Calibration of Posterior Inclusion Probabilities ...... 123 4.4.3 Prediction Performance ...... 125 4.4.4 Wall-time Usage ...... 129

5 Scaled Bayes Factors 132 5.1 Motivations for Scaled Bayes Factors ...... 132 5.2 An Application to Intraocular Pressure GWAS Datatsets ...... 135 5.3 Scaled Bayes Factors in Variable Selection ...... 139 5.3.1 Calibrating the Scaling Factors ...... 139 5.3.2 Prediction Properties ...... 144 5.4 Simulation Studies for Variable Selection ...... 147

6 Summary and Future Directions 152 6.1 Summary of This Work ...... 152 6.2 Specific Aims for Future Studies ...... 154 6.2.1 Bayesian Association Tests Based on Haplotype or Local Ancestry ...... 154 6.2.2 Application of ICF to Variational Methods for Variable Se- lection ...... 159 6.2.3 Extension of This Work to Categorical Phenotypes . . . . . 163

7 Appendices 167 7.1 Linear Algebra Results ...... 167 7.1.1 Some Matrix Identities ...... 167

9 7.1.2 Singular Value Decomposition and Pseudoinverse ...... 169 7.1.3 Eigenvalues, Eigenvectors and Eigendecomposition ...... 171 7.1.4 Orthogonal Projection Matrices ...... 175 7.2 Bayesian Linear Regression ...... 179 7.2.1 Posterior Distributions for the Conjugate Priors ...... 180 7.2.2 Bayes Factors for Bayesian Linear Regression ...... 182 7.2.3 Controlling for Confounding Covariates ...... 185 7.3 Big-O and Little-O Notations ...... 190

2 7.4 Distribution of a Weighted Sum of χ1 Random Variables ...... 192 7.4.1 Davies’ Method for Computing the Distribution Function . . 192 7.4.2 Methods for Computing the Bounds for the P-values . . . . 195 7.5 GCTA and Linear Mixed Model ...... 197 7.5.1 Restricted Maximum Likelihood Estimation ...... 198 7.5.2 Newton-Raphson’s Method for Computing REML Estimates 200 7.5.3 Details of GCTA’s Implementation of REML Estimations . . 202 7.6 Metropolis-Hastings Algorithm ...... 205 7.7 Used Real Datasets ...... 208 7.7.1 Merged Intraocular Pressure Dataset ...... 208 7.7.2 Height Dataset ...... 209

Bibliography 211

10 List of Figures

2.1 Comparison between PBF, PF, PLR for simple linear regression. . . . 48

2.2 Power comparison between PBF and PLR for p = 2 ...... 56

2.3 How BFnull changes with σ ...... 57

2.4 Calibration of PBF and PLR for p = 10 ...... 70

2.5 Calibration of PBF and PLR for p = 20 ...... 71

3.1 The relationship between ρ(Ψ) and n ...... 89 3.2 Distribution of optimal ρ(Ψ) in presence of multicollinearity . . . . 90 3.3 Wall time usage of ICF, Chol, GS, SOR and CG ...... 94 3.4 Iterations used by ICF, SOR and CG ...... 96 3.5 Accuracy of ICF, SOR and CG ...... 97

4.1 Heritability estimation with 200 causal SNPs ...... 122 4.2 Heritability estimation with 1000 causal SNPs ...... 123 4.3 Posterior estimation for the model size ...... 124 4.4 Calibration of the posterior inclusion probabilities ...... 126 4.5 Calibration of the PIPs with insufficient MCMC iterations . . . . . 127 4.6 Relative prediction gain of fastBVSR and GCTA ...... 130 4.7 Wall time used by fastBVSR for 10K MCMC iterations...... 131

5.1 How BFnull and sBF change with σ ...... 134

5.2 Distributions of BFnull and sBF in the IOP dataset ...... 136 5.3 Heritability estimation using the scaled Bayes factor ...... 149 5.4 Calibration of the Rao-Blackwellized posterior inclusion probabili- ties for sBF ...... 150 5.5 Relative prediction gain of the scaled Bayes factor ...... 151

11 List of Tables

3.1 Wall time usage of ICF, Chol, GS, SOR and CG under the null model 95

4.1 Heritability estimation with 200 causal SNPs ...... 121

5.1 Top 20 single SNP associations by BFnull (σ = 0.2) in the IOP dataset ...... 137

5.2 Top 20 single SNP associations by BFnull (σ = 0.5) in the IOP dataset ...... 138 5.3 Taylor seriess approximations for the ideal scaling factors ...... 143 5.4 Heritability estimation using the scaled Bayes factor ...... 148

12 Symbols and Notations

R real numbers a scalar a a vector a (lowercase bold)

p ||a||p ` -norm of a A matrix A (uppercase bold)

At transpose of A

A∗ conjugate transpose of A

|A| determinant of A tr(A) trace of A

I identity matrix

In identity matrix of size n × n

P(A) probability of event A

E[X] expected value of X def=. definition

IA(x) indicator function l.h.s left-hand side r.h.s right-hand side i.i.d. independent and identically distributed ind. independent

13 Abbreviations

BVSR Bayesian variable selection regression

BFnull null-based Bayes factor FDR false discovery rate

GWAS genome-wide association study

ICF iterative solutions using complex factorization

LASSO least absolute shrinkage and selection operator

LD linkage disequilibrium

LRT likelihood ratio test

MAF minor allele frequency

MAP maximum a posteriori

MSE mean squared error

MSPE mean squared prediction error

MCMC Markov chain Monte Carlo

MVN multivariate normal distribution

PIP posterior inclusion probability

REML restricted/residual maximum likelihood

RPG relative prediction gain sBF scaled Bayes factor

SNP single nucleotide polymorphism

SVD singular value decomposition

14 Chapter 1

Introduction

1.1 Genome-Wide Association Studies

Genome-wide association studies (GWASs) refer to the analyses using a dense set of SNPs across the whole genome with the ultimate aim of predicting disease risks and identifying the genetic foundations for complex diseases [Bush and Moore,

2012]. Unlike candidate-gene association analysis [Hirschhorn and Daly, 2005],

GWAS scans the whole genome and a typical study may involve about one million

SNPs. Since the discovery of complement factor H as a susceptibility gene for age-related macular degeneration (AMD) [Edwards et al., 2005, Haines et al.,

2005, Klein et al., 2005], GWASs have made great success in the demystification of the genetic architecture of some common complex diseases [McCarthy et al.,

2008], such as breast cancer [Easton et al., 2007], prostate cancer [Thomas et al.,

2008], type I and type II diabetes [Todd et al., 2007, Zeggini et al., 2008] and inflammatory bowel disease [Duerr et al., 2006]. A complete list of GWAS findings is available at the NHGRI-EBI (National Human Genome Research Institute and

European Bioinformatics Institute ) GWAS Catalog [Welter et al., 2014]. Besides,

15 GWASs have also shed light on the individual drug metabolism and given birth to the idea of personalized medicine (see Motsinger-Reif et al. [2013] for a review).

In this section I will explain several important genetic or statistical concepts that are indispensable to the understanding of GWAS. Some notions like heritability and the details of the statistical methods will be explicated in later sections.

1.1.1 Some Genetic Concepts

Single Nucleotide Polymorphism Commonly abbreviated as SNP, single nu- cleotide polymorphism refers to the single base-pair change in the DNA sequence that has a high prevalence, say greater than 1%, in some population. Most SNPs are biallelic and the frequency of the less common allele is denoted by MAF (minor allele frequency). For example, suppose we know some SNP with major allele A and minor allele T has an MAF equal to 0.2 in some population (human DNA only contains four types of nucleotides: A, T, C, G). Then at that SNP locus, under certain conditions, we may estimate about 64% of that population have genotype

AA, about 4% have genotype TT and the remaining have genotype AT (human genome is a diploid set). Here implicitly the Hardy-Weinberg equilibrium is as- sumed, which refers to a set of conditions that allows one to assume the number of copies of the minor allele follows a binomial distribution. SNPs with more than two alleles are very rare and they are often excluded from the statistical analysis.

Note that in meta-analysis, when we merge datasets from different studies, great care should be taken when dealing with AT and CG SNPs since a flipping of the reference allele can easily cause a false positive in the association testing. Most

SNPs do not pose any obvious harm to human health, since they are either lo- cated in the non-coding region, which occupies about 98% of the whole genome, or the allele change is synonymous, which means the amino acid coded remains

16 the same. Some SNPs may cause amino acid changes and then directly alter the

protein functions however the overall effect on the human body is usually im-

perceptible since otherwise they would probably have been eliminated by natural

selection. SNPs with MAF lower than 5% (or sometimes 1%) are often called rare

variants [Committee, 2009, Lee et al., 2014]. Extremely rare variants, which can

be detrimental to protein functions, are often called mutations [Bush and Moore,

2012].

Quantitative and Categorical Traits In GWAS, the response variable y in a regression model is often referred to as the trait or the phenotype. A trait can be categorical, in fact often binary, for example the case/control status of a disease like cancer. In such situations, an association testing between the trait and the

SNP directly estimates how likely the SNP may be disease-predisposing and its odds ratio. For some other diseases, there exists a quantitative trait that is known as a risk factor, for example the lipid level [Kathiresan et al., 2008]. Then the association testing with that quantitative trait can also help us find the genetic variants that could be used for predicting disease risks. Other typical continuous traits used in GWASs include height [Weedon et al., 2008] and fat mass [Scuteri et al., 2007]. Height appears to be the best candidate trait for studying the heritability (see Chap. 1.3.2).

Common Disease-Common Variant Hypothesis Underlying most of the

GWAS methodologies targeting common complex diseases is the common disease- common variant (CD-CV) hypothesis[Lander, 1996, Reich and Lander, 2001], which predicts that the genetic risk factors for common diseases like diabetes and cancer should mostly be alleles with relatively high frequencies (> 1%). Hence the traditional family-based genetic studies may fail due to the very small effects

17 of the disease-predisposing variants (since the total number of the causal vari- ants is large) but a whole-genome scan and joint analysis of the genetic variants for hundreds or thousands of case and control subjects are believed to lead to some findings. Early evidence supporting this hypothesis includes the high-MAF variants in the APOE gene that are risk factors for Alzheimer’s disease [Corder et al., 1993]. For some diseases, guided by this principle, exciting discoveries have been made as mentioned at the beginning of this section while for some other diseases like asthma and coronary heart diseases, GWASs haven been much less fruitful [McCarthy et al., 2008].

There are doubts about the CD-CV hypothesis [Pritchard and Cox, 2002].

As a complement, genome-wide rare-variant association studies have attracted increasing attention in the past decade and made impressive breakthroughs [Li and Leal, 2008]. But unlike common variant analysis, one major difficulty of the rare variant analysis is the low statistical power (to be explained shortly) due to the low frequencies of the variants (the effective sample size is small). See Asimit and Zeggini [2010] and Lee et al. [2014] for reviews on the methods for rare variant studies. In this thesis, special treatments are not to be given to rare variants.

Linkage Disequilibrium Linkage disequilibrium (LD) may be defined as the degree to which an allele of one SNP is correlated with an allele of another

SNP [Bush and Moore, 2012]. In meiosis, recombination can break a chromo- some into several segments and thus the genome of a child is a mosaic of his/her parental genomes. However, nearby SNPs on a paternal or a maternal chromo- some are usually inherited together since the recombination rate is low. Even after many generations, fixed combination patterns of nearby alleles are still very common across the whole genome and they are often called haplotypes. When two

SNPs have statistical correlation equal to 1, we say they are in perfect LD. When

18 the correlation is zero, we say they are in linkage equilibrium. Note that for two biallelic SNPs, uncorrelatedness is equivalent to independence. For other com- monly used measures for LD, see Devlin and Risch [1995]. Linkage disequilibrium brings about both benefits and problems to GWASs. Thanks to LD, we do not need to genotype every SNP of human genome to search for the “causal” variants because a SNP in LD with the truly causal variant would also show association with the phenotype. This is called indirect association [Collins et al., 1997]. As a result, whether the susceptible genes can be detected would largely depend on the degree of LD between the genotyped variant and the truly causal variant [Ohashi and Tokunaga, 2001]. On the other hand, the efficiency of GWAS is reduced by

LD since association tests for different SNPs are correlated. We will discuss this issue in greater detail later.

Imputation and Phasing Missing genotypes are common in GWAS datasets.

If the missing rate is very low, one may simply replace each missing value with the mean or the median of that SNP. But this is clearly not the ideal solution.

Furthermore, sometimes we need to combine datasets from different studies but they are generated from different genotyping platforms and different SNP arrays.

Because the intersection of the genotyped SNPs is often small, there can be a large proportion of missing values . One way to make full use of the data in such situations is to do imputation, which means to infer the missing SNPs using the neighbouring genotyped SNPs. The key idea behind imputation is the linkage disequilibrium, i.e., the fact that the SNPs that are located closely are not inde- pendently distributed. Existing software packages include BIMBAM [Guan and

Stephens, 2008], IMPUTE [Howie et al., 2009], MACH [Biernacka et al., 2009] and

BEAGLE [Browning and Browning].

Another very similar concept is called phasing, which refers to the inference

19 of haplotypes using genotypes. Human genome is a diploid set and contains two

copies of haploid genomes. Hence the genotype of a SNP takes value in {0, 1, 2},

which represents the counts of minor (or major) alleles. The current genotyping

technology can only report the genotypes but cannot distinguish which alleles

come from the same chromosome. Phasing is then used to separate a sequence

of genotypes into two copies called haplotypes. Just like imputation, phasing

also relies on the linkage disequilibrium. Both of them use hidden Markov model

to model the LD and make inferences. Indeed, most of the imputation tools

can perform phasing as well. Software packages designed specifically for phasing

include SHAPEIT [Delaneau et al., 2008] and fastPHASE [Scheet and Stephens,

2006]. The booming next-generation sequencing technology is generating tons of

phased SNP data (sequencing data) and perhaps in the near feature association

tests using haplotypes instead of SNPs will become a standard strategy for GWAS.

1.1.2 Some Statistical Concepts

Statistical Significance and Power Consider an association test with a single

SNP. The goal is to figure out whether the SNP has an effect (direct or indirect) on the phenotype. Regarding the testing result there are four possible scenarios: true positive, true negative, false positive and false negative. If a SNP that has no effect is called positive by the test, we call it a false positive or a type I error.

If a causal SNP is not identified by the test, we call it a false negative or a type

II error. Using statistical language for hypothesis testing, we say no effect of the

SNP on the phenotype is the null hypothesis and the probability of incorrectly

rejecting a null hypothesis is then called the type I error rate or the size of the

test. In practice a type I error is usually much more harmful than a type II

error and thus when conducting a hypothesis testing, we would like to control

20 the type I error rate under some threshold, which is called the significance level,

often denoted by α. To this end, a statistic named p-value is computed, such that by rejecting the null when the p-value is smaller than α, the type I error

rate is controlled. In fact p-value is equal to the “tail probability” under the

null hypothesis, and sometimes referred to as the “observed significance level”. A

smaller p-value indicates a more significant association (it does not imply a larger

effect size). Another important concept in hypothesis testing, power, is defined as

the probability of correctly rejecting the null, i.e., one minus type II error rate. It

is the most critical metric when comparing different testing methods. At a given

significance level, the method with the larger power is deemed better. The power

of a GWAS testing procedure depends on four factors: the testing method, the

significance level, the true effect size of the causal variant, and the information

we have in the data. In general, when we have a larger sample size or the causal

variant has a larger MAF, the test would have a greater power. Since in most

cases the causal SNPs only have small effects on the trait, how to construct a

more powerful test is a central question to the statisticians engaging in GWASs.

For more discussion on the power in GWAS, see de Bakker et al. [2005] among

others.

Single SNP Test Single SNP test, which means to test every SNP separately

for association with the phenotype, is the most common statistical strategy for

detecting causal variant in GWAS. Here by “causal” we mean the variant has

either a direct or indirect effect, i.e., the “causal variant” may not be the exact

genetic cause of the disease but is truly correlated with the phenotype and could

be used for predicting disease risk. See Morris and Kaplan [2002] and Martin

et al. [2000] for both theoretical and empirical reasons why the single SNP test is

preferred to the multi-locus test. From a practical point of view, the multi-locus

21 test faces many computational difficulties since the number of possible models

from a GWAS dataset is extremely enormous. Even if we only want to test all the

possible two-SNP or three-SNP models, the total number is already much greater

than what modern computers can handle.

For a binary trait (case/control status), the traditional single SNP test methods

include logistic regression and contingency table. The Cochran-Armitage trend

test [Agresti and Kateri, 2011, Chap. 5.3] is deemed the best choice in many

settings [McCarthy et al., 2008]. For a quantitative trait, linear regression, gener-

alized linear regression and ANOVA are often used. The genotype often consists

of only 0, 1, 2 to represent the copies of the minor alleles. But when genotypes

are imputed, they may certainly take any value within [0, 2]. The effect of the

minor allele can be modelled by several ways: dominant, recessive, multiplicative

and additive. A simple linear regression model with {0, 1, 2} coding corresponds to the additive model.

Bayesian approaches to single SNP testing are not as welcome as non-Bayesian methods mainly due to the difficulty of producing p-values, though the Bayesian substitute, the Bayes factor, has its own advantages. The theoretical work in

Chap. 2 will offer a solution. Nevertheless, there are some successful attempts of

Bayesian methods for single SNP testing. See, for example, Marchini et al. [2007] and Servin and Stephens [2007].

Confounding Covariates Apart from genetic variants, there can be other fac- tors influencing the phenotype, for example age and sex. Theoretically speaking, we always need to control for these confounding variables in order to avoid spuri- ous associations and increase testing power. For a somewhat contrived example, consider a sex-linked trait like red-green color blindness. If we do not control for

22 sex when the study subjects contain equally many males and females, probably a large proportion of the SNPs located on the X chromosome would be tested positive.

The confounding factor that needs to be worried about in practice is the pop- ulation stratification. Many complex diseases are known to have difference preva- lence rates in different populations (like Asian, African, European). If the popular stratification is not appropriately accounted for when the dataset consists of sub- jects from different populations, the ethnic-specific SNPs are very likely to be tested positive. In such cases, “inflated” p-values can often be observed. To explain this, recall that under the null hypothesis, the p-value is uniformly dis- tributed on (0, 1) [Casella and Berger, 2002, Chap. 8.3]. Since in GWAS as many as 1 million SNPs may be tested for association, most of the tests are expected to be under the null and consequently, the p-values should exhibit a uniform dis- tribution on (0, 1) except at the tail. But if some confounding factor fails to be controlled for, the p-values (after ordering) can display a clear overall tendency to be smaller than their expected values, which is called inflation. See Gamazon et al. [2015] for a figure of inflated p-values. It seems that population stratifica- tion was not controlled for in that study. A simple method for correcting inflation is genomic control [Devlin and Roeder, 1999, Devlin et al., 2001]. Today people usually prefer to use principal component analysis [Price et al., 2006]. One can do eigendecomposition by oneself and add the first three to ten principal compo- nent scores as covariates or use software like STRUCTURE [Pritchard et al., 2000] and EIGENSTRAT [Price et al., 2006]. The International HapMap project [The

International HapMap Consortium, 2010] provides samples from different ethnic groups that have been very densely genotyped and are often used as the reference panel in the principal component analysis.

23 Correction for Multiple Testing For a single test, one simply compares the

p-value with the significance threshold α. However, when one performs multiple tests and still wants to control the probability of making one or more type I errors less than α, a more stringent cutoff for the p-values is needed. For instance, if two independent tests both have type I error rate equal to 0.05, the probability of mak- ing at least one type I error would be as much as 0.098. The most widely used, and probably the most convenient, correction method is Bonferroni correction, which is an approximation to Sid´ak’sformula [Sid´ak,1968, 1971]. Unfortunately, both

Bonferroni and Sid´ak’scorrection methods are derived assuming the independence of the tests and turn out to be too exacting in GWAS so that some true signals might have to be discarded. Thus many substitutes for Bonferroni correction have been proposed, for example Nyholt [2004] and Conneely and Boehnke [2007]. A non-parametric method for calculating the necessary p-value threshold is permu- tation, which is implemented in software like PLINK [Purcell et al., 2007] and

PERMORY [Pahl and Sch¨afer,2010].

A simple rule of thumb for determining whether the p-value is significant in

GWAS is the so-called genome-wide significance threshold. Due to the LD between the SNPs, if all the SNPs across the whole genome are genotyped, the effective number should be much smaller. It is estimated that most of the SNPs in the hu- man genome could be expressed as a linear combination of 500, 000 to 1, 000, 000

SNPs. Using the dataset from Case-Control Consortium [Burton et al., 2007], Dudbridge and Gusnanto [2008] estimated the genome-wide signif- icance threshold to be 7.2 × 10−8, which corresponds to a 0.05 family-wise type

I error rate, for GWASs with subjects of European descent. A more widely used threshold that can be applied to any GWAS is 5 × 10−8 [Barsh et al., 2012, Pana-

giotou and Ioannidis, 2012, Jannot et al., 2015]. It can be thought of as obtained

from the Bonferroni correction to α = 0.05 assuming 1 million independent SNPs.

24 Thus this threshold should only be used when the total number of tests is greater than 1 million.

Another approach is to control the false discovery rate (FDR) [Benjamini and

Hochberg, 1995] instead of the type I error rate. In effect it allows a larger p-value cutoff and thus more SNPs would be declared significant. With the rapid devel- opment of biological technologies, the validation of a causal variant by molecular experiment becomes easier and thus scientists are willing to increase the test power at the cost of more type I errors. See Storey and Tibshirani [2003] for a discussion on the use of FDR in GWAS. See Sun and Cai [2009] for a review on different methods for controlling FDR.

1.2 Bayesian Linear Regression

The following Bayesian linear regression model is the main object of this work:

y | β, τ ∼ MVN(Xβ, τ −1I),

β | τ, V ∼ MVN(0, τ −1V ), (1.1)

τ | κ1, κ2 ∼ Gamma(κ1/2, κ2/2), κ1, κ2 → 0.

y = (y1, . . . , yn) is the response vector, X is an n × p design matrix and β is a p-vector called the regression coefficients. I denotes the diagonal matrix and

MVN stands for multivariate normal distribution. The first statement in (1.1) is equivalent to

y = Xβ + ε

ε | τ ∼ MVN(0, τ −1I),

25 and thus implicitly the errors 1, . . . , n are assumed to be i.i.d. normal random variables with mean 0 and variance τ −1. The second and the third lines of (1.1) are called the normal-inverse-gamma prior, which is conjugate for the normal linear model. The only prior parameter that need to be specified is the covariance matrix

V . The other parameters, κ1, κ2, are let go to 0 to represent a noninformative setting. See Chap. 7.2 for more details and variations of this model. Here is a summary of important points.

• All of our major results hold for the full regression model y = W a+Xb+ε,

where W represents the confounding covariates to be controlled for and L

represents the variables of interest. The Bayes factor for this full model is

equivalent to the Bayes factor for model (1.1) once y and X are replaced

with their residuals after regressing out W . See Chap. 7.2.3 for proof and

discussion.

• The intercept term does not need to be included in model (1.1) due to the

reason explained in the last remark. It is equivalent to centering both X and

y. However, there is a slight difference between the following two statements:

(a) y | β, τ ∼ MVN(Xβ, τ −1I); (b) y | β, τ, µ ∼ MVN(Xβ + µ, τ −1I).

Because µ is unknown, when it is integrated out the errors “lose” one degree

of freedom. This difference, nevertheless, has very little effect, as we will see

in Chap. 2.1. The same rationale applies to regressing out W .

• The prior for τ is equivalent to the well-known Jeffreys prior, which is most

commonly used in the literature. It is the standard choice under a noninfor-

mative setting. The posterior for β and τ are still proper.

• In rare applications, it might be more desirable to assume τ −1 is known.

The inferences with known error variance are also discussed in Chap. 7.2.

26 Note that as n goes to infinity, τ can be estimated precisely in the sense

that its posterior contracts to the true value. Therefore, the case of known

error variance is again a special case, or rather, the limiting case as n → ∞,

of model (1.1). This intuition is very important in deriving the asymptotic

distribution of the Bayes factor (Chap. 2.1).

The conditional posterior of β given τ is (see Chap. 7.2.1)

β|y, τ, V ∼ MVN((XtX + V −1)−1Xty, τ −1(XtX + V −1)−1).

Hence, the maximum a posteriori (MAP) estimator for β is

βˆ = (XtX + V −1)−1Xty. (1.2)

The null-based Bayes factor for model (1.1) is given by (see Chap. 7.2.2)

yty − ytX(XtX + V −1)−1Xty −n/2 BF = |I + XtXV |−1/2 . (1.3) null yty

It is straightforward to check the BFnull defined in (1.3) is invariant to the scaling of y. Throughout this study whenever we refer to a Bayes factor, the null model is assumed used as the reference unless otherwise stated. For the covariance matrix

V , we consider two choices.

Independent normal prior: V = σ2I; (1.4)

Zellner’s g-prior: V = g(XtX)−1. (1.5)

27 1.3 Applications of Bayesian Linear Regression

to GWAS

There is probably little doubt that linear regression is the most widely used sta-

tistical model. In genome-wide association studies (GWAS), Bayesian linear re-

gression, though much less favorable than its non-Bayesian counterpart, has been

applied for various purposes in an extensive amount of literature. For a review,

see Balding [2006] and Stephens and Balding [2009].

1.3.1 Association Testing

Although the regression model appears to imply a prospective study design (y is

random and X is fixed), it could be applied to retrospective studies as well, as

justified by Seaman and Richardson [2004]. For a quantitative trait, the single

SNP test could be performed using model (1.1) where X only has one column.

Servin and Stephens [2007] proposed such a model and discussed how to choose a noninformative and improper prior that admits a proper Bayes factor, which will be used in the derivation of the distribution of Bayes factors in Chap. 2. For a binary trait, the Bayesian logistic regression model is the appropriate choice [Mar- chini et al., 2007]. Wakefield [2009] proposed an asymptotic method which is very efficient compared with most inference methods for the logistic regression. On a side note, both Servin and Stephens [2007] and Marchini et al. [2007] discussed how to perform association testing using imputed genotypes.

For many non-statisticians, an uneasy feature of the Bayesian association test, or more generally Bayesian hypothesis testing, is that it produces a Bayes factor instead of the p-value. Fairly speaking, each statistic has its merits and demer- its. See Kass and Raftery [1995], Lavine and Schervish [1999], Katki [2008]

28 and Goodman [1999] among many others for comparisons between p-values and

Bayes factors. Bayesians would probably prefer the Bayes factor since it measures

the evidence of the alternative hypothesis while the p-value does not. Another

practical advantage of the Bayes factor is its convenience in combining multiple

tests. The Bayes factor comparing a model M against the null model M0 is defined by

def. p(y | M) BFnull(M) = , p(y | M0)

where p(y | ·) denotes the marginal likelihood of the model (see Chap. 7.2.2 for

more details). Suppose we have a small candidate genomic region that contains

K SNPs and we want to average the association testing over these K SNPs. Then

the Bayes factor for this SNP set, or for this region, is simply

K 1 X BF (M) = p(y | x , M)p(x | M). null p(y | M ) i i 0 i=1

Similarly we can also average over the four genetic models: dominant, recessive,

additive and multiplicative. This method was implemented in Marchini et al.

[2007].

The multi-locus association test with model (1.1) is often used to model the

joint effect of the SNPs within a restricted region, for example a candidate gene.

Servin and Stephens [2007] tested all the possible K-QTN (QTN: quantitative trait

nucleotide) models within the given region for K = 1, 2, 3, 4. Another potential

application of the multi-linear regression is rare variant studies. The sequencing

kernel association test (SKAT) proposed by Wu et al. [2011] and Ionita-Laza et al.

[2013] uses the variance component model, a non-Bayesian method, to identify the

rare variants associated with the phenotype. Using the Bayesian linear regression

29 model (1.1) with the independent normal prior (1.4), the idea of SKAT can be interpreted as testing the null hypothesis σ2 = 0 against the alternative σ2 > 0.

1.3.2 Variable Selection, Heritability Estimation and

Prediction

A typical variable selection procedure with the regression model assumes

N X −1 y = βixi + ε, ε ∼ MVN(0, τ I), i=1 where N is the total number of SNPs in the dataset and most of β’s are equal to

0. Variable selection is to simultaneously analyze all the SNPs and identify which

β’s are not zero. Unlike frequentists’ methods that aim to find a single optimal model, Bayesian variable selection tries to estimate the probability P(βi 6= 0) for every SNP. Variable selection is one of the central topics of the applied Bayesian analysis. Chap. 4.1 gives an extensive review on the generic methods for Bayesian variable selection based on regression.

Early attempts of Bayesian variable selection in genetic studies usually involved up to a few hundred covariates. The most typical application was the mapping of quantitative trait loci (QTLs) [Uimari and Hoeschele, 1997, Sillanp¨a¨aand Ar- jas, 1998, Broman and Speed, 2002, Kilpikari and Sillanp¨a¨a,2003, Meuwissen and

Goddard, 2004]. Besides, Yi et al. [2005] and Hoti and Sillanp¨a¨a[2006] studied the epistatic interactions and the genotype-expression interactions respectively.

Although the computational methods used in those studies were not designed for whole-genome datasets, some key ideas and techniques were employed later in

GWASs. For instance, the Jeffreys’ shrinkage prior used by Xu [2003] and Wang et al. [2005] (see Ter Braak et al. [2005] for a discussion on the propriety of the

30 posterior), the Laplace shrinkage prior used by Yi and Xu [2008], the reversible jump MCMC approach taken by Lunn et al. [2006], the stochastic search vari- able selection algorithm used by Yi et al. [2003] and the composite model space search of Yi [2004]. Recent years have witnessed an increasing number of reward- ing applications of Bayesian variable selection to GWASs. Li et al. [2011] used

Bayesian LASSO to detect the genes associated with body mass index; Ishwaran and Rao [2011] applied a Gibbs sampling scheme proposed in Ishwaran and Rao

[2000] to the microarray data analysis for colon cancer; Stahl et al. [2012] proposed an approximate bayesian computation method and studied a GWAS dataset for rheumatoid arthritis.

One particular application that needs emphasis is the heritability estimation.

The narrow-sense heritability refers to the proportion of the phenotypic variation that is due to additive genetic effects. It can be reliably estimated from close rela- tive data, especially twin studies [Gielen et al., 2008], using the statistical methods that can be traced back to Galton [Galton, 1894] and Fisher [Fisher, 1919]. For example, the heritability of height was estimated to be between 0.77 − 0.91 [Mac- gregor et al., 2006]. But, in stark contrast, early GWASs on tens of thousands of individuals detected around 50 variants statistically associated with height, but in total they could only explain about 5% of the phenotypic variance [Yang et al.,

2010]. A later study increased the proportion of variance explained to 10% after identifying 180 associated loci from 183, 727 individuals [Allen et al., 2010]. This huge gap is referred to by “missing heritability”. Two reasons immediately stood out. First, many causal variants may be neither genotyped, nor in complete link- age disequilibrium with the genotyped ones. Second, most causal variants may only contribute a very small amount of variation and thus fail to reach the sig- nificance thresholds. Other theories like rare variants and epistatic interactions cannot explain the fact that most heritability is missing.

31 There were few methodological studies on the heritability estimation before the advent of GWAS. But one of them, Meuwissen et al. [2001], compared by simulation the performance between the Bayesian models and the linear mixed model, which actually represent the current mainstream approaches to heritabil- ity estimation. The first sensible heritability estimate from real GWAS datasets was attained by GCTA [Yang et al., 2011], a package that implemented the re- stricted maximum likelihood inference for linear mixed model (see Chap. 7.5.1).

It estimated the heritability of height to be about 45% from 3, 925 subjects [Yang et al., 2010]. The rationale of GCTA is in fact the same as the classical methods using relative data, that is, if two individuals are similar in , they must have similar phenotypes. GCTA uses all the SNPs in the dataset to calculate the genetic relatedness between individuals and inferred the heritability using that ge- netic relationship matrix. The Bayesian approach was later proposed by Guan and

Stephens [2011] where the heritability was modelled as a hyperparameter and the regression coefficients were treated with the standard spike-and-slab prior. The model was then generalized by Zhou et al. [2013] where the prior for the regression coefficients became a mixture of two normal distributions. Compared with GCTA, the Bayesian methods have the following advantages. First, GCTA assumes every

SNP makes a small i.i.d contribution to the phenotype, which could be seriously violated for some phenotypes. On the contrary, Bayesian methods do not rely on this assumption and can pick out the SNPs with larger effects by variable selec- tion. Second, the heritability estimator of GCTA usually has a larger variance than Bayesian estimates. Last but not least, the prediction performance of GCTA is poor, on which we want to make more comments.

Prediction is one of the ultimate goals of both GWAS and variable selection.

A model that perfectly fits the observed data is practically useless if it has no

32 predictive power. This is why in variable selection, given so many potential pre- dictors we only want to select a small number of them and shrink their regression coefficients. Although GCTA is computationally very fast and could provide un- biased estimators for the regression coefficients when the model assumptions hold, its prediction performance, shown in [Zhou et al., 2013, Fig. 2], is much worse than the Bayesian approaches. For more general purposes, Guan and Stephens [2011] showed that BVSR outperformed LASSO (least absolute shrinkage and selection operator) [Tibshirani, 1996], which is one of the most widely used non-Bayesian variable selection procedure. This advantage is partly due to the model averag- ing [Raftery et al., 1997].

33 Chapter 2

Distribution and P-value of the

Bayes Factor

2.1 Distribution of Bayes Factors in Linear

Regression

The null-based Bayes factor for the linear regression model introduced in Sec- tion 1.2 is given by

 ytX(XtX + V −1)−1Xty −n/2 BF (X, V ) = |I + XtXV |−1/2 1 − , (2.1) null yty where V is the normalized prior variance matrix for β (see Section 1.2 for the notation). In order to calculate the p-value for this Bayes factor, it is necessary to characterize its distribution under the null model, that is, y ∼ MVN(0, τ −1I).

To this end, the distribution of the quadratic form ztX(XtX + V −1)−1Xtz is

first identified, where z is a multivariate normal variable with covariance matrix I.

Then it is used to derive the asymptotic distribution of log BFnull. The distribution

34 of log BFnull under the alternative model (β 6= 0) is also discussed.

2.1.1 Distributions of Quadratic Forms

Throughout this chapter, we define

H def=. X(XtX + V −1)−1Xt (2.2)

which is symmetric and thus admits a spectral decomposition,

p t X t H = UΛU = λiuiui, (2.3) i=1

with eigenvalues (diagonals of Λ) λ1 ≥ · · · ≥ λp ≥ 0 = λp+1 = ··· = λn and eigenvectors u1,..., un. We first prove a lemma about these eigenvalues. See Chap. 7.1.3 for the definition and properties of eigenvalues.

Proposition 2.1. Let the spectral decomposition of H be given by (2.3). Then,

(a) 0 ≤ λi < 1 for 1 ≤ i ≤ p;

(b) H and X have the same (left) null space;

t Pp (c) log(|I + X XV |) = − i=1 log(1 − λi).

Proof. Suppose λi 6= 0. Since we may write

H = XV 1/2(V 1/2XtXV 1/2 + I)−1V 1/2Xt,

1/2 t 1/2 −1 1/2 t 1/2 by Lemma 7.8, λi is also an eigenvalue of (V X XV + I) V X XV .

1/2 t 1/2 λi/(1−λi) is then an eigenvalue of V X XV by spectral decomposition. Next we claim V 1/2XtXV 1/2 and H have the same number of nonzero eigenvalues.

35 This is because

rank(V 1/2XtXV 1/2) = rank(XtX) = rank(X) = rank(H).

The first two equalities follow from the properties of rank. The last equality needs more explanation. If Hz = 0 for some vector z, then ztHz = 0. Since (XtX +

V −1)−1 is positive definite, this implies that Xtz = 0. Hence H and X have the same (left) null space (the other direction is trivial), which implies rank(X) = rank(H). Since V 1/2XtXV 1/2 is positive semi-definite, its eigenvalues must be nonnegative and thus λi ∈ [0, 1). Lastly, by Sylvester’s determinant formula, log(|I + XtXV |) = log |I + V 1/2XtXV 1/2| which finishes the proof.

Define the second term, which depends on y, in the expression of BFnull in (2.1) by R , i.e.,

 ytX(XtX + V −1)−1Xty  2 log R def=. −n log 1 − = −n log(1 − ytHy/yty). yty

By Proposition 2.1 (c),

p X 2 log BFnull = 2 log R + log(1 − λi). i=1

Since λ1, . . . , λp are constants that depend only on the data X and prior V , to characterize the distribution of 2 log BFnull, we only need to figure out the distribu- tion of 2 log R. Instead of considering the null and the alternative separately, let’s assume y ∼ MVN(Xβ, τ −1I). Then the alternative model corresponds to the cases where β 6= 0 (or follows a nondegenerate distribution) and the null model is

36 a special case with β = 0. Define the standardized response variable z by

z def=. τ 1/2y ∼ MVN(τ 1/2Xβ, I),

and rewrite the expression for 2 log R by

2 log R = −n log(1 − ztHz/ztz).

The statistic 2 log R is closely related to the likelihood ratio test (LRT) statistic, which can be written as (see Chap. 2.2.1)

t t 2 log LR = −n log(1 − z H0z/z z) (2.4) def. t + t where H0 = X(X X) X .

H0 is the hat matrix in traditional linear regression by least squares (Chap. 7.1.4). Clearly the distributions of the two statistics are determined by the distributions

t t t of the quadratic forms z Hz, z H0z and z z.

Definition 2.1. A random variable Q is said to have a noncentral chi-squared distribution with 1 degree of freedom and noncentrality parameter ρ ≥ 0, if it has √ the same distribution as (Z + ρ)2 where Z ∼ N(0, 1). The distribution of Q is

2 denoted by Q ∼ χ1(ρ). When ρ = 0, the distribution of Q reduces to the central

2 chi-squared distribution and is denoted by Q ∼ χ1.

Proposition 2.2. Let the spectral decomposition of H be given by (2.3) and assume rank(X) = p0. Let z ∼ MVN(τ 1/2Xβ, I). Then,

p0 p0 t P t 2 t P t 2 (a) z H0z = (uiz) and z Hz = λi(uiz) ; i=1 i=1

n t t P t 2 (b) z z = z H0z + (uiz) ; i=p0+1

37 0 t 2 ind. 2 t 2 (c) For 1 ≤ i ≤ p , (uiz) ∼ χ1(τ(uiXβ) );

n P t 2 def. 2 t 2 0 (d) (uiz) = Q0 ∼ χn−p0 and is independent of (uiz) for 1 ≤ i ≤ p . i=p0+1

Proof. By the spectral decomposition of H, ztHz = (U tz)tΛ(U tz). Since U is

t 1/2 t orthogonal, U z ∼ MVN(τ U Xβ, In). The covariance matrix is diagonal and

t 2 thus every (uiz) is independent of each other by Bernstein’s Theorem [Lukacs and King, 1954]. Note that this result is not trivial at all (uncorrelatedness is not equivalent to independence), and will be used frequently in the following proofs.

Part (c) is then self-evident. Recall that H and X have the same rank and the

0 t same left null space by Proposition 2.1 (b). Thus for i > p , λi = 0 and uiX = 0. Part (d) is then proved. Part (b) is immediate from part (a) since ztz = ztU tUz.

So the only remaining task is to work out the spectral decomposition of H0. By

0 Proposition 7.15, H0 has p eigenvalues equal to 1 and the rest are zero. We claim its spectral decomposition can be written as

p0 t X t H0 = UΛ0U = uiui (2.5) i=1

where Λ0 = diag(1,..., 1, 0,..., 0) and U are the eigenvectors of H as defined in (2.3). To prove this, let the singular value decomposition (Chap. 7.1.2) of X

t + t be X = U0D0V0 . Then H0 = U0D0D0 U0 which implies that the null space of

H0 is the same as the left null space of X, and thus the same as the null space of

H from Proposition 2.1 (b). Hence H0 and H also have the same column space and there exist orthogonal matrices E1, E2 such that U = U0 diag(E1, E2).

Immediately we have the following corollary under the null.

p t 2 t P t 2 Corollary 2.3. If z ∼ MVN(0, In) , then z H0z ∼ χp and z Hz = λi(uiz) i=1 t 2 2 where (uiz) are independent χ1 random variables.

38 Both H and H0 are positive semi-definite matrices. By their spectral decom- positions given in Proposition 2.2 (a), we have the following lemma.

n t t Lemma 2.4. For any vector z ∈ R , 0 ≤ z Hz ≤ z H0z.

2.1.2 Asymptotic Distributions of log BFnull

In this section, it will be shown that, loosely speaking, when the sample size n

is sufficiently large, 2 log R has approximately the same distribution as ztHz,

if ztHz doesn’t grow or grows slowly. To further explain the ideas, consider a

sequence of datasets with sample size n tending to infinity. For simplicity (and to

avoid confusion), we drop the subscript n from z, X, H, H0 but keep in mind they

always depend on n. The limiting distribution of 2 log Rn is our interest. It might not exist since ztHz may grow very quickly. For example, consider the simplest

t 2 case (though not realistic) with p = 1, βn = 1 and X = 1. Then (u1z) grows

at rate n and thus 2 log Rn would eventually blow up. This is expected since the evidence supporting the alternative model accumulates as n → ∞. Thus, in order

to discuss the asymptotic distribution of 2 log Rn, we need some constraint on the

t t growth speed of z Hz. For this purpose, we assume z H0z = Op(1), where Op(1) means stochastic boundedness. Later we will see this assumption is indeed very

convenient. For more explanations for stochastic Big-O and Little-O notations,

see Chap. 7.3. By Lemma 2.4,

t t Lemma 2.5. If z H0z = Op(1), z Hz = Op(1).

The main result is now given below. Without loss of generality, we assume X

always has full rank. The results can be easily extended to rank-deficient case but

are of very little practical interests.

39 t Proposition 2.6. If z H0z = Op(1) , then

t t t 2 log Rn = −n log(1 − z Hz/z z) = z Hz + op(1).

t t Proof. Assuming X is full rank, by Lemma 2.2, z z = Q0 + z H0z = Q0 + Op(1) n P t 2 where Q0 = (uiz) follows a chi-squared distribution with degree of freedom i=p+1 n − p. By direct calculations we can show E(Q0/n) → 1 and Var(Q0/n) → 0. Hence,

t P z z/n = Q0/n + Op(1)/n = Q0/n + op(1) → 1.

t By continuous mapping theorem [der Vaart, 2000, Chap. 2.1], n/z z = 1 + op(1). Although ztz and ztHz are correlated, by Slutsky’s theorem [der Vaart, 2000,

Chap. 2.1], we can write

t t t z Hz/z z = z Hz(1 + op(1))/n = Op(1)/n = Op(1/n), ztHz ztHz 1 1 − ztHz/ztz = 1 − (1 + o (1)) = 1 − + o ( ), n p n p n

t t t since z Hz = Op(1). Another way to verify this is to use the fact that z Hz/z z <

t t t z Hz/Q0. Since z Hz and Q0 are independent, z Hz/Q0 can be shown to be

Op(1/n). By Taylor expansion with Peano’s form of remainder,

t t t t −n log(1 − z Hz/z z) = z Hz + op(1) + n op(1/n) = z Hz + op(1).

Piecing together Proposition 2.1, Proposition 2.2, Proposition 2.6, we arrive

at the main result of this section.

40 def. t 2 t Theorem 2.7 (Asymptotic distribution of BFnull). Let Qi = (uiz) . If z H0z =

Op(1), then

p X ind. 2 t 2 2 log BFnull = (λiQi + log(1 − λi)) + op(1),Qi ∼ χ1((τ(uiXβ) ). i=1

t At last, we need figure out when the condition z H0z = Op(1) holds. It is

t 2 clearly true under the null since z H0z ∼ χp. Thus the following corollary is immediate.

t 2 Corollary 2.8. Let Qi = (uiz) . Under the null,

p X i.i.d 2 2 log BFnull = (λiQi + log(1 − λi)) + op(1),Qi ∼ χ1. i=1

This result can be further generalized to the case of non-normal errors.

Corollary 2.9. Consider the null model y = ε, where 1, . . . , n are i.i.d. random

−1 4 t 2 variables with E[i] = 0, Var(i) = τ , and E[i ] < ∞. Let Qi = (uiz) =

t 2 τ(uiε) . Then

p X 2 log BFnull = (λiQi + log(1 − λi)) + op(1). i=1

t Proof. Clearly z Hz is still Op(1). The proof of Proposition 2.6 tells us that n P P t 2 we only need to check Q0/n → 1 where Q0 = (uiz) , since the rest follows i=p+1 from Slutsky’s Theorem and Taylor’s expansion. Note that we don’t need the

t independence between z Hz and Q0 which is not true when the errors are not

41 1/2 t normally distributed. Since E[τ uiε] = 0, we have

n X 1/2 t E[Q0] = Var(τ uiε) = n − p, i=p+1 n X 1/2 t 4 Var(Q0) = E[(τ uiε) ] = O(2n). i=p+1

P Hence we conclude Q0/n → 1 and complete the proof.

Under the alternative model, we have to limit the growth rate of the noncen-

t 2 trality parameter of Qi, i.e., τ(uiXβ) , since

p t 2 X t 2 z H0z ∼ χp(τ (uiXβ) ) i=1

2 where χp(ρ) denotes a noncentral chi-squared random variable with degree of freedom p and noncentrality parameter ρ. We borrow the idea of local alternatives √ from frequentists’ theory and assume βn = β0/ nτ. Then,

p p n X 1 X 1 X τ (utXβ)2 = (utXβ )2 = (utXβ )2 i n i 0 n i 0 i=1 i=1 i=1 1 = (U tXβ )t(U tXβ ) n 0 0 1 = βt XtXβ . n 0 0

β0 is now fixed but it may still blow up if the entries of X grow with n. To solve this, we need additional constraint. Two practical choices would be that

XtX/n converges or that X is bounded entrywise. The second condition is

easy to check. To see that the first condition would work, recall that a weakly

convergent sequence of random variables is always stochastically bounded [Shao,

2003, Chap. 1.6 (127)]. To summarize,

t 2 Corollary 2.10. Let Qi = (uiz) . Under a sequence of local alternatives with

42 √ t β = β0/ nτ, either if X X/n converges of X is bounded entrywise,

p X ind. 1 2 log BF = (λ Q + log(1 − λ )) + o (1),Q ∼ χ2( (utXβ )2). null i i i p i 1 n i 0 i=1

2.1.3 Asymptotic Results in Presence of Confounding

Covariates

According to Chap. 7.2.3, if an n × q matrix W , representing confounding co- variates, need controlling for, it suffices to regress it out from both y and X and compute BFnull with the residuals. The resulting Bayes factor is exactly the Bayes factor for the full model (see (7.13)). It is tempting to jump to the conclusion that the asymptotic distribution of BFnull remains the same, which is true but requires additional arguments since the distribution of z (or y) has changed.

Some notations need to be redefined. Let P = I − W (W tW )−1W t be the projection matrix that maps a vector to its residuals after regressing out W . Let

L be the matrix representing the covariates of interest and redefine X def=. PL.

Let H and H0 still be as defined in (2.2) and (2.4). If we compare the full model y = W a + Lβ + ε against the null model y = W a + ε, then the expression for

2 log R becomes (see Chap. 7.2.3 for more details)

2 log R = −n log(1 − ztHz/ztP z). (2.6) where z = τ 1/2y. We may replace z by z˜ = τ 1/2P y, and 2 log R would have exactly the same expression as before since P is idempotent. However, we use form (2.6) as it is more convenient. Recall that we have defined the spectral de- p P t t t t composition H = λiuiui in (2.3). The distributions of z, z P z, z Hz, z H0z i=1 are given in the following lemma.

43 Lemma 2.11. Assume y ∼ MVN(W a + Lβ, τ −1I), then

(a) z ∼ MVN(τ 1/2Xβ, P ) ;

p p t P t 2 t P t 2 t 2 2 t 2 (b) z Hz = λi(uiz) and z H0z = (uiz) where (uiz) ∼ χ1(τ(uiXβ) ); i=1 i=1

p t P t 2 2 (c) z P z = (uiz) +Q0 where Q0 is some random variable that follows χn−p−q. i=1

Proof. Part (a) and part (b) follow from the fact that PL = X and PW = 0. To

prove part (c), we need figure out the spectral decomposition for matrix P . Since

P is a projection matrix, it has n − q unit eigenvalues and q zero eigenvalues. For

1 ≤ i ≤ p, ui must satisfy Hui = λiui, which implies P ui = ui since PX = X.

Hence, ui is also an eigenvector of P that corresponds to eigenvalue 1. Similarly, if P v = 0 for some vector v, then v is also an eigenvector for H with zero

eigenvalue. As a result, we may reorder and rotate up+1,..., un so that

n−q X t P = uiui, i=1

n−q t P t 2 2 and thus z P z has the given decomposition with Q0 = (uiz) ∼ χn−p−q. i=p+1

P Inspection shows that Proposition 2.6 still holds because Q0/n → 1. Corol-

t lary 2.8 and corollary 2.10 hold immediately since the distribution of z H0z hasn’t changed. Since the results with the existence of confounding covariates are con-

sistent with the simpler model y = Xβ + ε, in the remaining sections of this

chapter I only focus on the latter model unless otherwise stated. Because usually

the intercept term, 1, is treated as a confounding covariate, by using the simpler

model we are actually assuming both X and y are centered.

44 2.2 Properties of the Bayes Factor and Its

P-value

Using the asymptotic distribution of log BFnull, we can calculate an asymptotic

p-value associated with BFnull, which we denote by PBF, and study the properties

of BFnull and PBF theoretically. In Chap. 2.2.3, 2.2.4 and 2.2.5, we simplify the discussion by omitting the op(1) term in Theorem 2.7. We may safely do so because when the error variance τ −1 is known, we have exactly (see Chap. 7.2)

p X ind. 2 t 2 2 log BFnull = (λiQi + log(1 − λi)) ,Qi ∼ χ1(τ(uiXβ) ). i=1

For a sufficiently large sample size, τ can be estimated very accurately in the sense

that its posterior distribution contracts to the true value.

2.2.1 Comparison with the P-values of the Frequentists’

Tests

We compare PBF with two non-Bayesian p-values, the p-value of the F test, de-

1/2 noted by PF, and the p-value of the LRT, denoted by PLR. We still let z = τ y denote the standardized response variable. In frequentists’ language, usually yty

t t is denoted by SST (total sum of squares), y y − y H0y is denoted by SSE (sum

t of squares of errors) and y H0y is denoted by SSReg (sum of squares due to re- gression). Assuming both y and X have been centered, the F test statistic, by definition is

t SSReg/p z H0z/p F = = t ∼ F(p,n−p−1). (2.7) SSE/(n − p − 1) z (I − H0)z/(n − p − 1)

45 The test statistic of the LRT is derived as follows,

sup f(y|τ, β) 2 log LR = 2 log τ,β supτ f(y|τ, β = 0)

t t = −n log(1 − y H0y/y y) (2.8) t t = −n log(1 − z H0z/z z)

X D 2 = Qi → χp. i=1

This is a special case of Wilks’s [1938] theorem. Hence, F test, which is exact, and LRT are asymptotically equivalent. The Bayes factor, as an averaged (or penalized) likelihood ratio, has a very similar form to the LRT statistic.

A special case is simple linear regression. Recall from the result of Chap. 2.1.2 that when p = 1,

λ1Q1 2 log BFnull = −n log(1 − ) + log(1 − λ1) ≈ λ1Q1 + log(1 − λ1). Q1 + Q0

n−1 P t 2 2 t 2 2 where Q0 = (uiz) ∼ χn−2 and Q1 = (u1z) . Under the null, Q1 ∼ χ1. We i=2 make two observations.

Proposition 2.12. Let PBF be the asymptotic p-value calculated by Corollary 2.8,

PF be the p-value of the F-test and PLR be the p-value of the likelihood ratio test. If p = 1, then,

(a) PBF is asymptotically equivalent to PLR;

(b) PF is the true p-value for the Bayes factor.

Proof. PBF is asymptotically equal to PLR since

2 log BFnull = λ1(2 log LR) + log(1 − λ1) + op(1).

46 (BFnull is asymptotically a monotone function of LR.) The second statement is

because BFnull is a monotone function of F .

 λ  2 log BF = −n log 1 − 1 + log(1 − λ ). null 1 + (n − 2)/F 1

2 To compare the three p-values, we fix λ1 = 0.8, n = 100, Q0 = E[χ98] = 98

and try different values for Q1. The result is shown in Figure 2.1. It can be seen

that for a limited sample size n = 100, PLR and PBF are very close to each other and only deviate from the truth when the p-value is extremely small. Later in

Chap. 2.3.1 we will see how to correct the test statistics so that the asymptotic

p-values can have better calibration.

2.2.2 Independent Normal Prior and Zellner’s g-prior

Now consider two special cases of the prior covariance matrix for β, V . The first

is called the independent normal prior, which assumes βi’s to be i.i.d. normal variables a priori. The second choice is the well-known Zellner’s g-prior [Zellner,

1986].

Proposition 2.13. Consider the independent normal prior, V = σ2I, and Zell- ner’s g-prior, V = g(XtX)−1 with g > 0.

(a) Under both the independent normal prior and the g-prior, the columns of U

(the eigenvectors of H) are the left-singular vectors of X.

2 2 −2 (b) Under the independent normal prior, λi = di /(di + σ ) where di is the i-th

singular value of X (|d1| ≥ · · · ≥ |dp| ≥ 0); under the g-prior, λi = g/(g + 1) for 1 ≤ i ≤ p.

47 ●● ●●● ● ●●●● ● 0 ●● ●● ●●● ● ●●●● 1.0 ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●●

0.8 ● ● −2 ●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ● ) ● ● ●●● ●● ●●● ●● ●●● ●● ●●● R ● ●● ●●● ● ●●● −4 ● 0.6 ●● ●●●

● L ● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● P ● R ● ●● ●●● ●● ●●● ● ●●● L ● ●● ( ●● ●● ●●● ● ●●● ● ●●● ●●● ●●● ● ●●●

P ● ● ●● ●●● 10 ● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ● g ● ● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ● o ●●● ●● ●●●

−6 ● l ●● 0.4 ● ●● ● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ●● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● −8 ●●● 0.2 ● ●● ● ●●● ●● ●●● ●● ●●● ●● ●●● ●● ●●● ●●● ●●● ●● ●●● ●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●● ●●●● ●●● ●●● ●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●● ●●● ●●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●●● ●●● ●●●● ●●● 0.0 −10 0.0 0.2 0.4 0.6 0.8 1.0 −10 −8 −6 −4 −2 0 ( ) PF log10 PF

●● ●●● ● ●●● ● 0 ●● ●● ●●● ● ●●● 1.0 ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● 0.8 ● ●● ● ●●● ● −2 ●● ●● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ● ●●● ● ) ●● ●● ●●● ●● ●●● ● ●●●

F ● ●● ●●● ● ●●● ● ●●● 0.6 ●● ●●● ●● B ●● ● ●●● ● ●●● ● −4 ●● ●● ●●● ●● ●●● F ● ●● P ● ●● ●●● ●● ●●● ● ●●● B ● ● ( ●● ●● ●●● ● ●●● ● ●●● ●●● ●●● ●● ●●● P ● ●● ●●● 10 ● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ● g ● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●●

o ● ● ●●● ●● ●●● l ●● 0.4 ● ● ●● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ● −6 ● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ●● ●●● ● ●●●● ●● ●●●● 0.2 ●● ●●● ●● ●●●● ●●● ●●●● ●● ●●●● ●● ●●●● ●● ●●●● ●●● ●●●● ●●● −8 ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●● ●●●● ●●● ●●●● ●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●● ●●●● ●●●● ●●●● ●●●● ●●● ●●●● ●●● 0.0 0.0 0.2 0.4 0.6 0.8 1.0 −10 −8 −6 −4 −2 0 ( ) PF log10 PF

●● ●●● ● ●●●● ● 0 ●● ●● ●●●● ● ●●●● 1.0 ● ●●● ● ●●●● ●● ●●●● ● ●●●● ● ●●●● ●● ●●● ● ●●●● ●● ●●●● ●● ●●●● ● ●●●● ●● ●●●● ●● ●●●● ● ●●●● ●● ●●● ●● ●●●● ● ●●●● ●● ●●●● ●● ●●●● ● ●●●● ●● ●●●● ●● ●●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●●

0.8 ● ● −2 ●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ● ●●● ● ●●● ● ) ● ● ●●● ●● ●●● ●● ●●● ●● ●●● R ● ●● ●●● ● ●●● −4 ● 0.6 ●● ●●●

● L ● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● P ● R ● ●● ●●● ●● ●●● ● ●●● L ● ●● ( ●● ●● ●●● ● ●●● ● ●●● ●●● ●●● ● ●●●

P ● ● ●● ●●● 10 ● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ● g ● ● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ● o ●●● ●● ●●●

−6 ● l ●● 0.4 ● ●● ● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ●● ●●● ●● ●●● ● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● ●●● ●● ●●● ● ●●● ●● ●●● ●● ●●● ● −8 ●●● 0.2 ● ●● ● ●●● ●● ●●● ●● ●●● ●● ●●● ●● ●●● ●●● ●●● ●● ●●● ●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●● ●●● ●●● ●●● ●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●● ●●●● ●●● ●●●● ●●● 0.0 −10 0.0 0.2 0.4 0.6 0.8 1.0 −8 −6 −4 −2 0 ( ) PBF log10 PBF

Figure 2.1: Comparisons between PBF, PF, PLR for p = 1. We use n = 100, fix SSE = 98 (SSE: sum of squares of errors) and try different values for SSReg (sum of squares due to regression).

48 t 2 ind. (c) Under both the independent normal prior and the g-prior, Qi = (uiz) ∼

2 2 t 2 χ1(τdi (vi β) ) where vi is the i-th right-singular vector of X.

t Proof. Let the singular value decomposition of X be X = U0D0V0 . Then under

2 −2 −1 t the independent normal prior, H = U0D0(D0 +σ I) D0U0. Under the g-prior, g H = U U t. Part (a) and (b) then follows. To prove part (c), notice that g + 1 0 0

t t t t uiXβ = uiU0D0V0 β = divi β.

Under the independent normal prior, PBF differs from PLR since it assigns different weights to different directions (of β). The direction of the first (principal component) loading vector of the data matrix X, which corresponds to singular

value d1, has the biggest weight. In contrast, Under the g-prior, PBF acts just like

PLR since it also treats every direction equally.

Under the g-prior, the expression for BFnull can be simplified to

 g ytH y −n/2 BF = (g + 1)−p/2 1 − 0 = (g + 1)(n−p)/2 1 + g(1 − r2)−n/2 , null g + 1 yty (2.9)

where r2 is the coefficient of determination in traditional linear regression.

2.2.3 Behaviour of the Bayes Factor and Three Paradoxes

Using the asymptotic approximation,

p X t 2 2 log BFnull ≈ λiQi + log(1 − λi),Qi = (uiz) , (2.10) i=1

49 it is relatively easy to understand the genesis of three famous paradoxes of the

Bayes factor.

Jeffreys-Lindley’s paradox For any fixed significance level of α, one can al-

ways find a sample size such that an effect is statistically significant (i.e. the

p-value is less than α) however the posterior probability of the null model is

greater than 1 − α [Lindley, 1957, Naaman, 2016]. Consider our model with the independent normal prior. We may fix the values of Qi so that PLR is a constant

less than α and then let n goes to infinity. By Proposition 2.13, λ1, . . . , λp all go

to 1 and thus BFnull ↓ 0. Hence Bayesians would accept the null model.

Bartlett’s paradox When lacking prior information, people prefer to use the

noninformative prior, which assumes a flat shape of the parameter’s prior distri-

bution. However, using a noninformative prior can unintentionally favor the null

model [Bartlett, 1957, Liang et al., 2008]. In our model, the noninformative prior

should let the prior variance of β go to infinity. For the independent normal prior,

this means to let σ2 ↑ ∞; for the g-prior, it means to let g ↑ ∞. But in both cases,

assuming n is fixed, λ1, . . . , λp all go to 1 and thus BFnull ↓ 0.

Information paradox Under the g-prior, BFnull can be expressed using only g, n, p, and r2 (2.9). Suppose g, n, p are fixed and let r2 ↑ 1. We then have

(n−p)/2 2 BFnull ↑ (g + 1) . This is undesirable since as r ↑ 1, the evidence for the

alternative model becomes overwhelming and decisive. However, BFnull, which is assumed to measure the evidence, converges to a finite constant. Since r2 ↑ 1

represents the accumulation of the information, this paradox is called information

paradox [Liang et al., 2008].

50 In fact, none of the three paradoxes is a true paradox. But by investigating these paradoxes, we may better understand the nature of the Bayes factor. I

first explain why these paradoxes are indeed expected and how to solve them, and then perform a quantitative analysis to investigate whether they should be worried about for a finite sample size.

Jeffreys-Lindley’s paradox neglects the fact that as n ↑ ∞, the p-value of the LRT or the F test would also go to zero. It is impractical to assume that the observed significance level remains unchanged as n grows. Nevertheless, it is likely that when n is large, frequentists reject the null while Bayesians accept it.

But this is simply because the evidence is not strong enough, considering the large sample size. This is not paradoxical at all. It just reveals the different properties of the p-value and the Bayes factor.

The phenomenon of Bartlett’s paradox is expected as well. By letting σ2 or g go to infinity, we are actually assuming that the effect tends to be very large, which of course cannot be supported by the data. Hence the marginal likelihood of the alternative model, which is averaged over this noninformative prior, decreases and eventually becomes less than the marginal likelihood of the null model. However,

Bartlett’s paradox is a truly important observation. It reveals that there is no appropriate choice for σ2 when we have no prior information! To overcome this problem, people often put a hyperprior on σ2, which is referred to as the Bayesian random effect model (or Bayesian linear mixed model [Hobert and Casella, 1996]

). This is also the standard approach in variable selection, which will be discussed in the next chapter.

The information paradox, which is truly undesirable, arises from the nature of the g-prior. Bayesians usually prefer to use a prior that is independent of data.

But using the g-prior with a fixed g violates this rule. When n is fixed, the g-prior

51 is well-motivated. It is sometimes reasonable to assume that the prior precision

matrix of β is proportional to the covariance matrix of the data XtX since a

larger variance can be thought of as “more information”. Besides, the g-prior

is computationally convenient. However, when n grows and g is fixed, the prior covariance of β becomes smaller and smaller, and eventually vanishes! Just like

Bartlett’s paradox, one solution is to put a hyperprior on g ( see Liang et al.

[2008]).

Next let’s try to quantitatively characterize the behaviour pattern of the Bayes factor. We focus on the independent normal prior.

Proposition 2.14. Assume BFnull has the expression given in (2.10) and let

def. t 2 2 ρi = τ(uiXβ) so that Qi ∼ χ1(ρi). Then,

i.i.d. 2 (a) under the null, Qi ∼ χ1 and

p 1 X [log BF | β = 0] = (λ + log(1 − λ )), E null 2 i i i=1

E[BFnull | β = 0] = 1;

ind. 2 (b) under a fixed alternative, Qi ∼ χ1(ρi) and

p 1 X [log BF | β] = (λ (1 + ρ ) + log(1 − λ )), E null 2 i i i i=1 p Y λiρi [BF | β] = exp( ); E null 2(1 − λ ) i=1 i

52 −1 2 i.i.d. 2 (c) under the alternative β ∼ MVN(0, τ σ I), (1 − λi)Qi ∼ χ1 and

p 1 X λi [log BF | σ2] = ( + log(1 − λ )), E null 2 1 − λ i i=1 i   ∞ if λ1 ≥ 0.5 2  [BFnull | σ ] = p E Q 1 − λi  √ if λ1 < 0.5 i=1 1 − 2λi

Proof. By the independence between Q1,...,Qp, it is sufficient to calculate the 1 1 expectation of a single component (λ Q +log(1−λ )) or exp[ (λ Q +log(1−λ ))]. 2 i i i 2 i i i Part (a) and (b) then follow from routine calculations. For part (c), notice that

2 2 2 2 2 −1 by Proposition 2.13 (c), Qi/(1 + σ di ) ∼ χ1 and 1 + σ di = (1 − λi) .

Remark 2.14.1. In fact, the expectation of the null-based Bayes factor (with proper priors) under the null model is always 1, regardless of the models of the

problem. This can be checked from the definition of the Bayes factor.

E[log BFnull] Remark 2.14.2. By Jensen’s inequality, we always have e ≤ E[BFnull].

Since BFnull has a heavy-tailed distribution, the expected values of log BFnull ac- tually provide much more insight into the behaviour of BFnull than the expected values of BFnull.

2 Remark 2.14.3. As λ1 ↑ 1 (either the sample size n ↑ ∞ or σ ↑ ∞), the expected value of log BFnull goes to −∞ under setting (a) and goes to ∞ under setting (c).

2 Under setting (b), the limit depends on the growth rate of ρ1. Suppose σ is fixed,

2 ρ1 = O(n) and d1 = O(n), we still have the expected value of log BFnull goes to ∞ since log(1 − λ1) = O(log n).

A key message in the last remark is that the term log(1 − λ1) decreases quite slowly. In the usual case where we assume the observations are homogeneous, Q1 grows at a faster rate as n increases and thus the Bayes factor is consistent (the posterior probability of the alternative model goes to 1). In fact, log(1 − λ1) also

53 decreases very slowly as σ increases. It suffices to consider only p = 1 and thus

2 log BFnull = λ1Q1 + log(1 − λ1).

2 2 ∂2 log BFnull Q1d1 d1 2 = 2 2 2 − 2 2 ∂σ (σ d1 + 1) σ d1 + 1

2 2 2 Therefore, when σ ↓ 0, 2 log BFnull grows at rate (Q − 1)d1; when σ ↑ ∞,

−2 2 log BFnull decreases at a vanishing rate O(σ ).

2.2.4 Behaviour of the P-value Associated with the Bayes

Factor

Now consider the behaviour of PBF. Under the null, it follows a uniform distribu- tion (asymptotically), just like any other valid p-values. The calibration of it for a

finite sample size will be discussed later by simulation (Chap. 2.3.4). What would be more interesting is the power performance of PBF, and in particular, how it differs from PLR.

The power of the test given a fixed β is

p X 2 t 2 Power(PBF) = P( λiQi > Cα) ,Qi ∼ χ1(τ(uiXβ) ), (2.11) i=1

p P 2 where Cα is the critical value calculated from the null distribution, λiχ1. For i=1 comparison, the power of PLR is given by

p X 2 2 t 2 Power(PLR) = P( Qi > χp,1−α) ,Qi ∼ χ1(τ(uiXβ) ), (2.12) i=1

2 2 where χp,1−α denotes the (1−α) quantile of χp distribution. It should be noted that in (2.11), λ1, . . . , λp appear on both sides of the inequality inside the probability

54 term. Hence the scaling of λ1, . . . , λp does not affect the power. As a result, when p = 1, λ1 can be dropped from both sides and thus PBF and PLR have the same power.

Proposition 2.15. For simple linear regression, the power of PBF is equal to the power of PLR.

When p = 2 and λ1 6= λ2, the two p-values (PBF and PLR) have different power. This difference is illustrated by the following example. Choose τ = 1, and let the thin singular value decomposition of X be

    10 0 1 0     X = U     . 0 2 0 1

t 2 2 2 2 Denote β = (β1, β2) . By Proposition 2.13, Q1 ∼ χ1(100β1 ) and Q2 ∼ χ1(4β2 ).

Consider the independent normal prior with σ = 0.2. We can calculate λ1 = 0.8 and λ2 = 0.14. Then we use Monte Carlo sampling (10, 000 samples per β) to compute the power at α = 0.05. The result is shown in Figure 2.2. On the horizontal direction, PBF is better since, to achieve a given power, it requires a smaller value of |β1| than PLR. On the vertical direction, PLR is better. Note that both PLR and PBF have the largest power at direction (1, 0) (the first right-singular vector of X), because the data is most informative at that direction. However, this bias is exaggerated for PBF due to the weights λ1, λ2.

55 Figure 2.2: Power comparison between PBF and PLR for p = 2. The singular value of the design matrix X is set to 10 and 2, and we use σ = 0.2. The red contours represent the power of PLR and the blue contours are the power of PBF. We draw the contours at power = 0.1, 0.5, 0.8, 0.99.

2.2.5 More about Simple Linear Regression

For simple linear regression, it is possible to derive more quantitative results about the behaviour of BFnull and PBF. Recall that when p = 1,

2 log BFnull ≈ λ1Q1 + log(1 − λ1),

where Q1 ≈ 2 log LR. If Q1 is fixed, we observe that BFnull is maximized at some value of λ1 and the corresponding prior can be calculated analytically. Since the prior parameter V is now a 1 × 1 matrix, we may write

V = [σ2].

56 4.0

3.5 (BF) 3.0 10 g o l 2.5

n = 200 2.0 n = 500 n = 1000

0.0 0.5 1.0 1.5 2.0

σb

Figure 2.3: How BFnull changes as σ ranges from 0.05 to 2. We assume X has t unit variance and thus X X = n. Fix Q1 = 24, which corresponds to a p-value equal to 10−6.

Proposition 2.16. Consider simple linear regression. Assuming Q1 is given, then

max 2 log BFnull = Q1 − 1 − log Q1; σ2

Q1 − 1 arg max BFnull = n . σ2 P 2 xi i=1

Proof. By differentiating log BFnull w.r.t λ1, we get

Q1 − 1 = arg max BFnull. Q1 λ1

P 2 P 2 −2 Since λ1 = xi /( xi + σ ), we obtain the result.

Figure 2.3 shows how BFnull changes as σ ranges from 0.05 to 2 with Q1 = 24

−6 (PBF = 10 ).

Using this result, we can also quantify the numerical relationship between

PBF (which is equal to PLR by Proposition 2.12) and BFnull. Statisticians have

57 been very interested in this relationship because the Bayes factor can be used to calculate the posterior probability of the alternative model, which is often compared with the p-value. Certainly this numerical relationship depends on the model and the particular hypothesis testing method, but in most cases, the p-value is numerically “more significant” than the Bayes factor. For example, see

Good [1992], Berger and Sellke [1987] and Sellke et al. [2001], among others.

Proposition 2.17. For simple linear regression, let P = PBF = PLR. For suffi- ciently large Q1,

− log P ≈ max log BFnull + log Q1 + 0.73 σ2

Proof. From the last proposition, we have max 2 log BFnull = Q1 −1−log Q1. Thus we only need to establish the relationship between Q1 and P . Since Q1 is equal to the test statistic of the LRT,

Z ∞ 1 P = √ x−1/2e−x/2dx Q1 2π 2 Z ∞ x−1 −1/2 −Q1/2 −1/2 −x/2 = √ Q1 e − √ x e dx. 2π Q1 2π

Since 1/x ≤ 1/Q1 for x ≥ Q1, we have

2 1 2 −1/2 −Q1/2 −1/2 −Q1/2 √ Q1 e − P ≤ P ≤ √ Q1 e . 2π Q1 2π

2 Let f 2 be the density function of χ . After rearrangement we obtain χ1 1

2Q1 f 2 (Q ) ≤ P ≤ 2f 2 (Q ). χ1 1 χ1 1 Q1 + 1

Clearly, for sufficiently large Q , P ≈ 2f 2 (Q ). The result then follows by doing 1 χ1 1

58 some algebra. The constant in the formula corresponds to

1 (1 + log π − log 2) ≈ 0.73. 2

We may only use P ≤ 2f 2 (Q ) to derive the inequality χ1 1

− log P ≥ max log BFnull + log Q1 + 0.73. σ2

−1 This implies that P is always greater than BFnull. Besides, in practice, the value of σ2, which should be chosen before the testing, is usually not the “optimal” value

that maximizes the Bayes factor. Hence a p-value equal to 10−7 often corresponds to a Bayes factor around 105.

2.3 Computation of the P-values Associated

with Bayes Factors

By Theorem 2.7, we can calculate the asymptotic p-value for the Bayes factor.

However, there are two challenges. First, since this p-value, PBF, is asymptotic, can we still trust it for a finite sample size? In Chap. 2.3.1 a correction method

is introduced to improve the calibration of PBF when n is only moderate. Second, the numerical computation requires us to evaluate the distribution function of a

2 linear combination of χ1 random variables, which has been a difficult problem for a long time. We have implemented a new method, proposed by Bausch [Bausch,

2 2013], that has only polynomial complexity in the number of χ1 random variables.

59 2.3.1 Bartlett-type Correction

Denote the test statistic for calculating PBF by 2 log R which has asymptotically

2 the same distribution as a weighted sum of independent χ1 random variables with weights λ1, . . . , λp. For a moderate sample size, a correction method for 2 log R to

improve the calibration of PBF can be developed. We borrow the idea of Bartlett- type correction to the LRT statistic, which was first noticed by Bartlett [1937]

and later generalized by Box [1949] and Lawley [1956]. By Wilks’ theorem, the

2 likelihood ratio test statistic, denoted by Λ, converges weakly to a χp-distributed random variable at rate o(1) [Wilks, 1938]. For a small sample size, the calibration

of this p-value can be very poor. Suppose we have an estimator, E0[Λ], that esti-

−3/2 mates the expected value of Λ under the null with error as small as Op(n ). The

2 corrected test statistic, pΛ/ E0[Λ], converges weakly to a χp-distributed random variable at rate O(n−2), under very general conditions [Bickel and Ghosh, 1990].

This strategy can be used to introduce a heuristic correction for 2 log R.

Consider the general model with confounding covariates. This is because q

(the number of confounding covariates) has to appear in the correction term.

By Lemma 2.11, under the null, 2 log R can be expressed using independent chi-

squared random variables

p p P P λiQi λiQi i=1 i=1 2 log R = −n log(1 − p ) = n log(1 + p ), P P Q0 + Qi Q0 + (1 − λi)Qi i=1 i=1

2 Q0 ∼ χn−p−q,

2 Qi ∼ χ1 for 1 ≤ i ≤ p.

P Asymptotically 2 log R has the same distribution as i λiQi, of which the expec- P tation is i λi. To apply Bartlett-type correction, we need find a higher-order

60 approximation for E0[2 log R]. Define

p def. X A = λiQi i=1 p def. X B = (1 − λi)Qi i=1

Note that A and B are not independent. By Taylor expansion,

2 3 nA nA nA −2 2 log R = − 2 + 3 + oP (n ). Q0 + B 2(Q0 + B) 3(Q0 + B)

P Since B/Q0 → 0, we can apply Taylor expansion again,

2 nA nA B B −2 = (1 − + 2 ) + oP (n ), Q0 + B Q0 Q0 Q0 2 2 nA nA B −2 2 = − 2 (1 − 2 ) + oP (n )), 2(Q0 + B) 2Q0 Q0 3 3 nA nA −2 3 = 3 + oP (n ). 3(Q0 + B) 3Q0

We group the terms according to their orders and direct calculation gives

def. nA nα1 γ1 = E[ ] = , Q0 β1 2 def. n A −n(2α3 + α1α2) γ2 = E[ 2 (−AB − )] = , Q0 2 β1β2 3 def. n 2 2 A n 8 1 3 γ3 = E[ 3 (AB + A B + )] = [ α5 + 2α1α4 + α1 + (4 + p)α1α6 + (8 + 2p)α7], Q0 3 β1β2β3 3 3

where by abuse of notations,

p p p P P 1 P 1 βi = n − p − q − 2i , α1 = λi , α2 = (1 − λi) , α3 = λi(1 − λi) i=1 i=1 2 i=1 2 p p p p P 2 P 3 P P α4 = λi , , α5 = λi , α6 = (1 − λi) , α7 = λi(1 − λi). i=1 i=1 i=1 i=1 (2.13)

61 Combining them, we have, for k = 1, 2, 3,

k X −k+1 E0[2 log R] = γi + o(n ). (2.14) i=1

Hence, E0[2 log R] can be estimated by

k ˆ def. X E(k)[2 log R] = γi. i=1

ˆ In addition to 2 log R, we now have obtained three corrected test statistics, α1(2 log R)/E(k)[2 log R] (for k = 1, 2, 3). Similarly, the LRT statistic could be corrected by using

np np(p + 2) np(p + 2)(p + 4) −2 Eˆ[2 log LR] = − + + o(n ), (2.15) β1 2β1β2 3β1β2β3

where β1, β2, β3 are as defined in (2.13).

2.3.2 Bausch’s Method

The current most popular method is Davies’ method, which relies on the numerical inversion of the characteristic function. See Chap. 7.4.1 for a brief introduction. It is convenient, not difficult to implement, but involves the numerical integration of a highly oscillatory integrand which might produce only limited accuracy [Bausch,

2013].

Bausch [2013] proposed to calculate the distribution function of a linear com-

2 bination of independent χ1 random variables by Taylor expansion. His method

2 has only (at most) polynomial complexity in the number of χ1 random variables. Furthermore, the error bound can be explicitly evaluated and arbitrary accuracy can be obtained. I make a brief introduction of his algorithm in the current section and describe in details how we implemented it in C++ in the next section.

62 2 Let X1,...,Xp be i.i.d. χ1 random variables. We are interested in the distri- bution function of the weighted sum,

p ¯ X Q = λiXi, (2.16) i=1

where the weights λ1 ≥ · · · ≥ λp > 0. For the time being, let’s assume p is even and hence we may rewrite (2.16) as

p/2 ¯ X def. Q = Yk,Yk = λ2k−1X2k−1 + λ2kX2k. k=1

Bausch noticed that, by Kummer’s second transformation [Abramowitz and Ste- gun, 1964, Chap. 13],

1 λ1 + λ2 λ1 − λ2 fY1 (y) = 1/2 exp(− y)I0( y), (4λ1λ2) 4λ1λ2 4λ1λ2

where I0 is the modified Bessel function of the first kind with degree 0 [Abramowitz

and Stegun, 1964, Chap. 9]. The function I0 can be computed via Taylor expan- sion,

∞ X (x/2)2i I (x) = . 0 (i!)2 i=0

Hence, we may write

λ2k−1 − λ2k I0( y) = Tk(y) + Rk(y), 4λ2k−1λ2k

where Tk(y) is a finite power series and Rk(y) is the corresponding remainder term.

Multiply both Tk and Rk by the constant and the exponential term, and express

63 fYk by

˜ ˜ fYk (y) = Tk(y) + Rk(y), ˜ 1 λ2k−1 + λ2k Tk(y) = 1/2 exp(− y)Tk(y), (4λ2k−1λ2k) 4λ2k−1λ2k ˜ 1 λ2k−1 + λ2k Rk(y) = 1/2 exp(− y)Rk(y). (4λ2k−1λ2k) 4λ2k−1λ2k

A key observation made by Bausch is that the convolution of functions like e−βxxn is a “closed” operation:

n Z c n! X (βc)i e−βxxndx = [1 − e−βc ]. βn+1 i! 0 i=0

The integrand and the integral have the same form and the same highest degree. ˜ ˜ Therefore, we can perform the convolution of T1(y),..., Tp/2(y) algebraically. Such algebraic operations can be coded without much difficulty using mathematical programming language like Mathematica and Maple. However, we coded in C++ and hence had to define our own math objects. The complexity of this algorithm is polynomial in p.

To make the algorithm useful in practice, we need find an error bound. Let ∗ denote the convolution. By the associativity of convolution,

Z c ¯ ˜ ˜ ˜ ˜ P(Q ≤ c) = (T1 + R1) ∗ · · · ∗ (Tp/2 + Rp/2) 0 c Z    n 1−n  X ˜n1 ˜1−n1 ˜ p/2 ˜ p/2 = T1 · R1 ∗ · · · ∗ T1 · R1 . 0 nk∈{0,1}

By Young’s inequality for convolutions [Hardy et al., 1952, Chap. 4],

p/2    n 1−n  ˜n1 ˜1−n1 ˜ p/2 ˜ p/2 Y ˜nk ˜1−nk || T1 · R1 ∗ · · · ∗ T1 · R1 ||1 ≤ ||Tk · Rk ||1, k=1

64 1 where || · ||1 denotes the ` -norm. If p = 4, then we have,

Z c ¯ ˜ ˜ ˜ ˜ P(Q ≤ c) = (T1 + R1) ∗ (T2 + R2) 0 Z c  ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜  = T1 ∗ T2 + T1 ∗ R2 + R1 ∗ T2 + R1 ∗ R2 0 Z c  Z c Z c Z c Z c Z c Z c ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ≤ T1 ∗ T2 + T1 R2 + R1 T2 + R1 R2 0 0 0 0 0 0 0 Z c  Z c Z c Z c Z c ˜ ˜ ˜ ˜ ˜ ˜ ≤ T1 ∗ T2 + R2 + R1 + R1 R2 0 0 0 0 0 Z c  Z c Z c ˜ ˜ ˜ ˜ ≤ T1 ∗ T2 + (1 + R1)(1 + R2) − 1. 0 0 0

˜ The second last inequality follows from the fact that ||Tk||1 < 1 since fYk is a probability density function. This result could be easily generalized to any even p:

  Z c  p/2 Z c ¯ ˜ ˜ Y ˜  P(Q ≤ c) ≤ T1 ∗ · · · ∗ Tp/2 + (1 + Rk(y)dy) − 1. (2.17) 0 k=1 0 

Hence we may calculate P (Q¯ ≤ c) by

Z c ¯ ˜ ˜ P (Q ≤ c) ≈ T1 ∗ · · · ∗ Tp/2, 0 with error bound

  p/2 Z c Y ˜  (1 + Rk(y)dy) − 1. k=1 0 

˜ Note that in the derivation of (2.17), we actually don’t have to use the fact ||Tk||1 < 1, and the error bound would be

  p/2 Z c Z c p/2 Z c Y ˜ ˜  Y ˜ ( Tk(y)dy + Rk(y)dy) − Tk(y)dy. k=1 0 0  k=1 0

65 R c ˜ We might also use the fact 0 Tk < FYk (c) to derive another error bound,

  p/2 Z c p/2 Y ˜  Y (FYk (c) + Rk(y)dy) − FYk (c). (2.18) k=1 0  k=1

2 When the number of χ1 random variables is odd, we only need to first calculate

2 the distribution function of the weighted sum of the p − 1 χ1 random variables and then perform one numerical integration.

2.3.3 Implementation of Bausch’s Method

We implemented Bausch’s method in C++ to gain maximum speed. The source code and executables of our program BACH (Bausch’s Algorithm for CHi-square weighted sum) are freely available at http://haplotype.org. We used the GNU

Multiple Precision Arithmetic Library (GMP) so that we can use arbitrary-precision

floating-point numbers to accurately calculate extremely small p-values. By de- fault, the floating-point numbers in our program have 76 effective digits.

We now lay out the implementation details. Our goal is to use as few as possible Taylor expansion terms to achieve a desired precision. First, we sort the weights so that λ1 ≥ λ2 ≥ · · · ≥ λp. If p is odd, the smallest weight is held for the numerical integration at last, which is simply done by a weighted sampling scheme. The numerical integration almost never introduces additional noticeable error to the p-value due to its high precision and the ordering of the weights. The main reason for this ordering, nevertheless, is that we use Taylor expansion to approximate

λ2k−1 − λ2k I0( y) 4λ2k−1λ2k

66 and thus we want to make (λ2k−1 −λ2k)/4λ2k−1λ2k as small as possible (I0(x) grows fast as x increases). In fact, if we don’t order the weights, we are more likely to encounter the integration of e−βxxn with extremely small β in the convolution ˜ ˜ of T1,..., Tp/2. This would also make the algorithm numerically unstable. Now we explain how to find out the required Taylor expansion degrees. By Taylor’s

Theorem, we can control

Rk(y) ≤ δ(Tk(y) + Rk(y)), ∀0 < y < c.

(The method will be given later.) Then,

Z c Z c ˜ ˜ 1 ˜ FYk (c) = P(Yk < c) = [Tk(y) + Rk(y)]dy ≥ Rk(y)dy. 0 δ 0

By (2.18), the error bound is given by

Z c  def. ¯ ˜ ˜ Err(c) = P(Q ≤ c) − T1 ∗ · · · ∗ Tp/2 0 p/2 Z c p/2 Y ˜ Y ≤ (FYk (c) + Rk(y)dy) − FYk (c) k=1 0 k=1 p/2 p/2 Y Y ≤ (FYk (c) + δFYk (c)) − FYk (c) k=1 k=1 p/2  p/2  Y ≤ (1 + δ) − 1 FYk (c) k=1 p/2 pδ Y ≈ F (c) 2 Yk k=1

Hence, to control the error, we only need to choose δ such that

−1  p/2  2 Err(c) Y  δ ≥ F (c) . p Yk k=1 

67 Since usually we care more about the relative error instead of the absolute error, we

first calculate a lower bound for the p-value by the method described in Chap. 7.4.2

(Eq. (7.24)) and then use it to determine the value of Err(c).

In practice, small p-values are of most interest. For example, in GWAS, the significance threshold is typically 5 × 10−8, after correction for multiple compar- isons. Here we describe a trick that makes the calculation of extremely small p-values more efficiently. Note that the tail probability of Q¯ should have the form

p/2 ¯ X P(Q > c) = exp(−Akc)Gk(c), k=1

where Ak ∈ R and Gk(c) is an infinite power series. However, as we omit Rk(y) in the Taylor expansion of modified Bessel functions, we end up with

p/2 ¯ X 0 P(Q > c) ≈  + exp(−Akc)Gk(c), k=1

0 where Gk(c) is a finite power series and  6= 0. We might call  the limiting error since as c → ∞ (the p-value then should vanish), it is the error of the p-

value computed using our algorithm. In our implementation, we simply neglect

 when c is large. Note that we cannot always neglect  since its existence is to

offset the error introduced by Taylor expansion. From our tests, the extremely

small p-values obtained by omitting  appear to be very accurate. To see the

reason, note that when c is large, the amount we have omitted in the power series,

0 Gk(c) − Gk(c), has very little influence on the p-value, since the order of the

−Akc magnitude of exp(−Akc)Gk(c) is dominated by e .

In theory, it is likely that our algorithm fails to produce a correct or highly

accurate p-value. For instance, the maximum degree of the Taylor expansion of I0 is set to 160 in our software (it can be modified by the user), but when c is large, it

68 might not be enough to produce a desired relative error bound. Another possible

scenario is that the p-value is so small that we have to discard , as described in the last paragraph. In such cases, we calculate the lower and the upper bound of the p-value by (7.25) in Chap. 7.4.2. These bounds can always be quickly and exactly evaluated. When p is large or the p-value is very small, these bounds already meet most practical needs. If our p-value fails to be within the bounds, which never occurred in our simulation or tests, we use the bounds to estimate the p-value. Otherwise we trust our p-value but report the error bound by comparing the p-value with the bounds.

2.3.4 Calibration of the P-values

Using our asymptotic results, we can evaluate extremely small p-values for Bayes factors, which is an important advantage in applications such as GWAS compared to the permutation method described in Servin and Stephens [2007]. However, our PBF is an asymptotic p-value, and hence its calibration for moderate sample sizes need to be examined. Since LRT is one of the most widely used asymptotic tests, the calibration of PBF is compared with that of PLR.

A GWAS dataset (IOP) is used for simulation. The details of the IOP dataset are given in Chap. 7.7.1. Sample size n is chosen to be 100, 300, 1000 and the

number of covariates p is set to 10 or 20. For every combination of n and p,

a subset of genotypes of n individuals and p SNPs is randomly sampled, and y is simulated under the null model, y ∼ MVN(0, In). Then PLR and PBF are computed (σ = 0.2). This step is repeated for 107 times. Fig. 2.4 and Fig. 2.5 show that PBF is well calibrated, and the calibration is usually better than PLR at the tail. The performance of the corrected test statistics is also investigated. For both tests, a third-order approximation to the expected value of the test statistic under

69 Figure 2.4: Calibration of PBF and PLR for p = 10. The red dots represent PLR from the likelihood ratio test and the blue represent pB. The grey region indicates a 95% confidence band, which is calculated using the fact that a order statistic from a uniform distribution follows a Beta distribution. the null (see Eq. (2.14) and (2.15)) is used. Simulations show that the Bartlett- type correction improve the calibration of the p-values substantially on the linear scale but not much on the logarithmic scale. Overall, it can be concluded that PBF is well-calibrated when the sample size is more than a few hundred. Furthermore, as far as the tail calibration is concerned, the uncorrected test statistic can simply be used for computing PBF while for PLR the correction seems important for a small sample size.

70 Figure 2.5: Calibration of PBF and PLR for p = 20. The red dots represent PLR from the likelihood ratio test and the blue represent pB. The grey region indicates a 95% confidence band, which is calculated using the fact that a order statistic from a uniform distribution follows a Beta distribution.

71 Chapter 3

A Novel Algorithm for

Computing Ridge Estimators

This chapter introduces a novel algorithm for computing ridge regression estima- tors, which is a critical step in the calculation of Bayes factors given in Chap. 2.

This algorithm could be applied to the Bayesian variable selection and substan- tially boosts the MCMC sampling.

3.1 Background

Consider the linear regression with the response variable y = (y1, . . . , yn) and an n × p design matrix X,

y = Xβ + ε.

−1 The errors vector 1, . . . , n are i.i.d. with expectation 0 and variance τ . The ridge regression estimator for the coefficient vector β is obtained via a type of

72 Tikhonov regularization [Tikhonov and Arsenin, 1977],

ˆ 2 2 βR(λ) = arg min ||y − Xβ|| + c||β|| β (3.1) = (XtX + λI)−1Xty

where || · || denotes `2 norm. Comparing with the ordinary least squares estimator ˆ (denoted by βLS) given in Proposition. 7.14, it can be seen that the only difference in the objective function is the penalty term c||β||2. The constant c ≥ 0 is usually

referred to as the regularization or the shrinkage parameter since it forces βˆ to

be closer to 0. If c = 0, it reduces to the least-squares fitting. Initially the ridge

t ˆ regression was proposed for the cases where X X is ill-conditioned and βLS is unstable due to large variance or even numerically cannot be computed [Hoerl

and Kennard, 1970a,b]. The most important advantage of the ridge estimator can

be explained by the following decomposition of the mean squared error (MSE):

 2 2 −1 E (y − yˆ) = (E[y] − yˆ) + Var(ˆy) + τ .

ˆ The term, E[y] − yˆ, is called the bias. When βLS is used, the bias is clearly ˆ zero since βLS is the best linear unbiased estimator (BLUE) by Gauss Markov ˆ theorem [Plackett, 1950]. However, the MSE for βLS is not necessarily small due to the variance term. The ridge estimator, assuming c > 0, on the contrary is

always biased but the variance might be small if c is chosen appropriately. In fact, ˆ there always exists a c such that the ridge estimator βR(c) attains a smaller mean ˆ squared error than βLS [Hoerl and Kennard, 1970a], which is sometimes known as the bias-variance tradeoff.

Though ridge regression is a non-Bayesian method, the ridge estimator plays a

fundamental role in the Bayesian linear regression. Consider the Bayesian linear

73 regression model defined in (1.1) with the independent normal prior V = σ2I. ˆ First, βR is the maximum a posteriori (MAP) estimator (see Eq. (1.2)) and the posterior mean with the shrinkage parameter c = σ−2. Second, the calculation of ˆ βR is the “rate-determining” step in computing the Bayes factor (see Eq. (1.3)), which is central to a Bayesian variable selection procedure. When we have an extremely large sample space, the efficiency of a MCMC sampling procedure for

Bayesian variable selection hinges largely on whether the ridge estimators can be computed quickly. See Ishwaran and Rao [2005] for a discussion on the relationship between ridge regression and Bayesian variable selection.

3.2 Direct Methods for Computing Ridge

Estimators

The generalized ridge estimator [Draper and Van Nostrand, 1979], denoted by βˆ henceforth in this chapter (the subscript R dropped for simplicity), is obtained by solving

(XtX + Σ)βˆ = z, (3.2) where z = Xty and Σ is some diagonal matrix with nonnegative diagonals.

Clearly the ridge estimator defined in (3.1) is a special case with

Σ = σ−2I.

The length of βˆ is still denoted by p. Define

A def=. XtX + Σ, (3.3)

74 which is assumed to be invertible henceforth. The methods for solving (3.2) can be

divided into two groups: direct methods and iterative methods. In this section,

I first introduce the direct methods. Each of them has some advantages under

certain circumstances. Some methods make advantages of the structure of A

while some are just usual methods for solving systems of linear equations. Since

in our main application (variable selection for GWAS), the matrix X is usually

an n × p matrix with n  p, methods designed for rank-deficient XtX are not discussed.

3.2.1 Spectral Decomposition of XtX

Since XtX is positive semi-definite, it admits the spectral decomposition (Chap. 7.1.3),

XtX = UΛU t.

If Σ = σ−2I, we then have

XtX + Σ = U(Λ + σ−2I)U t.

Therefore, we can obtain the inverse of A and compute βˆ by

βˆ = U(Λ + σ−2I)−1U tz.

One advantage of this method is that if want to evaluate βˆ for many different

values of σ, we only need to perform the spectral decomposition once and every

new evaluation for βˆ costs ∼ 4p2 flops (floating-point operations). But later we

will see there is a better method for this purpose. Second, if we are computing

the Bayes factor defined in (1.3) with the independent normal prior, we can also

75 analytically evaluate its null distribution, since we have actually obtained the singular values of X .Moreover, if we obtain this spectral decomposition by the singular value decomposition (Chap. 7.1.2) of X, the behaviour of the Bayes factor under the alternatives can be quantified too. This is a unique advantage of this method.

Unfortunately, this method is probably the slowest. The standard approach to computing the spectral decomposition of XtX is to perform SVD of either X or XtX. For a square matrix, SVD has a time complexity cubic in p. The exact count of flops cannot be determined since it depends on both the algorithm and the accuracy. For example, for an n×p matrix, we need about 4np2 −4p3/3+O(p2)

flops by Golub-Kahan algorithm [Trefethen and Bau III, 1997, Lec. 31] .

3.2.2 Cholesky Decomposition of XtX + Σ

The Cholesky decomposition [Trefethen and Bau III, 1997, Chap. IV] is another standard way to solve systems of linear equations. Since A is positive definite, it can be decomposed into

A = LLt, where L is a lower triangular real matrix. Once we obtain L, we can quickly compute Ltβˆ by forward substitution and then βˆ by backward substitution. Both substitutions require ∼ p2 flops. The Cholesky decomposition, though has cubic time complexity, is much faster than the SVD and only needs ∼ p3/3 flops [Tre- fethen and Bau III, 1997, Lec. 23]. It is the fastest among the four methods introduced in this section.

76 t 3.2.3 QR Decomposition of the Block Matrix Xt Σ1/2

Any n × p real matrix can be factorized into an n × n orthogonal matrix and a n × p upper triangular matrix. This is called the QR decomposition. The QR decomposition is slower than the Cholesky decomposition but faster than the SVD.

Consider the QR decomposition of the block matrix Xt Σ1/2t,

  X     = QR, (3.4) Σ1/2 where Q is (n + p) × p and R is p × p. Notice that

  X  t 1/2   t t t t X Σ   = R Q QR = R R = X X + Σ = A. Σ1/2

Since R is upper triangular, now we can compute βˆ by one backward substitution and one forward substitution, just like in the last method using the Cholesky decomposition. The flop count for this QR decomposition is ∼ 2np2 − 4p3/3 by

Householder transformation [Trefethen and Bau III, 1997, Lec. 10]. In fact, this is acceptable even compared with the Cholesky decomposition since we don’t need to compute XtX, which requires ∼ np2 flops. But in practice usually XtX (or part of XtX) can be precomputed and stored in the memory, and the update of

XtX can be performed very efficiently.

In some situations, this method is most advantageous. Consider a variable selection procedure with a fixed regularization parameter, i.e., Σ is a diagonal matrix with a constant scaling factor. At every step, we may add or delete a column from X and βˆ need to be recomputed. There is no easy way to update the Cholesky decomposition of A and thus it must be recomputed at every step.

77 However, the QR decomposition of the block matrix in (3.4) is easy to obtain by updating. For example, the removal of the k-th column from X corresponds to the removal of the k-th column and the (n + k)-th row of the matrix Xt Σ1/2t. The new QR decomposition can be very efficiently computed using Givens rotation to introduce zeroes. See Golub and Van Loan [2012, Chap.12.5] for more details.

3.2.4 Bidiagonalization Methods

Numerically the singular value decomposition is often computed via a two-stage algorithm. The first step is called bidiagonalization, which is very similar to SVD except that in the middle of the decomposition is an upper bidiagonal matrix instead of a diagonal. This step can be done within a finite number of opera- tions. The second step is an iterative procedure for finding all the singular values.

Though each iteration is fast, we may need a very large number of iterations if the distribution of the singular values is extreme or a very high accuracy is needed.

Eld´en[1977] noticed that the second iterative stage could be avoided. He devel- oped an algorithm for computing βˆ using only the bidiagonalization, of which the number of operations is of the same order of magnitude as for the SVD-based algorithms (the first method introduced in this section). Therefore, this method might be preferred when we have a fixed X but want to compute βˆ for many different choices of Σ. Nevertheless, the bidiagonalization is still very slow. It requires ∼ 4np2 − 4p3/3 flops by Golub-Kahan algorithm and ∼ 2np2 + 2p3 flops by Lawson-Hanson-Chan algorithm [Trefethen and Bau III, 1997, Lec. 31].

78 3.3 Iterative Methods for Computing Ridge

Estimators

Iterative methods, in contrast to direct methods, produce a sequence of approxi-

mate solutions, βˆ(1),..., βˆ(k),... , such that lim βˆ(k) = βˆ under some conditions. k→∞ If the convergence is quick, iterative methods are much more efficient than direct methods since each iteration usually has time complexity O(p2).

3.3.1 Jacobi, Gauss-Seidel and Successive

Over-Relaxation

Recall our objective is to solve Aβˆ = z. Suppose we have the following decompo-

sition for A:

A = M + N.

where M is invertible. Then,

  Mβˆ = −Nβˆ + z ⇔ βˆ = M −1 −Nβˆ + z . (3.5)

This relationship inspires us to compute βˆ by an iterative procedure. We start

from an initial guess βˆ(0) and in each iteration we improve our guess by

  βˆ(k+1) = M −1 −Nβˆ(k) + z . (3.6)

Of course this approach does not necessarily work. It is possible that βˆ(k) becomes

worse and worse and eventually diverges to infinity. To study its convergence

79 properties, let’s define the error vector at the k-th iteration by

e(k) = βˆ(k) − βˆ. (3.7)

Combining (3.5), (3.6) and (3.7) yields to

e(k+1) = (−M −1N)e(k).

Since lim βˆ(k) = βˆ if and only iff lim e(k) = 0, we have the following theorem k→∞ k→∞ (see Golub and Van Loan [2012, Chap. 10.1.2] for a rigorous proof).

Theorem 3.1. (Convergence of standard iterations) Suppose both A and

M are invertible. Denote the spectral radius of −M −1N by

ρ(−M −1N) def=. max{|λ| : λ is an eigenvalue of − M −1N}.

If ρ(−M −1N) < 1, then for any initial guess βˆ(0), the sequence βˆ(k) defined by (3.6) converges to the true solution βˆ.

Furthermore, the smaller the spectral radius of M −1N, the faster the error

vanishes. Now we are ready to introduce three standard iterative methods. We

split the matrix A into three parts by

A = L + D + U,

where L is the strictly lower triangular component, U is the strictly upper trian-

gular component and D contains only the diagonals. Then we can define three

80 iterative procedures by

h i Jacobi method: βˆ(k+1) = D−1 −(L + U)βˆ(k) + z ;   Gauss-Seidel method: βˆ(k+1) = (D + L)−1 −Uβˆ(k) + z ; h i successive over-relaxation: βˆ(k+1) = (D + ωL)−1 −(ωU − (1 − ω)D)βˆ(k) + ωz .

The Jacobi method does not necessarily converge. One sufficient condition for the convergence of the Jacobi method is strict diagonal dominance, i.e.,

X |aii| > |aij|, ∀i = 1, . . . , p. j6=i

For the other two methods, we have a more convenient result.

Proposition 3.2. Suppose A is symmetric and positive definite. Then,

(a) the Gauss-Seidel method always converges to the true solution;

(b) the successive over-relaxation method converges for ω ∈ (0, 2).

See Golub and Van Loan [2012, Chap. 10.1.2] and Allaire et al. [2008, Chap. 8.2.3] for proofs. In fact, the successive over-relaxation can be seen as a generalization of the Gauss-Seidel method. It may be derived using the equality ωAβˆ = ωz.

In each iteration of the successive over-relaxation, we update our estimate by an weighted average of the last guess and the Gauss-Seidel update. The relaxation parameter, ω, acts as the weight. When ω is chosen appropriately, the succes- sive over-relaxation can achieves a faster convergence rate than the Gauss-Seidel method. However, unless the matrix A has a very nice structure, it is usually very difficult to find the optimal value for ω (clearly we don’t want to compute all the eigenvalues of A).

When implementing these methods, we should take advantage of the sparsity

81 of the matrices L, U and D. The updating equations for the three methods can be rewritten as

i−1 p ˆ(k+1) 1 P ˆ(k) P ˆ(k) Jacobi method: βi = (zi − aijβj − aijβj ); aii j=1 j=i+1 i−1 p ˆ(k+1) 1 P ˆ(k+1) P ˆ(k) Gauss-Seidel method: βi = (zi − aijβj − aijβj ); aii j=1 j=i+1 i−1 p ˆ(k+1) ω P ˆ(k+1) P ˆ(k) ˆ(k) successive over-relaxation: βi = (zi − aijβj − aijβj ) + (1 − ω)βi . aii j=1 j=i+1 (3.8)

Hence, for all the three methods, each iteration costs about 2p2 flops.

3.3.2 Steepest Descent and Conjugate Gradient

Another important class of iterative methods is called Krylov subspace meth-

ods [Trefethen and Bau III, 1997, Lec. 38]. The idea is to find an approximate

solution, βˆ(k), in the subspace

span{z, Az, A2z,..., Ak−1z}. (3.9)

Steepest descent is such an algorithm. Let

ˆ(k) r(k) = z − Aβ

be the “residual” at the k-th iteration. We update βˆ(k) and r(k) together by

rt r ˆ(k+1) ˆ(k) (k) (k) β = β + t r(k); r(k)Ar(k) ˆ(k+1) r(k+1) = z − Aβ .

82 Just like the name suggests, steepest descent searches for the next estimate along

the current gradient. In practice, it is usually used as an optimization algorithm to

find the local maximum (minimum). However, it is rarely used for solving systems

of linear equations due to its bad convergence properties. Instead, conjugate

gradient, which is another Krylov subspace method, is often preferred. Conjugate

ˆ(k) gradient relies on a conjugate sequence of vectors t(1),..., and updates β , r(k) and t(k) by

t r(k)r(k) t(k+1) = r(k) + t t(k); r(k−1)r(k−1) rt r ˆ(k+1) ˆ(k) (k) (k) β = β + t t(k+1); t(k+1)At(k+1) t r(k)r(k) r(k+1) = r(k) − t At(k+1). t(k+1)At(k+1)

The initial values for r and t are given by

ˆ(0) r(0) = t(1) = z − Aβ .

Conjugate gradient usually performs well when the matrix A is large and sparse.

Theoretically speaking, it can also be viewed as a direct method since it always con-

verges to the true solution within p iterations up to the rounding-off error. See Tre-

fethen and Bau III [1997, Lec. 38] and Golub and Van Loan [2012, Chap. 10.2] for

more information.

83 3.4 A Novel Iterative Method Using Complex

Factorization

In this section, a new method for solving (3.7) is proposed. It is iterative, just like the Gauss-Seidel method and can be generalized by introducing a relaxation parameter. But unlike all the iterative methods discussed in the last section, this method relies on the Cholesky decomposition of XtX and makes use of the special structure of the matrix A. In some applications like variable selection, the Cholesky decomposition of XtX can be obtained by updating very efficiently, whereas there is no easy to way to update the Cholesky decomposition of A if the diagonals of Σ change. Our idea is based on a “complex factorization” of the matrix A and thus we call our method ICF (Iterative solutions using Complex

Factorization).

3.4.1 ICF and Its Convergence Properties

Assume the Cholesky decomposition of the Gram matrix XtX is available and given by

XtX = RtR, where R is an upper triangular matrix. Then we have A = RtR + Σ. Define

D = RtΣ1/2 − Σ1/2R, (3.10) H = (Rt − iΣ1/2)(R + iΣ1/2).

84 One can check that A = H −iD. According to (3.6), we may iteratively compute

βˆ by

βˆ(k+1) = H−1(iDβˆ(k) + z).

Two important observations are made. First, since H is a product of triangu- lar matrices, calculating the right-hand side is quick by forward and backward substitutions. Second, since βˆ is real, we may discard the imaginary part of the right-hand side in each iteration. Thus, the estimate for β is updated by

h i βˆ(k+1) = Re H−1(iDβˆ(k) + z) . (3.11)

Discarding the imaginary part turns out to substantially expedite the convergence.

Just like the successive over-relaxation, we can define a more general iterative procedure by introducing a relaxation parameter ω ∈ (0, 1],

βˆ(k+1) = Re[(1 − ω)βˆ(k) + ωH−1(iDβˆ(k) + z)]. (3.12)

When ω = 1, (3.12) reduces to (3.11). The iterative method defined by (3.12) is referred to as ICF (Iterative solutions using Complex Factorization). Each iteration requires about 6p2 flops (a matrix-vector multiplication plus two complex backward/forward substitutions). We still use e(k) to denote the error at the k-th iteration. It can be shown that e(k+1) = Ψ(ω)e(k), where

Ψ(ω) = Re[(1 − ω)I + iωH−1D].

By Golub and Van Loan [2012, Theorem 10.1.1], we have the following proposition.

Proposition 3.3. The ICF method defined by (3.12) converges if and only if

85 ρ(Ψ(ω)) < 1 where ρ denotes the spectral radius.

The next theorem provides the theoretical guarantee of the convergence of ICF.

Theorem 3.4. The convergence of ICF can always be obtained by choosing an appropriate relaxation parameter ω ∈ (0, 1].

Proof. First, since H = A + iD, the imaginary part of H−1 can be solved. By

Lemma 7.5,

Im(H−1) = −A−1D(A + DA−1D)−1.

Then by the fact that both A, D are real matrices and Lemma 7.2 (Woodbury matrix identity),

Ψ(ω) = I − ω(I + A−1DA−1D)−1. (3.13)

Inspection reveals that, given ω, the spectrum of Ψ(ω) is fully determined by that of A−1D. Because A−1/2DA−1/2 is skew-symmetric, the eigenvalues of the matrix

A−1D must be conjugate pairs of pure imaginaries or zero by Proposition 7.11.

Let ±ηi be such a pair with η ≥ 0 and u be the eigenvector corresponding to the eigenvalue ηi. We have

A−1Du = iηu which can be rearranged to get iu∗Du = −ηu∗Au where u∗ denotes the conjugate transpose of u. From the definition in (3.10), H is a Hermitian positive definite matrix, which yields

u∗Hu = u∗(A + iD)u = (1 − η)u∗Au > 0.

86 Since A is also positive definite, we must have η ∈ [0, 1). (This also implies

(I + A−1DA−1D) is invertible.) Using (3.13), we can show Ψ(ω) must have two eigenvalues equal to (1 − η2 − ω)/(1 − η2). Hence to make ρ(Ψ(ω)) smaller than

1, we just need

1 − η2 − ω | | < 1 ⇔ η < p1 − ω/2 (3.14) 1 − η2

hold for every possible η. Since η is always strictly smaller than 1, there must

exist a positive ω that satisfies (3.14).

Consider the special case ω = 1. Then from the proof of the last theorem, the spectral radius of Ψ can be computed using that of A−1D.

Corollary 3.5. If ω = 1, ρ(Ψ) = ρ2(A−1D)/(1 − ρ2(A−1D)).

−1 Proof. By definition ηmax = ρ(A D). The claim is then proved by noticing that all the eigenvalues of Ψ(1) must be non-positive.

When ω = 1, we can also formulate a sufficient condition for the convergence

of ICF.

Corollary 3.6. If Σ = σ−2I and ω = 1, a sufficient condition for the convergence √ of ICF is max |Rij| < 1/σ 2. i>j

Proof. By submultiplicativity of matrix norm,

−1 −1 2 ρ(A D) ≤ ρ(A )ρ(D) ≤ σ ρ(D) ≤ σ||D||max = σ max |Rij| i>j

√ By (3.14), max |Rij| < 1/σ 2 is a sufficient condition for ρ(Ψ(ω)) < 1. i>j

87 3.4.2 Tuning the Relaxation Parameter for ICF

To make ICF generally applicable, it is necessary to work out a convenient way to choose the relaxation parameter ω. From the proof of Theorem 3.4, it can be seen there is close relationship between ρ(Ψ) and the spectrum of A−1D. Indeed, if the latter is known, the optimal value for ω can be analytically evaluated.

Corollary 3.7. Let ηmin and ηmax denote the smallest and largest absolute value of eigenvalue of A−1D. Then the optimal value for ω is

∗ 1 1 −1 ω = 2( 2 + 2 ) . 1 − ηmin 1 − ηmax

∗ 2 2 If ηmin = 0, we have ω = (2 − 2ηmax)/(2 − ηmax) and

∗ 2 2 ∗ ρ(Ψ(ω )) = ηmax/(2 − ηmax) = 1 − ω .

Proof. From the proof of Theorem 2, we know that the smallest and the largest

2 2 eigenvalue of Ψ(ω) are 1−ω/(1−ηmax) and 1−ω/(1−ηmin). They can be positive or negative. Recall η ∈ [0, 1). Since the optimal value of ω must minimize the spectral radius of Ψ, we have

∗ ω ω ω = arg min ρ(Ψ(ω)) = arg min max{1 − 2 , 2 − 1}. ω∈(0,1] ω∈(0,1] 1 − ηmin 1 − ηmax

ω∗ is attained when the two quantities in the braces on the right-hand side are equal.

By the property of skew-symmetric matrix, ηmin = 0 when p is odd. When p is even, it is still extremely small for a moderate sample size. Therefore it is treated as 0 in our discussion. Simulation shows that for a large design matrix that is not ill-conditioned, like in GWAS, ηmax is usually close to zero. As a result, by simply

88 choosing ω = 1, which is in fact near optimal, ICF converges strikingly fast. In

Figure 3.1), both ρ(Ψ(1)) and ρ(Ψ(ω∗)) from simulated data are plotted against sample size n (p = 500). When n < p, ρ(Ψ(1))  1 and thus ICF fails. Even if the optimal ω∗ is used, the spectral radius is still very close to 1, which means the convergence cannot be attained within a reasonable number of iterations. But once n grows greater than p, ρ(Ψ(1)) plummets. This phenomenon persists for other choices of p.

Ψ( ) ● Ψ( ) 1 ● 1 ● ●

● ● * * ● ● Ψ(ω ) Ψ(ω ) 4 4

● 2 2 ) ) ρ ρ ( ( 10 10 g g

o ● o 0 0 l l

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−2 −2 ● ● ● ● ● ● ● ● ● ●

0 500 1000 1500 2000 0 500 1000 1500 2000 n n

Figure 3.1: The relationship between ρ(Ψ(1)), ρ(Ψ(ω∗)) and n with p = 500 and Σ = I. For each n we simulate 100 datasets with independent predictors. The dots denote the mean and the grey bars indicate the 2.5% and 97.5% quantiles. Xij is sampled from Bin(2, fj) with fj ∼ U(0.01, 0.99) in the left panel and from N(0, 1) in the right.

If the design matrix X has severe multicollinearity, even if n  p, it is likely

2 that ηmax > 1/2 so that ICF does not converge for ω = 1. To show this, a sub- dataset that contained the first 20, 000 SNPs on the chromosome 1 from the IOP dataset (see Chap. 7.7.1) is constructed. For given n and p, X is sampled from this sub-dataset and its corresponding ηmax is computed. Then, by Corollary 3.7,

∗ assuming ηmin = 0, the optimal spectral radius ρ(Ψ(ω )) can be computed. This is repeated for 1, 000 times. Since neighboring SNPs often have a high correlation due to linkage disequilibrium, when p is large, say several hundred, the design

89 matrix X is susceptible to severe collinearity. Figure 3.2 displays the distribution

∗ ∗ of ρ(Ψ(ω )). Recall that when ηmax is very small, ω is close to 1 and thus ρ(Ψ(ω∗)) is close to zero. But in Fig. 3.2, a large proportion of ρ(Ψ(ω∗)) is away

from zero and for those cases, using ω = 1 even cannot achieve convergence (this can be shown using Corollary 3.5 and 3.7). Fortunately, ρ(Ψ(ω∗)) is still away from 1, which implies that the optimal convergence rate is always acceptable. An extremely interesting observation is the peak at ρ(Ψ(ω∗)) ≈ 1/3 in Fig. 3.2, which

∗ 2 corresponds to ω = 2/3 and ηmax = 1/2, the boundary value for the convergence of ICF with ω = 1.

∗ Figure 3.2: The distribution of ρ(Ψ(ω )) the IOP chr1 dataset. ηmax is computed using the eigendecomposition of A−1D. The optimal spectral radius is then com- ∗ 2 2 ∗ puted by ρ(Ψ(ω )) = ηmax/(2 − ηmax), which is equal to 1 − ω .

90 From these simulations, it can be concluded that as long as n  p and an appropriate value is chosen for ω, ICF should converge rapidly. In fact, it is not difficult to automatically adjust ω in the iterations. We start from ω(0) = 1 and

assume ηmin = 0. The idea is to compute an estimate for the spectral radius of Ψ(ω) in each iteration, which in turn helps us determine the value for ω in the

next iteration. As the number of iterations grows larger, the choice of ω shall tend

to the optimal value.

Since overall ω(k) has a decreasing trend, the spectral radius of Ψ at the k-th

iteration is

(k) (k) (k) (k) ω ω ρ(Ψ(ω )) = max{1 − ω , 2 − 1} ≈ 2 − 1. 1 − ηmax 1 − ηmax

Although sometimes 1 − ω(k) may be greater, it only occurs when ω(k) is near ω∗ (k) (k) ω and therefore ρ(Ψ(ω )) can be approximated by 2 − 1 by Corollary 3.7. 1 − ηmax A heuristic way to estimate the spectral radius of Ψ is given by

||βˆ(k) − βˆ(k−1)|| ρˆ(k) = 2 , ˆ(k−1) ˆ(k−2) ||β − β ||2 which leads to an updating of ω by

2ω(k) ω(k+1) = . 1 + ω(k) +ρ ˆ(k)

Theoretically this estimation may break down in some situations. A trivial case is

βˆ(0) = βˆ. A non-trivial situation is that the error vector may be orthogonal to the

eigenvectors of Ψ that correspond to the largest (in absolute value) eigenvalue. In

practice such special situations are very rare and this adaptive procedure works

very well in our simulation studies.

91 3.5 Performance Comparison by Simulation

3.5.1 Methods

Simulation is used to compare the performance of seven methods: ICF, Cholesky

decomposition (of matrix A), Jacobi method, Gauss-Seidel method, successive

over-relaxation, steepest descent and conjugate gradient. Two sub-datasets are

constructed using the IOP dataset (see Chap. 7.7.1). The first sub-dataset contains

20, 000 SNPs evenly distributed across the whole genome, which henceforth is

referred to as IOP-ind. These SNPs can be regarded as independent from each

other due to their distant genomic locations. The second contains the first 20, 000

SNPs on chromosome 1, which henceforth is referred to as IOP-chr1. As explained

in the last section, severe multicollinearity are very likely to occur when the design

matrix is sampled from this sub-dataset. For given n and p, X is sampled from each sub-dataset and y is simulated under both the null, y ∼ MVN(0, I), and the alternative, y ∼ MVN(0, I + 0.52XXt). Then the seven methods are applied to compute the corresponding ridge estimator with σ = 0.5. The Jacobi method and

steepest descent are immediately excluded from the following experiments owing

to bad performance. Both methods easily fail to converge for p = 200 and fails every time for p = 500 using the IOP-ind dataset.

Five methods enter the final comparison: ICF, Cholesky decomposition (Chol),

Gauss-Seidel method (GS), successive over-relaxation (SOR), and conjugate gra- dient (CG). For SOR, we have to choose a value for the relaxation parameter, which is known to be very difficult. The most famous result concerning this prob- lem is due to Young [1954] (see also Yang and Matthias [2007]). Unfortunately, the assumption of Young’s rule is not met when p grows greater than 500. Thus we did some tests and decided to use 1.2 for the relaxation parameter of SOR,

92 which appeared to produce a better overall performance than other choices. For all the iterative methods, we start from βˆ(0) = 0 and stop if

ˆ(k) ˆ(k−1) ˆ(k) ˆ(k−1) −6 ||β − β ||∞ = max |βi − βi | < 10 (3.15) i or the number of iterations exceeds M, where M = 50 for ICF and M = 200 for other methods. Initializing βˆ(0) to simple linear regression estimates was also tried but all results remained essentially unchanged. The code was written in C++ and the Cholesky decomposition was implemented using GSL (GNU Scientific

Library) [Gough, 2009]. GS and SOR were implemented according to (3.8) to obtain maximum efficiency.

3.5.2 Wall-time Usage, Convergence Rate and Accuracy

Simulation shows that ICF is the best in terms of wall-time usage, convergence rate and numerical accuracy.

Wall-time Usage Sample size n is fixed to 3000 and p ranges from 10 to 1200.

For each p, the simulation of X and y is repeated for 1000 times. The total wall time used of the five methods is shown in Fig. 3.3, and part of the exact numerical values are provided in Table 3.1. It is very clear that as p grows larger, ICF has an overwhelming advantage over all the other methods. In fact, in the simulation with IOP-chr1 dataset, all the other iterative methods fail. Gauss-Seidel method and successive over-relaxation fail to converge within 200 iterations in most cases when p ≥ 600 (see Table 3.1). If we really want to use these two methods to achieve convergence, the time usage should be much greater than the Cholesky decomposition. Conjugate gradient, although converges in all cases as ICF does, is always slower than the Cholesky decomposition. Since the Cholesky decomposition

93 is an exact direct method, there is no reason to use conjugate gradient either.

Figure 3.3: Comparison of the wall time usage of the five methods for computing 1, 000 ridge estimators.

94 Time (in seconds) Convergence failures

p Dataset Chol ICF GS SOR CG ICF GS SOR CG

50 IOP-chr1 0.035 0.020 0.029 0.031 0.107 0 8 6 0

50 IOP-ind 0.034 0.019 0.016 0.025 0.104 0 0 0 0

200 IOP-chr1 1.39 0.45 2.13 1.69 2.66 0 125 93 0

200 IOP-ind 1.38 0.304 0.339 0.385 2.29 0 1 1 0

400 IOP-chr1 10.9 2.90 19.1 16.4 16.0 0 427 344 0

400 IOP-ind 11.0 1.64 2.05 1.76 11.5 0 1 0 0

600 IOP-chr1 35.6 10.3 58.0 53.6 51.3 0 754 639 0

600 IOP-ind 35.8 4.53 6.40 4.86 31.2 0 5 4 0

800 IOP-chr1 82.7 20.5 115.5 112 125 0 933 866 0

800 IOP-ind 83.0 7.85 15.2 10.9 65.1 0 10 8 0

1000 IOP-chr1 160 35.8 183 180 244 0 979 951 0

1000 IOP-ind 161 14.3 35.9 25.1 136 0 11 7 0

1200 IOP-chr1 270 63.6 290 287 450 0 998 992 0

1200 IOP-ind 269 20.3 62.5 43.1 202 0 26 22 0

Table 3.1: The wall time usage and the number of convergence failures of the five methods for computing 1, 000 ridge estimators under the null model. The “Time” columns correspond to the points in Figure 3.3. “Convergence failures” columns give the number of cases that fail to stop before M iterations where M = 50 for ICF and M = 200 for the other iterative methods. Simulation under the alternative produces very similar results.

Convergence Rate To further investigate the difference between the itera- tive methods, I compare the number of iterations used by ICF, successive over- relaxation and conjugate gradient. Gauss-Seidel is excluded since it is always

95 poorer than successive over-relaxation. 10, 000 pairs of X and y are simulated under the null for p = 500, 1000. The distributions of the number of iterations used to stop are shown in Fig. 3.4. SOR only works for the IOP-ind dataset. It is excluded in the panels for the IOP-chr1 dataset since 50% of the cases use more than 200 iterations for p = 500 and 96% for p = 1000. Conjugate gradient, on the other hand, always uses much more iterations than ICF to converge.

Figure 3.4: Comparison of the number of iterations used by ICF, SOR and CG.

Accuracy In the simulation described in last paragraph, the maximum absolute error for all the cases that have converged are also calculated. Recall our stopping criterion (3.15), which was designed with the aim of controlling this error < 10−6.

Figure 3.5 shows that ICF usually achieves this precision while the other two

almost never do so. Since X and y are simulated under the null, the entries of the

true βˆ are usually very small. The maximum relative error is found to be usually

two orders of magnitude greater than the maximum absolute error.

96 Figure 3.5: Comparison of the accuracy of ICF, SOR and CG. Each panel shows the distributions of the maximum absolute error on log10 scale. Red bars stand for ICF, blue for CG and yellow for SOR. The non-convergent cases are excluded and thus for SOR the total number is much smaller than 10, 000 for the IOP-chr1 dataset.

97 Chapter 4

Bayesian Variable Selection

Regression

In genome-wide association studies, one central goal is to identify the causal SNPs, or rather, the SNPs that are associated with the phenotype, from millions of genotyped SNPs. This procedure is called variable selection by statisticians. The classical approach, which is simple but effective, is to use the regression model.

In this chapter, first the methods for Bayesian variable selection based on linear regression are reviewed in Chap. 4.1. Then a novel MCMC algorithm implementing the ICF method (introduced in the last chapter) is proposed, which substantially expedites the model fitting of Bayesian variable selection.

4.1 Background and Literature Review

Consider the regression model,

N X −1 y = µ1 + Xβ + ε = µ1 + βixi + ε, ε ∼ MVN(0, τ I), i=1

98 where X is an n × N matrix with N  n and ε represents the errors, which are usually assumed to be i.i.d. normal random variables. Both y and X are assumed to be centered so that the intercept term is omitted. If there are other confounding covariates to be controlled for, they should also be regressed out from y and X.

To avoid overfitting, we have to impose some constraints on β. For example, we may use ridge regression to shrink the estimates for βi to zero. However, it is often difficult to interpret such results. A more convenient way is to assume the sparseness of the model, i.e., to assume that most elements of β are zero. This assumption is indeed very desirable in most applications. For example, in GWAS, each column of X represents a SNP and only a small number of SNPs may be correlated with the phenotype y. The identification of these causal SNPs is of primary interest to most GWAS studies.

A wide variety of Bayesian methods for variable selection have emerged since late 1980s. It is difficult to divide them into distinct categories due to the common features shared between different methods. For a comprehensive review structured in this way, see O’Hara and Sillanp¨a¨a[2009]. I take a different approach. The review is separated into two parts, the formulation of the model and how to fit the model. They are the two defining elements of a Bayesian variable selection method.

For more introductory materials or surveys, see Miller [2002, Chap. 7] and Walli

[2010] among others. Innovations concerning the model selection criterion, such as the deviance information criterion [Spiegelhalter et al., 2002], the fractional

Bayes factor [O’Hagan, 1995] and the intrinsic Bayes factor [Berger and Pericchi,

1996a,b], are beyond the scope of this chapter and thus not to be discussed.

99 4.1.1 Models for Bayesian Variable Selection

Indicator Variable and “Spike and Slab” Prior Unlike frequentists’ model

selection methods, for example LASSO [Tibshirani, 1996] which outputs a single

best model, Bayesian variable selection procedures usually attempt to compute the

marginal posterior probability of every predictor’s being included in the model.

This can be achieved by introducing the auxiliary indicator variable γ ∈ {0, 1}N

and using the “spike and slab” prior [Ishwaran and Rao, 2005],

i.i.d. βi | γi ∼ (1 − γi)F0 + γiF1. (4.1)

F0 is a distribution concentrated at or around zero (“spike”) and F1 is a flat distribution spread over a wide region (“slab”). Thus, the variable selection has been translated to the parameter estimation of γ. Typically, the prior for γ is

i.i.d. γi | π ∼ Bernoulli(π).

π may be chosen a priori or treated as a hyperparameter (and thus requires a hyperprior). In most applications, π is chosen very small to reflect the prior belief that only a small number of the covariates are truly effective, unconsciously inducing a penalty term on model complexity to the marginal likelihood. P(γi = 1 | y) is the posterior probability of a predictor being in association with the response

variable and is sometimes referred to as the posterior inclusion probability (PIP).

The estimation of the posterior inclusion probabilities is a major goal of Bayesian

variable selection.

The first use of this prior seemed due to Mitchell and Beauchamp [1988], where

F0 is chosen to be the degenerate distribution with unit probability mass at 0,

denoted by δ0, and F1 is a uniform distribution on (−b, b) for some b > 0. A more

100 common choice for F1 is a normal distribution with mean 0. For instance see Chen and Dunson [2003] which also generalizes the “spike and slab” prior via a matrix factorization. Note that F0 and F1 may be allowed to change with i, for example, when the covariates are heterogeneous and form several distinct groups. Another important example of “spike and slab” prior is called stochastic search variable selection (SSVS), proposed by George and McCulloch [1993]. They considered

βi | γi ∼ (1 − γi)N(0, φi) + γiN(0, ciφi).

By letting φi be small and ci be large, this prior also takes “spike and slab” shape. This method was later generalized to non-conjugate settings in George and McCulloch [1997]. A practical difficulty in the implementation of SSVS is the tuning of the parameters ci, φi, which is crucial to the mixing of the Gibbs sampling chain and thus the accuracy of the posterior inferences.

Shrinkage Priors for β Instead of using the indicator variable γ, the model sparseness can also be attained by using a shrinkage prior for β. Such a shrinkage prior should approximate the “spike and slab” shape so that when the data displays no evidence for association between βi and y, the estimate for βi is shrunk to

i.i.d. zero. A common approach is to let βi | φi ∼ N(0, φi) and then put a shrinkage hyperprior for φ. For example, we may let p(φi) ∝ 1/φ, which is sometimes called Jeffreys’ prior. See Xu [2003] for an application to gene mapping. However, the use of this model is very controversial because the joint posterior distribution is improper, though the full conditionals, which are used by Gibbs sampling, are proper. See Hobert and Casella [1996] for the proof. An alternative choice is to

101 use

√ ! 1 2 p(β | φ) = √ exp −√ |β | , i 2φ φ i

i.e., to use a double exponential (Laplace) prior for βi. This prior is referred to as “Bayesian LASSO” [Park and Casella, 2008] since its maximum a posteriori (MAP)

estimation coincides with the LASSO method. φ may be chosen by an empirical

Bayesian method or given a shrinkage hyperprior. To do variable selection with

these models, one can simply set a threshold C and select the covariates with

|βi| > C.

The mixture of g-priors is another important example. Recall Zellner’s g-prior

β | g ∼ MVN(0, gτ(XtX)−1).

Liang et al. [2008] proposed to put the following hyperprior on g,

a − 2 p(g) = (1 + g)−a/2, g > 0. 2

In their simulation studies, they used a = 3. An alternative is Zellner-Siow

prior [Zellner and Siow, 1980], which they show is actually a special case of mix-

tures of g-priors with an inverse-gamma prior on g. Remarkably, Liang et al. [2008] proved the consistency of using this prior for both model selection and prediction.

To obtain the PIPs, for small N we may compute the marginal likelihood of every possible model (or equivalently, the null-based Bayes factors) and then compute the PIPs by model averaging.

102 4.1.2 Methods for the Model Fitting

Markov Chain Monte Carlo Methods There is usually no closed-form ex-

pression for the posterior distribution in Bayesian variable selection. The most

common strategy for posterior inference is to use Markov chain Monte Carlo

(MCMC) methods to generate samples from the posterior. For an introduction to

various MCMC algorithms, see Liu [2008] and Brooks et al. [2011] among others.

In early attempts, Gibbs sampling [Liu, 2008, Chap. 6] was often used to fit the

Bayesian variable selection models with “spike and slab” priors [George and Mc-

Culloch, 1993, Kuo and Mallick, 1998]. However, if F0 and F1 in (4.1) are very different, a proposal of flipping γi may be very unlikely to be accepted because a value of βi sampled from F1 may be regarded as very unlikely under F0. As a result, the mixing of the Markov chain can be very slow and never converge.

Dellaportas et al. [2002] used the idea of Carlin and Chib [1995] and tried to solve this problem by a “pseudo prior”. In their method, when γi = 0, the covariate xi is excluded from the model but βi, instead of being set to 0, is sampled from a prior that is close to its conditional given γi = 1. But this method still relies on very careful tuning that may be difficult to implement in practice.

Another class of MCMC methods uses the Metropolis-Hastings algorithm (see

Chap. 7.6). An early example was the reversible jump MCMC algorithm [Green,

1995, Sillanp¨a¨aand Arjas, 1998], which required the computation of the Jacobian of the proposal to account for the change in the dimension of the parameter space.

Uimari and Hoeschele [1997] compared the performance of reversible jump MCMC and other Gibbs-sampling-based approaches. Later it was noticed that the MCMC could be more efficient if βi’s are integrated out in each iteration. We simply propose to add or remove a covariate but no longer sample βi’s. The acceptance ratio is then obtained by computing the null-based Bayes factor or the marginal

103 likelihood of the model. This method is used in a large amount of literature, e.g.,

Godsill [1998], Yi [2004], Guan and Stephens [2011], Zhou et al. [2013]. The model

and the MCMC algorithm of Guan and Stephens [2011] will be discussed in great

detail in the next section.

Approximative and Variational Methods The computational difficulty of

Bayesian inferences arises from an intractable integration. Hence we may approx- imate this integral by some numerical techniques, e.g. Sen and Churchill [2001], or finding a tractable asymptotic estimate. Ball [2001] estimates the marginal likelihood of a model by a modification of the BIC score.

The variational Bayesian method aims to find an analytical approximation to the true joint posterior distribution by some known closed-form distribution func- tions. See Sm´ıdlandˇ Quinn [2006] for a full introduction of the theory. On one hand, since a variational method produces a distributional approximation, various posterior inferences can be carried out very easily once we obtain the approximat- ing distributions. On the other, variational methods are deterministic and thus very fast compared with MCMC. Carbonetto and Stephens [2012] describes a sim- ple but very efficient iterative variational algorithm for Bayesian variable selection with “spike and slab” prior. Though the individual PIP estimate is usually off, which is a known phenomenon for variational Bayesian methods [Fox and Roberts,

2012], the variational estimates for the posteriors of the hyperparameters turn out to be very accurate in a wide range of settings. A recent work by Huang et al.

[2016] improved this algorithm and proved its consistency when the total number of covariates grows exponentially fast with the sample size.

Approximate Bayesian computation (ABC) is another class of methods that directly approximate the likelihood function by simulation [Sunn˚aker et al., 2013].

104 In its simplest form of rejection sampling, when new values of the parameters are proposed, ABC decides whether to accept them by simulating data under these parameters and compare to the observed data. See Stahl et al. [2012] for a recent application in GWAS.

Searching for Maximum a Posteriori Estimates We can also perform all the statistical inferences using the MAP estimates, which might significantly re- duce the computational cost. Methods of optimization theory and machine learn- ing then can be applied [Tipping, 2004]. Hoggart et al. [2008] applied this strategy to analyzing the whole-genome SNP data with both the normal-inverse-gamma prior and the double exponential prior. In a similar vein, Segura et al. [2012] de- scribed a stepwise variable selection procedure using an asymptotic Bayes factor.

However, in general such approaches are not favored since a paramount advantage of Bayesian methods is the ability to make inferences by model averaging [Raftery et al., 1997, Broman and Speed, 2002].

4.2 The BVSR Model of Guan and Stephens

The Bayesian variable selection regression (BVSR) model proposed by Guan and

Stephens [2011] is the focus of this chapter. Their method has the following advantages.

• There is no need for parameter tuning, unless in some very special applica-

tions. The method is adaptive and the posterior inclusion probabilities can

be accurately estimated under a wide range of settings.

• The MCMC algorithm proposed is very efficient compared with other meth-

ods.

105 • The model is parametrized by using a hyperparameter describing the pro-

portion of variance explained by the covariates. The estimation for this

hyperparameter turns out often to be very accurate.

Their method probably was motivated by the heritability estimation in genome-

wide association studies but could be applied to many other fields.

4.2.1 Model and Prior

The model starts with a conventional spike-and-slab setting with indicator variable

γ.

−1 y | γ, β, X, τ ∼ MVN(Xγ βγ , τ I),

τ ∼ Gamma(κ1/2, κ2/2), κ1, κ2 ↓ 0

γj ∼ Bernoulli(π), (4.2)

2 βj | γj = 1, τ ∼ N(0, σ /τ),

βj | γj = 0 ∼ δ0.

Consider the application to GWAS. γj = 1 means that the j-th SNP has a causal

effect on the phenotype y. δ0 denotes a point mass at 0. Xγ denotes the design

matrix with columns for which γj = 1 and βγ denotes the corresponding sub-

vector of β. By letting hyperparameters κ1, κ2 (shape and rate parameters) go to zero, we are actually using the prior p(τ) ∝ 1/τ, which might be called Jeffreys’

prior (for τ). Since when γj = 0, βj = 0, we may also write

y | γ, β, X, τ ∼ MVN(Xβ, τ −1I).

106 For the hyperparameters σ and π, BVSR puts the following hyperpriors:

log π ∼ U(log(πmin), log(πmax)),

2 h 1 σ = P , (4.3) 1 − h j sjγj h ∼ U(0, 1).

where sj denotes the variance of the j-th SNP. πmin (πmax) is chosen small (large) enough to ensure that the true value of π is included. For example, we may use

πmin = 1/N where N is the total number of SNPs in the dataset and πmax = 1. The uniform prior on log π is very critical because it brings about the penalization

on the model complexity that is needed to obtain sparseness. The introduction

of the parameter h is a major novelty of the BVSR model. h has a similar flavor

to the R squared statistic in traditional linear regression and can be interpreted

as the proportion of the variance of y that is due to the additive effects of the

covariates. To see this, notice that τ −1 is the error variance and

X σ2 X Var(X β ) = Var(x β ) = s , γ γ j j τ j γj =1 γj =1

assuming independence between xj (centered) and βj. Thus,

Var(Xγ βγ ) h = −1 . Var(Xγ βγ ) + τ

In GWAS, h means the narrow-sense heritability, which is defined to be the pro- portion of the phenotypic variance that is due to additive genetic effects. See

Chap. 1.3.2 for a short review on the heritability estimation. Here are three com- ments.

1. Just like we did in Chapter 2, we assume X and y are centered and thus

107 drop the intercept term, µ, from the model. Note that if we explicitly include

µ in the model, the prior would change to p(µ, τ) ∝ τ −1/2 (see Chap. 7.2.3).

When there are other variables to be controlled for, we also regress them

out from X and y, which is equivalent to including them in the model and

putting a non-informative prior on their regression coefficients [George and

McCulloch, 1993].

2. The full posterior of the BVSR model specified by (4.2) and (4.3) is proper.

First p(y|γ, h) < ∞ since the distribution of τ given y, γ, h is still a gamma

distribution. Second since p(h) is proper, we have p(y|γ) < ∞. Thirdly,

p(γ) = R p(γ|π)p(π)dπ < ∞ even if we use an improper prior for π by

setting πmin = 0. Lastly, γ can only take a finite number of possible values and therefore the full posterior for (β, τ, h, π, γ) is proper.

3. The most important advantage of this model is its flexibility. Both h and

π are estimated from the data and the simulation studies showed that both

of them, especially h, can be accurately estimated in various settings. How-

ever, this parametrization also gives rise to identifiability issues when the

data contains almost no signals. In such cases, the data provides no infor-

mation for estimating σ and h. The estimates for σ concentrate near 0 and

consequently, the posterior for both π and h may spread over a wide range.

For practical purposes, this problem is of little concern since it can be easily

diagnosed from the posterior of σ and the large variability of the posterior

of π and h. If we really want to solve it, we may put a uniform prior on log h

instead of h (choose an hmin > 0 to make it proper) such that the posterior for h is shrunk to zero when the data contains no information for h.

108 4.2.2 MCMC Implementation

In the original work of Guan and Stephens [2011], the posterior inference is done by an Metropolis-Hastings algorithm with mixed proposals. In each iteration, a new value for h is proposed by a random walk around the current value; π is

proposed from the full conditional distribution p(π|γ); γ is proposed by adding or

deleting a predictor from the current model. The proposal distribution for adding

a SNP is a mixture of a uniform and a geometric distribution. Lastly, a long-

range proposal is made for γ with probability 0.3, which is to compound the local proposals randomly many times. Below are the details for each step.

Proposal for h and π By convention let q(θ∗ | θ) denote the proposal distri- bution for a new value θ∗ given the current value θ. The proposal for h is simply a random walk,

h∗ | h ∼ Unif(h − 0.1, h + 0.1).

When an invalid value (outside of (0, 1)) is proposed, the proposed value is then reflected about the boundary. Therefore the proposal ratio for h is always equal to 1.

For π, first notice that it actually can be integrated out. Let’s assume a more general form for the prior of π,

a−1 b−1 p(π) ∝ π (1 − π) I(πmin,πmax)(π), that is, a truncated beta prior with a, b ≥ 0. The prior specified in (4.3) is then a special case with a = 0 and b = 1. Define the beta function and the distribution

109 function of the beta distribution by

Γ(a)Γ(b) B(a, b) def=. , Γ(a + b)

B def. F(a,b)(x1, x2) = P(x1 < X < x2), where X ∼ Beta(a, b).

Then we may write

πa−1(1 − π)b−1 p(π) = I(πmin,πmax)(π) B , B(a, b)F(a,b)(πmin, πmax)

and the marginal prior probability for γ is

Z B B(a + |γ|, b + N − |γ|)F(a+|γ|,b+N−|γ|)(πmin, πmax) p(γ) = p(γ|π)p(π)dπ = B , B(a, b)F(a,b)(πmin, πmax) (4.4)

P where |γ| = ||γ||0 = γj is the number of covariates in the model. Clearly the denominator is a constant independent of γ. Since the beta function and the cumulative distribution function of the beta distribution are both easy to calculate, there is actually no need to sample π in the MCMC sampling. Besides, the posterior for π is not of much interest, or at least is of less interest than the posterior of |γ|. However, if one insists on sampling π, it could be proposed from the conditional distribution,

∗ ∗ ∗ ∗ π | γ ∼ I(πmin,πmax)Beta(a + |γ |, b + N − |γ |).

∗ after we have sampled γ . When πmin = 0 and πmax = 1, the sampling is easy.

But when (πmin, πmax) is a region of very small probability, the sampling could be time-consuming. In our new implementation, the algorithm of Damien and

Walker [2001] is used to tackle this difficulty.

110 Proposal for γ The local proposal for γ includes adding a covariate and delet- ing a covariate. The two types are equally likely. For deletion, the proposal is a uniform distribution on the all the covariates in the model. For addition, the pro- posal is a mixture distribution. With probability 0.7, we randomly choose a SNP from all the SNPs that are not in the current model with equal probability. With probability 0.3, the SNP to be added is selected according to a truncated geomet- ric distribution on all the candidate SNPs for which γj = 0. For this geometric proposal, the SNPs are ordered such that those with larger single-SNP Bayes fac- tors (from Bayesian simple linear regression) are more likely to be proposed. This mixture proposal significantly improve the efficiency of the MCMC chain when we have a very large number of SNPs. Without the geometric proposal, the addition may have a very small acceptance probability since the majority of the SNPs are not correlated with the phenotype.

Great care must be taken when calculating the proposal ratio q(γ | γ∗)/q(γ∗ |

γ). Conditioning on the current value of γ, the proposal probability for adding the k-th SNP is

1 s(1 − s)R(k;γ) 0.3 + 0.7 , N − |γ| 1 − (1 − s)N−|γ| where s is the success probability of the geometric proposal and R(k; γ) is the rank of the single-SNP Bayes factor of the k-th SNP in the set of all the SNPs for which γj = 0, which has to be recomputed every time.

111 Computing the Acceptance Ratio The marginal likelihood of the current

model is given by (see (1.3) and Chap. 7.2)

 t t −2 −1 t −n/2 2 1 y Xγ (Xγ Xγ + σ I) Xγ y p(y | γ, σ (h, γ)) ∝ |γ| t −2 1/2 1 − t . σ |Xγ Xγ + σ I| y y (4.5)

Then for the prior specified in (4.3), the Metropolis-Hastings ratio can be calcu-

lated by

p(y | h∗, γ∗)p(h∗, γ∗)q(h, γ | h∗, γ∗) α((h, γ), (h∗, γ∗)) = p(y | h, γ)p(h, γ)q(h∗, γ∗ | h, γ) ∗ ∗ ∗ ∗ ∗ B p(y | h , γ ) q(γ | γ ) Γ(|γ |)Γ(N − |γ | + 1)F(|γ∗|,N−|γ∗|+1)(πmin, πmax) = ∗ B . p(y | h, γ) q(γ | γ) Γ(|γ|)Γ(N − |γ| + 1)F(|γ|,N−|γ|+1)(πmin, πmax)

Compounding the Local Proposals In every MCMC iteration, with prob-

ability 0.3, we propose a long-range proposal for γ. To this end, we sample an

integer K from a uniform distribution on {2, 3,..., 10} and then propose K local

proposals. The proposal ratio is simply calculated by multiplying the ratio of each

local proposal. To see that this calculation leaves the posterior invariant, by the

detailed balance condition (see Chap. 7.6), we only need to show

∗ ∗ ∗ ∗ ∗ %¯ X % p(y | h , γ )p(h , γ ) q(γ → γ) p(y | h, γ)p(h, γ) q(γ → γ∗) min{1, } p(y | h, γ)p(h, γ) % ∗ %∈P(γ→γ∗) q(γ → γ ) % ∗ X %¯ p(y | h, γ)p(h, γ) q(γ → γ ) = p(y | h∗, γ∗)p(h∗, γ∗) q(γ∗ → γ) min{1, }, p(y | h∗, γ∗)p(h∗, γ∗) ∗ %¯ %¯∈P(γ∗→γ) q(γ → γ) where P(γ → γ∗) denotes the set of all possible proposal paths from γ to γ∗.

The key observation is that every path % ∈ P(γ → γ∗) can be reversed to become another path, denoted by% ¯, from γ∗ to γ. Hence there exists a one-to-one mapping between the sets P(γ → γ∗) and P(γ∗ → γ) and for each pair, the detailed

112 balance condition holds. Thus by summing over these paths, the detailed balance

condition is still satisfied. The mixing of the MCMC chain can be greatly expedited

by using the long-range proposal. Intuitively speaking, this is because they make

the sampling chain jumps rapidly across the whole sample space and help the chain

overcome local traps. See Guan and Krone [2007] for a theoretical argument.

Rao-Blackwellization Rao-Blackwellization [Blackwell, 1947] is a common tech-

nique used in sampling schemes to reduce the variance of the estimates [Casella

and Robert, 1996, Douc and Robert, 2011]. Guan and Stephens [2011] also pro-

posed a Rao-Blackwellization procedure to estimate γ and β by computing

E[γj | y, γ−j, β−j, τ, h, π] and E[βj | γj = 1, y, γ−j, β−j, τ, h, π], (4.6)

where γ−j and β−j denotes the corresponding sub-vectors with the j-th element

removed. To calculate E[γj | y, γ−j, β−j, τ, h, π], we only need to figure out

p(γj = 1 | y, γ−j, β−j, τ, h, π)

p(γj = 0 | y, γ−j, β−j, τ, h, π) p(γ = 1 | γ , τ, h, π) p(β | γ = 1, γ , τ, h, π) p(y | γ = 1, γ , β , τ, h) = j −j −j j −j j −j −j . p(γj = 0 | γ−j, τ, h, π) p(β−j | γj = 0, γ−j, τ, h, π) p(y | γj = 0, γ−j, β−j, τ, h)

The first ratio term on the r.h.s. is simply equal to π/(1 − π). The second can be calculated by using the fact that

σ2(h, γ) β | γ, τ, h, π ∼ MVN(0, I). −j τ

Note that the second ratio is not equal to 1 because σ2(h, γ) changes with the

value of γj. Let Xγ−j be the sub-matrix of X with all the columns, except the

113 j-th column, for which γi = 1, and βγ−j be the corresponding sub-vector. Then,

−1 2 t y | γj = 1, γ−j, β−j, τ, h ∼ N (Xγ−jβγ−j, τ (σaXjXj + I))

Therefore, we can compute the third ratio term by

t t p(y | γj = 1, γ−j, β−j, τ, h) −1 t −2 −1/2 τ y˜ XjXj y˜ = σ (Xj Xj + σ ) exp{ −2 t }, p(y | γj = 0, γ−j, β−j, τ, h) 2 σ + Xj Xj

2 where y˜ = y − Xγ−jβγ−j and σ is calculated assuming γj = 1. Although these quantities are very fast to evaluate, we cannot afford Rao-Blackwellization in every

MCMC iteration because we have to go through every SNP. In our implementation by default we perform Rao-Blackwellization every 1000 iterations.

4.3 A Fast Novel MCMC Algorithm for BVSR

using ICF

For a whole genome dataset that typically contains millions of genetic variants, we have to run a very large number of MCMC iterations in order to obtain good- quality posterior estimates. Filtering the variants before running MCMC is not recommended in light of the small effect size and the collinearity of the data. The

Bayesian methods then become less favorable since it could take one week of an

MCMC algorithm to obtain sensible results on a state-of-the-art computer.

The main reason is that when |γ| is large, evaluating (4.5) is very time con- suming due to the calculations of

t −2 1. the determinant of a |γ| × |γ| matrix, |Xγ Xγ + σ I|,

t −2 −1 2. the inverse of a |γ| × |γ| matrix, (Xγ Xγ + σ I) .

114 In this section, I describe a novel MCMC algorithm that bypasses the calculation

of the determinant by using the exchange algorithm and efficiently evaluates the

inverse using ICF algorithm introduced in Chap. 3.

4.3.1 The Exchange Algorithm

The exchange algorithm was proposed by Murray et al. [2012], which is actually a variation of the auxiliary method of Møller et al. [2006]. It was devised for sampling from “doubly intractable” posterior distributions where the marginal likelihood also has an intractable normalizing constant, which in our problem corresponds to the determinant term that we want to avoid computing. To illustrate their ideas, define

2 −|γ| t −2 −1/2 Z(γ, σ (h, γ)) = σ |Xγ Xγ + σ I| ,

2 t t t −2 −1 t −n/2 f(y | γ, σ (h, γ)) = y y − y Xγ (Xγ Xγ + σ I) Xγ y .

Thus by (4.5) we have p(y | γ, σ2) ∝ Z(γ, σ2)f(y | γ, σ2) . The exchange algo-

rithm estimates the ratio Z(γ∗, σ2∗)/Z(γ, σ2), which is independent of y, by an

unbiased importance sample f(y0 | γ, σ2)/f(y0 | γ∗, σ2∗) where y0 is sampled from

p(· | γ∗, σ2∗). The Metropolis-Hastings ratio of the exchange algorithm is

f(y0 | γ, σ2)f(y | γ∗, σ2∗) p(γ∗)q(γ|γ∗) α((h, γ), (h∗, γ∗)) = , y0 ∼ p(· | γ∗, σ2∗), f(y0 | γ∗, σ2∗)f(y | γ, σ2) p(γ)q(γ∗|γ) (4.7)

(See Chap. 4.2.2 for the calculation of p(γ) and q(γ∗ | γ).) It should be pointed

out that when sampling y0, we don’t need to sample τ because the scaling of y0

always cancels out during the calculation of the acceptance ratio.

By checking the detailed balance condition, this strategy can be proved to

115 leave the posterior distribution invariant. The proof was omitted in Murray et al.

[2012], so I give it here. Let θ = (γ, σ2). We only need to check

q(θ∗ | θ) R min{α(θ, θ∗), 1}f(y0 | θ∗)Z(θ∗)dy0 Z(θ∗)f(y | θ∗)p(θ∗) = . (4.8) q(θ | θ∗) R min{α(θ∗, θ), 1}f(y0 | θ)Z(θ)dy0 Z(θ)f(y | θ)p(θ) where

f(y | θ∗) f(y0 | θ) p(θ∗)q(θ | θ∗) α(θ, θ∗) = . f(y | θ) f(y0 | θ∗) p(θ)q(θ∗ | θ)

Let Y = {y0 : α(θ, θ∗) > 1}. We obtain that

Z q(θ∗ | θ) min{α(θ, θ∗), 1}f(y0 | θ∗)Z(θ∗)dy0 Z f(y | θ∗)p(θ∗) Z = Z(θ∗) q(θ∗ | θ)f(y0 | θ∗)dy0 + Z(θ∗) q(θ | θ∗)f(y0 | θ)dy0, Y f(y | θ)p(θ) Yc Z q(θ | θ∗) min{α(θ∗, θ), 1}f(y0 | θ)Z(θ)dy0) Z Z ∗ 0 0 f(y | θ)p(θ) ∗ 0 ∗ 0 = Z(θ) q(θ | θ )f(y | θ)dy + Z(θ) ∗ ∗ q(θ | θ)f(y | θ )dy . Yc f(y | θ )p(θ ) Y

Then (4.8) can be confirmed by straightforward calculations.

4.3.2 Updating of the Cholesky Decomposition

In our MCMC algorithm, a new value for σ is proposed in every iteration since it is determined by h and γ. Hence the matrix inversion in (4.5) should never be computed directly. Instead we just need to compute the ridge estimator,

t −2 −1 t (Xγ Xγ + σ I) Xγ y.

Our novel algorithm ICF, which was introduced in Chap. 3.4, was shown to be better than all the other common methods. To implement ICF, we need the the

116 t Cholesky decomposition of Xγ Xγ , which at first glance is very undesirable since the Cholesky decomposition can take much more time than ICF itself. However, in our MCMC algorithm, most of the time we only propose to add or remove on column of Xγ . Even when a long-range proposal is made, the number of columns to be changed is often very small compared with |γ|. The Cholesky

decomposition then can be obtained by updating, which is very fast. The details

of the updating are laid out below. In principle it is very similar to the updating

of QR factorization [Golub and Van Loan, 2012, Chap. 12.5].

Let the current Cholesky decomposition be

t t Xγ Xγ = Rγ Rγ ,

where Rγ is always upper-triangular. When a new SNP xj is added to the model, we attach it to the last column of Xγ and compute the corresponding entries of

Rγ by forward substitution according to

t t Rγ r|γ+1| = Xγ xj.

The forward substitution needs only ∼ |γ|2 flops. Calculating the r.h.s. in fact

requires much more operations (2n|γ| flops) but at least part of it could be pre-

computed and saved in the memory.

When a SNP is deleted, we use Givens rotation [Golub and Van Loan, 2012,

Chap. 5.1] to introduce zeros. For example, suppose |γ| = 4 and we want to

remove the second column of Xγ . The new matrix is denoted by Xγ∗ . Then we

117 first remove the second column of Rγ , which results in

  r11 r13 r14      0 r23 r24  ˜   R =   .    0 r33 r34    0 0 r44

˜ t ˜ t ˜ Though we have R R = Xγ∗ Xγ∗ , R is not upper-triangular. We now construct a Givens rotation matrix

  1 0 0 0      0 r23/ρ r33/ρ 0    G1 =   ,    0 −r33/ρ r23/ρ 0    0 0 0 1

p 2 2 t t ˜ where ρ = r23 + r33. As a rotation matrix, clearly G1G1 = I. Moreover, G1R

sets r33 to zero as we want. Similarly we can then define matrix G2 to set r44 to

zero. Note that the order cannot be interchanged (we must eliminate r33 first ).

For a general upper-triangular matrix Rγ , using Givens rotation to remove the (|γ| − k)-th column requires only 3k(k + 1) flops.

4.3.3 Summary of fastBVSR Algorithm

The new MCMC algorithm is called fastBVSR and it is summarized below.

118 Algorithm fastBVSR (0) Initialize Xγ and calculating the corresponding Cholesky decomposition

for i = 1 to Nmcmc do Propose h∗, γ∗ using h(i), γ(i) and compute σ∗ by (4.3)

t Compute the Cholesky decomposition of Xγ∗ Xγ by updating Draw y0 ∼ p(·|γ∗, σ∗) using τ = 1

Calculate α((h, γ), (h∗, γ∗)) by (4.7) using ICF algorithm

Set h(i+1) = h∗ and γ(i+1) = γ∗ with probability min{1, α} and stay otherwise

if (i + 1) mod 1000 = 0 then

Sample π(i+1) and τ (i+1) and do Rao-Blackwellization

end if

end for

4.4 GWAS Simulation

The Height dataset, which contains 3, 925 subjects and near 300K SNPs, is used for simulation. It was the dataset used in Yang et al. [2010]. See Chap. 7.7.2 for more details. In reality, it is usually not recommended to run an MCMC algorithm with as many as 300K variants since the chain has to be run for a huge number of iterations to explore all the SNPs, let alone to achieve convergence. Besides, for most purposes, especially with our linear regression model, we can safely partition the data by chromosome. Therefore, to reduce the number of SNPs, the following three sub-datasets are constructed.

1. Height-ind: 10K SNPs evenly sampled from the whole genome.

2. Height-chr6: the first 10K SNPs located on chromosome 6.

3. Height-5C: the 97, 370 SNPs located on chromosome 1, 2, 3, 4 and 5.

119 The first two sub-datasets are used to study the effect of multicollinearity. Clearly,

Height-ind represents a dataset with no or very little multicollinearity and Height-

chr6 has severe multicollinearity due to linkage disequilibrium. Chromosome 6 is

picked because the MHC (major histocompatibility complex) region located on

it is well known for its highly complicated LD structure. The last sub-dataset

contains around 100K SNPs on 5 chromosomes and represents the most difficult

case. Our algorithm fastBVSR is implemented in C++ and it is compared to

GCTA (see Chap. 7.5).

4.4.1 Posterior Inference for the Heritability

Recall the definition of h given in (4.3). It is not exactly equal to the proportion of variation explained (PVE), which is defined by

n N P P 2 ( (xj − x¯j)βj) i=1 j=1 PVE def=. . P 2 (yi − y¯) i=1n

(n is the number of subjects and N is the total number of SNPs.) Even the

expected value of PVE is slightly different from h since the expectation of the

ratio of the two random variables (actually two correlated random variables) is

not equal to the ratio of the corresponding expectations. However, the parameter

h has the same meaning as the ratio of the variance components in the linear

mixed model, which is the heritability definition used by GCTA. Hence we refer

to the posterior inference for h as the posterior inference for the heritability.

For both Height-ind and Height-chr6 datasets, the phenotypes with heritability

h = 0, 0.01,..., 0.99 are simulated. For each choice of heritability, 200 causal SNPs

are randomly sampled. Figure 4.1 shows that for the two small datasets with 200

120 Mean absolute error fastBVSR(a) fastBVSR(b) GCTA Height-ind 0.0217 0.0399 0.0289 Height-chr6 0.0271 0.0756 0.0209

Table 4.1: Mean absolute error (MAE) of the heritability estimation. The MAD is calculated as the mean of the absolute difference between the true and the estimated heritabilities. See Fig. 4.1 for the simulation settings. causal SNPs, the heritability can be very accurately estimated from the posterior.

GCTA seems unbiased too but in the Height-ind dataset its estimates have a larger mean absolute error than the posterior mean estimates of fastBVSR(a) (Table 4.1).

By comparing the two panels in the middle column, it can be seen the collinearity of the Height-chr6 dataset slows down the convergence of fastBVSR.

For the Height-5C dataset, the number of causal SNPs is set to 1000 and the phenotypes are simulated with heritability h = 0, 0.05,..., 0.95. The GCTA esti- mates for the heritabilities appear to be still unbiased, though with relatively large variance while fastBVSR shows a clear tendency to underestimate the heritability

(Fig. 4.2). To explain the behaviour of fastBVSR, first let’s recall the rationale for variable selection. Just like the multiple comparison correction for the p-value, if we have a larger number of candidate SNPs, by chance we could observe more and stronger spurious signals. Hence we would require a stronger signal threshold to include a SNP into the model. In this simulation, both the numbers of total candidate SNPs and causal SNPs are much larger than in the previous ones (a larger number of causal SNPs implies a smaller σ). Therefore, it becomes much harder for fastBVSR to detect all the signals. A much larger sample size would help fastBVSR overcome this problem.

121 Height-ind: fastBVSR(a) Height-ind: fastBVSR(b) Height-ind: GCTA

Height-chr6: fastBVSR(a) Height-chr6: fastBVSR(b) Height-chr6: GCTA

Figure 4.1: Heritability estimation with 200 causal SNPs in the Height-ind and Height-chr6 datasets. For both datasets, we run fastBVSR with two settings: (a) 20K burn-in iterations, 100K sampling iterations and Rao-Blackwellization every 1000 iterations; (b) 2K burn-in iterations, 10K sampling iterations and Rao- Blackwellization every 200 iterations. We compare the results with GCTA. The grey bars represent 95% posterior intervals for fastBVSR and ±2SE for GCTA.

122 Height-5C: fastBVSR Height-5C: GCTA

Figure 4.2: Heritability estimation with 1000 causal SNPs in the Height-5C dataset. We run fastBVSR with 200K burn-in iterations, 1M sampling itera- tions and Rao-Blackwellization every 10K iterations. The grey bars represent 95% posterior intervals for fastBVSR and ±2SE for GCTA.

4.4.2 Calibration of Posterior Inclusion Probabilities

The posterior estimation for the model size (the number of SNPs in the model,

|γ|) in the previous simulations is shown in Fig. 4.3. As explained in Chap. 4.2.1, when the heritability is small, the posterior distribution of the model size shows a larger variance.

It is not realistic to recover all the true signals and accurately estimate the model size in a real data analysis, since many SNPs with tiny effects cannot be identified confidently via any statistical method. However, the posterior inclusion probability (PIP) serves as a measure of how likely a SNP is truly associated with the trait. From a practical standpoint, the calibration of the posterior inclusion probabilities is of high importance. To study the calibration, the SNPs are divided into different bins by PIP. Suppose in the k-th bin there are Bk SNPs with PIP

123 Height-ind: fastBVSR(a) Height-ind: fastBVSR(b)

Height-chr6: fastBVSR(a) Height-chr6: fastBVSR(b)

Height-5C

Figure 4.3: The posterior estimation of the model size, |γ|. See Fig. 4.1 and Fig. 4.2 for the simulation settings. The truth is marked by the red lines. The grey bars represent 95% posterior intervals.

124 f1, . . . , fBk . Let Mk be the predicted number of true positives in this bin. Then

B B Xk Xk E[Mk] = fi , Var(Mk) = fi(1 − fi), i=1 i=1 and thus

B B 1 Xk 1 Xk [M /B ] = f , Var(M /B ) = f (1 − f ). (4.9) E k k B i k k B2 i i k i=1 k i=1

Mk/Bk is compared with the proportion of true positives in Fig. 4.4. The Rao- Blackwellized estimators for PIP, which was defined in (4.6), exhibit significant improvement of the crude PIP estimates, especially when the number of MCMC iterations is too small to achieve convergence (Fig. 4.5). In particular, for the

Height-ind datset, the calibration of the Rao-Blackwellized estimates is impres- sively accurate. For the other two datasets with collinearity, the calibration is arguably acceptable.

4.4.3 Prediction Performance

Prediction is one of the ultimate goals of variable selection. Even if a procedure cannot produce good estimates for the heritability, the model size or the inclu- sion probabilities, it is still practically useful as long as it has good prediction performance. Consider a future observation y0,

N 0 X 0 0 X 0 0 y = βixi + µ + e = µ + βixi + µ + e . i=1 γi=1

0 −1 where e ∼ N(0, τ ). In GWAS, xi can be assumed to be drawn from centered

Binom(2, fi) where fi is the minor allele frequency of the i-th SNP. The covariates x1, . . . , xN may not be independent, as in the Height-chr6 and Height-5C datasets.

125 Height-ind(a): PIP Height-chr6(a): PIP Height-5C: PIP

Height-ind(a): RB-PIP Height-chr6(a): RB-PIP Height-5C: RB-PIP

Figure 4.4: Calibration of the posterior inclusion probabilties. For Height-ind and Height-chr6 we report the results of setting (a) (see Fig. 4.1 for details). PIP stands for posterior inclusion probability and RB-PIP means Rao-Blackwellized estimation of posterior inclusion probability. The SNPs are divided into 20 bins by PIP (RB-PIP) and the y-axis represents the proportion of true positives in each bin. The x-axis is the predicted proportion of true postives (see (4.9)). The grey bars represent ±2SD.

126 Height-ind(b): PIP Height-chr6(b): PIP

Height-ind(b): RB-PIP Height-chr6(b): RB-PIP

Figure 4.5: Calibration of the posterior inclusion probabilties for Height-ind and Height-chr6 under setting (b) (see Fig. 4.1 for details).

127 ˆ ˆ 0 Let β1,..., βN be the estimates from some procedure. Then y would be estimated by

N 0 X ˆ 0 yˆ = βixi. i=1

Therefore, the mean squared prediction error (MSPE) is calculated by

N ˆ def. 0 0 2 ˆ X ˆ 0 0 2 ˆ MSPE(β, µˆ) = E[(y − yˆ ) | β, µˆ] = E[( (βi − βi)xi + (µ − µˆ) + e ) | β, µˆ] i=1 N ! −1 2 X ˆ 0 = τ + (µ − µˆ) + Var (βi − βi)xi i=1 = τ −1 + (µ − µˆ)2 + (β − βˆ)tCov(X0)(β − βˆ).

The covariance matrix Cov(X0) can be simply estimated by the observed sample covariance matrix. Sinceµ ˆ is usually estimated by sample mean, its mean squared error is τ −1/n. Thus, we define the MSPE for βˆ by

n + 1 1 MSPE(βˆ) def=. τ −1 + (β − βˆ)tXtX(β − βˆ) n n (4.10) n + 1 1 = τ −1 + ||Xβ − Xβˆ||2. n n 2

Consider two estimators for β: the optimal estimator βˆ = β and the null estimator

βˆ = 0. Their MSPEs are given by

n + 1 MSPE(β) = τ −1, n n + 1 1 MSPE(0) = τ −1 + βtXtXβ. n n

128 We define a metric, relative prediction gain (RPG), which will be used to measure the performance of an estimator βˆ, by

MSPE(0) − MSPE(βˆ) RPG(βˆ) def=. . (4.11) MSPE(0) − MSPE(β)

Though τ is assumed to be known for computing MSPE, it cancels out in the expression of RPG in (4.11).

Figure 4.6 compares the RPGs of the BVSR regression estimates with the

BLUPs (best linear unbiased predictors) used by GCTA. When the true heritabil- ity is very small, it makes little sense to talk about RPG since the denominator,

MSPE(0)−MSPE(β) (the variation that could be explained by the predictors), is very small and RPG will have a very large variance. For the Height-ind dataset, fastBVSR shows a substantial advantage over GCTA. Surprisingly, even in the setting (b) where the MCMC clearly has not attain convergence by checking the previous plots, the Rao-Blackwellized regression estimators still perform very well.

For the Height-chr6 dataset, fastBVSR is again much better than GCTA, espe- cially when the heritability is in (0.2, 0.8), the region of most practical interests.

However, Rao-Blackwellization does not seem to provide much improvement (even worse when the heritability is close to 1), which probably should be attributed to the collinearity of the dataset. For the Height-5C dataset, fastBVSR is no longer advantageous due to two main reasons: the slow convergence caused by the enor- mous sample space and the collinearity existing in the dataset.

4.4.4 Wall-time Usage

At last, Fig. 4.7 reports the wall time used by fastBVSR for 10K MCMC iterations under different simulation settings. Since the calculation of the ridge estimators

129 Height-ind: fastBVSR(a) Height-ind: fastBVSR(b)

Height-chr6: fastBVSR(a) Height-chr6: fastBVSR(b)

Height-5C

Figure 4.6: Relative prediction gain of fastBVSR and GCTA for the simulated datasets with heritability ≥ 0.05. See Fig. 4.1 and Fig. 4.2 for the simulation settings. The RPG is computed by (4.10) and (4.11). BVSR stands for the crude regression estimates from fastBVSR; BVSR-RB represents the Rao-Blackwellized estimates from fastBVSR; GCTA stands for the BLUPs of linear mixed model.

130 is the most time-consuming step, the wall-time usage is mostly determined by the posterior distribution of the model size. The simulation setting is unimportant.

Height-ind Height-chr6 Height-5C

Figure 4.7: Wall time used by fastBVSR for 10K MCMC iterations. The x-axis the posterior mean of the model size, which mainly determines the wall time usage of the fastBVSR. For Height-ind and Height-chr6 datasets, we use the results from setting (a). Note that for the two small datasets, we do Rao-Blackwellization every 1K iterations; for Height-5C, we do Rao-Blackwellization every 10K iterations.

131 Chapter 5

Scaled Bayes Factors

5.1 Motivations for Scaled Bayes Factors

The null-based Bayes factor, in general, is defined by

def. p(y|M) BFnull(M) = , p(y|M0)

where M0 denotes the null model and M denotes the model of interest. Immedi- ately we find that its expectation under the null model is 1 since

Z p(y|M) Z E[BFnull | M0] = p(y|M0)dy = p(y|M)dy = 1. (5.1) p(y|M0)

The expectation of log BFnull, however, is not zero. In fact, by Jensen’s inequality,

E[log BFnull | M0] < 0 but the exact value depends on the model (and the design matrix in regression). Shall we fix this expected value to some constant to replace

property (5.1)? There are two practical reasons. First, log BFnull, instead of BFnull, is often used as the measure of evidence (see Jeffreys [1961, app. B] and Kass and

Raftery [1995]). Hence aligning the null expectation of log BFnull would provide

132 more practical convenience. Second, the distribution of Bayes factors is usually

heavy tailed, for example in linear regression. Thus an observation y under the

null usually produces a Bayes factor much smaller than 1 and even if the Bayes

factor is averaged over many observations, the sample mean is hardly close to 1.

In contrast, the expectation of log BFnull would be much easier to be approximated by sampling due to the smaller variance.

For the time being, let’s simply define the scaled Bayes factor (sBF) by

log sBF = log BFnull − E[log BFnull | M0]. (5.2)

It immediately follows that E[log sBF | M0] = 0. Recall our multi-linear regression model given in (1.1),

y | β, τ ∼ MVN(Xβ, τ −1I),

β | τ, V ∼ MVN(0, τ −1V ),

τ | κ1, κ2 ∼ Gamma(κ1/2, κ2/2), κ1, κ2 → 0.

By Theorem 2.7 and Corollary 2.8, omitting the op(1) error, the null expectation

of log BFnull is given by

p X E[2 log BFnull | β = 0] = (λi + log(1 − λi)) , i=1

t −1 −1 t where λ1, . . . , λp are the eigenvalues of H = X(X X+V ) X . Furthermore, if

def. t 2 1/2 we define Qi = (uiz) as in Theorem 2.7, where z = τ y and ui is the eigenvector of H (please see Chap. 2 for more information), we can write the asymptotic

133 expression for sBF as

p X 2 log sBF = λi(Qi − 1). i=1

The new statistic is called the scaled Bayes factor since

p Y e−λi/2 sBF = BF √ . (5.3) null 1 − λ i=1 i

√ −λi/2 Each scaling component, e / 1 − λi, is monotone increasing in λi. When

λi ↓ 0, the scaling goes to 1; when λi ↑ 1, the scaling goes to infinity.

5.0

n = 200 4.5 n = 500 n = 1000 sBF(BF) 4.0 10 g o l

3.5

0.5 1.0 1.5 2.0 σ

Figure 5.1: The plot shows how BFnull and sBF change with σ in simple linear −6 regression with PBF = 10 . BFnull is in gray and sBF black. BFnull and sBF are computed assuming the covariate has unit variance.

2 Let’s focus on the independent normal prior, V = σ I, where we have λi =

2 2 −2 di /(di + σ )(di is the singular value of X). Figure 5.1 shows how the two Bayes factors change with the value of σ in simple linear regression. An arguable benefit of sBF is that the term log(1 − λi) has been removed. Therefore, as long as Qi > 1, log sBF is monotone increasing in λi with other λ’s fixed. Recall our

134 result on the distribution of Qi under the alternative, given in Proposition 2.14.

λi is actually kind of a measure of power. To see this, consider the alternative β ∼ MVN(0, τ −1σ2I), where

i.i.d. 2 (1 − λi)Qi ∼ χ1.

If λi is larger, Qi tends to be greater and thus implies a smaller p-value. For the p-value of likelihood ratio test, this is clear since PLR is computed by comparing

P 2 Qi to the distribution function of χp. For the p-value of the Bayes factor, note that the ratios λ1/λp, . . . , λp−1/λp also have a small effect but overall we can say a greater Qi often implies a smaller PBF. For a fixed alternative model, in a similar fashion it could be argued that, for some given σ, λi measures the power on the direction of vi, the i-th right-singular vector of X, by using Proposition 2.13. Hence, when faced with two tests with identical p-values that suggest the null should be rejected, the scaled Bayes factor tends to favor the one with the larger power, which itself might be a desirable property since the Bayes factor is often thought of as a link between power and significance [Sawcer, 2010, Stephens and

Balding, 2009] (but this property is missing in the unscaled Bayes factor). The

Bayes factor, together with its property E[BFnull | M0] = 1, of course has many advantages. However, just as a coin has two sides, we have to trade some nice

properties for some others.

5.2 An Application to Intraocular Pressure

GWAS Datatsets

The scaled Bayes factor is now applied to analyze the IOP (intraocular pressure)

dataset. For details of this dataset, see Chap. 7.7.1. Age, sex, and 6 leading

135 principal components are regressed out from the raw phenotypes (the average

IOP of the two eyes). After quantile normalization, the residuals are used as the

phenotypes for single SNP analysis. BFnull, sBF and pB are computed using prior σ = 0.2, which represents a prior belief of small but noticeable effect size [c.f.

Burton et al., 2007].

We first compared BFnull and sBF by minor allele frequency (MAF) bins.

Different MAF bins correspond to different bins of the informativeness (λ1) of

SNPs. Figure 5.2 shows that in each bin log10 sBF ∼ log10 BFnull is roughly parallel to the line y = x, and more importantly, the larger the MAF, the larger the

difference log10 sBF−log10 BFnull, as explained in (5.3). Another noticeable feature

is that the minimum value of sBF is larger than that of BFnull, because BFnull can

go to 0 while sBF is bounded below by e−λ1/2.

Figure 5.2: The distributions of log10 BF and log10 sBF by different bins of minor allele frequency (MAF). The bins are marked by color. In the left panel the diagonal line is y = x.

Next we examined the ranking of SNPs by different test statistics. Table 5.1

contains the top 20 SNPs in the ranking by BFnull. Rows are sorted according to SNP’s chromosome and position. Incidentally, the top 2 hits (rs7518099 and

136 SNP Chr Pos MAF bf(y) bf(y˜) sbf(y) sbf(y˜) p(y) p(y˜) rs12120962 1 10.53 0.384 3.88 (5) -0.90 4.56 (4) -0.21 5.63 (5) 0.01 rs12127400 1 10.54 0.384 3.61 (9) -0.90 4.29 (8) -0.21 5.34 (9) 0.01 rs4656461 1 163.95 0.140 5.71 (2) -0.57 6.26 (2) -0.03 7.51 (2) 0.46 rs7411708 1 163.99 0.428 3.69 (8) -0.68 4.38 (7) 0.01 5.43 (8) 0.52 rs10918276 1 163.99 0.427 3.59 (10) -0.66 4.28 (9) 0.03 5.33 (10) 0.54 rs7518099 1 164.00 0.140 6.04 (1) -0.61 6.58 (1) -0.07 7.85 (1) 0.38 rs972237 2 125.89 0.119 3.05 (15) -0.62 3.56 (17) -0.11 4.65 (18) 0.31 rs2728034 3 2.72 0.090 3.80 (6) -0.62 4.27 (10) -0.15 5.45 (7) 0.22 rs7645716 3 46.31 0.254 3.34 (11) -0.88 3.98 (11) -0.16 5.03 (11) 0.21 rs7696626 4 8.73 0.023 2.96 (18) -0.33 3.20 (42) -0.01 4.70 (16) 0.31 rs301088 4 53.53 0.473 2.95 (20) -0.81 3.64 (16) -0.11 4.65 (17) 0.31 rs2025751 6 51.73 0.466 3.78 (7) -0.75 4.47 (6) -0.06 5.53 (6) 0.41 rs10757601 9 26.18 0.443 3.09 (13) -0.79 3.78 (12) -0.10 4.80 (13) 0.33 rs10506464 12 62.50 0.164 2.97 (17) -0.75 3.54 (18) -0.18 4.59 (19) 0.15 rs10778292 12 102.78 0.140 4.00 (4) -0.75 4.54 (5) -0.21 5.68 (4) 0.02 rs2576969 12 102.80 0.271 3.07 (14) -0.85 3.71 (14) -0.20 4.75 (14) 0.08 rs17034938 12 102.85 0.127 3.23 (12) -0.71 3.75 (13) -0.19 4.85 (12) 0.13 rs1288861 15 43.50 0.120 2.95 (19) -0.45 3.46 (20) 0.06 4.54 (21) 0.59 rs4984577 15 93.76 0.367 3.02 (16) -0.64 3.69 (15) 0.04 4.71 (15) 0.56 rs12150284 17 9.97 0.353 4.95 (3) -0.75 5.63 (3) -0.07 6.75 (3) 0.38

Table 5.1: Top 20 single SNP associations by BFnull (σ = 0.2). Pos: ge- nomic position in megabase pair (reference HG18); bf(y): log10 BF(y); bf(y˜): log10 BF(y˜); sbf(y): log10 sBF(y); sbf(y˜): log10 sBF(y˜); p(y): − log10 PBF(y); p(y˜): − log10 PBF(y˜). The rankings by the three statistics are given in the paren- theses. y˜ is obtained by permuting y once. SNP IDs are in bold if they are mentioned specifically in the main text. rs4656461) are the same for all the three test statistics. The rankings by the three statistics are largely similar to one another, particularly so for the rankings by BFnull and PBF. There is, however, a noticeable exception of SNP rs7696626, whose ranking by sBF is much worse than its rankings by BFnull and PBF. Not surprisingly, this SNP has the smallest MAF (0.023) among the 20 SNPs included in Table 5.1. We permuted the phenotypes once and recomputed the three test statistics. Let y˜ be the permuted phenotypes. log sBF(y˜) is usually close to 0 whereas log BFnull(y˜) is negative.

We also tried σ = 0.5 and found that, for most top signals in Table 5.1, BFnull

137 SNP Chr Pos MAF log10 BFnull log10 sBF − log10 PBF rs12120962 1 10.53 0.384 3.549 (6) 4.624 (5) 5.628 (5) rs12127400 1 10.54 0.384 3.271 (9) 4.346 (9) 5.339 (9) rs4656461 1 163.95 0.140 5.494 (2) 6.424 (2) 7.507 (2) rs7411708 1 163.99 0.428 3.360 (8) 4.438 (7) 5.434 (8) rs10918276 1 163.99 0.427 3.258 (10) 4.336 (10) 5.328 (10) rs7518099 1 164.00 0.140 5.829 (1) 6.758 (1) 7.852 (1) rs972237 2 125.89 0.119 2.781 (14) 3.674 (17) 4.649 (18) rs2728034 3 2.72 0.090 3.584 (5) 4.434 (8) 5.452 (7) rs7645716 3 46.31 0.254 3.019 (12) 4.045 (11) 5.027 (11) rs7696626 4 8.73 0.023 3.069 (11) 3.644 (18) 4.698 (16) rs2025751 6 51.73 0.466 3.443 (7) 4.527 (6) 5.527 (6) rs1081076 6 132.97 0.022 2.690 (19) 3.260 (44) 4.287 (36) rs10757601 9 26.18 0.443 2.748 (15) 3.829 (13) 4.797 (13) rs10778292 12 102.78 0.140 3.738 (4) 4.665 (4) 5.683 (4) rs2576969 12 102.80 0.271 2.745 (16) 3.777 (14) 4.745 (14) rs17034938 12 102.85 0.127 2.953 (13) 3.862 (12) 4.845 (12) rs1955511 14 32.30 0.076 2.684 (20) 3.497 (25) 4.472 (25) rs12150284 17 9.97 0.353 4.638 (3) 5.707 (3) 6.752 (3) rs6017819 20 44.48 0.069 2.736 (17) 3.524 (24) 4.505 (23) rs279728 20 44.51 0.087 2.704 (18) 3.541 (22) 4.515 (22)

Table 5.2: Top 20 single SNP associations by BFnull (σ = 0.5). The “Pos” column gives the genomic position in megabase pair (reference HG18). The rankings by the three statistics are given in the parentheses. SNP IDs are in bold if they are mentioned specifically in the main text. becomes smaller and sBF grows larger, which is consistent with Fig. 5.1. Note that the p-value, PBF, is not affected by the choice of σ. The rankings of the SNPs remained mostly unchanged. The result is provided in Table 5.2.

Lastly, although it was not our main objective, we examined the top hits in the association result. Our analysis reproduced three known genetic associations for IOP. Namely, the TMCO1 gene on chromosome 1 (163.9M-164.0M) which was reported in [van Koolwijk et al., 2012]; a single hit rs2025751 in the PKHD1 gene on chromosome 6 [Hysi et al., 2014]; and a single hit rs12150284 in the

GAS7 gene on chromosom 17 [Ozel et al., 2014]. A potentially novel finding is the gene PEX14 on chrosome 1. Two SNPs, rs12120962 and rs12127400,

138 have modest association signals. PEX14 encodes an essential component of the

peroxisomal import machinery. The protein interacts with the cytosolic receptor

for proteins containing a PTS1 peroxisomal targeting signal. Incidentally, PTS1

is known to elevate the intraocular pressure [Shepard et al., 2007]. In addition, a

mutation in PEX14 results in one form of Zellweger syndrome, and for children

who suffer from Zellweger syndrome, congenital glaucoma is a typical neonatal-

infantile presentation [Klouwer et al., 2015].

5.3 Scaled Bayes Factors in Variable Selection

5.3.1 Calibrating the Scaling Factors

Consider the Bayesian variable selection regression (BVSR) model described in

Chap. 4.2. Apparently it is not wise to directly plug in the scaled Bayes factor

defined in (5.2) since it is no longer consistent (see Friedman et al. [2001, Chap. 7.7]

for the meaning of consistency). For example, suppose the effect size σ (or the

heritability h) is given and y is generated from a linear regression model with causal SNPs Xγ with rank(Xγ ) = |γ|. We now need to choose between two

∗ models Xγ and Xγ0 , the latter of which is given by Xγ0 = [Xγ , x ] where

t ∗ Xγ x = 0. Then by our asymptotic result in Chap. 2,

∗ ∗ ∗ 2 log BFnull(Xγ0 ) − 2 log BFnull(Xγ ) = λ Q + log(1 − λ ).

∗ 2 ∗ where Q ∼ χ1. If we have an infinitely large sample size, then λ goes to 1 and

thus we would choose model Xγ with probability one. However, for the scaled

139 Bayes factor, we have

∗ ∗ ∗ 2 log BFnull(Xγ0 ) − 2 log BFnull(Xγ ) = λ Q − λ .

Therefore, even if λ∗ = 1, the wrong model would still have a positive posterior probability. Note that in BVSR, we also need to compute the likelihoods for γ and γ0, however the difference, log p(γ0) − log p(γ), is bounded and thus does not affect our conclusion for this simple example. The key message is that the scaled

Bayes factor given in (5.2) does not produce a sufficiently large penalty on the model complexity.

To solve this problem, recall that our motivation for sBF was just to introduce a scaling factor to BFnull such that

log sBF = log BFnull − E[log BFnull | M0] + log C,

where C does not depend on X given the model size. For the BVSR model,

2 consider letting C = C(|γ|, σ ). Since BFnull = p(y|M)/p(y|M0) , we have

C(|γ|, σ2) exp (− [log BF | M ]) p(y|M) sBF = E null 0 . p(y|M0)

To make sBF suitable for variable selection, it suffices to require

 2 2   E C(p, σ ) exp − E[log BFnull(Xγ , σ ) | M0] | |γ| = p = 1, (5.4)

where the inner expectation is with respect to y and the outer is with respect to

γ. This condition immediately leads to

E [ E[sBF | M0] | |γ| = p] = 1.

140 To see why this works for BVSR, recall that in the original BVSR model, for γ

the following prior is used,

p(γ | π) = π|γ|(1 − π)N−|γ|,

which may be further decomposed to

p(γ | π) = p(γ, |γ| | π) = p(|γ| | π)p(γ | |γ|, π).

Implicitly we have assumed

|γ|!(N − |γ|)! p(γ | |γ|, π) = p(γ | |γ|) = , (5.5) N! where N is the total number of candidate SNPs. Now by using sBF we change the prior to

|γ|!(N − |γ|)! p(γ | |γ|, π) = C(|γ|, σ2) exp − [log BF (X , σ2) | M ] N! E null γ 0

which by condition (5.4) is a valid probability mass function. As the information in

the data accumulates, the choice of prior no longer has an influence on the posterior

and thus the scaled Bayes factor is consistent just like the Bayes factor. This is

actually a result of Bernsteinvon Mises Theorem [der Vaart, 2000, Chap. 10.2].

Note that C = C(|γ|, σ2) is chosen instead of C = C(|γ|) since otherwise the

integration over σ2 would be difficult.

The expectation given in (5.4) can be computed as follows. Using our asymp-

totic result, we can write

" p # 1 Y e−λi/2 = E √ , C(p, σ2) 1 − λ i=1 i

141 where the expectation is with respect to the distribution of λ = (λ1, . . . , λp) induced from the uniform conditional distribution of γ given in (5.5). This integral

can be approximated by

" p # Y e−λi/2  e−λ/2  log E √ ≈ p log E √ , (5.6) 1 − λ i=1 i 1 − λ where the expectation on the r.h.s. is with respect to the mixture distribution of

λ1, . . . , λp. In the BVSR model, according to Proposition 2.13, we have

2 di λi = 2 2 di + 1/σ

2 where di is the i-th singular value of Xγ . Let η = d denote an arbitrary eigenvalue

t of Xγ Xγ . Its distribution is easy to characterize by sampling and in fact changes very little with the choice of |γ|. Let’s also define φ = σ2 (to simplify the following

expressions) and write

e−λ/2 g(η) = e−ηφ/2(ηφ+1)p1 + ηφ = √ . 1 − λ

By Taylor expansion, we have

X 1 [g(η)] = (η − [η])k g(k)( [η]). E k! E E E k

In our implementation, a fourth-order approximation is used. The derivatives of

142 g are given by

e−ηφ/2(ηφ+1) g00(η) = 2φ2 − η2φ4 , 4(ηφ + 1)7/2 e−ηφ/2(ηφ+1) g(3)(η) = −16φ3 − 18ηφ4 + 3η3φ6 , 8(ηφ + 1)11/2 e−ηφ/2(ηφ+1) g(4)(η) = 156φ4 + 320ηφ5 + 180η2φ6 − 15η4φ8 . 16(ηφ + 1)15/2

The performance of our method is checked by simulation. The Height-10K

and Height-chr6 datasets described in Chap. 4.4 are used. For each given |γ|, γ

is sampled until we have collected 100K singular values of Xγ . Then the ideal scaling factor (l.h.s. of (5.6)) and the Taylor series approximation to the r.h.s.

of (5.6) using σ = 0.2 are computed. The results given in Table 5.3 demonstrate

that our method is accurate enough.

Height-10K Height-chr6

ˆ ˆ |γ| A|γ| A|γ| A|γ| A|γ|

1 1.505 1.505 1.504 1.505

10 15.06 15.06 15.02 15.03

50 75.08 75.20 74.72 75.00

100 150.3 150.4 148.6 149.6

200 297.5 299.8 293.3 297.2

Table 5.3: Taylor series approximations for the ideal scaling factors with σ = √ Q −λi/2 ˆ 0.2. A|γ| = log E[ e / 1 − λi] denotes the ideal scaling factor; A|γ| = √ p log E[e−λ/2/ 1 − λ] where the expectation is evaluated using Taylor expansion.

143 5.3.2 Prediction Properties

A natural question is what are the differences between the Bayes factor and the scaled Bayes factor in prediction? The answer would be very complicated and depends on the concrete problem. To gain some insights, let’s consider a toy example. Suppose we have two SNPs x1 and x2 (centered). We need to select between two models (σ2 is given):

2 −1 (1) M1 : y = β1x1 + ε, β1 ∼ N(0, σ ), ε ∼ MVN(0, τ I);

2 −1 (2) M2 : y = β2x2 + ε, β2 ∼ N(0, σ ), ε ∼ MVN(0, τ I).

P t P t Let si = xixi (i = 1, 2) and s12 = x1x2. For the i-th model, the posterior for βi and the Bayes factor are given by

t −1 xiy τ βi | Mi, τ ∼ N( −2 , −2 ); si + σ si + σ  t 2 −n/2 2 −1/2 (xiy) BFnull(xi) = (σ si + 1) 1 − t −2 . y y(si + σ )

˜ ˜ ˜ To avoid confusion let β = (β1, β2) be the true (realized) value of β. Then by (4.4.3), the mean squared prediction error (MSPE) of some estimate βˆ is

1 n o MSPE(βˆ) = (n + 1)τ −1 + s (β˜ − βˆ )2 + s (β˜ − βˆ )2 + 2s (β˜ − βˆ )(β˜ − βˆ ) . n 1 1 1 2 2 2 12 1 1 2 2 (5.7)

Using y ∼ MVN(Xβ˜, τ −1I), we can define

n o def. 1/2 −1/2 t 1/2 1/2 ˜ −1/2 ˜ Zi = τ si xiy ∼ N(τ si βi + si s12β3−i , 1), (5.8)

144 and we can show

t −1/2 1/2 ˜ ˜ −1 xiy τ si Zi siβi + s12β3−i τ si −2 = −2 ∼ N( −2 , −2 2 ). si + σ si + σ si + σ (si + σ )

Assume n is sufficiently large. Then we have

t xiy −1/2 −1/2 ˜ −1 ˜ −1 −1 −2 ≈ τ si Zi ∼ N(βi + s12si β3−i, τ si ), si + σ

and by Theorem 2.7,

2 siσ 2 2 2 log BFi = 2 Zi − log(siσ + 1), siσ + 1 2 siσ 2 2 log sBFi = 2 (Zi − 1), siσ + 1

where we have used the abbreviation BFi = BFnull(xi). Since the two models have the same dimension, we don’t need to compute the ideal scaling factor for sBF

and the posterior probability of each model (denoted by wi) is given by

B BFi S sBFi wi = , wi = . BF1 + BF2 sBF1 + sBF2

By the model averaging principle, the Bayesian variable selection estimate, de-

noted by βˆ, is

t t ˆ x1y x2y β = (w1 −2 , w2 −2 ). s1 + σ s2 + σ

ˆ Plugging the expressions for β and wi into (5.7) and using (5.8), we can calculate, only numerically, the expected MSPE for a Bayesian procedure. To obtain some

145 analytic conclusions, let’s set Z1,Z2 to their expected values and let

s ρ def=. √ 12 s1s2

be the correlation between x1 and x2. Then for the three terms on the r.h.s. of (5.7) we have

˜ ˆ 2 ˜ ˜ p ˜ 2 s1(β1 − β1) = s1{β1 − w1(β1 + ρ s2/s1β2)}

˜ ˆ 2 ˜ ˜ p ˜ 2 s2(β2 − β2) = s2{β2 − w2(β2 + ρ s1/s2β1)} ˜ ˆ ˜ ˆ √ ˜ ˜ p ˜ ˜ ˜ p ˜ 2s12(β1 − β1)(β2 − β2) = 2ρ s1s2{β1 − w1(β1 + ρ s2/s1β2)}{β2 − w2(β2 + ρ s1/s2β1)}.

Using w1 + w2 = 1, we obtain

1 h n √ oi MSPE(βˆ) = (n + 1)τ −1 + (1 − ρ2) s β˜2(1 − w )2 + s β˜2w2 − 2ρ s s β˜ β˜ w (1 − w ) . n 1 1 1 2 2 1 1 2 1 2 1 1

By differentiating, we can compute the optimal value for w1 is

√ s β˜2 + ρ s s β˜ β˜ w∗ = 1 1 1 2 1 2 . (5.9) 1 ˜2 ˜2 √ ˜ ˜ s1β1 + s2β2 + 2ρ s1s2β1β2

For some special circumstances enlightening conclusions can be drawn.

˜ ∗ (1) β2 = 0. Then w1 = 1, which means that we should simply choose M1, consistent with our intuitions. Consider the sample size n grows to infinity.

2 Then both 2 log(BF1/BF2) and 2 log(sBF1/sBF2) grows at rate O((1 − ρ )n).

Hence both statistics would yield w1 = 1 when n is sufficiently large.

˜2 ˜2 ∗ 2 2 (2) s1β1 = s2β2 and s1 > s2. Then w1 = 1/2. In this case Z1 = Z2 and thus the p- values of the two models would be equal. According to the previous discussion,

2 we have sBF1 > sBF2. But since 2 log sBF takes the form λ(Z − 1) where

146 λ → 1 as n → ∞, for large sample sizes we have sBF1 ≈ sBF2 and thus p w1 = 1/2, which is optimal. However, BF1/BF2 → s2/s1. Hence in this case, the scaled Bayes factor is more advantageous.

˜2 ˜2 ∗ Another useful observation is that if s1β1 > s2β2 , we have w1 > 1/2. Therefore, ˜ ˜ if the true effect sizes, β1 and β2, are equal (in absolute value), we should favor the SNP with larger variance; if the two SNPs have the same variance, we should favor the SNP with larger effect size (in absolute value).

5.4 Simulation Studies for Variable Selection

The experiments done in Chap. 4.4 can be also performed using the scaled Bayes factor. The goal of this section is to show that our method is valid and sBF works at least as well as BF. Permutation is used to compute E[log BFnull | M0]. The scaling factor is then calibrated by computing C(|γ|, σ2), using the method described in Chap. 5.3.1. Only the Height-ind and Height-chr6 datasets are used and we run MCMC with 20K burn-in iterations, 100K sampling iterations and

Rao-Blackwellization every 1000 iterations (this was the setting (a) in Chap. 4.4).

Figure 5.3 shows the results for heritability estimation with different number of permutations. As a larger number of permutations reduce the estimation variance for E[log BFnull | M0], the heritability is most accurately estimated when we permute 20 times. Moreover, using the mean absolute error metric, when we permute 20 times, the scaled Bayes factor produces more accurate results than the Bayes factor (Table 5.4). It may be a little surprising that the result for sBF with only 1 permutation also looks good enough. This can be explained using our asymptotic results. Recall from Theorem 2.7, asymptotically 2 log BFnull =

147 sBF: number of permutations Mean absolute error BF 1 2 5 10 20 Height-ind 0.0217 0.0434 0.0343 0.0260 0.0221 0.0203 Height-chr6 0.0271 0.0628 0.0504 0.0338 0.0266 0.0242

Table 5.4: Mean absolute error (MAE) of the heritability estimation using the scaled Bayes factor. The MAD is calculated as the mean of the absolute difference between the true and the estimated heritabilities. See Fig. 5.3 for more related results.

P i.i.d. 2 λiQi +log(1−λi) where under the null Qi ∼ χ1. Hence, by direct calculations

1 Y 1 [ | M0] = . E BF p 2 null 1 − λi

By the reasoning of pseudo-marginal MCMC methods [Andrieu and Roberts,

2009], the MCMC with 1 permutation converges to a stationary distribution in-

duced by a (scaled) Bayes factor (denoted by sBF(1)) with form

p X 1 log sBF(1) = (λ Q − log(1 + λ )) + log C(p, σ2) 2 i i i i=1

where Qi was defined in Theorem 2.7. The term − log(1 + λi) is close to the true

penalty term in sBF, which is −λi, compared with the original penalty log(1−λi).

The posterior inclusion probabilities can still be improved by Rao-Blackwellization, via the method proposed in (4.9). Again the results for the scaled Bayes factor are very similar to that for the Bayes factor (Fig. 5.4). For prediction, the scaled Bayes factor is also as good as the Bayes factor (Fig. 5.5). In fact, for the Height-chr6 dataset, the scaled Bayes factor is better when the true heritability is large.

148 1 permutation 5 permutations 20 permutations BFnull

Figure 5.3: Heritability estimation with sBF in the Height-ind (first row) and Height-chr6 datasets (second row). The first three columns correspond to different number of permutations used in computing the null expectation of BF. The BVSR results for BF are given in the last column for comparison. The grey bars represent 95% posterior intervals.

149 1 permutation 5 permutations 20 permutations BFnull

Figure 5.4: Calibration of the Rao-Blackwellized posterior inclusion probabilties for sBF. The first row corresponds to the Height-ind dataset and the second the Height-chr6 dataset. The first three columns correspond to BVSR with sBF and different number of permutations for computing the null expectation of BF, and the last column is the BVSR result obtained using BF. The SNPs are divided into 20 bins by RB-PIP and the y-axis represents the proportion of true positives in each bin. The x-axis is the predicted proportion of true postives. The grey bars represent ±2SD.

150 1 permutation 5 permutations 20 permutations BFnull

Figure 5.5: Relative prediction gain of the scaled Bayes factor for the simulated datasets with heritability ≥ 0.05. The first row corresponds to the Height-ind dataset and the second the Height-chr6 dataset. The first three columns corre- spond to BVSR with sBF and different number of permutations for computing the null expectation of BF, and the last column is the BVSR result obtained using BF. The RPG is computed by (4.10) and (4.11). In the legends BVSR stands for the crude regression estimates from fastBVSR; BVSR-RB represents the Rao- Blackwellized estimates from fastBVSR; GCTA stands for the BLUPs of linear mixed model output from GCTA.

151 Chapter 6

Summary and Future Directions

6.1 Summary of This Work

Bayesian linear regression has a wide application in genetics. Two important ex- amples are association testing and variable (causal SNP) selection. This work studied both the theoretical and computational aspects of Bayesian linear regres- sion, and provided examples of applications to genome-wide studies. We started from the characterization of the null distribution of the Bayes factor given in (1.3).

Under the null,

p X 2 log BFnull = (λiQi + log(1 − λi)) + op(1), (6.1) i=1

2 where Q1,...,Qp are i.i.d. χ1 random variables, λ1, . . . , λp are weights between 0 and 1 and op(1) is the error that vanishes in probability. Under the alternative, assuming some conditions that guarantee the error vanishes, we still have the same asymptotic form for 2 log BFnull but Q1,...,Qp become noncentral chi-squared random variables. The proof was given in Chap. 2.1. An immediate impact is on the calculation of p-values for Bayesian methods in GWAS. Due to the burden of

152 multiple testing, the significance threshold in GWAS is very small, typically 5 ×

10−8. Consequently, the permutation approach to computing p-values associated with Bayes factors is not feasible. Using the asymptotic result (6.1) such p-values can be analytically computed. In Chap. 2.2 the behaviour of the Bayes factor and its associated p-value was discussed. The computation of p-values requires

2 the evaluation of the distribution function of a weighted sum of independent χ1 random variables. To overcome this, we implemented in C++ a recent polynomial method of Bausch [2013], which appears to be the most efficient solution so far.

A striking feature of our implementation is that even extremely small p-values can be accurately computed. Besides, arbitrary precision is attainable and strict error bounds are provided. More details were given in Chap. 2.3.2 and 2.3.3.

Simulations studies (see Chap. 2.3.4) showed that the p-values computed using our asymptotic result have very good calibration, even at the tail, i.e., when the associated Bayes factor is very large.

The expression of the Bayes factor (Eq. (1.3)) contains a term (XtX+V −1)−1Xty, the posterior mean of the regression coefficient (also the maximum a posteriori estimator). It is often computed via Cholesky decomposition of (XtX + V −1), which has cubic complexity in p (the number of columns of X) and thus extremely slow for large p. A novel iterative method, called ICF (iterative solutions using complex factorization), another major contribution of this work, was proposed in

Chap. 3. Simulation (Chap. 3.5) shows that, when ICF is applicable, it is much better than the Cholesky decomposition and other iterative methods like Gauss-

Seidel algorithm. The only limitation of ICF is that it relies on the availability of the Cholesky decomposition of XtX. Fortunately, in the MCMC sampling of a Bayesian variable selection procedure, such decompositions are often easy to obtain by efficient updating algorithms. We studied the BVSR (Bayesian variable selection regression) model proposed by Guan and Stephens [2011] (see Chap. 4.2),

153 which turned out to fit well with the ICF algorithm. Our new MCMC algorithm for the inference of BVSR model was described in great detail in Chap. 4.3. Apart from the ICF algorithm, the exchange algorithm proposed by Murray et al. [2012] was employed to bypass the calculation of matrix determinants. Simulation studies

(Chap. 4.4) showed that the new algorithm can efficiently estimate the heritabil- ity of a quantitative trait and report well-calibrated posterior inclusion probabil- ities. Furthermore, compared with another popular software package GCTA (see

Chap. 7.5), it has much better performance in prediction (Chap. 4.4.3).

The last, but by no means the least, novelty of this work is a new statistic called scaled Bayes factor (Chap. 5). It was motivated by the null distribution of the

Bayes factor given in (6.1). See Chap. 5.1 for its practical and theoretical benefits.

Chap. 5.2 gave an application of sBF to whole-genome single-SNP analysis of a real

GWAS dataset on intraocular pressure. Some known associations were replicated and a potentially novel finding, PEX14, was described. The scaled Bayes factor could also be used for variable selection. The method is a little more complicated since the scaling factor of sBF requires further calibration with the data (see

Chap. 5.3.1). Simulation studies in Chap. 5.4 demonstrated that sBF performs at least as well as the unscaled Bayes factor.

6.2 Specific Aims for Future Studies

6.2.1 Bayesian Association Tests Based on Haplotype or

Local Ancestry

Background The asymptotic distribution of the Bayes factor in Bayesian lin- ear regression was studied in Chap. 2 and a software package was provided for

154 computing the p-values associated with the Bayes factors. A GWAS application

of this result was given in Chap. 5.2. Nevertheless, in that GWAS data analysis

we only considered the single-SNP analysis. By Proposition 2.12, in simple linear

regression, PBF (the p-value associated with the Bayes factor) is asymptotically equal to PLR (the p-value of the likelihood ratio test). Hence the use and the importance of PBF was not emphasized. For multi-locus association testing, PBF may behave very differently from the p-values of the traditional tests (PLR and

PF). The calculation for PBF would also be much harder but could be efficiently done by our program BACH.

Typical examples of multi-locus testing include the sequential kernel associ- ation test of Ionita-Laza et al. [2013], the semiparametric regression test with least-squares kernel machine of Kwee et al. [2008] and some other tests based on pooling rare variants. For these methods, the test statistic is distributed as a weighted sum of independent chi-squared random variables and thus our program

BACH could be applied.

Here we consider two new ideas. First, we may perform multi-locus association testing using haplotypes. According to the linkage disequilibrium, chromosomes can be divided into smaller haplotype blocks such that in each block, the dis- tributions of the SNPs are highly dependent and thus the combinations of these

SNPs form specific patterns. The inference of haplotypes from genotypes is called phasing (see Chap. 1.1.1 for more information). To test the association between the phenotype and a haplotype block, we build a multi-linear regression model using both the SNPs in that block and the haplotypes (represented by dummy variables). From a genetic perspective, such tests are appealing. In singe-SNP analysis, scientists are much more interested in the region surrounding an asso- ciation signal rather than the signal itself. A haplotype block usually lies in a

155 specific region, e.g. a gene, and thus represents direct biological interest. From a statistical perspective, haplotype blocks contain more information than single

SNPs and the total number of tests becomes much smaller, resulting in a much milder multiple testing correction. Therefore such tests tend to be more powerful than single-SNP analysis. Another example is the association test based on local ancestry. For an admixed population, it is often beneficial to infer the ancestral proportions at each locus. For example, Mexican is an admixed population of three ancestral populations: European, native American, and African. The local ancestry analysis of Mexican samples revealed a strong selection signal at the MHC region [Zhou et al., 2016]. Programs for local ancestry inference include ELAI [Xu and Guan, 2014], RFMix [Maples et al., 2013], LAMP-LD [Baran et al., 2012], etc. ELAI uses a two-layer model where the lower layer (the more recent layer) typically contains 10 to 20 clusters of ancestral haplotypes. Hence, just like the previous example, for each SNP, we may use the genotype together with the local ancestry (represented by dummy variables) as the regressors and fit a Bayesian multi-linear regression model. The Bayes factor and its associated p-value are then used to detect significant associations.

Methods and Materials For the haplotype association testing, we may simply use the IOP dataset which was analyzed in Chap. 5.2. There are two ways to define the haplotype blocks. First, we can simply use a fixed block size. This is also the strategy used by most phasing and local ancestry inference programs, for example IMPUTE2 and LAMP-LD. Second, we may define the blocks according to the biological functions or the linkage disequilibrium degrees. For example, the SNPs located within the same gene (including the upstream and downstream regions that may include promoters) should be grouped into blocks. We can also compute the linkage disequilibrium and group the SNPs such that in each block

156 LD is greater than some threshold. Then for each block, we perform a Bayesian

multi-linear regression and compute the Bayes factor and its associated p-value.

We write the model as

y = Xβ + Hu + ε,

i.i.d. i | τ ∼ N(0, 1/τ),

−1 β | τ, Vβ ∼ MVN(0, τ Vβ), (6.2)

−1 u | τ, Vu ∼ MVN(0, τ Vu),

p(τ) ∝ 1/τ, where X represents the genotype of the SNPs in the block and H represents the haplotypes. If there are k different haplotypes in the block, then H has k − 1 columns (dummy variables). Hij = 1 means the i-th subject has the j- th haplotype. Note that unlike the traditional dummy variables, the entries of

Hij do not have to be integers. For example, if haplotypes are inferred from a Bayesian procedure, we may obtain the posterior probability for each possible haplotype. (Of course, we can also compute the Bayes factor for every possi- ble inference of H and compute a weighted average over these Bayes factors.)

For the association testing using local ancestry, we need a dataset of admixed population. One candidate is a dbGaP dataset, Mexican hypertriglyceridemia study (accession number: phs000618.v1.p1), which contains 4, 350 case and con- trol samples [Weissglas-Volkov et al., 2013]. The dataset contains HDL (high- density lipoprotein) measurements which can be used as the phenotype for our analysis. At each SNP locus, we first infer the ancestral ancestry and then use them to build a multi-linear regression model for association testing. The model has the same form as (6.2). However, X has only one column and H includes the dummy variables that represent the local ancestry. The local ancestry inference program, ELAI [Xu and Guan, 2014], can output the probabilistic estimates for

157 every ancestral population.

An important advantage of Bayesian analysis is the model flexibility. We may

combine the two design matrices in (6.2), X and H, and use V = diag(Vβ, Vu) to represent the prior covariance matrix. Although in Chap. 2, only two special choices for V were discussed (independent normal prior and g-prior), V actually can be any positive definite matrix. For both association testing methods, how to choose an appropriate V would be a critical problem. When Vu = 0, the test reduces to the ordinary multi-locus test. Clearly, we would like to specify different effect sizes for Vβ and Vu. To average over prior uncertainties, we may try different priors and then average over the Bayes factors.

Simulation studies can be performed to compare the performance of our meth- ods with other methods, including the non-Bayesian multi-linear regression tests and other haplotype-based methods, for example the haplotype-sharing method of Nolte et al. [2007]. Besides, such studies can also be used to compare differ- ent phasing and local ancestry inference programs. For example, LAMP-LD and

RFMix use one-layer model to infer the local ancestry and thus for Mexican sam- ples, we only have two dummy variables. For ELAI, in contrast, we can have more than ten regressors due to the two-layer modelling (the upper layer only contains three ancestral populations but the lower layer can contain 10 to 20 haplotype groups). It would be interesting to know whether the two modelling methods produces different association testing results. An existing software package for simulating admixed populations is cosi2 [Shlyakhter et al., 2014].

158 6.2.2 Application of ICF to Variational Methods for

Variable Selection

Background The BVSR method described in Chap. 4 is quite powerful when the dataset only has a moderate number of SNPs. But for large datasets that contain more than 100K SNPs, 1000 of which are causal, BVSR has a clear ten- dency to underestimate the heritability (see the simulation results for Height-5C dataset in Chap. 4.4). A real example is the inference for the heritability of height using the Height dataset. The GCTA estimate for the heritability is 0.44 [Yang et al., 2010]. On the contrary, the BVSR estimate for PVE (proportion of variance explained) is only 0.15, as reported in Zhou et al. [2013]. Our new implementa- tion, fastBVSR, has also been tried but the heritability estimate is still around

0.2 (result not shown in this work). One Bayesian approach to solve this method is to use the BSLMM (Bayesian Sparse Linear Mixed Model) model of Zhou et al.

[2013]. BSLMM assumes the following prior for β:

i.i.d. 2 2 2 βi ∼ π N(0, (σa + σb )/τ) + (1 − π) N(0, σb /τ).

For comparison, the prior for β in BVSR could be written as

i.i.d. 2 βi ∼ π N(0, σ /τ) + (1 − π) δ0.

Hence, BSLMM essentially assumes that every SNP has a contribution to the phenotype but, for most of them, the effect is tiny. The variable selection of

BSLMM aims to identify the SNPs with relatively large effects. The rationale of

BSLMM can be seen as a mixture of BVSR and GCTA. Using this method, the

PVE estimate for the Height dataset is 0.41 [Zhou et al., 2013]. Our algorithm for computing ridge estimators, ICF, which was described in Chap. 3 could be applied

159 to BSLMM and a substantial improvement on the running speed is expected. Due to the similarity between BVSR and BSLMM, the implementation of BSLMM using ICF is easy.

For the Height dataset, the failure of BVSR to produce a heritability estimate comparable to that of GCTA is not necessarily caused by the model specification.

It might simply be due to the computational limitations. The posterior inference for BVSR is made via MCMC but the Height dataset contains about 300K SNPs which implies that the MCMC is almost impossible to converge with a couple of millions of iterations. To make things worse, the number of causal SNPs is very large, probably much greater than 1000. Hence it is entirely likely that there exist models with large posterior probabilities and model size greater than 1000 but BVSR cannot find them. In fact, for any problems with so many potential predictors (and “true” predictors), MCMC becomes much less reliable. Note that

BSLMM effectively makes the model size much smaller and, for traits like height, the heritability estimation largely depends on the estimation of the parameter σb. In Zhou et al. [2013], it is reported that the proportion of variance explained by the sparse effects is only 0.12 for height.

Another Bayesian strategy to solve this problem that does not use MCMC is the variational methods, which shall be the focus of our second aim. See Jordan et al. [1999], Bishop [2006], Grimmer [2011] among others for an introduction.

The idea of variational inference was briefly explained in Chap. 4.1. To be more specific, consider the following approximating form for the posterior distribution of (β, γ),

N Y γj 1−γj q(β, γ) = {φjfj(βj)} {(1 − φj)δ0(βj)} , (6.3) j=1 where N is the total number of SNPs and γ is the variable indicating whether

160 the SNP is included in the model. By integrating out βj, it can be seen that

φj = P(γj = 1) is actually the posterior inclusion probability (PIP). fj is some distribution to be estimated and we restrict it to be normal. Note that (6.3) cannot be the true posterior because we have assumed the posterior independence between the SNPs! The variational inference aims to find an approximation with form (6.3) that minimizes the Kullback-Leibler divergence between q and the true

posterior,

Z p(β, γ | y) KL(q(β, γ) k p(β, γ | y)) = q(β, γ) log dβdγ. (6.4) q(β, γ

The search for such an optimal approximating distribution is done by some de-

terministic algorithm and thus requires much less computation than the MCMC

approach. Such algorithms are usually iterative and conceptually resemble the

EM (expectation-maximization) algorithm.

Methods A potential application of our ICF algorithm is a very recent varia-

tional method proposed by Huang et al. [2016]. It is based on an earlier variational

algorithm of Carbonetto and Stephens [2012], which was shown in the paper to be

able to produce accurate estimates for the hyperparameters under a wide range of

settings, although the individual PIP estimate was often off. The method of Huang

et al. [2016] could produce more accurate estimates for PIPs and has better con-

vergence properties. The consistency was proved for exponentially growing (w.r.t.

the sample size) number of covariates.

161 Recall the BVSR model specified by (4.2).

−1 y | γ, β, X, τ ∼ MVN(Xγ βγ , τ I),

γj ∼ Bernoulli(π), (6.5) 2 βj | γj = 1, τ ∼ N(0, σ /τ),

βj | γj = 0 ∼ δ0.

2 For the time being, we treat the hyperparameters τ, π, σ as fixed. Let µj and vj be the mean and the variance of the normal distribution fj in (6.3). Carbonetto and Stephens [2012] proposed to update (µj, vj, φj) sequentially for each j in each iteration. Huang et al. [2016] showed that a better approach is to do the batch- wise update. In each iteration, they proposed to first update {vj : j = 1, . . . , p}, then {µj : j = 1, . . . , p} and lastly {φj : j = 1, . . . , p}. It turned out that the computational cost mainly come from the updating of {µj : j = 1, . . . , p}. Let

µ = (µ1, . . . , µN ) and Φ = diag(φ1, . . . , φN ). The updating equation for µ, at the k-th iteration, can be written as

−1 µ = Φ(k)XtXΦ(k) + nΦ(k)(I − Φ(k)) + σ−2Φ(k) Φ(k)Xty. (6.6)

The complexity of the matrix inversion is O((n ∧ N)3)(n is the sample size).

Huang et al. [2016] used the Woodbury identity to convert the problem into the inversion of a much smaller matrix. However, when the number of causal SNPs is very large such inversions could be still very time-consuming. Let Aγ denote the submatrix (or subvector) of A that corresponds to the SNPs with φj > 0. We may rewrite (6.6) as

(k) −1  t (k) −1 −2 (k) −1−1 t (6.7) µγ =(Φγ ) Xγ Xγ + n((Φγ ) − I) + σ (Φγ ) Xγ y.

162 (k) −1 −2 (k) −1 Since n((Φγ ) −I)+σ (Φγ ) is a diagonal matrix, ICF can be applied. The

t Cholesky decomposition for Xγ Xγ can still be obtained by updating. If the initial (0) values are appropriately chosen such that {φj : j = 1, . . . , p} are not too far away from the truth, by using ICF we also avoid the computation of the entire gram matrix XtX. In the BVSR model, we put a hyperprior on the hyperparameters

τ, π, σ2. To average over the hyperprior distributions and obtain the posterior inference for the hyperparameters, we can use the importance sampling approach proposed by Carbonetto and Stephens [2012].

This novel algorithm could be extremely useful for problems like the heritability estimation of the Height dataset, where we have a very large number of causal

SNPs with only tiny effects. For the original algorithm, the PIPs of many causal

SNPs could be set to zero since they are too small. However, if we sum up the effects of these SNPs, the total effect is not negligible at all. By using ICF, we can accurately estimate these PIPs within an acceptable computational time. As shown in Chap. 3.5, when there are more than 1000 SNPs in the model, the speed advantage of ICF over all the other methods is extremely significant.

6.2.3 Extension of This Work to Categorical Phenotypes

Background In genetic studies, very often the phenotype is the case-control sta- tus and then the Bayesian linear regression model is not directly applicable. Hence the extension of our results to categorical phenotypes would be of high practical importance. The standard approach to analyzing categorical phenotypes by re- gression is to introduce a logit or a probit link function. For binary phenotypes,

163 this means

P(yi = 1 | β) t logit P(yi = 1 | β) = log = x(i)β P(yi = 0 | β) (6.8) t P(yi = 1 | β) = Φ(x(i)β),

where x(i) = (1, xi1,..., xip), β = (β0, β1, . . . , βp) and Φ denotes the cumulative distribution of the standard normal distribution. (Note that we cannot assume

β0 = 0.) Unfortunately, the inference for either model is not easy due to the lack of conjugate prior. In particular, β cannot be integrated out in the expression for the marginal likelihood. Take the logistic regression model as an example. Its marginal likelihood is given by

n ( xt β !yi 1−y ) Z Y e (i)  1  i p(y) = t t p(β)dβ. x(i)β x(i)β i=1 1 + e 1 + e

The model (6.8) has another more convenient formulation. Introduce latent vari-

ables z1, . . . , zn such that yi = I{zi>0}. Then the logistic regression model is equiv- alent to stating that

t zi − x(i)β ∼ Logistic

and the probit model is equivalent to

t zi − x(i)β ∼ N(0, 1).

Methods To extend our results on the null distribution of Bayes factors to the

binary phenotypes, we need first work out a closed-form expression for the Bayes

factor. One solution is to use the Laplace approximation [Kass and Raftery, 1995],

164 which uses Taylor expansion to approximate the marginal likelihood by

Z p(y|β)p(β)dβ ≈ p(y|βˆ)p(βˆ)(2π)(p+1)/2|Σˆ |−1/2, where βˆ is the MAP (maximum a posteriori) estimator and Σˆ = D2l(βˆ) and l(β) = log p(y|β)p(β). However, the distribution of the corresponding Bayes fac- tor is difficult to characterize. Another asymptotic approach taken by Wakefield

[2009] makes use of the asymptotic normality of the maximum likelihood esti- mator and computes the Bayes factor as a function of the prior and the Wald test statistic. Nevertheless, this approach defeats the purpose of computing the p-value associated with the Bayes factor since it is always equal to the p-value of the Wald test. For the probit model, the marginal probability P(yi = 1) can be computed exactly in a closed form. Let the prior for β be MVN(µβ, Vβ). Then,

t since zi | β ∼ N(x(i)β, 1), we have

t t zi ∼ N(x(i)µβ, 1 + x(i)Vβx(i)), t x(i)µβ P(yi = 1) = P(zi > 0) = Φ( ). q t 1 + x(i)Vβx(i)

The computation of the Bayes factor requires a numerical integral of the Bayes factor for the linear regression model (integrating out z). There is no closed-form

t expression unless all the observations are exactly independent, i.e., x(i)x(j) = 0 for all i 6= j. Overall, the null distribution of the Bayes factor for binary phenotypes is a very challenging problem and some novel method would be necessary.

Extending BVSR to the binary phenotypes is an easier task. For the probit model one method has already been given in Guan and Stephens [2011]. Using the latent variable model, compared with the BVSR model for quantitative phe- notypes, we only need an additional sampling of z in each MCMC iteration. This

165 method appears to be a variant of the Gibbs sampler of Albert and Chib [1993], which may be viewed as the default choice for a Bayesian analysis with binary outcomes. For the logistic regression model, similar algorithms could be devel- oped, using the t distribution approximation to the logistic distribution proposed by Albert and Chib [1993]. Nonetheless, the additional update of z in MCMC im- plies that the mixing of the Markov chain is more difficult for binary phenotypes than for quantitative ones. Hence to achieve convergence or accurate posterior inferences, for binary phenotypes MCMC needs to be run for more iterations. It remains a challenge to develop better MCMC algorithms for variable selection with binary phenotypes.

166 Chapter 7

Appendices

7.1 Linear Algebra Results

The readers are assumed to have an elementary knowledge of linear algebra. No- tations that may be confusing are explained when first used and could also be found at the beginning of this work. The goal of this section is to introduce some known linear algebra results that will be used in the development of our theory.

All vectors and matrices are assumed to be real unless otherwise stated.

7.1.1 Some Matrix Identities   A A  11 12  Lemma 7.1. (Block matrix inversion formula) Let A =   be an A21 A22 invertible partitioned matrix such that both A11 and A22 are square. If both A11

−1 and S = A22 − A21A11 A12 are non-singular, then

  −1 −1 −1 −1 −1 −1 A11 + A11 A12S A21A11 −A11 A12S A−1 =   .  −1 −1 −1  −S A21A11 S

167 S is called the Schur complement of A11 [Hogben, 2006, Part I, Chap. 10].

The formula can be proved by directly checking that AA−1 = I. By symmetry,

−1 −1 −1 −1 A11 + A11 A12S A21A11 must be equal to the inverse of the Schur complement

of A22 provided that it exists. This is known as the Woodbury matrix identity.

Lemma 7.2. (Woodbury matrix identity) If both A and S are square matrices,

then

(A + USV )−1 = A−1 − A−1U(S−1 + VA−1U)−1VA−1

provided that U and V have conformable sizes and the inverses involved ex-

ist [Harville, 1997, Chap. 18].

Another way to prove this is to check that the product of the l.h.s and the r.h.s

is just the identity matrix. See also Press [2007, Chap. 2] for more information.

Lemma 7.3. Let A be the partitioned matrix as given in Lemma 7.1. If both A11

and A22 are non-singular, then [Hogben, 2006, Part I, Chap. 10]

−1 −1 |A| = |A11| · |A22 − A21A11 A12| = |A22| · |A11 − A12A22 A21|.

Proof. The first equality can be proved by using the following decomposition

        −1 I 0 A11 A12 I −A11 A12 A11 0       =   .  −1       −1  −A21A11 I A21 A22 0 I 0 A22 − A21A11 A12

The determinant of the l.h.s is simply |A| and the determinant of the r.h.s is

−1 |A11| · |A22 − A21A11 A12|. The second equality can be checked similarly.

168 Lemma 7.4. (Sylvester’s determinant formula) If A is an n × m matrix and B

is an m × n matrix, then

|In + AB| = |Im + BA|

where |·| denotes the determinant and In denotes an n×n identity matrix [Sylvester, 1851].

  I A  n  Proof. Consider the partitioned matrix M =  . By Lemma 7.3, −BIm

|M| = |In + AB| = |Im + BA|.

Lemma 7.5. Let A + iB be a complex square matrix where both A and B are real. If A and (A + BA−1B) are invertible, then,

(A + iB)−1 = (A + BA−1B)−1 − iA−1B(A + BA−1B)−1.

This can be proved by calculating the real and the imaginary parts of (A +

iB)(A + iB)−1.

7.1.2 Singular Value Decomposition and Pseudoinverse

Any matrix, real or complex, admits a factorization called singular value decompo-

sition (SVD). Before we state the form of SVD, we first review some terminologies

for complex matrices. We use M ∗ to denote the conjugate transpose of matrix

M.

Definition 7.1. (a) A complex square matrix M is said to be Hermitian if M =

M ∗.

(b) A complex square matrix M is said to be skew-Hermitian if M = −M ∗.

169 (c) A complex square matrix M is said to be unitary if MM ∗ = I.

Theorem 7.6. (Singular value decomposition) Let M be an arbitrary n × p complex matrix. Then there exist two unitary matrices U, V and a “rectangular

diagonal” matrix D of size n × p such that

  diag(d , . . . , d ) 0 ∗  1 r  M = UDV , D =   0 0

where diag(d1, . . . , dr) denotes a diagonal matrix with real diagonal elements d1, . . . , dr >

0. The singular values d1, . . . , dr are determined uniquely up to permutation and r is equal to the rank of M.

See Allaire et al. [2008, Chap. 2.7], Serre [2002, Chap. 7.7], Harville [1997,

Chap. 21.12], etc. for proofs and more information. SVD can be used to define

the pseudoinverse for any matrix.

Definition 7.2. (Moore-Penrose pseudoinverse) The Moore-Penrose pseu-

doinverse of any complex matrix M with SVD M = UDV ∗ is denoted by M +

and defined as [Allaire et al., 2008, Chap. 2.7]

  diag(d−1, . . . , d−1) 0 + def. + ∗ +  1 r  M = VD U , D =   . 0 0

Proposition 7.7. The Moore-Penrose pseudoinverse M + has following proper-

ties.

(a) MM +M = M;

(b) M +MM + = M +;

(c) MM + = (MM +)∗ ;

170 (d) M +M = (M +M)∗ ;

(e) (M +)+ = M;

(f) M + = (M ∗M)+M ∗;

(g) M + = M ∗(MM ∗)+;

(h) if M is invertible, M + = M −1.

Proof. We only prove part (a) and (f). The rest can be checked easily in similar ways.

(a) MM +M = UDV ∗VD+U ∗UDV ∗ = UDD+DV ∗ = UDV ∗ = M.

+ + ∗ (h) If M is invertible, then D = diag(d1, . . . , dr) and thus MM = UDD U = UU ∗ = I. Since the matrix inversion is unique, we must have M + = M −1.

These properties explain why M + is called pseudoinverse. In fact, it can be shown that the matrix M + satisfying properties (a) to (d) is unique. See Serre

[2002, Chap. 8.4] and Harville [1997, Chap. 20], for proofs.

7.1.3 Eigenvalues, Eigenvectors and Eigendecomposition

We first review the definitions of eigenvalue and eigenvector.

Definition 7.3. Let M be a p × p complex matrix. λ is called an eigenvalue of

M if there exists a nonzero vector u such that Mu = λu. u is then called the corresponding eigenvector.

Immediately we have the following lemma.

Lemma 7.8. Let A be an n × p matrix and B be a p × n matrix. If λ 6= 0 is an eigenvalue of AB, then it is also an eigenvalue of BA.

171 Proof. By definition, there exists a nonzero vector u such that ABui = λui. By multiplying both sides by B we get

BA(Bui) = λ(Bui).

We claim that Bui 6= 0, i.e., Bui is an eigenvector of BA with corresponding eigenvalue λ. We prove this by contradiction. If Bui = 0, we have ABui = 0 =

λui. However, since λ 6= 0, this would imply ui = 0 and gives the contradiction.

Clearly we can always assume the eigenvector is normalized so that ||u||2 = 1. Using eigenvalues and eigenvectors, some matrices admit a factorization which

is usually referred to as spectral decomposition or eigendecomposition. For the

purposes of this thesis, we only focus on a special class of matrices called normal

matrices.

Definition 7.4. A complex square matrix M is said to be normal if MM ∗ =

M ∗M.

Clearly, unitary matrices, Hermitian matrices and skew-Hermitian matrices

are normal. For real matrices, they correspond to orthogonal matrices, symmetric

matrices and skew-symmetric matrices respectively.

Theorem 7.9. (Spectral decomposition for normal matrices) If a square

matrix M is normal, it admits the factorization

M = UΛU ∗

where U is unitary and Λ = diag(λ1, . . . , λp). Each (λi, ui) is an eigenvalue-

eigenvector pair of matrix M (ui is the i-th column of U), but λ1, . . . , λp are not

172 necessarily distinct.

The set (in fact, multiset), {λ1, . . . , λp} is called the spectrum of M. It is unique up to permutation. The number of times that an eigenvalue appears in the spectrum is called its multiplicity. To see that (λi, ui) is an eigenvalue-eigenvector pair, notice that the decomposition is equivalent to MU = UΛ. For a formal proof, see Trefethen and Bau III [1997, Chap. 24] or Serre [2002, Chap. 3]. When all the eigenvalues are nonnegative, by convention we assume they are ordered so that λ1 ≥ · · · ≥ λp.

Consider a normal matrix M = XX∗. Let the SVD of X be UDV ∗. Then the SVD for M is

M = UDD∗U ∗. (7.1)

Since U is unitary and DD∗ is diagonal, this is also the spectral decomposition for M. Hence the nonzero singular values and the eigenvalues of M coincide.

Similarly one can show that if M = −XX∗, the nonzero singular values of M are equal to the absolute values of the eigenvalues of M. However, for a general square matrix, its singular values are not equal, in absolute value, to its eigenvalues.

Proposition 7.10. For a p × p normal matrix M with eigenvalues λ1, . . . , λp, counted with multiplicity [Trefethen and Bau III, 1997, Chap. 24],

p p p Y X X |M| = λi, tr(M) = λi, rank(M) = 1(0,∞)(|λi|) i=1 i=1 i=1

p P where tr(M) = Mii denotes the trace of M, and 1(0,∞)(|λi|) is the indicator i=1 function that equals 1 if λi 6= 0 and 0 otherwise.

Proof. Let UΛU ∗ be the spectral decomposition of M. Then |M| = |UΛU ∗| =

173 p Q ∗ ∗ |Λ| = λi. Similarly, tr(M) = tr(UΛU ) = tr(U UΛ) = tr(Λ) (trace is i=1 invariant under cyclic permutations), and rank(M) = rank(UΛU ∗) = rank(Λ).

The eigenvalues of a real matrix are not necessarily real. However, when the

matrix is Hermitian, the eigenvalues are always real. Furthermore, if the matrix

is positive definite, the eigenvalues are positive. The properties of the eigenvalues

of some special matrices are summarized in the following proposition.

Proposition 7.11. Let M be a p × p matrix and λ be an arbitrary eigenvalue of it.

(a) If M is unitary, |λ| = 1.

(b) If M is idempotent, i.e. MM = M, λ is either 0 or 1.

(c) If M is Hermitian, λ is real.

(d) If M is skew-Hermitian, λ is either 0 or purely imaginary and λ¯ is also an

eigenvalue of M. Hence, M has at least one zero eigenvalue if p is odd.

(e) If M is positive definite, λ > 0.

(f) If M is positive semi-definite, λ ≥ 0.

Proof. Let u be the corresponding eigenvector for λ so that Mu = λu.

2 2 ∗ ∗ ∗ ∗ (a) ||λu||2 = |λ| u u = u M Mu = u u. Thus |λ| = 1.

(b) Mu = M 2u = λMu = λ2u = λu. Since u is nonzero, λ = 0 or 1.

(c) On one hand, (Mu)∗u = u∗M ∗u = u∗Mu = λu∗u. On the other, (Mu)∗u =

(λu)∗u = λ¯u∗u. Therefore (λ − λ¯)u∗u = 0. Because u is nonzero, we must

have λ = λ¯.

174 (d) Using the same argument, we obtain (λ + λ¯)u∗u = 0, which implies the real

part of λ is 0. By writing u = <(u)+i=(u), it is easy to show that Mu¯ = λ¯u¯.

(e) Since M is positive definite, u∗Mu = λu∗u > 0. Thus λ > 0.

(f) Same argument shows λ ≥ 0.

At last we point out that spectral decomposition provides a simple approach

to calculating the nth root of a square matrix.

Lemma 7.12. Let M be a positive semi-definite matrix with spectral decomposi-

tion UΛU ∗. Then its square root is given by M 1/2 = UΛ1/2U ∗.

It is easy to check M = M 1/2M 1/2. The proof of uniqueness can be found in Harville [1997, Chap. 21.9] (for real matrices) and Serre [2002, Chap. 7.1] (for complex matrices).

7.1.4 Orthogonal Projection Matrices

Let X be an n × p real matrix. Define

def. t + t HX = X(X X) X . (7.2)

By the pseudoinverse introduced previously, immediately we have

Lemma 7.13. Let X = UDV t be the singular value decomposition of X. Then,

+ t HX = UDD U ,

175 where DD+ = diag(1,..., 1, 0,..., 0). The number of 1’s is equal to the rank of

X.

HX is called an orthogonal projection matrix. In the traditional linear regres- sion, it is also called hat matrix since it maps the response vector to its fitted

values by the method of least squares.

Proposition 7.14. Assume X is an n × p matrix with n ≥ p and rank(X) = p.

For any n-vector y, we have

t −1 t def. ˆ (X X) X y = arg min ||y − Xβ||2 = β β∈Rp

2 where || · ||2 denotes the ` -norm.

2 t Proof. Since ||y − Xβ||2 = (y − Xβ) (y − Xβ), we have

∂||y − Xβ||2 2 = 2XtXβ − 2Xty. ∂β

By letting the derivative equal to 0, we obtain the expression for βˆ. Since the

second derivative matrix 2XtX is positive definite, βˆ indeed minimizes ||y −

Xβ||2. (For matrix differentiation, see for example Mardia et al. [1980, Appx. A].)

Proposition 7.14 implies that HX y is the projection of vector y onto the col- umn space of X. In fact, when n < p or X is rank deficient, the claim still holds since β = (XtX)+Xty satisfies XtXβ = Xty, though the solution is no longer unique. The next proposition gives some important properties of HX . For more information, see for example Harville [1997, Chap. 12] and Hogben [2006, Part I,

Chap. 5].

Proposition 7.15. Let HX be a matrix as defined in (7.2). Then,

176 (a) HX is symmetric;

2 (b) HX is idempotent, i.e., HX = HX ;

(c) HX X = X;

(d) rank(HX ) = rank(X);

(e) tr(HX ) = rank(HX );

(f) I − HX is symmetric and idempotent;

(g) rank(I − HX ) = tr(I − HX ) = n − rank(HX );

(h) I − HX is an orthogonal projection matrix.

Proof. (a) The symmetry follows from the definition of HX and Proposition 7.7 (d).

2 t + t t + t (b) By Proposition 7.7 (b), HX = [X(X X) X ][X(X X) X ] = HX .

t + t + (c) By Proposition 7.7 (a) and (f), HX X = X(X X) X X = XX X = X.

(d) By the definition of rank, it is equivalent to prove that the column spaces

p of X and HX are identical. First, let Xv (v ∈ R ) be a vector in the

column space of X. By part (c), Xv = HX (Xv), which implies it is also

in the column space of HX . Second, let HX v be a vector in the column

t + t space of HX . By definition, HX v = X[(X X) X v]. Hence it is also in the column space of X. Combining the two arguments, we arrive at the conclusion

rank(HX ) = rank(X).

(e) By Proposition 7.10 and Proposition 7.11 (b), HX has rank(HX ) eigenvalues

equal to 1 (counted with multiplicity) and n − rank(HX ) zero eigenvalues.

Thus for HX , the trace is equal to the rank.

177 (f) Symmetry of I − HX is self evident. By part (b), (I − HX )(I − HX ) =

2 I + HX − 2HX = HX .

(g) Both HX and I − HX admit the spectral decomposition. Let the spectral

t t t decomposition of HX be UΛU . Since UU = I, I − HX = U(I − Λ)U . The result then follows.

(h) By part (f) and Proposition 7.7 (a), I − HX can be written in the following form that defines the orthogonal projection matrix,

t + t I − HX = (I − HX )[(I − HX ) (I − HX )] (I − HX ) .

In fact, any symmetric and idempotent matrix is an orthogonal projection

matrix.

Clearly, any n-vector y can be decomposed to y = HX y + (I − HX )y, i.e., the projection of y onto the column space of X and the projection of y onto the orthogonal complement of that space. In linear regression, (I − HX )y is the residuals for the least squares fitting.

At last, we prove an equality concerning the projection matrices that will be very useful in the restricted maximum likelihood inference.

Lemma 7.16. Let X be a full-rank n × p matrix and L be a full-rank n × (n − p) matrix such that LtX = 0. For any positive definite n × n matrix V , we have

L(LtVL)−1Lt = V −1 − V −1X(XtV −1X)−1XtV −1.

If X does not have full rank, the equality still holds with (XtV −1X)−1 replaced

by (XtV −1X)+.

178 Proof. To prove this, just notice that

1/2 t −1 t 1/2 HV 1/2L = V L(L VL) L V ;

−1/2 t −1 −1 t −1/2 HV −1/2X = V X(X V X) X V .

Thus the lemma could be rewritten as HV 1/2L = I−HV −1/2X . Clearly, the two ma- trices, V 1/2L and V −1/2X, are orthogonal in the sense that (V 1/2L)tV −1/2X = 0.

Since they have rank equal to n − p and p respectively, their column spaces must

be orthogonal complements to each other in the vector space Rn.

7.2 Bayesian Linear Regression

Consider the linear regression model

y = Xβ + ε (7.3)

where y = (y1, . . . , yn) is the response vector, X is an n × p design matrix and β is a p-vector called the regression coefficients. The errors 1, . . . , n are assumed to be i.i.d normal random variables with mean 0 and variance τ −1, i.e.,

ε|τ ∼ MVN(0, τ −1I).

Due to the normal error assumption, (7.3) is also referred to as the normal linear model, and could be equivalently written as

y|β, τ ∼ MVN(Xβ, τ −1I). (7.4)

179 For a full exposition in book form of the Bayesian treatment of the normal linear model, see, for example, O’Hagan and Forster [2004, Chap. 9], Hoff [2009, Chap. 9],

Koch [2007, Chap. 4], and Gelman et al. [2014, Chap. 14]. For readers who are not familiar with Bayesian methodology, more introductory material can also be found in these books.

7.2.1 Posterior Distributions for the Conjugate Priors

Throughout this thesis, only the family of conjugate priors is considered. We have two parameters in 7.4, β and τ, and β is usually of direct interest. The conjugate prior for β is multivariate normal distribution. The error precision, τ, can be treated as either known or unknown.

Known error variance If τ is known, the prior for model 7.4 is simply

β|τ, V ∼ MVN(0, τ −1V ). (7.5) where V is a positive definite matrix. More generally, we could specify a nonzero prior mean for β however it is rarely used in practice. (See Jeffreys [1961, Chap. 5] for more reasons.) The posterior distribution of β is still normal, since

f(β|y, τ, V ) ∝ f(y|β, τ)f(β|τ, V ) τ (n+p)/2 τ = |V |−1/2 exp{− [(y − Xβ)t(y − Xβ) + βtV −1β]} (2π)(n+p)/2 2 τ ∝ exp{− [βt(XtX + V −1)β − 2βtXty]}. 2

This is the normal density kernel corresponding to the posterior distribution

β|y, τ, V ∼ MVN((XtX + V −1)−1Xty, τ −1(XtX + V −1)−1).

180 Unknown error variance If τ is unknown, we consider the following normal-

inverse-gamma conjugate prior

β|τ, V ∼ MVN(0, τ −1V ), (7.6)

τ|κ1, κ2 ∼ Gamma(κ1/2, κ2/2).

The gamma distribution is in the shape-rate parameterization. It is called normal-

inverse-gamma prior since the prior for the error variance (τ −1) has an inverse-

gamma distribution. Thus the prior density is given by

(κ /2)κ1/2 2 −1/2 (κ1−2)/2 t −1 f(β, τ) = p/2 |V | τ exp{−(κ2 + β V β)τ/2}. (2π) Γ(κ1/2)

Under the prior (7.6), we have

y|τ, V ∼ MVN(0, τ −1(XVXt + I))

which leads to the marginal likelihood (after integrating out β),

τ n/2 τ f(y|τ, V ) = |I + XVXt|−1/2 exp{− yt(XVXt + I)−1y}. (7.7) (2π)n/2 2

Hence,

f(τ|y, V ) ∝ f(y|τ, V )f(τ|κ1, κ2) τ ∝ τ (n+κ1−2)/2 exp{− [yt(XVXt + I)−1y + κ ]}, 2 2

which shows that

t t −1 τ|y, κ1, κ2 ∼ Gamma((n + κ1)/2, [y (XVX + I) y + κ2]/2).

181 Non-informative prior In practice, usually there is no information guiding the choice of κ1 and κ2, and thus the non-informative prior is preferred. The most widely used such prior is the Jeffreys prior,

f(τ) ∝ 1/τ. (7.8)

It is improper since the integral of the density function is not finite. However, it can be viewed as the limit of a sequence of proper gamma priors for τ and thus written as

τ|κ1, κ2 ∼ Gamma(κ1/2, κ2/2), κ1 ↓ 0, κ2 ↓ 0. (7.9)

The posterior for τ is still proper and given by

τ|y ∼ Gamma(n/2, yt(XVXt + I)−1y/2).

7.2.2 Bayes Factors for Bayesian Linear Regression

The Bayes factor is defined as the ratio of the marginal likelihoods of two models.

In practice, it suffices to calculate only the null-based Bayes factor, which is defined as

def. f(y|M) BFnull(M) = , (7.10) f(y|M0)

where M denotes the model of interest specified by (7.4), (7.5) and (7.6), and M0 denotes the null model where we assume β = 0. Explicitly, if τ is known, the null

182 model is simply

y ∼ MVN(0, τ −1I); if τ is unknown, the null model is

y ∼ MVN(0, τ −1I),

τ|κ1, κ2 ∼ Gamma(κ1/2, κ2/2).

Clearly, if we want to compare two non-null models M1 and M2, the corresponding Bayes factor is the ratio of the two null-based Bayes factors.

def. f(y|M1) f(y|M1)/f(y|M0) BFnull(M1) BF(M1 : M2) = = = . f(y|M2) f(y|M2)/f(y|M0) BFnull(M2)

We will use the model parameters, including the design matrix X, to denote a model. For example, if τ is unknown, we write M = (X, V , κ1, κ2).

Known error variance If τ is known, the null-based Bayes factor can be com- puted from (7.7).

f(y|τ, V ) τ BF (X, τ, V ) = = |I + XVXt|−1/2 exp{ [yty − yt(XVXt + I)−1y]}. null f(y|τ) 2

By Lemma 7.2,

yt(XVXt + I)−1y = yty − ytX(XtX + V −1)−1Xty.

Hence,

τ BF (X, τ, V ) = |I + XtXV |−1/2 exp{ ytX(XtX + V −1)−1Xty}, null 2

183 where we have also applied Lemma 7.4.

Unknown error variance If τ is unknown, all we need is to integrate out τ from from (7.7).

Z f(y|V , κ1, κ2) = f(y|τ, V )f(τ|κ1, κ2)dτ

(κ /2)κ1/2 Z τ 2 t −1/2 (n+κ1−2)/2 t t −1 = n/2 |I + XVX | τ exp{− [y (XVX + I) y + κ2]}dτ Γ(κ1/2)(2π) 2 κ1  −(n+κ1)/2 Γ((n + κ1)/2)κ2 t −1/2 1 t t −1 = n/2 |I + XVX | [y (XVX + I) y + κ2] . Γ(κ1)(2π) 2

Similarly, for the null model,

κ1  −(n+κ1)/2 Γ((n + κ1)/2)κ2 1 t f(y|κ1, κ2) = n/2 (y y + κ2) . Γ(κ1)(2π) 2

Hence,

 t t −1 −(n+κ1)/2 t −1/2 y (XVX + I) y + κ2 BFnull(X, V , κ1, κ2) = |I + X XV | t . y y + κ2 (7.11)

Non-informative prior Under the non-informative prior (7.8), the Bayes factor

is still proper since the “improper” normalizing constants cancel out, and is given

by

yt(XVXt + I)−1y −n/2 BF (X, V ) = |I + XtXV |−1/2 . null yty

By comparing with (7.11), we have

lim BFnull(X, V , κ1, κ2) = BFnull(X, V ). κ1,κ2↓0

184 Therefore the Bayes factor given in (7.2.2) can be viewed as the limit of a sequence of Bayes factors for proper priors. By Lemma 7.2, we rewrite (7.2.2) into

yty − ytX(XtX + V −1)−1Xty −n/2 BF (X, V ) = |I + XtXV |−1/2 . null yty (7.12)

7.2.3 Controlling for Confounding Covariates

Consider the linear regression model

y = W a + Lb + ε, ε|τ ∼ MVN(0, τ −1I), where L is an n × p matrix that represents the covariates of interest and W is an n × q matrix representing the covariates to be controlled for, including the intercept term. Equivalently this model can be written as

y|a, b, τ ∼ MVN(W a + Lb, τ −1I). (7.13)

When calculating the null-based Bayes factor, the null model becomes

y|a, b, τ ∼ MVN(W a, τ −1I). (7.14)

The Bayes factor with proper conjugate prior Suppose the error variance is unknown and use the following conjugate prior for model (7.13) and model (7.14),

−1 a|τ, Va ∼ MVN(0, τ Va),

−1 b|τ, Vb ∼ MVN(0, τ Vb), (7.15)

τ|κ1, κ2 ∼ Gamma(κ1/2, κ2/2).

185 where both Va and Vb are positive definite. To simplify the notations, define

def. t −1 Σ0 = (WVaW + I) ,

def. t t −1 Σ1 = (WVaW + LVbL + I) .

According to our previous calculations, we can obtain the marginal likelihoods,

κ1  −(n+κ1)/2 Γ((n + κ1)/2)κ2 1/2 1 t f(y|Va, κ1, κ2) = n/2 |Σ0| (y Σ0y + κ2) , Γ(κ1)(2π) 2 κ1  −(n+κ1)/2 Γ((n + κ1)/2)κ2 1/2 1 t f(y|Va, Vb, κ1, κ2) = n/2 |Σ1| (y Σ1y + κ2) . Γ(κ1)(2π) 2

Hence, the Bayes factor is

1/2  t −(n+κ1)/2 |Σ1| y Σ1y + κ2 BFnull(W , L, Va, Vb, κ1, κ2) = 1/2 t . |Σ0| y Σ0y + κ2

−1 The Bayes factor with non-informative prior By letting Va → 0, and

κ1, κ2 → 0, we obtain the non-informative prior,

−1 b|τ, Vb ∼ MVN(0, τ Vb), (7.16) f(a, τ) ∝ τ (q−2)/2, which is the Jeffreys prior for (a, τ) [Ibrahim and Laud, 1991, O’Hagan and

Forster, 2004]. Some authors may favor a simpler form f(a, τ) ∝ 1/τ, which is also conventionally referred to as the Jeffreys prior [Berger et al., 2001, Liang et al., 2008]. The two forms produce essentially the same proper posterior infer- ences when n is sufficiently large.

186 Under the null model b = 0 with prior (7.16),

Z f(y) = f(y|τ, a)f(τ, a)dτda

Z τ (n−2)/2 τ = exp{− (y − W a)t(y − W a)}dτda (2π)n/2 2 Z τ = (2π)−(n−q)/2|W tW |−1/2 τ (n−2)/2 exp{− [yty − ytW (W tW )−1W ty]}dτ 2 Γ(n/2) 1 n/2 = |W tW |−1/2 [yty − ytW (W tW )−1W ty] . (2π)(n−q)/2 2

−1 t Under the alternative model, since y|τ, a, Vb ∼ MVN(W a, τ (I + LVbL )), we have

τ n/2 τ f(y|τ, a, V ) = |I + LV Lt|−1/2 exp{− (y − W a)t(LV Lt + I)−1(y − W a)}. b (2π)n/2 b 2 b

def. t −1 Letting Σ2 = (I + LVbL ) ,

Z f(y|Vb) = f(y|τ, a, Vb)f(τ, a)dτda Γ(n/2) n τ on/2 = |W tΣ W |−1/2|Σ |1/2 − [ytΣ y − ytΣ W (W tΣ W )−1W tΣ y] . (2π)(n−q)/2 2 2 2 2 2 2 2

To simplify the notations, first define

P def=.I − W (W tW )−1W t.

Next we claim

t −1 t t −1 −1 Σ2 − Σ2W (W Σ2W ) W Σ2 = P − PL(L PL + Vb ) LP . (7.17)

−1 t t −1 To prove this, consider (I + φ WW + LVbL ) with φ > 0. By Woodbury

187 identity,

−1 t t −1 t −1 t (I + φ WW + LVbL ) = Σ2 − Σ2W (W Σ2W + φI) W Σ2,

−1 t t −1 t −1 −1 t (I + φ WW + LVbL ) = Pφ − PφL(L PφL + Vb ) L Pφ.

where

def. −1 t −1 t −1 t Pφ = (I + φ WW ) = I − W (W W + φI) W .

−1 t t Notice that both (I + φ WW ) and (I + LVbL ) are invertible by checking the

t eigenvalues. Now let φ ↓ 0. The limit clearly exists since (W Σ2W ) is invertible by the positive definiteness of Vb, and lim Pφ = P . Thus by the uniqueness of the φ→0 limit we have obtained (7.17).

Letting X be the residuals of L after regressing out W , i.e.,

X = PL, by the idempotence of P , we can rewrite (7.17) to

t −1 t t −1 −1 Σ2 − Σ2W (W Σ2W ) W Σ2 = P − X(X X + Vb ) X.

188 The ratio of the two determinant terms in the marginal likelihoods is

|W tΣ W |−1/2|Σ |1/2 2 2 = |W tΣ W (W tW )−1|−1/2|Σ |1/2 |W tW |−1/2 2 2 t t −1 −1 t t −1 −1/2 1/2 = |I − W L(L L + Vb ) L W (W W ) | |Σ2|

t −1 −1 t −1/2 1/2 = |I − (I − P )L(L L + Vb ) L | |Σ2|

t −1 −1 t t −1/2 = |I + PL(L L + Vb ) L (I + LVbL )|

t −1/2 = |I + P LVbL |

t −1/2 = |I + VbX X|

In the third and last equalities we have used Sylvester’s determinant formula.

Finally we obtain the expression for the Bayes factor,

ytP y − ytX(XtX + V −1)−1Xty −n/2 BF (W , L, V ) = |I + XtXV |−1/2 . null b ytP y (7.18)

One can also check that

BFnull(W , L, Vb) = lim BFnull(W , L, Va, Vb, κ1, κ2). −1 Va →0,κ1,κ2↓0

def. If we define the residuals of y after regressing out W by yw = P y, we can rewrite (7.18) into

 t t t −1 −1 t −n/2 t −1/2 ywyw − ywX(X X + V ) X yw BFnull(W , L, Vb) = |I + X XV | t , ywyw which has exactly the same form as (7.12). Hence it suffices to discuss the model (7.4) with no loss of generality. It also reveals that BFnull defined in (7.18)

189 is invariant to the following transformation of y,

q T (y) = c(y + W α) , ∀c 6= 0 , α ∈ R , which is a very convenient feature in simulation studies.

7.3 Big-O and Little-O Notations

The Big-O and Little-O notations are very useful for studying the limiting be- haviour of a sequence. Depending on whether the sequence is deterministic or stochastic, these notations have different meanings. We use O(·) and o(·) for the deterministic sequences and Op(·) and op(·) for the stochastic sequences. The sub- script “p” means “probabilistic”. All proofs are omitted in this section. See Cox

[2004, Chap. 3], Shao [2003, Chap. 1.5], and der Vaart [2000, Chap. 2.2] among others for proofs and more details.

Definition 7.5. Given two sequences of real numbers an and bn and a sequence of random variables Xn, we write

(a) an = O(bn) if and only if there exists N < ∞ and C ∈ (0, ∞) such that

|an| ≤ C|bn|, ∀n > N;

(b) an = o(bn) if and only if for any  > 0, there exists an N() < ∞ such that

|an| ≤ |bn|, ∀n > N();

(c) Xn = Op(bn) if and only if for any δ > 0, there exists N(δ) < ∞ and C(δ) < ∞

190 such that

P(|Xn| > C(δ)|bn|) < δ, ∀n > N(δ);

(d) Xn = op(bn) if and only if for any  > 0 and δ > 0, there exists N(, δ) < ∞ such that

P(|Xn| > |bn|) < δ, ∀n > N(, δ).

In particular, if Xn = Op(1), we say the sequence Xn is stochastically bounded; if Xn = op(1), we say Xn converges to 0 in probability. There are many rules for the operations with Big-O and Little-O symbols. The following proposition lists some important ones that will be needed in the derivation of the asymptotic distribution for log BFnull in Chap. 2.1.2.

Proposition 7.17.

(a) If P(Xn = an) = 1, then Xn = Op(bn) if and only if an = O(bn).

(b) If P(Xn = an) = 1, then Xn = op(bn) if and only if an = o(bn).

(c) op(O(an)) = op(an).

(d) op(an) + op(an) = op(an).

(e) Op(an) + op(an) = Op(an).

(f) Op(an)op(bn) = op(anbn).

191 2 7.4 Distribution of a Weighted Sum of χ1 Random Variables

7.4.1 Davies’ Method for Computing the Distribution

Function

The characteristic function of a random variable X is defined as

itX φX (t) = E[e ], where i is the imaginary unit. Unlike the moment-generating function, the charac- teristic function always exists (the integral is finite). Moreover, the characteristic function uniquely determines the distribution. See Durrett [2010, Chap. 2.3],

Feller [1968, Chap XV], and Resnick [2013, Chap. 9] for more information.

2 Consider a random variable X ∼ χ1. Its characteristic function could be calculated by

Z itX 1 −1/2 1 −1/2 φX (t) = E[e ] = √ x exp{−( − it)x}dx = (1 − 2it) . (7.19) 2π 2

Note that the existence of this integral could be verified by Euler’s formula,

eitx = cos tx + i sin tx. (7.20)

2 Next consider a linear combination of χ1 random variables:

p ¯ X i.i.d. 2 Q = λiXi,Xi ∼ χ1. (7.21) i=1

We are interested in the computation of the distribution function of Q¯. A key

192 observation is that its characteristic function is already readily available.

p ¯ P i.i.d. 2 Lemma 7.18. The characteristic function of Q = λiXi where Xi ∼ χ1 is i=1 given by

p Y −1/2 φQ¯(t) = (1 − 2iλit) . i=1

Proof. The result in fact follows directly from the properties of the characteristic function.

p X φQ¯(t) = E[exp(it λiXi)] i=1 p Y = E[exp(itλiXi)] (by the independence of X1,...,Xp) i=1 p Y = φXi (λit) i=1 p Y −1/2 = (1 − 2iλit) . i=1

Given the characteristic function, we can calculate the distribution function by

the so-called inversion formula. There are different versions of the formula, among

which the most general is Levy’s inversion formula.

Theorem 7.19. (Levy’s inversion formula) Let φY (t) be the characteristic function of a random variable Y . For a < b,

1 1 Z T e−ita − e−itb P(a < Y < b) + [P(Y = a) + P(Y = b)] = lim φY (t)dt. 2 2π T →∞ −T it

See Durrett [2010, Chap. 2.3] for proof. For the random variable Q¯ defined

193 in (7.21), if λ1 ≥ · · · ≥ λp > 0, then we have

Z ∞ −itc ¯ 1 1 − e P(Q < c) = φQ¯(t)dt. 2π −∞ it

This provides a way to numerically computing the tail probability of a linear

2 combination of χ1 random variables. A more convenient method is to use Gil- Pelaez’s inversion formula, which can be directly derived from Levy’s inversion formula. See the original paper, Gil-Pelaez [1951], for proof.

Theorem 7.20. (Gil-Pelaez’s inversion formula) Let φY (t) be the character- istic function of a continuous random variable Y . We have

Z ∞ −ity 1 1 =[e φY (t)] FY (y) = − dt, 2 2π −∞ t

where = means to extract the imaginary part.

Davies [1973] showed that, using Gil-Pelaez’s inversion formula,

∞ −i(k+1/2)∆y 1 1 X =[φY ((k + 1/2)∆)e ] − 2 π k + 1/2 k=0 ∞ X n = P(Y < y) + (−1) {P(Y < y − 2πn/∆) − P(Y > y + 2πn/∆)}. n=1

Hence, we may numerically compute the distribution function of Y by

K −i(k+1/2)∆y 1 1 X =[φY ((k + 1/2)∆)e ] (Y < y) ≈ − . (7.22) P 2 π k + 1/2 k=0

There are two sources of error. First, we have omitted the term

∞ X n (−1) {P(Y < y − 2πn/∆) − P(Y > y + 2πn/∆)}. n=1

194 But this term could be made arbitrarily small by choosing an appropriate ∆.

Second, in the summation in (7.22), there is a truncation error since we sum up

to k = K instead of k = ∞. In a later paper [Davies, 1980], Davies showed how

2 to control this truncation error when Y is a linear combination of independent χ1 random variables.

7.4.2 Methods for Computing the Bounds for the

P-values

Consider the random variable Q¯ defined in (7.21). We now discuss methods for ¯ computing the lower and the upper bounds for P(Q > c). Assume λ1 ≥ · · · ≥ ¯ λp > 0. They are used to estimate the true p-value P(Q > c) and hence should be very easy to evaluate.

First clearly, the upper bound can be computed by

¯ 2 P(Q > c) ≤ P(λ1χp > c). (7.23)

¯ 2 Similarly, we have P(Q > c) ≥ P(λkχk > c) for k = 1, . . . , p. Thus the lower bound can be computed by

¯ 2 P(Q > c) ≥ max P(λkχk > c). (7.24) 1≤k≤p

This method for computing the bounds is extremely fast to evaluate however

the accuracy might be very bad if p is large and the weights cover a very wide

range. Assuming p is even, we now describe a better for computing the bounds.

195 Since λ1 ≥ λ2,

2 2 P(λ2χ2 > c) ≤ P(λ1X1 + λ2X2 > c) ≤ P(λ1χ2 > c).

2 But χ2 is just an exponential random variable with rate parameter 1/2, of which

i.i.d 2 the distribution function is very easy to compute and convolve! Let Yk ∼ χ2. The upper bound and the lower bound can be computed by

p/2 ¯ X P(Q > c) ≤ P( λ2k−1Yk > c); k=1 (7.25) p/2 ¯ X P(Q > c) ≥ P( λ2kYk > c). k=1

To see why the convolution is fast, let’s start from the simplest case, p = 2.

−c/2λ1 P(λ1Y1 > c) = e .

Next consider p = 4.

Z c/λ3 (λ Y + λ Y > c) = 1 − f 2 (y) (λ Y ≤ c − λ y)dy P 1 1 3 2 χ2 P 1 1 3 0 λ1 λ3 = e−c/2λ1 + e−c/2λ3 λ1 − λ3 λ3 − λ1

Proceed to p = 6.

Z c/λ5 (λ Y + λ Y + λ Y > c) =1 − f 2 (y) (λ Y + λ Y ≤ c − λ y)dy P 1 1 3 2 5 3 χ2 P 1 1 3 3 5 0 λ2 e−c/2λ1 λ2 e−c/2λ3 λ2 e−c/2λ5 = 1 + 3 + 5 . (λ1 − λ3)(λ1 − λ5) (λ3 − λ1)(λ3 − λ5) (λ5 − λ1)(λ5 − λ3)

196 It is easy to generalize to any even p > 0. Letting r = p/2, we have

r r r−1 −c/2λ2k−1 X X λ2k−1e P( λ2k−1Yk > c) = Q . (λ2k−1 − λ2j−1) k=1 k=1 j6=k

Hence, the bounds given in (7.25) are also very fast to evaluate.

7.5 GCTA and Linear Mixed Model

Linear mixed model is used by GCTA [Yang et al., 2010, Lee et al., 2011, Yang et al., 2011] to infer heritability for phenotypes. The model may be written as

2 2 y = Xβ + W u + ε , ε ∼ MVN(0, σε I) , u ∼ MVN(0, σuI), (7.26) where X is an n × q matrix and W is an n × N matrix by a slight abuse of notations. β is called fixed effects and u is called random effects. Equivalently, we can write

y = Xβ + g + ε ∼ MVN(Xβ, V ), (7.27) where

1 V = σ2H, H = κA + I, A = WW t. ε N

2 2 Hence, implicitly we have used κ = Nσu/σε . The model can be easily generalized to

r X 2 y = Xβ + gi + ε , gi ∼ MVN(0, κiσε Ai). i=1

197 For ease of reading, this introduction is restricted to the one-random-effect model (7.27).

2 2 GCTA estimates the variance components, , σε and σu, by REML (restrict- ed/residual maximum likelihood), which is the most classical approach for statis- tical inferences with linear mixed model. For a thorough treatment in book form, see Jiang [2007] and Searle et al. [2009].

7.5.1 Restricted Maximum Likelihood Estimation

To have an intuition for REML estimation, recall that for n observations y1, . . . , yn,

2 P 2 the sample variance is given by s = SST /(n − 1), where SST = (yi − y¯) denotes the total sum of squares. It can be shown that s2 is indeed unbiased.

However, if we assume a normal distribution for the observations and compute a

2 maximum likelihood estimator, it would beσ ˆML = SST /n, which is biased. The reason is that we have one “fixed effect”, the expectation of yi (denoted by µy), which we estimate byy ¯. This estimation, intuitively speaking, costs one degree of freedom and thus we prefer the estimator s2. REML, in contrast, aims to construct a likelihood function that does not contain µy, and then by maximizing the restricted likelihood we obtain an unbiased estimator for the variance.

For linear mixed model (7.26), the idea of REML is to find an n × q full-rank matrix L1 and an n × (n − q) full-rank matrix L2 such that

t t L1X = I , L2X = 0,

t t and make inferences using y1 = L1y and y2 = L2y. By (7.27),

      t t y1 β L1VL1 L1VL2   ∼ MVN(  ,   .)      t t  y2 0 L2VL1 L2VL2

198 The matrix L2 is used to control for the loss of the degree of freedoms caused

2 by X. Now define the restricted likelihood as the likelihood for κ and σε given observing y2. We write the log-restricted-likelihood as

def. 2 lr = log L(σε , κ; y2) n − q 1 1 = − log(2π) − log |Lt VL | − ytL (Lt VL )−1Lt y. 2 2 2 2 2 2 2 2 2

Assuming X has full rank, we have by Lemma 7.16

def. −1 −1 t −1 −1 t −1 t −1 t PH = H − H X(X H X) X H = L2(L2HL2) L2. (7.28)

The determinant term can also be transformed so that L2 does not enter the calculation of the derivatives. By Lemma 7.3,

t t t t t −1 t |L HL| =|L2HL2||L1HL1 − L1HL2(L2HL2) L2HL1|,

where L = (L1, L2). By (7.28), the second term on the r.h.s. is equal to |(XtH−1X)−1|. Thus,

t t t −1 log |L HL| = log |L2HL2| − log |X H X|.

Since both L and H are square matrices of full rank,

log |LtHL| = log |H| + log |LtL|,

which yields

t t t −1 log |L2HL2| = log |H| + log |L L| + log |X H X|.

199 Now omitting the constant term, we can rewrite the log-restricted-likelihood as

1 l = − ((n − q) log σ2 + log |H| + log |XtH−1X| + ytP y/σ2). (7.29) r 2 ε H ε

2 2 To compute the REML estimates for κ and σε , we need differentiate lr. For σε , we have

t ∂lr 1 n − q y PH y 2 = − ( 2 − 4 ). ∂σε 2 σε σε

∂ log |M| ∂M For κ, using the matrix differentiation rule = tr(M −1 ), we obtain ∂x ∂x

∂lr 1 1 t = − [tr(PH A) − 2 y PH APH y] ∂κ 2 σε

2 after heavy calculations. The REML estimates for κ and σε , which are unbiased, are then obtained by solving

∂lr ∂lr 2 = 0 , = 0. (7.30) ∂σ 2 ∂κ ε σˆε κˆ

However, this cannot be solved analytically (note that PH depends on the param- eter κ).

7.5.2 Newton-Raphson’s Method for Computing REML

Estimates

To solve (7.30), the standard approach is Newton-Raphson’s optimization method [Chong

2 (0) and Zak, 2013, Chap. 9]. Let θ = (κ, σε ). We start from an initial guess θ and

200 then update it by

 ∂2l −1 ∂l θ(k+1) = θ(k) − r r , ∂θθt ∂θ

∂2l where r is called Hessian matrix. In most statistical applications, the Hessian ∂θθt matrix is computed or estimated by either the observed Fisher information ma-

trix, J , or the expected Fisher information matrix, I. The correspond iteration

formulae are given by

∂l θ(k+1) = θ(k) + J (θ)−1 r ; ∂θ ∂l θ(k+1) = θ(k) + I(θ)−1 r . ∂θ

For our problem, we can calculate

2   ∂ lr 1 n − q 2 t J 2 2 = − = − + y P y , σε σε 2 2 4 6 H ∂(σε ) 2 σε σε 2   ∂ lr 1 2 t Jκκ = − 2 = − tr(PH APH A) + 2 y PH APH APH y , ∂κ 2 σε 2   ∂ lr 1 1 t J 2 = − = y P AP y , κσε 2 4 H H ∂κ∂σε 2 σε  2    ∂ lr 1 n − q I 2 2 = − = , σε σε E 2 2 4 ∂(σε ) 2 σε ∂2l  1 I = − r = tr(P AP A), κκ E ∂κ2 2 H H  2    ∂ lr 1 1 I 2 = − = tr(P A) . κσε E 2 2 H ∂κ∂σε 2 σε

2 2 2 2 Noticing that Jσε σε and Iσε σε ( Jκκ and Iκκ) contain the same term with opposite signs, in practice we use the average information matrix [Gilmour et al., 1995],

201 which is more convenient to compute,

 1 1  ytP y ytP AP y 1 6 H 4 H H 2  σε σε  AI(σε , κ) =   . 2  1 t 1 t  4 y PH APH y 2 y PH APH APH y σε σε

The diagonals are the averages of those of J and I . But the off-diagonals are simply chosen equal to those of J so that AI is positive definite.

7.5.3 Details of GCTA’s Implementation of REML

Estimations

2 2 Parametrization GCTA uses the parametrization (σε , σg ) where

2 2 2 σg = κσε = Nσu.

By defining

def. −2 −1 −1 t −1 −1 t −1 P = σε PH = V − V X(X V X) X V , we have

∂lr 1  t  2 = − tr(PA) − y P AP y , ∂σg 2   1 ytP P P y ytP AP P y AI(σ2, σ2) =   . ε g 2   ytP P AP y ytP AP AP y

2 2 Then the REML estimates for σε and σg are computed by Newton-Raphson’s method.

202 Standard Errors of the Estimates Since maximum likelihood estimates are

asymptotically normal with the covariance matrix I−1 and AI is a consistent

2 estimator for I, we may compute the standard errors for the estimates,σ ˆε and

2 σˆg , by calculating

  AI11 AI12 −1   AI =   . AI21 AI22

Then, the standard errors can be computed by

2 11 2 22 SE(ˆσε ) =AI , SE(ˆσg ) = AI .

Heritability Estimation In genome-wide association studies, the matrix W is composed of the dosages of the SNPs. If each column of W is normalized to unit variance, then the heritability of the phenotype y can be estimated by

σˆ2 ˆ2 g κˆ h = 2 2 = . σˆε +σ ˆg 1 +κ ˆ

2 GCTA call this “variance explained by the genome-wide SNPs”. Define σp =

2 2 2 2 2 σε + σg . Clearlyσ ˆp =σ ˆε +σ ˆg is also unbiased. To compute the standard error for hˆ2, we use the first-order Taylor expansion,

2 2  2 2 2 2  σˆg σg Var(ˆσg ) Cov(ˆσg , σˆp) Var(ˆσp) Var( 2 ) ≈ 2 4 − 2 2 2 + 4 , σˆp σp σg σg σp σp

203 where

2 22 Var(ˆσg ) =AI ,

2 11 22 12 Var(ˆσp) =AI + AI + 2AI ,

2 2 22 12 Cov(ˆσg , σˆp) =AI + AI .

Calculation of the P-value The p-value of GCTA output is calculated by the

2 (restricted) likelihood ratio test. The null hypothesis is σg = 0 and the alternative

2 is σg > 0. To calculate the maximum log-restricted-likelihood of the alternative

2 2 hypothesis, denoted by lr1, we simply plug the REML estimates,σ ˆε andσ ˆg into (7.29). Similarly, the maximum log-restricted-likelihood of the null hypothesis,

denoted by lr0, can be computed by plugging in

2 2 X 2 σˆg0 = 0, σˆε0 = (yi − y¯) /(n − q).

Note that this is not a standard setting for the likelihood ratio test. This is

because the null hypothesis lies at the boundary of the parameter space and thus

the standard asymptotic result for the likelihood ratio test does not apply. Stram

2 and Lee [1994] showed that asymptotically −2(lr0 − lr1) follows 0.5δ0 + 0.5χ1 (δ0 denotes a degenerate distribution with unit probability at 0), which was in fact a

result from Self and Liang [1987]. Hence GCTA computes the p-value by

1 P = Pr(χ2 > −2(l − l )). (7.31) 2 1 r0 r1

However, this asymptotic result is likely to perform poorly for a finite sample size.

Methods proposed by Crainiceanu and Ruppert [2004], Greven et al. [2012] may

be considered to produce more reliable p-values.

204 BLUP Estimations In most applications, we do not estimate u but only esti-

mate the variance of the random effect. If one does want to estimate u or g = W u,

the standard choice is to use the BLUPs (Best Linear Unbiased Predictors). The

BLUPs have very similar statistical properties to the BLUEs (Best Linear Un-

biased Estimators) in linear regression and why they are called “predictors” is

not very clear. The idea of the BLUP estimation is to compute the conditional

expectation given y2. For example, if we want to estimate a quantity a, which follows a normal distribution, we use

0 t t −1 t aˆ = E[a|L2y] = Cov(a, L2y) Var(L2y) L2y

Thus for u, g and ε, we have

2 gˆ =σg AP y = κAPH y, 2 (7.32) εˆ =σ P y = PH y,

2 t uˆ =σg W P y/N.

p But notice that in the GCTA output, uˆi is rescaled by 2fi(1 − fi) where fi is the minor allele frequency of the i-th SNP, so that it can be directly applied to

the unscaled genotype data.

7.6 Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm [Hastings, 1970] is probably the most impor-

tant example of Markov chain Monte Carlo (MCMC) methods. As suggest by the

name of MCMC, it is a sampling method based on Markov chain. The readers

are referred to Ross [1996], Levin et al. [2009] and Meyn and Tweedie [2012] for

an introduction to Markov chain. The following result from Markov chain theory

205 is key to the understanding of the Metropolis-Hastings algorithm.

Proposition 7.21. (Detailed balance condition) Let P be the transition matrix

of a Markov chain with a countable state space Ω. If a distribution π satisfies the

detailed balance condition,

π(x)P (x, y) = π(y)P (y, x), ∀x, y ∈ Ω,

then π is a stationary distribution for P .

Proof. By the detailed balance condition, ∀y ∈ Ω,

X X π(x)P (x, y) = π(y)P (y, x) = π(y). x∈Ω x∈Ω

Treating π as a row vector, we may write πP = π, which is the definition of the stationary distribution.

For a general state space S, the detailed balance condition is given by [Green

and Mira, 2001]

Z Z π(dx)P (x, dy) = π(dy)P (dx, y), ∀ Borel sets A, B ⊂ S. (x,y)∈A×B (x,y)∈A×B (7.33)

We are now ready to formalize the Metropolis-Hastings algorithm.

Proposition 7.22. (Metropolis-Hastings algorithm) Consider a countable state space Ω and a transition matrix Q (but we will call it the proposal matrix) such

that

• Q is irreducible and aperiodic;

206 • if Q(x, y) > 0, then Q(y, x) > 0.

Let π be a probability distribution on Ω. The Metropolis-Hastings algorithm defines

a Markov chain that starts from x(0) such that π(x(0)) > 0 and moves according to

the following rule:

• Given the current state x(k), propose a new state y according to the distri-

bution Q(x, ·).

• Compute the acceptance ratio

π(y)Q(y, x) α(x, y) = min{1, }. (7.34) π(x)Q(x, y)

• Set x(k+1) = y with probability α(x, y), and set x(k+1) = x(k) with probability

1 − α(x, y).

Then π is the unique stationary and limiting distribution for this Markov chain.

Proof. We start the proof by checking the detailed balance condition. Let P be the actual transition matrix of the Metropolis-Hastings Markov chain. For any x, y ∈ Ω, clearly one of α(x, y) and α(y, x) must be greater than or equal to 1.

Assume α(x, y) = π(y)Q(y, x)/π(x)Q(x, y) and α(y, x) = 1. Then,

π(y)Q(y, x) π(x)P (x, y) = π(x)Q(x, y)α(x, y) = π(x)Q(x, y) π(x)Q(x, y)

= π(y)Q(y, x) = π(y)P (y, x).

By Proposition 7.22, π must be a stationary distribution for P . Let Ω+ = {x ∈

Ω: π(x) > 0}. Since α(x, y) is always greater than 0 and Q is irreducible on

Ω, P is irreducible on Ω+. Since Q is aperiodic, P is also aperiodic. By the

207 standard Markov chain theory and in particular, the ergodic theorem [Durrett,

2010, Chap. 7], π is both the stationary and the limiting distribution for P .

We make two comments. First, the aperiodicity of Q is not necessary at all.

As long as for some x, y ∈ Ω+ we have Q(x, y) > 0 and α(x, y) ∈ (0, 1) , P must be aperiodic since the chain may stay at x for any positive number of steps with a positive probability. Second, there are many other choices of the acceptance ratio such that the detailed balance condition holds. However, Peskun [1973] proved that, for a discrete state space, the acceptance ratio given in (7.34) is optimal in terms of statistical efficiency. This ratio is also called Metropolis-Hastings ratio or simply Hastings ratio.

For variations of the Metropolis-Hastings algorithm, see Liu [2008] and Brooks et al. [2011] among others.

7.7 Used Real Datasets

7.7.1 Merged Intraocular Pressure Dataset

We applied for access and downloaded two GWAS datasets from the database of

Genotypes and Phentoypes (dbGaP). Both studies were funded by the National

Eye Institute. One is the Ocular Hypertension Treatment Study [Kass et al., 2002]

(henceforth OHTS, dbGaP accession number: phs000240.v1.p1), and the other is National Eye Institute Human Genetics Collaboration Consortium Glaucoma

Genome-Wide Association Study [Ulmer et al., 2012] (henceforth NEIGHBOR, dbGaP accession number: phs000238.v1.p1). The phenotype of interest is the in- traocular pressure (IOP). The OHTS dataset only contains individuals with high

IOP (≥ 21). The NEIGHBOR dataset is a case-control design for glaucoma [Ulmer

208 et al., 2012, Weinreb et al., 2014], in which many samples have IOP measurements, because a high IOP is considered a major risk factor for glaucoma. The NEIGH-

BOR dataset, however, contains case samples with small IOP and control samples with large IOP. To reduce the effect of any potentially confounding factors, we removed those samples. We also removed samples whose IOP measurements differ by more than 10 between the two eyes since such a large difference is likely to be caused by physical accidents. We noticed that there is an additional column defined as I(max IOP > 21) in the original phenotype file. However, this column conflicts with the IOP measurements of the two eyes for some samples. Such sam- ples were removed as well. The average IOP of the two eyes was used as the raw phentoypes.

We then performed the routine quality control for the genotypes using the procedures described in Xu and Guan [2014]. OHTS and NEIGHBOR were geno- typed on different SNP arrays and finally 301, 143 SNPs genotyped in both studies passed the quality control. We then performed principal component analysis to remove outliers and extracted 3, 226 subjects (740 from OHTS and 2486 from

NEIGHBOR) that were clustered around European samples in HapMap3 [The

International HapMap Consortium, 2010].

We refer to this dataset as the IOP dataset in this manuscript. In the simu- lation study of the p-value calibration (Chap. 2.3.4), we only used the genotype and the phenotype was simulated.

7.7.2 Height Dataset

The Height dataset refers to the dataset used in Yang et al. [2010], from which

GCTA estimated a 44.6% heritability for height. It contains 3, 925 subjects and

294, 831 SNPs. All the individuals have European descent and are unrelated with

209 each other. Hence there is no need to control for population stratification. Strict quality control procedures have already been performed (see Yang et al. [2010]).

In our simulation studies, we just removed the SNPs with missing rate > 0.01 or

MAF < 0.01 and 274, 719 SNPs remained.

210 Bibliography

Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions:

with formulas, graphs, and mathematical tables, volume 55. Courier Corpora-

tion, 1964.

Alan Agresti and Maria Kateri. Categorical data analysis. Springer, 2011.

James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychoto-

mous response data. Journal of the American statistical Association, 88(422):

669–679, 1993.

Gr´egoireAllaire, Sidi Mahmoud Kaber, and Karim Trabelsi. Numerical linear

algebra, volume 55. Springer, 2008.

Hana Lango Allen, Karol Estrada, Guillaume Lettre, Sonja I Berndt, Michael N

Weedon, Fernando Rivadeneira, Cristen J Willer, Anne U Jackson, Sailaja

Vedantam, Soumya Raychaudhuri, et al. Hundreds of variants clustered in

genomic loci and biological pathways affect human height. Nature, 467(7317):

832–838, 2010.

Christophe Andrieu and Gareth O Roberts. The pseudo-marginal approach for

efficient monte carlo computations. The Annals of Statistics, pages 697–725,

2009.

211 Jennifer Asimit and Eleftheria Zeggini. Rare variant association analysis methods

for complex traits. Annual review of genetics, 44:293–308, 2010.

David J Balding. A tutorial on statistical methods for population association

studies. Nature Reviews Genetics, 7(10):781–791, 2006.

Roderick D Ball. Bayesian methods for quantitative trait loci mapping based on

model selection: approximate analysis using the bayesian information criterion.

Genetics, 159(3):1351–1364, 2001.

Yael Baran, Bogdan Pasaniuc, Sriram Sankararaman, Dara G Torgerson, Christo-

pher Gignoux, Celeste Eng, William Rodriguez-Cintron, Rocio Chapela, Jean G

Ford, Pedro C Avila, et al. Fast and accurate inference of local ancestry in latino

populations. Bioinformatics, 28(10):1359–1367, 2012.

Gregory S Barsh, Gregory P Copenhaver, Greg Gibson, and Scott M Williams.

Guidelines for genome-wide association studies. PLoS genetics, 8(7):e1002812,

2012.

Maurice S Bartlett. Properties of sufficiency and statistical tests. Proceedings

of the Royal Society of London. Series A, Mathematical and Physical Sciences,

pages 268–282, 1937.

Maurice S Bartlett. A comment on D. V. Lindley’s statistical paradox. Biometrika,

44(1-2):533–534, 1957.

Johannes Bausch. On the efficient calculation of a linear combination of chi-

square random variables with an application in counting string vacua. Journal

of Physics A: Mathematical and Theoretical, 46(50):505202, 2013.

212 Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a prac-

tical and powerful approach to multiple testing. Journal of the Royal Statistical

Society. Series B (Methodological), pages 289–300, 1995.

James O Berger and Luis R Pericchi. The intrinsic bayes factor for linear models.

Bayesian statistics, 5:25–44, 1996a.

James O Berger and Luis R Pericchi. The intrinsic bayes factor for model selection

and prediction. Journal of the American Statistical Association, 91(433):109–

122, 1996b.

James O Berger and Thomas Sellke. Testing a point null hypothesis: the irreconcil-

ability of p values and evidence. Journal of the American statistical Association,

82(397):112–122, 1987.

James O Berger, Luis R Pericchi, JK Ghosh, Tapas Samanta, Fulvio De Santis,

JO Berger, and LR Pericchi. Objective bayesian methods for model selection:

introduction and comparison. Lecture Notes-Monograph Series, pages 135–207,

2001.

Peter J Bickel and JK Ghosh. A decomposition for the likelihood ratio statistic

and the bartlett correction–a bayesian argument. The Annals of Statistics, 18

(3):1070–1090, 1990.

Joanna M Biernacka, Rui Tang, Jia Li, Shannon K McDonnell, Kari G Rabe,

Jason P Sinnwell, David N Rider, Mariza De Andrade, Ellen L Goode, and

Brooke L Fridley. Assessment of genotype imputation methods. In BMC pro-

ceedings, volume 3, page 1. BioMed Central, 2009.

Christopher M. Bishop. Pattern recognition and machine learning. Springer, 2006.

213 David Blackwell. Conditional expectation and unbiased sequential estimation. The

Annals of Mathematical Statistics, pages 105–110, 1947.

George EP Box. A general distribution theory for a class of likelihood criteria.

Biometrika, 36(3/4):317–346, 1949.

Karl W Broman and Terence P Speed. A model selection approach for the identi-

fication of quantitative trait loci in experimental crosses. Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 64(4):641–656, 2002.

Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng. Handbook of

Markov Chain Monte Carlo. CRC press, 2011.

Brian L Browning and Sharon R Browning. A unified approach to genotype impu-

tation and haplotype-phase inference for large data sets of trios and unrelated

individuals. The American Journal of Human Genetics, 84(2):210–223.

Paul R Burton, David G Clayton, Lon R Cardon, Nick Craddock, Panos Deloukas,

Audrey Duncanson, Dominic P Kwiatkowski, Mark I McCarthy, Willem H

Ouwehand, Nilesh J Samani, et al. Genome-wide association study of 14,000

cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):

661–678, 2007.

William S Bush and Jason H Moore. Genome-wide association studies. PLoS

Comput Biol, 8(12):e1002822, 2012.

Peter Carbonetto and Matthew Stephens. Scalable variational inference for

bayesian variable selection in regression, and its accuracy in genetic associa-

tion studies. Bayesian analysis, 7(1):73–108, 2012.

Bradley P Carlin and Siddhartha Chib. Bayesian model choice via markov chain

214 monte carlo methods. Journal of the Royal Statistical Society. Series B (Method-

ological), pages 473–484, 1995.

George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury

Pacific Grove, CA, 2002.

George Casella and Christian P Robert. Rao-blackwellisation of sampling schemes.

Biometrika, 83(1):81–94, 1996.

Zhen Chen and David B Dunson. Random effects selection in linear mixed models.

Biometrics, 59(4):762–769, 2003.

Edwin KP Chong and Stanislaw H Zak. An introduction to optimization, vol-

ume 76. John Wiley & Sons, 2013.

Francis S Collins, Mark S Guyer, and Aravinda Chakravarti. Variations on a

theme: cataloging human dna sequence variation. Science, 278(5343):1580–

1581, 1997.

Psychiatric GWAS Consortium Coordinating Committee. Genomewide associa-

tion studies: history, rationale, and prospects for psychiatric disorders. Ameri-

can Journal of Psychiatry, 2009.

Karen N Conneely and Michael Boehnke. So many correlated tests, so little time!

rapid adjustment of p values for multiple correlated tests. The American Journal

of Human Genetics, 81(6):1158–1168, 2007.

EH Corder, AM Saunders, WJ Strittmatter, DE Schmechel, PC Gaskell, GWet

Small, AD Roses, JL Haines, and Margaret A Pericak-Vance. Gene dose of

apolipoprotein e type 4 allele and the risk of alzheimers disease in late onset

families. Science, 261(5123):921–923, 1993.

215 Dennis D. Cox. The Theory of Statistics and Its Applications. 2004. unpublished

book.

Ciprian M Crainiceanu and David Ruppert. Likelihood ratio tests in linear mixed

models with one variance component. Journal of the Royal Statistical Society:

Series B (Statistical Methodology), 66(1):165–185, 2004.

Paul Damien and Stephen G Walker. Sampling truncated normal, beta, and

gamma densities. Journal of Computational and Graphical Statistics, 10(2):

206–215, 2001.

Robert B Davies. Numerical inversion of a characteristic function. Biometrika, 60

(2):415–417, 1973.

Robert B Davies. Algorithm as 155: The distribution of a linear combination of

χ 2 random variables. Applied Statistics, pages 323–333, 1980.

Paul IW de Bakker, Roman Yelensky, Itsik Pe’er, Stacey B Gabriel, Mark J Daly,

and David Altshuler. Efficiency and power in genetic association studies. Nature

genetics, 37(11):1217–1223, 2005.

Olivier Delaneau, C´edricCoulonges, and Jean-Fran¸coisZagury. Shape-it: new

rapid and accurate algorithm for haplotype inference. BMC bioinformatics, 9

(1):1, 2008.

Petros Dellaportas, Jonathan J Forster, and Ioannis Ntzoufras. On bayesian model

and variable selection using mcmc. Statistics and Computing, 12(1):27–36, 2002.

Aad W der Vaart. Asymptotic statistics, volume 3. Cambridge university press,

2000.

B Devlin and Neil Risch. A comparison of linkage disequilibrium measures for

fine-scale mapping. , 29(2):311–322, 1995.

216 B Devlin, Kathryn Roeder, and Larry Wasserman. Genomic control, a new ap-

proach to genetic-based association studies. Theoretical population biology, 60

(3):155–166, 2001.

Bernie Devlin and Kathryn Roeder. Genomic control for association studies. Bio-

metrics, 55(4):997–1004, 1999.

Randal Douc and Christian P Robert. A vanilla rao–blackwellization of

metropolis–hastings algorithms. The Annals of Statistics, 39(1):261–277, 2011.

Norman R Draper and R Craig Van Nostrand. Ridge regression and james-stein

estimation: review and comments. Technometrics, 21(4):451–466, 1979.

Frank Dudbridge and Arief Gusnanto. Estimation of significance thresholds for

genomewide association scans. Genetic epidemiology, 32(3):227–234, 2008.

Richard H Duerr, Kent D Taylor, Steven R Brant, John D Rioux, Mark S Sil-

verberg, Mark J Daly, A Hillary Steinhart, Clara Abraham, Miguel Regueiro,

Anne Griffiths, et al. A genome-wide association study identifies il23r as an

inflammatory bowel disease gene. science, 314(5804):1461–1463, 2006.

Rick Durrett. Probability: theory and examples. Cambridge university press, 2010.

Douglas F Easton, Karen A Pooley, Alison M Dunning, Paul DP Pharoah, Deb-

orah Thompson, Dennis G Ballinger, Jeffery P Struewing, Jonathan Morrison,

Helen Field, Robert Luben, et al. Genome-wide association study identifies

novel breast cancer susceptibility loci. Nature, 447(7148):1087–1093, 2007.

Albert O Edwards, Robert Ritter, Kenneth J Abel, Alisa Manning, Carolien Pan-

huysen, and Lindsay A Farrer. Complement factor h polymorphism and age-

related macular degeneration. Science, 308(5720):421–424, 2005.

217 Lars Eld´en.Algorithms for the regularization of ill-conditioned least squares prob-

lems. BIT Numerical Mathematics, 17(2):134–145, 1977.

William Feller. An introduction to probability theory and its applications: volume

II, volume 3. John Wiley & Sons London-New York-Sydney-Toronto, 1968.

Ronald A Fisher. Xv.the correlation between relatives on the supposition of

mendelian inheritance. Transactions of the royal society of Edinburgh, 52(02):

399–433, 1919.

Charles W Fox and Stephen J Roberts. A tutorial on variational bayesian infer-

ence. Artificial intelligence review, 38(2):85–95, 2012.

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statis-

tical learning, volume 1. Springer series in statistics Springer, Berlin, 2001.

Francis Galton. Natural inheritance. Macmillan, 1894.

Eric R Gamazon, Heather E Wheeler, Kaanan P Shah, Sahar V Mozaffari, Keston

Aquino-Michaels, Robert J Carroll, Anne E Eyler, Joshua C Denny, Dan L

Nicolae, Nancy J Cox, et al. A gene-based association method for mapping

traits using reference transcriptome data. Nature genetics, 47(9):1091–1098,

2015.

Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian

data analysis, volume 2. Chapman & Hall/CRC Boca Raton, FL, USA, 2014.

Edward I George and Robert E McCulloch. Variable selection via gibbs sampling.

Journal of the American Statistical Association, 88(423):881–889, 1993.

Edward I George and Robert E McCulloch. Approaches for bayesian variable

selection. Statistica sinica, pages 339–373, 1997.

218 M Gielen, PJ Lindsey, Cath´erine Derom, HJM Smeets, NY Souren, ADC

Paulussen, R Derom, and JG Nijhuis. Modeling genetic and environmental

factors to increase heritability and ease the identification of candidate genes for

birth weight: a twin study. Behavior genetics, 38(1):44–54, 2008.

J Gil-Pelaez. Note on the inversion theorem. Biometrika, 38(3-4):481–482, 1951.

Arthur R Gilmour, Robin Thompson, and Brian R Cullis. Average information

reml: an efficient algorithm for variance parameter estimation in linear mixed

models. Biometrics, pages 1440–1450, 1995.

Simon J Godsill. On the relationship between MCMC model uncertainty methods.

Cambridge University, Engineering, Department, 1998.

Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU

Press, 2012.

IJ Good. The bayes/non-bayes compromise: A brief review. Journal of the Amer-

ican Statistical Association, 87(419):597–606, 1992.

Steven N Goodman. Toward evidence-based medical statistics. 2: The bayes

factor. Annals of internal medicine, 130(12):1005–1013, 1999.

Brian Gough. GNU scientific library reference manual. Network Theory Ltd.,

2009.

Peter J Green. Reversible jump markov chain monte carlo computation and

bayesian model determination. Biometrika, 82(4):711–732, 1995.

Peter J Green and Antonietta Mira. Delayed rejection in reversible jump

metropolis–hastings. Biometrika, 88(4):1035–1053, 2001.

219 Sonja Greven, Ciprian M Crainiceanu, Helmut K¨uchenhoff, and Annette Peters.

Restricted likelihood ratio testing for zero variance components in linear mixed

models. Journal of Computational and Graphical Statistics, 2012.

Justin Grimmer. An introduction to bayesian inference via variational approxi-

mations. Political Analysis, 19(1):32–47, 2011.

Yongtao Guan and Stephen M Krone. Small-world mcmc and convergence to

multi-modal distributions: From slow mixing to fast mixing. The Annals of

Applied Probability, 17(1):284–304, 2007.

Yongtao Guan and Matthew Stephens. Practical issues in imputation-based asso-

ciation mapping. PLoS Genetics, 4(12):e1000279, 2008.

Yongtao Guan and Matthew Stephens. Bayesian variable selection regression for

genome-wide association studies and other large-scale problems. The Annals of

Applied Statistics, pages 1780–1815, 2011.

Jonathan L Haines, Michael A Hauser, Silke Schmidt, William K Scott, Lana M

Olson, Paul Gallins, Kylee L Spencer, Shu Ying Kwan, Maher Noureddine,

John R Gilbert, et al. Complement factor h variant increases the risk of age-

related macular degeneration. Science, 308(5720):419–421, 2005.

Godfrey Harold Hardy, John Edensor Littlewood, and George P´olya. Inequalities.

Cambridge university press, 1952.

DA Harville. Matrix algebra from a statistician’s perspective. Inc., Springer-

Verlag New York, 1997.

W Keith Hastings. Monte carlo sampling methods using markov chains and their

applications. Biometrika, 57(1):97–109, 1970.

220 Joel N Hirschhorn and Mark J Daly. Genome-wide association studies for common

diseases and complex traits. Nature Reviews Genetics, 6(2):95–108, 2005.

James P Hobert and George Casella. The effect of improper priors on gibbs

sampling in hierarchical linear mixed models. Journal of the American Statistical

Association, 91(436):1461–1473, 1996.

Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for

nonorthogonal problems. Technometrics, 12(1):55–67, 1970a.

Arthur E Hoerl and Robert W Kennard. Ridge regression: applications to

nonorthogonal problems. Technometrics, 12(1):69–82, 1970b.

Peter D Hoff. A first course in Bayesian statistical methods. Springer Science &

Business Media, 2009.

Leslie Hogben. Handbook of linear algebra. CRC Press, 2006.

Clive J Hoggart, John C Whittaker, Maria De Iorio, and David J Balding. Si-

multaneous analysis of all snps in genome-wide and re-sequencing association

studies. PLoS Genet, 4(7):e1000130, 2008.

F Hoti and MJ Sillanp¨a¨a.Bayesian mapping of genotype× expression interactions

in quantitative and qualitative traits. Heredity, 97(1):4–18, 2006.

Bryan N Howie, Peter Donnelly, and Jonathan Marchini. A flexible and accurate

genotype imputation method for the next generation of genome-wide association

studies. PLoS Genet, 5(6):e1000529, 2009.

Xichen Huang, Jin Wang, and Feng Liang. A variational algorithm for bayesian

variable selection. arXiv preprint arXiv:1602.07640, 2016.

Pirro G Hysi, Ching-Yu Cheng, Henri¨etSpringelkamp, Stuart Macgregor, Jes-

sica N Cooke Bailey, Robert Wojciechowski, Veronique Vitart, Abhishek Nag,

221 Alex W Hewitt, Ren´eH¨ohn,et al. Genome-wide analysis of multi-ancestry co-

horts identifies new loci influencing intraocular pressure and susceptibility to

glaucoma. Nature genetics, 46(10):1126–1130, 2014.

Joseph G Ibrahim and Purushottam W Laud. On bayesian analysis of general-

ized linear models using jeffreys’s prior. Journal of the American Statistical

Association, 86(416):981–986, 1991.

Iuliana Ionita-Laza, Seunggeun Lee, Vlad Makarov, Joseph D Buxbaum, and Xi-

hong Lin. Sequence kernel association tests for the combined effect of rare and

common variants. The American Journal of Human Genetics, 92(6):841–853,

2013.

H Ishwaran and JS Rao. Bayesian nonparametric mcmc for large variable selection

problems. Unpublished manuscript, 2000.

Hemant Ishwaran and J Sunil Rao. Spike and slab variable selection: frequentist

and bayesian strategies. Annals of Statistics, pages 730–773, 2005.

Hemant Ishwaran and J Sunil Rao. Detecting differentially expressed genes in

microarrays using bayesian model selection. Journal of the American Statistical

Association, 2011.

Anne-Sophie Jannot, Georg Ehret, and Thomas Perneger. P¡ 5× 10−8 has emerged

as a standard of statistical significance for genome-wide association studies.

Journal of clinical epidemiology, 68(4):460–465, 2015.

Harold Jeffreys. The theory of probability. OUP Oxford, 1961.

Jiming Jiang. Linear and generalized linear mixed models and their applications.

Springer Science & Business Media, 2007.

222 Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul.

An introduction to variational methods for graphical models. Machine learning,

37(2):183–233, 1999.

MA Kass, DK Heuer, EJ Higginbotham, and et al. The ocular hypertension treat-

ment study: A randomized trial determines that topical ocular hypotensive med-

ication delays or prevents the onset of primary open-angle glaucoma. Archives

of Ophthalmology, 120(6):701–713, 2002. doi: 10.1001/archopht.120.6.701. URL

+http://dx.doi.org/10.1001/archopht.120.6.701.

Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american

statistical association, 90(430):773–795, 1995.

Sekar Kathiresan, Olle Melander, Candace Guiducci, Aarti Surti, No¨elP Burtt,

Mark J Rieder, Gregory M Cooper, Charlotta Roos, Benjamin F Voight, Aki S

Havulinna, et al. Six new loci associated with blood low-density lipoprotein

cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Na-

ture genetics, 40(2):189–197, 2008.

Hormuzd A Katki. Invited commentary: evidence-based evaluation of p values

and bayes factors. American Journal of Epidemiology, 168(4):384–388, 2008.

Riika Kilpikari and Mikko J Sillanp¨a¨a.Bayesian analysis of multilocus association

in quantitative and qualitative traits. Genetic epidemiology, 25(2):122–135,

2003.

Robert J Klein, Caroline Zeiss, Emily Y Chew, Jen-Yue Tsai, Richard S Sackler,

Chad Haynes, Alice K Henning, John Paul SanGiovanni, Shrikant M Mane, Su-

san T Mayne, et al. Complement factor h polymorphism in age-related macular

degeneration. Science, 308(5720):385–389, 2005.

223 Femke CC Klouwer, Kevin Berendse, Sacha Ferdinandusse, Ronald JA Wanders,

Marc Engelen, et al. Zellweger spectrum disorders: clinical overview and man-

agement approach. Orphanet journal of rare diseases, 10(1):1, 2015.

Karl-Rudolf Koch. Introduction to Bayesian statistics. Springer Science & Business

Media, 2007.

Lynn Kuo and Bani Mallick. Variable selection for regression models. Sankhy¯a:

The Indian Journal of Statistics, Series B, pages 65–81, 1998.

Lydia Coulter Kwee, Dawei Liu, Xihong Lin, Debashis Ghosh, and Michael P

Epstein. A powerful and flexible multilocus association test for quantitative

traits. The American Journal of Human Genetics, 82(2):386–397, 2008.

Eric S Lander. The new genomics: global views of biology. Science, 274(5287):

536, 1996.

Michael Lavine and Mark J Schervish. Bayes factors: what they are and what

they are not. The American Statistician, 53(2):119–122, 1999.

DN Lawley. A general method for approximating to the distribution of likelihood

ratio criteria. Biometrika, 43(3/4):295–303, 1956.

Sang Hong Lee, Naomi R Wray, Michael E Goddard, and Peter M Visscher. Es-

timating missing heritability for disease from genome-wide association studies.

The American Journal of Human Genetics, 88(3):294–305, 2011.

Seunggeung Lee, Gon¸caloR Abecasis, Michael Boehnke, and Xihong Lin. Rare-

variant association analysis: study designs and statistical tests. The American

Journal of Human Genetics, 95(1):5–23, 2014.

David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains and

mixing times. American Mathematical Soc., 2009.

224 Bingshan Li and Suzanne M Leal. Methods for detecting associations with rare

variants for common diseases: application to analysis of sequence data. The

American Journal of Human Genetics, 83(3):311–321, 2008.

Jiahan Li, Kiranmoy Das, Guifang Fu, Runze Li, and Rongling Wu. The bayesian

lasso for genome-wide association studies. Bioinformatics, 27(4):516–523, 2011.

Feng Liang, Rui Paulo, German Molina, Merlise A Clyde, and Jim O Berger.

Mixtures of g priors for Bayesian variable selection. Journal of the American

Statistical Association, 103(481), 2008. ISSN 0162-1459.

Dennis V Lindley. A statistical paradox. Biometrika, pages 187–192, 1957.

Jun S Liu. Monte Carlo strategies in scientific computing. Springer Science &

Business Media, 2008.

Eugene Lukacs and Edgar P King. A property of the normal distribution. The

Annals of Mathematical Statistics, 25(2):389–394, 1954.

David J Lunn, John C Whittaker, and Nicky Best. A bayesian toolkit for genetic

association studies. Genetic epidemiology, 30(3):231–247, 2006.

Stuart Macgregor, Belinda K Cornes, Nicholas G Martin, and Peter M Visscher.

Bias, precision and heritability of self-reported and clinically measured height

in australian twins. Human genetics, 120(4):571–580, 2006.

Brian K Maples, Simon Gravel, Eimear E Kenny, and Carlos D Bustamante.

Rfmix: a discriminative modeling approach for rapid and robust local-ancestry

inference. The American Journal of Human Genetics, 93(2):278–288, 2013.

Jonathan Marchini, Bryan Howie, Simon Myers, Gil McVean, and Peter Donnelly.

A new multipoint method for genome-wide association studies by imputation

of genotypes. Nature genetics, 39(7):906–913, 2007.

225 Kantilal Varichand Mardia, John T Kent, and John M Bibby. Multivariate anal-

ysis. 1980.

Eden R Martin, Eric H Lai, John R Gilbert, Allison R Rogala, AJ Afshari, John

Riley, KL Finch, JF Stevens, KJ Livak, Brandon D Slotterbeck, et al. Snping

away at complex diseases: analysis of single-nucleotide polymorphisms around

apoe in alzheimer disease. The American Journal of Human Genetics, 67(2):

383–394, 2000.

Mark I McCarthy, Gon¸caloR Abecasis, Lon R Cardon, David B Goldstein, Julian

Little, John PA Ioannidis, and Joel N Hirschhorn. Genome-wide association

studies for complex traits: consensus, uncertainty and challenges. Nature re-

views genetics, 9(5):356–369, 2008.

THE Meuwissen and ME Goddard. Mapping multiple qtl using linkage disequi-

librium and linkage analysis information and multitrait data. Genet. Sel. Evol,

36:261–279, 2004.

THE Meuwissen, BJ Hayes, and ME Goddard. Prediction of total genetic value

using genome-wide dense marker maps. Genetics, 157(4):1819–1829, 2001.

Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability.

Springer Science & Business Media, 2012.

Alan Miller. Subset selection in regression. CRC Press, 2002.

Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear

regression. Journal of the American Statistical Association, 83(404):1023–1032,

1988.

Jesper Møller, Anthony N Pettitt, R Reeves, and Kasper K Berthelsen. An efficient

226 markov chain monte carlo method for distributions with intractable normalising

constants. Biometrika, 93(2):451–458, 2006.

Richard W Morris and Norman L Kaplan. On the advantage of haplotype analysis

in the presence of multiple disease susceptibility alleles. Genetic epidemiology,

23(3):221–233, 2002.

Alison A Motsinger-Reif, Eric Jorgenson, Mary V Relling, Deanna L Kroetz,

Richard Weinshilboum, Nancy J Cox, and Dan M Roden. Genome-wide asso-

ciation studies in pharmacogenomics: successes and lessons. Pharmacogenetics

and genomics, 23(8):383, 2013.

Iain Murray, Zoubin Ghahramani, and David MacKay. Mcmc for doubly-

intractable distributions. arXiv preprint arXiv:1206.6848, 2012.

Michael Naaman. Almost sure hypothesis testing and a resolution of the jeffreys-

lindley paradox. Electronic Journal of Statistics, 10(1):1526–1550, 2016.

Ilja M Nolte, Andr´eR de Vries, Geert T Spijker, Ritsert C Jansen, Dumitru

Brinza, Alexander Zelikovsky, and Gerard J te Meerman. Association testing

by haplotype-sharing methods applicable to whole-genome analysis. In BMC

proceedings, volume 1, page S129. BioMed Central, 2007.

Dale R Nyholt. A simple correction for multiple testing for single-nucleotide poly-

morphisms in linkage disequilibrium with each other. The American Journal of

Human Genetics, 74(4):765–769, 2004.

Anthony O’Hagan. Fractional bayes factors for model comparison. Journal of the

Royal Statistical Society. Series B (Methodological), pages 99–138, 1995.

Anthony O’Hagan and Jonathan J Forster. Kendall’s advanced theory of statistics,

volume 2B: Bayesian inference, volume 2. Arnold, 2004.

227 Robert B O’Hara and Mikko J Sillanp¨a¨a.A review of bayesian variable selection

methods: what, how and which. Bayesian analysis, 4(1):85–117, 2009.

Jun Ohashi and Katsushi Tokunaga. The power of genome-wide association studies

of complex disease genes: statistical limitations of indirect approaches using snp

markers. Journal of human genetics, 46(8):478–482, 2001.

A Bilge Ozel, Sayoko E Moroi, David M Reed, Melisa Nika, Caroline M Schmidt,

Sara Akbari, Kathleen Scott, Frank Rozsa, Hemant Pawar, David C Musch,

et al. Genome-wide association study and meta-analysis of intraocular pressure.

Human genetics, 133(1):41–57, 2014.

Roman Pahl and Helmut Sch¨afer. Permory: an ld-exploiting permutation test

algorithm for powerful genome-wide association testing. Bioinformatics, 26(17):

2093–2100, 2010.

Orestis A Panagiotou and John PA Ioannidis. What should the genome-wide sig-

nificance threshold be? empirical replication of borderline genetic associations.

International journal of epidemiology, 41(1):273–286, 2012.

Trevor Park and George Casella. The bayesian lasso. Journal of the American

Statistical Association, 103(482):681–686, 2008.

Peter H Peskun. Optimum monte-carlo sampling using markov chains. Biometrika,

60(3):607–612, 1973.

Ronald L Plackett. Some theorems in least squares. Biometrika, 37(1/2):149–157,

1950.

William H Press. Numerical recipes 3rd edition: The art of scientific computing.

Cambridge university press, 2007.

228 Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A

Shadick, and David Reich. Principal components analysis corrects for strat-

ification in genome-wide association studies. Nature genetics, 38(8):904–909,

2006.

Jonathan K Pritchard and Nancy J Cox. The allelic architecture of human disease

genes: common disease–common variant or not? Human molecular genetics, 11

(20):2417–2423, 2002.

Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of pop-

ulation structure using multilocus genotype data. Genetics, 155(2):945–959,

2000.

Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel AR

Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul IW De Bakker,

Mark J Daly, et al. Plink: a tool set for whole-genome association and

population-based linkage analyses. The American Journal of Human Genet-

ics, 81(3):559–575, 2007.

Adrian E Raftery, David Madigan, and Jennifer A Hoeting. Bayesian model av-

eraging for linear regression models. Journal of the American Statistical Asso-

ciation, 92(437):179–191, 1997.

David E Reich and Eric S Lander. On the allelic spectrum of human disease.

TRENDS in Genetics, 17(9):502–510, 2001.

Sidney I Resnick. A probability path. Springer Science & Business Media, 2013.

Sheldon M Ross. Stochastic processes, volume 2. John Wiley & Sons New York,

1996.

229 Stephen Sawcer. Bayes factors in complex genetics. European Journal of Human

Genetics, 18(7):746–750, 2010.

Paul Scheet and Matthew Stephens. A fast and flexible statistical model for large-

scale population genotype data: applications to inferring missing genotypes and

haplotypic phase. The American Journal of Human Genetics, 78(4):629–644,

2006.

Angelo Scuteri, Serena Sanna, Wei-Min Chen, Manuela Uda, Giuseppe Albai,

James Strait, Samer Najjar, Ramaiah Nagaraja, Marco Orr´u,Gianluca Usala,

et al. Genome-wide association scan shows genetic variants in the fto gene are

associated with obesity-related traits. PLoS Genet, 3(7):e115, 2007.

Shaun R Seaman and Sylvia Richardson. Equivalence of prospective and retro-

spective models in the bayesian analysis of case-control studies. Biometrika, 91

(1):15–25, 2004.

Shayle R Searle, George Casella, and Charles E McCulloch. Variance components,

volume 391. John Wiley & Sons, 2009.

Vincent Segura, Bjarni J Vilhj´almsson, Alexander Platt, Arthur Korte, Umit¨

Seren, Quan Long, and Magnus Nordborg. An efficient multi-locus mixed-model

approach for genome-wide association studies in structured populations. Nature

genetics, 44(7):825–830, 2012.

Steven G Self and Kung-Yee Liang. Asymptotic properties of maximum likelihood

estimators and likelihood ratio tests under nonstandard conditions. Journal of

the American Statistical Association, 82(398):605–610, 1987.

Thomas Sellke, M. J Bayarri, and James O Berger. Calibration of p values

for testing precise null hypotheses. The American Statistician, 55(1):62–71,

230 2001. doi: 10.1198/000313001300339950. URL http://dx.doi.org/10.1198/

000313001300339950.

Saunak´ Sen and Gary A Churchill. A statistical framework for quantitative trait

mapping. Genetics, 159(1):371–387, 2001.

D Serre. Matrices: Theory and Applications. Springer, New York, 2002.

Bertrand Servin and Matthew Stephens. Imputation-based analysis of association

studies: candidate regions and quantitative traits. PLoS Genetics, 3(7):e114,

2007.

Jun Shao. Mathematical Statistics. Springer Texts in Statistics. Springer, second

edition, 2003. ISBN 9780387953823.

Allan R. Shepard, Nasreen Jacobson, J. Cameron Millar, Iok-Hou Pang,

H. Thomas Steely, Charles C. Searby, Val C. Sheffield, Edwin M. Stone, and

Abbot F. Clark. Glaucoma-causing myocilin mutants require the peroxiso-

mal targeting signal-1 receptor (pts1r) to elevate intraocular pressure. Human

Molecular Genetics, 16(6):609–617, 2007. doi: 10.1093/hmg/ddm001. URL

http://hmg.oxfordjournals.org/content/16/6/609.abstract.

Ilya Shlyakhter, Pardis C Sabeti, and Stephen F Schaffner. Cosi2: an efficient

simulator of exact and approximate coalescent with selection. Bioinformatics,

30(23):3427–3429, 2014.

Zbynek Sid´ak. On multivariate normal probabilities of rectangles: their depen-

dence on correlations. The Annals of Mathematical Statistics, pages 1425–1434,

1968.

Zbynek Sid´ak.On probabilities of rectangles in multivariate student distributions:

231 their dependence on correlations. The Annals of Mathematical Statistics, pages

169–175, 1971.

Mikko J Sillanp¨a¨aand Elja Arjas. Bayesian mapping of multiple quantitative trait

loci from incomplete inbred line cross data. Genetics, 148(3):1373–1388, 1998.

V´aclav Sm´ıdlandˇ Anthony Quinn. The variational Bayes method in signal pro-

cessing. Springer Science & Business Media, 2006.

David J Spiegelhalter, Nicola G Best, Bradley P Carlin, and Angelika Van

Der Linde. Bayesian measures of model complexity and fit. Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 64(4):583–639, 2002.

Eli A Stahl, Daniel Wegmann, Gosia Trynka, Javier Gutierrez-Achury, Ron Do,

Benjamin F Voight, Peter Kraft, Robert Chen, Henrik J Kallberg, Fina AS

Kurreeman, et al. Bayesian inference analyses of the polygenic architecture of

rheumatoid arthritis. Nature genetics, 44(5):483–489, 2012.

Matthew Stephens and David J Balding. Bayesian statistical methods for genetic

association studies. Nature Reviews Genetics, 10(10):681–690, 2009.

John D Storey and Robert Tibshirani. Statistical significance for genomewide

studies. Proceedings of the National Academy of Sciences, 100(16):9440–9445,

2003.

Daniel O Stram and Jae Won Lee. Variance components testing in the longitudinal

mixed effects model. Biometrics, pages 1171–1177, 1994.

Wenguang Sun and Tony T Cai. Large-scale multiple testing under dependence.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71

(2):393–424, 2009.

232 Mikael Sunn˚aker, Alberto Giovanni Busetto, Elina Numminen, Jukka Corander,

Matthieu Foll, and Christophe Dessimoz. Approximate bayesian computation.

PLoS Comput Biol, 9(1):e1002803, 2013.

James Joseph Sylvester. Xxxvii. on the relation between the minor determinants

of linearly equivalent quadratic functions. The London, Edinburgh, and Dublin

Philosophical Magazine and Journal of Science, 1(4):295–305, 1851.

Cajo JF Ter Braak, Martin P Boer, and Marco CAM Bink. Extending xu’s

bayesian model for estimating polygenic effects using markers of the entire

genome. Genetics, 170(3):1435–1438, 2005.

The International HapMap Consortium. Integrating common and rare genetic

variation in diverse human populations. Nature, 467(7311):52–58, 2010.

Gilles Thomas, Kevin B Jacobs, Meredith Yeager, Peter Kraft, Sholom Wacholder,

Nick Orr, Kai Yu, Nilanjan Chatterjee, Robert Welch, Amy Hutchinson, et al.

Multiple loci identified in a genome-wide association study of prostate cancer.

Nature genetics, 40(3):310–315, 2008.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of

the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

Andrej Nikolaevich Tikhonov and Vasiliy Yakovlevich Arsenin. Solutions of ill-

posed problems. 1977.

Michael E Tipping. Bayesian inference: An introduction to principles and practice

in machine learning. In Advanced lectures on machine Learning, pages 41–62.

Springer, 2004.

John A Todd, Neil M Walker, Jason D Cooper, Deborah J Smyth, Kate Downes,

Vincent Plagnol, Rebecca Bailey, Sergey Nejentsev, Sarah F Field, Felicity

233 Payne, et al. Robust associations of four new chromosome regions from genome-

wide analyses of type 1 diabetes. Nature genetics, 39(7):857–864, 2007.

Lloyd N Trefethen and David Bau III. Numerical linear algebra, volume 50. Siam,

1997.

Pekka Uimari and Ina Hoeschele. Mapping-linked quantitative trait loci using

bayesian analysis and markov chain monte carlo algorithms. Genetics, 146(2):

735–743, 1997.

Megan Ulmer, Jun Li, Brian L. Yaspan, Ayse Bilge Ozel, Julia E. Richards,

Sayoko E. Moroi, Felicia Hawthorne, Donald L. Budenz, David S. Friedman,

Douglas Gaasterland, Jonathan Haines, Jae H. Kang, Richard Lee, Paul Lichter,

Yutao Liu, Louis R. Pasquale, Margaret Pericak-Vance, Anthony Realini, Joel S.

Schuman, Kuldev Singh, Douglas Vollrath, Robert Weinreb, Gadi Wollstein,

Donald J. Zack, Kang Zhang, Terri Young, R. Rand Allingham, Janey L. Wiggs,

Allison Ashley-Koch, and Michael A. Hauser. Genome-wide analysis of central

corneal thickness in primary open-angle glaucoma cases in the neighbor and

glaugen consortiathe effects of cct-associated variants on poag risk. Investigative

Ophthalmology & Visual Science, 53(8):4468, 2012. doi: 10.1167/iovs.12-9784.

URL +http://dx.doi.org/10.1167/iovs.12-9784.

Leonieke ME van Koolwijk, Wishal D Ramdas, M Kamran Ikram, Nomdo M

Jansonius, Francesca Pasutto, Pirro G Hysi, Stuart Macgregor, Sarah F Janssen,

Alex W Hewitt, Ananth C Viswanathan, et al. Common genetic determinants

of intraocular pressure and primary open-angle glaucoma. PLoS Genet, 8(5):

e1002611, 2012.

Jon Wakefield. Bayes factors for genome-wide association studies: comparison

with p-values. Genetic Epidemiology, 33(1):79–86, 2009.

234 Gertraud Malsiner Walli. Bayesian variable selection in normal regression models.

PhD thesis, Institut f¨urAngewandte Statistik, 2010.

Hui Wang, Yuan-Ming Zhang, Xinmin Li, Godfred L Masinde, Subburaman Mo-

han, David J Baylink, and Shizhong Xu. Bayesian shrinkage estimation of

quantitative trait loci parameters. Genetics, 170(1):465–480, 2005.

Michael N Weedon, Hana Lango, Cecilia M Lindgren, Chris Wallace, David M

Evans, Massimo Mangino, Rachel M Freathy, John RB Perry, Suzanne Stevens,

Alistair S Hall, et al. Genome-wide association analysis identifies 20 loci that

influence adult height. Nature genetics, 40(5):575–583, 2008.

RN Weinreb, T Aung, and FA Medeiros. The pathophysiology and treatment of

glaucoma: A review. JAMA, 311(18):1901–1911, 2014. doi: 10.1001/jama.2014.

3192. URL +http://dx.doi.org/10.1001/jama.2014.3192.

Daphna Weissglas-Volkov, Carlos A Aguilar-Salinas, Elina Nikkola, Kerry A

Deere, Ivette Cruz-Bautista, Olimpia Arellano-Campos, Linda Liliana Mu˜noz-

Hernandez, Lizeth Gomez-Munguia, Maria Luisa Ordo˜nez-S´anchez, Prasad

MV Linga Reddy, et al. Genomic study in mexicans identifies a new locus

for triglycerides and refines european lipid loci. Journal of medical genetics, 50

(5):298–308, 2013.

Danielle Welter, Jacqueline MacArthur, Joannella Morales, Tony Burdett, Peggy

Hall, Heather Junkins, Alan Klemm, Paul Flicek, Teri Manolio, Lucia Hindorff,

et al. The nhgri gwas catalog, a curated resource of snp-trait associations.

Nucleic acids research, 42(D1):D1001–D1006, 2014.

Samuel S Wilks. The large-sample distribution of the likelihood ratio for testing

composite hypotheses. The Annals of Mathematical Statistics, 9(1):60–62, 1938.

235 Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong

Lin. Rare-variant association testing for sequencing data with the sequence

kernel association test. The American Journal of Human Genetics, 89(1):82–

93, 2011.

Hanli Xu and Yongtao Guan. Detecting local haplotype sharing and haplotype

association. Genetics, 197(3):823–838, 2014.

Shizhong Xu. Estimating polygenic effects using markers of the entire genome.

Genetics, 163(2):789–801, 2003.

Jian Yang, Beben Benyamin, Brian P McEvoy, Scott Gordon, Anjali K Hen-

ders, Dale R Nyholt, Pamela A Madden, Andrew C Heath, Nicholas G Martin,

Grant W Montgomery, et al. Common snps explain a large proportion of the

heritability for human height. Nature genetics, 42(7):565–569, 2010.

Jian Yang, S Hong Lee, Michael E Goddard, and Peter M Visscher. Gcta: a

tool for genome-wide complex trait analysis. The American Journal of Human

Genetics, 88(1):76–82, 2011.

Shiming Yang and K Gobbert Matthias. The optimal relaxation parameter for the

sor method applied to a classical model problem. Technical report, Technical

Report number TR2007-6, Department of Mathematics and Statistics, Univer-

sity of Maryland, Baltimore County, 2007.

Nengjun Yi. A unified markov chain monte carlo framework for mapping multiple

quantitative trait loci. Genetics, 167(2):967–975, 2004.

Nengjun Yi and Shizhong Xu. Bayesian lasso for quantitative trait loci mapping.

Genetics, 179(2):1045–1055, 2008.

236 Nengjun Yi, Varghese George, and David B Allison. Stochastic search variable

selection for identifying multiple quantitative trait loci. Genetics, 164(3):1129–

1138, 2003.

Nengjun Yi, Brian S Yandell, Gary A Churchill, David B Allison, Eugene J Eisen,

and Daniel Pomp. Bayesian model selection for genome-wide epistatic quanti-

tative trait loci analysis. Genetics, 170(3):1333–1344, 2005.

David Young. Iterative methods for solving partial difference equations of elliptic

type. Transactions of the American Mathematical Society, 76(1):92–111, 1954.

Eleftheria Zeggini, Laura J Scott, Richa Saxena, Benjamin F Voight, Jonathan L

Marchini, Tianle Hu, Paul IW de Bakker, Gon¸caloR Abecasis, Peter Alm-

gren, Gitte Andersen, et al. Meta-analysis of genome-wide association data and

large-scale replication identifies additional susceptibility loci for type 2 diabetes.

Nature genetics, 40(5):638–645, 2008.

Arnold Zellner. On assessing prior distributions and bayesian regression analysis

with g-prior distributions. Bayesian Inference and Decision Techniques: Essays

in Honor of Bruno De Finetti, 6:233–243, 1986.

Arnold Zellner and Aloysius Siow. Posterior odds ratios for selected regression

hypotheses. Trabajos de estad´ıstica y de investigaci´onoperativa, 31(1):585–603,

1980.

Quan Zhou, Liang Zhao, and Yongtao Guan. Strong selection at mhc in mexicans

since admixture. PLoS Genet, 12(2):e1005847, 2016.

Xiang Zhou, Peter Carbonetto, and Matthew Stephens. Polygenic modeling with

bayesian sparse linear mixed models. PLoS Genet, 9(2):e1003264, 2013.

237