SEARCHING FOR RARE VARIANTS ASSOCIATED WITH

OSAHS-RELATED PHENOTYPES

THROUGH PEDIGREES

by

JINGJING LIANG

Dissertation Advisor: Dr. Xiaofeng Zhu

Department of Population and Quantitative Health Sciences

CASE WESTERN RESERVE UNIVERSITY

May 29, 2019

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

Jingjing Liang

candidate for the degree of Ph.D

Committee Chair

Scott M. Williams

Committee Member

Jonathan L. Haines

Committee Member

Xiaofeng Zhu

Committee Member

Rong Xu

Committee Member

Curtis M. Tatsuoka

Date of Defense

January 29, 2019

*We also certify that written approval has been obtained

for any proprietary material contained therein.

1

Table of Contents

CHAPTER 1: LITERATURE REVIEW AND SPECIFIC AIMS ………………14

1.1 Obstructive sleep apnea-hypopnea syndrome …………………………………..14

1.2 AHI and SpO2 …………………………………………………………………...16

1.3 Rare variants and missing heritability …………………………………………..22

1.4 Rare variant association analysis ...………………………………………………24

1.5 Rare variant test using pedigree………………………………………………….27

1.6 Annotating variants in genetic regions ………………………………………….29

1.7 Mendelian randomization………………………………………………………..32

1.8 Specific aims ……………………………………………………………………36

CHAPTER 2: IDENTIFYING LOW FREQUENCY AND RARE

VARIANTS ASSOCIATED WITH AVSPO2S USING PEDIGREES .…….………38

2.1 Introduction ……………………………………………………………………...38

2.2 Material and methods……………………………………………………………..42

2.2.1 Description of study samples…..……………………………………………..42

2.2.2 Overview of the method………………………………………………………45

2.2.3 Primary phenotype……………………………………………………………47

2.2.4 Whole-genome sequencing…………………………………………………...47

2.2.5 Imputation of replication cohorts……………………………………………...48

2

2.2.6 Linkage analysis of AvSpO2S………………………………………………...48

2.2.7 Simulated family data…………………………………………………………50

2.2.8 Analysis of CFS stage I families………………….…………………………50

2.2.9 -based association test…… …………………………………………….51

2.2.10 Identifying variants contributing to linkage evidence in CFS-EAs……...…..54

2.2.11 Estimating the proportion of AvSpO2S variability explained by the

Identified variants in DLC1 ……………………….………………………...….……55

2.2.12 Test the aggregation of variant effect size directions in DLC1…….……….55

2.2.13 Analysis …………………………………………………56

2.2.14 Cell type specific regulatory annotation enrichment analysis ……………....57

2.2.15 Mendelian randomization analysis …………………………………………58

2.2.16 AvSpO2S and methylation in the DLC1 gene in MESA ……………………58

2.2.17 AvSpO2S and DLC1 gene expression in the Sleep Heart Health Study

from Framingham Heart Study…...……………..……………………….…………..59

2.3 Results ………………………………………………………………………...... 60

2.3.1 Persistent linkage evidence of AvSpO2S on 8p23………….....60

2.3.2 Choice of family specific LOD score threshold 0.1………………………….62

2.3.3 Multiple low frequency and rare variants in DLC1 are associated with

AvSpO2S……………………………………………………………………………..65

3

2.3.4 Conditioning on the effects of identified DLC1 variants reduces the

linkage evidence in CFS-EAs………...……………………………………………...73

2.3.5 Identified DLC1 variants are enriched in regulatory regions………………...76

2.3.6 DLC1 non-coding variants are associated with DLC1 expression level in

skin cells-transformed fibroblasts……………………………………………77

2.3.7 DLC1 DNA methylation is weakly associated with AvSpO2S and AHI ….....82

2.3.8 DLC1 expression is weakly associated with AvSpO2S and AHI …………....84

2.3.9 Association of DLC1 with AvSpO2S is not mediated by AHI……………….84

2.3.10 Potential pleiotropic effect between DLC1, AvSpO2S and lung function ….84

2.4 Discussion ……………………………………………………………………….88

CHAPTER 3: ASSESSING THE INDEPENDENCE OF ASSOCIATION

TESTS AND LINKAGE EVIDENCE OBTAINED IN THE SAME DATA ..……..93

3.1 Introduction ………………………………………………………………………93

3.2 Method .………………………………………………………………………94

3.3 Results ……………………………………………………………………...…….95

3.3.1 Assess the empirical P-values for burden and SKAT tests using

genome-wide variants ……………………………………………………………..95

3.3.2 Enrichment of 5% significant gene-based tests on

target region in CFS-EAs ………………………………………………...……..……98

3.4 Discussion ………………………………………………………………………99

4

CHAPTER 4: DEVELOPING AN ANALYSIS PIPELINE TO SEARCH

RARE VARIANTS THROUGH LINKAGE ANALYIS USING WGS DATA …..102

4.1 Introduction …………………………………………………………………….102

4.2 Method …………………………………………………………………….104

4.2.1 Overview of the analysis pipeline ………………………………………….104

4.2.2 Variance component linkage analysis ………………………………………106

4.2.3 Identify candidate variants in WGS data …………………………………107

4.2.4 Gene-based association tests using WGS data ……………………………107

4.3 Results ………………………………………………………………………….109

4.3.1 Multiple rare variants in CAV1 are associated with AHI ……………………109

4.3.2 CAV1 variants regulatory annotation and gene expression analysis …………110

4.4 Discussion ……………………………………………………………………112

CHAPTER 5: DISCUSSION AND FUTURE WORK …………………………118

5

List of Tables

Table 1.1 Summary of the features, the pros and cons of the different type of methods for rare variants association test ……………………………………………26

Table 2.1 Sample characteristics of TOPMed WGS and imputed genotype studies…...42

Table 2.2 Selected coding and non-coding variants on 105 in target region ……....52

Table 2.3 Type I error of burden and SKAT test ………………………………………66

Table 2.4 Power of burden and SKAT tests when simulated rare variants account for 1% of variability of phenotype ………………………………………………………66

Table 2.5 Power of burden and SKAT tests when simulated rare variants account for 0.5% of variability of phenotype…………………………………………………...…67

Table 2.6 Stage I and II gene-based association tests with AvSpO2S……………..……..71

Table 2.7 Gene-based association test for DLC1 with AvSpO2S in TOPMed sequencing and independent replication data with imputed genotypes….……………... 72

Table 2.8 Mendelian randomization analysis to assess causal effects of DLC1 expression in skin cell-transformed fibroblasts to AvSpO2S …………...………………..81

Table 2.9 Sample characteristics and results of DLC1 methylation association test with AvSpO2S ……………………………………………………………………...83

Table 2.10 Sample characteristics and results of DLC1 expression level association with AHI and AvSpO2S …………………………..…………………………83

6

Table 2.11 Stage I and II gene-based association tests with AvSpO2S by adjusting AHI as covariate ……………………………………………………...... 85

Table 2.12 Stage I and II gene-based association tests for DLC1 with AvSpO2S

Or FEV1/FVC by adjusting FEV1/FVC or AvSpO2S as covariates ……………………86

Table 3.1 Summary of the genome-wide gene based test using CFS-EA stage I family filtered variants for AvSpO2S…………………………………………………….96

Table 3.2 Enrichment of 5% significant AvSpO2S gene-based tests on chromosome 8 target region in CFS-EA ………………………………...………………98

Table 4.1 Stage I and II gene-based association tests with AHI …………………111

7

List of Figures

Figure 1.1 Normal breathing, snoring and OSAHS …………………………………...15

Figure 1.2 Example of polysomnography trace during repetitive apnoeas …………...17

Figure 1.3 Distribution of AHI and AvSpO2S raw measurement in CFS-EAs………..18

Figure 1.4 Boxplots to compare AHI and AvSpO2S in healthy people and

OSAHS patients in CFS-EAs……………..……………………………………………..19

Figure 1.5 Variance component linkage analysis of AHI ………………………………21

Figure 1.6 Genetic variant frequencies and effect sizes ………………………………22

Figure 1.7 Non-coding regions affect proximal and distal regulators ………………30

Figure 2.1 Variance component linkage analysis of AvSpO2S in CFS-EAs on chromosome 8……………………………………………………………………………62

Figure 2.2 Mixture distribution of family specific LOD scores in CFS families……...63

Figure 2.3 Distribution of family specific LOD score (All 40 causal variants have the same effect directions.)………………..…………… …………………………..64

Figure 2.4 Distribution of family specific LOD score (Half of 40 causal variants have the same effect directions.)………… …………………………………...64

Figure 2.5 The analysis flow chart for searching low frequency and rare variants associated with AvSpO2S using the TOPMed WGS data ………………………………68

Figure 2.6 Linkage evidence of AvSpO2S on chromosome 8 in CFS-EAs …………..70

8

Figure 2.7 Conditional linkage analysis for AvSpO2S on chromosome 8 in

CFS-EAs TOPMed individuals ……………...………………………………………….75

Figure 2.8 Effect sizes of the 57 variants in DLC1 estimated using the stage II samples conditional on MAF …..………………………………………………………..76

Figure 2.9 Cell type specific regulatory annotation enrichment tests for the identified non-coding variants in DLC1 …………...……………………………………78

Figure 2.10 51 non-coding variants and the corresponding effect sizes in

DLC1 genes plotted against physical locations ………………………………………….79

Figure 2.11 Gene based test for DLC1, MYOM2 and CSMD1 selected variants with corresponding expression level in GTEx tissues …………………………………..80

Figure 2.12 Mendelian randomization analysis using 24 DLC1 variants as instrumental variables…………………………………………………...……………….81

Figure 2.13 Comparison of the effect sizes of single variant associations with and without adjusting for FEV1, FVC and AHI as covariate for DLC1 selected variants ………………………………………………………………………....87

Figure 3.1 Distribution of variants number in each gene in genome-wide gene-based test using CFS-EA Stage I families filtered for coding and non-coding variants ……………………………………………………………………………..……96

Figure 3.2 Quantile-quantile plots for genome-wide gene-based test using

CFS-EA stage I family filtered variants for AvSpO2S……………………………..……97

Figure 4.1 Overview of the analysis pipeline processes ……………………………..106

9

Figure 4.2 Variance component linkage analysis of AHI in CFS-EAs on

Chromosome 7…………….……………………………………………………………110

Figure 4.3 Cell type specific regulatory annotation enrichment tests for the identified non-coding variants in CAV1 in cell lines defined in the Ensemble

Regulatory Build……………………………………………………………………..…113

Figure 4.4 21 non-coding variants and the corresponding effect sizes in CAV1 genes plotted against physical locations ………….……………………………………113

Figure 4.5 Gene based test for CAV1 selected variants with CAV1 expression level in GTEx tissues…………………………………………………………………114

Figure 4.6 Tissue-specific gene expression of CAV1 in GTEx database ……………..114

10

Acknowledgement

First of all, I express my earnest gratitude to my advisor, Dr. Xiaofeng Zhu, for his guidance and support in my research and career development. He taught me professional knowledge and skills, and encouraged me during the hard times in my research. I also thank my committee members Dr. Scott Williams, Dr. Jonathan Haines,

Dr. Rong Xu and Dr. Curtis Tatsuoka for their guidance and feedback on this dissertation work.

I would also like to thank the current and former colleges and friends in

Department of Population and Quantitative Health Sciences, especially Dr. Heming

Wang, Dr. Priya Shetty, Karen He, Dr. Xiaoyin Li, Mengshi Zhou, Jingting Yu and Mike

Fang, for their kind help in statistical analyses, computational programing and inspiring discussions. In addition, I am grateful to my collaborators outside Case Western Reserve

University, including Drs. Susan Redline, Xihong Lin and Brian Cade from Harvard

University, Dr. Nora Franceschini from University of North Carolina, Dr. Todd Edwards from Vanderbilt University and all of our collaborators in NHLBI Trans-Omics for

Precision Medicine (TOPMed) Sleep Working Group.

I appreciate the financial support from the National Institute of Health, the

National Heart, Lung, and Blood Institute in support my Ph.D study. Finally, I dedicate this dissertation to my daughters, Grace and Alice, my husband and my parents for their love, understanding and support.

11

Searching For Rare Variants Associated With OSAHS-related

Phenotypes Through Pedigrees

Abstract

by

JINGJING LIANG

Genetic determinants of obstructive sleep apnea-hypopnea syndrome (OSAHS), a common complex disorder that contributes to significant cardiovascular and neuropsychiatric morbidity, are not clear. Average arterial oxyhemoglobin saturation during sleep (AvSpO2S) is a clinically relevant and easily measured indicator of OSAHS severity but the genetic contribution to it has not been well studied. Using high depth whole-genome sequencing data from the NHLBI TOPMed project and focusing on genes with linkage evidence on chromosome 8p23, we observed 6 coding and 51 non-coding variants in the GTPase-activating (DLC1) significantly associated with AvSpO2S

(Trans-ethnic P = 4.2×10-5) that was replicated in 5,042 independent subjects (P = 0.01).

A risk score of the 57 variants built in an independent data set explains 0.97% of the

AvSpO2S variation and contributes to the linkage evidence (P = 0.0017). These variants are enriched in regulatory features in a human lung fibroblast cell line and contribute to

DLC1 expression variation. Mendelian randomization using the 57 variants indicates a

-4 significant causal effect of DLC1 expression in fibroblasts on AvSpO2S (P =1.32×10 ).

An additive score using DLC1 expression effect explained 0.43% of AvSpO2S variation

(P = 5.4×10-5). Multiple sources of information including gene expression, functional

12 annotation and DNA methylation consistently suggest DLC1 is a gene associated with

AvSpO2S.

Based on above studies, we developed a method to identify low frequency and rare variants in high depth genomic sequencing data that segregate within the families and contribute to observed linkage evidence. We also integrated functional annotation of

Ensemble Regulatory build and omics study from GTEx Portal of identified variants to elucidate a potential causative mechanism regulating the phenotype. Finally, we present an analysis pipeline in an R program to perform analysis using this method which is publicly available in the scientific community.

13

CHAPTER 1: LITERATURE REVIEW AND SPECIFIC AIMS

1.1 Obstructive sleep apnea-hypopnea syndrome

Obstructive sleep apnea-hypopnea syndrome (OSAHS) is a complex disorder characterized by the occurrence of repetitive episodes of complete or partial upper airway obstruction during sleep associated with snoring, intermittent hypoxemia, and daytime sleepiness (Mbata & Chukwuka 2012). Epidemiological studies indicate that OSAHS affects approximately 2 to 6% of children and more than 15% of adults (Badran et al.

2015, Capdevila et al. 2008, Rosen et al. 2003). The public health importance of this disorder relates not only to its high prevalence but also its impact on a wide range of health outcomes. The recurrent episodes of upper airway obstruction that occur with

OSAHS lead to disruptive snoring, sleep fragmentation and daytime sleepiness, which negatively impact quality of life and daytime performance (Mbata & Chukwuka 2012).

OSAHS also results in overnight hypoxemia and heightened sympathetic activation, which adversely affect blood pressure, metabolism, and vascular health (Mbata &

Chukwuka 2012). Indeed, research over the last two decades has identified OSAHS as an independent risk factor for the development of hypertension, cardiovascular disease, stroke and premature mortality (Badran et al. 2015).

The primary causes of OSAHS include decreased muscle tone during sleep, increased soft tissue in the upper airway and anatomical abnormalities in the upper airway and jaw (Figure 1.1) (Schwab et al. 2015). Aging and temporary or permanent brain injury and obesity are also risk factors associated with OSAHS (Halbower et al.

2006, Schwab et al. 2015). Aging is often accompanied by muscular and neurological

14

Normal breathing

Snoring – Partial obstruction of the airway

OSAHS – Complete obstruction of the airway

Figure 1.1 Normal breathing, snoring and OSAHS. In snoring and OSAHS, soft tissue falls to the back of the throat, which partially or completely impedes the passage of air (blue arrows) through the trachea. Image Credit: Alila Medical Media /

Shutterstock. loss of muscle tone of the upper airway. Traumatic brain injury and neuromuscular disorder might lead to permanent premature muscular tonal loss in the upper airway

(Schwab et al. 2015). OSAHS is more common in men (Badran et al. 2015). Older men with increased mass in the torso and neck are at increased risk of developing OSAHS

(Schwab et al. 2015). Women are reported to have OSAHS less frequently than men, but women during pregnancy are at relatively higher risk for developing OSAHS (Balserak

2014). Smoking also increases the chance of developing OSAHS as the chemical irritants in smoke tend to inflame the soft tissue of the upper airway and result in the upper airway narrowing (Lavie et al. 1995). Consumption of alcohol, sedatives, or any drugs that

15 increase sleepiness through muscle relaxants may also increase OSAHS risk (Lavie et al.

1995).

Importantly, OSAHS aggregates in families and genetic variation can influence its susceptibility (Redline et al. 1992). Genetic factors were reported to affect upper airway structures (Schwab 2005). The proportion of genetic influence on OSAHS variability was estimated to be from 21% to 84% (Redline & Tishler 2000). Understanding the genetic etiology of OSAHS, including the genetic basis for intermediate phenotypes, may help better elucidate its pathogenesis, with the goal of improving preventive strategies, diagnostic tools and therapies. Although a significant genetic basis for OSAHS is supported by several family studies and emerging candidate gene association analyses, molecular studies of OSAHS genetics have lagged behind other chronic diseases (Redline et al. 1995, Palmer et al. 2003, Larkin et al. 2008).

1.2 AHI and SpO2

To date, the chief quantitative metric indicating the severity of OSAHS and used in genetic analysis is apnea-hypopnea index (AHI), which averages the number of complete and partial airway occlusions that occur hourly during sleep (Redline et al.

1999, Liang et al. 2016). To measure AHI and diagnose OSAHS, a polysomnogram study is typically performed. The complete or partial airway occlusions (apnea or hypopnea) must last for at least 10 seconds and be associated with a decrease in blood oxygenation (Redline et al. 1999). The AHI values for adults are categorized as: normal

(AHI < 5), mild sleep apnea (5 ≤ AHI < 15), moderate sleep apnea (15 ≤ AHI < 30), severe sleep apnea (AHI ≥ 30) (Redline et al. 1999). For children, an AHI in excess

16

Figure 1.2 Example of polysomnography trace during repetitive apnoeas showing dips in arterial oxygen saturation (SpO2) occuring in associaiton with OSAHS. EEG: electroencephalogram. (Valipour et al. 2002).

of 5 is referred for treatment (Redline et al. 1999). In large scale genetic studies, the advantages of using the AHI include its simplicity, high night-to-night reproducibility and widespread clinical use (Liang et al. 2016). However, the AHI does not provide information on oxygen saturation, which also reflects sleep apnea severity (Liang et al.

2016).

Arterial oxyhemoglobin satruation (SpO2) reflects the adequacy of ventilation and oxygen transport, fundamental physiological properties that are tightly regulated at molecular and cellular levels to ensure adequate delivery of oxygen to vital tissues.

Reduction in oxyhemoglobin saturation increases mortality and cognitive decline

(Antonelli Incalzi et al. 2003). Given its clinical relevance, oxygen saturation is commonly monitored in patients with pulmonary, cardiac and sleep disorders to identify adverse outcomes and to inform intervension. For OSAHS, repetitive dips in SpO2 were observed during successive apneas/hypopneas (Valipour et al. 2002). Combining AHI and oxygen saturation gives an overal sleep apnea severity score that reflects both the

17

A B

Figure 1.3 Distributions of AHI (A) and AvSpO2S (B) raw measurement in CFS-

EAs

number of sleep disruptions and the degree of oxygen desatruation (low oxygen level in the blood) (Ruehland et al. 2009). Figure 1.2 shows example of polysomnography trace during repetitive apnoeas with drops in arterial oxygen saturation occuring in association with OSAHS.

Average arterial oxyhemoglobin saturation during sleep (AvSpO2S) measures the mean oxygen saturation values during sleep. AvSpO2S is highly correlated with AHI. In the Cleveland Family Study (CFS), a family-based longitudianal study comprised of index cases with laboratory diagnosed sleep apnea, their family members and neighborhood control families, the Pearson correlation coefficient between AHI and

AvSpO2S was estimated to be -0.69 and -0.71 in 656 African Americans (CFS-AAs) and

645 European Americans (CFS-EAs) (Liang et al. 2016). Both AHI and AvSpO2S raw measurement distributions are skewed in CFS (Figure 1.3). The comparison of AHI and

18

A B

C D

Figure 1.4 Boxplots to compare AHI and AvSpO2S in healthy people and

OSAHS patients in CFS-EAs. (A) raw AHI measurement (B) raw AvSpO2S

measurement (C) rank-normal transformed AHI (D) rank-normal transformed

AvSpO2S. Both adults and children are included, in which AHI > 15 for adults and

AHI> 5 for children are used as diagnosis standard for OSAHS.

AvSpO2S raw measurements and their corresponding rank-normal transformed values distributions in healthy people and patients with OSAHS in CFS-EAs are shown in

Figure 1.4, in which AHI > 15 for adults older than 18 years and AHI > 5 for children younger than 18 is used as diagnosis standard for OSAHS. AvSpO2S level is significantly

19

A B

C D

E F

20

Figure 1.5 Variance component linkage analysis of AHI on (A and B)

and AvSpO2S on chromosome 8 (C and D) and chromosome 14 (E and F) in CFS-EAs.

The raw AHI and AvSpO2S was rank-normal transformed, adjusted covariates of age,

age2, gender, age × gender, ten principal components and with (A, C and E) or without

(B, D and F) BMI, and carried out linkage analysis. (Liang et al. 2016)

lower in patients with OSAHS in 645 CFS-EAs ( P = 2.2×10-16 for two-sided Student’s t- test). Using family data, the heritabilities of AHI and AvSpO2S can be esitimated. The heritability of AHI was reported to be 0.31 in 656 CFS African Americans (CFS-AAs) from 147 families and 0.27 in 645 CFS European Amercians (CFS-EAs) from 139 families (Liang et al. 2016). For AvSpO2S, the heritability was estimated to be 0.22 in

CFS-AAs and 0.41 in CFS-EAs (Liang et al. 2016). Both AHI and AvSpO2S are postulated to be useful for understanding the genetic basis of nocturnal hypoxaemia associated with OSAHS.

In the study of OSAHS in CFS, linkage analyses of multiple OSAHS-related traits were performed using genotype data of IBC array and rank-normal transformed phenotypes (Keating et al. 2008, Liang et al. 2016). The IBC array includes approximately 50,000 SNPs in cardiovascular, pulmonary, hematological and sleep- related disorders (Keating et al. 2008). One region on chromosome 7 linked with AHI and two regions on chromosome 8 and 14 linked with AvSpO2S were identified in CFS-

EAs (Figure 1.5) (Liang et al. 2016). In the later part of this thesis, we will show the further focus on these regions for rare variants detection.

21

Figure 1.6 Genetic variant frequencies and effect sizes (Manolio et al. 2009)

1.3 Rare variants and missing heritability

Genome-wide association studies (GWASs) have identified thousands of genetic variants associated with complex traits, and have provided valuable insights into their genetic architecture and biological basis (Visscher et al. 2012). However, most variants identified so far show only modest effects on disease risk or quantitative trait variation. A substantial proportion of genetic contribution to complex traits remains unexplained, which is referred to as the “missing heritability” problem (Manolio et al. 2009). For example, heritability of human height was estimated to be about 80%, but the over 3000

22 identified independent variants only explain ~ 25% of the phenotypic variability despite being estimated from more than 700,000 people (Visscher 2008, Yengo et al. 2018).

Many theories have been proposed to explain the missing heritability problem in

GWAS. Conventionally, common variants with large effects are often subject to strong purifying selection and rare variants with small effects are difficult to identify, thus in genetic association studies, allele frequency and effect size are generally inversely related

(Figure 1.6) (Manolio et al. 2009). GWAS only analyze common genetic variants with minor allele frequency (MAF) > 5%. One hypothesis is that analyses of low-frequency

(0.5% ≤ MAF < 5%) and rare (MAF < 0.5%) variants can explain additional disease risk or trait heritability (Lee et al. 2014, Hemminki et al. 2011, Zuk et al. 2014). In rare

Mendelian disorders, causal rare variants tend to show high penetrance (Ionita-Laza et al.

2011). In complex diseases, the penetrance levels of rare variants are mostly moderate or small (Auer & Lettre 2015).

Both simulation and empirical studies have shown low frequency and rare variants can contribute to substantial fractions of heritability for complex traits and diseases. In 2017, a simulation study investigated why genetic associations have not been detected for preterm birth (Bandyopadhyay et al. 2017). This study simulated 1000 causal rare variants embedded into randomly selected subsets of 9,642 promoter regions from the 1000 Genomics Project genotype data and analyzed the correlations between rare variant aggregations and the phenotype (Bandyopadhyay et al. 2017). The results showed that the genetic association signal got weaker when the proportion of causal rare variants decreased in the embedded promoter (Bandyopadhyay et al. 2017). This study showed that the association signals for rare variants were difficult to detect when causal rare

23 variants were dispersed in the genome, which might account for an important proportion of missing heritability. Empirical studies have also reported the role of low frequency or rare variants in several human complex diseases. Studies reported association and potentially pathogenic roles of rare variants in several genes in (Purcell et al. 2014, Teng et al. 2018). Multiple studies demonstrated that a significant proportion of autism spectrum disorders (ASDs) heritability is contributed by rare variants in multiple synaptic and neuronal genes that implicate specific pathways and subcellular compartments in ASDs (Wang et al. 2018, Delahanty et al. 2011, Sanders et al. 2012,

Neale et al. 2012). A study measuring genetic distance between individuals to build the multi-dimensional scaling (MDS) picture using whole-exome genotyping data from a

Mexican-American cohort and a European-ancestry cohort suggested that low frequency and rare variants played significant roles in the genetics of major depression disorder (Yu et al. 2018).

1.4 Rare variant association analysis

To evaluate the contribution of rare variants to complex traits and diseases, a large number of custom DNA arrays and DNA sequencing-based approaches have been developed (Tennessen et al. 2012, Huyghe et al. 2013). There are many different types of study designs and analytic techniques that have been developed to maximize the power of rare variant association studies. Association test for rare variants often relies on aggregation tests, which evaluate cumulative effects of multiple genetic variants in a gene or region, increasing power when multiple variants in a gene are associated with a given trait or disease (Asimit et al. 2012, Morgenthaler & Thilly 2007, Morris & Zeggini 2010,

Li & Leal 2008). Numerous methods for region- or gene-based aggregation tests have

24 been developed in recent years. Table 1.1 summaries the general principals behind these tests.

One class of aggregation tests is the burden test, which collapses multiple genetic variants in to a single genetic score and tests for association between this score and a phenotype (Asimit et al. 2012, Morgenthaler & Thilly 2007, Morris & Zeggini 2010, Li

& Leal 2008). The combined multivariate and collapsing (CMC) test is an approach that combines collapsing and multivariate tests (Li & Leal 2008). In the CMC method, markers are divided into subgroups on the bases of MAF, and within each group, markers are collapsed. A multivariate test, the Hotelling’s t test, is then applied for analysis of the groups of maker data (Li & Leal 2008). The burden methods are based on the assumption that all variants in a set are causal and association with a trait with the same direction.

Thus, if a region has many non-causal variants and if both risk and protective variants are present, burden tests are less powerful.

Another class of methods uses a variant-component test within a random-effect model and tests for the association by evaluating the distribution of genetic effects for a group of variants. The sequence kernel association test (SKAT) collapses and tests the weighted sum of squares of single-variant score statistics, which is robust to groupings that include both variants with both positive and negative effects (Wu et al. 2011). Thus,

SKAT tests are more powerful than burden tests if a region has many non-causal variants or if both risk and protect variants are present.

25

26

1.5 Rare variant test using pedigrees

Because low frequency or rare variants can be enriched in an extended pedigree and segregate with the phenotype, a powerful design for identifying disease associated rare variants is the family study, which usually genotypes co-affected family members and searches for variants that co-segregate with the phenotype (Zhu et al. 2010, Epstein et al. 2015). Family-based studies have successfully identified large effect, highly penetrant mutations that underlie Mendelian disorders (Ng et al. 2010a, Ng et al. 2010b).

For complex common disease, family-based studies have also identified several important risk variants. For example, the association of Apolipoprotein E-4 (ApoE) with

Alzheimer’s disease (AD) was first discovered by linkage study performed in multiplex familial late-onset AD in which a linkage was identified at chromosome 19q13 in 1991

(Roses 2006). The ApoE ε4 is a common allele associated with AD with a large effect size whose association with disease susceptibility has been robustly replicated (Roses

2006). This situation is not very common. The common variants implicated in common disease by GWAS have relatively small effect sizes. In rare variant analysis, family-based analysis to identify causal variants among a large set of candidate variants in sequencing data is still challenging for complex common disease (Cirulli & Goldstein 2010).

Linkage analysis searches for chromosomal segments that co-segregate with the disease phenotype through families (Borecki & Province 2008). Linkage methods are powerful for the detection of variants with large effect sizes, which are often rare for complex traits (Figure 1.6) (Bailey-Wilson & Wilson 2011). Studies adopting the strategy of combining linkage-based and genetic association methodologies improve rare variant detection. For example, a study performing GWAS and linkage successfully identified

27 modifier loci of cystic fibrosis severity, which were missed in GWAS alone (Wright et al.

2011). Our group performed combined linkage and association analysis of AvSpO2S using 139 CFS-EA families and successfully identified two low frequency haplotypes in angiopoietin-2 (ANGPT2), which significantly contributes to the variation of AvSpO2S

(Wang et al. 2016). This study performed family based association analysis in the genes under the linkage regions using the SNPs available in IBC array (Keating et al. 2008) in

CFS-EAs. The IBC array includes approximately 50,000 SNPs in cardiovascular, pulmonary, hematological and sleep-related disorders (Keating et al. 2008). Conditional linkage analysis was performed by adding the most significant SNP as a covariant in linkage analysis to test whether the added SNP was able to explain the linkage evidence.

The study also conducted haplotype association analysis using the SNPs on both sides of the index SNP. Gene-based tests using SKAT were also conducted in independent cohorts (Wang et al. 2016). However, this work did not integrate in silico functional annotation information of variants and the two identified rare variants on ANGPT2 are both non-coding variants. A different study from our group using the same strategy for blood pressure traits in CFS-EAs identified multiple rare, coding variants in fox-1 homolog A (RBFOX1) associated with reduced systolic blood pressure (SBP) (He et al.

2017). This study examined low frequency or rare variants that present at least twice in one family with family-specific LOD score > 0.1, have association test results with marginal effect (P-value ≤ 0.1) or large effect size (absolute regression coefficient beta ≥

5), and restricted the search of coding variants genotyped on the exome SNP array (He et al. 2017). These two successful studies showed how linkage analysis can help in detecting rare variants in this project.

28

Linkage studies can detect regions of the genome that are likely to harbor rare variants, but identifying the actual sequence variant(s) responsible for these linkage signals is challenging, because the candidate regions identified by linkage analysis are often very large, spanning 20~40Mb. However, in combination with whole-genome sequencing (WGS), linkage analysis can give guidance to the search for rare variants.

The multipoint linkage analysis tests the co-segregation between a putative disease and a set of markers in a genome region. The LOD score is the logarithm of the likelihood ratio and is computed by Z(x) = log10[L(x)/L(∞)], in which the numerator assuming a position, x, of the putative disease locus on the marker map and the denominator assuming the disease locus to be infinitely far away from the markers (Ott et al. 2015). The final linkage result can be obtained from multiple pedigrees with LOD scores summed at the same map position.

1.6 Annotating variants in genetic regions

The advances in sequencing technologies in recent years have made next- generation sequencing the primary tool for identifying rare variants associated with traits and diseases. However, with the accessibility of sequencing data, understanding the underlying causes of diseases still requires accurate interpretation of genetic variants.

Several approaches have been proposed to filter and prioritize genetic variants to identify causative variants from sequencing data (Cooper & Shendure 2011, Goldstein et al. 2013,

MacArthur et al. 2014).

Coding variants can be prioritized based on their ability to result in protein damage. Approaches predicting functional coding variants effect include PolyPhen-2

(Adzhubei et al. 2010), SIFT (Kumar et al. 2009), FATHMM (Shihab et al. 2013) which

29

Figure 1.7 Non-coding regions affect proximal and distal regulators (Khurana et

al. 2016) are based on amino-acid substitutions; GERP++ (Davydov et al. 2010), which is based on conservation; combined annotation-dependent depletion (CADD) (Kircher et al. 2014),

MutationTaster (Schwarz et al. 2010) and genome-wide annotation of variants

(GWAVA) (Ritchie et al. 2014). Because those variants change the amino acid sequence and protein sequences are highly conserved, these approaches prioritize coding variants with high accuracy.

Non-coding variants, which do not change the amino acid sequence and are not necessarily under as strong evolutionary constraint, can also have deleterious effects on a transcript through the regulation of transcription or splicing (Barbosa-Morais et al. 2012).

Thus, for non-coding variants, various variant-specific annotations of different classes and genomic scales to investigate if the combination of regulatory annotations, genetic context and genome-wide properties can be used to assess the variant function have been

30 developed (Ritchie et al. 2014). Regulatory variants may have significant effects on gene regulation. Studies have demonstrated that there is allele-specific binding of transcription factors (TF) on a genome-wide scale and a single variant can result in both disruptions of

TF binding and alteration of chromatin accessibility (Kasowski et al. 2010, McDaniell et al. 2010, Degner et al. 2012) (Figure 1.7). Numerous types of assays and approaches were developed to predict regulatory function of variants in non-coding genome, including DNase I hypersensitive sites sequencing (DNase-seq), assaying for transposase- accessible chromatin with high-throughput sequencing (ATAC-seq), TF ChIP-seq, histone modification ChIP-seq, and expression quantitative trait loci (eQTL) analysis

(Mathelier et al. 2015). Importantly, because many of these analyses are specific to cell- type or stage of development condition, most of these data are available with cell-type specific interactions (Nishizaki & Boyle 2017). In addition, recently, many approaches using machine learning algorithms have been developed for genetic variant annotation with their power to build multifaceted predictive models of genetic variant function

(Kotsiantis 2007). The most widely known methods using machine learning to evaluate functional variants include GWAVA (Ritchie et al. 2014), CADD (Kircher et al. 2014), deleterious annotation of genetic variants using neural networks (DANN) (Quang et al.

2015), FATHMM-MKL (Shihab et al. 2015), deltaSVM (Lee et al. 2015) and DeepSEA

(Zhou & Troyanskaya 2015). These methods usually incorporate both functional data and conservation data to train their prediction models and build powerful complex predictive models. However, there are some potential biases in these machine learning methods. For example, the training data and annotations used to train models might have enrichments of variants near genes; the models may over-fit due to suboptimal parameterization; there

31 might be gaps in functional annotation or insufficient training data (Nishizaki & Boyle

2017).

For rare variant aggregation tests, the number of variants with little to neutral effect within a collapsed region often weaken the statistical power of analyses. Thus, filtering variants by their functional consequence is a useful strategy to reduce the potential number of non-effect variants within a collapsed region of functional unit. For example, the CADD method integrates a range of different annotation metrics into a single measure (C score) and provide a reliable estimate of deleteriousness for variants at a genome-wide scale (Kircher et al. 2014). Using the CADD score to filter variants may reduce the number of neutral variants and keep functional variants for complex traits in rare variant association tests and further improve power to identify susceptibility genes.

A candidate gene analysis study published in 2016 using the UK10K sequence and lipid data suggested that filtering variants by CADD, as well as FATHMM-MKL and DANN, could identify association signals not detected in other analyses (Richardson et al. 2016).

Other annotation tools, such as SIFT (Kumar et al. 2009) and PolyPhen-2 (Adzhubei et al. 2010), are also useful in rare variants analysis, but these two methods only use protein-based metrics and therefore confined to coding regions (Lee et al. 2014).

1.7 Mendelian randomization

Mendelian randomization is a common approach to infer causality of exposure on an outcome (Smith & Ebrahim 2004). Mendelian randomization uses genetic variants as instrumental variables that are robustly associated with the exposure and test whether the exposure has a causal effect on the etiology of disease (Smith 2010). Mendelian randomization is based on a number of assumptions about the instruments. First, the

32 instrumental variables must be sufficiently strongly associated with the exposure; second, they should not be associated with any potential confounder of the exposure-outcome relationship; third, they should be independent of the outcome given the exposure and all confounders of the exposure-outcome relationship (Verbanck et al. 2018). Failing to meet any of these assumptions would lead to a biased estimate of the causal effect (Verbanck et al. 2018).

One conventional analysis method for Mendelian randomization is the inverse- variance weighted (IVW) method, which assumes that all genetic variants satisfy the instrumental variable assumptions. In IVW method, the causal effect of the exposure on the outcome can be estimated using the jth variant as the ratio of the gene-outcome association and the gene-exposure association estimates (Lawlor et al. 2008):

   Yj  j   1  Xj

 where  Yj is the estimated coefficient from the univariate regression of the outcome on

 the jth genetic variant and  Xj is from the univariate regression of the exposure on the jth genetic variant.

If the genetic variants are uncorrelated (not in linkage disequilibrium), then the ratio estimates from each genetic variant can be combined into an overall estimate using the formula from the meta-analysis method (Johnson 2013):

33

   2  se     Yj Xj Yj j    IVW  2 2 2    Xjse Yj j  

This is referred to as IVW estimator (Burgess et al. 2013). If all genetic variants satisfy the instrumental variable assumptions and pleiotropic effects of the genetic variants are not present, the IVW estimate is a consistent estimate of the causal effect.

However, genetic variants can be invalid instrumental variables if they manifest horizontal pleiotropy effects, which will lead to biased causal estimate (Burgess &

Thompson 2017). Mendelian randomization-Egger (MR-Egger) is an analysis method for

Mendelian randomization that can correct the bias induced by pleiotropy of variants

(Burgess & Thompson 2017). MR-Egger is performed by a modification to the linear regression for the jth variant by not setting the intercept term to be zero:

      ; 3 Yj01EE Xj Ej where the parameter is the intercept, is the slope (MR-Egger estimate) and  0E  1E  Ej is the residual term. If there is no intercept term in the regression model, then the MR-

Egger estimate will equal to the IVW estimate (Burgess et al. 2015). The value of the intercept term is an estimate of the pleiotropic effect of the genetic variant. The pleiotropic effect is the effect of the genetic variant on the outcome that is not mediated through the exposure.

Mendelian Randomization Pleiotropy RESidual Sum and Outlier (MR-PRESSO) is a method developed to detect and correct for biological pleiotropic outliers in multi-

34 instrument summary-level MR testing (Verbanck et al. 2018). MR-PRESSO extends the framework of IVW method by detecting whether a subset of variants is significantly deviating from the fitted weighted linear regression line between the set of effect sizes on the outcome and the effect sizes on the exposure (Verbanck et al. 2018). The MR-

PRESSO global test evaluates the presence of horizontal pleiotropy (Verbanck et al.

2018).

35

1.8 Specific aims

Aim 1. Identifying low frequency and rare variants associated with AvSpO2S through linkage and association analysis using CFS and additional TOPMed cohorts sequencing data.

We previously performed linkage analysis of AvSpO2S and identified a linkage signal on chromosome 8p23 in CFS-EAs. In this section we develop an analysis approach to detect low frequency and rare variants in this linkage region and apply it to the Trans-

Omics for Precision Medicine (TOPMed) sequencing data.

Aim 2. Assessing the independence of the burden and SKAT association analysis and linkage evidence obtained in the same data.

A previous simulation study of a common variant indicated that when there is linkage and association in the data, the test of linkage is positively correlated with the test of association although the type I error rate of neither test will be affected (Chung et al.

2007). However, it is still unknown if the type I error will be inflated in testing for association of multiple low frequency and rare variants under a linkage region when the low frequency and rare variants are not associated with a phenotype. In this section, we empirically assess the independence of the burden and SKAT association under a linkage peak when the low frequency and rare variants are not associated with a phenotype.

Aim 3. Developing an analysis pipeline to search low frequency and rare variants through linkage analysis using sequencing data.

36

In Aim 1 and 2, we develop an analysis approach to detect low frequency and rare variants associated with OSAHS related phenotypes through linkage and association analysis. In this section, we develop a pipeline to perform the analysis and make it publicly available in the scientific community.

37

CHAPTER 2: IDENTIFYING LOW FREQUENCY AND RARE VARIANTS

ASSOCIATED WITH AVSPO2S USING PEDIGREES

2.1 Introduction

Obstructive sleep apnea-hypopnea syndrome (OSAHS) is a complex disorder characterized by the occurrence of repetitive episodes of complete or partial airway obstruction during sleep associated with snoring, intermittent hypoxemia, and daytime sleepiness (Mbata & Chukwuka 2012). Epidemiological studies indicate that OSAHS affects approximately 2 to 6% of children and more than 15% of adults (Badran et al.

2015, Capdevila et al. 2008, Rosen et al. 2003). The recurrent episodes of upper airway obstruction that occur with OSAHS lead to disruptive snoring, sleep fragmentation and daytime sleepiness, which negatively impact quality of life and daytime performance

(Mbata & Chukwuka 2012). OSAHS also results in overnight hypoxemia and heightened sympathetic activation, which have been shown to increase risk for numerous adverse health outcomes, including diabetes (Punjabi et al. 2004), atrial fibrillation (Kanagala et al. 2003), cancer (Nieto et al. 2012), and cognitive decline (Yaffe et al. 2011).

Development of pharmacological intervention for OSAHS, and its consequences such as hypoxaemia, has been limited by a lack of understanding of its genetic basis and molecular pathophysiology. The traditional clinical OSAHS measure for genetic study is the apnea-hypopnea index (AHI), which shows evidence of significant family aggregation (Redline et al. 1992) and heritability estimates of 0.20 to 0.40 (Patel et al.

38

2004, Liang et al. 2016). However, AHI requires collection of multiple physiological signals during sleep, and does not entirely capture the multifactorial risk of OSAHS

(Patel et al. 2008, Pham & Polotsky 2016), nor its clinical consequences (Kendzerska et al. 2014) (Oldenburg et al. 2016).

One important clinical consequence is that OSAHS patients suffer lowered arterial oxyhemoglobin saturation (SpO2) level. SpO2 reflects the adequacy of ventilation and oxygen transport, fundamental physiological properties that are tightly regulated at the molecular and cellular levels to ensure delivery of oxygen to vital tissues. Reduction in oxyhemoglobin saturation leads to increased rates of mortality and cognitive decline

(Antonelli Incalzi et al. 2003). Given its clinical relevance, oxygen saturation is commonly monitored in patients with OSAHS to identify those at risk for adverse outcomes and to assess the success of therapy. The heritability of average SpO2 during sleep (AvSpO2S) was estimated to be 0.41 in individuals with European ancestry in CFS

(Liang et al. 2016). Since AvSpO2S can be reliably and relatively easily measured with a finger-placed pulse oximeter, it has the potential to be scalable in large-scale genetic studies. Studying the genetic underpinnings of AvSpO2S can help elucidate the bases for variation in hypoxemia-related stresses, and may ultimately explain differences in susceptibility to many OSAHS-related morbidities (Nieto et al. 2012, Yaffe et al. 2011).

This information may also inform underlying susceptibility to hypoxemia in the setting of lung injury or disease (Kolilekas et al. 2013, Corte et al. 2012, Ross et al. 2012, Lacasse et al. 2011).

The genetic analysis of AvSpO2S is in its infancy, with the first genome-wide association study reported in 2016 (Cade et al. 2016). This GWAS study analyzed three

39

OSAHS-related traits in 12,558 Hispanic/Latino Americans and reported two SNPs in

KCNK1/SLC35F3 and ATP2B4 reaching suggestive genome-wide significance (P <

-7 5×10 ) (Cade et al. 2016). Genetic study of resting SpO2, which is an important clinical index of chronic obstructive pulmonary disease (COPD), was also reported. In 2014,

GWAS of resting SpO2 in 4,568 non-Hispanic whites reported multiple loci reaching suggestive genome-wide significant level (P < 5×10-7). These studies highlighted the importance of investigating the genetics of SpO2 as a complex trait in different racial groups. Notably, these studies have only examined common variants. For complex traits, rare variants are suggested to play a greater role in heritability than anticipated in the common disease-common variant hypothesis (Eichler et al. 2010). Over the past several years, rapid advances in DNA sequencing technologies enable a more complete search of low frequency and rare variants and the investigation of their roles in complex traits

(Goldfeder et al. 2017). Analyzing low frequency and rare variants in AvSpO2S using whole-genome sequencing (WGS) data should be very important.

The Trans-Omics for Precision Medicine (TOPMed) program is an initiative sponsored by the National Heart, Lung and Blood Institute of the National Institutes of

Health (NIH) to improve the understanding of the biological process that contributes to heart, lung, blood and sleep disorder (https://www.nhlbiwgs.org/). The TOPMed program has generated WGS data on over 100,000 individuals from multiple cohort studies at >30× depth, including sequencing data from seven studies with objective assessment of

OSAHS. A variant imputation server using TOPMed data also allows for high-quality imputation of non-sequenced genotype chip data (Das et al. 2016). This program

40 provides the ability to examine the low frequency and rare variants for OSAHS related traits including AvSpO2S at unprecedented detail.

Recently, many statistical approaches for rare variant association analyses have been developed for both unrelated samples (Lee et al. 2012b, Wu et al. 2011, Li & Leal

2008) and family data (Zhu et al. 2010, Epstein et al. 2015, Chen et al. 2015a). Because low frequency and rare variants can be enriched in families, family based studies are powerful to identify low frequency and rare variants (Zhu et al. 2010, Epstein et al.

2015). Family-based linkage studies have been shown to have good power to detect loci in which variants have large effect sizes, which tend to be rare in population (Bailey-

Wilson & Wilson 2011). Linkage studies usually detect genomic regions that are likely to harbor rare variates with large effects for Mendelian diseases or complex traits. With current DNA sequencing methods, it is now possible to integrate linkage information from families into association analysis of WGS data to identify low frequency and rare variants that account for variation of complex traits such as AvSpO2S.

In this chapter, we present a study based on a strategy that combines linkage analysis results with TOPMed WGS and imputed genotype data, and aim to increase statistical power to identify low frequency or rare variants associated with AvSpO2S. We also incorporate variant functional annotations and different types of omics data to examine the identified variants and genes. We examined 2,966 individuals with objectively measured AvSpO2S and WGS data in conjunction with data from 5,042 individuals with imputed genotype data.

41

2.2 Material and Methods

2.2.1 Description of study samples

Two types of data were used to identify low frequency and rare variants. Four studies have WGS data in a subset of participants by the TOPMed program, which are referred to as “WGS” studies; another two studies have array-based genotyping later imputed using the TOPMed imputation server (as described below), which are referred to as “Imputed” studies. Some studies (MESA, ARIC and FHS) with WGS contributed imputed study data from additional array-based genotyped individuals (Table 2.1). Each

Table 2.1 Sample characteristics of TOPMed WGS and imputed genotype studies. Total No. of Study Race Age BMI AvSpO2S Subjects Males (%) WGS studies 41.4 Cleveland Family Study (CFS) EA 487 228 (46.8) 30.1 (8.6) 93.9 (3.4) (19.5) 38.0 Cleveland Family Study (CFS) AA 508 222 (43.6) 31.7 (9.6) 94.6 (3.8) (19.3) Framingham Heart Study (FHS) EA 461 230 (49.9) 60.2 (8.5) 28.3(5.1) 94.6 (2.0) Venous Thromboembolism EA 471 247 (52.4) 64.0 (6.0) 28.5 (4.9) 94.3 (1.9) Project (VTE) Multi-Ethnic Study of EA 616 291 (47.2) 68.6 (9.1) 27.8 (5.1) 94.0 (1.8) Atherosclerosis (MESA) Multi-Ethnic Study of AA 423 196 (46.3) 68.8 (9.1) 30.2 (5.6) 94.5 (2.0) Atherosclerosis (MESA) Imputed genotype studies Atherosclerosis Risk in EA 1,083 549 (50.7) 60.7 (5.8) 29.1 (5.2) 94.3 (2.1) Communities (ARIC) Framingham Heart Study (FHS) EA 181 90 (49.7) 57.5 (9.7) 28.9 (5.2) 94.7 (1.8) Multi-Ethnic Study of EA 93 42 (44.1) 67.4 (9.0) 27.0 (5.5) 94.9 (2.2) Atherosclerosis (MESA) Osteoporotic Fractures in Men EA 2,178 2,178 (100) 76.7 (5.6) 27.2 (3.8) 93.9 (1.7) Sleep Study (MrOS) Western Australian Sleep Health 52.3 EA 1,507 889 (59.0) 31.8 (7.9) 84.6 (7.9) Study (WASHS) (13.7)

42 study had a protocol approved by its respective Institutional Review Board and participants provided informed consent.

WGS studies

The Cleveland Family Study (CFS) is a family-based longitudinal study that includes participants with laboratory diagnosed sleep apnea, their family members and neighborhood control families followed between 1990 and 2006. Four examinations over

16 years provided measurements of sleep apnea with overnight polysomnography, anthropometry, and other related phenotypes, as detailed previously (Larkin et al. 2010,

Redline et al. 2003). The CFS European Americans (EA) were genotyped by the IBC and

Human OmniExpress+Exome arrays (Keating et al. 2008). 487 EAs and 508 African

Americans (AA) in CFS were whole genome sequenced in the Freeze 4 release NHLBI

TOPMed WGS project (Table 2.1).

The Atherosclerosis Risk in Communities Study (ARIC) and Framingham Heart

Study (FHS) Cohorts data from the subsets of participants that took part in the Sleep

Heart Health Study (SHHS) were analyzed (Feinleib 1985, Quan et al. 1997). The SHHS community-based study built upon existing cohorts (e.g. ARIC, FHS) and included a baseline examination (1995 – 1998) with both in-home polysomnography and questionnaires (Quan et al. 1997). A subset of ARIC participants with AvSpO2S and sequencing data included 471 EAs from the Maryland and Minnesota sites, as part of the

TOPMed Venous Thromboembolism (VTE) project. An additional 1,083 ARIC subjects with AvSpO2S who were not included in the TOPMed and genotyped using Affymetrix

6.0 array were included in replication analysis. A portion of FHS participants included

461 EAs with AvSpO2S who also had whole genomes sequenced in the NHLBI TOPMed

43

WGS project (Table 2.1). Additional 181 FHS subjects with AvSpO2S who were not included in the TOPMed and genotyped using Affymetrix 6.0 array were included in replication analysis.

The Multiethnic Study of Atherosclerosis (MESA) studies the characteristics and risk factors of subclinical cardiovascular disease in multiple ethnic groups in community samples from Baltimore MD, Chicago IL, Los Angeles CA, New York NY,

Minneapolis/St. Paul MN, and Winston-Salem NC. The first examination of 6,429 individuals occurred in 2000, with four subsequent examinations and ongoing follow-up.

A sleep exam that included overnight polysomnography as described previously (Chen et al. 2015b) was performed in conjunction with Exam 5 (2010-2013) and included 616

EAs and 423 AAs who also were whole genome sequenced in the NHLBI TOPMed

WGS project (Table 2.1) An additional 93 EAs with AvSpO2S who were not included in the TOPMed and genotyped using Affymetrix 6.0 array as part of the NHLBI Candidate

Gene Association Resource (CARe) and SNP Health Association Resource (SHARe) project were included in replication analysis.

Imputed genotype studies

The Osteoporotic Fractures in Men Sleep Study (MrOS) is a multi-center prospective, longitudinal, observational study of risk factors for osteoporosis and osteoporotic fractures. The MrOS study enrolled 5,995 community-dwelling men aged 65 and older during the baseline examination between 2000 and 2002. The study design and method of recruitment and demographics of the initial cohort have been previously published (Orwoll et al. 2005, Blank et al. 2005). The MrOS Sleep Study visit occurred

44 on average 3.4 ± 0.5 years (range 1.9-4.9) after the baseline examination, between

December 2003 and March 2005. Of the 5,995 MrOS participants, 3,135 participated in the MrOS Sleep Study, which included overnight polysomnography similar to that used in SHHS (Table 2.1). This analysis included 2,178 MrOS participants who were genotyped using HumanOmni1-Quad_v1-0_H platform for replication analysis.

The Western Australian Sleep Health Study (WASHS) is a study of patients who were referred to the sole public sleep clinic in Western Australia for a presumed sleep disorder, the majority for snoring and/or obstructive sleep apnea. The protocol was developed in 2005 with full-scale data collection underway from April 2006. We included 1,507 participants of European ancestry who had laboratory-based polysomnography and were genotyped and imputed for replication analysis (Table 2.1).

In addition to WGS performed by TOPMed, imputed genotype data were available for additional members of the cohorts described above. ARIC contributed 1,083

EA individuals (Affymetrix 6.0). FHS contributed 181 EA individuals (Affymetrix

500k). MESA contributed 93 EA individuals (Affymetrix 6.0) (Table 2.1).

2.2.2 Overview of the method

Linkage analysis searches were done for chromosomal segments that co-segregate with the disease phenotype through families (Borecki & Province 2008). In practice, linkage analysis directly examines the transmission across generations of both disease phenotype and marker alleles within pedigrees, evaluating recombination events that suggest whether the marker is linked to a causal locus (Hu et al. 2014). In multipoint linkage analysis, the location of the causal locus is considered in combination with many linked loci. The distance between these loci can be extracted from a genetic map.

45

Multipoint linkage analysis can localize causal locus between two markers and maximize the informativeness of a series of markers. The LOD score is the logarithm of the likelihood ratio and is computed by Z(x) = log10[L(x)/L(∞)], in which the numerator assuming a position, x, of the putative disease locus on the marker map and the denominator assuming the disease locus to be infinitely far away from the markers (Ott et al. 2015). The final multipoint linkage result can be obtained from multiple pedigrees with LOD scores summed at the same map position.

In this study, we develop an analysis strategy based on the hypothesis that if identified linkage regions do not overlap with significant loci identified by common variant association studies, the linkage peaks are likely to be driven by multiple low frequency or rare variants buried in those genomic regions. We first perform traditional multipoint linkage analysis using a limited number of common variants, then analyze low frequency and rare variants in the genes within the linkage region using high depth sequencing data. When observed linkage evidence is attributed to low frequency or rare variants, these variants will be more likely to be present or enriched within families with relatively larger family specific LOD scores.

Our study has two stages. In stage I, we conduct linkage analysis using common variants in CFS EA families (CFS stage I families). Focusing on a genomic region with linkage evidence, we analyze family specific LOD scores and identify candidate families that potentially carry causal variants. Next we search low frequency and rare variants that segregate within those families and conduct gene-based association test using the identified variants in each gene. It has been suggested that gene-based tests using high- density sequencing data can be ameliorated by filtering variants based on predicted

46 variant functional annotation (Friedrichs et al. 2016), we thus incorporate the functional annotation information to filter variants in gene-based association tests to identify genes.

In stage II, we conduct independent validation of the identified variants and genes in the remaining cohorts (Stage II cohorts) with either sequencing data or imputed genotype data. We also develop a strategy to test whether the identified multiple low frequency or rare variants are able to explain the linkage evidence observed in CFS stage I families.

We conduct conditional linkage analysis by including a polygenic score calculated by the identified variants and their effect sizes estimated in Stage II cohorts. Functional genomics analysis using genomic annotations, gene expression and methylation analysis are also incorporated to dissect the potential mechanisms underlying the identified genes.

Generally, this analysis provides a framework for integrating linkage information and functional annotations with sequencing data to identify low frequency and rare variants that account for significant variation of polygenic human phenotypes.

2.2.3 Primary phenotype

The quantitative phenotypic outcome was average arterial oxyhemoglobin saturation measured continuously during sleep (SpO2) using finger pulse oximetry (all using NONIN oximetry boards) as part of polysomnography. Intermittent waking and

SpO2 artifact were manually edited from all records. Since AvSpO2S was skewed, rank normal transformation was conducted in each cohort separately.

2.2.4 Whole-genome sequencing

The National Heart, Lung and Blood Institute’s TOPMed (nhlbiwgs.org) phase I

(October 2014-Februrary 2016) included 11 studies covering a range of heart, lung, blood, and sleep disorders phenotypes and a total of 20,000 samples. The CFS, FHS,

47

VTE and MESA studies are part of this phase I with 2,035 sequenced individuals (Table

2.1). The samples were sequenced at average >30x depth of coverage at the Broad

Institute, University of Washington, and Baylor College of Medicine. Individual genetic variations across the genome were identified in a joint calling of all samples conducted by the TOPMed Informatics Resource Center (University of Michigan), which also performed centralized read mapping and genotype calling, as well as variant quality metrics and filtering of variants and reports of samples that failed to meet these quality metrics. Data management and quality control (QC) to ensure correct sample identification, and general study coordination were provided by the TOPMed Data

Coordinating Center. Variants with an average depth < 10x, missing rate >5% and an internal QC score QUAL < 127 were excluded. Methods for WGS data acquisition and

QC are described in ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi? id=phd006969.1.

2.2.5 Imputation of replication cohorts MrOS, FHS, MESA, ARIC and WASHS.

Genotype imputation was performed use the Michigan Imputation Server (Das et al. 2016) ( https://imputationserver.sph.umich.edu/). The reference samples are the

62,784 subjects in TOPMed Freeze 5b who have been whole genome sequenced.

Consortium imputation preparation script ( https://www.well.ox.ac.uk/~wrayner/tools/ ) and confirmed using Ensembl variant allele checks and internal QC were performed on the server. Study-level data were imputed separately, and polymorphic variants with a r2 score > 0.8 were retained. The individuals with European ancestry in the five cohorts with imputed genotype data was used for independent replication.

2.2.6 Linkage analysis of AvSpO2S

48

We and others have demonstrated that phenotype associated rare variants can be enriched in families that will lead to linkage evidence (Zhu et al. 2010, Jun et al. 2018).

We performed linkage analysis of AvSpO2S in 617 European Americans from 132 families in CFS. These subjects were genotyped using Illumina Human

OmniExpress+Exome chip. We used the pairwise linkage disequilibrium (LD) pruning approach with a window size of 50 kb, step size of 5 variants, and R2 threshold of 0.2.

We also required a minor allele frequency (MAF) ≥ 0.2. This resulted in 1,673 SNPs on chromosome 8 using PLINK. To adjust the covariates for AvSpO2S, two models were investigated. Model 1 included the covariates age, age2, gender, age × gender, reflecting non-linear age associations and an age-gender interactions and ten principal components for correcting for population stratification (Zhu et al. 2008). Model 2 included all the covariates in Model 1 with the addition of body mass index (BMI) and BMI2. BMI was defined as weight divided by the square of height. BMI and BMI2 were used in Model 2 given the high, but non-linear correlation of BMI with indices of OSAHS severity. The corresponding residuals were used for linkage analysis.

Variance components linkage analysis was conducted using the software package

Merlin (Abecasis et al. 2002). In this analysis, the total covariance matrix (훴) of the ranked normalized AvSpO2S residuals in a pedigree is decomposed into three variance

2 components: the variance due to the major quantitative trait locus (휎푄푇퐿), the variance

2 due to the random polygenetic effect (휎퐺 ), and the variance due to the random

2 2 2 2 environmental effect (휎퐸 ), 훴 = 훱휎푄푇퐿 + 2훷휎퐺 + 퐼휎퐸 . The elements of the Π matrix are the identical by descent probabilities at the tested quantitative trait locus between two members i and j in a family. 훷 is the kinship matrix and 퐼 is the identity matrix. In

49

2 2 2 2 particular, when i = j, the total variance of a trait is 휎 = 휎푄푇퐿 + 휎퐺 + 휎퐸 . The null

2 2 hypothesis of no linkage is 퐻0: 휎푄푇퐿 = 0 and the alternative hypothesis is 퐻퐴: 휎푄푇퐿 > 0.

MERLIN uses the likelihood ratio test to test the null hypothesis.

2.2.7 Simulated family data

We performed simulations to examine whether a threshold for the family specific

LOD score of 0.1 would improve the statistical power for testing low frequency and rare variant association in either burden or SKAT (Wu et al. 2011, Lee et al. 2012a) association analysis. We simulated 200 nuclear families with average family size 6 to mimic CFS European ancestry cohort, which has average family size of 5.6. We applied a coalescent model (COSI) to simulate genetic variants in a 50kb genomic region

(Schaffner et al. 2005). The average number of variants in the 50kb region was around

400. We simulated a variety of causal variant models by varying the number of causal variants, effect sizes and their directions.

2.2.8 Analysis of CFS stage I families

We estimated family-specific LOD scores for SNP rs11782819 with the highest

LOD score in linkage analysis of AvSpO2S on chromosome 8p23 in CFS families. We fitted the family specific LOD scores for 118 CFS families by a mixture of two normal distributions using R package ‘mixtools’ https://rdrr.io/cran/mixtools/ .

We took the top 18 families with family specific LOD scores larger than 0.1 as those who potentially carry low frequency or rare AvSpO2S variants. There are 487 CFS

EAs who were sequenced through the TOPMed with average sequencing depth 38x

(Table 2.1). We focused on a target region of chr8:1,020,000~21,780,000

50

(GRCh37/hg19), which spans around 10Mb right and left of SNP rs11782819. We observed 212,282 variants with MAF < 0.05 that passed QC filters in this region. We focused on the variants located in the 105 protein-coding genes or their 5k bps up and downstream of these genes. Further, to search for variants that can potentially account for the observed linkage evidence, we filtered out variants that are only present at most once in any of the 18 selected families, reducing the number of variants to 20,168. We filtered out the genes with only one selected variant because of low statistical power. Among the

105 genes, 20 have at least 2 such variants that are also functional coding defined as missense, in frame deletion/insertion, stop gained/lost, start gained/lost, splice acceptor/donor, or initiator/start codon (Table 2.2). For the remaining non-functional coding variants, we applied the CADD PHRED score (Kircher et al. 2014), which estimates the likely impact on encoded protein and variant deleterious metrics. We used the CADD score > 10 as a threshold to filter the variants, resulting in 709 variants distributed across 48 genes (Table 2.2). Both gene-based burden and SKAT tests were performed in the CFS EA cohort for each gene, analyzing functional coding and non- coding variants separately.

2.2.9 Gene-based association tests

We tested both single SNP and gene-based associations for the selected variants in all protein-coding genes within the linkage region 8p23 using the software EPACTS

(H.M.Kang 2014). A kinship matrix was generated for each of the cohorts we analyzed using EPACTS and incorporated into all of the association analyses to adjust for within- study relatedness. Covariates included sex, age, age2, BMI and the top ten principal

51 components calculated from genomic variants. Functional coding variants and non- coding variants were analyzed separately. Single variant association tests for biallelic

Table 2.2 Selected coding and noncoding rare variants on 105 genes in target region

Number of Start End Number of selected selected Gene Chromosome Position Position noncoding variants with functional coding (GRCh37) (GRCh37) CADD PHRED > 10 variants DLGAP2 8 1444533 1661624 3 1 CLN8 8 1706933 1739730 0 0 ARHGEF10 8 1767152 1911807 4 1 KBTBD11 8 1919570 1960098 0 1 MYOM2 8 1988171 2098379 10 10 CSMD1 8 2787879 4857494 10 82 MCPH1 8 6259113 6506139 1 12 ANGPT2 8 6352173 6425925 0 6 AGPAT5 8 6560879 6622164 1 13 DEFB1 8 6723100 6740539 0 1 DEFA6 8 6777216 6788596 0 0 DEFA4 8 6792998 6800860 1 1 DEFA1 8 6831642 6842577 0 0 DEFA1B 8 6852250 6861709 0 0 DEFA3 8 6871189 6880819 0 0 DEFA5 8 6907837 6919252 1 0 ZNF705G 8 7208048 7248074 0 0 DEFB4B 8 7267405 7279376 0 0 DEFB103B 8 7283072 7292862 0 0 SPAG11B 8 7295214 7325239 0 0 DEFB104B 8 7326264 7337603 0 0 DEFB106B 8 7337657 7348912 0 0 DEFB105B 8 7340320 7347054 0 0 DEFB107B 8 7348922 7371769 0 0 PRR23D1 8 7392190 7404968 0 0 PRR23D2 8 7635123 7643895 0 0 DEFB107A 8 7664318 7678047 0 0 DEFB106A 8 7678274 7688985 0 0 DEFB105A 8 7679504 7682669 0 0 DEFB104A 8 7688993 7699966 0 0 SPAG11A 8 7701868 7731385 0 0 DEFB103A 8 7733734 7745159 0 0 DEFB4A 8 7747211 7759204 0 0 ZNF705B 8 7778866 7817265 0 0 LRLE1 8 8041163 8051293 0 0 SGK223 8 8170274 8249008 1 14

52

CLDN23 8 8554461 8566614 1 2 MFHAS1 8 8635864 8756155 1 14 ERI1 8 8854659 8895849 2 4 PPP1R3B 8 8988778 9014083 1 4 RP11-10A14.4 8 9004253 9018871 0 1 TNKS 8 9408433 9644798 0 28 MSRA 8 9906789 10291400 1 46 PRSS55 8 10378069 10416667 0 0 RP1L1 8 10458870 10517608 8 7 C8orf74 8 10525153 10563101 1 1 SOX7 8 10576289 10702354 1 7 PINX1 8 10617482 10702378 4 4 XKR6 8 10748565 11063856 1 9 AF131215.5 8 10980896 10992735 0 1 MTMR9 8 11136929 11190641 1 6 SLC35G5 8 11187947 11192130 3 3 C8orf12 8 11220934 11296165 1 1 FAM167A 8 11273975 11330077 1 1 BLK 8 11346515 11427094 1 4 GATA4 8 11529471 11621013 1 9 C8orf49 8 11613911 11622143 0 0 NEIL2 8 11622149 11648080 0 1 FDFT1 8 11648093 11700029 1 3 RP11-297N6.4 8 11654221 11665073 1 1 CTSB 8 11700202 11731936 1 1 DEFB136 8 11826448 11837078 0 1 DEFB135 8 11837113 11847089 0 0 DEFB134 8 11847112 11858816 0 0 RP11- 481A20.11 8 11866606 11878042 0 0 DEFB130 8 11916901 12180804 0 1 ZNF705D 8 11956900 11978017 0 0 USP17L2 8 11994693 12001584 0 0 FAM86B1 8 12040074 12056625 0 0 FAM86B2 8 12282918 12298890 0 0 LONRF1 8 12576644 12617997 1 4 KIAA1456 8 12798151 12894010 1 8 DLC1 8 12935872 13377392 6 51 8 13419354 13430774 0 0 SGCZ 8 13942368 15100842 0 76 TUSC3 8 15392736 15629150 0 12 MSR1 8 15960390 16429868 3 44 FGF20 8 16844683 16864688 0 8 MICU3 8 16879756 16983675 0 6 ZDHHC2 8 17013423 17082972 1 11 CNOT7 8 17086738 17109381 1 3

53

VPS37A 8 17102773 17159152 1 5 MTMR7 8 17155541 17275836 3 10 SLC7A2 8 17349599 17428940 5 9 PDGFRL 8 17428956 17501298 1 13 MTUS1 8 17501304 17663151 13 20 FGL1 8 17716897 17772866 4 9 PCM1 8 17775353 17890472 5 11 ASAH1 8 17908943 17947468 3 5 NAT1 8 18027994 18086198 6 4 NAT2 8 18248758 18263705 1 1 PSD3 8 18379841 18944229 1 51 SH2D4A 8 19166132 19258722 2 6 CSGALNACT1 8 19258730 19620511 1 14 INTS10 8 19669673 19714588 1 1 LPL 8 19754262 19829759 1 10 SLC18A1 8 19997367 20045709 0 1 ATP6V1B2 8 20049879 20082288 2 6 LZTS1 8 20098676 20166465 1 8 GFRA2 8 21542941 21674827 0 12 DOK2 8 21761400 21776367 2 5 XPO7 8 21776379 21869067 1 1 NPM2 8 21876672 21894865 0 0 FGF17 8 21894912 21906316 0 0 DMTN 8 21902188 21929994 1 0

variants and INDELs were performed for all selected variants in the linkage region using the Efficient Mixed-Model Association eXpedited (EMMAX) test for quantitative traits

(Kang et al. 2010). Gene-based tests for SNPs and INDELs were done using thecollapsing burden test (burdenCMC) and mixed-model SKAT (mmSKAT). To improve power, we imposed additional filters for non-coding variants using the CADD

PHRED-like scores > 10 (Kircher et al. 2014). For the gene-based tests, meta-analyses of cohorts were calculated using Fisher’s method.

2.2.10 Identifying variants contributing to linkage evidence in CFS-EAs

54

For the rare and low frequency variants in the three genes DLC1, MYOM2 and

CSMD1, we performed single variant analysis in each of Stage II cohorts followed by fixed-effects meta-analysis (Willer et al. 2010) to obtain the effect size for each variant.

Next, for each individual in the Stage I CFS-EAs, we defined a polygenic score for each

푘 ̂ gene by 푃푆 = ∑푖=1 훽푖푔푖, where k is the total number of identified variants in a gene, 푔푖 is

̂ the ith genotype value and 훽푖 is the corresponding effect size obtained from the Stage II data. We performed linkage analysis by including the polygenic score as a covariate and calculated the drop in the LOD score. To examine whether the LOD score drop is statistically significant, we randomly selected the same number of allele frequency matched variants outside of the linkage region. We estimated their effect sizes using

Stage II cohorts and calculated the polygenic score again. We performed linkage analysis in CFS-EAs 1,000 times and calculated the empirical distribution for the null hypothesis that the variants are not associated with the average AvSpO2S. We then calculated the P- values for the observed LOD score drop in each gene.

2.2.11 Estimating the proportion of AvSpO2S variability explained by the identified variants in DLC1

To estimate the proportion of AvSpO2S variability explained by 57 identified low frequency and rare variants in DLC1, we performed single variant analysis in 487 individuals in Stage I CFS-EAs and obtained the effect size for each of the 57 variants.

We then pooled 1,548 EAs in stage II cohorts. We defined the polygenic score of each of

57 ̂ ̂ the 1,548 individuals by 푃푆 = ∑푖=1 훽푖푔푖, where 푔푖 is the ith genotype value and 훽푖 is the corresponding estimated effect size obtained from the CFS-EAs. We regressed the ranked normalized AvSpO2S on the polygenic score with covariates of age, sex, ten principal

55 components and estimated the proportion of AvSpO2S variability explained by the 57 variants (Tamar Sofer 2019).

2.2.12 Test the aggregation of variant effect size directions in DLC1

To examine whether the adjacent variants have similar effect size directions among the 51 non-coding variants in DLC1, we evaluated the effect size direction difference for the 50 adjacent pairs of variants. Assuming no linkage disequilibrium among rare variants, the number of opposite effect sizes for paired adjacent variants follows a binomial distribution with probability 0.5:

X ~ Binom (n, p=0.5) where n is the number of adjacent pairs of the variants, X is the number of pairs with opposite effect size directions. Supposing that the number of tested variants is N, n is calculated by N-1.

We observed 13 out of 50 pairs having opposite effect size direction, resulting in a P-value 4.7×10-4. It is possible that rare variants may be in linkage disequilibrium. We then applied the same variant selection procedure as described in Analysis of CFS stage

I families across the genome and resulted 157,912 non-coding variants. We performed the same analysis for each of 51 variants and calculated the corresponding P-values which were tallied to estimate the null hypothesis.

2.2.13 Gene expression analysis.

GTEx V6p cis-eQTL gene expression data and covariates were downloaded from the GTEx Portal (https://www.gtexportal.org/home/datasets). WGS (N = 148) and imputed genotype data (N = 450) were downloaded from dbGaP. For each dataset, we

56 extracted the expression levels of the 3 genes of interest, DLC1, MYOM2 and CSMD1 for all available 44 tissues. We used the residual of the gene expression level as the phenotype, after adjusting for sex, platform, top three principal components, and tissue- specific latent factors inferred by GTEx using the PEER method (Stegle et al. 2012). We performed gene-based tests between the gene expression level and the identified rare and low frequency variants associated with AvSpO2S that were available in GTEx. We only retained the variants with imputation info scores larger than 0.4 and in Hardy-Weinberg equilibrium P-value larger than 1×10-6.

2.2.14 Cell type-specific regulatory annotation enrichment analysis.

We analyzed the annotations of cell type-specific genome segmentation states of regulatory activity prediction for 17 cell lines defined in the Ensemble Regulatory Build

(Zerbino et al. 2015). The predicted gene regulatory features including the CTCF binding sites, enhancer, heterochromatin, promoter flank, transcription start sites (TSS) and active transcription factor binding sites within the ChIP-seq peaks are defined with active regulatory features (Zerbino et al. 2015); the other genetic variants are defined without active regulatory features. In each of the three genes: DLC1, MYOM2 and CSMD1, we compared the proportion of the identified non-coding variants associated with AvSpO2S located in the gene regulatory features with that of the remain non-coding variants with matched allele frequency and CADD PHRED ≥ 10 by a 2 by 2 table using the Fisher’s

Exact Test. This analysis was performed for each cell line separately. Significance was defined after correcting for 17 tests, which results in a significance level of 0.003. For

DLC1, none of the allele frequency matched and CADD PHRED ≥ 10 non-coding variants fall in the gene regulatory features of leukemia-lymphoma cell line (DND41),

57 thus we reported regulatory annotation enrichment analysis results of the remaining 16 cell lines. For MYOM2, none of the allele frequency matched and CADD PHRED ≥ 10 non-coding variants fall in the gene regulatory features of B-lymphocyte cell (GM12878), dermal endothelium (HMEC), adult dermal fibroblast cells (NHDFAD), umbilical vein endothelial cell (HUVEC), liver (HEPG2), lung fibroblast

(NHLF) and human acute leukemia (MONO), thus we reported the results of the remaining 10 cell lines. For CSMD1, none of the allele frequency matched and CADD

PHRED ≥ 10 non-coding variants fall in the gene regulatory features of epidermal keratinocytes (NHEK), adult dermal fibroblast cells (NHDFAD), skeletal myoblasts

(HSMM), osteoblast lineage (OSTEO) and lung fibroblast (NHLF), we thus reported the results of the remaining 12 cell lines.

2.2.15 Mendelian randomization analysis

We performed Mendelian randomization analysis to test and estimate causal effect of DLC1 expression in human skin cells-transformed fibroblasts using an inverse- variance weighted (IVW) method, MR-Egger regression (Bowden et al. 2015) and MR-

PRESSO test (Verbanck et al. 2018). We chose human skin cells-transformed fibroblasts because the DLC1 variants are eQTLs in the GTEx expression analysis in this tissue. We constructed instrumental variables using the 24 available DLC1 variants in GTEx. The

DLC1 expression level was treated as an exposure and AvSpO2S as an outcome.

2.2.16 AvSpO2S and methylation in the DLC1 gene in MESA.

The subset of MESA individuals in this analysis is composed of those who participated in a sleep examination conducted in conjunction with MESA Exam 5

(described in detail previously (Chen et al. 2015b, Bild et al. 2002)), and who also

58 participated in the MESA DNA methylation study (Liu et al. 2013). The blood draws for the methylation study were obtained during MESA Exam 5 (2010-2012) (Reynolds et al.

2014) on a random subset of MESA participants at four of the six field centers: John

Hopkins University, University of Minnesota, Columbia University, and Wake Forest

University. A total of 623 individuals were available with both sleep data and DNA methylation data (See Table 2.9 for participant characteristics). DNA methylation was ascertained in sorted monocytes using the Illumina 450K bead chip array. Methods for collection and assays for DNA methylation have been described previously (Reynolds et al. 2014). We used the R package “Illumina450ProbeVariants.db” to identify 77 DNA methylation sites associated with DLC1, and do not measure SNPs in EAs, AAs, or in neither. We first adjusted the methylation values to batch effect and confounding variable

(all adjusting covariates) using ComBat (Johnson et al. 2007), which was applied on the beta-value scale. We subsequently analyzed the association between methylation in these

DNA methylation sites using linear regression applied on the M-value scale, and adjusted for age, sex, recruitment site, estimated residual cell count (beta, T, , and natural killer cells), and the batch covariate, array position on chip. We performed association analysis in AA only, EA only, and in the pooled cohort that included AA, EA, and HA and adjusted for race/ethnicity.

2.2.17 AvSpO2S and DLC1 gene expression association test in the Sleep Heart

Health Study/Framingham Heart Study.

The Sleep Heart Health Study (SHHS) population (Newman et al. 2001, Gottlieb et al. 2010), consists of 517 European Americans from the Offspring Study of the

Framingham Heart Study (FHS) with both gene expression and overnight

59 polysomnography (PSG) measures. Gene expression was measured in whole blood using the Affymetrix Human 1.0 ST Array. Blood for gene expression measures was collected from Offspring study participants in their examination cycle 8 (2005-2008)

(Irvin et al. 2014). A dataset adjusted to technical batch covariates, is available through dbGaP (phe000002.v4), and was used in the analysis. Here, we considered only probe

3125116, measuring expression of the DLC1 gene. SHHS FHS individuals participated in two sleep studies (1994-1998 and 2001-2003). In this analysis, we used polysomnography data from the second study in 385 participants and for the remaining

132 participants who only had an initial sleep study, we used the earlier polysomnography data. Participant characteristics are provided in Table 2.10. We used a linear mixed model to assess the association between gene expression of DLC1 probe and two sleep traits: AvSpO2S and AHI, measured as the average number of apnea and hypopnea events with at least 4% oxygen desaturation. We regressed expression on sleep measures adjusted for age, sex and BMI, and controlled for familial relationships using a random effect model.

2.3 Results

2.3.1 Persistent linkage evidence of AvSpO2S on chromosome 8p23 in CFS-EAs

We previously conducted variance component linkage analysis of AvSpO2S in

645 CFS EAs from 139 families using the IBC array and identified a linkage peak on chromosome 8p23 with LOD score 2.36 and 3.26 with and without including BMI as covariate (Liang et al. 2016), respectively (Figure 2.1 A-B). In this study, 617 individuals from 132 families were further genotyped using Illumina OmniExpressExome chip.

Linkage analysis of AvSpO2S in these 617 individuals using Illumina

60

OmniExpressExome chip genotype data showed persistent linkage evidence with LOD scores 2.32 and 3.38 with and without including BMI as covariate (Figure 2.1 C-D). In the further study, we used 487 individuals from 118 families with WGS data from the

NHLBI TOPMed project to identify low frequency and rare variants accounting for the

61

A B

C D

E F

62

Figure 2.1 Variance component linkage analysis of AvSpO2S in CFS-EAs on

chromosome 8. (A-B) linkage analysis using IBC array for model with (A) and

without (B) including BMI as covariate in 645 individuals from 139 families; (C-D)

linkage analysis using Illumina OmniExpressExome array for model with (C) and

without (B) including BMI as covariate in 617 individuals from 132 families; (E-F)

linkage analysis using Illumina OmniExpressExome array for model with (E) and

without (F) including BMI as covariate in 487 individuals from 118 families with

TOPMed sequencing data.

linkage evidence, thus we also conducted linkage analysis using Ilumina

OmniExpressExome chip data in the 487 individuals and identified linkage signal with

LOD score 2.56 and 4.24 for models with and without including BMI as covariate respectively (Figure 2.1 E-F). To identify variants that are independent of BMI, a trait correlated with several OSAHS traits, we focused our analysis on AvSpO2S adjusted for

BMI.

2.3.2 Choice of family specific LOD score threshold 0.1

Supposing that a portion of CFS families carrying rare variants accounting for the

AvSpO2S linkage evidence on chromosome 8p23, the distribution of CFS family-specific

LOD score for the peak SNP should be mixture of two normal distributions: one is the family-specific LOD score distribution of families not carrying variants accounting for the linkage evidence with mean zero and the other is the distribution of families carrying variants accounting for the linkage evidence with a positive mean. Thus we fitted the

63 family specific LOD scores by a mixture of two normal distributions. We observed that the means of two normal distributions were 0.016 and 0.089, with mixture proportions

64% and 36%, respectively (Figure 2.2). Based on a discriminant analysis, the threshold that best separate the two distributions is the average of the two means of the two components, which is 0.053. We chose a conservative threshold of 0.1.

We performed simulations (Section 2.2.7) to examine whether a threshold for the family specific LOD score 0.1 would improve the statistical power for testing low frequency and rare variant association in either burden or SKAT (Wu et al. 2011, Lee et al. 2012a) tests. We compared the distributions of family specific LOD score in families with at least 2 family members carrying a causal variant and families without at least 2 family members carrying a causal variant. We found that, for both scenarios, all 40 simulated causal variants had the same effect directions and half of 40 causal variants have opposite effect directions, the family specific LOD scores for families with at least two members carrying causal variants are higher than the rest families (Figure 2.3 and

2.4)

Figure 2.2 Mixture distribution of family specific LOD scores in CFS families.

64

Figure 2.3. Distribution of family specific LOD score. Left: families with less than 2 family members carrying a causal variant. Right: families with at least 2 family members carrying a causal variant. All 40 causal variants have the same effect direction.

Figure 2.4. Distribution of family specific LOD score. Left: families with less than 2

family members carrying a causal variant. Right: families with at least 2 family members

carrying a causal variant. Half of 40 causal variants have an opposite effect direction.

65

We further examined the statistical power and type I error rate for this threshold

0.1 in the simulated family data. We simulated the causal variants in sex scenarios with different number of causal variants with effect positive and negative (Table 2.4 and 2.5).

We also simulated the phenotype in two scenarios that simulated causal variants account for 1% and 0.5% of variability of a phenotype. We compared type I error and power of burden and SKAT tests by selecting and without selecting variants in families with family specific LOD score larger than 0.1. We observed that while the type I error rate is reasonable (Table 2.3), the power for both burden and SKAT analysis was improved using families with family specific LOD score larger than 0.1 to select variants (Table 2.4 and 2.5). In addition, we also observed that burden test is more powerful when a large proportion of causal variants are in the same direction, although generally SKAT is more robust than burden test (Table 2.4 and 2.5).

2.3.3 Multiple low frequency and rare variants in DLC1 are associated with

AvSpO2S

The overview of analytical approach was presented in Figure 2.5. We focused on a target linkage region of AvSpO2S encompassing 20.93 Mb on chromosome 8 (Figure

2.6). Using linkage information and focusing on low frequency and rare variants (MAF <

0.05), we were able to reduce our gene-based association analysis to 20 genes with at least 2 putative functional coding variants and 48 genes with at least 2 non-coding variants (Section 2.2.8). We observed 12 genes that had P-values < 0.05 (Table 2.6), indicating an enrichment of genes associated with AvSpO2S under the linkage peak. In these 12 genes, 5 were identified by gene-based association tests using functional coding variants, 8 were identified using non-coding variants and DLC1 was the only gene

66 observed in both functional coding and non-coding variant analysis (Table 2.6).

Table 2.3 Type I error of burden and SKAT test (N=10000)

Significance Level Test 0.05 0.01 1.00E-03 1.00E-04 Not select variants in Burden 0.0528 0.0104 0.0014 0.0001 families with fsLOD > 0.1 SKAT 0.0571 0.0114 0.0008 0.0002 Select variants in families Burden 0.0543 0.0109 0.0012 0.0001 with fsLOD > 0.1 SKAT 0.0547 0.0118 0.0013 0.0002

Table 2.4 Power of burden and SKAT tests when simulated rare variants account

for 1% of variability of phenotype (N=10000).

Not select variants in Select variants in family # of rare variants with effect Significance families with fsLOD > 0.1 with fsLOD > 0.1 positive (+) / negative (-) Level Burden SKAT Burden SKAT 0.05 0.559 0.942 0.665 0.974 20 + / 0 - 0.01 0.384 0.889 0.527 0.927 0.001 0.226 0.782 0.366 0.861 0.05 0.165 0.897 0.296 0.946 10 + / 10 - 0.01 0.072 0.787 0.167 0.881 0.001 0.019 0.602 0.077 0.739 0.05 0.168 0.957 0.317 0.976 5 + / 15 - 0.01 0.077 0.900 0.210 0.938 0.001 0.027 0.810 0.118 0.887 0.05 0.764 0.954 0.737 0.981 40 + / 0 - 0.01 0.563 0.854 0.572 0.927 0.001 0.322 0.652 0.365 0.801 0.05 0.206 0.928 0.325 0.969 20 + / 20 - 0.01 0.099 0.812 0.177 0.900 0.001 0.030 0.627 0.080 0.768 0.05 0.200 0.985 0.370 0.989 10 + / 30 - 0.01 0.081 0.965 0.246 0.977 0.001 0.029 0.895 0.137 0.944

67

Table 2.5 Power of burden and SKAT tests when simulated rare variants account for 0.5% of variability of phenotype (N = 10000)

Not select variants in Select variants in family # of rare variants with effect Significance families with fsLOD > 0.1 with fsLOD > 0.1 positive (+) / negative (-) Level Burden SKAT Burden SKAT 0.05 0.484 0.912 0.490 0.943 20 + / 0 - 0.01 0.294 0.777 0.334 0.865 0.001 0.137 0.540 0.166 0.688 0.05 0.169 0.910 0.269 0.940 10 + / 10 - 0.01 0.068 0.772 0.132 0.858 0.001 0.016 0.574 0.050 0.686 0.05 0.161 0.966 0.309 0.975 5 + / 15 - 0.01 0.078 0.931 0.213 0.948 0.001 0.023 0.833 0.119 0.891 0.05 0.742 0.930 0.590 0.945 40 + / 0 - 0.01 0.532 0.816 0.425 0.856 0.001 0.307 0.623 0.250 0.698 0.05 0.200 0.944 0.293 0.953 20 + / 20 - 0.01 0.083 0.827 0.177 0.874 0.001 0.021 0.601 0.078 0.684 0.05 0.167 0.982 0.323 0.985 10 + / 30 - 0.01 0.069 0.959 0.202 0.964 0.001 0.022 0.903 0.110 0.915

68

Figure 2.5. The analysis flow chart for searching low frequency and rare variants associated with AvSpO2S using the TOPMed sequencing data.

69

A

B

70

Figure 2.6. Linkage evidence of AvSpO2S on chromosome 8 in CFS-EAs. (A)

LOD score in 8p23 linked to AvSpO2S. The pink region is the 20 Mb target region in the sequencing analysis and the protein coding genes are presented in the bottom. (B)

LOD score in 8p23 when the polygenic score (PS) of the 57 variants in DLC1 was included in the linkage analysis. The linkage curves are plotted with (red curve) and without (blue curve) adjusting for the polygenic score. The grey curves are the 1,000 linkage curves with adjusting for PS defined by 57 randomly selected frequency matched variants outside of the target region (chr8: 21780000- 146302000 for

GRCh37/hg19) on Chromosome 8. The location of DLC1 is marked by a black bar.

71

72

73

The TOPMed WGS project included an additional five cohorts consisting of

1,548 EAs and 931 African Americans (AAs) (Table 2.1) whose genomes were sequenced and who had AvSpO2S measured. To reduce the multiple comparison penalty, we performed the same burden and SKAT analyses in these samples, focusing on the same variants in the 12 genes identified in CFS stage I analysis. Although the numbers of functional coding variants were small in each gene, we observed an association with

DLC1 in EAs (P < 0.1) and in AAs (P = 0.024) in the burden test (Table 2.6). For the non-coding variants, we observed association with DLC1 in the SKAT analysis (P =

2×10-4) in EAs but not in AAs. This association evidence in EAs for non-coding variants was significant after accounting for multiple tests (12 genes × 2 analysis methods) and the association evidence was further improved when combining with Stage I CFS EAs (P

= 2.11×10-5). We also observed an additional two genes, CSMD1 and MYOM2, with nominal replication association evidence in stage II samples for non-coding variants by

SKAT analysis, although they were not significant after correcting for multiple tests

(Table 2.6). In the combined AA and EA stage I and II TOPMed data the association of

DLC1 was significant (Table 2.7, P = 4.2×10-5). The association of DLC1 was further replicated in five independent samples of 5,042 EAs using imputed genotypes for these

57 variants (P = 0.01), with association evidence further improved when combining all

EA stage I, II and replication data (P = 6.5×10-6, Table 2.7).

2.3.4 Conditioning on the effects of identified DLC1 variants reduces the linkage evidence in CFS EAs

If the variants identified in DLC1 are truly associated with AvSpO2S, conditioning on the effects of these variants should reduce the observed linkage evidence

74 in the CFS. We estimated the effect sizes of the 57 DLC1 variants using the stage II samples only and then calculated a DLC1 polygenic score for each subject in 487 EAs from 118 CFS families with NHLBI TOPMed WGS data (Section 2.2.10). We performed linkage analysis in 487 individuals including the DLC1 polygenic score as an additional covariate. The maximum LOD score dropped from 2.56 without including the DLC1 gene score to 1.81 with the score. The LOD score drop is statistically significant (P <

0.001, Figure 2.6B, Section 2.2.10), indicating that these variants contribute to the observed linkage evidence. Similar analyses showed that the variants in CSMD1 led to a significant LOD score drop (P = 0.04) but not variants in other genes (Figure 2.7).

Conditional on allele frequencies, 7 of the 57 variants in DLC1 have statistically significant effect sizes that fall in the top 5% of allele frequency matched variants (Figure

2.8, P = 0.023). These observed effect sizes of low frequency and rare variants are not necessarily large, which is consistent with prior literature (e.g. heights (Marouli et al.

2017)). We further observed that 5 of 6 coding variants had consistent positive effect direction (higher oxygen saturation; a more favorable phenotype) but this pattern was not observed for non-coding variants (Figure 2.8).

The 57 variants in DLC1 together explained 0.97% of AvSpO2S variation (P =

0.0017, Section 2.2.11). Based on this attributable variation, we calculated the power of the AA sample size to be 12% at a 5% significance level, indicating insufficient statistical power in AAs, likely explaining the failure to observe association evidence in AAs

(Table 2.6).

75

A

B

Figure 2.7. Conditional linkage analysis for AvSpO2S on chromosome 8 in CFS-

EAs TOPMed individuals. The linkage curves are plotted with (red curve) and

without (blue curve) adjusting or risk score defined by the selected variants of

CSMD1 (A) and MYOM2 (B) estimated by four TOPMed European verification

cohorts as covariate. The grey curves are 1000 linkage curves with adjusting risk

scores defined by 56 randomly selected variants with MAF < 0.05 off the target

region (chr8: 21780000- 146302000 for GRCh37/hg19) on Chromosome 8 for 1000

times.

76

Figure 2.8 Effect sizes of the 57 variants in DLC1 estimated using the stage II samples conditional on MAF. Red dots: coding variants; purple dots: non-coding variants; blue dots: the variants in target region (chr8: 1000000-21780000 for GRCh37/hg19) with

MAF < 0.05 after excluding the 57 variants in DLC1. The two curves represent the 95% conference intervals conditional on MAF.

2.3.5 Identified DLC1 variants are enriched in regulatory regions

We next examined whether the identified non-coding variants in DLC1 are enriched in regulatory regions by comparing these with the remaining frequency and

CADD PHRED score matched variants (Section 2.2.14). We examined the regulatory activity predicted elements for cell lines defined in the Ensembl Regulatory Build

(Zerbino et al. 2015), which includes CTCF binding sites, enhancer, heterochromatin, promoter flank, and transcription start sites (TSS). The 51 non-coding variants in DLC1 were significantly enriched with cis-regulatory elements (CREs) in the human lung fibroblast cell line (NHLF) after correcting for multiple tests (P = 0.003, Figure 2.9A).

77

Figure 2.10 demonstrates the genomic locations of the 51 variants with their corresponding regulator elements in the human lung fibroblast cell line. We observed significant aggregation of variants with similar effect direction in the genomic region (P

= 4.7×10-4, Section 2.2.13). The non-coding variants in MYOM2 were also marginally enriched in skeletal myotubes and skeletal myoblasts (Figure 2.9B).

2.3.6 DLC1 non-coding variants are associated with DLC1 expression level in human skin cells-transformed fibroblasts

We further investigated whether the low frequency and rare non-coding variants have gene regulatory roles by eQTL analysis of their corresponding RNAseq data across the 44 tissues from the Genotype-Tissue Expression (GTEx) program (Consortium 2013).

The 51 non-coding variants in DLC1 significantly contributed to DLC1 expression levels in human skin cell-transformed fibroblasts after correcting for 44 tests (P = 2.6×10-4,

Figure 2.11A), consistent with our observation that these variants are enriched in regulatory features in the human lung fibroblast cell line (Figure 2.9A). We observed a significant correlation between the AvSpO2S effect sizes of the 24 available variants in

GTEx with that of DLC1 expression effect sizes (P = 2.1×10-5). The additive score using

DLC1 expression effect sizes of the 24 DLC1 variants explained 0.43% of AvSpO2S variation (P = 5.4×10-5). Mendelian randomization analysis using the DLC1 variants

-4 demonstrated significant causal effect of DLC1 expression to AvSpO2S (P = 1.32×10 ,

Figure 2.12, Table 2.8), suggesting DLC1 variants contribute to AvSpO2S variation through DLC1 expression. The non-coding variants in MYOM2 were significantly associated with MYOM2 expression level in brain cortex (burden test P = 2.1 × 10-4) but not in CSMD1 (Figure 2.11 B-C).

78

A B

C

Figure 2.9 Cell type specific regulatory annotation enrichment tests for the identified non-coding variants in DLC1 (A), MYOM2 (B) and CSMD1 (C) in cell lines defined in the Ensemble Regulatory Build. The vertical dot line represents the significance level after adjusting for multiple tests.

79

Figure 2.10 51 non-coding variants and the corresponding effect sizes in DLC1 genes plotted against physical locations. The corresponding DNase hypersensitive,

H3K4me3, H3K27ac and CTCF elements derived from lung fibroblast in the

Encyclopedia of DNA Elements (ENCODE) data were also presented.

80

A

B

C

Figure 2.11 Gene based test for DLC1 (A), MYOM2 (B) and CSMD1 (C) selected variants with DLC1, MYOM2 and CSMD1 expression level in GTEx tissues. The horizontal dotted line shows the significance threshold after adjusting for multiple testing by Bonferroni correction.

81

Figure 2.12 Mendelian randomization analysis using 24 DLC1 variants as instrumental variables. DLC1 expression level in skin cells-transformed fibroblasts in

GTEx is treated as exposure and AvSpO2S as outcome. The solid red and blue dotted lines represent the causal effects estimated by Inverse-variance weighted method and

MR-Egger regression.

Table 2.8 Mendelian randomization analysis to assess causal effects of DLC1 expression in skin cell-transformed fibroblasts to AvSpO2S

Effect Effect Intercept Intercept Global Test (SE) P-value (SE) P-value P-value IVWa 0.274 (0.06) 1.32E-4 NA NA NA MR-EGGERb 0.280 (0.12) 0.03 -0.0012 (0.02) 0.95 NA MR-PRESSOc 0.274 (0.06) 1.32E-4 NA NA 0.87 a: IVW: Inverse-variance weighted test estimates the direct causal effect without adjusting for pleiotropic effects (Bowden et al. 2015). b: MR-EGGER: Mendelian randomization-Egger test: effect is a bias-reduced estimate for the causal effect and intercept is an estimate of the average pleiotropic effect across the genetic variants (Bowden et al. 2015). No bias is observed. c: MR-PRESSO: Mendelian randomization pleiotropy residual sum and outlier, global P-value test horizontal pleiotropic outliers (Verbanck et al. 2018).

82

2.3.7 DLC1 DNA methylation is weakly associated with AvSpO2S and FEV1/FVC ratio

We investigated the association between DNA methylation in DLC1 and

AvSpO2S. We tested 77 DNA methylation sites in DLC1 and observed one associated site at chr8:12992570 (GRCh37/hg19) (P < 0.001, FDR adjusted P = 0.078), using the

623 subjects from the Multi-Ethnic Study of Atherosclerosis (MESA) study (Table 2.9).

Two common variants in DLC1 have been reported to be associated with two emphysema-related phenotypes on computed tomography (Cho et al. 2015), a pulmonary disease trait that is associated with dyspnea (Oelsner et al. 2015), reduced activity levels

(Lo Cascio et al. 2017), and exercise tolerance (Smith et al. 2013). Given the potential relationship of SpO2 with lung, we also examined the association between DNA methylation in DLC1 and ratio of two lung functions, predicted forced expiratory volume in one second (FEV1) and forced vital capacity (FVC), which is known as Tiffeneau-

Pinelli index and used in the diagnosis of obstructive lung disease (Sahebjami & Gartside

1996). We observed that the DLC1 methylation site at chr8:12992570 (GRCh37/hg19) is also associated with FEV1/FVC ratio (P < 0.001, FDR adjusted P = 0.063) in MESA

(Table 2.9).

We also tested the associations of methylation of the identified site in DLC1 with three other OSASHS-related traits: the percentage of the night with oxyhemoglobin saturation (SpO2) < 90% (Per90), minimum SpO2 and AHI and did not observe significant results (Table 2.9).

83

Table 2.9 Sample characteristics and results of DLC1 methylation association test with AvSpO2S

MESA- MESA- MESA-AA P-value HAa EA Sample Size 134 203 286 Age (Mean(SD)) 69.2 (9.0) 67.3 (8.9) 69.6 (9.5) No. of Female (%) 80 (59.7) 104 (51.2) 148 (51.7) BMI (Mean(SD)) 30.7 (5.5) 30.4 (5.43) 29.0 (5.6) FEV1b/FVCc (Mean(SD)) 0.8 (0.1) 0.8 (0.1) 0.7 (0.1) <0.001

AvSpO2S (Mean(SD)) 94.7 (1.7) 94.4 (1.5) 93.8 (1.9) <0.001

Minimum SpO2 (Mean(SD)) 84.0 (7.3) 82.1 (8.3) 83.3 (7.3) 0.069 Per90d (Mean(SD)) 3.3 (9.4) 3.4 (7.0) 5.3 (12.5) 0.078 AHI (Mean(SD)) 17.6 (17.7) 21.4 (18.2) 19.3 (19.9) 0.198 a: HA is Hispanic American b: FEV1 is forced expiratory volume-one second c: FVC is forced vital capacity d: Per90 is the percentage of the night with oxyhemoglobin saturation < 90%

Table 2.10 Sample characteristics and results of DLC1 expression level association with AHI and AvSpO2S

Framingham Offspring Cohort Study (N=517) No. of Missing Effect P- Phenotype Values SE values Size Value 62.2 Age (Mean(SD)) 0 (9.0) 253 No. of Female (%) 0 (48.9) 29.0 BMI (Mean(SD)) 3 (5.3) 9.6 AHI (Mean(SD)) 0 0.0008 0.0004 0.0634 (13.0)

AvSpO2S 94.3 2 0.0042 0.0029 0.1515 (Mean(SD)) (2.1)

84

2.3.8 DLC1 expression is weakly associated with AvSpO2S and AHI

We further examined the association between DLC1 expression level measured by microarray with AvSpO2S and AHI, respectively, and observed weak associations between DLC1 gene expression and AvSpO2S (P = 0.15) and AHI (P = 0.06) in 517 subjects from Framingham Heart Study (FHS) (Table 2.10).

2.3.9 Association of DLC1 with AvSpO2S is not mediated by AHI

AvSpO2S is a marker of sleep-disordered breathing severity and is correlated with

AHI (Pearson correlation coefficient 0.63 in CFS-EAs). Given the results that DLC1 expression was weakly associated with both AvSpO2S and AHI, we further explored whether association of DLC1 with AvSpO2S is mediated by AHI. Adjusting for AHI, the association of both coding and non-coding variants in DLC1 remained significant, with their effects almost unchanged (Table 2.11, Figure 2.13C), suggesting that the association was not mediated by frequency of apneas.

2.3.10 Potential pleiotropic effect between DLC1, AvSpO2S and lung function

As mentioned above, two common variants in DLC1 have been reported to be associated with two emphysema-related phenotypes on computed tomography (Cho et al.

2015). DLC1 is reported to be highly expressed in lung (Feng et al. 2016). We therefore examined the association of DLC1 adjusting for FEV1 and FVC. The association remained at a slightly reduced significance level, but effect sizes with and without adjusting for FEV1 or FVC are highly correlated (Pearson correlation coefficient 0.9)

(Figure 2.13 A and B), likely reflecting the sample size reduction due to sample size reduction of lung function data (Table 2.12). Furthermore, these associations suggest a

85

86

87

A B

C

Figure 2.13 Comparison of the effect sizes of single variant associations with and without adjusting for FEV1(A), FVC (B) and AHI (C) as covariate for DLC1 selected variants.

88 potential pleiotropic effect between DLC1, AvSpO2S and lung function at gene level

(Table 2.12).

2.4 Discussion

We performed linkage analysis of AvSpO2S using the families collected in CFS and identified a consistent linkage signal on chromosome 8p23 in both IBC array and

Illumina OmniExpresssExome chip. In the model without including BMI and BMI2 as covariates, the maximum LOD score in the analysis using Illumina OmniExpresssExome chip reached a genome-wide significant threshold of LOD score 3.3. The signal dropped after adjusting for BMI and BMI2, suggesting there might be variants and genes in this region linked to BMI. It would be interesting to further examine the linkage and association analysis of BMI results in this region. However, in this study, we are only interested in variants independent of BMI, thus we focused on analysis of AvSpO2S adjusted for BMI.

We searched among published GWAS studies of AvSpO2S and did not identify any AvSpO2S variants previously reported on 8p23. We also performed single variant association test in Stage I and II cohorts and did not identify any common variant reaching significant level after Bonferroni correction for multiple tests, suggesting that multiple low frequency or rare variants with relatively large effect sizes are possible contributing to the observed linkage evidence. We further assumed that families with large LOD scores are more likely to carry phenotype associated variants. Thus we limited our search to only the low frequency and rare variants segregated in families with family specific LOD score larger than 0.1.

89

We divided the selected variants into two groups based on their functional annotation. One group consists of functional coding variants with annotated as adverse or missense, which can lead to amino acid change of a corresponding protein and affect the protein’s structure and function; the other group consists of the remaining non-coding variants. Non-coding variants do not change the amino acid sequence and are thus under lower evolutionary constraints (Gelfman et al. 2012, Makalowski & Boguski 1998).

However, non-coding variants can potentially have deleterious effects on a transcript through the regulation of transcription or splicing (Gelfman et al. 2017). We filtered the non-coding variants based on CADD score. The CADD score is a widely used annotation ensemble score which incorporates many different features such as conservation and regulatory annotations (Kircher et al. 2014). A CADD PHRED-like score of 10 or greater indicates a raw score in the top 10% deleterious of all possible reference genome variants and was recommended by the author to be used as a cutoff (Kircher et al. 2014,

Rentzsch et al. 2019). Several variant annotations based on various approaches can be used to filter and prioritize variants in gene-based association tests (Gelfman et al. 2017).

Some tools, such as SIFT (Ng & Henikoff 2003) and Polyphen-2 (Adzhubei et al. 2010), use protein-based metrics and are therefore confined to evaluate coding variants. CADD aims to provide a more reliable estimate of deleteriousness for all known variants and allows analyses to be undertaken in intron regions of genes in applying collapsing methods to genes (Richardson et al. 2016).

This study also provides analysis strategy to test whether identified multiple low frequency and rare variants in DLC1 contribute to linkage signal in Stage I CFS families.

We conducted conditional linkage analysis by adjusting DLC1 polygenic score

90 constructed by effect sizes estimated in independent cohorts. We observed that the maximum LOD score significantly dropped conditioning on DLC1 polygenic score. This experiment supported our hypothesis that the observed linkage signal was driven by multiple low frequency and rare variants. Although people have known that linkage analysis is sensitive to rare variants in the linkage region for years, this is the first study to test it explicitly.

We successfully identified 6 coding and 51 non-coding variants in DLC1 together significantly associated with AvSpO2S. The functional genomics analysis using genomic annotations, gene expression and methylation information consistently demonstrated that

DLC1 was associated with AvSpO2S. We observed that the 51 non-coding variants in

DLC1 were enriched in regulatory activity predicted elements in the Human Lung

Fibroblast cell line (NHLF) and contributed to DLC1 expression level in human skin cells-transformed fibroblasts. These results suggested the potential link between fibroblasts and SpO2.

We also performed Mendelian randomization analysis using the DLC1 variants to test the causal effect of DLC1 expression level in human skin cells-transformed fibroblasts to AvSpO2S. Mendelian randomization relies on the assumptions that the instrument variables are associated with the outcome only through the exposure and are not associated with any confounder of the exposure-outcome relation (Verbanck et al.

2018). We used three Mendelian randomization methods and the results all supported the direct causal effects of DLC1 expression to AvSpO2S. This analysis shed light on the potential biological mechanism underlying the DLC1 variants association to the

91 phenotype and also highlight the feasibility of Mendelian randomization in low frequency and rare variants analysis.

We also explored the relationships of DLC1, AvSpO2S and lung function. We observed that the DLC1 DNA methylation site identified by AvSpO2S association analysis also weakly associated with FEV1/FVC ratio. After adjusting FEV1 and FVC, the association of DLC1 with AvSpO2S remained at a slightly reduced significance level.

These results suggest a potential pleiotropic effect between DLC1, AvSpO2S and lung function at gene level. A formal statistical test to evaluate this pleiotropy effect is needed.

The implication of DLC1 in sleep-related oxygenation is of interest given that

DLC1 is highly expressed in lung tissue where it functions as an inhibitor of small

GTPases, influencing cell proliferation, migration, and angiogenesis (Lu et al.

2009, Sim et al. 2017). DLC1 also functions as an activator of PLCD1, a repressor of airway smooth muscle hypertrophy (Sasse et al. 2017). Thus, by modulating endothelial cell function and smooth muscle contractility it may influence cardiovascular and pulmonary traits (Sim et al. 2017, Lu et al. 2009). Two common variants on DLC1, but distinct from those associated with AvSpO2S, were associated with quantitative lung imaging markers of emphysema (Cho et al. 2015). DLC1 may specifically influence oxygen saturation levels in OSAHS by modulating the effects of OSAHS-related mechanical (i.e., via episodic airway obstruction) or oxidative stressors on subclinical lung parenchymal disease (Kim et al. 2017). Notably, DLC1 function is modulated by reactive oxygen radicals (Yang et al. 2017), commonly elevated in OSAHS (Lira & de

Sousa Rodrigues 2016). DLC1 is also a peroxisome proliferator-activated receptor gamma (PPARG) target critical for adipocyte differentiation (Sim et al. 2017), and thus

92 may influence OSAHS-related hypoxemia through effects on body fat distribution and its influences on ventilation.

In summary, our analyses identified a novel association between oxyhemoglobin saturation levels during sleep, a clinically important but understudied phenotype, and

DLC1, a gene having pleiotropic functions most studied in relationship to tumor activity but also relevant to lung function and oxygenation.

93

CHAPTER 3: ASSESSING THE INDEPENDENCE OF ASSOCIATION TESTS

AND LINKAGE EVIDENCE OBTAINED IN THE SAME DATA

3.1 Introduction

Genetic linkage and association analyses are the major tools to identify susceptibility genes for complex diseases. In Chapter 2, we applied linkage and family- based association tests in the same data to help fine-mapping associated loci and identify low frequency and rare variants for AvSpO2S. Linkage analysis is used to find chromosomal regions that do not recombine with a proposed disease locus (Borecki &

Province 2008). The identified regions by linkage tests likely harbor genes related to a disease or trait but are often large, possibly 20 ~ 40cM. Linkage analysis is sensitive for detection of rare variants. For complex diseases or traits, multiple low frequency or rare variants in a linkage region are likely causal variants. While linkage analysis can guide the identification of variants that may segregate with a trait in families, association analysis can be used as a complementary method to provide a higher resolution for locating genes using sequencing data. Thus, in our study, we took advantage of their complementary properties to conduct linkage analysis first and then follow the linkage regions with family-based association tests using a denser panel of markers in an attempt to localize the trait associated genes.

With this strategy, we conducted both linkage and family-based association tests in the chromosomal regions where linkage evidences are present in the data of CFS-EAs.

Because the association test is performed using the same family data where the linkage analysis is performed, the association analysis may have inflated type I error. A previous

94 paper published in 2007 (Chung et al. 2007) reported a simulation study to estimate the correlation between linkage statistics and family-based association statistics under various hypotheses and different types of pedigrees. This study simulated a common variant in different types of pedigrees including nuclear families with affected sib pairs, extended pedigrees and incomplete pedigrees (Chung et al. 2007). Their results showed that when there is neither linkage nor association, the linkage and association tests are not correlated (Chung et al. 2007). When there is linkage and association, the two tests are positively correlated and the correlation increases with increasing association.

Importantly, the study concluded that when linkage and association tests are applied in the same data, the type I error of neither test will be affected (Chung et al. 2007).

However, it is still unknown if type I error will be inflated in tests for associations of multiple genes with low-frequency and rare variants under a linkage region when the variants are not associated with the phenotype. In this chapter, we empirically assess the independence of the burden and SKAT tests under a linkage peak when the low frequency and rare variants are not associated with the trait and estimate the empirical significance level in the gene-based tests in CFS-EAs.

3.2 Method

We took the 18 families with family specific LOD scores larger than 0.1 as those who potentially carry low frequency or rare AvSpO2S variants on chromosome 8p23.

Genome-wide variants excluded the genes on chromosome 8p23 target region with MAF

≤ 0.05 and presented at least twice in one of the 18 families in 487 CFS EAs sequenced through TOPMed were identified. Same as we did in Chapter 2, we focused on the variants located in the protein-coding genes or their corresponding 5k bps regions up and

95 downstream. We analyzed functional coding variants and non-coding variants separately.

Functional coding variants are defined according to their functional annotation as missense, in frame deletion/insertion, stop gained/lost, start gained/lost, splice acceptor/donor, or initiator/start codon. For the remaining non-functional coding variants, we applied the CADD PHRED score (Kircher et al. 2014) and used the CADD score > 10 as a threshold to filter the variants. We also filtered out the genes with only one variant because of low statistical power. Finally, we performed gene-based collapsing burden

(burdenCMC) and mixed-model SKAT (mmSKAT) tests (Wu et al. 2011, Lee et al.

2012a) for 2,308 genes with functional coding variants and 16,538 genes with CADD score filtered non-coding variants.

3.3 Results

3.3.1 Assess the empirical P-values for buren and SKAT tests using genome-wide variants

The summary of the genome-wide gene-based test results using CFS-EA stage I family filtered variants for AvSpO2S were shown in Table 3.1. In 2,308 tested genes with functional coding variants, the number of variants in each gene is relatively small, with median of 4, whereas in 16,538 genes with non-coding variants, the number of tested variants in each gene is larger, with median 30 (Figure 3.1, Table 3.1). The quantile-quantile plots of P-value distribution for burden and SKAT tests with functional coding and non-coding variants are shown in Figure 3.2. We observed genome-wide inflation in all four tests. For burden test with both functional coding and non-coding

96

Figure 3.1 Distribution of variants number in each gene in genome-wide gene-based test using CFS-EA Stage I families filtered for coding (A) and non-coding (B) variants

Table 3.1 Summary of the genome-wide gene based test using CFS-EA stage I

family filtered variants for AvSpO2S

Variants number in each gene P-value Total No. of Test 5th tested genes Minimum Maximum Median Median Percentile Functional coding variants Burden 2308 2 83 4 0.490 0.044 SKAT 2308 2 83 4 0.468 0.040 Non-coding variants with CADD > 10 Burden 16538 2 4763 30 0.490 0.046 SKAT 16538 2 4763 30 0.431 0.033

97

A B

C D

Figure 3.2 Quantile-quantile plots for genome-wide gene-based test using CFS-EA stage I family filtered variants for AvSpO2S (A) burden test for functional coding variants; (B) SKAT test for functional coding variants; (C) burden test for non-coding variants; (D) SKAT test for non-coding variants.

98

Table 3.2 Enrichment of 5% significant AvSpO2S gene-based tests on chromosome 8 target region in CFS-EAs

Total Empirical Expected No. No. of Genes Variants No. of 5% of Genes reaching Enrichment Test Ratio Set Tested significant reaching 5% 5% P-value Genes P-value significance significance Coding 20 0.046 1 2 2.00 0.075 Burden Noncoding 48 0.042 2.4 3 1.25 0.218 Coding 20 0.047 1 3 3.00 0.035 SKAT Noncoding 48 0.043 2.4 5 2.08 0.032

variants, the genomic inflation factors λ are 1.05; for SKAT test, the inflation factor λ is

1.16 for functional coding variants and 1.24 for non-coding variants. Thus, the type I errors are inflated in SKAT tests of multiple genes using both coding and non-coding variants. We used the results from genome-wide but excluded the target region in chromosome 8p23 to tally the P-values for each type of gene-based test in Chapter 2 and calculated the empirical significance levels. We chose the 95% quantile as the threshold of 5% significance level for functional coding variant and non-coding variants, respectively, resulting P-values of 0.046 and 0.042 for burden and SKAT tests for functional coding variants, and 0.047 and 0.043 for non-coding variants, respectively

(Table 3.2).

3.3.2 Enrichment of 5% significant gene-based tests on chromosome 8 target region in CFS-EAs

We used the above estimated empirical 5% thresholds to assess gene-based tests with AvSpO2S on chromosome 8 linkage region using identified coding or non-coding variants in CFS-EAs. We also estimated the expected number of genes reaching 5%

99 significant level for burden and SKAT tests with coding and non-coding variants. The enrichment of 5% significant gene-based tests in the target region in CFS-EAs were assessed by the ratios of observed number of genes reaching 5% significance level to the expected number. We found that for almost all variant sets, the ratio is larger than 2

(Table 3.2). We performed binomial distribution test with probability 0.05 of the observed number of genes reaching 5% significant level. The SKAT tests for both coding and non-coding variants were observed with significant enrichment (P = 0.035 and

0.032). Burden test for coding variants was marginally significant enriched (P = 0.075).

3.4 Discussion

Linkage analysis and association tests use two different types of information to conduct trait locus mapping. Both methods take advantage of genetic recombination information. However, linkage analysis examines the recombination events that occur in the investigated pedigree, whereas, association signals derive from the historical recombination events in the population. Many studies combined linkage and family- based association analysis to improve the detection power of disease susceptibility genes

(Lou et al. 2006, Dupuis et al. 2007, Li et al. 2014). One strategy for identifying complex trait or disease genes is to conduct linkage analyses first and then follow significant results with family-based association tests at high-resolution of genetic markers to further localize the disease variants (Li et al. 2014). Our study adopted this strategy in Chapter 2.

Another strategy is to use linkage test results to weight the P-values of association tests

(Roeder et al. 2006). This strategy proposed a new false discovery rate (FDR) approach that involved weighting association hypotheses on the basis of prior data (Roeder et al.

2006). This study revealed that if the linkage study was informative, this procedure was

100 more powerful than association tests alone (Roeder et al. 2006). In 2011, a study combing GWAS and linkage study successfully identified modifier loci for lung disease severity in cystic fibrosis (Wright et al. 2011). This study used linkage information to reprioritize genome-wide association using extensions of the FDR approach (Sun et al.

2006, Roeder et al. 2006). This approach finally corroborated genome-wide significance of identified loci on 11 and 20 (Wright et al. 2011). These studies suggested that the same data are often used to perform both linkage and family-based association tests. If there is correlation between linkage and association test statistics, it is important to properly interpret the results from both tests together.

In this chapter, we used genome-wide variants but excluded the target region with linkage evidence to assess the inflation of burden and SKAT tests of multiple genes under a linkage region. Previous simulation study of a common variant indicated that when linkage and association tests were applied in the same data, the type I error rate of neither test would be affected (Chung et al. 2007). However, it is still important to evaluate whether type I error will be inflated in testing for multiple genes with low frequency and rare variants selected using family information under a linkage region. We found that for burden test, the genomic inflation was relatively slight, but for SKAT test there both analyses with both functional coding and non-coding variants were inflated. It is important to further evaluate the potential type I error for SKAT test in the analysis using our developed strategy.

Using the empirical significance level for each type of gene-based tests in chromosome 8 target region, we still observed significant enrichment (P < 0.05). Because there was linkage evidence in the target region, it was feasible that numerous genes were

101 associated with the phenotype in this region, especially the tested variants selected in families contributing to the linkage signal. The previous simulation study suggested that the conditional power of one test conditioned on the significant test results of the other was greater than the unconditional power in the case when there was both linkage and association (Chung et al. 2007). In empirical analysis, our study assumed that both significant linkage and association results implied a higher likelihood that a true positive was identified. With this assumption, we successfully identified a novel gene associated with AvSpO2S in Chapter 2. Both simulation and real data analysis to further evaluate and interpret our assumption are needed in the future.

102

CHAPTER 4: DEVELOPING AN ANALYSIS PIPELINE TO IDENTIFY RARE

VARIANTS THROUGH LINKAGE ANALYSIS USING WGS DATA

4.1 Introduction

Recent advances in sequencing technology and large decreases in cost have promoted the generation of a large amount of WGS data. This provides an opportunity for the investigation of low frequency and rare variants that might explain the missing heritability for complex traits or disease beyond genome-wide association study

(GWAS). The analysis of WGS data to identify novel rare variants often requires the integration of various resources. Thus, many analysis pipelines have been developed to facilitate the analysis (Chung et al. 2016, Moore et al. 2016, Drichel et al. 2014).

Family-based studies are increasingly being conducted to identify low frequency and rare variants because rare alleles can be enriched in families and co-segregate with the traits or diseases. Several tools or pipelines have been developed to analyze family- based sequencing data. For example, tools of FamAnn (Yao et al. 2014) and VariantDB

(Vandeweyer et al. 2014) were designed to identify rare variants for Mendelian disorders based on Mendelian inheritance rules. The Shared Genomic Segment (SGS) method was developed to identify haplotypes that are shared identical-by-descent (IBD) among affected members within pedigree (Miyazawa et al. 2007, Thomas et al. 2008, Leibon et al. 2008). It was demonstrated that this SGS method was powerful to identify rare disease variants (Knight et al. 2012) and an extension of it, pairwise shared genomic segment

(pSGS), could be used for variants identification in complex disease (Cai et al. 2011, Cai et al. 2012). Another family-based tool to identify rare variants for complex trait or

103 disease is RVsharing, which calculates the exact probabilities of sharing by multiple affected relatives at variants under the null hypothesis of no linkage and no association

(Bureau et al. 2014). These analysis pipelines and workflows were all developed for automating analysis processes from different tools and programs. For example, SGS method calculated IBD statistics using Merlin (Abecasis et al. 2002) and incorporated processes to automatically generate input files for Merlin and processed output files for the future analysis. However, all those pipelines focused on variants and genes identification and none of them integrated functional genomic analysis, such as sequencing variant annotations, gene expression or methylation data to further assess the identified variants and genes.

In chapter 2, we conducted analysis to detect low frequency and rare variants associated with AvSpO2S using high depth WGS data through family-based linkage analysis. Linkage analysis results implicated target genomic regions and provided linkage information from families to search candidate variants that segregated within families in sequencing data. We interrogated variant functional annotations to analyze functional coding and non-coding variants separately in gene-based association tests. We also incorporated different types of predicted regulatory features of human variants for several cell lines from Ensemble Regulatory Build (Zerbino et al. 2015) and gene expression data for multiple tissues from GTEx Portal (Stegle et al. 2012) to assess the identified variants.

This analysis elucidated potential causative mechanism regulating the phenotype. Finally, we identified 6 coding and 51 non-coding variants in DLC1 significantly associated with

AvSpO2S. This study provides a framework for integrating linkage information from families and functional annotations with WGS data to identify low frequency and rare

104

variants for complex traits or diseases. We believe this study hold appealing to those who

try to understand the role of low frequency and rare variant in complex polygenic traits

using WGS data.

To make the analysis strategy we developed in Chapter 2 broadly applicable to

multiple complex traits and diseases, in this chapter, we present a unified statistical

analysis pipeline that integrates all tedious and inefficient analysis steps in Chapter 2. In

this pipeline, linkage analysis, identification of candidate variants in WGS data, gene-

based association tests and functional genomic analysis are integrated into a pipeline.

Scripts were developed to automatically transform the outputs and input files for different

functions and programs and can be widely used when the data consist of both sequenced

family and unrelated samples.

4.2 Methods

4.2.1 Overview of the analysis pipeline

Figure 4.1 shows the flowchart for the analysis pipeline, which consists of two

analyses. Part one is to conduct linkage analysis using pedigree data. The software we

recommend for variance component linkage analysis is the software package Merlin

(Abecasis et al. 2002), which outputs a file including LOD score and P-value at each

marker location and another file including the estimated heritability and family-specific

LOD scores at each marker location.

Based on the results of linkage analysis in part one, candidate variant

identification can be carried out using WGS data. We built an R program to select

families with LOD score at the linkage peak larger than a predefined threshold

105

106

Figure 4.1 Overview of the analysis pipeline processes from linkage analysis in

genotype data, candidate variant selection and gene-based association analysis. The

diagram above shows input files and output files for each part of the pipeline; major

tools are named and intermediate processing is showing.

and select candidate low-frequency and rare variants (MAF < 0.05) that present at least two times in at least one family in the selected families in the target region. Our program can also extract the variant annotations for the selected candidate variants from the VCF file and finally generate marker group files for selected functional coding and non-coding variants respectively.

We recommend the software package Efficient and Parallelizable Association

Container Toolbox (EPACTS) https://genome.sph.umich.edu/wiki/EPACTS gene-based collapsing burden test (burdenCMC) and mixed-model SKAT (mmSKAT) for the gene- based tests. For the discovery study, we require pedigrees for linkage analysis, thus the gene-based association tests are performed using family data. EPACTS can generate a kinship matrix and incorporate it into the association analysis. For the independent verification gene-based association tests, both family samples and unrelated population samples can be used.

4.2.2 Variance component linkage analysis

Software package Merlin can conduct both non-parametric and variance component linkage analysis http://csg.sph.umich.edu/abecasis/merlin/tour/linkage.html .

The input files include a data file (-d parameter), pedigree file (-p parameter) and map file

107

(-m parameter). Before linkage analysis, we need to do genetic marker pruning of genotypic data to reduce the dependence among genetic markers. We recommend using

PLINK linkage disequilibrium based SNP pruning function (Purcell et al. 2007). We built an R program to prepare the input files for MERLIN, including genetic marker pruning, generating data for all founders and preparing dat, map and ped files for

MERLIN. We also incorporate the MERLIN commands in the R program.

4.2.3 Identify candidate variants in WGS data

To select a target region with linkage evidence, we recommend prioritizing genetic marker locations with LOD score larger than 2 and the chromosomal region in a

1-LOD drop of linkage peak. The pipeline chunk 2 shows the R program to select candidate low frequency and rare using Merlin linkage test output and target region sequencing data. The target region and marker with a linkage peak signal can be determined using the Merlin output tabulate file and families with LOD score larger than the predefined threshold for markers with the linkage peak signal can be selected using

Merlin variance component output file. To select variants, the sequencing data is transformed to a PLINK raw data file with additive coding (Purcell et al. 2007). Our pipeline R program outputs the allele counts for each of the selected candidate variants in the target region in each of the selected family.

4.2.4 Gene-based association tests using WGS data

To perform gene-based association tests using identified candidate low frequency and rare variants, we classify variants into two groups: functional coding variants with annotations of missense, in frame deletion/insertion, stop gained/lost, start gained/lost, splice acceptor/donor, or initiator/start codon and non-coding variants with annotations of

108

3’ or 5’ prime UTR variant/trucation, 5’ prime UTR premature start coding gain variant, downstream/upstream gene variants, intergenic/intragenic variant, intron variant, non- coding exon/transcrpt variant, rare amino acid variant, stop retained variant on genes with biotype of protein coding. We also recommend filtering the selected non-coding variants with CADD PHRED score larger than 10 (Kircher et al. 2014). The classified selected coding and non-coding variants are aggregated based on genes to generate the group files for gene-based association tests. The pipeline also includes EPACTS commands for collapsing burden test (burdenCMC) and mixed-model SKAT (mmSKAT) using WGS data. Our method use the same group files for both dicovery and independnet association tests, as our method test and verificate exact same identified low frequency and rare variants.

To combine the gene-based association results form different cohorts, we recommend Fisher’s method to combine P-values from independent tests (Brown 1975) .

Fisher’s method combines P-values from independent tests and the formula is

k 2 X2 ln( pi ) ~ (2 k ) i1

th 2 where pi is the P-value for the i test. χ has a chi-squared distribution with 2k degrees of freedom, where k is the number of tests being combined.

109

4.3 Results

4.3.1 Multiple rare variants in CAV1 are associated with AHI

Example datasets were used to test the pipeline’s performance. We conducted linkage analysis of AHI in 617 European Americans from 132 families in CFS and identified a significant linkage peak on chromosome 7q32 with LOD scores 2.3 (Figure

4.2). We focused on the 1-LOD drop linkage region of chr7:110468988-149102171

(GRCh37/hg19) encompassing 19.08 Mb and consisting of 102 protein coding genes.

Using the pipeline, we estimated family-specific LOD scores in the CFS families in the linkage analysis and took the 12 families with family specific LOD scores at rs6959579 on chromosome 7q32 larger than 0.1. We observed 159,435 variants with

MAF < 0.01 and these variants passed QC filters. The variants that only present once in the 12 selected families were filtered out, therefore reducing the number of variants to

17,158. We filtered out the genes with only one variant because of low statistical power.

Among the 102 genes, 22 have at least 2 functional coding variants. For the non- functional coding variants, we obtained 876 variants across 67 genes. Both gene-based burden and SKAT tests (Wu et al. 2011, Lee et al. 2012a) were performed in the CFS EA cohort for each gene. We analyzed functional coding and non-coding variants separately.

We applied an empirical significance level estimated in Chapter 3 (Table 3.1). We observed 5 out of 22 genes for functional coding variants and 13 out of 67 genes for non- coding variants with empirical P-value < 0.05 (Table 4.1).

110

Figure 4.2 Variance component linkage analysis of AHI in CFS-EAs on

chromosome 7.

In stage II, we performed the burden and SKAT analyses in these samples, focusing on the same variants in the 13 genes identified in CFS-EA. For coding variants, we observed the nominal association of AASS with burden test and CREB3L2 with SKAT test in Stage II EA and AA samples (Table 4.1). For non-coding variants, we observed nominal associations of CAV1, ASZ1 and RBM28 with burden test and PTPRZ1 with both burden and SKAT tests with P < 0.05 (Table 4.1). Importantly, for CAV1, combining stage I and stage II EAs results a genome-wide significant evidence (P = 1.38×10-6) in the non-coding variants tests (Table 4.1).

4.3.2 CAV1 variants regulatory annotation and gene expression analysis

We next focused on CAV1 and examined whether the identified rare non-coding variants in CAV1 are enriched in regulatory regions in a specific cell lines. We compared

111

112 the predicted regulatory activity that includes CTCF binding sites, enhancer, heterochromatin, promoter flank, and transcription start sites (TSS) of identified rare variants with the remaining frequency and CADD score matched variants in CAV1 for cell lines identified in the Ensembl Regulatory Build (Zerbino et al. 2015). The 21 non- coding variants in CAV1 were significantly enriched with regulatory features in human mammary epithelial cell line (HMEC) (P = 7.61×10-5, Figure 4.3) and human cervix cell line (Hela S3) (P = 0.001, Figure 4.3) after correcting for multiple tests. Figure 4.4 shows the genomic locations of the 21 variants with their corresponding regulatory elements in human mammary epithelial cell line. We observed significant aggregation of variants with similar effect direction in the genome region (P = 0.001, method was described in Section 2.2.12).

We also investigated whether these non-coding variants in CAV1 are associated with its corresponding RNAseq data across 48 tissues from GTEx (v7) data (Consortium

2013). We did not detect any significant association with CAV1 expression level in the 48 tissues after correcting for multiple tests (Figure 4.5).

4.4 Discussion

We developed pipeline using linkage information in combination with WGS data to search rare variants associated with complex traits. The pipeline ensures researcher to perform initiating linkage analysis, candidate variants selection, gene-based association tests and variant functional annotation analysis under the R platform. The R program automatically call PLINK, MERLIN commands and outputs files and WGS by incorporating whole genome sequencing annotator (WGSA) files (Liu et al. 2016) in the

113

Figure 4.3 Cell type specific regulatory annotation enrichment tests for the identified non-coding variants in CAV1 in cell lines defined in the Ensemble Regulatory Build.

The vertical dot line represents the significance level after adjusting for multiple tests.

CAV1 CAV1 Selected Noncoding Variants

Figure 4.4 21 non-coding variants and the corresponding effect sizes in CAV1

genes plotted against physical locations. The corresponding DNase hypersensitive,

H3K4me3, H3K27ac and CTCF elements derived from human mammary epithelial

cell line in the Encyclopedia of DNA Elements (ENCODE) data were also presented.

114

Figure 4.5 Gene based test for CAV1 selected variants with CAV1 expression level in

GTEx (V7) tissues. The horizontal dotted line shows the significance threshold after adjusting for multiple testing by Bonferroni correction.

Figure 4.6 Tissue-specific gene expression of CAV1 in GTEx (v7) database

115 analysis. We also provide the list of non-default parameters in the program and each researcher needs to carefully consider parameter setting appropriate to his/her research project when performing the initial configuration. For example, to determine a LOD score cutoff to select families potentially carrying rare variants accounting for the linkage evidence, we recommend conducting a mixture distribution analysis for family-specific

LOD score. For data with several small families (families with 2, 3 or 4 members), it might be worthwhile to stratify family sizes when performing mixture distribution analysis. We also recommend to remove families with only 1 person in family-specific

LOD score analysis because these families do not have linkage information.

The software tools included in our pipeline are selected to fulfill the needs of our research project, and might not be suitable for the research goals of other specific projects. For example, we used MERLIN (Abecasis et al. 2002) to do linkage analysis, which needs to prune genotypes based on LD pattern. However, several software and analysis pipelines have been developed to conduct linkage analysis directly using high- depth sequencing data (Hu et al. 2014, Chung et al. 2016), which might be more convenient and efficient for incorporating and dealing with pedigree sequencing data. In our project, we used CADD score to filter the non-coding variants. CADD integrates a range of different annotation metrics into a single measure (C score) and aims to provide a more reliable estimate of deleteriousness for all known variants (Kircher et al. 2014).

Although CADD score has been used widely to evaluate the impact of genetic variants after identification in association studies, this resource has infrequently been utilized to filter low-frequency or rare variants before undertaking association analysis (Richardson et al. 2016). For rare non-coding variants, incorporating non-coding annotations based on

116 predicted variant functionality would be more fruitful. Thus, our pipeline can actually be used a template and learning tool for modification, incorporating and developing more valuable methods and tools.

We performed an implementation analysis targeting region on chromosome 7 linked with AHI using the pipeline. Interestingly, we observed a gene CAV1 with 21 identified rare non-coding variants reaching genome-wide significance (P = 1.38×10-6) associated with AHI by burden test in 2,035 European ancestry individuals. CAV1 gene is highly expressed in adipose, breast mammary tissues, transformed fibroblast cells as well as lung (Figure 4.6). It encodes Caveolin-1, the main component of the caveolae plasma membranes found in most cell types. This protein regulates integrin subunits coupling to the Ras-ERK pathway and promote cell cycle progression (Wary et al. 1998).

Caveolin-1 was reported to be induced by insulin to negatively regulate endothelial nitric oxide synthase (eNOS) activity and contribute to endothelial dysfunction (Wang et al. 2009). Interestingly, several studies have suggested impaired endothelial function in patients with OSAHS (Ip et al. 2004, Nieto et al. 2004). For example, a study in a large community sample of 1,037 older adults in the Sleep Heart

Health/Cardiovascular Health Study cohort reported the association of AHI with impaired brachial artery flow-mediated dilation, a surrogate of endothelial dysfunction (Nieto et al.

2004). One possible mechanism for the development of endothelial dysfunction in

OSAHS is through hypoxemia induced reactive oxygen species (ROS) production and systemic inflammation (Budhiraja et al. 2007, Hoyos et al. 2015). OSAHS is associated with obesity, hypertension, and metabolic dysfunction, which may contribute to endothelial injury (Budhiraja & Quan 2005). Treatment of OSAHS with continuous

117 positive airway pressure (CPAP) therapy has also been suggested to improve microvascular endothelial function in the systemic circulation (Lattimore et al. 2006).

Besides, OSAHS is also reported to be independently associated with insulin resistance

(Ip et al. 2002). These evidence accumulating suggests that caveolin-1 may play a significant role in the pathogenesis OSAHS.

In 2017, a study investigated the associations of two variants on CAV1 with

OSAHS on 86 severe OSAHS patient and 86 healthy controls. This study reported significant association of SNP rs7804372 on CAV1 with OSAHS (Asker et al. 2017).

Further studies are warranted to examine and replicate this common variant and our identified rare variants in large independent cohorts. Our functional annotation analysis showed that the 21 identified CAV1 non-coding variants were enriched in regulatory activity in breast mammary epithelial cells. Mendelian randomization analysis to further evaluate the causal effects of CAV1 expression to AHI should be conducted.

Caveolin-1 and caveolin-1/eNOS interrelationship was reported to be involved pathogenesis of pulmonary artery hypertension (PAH) (Mathew et al. 2007). OSAHS has been identified as a significant cause to cardiovascular diseases, such as hypertension, pulmonary vascular disease, ischemic heart disease, stroke, congestive heart failure and so on (Jean-Louis et al. 2008). It is very interesting to further examine the potential pleiotropic effects between CAV1, OSAHS and other cardiovascular diseases such as

PAH.

118

CHAPTER 5: DISCUSSION AND FUTURE WORK

Family-based studies have become an attractive strategy to analyze next- generation sequencing data for identification of rare variants because rare variants that co-segregate with the traits or disease can be observed in pedigrees (Wijsman 2012).

Linkage analysis simply measures the co-segregation of any sequence variant with traits.

Linkage analysis results can be used to identify families that are most likely to segregate genetic variants and to guide choose of regions with targeted variants. Our study presents a method to combine linkage studies with sequencing data to search low frequency and rare variants.

In chapter 2 of this dissertation, using high depth WGS data from the NHLBI

TOPMed project and focusing on genes with linkage evidence on chromosome 8p23, we successfully identified a novel association between AvSpO2S and DLC1, a gene having pleiotropic functions most studied in relationship to tumor activity but also relevant to lung function and oxygenation. In this study, 51 identified non-coding variants were

-6 tested to be associated with AvSpO2S with P-value of 6.5×10 in the meta-analysis combining results from both TOPMed WGS data and imputed genotype data.

Importantly, using the polygenic score built on the identified variants estimated in independent data set, we tested those variants in DLC1 significantly contributing to the linkage evidence (P < 0.001). This is the first study that successfully identified and tested low frequency and rare variants in the genes within a linkage region driving linkage peak.

We also did a series of subsequent functional annotation and omics study of DLC1.

119

Multiple sources of information including gene expression, functional annotation and

DNA methylation consistently suggested DLC1 was a gene associated with AvSpO2S.

In the future, we will do more analysis to further examine the functional link between DLC1 regulatory elements, identified cell lines/tissues and the phenotype. In our study, we only used the CADD score to filter the selected non-coding variants, which may not be an ideal method to evaluate rare variants. To improve the analysis, we plan to incorporate more functional score to boost power in the gene-based association analysis.

Multiple functional annotations evaluating various functionality such as distance to coding, , map ability, micro RNA and so on will be analyzed and compared individually or in combination. Future analysis of the 57 identified coding and non- coding variants will be undertaken. The identified non-coding variants were tested for enrichment in regulatory activity in lung fibroblast cell lines. A comprehensive evaluation of all selected low frequency and rare variants with cell-line specific regulatory activity, including those with CADD score < 10 will be conducted to improve the analysis and increase power.

Additionally, our analysis has suggested a potential pleiotropic effect between

DLC1, AvSpO2S and two lung functions, FEV1 and FVC. Given the important intrinsic mechanism relationship between oxygen saturation level and pulmonary functions, it would be worthwhile to do a formal pleiotropy test. Although there is not available published method to test pleiotropy effects of rare variants, we plan to develop strategy using independently estimated polygenic score as covariant to preliminary test of possible pleiotropic effects. Importantly, all the improvements, strategies and methods developed

120 based on this AvSpO2S study will be updated in our developed analysis pipeline in chapter 4.

In chapter 3, we empirically assessed the independence of the gene-based association tests in linkage region using a genome-wide tests strategy. Our results showed that the inflation level of type I error for SKAT test was much bigger than for the burden test. In the future, we plan to do more simulation and theoretical studies to evaluate the conduction of gene-based association tests and linkage analysis using the same data.

In chapter 4, a publically available analysis pipeline was developed to search rare variants through linkage analysis using sequencing data. We plan to improve the pipeline to be available handle and incorporate different source and tools. TOPMed project released both annotation and sequencing data in GDS format. We will build cloud-based analysis pipeline platform available work with annotated GDS files. We also plan to explore different computationally-efficient and more powerful gene-based association methods (Chen et al. 2016, Chen et al. 2018) to make the pipeline more tractable dealing with large-scale WGS data with complex study samples.

121

Reference

(1989) The Atherosclerosis Risk in Communities (ARIC) Study: design and objectives. The ARIC investigators. American journal of epidemiology, 129, 687-702. Abecasis, G. R., Cherny, S. S., Cookson, W. O. and Cardon, L. R. (2002) Merlin--rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet, 30, 97-101. Adzhubei, I. A., Schmidt, S., Peshkin, L., Ramensky, V. E., Gerasimova, A., Bork, P., Kondrashov, A. S. and Sunyaev, S. R. (2010) A method and server for predicting damaging missense mutations. Nat Methods, 7, 248-249. Antonelli Incalzi, R., Marra, C., Giordano, A., Calcagni, M. L., Cappa, A., Basso, S., Pagliari, G. and Fuso, L. (2003) Cognitive impairment in chronic obstructive pulmonary disease--a neuropsychological and spect study. J Neurol, 250, 325-332. Asimit, J. L., Day-Williams, A. G., Morris, A. P. and Zeggini, E. (2012) ARIEL and AMELIA: testing for an accumulation of rare variants using next-generation sequencing data. Hum Hered, 73, 84-94. Asker, S., Taspinar, M., Koyun, H., Ozbay, B. and Arisoy, A. (2017) Caveolin-1 polymorphisms in patients with severe obstructive sleep apnea. Biomarkers, 22, 77-80. Auer, P. L. and Lettre, G. (2015) Rare variant association studies: considerations, challenges and opportunities. Genome Med, 7, 16. Badran, M., Yassin, B. A., Fox, N., Laher, I. and Ayas, N. (2015) Epidemiology of Sleep Disturbances and Cardiovascular Consequences. Can J Cardiol, 31, 873-879. Bailey-Wilson, J. E. and Wilson, A. F. (2011) Linkage analysis in the next-generation sequencing era. Hum Hered, 72, 228-236. Balserak, B. I. (2014) Sleep-disordered breathing in pregnancy. Am J Respir Crit Care Med, 190, P1-2. Bandyopadhyay, B., Chanda, V. and Wang, Y. (2017) Finding the Sources of Missing Heritability within Rare Variants Through Simulation. Bioinform Biol Insights, 11, 1177932217735096. Barbosa-Morais, N. L., Irimia, M., Pan, Q. et al. (2012) The evolutionary landscape of in vertebrate species. Science, 338, 1587-1593. Belkadi, A., Bolze, A., Itan, Y. et al. (2015) Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proceedings of the National Academy of Sciences of the United States of America, 112, 5473-5478. Bild, D. E., Bluemke, D. A., Burke, G. L. et al. (2002) Multi-ethnic study of atherosclerosis: objectives and design. American journal of epidemiology, 156, 871-881. Blank, J. B., Cawthon, P. M., Carrion-Petersen, M. L., Harper, L., Johnson, J. P., Mitson, E. and Delay, R. R. (2005) Overview of recruitment for the osteoporotic fractures in men study (MrOS). Contemp Clin Trials, 26, 557-568. Bomba, L., Walter, K. and Soranzo, N. (2017) The impact of rare and low-frequency genetic variants in common disease. Genome Biol, 18, 77. Borecki, I. B. and Province, M. A. (2008) Linkage and association: basic concepts. Adv Genet, 60, 51-74. Bowden, J., Davey Smith, G. and Burgess, S. (2015) Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. International journal of epidemiology, 44, 512-525. Brown, M. (1975) A method for combining non-independent, one-sided tests of significance. Biometrics., 31 (4): 987–992.

122

Budhiraja, R., Parthasarathy, S. and Quan, S. F. (2007) Endothelial dysfunction in obstructive sleep apnea. J Clin Sleep Med, 3, 409-415. Budhiraja, R. and Quan, S. F. (2005) Sleep-disordered breathing and cardiovascular health. Curr Opin Pulm Med, 11, 501-506. Bureau, A., Younkin, S. G., Parker, M. M. et al. (2014) Inferring rare disease risk variants based on exact probabilities of sharing by multiple affected relatives. Bioinformatics, 30, 2189- 2196. Burgess, S., Butterworth, A. and Thompson, S. G. (2013) Mendelian randomization analysis with multiple genetic variants using summarized data. Genet Epidemiol, 37, 658-665. Burgess, S., Dudbridge, F. and Thompson, S. G. (2015) Re: "Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal effects". Am J Epidemiol, 181, 290-291. Burgess, S. and Thompson, S. G. (2017) Interpreting findings from Mendelian randomization using the MR-Egger method. Eur J Epidemiol, 32, 377-389. Cade, B. E., Chen, H., Stilp, A. M. et al. (2016) Genetic Associations with Obstructive Sleep Apnea Traits in Hispanic/Latino Americans. Am J Respir Crit Care Med, 194, 886-897. Cai, Z., Knight, S., Thomas, A. and Camp, N. J. (2011) Pairwise shared genomic segment analysis in high-risk pedigrees: application to Genetic Analysis Workshop 17 exome-sequencing SNP data. BMC Proc, 5 Suppl 9, S9. Cai, Z., Thomas, A., Teerlink, C., Farnham, J. M., Cannon-Albright, L. A. and Camp, N. J. (2012) Pairwise shared genomic segment analysis in three Utah high-risk breast cancer pedigrees. BMC Genomics, 13, 676. Capdevila, O. S., Kheirandish-Gozal, L., Dayyat, E. and Gozal, D. (2008) Pediatric obstructive sleep apnea: complications, management, and long-term outcomes. Proc Am Thorac Soc, 5, 274-282. Chen, H., Huffman, J. E., Brody, J. A. et al. (2018) Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole genome sequencing studies. bioRxiv, 395046. Chen, H., Wang, C., Conomos, M. P. et al. (2016) Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models. American journal of human genetics, 98, 653-666. Chen, R., Wei, Q., Zhan, X. et al. (2015a) A haplotype-based framework for group-wise transmission/disequilibrium tests for rare variant association analysis. Bioinformatics, 31, 1452-1459. Chen, X., Wang, R., Zee, P., Lutsey, P. L., Javaheri, S., Alcantara, C., Jackson, C. L., Williams, M. A. and Redline, S. (2015b) Racial/Ethnic Differences in Sleep Disturbances: The Multi-Ethnic Study of Atherosclerosis (MESA). Sleep, 38, 877-888. Cho, M. H., Castaldi, P. J., Hersh, C. P. et al. (2015) A Genome-Wide Association Study of Emphysema and Airway Quantitative Imaging Phenotypes. American journal of respiratory and critical care medicine, 192, 559-569. Chung, R. H., Hauser, E. R. and Martin, E. R. (2007) Interpretation of simultaneous linkage and family-based association tests in genome screens. Genet Epidemiol, 31, 134-142. Chung, R. H., Tsai, W. Y., Kang, C. Y., Yao, P. J., Tsai, H. J. and Chen, C. H. (2016) FamPipe: An Automatic Analysis Pipeline for Analyzing Sequencing Data in Families for Disease Studies. PLoS Comput Biol, 12, e1004980. Cirulli, E. T. and Goldstein, D. B. (2010) Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet, 11, 415-425.

123

Cohen, B., Novick, D. and Rubinstein, M. (1996) Modulation of insulin activities by leptin. Science, 274, 1185-1188. Cohen, J. C., Pertsemlidis, A., Fahmi, S., Esmail, S., Vega, G. L., Grundy, S. M. and Hobbs, H. H. (2006) Multiple rare variants in NPC1L1 associated with reduced sterol absorption and plasma low-density lipoprotein levels. Proceedings of the National Academy of Sciences of the United States of America, 103, 1810-1815. Consortium, G. T. (2013) The Genotype-Tissue Expression (GTEx) project. Nat Genet, 45, 580- 585. Cooper, G. M. and Shendure, J. (2011) Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat Rev Genet, 12, 628-640. Corte, T. J., Wort, S. J., Talbot, S. et al. (2012) Elevated nocturnal desaturation index predicts mortality in interstitial lung disease. Sarcoidosis, vasculitis, and diffuse lung diseases : official journal of WASOG, 29, 41-50. Das, S., Forer, L., Schonherr, S. et al. (2016) Next-generation genotype imputation service and methods. Nat Genet, 48, 1284-1287. Davydov, E. V., Goode, D. L., Sirota, M., Cooper, G. M., Sidow, A. and Batzoglou, S. (2010) Identifying a high fraction of the to be under selective constraint using GERP++. PLoS Comput Biol, 6, e1001025. Degner, J. F., Pai, A. A., Pique-Regi, R. et al. (2012) DNase I sensitivity QTLs are a major determinant of human expression variation. Nature, 482, 390-394. Delahanty, R. J., Kang, J. Q., Brune, C. W., Kistner, E. O., Courchesne, E., Cox, N. J., Cook, E. H., Jr., Macdonald, R. L. and Sutcliffe, J. S. (2011) Maternal transmission of a rare GABRB3 signal peptide variant is associated with autism. Mol Psychiatry, 16, 86-96. Design, E. C. http://genome.sph.umich.edu/wiki/Exome_Chip_Design. Drichel, D., Herold, C., Lacour, A. et al. (2014) Rare variant testing of imputed data: an analysis pipeline typified. Hum Hered, 78, 164-178. Dupuis, J., Siegmund, D. O. and Yakir, B. (2007) A unified framework for linkage and association analysis of quantitative traits. Proc Natl Acad Sci U S A, 104, 20210-20215. Eichler, E. E., Flint, J., Gibson, G., Kong, A., Leal, S. M., Moore, J. H. and Nadeau, J. H. (2010) Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet, 11, 446-450. Epstein, M. P., Duncan, R., Ware, E. B. et al. (2015) A statistical approach for rare-variant association testing in affected sibships. Am J Hum Genet, 96, 543-554. Feinleib, M. (1985) The Framingham Study: sample selection, follow-up, and methods of analyses. National Cancer Institute monograph, 67, 59-64. Feng, H., Zhang, Z., Wang, X. and Liu, D. (2016) Identification of DLC-1 expression and methylation status in patients with non-small-cell lung cancer. Mol Clin Oncol, 4, 249- 254. Friedrichs, S., Malzahn, D., Pugh, E. W., Almeida, M., Liu, X. Q. and Bailey, J. N. (2016) Filtering genetic variants and placing informative priors based on putative biological function. BMC genetics, 17 Suppl 2, 8. Gelfman, S., Burstein, D., Penn, O., Savchenko, A., Amit, M., Schwartz, S., Pupko, T. and Ast, G. (2012) Changes in exon-intron structure during vertebrate evolution affect the splicing pattern of . Genome Res, 22, 35-50. Gelfman, S., Wang, Q., McSweeney, K. M. et al. (2017) Annotating pathogenic non-coding variants in genic regions. Nat Commun, 8, 236.

124

Goldfeder, R. L., Wall, D. P., Khoury, M. J., Ioannidis, J. P. A. and Ashley, E. A. (2017) Human Genome Sequencing at the Population Scale: A Primer on High-Throughput DNA Sequencing and Analysis. Am J Epidemiol, 186, 1000-1009. Goldstein, D. B., Allen, A., Keebler, J., Margulies, E. H., Petrou, S., Petrovski, S. and Sunyaev, S. (2013) Sequencing studies in human genetics: design and interpretation. Nat Rev Genet, 14, 460-470. Gottlieb, D. J., Yenokyan, G., Newman, A. B. et al. (2010) Prospective study of obstructive sleep apnea and incident coronary heart disease and heart failure: the sleep heart health study. Circulation, 122, 352-360. H.M.Kang (2014) EPACTS: efficient and parallelizable association container toolbox. http://www.sph.umich.edu/csg/kang/epacts/, 2014. Halbower, A. C., Degaonkar, M., Barker, P. B., Earley, C. J., Marcus, C. L., Smith, P. L., Prahme, M. C. and Mahone, E. M. (2006) Childhood obstructive sleep apnea associates with neuropsychological deficits and neuronal brain injury. PLoS Med, 3, e301. Han, F. and Pan, W. (2010) A data-adaptive sum test for disease association with multiple common or rare variants. Hum Hered, 70, 42-54. He, K. Y., Wang, H., Cade, B. E. et al. (2017) Rare variants in fox-1 homolog A (RBFOX1) are associated with lower blood pressure. PLoS genetics, 13, e1006678. Hemminki, K., Forsti, A., Houlston, R. and Bermejo, J. L. (2011) Searching for the missing heritability of complex diseases. Human mutation, 32, 259-262. Hoffmann, T. J., Marini, N. J. and Witte, J. S. (2010) Comprehensive approach to analyzing rare genetic variants. PLoS One, 5, e13584. Hoyos, C. M., Melehan, K. L., Liu, P. Y., Grunstein, R. R. and Phillips, C. L. (2015) Does obstructive sleep apnea cause endothelial dysfunction? A critical review of the literature. Sleep Med Rev, 20, 15-26. Hu, H., Roach, J. C., Coon, H. et al. (2014) A unified test of linkage analysis and rare-variant association for analysis of pedigree sequence data. Nat Biotechnol, 32, 663-669. Huyghe, J. R., Jackson, A. U., Fogarty, M. P. et al. (2013) Exome array analysis identifies new loci and low-frequency variants influencing insulin processing and secretion. Nature genetics, 45, 197-201. Ionita-Laza, I., Makarov, V., Yoon, S., Raby, B., Buxbaum, J., Nicolae, D. L. and Lin, X. (2011) Finding disease variants in Mendelian disorders by using sequence data: methods and applications. American journal of human genetics, 89, 701-712. Ip, M. S., Lam, B., Ng, M. M., Lam, W. K., Tsang, K. W. and Lam, K. S. (2002) Obstructive sleep apnea is independently associated with insulin resistance. Am J Respir Crit Care Med, 165, 670-676. Ip, M. S., Tse, H. F., Lam, B., Tsang, K. W. and Lam, W. K. (2004) Endothelial function in obstructive sleep apnea and response to treatment. Am J Respir Crit Care Med, 169, 348-353. Irvin, M. R., Zhi, D., Joehanes, R. et al. (2014) Epigenome-wide association study of fasting blood lipids in the Genetics of Lipid-lowering Drugs and Diet Network study. Circulation, 130, 565-572. Jean-Louis, G., Zizi, F., Clark, L. T., Brown, C. D. and McFarlane, S. I. (2008) Obstructive sleep apnea and cardiovascular disease: role of the metabolic syndrome and its components. J Clin Sleep Med, 4, 261-272. Johnson, T. (2013) Efficient calcularion for multi-SNP genetic risk scores. . Technical report The Comprehensive R Archive Network. , Available at http://cran.r‐ project.org/web/packages/gtx/vignettes/ashg2012.pdf [last accessed 2014/11/19].

125

Johnson, W. E., Li, C. and Rabinovic, A. (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8, 118-127. Jun, G., Manning, A., Almeida, M. et al. (2018) Evaluating the contribution of rare variants to type 2 diabetes and related traits using pedigrees. Proceedings of the National Academy of Sciences of the United States of America, 115, 379-384. Kanagala, R., Murali, N. S., Friedman, P. A., Ammash, N. M., Gersh, B. J., Ballman, K. V., Shamsuzzaman, A. S. and Somers, V. K. (2003) Obstructive sleep apnea and the recurrence of atrial fibrillation. Circulation, 107, 2589-2594. Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S. Y., Freimer, N. B., Sabatti, C. and Eskin, E. (2010) Variance component model to account for sample structure in genome- wide association studies. Nature genetics, 42, 348-354. Kasowski, M., Grubert, F., Heffelfinger, C. et al. (2010) Variation in transcription factor binding among . Science, 328, 232-235. Keating, B. J., Tischfield, S., Murray, S. S. et al. (2008) Concept, design and implementation of a cardiovascular gene-centric 50 k SNP array for large-scale genomic association studies. PLoS One, 3, e3583. Kendzerska, T., Gershon, A. S., Hawker, G., Leung, R. S. and Tomlinson, G. (2014) Obstructive sleep apnea and risk of cardiovascular events and all-cause mortality: a decade-long historical cohort study. PLoS Med, 11, e1001599. Khurana, E., Fu, Y., Chakravarty, D., Demichelis, F., Rubin, M. A. and Gerstein, M. (2016) Role of non-coding sequence variants in cancer. Nat Rev Genet, 17, 93-108. Kim, J. S., Podolanczuk, A. J., Borker, P. et al. (2017) Obstructive Sleep Apnea and Subclinical Interstitial Lung Disease in the Multi-Ethnic Study of Atherosclerosis (MESA). Annals of the American Thoracic Society, 14, 1786-1795. Kircher, M., Witten, D. M., Jain, P., O'Roak, B. J., Cooper, G. M. and Shendure, J. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 46, 310-315. Knight, S., Abo, R. P., Abel, H. J., Neklason, D. W., Tuohy, T. M., Burt, R. W., Thomas, A. and Camp, N. J. (2012) Shared genomic segment analysis: the power to find rare disease variants. Ann Hum Genet, 76, 500-509. Kolilekas, L., Manali, E., Vlami, K. A. et al. (2013) Sleep oxygen desaturation predicts survival in idiopathic pulmonary fibrosis. Journal of clinical sleep medicine : JCSM : official publication of the American Academy of Sleep Medicine, 9, 593-601. Kotsiantis, S. B. (2007) Supervised Machine Learning: A Review of Classification Techniques. Front Artif Intel Ap, 160, 3-24. Kumar, P., Henikoff, S. and Ng, P. C. (2009) Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc, 4, 1073-1081. Lacasse, Y., Series, F., Vujovic-Zotovic, N., Goldstein, R., Bourbeau, J., Lecours, R., Aaron, S. D. and Maltais, F. (2011) Evaluating nocturnal oxygen desaturation in COPD--revised. Respiratory medicine, 105, 1331-1337. Larkin, E. K., Patel, S. R., Elston, R. C., Gray-McGuire, C., Zhu, X. and Redline, S. (2008) Using linkage analysis to identify quantitative trait loci for sleep apnea in relationship to body mass index. Ann Hum Genet, 72, 762-773. Larkin, E. K., Patel, S. R., Goodloe, R. J., Li, Y., Zhu, X., Gray-McGuire, C., Adams, M. D. and Redline, S. (2010) A candidate gene study of obstructive sleep apnea in European Americans and African Americans. Am J Respir Crit Care Med, 182, 947-953.

126

Lattimore, J. L., Wilcox, I., Skilton, M., Langenfeld, M. and Celermajer, D. S. (2006) Treatment of obstructive sleep apnoea leads to improved microvascular endothelial function in the systemic circulation. Thorax, 61, 491-495. Lavie, P., Herer, P., Peled, R., Berger, I., Yoffe, N., Zomer, J. and Rubin, A. H. (1995) Mortality in sleep apnea patients: a multivariate analysis of risk factors. Sleep, 18, 149-157. Lawlor, D. A., Harbord, R. M., Sterne, J. A., Timpson, N. and Davey Smith, G. (2008) Mendelian randomization: using genes as instruments for making causal inferences in epidemiology. Stat Med, 27, 1133-1163. Lee, D., Gorkin, D. U., Baker, M., Strober, B. J., Asoni, A. L., McCallion, A. S. and Beer, M. A. (2015) A method to predict the impact of regulatory variants from DNA sequence. Nature genetics, 47, 955-961. Lee, S., Abecasis, G. R., Boehnke, M. and Lin, X. (2014) Rare-variant association analysis: study designs and statistical tests. American journal of human genetics, 95, 5-23. Lee, S., Emond, M. J., Bamshad, M. J., Barnes, K. C., Rieder, M. J., Nickerson, D. A., Christiani, D. C., Wurfel, M. M. and Lin, X. (2012a) Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am J Hum Genet, 91, 224-237. Lee, S., Wu, M. C. and Lin, X. (2012b) Optimal tests for rare variant effects in sequencing association studies. Biostatistics, 13, 762-775. Leibon, G., Rockmore, D. N. and Pollak, M. R. (2008) A SNP streak model for the identification of genetic regions identical-by-descent. Stat Appl Genet Mol Biol, 7, Article16. Li, B. and Leal, S. M. (2008) Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet, 83, 311-321. Li, Y., Foo, J. N., Liany, H., Low, H. Q. and Liu, J. (2014) Combined linkage and family-based association analysis improves candidate gene detection in Genetic Analysis Workshop 18 simulation data. BMC Proc, 8, S29. Liang, J., Cade, B. E., Wang, H. et al. (2016) Comparison of Heritability Estimation and Linkage Analysis for Multiple Traits Using Principal Component Analyses. Genet Epidemiol, 40, 222-232. Lira, A. B. and de Sousa Rodrigues, C. F. (2016) Evaluation of oxidative stress markers in obstructive sleep apnea syndrome and additional antioxidant therapy: a review article. Sleep & breathing = Schlaf & Atmung, 20, 1155-1160. Liu, X., White, S., Peng, B. et al. (2016) WGSA: an annotation pipeline for human genome sequencing studies. J Med Genet, 53, 111-112. Liu, Y., Ding, J., Reynolds, L. M. et al. (2013) Methylomics of gene expression in human monocytes. Hum Mol Genet, 22, 5065-5074. Lo Cascio, C. M., Quante, M., Hoffman, E. A. et al. (2017) Percent Emphysema and Daily Motor Activity Levels in the General Population: Multi-Ethnic Study of Atherosclerosis. Chest, 151, 1039-1050. Lou, X. Y., Ma, J. Z., Yang, M. C., Zhu, J., Liu, P. Y., Deng, H. W., Elston, R. C. and Li, M. D. (2006) Improvement of mapping accuracy by unifying linkage and association analysis. Genetics, 172, 647-661. Lu, Q., Longo, F. M., Zhou, H., Massa, S. M. and Chen, Y. H. (2009) Signaling through Rho GTPase pathway as viable drug target. Curr Med Chem, 16, 1355-1365. MacArthur, D. G., Manolio, T. A., Dimmock, D. P. et al. (2014) Guidelines for investigating causality of sequence variants in human disease. Nature, 508, 469-476. Madsen, B. E. and Browning, S. R. (2009) A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet, 5, e1000384.

127

Makalowski, W. and Boguski, M. S. (1998) Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc Natl Acad Sci U S A, 95, 9407-9412. Manolio, T. A., Collins, F. S., Cox, N. J. et al. (2009) Finding the missing heritability of complex diseases. Nature, 461, 747-753. Marouli, E., Graff, M., Medina-Gomez, C. et al. (2017) Rare and low-frequency coding variants alter human adult height. Nature, 542, 186-190. Mathelier, A., Shi, W. and Wasserman, W. W. (2015) Identification of altered cis-regulatory elements in human disease. Trends Genet, 31, 67-76. Mathew, R., Huang, J. and Gewitz, M. H. (2007) Pulmonary artery hypertension: caveolin-1 and eNOS interrelationship: a new perspective. Cardiol Rev, 15, 143-149. Mbata, G. and Chukwuka, J. (2012) Obstructive sleep apnea hypopnea syndrome. Ann Med Health Sci Res, 2, 74-77. McDaniell, R., Lee, B. K., Song, L. et al. (2010) Heritable individual-specific and allele-specific chromatin signatures in humans. Science, 328, 235-239. Miyazawa, H., Kato, M., Awata, T. et al. (2007) Homozygosity haplotype allows a genomewide search for the autosomal segments shared among patients. Am J Hum Genet, 80, 1090- 1102. Moore, C. C. B., Basile, A. O., Wallace, J. R., Frase, A. T. and Ritchie, M. D. (2016) A biologically informed method for detecting rare variant associations. BioData Min, 9, 27. Morgenthaler, S. and Thilly, W. G. (2007) A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res, 615, 28-56. Morris, A. P. and Zeggini, E. (2010) An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol, 34, 188-193. Moutsianas, L., Agarwala, V., Fuchsberger, C. et al. (2015) The power of gene-based rare variant methods to detect disease-associated variation and test hypotheses about complex disease. PLoS Genet, 11, e1005165. Neale, B. M., Kou, Y., Liu, L. et al. (2012) Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature, 485, 242-245. Newman, A. B., Nieto, F. J., Guidry, U., Lind, B. K., Redline, S., Pickering, T. G., Quan, S. F. and Sleep Heart Health Study Research, G. (2001) Relation of sleep-disordered breathing to cardiovascular disease risk factors: the Sleep Heart Health Study. American journal of epidemiology, 154, 50-59. Ng, P. C. and Henikoff, S. (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res, 31, 3812-3814. Ng, S. B., Bigham, A. W., Buckingham, K. J. et al. (2010a) Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nature genetics, 42, 790-793. Ng, S. B., Buckingham, K. J., Lee, C. et al. (2010b) Exome sequencing identifies the cause of a mendelian disorder. Nature genetics, 42, 30-35. Nieto, F. J., Herrington, D. M., Redline, S., Benjamin, E. J. and Robbins, J. A. (2004) Sleep apnea and markers of vascular endothelial function in a large community sample of older adults. Am J Respir Crit Care Med, 169, 354-360. Nieto, F. J., Peppard, P. E., Young, T., Finn, L., Hla, K. M. and Farre, R. (2012) Sleep-disordered breathing and cancer mortality: results from the Wisconsin Sleep Cohort Study. Am J Respir Crit Care Med, 186, 190-194. Nishizaki, S. S. and Boyle, A. P. (2017) Mining the Unknown: Assigning Function to Noncoding Single Nucleotide Polymorphisms. Trends Genet, 33, 34-45.

128

Oelsner, E. C., Lima, J. A., Kawut, S. M., Burkart, K. M., Enright, P. L., Ahmed, F. S. and Barr, R. G. (2015) Noninvasive tests for the diagnostic evaluation of dyspnea among outpatients: the Multi-Ethnic Study of Atherosclerosis lung study. The American journal of medicine, 128, 171-180 e175. Oldenburg, O., Wellmann, B., Buchholz, A., Bitter, T., Fox, H., Thiem, U., Horstkotte, D. and Wegscheider, K. (2016) Nocturnal hypoxaemia is associated with increased mortality in stable heart failure patients. Eur Heart J, 37, 1695-1703. Orwoll, E., Blank, J. B., Barrett-Connor, E. et al. (2005) Design and baseline characteristics of the osteoporotic fractures in men (MrOS) study--a large observational study of the determinants of fracture in older men. Contemp Clin Trials, 26, 569-585. Ott, J., Wang, J. and Leal, S. M. (2015) Genetic linkage analysis in the age of whole-genome sequencing. Nature reviews. Genetics, 16, 275-284. Palmer, L. J., Buxbaum, S. G., Larkin, E., Patel, S. R., Elston, R. C., Tishler, P. V. and Redline, S. (2003) A whole-genome scan for obstructive sleep apnea and obesity. American journal of human genetics, 72, 340-350. Patel, S. R., Ayas, N. T., Malhotra, M. R., White, D. P., Schernhammer, E. S., Speizer, F. E., Stampfer, M. J. and Hu, F. B. (2004) A prospective study of sleep duration and mortality risk in women. Sleep, 27, 440-444. Patel, S. R., Larkin, E. K. and Redline, S. (2008) Shared genetic basis for obstructive sleep apnea and adiposity measures. Int J Obes (Lond), 32, 795-800. Pham, L. V. and Polotsky, V. Y. (2016) Genome-Wide Association Studies in Obstructive Sleep Apnea. Will We Catch a Black Cat in a Dark Room? Am J Respir Crit Care Med, 194, 789- 791. Punjabi, N. M., Shahar, E., Redline, S., Gottlieb, D. J., Givelber, R., Resnick, H. E. and Sleep Heart Health Study, I. (2004) Sleep-disordered breathing, glucose intolerance, and insulin resistance: the Sleep Heart Health Study. Am J Epidemiol, 160, 521-530. Purcell, S., Neale, B., Todd-Brown, K. et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics, 81, 559-575. Purcell, S. M., Moran, J. L., Fromer, M. et al. (2014) A polygenic burden of rare disruptive mutations in schizophrenia. Nature, 506, 185-190. Quan, S. F., Howard, B. V., Iber, C. et al. (1997) The Sleep Heart Health Study: design, rationale, and methods. Sleep, 20, 1077-1085. Quang, D., Chen, Y. and Xie, X. (2015) DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics, 31, 761-763. Redline, S., Schluchter, M. D., Larkin, E. K. and Tishler, P. V. (2003) Predictors of longitudinal change in sleep-disordered breathing in a nonclinic population. Sleep, 26, 703-709. Redline, S. and Tishler, P. V. (2000) The genetics of sleep apnea. Sleep Med Rev, 4, 583-602. Redline, S., Tishler, P. V., Schluchter, M., Aylor, J., Clark, K. and Graham, G. (1999) Risk factors for sleep-disordered breathing in children. Associations with obesity, race, and respiratory problems. Am J Respir Crit Care Med, 159, 1527-1532. Redline, S., Tishler, P. V., Tosteson, T. D., Williamson, J., Kump, K., Browner, I., Ferrette, V. and Krejci, P. (1995) The familial aggregation of obstructive sleep apnea. Am J Respir Crit Care Med, 151, 682-687. Redline, S., Tosteson, T., Tishler, P. V., Carskadon, M. A. and Millman, R. P. (1992) Studies in the genetics of obstructive sleep apnea. Familial aggregation of symptoms associated with sleep-related breathing disturbances. Am Rev Respir Dis, 145, 440-444.

129

Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. and Kircher, M. (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res, 47, D886- D894. Reynolds, L. M., Taylor, J. R., Ding, J. et al. (2014) Age-related variations in the methylome associated with gene expression in human monocytes and T cells. Nat Commun, 5, 5366. Richardson, T. G., Campbell, C., Timpson, N. J. and Gaunt, T. R. (2016) Incorporating Non-Coding Annotations into Rare Variant Analysis. PLoS One, 11, e0154181. Ritchie, G. R., Dunham, I., Zeggini, E. and Flicek, P. (2014) Functional annotation of noncoding sequence variants. Nat Methods, 11, 294-296. Roeder, K., Bacanu, S. A., Wasserman, L. and Devlin, B. (2006) Using linkage genome scans to improve power of association in genome scans. Am J Hum Genet, 78, 243-252. Rosen, C. L., Larkin, E. K., Kirchner, H. L., Emancipator, J. L., Bivins, S. F., Surovec, S. A., Martin, R. J. and Redline, S. (2003) Prevalence and risk factors for sleep-disordered breathing in 8- to 11-year-old children: association with race and prematurity. J Pediatr, 142, 383-389. Roses, A. D. (2006) On the discovery of the genetic association of Apolipoprotein E genotypes and common late-onset Alzheimer disease. J Alzheimers Dis, 9, 361-366. Ross, K. R., Storfer-Isser, A., Hart, M. A., Kibler, A. M., Rueschman, M., Rosen, C. L., Kercsmar, C. M. and Redline, S. (2012) Sleep-disordered breathing is associated with asthma severity in children. The Journal of pediatrics, 160, 736-742. Ruehland, W. R., Rochford, P. D., O'Donoghue, F. J., Pierce, R. J., Singh, P. and Thornton, A. T. (2009) The new AASM criteria for scoring hypopneas: impact on the apnea hypopnea index. Sleep, 32, 150-157. Sahebjami, H. and Gartside, P. S. (1996) Pulmonary function in obese subjects with a normal FEV1/FVC ratio. Chest, 110, 1425-1429. Sanders, S. J., Murtha, M. T., Gupta, A. R. et al. (2012) De novo mutations revealed by whole- exome sequencing are strongly associated with autism. Nature, 485, 237-241. Sasse, S. K., Kadiyala, V., Danhorn, T., Panettieri, R. A., Jr., Phang, T. L. and Gerber, A. N. (2017) Glucocorticoid Receptor ChIP-Seq Identifies PLCD1 as a KLF15 Target that Represses Airway Smooth Muscle Hypertrophy. American journal of respiratory cell and molecular biology, 57, 226-237. Schaffner, S. F., Foo, C., Gabriel, S., Reich, D., Daly, M. J. and Altshuler, D. (2005) Calibrating a coalescent simulation of human genome sequence variation. Genome Res, 15, 1576- 1583. Schwab, R. J. (2005) Genetic determinants of upper airway structures that predispose to obstructive sleep apnea. Respir Physiol Neurobiol, 147, 289-298. Schwab, R. J., Kim, C., Bagchi, S. et al. (2015) Understanding the anatomic basis for obstructive sleep apnea syndrome in adolescents. Am J Respir Crit Care Med, 191, 1295-1309. Schwab, R. J., Pasirstein, M., Kaplan, L., Pierson, R., Mackley, A., Hachadoorian, R., Arens, R., Maislin, G. and Pack, A. I. (2006) Family aggregation of upper airway soft tissue structures in normal subjects and patients with sleep apnea. Am J Respir Crit Care Med, 173, 453-463. Schwarz, J. M., Rodelsperger, C., Schuelke, M. and Seelow, D. (2010) MutationTaster evaluates disease-causing potential of sequence alterations. Nat Methods, 7, 575-576. Shihab, H. A., Gough, J., Cooper, D. N., Stenson, P. D., Barker, G. L., Edwards, K. J., Day, I. N. and Gaunt, T. R. (2013) Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human mutation, 34, 57-65.

130

Shihab, H. A., Rogers, M. F., Gough, J., Mort, M., Cooper, D. N., Day, I. N., Gaunt, T. R. and Campbell, C. (2015) An integrative approach to predicting the functional effects of non- coding and coding sequence variation. Bioinformatics, 31, 1536-1543. Sim, C. K., Kim, S. Y., Brunmeir, R. et al. (2017) Regulation of white and brown adipocyte differentiation by RhoGAP DLC1. PLoS One, 12, e0174761. Smith, B. M., Prince, M. R., Hoffman, E. A. et al. (2013) Impaired left ventricular filling in COPD and emphysema: is it the heart or the lungs? The Multi-Ethnic Study of Atherosclerosis COPD Study. Chest, 144, 1143-1151. Smith, G. D. (2010) Mendelian Randomization for Strengthening Causal Inference in Observational Studies: Application to Gene x Environment Interactions. Perspect Psychol Sci, 5, 527-545. Smith, G. D. and Ebrahim, S. (2004) Mendelian randomization: prospects, potentials, and limitations. Int J Epidemiol, 33, 30-42. Stegle, O., Parts, L., Piipari, M., Winn, J. and Durbin, R. (2012) Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat Protoc, 7, 500-507. Sul, J. H., Han, B., He, D. and Eskin, E. (2011) An optimal weighted aggregated association test for identification of rare variants involved in common diseases. Genetics, 188, 181-188. Sun, L., Craiu, R. V., Paterson, A. D. and Bull, S. B. (2006) Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies. Genet Epidemiol, 30, 519-530. Tamar Sofer, X. Z., Stephanie M Gogarten, Cecelia A Laurie, Kelsey Grinde, John R Shaffer, Dmitry Shungin, Jeffrey R O'Connell, Ramon A Durazo-Arvizo, Laura Raffield, Leslie Lange, Solomon Musani, Ramachandran S Vasan, L. Adrienne Cupples, Alexander P Reiner, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Cathy C Laurie, Kenneth M Rice (2019) A fully-adjusted two-stage procedue for rank normalization in genetic associaiton studies. Genetic Epidemiology, In press. Teng, S., Thomson, P. A., McCarthy, S. et al. (2018) Rare disruptive variants in the DISC1 Interactome and Regulome: association with cognitive ability and schizophrenia. Mol Psychiatry, 23, 1270-1277. Tennessen, J. A., Bigham, A. W., O'Connor, T. D. et al. (2012) Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science, 337, 64-69. Thomas, A., Camp, N. J., Farnham, J. M., Allen-Brady, K. and Cannon-Albright, L. A. (2008) Shared genomic segment analysis. Mapping disease predisposition genes in extended pedigrees using SNP genotype assays. Ann Hum Genet, 72, 279-287. UkBiobank http://www.ukbiobank.ac.uk/. Assessed 30 Mar 2017. Valipour, A., McGown, A. D., Makker, H., O'Sullivan, C. and Spiro, S. G. (2002) Some factors affecting cerebral tissue saturation during obstructive sleep apnoea. Eur Respir J, 20, 444-450. Vandeweyer, G., Van Laer, L., Loeys, B., Van den Bulcke, T. and Kooy, R. F. (2014) VariantDB: a flexible annotation and filtering portal for next generation sequencing data. Genome Med, 6, 74. Verbanck, M., Chen, C. Y., Neale, B. and Do, R. (2018) Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nat Genet, 50, 693-698. Visscher, P. M. (2008) Sizing up human height variation. Nat Genet, 40, 489-490. Visscher, P. M., Brown, M. A., McCarthy, M. I. and Yang, J. (2012) Five years of GWAS discovery. Am J Hum Genet, 90, 7-24.

131

Wang, H., Cade, B. E., Chen, H. et al. (2016) Variants in angiopoietin-2 (ANGPT2) contribute to variation in nocturnal oxyhaemoglobin saturation level. Hum Mol Genet, 25, 5244-5253. Wang, H., Wang, A. X., Liu, Z., Chai, W. and Barrett, E. J. (2009) The trafficking/interaction of eNOS and caveolin-1 induced by insulin modulates endothelial nitric oxide production. Mol Endocrinol, 23, 1613-1623. Wang, L., Li, J., Shuang, M. et al. (2018) Association study and mutation sequencing of genes on chromosome 15q11-q13 identified GABRG3 as a susceptibility gene for autism in Chinese Han population. Transl Psychiatry, 8, 152. Wary, K. K., Mariotti, A., Zurzolo, C. and Giancotti, F. G. (1998) A requirement for caveolin-1 and associated kinase Fyn in integrin signaling and anchorage-dependent cell growth. Cell, 94, 625-634. Wessel, J., Chu, A. Y., Willems, S. M. et al. (2015) Low-frequency and rare exome chip variants associate with fasting glucose and type 2 diabetes susceptibility. Nat Commun, 6, 5897. Wijsman, E. M. (2012) The role of large pedigrees in an era of high-throughput sequencing. Hum Genet, 131, 1555-1563. Willer, C. J., Li, Y. and Abecasis, G. R. (2010) METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics, 26, 2190-2191. Wright, F. A., Strug, L. J., Doshi, V. K. et al. (2011) Genome-wide association and linkage identify modifier loci of lung disease severity in cystic fibrosis at 11p13 and 20q13.2. Nat Genet, 43, 539-546. Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. and Lin, X. (2011) Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89, 82- 93. Yaffe, K., Laffan, A. M., Harrison, S. L., Redline, S., Spira, A. P., Ensrud, K. E., Ancoli-Israel, S. and Stone, K. L. (2011) Sleep-disordered breathing, hypoxia, and risk of mild cognitive impairment and dementia in older women. JAMA, 306, 613-619. Yang, B., Zhu, W., Zheng, Z. et al. (2017) Fluctuation of ROS regulates proliferation and mediates inhibition of migration by reducing the interaction between DLC1 and CAV-1 in breast cancer cells. In vitro cellular & developmental biology. Animal, 53, 354-362. Yao, J., Zhang, K. X., Kramer, M., Pellegrini, M. and McCombie, W. R. (2014) FamAnn: an automated variant annotation pipeline to facilitate target discovery for family-based sequencing studies. Bioinformatics, 30, 1175-1176. Yengo, L., Sidorenko, J., Kemper, K. E. et al. (2018) Meta-analysis of genome-wide association studies for height and body mass index in approximately 700000 individuals of European ancestry. Hum Mol Genet, 27, 3641-3649. Yu, C., Arcos-Burgos, M., Baune, B. T., Arolt, V., Dannlowski, U., Wong, M. L. and Licinio, J. (2018) Low-frequency and rare variants may contribute to elucidate the genetics of major depressive disorder. Transl Psychiatry, 8, 70. Zerbino, D. R., Wilder, S. P., Johnson, N., Juettemann, T. and Flicek, P. R. (2015) The ensembl regulatory build. Genome biology, 16, 56. Zhou, J. and Troyanskaya, O. G. (2015) Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods, 12, 931-934. Zhu, X., Feng, T., Li, Y., Lu, Q. and Elston, R. C. (2010) Detecting rare variants for complex traits using family and unrelated data. Genet Epidemiol, 34, 171-187. Zhu, X., Li, S., Cooper, R. S. and Elston, R. C. (2008) A unified association analysis approach for family and unrelated samples correcting for stratification. Am J Hum Genet, 82, 352-365.

132

Zuk, O., Schaffner, S. F., Samocha, K. et al. (2014) Searching for missing heritability: designing rare variant association studies. Proceedings of the National Academy of Sciences of the United States of America, 111, E455-464.

133