Meta-Analysis of a Multi-Ethnic, Breast Cancer Case-Control Targeted Sequencing Study

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Ablorh, Akweley. 2015. Meta-Analysis of a Multi-Ethnic, Breast Cancer Case-Control Targeted Sequencing Study. Doctoral dissertation, Harvard T.H. Chan School of Public Health.

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:16121143

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA META-ANALYSIS OF A MULTI-ETHNIC,

BREAST CANCER CASE-CONTROL

TARGETED SEQUENCING STUDY

AKWELEY ABLORH

A Dissertation Submitted to the Faculty of

The Harvard T.H. Chan School of Public Health

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Science

Harvard University

Boston, Massachusetts.

May, 2015

Dissertation Advisor: Dr. Peter Kraft Akweley Ablorh

Meta-analysis of a multi-ethnic, breast cancer case-control targeted sequencing study

ABSTRACT

Breast cancer, the most commonly diagnosed cancer in American women, is a heritable disease with nearly one hundred known genetic risk factors. Using next generation sequencing, we explored the contribution of genetics at 12 GWAS-identified loci to breast cancer susceptibility in a multi-ethnic breast cancer case-control study.

Methods: The study population consists of 4,611 breast cancer cases and controls (2,316 cases and 2,295

controls) from four mutually exclusive ethnicities: African, Latina, Japanese, or European American.We conducted

rare variant association testing between sequenced genotypes and simulated phenotypes to compare the performance

of several approaches for assessing rare variant associations across multiple ethnicities and the statistical performance of different ethnic sampling fractions, including single-ethnicity studies and studies that sample up to four ethnicities. Findings from simulation of causal rare variant penetrance models were then applied to a non- synonymous -coding rare variant association study of breast cancer. Next, we applied variance partitioning

methods to determine what proportion of breast cancer heritability is explained by rare and common, coding and

non-coding, and the complete set of sequenced genetic variants.

Results: Variance component-based tests were better powered in several scenarios. Multi-ethnic studies were well powered, with inclusion of African Americans providing the largest gains in statistical power. Rare

variation in several was nominally associated (alpha=0.05) with breast cancer risk. Common variants

explained a significant amount of breast cancer heritability (5%; SE=2%). Total breast cancer heritability from all

sequenced SNVs from all 12 loci was approximately 11% (S.E.=4%), a substantial portion of breast cancer

heritability which ranges from 27% to 32% in European familial studies.

Conclusion: Our findings suggest that association studies between rare variants and complex disease should consider including subjects from multiple ethnicities, with preference given to genetically diverse groups.

We demonstrate practices with the potential to uncover and localize -based associations using gene-based rare

variant association testing at 12 GWAS-identified breast cancer susceptibility loci. We also present strong evidence

that just this subset of previously-identified loci explains a substantial portion of heritability which suggests that all

GWAS-identified loci may explain more breast cancer heritability than the 17% previously reported.

ii

Table of Contents

List of Figures iv List of Tables v Acknowledgments vi Dedication ix

Chapter 1 1 Introduction

Chapter 2 Meta-analysis of rare variant association tests in multi-ethnic 6 populations - Introduction 9 - Methods 11 - Results 19 - Discussion 32 - References 36

Chapter 3 Breast cancer and rare protein-coding variation at 12 43 susceptibility loci - Introduction 45 - Methods 48 - Results 57 - Discussion 64 - References 70

Chapter 4 Components of breast cancer heritability in a multi-ethnic 77 targeted sequencing study - Introduction 79 - Methods 81 - Results 90 - Discussion 98 - References 105

iii

List of Figures

Title Page Figure 2.1: Estimated singleton, doubleton and tripleton count 22 Figure 2.2: Type I error (at nominal 5% alpha) for individual ethnic results and multi- 23 ethnic designs Figure 2.3: Lambda GC when altering sampling fraction between ethnicities 24 Figure 2.4: Burden test and MetaSKAT power under two penetrance models by 26 study population and carrier proportion (relative risk = 2.5) Figure 2.5: Meta-analysis statistic power for all genes in fourteen multi-ethnic 27 populations and four monoethnic populations under two penetrance models Figure 2.6: Non-Synonymous Variants by Sample Size 34 Figure 4.1: Theoretical whole genome SE and observed SE for observed-scale 101 heritability estimates

iv

List of Tables

Title Page Table 2.1: 12 Sequenced Regions, Coverage Depth, and Length 12 Table 2.2: Study Subject Count by Cohort and Ethnicity 13 Table 2.3: Variant, Median Singleton, and Carrier Proportion for Genes with Highest and 20 Lowest Carrier Proportions Table 2.4: Gene Counts by Carrier Proportion and Self-Reported Ethnicity 21 Table 2.5: Gene-specific power for alternate proportion of causal variants – Single Ethnicity 28 Table 2.6: Gene-specific power for alternate proportion of causal variants – Multi-Ethnic 29 Table 2.7: Power by Penetrance Model for two multi-ethnic and four mono-ethnic study 31 populations Table 3.1: Study Participants 49 Table 3.2: 12 Sequenced Regions, Coverage Depth, and Length 51 Table 3.3: Count and Percent of Predictions and Polyphen-2 Scores for Non-Synonymous SNPs 53 Table 3.4: Significant gene meta-analyses for all breast cancer, and by ER status (alpha=0.05) 58 Table 3.5: Significant Gene Meta-Analyses after Correction for Number of Statistical Tests 59

(alphaB =0.025) Table 3.6: Gene-Based Associations by Ethnicity (P-Value < 0.02 in at least one ethnicity) 62 Table 3.7: Correlation between summary effect estimate and ethnicity specific effect 66 estimates by case ER status Table 3.8: Genetic Susceptibility Loci Identified in GWAS by receptor status 67 Table 3.9: Nominal gene-based rare variant breast cancer associations 69 Table 4.1: Study Population and Sequenced Variants by Ethnicity - Counts (Percent) 83 Table 4.2: Targeted Regions and Liability-Scale Heritability (and SE) from sequenced SNPs 85 Table 4.3: Counts of SNVs (M) in sequencing data by minimum MAF and ethnicity 86 Table 4.4: Conditional Analysis of all Sequenced SNPs 92 2 Table 4.5: Rare and Common Variant Breast Cancer Heritability (h g) 94 Table 4.6: Heritability in Coding and DHS Regions 95 Table 4.7: Heritability by Functional Annotation 97 Table 4.8: ER-Positive Breast Cancer heritability 98 Table 4.9: Heritability Estimation in Two Independent Samples 102 2 Table 4.10: Crude, Adjusted, and LD-pruned estimates of Rare, Common and Total hg in 103 Single variance-component analysis Table 4.11: Breast cancer heritability (Percent) in European populations 105

v

Acknowledgements

This, to paraphrase the most inspiring scientist I have been so fortunate to meet, was the work of many people; many more people than the few I acknowledge here. First and foremost, thank you to the members of my research committee, Dr. Heather Eliassen, Dr. Alkes Price, and Dr. Peter Kraft, for your feedback, time, and commitment to the success of this venture.

Thank you to Dr. Peter Kraft and our collaborators at other institutions Dr. Chris Haiman, Dr. Brian

Henderson, Dr. Loic Le Marchand, Dr. Seunngeun (Shawn) Lee and Dr. Dan Stram for graciously including me and sharing study resources and feedback as this project developed.

My sincere gratitude to the current and former postdoctoral members of the Program in Genetic epidemiology and Statistical Genetics (PGSG) and the Program in Molecular epidemiology And Genetic

Epidemiology (PMAGE) who include Dr. Sara Lindstrom and Dr. Sasha Gusev for their openness, flexibility, and insights.

I thank my colleagues and former classmates, Dr. Jason Wong, Dr. Shasha Meng, Dr. Zhonghua Liu, Dr.

Mitch Machiela, Dr. Mengmeng (Margaret) Du, Maxine Chen, Dr. Chia-Yen Chen, Tristan Hayek, Hilary

Finucane, and Dr. Gaurav Bhatia for their collegial nature and unflagging dedication to high standards.

I am grateful to the staff for meeting and exceeding expectations including Laura Ruggiero, Connie Chen,

Elizabeth Solomon, Donald Halstead and Jill McDonald.

I would like to thank members of the Harvard faculty for an unmatched learning experience and especially thank Dr. Ichiro Kawachi, Dr. David Williams, Dr. John Quackenbush, Dr. Lisa Signorello, Dr.

Betty Johnson, Dr. Robert Blendon, Dr. David Hunter and Dr. Xihong Lin who generously shared their support or guidance with me along the way.

vi

Finally, I thank Dr. Alkes Price, truly you inspire me, and Dr. Michelle Williams, Chair of the department of Epidemiology, for leading by example and for ensuring my progress through the Doctor of Science degree was fully supported in every sense that an aspiring investigator could dare to hope for.

vii

Support for this work was provided by:

Grant Source Grant Title Grant Number Funding Dates NIH – National Institute Initiative to Maximize Student Diversity GM055353-13 09/01/2010- of General Medical 08/31/2012 Sciences Department of Health Joint Interdisciplinary Training Program T32 - 09/01/2012 - and Services in Biostatistics & Computational GM074897 08/31/2014 Public Health Service Biology NIH – National Institute Statistical Methods for Studies of Rare R01 MH101244 09/01/2014 - of Mental Health Variants 07/01/2015

viii

For

my friends, family, and loved ones

and for

the many survivors

and their friends and family whose lives have been touched by breast cancer.

ix

Chapter 1

Introduction

Introduction

Breast cancer genetic association studies and NGS genotypes

Like so many transformative accomplishments, sequencing the first was both

the end and the beginning of a journey. The scientific journey from measuring genotypes to interpreting

those genotypes for the improvement of human health has only just begun. The human genome project

was a monumental task that took over a decade and 3 billion dollars but the cost of genotyping

subsequent whole genomes is falling precipitously.1 With this influx of genetic information, researchers, clinicians, and biotechnologists may find the amount of data quickly outpacing their ability to analyze the results responsibly.

This work is motivated by

1) a need to validate disease risk estimation methods in ever-more widely available, high quality,

next generation sequencing (NGS) data; and

2) a desire to explain the gap between the familial risk explained by known breast cancer genetic

loci and the observed two-fold increase in breast cancer risk for first degree relatives of breast

cancer cases.2

To that end, we applied NGS, the latest evolution in genotyping methods1 to a study population of breast cancer cases and controls.

Breast cancer is the most commonly diagnosed site-specific cancer in US women. Lifestyle and reproductive factors have been associated with breast cancer risk, but few risk factors have associations as large as mammographic density or family history’s two-fold relative risk. Although genetic risk variants have been identified, they explain just a portion of the phenotypic variability due to genetics

(heritability) of breast cancer. So far, the remaining heritability of breast cancer remains unexplained by genetic association studies.2

2

Genetic studies have changed dramatically in the last century. Linkage studies, some of the

earliest genetic studies, relied on correlation between neighboring nucleotide base pairs in

deoxyribonucleic acid (DNA). Sanger sequencing, an ingenious biochemical assay of DNA, was well-

suited to sequences several hundreds of nucleotides long.3 However, this was just a fraction of the 3 billion nucleotide base pairs in the human genome. Array-based genotyping, SNP chips, advanced the science by allowing researchers to interrogate regions of the genome with no prior linkage study association. By 2010, genome-wide association studies (GWASs) used SNP arrays which directly genotyped over half a million SNPs (single nucleotide polymorphisms).

Even before massively parallel NGS technologies streamlined genotyping of candidate regions, exonic sequences, or whole genomes, linkage studies of pedigrees identified associations between complex disease risk and genotype. The genetic basis of breast cancer has been particularly well- studied.4 Several of these associations had meaningful clinical consequences. However, high risk, low-

frequency variants have explained only a portion of the variation in risk for breast cancer so far.

To test association with moderate and high frequency SNPs in large samples, array-based methods are a more effective tool.5 Many common SNPs with moderate effects on complex disease risk have been discovered in the past decade using GWAS.6 These associations include both genotyped SNPs and imputed SNPs which were inferred from a reference human genome based on correlation with genotyped SNPs. By combining information across studies and increasing the size of each study, GWAS have started to identify risk alleles with lower and lower frequencies. Each newly-identified risk allele explains a little more, but not all, of the remaining complex disease heritability.

There are several explanations for missing heritability.7 One explanation is that heritability

estimates were inflated and thus, putative risk variants have explained all the heritability required. If we

instead assume that heritability estimates were accurate, then there are still several possibilities. For

one, rare variants with large effect sizes were not captured well in early GWAS. Even in NGS data, read

3

depth can also limit detection of rare variants.8 A second possibility is that complex diseases like breast cancer are highly polygenic. Many of the common variants that contribute to heritability would have effect sizes too low to reach statistical significance. Due to the high quality of our data, we could explore both of these possibilities with this work.

Early literature on multi-ethnic rare variant analysis relied on the use of simulated genotypes or included only one continental population.9 In practice, both authenticity and diversity of genetic data

will be vital to bridging the gap between theory and clinical utility. The current work is split into three

distinct but complementary endeavors:

Chapter 2: “Meta-analysis of rare variant association tests in multi-ethnic populations”

addresses two key aspects of rare variant association study design in a multi-ethnic population:

1) test statistic performance; and 2) study population selection. This first project contributes

practical advice for the design and analysis of rare variant association studies based on empirical

evidence. In addition, it has a broader application to investigators working with non-European

study populations and can be used to benchmark the expected performance of rare variant test

statistics for a larger audience.

Chapter 3: “Breast cancer and rare protein-coding variant association at 12 susceptibility loci” is

an applied study of NGS exposure data for a complex disease with missing heritability. This

association study demonstrates the potential to uncover gene-specific associations using fine-

mapping of disease-associated loci and the limitations of a study of moderate sample size.

Chapter 4: “Breast cancer heritability at 12 GWAS-identified loci” applies variance-component

methods to capture hidden heritability.

In the following chapters, we present a deep and broad exploration of high quality, next generation sequencing data from a multi-ethnic population of over 4600 breast cancer cases and controls covering

5 million base pairs of the human genome.

4

Works Cited

1. Mardis ER. A decade’s perspective on DNA sequencing technology. Nature. 2011;470(7333):198- 203. doi:10.1038/nature09796. 2. Michailidou K, Beesley J, Lindstrom S, Canisius S, Dennis J, Lush MJ, Maranian MJ, Bolla MK, Wang Q, Shah M, Perkins BJ, Czene K, Eriksson M, Darabi H, Brand JS, Bojesen SE, Nordestgaard BG, Flyger H, Nielsen SF, et al. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nat Genet. 2015;advance on. doi:10.1038/ng.3242. 3. Schuster SC. Next-generation sequencing transforms today’s biology. Nat Methods. 2008;5(1):16-18. doi:10.1038/nmeth1156. 4. Stratton MR, Rahman N. The emerging landscape of breast cancer susceptibility. Nat Genet. 2008;40(1):17-22. doi:10.1038/ng.2007.53. 5. Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet. 2012;13(2):135-145. doi:10.1038/nrg3118. 6. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012;90(1):7-24. doi:10.1016/j.ajhg.2011.11.029. 7. Manolio T a, Collins FS, Cox NJ, Goldstein DB, Lucia a, Hunter DJ, Mccarthy MI, Ramos EM, Cardon LR, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Alice S, Boehnke M, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747- 753. doi:10.1038/nature08494. 8. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;135(V):0-9. doi:10.1038/nature11632. 9. Zuk O, Schaffner SF, Samocha K, Do R, Hechter E, Kathiresan S, Daly MJ, Neale BM, Sunyaev SR, Lander ES. Searching for missing heritability: designing rare variant association studies. Proc Natl Acad Sci U S A. 2014;111(4):E455-E464. doi:10.1073/pnas.1322563111.

5

Chapter 2

Meta-analysis of rare variant association tests in multi-ethnic populations

- -

Akweley Mensah-Ablorh1,2, Sara Lindstrom1,2, Christopher A. Haiman3, Brian E. Henderson3, Loic Le

Marchand4, Seunngeun Lee5, Daniel O. Stram3, A. Heather Eliassen1,6, Alkes Price1,2,7, Peter Kraft1,2,7

1 Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02215

2 Program in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health,

Boston, MA 02215

3 Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer

Center, University of Southern California, Los Angeles, CA 90033

4 Epidemiology Program, University of Hawaii Cancer Research Center, Honolulu, HI 96813

5 Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109

6 Channing Division of Network Medicine, Harvard Medical School and Brigham & Women’s Hospital,

Boston, MA 02215

7 Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02215

Running Title: Multiethnic rare variant meta-analysis

Correspondence: Akweley Ablorh

Harvard T.H. Chan School of Public Health

Bldg II, Suite 200

655 Huntington Ave

Boston, MA 02215

(617)432-7095

[email protected]

7

ABSTRACT

Several methods have been proposed to increase power in rare variant association testing by aggregating information from individual rare variants (MAF<0.005). However, how to best combine rare variants across multiple ethnicities and the relative performance of designs using different ethnic sampling fractions remains unknown. In this study, we compare the performance of several statistical approaches for assessing rare-variant associations across

multiple ethnicities. We also examine how different ethnic sampling fractions perform,

including single-ethnicity studies and studies that sample up to four ethnicities. We conducted simulations based on targeted sequencing data from 4,611 women in four ethnicities (African

American, European American, Japanese, Latino). As with single-ethnicity studies, burden tests had greater power when all causal rare variants were deleterious, and variance component- based tests had greater power when some causal rare variants were deleterious and some were protective. Multi-ethnic studies had greater power than single-ethnicity studies at many loci, with inclusion of African Americans providing the largest impact. On average, studies including African Americans had as much as 20% greater power than equivalently-sized studies without African Americans. This suggests that association studies between rare variants and complex disease should consider including subjects from multiple ethnicities, with preference given to genetically diverse groups.

Keywords: Fine-mapping; Sequencing study design; Statistical genetics; Study subject selection;

Multiethnic meta-analysis;

8

INTRODUCTION

Despite the successes of genome-wide association studies (GWAS), which have identified hundreds of common variants associated with complex diseases and traits, the genetic basis of these traits has not been fully explained.1–4 It is likely that rare variants contribute to the remaining heritability, 5,6 but to what extent remains an important open empirical question. Individual rare variants (defined here as minor allele frequency (MAF) <

0.005) are difficult to study, as they require prohibitively large samples sizes to have adequate power to detect realistic effect sizes. As an alternative, many methods have been proposed for aggregating information from individual rare variants.7–14 However, these have been proposed in the context of studies conducted in a single ethnicity. It remains an open question how to best combine rare variants in association testing across ethnicities in a multi-ethnic study, and the relative performance of designs using optimal sampling fraction across ethnicities remains

unknown.

Previous studies evaluating the performance of aggregate rare-variant tests have often used simulated genotype data,9–12 which may not adequately capture all of the properties of

empirical data. Those studies that have used empirical sequencing data have been restricted to small sample sizes (under 2,000).15–17 Importantly, most previous studies do not consider multi-

ethnic populations, and when they do, they focus on samples drawn from at most two ethnic

groups18,19 or from within one continental population.17,20

We address some of these gaps using simulations based on empirical targeted sequencing data from 4,611 women in four ethnicities (African American, European American,

9 Japanese, Latino). It is well established that multi-ethnic designs can aid in identifying the causal variant in the context of fine-mapping GWAS-identified loci;21 whether multi-ethnic designs

increase the power of aggregate rare-variant tests remains an open question. We compare the

performance of several statistical approaches for combining evidence for rare-variant association across multi-ethnic samples. We also investigate optimal ethnic sampling fractions

when samples of multiple ethnicities are available.

Our findings suggest that a multi-ethnic study that includes African American subjects may be advantageous since it leverages the greater genetic diversity among Africans.

10

METHODS

Breast Cancer Targeted Sequencing Data: We generated next-generation sequencing

data targeting 12 regions (Table 2.1) spanning 5,500kb on 2,316 breast cancer cases and 2,295

controls from three cohorts (the Nurses’ Health Study,22 the Nurses’ Health Study II,23 and the

Multiethnic Cohort24). Our study population is ethnically diverse with 937 women of African-

American ancestry, 907 women of Latino ancestry and 1,256 women of Japanese ancestry from

the Multiethnic Cohort (MEC) and 1,511 women of European ancestry from Nurses’ Health

Study (NHS) and Nurses’ Health Study II (NHSII) (Table 2.2). Boundaries of targeted regions

were defined by recombination hotspots flanking single nucleotide polymorphisms with published genome-wide-significant associations to breast cancer risk. These regions contained

74 genes, ranging in length from 0.4 kb to 910 kb, for a total of 182 kb of exonic sequence.

Target capture was performed using custom Agilent baits, and high-depth sequencing was performed at the Broad Institute using the Illumina HiSeq platform. Reads were aligned using

BWA and genotypes called using GATK in batches of approximately 100 samples25–28. Batches were balanced with respect to ethnicity and case-control status. Over 97% of bases in coding

regions had >20x coverage, with a mean coverage of 174x. All of the subjects were previously

genotyped using Illumina HumanHap arrays. Concordance in genotype calls at variable sites was over 99.7% across four pairs of duplicate samples; concordance between sequencing

genotype calls and GWAS genotypes was also over 99.7%.

We used SnpEff to annotate coding variants, and restricted analysis to non-synonymous

SNPs and stop-gain, stop-lost, start-lost, and splice site donor or acceptor variants.29

11

Table 2.1: 12 Sequenced Regions, Coverage Depth, and Length

Median | Minimum Region Index_SNP Locus Chr SNP_Pos Length(kb) Coverage Depth (Coding) 1 rs13387042 2q35 2 217905832 ------121.8

2 rs10069690 TERT 5 1279790 824 42 45.9

3 rs889312 MAP3K1 5 56031884 3681 282 307.8

4 rs2046210 ESR1 6 151990059 3749 358 242.7

5 rs1562430 8q24 8 128387852 1317 118 972.9

6 rs10995190 ZNF365 10 64278682 3221 244 397.9

7 rs704010 ZMIZ1 10 80841148 1307 122 473.2

8 rs2981579 FGFR2 10 123337335 1934 131 876.2

9 rs614367 11q13 11 69328764 910 36 258.9

10 rs999737 RAD51B 14 69034682 2319 36 815.4

11 rs3803662 TOX3 16 52586341 2797 40 269

12 rs8170 MERIT40 19 17389704 1025 35 712.4

12

Table 2.2: Study Subject Count by Cohort and Ethnicity

Cohort Ethnicity Total (% Subjects)

African American 937 (20%)

MEC Japanese American 1,256 (27%)

Latino American 907 (20%)

NHS European American 1,511 (33%)

All 4,611 (100%)

Sequencing was performed for subjects from three cohorts — Multiethnic Cohort (MEC) and Nurses'

Health or Nurses’ Health Study 2 (NHS). Ethnicity refers to self-reported ancestry validated by GWAS data.

13

Simulated case-control data sets: For the 47 genes that had at least one non-

synonymous variable site, we simulated case-control studies under the null hypothesis that the

gene was not associated with disease risk. In addition, we simulated data under four alternative

scenarios that were characterized by two conditions: the absolute value of the allele-specific

log relative risk was either constant or inversely proportional to the minor allele frequency, and

the minor alleles at causal loci were either all deleterious or a mixture of 70% deleterious and

30% protective alleles.

Case and control genotypes from each ethnic group were sampled by taking random

draws (with replacement) from the observed genotypes. Assuming the disease is rare, control

genotypes were sampled with equal probability, while case genotypes were selected with

probabilities proportional to the genotype relative risks. Specifically: for a given ethnicity, the

genotype for the ith control = ( , … , ) in gene k was set equal to = ( , … , )

宋餐 沈怠 沈徴入 似 碇怠 碇徴入 where ゜ is drawn from the set賛 of subject訣 indices訣 for the observed data {l=1,...,賛 L} with訣 probability訣

-1 th P0l=L . Here gij is the observed genotype at variable site j for the i individual in a

particular ethnicity; L is the total number of subjects sequenced in that ethnicity (e.g. L=937 for

African Americans); and Jk is the number of rare varying sites (MAF<0.005 in the pooled sample

of 4,611 women) in gene k. Genotypes for cases were sampled based on a specified log relative

risk, setting

exp [ ] = ,…, exp珍 珍 [鎮珍 ] 怠鎮 盤デ 紅 訣 匪 講 茅 茅 デ鎮 退怠 挑版 盤デ珍 紅珍訣鎮 珍 匪繁

14

Here is the log relative risk for variable site j, defined by the simulated penetrance model.

紅珍 Setting , equal to 1 if minor allele j has a causal protective effect and 0 for a deleterious

荊牒追墜 珍 effect; setting equal to 1 if there is MAF dependence and 0 otherwise; and setting ,

暢帖 寵銚通 珍 equal to 1 if variant荊 j is causal and 0 otherwise, the log relative risk for the minor allele 荊at site j is:

, 彫謎呑 = , × ( 1) × 彫鍋認任 乳 兼欠血陳銚掴 紅珍 荊寵銚通 珍 伐 頒紅茅 健剣訣 嵜俵 崟 番 兼欠血珍 H

{1.5,2.5,3.0,4.0}:

紅違 樺

荊警経 貸怠 兼欠血兼欠捲 珍 = 僅デ 健剣訣 嵜俵 倹 崟 巾 兼欠血, 勤 錦 紅 紅違 勤 デ珍 荊寵銚通 珍 錦 勤 錦

均 斤 For all alternatives, we assumed that a fraction of rare variants ( ) were causal and

寵銚通 generated the indicator , for each SNP from a Bernoulli( ) distribution.喧 For alternatives

寵銚通 珍 寵銚通 where causal variants were荊 a mixture of protective and deleterious喧 variants, we assumed that a

fraction of causal variants ( | ) were protective ( , i.i.d. Bernoulli( | )). , was

喧牒追墜 寵銚通 荊牒追墜 珍 喧牒追墜 寵銚通 荊寵銚通 珍 shared across ethnicity but , was drawn independently for each ethnicity, introducing

牒追墜 珍 heterogeneity across ethnicities荊 for some scenarios.

15 Association tests: For each gene k, we tested for association between rare variants and case-control status within each ethnicity separately using two aggregate association approaches: a burden test12 and the Sequence Kernel Association Test (SKAT).30 Both of these approaches test the null hypothesis:

: = = = =0.

茎待 紅怠 紅態 橋 紅徴入 The burden test collapses all rare alleles within a gene by creating an indicator variable

Xeik which is equal to 1 if subject i in ethnicity e carries any rare variant in gene k and 0 otherwise. Logistic regression is then used to test the association between case-control status and Xeik by fitting the model

logit[E(y )] = + X

奪辿 ゎ赴 ず侮奪谷 奪辿谷 and comparing to a central chi-squared 1 df distribution. ( ) 態 提撫武賑入 磐聴帳 提撫武賑入 卑 SKAT is equivalent to the score test for =0 from a random effects model where the 態 30 js are normally distributed with mean 0 and variance酵 . We used the default setting for 態 態 拳勅珍酵 SKAT and set wej=Beta(MAFej, 1, 25), where MAFej is the minor allele frequency in subjects of ethnicity e sample. The SKAT test statistic for association in ethnicity e is

= 徴入 態 態 芸勅 布拳勅珍 鯨勅珍 珍退怠 where = ( ) and is the expected value of case-control status Y for subject

勅珍 沈 勅沈珍 勅沈 勅沈 勅沈 i in ethnicity鯨 eデ under訣 検 the伐 null. 航 SKAT 航can adjust for covariates (such as genome-wide genetic

16

principal components) by incorporating them in the model for . For simplicity we simulated

勅沈 and fit models without any covariate effects; in this case = 航 and is simply the proportion of

勅沈 勅 cases in the sample from ethnicity e. The statistic Qe is distributed航 航 as a mixture of chi-squared random variables, where the mixing proportions depend on the weights and genetic covariance in ethnicity e.

Burden-test meta-analysis: Standard fixed-effect meta-analysis techniques can be applied to the ethnic-specific estimates of to test the null hypothesis that =...= = 0 (E

勅賃 怠賃 帳賃 is the total number of ethnic groups considered).肯侮 We consider three meta-analysis肯 approaches.肯

The first approach is based on the inverse-variance weighted estimate of the mean effect of carrying any rare variant across ethnicities. If the causal effects or allele frequencies at the causal variants differ greatly across ethnicity—in particular, if some causal variants are monomorphic in one or more populations—then the true s may differ across ethnicity, and

勅賃 the average effect may be small—causing the inverse variance肯 weighted test to lose power. 31,32

The second and third approaches account for possible heterogeneity in , at the cost of

勅賃 increased degrees of freedom. The second approach tests for heterogeneity肯 in the ethnic- specific effects using Cochran's Q and has E-1 degrees of freedom. The third approach is a joint test of mean effect and heterogeneity that sums the fixed-effect test of the overall mean and

Cochran's Q. Because these tests are independent, the joint test has E degrees of freedom. 31,32

SKAT meta-analysis W j are constant across ethnicity, and calculate a homogeneous and heterogeneous MetaSKAT statistic (Hom-Meta-SKAT and

Het-Meta-SKAT) proposed by Lee et al.33

17

= 陳 帳 態

芸朕墜陳貸陳勅痛銚貸聴懲凋脹 布蕃布拳勅珍鯨勅珍否 珍退怠 勅退怠

= 陳 帳 態 態 芸朕勅痛貸陳勅痛銚貸聴懲凋脹 布布拳勅珍 鯨勅珍 珍退怠 勅退怠 Under the null, these statistics are also distributed as a mixture of chi-squared variables, where the mixing proportions depend on the ethnic-specific weights and genetic covariance matrices.

Comparing studies with different ethnic sample fractions: We selected 14 different

combinations of samples from at least two of four ethnicities (African American, European

American, Japanese, and Latino) to construct studies with a total of 3,520 total subjects (tri- ethnic samples have 3,522 subjects). For all of these combinations the case-control ratio within each ethnic group was 1:1. We chose four combinations that included subjects from all four ethnicities but had differing proportions of ancestries (4:1:1:2, 2:1:1:1, 3:2:1:2, or 1:1:1:1 of

African American, European American, Japanese, and Latino subjects, respectively). We also

simulated populations by selecting equal numbers of individuals from three out of the four

ethnicities at a time (four combinations) and finally from only two of the four ethnicities (six

combinations).

18

RESULTS

We hypothesized that there may be differences in rare variant association test

performance due to both choice of rare variant test statistic and ethnic sampling fractions since

rare variant tests are better powered in study populations with a higher number of varying

sites. Distributions of rare variants by ethnicity and gene are given in Table 2.3 and Table 2.4 and Figure 2.1. Consistent with previous reports,4,34,35 African Americans have the highest number of rare varying sites, followed by Latinos. Consequently the carrier proportion (the fraction of subjects carrying at least one rare, non-synonymous or truncating allele in a gene) is highest in these populations. However, for most genes, the carrier proportion (CP) was less than 2%, except in African Americans (median CP = 2.6%). Table 2.4 lists summary values for ten genes with the five highest and five lowest overall CP.

We first evaluated rare variant association meta-analysis techniques under the null. In individual ethnic groups, we examined Type I error rates of the burden test and SKAT (Figure

2.2, green boxes). In addition to the false positive rate, we explored a measure of test statistic inflation, Figure 2.3, left with blue shading) and all five meta- analysis techniques. For the 33/47 of genes with an overall CP higher than 0.01, both the burden test and SKAT adequately controlled the Type I error rate. The burden test had deflated

Type I error when the overall CP was less than 0.01. For some of the genes with overall CP less

than 0.01, the deflation in the burden test for some ethnic groups was considerable (e.g.

less than 0.5 in European and Japanese ancestry subjects for RMND1), owing to the low CP (the

RMND1 CPs for European and Japanese ancestry subjects were both 0.002). The SKAT test

19

Table 2.3: Variant, Median Singleton, and Carrier Proportion for Genes with Highest and Lowest Carrier Proportions

Subjects with rare NS variant count Singleton Carrier

Gene Chr # Total # Rare Proportion 0 1 2 3+ Proportion TACC2 10 255 230 0.53 3,823 679 76 33 17.1% MYO9B 19 158 151 0.46 4,235 351 25 0 8.2% ANKLE1 19 82 71 0.45 4,260 321 21 9 7.6% FAM129C 19 67 61 0.48 4,299 286 25 1 6.8% ZFYVE26 14 131 121 0.66 4,348 253 9 1 5.7% HAUS8 19 40 36 0.67 4,545 65 1 0 1.4% PGLS 19 20 19 0.53 4,548 63 0 0 1.4% ABHD8 19 20 19 0.47 4,551 59 1 0 1.3% BTBD16 10 23 21 0.62 4,553 58 0 0 1.3% DDA1 19 7 7 0.43 4,565 46 0 0 1.0% Median --- 32 29 0.58 4,527 75 1 0 1.8%

Chr = ; Total = Varying non-synonymous (NS) sites observed in study population; Rare =

Varying sites with minor allele frequency (MAF) < 0.005, a subset of Total; Singleton Proportion -

Median number of singleton sites as described in Figure I as a proportion of Rare; ‘0’, ‘1’, ‘2’, and ‘3+’ -

Number of subjects out of 4,611 with 0, 1, 2, and 3 or more NS rare minor alleles. Carrier Proportion -

Proportion of study subjects with at least one rare, NS minor allele.

20

Table 2.4: Gene Counts by Carrier Proportion and Self-Reported Ethnicity

Carrier African European Japanese Latino All Proportion (CP) American American American

CP = 0 24 (32%) 27 (36%) 27 (36%) 25 (34%) 23 (31%)

0 < CP 12 (16%) 20 (27%) 18 (24%) 21 (28%) 18 (24%)

0.01 < CP 6 (8%) 14 (19%) 12 (16%) 12 (16%) 10 (14%)

0.02 < CP 10 (14%) 4 (5%) 5 (7%) 7 (9%) 10 (14%)

CP > 0.03 22 (30%) 9 (12%) 12 (16%) 9 (12%) 13 (18%)

All CP 74 (100%) 74 (100%) 74 (100%) 74 (100%) 74 (100%)

We observed variants in 74 distinct SnpEff annotated genes. Carrier proportion (CP) is defined as the

-synonymous (NS) rare variant divided by the number of individuals of that ancestry. Table displays count and (percent) of genes according to carrier proportion for each self-reported ancestry. 'All' refers to study participants from all ancestries combined. There were no NS rare variants observed in any ancestry for 23 of the 74 genes.

21

Figure 2.1: Estimated singleton, doubleton and tripleton count

Observed total non-singleton sites in subjects of each ethnicity for the sample size indicated in legend. To the right, median counts of singletons, doubletons and tripletons for 500 bootstrap samples of 907 subjects from each ethnicity.

22 Figure 2.2: Type I error (at nominal 5% alpha) for individual ethnic results and multi-ethnic designs

Each panel contains vertical box-plots of gene Type I error (false positive) rates for indicated meta-analysis statistic. X-axis lists study populations as a ratio of African American, European American, Japanese, and Latino

subjects. Box plots are arranged from left to right by increasing median Type I error rate for fourteen multi-ethnic and four mono-ethnic study populations. Green box plots are burden or SKAT tests of a mono-ethnic population.

Green horizontal line is the nominal type I error rate of alpha=0.05.

23 Figure 2.3: Lambda GC when altering sampling fraction between ethnicities

The blue- for burden and SKAT tests in four monoethnic populations.

M decreasing joint test lambda GC for five color-coded meta-analysis statistics. X- axis lists study populations as a ratio of African American, European, Japanese, and Latino subjects.

24 statistic was generally not dramatically deflated for genes with CPs less than 0.01, although 5

out of 14 genes with low CP (<0.01) < 0.85 in at least one ethnic group.

For most multi-ethnic designs, the meta-analyses based on the burden test--the inverse

variance weighted meta-analysis, Cochran's Q, and the joint test--all showed deflation (median

0.95; Figure 2.2). SKAT meta-analysis test statistics (het and hom) were slightly inflated

>1.04; Figure 2.3). As with the single-ethnicity analyses, the deflation for the burden

test statistics was greater for genes with overall CP less than 0.01. Cochran's Q and joint tests had notably miniscule <0.7) for designs that included all four ethnicities. The most extreme inflation factors (Cochran's Q ) were for genes where the CP in European,

Latino or Japanese subjects was 0.005 or less.

We next considered how power to detect causal association was influenced by penetrance model, design, and overall CP (Figure 2.4 and Figure 2.5). Consistent with previous reports30, the burden test has higher power than SKAT when all of the causal variants are deleterious, but SKAT has higher power than the burden test when an appreciable proportion of the causal variants are protective. Reducing the total proportion of causal variants gave qualitatively similar results, however the advantage of the fixed-effect burden test when all causal variants were deleterious declined as we lowered the proportion of deleterious causal sites to 10% (Table 2.5, Table 2.6). When only 10% of the rare variants were causal, MetaSKAT had better power than the fixed-effect burden test. Hom-Meta-SKAT and Het-Meta-SKAT have roughly the same power, although when all causal rare variants are deleterious, Hom-Meta-

SKAT has slightly greater power. This is consistent with previous reports when the majority of

25 Figure 2.4: Burden test and MetaSKAT power under two penetrance models by study population and carrier proportion (relative risk = 2.5)

In upper panels, all causal variants are deleterious. In lower panels, a portion of causal variants are protective. Left panels display burden tests and meta-analysis of burden tests. Right panels display MetaSKAT and SKAT tests.

Legend: Overall CP was estimated in a population of 4,611 women as the proportion of women carrying at least one rare variant. Ethnicity specific CP was estimated in each ethnicity as the proportion of women of that ethnicity who carry at least one rare variant. Left field within each panel shows statistical power for all genes and by overall

CP for fourteen multi-ethnic populations. Study populations are ordered by increasing mean power for all genes from left to right. In the far right field of each panel, four ethnicities are arranged from left to right by mean power for all genes. Mean power is shown for genes in each category of ethnic-specific CP for the indicated ethnicity.

26 Figure 2.5: Meta-analysis statistic power for all genes in fourteen multi-ethnic populations and four

monoethnic populations under two penetrance models

Both penetrance models use a RR of 2.5. In the left panel, all causal variants are deleterious and in the right panel,

a portion of causal variants are protective. Unlike the previous figure, multi-ethnic study populations are arranged

in the same order for both left and right panels and power for inverse-variance meta-analysis of burden tests

increases for all deleterious causal variants increases from left to right. Again, X-axis depicts multi-ethnic study populations as the ratio of African American, European American, Japanese American and Latino subjects. Not

shown: power for Cochran's Q was under 0.2 for all scenarios depicted.

27 Table 2.5: Gene-specific power for alternate proportion of causal variants – Single Ethnicity

Attachment

28 Table 2.6: Gene-specific power for alternate proportion of causal variants – Multi-Ethnic

Attachment

29 genetic effects are shared across studies.33 The two techniques for meta-analyzing burden test

results while accounting for possible heterogeneity in effects across ethnicity—the joint test

and Cochran’s Q—both had relatively low power. In the scenarios plotted in Figure 2.5, for

example, Cochran’s Q had less than 20% power, while all other tests had greater than 25%

power.

The proportion of the total sample in different ethnicities affected the power of multi- ethnic meta-analyses (Figures 1.4 and Figure 2.5). Fixing the total sample size, the power for the

different sampling fractions we considered differed by as much as 0.25 (for models with fixed

relative risks) or 0.37 (for models where the relative risk increased with decreasing MAF (Table

2.7). Notably, studies restricted to African Americans had greater power than studies of other

single ethnic groups, as well as many studies that sampled multiple ethnicities. However, the

overall CP had a much stronger effect on power: the average power between genes with overall

CP less than 0.01 and those with CP greater than 0.03 differed by more than 0.60 in many

scenarios.

30 Table 2.7: Power by Penetrance Model for two multi-ethnic and four mono-ethnic study populations

50% Deleterious 15% Protective/35% Deleterious Study No MAF- MAF- No MAF- MAF- Population Statistic Dependence Dependent Dependence Dependent Multi-ethnic - Cochran’s Q 0.07 0.05 0.09 0.06 Equal Inverse-Variance 0.60 0.31 0.36 0.20 Proportion of Joint Test 0.47 0.20 0.28 0.13 each Ethnicity Het-Meta-SKAT 0.50 0.23 0.43 0.22 (1:1:1:1) Hom-Meta-SKAT 0.54 0.24 0.42 0.21 Multi-ethnic - Cochran’s Q 0.06 0.04 0.07 0.05 More African Inverse-Variance 0.63 0.35 0.41 0.24 Americans Joint Test 0.50 0.21 0.32 0.15 (4:1:1:2) Het-Meta-SKAT 0.55 0.34 0.49 0.30 Hom-Meta-SKAT 0.58 0.34 0.48 0.30 African Burden Test 0.64 0.50 0.57 0.44 American SKAT 0.67 0.46 0.50 0.34 European Burden Test 0.50 0.30 0.43 0.25 American SKAT 0.54 0.40 0.35 0.27 Japanese Burden Test 0.48 0.35 0.44 0.30 American SKAT 0.50 0.39 0.36 0.27 Latino Burden Test 0.52 0.46 0.49 0.40 SKAT 0.55 0.51 0.41 0.35

31 DISCUSSION

Using empirical targeted sequencing data on a large sample of 4,611 women from four

ethnic groups (African American, European American, Japanese and Latino), we evaluated the performance of several different approaches to meta-analyzing aggregate tests of association between rare variants and disease, as well as the power of designs with varying sampling fractions across the four ethnic groups.

Although Type I error rates were low for the meta-analysis techniques we explored,

Type I error rates CP 0.01) in all ethnicities were consistent with nominal alpha levels. Our findings are analogous to the work of Ma et al. who also observed deflated type I error rates for meta-analysis of individual sites with allele frequencies below 1% in a population of 908 subjects of Northern European ancestry.36 Choice of study subjects also influenced Type I error rates and, unlike inverse-variance meta-analysis of burden tests and MetaSKAT, which held relatively steady, Type I error for the joint test and Cochran’s Q were greatly deflated for all genes as the number of ethnicities increased and subgroup sizes dropped.

We hypothesized that causal variants might be detected using tests of heterogeneity due to population private variants, but this was not the case. Both Cochran’s Q and the joint test had low power to detect bidirectional and unidirectional effects compared to MetaSKAT and inverse-variance meta-analysis of the burden test. For high CP, MetaSKAT and the burden test were both well-powered to detect association when all causal variants were deleterious.

The burden test meta-analysis was better powered to detect uniformly deleterious variants

32 than MetaSKAT, but in the presence of protective effects, both MetaSKAT approaches

maintained higher power. Our results suggest that Hom-Meta-SKAT is preferable to Het-Meta-

SKAT. These tests were nearly indistinguishable in the presence of protective effects, but Hom-

Meta-SKAT was better-powered to detect deleterious variants. Of the meta-analyses techniques explored, Hom-Meta-SKAT performed best under the alternate hypotheses while maintaining appropriate Type I error.

Study populations that included higher proportions of African Americans had more valid test statistic performance under the null and had more power to detect association using either the burden test or MetaSKAT regardless of penetrance model. This may be due to the higher number of rare varying sites in African American study populations 4 (Figure 2.6). This suggests that studies testing for association between rare variants and complex disease should consider including subjects from multiple ethnicities, with preference given to genetically diverse groups.

Because we focused on test validity and power, our simulations did not model intra- or inter-ethnicity population stratification bias. This is an important concern for analysis of empirical data. 19,37–39 Our stratified analysis strategy accounts for possible inter-ethnicity differences in disease rates. Methods for accounting for intra-ethnicity population stratification are also important. For example, O’Connor et al. show that including ten principal components of genetic variation to account for fine-scale population structure may be sufficient to reduce inflation in aggregate rare variant tests due to differences in disease rates across European subpopulations39.

33 Figure 2.6: Non-Synonymous Variants by Sample Size

Number of individuals (Sample Sizes) is plotted against median number of variants (Median No. Variants) as defined by individuals sampled. On panel 1, count includes all non-synonymous coding (NS) variants. Panel 2

NS MAF 1% in the full sample on vertical axis. Each figure has six Loess-smoothed, color-coded curves representing counts: (from top to bottom at rightmost edge) aa+ea+ja+la (black)=African,

European, Japanese and Latino, ea+ja+la (slate gray)=European, Japanese and Latino, aa(red)=African, la(blue)=Latino, ea(yellow)=European, and ja(green)=Japanese American only study subjects.

34 We examined a diverse set of rare-variant association tests in multi-ethnic samples, but an exhaustive survey is beyond the scope of this paper. Many other methods have been proposed for meta-analysis of GWAS data including MANTRA of Morris,40 a binary effects model by Han and Eskin41 for burden scores and RE-VC and RE-VC-O of Tang and Lin.42 Varying parameters of the approach described by Lee et al33 and examined here could also provide an endless array of approaches to address this question. Moreover, MANTRA,40 a meta-analysis technique specifically proposed for multi-ethnic populations was not amenable to large scale simulation because it was implemented with an interactive user interface.

Next generation sequencing association studies–including targeted sequencing, exome sequencing, and soon whole-genome sequencing studies–are still in their earliest stages and this study is one of the largest of its kind. The current study included more ethnicities than most next-generation sequencing studies to date.16,43,25,30,34,40,44–46 A larger sample size allows

detection of more rare variation and greater power in association testing. However, our study

population was not infinite and simulations drawing from a finite population to generate larger

population sizes could lead to an underestimate of the amount and perhaps, influence of rare

variation. In addition, the current study still omits some continental populations. Future

studies may uncover new ways to optimize rare variant studies with even more diverse populations.

35 REFERENCES

1. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet.

2012;90(1):7-24. doi:10.1016/j.ajhg.2011.11.029.

2. Hindorff L, MacArthur, J (European Bioinformatics Institute), Morales, J (European

Bioinformatics Institute), HA J, PN H, AK K, Manolio T. A Catalog of Published Genome-Wide

Association Studies. 2013. www.genome.gov/gwastudies.

3. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T,

Hindorff L, Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.

Nucleic Acids Res. 2014;42(Database issue):D1001-D1006. doi:10.1093/nar/gkt1229.

4. Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, Bonnen PE, de Bakker

PIW, Deloukas P, Gabriel SB, Gwilliam R, Hunt S, Inouye M, Jia X, Palotie A, Parkin M, Whittaker

P, Chang K, Hawes A, et al. Integrating common and rare genetic variation in diverse human

populations. Nature. 2010;467(7311):52-58. doi:10.1038/nature09298.

5. Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet. 2012;13(2):135-145.

doi:10.1038/nrg3118.

6. Zuk O, Schaffner SF, Samocha K, Do R, Hechter E, Kathiresan S, Daly MJ, Neale BM, Sunyaev SR,

Lander ES. Searching for missing heritability: designing rare variant association studies. Proc Natl

Acad Sci U S A. 2014;111(4):E455-E464. doi:10.1073/pnas.1322563111.

7. Bansal V, Libiger O, Torkamani A, Schork NJ. Statistical analysis strategies for association studies

involving rare variants. Nat Rev Genet. 2010;11(11):773-785. doi:10.1038/nrg2867.

36 8. Goldstein DB, Allen A, Keebler J, Margulies EH, Petrou S, Petrovski S, Sunyaev S. Sequencing

studies in human genetics: design and interpretation. Nat Rev Genet. 2013;14(7):460-470.

doi:10.1038/nrg3455.

9. Liu DJ, Leal SM. A novel adaptive method for the analysis of next-generation sequencing data to

detect complex trait associations with rare variants due to gene main effects and interactions.

PLoS Genet. 2010;6(10):e1001156. doi:10.1371/journal.pgen.1001156.

10. Price AL, Kryukov G V, de Bakker PIW, Purcell SM, Staples J, Wei L-J, Sunyaev SR. Pooled

association tests for rare variants in exon-resequencing studies. Am J Hum Genet.

2010;86(6):832-838. doi:10.1016/j.ajhg.2010.04.005.

11. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, Kathiresan S, Purcell

SM, Roeder K, Daly MJ. Testing for an unusual distribution of rare variants. PLoS Genet.

2011;7(3):e1001322. doi:10.1371/journal.pgen.1001322.

12. L B L SM M D A R V C D

Application to Analysis of Sequence Data. Am J Hum Genet. 2008;(83):311-321.

doi:10.1016/j.ajhg.2008.06.024.

13. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum

statistic. PLoS Genet. 2009;5(2):e1000384. doi:10.1371/journal.pgen.1000384.

14. Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic

risk for common diseases: a cohort allelic sums test (CAST). Mutat Res. 2007;615(1-2):28-56.

doi:10.1016/j.mrfmmm.2006.09.003.

37 15. Ladouceur M, Zheng H-F, Greenwood CMT, Richards JB. Empirical power of very rare variants for

common traits and disease: results from sanger sequencing 1998 individuals. Eur J Hum Genet.

2013;(October 2012):1-4. doi:10.1038/ejhg.2012.284.

16. Ladouceur M, Dastani Z, Aulchenko YS, Greenwood CMT, Richards JB. The empirical power of

rare variant association methods: results from sanger sequencing in 1,998 individuals. PLoS

Genet. 2012;8(2):e1002496. doi:10.1371/journal.pgen.1002496.

17. Song K, Nelson MR, Aponte J, Manas ES, Bacanu S, Yuan X, Kong X, Cardon L, Mooser VE,

Whittaker JC, Waterworth DM. Sequencing of Lp-PLA2-encoding PLA2G7 gene in 2000

Europeans reveals several rare loss-of-function mutations. Pharmacogenomics J.

2012;12(5):425-431. doi:10.1038/tpj.2011.20.

18. Emison ES, Garcia-Barcelo M, Grice EA, Lantieri F, Amiel J, Burzynski G, Fernandez RM, Hao L,

Kashuk C, West K, Miao X, Tam PKH, Griseri P, Ceccherini I, Pelet A, Jannot A-S, de Pontual L,

Henrion-Caude A, Lyonnet S, et al. Differential contributions of rare and common, coding and

noncoding Ret mutations to multifactorial Hirschsprung disease liability. Am J Hum Genet.

2010;87(1):60-74. doi:10.1016/j.ajhg.2010.06.007.

19. Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM, McLaren PJ, Gupta N, Sklar P, Sullivan PF,

Moran JL, Hultman CM, Lichtenstein P, Magnusson P, Lehner T, Shugart YY, Price AL, de Bakker

PIW, Purcell SM, Sunyaev SR. Exome sequencing and the genetic basis of complex traits. Nat

Genet. 2012;44(6):623-630. doi:10.1038/ng.2303.

20. Rivas MA, Beaudoin M, Gardet A, Stevens C, Sharma Y, Zhang CK, Boucher G, Ripke S, Ellinghaus

D, Burtt N, Fennell T, Kirby A, Latiano A, Goyette P, Green T, Halfvarson J, Haritunians T, Korn

JM, Kuruvilla F, et al. Deep resequencing of GWAS loci identifies independent rare variants

38 associated with inflammatory bowel disease. Nat Genet. 2011;43(11):1066-1073.

doi:10.1038/ng.952.

21. Z N P B G T Z E H E Laging genetic variability across populations

for the identification of causal variants. Am J Hum Genet. 2010;86(1):23-33.

doi:10.1016/j.ajhg.2009.11.016.

22. Spiegelman D, Colditz GA, Hunter D, Hertzmark E. Validation of the Gail et al. model for

predicting individual breast cancer risk. J Natl Cancer Inst. 1994;86(8):600-607.

www.ncbi.nlm.nih.gov/pubmed/8145275.

23. Eliassen AH, Tworoger SS, Mantzoros CS, Pollak MN, Hankinson SE. Circulating insulin and c-

peptide levels and risk of breast cancer among predominately premenopausal women. Cancer

Epidemiol Biomarkers Prev. 2007;16(1):161-164. doi:10.1158/1055-9965.EPI-06-0693.

24. Pike MC, Kolonel LN, Henderson BE, Wilkens LR, Hankin JH, Feigelson HS, Wan PC, Stram DO,

Nomura AMY. Breast Cancer in a Multiethnic Cohort in H L A R F-

adjusted Incidence in Japanese Equals and in Hawaiians Exceeds that in Whites. Cancer

Epidemiol Biomarkers Prev. 2002;(11):795-800.

25. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11(1):31-46.

doi:10.1038/nrg2626.

26. Auwera GA Van Der, Carneiro MO, Hartl C, Poplin R, Angel G, Levy-moonshine A, Shakir K,

Roazen D, Thibault J, Banks E, Garimella K V, Altshuler D, Gabriel S, Depristo MA. From FastQ

Data to High-C V C T Ge Analysis Toolkit Best Practices Pipeline. Curr

Protoc Bioinforma. 2013;11(October):1-33. doi:10.1002/0471250953.bi1110s43.

39 27. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D,

Gabriel S, Daly M, DePristo M a. The Genome Analysis Toolkit: a MapReduce framework for

analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-1303.

doi:10.1101/gr.107524.110.

28. DePristo MA, Banks E, Poplin R, Garimella K V, Maguire JR, Hartl C, Philippakis AA, del Angel G,

Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel

SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-

generation DNA sequencing data. Nat Genet. 2011;43(5):491-498. doi:10.1038/ng.806.

29. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Ruden DM, Lu X. A program

for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in

D -2; iso-3. Fly (Austin). 2012;6(2):1-13.

http://snpeff.sourceforge.net/SnpEff_paper.pdf.

30. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data

with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82-93.

doi:10.1016/j.ajhg.2011.05.029.

31. Evangelou E, Ioannidis JPA. Meta-analysis methods for genome-wide association studies and

beyond. Nat Rev Genet. 2013;14(6):379-389. doi:10.1038/nrg3472.

32. Hardy RJ, Thompson SG. Detecting and Describing Heterogeneity In Meta-Analysis. Stat Med.

1998;(17):841-856.

33. Lee S, Teslovich TM, Boehnke M, Lin X. General Framework for Meta-analysis of Rare Variants in

Sequencing Association Studies. Am J Hum Genet. 2013;93(1):42-53.

doi:http://dx.doi.org/10.1016/j.ajhg.2013.05.010.

40 34. Coventry A, Bull-Otterson LM, Liu X, Clark AG, Maxwell TJ, Crosby J, Hixson JE, Rea TJ, Muzny

DM, Lewis LR, Wheeler D a, Sabo A, Lusk C, Weiss KG, Akbar H, Cree A, Hawes AC, Newsham I,

Varghese RT, et al. Deep resequencing reveals excess rare recent variants consistent with

explosive population growth. Nat Commun. 2010;1(8):131. doi:10.1038/ncomms1130.

35. Fu W, O’Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, Gabriel S, Rieder MJ, Altshuler D,

Shendure J, Nickerson D a, Bamshad MJ, Akey JM. Analysis of 6,515 exomes reveals the recent

origin of most human protein-coding variants. Nature. 2013;493(7431):216-220.

doi:10.1038/nature11690.

36. Ma C, Blackwell T, Boehnke M, Scott LJ, GoT2D Investigators. Recommended joint and meta-

analysis strategies for case-control association testing of single low-count variants. Genet

Epidemiol. 2013;37(6):539-550. doi:10.1002/gepi.21742.Recommended.

37. Mao X, Li Y, Liu Y, Lange L, Li M. Testing genetic association with rare variants in admixed

populations. Genet Epidemiol. 2013;37(1):38-47. doi:10.1002/gepi.21687.

38. Liu J, Lewinger JP, Gilliland FD, Gauderman WJ, Conti D V. Confounding and heterogeneity in

genetic association studies with admixed populations. Am J Epidemiol. 2013;177(4):351-360.

doi:10.1093/aje/kws234.

39. O’Connor TD, Kiezun A, Bamshad M, Rich SS, Smith JD, Turner E, Leal SM, Akey JM. Fine-scale

patterns of population stratification confound rare variant association tests. PLoS One.

2013;8(7):e65834. doi:10.1371/journal.pone.0065834.

40. Morris AP. Transethnic meta-analysis of genomewide association studies. Genet Epidemiol.

2011;35(8):809-822. doi:10.1002/gepi.20630.

41 41. Han B, Eskin E. Interpreting meta-analyses of genome-wide association studies. PLoS Genet.

2012;8(3):e1002555. doi:10.1371/journal.pgen.1002555.

42. Tang Z-Z, Lin D-Y. Meta-analysis of sequencing studies with heterogeneous genetic associations.

Genet Epidemiol. 2014;38(5):389-401. doi:10.1002/gepi.21798.

43. Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011;470(7333):187-

197. doi:10.1038/nature09792.

44. Haiman CA, Stram DO, Pike MC, Kolonel LN, Burtt NP, Altshuler D, Hirschhorn J, Henderson BE. A

comprehensive haplotype analysis of CYP19 and breast cancer risk: the Multiethnic Cohort. Hum

Mol Genet. 2003;12(20):2679-2692. doi:10.1093/hmg/ddg294.

45. Hein R, Maranian M, Hopper JL, Kapuscinski MK, Southey MC, Park DJ, Schmidt MK, Broeks A,

Hogervorst FBL, Bueno-de-Mesquita HB, Bueno-de-Mesquit HB, Muir KR, Lophatananon A,

Rattanamongkongul S, Puttawibul P, Fasching PA, Hein A, Ekici AB, Beckmann MW, et al.

Comparison of 6q25 breast cancer hits from Asian and European Genome Wide Association

Studies in the Breast Cancer Association Consortium (BCAC). PLoS One. 2012;7(8):e42380.

doi:10.1371/journal.pone.0042380.

46. Do R, Kathiresan S, Abecasis GR. Exome sequencing and complex disease: practical aspects of

rare variant association studies. Hum Mol Genet. 2012;21(R1):R1-R9. doi:10.1093/hmg/dds387.

42

Chapter 3

Breast cancer and rare protein-coding variation at 12 susceptibility loci

-

Doctoral Candidate: Akweley Ablorh

Thesis Committee Members: Dr. A. Heather Eliassen, Dr. Peter Kraft, Dr. Alkes Price

Abstract: Breast cancer, the most commonly diagnosed cancer in American women, is a heritable disease with dozens of known genetic risk factors. Yet, the two-fold familial relative

risk of breast cancer has not been fully explained. Using next generation sequencing, we

applied gene-based association testing to explore the contribution of rare (MAF < 0.5%)

protein-coding variation at 12 breast cancer susceptibility loci in a multi-ethnic study population of 4,611 breast cancer cases and controls reporting one of four ethnicities: African, Latina,

Japanese, or European American. Rare variation in several of the genes was associated with breast cancer risk without correction for multiple comparisons at alpha=0.05. The most significant association, ORAOV1, an oral cancer oncogene (p=0.004) did not withstand

Bonferroni adjustment at a stringent family-wise error rate of 5% when we included all breast cancer cases. Rare genetic variation in this oncogene was nominally associated with both estrogen-receptor positive (p=0.024) and estrogen-receptor negative (p=0.028) breast cancer in stratified analysis. This association study demonstrates the potential to uncover and localize gene-based associations using fine-mapping of disease-associated loci and possible advantages of a more inclusive study population.

44

INTRODUCTION

In women, the most common cancer in developed nations is breast cancer and it is second only to cervical cancer in many less developed nations around the globe.1 Lifestyle and reproductive factors have been associated with breast cancer risk, including age at first birth, age at menarche, and postmenopausal hormone use, 2,3 but few have associations as large or consistent as mammographic density or family history’s two-fold relative risk.4 Family history is believed to affect breast cancer risk through inherited genetic susceptibility.5 Although a few high penetrance genetic variants and numerous common risk variants have been identified, 6

they explain only a portion of the phenotypic variability due to genetics (heritability) of breast

cancer. Much of the remaining heritability of breast cancer predicted from twin studies remains

unexplained.7

Several breast cancer susceptibility variants were identified using genome-wide association studies (GWAS). 7–18 However, GWAS cannot be used to localize causal variants with precision because GWAS rely on linkage disequilibrium (LD) or correlation between neighboring genetic variants.19 In GWAS, one genetic variant is used to tag segments of the genome over which LD is maintained, “or LD blocks”. LD blocks can contain multiple genes. This is further complicated by the possibility of multiple causal variants within one tagged segment of the genome.

A second limitation of previous studies was the exclusion of several ancestral populations. In the United States, breast cancer affects women of all ethnicities.20 Early GWAS were often

performed in single ethnicities and particularly in European American study populations.21 LD

45

blocks differ by ancestry, which limits the consistency of some GWAS findings across

populations and studies in a single ethnicity may miss risk alleles that are observed at higher

frequencies in other populations. Multi-ethnic studies have been proposed to aid fine mapping of causal variants.22

Third, GWAS do not capture rare variants which may explain some remaining breast cancer heritability. The contribution of rare variants to complex disease risk is unknown.23 According to the “common disease common variant” hypothesis, common complex diseases like breast cancer are the result of higher frequency alleles of low to moderate effect. 24 However, common variants have only explained a small portion of the heritability in most common diseases.

There are several metrics for measuring the contribution of variants to disease risk. For risk variants with uniformly low penetrance and high allele frequencies like those identified in breast cancer GWAS, heritability is approximated well by familial relative risk, an estimate of increased disease risk due to shared additive or dominant genetic effects between relatives.25

Only 14% of the familial risk of breast cancer is explained by common variants at 67 established

loci that have been associated with breast cancer susceptibility.7 This increases to less than a

third, 28%, of familial relative risk explained using all information available from a subset of

over 35,000 SNPs associated with breast cancer in a meta-analysis of 9 breast cancer GWAS.7

46

Resequencing of previously identified, breast cancer susceptibility loci could localize causal

signal, leverage information from multiple continental populations, and reveal novel, rare

variants that are independent of the original common variant.26

In order to further explain the genetic susceptibility of breast cancer, we performed targeted sequencing of over 4,600 women of four continental ancestries at 12 GWAS-identified breast

cancer susceptibility loci. We investigated the association between genes containing rare

variants and risk of breast cancer in a gene-based nested-case control study of non-

synonymous (NS) rare genetic variation and breast cancer incidence. We report the results of

these rare-variant analyses here; results for common-variant analyses are reported

elsewhere.27

47

METHODS

Study population

Breast cancer cases and controls were selected from three cohorts: Multiethnic Cohort (MEC),20

Nurses’ Health Study (NHS)28 and Nurses’ Health Study II (NHS2)29. All three cohorts – MEC, NHS and NHS2 – are described in further detail previously.30 Study subjects each reported one of four ethnicities: African American, European American, Latina American, and Japanese

American (Table 3.1) which were confirmed using principal component analysis of GWAS

SNPs.31

Phenotype

Nurses’ Health Study and Nurses’ Health Study II breast cancer cases were identified on biennial questionnaires and confirmed by medical record review. Cases had no previously reported cancer diagnosis.29,32

Breast cancer incidence in the MEC cohort was obtained through linkage to the Hawaii

Surveillance Epidemiology, and End Results cancer registry, the Los Angeles County

Surveillance, Epidemiology, and End Results registry, or the California State registry. Women who self-reported breast cancer prior to membership in the cohort were ineligible for breast cancer case status in this analysis. In addition, prior diagnosis with cancer of the endometrium or ovary was considered a censoring event because treatment often involves oophorectomy which alters risk of subsequent breast cancer incidence significantly.20

48

Table 3.1: Study Participants

Controls All Cases ER+ Cases ER- Cases All Subjects N (%) N (%) N (%) N (%) N (%) Study MEC 1,562 (67%) 1,538 (67%) 908 (66%) 272 (69%) 3,100 (67%) NHS 761 (33%) 467 (20%) 315 (23%) 80 (20%) 1,228 (27%) NHSII 0 (0%) 283 (12%) 143 (10%) 41 (10%) 283 (6%) Ethnicity African American 469 (20%) 468 (20%) 273 (20%) 117 (30%) 937 (20%) European American 761 (33%) 750 (33%) 458 (34%) 121 (31%) 1,511 (33%) Japanese American 637 (27%) 619 (27%) 352 (26%) 57 (15%) 1,256 (27%) Latina American 456 (20%) 451 (20%) 283 (21%) 98 (25%) 907 (20%) Total 2,323 2,288 1,366 393 4,611

49

The full, multi-ethnic population for the current study consisted of 2,323 breast cancer cases and 2,288 controls. We obtained 1,538 breast cancer cases and 1,562 controls from MEC and an additional 750 breast cancer cases and 761 controls from NHS and NHS2. Case estrogen receptor (ER) status was extracted from medical records for approximately three-quarters of breast cancer cases (See Table 3.1 for a full listing of case and control ethnicities and study membership).

Genotype Data

We selected 12 genomic regions for targeted sequencing in the vicinity of previously identified

GWAS hits (Table 3.2). Targeted sequencing is described previously for this population.30

Targeted regions were defined by recombination hotspots flanking variants with published

genome-wide-significant associations with breast cancer risk and encompassed 74 genes.

Custom Agilent baits were used for target capture, and high-depth sequencing was performed

at the Broad Institute using the Illumina HiSeq platform. Reads were aligned using BWA and

genotypes called using GATK in batches of approximately 100 samples.33–36 Sequencing batches

were balanced with respect to case-control status and ethnicity. All subjects were previously

genotyped using Illumina HumanHap arrays. Genotype call concordance was over 99.7% across

four pairs of duplicate samples at variable sites. Concordance was also over 99.7% between

sequencing genotype calls and GWAS genotypes. Across all 12 regions, there was greater than

20x coverage in 93.8 to 99.9% of baited portions. Sample batches were balanced with respect

to case-control status and ethnicity.

50

Table 3.2: 12 Sequenced Regions, Coverage Depth, and Length

Median | Minimum Region Index_SNP Locus Chr SNP_Pos Length(kb) Coverage Depth (Coding) 1 rs13387042 2q35 2 217905832 ------121.8 2 rs10069690 TERT 5 1279790 824 42 45.9 3 rs889312 MAP3K1 5 56031884 3681 282 307.8 4 rs2046210 ESR1 6 151990059 3749 358 242.7 5 rs1562430 8q24 8 128387852 1317 118 972.9 6 rs10995190 ZNF365 10 64278682 3221 244 397.9 7 rs704010 ZMIZ1 10 80841148 1307 122 473.2 8 rs2981579 FGFR2 10 123337335 1934 131 876.2 9 rs614367 11q13 11 69328764 910 36 258.9 10 rs999737 RAD51B 14 69034682 2319 36 815.4 11 rs3803662 TOX3 16 52586341 2797 40 269 12 rs8170 MERIT40 19 17389704 1025 35 712.4

51

Variants with sequencing quality scores below 30 were omitted. Variants were also removed if

greater than 10% of samples had no reads or if the total number of reads across all samples was

under 20,000. Genotypes were considered missing for an individual if the quality score was less than 10 or if there were fewer than five total reads. We excluded 8 samples that were pair-wise duplicates, 43 samples with less than 90% concordance with GWAS data since the low concordance could signal potential sample mix-ups, 16 samples with call rates under 90%, 36 samples for whom we did not have GWAS data, and five samples with unexpected ancestry.

Our final analysis included a total of 138,792 single nucleotide variants in 4,611 women.

Grouping of Rare Variant Information

Variants were grouped for statistical analysis according to whether variants were located within coding regions of protein-coding genes. Gene annotation of variants or assignment of a variant to a gene, was implemented using GEMINI (GEnome MINIng).37 GEMINI is a flexible UNIX- compatible framework for exploring genome variation which pulls information from SnpEff 38 as mapped by build 37 of the human genome. We further annotated each variant with Polyphen-2 scores39 which predict the functional impact of individual variants using Variant Effect Predictor

(VEP) 40.

Of the 2,085 NS coding variants that passed QC filters in our study population, we obtained

Polyphen-2 scores for 1,975 (95%) using GEMINI. Exclusive predictions for nearly 90% of (1,852)

NS variants remained after excluding variants with no predictions, None and unknown categories. Lower scores correspond to less damaging qualitative values and scores ranged from 1 to 617 (Table 3.3). Scored variants were assigned to at least one of six Polyphen-2

52

Table 3.3: Count and Percent of Predictions and Polyphen-2 Scores for Non-Synonymous SNPs

Polyphen-2 Prediction Number of QC’d Percentage of Polyphen-2 Score range variants Predictions

(Also predicted as (Out of 1,981) “None”)

Benign 1,060 (3) 54% 1 – 283 possibly damaging 365 (1) 18% 284 - 533 probably damaging 427 (1) 22% 534 – 616

Unknown 76 (1) 4% 1

None 53 3% 617

Six variants had two predictions and were assigned to None and one other prediction.

53

qualitative predictions: benign, possibly damaging, probably damaging, or in the case of

insufficient data, unknown or None.

To limit the incorporation of null variants, which have no impact on cell function, we tested for

association with only rare (MAF < 0.005), NS coding variants. However, not all rare, NS variants

are equally deleterious so we further refined the set of variants in each gene and also tested

rare, NS variants that were predicted to be possibly damaging or probably damaging by

Polyphen-2. We then tested each gene-specific subset of NS, rare variants or subset of NS, rare variants predicted to be damaging.

Statistical Methods

Gene-based association tests in ethnicity

We used two gene-based rare variant tests in each ethnicity: a burden test and a Sequencing

Kernel Association Test (SKAT).41 These tests can both accommodate covariates and can be

classified as either testing for association between breast cancer case status and the mean SNP

effect size in the case of the burden test or testing for association between case status and the

variance of SNP effect sizes in the case of SKAT. We tested overall breast cancer and breast

cancer by ER status for association with gene-specific rare variants. Each association test was

performed separately by ethnicity and adjusted for the first three ethnicity-specific principal

components to limit confounding due to population stratification.

54

We first tested the probability of case status ( ) for an individual i in ethnicity e using the

沈勅 following logistic regression model: 桁

[ ( )] = + + + + Equation 1

健剣訣件建 継 桁沈勅 紅待 紅怠隙怠 紅態隙態 紅戴隙戴 紅直訣沈勅珍 Above, the first three ethnicity-specific principal components are represented , , and .

怠 態 戴 The fourth coefficient is multiplied by an indicator for the presence of rare (MAF隙 <隙 0.005) non隙 -

synonymous genetic variation in gene for individual e, . Association with case status can

沈勅珍 also be tested using SKAT which does not倹 require variants件 訣 to be collapsed into one indicator

variable but still adjusts for the same covariates.41

Multi-ethnic meta-analysis

Meta-analysis for a multi-ethnic population of both SKAT and the indicator variable described

above is described in further detail elsewhere.30 We applied two meta-analysis techniques:

Inverse-variance (IV) weighted fixed effect meta-analysis 42, and Meta-analysis of SKAT

assuming effects of each variant is homogenous regardless of ethnicity (Hom-Meta-SKAT)43.

Each technique uses a distinct approach: a mean-based approach – fixed-effect meta-analysis; and a variance-based approach – homogenous MetaSKAT.

Correction for Multiple Comparisons

Since our analysis of rare variation is exploratory, type I error can be controlled using the false discovery rate (FDR). FDR is the expected proportion of the total number of null hypothesis

55

rejections which are false. For a set of 100 rejected null hypotheses, an FDR of 5% would be expected to include 5 falsely rejected null hypotheses.

We applied the Benjamini-Hochberg procedure described previously and illustrated here.44 To

FDR N Hi be the null hypothesis for test i

and H0 be the global null hypothesis that none of the N hypotheses should be rejected for any

test i. We ordered the set of p-values (p1 … pN) from smallest to largest so that p(1) (2) … p(N-1)

(N), determined the maximum j such that: ( ) , and rejected all hypotheses for p- 珍ゲ底 喧 珍 判 朝 values p(1) … p(j). To determine which hypotheses meet this criterion.

56

RESULTS

Using a family-wise error rate (FWER) of 0.05, no significant association remained between

rare, NS variation at 12 breast cancer loci and breast cancer case status after Bonferroni

correction for the number of phenotypes (3), tests per gene (2), sets per gene (2), and number of genes (47) (Table 3.4). However, the overall set of p-values did not differ from the null distribution when using a less conservative Benjamini-Hochberg procedure for a false-discovery rate of 0.05. There were several nominally significant findings after adjusting only for two tests per gene with alpha=0.025 (Table 3.5).

Of the 12 sequenced regions, 11 contained coding regions with rare, NS variation (Table 3.2).

There were nominally (alpha=0.05) significant associations between rare, NS variation and breast cancer at six of the eleven loci we explored.

All Breast Cancer: Adjusting for two independent tests at alpha=0.05, breast cancer case status was associated with the variability of rare, NS sites in four genes: ORAOV1, GTPBP3, GLT25D1, and NSMCE4A. Rare, NS variants that were predicted to be damaging were nominally associated with breast cancer case status in DDA1. When using fixed-effect (mean-based) meta- analysis techniques, only one of the associations was significant after correcting for two tests, but two genes had p-values below alpha=0.05. For one, NSMCE4A, the presence of any rare, NS variation was inversely associated with breast cancer risk (OR=0.4 95% CI: [0.2, 0.9]) while in the other, ANO8, the presence of damaging rare NS variation was associated with increased risk of breast cancer (OR=5.3 95% CI: [1.1, 24.2]).

57

Table 3.4: Significant gene meta-analyses for all breast cancer, and by ER status (alpha=0.05) Significant P- Cases Genea Chr: GWAS SNP Region Sites CPb Test(s)c Valued exp( ) [95%CI]e All ORAOV1 chr11: rs614367 9 43 4.8% Mhom 0.004* 0.8 [0.6, 1.1] GTPBP3 chr19: rs8170 12 48 3.5% Mhom 0.010* 1.3 [0.9,肯侮 1.8] GLT25D1 chr19: rs8170 12 55 2.5% Mhom 0.015* 0.9 [0.7, 1.4] DDA1 - p2 chr19: rs8170 12 1 <1.0% Mhom 0.046 2.0 [0.9, 4.1] NSMCE4A chr10: rs2981579 8 17 0.7% IV 0.023* 0.4 [0.2, 0.9] ANO8 - p2 chr19: rs8170 12 8 <1.7% IV 0.034 5.3 [1.1, 24.2] ER+ GTPBP3 chr19: rs8170 12 48 3.5% Mhom 0.007* 1.4 [1.0, 2.1] TMEM221 - p2 chr19: rs8170 12 5 <2.1% Mhom 0.013* 0.4 [0.2, 1.0] GLT25D1 chr19: rs8170 12 55 2.5% Mhom 0.018* 0.8 [0.5, 1.3] ZNF365 - p2 chr10: rs10995190 6 8 <2.0% Mhom, IV 0.019* 2.5 [1.0, 5.9] ORAOV1 chr11: rs614367 9 43 4.8% Mhom 0.024* 0.8 [0.6, 1.1] TMEM221 chr19: rs8170 12 24 2.1% Mhom 0.028 0.6 [0.4, 1.1] ZFYVE26 chr14: rs999737 10 126 5.7% Mhom 0.041 1.2 [0.9, 1.5] MAP1S - p2 chr19: rs8170 12 22 <4.9% Mhom 0.045 0.9 [0.6, 1.5] GLT25D1 - p2 chr19: rs8170 12 15 <2.5% Mhom 0.049 0.9 [0.4, 1.9] ZNF365 chr10: rs10995190 6 38 2.0% IV 0.013* 1.8 [1.1, 2.9] PLVAP chr19: rs8170 12 31 1.8% IV 0.016* 1.8 [1.1, 3.0] FGFR2 chr10: rs2981579 8 41 1.5% IV 0.016* 2.0 [1.1, 3.4] C6orf211 chr6: rs2046210 4 34 2.7% IV 0.044 0.6 [0.4, 1.0] ER- ORAOV1 chr11: rs614367 9 43 4.8% Mhom 0.028 0.9 [0.5, 1.4] ANO8 - p2 chr19: rs8170 12 8 <1.7% Mhom 0.028 4.4 [0.3, 74.5] FAM129C chr19: rs8170 12 62 6.8% Mhom 0.035 0.9 [0.6, 1.3] BABAM1 chr19: rs8170 12 24 1.4% Mhom 0.041 2.1 [0.9, 5.1] ORAOV1 - p2 chr11: rs614367 9 2 <4.8% Mhom 0.042 3E+6 [0, >1E50] UNC13A chr19: rs8170 12 69 3.7% Mhom 0.045 1.5 [0.9, 2.5] ZFYVE26 chr14: rs999737 10 126 5.7% Mhom 0.049 1.5 [1.0, 2.3] USE1 chr19: rs8170 12 14 0.4% IV, Mhom 0.022* 7.2 [1.3, 38.8] ABHD8 chr19: rs8170 12 19 1.3% IV 0.026 2.4 [1.1, 5.0] TMEM221 chr19: rs8170 12 24 2.1% IV 0.029 2.1 [1.1, 3.9] ZFYVE26 - p2 chr14: rs999737 10 29 <5.7% IV 0.031 2.5 [1.1, 5.7]

a. p2 = Subset of sites within gene predicted to be potentially or possibly deleterious by Polyphen-2

b. CP = Carrier Proportion or proportion of subjects who carry at least one rare, non-synonymous variant in

gene.

c. IV=Inverse-variance weighted burden test, Mhom=Hom-Meta-SKAT, Underline = lowest p-value if more

than one significant test

d. Unadjusted p-value. * = P-Value is less than 0.025 (0.05/2). A more conservative cutoff which takes

number of tests per gene-set into account.

e. Gray = Estimated fixed-effect summary log odds ratio ( ) for carrying any rare variation vs. none is not

significantly different from 0 at alpha=0.05 肯侮

58

Table 3.5: Significant Gene Meta-Analyses after Correction for Number of Statistical Tests

(alphaB =0.025)

ER-Receptor Start – Status Effect Region Genea Chromosome End Association Full Gene Name Estimate

69467844- oral cancer 9 ORAOV1 11 69490184 All and Positive overexpressed 1 none non-SMC element 123716603- 4 homolog A (S. 8 NSMCE4A 10 123734743 All cerevisiae) Protective GTP-binding 17445729- protein 3 12 GTPBP3 19 17453544 All and Positive (mitochondrial) none glycosyltransferase 17666511- 25 domain 12 GLT25D1 19 17693965 All and Positive containing 1 none

17546318- transmembrane 12 TMEM221a 19 17559376 Positive protein 221 None plasmalemma 17462257- vesicle associated 12 PLVAP 19 17488159 Positive protein Deleterious

64133951- zinc finger protein 6 ZNF365a 10 64431771 Positive 365 Deleterious

123237848- fibroblast growth 8 FGFR2 10 123357972 Positive factor receptor 2 Deleterious unconventional SNARE in the ER 17326155- 1 homolog (S. 12 USE1 19 17330638 Negative cerevisiae) Deleterious

a. Subset of sites within gene predicted to be potentially or possibly deleterious by Polyphen-2

59

ER-positive breast cancer: (N=1,378 Cases) After limiting cases to ER-positive breast cancer,

two of the lowest p-values for all cases were also the most associated genes in ER-positive cases: GTPBP3 and GLT25D1. For a third gene, TMEM221, p-values for ER-positive breast cancer were below 0.025 when testing variability in only those variants was predicted to be damaging.

The association was still nominally significant (alpha=0.05) when testing all variants in

TMEM221.

In contrast to all cases, the presence of rare variation in eight genes was nominally associated with ER-positive breast cancer when using fixed-effect meta-analysis techniques at the more

stringent alpha=0.025 level. (Table 3.4) At three genes (FGFR2, PLVAP, and ZNF365), those who

carried rare variants experienced an estimated two-fold increase in the odds of ER-positive

breast cancer compared to those who carried none. When limited to variants predicted to be

deleterious, fixed-effect meta-analysis association for ZNF365 was nominally significant

(alpha=0.05) and the estimated odds ratio increased from 1.8 [95%CI: 1.1, 2.9] for carrying any

rare, NS variants to 2.5 [95%CI: 1.0, 5.9] for carrying rare, NS variants that were predicted to be

damaging.

ER-negative breast cancer: (N=399 Cases) Only one gene-based association with ER-negative

breast cancer withstood any adjustment for multiple testing (USE1). (Table 3.4) ER-negative

breast cancer case status was nominally (alpha=0.05) associated with the presence of rare, NS

variation in three genes – ABHD8, TMEM221, and USE1; and with damaging rare, NS variation in

a fourth gene – ZFYVE26. ER-negative breast cancer was also associated (alpha=0.05) with

variability of all rare, NS variation in five genes (BABAM1, FAM129C, ORAOV1, UNC13A, and

60

ZFYVE26) and with the variability of rare, NS variation predicted to be damaging for one of the five genes (ORAOV1) and an additional gene (ANO8).

The most significant associations in ethnicity-specific analyses overlapped with the most significant meta-analysis results. However several genes reached nominal significance in only one or a few ethnicities and were not significant in meta-analysis such as NXNL1 for Japanese and African Americans, ZM1Z1 for European Americans, MYC in Latina Americans, and BTBD16 in African Americans (Table 3.6).

61

Table 3.6: Gene-Based Associations by Ethnicity (P-Value < 0.02 in at least one ethnicity)

Case SE Exp Burden SKAT

Ethnicity Gene ER beta (beta) (beta) Pval P-val Sites Cases Controls African NXNL1 + -0.9 0.4 0.4 0.015 0.081 10 273 469 African NXNL1 All -0.7 0.3 0.5 0.015 0.057 12 468 469 African MYO9B - 0.7 0.3 2.0 0.017 0.094 43 117 469 African GTPBP3 + 0.6 0.3 1.8 0.042 0.003 15 273 469 African TMEM221 P2 + -1.1 0.6 0.3 0.051 0.016 3 273 469 African UNC13A + 0.7 0.4 2.1 0.087 0.016 15 273 469 African MAP3K1 P2 - 2.1 1.2 7.8 0.095 0.020 3 117 469 African BTBD16 + -1.3 0.8 0.3 0.103 0.009 4 273 469 African GTPBP3 All 0.4 0.3 1.5 0.104 0.003 17 468 469 African ORAOV1 All -0.3 0.2 0.8 0.153 0.003 18 468 469 African TMEM221 + -0.4 0.4 0.7 0.367 0.018 7 273 469 African BTBD16 P2 + -15.0 438.2 0.0 0.973 0.010 1 273 469 European TACC2 All -0.5 0.2 0.6 0.002 0.382 68 750 761 European TMEM221 - 1.5 0.5 4.5 0.003 0.006 4 121 761 European GTPBP3 - 2.2 0.8 8.8 0.005 0.003 7 121 761 European ORAOV1 + -2.0 0.7 0.1 0.007 0.022 6 458 761 European GTPBP3 All 1.6 0.6 5.2 0.010 0.012 12 750 761 European FGFR2 + 1.4 0.5 4.0 0.010 0.123 10 458 761 European TMEM221 P2 - 1.7 0.7 5.5 0.013 0.007 2 121 761 European PLVAP + 1.2 0.5 3.2 0.013 0.130 8 458 761 European BABAM1 All 2.5 1.0 12.3 0.016 0.073 11 750 761 European GTPBP3 + 1.6 0.7 5.0 0.017 0.008 9 458 761 European BABAM1 + 2.5 1.1 11.8 0.021 0.004 7 458 761 European ANKLE1 P2 - 2.5 1.3 12.6 0.047 0.006 2 121 761 European UNC13A - 0.6 0.4 1.8 0.172 0.014 13 121 761 European ZMIZ1 - 0.7 0.6 2.1 0.211 0.017 14 121 761 European GLT25D1 P2 + -0.3 0.7 0.7 0.657 0.015 5 458 761 Japanese ZNF365 + 1.0 0.4 2.6 0.013 0.025 8 352 637 Japanese ZNF365 P2 + 1.3 0.5 3.8 0.014 0.016 2 352 637 Japanese USHBP1 P2 + 1.3 0.9 3.7 0.130 0.008 4 352 637 Japanese USHBP1 P2 All 1.1 0.8 3.1 0.164 0.011 4 619 637 Japanese FGF19 + 14.8 508.1 2.7E+06 0.977 0.011 1 352 637 Japanese NXNL1 P2 - 17.2 882.7 2.9E+07 0.984 0.011 1 57 637 Japanese BST2 - 17.2 882.7 2.9E+07 0.984 0.012 1 57 637 Japanese NXNL1 - 17.2 882.7 2.9E+07 0.984 0.012 1 57 637 Latina BABAM1 - 1.5 0.7 4.5 0.037 0.001 6 98 456 Latina USE1 - 2.4 1.2 10.5 0.058 0.007 2 98 456 Latina ZFYVE26 + 0.6 0.3 1.8 0.064 0.007 31 283 456 Latina MYC All 1.2 0.8 3.4 0.129 0.005 5 451 456

62

Table 3.6 (Continued)

Case SE Exp Burden SKAT

Ethnicity Gene ER beta (beta) (beta) Pval P-val Sites Cases Controls Latina TMEM221 P2 All -1.2 0.8 0.3 0.148 0.016 4 451 456 Latina GTPBP3 + 0.9 0.8 2.4 0.268 0.008 6 283 456 Latina NR2F6 All 0.0 0.8 1.0 0.961 0.013 3 451 456 Latina PPIF + 15.3 506.0 4.3E+06 0.976 0.008 3 283 456 Latina MYC P2 - 15.8 622.8 7.3E+06 0.980 0.005 2 98 456

63

DISCUSSION

We investigated the association between rare non-synonymous coding variation and breast

cancer risk in a multi-ethnic population and detected mechanistically plausible but not

significant associations.

Four nominally-associated genes (Table 3.5) were in a gene-rich portion of

where we identified 24 genes containing rare variation. One of the four genes, GTPBP3, is

localized to mitochondria; cell organelles tasked with energy production, and may interact with

estradiol, an estrogen hormone [RefSeq, Comparative Toxicogenomics Database (CTD)]. Three

other genes, GLT25D1(COLGALT1), TMEM221, and PLVAP are membrane localized to the endoplasmic reticulum45, extracellular membrane 46, and intracellular membranes 47.

Membrane-bound proteins may play an important role in cellular signaling cascades.

Most ethnicity-specific findings overlapped with meta-analysis results (Table 3.4). Observing similar associations in multiple ethnicities is consistent with the findings of Chen et al.48 who cautioned against the generalizability of a genetic risk score for breast cancer developed in one ethnic group (European Americans) across several ethnicities despite observing a similar trend for the score in all but one of five ethnicities.48

Like Chen and colleagues,48 the loci we sequenced were identified using GWAS mainly performed and replicated in European American study populations. Particularly for ER-positive breast cancer, effect estimates in European Americans tended to be more significant and larger in magnitude at these 12 loci than the multi-ethnic meta-analysis summary effect estimate

64 which was averaged across multiple ethnicities. However, effect estimates in African Americans were more correlated with the summary meta-analysis effect estimates compared to other single-ethnicity analysis. This suggests rare variant findings in African Americans may be more generalizable to non-African American subjects. (Table 3.7)

Within ethnicity, we did not adjust for breast cancer risk factors since omitting risk factors does not necessarily reduce power in case-control studies.49 However, aggregating breast cancer cases across estrogen receptor (ER) specific subtype could limit statistical power given that ER- positive and ER-negative breast cancer have different prognoses and risk factors50. Most of the loci we sequenced were more strongly associated with ER-positive than ER-negative breast cancer (Table 3.8). ER-negative association may have been harder to find because we sequenced regions that are more strongly associated with ER-positive breast cancer and, compared to all breast cancer cases, there were relatively few ER-negative breast cancer cases which limited statistical power.

Our analyses focused on coding variants. Yet, almost all (>98%) of the variants sequenced in our study were outside of coding regions. One locus (2q35) did not contain any rare, NS variants

(Table 3.2) and for three loci (MAP3K1, 8q24, and 11q13) the index (tag) SNP was upstream of the first coding variant (outside coding regions) (Table 3.9). There are several examples of noncoding variation playing an important role in breast cancer51,52 and other diseases53.

65

Table 3.7: Correlation between summary effect estimate and ethnicity specific effect estimates by case ER status

Ethnicity All Breast Cancer ER- Breast Cancer ER+ Breast Cancer

African American 0.28 0.30 0.32

European American 0.63 0.15 0.51

Japanese American -0.24 0.18 -0.17

Latina 0.36 -0.26 0.26

66

Table 3.8: Genetic Susceptibility Loci Identified in GWAS by receptor status (Adapted From Palmer et al. Cancer Epidemiol Biomarkers Prev 2013 Table 2)54 Associated ER+ ER- a GWAS (p-valuehet) OR OR 54 Locus SNP OR Allele MAFAA (95% CI) (95% CI) ER+ 1p11.2 rs1124943355 1.16 C/T 0.13 (7.6 × 10) 56 1.13 (1.10-1.16) 1.03 (0.98-1.07) ER+ 2q35 rs1338704257 1.2 A/G 0.73 (0.006) 58 1.16 (1.13-1.19) 1.09 (1.05-1.13) ER+ 3p24.1 rs497376859 1.11 T/C 0.37 (0.003) 58 1.13 (1.10-1.16) 1.05 (1.01,1.10) ER+ 5p12 rs441508460 1.16 T/C 0.61 (4x10-6) 61 B 1.15 (1.12-1.19) 1.04 (0.99-1.08) ER- 5p15.33 rs1006969012 1.18 C T/C 0.57 (1.7 x 10-4) 12 1.04 (1.00-1.08) 1.18 (1.13-1.25) both 5q11.2 rs8893128 1.13 C/A 0.32 (0.531)58 1.11 (1.08-1.15) 1.09 (1.5-1.14) ER+ 6q25.1 rs204621062 1.29 A/G 0.61 (5.1x10-6) 63 1.07 (1.04-1.1) 1.20 (1.15-1.25) ER+ 8q24.21 rs13281615 8 1.08 C/T 0.43 (0.001)64 1.13 (1.10-1.17) 1.03 (0.98-1.09) Both 9p21.3 rs101197065 1.09 T/G 0.34 (0.65) 66 1.07 (1.03-1.10) 1.13 (1.07-1.19) Both 10p15.1 rs238020565 0.94 T/C 0.6 (0.38) 66 0.97 (0.95-1.00) 0.99 (0.96-1.04) Both 10q21.2a rs1099519065 0.86 A/G 0.16 (0.61) 66 0.87 (0.85-0.91) 0.89 (0.84-0.94) Both 10q22.3 rs70401065 1.07 A/G 0.1 (0.13) 66 1.07 (1.05-1.10) 1.04 (1.00-1.08) ER+ 1.28 (1.24, 1.05 (1.01, 10q26.13 rs298157955 1.17 T/C 0.6 (6.1x10-18) 58 1.31) 1.09) Both 1.07 (1.04, 1.05 (1.00 , 11p15.5 rs38171988 1.07 C/T 0.11 (0.426) 58 1.10) 1.09) ER+ 11q13.3 rs61436765 1.15 T/C 0.14 (3x10-10) 66 1.26 (1.22-1.31) 1.01 (0.96-1.07) Both 0.90 (0.87, 0.93 (0.88, 14q24.1 rs99973755 1.06 C/T 0.95 (0.42)56 0.93) 0.98) ER+ 1.26 (1.23, 1.15 (1.10, 16q12.1 rs38036628 1.2 T/C 0.51 (2.1x10-10) 58 1.30) 1.20) ER+ 0.93 (0.90, 1.00 (0.95, 17q23.2 rs650495059 0.95 A/G 0.35 (0.002) 58 0.95) 1.05) ER- 19p13.11 rs817067 1.26 C A/G 0.19 (2.9 x 10-5) 67 0.91(0.84-0.98) 1.21 (1.07-1.37) Bold, shaded loci are examined in the current study. A. Case-Only test of heterogeneity p-value. 2 B. p-valuehet for rs10941679 instead of rs4415084. R = 0.51 C. GWAS studies of ER-negative breast cancer cases only

67

Table 3.9: Nominal gene-based rare variant breast cancer associations

Region Gene Chr: Position + - All Region Gene Chr: Position + - All

2 TERT 5:1295184 12 HAUS8 19:17160539

3 MAP3K1 5:56191979 12 MYO9B 19:17186591 x 3 C5orf35 5:56221359 12 USE1 19:17326155

3 MIER3 5:56267502 12 OCEL1 19:17337013

4 RMND1 6:151773259 12 NR2F6 19:17342692 x 4 C6orf211 6:151791236 12 USHBP1 19:17359985 x 4 C6orf97 6:151942328 12 BABAM1 19:17378159

5 POU5F1B 8:128432311 12 ANKLE1 19:17392454 x 5 MYC 8:128753674 12 ABHD8 19:17402940 x/xx 6 ZNF365 10:64431771 12 MRPL34 19:17403418 xx 7 ZMIZ1 10:81076276 12 DDA1 19:17420327 xx xx 7 PPIF 10:81115093 12 ANO8 19:17434032 x x 8 FGFR2 10:123357972 12 GTPBP3 19:17445729 x 8 ATE1 10:123688316 12 PLVAP 19:17462257 x 8 NSMCE4A 10:123734732 12 BST2 19:17513748

8 TACC2 10:124014060 12 FAM125A 19:17530912 x/xx x 8 BTBD16 10:124097677 12 TMEM221 19:17546318

9 CCND1 11:69469242 12 NXNL1 19:17566234 9 ORAOV1 11:69490184 x x x/xx 12 SLC27A1 19:17579578

9 FGF19 11:69519410 12 PGLS 19:17622438 10 ZFYVE26 14:68283307 x x/xx x 12 FAM129C 19:17634110 x x/xx x 10 RAD51B 14:69196935 12 GLT25D1 19:17666511 x 11 TOX3 16:52581714 12 UNC13A 19:17712137 xx 12 MAP1S 19:17830051

Chr: Position = Chromosome and gene 5’ start position

+ = ER-positive breast cancer

- = ER-negative breast cancer

All = breast cancer x = Gene-based rare variant association is nominally significant (alpha=0.05). xx = Gene-based polyphen-2 damaging, rare variant association is nominally significant (alpha=0.05). x/xx = Both x and xx

68

Despite limitations, our results suggest future investigators may localize causal signal with a

similar approach. Our study population is relatively large and incorporates multiple ethnicities,

which has been shown to maximize statistical power to detect phenotypic association with

causal rare variants.30 We explored a fraction of the more than 70 established GWAS-identified breast cancer loci68 and localized rare variant breast cancer association to several mechanistically plausible genes. Future investigators may benefit from further study of other

GWAS-identified loci and should explore the numerous non-coding annotations that have been proposed69 to aggregate rare variants. Sufficiently large sample sizes that incorporate ancestral populations with greater genetic diversity and that target regions appropriate to the phenotype under study will be critical for effective fine mapping. In addition, non-coding variation will be a vital next step in uncovering the remaining hidden heritability of breast cancer.

69

REFERENCES

1. Ferlay J, Soerjomataram I, Ervik M, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F. GLOBOCAN 2012: Estimated Cancer Incidence, Mortality and Prevalence Worldwide 2012. 2013. http://globocan.iarc.fr. 2. Yang XR, Chang-Claude J, Goode EL, Couch FJ, Nevanlinna H, Milne RL, Gaudet M, Schmidt MK, Broeks A, Cox A, Fasching PA, Hein R, Spurdle AB, Blows F, Driver K, Flesch-Janys D, Heinz J, Sinn P, Vrieling A, et al. Associations of breast cancer risk factors with tumor subtypes: a pooled analysis from the Breast Cancer Association Consortium studies. J Natl Cancer Inst. 2011;103(3):250-263. doi:10.1093/jnci/djq526. 3. Nelson NJ. Breast cancer prevention in high-risk women: searching for new options. J Natl Cancer Inst. 2011;103(9):710-711. doi:10.1093/jnci/djr159. 4. Tice JA, Cummings SR, Ziv E, Kerlikowske K. Mammographic breast density and the Gail model for breast cancer risk prediction in a screening population. Breast Cancer Res Treat. 2005;94(2):115-122. doi:10.1007/s10549-005-5152-4. 5. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012;90(1):7-24. doi:10.1016/j.ajhg.2011.11.029. 6. Stratton MR, Rahman N. The emerging landscape of breast cancer susceptibility. Nat Genet. 2008;40(1):17-22. doi:10.1038/ng.2007.53. 7. Michailidou K, Hall P, Gonzalez-Neira A, Ghoussaini M, Dennis J, Milne RL, Schmidt MK, Chang-Claude J, Bojesen SE, Bolla MK, Wang Q, Dicks E, Lee A, Turnbull C, Rahman N, Fletcher O, Peto J, Gibson L, Dos Santos Silva I, et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat Genet. 2013;45(4):353-361, 361e1-e2. doi:10.1038/ng.2563. 8. Easton DF, Pooley KA, Dunning AM, Pharoah PDP, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R, Wareham N, Ahmed S, Healey CS, Meyer KB, Haiman CA. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007;447(7148):1087-1093. doi:10.1038/nature05887.Genome- wide. 9. Cai Q, Long J, Lu W, Qu S, Wen W, Kang D, Lee J-Y, Chen K, Shen H, Shen C-Y, Sung H, Matsuo K, Haiman C, Khoo US, Ren Z, Iwasaki M, Gu K, Xiang Y-B, Choi J-Y, et al. Genome-wide association study identifies breast cancer risk variant at 10q21.2: results from the Asia Breast Cancer Consortium. Hum Mol Genet. 2011;20(24):4991-4999. doi:10.1093/hmg/ddr405. 10. Long J, Cai Q, Sung H, Shi J, Zhang B, Choi J-Y, Wen W, Delahanty RJ, Lu W, Gao Y-T, Shen H, Park SK, Chen K, Shen C-Y, Ren Z, Haiman CA, Matsuo K, Kim MK, Khoo US, et al. Genome-wide association study in east Asians identifies novel susceptibility loci for breast cancer. PLoS Genet. 2012;8(2):e1002532. doi:10.1371/journal.pgen.1002532. 11. Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, Hankinson SE, Wacholder S, Wang Z, Welch R, Yu K, Chatterjee N, Orr N, Willett WC, Graham A, Ziegler RG, Berg CD, Buys SS, Catherine A, Feigelson HS, et al.

70

A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet. 2007;39(7):870-874. doi:10.1038/ng2075.A. 12. Haiman CA, Chen GK, Vachon CM, Canzian F, Millikan RC, Wang X, Ademuyiwa F, Ahmed S, Ambrosone CB, Baglietto L, Balleine R, Bandera E V. A common variant at the TERT-CLPTM1L locus is associated with estrogen receptor–negative breast cancer. Nat Genet. 2012;43(12):1210-1214. doi:10.1038/ng.985.A. 13. Gold B, Kirchhoff T, Stefanov S, Lautenberger J, Viale A, Garber J, Friedman E, Narod S, Olshen AB, Gregersen P, Kosarin K, Olsh A, Bergeron J, Ellis NA, Klein RJ, Clark AG, Norton L, Dean M, Boyd J, et al. Genome-wide association study provides evidence for a breast cancer risk locus at 6q22.33. Proc Natl Acad Sci U S A. 2008;105(11):4340-4345. doi:10.1073/pnas.0800441105. 14. Chen F, Chen GK, Stram DO, Millikan RC, Ambrosone CB, John EM, Bernstein L, Zheng W, Palmer JR, Hu JJ, Rebbeck TR, Ziegler RG, Nyante S, Bandera E V, Ingles SA, Press MF, Ruiz-Narvaez EA, Deming SL, Rodriguez-Gil JL, et al. A genome-wide association study of breast cancer in women of African ancestry. Hum Genet. 2013;132(1):39-48. doi:10.1007/s00439-012-1214-y. 15. Fletcher O, Johnson N, Orr N, Hosking FJ, Gibson LJ, Walker K, Zelenika D, Gut I, Heath S, Palles C, Coupland B, Broderick P, Schoemaker M, Jones M, Williamson J, Chilcott-Burns S, Tomczyk K, Simpson G, Jacobs KB, et al. Novel breast cancer susceptibility locus at 9q31.2: results of a genome-wide association study. J Natl Cancer Inst. 2011;103(5):425-435. doi:10.1093/jnci/djq563. 16. Kim H, Lee J-Y, Sung H, Choi J-Y, Park SK, Lee K-M, Kim YJ, Go MJ, Li L, Cho YS, Park M, Kim D-J, Oh JH, Kim J-W, Jeon J-P, Jeon S-Y, Min H, Kim HM, Park J, et al. A genome-wide association study identifies a breast cancer risk variant in ERBB4 at 2q34: results from the Seoul Breast Cancer Study. Breast Cancer Res. 2012;14(2):R56. doi:10.1186/bcr3158. 17. Siddiq A, Couch FJ, Chen GK, Lindström S, Eccles D, Millikan RC, Michailidou K, Stram DO, Beckmann L, Rhie SK, Ambrosone CB, Aittomäki K, Amiano P, Apicella C, Baglietto L, Bandera E V, Beckmann MW, Berg CD, Bernstein L, et al. A meta-analysis of genome-wide association studies of breast cancer identifies two novel susceptibility loci at 6q14 and 20q11. Hum Mol Genet. 2012;21(24):5373-5384. doi:10.1093/hmg/dds381. 18. Thomas G, Jacobs KB, Kraft P, Yeager M, Cox DG, Hankinson SE, Hutchinson A, Yu K, Chatterjee N, Gonzalez-Bosquet J, Prokunina-Olsson L, Orr N, Willett WC, Colditz GA, Ziegler RG, Berg CD, Buys SS, Mccarty CA, Feigelson S, et al. A multi-stage genome-wide association in breast cancer identifies two novel risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat Genet. 2009;41(5):579-584. doi:10.1038/ng.353.A. 19. Machiela MJ, Chen C, Liang L, Diver WR, Stevens VL, Tsilidis KK, Haiman C a, Chanock SJ, Hunter DJ, Kraft P. One thousand genomes imputation in the National Cancer Institute Breast and Prostate Cancer Cohort Consortium aggressive prostate cancer genome-wide association study. Prostate. 2013;73(7):677-689. doi:10.1002/pros.22608.

71

20. Pike MC, Kolonel LN, Henderson BE, Wilkens LR, Hankin JH, Feigelson HS, Wan PC, Stram DO, Nomura AMY B C M C H L A R F-adjusted Incidence in Japanese Equals and in Hawaiians Exceeds that in Whites. Cancer Epidemiol Biomarkers Prev. 2002;(11):795-800. 21. Welter D, MacArthur J, Morales J, Burdett T, Hall P, Junkins H, Klemm A, Flicek P, Manolio T, Hindorff L, Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42(Database issue):D1001-D1006. doi:10.1093/nar/gkt1229. 22. Z N P B G T Z E H E L identification of causal variants. Am J Hum Genet. 2010;86(1):23-33. doi:10.1016/j.ajhg.2009.11.016. 23. Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet. 2012;13(2):135-145. doi:10.1038/nrg3118. 24. P JK C NJ T n disease – common variant Hum Mol Genet. 2002;11(20):2417-2423. 25. Witte JS, Visscher PM, Wray NR. The contribution of genetic variants to disease depends on the ruler. Nat Rev Genet. 2014;15(11):765-776. doi:10.1038/nrg3786. 26. Galarneau G, Palmer CD, Sankaran VG, Orkin SH, Hirschhorn JN, Lettre G. Fine-mapping at three loci known to affect fetal hemoglobin levels explains additional genetic variation. Nat Genet. 2010;42(12):1049-1051. doi:10.1038/ng.707. 27. Lindström S, Ablorh A, Chapman B, Gusev A, Chen G, Chen C, Eliassen HA, Price A, Henderson B, Marchand L Le, Gabriel S, Hofmann O, Haiman CA, Kraft P. Deep Targeted Sequencing of 12 Breast Cancer Susceptibility Regions in 4,611 Women Across Four Different Ethnicities. 2015; In preparation. 28. Spiegelman D, Colditz GA, Hunter D, Hertzmark E. Validation of the Gail et al. model for predicting individual breast cancer risk. J Natl Cancer Inst. 1994;86(8):600-607. http://www.ncbi.nlm.nih.gov/pubmed/8145275. 29. Eliassen AH, Tworoger SS, Mantzoros CS, Pollak MN, Hankinson SE. Circulating insulin and c-peptide levels and risk of breast cancer among predominately premenopausal women. Cancer Epidemiol Biomarkers Prev. 2007;16(1):161-164. doi:10.1158/1055-9965.EPI-06-0693. 30. Mensah-Ablorh A, Lindstrom S, Haiman CA, Henderson BE, Marchand L Le, Lee S, Stram DO, Eliassen AH, Price A, Kraft P. Meta-analysis of rare variant association tests in multi-ethnic populations. Genet Epidemiol. 2015;Submitted. 31. Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190. doi:10.1371/journal.pgen.0020190. 32. Zhang X, Tworoger SS, Eliassen AH, Hankinson SE. Postmenopausal plasma sex hormone levels and breast cancer risk over 20 years of follow-up. Breast Cancer Res Treat. 2013;137(3):883-892. doi:10.1007/s10549-012-2391-z.

72

33. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11(1):31-46. doi:10.1038/nrg2626. 34. Auwera GA Van Der, Carneiro MO, Hartl C, Poplin R, Angel G, Levy-moonshine A, Shakir K, Roazen D, Thibault J, Banks E, Garimella K V, Altshuler D, Gabriel S, Depristo MA. From FastQ Data to High- C V C T G A T B P P Curr Protoc Bioinforma. 2013;11(October):1-33. doi:10.1002/0471250953.bi1110s43. 35. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo M a. The Genome Analysis Toolkit: a MapReduce framework for analyzing next- generation DNA sequencing data. Genome Res. 2010;20(9):1297-1303. doi:10.1101/gr.107524.110. 36. DePristo MA, Banks E, Poplin R, Garimella K V, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491-498. doi:10.1038/ng.806. 37. Paila U, Chapman BA, Kirchner R, Quinlan AR. GEMINI: integrative exploration of genetic variation and genome annotations. PLoS Comput Biol. 2013;9(7):e1003153. doi:10.1371/journal.pcbi.1003153. 38. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Ruden DM, Lu X. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of D -2; iso-3. Fly (Austin). 2012;6(2):1-13. http://snpeff.sourceforge.net/SnpEff_paper.pdf. 39. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7(4):248-249. doi:10.1038/nmeth0410-248. 40. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics. 2010;26(16):2069-2070. doi:10.1093/bioinformatics/btq330. 41. Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82-93. doi:10.1016/j.ajhg.2011.05.029. 42. Evangelou E, Ioannidis JPA. Meta-analysis methods for genome-wide association studies and beyond. Nat Rev Genet. 2013;14(6):379-389. doi:10.1038/nrg3472. 43. Lee, S, Teslovich, TM, Boehnke, M and Lin X. General Framework for Meta-Analysis of Rare Variants in Sequencing Association Studies. Am J Hum Genet. 2013;93(1):42-53. 44. Sabatti C, Service S, Freimer N. False Discovery Rate in Linkage and Association Genome Screens for Complex Disorders. Genetics. 2003;833(June):829-833.

73

45. Liefhebber JMP, Punt S, Spaan WJM, Leeuwen HC Van. The human collagen beta ( 1-O ) galactosyltransferase , GLT25D1 , is a soluble localized protein. BMC Cell Biol. 2010;11(33). 46. Grimwood J, Gordon LA, Olsen A, Terry A, Schmutz J, Lamerdin J, Hellsten U, Goodstein D, Couronne O, Tran-gyamfi M, Aerts A, Altherr M, Ashworth L, Bajorek E, Black S, Branscomb E, Caenepeel S, Carrano A, Caoile C, et al. The DNA sequence and biology of human chromosome 19. Nature. 2004;428(April):529- 535. 47. Hnasko R, Carter JM, Medina F, Frank PG, Lisanti MP. PV-1 Labels Trans-Cellular Openings in Mouse Endothelial Cells and is Negatively Regulated by VEGF. Cell Cycle. 2014;5(17):2021-2028. doi:10.4161/cc.5.17.3217. 48. Chen F, Stram DO, Le Marchand L, Monroe KR, Kolonel LN, Henderson BE, Haiman CA. Caution in generalizing known genetic risk markers for breast cancer across all ethnic/racial populations. Eur J Hum Genet. 2011;19(2):243-245. doi:10.1038/ejhg.2010.185. 49. Pirinen M, Donnelly P, Spencer CCA. Including known covariates can reduce power to detect genetic effects in case-control studies. Nat Genet. 2012;44(8):848-851. doi:10.1038/ng.2346. 50. Setiawan VW, Monroe KR, Wilkens LR, Kolonel LN, Pike MC, Henderson BE. Breast cancer risk factors defined by estrogen and progesterone receptor status: the multiethnic cohort study. Am J Epidemiol. 2009;169(10):1251-1259. doi:10.1093/aje/kwp036. 51. Ahmadiyeh N, Pomerantz MM, Grisanzio C, Herman P, Jia L, Almendro V. 8q24 prostate, breast, and colon cancer risk loci show tissue-specific long-range interaction with MYC. PNAS. 2010;107(21):9742-9746. doi:10.1073/pnas.0910668107/-/DCSupplemental.www.pnas.org/cgi/doi/10.1073/pnas.0910668107. 52. French JD, Ghoussaini M, Edwards SL, Meyer KB, Michailidou K, Ahmed S, Khan S, Maranian MJ, O’Reilly M, Hillman KM, Betts JA, Carroll T, Bailey PJ, Dicks E, Beesley J, Tyrer J, Maia A-T, Beck A, Knoblauch NW, et al. Functional variants at the 11q13 risk locus for breast cancer regulate cyclin D1 expression through long-range enhancers. Am J Hum Genet. 2013;92(4):489-503. doi:10.1016/j.ajhg.2013.01.002. 53. Farh KK, Marson A, Zhu J, Kleinewietfeld M, Housley WJ, Beik S, Shoresh N, Whitton H, Ryan RJH, Shishkin AA, Hatan M, Carrasco-alfonso MJ, Mayer D, Luckey CJ, Patsopoulos NA, Jager PL De, Kuchroo VK, Epstein CB. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature. 2014;518(7539):337-343. doi:10.1038/nature13835. 54. Palmer JR, Ruiz-Narvaez EA, Rotimi CN, Cupples LA, Cozier YC, Adams-Campbell LL, Rosenberg L. Genetic susceptibility loci for subtypes of breast cancer in an African American population. Cancer Epidemiol Biomarkers Prev. 2013;22(1):127-134. doi:10.1158/1055-9965.EPI-12-0769. 55. Thomas G, Jacobs KB, Yeager M, Kraft P, Wacholder S, Orr N, Yu K, Chatterjee N, Welch R, Hutchinson A, Crenshaw A, Cancel-Tassin G, Staats BJ, Wang Z, Gonzalez-Bosquet J, Fang J, Deng X, Berndt SI, Calle EE, et

74

al. Multiple loci identified in a genome-wide association study of prostate cancer. Nat Genet. 2008;40(3):310-315. doi:10.1038/ng.91. 56. Figueroa JD, Garcia-Closas M, Humphreys M, Platte R, Hopper JL, Southey MC, Apicella C, Hammet F, Schmidt MK, Broeks A, Tollenaar RAEM, Van’t Veer LJ, Fasching P a, Beckmann MW, Ekici AB, Strick R, Peto J, dos Santos Silva I, Fletcher O, et al. Associations of common variants at 1p11.2 and 14q24.1 (RAD51L1) with breast cancer risk and heterogeneity by tumor subtype: findings from the Breast Cancer Association Consortium. Hum Mol Genet. 2011;20(23):4693-4706. doi:10.1093/hmg/ddr368. 57. Stacey SN, Manolescu A, Sulem P, Rafnar T, Gudmundsson J, Gudjonsson SA, Masson G, Jakobsdottir M, Thorlacius S, Helgason A, Aben KK, Strobbe LJ, Albers-Akkers MT, Swinkels DW, Henderson BE, Kolonel LN, Le Marchand L, Millastre E, Andres R, et al. Common variants on 2q35 and 16q12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet. 2007;39(7):865-869. doi:10.1038/ng2064. 58. Broeks A, Schmidt MK, Sherman ME, Couch FJ, Hopper JL, Dite GS, Apicella C, Smith LD, Hammet F, Southey MC, Van ’t Veer LJ, de Groot R, Smit VTHBM, Fasching PA, Beckmann MW, Jud S, Ekici AB, Hartmann A, Hein A, et al. Low penetrance breast cancer susceptibility loci are associated with specific breast tumor subtypes: findings from the Breast Cancer Association Consortium. Hum Mol Genet. 2011;20(16):3289-3303. doi:10.1093/hmg/ddr228. 59. Ahmed S, Thomas G, Ghoussaini M, Healey CS, Humphreys MK, Platte R, Morrison J, Maranian M, Pooley KA, Luben R, Eccles D, Evans DG, Fletcher O, Johnson N, dos Santos Silva I, Peto J, Stratton MR, Rahman N, Jacobs K, et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat Genet. 2009;41(5):585-590. doi:10.1038/ng.354. 60. Stacey SN, Manolescu A, Sulem P, Thorlacius S, Gudjonsson SA, Jonsson GF, Jakobsdottir M, Bergthorsson JT, Gudmundsson J, Aben KK, Strobbe LJ, Swinkels DW, van Engelenburg KCA, Henderson BE, Kolonel LN, Le Marchand L, Millastre E, Andres R, Saez B, et al. Common variants on chromosome 5p12 confer susceptibility to estrogen receptor-positive breast cancer. Nat Genet. 2008;40(6):703-706. doi:10.1038/ng.131. 61. Milne RL, Goode EL, García-Closas M, Couch FJ, Severi G, Hein R, Fredericksen Z, Malats N, Zamora MP, Arias Pérez JI, Benítez J, Dörk T, Schürmann P, Karstens JH, Hillemanns P, Cox A, Brock IW, Elliot G, Cross SS, et al. Confirmation of 5p12 as a susceptibility locus for progesterone-receptor-positive, lower grade breast cancer. Cancer Epidemiol Biomarkers Prev. 2011;20(10):2222-2231. doi:10.1158/1055-9965.EPI-11- 0569. 62. Zheng W, Long J, Gao Y, Li C, Zheng Y, Xiang Y. Genome-wide association study identifies a novel breast cancer susceptibility locus at 6q25.1. Nat Genet. 2009;41(3):324-328. doi:10.1038/ng.318.Genome-wide. 63. Hein R, Maranian M, Hopper JL, Kapuscinski MK, Southey MC, Park DJ, Schmidt MK, Broeks A, Hogervorst FBL, Bueno-de-Mesquita HB, Bueno-de-Mesquit HB, Muir KR, Lophatananon A, Rattanamongkongul S,

75

Puttawibul P, Fasching PA, Hein A, Ekici AB, Beckmann MW, et al. Comparison of 6q25 breast cancer hits from Asian and European Genome Wide Association Studies in the Breast Cancer Association Consortium (BCAC). PLoS One. 2012;7(8):e42380. doi:10.1371/journal.pone.0042380. 64. Garcia-Closas M, Hall P, Nevanlinna H, Pooley K, Morrison J, Richesson D a, Bojesen SE, Nordestgaard BG, Axelsson CK, Arias JI, Milne RL, Ribas G, González-Neira A, Benítez J, Zamora P, Brauch H, Justenhoven C, Hamann U, Ko Y-D, et al. Heterogeneity of breast cancer associations with five susceptibility loci by clinical and pathological characteristics. PLoS Genet. 2008;4(4):e1000054. doi:10.1371/journal.pgen.1000054. 65. Turnbull C, Ahmed S, Morrison J, Pernet D, Renwick A, Maranian M, Seal S, Ghoussaini M, Hines S, Healey CS, Hughes D, Warren-Perry M, Tapper W, Eccles D, Evans DG, Hooning M, Schutte M, van den Ouweland A, Houlston R, et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nat Genet. 2010;42(6):504-507. doi:10.1038/ng.586. 66. Lambrechts D, Truong T, Justenhoven C, Humphreys MK, Wang J, Hopper JL, Dite GS, Apicella C, Southey MC, Schmidt MK, Broeks A, Cornelissen S, Hien R Van, Sawyer E, Tomlinson I, Kerin M, Miller N, Milne RL, Zamora MP, et al. 11q13 Is a Susceptibility Locus for Hormone Receptor. Hum Mutat. 2012;33(7):1123- 1132. 67. Antoniou AC, Wang X, Fredericksen ZS, McGuffog L, Tarrell R, Sinilnikova OM, Healey S, Morrison J, Kartsonaki C, Lesnick T, Ghoussaini M, Barrowdale D, Peock S, Cook M, Oliver C, Frost D, Eccles D, Evans DG, Eeles R, et al. A locus on 19p13 modifies risk of breast cancer in BRCA1 mutation carriers and is associated with hormone receptor-negative breast cancer in the general population. Nat Genet. 2010;42(10):885-892. doi:10.1038/ng.669. 68. Michailidou K, Beesley J, Lindstrom S, Canisius S, Dennis J, Lush MJ, Maranian MJ, Bolla MK, Wang Q, Shah M, Perkins BJ, Czene K, Eriksson M, Darabi H, Brand JS, Bojesen SE, Nordestgaard BG, Flyger H, Nielsen SF, et al. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nat Genet. 2015;advance on. doi:10.1038/ng.3242. 69. Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, Wong MC, Maddren M, Fang R, Heitner SG, Lee BT, Barber GP, Harte RA, Diekhans M, Long JC, Wilder SP, Zweig AS, Karolchik D, Kuhn RM, et al. ENCODE Data in the UCSC Genome Browser: Year 5 update. Nucleic Acids Res. 2013;41(D1):56-63. doi:10.1093/nar/gks1172.

76

Chapter 4

Components of breast cancer heritability in a multi-ethnic targeted sequencing study

Components of breast cancer heritability in a multi-ethnic targeted sequencing study

Akweley Ablorh1,2, Alexander Gusev1,2, Sara Lindström1,2, Brad Chapman3,4, Gary Chen5, Constance Chen1,2, A. Heather Eliassen2,6, Brian E. Henderson5, Loic Le Marchand7, Oliver Hofmann3,4, Christopher A. Haiman5, Peter Kraft1,2,3,** and Alkes Price1,2,3,**

1 Program in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health, Boston, MA 2 Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 3 Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 4 HSPH Bioinformatics Core, Harvard T.H. Chan School of Public Health, Boston, MA 5 Department of Preventive Medicine, Norris Comprehensive Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles, CA 6 Channing Division of Network Medicine, Brigham & Women’s Hospital and Harvard Medical School, Boston, MA 7 University of Hawai’i Cancer Center, Honolulu, HI ** Both authors contributed equally to this work

Abstract: Much of the heritability of sporadic breast cancer (BC) remains unexplained despite dozens of common, reproducible and genome-wide significant single nucleotide variant (SNV) associations. Variance component methods may explain more trait heritability than top genome-wide association SNV associations alone. We applied variance component methods that considered all SNVs together to determine what proportion of BC heritability can be explained by common and rare variants at 12 GWAS-identified BC loci selected for targeted sequencing in 2,316 cases and 2,295 controls from four self-reported ethnicities. In each ethnicity, we partitioned the heritability explained by rare (minor allele frequency < 1%) or common SNVs. In addition, we partitioned heritability from GWAS index SNVs, broadly-defined DNase Hypersensitivity (DHS), , or coding variants. Common variants from these loci explained a significant amount of breast cancer heritability (5%; SE=2%). We found no statistically significant evidence that rare variants at these loci contributed to breast cancer heritability (- 1%, SE=4%), although this is largely due to lack of power. There was also no statistically significant evidence that the contribution of rare variants was different than common variants. None of the functional classes we examined explained a larger proportion of heritability at these loci than expected. Total breast cancer heritability from all sequenced SNVs from all 12 loci was approximately 11% (S.E.=4%), a substantial portion of breast cancer heritability which ranges from 27% to 32% in European familial studies and presents evidence that just this subset of previously-identified loci explains a substantial portion of heritability which suggests that all GWAS-identified loci together may explain more of the phenotypic variation in breast cancer than the 17% previously reported.

78

Introduction

Dozens of genetic risk factors for breast cancer have been identified in genome-wide association studies1–13 and familial studies of early-onset disease.14 However, breast cancer heritability, the

variability in breast cancer risk due to genetics, has not been fully explained.10,11,13 Possible reasons for

the unexplained variability include overestimates of the true heritability of breast cancer, rare genetic

variants of large effect size, and gene-environment interaction.15–17 Breast cancer, like most complex diseases, appears to be influenced by multiple risk alleles scattered across the genome.18 Additionally missing heritability may also be due to the diffuse or polygenic nature of genetic risk factors.19

Extensive polygenicity characterizes RA Fisher’s “infinitesimal model” of genetic risk.20–22 Under this

model, every genetic variant contributes to disease risk. These normally-distributed effect sizes can vary from large down to nearly imperceptible which would not reach statistical significance. GWAS require stringent p-value thresholds to take the number of tests conducted into account. Although the genetic risk architecture of breast cancer may not be fully infinitesimal, SNVs of modest effect size, low frequency, or both require prohibitively large samples to reach statistical significance and could contribute to unexplained heritability. Secondly, GWAS’ reliance on linkage disequilibrium (LD) to impute nearby variants limits detection of rare and low-frequency causal variants that are not well tagged by genotyped SNPs.

One solution to efficiently model highly polygenic traits is to analyze a group of variants instead of performing single variant association tests. In a study of several traits, Yang et al23 applied a variance component (VC) estimation method to determine the proportion of variance explained by linear combinations of thousands of SNVs. While 50 known single variant associations together explained 5% of variance in human height, testing approximately 300,000 SNVs together explained 45% of phenotypic variance.23

79

LD, a limitation in GWAS studies, can also impact the validity of VC methods since estimates can be

biased by the LD-structure of untyped variants in GWAS. Untyped causal variants that are correlated

with their neighbors can falsely inflate or deflate local estimates of heritability when using array-based methods.20 Robust adjustments have been proposed to address bias due to LD in GWAS studies.20,24

However, in the current study, we applied VC methods to high-quality, high-depth, high-density targeted sequencing data in a multi-ethnic population which more fully captures the array of human genetic diversity in these regions than mono-ethnic studies.25

In addition to estimating total heritability, VC methods can partition variants into categories and reveal insights into the genetic contribution at specific loci, functional classes, or allele frequencies.26

Determining the proportion of heritability due to common as compared to rare variants could have implications for future study design and choice of genotyping methods since GWAS are better powered to detect common than rare variation.21

To further characterize breast cancer heritability at known loci, we estimated the total heritability from sequenced variants and the heritability for several variant classes at 12 GWAS-identified breast cancer loci in a case-control re-sequencing study of a multi-ethnic population.

80

Methods

We refer to sets of variants that may include those with minor allele frequency (MAF) below 1% as single nucleotide variants (SNVs). Variants that occur with non-trivial frequency in the general population are referred to as single nucleotide polymorphisms (SNPs).

Study population

We selected breast cancer cases and controls from the Nurses’ Health Study (NHS)27, Nurses’ Health

Study 2 (NHS2)28 and Multiethnic Cohort of Southern California and Hawaii (MEC)29. Participants

reported one of four ethnicities: Latina, African, European and Japanese American. We excluded 8

duplicate samples, 43 samples with less than 90% concordance with GWAS data, 16 samples with

sequencing call rates under 90%, 5 samples with unexpected ancestry and additional 36 samples

without GWAS data. After exclusions, our total study population consists of 4,611 women. All remaining

participants were previously genotyped in GWAS with Illumina HumanHap arrays. We obtained raw

genotypes from the previous GWAS for European subjects.

Breast cancer case status

NHS and NHS2 participants’ breast cancer case status was reported on a biennial questionnaire and

confirmed by medical record review. Eligible cases had no prior cancer diagnosis. In MEC, incident

breast cancer cases were identified through linkage with cancer registries. Only MEC participants with

no prior breast cancer, endometrial cancer, or ovarian cancer were eligible. The breast cancer case-

control ratio was approximately 1:1 within each of the four ethnicities. Estrogen receptor status (defined

based on review of medical records) was available for a subset of breast cancer cases from each cohort.

(Table 4.1)

81

Table 4.1: Study Population and Sequenced Variants by Ethnicity - Counts (Percent)

African Japanese Latina European All Ethnicities

Participants 937 1,256 907 1,511 4,611 Controls 468 (50%) 609 (48%) 447 (49%) 771 (51%) 2,295 (50%) Cases 469 (50%) 647 (52%) 460 (51%) 740 (49%) 2,316 (50%) ER+ Cases 274 (29%) 359 (29%) 287 (32%) 458 (30%) 1,378 (30%) ER- Cases 117 (12%) 59 (5%) 102 (11%) 121 (8%) 399 (9%) SNVs 66,379 39,761 55,709 56,357 137,650 Non-Singleton SNVs 40,299 (61%) 22,762 (57%) 30,425 (55%) 26,805 (48%) 66,249 (48%)

‘ER+’ is estrogen receptor positive breast cancer cases and ‘ER-‘ is estrogen receptor negative breast cancer cases. ‘SNVs’ refers to the number of variants called after QC. ‘Non-Singleton SNVs’ is the number of variants observed on at least 2 chromosomes after QC. Only non-singleton SNVs are included in subsequent heritability analyses. Column percentages of participants and SNVs are in parentheses.

82

Targeted Sequencing

We chose 12 breast-cancer associated loci for high-depth sequencing using the Illumina HiSeq platform.

Sequencing was targeted to regions stretching from the nearest upstream to the nearest downstream recombination hot spot and is described previously for this population. (Table 4.2) We excluded SNVs with quality scores below 30 or with no reads in more than ten percent of samples. Genotype handling and additional quality control was performed using two software packages, vcftools30 and PLINK31. SNVs

with an allele count greater than one (non-singleton SNVs) in the ethnicity under study were included in

the variance component analysis. The effect of a more stringent minimum minor allele frequency (MAF)

cutoff on standard errors of our estimates was minimal. (Table 4.3) MAF was estimated within each

ethnicity from the study population of cases and controls.

LD pruning was used to eliminate any potential LD bias. We pruned SNPs within a 50-base window that

showed evidence of correlation (variance-inflation factor > 1.5) then shifted the window 5 bases with the “--indep" flag in PLINK software.31

Heritability estimation model

We assumed no gene-by-environment, dominance, or epistatic (gene-gene interaction) effects to estimate narrow-sense heritability (heritability) for several categories of SNVs.

We define “index SNPs” as the published genome-wide significant SNPs that were used to identify the sequenced breast-cancer risk regions; there is one index SNP per region. Index SNP heritability was estimated as = where is the squared correlation between breast cancer case status and 態 態 態 sequenced SNVs月彫津鳥勅掴 at locusデ i彫 as堅沈 estimated堅沈 in this data set. We estimated the total heritability from all sequenced SNVs and from annotation-defined subsets of SNVs based on a linear mixed model approach proposed and described by Yang et al32 and parameterized:

83

Table 4.2: Targeted Regions and Liability-Scale Heritability (and SE) from sequenced SNPs

Index Start - Region SNP Chr End Locus Meta African European Japanese Latina 217856966 - 0.5% -0.3% 2.0% 0.3% 0.7% 1 rs13387042 2 217978765 2q35 (0.6%) (1.3%) (1.4%) (0.8%) (1.8%) 1251287 - 0.2% 1.1% -0.1% 0.2% 2 rs10069690 5 1297162 TERT (1.0%) (2.2%) NA* (1.4%) (1.9%) 55990143 - 0.9% 0.8% 1.8% 1.0% 0.1% 3 rs889312 5 56297954 MAP3K1 (0.8%) (1.9%) (1.5%) (1.4%) (1.4%) 151765174 - 0.5% 0.0% 0.1% 3.2% 0.9% 4 rs2046210 6 152007889 ESR1 (0.7%) (2.0%) (0.8%) (2.1%) (1.9%) 127830818 - 1.4% 5.5% 2.3% 2.5% -1.9% 5 rs1562430 8 128803680 8q24 (1.4%) (4.5%) (2.3%) (2.9%) (2.5%) 64083915 - -0.4% -2.5% -0.5% 1.7% 1.5% 6 rs10995190 10 64481771 ZNF365 (0.7%) (1.8%) (0.8%) (2.3%) (2.3%) 80653082 - 0.0% -1.9% 1.5% 0.1% -1.6% 7 rs704010 10 81126285 ZMIZ1 (1.2%) (3.4%) (2.0%) (2.0%) (2.6%) 123187843 - -1.1% 8.3% 1.1% -1.2% -3.2% 8 rs2981579 10 124064057 FGFR2 (0.8%) (5.0%) (2.0%) (1.1%) (1.6%) 69281227 - -0.5% 8.2% -1.1% 0.3% 0.2% 9 rs614367 11 69540165 11q13 (0.6%) (3.6%) (0.7%) (1.5%) (2.1%) 68236495 - -0.2% 2.2% -0.5% -0.3% 0.8% 10 rs999737 14 69051900 RAD51B (0.5%) (2.6%) (0.9%) (0.7%) (1.9%) 52421917 - -0.4% 1.2% 1.5% -1.0% 3.9% 11 rs3803662 16 52690887 TOX3 (0.6%) (2.6%) (1.9%) (0.7%) (2.9%) 17136590 - -1.6% 0.6% -2.6% 1.5% -0.8% 12 rs8170 19 17849008 MERIT40 (0.9%) (3.7%) (1.1%) (2.5%) (2.5%) Index SNPs 1.4% 2.1% 2.9% 1.1% 2.4% only** Multiple (0.1%) (0.3%) (0.2%) (0.1%) (0.2%) 2.3% GWAS SNPs Multiple ------(1.5%) ------8.3% 25.2% 9.0% 5.7% 2.5% Total Multiple (3.2%) (10.1%) (5.1%) (5.6%) (7.2%) In each region, estimated liability-scale heritability (with SE) for a single VC model of SNVs from each of 12 regions beginning with a meta-analysis of ethnic specific values, “Meta” then separately for each of four ethnicities. GWAS SNPs refers to all array-genotyped SNPs positioned within the 12 regions from a GWAS of European subjects. Chr = Chromosome. Total = Estimated liability-scale heritability from all non-singleton sequenced SNPs over all 12 loci adjusted for 10 ethnic-specific PCs.*: In European American subjects, genetic related matrix was non-invertible. **:Due to excessive missingness, “Index SNPs only” excludes rs2046210 for all subjects, and in Japanese Americans, excludes rs999737 since it was monomorphic in these study participants. Bold point estimates and SE are significant at alpha=0.05.

84

Table 4.3: Counts of SNVs (M) in sequencing data by minimum MAF and ethnicity

Cohort Ethnicity None 0.0006 0.001 0.003 0.005 0.01 0.02 0.05

African 66,379 40,299 40,296 25,997 22,131 18,086 14,449 9,839 MEC Japanese 39,761 22,762 17,836 12,168 10,757 9,409 8,277 6,825 Latina 55,709 30,425 30,424 16,424 13,352 10,754 8,959 7,228 NHS European I & II 56,357 26,805 18,108 13,245 11,875 10,258 8,781 6,967 All Any 137,650 66,249 57,394 33,333 27,558 21,838 17,120 11,681

Counts from three American cohorts of sequenced SNVs that passed quality control.

85

= + + +

Here y is a 0,1 vector of participants’ disease status, is the overall mean of phenotype values, 1N is a vector of ones with length N. Optionally, covariates can be included as , an N by t dimensional matrix for t covariates. Elements of the vector g are a product of , the causal effect at locus m, and , an individual i's normalized genotype at that locus (0, 1, or 2 alleles centered and scaled by the sample mean and variance at locus m). This product can also be summed across M causal loci: =

making the model flexible enough to accommodate any number of SNVs in one or more variance components.

Phenotypic variance

The additive genetic effects, gi, are normally-distributed with mean zero and variance = . Thus, phenotypic variance can be estimated from the random effect g and sampling variability in the equation

above as ( ) = + where is an identity matrix and is the genetic relationship between pairs of individuals at causal loci. We approximate the genetic relationships we would obtain had we known and used all causal SNVs to calculate = with A, a matrix whose off-diagonal entries are:

1 ( 2 )( 2 ) = 2 (1 ) At site m, the allele frequency, pm is estimated from the sample and the raw genotype count xim is 0,1, or

2 for individual i. Using these variance components, we estimate heritability on the observed scale from genotyped SNVs as = , the proportion of total phenotypic variance due to additive genetic , effects.19,32

Liability scale heritability

86

We used the reported lifetime breast-cancer risk (K) for breast cancer of 12% 33 to convert heritability from the observed, case-control scale to a liability scale that can be compared across traits or populations.19

(1 ) (1 ) = , (1 ) 態 態 計 伐 計 計 伐 計 月直 月直 寵寵 態 Here, is the height of a standard normal curve at which穴 a鶏 truncation伐 鶏 point (z) leaves 12% of the area to

the right穴 of z and 88% of the area to the left of z and P is the proportion of cases in the case-control study population. This conversion simultaneously transforms , from the observed to the liability 態 直 寵寵 scale with the first proportion and adjusts for the excess number月 of cases due to ascertainment with the second proportion.19

All analyses were conducted stratified by ethnicity. Overall, cross-ethnicity heritability was estimated

using fixed-effect meta-analysis. To limit the confounding due to population stratification, we performed

both crude and adjusted analyses using the software GATK (the Genome Analysis Toolkit)34. Adjusted estimates included the first 10 ethnic-specific PCs.

Enrichment and depletion of heritability

We annotated sequenced regions using the six non-overlapping categories described by Gusev et al26.

SNVs that could fit multiple categories were labeled as: 1) Coding, 2) UTR, 3) promoter, 4) Dnase hypersensitivity (DHS) in any cell-type, 5) intronic, or if no other label applied, 6) intergenic. For example, a ‘Coding’ SNV may fall in a promoter but DHS variants do not include coding, UTR, or promoter regions. We obtained additional annotations from the publicly available ENCODE project site.35

87

Using estimated heritability for a group of SNVs, we investigated whether some categories of SNVs

contributed relatively large or small amounts of heritability. Assuming each normalized genotype contributes equally to the phenotypic variance explained, fold-enrichment for SNVs in category C is:

, / = 態 / 態 月直 寵 月直 繋 寵 Where , is the heritability estimated for SNV category警 警 C and is the heritability estimated for all 態 態 月直 寵 月直 sequenced SNVs. MC is the variant count for SNV group C and M is the variant count for all sequenced

SNVs. Depletion is defined similarly except MC represents the count of all SNVs outside SNV category C and , represents the estimated heritability from SNVs not in SNV category C. We obtained Z scores to 態 直 寵 test 月significance of both fold-enrichment and fold-depletion as described by Gusev et al.26

Other metrics of heritability

We compared heritability from sequenced SNVs, to other metrics of heritability reported in the 態 直 literature. These include familial relative risk and 月pedigree studies. Twin studies, described previously,22,36,37 are a special-case of pedigree studies and compare co-occurrence of phenotypic traits in monozygotic and dizygotic twins. They were preferred since estimates using closer relations have lower sampling errors. Generally, twin studies serve as a standard comparator since they estimate genome-wide additive effects with no single-variant model specification or genotyping required.

( ) Another metric, proportion of familial relative risk explained, = , measures the proportion of ( ) 狸誰巽 碇虹 懸 狸誰巽 碇轍 overall familial relative risk ( = 2)38,39 that is explained by the relative risk at genotyped loci, .

待 直 膏 = exp[ log( )] 膏

膏直 懸茅 膏待

88

Familial relative risk, , can be combined with lifetime incidence, K, to obtain an estimate of heritability

直 on the liability scale and膏 is described below and previously.40 As mentioned above, is the standard normal quantile for the cumulative probability 1 K. 権

伐 2 × [ 1 ( )(1 / )] = 態 態 直 + ( )直 態 権 伐 権 謬 伐 権 伐 権 伐 権 件 直 態 月 直 Here, = / and is the height of a standard normal件 権 件curve 伐 権 at threshold z. The second threshold, z ,

直 marks件 the 穴cumulative計 穴 probability 1 K for a standard normal distribution. Thus, we estimate the

直 liability scale breast cancer heritability伐膏 explained by significant GWAS SNPs using =33%, K=12% and

直 =2 to be =17%. 膏 態 膏待 月弔調凋聴

89

Results

Total Heritability

We tested the variance in breast cancer case status explained due to additive genetic effects from all

genotyped variants. Total heritability due to sequenced SNVs at 12 breast-cancer associated loci

( ) was 11% [SE=4.2%; 95% CI: 3%-19%]. Estimates of were noticeably lower for Latina and 態 態 直 直 Japanese月 American than for European and particularly African月 American samples. (Table 4.2).

We parsed heritability according to the 12 targeted loci. However, the contribution to heritability at any one of the 12 loci was not significantly different from zero in any ethnicity. In African Americans, region

8 (FGFR2) and region 9 (11q13) contributed the largest amount of heritability but neither estimate was significant after Bonferroni correction for the number of regions. The largest local estimates were in

African Americans. (Table 4.2)

Index SNP Heritability

Total index SNP heritability was small but appreciable for each of the ethnicities separately and 態 when meta-analyzed. One index月彫津鳥勅掴 SNP (rs999737) was monomorphic in Japanese American study participants. We also estimated , from the subset of GWAS SNPs positioned within the 態 弔調凋聴 痛銚追直勅痛勅鳥 boundaries of the 12 targeted regions月 and genotyped in a GWAS of European American participants in this study. For Europeans, the contribution from the 12 index SNPs alone and from genotyped GWAS

SNPs from these regions was comparable. ( , = 2.3%; =2.9%; Table 4.2) 態 態 弔調凋聴 痛銚追直勅痛勅鳥 彫津鳥勅掴 We conditioned on the 12 index SNPs as fixed月 effects to estimate heritability月 from non-index SNPs at targeted loci. The contribution to breast cancer heritability of all sequenced SNVs combined after adjusting for 12 index SNPs was not significant in European American participants (0% SE=6%) or in meta-analysis of all study participants (7%, SE=4%; Table 4.4). However, all remaining sequenced SNVs

90

Table 4.4: Conditional Analysis of all Sequenced SNPs

2 2 Cases Ethnicity h g SE(h g) z-score 2-sided P-value All African 0.25 0.12 2.05 * 4.1E-02 European 0.00 0.06 -0.06 9.6E-01 Japanese 0.12 0.08 1.44 1.5E-01 Latina 0.05 0.11 0.48 6.3E-01 All All (Meta) 0.07 0.04 1.64 1.0E-01 Non-European 0.13 0.06 2.24 * 2.5E-02 ER+ only African 0.30 0.15 1.97 * 4.8E-02 European -0.02 0.08 -0.32 7.5E-01 Japanese 0.16 0.11 1.51 1.3E-01 Latina 0.24 0.13 1.79 7.3E-02 ER+ only All (Meta) 0.10 0.05 1.94 5.3E-02 Non-European 0.22 0.07 2.96 * 3.0E-03

Liability-scale heritability estimated from LD Pruned SNVs and adjusted for 10 ancestry PCs (continuous) and up to 11 index SNP genotypes (categorical). Due to excessive missingness, index SNPs exclude rs2046210 for all subjects, and in Japanese Americans, exclude rs999737 since it was monomorphic in these study participants.

91

explained a significant (alpha=0.05) amount of both breast cancer heritability and ER+ breast cancer

heritability in meta-analysis of non-European American participants (BC: 13%, SE=6%; and ER+ BC: 22%,

SE=7%) after conditioning on index SNPs. (Table 4.4)

Rare and Common Variant Heritability

Common variants explained a significant amount of heritability, , = 5% [95% CI: 1.1%-8.1%]. 態 直 寵墜陳陳墜津 However common variants did not explain as much heritability as月 all sequenced SNVs. There was no evidence of enrichment for common SNV heritability compared to the null. (Table 4.5)

Point estimates of rare variant heritability were not significantly different from zero despite large point estimates in some groups (Japanese: 12%, SE=10%; African: 11%, SE=11%; Table 4.5) and standard errors were larger for rare than common variant variance components. Although some point estimates appeared higher or lower, there was no statistical evidence of heterogeneity between the four ethnicities.

Functional Annotations

We modeled subsets of SNVs according to functional annotation categories in separate variance components and did not observe functional enrichment of annotation categories. (Table 4.6) However, standard errors were relatively large and only the non-annotation variance components, VCs consisting of the remaining combination of annotation categories and, potentially, more SNVs than any one category, had heritability estimates significantly greater than zero. For example, the non-intron VC

(consisting of coding, UTR, promoter, DHS and uncategorized SNVs together) explained significant heritability when meta-analyzed across all four ethnicities but there was no evidence of intron depletion, that non-intron heritability was higher than expected. Each VC which consisted of DHS and at

92

2 Table 4.5: Rare and Common Variant Breast Cancer Heritability (h g)

Point Proportion Comparison Ethnicity SE Enrichment Estimate of SNVs Meta 0.05 0.02 0.33 1.3

Common African American 0.12 0.06 0.45 1.0 (MAF > 0.01) European American 0.07 0.03 0.38 2.0 Japanese American 0.02 0.03 0.41 0.4

Latina 0.02 0.04 0.35 1.5

Meta -0.01 0.04 0.67 -0.2 Rare African American 0.11 0.12 0.55 0.7 (0.006 < MAF < 0.01) European American -0.11 0.07 0.62 -1.9 Japanese American 0.12 0.10 0.59 1.9

Latina 0.00 0.10 0.65 0.1

Meta 0.05 0.05 1.0 --- Total - Sum of African American 0.23 0.13 1.0 --- 2 component model European American -0.04 0.07 1.0 --- Japanese American 0.14 0.10 1.0 ---

Latina 0.02 0.11 1.0 ---

Meta 0.11 0.04 1.0 --- Total - African American 0.26 0.12 1.0 --- 1 component model European American 0.09 0.07 1.0 --- Japanese American 0.11 0.08 1.0 ---

Latina 0.04 0.10 1.0 ---

LD-pruning was applied to all estimates. Liability scale heritability estimates for non-singleton variants from 2 variance component (VC) model with common and rare component with 10 ethnic-specific principal components as fixed effects. Where noted, LD-pruned, non-singleton SNVs are included in a single VC model.

93

Table 4.6: Heritability in Coding and DHS Regions

Annotation Ethnicity h2ANNOT(SE) h2NON(SE) SNVsANNOT Depletion Coding Meta-Analysis -0.02 (0.01) 0.08 (0.03) 1,337 0.8 African -0.02 (0.04) 0.26 (0.10) 735 1.0 European -0.03 (0.01) 0.11 (0.05) 545 1.1 Japanese 0.03 (0.03) 0.03 (0.06) 454 0.3 Latina 0.01 (0.03) 0.02 (0.07) 582 0.6 UTR Meta-Analysis -0.02 (0.01) 0.08 (0.03) 1,004 0.8 African -0.02 (0.03) 0.30 (0.11) 592 1.1 European -0.03 (0.01) 0.11 (0.05) 402 1.2 Japanese 0.03 (0.02) 0.02 (0.05) 326 0.2 Latina -0.03 (0.02) 0.05 (0.08) 460 1.2 Promoter Meta-Analysis -0.02 (0.01) 0.10 (0.03) 2,848 0.9 African -0.03 (0.04) 0.28 (0.11) 1,698 1.1 European -0.02 (0.02) 0.10 (0.05) 1,201 1.1 Japanese -0.01 (0.02) 0.06 (0.06) 942 0.6 Latina -0.03 (0.03) 0.04 (0.08) 1,341 1.2 DHS (Trynka) Meta-Analysis 0.03 (0.05) 0.05 (0.05) 21,313 0.7 African 0.09 (0.14) 0.16 (0.16) 12,816 0.9 European 0.03 (0.07) 0.06 (0.08) 8,323 1.0 Japanese 0.02 (0.08) 0.04 (0.09) 6,995 0.6 Latina 0.05 (0.13) -0.02 (0.13) 9,565 -0.8 Intron Meta-Analysis 0.0 (0.02) 0.08 (0.03) 33,432 1.5 African 0.05 (0.07) 0.20 (0.08) 20,073 1.5 European -0.01 (0.03) 0.10 (0.05) 13,494 2.2 Japanese 0.0 (0.04) 0.05 (0.05) 11,476 1.0 Latina 0.01 (0.05) 0.01 (0.06) 15,233 0.5

Adjusted, liability scale heritability from annotations in a 2 VC model. SNVs for the annotation (h2ANNOT) comprised the first component and the remaining SNVs comprised the second component (h2NON).

SNVsANNOT is the number of SNVs in that annotation category observed for participants in each ethnicity.

*Coding and non-Coding GRM did not converge for European American subjects.

94

least one other category explained significant heritability, and among non-annotation VCs, only the non-

DHS VC (consisting of coding, UTR, promoter, intron and uncategorized SNVs together) was statistically indistinguishable from zero in meta-analysis. Like DHS, separate analysis of consensus (all cell types)

H3K4me1 and breast myoepithelial cell H3K4me1 was suggestive but not significant. (Table 4.7)

ER-positive (ER+) breast cancer heritability

We assume that total phenotypic variance is estimated with high precision. Since the genetic risk factors for ER+ breast cancer may differ from those of other breast cancer subtypes, we examined heritability of

ER+ breast cancer in this sample and found higher estimates of total heritability from sequenced SNVs, and from rare and common heritability for most ethnicities. (Table 4.8) Estimated heritability from rare

2 SNVs in African Americans was 29% (SE=15%). Nonetheless, meta-analysis of h g,Rare for ER+ breast

2 cancer was only significant when European American subjects were excluded. (ER+ h g,Rare,-Eur= 15% [SE:

5%]; Table 4.8)

95

Table 4.7: Heritability by Functional Annotation

2 2 h ANNOT h NON Ethnicity Annotation SNVANNOT (SE) (SE) Depletion Factor Binding - Meta ENCODE 18,071 -0.02 (0.04) 0.10 (0.04) 1.7 All cell types - H3K4me1 (Trynka) 24,816 0.03 (0.04) 0.05 (0.04) 0.9 Conserved LindbladToh 4,531 0.00 (0.03) 0.08 (0.04) 1.0 Breast Myoepithelial Cells -H3K4me1 10,662 0.02 (0.03) 0.06 (0.04) 0.8 Stomach Smooth Muscle - H3K4me1 7,226 -0.01 (0.02) 0.08 (0.03) 1.1 Stomach Smooth Muscle peaks - H3K4me3 4,322 -0.01 (0.01) 0.08 (0.03) 1.1 Binding - African ENCODE 10,735 0.01 (0.12) 0.24 (0.14) 1.2 All cell types - H3K4me1 (Trynka) 14,812 -0.04 (0.13) 0.29 (0.15) 1.8 Conserved LindbladToh 2,586 -0.12 (0.07) 0.34 (0.12) 1.4 Breast Myoepithelial Cells - H3K4me1 6,348 0.03 (0.09) 0.22 (0.12) 1.0 Stomach Smooth Muscle - H3K4me1 4,218 -0.07 (0.06) 0.31 (0.11) 1.3 Stomach Smooth Muscle peaks - H3K4me3 2,515 -0.03 (0.04) 0.27 (0.1) 1.1 Transcription Factor Binding - European ENCODE 7,109 -0.01 (0.06) 0.09 (0.07) 1.4 All cell types - H3K4me1 (Trynka) 9,709 0.09 (0.07) 0.00 (0.07) 0.0 Conserved LindbladToh 1,710 0.02 (0.04) 0.08 (0.06) 0.9 Breast Myoepithelial Cells - H3K4me1 4,116 0.04 (0.04) 0.05 (0.06) 0.7 Stomach Smooth Muscle - H3K4me1 2,915 0.03 (0.03) 0.07 (0.05) 0.8 Stomach Smooth Muscle peaks - H3K4me3 1,719 0.01 (0.02) 0.08 (0.05) 0.9 Transcription Factor Binding - Japanese ENCODE 5,928 -0.05 (0.06) 0.1 (0.08) 1.3 All cell types - H3K4me1 (Trynka) 8,267 0.01 (0.07) 0.04 (0.07) 0.6 Conserved LindbladToh 1,479 0.06 (0.05) 0.01 (0.07) 0.1 Breast Myoepithelial Cells - H3K4me1 3,581 0.04 (0.05) 0.02 (0.06) 0.2 Stomach Smooth Muscle - H3K4me1 2,394 0.01 (0.04) 0.05 (0.06) 0.5 Stomach Smooth Muscle peaks - H3K4me3 1,459 0.00 (0.02) 0.06 (0.06) 0.6 Transcription Factor Binding - Latina ENCODE 8,218 0.01 (0.11) 0.02 (0.11) 0.6 All cell types - H3K4me1 (Trynka) 11,251 -0.03 (0.10) 0.05 (0.10) 2.0 Conserved LindbladToh 1,894 -0.02 (0.06) 0.05 (0.10) 1.5 Breast Myoepithelial Cells - H3K4me1 4,821 -0.03 (0.06) 0.06 (0.09) 1.8 Stomach Smooth Muscle - H3K4me1 3,306 -0.06 (0.04) 0.07 (0.08) 2.1 Stomach Smooth Muscle peaks - H3K4me3 1,986 -0.04 (0.02) 0.04 (0.07) 1.1

Liability scale heritability estimates for non-singleton SNVs from 2 VC model. One component had SNVs from annotation (ANNOT) and the second had SNVs outside indicated annotation (NON). SNVANNOT is the number of SNVs observed for subjects of an ethnicity in that annotation category.

96

Table 4.8: ER-Positive Breast Cancer heritability

Proportion Comparison Ethnicity Point Estimate SE Enrichment of SNVs Common (MAF > 0.01) Meta 0.06 0.02 0.33 1.2 African 0.07 0.07 0.45 0.5 European 0.06 0.03 0.38 2.6 Japanese 0.04 0.04 0.41 0.7 Latina 0.11 0.06 0.35 1.1 Meta 0.01 0.05 0.67 0.1 Rare (MAF ) African 0.25 0.15 0.55 1.6 European -0.13 0.08 0.62 -3.7 Japanese 0.10 0.13 0.59 1.2 Latina 0.13 0.13 0.65 0.8 Meta 0.08 0.06 1.0 --- Total - Sum of 2 VCs African 0.31 0.16 1.0 --- European -0.07 0.08 1.0 --- Japanese 0.15 0.13 1.0 --- Latina 0.24 0.13 1.0 --- Meta 0.15 0.05 1.0 --- Total - 1 VC model African 0.29 0.15 1.0 --- European 0.06 0.08 1.0 --- Japanese 0.15 0.10 1.0 --- Latina 0.27 0.13 1.0 ---

LD-pruning was applied to all estimates. Liability scale heritability estimates for non-singleton variants from 2 variance component (VC) model with common and rare component with 10 ethnic-specific pcs as fixed effects. Where noted, LD-pruned, non-singleton SNVs are included in a single VC.

97

Discussion

We characterized the total heritability and contribution of several classes of genetic variation to breast cancer heritability at 12 breast cancer GWAS-identified loci using high-depth, high-density sequencing data in a multi-ethnic case-control study. Common variants at these GWAS-identified loci contribute

2 significantly to breast cancer heritability. Estimated total h g in these regions was non-significant for

European American participants. However, meta-analysis of four ethnicities allowed us to capture a large and significant portion of total breast cancer heritability which was estimated to be 27% [95% 態 CI: 4%-41%] and in a young cohort, 32% from Northern European月 twin studies.41,42

Sequenced variants explained additional breast cancer heritability beyond that obtained from 12 index

SNPs in non-European American participants and marginally, additional ER+ breast cancer heritability in the full study population. (Table 4.4) This suggests that, relative to a model including GWAS-significant index SNPs, there may be multiple susceptibility variants in these regions and, by including all sequenced variants, we tagged or possibly modeled susceptibility variants better. The contribution of non-genome wide significant SNVs to complex trait heritability is important to consider.

Although rare variant breast cancer heritability was not significant, this may reflect lack of statistical power more than lack of rare variant heritability. Rare genetic variation contributed meaningfully to the heritability of another complex disease: prostate cancer.43 Breast cancer and prostate cancer are each the highest-incident cancer in their respective gender, share some genetic risk factors44,45 and may have similar genetic architecture. Both breast cancer and prostate cancer have a two-fold familial relative risk.38,39 For prostate cancer, 33% of familial relative risk is explained by 100 GWAS-identified loci46 and for breast cancer a third of familial risk is explained by 70 GWAS-identified loci.13

Approximately 40% of prostate cancer heritability was due to rare alleles in the African, European,

Japanese, and Latino American cohort.43 The prostate cancer study had more participants and a larger

98

proportion of those participants were African American. Lower sample sizes in the current study and a

higher effective number of rare relative to common SNVs contributed to large standard errors for

heritability estimates (Figure 4.1) and the current study was not sufficiently powered to confirm that rare variation contributes to breast cancer. Like the prostate cancer study,43 point estimates for rare variant heritability were higher in African American and Japanese American subjects than European

American subjects. (Table 4.9)

No significant breast cancer heritability was explained by any one functional annotation. However,

standard errors were large enough that this cannot be considered evidence against the biological

relevance of these categories. Variants classified as DHS contributed to significant heritability estimates

when combined with other categories.

Another consideration in interpreting our findings is that variance partioning methods are subject to LD

bias. Unanticipated LD between causal variants can bias local estimates of heritability from genome- wide SNPs.20,24 Gusev et al20 noted severe deflation of local estimates of heritability in the presence of low-frequency causal variants and moderate inflation when common causal variants were genotyped in a GWAS of a simulated trait mimicking the genetic architecture of height. This bias stems from LD between causal SNPs and has larger magnitude for imputed GWAS data than for genotyped GWAS data for several traits.20,24 No imputation was applied in the current study which limits the magnitude of this bias.

LD bias has not been explored in a sequencing study of this size and, we hypothesized, would be countered by high-density sequencing data. However, directionality of anticipated LD bias is consistent with what we observed for estimated rare and common variant heritability in European samples without

LD pruning. (Table 4.10) African American samples which tend to have shorter haplotype

99

Figure 4.1: Theoretical whole genome SE and observed SE for observed-scale heritability estimates

Theoretical whole genome SE and observed SE for observed-scale heritability estimates

Dashed line above - Theoretical standard error for a genome- S

2 Size). Heritability estimate (h g ) SE is a function of both sample size and the number of SNPs. SE falls with both increasing sample size and more stringent ethnicity-specific minimum MAF thresholds.

100

Table 4.9: Heritability Estimation in Two Independent Samples

Prostate Cancer Breast Cancer

Population: Meta African European Japanese Latino Meta African European Japanese Latina 2 h INDEX 0.07 0.05 0.1 0.09 0.08 0.02 0.03 0.03 0.01 0.03 2 SE (h INDEX) (<0.01) 0.01 0.02 0.02 0.02 (<0.01) (<0.01) (<0.01) (<0.01) (<0.01) 2 h RARE 0.07 0.12 0.03 0.09 0.02 -0.06 0.11 -0.08 0.04 -0.05 2 SE (h RARE) 0.03 0.04 0.06 0.08 0.06 0.02 0.11 0.03 0.08 0.08 2 h COMMON 0.16 0.16 0.29 0.14 0.16 0.04 0.12 0.06 0.02 0.03 2 SE (h COMMON) 0.02 0.03 0.06 0.04 0.05 0.02 0.06 0.03 0.03 0.04 Rare SNVs --- 58,699 33,606 29,121 46,374 --- 28,382 13,180 12,905 26,066 Common SNPs --- 63,972 53,164 40,742 45,932 --- 25,731 16,111 14,873 16,905 Sample Size 9,237 4,006 1,753 1,770 1,708 4,611 937 1,511 1,256 907 Loci Targeted 93 12 MB 11.3 5.5 sequenced Familial RR 38,39 2 2 .0) Lifetime 0.16 0.12 Incidence47 21% 17%

2匝 42% 27% h札撒冊傘 酸 (0.29- (0.04- (95% CI) 0.50) 0.41)

All h2 estimates are reported on the liability scale in a prostate cancer study43 on the left and the current study on the right. In both the prostate cancer study and the current breast cancer study, 10 ancestry principal components were included as fixed effects. In addition, Mancuso, Rohland and colleagues included age as a fixed effect.43 Meta = weighted average of heritability estimates from the four

2 2 ethnicities. h INDEX = heritability estimated from index SNPs; h RARE = heritability estimated from SNVs with 2 MAF between 0.1% and 1% in the indicated population; h COMMON = heritability estimated from SNPs with 2 2 MAF greater than 1% in the indicated population; .0 = Familial relative risk; h GWAS = see methods; h = heritability estimated from cohorts of twins from Sweden, Denmark, and Finland.41

101

2 Table 4.10: Crude, Adjusted, and LD-pruned estimates of Rare, Common and Total hg in Single variance- component analysis

Rare [0.6% - 1%) Common MAF Total

Ethnicity , ( , ) , ( , , ( , ) 匝 匝 匝 匝 匝 匝 African American 酸賛 司珊司蚕 傘撮 酸賛 司珊司蚕 酸賛 算伺仕仕伺仔 傘撮 酸賛 算伺仕仕伺仔 酸賛 嗣伺嗣珊残 傘撮 酸賛 嗣伺嗣珊残 Crude 10.5% 11.0% 12.7% 5.7% 23.8% 9.9% Adjusted for 10 PCs 13.3% 11.3% 12.8% 5.8% 25.2% 10.1% LD Pruned 10.3% 11.8% 12.8% 6.2% 24.0% 11.4% LD Pruned and 13.5% 12.1% 12.9% 6.3% 26.3% 11.7% Adjusted European American Crude -6.2% 5.1% 5.7% 2.6% 8.4% 5.0% Adjusted for 10 PCs -5.6% 5.4% 5.9% 2.7% 9.0% 5.1% LD Pruned -10.6% 6.5% 6.9% 3.0% 8.8% 6.5% LD Pruned and -10.2% 6.8% 7.1% 3.0% 9.5% 6.7% Adjusted Japanese American Crude 11.2% 9.6% 2.3% 2.5% 6.6% 5.6% Adjusted for 10 PCs 11.0% 9.7% 1.8% 2.4% 5.7% 5.6% LD Pruned 13.2% 9.8% 2.9% 2.9% 11.4% 7.9% LD Pruned and 13.3% 9.9% 2.4% 2.9% 10.6% 7.9% Adjusted Latina Crude -3.0% 7.7% 4.9% 3.7% 8.1% 7.3% Adjusted for 10 PCs -3.8% 7.8% 2.4% 3.7% 2.5% 7.2% LD Pruned 2.0% 10.1% 4.8% 4.1% 10.3% 9.7% LD Pruned and 0.8% 10.3% 2.0% 4.2% 3.8% 10.0% Adjusted Summary (Meta) Crude -1.1% 3.7% 4.7% 1.6% 9.3% 3.1% Adjusted for 10 PCs -0.5% 3.8% 4.1% 1.6% 8.3% 3.2% LD Pruned -0.4% 4.4% 5.5% 1.8% 11.8% 4.2% LD Pruned and 0.3% 4.6% 4.8% 1.8% 11.0% 4.2% Adjusted

102

blocks would be the least vulnerable to this type of bias. By estimating breast cancer heritability in a multi-ethnic cohort, we improved the generalizability and robustness of our findings.

In Europeans, heritability of breast cancer (Table 4.11) from familial studies (h2) was estimated to be

41 10 2 27%. The gap between heritability explained by significant GWAS SNPs (h GWAS) and total heritability

(h2) may be explained at least in part by the substantial increases we observed as we moved from either

GWAS SNPs within the bounds of our targeted regions ( , ), genome-wide significant index 態 月弔調凋聴 痛銚追直勅痛勅鳥 SNPs ( ), or common sequenced SNPs ( , ) to total heritability from all sequenced SNVs in 態 態 彫津鳥勅掴 直 頂墜陳陳墜津 these regions月 ( ). If these gains are recapitulated月 for other breast cancer susceptibility loci, VC 態 直 methods applied月 to resequencing studies of known and suspected susceptibility loci could make substantial progress towards explaining not only the missing heritability of breast cancer but missing heritability for many other complex diseases.

103

Table 4.11: Breast cancer heritability (Percent) in European populations

Metric Global or Local Point Estimate (%) 95% CI (%) Reference

h2 Global 27 4 - 41 41

2 13,40 h GWAS Global 17 NA

2 h GWAS,targeted Local 2.3 -0.6 - 5.2 ----

2 h Index Local 2.9 2.8 - 3 ----

2 h g,Common Local 6 4 - 13 ----

2 h g Local 10 3 - 23 ----

Global estimates are drawn from the literature. Local estimates are estimated from targeted sequencing

2 2 or array-genotyped data in the current study. h = twin study genome-wide estimate; h GWAS =

2 approximated from familial relative risk explained by 70 GWAS-identified loci; h GWAS,Local = contribution from array-genotyped SNPs in European American participants positioned within the bounds of 12

2 2 targeted regions; h g,Common = targeted sequencing SNPs with MAF above 1%; h g = all targeted sequencing SNVs

104

References

1. Easton DF, Pooley KA, Dunning AM, Pharoah PDP, Ballinger DG, Struewing JP, Morrison J, Field H, Luben R,

Wareham N, Ahmed S, Healey CS, Meyer KB, Haiman CA. Genome-wide association study identifies novel breast cancer susceptibility loci. Nature. 2007;447(7148):1087- G-wide.

2. Cai Q, Long J, Lu W, Qu S, Wen W, Kang D, Lee J-Y, Chen K, Shen H, Shen C-Y, Sung H, Matsuo K, Haiman C,

Khoo US, Ren Z, Iwasaki M, Gu K, Xiang Y-B, Choi J-Y, et al. Genome-wide association study identifies breast cancer risk variant at 10q21.2: results from the Asia Breast Cancer Consortium. Hum Mol Genet. 2011;20(24):4991-4999.

3. Long J, Cai Q, Sung H, Shi J, Zhang B, Choi J-Y, Wen W, Delahanty RJ, Lu W, Gao Y-T, Shen H, Park SK, Chen

K, Shen C-Y, Ren Z, Haiman CA, Matsuo K, Kim MK, Khoo US, et al. Genome-wide association study in east Asians identifies novel susceptibility loci for breast cancer. PLoS Genet. 2012;8(2):e1002532.

4. Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, Hankinson SE, Wacholder S, Wang Z, Welch R, Yu K,

Chatterjee N, Orr N, Willett WC, Graham A, Ziegler RG, Berg CD, Buys SS, Catherine A, Feigelson HS, et al. A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet. 2007;39(7):870- A

5. Haiman CA, Chen GK, Vachon CM, Canzian F, Millikan RC, Wang X, Ademuyiwa F, Ahmed S, Ambrosone

CB, Baglietto L, Balleine R, Bandera E V. A common variant at the TERT-CLPTM1L locus is associated with estrogen receptor–negative breast cancer. Nat Genet. 2012;43(12):1210- A

6. Gold B, Kirchhoff T, Stefanov S, Lautenberger J, Viale A, Garber J, Friedman E, Narod S, Olshen AB,

Gregersen P, Kosarin K, Olsh A, Bergeron J, Ellis NA, Klein RJ, Clark AG, Norton L, Dean M, Boyd J, et al. Genome- wide association study provides evidence for a breast cancer risk locus at 6q22.33. Proc Natl Acad Sci U S A.

2008;105(11):4340-

7. Chen F, Chen GK, Stram DO, Millikan RC, Ambrosone CB, John EM, Bernstein L, Zheng W, Palmer JR, Hu JJ,

Rebbeck TR, Ziegler RG, Nyante S, Bandera E V, Ingles SA, Press MF, Ruiz-Narvaez EA, Deming SL, Rodriguez-Gil JL,

105

et al. A genome-wide association study of breast cancer in women of African ancestry. Hum Genet.

2013;132(1):39- -012-1214-y.

8. Fletcher O, Johnson N, Orr N, Hosking FJ, Gibson LJ, Walker K, Zelenika D, Gut I, Heath S, Palles C,

Coupland B, Broderick P, Schoemaker M, Jones M, Williamson J, Chilcott-Burns S, Tomczyk K, Simpson G, Jacobs

KB, et al. Novel breast cancer susceptibility locus at 9q31.2: results of a genome-wide association study. J Natl

Cancer Inst. 2011;103(5):425-

9. Kim H, Lee J-Y, Sung H, Choi J-Y, Park SK, Lee K-M, Kim YJ, Go MJ, Li L, Cho YS, Park M, Kim D-J, Oh JH, Kim

J-W, Jeon J-P, Jeon S-Y, Min H, Kim HM, Park J, et al. A genome-wide association study identifies a breast cancer risk variant in ERBB4 at 2q34: results from the Seoul Breast Cancer Study. Breast Cancer Res. 2012;14(2):R56.

10. Michailidou K, Hall P, Gonzalez-Neira A, Ghoussaini M, Dennis J, Milne RL, Schmidt MK, Chang-Claude J,

Bojesen SE, Bolla MK, Wang Q, Dicks E, Lee A, Turnbull C, Rahman N, Fletcher O, Peto J, Gibson L, Dos Santos Silva

I, et al. Large-scale genotyping identifies 41 new loci associated with breast cancer risk. Nat Genet. 2013;45(4):353-

361, 361e1-

11. Siddiq A, Couch FJ, Chen GK, Lindström S, Eccles D, Millikan RC, Michailidou K, Stram DO, Beckmann L,

Rhie SK, Ambrosone CB, Aittomäki K, Amiano P, Apicella C, Baglietto L, Bandera E V, Beckmann MW, Berg CD,

Bernstein L, et al. A meta-analysis of genome-wide association studies of breast cancer identifies two novel susceptibility loci at 6q14 and 20q11. Hum Mol Genet. 2012;21(24):5373-

12. Thomas G, Jacobs KB, Kraft P, Yeager M, Cox DG, Hankinson SE, Hutchinson A, Yu K, Chatterjee N,

Gonzalez-Bosquet J, Prokunina-Olsson L, Orr N, Willett WC, Colditz GA, Ziegler RG, Berg CD, Buys SS, Mccarty CA,

Feigelson S, et al. A multi-stage genome-wide association in breast cancer identifies two novel risk alleles at 1p11.2 and 14q24.1 (RAD51L1). Nat Genet. 2009;41(5):579- A

13. Michailidou K, Beesley J, Lindstrom S, Canisius S, Dennis J, Lush MJ, Maranian MJ, Bolla MK, Wang Q, Shah

M, Perkins BJ, Czene K, Eriksson M, Darabi H, Brand JS, Bojesen SE, Nordestgaard BG, Flyger H, Nielsen SF, et al.

106

Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast

cancer. Nat Genet

14. Turnbull C, Ahmed S, Morrison J, Pernet D, Renwick A, Maranian M, Seal S, Ghoussaini M, Hines S, Healey

CS, Hughes D, Warren-Perry M, Tapper W, Eccles D, Evans DG, Hooning M, Schutte M, van den Ouweland A,

Houlston R, et al. Genome-wide association study identifies five new breast cancer susceptibility loci. Nat Genet.

2010;42(6):504-

15. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Lucia a, Hunter DJ, Mccarthy MI, Ramos EM, Cardon LR, Cho

JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Alice S, Boehnke M, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747-

16. Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci U S A. 2012;109(4):1193-

17. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet.

2012;90(1):7-

18. Stratton MR, Rahman N. The emerging landscape of breast cancer susceptibility. Nat Genet.

2008;40(1):17-22.

19. Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet. 2011;88(3):294-

20. Gusev A, Bhatia G, Zaitlen N, Vilhjalmsson BJ, Diogo D, Stahl E a, Gregersen PK, Worthington J, Klareskog L,

Raychaudhuri S, Plenge RM, Pasaniuc B, Price AL. Quantifying missing heritability at known GWAS loci. PLoS Genet.

21. Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet. 2012;13(2):135-145.

107

22. Vinkhuyzen AA, Wrap NR, Yang J, Goddard ME, Visscher PM. Estimation and partioning of heritability in human populations using whole genome analysis methods. Annu Rev Genet. 2014;47:75- - genet-111212-133258.Estimation.

23. Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, de Andrade M, Feenstra B,

Feingold E, Hayes MG, Hill WG, Landi MT, Alonso A, Lettre G, Lin P, Ling H, Lowe W, Mathias RA, Melbye M, et al.

Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet. 2011;43(6):519-525.

24. Speed D, Hemani G, Johnson MR, Balding DJ. Improved heritability estimation from genome-wide SNPs.

Am J Hum Genet. 2012;91(6):1011-

25. Mensah-Ablorh A, Lindstrom S, Haiman CA, Henderson BE, Marchand L Le, Lee S, Stram DO, Eliassen AH,

Price A, Kraft P. Meta-analysis of rare variant association tests in multi-ethnic populations. Genet Epidemiol. 2015;

Submitted.

26. Gusev A, Lee SH, Trynka G, Finucane H, Vilhjálmsson BJ, Xu H, Zang C, Ripke S, Bulik-Sullivan B, Stahl E,

Kähler AK, Hultman CM, Purcell SM, McCarroll SA, Daly M, Pasaniuc B, Sullivan PF, Neale BM, Wray NR, et al.

Partitioning Heritability of Regulatory and Cell-Type-Specific Variants across 11 Common Diseases. Am J Hum

Genet. 2014;95:535-

27. Spiegelman D, Colditz GA, Hunter D, Hertzmark E. Validation of the Gail et al. model for predicting individual breast cancer risk. J Natl Cancer Inst. 1994;86(8):600-607.

28. Eliassen AH, Tworoger SS, Mantzoros CS, Pollak MN, Hankinson SE. Circulating insulin and c-peptide levels and risk of breast cancer among predominately premenopausal women. Cancer Epidemiol Biomarkers Prev.

2007;16(1):161- -9965.EPI-06-0693.

108

29. Kolonel LN, Henderson BE, Hankin JH, Nomura a M, Wilkens LR, Pike MC, Stram DO, Monroe KR, Earle ME,

Nagamine FS. A multiethnic cohort in Hawaii and Los Angeles: baseline characteristics. Am J Epidemiol.

2000;151(4):346-

30. Danecek P, Auton A, Abecasis G, Albers C a, Banks E, DePristo M a, Handsaker RE, Lunter G, Marth GT,

Sherry ST, McVean G, Durbin R. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156-2158.

31. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M a R, Bender D, Maller J, Sklar P, de Bakker PIW,

Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J

Hum Genet. 2007;81(3):559-

32. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden P a, Heath AC, Martin NG,

Montgomery GW, Goddard ME, Visscher PM. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010;42(7):565-

33. Desantis C, Ma J, Bryan L, Jemal A. Breast Cancer Statistics , 2013. CA Cancer J Clin. 2014;(64):52-62.

34. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S,

Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-

35. Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, Wong MC, Maddren M, Fang R,

Heitner SG, Lee BT, Barber GP, Harte RA, Diekhans M, Long JC, Wilder SP, Zweig AS, Karolchik D, Kuhn RM, et al.

ENCODE Data in the UCSC Genome Browser: Year 5 update. Nucleic Acids Res. 2013;41(D1):56-63.

36. Visscher PM, Hill WG, Wray NR. Heritability in the genomics era--concepts and misconceptions. Nat Rev

Genet. 2008;9(4):255-

109

37. Witte JS, Visscher PM, Wray NR. The contribution of genetic variants to disease depends on the ruler. Nat

Rev Genet. 2014;15(11):765-

38. Eeles RA, Olama AA Al, Benlloch S, Saunders EJ, Leongamornlert DA, Tymrakiewicz M, Ghoussaini M,

Luccarini C, Dennis J, Jugurnauth-Little S, Dadaev T, Neal DE, Hamdy FC, Donovan JL, Muir K, Giles GG, Severi G,

Wiklund F, Gronberg H, et al. Identification of 23 new prostate cancer susceptibility loci using the iCOGS custom

genotyping array. Nat Genet. 2013;45(4):385-391, 391e1-

39. Collaborative Group on Hormonal Factors in Breast Cancer. Familial breast cancer: collaborative reanalysis of individual data from 52 epidemiological studies including 58,209 women with breast cancer and 101,986 women without the disease. Lancet. 2001;358(9291):1389- S-6736(01)06524-2.

40. Wray NR, Yang J, Goddard ME, Visscher PM. The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet

41. Lichtenstein P, Holm NV, Verkasalo P, Iliadou A, Kaprio J, Koskenvuo M, Pukkala E, Skytthe A, Hemminki K.

Environmental and heritable factors in the causation of cancer-Analyses of cohorts of twins from Sweden,

Denmark, and Finland. N Engl J Med. 2000;343(2):78-85.

42. Ahlbom A, Lichtenstein P, Malmström H, Feychting M, Hemminki K, Pedersen NL. Cancer in twins: genetic and nongenetic familial risk factors. J Natl Cancer Inst. 1997;89(4):287-

43. Mancuso N, Rohland N, Tandon A, Rand K, Allen A, Quinque D, Swapan M, Li H, Stram A, X S, Le Marchand

L, African Ancestry Prostate Cancer GWAS Consortium, Stram D, Conti D V, Henderson BE, Pasuniac B, Haiman C,

Reich D. The contribution of rare variation to prostate cancer heritability. In Submission. 2015.

44. Ahmadiyeh N, Pomerantz MM, Grisanzio C, Herman P, Jia L, Almendro V. 8q24 prostate, breast, and colon cancer risk loci show tissue-specific long-range interaction with MYC. PNAS. 2010;107(21):9742-9746.

-DCS

45. Thomas G, Jacobs KB, Yeager M, Kraft P, Wacholder S, Orr N, Yu K, Chatterjee N, Welch R, Hutchinson A,

Crenshaw A, Cancel-Tassin G, Staats BJ, Wang Z, Gonzalez-Bosquet J, Fang J, Deng X, Berndt SI, Calle EE, et al.

110

Multiple loci identified in a genome-wide association study of prostate cancer. Nat Genet. 2008;40(3):310-315.

46. Al Olama AA, Kote-Jarai Z, Berndt SI, Conti D V, Schumacher F, Han Y, Benlloch S, Hazelett DJ, Wang Z,

Saunders E, Leongamornlert D, Lindstrom S, Jugurnauth-Little S, Dadaev T, Tymrakiewicz M, Stram DO, Rand K,

Wan P, Stram A, et al. A meta-analysis of 87,040 individuals identifies 23 new susceptibility loci for prostate cancer.

Nat Genet. 2014;46(10):1103-

47. Siegel R, Naishadham D, Jemal A. Cancer Statistics , 2012. CA Cancer J Clin. 2012;(62):10-29.

A

111