Meta-Analysis of a Multi-Ethnic, Breast Cancer Case-Control Targeted Sequencing Study
The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters
Citation Ablorh, Akweley. 2015. Meta-Analysis of a Multi-Ethnic, Breast Cancer Case-Control Targeted Sequencing Study. Doctoral dissertation, Harvard T.H. Chan School of Public Health.
Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:16121143
Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA META-ANALYSIS OF A MULTI-ETHNIC,
BREAST CANCER CASE-CONTROL
TARGETED SEQUENCING STUDY
AKWELEY ABLORH
A Dissertation Submitted to the Faculty of
The Harvard T.H. Chan School of Public Health
in Partial Fulfillment of the Requirements
for the Degree of
Doctor of Science
Harvard University
Boston, Massachusetts.
May, 2015
Dissertation Advisor: Dr. Peter Kraft Akweley Ablorh
Meta-analysis of a multi-ethnic, breast cancer case-control targeted sequencing study
ABSTRACT
Breast cancer, the most commonly diagnosed cancer in American women, is a heritable disease with nearly one hundred known genetic risk factors. Using next generation sequencing, we explored the contribution of genetics at 12 GWAS-identified loci to breast cancer susceptibility in a multi-ethnic breast cancer case-control study.
Methods: The study population consists of 4,611 breast cancer cases and controls (2,316 cases and 2,295
controls) from four mutually exclusive ethnicities: African, Latina, Japanese, or European American.We conducted
rare variant association testing between sequenced genotypes and simulated phenotypes to compare the performance
of several approaches for assessing rare variant associations across multiple ethnicities and the statistical performance of different ethnic sampling fractions, including single-ethnicity studies and studies that sample up to four ethnicities. Findings from simulation of causal rare variant penetrance models were then applied to a non- synonymous protein-coding rare variant association study of breast cancer. Next, we applied variance partitioning
methods to determine what proportion of breast cancer heritability is explained by rare and common, coding and
non-coding, and the complete set of sequenced genetic variants.
Results: Variance component-based tests were better powered in several scenarios. Multi-ethnic studies were well powered, with inclusion of African Americans providing the largest gains in statistical power. Rare
variation in several genes was nominally associated (alpha=0.05) with breast cancer risk. Common variants
explained a significant amount of breast cancer heritability (5%; SE=2%). Total breast cancer heritability from all
sequenced SNVs from all 12 loci was approximately 11% (S.E.=4%), a substantial portion of breast cancer
heritability which ranges from 27% to 32% in European familial studies.
Conclusion: Our findings suggest that association studies between rare variants and complex disease should consider including subjects from multiple ethnicities, with preference given to genetically diverse groups.
We demonstrate practices with the potential to uncover and localize gene-based associations using gene-based rare
variant association testing at 12 GWAS-identified breast cancer susceptibility loci. We also present strong evidence
that just this subset of previously-identified loci explains a substantial portion of heritability which suggests that all
GWAS-identified loci may explain more breast cancer heritability than the 17% previously reported.
ii
Table of Contents
List of Figures iv List of Tables v Acknowledgments vi Dedication ix
Chapter 1 1 Introduction
Chapter 2 Meta-analysis of rare variant association tests in multi-ethnic 6 populations - Introduction 9 - Methods 11 - Results 19 - Discussion 32 - References 36
Chapter 3 Breast cancer and rare protein-coding variation at 12 43 susceptibility loci - Introduction 45 - Methods 48 - Results 57 - Discussion 64 - References 70
Chapter 4 Components of breast cancer heritability in a multi-ethnic 77 targeted sequencing study - Introduction 79 - Methods 81 - Results 90 - Discussion 98 - References 105
iii
List of Figures
Title Page Figure 2.1: Estimated singleton, doubleton and tripleton count 22 Figure 2.2: Type I error (at nominal 5% alpha) for individual ethnic results and multi- 23 ethnic designs Figure 2.3: Lambda GC when altering sampling fraction between ethnicities 24 Figure 2.4: Burden test and MetaSKAT power under two penetrance models by 26 study population and carrier proportion (relative risk = 2.5) Figure 2.5: Meta-analysis statistic power for all genes in fourteen multi-ethnic 27 populations and four monoethnic populations under two penetrance models Figure 2.6: Non-Synonymous Variants by Sample Size 34 Figure 4.1: Theoretical whole genome SE and observed SE for observed-scale 101 heritability estimates
iv
List of Tables
Title Page Table 2.1: 12 Sequenced Regions, Coverage Depth, and Length 12 Table 2.2: Study Subject Count by Cohort and Ethnicity 13 Table 2.3: Variant, Median Singleton, and Carrier Proportion for Genes with Highest and 20 Lowest Carrier Proportions Table 2.4: Gene Counts by Carrier Proportion and Self-Reported Ethnicity 21 Table 2.5: Gene-specific power for alternate proportion of causal variants – Single Ethnicity 28 Table 2.6: Gene-specific power for alternate proportion of causal variants – Multi-Ethnic 29 Table 2.7: Power by Penetrance Model for two multi-ethnic and four mono-ethnic study 31 populations Table 3.1: Study Participants 49 Table 3.2: 12 Sequenced Regions, Coverage Depth, and Length 51 Table 3.3: Count and Percent of Predictions and Polyphen-2 Scores for Non-Synonymous SNPs 53 Table 3.4: Significant gene meta-analyses for all breast cancer, and by ER status (alpha=0.05) 58 Table 3.5: Significant Gene Meta-Analyses after Correction for Number of Statistical Tests 59
(alphaB =0.025) Table 3.6: Gene-Based Associations by Ethnicity (P-Value < 0.02 in at least one ethnicity) 62 Table 3.7: Correlation between summary effect estimate and ethnicity specific effect 66 estimates by case ER status Table 3.8: Genetic Susceptibility Loci Identified in GWAS by receptor status 67 Table 3.9: Nominal gene-based rare variant breast cancer associations 69 Table 4.1: Study Population and Sequenced Variants by Ethnicity - Counts (Percent) 83 Table 4.2: Targeted Regions and Liability-Scale Heritability (and SE) from sequenced SNPs 85 Table 4.3: Counts of SNVs (M) in sequencing data by minimum MAF and ethnicity 86 Table 4.4: Conditional Analysis of all Sequenced SNPs 92 2 Table 4.5: Rare and Common Variant Breast Cancer Heritability (h g) 94 Table 4.6: Heritability in Coding and DHS Regions 95 Table 4.7: Heritability by Functional Annotation 97 Table 4.8: ER-Positive Breast Cancer heritability 98 Table 4.9: Heritability Estimation in Two Independent Samples 102 2 Table 4.10: Crude, Adjusted, and LD-pruned estimates of Rare, Common and Total hg in 103 Single variance-component analysis Table 4.11: Breast cancer heritability (Percent) in European populations 105
v
Acknowledgements
This, to paraphrase the most inspiring scientist I have been so fortunate to meet, was the work of many people; many more people than the few I acknowledge here. First and foremost, thank you to the members of my research committee, Dr. Heather Eliassen, Dr. Alkes Price, and Dr. Peter Kraft, for your feedback, time, and commitment to the success of this venture.
Thank you to Dr. Peter Kraft and our collaborators at other institutions Dr. Chris Haiman, Dr. Brian
Henderson, Dr. Loic Le Marchand, Dr. Seunngeun (Shawn) Lee and Dr. Dan Stram for graciously including me and sharing study resources and feedback as this project developed.
My sincere gratitude to the current and former postdoctoral members of the Program in Genetic epidemiology and Statistical Genetics (PGSG) and the Program in Molecular epidemiology And Genetic
Epidemiology (PMAGE) who include Dr. Sara Lindstrom and Dr. Sasha Gusev for their openness, flexibility, and insights.
I thank my colleagues and former classmates, Dr. Jason Wong, Dr. Shasha Meng, Dr. Zhonghua Liu, Dr.
Mitch Machiela, Dr. Mengmeng (Margaret) Du, Maxine Chen, Dr. Chia-Yen Chen, Tristan Hayek, Hilary
Finucane, and Dr. Gaurav Bhatia for their collegial nature and unflagging dedication to high standards.
I am grateful to the staff for meeting and exceeding expectations including Laura Ruggiero, Connie Chen,
Elizabeth Solomon, Donald Halstead and Jill McDonald.
I would like to thank members of the Harvard faculty for an unmatched learning experience and especially thank Dr. Ichiro Kawachi, Dr. David Williams, Dr. John Quackenbush, Dr. Lisa Signorello, Dr.
Betty Johnson, Dr. Robert Blendon, Dr. David Hunter and Dr. Xihong Lin who generously shared their support or guidance with me along the way.
vi
Finally, I thank Dr. Alkes Price, truly you inspire me, and Dr. Michelle Williams, Chair of the department of Epidemiology, for leading by example and for ensuring my progress through the Doctor of Science degree was fully supported in every sense that an aspiring investigator could dare to hope for.
vii
Support for this work was provided by:
Grant Source Grant Title Grant Number Funding Dates NIH – National Institute Initiative to Maximize Student Diversity GM055353-13 09/01/2010- of General Medical 08/31/2012 Sciences Department of Health Joint Interdisciplinary Training Program T32 - 09/01/2012 - and Human Services in Biostatistics & Computational GM074897 08/31/2014 Public Health Service Biology NIH – National Institute Statistical Methods for Studies of Rare R01 MH101244 09/01/2014 - of Mental Health Variants 07/01/2015
viii
For
my friends, family, and loved ones
and for
the many survivors
and their friends and family whose lives have been touched by breast cancer.
ix
Chapter 1
Introduction
Introduction
Breast cancer genetic association studies and NGS genotypes
Like so many transformative accomplishments, sequencing the first human genome was both
the end and the beginning of a journey. The scientific journey from measuring genotypes to interpreting
those genotypes for the improvement of human health has only just begun. The human genome project
was a monumental task that took over a decade and 3 billion dollars but the cost of genotyping
subsequent whole genomes is falling precipitously.1 With this influx of genetic information, researchers, clinicians, and biotechnologists may find the amount of data quickly outpacing their ability to analyze the results responsibly.
This work is motivated by
1) a need to validate disease risk estimation methods in ever-more widely available, high quality,
next generation sequencing (NGS) data; and
2) a desire to explain the gap between the familial risk explained by known breast cancer genetic
loci and the observed two-fold increase in breast cancer risk for first degree relatives of breast
cancer cases.2
To that end, we applied NGS, the latest evolution in genotyping methods1 to a study population of breast cancer cases and controls.
Breast cancer is the most commonly diagnosed site-specific cancer in US women. Lifestyle and reproductive factors have been associated with breast cancer risk, but few risk factors have associations as large as mammographic density or family history’s two-fold relative risk. Although genetic risk variants have been identified, they explain just a portion of the phenotypic variability due to genetics
(heritability) of breast cancer. So far, the remaining heritability of breast cancer remains unexplained by genetic association studies.2
2
Genetic studies have changed dramatically in the last century. Linkage studies, some of the
earliest genetic studies, relied on correlation between neighboring nucleotide base pairs in
deoxyribonucleic acid (DNA). Sanger sequencing, an ingenious biochemical assay of DNA, was well-
suited to sequences several hundreds of nucleotides long.3 However, this was just a fraction of the 3 billion nucleotide base pairs in the human genome. Array-based genotyping, SNP chips, advanced the science by allowing researchers to interrogate regions of the genome with no prior linkage study association. By 2010, genome-wide association studies (GWASs) used SNP arrays which directly genotyped over half a million SNPs (single nucleotide polymorphisms).
Even before massively parallel NGS technologies streamlined genotyping of candidate regions, exonic sequences, or whole genomes, linkage studies of pedigrees identified associations between complex disease risk and genotype. The genetic basis of breast cancer has been particularly well- studied.4 Several of these associations had meaningful clinical consequences. However, high risk, low-
frequency variants have explained only a portion of the variation in risk for breast cancer so far.
To test association with moderate and high frequency SNPs in large samples, array-based methods are a more effective tool.5 Many common SNPs with moderate effects on complex disease risk have been discovered in the past decade using GWAS.6 These associations include both genotyped SNPs and imputed SNPs which were inferred from a reference human genome based on correlation with genotyped SNPs. By combining information across studies and increasing the size of each study, GWAS have started to identify risk alleles with lower and lower frequencies. Each newly-identified risk allele explains a little more, but not all, of the remaining complex disease heritability.
There are several explanations for missing heritability.7 One explanation is that heritability
estimates were inflated and thus, putative risk variants have explained all the heritability required. If we
instead assume that heritability estimates were accurate, then there are still several possibilities. For
one, rare variants with large effect sizes were not captured well in early GWAS. Even in NGS data, read
3
depth can also limit detection of rare variants.8 A second possibility is that complex diseases like breast cancer are highly polygenic. Many of the common variants that contribute to heritability would have effect sizes too low to reach statistical significance. Due to the high quality of our data, we could explore both of these possibilities with this work.
Early literature on multi-ethnic rare variant analysis relied on the use of simulated genotypes or included only one continental population.9 In practice, both authenticity and diversity of genetic data
will be vital to bridging the gap between theory and clinical utility. The current work is split into three
distinct but complementary endeavors:
Chapter 2: “Meta-analysis of rare variant association tests in multi-ethnic populations”
addresses two key aspects of rare variant association study design in a multi-ethnic population:
1) test statistic performance; and 2) study population selection. This first project contributes
practical advice for the design and analysis of rare variant association studies based on empirical
evidence. In addition, it has a broader application to investigators working with non-European
study populations and can be used to benchmark the expected performance of rare variant test
statistics for a larger audience.
Chapter 3: “Breast cancer and rare protein-coding variant association at 12 susceptibility loci” is
an applied study of NGS exposure data for a complex disease with missing heritability. This
association study demonstrates the potential to uncover gene-specific associations using fine-
mapping of disease-associated loci and the limitations of a study of moderate sample size.
Chapter 4: “Breast cancer heritability at 12 GWAS-identified loci” applies variance-component
methods to capture hidden heritability.
In the following chapters, we present a deep and broad exploration of high quality, next generation sequencing data from a multi-ethnic population of over 4600 breast cancer cases and controls covering
5 million base pairs of the human genome.
4
Works Cited
1. Mardis ER. A decade’s perspective on DNA sequencing technology. Nature. 2011;470(7333):198- 203. doi:10.1038/nature09796. 2. Michailidou K, Beesley J, Lindstrom S, Canisius S, Dennis J, Lush MJ, Maranian MJ, Bolla MK, Wang Q, Shah M, Perkins BJ, Czene K, Eriksson M, Darabi H, Brand JS, Bojesen SE, Nordestgaard BG, Flyger H, Nielsen SF, et al. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer. Nat Genet. 2015;advance on. doi:10.1038/ng.3242. 3. Schuster SC. Next-generation sequencing transforms today’s biology. Nat Methods. 2008;5(1):16-18. doi:10.1038/nmeth1156. 4. Stratton MR, Rahman N. The emerging landscape of breast cancer susceptibility. Nat Genet. 2008;40(1):17-22. doi:10.1038/ng.2007.53. 5. Gibson G. Rare and common variants: twenty arguments. Nat Rev Genet. 2012;13(2):135-145. doi:10.1038/nrg3118. 6. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012;90(1):7-24. doi:10.1016/j.ajhg.2011.11.029. 7. Manolio T a, Collins FS, Cox NJ, Goldstein DB, Lucia a, Hunter DJ, Mccarthy MI, Ramos EM, Cardon LR, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Alice S, Boehnke M, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747- 753. doi:10.1038/nature08494. 8. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;135(V):0-9. doi:10.1038/nature11632. 9. Zuk O, Schaffner SF, Samocha K, Do R, Hechter E, Kathiresan S, Daly MJ, Neale BM, Sunyaev SR, Lander ES. Searching for missing heritability: designing rare variant association studies. Proc Natl Acad Sci U S A. 2014;111(4):E455-E464. doi:10.1073/pnas.1322563111.
5
Chapter 2
Meta-analysis of rare variant association tests in multi-ethnic populations
- -
Akweley Mensah-Ablorh1,2, Sara Lindstrom1,2, Christopher A. Haiman3, Brian E. Henderson3, Loic Le
Marchand4, Seunngeun Lee5, Daniel O. Stram3, A. Heather Eliassen1,6, Alkes Price1,2,7, Peter Kraft1,2,7
1 Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA 02215
2 Program in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health,
Boston, MA 02215
3 Department of Preventive Medicine, Keck School of Medicine and Norris Comprehensive Cancer
Center, University of Southern California, Los Angeles, CA 90033
4 Epidemiology Program, University of Hawaii Cancer Research Center, Honolulu, HI 96813
5 Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI 48109
6 Channing Division of Network Medicine, Harvard Medical School and Brigham & Women’s Hospital,
Boston, MA 02215
7 Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02215
Running Title: Multiethnic rare variant meta-analysis
Correspondence: Akweley Ablorh
Harvard T.H. Chan School of Public Health
Bldg II, Suite 200
655 Huntington Ave
Boston, MA 02215
(617)432-7095
7
ABSTRACT
Several methods have been proposed to increase power in rare variant association testing by aggregating information from individual rare variants (MAF<0.005). However, how to best combine rare variants across multiple ethnicities and the relative performance of designs using different ethnic sampling fractions remains unknown. In this study, we compare the performance of several statistical approaches for assessing rare-variant associations across
multiple ethnicities. We also examine how different ethnic sampling fractions perform,
including single-ethnicity studies and studies that sample up to four ethnicities. We conducted simulations based on targeted sequencing data from 4,611 women in four ethnicities (African
American, European American, Japanese, Latino). As with single-ethnicity studies, burden tests had greater power when all causal rare variants were deleterious, and variance component- based tests had greater power when some causal rare variants were deleterious and some were protective. Multi-ethnic studies had greater power than single-ethnicity studies at many loci, with inclusion of African Americans providing the largest impact. On average, studies including African Americans had as much as 20% greater power than equivalently-sized studies without African Americans. This suggests that association studies between rare variants and complex disease should consider including subjects from multiple ethnicities, with preference given to genetically diverse groups.
Keywords: Fine-mapping; Sequencing study design; Statistical genetics; Study subject selection;
Multiethnic meta-analysis;
8
INTRODUCTION
Despite the successes of genome-wide association studies (GWAS), which have identified hundreds of common variants associated with complex diseases and traits, the genetic basis of these traits has not been fully explained.1–4 It is likely that rare variants contribute to the remaining heritability, 5,6 but to what extent remains an important open empirical question. Individual rare variants (defined here as minor allele frequency (MAF) <
0.005) are difficult to study, as they require prohibitively large samples sizes to have adequate power to detect realistic effect sizes. As an alternative, many methods have been proposed for aggregating information from individual rare variants.7–14 However, these have been proposed in the context of studies conducted in a single ethnicity. It remains an open question how to best combine rare variants in association testing across ethnicities in a multi-ethnic study, and the relative performance of designs using optimal sampling fraction across ethnicities remains
unknown.
Previous studies evaluating the performance of aggregate rare-variant tests have often used simulated genotype data,9–12 which may not adequately capture all of the properties of
empirical data. Those studies that have used empirical sequencing data have been restricted to small sample sizes (under 2,000).15–17 Importantly, most previous studies do not consider multi-
ethnic populations, and when they do, they focus on samples drawn from at most two ethnic
groups18,19 or from within one continental population.17,20
We address some of these gaps using simulations based on empirical targeted sequencing data from 4,611 women in four ethnicities (African American, European American,
9 Japanese, Latino). It is well established that multi-ethnic designs can aid in identifying the causal variant in the context of fine-mapping GWAS-identified loci;21 whether multi-ethnic designs
increase the power of aggregate rare-variant tests remains an open question. We compare the
performance of several statistical approaches for combining evidence for rare-variant association across multi-ethnic samples. We also investigate optimal ethnic sampling fractions
when samples of multiple ethnicities are available.
Our findings suggest that a multi-ethnic study that includes African American subjects may be advantageous since it leverages the greater genetic diversity among Africans.
10
METHODS
Breast Cancer Targeted Sequencing Data: We generated next-generation sequencing
data targeting 12 regions (Table 2.1) spanning 5,500kb on 2,316 breast cancer cases and 2,295
controls from three cohorts (the Nurses’ Health Study,22 the Nurses’ Health Study II,23 and the
Multiethnic Cohort24). Our study population is ethnically diverse with 937 women of African-
American ancestry, 907 women of Latino ancestry and 1,256 women of Japanese ancestry from
the Multiethnic Cohort (MEC) and 1,511 women of European ancestry from Nurses’ Health
Study (NHS) and Nurses’ Health Study II (NHSII) (Table 2.2). Boundaries of targeted regions
were defined by recombination hotspots flanking single nucleotide polymorphisms with published genome-wide-significant associations to breast cancer risk. These regions contained
74 genes, ranging in length from 0.4 kb to 910 kb, for a total of 182 kb of exonic sequence.
Target capture was performed using custom Agilent baits, and high-depth sequencing was performed at the Broad Institute using the Illumina HiSeq platform. Reads were aligned using
BWA and genotypes called using GATK in batches of approximately 100 samples25–28. Batches were balanced with respect to ethnicity and case-control status. Over 97% of bases in coding
regions had >20x coverage, with a mean coverage of 174x. All of the subjects were previously
genotyped using Illumina HumanHap arrays. Concordance in genotype calls at variable sites was over 99.7% across four pairs of duplicate samples; concordance between sequencing
genotype calls and GWAS genotypes was also over 99.7%.
We used SnpEff to annotate coding variants, and restricted analysis to non-synonymous
SNPs and stop-gain, stop-lost, start-lost, and splice site donor or acceptor variants.29
11
Table 2.1: 12 Sequenced Regions, Coverage Depth, and Length
Median | Minimum Region Index_SNP Locus Chr SNP_Pos Length(kb) Coverage Depth (Coding) 1 rs13387042 2q35 2 217905832 ------121.8
2 rs10069690 TERT 5 1279790 824 42 45.9
3 rs889312 MAP3K1 5 56031884 3681 282 307.8
4 rs2046210 ESR1 6 151990059 3749 358 242.7
5 rs1562430 8q24 8 128387852 1317 118 972.9
6 rs10995190 ZNF365 10 64278682 3221 244 397.9
7 rs704010 ZMIZ1 10 80841148 1307 122 473.2
8 rs2981579 FGFR2 10 123337335 1934 131 876.2
9 rs614367 11q13 11 69328764 910 36 258.9
10 rs999737 RAD51B 14 69034682 2319 36 815.4
11 rs3803662 TOX3 16 52586341 2797 40 269
12 rs8170 MERIT40 19 17389704 1025 35 712.4
12
Table 2.2: Study Subject Count by Cohort and Ethnicity
Cohort Ethnicity Total (% Subjects)
African American 937 (20%)
MEC Japanese American 1,256 (27%)
Latino American 907 (20%)
NHS European American 1,511 (33%)
All 4,611 (100%)
Sequencing was performed for subjects from three cohorts — Multiethnic Cohort (MEC) and Nurses'
Health or Nurses’ Health Study 2 (NHS). Ethnicity refers to self-reported ancestry validated by GWAS data.
13
Simulated case-control data sets: For the 47 genes that had at least one non-
synonymous variable site, we simulated case-control studies under the null hypothesis that the
gene was not associated with disease risk. In addition, we simulated data under four alternative
scenarios that were characterized by two conditions: the absolute value of the allele-specific
log relative risk was either constant or inversely proportional to the minor allele frequency, and
the minor alleles at causal loci were either all deleterious or a mixture of 70% deleterious and
30% protective alleles.
Case and control genotypes from each ethnic group were sampled by taking random
draws (with replacement) from the observed genotypes. Assuming the disease is rare, control
genotypes were sampled with equal probability, while case genotypes were selected with
probabilities proportional to the genotype relative risks. Specifically: for a given ethnicity, the
genotype for the ith control = ( , … , ) in gene k was set equal to = ( , … , )
宋餐 沈怠 沈徴入 似 碇怠 碇徴入 where ゜ is drawn from the set賛 of subject訣 indices訣 for the observed data {l=1,...,賛 L} with訣 probability訣
-1 th P 0l=L . Here gij is the observed genotype at variable site j for the i individual in a
particular ethnicity; L is the total number of subjects sequenced in that ethnicity (e.g. L=937 for
African Americans); and Jk is the number of rare varying sites (MAF<0.005 in the pooled sample
of 4,611 women) in gene k. Genotypes for cases were sampled based on a specified log relative
risk, setting
exp [ ] = ,…, exp珍 珍 [鎮珍 ] 怠鎮 盤デ 紅 訣 匪 講 茅 茅 デ鎮 退怠 挑版 盤デ珍 紅珍訣鎮 珍 匪繁
14
Here is the log relative risk for variable site j, defined by the simulated penetrance model.
紅珍 Setting , equal to 1 if minor allele j has a causal protective effect and 0 for a deleterious
荊牒追墜 珍 effect; setting equal to 1 if there is MAF dependence and 0 otherwise; and setting ,
暢帖 寵銚通 珍 equal to 1 if variant荊 j is causal and 0 otherwise, the log relative risk for the minor allele 荊at site j is:
, 彫謎呑 = , × ( 1) × 彫鍋認任 乳 兼欠血陳銚掴 紅珍 荊寵銚通 珍 伐 頒紅茅 健剣訣 嵜俵 崟 番 兼欠血珍 H
{1.5,2.5,3.0,4.0}:
紅違 樺
荊警経 貸怠 兼欠血兼欠捲 珍 = 僅デ 健剣訣 嵜俵 倹 崟 巾 兼欠血, 勤 錦 紅 紅違 勤 デ珍 荊寵銚通 珍 錦 勤 錦
均 斤 For all alternatives, we assumed that a fraction of rare variants ( ) were causal and
寵銚通 generated the indicator , for each SNP from a Bernoulli( ) distribution.喧 For alternatives
寵銚通 珍 寵銚通 where causal variants were荊 a mixture of protective and deleterious喧 variants, we assumed that a
fraction of causal variants ( | ) were protective ( , i.i.d. Bernoulli( | )). , was
喧牒追墜 寵銚通 荊牒追墜 珍 喧牒追墜 寵銚通 荊寵銚通 珍 shared across ethnicity but , was drawn independently for each ethnicity, introducing
牒追墜 珍 heterogeneity across ethnicities荊 for some scenarios.
15 Association tests: For each gene k, we tested for association between rare variants and case-control status within each ethnicity separately using two aggregate association approaches: a burden test12 and the Sequence Kernel Association Test (SKAT).30 Both of these approaches test the null hypothesis:
: = = = =0.
茎待 紅怠 紅態 橋 紅徴入 The burden test collapses all rare alleles within a gene by creating an indicator variable
Xeik which is equal to 1 if subject i in ethnicity e carries any rare variant in gene k and 0 otherwise. Logistic regression is then used to test the association between case-control status and Xeik by fitting the model
logit[E(y )] = + X
奪辿 ゎ赴 ず侮奪谷 奪辿谷 and comparing to a central chi-squared 1 df distribution. ( ) 態 提撫武賑入 磐聴帳 提撫武賑入 卑 SKAT is equivalent to the score test for =0 from a random effects model where the 態 30 js are normally distributed with mean 0 and variance酵 . We used the default setting for 態 態 拳勅珍酵 SKAT and set wej=Beta(MAFej, 1, 25), where MAFej is the minor allele frequency in subjects of ethnicity e sample. The SKAT test statistic for association in ethnicity e is
= 徴入 態 態 芸勅 布拳勅珍 鯨勅珍 珍退怠 where = ( ) and is the expected value of case-control status Y for subject
勅珍 沈 勅沈珍 勅沈 勅沈 勅沈 i in ethnicity鯨 eデ under訣 検 the伐 null. 航 SKAT 航can adjust for covariates (such as genome-wide genetic
16
principal components) by incorporating them in the model for . For simplicity we simulated
勅沈 and fit models without any covariate effects; in this case = 航 and is simply the proportion of
勅沈 勅 cases in the sample from ethnicity e. The statistic Qe is distributed航 航 as a mixture of chi-squared random variables, where the mixing proportions depend on the weights and genetic covariance in ethnicity e.
Burden-test meta-analysis: Standard fixed-effect meta-analysis techniques can be applied to the ethnic-specific estimates of to test the null hypothesis that =...= = 0 (E
勅賃 怠賃 帳賃 is the total number of ethnic groups considered).肯侮 We consider three meta-analysis肯 approaches.肯
The first approach is based on the inverse-variance weighted estimate of the mean effect of carrying any rare variant across ethnicities. If the causal effects or allele frequencies at the causal variants differ greatly across ethnicity—in particular, if some causal variants are monomorphic in one or more populations—then the true s may differ across ethnicity, and
勅賃 the average effect may be small—causing the inverse variance肯 weighted test to lose power. 31,32
The second and third approaches account for possible heterogeneity in , at the cost of
勅賃 increased degrees of freedom. The second approach tests for heterogeneity肯 in the ethnic- specific effects using Cochran's Q and has E-1 degrees of freedom. The third approach is a joint test of mean effect and heterogeneity that sums the fixed-effect test of the overall mean and
Cochran's Q. Because these tests are independent, the joint test has E degrees of freedom. 31,32
SKAT meta-analysis W j are constant across ethnicity, and calculate a homogeneous and heterogeneous MetaSKAT statistic (Hom-Meta-SKAT and
Het-Meta-SKAT) proposed by Lee et al.33
17
= 陳 帳 態
芸朕墜陳貸陳勅痛銚貸聴懲凋脹 布蕃布拳勅珍鯨勅珍否 珍退怠 勅退怠
= 陳 帳 態 態 芸朕勅痛貸陳勅痛銚貸聴懲凋脹 布布拳勅珍 鯨勅珍 珍退怠 勅退怠 Under the null, these statistics are also distributed as a mixture of chi-squared variables, where the mixing proportions depend on the ethnic-specific weights and genetic covariance matrices.
Comparing studies with different ethnic sample fractions: We selected 14 different
combinations of samples from at least two of four ethnicities (African American, European
American, Japanese, and Latino) to construct studies with a total of 3,520 total subjects (tri- ethnic samples have 3,522 subjects). For all of these combinations the case-control ratio within each ethnic group was 1:1. We chose four combinations that included subjects from all four ethnicities but had differing proportions of ancestries (4:1:1:2, 2:1:1:1, 3:2:1:2, or 1:1:1:1 of
African American, European American, Japanese, and Latino subjects, respectively). We also
simulated populations by selecting equal numbers of individuals from three out of the four
ethnicities at a time (four combinations) and finally from only two of the four ethnicities (six
combinations).
18
RESULTS
We hypothesized that there may be differences in rare variant association test
performance due to both choice of rare variant test statistic and ethnic sampling fractions since
rare variant tests are better powered in study populations with a higher number of varying
sites. Distributions of rare variants by ethnicity and gene are given in Table 2.3 and Table 2.4 and Figure 2.1. Consistent with previous reports,4,34,35 African Americans have the highest number of rare varying sites, followed by Latinos. Consequently the carrier proportion (the fraction of subjects carrying at least one rare, non-synonymous or truncating allele in a gene) is highest in these populations. However, for most genes, the carrier proportion (CP) was less than 2%, except in African Americans (median CP = 2.6%). Table 2.4 lists summary values for ten genes with the five highest and five lowest overall CP.
We first evaluated rare variant association meta-analysis techniques under the null. In individual ethnic groups, we examined Type I error rates of the burden test and SKAT (Figure
2.2, green boxes). In addition to the false positive rate, we explored a measure of test statistic inflation, Figure 2.3, left with blue shading) and all five meta- analysis techniques. For the 33/47 of genes with an overall CP higher than 0.01, both the burden test and SKAT adequately controlled the Type I error rate. The burden test had deflated
Type I error when the overall CP was less than 0.01. For some of the genes with overall CP less
than 0.01, the deflation in the burden test for some ethnic groups was considerable (e.g.
less than 0.5 in European and Japanese ancestry subjects for RMND1), owing to the low CP (the
RMND1 CPs for European and Japanese ancestry subjects were both 0.002). The SKAT test
19
Table 2.3: Variant, Median Singleton, and Carrier Proportion for Genes with Highest and Lowest Carrier Proportions
Subjects with rare NS variant count Singleton Carrier
Gene Chr # Total # Rare Proportion 0 1 2 3+ Proportion TACC2 10 255 230 0.53 3,823 679 76 33 17.1% MYO9B 19 158 151 0.46 4,235 351 25 0 8.2% ANKLE1 19 82 71 0.45 4,260 321 21 9 7.6% FAM129C 19 67 61 0.48 4,299 286 25 1 6.8% ZFYVE26 14 131 121 0.66 4,348 253 9 1 5.7% HAUS8 19 40 36 0.67 4,545 65 1 0 1.4% PGLS 19 20 19 0.53 4,548 63 0 0 1.4% ABHD8 19 20 19 0.47 4,551 59 1 0 1.3% BTBD16 10 23 21 0.62 4,553 58 0 0 1.3% DDA1 19 7 7 0.43 4,565 46 0 0 1.0% Median --- 32 29 0.58 4,527 75 1 0 1.8%
Chr = Chromosome; Total = Varying non-synonymous (NS) sites observed in study population; Rare =
Varying sites with minor allele frequency (MAF) < 0.005, a subset of Total; Singleton Proportion -
Median number of singleton sites as described in Figure I as a proportion of Rare; ‘0’, ‘1’, ‘2’, and ‘3+’ -
Number of subjects out of 4,611 with 0, 1, 2, and 3 or more NS rare minor alleles. Carrier Proportion -
Proportion of study subjects with at least one rare, NS minor allele.
20
Table 2.4: Gene Counts by Carrier Proportion and Self-Reported Ethnicity
Carrier African European Japanese Latino All Proportion (CP) American American American
CP = 0 24 (32%) 27 (36%) 27 (36%) 25 (34%) 23 (31%)
0 < CP 12 (16%) 20 (27%) 18 (24%) 21 (28%) 18 (24%)
0.01 < CP 6 (8%) 14 (19%) 12 (16%) 12 (16%) 10 (14%)
0.02 < CP 10 (14%) 4 (5%) 5 (7%) 7 (9%) 10 (14%)
CP > 0.03 22 (30%) 9 (12%) 12 (16%) 9 (12%) 13 (18%)
All CP 74 (100%) 74 (100%) 74 (100%) 74 (100%) 74 (100%)
We observed variants in 74 distinct SnpEff annotated genes. Carrier proportion (CP) is defined as the