Coal-Miner: a Coalescent-Based Method for GWA Studies of Quantitative Traits with Complex Evolutionary Origins
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/132951; this version posted May 2, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Coal-Miner: a coalescent-based method for GWA studies of quantitative traits with complex evolutionary origins Hussein A. Hejase Natalie Vande Pol Gregory M. Bonito Department of Computer Science and Department of Plant, Soil and Department of Plant, Soil and Engineering Microbial Sciences Microbial Sciences Michigan State University Michigan State University Michigan State University East Lansing, Michigan 48824 East Lansing, Michigan 48824 East Lansing, Michigan 48824 [email protected] [email protected] [email protected] Patrick P. Edger Kevin J. Liu∗ Department of Horticulture Department of Computer Science and Michigan State University Engineering East Lansing, Michigan 48824 Michigan State University [email protected] East Lansing, Michigan 48824 [email protected] ABSTRACT genomic architectures for complex traits and incorporate a range Association mapping (AM) methods are used in genome-wide asso- of evolutionary scenarios, each with dierent evolutionary pro- ciation (GWA) studies to test for statistically signicant associations cesses that can generate local genealogical variation. e empirical between genotypic and phenotypic data. e genotypic and pheno- benchmarks include a large-scale dataset that appeared in a re- typic data share common evolutionary origins – namely, the evo- cent high-prole publication. Across the datasets in our study, we lutionary history of sampled organisms – introducing covariance nd that Coal-Miner consistently oers comparable or typically which must be distinguished from the covariance due to biological beer statistical power and type I error control compared to the function that is of primary interest in GWA studies. A variety of state-of-art methods. methods have been introduced to perform AM while accounting for sample relatedness. However, the state of the art predominantly CCS CONCEPTS utilizes the simplifying assumption that sample relatedness is ef- •Applied computing ! Computational genomics; Computational fectively xed across the genome. In contrast, population genetic biology; Molecular sequence analysis; Molecular evolution; Computa- theory and empirical studies have shown that sample relatedness tional genomics; Systems biology; Bioinformatics; Population genetics; can vary greatly across dierent loci within a genome; this phe- nomena – referred to as local genealogical variation – is commonly KEYWORDS encountered in many genomic datasets. New AM methods are genome wide association study, population stratication, associa- needed to beer account for local variation in sample relatedness tion mapping, genealogy, variation, discordance, coalescent, simu- within genomes. lation study, Arabidopsis, Heliconius, Burkholdericeae We address this gap by introducing Coal-Miner, a new statis- tical AM method. e Coal-Miner algorithm takes the form of a ACM Reference format: methodological pipeline. e initial stages of Coal-Miner seek to Hussein A. Hejase, Natalie Vande Pol, Gregory M. Bonito, Patrick P. Edger, detect candidate loci, or loci which contain putatively causal mark- and Kevin J. Liu. 2017. Coal-Miner: a coalescent-based method for GWA ers. Subsequent stages of Coal-Miner perform test for association studies of quantitative traits with complex evolutionary origins. In Proceed- ings of ACM BCB, Boston, MA, 2017 (BCB), 10 pages. using a linear mixed model with multiple eects which account for DOI: 10.475/123 4 sample relatedness locally within candidate loci and globally across the entire genome. Using synthetic and empirical datasets, we compare the statistical 1 INTRODUCTION power and type I error control of Coal-Miner against state-of-the- art AM methods. e simulation conditions reect a variety of Genome-wide association (GWA) studies aim to pinpoint loci with genetic contributions to a phenotype by uncovering signicant ∗To whom correspondence should be addressed. statistical associations between genomic markers and a phenotypic trait under study. We refer to the computational methods used in a Permission to make digital or hard copies of part or all of this work for personal or GWA analysis as association mapping (AM) methods. Among the classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation most widely studied organisms in GWA studies are natural human on the rst page. Copyrights for third-party components of this work must be honored. populations and laboratory strains of house mouse. Recently, GWA For all other uses, contact the owner/author(s). approaches have been applied to natural populations of other organ- BCB, Boston, MA © 2017 Copyright held by the owner/author(s). 123-4567-24-567/08/06...$15.00 isms sampled from across the Tree of Life. For example, the study DOI: 10.475/123 4 of Consortium [10] published whole genome sequences for over bioRxiv preprint doi: https://doi.org/10.1101/132951; this version posted May 2, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. BCB, 2017, Boston, MA H. Hejase et al. a thousand samples from globally distributed Arabidopsis popula- is RecHMM [61], which applies a sequentially Markovian approxi- tions. In combination with phenotypic data, the genomic sequence mation [40] to the full coalescent-with-recombination model [21]. data was used in a GWA analysis to pinpoint genomic loci involved More recently, coalescent-based methods have been developed to in owering time at two dierent temperatures. Other recent GWA infer local coalescent histories and explicitly ascribe local genealog- studies such as the study of Porter et al. [47] have focused on bacte- ical variation to dierent evolutionary processes. Examples include ria and other microbes (see [9] for a review of relevant literature). Coal-HMM [15, 20, 37, 38] and PhyloNet-HMM [35]. Regardless of sampling strategy – from one or more closely To address this methodological gap, we recently introduced Coal- related populations involving a single species to multiple popula- Map, a new AM method that accounts for local genealogical vari- tions from divergent species – it is well understood that sample ation across genomic sequences. Coal-Map performs statistical relatedness can be a confounding factor in GWA analyses unless inference under a linear mixed model (LMM). e LMM utilizes properly accounted for. Intuitively, the genotypes and phenotypes xed eects to account for global sample relatedness and, depending of present-day samples reect their shared evolutionary history, or upon whether the test marker is located within a locus containing phylogeny. For this reason, covariance due to a functional relation- putatively causal markers, local sample relatedness as well. e lat- ship between genotypic markers and a phenotypic character must ter condition is evaluated using model selection criteria. Coal-Map be distinguished from shared covariance due to common evolu- required local-phylogeny-switching breakpoints as input. We vali- tionary origins. EIGENSTRAT [49] is a popular AM method which dated Coal-Map’s performance using simulated and empirical data. accounts for sample relatedness as a xed eect. Other statistical Our performance study demonstrated that Coal-Map’s statistical AM methods have utilized linear mixed models (LMMs) to capture power and type I error control was comparable or beer than other sample relatedness using random eects; these include EMMA [30], state-of-the-art methods that account for sample relatedness using EMMAX [29], and GEMMA [62]. e question of whether sample xed eects. relatedness is beer modeled using the former or the laer – i.e., using xed vs. random eects – is a maer of ongoing debate 2 METHODS [50, 56]. 2.1 Overview of Coal-Miner algorithm Local variation in functional covariance across the genome is a crucial signature that AM methods use to uncover putatively We begin by introducing the high-level design of Coal-Miner, our causal markers. In contrast, virtually all of the most widely used new algorithm for statistical AM which accounts for local variation state-of-the-art AM methods assume that covariance due to sample of sample relatedness across genomic sequences. e input to the relatedness does not vary appreciably across the genome. Sample Coal-Miner algorithm consists of: (1) an n by k multi-locus sequence ∗ relatedness is therefore evaluated “globally” across the genome, data matrix X, (2) a phenotypic character y, and (3) ` , the number eliding over “local” genealogical variation across loci. e laer has of “candidate loci” used during analysis, where a candidate locus been observed by many comparative genomic and phylogenomic is a locus that is inferred to contain one or more putatively causal studies (see [16] for a review of relevant literature). It is well un- SNPs. e output consists of an association score for each site derstood now that local genealogical variation within genomes is x 2 X. pervasive across a range of evolutionary divergence – from struc- Coal-Miner’s statistical model captures the relationship