Inbred Strain Variant Database (Isvdb): a Repository for Probabilistically Informed Sequence Differences Among the Collaborative Cross Strains and Their Founders
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Carolina Digital Repository MULTIPARENTAL POPULATIONS Inbred Strain Variant Database (ISVdb): A Repository for Probabilistically Informed Sequence Differences Among the Collaborative Cross Strains and Their Founders Daniel Oreper,*,†,1 Yanwei Cai,*,†,1 Lisa M. Tarantino,†,§ Fernando Pardo-Manuel de Villena,†,‡ and William Valdar†,‡,2 *Curriculum in Bioinformatics and Computational Biology, †Department of Genetics, ‡Lineberger Comprehensive Cancer § Center, and Division of Pharmacotherapy and Experimental Therapeutics, Eshelman School of Pharmacy University of North Carolina, Chapel Hill, North Carolina 27599-7264 ORCID IDs: 0000-0003-4569-5160 (D.O.); 0000-0002-3950-3794 (Y.C.); 0000-0001-9984-2600 (L.M.T.); 0000-0002-5738-5795 (F.P.-M.d.V.); 0000-0002-2419-0430 (W.V.) ABSTRACT The Collaborative Cross (CC) is a panel of recently established multiparental recombinant KEYWORDS inbred mouse strains. For the CC, as for any multiparental population (MPP), effective experimental design Collaborative and analysis benefit from detailed knowledge of the genetic differences between strains. Such differences Cross can be directly determined by sequencing, but until now whole-genome sequencing was not publicly inbred strain available for individual CC strains. An alternative and complementary approach is to infer genetic online GUI differences by combining two pieces of information: probabilistic estimates of the CC haplotype mosaic variant from a custom genotyping array, and probabilistic variant calls from sequencing of the CC founders. The imputation computation for this inference, especially when performed genome-wide, can be intricate and time- haplotype consuming, requiring the researcher to generate nontrivial and potentially error-prone scripts. To provide multiparental standardized, easy-to-access CC sequence information, we have developed the Inbred Strain Variant populations Database (ISVdb). The ISVdb provides, for all the exonic variants from the Sanger Institute mouse MPP sequencing dataset, direct sequence information for CC founders and, critically, the imputed sequence information for CC strains. Notably, the ISVdb also: (1) provides predicted variant consequence metadata; (2) allows rapid simulation of F1 populations; and (3) preserves imputation uncertainty, which will allow imputed data to be refined in the future as additional sequencing and genotyping data are collected.TheISVdbinformationishousedinanSQL database and is easily accessible through a custom online interface (http://isvdb.unc.edu), reducing the analytic burden on any researcher using the CC. The Collaborative Cross (CC) is a large panel of recombinant inbred mouse Copyright © 2017 Oreper et al. doi: https://doi.org/10.1534/g3.117.041491 strains derived from a genetically diverse set of eight inbred founder strains: Manuscript received January 5, 2017; accepted for publication March 20, 2017. A/J (AJ), C57BL/6J (B6), 129S1Sv/ImJ (129), NOD/ShiLtJ (NOD), NZO/ This is an open-access article distributed under the terms of the Creative HlLtJ (NZO), CAST/EiJ (CAST), PWK/PhJ (PWK), and WSB/EiJ (WSB). Commons Attribution 4.0 International License (http://creativecommons.org/ These eight founder strains were first outcrossed for three generations to licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. produce mice with contributions from all eight founder strains. These Supplemental material is available online at http://www.g3journal.org/lookup/ outcrosses were initiated, with differentfounderorderings,inover1000in- suppl/doi:10.1534/g3.117.041491/-/DC1. dependent breedingfunnels(Shorteret al. 2017). Mice within each funnel 1These authors contributed equally to this work. 2 were subsequently inbred for multiple generations until two or more an- Corresponding author: Department of Genetics, University of North Carolina at fi Chapel Hill, 120 Mason Farm Rd., Suite 5113, Genetic Medicine Bldg., Chapel Hill, imals were identi ed by MegaMUGA genotyping as collectively having NC 27599-7264. E-mail: [email protected] over 90% of the genome fixed (i.e., homozygous and consistent for a Volume 7 | June 2017 | 1623 founder haplotype). These animals, hereafter termed the most recent com- mon ancestors (MRCAs), were then chosen to become the obligate ances- tors of all subsequent generations and bred to produce a distinct CC strain. ThesetofMRCAsfromallstrainscomposestheCC’s obligate ancestors; that is, the set of individuals that together circumscribes the initial genetic material that can be passed on to subsequent CC mice. As a result of this breeding scheme, the inbred CC strain genomes are random and independent mosaics of the eight founder haplotypes Collaborative Cross Consortium (2012); (Srivastava et al. 2017) (Figure 1; more details at http://csbio.unc.edu/CCstatus). This combination of independent genomes and high genetic di- versity, along with the reproducibility of inbred strains, has made the CC a unique resource in mammalian genetics, and early studies on the CC have begun to exploit these features (Aylor et al. 2011; Ferris et al. 2013; Phillippi et al. 2014; Rasmussen et al. 2014; Mosedale et al. 2017; Green et al. 2017; Gralinski et al. 2017). To make optimal use of the CC strains, it is desirable to have an accurate catalog of the genetic differences between them—specifically, the positions and other characteristics of all known CC strain genetic variants—and to be able to predict which variants will be polymorphic in future hypothetical crosses of CC strains and of CC strains with other laboratory strains. Such information could be determined directly by sequencing, but CC sequencing, soon to be released for a single male per strain (Srivastava et al. 2017), is not yet easily accessible. Furthermore, sequencing from a single animal will not resolve uncertainty arising from residual het- erozygosity, since two animals from the same strain could easily differ Figure 1 Breeding process for two Collaborative Cross (CC) strains. Both funnels begin by outcrossing the same eight founders, but the at residually heterozygous loci. More generally, whole-genome se- initial outcrossing order differs, resulting in completely independent quencing in most organisms is expensive and inconvenient. populations per funnel. Animals are outcrossed for three generations, A commonly used alternative is haplotype-based variant imputation, then inbred until genotyping reveals at least two animals with .90% whereby comparatively sparse and cheap genotyping data are combined consistent homozygosity by haploytpe. These homozygous animals with more complete information about allelic state in (even extremely [a.k.a., the most recent common ancestors (MRCAs)] are chosen to distant) relatives to infer allelic state at ungenotyped positions in the target become the obligate ancestors for the CC strains; all subsequent gen- sample. This typically involves inferring shared haplotype blocks and, by erations of a CC strain descend from a subset of the MRCAs. In (A), assuming that individuals sharing a haplotype block also share the arrows show CC1 MRCA regions of inconsistent homozygosity (L1) and corresponding allelic state, using this to impute genotype (Li et al. 2009; residual heterozygosity (L2 and L3). After further inbreeding, only L2 Marchini and Howie 2010). A broad array of such imputation methods continues to segregate. In (B), the CC2 MRCA set includes three an- imals rather than two. After further inbreeding, only L1 continues to have been developed for use in humans (e.g., Hawley and Kidd 1995; segregate, but a de novo mutation has become fixed at L2. Scheet and Stephens 2006; Marchini et al. 2007; Browning and Browning 2009; Howie et al. 2009; Li et al. 2010) and livestock (e.g.,VanRadenet al. 2011; Hickey et al. 2012; Sargolzaei et al. 2014); from a known and relatively limited number of founders that can be (more) these typically either start by inferring haplotypes from the genotype affordably deeply sequenced and genotyped. This in turn reduces variant data or by approximating the pool of extant haplotypes via a large imputation uncertainty. For MPP RI strains in particular, once an animal’s reference panel, and then use those haplotypes as a means to impute variants are imputed, the need for even sparse genotyping is largely obvi- variant genotypes, which are assumed to be the primary interest. ated in its inbred descendants; they are effectively genotyped as well. In MPPs of model organisms, where the founder haplotypes are Haplotypes can, and have been, similarly imputed for the entire CC pop- usually known, it is more common for primary interest to focus on ulation based on the CC MRCAs. In particular, a hidden Markov model reconstructing the haplotype mosaic itself, e.g., for the purpose of linkage (HMM)-basedmethod(Fuet al. 2012) has previously been applied to disequilibrium mapping (Mott et al. 2000; Liu et al. 2010; King et al. 2012; MegaMUGA genotyping of the CC MRCA animals coupled with Fu et al. 2012; Gatti et al. 2014; Verbyla et al. 2014; Zheng et al. 2015). In MegaMUGA genotyping from founder animals, (Welsh et al. 2012; such cases, haplotype-based imputation of specific variants may proceed Morgan et al. 2015; Srivastava et al. 2017) to impute a probabilistic as a second, refinement step to inform fine-mapping and candidate estimate of each CC strain’s haplotype mosaic. Sequencing