
Sequence based characterization of structural variation in the mouse genome Binnaz Yalcin1†, Kim Wong2†, Avigail Agam†1, Martin Goodson†1, Thomas M. Keane2, Leo Goodstadt1, Jérôme Nicod1, Amarjit Bhomra1, Polinka Hernandez-Pliego1, Helen Whitley1, James Cleak1, Rebekah Dutton1, Deborah Janowitz1, Richard Mott1, David J. Adams2,*, Jonathan Flint2,* 1The Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK, 2The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK 3MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK. †Co-first authors *Correspondence to: Dr. David Adams Dr. Jonathan Flint Wellcome Trust Sanger Institute Wellcome Trust Centre for Human Genetics Hinxton, Cambs, CB10 1SA, UK Oxford, OX3 7BN, UK Ph: +44 (0) 1223 86862 Ph: +44 (0) 1865 287512 Fax: +44 (0) 1223 494919 Fax: +44 (0) 1865 287501 Email: [email protected] Email: [email protected] Abstract The importance of structural variants (SVs) in DNA as a cause of quantitative variation and as a contributor to disease is unknown, but without knowing how many SVs there are, and how they arise, it is difficult to discover what they do. Combining experimental with automated analyses of the mouse genome sequence, we identified 0.71M SVs at 0.28M sites in the genomes of thirteen classical and four wild-derived inbred mouse strains. The majority of SVs are less than 1 kilobase in size and 98% are deletions or insertions. The breakpoints of 58% were mapped to base pair resolution allowing us to infer that insertion of retrotransposons causes more than half of SVs. Yet, despite their prevalence, SVs are less likely than other sequence variants to cause gene-expression or quantitative phenotypic variation. We identified 24 SVs that disrupt coding exons, acting as rare variants of large effect on gene function. One third of the genes so affected have immunological functions. Our catalogue provides a starting point for the analysis of the most dynamic and complex regions of genomes from a genetically tractable model organism. 2 Introduction Structural variation is believed to be widespread in mammalian genomes1-5 and is an important cause of disease6-8, but just how abundant and important structural variants (SVs) are in shaping phenotypic variation remains unclear. Understanding what SVs do depends on understanding what they are, where they occur and how they arise: large SVs that keep recurring and coincide with genes are far more likely to contribute to phenotypic variation than small non-recurrent SVs within intergenic regions. The preeminent organism for modeling the relationship between phenotype and genotype, including SVs, is the mouse, but our catalogue of SVs in this animal is incomplete. Estimates of SV numbers and the proportion of the mouse genome they occupy, vary considerably, from figures of a few hundred to over 7,0009-13, affecting from 3.2% to more than 10% of the genome14,15. Incompleteness and inconsistencies are largely due to reliance on differential hybridization of genomic DNA to oligonucleotide arrays16, a technology blind to some SV categories (such as inversions and insertions) and only limited ability to detect others (segmental duplications and transposable elements). Sequence based methods of SV detection, with higher resolution and greater sensitivity, have so far had limited application12,17. Along with SV catalogues, we need to know how SVs arise, as this will tell us what SVs may or may not do. The major molecular mechanism producing SVs in the mouse genome is believed to be retrotransposition12,17, which, may account for more than 80% of SVs between 100 nucleotides to 10 kilobases in length17. In cell culture, about 10% of LINE-1 insertions delete 3 DNA18,19, a process that also occurs in mouse genomic DNA20. It is not known to what extent retrotransposons, or other mechanisms of SV formation, contribute to mouse phenotypic variation and disease. What we know about the burden of SVs’ impact on phenotypes in the mouse comes primarily from analyses of gene expression14,15,21. Up to 28% of the between-strain variation in gene expression in hematopoietic stem and progenitor cells has been attributed to SVs14; for genes lying within SVs, the latter account for between 66% to 74% of between-strain expression variation in kidney, liver, lung and testis15. If the genome is replete with SVs, and given that their influence on gene expression could extend up to 500 Kb from their margins15, then SVs might be responsible for a considerable fraction of heritable gene expression variance. Since gene expression variation is believed to contribute to variation in phenotypes in the whole organism21, SVs may turn out to have a major role in the genetic determination of all aspects of mouse, and mammalian, biology. In this paper we use next generation sequencing to address three critical questions: what are the extent and complexity of SVs in the mouse genome, what are the likely mechanisms of SV formation, and to what extent do SVs contribute to phenotypic variation? Our molecular characterization of SVs in the mouse genome allows us to determine the extent to which SVs contribute to genetic and phenotypic diversity. 4 Results SV identification Using short-read paired-end mapping, we found SVs at 0.28M sites in the mouse genome, amounting to 0.71M SVs in 17 inbred strains of mice: A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J, LP/J, NOD/ShiLtJ, NZO/HlLtJ, 129S5SvEvBrd, 129P2/OlaHsd, 129S1/SvImJ, WSB/EiJ, PWK/PhJ, CAST/EiJ and SPRET/EiJ. Our catalogue contains far more SVs than previously recognized (Fig. 1a) and consists of a greater variety of molecular structures (Fig. 1b&1c). To explain why we found more, we start by describing how we went about finding SVs. We combined visual inspection of short-read sequencing data with molecular validation to improve automated SV detection across the genome. We used two criteria to identify SVs manually: read depth and anomalous paired-end mapping (PEM). We did this using data from the mouse’s smallest chromosome (19) in its entirety, and a random set of other chromosomal regions, for eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J), founder strains of heterogeneous stock (HS) population22. Based on read depth and PEM we expected to find eleven patterns that classify SVs. We refer to these as type H (“High-confidence”) patterns (H1-H11: Supplementary Fig. 1). For example, some deletions and inversions leave precise, easily identifiable signatures (Fig. 1d). In addition, we found ten patterns whose interpretation was ambiguous. We refer to these as type Q (“Questionable”) patterns (Q1-Q10: Supplementary Fig. 1, Fig. 1e). We investigated the molecular structure of all 21 patterns using a PCR 5 strategy (Supplementary Fig. 2, Supplementary Methods). We designed 584 pairs of primers and successfully amplified 547 SV sites across the eight strains (Supplementary Table 1). Our categorization of predicted SV structures, based on manual inspection of PEM patterns, resulted in the confident identification of an SV for nineteen of the 21 patterns in all instances that we examined by PCR (Supplementary Table 2). Two patterns were always false (Q6 and Q10), and arose because of the presence of a retrotransposed pseudogene giving mapping errors. Recognizing these patterns, we were able to predict underlying SV structure with high confidence. PCR confirmed that 12 patterns were indicative of a single SV and six patterns indicative of multiple adjacent SVs (Supplementary Table 2). However, SVs of type Q7 (45 cases) were due to a variable number tandem repeat, for which we could not predict the number of repeats or molecular structure. Available automated methods to identify SVs are unable to differentiate all 19 PEM patterns, and may also classify some SVs incorrectly; for example, the PEM patterns of linked insertions (Q5 and Q9: Supplementary Fig. 1) are similar to those for inversions or deletions. Therefore we adapted automated methods to recognize 15 types (Q1, Q2, Q3 and Q7 could not be unambiguously identified) identified by manual inspection and PCR validation (Supplementary Methods and 23). 6 Sensitivity and specificity analyses We established false positive and false negative rates for the automated analysis in three ways. First, we used our manually identified set of SVs on chromosome 19 (Supplementary Table 3) where we found 932 deletions (684 type H and 248 type Q), 15 inversions (2 type H and 13 type Q) and three copy number gains (all type H). False negative rates per strain range from 14% to 17% (Supplementary Table 4a); false positive rates range from 3.1% to 4.6% (Supplementary Table 4b). Second, to ensure that our sensitivity and specificity analyses were not vitiated because we used chromosome 19 as a training set for the automated analysis, we derived a second, smaller, set of manually curated deletions from a randomly chosen 10 Mb region (101 Mb to 111 Mb) from chromosome 3 in the strain C3H/HeJ. Automated analysis of this region correctly identified 43 (82.7%) and called 2 false positive deletions (4.4%). Third, we investigated the false negative rate for the automated detection of deletions across the genome using a PCR validation dataset of 267 simple deletions (Supplementary Table 1). Consistent with the chromosome 19 and chromosome 3 analyses we found that the false negative rate for deletions was between 9% and 15%, respectively (Supplementary Table 5a). We could not assess the performance of automated analysis to detect SV types other than deletions from manual inspection of chromosome 19 because so few of these rearrangements were called. So we turned to PCR- based validation of insertions, inversions and tandem duplications (n = 62 to n = 76) and found that the average false negative rate was higher than for deletions, ranging from 21% to 33% per strain (Supplementary Table 5b).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages37 Page
-
File Size-