Sequence based characterization of structural variation in the mouse genome

Binnaz Yalcin1†, Kim Wong2†, Avigail Agam†1, Martin Goodson†1, Thomas M. Keane2,

Leo Goodstadt1, Jérôme Nicod1, Amarjit Bhomra1, Polinka Hernandez-Pliego1, Helen

Whitley1, James Cleak1, Rebekah Dutton1, Deborah Janowitz1, Richard Mott1, David

J. Adams2,*, Jonathan Flint2,*

1The Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford, OX3 7BN, UK, 2The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK 3MRC Functional Genomics Unit, Department of Physiology, Anatomy and Genetics, University of Oxford, South Parks Road, Oxford OX1 3QX, UK.

†Co-first authors

*Correspondence to:

Dr. David Adams Dr. Jonathan Flint

Wellcome Trust Sanger Institute Wellcome Trust Centre for Human Genetics

Hinxton, Cambs, CB10 1SA, UK Oxford, OX3 7BN, UK

Ph: +44 (0) 1223 86862 Ph: +44 (0) 1865 287512

Fax: +44 (0) 1223 494919 Fax: +44 (0) 1865 287501

Email: [email protected] Email: [email protected]

Abstract

The importance of structural variants (SVs) in DNA as a cause of quantitative variation and as a contributor to disease is unknown, but without knowing how many SVs there are, and how they arise, it is difficult to discover what they do.

Combining experimental with automated analyses of the mouse genome sequence, we identified 0.71M SVs at 0.28M sites in the genomes of thirteen classical and four wild-derived inbred mouse strains. The majority of SVs are less than 1 kilobase in size and 98% are deletions or insertions. The breakpoints of 58% were mapped to resolution allowing us to infer that insertion of retrotransposons causes more than half of SVs. Yet, despite their prevalence, SVs are less likely than other sequence variants to cause -expression or quantitative phenotypic variation. We identified 24 SVs that disrupt coding exons, acting as rare variants of large effect on gene function. One third of the so affected have immunological functions.

Our catalogue provides a starting point for the analysis of the most dynamic and complex regions of genomes from a genetically tractable model organism.

2

Introduction

Structural variation is believed to be widespread in mammalian genomes1-5 and is an important cause of disease6-8, but just how abundant and important structural variants (SVs) are in shaping phenotypic variation remains unclear.

Understanding what SVs do depends on understanding what they are, where they occur and how they arise: large SVs that keep recurring and coincide with genes are far more likely to contribute to phenotypic variation than small non-recurrent SVs within intergenic regions.

The preeminent organism for modeling the relationship between phenotype and genotype, including SVs, is the mouse, but our catalogue of

SVs in this animal is incomplete. Estimates of SV numbers and the proportion of the mouse genome they occupy, vary considerably, from figures of a few hundred to over 7,0009-13, affecting from 3.2% to more than 10% of the genome14,15. Incompleteness and inconsistencies are largely due to reliance on differential hybridization of genomic DNA to oligonucleotide arrays16, a technology blind to some SV categories (such as inversions and insertions) and only limited ability to detect others (segmental duplications and transposable elements). Sequence based methods of SV detection, with higher resolution and greater sensitivity, have so far had limited application12,17.

Along with SV catalogues, we need to know how SVs arise, as this will tell us what SVs may or may not do. The major molecular mechanism producing SVs in the mouse genome is believed to be retrotransposition12,17, which, may account for more than 80% of SVs between 100 nucleotides to 10 kilobases in length17. In cell culture, about 10% of LINE-1 insertions delete

3

DNA18,19, a process that also occurs in mouse genomic DNA20. It is not known to what extent retrotransposons, or other mechanisms of SV formation, contribute to mouse phenotypic variation and disease.

What we know about the burden of SVs’ impact on phenotypes in the mouse comes primarily from analyses of gene expression14,15,21. Up to 28% of the between-strain variation in gene expression in hematopoietic stem and progenitor cells has been attributed to SVs14; for genes lying within SVs, the latter account for between 66% to 74% of between-strain expression variation in kidney, liver, lung and testis15. If the genome is replete with SVs, and given that their influence on gene expression could extend up to 500 Kb from their margins15, then SVs might be responsible for a considerable fraction of heritable gene expression variance. Since gene expression variation is believed to contribute to variation in phenotypes in the whole organism21, SVs may turn out to have a major role in the genetic determination of all aspects of mouse, and mammalian, biology.

In this paper we use next generation sequencing to address three critical questions: what are the extent and complexity of SVs in the mouse genome, what are the likely mechanisms of SV formation, and to what extent do SVs contribute to phenotypic variation? Our molecular characterization of

SVs in the mouse genome allows us to determine the extent to which SVs contribute to genetic and phenotypic diversity.

4

Results

SV identification

Using short-read paired-end mapping, we found SVs at 0.28M sites in the mouse genome, amounting to 0.71M SVs in 17 inbred strains of mice: A/J,

AKR/J, BALB/cJ, C3H/HeJ, C57BL/6N, CBA/J, DBA/2J, LP/J, NOD/ShiLtJ,

NZO/HlLtJ, 129S5SvEvBrd, 129P2/OlaHsd, 129S1/SvImJ, WSB/EiJ,

PWK/PhJ, CAST/EiJ and SPRET/EiJ. Our catalogue contains far more SVs than previously recognized (Fig. 1a) and consists of a greater variety of molecular structures (Fig. 1b&1c). To explain why we found more, we start by describing how we went about finding SVs.

We combined visual inspection of short-read sequencing data with molecular validation to improve automated SV detection across the genome.

We used two criteria to identify SVs manually: read depth and anomalous paired-end mapping (PEM). We did this using data from the mouse’s smallest (19) in its entirety, and a random set of other chromosomal regions, for eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J,

CBA/J, DBA/2J and LP/J), founder strains of heterogeneous stock (HS) population22.

Based on read depth and PEM we expected to find eleven patterns that classify SVs. We refer to these as type H (“High-confidence”) patterns

(H1-H11: Supplementary Fig. 1). For example, some deletions and inversions leave precise, easily identifiable signatures (Fig. 1d). In addition, we found ten patterns whose interpretation was ambiguous. We refer to these as type Q (“Questionable”) patterns (Q1-Q10: Supplementary Fig. 1, Fig.

1e). We investigated the molecular structure of all 21 patterns using a PCR

5

strategy (Supplementary Fig. 2, Supplementary Methods). We designed

584 pairs of primers and successfully amplified 547 SV sites across the eight strains (Supplementary Table 1).

Our categorization of predicted SV structures, based on manual inspection of PEM patterns, resulted in the confident identification of an SV for nineteen of the 21 patterns in all instances that we examined by PCR

(Supplementary Table 2). Two patterns were always false (Q6 and Q10), and arose because of the presence of a retrotransposed pseudogene giving mapping errors.

Recognizing these patterns, we were able to predict underlying SV structure with high confidence. PCR confirmed that 12 patterns were indicative of a single SV and six patterns indicative of multiple adjacent SVs

(Supplementary Table 2). However, SVs of type Q7 (45 cases) were due to a variable number tandem repeat, for which we could not predict the number of repeats or molecular structure.

Available automated methods to identify SVs are unable to differentiate all 19 PEM patterns, and may also classify some SVs incorrectly; for example, the PEM patterns of linked insertions (Q5 and Q9: Supplementary Fig. 1) are similar to those for inversions or deletions. Therefore we adapted automated methods to recognize 15 types (Q1, Q2, Q3 and Q7 could not be unambiguously identified) identified by manual inspection and PCR validation

(Supplementary Methods and 23).

6

Sensitivity and specificity analyses

We established false positive and false negative rates for the automated analysis in three ways. First, we used our manually identified set of SVs on chromosome 19 (Supplementary Table 3) where we found 932 deletions

(684 type H and 248 type Q), 15 inversions (2 type H and 13 type Q) and three copy number gains (all type H). False negative rates per strain range from 14% to 17% (Supplementary Table 4a); false positive rates range from

3.1% to 4.6% (Supplementary Table 4b). Second, to ensure that our sensitivity and specificity analyses were not vitiated because we used chromosome 19 as a training set for the automated analysis, we derived a second, smaller, set of manually curated deletions from a randomly chosen 10

Mb region (101 Mb to 111 Mb) from chromosome 3 in the strain C3H/HeJ.

Automated analysis of this region correctly identified 43 (82.7%) and called 2 false positive deletions (4.4%). Third, we investigated the false negative rate for the automated detection of deletions across the genome using a PCR validation dataset of 267 simple deletions (Supplementary Table 1).

Consistent with the chromosome 19 and chromosome 3 analyses we found that the false negative rate for deletions was between 9% and 15%, respectively (Supplementary Table 5a).

We could not assess the performance of automated analysis to detect

SV types other than deletions from manual inspection of chromosome 19 because so few of these rearrangements were called. So we turned to PCR- based validation of insertions, inversions and tandem duplications (n = 62 to n

= 76) and found that the average false negative rate was higher than for deletions, ranging from 21% to 33% per strain (Supplementary Table 5b).

7

Automated analysis was less successful in detecting the more complex rearrangements, with 25% to 38% false negative rates (n = 46 to n = 54).

SV categories

The results of the detection and classification of 711,923 SVs across the entire genome of 17 strains are shown in Table 1. There are on average

26,000 SVs in classical inbred strains, and 92,000 in wild derived inbred strains, affecting 1.2% (33.0 Mb) and 3.7% (98.6 Mb) of the genome respectively. SVs smaller than 100 bp are excluded, as below this size it is difficult to determine whether the deviation in distance between two paired end reads is due to variation in the library insert size distribution or due to paired ends flanking a structural variant. However, we were able to identify 32 deletions smaller than 100 bp and greater than 10 bp from the chromosome

19 experimental analysis of SVs along a region of 2.4 Mb. We confirmed by

PCR that in all cases the deletion is real and smaller than 100 bp

(Supplementary Table 1). Size distribution of all SVs occurring in this 2.4 Mb region is plotted in Supplementary Figure 3. Assuming that the 2.4 Mb region is typical, this implies that the rest of the genome (in classical laboratory strains) will contain approximately 40,000 deletion SVs smaller than 100 bp and greater than 10 bp.

Table 1 classifies SVs greater than 100 bp into two groups: 99.4% are simple and 0.6% are complex. Simple SVs include those whose biological interpretation is straightforward: insertions, deletions and inversions. We separately identify one type of insertion, a copy number gain, consisting of non-repetitive DNA that is present in multiple copies in other strains. When

8

this sequence occurs immediately adjacent to its original, it is annotated as a tandem duplication.

It is less clear to what extent the more complex categories we found represent different biological categories. Complex SVs consist of a mixture of events that abut each other. Sometimes the mixture arises because two or more simple SVs occur next to each other: given the density of deletions in the genome, the 2,132 deletions that we found separated by less than 250 bp could have occurred by chance (two of our PEM patterns (H3 and Q5:

Supplementary Fig.1)). However we recognize as a separate biological category SVs that are immediately adjacent to each other, with no intervening

DNA, since we suspect that these might be the progeny of a single biological process (marked as del+ins and del+ins/inv in Table 1). Intriguingly, we noted that half of the inversions co-occur with an insertion or deletion, or in rare cases with both an insertion and deletion (Fig. 2a). We also separately identify an SV within a copy number gain (termed “nested” in Table 1) since the probability of coincidence is less than one event per genome.

SV formation

Microhomology at SV breakpoints, as well as the content of sequence within

SVs and the SV’s ancestral state, was used to infer the likely mechanism of formation for simple SVs. To obtain breakpoint sequence, we performed de novo local assembly at 81.3% of deletions and 74.2% of non-transposable element insertions23. Comparison of 1,314 predicted breakpoints to the breakpoint delineated by PCR and sequencing (Supplementary Table 1;

Supplementary Methods), revealed that 57.7% of breakpoint predictions are

9

exact and 86.5% are within 20 bp (Supplementary Table 6a). In cases where the local assembly strategy failed, we relied on the original breakpoint estimates obtained from the mapping of reads to the reference genome:

83.3% of these estimates are within 100 bp of the actual breakpoint

(Supplementary Table 6b). Breakpoint accuracy for insertions, inversions and copy number gains SV is presented in Supplementary Tables 6c, 6d and 6e, respectively.

To obtain genome-wide estimates of the contribution of each mechanism to SV formation, we used sequence data from relative deletions

(in other words deletions relative to the C57BL/6J genome). We have highly accurate breakpoint sequence for this sample, which should be unbiased with respect to ancestry. Using rat as an outgroup, we classified 19% of relative deletion SVs as ancestral deletions, 57% as ancestral insertions and the remainder (24%) were indeterminate (Supplementary Methods).

Classification of SVs and their size characteristics are summarized in

Figure 3. SVs are most often due to retrotransposons (LINEs (25%), LTRs

(14%) and SINEs (15%)), followed by variable number tandem repeats

(VNTRs) (15%) and pseudogenes (2%). Other mechanisms, not involving retrotransposons, account for 29% of SVs. We found that the median length of all SVs is 349 bp, with modes at 100 bp and 6,400 bp, LINE insertions comprising the majority of the larger insertions (Fig. 3a). Outgroup analysis showed that the transposon-derived SVs arose almost exclusively from ancestral insertions events (98.8%). Microhomology (12-16 bp) surrounds the breakpoints of LINE and SINE derived SVs (known as target site duplication) and shorter (6-8 bp) stretches associated with LTR SVs (Fig. 3b). Non-repeat

10

mediated SVs are mainly a result of ancestral deletion events (79%), and are associated with short microhomologies, up to 7 bp in length, consistent with either microhomology-mediated break-induced replication (MMBIR) or microhomolgy-mediated end joining (MMEJ).

A substantial proportion of SVs caused by LINE, ERV and SINE insertions do not show any missing nucleotides at their breakpoints (95%,

93.3% and 92.3% respectively; Table 2). However, we found rare cases (4

LINEs, 2 ERVs and 1 SINE) during which the insertion machinery also deletes nucleotides. Missing sequence ranged from 1 bp to 289 bp. We found that the presence of an ancestral microdeletion is directly linked to the absence of the

TSD for three LINEs. Of the 113 ancestral deletions, 36 (32%) had from 1 bp to 107 bp of inserted sequence at the breakpoint, in addition to the deletion.

We unexpectedly found that in all cases the presence of SNPs in the microhomology region was correlated with the presence of the SV (Fig. 2b).

In every case the SNP elongates the microhomology. However this phenomenon is rare: we only observed five (4.5%) cases amongst our 113 manually-curated ancestral deletions (Supplementary Table 7) where a SNP and SV formation co-segregate. We found a similar relationship between a

SNP formed at a target site duplication and the presence of an ancestral insertion. 15 ancestral insertions (16%) had SNPs or short indels within their target site duplication, coincident with the insertion (Supplementary Table 7).

Given their potential role in human disease24, we were interested to document the occurrence of recurrent SVs, those that arise at the same genomic locus independently in unrelated individuals. Non-allelic homologous recombination (NAHR) is the major mechanism for recurrent SVs25, while fork

11

stalling and template switching (FoSTeS) and/or microhomology-mediated break-induced replication (MMBIR) mechanisms may be important for non- recurrent SVs26.

We looked for SVs occurring at the same locus in different strains, but with different breakpoints, indicating independent origins. Using the SV breakpoints obtained from PCR sequencing (249 SV sites in eight strains that account for over 4,000 breakpoints; Supplementary Table 7), we found that in the classical strains, only 2.5% of deletions at the same locus had different breakpoint sequences. However within all 17 strains, we found multiple alleles at 12% of SVs, due almost entirely to the presence of different alleles originating from the wild-derived inbred strains (Supplementary Table 7).

Consistent with the low frequency of recurrent SVs, breakpoint features associated with NAHR are rare. We estimated that 0.13% of deletions are due to NAHR, when we required a signature of >=200 bp of >=90% sequence identity. Two analyses, therefore, indicate that recurrent SVs are rare.

Impact of SVs on gene function

We assessed the impact of SVs on phenotypes in three ways: i) the relationship between the position of SVs and the position of genes; (ii) changes in expression of genes overlapping, or nearby, an SV; (iii) association between SVs and phenotypes in an outbred population of mice.

Across all strains, SVs overlap 10,291 genes. We found that SVs are, in all strains except C57BL/6N, significantly depleted (P<0.01) in genes (fold change 0.91). However, we found that SINE insertions are significantly enriched in the introns of genes (P<0.01, fold change 1.34).

12

The relative depletion of SVs within genes implies a proportionate deficit in their phenotypic consequences. We found this to be true for the effect of SVs on gene expression, by estimating heritability attributable to

SVs. Variation within strains is due to environmental factors and variation between strains is due to both environmental and genetic factors, so the difference between the two variances is a measure of genetic effect

(heritability). Figure 4a shows scaled variances in gene expression from brain

RNAseq data measured between and within strains for five categories of SVs.

Variances for between strain variation are clearly larger than for within strain variances, indicating that SVs do have an impact on expression, but how big an impact?

We estimated the proportion of heritability attributable to SVs15 and found that no category accounts for more than 10%. To determine if these results were specific to brain tissue, we analysed gene expression data for the eight founder strains of the HS population (n = 5 for each) from liver, measured on Illumina arrays27. Mean heritability attributable to an SV, for transcripts overlapping one or more SVs, was 9.5%. Since many transcripts overlap multiple small SVs (median of 3, maximum of 216), we hypothesized that SV heritability might be related to the amount of gene overlapped. For each transcript we summed the amount of DNA overlapping a gene and expressed this as a proportion of the total length of the gene. Overlap proportions of 50% or more make a large contribution to heritability: in brain tissue, at loci with SVs in this category, SVs contribute to 25% of the variance, compared to 7.8% for transcripts where SVs overlap less than 50% of the gene. However, large overlaps (50% or more) are rare, affecting less than 3%

13

of transcripts. Thus while SVs make a modest contribution to the overall heritability of expression variance, at individual transcripts they may be the main cause of between-strain differences in expression.

We also found that SVs outside a gene have only small effects on gene expression. Figure 4b shows between and within strain variances for SVs lying at distances from less than 2 Kb to more than 40 Kb from brain transcripts with no SV overlap (the density of SVs meant that we found too few transcripts further than 60 Kb from an SV to analyze). Heritability attributable to SVs within 2 Kb of the gene is 2%, and falls as the distance from the gene increases.

We considered whether the lower estimates we obtained for the effect of SVs, compared to those obtained from array based assays, might be due to the differences in the way SVs were assessed. Using SVs from a genome- wide array based assessment of SVs in 12 classical strains, we calculated within and between strain variances15. Results, shown in Figure 4a, demonstrate a larger difference between within-strain and between-strain variances than seen using SVs from our sequence analysis. SVs assessed by arrays contribute to 25% of the variance in gene expression. Differences in the heritability estimates are thus due in part to the differences in the way SVs are called.

Our third observation of the phenotypic impact of SVs is that they are unlikely to be the causative variant at QTLs, as we know from genetic association with 100 phenotypes measured in over 2,000 heterogeneous stock (HS) mice22. We applied a test of functionality28 to 281,246 SVs where we were certain that the strain distribution pattern (SDP) was correct

14

(Supplementary Methods). Relatively high rates of SV mutation10 might invalidate the imputation (the HS animals are at least 60 generations distant from the sequenced strains), so we genotyped 100 HS animals using a high- density array (16, Supplementary Methods). 194 deletions could be genotyped on the array (with an additional 47 deletions when we allow for non-segregating SVs in the HS). In every case imputation correctly predicted the logP obtained from ANOVA carried out using the array -based genotypes.

We identified 290 QTLs where SVs were among the variants most likely to be functional, but in all these cases the SVs were only a subset of the total number of functional variants. We found a small but significant deficit in

SVs among the functional variants (0.36% compared to 0.54% among the non-functional, P < 1E-16,2 = 72.1).

While SVs make a relatively small contribution to the total amount of quantitative phenotypic variation, at a small number of QTLs they are the cause of variation. As shown in our companion paper29, larger effect QTLs are more likely to arise from SVs (and see Supplementary Figure 4a, 4b and

4c). We identified 12 QTLs where the SV overlapped a gene with its flanking region (2 Kb up or downstream of a gene), and where the QTL effect size is in the top 5% of the distribution. Table 3 lists these SVs, the genes they affect and the putative phenotype with which they are associated. Complementation of the deletion of the H2-Ea promoter has confirmed the effect of this SV on the T-cell phenotype30. In one other case we have evidence in favour of a causative role for the SV: Eps15 -/- male mice exhibited a significantly lower locomotor activity (Supplementary Fig. 5) compared to matched wild type male mice, indicating that the SV is likely the cause of the QTL.

15

SVs that disrupt coding exons

There are relatively few examples where an SV can be said unequivocally to delete one, or more, coding exons. Without nucleotide resolution accuracy it is often impossible to be certain whether the breakpoint of an SV lies within an exon, so to find SVs overlapping exons we used our most accurate and complete category of SV calls: deletions relative to the reference strain. Using this list, we started with 210 that overlap exons from Ensembl (Build 58); after removing pseudogenes, and anything not annotated as ' coding', we were left with just 24 SVs that affect coding exons, including six that encompass a gene (or several) in its entirety. Table 4 gives positional information for these SVs, the gene they affect (gene that are affected in their entirety are indicated by an asterisk), how they formed, their strain distribution pattern (SDP) and their known function as reported in the current literature.

Five of the 24 SVs are already known31-35; the remaining 19 SVs are novel. Remarkably, a third of the genes affected by these SVs are involved in immunity and infection. Figure 2c gives an example of how our data expands current knowledge of the molecular architecture of these SVs. The antiviral genes Trim5 and Trim12a are for the first time revealed as unique to

C57BL/6J, due to segmental duplication36. All the other strains contain only the Trim12c gene. Therefore the mouse contains a unique homologue of the human TRIM5 gene, similarly to the rat, and the expansion of the Trim12 genes appeared only in the C57BL/6J lineage. A second example is our analysis of beta defensin 8 gene (Defb8), another immune related gene. Two alleles have been identified and differ by 3 base pairs changes in the second

16

exon37,38. Our analysis reveals that these documented exonic changes are linked to a previously undetected 3,192 bp deletion that includes the first exon of the gene.

Discussion

Our results are important in three respects: first we find an unexpectedly large number of SVs with diverse molecular architecture, thus providing a catalogue of the most dynamic and variable regions of the mouse genome. Second, we identify breakpoints at nucleotide level resolution, giving a genome wide picture of how SVs originate. Third, we demonstrate that, despite their abundance, SVs make relatively little functional impact, as assessed by their effects on gene expression and phenotypic variation in the whole animal.

We were able to find more SVs, of greater complexity, because we relied on manual inspection of the PEM results, combined with molecular validation, before using automated calling methods. Previous studies have revealed the noisiness of sequenced based methods of SV calling12,39,40, due in part to the multiplicity of forms and the presence of insertions, deletions and inversions often in close proximity to each other, and the difficulty of mapping sequence reads back to repetitive genomes. Nevertheless, we have shown here that it is possible to calibrate automated methods to generate genome- wide SV calls of high accuracy.

The SVs we find have two distinguishing characteristics: first, they are small. For deletions, whose size we know accurately, the median is 385 bp. In comparison, the median size of SVs in a recent high-density array analysis of the genomes of 20 laboratory strains was 9 Kb14 and about 1.9 Kb from a

17

PEM analysis of DBA/2J12. Our size estimate is actually an upper limit, since it does not include SVs less than 100 bp, which we think could make up 10% of the total. Second, SV density means that we frequently find regions with high concentrations of small rearrangements. These two features emphasize the need for methods of SV identification at base pair, or near base pair resolution. Otherwise not only are many SVs missed, but those recognized are misclassified: a mixture of small deletions and insertions will be mistaken for a large SV of a single type16.

Our second important finding is the catalogue of SV mechanisms based on breakpoint sequence. We were able to map almost 60% of deletions to base pair resolution, allowing us to classify SVs by the mechanism that created them. We find that the primary origin of structural variation between mouse strains is attributable to LINE-1 retrotransposons. Mice differ from humans in whom LINE-1 retrotransposition comes third after microhomology- mediated processes and nonallelic homologous recombination as the predominant processes in generating SVs39. In contrast to human SV studies, the great majority of SVs we have discovered are non-recurrent rearrangements, based on two observations: among the classical strains, only

2.5% of deletions at the same locus had different breakpoint sequences and less than 1% of deletions are due to NAHR, the mechanism thought to be responsible for the majority of recurrent SVs in humans24,25.

Our third important observation is that SVs have relatively little impact on gene function. Results from -wide association studies have revealed that common SNPs (minor allele frequency > 5%) explain only a part of trait heritability suggesting that SVs might be a major unrecognized

18

contributor to phenotypic variation41. Available evidence has not yet resolved whether or not this is so. Analysis of human lymphoblastoid cell lines attributed at least 8.5-17.7% of heritable gene expression variation to SVs42.

Importantly, this heritability was not shared with common SNPs, potentially making SVs a contributor to missing heritability. In mice, SVs overlapping a gene were estimated to contribute to a substantial proportion of between- strain expression variance (up to 74%)15, which, together with the prevalence of SVs in the genome, implies that they might be responsible for a considerable fraction of heritable gene expression variance. If the genetic basis of gene expression were a model for understanding the molecular basis of other phenotypes, then SVs would be a major player. Two recent analyses of the association between SVs and disease phenotypes in humans did not support this view (common SVs were found to be no more likely than common SNPs to contribute to phenotypic variation1,43) but could not exclude a role for rare SVs (minor allele frequency < 5%) of large effect (odds ratio >

2).

Our findings add to this debate in three ways. First, we find that SVs overlapping a gene make a small contribution to variation in gene expression, accounting for less than 10%. This might be due to our analysis of very large numbers of small SVs, but we find that even when SVs overlap more than

50% of a gene they account for less than a third of the heritability. The most likely explanation is that previous array based studies conflated under one apparently large SV the effects of numerous smaller rearrangements together with regions of diploid DNA, containing other variants that influenced gene expression.

19

Second, our analysis of the phenotypic consequences of SVs on QTLs for multiple phenotypes also points to a relative deficit of SVs as the molecular basis of complex phenotypes. By working with an outbred population where all are descended from known progenitors, imputation effectively reconstitutes the genomes of all animals, so that we can detect the effects of all variants, both common and rare. Our results indicate that common and rare SVs make less of a contribution to phenotypic variation than we would expect given their abundance in the genome. However the outbred population we tested is derived from inbred progenitors whose homozygosity will have purged their genomes of variants that could be maintained in heterozygous freely mating populations.

Third, few SVs overlap exons. From our set of relative deletions we identified 24 that delete one or more exons. These SVs, with large effects on a phenotype, are the equivalent of rare variants found in human populations.

In mouse populations they are very rare indeed. Since the frequency of insertions is equal to that of deletions, and since these two categories make up 98% of all SVs then we predict that there may be only about 50 SVs that directly overlap exons, or about 0.2% of the total burden of SVs in the genome.

Despite their rarity, SVs that cause phenotype change are likely to provide biological insights out of proportion to their relative small contribution to phenotypic variance. Biological insight into a phenotype requires discovering which genes are involved. The task is considerably easier if the

SV removes a coding segment, effectively creating a null allele. We expect

20

that the alleles we have described will provide a starting point for investigating the relationship between phenotype and genotype in mice.

Methods Summary

SV discovery. We used a combination of four computational methods: split- read mapping44, mate-pair analysis45, single-end cluster analysis (SECluster and RetroSeq, unpublished), and read-depth46. These methods identify deletions, insertions, inversions and copy number gains. We also derived methods to recognize other types of rearrangements, such as inversion plus insertion or inversion plus deletion, newly revealed from our experimental analysis.

Experimental analysis. We inspected short-read sequencing data using

LookSeq47 and manually detected SVs across mouse chromosome 19 in its entirety and a random set of other chromosomal regions. We analysed molecular structures of these SVs at nucleotide-level resolution using PCR and Sanger-based sequencing.

Outgroup analysis. We used the rat as an outgroup species to classify each mouse SV as either an ancestral deletion or an ancestral insertion. We predicted the ancestral state in the rat by estimating the size of the region in the rat genome that was homologous to the region that encompassed the mouse SV.

SV classification. We developed a machine learning method to classify SVs.

The method used a random forest classifier, trained using sequence features

21

within the SVs. Microhomology between breakpoints was determined by recording the longest sequence of bases that was identical between each breakpoint of each SV.

Functional impact of SVs. We tested whether an SV is likely to be functional using merge analysis28. The variances of expression data were calculated using ANOVA in the statistical software R using formulae described in15 and also by comparing a model where the expression value is explained by the strain, to a model in which the expression is explained by strain and whether or not the animal has an SV.

Full methods are provided in Supplementary Information.

References

1 Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704-712 (2010). 2 Durbin, R. M. et al. A map of human genome variation from population- scale sequencing. Nature 467, 1061-1073 (2010). 3 Kidd, J. M. et al. Mapping and sequencing of structural variation from eight human genomes. Nature 453, 56-64 (2008). 4 Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59-65 (2011). 5 Redon, R. et al. Global variation in copy number in the human genome. Nature 444, 444-454 (2006). 6 Buchanan, J. A. & Scherer, S. W. Contemplating effects of genomic structural variation. Genet Med 10, 639-647 (2008). 7 Hurles, M. E., Dermitzakis, E. T. & Tyler-Smith, C. The functional impact of structural variation in humans. Trends Genet 24, 238-245 (2008). 8 Zhang, F., Gu, W., Hurles, M. E. & Lupski, J. R. Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet 10, 451-481 (2009). 9 Cutler, G., Marshall, L. A., Chin, N., Baribault, H. & Kassner, P. D. Significant gene content variation characterizes the genomes of inbred mouse strains. Genome Res 17, 1743-1754 (2007).

22

10 Egan, C. M., Sridhar, S., Wigler, M. & Hall, I. M. Recurrent DNA copy number variation in the . Nat Genet 39, 1384-1389 (2007). 11 Graubert, T. A. et al. A high-resolution map of segmental DNA copy number variation in the mouse genome. PLoS Genet 3, e3 (2007). 12 Quinlan, A. R. et al. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res 20, 623-635 (2010). 13 Snijders, A. M. et al. Mapping segmental and sequence variations among laboratory mice using BAC array CGH. Genome Res 15, 302- 311 (2005). 14 Cahan, P., Li, Y., Izumi, M. & Graubert, T. A. The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells. Nat Genet 41, 430-437 (2009). 15 Henrichsen, C. N. et al. Segmental copy number variation shapes tissue transcriptomes. Nat Genet 41, 424-429 (2009). 16 Agam, A. et al. Elusive copy number variation in the mouse genome. PLoS One 5 (2010). 17 Akagi, K., Li, J., Stephens, R. M., Volfovsky, N. & Symer, D. E. Extensive variation between inbred mouse strains due to endogenous L1 retrotransposition. Genome Res 18, 869-880 (2008). 18 Gilbert, N., Lutz-Prigge, S. & Moran, J. V. Genomic deletions created upon LINE-1 retrotransposition. Cell 110, 315-325 (2002). 19 Symer, D. E. et al. Human l1 retrotransposition is associated with genetic instability in vivo. Cell 110, 327-338 (2002). 20 Garvey, S. M., Rajan, C., Lerner, A. P., Frankel, W. N. & Cox, G. A. The muscular dystrophy with myositis (mdm) mouse mutation disrupts a skeletal muscle-specific domain of titin. Genomics 79, 146-149 (2002). 21 Schadt, E. E. et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 37, 710-717 (2005). 22 Valdar, W. et al. Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat Genet 38, 879-887 (2006). 23 Wong, K., Keane, T. M., Stalker, J. & Adams, D. J. Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly. Genome Biol 11, R128 (2010). 24 Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and its role in disease. Annu Rev Med 61, 437-455 (2010). 25 Gu, W., Zhang, F. & Lupski, J. R. Mechanisms for human genomic rearrangements. Pathogenetics 1, 4 (2008). 26 Zhang, F. et al. The DNA replication FoSTeS/MMBIR mechanism can generate genomic, genic and exonic complex rearrangements in humans. Nat Genet 41, 849-853 (2009). 27 Huang, G. J. et al. High resolution mapping of expression QTLs in heterogeneous stock mice in multiple tissues. Genome Res 19, 1133- 1140 (2009).

23

28 Yalcin, B., Flint, J. & Mott, R. Using progenitor strain information to identify quantitative trait nucleotides in outbred mice. Genetics 171, 673-681 (2005). 29 Keane, T. Sequence variation amongst 17 laboratory and wild-derived mouse genomes and its affect on gene regulation and phenotypic variation. Nature (2011). 30 Yalcin, B. et al. Commercially available outbred mice for genome-wide association studies. PLoS Genet 6 (2010). 31 Best, S., Le Tissier, P., Towers, G. & Stoye, J. P. Positional cloning of the mouse retrovirus restriction gene Fv1. Nature 382, 826-829 (1996). 32 Boyden, L. M. et al. Skint1, the prototype of a newly identified immunoglobulin superfamily gene cluster, positively selects epidermal gammadelta T cells. Nat Genet 40, 656-662 (2008). 33 Nelson, T. M., Munger, S. D. & Boughter, J. D., Jr. Haplotypes at the Tas2r locus on distal chromosome 6 vary with quinine taste sensitivity in inbred mice. BMC Genet 6, 32 (2005). 34 Persson, K., Heby, O. & Berger, F. G. The functional intronless S- adenosylmethionine decarboxylase gene of the mouse (Amd-2) is linked to the ornithine decarboxylase gene (Odc) on chromosome 12 and is present in distantly related species of the genus Mus. Mamm Genome 10, 784-788 (1999). 35 Wu, B. et al. Mutations in sterol O-acyltransferase 1 (Soat1) result in hair interior defects in AKR/J mice. J Invest Dermatol 130, 2666-2668 (2010). 36 Tareen, S. U., Sawyer, S. L., Malik, H. S. & Emerman, M. An expanded clade of rodent Trim5 genes. Virology 385, 473-483 (2009). 37 Bauer, F. et al. Structure determination of human and murine beta- defensins reveals structural conservation in the absence of significant sequence similarity. Protein Sci 10, 2470-2479 (2001). 38 Taylor, K. et al. Defensin-related peptide 1 (Defr1) is allelic to Defb8 and chemoattracts immature DC and CD4+ T cells independently of CCR6. Eur J Immunol 39, 1353-1360 (2009). 39 Kidd, J. M. et al. A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837- 847 (2010). 40 Korbel, J. O. et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318, 420-426 (2007). 41 Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747-753 (2009). 42 Stranger, B. E. et al. Population genomics of human gene expression. Nat Genet 39, 1217-1224 (2007). 43 Craddock, N. et al. Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls. Nature 464, 713-720 (2010). 44 Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics 25, 2865-2871 (2009). 45 Chen, K. et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods 6, 677-681 (2009).

24

46 Simpson, J. T., McIntyre, R. E., Adams, D. J. & Durbin, R. Copy number variant detection in inbred strains from short read sequence data. Bioinformatics 26, 565-567 (2010). 47 Manske, H. M. & Kwiatkowski, D. P. LookSeq: a browser-based viewer for deep sequencing data. Genome Res 19, 2125-2132 (2009).

Supplementary Information is linked to the online version of the paper at www.nature.com/nature. Supplementary Information contains Supplementary

Figures and Tables, additional Methods, and Supplementary References.

Acknowledgements

We thank Adam Whitley, Giles Durrant, Andrew Marc Hammond, Danica Joy

Fabrigar, Lucia Chen, Martina Johannesson and Enzhao Cong for helping

B.Y. with various laboratory-based work. We also thank Chris P. Ponting for comments on the manuscript. This project was supported by The Medical

Research Council, UK and the Wellcome Trust. DJA is supported by Cancer

Research UK.

Author contributions

D.J.A. and J.F. conceived the study and directed the research. B.Y., K.W.,

A.A., R.M., M.G. and J.F. wrote the paper. K.W. and T.K. performed the genome-wide SV discovery. A.B., P.H.P., H.W., J.C., R.D. and D.J. carried out experimental work, led by B.Y. A.B. and B.Y. analysed Sanger-based sequencing data and resolved SV breakpoints at nucleotide-level resolution.

M.G. performed the genome-wide SV mechanism of formation and outgroup analysis. A.A. and J.F. analysed functional impact of SVs on expression and

25

phenotypes. L.G and R.M. carried out additional analyses. J.N. and B.Y. characterised function of individual SV examples.

Author information

Data sets described here will be available under study accession number estd118 from the Database of Genomic Variants archive (DGVa) at http://www.ebi.ac.uk/dgva/page.php. Reprints and permissions information is available at www.nature.com/reprints. Readers are welcome to comment on the online version of this article at www.nature.com/nature. Correspondence and requests for materials should be addressed to J.F. ([email protected]).

26

Tables

Table 1: Structural variants greater than 100 bp in 17 inbred strains

Simple Complex Strain del gain inv ins del+ins nested inv+del/ins 129P2/OlaHsd 16292 57 74 15604 105 27 68 129S1/SvImJ 17307 70 88 11516 73 32 67 129S5/SvEvBrd 16089 72 67 8970 43 41 58 A/J 16190 69 92 12184 61 28 67 AKR/J 15806 88 89 14576 88 13 82 BALB/cJ 14859 82 87 10551 48 17 58 C3H/HeJ 16062 94 94 12100 90 16 76 C57BL/6N 164 44 6 213 0 3 1 CAST/EiJ 50978 361 224 34122 133 239 265 CBA/J 16996 79 83 10867 64 16 78 DBA/2J 17478 67 83 10559 55 29 75 LP/J 16964 64 88 12745 64 30 69 NOD/ShiLtJ 17047 51 116 13244 53 16 79 NZO/HlLtJ 15429 62 71 9445 33 23 62 PWK/PhJ 54147 96 272 35098 184 60 268 SPRET/EiJ 91295 112 470 64304 463 110 554 WSB/EiJ 22154 88 97 12521 64 37 105 del=deletion; gain=copy number gain; inv=inversion; ins=insertion; del+ins=deletion plus insertion; nested=SV in a copy number gain region; inv+del/ins=inversion plus deletion(s) or inversion plus insertion.

27

Table 2: SV classification and inferred mechanism of formation

a Sequence features at breakpoints Ancestral Events

CNG

multiple

Deletion Insertion Inversion LINE ERV SINE VNTR Target site duplication none 6.7% 6.7% 0.0% 4-10 bp 13.3% 93.3% 15.4% 11-20 bp 78.3% 0.0% 84.6% >20 bp 1.7% 0.0% 0.0%

Microdeletion none 95.0% 93.3% 92.3% 92.3% 1-34 bp 5.0% 6.7% 7.7% 7.7% >200 bp 1.7% 0.0% 0.0% 0.0%

Microhomology none 0.0% 15.9% 50.0% 12.5% 0.0% 1-2 bp 0.0% 15.0% 12.5% 37.5% 25.0% 3-25 bp 23.1% 69.0% 37.5% 50.0% 50.0% 26-200 bp 76.9% 0.9% 0.0% 0.0% 0.0% >200 bp 0.0% 0.9% 0.0% 0.0% 0.0%

Microinsertion none 69.9% 62.5% 87.5% 25.0% 1-10 bp 23.0% 37.5% 12.5% 50.0% 11-50 bp 8.0% 0.0% 0.0% 0.0% >51 bp 0.9% 0.0% 0.0% 0.0%

Total (249 SV regions) 60 30 13 13 113 8 8 4

b Inferred mechanisms Total Retrotransposition LINE Retrotransposition 24.5% ERV Retrotransposition 12.0% SINE Retrotransposition 5.2% SRS 5.2% MMEJ, MMBIR 31.3% NHEJ 13.3% SSA 0.4% NAHR 0.4% FoSTeS/others 0.8% 3.2% 3.2% 1.2% This detailed classification is based on the 249 SVs resolved at nucleotide- level resolution (Supplementary Table 7). MMEJ=Microhomology-mediated end joining; NHEJ=Non-homologous end joining; FoSTeS= fork stalling and template switching; MMBIR=Microhomology-mediated break-induced replication; NAHR= Non-allelic homologous recombination; SRS=Serial replication slippage; SSA=Single strand annealing; CNG=Copy number gain.

28

Table 3: QTLs associated with SVs

Ancestral Phenotype Chr SV start SV stop Event Gene SV overlap LogP Mean platelet volume 1 175158884 175158885 insertion Fcer1a upstream 52.833 OFT Total activity 2 144402772 144402974 SINE insertion Sec23b intron 15.721 Hippocampus cellular proliferation marker 4 49690364 49690365 SINE insertion Grin3a intron 20.119 Home cage activity 4 108951264 108951265 ERV insertion Eps15 upstream 15.922 T-cells: %CD3 4 130038389 130038390 SINE insertion Snrnp40 intron 12.129 Wound healing 7 90731819 90731820 ERV insertion Tmc3 upstream 22.216 Red cells: mean cellular haemoglobin 7 111398000 111480000 insertion Trim5 exon 13.016 Red cells: mean cellular haemoglobin 7 111504957 111505193 deletion Trim30b UTR 12.806 Red cells: mean cellular volume 8 87957244 87957245 LINE insertion 4921524J17Rik upstream 18.141 Serum urea concentration 11 115106122 115106250 deletion Tmem104 UTR 13.404 Hippocampus cellular proliferation marker 13 113783196 113783359 deletion Gm6320 upstream 17.456 T-cells: CD4/CD8 ratio 17 34483680 34483681 deletion H2-Ea upstream 82.858

Start and stop coordinates are given for build37 of the mouse genome, so that insertions into the reference are given as consecutive base pairs (columns headed SV start and SV stop). The part of the gene overlapped is reported in the column headed SV overlap. LogP is the negative logarithm of the P-value for association between the SV and the phenotype as assessed in outbred HS mice 22.

29

Table 4: SVs affecting coding regions

Brd

omosome

MGI gene name gene MGI Chr SV Bp Start SV Bp Stop SV Length SV State Ancestral A/J AKR/J BALB/cJ C3H/HeJ C57BL/6J CBA/J DBA/2J LP/J 129P2/OlaHsd 129S1/SvImJ 129S5SvEv NOD/ShiLtJ NZO/HiLtJ CAST/EiJ PWK/PhJ SPRET/EiJ WSB/EiJ function Known Soat1 1 158394619 158401436 6818 del 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Hair interior defects Olfr1055 2 86179904 86186982 7079 ins 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 Olfaction Fcrl5 3 87245076 87245948 873 del 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Infection and immunity Nes 3 87780517 87780656 140 VNTR 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 1 Brain development Pglyrp3 3 91831864 91835385 3522 del 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Infection and immunity Sknit4,3,9* 4 111731004 112272814 541811 ins 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 Infection and immunity Fv1 4 147244399 147245739 1341 del+ins 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 Infection and immunity Ugt2b38 5 87850560 87854999 4440 complex 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 0 Metabolism Klrb1a 6 128559593 128559740 148 ins 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 Infection and immunity Klri2 6 129689528 129691211 1684 del 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 Infection and immunity Tas2r120* 6 132593115 132613777 20663 del 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 Taste Tas2r103 6 132985439 132986718 1280 del 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 Taste Zfp607* 7 28646762 28671649 24888 del 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 DNA-binding Krtap5-5 7 149415125 149415279 155 VNTR 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 Hair formation Trim5, 12a* 7 111398000 111480000 82001 ins 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Infection and immunity Defb8 8 19447386 19450577 3192 ins+del 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 Infection and immunity Zfp872 9 22004938 22004981 44 del 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 DNA-binding Olfr913 9 38402597 38403495 899 del 1 1 1 1 0 0 1 0 1 1 1 1 0 0 0 0 0 Olfaction Rtp3 9 110888888 110889308 421 VNTR 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Bone density Nlrp1c* 11 71046193 71101410 55218 ins 1 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 Embryonic development Fam110c 12 31759374 31759699 326 VNTR 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 0 Cell migration Olfr234 15 98328544 98328892 349 del 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Olfaction Krtap16-1 16 88874115 88874526 412 VNTR 1 0 1 1 1 0 0 1 0 0 0 1 0 0 0 0 1 Hair formation Amd2* 18 64607761 64609669 1909 ins 1 0 1 1 1 1 1 0 0 0 0 0 1 0 0 0 1 Biosynthesis of polyamines MGI is Mouse Genome Informatics. Ins: insertion; del: deletion; VNTR: variable number tandem repeat. The strain

distribution pattern relative to the ancestral state is given for all strains: “1” referring to presence, “0” to absence and “2” to an

additional allele. * indicates that the structural variant overlaps the entire gene.

30

Figure Legends

Figure 1: Identification of structural variants a) Venn diagram showing the overlap between deletion SVs (relative to

C57BL/6J) detected in our study (blue) and those published elsewhere (Agam et al, 2010 in red and Quinlan et al in green), in DBA/2J. b) Basic rearrangements: deletion (del), insertion (ins), inversion (inv), tandem duplication (tandem dup) and other types of copy number gains. Inverted tandem duplication is drawn in Supplementary Figure 1 (H9). c) Complex rearrangements: deletion co-occurring with an insertion (del+ins), linked deletion (normal copy of small size flanked by two deletions), deletion within a gain (del in gain), inversion with flanking deletions (del+inv+del), inversion with an insertion (inv+ins), inversion within a gain (inv in gain), linked insertion (linked ins) where the inserted sequence is copied from nearby. Inverted linked insertion (Q9; drawn in Supplementary Fig. 1) has a similar pattern to a linked insertion but the inserted sequence is inverted. d) PEM pattern of a del+inv+del. Green arrows represent primers used for

PCR amplification and sequencing reactions. e) PEM pattern of an inv+ins, with PCR data across the eight classical strains

(A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J).

HyperladderII is used as molecular marker. Amplicon size for BALB/cJ,

C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than the other strains, indicative of the insertion. Complete list of PEM patterns is given in

Supplementary Figure 1.

31

Figure 2: Experimental analysis of SVs a) Complex SV, involving several genomic rearrangements including an inversion, a deletion and two small insertions, is displayed relative to its genic location along Zbtb10, a Zinc finger and BTB domain containing 10 gene.

PCR amplification using forward (F) and reverse (R) primers revealed an AT insertion at the first breakpoint B1, followed by an inversion of 125 bp that comprises an inverted linked insertion of the 22bp-region, as seen in B2. The third breakpoint (B3) revealed a deletion of 813 bp. Hyperladder II was used as the size marker. C57BL/6J and LP/J show a normal size of 1604 bp, whereas A/J, AKR/J, BALB/cJ, C3H/HeJ, CBA/J and DBA/2J show a smaller band at 793 bp. b) Relationship between SNP and SV formation. Two SNPs lying on the 6 bp microhomolgy of an ancestral deletion of 64 bp (chr12:27,040,459-

27,040,522) correlated with the presence of the SV. Sequencing traces are shown for a test strain (A/J) and the reference strain (C57BL/6J). Note that all other test strains traces are identical to the one shown here. Asterisk is used to emphasize the microhomology of 6 bp (GAACTA). The presence of two

SNPs (C->G and T->A) in all test strains (here only shown in A/J) is associated with the presence of the ancestral deletion. c) Schematic representation of the Trim6 to Trim30 genes cluster on chromosome 7. Boxes represent the sequential positions of the Trim6,

Trim34, Trim5/12 and Trim30 genes. Trim5 and Trim12a genes are only present in the C57BL/6J genome occurred by segmental duplication of the

32

Trim12c gene present in all 17 strains. The flanking Trim34 and Trim30 genes do not vary between strains.

Figure 3: Classification of structural variants a) Histogram of lengths for each deletion SV class. b) Microhomology surrounding SV breakpoints. SVs were classified as in (a) and the longest length of microhomology between both breakpoints was recorded.

Figure 4: Impact of SVs on gene expression a) Between-strain (grey boxes) and within-strain (white boxes) gene expression variances for transcripts which are not overlapped by any structural variant (No SV) and for those which are. Six categories are shown: deletions (Dels), insertions (Ins), copy number gains (Gains), inversions (Inv), complex rearrangements (Complex), and SVs (of any class) identified by an array analysis (Array:SVs). The difference between the two variances is a measure of heritability. b) Effect of distance from the transcript on gene expression variances. Grey boxes are between-strain and white boxes are within-strain variances. The figure shows standardized variances of gene expression for transcripts with structural variants at distances from less than 2 Kb to more than 40 Kb from either the start or end of the transcript.

33

Figure 1. Identification of structural variants

a

b Reference allele Alternative allele d PEM pattern of a del+inv+del del 1 2 3 1 3 5.148925081F 5.148926592R

ins 12 1 2 Test 1 3 5

inv 1 2 3 1 2 3

tandem dup 1 2 3 1 2 2 3 Ref. 1 243 5 copy number gain 1 2 3 1 2 3 2 e PEM pattern of an inv+ins c del+ins 1 2 3 1 3 8.77136886F 8.77138162R

linked del 1 2 3 4 5 1 3 5 Test 1 2 3

del in gain 1 2a 2b 2c 3 1 2a 2b 2c 3 2a 2c

1 2 3 4 5 1 3 5 del+inv+del Ref. 1 2 3

inv+ins 1 2 3 1 2 3

inv in gain 1 2a 2b 2c 3 1 2a 2b 2c 3 2a 2b 2c

linked ins 1 2 3 4 1 2 3 2 4 A AKR Pattern relative to the reference: del ins inv copy number gain ladder BALB C3H C57 CBA DBA LP ladder Figure 2. Experimental analysis of SVs

a 34.73 Kb forward strand Zbtb10 (Zinc finger and BTB domain containing 10 gene) Ladder AJ AKR BALB C3H C57 CBA DBA LP Ladder

F R F 22 bp AT

B1 Inversion (125 bp) 22 bp

B2 CNG Deletion (813 bp) R

B3 B2_Mm2 SINE repeat

b B1-2 AJ * * * * ** *

Ladder AJ AKR BALB C3H C57 CBA DBA LP Ladder B1 … C57

* * * * ** *B2

C57 ….

* * * * * * c Chromosome 7, Trim6 to Trim30 locus

C57BL/6J 6 34 5 12a 12c 30b 30c 30a 30d

Mouse 6 34 12c 30b 30c 30a 30d Figure 3. SVs length and SV breakpoints microhomology a

b Figure 4. Impact of SVs on gene expression a Effect attributable to SV type Scaled variances 0 5 15 25 Inv Ins Dels Gains No SV Complex Array:SVs

b Effect attributable to distance from SV Scaled variances 0 5 15 25 < 2Kb > 40 Kb 2−10 Kb 10−20 Kb 20−40 Kb