<<

-level assembly of thaliana Ler reveals the extent of translocation and inversion polymorphisms

Luis Zapataa,b,1, Jia Dingc,1,2, Eva-Maria Willingd, Benjamin Hartwigd, Daniela Bezdana,b, Wen-Biao Jiaod, Vipul Pateld,3, Geo Velikkakam Jamesd,4, Maarten Koornneefc,e,5, Stephan Ossowskia,b,5, and Korbinian Schneebergerd,5

aBioinformatics and Programme, Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, 08003 Barcelona, ; bUniversitat Pompeu Fabra, 08002 Barcelona, Spain; cDepartment of Breeding and , Max Planck Institute for Plant Breeding Research, 50829 Cologne, Germany; dDepartment of Plant , Max Planck Institute for Plant Breeding Research, 50829 Cologne, Germany; and eLaboratory of Genetics, Wageningen University, NL-6708 PE, Wageningen, The Netherlands

Contributed by Maarten Koornneef, May 12, 2016 (sent for review February 18, 2016; reviewed by Ian R. Henderson and Yves Van de Peer) Resequencing or reference-based assemblies reveal large parts of laboratories, but most likely all derived from the original mutant the small-scale sequence variation. However, they typically fail to isolated by Rédei, and we will collectively refer to them as “Ler.” separate such local variation into colinear and rearranged variation, First comparative analyses of the Ler included cyto- because they usually do not recover the complement of large-scale genetic studies using pachytene cells and in situ hybridization (4, rearrangements, including transpositions and inversions. Besides the 5), suggesting a large inversion on the short arm of chromosome availability of hundreds of of diverse Arabidopsis thaliana 4, as well as differences between 5S rDNA clusters compared accessions, there is so far only one full-length assembled genome: with the genome of Col-0 (4). The first large-scale analysis of the the reference sequence. We have assembled 117 Mb of the A. thaliana Ler genome sequence was published together with the Col-0 ref- Landsberg erecta (Ler) genome into five chromosome-equivalent erence sequence in 2000 (6). Within 92 Mb of random shotgun sequences using a combination of short Illumina reads, long PacBio dideoxy reads, 25,274 SNPs and 14,570 indels were reads, and linkage information. Whole-genome comparison against the reference sequence revealed 564 transpositions and 47 inver- Significance sions comprising ∼3.6 Mb, in addition to 4.1 Mb of nonreference sequence, mostly originating from duplications. Although rearranged Despite widespread reports on deciphering the sequences of regions are not different in local divergence from colinear regions, all kinds of genomes, most of these reconstructed genomes they are drastically depleted for meiotic recombination in heterozy- rely on a comparison of short DNA sequencing reads to a ref- gotes. Using a 1.2-Mb inversion as an example, we show that such erence sequence, rather than being independently reconstructed. rearrangement-mediated reduction of meiotic recombination can This method limits the insights on genomic differences to local, lead to genetically isolated haplotypes in the worldwide population mostly small-scale variation, because large rearrangements are of A. thaliana. Moreover, we found 105 single-copy , which likely overlooked by current methods. We have de novo assembled were only present in the reference sequence or the Ler assembly, thegenomeofacommonstrainofArabidopsis thaliana Landsberg and 334 single-copy orthologs, which showed an additional copy erecta and revealed hundreds of rearranged regions. Some of in only one of the genomes. To our knowledge, this work gives these differences suppress meiotic recombination, impacting the first insights into the degree and type of variation, which will haplotypes of a worldwide population of A. thaliana. In addition be revealed once complete assemblies will replace resequencing to sequence changes, this work, which, to our knowledge is the or other reference-dependent methods. first comparison of an independent, chromosome-level assembled A. thaliana genome, revealed hundreds of unknown, accession- de novo assembly | Arabidopsis | PacBio sequencing | inversions | specific genes. absence/presence polymorphisms

Author contributions: M.K., S.O., and K.S. designed research; B.H. and D.B. performed andsberg erecta (Ler) is presumably the second-most-used research; L.Z., J.D., E.-M.W., W.-B.J., V.P., G.V.J., S.O., and K.S. analyzed data; and L.Z. and Lstrain of Arabidopsis thaliana after the reference accession K.S. wrote the paper. Columbia (Col-0). It is broadly known as Ler-0, which is an ab- Reviewers: I.R.H., University of Cambridge; and Y.V.d.P., Ghent University, Vlaams Insti- breviation for its accession code La-1 and a mutation in the tuut voor Biotechnologie. ERECTA gene. In 1957, George Rédei, at the University of The authors declare no conflict of interest. Missouri–Columbia, irradiated La-1 samples, which were pro- Freely available online through the PNAS open access option. vided by Friedrich Laibach and were collected in Landsberg an Data deposition: Raw sequencing data have been deposited in the NCBI Sequence Read Archive [accession nos. SRX1567556 (180-bp fragment library), SRX1567557 (8-kb jumping der Warthe (now called Gorzów Wielkopolski), Poland, where library), SRX1567558 (20-kb jumping library), and SRX1567559 (raw reads with unknown Ler-0–related genotypes are still present (1). Some of the insert size)]. The Whole Genome Shotgun project has been deposited at DNA Data Bank original batch were irradiated with X-rays, resulting, among of Japan/European Nucleotide Archive/GenBank (accession no. LUHQ00000000; the ver- others, in the isolation of the erecta (er) mutant (2, 3). Will sion described in this paper is version LUHQ01000000). Feenstra received this mutant from George Rédei in 1959 be- 1L.Z. and J.D. contributed equally to this work. cause he was interested in its erect growth habit and introduced 2Present address: Centre for Human Genetics, University Hospital Leuven, and Depart- it as the standard strain in the Department of Genetics at ment of Human Genetics, Leuven, 3000 Leuven, Belgium. 3 Wageningen University. There, he started a mutant induction Present address: KWS SAAT SE, Research & Development-Data Management, 37574 Einbeck, Germany. program that was later continued by Jaap van der Veen and 4Present address: Rijk Zwaan Research & Development Fijnaart, Rijk Zwaan, 4793 RS, Maarten Koornneef. Mutants from this program as well as pa- Fijnaart, The Netherlands. rental lines of recombinant inbred lines were mainly distributed 5To whom correspondence may be addressed. Email: [email protected], stephan. as Ler-0 lines to other laboratories, reflecting the increasing in- [email protected], or [email protected]. A. thaliana. er terest in Some descendants of L -0 were later This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. renamed to Ler-1 and -2 to identify genotypes used in different 1073/pnas.1607532113/-/DCSupplemental.

E4052–E4060 | PNAS | Published online June 27, 2016 www.pnas.org/cgi/doi/10.1073/pnas.1607532113 Downloaded by guest on September 29, 2021 identified. Although this was a severe underestimation (7–11), the Table 1. Assembly statistics and comparison with the PNAS PLUS authors already observed that many of the large indels contained reference sequence entire active genes, half of which were found at different loci in the Feature Col-0 Ler genome of Ler, whereas others were entirely absent (6). The advent of next-generation sequencing greatly expanded Chromosome scaffolds 5 5 the knowledge of natural genetic diversity in A. thaliana (reviewed Unplaced scaffolds 0 25 in refs. 12 and 13). However, genome-wide studies on gene Assembly length 119,146,348 118,890,721 absence/presence polymorphisms were not repeated, because Ambiguous bases 185,738 1,777,652 short-read analyses focused on small-scale changes only. To Gene no. 27,416 27,170 resolve large variation, reference-guided assemblies (8, 9, 14) and structural variation-identifying tools (15) were introduced. However, such methods mostly reveal local differences, which do which allowed us to generate five chromosome-representing se- not include the complement of large-scale rearrangements in- quences using stretches of Ns as indication of assembly breaks cluding inversions or transpositions. (Table 1 and SI Appendix, SI Materials and Methods and Fig. S1). er So far, two de novo assemblies of L have been published. Because some of the short scaffolds were too small and did not The first was based on Illumina short-read data and resulted in overlap with enough markers, we introduced an additional seven an assembly with an N50 of 198 kb, showing similar performance scaffolds with a combined length of 1.4 Mb into the recon- as a reference-guided assembly (8). The second was based on a ’ structed based on homology information from set of previously released Pacific Bioscience s single-molecule, Col-0. The final assembly consisted of five pseudomolecules and real-time sequencing (PacBio) data (16) and assembled the ge- 25 unplaced scaffolds (including scaffolds representing the or- nome into 38 contigs with an N50 of 11.2 Mb (17). This drastic ganelle genomes) with a combined length of 118.9 Mb, including improvement outlines the potential of long-read sequencing <2 Mb of ambiguous bases. technologies such as PacBio sequencing (18) to overcome the Within the assembly, we found sequence similarity to telomeric limitations of short-read methods because long reads can span repeats at five of eight chromosome ends without heterochromatic many of the repetitive regions, which are presumably the most nuclear organizing region (i.e., shorter arms of chromosomes 2 and common reason for assembly breaks in short-read assemblies. 4) and centromeric repeat sequences and rDNA clusters within Alternatively, low-fold long-read sequencing could be combined most of the pericentromeric regions (Fig. 1). Telomeric repeats with cheaper short-read sequencing, either by using the short reads were also found as interstitial repeats near or in the pericentro- for error-correcting the long reads (19) or for integrating long-read meric regions, which is similar to their location in the Col-0 ge- information into short-read assemblies or vice versa (20). nome (27). Interestingly, we found a few rDNA copies in the Despite the unprecedented contiguity of this long-read as- upper arm of chromosome 3, revealing the location of a (not fully sembly, both earlier studies focused on methodological aspects er assembled) rDNA cluster, which is in agreement with earlier and did not perform any whole-genome comparisons of the L er genome or gene annotations, and, as a consequence, compre- findings of a L -specific rDNA locus in this region (4). To test the hensive reports on large-scale rearrangements and nonreference assembly quality further, we analyzed nucleotide-binding leucine- genes are still sparse. We have generated an advanced de novo rich repeat (NB-LRR) gene loci, which include regions that are assembly of Ler consisting of 117 Mb arranged into five se- known to be structurally diverse between accessions and are quences representing the five chromosomes, which is based on a challenging to assemble because of high levels of local repeats combination of short-read assembly, long-read-based gap clo- (28). For 154 of the 159 genes, we could identify the orthologous er sure, and scaffolding based on genetic maps. This chromosome- regions in the L assembly. Only 10 (6%) of these regions in- scale assembly and its comparison with the reference assembly cluded ambiguous sequences, which would indicate failures in revealed features that are typically not analyzed within next- sequence assembly. In contrast, 49 (31%) of them revealed dif- > generation sequencing assemblies, including the location of a ferences 100 bp, corroborating their strong divergence. er polymorphic rDNA cluster and centromeric repeats, as well as Finally, we used L wild-type RNA-sequencing (RNA-seq) the exact makeup of all large rearrangements including a 1.2-Mb reads from two public datasets (9, 29) and homology to Col-0 pro- inversion on the short arm of chromosome 4. This inversion tein sequences to annotate 27,170 -coding genes in the suppresses meiotic recombination in Ler and Col-0 hybrids, and assembly of Ler (compared with 27,416 protein-coding genes in the we show that this suppression introduced genetically isolated reference annotation) (SI Appendix, SI Materials and Methods). inversion haplotypes into the worldwide population of A. thaliana. De novo gene annotation revealed hundreds of copy-number The Complement, Divergence, and Impact of Nonallelic Rearrangements. polymorphisms as well as novel genes that are entirely absent in In contrast to resequencing, full-length assemblies facilitate direct one or the other genome. Finally, we report on variation in dif- identification of large-scale genomic rearrangements, including ferent Ler genomes, suggesting that some Ler lines feature un- transpositions and inversions. These types of higher-order differ- expected footprints of an additional event. ences constitute a second layer of genetic variation, because local differences (e.g., SNPs) can be found in colinear as well as in re- Results arranged regions (Fig. 2A). However, because resequencing studies Karyotype-Resolving Assembly of A. thaliana Ler. We deeply se- typically cannot distinguish between rearranged and nonrearranged quenced the genome of Ler using Illumina libraries of different regions, this distinction was so far not possible (30). GENETICS insert sizes (SI Appendix, Table S1). Combined with recently A whole-genome alignment of the Ler and Col-0 genome as- generated data (8), we obtained an initial assembly with an semblies (31) revealed 512 colinear (allelic) and 611 rearranged N50 of 7.5 Mb using ALLPATHS-LG (21) and SSPACE (22). (nonallelic) regions comprising ∼107.6 and 3.6 Mb (Fig. 2B and For further improvement, we generated long-read data from 10 SI Appendix, SI Materials and Methods). Among the nonallelic PacBio SMRTcells. However, during the course of this project, regions, we identified 47 inversions and 383 and 181 inter- and Pacific Biosciences released sequencing reads of Ler with higher intrachromosomal transpositions (Dataset S1). Most of the trans- quality (16), which we used for gap closure (20) and scaffolding positions resided in pericentromeric regions, whereas inversions (23) to generate an assembly with an N50 of 12.8 Mb consisting of were also found in chromosome arms (Fig. 2C). Nearly 40% of the 65 scaffolds >50 kb. Following the POPSEQ approach (24), we transposed sequences overlapped with transposable elements anchored 31 of these scaffolds to two public genetic maps (25, 26), (TEs); however, only a minor fraction of the inversion sequences

Zapata et al. PNAS | Published online June 27, 2016 | E4053 Downloaded by guest on September 29, 2021 12345 1 2 3 4 5 To avoid recombination between nonallelic regions, meiotic 0 0 recombination requires pairing of homologous sequence between

10 allelic regions (11), implying that rearranged regions should be 5 suppressed for meiotic recombination in heterozygotes (Fig. 2F). 20 10 To test this hypothesis, we overlapped the precise location of 362 30 crossover (CO) events from Col-0/Ler hybrids previously collected 40 15 from different studies (34) with allelic and nonallelic regions (SI Mbp Appendix SI Materials and Methods 50 , ). Only five of the CO events 20 did not reside in colinear, allelic regions (expected: 35 CO events), 60 of which four COs were located in nonaligned regions, and only 25 70 Chromosome length (µm) at pachytene one CO was mapped to a transposition (Fig. 2G). This highly

80 30 significant underrepresentation of COs in nonallelic regions Assembly-based karyotype provides evidence that rearranged regions are in fact suppressed Cytogenetics-based karyotype P = χ2 rDNA rDNA for meiotic recombination ( 1.96e-06, test). Centromere core Centromeric repeat Heterochromatin / NOR Long N-stretch The Effects of Inversions on Natural Haplotypes. Inverted regions Telomeric repeat were most drastically suppressed for meiotic recombination and Fig. 1. Chromosome-level sequence assembly reveals the karyotype and the in consequence genetic exchange between the two alleles of an arrangement of structural features of the A. thaliana Ler genome. Ideogram inversion is expected to be minimized (35). Despite this strong of the Ler genome shows idealized chromosomes at pachytene and chromo- impact, no population-wide analysis of inversions in Arabidopsis somal marks, which were revealed with cytogenetics including heterochromatic and only few inversions have been reported so far (5, 8, 36). clusters (dark gray), clusters of centromeric repeats (red), and rDNA (blue) (4, 5; Two inversions between Col-0 and Ler were >100 kb, including a modified from ref. 56). On the right is an illustration of the assembly, including 170-kb inversion on chromosome 3 and the 1.2-Mb inversion on five chromosomal sequences. The colored bars next to the chromosomes indicate the location of sequence similarity to telomeric repeat sequence (green), cen- chromosome 4. Overlapping the regions of the inversion with tromeric repeat (red), and rDNA (blue) as well as major gaps in the sequence meiotic recombination frequency data (26) showed locally re- (dark gray). The blue star marks a Ler-specific rDNA cluster, which was earlier duced recombination rates co-occurring with both of these two identified by cytogenetics (4) and was also found in the assembly. large inversions (Fig. 3A and SI Appendix, Fig. S4). The effect of the chromosome 4 inversion extended its recombination sup- pression into the heterochromatic pericentromere and was in were related to TEs. The inversion breakpoints overlapped with 7 agreement with early reports of reduced recombination specific and 10 genes in the Ler and Col-0 assemblies, which did not have to hybrids of these accessions (26, 37). syntenic orthologs in the other assembly, suggesting that these Selection acting on one of the alleles of an inverted region can genes have been deleted by the inversion events. have dramatic effect on its allele frequency and haplotype di- The by far largest sequence difference between the two as- vergence across the entire inverted regions. To estimate the semblies was an inversion of 1.2 Mb on chromosome 4, which we impact of these large inversions on the population of A. thaliana, confirmed by PCR (SI Appendix, Figs. S2 and S3). The latter we genotyped a worldwide selection of 409 accessions using inversion was already described by Fransz et al. (5) and was public whole-genome sequencing data (15, 30, 38) (SI Appendix, found to be associated with a polymorphic, heterochromatic SI Materials and Methods). For this process, we simultaneously knob on the short arm of the Col-0 chromosome, presumably aligned the short reads against the Col-0 and Ler reference se- due to inverting parts derived from the pericentromere (5, 32). quences and calculated the ratio of alignments to either of the This inversion could also be observed between Col-0 and the inversion breakpoints to assign the respective allele (SI Appendix, closely related plant , further corroborating that Fig. S5). Surprisingly, only sequencing reads of Ler matched the the Col-0 allele is the derived form (33). Moreover, 7.9 Mb of the Ler inversion breakpoints of the 170-kb inversion on chromo- Col-0 and 4.1 Mb of the Ler assembly were not aligned to any some 3, whereas the reads of all other accessions matched the homologous region of the other genome at all. This sequence Col-0 breakpoints, suggesting that this inversion was either specific space was separated into 713 and 535 regions, respectively (Fig. to the La-1 accession or was introduced during the mutagenesis 2B). Even though not aligned in a strict one-to-one whole-genome leading to the Ler genotype. In contrast, genotyping for the 1.2-Mb alignment, large portions of these unaligned regions showed ex- inversion revealed 26 accessions as carriers of the Col-0 allele and tensive similarity to regions in the other genome, suggesting that 383 accessions as carriers of the Ler allele (Fig. 3B). Most acces- most of these regions originated from duplication events (Fig. 2 A sions could be characterized at the distal inversion breakpoint; and B)(SI Appendix, SI Materials and Methods). however, some accessions showed an additional rearrangement on To compare sequence divergence in allelic and nonallelic re- the proximal breakpoint complicating short read alignments. De- gions, we searched for local differences, including SNPs and spite this complication, none of the accessions showed contra- small and large indels, as well as highly divergent regions (HDRs) dicting genotypes at the two breakpoints. (Fig. 2A, SI Appendix, SI Materials and Methods,andDataset S2). The relatedness of the accessions was estimated by using Approximately 4.5 Mb (4.2%) of the 107.6 Mb of allelic sequence 20,408 SNPs from within the inversion and revealed a perfectly were polymorphic, with the majority of this variation organized in separated subclade, which matched the assignment of the Col-0– long indels and HDRs (Fig. 2D). Approximately 39% of the long like inversion allele during the breakpoint genotyping (Fig. 3C indels had flanking (short or tandem) repeat sequences, indicating and SI Appendix, SI Materials and Methods). This finding sug- that a large proportion of the indel mutations were introduced by gested that suppressed recombination and genetic exchange homology-dependent events, whereas only 16% of them highly separated the population in two distinct groups, in particular overlapped with TEs. Even though rearranged (nonallelic) se- because this separation was not mirrored by geographic isolation quence harbored less large local differences, presumably due to (Fig. 3D). However, this separation was not absolute: Across the their short length [average lengths: transpositions, 4 kb; inversions, 20,408 sites, we found 13 (0.06%) sites with shared variation 9 kb (without the 1.2 Mb inversion); allelic regions, 209 kb], the between both groups (whereas 20,306 sites were polymorphic average pairwise difference in the alignments of nonallelic regions only within the Ler-like inversion accessions, and 89 sites were was still similar compared to the differences in allelic regions (Fig. polymorphic only within the Col-0-like accessions). Although 2 D and E). this very low level of shared variation still could indicate the

E4054 | www.pnas.org/cgi/doi/10.1073/pnas.1607532113 Zapata et al. Downloaded by guest on September 29, 2021 PNAS PLUS A D

Col-0 2.5 40 2.0 30 1.5

Local Ler

Mb 20 kb Variation SNPs & Large indels Copy-number 1.0 short indels Large indels with repeats variation HDRs 10 0.5 0 0 Duplication Col-0 SNPs HDR SNPs HDR

short longindels indels short indelslong indels Ler Allelic Non-allelic Transposition nucmer alignments

Variation Allelic Higher order } including local variation Non-allelic C Chr1 Redundant revealing duplications Col-0 Chr1 Ler

Chr2

r2 h C B E 108 512 0.04

Chr3

Chr3 10 713 Mb 535 pi 0.02 611 Ns

Inversions unique Chr4 Chr4 0 Transpos. Duplications Allelic Non-allelic Col-0 Ler 0.00 Chr5 Chr5 Aligned Non-aligned Inversion Allelic Transposition Non-allelic

F G 0.00 0.90 1.00 CO location CO location Genomic space Trasnposition Whole COs in transpositions can generate duplications and genome deletions CO location Genomic space arms CO location Chromosome Allelic regions Inversion Non-aligned regions COs in inversions can generate dicentric bridges and acentric fragments Tranpositions and inversions GENETICS Fig. 2. Higher-order sequence variation. (A) Schematic of local (Upper) and higher-order (Lower) sequence variation as revealed by a whole-genome alignment. Local sequence divergence does not only include small-scale variation like SNPs and small indels, but also structural variation like large indels and HDRs. Higher-order variation includes transpositions and inversions, which do not reside in the orthologous regions in the other genome. Both colinear (allelic) and rearranged (nonallelic) regions can harbor local variation. (B) Amount of aligned and nonaligned regions in a nonredundant whole-genome alignment of Col-0 and Ler. Aligned regions can be separated into colinear (gray) and rearranged regions [inversions and transpositions (transpos.); red]. Nonaligned regions, typically residing in the breaks between allelic and nonallelic regions, are shown for Col-0 and Ler separately, including the amount of putatively duplicated regions. (C) Location of transpositions and inversions. (D) Genomic space involved in different types of local sequence variation, sep- arately shown for allelic and nonallelic regions. (E) Sequence divergence in allelic and nonallelic alignments. (F) Schematic examples for the consequences of meiotic recombination (CO) events in transposed (Upper) and inverted (Lower) regions. Chromosome arm exchange in nonallelic regions can lead to extreme chromosomal rearrangements. (G) Distribution of the location of 362 CO events in respect to their occurrence in allelic (gray), nonaligned (green), and nonallelic (red) regions in contrast to the genomic fractions of these regions; shown are complete genome (Upper) and only chromosome arms (Lower).

Zapata et al. PNAS | Published online June 27, 2016 | E4055 Downloaded by guest on September 29, 2021 Inversion Peri-centromere A 15 B C Breakpoint Haplotype clustering genotyping cM/Mb

0 Vash−1 Vash−1 Var2−1 Var2−1 VarA_1 VarA_1 TV−22 TV−22 TV−30 TV−30 TV−10 TV−10 0 5 10 15 20 25 TV−7 TV−7 Kastel−1 Kastel−1 Se−0 Se−0 Chr. 3 Ts−1 Ts−1 Got−22 Got−22 Brosarp−61−162 Brosarp−61−162 Kru_3 Kru_3 Sku−30 Sku−30 Dog−4 Dog−4 Bak−2 Bak−2 Bak−7 Bak−7 Hov3−2 Hov3−2 Hov4−1 Hov4−1 Brosarp−21−140 Brosarp−21−140 Hov1−7 Hov1−7 Hov3−5 Hov3−5 Fri_1 Fri_1 Inversion Peri-centromere Tamm−27 Tamm−27 Tamm−2 Tamm−2 Ga−0 Ga−0 Hn−0 Hn−0 Amel−1 Amel−1 Bd−0 Bd−0 Hovdala−2 Hovdala−2 15 Tu−0 Tu−0 Puk_2 Puk_2 Brosarp−34−145 Brosarp−34−145 TDr−9 TDr−9 Brosarp−11−135 Brosarp−11−135 Lis−3 Lis−3 ICE216 ICE216 ICE29 ICE29 ICE102 ICE102 Bsch−0 Bsch−0 Is−0 Is−0 Liarum Liarum Tur_4 Tur_4 App1−12 App1−12 Sp−0 Sp−0 ICE21 ICE21 Bu−0 Bu−0 Utrecht Utrecht Lm−2 Lm−2 Tuescha−9 Tuescha−9 TueWa1−2 TueWa1−2 Com−1 Com−1 ICE120 ICE120 Or−1 Or−1 Chat−1 Chat−1 Ra−0 Ra−0 Var2−6 Var2−6 Spro_1 Spro_1 TDr−2 TDr−2 Vimmerby Vimmerby Got−7 Got−7 Kil−0 Kil−0 Ag−0 Ag−0 cM/Mb Wa−1 Wa−1 Ei−2 Ei−2 St−0 St−0 ICE181 ICE181 Or−0 Or−0 T1080 T1080 Bor−4 Bor−4 Agu−1 Agu−1 Pra−6 Pra−6 Ren−1 Ren−1 Cdm−0 Cdm−0 Mer−6 Mer−6 ICE93 ICE93 Stu−2 Stu−2 0 T720 T720 Hal_1 Hal_1 T1000 T1000 OMo1−7 OMo1−7 Spr1−2 Spr1−2 T550 T550 Mz−0 Mz−0 Ven−1 Ven−1 0 5 10 15 20 Lis−1 Lis−1 OMo2−3 OMo2−3 Fei−0 Fei−0 Spr1−6 Spr1−6 TueV13 TueV13 Fri_2 Fri_2 Sim_1 Sim_1 ICE71 ICE71 Fja1−2 Fja1−2 Chr. 4 Spro_2 Spro_2 Spro_3 Spro_3 Nc−1 Nc−1 Tha−1 Tha−1 Wl−0 Wl−0 Sanna−2 Sanna−2 Dra−3 Dra−3 TBO_01 TBO_01 Hel_3 Hel_3 T480 T480 TueSB30−3 TueSB30−3 ICE119 ICE119 ICE33 ICE33 Lillo−1 Lillo−1 OMo2−1 OMo2−1 Lag2.2 Lag2.2 Istisu−1 Istisu−1 Pog−0 Pog−0 Lerik1−3 Lerik1−3 Xan−1 Xan−1 Rev−1 Rev−1 Fly2−2 Fly2−2 Lund Lund Brosarp−15−138 Brosarp−15−138 FlyA_3 FlyA_3 Fly2−1 Fly2−1 D Kni−1 Kni−1 T780 T780 UllA_1 UllA_1 T880 T880 T800 T800 T850 T850 T710 T710 T900 T900 T860 T860 Kro−0 Kro−0 T790 T790 T840 T840 TAL_03 TAL_03 Gron_14 Gron_14 TGR_01 TGR_01 Ale−Stenar−41−1 Ale−Stenar−41−1 Ale−Stenar−44−4 Ale−Stenar−44−4 HolA2_2 HolA2_2 HolA1_1 HolA1_1 HolA1_2 HolA1_2 T470 T470 Fja1−1 Fja1−1 T960 T960 Leo−1 Leo−1 Dra3−1 Dra3−1 Ts−5 Ts−5 Bla−1 Bla−1 Pla−0 Pla−0 Ler-like Ale−Stenar−56−14 Ale−Stenar−56−14 Gron−5 Gron−5 TAA_04 TAA_04 Don−0 Don−0 Cnt−1 Cnt−1 HR−5 HR−5 NFA−10 NFA−10 Alst−1 Alst−1 Rou−0 Rou−0 Sq−1 Sq−1 CIBC−5 CIBC−5 Ema−1 Ema−1 Col-0-like HR−10 HR−10 ICE91 ICE91 Kulturen−1 Kulturen−1 ICE104 ICE104 ICE92 ICE92 ICE1 ICE1 WalhaesB4 WalhaesB4 Rev−2 Rev−2 ICE111 ICE111 ICE36 ICE36 T990 T990 T530 T530 T930 T930 Brosarp−45−153 Brosarp−45−153 T1090 T1090 ICE138 ICE138 Koch−1 Koch−1 Sha Sha ICE152 ICE152 ICE134 ICE134 ICE150 ICE150 ICE153 ICE153 Zal−1 Zal−1 Kondara Kondara Rubeznhoe−1 Rubeznhoe−1 ICE127 ICE127 ICE130 ICE130 ICE70 ICE70 Ms−0 Ms−0 Had_3 Had_3 Had_1 Had_1 Had_2 Had_2 Es−0 Es−0 ICE72 ICE72 Hs−0 Hs−0 Nas_2 Nas_2 TEDEN_03 TEDEN_03 Eden−2 Eden−2 Eden−6 Eden−6 Eden−7 Eden−7 In−0 In−0 Uod−7 Uod−7 Sap−0 Sap−0 Zdr−1 Zdr−1 Lp2−6 Lp2−6 Lu−1 Lu−1 Ta−0 Ta−0 TDr−1 TDr−1 Hov1−10 Hov1−10 TDr−13 TDr−13 TDr−7 TDr−7 Dr−0 Dr−0 Dra−0 Dra−0 Sr_5 Sr_5 Ler−1 Ler−1 RRS−7 RRS−7 Ey15−2 Ey15−2 Jm−0 Jm−0 Tscha−1 Tscha−1 ICE7 ICE7 Ham_1 Ham_1 Pu2−7 Pu2−7 Pu2−8 Pu2−8 Bor−1 Bor−1 Boot−1 Boot−1 NFA−8 NFA−8 ICE61 ICE61 Kz−9 Kz−9 Gr−1 Gr−1 Ang−0 Ang−0 Van−0 Van−0 App1−14 App1−14 App1−16 App1−16 Gie−0 Gie−0 TEDEN_02 TEDEN_02 TV−38 TV−38 Tottarp−2 Tottarp−2 Ull3−4 Ull3−4 Vie−0 Vie−0 Aa−0 Aa−0 Rd−0 Rd−0 T1070 T1070 Tomegap−2 Tomegap−2 Hau−0 Hau−0 TDr−8 TDr−8 Nie1−2 Nie1−2 Ale−Stenar−64−24 Ale−Stenar−64−24 Star−8 Star−8 Gu−0 Gu−0 TAL_07 TAL_07 T540 T540 Do−0 Do−0 Ste_4 Ste_4 Lom1−1 Lom1−1 Ha−0 Ha−0 Mh−0 Mh−0 Ak−1 Ak−1 Ws−2 Ws−2 Knox−18 Knox−18 Col−0 Col−0 Rod−17−319 Rod−17−319 Ca−0 Ca−0 Na−1 Na−1 Gre−0 Gre−0 Nw−0 Nw−0 Si−0 Si−0 Pna−17 Pna−17 RRS−10 RRS−10 NC−6 NC−6 Knox−10 Knox−10 Yo − 0 Yo − 0 ICE107 ICE107 Rue3−1−31 Rue3−1−31 ICE112 ICE112 ICE73 ICE73 ICE213 ICE213 HKT2.4 HKT2.4 Lp2−2 Lp2−2 Algutsrum Algutsrum Wt−5 Wt−5 Kl−5 Kl−5 Mnz−0 Mnz−0 Ull2−3 Ull2−3 Baa−1 Baa−1 Benk−1 Benk−1 Gel−1 Gel−1 Stw−0 Stw−0 Rhen−1 Rhen−1 Sq−8 Sq−8 Br−0 Br−0 Mc−0 Mc−0 Gro−3 Gro−3 TAA_14 TAA_14 TAA_17 TAA_17 FaL_1 FaL_1 TFA_07 TFA_07 Lan_1 Lan_1 Ull2−5 Ull2−5 T1130 T1130 Ste_2 Ste_2 TDr−16 TDr−16 Fja2−6 Fja2−6 Vastervik Vastervik Bs−1 Bs−1 Wei−0 Wei−0 ICE97 ICE97 Jl−3 Jl−3 Lip−0 Lip−0 Lan−0 Lan−0 An−1 An−1 Sr_3 Sr_3 Bch−1 Bch−1 Dra1−4 Dra1−4 RMX−A02 RMX−A02 RMX−A180 RMX−A180 ICE98 ICE98 Kin−0 Kin−0 Seattle−0 Seattle−0 Fri_3 Fri_3 Yst_1 Yst_1 Sei−0 Sei−0 San−2 San−2 T980 T980 Lov−5 Lov−5 T1160 T1160 Ty−0 Ty−0 E Lis−2 Lis−2 Je−0 Je−0 Inversion Peri-centromere CIBC−17 CIBC−17 Gy−0 Gy−0 Nok−3 Nok−3 Pro−0 Pro−0 Ped−0 Ped−0 1.0 Pu2−23 Pu2−23 Del−10 Del−10 Fab−4 Fab−4 ICE226 ICE226 ICE79 ICE79 ICE169 ICE169 ICE173 ICE173 ICE60 ICE60 ICE212 ICE212 Per−1 Per−1 Vinslov Vinslov Ste_3 Ste_3 Kor_3 Kor_3 Kor_4 Kor_4 Hey−1 Hey−1 Kelsterbach−4 Kelsterbach−4 Ye g − 1 Ye g − 1 Cvi−0 Cvi−0 ICE49 ICE49 ICE50 ICE50 Uk−1 Uk−1 Np−0 Np−0 Stu1−1 Stu1−1 Dra2−1 Dra2−1 Puk_1 Puk_1 Uod−1 Uod−1 TOM_06 TOM_06 TOM_07 TOM_07 ICE63 ICE63 ICE75 ICE75 Ler-like TAD_03 TAD_03 Bil−5 Bil−5 Eds−1 Eds−1 Bil−7 Bil−7 Eden−1 Eden−1 TAD_01 TAD_01 Nyl−2 Nyl−2 TGR_02 TGR_02

diversity Omn−5 Omn−5 Nyl−7 Nyl−7 Omn−1 Omn−1 Eden−5 Eden−5 Haplotype Col-0-like Adal_3 Adal_3 Eden−9 Eden−9 TAD_05 TAD_05 TAD_06 TAD_06 Nyl_13 Nyl_13 0 TOM_04 TOM_04 Adal_1 Adal_1 Gron_12 Gron_12 Eds−9 Eds−9 TAD_04 TAD_04 Dor−10 Dor−10 TNY_04 TNY_04 TRA_01 TRA_01 Fab−2 Fab−2 Lov−1 Lov−1 0 5 10 15 20 Ba1−2 Ba1−2 T1020 T1020 Qui−0 Qui−0 Ann−1 Ann−1 Nemrut−1 Nemrut−1 Sparta−1 Sparta−1 Ren−11 Ren−11 Chromosome 4 Ba4−1 Ba4−1 Ba5−1 Ba5−1 Fja1−5 Fja1−5 Fja2−4 Fja2−4 TDr−17 TDr−17 T570 T570 Kavlinge−1 Kavlinge−1 T460 T460 T1110 T1110 T740 T740 Bro1−6 Bro1−6 ICE163 ICE163 F Inversion Peri-centromere 0.3 Ler-like

st Col-0-like F Unassigned Left breakpoint 0 Right breakpoint 0 5 10 15 20 Chromosome 4

Fig. 3. Impact of large-scale inversions on meiotic recombination and haplotype diversity in a worldwide collection of A. thaliana accessions. (A) Male meiotic recombination frequencies across chromosomes 3 and 4 contrasted with the location of the two large-scale inversions (dark gray boxes) and the pericen- tromeric regions (light gray boxes) [recombination data generated by Giraut et al. (26)]. Recombination frequency was measured between markers withan average distance of 316 kb. Both inversions co-occur with locally reduced recombination frequencies. The interval harboring the inversion on chromosome 3, however, showed residual recombination activity, which does not imply recombination in the inverted region, but might arise from recombination in 111 kb of noninverted sequence in this interval. (B) The names of 409 accessions colored by the inferred chromosome 4 inversion allele (blue, Ler allele; red, Col-0 allele) as assessed on the left and right breakpoints of the inversion. The accessions were ordered after their occurrence in the haplotype clustering shown in C.(C) Haplotype clustering based on 9,198 SNPs located within the chromosome 4 inversion, revealing two distinct clusters, which perfectly matched the two chromosome 4 inversion alleles. (D) Distribution of the accession origins in central Europe, colored by their respective chromosome 4 inversion alleles. (E) Haplotype diversity within the accessions carrying a Col-0–like (red) or Ler-like allele (blue) of the chromosome 4 inversion. (F) Population differentiation

(Fst) between these two groups of accessions. Inversion and pericentromere shown with dark and light gray boxes.

existence of rare (double) recombination events between two The minor Col-0 inversion allele mostly co-occurred together inversion alleles, such patterns might also be explained by gene with the Ler-like allele across its entire distribution range in conversions (11) or accumulation of false-positive SNP pre- Central and Northern Europe, and even in recently invaded diction within the set of shared SNPs. North America (39). Although the Ler-like genotypes showed

E4056 | www.pnas.org/cgi/doi/10.1073/pnas.1607532113 Zapata et al. Downloaded by guest on September 29, 2021 genetic diversity within the inversion region, which was not different appeared specific to either of the genomes. These gene sets were PNAS PLUS from the rest of the genome, the diversity among the Col-0–like then further filtered to exclude the possibility that their acces- genotypes was greatly reduced within the inversion (Fig. 3E and SI sion-specific occurrence was due to failures in the assemblies, Appendix,Fig.S6) and extended across the low recombining peri- annotations, or assignment of orthologs (SI Appendix, SI Mate- centromere. The drastic reduction in diversity most likely reflects a rials and Methods). This conservative approach led to 40 unique bottleneck event as a result of the inversion generating the derived genes specific for Ler and 63 genes specific for Col-0. Using the Col-0 allele, which was maintained by the suppression of recom- genome of the close relative A. lyrata (36, 40) as outgroup, we bination in inversion heterozygotes. found that the majority of these polymorphic genes (77%) Taken together, these findings suggest that the inversion was evolved via deletions within the genome in which they are absent, introduced by a single event, and, in consequence, population rather than by a spontaneous appearance in the genome in which differentiation between the two groups was not uniform across they are present (Fig. 4A). In fact, the number of genes is pre- A. lyrata the chromosome. Wright’s fixation index (FST) was elevated in sumably underestimated because the assembly might the inversion, and, again, the effect was extended into the peri- not be complete and might lack some genes, leading to a false centromeric region (Fig. 3F and SI Appendix, Fig. S7). categorization of genes. Finally, we assessed patterns of genic selection in the inversion Single-copy gene loss is an extreme type of modification of the by calculating Ka/Ks to explain the relatively high allele frequency gene content of a genome, because no additional copy can re- of the inversion allele. One of the genes in the inversions, place the function of the absent gene. In contrast, small gene RLP46, a LRR- domain-containing gene with function families with multiple copies are expected to be at least partially in defense response, was among the four genes with the highest functionally redundant and, therefore, more variable for copy Ka/Ks values across the entire genome. However, the numbers of number. In fact, 330 unique one-to-one orthologous gene pairs nonsynonymous and synonymous changes were not high enough showed one additional copy in either the Ler or Col-0 genome to prove a significant difference from neutral , pre- (again, after strict filtering for artifacts) (SI Appendix, SI Mate- venting a final conclusion as to whether selection of a gene in the rials and Methods). In 151 of the cases, the additional gene was inversion helped to retain the inverted allele or whether the in- found in the genome of Col-0, whereas in 179 of the cases, the version simply drifted into the population. additional copy was found in Ler. In contrast to accession-specific genes, copy-number variation did not primarily evolve through Genic Copy Number Variation Between Two A. thaliana Lines. To deletion events; instead, copy loss events were underrepresented assess the amount of genes that are specific to Ler or Col-0, we (36%) compared with acquiring new copies by novel duplication first calculated groups of orthologous genes between the two (64%) (again, gain and loss events were distinguished by using genomes (SI Appendix, SI Materials and Methods). This process A. lyrata as an outgroup) (Fig. 4B). These additional genes re- revealed initial sets of 212 and 240 single-copy genes that sided in 134 duplicated regions in Col and 148 duplicated regions

AC 27.84 27.83 C 27.82 27.81 27.80 Col-0 chromosome 1 (Mb) GGeneene loslosss GeneGene gaingain

31 9 488 15 Ler CCol-0ol-0 AA.. llyratayrata Ler Col-0Col-0 A.A. llyratayrata

AbsentAbsent PresentPresent

26.74 26.75 26.76 26.77 26.78 26.79 Ler chromosome 1 (Mb) B D E CopyCopy loss CopyCopy gaigainn 115050 1.001.00 DispersedDispersed LocalLocal t

un 0.900.90 entity Co Count Id Identity

5757 94 GENETICS 61 118 0 0.800.80 Ler Col-0 A. lyrata Ler Col-0 A. lyrata Copy Copy Copy Copy Loss Gain Loss Gain

Fig. 4. Gene absence/presence polymorphisms between Col-0 and Ler.(A) Amount of single-copy, polymorphic genes in Col-0 and Ler. The genes were separated by the presence (Left) or absence (Right) of an ortholog in the related species A. lyrata.(B) Amount of single-copy genes with one additional copy in Col-0 and Ler. Cases were separated by the presence of one or two orthologs in the genome of A. lyrata as in A.(C) Dot plot (57) of an example of a local duplication event coping multiple genes in the genome of Ler. Identically colored arrows indicate similarity between underlying gene loci. (D) Amount of local or dispersed gene copies separately shown for copy loss or gain events as defined in B.(E) Sequence identity between gene copies separately shown for copy loss or gain events as defined in B.

Zapata et al. PNAS | Published online June 27, 2016 | E4057 Downloaded by guest on September 29, 2021 in Ler, implying that some of the duplicated genes were introduced Ler (Cologne) by the same duplication event (Fig. 4C). Interestingly, although Ler qrt1 (NASC) the loss of additional copies was independent of the location, the Ler (Tübingen) Ler qrt1 (Shanghai) gain of additional copies was clearly enriched for copies residing Ler (Wageningen) close to the original gene (Fig. 4D), suggesting that novel genes Ler (Cambridge) primarily evolved by local duplication events. This result was further supported by differences in the similarity between new copies and the original genes. Gained copies were much more similar to each other compared with copies, where the other ge- Chr. 1 nome lost one copy, corroborating that gene gains predominantly result from recent events (Fig. 4E). Performing a enrichment analysis, we found Chr. 2 that polymorphic single-copy genes are enriched for signaling T and -related pathways, whereas genes with Chr. 3 C copy-number changes were enriched for defense-related cate- A gories, protein polymerization, and translation. G Chr. 4 Polymorphisms Between Diverse Ler Lines. Earlier comparisons of N different Ler lines have revealed a polymorphic premature stop codon in the HUA2 gene (41). This allele, hua2-5, was specific Chr. 5 only to some, but not all, Ler lines; could not be identified in any A. thaliana Ler (Cologne) of 29 other accessions (41); and still has not been Ler qrt1 (NASC) identified in any of the hundreds of other accessions released by Ler (Tübingen) the 1001 Genomes Project (www.1001genomes.org). This result Ler qrt1 (Shanghai) suggested that the hua2-5 mutation occurred after the original Ler (Wageningen) La-1 mutagenesis and, because it was not found in some flowering Ler (Cambridge) time mutants isolated in the 1960s but was present in mutants Fig. 5. SNPs between six Ler genomes from different laboratories. Location isolated in the early 1970s at the Genetics Department in Wage- and type of SNPs distinguishing six genomes published as the genome of Ler. ningen, this mutation most likely occurred in that laboratory (41). Genome-wide visualization revealed large blocks of C→T and G→A muta- We have reanalyzed six different datasets, all of which have tions specific to two Ler lines. been labeled as whole-genome sequencing data of Ler (8–11) (SI Appendix, Table S2). We aligned the reads of all sets against the Ler reference sequence, and, after stringent filtering, we identi- refers to local sequence divergence without considering rear- fied only a marginal amount of structural variations between rangements like transpositions or inversions. In fact, by analyzing er these lines, including five deletions and two duplication events CO events, which were observed in Col-0/L hybrids, in respect to using read copy and read pair analysis. However, we also iden- their occurrence in colinear and rearranged regions, we could tified an unexpected amount of 723 single-nucleotide variations, show that nonallelic regions have significantly reduced levels of including the hua2-5 mutation, which was present in two of the meiotic recombination, despite no obvious difference in local six lines. Surprisingly, most of the other polymorphic sites were sequence divergence. not specific to a single genome as expected for inbred strains Such cross-specific suppression of recombination was de- distributed to different laboratories, but were associated to the scribed for individual regions before including the 1.2-Mb in- hua2-5 mutation. Nearly all of these mutations were C→Tor version on the upper arm chromosome 4 (5, 26, 37), which was G→A mutations and were found in large clusters throughout the found between Col-0 and Ler, as well as Ws and Ler. Similar entire genome (Fig. 5). Because such mutations and their clus- suppression patterns were found in a 2.2-Mb region on the upper tering are characteristic for ethyl methanesulfonate (EMS) mu- arm of chromosome 3 within the offspring of Bay-0 and Sha (47) tations, it is likely that the hua2-5 mutation (and all other [and were later also observed in crosses between Col-0 and Sha associated mutations) resulted from an EMS mutagenesis. (45, 48, 49)], a 2- to 3-Mb region on the upper arm of chromo- some 5 specific to RRS7 and a 0.2-Mb region on the lower arm Discussion of chromosome 1 specific to Bay-0 (49). Recently, local sup- We have used a combination of short Illumina reads, long PacBio pression of recombination on the lower arm of chromosome 4 reads, and genetic maps to generate a chromosome-level between a cross of Col-0 and Ws-2 led to the identification of a assembly of A. thaliana Ler. Despite the unprecedented conti- 1.8-Mb inversion, which perfectly matched the region without guity that we achieved by combining different data types, in- any CO events (36). tegration was not straightforward. In contrast, de novo assemblies Using a 1.2-Mb inversion on the upper arm of chromosome 4 based only on long reads are easier to perform and start to become as an example, we studied the effects of such inversion-mediated a powerful alternative (17, 42), although they do not assemble suppression of meiotic recombination on the global population entire chromosomes yet. Independent of the assembly type, the of A. thaliana. Suppression of genetic exchange between these amount of rearranged sequence that was revealed by comparing two inversion alleles introduced a new genetically isolated hap- two assemblies underlines the general importance of chromosome- lotype into the global population. Although accessions carrying level assemblies, in particular because rearrangements have been the Col-0–like allele showed a clear concentration in Germany recognized as confounding factors in genome-wide screens (11, 43) and Sweden, it was surprising that these genotypes were rather and are essential to fully understand the segregation and evolution widely distributed. Other examples of widely distributed polymor- of natural haplotypes. phisms include deletions in FRIGIDA (50), which could be asso- Recent meiotic recombination estimates suggested that high ciated with different selective pressures. We could not identify levels of sequence divergence themselves are not inhibitory for similar signals for any of the alleles in the derived form of the meiotic recombination (26, 44, 45), which is in agreement with a inversion, which, nevertheless, does not exclude a possible se- positive correlation of ancestral recombination frequency and lective advantage of this inversion allele, but might point to a regions with high sequence divergence (46). However, this finding more complex scenario of selection.

E4058 | www.pnas.org/cgi/doi/10.1073/pnas.1607532113 Zapata et al. Downloaded by guest on September 29, 2021 In addition to rearrangements, an even larger portion of the once more chromosome-level de novo genome assemblies become PNAS PLUS genome was not assigned to any orthologous region in the whole- available. genome alignment. Many of these regions resulted from dupli- cation events, which had a drastic impact on the gene content Materials and Methods of the genomes. Gene absence/presence polymorphisms have To assemble the genome of the A. thaliana Ler genome, we used ALLPATH- been connected to phenotypic variation (1) and genetic in- LG (21) and SSPACE-ShortRead (22) for an Illumina short-read contig as- compatibilities (51–54) before, but the genome-wide extent of sembly and scaffolding. Integration of public PacBio sequencing data (16) was performed with PBJelly (20) to close gaps and SSPACE-LongRead (23) to hundreds of polymorphic genes in both genomes was surprisingly connect scaffolds. Higher-order scaffolding was based on two public genetic high, even though we only looked at low-copy differences. The maps with 676 and 386 markers with a location on the scaffolds (25, 26). Ad- functional classes, which were enriched among the duplicated ditional integration of seven scaffolds (combined length of 1.4 Mb) based on genes, included defense-related genes, which can have a selective synteny assumptions led to the final assembly of five chromosome-representing advantage when duplicated because they can evolve race-specific scaffolds. Finally, PacBio data were used to correct 3.5-Mb ambiguous (N) ba- resistances exemplified by the RPP1 locus where Ler has a 70-kb ses. For gene annotation, we used AUGUSTUS, including alignment from additional sequence including several strain-specific genes (1, 55). public RNA-seq data. Even in an organism with so many resequenced genomes as A. thaliana ACKNOWLEDGMENTS. We thank Erik Wijnker for help with interpretation , a single chromosome-level genome assembly enabled of meiotic recombination suppression in non-allelic regions; Petra Tänzler us to analyze a second layer of genetic variation, which so far could for help with DNA extractions; Avraham Levy and Shay Shilo for sharing not be considered in most short-read-based genome-wide studies. their recombination data; and the 1001 Genomes Project for releasing their These regions showed a drastic impact on natural haplotypes and data, allowing us to search for natural variants of the hua2-5 allele. This work was supported by Spanish Ministry of Economy and Competitiveness introduced great variability into the gene space, giving a first Centro de Excelencia Severo Ochoa 2013-2017 Grant SEV-2012-0208. L.Z. was glimpse of the degree of natural variation, which can be revealed supported by the International PhD scholarship program of La Caixa at CRG.

1. Alcázar R, et al. (2014) Analysis of a plant complex resistance gene locus underlying 27. Uchida W, Matsunaga S, Sugiyama R, Kawano S (2002) Interstitial -like re- immune-related hybrid incompatibility and its occurrence in nature. PLoS Genet peats in the Arabidopsis thaliana genome. Genes Genet Syst 77(1):63–67. 10(12):e1004848. 28. Guo Y-L, et al. (2011) Genome-wide comparison of nucleotide-binding site-leucine- 2. Rédei GP (1962) Single loci heterosis. Z Vererbungsl 93(1):164–170. rich repeat-encoding genes in Arabidopsis. Plant Physiol 157(2):757–769. 3. Rédei GP (1992) A heuristic glance at the past of Arabidopsis genetics. Methods in 29. Shen H, et al. (2012) Genome-wide analysis of DNA methylation and gene expression Arabidopsis Research, eds Koncz C, Chua NH, Schell J (World Scientific, Singapore), pp changes in two Arabidopsis and their reciprocal hybrids. 24(3): 1–15. 875–892. 4. Fransz P, et al. (1998) Cytogenetics for the model system Arabidopsis thaliana. Plant J 30. Long Q, et al. (2013) Massive genomic variation and strong selection in Arabidopsis 13(6):867–876. thaliana lines from Sweden. Nat Genet 45(8):884–890. 5. Fransz PF, et al. (2000) Integrated cytogenetic map of chromosome arm 4S of A. 31. Kurtz S, et al. (2004) Versatile and open software for comparing large genomes. thaliana: Structural organization of heterochromatic knob and centromere region. Genome Biol 5(2):R12. Cell 100(3):367–376. 32. Cold Spring Harbor Laboratory, Washington University Genome Sequencing Center, 6. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flow- and PE Biosystems Arabidopsis Sequencing Consortium (2000) The complete sequence ering plant Arabidopsis thaliana. Nature 408(6814):796–815. of a heterochromatic island from a higher . Cell 100(3):377–386. 7. Clark RM, et al. (2007) Common sequence polymorphisms shaping genetic diversity in 33. Hu TT, et al. (2011) The Arabidopsis lyrata genome sequence and the basis of rapid Arabidopsis thaliana. Science 317(5836):338–342. genome size change. Nat Genet 43(5):476–481. 8. Schneeberger K, et al. (2011) Reference-guided assembly of four diverse Arabidopsis 34. Shilo S, Melamed-Bessudo C, Dorone Y, Barkai N, Levy AA (2015) DNA crossover thaliana genomes. Proc Natl Acad Sci USA 108(25):10249–10254. motifs associated with epigenetic modifications delineate open chromatin regions in 9. Gan X, et al. (2011) Multiple reference genomes and transcriptomes for Arabidopsis Arabidopsis. Plant Cell 27(9):2427–2436. thaliana. Nature 477(7365):419–423. 35. Kirkpatrick M (2010) How and why chromosome inversions evolve. PLoS Biol 8(9): 10. Lu P, et al. (2012) Analysis of Arabidopsis genome-wide variations before and after e1000501. meiosis and meiotic recombination by resequencing Landsberg erecta and all four 36. Rowan BA, Patel V, Weigel D, Schneeberger K (2015) Rapid and inexpensive whole- products of a single meiosis. Genome Res 22(3):508–518. genome genotyping-by-sequencing for crossover localization and fine-scale genetic 11. Wijnker E, et al. (2013) The genomic landscape of meiotic crossovers and gene con- mapping. Genes Genomes Genet 5(3):385–398. versions in Arabidopsis thaliana. eLife 2(2):e01426. 37. Drouaud J, et al. (2006) Variation in crossing-over rates across chromosome 4 of 12. Schneeberger K, Weigel D (2011) Fast-forward genetics enabled by new sequencing Arabidopsis thaliana reveals the presence of meiotic recombination “hot spots”. technologies. Trends Plant Sci 16(5):282–288. Genome Res 16(1):106–114. 13. Hollister JD (2014) Genomic variation in Arabidopsis: Tools and insights from next- 38. Schmitz RJ, et al. (2013) Patterns of population epigenomic diversity. Nature generation sequencing. Chromosome Res 22(2):103–115. 495(7440):193–198. 14. Ossowski S, et al. (2008) Sequencing of natural strains of Arabidopsis thaliana with 39. Platt A, et al. (2010) The scale of population structure in Arabidopsis thaliana. PLoS short reads. Genome Res 18(12):2024–2033. Genet 6(2):e1000843. 15. Cao J, et al. (2011) Whole-genome sequencing of multiple Arabidopsis thaliana 40. Rawat V, et al. (2015) Improving the annotation of Arabidopsis lyrata using RNA-Seq populations. Nat Genet 43(10):956–963. data. PLoS One 10(9):e0137391. 16. Kim KE, et al. (2014) Long-read, whole-genome shotgun sequence data for five model 41. Doyle MR, et al. (2005) HUA2 is required for the expression of floral in organisms. Sci Data 1:140045. Arabidopsis thaliana. Plant J 41(3):376–385. 17. Berlin K, et al. (2015) Assembling large genomes with single-molecule sequencing and 42. VanBuren R, et al. (2015) Single-molecule sequencing of the desiccation-tolerant grass locality-sensitive hashing. Nat Biotechnol 33(6):623–630. Oropetium thomaeum. Nature 527(7579):508–511. 18. Eid J, et al. (2009) Real-time DNA sequencing from single polymerase molecules. 43. Qi J, Chen Y, Copenhaver GP, Ma H (2014) Detection of genomic variations and DNA Science 323(5910):133–138. polymorphisms and impact on analysis of meiotic recombination and genetic map- 19. Koren S, et al. (2012) Hybrid error correction and de novo assembly of single-molecule ping. Proc Natl Acad Sci USA 111(27):10007–10012. sequencing reads. Nat Biotechnol 30(7):693–700. 44. Barth S, Melchinger AE, Devezi-Savula B, Lübberstedt T (2001) Influence of genetic 20. English AC, et al. (2012) Mind the gap: Upgrading genomes with Pacific Biosciences RS background and heterozygosity on meiotic recombination in Arabidopsis thaliana. –

long-read sequencing technology. PLoS One 7(11):e47768. Genome 44(6):971 978. GENETICS 21. Gnerre S, et al. (2011) High-quality draft assemblies of mammalian genomes from 45. Ziolkowski PA, et al. (2015) Juxtaposition of heterozygosity and homozygosity during massively parallel sequence data. Proc Natl Acad Sci USA 108(4):1513–1518. meiosis causes reciprocal crossover remodeling via interference. eLife 4:e03708. 22. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W (2011) Scaffolding pre- 46. Kim S, et al. (2007) Recombination and linkage disequilibrium in Arabidopsis thaliana. assembled contigs using SSPACE. 27(4):578–579. Nat Genet 39(9):1151–1155. 23. Boetzer M, Pirovano W (2014) SSPACE-LongRead: Scaffolding bacterial draft genomes 47. Loudet O, Chaillou S, Camilleri C, Bouchez D, Daniel-Vedele F (2002) Bay-0 x Shahdara using long read sequence information. BMC Bioinformatics 15(1):211. recombinant inbred line population: S powerful tool for the genetic of 24. Mascher M, et al. (2013) Anchoring and ordering NGS contig assemblies by pop- complex traits in Arabidopsis. Theor Appl Genet 104(6-7):1173–1184. ulation sequencing (POPSEQ). Plant J 76(4):718–727. 48. Simon M, et al. (2008) Quantitative trait loci mapping in five new large recombinant 25. Singer T, et al. (2006) A high-resolution map of Arabidopsis recombinant inbred lines inbred line populations of Arabidopsis thaliana genotyped with consensus single- by whole-genome array hybridization. PLoS Genet 2(9):e144. nucleotide polymorphism markers. Genetics 178(4):2253–2264. 26. Giraut L, et al. (2011) Genome-wide crossover distribution in Arabidopsis thaliana 49. Salomé PA, et al. (2012) The recombination landscape in Arabidopsis thaliana F2 meiosis reveals sex-specific patterns along chromosomes. PLoS Genet 7(11):e1002354. populations. Heredity (Edinb) 108(4):447–455.

Zapata et al. PNAS | Published online June 27, 2016 | E4059 Downloaded by guest on September 29, 2021 50. Toomajian C, et al. (2006) A nonparametric test reveals selection for rapid flowering 54. Chae E, et al. (2014) Species-wide genetic incompatibility analysis identifies immune in the Arabidopsis genome. PLoS Biol 4(5):e137. genes as hot spots of deleterious epistasis. Cell 159(6):1341–1351. 51. Bikard D, et al. (2009) Divergent evolution of duplicate genes leads to genetic in- 55. Alcázar R, García AV, Parker JE, Reymond M (2009) Incremental steps toward in- compatibilities within A. thaliana. Science 323(5914):623–626. compatibility revealed by Arabidopsis epistatic interactions modulating 52. Vlad D, Rappaport F, Simon M, Loudet O (2010) Gene transposition causing natural pathway activation. Proc Natl Acad Sci USA 106(1):334–339. variation for growth in Arabidopsis thaliana. PLoS Genet 6(5):e1000945. 56. Koornneef M, Fransz P, de Jong H (2003) Cytogenetic tools for Arabidopsis thaliana. 53. Smith LM, Bomblies K, Weigel D (2011) Complex evolutionary events at a tandem Chromosome Res 11(3):183–194. cluster of Arabidopsis thaliana genes resulting in a single-locus genetic incompati- 57. Krumsiek J, Arnold R, Rattei T (2007) Gepard: A rapid and sensitive tool for creating bility. PLoS Genet 7(7):e1002164. dotplots on genome scale. Bioinformatics 23(8):1026–1028.

E4060 | www.pnas.org/cgi/doi/10.1073/pnas.1607532113 Zapata et al. Downloaded by guest on September 29, 2021