Analysis https://doi.org/10.1038/s41588-019-0393-z

Tracing the ancestry of modern

Caroline Pont1,22, Thibault Leroy2,3,22, Michael Seidel 4,22, Alessandro Tondelli5,22, Wandrille Duchemin1,22, David Armisen1,22, Daniel Lang 4,22, Daniela Bustos-Korts6,22, Nadia Goué1,7, François Balfourier1, Márta Molnár-Láng8, Jacob Lage9, Benjamin Kilian10,11, Hakan Özkan 12, Darren Waite13, Sarah Dyer14, Thomas Letellier15, Michael Alaux15, and Barley Legacy for Breeding Improvement (WHEALBI) consortium16, Joanne Russell17, Beat Keller 18, Fred van Eeuwijk 6, Manuel Spannagl4, Klaus F. X. Mayer 4,19, Robbie Waugh 17,20,21, Nils Stein 11, Luigi Cattivelli 5,23, Georg Haberer4,23, Gilles Charmet1,23 and Jérôme Salse 1,23*

For more than 10,000 years, the selection of plant and animal traits that are better tailored for human use has shaped the development of civilizations. During this period, bread wheat (Triticum aestivum) emerged as one of the world’s most important crops. We use exome sequencing of a worldwide panel of almost 500 genotypes selected from across the geographical range of the wheat complex to explore how 10,000 years of hybridization, selection, adaptation and has shaped the genetic makeup of modern bread wheats. We observe considerable genetic variation at the genic, chromosomal and subge- nomic levels, and use this information to decipher the likely origins of modern day wheats, the consequences of range expansion and the allelic variants selected since its . Our data support a reconciled model of wheat and provide novel avenues for future breeding improvement.

read wheat has an allo-hexaploid consisting of three gene pool, we assembled a worldwide panel of 487 genotypes that closely related subgenomes (AABBDD). It is proposed to have included wild diploid and tetraploid relatives, domesticated tetra- Boriginated from two polyploidization events; first, a tetraploidi- ploid and hexaploid , old cultivars and modern elite cul- zation some 0.5 million years before present (ybp) from the - tivars (Supplementary Table 1). Mapping exome capture sequence ization between wild Triticum urartu Tumanian ex Gandilyan (AA) data7 from these lines onto the ‘Chinese Spring’ reference genome and an undiscovered species of the speltoides Tausch lineage sequence (International Wheat Genome Sequencing Consortium (BB), and second, a hexaploidization some 10,000 ybp as a result (IWGSC)8) revealed 620,158 high-confidence genetic variants of hybridization between a descendant of this original tetraploid (including 595,939 SNPs and indels between hexaploid geno- hybrid (AABB) and the wild diploid Coss (DD)1. types) distributed across 41,032 physically ordered wheat genes Archaeobotanical evidence suggests that the resulting allo-hexaploid (Supplementary Table 2). Equivalent sequence coverage at the chro- wheats were domesticated in the Fertile Crescent, a region extending mosome and homoeologous gene or region levels (medians within from Israel, Jordan, Lebanon, Syria to South-East Turkey, to northern one standard deviation) excluded bias in the detection and calling Iraq and western Iran2–5. Modern cultivated bread wheats are there- of structural variants (Supplementary Fig. 1). Furthermore, corre- fore the product of at least 10,000 years of human selection during lation between gene and structural variant distribution (r > 82%) domestication and cultivation (improvement and breeding). Today across the three subgenomes, expected from exome capture experi- they comprise high-yielding varieties adapted to a wide range of envi- ment, supports the lack of bias in genic variant detection at the chro- ronments, ranging from low-humidity regions in Nigeria, Australia, mosome level with a visible gradient from distal gene-rich regions India and Egypt to high humidity regions like South America6. to pericentromeric gene-poor regions (Supplementary Fig. 2). Both individual subgenomes (with a B > A ≫ D gradient) and Results chromosome compartments (with a telomere > core ≫ centromere Wheat genomic diversity. To explore the origins and patterns of gradient) exhibited differences in structural variant enrichment genetic diversity that exist within the currently accessible wheat (Supplementary Fig. 3). In summary, our variant data set provides

1INRA-Université Clermont Auvergne, Clermont-Ferrand, France. 2INRA-Université de Bordeaux, Cestas, France. 3ISEM, Université de Montpellier, CNRS, IRD, EPHE, Place Eugène Bataillon, Montpellier, France. 4PGSB, Helmholtz Center Munich, Neuherberg, Germany. 5Council for Agricultural Research and Economics (CREA), Research Centre for Genomics and Bioinformatics, Fiorenzuola d’Arda, Italy. 6Wageningen University & Research, Biometris, Applied Statistics, Wageningen, the Netherlands. 7Plateforme Auvergne Bioinformatique, Mésocentre, Université Clermont Auvergne, Aubière, France. 8Agricultural Institute, Centre for Agricultural Research, Hungarian Academy of Sciences, Martonvásár, Hungary. 9KWS UK Ltd, Thriplow, UK. 10Global Crop Diversity Trust, Bonn, Germany. 11Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany. 12University of Çukurova, Faculty of Agriculture, Department of Field Crops, Adana, Turkey. 13Earlham Institute, Norwich Research Park, Norwich, UK. 14NIAB, Huntingdon Road, Cambridge, UK. 15URGI, INRA, Université Paris-Saclay, Versailles, France. 16A list of members and affiliations appears in the Supplementary Note. 17The James Hutton Institute, Invergowrie, Dundee, UK. 18Department of Plant and Microbial Biology, University of Zurich, Zürich, Switzerland. 19School of Life Sciences, Technical University Munich, Weihenstephan, Germany. 20The University of Dundee, Division of Plant Sciences, School of Life Sciences, Dundee, UK. 21School of Agriculture, Food and Wine, University of Adelaide, Adelaide, South Australia, Australia. 22These authors contributed equally: Caroline Pont, Thibault Leroy, Michael Seidel, Alessandro Tondelli, Wandrille Duchemin, David Armisen, Daniel Lang, Daniela Bustos-Korts. 23These authors jointly supervised this work: Luigi Cattivelli, Georg Haberer, Gilles Charmet, Jérôme Salse. *e-mail: [email protected]

Nature Genetics | VOL 51 | MAY 2019 | 905–911 | www.nature.com/naturegenetics 905 Analysis Nature Genetics

Chr 1 Chr 7 1 700 Chr 2 0 2 8,000 Chr 6 0 3 700 0 Gli 4 WFZP Chr 3 5 PPD D Chr 5 0 A 5 1 0 –1 6 FDL Q 12 ttbrt 6 CUL Chr 4 0 7 Chr 4 6

0 WPCL 8

Rht

Wx

Chr 3

Rht Rht

Chr 5

WPCL

Vr

Vr n

n

Q

Chr 2 PPD

Gli

Tg470

6 Chr

Vr

n

u Gl

Wx

1 Chr

7 Chr

n

Glu Vr

Gli

t Rh

7 Chr

Tg

1 Chr Q

i Gl

Ph

rt ttb

GNS

6 Chr

Gli

t Rh

2 Chr

5 Chr

3 Chr

4 B Chr

Fig. 1 | Wheat genome diversity map. Genome histograms illustrating (from outer to inner circles) the density in genes (1), SNPs (2) and indels (3), the genome scan for improvement among European genotypes (4), the reduction in diversity during the domestication between wild tetraploids versus domesticated hexaploids for subgenomes A and B, and between wild diploids and domesticated hexaploids for subgenome D (5), as well as GWAS for HD (6), PH (7) on the 21 chromosomes (from 1A to 7D) illustrated in circles for the three subgenomes (A, B and D). Homoeologous genes are joined with colored lines between chromosomes in the center (8). The three outer circles show centromeres (grey blocks) and telomeres (blue blocks) on the 21 chromosomes. a comprehensive overview of wheat genomic diversity at various Maghreb and Iberian Peninsula), complemented by two additional scales (gene, region, chromosome and genome), and represents a paths north-east and along the Inner Asian Mountain Corridor, fol- rich source of genetic information for exploitation by both the aca- lowed by further colonization events in American, Oceanian and demic and agricultural research communities (Fig. 1, circles 1 to 3). African territories (Fig. 2b and Supplementary Fig. 5)3. Phylogenetic and principal component analyses (PCAs) revealed three major factors driving the partitioning of diversity within our Wheat selection footprints. We next explored selection footprints panel (Fig. 2): vernalization requirement (winter vs spring), histori- resulting from domestication (comparing wild and domesticated cal groups (groups I to IV, oldest to newest: old landraces to modern wheats) and breeding (comparing historical bread wheat groups elite lines) and geographical origin (Europe, Asia, Oceania, Africa I to IV) using a sliding window (as opposed to variant- or gene- and America) (Fig. 2a and Supplementary Fig. 4). Among the 11 centric) approach to deciphering local reduction in diversity and major tree clades chosen based on size, representativeness and taking into account the geographical structuring of the wheat panel statistical support, permutation tests for the conservation of clade (Fig. 1, circles 4 and 5). For the detection of domestication signals, monophyly confirmed a strong grouping for all three factors (P < 1 we computed the nucleotide diversity per site (π) over non-overlap- × 10˗6). However, the deep structure of the phylogeny is centered ping 1 Mb sliding windows for wild diploid (T. urartu, A. speltoides, around continental difference, and subsequently more recent shifts A. tauschii), wild tetraploid (T. dicoccoides) and the domesticated in growth habit traits (such as vernalization requirement) resulting hexaploid (T. aestivum) wheats from Asia, where the previous dip- from intense selection for yield in modern wheat breeding practices. loid and tetraploid progenitors in our panel originated. Contrasts Superimposition of both country and continent of origin onto the between wild wheat ancestors and hexaploid landraces support con- phylogenetic clusters suggests that the observed genetic diversity is siderable heterogeneity in the reduction of diversity (ROD) during mainly structured along an east–west axis, consistent with estab- the domestication process along the wheat genome (Supplementary lished routes of human migration out of the Fertile Crescent. Two Fig. 6). Strongly affected genomic regions (1,221), showing a paths to Western Europe follow an inland route (via Anatolia and loss of at least four-fifths of the diversity (ROD > 0.8, red dots in the Balkans to Central Europe) and a coastal route (via Egypt to the Fig. 1, circle 5), cover 9% of the wheat genome (1.2 Gb). Known

906 Nature Genetics | VOL 51 | MAY 2019 | 905–911 | www.nature.com/naturegenetics Nature Genetics Analysis

a

Panel passport information America

Geographical Asia origin Africa Asian Europe Oceania spring European II Alternative wheats modern I wheats Spring

III IV Winter Facultative Historical Growth group habit

b

5

Fig. 2 | Geographical components of the panel structure. a, The phylogenetic relatedness among the hexaploid bread wheat genotypes, with a color code (right) in pie charts along the 11 major tree clades, showing their geographical origins (see the associated world map in part b), historical groups (I to IV) and growth habit (winter, spring, alternative and facultative). b, The phylogenetic relatedness between wheat accessions of different geographical origins (see color legend in part a, detailed in shades of green (for Fertile Crescent, Central Asia, Eastern Asia, Northern Asia and Indian Peninsula), blue (for Central, Southern and Western Europe), brown (for Northern and Sub-Saharan Africa), red (for South and North America) and pink (for Oceania)) with colored connecting lines on the world map illustrating phylogenetic tree edges (part a) associated with a mean of at least one transition per simulation and illuminating the known historical routes of wheat migration, out of the Fertile Crescent (green connecting lines), west through inland (1) and coastal (2) paths, and northeast (3) and along the Inner Asian Mountain Corridor (4) followed by further colonization (black connecting lines) of American (5), African (6) and Oceanian (7) territories. domestication genes conferring brittle rachis (Brt), tenacious glume account the graduated population structure within and between (Tg), homoeologous pairing (Ph) and non-free-threshing character groups, and at a higher granularity among European and Asian (Q) were identified in close (< 5 Mb distance) proximity to these samples separately (Supplementary Fig. 7). We identified 5,089 regions. Interestingly, known domestication genes only account for polymorphic sites exhibiting improvement signals (P < 0.0001, red a minority of the observed peaks, suggesting that further domesti- dots in Fig. 1, circle 4). Known genes, including genes for photo- cation genes remain to be discovered (Supplementary Table 3)9,10, as period sensitivity (Ppd) and vernalization (VRN), reduced height reported for other grass species such as maize11. (Rht), Glutenin (Glu) and Gliadin (Gli) genes involved in stor- To unravel genomic regions targeted by breeders during the last age protein accumulation, Frizzy panicle (FZP), grain number two centuries (since 1830) of wheat improvement, we compared (GNS), waxy (Wx) as well as the CUL (uniCULm) gene driving the ROD statistics of the European panel within historical groups plant architecture, were located close ( < 5 Mb distance) to these II, III and IV (that is, those subject to breeding) to those of group improvement signals (Supplementary Table 3)13–16. Large genomic I (landraces) (Fig. 3a). Our results are consistent with two main regions (> 10 Mb) where selection appears to have occurred dur- rounds of diversity reduction, an initial wave between groups I and ing the two last centuries (between historical groups I and IV) and II (11.7% of ROD), reflecting early breeding improvement, and a eventually become fixed, were observed especially on chromosome second wave between groups III and IV (13.3% of ROD) that fol- 1A and the two most structurally re-arranged chromosomes, 4A lowed the green revolution (that is, renovation of agricultural prac- and 7B, of the wheat genome (Fig. 3b)17. Extension of the 8,308 and tices starting between 1950 and the late 1960s). Modern wheat 9,948 polymorphic sites associated with improvement footprints varieties showed an average loss of nucleotide diversity of 21.8% observed in the European and Asian genotypes over 2 Mb overlap- compared to those of group I, with strong variation within and ping windows, defined a cumulative genomic space of 950 Mb (7% between chromosomes (Fig. 3b). This appeared to be more intense of the genome) and 1.3 Gb (9% of the genome) with selection sig- on the A (median ROD = 33.2%) and B (median ROD = 28.0%) natures for the two geographical areas, respectively. Interestingly, subgenomes compared to the D subgenome (median ROD = 5.8%), only 168 Mb (13% to 18% of the previous genomic space under which may reflect their different contributions to wheat improve- selection) of the genomic regions harboring selection signatures are ment (Fig. 3a). To identify genetic markers and regions selected by identical between the European and Asian germplasm, suggesting wheat breeders, we performed a genome-wide scan across all sam- independent improvement targets from the two geographic origins ples using the individual-centric method PCAdapt12 to take into (Supplementary Fig. 7).

Nature Genetics | VOL 51 | MAY 2019 | 905–911 | www.nature.com/naturegenetics 907 Analysis Nature Genetics

a Subgenome A Subgenome B Subgenome D

0.0020

Time

0.0015 1935 1965 1985 ersity

0.0010 ( π ) Nucleotide div

0.0005

0.0000

Landraces Old Cultivars Modern Landraces Old Cultivars Modern Landraces Old Cultivars Modern cultivars varieties cultivars varieties cultivars varieties

b 0.008

0.006 ersity ( π ) 0.004

0.002

Nucleotide div 0.000 Chr 1 Chr 2 Chr 3 Chr 4 Chr 5 Chr 6 Chr 7 Subgenome A

0.008

0.006 ersity ( π ) 0.004

0.002

Nucleotide div 0.000 Chr 1 Chr 2 Chr 3 Chr 4 Chr 5 Chr 6 Chr 7 Subgenome B 0.008 π Group IV ROD ≥ 0.8 0.006 Subgenome A, B and D (ROD < 0.8)

ersity ( π ) 0.004

0.002

0.000 Chr 1 Chr 2 Chr 3 Chr 4 Chr 5 Chr 6 Chr 7 Nucleotide div Subgenome D

Fig. 3 | Temporal evolution of wheat diversity. a, Nucleotide diversity (π) from the hexaploid wheat genotypes between subgenomes A, B and D, and historical groups (landraces, old cultivars, cultivars and modern varieties) covering the two last centuries of breeding (a timescale legend is given in the white box from 1830 to present). b, Chromosomal distribution of nucleotide diversity (π) between landraces (group I) and modern wheats (group IV, dots) for subgenomes A, B and D, with bars illustrating the range of variation in diversity between these two groups (colored red for ROD ≥ 80%). Large regions of reduced diversity are shown in gray boxes.

We then tested whether the observed allelic variation could be associated (false discovery rate (FDR) of P < 0.01) with variation linked to two key life-history traits, heading date (HD) and plant in HD (Supplementary Fig. 8 and Supplementary Table 4) and PH height (PH), by conducting multi-environment genome-wide (Supplementary Fig. 9 and Supplementary Table 5), respectively, association studies (GWAS) (Fig. 1, circles 6 and 7). We grew and including regions (< 15 Mb) containing known (Ppd, VRN, FDL evaluated 435 hexaploid bread wheat genotypes for HD and PH in (Flowering Locus D like), WPCL (Phytoclock) for HD and Rht for four common garden experiments (partially replicated design) in PH)18 and unknown genes. The current data provide the basis for France (Institut National de la Recherche Agronomique (INRA)), identifying relevant candidate genes in the previously detected but Hungary (ATK), Turkey (University of Çukurova) and United currently unknown loci for functional validation, as exemplified Kingdom (KWS). A subset of 390,657 SNPs, stringently filtered for the major HD association detected on chromosome 2A, where for call rate (< 0.80) and minor allele frequency (MAF, >0.05) was Cry (Cryptochrome) is a putative driver (Supplementary Fig. 8). used for GWAS. We identified 48 and 40 genomic sites significantly Notably, diversity, selection footprint and GWAS analyses clearly

908 Nature Genetics | VOL 51 | MAY 2019 | 905–911 | www.nature.com/naturegenetics Nature Genetics Analysis

a T. aestivum γ Western block pool T. dicoccum T. sphaerococcum T. spelta

T.

Synthetic T. durum × T. dicoccoides T. dicoccoides

Synthetic T. durum × A. tauschii T. boeoticum T. aestivum α T. zhukovskyi Founder pool T. timopheevii T. urartu A. speltoides T. aestivum β Warsaw pact pool T. monococcum T. araraticum A. tauschii

b Wheat ancestor, n = 7

Triticum, n = 7 Aegilops, n = 7

A S D prog prog prog

T. urartu A. speltoides

B T. timopheevii prog

A. bicornis T. boeoticum T. dicoccoides T. araraticum A. searsii (wild einkom) (wild ) A. sharonensis

T. monococcum T. zhukovskyi T. durum T. dicoccum A. longissima A. tauschii (domesticated einkom) (durum wheat) (domesticated emmer)

T. spelta () T. aestivum

Durum wheat Bread wheat n = 14 n = 21

Fig. 4 | Model of reticulated evolution. a, Clustered phylogenetic consensus genotype network of 1,000 maximum likelihood tree topologies inferred from repeated RRHS. Nodes represent individual genotypes and are color-coded by taxon. Node size is proportional to the number of connections (that is, node degree). Edges represent minimal evolutionary distances in the RRHS trees deduced by the MST algorithm and are color-coded by the respective subgenome (green, A; purple, B; brown, D). Edge transparency is proportional to the relative number of RRHS trees where the edge was an MST edge (that is, edge weight). b, Schematic illustration of the hexaploid bread wheat evolutionary scenario based on the stronger edges of the subgenomes phylogenetic networks (extracted using a MST available in Supplementary Figure 11) with green, purple and brown lines denoting the paths from the A, B and D subgenomes. Arrow colors illustrate the phylogenetic relatedness between subgenomes (plain arrows are indicative of the main, vertical signal; dashed arrows show alternative paths well supported by the inferred topologies and indicative of introgression or gene flow). Circles show putative extinct ancestor intermediates. Additional Sitopsis species (grey boxes) were not part of this study, but are included for completeness. The schematic is based on current data, analysis and prior assumptions from the literature.

Nature Genetics | VOL 51 | MAY 2019 | 905–911 | www.nature.com/naturegenetics 909 Analysis Nature Genetics showed that only a small fraction of homoeologous loci harbour of the three analytical approaches (network, tree, Patterson’s D) coincident signals, supporting the view that modern hexaploid need further investigation and were not integrated in our evolution- bread wheats behave genetically as diploids, as suggested previously ary model (Fig. 4b). from the convergent pattern of (also referenced as parallel advanta- Such a reticulated evolutionary scenario would first have led to geous) selection shown to be rare between homoeologous regions13. a founder hexaploid bread wheat gene pool (α community, Fig. 4a) that was established during and following domestication. This would Wheat origins. Finally, we implemented a network-based phylo- probably have consisted of primitive wheat landraces originating in genetic approach19,20 that involves the inference of 1,000 trees from the Fertile Crescent, leading to two (β and γ) derived communities repeated random haplotype samples (RRHS) with maximum like- of hexaploids (Fig. 4a and Supplementary Fig. 13). While the γ clus- lihood, subsequent graph reconstruction analysis and community ter is enriched for modern genotypes from Western Europe (that is, clustering to reconstruct the reticulated evolutionary history of lines originating from 1986 or later, mostly comprising wheat cul- modern hexaploid bread wheats from their diploid and tetraploid tivars and current varieties), the β cluster is enriched for Eastern progenitors. The resulting clustered consensus network (Fig. 4a) European countries formerly part of the Warsaw Pact from May comprises signals of vertical (species relationships) and horizontal 1955, during the Cold War. The clear separation in evolutionary (reticulation) events within the Triticum–Aegilops species complex. phylogeny between these two modern pools (β and γ) may reflect The intermediate positioning in the network of synthetic polyploid how human history and resulting socioeconomic consequences wheats (that is, synthetic T. turgidum deriving from T. durum × have influenced the genetic makeup of modern wheat germplasm, T. dicoccoides and synthetic T. aestivum deriving from T. durum × with β genotypes still grown in Hungary and Ukraine today, while γ A. tauschii) between their direct progenitors validate the robustness of genotypes still dominate many parts of the European Union. our phylogenetic inference of the entire wheat panel (Fig. 4a and Supplementary Fig. 10). An integrative model of wheat evolution Discussion (Fig. 4b) was derived from the combined conclusions drawn from Bread wheat derives from a reticulated evolution from its di- and the in-depth analysis of the networks’ edges and edge weights19,20 tetraploid progenitors involving massive and recurrent hybridiza- (Fig. 4a, Supplementary Fig. 10 and Supplementary Table 6), and tion and gene flow events, with T. dicoccoides being the progenitor of supported through the evaluation of alternative consensus tree the A and B subgenomes of all the modern tetraploid and hexaploid topologies21 (Supplementary Fig. 11) and gene flow tests using genotypes, and the T. durum lineage being the most likely ancestor of Patterson’s D statistic22 (Supplementary Fig. 12 and Supplementary today’s bread wheat cultivated germplasm. Such a complex history of Table 7). For example, we were able to reconstruct, at the subgenome hybridizations and gene flows explains the observed partitioning of level, the introgressions at the base of modern synthetic T. turgidum diversity at the genomic scale (impoverished on the D subgenome) (F7 RIL (recombination inbred line) offspring) polyploids men- and supports the view that modern hexaploid bread wheats behave tioned earlier and detected as hybrids (Supplementary Figs. 10–12) genetically as diploids with compartmentalized selective footprints, by all methods with dominant T. dicoccoides genotypes and multiple as well as trait loci, with only a small fraction of homoeologous independent T. durum introgressions23, illustrating the resolution loci harbouring domestication and/or breeding sweeps or driving obtained by using a combination of complementary approaches. phenotypic traits showing coincident signals. Modern bread wheat Our proposed model (Fig. 4b) largely refines the widely accepted originated in the Fertile Crescent some 10,000 ybp and the variation evolutionary path leading to modern bread wheat with the hybrid- observed in the gene pool today has been shaped during domesti- ization of wild diploid AA and SS (close to BB) genotypes leading cation by human migration, anthropogenic selection and latterly by to wild tetraploid AABB progenitors, which subsequently hybrid- breeding. Associations identified between diversity and both known ized with a wild diploid DD genotype resulting in the hexaploid and novel genes influencing plant height and flowering time dem- T. aestivum (AADDBB) lineage23. In our analysis, the wheat B genome onstrate the potential value of our panel for both fundamental and is confirmed to be derived from the Aegilops section Sitopsis lin- applied studies. Importantly, the hallmarks of adaptation to new envi- eage, which gave rise to A. speltoides (SS), while the progenitors of ronments remain highly topical research subjects during a period of A. tauschii and T. urartu represent the established origins of the D and accelerated climate change and both our selective sweep analyses and A genome lineages, respectively24. T. araraticum (also referenced GWAS highlight targets for future gene and/or allele discovery. The as T. araraticum Jakubz) represents the closest wild descendant of combined data and germplasm collection that we report here and the AAGG tetraploid ancestor. It appears to have been subsequently that we make available to the broader research community represents domesticated to form T. timopheevii (Zhuk.) Zhuk while also a rich source of genetic diversity that can be applied to understand- hybridizing with T. boeoticum leading to the hexaploid T. zhukovskyi ing and improving diverse traits, from environmental adaptation to (Menabde & Ericzjan) lineage (AAAAGG)25,26. The model con- disease resistance and nutrient use efficiency. firms wild emmer (T. dicoccoides) as the closest descendant of the progenitor of the modern A and B wheat subgenomes of all of the Online content modern tetraploid AABB and hexaploid AABBDD genotypes. Any methods, additional references, Nature Research reporting Our data suggest that during the early phase of domestication and summaries, source data, extended data, supplementary informa- cultivation, a pool of wild emmer wheat T. dicoccoides (Körn. ex tion, acknowledgements, peer review information; details of author Asch. and Graebner) Schweinf. gave rise to at least two distinct lin- contributions and competing interests; and statements of data and eages of domesticated tetraploids, T. dicoccum Schrank ex Schübl. code availability are available at https://doi.org/10.1038/s41588- (domesticated emmer, also known as T. dicoccon Schrank) and 019-0393-z. T. durum Desf. (domesticated durum or hard or wheat)27,28. Finally, the model supports the hypothesis that T. aestivum is most Received: 24 April 2018; Accepted: 13 March 2019; likely to be derived from an ancestral hybridization event between the Published online: 1 May 2019 previous T. durum29 lineage and a D lineage close to wild A. tauschii30 (Fig. 4b and Supplementary Fig. 11). Subsequently, T. spelta emerged from the hybridization between the hexaploid T. aestivum and the References 1. Feldman, M. & Levy, A. A. Genome evolution due to allopolyploidization in tetraploid T. dicoccum, and still harbors evidence of T. dicoccum wheat. Genetics 192, 763–774 (2012). introgressions today (Supplementary Fig. 12). Additional putative 2. Tanno, K. & Willcox, G. How fast was wild wheat domesticated? Science 311, reticulation events (Supplementary Fig. 12), supported only by either 1886 (2006).

910 Nature Genetics | VOL 51 | MAY 2019 | 905–911 | www.nature.com/naturegenetics Nature Genetics Analysis

3. Brown, T. A., Jones, M. K., Powell, W. & Allaby, R. G. Te complex 27. Civáň, P., Ivaničová, Z. & Brown, T. A. Reticulated origin of domesticated origins of domesticated crops in the Fertile Crescent. Trends Ecol. Evol. 24, emmer wheat supports a dynamic model for the emergence of agriculture in 103–109 (2009). the fertile crescent. PLoS ONE 8, e81955 (2013). 4. Bocquet-Appel, J. P., Naji, S., Vander Linden, M. & Kozlowski, J. K. Detection 28. Luo, M. C., Yang, Z. L., You, F. M., Kawahara, T. & Waines, J. G. Te of difusion and contact zones of early farming in Europe from the structure of wild and domesticated emmer wheat populations, gene fow space-time distribution of 14C dates. J. Archaeol. Sci. 36, 807–820 (2009). between them, and the site of emmer domestication. Teor. Appl. Genet. 114, 5. Szécsényi-Nagy, A., Brandt, G., Keerl, V., Jakucs, J. & Haak, W. Tracing the 947–959 (2007). genetic origin of Europe’s frst farmers reveals insights into their social 29. Matsuoka, Y. & Nasuda, S. Durum wheat as a candidate for the unknown organization. Proc. R. Soc. B. 282, 20150339 (2015). female progenitor of bread wheat: an empirical study with a highly fertile F1 6. Damania, A. B. et al. (eds) Te Origin of Agriculture and Crop Domestication hybrid with Aegilops tauschii Coss. Teor Appl Genet. 109, 1710–1717 (2004). (ICARDA, 1997). Epub 2004 Sep 22. 7. Warr, A. et al. Exome sequencing: current and future perspectives. 30. Wang, J., Luo, M. C., Chen, Z., You, F. M. & Wei, Y. Aegilops tauschii single G3 (Bethesda) 5, 1543–1550 (2015). nucleotide polymorphisms shed light on the origins of wheat D-genome 8. Te International Wheat Genome Sequencing Consortium (IWGSC) et al. genetic diversity and pinpoint the geographic origin of hexaploid wheat. Shifing the limits in wheat research and breeding using a fully annotated New Phytol. 198, 925–937 (2013). reference genome. Science 361, eaar7191 (2018). 9. Matsuoka, Y. Evolution of polyploid triticum wheats under cultivation: the Acknowledgements role of domestication, natural hybridization and allopolyploid speciation in The authors wish to thank the INRA Biological Resources Center on small grain cereals their diversifcation. Plant Cell Physiol. 52, 750–764 (2011). (https://www6.ara.inra.fr/umr1095_eng/Teams/Research/Biological-Resources-Centre) 10. Gao, L., Zhao, G., Huang, D. & Jia, J. Candidate loci involved in for providing and passport data, and for establishing a wheat biorepository. The domestication and improvement detected by a published 90K wheat SNP authors thank the Federal ex situ Genbank Gatersleben, Germany (IPK), the N. I. Vavilov array. Sci. Rep. 7, 44530 (2017). All-Russian Research Institute of Plant Industry, Russia (VIR), Centre for Genetic 11. Wright, S. I. et al. Te efects of artifcial selection on the maize genome. Resources, WUR, Netherlands (CGN), Kyoto University, National Bioresource Project, Science 308, 1310–1314 (2005). Japan (NBRP), the Australian Winter Cereal Collection Tamworth, Australia (AWCC), 12. Luu, K., Bazin, E. & Blum, M. G. pcadapt: an R package to perform genome the National Plant Germplasm System, USA (USDA-ARS), the International Center for scans for selection based on principal component analysis. Mol. Ecol. Resour. Agriculture Research in the Dry Areas (ICARDA), the Max Planck Institute for Plant 17, 67–77 (2017). Breeding Research Cologne, Germany (MPIPZ), Germplasm Resource Unit at the John 13. Jordan, K. W., Wang, S., Lun, Y., Gardiner, L. J., MacLachlan, R. A haplotype Innes Centre UK (JIC) and the Wheat and Barley Legacy for Breeding Improvement map of allohexaploid wheat reveals distinct patterns of selection on (WHEALBI) consortium for providing plant material and passport data. The research homoeologous . Genome Biol. 16, 48 (2015). leading to these results has received funding from the European Community’s Seventh 14. Cavanagh, C. R. et al. Genome-wide comparative diversity uncovers multiple Framework Programme (FP7/ 2007–2013) under grant agreement FP7- 613556, Whealbi targets of selection for improvement in hexaploid wheat landraces and project (http://www.whealbi.eu/project/). R.W. and J.R. also acknowledge support cultivars. Proc. Natl Acad. Sci. USA 110, 8057–8062 (2013). from the Scottish Government Research Program and R.W. from the University of 15. Joukhadar, R., Daetwyler, H. D., Bansal, U. K., Gendall, A. R. & Hayden, M. Dundee. H.O. acknowledges support from Çukurova University (FUA-2016–6033). J. Genetic diversity, population structure and ancestral origin of Australian K.F.X.M. acknowledges support from the German Federal Ministry of Food and wheat. Front. Plant Sci. 8, 2115 (2017). Agriculture (2819103915) and the DFG (SFB924). T.L. acknowledges supports from 16. Nielsen, N. H., Backes, G., Stougaard, J., Andersen, S. U. & Jahoor, A. Genetic the Agence Nationale pour la Recherche (BirdIslandGenomic project 14-CE02-0002), diversity and population structure analysis of European hexaploid bread European Research Council (TREEPEACE project, grant agreement 339728) and the wheat (Triticum aestivum L.) varieties. PLoS ONE 9, e94000 (2014). bioinformatics platform from Toulouse Midi-Pyrénées (Bioinfo Genotoul) for providing 17. Devos, K. M., Dubcovsky, J., Dvorak, J., Chinoy, C. N. & Gale, M. D. computing and storage resources. J.S. acknowledges support from the Région Auvergne- Structural evolution of wheat chromosomes 4A, 5A, and 7B and its impact on Rhône-Alpes and FEDER Fonds Européens de Développement Régional (23000816 recombination. Teor. Appl. Genet. 91, 282–288 (1995). SRESRI 2015), the CPER contrat de plan État-région (23000892 SYMBIOSE 2016) and 18. Nadolska-Orczyk, A., Rajchel, I. K., Orczyk, W. & Gasparis, S. Major genes AgreenSkills fellowship (applicant ID 4146). determining yield-related traits in wheat and barley. Teor. Appl. Genet. 130, 1081–1098 (2017). 19. Gardiner, L. J. et al. Hidden variation in polyploid wheat drives local Author Contributions adaptation. Genome Res. 28, 1319–1332 (2018). F.B., B.K. and N.S. carried out panel constitution and distribution. S.D., J.R. and R.W. 20. Lischer, H. E., Excofer, L. & Heckel, G. Ignoring heterozygous sites biases carried out exome sequencing. D.W., M.Se., M.Sp. and G.H. carried out variant (SNP and phylogenomic estimates of divergence times: implications for the evolutionary indel) calling. C.P., D.A., N.G., M.Se., D.L. and W.D. carried out variant analysis. D.L., history of microtus voles. Mol. Biol. Evol. 31, 817–831 (2014). W.D., M.Se., C.P., N.G. and G.H. carried out phylogenetic analyis. T.L., C.P. and D.A. 21. Nguyen, L. T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: a carried out diversity analysis and selection footprints. A.T., D.B.K., C.P., H.O., M.M., F.E. fast and efective stochastic algorithm for estimating maximum-likelihood and L.C. carried out field experiments and GWAS. B.K., J.R., K.F.X.M., R.W., N.S., L.C., phylogenies. Mol. Biol. Evol. 32, 268–274 (2015). G.H., G.C. and J.S. conceived and supervised the study and prepared the article. 22. Martin, S. H., Davey, J. W., Jiggins, C. D. Evaluating the use of ABBA-BABA statistics to locate introgressed loci. Mol. Biol. Evol. 32, 244–257 (2015). Competing interests 23. Ben-David, R. et al. Dissection of powdery mildew resistance uncovers The authors declare no competing interests. diferent resistance types in the T. turgidum L. gene pool. in 11th Int. Wheat Genetics Symposium (Sydney University Press, 2008). 24. El Baidouri, M., Murat, F., Veyssiere, M., Molinier, M. & Flores, R. Additional information Reconciling the evolutionary origin of bread wheat (Triticum aestivum). Supplementary information is available for this paper at https://doi.org/10.1038/ New Phytol. 213, 1477–1486 (2017). s41588-019-0393-z. 25. Balint, A. F., Kovacs, G. & Sutka, J. Origin and of wheat Reprints and permissions information is available at www.nature.com/reprints. in the light of recent research. Acta Agronomica Hungarica 48, Correspondence and requests for materials should be addressed to J.S. 301–313 (2000). 26. Nesbitt, M. & Samuel, D. From staple crop to extinction: the archaeology and Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in history of the hulled wheats. in Proc. 1st International Workshop on Hulled published maps and institutional affiliations. Wheats (International Plant Genetic Resources Institute, 1996). © The Author(s), under exclusive licence to Springer Nature America, Inc. 2019

Nature Genetics | VOL 51 | MAY 2019 | 905–911 | www.nature.com/naturegenetics 911 Analysis Nature Genetics

Methods imputed variant data set was generated applying these criteria. We detected 620,158 Plant material. Te wheat panel consists of 487 genotypes comprising 13 diploid, small-scale variant positions on the IWGSC Refseqv1.0 wheat genome assembly 38 tetraploid (including 25 AABB) and 436 hexaploid (including 435 AABBDD) (targeting 41,032 of the 110,790 high-confidence genes). The variants comprised genotypes, including landraces, cultivars and currently grown varieties from 68 56,163 indels (9%) and 563,995 SNPs (91%). Of the variants, 595,939 (96%) were countries. For a detailed description of all lines and varieties used in this study found in the 435 AABBDD hexaploid genotypes (Supplementary Table 2). see Supplementary Table 1 and passport information available at https://urgi. versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Annotations/v1.0/iwgsc_ Phylogenetic analysis. The analysis of phylogeny for the 435 hexaploid bread refseqv1.0_Whealbi_GWAS.zip. Te genotypes were structured into four historical wheat genotypes was inferred on an alignment of 91,554 SNPs (for a total of groups: landraces (group I, before 1935), old cultivars (group II, from 1936 to 65,467 distinct alignment patterns) found on triplets (2,855) of orthologous 1965), cultivars (group III, from 1966 to 1985) and modern varieties (group genes conserved in all three subgenomes, A, B and D. The data were analyzed IV, afer 1986). with iqtreeX21 (GTR + GAMMA(4) model), with 1,000 ultrafast bootstraps31. The genotypes have been grouped according to country of origin: Af, Geographical regions for ancestral nodes were reconstructed using the following Afghanistan; Ag, Argentina; Alb, Albania; Alg, Algeria; Ar, Armenia; As, Austria; protocol: 10,000 simulations were performed using the stochastic mapping Au, Australia; Az, Azerbaijan; Be, Belgium; Bu, Bulgaria; Br, Brazil; Ca, Canada; Ch, algorithm of the R phytools package32 (using the equal rates model), the region of a China; Co, Colombia; Cr, Croatia; Cz, Czech republic; Dn, Denmark; Eg, Egypt; Et, node was then chosen as the one with maximum sampled frequency. Eleven major Ethiopia; Fi, Finland; Fr, France; Fyrm, Former Yugoslav Republic of Macedonia; tree clades were identified based on criteria of size, representativeness and statistical Ge, Germany; Go, Georgia; Gr, Greece; Hu, Hungary; In, India; Ir, Iran; Is, Israel; support to offer good coverage of the tree, while taking into account sampling bias It, Italy; Ja, Japan; Ke, Kenia; Ko, Korea; La, Latvia; Le, Lebanon; Me, Mexico; Mo, for European individuals. World maps used to illustrate the geographical structure Morocco; Ne, Nepal; Ni, Niger; Nt, Netherlands; Nz, New Zealand; Pa, Pakistan; of the panel diversity were obtained with R packages (countrycode, geosphere, Po, Poland; Pr, Portugal; Ro, Romania; Ru, Russia; Sa, South Africa; Sp, Spain; Sw, maps) in Fig. 2b, Supplementary Figs. 5–13. The analysis of phylogeny for the 487 Sweden; Sy, Syria; Sz, Switzerland; Ta, Tajikistan; Tk, Turkmenia; Tn, Tunisia; Tu, di- tetra- and hexaploid wheats was inferred in accounting for ambiguities and Turkey; Uk, United Kingdom; Ukr, Ukraine; Ur, Uruguay; Usa, United States of possible biases in phylogenetic inference from SNP data arising from varying levels America; Zi, Zimbabwe. The genotypes have been finally grouped according to of heterozygosity, linkage disequilibrium, incomplete lineage sorting and reticulate region or continent of origin: Fertile Crescent (Ir, Is, Le, Sy, Tu), Central Asia (Af, evolution. In that regard, we implemented a network-based approach to reconstruct Ar, Az, Go, Pa, Ta, Tk), Eastern Asia (Ch, Ja, Ko), Northern Asia (Ru), Central the species history and community structure in the sampled genotypes. Europe (Alb, Bu, Cr, Cz, Fi, Hu, La, Po, Ro, Ukr, Fyrm), Southern Europe (Gr, It, Pr, To this end, we stringently filtered biallelic, polymorphic SNPs present in >90% Sp), Western Europe (As, Be, Dn, Fr, Ge, Sw, Sz, Nt, Uk), Indian Peninsula (In, Ne), of the genotypes from non-imputed data accounting for linkage disequilibrium Northern Africa (Alg, Eg, Mo, Tn), Sub-Saharan Africa (Et, Ke, Ni, Sa, Zi), South (delivering 15,490 filtered SNPs), and implemented a repeated random haplotype America (Ag, Br, Co, Ur), North America (Ca, Me, Usa) and Oceania (Au, Nz). sampling procedure including heterozygous sites (RRHS20) to infer 1,000 maximum likelihood tree topologies with the ASC_GTRGAMMA model and JC69 distances Exome sequencing. As the large wheat genome consists of >80% mobile and in RAxML (asc-corr = felsenstein). While these RRHS trees were also analyzed repeated elements, whole-genome resequencing is a cost-intensive and probably in the form of conventional consensus topologies and densitree visualizations to highly error-prone approach to comprehensively cataloguing genetic diversity. infer taxonomic clades, we analyzed the evolutionary distances among the tips of To circumvent this limitation, we used a wheat exome-based target enrichment the 1,000 trees using the minimum spanning tree (MST) algorithm in Python. The sequencing assay to capture variation in and around the gene-containing regions MST graphs were subsequently combined into a weighted, phylogenetic consensus of 487 wheat genotypes. We selected the Nimblegen SeqCap EZ wheat exome network whose nodes were clustered into clades using the Girvan–Newman Edge- design (120426_Wheat_WEC_D02, Roche), https://sequencing.roche.com/en/ Betweenness algorithm in Cytoscape 3.6 (ref. 33). The clustered network topology products-solutions/by-category/target-enrichment/shareddesigns.html. This is a was plotted considering edge-betweenness in Cytoscape and taxonomic clades design comprising 106.9 Mb of low-copy regions of the wheat genome developed were inferred by intersection of community clusters with taxon information that by the Wheat Exome consortium7,13. To optimize the cost, we used a multiplex of was annotated using the AutoAnnotate plugin33. The relative number of RRHS six individually barcoded accession DNAs, combined prior to capture. Captured trees for which a respective edge was selected by the MST algorithm were used DNA was sequenced as paired ends (2 × 125 bp) on Illumina HiSeq instruments as edge weights and were interpreted similar to bootstrap support values in the using HiSeq2500 high output mode. Genomic DNA (gDNA) samples were consensus tree topologies. The composition, geographical and historical origins checked using the PerkinElmer DropSense to verify gDNA integrity. Samples were of the identified wheat communities were analyzed using chi squared tests and quantified using the Picogreen assay and normalized to 20 ng µl–1 in 10 nM Tris- barplots in R. Gene flow in subgenomes A and B was investigated with the HCl (pH 8.0), as suggested in the NimbleGen SeCap EZ Library SR protocol. The Patterson’s D statistic (or ABBA-BABA statistic) using ANGSD34 with a threshold gDNA was fragmented at the mean fragment size of 350 bp and whole-genome of Z > 4 (ref. 35). An integrative model (Fig. 4b) of wheat evolution was built by libraries were prepared according to the Kapa Library Preparation protocol and manual consolidation of the support values of the edges in the phylogenetic quantified by Nanodrop. Six libraries were pooled and used for the hybridization consensus network (Supplementary Fig. 10 and Supplementary Table 7), the with the SeqCap Ez oligo pool (120426_Wheat_WEC_D02) in a thermocycler various consensus and IUPAC tree topologies (Supplementary Fig. 11), the at 47 °C for 72 h. Capture beads were used to pull down the complex of capture ABBA-BABA results (Supplementary Fig. 12) as well as the literature. Where oligonucleotides, and genomic DNA fragments and unbound fragments were species relationships remained ambiguous on the sole basis of the network removed by washing. Enriched fragments were amplified by PCR and the final approach (that is, when similar phylogenetic relatedness between groups of library was quantified by quantitative PCR (qPCR) and visualized on an Agilent genotypes defines several possible evolutionary paths between putative progenitors Bioanalyser. Sequencing libraries were normalized to 2 nM, NaOH denatured and descendants), we then considered the results of the ABBA-BABA statistical and used for cluster amplification using the cBot clonal amplification system test (Supplementary Fig. 12 and Supplementary Table 6) and the existing literature (Illumina). The clustered flow cells were sequenced on Illumina HiSeq2500 high when available. Fig. 4b reports only the reticulation events identified on the basis of output mode with a six-plex strategy (that is, six samples per HiSeq lane) with a phylogenetic consensus networks supported by the ABBA-BABA analysis in both 125bp paired-end run module. Exome capture (Nimblegen) and next-generation the A and B subgenomes. sequencing (Illumina) delivered on average 34 million read pairs per genotype. Diversity analysis and selection footprints. Genome scans for breeding Variant calling. Raw reads were mapped to the hexaploid Chinese Spring reference signatures among the hexaploid samples were performed under PCAdapt12, an sequence v1.0 using the ‘mem’ subcommand of BWA (v0.7.12, http://bio-bwa. individual-based method of genome scan able to handle massive NGS data. Given sourceforge.net/). Samtools (v1.3, http://samtools.sourceforge.net/, http://www. that PCAdapt is based on principal components, this method does not require htslib.org/) was used to mark duplicate reads. Variants were called using samtools any partitioning of the data set in different groups and can therefore be applied and bcftools (v1.3) and filtered for overlap with exonic regions (± 1 kb) based to the continuous pattern of population structure. This method is therefore on the current IWGSC genome annotation v1.0. Genotypes were subsequently conceptually robust to any source of errors associated with the boundaries of these filtered for a minimum sequencing depth of 3 and a genotype quality of at least 10. groups and can take into account the gradual variation among all individuals Two sets of genotype calls from experiments with Axiom 35k and iSelect 80k SNP of the improvement continuum (that is, time series-like data). For each data arrays (details from CerealsDB, http://www.cerealsdb.uk.net/) with a total of 38 set, selection of the best number of principal components (K) was performed genotypes in common between either of those sets and the current exome capture after a first assessment of the percentage of variance explained by 20 principal were used to approximate optimal filter criteria leading to the lowest FDR. Variant components. Analyses were performed assuming K = 4 for the whole data set positions were removed if the total count of samples with defined genotype (for and K = 3 for both European and Asian data sets. Computations were run under example not missing) was below 10 or the MAF was below 1% to derive the initial R (v3.4.3). Candidate genes for improvement are either associated with highly set of variants. This set was further subjected to imputation using beagle (default significant P values or considered in close vicinity (0 to 5 Mb) to loci with breeding parameters) and the output was again analyzed for FDR by comparison to iSelect signatures (Supplementary Table 3). π and Tajima’s D was computed over 1 Mb genotype information. Based on this evaluation, the optimal trade-off between non-overlapping sliding windows on European genotypes to take into account the FDR, number of variants and missing values was considered to be 0.6 for genotype strong signal of intercontinental genetic signatures. To perform this analysis, we probability (estimated by beagle) and 4% MAF (after imputation) and the final took into account the number of sites covered by reads aligned to the reference. All

Nature Genetics | www.nature.com/naturegenetics Nature Genetics Analysis sites with a total depth of coverage greater than 1,461 (that is, at least three reads dandaman/whealbiCode. They comprise the following analysis steps: (1) raw data per individual on average) were considered as covered. A ROD index was then processing and variant calling (Fig. 1), (2) genome-wide association study (Fig. 1), estimated for each 1 Mb window by comparing diversity of each group (II, III or IV) (3) geographical component of the panel structure (Fig. 2a), (4) components of to the landraces (group I) as follows: 1 – (π group/π landraces). To further explore the panel structure (Fig. 2b), (5) computation of nucleotide diversity (Fig. 3), (6) population structure, PCA was performed with the R package FactoMineR36. phylogeny and reticulate evolution of the wheat species complex using repeated Signatures of improvement were detected for loci associated with P < 0.0001 (that random haplotype sampling (RRHS; Fig. 4) and (7) inference of hybridization and is, a –log10-transformed P > 4). Similarly, for domestication signatures, genomic introgression events using ABBA-BABA (Fig. 4b). regions were identified using differences in π between diploid, tetraploid and hexaploid genotypes by computing the ROD index over 1 Mb windows. Signatures Reporting Summary. Further information on research design is available in the of domestication were detected for regions associated with ROD > 0.8. Candidate Nature Research Reporting Summary linked to this article. genes for domestication were considered in close vicinity (0 to 5 Mb) to loci with domestication signatures (Supplementary Table 3). Visualization was performed 37 38–41 Data Availability using R packages such as graphics, stats, circlize and dendextend . All data analyzed and generated during this study are included in this published article and its supplementary information files (6 tables and 13 figures) and Field experiments and GWAS. These analyses focused on 435 hexaploid wheat are available online at https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_ genotypes evaluated for HD and PH in four common garden experiments in RefSeq_Annotations/v1.0/iwgsc_refseqv1.0_Whealbi_GWAS.zip (the catalog of France (INRA), Hungary (ATK), Turkey (University of Çukurova) and the United imputed and non-imputed variants as a vcf file and passport information for the Kingdom (KWS). Trials were grown under an augmented partially replicated 487 genotypes as an .xls file). The Whealbi SNP data are open access and can be design with 20% of the genotypes replicated twice and two check cultivars assigned viewed in the IWGSC reference genome browser54 at https://urgi.versailles.inra. uniformly to eight plots. Raw data were corrected for spatial heterogeneity 42 fr/jbrowseiwgsc/gmod_jbrowse/?data=myData%2FIWGSC_RefSeq_v1.0. The using replicated controls and the SpATS package in R . After filtering out SNPs sequence data are available at NCBI under the accession number PRJNA524104. with a call rate < 0.80 and MAF < 0.05, a final set of 390,657 SNPs were used for subsequent analysis. Finally, circular genome maps were drawn using the R package circlize43. A chromosome-specific kinship matrix was calculated using References 1,000 SNPs sampled at random from each chromosome. In this chromosome- 31. Minh, B. Q., Nguyen, M. A. & von Haeseler, A. Ultrafast approximation for specific kinship, AAB, is the realized additive genetic relationship matrix calculated phylogenetic bootstrap. Mol. Biol. Evol. 30, 1188–1195 (2013). from all molecular markers along the whole genome, except those in the 32. Revell, L. J. phytools: an R package for phylogenetic comparative biology (and chromosome being tested44. For example, to test SNPs in chromosome 1A, AAB other things). Methods Ecol. Evol. 3, 217–223 (2012). was calculated with SNPs sampled from all chromosomes, except those from 1A. 33. Kucera, M. et al. AutoAnnotate: a Cytoscape app for summarizing networks AAB was calculated following the equation proposed by Astle and Balding45, with a with semantic annotations. F1000 Res. 5, 1717 (2016). typical entry for the relationship between genotypes i and j of: 34. Korneliussen, TorfnnSand, Albrechtsen, Anders & Nielsen, Rasmus ANGSD: analysis of next generation sequencing data. BMC Bioinformatics 15, K (2xp)(xp2) AB 1 ik−−k jk k 356 (2014). A = (1) ij K ∑ 2(pp1) 35. Durand, E. Y., Patterson, N., Reich, D. & Slatkin, M. Testing for ancient k=1 kk− admixture between closely related populations. Mol. Biol. Evol. 28, 2239–2252 (2011). where xik is a marker score indicating the allele count for least frequent alleles 36. Lê, S. et al. FactoMineR: an R package for multivariate analysis. J. Stat. Sof. (2, 1, 0) for genotype i at marker k, and pk is the corresponding allele frequency. The matrix above was calculated using the ‘realizedAB’ option in the ‘kin’ function 25, 1–18 (2008). of the Synbreed package46. 37. R. Core Team. R: a Language and Environment for Statistical Computing A multi-environment mixed model GWAS analysis was performed analogous to (R Foundation for Statistical Computing, 2014). the method described by Millet et al.47 and Thoen et al48. Correction for population 38. Chipman, H. & Tibshirani, R. Hybrid hierarchical clustering with applications structure and kinship was done on the basis of eigen vectors (‘principal components’) to microarray data. Biostatistics 7, 286–301 (2006). extracted from the chromosome-specific Astle and Balding kinship matrices, AAB. 39. Schmidtlein, S., Tichy, L., Hannes, F. & Ulrike, F. A brute-force approach to The number of significant principal components was calculated following vegetation classifcation. J. Veg. Sci. 21, 1162–1171 (2010). Patterson et al.49. We scanned the genome with the following single locus model: 40. Witten, D. M. & Tibshirani, R. A framework for feature selection in clustering. J. Am. Stat. Assoc. 105, 713–726 (2010). P P 41. Galili, T. dendextend: an R package for visualizing, adjusting and comparing PC G PC GE trees of hierarchical clustering. Bioinformatics 31, 3718–3720 (2015). yE=+μβj ++()xGip i + ()xip β ij ∑∑p jp 42. Rodríguez-Álvarez, M. X., Boer, M. P., van Eeuwijk, F. A. & Eilers, P. H. p==11p (2) Correcting for spatial heterogeneity in plant breeding experiments with SNPSNP ++xGi βεj Eij + ij P-splines. Spatial Stat. 23, 52–71 (2017). 43. Gu, Z., Gu, L., Eils, R., Schlesner, M. & Brors, B. irclize implements and enhances circular visualization in R. Bioinformatics 30, 2811–2812 (2014). In equation (2), μ is an intercept term, Ej is the fixed environmental main 44. Rincent, R. et al. Recovering power in association mapping panels with PC effect, xip stands for the genotype-specific scores on the pth kinship principal variable levels of linkage disequilibrium. Genetics 197, 375–387 (2014). G GE component, with p = 1…P, and βp and βjp are the corresponding fixed regression 45. Astle, W. & Balding, D. J. Population structure and cryptic relatedness in coefficients for these principal components correcting for population structure genetic association studies. Stat. Sci. 24, 451–471 (2009). with respect to the genotype main effect and the G × E interaction, respectively. 46. Wimmer, V., Albrecht, T., Auinger, H. J. & Schön, C. C. synbreed: a SNP SNP βj is a term for the fixed SNP effect, while xi contains the marker framework for the analysis of genomic prediction data using R. Bioinformatics information. This means that fitted QTLs are allowed to have an environment- 28, 2086–2087 (2012). specific effect, or, that at each marker position, QTLs model main effect and a QTL 47. Millet, E. J. et al. Genome-wide analysis of yield in Europe: allelic efects vary SNP × E term simultaneously. The test for βj being zero in all environments or being with drought and heat scenarios. Plant Physiol. 172, 749–764 (2016). 50,51 non zero in at least one environment was a Wald test . Gi is a random genotypic 48. Toen, M. P. et al. Genetic architecture of plant stress resistance: multi‐trait main effect, GEij is a random genotype by environment interaction. The random genome‐wide association mapping. New Phytologist 213, 1346–1362 (2017). terms for Gi and GEij have variances VG and VGE, that were restricted to be positive. 49. Patterson, N., Price, A. L. & Reich, D. D, Population Structure and The error term εij is environment-specific and was confounded with the GEij term. Eigenanalysis. PLoS Genet. 2, e190 (2006). The model was fitted in ASREML-R (VSN-International, 2016). Genomic control 50. Welham, S. J. & Tompson, R. Likelihood Ratio Tests for Fixed Model Terms was applied a posteriori to correct for inflation52. using Residual Maximum Likelihood. J. R. Statist. Soc. 59, 701–714 (1997). The genome-wide significance threshold with multiple testing correction was 51. Boer, M. P. et al. A mixed-model quantitative trait loci (QTL) analysis for calculated following the method proposed by Li and Ji53. For each chromosome, multiple-environment trial data using environmental covariables for the correlation matrix for the SNPs was calculated. Then, the effective number QTL-by-environment interactions, with an example in maize. Genetics. 177, of independent tests per chromosome was estimated from the eigenvalues of the 1801–1813 (2007). correlation matrix. The effective number of independent tests was summed across 52. Devlin, B., Roeder, K. & Genomic, K. Control for association studies. chromosomes (Meff) and the significance threshold for individual markers was Biometrics 55, 997–1004 (1999). 1∕Meff calculated as αpe=−1(1)−α , where the genome-wide test level was αe =.005. 53. Li, J. & Ji, L. Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity 95, 221–227 (2005). Software. The relevant source codes and analysis workflows used to generate the 54. Alaux, M. et al. Linking the International Wheat Genome Sequencing results presented in the different sections of the manuscript, figures and tables Consortium bread wheat reference genome sequence to wheat genetic and have been deposited in a public github repository, accessible at https://github.com/ phenomic data. Genome Biol. 19, 111 (2018).

Nature Genetics | www.nature.com/naturegenetics nature research | reporting summary

Corresponding author(s): Jerome Salse

Last updated by author(s): 2019-25-02 Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one- or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section. A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated

Our web collection on statistics for biologists contains articles on many of the points above. Software and code Policy information about availability of computer code Data collection All data analyzed and generated during this study are included in this published article and its supplementary information files (6 tables and 13 figures) and are available online at https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Annotations/v1.0/ iwgsc_refseqv1.0_Whealbi_GWAS.zip (Catalog of imputed and non-imputed variants as vcf file and passport information for the 487 genotypes as .xls file).

Data analysis The relevant source codes and analysis workflows used to generate the results presented in the different sections of the manuscript, figures and tables, were deposited in a public github repository, accessible at https://github.com/dandaman/whealbiCode and comprise the following analysis steps: 1. Raw data processing and variant calling, 2. Genome-wide association study (Figure 1), 3. Geographical component of the panel structure (Figure 2A) 4. Components of the panel structure (Figure 2B) 5. Computation of nucleotide diversity (Figure 3) 6. Studying the phylogeny and reticulate evolution of the wheat species complex using repeated random haplotype sampling (RRHS; Figure 4) 7. Inference of hybridization and introgression events using ABBA-BABA (Figure 4B). For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. Data October 2018 Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: - Accession codes, unique identifiers, or web links for publicly available datasets - A list of figures that have associated raw data - A description of any restrictions on data availability

The Whealbi SNPs data can be displayed in open access on the IWGSC reference genome browser56 at https://urgi.versailles.inra.fr/jbrowseiwgsc/gmod_jbrowse/? data=myData%2FIWGSC_RefSeq_v1.0

1 nature research | reporting summary Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design All studies must disclose on these points even when the disclosure is negative. Sample size In order to explore the origins and patterns of genetic diversity that exist within the currently accessible wheat gene pool, we assembled a worldwide panel of 487 genotypes that included wild diploid and tetraploid relatives, domesticated tetraploid and hexaploid landraces, old cultivars and modern elite cultivars (Table S1). Mapping exome capture sequence data7 from these lines onto the ‘Chinese Spring’ reference genome sequence (IWGSC8) revealed 620,158 high confidence genetic variants (including 595,939 SNPs and InDels between hexaploid genotypes) distributed across 41,032 physically ordered wheat genes (Table S2).

Data exclusions Raw reads were mapped to the hexaploid Chinese Spring reference sequence v1.0 using the 'mem' subcommand of BWA (version 0.7.12, http://bio-bwa.sourceforge.net/). Samtools (version 1.3, http://samtools.sourceforge.net/; http://www.htslib.org/) was used to mark duplicate reads. Variants were called using samtools/bcftools (version 1.3) and filtered for overlap to exonic regions (+/- 1 Kbp) based on the current IWGSC genome annotation v1.0. Genotypes were subsequently filtered for a minimum sequencing depth (DP) of 3 and a genotype quality (GQ) of at least 10.

Replication Two sets of genotype calls from experiments with Axiom 35k and iSelect 80k SNP arrays (details from CerealsDB, http:// www.cerealsdb.uk.net/) with a total of 38 genotypes in common between either of those sets and the exome capture were used to approximate optimal filter criteria leading to the lowest false discovery rate (FDR).

Randomization NA

Blinding NA

Reporting for specific materials, systems and methods We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. Materials & experimental systems Methods n/a Involved in the study n/a Involved in the study Antibodies ChIP-seq Eukaryotic cell lines Flow cytometry Palaeontology MRI-based neuroimaging Animals and other organisms Human research participants Clinical data October 2018

2