ARTICLE

https://doi.org/10.1038/s41467-020-18738-5 OPEN Origin and adaptation to high altitude of Tibetan semi-wild

Weilong Guo 1,4, Mingming Xin1,4, Zihao Wang 1,4, Yingyin Yao1, Zhaorong Hu 1, Wanjun Song1, ✉ Kuohai Yu1, Yongming Chen 1, Xiaobo Wang1, Panfeng Guan1, Rudi Appels2,3, Huiru Peng 1 , ✉ ✉ Zhongfu Ni 1 & Qixin Sun 1

Tibetan wheat is grown under environmental constraints at high-altitude conditions, but its

1234567890():,; underlying adaptation mechanism remains unknown. Here, we present a draft sequence of a Tibetan semi-wild wheat (Triticum aestivum ssp. tibetanum Shao) accession Zang1817 and re-sequence 245 wheat accessions, including world-wide wheat landraces, cultivars as well as Tibetan landraces. We demonstrate that high-altitude environments can trigger extensive reshaping of wheat , and also uncover that Tibetan wheat acces- sions accumulate high-altitude adapted haplotypes of related genes in response to harsh environmental constraints. Moreover, we find that Tibetan semi-wild wheat is a feral form of Tibetan landrace, and identify two associated loci, including a 0.8-Mb deletion region con- taining Brt1/2 homologs and a genomic region with TaQ-5A gene, responsible for rachis brittleness during the de-domestication episode. Our study provides confident evidence to support the hypothesis that Tibetan semi-wild wheat is de-domesticated from local land- races, in response to high-altitude extremes.

1 State Key Laboratory for Agrobiotechnology, Key Laboratory of Crop Heterosis and Utilization (MOE), Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100193, China. 2 AgriBio, Centre for AgriBioscience, Department of Economic Development, Jobs, Transport, and Resources, 5 Ring Road, La Trobe University, Bundoora, VIC 3083, Australia. 3 University of Melbourne, FVAS, Grattan Street, Parkville, VIC 3010, Australia. ✉ 4These authors contributed equally: Weilong Guo, Mingming Xin, Zihao Wang. email: [email protected]; [email protected]; [email protected]

NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5

nderstanding the evolution and domestication of chromosomes with the aid of the Chinese Spring (CS) IWGSCv1 Uhas accelerated crop breeding through being able to target reference genome11. A total of 95.51% scaffolds were ordered to new germplasm for crossing into elite lines and creating a form the 21 high-confidence pseudochromosomes, comprising more secure world food supply1–3. Hexaploid or bread wheat 1067 scaffolds with 14.05 Gb (Fig. 1b), which is similar to that of (Triticum aestivum, AABBDD) is a staple food crop for more the reference CS genome (14.07 Gb)11. The Zang1817 genome than one-third of the world population4 and originated ~10,000 assembly reported here is a sequenced genome of a semi-wild years ago in the Near-Eastern Fertile Crescent through hybridi- form of hexaploid wheat to date. zation between domesticated tetraploid wheat, Triticum turgidum A total of 118,078 high-confidence protein-coding genes were (AABB) and wild goatgrass, tauschii (DD)5. As a tem- inferred from the Zang1817 assembly using protein-homology- perate crop, hexaploid wheat is widely distributed around the based prediction methods supported by RNA-sequencing data. world, including the Tibetan Plateau, which has an average alti- Repetitive elements, a major component of complex genomes, tude of 4268 m above sea level6, and is characterized by extensive were widely dispersed throughout the wheat genome11,12.Upto UV-B irradiation, extremely low temperature and strong hypoxia 82.74% of the Zang1817 assembly sequence were annotated as stress. Although such environmental constraints would be repeat sequences, including retrotransposons (63.91%), DNA expected to trigger genomic variation of wheat associated with transposons (18.45%) and other unclassified sequence (1.57%). adaptation to high-altitude (HA), the underlying molecular Copia and Gypsy family retrotransposons accounted for ~44.16% changes are still unknown. Thus, Tibetan wheat provides a sig- and 15.28% of the Zang1817 assembly sequence, respectively nificant resource to understand genomic adaptation to the (Supplementary Table 1). The composition of the different classes environmental extremes at high altitude conditions. In addition, of repetitive DNA in Zang1817 was highly similar to that in the although wild forms of bread wheat are not found, a unique CS genome. We further evaluated the annotation with 1440 Tibetan semi-wild hexaploid wheat (Triticum aestivum ssp. tibe- Benchmarking Universal Single Copy Orthologs (BUSCO)13 tanum Shao) is discovered on the Tibetan Plateau, China7. Huang genes, and found 99.51% genes were recalled in the genome, and his colleagues6 investigated 17 agronomic traits of 77 Tibetan with 94.38~96.81% among the three subgenomes individually, a accessions and 277 Tibetan semi-wild wheat, and found that percentage similar to that of the CS genome (99.72%), thus Tibetan semi-wild wheat phenotypically resembles Tibetan local indicating a near completeness of Zang1817 genome assembly landraces but differs in brittle rachis characteristics (Fig. 1a). (Supplementary Table 2). Thus, they suspected that Tibetan semi-wild wheat was a de- Extensive genomic variations have been reported in maize and domesticated form of Tibetan local landraces as a result of natural rice14,15. We found that genomic variation between Zang1817 selection. Yet, no genetic or genomic evidence is provided to and CS, when the pseudochromosomes were aligned, mainly support this hypothesis. Brittle rachis is generally controlled by occurred in non-coding regions. By comparing the two genomes, genes on the short arm of the group 3 chromosomes, especially we determined the present/absent variation (PAV) for segments chr3D8, as well as alleles of the Q locus on 5A9. However, there is longer than 10 kb (Fig. 1c) and identified 345.73 Mb Zang1817- a lack of understanding of the genomic variations that contribute specific genomic segments, including 1875 Zang1817-specific to the Tibetan wheat wild-like traits at the population scale. genes compared with CS (Supplementary Data 1). GO analysis To address these knowledge gaps, we de novo assemble the revealed that these genes were predominantly enriched in genome sequence of a high-altitude wheat accession (Tibetan “polysaccharide binding” (GO:0030247) and “serine-type endo- semi-wild wheat Zang1817), and resequence 245 wheat acces- peptidase inhibitor activity” (GO:0004867) categories. We also sions, including Tibetan semi-wild wheat accessions, Tibetan identified 389.29 Mb of genomic segments containing 2540 genes landraces and cultivars, representing the largest resequencing of that were absent from the Zang1817 assembly in comparison with world-wide wheat landraces and accessions to date. In combi- CS genome (Supplementary Data 2). These genes were mostly nation with previously published 73 re-sequenced accessions10, involved in photosynthesis related GO terms including “photo- we demonstrate that adaptation to high-altitude environments system II reaction center” (GO:0009539), “chloroplast thylakoid likely trigger extensive reshaping of the wheat genomes. In par- membrane” (GO:0009535), “photosynthesis, light reaction” ticular, we argue that the TaHY5-like-mediated signaling pathway (GO:0019684) and “cytochrome b6f complex” (GO:0009512). might play a role in the adaptation to high-altitude extreme Interestingly, the PAV genes were unevenly distributed across the environments in Tibetan wheat and possibly other plants. Fur- genome and tended to be distributed at the ends of chromosomes thermore, we uncover a feral origin of Tibetan semi-wild wheat, and devoid in centromeric regions (Fig. 1d). A genomic inversion which is probably, in effect, a de-domesticated form of Tibetan at 366~375 Mb region of chromosome 6D in Zang1817 was landraces, suggesting that the Tibetan wheat is a resource for fine revealed based on the genomic comparison, which shared a fine tuning wheat germplasm to certain environments. collinearity between Zang1817 assembly and Aegilops tauschii genome12 (Supplementary Fig. 1), further validating the high- quality of the Zang1817 genome assembly. Results The α-gliadin genes confer wheat the unique viscoelastic De novo assembly of Tibetan semi-wild wheat Zang1817 gen- properties, which facilitate the processing of flour into bread, ome. Genomic variation in HA adapted wheat compared with pasta, noodles, and other food products. An in-depth sequence low-altitude (LA) wheat was determined by generating a de novo comparison of the terminal regions of the chromosome group 6 genome assembly of Tibetan semi-wild wheat Zang1817. First, we indicated that the duplication and expansion of α-gliadin genes in generated >240× coverage sequencing reads from a series of PCR- the three homeologous groups occurred extensively in Zang1817 free Illumina paired-end libraries and mate-paired libraries. The assembly compared with CS genome. For example, the 2.4 Mb contig and scaffold assembly was performed using the region at 25.91–28.31 Mb on chromosome 6 A from the DeNovoMAGIC3 software platform (NRGene, Nes Ziona, Israel), Zang1817 genome contained 34 α-gliadin genes, more than triple and additional Chromium 10X genomic data were utilized to the number from the CS genome (9, region at 24.92–26.84 Mb), support scaffold validation and further elongation. Finally, and the differences in terms of sequence length between the 384,307 scaffolds with N50 of 37.62 Mb were generated, and sequences were mostly caused by differential insertion of formed the basis of a total assembly size of 14.71 Gb for Zang1817 repetitive DNA elements. Similar results were revealed for α- (Table 1). The assembled scaffolds were ordered into gliadin-containing regions at chromosome 6B and 6D (Fig. 1e).

2 NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5 ARTICLE

a

1 cm Zang1817 S79

1A b 7D 600 c 400

200

0

200

400 0 1B 600 7B 0 400 a 200 200 400 2000 b 600 1D 0 0 600 c 200 400 1500 7A d 400 200 e 0 200 2A 0 1000 400 f 400 Frequency 6D 200 g 600 500 0 h 0

600 200 2B 0 400 400 6B 0 100 k 200 k >300 k 200 600 Length 0 800 600 0

6A 400 200 d 1 200 400 2D

0 600 0 Absence in CS 400 0.5 5D 200

200 400

0 3A 600

600 0 0 short arm long arm

400 0 5B 200

200 400 0 600 3B 600 800 0.5 0 400 Absence in Zang1817 200 5A 200

0 400 Relative gene absence rate

600 0 400 200

0 400 200 3D 600 0 1 4D 600

400 200 Relative chromosome position 4B 4A

e chr6A: 25.91 – 28.31 Mb chr6B: 41.58 – 42.45 Mb chr6D: 27.25 – 27.71 Mb Zang1817

-gliadin CS-IWGSCv1 chr6A: 24.92 – 26.84 Mb chr6B: 43.38 – 44.66 Mb

Fig. 1 Phenotypes of Tibetan wheat and overview of the Zang1817 assembly. a Phenotype of rachis brittleness for Tibetan semi-wild wheat Zang1817 and Tibetan wheat landrace S79. b Concentric circle diagram shows distribution of the genomic features of Zang1817: a presence–absence variation (PAV) of genes absent in CS; b collinear gene pairs between CS and Zang1817 (discordant matching regions are marked in red); c PAV genes absent in Zang1817; d density of genomic variations (0 to 240,098 bp per 10 Mb); e density of high-confidence genes (HC; 0–253 genes per 10 Mb); f transposable element (TE) density in 5-kb windows; g distribution of K-mer frequency; h homeologous gene pairs between subgenomes. c Frequency distribution of PAV segment length (>10 kb). d Density distribution of the PAV genes across chromosome model. Red, Zang1817-specific retained genes. Blue, CS-specific retained genes. e Comparison analysis of genomic regions containing α-gliadin genes on group six homeologous chromosomes. Source data underlying Fig. 1c, d are provided as a Source Data file.

On a broader scale, a total of 22,782,409 SNPs were identified in resulting in amino-acid changes/stop codon occurrences between the aligned syntenic blocks between Zang1817 and CS genomes, the Zang1817 and CS genomes, respectively. with 8,085,857, 13,049,811 and 1,646,741 in subgenomes A, B, and D, respectively. Subgenome B had the highest number of SNPs, whereas subgenome D had the lowest number. Although Adaptation mechanism to high-altitude environment. High- ~99.61% of the Zang1817 sequences matched with CS sequences altitude plants and animals are subjected to genetically adaptive in syntenic blocks, we were still able to identify 184,913 SNPs that evolution that contributes to their survival in the harsh envir- resided in the protein-coding genes, including 95,734/1634 SNPs onment of long exposure to high UV-B irradiation, low

NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications 3 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5

Table 1 Summary statistics for the Zang1817 assembly. E2-conjugating enzyme, respectively, might play important roles in DNA repair in response to UV-B induced DNA damage20–22 (Supplementary Figs. 8 and 9), and TaERF4 Scaffolds Contigs (TraesCS3B02G357500) (Supplementary Fig. 10), whose Total scaffolds 384,307 822,893 homolog in Arabidopsis is involved in cold acclimation23, Assembly size (bp) 14,707,915,341 14,553,951,571 were also significantly diverged between HA and LA wheat Gaps size (bp) 153,963,770 Gaps 1.05% samples (Fig. 2b, Supplementary Fig. 11). Furthermore, N50 (bp) 37,620,242 66,264 TraesCS2A02G134000 (TaCHLH) (Supplementary Figs. 12 N50 #sequences 102 63,361 and 13), a homolog of CHLH involving in light-dependent 24 N90 (bp) 5,215,097 13,297 chlorophyll accumulation , together with other photosynth- N90 #sequences 499 242,400 esis related genes including TraesCS2A01G145300 (Supple- MAX (bp) 207,426,720 994,222 mentary Fig. 14) and TraesCS7B02G178700 (Supplementary Fig. 15), also showed haplotype divergence between HA and LA wheat groups (Fig. 2b; Supplementary Fig. 16). Interest- temperature and hypoxia. To identify candidate genes and reg- ingly, the downstream genes ERF4 and CHLH are known to be ulatory pathways for adaptation to high altitude environ- positively regulated by HY5 in Arabidopsis, the promoter ments, we re-sequenced 109 Tibetan wheat accessions (including regions of TaERF4, TaPCO1, TaCHLH, and TaTDP1 also 74 Tibetan semi-wild wheat and 35 Tibetan landraces) as well as contained HY5 regulating cis-element (CACGTG), indicating 136 representative world-wide wheat landraces and cultivars at a that they are potentially regulated by TaHY5-like in wheat. The 6.07-fold coverage on average (Fig. 2; Supplementary Data 3). HY5 is a master regulator extensively involved in regulating Together with previously published data of 63 hexaploid wheat light signaling, DNA repair, cold acclimation, chlorophyll samples10, a total of 46,431,479 filtered SNPs were employed for synthesis and hypoxia response in plants25. Given that light further analysis. Of these, 0.77% SNPs resided in coding intensities, low temperature, and hypoxia frequently occur sequences of high-confidence genes, and led to 2813 gains of stop simultaneously on Tibetan Plateau,itislikelythatthesignaling codons (Supplementary Data 4). Phylogenetic analysis revealed modules have co-evolved to cope with these stimuli. We found that Tibetan semi-wild wheat (69/74) and Tibetan landraces (23/ that the TaHY5-like hub as well as TaHY5-like-mediated 35) grouped together (Supplementary Fig. 2). Interestingly, two pathway of Tibetan wheat were the targets of positive selection, landraces originated from Nepal, a high altitude region south of leading to their adaptation to the extreme. Interestingly, the Tibetan Plateau also showed a close relatedness with Tibetan geographic distribution analysis of haplotype diversity of wheat wheat (Supplementary Fig. 2), indicating a potential genetic homologs of TaHY5-like, TaERF4, TaPCO1, TaCHLH, and affinity between each other. TaTDP1 showed that the combination of five high-altitude To determine the genomic regions of Tibetan wheat associated specific haplotypes mainly exhibited in Tibetan plateau (Fig. 2c, with adaptation to the HA environments on Tibetan Plateau, FST Supplementary Fig. 17). These observations suggest that a for searching genetic differentiation regions was carried out rapid evolution of TaHY5-like-mediated functional pathway between HA wheat accessions and LA wheat accessions (Fig. 2b; occurred in Tibetan wheat in response to the harsh constraints Supplementary Data 5). In total, we identified 1‚905 significant in HA environments (Supplementary Fig. 18). To support our divergent genomic regions including 3847 genes. GO enrichment hypothesis, we analyzed the haplotypes of Nepal wheat lines, analysis of those genes showed that functional terms of “serine- which is also planted at high altitude (~1200–4800 meters type peptidase activity” (GO:0008236) was significantly over- high), and these HA adapted wheat accessions possessed the presented. In addition, photosynthesis related GO terms same haplotype to Tibetan wheat, including TaHY5- (“chloroplast thylakoid, GO:0009534” and “chlorophyllide a like, TaERF4, TaCHLH, TaPCO1, TaTDP1.Toconfirm the oxygenase [overall] activity, GO:0010277”) and “DNA repair” hypothesis, further evidence is needed to functionally validate (GO:0006281) were also enriched (Supplementary Fig. 3). These their biological significance in adaptation to high altitude. findings were consistent with previous reports, which indicated Regulation of flowering time also contributes to the environ- that genes involved in stress tolerance and DNA repair are mental adaptation in wheat26, and Tibetan semi-wild wheat enriched or expanded in high-altitude plants, such as Eutrema accessions exhibited delayed heading date compared with other species16, Crucihimalaya himalaica17, and Maca (Lepidium wheat species sowed with the same sowing date7. PPD1 gene is a meyenii)18. major determining factor affecting the photoperiod response of Significant genomic differentiation was identified on chro- wheat27 and barley28. Consistent with this observation, we mosome 2 A, where TraesCS2A02G142800, a homolog of identified related genes in HA wheat that may have been targeted ELONGATED HYPOCOTYL 5 (HY5) in Arabidopsis (Supple- by natural selection during the adaptation process. The TaPPD1- mentary Fig. 4), was diverged between HA and LA groups 2A (TraesCS2A02G081900) (Supplementary Fig. 19) and featured by two missense variations (designated as TaHY5-like, TaPPD1-2D (TraesCS2D02G079600) (Supplementary Fig. 19) c.1049 G > A|p.Arg350Gln and c.1034 G > T|p.Gly345Val) genes, two homeologs underlying major quantitative trait locus (Fig. 2b; Supplementary Fig. 5). With the missense SNPs for photoperiod-dependent flowering29, were dramatically diver- fi fi identi ed between HA (AT haplotype) and LA (GC haplotype) si ed between HA and LA wheat accessions based on FST analysis groups in TaHY5-like, two major haplotypes were character- (Supplementary Figs. 20 and 21), suggesting strong selection on ized for the 308 resequencing wheat lines together with 1026 this locus in high-altitude wheat populations. A comparison of whole-exome capture wheat accessions19.Significantly, we the haplotypes between LA and HA wheat groups at these two found that over 86% of Tibetan wheat carry the AT haplotype, genes identified two differentiated SNPs for TaPPD1-2A (c.1202 compared to wheat from LA environments, which mainly A > G|p.Asp401Gly) and TaPPD1-2D (c.1018 C > T|p.Arg340*), possessed GC haplotypes (Fig. 2c, Supplementary Fig. 5). respectively. These two candidate causal mutations were fixed in In addition, we found that wheat orthologues of TDP1 HA wheat, which together accounted for 87.61%, and has low and ATG10 in Arabidopsis (TraesCS2B02G244300 and allele frequency (33.33%) in LA population. The HA-haplotype TraesCS2B02G371600) (Supplementary Figs. 6 and 7), whi- combination of TaPPD1-2A and TaPPD1-2D only occurred in ch encode a tyrosyl-DNA phosphodiesterase and an Tibetan Plateau occupying 51% of the whole HA group at Tibetan

4 NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5 ARTICLE

a

40

20

32 Latitude Tibetan Plateau 0 30

−20 28 −100 −50 0 50 100 150 90 95 Longitude TS CL NCL TL CC NCC

b TraesCS2A02G135000 TaCHLH TaHY5-like Photosynthesis Abiotic stress Flowering response 1 TaPPD1 TraesCS2A02G145300

0.5

0 1A 2A 3A 4A 5A 6A 7A TaERF4 Traes3B02G357900 Traes6B02G366100 Traes7B02G178700 1 TaTDP1

ST 0.5 F 0 1B 2B 3B 4B 5B 6B 7B

1 TaPPD1 TaPCO1

0.5

0 1D 2D 3D 4D 5D 6D 7D c

40

20 TaHY5-like HA LA 050100

Fig. 2 Geographic locations and candidate regions responsive to high-altitude extremes. a Global distribution of wheat accessions. NCL, Non-Chinese landrace; NCC, Non-Chinese cultivar; CL, Chinese landrace (excluding Tibetan landrace); CC, Chinese cultivar; TS, Tibetan semi-wild wheat; TL, Tibetan landrace. b Genome-wide distribution of genetic differentiation regions between high-altitude (HA) and low-altitude (LA) wheat accessions identified by FST analysis. Each point indicates a 100k-bp window. Dashed lines indicate the top 5% thresholds. Key candidate genes are labeled with arrows. c Geographic distribution of different haplotypes of TaHY5-like gene. The haplotype distribution analysis was determined using 308 resequencing accessions and 1026 wheat whole-exome capture samples. Orange indicates the proportion of HA adapted haplotype of TaHY5-like gene. HA, high-altitude wheat accessions. LA, low-altitude accessions. Geographic maps in a and c were generated using R packages, including maps, ggmaps, ggplot2, and sf, with geographic information data edited from original version of the Natural Earth project database (version 2013). Source data underlying Fig. 2b are provided as a Source Data file.

Plateau (Supplementary Fig. 22). This observation suggested that structure analyses of available accessions based on 364,856 high- Tibetan wheat have been selected to achieve appropriate flower- confidence homologous SNPs on subgenome D using Aegilops ing time to complete their life cycles in Tibet. tauschii accessions as an outgroup. This use of the D subgenome was necessitated based on previous analyses that extensive wild- relative introgressions were observed in the subgenome A and B Origin of Tibetan semi-wild wheat. Tibetan semi-wild wheat is a of hexaploid wheat19, rendered the analysis using these sub- unique form of hexaploid wheat30. To provide insights into their genomes together to be too complex and ambiguous. The evolutionary origin, we performed a comprehensive population resulting phylogenetic tree clearly encompassed three major

NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications 5 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5

ab Clade I

0.1

Subcluster 1 0.0 PC2 (16.1%)

−0.1

Subcluster 2 −0.10 −0.05 0.00 0.05

Clade III Clade II PC1 (22.5%) TS CL NCL TL CC NCC c K =4 K =5

NCL CL TL TS

Fig. 3 Population structure analysis of resequencing wheat accessions. a A neighbor joining (NJ) phylogenetic tree of wheat resequencing samples using 364,856 high-confidence SNPs on subgenome D. Gray triangle, the outgroup, Aegilops tauschii, which is wild diploid progenitor for D genome in wheat. b PCA plot of resequencing samples based on SNPs in subgenome D. c Individual ancestry coefficients of resequencing accessions inferred using ADMIXTURE using SNPs in subgenome D. Each color represents one population. The length of each segment in each vertical bar represents the proportion contributed by ancestral populations. Source data are provided as a Source Data file. clades, among which, a portion of non-Tibetan Chinese landraces population (π = 5.38 × 10−4) than in landrace-counterparts (CL, 26/67) together with almost all of non-Chinese landraces (π = 5.67 × 10−4), indicating a limited genetic background of (NCL, 49/52) and non-Chinese cultivars (NCC, 35/38) formed Tibetan semi-wild wheat available during the adaptation Clade I, indicating a potential contribution of non-Chinese wheat process in the Tibetan Plateau. To further confirm our to the genetic diversity of Chinese wheat (Fig. 3a). In contrast, hypothesis, we tested the origin models of Tibetan semi-wild most of the remaining non-Tibetan Chinese landraces (32/41) wheat on eight different demographic simulation models and almost all of the elite cultivars (CC, 36/42) clustered into (Supplementary Table 3; Supplementary Fig. 26) and found Clade II, suggesting that Chinese landraces contributed sig- that the model with bottleneck and without any migration nificantly to varietal replacement during breeding process in (Supplementary Fig. 26) fit the genetic data best (Supplemen- China. Tibetan semi-wild wheat and Tibetan landraces were tary Table 3), indicating that initial Tibetan semi-wild wheat grouped into Clade III, which was further separated into two accessions are derived from Tibetan landraces by de- distinct subclusters in the phylogenetic tree, that is, a total of 41 domestication followed by a bottleneck period resulting from (55.41%) Tibetan semi-wild wheat accessions were grouped into a developed agriculture activities. Our genome-wide association monophyletic group (Subcluster I), whereas 28 (37.84%) Tibetan study (GWAS) identified two loci of particular interest as being semi-wild wheat accessions together with 63% Tibetan landraces significantly associated with the rachis brittleness phenotype in (TL, 22/35) formed a second subcluster (Subcluster II), revealing Tibetan semi-wide wheat, including a 0.8-Mb deletion region that the Tibetan semi-wild accessions shared a closer genetic on 3D and a genetic locus containing TaQ-5A gene on 5 A (see relationship with Tibetan landraces compared with the other Fig. 4a, Supplementary Fig. 27–30). The association was also wheat samples (Fig. 3a). We also performed phylogenetic analysis supported by results from FST, Pi-ratio, and CNV-index of wheat accessions using SNPs in the whole-genome and A&B analysis. Approximately 20.51% of the rachis brittleness subgenomes, respectively, and revealed consistent observations variation could be explained by a 0.8-Mb deletion region (Supplementary Fig. 23). (55.5–56.3 Mb) on chromosome 3D, as identified from the Principal component analysis (PCA) also showed a close GWAS analysis of the entire population (Fig. 4a). Further affinity of Tibetan semi-wild wheat with Tibetan landraces investigation showed that the deleted genomic region contained (Fig. 3b; Supplementary Fig. 24). This inferred relationship was two BRITTLE RACHIS-LIKE genes (TraesCS3D02G103200 and supported by a maximum likelihood estimation of individual TraesCS3D02G103400) (Supplementary Fig. 31), homologs of ancestries performed using ADMIXTURE where Tibetan semi- the Btr1 and Btr2 genes controlling rachis brittleness31.This wild wheat accessions shared a greater proportion of genetic finding suggested that the identified locus was a significant composition with Tibetan landrace group than with other contributor to rachis brittleness in Tibetan semi-wild wheat. wheat samples according to the ancestry states with K from 4 to Consistently, a previous study revealed that brittle rachis of 5(Fig.3c; Supplementary Fig. 25). Taken together, we suggest Tibetan semi-wide wheat is governed by a locus on the short that Tibetan semi-wild wheat was a de-domesticated form of arm of chromosome 3D32. Further investigation showed that all Tibetan wheat landraces. Furthermore, genome-wide nucleo- the homeologs of Btr1 and Btr2 gene in subgenome A and B of tide diversity was lower in the Tibetan semi-wild wheat Zang1817 shared almost the same amino-acid sequence with

6 NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5 ARTICLE

a chr3D: 55.5 – 56.3 Mb deletion 20 20 TaQ-5A

value) 12 12 P 4 4 -log( 0.4 0.4

ST 0.2 0.2 F 0 0

15 15 7.5 7.5 ratio  0 0 0.7 0.7 0.25 0.25 −0.2 −0.2 CNV-index 3D 5A b

Zang1817 56.2 Mb 56.62 Mb 57.1 Mb Deletion site Deletion region Watson strand gene Crick strand gene

55.2 Mb 55.54 Mb TaBtr1 TaBtr2 56.31 Mb 56.8 Mb CS

chr3D c TraesCS5A02G473800 TGA, stop gain TraesZN5A01G476900

161 bp TE insertion Coding region TE sequence Frame shift Intergenic region d TS

TL rachis phenotype: Brittle Non-brittle CC chr3D deletion: With CL Without TaQ TE insertion: NCC With Without NCL Unknown

Fig. 4 Genomic differentiation during de-domestication process. a Genome-wide scanning of candidate regions for rachis brittleness during wheat de- domestication process using GWAS analysis. Dashed lines represent the top 5% thresholds. Two significant divergent regions (0.8-Mb genomic region in chr3D and TaQ-5A) are labeled above the dot peaks. b Genomic structure comparison of 0.8-Mb genomic region with significant divergence between the Zang1817 assembly (upper panel) and CS reference genome (lower panel). BRITTLE RACHIS-LIKE gene potentially related to rachis brittleness locates in the region. Collinearity of high-confidence genes between two genomes were shown in lines. Dashed line indicates un-annotated genes although with intact genomic sequence. c Genomic comparison of TaQ-5A between Zang1817 and CS genomes. A 161 bp TE insertion (red) was identified in Zang1817, resulting in frame-shift variation and gain of stop codon (red line). d Schematic diagram of genomic variations and phenotype alterations related to wheat rachis brittleness. Source data underlying Fig. 4a are provided as a Source Data file. the respective standards in CS (Supplementary Fig. 32), and rachis brittleness in semi-wild wheat associated with the 0.8-Mb genomic variations only occurred as the Btr1/2 homologs in deletion; it requires further biological investigation to develop subgenome D were lost in Zang1817 assembly (Supplementary this issue further. The locus at chromosome 5 A including the Fig. 33). However, Btr1 and Btr2 gene mutations lead to a non- TaQ-5A gene, is a well-established contributing factor for 33,34 brittle rachis phenotype in barley according to the previous brittle rachis in wheat .GWASandFST analysis narrowed report31, thus the deletion in their homologs of wheat cannot the candidate region down to 0.1 Mb (chr5A:650.1–650.2 Mb), directly explain the brittleness variation. Therefore, the data although several other SNPs located in non-coding region were suggest that there might be other gene interactions controlling also found to be associated with the phenotypic variation.

NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications 7 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5

In addition, the 161-bp TE insertion has been functionally semi-wild wheat Zang1817 together with ~680 Gb of 10X Chromium library validated to be contributing to rachis brittleness previously9. sequencing data. Therefore, we propose that the 161-bp insertion in the fifth exon of TaQ-5A gene (TraesCS5A02G473800) on chromosome Scaffold assembly. Genome assembly was constructed from short sequencing data 34 5A is also a major factor affecting rachis brittleness, by the proprietary software package DenovoMAGIC-3.0 (NRgene)41. Main steps to accounting for ∼28.96% of the phenotypic variation in the construct the Zang1817 assembly scaffolds include: (1) reads pre-processing: entire population (Fig. 4a, Supplementary Fig. 27). The 0.8-Mb removing Illumina adaptors, PCR duplicates and Nextera linkers. Create the stit- ched reads by merging overlapping reads in PE libraries with minimal overlap of genomic sequence deletion and the 161-bp insertion in TaQ-5A 10 bp. Then scan and filter the putative sequencing errors in stitched reads; (2) gene were evident in Zang1817 genome in the comparison to contigs assembly: building the De Bruijn graph (k-mer length = 127 or 191 bp) of CS (Fig. 4b, c). However, we also found several wheat accessions contigs from stitched reads. For repeat resolving and contigs extension, reliable with either the deletion or the insertion genomic variations paths in the graph between contigs were found using PE reads; (3) scaffolds exhibiting rachis non-brittleness (Fig. 4d), indicating that there assembly: contigs were linked into scaffolds and gaps in between were estimated with PE and MP information. Then unique path connecting the gap edges was might be other genomic loci contributing to this phenotypic detected utilizing PE and MP links and De Bruijn graph information; (4) scaffolds variation. Yet, this study provides targets for the further validation and extension: 10X barcoded reads were mapped to the assembled investigation of the gene network contributing to rachis scaffolds, and clusters of reads were identified to be part of a single molecule when brittleness. We conclude that Tibetan semi-wild wheat has they share same barcode that mapped to adjacent contigs in the scaffolds. Each scaffold was further scanned with a 20 kb window, and the number of distinct undergone a de-domestication-like process during its evolution clusters covering the entire window was ensured to be statistically significant history with a selective footprint related to rachis brittleness, regarding the number of clusters that span the entire window. A scaffolds graph representing a feral form of Tibetan local wheat. Crop ferality is linking two scaffolds with more than two common barcodes is then generated by 35–38 comparing the barcodes mapped to the scaffold edges. Linear scaffolds paths in the observed in many crop species, including rice and barley fi and this report provides a genomic evidence for a de- scaffolds graph formed the nal scaffolds. domestication event in species. Construction of pseudomolecules. The scaffolds of Zang1817 were anchored to chromosomes with the reference of CS-IWGSCv1 assembly. Simulated reads of Discussion both Zang1817 scaffolds reference genome and CS-IWGSCv1 reference genome It is estimated that wheat was introduced into China ~5000 years were generated as paired-end short sequences, in length of 500 bp with insertion ago, and planted in Tibet as early as ~3500 years ago39. Here, we size of 500 bp and with step size of 1000 bp. The reads were aligned to each other reference genome mutually utilizing the BWA-mem aligner v0.7.1542. Each of the report that the colonization of wheat in the Tibetan Plateau fi fi appears to have co-selected primitive sequence diversity and post- Zang1817 scaffolds was rst grouped to a speci c chromosome, by which chro- fi mosome were aligned with most simulated reads from this scaffold. Scaffolds with xation of haplotypes for adaptation in high-altitude environ- <22 simulated reads aligned to one chromosome were reserved to build the pseudo- ments including, high light intensities, low temperature and sequence notated as chrUn. For each of the scaffolds grouped in one chromosome, hypoxia stresses in addition to photoperiod response. We propose the first five continuously aligned reads and last five continuously aligned reads fi a TaHY5-like regulation hub, featuring two main haplotypes in were identi ed and then used for deciding the relative-position and the direction. Then the scaffolds grouped in each chromosome were ordered by the related- HA and LA wheat populations, has been advantageous to Tibetan positions and directions and then were concatenating into a pseudo-chromosome wheat adaptation to HA extremes. In addition, the downstream sequence, with insertion gaps of 300 bp. Finally, the 21 pseudochromosomes and a genes of TaHY5-like, including TaERF4, TaPCO1, TaCHLH, and chrUn were generated for the Zang1817 assembly. TaTDP1 involved in cold acclimation, hypoxia adaptation, and high light intensity and UV radiation response, respectively, also Gene prediction and annotation. To aid genome annotation and perform the showed sequence differentiation in terms of altitude. During the transcriptome analysis, we generated the RNA-seq data for four different tissues, long history of the HA adaptation process, we propose that including leaf, root, stem, and spike. All fresh tissues were frozen in liquid nitrogen, Tibetan semi-wild wheat has undergone a de-domestication event and stored at −80 °C before processing. Paired-end RNA libraries were constructed linked to a genomic footprint including a 0.8-Mb deletion region using the Illumina RNA-Seq kit (Illumina Inc. San Diego, CA, USA) using pooled RNA, which were sequenced using the pair-ends sequencing module (2 × 150 bp) on 3D containing Btr1/2 homologs and a genetic loci containing on HiSeq X Ten platform. In total, sequencing resulted in 465.91 million reads. TaQ-5A gene on 5A, that correlates with rachis brittleness can be Reads contaminated with adapters were detected and removed using Trimmomatic considered a feral form of Tibetan local landrace. Feral forms are v0.3643. Poor-quality reads (Phred score <20) were trimmed from both ends with 44 ≥ observed in many crop species, including rice and barley37,38 and SolexaQA v2.0 ; Only the reads with lengths 25 bp on both sides of the paired- fi end format were subjected to further analysis. this report provides evidence for de ning new targets that can Protein-coding genes were annotated using a comprehensive strategy by delineate the gene network underpinning the de-domestication combining results obtained from homology-based prediction, ab initio prediction, event in Triticum species. and RNA-seq-based prediction methods. Annotated gene protein sequences from Triticum turgidum, Triticum aestivum, Triticum urartu, Aegilops tauschii, Hordeum vulgare, Brachypodium distachyon, Oryza sativa, and Zea mays were aligned to the Methods Triticum aestivum genome assembly using WUblast v2.0 with an E value cutoff of DNA extraction and sequencing. Genomic DNAs were isolated from the leaves of 10−5 and the hits were conjoined by Solar software. GeneWise v2.4.145 was used to semi-wild wheat Zang1817 (Triticum aestivum ssp. tibetanum Shao) using the predict the exact gene structure of the corresponding genomic regions for each modified CTAB40 and subjected to libraries construction. Two PCR-free paired- WUblast hit. Gene structure created by GeneWise was denoted as Homo-set end shotgun libraries were constructed using DNA template size of ~470 bp, and (homology-based prediction gene set). RNA-seq reads were assembly by Trinity sequenced on the Hiseq2500 v2 Rapid mode as 2 × 265 bp reads. PCR-free genomic v2.446 software; the assembly results were directly mapped to the Triticum aestivum library of ~800 bp DNA template size was prepared using the TruSeq DNA Sample genome assembly by BLAT v3647 and assembled by PASA v2.1.048. Gene models Preparation Kit version 2 according to the manufacturer’s protocol (Illumina, San created by PASA were denoted as the PASA-Iso-set (PASA-Iso-Seq set), and the Diego, CA), and sequenced on an Illumina HiSeq2500 as 2 × 160 bp reads (using training data for the ab initio gene prediction programs. Five ab initio gene the v4 illumina chemistry). Ten genomic DNA libraries ranging from 470 bp to 10 prediction programs (Augustus v2.5.549, Genscan v1.050, Geneid v1.451, K bp were further constructed and sequenced. Six Mate-Pair (MP) libraries with GlimmerHMM v3.0.452, and SNAP v2006-07-2853) were used to predict coding jump ranges of 2–5 Kb, 5–7 Kb, and 7–10 Kb were prepared utilizing the Illumina regions in the repeat-masked genome. In addition, Illumina RNA-seq data were Nextera Mate-Pair Sample Preparation Kit (Illumina, San Diego, CA), and mapped to the assembly using Tophat v2.0.1354, and Cufflinks v2.2.155 was then sequenced on HiSeq4000 as 2 × 150 bp reads. PE and MP libraries construction and used to assemble the transcripts into gene models (Cufflinks-set). Gene model sequencing were conducted at Roy J. Carver Biotechnology Center, University of evidence from the Homo-set, PASA, Cufflinks-set, ab initio programs were Illinois at Urbana-Champaign. The 10X genomics library was constructed with combined by EvidenceModeler v1.1.156 into a non-redundant set of gene DNA fragment longer than 50 Kb and was sequenced with the Gemcode platform annotations. Weights for each type of evidence were set as follows: PASA>Homo- (10X genomics, Pleasanton, CA). 10X Chromium library construction and set>Cufflinks-set>Augustus>GeneID=SNAP=GlimmerHMM=Genscan. Gene sequencing were conducted at HudsonAlpha Institute for Biotechnology, Hunts- models were filtered out by these criteria: (1) coding region lengths of 150 bp, (2) ville, Alabama. A total of ~3,535 Gb Illumina sequencing data was produced for supported only by ab initio methods and with FPKM < 1.

8 NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5 ARTICLE

All protein-coding genes were aligned to two integrated protein sequence together. The finally merged windows were defined as Zang1817-retained regions. databases: SwissProt (release 2015_08) and NR (release 2015_08). Protein domains CS PAV regions were identified in the same way. Finally, 9385 windows with total were annotated by searching against the InterPro v32 using InterProScan v557 and length of 345.73 M bp were identified as Zang1817-specific retained regions; and Pfam v27.0 databases by HMMER v3.158, respectively. The Gene Ontology (GO) 8265 windows with total length of 389.30 M bp were identified as CS-specific terms for each gene were obtained from the corresponding InterPro or Pfam entry. retained regions. The pathways in which the genes might be involved were assigned by BLAST PAV genes were defined in a similar manner but in level of genes. To identify against the KEGG databases (release 53), with an E value cutoff of 10−5. Functional Zang1817-specific retained gene, average read depths were calculated for each annotation results were finally combined from two strategies above. annotated gene and normalized by the whole-genome average read depth. Genes that could be covered normally by Zang1817 (normalized read depth >0.8) but low covered by CS resequencing data (normalized read depth <0.2) were classified as Annotation of repetitive DNA. The transposable elements (TE) were annotated Zang1817-specific retained genes. CS-retained genes were identified in the through homology-based prediction method. Two TE libraries containing 3050 same way. complete wheat TEs sequences (ClariTeRep: https://github.com/jdaron/CLARI-TE) The MUMmerv4 module NUCmer65 was used to detect large structural and 2825 complete plant TE sequences (http://botserv2.uzh.ch/kelldata/trep-db/ variations between CS-IWGSCv1 and Zang1817 assemblies. Each chromosome of downloads/trep-db_complete_Rel-16.fasta.gz) were downloaded and combined. Zang1817 was aligned to the corresponding chromosome of CS using NUCmer RepeatMasker vopen-4.0.559 was used to annotate TEs in Triticum aestivum gen- with default parameters. The alignments were further filtered using delta-filter with omes using the combined data sets as searching library. parameter “−1” to retain only 1-to-1 alignment. The filtered alignments were visualized as dot plot using R. Assembly and annotation evaluation. We evaluated the assembly with 1440 Benchmarking Universal Single Copy Orthologs (BUSCOv3) genes from plants 13 Identification of SNPs and INDELs between assemblies. To identify SNPs and set . The whole genome was evaluated together, and meanwhile, each subgenome fi (A, B, D) was also evaluated individually. To evaluate the quality of the Zang1817 INDELs between the CS-IWGSCv1 and Zang1817 assemblies, we rst used fi NUCmer for alignment with the parameters “-c 90 -l 40 -t 4 –mum”. The delta file assembly, we rst inspected the genome-wide 1,888,547,646 (~20× expected cov- fi fi erage) random selected raw reads. To be compatible with the whole-genome was further ltered using delta- lter (module of MUMmer) to obtain 1-to-1 alignment block. Then show-snps (module of MUMmer) with parameters sequencing of CS technically, 150 bp PE sequencing data were generated from the “ ” 450 bp PE library. These reads were aligned to Zang1817 assembly using BWA- -ClrTH was used to identify SNPs and INDELs. mem with default parameters, resulting in 1,886,201,978 (99.88%) mapped reads. Sampling and sequencing. The re-sequenced collection contains 308 hexaploid Dot plot for comparing the assemblies. Simulated reads (500 bp PE) of both wheat accessions in total, covering 74 Tibetan semi-wild accessions, 43 Chinese assemblies were used to be aligned to each other assembly with BWA-mem. Using cultivars accessions and 102 Chinese landrace accessions (including 35 Tibetan the aligned position and oriented position of each simulated read, we generated the landrace accessions), and 89 accessions from countries word-widely beyond China. dot plot showing the collinearity between these two genomes across the pseudo- The re-sequenced 245 wheat accessions in this study and previous published 73 10 chromosomes of each assembly. Beyond the comparison between Zang1817 and accessions were combined together in this study. Tibetan semi-wild accessions CS-IWGSCv1 assemblies, we further compared Zang1817 and Aegilops tauschii Aet and chinese landrace accessions were obtained from National center for crop v4.0 assemblies12 in the same way. germplasm preservation and Sichuan Agricultural University. Genomic DNA was extracted from young leaves following a standard CTAB protocol. DNA libraries were constructed by Novogene and sequenced with the Illumina Hiseq Xten PE150 Collinear analysis of α-gliadin. The gliadins genes are amplified in number in platform with insert size of approximately 500 bp. wheat, and the gliadin family is important for studying the wheat grain protein. To reveal the genomic difference in gene level, we selected the α-gliadin gene family to construct the collinearity between Zang1817 and CS-IWGSCv1 assemblies. We Collection of published wheat resequencing data. A panel of previous published 10 used the α-gliadin gene sequences from CS as queries to identify all hits in resequencing data was used ; and the raw genotyping data in GVCF formats of the assemblies of CS and Zang1817 using BLAST v2.6.060 with E value ≤ 10−5, simi- hexaploid were requested from the authors. To reveal the geographical larity ≥75% and identity ≥75%. The bedtools v2.27.161 were used to annotate all distribution of haplotypes for important adaptive genes in a large genetic back- hits with genome annotation (high confidence and low confidence) and generate ground, we merged the resequencing data in this study with the recently published corresponding sequences. To reveal the collinearity, we used the reciprocal best hits whole-exome capture (WEC) data set for 1026 hexaploid wheat accessions. The for all these α-gliadin genes from BLAST results, and the gene pairs with multiple WEC data set was downloaded from http://wheatgenomics.plantpath.ksu.edu/ best hits in different chromosomes were discarded. 1000EC/, covering ~8.87 million SNP sites.

Identify homologous cluster for collinear analysis. Homologous gene cluster Alignment and genomic variation calling. Raw reads were trimmed using analyzed between Zang1817 and CS were characterized using OrthoMCL v2.0.962. Trimmomatic, and high-quality clean reads were mapped to the CS wheat refer- The transcript with the longest protein sequence of each gene was used to represent ence genome (IWGSC RefSeq v1.0)11 using BWA-MEM. Read pairs with abnormal the gene. The default setting orthomclFilterFasta module was used to filter out insert sizes (>10,000 bp or <−10,000 bp or =0 bp) or low mapping qualities (<1) poor-quality proteins. Mutual similarities were calculated by performing BLASTP. were filtered using Bamtools v2.4.166. Potential PCR duplicates reads were further Then Markov clustering was performed based on the similarities in order to define removed using Samtools v1.3.167. SNPs and INDELs were identified through all gene groups with default inflation value. Thus, 12,915 homologous gene clusters 245 samples by the HaplotypeCaller module of GATK v3.868 in GVCF mode. Then with at least two members were identified between Zang1817 and CS-IWGSCv1. joint call was performed using GATK GenotypeGVCFs for all 318 samples. SNPs Finally, the homologous gene clusters were used for collinear analysis. were preliminarily filtered using GATK VariantFiltration function with the para- meter “–filterExpression QD < 2.0| | FS > 60.0| | MQRankSum < −12.5| | Read- PosRankSum < −8.0| | SOR > 3.0| | MQ < 40.0| | DP > 30| | DP < 3.” The filtering Syntenic blocks detection. Syntenic blocks between CS-IWGSCv1 and Zang1817 “ ” “ − 63 settings for INDELs were QD < 2.0, FS > 200.0, and ReadPosRankSum < 20.0| | assemblies were detected by MCScanX v1.0 and further merged by Quota Align DP > 30 | | DP < 3”. SNPs and INDELs that did not meet any of the following 64 fi v1.0 using high con dence genes as anchors. The longest CoDing Sequence criterias were further discarded: (1) MAF ≥ 5% (2) Missing rate ≤ 40% (3) bi-allelic (CDS) was used as the representative sequence of each gene for CS-IWGSCv1. sites. Finally, the identified SNPs and INDELs were further annotated with SnpEff First, pairwise sequence similarities between all CDS sequences of CS-IWGSCv1 v4.369. and Zang1817 were calculated using BLASTN with an E value cutoff of 10−10. Then syntenic blocks were detected using MCScanX with default parameters. Quota Align was further used to merge adjacent syntenic blocks. The syntenic Population genetics analysis. We assessed the overall population genetic struc- depth was set to 1:1. The overlap cutoff was set to 40, and the distance cutoff was ture of 308 hexaploid accessions using neighbor joining (NJ) phylogenetic tree, set to 20. PCA, and ADMIXTURE v1.3.070 for non-hierarchical, quantitative clustering analyses. For all these analyses, we used a non-redundant SNP data set obtained after removing rare alleles with minor allele frequency <5% and genotype missing Identification of PAV regions and genes. PAV segments between CS-IWGSCv1 ratio >10%, and further pruning out SNPs with intra-chromosomal LD r2 < 0.4 to and Zang1817 were defined through a read depth based method. To identify remove the bias caused by LD. Zang1817-specific regions, we aligned raw resequencing reads of Zang1817 and CS to Zang1817 and CS-IWGSCv1 reference genomes using BWA-mem with default parameters. Then the reads depths were calculated by windows (window size, 5K- Phylogenetic analysis. The NJ phylogenetic tree was obtained by calculating the bp) crossing Zang1817 reference genome, and then were normalized by the whole- pairwise genetic distances using PLINK v1.971 with parameters “–distance square genome average coverages. The windows that could be normally aligned by 1-ibs”. The tree was constructed using njs method in the R package ape v5.372. NJ- Zang1817 itself (normalized read depth >0.8) but very low covered by CS rese- tree of A&B subgenome was rooted with five wild emmer accessions as outgroup10. quencing data (normalized read depth <0.2) were identified as Zang1817-retained NJ-tree of D subgenome was rooted with five Aegilops tauschii accessions as sequences. The identified windows with adjacent distances <10k-bp were merged outgroup10.

NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications 9 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5

Population structure analysis. The population structure analysis was performed Fig. 27a), we combined the top five PCs for chr3B and the top five PCs for the using ADMIXTURE, with K values from 2 to 6 with 10 replications using different whole genome together as covariance representing local population structure for random seeds. The replicate with the highest log-likelihood for each K was used for performing GWAS analysis specific on chr3B (Supplementary Fig. 27b). representing population structure. Genomic region detection responsive to de-domestication. To detect the dif- PCA analysis. PCA analysis was performed using smartpca program embedded in ferentiation regions during de-domestication process of Tibetan semi-wild wheat, two EIGENSOFT v6.1.473 with default settings. sample groups DE (de-domestication) and DO (domestication) were defined according to phenotypes. DE group contains all Tibet samples with brittle rachis (a part of TS samples, 50 in total). DO group contains the rest Tibet samples (the other Demographic analysis. Fourfold degenerate SNPs on D subgenome of Tibetan part of TS and all TL samples, 52 in total). Further analyses were performed base on semi-wild wheat (TS) and Tibetan landraces (TL) in subcluster I of Fig. 3a were used these two groups. Considering the frequently introgression events in AB subgenomes to minimize bias due to sample population structure. To reduce the possibility of from the tetraploid wheat19, the top 5% quantiles for AB subgenomes and D sub- missing data, TL and TS were downsampled to 30 and 40 via projection, respec- genome were calculated separately. Differentiation index (F ) was calculated using tively. Two-dimension site frequency spectrum was generated using easySFS ST VCFtools v0.1.1319 with window size of 100 kb and window step of 100 kb. (https://github.com/isaacovercast/easySFS). Eight demographic models were con- Nucleotide diversity (π) was estimated using VCFtools with window size of 100 structed and fitted by 1000 independent runs with randomized starting points kb. Pi-ratio was further calculated between two groups as (π /π ). To reveal the utilizing ∂a∂i v2.0.574 (Supplementary Fig. 26). Given the observed site frequency DO DE phenotype-associated genomic sequence loss event, we proposed a coverage depth spectrum, the parameters of each model with highest log-likelihood were selected as difference based variable, CNV-index, as a counterpart to F and Pi-ratio methods best fitting parameters. We scaled population size by N (ancestral effective popu- ST e those are merely based on single nucleotide polymorphism neglecting situations lation size) to estimate the population size in number of individuals, and scaled such as genomic segment deletion. CNV-index was defined as the difference of migration rate and time by 2N to estimate number of migration per generation and e normalized average read depths in two groups in each size-fixed window, to reflect time in years. The ancestral population size was estimated by the formula θ = 4× − the coverage differences between groups of samples. To calculate CNV-index N × μ × L, where the neutral mutation rate (μ) is set to be 6.5 × 10 9 according to e between DE and DO groups, we first calculated the mean of Normalized Coverage Molina et al.75, and the generation time (L) was assumed to be 1 year. θ was Number (NCN) in each 100 kb window for both groups; Then CNV-index was computed based on site frequency spectrum of D subgenome using ∂a∂i. Therefore, calculated as (NCN -NCN ), which is notated as CNV-index. Positive numbers N was estimated as 67,003. DO DE e indicate there are more accessions in DE were detected as genomic segment deletion in this window, and vice versa. Rachis brittleness phenotyping. Field trials were conducted for 1 year at China Agricultural University research station located at Beijing, China. At least five Candidate region and gene identification. The high-altitude (HA) wheat acces- spikes per accession were pulled apart at physiological maturity. The rachis brit- sions (99 in total) were reserved to accessions gathered in Tibetan Plateau. The LA tleness phenotype was identified by applying elevated mechanical pressure by hand. wheat accessions (73 in total) were selected globally and clearly known to be in LA. The accessions which might be vague to be identified as HA or LA were excluded. The fi Chr3D deletion region identi cation. As the genomic comparison analysis top 5% quantile distributions of FST (AB subgenomes and D subgenome respectively) revealed a ~0.8 M segment deletion in chr3D in Zang1817 as compared with CS- between HA and LA groups were classified as candidate genetic differentiation IWGSCv1, that the region ranges chr3D:55.5–56.3 Mbp along with CS-IWGSC1 regions, and genes residing in these regions were classified as candidate genes for reference genome. We then identified this segment deletion based on the coverage high-altitude adaption. To exclude the possibility that divergence between HA and LA fi data in all resequence data, and identi ed this deletion in 50 Tibetan wheat accessions are caused by population structure variations, we performed the FST accessions, which mainly in Tibetan semi-wide wheat accession. analysis with randomly shuffled sample-labels repeatedly for 100 times to generate a series of empirical distributions (expected diversity) of FST. As the maximum FST Identification of homologs to barley Btr1/2 genes. To identify the Btr1 and Btr2 scores (0.22 for A&B subgenome, 0.15 for D subgenome, corresponding to the 5% homologs in wheat, we used the barley Btr1/2 gene sequences31 as queries to thresholds) in empirical distributions is still below our threshold, indicating that the identify the candidate homologs in the assembly of CS-IWGSCv1, Zang1817, thresholds for A&B subgenome and D subgenome as the criteria is able to rule out the Triticum dicoccoides WEWseq v1.076, and Aegilops tauschii Aet v4.012 using source of expected genomic diversity. TBLASTN, with the thresholds E value <10−5 and identity >60%. The candidate homolog genes were aligned with Btr1 and Btr2 of barley with MAFFT v777 to filter Haplotype analysis in response to high-altitude. Classifying the haplotypes of the out the candidates missing crucial domain. The remaining gene sequences were genotype-associated candidate genes into HA-prone haplotype and LA-prone hap- used to conduct BLASTN analyze against each other with the threshold E value <10 lotype helps understanding the adaption of wheat to the high-altitude environment. −5. In collinearity analysis, the gene pairs were first determined by selecting To gain enough statistical power in showing the haplotype distributions geo- homologous gene pairs with highest BLAST scores mutually, and then by selecting graphically, we merged the resequencing data in this study with a recently published the rest genes with its single-side best hits. WEC data set (>1000 accessions)19. Only SNPs that are polymorphic in both data sets were retained. For each gene, we identify the haplotypes of all samples based on SNPs residing in gene region using haplotype function in R package pegas v0.1280. Hap- TE insertion identification in TaQ-5A.Thegenomiccomparisonanalysisinlocal lotype harboring mutations that causing amino-acid changes and more enriched in region revealed a TE insertion in TaQ-5A gene in Zang1817 comparing with CS- HA groups were classified as HA-haplotype, and the other haplotype were classified as IWGSCv1. To inspect the TE insertion in TaQ-5A gene in each of the re-sequenced LA-haplotype. Samples or genes without enough sites or poorly covered were dis- wheat accessions, we scanned all the mapped reads covering the insertion site carded when generating the HapMap graph for specific genes. (chr5A:650129563 on CS-IWGSCv1) for each sample. Samples with at least two perfect matched reads were classified as non-insertion. For the rest samples with soft- clipped reads covering the insertion site, we extracted these soft-clipped reads, and Reporting summary. Further information on research design is available in the Nature then aligned these sequences in soft-clipped sequences with the 161 bp TE sequence Research Reporting Summary linked to this article. (detected in Zang1817); if the soft-clipped sequences are 100% match with detected TE sequence and length >10 bp, the sample was classified as with-TE insertion. The fi fi Data availability remaining samples were classi ed as unknown, considering insuf cient fi statistical power. Data supporting the ndings of this work are available within the paper and its Supplementary Information files. A reporting summary for this Article is available as a Supplementary Information file. The data sets generated and analyzed during the current GWAS analysis. GWAS was performed using a mixed linear model (MLM) study and genetic material used for this paper are available from the corresponding fi 78,79 incorporating population structure and kinship coef cients in TASSEL v5 to author upon request. The genomic sequencing reads and RNA-seq reads have been fi fi nd genomic sites signi cantly associated with rachis brittleness. All non-Tibetan deposited to NCBI Sequence Read Archive with accession PRJNA596843 and samples were classified as non-brittle to increase statistical power. Principal 71 PRJNA596859, respectively. Zang1817 genome assembly has been deposited at DDBJ/ components (PCs) of the association panel were calculated in PLINK v1.90 using ENA/GenBank under the accession JACSIS000000000. The version described in this the high-quality data for the 35,045,206 variation sites (SNPs or INDELs) with paper is version JACSIS010000000. The genome assembly and full annotation data are MAF ≥ 0.05 and missing data <30%. The first five PCs were used to estimate also available from website http://wheat.cau.edu.cn/Zang1817_genome/. The raw reads of population structure. The variance-covariance kinship matrix was estimated in 73 previously published re-sequenced accessions are available under NCBI Sequence TASSEL. The SNPs (P ≤ 2.85 × 10−10, Bonferroni-corrected threshold of 0.01) Read Archive accession PRJNA476679 and https://downloads-qcif.bioplatforms.com/ identified in the GWAS were used to declare significant marker-trait association. bpa/wheat_cultivars/cultivars/. The whole-exome capture data set containing 1,026 To identify phenotypic variances explained by chromosome 3D 0.8-Mb deletion and TE-insertion in TaQ-5A gene, we produced a pseudo-VCF file containing two hexaploid wheat accessions is downloaded from http://wheatgenomics.plantpath.ksu. sites for each of these two genomic variances, and set genotype of each sample to edu/1000EC/. Pfam databases used in this study are available at ftp://ftp.ebi.ac.uk/pub/ “0/0” (variant absence) or “1/1” (variant presence) and performed GWAS analyze databases/Pfam/releases/Pfam27.0/. KEGG databases used in this study are available at as mentioned above. To justify population structure bias on chr3B (Supplementary https://www.kegg.jp/kegg/download/. Source data are provided with this paper.

10 NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5 ARTICLE

Code availability 28. Russell, J. et al. Exome sequencing of geographically diverse barley landraces The source code and scripts used in the paper have been deposited in GitHub [https:// and wild relatives gives insights into environmental adaptation. Nat. Genet. github.com/wangzihell/CAU-TibetanWheatSeq]. 48, 1024–1030 (2016). 29. Liu, J. et al. PSBP-DOMAIN PROTEIN1, a nuclear-encoded thylakoid lumenal protein, is essential for photosystem I assembly in Arabidopsis. Plant Received: 4 February 2020; Accepted: 3 September 2020; Cell 24, 4992–5006 (2012). 30. Sun, Q., Ni, Z., Liu, Z., Gao, J. & Huang, T. Genetic relationships and diversity among Tibetan wheat, and European wheat revealed by RAPD markers. Euphytica 99, 205–211 (1998). 31. Pourkheirandish, M. et al. Evolution of the grain dispersal system in Barley. Cell 162, 527–539 (2015). References 32. Chen, Q.-F., Yen, C. & Yang, J.-L. Chromosome location of the gene for brittle 1. Zsogon, A. et al. De novo domestication of wild tomato using genome editing. rachis in the Tibetan weedrace of common wheat. Genet. Resour. Crop Evol. Nat. Biotechnol. 36, 1211–1216 (2018). 45, 407–410 (1998). 2. Li, T. et al. Domestication of wild tomato is accelerated by genome editing. 33. Sormacheva, I. et al. Q gene variability in wheat species with different spike Nat. Biotechnol. 36, 1160–1163 (2018). morphology. Genet. Resour. Crop Evol. 62, 837–852 (2015). 3. Bailey-Serres, J., Parker, J. E., Ainsworth, E. A., Oldroyd, G. E. D. & Schroeder, 34. Faris, J. D., Fellers, J. P., Brooks, S. A. & Gill, B. S. A bacterial artificial J. I. Genetic strategies for improving crop yields. Nature 575, 109–118 (2019). chromosome contig spanning the major domestication locus Q in wheat and 4. Shewry, P. R. & Hey, S. J. The contribution of wheat to human diet and health. identification of a candidate gene. Genetics 164, 311–321 (2003). Food Energy Security 4, 178–202 (2015). 35. Li, L.-F., Li, Y.-L., Jia, Y., Caicedo, A. L. & Olsen, K. M. Signatures of 5. Marcussen, T. et al. Ancient hybridizations among the ancestral genomes of adaptation in the weedy rice genome. Nat. Genet. 49, 811–814 (2017). bread wheat. Science 345, 1250092 (2014). 36. Qiu, J. et al. Diverse genetic mechanisms underlie worldwide convergent rice 6. Huang, H., Wong, Y. & Zhang, Z. Trait clusting and evolution trend of wheat feralization. Genome Biol. 21, 70 (2020). subspecies in Tibet. Tibet J. Agric. Sci. 3,42–55 (1999). 37. Qiu, J. et al. Genomic variation associated with local adaptation of weedy rice 7. Shao, Q. & Li, C. B. Semi-wide wheat from Xizang (Tibet). J. Genet. genomics during de-domestication. Nat. Commun. 8, 15323 (2017). 7, 149–156 (1980). 38. Zeng, X. et al. Origin and evolution of qingke barley in Tibet. Nat. Commun. 8. Watanabe, N., Sugiyama, K., Yamagishi, Y. & Sakata, Y. Comparative 9, 5433 (2018). telosomic mapping of homoeologous genes for brittle rachis in tetraploid and 39. Betts, A., Jia, P. W. & Dodson, J. The origins of wheat in China and potential hexaploid wheats. Hereditas 137, 180–185 (2002). pathways for its introduction: a review. Quat. Int. 348,158–168 (2014). 9. Jiang, Y.-F. et al. Re-acquisition of the brittle rachis trait via a transposon 40. Murray, M. G. & Thompson, W. F. Rapid isolation of high molecular-weight insertion in domestication gene Q during wheat de-domestication. N. plant DNA. Nucleic Acids Res. 8, 4321–4325 (1980). Phytologist 224, 961–973 (2019). 41. Hu, Y. et al. Gossypium barbadense and Gossypium hirsutum genomes 10. Cheng, H. et al. Frequent intra- and inter-species introgression shapes the provide insights into the origin and evolution of allotetraploid cotton. Nat. landscape of genetic variation in bread wheat. Genome Biol. 20, 136 (2019). Genet. 51, 739–748 (2019). 11. The International Wheat Genome Sequencing Consortium (IWGSC). Shifting 42. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows- the limits in wheat research and breeding using a fully annotated reference Wheeler transform. Bioinformatics 25, 1754–1760 (2009). genome. Science 361, eaar7191 (2018). 43. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for 12. Luo, M. C. et al. Genome sequence of the progenitor of the wheat D genome Illumina sequence data. Bioinformatics 30, 2114–2120 (2014). Aegilops tauschii. Nature 551, 498–502 (2017). 44. Cox, M. P., Peterson, D. A. & Biggs, P. J. SolexaQA: at-a-glance quality 13. Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, assessment of Illumina second-generation sequencing data. BMC E. M. BUSCO: assessing genome assembly and annotation completeness with Bioinformatics 11, 485 (2010). single-copy orthologs. Bioinformatics 31, 3210–3212 (2015). 45. Birney, E., Clamp, M. & Durbin, R. GeneWise and genomewise. Genome Res. 14. Sun, S. et al. Extensive intraspecific gene order and gene structural variations 14, 988–995 (2004). between Mo17 and other maize genomes. Nat. Genet. 50, 1289–1295 (2018). 46. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data 15. Zhang, J. et al. Extensive sequence divergence between the reference genomes without a reference genome. Nat. Biotechnol. 29, 644–652 (2011). of two elite indica rice varieties Zhenshan 97 and Minghui 63. Proc. Natl. 47. Kent, W. J. BLAT—The BLAST-like alignment tool. Genome Res. 12, 656–664 Acad. Sci. USA 113, E5163–E5171 (2016). (2002). 16. Guo, X. et al. The genomes of two Eutrema species provide insight into plant 48. Haas,B.J.etal.ImprovingtheArabidopsis genome annotation using adaptation to high altitudes. DNA Res. 25, 307–315 (2018). maximal transcript alignment assemblies. Nucleic Acids Res. 31,5654–5666 17. Zhang, T. et al. Genome of Crucihimalaya himalaica, a close relative of (2003). Arabidopsis, shows ecological adaptation to high altitude. Proc. Natl. Acad. 49. Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction Sci. USA 116, 7137–7146 (2019). in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 33, 18. Zhang, J. et al. Genome of Plant Maca (Lepidium meyenii) Illuminates W465–W467 (2005). genomic basis for high-altitude adaptation in the Central Andes. Mol. Plant 9, 50. Burge, C. & Karlin, S. Prediction of complete gene structures in human 1066–1077 (2016). genomic DNA. J. Mol. Biol. 268,78–94 (1997). 19. He, F. et al. Exome sequencing highlights the role of wild-relative 51. Parra, G., Blanco, E. & Guigo, R. GeneID in Drosophila. Genome Res. 10, introgression in shaping the adaptive landscape of the wheat genome. Nat. 511–515 (2000). Genet. 51, 896–904 (2019). 52. Majoros, W. H., Pertea, M. & Salzberg, S. L. TigrScan and GlimmerHMM: two 20. Lee, S.-Y. et al. Identification of Tyrosyl-DNA phosphodiesterase as a novel open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 DNA damage repair enzyme in Arabidopsis. Plant Physiol. 154, 1460 (2010). (2004). 21. Sample, A. & He, Y.-Y. Autophagy in UV damage response. Photochem. 53. Leskovec, J. & Sosic, R. SNAP: a general-purpose network analysis and graph- Photobiol. 93, 943–955 (2017). mining library. ACM Trans. Intell. Syst. Technol. 8, 20 (2016). 22. Phillips, A. R., Suttangkakul, A. & Vierstra, R. D. The ATG12-conjugating 54. Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions enzyme ATG10 Is essential for autophagic vesicle formation in Arabidopsis with RNA-Seq. Bioinformatics 25, 1105–1111 (2009). thaliana. Genetics 178, 1339–1353 (2008). 55. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals 23. Fujimoto, S. Y., Ohta, M., Usui, A., Shinshi, H. & Ohme-Takagi, M. Arabidopsis unannotated transcripts and isoform switching during cell differentiation. ethylene-responsive element binding factors act as transcriptional activators or Nat. Biotechnol. 28, 511–515 (2010). repressors of GCC box-mediated gene expression. Plant Cell 12, 393–404 (2000). 56. Haas, B. J. et al. Automated eukaryotic gene structure annotation using 24. Osanai, T. et al. ChlH, the H subunit of the Mg-chelatase, is an anti-sigma EVidenceModeler and the program to assemble spliced alignments. Genome factor for SigE in Synechocystis sp PCC 6803. Proc. Natl. Acad. Sci. USA 106, Biol. 9, R7 (2008). 6860–6865 (2009). 57. Jones, P. et al. InterProScan 5: genome-scale protein function classification. 25. Gangappa, S. N. & Botto, J. F. The multifaceted roles of HY5 in plant growth Bioinformatics 30, 1236–1240 (2014). and development. Mol. Plant 9, 1353–1365 (2016). 58. Mistry, J., Finn, R. D., Eddy, S. R., Bateman, A. & Punta, M. Challenges in 26. Kamran, A., Iqbal, M. & Spaner, D. Flowering time in wheat (Triticum homology search: HMMER3 and convergent evolution of coiled-coil regions. aestivum L.): a key factor for global adaptability. Euphytica 197,1–26 (2014). Nucleic Acids Res. 41, e121 (2013). 27. Guo, Z., Song, Y., Zhou, R., Ren, Z. & Jia, J. Discovery, evaluation and 59. De Koning, A. P. J., Gu, W., Castoe, T. A., Batzer, M. A. & Pollock, D. D. distribution of haplotypes of the wheat Ppd-D1 gene. N. Phytologist 185, Repetitive elements may comprise over two-thirds of the human genome. PloS 841–851 (2010). Genet. 7, e1002384 (2011).

NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications 11 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-18738-5

60. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of Acknowledgements protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). We thank Lihui Li (Chinese Academy of Agricultural Sciences), Youliang Zheng 61. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for (Sichuan Agricultural University), and Wanquan Ji (Northwest A&F University) for comparing genomic features. Bioinformatics 26, 841–842 (2010). providing wheat seeds and Yue Zhao, Jun Yuan, and Jingjing Wen for extracting DNA 62. Li, L., Stoeckert, C. J. & Roos, D. S. OrthoMCL: identification of ortholog and RNA. This work was supported by the Major Program of the National Natural groups for eukaryotic genomes. Genome Res. 13, 2178–2189 (2003). Science Foundation of China (31991210 and 91935304), Chinese Universities Scientific 63. Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of Fund (2017TC035 and 2019TC153), and Strategic International Science and Technology gene synteny and collinearity. Nucleic Acids Res. 40, e49 (2012). Innovation Collaboration Project (2020YFE0202300). 64. Tang, H. et al. Screening synteny blocks in pairwise genome comparisons through integer programming. BMC Bioinformatics 12, 102 (2011). Author contributions 65. Marcais, G. et al. MUMmer4: a fast and versatile genome alignment system. Plos Comput. Biol. 14, e1005944 (2018). Q.S., Z.N., and H.P. designed the research. Z.W., W.G., M.X., Y.Y., Z.H., W.S., K.Y., Y.C., and X.W. performed analyses. H.P. and P.G. collected phenotypic data. M.X., Q.S., and R. 66. Barnett, D. W., Garrison, E. K., Quinlan, A. R., Stroemberg, M. P. & Marth, G. fi T. BamTools: a C++ API and toolkit for analyzing and managing BAM files. A. wrote and edited the manuscript. All authors read and approved the nal manuscript. Bioinformatics 27, 1691–1692 (2011). 67. Li, H. et al. The sequence alignment/map format and SAMtools. Competing interests Bioinformatics 25, 2078–2079 (2009). The authors declare no competing interests. 68. McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010). Additional information 69. Cingolani, P. et al. A program for annotating and predicting the effects of Supplementary information is available for this paper at https://doi.org/10.1038/s41467- single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila 020-18738-5. melanogaster strain w(1118); iso-2; iso-3. Fly 6,80–92 (2012). 70. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of Correspondence and requests for materials should be addressed to H.P., Z.N. or Q.S. ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009). 71. Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger Peer review information Nature Communications thanks Jill Preston, and the other, and richer datasets. Gigascience 4, 7 (2015). anonymous, reviewer(s) for their contribution to the peer review of this work. 72. Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528 (2019). Reprints and permission information is available at http://www.nature.com/reprints 73. Price, A. L. et al. Principal components analysis corrects for stratification in ’ genome-wide association studies. Nat. Genet. 38, 904–909 (2006). Publisher s note Springer Nature remains neutral with regard to jurisdictional claims in fi 74. Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. & Bustamante, C. D. published maps and institutional af liations. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PloS Genet. 5, e1000695 (2009). 75. Molina, J. et al. Molecular evidence for a single evolutionary origin of Open Access This article is licensed under a Creative Commons domesticated rice. Proc. Natl. Acad. Sci. USA 108, 8351–8356 (2011). Attribution 4.0 International License, which permits use, sharing, 76. Avni, R. et al. Wild emmer genome architecture and diversity elucidate wheat adaptation, distribution and reproduction in any medium or format, as long as you give evolution and domestication. Science 357,93–96 (2017). appropriate credit to the original author(s) and the source, provide a link to the Creative 77. Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software Commons license, and indicate if changes were made. The images or other third party version 7: improvements in performance and usability. Mol. Biol. Evolution material in this article are included in the article’s Creative Commons license, unless – 30, 772 780 (2013). indicated otherwise in a credit line to the material. If material is not included in the 78. Bradbury, P. J. et al. TASSEL: software for association mapping of complex article’s Creative Commons license and your intended use is not permitted by statutory – traits in diverse samples. Bioinformatics 23, 2633 2635 (2007). regulation or exceeds the permitted use, you will need to obtain permission directly from 79. Zhang, Z. et al. Mixed linear model approach adapted for genome-wide – the copyright holder. To view a copy of this license, visit http://creativecommons.org/ association studies. Nat. Genet. 42, 355 360 (2010). licenses/by/4.0/. 80. Paradis, E. pegas: an R package for population genetics with an integrated- modular approach. Bioinformatics 26, 419–420 (2010). © The Author(s) 2020

12 NATURE COMMUNICATIONS | (2020) 11:5085 | https://doi.org/10.1038/s41467-020-18738-5 | www.nature.com/naturecommunications