Dong et al. BMC Genomics (2015) 16:1039 DOI 10.1186/s12864-015-2257-y

RESEARCHARTICLE Open Access Single-molecule real-time transcript sequencing facilitates common genome annotation and grain transcriptome research Lingli Dong1†, Hongfang Liu2†, Juncheng Zhang1, Shuangjuan Yang1,3, Guanyi Kong2, Jeffrey S. C. Chu2,4, Nansheng Chen5,6* and Daowen Wang1,7*

Abstract Background: The large and complex hexaploid genome has greatly hindered genomics studies of (Triticum aestivum, AABBDD). Here, we investigated transcripts in common wheat developing caryopses using the emerging single-molecule real-time (SMRT) sequencing technology PacBio RSII, and assessed the resultant data for improving common wheat genome annotation and grain transcriptome research. Results: We obtained 197,709 full-length non-chimeric (FLNC) reads, 74.6 % of which were estimated to carry complete open reading frame. A total of 91,881 high-quality FLNC reads were identified and mapped to 16,188 chromosomal loci, corresponding to 13,162 known genes and 3026 new genes not annotated previously. Although some FLNC reads could not be unambiguously mapped to the current draft genome sequence, many of them are likely useful for studying highly similar homoeologous or paralogous loci or for improving chromosomal contig assembly in further research. The 91,881 high-quality FLNC reads represented 22,768 unique transcripts, 9591 of which were newly discovered. We found 180 transcripts each spanning two or three previously annotated adjacent loci, suggesting that they should be merged to form correct gene models. Finally, our data facilitated the identification of 6030 genes differentially regulated during caryopsis development, and full-length transcripts for 72 transcribed gluten gene members that are important for the end-use quality control of common wheat. Conclusions: Our work demonstrated the value of PacBio transcript sequencing for improving common wheat genome annotation through uncovering the loci and full-length transcripts not discovered previously. The resource obtained may aid further structural genomics and grain transcriptome studies of common wheat. Keywords: SMRT sequencing, PacBio RSII, Genome annotation, Transcriptome, Grain development, Common wheat

Background The advent of second generation sequencing (SGS) tech- Structural and functional genomics studies are funda- nologies, such as the Illumina technology, has stimulated mental to the understanding of biology. To effect- the construction of genome and transcriptome resources ively perform such studies, it is essential to have access for many plant species [1–4]. The development of a to high-quality genome and transcriptome sequences. draft genome sequence using SGS technologies generally involves three main steps: generating and assembling short sequence reads into longer DNA contigs, ordering * Correspondence: [email protected]; [email protected] †Equal contributors contigs along chromosomes, and annotating protein- 5School of Life Science and Technology, Huazhong Agricultural University, coding genes and other elements for the contigs [1, 2]. Wuhan 430075, China The construction of transcriptomic sequences generally 1The State Key Laboratory of Plant cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Chinese Academy of entails the production and assembly of short RNA-seq Sciences, Beijing 100101, China reads, and the assembly step can be made easier if there Full list of author information is available at the end of the article

© 2015 Dong et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Dong et al. BMC Genomics (2015) 16:1039 Page 2 of 13

is a high-quality genome sequence available as a refer- common wheat. But a draft genome sequence, con- ence [3, 4]. structed mainly using Illumina HiSeq sequencing tech- During the development of genome and transcriptome nology and covering about 60 % of the hexaploid resources, full-length transcripts can greatly increase the genome, has recently been published for the spring-type accuracy of genome annotation and transcriptome common wheat land race Chinese Spring (CS) [24]. A characterization when compared to the transcript tags large number of contigs were assembled for each of the assembled from short RNA-seq reads. Full-length tran- 21 chromosomes, and a total of 133,090 high-confidence script sequences permit efficient analysis of exon-intron genes were annotated [24]. Because of its incomplete structure and alternative splicing, thus facilitating a coverage, it is possible that many genes are missing or complete understanding of the transcriptional behavior fragmented (i.e., exist in multiple short contigs) in the of genomic loci [5–7]. Furthermore, the well character- current draft genomic sequence. Thus, substantial efforts ized full-length transcripts are also beneficial for subse- are being devoted to improve this draft genome se- quent functional studies of important loci. On the other quence through refining contig assembly, gene annota- hand, the transcript tags derived from RNA-seq may suf- tion or both for individual chromosomes [25–27]. fer from misassembly of the reads transcribed from The A and D subgenomes of common wheat are de- highly repetitive regions or very similar members of rived from Triticum urartu (AuAu) and Aegilops tauschii multigene families [3, 4]. This problem may become (DtDt), respectively, whereas subgenome B is probably even more severe for polyploid that often harbor originated from an extinct Aegilops species [28, 29]. A a large number of nearly identical homoeologous gene natural hybridization, occurred about 10,000 years ago sets. Although full-length transcripts are highly desir- between wild tetraploid wheat (T. turgidum ssp. dicoc- able, their production is often labor intensive and time coides, AABB) and Ae. tauschii, gave rise to common consuming in the past because of the need to clone indi- wheat [29]. Thus, the A and D subgenomes of common vidual cDNAs and to sequence them by traditional wheat are largely syntenic with their counterparts in the Sanger sequencing [5–7]. Recently, the third generation diploid progenitors [29]. Using Illumina HiSeq platform, sequencing technology PacBio RSII has emerged as a draft genome sequences have been developed for T. unique opportunity for constructing full-length tran- urartu and Ae. tauschii, with the number of protein- scripts [8, 9]. This technology accomplishes single- coding genes annotated for the two species being 34,879 molecule real-time (SMRT) sequencing with a read and 43,150, respectively [30, 31]. There is also an effort to length up to 20 kb [8, 9], which renders PacBio RSII very develop a detailed genomic sequence of Ae. tauschii based effective in sequencing full-length cDNAs including long on sequencing chromosomally ordered BAC clones [32], transcript isoforms [10–12]. One concern on PacBio se- http://aegilops.wheat.ucdavis.edu/ATGSP/data.php]. The quencing is its relatively high error rate, but this can be contigs and annotated gene sets of T. urartu and Ae. effectively improved through implementing two types of tauschii have been useful for constructing the draft gen- corrections, i.e., by constructing consensus sequence ome of common wheat [24]. In addition, the genomic in- reads from raw PacBio subreads and by alignment with formation of other sequenced grasses, including rice, the reads generated from appropriate SGS platforms Brachypodium distachyon, sorghum and maize, has also [13–15]. However, compared to many successful applica- aided contig ordering and gene annotation during devel- tions in human research (http://www.pacificbiosciences. oping CS draft genome sequence [24]. com/news_and_events/publications/), the use of PacBio Grains represent the most valuable organ of common sequencing to assist plant transcriptome investigations wheat, and have been the subject for numerous genetic, has so far been limited [16–20]. Nevertheless, a signifi- breeding, and more recently, functional genomics stud- cant progress has recently been made on using SMRT ies [32, 33]. They develop after double fertilization of cDNA reads to aid the prediction and validation of plant caryopses, and their morphometric, weight and bio- gene models [20]. chemical traits are the targets of wheat yield and quality Owing to broad adaptability and numerous end-uses, improvement efforts [32, 33]. A huge number of genes wheat is the most widely cultivated and consumed staple and genetic interactions have been found involved in food crop [21]. Among the two main types of wheat cul- common wheat grain development and their function is tivated currently, the hexaploid common wheat (bread often regulated at multiple levels, including temporal wheat, Triticum aestivum, AABBDD, 2n = 6x = 42) is and spatial transcriptional regulations [32, 33]. Because predominant, accounting for 95 % of global wheat pro- of this complexity, our understanding of the genes duction [22]. As a polyploid, the genome of common functioning in common wheat grain development is wheat is both large (about 17 G) and complex (contain- still incomplete, which is not conducive for effectively ing 80 % or more repetitive DNA) [23, 24]. A complete improving the yield and end-use traits of this important reference genome sequence is still unavailable for crop. Dong et al. BMC Genomics (2015) 16:1039 Page 3 of 13

The main objectives of this study were to sequence the from which 22,334 FLNC reads were identified transcripts expressed during common wheat caryopsis (Additional file 1). Consequently, a total of 197,709 development using the emerging SMRT sequencing plat- FLNC reads (175,375 + 22,334) were obtained. form PacBio RSII, and to assess the resultant data for To make further correction of the 197,709 reads, we improving common wheat genome annotation and grain generated Illumina HiSeq 2000 transcriptomic reads for transcriptome research. Towards these aims, we first each of the unfertilized caryopsis and developing grain identified a population of full-length non-chimeric samples. After adaptor sequence trimming and low- (FLNC) SMRT cDNA reads from a pooled sample of quality read filtering, we obtained averagely 69.3 million unfertilized caryopses and developing grains using Pac- reads (with a mean size of 101 bp) for each of the four Bio sequencing. Then we mapped the reads to the draft samples (Additional file 2). The proovread software, genome sequence of CS, and performed an in-depth which had been found highly efficient for correcting analysis of the high-quality reads. Finally, we examined SMRT sequences through iterative short read consensus the value of the FLNC reads for finding full-length tran- [20, 36], was used to correct 197,709 reads. Before script sequence of the genes encoding three complex proovread correction, the average alignment identity of families of gluten proteins: high-molecular weight glute- the 197,709 reads to CS draft genome sequence was nin subunits (HMW-GSs), low-molecular weight glute- 96.2 %. This value was increased to 98.3 % after proov- nin subunits (LMW-GSs) and gliadins. These proteins read correction. We therefore focused our subsequent are specifically expressed in the developing grains and investigations on the 197,709 error corrected FLNC are important determinants of the processing and nutri- reads. tional qualities of common wheat [34]. We also pro- duced transcriptomic reads using HiSeq 2000 for the Estimation of the proportion of FLNC reads carrying unfertilized caryopsis and developing grain samples, sep- complete open reading frame arately, for three purposes, i.e., error correction of FLNC To evaluate the proportion of FLNC reads carrying reads, validation of exon-intron junction sequence in complete open reading frame (ORF), we made use of the FLNC reads, and investigation of the genes whose tran- 5495 wheat full-length cDNAs published previously [37] scription was differentially regulated during caryopsis and the draft genome sequence of CS. After mapping development. The experimental variety used in this wheat full-length cDNAs and FLNC reads to the draft work, Xiaoyan 81, is an elite winter-type common wheat genome using the software GMAP [38], 1347 gene loci line with super end-use quality [35]. Understanding the were found to be covered by both data sources. The transcripts during the grain development of Xiaoyan 81 number of FLNC reads assigned to the 1347 loci was may provide useful transcriptome resource for genetic- 28,599. Out of the 28,599 reads, 21,326 (74.6 %) carried ally enhancing the end-use quality of common wheat. complete ORF (with start and stop codons) as defined in the wheat full-length cDNAs. Results Transcript sequencing and error correction Genome mapping of 197,709 FLNC reads Using mRNAs extracted from a pooled sample of First, the 197,709 FLNC reads were mapped against the unfertilized caryopses and developing grains collected at draft genome sequence of CS using GMAP. During 5, 15 and 25 days after anthesis (DAA), two different li- mapping, the alignment direction of 10,944 reads could braries, with cDNA insert size < 2 kb and ≥ 2 kb, respect- not be reliably determined. Therefore, the genome map- ively, were prepared. Each library was sequenced using ping characteristics of the remaining 186,765 reads were four SMRT cells on PacBio RSII platform. In total, we further investigated. These reads could be divided into obtained 526,915 continuous long reads (CLRs), includ- five groups (G1 to G5, Fig. 1a). G1 consisted of 134,204 ing 265,832 from the short-insert library and 261,083 reads (67.88 % of the total), each of which could be from the long-insert library (Additional file 1). Following mapped to one unique location with higher than 90 % a previous publication [20], the CLRs were divided into coverage and identity. Thus, the G1 reads were mapped to two types, type I containing two or more subreads and the draft genome sequence with high-confidence. G2 con- type II with only one subread. From type I CLRs we tained 15,352 reads showing multiple best alignments identified a total of 1,618,400 subreads, which formed (with identity and/or coverage values ≥ 90 %). G3 included 240,312 circular consensus sequences (CCSs) after mer- 27,014 reads exhibiting partial mapping to two or more ging and error correction through subread comparison distinct draft genome contigs (coverage 30–80 %, iden- (Additional file 1). Of the 240,312 CCSs, 175,375 were tity ≥ 90 %). G4 had 8669 reads generally showing low- found to be FLNC reads because each of them contained quality alignment to the draft genome (coverage 30–50 %, a distinct poly(A) tail and the 5′ and 3′ cDNA synthesis identity 40–90 %). Finally, G5 contained 1526 reads with primers. The type II CLRs carried 570,293 subreads, no significant mapping to the draft genome. Dong et al. BMC Genomics (2015) 16:1039 Page 4 of 13

A 8,669 1,526

G1 Mapped to single best loci

27,014 G2 Mapped to multiple loci

15,352 G3 Mapped to two or more contigs 134,204 G4 Low quality mapping

G5 No significant mapping

B FLNC read m140526_121759_42199_c100632732550000001823123009121491_s1_X0/152066/31_1768_CCS

CS contig 1DL_2289966 1DL_2289966

Ae. tauschii orthologous contig TGAC_WGS_tauschii_v1_contig_90959

Fig. 1 Mapping FLNC reads to the draft genome sequence of Chinese Spring (CS). a Division of the 186,765 FLNC reads into five groups (G1 to G5) based on their genome mapping characteristics. The number of reads in each group is depicted in the pie chart. b An example illustrating the FLNC read mapped to two different CS contigs located on the same chromosome arm. The read is shown as a split-mapped molecule (SMM) with both exon (filled box) and intron (line between two neighboring exons) depicted. The arrow indicates the direction of alignment to the genomic sequence. The two CS contigs to which the shown FLNC read mapped were both located on the long arm of chromosome 1D (1DL). The representative transcripts annotated for the two contigs by the draft genome sequence of CS are shown below as SMMs (boxed in green). The bottom panel is the Ae. tauschii contig orthologous to the two 1DL contigs of CS. The transcript annotated for this Ae. tauschii contig is also provided as a SMM (boxed in purple). The Ae. tauschii contig was identified by mapping the exemplary FLNC read to the Dt genome sequencing database (http://aegilops.wheat.ucdavis.edu/ATGSP/data.php). The diagrams shown are not drawn to scale

Second, we examined the level of splice junction sup- two different contigs, whereas the remaining 20,912 port with HiSeq transcriptomic reads for the well mapped reads displayed alternative and complex mapping pat- G1 FLNC reads. As anticipated, nearly all of the splice terns. Interestingly, of the 6102 reads, 4219 bridged junction sties (96 %) in the 134,204 G1 reads were sup- different CS contigs generally from the same chromo- ported by HiSeq transcriptome data. The most frequent some arm (Fig. 1b), suggesting that the genes giving splice junction was GT-AG (accounting for 98.51 %), rise to the 4219 reads may not be adequately covered by followed by GC-AG (1.29 %), AT-AC (0.11 %) and some the current CS draft genome sequence. To investigate this other minor types (0.09 %). These findings conformed to possibility, these reads were compared to the Dt genome the consensus splice sites of higher plants [39–41]. sequence of Ae. tauschii constructed based on a high- Third, we investigated mapping characteristics of the quality physical map and BAC clone sequencing reads in G2 and G3. Among the 15,352 G2 reads, a few (http://aegilops.wheat.ucdavis.edu/ATGSP/data.php) [42]. (303) exhibited less than 90 % identity or coverage values We found 90 different cases where the two separate CS when mapped to CS draft genome sequence, and were not draft genome contigs bridged by a G3 FLNC read were ac- considered further. Of the remaining 15,049 G2 reads, 554 tually contiguous in the Dt genome sequence (Fig. 1b, showed one best mapping in all three (A/B/D) or two Additional file 4). A total of 200 similar cases were found (A/B, B/D or A/D) subgenomes; 6348 exhibited map- when the 4219 G3 reads were compared to the scaffolds ping to two or more chromosomal locations in one of of T. urartu draft genome sequence (Additional file 4). three subgenomes; 8147 were mapped to two or all Clearly, many of the G3 reads were likely useful for im- three subgenomes with multiple mapping (≥2) found proving the contig assembly of CS draft genome sequence. in at least one subgenome (Additional file 3). There- Lastly, because the FLNC reads in G4 and G5 showed fore, the G2 reads were most likely transcribed from low-quality or no apparent mapping to CS draft genome highly similar homoeologous and/or paralogous loci. sequence, we investigated if some of them might have Among the 27,014 G3 reads, 6102 were mapped to corresponding orthologous genes in the Au genome of Dong et al. BMC Genomics (2015) 16:1039 Page 5 of 13

T. urartu or the Dt genome of Ae. tauschii.These found on the long arms of subgenome B chromosomes reads were therefore aligned to the annotated protein- (Fig. 2). Among the 16,188 loci, the majority (92.0 %) coding genes of T. urartu (34,879) and Ae. tauschii were covered by 2 to 10 FLNC reads, while the rest (43,150) [30, 31]. Assuming that some of the FLNC (8.0 %) were each supported by ≥ 10 reads. There were reads were mapped to distinct genes in the Au or Dt 49 loci each covered by higher than 100 reads. genomes with both coverage and identity values higher Unique transcripts represented by the cDNA inserts of than 90 %, then it might be possible that the ortholo- 134,204 high-quality FLNC reads, including different gous common wheat loci do exist but are inadequately, isoforms for multi-exon genes, were examined. In this or not yet, covered by the current draft genome sequence. study, we defined the transcript isoforms of a multi-exon As displayed in Additional file 5, the G4 FLNC reads hav- gene as having at least one different intron/exon junc- ing corresponding genes in T. urartu and Ae. tauschii tion. A total of 22,768 unique transcripts were identified, amounted to 3076 and 2150, respectively, and a number including 19,023 transcribed from previously annotated of G5 reads were also found to have corresponding genes loci and 3745 from the loci newly annotated by this in both species. These data supported the assumption that work (Table 1). Of the 19,023 transcripts corresponding many gene loci were still missing in the present draft gen- to known loci, 13,177 confirmed previous annotations, ome sequence of CS. whereas 5846 were newly discovered by this work (Table 1). Consequently, the total number of newly dis- Detailed characterization of G1 FLNC reads covered transcripts by this work was 9591 (5846 + 3745, For further analyzing the 134,204 well mapped G1 FLNC Table 1). The size of the 9591 transcripts varied from reads, more stringent criteria were adopted during find- 575 to 4537 bp, with the mean being 2433 bp. For com- ing their genomic loci in CS draft genome sequence parison, we calculated the size range and average length using GMAP. The reads missing 5′ exons and the of previously reported transcripts for the 13,162 existing singleton reads not supported by RNA-seq data were loci, which was 671–4636 bp with an average of not considered further. These criteria were also applied 2388 bp. Thus, on average, the newly discovered tran- in previous PacBio transcriptome studies [e.g., 10, 20]. scripts were 45 bp longer than previously reported com- With this filtering step, 42,323 FLNC reads were ex- mon wheat transcripts. The 22,768 transcripts identified cluded from further analysis. The remaining 91,881 based on our PacBio sequencing and their corresponding reads were regarded as having high-quality nucleotide genomic loci and representative FLNC reads are listed in sequence suitable for more detailed analysis. Of the Additional file 7. 91,881 reads, 83,736 were assigned to 13,162 extant loci, Interestingly, we found 180 transcripts (among the and 8145 were mapped to 3745 contig regions that did 22,768 transcript set) each spanning two or three different not have prior annotated gene models (Table 1). The genes annotated by CS draft genome sequence (Additional new loci defined by the 8145 reads amounted to 3026 file 8). These transcripts carried intact ORF capable of en- (Table 1, Additional file 6). Of these newly identified coding the polypeptides with 80 to 1307 amino acids loci, 666 were located on the contigs with other gene (Additional file 8). One possible explanation for this ob- models, and 2360 were on the contigs that did not have servation was that the loci transcribing the 180 transcripts any previously annotated gene models (Additional file 6). were incorrectly annotated into separate genes in CS draft When compared to the genes in GO, KEGG, KOG and genome sequence. To investigate this possibility, the 180 NR databases, 2433 of the 3026 new loci (80.4 %) could transcripts were mapped against the genome of B. distach- be annotated (Additional file 6). Together, the 91,881 yon and rice. The mapping results showed that 66 such reads were mapped to 16,188 loci distributed on 21 transcripts were reliably mapped to discrete genomic loca- common wheat chromosomes (Table 1, Fig. 2). The tions each with a single gene annotation in B. distachyon percentages of mapped FLNC reads and loci varied (coverage ≥ 90 %, mean identity 87 %) (Fig. 3a). Twenty- among the 42 chromosomal arms (Fig. 2). The distribu- eight of these transcripts could be aligned to distinct rice tion of the 3026 new loci also differed among the 42 loci (coverage ≥ 90 %, mean identity 84 %), all of which chromosomal arms, with more of them tending to be had one annotated gene (Fig. 3b).

Table 1 Genomic loci and unique transcripts represented by Investigation of the genes differentially expressed during 91,881 high-quality FLNC reads caryopsis development FLNC read Locus Transcript To promote use of the transcripts identified by our Pac- 83,736 13,162 (extant) 19,023 (13,177 extant, 5846 new) Bio sequencing for studying wheat caryopsis develop- ment in further research, we conducted the following 8145 3026 (new) 3745 (new ) lines of analysis. First, we computed the number of Total 91,881 16,188 22,768 expressed genes, and the sum of expressed genes with Dong et al. BMC Genomics (2015) 16:1039 Page 6 of 13

Fig. 2 Chromosomal distributions of 91,881 high-quality FLNC reads and the loci identified by them. The known and new loci identified by the 91,881 reads were 16,188 and 3026, respectively. The three values were used as backgrounds for calculating the percentages displayed along each short arm (SA) and long arm (LA) coverage by PacBio sequencing identified transcripts, in as a and b) were identified for TRAES_1DS_114C78BF4, the unfertilized caryopses and the developing grains at 5, with both present from S1 to S3 but only isoform b at 15 and 25 DAA (Table 2). The genes expressed at the S4 (Additional file 9). Consistent with this finding, RT- four stages (S1 to S4), as identified using the uniquely PCR analysis using specific primers confirmed the pres- mapped Illumina transcriptomic reads, amounted to ence of both isoforms from S1 to S3 but only isoform b 50,650, 42,444, 44,547 and 37,369, respectively (Table 2). at S4 (Fig. 4). The two transcript isoforms (a and b) of These numbers were comparable to those reported for TRAES_2BS_CDE410A7D were both detected at S1, CS developing grains by a previous study [33]. At S1 to with only one isoform found at S2 and S3 and no iso- S4, the expressed genes with coverage by the transcripts form expressed at S4 (Additional file 9). This differen- identified through PacBio sequencing were 11,798, 9452, tially regulated isoform expression pattern was also 10,626 and 8676, respectively, with the total number of confirmed by RT-PCR (data not shown). such transcripts detected at the four stages being 17,330, 13,938, 15,909 or 12,943 (Table 2). Finding of full-length transcripts for wheat gluten gene Second, we found 6030 genes showing differential ex- members pression of the transcripts identified by our PacBio se- To test utility of the 197,709 FLNC reads, we used this quencing among S1 to S4 (Additional file 9). Of these resource to search full-length transcripts for the genes genes, 4783 had only a single form of transcript de- encoding three families of gluten proteins, i.e., HMW- tected, and 1247 had two or more transcript isoforms. GSs, LMW-GSs and gliadins (see Introduction). A com- Two genes, TRAES_1DS_114C78BF4 (encoding a puta- mon characteristic of the three gluten gene families is tive RING/U-box superfamily protein) and TRAES_2BS_ the presence of multiple homoeologous and paralogous CDE410A7D (specifying a probable O-fucosyltransferase copies with high sequence similarity [34], and hence the family protein) were chosen as representatives to test if construction of full-length transcripts for these genes the differential expression computed may be verified by using short SGS reads is prone to misassembly. In our RT-PCR. Two different transcript isoforms (designated test variety Xiaoyan 81, there are three homoeologous Dong et al. BMC Genomics (2015) 16:1039 Page 7 of 13

A Transcript 2BS_5155291.1.1

Corresponding CS contig 2BS_5155291 (Traes_2BS_D46E40C29, Traes_2BS_033FD1621, Traes_2BS_00DF01F06)

Traes_2BS_D46E40C29.1 Traes_2BS_033FD1621.1 Traes_2BS_00DF01F06.1

Traes_2BS_033FD1621.2

Orthologous region in B. distachyon Bradi1g21372.1

B Transcript 1AL_3888283.1.2

Corresponding CS contig 1AL_3888283 (Traes_1AL_2D4B01C64, Traes_1AL_6275047AA)

Traes_1AL_2D4B01C64.3 Traes_1AL_6275047AA.1

Orthologous region in rice Os05g0345400.02

Fig. 3 Analysis of representative transcripts spanning two or three Chinese Spring (CS) loci. a The transcript 2BS_5155291.1.1 and the three CS loci (Traes_2BS_D46E40C29, Traes_2BS_033FD1621 and Traes_2BS_00DF01F06) it covered. These loci are located on the CS contig 2BS_5155291, and the transcripts annotated for the three loci by the draft genome sequence are boxed in green. The bottom panel shows B. distachyon genomic region orthologous to 2BS_5155291. A single locus (Bradi1g21372.1) and a corresponding transcript (boxed in purple) are annotated for this B. distachyon genomic region (http://www.plantgdb.org/BdGDB/). b The PacBio transcript 1AL_3888283.1.2 and the two CS loci (Traes_1AL_2D4B01C64 and Traes_1AL_6275047AA) it covered. The two loci reside on the CS contig 1AL_3888283, and the representative transcripts annotated for them by the draft genome sequence are boxed in green. The bottom panel is the rice genomic region orthologous to 1AL_3888283. A single locus (Os05g0345400.02) and a corresponding transcript (boxed in purple) are annotated for this rice genomic region (http://www.plantgdb.org/OsGDB/). The transcripts in (a and b) are all shown as SMMs with exon (filled box) and intron (line between two neighboring exons) depicted. The diagrams are not scaled

Table 2 Estimation of the genes expressed in unfertilized loci (Glu-A1/B1/D1) carrying six closely related, single- caryopses and developing grains exon HMW-GS genes, including five active members a Developmental stage (1Ax1, 1Bx14, 1By15, 1Dx2 and 1Dy12) and one pseudo- S1 S2 S3 S4 gene (1Ay) [35]. The A, B and D copies are homoeologs Total number of genes expressedb 50,650 42,444 44,547 37,369 whereas the x and y members are paralogs. In our search, Number of genes with coverage by 11,798 9452 10,626 8676 we detected 574 FLNC reads carrying complete HMW-GS PacBio sequencing identified transcripts ORF, and their corresponded well to six different full- Total number of PacBio sequencing 17,330 13,938 15,909 12,943 length HMW-GS transcripts (Additional file 10). The 1Ay identified transcripts detected at alleles in common wheat and related species are highly each stage similar, and have often been found to carry premature aS1, unfertilized caryopses; S2-S4, developing grains at 5, 15 or 25 days after anthesis stop codon in the coding region [43, 44]. Consistent with bJudged based on RPKM (reads per kilobase per million mapped reads) > 1 the past observation, we noticed that the 1Ay transcript Dong et al. BMC Genomics (2015) 16:1039 Page 8 of 13

AB FP RP1 565 bp 0.75 a a 0.50 2.00 FP 1415 bp RP2 b 1.00 b Actin Fig. 4 Analysis of the two different transcript isoforms of TRAES_1DS_114C78BF4 by RT-PCR. a A diagram showing the exon (filled box)-intron (line in between filled boxes) patterns of the two isoforms (designated as a and b, respectively). Arrows mark the positions of the primers (FP, RP1 and RP2) used for specifically amplifying each of the two isoforms. The length of the amplicon (bp) is indicated for each isoform. b The result of amplifying isoforms a and b by RT-PCR in the caryopsis samples of four developmental stages (S1 - S4). Amplification of the common wheat actin gene (GenBank accession AB181991) served as internal control for normalizing cDNA content prior to PCR amplification. S1, unfertilized caryopses; S2-S4, developing grains collected at 5, 15 or 25 DAA. The data displayed are typical of three independent experiments sequence of Xiaoyan 81 was more than 99 % identical to HMW-GSs, LMW-GSs and gliadins (Additional file 10). two previously reported inactive 1Ay alleles, and that the The total positive FLNC reads found in this search were three alleles all carried a premature stop codon at the 1577, 1211 of which contained complete ORF (though same position of the coding region (Additional file 11). some of them were disrupted). Thus, the proportion of The genes encoding LMW-GSs in common wheat come FLNC reads with complete gluten gene ORF was thus from three large homoeologous loci (Glu-A3/B3/D3) [34]. 76.8 %. The number of LMW-GS genes in Xiaoyan 81 is still un- known, but its parental variety (Xiaoyan 54) has been Discussion found to carry 14 such genes (including 11 active and In this work, we applied PacBio sequencing to investi- three pseudogene members) by a combined genomic and gate transcripts in the unfertilized caryopses and devel- proteomic analysis [45]. Here, we found 139 FLNC reads oping grains of common wheat. Following the latest carrying complete LMW-GS ORF, and they represented methodologies in analyzing PacBio transcriptome data 14 distinct full-length LMW-GS transcripts (including 12 [10–12, 20, 36], we obtained 197,709 error corrected with intact ORF and two with disrupted coding region, FLNC reads, 91,881 of which were found to be of high- Additional file 10). quality. The new resource and transcriptional informa- The genes encoding common wheat gliadins, con- tion gathered and their values for improving common tained mainly in two sets of compound homoeologous wheat draft genome annotation and grain transcriptome loci (Gli-A1/B1/D1 and Gli-A2/B2/D2), are exceedingly research are discussed below. complex, and have been divided into three main sub- families according to their protein products (α/β-, γ-or Utility of PacBio transcriptome sequencing for obtaining ω-gliadins) [34]. However, the exact numbers of genes full-length transcript sequence information in plants coding for α/β-, γ-andω-gliadins are still unclear in Full-length transcript sequence information is very use- common wheat. Here, we detected 263 FLNC reads har- ful for both genome annotation and gene function stud- boring complete α/β-gliadin gene ORF, and they identified ies in plants. However, it is often difficult to obtain such 32 unique full-length transcripts for α/β-gliadins (including information efficiently using traditional cDNA cloning 25 with intact ORF and seven with disrupted coding region, and sequencing approaches (5–7). Here we suggest that Additional file 10). A total of 208 FLNC reads carrying PacBio sequencing is an effective route for obtaining re- complete γ-gliadin gene ORF were detected, and they rep- liable full-length transcript sequence information in resented 14 distinct full-length transcripts for γ-gliadins (12 plants, particularly for the polyploid species like com- with intact ORF and two with disrupted coding region, mon wheat. This suggestion is supported by the follow- Additional file 10). This finding agreed with the previous ing evidence. First, about 74.6 % of the FLNC reads annotation of 13 γ-gliadin coding genes for CS [46]. Lastly, generated in this work were found to carry complete 27 FLNC reads carrying complete ω-gliadin gene ORF were ORF when compared to several thousands of full-length scored, and they corresponded to six different full-length cDNAs published previously. Second, by searching the transcripts for ω-gliadins, four of which had intact ORF 197,709 FLNC reads, we identified full-length transcripts (Additional file 10). The finding of six unique full-length for 72 transcribed gluten gene members belonging to transcripts for ω-gliadins here was consistent with the iden- three complex gene families, and the proportion of tification of five to seven ω-gliadin proteins in the grains of FLNC reads with complete gluten gene ORF was 76.8 %. American and British common wheat cultivars [47, 48]. In Third, Our PacBio sequencing correctly captured the total, we found 72 non-redundant full-length transcripts for transcripts derived from a number of pseudogenes, such Dong et al. BMC Genomics (2015) 16:1039 Page 9 of 13

as 1Ay, Glu-D3-5 and the inactive gliadin gene members 1) the finding of 290 G3 FLNC reads each mapped to two (Additional file 10). The sequencing of RT-PCR amplicons draft genome contigs from the same chromosome arm using PacBio platform by a previous study had also identi- (Additional file 4, Figs. 1b and 2) the observation of 180 fied the transcripts from a number of gluten pseudogenes transcripts each spanning two or three previously anno- [18]. Together, these data provide good support for the re- tated gene models (Additional file 8, Fig. 3). The 290 FLNC liability of the full-length nucleotide sequence information reads should be useful for refining chromosomal contig as- obtained through PacBio transcriptome sequencing. The sembly, whereas the 180 transcripts can assist more de- self-correction through subread comparison and the add- tailed annotations of the concerned chromosomal loci. itional correction using HiSeq transcriptomic reads have In addition to the improvements discussed above, the both contributed to yielding reliable nucleotide sequence 15,352 G2 FLNC reads with three patterns of multiple data in PacBio sequencing. Finally, the construction of genome mapping (Additional file 3) are potentially use- PacBio CCSs and FLNC reads completely avoided the ful for annotating highly similar homoeologous or par- need to assemble short transcriptomic reads. This advan- alogous genes of common wheat. The FLNC reads in G4 tage has enabled us to obtain and differentiate the full- and G5 may aid future annotation of the chromosomal length transcripts of three families of gluten genes with loci not covered by the current draft genome sequence. highly similar homoeologous and paralogues members There is also a possibility that some of the FLNC reads (Additional file 10). Because the expression of homoeolo- in G4 and G5 may come from divergent variety-specific gous gene set with nearly identical gene members is a genes since the elite common wheat variety used here fundamental characteristic of polyploid plants, PacBio se- (i.e., Xiaoyan 81) may differ from the land race CS used quencing should be particularly valuable for transcriptome for constructing the draft genome sequence. These studies of these species. variety-specific genes may help genetic analysis of the The high capacity of PacBio transcriptome sequencing traits unique to Xiaoyan 81 in the future. to generate full-length transcript sequence information An interesting observation in this work was that sub- may well be related to its long-read property. In agree- genome B seemed to host more of the 3026 newly anno- ment with this reasoning, we found that the 9591 newly tated loci than subgenomes A and D (Fig. 2). This may discovered transcripts by our PacBio sequencing were be related to the fact that the chromosomes of subge- on average more than 45 bp longer than the known nome B are generally larger in size than those of A and transcripts of 13,177 existent loci. Previous transcrip- D subgenomes in common wheat [24, 50]. tome studies have also reported that PacBio sequencing represented an efficient strategy for identifying full- Implications for further studies on common wheat grain length and relatively long transcript sequences in human transcriptome cells [10–12]. Several past studies have investigated common wheat transcriptome using HiSeq technology, with substantial Improvements on common wheat draft genome insights gained into the number of genes expressed, po- annotation tential existence of subgenome dominance, and major The large and complex hexaploid genome makes it diffi- biological processes operated in the developing grains cult to develop a complete and high-quality reference [33, 51, 52]. Compared to previous studies, our work is genome sequence for common wheat in a short time. unique in using PacBio sequencing to characterize the The available draft genome sequence of CS represents transcripts in the unfertilized caryopses and developing an important aid to common wheat structural and func- grains. The newly found chromosomal loci and tran- tional genomics investigations [49], and continuous re- scripts, as outlined and discussed above, will contribute finement of the draft sequence with transcriptome data positively to further studies on common wheat grain should enhance its utility. In this work, we annotated transcriptome. Also valuable is the list of the 6030 genes 3026 new chromosomal loci in the draft genome contigs showing differentially regulated transcriptional pattern based on newly gathered transcript evidence (Additional during caryopsis development, because many of them file 6). More than 80.4 % of the newly annotated genes have functionally important homologs in model plants had homologs in various databases, and thus represent a (Additional file 9). Moreover, we demonstrated that the useful addition to the gene complement of the draft gen- differentially regulated isoform expression pattern com- ome sequence. We found 9591 new transcripts (Additional puted for these genes could be verified using two repre- file 7), which not only enrich the transcriptional informa- sentatives (Fig. 4). Therefore, the 6030 genes might tion of the draft genome sequence but also are useful for provide some useful clue for studying the involvement functional studies of important genes in further research of alternative splicing in regulating common wheat grain (see also below). Two other sets of data with direct implica- development. However, we acknowledge that the identi- tions for future revision of the draft genome sequence are fication of this set of differentially expressed genes is Dong et al. BMC Genomics (2015) 16:1039 Page 10 of 13

preliminary because the transcripts yielded by our Pac- determined with the Agilent 2100 Bioanalyzer (Agilent Bio sequencing are limited in number, and the differen- Technologies, Palo Alto, California). Only the total tially expressed transcripts were judged based on only RNA samples with RIN value ≥ 8 were used for con- their presence or absence at the four developmental structing the cDNA libraries in PacBio or HiSeq stages. A detailed identification of the differentially sequencing. expressed genes based on statistical comparison of their transcript levels at different wheat grain developmental PacBio library construction and sequencing stages is an important target for our future study. Total RNA (10 μg) was reversely transcribed into cDNA The finding of full-length transcript sequence informa- using the SMARTer PCR cDNA Synthesis Kit that has tion for 72 transcribed gluten gene members should be been optimized for preparing high-quality, full-length of practical value for more systematically dissecting and cDNAs (Takara Biotechnology, Dalian, China), followed improving the function of gluten proteins in common by size fractionation using the BluePippin™ Size Selec- wheat. Of special interest is the identification of full- tion System (Sage Science, Beverly, MA). Each SMRT length transcripts for 52 gliadin gene members, since bell library was constructed using 500 ng size-selected they are involved in controlling both the processing and cDNA with the Pacific Biosciences DNA Template Prep nutritional qualities of common wheat, and yet the Kit 2.0. The binding of SMRT bell templates to polymer- expression and function of individual gliadin proteins ases was conducted using the DNA/Polymerase Binding remain poorly understood [53]. With the available full- Kit P5 and v2 primers. Sequencing was carried out on length transcript information, it is now possible to con- the Pacific Bioscience RS II platform using C3 reagents duct a detailed proteogenomic analysis to accurately with 120 min movies. establish the correspondence between gliadin proteins and their coding genes. The resultant data should speed Illumina library construction and sequencing up the identification of functionally important gliadin HiSeq libraries were prepared using the Illumina Tru- members, thus aiding the enhancement of common Seq RNA sample Prep kit. Briefly, fragmentation buffer wheat end-use and nutritional qualities through appro- was added to break mRNA into fragments of 200–700 priate molecular breeding strategies. nucleotides. The resultant mRNA fragments were used as templates to synthesize first strand cDNA. After sec- Conclusions ond strand cDNA synthesis, the fragments with suitable The data described and discussed above suggest that our size were gen-purified and amplified by PCR. The PCR PacBio transcript sequencing has generated novel re- products were sequenced using Illumina HiSeq 2000. source and information with positive implications for common wheat genome annotation and gene function Subread processing and error correction research. Clearly, PacBio transcript sequencing can fa- Effective subreads were obtained using the P_Fetch and cilitate the annotation and functional studies of complex P_Filter function (parameters: miniLength = 50, read- plant genomes. Our study, together with those published Score = 0.75, artifact = −1000) in the SMRT Analysis previously [16–20], may help to stimulate more intensive software suite (http://www.pacificbiosciences.com/devnet/). application of PacBio sequencing in plant transcriptome CCS was obtained from the P_CCS module using the par- research. ameter MinCompletePasses = 2 and MinPredictedAccu- racy = 0. After examining for poly(A) signal and 5′ and 3′ Methods adaptors, only the CCS with all three signals was consid- Plant material ered as a FLNC read [20]. Unmerged subreads were also Xiaoyan 81 was cultivated in the field as described previ- examined for the three signals, and those with three signals ously [35]. After heading, the plants were inspected were incorporated into the final FLNC read set. Additional regularly to record the timing of anthesis and grain de- nucleotide errors in FLNC reads were corrected using the velopment. Unfertilized caryopses were collected from Illumina RNA-seq data with the software proovread [36]. 10 different main stem spikes approximately 2 days be- fore anthesis. The grain samples were similarly collected Mapping of FLNC reads to CS draft genome sequence at 5, 15 and 25 DAA, respectively. Total RNA samples The error corrected FLNC reads were mapped to the were isolated from unfertilized caryopses and developing draft genome sequence of CS using GMAP as described grains using a commercial Kit (Takara Biotechnology, previously [38]. We used the no-chimera setting to en- Dalian, China). The purified RNA was dissolved in sure mapping on the same contig as much as possible. RNase-free water, with genomic DNA contamination The best mapped locus was chosen for each FLNC read removed using TURBO DNase I (Promega, Beijing, based on both identity and coverage values. The genome China). The integrity of the RNA thus prepared was mapping results of FLNC reads were visualized using the Dong et al. BMC Genomics (2015) 16:1039 Page 11 of 13

Integrative Genome Viewer [54]. The proportion of FLNC reads/mapped reads in millions × exon length in kb [56]. reads carrying complete ORF was evaluated using 5495 A gene was considered as expressed if its RPKM value wheat full-length cDNAs (downloaded from http:// was > 1. The number of genes with coverage by PacBio trifldb.psc.riken.jp/v3/index.pl) as reference and by map- sequencing identified transcripts was calculated by com- ping to CS draft genome sequence with the aid of GMAP. paring the gene sets expressed at S1 to S4 to the 22,768 unique transcripts (Table 1). Verification of splice junctions in G1 FLNC reads The genes covered by PacBio sequencing identified The FLNC reads were aligned to CS draft genome se- transcripts were divided into two classes, one with a quence using GMAP, and clustered by locus. The splice single form of transcript (named as ‘a’)andtheother junction sites were then obtained with an in-house perl with multiple transcript isoforms (designated alpha- script. In the meantime, the HiSeq transcriptomic reads betically) (Additional file 9). The expression status of were also mapped to the draft genome sequence of CS, the different types of transcripts at S1 to S4 was with the splice junctions identified using Tophat2 [55]. checked by finding the HiSeq transcriptomic read(s) The splice junctions revealed by Tophat2 were com- that were mapped to transcript-specific splice junc- pared to those in the FLNC reads to determine if the tion(s), with positive expression judged when the tar- junction sequence in the concerned FLNC read was sup- get junction was covered by ≥ 1HiSeqread.Inthis ported by HiSeq reads. way the 6030 genes with differentially expressed tran- scripts during caryopsis development were compiled Finding the FLNC read showing cross-contig mapping (Additional file 9). The FLNC read exhibiting cross-contig mapping was For verifying the expression patterns of the two tran- found using two criteria, 1) positive alignment (with script isoforms (a and b) of TRAES_1DS_114C78BF4 by identity and coverage values both higher than 90 %) RT-PCR, three nucleotide primers were designed, FP (5′- within 600 bp of CS contig ends, and 2) the orthologs of ACCACCACCACCTCATTCAA-3′), RP1 (5′-ACATCAA the concerned contigs were contiguous in the Dt (Ae. GGGGAGACATGGA −3′) and RP2 (5′-GCAACCCT tauschii)orAu (T. urartu) reference genomes. TCTGTCATCCAC-3′). FP and RP1 were used for ampli- fying isoform a, whereas FP and RP2 were for isoform b. Discovering the transcript spanning two or more CS Total RNA samples from the unfertilized caryopses and chromosomal loci developing grains collected at 5, 15 and 25 DAA were The transcript that covered two or more loci anno- reverse-transcribed into cDNA as described above. The tated by CS draft genome sequence was identified resultant cDNA samples were normalized by amplifying a using the following standards. First, the transcript common wheat actin gene (GenBank accession AB18 showed unique and positive mapping (with identity 1991) as detailed previously [57]. Subsequently, the nor- and coverage values both higher than 90 %) to the malized cDNA samples were used for amplifying isoforms exons of the concerned CS loci, and these loci were a and b, respectively. The PCR was carried out in 20 μl located on the same contig. Second, the transcript volume containing 10 mM dNTPs, 5 pmol of each primer, could be reliably mapped (with identity and coverage and 1 U Taq polymerase (TransGen Biotech, Beijing, values both higher than 80 %) to an orthologous gen- China). The cycling parameters included 94 °C for 5 min omic region of B. distachyon or rice that had only and 30 cycles of 94 °C for 30 s, 58 °C for 30 s and 72 °C onegenelocusannotated. for 1 min, with a final extension at 72 °C for 10 min. The PCR products were separated in 1 % agarose gels. Three Analysis of the genes and their transcripts expressed independent experiments were conducted with identical during caryopsis development results obtained. The HiSeq transcriptomic data (Additional file 2) were used to calculate gene expression level and for comput- ing the number of genes expressed at the four caryopsis Searching full-length transcripts for gluten gene members developmental stages of Xiaoyan 81 (S1 to S4, Table 2). The 197,709 FLNC reads were employed to search for In brief, the four sets of clean reads from the unfertilized the full-length transcripts of the gluten gene members caryopses and the developing grain samples at 5, 15 and expressed in Xiaoyan 81 grains by local BlastN. The 25 DAA (Additional file 2) were each mapped to CS query sequences used for the search were indicated in draft genome sequence using TopHat2 with default pa- Additional file 10. The alignment of three 1Ay alleles (one rameters. HTSeq-count was used to determine the reads from Xiaoyan 81 and two from GenBank, Additional mapped to individual genes [56]. The gene expression file 11) was carried out using Clustal Omega with de- level, as measured by reads per kilobase per million fault settings in European Bioinformatics Institute website mapped reads (RPKM), was calculated as total exon (http://www.ebi.ac.uk/Tools/msa/clustalo/). Dong et al. BMC Genomics (2015) 16:1039 Page 12 of 13

Availability of supporting data Council of Canada (NSERC). We thank Aisi Fu (Wuhan Institute of Biotechnology, The 197,709 FLNC reads and the HiSeq transcriptomic Wuhan, China) for assistance in using the PacBio RSII platform. reads generated in this study have been submitted to the Author details BioProject database of National Center for Biotechnol- 1The State Key Laboratory of Plant cell and Chromosome Engineering, ogy Information (accession number PRJNA285723). Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing 100101, China. 2Frasergen, Wuhan East Lake High-tech Zone, Wuhan 430075, China. 3University of Chinese Academy of Sciences, Additional files Beijing 100049, China. 4School of Pharmaceutical Sciences, Wuhan University, Wuhan 430071, China. 5School of Life Science and Technology, Huazhong Agricultural University, Wuhan 430075, China. 6Department of Molecular Additional file 1: Summary of the cDNA libraries sequenced by PacBio Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, RSII and the full-length non-chimeric reads obtained. (DOCX 40 kb) Canada. 7The Collaborative Innovation Center for Grain Crops, Henan Additional file 2: Summary of Illumina HiSeq 2000 transcriptomic Agricultural University, Zhengzhou 450002, China. reads obtained for unfertilized wheat caryopses and the grains at 5, 15 and 25 days after anthesis (DAA). (DOCX 39 kb) Received: 24 June 2015 Accepted: 30 November 2015 Additional file 3: Analysis of the G2 FLNC reads showing multiple best alignments in the draft genome sequence of common wheat. (XLSX 799 kb) Additional file 4: List of 290 FLNC reads showing cross-contig References mapping. (XLS 40 kb) 1. Hamilton JP, Buell CR. Advances in plant genome sequencing. Plant J. – u 2012;70(1):177 90. Additional file 5: Mapping data of G4 and G5 FLNC reads in the A 2. Michael TP, Jackson S: The first 50 plant genomes. Plant Genome. T. urartu t Ae. tauschii genome of and D genome of . (DOCX 39 kb) 2013; 6. doi:10.3835/plantgenome2013.03.0001in. Additional file 6: List of 3026 newly discovered loci and their 3. Schliesky S, Gowik U, Weber AP, Brautigam A. RNA-Seq assembly - are we annotations. (XLS 266 kb) there yet? Front Plant Sci. 2012;3:220. Additional file 7: Summary of 22,768 unique transcripts 4. Li B, Fillmore N, Bai Y, Collins M, Thomson JA, Stewart R, et al. Evaluation represented by 91,881 high-quality G1 FLNC reads. (XLS 1122 kb) of de novo transcriptome assemblies from RNA-Seq data. Genome Biol. 2014;15(12):553. Additional file 8: List of 180 transcripts each spanning two or three 5. Ogihara Y, Mochida K, Kawaura K, Murai K, Seki M, Kamiya A, et al. separately annotated gene models in Chinese Spring draft genome Construction of a full-length cDNA library from young spikelets of hexaploid sequence. (XLS 28 kb) wheat and its characterization by large-scale sequencing of expressed Additional file 9: List of 6030 genes with differentially expressed sequence tags. Genes Genet Syst. 2004;79(4):227–32. transcripts during caryopsis development. (XLSX 546 kb) 6. Seki M, Satou M, Sakurai T, Akiyama K, Iida K, Ishida J, et al. Additional file 10: Finding of full-length gluten gene transcripts RIKEN Arabidopsis full-length (RAFL) cDNA and its applications and their representative FLNC reads. (DOCX 26 kb) for expression profiling under abiotic stress conditions. J Exp Bot. 2004;55(395):213–23. Additional file 11: Comparison of coding region nucleotide 7. Soderlund C, Descour A, Kudrna D, Bomhoff M, Boyd L, Currie J, et al. sequence of three 1Ay alleles. The 1Ay sequence of Xiaoyan 81 Sequencing, mapping, and analysis of 27,455 maize full-length cDNAs. PLoS (represented by 1Ay_XY81) was based on the representative FLNC read Genet. 2009;5(11):e1000740. identified for 1Ay in this work (Additional file 10). The sequences of the 8. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al. Real-time DNA sequencing GenBank accessions AY260548 and AY303766 are two 1Ay alleles from from single polymerase molecules. Science. 2009;323(5910):133–8. tetraploid durum wheat and hexaploid spelta wheat, respectively. Asterisks 9. Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT sequencing. indicate identical nucleotides. The premature stop codons in the coding Genome Biol. 2013;14(7):405. region of the three sequences are boxed in red. (DOCX 1265 kb) 10. Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA, et al. Characterization of the human ESC transcriptome by hybrid sequencing. Abbreviations Proc Natl Acad Sci U S A. 2013;110(50):E4821–30. BAC: Bacterial artificial chromosome; bp: Base pair; CCS: Circular consensus 11. Sharon D, Tilgner H, Grubert F, Snyder M. A single-molecule long-read sequence; CLR: Continuous long read; CS: Chinese Spring; DAA: Day after survey of the human transcriptome. Nat Biotechnol. 2013;31(11):1009–14. anthesis; FLNC: Full-length non-chimeric; GO: Gene ontology; HMW-GS: 12. Treutlein B, Gokce O, Quake SR, Sudhof TC. Cartography of neurexin High-molecular-weight glutenin; kb: Kilobase; KEGG: Kyoto encyclopedia of alternative splicing mapped by single-molecule long-read mRNA genes and genomes; KOG: Eukaryotic orthologous group; LMW-GS: Low- sequencing. Proc Natl Acad Sci U S A. 2014;111(13):E1291–9. molecular-weight glutenin; NR: Non-redundant; ORF: Open reading frame; 13. Au KF, Underwood JG, Lee L, Wong WH. Improving PacBio long read PacBio: Pacific Biosciences; RIN: RNA integrity number; RPKM: Reads per accuracy by short read alignment. PLoS One. 2012;7(10):e46679. kilobase per million mapped reads; SGS: Second generation sequencing; 14. Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, et al. SMM: Split-mapped molecule; SMRT: Single-molecule-real time. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012;30(7):693–700. Competing interests 15. Salmela L, Rivals E. LoRDEC: accurate and efficient long read error The authors declare that they have no competing interests. correction. Bioinformatics. 2014;30(24):3506–14. 16. Gross SM, Martin JA, Simpson J, Abraham-Juarez MJ, Wang Z, Visel A. De Authors’ contributions novo transcriptome assembly of drought tolerant CAM plants, Agave deserti DW and LD designed the research. LD, JZ and SY performed the research. and Agave tequilana. BMC Genomics. 2013;14:563. LD, HL, NC, GK and JC analyzed the data. LD, DW and NC wrote the paper. 17. Martin JA, Johnson NV, Gross SM, Schnable J, Meng X, Wang M, et al. A All authors read and approved the final manuscript. near complete snapshot of the Zea mays seedling transcriptome revealed from ultra-deep sequencing. Sci Rep. 2014;4:4519. Acknowledgements 18. Zhang W, Ciclitira P, Messing J. PacBio sequencing of gene families - a case This study was supported by the Ministry of Science and Technology of China study with wheat gluten genes. Gene. 2014;533(2):541–6. (grants 2012AA10A308 and 2013CB127703), the National Natural Science 19. Xu Z, Peters RJ, Weirather J, Luo H, Liao B, Zhang X, et al. Full-length Foundation of China (project 31471483) and the State Key Laboratory of Plant transcriptome sequences and splice variants obtained by a combination of Cell and Chromosome Engineering (grant PCCE-TD-2012-02). Research in the sequencing platforms applied to different root tissues of Salvia miltiorrhiza laboratory of NC was supported by Natural Science and Engineering Research and tanshinone biosynthesis. Plant J. 2015;82(6):951–61. Dong et al. BMC Genomics (2015) 16:1039 Page 13 of 13

20. Minoche AE, Dohm JC, Schneider J, Holtgrawe D, Viehover P, Montfort M, 44. Jiang QT, Wei YM, Wang F, Wang JR, Yan ZH, Zheng YL. Characterization et al. Exploiting single-molecule transcript sequencing for eukaryotic gene and comparative analysis of HMW glutenin 1Ay alleles with differential prediction. Genome Biol. 2015;16:184. expressions. BMC Plant Biol. 2009;9:16. 21. Langridge P. Genomics:decoding our daily bread. Nature. 2012;491:678–80. 45. Dong L, Zhang X, Liu D, Fan H, Sun J, Zhang Z, et al. New insights into 22. Shewry PR. Wheat. J Exp Bot. 2009;60(6):1537–53. the organization, recombination, expression and functional mechanism of 23. Brenchley R, Spannagl M, Pfeifer M, Barker GL, D’Amore R, Allen AM, et al. low molecular weight glutenin subunit genes in bread wheat. PLoS One. Analysis of the bread wheat genome using whole-genome shotgun 2010;5(10):e13548. sequencing. Nature. 2012;491(7426):705–10. 46. Anderson OD, Huo N, Gu YQ. The gene space in wheat: the complete 24. International Wheat Genome Sequencing Consortium. A chromosome-based γ-gliadin gene family from the wheat cultivar Chinese Spring. Funct Integr draft sequence of the hexaploid bread wheat (Triticum aestivum)genome. Genomics. 2013;13(2):261–73. Science. 2014;345(6194):1251788. 47. Dupont FM, Vensel WH, Tanaka CK, Hurkman WJ, Altenbach SB. Deciphering 25. Lucas SJ, Akpinar BA, Simkova H, Kubalakova M, Dolezel J, Budak H. the complexities of the wheat flour proteome using quantitative two- Next-generation sequencing of flow-sorted wheat chromosome 5D dimensional electrophoresis, three proteases and tandem mass reveals lineage-specific translocations and widespread gene duplications. spectrometry. Proteome Sci. 2011;9:10. BMC Genomics. 2014;15:1080. 48. Wan Y, Gritsch CS, Hawkesford MJ, Shewry PR. Effects of nitrogen nutrition 26. Helguera M, Rivarola M, Clavijo B, Martis MM, Vanzetti LS, Gonzalez S, et al. on the synthesis and deposition of the omega-gliadins of wheat. Ann Bot. New insights into the wheat chromosome 4D structure and virtual gene 2014;113(4):607–15. order, revealed by survey pyrosequencing. Plant Sci. 2015;233:200–12. 49. Wang M, Wang S, Xia G. From genome to gene: a new epoch for wheat 27. Pingault L, Choulet F, Alberti A, Glover N, Wincker P, Feuillet C, et al. Deep research? Trends Plant Sci. 2015;20(6):380–7. transcriptome sequencing provides new insights into the structural and 50. Safar J, Simkova H, Kubalakova M, Cihalikova J, Suchankova P, Bartos J, et al. functional organization of the wheat genome. Genome Biol. 2015;16:29. Development of chromosome-specific BAC resources for genomics of bread – – 28. Salse J, Chague V, Bolot S, Magdelenat G, Huneau C, Pont C, et al. New wheat. Cytogenet Genome Res. 2010;129(1 3):211 23. insights into the origin of the B genome of hexaploid wheat: evolutionary 51. Pont C, Murat F, Confolent C, Balzergue S, Salse J. RNA-seq in grain unveils relationships at the SPA genomic region with the S genome of the diploid fate of neo- and paleopolyploidization events in bread wheat (Triticum relative Aegilops speltoides. BMC Genomics. 2008;9:555. aestivum L.). Genome Biol. 2011;12(12):R119. 29. Marcussen T, Sandve SR, Heier L, Spannagl M, Pfeifer M, Jakobsen KS, et al. 52. Li HZ, Gao X, Li XY, Chen QJ, Dong J, Zhao WC. Evaluation of assembly Ancient hybridizations among the ancestral genomes of bread wheat. strategies using RNA-seq data associated with grain development of wheat Science. 2014;345(6194):1250092. (Triticum aestivum L.). PLoS One. 2013;8(12):e83530. 30. Ling HQ, Zhao S, Liu D, Wang J, Sun H, Zhang C, et al. Draft genome of the 53. Rasheed A, Xia X, Yan Y, Appels R, Mahmood T, He Z. Wheat seed storage wheat A-genome progenitor Triticum urartu. Nature. 2013;496(7443):87–90. proteins: advances in molecular gentics, diversity and breeding applications. – 31. Jia J, Zhao S, Kong X, Li Y, Zhao G, He W, et al. Aegilops tauschii draft J Cereal Sci. 2014;60(1):11 24. genome sequence reveals a gene repertoire for wheat adaptation. Nature. 54. Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, – 2013;496(7443):91–5. et al. Integrative genomics viewer. Nat Biotechnol. 2011;29(1):24 6. 32. Drea S, Leader DJ, Arnold BC, Shaw P, Dolan L, Doonan JH. Systematic 55. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: spatial analysis of gene expression during wheat caryopsis development. accurate alignment of transcriptomes in the presence of insertions, Plant Cell. 2005;17(8):2172–85. deletions and gene fusions. Genome Biol. 2013;14(4):R36. 56. Anders S, Pyl PT, Huber W. HTSeq–a Python framework to work with 33. Pfeifer M, Kugler KG, Sandve SR, Zhan B, Rudi H, Hvidsten TR, et al. high-throughput sequencing data. Bioinformatics. 2015;31(2):166–9. Genome interplay in the grain transcriptome of hexaploid bread wheat. 57. Wang GF, Wei X, Fan R, Zhou H, Wang X, Yu C, et al. Molecular analysis Science. 2014;345(6194):1250091. of common wheat genes encoding three types of cytosolic heat shock 34. Shewry PR, Halford NG, Lafiandra D. Genetics of wheat gluten proteins. protein 90 (Hsp90): functional involvement of cytosolic Hsp90s in the Adv Genet. 2003;49:111–84. control of wheat seedling growth and disease resistance. New Phytol. 35. Yang Y, Li S, Zhang K, Dong Z, Li Y, An X, et al. Efficient isolation of ion 2011;191(2):418–31. beam-induced mutants for homoeologous loci in common wheat and comparison of the contributions of Glu-1 loci to gluten functionality. Theor Appl Genet. 2014;127(2):359–72. 36. Hackl T, Hedrich R, Schultz J. Forster F: proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics. 2014;30(21):3004–11. 37. Mochida K, Yoshida T, Sakurai T, Ogihara Y, Shinozaki K. TriFLDB: a database of clustered full-length coding sequences from with applications to comparative grass genomics. Plant Physiol. 2009;150(3):1135–46. 38. Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21(9):1859–75. 39. Filichkin SA, Priest HD, Givan SA, Shen R, Bryant DW, Fox SE, et al. Genome-wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res. 2010;20(1):45–58. 40. Marquez Y, Brown JW, Simpson C, Barta A, Kalyna M. Transcriptome survey Submit your next manuscript to BioMed Central reveals increased complexity of the alternative splicing landscape in Arabidopsis. Genome Res. 2012;22(6):1184–95. and we will help you at every step: 41. Shen Y, Zhou Z, Wang Z, Li W, Fang C, Wu M, et al. Global dissection of alternative splicing in paleopolyploid soybean. Plant Cell. 2014;26(3):996–1008. • We accept pre-submission inquiries 42. Luo MC, Gu YQ, You FM, Deal KR, Ma Y, Hu Y, et al. A 4-gigabase physical • Our selector tool helps you to find the most relevant journal map unlocks the structure and evolution of the complex genome of • We provide round the clock customer support Aegilops tauschii, the wheat D-genome progenitor. Proc Natl Acad Sci U S A. 2013;110(19):7940–5. • Convenient online submission 43. Forde J, Malpica JM, Halford NG, Shewry PR, Anderson OD, Greene FC, et al. • Thorough peer review nucleotide sequence of a HMW glutenin subunit gene located on The • Inclusion in PubMed and all major indexing services chromosome 1A of wheat (Triticum aestivum L.). Nucleic Acids Res. • 1985;13(19):6817–32. Maximum visibility for your research

Submit your manuscript at www.biomedcentral.com/submit