Leveraging Multiple Transcriptome Assembly Methods for Improved Gene Structure Annotation

bioRxiv preprint doi: https://doi.org/10.1101/216994; this version posted December 21, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Venturini et al. SOFTWARE Leveraging multiple transcriptome assembly methods for improved gene structure annotation Luca Venturini1, Shabhonam Caim1,2, Gemy G Kaithakottil1, Daniel L Mapleson1 and David Swarbreck1* mental stages and conditions, aiding the annotation of transcription start sites, exons, alternative splice vari- ants and polyadenylation sites. Currently, one of the most commonly used technol- Abstract ogy for RNA-Seq is Illumina sequencing, which is char- The performance of RNA-Seq aligners and acterised by extremely high throughput and relatively assemblers varies greatly across different organisms short read lengths. Since its introduction, numerous and experiments, and often the optimal approach algorithms have been proposed to analyse its output. is not known beforehand. Here we show that the Many of these tools focus on the problem of assigning accuracy of transcript reconstruction can be reads to known genes to infer their abundance [1{4], or boosted by combining multiple methods, and we of aligning them to their genomic locus of origin [5{7]. present a novel algorithm to integrate multiple Another challenging task is the reconstruction of the RNA-Seq assemblies into a coherent transcript original sequence and genomic structure of transcripts annotation. Our algorithm can remove directly from sequencing data. Many approaches de- redundancies and select the best transcript models veloped for this purpose leverage genomic alignments according to user-specified metrics, while solving [8{11], although there are alternatives based instead common artefacts such as erroneous transcript on de novo assembly [9, 12, 13]. While these methods chimerisms. We have implemented this method in focus on how to analyse a single dataset, related re- an open-source Python3 and Cython program, search has examined how to integrate assemblies from Mikado, available at multiple samples. While some researchers advocate for https://github.com/lucventurini/Mikado. merging together reads from multiple samples and as- Keywords: RNA-Seq; transcriptome; assembly; sembling them jointly [9], others have developed meth- genome annotation ods to integrate multiple assemblies into a single coherent annotation [8, 14]. The availability of multiple methods has generated interest in understanding the relative merits of each Background approach [15{17]. The correct reconstruction of tran- The annotation of eukaryotic genomes is typically a scripts is often hampered by the presence of multiple complex process which integrates multiple sources of isoforms at each locus and the extreme variability of extrinsic evidence to guide gene predictions. Improve- expression levels, and therefore in sequencing depth, ments and cost reductions in the field of nucleic acid within and across gene loci. This variability also af- sequencing now make it feasible to generate a genome fects the correct identification of transcription start assembly and to obtain deep transcriptome data even and end sites, as sequencing depth typical drops near for non-model organisms. However, for many of these the terminal ends of transcripts. The issue is partic- species often there are only minimal EST and cDNA ularly severe in compact genomes, where genes are resources and limited availability of proteins from clustered within small intergenic distances. Further, closely related species. In these cases, transcriptome the presence of tandemly duplicated genes can lead to data from high-throughput RNA sequencing (RNA- Seq) provides a vital source of evidence to aid gene alignment artefacts that then result in multiple genes structure annotation. A detailed map of the transcrip- being incorrectly reconstructed as a fused transcript. tome can be built from a range of tissues, develop- As observed in a comparison performed by the RGASP consortium [18], the accuracy of each tool depends on *Correspondence: [email protected] how it corrects for each of these potential sources of 1Earlham Institute, Norwich Research Park, NR4 7UZ Norwich, United Kingdom errors. However, it also depends on other external fac- Full list of author information is available at the end of the article tors such as the quality of the input sequencing data bioRxiv preprint doi: https://doi.org/10.1101/216994; this version posted December 21, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Venturini et al. Page 2 of 17 as well as on species-dependent characteristics, such previous RGASP evaluation, we performed our tests as intron sizes and the extent of alternative splicing. It on the three metazoan species of Caenhorabditis el- has also been observed that no single method consis- egans, Drosophila melanogaster and Homo sapiens, tently delivers the most accurate transcript set when using RNA-Seq data from that study as input. We tested across different species. Therefore, none of them also added to the panel a plant species, Arabidopsis can be determined a priori as the most appropriate for thaliana, to assess the performance of these tools on a given experiment [19]. These considerations are an a non-metazoan genome. Each of these species has important concern in the design of genome annota- undergone extensive manual curation to refine gene tion pipelines, as transcript assemblies are a common structures, and moreover, these annotations exhibit component of evidence guided approaches that inte- very different gene characteristics in terms of their grate data from multiple sources (e.g. cDNAs, protein proportion of single exon genes, average intron lengths or whole genome alignments). The quality and com- and number of annotated transcripts per gene (Sup- pleteness of the assembled transcript set can therefore plementary Table ST1). Similar to previous studies substantially impact on downstream annotation. [18, 24], we based our initial assessment on real rather Following these studies, various approaches have than simulated data, to ensure we captured the true been proposed to determine the best assembly using characteristics of RNA-Seq data. Prediction perfor- multiple measures of assembly quality [19, 20] or to mance was benchmarked against the subset of anno- integrate RNA-Seq assemblies generated by compet- tated transcripts with all exons and introns (minimum ing methods [21{23]. In this study we show that alter- 1X coverage) identified by at least one of the two RNA- native methods not only have different strengths and Seq aligners. weaknesses, but that they also often complement each The number of transcripts assembled varied sub- other by correctly reconstructing different subsets of stantially across methods, with StringTie and Trin- transcripts. Therefore, methods that are not the best ity generally reconstructing a greater number of tran- overall might nonetheless be capable of outperforming scripts (Supplementary Figure SF1). Assembly with the \best" method for a sub-set of loci. An annotation Trinity was performed using the genome guided de- project typically integrates datasets from a range of novo method, where RNA-Seq reads are first parti- tissues or conditions, or may utilise public data gener- tioned into loci ahead of de-novo assembly. This ap- ated with different technologies (e.g. Illumina, PacBio) proach is in contrast to the genome guided approaches or sequencing characteristics (e.g. read length, strand employed by the other assemblers that allow small specificity, ribo-depletion); in such cases, it is not un- drops in read coverage to be bridged and enable the common to produce at least one set of transcript as- exclusion of retained introns and other lowly expressed semblies for each of the different sources of data, as- fragments. As expected Trinity annotated more frag- semblies which then need to be reconciled. To ad- mented loci, with a higher proportion of mono-exonic dress these challenges, we developed MIKADO, an ap- genes (Supplementary Figure SF1). proach to integrate transcript assemblies. The tool de- Accuracy of transcript reconstruction was measured fines loci, scores transcripts, determines a representa- using recall and precision. For any given feature (nu- tive transcript for each locus, and finally returns a set cleotide, exon, transcript, gene), recall is defined as the of gene models filtered to individual requirements, for percentage of correctly predicted features out of all ex- example removing transcripts that are chimeric, frag- pressed reference features, whereas precision is defined mented or with short or disrupted coding sequences. as the percentage of all features that correctly match Our approach was shown to outperform both stand- a feature present in the reference. In line with previ- alone methods and those that combine assemblies, by ous evaluations, we found that accuracy varied consid- returning more transcripts reconstructed correctly and erably among methods, with clear trade-offs between less chimeric and unannotated genes. recall and precision (Supplementary Figure SF2). For instance, CLASS2 emerged as the most precise

Load more