Long-Read Sequencing of Trypanosoma Cruzi
Total Page:16
File Type:pdf, Size:1020Kb
RESEARCH ARTICLE Berna et al., Microbial Genomics 2018;4 DOI 10.1099/mgen.0.000177 Expanding an expanded genome: long-read sequencing of Trypanosoma cruzi Luisa Berna, 1 Matias Rodriguez,2 María Laura Chiribao,1,3 Adriana Parodi-Talice,1,4 Sebastian Pita,1,4 Gastón Rijo,1 Fernando Alvarez-Valin2,* and Carlos Robello1,3,* Abstract Although the genome of Trypanosoma cruzi, the causative agent of Chagas disease, was first made available in 2005, with additional strains reported later, the intrinsic genome complexity of this parasite (the abundance of repetitive sequences and genes organized in tandem) has traditionally hindered high-quality genome assembly and annotation. This also limits diverse types of analyses that require high degrees of precision. Long reads generated by third-generation sequencing technologies are particularly suitable to address the challenges associated with T. cruzi’s genome since they permit direct determination of the full sequence of large clusters of repetitive sequences without collapsing them. This, in turn, not only allows accurate estimation of gene copy numbers but also circumvents assembly fragmentation. Here, we present the analysis of the genome sequences of two T. cruzi clones: the hybrid TCC (TcVI) and the non-hybrid Dm28c (TcI), determined by PacBio Single Molecular Real-Time (SMRT) technology. The improved assemblies herein obtained permitted us to accurately estimate gene copy numbers, abundance and distribution of repetitive sequences (including satellites and retroelements). We found that the genome of T. cruzi is composed of a ‘core compartment’ and a ‘disruptive compartment’ which exhibit opposite GC content and gene composition. Novel tandem and dispersed repetitive sequences were identified, including some located inside coding sequences. Additionally, homologous chromosomes were separately assembled, allowing us to retrieve haplotypes as separate contigs instead of a unique mosaic sequence. Finally, manual annotation of surface multigene families, mucins and trans-sialidases allows now a better overview of these complex groups of genes. DATA SUMMARY cruzi genomes; three species considered to be representative of the parasitic diversity of the group [1–3]. The new infor- The genome assemblies and annotation of TCC and Dm28c have been deposited in GenBank with the accession num- mation obtained from these genomes opened a new era that bers: PRJNA432753 and PRJNA433042, respectively. Raw allowed researchers to conduct studies with unprecedented sequences of TCC and Dm28c have been deposited in comprehensiveness. Several new types of analyses, such as GenBank with the accession numbers SRP134374 and large-scale comparative genomics, aimed at understanding SRP134013, respectively. Supporting Data tables are avail- the common evolutionary basis of parasitism and pathogen- able on Figshare data: https://figshare.com/s/38da1c- esis, or searching for new vaccine candidates and drug de5667299b3a1e. We confirm that all supporting data, code targets [4, 5] were made possible. These three genomes, and protocols have been provided within the article or however, were published with very different degrees of fin- through supplementary data files. ishing. The genome of L. major was precisely assembled and annotated. The T. brucei genome was also of very good quality but was mostly focused on megabase chromosomes, INTRODUCTION whereas minichromosomes were not included. Conversely, The year 2005 represents a landmark in the study of trypa- the assembly of the T. cruzi genome (CL Brener strain) was nosomatid biology with the simultaneous publication of the extremely fragmented (4098 contigs) with very few contigs Leishmania major, Trypanosoma brucei and Trypanosoma exceeding 100 kb in size. In fact, only 12 contigs met this Received 19 February 2018; Accepted 3 April 2018 Author affiliations: 1Laboratory of Host Pathogen Interactions-UBM, Institut Pasteur de Montevideo, Montevideo, Uruguay; 2Sección Biomatematica - Unidad de Genómica Evolutiva, Facultad de Ciencias-UDELAR, Montevideo, Uruguay; 3Departamento de Bioquímica, Facultad de Medicina-UDELAR, Montevideo, Uruguay; 4Sección Genetica, Facultad de Ciencias-UDELAR, Montevideo, Uruguay. *Correspondence: Fernando Alvarez-Valin, [email protected]; Carlos Robello, [email protected] Keywords: Trypanosoma cruzi; PacBio; whole genome sequencing; Chagas disease. Abbreviations: HSP, high-scoring pairs; SMRT, Single Molecular Real-Time. Data statement: All supporting data, code and protocols have been provided within the article or through supplementary data files. Five supplementary figures and five supplementary tables are available with the online version of this article. 000177 ã 2018 The Authors Downloaded from www.microbiologyresearch.org by This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited. IP: 164.73.80.128 1 On: Fri, 12 Jul 2019 20:33:37 Berna et al., Microbial Genomics 2018;4 threshold and none was larger than 150 kb [3]. Despite this degree of fragmentation, this draft genome was highly valu- IMPACT STATEMENT able because it led to the identification of several novel spe- We present the assembled and annotated genomes of cies-specific multigene families and gave, for the first time, a two Trypanosoma cruzi clones, the hybrid TCC and the draft overview of the genome architecture of T. cruzi. This non-hybrid Dm28c, obtained using PacBio technology. draft, also made it possible to identify the vast majority of The complications associated with this genome (assem- conserved trypanosomatid genes, as indicated by the fact bly fragmentation and collapse of repetitive sequences) that very few genes conserved between Leishmania and were basically solved. These improved assemblies per- T. brucei were not found in the draft T. cruzi genome [4]. mitted us not only to accurately estimate copy numbers Most likely this can be attributed to the fact that genes are of tandemly arrayed genes and multigene families but short in trypanosomatids (since they lack introns), hence also allowed the unambiguous identification of many sin- even moderately small contigs can contain complete coding gle-copy genes. Reliable information on the latter is sequences (CDS). The genome sequences of additional T. required to conduct functional studies based on gene cruzi strains Dm28c [6], Sylvio X10/1 [7], and T. c. marin- knock-out, a fundamental source of information to deter- kellei [8] have been reported since this initial publication mine which genes are essential and to find new drug tar- using new sequencing technologies, such as Roche 454 alone gets for Chagas disease.We found that the genome of T. or in combinations with other methodologies (Illumina and cruzi is compartmentalized in a isochore-like manner, Sanger). A common feature of all the assemblies reported ‘ ’ ‘ for these strains is again high fragmentation, which was containing a core compartment and a disruptive com- ’ even higher than that originally reported for CL Brener. partment , which exhibit significant differences in GC con- tent and gene composition, the former being GC-poorer To tackle the problem of assembly fragmentation, Weath- and composed of conserved genes and the latter erly et al. [9] used complementary sources of data to scaffold enriched in trans-sialidases (TS), mucins and mucin- T. cruzi (CL Brener strain) contigs, aiming to recover full- associated surface protein (MASP) genes.Additionally, length chromosome sequences. Their approach used a com- many homologous chromosomes were separately bined strategy based on sequencing bacterial artificial chro- assembled (haplotypes), and some homologous recombi- mosome (BAC) ends, as well as their co-location on nation events could be identified. The availability of high- chromosomes and synteny conservation with T. brucei and quality genomes opens new possibilities to conduct anal- Leishmania chromosomes. This strategy enabled these ysis on this parasite that require high degrees of preci- authors not only to obtain an assembly that represented a sion of genomic data (e.g. allelic exclusion, population substantial improvement in comparison to previous ver- genomics). sions of the genome but also to reconstruct chromosomes (with a few gaps), with only a relatively minor portion of the genome remaining unassigned to chromosomes. Nonethe- less, the issue of assembly fragmentation is a limitation that malariae [11]. Furthermore, this technology allows obtain- poses a number of complications for diverse types of analy- ing separated assemblies of homologous chromosomes, such ses that require high precision. that haplotypes can be retrieved as separated contigs/scaf- folds instead of a unique mosaic sequence. This feature is Third-generation sequencing technologies are particularly particularly desirable in genomes with a high level of hetero- suitable for addressing the challenges associated with the zygosity, as in the case of some T. cruzi hybrid strains that peculiarities of the T. cruzi’s genome since they allow have arisen from relatively divergent ancestors, in which the sequencing reads of more than 15 kb in average length, and divergence between haplotypes is higher than 5 % [3]. Need- many much longer can be obtained. This opens up the pos- less to say, a single mosaic assembly, representing a mixture sibility of directly determining the full sequence