In Silico Signature Prediction Modeling in Cytolethal Distending Toxin-Producing Escherichia Coli Strains
Total Page:16
File Type:pdf, Size:1020Kb
eISSN 2234-0742 Genomics Inform 2017;15(2):69-80 G&I Genomics & Informatics https://doi.org/10.5808/GI.2017.15.2.69 ORIGINAL ARTICLE In Silico Signature Prediction Modeling in Cytolethal Distending Toxin-Producing Escherichia coli Strains Maryam Javadi, Mana Oloomi*, Saeid Bouzari Department of Molecular Biology, Pasteur Institute of Iran, Tehran 13164, Iran In this study, cytolethal distending toxin (CDT) producer isolates genome were compared with genome of pathogenic and commensal Escherichia coli strains. Conserved genomic signatures among different types of CDT producer E. coli strains were assessed. It was shown that they could be used as biomarkers for research purposes and clinical diagnosis by polymerase chain reaction, or in vaccine development. cdt genes and several other genetic biomarkers were identified as signature sequences in CDT producer strains. The identified signatures include several individual phage proteins (holins, nucleases, and terminases, and transferases) and multiple members of different protein families (the lambda family, phage-integrase family, phage-tail tape protein family, putative membrane proteins, regulatory proteins, restriction-modification system proteins, tail fiber-assembly proteins, base plate-assembly proteins, and other prophage tail-related proteins). In this study, a sporadic phylogenic pattern was demonstrated in the CDT-producing strains. In conclusion, conserved signature proteins in a wide range of pathogenic bacterial strains can potentially be used in modern vaccine-design strategies. Keywords: biomarkers, cytolethal distending toxin, genomic signature, multiple alignments, pathogenic Escherichia coli In this study, comparative genome analysis of CDT-pro- Introduction ducer E. coli isolates with other pathogenic and commensal strains was performed. Alignments between multiple The co-evolution of pathogenic bacteria and their hosts genomes led to the identification of a set of distinct leads to the generation of functional pathogen-host inter- (“signature”) sequence motifs. These signature sequences faces. Well-adapted pathogens have evolved a variety of could be used to delineate single genomes or a specified strategies for manipulating host cell functions to guarantee group of associated genomes within a desired group, such as their successive colonization and survival. For instance, a the CDT-producing E. coli (the target group in this study). group of gram-negative bacterial pathogens produces a While genomic signatures were conserved in the target toxin, known as cytolethal distending toxin (CDT) [1]. group, which they were not conserved or were absent in Among the vast majority of CDT producers are Escherichia other related or unrelated genomes (i.e., the background coli, which is commonly found in the intestines of humans group). From a clinical point of view, conserved signature and other mammals. Most E. coli strains are harmless sequences could offer advantages in predicting and further commensals; however, some isolates can cause severe designing novel CDT inhibitors to vaccine candidates [4]. diseases and are designated as pathogenic E. coli. Among the On the other hand, phylogenic trees can be constructed various pathogenic E. coli strains, some have acquired based on multiple sequence alignments. It is important that virulence determinants through the horizontal transfer of phylogeny based on an immense number of genes and genes, such as the cdt genes encoding CDTs. CDTs were the whole-genome sequences are more reliable than those based first bacterial toxins identified that block the eukaryotic cell on a single gene or a few selected loci [5]. Phylogenic analysis cycle and suppress cell proliferation, eventually resulting in can provide an overall classification of the target group cell death. The active subunits of CDT toxins exhibit features among the background group. Alignment of whole-genome of type I deoxyribonuclease-like activity [2, 3]. sequences yields detailed information on specific differences Received April 28, 2017; Revised May 9, 2017; Accepted May 9, 2017 *Corresponding author: Tel: +98-21-66953311-20, Fax: +98-21-66492619, E-mail: [email protected] Copyright © 2017 by the Korea Genome Organization CC It is identical to the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/). M Javadi, et al. Signature Prediction of CDT-Producing E. coli Strains between genomes and, consequently, has shed new insights created using a fragment size of 200 nucleotides, a step size into phylogenetic relationships in recent years [6-9]. of 100 parameters, and BLASTN, which was optimized for In this study, phylogenic relationships of CDT+ strains highly similar sequences. with other pathogenic and commensal E. coli strains were Phylogenic tree construction assessed, and conserved signature genomic regions in the target group (CDT-producers) were annotated. This infor- A phylogram was produced in SplitsTree 4, using the mation could be used for developing molecular diagnostics neighbor-joining method and a distance matrix Nexus file assays, polymerase chain reaction primer and probe design exported from Gegenees software [13]. E. albertii TW07627 in modern vaccines. and E. fergusonii ATCC 35469 strains were set as the out-groups. Methods Identifying conserved signatures CDT+ strains CDT-producing isolates were set as the target group, and Several databases were used to identify bacterial strains all other strains were used as the background group by using harboring cdt genes. Data was extracted from the following the in-group setting tab in Gegenees software. Because of the resources: NCBI, National Center for Biotechnology Infor- genomic diversity in CDT-producer E. coli, we repeated this mation GenBank; EMBL, European Molecular Biology procedure with five different strains, including E. coli 53638, Laboratory; DDBJ, DNA Data Bank of Japan; PDB, Protein E. coli IHE3034, E. coli RN587/1, E. coli STEC B2F1, and E. Data Bank; RefSeq, NCBI Reference Sequence Database; and coli STEC C165-02, which were defined as separate re- UniProtKB, Swiss-Prot Database. ference strains. The biomarker score (max/average) setting was also used. Whole-genome sequences Biomarker scores were drawn graphically and loaded into the All genomes analyzed in this study were downloaded from tabular view for further data analysis. In the tabular view, a the NCBI file transfer protocol (FTP) site at: ftp://ftp. score of 1.0 is the maximum biomarker score and is con- ncbi.nih.gov/genomes. sidered as a signature. Reordering of draft genomes Assembling signature fragments Ordering and orienting contigs in draft genomes faci- Several overlapping fragments were obtained, based on litates comparative genome analysis. Contig ordering can be the sequences of each reference strain. To facilitate sub- predicted by comparison of a reference genome that is sequent analysis steps, the overlapping fragments were expected to have a conserved genome organization [10]. assembled using DNA Dragon software, version 1.6.0 ProgressiveMauve (version 2.3.1) was used for ordering (http://www.dna-dragon.com/). contigs in draft genomes. Mauve contig mover (MCM) offers The settings were designed with minimum overlaps (100 advantages over methods that rely on matches in limited bases) along the diagonal length, a minimum %-identity of regions near the ends of contigs [11, 12]. The E. coli K-12 complete overlapping fragments, and 100% full-search MG1655 strain (accession No. NC_000913.3) was used as a parameters. reference genome. BLAST The MCM optional parameters were used in this study including default seed weight, use seed families: 15 BLAST was done with sequences for each of the five determine Locally Collinear Blocks (LCBs); LCBs, full reference strains by using NCBI BLASTX (http://blast. alignment, iterative refinement, sum-of-pairs LCB scoring, ncbi.nlm.nih.gov/Blast.cgi) to identify the putative protein and min LCB weight: 200. domains. Furthermore, putative conserved domains were also detected. The results were confirmed using the Uni- Multiple genome alignments ProtKB Bank BLASTX program (http://www.uniprot. In this study, Gegenees software (version 2.2.1) was used org/blast/). for multiple-genome alignments. The software is written in JAVA, and making it compatible with several platforms. Results Limitations were not observed in the speed calculation, Strains number and memory of the genomes that could be aligned. Gegenees software is also capable of performing fragmented The sequences of 76 strains were downloaded from the alignments [4]. Multiple alignments of E. coli genomes were NCBI site. Details regarding genome sizes, %GC content, 70 www.genominfo.org Genomics & Informatics Vol. 15, No. 2, 2017 Table 1. Strains characteristics DNA length cdt Protein Gene Genome type, No. of Pathotype, serotype, Accession Strain GC% (Mb) gene count count subsequences/contigs other characteristic No. Escherichia coli 5.01426 + 50.80 4,862 5,026 Draft, 13 Host: homo sapiens, NZ_AEZQ00000000.2 96.0497 O91:H21 Escherichia coli 4.91733 + 50.7 4,825 4,982 Draft, 8 I.S: water, O157:H45 NZ_AFAF00000000.2 3003 Escherichia coli 5.38651 + 50.20 5,670 5,761 Draft, 373 Host: homo sapiens, NZ_AMUJ00000000.1 5412 SFO157 Escherichia coli 5.37179 + 50.99 4,803 5,218 Draft, 2 EIEC, O144 NZ_AAKB00000000.2 53638 Escherichia coli 4.98276 + 50.50 5,105 5,194 Draft, 209 I.S: water, O157:H16 NZ_AMUL00000000.1 ARS4.2123 Escherichia coli 5.4079 + 50.30 5,541 5,692 Draft, 93 Host: homo sapiens,