Download from Patterns I and J Are Calculated for the Same DNA But
Total Page:16
File Type:pdf, Size:1020Kb
APPLICATION OF GENOME LINGUISTIC APPROACHES FOR IDENTIFICATION OF GENOMIC ISLAND IN BACTERIAL GENOMES AND TRACKING DOWN THEIR ORIGINS Genome Linguistics to Visualize Horizontal Gene Exchange Oliver Bezuidt, Kingdom Mncube and Oleg N. Reva University of Pretoria, Department Biochemistry, Bioinformatics and Computational Biology Unit 0002, Pretoria, South Africa Keywords: Genome linguistics, k-Mer statistics, Oligonucleotide usage pattern, Genomic island, Horizontal gene exchange, Pathogenicity. Abstract: With more sequences of complete bacterial genomes getting public availability the approaches of genome comparison by frequencies of oligonucleotides (k-mers) known also as the genome linguistics are becoming popular and practical to resolve problems which can not be tackled by the traditional sequence comparison tools. In this work we present several innovative approaches based on k-mer statistics for detection of inserts of genomic islands and tracing down the ontological links and origins of mobile genetic elements. 637 bacterial genomes were analyzed by SeqWord Sniffer program that has detected 2,622 putative genomic islands. These genomic islands were clustered by DNA compositional similarity. A stratigraphic analysis was introduced that allows distinguishing between new and old genomic inserts. A method of reconstruction of donor-recipient relations between micro-organisms was proposed. The strain E. coli TY- 2,482 isolated from the latest deadly outbreak of a haemorrhagic infection in Europe in 2011 was used for the case study. It was shown that this strain appeared on an intersection of two independent fluxes of horizontal gene exchange, one of which is a conventional for Enterobacteria stream of vectors generated in marine gamma-Proteobacteria; and the second is a new channel of antibiotic resistance genomic islands originated from environmental beta-Proteobacteria. 1 INTRODUCTION in many instances falsely predicted. Multiple genes which are of great importance for mobility of Recurrent outbreaks of pathogens armed with new plasmids and phages quickly become a wreck after virulence factors and broad range antibiotic integration into chromosomes due to multiple resistance gene cassettes demonstrate our ignorance mutations and fragmentations. on the principles of horizontal gene exchange and Bacterial species are variable in their overall GC the evolution of pathogenic bacteria. Ontology and content but the genes in genomes of particular phylogeny of laterally transferred genetic elements species are fairly uniform with respect to their base are difficult to investigate, let alone the predictions composition patterns and frequencies of of their insertion sites in hosts chromosomes. DNA oligonucleotides. DNA compositional comparison of and protein sequence similarity comparison by blast bacterial genomes showed that horizontally acquired is generally used to track the origins of genomic genes display features that are distinct from those of islands (GIs) but this approach has many limitations. their recipient genomes (Van Passel, 2006). Based Horizontally transferred genes are highly mutable on these observations we developed and utilized and similarities in their DNA sequences quickly several innovative genome linguistics approaches to disappear. Protein sequence similarity reflects improve GI detection in bacterial genomes. They are mostly functional conservations rather than the practical also for reconstraction of the ontological phylogenetic relations between GIs. Moreover, the links and donor-resipient relations between majority of genes found in GIs are hypothetical and microorganisms and their mobilomes. Bezuidt O., Mncube K. and N. Reva O.. 118 APPLICATION OF GENOME LINGUISTIC APPROACHES FOR IDENTIFICATION OF GENOMIC ISLAND IN BACTERIAL GENOMES AND TRACKING DOWN THEIR ORIGINS - Genome Linguistics to Visualize Horizontal Gene Exchange. DOI: 10.5220/0003704201180123 In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2012), pages 118-123 ISBN: 978-989-8425-90-4 Copyright c 2012 SCITEPRESS (Science and Technology Publications, Lda.) APPLICATION OF GENOME LINGUISTIC APPROACHES FOR IDENTIFICATION OF GENOMIC ISLAND IN BACTERIAL GENOMES AND TRACKING DOWN THEIR ORIGINS - Genome Linguistics to Visualize Horizontal Gene Exchange 2 RESULTS Relative variance of an OU pattern (RV) was calculated by the following equation: 4 N 2.1 Oligonucleotide Usage Pattern 2 w w (3) Concept RV N 4 1 0 2 Statistical parameters of frequencies of k-mers in where N is the word length; Δ w is the square of a natural DNA sequences and a definition of the word w count deviation (see equation 1); and σ0 is oligonucleotide usage (OU) pattern were given in the expected standard deviation of the word our previous publications (Reva, 2005); (Ganesan, distribution in a randomly generated sequence which 2008). OU pattern was denoted as a matrix of depends on the sequence length (Lseq) and the word deviations [1…N] of observed from expected counts length (N): for all possible permutations of oligonucleotides 4 N 0 0.02 (4) (words) of length N. In this work tetranucleotides L were used, thus N = 256. Words are distributed in seq sequences logarithmically and the deviations of their GRV is a particular case of RV when expected counts frequencies from expectations may be found as of words are calculated based on the frequencies of follows: nucleotides estimated for the complete genome. C 2 C 2 C 2 1...N |obs 1...N |e 1...N |0 2.2 Identification and Clustering of ln 2 2 2 C C C Genomic Islands 1...N |e 1...N |obs 1...N |0 (1) w 1...N 6 2 2 ln C C 1 1...N |0 1...N |e Each genome may be characterized by a unique pattern of frequencies of oligonucleotides (Abe, where n is any nucleotide A, T, G or C in the N- 2003); (Reva, 2005;). Foreign DNA inserts retain long word; C[1…N]|obs is the observed count of a OU patterns of the genomes of origin. Comparison word [1…N]; C[1…N]|e is its expected count and of OU patterns of genomic fragments against the C[1…N]|0 is a standard count estimated from the entire genome OU pattern reveals areas with assumption of an equal distribution of words in the alternative DNA compositions. An algorithm for the -N sequence: C[1…N]|0 = Lseq 4 . In this work identification of horizontally transferred genomic C[1…N]|e frequencies were calculated by using 0- elements by superimposition of OU statistical order Markov model, i.e. by normalization to the parameters has been introduced in our previous frequencies of nucleotides. publications (Reva, 2005); (Ganesan, 2008). This The distance D between two patterns was algorithm calculates and superimposes the four OU calculated as the sum of absolute subtractions of pattern parameters discussed above: D – distance ranks of identical words after ordering of the words between local and global OU patterns; RV and GRV by [1…N] values (equation 1) in patterns i and j as variances; and PS. Horizontally transferred GIs are follows: characterized by a significant pattern deviation 4 N (large D), significant increase in GRV associated rank rank D w,i w, j min with decreased RV and a moderately increased PS D(%) 100 w (2) D D (Ganesan, 2008). Exploitation of all these parameter max min allows the discrimination of the putative mobile Application of ranks instead of relative genomic elements from other genomic loci with oligonucleotide frequencies made the comparison of alternative DNA compositions, namely: multiple OU patterns less biased to the sequence length tandem repeats, clusters of genes for ribosomal provided that the sequences are longer 5 kbp for a proteins and ribosomal RNA, giant genes, etc (Reva, reliable statistics of tetranucleotide frequencies 2005). To facilitate a large scale analysis of bacterial (Reva, 2005). genomes, a Python utility SeqWord Sniffer was Pattern skew (PS) is a particular case of D where developed. It is available for download from patterns i and j are calculated for the same DNA but www.bi.up.ac.za/SeqWord/sniffer/. for direct and reversed strands, respectively. Sniffer was compared to other available tools of N N Dmax = 4 (4 – 1)/2 and Dmin = 0 when calculating GI identification: IslandPick, SIGI-HMM and N a D, or, in a case of PS calculation, Dmin = 4 if N is IslandPath (Langille, 2009). The rate of false N N an odd number, or Dmin = 4 – 2 if N is an even negative predictions was determined by testing the number due to the presence of palindromic words. capacities of different programs to predict known 119 BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms patogenicity GIs from PAI DB (http://www.gem. hypothesized that the comparison of GIs of the same re.kr/paidb/about_paidb.php). The rate of false origin which are distributed in organisms that share positive predictions was estimated by counting the similar genomic OU patterns would result in numbers of GIs predicted by one program which different D-values relative to the time of their were not confirmed by other programs. To optimize acquisition. In Fig. 2 the differences in D-values are the Sniffer’s program run settings the factorial depicted by grey colour gradients. The darker colour experiment was used. Results of program gradients depict GIs which have been acquired comparison are shown in Table 1. recently, while the lighter colours depict ancient acquisitions. Table 1: Comparison of different programs for GI identification. Program False negative