ics om & B te i ro o P in f f o o r

l m

a Journal of

a

n

t

r

i Che et al., J Proteomics Bioinform 2014, 7:8 c u

s o J ISSN: 0974-276X Proteomics & Bioinformatics DOI: 10.4172/0974-276X.1000322

Research Article Article OpenOpen Access Access An Accurate Genomic Island Prediction Method for Sequenced Bacterial and Archaeal Dongsheng Che1*, Han Wang1, John Fazekas1 and Bernard Chen2 1Department of Computer Science, East Stroudsburg University, East Stroudsburg, PA 18301, USA 2Computer Science Department, University of Central Arkansas, Conway AR, 72034, USA

Abstract A genomic island (GI) is a genomic segment in a host , and it was transferred from donor genomes. Since genomic islands (GIs) usually contain important such as pathogenicity genes, the detection of GIs becomes extremely critical to medical research and industrial applications. Previous computational GI detection tools used one or a few GI-associate features, and thus they suffered the problem of low prediction accuracy. A systematic approach that uses multiple sources to improve GI prediction accuracy, therefore, is in great demand. In this paper, we report the development of Genomic Island Hunter (GIHunter), an accurate software tool for GI detection. GIHunter is a decision tree based bagging model that uses eight GI-associated features such as sequence composition, mobile information, and integrase. The performance metric comparison between our approach and other current existing GI prediction methods has shown that our approach is more accurate than other approaches. We have used GIHunter to predict GIs in more than 2000 prokaryotic genomes. We have also visualized our predicted GIs so that our predicted results become useful and meaningful for biomedical studies. Our GI program GIHunter can be obtained at: http://www.esu.edu/cpsc/che_lab/software/GIHunter. Our GI prediction results are available on our genomic islands database, which is: http://www5.esu.edu/cpsc/bioinfo/dgi.

Keywords: Genomic islands; Machine learning; Genomes; ; adjustment and selection, which is hard to perform and control because ; Computational methods it may lead to inconsistent selection criteria due to the unfamiliarity of different genome structures [8]. Introduction Sequence composition-based approach is based on the general Genomic Islands (GIs) are segments of genomic sequences that concept that different species have different genomic sequence were horizontally transferred from other bacterial species or signatures. Thus, if there is a genomic region that has special signature [1]. The driving forces of the forming of GIs in such genomes could different from that of the majority regions of the genome, this region be attributed to the adaptability to their ever changing surroundings, is considered to be a GI. Accordingly, this kind of approach enables us and thus the contents of GIs in the host genomes could be versatile [2]. to detect GIs within a genomic sequence without relying on reference For example, some GIs are pathogenic-related, while some others may genomes or manual selections, and thus it can be applied to any confer drug-resistance to bacterium. Some GIs contain genes which sequenced genome. could produce secondary metabolites, while some others degrade recalcitrant chemical products. The key of this kind of approach is to detect abnormal sequence compositions or signatures in the genome. Such signatures include It is extremely significant to identify the locations and contents codon usage [9], and G+C content [10]. Some GIs are also flanked of GIs because such findings will be a greatly benefit to biomedical by tRNA genes [11]. Furthermore, some others GIs could contain research. For instance, we can design and produce corresponding mobile genes such as integrase and transposes [12]. Computational antibiotics based on the identification of pathogenic GIs [3]. On tools using this kind of approach include AlienHunter [13], Centroid the other hand, for those GIs containing beneficial secondary [14], IslandPath [15], PAI-IDA [16] and SIGI-HMM [17]. Recently, metabolites, we can produce them on a large scale [4]. Due to the some integrative approaches uses the prediction results of individual explosive growth of genomic sequence data, huge amounts of genomic programs to obtain census GIs. Available programs and tools include structure information need to be elucidated. Unfortunately, biological EGID [18], GIST [19] and IsalndViewer [20]. experiments may only contribute us a small fraction of information for GIs in all sequenced genomes. Therefore, it is urgent to develop Despite of the availability of existing GI prediction tools, the computational tools to identify all GIs. prediction accuracy is not as good as desirable. The causes of the low Currently, there are mainly two kinds of GIs detection approaches: comparative genomic-based approach, and sequence composition- based approach [5]. Comparative genomic-based approach compares *Corresponding author: Dongsheng Che, Department of Computer Science, the query gene sequence with its closely related species genomes by East Stroudsburg University, East Stroudsburg, PA 18301, USA, Tel: (570)422- 2731; E-mail: [email protected] aligning them together and considering the genome segments that are only present in the query genomic sequence to be GIs [6]. Some Received May 27, 2014; Accepted July 07, 2014; Published July 11, 2014 examples of tools based on this approach include IslandPick [6] and Citation: Che D, Wang H, Fazekas J, Chen B (2014) An Accurate Genomic Island MobilomeFinder [7]. While the detection of GIs using this kind of Prediction Method for Sequenced Bacterial and Archaeal Genomes. J Proteomics approach is relatively reliable, it is limited to a small group of genomes. Bioinform 7: 214-221. doi:10.4172/0974-276X.1000322 The reason is that not all genomes have an enough number of closely Copyright: © 2014 Che D, et al. This is an open-access article distributed under related genomes for references, and thus they cannot be analyzed the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and by using this approach. Furthermore, such tools also need manual source are credited

J Proteomics Bioinform ISSN: 0974-276X JPB, an open access journal Volume 7(8) 214-221 (2014) - 214 Citation: Che D, Wang H, Fazekas J, Chen B (2014) An Accurate Genomic Island Prediction Method for Sequenced Bacterial and Archaeal Genomes. J Proteomics Bioinform 7: 214-221. doi:10.4172/0974-276X.1000322

prediction accuracy are due to the amelioration of genomes, which islands is summarized in Figure 1. The detailed procedures are led to false negatives [9], or the existence of abnormal sequence presented in the following subsections. composition such as ribosome regions in the host genome, which leads to false positives. Therefore, it may not be sufficient to use only one or Dataset collection two features to determine a sequence is a GIs or not. Another problem The positive genomic islands (GIs) and negative genomic islands for GIs detection is the boundary of GIs. Some GIs only cover several (non-GIs) datasets used in this study were obtained from [6]. The kilo bases (kbs) while others can cover hundreds of kbs, which makes it dataset contains 771 GIs and 3770 non-GIs. They covered 118 genomes difficult to determine the size of GIs. in total, and they were used for building decision-tree based GI In this paper, we propose an accurate GI prediction approach that prediction model. incorporates multiple features, including: 1) sequence composition The sequence lengths of original GIs and non-GIs varied from 8 kb based feature, Interpolated Variable Order Motifs (IVOM); 2) mobile to 31 kb, making it less meaningful using GI-associated feature values gene information, Integrase, and transposases; 3) tRNA gene; 4) phage with different GI or non-GI sizes to build our prediction model. In information, and 5) two new features we found recently, intergenic order to resolve this problem, we extracted all genes within the training distance and highly expressed genes (HEGs). We use decision-tree GI and non-GI datasets. From this training dataset, we obtained 10,459 based ensemble learning to build the GI model based on these features. genes from GIs, and 44,909 genes from non-GIs, with the total of Additionally, in order to resolve the GI boundary issue, we treat each 55,368 genes. Since these genes were from 118 genomes, representative gene in GIs individually, and predict GI gene or non-GI gene by using specie genomes from the domains of Bacteria and Archaea. Thus, we feature values associated with its neighboring region, and then piece believe that the selected genes in the training set should not be overly contiguous GI genes together using our GI gene merging process to biased, and will not affect our model overall. predict GIs. The experimental results have shown that our approach has outperformed other GI tools in terms of prediction accuracy, To obtain the feature values for GIs / non-GIs genes in the training suggesting the power of our approach, and its potential use in the dataset (or all genes in a genome used for genomic island prediction), future GI prediction. we downloaded the corresponding complete genome sequences and the annotations from the National Center for Biotechnology Information Methods and Materials (NCBI) FTP server (ftp://ftp.ncbi.nih.gov/genomes/Bacteria). Computational framework Feature value extraction Our computational framework for genomic island prediction was For each of 55,319 genes from GI and non-GI datasets, we divided into two main stages: (1) model construction for GI/non-GI calculated the GI-associated feature values. The calculation of feature gene classification, and (2) genome scale genomic island prediction. values for any gene was not based on the gene itself. Instead, it relied on the gene and its neighboring region. In this study, we selected the region The stage of building GI/non-GI gene classifier model consisted of the following steps:

1. Collecting training data and genomic sequence data. The training Model construction for Genome scale dataset of known genomic islands and non-genomic islands were GI/non-GI gene classification genomic island prediction collected for model construction. The genomic sequence was collected Training data + for feature value extraction. Genome sequence Genome sequence collection 2. Extracting genomic island-related features. Feature values for all collection GI and non-GI genes from the training data were computed.

3. Building a GI/non-GI gene classifier model. The feature values Feature value Feature value obtained in Step 2 were used to build a model for classifying any gene extraction in the extraction in the training dataset whole genome to be either GI or non-GI. The stage of genome scale genomic island prediction consisted of the following steps: Model for GI/non- Classification of all Gi gene GI/non-GI genes in 1. Collecting genomic sequence. The genomic sequence to be classification the genome analyzed was collected, and was used for feature value extraction.

2. Extracting genomic island-related features. Feature values for all Merge of all the genes in the genome were computed. GI/non-GI genes in the genome 3. Classifying GI/non-GI genes using the model built from Stage 1. Based on the GI/non-GI gene classifier model, all genes in the genome were classified to be a GI gene or non-GI gene. Predicted genomic islands 4. Predicting genomic islands based on classified genes. Our program merged the contiguous GI genes and predict them a genomic island. Figure 1: Computational framework for the genome scale genomic island prediction. The computational framework flowchart for genome scale genomic

J Proteomics Bioinform ISSN: 0974-276X JPB, an open access journal Volume 7(8) 214-221 (2014) - 215 Citation: Che D, Wang H, Fazekas J, Chen B (2014) An Accurate Genomic Island Prediction Method for Sequenced Bacterial and Archaeal Genomes. J Proteomics Bioinform 7: 214-221. doi:10.4172/0974-276X.1000322

length of 8 kb because the feature information of a-8kb-long region Phage can roughly reflect the existence of a genomic island or not. Figure 2 >0 illustrates a selected gene in a genomic island, and its’ corresponding NO YES Bagging IVOM Score IVOM Score region (8 kb) for calculating feature values. >10 >20 For any gene, we calculated eight features, including IVOM, NO YES YES Transp HEG Density, Integrase, Phage, tRNA, Highly Expressed Genes (HEGs), >2 >2 Average intergenic gene distance (AvgID), and Transposase, based DT1 DT2 DT3 NO YES NO YES on its surrounding regions (8 kb). The reason of choosing these eight features was due to their contributions to genomic island prediction in previous studies [8,21,22]. (A). Bagging model (B). Decision tree model We calculated each of eight feature values using the methods Figure 3: A decision-tree based bagging model for GI/non-GI gene summarized in Table 1. Specifically, we used Alien Hunter [13] to classification. (A)A decision-tree based bagging classifier;(B) An example of obtain IVOM score for the selected gene within the genomic segment decision tree based classifier, which is used in bagging in (A). The pie chart in information of 8 kb (Figure 2). For the features of tRNA, Phage, the bottom shows the probabilities of being GI gene, colored in “Blue” or non-GI gene, colored in “Red”. Integrase, and Transposase, we used regular expression based search on annotated genome files information to count how many genes within the 8kb region that have these features. For the feature of could improve differentiating non-GIs and GIs when adding these HEGs, we considered ribosomal protein (RP) genes, and features. The feature value difference on non-GIs and GIs in the transcriptional processing factor (TF) genes (e.g., RNA helicase, RNA training dataset is shown in Supplementary Figure 1, and they have polymerase, tRNA synthetase), chaperone degradation (CH) genes, been summarized in Table 1. and genes involved in energy metabolism (e.g., NADH) to be highly expressed genes [23], and counted the number of HEGs within the 8kb Model for classifying GI/non-GI genes region. Our genomic island gene model is a bootstrap aggregating (also For the feature of gene density, we counted the number of genes known as bagging) of base classifiers, in which each base classifier within the 8kb region, and thus gene density was defined as the number is a decision trees in this study. Bagging [24] is a method whose of genes per kb. For the feature of average inter- genic distance, we first classification takes the majority votes of multiple classifiers, with the measured inter-genic distances for each adjacent gene pair (gi, gi+1) same weight for each of base classifiers (Figure 3A). The Bagging model within the 8 kb region, and then averaged them to obtain AvgID. The used in this study consists of multiple decision trees as base classifiers features of density and AvgID may not be as effective as the features of in the “committee”. In this study, we used WEKA [25] to build our IVOM or mobility genes. Nevertheless, they have some differentiation decision tree based bagging model for GI gene prediction. power as shown in Supplementary Figure 1, and thus we believe they The training set used for constructing each decision tree was sampled by bootstrap sampling of 55,319 examples. The training examples are a set of tuples , where c is the class label, and x is GI the set of features. In this study, x is the set of eight features: IVOM, Density, Integrase, Phage, tRNA, HEG, AvgID, and Transposase, while c is the label of GI (1) or non-GI (0). A decision tree based GI/non-GI gene classifier used in bagging is illustrated in Figure 3B.

4 kb 4 kb The construction of a decision tree model was based on the ID3 algorithm [26], which was a top-down greedy search algorithm. It first Figure 2: An illustration of extracting feature values for a gene within a started with the whole training set, and then chose the best feature as GI. The box denotes a genomic island, where circles represent genes, and the black circle is the selected GI gene for feature value calculation. The feature the root node based on the information gain values. It then split the set value calculation is based on the dashed box, which is chosen by starting from based on the possible values of the selected best feature. The algorithm the center of the selected gene, extending to left by 4 kb, and then extending assigned the node to a leaf node when all instances in a subset had the to right by 4 kb. same classification, and labelled the same classification as the instances. For the mixed label classification, the algorithm continued to choose

Feature Relation with genomic islands Feature value the next best feature based on the subset of the training examples. This (GIs) obtained whole process was repeated until no further features were available. IVOM GIs have higher IVOM scores AlienHunter (i.e., sequence composition bias) In this study, we used J48 in WEKA, which is the Java tRNA GIs are flanked by RNA genes Functional annotation implementation of C4.8 algorithm [25]. C4.8 is the latest research Phage GIs may contain phage-related genes Functional annotation version for the C4.5 algorithm, which is one of the best-known and Integrase GIs may contain integrase genes Functional annotation most widely-used decision tree algorithms. C4.5 extends the ID3 Transposase GIs may contain transposase genes Functional annotation algorithm by addressing several important issues. The C4.5 algorithm Highly expressed GIs may not contain highly expressed Functional annotation handles numerical attributes and missing values, it also incorporates gene genes post-pruning process to handle the over-fitting problem. Gene density GIs have lower gene densities Genome sequence Genome scale genomic island prediction Average intergenic GIs have higher inter-genic gene Genome sequence distance distance The trained model built using J48 decision tree based ensemble Table 1: Summary of genomic island associated features. algorithms were used to classify all genes (i.e., GI gene or non-GI

J Proteomics Bioinform ISSN: 0974-276X JPB, an open access journal Volume 7(8) 214-221 (2014) - 216 Citation: Che D, Wang H, Fazekas J, Chen B (2014) An Accurate Genomic Island Prediction Method for Sequenced Bacterial and Archaeal Genomes. J Proteomics Bioinform 7: 214-221. doi:10.4172/0974-276X.1000322

gene) in any query genome. The inputs for GI prediction were: (a) the In particular, the known GI/non-GI dataset was evenly separated whole genome sequence; (b) the gene annotation of the genome. The into ten parts, and the first part was evaluated based on the model procedure of calculating feature values for all genes in the genome is trained from the remaining nine parts. This process continued until as same as that of described in the previous Section, i.e., 1) Finding all ten parts were evaluated. The overall classification accuracy was the the neighboring region corresponding to the selected gene; and 2) average of all ten separate evaluations. True positives (TP) were the Calculating each of eight feature values (IVOM, Density, Integrase, number of GI genes predicted to be GI genes. False negatives (FN) were Phage, tRNA, HEG, AvgID, and Transposase) for each selected region. the number of GI genes predicted to be non-GI genes. True Negatives (TN) were the number of non-GI genes predicted to be non-GI genes. Based on the feature values for each gene in the genome, we False positives (FP) were the number of non-GI genes predicted to be classified these genes into either GI genes (or non-GI genes) using our GI genes. The recall, precision, and classification accuracy were defined decision-tree based bagging model built from the training set. Such GI as follows: gene (or non-GI gene) unit information was further processed, and we TP obtained genomic islands through the following steps: Recall = (1) TP+ FN 1. Finding all GI segments based on the model. From the decision- tree based bagging model, each gene for the whole genome was classified TP into GI-genes or non-GI genes. We considered those contiguous GI Pr ecision = (2) genes in the genome to be a GI segment, and we found all GI segments. TP+ FP 2. Picking two neighboring GI segments, Gi and Gi+1, calculating TP+ TN the average probability value for the region spanning from the first Accuracy = (3) gene of Gi to the last gene of Gi+1, including those non- GI genes TP+++ TN FP FN between Gi and Gi+1. We merged smaller adjacent GI segments into Genomic island prediction accuracy a new GI segment when the average value was greater than a threshold value T. A larger GI segment of G2 in Figure 4 is the product of two We also compared our predicted GIs with the benchmark dataset. neighboring GI segments merged. The optimal threshold value T was We evaluated our predicted GIs at the nucleotide level. In particularly, determined by selecting a set of values between 0 and 1, running on the true positives (TP) were the number of nucleotides both in benchmark benchmark sets, and comparing performance metrics of predicted GIs GI dataset and in predicted GIs. True negatives (TN) were the number under different threshold settings. In this study, we found that T=0.6 of nucleotides both in benchmark non-GI dataset and predicted non- was the optimal threshold, and chose it as the cut-off value. GIs. False positives (FP) were the number of nucleotides in predicted GIs but not in benchmark GI dataset. False negatives (FN) were the 3. Repeating the process of Step 2 until non neighboring GI number of nucleotides in benchmark GI dataset but not in predicted segments could be merged; GIs. We focused on four performance measures, recall (See equation 4. Filtering out short GI segments. In this study, we removed those 1), precision (See equation 2), Performance Coefficient (PC) (See GI segments containing less than six genes. Figure 4 shows an example equation 4), and F-Measure (See equation 5). of a short GI segment between G1 and G2 removed. The remaining GI TP segments were considered to be final predicted genomic islands for the PC = (4) whole genome. TP++ FP FN We have developed our program, Genomic Island Hunter 2*Recall*Precision (GIHunter), for predicting genomic islands in the whole genome, as F −=Measure (5) described above. Recall+Precision Performance evaluation Results and Discussion We looked into the performance of our computational framework Classification accuracy on GI/non-GI genes from two aspects: a) The accuracy of the decision-tree bagging model for GI/non-GI gene classification, and b) The prediction accuracy of To evaluate the classification power of GI/non-GI genes, we applied predicted genomic islands. the decision tree J48 model on training datasets. We used default parameter settings, with the confidence factor of 0.25 and minimum Classification accuracy on GI/non-GI genes number of instance per leaf of 2. We have achieved the recall of 0.682, We used a ten-fold cross-validation scheme to evaluate the GI/ precision of 0.879, and the overall accuracy of 0.922. The classification non-GI gene classification accuracy on our ensemble learning model. results indicate that this decision tree model with eight features is quite specific (0.879), although it is not quite sensitive (0.682). As a comparison, we also applied four other classifiers, including Naïve Bayes, logistic classification and regression trees (CART), and Support Vector Machines (SVMs), to the same datasets. These G1 G2 G3 G4 algorithms with default parameter settings were run through WEKA. Figure 4: The merging process of combining GI genes into GIs. Each As shown in Figure 5, the classification powers of the other four bar represents a gene in the genome, with the bar height representing the algorithms were slightly lower than that of the decision tree when such probability of being a GI-gene, calculated from our bagging model. The GIs are basically the genomic regions with a cluster of adjacent GI genes as shown in three performance measures (recall, precision and accuracy) were used G1, G3 and G4. Some adjacent GI gene clusters can be extended to become for comparison. We also tested the J48 decision tree based bagging a single GI, as shown in G2. model in order to improve classification accuracy. Our classification

J Proteomics Bioinform ISSN: 0974-276X JPB, an open access journal Volume 7(8) 214-221 (2014) - 217 Citation: Che D, Wang H, Fazekas J, Chen B (2014) An Accurate Genomic Island Prediction Method for Sequenced Bacterial and Archaeal Genomes. J Proteomics Bioinform 7: 214-221. doi:10.4172/0974-276X.1000322

As we can see in Figure 7, our program GIHunter reaches the 1 highest recall (0.69), highest precision rate (0.91), highest PC (0.65), and 0.9 0.8 highest F-measure (0.78), indicating the improvement of prediction 0.7 accuracy of our method over previous approaches. The improvement 0.6 of prediction accuracy is due to the integration of multiple sources (i.e., Precision 0.5 multiple features). Our program combines the IVOM score, generated Recall 0.4 by AlienHunter, with other GI-related information such as phage, 0.3 Accuracy 0.2 integrase, transposase, tRNA, which were used in the programs of 0.1 IslandPath and PAI-IDA. In addition, we also include new information 0 such as HEGs, and inter-genic gene distance. Naive Bayes Logistic CART SVM J48 Decision Bagging tree Case Study: predicted genomic islands in Corynebacterium Figure 5: GI/non-GI gene classification accuracy with different machine learning algorithms. diphtheria NCTC13129 C. diphtheria NCTC13129 is a microbe that produces diphtheria toxin (DT), which causes the symptoms of diphtheria [28]. It was first 1 isolated from the Pharyngeal membrane of a 72-year-old female with clinical diphtheria [29]. The genome has been fully sequenced, with the size of 2,488,635bp. In this genome, we predicted fourteen genomic

0.8 1 0.9 0.8 0.7 Recall 0.6 0.6 0.5 Precision 0.4 nPc ROC:0.911 0.3 F-Measure 0.2

Sensitivity 0.1 0.4 0 Alien Hunter Island Path SIGI-HMM INDeGeNIUS PAI-IDA GIHunter

Figure 7: Performance comparison with different GI prediction tools. The performance measures include recall, precision, nPc and F-measure. GI 0.2 prediction tools include Alien hunter, IslandPath, SIGI-HMM, INDeGeNIUS, PAI-IDA and our program, GIHunter.

0 0 0.2 0.4 0.6 0.8 1 0 MB 1-Specificity GI_14 GI_13 GI_14 Figure 6: The ROC curve for decision tree model in GIHunter. GI_12 GI_2 GI_11 results showed that this bagging model performed better than a single GI_3 decision tree model, with the increase of 2.7% for recall, 0.6% for GI_10 0.5 MB precision, and 0.6% for overall accuracy (Figure 5). Figure 6 shows ROC curves for decision tree model, and a decision tree that was built by J48 algorithm on the training set is shown in Supplementary Figure 2. GI_9 Prediction accuracy on genomic islands GI_8 The measure of classification accuracy on GI/non-GI genes is GI_4 an indirect way to evaluate the performance of our decision-tree based bagging model. To truly measure the prediction power of our approach for predicting genomic islands, we directly compared our predicted genomic islands with the benchmark dataset. To do so, GI_5 1 MB we collected 118 prokaryotic genomes from the National Center for Biotechnology Information (NCBI) FTP server, ran our GIHunter 1.5 MB GI_7 GI_6 (http://www5.esu.edu/cpsc/bioinfo/software/GIHunter) program, and generated GI locations for each genome. We used genomic islands Figure 8: An example of predicted genomic islands in the genome of C. obtained by IslandPick [27] as benchmark to evaluate the predicted diphtheria NCTC13129. Each circle represents the feature values of tRNA GIs by GIHunter. We also collected predicted GI results of five (1), Phage (2), Integrase (3), Transposase (4), HEG (5), AvgID (6), Gene component programs including AlienHunter, IslandPath, SIGI-HMM, density (7), and IVOM (8). The most outer circle shows the predicted fourteen INDeGeNIUS, and PAI-IDA. GI regions.

J Proteomics Bioinform ISSN: 0974-276X JPB, an open access journal Volume 7(8) 214-221 (2014) - 218 Citation: Che D, Wang H, Fazekas J, Chen B (2014) An Accurate Genomic Island Prediction Method for Sequenced Bacterial and Archaeal Genomes. J Proteomics Bioinform 7: 214-221. doi:10.4172/0974-276X.1000322

Label Start / Stop Genes covered (total Supporting information number of genes) GI_1 21,381-390,836 DIP0021 → DIP0425 This region contains two integrase genes annotated as phage integrase, five phage genes annotated as phage bp (404) prohead protease, phage capsid protein, and phage tail fiber protein, ten tRNA genes and five transposon related genes encoding for IS element transposase. The overall IVOM score in this region is 18.19, significantly higher than that of in non-GI region (usually <10). GI_2 403,882 DIP0441 → DIP0452 (11) The IVOM score is 24.18, much higher than that of the rest of genomic region. This region contains one -421,292 bp transposon related gene. This genomic island was supported by the genome analysis in [17], where researchers considered the gene region from DIP0438 to DIP0445 to be a . GI_3 460,848 DIP0495 → DIP0521 (26) The high IVOM score (17.98) in this region indicates sequence composition is different from that of its neighbor -490,567 bp genomic regions. GI_4 726,536 DIP0750 → DIP0826 (76) This predicted GI region contains one phage integrase gene, one tRNA gene, and six transposon related genes -799,536 bp including transposase A and IS element transposase. IslandViewer also has one corresponding genomic island from DIP0806→DIP0822. Other research results also discovered two genomic islands, including DIP0752 → DIP0766 and DIP0795 → DIP0820, in this region GI_5 929,413 DIP0957 → DIP0987 (30) The overall IVOM score in this region is 18.64. IslandViewer also predicted the genomic region from DIP0960 to -953,582 bp DIP0966 to be a genomic island. GI_6 1,136,731 DIP01150 → DIP1158 (9) The main evidence of predicted GI is due to its very high IVOM score (24.81) -1,139,080 bp GI_7 1,160,398 DIP1168 → DIP1662 (494) This region contains nine tRNA genes, and four putative transposase genes. IslandViewer predicted three GIs -1,699,168 bp in this region, i.e., DIP1184 → DIP1188, DIP1444 → DIP1453, and DIP1617 → DIP1623. In a separate genome analysis, the region from DIP1645 to DIP1663 was considered as a pathogenicity island GI_8 1,768,327 DIP1728 → DIP1738 (10) The IVOM score in this region is 18.08, and there are no highly expressed genes found in this region. -1,781,812 bp GI_9 1,861,303 DIP1811 → DIP1857 (46) This region contains one putative phage integrase, one putative phage terminase protein, and three tRNA genes. -1,900,281 bp IslandViewer also predicted the genomic region from DIP1811 to DIP1841 to be a GI, and the region of DIP1817 → DIP1837 was reported to a pathogenicity island in [17]. GI_10 1,993,436 DIP1941 → DIP1971 (31) Four putative transposase genes, one Integrase gene and one tRNA gene are found in this predicted GI. The -2,020,291 bp IVOM score in this region is 20.36. Islandviewer predicted two genomic Islands within this region, i.e., DIP1944 → DIP1951, and DIP1960 → DIP1964. GI_11 2,043,495 DIP1993 → DIP2098 (105) This region contains three putative transposase genes including putative transposase for insertion element. The -2,149,289 bp overall IVOM score is 20.18. IslandViewer predicted two GIs, DIP2015 → DIP2022 and DIP2046 → DIP2050. In another genome analysis study, two regions, DIP2010 → DIP2015, and DIP2066 → DIP2093, were considered to be pathogenicity islands. GI_12 2,208,692 DIP2142 → DIP2163 (21) In this region, several GI related genes have been found, including putative bacteriophage holin protein, putative -2,237,426 bp DNA-binding bacteriophage protein, and tRNA gene. In a separate genome study, researcher considered the region of DIP2148 → DIP2168 to be a pathogenicity island. GI_13 2,242,597 DIP2169 → DIP2191 (22) The IVOM score in this region is 15.42, which implies that this region contains different sequence composition -2,275,566 bp from the rest of its host genome. GI_14 2,284,782 DIP2199 → DIP2245 (46) The IVOM score in this region is 21.85. IslandViewer detected two genomic islands in this region, i.e., DIP2204 -2,335,407 bp → DIP2209, and 2) DIP2236 → DIP2239. In addition, a separate genome analysis revealed that there was one pathogenicity island within this genomic region, which was from DIP2208 to DIP2234. Table 2: A list of fourteen predicted genomic islands in the genome of C. diphtheria NCTC13129. islands in total. The visualization view of fourteen genomic islands region is pathogenic related. Four pathogenicity islands in this region as well as eight feature values is shown in Figure 8. More detailed were detected, including DIP0180 → DIP0222, DIP0223 → DIP0244, information about fourteen genomic islands is described in Table 2. DIP0282 → DIP0287, and DIP0334 → DIP0357. All four pathogenicity islands are covered in our prediction result. Whether this genomic In general, these fourteen genomic islands have higher IVOM region is a single insertion event or separate insertion events needs to scores, as shown in the second outer circle in Figure 8. Many of these be further investigated. However, detailed studies on these predicted islands also contain phage integrase, transposase and tRNA genes. In GIs by various algorithms make us to believe that smaller adjacent addition, most of our predicted genomic islands overlapped with those GIs, such as DIP0180 → DIP0222 and DIP0223 → DIP0244, might be identified pathogenicity islands in previous studies [28]. For instance, a single GI. the first predicted genomic island GI_1 covers 404 genes from gene DIP0021 to gene DIP0425. This genomic island contains two integrase Database of predicted genomic islands genes, five phage genes, ten tRNA genes and five transposon related genes. The overall IVOM score in this region is 18.19, significantly We downloaded fully sequenced microbial genomes from the higher than that of in non-GI region (typically <10). NCBI web server (ftp://ftp.ncbi.nih.gov/genomes/Bacteria), and ran our program GIHunter to predict genomic islands for fully sequenced We compared our predicted GI_1 with the predictions of genomes. All predicted genomic islands are available on our genomic IslandViewer [20], and we found that IslandViewer also considered islands database (http://www5.esu.edu/cpsc/bioinfo/dgi). By providing this region as a GI region. In IslandViewer, however, this region was predicted genomic islands, we hope that microbiologists and medical predicted as ten genomic islands, including: DIP0023 → DIP0035, biologists can better understand the mechanism of gene transfer in DIP0057 → DIP0061, DIP0176 → DIP0179, DIP0219 → DIP0226, DIP0242 → DIP0249, DIP0281 → DIP0290, DIP0293 → DIP0294, bacterial and archaeal genomes. DIP0330 → DIP0333, DIP0361 → DIP0364, and DIP0407 → DIP0418. To provide microbiologists a better visualization on predicted In a separate genome analysis [28] has shown that this genomic genomic islands, we used Circos program [30] to generate genomic

J Proteomics Bioinform ISSN: 0974-276X JPB, an open access journal Volume 7(8) 214-221 (2014) - 219 Citation: Che D, Wang H, Fazekas J, Chen B (2014) An Accurate Genomic Island Prediction Method for Sequenced Bacterial and Archaeal Genomes. J Proteomics Bioinform 7: 214-221. doi:10.4172/0974-276X.1000322

island images, which can represent the relative positions of predicted GC content). We believe our GI tool GIHunter, which incorporates GIs in the genomes. In addition, we want to show the evidences of multiple sources, will be more accurately characterizing genomic predicted GI regions by displaying all GI-associated feature values in island information for the future draft genomes. Thus, we will build the corresponding positions. To this end, we can display all feature a similar genome annotation pipeline that incorporates our GIHunter values by different circular ideograms. By aligning different feature into current annotation tools in our future work. circles along with the GI circle, we can identify which feature values are Acknowledgement important when used in predicting GIs. Figure 8 shows an example of predicted genomic islands in the genome of C. diphtheria NCTC13129, This research was partially supported by President Research Fund, Faculty Professional Development & Research (FDR) major grant, and FDR mini grant at where each circle represents one GI- associated feature, with the order East Stroudsburg University, USA. of features from inside to outside: 1) tRNA; 2) Phage; 3) Integrase; 4) Transposons; 5) HEG; 6) Intergenic Distance; 7) Density; and 8) References IVOM. The most outer circle (9) represents the predicted GIs locations. 1. Hacker J, Bender L, Ott M, Wingender J, Lund B, et al. (1990) Deletions of chromosomal regions coding for fimbriae and hemolysins occur in vitro and in Conclusions vivo in various extraintestinal Escherichia coli isolates. Microb Pathog 8: 213- 225.

In this paper, we have reported the development of an ensemble 2. Dobrindt U, Hacker J (2001) Whole genome plasticity in pathogenic bacteria. of decision tree model for classifying GI/no-GI genes, based on eight Curr Opin Microbiol 4: 550-557. features: IVOM, density, Highly Expressed Gene, tRNA, integrase, 3. Dobrindt U, Hochhut B, Hentschel U, Hacker J (2004) Genomic islands in transposase, intergenic distance, and phage. We have developed our pathogenic and environmental microorganisms. Nat Rev Microbiol 2: 414-424. program GIHunter based on the trained model to predict GIs in any 4. Metsa-Ketela M, Halo L, Munukka E, Hakala J, Mantsala P, et al. (2002) genome. The prediction accuracy of our approach has shown to be Molecular evolution of aromatic polyketides and comparative sequence more accurate than that of other sequence composition approaches. analysis of polyketide ketosynthase and 16S ribosomal DNA genes from The improved performance of our approach over previous ones is due various streptomyces species. Appl Environ Microbiol 68: 4472-4479. to the incorporation of multiple genomic island related features in our 5. Langille MG, Hsiao WW, Brinkman FS (2010) Detecting genomic islands using model. For example, the feature of IVOM is effective in general, but bioinformatics approaches. Nat Rev Microbiol 8: 373-382. the incorporation of gene information such as integrase, transposase, 6. Langille MG, Hsiao WW, Brinkman FS (2008) Evaluation of genomic island and highly expressed genes will make the prediction more accurate. predictors using a comparative genomics approach. BMC Bioinformatics 9: We hope our program, GIHunter (http://www.esu.edu/cpsc/che_lab/ 329. software/GIHunter), can help accurately detecting genomic islands in 7. Ou HY, He X, Harrison EM, Kulasekara BR, Thani AB, et al. (2007) future sequenced genomes. MobilomeFINDER: web-based tools for in silico and experimental discovery of bacterial genomic islands. Nucleic Acids Res 35: W97-W104. Additionally, in order to make non computational scientists to 8. Vernikos GS, Parkhill J (2008) Resolving the structural features of genomic access predicted genomic islands easily, we have predicted genomic islands: a machine learning approach. Genome Res 18: 331-342. islands for more than 2,000 bacterial and archaeal genomes, and 9. Lawrence JG, Ochman H (1997) Amelioration of bacterial genomes: rates of provided visualized GI images (http://www5.esu.edu/cpsc/bioinfo/dgi). change and exchange. J Mol Evol 44: 383-397. Therefore, microbiologists and medical biologists could investigate 10. Karlin S (2001) Detecting anomalous gene clusters and pathogenicity islands in the gene transfer or pathogenicity mechanisms of their interested diverse bacterial genomes. Trends Microbiol 9: 335-343. by looking into our predicted genomic islands. 11. Hacker J, Kaper JB (2000) Pathogenicity islands and the evolution of microbes. In our future work, we will create the database of sub-categorized Annu Rev Microbiol 54: 641-679. genomic islands. Different genomic islands perform different 12. Cheetham BF, Katz ME (1995) A role for bacteriophages in the evolution and functions. Some of them may contain phage related genes which transfer of bacterial virulence determinants. Mol Microbiol 18: 201-208. may cause disease in some cases, while others contain degradation 13. Vernikos GS, Parkhill J (2006) Interpolated variable order motifs for identification or resistance genes. Therefore, it will be meaningful and important to of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands. categorize genomic islands based on their functions inside the genomic Bioinformatics 22: 2196-2203. islands. We propose to categorize them into six groups, including 14. Rajan I, Aravamuthan S, Mande SS (2007) Identification of compositionally pathogenicity island, degradation island, resistance island, metabolism distinct regions in genomes using the centroid method. Bioinformatics 23: 2672-2677. island, symbiotic island, and secretion genomic island. We strongly believe that providing such sub-groups of islands will be beneficial to 15. Hsiao W, Wan I, Jones SJ, Brinkman FS (2003) IslandPath: aiding detection of researchers in the microbial communities. genomic islands in prokaryotes. Bioinformatics 19: 418-420. 16. Tu Q, Ding D (2003) Detecting pathogenicity islands and anomalous gene The high-throughput sequencing technologies have generated clusters by iterative discriminant analysis. FEMS Microbiol Lett 221: 269-275. huge amounts of draft genomes, and have left us to annotate. It will be very useful to integrate genomic island prediction tools into the current 17. Waack S, Keller O, Asper R, Brodag T, Damm C, et al. (2006) Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov genome annotation tools. Recently, GI-POP [31], a comprehensive models. BMC Bioinformatics 7: 142. annotation Web server, has been developed for ongoing genome 18. Che D, Hasan MS, Wang H, Fazekas J, Huang J, et al. (2011) EGID: an project analysis. ensemble algorithm for improved genomic island detection in genomic GI-POP contains a sequence assembling tool, functional annotation sequences. Bioinformation 7: 311-314. tools, and support vector machine (SVM)-based genomic island 19. Hasan MS, Liu Q, Wang H, Fazekas J, Chen B, et al. (2012) GIST: Genomic genomic profile scanning (GI-GPS) tool. The GI-GPS tool, however, island suite of tools for predicting genomic islands in genomic sequences. Bioinformation 8: 203-205. is mainly based on genomic and codon signature (including: codon usage frequency, dinucleotide frequency, codon adaption index, and 20. Dhillon BK, Chiu TA, Laird MR, Langille MG, Brinkman FS (2013) IslandViewer

J Proteomics Bioinform ISSN: 0974-276X JPB, an open access journal Volume 7(8) 214-221 (2014) - 220 Citation: Che D, Wang H, Fazekas J, Chen B (2014) An Accurate Genomic Island Prediction Method for Sequenced Bacterial and Archaeal Genomes. J Proteomics Bioinform 7: 214-221. doi:10.4172/0974-276X.1000322

update: Improved genomic island discovery and visualization. Nucleic Acids 26. Quinlan JR (1986) Induction of Decision Trees. Machine Learning 1: 81-106. Res 41: W129-132. 27. Langille MG, Brinkman FS (2009) Bioinformatic detection of horizontally 21. Wang H, Fazekas J, Booth M, Liu Q, Che D (2011) An Integrative Approach transferred DNA in bacterial genomes. F1000 Biol Rep 1: 25. for Genomic Island Prediction in Prokaryotic Genomes. In: Bioinformatics Research and Applications. Chen J, Wang J, Zelikovsky A (Eds) Springer 28. Cerdeño-Tárraga AM, Efstratiou A, Dover LG, Holden MT, Pallen M, et al. Berlin / Heidelberg. (2003) The complete genome sequence and analysis of Corynebacterium diphtheriae NCTC13129. Nucleic Acids Res 31: 6516-6523. 22. Che D, Hockenbury C, Marmelstein R, Rasheed K (2010) Classification of genomic islands using decision trees and their ensemble algorithms. BMC 29. (1997) Diphtheria acquired during a cruise in the Baltic Sea. Commun Dis Rep Genomics 11 Suppl 2: S1. CDR Wkly 7: 207.

23. Karlin S, Mrázek J (2000) Predicted highly expressed genes of diverse 30. Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, et al. (2009) Circos: an prokaryotic genomes. J Bacteriol 182: 5238-5250. information aesthetic for comparative genomics. Genome Res 19: 1639-1645.

24. Breiman L (1996) Bagging Predictors. Machine Learning 24: 123-140. 31. Lee CC, Chen YP, Yao TJ, Ma CY, Lo WC, et al. (2013) GI-POP: a 25. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, et al. (2009) The combinational annotation and genomic island prediction pipeline for ongoing WEKA Data Mining Software: An Update. SIGKDD Explorations 11: 10-18. microbial genome projects. Gene 518: 114-123.

J Proteomics Bioinform ISSN: 0974-276X JPB, an open access journal Volume 7(8) 214-221 (2014) - 221