BIOINF-2006-1117 Revision 3
Total Page:16
File Type:pdf, Size:1020Kb
Bioinformatics Advance Access published January 31, 2007 S.K.L NG and S.K MISHRA large internal loops or bulges especially large asymmetric ones from the hairpins that obviates the use of comparative genomics infor- (Ambros et al., 2003). Apparently, mere application of simple align- mation. The SVM classifier model trained with the experimental do- ment queries and positive-selection rules is likely to overlook novel main knowledge and binary-labeled feature vectors, recovered 71% of families lacking clear homologues to published mature miRNAs. the positive pre-miRs with a remarkably low false-positive rate of ~3%. Advanced comparative approaches like MiRscan (Lim et al., 2003b; It also predicted ~50 to 100 novel pre-miRs for several species; ~30% Lim et al., 2003a), MIRcheck (Jones-Rhoades and Bartel 2004), miR- of these were previously experimentally validated. The validation rate Finder (Bonnet et al., 2004a), miRseeker (Lai et al., 2003), findMiRNA among the predicted cases that were conserved in ≥1 other species was (Adai et al., 2005), PalGrade (Bentwich et al., 2005), and MiRAlign higher at ~60%; many had not been detected by comparative genomics (Wang et al., 2005) have systematically exploit the greater availability approaches. The 3SVM (Xue et al., 2005) improved the performances of sequenced genomes for eliminating the over-represented false- to ~90.00% for human and up to 90.00% in other species. Albeit its positives. Cross-species sequence conservation based on computation- methodological simplicity, promising performances, and independence ally intensive multiple genome alignments is a powerful approach for of comparative genomics information, 3SVM was largely limited to genome-wide screening of phylogentically well conserved pre-miRs classifying RNA sequences that fold into secondary structures without between closely related species. However, it suffers lower sensitivity in multiple loops. RNAmicro (Hertel and Stadler 2006) incorporating se- divergent evolutionary distance (Berezikov et al., 2005; Boffelli et al., quence and structural information as part of its feature vector, reported 2003). Identifying pre-miRs that differ significantly or evolve rapidly at an incredibly promising efficiency of 91.16% (sensitivity) and 99.47% the sequence level while retaining their characteristic evolutionary con- (specificity). Still, its classification pipeline required computationally served hairpin structures may also be an issue. Another significant expensive multiple sequence alignments for inputs. drawback is that non-conserved pre-miRs with genus-specific patterns ProMiR (Nam et al., 2005) took advantage of a probabilistic co- are likely to evade detection. Pathogenic viral-encoded pre-miRs have learning model HMM to classify miRNA genes based on their pairwise been uncovered in E. Barr virus, K. sarcoma-associated herpesvirus, aligned sequences. ProMiR minimized the false-positive rate to as low M. γ-herpesvirus 68, H. Cytomegalovirus, and Simian virus 40 that nei- as 4.00%, but compromised for a less performing sensitivity of ther share significant sequence homology with known host pre-miRs 73.00%. A relatively recent work BayesMIRfinder (Yousef et al., 2006) nor among themselves (Cullen 2006; Sarnow et al., 2006). adopted an alternative discriminative machine learning algorithm NBI (Table 1) To surmount the technical shortfalls of comparative works as its underlying classifier model. Notwithstanding its technical nov- for distinguishing species-specific and non-conserved pre-miRs, several elty, BayesMIRfinder relied on the comparative analysis of conserved state-of-the-art de novo (or ab initio) predictive approaches have been genomics regions for post-processing to yield a considerably higher extensively developed. The inaugural and definitive work by Sewer et sensitivity of 97.00% and comparable specificity of 91.00% in mouse al. (2005) compiled 40 distinctive sequence and structural "markers" to existing algorithms. Table 1. Existing (quasi) de novo classifiers, in chronological order, for distinguishing novel pre-miRs from genomic pseudo hairpins. Works Classifiers Num Description of Features Datasets P N SE SP Sewer et al. (2005) SVM 40 16 statistics computed from the entire hairpin structure, 10 from the longest symmet- H 178 5,395 71.00 97.00 rical region of the stem, 11 from the longest relaxed symmetry region, and 3 from the candidate stem-loop. ProMiR HMM − A hairpin structure is represented as a pairwise sequence. Each position of the pair- H 136 1,000 73.00 96.00 (Nam et al., 2005) wise sequence has two states, structural and hidden. 3SVM SVM 32 Each hairpin is encoded as a set of 32 triplet elements: a nucleotide type and three lo- H 30 1,000 93.30 88.10 (Xue et al., 2005) cal continuous sub-structure-sequence attributes e.g., "A(((" and "G(..". H 39 2,444 92.30 89.00 BayesMIRfinder NBI 84 62 secondary structural features derived from the foot, mature, and head of a hairpin- C.E 11 150 83.00 96.00 (Yousef et al., 2006) loop; 12 sequence features extracted from the candidate sequence. M 22 150 97.00 91.00 RNAmicro SVM 12 2 lengths of stem and hairpin loop regions; 1 G+C sequence composition; 4 sequence Animal 136 394 91.16 99.47 (Hertel and Stadler 2006) conservation; 4 thermodynamic stability; and 1 structural conservation. (Classifiers) SVM (Support Vector Machine), NBI (Naïve Bayesian Induction), and HMM (Hidden Markov Model). (Num) Number of features. (Datasets) H (H. sapiens), C.E (C. elegans), and M (M. muscu- lus). P (real pre-miRs), N (pseudo hairpins), SE (Sensitivity), and SP (Specificity). noted as TR-H and TE-H. The comparable ratio of 1:2 ensures that the selected 2 MATERIALS AND METHODS negatives contribute more significantly to the specificity of a classifier than posi- tives, while avoiding the problem of overtraining. Typically, the decision func- 2.1 Biologically relevant datasets tion of SVM converges to a solution where all samples belonging to the smaller Training, testing, and independent datasets. They are pooled separately from class are classified as that of the larger class if the class sizes differ significantly. four independent sources (see supplementary "Materials and Methods" for de- The performance of miPred is evaluated against three datasets IE-NH, IE-NC, tails). To improve the quality of this comprehensive collection, sequences with and IE-M. They represent the remaining 1,918 pre-miRs spanning 40 non- non-ACG[TU] nucleotides are filtered and no sequence is reused. Entire set of human species (positives) and 3,836 randomly selected pseudo hairpins (nega- 2,241 pre-miRs is obtained from miRBase 8.2 (Griffiths-Jones et al., 2006); tives); 12,387 functional ncRNAs (negatives) from Rfam 7.0 (Griffiths-Jones et 8,494 pseudo hairpins from human RefSeq genes (Pruitt and Maglott 2001) al., 2005); and 31 mRNAs (negatives) from GenBank DNA database (Benson et without undergoing any known experimentally validated alternative splicing al., 2005), respectively. (AS) events. For hyperparameter estimation and training the decision function of Four complete viral genomes. They are downloaded from GenBank DNA data- miPred, binary-class labeled samples consisting of 200 human pre-miRs (posi- base (Benson et al., 2005), namely E. Barr virus (EBV; 171,823-bps; DNA cir- tives) and 400 pseudo hairpins (negatives) are randomly selected without re- cular; AJ507799.2), K. sarcoma-associated herpesvirus (KSHV; 137,508-bps; placement to avoid the classifier being skewed towards specifically screened DNA linear; U75698.1), M. γ-herpesvirus 68 strain WUMS (MGHV68; training samples. The remaining 123 human pre-miRs (positives) and 246 ran- 119,451-bps; DNA linear; U97553.2), and H. cytomegalovirus strain AD169 domly selected pseudo hairpins (negatives) are used for testing. They are de- (HCMV; 229,354-bps; DNA linear; X17403.1). 2 De Novo SVM Classification of Precursor MicroRNAs from Genomic Pseudo Hairpins Using Global and Intrinsic Folding Measures 2.2 Computational pipeline of miPred rate (Duan et al., 2003). To avoid over-fitting the generalization, the best combi- nation of hyperparameters (C, γ) maximizing the five-fold LOOCV accuracy Background of SVM. Integral to miPred is the Support Vector Machine (SVM), rate serve as the default setting for training miPred. Finally, the classification is a supervised classification technique derived from the statistical learning theory conducted on the testing and independent evaluation datasets with "svm-predict of structural risk minimization principle (Burges 1998). Given its simplicity to -b 1". See supplementary "Materials and Methods" for details on statistical tests deal easily with multi-dimensional datasets that can be noisy or redundant (non- and performance evaluation metrics. informative or highly correlated), SVM has been adopted extensively as an in- valuable machine learning tool to address diverse bioinformatics problems (Dror et al., 2005; Han et al., 2004; Liu et al., 2006). 3 RESULTS AND DISCUSSION Briefly, the primary objective of SVM is to explicitly construct a multi- 3.1 Training and classifying human pre-miRs dimensional hyperplane separating a set of complex feature vectors xi into bi- nary labeled classes yi Î (1 or -1) with the distance between the hyperplane and We calibrate miPred using TR-H, the optimal hyperparameter pair (C, the closest support vectors (the margin) maximized. In non-linear separable γ) is (16.0, 0.03125) that maximizes the five-fold cross-validation ac- cases, the maximum-margin hyperplane is