Articles

Materials Science September 2010 Vol.55 No.3: 3576-3589 doi: 10.1007/s11434-010-4343-5

SPECIAL TOPICS:

Identification of common microRNA-mRNA regulatory biomodules in human epithelial cancer

YANG XiNan1,2, LEE Younghee2, FAN Hong3, SUN Xiao1* & LUSSIER Yves A2,4,5*

1 State Key Laboratory of Bioelectronics, Southeast University, Nanjing 210096, China; 2 Center for Biomedical Informatics and Section of Genetic Medicine, Department of Medicine, the University of Chicago, Chicago, IL 60637 USA; 3 MOE Key Laboratory of Developmental & Human Diseases, Southeast University, Nanjing 210009, China; 4 The University of Chicago Cancer Research Center, and the Ludwig Center for Metastasis Research, the University of Chicago, Chicago, IL 60637, USA; 5 The Institute for Genomics and Systems Biology, and the Computational Institute, Argonne National Laboratories and the University of Chicago, Chicago, IL 60637, USA

Received May 1, 2009; accepted August 14, 2009

The complex regulatory network between microRNAs and expression remains an unclear domain of active research. We proposed to address in part this complex regulation with a novel approach for the genome-wide identification of biomodules de- rived from paired microRNA and mRNA profiles, which could reveal correlations associated with a complex network of dys-regulation in human cancer. Two published expression datasets for 68 samples with 11 distinct types of epithelial cancers and 21 samples of normal tissues were used, containing microRNA expression and gene expression profiles, respectively. As results, the microRNA expression used jointly with mRNA expression can provide better classifiers of epithelial cancers against normal epithelial tissue than either dataset alone (p=1x10–10, F-Test). We identified a combination of six microRNA-mRNA biomodules that optimally classified epithelial cancers from normal epithelial tissue (total accuracy = 93.3%; 95% confidence intervals: 86%–97%), using penalized logistic regression (PLR) algorithm and three-fold cross-validation. Three of these biomodules are individually sufficient to cluster epithelial cancers from normal tissue using mutual information distance. The biomodules contain 10 distinct microRNAs and 98 distinct genes, including well known tumor markers such as miR-15a, miR-30e, IRAK1, TGFBR2, DUSP16, CDC25B and PDCD2. In addition, there is a significant enrichment (Fisher’s exact test p=3x10–10) between putative microRNA-target gene pairs reported in five microRNA target databases and the inversely correlated microRNA-mRNA pairs in the biomodules. Further, microRNAs and genes in the biomodules were found in abstracts mentioning epithelial cancers (Fisher Exact Test, unadjusted p<0.05). Taken together, these results strongly suggest that the discovered microRNA-mRNA biomodules correspond to regulatory mechanisms common to human epithelial cancer samples. In conclusion, we developed and evaluated a novel comprehensive method to systematically identify, on a genome scale, microRNA-mRNA expression biomodules common to distinct cancers of the same tissue. These biomodules also comprise novel microRNA and genes as well as an imputed regulatory network, which may accelerate the work of cancer biologists as large regulatory maps of cancers can be drawn efficiently for hy- pothesis generation. Supplementary materials are available at http://www.lussierlab.org/publication/biomodule. biomodule, microRNA expression, gene expression, cancer, molecular diagnosis

Citation: Yang X N, LEE Y, Fan H, et al. Identification of common microRNA-mRNA regulatory biomodules in human epithelial cancer. Chinese Sci Bull, 2010, 55: 3576-3589, doi: 10.1007/s11434-010-4343-5

Mounting evidence shows that common gene expression for medical diagnostics [1–3]. And recent work also reveals signatures across many types of cancer are useful markers the universal diagnostic role of microRNAs for human tu- mors [4], which extends the potential application of mi- *Corresponding author (email: [email protected]; [email protected]) croRNA profiling from specific cells within a tissue [5–10]

© Science China Press and Springer-Verlag Berlin Heidelberg 2010 csb.scichina.com www.springerlink.com 2 YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 to a more functional understanding of cancer development few gene markers determine the classification of a genetic [11,12]. For example, Volinia et al. identified a set of uni- problem, the penalized logistic regression (PLR) is a tech- versal microRNA signatures for solid cancers by a nique that performs comparably to the SVM [32] and pro- large-scale analysis on patients including lung, breast, sto- vides an estimation of the underlying class probabilities. mach, prostate, colon, and pancreatic tumors [4]. Lu et al. PLR has been considered as a stand-alone for classification found that with lower expression levels in tumors than in of microarray gene expression data with “small sample size, normals, microRNAs are unexpectedly rich in information larger variances” [32,33] and has been shown strongly pre- about cancer [13]. In addition, microRNA and messenger dictive for clinical response in cancer as compared with the RNA (mRNA) interactions remain incompletely understood other above mentioned methods [33]. However to our in most cancers. Among solutions conducted to clarify these knowledge, PLR has never been used over microRNA data interactions, hundreds of putative microRNA targets have before, nor for reverse engineering regulatory networks. generally been calculated from nucleic acid sequence simi- We hypothesized that we could identify in high through- larities and are provided in datasets (miRBase [14,15], mi- put microRNA/mRNA biomodules common to different Randa [12], PicTar [16,17], TarBase [18] and TargetScan types of epithelial cancers that would also be associated to [19], reviewed by Bartel [20]). However, sequence similar- some fundamental biological mechanisms. We address the ity alone, taking out of cellular context, leads to about 40% problem by re-analyzing the public data which includes 89 accuracy in microRNA target validations [21,22]. Further, human epithelial samples including cancers and controls traditional biological characterization of microRNA targets that have both mRNA and microRNA expression profiles is time consuming. Since expression arrays of microRNA [13]. Using PLR, we identify common microRNA-mRNA and mRNA can be conducted over the same samples, there biomodules across epithelial cancers that best distinguishes is an opportunity to reverse engineer regulatory mechanisms them from normal epithelial tissues. Specifically, to reverse and to provide microRNA-target hypotheses. Though pre- engineer the regulatory network consisting of microRNA and vious studies also identified regulatory bio-modules of spe- mRNA interactions, we assumed that microRNAs might act cific human cancer by combining microRNA and mRNA in concert with other regulatory processes to regulate gene’s expression profiles [23–26], their algorithms focus exclu- expression [34] without using any prior knowledge of mi- sively on linear inverse correlations [25,26] and /or on pre- croRNA target genes or any assumption on the inverse cor- viously predicted microRNA targets [24–27] between two relation of expression or co-regulation. expression profiles that may not reveal other complex pat- terns of regulations that may appear paradoxical (such as co-expression). However, co-expression has been observed 1 Methods between intragenic microRNAs and a target gene, when this 1.1 Datasets target happens to be the microRNA host gene [28]. Further, one can hypothesize downstream signaling that may para- Expression profiles of 217 microRNAs and 16,063 mRNAs doxically lead to co-expression of microRNAs with genes for the same 89 epithelial samples were published by Lu et that aren’t their immediate target. Of note, certain micro- al. [13] (GSE2564) and Ramaswamy et al. [31], respective- RNAs have already been discovered to be either positively ly. They were among the very first paired samples of mi- co-expressed with their target genes [29], for example, croRNA and mRNA expression, including 21 normals and miR-17-5p and its target E2F1 are positively coregulated by 68 tumors. To generate the regulatory network, five mi- the proto-oncogene c-Myc [30]. Therefore, in this manu- croRNA gene target databases were downloaded and parsed script, we propose a novel unbiased computational strategy to set up the comprehensive tables for human microRNA to derive microRNA-mRNA interactions biomodules that target genes. The five databases are miRBase v5 [14, 15], represent fundamental regulatory mechanism common to miRanda version on July 2007 [12], PicTar server version several cancers inclusive of inverse correlation and correla- 4.0.24 [16,17], TarBase v4.0 [18] and TargetScan v3.1 [19], tion patterns between microRNAs and mRNAs. which contain all together 1112 human microRNAs and In bioinformatics, supervised machine learning tech- 22084 predictive putative microRNA gene-targets. Addition- niques have been successfully used for classification and ally, and PubMed databases were used to do cancer diagnosis. So far, many supervised prediction me- evaluation in this study, using Bioconductor software [35]. trics have been built to find the relation between gene ex- For the expression profiles, we did preprocessing and pression and clinical conditions. The Support Vector Ma- filtering on the downloaded data in two steps: (A) Prepro- chine (SVM) is among the methods that have successful cessing and preliminary filtering suggested by the author applications to the problems of cancer diagnosis [31]. The [13]: log2-transformation, keeping only those probes ex- model of partial least squares (PLS) builds weighted linear ceeding a measurement of 7.25 (on log2 scale) in one or combinations of genes that have maximal covariance with more samples. This preprocessing and pre-filtering resulted conditions of interest, but usually result in thousands of in 195 microRNAs and 14546 mRNAs [13]. (B) Additional genes on microarray expression data. Assuming that only a preprocessing steps to the microRNA and mRNA expres- YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 3 sion profiles to concentrate on probe-sets that vary among In summary, the second implementation of PLR finds different conditions. This variation filter [13, 31] involves prioritized groups of variables (probes in a microarray) that the following steps: (a) setting a threshold using a ceiling of can best explain the sample classes (cancer and normal in 1600 units and a floor of 20 units of expression measure- this study), employing the penalized log-likelihood function ment; (b) the maximum expression measurement was at that is based on estimations of conditional class probabili- least 5 fold greater than the minimum measurement, and (c) ties from PLR. As a result, the algorithm estimates condi- the maximum measurement was more than 500 greater than tional probabilities while focusing on similarities and inter- the minimum measurement. The filtering step resulted in sections among predictor-variables in a supervised way 130 microRNAs and 6621 mRNAs. Finally, we performed [37,40]. The assumption is as follows: let’s assume there

“low-level fusion”, that is, we combined expression profiles exist two groups of probe-sets, setA and setB, and two condi- from two sources of microarrays before step-down analysis. tions of samples, e.g. tumor and normal. If it is indicative of

This is different from high-level fusion (decision fusion) cancer that the centroid of setA (CA) is high while the cen- [36] in which microRNA-mRNA analysis are done in pa- troid of setB (CB) is low, then two such probe-sets and their rallel rather than together. To allow expression levels to be contributions to the centroids can be understood as molecu- comparable, we first scaled the combined data. Then we lar signatures to gain insights into molecular regulation in performed further supervised grouping. cancer [40]. For details, let x be the observed measurement of expression of one probe-set, the centroid of a group of 1.2 Algorithm for sample classification and predic- probe-sets G is given as: tor-variable grouping 1 Cx  , G g g g (2) Penalized logical regression (PLR) algorithm has been pro- G gG posed as a stand-alone for classification of microarray gene expression data with “small sample size, larger variances” using a discrete parameter g∈ {-1,1} that allows up- or [32,33] and has been shown strongly predictive for clinical down-regulated genes to contribute in the same group. In response in cancer as compared with the other above men- this way, this centroid approach can derive biomodules con- tioned methods [33]. Let G = (1,g1,…,gq) be a group of q sisting of both up- and down-expressed probe-sets. This probe-sets,  be a vector of q parameters (each associated characteristic allows for unbiased observation of both the with one probe-set) that are trained to optimize (for exam- negative expression correlation between microRNA and ple, to minimize the S() in Equation 3) the PLR model by a target gene within the group [41], and positive expression penalized maximum likelihood principle [37]. The classical correlation between the microRNA and their host gene in logical model of PLR is defined as [38]: the combined profiles [42]. Let  be a vector of parameters,

q λ be a variable parameter for penalization control, and P be pG() log( )= g ,  kk (1) a penalty matrix, the pelora [37,40] measures the strength 1 pG ( )  k  0 of a probe-set group G for distinguishing the n tu- Two implementations of PLR algorithms in this study were mor/normal phenotypes (y1,y2,…,yn) as: employed. (1) In the preliminary validation study compar- n ylog p ( C )  (1  y )  ing PLR usage in mRNA expression data alone, microRNA S(), i G i  nT P   (3) expression data alone and the combined profiles (step i in i 1 log(1 pC (G )) 2 Figure 1), we applied three-fold cross validation using Bioconductor package MCRestimate and penalized maxi- where p (CG)= p[y=1|CG] is the estimated conditional mum likelihood algorithm to estimate the classical PLR class probability from penalized logical regression analysis. model, using the R package Design [39]. The PLR model The pelora automatically scratches the starting probe-set for was estimated from the data of training samples in each group of probe-sets, then it incrementally increases the cross-validation (CV), then in the subsequent tests within number of probe-sets in a group Gi by adding or pruning the CV, we performed a predict function for classification one probe-set after the other, and recalculating after each test data using the fitted model. The input data were the change until it optimizes the sample classification in the mRNA and/or microRNA expression together with classifi- training set. Once a group of probe-set and trained parame- cation annotation of test samples and the trained PLR mod- ters are optimized as best, then pleora identifies the second el. The corresponding outputs here were predicted class best one. Therefore, by inputting a matrix of mRNA and/or labels of each test sample. (2) To identify the merged mi- microRNA expression with classification annotation of each croRNA-mRNA biomodule that best classified normal epi- samples, and setting the number of biomodules that should thelial samples from the combined set of epithelial cancers, be searched to 10, while keeping other default parameter, we performed Supervised Grouping of Predictor Variables we obtained 10 biomodules that PLR searched, each asso- (pelora) method developed by Dettling et al. [37,40],using ciated with their respective “rescaled penalty parameter ”, the R package Supercluster. and the fitted values of each biomodule. 4 YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3

Figure 1 Data and six steps of analysis: Data. Data collection, profile preprocessing and filtering; The analysis was performed as: (i) a preliminary valida- tion study comparing PLR usage in mRNA expression data alone, microRNA expression data alone and the combined results; (ii) identification and the quantitative description of the discovered microRNA-mRNA biomodules and their interaction networks; (iii) expression patterns of the microRNA-mRNA biomodules classify epithelial cancers and normal samples across all 11 types of tissues; (iv) conducting the Gene Ontology enrichment of the genes asso- ciated with these biomodules to provide an unbiased description of their biological mechanisms. Finally, we evaluated the identified genes and microRNAs in the biomodules in two ways: (v) calculating the enrichment of gene targets of microRNAs among the negative co-expression patterns of microRNA and mRNA; (vi) systematic review of literature to identify the relevance of the gene patterns observed in the biomodules. Legend: GO: Gene Ontology, PPI: protein-protein interaction, PLR: penalized logistic regression, CV: Cross-Validation.

Table 1 Contingency table

#ref. including gene or microRNA symbol #ref. not including the symbol a) ∑

#ref. including epithelial cancers n1 n3-n1 n3

#ref. not including epithelial cancers n2-n1 N-n2-n3+n1 N-n3

∑ n2 N-n2 N a) ref= individual reference in PubMed

YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 5

Step i: Comparing PLR usage in mRNA expression each number of including group n∈ N, and then applied to data, microRNA expression data and the combined pro- the test set while the predicted labels were compared with files the true labels. The fraction of misclassified individuals was We performed a three-fold stratified cross-validation estimated for each number of including biomodules. For [43,44] as well as a re-randomization algorithms to assess comparison, we also did the same estimation for mRNA and and compare the predictive power of mRNA, microRNA microRNA data, respectively. The optimal number of in- and their combined expression profiles, where “three-fold” cluded biomodules is the n* that achieved the lowest mis- means random choosing the training-set to be two thirds of classification. We then merged the n* leading biomodules the samples, and “stratified” means the proportion of tumor into a merged epithelial cancer biomodule for further vali- tissues in the training set more or less the same as in the dation. whole data set. Then, for the samples in the training-set, we We summarized the genes and microRNAs involved in employed penalized logistic regression (PLR) to train the the merged epithelial cancer biomodule as comprehensive classifier by modeling the probability of class membership. tables of human microRNA target genes. To describe the The resulting classifier model was then used to predict the regulatory network, we made use of shapes and colors: cir- samples in the remaining one-third test-set. Thus each run cle (gene), box (microRNA), pink node (averagely of the cross-validation yielded predictive class labels for the up-regulated gene and/or microRNA in cancer), blue node samples in the test-set. We iterated the above process 100 (averagely down-regulated gene and/or microRNA in can- times and calculated the frequency of each sample correctly cer), and orange node (tissue-specific expressed gene target predicted in the test sets. The three-fold cross validation has of down-expressed microRNAs). been shown at least as conservative as the leave-one out Step iii: Expression patterns of the microRNA-mRNA method to provide robust estimates that are not over trained biomodules classify epithelial cancers and normal sam- when using machine learning methods [45-47]. ples across all 11 Types of Tissues After completing the above-mentioned cross-validations, The 10 microRNAs and 98 genes identified in the mi- we plotted the distribution of resulting predictions for the croRNA-mRNA biomodule and all samples were hierarchi- microRNA expression profiles alone, the gene expression cally clustered using complete Mutual Information distance profiles alone and also the combined microRNA-mRNA of the expression levels, (Bioconductor package Biodist). expression profiles. Then we further compared such three We also observed the expression patterns of microRNAs distributions of the validation results using F-test. The total and genes in each prioritized biomodule, to see whether accuracy, (TP+TN)/(TP+FP+TN+FN), of cross-validation individual biomodules could classify epithelial cancers and was calculated, where TP (True positive) is the correctly normal samples across 11 types of tissue. classified cancer samples, FP (false positive) is the wrongly Step iv: Gene Ontology enrichment classified cancer samples, TN (True negative) is the cor- Validation of gene ontology enrichment of the biological rectly classified normal samples, and FN (false negative) is processes was conducted in the genes of the epithelial can- the wrongly classified normal samples. The 95% central cer biomodule in order to biologically characterize the bio- confidence interval [48,49] of the total accuracy for the module. We used the Bioconductor package GOstats [51] to population (89 samples) in this data was estimated using an estimate the Biological Processes (BP) enriched in genes of online Bayesian calculator (http://www.causascientia.org/ six microRNA-mRNA biomodules (version 2.6.0 based on math_stat/ProportionCI.html). Hu6800.db v2.2.0, hu35ksuba.db v2.2.0 and GO.db v2.2.0, Step ii: Identification and the quantitative description parameters p-value ≤ 0.01, gene count > 2). This package of the discovered microRNA-mRNA biomodules and GOstats reports a standard hypergeometric p-value to access their interaction networks whether the number of selected genes among the genes in The two kinds of filtered expression data were finally an annotated array associated with a GO term is larger than combined (6621 mRNAs and 130 microRNAs), log 10 expected [51], and it is a standard tool to evaluate the over- transformed, and standardized. We used penalized regres- representation of GO terms among a given gene set and has sion model [50] to search prioritized microRNA-mRNA been successfully implemented [52]. As each GO term is biomodules with good predictive potential based on this considered independently in GOstat, the review of the combined data, using Bioconductor package supclust enrichment was conducted with the understanding that if [37,40]. related or possibly redundant GO terms were found to be First, we searched N = 1,...,10 leading biomodules from enriched, the nature of this relationship could depend on the the combined microRNAs and mRNAs data. Second, we intrinsic annotations rather than the structure of GO (e.g. estimated the number of including biomodules that yielded redundant annotations of genes to related GO terms). Addi- the best prediction accuracy on the validated samples by tionally, because the raw data contains two different mRNA cross-validation. We iteratively ran a three-fold cross vali- platforms, we also used an old version of Bioconductor dation 100 times, using Bioconductor package MCResti- package Compdiagtools (version in April 2007) to access mate. A PLR model of the learning set was obtained for the BP from whole (Gene Ontology Built: 6 YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3

15-Mar-2006) enriched in these 98 genes identified in bio- 2 Results modules (parameters: p ≤ 0.05, gene count ≥ 2). Step v: Evaluation of enrichment of inversely corre- 2.1 Preliminary validation study comparing PLR lated microRNA-mRNA pairs in the Biomodules com- usage in mRNA expression data alone, microRNA ex- puted by expression among microRNA-target pairs pre- pression data alone and the combined Results dicted from nucleotide sequences microRNA target In order to demonstrate the increased statistical power of Enrichment of microRNA targets in the network was classification using joint microRNA and mRNA expres- evaluated using Fisher exact test to estimate possibility of drawing the intersection of biomodule observed inversely sions, we compared the classification accuracy for micro- RNA, mRNA, and the expression profiles of both kinds of correlated microRNA-mNRA pairs from all putative mi- croRNA-target pairs based on sequence similarities that microarrays on the same 89 epithelial samples, representing , 11 types of human tumor (colon, pancreas, kidney, bladder, contained in five databases (miRBase v5 [14,15] miRanda prostate, ovary, uterus, lung, mesothelioma, melanoma, and version on July 2007 [12], PicTar server version 4.0.24 breast cancer) [13]. [16,17], TarBase v4.0 [18] and TargetScan v3.1 [19]). This To estimate the total accuracy of expression profiles, we study was performed on the expression profiles of 217 mi- croRNAs and 16063 genes, about which there were 121755 repeatedly (n=100) performed 3-fold [43,44] stratified cross-validation, using standard PLR as the classifier to putative microRNA-target pairs among all possible 3485671 predict cancer from normal samples (details are given in pairs. We then counted the six biomodules identified in- section methods). The meta-analysis of both mRNA and versely correlated microRNA-mRNA pairs and the intersec- microRNA expression profiles resulted in a higher accuracy tion number of identified inversely correlated pairs among putative target pairs to calculate the p-value. with smaller variance (Figure 2). The differences between the predictions from individual mRNA or microRNA and We looked into the significance of differential expression for each gene and microRNA that acts together to best clas- the predictions from integrated data are significant (F-test = 56.27, p = 1×10–10). Our results suggest that there are sify epithelial tumors versus normal tissues. To estimate the groups of non-coding and coding genes that work together individually false positives rate when expression changes with respect to the classification (total accuracy = 93.3%; were significant [53], we calculated the q-value [53] of 95% confidence intervals: 86%–97%) of cancer tissue vs. fold-change equivalent measures on the log-transformed expression levels using Bioconductor package twilight [54]. normal tissue for the 11 types of epithelial tissues. The Entrez Gene IDs for mRNAs were mapped from probe sets using package hu6800 version 2.2.0, and package hu35ksuba version 2.2.0 in R. Step vi: Systematic review of literature to identify the relevance of the gene patterns observed in the biomodules We assessed the significance of the co-occurrence for every pair of two terms in which one was a microRNA or gene symbol and the other was a cancer type, to see the as- sociation between cancer and gene- or microRNAs in the identified biomodule. Similarly, using the online literature mining tool PubMatrix [55], we searched the PubMed in- dexed publications and counted the number of abstracts containing the symbol of the identified microRNA or gene and epithelial cancer type, also the number of co-occurrence of pair of terms. The total number of indexed publication Figure 2 Standard box-and-whisker plots of the error rate of repeated was derived from the PubMed online review tool. Subse- cross-validation on mRNA profiles, microRNA profiles and on the com- quently, Fisher’s exact test was used to estimate the signi- bined profiles. Each box presents the smallest observation, lower quartile, median, upper quartile and the largest observation, the whiskers are the ficance to co-occurrence. Let N be the totally records in lines that extend to a maximum of 1.5 times of the inter-quartile range (the PubMed, n2 be the number of abstracts containing a symbol range of the middle two quartiles) excluding outlines, and the circles represent the outliers. The y-axis gives the probability of samples to be of microRNA or gene, n3 be the number of abstracts about cancer, and n be the number of abstract mentioning both, wrongly predicted. The dashed line is the normal (lower) prevalence of the 1 samples. we got the contingency table (Table 1) for each symbol and cancer. Then one-side Fisher’s exact test was applied to evaluate the significance of co-occurrence. A threshold of unadjusted p < 0.05 was set for to indicate a true positive To identify the contribution of each specific epithelial result in this evaluation. tissue to the accuracy, we further counted the classification accuracy in each type of tissue (Table 2). The results indi YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 7

Table 2 The population size and classification total accuracy in percentage (%) from 3-fold cross-validation for each type of cancer and as a whole in this study. The last row of this table lists the sample populations for each tissue and as a whole. The last three columns list the total accuracy for all types of cancer as a whole and the corresponding confidence intervals (CI). CI: 95% confidence interval calculated by a Bayesian calculator [48,49]; TP: true positive; TN: true negative.

Total CI-Lower CI -Upper colon prostate uterus kidney ovary breast bladder melanoma mesothelioma lung pancreas Accuracy Limit limit microRNA/ 63.6 91.7 90.9 100 100 100 100 100 100 100 100 93 86 97 mRNA (%) microRNA (%) 45.5 75 81.8 85.7 80 100 100 100 100 100 100 85 77 91 mRNA (%) 45.5 91.7 72.7 71.4 100 88.9 71.4 0 100 100 100 80 70 87 Sample population 11 12 11 7 5 9 7 3 8 7 9 89

Figure 3 Box and whisker plot showing the variation of the misclassification rates (y axis) of different number of microRNA-mRNA biomodules (x axis), conducted in three expression datasets: mRNA alone (a), microRNA alone (b), and the combined microRNA-mRNA expression (c). The optimal area is the lowest error rate as measured by the median (black horizontal line) and the direction of the box plot. Optimal results are observed in the leftmost group at X=6. The upper edge of the boxes indicates the 75th percentile of the data set, and the lower hinges indicate the 25th percentile. The “whiskers” are the lines that extend to a maximum of 1.5 times of the inter-quartile range (the range of the middle two quartiles) excluding outlines, and the circles represent the outliers. The dashed line in each plot is the normal (lower) prevalence of the samples.

Figure 4 Two-dimensional projection of supervised cluster of the microarray samples based on the mRNA (a), microRNA (b), and the combined (c) pro- filing data. In each plot, the expression level of its first biomodule for discrimination of tumor (labeled as 1 in the plot) versus normal (labeled as 0 in the plot) is given on the x axis and of its second biomodule on the y axis. cate that, with the exception of colon cancer, no specific combined expression profile, implying that there are com- tissue type alters the total accuracy of classification of the mon expression patterns among distinct kinds of epithelial 8 YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 cancers in the context of expression levels of both mRNA 2.3 Expression patterns of the microRNA-mRNA bio- and microRNA. The lower accuracy of cancer-vs-normal modules that classify epithelial cancers and normal diagnosis of colon sample in this dataset suggests a tis- samples across 11 types of tissues sue-specific expression pattern among these colon samples because the dataset contains eleven different anatomical The genes and/or microRNAs within a biomodule should locations of epithelial tissues. Interestingly, for each tissue have the biggest within-module correlation and biggest type, the analysis of the combined microRNA-mRNA ex- central distance between cancer and normal sample pression resulted in a relatively higher accuracy than either groups. The mutual information (MI) has been used as a the microRNA or the mRNA expression profiles conducted measurement between two genes related to their degree of alone did. independence [57]. The hypothesis here is that a higher 2.2 Identification and the quantitative description of measurement of mutual information observed between the the discovered biomodules and their interaction net- expression of two genes and/or microRNAs indicates a works closer biological relationship. Therefore, we used MI to validate the expression pattern of the combined six bio- We previously confirmed that combining the expression modules identified by PLR algorithm as best predictors profiles of mRNAs and microRNAs can lead to higher clas- (lowest prediction error using cross-validation in Figure sification power. In order to discover the common group of 2). Figure 5 shows that the jointed six modules can dis- mRNA and microRNAs interactions that best classified ep- tinguish epithelial cancer from normal tissue samples us- ithelial cancers separately from normal tissue, regardless of ing MI measurements. We also found high mutual infor- the specific cancer or anatomical location, we employed mation between the microRNAs and/or genes in the six PLR on the whole combined microRNA and mRNA expres- biomodules (data not shown). sion profiles to generate ten prioritized groups (biomodule candidates). To answer the question of how many biomo- 2.4 Gene Ontology (GO) enrichment of the genes asso- dules should be optimal to best classify, we conducted a ciated with these biomodules prediction-based statistical inference on the datasets [56]. Six microRNA-mRNA biomodules contributed to the low- To further understand the biological mechanism underling est 95% quintile of the error-rate on the combined data the identified merged microRNA-mRNA biomodule, we (Figure 3). In particular, the error-rate of microRNA data conducted a Gene Ontology enrichment of the biological stably decreased when n < 3, while the combined data processes and also reviewed the literature for the genes in- showed an increased tendency of error-rate when n > 6. Our volved in the merged microRNA-mRNA biomodule. Table results confirm the six microRNA-mRNA biomodules that 3 gives 7 Biological Process terms defined by Gene Ontol- perform commonly in several types of human epithelial ogy (Built: 15-Mar-2006), which are significantly over- tumor. Thus our further investigation and discussion will represented within the 98 genes based on running hyper- focus on the leading six groups which are referred as “ex- geometric tests for each GO term of two Affymetrix chips pression microRNA-mRNA biomodule”. The biomodules (hu35ksuba and hu6800). Many of these biological contain 10 distinct microRNAs and 98 distinct genes, (3 processes are known to be involved in oncogenesis (e.g. microRNAs and 24 genes, 3 microRNAs and 22 genes, 4 DNA replication, growth, nucleic acid metabolism, etc.). microRNAs and 21 genes, 5 microRNAs and 24 genes, 1 The hypergeometric tests evaluate the likelihood that the microRNAs and 6 genes, 1 microRNAs and 19 genes, re- corresponding number of annotations is occurring in a ran- dom list of genes of the same size using R CompdiagTools spectively). package. PTMS [58], MCMC7 [59,60], TK1, and NFIB are Two microRNAs and 11 mRNAs were involved in mul- the genes important for DNA replication (hypergeometric tiple microRNA-mRNA biomodules, such as miR-30e, p-value is 0.007). Higher expression levels of PTMS [61], miR-15a, ABO, METTL7A, CFD, IRAK1, PTMS, MED6, etc TK1 [62], and MCM7 [60] in epithelial carcinoma cells than (Suppl. Table 1). A two-dimensional projection of tumor in normal cells has been reported. The over-expression of (red 1) and normal (black 0) samples into the space of the IRAK1 and under-expression of TGFBR2 in tumor cells expression level of the first biomodule (x axis) and second have been previously reported as well [63,64]. The genes biomodule (y axis) separates the samples of tumor and the enriched in protein amino acid dephosphorylation are all normal tissue (Figure 4). Moreover, the combined analysis famous tumor suppressors or oncogenes. Both DUSP16 of microRNA and mRNA (leftmost projection of Figure 4) [65], positively, and TGFBR2 [66,67], negatively, are in- yields larger between-cluster distance than within-cluster volved in the MAPK signaling pathway. CDC25B distances. phosphatase, as a kind of cyclin-dependent kinase activator, YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 9

Figure 5 The heatmap of log-transformed expression levels of microRNAs-genes biomodules across 89 epithelial samples. The 10 microRNAs (symbols in red) and 98 genes and all samples are ordered by a hierarchical cluster agglomerated on complete Mutual Information distance of the expression levels, using Bioconductor package Biodist. The two black vertical lines range the normal samples that were clustered together. The annotation of the 89 sample are “tumor_T” or “tumor_N” (e.g. normal kidney tissue is Kid-N); Kid=kidney, BLDR=bladder, PAN=pancreas, BRST=breast, BLDR=bladder, UT=uterus, MELA= melanoma, MESO=mesothelioma, PROST=prostate. plays a key role in cell-cycle progression and the MAPK signaling pathway with positive expression [68]. In contrast, 2.5 Evaluation 1 – Enrichment of microRNA targets each biomodule individually did not have sufficient genes to among inversely correlated microRNA-mRNA pairs achieve statistic power or to identify meaningful non-trivial processes (Suppl. Table 2). The PLR identified six microRNA-mRNA biomodules comprising 10 microRNAs and 98 mRNAs. The average 10 YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 expressions of every one of the 10 microRNAs were putative target genes (SPCS1, FABP4, PDK4 and IHPK2) down-regulated in the epithelial cancer group as compared of one of the 10 microRNAs in the combined biomodules. with the normal tissue group, while 59 out of 98 mRNAs Note that our method didn’t make any assumptions on the were up-regulated. There were another 39 mRNAs in this mechanism of correlation or inverse correlation between biomodule that were down-regulated in cancer. Five data- microRNAs and genes expressions in the same biomodule; bases of microRNA target predictions based on sequences therefore it identified co-expression as well inverse were used: miRBase v5 [14,15], miRanda version on July co-expression patterns. As described in the background, 2007 [12], PicTar server version 4.0.24 [16,17], TarBase correlation of microRNA and mRNA expression may be v4.0 [18] and TargetScan v3.1 [19], and they were used to attributed to mechanisms such as genes hosting an intra- ascribe relationships between genes and microRNAs within genic microRNA correlated with the expression of their each biomodule clusters to generate the network (shown in intragenic microRNA [42], or target genes regulated by Figure 6) comprising 35 microRNA-targets relationships. other mechanism such as transcriptional factor [29,34,69], Among the biomodules, 208 distinct pairs of inversely cor- and other complex feed - back controls. related microRNA-mRNA were found, of which 29 were also found in the five putative microRNA target databases based on sequence similarities that contain 121755 distinct pairs of microRNA-target. All together, 217 microRNAs and 16063 genes were studied, with a potential for 3485671 microRNA-mRNA interactions. Therefore, the Fisher’s exact test showed a significant enrichment (p = 3×10–10, odds ratio = 4.5) of observed microRNA-mRNA inverse correlations pairs in the six biomodules, corroborating the validity of the proposed approach. In some cases, the microRNA target was only up-regulated in some subset of epithelial cancers (tissue-specific inverse correlation with the microRNA). To address this question, we further measured the tissue-specific log-transformed group distance of expression for every microRNA and gene in the six biomodules (Suppl. Table 3). Eight putative target genes were down-regulated in the epithelial cancers group as compared with the normal tissue group and up-regulated in Figure 6 Regulatory network consisting of microRNAs and their putative gene targets in the six combined microRNA-mRNA biomodules. Circle a specific cancer tissue as compared with the expression in (gene), box (microRNA), pink node (up-regulated gene or microRNA in the comparable normal tissue (Suppl. Table 3). And three epithelial cancer), blue node (down-regulated gene or RNA in epithelial microRNAs (has-miR-143, -193 and has-let-7b) were cancer), and orange node (tissue-specific up-regulation of the gene target down-regulated in every cancers except in colon cancer; associated with the down-expressed microRNAs as described in Suppl. Table 3). It includes 9 out of 10 down-regulated microRNAs and their while has-miR-1 was down-regulated in all cancers except predicted target genes (n=26) summarized from five public databases ob- in pancreas cancer. Besides, there are four down-regulated served among the 98 genes of the six combined biomodules.

Table 3 Biological processes enriched in 98 genes of merged microRNA-mRNA biomodule. The listed GO terms meet the standard of significance (p≤0.05, gene count≥2, CompdiagTools software, see Methods)

GO ID Biological process p Genes 0007178 Transmembrane receptor protein serine/threonine kinase signaling pathway 0.003 IRAK1, TGFBR2 0051216 Cartilage development 0.003 BMP2, ZBTB7A 0006957 Complement activation, alternative pathway 0.005 CFD, CFB 0006260 DNA replication 0.007 NFIB, PTMS, MCM7, TK1 0040007 Growth 0.008 BMP2, INHBA 0006139 Nucleobase, nucleoside, nucleotide and nucleic acid metabolism 0.05 FPGS, TK1, FPGS 0006470 Protein amino acid dephosphorylation 0.05 DUSP16, TGFBR2, CDC25B

YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 11

2.6 Evaluation 2 – Enrichment of co-occurrence of one CDC25B) and DNA replication (NFIB, PTMS, MCM7 and “genes or microRNA” and one epithelial cancer in the TK1), are promising markers for diagnosis of human epi- abstracts indexed in PubMed thelial cancer. Other known tumor markers that we found include IRAK1, TGFBR2, DUSP16, and CDC25B. Some To further evaluate the functional role of the 10 microRNAs tumor suppressor genes such as PDCD2 also showed con- and 98 genes in our merged epithelial cancer biomodule, we sistent down-regulation in all human tumors derived from conducted a PubMed literature search to estimate the epithelial tissues. Our most interesting findings are that enrichment of co-occurrence of the symbol of (i) [the mi- many genes in the group of diagnostic markers are the ex- croRNA or gene] with that of (ii) any of the 11 types of ep- pected targets of the identified microRNAs. Although com- ithelial cancers (Fisher’s exact test). A true positive result putational prediction of microRNA targets requires experi- was assumed to be unadjusted p < 0.05. As of April 15th, mental validation, this observation further reveals the com- 2009, there were 18789608 papers indexed in PubMed da- plicated relationship between microRNA and genes in tu- tabase. The number of abstracts containing each symbol, mors [85]. We suggest that combining expression profiling cancer type and both were recorded. There are 78% (69 out of microRNA with mRNA is a promising strategy to predict of 89) of the gene and microRNA symbols taken from the the risk of epithelial tumors. six biomodules that have PubMed that co-occurred signifi- We also summarized and did basic statistics on the genes cantly (p<0.05) with at least one term associating with epi- and microRNAs in these microRNA-mRNA biomodules for thelial cancers (Suppl. Table 4). This result strongly suggest further possible investigation. All 10 microRNAs and 75% that the proposed method can enrich known facts about the of the identified genes show strongly significant changes (q regulation for epithelial cancers, that it likely predicts new < 0.05 [54]), when we did fold-change equivalent measures fact, and that it is likely scalable to other biological prob- on the log-transformed expression levels [53] (Suppl. Table lems. Moreover, among the results, 4 out of 9 identified 1). Note that since the biological events are not indepen- microRNAs and 33 out of these 80 genes that have official dent, we cannot assess the performance by individual genes. Symbols (89 total) met the threshold of p < 0.05 for the Individually, some of the identified diagnostic signatures co-occurrences at least two epithelial cancers in PubMed, have mediocre identification power but if joined together, there were miR-30e, miR-15a, let-7b and miR-143, ABO, they may have good classification power. MYO9B, CFD, RAD1, IL1RN, S100A2, HBA2, SH3BP2, BMP2, MT1E, CEBPD, C1R, ADRM1, NUAK1, DNAJB6, CCNDBP1, MMP1, NDE1, MCM7, GPX3, BFAR, 3 Discussion ARMCX1, ZBTB7A, PKP1, ITIH1, GRB14, FPGS, SLC31A1, TK1, TGFBR2, SFRP1, CDC25B and TUBB3 In this paper, we described an integrated analysis of expres- (Suppl. Table 4). Of note, there were 12 gene symbols for sion levels of protein-coding transcripts and non-coding which we did not find a single PubMed result, which may transcripts. We used modern classification and evaluation be novel predictions. They were METTL7A, KRTCAP2, methods based on penalized logistic regression (PLR) and tcag7.1314, SPCS1, psiTPTE22, CCDC103, RASL12, found that the combining analysis of mRNA and microRNA TRIM69, KIAA2026, OSBPL10, C1orf142 and expression profiles resulted in the lowest rate of misclassi- hCG_1731871. fication. Furthermore, these biomodules found by PLR are All the identified microRNAs have already been reported sorted according to their variances in descending order. We associated with human cancer, except miR-33. The known then identified and investigated a group of six leading mi- tumor-suppressive microRNAs are miR-30e in breast, head croRNA-mRNA biomodules whose collective expression and neck, and lung cancer [70], miR-193a and miR-338 in was strongly associated with epithelial cancers from differ- oral squamous cell carcinoma [71], miR-15a in prostate ent originals as compare with their corresponding normal cancer [72], miR-130a in the drug-resistant cell lines of tissues. The first, second and third biomodules could indi- ovarian cancer [73], miR-136 in leukemia [74], miR-143 in vidually distinguish cancers from controls, suggesting the cervical cancer [75], colorectal cancer [76,77] and naso- diversity of distinct biomodules involved in distinct me- pharyngeal carcinoma [78], miR-1 in lung cancer [79] and chanisms associated with epithelial cancers. hepatocellular cancer [80], let-7b in the poor-prognostic Tumorigenesis is a multi-step process in which cancer ovarian carcinoma [81], colon cancer [82], lung cancer [83] cells acquire characteristics such as self-sufficiency in and melanoma [84]. Our result suggests that the new candi- growth, evasion of apoptosis, insensitivity to growth inhibi- date miR-33 may also function as growth suppressors in tory signals, limitless replicative potential, and so on. Early human epithelial cancer cells. detection of cancer is the key for the application of suc- We found that apoptosis related microRNAs (miR-338, cessful therapies. MicroRNAs have recently been identified miR-193a) and genes (BCL2L13, PDCD2, IRAK1, IHPK2, as a new class of genes with tumor suppressor and onco- INHBA, BFAR and MCM7), as well as genes involved in genic functions. More researchers are considering the ad- cell division cycle pathway (RAD1, PTMS, INHBA and vances of microRNA expression profiling and discussing 12 YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 their potential in cancer diagnostics and prevention. It is vious methods based on negative linear correlation between known that most microRNAs are controlled by developmen- microRNA and mRNA expression profiles did not distin- tal or tissue-specific signaling and that there are tissue special guish cancer and normal conditions [24,26] or distinguished regulatory microRNA and gene expression patterns. Howev- them depending on arbitrary threshold of differential ex- er, our study shows that some microRNAs are consistently pression either [25]. under-expressed in 11 different types of epithelial tumors Future studies of microRNA-mRNA biomodules com- compared with normal tissues, strongly suggesting that there mon to multiple cancers will be: (1) the development of are common microRNA regulatory patterns occurring in dis- robust biomodules by generating a subset of robust genes tinct types of epithelial cancers from distinct anatomical ori- and microRNA within biomodules using cross-validations gins. The regulatory network of the identified microRNA- or other studies; and (2) conducting biological validations gene biomodules include the co-expression and inverse on some uncharacterized mechanisms that we have pre- coexpression patterns within a biomodule, implying the com- dicted. (3) In addition, the open question in our identified plexity of regulatory mechanisms beyond the direct micro- microRNA-mRNA biomodules is that some microRNAs are RNA target, such as a host microRNA gene co-expressed co-expressed with their putative target genes in this dataset. with its intragenic microRNA, more complex interactions We should add additional databases, such as transfac, or mediated via signaling and transcriptional activity [30,34,42], other approaches to further characterize the putative me- or a microRNA and gene activated by the same regulator chanism involved in this otherwise paradoxical result, par- [30]. ticularly when this correlation is observed between a mi- croRNA and its putative target gene. It either challenges the In contrast to the algorithm of decision fusion by Wang et disease specific roles of microRNA on its target genes [87], al. using the same datasets, our study is superior in two or challenges the prediction precision of putative target points. Wang’s “high level decision fusion” consists of ana- genes in those databases. In fact, neuronal-enriched mi- lyzing separately microRNAs and mRNAs. (1) Our pro- croRNAs have been discovered to be either positively or posed “low-level data fusion” method is more suitable for negatively co-expressed with their target genes [29]. There- mining the oncogenic microRNAs and their regulation of fore, besides the repressive effect on target expression, mi- gene expression. We combined and scaled the two sources croRNAs might act in concert with other regulatory of expression profiles to produce a new meta-data. Moreo- processes [42] to regulatory target gene expression ver, the null hypothesis of the method adopted by Wang et [29,34,69]. Therefore the combination of genome-wide mi- al. was that the expression levels for the variables were the croRNA and mRNA expression provides increasing infor- same for normal tissues and tumors [36], however a large mation to further understanding of the multiple levels of fraction of the world of known microRNAs were down- dys-regulatory network in human cancer by integrating mi- expressed in cancer [6,86]. (2) Our method is good for fea- croRNA into regulatory networks [88]. ture selection because it is based on the PLR algorithm which is designed for searching small functional groups of genes/microRNAs that act together and whose collective 4 Conclusions expression is strongly associated with response conditions. PLR is also superior to classification with state-of-the-art Current approaches to derive microRNA-mRNA interac- methods based on single genes [33]. Therefore, our method tions from their respective expression profiles over paired can identify the signature from complete mRNA and mi- tissue samples are generally focused on statistical associa- croRNA expression changes which might be ignored in an tions of their expressions and in many cases, assume an individual data source. inverse correlation to the interaction. Different from others, Compared with other studies that calculate the biomo- we focused on coordinating mRNA and microRNA expres- dules of microRNA-mRNA by combining both expression sions together by rescaling their normalized values and profiles, our study is different in two ways: (1) The pro- treating them as probe-sets of a common pool. We then de- posed unbiased and genome-wide method identifies mi- veloped and evaluated a novel comprehensive method to croRNA-mRNA biomodules inclusive of both correlations systematically identify, on a genome scale, pan-expression an inverse correlations between microRNA-mRNA, while biomodules commonly to distinct cancers of the same tis- previous methods only addressed the latter (biased) and sues. These pan-expression biomodules are enriched in generally also used previous knowledge from microRNA genes and microRNAs that have previously been identified target databases to reduce the number of comparisons (not in epithelial cancers, using literature review. Since micro- genome-wide) [24–27]. (2) The proposed method is an al- RNA can sometimes induce mRNA degradation with their ternative to bi-clustering which results in biomodules made complementary mRNA molecules, we confirmed that the of dys-regulated microRNA and mRNAs whose collective biomodules were significantly enriched in microRNAs as- expression are strongly associated with cancer conditions, sociated with their inversely correlated putative gene tar- because the algorithm balances between the within group gets. Additionally, we have demonstrated that the co-expression and between group distance. In contrast, pre- pan-expression biomodules are robust classifiers of cancer YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3 13 vs. normal conditions. Taken together, these observations terminants beyond seed pairing. Mol Cell, 2007. 27(1): p. 91-105. support the internal validity and biological consistency of the 20 Bartel, D.P., MicroRNAs: target recognition and regulatory functions. Cell, 2009. 136(2): p. 215-33. pan-expression biomodules in epithelial cancers. Further, 21 Sethupathy, P., M. Megraw, and A.G. Hatzigeorgiou, A guide through the regulatory network of the identified microRNAs and present computational approaches for the identification of mamma- corresponding genes in these biomodules may contribute to lian microRNA targets. Nat Methods, 2006. 3(11): p. 881-6. the understanding of the complexity of microRNA-mRNA 22 Lewis, B.P., et al., Prediction of mammalian microRNA targets. Cell, 2003. 115(7): p. 787-98. interactions by providing unbiased synthetic models and 23 Lanza, G., et al., mRNA/microRNA gene expression profile in mi- hypothesis generation to cancer biologists. crosatellite unstable colorectal cancer. Mol Cancer, 2007. 6: p. 54. 24 Joung, J.G., et al., Discovery of microRNA-mRNA modules via pop- ulation-based probabilistic learning. Bioinformatics, 2007. 23(9): p. This work was supported by the National Natural Science Foundation of 1141-7. China (60971099, 60671018 and 60771024), Center for Multilevel Ana- 25 Liu, B., J. Li, and A. Tsykin, Discovery of functional miRNA-mRNA lyses of Genomic and Cellular Networks (1U54CA121852-01A1), and the regulatory modules with computational methods. Journal of Biomed- Cancer Research Foundation. We thank WALTS Adrienne for her contri- ical Informatics. In Press, Corrected Proof. bution to editing. We also thank XIE Jianming for contributive discussion 26 Tran, D.H., K. Satou, and T.B. Ho, Finding microRNA regulatory on the biological impact. modules in human genome using rule induction. BMC Bioinformat- ics, 2008. 9 Suppl 12: p. S5. 27 Xin, F., et al., Computational analysis of microRNA profiles and their 1 Rhodes, D.R., et al., Large-scale meta-analysis of cancer microarray target genes suggests significant involvement in breast cancer anties- data identifies common transcriptional profiles of neoplastic trans- trogen resistance. Bioinformatics, 2009. 25(4): p. 430-4. formation and progression. Proc Natl Acad Sci U S A, 2004. 101(25): 28 Gennarino, V.A., et al., MicroRNA target prediction by expression p. 9309-14. analysis of host genes. Genome Res, 2009. 19(3): p. 481-90. 2 Segal, E., et al., A module map showing conditional activity of ex- 29 Tsang, J., J. Zhu, and A. van Oudenaarden, MicroRNA-mediated pression modules in cancer. Nat Genet, 2004. 36(10): p. 1090-8. feedback and feedforward loops are recurrent network motifs in 3 Yang, X., S. Bentink, and R. Spang, Detecting common gene expres- mammals. Mol Cell, 2007. 26(5): p. 753-67. sion patterns in multiple cancer outcome entities. Biomed Microde- 30 O'Donnell, K.A., et al., c-Myc-regulated microRNAs modulate E2F1 vices, 2005. 7(3): p. 247-51. expression. Nature, 2005. 435(7043): p. 839-43. 4 Volinia, S., et al., A microRNA expression signature of human solid 31 Ramaswamy, S., et al., Multiclass cancer diagnosis using tumor gene tumors defines cancer gene targets. Proc Natl Acad Sci U S A, 2006. expression signatures. Proc Natl Acad Sci U S A, 2001. 98(26): p. 103(7): p. 2257-61. 15149-54. 5 Calin, G.A., et al., Frequent deletions and down-regulation of micro- 32 Zhu, J. and T. Hastie, Classification of gene microarrays by penalized RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leu- logistic regression. Biostatistics, 2004. 5(3): p. 427-43. kemia. Proc Natl Acad Sci U S A, 2002. 99(24): p. 15524-9. 33 Shen, L. and E.C. Tan, Dimension reduction-based penalized logistic 6 Calin, G.A., et al., A MicroRNA signature associated with prognosis regression for cancer classification using microarray data. and progression in chronic lymphocytic leukemia. N Engl J Med, IEEE/ACM Trans Comput Biol Bioinform, 2005. 2(2): p. 166-75. 2005. 353(17): p. 1793-801. 34 Yu, X., et al., Analysis of regulatory network topology reveals func- 7 Iorio, M.V., et al., MicroRNA gene expression deregulation in human tionally distinct classes of microRNAs. Nucleic Acids Res, 2008. breast cancer. Cancer Res, 2005. 65(16): p. 7065-70. 36(20): p. 6494-503. 8 Yanaihara, N., et al., Unique microRNA molecular profiles in lung 35 Reimers, M. and V.J. Carey, Bioconductor: an open source frame- cancer diagnosis and prognosis. Cancer Cell, 2006. 9(3): p. 189-98. work for bioinformatics and computational biology. Methods Enzy- 9 Ruike, Y., et al., Global correlation analysis for micro-RNA and mol, 2006. 411: p. 119-34. mRNA expression profiles in human cell lines. J Hum Genet, 2008. 36 Wang, Y., et al., Classification for poorly differentiated tumor classi- 53(6): p. 515-23. fication using both messenger rna and microrna expression profiles, 10 Ambs, S., et al., Genomic profiling of microRNA and messenger in 2006 Computational Systems Bioinformatics Conference (CSB RNA reveals deregulated microRNA expression in prostate cancer. 2006) 2006: Stanford, California. Cancer Res, 2008. 68(15): p. 6162-70. 37 Dettling, M. and P. Buhlmann, Finding predictive gene groups from 11 Pasquinelli, A.E., et al., Conservation of the sequence and temporal microarray data. Journal of Multivariate Analysis 2004. 90(1): p. 106 expression of let-7 heterochronic regulatory RNA. Nature, 2000. - 131 408(6808): p. 86-9. 38 Cessie, S.L. and J.V. Houwelingen, Ridge estimators in logistic re- 12 John, B., et al., Human MicroRNA targets. PLoS Biol, 2004. 2(11): p. gression. Applied Statistics, 1990. 41: p. 191-201. e363. 39 Alzola, C. and F. Harrell. An Introduction to S and the Hmisc and 13 Lu, J., et al., MicroRNA expression profiles classify human cancers. Design Libraries. Available from: Nature, 2005. 435(7043): p. 834-8. http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf. 14 Griffiths-Jones, S., miRBase: the microRNA sequence database. Me- 40 Dettling, M. and P. Buhlmann, Supervised clustering of genes. Ge- thods Mol Biol, 2006. 342: p. 129-38. nome Biol, 2002. 3(12): p. RESEARCH0069. 15 Griffiths-Jones, S., et al., miRBase: tools for microRNA genomics. 41 Huang, J.C., Q.D. Morris, and B.J. Frey, Bayesian inference of Mi- Nucleic Acids Res, 2008. 36(Database issue): p. D154-8. croRNA targets from sequence and expression data. J Comput Biol, 16 Yang, Y., Y.P. Wang, and K.B. Li, MiRTif: a support vector ma- 2007. 14(5): p. 550-63. chine-based microRNA target interaction filter. BMC Bioinformatics, 42 Baskerville, S. and D.P. Bartel, Microarray profiling of microRNAs 2008. 9 Suppl 12: p. S4. reveals frequent coexpression with neighboring miRNAs and host 17 Chen, K. and N. Rajewsky, Natural selection on human microRNA genes. RNA, 2005. 11(3): p. 241-7. binding sites inferred from SNP data. Nat Genet, 2006. 38(12): p. 43 Dudoit, S., J. Fridlyand, and T. Speed, Comparison of discrimination 1452-6. methods for the classification of tumors using gene expression data. 18 Sethupathy, P., B. Corda, and A.G. Hatzigeorgiou, TarBase: A com- Journal of the American Statistical Association, 2002. 97(457): p. prehensive database of experimentally supported animal microRNA 77–87. targets. RNA, 2006. 12(2): p. 192-7. 44 Fort, G. and S. Lambert-Lacroix, Classification using partial least 19 Grimson, A., et al., MicroRNA targeting specificity in mammals: de- squares with penalized logistic regression. Bioinformatics, 2005. 14 YANG XiNan, et al. Chinese Sci Bull September (2010) Vol.55 No.3

21(7): p. 1104-11. gene but not in the MADH4 gene. Cell Tissue Res, 2002. 308(2): p. 45 Jack, F.C.-X., et al., Threefold vs. fivefold cross validation in 205-14. one-hidden-layer and two-hidden-layer predictive neural network 67 Grady, W.M. and S.D. Markowitz, Genetic and epigenetic alterations modeling of machining surface roughness data. Journal of manufac- in colon cancer. Annu Rev Genomics Hum Genet, 2002. 3: p. 101-28. turing systems 2005. 24(2): p. 93-105. 68 Guo, J., et al., Expression and functional significance of CDC25B in 46 Campbell, G.P., et al., Compositional data analysis for elemental data human pancreatic ductal adenocarcinoma. Oncogene, 2004. 23(1): p. in forensic science. Forensic Sci Int, 2009. 188(1-3): p. 81-90. 71-81. 47 Marchese, A., et al., Two gene duplication events in the human and 69 Shalgi, R., et al., Global and local architecture of the mammalian mi- primate dopamine D5 receptor gene family. Gene, 1995. 154(2): p. croRNA-transcription factor regulatory network. PLoS Comput Biol, 153-8. 2007. 3(7): p. e131. 48 Nicholson, B.J., On the F-Distribution for Calculating Bayes Credible 70 Wu, F., et al., MicroRNA-mediated regulation of Ubc9 expression in Intervals for Fraction Nonconforming, in IEEE Transactions on Re- cancer cells. Clin Cancer Res, 2009. 15(5): p. 1550-7. liability. 1985. p. 227-228. 71 Kozaki, K., et al., Exploration of tumor-suppressive microRNAs si- 49 Harper, W.L. and C.A. Hooker, eds. Foundations of Probability lenced by DNA hypermethylation in oral cancer. Cancer Res, 2008. Theory, Statistical Inference, and Statistical Theories of Science. 68(7): p. 2094-105. Confidence Intervals vs. Bayesian Intervals. 1976. 175. 72 Bonci, D., et al., The miR-15a-miR-16-1 cluster controls prostate 50 Dettling, M. and P. Buhlmann, Finding predictive gene groups from cancer by targeting multiple oncogenic activities. Nat Med, 2008. microarray data. Journal of Multivariate Analysis, 2004. 90(1): p. 14(11): p. 1271-7. 106-31. 73 Sorrentino, A., et al., Role of microRNAs in drug-resistant ovarian 51 Falcon, S. and R. Gentleman, Using GOstats to test gene lists for GO cancer cells. Gynecol Oncol, 2008. 111(3): p. 478-86. term association. Bioinformatics, 2007. 23(2): p. 257-8. 74 Yu, J., et al., Human microRNA clusters: genomic organization and 52 Martin-Subero, J.I., et al., New insights into the biology and origin of expression profile in leukemia cell lines. Biochem Biophys Res mature aggressive B-cell lymphomas by combined epigenomic, ge- Commun, 2006. 349(1): p. 59-68. nomic, and transcriptional profiling. Blood, 2009. 113(11): p. 75 Lui, W.O., et al., Patterns of known and novel small RNAs in human 2488-97. cervical cancer. Cancer Res, 2007. 67(13): p. 6031-43. 53 Storey, J.D. and R. Tibshirani, Statistical significance for genome- 76 Chen, X., et al., Role of miR-143 targeting KRAS in colorectal tumo- wide studies. Proc Natl Acad Sci U S A, 2003. 100(16): p. 9440-5. rigenesis. Oncogene, 2009. 28(10): p. 1385-92. 54 Scheid, S. and R. Spang, twilight; a Bioconductor package for esti- 77 Michael, M.Z., et al., Reduced accumulation of specific microRNAs mating the local false discovery rate. Bioinformatics, 2005. 21(12): p. in colorectal neoplasia. Mol Cancer Res, 2003. 1(12): p. 882-91. 2921-2. 78 Chen, H.C., et al., MicroRNA deregulation and pathway alterations in 55 Becker, K.G., et al., PubMatrix: a tool for multiplex literature mining. nasopharyngeal carcinoma. Br J Cancer, 2009. 100(6): p. 1002-11. BMC Bioinformatics, 2003. 4: p. 61. 79 Nasser, M.W., et al., Down-regulation of micro-RNA-1 (miR-1) in 56 Dudoit, S. and J. Fridlyand, A prediction-based resampling method lung cancer. Suppression of tumorigenic property of lung cancer cells for estimating the number of clusters in a dataset. Genome Biol, and their sensitization to doxorubicin-induced apoptosis by miR-1. J 2002. 3(7): p. RESEARCH0036. Biol Chem, 2008. 283(48): p. 33394-405. 57 Margolin, A.A., et al., ARACNE: an algorithm for the reconstruction 80 Datta, J., et al., Methylation mediated silencing of MicroRNA-1 gene of gene regulatory networks in a mammalian cellular context. BMC and its role in hepatocellular carcinogenesis. Cancer Res, 2008. Bioinformatics, 2006. 7 Suppl 1: p. S7. 68(13): p. 5049-58. 58 Clinton, M., et al., Evidence for nuclear targeting of prothymosin and 81 Karwowska, S. and S. Zolla-Pazner, Passive immunization for the parathymosin synthesized in situ. Proc Natl Acad Sci U S A, 1991. treatment and prevention of HIV infection. Biotechnol Ther, 1991. 88(15): p. 6608-12. 2(1-2): p. 31-48. 59 Vareli, K., et al., Nuclear distribution of prothymosin alpha and para- 82 Akao, Y., Y. Nakagawa, and T. Naoe, let-7 microRNA functions as a thymosin: evidence that prothymosin alpha is associated with RNA potential growth suppressor in human colon cancer cells. Biol Pharm synthesis processing and parathymosin with early DNA replication. Bull, 2006. 29(5): p. 903-6. Exp Cell Res, 2000. 257(1): p. 152-61. 83 Takamizawa, J., et al., Reduced expression of the let-7 microRNAs in 60 Lei, M., The MCM complex: its role in DNA replication and implica- human lung cancers in association with shortened postoperative sur- tions for cancer therapy. Curr Cancer Drug Targets, 2005. 5(5): p. vival. Cancer Res, 2004. 64(11): p. 3753-6. 365-80. 84 Schultz, J., et al., MicroRNA let-7b targets important cell cycle mo- 61 Tsitsilonis, O.E., et al., The prognostic value of alpha-thymosins in lecules in malignant melanoma cells and interferes with anchor- breast cancer. Anticancer Res, 1998. 18(3A): p. 1501-8. age-independent growth. Cell Res, 2008. 18(5): p. 549-57. 62 Fujiwaki, R., et al., Thymidine kinase in epithelial ovarian cancer: 85 Wang, X. and X. Wang, Systematic identification of microRNA func- relationship with the other pyrimidine pathway enzymes. Int J Can- tions by combining target prediction and expression profiling. Nucle- cer, 2002. 99(3): p. 328-35. ic Acids Res, 2006. 34(5): p. 1646-52. 63 Holleman, A., et al., The expression of 70 apoptosis genes in relation 86 Thomson, J.M., et al., Extensive post-transcriptional regulation of to lineage, genetic subtype, cellular drug resistance, and outcome in microRNAs and its implications for cancer. Genes Dev, 2006. 20(16): childhood acute lymphoblastic leukemia. Blood, 2006. 107(2): p. p. 2202-7. 769-76. 87 Kuhn, D.E., et al., Experimental validation of miRNA targets. Me- 64 Biswas, S., et al., Transforming growth factor beta receptor type II thods, 2008. 44(1): p. 47-54. inactivation promotes the establishment and progression of colon 88 Kanellopoulou, C. and S. Monticelli, A role for microRNAs in the cancer. Cancer Res, 2004. 64(14): p. 4687-92. development of the immune system and in the pathogenesis of can- 65 Hoornaert, I., et al., MAPK phosphatase DUSP16/MKP-7, a candi- cer. Semin Cancer Biol, 2008. 18(2): p. 79-88. date tumor suppressor for region 12p12-13, reduces BCR-ABL-induced transformation. Oncogene, 2003. 22(49): p. 7728-36. 66 Ku, J.L., et al., Establishment and characterization of four human pancreatic carcinoma cell lines. Genetic alterations in the TGFBR2