AUCTSP: an Improved Biomarker Gene Pair Class Predictor Dimitri Kagaris1* , Alireza Khamesipour1 and Constantin T

Kagaris et al. BMC Bioinformatics (2018) 19:244 https://doi.org/10.1186/s12859-018-2231-1 RESEARCH ARTICLE Open Access AUCTSP: an improved biomarker gene pair class predictor Dimitri Kagaris1* , Alireza Khamesipour1 and Constantin T. Yiannoutsos2 Abstract Background: The Top Scoring Pair (TSP) classifier, based on the concept of relative ranking reversals in the expressions of pairs of genes, has been proposed as a simple, accurate, and easily interpretable decision rule for classification and class prediction of gene expression profiles. The idea that differences in gene expression ranking are associated with presence or absence of disease is compelling and has strong biological plausibility. Nevertheless, the TSP formulation ignores significant available information which can improve classification accuracy and is vulnerable to selecting genes which do not have differential expression in the two conditions (“pivot" genes). Results: We introduce the AUCTSP classifier as an alternative rank-based estimator of the magnitude of the ranking reversals involved in the original TSP. The proposed estimator is based on the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) and as such, takes into account the separation of the entire distribution of gene expression levels in gene pairs under the conditions considered, as opposed to comparing gene rankings within individual subjects as in the original TSP formulation. Through extensive simulations and case studies involving classification in ovarian, leukemia, colon, breast and prostate cancers and diffuse large b-cell lymphoma, we show the superiority of the proposed approach in terms of improving classification accuracy, avoiding overfitting and being less prone to selecting non-informative (pivot) genes. Conclusions: The proposed AUCTSP is a simple yet reliable and robust rank-based classifier for gene expression classification. While the AUCTSP works by the same principle as TSP, its ability to determine the top scoring gene pair based on the relative rankings of two marker genes across all subjects as opposed to each individual subject results in significant performance gains in classification accuracy. In addition, the proposed method tends to avoid selection of non-informative (pivot) genes as members of the top-scoring pair. Keywords: Microarray data analysis, Gene expression, Gene selection, Receiver operating characteristic (ROC) curve, AUC, Leukemia, Breast cancer, Ovarian cancer, Colon cancer, Prostate cancer, Diffuse large B-Cell lymphoma Background is present (classification) or the risk for the occurrence Microarray data analysis is a high throughput method of disease in the future (prediction). While very complex used to gain information about gene functions inside cells. classifiers can be constructed, a number of authors have Thisinformationisinturnusedtodetectthepresenceor expressed concern with the “black box” nature of these absenceofdisease[1–3], and gain a better understanding approaches [5] preferring simpler more interpretable clas- of a disease mechanism [4]. sifiers in clinical applications [6, 7]. It is noted that the A particularly useful application of microarray tech- preference for the latter kind of classifiers should not be at nology uses microarray data to detect the presence of the expense of their performance. disease by combining gene expression levels from a num- Classification involves, at its most fundamental level, ber of genes, to provide information on whether disease a comparison between expression levels in one or more genes between two or more conditions (e.g., disease ver- *Correspondence: [email protected] sus no disease). This comparison can be based on a fairly 1 Department of Electrical and Computer Engineering, Southern Illinois heuristic criterion (e.g., fold-change in gene expression University, 1230 Lincoln Drive, 62901 Carbondale, IL, USA Full list of author information is available at the end of the article [8]), or by using parametric or non-parametric statistical © The Author(s). 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Kagaris et al. BMC Bioinformatics (2018) 19:244 Page 2 of 13 methods [9–12]. There are several advantages and dis- Given a pair of genes (i, j),1 ≤ i = j ≤ G,thereversal advantages with each of these methods. For example, it score of the pair was defined in [6]as is biologically plausible that genes with large differential = ( > | = ) − ( > | = ) expression levels should be part of a classification crite- ij P Yi Yj C 1 P Yi Yj C 2 (1) rion. However, the fold-change criterion does not take where P(Yi > Yj|C = m) denotes the probability that the gene expression variability into account and determining expression level of gene i is larger than the expression level a cutoff is an arbitrary exercise [13]. On the other hand, of gene j in samples from class C,withC being equal to parametric statistical methods, which are based on some m = 1, 2. The score ij can be empirically approximated variant of the t-test, provide some sense of one’s confi- by the expression [6] dence on the gene expression difference, but frequently lose the intuitive appeal of heuristic methods like fold- n1 ( > ) n ( > ) = I1 Yi, Yj, =n +1 I2 Yi, Yj, change (e.g., when even small differences are statistically D = 1 − 1 ij n n significant). In addition, parametric methods make strong 1 2 and frequently untenable assumptions regarding the dis- (2) tribution of gene expression levels [13]. Non-parametric where index indicates the th subject, 1 ≤ ≤ n and methods, which are based on ranking gene expression I (Y > Y ) = 1ifY > Y in class m = 1, 2, and 0 levels, are expected to lose some information because of m i, j, i, j, otherwise. the use of ranks instead of actual gene-expression data. Obviously, the larger the , the higher the probabil- However, such methods are robust to deviations from ij ity that the expression levels of genes i and j have reverse parametric assumptions [13], and are less vulnerable to relative rankings in the two groups, and it is exactly this biases arising from data normalization and other pre- property that is used for classification by the TSP. More processing steps [14], which are plausibly assumed to be specifically, let (α, β) be the pair of genes that yields the rank-preserving [6, 7]. maximum score αβ = max{ } (referred to as the The fact that the TSP provides classifiers based on only ij Top Scoring Pair (TSP) [6]). Then the classification is two genes is also an attractive compromise in the so-called performed as follows: “bias-variance” tradeoff [15]. As a classifier’s performance Assume that is a combination of variance (random error) and bias (sys- tematic error), in many cases, high-dimensional classifiers P(Yα > Yβ |C = 1)>P(Yα > Yβ |C = 2) (3) withlowbias(duetogoodperformanceinthecurrent sample) have large variances (i.e., poor precision) in new i.e., samples. By contrast, simpler (and thus more rigid) clas- n1 n ( > ) = I1(Yα, > Yβ,) = + I2 Yα, Yβ, sifiers, while possibly having higher levels of bias, are less 1 > n1 1 (4) influencedbyaspecificsampleandmayhavebetterover- n1 n2 all performance (smaller variance) in multiple samples. Then a new subject s whose measured expression levels The simple TSP classifiers, it was hoped, would perform for genes a and b are Yα,s and Yβ,s respectively, will be clas- sufficiently well both in the current sample as well as in sified as belonging to the first class if Yα,s > Yβ,s,andto new samples. The TSP is a rank-based classifier in the the second class otherwise. sense that it uses the rankings of gene expression levels The genes in the top scoring pair as selected by the TSP within a gene profile rather than the levels themselves, an method may have a problem, as Lin et al. [5]alsopoint approach with significant advantages due to the nonpara- out: the selected genes may not be a pair of genuinely metric nature of the classification technique. The central up-regulated and down-regulated genes, but one of the idea behind the TSP classifier is that it identifies two genes selected genes in the pair happens to serve only as a ref- whose gene expression ranking changes between the two erence or “pivot” gene that may lead to a high TSP score conditions under consideration. This change lends itself but a rather non-informative gene pair. Most researchers to a simple biological interpretation as an inversion of have used more complicated methods or selected more mRNA abundance of the two genes in the two conditions features in order to overcome the mentioned problems. under consideration. The pair of genes selected by the In the proposed method we employ a simple statistic TSP [6], referred to as the top scoring pair (TSP), is found associated with the Receiver Operating Characteristic by the following approach: Consider G genes which have (ROC) curve that is commonly known as the Area Under been profiled by microarray analysis. Let n1 be the num- theROCcurve(AUROC)ortheAreaUndertheCurve ber of experiments from the first class with expression (AUC), for short. The ROC curve and the AUC in par- ={ ··· } levels Yi Yi,1, Yi,2, , Yi,n1 ,andletn2 be the number ticularhavebeenwidelyusedasameasureformicroar- of experiments from the second class with expression lev- ray classification and other medical diagnostic tests ={ ··· } = + els Yi Yi,n1+1, Yi,n1+2, , Yi,n ,wheren n1 n2. (see, e.g., [16–23]. Kagaris et al. BMC Bioinformatics (2018) 19:244 Page 3 of 13 The proposed method, referred to as AUCTSP (AUC- The probability PAUCTSP can be calculated by the Area based TSP), uses similar ideas as the TSP, thus benefiting Under the ROC Curve (AUC) [23].

Load more