View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Elsevier - Publisher Connector

Genomics 103 (2014) 48–55

Contents lists available at ScienceDirect

Genomics

journal homepage: www.elsevier.com/locate/ygeno

Gene expression profile based classification models of psoriasis

Pi Guo a,b,c,1, Youxi Luo a,b,1, Guoqin Mai a,b,MingZhangf,g, Guoqing Wang d,e,⁎⁎,MiaomiaoZhaoa,b, Liming Gao a,b,FanLid,e,FengfengZhoua,b,⁎

a Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China b Key Lab for Health Informatics, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China c Department of Public Health, Shantou University Medical College, No. 22 Xinling Road, Shantou, Guangdong 515041, PR China d Department of Pathogeny Biology, Norman Bethune Medical College, Jilin University, Changchun, Jilin 130021, PR China e Key Laboratory of Zoonosis Research, Ministry of Education, Jilin University, Changchun, Jilin 130021, PR China f Department of Epidemiology and Biostatistics, Faculty of Infectious Diseases, University of Georgia, Athens, GA 30605, USA g Institute of Bioinformatics, University of Georgia, Athens, GA 30605, USA

article info abstract

Article history: Psoriasis is an autoimmune disease, which symptoms can significantly impair the patient's life quality. It is mainly Received 25 March 2013 diagnosed through the visual inspection of the lesion skin by experienced dermatologists. Currently no cure for Accepted 1 November 2013 psoriasis is available due to limited knowledge about its pathogenesis and development mechanisms. Previous Available online 13 November 2013 studies have profiled hundreds of differentially expressed related to psoriasis, however with no robust psoriasis prediction model available. This study integrated the knowledge of three feature selection algorithms Keywords: that revealed 21 features belonging to 18 genes as candidate markers. The final psoriasis classification model expression profiles Classification was established using the novel Incremental Feature Selection algorithm that utilizes only 3 features from 2 Psoriasis unique genes, IGFL1 and C10orf99. This model has demonstrated highly stable prediction accuracy (averaged at 99.81%) over three independent validation strategies. The two marker genes, IGFL1 and C10orf99, were revealed as the upstream components of growth signal transduction pathway of psoriatic pathogenesis. © 2013 Elsevier Inc. All rights reserved.

1. Introduction Majority of the large-scale psoriasis studies have employed microar- ray or qPCR techniques to compare the gene expression pattern in psori- Psoriasis is a widely spread chronic autoimmune disease with major atic lesion samples with healthy controls. Gudjonsson et al. systematically symptoms on skins and joints [1,2]. Psoriasis may increase the risks of screened ~54,000 probe sets in the Affymetrix HG-U133 Plus 2 platform, stroke and myocardial infarction [3,4] and shares the same effective and detected 179 unique genes for 223 probe sets with differential drug, p40-neutralizing antibodies, with the neurodegenerative expression in the uninvolved psoriatic skin samples [13]. The lipid metab- Alzheimer's disease [5]. The prevalence of psoriasis varies considerably olism was demonstrated to be most associated with psoriasis, and three in different geographic regions and ethnic groups, with higher rate in lipid metabolic related transcription factors, peroxisome proliferator- the Caucasian population (e.g. 2.2% in the United States) and lower activator receptor alpha (PPARA), sterol regulatory element-binding rate in the Asian (e.g. less than 1% in China) [6,7]. Along with the protein (SREBF), and estrogen receptor 2 (ESR2), seem to regulate the cutaneous manifestations, psoriasisisaccompaniedbyinflammatory ar- dysregulated genes in the uninvolved psoriatic samples. Reischl et al. thritis in up to 40% cases [8]. Its phenotypic symptoms may also include chose the Affymetrix HG-U133A platform to profile the expression pat- epidermal hyperplasia, keratinocyte differentiations, angiogenesis, infil- terns of ~22,000 probe sets in the , and detected 179 tration of T lymphocytes, dendritic cells, neutrophils, cytokines as well genes for 203 probe sets with more than 2-fold expression changes in as chemokines [9–11]. The current diagnosis of psoriasis heavily relies the psoriatic lesion skins [14]. Their data strongly suggested that the on the visual observation of the skin lesion biopsy by an experienced stem cell proliferation regulation Wnt pathway is associated with the pso- dermatologist [12], which procedure is inefficient and labor intensive. riasis development. But it still remains elusive whether the Wnt pathway plays a causative role or merely the human immune response to the pso-

⁎ Correspondence to: Fengfeng Zhou, Shenzhen Institutes of Advanced Technology, riatic lesion. The human immune response to the environmental stimulus, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, PR China. Fax:+86-755- the type I interferon system, was screened in both the psoriatic lesion and 86392299. uninvolved skin samples on the Affymetrix HG-U133 Plus 2 platform [15]. ⁎⁎ Correspondence to: Guoqing Wang, Department of Pathogeny Biology, Norman There were 1408 up-regulated and 1465 down-regulated probe sets Bethune Medical College, Jilin University, Changchun, Jilin 130021, PR China. detected in the psoriatic lesion skin samples, among which multiple T E-mail addresses: [email protected], [email protected] (F. Zhou). URL: http://www.healthinformaticslab.org/ffzhou/ (F. Zhou). cell marker genes and two type I interferon inducible genes (STAT1 and 1 These authors contribute equally to this work. ISG15) had ~3 fold or higher in expression changes. Swindell et al.

0888-7543/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ygeno.2013.11.001 P. Guo et al. / Genomics 103 (2014) 48–55 49

[16,17] compared the gene expression files of psoriasiform versus normal [22]. Boruta removes the features with less contribution to classification samples between human and mouse. The calculated lists of top 5000 fold- accuracy by introducing random variables for the competition [23,24]. change ranked up-regulated and down-regulated probe sets showed con- We believe, the consensus features derived from the three algorithms sistent overlaps among different mouse psoriasiform phenotypes. Unfor- represent a reliable list of biomarker candidates. The R implemented tunately, the lists of marker genes barely overlapped between different version of the three algorithms was used in this study. studies, leaving little information to construct a psoriatic classification model. We hypothesized that a robust disease classification model can be 2.2. SVM-RFE feature selection algorithm constructed based on marker genes. This study employed the microarray based gene expression profiles, in which three widely-used feature selec- SVM-RFE is a Recursive Feature Elimination (RFE) strategy utilizing fi tion algorithms were applied to screen the psoriasis associated features the weight vector produced by the classi cation model Support Vector (probe sets). Majority of the 21 features were known to be associated Machine [25]. RFE is a widely used strategy that can recursively reduce with psoriasis. We further conducted a comprehensive performance the number of features by removing those features with the least evaluation of the neural network based binary classifiers trained with phenotype correlation rankings [22,26,27]. The Support Vector Machine fi the Incremental Feature Selection strategy. The classifier based on the algorithm was invented by Vladimir N Vapnik in 1979 for the classi ca- three features of IGFL1 and C10orf99 outperformed the rest of models, tion problem [28], and it produces a weight vector for the input features fi with over 99.43% accuracy in all three validation tests. after being trained to optimize the classi cation accuracy between the psoriatic and healthy samples. The generated weight vector will be 2. Methods used to rank the features, and the RFE strategy will recursively eliminate the least ranked features. The R package mSVM-RFE was chosen as the 2.1. Data collection and preprocessing implementation of this algorithm [22].

Two microarray data sets with accession numbers GSE14905 [15] and 2.3. ROC feature selection algorithm GSE13355 [16,17] were retrieved from the Gene Expression Omnibus (GEO) Database [18]. The two data sets provide gene expression profiles The receiver operating characteristic (ROC) is a graphical plot in the for the same disease type (psoriasis) and the corresponding controls, signal detection theory that illustrates the relationship between the true but have no overlap of samples. This makes them an ideal paired set for positive rate (TPR) versus the false positive rate (FPR) of a binary detecting the consistent psoriasis biomarkers. There are 176 participating classifier [29]. The ROC curve is one of the best measurements to rank subjects in total, including 91 psoriatic patients and 85 healthy controls, the features between multiple groups of tissues [30]. We ranked the denoted by P ={P , P ,…, P }andN ={N , N ,…, N }. Samples from 1 2 91 1 2 85 features based on the ROC-based binary classification performance of both data sets were processed by the standard protocol for the Affymetrix the psoriatic and healthy samples. In brief, the two expression distribu- Human Genome U133 Plus 2.0 platform. The detailed information of the tions of a given feature and/or gene were calculated for the psoriatic and samples used in this study is listed in the Supplementary Table S1. healthy samples, respectively. Next, the area under the ROC curve The raw fluorescence intensity data within CEL files were processed (AUC) was calculated to rank the features. The R package rocc was by background correction, log2 transformation, and quantile normaliza- used as the implementation of this algorithm [21]. tion using the Robust Multichip Analysis (RMA) algorithm [19],asim- plemented in the program Affymetrix Expression Console™,which was downloaded from http://www.affymetrix.com/estore/index.jsp. 2.4. Boruta feature selection algorithm The systematic batch biases were corrected by the Distance Weighted Discrimination (DWD) algorithm [20]. DWD eliminates source effects The Boruta feature selection algorithm is a random forest based fi across different studies by nding a hyper-plane that separates the wrapper strategy that removes features proven to be less informative two systematic biases and adjusts the microarray data by projecting than random probes [31]. The classification algorithm, random forest, them on the hyper-plane through subtracting out the DWD plane was chosen because of its inexpensive calculation and parameter free multiplied by the batch mean. for manual tuning. A random forest [22] collects votes over multiple We removed the probe sets with less discriminating power by decision tree based weak classifiers, which are independently trained measuring the overall variance. This procedure was performed in over the binary classification data set between the psoriatic and healthy fi varFilter of the R package Bioconductor gene lter. We chose the default samples. The importance of a probe set is measured by how the classifi- setting of the Inter-Quartile Range function IQR for the parameter cation performance is decreased due to the random permutation of the var.func, and .5 as the var.cutoff. The function IQR was chosen for its probe set among samples. The trained random forest classifier provides fi robust outlier detection. This ltering step was applied on the DWD- an importance estimate for all features [22].TheRpackageimplementa- processed data sets, and 27,336 probe sets of the 176 participating tion of Boruta was used for this study [24]. subjects were retained for further analysis. This study investigated the binary classification problem between

N1 = 91 psoriatic patients and N2 = 85 healthy controls. The total 2.5. Classification model number of samples was N = 91 + 85 = 176. The expression level of each probe set was denoted as Fi, where i ϵ {1, 2,…, 27,336}. The follow- This study utilized a feed-forward neural network classifier with one ing sections aimed to detect m features F′j, where j ϵ {1, 2,…, m} from the hidden layer of nodes [32] to distinguish the psoriatic samples from the M = 27,336 features of the binary classification model. We investigated healthy controls. A neural network is a mathematical model with a func- various feature selection algorithms to detect psoriasis related features tion f : X → Y,whereX is a collection of input neurons and Y is a collec- from 27,336 probe sets, with classification accuracy as the most tion of output neurons. X ={F′j}, where j ϵ {1, 2,…, m}, is the features important feature. The following three algorithms were chosen for selected by the aforementioned feature selection algorithms. There are their complementary performance features. ROC screens for the most two output nodes, Y = {0, 1}, representing the predicted results of a significant phenotype-associated individual features, by investigating healthy or psoriatic sample, respectively. The input and output neurons each feature's parameter-independent association [21]. SVM-RFE fur- are connected by the hidden layers of neurons. In order to avoid model ther optimizes the inter-feature associations, which can top-rank the over-fitting, the number of hidden layers was set to one in this study. important features that have weak individual phenotype-association The rest of the parameters used the default values. 50 P. Guo et al. / Genomics 103 (2014) 48–55

2.6. Measurement of prediction performance where l ≠ k, and the prediction performance was measured over the data sets Pk and Nk. The overall prediction performances, Sn, Sp,andAc, A neural network based binary classifier was evaluated using the fol- were summed over 10 times of calculations. lowing three performance measurements [33,34]. Let P ={P1, P2,…, P91}andN ={N1, N2,…, N85} as the positive and negative data sets, 3. Results respectively. TP is the number of psoriatic samples predicted to be psoriatic, and FN is the number of psoriatic samples incorrectly predict- 3.1. DWD adjustment results ed to be healthy. TN and FP are the numbers of healthy samples predict- ed to be healthy and psoriatic, respectively. The classification model's The systematic batch bias between the two microarray data sets, sensitivity and specificity are defined to be Sn = TP/(TP + FN)and GSE14905 [15] and GSE13355 [16,17] used in this study, was corrected Sp = TN/(TN + FP), respectively. The model's overall accuracy is by the DWD algorithm [20]. The samples from the two microarray data Ac =(TP + TN)/(TP + FN + TN + FP). sets were clearly partitioned into the two data sets in the unsupervised A neural network classification model may be over-fitted if the num- hierarchical clustering, as shown in Fig. 1(A). This separation suggests ber of features used in the model is similar to or larger than that of the that the samples were more similar to those from the same microarray training samples. This can be overcome by restricting the number of se- data set than those from the other data set, and samples of the same phe- lected features and evaluating the classification model's cross-validation notype labeling, i.e. healthy or psoriatic, have no correlations with each performance. This study measures how a classification model performs other. Similar separation can be achieved by the first principle component by the commonly used 10-fold cross-validation strategy, as similar to in the principle component analysis (PCA) algorithm [38,39], as shown in [35–37]. Its principle is to randomly split the 91 psoriatic and 85 healthy Supplementary Fig. S1(a). The batch bias needs to be corrected, before the samples into ten equal-sized subsets P ={P1, P2,…, P10}andN ={N1, two independently produced microarray data can be used as the cross- N2,…, N10}, respectively. For the pair of psoriatic and healthy sample sub- validation data sets. By the DWD-based projection, the two data sets sets Pk and Nk,wherek ϵ {1, 2,…, 10}, a single-hidden-layer feed-forward were reasonably mixed together after removing the batch bias, as neural network classifier was trained over the data sets P\Pl and N\Nl, shown in the dendrogram in Fig. 1(B). Supplementary Fig. S1(b) shows

Fig. 1. Hierarchical clustering plots of the samples from the two microarray data sets for the raw data (A) before and (B) after the DWD batch effect correction. The data sets GSE14905 and GSE13355 were colored in red and green, respectively. The three finally chosen features, i.e. 227736_at (gene: C10orf99), 227735_s_at (gene: C10orf99) and 239430_at (gene: IGFL1), were highlighted as red in the right vertical panel. P. Guo et al. / Genomics 103 (2014) 48–55 51 that at least the first four principle components and their pairs cannot tGSE14905. The third validation strategy tGSE13355 was defined simi- separate the samples from the two different microarray data sets, thus larly as tGSE14905. The latter two validations evaluated the stability of supporting the hypothesis that the batch bias was corrected. the classification model on data sets with batch effects. The binary classifier achieved the best overall accuracies (99.81%) over 3.2. Gene sets determined by feature selection algorithms the three validation strategies, as shown in Fig. 4. The top one ranked fea- ture, 227736_at (gene C10orf99), achieved 100% in Ac for both 10FCV and Different feature selection algorithms rank the features with different tGSE13355 validations, but didn't work equally well on the tGSE14905 phenotype-association measurements, and we hypothesized that fea- validation. The second best feature, 227735_s_at, also resides within tures chosen by all the three feature selection algorithms, i.e. SVM-RFE, C10orf99. But the classifier based on the two features, 227736_at and ROC and Boruta, have stable discrimination power. We chose 80 features 227735_s_at, performed worse than 227736_at itself over the 10FCV val- top-ranked by each of the three algorithms SVM-RFE, ROC and Boruta. idation. The top three ranked features, 227736_at and 227735_s_at (gene The intersection of the three top-ranked feature lists consisted of 21 C10orf99) and 239430_at (gene IGFL1), produced a binary classifier features, corresponding to 18 human genes, as shown in Fig. 2. The which performed well with 99.43%, 100% and 100% in overall accuracies probe set IDs, official gene symbols, and functional annotations of the (Ac) for the three validation strategies. A classifier based on more features above chosen features/probe-sets were detailed in Supplementary produced fluctuated overall accuracies between 80% and 100% (Fig. 4). Table S2. The three features were highlighted in bold and blue font in Supplemen- We further investigated the discriminating power of the combined tary Table S2, whose functions are discussed in details below. list of all the 21 features. The expression patterns of the 21 features were visualized as the unsupervised hierarchical clustering heatmap 3.4. Enrichment analysis of differentially expressed genes of all the 176 samples, as shown in Fig. 3. The 176 samples can be easily separated into their correct labels, with only two psoriatic samples in- The Gene Ontology (GO) annotations were screened for the terms correctly predicted as healthy. This visual separation gave a detection enriched in the 21 psoriasis associated features (18 human genes), as accuracy Ac = 174/176–98.86% for the 176 samples. The 10-fold shown in Table 1. The enrichment analysis was conducted using the cross-validation of the neural network binary classifier rendered the system DAVID with default parameters (Evalue cutoff = E−2) [40]. same prediction result. Fig. 3 also demonstrated that all of the 21 The Pvalues of these multiple tests were statistically corrected by features (probe sets) were up-regulated in the psoriatic samples. Bonferroni and Benjamini algorithms [41]. Only the GO categories with Pvalues ≤ 0.05 after both multi-test corrections were kept for fur- 3.3. Robustness test of the binary classifier ther analysis, as shown in Table 1. Genes involved in the development and differentiation processes of the skin cells, in particular the The prediction robustness of the binary classifier was evaluated keratinocyte and epidermal cells, were significantly enriched in the 21 using the Incremental Feature Selection (IFS) strategy. The selected 21 features that were chosen by three different feature selection features were ranked by the averaged scores calculated from the three algorithms. This result is consistent with the clinical observation that un- feature selection algorithms as {f1, f2,…, f21}. We calculated the overall controlled differentiation of the skin cells is psoriasis's major symptom accuracy Ac using the features {f1, f2,…, f1}, (i ≤ 21) over three indepen- [42,43]. There is also a sub-cellular compartment, cornified envelope dent validations. The number of features i was chosen, if the average ac- (GO:0001533 with Pvalue = 5.65E−03 after both correction algo- curacy AvgAc over the three validations reaches the peak as shown in rithms), which was enriched with 3 psoriasis associated genes. Fig. 4. The three validation strategies are described below. For the two independently generated microarray data sets, GSE14905 and 4. Discussion GSE13355, we firstly conducted a 10-fold cross-validation of the binary classifier over the combined data set, denoted here as 10FCV. In the sec- 4.1. A robust prediction signature for psoriasis ond validation strategy, we used GSE14905 as the training data set and GSE13355 the test set to validate the classifier, denoted here as This study aimed to establish a robust psoriasis prediction model, based on the gene expression profiles of human samples. Multiple sets of genes were proposed to be associated with psoriasis, or differentially expressed in psoriasis samples [13–15,44]. Such gene sets consist of dozens or even hundreds of candidate markers, which may potentially be used for psoriasis diagnosis. However, psoriasis classification model study is scarce. To our knowledge, this study is the first of this kind in con- structing a feed-forward neural network based psoriasis binary classifier. We hypothesized that the association between the marker features and psoriasis is strong enough to pass the filtering criteria of different fea- ture selection algorithms. Therefore we applied three widely used feature selection algorithms, i.e. SVM-RFE, ROC and Boruta, to screen the 27,336 features that passed the data quality filtering criteria and have statistical significant variations between the disease and control samples. The top 80 features ranked by each of the three feature selection algorithms with default parameters, i.e. SVM-RFE, ROC and Boruta, were retained for further analysis. The intersection of these three lists resulted in 21 fea- tures (probe sets) that belong to 18 human genes. Majority of the 21 fea- tures were known to be associated with the initiation and development of psoriasis, as shown in Table 1 and Supplementary Table S2. A comprehensive classification validation was conducted to investi- gate how a subset of these 21 features may be combined to build a binary classifier, which can perform equally well on different data Fig. 2. The overlapping features top-ranked by the three feature selection algorithms. Each of the three algorithms, i.e. Boruta, SVM-RFE and ROC, chooses 80 features. There are 21 sets. We incrementally added one feature at a time to the feature subset features that are in the overlapping region of the three algorithms. by the ranking order. We evaluated the feed-forward neural network 52 P. Guo et al. / Genomics 103 (2014) 48–55

Fig. 3. The hierarchical clustering heatmap of samples based on expression patterns of the 21 features. Each row corresponds to a feature and each column corresponds to a sample. The status of psoriasis or normal for each subject is shown with the above bar. The psoriatic and healthy samples are in red and blue, respectively.

classifier with one hidden layer of nodes in three independent valida- clinical data and genetic studies, such as HLA-C*06, ERAP1, the IL-23 tion strategies, as described in the above section. The binary classifiers pathway, and NF-κB signaling [46]. The PI3K/Akt pathway may be con- based on the top one or two features excelled in some validations, but tinuously sustained by the C10orf99 stimulus, promoted keratinocyte were not stable enough on the other data sets. With one more feature, proliferation, and increased the anti-apoptotic NF-κB signaling, which the trained binary classifier achieved the overall accuracies of 99.43%, not only resist psoriatic keratinocytes to cell death [47], but also trigger 100% and 100% for the three validation strategies, respectively, and cytokine expression resulting to inflammatory responses [48].At the best averaged overall accuracy (99.81%) among all the incremental- present, there exist conventional and biologic agents to treat psoriasis. ly added feature subsets, as shown in Fig. 4. We also observed that more Conventional treatments include methotrexate, cyclosporine A, and features made the overall accuracies more fluctuating. acitretin. Five biologic agents are FDA approved for psoriasis treatment: In all, the binary classifier trained over the top three ranked features alefacept (T-cell modulator), adalimumab, etanercept, ustekinumab (two human genes, C10orf99 and IGFL1) represents a robust and highly and infliximab (TNF-α inhibitors) [49]. These biologic agents may only accurate psoriasis prediction model. control the development of the symptoms. The psoriasis classification model developed in this study suggested 4.2. Functional insights of the two marker genes, IGFL1 and C10orf99 an association between psoriasis and two marker genes, IGFL1 and C10orf99. A molecular pathway of psoriasis development was manually Psoriasis was regarded as a T-cell-driven disease with autoimmune curated from the literature [50,51],asshowninFig. 5. IGFL1 is one of the mechanisms as its primary cause [45].Thisviewwassupportedby four members of the human insulin growth factor (IGF)-like (IGFL) fam- ily, which is the ligand of cell membrane associated receptor IGFLR1. Structurally similar to the IGF family members, IGFLs comprise small and secreted proteins (shorter than 80 aa on average in the mature polypeptide) and contain 11 Cysteines including two CC motifs [52]. With other hormones or growth factors, IGFLs promote cell proliferation and differentiation, and suppress apoptosis in various cell types [53]. The continuous stimulation of IGFL1 induces a prolonged association of phosphatidylinositol 3 kinase (PI3K) with the IGFL1 receptor, which was shown to be essential for IGFL1-induced cell proliferation, including promoting DNA synthesis and increasing cell number [54]. Two IGFLR1 can dimerize when the growth factor IGFL1 is coupled and induce auto-phosphorylation. The post-translational modification of IGFLR1 transduces the growth signal to the downstream PI3K/Akt pathway. C10orf99 may be a serine/threonine kinase that can phos- phorylate IGFLR1 and induce the same signal transduction, based on multiple sources of evidences. The functional annotation of C10orf99 Fig. 4. Prediction accuracy assessment based on the feature subset in three scenarios by was not documented in various databases, such as UniProtKB [55] and the Incremental Feature Selection (IFS) strategy. The overall accuracy (Ac) values of the Human Protein Atlas [56]. We structurally aligned C10orf99 to the validations 10FCV, tGSE14905 and tGSE13355 were plot in different line types. The fi average accuracy is averaged over the three validation strategies for the number of known 3D structures using I-TASSER version 1.1 [57]. The top ve en- features i, and is plotted as a dash line with circle points. zyme homologs of C10orf99 were found belonging to the Enzyme P. Guo et al. / Genomics 103 (2014) 48–55 53

Table 1 Enriched Gene Ontology (GO) terms. Gene Ontology (GO) terms from the groups of biological process (BP), cellular component (CC) and molecular function (MF), that were significantly enriched in the 21 selected features (probe sets).

Term Pvalue FC Bonferroni Benjamini

BP|GO:0031424 — keratinization 6.45E−06 96.8 8.13E−04 8.13E−04 BP|GO:0009913 — epidermal cell differentiation 3.07E−05 57.81 3.86E−03 1.29E−03 BP|GO:0030216 — keratinocyte differentiation 2.36E−05 63.07 2.98E−03 1.49E−03 BP|GO:0030855 — epithelial cell differentiation 2.09E−04 30.38 2.60E−02 6.57E−03 CC|GO:0001533 — cornified envelope 2.18E−04 124.5 5.65E−03 5.65E−03

Classification (EC) of serine/threonine kinases (2.7.11.14, 2.7.11.14, C10orf99 in psoriasis was independently supported by the epigenetic 2.7.11.1, 2.7.11.13 and 2.7.11.1). The best two matches (PDBs 3C4W studies, which showed significantly less methylation of C10orf99 in and 3C51) are G protein coupled receptor (GPCR) kinases (GRKs). The the psoriatic samples compared to the healthy controls [63]. best match, enzyme PDB 3C4W, has as little as 3.38 RMSD variation to the predicted structure of C10orf99 as shown in Fig. 6 (within the rea- 5. Conclusions sonable RMSD cutoff of 5.5 [58]), where RMSD is the root mean square deviation between two 3D structures. GRK was proposed to facilitate This study constructed a robust two-gene expression signature to the phosphorylation of IGFLR1, and to activate the downstream PI3K/ distinguish the psoriasis samples from the healthy ones. The functional Akt pathway [59]. By suppressing the GRK expression with siRNA, characterization of the two genes, IGFL1 and C10orf99, suggests that the Zheng et al. demonstrated that the decreased expression of GRK5/6 kinase IGFL1 activates the growth signal receptor C10orf99 by abolished IGFL1-mediated AKT activation, and GRK2 inhibition partially phosphorylation, and triggering a growth signal to the cell proliferation inhibited AKT signaling [59]. IGFL1 binding to its receptor IGFLR1 can machinery. Further experiments will be conducted to investigate the stimulate the phosphorylation of the insulin receptor substrate (IRS) regulatory roles of IGFL1 and C10orf99 in the psoriasis pathogenesis, [60] and activate the PI3K/Akt pathway [61]. Therefore, the elevated ex- and their potential use as psoriasis drug targets. pression of C10orf99 can activate the downstream IGLF1-mediated AKT pathway, as shown in Fig. 5. This hypothesis was supported by a num- List of abbreviations used ber of recent studies at the genomics and epigenetics levels. C10orf99 DWD Distance Weighted Discrimination was shown to be up-regulated by 1.7-fold in uninvolved skins of psori- RFE Recursive Feature Elimination atic patients and even by 30-fold in the lesion skins [62].Thetwomicro- ROC receiver operating characteristic array data sets used in this study also showed the elevated expression of TPR true positive rate C10orf99 in the psoriatic lesion skins [16,17]. The up-regulation of FPR false positive rate

Fig. 5. Molecular pathway of psoriasis development. The details of the abbreviations are listed here: IGFL1, insulin-like growth factor (IGF) like family member 1; IGFLR1, IGF-like family receptor 1; IRS, insulin receptor substrate; C10orf99, 10 open reading frame 99; P85, p85 regulatory subunit; P110α, PtdIns-3-kinase subunit p110-alpha; PI3K, phosphoinositide 3 kinase, catalytic, alpha polypeptide; PIP2, phosphatidylinositol (4,5)-bisphosphate; PIP3, phosphatidylinositol (3,4,5)-trisphosphate; NF-κB, nuclear factor of kappa light polypeptide gene enhancer in B-cells; P, phosphate. 54 P. Guo et al. / Genomics 103 (2014) 48–55

References

[1] J. Villasenor-Park, D. Wheeler, L. Grandinetti, Psoriasis: evolving treatment for a complex disease, Cleve. Clin. J. Med. 79 (2012) 413–423. [2] T.W. Chu, T.F. Tsai, Psoriasis and cardiovascular comorbidities with emphasis in Asia, G. Ital. Dermatol. Venereol. 147 (2012) 189–202. [3] L. Puig, Cardiovascular risk and psoriasis: the role of biologic therapy, Actas Dermosifiliogr. 103 (2012) 853–862. [4] J.M.Gelfand,E.D.Dommasch,D.B.Shin,R.S.Azfar,S.K.Kurd,X.Wang,A.B.Troxel,The risk of stroke in patients with psoriasis, J. Invest. Dermatol. 129 (2009) 2411–2418. [5] J. Vom Berg, S. Prokop, K.R. Miller, J. Obst, R.E. Kalin, I. Lopategui-Cabezas, A. Wegner, F. Mair, C.G. Schipke, O. Peters, Y. Winter, B. Becher, F.L. Heppner, Inhibition of IL-12/IL-23 signaling reduces Alzheimer's disease-like pathology and cognitive decline, Nat. Med. 18 (2012) 1812–1819. [6] R.G. Langley, G.G. Krueger, C.E. Griffiths, Psoriasis: epidemiology, clinical features, and quality of life, Ann. Rheum. Dis. 64 (Suppl. 2) (2005) ii18–ii23(discussion ii24-15). [7] Psoriasis-Aid.com, Psoriasis Statistics, 2012. [8] D.D. Gladman, Natural history of psoriatic arthritis, Baillieres Clin. Rheumatol. 8 (1994) 379–394. [9] F. Chamian, M.A. Lowes, S.L. Lin, E. Lee, T. Kikuchi, P. Gilleaudeau, M. Sullivan-Whalen, I. Cardinale, A. Khatcherian, I. Novitskaya, K.M. Wittkowski, J.G. Krueger, Alefacept reduces infiltrating T cells, activated dendritic cells, and inflam- matory genes in psoriasis vulgaris, Proc. Natl. Acad. Sci. U. S. A. 102 (2005) 2075–2080. [10] F. Chamian, J.G. Krueger, Psoriasis vulgaris: an interplay of T lymphocytes, dendritic cells, and inflammatory cytokines in pathogenesis, Curr. Opin. Rheumatol. 16 (2004) 331–337. [11] J.K. Kulski, W. Kenworthy, M. Bellgard, R. Taplin, K. Okamoto, A. Oka, T. Mabuchi, A. Fig. 6. The predicted structure of C10orf99. The alignment of C10orf99 to the best matched Ozawa, G. Tamiya, H. Inoko, Gene expression profiling of Japanese psoriatic skin enzyme, GRK, with the RMSD variance is only 3.38. reveals an increased activity in molecular stress and immune response signals, J. Mol. Med. 83 (2005) 964–975. [12] Wikipedia, Psoriasis, 2012. [13] J.E. Gudjonsson, J. Ding, X. Li, R.P. Nair, T. Tejasvi, Z.S. Qin, D. Ghosh, A. Aphale, D.L. AUC area under the ROC curve Gumucio, J.J. Voorhees, G.R. Abecasis, J.T. Elder, Global gene expression analysis PCA principle component analysis reveals evidence for decreased lipid biosynthesis and increased innate immunity – IFS Incremental Feature Selection in uninvolved psoriatic skin, J. Invest. Dermatol. 129 (2009) 2795 2804. [14] J. Reischl, S. Schwenke, J.M. Beekman, U. Mrowietz, S. Sturzebecher, J.F. Heubach, In- GO Gene Ontology creased expression of Wnt5a in psoriatic plaques, J. Invest. Dermatol. 127 (2007) (IGF)-like (IGFL) insulin growth factor 163–169. (GPCR) kinases (GRKs) G protein coupled receptor [15] Y. Yao, L. Richman, C. Morehouse, M. de los Reyes, B.W. Higgs, A. Boutrin, B. White, A. Coyle, J. Krueger, P.A. Kiener, B. Jallal, Type I interferon: potential therapeutic target for psoriasis? PLoS One 3 (2008) e2737. [16] W.R. Swindell, A. Johnston, S. Carbajal, G. Han, C. Wohn, J. Lu, X. Xing, R.P. Nair, J.J. Voorhees, J.T. Elder, X.J. Wang, S. Sano, E.P. Prens, J. DiGiovanni, M.R. Pittelkow, fi fi Authors' contributions N.L. Ward, J.E. Gudjonsson, Genome-wide expression pro ling of ve mouse models identifies similarities and differences with human psoriasis, PLoS One 6 (2011) e18266. FZ conceived and designed the study. PG, MZ, YL, and LG performed [17] R.P. Nair, K.C. Duffin,C.Helms,J.Ding,P.E.Stuart,D.Goldgar,J.E.Gudjonsson,Y.Li,T. the classification training and cross-validation. GM, GW and FL Tejasvi,B.J.Feng,A.Ruether,S.Schreiber,M.Weichenthal,D.Gladman,P.Rahman,S.J. Schrodi, S. Prahalad, S.L. Guthery, J. Fischer, W. Liao, P.Y. Kwok, A. Menter, G.M. performed the functional analysis and clinical characterization of the Lathrop,C.A.Wise,A.B.Begovich,J.J.Voorhees,J.T.Elder,G.G.Krueger,A.M.Bowcock, results. FZ, PG and GM drafted the manuscript. All authors read and G.R.Abecasis,P.CollaborativeAssociationStudy of, Genome-wide scan reveals associa- approved the final manuscript. tion of psoriasis with IL-23 and NF-kappaB pathways, Nat. Genet. 41 (2009) 199–204. [18] T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, K.A. Marshall, K.H. Phillippy, P.M. Sherman, R.N. Muertter, M. Holko, O. Ayanbule, A. Yefanov, A. Soboleva, NCBI GEO: archive for functional Competing interests genomics data sets—10 years on, Nucleic Acids Res. 39 (2011) D1005–D1010. [19] R.A. Irizarry, B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf, T.P. Speed, Exploration, normalization, and summaries of high density oligonucleotide Authors declare no potential competing interests. array probe level data, Biostatistics 4 (2003) 249–264. [20] M. Benito, J. Parker, Q. Du, J. Wu, D. Xiang, C.M. Perou, J.S. Marron, Adjustment of systematic microarray data biases, Bioinformatics 20 (2004) 105–114. Acknowledgments [21] M. Lauss, A. Frigyesi, T. Ryden, M. Hoglund, Robust assignment of cancer subtypes from expression data using a uni-variate gene expression average as classifier, BMC Cancer 10 (2010) 532. This study was supported in part by the Shenzhen Research Grant [22] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification ZDSY20120617113021359, China 973 program (2011CB512003 and using support vector machines, Mach. Learn. 46 (2002) 389–422. 2010CB732606-6) and NSFC 31000447. Computing resources were [23] M.B. Kursa, W.R. Rudnicki, Feature selection with the Boruta package, Journal, 2010. [24] A. Liaw, M. Wiener, Classification and regression by randomForest, R News 2 (2002) partly provided by the Dawning supercomputing clusters at SIAT CAS. 18–22. We would like to thank the Health Informatics Laboratory (HILab) [25] Y. Ding, D. Wilkins, Improving the performance of SVM-RFE to select genes in micro- members for their helpful discussions. Constructive comments from array data, BMC Bioinforma. 7 (Suppl. 2) (2006) S12. [26] C. Furlanello, M. Serafini, S. Merler, G. Jurman, Entropy-based gene ranking without the two anonymous reviewers are also appreciated. selection bias for the predictive classification of microarray data, BMC Bioinforma. 4 (2003) 54. [27] C. Furlanello, M. Serafini, S. Merler, G. Jurman, An accelerated procedure for recur- Appendix A. Supplementary data sive feature ranking on microarray data, Neural Netw. 16 (2003) 641–648. [28] V.N. Vapnik, The Nature of Statistical Learning Theory, 2 ed. Springer-Verlag, New York, 1999. The data sets supporting the results of this study are cited in the text [29] J.A. Swets, Signal Detection Theory and ROC Analysis in Psychology and Diagnostics: and additional files. The Supplementary Figs. S1–S2 and Supplementary Collected Papers, Lawrence Erlbaum Associates, Inc., Hillsdale, NJ, England, 1996. Tables S1–S2 are available at http://www.healthinformaticslab.org/ [30] M.S. Pepe, G. Longton, G.L. Anderson, M. Schummer, Selecting differentially expressed genes from microarray experiments, Biometrics 59 (2003) 133–142. supp/. Supplementary data to this article can be found online at http:// [31] M.B. Kursa, W.R. Rudnicki, Feature selection with the Boruta package, J. Stat. Softw. dx.doi.org/10.1016/j.ygeno.2013.11.001. 36 (2010) 1–13. P. Guo et al. / Genomics 103 (2014) 48–55 55

[32] D.A. Elizondo, R. Birkenhead, M. Gongora, E. Taillard, P. Luyima, Analysis and test of [47] S. Madonna, C. Scarponi, S. Pallotta, A. Cavani, C. Albanesi, Anti-apoptotic effects of efficient methods for building recursive deterministic perceptron neural networks, suppressor of cytokine signaling 3 and 1 in psoriasis, Cell Death Dis. 3 (2012) e334. Neural Netw. 20 (2007) 1095–1108. [48] M. Pasparakis, Role of NF-kappaB in epithelial biology, Immunol. Rev. 246 (2012) [33] K. Li, M. Yang, G. Sablok, J. Fan, F. Zhou, Screening features to improve the class 346–358. prediction of acute myeloid leukemia and myelodysplastic syndrome, Gene 512 [49] E. Krulig, K.B. Gordon, Ustekinumab: an evidence-based review of its effectiveness (2013) 348–354. in the treatment of psoriasis, Core Evid. 5 (2010) 11–22. [34] F. Zhou, Y. Xu, cBar: a computer program to distinguish plasmid-derived from [50] E.D. Roberson, Y. Liu, C. Ryan, C.E. Joyce, S. Duan, L. Cao, A. Martin, W. Liao, A. Menter, chromosome-derived sequence fragments in metagenomics data, Bioinformatics A.M. Bowcock, A subset of methylated CpG sites differentiate psoriatic from normal 26 (2010) 2051–2052. skin, J. Invest. Dermatol. 132 (2012) 583–592. [35] X. Zhou, J. Mao, J. Ai, Y. Deng, M.R. Roth, C. Pound, J. Henegar, R. Welti, S.A. Bigler, [51] A.A. Lobito, S.R. Ramani, I. Tom, J.F. Bazan, E. Luis, W.J. Fairbrother, W. Ouyang, L.C. Identification of plasma lipid biomarkers for prostate cancer by lipidomics and Gonzalez, Murine insulin growth factor-like (IGFL) and human IGFL1 proteins are bioinformatics, PLoS One 7 (2012) e48889. induced in inflammatory skin conditions and bind to a novel tumor necrosis factor [36] A.P. Thrift, B.J. Kendall, N. Pandeya, D.C. Whiteman, A model to determine receptor family member, IGFLR1, J. Biol. Chem. 286 (2011) 18969–18981. absolute risk for esophageal adenocarcinoma, Clin. Gastroenterol. Hepatol. 11 [52] P.Emtage,P.Vatta,M.Arterburn,M.W.Muller,E.Park,B.Boyle,S.Hazell,R. (2013)138–144.e2. Polizotto, W.D. Funk, Y.T. Tang, IGFL: a secreted family with conserved cysteine [37] X. Hu, H. Cammann, H.-A. Meyer, K. Miller, K. Jung, C. Stephan, Artificial neural residues and similarities to the IGF superfamily, Genomics 88 (2006) 513–520. networks and prostate cancer—tools for diagnosis and management, Nat. Rev. [53] S.M. Jones, A. Kazlauskas, Growth-factor-dependent mitogenesis requires two Urol.10(2013)174–182. distinct phases of signalling, Nat. Cell Biol. 3 (2001) 165–172. [38] S. Pegolo, G. Gallina, C. Montesissa, F. Capolongo, S. Ferraresso, C. Pellizzari, L. Poppi, [54] T. Fukushima, Y. Nakamura, D. Yamanaka, T. Shibano, K. Chida, S. Minami, T. Asano, M. Castagnaro, L. Bargelloni, Transcriptomic markers meet the real world: finding F. Hakuno, S. Takahashi, Phosphatidylinositol 3-kinase (PI3K) activity bound to diagnostic signatures of corticosteroid treatment in commercial beef samples, insulin-like growth factor-I (IGF-I) receptor, which is continuously sustained by BMC Vet. Res. 8 (2012) 205. IGF-I stimulation, is required for IGF-I-induced cell proliferation, J. Biol. Chem. 287 [39] H. Akanuma, X.Y. Qin, R. Nagano, T.T. Win-Shwe, S. Imanishi, H. Zaha, J. Yoshinaga, T. (2012) 29713–29721. Fukuda, S. Ohsako, H. Sone, Identification of stage-specific gene expression signa- [55] M. Magrane, U. Consortium, UniProt Knowledgebase: a hub of integrated protein tures in response to retinoic acid during the neural differentiation of mouse embry- data, Database (Oxford) 2011 (2011) bar009. onic stem cells, Front. Genet. 3 (2012) 141. [56] F. Ponten, J.M. Schwenk, A. Asplund, P.H. Edqvist, The Human Protein Atlas as a [40] G. Dennis Jr., B.T. Sherman, D.A. Hosack, J. Yang, W. Gao, H.C. Lane, R.A. Lempicki, proteomic resource for biomarker discovery, J. Intern. Med. 270 (2011) 428–446. DAVID: Database for Annotation, Visualization, and Integrated Discovery, Genome [57] A. Roy, A. Kucukural, Y. Zhang, I-TASSER: a unified platform for automated protein Biol. 4 (2003) P3. structure and function prediction, Nat. Protoc. 5 (2010) 725–738. [41] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful [58] Y. Zhang, J. Skolnick, Tertiary structure predictions on a comprehensive benchmark approach to multiple testing, J. R. Stat. Soc. Ser. B Methodol. 57 (1995) 289–300. of medium to large size proteins, Biophys. J. 87 (2004) 2647–2655. [42] D. Globe, M.S. Bayliss, D.J. Harrison, The impact of itch symptoms in psoriasis: results [59] H. Zheng, C. Worrall, H. Shen, T. Issad, S. Seregard, A. Girnita, L. Girnita, Selective re- from physician interviews and patient focus groups, Health Qual. Life Outcomes 7 cruitment of G protein-coupled receptor kinases (GRKs) controls signaling of the (2009) 62. insulin-like growth factor 1 receptor, Proc. Natl. Acad. Sci. U. S. A. 109 (2012) [43] M.G. Boy, C. Wang, B.E. Wilkinson, V.F. Chow, A.T. Clucas, J.G. Krueger, A.S. Gaweco, 7055–7060. S.H. Zwillich, P.S. Changelian, G. Chan, Double-blind, placebo-controlled, [60] T. Ogihara, T. Isobe, T. Ichimura, M. Taoka, M. Funaki, H. Sakoda, Y. Onishi, K. Inukai, dose-escalation study to evaluate the pharmacologic effect of CP-690,550 in patients M. Anai, Y. Fukushima, M. Kikuchi, Y. Yazaki, Y. Oka, T. Asano, 14-3-3 protein binds with psoriasis, J. Invest. Dermatol. 129 (2009) 2299–2302. to insulin receptor substrate-1, one of the binding sites of which is in the [44] E. Guttman-Yassky, M. Suarez-Farinas, A. Chiricozzi, K.E. Nograles, A. Shemer, J. phosphotyrosine binding domain, J. Biol. Chem. 272 (1997) 25267–25274. Fuentes-Duculan, I. Cardinale, P. Lin, R. Bergman, A.M. Bowcock, J.G. Krueger, [61] Y. Benomar, A.F. Roy, A. Aubourg, J. Djiane, M. Taouis, Cross down-regulation of Broad defects in epidermal cornification in atopic dermatitis identified through leptin and insulin receptor expression and signalling in a human neuronal cell genomic analysis, J. Allergy Clin. Immunol. 124 (2009) 1235–1244(e1258). line, Biochem. J. 388 (2005) 929–939. [45] J.G. Bergboer, P.L. Zeeuwen, J. Schalkwijk, Genetics of psoriasis: evidence for epistatic [62] Y. Liu, Identification of Genetic and Epigenetic Risk Factors for Psoriasis and Psoriatic interaction between skin barrier abnormalities and immune deviation, J. Invest. Arthritis in: Graduate School of Arts and Sciences, Washington University in St. Dermatol. 132 (2012) 2320–2321. Louis, St. Louis, MO, 2011. 193. [46] J.G. Bergboer, P.L. Zeeuwen, J. Schalkwijk, Genetics of psoriasis: evidence for epistatic [63] S.H. Chu, D.F. Feng, Y.B. Ma, H. Zhang, Z.A. Zhu, Z.Q. Li, P.C. Jiang, Promoter methyl- interaction between skin barrier abnormalities and immune deviation, J. Invest. ation and downregulation of SLC22A18 are associated with the development and Dermatol. 132 (2013) 2320–2331. progression of human glioma, J. Transl. Med. 9 (2011) 156.