<<

Hindawi Complexity Volume 2017, Article ID 9194801, 8 pages https://doi.org/10.1155/2017/9194801

Research Article Identifying the Risky SNP of Osteoporosis with ID3-PEP Decision Tree Algorithm

Jincai Yang,1 Huichao Gu,1 Xingpeng Jiang,1 Qingyang Huang,2 Xiaohua Hu,1 and Xianjun Shen1

1 School of Computer Science, Central China Normal University, Wuhan 430079, China 2School of Life Science, Central China Normal University, Wuhan 430079, China

Correspondence should be addressed to Jincai Yang; [email protected]

Received 31 March 2017; Revised 26 May 2017; Accepted 8 June 2017; Published 7 August 2017

Academic Editor: Fang-Xiang Wu

Copyright © 2017 Jincai Yang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In the past 20 years, much progress has been made on the genetic analysis of osteoporosis. A number of and SNPs associated with osteoporosis have been found through GWAS method. In this paper, we intend to identify the suspected risky SNPs of osteoporosis with computational methods based on the known osteoporosis GWAS-associated SNPs. The process includes two steps. Firstly, we decided whether the genes associated with the suspected risky SNPs are associated with osteoporosis by using random walk algorithm on the PPI network of osteoporosis GWAS-associated genes and the genes associated with the suspected risky SNPs. In order to solve the overfitting problem in ID3 decision tree algorithm, we then classified the SNPs with positive results based on their features of position and function through a simplified classification decision tree which was constructed by ID3 decision tree algorithm with PEP (Pessimistic-Error Pruning). We verified the accuracy of the identification framework with the data set of GWAS-associated SNPs, and the result shows that this method is feasible. It provides a more convenient way to identify the suspected risky SNPs associated with osteoporosis.

1. Introduction Osteoporosis is a complex and polygenic disease of bone system with the heritability of bone mass is about 60–80% Osteoporosis is a type of systemic skeletal disease that is [3]. Much progress has been made on the genetic analysis characterized by reduced bone mass and microarchitecture of osteoporosis in the past 20 years and it has been found deterioration of bone tissues, thereby leading to the loss of that a lot of genes and SNPs are associated with osteoporosis strength and increased risk of fractures [1]. It is one of the age- through GWAS [4, 5]. related diseases with arteriosclerosis, hypertension, diabetes, Computational biology refers to the development and and cancer. Currently, none of the medical methods is safe application of data analysis, the theory of data method, and effective to cure osteoporosis. Therefore, it is necessary mathematical modeling, and computer simulation technol- to provide theoretical basis for developing a medical strategy ogy, used in the study of biology, behavioral, and social to cure the disease from the pathogenesis of osteoporosis. group system of a discipline [6]. The rapid mass of biological With the completion of the International HapMap Project data accumulation is unprecedented in the history of human and 1000 Genomes Project, about ten millions SNPs of science. Now, a variety of methods and tools of computational human were annotated, among which more than 3 million biology through the Internet have been successfully applied are common SNPs. Genetic analysis has reached the stage in every aspect in the field of biological research. They are of genome-wide association study (GWAS). The GWAS is powerful for post-GWAS studies [7] and could identify the applied to the study of 40 kinds of diseases that are related potential and promising causal SNPs that require experimen- to more than 500 thousands SNPs [2]. tal tests for follow-up functional studies. Extensive work has 2 Complexity been done in this area in recent years. The performances associated with osteoporosis through random walk algorithm were well validated through identifying numerous disease- based on Markov chain. By the algorithm, we also selected the associated SNPs for further study and revealing previously suspected risky SNPs whose associated genes are identified unknown mechanisms for complex diseases [8]. to be associated with osteoporosis. We then classified those The method of computational biology can also be used SNPs based on their characteristics of function and their loci to study and understand these osteoporosis-susceptible genes features by a classification decision tree, and the decision and the function of SNP. All the osteoporosis associated tree was constructed by ID3 decision tree algorithm with genes and SNPs (including linkage disequilibrium (LD) Pessimistic-Error Pruning. Figure 1 describes the process to SNPs) sequence information were collected and aggregated identify the osteoporosis risky SNPs. from the national center for biological information (NCBI) database, and the effects of osteoporosis GWAS-associated 2.1. The Identification of Genes Associated with Suspected Risky lead SNPS and their linked SNPs to (TF) SNPs. According to the modular property of the genetic binding affinity were studied through JASPAR database. At diseases, many scholars have proposed prioritization algo- the same time, the osteoporosis GWAS-associated genes have rithms to predict the disease-causing genes based on the PPI, also been analyzed with -Protein Interaction (PPI) Human Disease Network, and DISEASOME recently [12–16]. network analysis tool in the study of the osteoporosis GWAS- Similarly, we obtained the scores of the genes associated with associated SNPs associated by the online PPI tool named the suspected risky SNPs through the random walk algorithm String. Combining with GO and pathway analysis, we found based on the PPI of the osteoporosis GWAS-associated genes that the hub associated and the Wnt signaling path- and the genes associated with suspected risky SNPs. Then, way were related to the mesenchymal stem cell differentiation the result was acquired by setting up a threshold 𝑘,andthe and hormone signaling that was related to the metabolism of genesassociatedwithsuspectedriskySNPsareprobablythe osteoporosis [9]. Finally, it was found that the osteoporosis osteoporosis associated genes if their scores are greater than GWAS-associated SNPs in special region of genes had long- 𝑘. range interaction signal with other by analyzing the long-range interaction of osteoporosis associated SNPs on 2.2. The Random Walk Algorithm Based on Markov Chain. GWAS3D [10]. Kohler proposed a method for the problem of candidate- In the BIBM workshop paper [11], we utilized the known prioritization by random walk algorithm based on the osteoporosisGWAS-associatedSNPsandgenesasthedataset global network distance of PPI. The results indicate that the to identify the osteoporosis suspected risky SNPs. The process algorithm is more effective than the local network distance for identification was achieved by computation method. In algorithm [17]. The random walk algorithm was applied to this extension, we made some improvements on the paper. Protein-Protein Interaction network of all associated genes. Firstly, we had achieved graphical description for the SNPs An undirected graph 𝐺 = (𝑉,𝐸) isdefinedfortheProtein- identification process. We added a flow chart for the paper Protein Interaction network of all associated genes. In the to describe the process of identification method that made undirected graph 𝐺, 𝑉 is the set of vertices for the interactors the method more intuitive. Secondly, we used ID3 decision of the network. And 𝑉 is defined as 𝑉={V1, V2,...,V𝑛}; 𝐸 is tree algorithm with PEP method instead of ID3 decision the set of edges; and 𝐸 is defined as 𝐸={⟨V𝑖, V𝑗⟩|V𝑖, V𝑗 ∈𝑉}. tree algorithm in the second part of the method. We made Every edge in the set of edges corresponds to two nodes of theimprovementtosolvetheoverfittingprobleminID3 the set of vertices for the interaction between the interactors. decision tree algorithm; we used the C4.5 algorithm to Moreover, it is assumed that a random process meets the make a comparison with our ID3-PEP algorithm. Finally, condition of Markov chain. The random process should be we added type 2 diabetes (T2D) GWAS-associated SNPs and as follows: genesasthenegativedatasetbasedonosteoporosisGWAS- associated SNPs and genes to verify the accuracy of the (a) The probability distribution of time 𝑡+1is only related method comprehensively. to the state of time 𝑡, and it is not related to the state before time 𝑡. 2. Material and Method (b) The state transition is not related to the value of 𝑡 from the time 𝑡 to time 𝑡+1. Therefore, the Markov chain We identified the suspected risky SNPs associated with osteo- model is defined as porosis by algorithm based on the analysis of osteoporosis GWAS-associated SNPs with the method mentioned above (𝑆, 𝑃,) 𝑄 . (1) [9]. It is assumed that the SNPs that are similar to the osteoporosis GWAS-associated SNPs are possible risky SNPs associated with osteoporosis. The identification process of 𝑆 is a nonempty set that consists of all the possible states the suspected risky SNPs includes two steps in general. of the system. It is a state space that can be a limited and Firstly, we constructed a Protein-Protein Interaction (PPI) denumerable set or a nonempty set. 𝑃=[𝑃𝑖𝑗 ]𝑛×𝑛 is the network based on the Protein-Protein Interaction analysis state transfer-probability matrix, 𝑃𝑖𝑗 is the probability that of the osteoporosis GWAS-associated genes and the genes the system is in the state 𝑖 at time 𝑡 to the state 𝑗 at time associated with suspected risky SNPs and identify whether 𝑡+1. 𝑁 is the number of system states. 𝑄={𝑞0,𝑞1,...,𝑞𝑛−1} the genes associated with the suspected risky SNPs are is the initial probability distribution of the system, 𝑞𝑖 is the Complexity 3

Begin

SNPs to be identified

End Search associated genes of SNPs

Osteoporosis Is Search interactors of associated genes Osteoporosis associated genes and interactors SNP

Yes

No Construct PPI End Be classified

Random walk ID3 decision tree algorithm algorithm

Yes Output Search loci >Threshold risky SNP features

No

End

Figure 1: Process to identify the suspected risky SNPs associated with osteoporosis.

probability that the system in state 𝑖 at the initial time, and between V𝑖 and V𝑗 in the network, the element 𝑎𝑖𝑗 =1; 𝑁 𝑎 =0 ∑𝑖=0 𝑞𝑖 =1. otherwise, 𝑖𝑗 the formula is defined as Based on the above theory model, the random walk on graphs is defined as an iterative walk’s transition from its {1, ⟨V𝑖, V𝑗⟩∈𝐸 𝑎𝑖𝑗 = { (4) current node to a randomly selected neighbor starting at 0, . givensourcenode[17].Therandomwalkisdefinedas { otherwise

𝑡+1 𝑡 0 𝐷 is a diagonal matrix. Each element 𝑑𝑖𝑗 of 𝐷 is defined as 𝑃 = (1−𝛼) 𝑃 𝑊+𝛼𝑃 . (2) follows: if 𝑖=𝑗then it should have 𝑑𝑖𝑗 =𝑑𝑖𝑖;otherwise𝑑𝑖𝑗 =0. 𝑑𝑖𝑗 is the degree of V𝑖 in the network. The formula is defined 𝑃𝑡 𝑖 is a vector in which the th element holds the probability of as being at node 𝑖 at time step 𝑡. 𝛼 is a constant between 0 and 1 𝑖 𝑛 that it is the restart of the walk in every step at the node with {∑𝑎 ,𝑖=𝑗 𝛼 𝛼 ∈ (0, 1] 𝑃0 1×𝑛 𝑑 = 𝑖𝑘 probability ,and [17]. is a row vector of 𝑖𝑗 {𝑘=1 (5) which is the initial state of the system, and 𝑛 is the element 0, 0 { otherwise. number of 𝑉.Thevalueofknownelementsof𝑃 is equal, and the sum of them is 1. And the value of other elements is The transition probability matrix 𝑊 is also a row- 0. 𝑊 is the transition probability matrix which is defined as normalized adjacency matrix of the graph. Formula (2) meets the state of stationary distribution of Markov train model −1 𝑊=𝐷 𝐴. (3) obviously, so the central point of random walk algorithm is evaluating the stationary distribution state of the probability 𝐴 is an adjacency matrix of undirected graph 𝑉.Every of the nodes in the undirected network 𝐺 which consists of element 𝑎𝑖𝑗 of 𝐴 is defined as follows: if there is interaction PPI. Firstly, the transition probability matrix 𝑊 should be 4 Complexity

0 obtained and the initial value is set for 𝑃 .Then,process𝑡 When the attribute 𝑇=𝑡𝑗,aformulaisusedtodefinethe 𝑡+1 𝑡 times iteration based on formula (2) until lim(𝑝 −𝑝)= conditional entropy of the attribute 𝑇 as 𝑡+1 0, 𝑝 is a convergence vector. A threshold is set for the 𝑛 probability value, and if the probability value of the nodes 𝐻(𝑋𝑗)=−∑𝑃(𝑆𝑖 |𝑇=𝑡𝑗) log2𝑃(𝑆𝑖 |𝑇=𝑡𝑗). (9) (or genes) is greater than the threshold, they are osteoporosis 𝑖=1 associated genes.

𝑋𝑗 is the training instances set of training set 𝑆. 2.3. Classify the Suspected Risky SNPs by ID3 Decision Tree The information entropy of attribute 𝑇 is defined as Algorithm. ID3 decision tree algorithm is a classification algorithm for tree structure [18, 19]. The goal of the algorithm 𝑚 𝑇 =𝐻 𝑆 − ∑𝑃(𝑆 |𝑇=𝑡)𝐻(𝑋 ). is to predict target variable based on multiple input variables IG ( ) ( ) 𝑖 𝑗 𝑗 (10) anddeduceaclassificationrulewithdecisiontreeformfrom 𝑗=1 a group of irregular samples. We assume that all input charac- teristic elements have a limited discrete domain and need an We built a top-down decision tree and classified the train- individual characteristic element as a category. The nonleaf ing instances by choosing the attribute with the maximum nodes of a classification decision tree classify the samples by information entropy based on the formulas above. characteristics of samples, and each leaf node of the tree is However, the overfitting problem could not be avoided if a class or classes of probability distribution. Therefore, we there were many noise samples in the training set, because of chose decision tree to classify the SNP based on the condition a complicated classification decision tree constructed by ID3 of the training set and algorithm characteristics. decision tree algorithm with a fair amount of noise samples SNPs located within the or distant enhancer in the training set. To solve the problem, a PEP (Pessimistic- region of genes may alter the binding of TFs with DNA and Error Pruning) algorithm was exerted on the ID3 decision subsequently regulate [20]. The suspected tree classification algorithm. PEP is the most accurate top- risky SNPs are classified by ID3 decision tree algorithm based down pruning strategy which deals with the pruning problem on four features of significant position on genes, mapping on without separating the training set. 𝑇 putative enhancer region, mapping on distal interaction, and We define a decision tree which grows on a large scale 𝑇 the region where the SNPs are located [21]. based on the training set of SNPs with their loci features. 1 is 𝑇 𝑇 The decision tree algorithm chooses the attribute with anonleafnodeset, 2 is a leaf node set, and 3 is for all nodes 𝑇 𝑇 =𝑇 ∪𝑇 the maximum information gain after it is split, and the of .Theformulais 3 1 2. 𝑟(𝑡) 𝑡 algorithm searches the decision-space by way of top-down Before pruning, we define as the error rate of node greedy algorithm. 𝑆 is defined as the training set of SNPs in the decision tree. The formula is with their loci features, and the training set is divided into 𝑒 (𝑡) 𝑛 𝐶={𝑆,𝑆 ,...,𝑆 } 𝑟 (𝑡) = . classes. That is, 1 2 𝑛 .Thenumberofthe 𝑛 (𝑡) (11) training instances in 𝑖th class is defined as |𝑆𝑖|=𝐶𝑖.The number of the training instances in 𝑆 is |𝑆|.Theprobability 𝑛(𝑡) isthenumberofsamplesinnode𝑡,and𝑒(𝑡) is the number that a training instance belongs to the 𝑖th class is 𝑃(𝑆𝑖).And of samples that does not belong to node 𝑡 actually. a formula is defined as We define 𝑇𝑡 as a subtree of the decision tree 𝑇,and𝑡 is 𝐶 𝑇 𝑇 𝑃(𝑆)= 𝑖 . therootnodeof 𝑡. So the error rate of the subtree 𝑡 is 𝑖 |𝑆| (6) ∑ 𝑒 (𝑠) 𝑠∈𝑆푡 For the training set 𝑆, 𝐻(𝑆) is defined as the information 𝑟(𝑇𝑡)= . (12) ∑𝑠∈𝑆 𝑛 (𝑠) entropy of 𝐶,andwehavetheformula 푡

𝑛 𝑆𝑡 is the leaf node set of subtree 𝑇𝑡, and we define 𝑆𝑡 = 𝐻 (𝑆) =−∑𝑃(𝑆) 𝑃(𝑆). 𝑖 log2 𝑖 (7) {𝑠1,𝑠2,...,𝑠𝑛}. 𝑖=1 Apparently, the formula for error rate of the subtree 𝑇𝑡 The greater the value of information entropy 𝐻(𝑆) is, is binomial distribution. We define a continuity correction 𝑟󸀠(𝑡) the smaller the degree of uncertainty for the division of 𝐶 factor in order to make the binomial distribution is. The attribute 𝑇 is selected as the test attribute which is approach the normal distribution. And the formula is thelocifeaturesofthetrainingsetSNPs,andthevalueset 󸀠 𝑒 (𝑡) +1/2 for attribute 𝑇 is 𝑇={𝑡1,𝑡2,...,𝑡𝑚}.Theprobabilityofthe 𝑟 (𝑡) = . 𝑛 (𝑡) (13) attribute belongs to 𝑖th class when 𝑇=𝑡𝑗 canbeformulated as Therefore, we deduce the continuity correction factor for 𝐶 𝑖𝑗 the subtree 𝑇𝑡.Theformulais 𝑃(𝑆 |𝑇=𝑡)= . (8) 𝑖 𝑗 𝑆 | | 󵄨 󵄨 ∑ [𝑒 (𝑠) +1/2] ∑ 𝑒 (𝑠) + 󵄨𝑆 󵄨 /2 󸀠 𝑠∈𝑆푡 𝑠∈𝑆푡 󵄨 𝑡󵄨 𝐶𝑖𝑗 is the number of training instances which belongs to 𝑖th 𝑟 (𝑇𝑡)= = . (14) ∑𝑠∈𝑆 𝑛 (𝑠) ∑𝑠∈𝑆 𝑛 (𝑠) class. 푡 푡 Complexity 5

󸀠 In order to simplify the formula, we define 𝑒 (𝑡) as the Table 1: Part of the classification of training set. errorsamplenumberinsteadoferrorrate.Sotheerrorsample number of node 𝑡 in the decision tree 𝑇 is SNP bda td Enhancer Gene region Class rs7524102 Y Y Y Intergenic C 1 𝑒󸀠 (𝑡) =𝑒(𝑡) + . rs34920465 Y Y Y Control region D 2 (15) rs6426749 Y N Y Control region G

Therefore, the error sample number of the subtree 𝑇𝑡 is rs1430742 N N N Cording sequence B rs6929137 Y Y Y Missense A 󵄨 󵄨 󸀠 󵄨𝑆𝑡󵄨 rs479336 Y Y N Cording sequence K 𝑒 (𝑇𝑡)=∑𝑒 (𝑠) + . 2 (16) rs11898505 Y N Y Intergenic F 𝑠∈𝑆푡 rs17040773 Y Y Y Cording sequence E Similarly, the formula for the error sample number rs344081 Y N Y Cording sequence H of subtree 𝑇𝑡 is binomial distribution. And the standard rs6909279 Y Y N Intergenic I 𝑒󸀠(𝑇 ) deviation for 𝑡 is defined as (a) The first column is part of osteoporosis GWAS-associated SNPs; (b) the column of “bda,” “td,” and “enhancer” means whether the SNP is on 1/2 󸀠 󸀠 significant TFs binding affinity, mapping on distal interaction, and mapping 𝑒 (𝑇𝑡)×(𝑛(𝑡) −𝑒 (𝑇𝑡)) 󸀠 on putative enhancer region; (c) the last column is the category the SNP SE (𝑒 (𝑇𝑡)) = [ ] . (17) 𝑛 (𝑡) belong to.

Finally, we deduce from formulas above that the subtree 𝑇𝑡 will be cut if the node 𝑡 meets the condition: osteoporosis GWAS-associated genes and interactors. We used the common Protein-Protein Interaction databases, 󸀠 󸀠 󸀠 𝑒 (𝑡) ≤𝑒 (𝑇𝑡)+SE (𝑒 (𝑇𝑡)) . (18) such as Human Protein Interaction database (HPID) and General Repository for Interaction Data (GRID), to find The process of the PEP algorithm is as follows: the interactors which had interactions with the osteoporosis GWAS-associated genes and their interactions. Then, we Algorithm: PEP obtained the interaction network graph by Cytoscape v3.4.0. Begin Figure 2 is the PPI of osteoporosis GWAS-associated genes. The result was verified by 10-fold cross-validation based Input: decision tree 𝑇 before pruned on the data set of osteoporosis GWAS-associated genes and Output: decision tree 𝑇 after pruned SNPs.Wedividedthedatasetof129osteoporosisGWAS- (1) Get the nonleaf node set 𝑇1 of the decision tree 𝑇 associated lead SNPs and 222 SNPs linked with them into 10 𝑘=1 (𝑇 ) samples. One sample was then randomly chosen and saved as (2) For to length 1 the validation set to verify the model from the 10 samples, and (3) Do get a subtree 𝑇𝑡 whose root node is theother9samplesweresavedastrainingset.Theverification

𝑡[𝑘] (𝑡[𝑘]1 ∈𝑇 ) process was repeated 10 times so that each sample was the (𝑒󸀠(𝑡) ≤󸀠 𝑒 (𝑇 )+𝑆𝐸(𝑒󸀠(𝑇 ))) validation set once, and the accuracy was calculated every (4) If 𝑡 𝑡 time. A 10-fold cross-validation was completed by the process (5) Then delete 𝑇𝑡 above. 𝑘 𝑘>10−3 (6) Else 𝑘++ We set a threshold ( )asaresultofthe validation. The recall was calculated, which was the true (7) End positive result to positive result ratio. The 10-fold cross- End validation was repeated for ten times and the average recall rate of every validation was calculated. The result was shown We classified the suspected risky SNPs effectively based in Figure 3. on their loci characteristics and studied their functions The classification result was also verified by 10-fold cross- according the ID3 decision tree algorithm and PEP. validation.TheosteoporosisGWAS-associatedSNPswere used as the data set. The SNPs of training set were classified 3. Results based on their loci features. Part of classification of the training set was shown in Table 1. We classified the SNPs By the end of 2014, nine GWAS and nine meta-analyses of validation set through ID3 decision tree algorithm and hadreported107genesand129SNPs(leadSNP)thatwere recorded the accuracy of classification, which was the pro- associated with BMD, osteoporosis, or fractures with a signif- portion of classification accurate samples to all the samples. −8 icant threshold of 5×10 . 222 SNPs linked to osteoporosis Then, the process of validation was repeated for ten GWAS-associated lead SNPs had also been identified by using times and calculated the average accuracy rate and average LD information in the Caucasians population via HapMap classification reliability. The result was shown in Figure 4. website[9].Moreover,weobtained107knownosteoporosis We also used genome-wide association studies (GWAS) GWAS-associated genes which showed significant connec- of type 2 diabetes (T2D) data as negative data to verify tivity among proteins. And there were interactions between our method [22]. 50 lead SNPs of T2D were obtained with 6 Complexity

Dlg4 TGFBR2 MAPK8IP2 SDC2 KIAA2018 CHEK1 C16orf95 DCDC1 BOD1 IER2

KAL1

GALNT3 MEPE

DDX42 KPTN FMN2 CASP1 NFYC Skp1a GAN

APPBP2

ZBTB40 BTN2A1 APBB2 WNT5B

POLR3E DDA1 ERP44 NUFIP1 Tube1 G6PD PKDCC SH3GL3 LYPD6 LDHB Appl1 Traf2 Anapc2 MASTL ABAT PAGE3 LNPEP PGD SORD DAPK2 CDC42EP4 FER DNM1 FERMT2 MORF4L1 CLCN1 HIBCH DNAAF2 ALMS1 Copa Blzf1 CADPS Ddb1 CPT2 DNM2 SCPEP1 CAPNS1 STK24 PRDX5 PIN4 EFNB2 Ruvbl1 Kctd5 GLA FHL3 FAM161A NDUFB10 IGSF8 URB1−AS1 TP73 SH3GL2 C19orf57 RAB17 SYT17 PAGE1 GBE1 UBA5 L3HYPDH PRKAB2 MTOR Mtx1 ASNS MAB21L3 SPACA1 Mtx2 KIF2A GOT1 CAPN2 MRPL28 THBS3 ELK1 PDIA4 MOB1A ABT1 LIN7A NEBL TBRG4 DNMBP C1orf112 TXNL4A MYO15B RAPH1 RPE KRAS TBCB SMN2 SNX5 GNAQ HMGB3 ACP6 MFAP1 BSG FAHD2A BCL2A1 GPX1 TGFBI KIF2C EPS8 EIF5 POMK CCDC106 C12orf10 ISCU KIF11 GNAO1 MESDC2 RASD2 LYRM2 PHYH CSDE1 SNX1 SCN2B COL1A1 SNAP47ZCCHC10 RUSC1 LIPH DNAJC4 CENPL RSL24D1 CRH EBNA1BP2 CHST9 FZD8 CEP57L1 BLOC1S2 HAUS8 RHEBL1 UBE2E2 Ucn TCEB3 C14orf105 BET1 XRCC4 ETV6 LOC728175 TSG101 MRPL10 IGF1 HBB gag−pol SUSD4 SUCLA2 FASTKD3 CAPRIN2 AMZ2 BYSL AURKAIP1 UCN C2orf68 ATF7 TMEM263 tat SLPI AP1M1SERPINE2 ITIH1 TK2 CHD1 DPP4 FCHSD2 GSTK1 CAPN15 PARP16 TRMT61B RGSL1 NANOGNB SMG7 ALDH7A1 B3GALT2 SNAP29 FOXH1 DNM3 UPF3B CCDC70 PPP1R18 RBM17 C16orf72 DNAJB1 EIF3K LRRC4 HDAC8 KRTAP10−1 SLC16A3 DENND2D APOA1 DSN1 RP1L1 SCNN1G SMG5 RBM45 NDE1 MORF4L2 BRD3 TNNI3 PSG1 CORO1A C8orf33 ZGPAT FAM9B USP8 ZNF804A CTNNBL1 STARD3NL SOST RMND1 CISD2 DSEL MINPP1 FOXB1 MDFI POTEI ZNF410 ITGB1 FOXA2 PLXNB1 HIST1H1A TERT PRPF3 CMTM6 ZNF638 MRPL23 ALG8 ZNF614 ZNF567 ATP5F1 NACA LMO2 CDC23 YARS2 SEC24C RPL15 EDF1 CTR9 HBA2 CRHR1 KHSRP TCEANC POLL F2 MAP1A ITGAV TEAD2 CYB5R3 MNT ATF2 SP7 RIMS4 RHOU ACTG1 HIST2H2BE NQO1 MRPL9 SNX20 SDCBP2 LNX1 ANKRD36B STK39 RFPL4B TXNDC11SMAD6 AKT1S1 ARF4 COL20A1 MEOX2 CACNG4 PEPD GRIPAP1 SNAPIN HPS1 CSNK2B EIF3G PTRH1 APOC3TERF1 MDFIC FOXS1 CDCA7L KBTBD7 RNF19A LRP5 PRKACB KCNIP1 SDCBP SUMO1 Gsk3b PTH1R GOPC SSB TPT1 FOXF1 GRB2 PLD3 MAP3K4 ATXN1 SAV1 ZNF41 C20orf195 DNAJC15 TTC1 ITGB5 RAI1 ZNF775 PAX6 FERD3L DCX IGFBP5 CC2D1B WEE1 RBM8A MITF NR4A1 CCDC116 RAPGEF2 Apc CWF19L2 USP42 BANF1 DLG5 ERH SMARCB1 DKK1 PCBP1 AMOTL2 MARCKS GAK ROBO2 ITGA4 RAD51B CHRAC1 DAXX MATR3 C1RL CMTM3 RAB6B SYNPO PNPLA2 SMG6 CTSD B9D2 CEP95 TBC1D8 MYO18A GPRASP2 MYOTPIN1 Ctnnb1 CDK20 CPSF3 SMAD9 SAAL1 GDI1 CCDC88A ASB6 STK3 GFOD1 PPP1R21 CASQ2 ETV1 CSNK1D HMGN1 SGTA ADPRHL2 Sept9 TNKS NOL4 PEMT MPPED1 Pml PAX9 CTNNA1 HIST2H4A FOXL2 TMEM30A TPM1 NCK2 KIF21B SSX1 CDK18 SMARCC1 ARHGAP21 VPS11 PYGO2DAB2 RASGRF1 TMEM168 CHST15 PLCB1 DPP8 NCKAP1 ATP6V0D1 Flnb HIST2H3C LRP6 MAPK11 RAB10 CDK19 CPED1 ARRB2 HNRNPDL ITSN1 BAD FOXQ1 TXN2 ERC2 EPB41L3 SALL1 AMER1 FBXW11 EGF EN1 Myh10 DVL2 PAN2 gsk−3 SUPT20H Myo1c TAZ MAP2 CSNK1A1 MBP FOXE1 SLC39A6 AFF4 SPP1 PYGO1 SOX4 MAPT CARM1 FOXP3 COMMD1 RACGAP1 PDLIM7 PKP2 IL17RB QKI rev ARRB1 PPP2R5ACTBP2 UTP14A FOXA3 NUP62 ALB XIAP EIF2B2 CDH2 MYO5C CSNK1E SH3KBP1 SCLT1 GRIA2 MAP3K1 APPL1 SLC16A6 TCEA2 RABL2A GSK3A PMPCB SEC16A KCNMB1 Kif4 GNAS TTR Crtc2 CREB1 DNAJC11 C2orf50 RAD23A PRKCDBP TRAF3IP1 HIST1H2BB DAZAP1 NCBP1 DISC1 SREBF2 RPA3 TARDBP VCAM1 ABCE1 Ckap5 KIF23 IKBKB MTCL1 HIPK2 UNK CEP63 EED CBL ATP6V0C DDB2 BAG6 NEDD8 Arrb2TUFM MOB4 TCF3 Akap8 CAV3 STK4 GSK3B CRMP1 CENPJ WDR4 ANK1 HIST1H3EBcl7c APC RPA1 SAP130 NDUFS4 ACTG2 Actb PPP1CB PSMD13 TAF6L NPHP1 ITSN2 ECH1 UBASH3B CNBP ZAK OBSL1 PSMD14 PIM1 TADA2A TBXA2R NIN CEP162 LATS2 TAF4 PPIE OFD1 GRK5 TNIK MYBL2 TNFAIP3 PSMD3 Mapre1 DVL3 CAT EIF3L RAB6A ERC1 PLEKHA5 Myh9 RPS6KA5 FUBP3 TARS CEP44 Tpm1 FN1 HTT AXIN2 CAV2 ACTA2 RPGRIP1L EPAS1 PTHLH AXIN1 ZNRF3 UBQLN4 SASS6 Dlg5 FXR1 EAPP TWF2 CEP89 Lima1 MAGOH ATXN7L3 PSMD6 RSPO3 ELAVL1 TERF2BTRC PSMC4 MAPK4 NENF ADD1 MYO19 ITCH CDC25C KCNMA1 USB1 EPDR1 Calml3 CAPZA2 LIMA1 YWHAH USP22 LGR4 TRAIP PJA1 CREB3L4 TAF5 SLC25A11 CEP19 PCM1 SNX6 KPNB1 ATXN7 TAF10 GCM1 CEP120 TGFBR1 HNRNPLL NFKBIE SNX2 USP13 APP Ppp1cb GNAI1 YWHAG TAF5L GATA1 RPGRIP1 FBF1 MED4Flot2 PLA2G12A SIRT7 GLUD1 HERC1 USP21 UBL4A TAF12 TADA1 TTC4 TSSC4 CEP152CNTRL DCTN1 CD209 MAPRE1 CTNNB1 PRKD3 NOTCH2NL Cd2ap Tmod3 ATP2A2 ASB13 MYCMARK3 PRKAA1 KAT2A TLE1 CAMTA2 STIL SH3BP5 KSR1 TADA2B MCM2 CAMK1 DDX19A UBC SMAD4 CSRNP3STRAP SUPT7L CANX FTL COIL NFKBIA TSC1 CUL7 CAND1 ETFA CCDC94 CEP104 CEP128 DVL1 NFKB1 WDR83 FGFR1OP KIAA0101 SFN MCM5 PPIA DDB1 SLC25A3 DDX20 TCF4 SQSTM1 PCBP2 GOT2 CDK4 GPATCH1 CEP135 SRPK1SMAD5 WWP1 PRPF39 CCT7 SIK1 GAS2L1 TRIM23 IKBKG SPTBN1 MAP1B AGO3 UPF1 CDK13 KRTAP10−3 RIPK1NF2 Coro1cPPP2R4 CDK9 SSX2IP PACSIN1 MAPK7 RPA2 NLRP14 CCT6A NINL CLIP3 LMNB2 POU2F1 NR5A1 PRMT1 SUPT3H ATP5A1 PKN1NUP93 RCN3 LCA5 ETS1 TWIST2 RFXANK NUDC LDHA FGL2 TNKS1BP1CEP192 HES1 PML YWHAE Ahrr CCT4 CUL5 LYPD3 ANLN TRAF2 MARK2 SMURF2 NR2E1 SLC2A4RG SIK3 RNF126 Hoxa1 KATNA1 SPHK1 HSPA1A MAPK14 YWHAZ MEF2D CDK15 KRTAP10−8 MGEA5 CHGB CDK1 AURKC RBM23 MAP3K7 RPS13 TRAF1 IRS4 PKN2 DDOST UNC119 COPS3 CC2D2A SMOC1 SNCA DBN1 KAT2B GATA3 SF3B3SMAD7 NME1 DCUN1D1 SPTAN1 CAPZB PPP1CC RB1 PRDX4 TKT Cul3 CEP350 CYLD MYOGSTAT1 UBE2I PPP1CA XPOT CAMK2A DDI1 GINM1 ODF2 MIB2 CHUK STAM2 PON2 PMS1 FBLN1 TP53 CDC5L PCNA PPP2R2D HSPA5 AURKB MYH9 TEAD1 YWHAB LRPPRC HSP90AB3PVDAC1 CDKL4 EEF1A1NR3C1 TAF9 CCDC101 Cbx1 MYLK4 PPIP5K1 GAPVD1 YAP1 MEF2A MCM4 UQCRC2 COPS4 KLHL42 COPS5 YWHAQ ANKRA2GABARAPGATA2 CKMT1B PSMB1 CUL3 CNTROB CEBPA PHGDH Adrbk1CCAR2 IKZF3 HNRNPL MLLT4 USHBP1 PSMD11 CSRP1 EIF3F SPATA2L RANBP9 TUBA1A MYO6 YY1 TRRAPSMYD2 HIST3H3 ALDOA ATF3 SNRNP200 FAM168A TEKT2 COPS6 OPTN SKI RL2 PRKCD HIVEP3 CBFB MYCN MCM3 LTA4H UBQLN1 AMFR SRPK2 STAU1 TRAF5 MAF PIK3R3 HIC1 PABPC1 MBD3 MARS SMARCAD1 TRAF6 FLNA CTBP1 DNAJA1 PGK1 SPATA2 EWSR1 JUP HIST1H1C HCFC1 NOTCH1 DSP ENO1 FASN TUBB RABGAP1L PPP2CA QARS IQGAP1 XRCC6 SPDEF ACLY HGS RFWD2 PAK4 PSMD4 UBE2S DDX58 PCMT1 RPL13 SMURF1 IRS2 GPS2 OGT CALCOCO2 SCX Traf5 PARP1 MED13 COPB1 Cbx3 STXBP4 RPL5 SOX9 RUNX3 DDX47CIITA RBBP6 SOX18 RELA HSD17B10 PSMA5 Ywhae RARS TBX3 JAG1 FBXW7 USP32P2 HDAC6 RPL9 PSMC6 RBM14 PIAS4 GANAB TFEB SF1 NEURL1 PSMA2 PMS2 TRPA1 VCP HECTD1 SMAD3 KAT6B MECOM BMPR1A NFATC1 NFS1 CCT5 PSMC2 BARD1 MYO1B SKILZBTB16 EEF1G NEU1 PKMPPP2R2A SERTAD1 EEF2 CAV1 CCT8 SFXN1 MCM6 UBE2E3 INSIG2 PUF60 MEF2C HDAC5 DDIT3 MIB1 GTF3C1 PSMC3 TNFRSF11AZNF512B PRPF19KLF4 NPEPPS CS HK2 SCAP PLK1 NOP58 PIK3R2 HDAC7 HIF1A WNT16 NCSTN HMGCR CALM1 EHMT2 RPS3 MDH2 PDS5A XPO1 SMAD1PDLIM1 BCL6 CSE1L NOTCH2 ARID3B FTH1 CEP290 IKBKE RUNX2 ILF2 HNRNPA1DHX15 HSPB1 DYRK1B ASCL1 SRRM2 SMAD2SIRT1 POLR2I MED17 HNRNPD STAG1 PRKDC RBM25 KDM5B BRMS1 MAFF COPA SMPD4 MRPL45 ARHGEF12 BCL3 ACAT1 TRIM28 CAD COG5 HSPA6 ctnnb1−a NOP2 ALYREFPPP2CB SIK2PRPF8 ZBTB7B HOXC11 AHCY NOTCH3 SYMPK UBQLN2 SREBF1 PSMA7 SF3B1 H3F3A CHD1L LMNA DDAH2 NTRK1 CDK2 BCAS2 IKZF1 PSMD2 DDX39B JDP2 DDX39A DDX52 AKT1 ZBTB7A LARS POLR2J CEP250 PTBP1 LEKR1 TADA3JUN EPRS Cbx5 PRMT5 NUP205 CUL1 RBBP7 STAT3 CLTC TCP1 CKB SUGP2 PSMD12 IFRD1 SMARCA4 MAPK1 EIF4A1TFRC MCM7 EIF3B CPSF2 RNF139 MED12 HDAC4 TOP1 TRAP1 HIST1H2BK NUP107 FUBP1 ANP32A DDX54 HSP90B1 PDIA6SF3A1 APLP2 EEF1D HK3 Flot1 RNF5 RPS8 DNTTIP2RPN1 RBM39TUBG1 HDAC9 RAD51 XRCC5 PHB PRKAB1 TNPO1 EIF3C NOVA1 VPS13D MED23 C1QBP TMEM194A TOP3B HSPA14 VARS SUZ12 SVILRNF2DNAH2 POLR2B FHL2 RBMX CCT3 IKZF2 HSP90AB1 PTEN SMC1A POLR2G RPS11 P4HB EFTUD2 CAMK2D IARS UNC45A REST SHC1 XBP1MTHFD1 TBL1XR1 EMD TUBB4B ADRM1 EZH2 SYNCRIP ISL1 RPL7A SMARCE1 SIN3A SLC25A41 SIX5 TLE3 CORO1C SKP2 EP300 KLF5 FOS PSMB5 BAG2 ZFHX4 CAPZA1 SUV39H1 RPS3A AHRR TBL1X INSIG1 GSTM3 PTPLAD1 CSNK2A1 MYEF2 IMPDH2 RRBP1 RANBP2 PSMC5 LYARSTUB1 RAN RPL36 PNRC2SRSF1 PRDX3 DTNBP1 ORC3 RPS26TBK1 TAB3 CREBBP RPSA HIST2H2AC PSMB3 TRAM1 TAB2 NCOR1 PLECHNRNPF MRPL38 ARID3A MTA2 HDAC3 HR USP9X SATB1 ESYT1 DNAJB11 BAG1 CCND1 HSP90AA1 S100A8 TFF1 PRKD1 SATB2 STRBP SMARCA2 RAD50 MED16HDAC1 NUP214 CETN3 EIF4A3 TRIM25 RUVBL1 LGALS3BP HAUS2 ADCYAP1 MRPS30DARS2 ATP6V1H SKP1RPL22 AP2M1 SP1 GNB2 NUMA1PGAM5 PSMD1 SPIDR HELLS FOXL1 ADAR NPM1 ACTC1 RBBP4 HIST4H4 ALDH18A1 POC5 HAUS7 FBXO6 POLR2D EPB41L5 SET ELMSAN1 SMARCA5 NCAPD3 NR2C2 TRIM24 CCT2 DDX1 IGF2BP1 UTP20 PPP6R1 EIF4G1 APITD1−CORT KAT5 MED1 RPLP0 NONO CDK7 IKZF4 NUP98 KIAA0368 MYOD1 RBBP5 PHB2 NCOA2 MTA1 POLR2H RPS4X HSPA9 INTS3 LGALS7 MEN1 NR2F1 NR0B2 TSC2 EPSTI1 RPL10AACTN4 RNF31HNRNPH2PRKD2 CKAP5 STOM PA2G4 HSPA8 DDX17 RPL4 ILF3 TPI1 SYT6 RALY ESRRA SF3B2 HIST1H3A VDAC2 HNRNPUL2 UBE4A MAP7D1 MPG DARS GCN1L1 BCKDK TMEM17 YARS HDAC2 RLIM HIST1H2BG RPL7RPN2 SMYD3 COPB2 POLE TRAF3 SMC3 MDM2UBA1 NCOA3 GNAI2 PRR12 PNKD BAZ1A NOC2L SMC4 RUVBL2 PGR RPL18 ARHGDIA Cetn2 GTF3C5 ORC2 YTHDF2RABGEF1 ASH2L H2AFXYBX1 DHX9 MED20 PRDX6 ATP1A1 NELFCD MDN1 CEBPB HSPA4 POLR2CTDG NFXL1 NOL9NOP9 NCAPD2 ARID5A KDM4BSP2 UBTF GAPDH SFPQ UFL1 PI4KA POU2F2PPARGC1A ATP5B GNB1 OXLD1 BIRC7 ACTR3 HEXIM1 ASCC1 HNRNPA3 LRIF1 NSF ING1 PELP1 MYO1C SERPINB4 WDR5 SRSF2 Kcnk1 MYO9BKIF22 IRF9 RNF4 CDC25B USP7 DHFR RHOA MKI67 MYL6 ESR1 ISG15 Tada3 FMR1 ATAD3A SGSM1 PDE6D SMARCA1 HNRNPM GSNNCOA6 SLC25A1 GADD45GIP1 FEN1 NSDHL TRMT2A UBE4B POLR2F SPOP UTP6 MED24 IRS1PIK3R1 RPL3 BCORL1 HERC5 NCAPH SIN3B SRC NCOR2 BCAR1 HIST1H2BD ORC4 LMO4 TUBA1CCCNT1 NRIP1 MED21 MCOLN3 NAA40 RANGAP1 NCAPG HNRNPA2B1 PTGES3 LCOR FHL1 NCOA7 RAB5C Atp2a2 HEATR1 ARFGAP2RPL37A CBLL1 AKT2 NCOA1 DDX5 STAT5A POF1B RHOD Cep135 MTCH2 MPRIP POU4F1 HNRNPR GNAI3 GTPBP4 OAT CLCN7 RIF1 TOMM40 WNT4 SETD7 FOXK2 DYNC1H1 BTAF1 NXF1 RPS12 ANGPTL4 SCRIB SAFB MGMT NAT10 FOXO1 GRIP1 USP49 RPL27A CIRBP RBCK1 RAC3ACTB MNAT1 BCAS3 UBAP2L HSPD1 DCUN1D5 GTF3C4 CDKAL1 POLR2L ALKBH7 BLOC1S1BRCA1 NELFB KRT8 ZDHHC7 SLC25A6 LTBR FAM177A1 CENPB GBF1 RPL17 SUPT16H FAM46A Ktn1 SLC33A1 AARS2 ERBB2 NR2C1 MED14 HNRNPA0 GLCE MAP2K3 MMS19 POLR2AKMT2DHIST1H2BIFKBPL SRSF6 RINT1 SOD1 FAF2 SERBP1 DDX3X PAK1ESR2 IGF1R RPS9 PIK3CAKLK9 SHMT2 TMOD3 Nr1i2 CLTCL1 CCNC TAF2 SGK3 SRSF7MSH2DDX27 TAF1A TNFSF11 GOLGA2 UBE3A TTC5MUC1Nsd1 MSX2 TOMM70A NUP133 NCL AKAP13 PFN1 HOXD9 FBXO18 RNF13 Nr0b2 SNRPN PPP2R1A GOLT1B POP1 SLC25A20ATAD2 ATP9A PRKCSH PASK LMF1 ARHGAP1NUP153 TRIP12 DHX30 HADHA DUT GTF2H1 PIAS2HDLBP LGALS8 Nup98 N4BP2L2 SBF1 FGFRL1 HIST1H2BEPQBP1 TCOF1 U2SURP ITPA BBS10 SMC2 EIF4G3 TFCP2 TMEM216 EPPK1 CDK11B SOS2GNB2L1 MED10 TBP RNF40 DEK UBE3C RPS15A TOP2B KDM5A E2F6 Vps28 HOXC10 NDC80 AHR SP3 TTK PATZ1 ABCF3 YTHDF1BCOR PSMB9BTF3 FIP1L1 PTMA RPS16 P2RX4 TCTN2 RAC1 EGFR UIMC1 RPS2ACTR2 HNRNPK Tmed2 SAMD1 SFRP1 ZNF398 SUPT6H TCTN3 FRAS1 CHD6 GADD45G SNRPD1 PRMT2 Nr5a1 AGFG1 RPL13A Rab5c CBWD3 BORA FAM213B MED25 WDR36 RPL23A OTUB1 MED6 NCOA4 VWA9 TECR PSMC3IP HIST1H2BFSASH1 MYBBP1A PAK6 TFFARSA B4GALT7 CITED1FKBP5 RPL21 PPARGC1B MYL6B C1qbp WNT5A SFRP2 PDXDC1 BNIP2 RTCB PAGR1GNL3 Gadd45bPDIA3 KRT19 RPS6KA3 POLR2E ZNF131NGG1 RPL10 WHSC1 TMEM237 GOSR1 CBR1 PPP5C CDKN1A HNRNPA1L2 TXNRD1ARG1CUEDC2 ANP32B WDR18 EIF2S2 NR1H4 ATP6AP2 TMEM63B G3BP2 FBXO45POU4F2KDM1A MVP AP2A2 RPL36AL CRELD2 WLS PHAX FSD1 ZDHHC21 CHD9 TUBB1 NUDT21GNB4 SLC25A5 MYH14 FTSJ3 RMDN3 Bmpr1a RAD51C CHEK2 HIST1H2BC NCAPH2 PSPC1 XRCC3 GALNT7 HNRNPC KARS ZC3HC1 MAPKAPK2VAV3 RRP12 RPL11 MYLK2 CHD4 ARSB KAT6A RPS6KA1 EVC2 FAM96B KIAA1377 DKC1 SAFB2 SERPINH1PRKCZ PFKP CDC42 PPID RBFOX2 HSPA1L PSIP1 POLR2K NCCRP1 DAP3 WNT7B CIAO1 CRIPAK RBM28 YBX3 ZNF675 RHOC SEPT8 ARID1A LUC7L3 MED27 S100A7 SERPINB5 NR2F6NFYA SNRNP35 ZNF182 MBD2 TAF1B PRDM2 TOPBP1 AHNAK NEK8 SRA1 EXOSC10 AP1B1 RPL28BRI3BP ATP5G1 FKBP4 H3F3B CREB3 LCK TIA1 KIF1A THEM6 PGC MRPS9 RPL24TCERG1 Mms19 SRSF3 CCDC22 SEC61B BUD13 ND1 TNFRSF11B TCEB1 RPL29 WIPI1 TBL2 SHOC2AIFM1 OPRD1 OPRM1 MBTPS1 GFPT1 ABCF2 RPL14PPIB AKAP8 RPS14 GNL2 RPL8EIF6 CXCL1 SCYL2 HPCAL1 APOD BAP1 SND1 CFL1 FBXO25 GADD45A DSG1 GRWD1DDX50 DDX21 MOV10 MED9 EP400 FOXR2 RPS20 HNRNPAB FLNB KRT18 WDR44 BCLAF1MTDH U2AF2 PRDX2 UTP3 Cdc16 HTRA2 TOP2A CD244 DYX1C1FLII RPL18AANIB1 PPP2R3A Ncstn GIPC1 EIF5A CDK8 SEPT2 DNMT3L RPS18 RPS24 A4GNT AGTRAP NMD3PPAT RSL1D1 RRP1B PIAS1 MGST3 NSFL1C BNIPL RPS6 NCAPG2 PHLDA3 IRF2BP1 NSD1 NAP1L4 RPL19KLK5 ACTN1 LPAR6 B3GNT2 OPRK1 HTR3C ARL13B RBX1 SMARCD1 KRT10 DDX18 NME2 USP16 SRSF5 ST3GAL4 RPL6 FOXO4 ANXA2 RDX SARNP FUS G3BP1 CAPRIN1 DNAJB9 MIF4GD THRAP3 GLRA2 FAS RPL12 WHSC1L1 USP47 DDX3Y CCNH ATP4A FAM210A SARS2 GLYR1 Trim24 SMARCD3RPL36AHBP1 Smc1a GTF2B MED7 RALA LDB1 REXO4 FAM3C SIRT6 PRPF4B SNAP23 SRRM1 RPS23 Mis12 SUV420H1 DECR1 CHST8 TNFSF13 POLR1C NPPA GALNT2 ACAD9 CCDC8 NOP56 GCNT2 ILK TRIP4KRT1 EIF3A EIF2S3 Uso1 CREB3L1 CLUH KHDRBS1 NDUFS7 UPF2 HLA−C EFTUD1 TOMM22 CPN1 TOMM7 TM9SF4 DNAJA2 IER3IP1 EIF5B MAD2L1 MPV17 MRPL41 ZC3H15 FLOT1 CCNL2 PPP1R10 SNW1 SFXN3 CD300C Orc2 CCDC170 Nfyc COLEC10 B3GALNT1 PTCD3 MCU BABAM1 HLA−DPA1 DCP2

MIA3 C3 P3H4 STC2 RAD21 CLEC4G

CKAP4 TIMP3 COLGALT2 INHBE

NUCB2 PTX3 ARL8A

MBL2

TRAPPC2 MASP1

CALR MASP2

Figure 2: PPI of osteoporosis GWAS-associated genes (the pink nodes indicated those which had interactions with the osteoporosis GWAS- associated genes, and the yellow nodes indicated the osteoporosis GWAS-associated genes).

100 100 95 90 90 80 85 70 80 60 75 50 (%) Recall (%) 70 40 65 30 60 20 012345678910 10 Figure 3: Result of random walk (the ten colors of the points 0 indicated ten 10-fold cross-validation, and the same color of points 12345678910 indicated the validation process. The points connected by a line were Accuracy 𝑥 the average recall value of ten experiments. The -axis was the 10- Reliability step verification of the 10-fold cross-validation process). Figure 4: Result of ID3 decision tree (the blue credibility refers to the average accuracy values of 10-fold cross-validation, and the orange credibility refers to the average reliability value). their position features and associated genes. We searched the interactors of the associate genes from the PPI database and constructed the PPI network with the known osteoporosis GWAS-associated genes. The random walk algorithm was that not only was the computation efficiency improved, but used on the PPI network. also the accuracy rate of the result by using ID3 decision We then used PEP for ID3 decision tree to construct a tree algorithm with PEP in the identification method was simplified classification decision tree. We combined the two higher.Theimprovementisduetothefactthatwehadcut steps of the risky SNPs identification method and verified the subtrees which were constructed by the noise samples the method by 10-fold cross-validation. Finally, we found and solved the overfitting problem. While we defined ID3 Complexity 7

100 1 90 0.9 80 0.8 70 60 0.7 50

(%) 0.6 40 0.5 30 20 Specificity 0.4

10 0.3 0 12345678910 0.2 ID3 ID3-PEP 0.1 Figure 5: Comparison of two classification algorithm (the blue 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 credibility refers to the classification accuracy by ID3 algorithm, 1− and the yellow credibility refers to the classification accuracy by ID3 Specificity decision tree algorithm with PEP). ID3-PEP C4.5 Figure 6: The comparison of ID3-PEP and C4.5. decision tree algorithm with PEP in the identification method as ID3-PEP and ID3 decision tree algorithm as ID3, the result comparison of these two classification algorithm in the iden- correct and effective. Our method efficiently achieved the tification method was described by Figure 5. According to the process of identifying osteoporosis suspected risky SNPs. result, we concluded that the ID3-PEP in the identification However, there is still a need to perfect the identification method was more stable than ID3 algorithm, and it had better method. First of all, we need to search the loci features of effect for the classification problem. C4.5 is the optimization of ID3. They have the same suspected risky SNPs associated with osteoporosis and the way to learn training set and build a classification decision interactors of associated genes manually. The training set tree, but the difference of them is the way of choosing split for our method is the known osteoporosis GWAS-associated attribute. C4.5 algorithm chooses the maximum attribute SNPs, which is not large enough to identify the risky SNPs with information gain ratio to split. In order to solve the accurately. Therefore, further research is needed. Firstly, a problem of overfitting in ID3 decision tree algorithm, C4.5 workflow can be constructed to improve the identification algorithm needs to scan the data set and rank them in every process, aiming to automatically identify the suspected risky step. This calculation method and process of the algorithm SNPs’ features. In order to improve the accuracy of our have low operational efficiency. ID3-PEP algorithm solved method, more features of the SNPs should be examined, such theproblemandwasmoreaccuratethanC4.5.Wemade astheconservationofSNPsandtheinfluenceoftheSNPson a comparison of these two algorithms through ROC curve, miRNA binding site. Finally, we use our method to predict which is shown in Figure 6. Result shows that ID3-PEP is risky SNPs associated with osteoporosis by constructing the better than C4.5 in our classification. PPI network of all the human genes.

4. Discussion and Conclusion Conflicts of Interest SinceSNPplaysakeyroleintheprocessofpathology The authors declare that there are no conflicts of interest and susceptibility of osteoporosis [23], it is necessary to regarding the publication of this paper and the received findtheunknownriskySNPs.Usingthedatasetofknown funding did not lead to any conflicts of interest regarding the osteoporosis GWAS-associated SNPs and genes [8], we iden- publication of this manuscript. tifiedthegenesofsuspectedriskySNPsassociatedwith osteoporosis by random walk algorithm on the PPI network constructed by osteoporosis GWAS-associated genes and the Acknowledgments genes associated with suspected risky SNPs. The suspected risky SNPs were classified based on the features of their loci ThisresearchissupportedbytheNationalNaturalScience position and function. We used 10-fold cross-validation to Foundation of China (Grants nos. 61532008 and 31371275), verify our method. the National Social Science Foundation of China (no. The result of the experiment above showed that the 14BYY093), and the Fundamental Research Funds for the identification method for risky SNPs of osteoporosis was Central Universities (no. CCNU17TS0003). 8 Complexity

References [16]R.K.Nibbe,S.A.Chowdhury,M.Koyuturk,R.Ewing,andM. R. Chance, “Protein-protein interaction networks and subnet- [1] F. Rivadeneira, U. Styrkarsdottir,´ K. Estrada, B. V. Halldorsson,´ works in the biology of disease,” Systems Biology and Medicine, Y. H. Hsu, J. B. Richards et al., “Twenty bonemineral-density vol. 3, no. 3, pp. 357–367, 2010. loci identified by large-scale meta-analysis of genome-wide [17]S.Kohler,S.Bauer,D.Horn,andP.N.Robinson,“Walking association studies,” Nature Genetics,vol.41,no.11,pp.1199-206, the interactome for prioritization of candidate disease genes,” 2009. American Journal of Human Genetics,vol.82,no.4,pp.949– [2] D. Welter, J. MacArthur, J. Morales et al., “The NHGRI GWAS 958, 2008. Catalog, a curated resource of SNP-trait associations,” Nucleic [18] M. J. Blow, D. J. McCulley, Z. Li et al., “ChIP-seq identification Acids Research, vol. 42, no. 1, pp. D1001–D1006, 2014. of weakly conserved heart enhancers,” Nature Genetics,vol.42, [3] Q.-Y. Huang, R. R. Recker, and H.-W. Deng, “Searching for no.9,pp.806–812,2010. osteoporosis genes in the post-genome era: Progress and chal- [19] J. R. Quinlan, “Generating production rules from decision lenges,” Osteoporosis International,vol.14,no.9,pp.701–715, trees,” in Proceedings of the IJCAI-87,Milan,Italy,1987. 2003. [20] L. A. Hindorff, P.Sethupathy, H. A. Junkins et al., “Potential eti- [4] Q.HuangandA.W.C.Kung,“Geneticsofosteoporosis,”Molec- ologic and functional implications of genome-wide association ular Genetics and Metabolism, vol. 88, no. 4, pp. 295–306, 2006. loci for human diseases and traits,” Proceedings of the National Academy of Sciences of the United States of America,vol.106,no. [5]J.B.Richards,H.F.Zheng,andT.D.Spector,“Geneticsof 23, pp. 9362–9367, 2009. osteoporosis from genome-wide association studies: advances and challenges,” Nature Reviews Genetics,vol.13,no.8,pp.576– [21] A. Visel, E. M. Rubin, and L. A. Pennacchio, “Genomic views 588, 2012. of distant-acting enhancers,” Nature,vol.461,no.7261,pp.199– 205, 2009. [6] K. A. Frazer, L. Pachter, A. Poliakov, E. M. Rubin, and I. [22] M.Cheng,X.Liu,M.Yang,L.Han,A.Xu,andQ.Huang,“Com- Dubchak, “VISTA: computationaltools for comparative genom- putational analyses of type 2 diabetes-associated loci identified ics,” Nucleic Acids Research, vol. 32, pp. W273–W279, 2004. by genome-wide association studies,” Journal of Diabetes,vol.9, [7]Q.Y.Huang,“Geneticstudyofcomplexdiseasesinthepost- no. 4, pp. 362–377, 2016. GWAS era,” Journal of Genetics and Genomics,vol.42,no.3,pp. [23] E. T. Dermitzakis and A. G. Clark, “Evolution of transcription 87–98, 2015. factor binding sites in mammalian gene regulatory regions: [8] S. L. Edwards, J. Beesley, J. D. French, and A. M. Dunning, conservation and turnover,” Molecular Biology and Evolution, “Beyond GWASs: illuminating the dark road from association vol. 19, no. 7, pp. 1114–1121, 2002. to function,” American Journal of Human Genetics,vol.93,no. 5, pp. 779–797, 2013. [9] L. Qin, Y.Liu, Y.Wang et al., “Computational characterization of osteoporosis associated SNPs and genes identified by genome- wide association studies,” Plos One,vol.11,no.3,ArticleID e0150070, pp. 1–14, 2016. [10]M.J.Li,L.Y.Wang,Z.Xia,P.C.Sham,andJ.Wang,“GWAS3D: detecting human regulatory variants by integrative analysis of genome-wide associations, interactions and modifications,” Nucleic acids research,vol.41,pp.W150– W158, 2013. [11] J. Yang, H. Gu, X. Jiang, Q. Huang, X. Hu, and X. Shen, “Walking in the PPI network to predict the risky SNP of osteoporosis with decision tree algorithm,” in Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM ’16), pp. 1283–1287, Shenzhen, China, 2016. [12] K. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A. Barabasi,´ “The human disease network,” Proceedings of the National Academy of Sciences of the United States of America, vol.104,no.21,pp.8685–8690,2007. [13]G.R.Abecasis,A.Auton,L.D.Brooks,M.A.DePristo,M.A. Durbin, and R. E. Handsaker, “An integrated map of genetic variation from 1, 092 human genomes,” Nature,vol.467l,no. 7311, pp. 52–58, 2010. [14]K.Lage,E.O.Karlberg,Z.M.Storlingetal.,“Ahuman phenome-interactome network of protein complexes impli- cated in genetic disorders,” Nature Biotechnology,vol.25,no.3, pp.309–316,2007. [15] X. Wu, R. Jiang, M. Q. Zhang, and S. Li, “Network-based global inference of human disease genes,” Molecular Systems Biology, vol. 4, no. 1, 2008. Advances in Advances in Journal of Journal of Operations Research Decision Sciences Applied Mathematics Algebra Probability and Statistics Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

The Scientific International Journal of World Journal Differential Equations Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Submit your manuscripts at https://www.hindawi.com

International Journal of Advances in Combinatorics Mathematical Physics Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

Journal of Journal of Mathematical Problems Abstract and Discrete Dynamics in Complex Analysis Mathematics in Engineering Applied Analysis Nature and Society Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014

International Journal of Journal of Mathematics and Mathematical #HRBQDSDĮ,@SGDL@SHBR Sciences

Journal of International Journal of Journal of Function Spaces Stochastic Analysis Optimization Hindawi Publishing Corporation Hindawi Publishing Corporation Volume 2014 Hindawi Publishing Corporation Hindawi Publishing Corporation Hindawi Publishing Corporation http://www.hindawi.com Volume 201 http://www.hindawi.com http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014 http://www.hindawi.com Volume 2014