BicBioEC: biclustering in biomarker identification for ESCC

P. Kakati, D. K. Bhattacharyya & J. K. Kalita

Network Modeling Analysis in Health Informatics and Bioinformatics

ISSN 2192-6662 Volume 8 Number 1

Netw Model Anal Health Inform Bioinforma (2019) 8:1-21 DOI 10.1007/s13721-019-0200-x

1 23 Your article is protected by copyright and all rights are held exclusively by Springer-Verlag GmbH Austria, part of Springer Nature. This e- offprint is for personal use only and shall not be self-archived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 https://doi.org/10.1007/s13721-019-0200-x

ORIGINAL ARTICLE

BicBioEC: biclustering in biomarker identifcation for ESCC

P. Kakati1 · D. K. Bhattacharyya1 · J. K. Kalita2

Received: 23 November 2018 / Revised: 26 June 2019 / Accepted: 21 July 2019 © Springer-Verlag GmbH Austria, part of Springer Nature 2019

Abstract Analysis of expression patterns enables identifcation of signifcant related to a specifc disease. We analyze gene expression data for esophageal squamous cell carcinoma (ESCC) using biclustering, gene–gene network topology and pathways to identify signifcant biomarkers. Biclustering is a clustering technique by which we can extract coexpressed genes over a subset of samples. We introduce a parallel and robust biclustering algorithm to identify shifted, scaled and shifted-and-scaled biclusters of high biological relevance. Additionally, we introduce a mapping algorithm to establish the module–bicluster relationship across control and disease stages and a hub-gene identifcation method to support our analysis framework. The C-CUDA implementation of our biclustering algorithm makes the method attractive due to faster speed and higher accuracy of results. Biomarkers such as CCNB1, CDK4, and KRT5 have been found to be closely associated with ESCC.

Keywords Gene expression · Bicluster · Primary gene · Secondary gene · Biomarkers · SSSIM · GPU computing

1 Introduction paper, we analyze gene expression data for ESCC using a parallel biclustering approach followed by network topol- Esophageal squamous cell carcinoma (ESCC) is a subtype ogy analysis, and pathway analysis, to identify interesting of esophageal cancer. ESCC is common in developing coun- gene biomarker(s) related to ESCC. In microarray technol- tries like India and China. It arises from epithelial cells that ogy, gene expression data are represented by matrix format. line the esophagus (Kelsen 2008). It is the eighth most com- There are two types of gene expression data: (1) gene–sam- mon cancer globally with 456,000 new cases during the year ple ( G × S ) data and (2) gene–sample–time ( G × S × T ) 2014 (Ferlay et al. 2015). It caused around 400,000 deaths (Mandal et al. 2018) data. There are generally three types in 2014. This rate varied widely among countries. Due to of correlation patterns in gene expression data that can be the severity of this disease, identifcation of interesting bio- used to show gene coexpression: (1) shifting, (2) scaling, markers related to ESCC is highly essential. There are sev- and (3) shifting-and-scaling (Aguilar-Ruiz 2005). To iden- eral ways to fnd the biomarkers for a given disease. In this tify such correlation patterns, in an unsupervised framework with high accuracy, a number of clustering approaches have been introduced. Among these, biclustering approaches are Electronic supplementary material The online version of this article (https​://doi.org/10.1007/s1372​1-019-0200-x) contains prominent. However, most biclustering techniques consume supplementary material, which is available to authorized users. tremendous computational time due to the NP-hard nature (Tanay et al. 2002). To address this issue, we introduce a * D. K. Bhattacharyya parallel biclustering approach which we demonstrate to be [email protected] capable of handling all the three types of correlations during P. Kakati bicluster extraction in much less time. Based on the highly [email protected] enriched biclustering results, we follow with gene–gene J. K. Kalita network topology analysis and pathway analysis to identify [email protected] interesting biomarkers for ESCC, which have been associ- 1 Department of Computer Science and Engineering, Tezpur ated in terms of established literature. Additionally, to sup- University, Napaam, Tezpur, Assam 784028, India port the biomarker identifcation process, we introduce two 2 Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA

Vol.:(0123456789)1 3 Author's personal copy

19 Page 2 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 efective techniques for (1) control-to-disease bicluster map- results are reported with discussion. Finally, Sect. 5 pre- ping and (2) hub-gene fnding. sents the conclusion and the future direction of research.

1.1 Problem defnition 2 Related work Given an expression matrix for ESCC, say M = G, S , where G represents a set of genes and S represents ⟨a set of⟩ Due to the large volume and high dimensionality of gene samples and Lp a list of primary genes for esophageal squa- mous cell carcinoma that appears in a formal repository, expression data, extraction of clusters with high biological like (Malacards 2017). The problem is to identify and estab- signifcance is a challenging task. To address this issue, lish signifcant gene biomarkers (other than primary genes) biclustering with parallelization has been considered as a related to ESCC using appropriate (1) unsupervised machine potential solution. Zhao et al. (2009) introduced a paral- learning techniques on the gene expression data and (2) lel algorithm based on Hadoop MapReduce for K-means network and biological analysis without much knowledge. clustering. The programming technique called Hadoop Performance of a biclustering based method for gene expres- MapReduce can handle large volumes of data with high sion analysis is highly dependent on the proximity measure efciency. Olson (1995) reported a parallel hierarchical used to identify coexpressed patterns. So, identifcation of clustering approach with an efective proximity measure. a robust measure that can handle shifting-and-scaling pat- The parallelization of hierarchical clustering has been terns for efective cluster analysis of gene expression data shown to be superior in comparison to other approaches is a major issue. Further, most biclustering algorithms are to parallelization of clustering. inefcient due to the high computational cost during extrac- Biclustering aims to extract biclusters (subsets of highly tion of biclusters. So, developing a cost-efective and robust correlated genes over subsets of samples) from gene parallel biclustering technique which can extract biologi- expression data that show high biological signifcance. cally signifcant biclusters from an expression matrix is a Due to the need for simultaneous operations to eliminate prime motivation. After extraction of biologically signif- less relevant rows and columns, it is more complex com- cant biclusters, topological and biological analyses of each pared to normal clustering, especially for larger datasets. bicluster can help identify the biomarker(s) related to ESCC. Researchers have developed many biclustering algorithms to mine large numbers of genes over subsets of samples to 1.2 Contribution extract biclusters of high biological signifcance. Zhou and Khokhar (2006) proposed a parallel version of a bicluster- ing algorithm, named as ParRescue and implemented it The major contributions of this paper are given below: using MPI on a cluster of 64 nodes. ParRescue is efec- tive in handling voluminous data using a large number • An overall model for the identifcation of signifcant bio- of nodes. However, the biclusters extracted by it are not markers for ESCC using parallel biclustering, topological satisfactory from an enrichment perspective. Bhattacha- analysis and biological behavior analysis of gene expres- rya and Cui (2017) introduced a GPU-accelerated parallel sion data for ESCC. biclustering algorithm which showed that GPU computing • A robust parallel biclustering variant of Bhattacharya and speeds up the biclustering process signifcantly, but was Cui (2017) to identify biclusters with shifting, scaling or not concerned about the noisy values of gene expression shifting-and-scaling patterns. data. To address this issue, we introduced a robust parallel • An efective technique to map the biclusters across con- biclustering algorithm using an efective proximity meas- trol and disease conditions for subsequent analysis. ure proposed by Ahmed et al. (2014), based on the con- • A weighted hub-gene fnding technique to support the cept of largest condition-dependent subgroups introduced biomarker identifcation process. by Bhattacharya and Cui (2017). GPU implementation of • Network and biological behavior analysis of the identi- the proximity measure called SSSim (Ahmed et al. 2014) fed genes to establish their signifcance in the context of overcomes the scalability issue of the ICS biclustering ESCC. algorithm reported by Ahmed et al. (2014). Our method also overcomes the limitations of Bhattacharya and Cui 1.3 Organization (2017), especially in handling data with outlier or noise. Further, the highly enriched biclusters extracted by our The rest of the paper is organized as follows: Sect. 2 pro- method have been found useful in the identifcation of sig- vides a background of existing biclustering algorithms. In nifcant biomarkers for ESCC in the later stage. Sect. 3, the proposed approach is discussed and in Sect. 4,

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 3 of 21 19

3 Proposed method: BicBioEC 3.1 Module ­M1: preprocessing and parallel biclustering Our method, referred to here as BicBioEC, aims to identify interesting biomarkers relevant for ESCC by exploiting This module performs two tasks, preprocessing and paral- parallel biclustering, condition (normal/disease)-specifc lel biclustering. Preprocessing may include multiple sub- topological, hub-gene centric and pathway analysis of tasks such as normalization, missing value estimation and biclusters or modules. We can identify biclusters of all discretization. Depending on requirements and the input three types of correlation patterns, namely shifting, scaling dataset, one can perform appropriate subtask(s) by choos- and shifting-and-scaling (Ahmed et al. 2014). ing appropriate techniques. In our case, both the synthetic The BicBioEC framework includes six modules. Mod- and benchmark datasets [GSE20347 downloaded from GEO ule 1 is dedicated to preprocessing and biclustering. Based (ESCC 2017)] chosen for analysis are already normalized, on the biclusters obtained, module 2 identifes pairwise and do not include any missing value. Hence, no preprocess- similar biclusters across the conditions (i.e., normal and ing is required for both the synthetic and benchmark data- disease) for subsequent analysis. In modules 3 and 4, we sets. However, we subdivide the samples of GSE20347 into construct biological networks and identify hub-genes, two datasets, one for normal condition and the other one for respectively, to carry out topological and biological behav- disease condition. ioral analyses in the subsequent module (i.e., in module 5) The preprocessed dataset is next used as input by a biclus- towards identifcation of interesting biomarkers for ESCC. tering technique. Our biclustering algorithm consists of Module 6 identifes and validates biomarkers or secondary three parts and each of these are described in sub-modules genes through a rigorous process. ­M11, M­ 12 and M­ 13. Sub-module M­ 11 subdivides the samples A conceptual framework of the proposed BicBioEC is into three spaces, sub-module M­ 12 creates the initial biclus- depicted in Fig. 1. BicBioEC is able to identify biclusters ters, and sub-module M­ 13 merges the biclusters with high of high biological signifcance in the presence of outli- inter-similarity towards generation of a fnal set of biclusters. ers. Some basic defnitions to help describe the proposed method are as follows: 3.1.1 Sub‑module M­ 11

Defnition 1 Primary genes: For a given disease di , a gene This module is used to create the subspaces (or sample gi is referred to as elite or primary gene if it appears in a for- spaces) of two genes based on their expression values. Based mal repository (e.g., Malacards 2017) as a causal gene for di. on the trends shown by the expression values, genes can be From Esophageal Squamous Cell Carcinoma (2017), we divided into any of three categories, namely upward trend- fnd that there are a total of seven possible elite genes which ing, downward trending and mixed trend for any given pair are related to ESCC. These are: DSE, RUNX3, CDH1, VIM, of genes. WWOX, CTTN, and CCND1. The algorithm is described as follows:

Defnition 2 Secondary genes: A gene gj is referred to as a Input: ∈ secondary gene for a given disease di w.r.t. a set of elite or Two gene expressions ­gi,gj G are taken at a time for primary gene(s), if and only if there is sufcient evidence sample space creation, where G = {g1,g1,...,gn} is a set to be considered a causal gene based on (i) gene–gene topo- of all genes in the dataset. Ci Ci Ci logical properties, (ii) pathway analysis and coexpression gi={ 1, 2,..., n }, the set of samples for gene ­gi. Cj Cj Cj relationship, and (iii) established wet lab results. gj={ 1, 2,..., n }, the set of samples for gene ­gj. Ci k : a sample for gene ­gi, where k={1,2,3,...,n}. Cj k : a sample for gene ­gj, where k={1,2,3,...,n}.

Fig. 1 Conceptual framework

1 3 Author's personal copy

19 Page 4 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19

th Ck: the ­k sample, where k={1,2,3,...,n}. 3. Calculate SSSim similarity, r between ­gi and ­gj for g (1∕n) n Ci k i = mean of all samples for gene ­gi, i.e., ∑k=1 k. subspace ­S . g (1∕n) n Cj ≥ j = mean of all samples for gene ­gj, i.e., ∑k=1 k. 4. If r go to step 5. Else go to step 10. 5. Construct a temporary bicluster tmpB with subspace Output: k U D M U ­S and genes {g , ­g }. S = { S , S , S }, i.e., upward-trending subspace ( S ), i j D 6. Select gene ­g ∈ G, but ­g ∉ tmpB(G). downward-trending subspace ( S ) and mixed-trend sub- l l M U D M 7. Calculate SSSim of ­g with all genes present in space ( S ). Initially, S , S and S are empty. l tmpB(G). ≥ ∈ tmpB(G) Steps: 8. If SSSim(gl,gx) , and g­ x go to step 9. Else SU SD SM A sample ­Ck, is added to subspace or or based go to step 10. on the following conditions. 9. If size of tmpB(G) ≥ 5 and total samples in ­Sk ≥ 5, add U 1. Ck will be added to upward-trending subspace, S if tmpB to . i ≥ j ≥ U D (C k-gi) 0 and (C k-gj) 0. 10. Repeat steps 2–9 for all three subspaces (i.e, S­ , ­S , D M 2. Ck will be added to downward-trending subspace, S if ­S ). Ci g Cj g ( k- i)<0 and ( k- j)<0. 11. Repeat steps from 1 to 10 till all combinations of ­gi C SM 3. k will be added to mixed-trend subspace, if it fol- and ­gj are over. i j lows both upward and downward trends, or (C k-gi)× (C k -gj)<0. The pseudocode for the above steps is also given in the Sup- plementary fle (Section 2). The pseudocode of the above steps is also given in Supple- mentary fle (Section 1). 3.1.3 Sub‑module M­ 13: to generate a fnal set of biclusters, B 3.1.2 Sub‑module M­ 12: to create an initial set of biclusters, i.e., ˇ This sub-module is responsible for the generation of the fnal set of biclusters, i.e., B. It operates on the set of initial This sub-module operates on preprocessed data and creates biclusters, i.e., and iteratively merges the smaller initial an initial set of biclusters, for each condition (i.e., normal biclusters based on the amount of overlap. The higher the or disease). overlap, the higher will be the chances of merging.

Input: Input: DG × C : Gene expression dataset, with G genes over C : Sets of initial biclusters. samples. bAss : Inter-bicluster similarity score. base_number: User-defned maximum number of biclus- : Bicluster association threshold. ters. Output: : User-defned SSSim (Ahmed et al. 2014) correlation B : Set of fnal biclusters. threshold. 2 Initially, B = . : Variance of the base_numberth gene in decreasing order of variance. Steps: ∀ g ∈ G ∣ g ) ≥ 2 ∈ G’={ i , Variance( i }, i.e., sets of genes 1. Select a pair of biclusters ­bi, ­bi . ≥ 2 bAss(b ,b) ≥ which have variance . 2. If b­ i and b­ j share common gene(s) and i j tmpB: Temporary bicluster considering two genes as ini- go to step 3, else go to step 4. ∪ tial genes during bicluster creation. 3. Remove ­bi, ­bj from and add bicluster ­bk=bi bj to . ∩ bAss(b ,b)= tmpB(G)={g1,g2,...,gx}, where {g1,g2,...,gx} are added 4. Repeat step 1–3 until b(g)i b(g)j = or i j . genes to the bicluster, tmpB during creation. 5. Add all biclusters in to B. tmpB(S): Subspace which is considered for creation of tmpB bicluster. The pseudocode for the above steps is given in the Supple- Output: mentary fle (Section 3). : Set of biclusters, initially = . In the above steps, bAss is used to merge two biclusters Steps: based on inter-bicluster similarity or overlap. Let i(Gi, Si) ∈ (G , S ) 1. Select two distinct genes ­gi,gj G’. and j j j be two biclusters, 2. Create subspace ­Sk, where k ∈{U,D,M} for genes where G ={gi , gi , … , gi } {gi,gj} using steps mentioned in module ­M11. i 1 2 x , the set of genes present in bicluster, i.

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 5 of 21 19

j j j Gj ={g1, g2, … , gy} , the set of genes present in bicluster, Input: j. BC: Set of all biclusters having primary genes in “stress” j j j Si ={C1, C2, … , Cl} , the set of samples present in biclus- condition (that may cause ESCC). ter, i. BN: Set of all biclusters having primary genes in “normal” j j j Sj ={C1, C2, … , Cm} , the set of samples present in condition. bicluster, j. N1 = |BC|, i.e., the total number of biclusters in “stress” Then, bAss can be represented as given below. condition. N = |B |, i.e., the total number of biclusters in “normal” bAss = SSSim( ∩ )∕SSSim( ∪ ) 2 N i j i j (1) condition. In Eq. (1), i ∩ j consists of Gi ∩ Gj genes and Si ∩ Sj sam- M = {(Bi,Bj) |Bi ­BC and ­Bj BN }, i.e., a pair of mapped ples, whereas i ∪ j consists of Gi ∪ Gj genes and Si ∪ Sj biclusters. samples, where Output: 1 SSSim( ∩ =) SSSim(g , g ) R, i.e., the fnal resultant mapped biclusters across the i j 2 k1 k2 ∣ i ∩ j ∣ k1,k2∈G ∩G conditions. i j 1 Initially R = . SSSim( ∪ =) SSSim(g , g ) i j 2 k1 k2 ∣ i ∪ j ∣ k1,k2∈G ∪G Steps: i j × 1. Create a table of size ­N1 ­N2, where each row has two biclusters one from stress and the other from normal Here, SSSim (Ahmed et al. 2014) is a correlation measure condition. One column is for the number of similar genes introduced by Ahmed et al. As shown in Eq. (1), the value in both biclusters and one for the elite or primary genes of bAss is in between 0 and 1. present in both biclusters. 2. Sort the table in descending order according to the 3.2 Module ­M2: mapping of biclusters number of similar genes in both biclusters. 3. Take the rows which have the highest similarity and at The mapping aims to fnd similar biclusters of genes across least one similar elite or primary gene at both stages, and conditions (i.e., from normal to disease condition). In other add to R, i.e., R = R∪ ( bc,bn); words, it fnds a pair of biclusters from both conditions, 4. Delete all the rows from the original table which have based on certain similarities. To provide support for sub- ­bc or ­bn. If at least one row is present, with similar elite sequent topological and biological similarity and deviation or primary genes present in both stages, then go to step 3 analysis across the conditions, identifcation of similar or again. Else go to step 5. corresponding biclusters is essential. Due to diferent con- 5. Return the resultant mapping R. ditions, the expression values of the genes may vary for diferent samples across the conditions. It is hard to fnd For further understanding of the above steps, let us consider an exact pair of biclusters across condition, i.e., the genes an example with a dataset of ten genes (say, {g1, ­g2,..., ­g10}) which are in the same bicluster in normal condition may for a given disease. Also, let ­g3, ­g4, ­g7 and ­g10 be elite or pri- not be in the same bicluster in the disease condition. There- mary genes. Assume that the genes are biclustered into three fore, we attempt to fnd the best possible pair of biclusters, groups in normal condition and four groups in disease condi- which correspond to each other over diferent conditions. tion. So, for the two sets of given biclusters corresponding to The mapping technique described below accepts two sets of two conditions, our task is to fnd the best possible pairs of fltered biclusters based on the output given by Module M­ 1 similar biclusters across the conditions. The bicluster details as input and generates resultant pairs of similar biclusters for both conditions are given in Table 1 across condition.

Table 1 Biclusters of genes for Normal condition Disease condition diferent conditions Bicluster ID Genes Elite genes Bicluster ID Genes Elite genes

1 g1, ­g2, ­g10 g10 1 g1, ­g2, ­g6, ­g10 g10

2 g4, ­g7, ­g9 g7, ­g4 2 g8, ­g3, ­g9 g3

3 g3, ­g6, ­g5, ­g8 g3 3 g4, ­g7 g7, ­g4

4 g5, ­g10 g10

1 3 Author's personal copy

19 Page 6 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19

Table 2 All possible pairs of biclusters across conditions In our experiment, we consider a network combining Normal Disease Total number of Elite gene(s) both CEN and PAN. We use a tool called GeneMANIA bicluster ID bicluster ID common genes present in both (Tanay et al. 2002) plugin of Cytoscape v 3.6.0. (2018) to biclusters construct a combine network table of genes (i.e., CEN and PAN tables). Cytoscape v 3.6.0. (2018) is an open-source 1 1 3 g 10 software which is used for constructing and visualizing bio- 1 2 0 – logical networks, like gene–gene interaction and molecular 1 3 0 – interaction networks. There are many weighting schemes 1 4 1 g 10 to calculate the weight of an edge. In our experiment, we 2 1 0 – use “assigned based query strategy” Tanay et al. (2002) to 2 2 0 – calculate the weight of any edge in the gene–gene network. 2 3 2 g , ­g 4 7 After creating the network table, we calculate the degree 2 4 0 – (connection to other genes) of each gene using an R network 3 1 1 – manipulation and analyzing package called iGraph (2018). 3 2 2 g3 For visualization of the network table created by Gene- 3 3 0 – MANIA Tanay et al. (2002), we used Gephi-0.9.2 (2018). 3 4 1 – Gephi-0.9.2 (2018) is a network visualization open-source tool written in Java. The combination network of CEN and PAN networks is used in subsequent steps for both topologi- Table 3 Mapping of biclusters across the conditions (best possible cal and causal analyses. pairs) Normal biclus- Disease biclus- Matching gene Matching 3.4 Module ­M4: weighted hub‑gene identifcation ter Id ter Id elite gene Hub-gene fnding and hub-gene centric analysis towards 1 1 g , ­g , ­g g 1 2 10 10 biomarker identifcation is an established approach (Man- 2 3 g , ­g g , ­g 4 7 4 7 dal et al. 2018). In the past few years, several eforts have 3 2 g , ­g g 3 8 3 attempted to identify appropriate hub-gene(s) in a network. However, our observation is that there is no consensus among the results of the existing methods. Hence, we intro- Table 2 (with 3 × 4=12 rows) demonstrates the various duce a variant, yet an efective technique to identify hub- possibilities for identifcation of similar bicluster pairs based gene(s) in a biological network. on the presence of (1) no of common genes and (2) no of The probable hub-genes for each biological network primary gene(s) (Table 3). constructed for each condition based on resultant mapped Finally, we identify the similar biclusters as follows. biclusters are identifed using Eqs. (2) and (3). degWeight = × scale(degree)+(1 − )×scale(totalStrength) 3.3 Module ­M : biological network construction 3 (2) Here, is a weight, which is a user-specifed parameter. This module constructs biological networks, i.e., gene–gene = 0.5 balances the two parts in Eq. (2) and works well for coexpression networks (CEN) and gene–gene pathway net- our analysis. scale(degree) It scales the degree values of a works (PAN) based on the resultant paired biclusters R, gene in a network within the range (0–1] for elimination of obtained from module ­M 2 bias. We assume that the highest degree of any gene in a A gene–gene coexpression network (CEN) is a network network as 1. scale(totalStrength) It scales the weights of (or graph) of genes where nodes represent genes and the all the edges of a gene in a network within the range (0–1], edges between these nodes represent whether they are sig- for elimination of bias. We assume that the highest weight nifcantly coexpressed or not. The weight of an edge between in a network as 1. a pair of genes represents how much the genes are coex- Using Eq. (2), we compute the weighted degree (deg- pressed or correlated. Weight) of each gene in the network of a bicluster. Weighted A gene–gene pathway network (PAN) is a network where degree (degWeight) of a gene represents the importance of genes represent nodes and edges between nodes represent the gene in the network considering the total degree as well whether they share the same pathway or not. The strength as the strength of the gene. of each edge represents the number of common pathways shared by the genes.

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 7 of 21 19

After calculating the weighted degree (degWeight) of 3.6 Complexity each gene in a bicluster, we compute the list of probable hub-genes (HubGenes) in a network using Eq. (3) In this section, we present the computational complexity of our biclustering algorithm, which is the heart of our method. HubGenes = gi ∈ Bic ∣ degWeight(gi) ≥ mean(degWeight) Let k be the total number of samples for each gene, n be (3) the total number of genes in the coexpression matrix, and In Eq. (3), g­ is a gene in the bicluster Bic, and degWeight i b be the base_number . In our dataset, the maximum value of ­g is higher than the mean of degWeight of all the genes i of k is 17 and n is 22,277. The time complexity of SSSim is present in the bicluster (or bicluster network constructed in O(kn2) (Ahmed et al. 2014). The time complexity for sub- Step 3.3) space creation for a pair of genes is O(k) and for creation 3.5 Module ­M : topological and biological behavior of initial biclusters for three subspaces (i.e., upward trend- 5 n k kn2 analysis ing, downward trending and mixed trend) is O( ( +3( + base_number ( base_number(kn2))))), and for the fnal biclus- ter generation it is (i.e., for Algorithm 3) O(base_number2). This module plays a crucial role in our biomarker identif- Therefore, the total complexity of the algorithm is: cation process. It analyzes the network topology, observes O(n(k + 3(kn2 + base_number ( base_number (kn2))))) + the hub-genes(s) (whether primary or secondary genes) and O(base_number2). In our experiment, the maximum value associated primary and secondary genes, gene–gene causal for base_number is taken to be 1000, which can be consid- relationships and signifcant deviations across the condi- ered as constant. Hence, complexity of the algorithm con- tions. Such analysis results help distil out an initial subset sidering base_number as constant is O(n(k + 3(kn2 + kn2 of genes as secondary genes w.r.t. the given set of primary 3 )))) or O(nk + 6kn ) . Further simplifcation gives the total genes. 3 complexity as O(kn ) Topological behavior is represented by the characteris- tics of a gene–gene network based on information related 3.7 Comparison with other methods (Bhattacharya to connectivity and weight of connectivity of each node. A and Cui 2017; Zhou and Khokhar 2006; gene with high degree or high association with other genes Orzechowski et al. 2018; González‑Domínguez is considered as a signifcant gene. Similarly, a pair of genes and Expósito 2018) connected or correlated with high connection weight is con- sidered more closely associated. From Eq. (3) in Module Our method difers from other existing methods (Bhattacha- ­M , we get a list of genes which are highly signifcant in 4 rya and Cui 2017; Zhou and Khokhar 2006; Orzechowski a bicluster, from this list of genes, the gene(s) which have et al. 2018; González-Domínguez and Expósito 2018)) in highest degWeight (calculated using Eq. (2)) as well as con- the following ways nected to at least one primary gene is considered a “hub- gene”. There may be multiple hub-genes in the case of a tie • Unlike (Bhattacharya and Cui 2017), we use the SSSim situation (i.e., multiple genes with equally highest degWeight (Ahmed et al. 2014) measure to identify biclusters of all and connected to at least one primary gene). types. Consequently, our method is more robust to noise. These hub-genes are further investigated for their biologi- • Unlike (Bhattacharya and Cui 2017), we identify sig- cal behavior using coexpression analysis, pathway analysis nifcant biclusters based on high correlation w.r.t a given and tissue analysis with an appropriate tool to confrm the user-defned threshold. Further, we use bicluster asso- status of the secondary genes. For coexpression analysis, ciation measure called bAss to merge smaller biclusters. we plot the graph of each hubgene in both conditions to In Bhattacharya and Cui (2017), biclusters are merged compare their deviation due to the conditions. In pathway based on a score called BScore. analysis, we check the diferent pathways these hubgene(s) • Unlike BScore used in Bhattacharya and Cui (2017) share using (Geneanalytics 2018) and whether they directly which is dependent on Pearson correlation coefcient or indirectly share pathways with esophageal cancer. After measure, our measure bAss is dependent on the SSSim fnding a subset of hubgene(s) which have direct or indi- (Ahmed et al. 2014) measure. Consequently, our method rect pathway sharing with esophageal cancer, we further is more robust to outlier. Further, unlike bAss, the authors investigate how these genes afect tissue and cell ranking of of Bhattacharya and Cui (2017) compute BScore based esophageal squamous cell in Geneanalytics (2018). Finally, on two sets of correlated gene pairs, M and N. For N, cor- we verify their states w.r.t. established wet lab results. relations are computed over samples which are present in a bicluster, whereas for M, correlation is computed over samples which are not present in a bicluster.

1 3 Author's personal copy

19 Page 8 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19

• ParRescue Zhou and Khokhar (2006) uses the minimum used as the primary language. Further, for statistical analy- sum-squared residue (MSSR) Cho et al. (2004) meas- sis, R 3.4.3 (2017) was used. ure for biclustering, which fails to determine overlapped clusters. Further, it cannot handle noisy data. In contrast, 4.2 Datasets used our method is capable of detecting overlapped biclusters. It is also robust to noisy data. In our experimental study, we used both synthetic and • EBIC Orzechowski et al. (2018) is a recent evolution- benchmark datasets. We created fve synthetic datasets ary parallel biclustering algorithm. As mentioned in using BiBench similar to the synthetic datasets of Bhat- Orzechowski et al. (2018), it fails to create signifcant tacharya and Cui (2017). BiBench is a Python package biclusters with fewer than 20 columns. As our ESCC using which we can create bicluster implanted synthetic dataset contains only 17 samples per condition, it does datasets. In our experiment, we created a total of fve not sufce our need. On the other hand, our method is synthetic datasets, two ( DC_100_1 and DC_250_1 ) of able to identify highly signifcant biclusters with a mini- which are constant type and other three ( DSS_100_1 , mum of fve samples per condition. DSS_250_1 and DSS_250_3 ) are shift and scale type. In • ParBiBit González-Domínguez and Expósito (2018) dataset DC_100_1 , one constant bicluster of size 50 × 50 is is another parallel biclustering algorithm which takes implanted and in dataset DC_250_1 , one constant bicluster binary matrices as input. Though it gives interesting with size 125 × 50 implanted. In dataset DSS_100_1 , one biclusters, it does not meet our needs as we are interested Shift and Scale bicluster of size 50 × 50 is implanted. Sim- in fnding signifcant biomarkers from non-binary gene ilarly, in other datasets also, biclusters are implanted using expression matrices for any disease. BiBench. Table 4 shows the characteristics of these fve synthetic datasets. Additionally, we use a real-life microar- To be best of our knowledge, there is no structured approach ray dataset for ESCC [GSE20347 from GEO (Gephi-0.9.2 to fnd signifcant biomarkers using parallel biclustering, 2018)] to establish the efectiveness of our method in iden- topological analysis and pathway analysis of biclusters. In tifying interesting biomarkers for ESCC based on condi- our approach, we provide a pipeline which takes microarray tion-specifc bicluster extractions. The dataset GSE20347 gene data for a specifc disease with two conditions (normal consists of total 22,277 gene expression values for two and disease) as input and gives a list of interesting biomark- conditions (normal and disease). Each condition consists ers relevant to the disease of 17 samples with a total of 34 samples.

4 Results and discussion 4.3 Results

4.1 Environment and platform used To establish the efectiveness of our method, experimen- tal analysis was carried out in two phases. In phase 1, BicBioEC was implemented using CUDA-C on a desktop we used fve unbiased synthetic datasets that included super computer (PARAM Shavak) with 64 GB RAM, 50TB shifted, scaled and shifted-and-scaled patterns to validate storage, NVIDIA GeForce 980 GPU graphics, 3TF compute the efectiveness of our biclustering algorithm of imple- support and 2048 cores. mentation on GPU over implementation on CPU and also We used the CUDA 7.5 toolkit (2017) for compilation and in terms of its ability to extract implanted biclusters. In extraction of biclusters in parallel. For mapping of biclusters addition, we used the GSE20347 dataset to validate our across the conditions and for hub-gene fnding, Java 8 was method in identifying signifcant biomarkers related to

Table 4 Biclustering results for synthetic datasets Dataset Correlation types True implanted biclusters Our results Runtime of CPU Runtime of implementation GPU implemen- tation

DC 100 1 Constant 1(b1(50, 50)) 1(b1(50, 50)) 280.8 147.54 DC 250 1 Constant 1(b1(125, 50)) 1(b1(125, 50)) 9838.72 747.66 DSS 100 1 Shift and scale 1(b1(50, 50)) 1(b1(50, 56)) 468.18 142.799 DSS 250 1 Shift and scale 1(b1(125, 50)) 1(b1(125, 57)) 8838.72 1922 DSS 250 3 Shift and scale 3(b1(83, 33), b2(83, 33), b3(83, 33)) 3(b1(79, 36), b2(83, 2102.84 126.45 36), b3(69, 27))

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 9 of 21 19

Fig. 2 CPU vs GPU implemen- tation of the BicBioEc

Table 5 The result of parallel biclustering for GSE20347 having at Table 6 Bicluster mapping across two conditions least one elite gene Bicluster ID Total genes Bicluster Total genes Number of Normal Disease (normal) (normal) ID (cancer) (cancer) common genes No of No of No of sig- No of No of No of genes samples nifcant genes samples signifcant 1 2253 1 1393 308 biclusters biclusters 21 2093 2 2683 344 22,277 17 102 22,277 17 36 10 2446 3 3801 643 17 1858 4 1320 235 35 1497 6 822 111 40 1389 7 634 42 ESCC. Table 4 shows comparison of GPU and CPU execu- 13 1506 9 1824 151 tion times for the fve synthetic datasets with various sized 68 455 10 1288 50 bicluster implementations. 25 1644 11 1358 124 Further, the comparison between CPU vs GPU imple- 8 1606 14 534 38 mentations is shown graphically in Fig. 2. Clearly our 22 1200 15 332 39 GPU implementation is signifcantly faster for all the syn- 5 1164 16 726 77 thetic datasets except DC-100-1, where it is marginally 52 673 17 720 39 higher, and it is due to the smaller number of instances. 32 485 18 483 44 4.3.1 Results for ESCC 6 1620 19 1521 112 31 811 22 775 31 36 893 24 392 13 The proposed parallel biclustering algorithm was run on the 88 844 26 607 18 GSE20347 dataset, downloaded from GEO ESCC (2017). 51 438 27 298 6 In the dataset, out of the total 34 samples, 17 samples are in 39 1003 28 152 20 normal condition and the rest 17 are in disease condition. 38 675 30 532 26 The biclustering results for this dataset are given in Table 5. 16 1274 35 138 17 It extracts 102 biclusters under normal conditions, and 36 biclusters under disease conditions. An efective technique based on maximum- similarity among the participating genes between a pair of biclusters of genes common in a pair of biclusters, which varies from from two conditions is used to map the biclusters across 6 to 643. two conditions. ­Module2 results in a total of 22 bicluster For each mapped bicluster, we consider all genes in a pairs, which are given in Table 6. It also reports the number bicluster for both conditions (i.e., normal and disease con- ditions) and check whether the biclusters are biologically

1 3 Author's personal copy

19 Page 10 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 − − − − − − − − 30 6.26E 31 1.67E 44 1.21E 28 1.86E 28 2.18E 13 2.93E 24 1.01E 21 1.15E − − − − − − − − 30 4.94E 34 3.77E 45 1.27E 32 2.74E 28 6.92E 13 9.24E 24 2.41E 23 1.67E − − − − − − − − 28 25 34 27 27 12 21 20 2.23E 5.31E 3.34E 2.08E 3.16E 7.03E 1.23E 2.14E p value organelle part, nuclear organelle part process,single-organism biologi - cellular process, cal regulation cellular single-organism plasma mem - process, partbrane ing, nucleoplasm membrane, to targeting - cotransla SRP-dependent to targeting tional SRP-dependent membrane protein cotranslational membrane, to targeting the establishment of the localization to protein endoplasmic reticulum plasma plasma membrane, part,membrane intrinsic of plasma component membrane process, single-organism cellular single-organism process intrinsic of component inte - plasma membrane, gral of plasma component membrane Intra-cellular organelle part, organelle Intra-cellular Single-organism process, Single-organism part, bind - Nuclear protein protein Cotranslational Integral of component part,Plasma membrane part,Plasma membrane Function sample) × 9 9 9 9 9 9 10 9 × × × × × × × × Size(gene Size(gene 2542 960 5590 3790 878 1666 1600 1736 Disease (cancer) 1 2 3 4 6 7 9 10 ID (cancer) − − − − − − − − 56 2.28E 56 2.29E 58 1.24E 39 7.25E 36 1.97E 41 6.07E 38 1.75E 47 8.85E − − − − − − − − 59 9.85E 65 5.48E 60 5.26E 42 1.14E 47 1.70E 46 3.48E 38 8.56E 47 7.02E − − − − − − − − 54 55 57 39 35 39 34 46 4.15E 8.53E 3.49E 2.21E 4.51E 1.94E 7.98E 2.23E p value component organization organization component cellular or biogenesis, organization component molecular complex, binding protein mic part, binding RNA organization component cellular or biogenesis, organization component initiation, ribonucleopro - tein complex - binding, cotransla protein to targeting tional protein membrane cel - zation or biogenesis, - organiza lular component binding tion, protein membrane, to targeting - cotransla SRP-dependent to targeting tional protein - target protein membrane, ER ing to Protein binding, cellular Protein binding, macro- RNA - binding, cytoplas Protein part,Cytoplasmic cellular binding, translational RNA initiation, Translational - organi Cellular component protein Cotranslational Function sample) × 9 9 9 9 9 9 9 7 × × × × × × × × Size (gene Size (gene 2737 2529 2974 2216 542 1734 1650 1768 values across conditions across with Mapped biclusters their three top functionalities and p values 7 Table Normal ID (normal) 21 10 17 35 40 13 68 1

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 11 of 21 19 − − − − − − − − 21 1.08E 19 8.78E 11 2.87E 14 1.16E 15 8.48E 18 5.05E 23 1.37E 20 1.58E − − − − − − − − 22 3.75E 19 7.32E 13 1.28E 18 6.15E 15 4.38E 18 8.93E 24 2.10E 21 1.07E − − − − − − − − 20 19 11 13 15 16 22 19 4.65E 1.17E 7.68E 1.86E 3.62E 3.09E 6.91E 4.57E p value integral of component intrinplasma membrane, - of plasma sic component membrane plasma plasma membrane, part,membrane integral of plasma component membrane process, developmental - develop single-organism mental process binding, macro-molecular complex plasma membrane, process, single-organism intrinsic of component plasma membrane plasma membrane, integral of component plasma plasma membrane, partmembrane plasma plasma membrane, part,membrane integral of plasma component membrane plasma membrane, integral of component plasma plasma membrane, partmembrane Plasma membrane part,Plasma membrane Intrinsic of component part, region Extra-cellular binding, RNA Protein Integral of component Intrinsic of component Intrinsic of component Intrinsic of component Function sample) × 10 9 9 9 10 7 10 10 × × × × × × × × Size(gene Size(gene 2074 908 677 698 353 815 1949 1019 Disease (cancer) 11 14 15 16 17 18 19 22 ID (cancer) − − − − − − − − 37 2.58E 32 1.74E 30 1.31E 27 9.51E 15 8.77E 12 7.62E 36 1.09E 17 4.17E − − − − − − − − 38 4.32E 36 1.49E 32 1.50E 28 1.30E 15 4.30E 14 4.84E 38 2.73E 18 8.43E − − − − − − − − 33 31 29 27 15 12 35 16 9.80E 2.75E 5.11E 9.35E 2.46E 2.91E 8.05E 7.96E p value binding, macro-molecular binding, macro-molecular complex cellular part, intra-cellular part organelle initiation ing, translational zation, cellular component - or biogen organization partesis, cytoplasmic zation, cellular component - or biogen organization region esis, extra-cellular part cellular macro-molecular localization, cytoplasmic part cel - zation or biogenesis, - organiza lular component binding tion, protein com - macro-molecular cell cycle mitotic plex, process Cytoplasmic part,Cytoplasmic protein binding, intra- Protein bind - binding, RNA Protein - organi Cellular component - organi Cellular component complex, Macro-molecular - organi Cellular component matrix,Extra-cellular Function sample) × 9 9 9 9 8 9 8 9 × × × × × × × × Size (gene Size (gene 727 909 551 1908 1856 1415 1322 1894 (continued) 7 Table Normal ID (normal) 8 22 5 52 32 6 31 25

1 3 Author's personal copy

19 Page 12 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 − − − − − − 11 1.45E 21 9.19E 07 3.35E 08 2.12E 20 2.02E 08 1.58E − − − − − − 11 7.58E 24 1.80E 07 7.11E 09 7.57E 21 2.62E 09 1.31E − − − − − − 09 21 06 07 19 08 1.28E 2.19E 5.98E 7.87E 2.11E 1.64E p value plasma membrane, intrinplasma membrane, - of plasma sic component multicellular membrane, process organismal intrinsic of component inte - plasma membrane, gral of plasma component membrane intrinplasma membrane, - of plasma sic component biological membrane, regulation - biological regula process, of cellular tion, regulation process plasma plasma membrane, part,membrane integral of plasma component membrane compound of nitrogen - regula metabolic process, - tion of cellular macromol process ecule biosynthetic Integral of component part,Plasma membrane Integral of component of biological Regulation Intrinsic of component binding, regulation Protein Function sample) × 10 8 9 10 6 6 × × × × × × Size(gene Size(gene 473 694 340 818 154 130 Disease (cancer) 24 26 27 28 30 35 ID (cancer) − − − − − − 23 2.28E 22 1.75E 45 6.24E 26 5.80E 20 1.02E 50 4.26E − − − − − − 26 7.20E 22 8.12E 46 6.24E 26 3.40E 22 4.26E 51 5.09E − − − − − − 22 21 45 26 17 49 2.76E 1.70E 5.36E 1.45E 1.57E 9.13E p value protein binding, nuclear protein part - organiza lar component tion, cellular component or biogenesis organization targeting tional protein - contrans membrane, to targeting lational protein protein membrane, to ER to targeting organization component cellular or biogenesis, organization component cellular exosome cellular component plex, - or biogen organization esis, cellular component organization Macro-molecular complex, complex, Macro-molecular matrix,Extra-cellular cellu - - cotransla SRP-dependent binding, cellular Protein extra- vesicle, Cytosol, com - Macro-molecular Function sample) × 8 9 9 9 8 8 × × × × × × Size (gene Size (gene 760 984 935 526 1138 1490 (continued) 7 Table Normal ID (normal) 88 51 39 38 16 36

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 13 of 21 19

Table 8 Secondary genes with their respective degree information in the mapped biclusters Mapped bicluster pair Secondary genes w.r.t primary genes Primary genes for ESCC Disease Healthy Disease Healthy Disease Healthy bicluster bicluster Genes Degree Genes Degree Genes Degree Genes Degree

1 1 CCNB1 570 CCT7 589 CDH1, VIM, CCND1 224, 246, 176 CDH1, CCND1 258, 244 2 21 CRYBB3 623 CCT7 606 DSE, CDH1 125, 231 CDH1, CCND1 219, 232 3 10 BMP10 625 CCT7 678 CDH1 328 CDH1, CCND1 265, 254 4 17 CCNB1 637 CCT7 578 CDH1, VIM 185, 232 CDH1, CCND1 212, 205 6 35 NPM1 363 CCT7 950 CDH1 132 CDH1, CCND1 194, 197 7 40 BMP10 203 CCT7 869 CDH1 51 CDH1, CCND1 202, 182 8 13 NTSR2 561 CCT7 743 CDH1, WWOX, CTTN 169, 93, 70 CDH1 234 9 68 CRYBB3 440 NAE1 139 DSE 56 DSE, CTTN 27, 27 10 25 BMP10 362 CCT7 954 CDH1 125 CDH1, CTTN 272, 138 11 8 GRK2 145 CCT7 941 CDH1 38 CDH1 45 14 22 KRT5 122 CCNB1 707 CDH1, VIM 76,62 CDH1, CCND1 170, 155 15 5 RPL10A 316 RPLP0 614 VIM 143 VIM 363 16 52 KRT2 267 FBN1 368 VIM 72 CDH1, VIM 134, 264 17 32 – – PPP1CC 193 CTTN 100 CTTN 85 18 6 KRT2 620 CDK4 932 VIM 185 VIM, CCND1 422, 218 19 31 KRT2 343 CDC20 589 VIM 96 VIM 291 22 36 KRT2 183 CDK4 522 VIM 59 CDH1, VIM 148, 195 26 88 KRT2 282 SPARC​ 501 RUNX3, VIM 137, 66 VIM 310 27 51 TM9SF2 56 RPL9 530 DSE 16 DSE, CTTN 24, 22 28 39 BAZ2B 52 CCNB1 704 CDH1 19 CDH1 171 30 38 KRT2 285 SCP2 177 WWOX 27 WWOX, CTTN 35, 61 35 16 PTGES3 93 CCNB1 809 CDH1 21 CDH1 166

Fig. 3 Gene–gene network for bicluster-35 (Cancer) with hub gene PTGES3

signifcant. FuncAssociate (2018) is an open-source tool It uses Fisher’s exact test (Berriz et al. 2003) to calculate the which can calculate the probability of fnding m genes which probability of fnding associated genes in a bicluster. This are associated with a GO term present in the same bicluster. value signifes how well genes in a bicluster are associated

1 3 Author's personal copy

19 Page 14 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 with similar GO terms. In statistics, a lower p value for a network to fnd the hub-gene, according to Eqs. (2) and (3) cluster signify higher coherency. It is a well-established vali- for each bicluster in both normal and disease conditions. As dation technique described by Ahmed et al. Ahmed et al. described earlier, hub-gene is the gene which has the high- (2014) that lowers the p value of a bicluster for a gene ontol- est degWeight in a bicluster and is connected to at least one ogy to attribute higher signifcance to a bicluster, meaning primary gene as well as the degWeight of the gene should be more genes present in the bicluster are related to the GO higher than the mean of degWeights of all the genes in the term[24]. Using (FuncAssociate 2018), we calculated the p bicluster. degWeight of a gene is calculated using Eq. (2). In value of each bicluster in both conditions limiting the total Table 8, the genes which are considered hub-gene(s) (sec- number of simulations for each bicluster to 1000, and using ondary genes) and primary genes of a bicluster are shown a signifcance cutof of 0.05. In the Table 7, we show only with their degree values for each mapped bicluster in both three GO terms for each bicluster associated with the least normal and disease conditions. three p values. As we have already mentioned, the expres- In Fig. 3, we show a biological network for biclus- sion values of genes highly afect the biclustering of genes ter-35 (Disease Condition/ Cancer) which was created and therefore the least three p values for a bicluster may not using (Gephi-0.9.2 2018). In this fgure, we observe that be the same across conditions (i.e., least three p value of gene PTGES3 is highly connected to all other genes, by normal and disease state/condition of bicluster maybe dif- ferent.). Moreover, till now, there is no efective algorithm that can identify exact biclusters across the states/conditions of gene. Therefore, we introduce this method to help fnd the best possible matched biclusters across the condition using steps described in Sect. 3.2 to support efective downstream analysis. The top three GO terms related to each mapped bicluster are documented in Table 7. v_3.6.0 As described in module M­ 3, we use Cytoscape (Cytoscape v 3.6.0. 2018) to construct the biological net- work (combination of CEN and PAN) considering the bio- logical pathways and correlation-based associations between the genes in a bicluster. These steps are followed for all biclusters from both conditions to construct respective net- work tables. After constructing the biological network table using Cytoscape v_3.6.0 (Cytoscape v 3.6.0. 2018), we need to calculate the degree and total strength of each gene in a Fig. 5 BMP10

Fig. 4 BAZ2B Fig. 6 CCNB1

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 15 of 21 19

Fig. 7 CCT7 Fig. 9 CDK4

Fig. 10 Fig. 8 CDC20 CRYBB3

22 secondary genes, namely BAZ2B, BMP10, CCNB1, calculating the total degree and total strength of all genes CCT7, CDC20, CDK4, CRYBB3, FBN1, GRK2, KRT2, present in bicluster-35 (Disease condition/Cancer), we fnd KRT5, NAE1, NPM1, NTSR2, PPP1CC, PTGES3, RPL9, that gene PTGES3 can be considered the hub-gene for this RPL10A, RPLP0, SCP2, SPARC and TM9SF2 have been bicluster. found to have signifcant relevance with the ESCC. We fur- After successfully fnding a list of 22 hub-genes from ther investigate the coexpression behavior of these genes the mapped bicluster shown in Table 8, we use them as across the conditions. initial secondary genes considering topological behavior. Coexpression value of a gene varies over samples due to As described earlier, topological behavior represents the diferent environmental or other conditions. characteristics of a gene–gene network based on the infor- In Figs. 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, mation related to connectivity and weight of connectivity 18, 19, 20, 21, 22, 23, 24 and 25, we plot the coexpression for each node. A gene with high degree or high association values of 22 secondary genes using Matlab. From the plot- with other genes is considered a signifcant gene. Similarly, ted coexpression values, we can be observe that 15 genes a pair of genes connected or correlated with high connec- tion weights is considered to be closely associated. These

1 3 Author's personal copy

19 Page 16 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19

Fig. 11 FBN1 Fig. 13 KRT2

Fig. 14 KRT5

Fig. 12 GRK2 genes and these six genes are also related to cancer pathways directly or indirectly. Various pathways related to the sec- exhibit highly diferent patterns across the conditions (i.e., ondary genes are given in Table 9. the expression graph for normal and disease conditions of Tissue and cells analyses shows that out of 15 secondary the samples shows either negative or diferent correlation genes, 11 genes have direct evidence of afecting the scores patterns) and we consider them for further investigation of tissues, and cell ranking in (Geneanalytics 2018). These to obtain biological insights. These 15 genes are BAZ2B, genes are CCNB1, CCT7, CDC20, KRT2, KRT5, NAE1, CCNB1, CCT7, CDC20, CDK4, FBN1, KRT2, KRT5, NPM1, PTGES3, SCP2, SPARC, and TM9SF2. NAE1, NPM1, PPP1CC, PTGES3, SCP2, SPARC, and Further, we fnd that three secondary genes are directly TM9SF2. connected to esophageal cancer. These are CCNB1 and Next, we carry out pathway analysis for these 15 genes, CDK4 in esophageal squamous cell carcinoma and KRT5 in using the (Geneanalytics 2018). We fnd that the three genes, esophageal adenosquamous carcinoma. So, these three genes CCNB1, KRT5 and CDK4 share pathways with esophageal can be considered signifcant biomarkers related to esopha- cancer. Six other genes, namely PPP1CC, NPM1, CDC20, geal cancer. Moreover, six genes CDC20, FBN1, CCT7, KRT2, CCT7 and SCP2 share pathways with the above three

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 17 of 21 19

Fig. 15 NAE1 Fig. 17 NTSR2

Fig. 16 NPM1 Fig. 18 PPP1CC

SPARC, NPM1, and KRT2 are found to be causal genes in CDK4 encodes a protein of the Ser/Thr kinase family. It several cancer-related diseases (Geneanalytics 2018). is a catalytic subunit of the protein kinase complex that is Next we introduce each of these seven secondary genes important for cell cycle G1 phase progression. It is relevant briefy. to melanoma, splindle cell lipoma, esophageal cancer and lung cancer susceptibility 3. CCNB1 (GeneCards 2017) CCNB1 encodes a regulatory protein which is involved KRT5 (GeneCards 2017) in mitosis. It is also controls the formation of MPF (Mat- KRT5 encodes a protein of keratin gene family. It is spe- uration-promoting factor) working with p34 (CDC2). This cially expressed in basal layer of epidermis. Mutation of gene helps G2/M transition control in the cell cycle. It is this gene is related to epidermolysis bullosa simplex. It is relevant to thyroid lymphoma, adrenal carcinoma, esopha- relevant to sarcomatoid basal cell carcinoma, esophageal geal cancer, colorectal cancer and gastric cancer. adenosquamous carcinoma and prostate squamous cell carcinoma. CDK4 (GeneCards 2017) Moreover, genes like CDC20, CCT7, NPM1 and KRT2, though there is no direct evidence of causing esophageal

1 3 Author's personal copy

19 Page 18 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19

Fig. 19 PTGES3 Fig. 21 RPL10A

Fig. 20 RPL9 Fig. 22 RPLP0 cancer they are sharing pathway to those three potential bio- markers and they are also causes of diferent cancer-related CCT7 (GeneCards 2017) diseases and so these four genes need to be considered Molecular chaperone is encoded by this gene. This gene seriously and they deserve more rigorous investigations to produces two stacked rings known as the TCP1 ring com- uncover further biological insights. plex. Each of these rings contains eight diferent types of protein. Actin and tubulin are two among many CDC20 (GeneCards 2017) others which are folded by the TCP1 ring complex. It is It is a regulatory protein which interacts with many other related to prostate cancer. proteins in the cell cycle at diferent points. It is mainly required for nuclear movement prior to anaphase and chro- NPM1 (GeneCards 2017) mosome separation. It has relevance to lung cancer, uterine NPM1 encodes proteins involved in cellular process corpus endometrial carcinoma, hepatosplenic T-cell lym- (e.g., protein chaperoning, centrosome duplication, and phoma and cervical squamous cell carcinoma. cell proliferation). It is known to sequester the tumor sup- pressor ARF in the nucleolus. It protects the nucleolus from degradation. Mutations in this gene may lead to acute myeloid leukemia. It has relevance for leukemia, acute

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 19 of 21 19

Fig. 23 SCP2 Fig. 25 TM9SF2

Based on our exhaustive experimental study, we would fnally suggest that CCNB1, CDK4, and KRT5 are highly associated with “esophageal squamous cell carcinoma”, and hence can be recommended as the potential biomark- ers for this deadly disease.

5 Conclusion and future work

A parallel biclustering algorithm can be used to identify signifcant biomarkers related to a disease in less time. Hub-gene centric analysis can help identify interesting biomarkers from biological networks that are constructed based on enriched biclusters. We observe that three sec- ondary genes (i.e., CCNB1, CDK4, and KRT5) which are directly related to ESCC and other four genes (i.e., CDC20, CCT7, NPM1, and KRT2) are related to can- Fig. 24 SPARC​ cer as well as highly signifcant in esophageal squamous cells. All these seven genes can be considered signifcant myeloid, reticulosarcoma, lymphoma and hematologic biomarkers, though four of these genes, namely CDC20, cancer. CCT7, NPM1, and KRT2 need further investigation. The proposed method can be further extended to analyze other KRT2 (GeneCards 2017) critical diseases for identifcation of biomarkers. There The protein encoded by KRT2 is related to the keratin is scope for improvement of the hub-gene identifcation family. The type II cytokeratins consist of basic neutral method using a multi-objective approach. The BicBioEC proteins. It is coexpressed during diferentiation of simple method presented in this paper can be further strengthened and stratifed epithelial tissues. It has relevance to cervical by integrating MiRNA and other relevant sources of data. squamous cell carcinoma.

1 3 Author's personal copy

19 Page 20 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19

Table 9 Pathway analysis of secondary genes using (Geneanalytics 2018)

Pathway Genes

SuperPath: cell cycle_Role of APC in cell cycle CCT7, CDC20, CCNB1 SuperPath: validated targets of C-MYC transcriptional activation NPM1, CDK4, CCNB1 SuperPath: DNA damage NPM1, CDK4, CDC20, CCNB1 SuperPath: cell cycle, mitotic PPP1CC, NPM1, CDK4, CDC20, CCNB1 SuperPath: mitotic prometaphase PPP1CC, CDC20, CCNB1 SuperPath: cell cycle CDK4, CDC20, CCNB1 SuperPath: PPAR alpha pathway CDK4, SCP2 SuperPath: APC-Cdc20 mediated degradation of Nek2A CDC20, CCNB1 SuperPath: oocyte meiosis PPP1CC, CDC20, CCNB1 SuperPath: cellular senescence (KEGG) PPP1CC, CDK4, CCNB1 SuperPath: GADD45 pathway CDK4, CCNB1 SuperPath: FOXM1 transcription factor network CDK4, CCNB1 SuperPath: Arora B signaling PPP1CC, NPM1 SuperPath: PLK1 signaling events CDC20, CCNB1 SuperPath: TP53 regulates transcription of cell cycle genes NPM1, CCNB1 SuperPath: cell cycle cell cycle (genetic schema) CDK4, CCNB1 SuperPath: mitotic roles of polo like kinases CDC20, CCNB1 SuperPath: cytoskeleton remodeling neuroflaments KRT5, KRT2 SuperPath: cyclins and cell cycle regulation CDK4, CCNB1 SuperPath: retinoblastoma (RB) in cancer CDK4, CCNB1 SuperPath: cell cycle_spindle assembly and separation CDC20, CCNB1 SuperPath: DNA damage response CDK4, CCNB1 SuperPath: arrhythmogenic right ventricular cardiomyopathy (ARVC) CDK4, CCNB1 SuperPath: mitotic G1–G1/S Phases CDK4, CCNB1

References Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, Parkin DM, Forman D, Bray F (2015) Cancer incidence and mortality worldwide: sources, methods and major patterns in Aguilar-Ruiz JS (2005) Shifting and scaling patterns from gene expres- GLOBOCAN 2012. Int J Cancer 136(5):E359–E386 sion data. Bioinformatics 21(20):3840–3845 FuncAssociate. http://llama.mshri​ .on.ca/funca​ ssoci​ ate/​ . Accessed 1 Ahmed HA, Mahanta P, Bhattacharyya DK, Kalita JK (2014) Shifting- Jan 2018 and-scaling correlation based biclustering algorithm. IEEE/ACM Geneanalytics. http://geneanalyt​ ics.genec​ ards.org/​ . Accessed 10 May Trans Comput Biol Bioinform (TCBB) 11(6):1239–1252 2018 Berriz GF, King OD, Bryant B, Sander C, Roth FP (2003) Char- GeneCards. https​://www.genec​ards.org/. Accessed 30 July 2017 acterizing gene sets with funcassociate. Bioinformatics Gephi-0.9.2. https​://gephi​.org/. Accessed 24 Jan 2018 19(18):2502–2504 González-Domínguez J, Expósito RR (2018) Parbibit: parallel tool Bhattacharya A, Cui Y (2017) A gpu-accelerated algorithm for biclus- for binary biclustering on modern distributed-memory systems. tering analysis and detection of condition-dependent coexpression PLoS One 13(4):e0194,361 network modules. Sci Rep 7(1):4162 IGraph,r-package. http://igrap​h.org/r/. Accessed 10 Mar 2018 Cho H, Dhillon IS, Guan Y, Sra S (2004) Minimum sum-squared resi- Kelsen D (2008) Principles and practice of gastrointestinal oncology. due co-clustering of gene expression data. In: Proceedings of the Lippincott Williams & Wilkins, Philadelphia 2004 SIAM international conference on data mining. Society for Malacards. http://www.malac​ards.org/. Accessed 30 July 2017 Industrial and Applied Mathematics, pp 114–125 Mandal K, Sarmah R, Bhattacharyya DK (2018) Biomarker identifca- CUDA 7.5 toolkit. https​://devel​oper.nvidi​a.com/cuda-toolk​it-archi​ve. tion for cancer disease using biclustering approach: an empirical Accessed 1 Nov 2017 study. IEEE/ACM Trans Comput Biol Bioinform 1:1–1 Cytoscape v 3.6.0. http://www.cytoscape.org/index​ .html​ . Accessed 25 Olson CF (1995) Parallel algorithms for hierarchical clustering. Paral- Feb 2018 lel Comput 21(8):1313–1325 ESCC(GSE20347). https​://www.ncbi.nlm.nih.gov/geo/query​/acc. Orzechowski P, Sipper M, Huang X, Moore J (2018) Ebic: an evolu- cgi?acc=GSE20​347. Accessed 30 July 2017 tionary-based parallel biclustering algorithm for pattern discovery. Esophageal squamous cell carcinoma. http://www.malacards.org/card/​ Bioinformatics (Oxford, England). https​://doi.org/10.1093/bioin​ esoph​agus_squam​ous_cell_carci​noma. Accessed 30 July 2017 forma​tics/bty40​1

1 3 Author's personal copy

Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 Page 21 of 21 19

R 3.4.3. https​://cran.r-proje​ct.org/bin/windo​ws/base/. Accessed 1 Nov Zhou J, Khokhar A (2006) Parrescue: Scalable parallel algorithm and 2017 implementation for biclustering over large distributed datasets. Tanay A, Sharan R, Shamir R (2002) Discovering statistically signif- In: 26th IEEE international conference on distributed computing cant biclusters in gene expression data. Bioinformatics 18(suppl systems (ICDCS’06). IEEE, pp 21 1):S136–S144 Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: IEEE international conference on cloud comput- ing. Springer, Berlin, Heidelberg, pp 674–679

1 3