Bicbioec: Biclustering in Biomarker Identification for ESCC

BicBioEC: biclustering in biomarker identification for ESCC P. Kakati, D. K. Bhattacharyya & J. K. Kalita Network Modeling Analysis in Health Informatics and Bioinformatics ISSN 2192-6662 Volume 8 Number 1 Netw Model Anal Health Inform Bioinforma (2019) 8:1-21 DOI 10.1007/s13721-019-0200-x 1 23 Your article is protected by copyright and all rights are held exclusively by Springer-Verlag GmbH Austria, part of Springer Nature. This e- offprint is for personal use only and shall not be self-archived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”. 1 23 Author's personal copy Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 https://doi.org/10.1007/s13721-019-0200-x ORIGINAL ARTICLE BicBioEC: biclustering in biomarker identifcation for ESCC P. Kakati1 · D. K. Bhattacharyya1 · J. K. Kalita2 Received: 23 November 2018 / Revised: 26 June 2019 / Accepted: 21 July 2019 © Springer-Verlag GmbH Austria, part of Springer Nature 2019 Abstract Analysis of gene expression patterns enables identifcation of signifcant genes related to a specifc disease. We analyze gene expression data for esophageal squamous cell carcinoma (ESCC) using biclustering, gene–gene network topology and pathways to identify signifcant biomarkers. Biclustering is a clustering technique by which we can extract coexpressed genes over a subset of samples. We introduce a parallel and robust biclustering algorithm to identify shifted, scaled and shifted-and-scaled biclusters of high biological relevance. Additionally, we introduce a mapping algorithm to establish the module–bicluster relationship across control and disease stages and a hub-gene identifcation method to support our analysis framework. The C-CUDA implementation of our biclustering algorithm makes the method attractive due to faster speed and higher accuracy of results. Biomarkers such as CCNB1, CDK4, and KRT5 have been found to be closely associated with ESCC. Keywords Gene expression · Bicluster · Primary gene · Secondary gene · Biomarkers · SSSIM · GPU computing 1 Introduction paper, we analyze gene expression data for ESCC using a parallel biclustering approach followed by network topol- Esophageal squamous cell carcinoma (ESCC) is a subtype ogy analysis, and pathway analysis, to identify interesting of esophageal cancer. ESCC is common in developing coun- gene biomarker(s) related to ESCC. In microarray technol- tries like India and China. It arises from epithelial cells that ogy, gene expression data are represented by matrix format. line the esophagus (Kelsen 2008). It is the eighth most com- There are two types of gene expression data: (1) gene–sam- mon cancer globally with 456,000 new cases during the year ple ( G × S ) data and (2) gene–sample–time ( G × S × T ) 2014 (Ferlay et al. 2015). It caused around 400,000 deaths (Mandal et al. 2018) data. There are generally three types in 2014. This rate varied widely among countries. Due to of correlation patterns in gene expression data that can be the severity of this disease, identifcation of interesting bio- used to show gene coexpression: (1) shifting, (2) scaling, markers related to ESCC is highly essential. There are sev- and (3) shifting-and-scaling (Aguilar-Ruiz 2005). To iden- eral ways to fnd the biomarkers for a given disease. In this tify such correlation patterns, in an unsupervised framework with high accuracy, a number of clustering approaches have been introduced. Among these, biclustering approaches are Electronic supplementary material The online version of this article (https ://doi.org/10.1007/s1372 1-019-0200-x) contains prominent. However, most biclustering techniques consume supplementary material, which is available to authorized users. tremendous computational time due to the NP-hard nature (Tanay et al. 2002). To address this issue, we introduce a * D. K. Bhattacharyya parallel biclustering approach which we demonstrate to be [email protected] capable of handling all the three types of correlations during P. Kakati bicluster extraction in much less time. Based on the highly [email protected] enriched biclustering results, we follow with gene–gene J. K. Kalita network topology analysis and pathway analysis to identify [email protected] interesting biomarkers for ESCC, which have been associ- 1 Department of Computer Science and Engineering, Tezpur ated in terms of established literature. Additionally, to sup- University, Napaam, Tezpur, Assam 784028, India port the biomarker identifcation process, we introduce two 2 Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA Vol.:(0123456789)1 3 Author's personal copy 19 Page 2 of 21 Network Modeling Analysis in Health Informatics and Bioinformatics (2019) 8:19 efective techniques for (1) control-to-disease bicluster map- results are reported with discussion. Finally, Sect. 5 pre- ping and (2) hub-gene fnding. sents the conclusion and the future direction of research. 1.1 Problem defnition 2 Related work Given an expression matrix for ESCC, say M = G, S , where G represents a set of genes and S represents ⟨a set of⟩ Due to the large volume and high dimensionality of gene samples and Lp a list of primary genes for esophageal squamous cell carcinoma that appears in a formal repository, expression data, extraction of clusters with high biological like (Malacards 2017). The problem is to identify and estab- signifcance is a challenging task. To address this issue, lish signifcant gene biomarkers (other than primary genes) biclustering with parallelization has been considered as a related to ESCC using appropriate (1) unsupervised machine potential solution. Zhao et al. (2009) introduced a paral- learning techniques on the gene expression data and (2) lel algorithm based on Hadoop MapReduce for K-means network and biological analysis without much knowledge. clustering. The programming technique called Hadoop Performance of a biclustering based method for gene expres- MapReduce can handle large volumes of data with high sion analysis is highly dependent on the proximity measure efciency. Olson (1995) reported a parallel hierarchical used to identify coexpressed patterns. So, identifcation of clustering approach with an efective proximity measure. a robust measure that can handle shifting-and-scaling pat- The parallelization of hierarchical clustering has been terns for efective cluster analysis of gene expression data shown to be superior in comparison to other approaches is a major issue. Further, most biclustering algorithms are to parallelization of clustering. inefcient due to the high computational cost during extrac- Biclustering aims to extract biclusters (subsets of highly tion of biclusters. So, developing a cost-efective and robust correlated genes over subsets of samples) from gene parallel biclustering technique which can extract biologi- expression data that show high biological signifcance. cally signifcant biclusters from an expression matrix is a Due to the need for simultaneous operations to eliminate prime motivation. After extraction of biologically signif- less relevant rows and columns, it is more complex com- cant biclusters, topological and biological analyses of each pared to normal clustering, especially for larger datasets. bicluster can help identify the biomarker(s) related to ESCC. Researchers have developed many biclustering algorithms to mine large numbers of genes over subsets of samples to 1.2 Contribution extract biclusters of high biological signifcance. Zhou and Khokhar (2006) proposed a parallel version of a biclustering algorithm, named as ParRescue and implemented it The major contributions of this paper are given below: using MPI on a cluster of 64 nodes. ParRescue is efective in handling voluminous data using a large number • An overall model for the identifcation of signifcant bio- of nodes. However, the biclusters extracted by it are not markers for ESCC using parallel biclustering, topological satisfactory from an enrichment perspective. Bhattacha- analysis and biological behavior analysis of gene expres- rya and Cui (2017) introduced a GPU-accelerated parallel sion data for ESCC. biclustering algorithm which showed that GPU computing • A robust parallel biclustering variant of Bhattacharya and speeds up the biclustering process signifcantly, but was Cui (2017) to identify biclusters with shifting, scaling or not concerned about the noisy values of gene expression shifting-and-scaling patterns. data. To address this issue, we introduced a robust parallel • An efective technique to map the biclusters across con- biclustering algorithm using an efective proximity meas- trol and disease conditions for subsequent analysis. ure proposed by Ahmed et al. (2014), based on the con- • A weighted hub-gene fnding technique to support the cept of largest condition-dependent subgroups introduced biomarker identifcation process. by Bhattacharya and Cui (2017). GPU implementation of • Network and biological behavior analysis of the identi- the proximity measure called SSSim

Bicbioec: Biclustering in Biomarker Identification for ESCC

Analysis of Gene Expression Data for Gene Ontology

Molecular and Physiological Basis for Hair Loss in Near Naked Hairless and Oak Ridge Rhino-Like Mouse Models: Tracking the Role of the Hairless Gene

Transcriptional Regulation of RKIP in Prostate Cancer Progression

Epigenome-Wide Exploratory Study of Monozygotic Twins Suggests Differentially Methylated Regions to Associate with Hand Grip Strength

Table S3: Subset of Zebrafish Early Genes with Human And

A Rare Variant in MCF2L Identified Using Exclusion Linkage in A

Aneuploidy: Using Genetic Instability to Preserve a Haploid Genome?

The Genetics of Bipolar Disorder

Novel Targets of Apparently Idiopathic Male Infertility

Chromatin Conformation Links Distal Target Genes to CKD Loci

Content Based Search in Gene Expression Databases and a Meta-Analysis of Host Responses to Infection

Table S1. 103 Ferroptosis-Related Genes Retrieved from the Genecards