bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Identification of Platform-Independent Diagnostic Biomarker Panel for

2 Hepatocellular Carcinoma using Large-scale Transcriptomics Data

3 Harpreet Kaur1,2, Anjali Dhall2, Rajesh Kumar1,2, Gajendra P. S. Raghava2,* 4 1 Bioinformatics Center, CSIR-Institute of Microbial Technology, Chandigarh, India 5 2 Department of Computational Biology, Indraprastha Institute of Information Technology, New 6 Delhi, India 7 8 Emails of Authors: 9 Harpreet Kaur: [email protected] , [email protected]

10 Anjali Dhall: [email protected]

11 Rajesh Kumar: [email protected]

12 Gajendra P.S. Raghava: [email protected] 13 14 15 * Correspondence 16 Professor, Department of Computational Biology 17 Indraprastha Institute of Information Technology, 18 Okhla Industrial Estate, Phase III, New Delhi 110020, 19 India. Tel.: +91 011 26907444; E-mail address: [email protected]

20

21

22

23

24

25

26

27

28 bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

29 Abstract

30 The high mortality rate of hepatocellular carcinoma (HCC) is primarily due to its late diagnosis. In the 31 past numerous attempts have been made to design genetic biomarkers for the identification of HCC; 32 unfortunately, most of the studies are based on a small dataset obtained from a specific platform or 33 lack of their reasonable validation performance on the external datasets. In order to identify a universal 34 expression-based diagnostic biomarker panel for HCC that can be applicable across multiple platforms; 35 we have employed large scale transcriptomic profiling datasets containing a total of 2,306 HCC and 36 1,655 non-tumorous tissue samples. These samples were obtained from 29 studies generated by mainly 37 four types of profiling techniques include Affymetrix, Illumina, Agilent and High-throughput-seq, 38 which implemented a wide range of platforms. Firstly, we scrutinized 26 that are differentially 39 expressed or regulated in uniform pattern among numerous datasets. Subsequently, we identified three 40 genes (FCN3, CLEC1B, & PRC1) panel-based HCC biomarker using different machine learning 41 techniques include Simple-threshold based approach, Extra Trees, Support Vector Machine, Random 42 Forest, K Neighbors Classifier, Logistic Regression etc. Three-genes panel-based HCC biomarker 43 classified HCC samples and non-tumorous samples of training and three external validation datasets 44 with an accuracy between 93 to 98% and AUROC (Area Under Receiver Operating Characteristic 45 curve) in a range of 0.97 to 1.0. Furthermore, the prognostic potential of these genes was evaluated on 46 TCGA cohort and GSE14520 cohort using univariate survival analysis revealed that these genes are 47 independent prognostic indicators for various types of the survivals, i.e. OS (Overall Survival), PFS 48 (Progression-Free Survival), DFS/RFS (Disease-Free Survival/Recurrence-Free Survival) and DSS 49 (Disease-Specific Survival) of HCC patients and significantly stratify high-risk and low-risk HCC 50 patients (p-value <0.05). In conclusion, we identified a universal platform-independent three genes- 51 based biomarker that can predict HCC patients with high precision; also possess significant prognostic 52 potential. Eventually, to provide service to the scientific community, we developed a webserver 53 HCCPred based on the above study (http://webs.iiitd.edu.in/raghava/hccpred/).

54 Keywords: liver cancer; HCC; biomarker; expression; diagnosis; survival; machine learning; genes, 55 classification, platform-independent, malignancy

56

57 1 Introduction

58 Cancer is a heterogeneous disease driven by genomic and epigenomic changes within the cell 59 (Chatterjee et al., 2018; Dawson and Kouzarides, 2012; Flavahan et al., 2017; Kagohara et al., 2018; 60 Kamel and Al-Amodi, 2017; Kumar et al., 2019; Nagpal et al., 2015; Narrandes and Xu, 2018; 61 Nebbioso et al., 2018; Sharma et al., 2010). dysregulation is considered a hallmark of cancer. 62 Among the 22 common cancer type, Hepatocellular carcinoma (HCC) ranks at sixth in terms of 63 frequency of occurrence and fourth at cancer-related mortality (Siegel et al., 2019). The aetiology of 64 HCC can be induced by multiple factors, especially virus infection, alcoholic cirrhosis and 65 consumption of aflatoxin-contaminated foods (Ho et al., 2016). Although various traditional and 66 locoregional treatment strategies such as Hepatic resection (RES), Percutaneous ethanol injection 67 (PEI), radiofrequency ablation (RFA), microwave ablation (MWA) and trans-arterial chemotherapy 68 infusion (TACI) have improved the survival rate but patient with HCC still have a late diagnosis and 69 poor prognosis (Tian et al., 2018).

70 In the past, several studies focus on the identification of biomarkers by comparing the global gene 71 expression changes between cancer tissue and non-tumorous tissues (Cai et al., 2017, 2019; Emma et

2

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

72 al., 2016; Gao et al., 2015; Jia et al., 2007; Jiao et al., 2019; Kang et al., 2015; Komatsu et al., 2016; 73 Li et al., 2017, 2018; Liao et al., 2018; Liu et al., 2015; Marshall et al., 2013; Meng et al., 2018; Shirota 74 et al., 2001; Wang et al., 2018; Xia et al., 2019; Xu et al., 2018; Zhang et al., 2017, 2019; Zheng et al., 75 2018). Such analyses yield hundreds or thousands of gene signature that are differentially expressed in 76 cancer tissue compared to normal tissue. Thus, making it difficult to identify a universal subset of genes 77 that play a crucial role in neoplastic transformation and progression (Rhodes et al., 2004). The lack of 78 concordance of signature genes among different studies and extensive molecular variation between the 79 patient’s samples restrains the establishment of the robust biomarkers, promising targets and their 80 experimental validation in clinical trials (Vasudevan et al., 2018). The transcriptome signatures have 81 yet to be translated into a clinically useful biomarker, that may be due to a lack of their satisfactory 82 validation performance on independent patient’s cohort.

83 In this regarding, treatment of HCC remains unsatisfying as only diagnostic and prognostic 84 biomarkers alpha-fetoprotein (AFP) has been established so far. Several other biomarkers AFP-L3, 85 osteopontin, glypican-3 are currently being under investigation for the early diagnosis of HCC patients 86 (Ocker, 2018). Advancement in the genomics has created rich public repositories of microarray and 87 high throughput datasets from numerous studies such as TCGA(The Cancer Genome Atlas Program - 88 National Cancer Institute), GDC(GDC) and GEO(Home - GEO - NCBI), provide the opportunity to 89 study the various aspects of cancer. Thus, novel methods exploring computational approach by 90 merging multiple datasets from different platforms could provide a new way to establish a robust and 91 universal biomarker for disease diagnosis and prognosis with increased precision and reproducibility. 92 Recently, multiple datasets integration approach implemented for identification of robust biomarker 93 panel for pancreatic ductal adenocarcinoma (PDAC) (Bhasin et al., 2016; Klett et al., 2018). Although, 94 various studies employed large-scale data or meta-analysis approaches to identify and miRNA 95 expression-based biomarker for HCC diagnosis (Chen et al., 2018b; Ding et al., 2017; Ji et al., 2016, 96 2018). But, best of our knowledge, RNA-expression data is not explored in this regard for identification 97 of the robust biomarker for HCC diagnosis and prognosis.

98 To overcome the limitations of existing methods, we made a systematic attempt to identify 99 genetic biomarkers for HCC diagnosis that apply to a wide range of platforms and profiling techniques. 100 One of the objectives of this study is to identify robust signature for discrimination of 101 HCC samples by the integration of multiple transcriptomic datasets from various platforms. Here, we 102 have collected and analysed a total of 3,961 samples from published datasets, out of which 2,306 and 103 1,655 are of HCC and normal or non-tumorous tissue samples, respectively. From this, we identified 104 26 genes, which are commonly differentially expressed in uniform patterns among most of the datasets, 105 which provides a universally activated transcriptional signature of HCC cancer type. Further, we have 106 established a robust "Three-genes based HCC biomarker" implementing different machine learning 107 techniques to distinguish HCC and non-tumorous samples with high precision. Additionally, the 108 survival analysis of HCC patient's cohorts using these genes revealed their significant prognostic 109 potential in the stratification of high-risk and low-risk patient’s groups. To best of our knowledge, this 110 is the first study in the regards of HCC cancer type, which utilizes such a large-scale transcriptomic 111 dataset from different platforms, for the identification of universal platform-independent diagnostic 112 biomarkers implementing machine learning approaches.

113 2. Materials and Methods

114 2.1 Dataset Collection

115 2.1.1 Collection of gene expression datasets of HCC

3

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

116 In order to analyse the large-scale gene expression profiles in Hepatocellular carcinoma (HCC), 117 supplementary data for twenty-nine transcriptome studies (GSE102079 (Chiyonobu et al., 2018), 118 GSE22405, GSE98383 (Diaz et al., 2018), GSE84402 (Wang et al., 2017), GSE64041 (Makowska et 119 al., 2016), GSE69715 (Sekhar et al., 2018), GSE51401, GSE62232 (Schulze et al., 2015), GSE45267 120 (Chen et al., 2018a), GSE32879 (Oishi et al., 2012), GSE19665 (Deng et al., 2010), GSE107170 (Diaz 121 et al., 2018), GSE76427 (Grinchuk et al., 2018), GSE39791 (Kim et al., 2014), GSE57957 (Mah et al., 122 2014), GSE87630 (Woo et al., 2017), GSE46408, GSE57555 (Murakami et al., 2015), GSE54236 123 (Villa et al., 2016; Zubiete-Franco et al., 2019), GSE65484 (Dong et al., 2015), GSE31370 (Seok et 124 al., 2012), GSE84598, GSE89377, GSE29721 (Stefanska et al., 2011), GSE14323 (Mas et al., 2009), 125 GSE25097 (Lamb et al., 2011; Tung et al., 2011; Wong et al., 2016), GSE14520 (Roessler et al., 2010; 126 Zhao et al., 2015), GSE36376 (Lim et al., 2013) and TCGA-LIHC), each contains minimum 10 127 samples. All these datasets were initially obtained using GEOquery package (Bioconductor - 128 GEOquery) from GEO and TCGA-LIHC RNA-seq data with metadata and clinical files was 129 downloaded using gdc-client from GDC data portal, respectively. Then, we manually curated these 130 datasets to ensure that the data contains only expression data from HCC tissues. For each of the 131 29 datasets included in the current study, we reviewed the samples profiled. Twenty-nine studies had 132 at least ten samples corresponding to both classes, i.e. HCC and normal (or adjacent non-tumor) were 133 considered for the analysis of interest. Further, each dataset carefully investigated to check proper 134 sample ID, type of sample and to remove any irrelevant error in dataset file. Besides, Gene symbol 135 mapped to probe IDs were extracted from respective Platform file (wherever available) and 136 incorporated in the dataset matrix for each dataset.

137 2.1.2 Pre-processing of datasets 138 Each retrieved raw dataset (supplementary data) was subjected to a detailed curation process. 139 We have pre-processed each dataset matrix individually from each profiling technique for each 140 platform in a standardized manner. For Affymetrix datasets, raw data files were pre-processed with 141 background correction and eventually RMA values calculated using the Oligo package (Carvalho and 142 Irizarry, 2010). For Illumina datasets, raw files were processed using Limma and Lumi packages (Du 143 et al., 2008; Ritchie et al., 2015) and finally log2 values calculated using in-house R scripts. Similarly, 144 raw Agilent-1-color and Agilent-2-color files were pre-processed using Limma package individually, 145 then A-values were generated, which were further transformed to log2 values. Eventually, the average 146 of multiple probes computed that correspond to a single gene for each dataset individually employing 147 in-house R scripts. TCGA-LIHC dataset contains FPKM values, which were further converted to log2 148 values. transcript IDs are mapped to the Gene symbols using Gencode v22.

149 2.1.3 Datasets used for the Identification of Differentially Expressed Genes (DEG) 150 Twenty-five out of twenty-nine datasets selected, each containing samples more than ten for 151 initial analysis to identify differentially expressed genes between HCC and normal as given in Figure 152 1A. These datasets were further analyzed to assign the samples to three different classes, i.e., HCC, 153 adjacent non-tumor and healthy normal samples. Eventually, we generated a total of 27 datasets; twenty 154 datasets contain HCC v/s adjacent non-tumor samples and seven datasets contain HCC v/s healthy 155 normal samples. These datasets encompass 1199 HCC samples and 949 normal or adjacent non-tumor 156 samples.

4

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

157 158 Figure1: Distribution of Samples among Datasets used in the study

159

160 2.1.4 Training and Independent Validation Datasets 161 Four out of twenty-nine datasets (GSE25097, GSE14520, GSE36376 and TCGA-LIHC 162 datasets) selected each having more than 400 samples, for training the model and validation as given 163 in Figure 1B. To reduce the cross-platform artifacts, quantile normalization is performed using the 164 PreprocessCore package of Bioconductor (GitHub - bmbolstad/preprocessCore) for each dataset and 165 for each profiling technique by keeping training dataset as model (target) matrix. These datasets contain 166 a total of 1107 HCC and 706 adjacent non-tumor samples.

167 2.2 Analysis for the Identification of Differentially Expressed Genes (DEGs) 168 For each of the 32 microarray datasets, each gene was analyzed for differential expression with 169 Student's t-test (Welch t-test and Wilcoxon t-test) using in-house Rscript after the assignment of 170 samples to the respective class, i.e. cancer or normal. T-tests were conducted as two-sided for 171 differential expression analysis. Only those set of genes chosen to define differential expressed genes 172 from each of datasets that are statistically differentially expressed between two classes with Bonferroni 173 adjusted p-value less than 0.01. With aim to identify a selected set of differential expression signatures 174 or “core genes of Hepatocellular carcinoma”, DEGs were compared among all 27 datasets.

175 2.3 Identification of Robust Biomarkers for HCC Diagnosis

176 2.3.1 Reduction of Features 177 With intent to reduce the number of genes from the selected the set of signature, i.e. “the core 178 genes of Hepatocellular carcinoma”, we used two machine learning feature selection techniques

5

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

179 include simple threshold-based approach (Bhalla et al., 2017) and ten-fold cross-validation technique 180 using training cohort (GSE25097). Firstly, we employed a simple threshold-based technique in which 181 we developed single feature-based models to distinguish cancer and normal samples. Single feature- 182 based models are also called threshold-based models wherein genes with a score above the threshold 183 are assigned to cancer class if it is found to be upregulated in cancer and otherwise normal; whereas 184 sample is assigned to normal class if the gene is downregulated in cancerous condition as reported in 185 our previous studies(Bhalla et al., 2017; Kaur et al., 2018). We compute the performance of each model 186 based on a given feature and identify the top 10 features having the highest performance in terms of 187 different performance measures like accuracy, MCC and AUROC, etc.

188 Further, ten-fold cross-validation technique is implemented to select only most important features from 189 above top 10 selected features (genes) based on their individual performance in terms of all the 190 performance measures (sensitivity, specificity, accuracy, MCC and AUROC) to discriminate cancer 191 and normal samples.

192 2.3.2 Development of Prediction models implementation Machine Learning Techniques 193 Here, we have developed the prediction models to distinguish HCC and normal samples based 194 on selected genomic features using various classifiers implementing scikit-learn package. These 195 classifiers include ExtraTrees, Naive Bayes, KNN, Random forest, Logistic Regression (LR), and 196 SVC-RBF (radial basis function) kernel were implemented employing scikit-learn package (Scikit- 197 learn: Machine Learning in Python). The optimization of the parameters for the various classifiers was 198 done by using a grid search with AUROC curve as scoring performance measure for selecting the best 199 parameter.

200 2.4 Performance Evaluation of Models 201 In the current study, both internal and external validation techniques were employed to evaluate 202 the performance of models. First, the training dataset is used to develop prediction models and ten-fold 203 cross-validation is used for performing internal validation. It is important to evaluate the

204 realistic performance of the model on external validation dataset, which should not be used for training 205 and testing during model development. Therefore, we also implemented external validation. Towards 206 this, we evaluated the performance of our model on three independent gene expression cohorts include 207 GSE14520, GSE36376, and TCGA-LIHC obtained from GEO and The Cancer Genome Atlas 208 (TCGA), each containing more than 400 samples (Supplementary Information File 2, Figure S1C), 209 were not used for training. In order to measure the performance of models, we used standard 210 parameters. Both threshold-dependent and threshold-independent parameters were employed to 211 measure the performance. In the case of threshold-dependent parameters, we measure sensitivity, 212 specificity, accuracy and Matthew’s correlation coefficient (MCC) using the following equations.

푇푃 213 푆푒푛푠푖푡푖푣푖푡푦 (푆푒푛) = × 100 (1) 푇푃+퐹푁

푇푁 214 푆푝푒푐푖푓푖푐푖푡푦 (푆푝푒푐) = × 100 (2) 푇푁+퐹푃

푇푃×푇푁 215 퐴푐푐푢푟푎푐푦 (퐴푐푐) = × 100 (3) 푇푃+퐹푃+푇푁+퐹푁

(푇푃×푇푁)−(퐹푃×퐹푁) 216 푀퐶퐶 = (4) √(푇푃+퐹푃)(푇푃+퐹푁)(푇푁+퐹푃)(푇푁+퐹푁)

6

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

217 Where, FP, FN, TP and TN are false positive, false negative true positive and true negative predictions, 218 respectively.

219 While, for threshold-independent measures, we used a standard parameter Area under the Receiver 220 Operating Characteristic curve (AUROC). The AUROC curve is generated by plotting sensitivity or 221 true positive rate against the false positive rate (1-specificity) at various thresholds. Finally, the area 222 under the curve calculated to compute a single parameter called AUROC.

223 2.5 Prognostic Potential of Identified HCC Diagnostic Biomarkers 224 The prognostic potential of the “Three-genes HCC Biomarker” was analyzed using gene 225 expression data of TCGA-LIHC and GSE14520 cohorts. The TCGA dataset and GSE14520 dataset 226 contain of 374 and 219 tumor samples with their clinical information extracted from GEO, TCGA and 227 literature (Liu et al., 2018; Roessler et al., 2010), respectively. The clinical characteristics of patients 228 are given in Table S11 (Supplementary Information File 1). The survival univariate analyses and risk 229 assessments were performed by Survival package in R (Therneau). The survival signatures of 230 biomarkers were evaluated by Kaplan-Meier plots, and a log-rank p-value < 0.05 was considered the 231 cut-off to describe the statistical significance in all survival analyses. Here, we analyzed four types of 232 survivals, i.e. OS (Overall Survival), DSS (Disease-Specific Survival), DFS (Disease-Free Survival) 233 and PFS (Progression-Free Survival) for TCGA cohort, and two types of survivals, i.e. OS and RFS 234 (Recurrence-Free Survival) [also called as Disease-Free Survival (DFS)] for GSE14520 cohort.

235 2.6 Functional Annotation of Signature Genomic Markers

236 In order to discern the biological relevance of the signature genes, enrichment analysis is 237 performed using Enrichr (Kuleshov et al., 2016). Enrichr executes Fisher exact test to identify 238 enrichment score. It provides Z-score and adjusted p-value, which is derived by applying correction on 239 a Fisher Exact test.

240 3 Results 241 Overview

242 The pipeline of our analysis is illustrated in Figure 2. The details of each step are described 243 below.

7

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

244 245 Figure2: Overview for the analysis implemented in the study

246

247 3.1 The Transcriptomic Cores for Hepatocellular Carcinoma 248 The individual statistical differential expression analyses of 27 gene expression datasets resulted 249 in the identification of hundreds of upregulated and downregulated Differentially Expressed Genes 250 (DEGs) (Supplementary Figure 1). The 9,954 genes are present among each of 27 datasets 251 (Supplementary Information File 1, Table S1). Further, the comparative analysis among all 27 datasets 252 scrutinized 26 genes that are found to statistically differentially expressed in at least more than 80% of 253 the datasets, i.e. 22 datasets. We called these genes as “Core genes for Hepatocellular carcinoma”. 254 Among these 26 genes, 12 are downregulated and 14 are upregulated in HCC in comparison to normal 255 samples. The regulatory patterns of the core genes were consistent among most of the datasets (Table 256 1). Additionally, the expression pattern of these genes in training and three external validation datasets 257 is shown in Figure S2 (Supplementary Information File 2). 258 259 Table1: List of genes that are differentially expressed between HCC and adjacent normal or adjacent 260 non-tumor samples with Bonferroni p-values < 0.01. 261 Gene #Up #Down #Sig #Non-sig Up (%) Down (%) Sig ( %) Regulation

FCN3 1 26 24 3 3.70 96.30 88.89 Down

CLEC4M 2 25 24 3 7.41 92.59 88.89 Down

FCN2 2 25 24 3 7.41 92.59 88.89 Down

MARCO 3 24 22 5 11.11 88.89 81.48 Down

CRHBP 2 25 22 5 7.41 92.59 81.48 Down

CFP 2 25 22 5 7.41 92.59 81.48 Down

8

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

STEAP3 2 25 25 2 7.41 92.59 92.59 Down

HGFAC 4 23 22 5 14.81 85.19 81.48 Down

CLEC1B 2 25 23 4 7.41 92.59 85.19 Down

CXCL12 3 24 24 3 11.11 88.89 88.89 Down

MT1E 3 24 24 3 11.11 88.89 88.89 Down

NSUN5 25 2 24 3 92.59 7.41 88.89 Down

MCM7 24 3 24 3 88.89 11.11 88.89 Up

MCM3 24 3 24 3 88.89 11.11 88.89 Up

ITGA6 24 3 24 3 88.89 11.11 88.89 Up

SSR2 24 3 23 4 88.89 11.11 85.19 Up

STMN1 23 4 24 3 85.19 14.81 88.89 Up

PRC1 24 3 23 4 88.89 11.11 85.19 Up

POLD1 24 3 23 4 88.89 11.11 85.19 Up

PBK 24 3 24 3 88.89 11.11 88.89 Up

IGSF3 22 5 23 4 81.48 18.52 85.19 Up

DTL 24 3 22 5 88.89 11.11 81.48 Up

ZWINT 24 3 22 5 88.89 11.11 81.48 Up

SPATS2 24 3 24 3 88.89 11.11 88.89 Up

GPSM2 23 4 23 4 85.19 14.81 85.19 Up

COL15A1 24 3 22 5 88.89 11.11 81.48 Up

262 Up=Upregulated in cancer or HCC; Down=Downregulated in cancer or HCC; #Up: No of datasets in which gene is over- 263 expressed; #Down: No of datasets in which gene is under-expressed; #Sig: No of datasets in which gene is significantly 264 differentially - expressed; #Non-Sig: No of datasets in which gene is not significantly differentially-expressed; Up (%): 265 Percentage of datasets in which gene is over-expressed; Down (%): Percentage of datasets in which gene is under- 266 expressed; Sig (%): Percentage of datasets in which gene is significantly differentially-expressed 267 268 269 3.2 Gene Enrichment analysis of the Transcriptomic Cores for Hepatocellular Carcinoma 270 Gene enrichment analysis of these “core genes of HCC” revealed their biological significance 271 as the encoded by the downregulated genes mainly enriched in complement activation and 272 lectin pathways related processes and negatively regulate cellular extravasation. They are also enriched 273 in GO molecular functions like serine-type endopeptidase activity, oxidoreductase activity, RNA 274 methyltransferase activity, etc. (Supplementary Information File 2, Figure S3). Whereas, upregulated 275 core genes enriched in cell cycle GO biological processes like mitotic spindle organization and mitotic 276 sister chromatid segregation, DNA synthesis and DNA replication, postreplication repair and cellular

9

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

277 response to DNA damage stimulus, etc. They are also enriched in GO molecular functions such as 278 exodeoxyribonuclease activity, GDP-dissociation inhibitor activity, DNA polymerase activity and 279 insulin-like growth factor binding, etc. (Supplementary Information File 2, Figure S3).

280 3.2 Identification of Robust HCC Biomarkers 281 With the aim to identify most important genes from the selected set of the signature, i.e. “the core genes 282 of Hepatocellular carcinoma”, a simple threshold-based approach (Bhalla et al., 2017; Kaur et al., 283 2018) and ten-fold cross-validation technique employed using training dataset (GSE25097).

284 3.2.1 Single Gene Prediction Models using Simple Threshold-based Approach 285 Here, a simple threshold-based approach employed to distinguish cancer and normal samples. We 286 developed simple threshold-based models for each gene of “the core genes of Hepatocellular 287 carcinoma”. Subsequently, all 26 genes are ranked based on their discriminatory power to distinguish 288 cancer from normal tissue samples as shown in table (Supplementary Information File 1, Table S2). 289 The top 10 genes among the core genes of HCC with the highest performance (i.e. Accuracy > 85%, 290 MCC >0.75 and AUROC >0.85) are selected and enlisted in Table 2.

291

292 Table 2: Top 10 genes based on the simple threshold-based approach. 293 Gene Thresh Sens (%) Spec (%) Acc (%) MCC AUROC Mean in Mean in Mean symbol HCC normal diff

FCN2 9.78 97.76 99.59 98.63 0.97 0.98 5.76 10.89 -5.13 CLEC4M 7.59 97.01 98.77 97.85 0.96 0.98 4.32 9.37 -5.06 FCN3 10.76 95.15 99.18 97.06 0.94 0.97 7.87 12.32 -4.45 CLEC1B 9.46 95.52 97.94 96.67 0.93 0.97 5.96 11.38 -5.42 CFP 8.14 96.64 94.24 95.5 0.91 0.96 6.15 8.63 -2.48 CRHBP 8.69 92.54 96.71 94.52 0.89 0.95 6.35 10.3 -3.95 PRC1 7.76 91.42 97.12 94.13 0.88 0.94 10.03 6.35 3.68 PBK 6.03 91.04 93.42 92.17 0.84 0.93 8.65 4.41 4.24 DTL 6.71 85.82 94.65 90.02 0.8 0.91 8.72 5.2 3.52 IGSF3 6.93 81.34 91.77 86.3 0.73 0.88 8.1 6.08 2.01 294 Sens: Sensitivity; Spec: Specificity; Acc: Accuracy; MCC: Mathews Correlation Coefficient; AUROC: Area under 295 Receiver operator curve; Thresh: Threshold; Mean diff: Mean in HCC – Mean in normal 296

297 3.2.2 Single Gene Prediction Models using Cross-validation Technique 298 We have obtained the top 10 genes to classify HCC and normal samples based on simple threshold- 299 based approach. Further, we implemented a ten-fold cross-validation approach to assess the classifying 300 ability of each of these genes using various machine learning classifiers include ExtraTrees, Naive 301 Bayes, KNN, Random forest, Logistic Regression (LR), and SVC-RBF as represented in Table S3 302 (Supplementary Information File 1). Five out of 10 genes include FCN3, CLEC1B, CLEC4M, PRC1, 303 and PBK are the best performers to distinguish HCC and normal samples with sensitivity, specificity

10

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

304 accuracy more than 90%, MCC and AUROC greater than 0.8 and 0.95, respectively as shown in Table 305 S3 (Supplementary Information File 1). 306 3.3 Multiple-Genes based Prediction Models and their Validation on External Validation 307 Datasets 308 Next, we sought to identify a minimum predictive gene signature that can classify HCC and non- 309 tumorous samples with high precision. Thus, prediction models developed for each gene from the top 310 five genes using different machine learning algorithms like ETREES, Naive Bayes, KNN, Random 311 Forest, LR, and SVC-RBF. Prediction performance of these models was evaluated by using ten-fold 312 cross-validation (CV) on the training dataset. Although models based on each of five genes can 313 distinguish HCC and normal samples of training datasets with high precision, i.e. more than 90% 314 sensitivity and specificity. To reduce bias through overfitting and to assure their independence on the 315 testing set, their performance is further evaluated on three external validation datasets. There was a 316 substantial decrease in specificity on these validation datasets (Supplementary Information File 1, 317 Table S4). Therefore, we further developed several models on combining these five genes in various 318 combinations using different machine learning algorithms. 319 3.3.1.1 The Diagnostic Performance of Five-genes HCC Biomarker-based Models in Identifying 320 HCC 321 At first all the five genes are combined and subsequently, prediction models developed using 322 different algorithms, i.e. ETREES, Naive Bayes, KNN, Random Forest, LR, and SVC-RBF. These 323 models discriminate HCC and normal samples with sensitivity, specificity in a range of 93-97%, an 324 accuracy 95-98%, along with AUROC 0.98-0.99 of training dataset as well as three different cross- 325 platform datasets (shown in Table 3). 326 327 328 Table 3: Performance of Five genes (FCN3, CLEC4M, CLEC1B, PRC1, PBK) based models on 329 training and validation datasets implementing various machine learning techniques. 330 Classifier Sens Spec Acc MCC AUROC with Sens Spec Acc MCC AUROC (%) (%) (%) 95% CI (%) (%) (%) with 95% CI Training Dataset Validation Dataset1 ETREES 97.39 98.35 97.85 0.96 0.99 (0.99-1) 97.78 94.09 95.96 0.92 0.98 (0.97-0.99) NB 97.76 99.18 98.43 0.97 0.99 (0.99-1) 97.33 95.45 96.4 0.93 0.98 (0.97-0.99) KNN 97.39 98.77 98.04 0.96 0.99 (0.99-1) 96.89 96.82 96.85 0.94 0.98 (0.97-0.99) RF 97.01 97.94 97.46 0.95 0.99 (0.99-1) 97.33 94.55 95.96 0.92 0.98 (0.97-0.99) LR 97.76 99.59 98.63 0.97 0.99 (0.99-1) 95.56 97.27 96.4 0.93 0.99 (0.98-0.99) SVC 97.01 100 98.43 0.97 0.99 (0.99-1) 96.89 95 95.96 0.92 0.99 (0.98-0.99) Validation Dataset2 Validation Dataset3 ETREES 95 97.41 96.07 0.92 0.98 (0.97-0.99) 97.86 96 97.64 0.89 0.99 (0.98-0.99) NB 94.58 98.45 96.3 0.93 0.98 (0.96-0.99) 98.13 92 97.41 0.88 0.98 (0.98-0.99) KNN 92.92 98.45 95.38 0.91 0.97 (0.96-0.99) 97.86 94 97.41 0.88 0.99 (0.98-0.99) RF 96.67 93.26 95.15 0.9 0.98 (0.97-0.99) 98.4 90 97.41 0.88 0.99 (0.98-0.99) LR 93.75 98.45 95.84 0.92 0.98 (0.97-0.99) 97.59 98 97.64 0.9 0.99 (0.98-0.99) SVC-RBF 93.33 98.45 95.61 0.91 0.98 (0.97-0.99) 97.33 98 97.41 0.89 0.99 (0.98-0.99)

11

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

331 ETREES: Extra Trees Classifier; NB: Naive Bayes; KNN: K Neighbors Classifier; RF: Random Forest; LR: Logistic 332 Regression; SVC-RBF: Support Vector Machine with RBF-kernel; Sens: Sensitivity; Spec: Specificity; Acc: Accuracy; 333 MCC: Mathews Correlation Coefficient; AUROC: Area under Receiver operator curve 334 335 336 3.3.1.2 Four-genes and Three-genes HCC Biomarker-based Models 337 To reduce the number of genes from the Five-genes HCC biomarker without compromising the 338 classification performance; we developed several models using the different combination of these five 339 genes, which include Three-genes and Four-genes HCC biomarker sets. 340 3.3.1.2.1 The Diagnostic Performance of Four-Genes HCC Biomarker based Models in 341 Identifying HCC 342 Two Four-Genes biomarker panel designed called as Four-genes HCC Biomarker-A contains 343 FCN3, CLEC1B, PRC1, PBK and second Four-genes HCC Biomarker-B contains FCN3, CLEC1B, 344 CLEC4M, PRC1. The prediction models based on Four-Genes HCC Biomarker-A classified HCC and 345 non-tumor samples with good prediction power, i.e. sensitivity and specificity in the range of 93-97% 346 with AUROC in the range of 0.97-0.99 on both training and three independent validation data sets as 347 shown in Table S5 (Supplementary Information File 1). 348 Prediction models built on Four-Genes HCC Biomarker-B almost performed in similar manner 349 with an accuracy more than 97% and AUROC in the range of 0.97-0.99 on training and two 350 independent validation data sets. While, there is a marginal decrease in the performance on third 351 independent validation dataset, i.e. specificity decreased for Random forest and SVC model in the 352 range 82-92% as shown in Supplementary Table S6. 353 3.3.1.2.2 The Diagnostic Performance of Three-genes HCC Biomarker based Models for 354 Identification of HCC samples 355 To further reduces the genes from HCC biomarker panel, two “Three-genes Biomarker” sets 356 designed as Three-genes HCC Biomarker-A contains FCN3, CLEC1B, PRC1 and second Three-genes 357 HCC Biomarker-B includes FCN3, CLEC4M, PRC1. Models based on “Three-genes HCC Biomarker- 358 A” categorize HCC and non-tumor samples very efficiently, i.e. accuracy 95-98% with AUROC in the 359 range of 0.96-0.99 on training as well as independent validation data sets as shown in Table 4. The 360 expression pattern of these three-genes among samples of training dataset and three external validation 361 datasets depicted in Figure 3. 362 363 364 Table 4: Performance of Three-genes HCC Biomarker-A (FCN3, CLEC1B, PRC1) based models on 365 training and validation datasets implementing various machine learning techniques. 366 Classifier Sens Spec Acc MCC AUROC with 95% Sens Spec Acc MCC AUROC with (%) (%) (%) CI (%) (%) (%) 95% CI Training Dataset Validation Dataset1 ETREES 96.64 97.94 97.26 0.95 0.99 (0.98-0.99) 94.67 95.91 95.28 0.91 0.97 (0.96-0.99)

NB 97.39 99.18 98.24 0.96 0.99 (0.99-1.0) 96 95.91 95.96 0.92 0.98 (0.97-0.99)

KNN 97.76 98.77 98.24 0.96 0.99 (0.99-1.0) 93.78 97.73 95.73 0.92 0.97 (0.96-0.99)

RF 97.01 97.53 97.26 0.95 0.99 (0.99-1.0) 94.67 96.36 95.51 0.91 0.97 (0.96-0.99)

12

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

LR 93.28 100 96.48 0.93 0.99 (0.99-1.0) 92.89 97.73 95.28 0.91 0.98 (0.97-0.99)

SVC-RBF 94.03 100 96.87 0.94 0.99 (0.98-0.99) 96 96.82 96.4 0.93 0.98 (0.97-0.99)

Validation Dataset2 Validation Dataset3 ETREES 93.75 96.37 94.92 0.9 0.98 (0.97-0.99) 95.72 98 95.99 0.84 0.99 (0.98-0.99)

NB 94.58 98.45 96.3 0.93 0.98 (0.97-0.99) 98.13 82 96.23 0.82 0.96 (0.95-0.98)

KNN 95.83 97.93 96.77 0.94 0.98 (0.97-0.99) 97.59 96 97.41 0.88 0.99 (0.98-0.99)

RF 95.42 94.3 94.92 0.9 0.98 (0.97-0.99) 95.45 96 95.52 0.82 0.98 (0.97-0.99)

LR 95.42 98.45 96.77 0.94 0.99 (0.98-0.99) 97.33 98 97.41 0.89 0.99 (0.98-0.99)

SVC-RBF 93.33 97.93 95.38 0.91 0.98 (0.97-0.99) 96.79 98 96.93 0.87 0.99 (0.98-0.99)

367 ETREES: Extra Trees Classifier; NB:Naive Bayes; KNN: K Neighbors Classifier; RF: Random Forest; LR: Logistic 368 Regression; SVC-RBF: Support Vector Machine with RBF kernel; Sens: Sensitivity; Spec: Specificity; Acc: Accuracy; 369 MCC: Mathews Correlation Coefficient; AUROC: Area under Receiver operator curve. 370 371

372 373 374 Figure 3: Boxplot representing the Expression pattern of Three-genes panel-based HCC 375 Biomarker. 376 377

13

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

378 Most of the models based on “Three-genes HCC Biomarker-B” performed in similar range in 379 classifying samples with a marginal decrease in performance, i.e. sensitivity and specificity in range 380 of 91-97% with an accuracy of 92-98% on training and validation data sets as shown in Table S7 381 (Supplementary Information File 1). 382 3.4 Prediction Model based on the Gene Expression of already identified protein-based 383 biomarker for diagnosis of HCC 384 To understand the importance of our “Three-genes HCC biomarker”, we also developed prediction 385 models using gene expression of already identified proteins biomarkers, i.e. AFP+GPC3 and 386 AFP+GPC3+KRT19 (CK19) [49]. As we do not have their protein expression for these patients’ 387 samples, hence we employed only their gene expression values. Models based on AFP+GPC3+KRT19 388 classified HCC and normal samples of training dataset with an accuracy 67-75%. While, this model 389 attained accuracy of 69-77%, 55-87% and 50-74% on external validation dataset1, dataset2 and 390 dataset3, respectively as shown in Table S8 (Supplementary Information File 1). Further, the prediction 391 models based on AFP+GPC3 have improved performance on training dataset with an accuracy of 70- 392 77%, but lower performance on all three validation datasets as given in Table S9 (Supplementary 393 Information File 1). 394 3.5 Prognostic Potential of the “Three-genes HCC biomarker” in the Survival of HCC Patients 395 To examine the prognostic potential of “Three-genes HCC Biomarker-A”, univariate survival 396 analysis was performed on TCGA and GSE14520 cohorts. The samples were partitioned into low-risk 397 and high-risk groups. Interestingly, all three genes of “Three-genes HCC Biomarker-A” are 398 significantly associated with the survival of HCC. For instance, higher expression (greater than mean) 399 of CLEC1B and FCN3 significantly associated with good outcome of the patients, i.e. OS, DSS, DFS 400 and PFS; while the overexpression of PRC1 is significantly associated with poor survival include DSS, 401 DFS or RFS and PFS of HCC patients for TCGA dataset as shown in Figure 4. In GSE14520 dataset, 402 higher expression of PRC1 is significantly associated with the poor outcome of patients i.e. OS and 403 DFS or RFS, while the higher expression of FCN3 significantly associated with the better outcome of 404 HCC patients as depicted in Figure 5. Complete results of survival analysis with HR (Hazard Ratio), 405 with 95% CI and p-value are represented in Table S10 (Supplementary Information File 1).

14

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

406

407 Figure 4: Kaplan Meier survival curves for the risk estimation of HCC patient in TCGA cohort based 408 on the RNA expression of A) FCN3, B) PRC1, and C) CLEC1B

15

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

409

410 Figure 5: Kaplan Meier survival curves for the risk estimation of HCC patient in GSE14520 cohort

411 based on the RNA expression of A) RFS for PRC1, B) RFS for FCN3, and C) OS for PRC1

412

413 3.6 Web server 414 To facilitates the scientific community working in the area of the liver cancer research, we 415 developed HCCPred (Prediction Server for Hepatocellular Carcinoma). In HCCPred, we execute 416 mainly two modules; Prediction Module and Analysis Module based on robust Five-genes, Four-genes 417 and Three-genes HCC biomarkers and 26 Core genes of HCC identified in the present study for the 418 prediction and analysis of samples from the RNA-expression data. The Prediction Module permits the 419 users to predict the disease status, i.e. cancerous or normal using RNA expression values of a subset of 420 genes using in-silico prediction models based on robust Five-genes, Four-genes and Three-genes HCC 421 biomarkers identified in the present study. Here, the user required to submit RMA (for Affymetrix), 422 A-value (for Agilent), Log2 value (for Illumina) or FPKM (High throughput RNA-seq data) for subset 423 of genes or biomarkers. The output result displays a list for patient samples and corresponding 424 predicted status of samples. Moreover, the user can select among the models, i.e. ETREES-based or 425 SVC-RBF based model. Further, the Analysis Module permits the user to analyse the expression 426 pattern of any of top10 ranked genes to check whether it is upregulated or downregulated in comparison 427 to HCC samples based on the samples of the current study. This webserver is freely accessible at webs 428 http://webs.iiitd.edu.in/raghava/hccpred/.

429 4. Discussion

16

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

430 HCC is a type of tumor that is associated with poor prognosis and high mortality rate among 431 all most common cancer types (Siegel et al., 2019). High recurrence rate and low rate of early detection, 432 results in poor prognosis. Accurate diagnosis of HCC may provide the opportunity for appropriate 433 treatment, including traditional available treatment like liver transplantation resection, etc. Although 434 the AFP and DCP proteins are well-established markers for the diagnosis of HCC, but their sensitivity 435 and specificity are not optimum (Sauzay et al., 2016). Therefore, the development of novel robust 436 diagnostic, prognostic biomarker for HCC is needed as it can assist the existing clinical management 437 of tumor.

438 In this study, we provide a novel large-scale analysis-based approach to identify a robust gene 439 expression-based candidate diagnostic biomarker for HCC derived from multiple transcriptomic 440 profiles/datasets across a variety of platforms obtained from GEO, TCGA. This metadata integration 441 approach across multiple transcriptome datasets of HCC to elucidate core HCC DEG subset followed 442 by class prediction analysis implementing various machine learning algorithms and validation on 443 external independent datasets led us to the identification of multiple-genes based robust biomarkers for 444 HCC. We have identified 26 genes named as “Core genes for HCC” that are uniformly differentially 445 expressed among 80% of datasets. Considering only these genes for downstream machine learning 446 analysis, in an urge to identify a manageable subset with the minimum number of genes from this list 447 that have a high discriminatory power; we further identified “Three-genes based HCC biomarker” 448 which have predictive accuracy of 95-98% and AUROC 0.96-0.99 on training and all three independent 449 validation datasets. Besides, we also developed the prediction models based on the gene expression of 450 already well established protein biomarkers of HCC in the literature, i.e. AFP+GPC3 and 451 AFP+GPC3+KRT19 (Lou et al., 2017). The prediction models based on AFP+GPC3+KRT19 452 discriminate samples of training dataset with an accuracy of 67-75% and 69-77% of validation 453 dataset1, 55-87% of validation dataset2 and 50-74% of validation dataset3; while the models based on 454 AFP+GPC3 have quite lower performance on validation datasets. Based on these observations, we 455 hypothesized that “Three-genes HCC Biomarker” might be proved as quite effective diagnostic 456 biomarker for HCC; further protein products of these genes can be also explored from blood or urine, 457 etc. to identify novel protein based non-invasive biomarker as they have very good discriminatory 458 power to distinguish HCC and non-tumor samples at gene expression level from the tissue samples. 459 Moreover, the product of FCN3 gene is released in the serum and bile (Akaiwa et al., 1999; FCN3 460 ficolin 3 [Homo sapiens (human)] - Gene - NCBI; Pan et al., 2015; Tizzot et al., 2018), thus this may 461 serve as non-invasive biomarkers for diagnosis of HCC.

462 Interestingly, the robust “Three-genes HCC Biomarker” contains FCN3, PRC1, CLEC1B 463 which have very high diagnostic potential, also possess prognostic potential, i.e. they are significantly 464 associated with survival of HCC patients. For instance, higher expression of CLEC1B and FCN3 465 significantly associated with good outcome of HCC patients in TCGA cohort; while higher expression 466 of PRC1 is significantly associated with poor outcome of HCC patients in both TCGA and GSE14520 467 cohorts. Besides, role of CLEC1B and PRC1 was previously also revealed in diagnosis and prognosis 468 of HCC (Chan et al., 2018; Chen et al., 2016; Hu et al., 2018; Kaur et al., 2018).

469 Taken together, we have established robust three-gene HCC diagnostic biomarker with reasonable 470 performance, possess both diagnostic and prognostic potential; and a meta-data integration pipeline for 471 identification of robust biomarker using machine learning technique, which can work across the 472 different platforms. Further, this pipeline can be used for the analysis of any other cancer type. 473 Although more and more research is under the development of novel biomarkers, further work will be 474 required to implement the clinical utilization of identified biomarker to meet real-world demand. We

17

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

475 are anticipating that the identifying novel cost-efficient biomarker using predictive technology for 476 detection of HCC will be promising.

477 5. Conclusions 478 This study identified and validated a highly accurate Three-genes HCC biomarker for 479 discriminating HCC and non-tumorous tissue samples, also possess a significant prognostic potential 480 that may facilitate more accurate early diagnosis and risk stratification upon validation in prospective 481 clinical trials. Moreover, the product of FCN3 gene is released in the serum and bile, thus this may 482 serve as non-invasive diagnostic biomarkers. Additionally, the uniform overexpression-pattern of 483 PRC1 among numerous HCC samples suggest it as a novel potential therapeutic target. 484 Author Contributions 485 H.K. collected the data and created the datasets, H.K. developed classification algorithms, H.K. and 486 A.D. implemented algorithms. H.K. and A.D perform the survival analysis. H.K. and A.D. created the 487 back-end server and front-end user interface. S.B., H.K., and G.P.S.R. analysed the results. H.K., R.K., 488 and A.D. wrote the manuscript. G.P.S.R. conceived and coordinated the project, helped in the 489 interpretation and analysis of data, refined the drafted manuscript and gave complete supervision to the 490 project. All of the authors have read and approved the final manuscript. 491 Funding 492 This research was funded by J. C. Bose National Fellowship (with Grant No. SRP076), Department of 493 Science and Technology (DST), INDIA. 494 Acknowledgments 495 All the authors acknowledge funding agencies J. C. Bose National Fellowship (DST). H.K., and R.K., 496 are thankful to Council of Scientific and Industrial Research (CSIR) and A.D. is thankful to DST 497 INSPIRE for providing fellowships, respectively. 498 Conflicts of Interest 499 The authors declare no financial and non-financial conflict of interest.

500 Abbreviations

501 AUROC Area under the Receiver Operating Characteristic curve

502 ETREES Extra Trees Classifier

503 SVC-RBF Support Vector Machine with RBF kernel

504 TCGA The Cancer Genome Atlas

505 KNN K Neighbors Classifier

506 HCC Hepatocellular Carcinoma

507 MCC Matthew’s correlation coefficient

508 LR Logistic Regression

509 NB Naive Bayes

18

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

510 RF Random Forest 511

512 References

513 Akaiwa, M., Yae, Y., Sugimoto, R., Suzuki, S. O., Iwaki, T., Izuhara, K., et al. (1999). Hakata 514 Antigen, a New Member of the Ficolin/Opsonin p35 Family, Is a Novel Human Lectin Secreted 515 into Bronchus/Alveolus and Bile. J. Histochem. Cytochem. 47, 777–785. 516 doi:10.1177/002215549904700607. 517 Bhalla, S., Chaudhary, K., Kumar, R., Sehgal, M., Kaur, H., Sharma, S., et al. (2017). Gene 518 expression-based biomarkers for discriminating early and late stage of clear cell renal cancer. 519 Sci. Rep. 7, 44997. doi:10.1038/srep44997. 520 Bhasin, M. K., Ndebele, K., Bucur, O., Yee, E. U., Otu, H. H., Plati, J., et al. (2016). Meta-analysis 521 of transcriptome data identifies a novel 5-gene pancreatic adenocarcinoma classifier. Oncotarget 522 7, 23263–23281. doi:10.18632/oncotarget.8139. 523 Cai, C., Wang, W., and Tu, Z. (2019). Aberrantly DNA Methylated-Differentially Expressed Genes 524 and Pathways in Hepatocellular Carcinoma. J. Cancer 10, 355–366. doi:10.7150/jca.27832. 525 Cai, J., Li, B., Zhu, Y., Fang, X., Zhu, M., Wang, M., et al. (2017). Prognostic Biomarker 526 Identification Through Integrating the Gene Signatures of Hepatocellular Carcinoma Properties. 527 EBioMedicine 19, 18–30. doi:10.1016/j.ebiom.2017.04.014. 528 Carvalho, B. S., and Irizarry, R. A. (2010). A framework for oligonucleotide microarray 529 preprocessing. Bioinformatics 26, 2363–2367. doi:10.1093/bioinformatics/btq431. 530 Chan, H. L., Beckedorff, F., Zhang, Y., Garcia-Huidobro, J., Jiang, H., Colaprico, A., et al. (2018). 531 Polycomb complexes associate with enhancers and promote oncogenic transcriptional programs 532 in cancer through multiple mechanisms. Nat. Commun. 9, 3377. doi:10.1038/s41467-018- 533 05728-x. 534 Chatterjee, A., Rodger, E. J., and Eccles, M. R. (2018). Epigenetic drivers of tumourigenesis and 535 cancer metastasis. Semin. Cancer Biol. 51, 149–159. doi:10.1016/J.SEMCANCER.2017.08.004. 536 Chen, C.-L., Tsai, Y.-S., Huang, Y.-H., Liang, Y.-J., Sun, Y.-Y., Su, C.-W., et al. (2018a). Lymphoid 537 Enhancer Factor 1 Contributes to Hepatocellular Carcinoma Progression Through 538 Transcriptional Regulation of Epithelial-Mesenchymal Transition Regulators and Stemness 539 Genes. Hepatol. Commun. 2, 1392–1407. doi:10.1002/hep4.1229. 540 Chen, H., Zhang, Y., Li, S., Li, N., Chen, Y., Zhang, B., et al. (2018b). Direct comparison of five 541 serum biomarkers in early diagnosis of hepatocellular carcinoma. Cancer Manag. Res. 10, 542 1947–1958. doi:10.2147/CMAR.S167036. 543 Chen, J., Rajasekaran, M., Xia, H., Zhang, X., Kong, S. N., Sekar, K., et al. (2016). The microtubule- 544 associated protein PRC1 promotes early recurrence of hepatocellular carcinoma in association 545 with the Wnt/β-catenin signalling pathway. Gut 65, 1522–34. doi:10.1136/gutjnl-2015-310625. 546 Chiyonobu, N., Shimada, S., Akiyama, Y., Mogushi, K., Itoh, M., Akahoshi, K., et al. (2018). Fatty 547 Acid Binding Protein 4 (FABP4) Overexpression in Intratumoral Hepatic Stellate Cells within 548 Hepatocellular Carcinoma with Metabolic Risk Factors. Am. J. Pathol. 188, 1213–1224. 549 doi:10.1016/j.ajpath.2018.01.012. 550 Dawson, M. A., and Kouzarides, T. (2012). Cancer epigenetics: from mechanism to therapy. Cell 551 150, 12–27. doi:10.1016/j.cell.2012.06.013. 552 Deng, Y.-B., Nagae, G., Midorikawa, Y., Yagi, K., Tsutsumi, S., Yamamoto, S., et al. (2010). 553 Identification of genes preferentially methylated in hepatitis C virus-related hepatocellular 554 carcinoma. Cancer Sci. 101, 1501–1510. doi:10.1111/j.1349-7006.2010.01549.x. 555 Diaz, G., Engle, R. E., Tice, A., Melis, M., Montenegro, S., Rodriguez-Canales, J., et al. (2018).

19

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

556 Molecular Signature and Mechanisms of Hepatitis D Virus–Associated Hepatocellular 557 Carcinoma. Mol. Cancer Res. 16, 1406–1419. doi:10.1158/1541-7786.MCR-18-0012. 558 Ding, Y., Yan, J.-L., Fang, A.-N., Zhou, W.-F., Huang, L., Ding, Y., et al. (2017). Circulating 559 miRNAs as novel diagnostic biomarkers in hepatocellular carcinoma detection: a meta-analysis 560 based on 24 articles. Oncotarget 8, 66402–66413. doi:10.18632/oncotarget.18949. 561 Dong, H., Zhang, L., Qian, Z., Zhu, X., Zhu, G., Chen, Y., et al. (2015). Identification of HBV- 562 MLL4 Integration and Its Molecular Basis in Chinese Hepatocellular Carcinoma. PLoS One 10, 563 e0123175. doi:10.1371/journal.pone.0123175. 564 Du, P., Kibbe, W. A., and Lin, S. M. (2008). lumi: a pipeline for processing Illumina microarray. 565 Bioinformatics 24, 1547–8. doi:10.1093/bioinformatics/btn224. 566 Emma, M. R., Iovanna, J. L., Bachvarov, D., Puleio, R., Loria, G. R., Augello, G., et al. (2016). 567 NUPR1, a new target in liver cancer: implication in controlling cell growth, migration, invasion 568 and sorafenib resistance. Cell Death Dis. 7, e2269. doi:10.1038/cddis.2016.175. 569 FCN3 ficolin 3 [Homo sapiens (human)] - Gene - NCBI Available at: 570 https://www.ncbi.nlm.nih.gov/gene?Db=gene&Cmd=DetailsSearch&Term=8547 [Accessed 571 July 25, 2019]. 572 Flavahan, W. A., Gaskell, E., and Bernstein, B. E. (2017). Epigenetic plasticity and the hallmarks of 573 cancer. Science 357, eaal2380. doi:10.1126/science.aal2380. 574 Gao, B., Ning, S., Li, J., Liu, H., Wei, W., Wu, F., et al. (2015). Integrated analysis of differentially 575 expressed mRNAs and miRNAs between hepatocellular carcinoma and their matched adjacent 576 normal liver tissues. Oncol. Rep. 34, 325–33. doi:10.3892/or.2015.3968. 577 GDC Available at: https://portal.gdc.cancer.gov/ [Accessed August 28, 2019]. 578 GitHub - bmbolstad/preprocessCore Available at: https://github.com/bmbolstad/preprocessCore 579 [Accessed July 25, 2019]. 580 Grinchuk, O. V., Yenamandra, S. P., Iyer, R., Singh, M., Lee, H. K., Lim, K. H., et al. (2018). 581 Tumor-adjacent tissue co-expression profile analysis reveals pro-oncogenic ribosomal gene 582 signature for prognosis of resectable hepatocellular carcinoma. Mol. Oncol. 12, 89–113. 583 doi:10.1002/1878-0261.12153. 584 Ho, D. W.-H., Lo, R. C.-L., Chan, L.-K., and Ng, I. O.-L. (2016). Molecular Pathogenesis of 585 Hepatocellular Carcinoma. Liver cancer 5, 290–302. doi:10.1159/000449340. 586 Home - GEO - NCBI Available at: https://www.ncbi.nlm.nih.gov/geo/ [Accessed August 28, 2019]. 587 Hu, K., Wang, Z.-M., Li, J.-N., Zhang, S., Xiao, Z.-F., and Tao, Y.-M. (2018). CLEC1B Expression 588 and PD-L1 Expression Predict Clinical Outcome in Hepatocellular Carcinoma with Tumor 589 Hemorrhage. Transl. Oncol. 11, 552–558. doi:10.1016/j.tranon.2018.02.010. 590 Ji, J., Chen, H., Liu, X.-P., Wang, Y.-H., Luo, C.-L., Zhang, W.-W., et al. (2018). A miRNA 591 Combination as Promising Biomarker for Hepatocellular Carcinoma Diagnosis: A Study Based 592 on Bioinformatics Analysis. J. Cancer 9, 3435–3446. doi:10.7150/jca.26101. 593 Ji, J., Wang, H., Li, Y., Zheng, L., Yin, Y., Zou, Z., et al. (2016). Diagnostic Evaluation of Des- 594 Gamma-Carboxy Prothrombin versus α-Fetoprotein for Hepatitis B Virus-Related 595 Hepatocellular Carcinoma in China: A Large-Scale, Multicentre Study. PLoS One 11, 596 e0153227. doi:10.1371/journal.pone.0153227. 597 Jia, H.-L., Ye, Q.-H., Qin, L.-X., Budhu, A., Forgues, M., Chen, Y., et al. (2007). Gene expression 598 profiling reveals potential biomarkers of human hepatocellular carcinoma. Clin. Cancer Res. 13, 599 1133–9. doi:10.1158/1078-0432.CCR-06-1025. 600 Jiao, Y., Li, Y., Jiang, P., Han, W., and Liu, Y. (2019). PGM5: a novel diagnostic and prognostic 601 biomarker for liver cancer. PeerJ 7, e7070. doi:10.7717/peerj.7070. 602 Kagohara, L. T., Stein-O’Brien, G. L., Kelley, D., Flam, E., Wick, H. C., Danilova, L. V, et al. 603 (2018). Epigenetic regulation of gene expression in cancer: techniques, resources and analysis. 604 Brief. Funct. Genomics 17, 49–63. doi:10.1093/bfgp/elx018.

20

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

605 Kamel, H. F. M., and Al-Amodi, H. S. A. B. (2017). Exploitation of Gene Expression and Cancer 606 Biomarkers in Paving the Path to Era of Personalized Medicine. Genomics. Proteomics 607 Bioinformatics 15, 220–235. doi:10.1016/j.gpb.2016.11.005. 608 Kang, L., Liu, X., Gong, Z., Zheng, H., Wang, J., Li, Y., et al. (2015). Genome-wide identification of 609 RNA editing in hepatocellular carcinoma. Genomics 105, 76–82. 610 doi:10.1016/j.ygeno.2014.11.005. 611 Kaur, H., Bhalla, S., and Raghava, G. (2018). Classification of early and late stage Liver 612 Hepatocellular Carcinoma patients from their genomics and epigenomics profiles. bioRxiv, 613 292854. doi:10.1101/292854. 614 Kim, J. H., Sohn, B. H., Lee, H.-S., Kim, S.-B., Yoo, J. E., Park, Y.-Y., et al. (2014). Genomic 615 Predictors for Recurrence Patterns of Hepatocellular Carcinoma: Model Derivation and 616 Validation. PLoS Med. 11, e1001770. doi:10.1371/journal.pmed.1001770. 617 Klett, H., Fuellgraf, H., Levit-Zerdoun, E., Hussung, S., Kowar, S., Küsters, S., et al. (2018). 618 Identification and Validation of a Diagnostic and Prognostic Multi-Gene Biomarker Panel for 619 Pancreatic Ductal Adenocarcinoma. Front. Genet. 9, 108. doi:10.3389/fgene.2018.00108. 620 Komatsu, H., Iguchi, T., Masuda, T., Ueda, M., Kidogami, S., Ogawa, Y., et al. (2016). HOXB7 621 Expression is a Novel Biomarker for Long-term Prognosis After Resection of Hepatocellular 622 Carcinoma. Anticancer Res. 36, 2767–73. Available at: 623 http://www.ncbi.nlm.nih.gov/pubmed/27272787 [Accessed July 25, 2019]. 624 Kuleshov, M. V, Jones, M. R., Rouillard, A. D., Fernandez, N. F., Duan, Q., Wang, Z., et al. (2016). 625 Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids 626 Res. 44, W90-7. doi:10.1093/nar/gkw377. 627 Kumar, R., Patiyal, S., Kumar, V., Nagpal, G., and Raghava, G. P. S. (2019). In Silico Analysis of 628 Gene Expression Change Associated with Copy Number of Enhancers in Pancreatic 629 Adenocarcinoma. Int. J. Mol. Sci. 20, 3582. doi:10.3390/ijms20143582. 630 Lamb, J. R., Zhang, C., Xie, T., Wang, K., Zhang, B., Hao, K., et al. (2011). Predictive Genes in 631 Adjacent Normal Tissue Are Preferentially Altered by sCNV during Tumorigenesis in Liver 632 Cancer and May Rate Limiting. PLoS One 6, e20090. doi:10.1371/journal.pone.0020090. 633 Li, L., Lei, Q., Zhang, S., Kong, L., and Qin, B. (2017). Screening and identification of key 634 biomarkers in hepatocellular carcinoma: Evidence from bioinformatic analysis. Oncol. Rep. 38, 635 2607–2618. doi:10.3892/or.2017.5946. 636 Li, N., Li, L., and Chen, Y. (2018). The Identification of Core Gene Expression Signature in 637 Hepatocellular Carcinoma. Oxid. Med. Cell. Longev. 2018, 3478305. 638 doi:10.1155/2018/3478305. 639 Liao, X., Liu, X., Yang, C., Wang, X., Yu, T., Han, C., et al. (2018). Distinct Diagnostic and 640 Prognostic Values of Minichromosome Maintenance Gene Expression in Patients with 641 Hepatocellular Carcinoma. J. Cancer 9, 2357–2373. doi:10.7150/jca.25221. 642 Lim, H.-Y., Sohn, I., Deng, S., Lee, J., Jung, S. H., Mao, M., et al. (2013). Prediction of Disease-free 643 Survival in Hepatocellular Carcinoma by Gene Expression Profiling. Ann. Surg. Oncol. 20, 644 3747–3753. doi:10.1245/s10434-013-3070-y. 645 Liu, F., Li, H., Chang, H., Wang, J., and Lu, J. (2015). Identification of hepatocellular carcinoma- 646 associated hub genes and pathways by integrated microarray analysis. Tumori 101, 206–14. 647 doi:10.5301/tj.5000241. 648 Liu, J., Lichtenberg, T., Hoadley, K. A., Poisson, L. M., Lazar, A. J., Cherniack, A. D., et al. (2018). 649 An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival 650 Outcome Analytics. Cell 173, 400-416.e11. doi:10.1016/j.cell.2018.02.052. 651 Lou, J., Zhang, L., Lv, S., Zhang, C., and Jiang, S. (2017). Biomarkers for Hepatocellular Carcinoma. 652 Biomark. Cancer 9, 1–9. doi:10.1177/1179299X16684640. 653 Mah, W.-C., Thurnherr, T., Chow, P. K. H., Chung, A. Y. F., Ooi, L. L. P. J., Toh, H. C., et al.

21

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

654 (2014). Methylation Profiles Reveal Distinct Subgroup of Hepatocellular Carcinoma Patients 655 with Poor Prognosis. PLoS One 9, e104158. doi:10.1371/journal.pone.0104158. 656 Makowska, Z., Boldanova, T., Adametz, D., Quagliata, L., Vogt, J. E., Dill, M. T., et al. (2016). 657 Gene expression analysis of biopsy samples reveals critical limitations of transcriptome-based 658 molecular classifications of hepatocellular carcinoma. J. Pathol. Clin. Res. 2, 80–92. 659 doi:10.1002/cjp2.37. 660 Marshall, A., Lukk, M., Kutter, C., Davies, S., Alexander, G., and Odom, D. T. (2013). Global gene 661 expression profiling reveals SPINK1 as a potential hepatocellular carcinoma marker. PLoS One 662 8, e59459. doi:10.1371/journal.pone.0059459. 663 Mas, V. R., Maluf, D. G., Archer, K. J., Yanek, K., Kong, X., Kulik, L., et al. (2009). Genes Involved 664 in Viral Carcinogenesis and Tumor Initiation in Hepatitis C Virus-Induced Hepatocellular 665 Carcinoma. Mol. Med. 15, 85–94. doi:10.2119/molmed.2008.00110. 666 Meng, C., Shen, X., and Jiang, W. (2018). Potential biomarkers of HCC based on gene expression 667 and DNA methylation profiles. Oncol. Lett. doi:10.3892/ol.2018.9020. 668 Murakami, Y., Kubo, S., Tamori, A., Itami, S., Kawamura, E., Iwaisako, K., et al. (2015). 669 Comprehensive analysis of transcriptome and metabolome analysis in Intrahepatic 670 Cholangiocarcinoma and Hepatocellular Carcinoma. Sci. Rep. 5, 16294. doi:10.1038/srep16294. 671 Nagpal, G., Sharma, M., Kumar, S., Chaudhary, K., Gupta, S., Gautam, A., et al. (2015). PCMdb: 672 Pancreatic Cancer Methylation Database. Sci. Rep. 4, 4197. doi:10.1038/srep04197. 673 Narrandes, S., and Xu, W. (2018). Gene Expression Detection Assay for Cancer Clinical Use. J. 674 Cancer 9, 2249. doi:10.7150/JCA.24744. 675 Nebbioso, A., Tambaro, F. P., Dell’Aversana, C., and Altucci, L. (2018). Cancer epigenetics: Moving 676 forward. PLOS Genet. 14, e1007362. doi:10.1371/journal.pgen.1007362. 677 Ocker, M. (2018). Biomarkers for hepatocellular carcinoma: What’s new on the horizon? World J. 678 Gastroenterol. 24, 3974–3979. doi:10.3748/wjg.v24.i35.3974. 679 Oishi, N., Kumar, M. R., Roessler, S., Ji, J., Forgues, M., Budhu, A., et al. (2012). Transcriptomic 680 profiling reveals hepatic stem-like gene signatures and interplay of miR-200c and epithelial- 681 mesenchymal transition in intrahepatic cholangiocarcinoma. Hepatology 56, 1792–1803. 682 doi:10.1002/hep.25890. 683 Pan, J.-W., Gao, X.-W., Jiang, H., Li, Y.-F., Xiao, F., and Zhan, R.-Y. (2015). Low serum ficolin-3 684 levels are associated with severity and poor outcome in traumatic brain injury. J. 685 Neuroinflammation 12, 226. doi:10.1186/s12974-015-0444-z. 686 Rhodes, D. R., Yu, J., Shanker, K., Deshpande, N., Varambally, R., Ghosh, D., et al. (2004). Large- 687 scale meta-analysis of cancer microarray data identifies common transcriptional profiles of 688 neoplastic transformation and progression. Proc. Natl. Acad. Sci. U. S. A. 101, 9309–14. 689 doi:10.1073/pnas.0401994101. 690 Ritchie, M. E., Phipson, B., Wu, D., Hu, Y., Law, C. W., Shi, W., et al. (2015). limma powers 691 differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 692 43, e47. doi:10.1093/nar/gkv007. 693 Roessler, S., Jia, H.-L., Budhu, A., Forgues, M., Ye, Q.-H., Lee, J.-S., et al. (2010). A unique 694 metastasis gene signature enables prediction of tumor relapse in early-stage hepatocellular 695 carcinoma patients. Cancer Res. 70, 10202–12. doi:10.1158/0008-5472.CAN-10-2607. 696 Sauzay, C., Petit, A., Bourgeois, A.-M., Barbare, J.-C., Chauffert, B., Galmiche, A., et al. (2016). 697 Alpha-foetoprotein (AFP): A multi-purpose marker in hepatocellular carcinoma. Clin. Chim. 698 Acta. 463, 39–44. doi:10.1016/j.cca.2016.10.006. 699 Schulze, K., Imbeaud, S., Letouzé, E., Alexandrov, L. B., Calderaro, J., Rebouissou, S., et al. (2015). 700 Exome sequencing of hepatocellular carcinomas identifies new mutational signatures and 701 potential therapeutic targets. Nat. Genet. 47, 505–511. doi:10.1038/ng.3252. 702 Scikit-learn: Machine Learning in Python Available at:

22

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

703 https://dl.acm.org/citation.cfm?id=1953048.2078195 [Accessed July 25, 2019]. 704 Sekhar, V., Pollicino, T., Diaz, G., Engle, R. E., Alayli, F., Melis, M., et al. (2018). Infection with 705 hepatitis C virus depends on TACSTD2, a regulator of claudin-1 and occludin highly 706 downregulated in hepatocellular carcinoma. PLOS Pathog. 14, e1006916. 707 doi:10.1371/journal.ppat.1006916. 708 Seok, J. Y., Na, D. C., Woo, H. G., Roncalli, M., Kwon, S. M., Yoo, J. E., et al. (2012). A fibrous 709 stromal component in hepatocellular carcinoma reveals a cholangiocarcinoma-like gene 710 expression trait and epithelial-mesenchymal transition. Hepatology 55, 1776–1786. 711 doi:10.1002/hep.25570. 712 Sharma, S., Kelly, T. K., and Jones, P. A. (2010). Epigenetics in cancer. Carcinogenesis 31, 27–36. 713 doi:10.1093/carcin/bgp220. 714 Shirota, Y., Kaneko, S., Honda, M., Kawai, H. F., and Kobayashi, K. (2001). Identification of 715 differentially expressed genes in hepatocellular carcinoma with cDNA microarrays. Hepatology 716 33, 832–40. doi:10.1053/jhep.2001.23003. 717 Siegel, R. L., Miller, K. D., and Jemal, A. (2019). Cancer statistics, 2019. CA. Cancer J. Clin. 69, 7– 718 34. doi:10.3322/caac.21551. 719 Stefanska, B., Huang, J., Bhattacharyya, B., Suderman, M., Hallett, M., Han, Z.-G., et al. (2011). 720 Definition of the Landscape of Promoter DNA Hypomethylation in Liver Cancer. Cancer Res. 721 71, 5891–5903. doi:10.1158/0008-5472.CAN-10-3823. 722 The Cancer Genome Atlas Program - National Cancer Institute Available at: 723 https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga [Accessed 724 August 28, 2019]. 725 Therneau, T. M. Survival Analysis [R package survival version 2.44-1.1]. Available at: https://cran.r- 726 project.org/web/packages/survival/index.html [Accessed July 25, 2019]. 727 Tian, G., Yang, S., Yuan, J., Threapleton, D., Zhao, Q., Chen, F., et al. (2018). Comparative efficacy 728 of treatment strategies for hepatocellular carcinoma: systematic review and network meta- 729 analysis. BMJ Open 8, e021269. doi:10.1136/bmjopen-2017-021269. 730 Tizzot, M. R., Lidani, K. C. F., Andrade, F. A., Mendes, H. W., Beltrame, M. H., Reiche, E., et al. 731 (2018). Ficolin-1 and Ficolin-3 Plasma Levels Are Altered in HIV and HIV/HCV Coinfected 732 Patients From Southern Brazil. Front. Immunol. 9, 2292. doi:10.3389/fimmu.2018.02292. 733 Tung, E. K.-K., Mak, C. K.-M., Fatima, S., Lo, R. C.-L., Zhao, H., Zhang, C., et al. (2011). 734 Clinicopathological and prognostic significance of serum and tissue Dickkopf-1 levels in human 735 hepatocellular carcinoma. Liver Int. 31, 1494–1504. doi:10.1111/j.1478-3231.2011.02597.x. 736 Vasudevan, S., Flashner-Abramson, E., Remacle, F., Levine, R. D., and Kravchenko-Balasha, N. 737 (2018). Personalized disease signatures through information-theoretic compaction of big cancer 738 data. Proc. Natl. Acad. Sci. U. S. A. 115, 7694–7699. doi:10.1073/pnas.1804214115. 739 Villa, E., Critelli, R., Lei, B., Marzocchi, G., Cammà, C., Giannelli, G., et al. (2016). 740 Neoangiogenesis-related genes are hallmarks of fast-growing hepatocellular carcinomas and 741 worst survival. Results from a prospective study. Gut 65, 861–869. doi:10.1136/gutjnl-2014- 742 308483. 743 Wang, H., Huo, X., Yang, X.-R., He, J., Cheng, L., Wang, N., et al. (2017). STAT3-mediated 744 upregulation of lncRNA HOXD-AS1 as a ceRNA facilitates liver cancer metastasis by 745 regulating SOX4. Mol. Cancer 16, 136. doi:10.1186/s12943-017-0680-1. 746 Wang, Z., Teng, D., Li, Y., Hu, Z., Liu, L., and Zheng, H. (2018). A six-gene-based prognostic 747 signature for hepatocellular carcinoma overall survival prediction. Life Sci. 203, 83–91. 748 doi:10.1016/j.lfs.2018.04.025. 749 Wong, K.-F., Liu, A. M., Hong, W., Xu, Z., and Luk, J. M. (2016). Integrin 750 &#x3B1;2&#x3B2;1 inhibits MST1 kinase phosphorylation and activates Yes- 751 associated protein oncogenic signaling in hepatocellular carcinoma. Oncotarget 7, 77683–

23

bioRxiv preprint doi: https://doi.org/10.1101/758250; this version posted September 5, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

752 77695. doi:10.18632/oncotarget.12760. 753 Woo, H. G., Choi, J.-H., Yoon, S., Jee, B. A., Cho, E. J., Lee, J.-H., et al. (2017). Integrative analysis 754 of genomic and epigenomic regulation of the transcriptome in liver cancer. Nat. Commun. 8, 755 839. doi:10.1038/s41467-017-00991-w. 756 Xia, Q., Li, Z., Zheng, J., Zhang, X., Di, Y., Ding, J., et al. (2019). Identification of novel biomarkers 757 for hepatocellular carcinoma using transcriptome analysis. J. Cell. Physiol. 234, 4851–4863. 758 doi:10.1002/jcp.27283. 759 Xu, W., Rao, Q., An, Y., Li, M., and Zhang, Z. (2018). Identification of biomarkers for Barcelona 760 Clinic Liver Cancer staging and overall survival of patients with hepatocellular carcinoma. 761 PLoS One 13, e0202763. doi:10.1371/journal.pone.0202763. 762 Zhang, C., Peng, L., Zhang, Y., Liu, Z., Li, W., Chen, S., et al. (2017). The identification of key 763 genes and pathways in hepatocellular carcinoma by bioinformatics analysis of high-throughput 764 data. Med. Oncol. 34, 101. doi:10.1007/s12032-017-0963-9. 765 Zhang, Y.-L., Ding, C., and Sun, L. (2019). High Expression B3GAT3 Is Related with Poor 766 Prognosis of Liver Cancer. Open Med. (Warsaw, Poland) 14, 251–258. doi:10.1515/med-2019- 767 0020. 768 Zhao, X., Parpart, S., Takai, A., Roessler, S., Budhu, A., Yu, Z., et al. (2015). Integrative genomics 769 identifies YY1AP1 as an oncogenic driver in EpCAM+ AFP+ hepatocellular carcinoma. 770 Oncogene 34, 5095–5104. doi:10.1038/onc.2014.438. 771 Zheng, Y., Liu, Y., Zhao, S., Zheng, Z., Shen, C., An, L., et al. (2018). Large-scale analysis reveals a 772 novel risk score to predict overall survival in hepatocellular carcinoma. Cancer Manag. Res. 773 Volume 10, 6079–6096. doi:10.2147/CMAR.S181396. 774 Zubiete-Franco, I., García-Rodríguez, J. L., Lopitz-Otsoa, F., Serrano-Macia, M., Simon, J., 775 Fernández-Tussy, P., et al. (2019). SUMOylation regulates LKB1 localization and its oncogenic 776 activity in liver cancer. EBioMedicine 40, 406–421. doi:10.1016/j.ebiom.2018.12.031. 777

778 Supplementary Materials:

779 Supplementary Information File 1: Supplementary Tables.

780 Supplementary Information File 2: Supplementary Figures.

24