G C A T T A C G G C A T

Article A Sparse and Low-Rank Regression Model for Identifying the Relationships Between DNA Methylation and Expression Levels in Gastric Cancer and the Prediction of Prognosis

Yishu Wang 1,*, Lingyun Xu 2 and Dongmei Ai 1,3,*

1 School of Mathematics and Physics, University of Science & Technology Beijing, Beijing 100083, China 2 School of Mathematics and Statistics, Qingdao University, Qingdao 266003, China; [email protected] 3 Basic Experimental Center of Natural Science, University of Science and Technology Beijing, Beijing 100083, China * Correspondence: [email protected] (Y.W.); [email protected] (D.A.); Tel.: 010-62332349 (Y.W.)

Abstract: DNA methylation is an important regulator of that can influence tumor heterogeneity and shows weak and varying expression levels among different genes. Gastric cancer (GC) is a highly heterogeneous cancer of the digestive system with a high mortality rate worldwide. The heterogeneous subtypes of GC lead to different prognoses. In this study, we explored the relationships between DNA methylation and gene expression levels by introducing a sparse low- rank regression model based on a GC dataset with 375 tumor samples and 32 normal samples from The Cancer Genome Atlas database. Differences in the DNA methylation levels and sites were found to be associated with differences in the expressed genes related to GC development.   Overall, 29 methylation-driven genes were found to be related to the GC subtypes, and in the prognostic model, we explored five prognoses related to the methylation sites. Finally, based on a Citation: Wang, Y.; Xu, L.; Ai, D. A Sparse and Low-Rank Regression low-rank matrix, seven subgroups were identified with different methylation statuses. These specific Model for Identifying the classifications based on DNA methylation levels may help to account for heterogeneity and aid in Relationships Between DNA personalized treatments. Methylation and Gene Expression Levels in Gastric Cancer and the Keywords: low-rank sparse regression model; DNA methylation; prognosis; gene expression; Prediction of Prognosis. Genes 2021, gastric cancer 12, 854. https://doi.org/10.3390/ genes12060854

Academic Editor: Stefania Bortoluzzi 1. Introduction Gastric cancer (GC) is a cancer of the digestive system, and it has a high mortality Received: 27 April 2021 rate globally, ranking second among all cancers [1]. In recent years, GC morbidity and Accepted: 29 May 2021 Published: 2 June 2021 mortality rates have increased because of changes in our diet and environment. Patients are often diagnosed with GC at an inoperable stage, and recurrence is common after resec-

Publisher’s Note: MDPI stays neutral tion [1]. Tumor heterogeneity allows for early diagnosis and effective treatment of patients with regard to jurisdictional claims in with different types of GC. GC is highly heterogeneous and shows varying sensitivity to published maps and institutional affil- chemotherapy among different clinical subtypes. Therefore, molecular oncology studies of iations. GC are urgently needed to obtain better prognostic outcomes. GC usually originates from Helicobacter pylori infection, which may lead to chronic inflammation and consequent tumorigenesis [2]. In addition, other risk factors have been identified, such as environmen- tal, genetic, and epigenetic factors [3]. Because of tumor heterogeneity, the same treatment can result in different prognoses for different GC subtypes. Identification of the genetic Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. heterogeneity of GC could help clinicians to better understand the mechanism of GC. This article is an open access article The functional heterogeneity of cancer cells within tumors involves weak and varying distributed under the terms and genetic expression between cells or different functional cell subpopulations [4], indicat- conditions of the Creative Commons ing a more comprehensive understanding of tumor cells. One basic aspect of cancer cell Attribution (CC BY) license (https:// heterogeneity in the same tumor is the different levels of gene expression, which may be creativecommons.org/licenses/by/ influenced by many factors, for example, epigenetic changes which include DNA methyla- 4.0/). tion, non-coding RNA, chromatin remodeling, and histone modifications [3]. Epigenetic

Genes 2021, 12, 854. https://doi.org/10.3390/genes12060854 https://www.mdpi.com/journal/genes Genes 2021, 12, 854 2 of 18

processes can regulate the expression of genes but without DNA sequencing changes and is inheritable across generations. DNA methylation is a well-characterized epigenetic modification that plays an important role in carcinogenesis [5,6], and is mediated by DNA methyltransferase. DNA hypermethylation at regions can silence the expression of targeted genes [7], further influencing cell cycles, DNA repair, and even the signaling pathways of tumor development. It plays an important role in promoting cancer [6,8,9]. However, not all CpG sites regulate gene expression. Only hypermethylation of CpG islands in specific regions inhibits the expression of tumor suppressor genes (TSGs), DNA repair genes, housekeeping genes, and cell cycle control genes. Currently, exploring the relationships between gene expression levels and DNA methylation regions based on a specific tumor is needed for tumor typing and prognosis prediction. For GC, although several CpG sites are involved in the processes of GC develop- ment [3], there is a lack of systematic analysis to establish an efficient model that can provide insights into low gene expression levels in different GC subtypes affected by specific DNA methylation sites, as well as further help to prioritize disease-associated methylation sites and contribute to GC heterogeneity. Most abnormal changes, including methylation or demethylation of specific DNA methylation sites, exist in different GC TSGs or oncogenes. These changes often occur before tumor formation or development. Accordingly, if their sites can be accurately confirmed, they can be considered to be early diagnostic GC markers or predictors for people with a high risk of GC. In contrast, high concordance in methylation and gene expression predicts tumor immune infiltration levels [10] and immune-informative CpG sites show significant prog- nostic value [10]. Therefore, identification of specific CpG sites associated with disease would allow the prediction of relationships between methylation CpG sites and cancer prognosis, which would provide a reference for clinical practices of cancer prognosis. One approach is to determine which aspects of DNA, such as gene expression levels, are affected by DNA methylation. Naively, such association can be identified using a simple statistical test on all paired combinations of DNA methylation and gene transcripts. However, gene expression levels are influenced by many factors, including DNA methylation, changes in DNA sequence, and other hidden factors, such as cellular state [11], environmental factors [12], and experimental conditions [13]. The interaction of most epige- netic variations with genetic variation is diverse. Moreover, a wide variety of confounders lie hidden in the data, leading to both spurious associations and missed associations if not properly addressed. An interesting problem is how to distribute the linkage between epi- genetic and genetic variations. In particular, an association analysis of these RNA-directed DNA methylation regions with genetic variants could identify the methylation quantitative trait loci, which are related to GC. In this study, we introduce an alternative formulation to address these problems. We make use of sparse regression for RNA-directed DNA methylation mapping and propose a low-rank representation to account for the DNA sequence changes and other nongenetic hidden factors. Our methodology is inspired by the following linear mixed model [14]: y = Xβ + Zµ + e, where y is the vector of data; β is a vector of fixed effects; µ is a vector of random effects; and X and Z are the model matrices corresponding to β and µ, respectively. This linear mixed model has been used to model the hidden factors in eQTL mapping problems [15]. We aimed to establish the relationship between DNA methylation and genetic vari- ation (here, we focused on gene expression levels) related to GC subtypes. We rewrote this mixed model because it could be separated passively. We used sparse regression X1β1 to account for DNA methylation mapping, which was related to differential gene expression levels. We put the DNA sequence changes and hidden factors related to GC tumor subtypes, separately, into X2β2 and Zµ. Then, we merged these two parts into one low-rank matrix representation, L, for the expression heterogeneity (EH). Thereby, the regression model could be rewritten as follows: y = Xβ + L + e. Because a low-rank matrix Genes 2021, 12, 854 3 of 18

can always be written as L=WH, our model can be regarded as an equivalent form of the linear mixed model. There are two main advantages of our model as follows: Multiple gene expression levels and confounder effects can be jointly analyzed in disease-associated DNA methyla- tion site studies, and the low-rank matrix, L, represents different patients’ sample clusters according to the confounder effects. Then, we can identify the differentially expressed genes of the GC subtypes, which also have specific methylation sites.

2. Materials and Methods 2.1. Methods Let Y be a n × q matrix corresponding to a gene expression dataset where n is the number of samples and q is the number of genes. Let X be an n × q matrix corresponding to a DNA methylation dataset, where p is the number of DNA methylation sites. To model the relationship between Y and X, we propose decomposing Y as follows:

Y = XB + L + e (1)

where B ∈ Rp×q is the coefficient matrix and e ∈ Rn×q is a Gaussian random noise term 2 2 n×q with zero mean and variance σ , i.e., eij ∼ N(0, σ ). Here, we introduce L ∈ R to our model to account for the variations caused by EH, including DNA sequence changes and other hidden nongenetic factors. This model implies that gene expression levels are influenced by DNA methylation, EH is related to tumor subtypes, and random noise. To make the decomposition (1) possible, we made the following assumptions: • There are only a few expression heterogeneous factors related to tumor subtypes that influence differential gene expression levels. Thus, L is a low-rank matrix. Then, we can obtain gene clusters according to the matrix L. • The gene expression level is only affected by a small fraction of DNA methylation sites. This implies that the coefficient matrix B should be sparse. On the basis of these assumptions, Formula (1) can be rewritten as per the following optimization problem: 2 minkY−XB−LkF B,L (2) s.t. rank(L) ≤ r0, kBk1 ≤ t0 q 2 2 where |Bk1 is the element-wise l1 norm and |WkF = ∑ij Wij is the Frobenius norm. To make this minimization problem a convex surrogate, we relax the rank operator on L with the nuclear norm, which has been previously proven to be effective [16], as follows:

1 2 min kY−XB−L||F + ρkBk1 + λkLk∗ (3) B,L 2

where kLk∗ is the nuclear norm of L. ρ and λ are regularization parameters that control the sparsity of B and the rank of L. Then, we adopted an alternating strategy to solve problem (3). • For fixed B, the optimization problem becomes:

1 2 min kY−XB−L||F + λkL||∗ (4) L 2 The solution of L is Formula (5), according to [16], as follows:

L = Sλ(Y−XB) (5)

T T where Sλ(W) = UDλV , with Dλ = diag[(d1 − λ)+, ... , (dr − λ)+], and UDV is the singular value decomposition (SVD) of W, D = diag[d1,..., dr], and t+ = max(t, 0). Genes 2021, 12, 854 4 of 18

• For fixed L, the optimization problem becomes:

1 2 min kY−XB−L||F + ρkB||1 (6) B 2 It can be decomposed into q independent Lasso problems [17] as follows:

1 2 min kYj − XBj − LjkF + ρkBjk1, j = 1, . . . , q Bj 2

where Yj, Lj, and Bj are the jth columns of Y, L, and B. The Lasso problem can be solved efficiently by the coordinate descent algorithm [15,18]. After the two alternative steps, we selected the top d DNA methylation sites for each gene (based on the absolute value of the coefficients) and the low-rank matrix L. Then, we built the methylation prognosis subtypes based on specific methylation sites, and relationships between the methylation sites and gene expression levels. Meanwhile, the low-rank matrix L also provided a classification of tumor samples that may indicate the different GC subtypes caused by different methylation sites and epigenetic levels, which was defined as EH in our model.

2.2. Synthetic Data To demonstrate the effectiveness of our model, while avoiding the simulation setup favoring it, we generated the synthetic data, as in the setup in [19], as follows: • For the methylation effects, each methylation site is generated independently and uniformly from a binomial distribution with the probability p = 0.25 denoted by matrix X with dimension n × p. The coefficient matrix B is a sparse matrix with dimension p × q, with 2% non-zero entries, which are generated using a standard Gaussian distribution. Let G denote the methylation effect G = XB. T n×K • EH: The covariance matrix is Σ generated by HH , with H ∈ R and Hi,j ∼ N(0, 1). Here, K is the number of hidden factors. The random variable Lj was drawn from N(0, τΣ). Let L = [L1,..., Lq] 2 • Let e ∼ N(0, σe I) Now the synthetic data are:

Y=XB+µ+e=G+µ+e (7)

2.3. Gastric Cancer Data The GC datasets were downloaded, including the mRNA expression, methylation, and clinical datasets, from The Cancer Genome Atlas repository (http://portal.gdc.cancer.gov/, aceessed on 2 June 2021).

3. Results 3.1. Synthetic Results Demonstrated Our Model Benefited High Dimensional Data We used the synthetic dataset, constructed as described in the Materials and Methods, to test the performance of our method under different settings, i.e., we varied n, p, and q. Figure1 shows the receiver operating characteristic curves of different settings. From the simulation results, we observed that our method is beneficial with high dimensional data; is robust with a larger q, which is the dimension of the response variable in the regression model; that the higher the number of samples, the better the regression results; and if the number p is larger than q and n, this may result in overfitting. Genes 2021, 12, x FOR PEER REVIEW 6 of 19 Genes 2021, 12, 854 5 of 18

Figure 1. ROC curves of different parameters. Figure 1. ROC curves of different parameters. 3.2. Gastric Cancer Transcription and Methylation Datasets Filtration We downloaded the gastric cancer dataset from The Cancer Genome Atlas repository (http://portal.gdc.cancer.gov/, aceessed on 2 June 2021). There were 407 patients with 375 tumor samples and 32 normal samples with 56,753 gene expression profiles for each sample in this gastric cancer dataset and 469 patients with 443 tumor samples and 27 normal patients with 19,755 DNA CpG sites. First, in more than 70% of the samples, we deleted the methylation sites that had missing values; these were located in the second , and CpGs from above 2 kb upstream to 0.5 kb downstream (gene promoter regions). The clinical samples were also filtered if the follow-up duration was less than 30 days, or there were no follow-up data. Then, we took the intersection of samples based on the mRNA expression and methylation datasets to use in our model. There were 378 patients with 351 tumor samples and 27 normal patients. Furthermore, the number of gene expression profiles was too large to provide significant information. We selected the differentially expressed genes between tumor and normal samples as response variables.

3.3. Application of Our Model to Gastric Cancer Dataset Here, the Y matrix was read from the mRNA expression dataset, and X was read from the DNA methylation dataset. Our regression-decomposed model gave two matrixes, i.e., B and L, where B denoted the relationships between DNA methylation sites and gene expression levels and L denoted a low-rank matrix. Similar genes shared the same rank. Figure2 shows the linkage peaks in this GC study given by our model; the top 1000 associations based on abs (B) are shown here. Figure3A shows the methylation levels of the top 100 methylation sites based on abs (B) and Figure3B shows gene expression levels of the genes corresponding to the 100 sites. These 100 CpG sites indicated the important influence of DNA methylation on gene expression in GC samples. In addition, Figure4 shows the (GO) (A) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) (B) enriched functional items of these methylation sites (p-value of <0.05). Most of Figurethese 2. sitesThe plot were of associated linkage peaks with carcinogenesis, given by our mo metabolism,del (Top 1000 and ECM-receptorassociations based interaction on abs(ma- trixB)functions are shown and here). most of the genes corresponding to them took part in receptor ligand activity and collagen-containing extracellular pathways.

Genes 2021, 12, x FOR PEER REVIEW 6 of 19

Figure 1. ROC curves of different parameters.

Genes 2021, 12, 854 6 of 18

Genes 2021, 12, x FOR PEERFigure REVIEWFigure 2. The 2. plotThe of linkage plot of peaks linkage given peaks by our given model by(Top our 1000 model associations (Top based 1000 associationson abs(ma- based on7 of abs(matrixB) 19

trixB)are are shown shown here). here).

(A)

(B)

FigureFigure 3. ((AA)) The The methylation methylation levels levels of the of top the 100 top methylation 100 methylation sites based sites on based abs(B); on (B abs(B);) the gene (B ) the gene expressionexpression levels levels of of genes genes corresponding corresponding to these to these methylation methylation sites. sites.

Genes 2021Genes, 12 ,2021 854 , 12, x FOR PEER REVIEW 7 of8 18of 19

(A): GO function items

(B): KEGG items BP

CC

MF Figure 4. Cont.

Genes 2021Genes, 12 2021, 854, 12, x FOR PEER REVIEW 89 of of 18 19

MF

FigureFigure 4. GO 4. GO and and KEGG KEGG enrichment enrichment of the of the top top 100 100 methylation methylation sites. sites.

3.4.3.4. 29 29 Genes Genes Were were Identified Identified as as Gastric Gastric Cancer-Associated Cancer-Associated Methylation-Driven Methylation-Driven Genes Genes ToTo find find more more information information related related to DNA to DNA methylation methylation sites with sites gene with expression gene expression levels, welevels, used Rwe package used R “Methylmix” package “Methylmix” [20] to find [20] the methylation-driven to find the methylation-driven genes with low genes mRNA with expressionlow mRNA and expression high methylation and high levels methylation and negative levels correlations and negative between correlations methylation between andmethylation expression and in cancer expression samples in cancer (p < 0.05). samples There (p were < 0.05). 31genes There selected were 31 by genes “Methylmix,” selected by and“Methylmix,” 29 genes overlapped and 29 genes with overlapped those corresponding with those tocorresponding 100 associations to 100 based associations on abs (B).based These on genes abs (B). were These RPP25, genes PLXNC1, were RPP25, BOP1, PLXNC1, HOXC13, BOP1, BST2, HOXC13, TMEM26, BST2, ARHGAP20, TMEM26, TONSL,ARHGAP20, CLEC7A, TONSL, STC2, HOXA13,CLEC7A, MCMDC2, STC2, HOXA13, DNAH14, MCMDC2, PRELID1P1, DNAH14, HOXA11, PRELID1P1, "DPY19L1, CDKN2A,HOXA11, GPR84, "DPY19L1, ZNF525, CDKN2A, FAM24B, GPR84, ZFPM2-AS1, ZNF525, RANBP17, FAM24B, C3AR1, ZFPM2-AS1, SAC3D1, RPS6KA6,RANBP17, PIWIL1,C3AR1, CCNI2, SAC3D1, BZW2, RPS6KA6, and PALB2. PIWIL1, They CCNI2, were the BZW2, important and PALB2. methylation-driven They were genesthe im- associatedportant methylation-driven with GC whose corresponding genes associated methylation with GC sites whose might corresponding affect gene expressionmethylation levels.sites might The other affect two gene genes expression HOXA10-AS levels and. The HOXA11-AS, other two genes which HOXA10-AS were not recognized and HOXA11- by ourAS, model which had were been not demonstrated recognized by do our not model have had obvious been differentialdemonstrated methylation do not have levels obvious be- tweendifferential tumor andmethylation normal samples levels between in Figure tumor5. Figure and6 shows normal the samples GO and in KEGG Figure functional 5. Figure 6 itemsshows of thesethe GO genes and with KEGG different functional log FCitems values, of these which genes were with denoted different as upregulated log FC values, if logwhich FC > were 0, otherwise, denoted as they upregulated were denoted if log as FC downregulated. > 0, otherwise, they Here, were we defineddenoted upregu-as down- latedregulated. genes as Here, having we higher defined methylation upregulated levels gene ins cancer as having subtypes higher and methylation lower methylation levels in levelscancer in subtypes normal subtypes, and lower otherwise, methylation they levels were in downregulated normal subtypes, genes. otherwise, Figure5 they shows were thedownregulated methylation and genes. gene Figure expression 5 shows level the heatmapmethylation results and ofgene the expression 31 methylation-driven level heatmap genes,results where of the the 31genes methylation-driven HOXA10-AS and gene HOXA11-ASs, where the dogenes not HOXA10-AS have obvious and differential HOXA11- methylationAS do not have levels obvious between differential tumor and methylation normal samples. levels Inbetween the Supplementary tumor and normal Informa- sam- tion,ples. we In also the provideSupplementary the correlation Information, relationship we also resultsprovide between the correlation the 29 GC-associated relationship re- methylation-drivensults between the 29 gene GC-associated expression levels methylatio and methylationn-driven gene levels expression Figures S1–S29.levels and meth- ylation levels Figure S1-S29.

Genes 2021, 12, 854 9 of 18 Genes 2021, 12, x FOR PEERGenes REVIEW 2021, 12 , x FOR PEER REVIEW 10 of 19 10 of 19

(A) methylation levels (A) methylation levels

(B) gene expression levels(B) gene expression levels

Figure 5. Heatmap ofFigure methylationFigure 5. 5.Heatmap Heatmap and expression of of methylation methylation levels and basedand expression expression on these levels31 levels driven based based genes. on on these these 31 31 driven driven genes. genes.

Figure 6. GO items of the driven genes. Figure 6. GO items of the driven genes. Figure 6. GO items of the driven genes. 3.5. Methylation Markers Associated with Prognosis of GC 3.5. Methylation Markers Associated with Prognosis of GC

Genes 2021, 12, x FOR PEER REVIEW 11 of 19 Genes 2021, 12, 854 10 of 18

For patients with GC, one important problem is poor prognosis because of different 3.5. Methylation Markers Associated with Prognosis of GC GC subtypes, with complex genetic heterogeneity. In this study, we aimed to identify spe- cific biomarkersFor that patients could with help GC, clinicians one important to better problem understand is poor prognosisthe mechanism because of of GC. different We GC subtypes, with complex genetic heterogeneity. In this study, we aimed to identify merged the clinical data with the mRNA expression profiles and DNA methylation pro- specific biomarkers that could help clinicians to better understand the mechanism of GC. files of theseWemerged 29 methylation-driven the clinical data with genes the associated mRNA expression with GC. profiles A prognostic and DNA assessment methylation model wasprofiles established, of these and 29 methylation-driven the Cox risk regr genesession associated model was with used GC. to A prognosticidentify independ- assessment ent prognosticmodel wasmethylation established, sites, and theimplement Cox risk regressioned by using model the was R package used to identify “survival” independent [21]. First, usingprognostic univariate methylation Cox regression sites, implemented analysis, we by obtained using the R17 package methylation “survival” sites [ 21related]. First, to patientusing prognosis univariate (p < Cox0.05). regression Then, a multivariate analysis, we obtainedCox regression 17 methylation was used sites to relatedanalyze to these sites.patient Five prognosismethylation (p < CpG 0.05). sites Then, were a multivariate observed, Cox and regression the risk was score used was to analyzecomputed. these The genessites. that Five corresponded methylation CpGto these sites CpG were observed,sites were and cg01531665 the risk score (NOL6), was computed. cg01201519 The (FEZF1), genescg00484488 that corresponded (ADPGK), to thesecg00730266 CpG sites (PPP1R14A), were cg01531665 and (NOL6), cg00333849 cg01201519 (KLHL35), (FEZF1), cg00484488 (ADPGK), cg00730266 (PPP1R14A), and cg00333849 (KLHL35), which were which were prognosis-related sites. Figure 7A shows the methylation heatmap of tumor prognosis-related sites. Figure7A shows the methylation heatmap of tumor samples based samples basedon these on five these genes five and genes Figure and7B Figure shows 7B the shows heatmap the of heatmap risk scores of basedrisk scores on these based five on these fivegenes genes for all for tumor all tumor samples samples and the and grouping the grouping results results of samples of samples with low with or high low risk.or high risk.Next, Next, according according to theto riskthe score,risk score, we grouped we grouped the patients the patients into high- into and low-riskhigh- and groups low- if risk groupstheir if risktheir scores risk were scores higher were than higher the median than ofthe all median risk scores. of Theall risk assessmentscores. The showed risk assessmentthat showed high-risk that patients high-risk had patients a poorer had prognosis a poorer than prognosis low-risk patients. than low-risk The methylation patients. The methylationlevels of levels the selected of the sites selected in the sites established in the modelestablished decreased model as thedecreased risk scores as increased.the risk scores increased.Figure7C Figure shows 7C the shows survival the curve survival with curve different with risk different subgroups. risk Figuresubgroups.7D shows Figure the hazard ratio of the risk score to other multi-covariates, such as age, gender, grade, and 7D shows the hazard ratio of the risk score to other multi-covariates, such as age, gender, cancer stage. According to this hazard ratio, the risk score based on these methylation CpG grade, andsites cancer could stage. be regarded According as a prognosticto this hazard factor. ratio, the risk score based on these meth- ylation CpG sites could be regarded as a prognostic factor. (A)

(B)

Figure 7. Cont.

(C) (D)

Genes 2021 12 Genes ,2021, 854,Genes 12, x 2021FOR, PEER12, x FOR REVIEW PEER REVIEW 1211 of of12 1819of 19

(C) (D)

Figure 7. (A) Heatmap of methylation levels of samples based on five prognosis genes; (B) heatmap of risk Score based on Figure 7. (FigureA) Heatmap 7. (A) Heatmap of methylation of methylation levels of levels samples of samples based onbased five on prognosis five prognosis genes; genes; (B) heatmap (B) heatmap of risk of Scorerisk Score based based on on five prognosis genes; (C) survival curves; (D) hazard ratio. five prognosisfive prognosis genes; (C )genes; survival (C) survival curves; (curves;D) hazard (D) ratio.hazard ratio. 3.6. Exploration of the Subtype-Associated GpG Sites 3.6. Exploration3.6. Exploration of the Subtype-Associated of the Subtype-Associated GpG SitesGpG Sites On the Onone the hand, one thehand, GC the molecular GC molecular subtype subtype seemed seemed to influence to influence the prognosesthe prognoses of the of the On the one hand, the GC molecular subtype seemed to influence the prognoses of the patientspatients in this study,in this study,while, while,on the on other the otherhand, hand, the expression the expression hete rogeneityheterogeneity (EH) (EH) in our in our patients in this study, while, on the other hand, the expression heterogeneity (EH) in our model indicatedmodel indicated the different the different GC subtypes GC subtypes caused ca usedby different by different methylation methylation sites sites and and epi- epi- model indicated the different GC subtypes caused by different methylation sites and epige- genetic geneticlevels. levels.In order In toorder explore to explore the subt theype-associated subtype-associated CpG CpG sites site baseds based on classifica-on classifica- netic levels. In order to explore the subtype-associated CpG sites based on classifications tions oftions tumor of samples,tumor samples, we used we the used low-rank the low-rank matrix matrix L as classificationL as classification features features to obtain to obtain of tumor samples, we used the low-rank matrix L as classification features to obtain the the samples’the samples’ classifications. classifications. Because Because L is low L is rank, low rank, variables variables sharing sharing the samethe same rank rank could could samples’ classifications. Because L is low rank, variables sharing the same rank could be re- be regardedbe regarded as being as inbeing one in group one group [22]. In[22]. our In algorithm, our algorithm, when when optimal optimal problems problems con- con- garded as being in one group [22]. In our algorithm, when optimal problems converge, the verge, theverge, rank the of rank matrix of matrix L is seven. L is seven. Therefore, Therefore, we grouped we grouped GC samples GC samples into into seven seven clusters clusters rank of matrix L is seven. Therefore, we grouped GC samples into seven clusters according accordingaccording to their to ranks. their ranks.A heatmap A heatmap of the of methylation the methylation levels levels of GC of subtypesGC subtypes is shown is shown in in to their ranks. A heatmap of the methylation levels of GC subtypes is shown in Figure8A . Figure 8A.Figure The 8A. boxplot The boxplot of methylation of methylation levels levels based based on these on these seven seven clusters clusters is shown is shown in in The boxplot of methylation levels based on these seven clusters is shown in Figure8B. Figure 8B.Figure According 8B. According to the clusteringto the clustering results results obtained obtained by matrix by matrix L, we L ,evaluated we evaluated all meth- all meth- Accordingylation to the sites clustering for different results clusters, obtained finding by matrix eight Lsubtype-associated, we evaluated all CpG methylation sites which sites had ylation sites for different clusters, finding eight subtype-associated CpG sites which had for differentdifferential clusters, methylation finding eightlevels subtype-associated in more than two clusters CpG siteswith whichp-values had

(A) (A)

Figure 8. Cont.

Genes 2021, 12, x FOR PEER REVIEW 13 of 19 Genes 2021, 12, x FOR PEER REVIEW 13 of 19

Genes 2021, 12, 854 12 of 18

(B)

Figure 8. (A) Heatmap of methylation(B) sites in different gastric cancer subtypes grouped by matrix L. The heat rectangle on the left map is red if there is a statistically significant difference between Figure 8. 8.( A(A) Heatmap) Heatmap of methylationof methylation sites insites different in different gastric cancergastric subtypes cancer groupedsubtypes by grouped matrix by matrix the selected cluster and the other clusters; otherwise, it is blue. Sig, significance, p < 0.05; No, no L. TheThe heatheat rectangle rectangle on on the the left left map map is red is if red there if isth aere statistically is a statistically significant significant difference difference between between significance. (B) The box-plot of methylation levels of these seven clusters. the selectedselected cluster cluster and and the the other other clusters; clusters; otherwise, otherwise, it is blue. it is Sig, blue. significance, Sig, significance,p < 0.05; No,p < no0.05; No, no significance.significance. (B ()B The) The box-plot box-plot of methylation of methylation levels oflevels these of seven these clusters. seven clusters. 3.7. Association between the Eight Subtype-Associated CpG Sites and Prognosis 3.7. Association between the Eight Subtype-Associated CpG Sites and Prognosis 3.7. AssociationWe constructed between the the prognosisEight Subt survivalype-Associated curves CpG for all Sites the andGC Prognosissamples grouped by hy- permethylationWe constructed with the prognosis low expression survival and curves hypo formethylation all the GC samples with high grouped expression by of the hypermethylation with low expression and hypomethylation with high expression of the eightWe subtype-associated constructed the prognosis CpG sites survival (Figure curv 9). esIn forthis all figure, the GC the samples Kaplan–Meier grouped curves by hy- eight subtype-associated CpG sites (Figure9). In this figure, the Kaplan–Meier curves show permethylationshow that the prognosticwith low expressionresults differ anded significantlyhypomethylation between with the hightwo groups. expression of the thateight the subtype-associated prognostic results differed CpG significantly sites (Figure between 9). In the this two figure, groups. the Kaplan–Meier curves

show that the prognostic results differed significantly between the two groups.

Figure 9. Cont .

GenesGenes2021 2021, 12,, 854 12, x FOR PEER REVIEW 13 of 18 14 of 19

FigureFigure 9. 9.Prognosis Prognosis survival survival curves curves of eight of eight different different expressed expressed methylation methylation sites. sites.

3.8.3.8. Identifying Identifying Four Four Subtype-Specificity Subtype-Specificity Prognosis-Related Prognosis-Related Sites Sites Finally,Finally, the the intersection intersection analysis analysis between between the five the prognosis-related five prognosis-related sites and sites the and the eighteight subtype-associated subtype-associated sites sites was was carried carried out, and out, four and factors, four factors, i.e., cg01531665 i.e., cg01531665 (NOL6), (NOL6), cg01201519cg01201519 (FEZF1), (FEZF1), cg02422011 cg02422011 (PPP1R14A), (PPP1R14A), and cg00388897 and cg00388897 (KLHL35) (KLHL35) were found were to found to bebe overlapped. overlapped. These These four four CpG CpG sites sites were were prognosis-related prognosis-related sites that sites influenced that influenced gene gene expression levels and also were related to GC subtypes, denoted as subtype-specificity expression levels and also were related to GC subtypes, denoted as subtype-specificity prognosis-related sites. Gene ontology (GO) and the KEGG items of genes according toprognosis-related these four CpG sites sites. are Gene shown ontology in Table 1(GO). Then, and we the used KEGG an onlineitems analysisof genes tool according to these four CpG sites are shown in Table 1. Then, we used an online analysis tool “MEX- PRESS” (https://mexpress.be/) to verify the correlations among the expression levels of these four genes and DNA methylations. Figure 10A–D show the results of different genes. On the right of each subfigure, the r values are the Pearson correlation coefficients

Genes 2021, 12, 854 14 of 18

Genes 2021, 12, x FOR PEER REVIEW 15 of 19

“MEXPRESS” (https://mexpress.be/, aceessed on 2 June 2021) to verify the correlations among the expression levels of these four genes and DNA methylations. Figure 10A–D showbetween the one results specific of different gene expression genes. On level the and right DNA of each methylation subfigure, levels the of r valuesdifferent are meth- the Pearsonylation sites correlation. Meanwhile, coefficients each subfigure between also one shows specific the gene relationships expression of other level factors and DNA such methylationas sample type, levels tumor of different stage, and methylation gender. sites. Meanwhile, each subfigure also shows the relationships of other factors such as sample type, tumor stage, and gender. Table 1. GO and KEGG items of NOL6, FEZF1, PPP1R14A and KLHL35. Table 1. GO and KEGG items of NOL6, FEZF1, PPP1R14A and KLHL35. Gene Name GO KEGG negative regulation of transcription Gene NameFEZF1 GO KEGGNone negative regulationfrom RNA of polymerase transcription II from promoter RNA FEZF1 None KIHL35 polymeraseprotein II promoter binding None rRNA processing, tRNA export from KIHL35NOL6 bindingRibosome biogenesis None in eukaryotes nucleus NOL6 rRNA processing, tRNA export from nucleus Ribosome biogenesis in eukaryotes regulation of protein dephosphoryla- regulation of protein dephosphorylation, cellular PPP1R14A Vascular smooth muscle contraction PPP1R14A responsetion, to drug, cellular phosphatase response inhibitor to drug, activity phos- Vascular smooth muscle contraction phatase inhibitor activity

A: Gene NOL6 B: Gene: FEZF1

Figure 10. Cont. C: Gene PPP1R14A D: Gene KLHL35

Genes 2021, 12, x FOR PEER REVIEW 16 of 19

Genes 2021, 12, 854 15 of 18

C: Gene PPP1R14A D: Gene KLHL35

FigureFigure 10. 10. TheThe results results of of correlation correlation betweenbetween various various factors factors and and gene gene expression, expression, especially especially the relationship the relationship of gene of gene methylationmethylation levels andand expression expression level. level. (A )(A) Gene Gene NOL6 NOL6 (B) Gene: (B) Gene: FEZF1 FEZF1 (C) Gene (C) PPP1R14A Gene PPP1R14A (D) Gene (D) KLHL35. Gene KLHL35

3.9.3.9. Determining Determining the the Influential Influential Power Power of the of the Subtype-Specificity Subtype-Specificity Prognosis-Related Prognosis-Related Sites Sites on on ExpressionExpression Levels—Important Levels—Important Prognostic Prognostic Markers Markers and an Regulationd Regulation of Gene of Gene Expression Expression Factors Factors. It canIt can be observedbe observed that that there there is one is one position position of the of the gene gene NOL6 NOL6 that that had had a significantly a significantly positivepositive correlation correlation with with its expressionits expression (Figure (Figure 10A), 10A), 26 positions26 positions of the of the gene gene FEZF1 FEZF1 CpG CpG sitesite have have significantly significantly negative negative correlations correlations with with its expression,its expression, while while 12 CpG12 CpG sites sites have have significantlysignificantly positive positive correlations correlations with with its expression.its expression. Actually, Actually, it has it has been been demonstrated demonstrated thatthat the the gene gene FEZF1-AS1 FEZF1-AS1 (FEZF1 (FEZF1 antisense antisense RNA) RNA) can can epigenetically epigenetically repress repress the the expression expression of P21of P21 which which is a is demethylase a demethylase [23]. [23]. Studies Studies have have found found that thethat gene the FEZF1-AS1gene FEZF1-AS1 can act can as act an “oncogene” for gastric cancer partly through suppressing P21 expression and may serve as an “oncogene” for gastric cancer partly through suppressing P21 expression and may as a candidate prognostic biomarker for new therapies of gastric cancer patients. Simul- serve as a candidate prognostic biomarker for new therapies of gastric cancer patients. taneously, there were 14 positions of the gene PPP1R14A CpG site that had significantly Simultaneously, there were 14 positions of the gene PPP1R14A CpG site that had signifi- negative correlations with its expression, while there were 12 positions of the gene KLHL35 cantly negative correlations with its expression, while there were 12 positions of the gene CpG site that had significantly negative correlations with its expression. In addition, the KLHL35 CpG site that had significantly negative correlations with its expression. In addi- gene PPP1R14A, which is regulated by promoter region methylation, has been proven to tion, the gene PPP1R14A, which is regulated by promoter region methylation, has been play a key role in the initiation and progression of gastric cancer, colorectal cancer, and proven to play a key role in the initiation and progression of gastric cancer, colorectal lymphomas [24–26]. At the same time, hypermethylation of the gene KLHL35 has also been cancer, and lymphomas [24–26]. At the same time, hypermethylation of the gene KLHL35 demonstrated to be associated with the development of hepatocellular carcinoma [27,28]. has also been demonstrated to be associated with the development of hepatocellular car- These results demonstrate the predictive accuracy of our model, which can identify the significantcinoma CpG[27,28]. sites These that result haves an demonstrate influence on the gene predictive expression, accuracy meanwhile, of our model, exploring which prognosis-relatedcan identify the sites significant associated CpG with sites GC that subtypes. have an These influence results on demonstrated gene expression, that themean- fourwhile, CpG sitesexploring were notprognosis-related only associated withsites theassociated GC subtypes with andGC patients subtypes. prognosis, These butresults alsodemonstrated had significant that correlation the four CpG with sites the were gene not expression only associated levels. Itwith means the GC that subtypes they had and potential to be important prognostic markers and regulate of gene expression factors.

Genes 2021, 12, 854 16 of 18

4. Discussion Epigenetic changes, such as DNA methylation, are closely associated with the devel- opment and malignant transformation of GC. DNA methylation changes accumulated as oncogenesis may result in different GC subtypes. Therefore, identification of specific DNA methylation sites which associated with gene expression and prognosis of tumor patients, might aid in the development of personalized treatment plans. In this study, we introduced a linear regression model to prioritize the relationship of disease-associated methylation sites and gene expression levels in different sample groups, and then provided an epigenetic explanation of tumor heterogeneity. Our model made use of matrix decomposition theory by introducing a sparse re- gression to account for DNA methylation mapping and a low-rank matrix to contain the expression heterogeneity (EH), resulting in one sparse coefficient matrix B and one low- rank matrix L. According to the top abs (B), we obtained the specific DNA methylation sites that were related to GC related to different gene expression levels. Because our model does not perform statistical significance tests, we were not able to report our results based on statistical significance. Alternatively, we were more interested in the top signals. Thus, we only showed the top 1000 associations based on the absolute value of B in Figure2 and selected the top 100 methylation sites for subsequent research. Concurrently, we used the low-rank matrix L to obtain seven tumor sample clusters, from which we selected eight subtype-associated CpG sites. In this article, our study firstly identified 29 Gastric Cancer-associated methylation- driven genes, whose corresponding methylation sites might affect gene expression levels, based on the top 100 abs (B). Then we used the Cox risk regression model to obtain five important prognosis-associated methylation sites that can be regarded as prognostic factors: cg01531665 (NOL6), cg01201519 (FEZF1), cg00484488 (ADPGK), cg00730266 (PPP1R14A), and cg00333849 (KLHL35). By using them to compute the risk score to predict the survival curves, we classified patients into high-risk and low-risk groups, in which patients with high-risk scores were corresponding to poor prognosis, and on thecontrary, low risk corresponded to relatively good prognosis. Then according to the low-rank matrix L, our model explored seven Gastric Cancer subtypes, by which we can evaluated all methylation CpG sites with the methylation levels to find the differentially methylated sites among different GC tumor groups. In this way, we identified eight subtype-associated CpG sites: cg01201519 (FEZF1), cg00388897 (KLHL35), cg01531665 (NOL6), cg02422011 (PPP1R14A), cg00563678 (RNF150), cg01192487 (SAMD12), cg00328900 (VCAN), and cg00112309 (ZFPM2). Kaplan-Meier curves demonstrated that these sites also had relationships with prognosis. Therefore, an intersection analysis be- tween the five prognosis-associated methylation sites and the eight subtype-associated sites was conducted, resulting in four sites: cg01531665 (NOL6), cg01201519 (FEZF1), cg02422011 (PPP1R14A), and cg00388897 (KLHL35). The downstream analysis demonstrated that they were not only associated with the GC subtypes and patients prognosis, but also had a sig- nificant correlation with the gene expression levels, which indicated that they had potential to be important prognostic markers and regulate of gene expression factors. Despite the advantages of our approach, the limitation is that we did not provide a rigorous statistical significance test of the estimated coefficient matrix B. Researchers can rank associations and select those that they are interested in.

Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/genes12060854/s1, Figures S1–S29: The correlation relationship results of gene methylation levels and expression levels. They also contains DNA methylation data matrix, mRNA data matrix in txt format text preprocessed by us and the clinical data download link from TCGA. Author Contributions: Y.W. and D.A. conceived and designed the sparse and low-rank regression model; L.X. implemented the simulated study and real dataset analysis; Y.W. wrote the codes for experiments; Y.W. and D.A. wrote the whole manuscript. All authors have participated sufficiently Genes 2021, 12, 854 17 of 18

in the work to take responsibility for it. All authors have reviewed the final manuscript and approve it for publication. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the National Natural Science Foundation of China (no. 3161194), and National Science Foundation of China (no. 61873027). Institutional Review Board Statement: Not applicable. Informed Consent Statement: Not applicable. Data Availability Statement: All the data in our manuscript are available in The Cancer Genome Atlas repository (http://portal.gdc.cancer.gov/, aceessed on 2 June 2021). We also have uploaded them with the supplementary. Acknowledgments: This study was supported by an NNSF grant 3161194 and an NSF grant 61873027. We also thank the future editor and anonymous reviewers for their comments on our manuscript. Conflicts of Interest: The authors declare that they have no conflict of interest.

References 1. Van Cutsem, E.; Sagaert, X.; Topal, B.; Haustermans, K.; Prenen, H. Gastric cancer. Lancet 2016, 388, 2654–2664. [CrossRef] 2. Bornschein, J.; Selgrad, M.; Warnecke, M.; Kuester, D.; Wex, T.; Malfertheiner, P.H. Pylori infection is a key risk factor for proximal gastric cancer. Dig. Dis. Sci. 2010, 55, 3124–3131. [CrossRef][PubMed] 3. Ebrahimi, V.; Soleimanian, A.; Ebrahimi, T.; Azargun, R.; Yazdani, P.; Eyvazi, S.; Tarhriz, V. Epigenetic modifications in gastric cancer: Focus on DNA methylation. Gene 2020, 742, 144577. [CrossRef][PubMed] 4. Park, S.Y.; G¨onen, M.; Kim, H.J.; Michor, F.; Polyak, K. Cellular and genetic diversity in the progression of in situ human breast carcinomas to an invasive phenotype. J. Clin. Investig. 2010, 120, 636–644. [CrossRef][PubMed] 5. Jaenisch, R.; Bird, A. Epigenetic regulation of gene expression: How the genome integrates intrinsic and environmental signals. Nat Genet. 2003, 33, 245–254. [CrossRef][PubMed] 6. Pfeifer, G.P. Defining driver DNA methylation changes in human cancer. Int. J. Mol. Sci. 2018, 19, 1166. [CrossRef] 7. Gujar, H.; Weisenberger, D.; Liang, G. The Roles of Human DNA Methyltransferases and Their Isoforms in Shaping the Epigenome. Genes 2019, 10, 172. [CrossRef] 8. Luo, C.; Hajkova, P.; Ecker, J.R. Dynamic DNA methylation: In the right place at the right time. Science 2018, 361, 1336–1340. [CrossRef] 9. Lea, A.J.; Vockley, C.M.; A Johnston, R.; A Del Carpio, C.; Barreiro, L.B.; E Reddy, T.; Tung, J. Genome-wide quantification of the effects of DNA methylation on human gene regulation. Dig. Dis. Sci. 2018, 7, 156. [CrossRef] 10. Liang, C.; Yu, X.; Li, B.; Chen, Y.A.; Conejo-Garcia, J.R.; Wang, X. DNA methylation-based immune cell deconvolution in solid tumors. bioRxiv 2019, 619965. [CrossRef] 11. Alter, O.; Brown, P.O.; Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA 2000, 97, 10101–10106. [CrossRef][PubMed] 12. Gibson, G. The environmental contribution to gene expression profiles. Nat. Rev. Genet. 2008, 9, 575–581. [CrossRef] 13. Leek, J.T.; Scharpf, R.B.; Bravo, H.C.; Simcha, D.; Langmead, B.; Johnson, W.E.; Geman, D.; Baggerly, K.; Irizarry, R.A. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 2010, 11, 733–739. [CrossRef] 14. Searle, S.R. An overview of variance component estimation. Metrika 1995, 42, 215–230. [CrossRef] 15. Kang, H.M.; Ye, C.; Eskin, E. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics 2008, 180, 1909–1925. [CrossRef][PubMed] 16. Recht, B.; Fazel, M.; Parrilo, P.A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 2010, 52, 471–501. [CrossRef] 17. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [CrossRef] 18. Kang, H.M.; Sul, J.H.; Service, S.K.; Zaitlen, N.A.; Kong, S.-Y.; Freimer, N.B.; Sabatti, C.; Eskin, E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010, 42, 348–354. [CrossRef] 19. Tibshirani, R.; Hastie, T.; Friedman, J.H. Regularized Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33, 1–22. 20. Olivier, G. MethylMix: An R package for identifying DNA methylation-driven genes. Bioinformatics 2015, 31, 1839–1841. 21. Therneau, T.M. Survival Analysis [Rpackage survival version 2.41-3]. Technometrics 2015, 46, 111–112. 22. Hansen, P.C.; Sekii, T.; Shibahashi, H. The Modified Truncated SVD Method for Regularization in General Form. SIAM J. Sci. Stat. Comput. 1992, 13, 1142–1150. [CrossRef] 23. Liu, Y.W.; Xia, R.; Lu, K.; Xie, M.; Yang, F.; Sun, M.; De, W.; Wang, C.; Ji, G. LincRNAFEZF1-AS1 represses p21 expression to promote gastric cancer proliferation through LSD1-Mediated H3K4me2 demethylation. Mol. Cancer 2017, 16, 39. [CrossRef] 24. Peng, Y.; Wu, Q.; Wang, L.; Wang, H.; Yin, F. A DNA methylation signature to improve survival prediction of gastric cancer. Clin. Epigenet. 2020, 12, 1–16. [CrossRef] Genes 2021, 12, 854 18 of 18

25. Li, D.; Guo, J.; Wang, S.; Zhu, L.; Shen, Z. Identification of novel methylated targets in colorectal cancer by microarray analysis and construction of coexpression network. Oncol. Lett. 2017, 14, 2643–2648. [CrossRef] 26. Bethge, N.; Honne, H.; Hilden, V.; Trøen, G.; Eknæs, M.; Liestøl, K.; Holte, H.; Delabie, J.; Smeland, E.B.; Lind, G.E. Lind Identification of Highly Methylated Genes across Various Types of B-Cell Non-Hodgkin Lymphoma. PLoS ONE 2013, 8, e79602. [CrossRef][PubMed] 27. Cheng, Q.; Zhao, B.; Huang, Z.; Su, Y.; Chen, B.; Yang, S.; Peng, X.; Ma, Q.; Yu, X.; Zhao, B.; et al. Epigenome-wide study for the offspring exposed to maternal HBV infection during pregnancy, a pilot study. Gene 2018, 658, 76–85. [CrossRef][PubMed] 28. Morris, M.R.; Ricketts, C.J.; Gentle, D.; McRonald, F.; Carli, N.; Khalili, H.; Brown, M.; Kishida, T.; Yao, M.; Banks, R.E.; et al. Genome-wide methylation analysis identifies epigenetically inactivated candidate tumour suppressor genes in renal cell carci- noma. Oncogene 2011, 30, 1390–1401. [CrossRef][PubMed]