<<

1 An integrated epigenomic–transcriptomic landscape of lung cancer reveals novel 2 methylation driver of diagnostic and therapeutic relevance 3 4 Xiwei Sun1#, Jiani Yi 1#, Juze Yang1#, Yi Han1, Xinyi Qian2, Yi Liu1, Jia Li1, Bingjian Lu2, 5 Jisong Zhang1, Xiaoqing Pan3, Yong Liu4, Mingyu Liang4, Enguo Chen1,5*, Pengyuan 6 Liu1,4,5*, Yan Lu2,5*

7 8 1Sir Run Run Shaw Hospital and Institute of Translational Medicine, Zhejiang University 9 School of Medicine, Hangzhou, Zhejiang 310016, China.

10 2Center for Uterine Cancer Diagnosis & Therapy Research of Zhejiang Province, Women's 11 Reproductive Health Key Laboratory of Zhejiang Province, Department of Gynecologic 12 Oncology, Women’s Hospital and Institute of Translational Medicine, School of Medicine, 13 Zhejiang University, Hangzhou, Zhejiang 310006, China.

14 3Department of Mathematics, Zhejiang University, Hangzhou, Zhejiang 310027, China.

15 4Center of Systems Molecular Medicine, Department of Physiology, Medical College of 16 Wisconsin, Milwaukee, WI 53226, USA

17 5Cancer center, Zhejiang University, Hangzhou, Zhejiang 310013, China.

18

19 # XS, JNY, and JZY equally.

20 Running title: Integrative epigenomic–transcriptomic landscape of lung cancer

21 Key words: DNA methylation; driver genes; epigenomics; lncRNA, lung cancer; miRNA,

22 reduced representation bisulfite sequencing; transcriptomics; transcription factor

23 *Correspondence: [email protected], [email protected] or [email protected]

1

24 Abstract

25 Background: Aberrant DNA methylation occurs commonly during carcinogenesis and is of 26 clinical value in cancers. However, knowledge of the impact of DNA methylation 27 changes on lung carcinogenesis and progression remains limited.

28 Methods: Genome-wide DNA methylation profiles were surveyed in 18 pairs of tumors and 29 adjacent normal tissues from non-small cell lung cancer (NSCLC) patients using Reduced 30 Representation Bisulfite Sequencing (RRBS). An integrated epigenomic–transcriptomic 31 landscape of lung cancer was depicted using the multi-omics data integration method.

32 Results: We discovered a number of hypermethylation events pre-marked by poised 33 promoter in embryonic stem cells, being a hallmark of lung cancer. These hypermethylation 34 events showed a high conservation across cancer types. Eight novel driver genes with 35 aberrant methylation (e.g., PCDH17 and IRX1) were identified by integrated analysis of 36 DNA methylation and transcriptomic data. Methylation level of the eight genes measured by 37 pyrosequencing can distinguish NSCLC patients from lung tissues with high sensitivity and 38 specificity in an independent cohort. Their tumor-suppressive roles were further 39 experimentally validated in lung cancer cells, which depend on promoter hypermethylation. 40 Similarly, 13 methylation-driven ncRNAs (including 8 lncRNAs and 5 miRNAs) were 41 identified, some of which were co-regulated with their host genes by the same promoter 42 hypermethylation. Finally, by analyzing the transcription factor (TF) binding motifs, we 43 uncovered sets of TFs driving the expression of epigenetically regulated genes and 44 highlighted the epigenetic regulation of expression of TCF21 through DNA methylation 45 of EGR1 binding motifs.

46 Conclusions: We discovered several novel methylation driver genes of diagnostic and 47 therapeutic relevance in lung cancer. Our findings revealed that DNA methylation in TF 48 binding motifs regulates target by affecting the binding ability of TFs. Our 49 study also provides a valuable epigenetic resource for identifying DNA methylation-based 50 diagnostic biomarkers, developing cancer drugs for epigenetic therapy and studying cancer 51 pathogenesis.

2

52

3

53 Introduction

54 Lung cancer is the leading cause of cancer mortality worldwide and non-small cell lung 55 cancers (NSCLCs) account for about 85% of lung cancers [1, 2]. Although there are new 56 agents for the treatment of NSCLC patients, the 5-year survival rate is still estimated at 15% 57 [1, 3, 4]. To reduce the high lethality of NSCLCs, many significant efforts have been made. 58 Previous studies have reported causal genetic alterations, from somatic point mutations to 59 large structure variations, involved in the carcinogenesis of NSCLCs [5-9]. Based on these 60 findings, various therapeutic strategies have been developed for NSCLC treatment, such as 61 combination cisplatin-based chemotherapy with the anti-angiogenic bevacizumab [10, 11], 62 the use of tyrosine kinase inhibitors to treat EGFR-mutated, ALK or ROS1-rearranged 63 NSCLC patients [12-15]. However, owing to the primary, adaptive and acquired resistance of 64 these drugs, the target treatment is not effective for all NSCLC patients [16]. DNA 65 methylation alterations commonly occur in cancer and are involved in tumor initiation and 66 progression. Aberrant promoter CpG methylation provides a selective mechanism for the 67 regulation of tumor suppressor genes and oncogenes in cancer instead of genetic mutations 68 [17]. However, these methylation changes could be inherently reversed by agents, in contrast 69 with genetic mutations [18]. Thus, identification of epigenetically modulated genes opens 70 new avenues for epigenetic therapies and for discovery of novel drug targets.

71 In NSCLC, CDKN2A was the first reported gene whose downregulated expression in 72 lung carcinogenesis was predominantly attributed to promoter hypermethylation [19]. 73 Afterwards, several studies described several other epigenetically regulated genes, such as 74 FHIT, DAPK and RASSF1A [20-26]. However, these earlier studies only investigated a 75 single gene or a small list of genes. The advent of high throughput next-generation 76 sequencing (NGS) for analyzing DNA methylome and transcriptome has offered the unique 77 ability to analyze methylation changes and detect methylation driver events at a genome-wide 78 scale [27-29].

79 In this study, we analyzed genome-wide methylation profiles of ~ 2 million CpG sites 80 spanning more than 19,600 genes and noncoding regions using Reduced Representation 81 Bisulfite Sequencing (RRBS) technology [30]. RRBS combines restriction enzymes and 82 bisulfite sequencing to enrich for DNA segments with a high CpG content and regulatory 83 potentials. It is an efficient and cost-effective technique for analyzing the genome-wide 84 methylation profiles on a single nucleotide level. Using this technique and the multi-omics 85 data integration method, we pinpointed several novel methylation driver coding genes 86 and noncoding RNAs that could be potential targets for epigenetic therapy. Furthermore, we 87 detected sets of transcription factor (TF) binding motifs located in differentially methylated 88 regions (DMRs) which regulated target gene expression by affecting the binding ability of 89 TFs in lung cancer.

4

90 Materials and Methods

91 Reduced representation bisulfite sequencing (RRBS)

92 The study was approved by the institutional review board of Sir Run Run Shaw Hospital 93 at Zhejiang University (Hangzhou, China). Eighteen lung tumor tissues and adjacent normal 94 tissues were collected from NSCLC patients (Table S1). All samples used in the current study 95 were obtained at the time of diagnosis before any treatment was administered. Genomic DNA 96 of these tissues was extracted and then treated according to the RRBS library preparation 97 protocols as we previously described [30] with modifications to allow multiplexing [31]. 98 Paired-end sequencing with 100bp was performed on the Illumina HiSeq 2000 according to 99 manufacturer’s protocol.

100 Bioinformatics analyses of RRBS data

101 Bisulfite sequencing reads were pre-processed with Trim Galore 102 (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). Both adapters and 103 sequences with low quality (base quality < 20) were removed before the analysis. The 104 trimmed reads were then aligned to the human reference genome (hg19) and the methylation 105 status of each CpG was determined using Bismark (v0.14.1) with default parameters [32] 106 (Table S2). The unconverted cytosines at fill-in 3′ MspI sites of sequencing reads were used 107 to estimate the bisulfite conversion rate. For each CpG site with at least 5×coverage, the 108 methylation rate, C/(C+T), was calculated. We merged the 18 tumor-normal sample pairs 109 based on CpG coordinates, yielding 2,574,098 CpG sites that covered at least ten paired 110 samples. CpG sites in the heterosome and CpG sites overlapped with SNP (dbSNP build 142) 111 were filtered, and 2,166,853 were retained for subsequent analysis. The remaining missing 112 methylation values were imputed using k-nearest neighbors in the CpGs space 113 (http://bioconductor.org/packages/release/bioc/html/impute.html).

114 Next, we used metilene (Version 0.23) [33] to identify DMRs between tumor and 115 matched noncancerous tissues. These DMRs were further examined for significance using the 116 Wilcoxon sum test for a paired dataset. The final DMRs were determined using the 117 following threshold: at least ten CpGs in the DMRs, at least 10% differences in methylation, 118 and false discovery rate (FDR) < 0.05. We performed an unsupervised hierarchical cluster 119 analysis on the CpG sites’ methylation using ward linkage and Euclidean distance. The 120 Metascape software (http://metascape.org) was used to conduct (GO) 121 analyses according to the standard protocol.

122 RNA-seq analysis

123 mRNA sequencing libraries were prepared for the same set of lung tumor and adjacent 124 normal tissues using TruSeq RNA Sample Preparation Kit from Illumina as we described 125 previously with modifications [31] (Supplementary Materials and Methods). 126 Transcriptome data were mapped with Tophat v2, using the spliced mapping algorithm [34]. 127 A set of both known and novel transcripts was constructed and identified using Cufflinks 128 [35]. Reads per kilobase of exon per million fragments mapped (RPKM) was used to quantify 5

129 gene expression level. Finally, differentially expressed genes were obtained by using paired t- 130 test with FDR < 0.05.

131 Identification of lncRNAs

132 Identification of long noncoding RNAs (lncRNAs) followed with the workflow 133 described previously with modifications [36]. In brief, we first filtered transcripts with single 134 exon or length <160 bp. The protein coding potential of the remaining transcripts was 135 evaluated by PhyloCSF according to the genome alignments of chimp, rhesus, mouse, guinea, 136 pig, cow, horse and dog [37]. Transcripts with PhyloCSF scores greater than 50 were 137 removed for their high coding potential. Moreover, we discarded transcripts with complete 138 branch lengths (CBL) >0 and open reading frames (ORF) of >150 amino acids. Transcripts 139 with CBL scores equal to 0 because of low sequence alignments were also removed if they 140 contained an ORF with more than 50 amino acids. Finally, we employed blastx with repeats 141 masked to analyze the remaining transcripts and removed those with a median of the E-value 142 <1e-18. The identified transcripts were classified into different categories according to the 143 HUGO Gene Committee.

144 Small RNA sequencing analysis

145 Small RNA sequencing libraries were prepared with the TruSeq Small RNA Sample 146 Preparation Kit from Illumina as we described previously with modifications [38] 147 (Supplementary Materials and Methods). The raw short reads were trimmed by Trim 148 Galore as described above. The trimmed reads were aligned to the human reference genome 149 (NCBI build 37) using the Burrows-Wheeler Aligner [39]. Only one nucleotide mismatch 150 was allowed in the mapping process. Read counts for miRNAs (miRBase v21) were extracted 151 from alignment files using BEDtools (http://bedtools.readthedocs.io/en/latest). miRNA 152 expression levels were quantified by Reads Per Million (RPM) mapped reads and then 153 normalized with log2(RPM+1).

154 Annotation with genomic features

155 To gain a better insight into the genomic features of differential methylation, the 156 identified DMR sets were intersected with a collection of functional regions, including the 15 157 chromatin states in embryonic stem cells (ESCs), lung-related tissues and cells established by 158 the NIH Roadmap Epigenetics Mapping Consortium 159 (http://egg2.wustl.edu/roadmap/web_portal/). The genomic coordinates of CpG islands 160 (CGIs) based on GRCh37 were downloaded from the UCSC Genome Browser. The protein- 161 coding gene and lncRNA annotation were obtained from GENCODE (v28, GRCh37). The 162 genomic annotation of primary miRNAs were downloaded from miRBase 21 [40]. The 163 promoters were defined as ±2kb of transcription start site (TSS) of genes and noncoding 164 RNAs (ncRNAs). The feature was annotated to the DMR if it was overlapped by at least 50% 165 bases of that DMR. We further defined ESC-marked chromatin states of DMR, which 166 represent that states were recurrently annotated to the DMR in at least half of our available 167 ESCs (n=3). The chromatin landscape of the selected genes was visualized using the WashU 168 EpiGenome Browser tool (http://epigenomegateway.wustl.edu/browser/).

6

169 Enrichment analysis

170 The fold enrichment of overlap for a variety of functional annotations (e.g., CGI and 171 given states) over the set of DMRs was calculated using a previously described method [6]; 172 briefly, as the odds ratio of the joint probability of overlapped regions between the DMR set 173 and a given state and the product of marginal probability of the DMR set and the marginal 174 probability of the state in the whole genome. The fold enrichment is estimated as follows:

푛표⁄푛푔 175 fold enrichment = (푛푚⁄푛푔) × (푛푠⁄푛푔)

176 where 푛표 indicates the number of overlapped bases between the DMR set and the state, 푛푚 177 represents the number of bases in the DMR set, and 푛푔 is the number of bases in the 178 genome.

179 In order to make comparisons across different states, we normalized the enrichments 180 across all 15 states and used a normalized enrichment score (E) for a given state (i): 퐹푖 − 푚푖푛푖=1…15(퐹푖) 181 퐸푖 = 푚푎푥1….15(퐹푖) 182 where 퐹푖 is the fold enrichment for a given state. The normalized enrichment scores ranged 183 from 0 to 1.

184 Data collection and analysis of HM450 methylation and RNA-seq data from the TCGA 185 cohorts and CCLE database

186 The HM450 DNA methylation data of 21 cancer types in The Cancer Genome Atlas 187 (TCGA) were downloaded from the ICGC Data Portal (https://dcc.icgc.org) (Table S3). Beta 188 values ranging from 0 to 1 were used to measure DNA methylation levels. The values close 189 to 0 mean a low level of methylation and 1 represents high-level methylation. CpG sites in 190 the heterosome and overlapped with SNP (dbSNP build 142) were removed. In addition, CpG 191 sites with more than 10% missing values across all samples were discarded. The remaining 192 missing methylation beta values were estimated using k-nearest neighbors in the CpGs space 193 (http://bioconductor.org/packages/release/bioc/html/impute.html) and then used for 194 subsequent analysis.

195 The TCGA mRNA expression data of 21 cancer types were downloaded from the ICGC 196 Data Portal. Then the expression values of protein-coding genes were normalized using 197 RPKM. Low-expressed protein-coding genes with 90th quantile RPKM < 1 were removed 198 and the remaining log2-transformed RPKM values were used for subsequent analysis.

199 HM450 DNA methylation data of 1,028 cancer cell lines were downloaded from the 200 COSMIC database and RNA-seq data of 781 cancer cell lines were from the Encyclopedia of 201 DNA Elements (ENCODE) and Cancer Cell Line Encyclopedia (CCLE) databases (E- 202 MTAB-2770). In total, 455 cancer cells have both HM450 methylation and RNA-seq data.

203 These HM450 probes were then mapped to functional genomic regions based on their 204 genomic coordinates in GRCh37. In this way, we defined: (1) DMR probes, located within 205 the DMRs; and (2) gene probes, located within gene promoter regions (+/- 2kb from TSS). 7

206 Identification of methylation driver sets

207 We used a discovery-validation strategy to pinpoint high-confidence methylation driver 208 genes. The discovery-stage analyses were performed on our RRBS data and matched RNA- 209 seq data. Specifically, we first identified gene and DMR pairs in which the DMR was 210 overlapped with the gene’s promoter. Then, we filtered those gene-DMR pairs where the 211 genes were not differentially expressed in tumors compared with normal tissues (FDR < 0.05 212 and fold change > 2). Finally, we calculated the Spearman correlation coefficients (rho) 213 between methylation changes and gene expression for each of the remaining gene-DMR 214 pairs. For multiple DMRs for each promoter, we chose one DMR that had the most negative 215 association with matched gene expression to capture the most functionally relevant events 216 (epigenetic silencing). In the discovery stage, we obtained a primary gene list, which 217 consisted of 190 protein coding genes with rho values < -0.3 and FDR < 0.05.

218 Next, for the validation-stage analyses on TCGA Human450methylation (HM450) 219 microarray and RNA-seq data from two NSCLC subtypes (lung adenocarcinoma, LUAD; and 220 lung squamous cell carcinomas, LUSC), we found that the promoter of 133 out of the 221 primary gene set was covered by at least one HM450 CpG probe and these gene-probe pairs 222 were used for subsequent analysis. Similarly, the Spearman correlations between CpG probe 223 methylation and gene expression across all samples were calculated for these gene-probes 224 pairs. If multiple CpGs were annotated to one gene promoter, the CpG probe with the most 225 negative correlation was selected. Finally, we screened out a secondary gene set of 81 genes 226 with rho values < -0.3 and FDR < 0.05 in at least one cancer type (LUAD or LUSC). Among 227 them, 31 genes were identified in both LUAD and LUSC, including 20 hypermethylated ones 228 and 11 hypomethylated ones. Hypermethylation in a promoter was a classical paradigm to 229 induce gene silence, thus the list of 20 genes was chosen as our final high-confidence 230 methylation drivers.

231 Similarly, we also identified 13 methylation driver ncRNAs (eight lncRNAs and five 232 miRNAs) in lung cancer using the strategy as described above.

233 Analysis of regulatory transcription factors

234 Genomic coordinates of binding motifs of 84 TF groups were obtained from 235 (http://compbio.mit.edu/encode-motifs/). The 190 DMRs in promoters of genes inversely 236 correlated with gene expression were a set of regions of interest. A hypergeometric test was 237 employed to evaluate whether a particular TF binding motif was more enriched in the set of 238 regions of interest than other TF binding motifs. All the identified DMRs between tumor and 239 normal samples served as a background for this test.

240 In order to identify regulatory TFs, the Spearman rank correlation test was used to 241 calculate the correlation between the expression of TFs corresponding to enriched motifs and 242 the expression of targeted genes (FDR < 0.05).

243 Activity score [41] was defined as y = ρ×log2(fold change), where ρ was the Spearman 244 correlation coefficient between DMR and the targeted gene. The average activity score was 245 calculated for binding motifs of TFs with multiple binding sites in the DMR of targeted gene.

8

246 ROC analysis

247 ROC analysis was performed using a random forest model in the caret R packages [42]. 248 The samples in the TCGA cohorts (LUAD: Tumor 460, Normal 32; LUSC: Tumor 371, 249 Normal 41) were randomly divided into equal-size training and testing sets. Based on the 250 model parameters obtained from the training sets, the testing sets and an independent lung 251 cancer dataset were used to evaluate the model performance. A ROC plot was conducted 252 using pROC R packages (www.r-project.org).

253 Pyrosequencing methylation analysis

254 Genomic DNA from 23 pairs of lung cancer tissues and matched adjacent normal tissues 255 was isolated using a PureLink Genomic DNA Mini Kit (Invitrogen) according to the 256 manufacturer’s protocol. 500ng of DNA underwent bisulfite treatment using the EZ DNA 257 methylation-GoldTM Kit (Zymo Research) according to the manufacturer’s protocol to 258 convert all unmethylated cytosine to uracil while leaving 5-methylcytosine unaltered, and 259 was then eluted in 40 μl of DNase-free water. Primer sequences for methylation specific PCR 260 (MSP) and sequencing primers were designed using the Pyromark Assay Design 2.0 Software 261 (Qiagen) (Table S4). Bio-primer sequences and sequencing primers were oligo-synthesized 262 (Invitrogen, Beijing, China). DNA methylation level of eight DMRs was analyzed by MSP 263 and pyrosequencing. MSP products were observed at 2% agarose gels before pyrosequencing. 264 Pyrosequencing was done following the instructions of the manufacturer of the 265 pyrosequencing device, PyroMark Q24 (Qiagen).

266 The methylation values of all the CpG sites within the DMRs were averaged and paired 267 t-test was used to infer whether the differences between tumor and normal tissues were 268 statistically significant.

269 5-aza-dC treatment

270 Lung cancer cell lines A549, Calu1, H1299 and Hop62 were grown in 1640 medium 271 supplemented with 10% FBS and 1% penicillin-streptomycin (Life Technologies) and 272 incubated in 5% CO2 at 37°C. The first day, cell lines were seeded in a 6-well plate 273 (2×105/well) and the second day were treated with 5-aza-2’-deoxycytidine (Sigma) at 274 different final concentrations for 4 days; culture medium was changed every 2 days with new 275 drugs added. Then, cell pellets were harvested for DNA and RNA extraction. Total RNA was 276 extracted using Trizol (Invitrogen) and then the mRNA expression levels of target genes were 277 evaluated using real-time PCR (Takara, RR420A). Genomic DNA was isolated using the 278 PureLink Genomic DNA Mini Kit (Invitrogen, K1820-02) and then pyrosequencing 279 methylation analysis was performed as described above.

280 In vitro cell-based assays

281 Detailed information regarding in vitro cell-based assays including siRNA knockdown, 282 gene overexpression, cell proliferation and clone formation, and cell migration is provided in 283 Supplementary Materials and Methods and Table S5.

9

284 Results

285 Genome-wide DNA methylation changes in lung cancers

286 To investigate aberrations of DNA methylation in primary lung tumors, RRBS was 287 conducted in eighteen tumors and matched adjacent normal lung tissues, including nine lung 288 LUADs and LUSCs (Table S1). On average, approximately 22 million reads were uniquely 289 mapped to the reference DNA genome and the bisulfite conversion rate was 0.986 (Table 290 S2). 2,574,098 CpG sites that covered at least 10 paired samples were identified. CpG sites in 291 the heterosome and overlapped with a database of single nucleotide polymorphisms (dbSNP 292 build 142) were removed, and the remaining 2,166,853 CpG sites were used for downstream 293 analyses. To assess the reproducibility of RRBS, the genome-wide CpG methylation values 294 between each sample were compared. The average Pearson's correlation coefficient between 295 normal samples was above 0.95 (Figure S1), indicating excellent reproducibility of RRBS. 296 Compared with normal samples, the correlation coefficient between tumor samples or 297 between tumor-normal pairs was smaller (Figure S1), showing genome-wide DNA 298 methylation changes occurred between tumor-normal pairs, as well as among tumors.

299 300 Figure 1. Comparison of DNA methylation alterations in lung tumor and matched 301 normal tissues. (A) Unsupervised hierarchical clustering based on methylation levels of the 302 top 1% CpG sites (n = 21,668) that varied most across 18 normal/tumor pairs. Columns are 303 samples and rows are CpG sites. (B) Principal component analysis of 18 normal/tumor pairs 304 based on methylation levels of all CpG sites (n = 2,166,853). 305

306 The top 1% of CpG sites that varied most across all samples were selected to perform 307 unsupervised hierarchical clustering. As expected, the NSCLCs and matched normal samples 308 were clearly segregated, with the exception of one normal sample (Figure 1A). Principal 309 component analysis (PCA) using all CpG sites also showed similar segregation between the 310 two groups (Figure 1B). These results indicated substantial methylation differences between 311 NSCLCs and normal tissues. We observed that the variation within NSCLCs is much larger 312 than that within normal samples (Figure 1B), suggesting intra-tumor DNA methylation 313 heterogeneity in NSCLCs. Moreover, the NSCLCs revealed a lower proportion of highly

10

314 methylated CpG sites (>80%), and a similar fraction of intermediate methylation (10%–80%) 315 and low methylation (<10%) compared to matched normal ones (Figure S2A), showing 316 hypomethylation on a genome-wide scale in NSCLCs. However, enormous methylation gains 317 were observed in NSCLCs at CpG sites within CGIs that had low methylation levels in 318 normal samples (< 10%; Figure S2B). Then, we compared DNA methylation differences 319 between lung tumors and normal lung tissues for 15 chromatin states which were defined in 320 ESCs [6]. We found that heterochromatin states (Het) had the greatest hypomethylation in 321 NSCLCs compared to their normal counterparts. In contrast, methylation levels of bivalent 322 regulatory states (TssBiv, BivFlnk and EnhBiv) and repressed polycomb states (ReprPC) 323 showed a small increase in NSCLC tumors (Figure S2C).

324 Cancer-specific hypermethylated regions occurred primarily at genes with poised- 325 promoter in embryonic stem cells 326 To identify DMRs in NSCLCs, the whole RRBS dataset was analyzed by an unbiased 327 circular binary segmentation method [43]. In total, 4,410 hypermethylated DMRs (hyperMe- 328 DMRs) and 4,824 hypomethylated DMRs (hypoMe-DMRs) were detected in tumors 329 compared to matched normal tissues (Figure 2A and Table S6). In our study, similar amounts 330 of hypo- and hypermethylation were uncovered in lung cancer, which was in line with 331 previous reports in colon cancer [44]. The hyperMe-DMRs were of smaller size (mean values 332 = 215 bp versus 263 bp, P = 2.2×10-16) and contained more CpG sites (mean values = 22 333 versus 13, P = 2.2×10-16) than hypoMe-DMRs (Figure S3A-B).

334 Although both hypermethylation and hypomethylation changes occurred in lung cancer, 335 there were significant differences in the genomic regions and the activity states that were 336 modified. The majority (75.5%) of hyperMe-DMRs were located within CGI, while only 337 16.1% of hypoMe-DMRs overlapped with CGI (Figure 2A and Table S6). Moreover, by 338 integrating analysis with 15-chromatin states across ESC epigenomes generated from five 339 histone modification marks, we discovered that hyperMe-DMRs were strongly enriched in 340 poised promoters (TssBiv and BivFlnk) of ESCs, whereas hypoMe-DMRs did not show 341 significant enrichment across the activity states (Figure 2 and Table S6). In lung cancer, 342 46.1% (2,033) of hyperMe-DMRs were marked by poised promoters in ESCs, which covered 343 promoters of 995 genes including well-known aberrantly methylated homeobox gene cluster 344 [45] (Figure S4). Enrichment analysis of Gene Ontology and molecular signature found that 345 these genes were significantly enriched in developmental processes, stem cell differentiation, 346 tri-methylated H3K4 and H3K27, and hypermethylated genes in other cancer types (Figure 347 S5A-B, and Table S7).These findings demonstrate that genes targeted by a bivalent promoter 348 chromatin pattern (repressive mark H3K27me3 and active mark H3K4me3) in ESCs tend to 349 be de novo methylated in lung cancer.

350 351 HyperMe-DMRs in lung cancer are recurrently hypermethylated in various primary 352 cancers and cancer cell lines

353 Then we determined if these hyperMe-DMRs in lung cancer were also hypermethylated 354 in other tumor types. By investigating the methylation levels of CpG sites within hyperMe- 355 DMRs from methylation array data of 21 cancer types in TCGA, we observed that the 11

356 357 Figure 2. Characteristics of differentially methylated regions. (A) Circular plot of 4,410 358 hypermethylated (hyperMe) and 4,824 hypomethylated (hypoMe) differentially methylated 359 regions (DMRs). ESCs, embryonic stem cells. (B) Distribution of hyperMe-DMRs and 360 hypoMe-DMRs within the region of 15 chromatin states defined in ESCs. Definitions of 15 361 chromatin states are shown in Figure 2C. (C) Definitions of 15 chromatin states and histone 362 mark probabilities. (D) Enrichments of DMRs in ESCs, IMR90, NHLF, lung tissues, and 363 A549 cells across 15 chromatin states. 364 365 hyperMe-DMRs in our lung cancer data remained hypermethylated in the TCGA lung cancer 366 dataset as well as other primary cancers (Figure 3A). Strikingly, we also found that 367 methylation levels of these hyperMe-DMRs in cancer cell lines from the ENCODE project 368 and CCLE database [6, 46], were significantly increased compared to that in ESCs, primary 369 cells from peripheral blood, primary culture cell lines, and primary tissues from the NIH 370 Roadmap Epigenomics Consortium (Figure 3B). These results highlight that the hyperMe- 371 DMRs we identified are highly relevant to tumor development and progression. In addition, 12

372 hyperMe-DMRs targeted by bivalent or active promoters in ESCs also showed 373 hypermethylation in TCGA primary cancers and cancer cell lines (Figure 3B and Figure 374 S6A-B). Overall, the above data indicate that a group of development-associated genes, 375 mainly marked by ESC poised promoters, show highly conserved hypermethylation across 376 cancer types and cancer cell lines.

377 378 Figure 3. HyperMe-DMRs are recurrently hypermethylated. (A) Methylation levels of 379 total hyperMe-DMRs between tumor and normal samples across 21 cancer types. Data are 380 shown as mean ± SD. The top digits signified number of tumor (red) and normal (green) 381 samples analyzed. (B) Methylation levels of total hyperMe-DMRs and the other set of 382 hyperMe-DMRs (i.e., ESC-poised and ESC-active promoter marked hyperMe-DMRs) across 383 tissues, cell types and cancer cell lines. Data are shown as mean ± SD. The top digits 384 signified number of tissues/cells.

13

385

386 Figure 4. Functional characterization of hyperMe-DMRs subtypes. (A) Examples of two 387 hyperMe-DMRs subtypes, one is pre-marked by poised promoter (IRX1, poised-hyperMe 388 DMRs) and the other by active promoter (MAPK7, active-hyperMe DMRs) in ESCs. 389 Annotations of chromatin state across 127 reference epigenomes including ESCs and lung- 390 related tissues/cells are shown. (B) Fraction of no/bottom-expression genes for all annotated 391 genes and genes harboring poised-hyperMe DMRs and active-hyperMe DMRs in their 392 promoters. (C) Distribution of no differential expression, upregulation and downregulation 393 for genes harboring poised-hyperMe DMRs and active-hyperMe DMRs in their promoters. 394 395 DNA methylation of ESC-poised promoter marked genes correlates more strongly with 396 gene expression than ESC-active promoter marked genes 397 It was a classical regulatory paradigm that DNA hypermethylation at promoters were 398 generally correlated with gene silencing [47], thus hyperMe-DMRs at promoter regions were 399 our focal point. There were 2,519 (57.1%) hyperMe-DMRs located within promoters, 400 including two types; one where its promoter was marked by poised chromatin states in ESCs 401 (2,033, 46.1%) and another by active chromatin states in ESCs (486, 11%) (Figure 2B). 402 Examples included hyperMe-DMRs within IRX1 and MAPK7 promoters (Figure 4A), 403 targeted by bivalent and active states, respectively, showing differential chromatin 404 modification in ESCs. Furthermore, we annotated the two types of hyperMe-DMRs by 405 19,901 protein coding genes (GENCODE V28), and obtained 995 ESC-poised marked genes 406 and 307 ESC-active marked genes, respectively. By integrating analysis with corresponding 407 RNA-seq data, we found that ESC-poised marked genes were significantly enriched in genes 14

408 with no or low transcript levels in both lung normal and tumor tissues (P = 2.2×10-16), 409 whereas ESC-active marked genes showed no significant enrichment in those genes (Figure 410 4B). This finding suggests that most ESC-poised marked genes were inactive in normal 411 tissues and generally remain inactive in tumor tissues. This is probably due to switching of 412 the epigenetic silence mechanism where the silenced genes in normal tissues were inactive by 413 de novo methylation in cancer [48-51]. For the 393 ESC-poised marked genes and 187 ESC- 414 active marked genes that were expressed in both lung normal and cancer tissues, there were 415 more significantly downregulated genes among the ESC-poised marked genes than the ESC- 416 active marked genes (42% versus 28%, P = 0.0018) (Figure 4C). Taken together, the results 417 indicated that methylation levels of ESC-poised marked genes are more predictive of their 418 transcript levels over ESC-active marked genes.

419 Integrative analysis identified eight novel methylation-silenced genes in lung cancer

420 The above data and previous reports suggested that the majority of methylated events 421 were passenger methylation events which are probably a consequence of tumorigenesis [52, 422 53]. Hence, we developed a heuristic strategy to pinpoint driver gene methylation events 423 from passenger events in lung cancer. By integrative analysis of methylation levels of 424 promoter DMRs (pDMRs; within the 2kb upstream and downstream of TSS and 425 corresponding RNA-seq, this strategy identified putative methylation driver genes where not 426 only both their transcript expression and methylation level were dysregulated in tumors, but 427 also there were strong negative correlations between expression and methylation changes. As 428 a result, we obtained a primary gene list in the discovery set, which consisted of 190 protein 429 coding genes (FDR < 0.05, Spearman’s rho < -0.3). To further reduce false positive discovery 430 rates, we filtered the primary candidate gene list using the methylation and expression dataset 431 of the lung cancer samples in TCGA cohorts including LUAD and LUSC. Finally, we 432 obtained a high-confidence list with 31 unique genes including 20 hypermethylation drivers 433 and 11 hypomethylation drivers that were recovered in both the LUAD and LUSC samples 434 from TCGA cohorts, and a moderate-confidence list with 81 unique genes that were 435 recovered in only one of the above cancer types (Figure 5A and Table S8). Here, we focused 436 on the 20 hypermethylation driver genes in the high-confidence list for subsequent analysis.

437 Methylation status of the 20 hypermethylation driver genes showed a significant and 438 highly negative correlation (FDR < 0.01, Spearman’s rho < -0.3) with their expression 439 (Figure 5A). We further analyzed the transcript expression and HM450 methylation data of 440 638 cancer cell lines from the CCLE database, and found that 15 of 20 hypermethylation 441 driver genes exhibited a significant and highly negative correlation (FDR < 0.01, Spearman’s 442 rho < -0.3) between their expression and methylation in the promoter (Figure 5A), 443 demonstrating reproducibility of the methylation driver list. We also examined such negative 444 correlations for the 20 genes across 19 other cancer types in TCGA. We observed that 18 445 genes revealed a significant negative correlation (FDR < 0.01, Spearman’s rho < -0.3) in at 446 least five cancer types, except for ADCY8 and TBX5, which showed a cancer-type specific 447 methylation driver. Moreover, the majority of methylation driver genes (16 of 20) belonged to 448 ESC-poised marked genes (Figure 5A), further supporting our conclusion that preferential 449 hypermethylation events in ESC-poised marked genes played a key role in tumorigenesis.

15

450

451 Figure 5. Methylation-silenced driver genes in lung cancer. (A) 20 methylation driver 452 genes identified in lung cancer. Grey bars indicate the Spearman correlation calculated from 453 our NSCLC data, TCGA LUAD and LUSC, and CCLE database. The dark yellow triangle 454 indicates known methylation drivers in lung cancer, spring green circle indicates known 455 methylation drivers in non-lung cancer, and the Indian red rhombus indicates first report in 456 cancer. Light and dark red circles indicate genes pre-marked by poised and active promoter in 457 ESCs, respectively. Number of cancer types from TCGA cohorts (blue bar) that detected the 458 same methylation drivers is plotted in the right panel. (B) Validation of methylation changes 459 of eight novel methylation driver genes in an independent cohort (23 paired tumor and 460 normal tissues) by bisulfite pyrosequencing. The box indicates the median±1 quartile with 461 each point representing one sample (yellow for tumors and blue for normal tissues). (C) ROC 462 curve for random forest analysis using pyrosequencing value inputs of eight novel 463 methylation biomarkers from an independent cohort. ROC analysis yielded an AUC of 0.965,

16

464 showing high accuracy of distinguishing tumors from normal tissues. 465

466 Twelve of 20 genes (60%) in the list were methylation driver genes described previously 467 in lung cancer (Figure 5A), such as CDO1, SLIT2, SOX17 and TCF21. The remaining eight 468 genes are newly identified methylation drivers in lung cancer, including 5 known cancer 469 methylation driver genes in other cancer types (HSPB6, IRX1, ITGA5, PCDH17, TBX5) and 470 3 novel genes that had not been previously reported (ADCY8, GALNT13 and TCTEX1D1) 471 (Figure S7). Furthermore, we verified the reproducibility of the methylation changes of eight 472 novel methylation drivers in an independent set of 23 lung tumor and matched normal tissues 473 using pyrosequencing (Figure 5B). All of them exhibited highly significant differences in 474 methylation levels between tumor and normal tissues. Interestingly, the eight novel 475 methylation drivers could distinguish lung tumors from normal samples with an area under 476 the curve (AUC) of 0.965 (95%CI: 091-1) in an independent lung cancer cohort (Figure 5C), 477 suggesting this eight-gene panel could be a potential diagnostic biomarker of lung cancer.

478 Functional validation of novel methylation-silenced driver genes in lung cancer

479 Considering that PCDH17 is a known tumor suppressor but not previously associated 480 with lung carcinogenesis, we performed experimental investigation of its functional role in 481 this tumor type. Its expression is higher in normal spleen, brain and lung, probably due to the 482 active histone mark of their promoter in these tissues. However, for other tissue and cell types 483 (e.g., ESCs and A549), their promoter regions are often targeted by a combination of active 484 and repressive histone marks (poised chromatin state), resulting in lower expression in them 485 (Figure 6A and Figure S8). The TSS of this gene is within CGI. Our analysis identified a 486 1,123 bp DMR (from 58,206,777 to 58,207,899 on chromosome 13) within this CGI, located 487 at 124bp downstream of the gene’s TSS. Two probes in HM450 from TCGA cohorts are also 488 mapped to the DMR (Figure 6A). Increased methylation levels of both the probes and the 489 DMR are significantly correlated with decreased gene expression in lung tumor compared to 490 normal samples (Figure 6B). We further examined the consequence of demethylation on 491 PCDH17 expression using the demethylating agent 5-aza-dC in four lung cancer cell lines 492 (A549, Calu1, H1299, and HOP62). The expression of PCDH17 was found to be strongly 493 upregulated in Calu1, H1299, and HOP62 cells (Figure 6C), suggesting that PCDH17 494 expression is regulated by promoter methylation. PCDH17 expression in A549 cells was the 495 highest among the four cell lines, which probably explained why we did not observe 496 significant expression change after 5-aza-dC treatment (Figure 6D). HOP62 also had the 497 higher PCDH17 expression, thus the two cell lines were selected for inhibition of PCDH17 498 expression. The expression level of PCDH17 was substantially reduced (>85% and 90%, 499 respectively) after small interfering RNA (siRNA) was transfected into cells (Figure 6E). 500 Then clone formation assays and CCK-8 cell viability assays were performed to assess the 501 effect of PCDH17 knockdown on A549 and HOP62 cell proliferation. The depletion of 502 PCDH17 significantly promoted cell growth of A549 and HOP62 cells compared to cell lines 503 with scrambled control (P < 0.001, Figure 6F). Downregulation of PCDH17 also strongly 504 enhanced clone formation in both cell lines (Figure 6G). These findings indicate that 505 PCDH17 suppresses lung cancer cell proliferation.

17

506 507 Figure 6. Functional validation of PCDH17. (A) The epigenomic chromatin profile of 508 DMR in the PCDH17 promoter. Color-coded definitions of chromatin states are shown in 509 Figure 2C. The position of DMR is highlighted by the red box. The CpG islands overlapped 510 with the DMR are shown in a green box. The HM450 probes within the DMR are indicated

18

511 by yellow lines. (B) Scatterplots and box plots of methylation of DMRs/CpGs within the 512 PCDH17 promoter and expression of PCDH17 in RRBS NSCLC (top), TCGA LUAD 513 (middle) and TCGA LUSC (bottom) cohorts. (C) Relative expression of PCDH17 after 514 treatment with increasing concentration of 5-aza-dC in lung cancer cell lines A549, Calu1, 515 H1299, and HOP62. (D) Relative expression of PCDH17 in lung cancer cell lines A549, 516 Calu1, H1299, and HOP62. (E) Relative expression change of PCDH17 in A549 and HOP62 517 cells transfected with PCDH17 siRNA or control siRNA. (F) Growth curves of A549 and 518 HOP62 cells transfected with PCDH17 siRNA or control siRNA. (G) Colony formation 519 assays in A549 and HOP62 cells transfected with PCDH17 siRNA or control siRNA. The 520 error bars indicate SD of three independent experiments. *P <0.05, **P <0.01, ***P <0.001 521 using a two-sided Student’s t-test.

522 We also performed in vitro experiments to explore the functional role of other 523 methylation drivers (IRX1, TBX5 and HSPB6) in lung cancer cell lines. We observed 524 significant expression change of IRX1 and HSPB6 after 5-aza-dC treatment (Figure S9A-B). 525 In addition, overexpression cell lines of IRX1, TBX5 and HSPB6 were successfully 526 established (Figure S9C-D). Re-expression of IRX1 significantly inhibited lung cancer cells 527 growth compared to control cells (Figure S9E-F). Cell migration was significantly reduced 528 in lung cell lines with overexpression of IRX1, TBX5 and HSPB6 (Figure S9G). Overall, our 529 results suggest that these newly identified methylation drivers play a tumor-suppressor role in 530 lung tumor pathogenesis.

531 Shared epigenetic control of ncRNAs and their host genes

532 Epigenetic alterations of ncRNAs have been reported to be involved in tumorigenesis in 533 several cancer types [54, 55]. Here, we focused on two common types of ncRNAs, lncRNAs 534 and miRNAs. First, we evaluated the aberrant methylation pattern surrounding the TSSs and 535 the transcript end sites (TESs) of protein coding genes, lncRNAs, and miRNAs in lung 536 tumors and adjacent normal tissues. The depletion of DNA methylation in TSS and TES 537 regions and the enrichment of DNA methylation in transcribed regions were observed in both 538 protein coding genes and lncRNAs, but low levels of DNA methylation were distributed in 539 miRNAs (Figure 7A and Figure S10A). We further investigated if three subtypes of 540 lncRNAs (i.e., intergenic, intragenic and antisense), which were classified based on their 541 relative position to nearby coding genes, had differential methylation profiles. More 542 enrichment of DNA methylation toward the transcribed regions was exclusively observed in 543 intergenic lncRNAs compared to the other types of lncRNAs (Figure 7B and Figure S10B). 544 These results suggest that lncRNAs display similar methylation patterns with protein coding 545 genes, but the three types of lncRNAs have distinct methylation profiles, which probably 546 affect their regulatory role.

547 Moreover, we identified eight lncRNAs and five miRNAs showing significantly 548 negative correlation between their methylation and expression (Table S9). Interestingly, three 549 of them were co-regulated with their host genes by promoter hypermethylation, including 550 miR-218 and its host gene SLIT2, miR-490 and its host gene CHRM2, and CDO1-LNC and 551 its host gene CDO1 (Figure 7C, and Figure S11A and 12A). miR-218 was recently reported

19

552 to play a tumor suppressive role in lung cancer through regulation of the IL6/STAT3 pathway.

553

554 Figure 7. DNA methylation alterations control expression of ncRNAs and their host 555 genes. (A) DNA methylation patterns of protein-coding genes and ncRNAs (lncRNAs and 556 miRNAs) across the gene body and ±5kb flanking regions of the gene body. DNA 557 methylation level is calculated according to 1kb windows with 100bp steps in lung normal 558 samples. (B) DNA methylation patterns of three different subcategories of lncRNAs 559 (intergenic, intragenic and antisense) across the gene body and ±5kb flanking the gene body. 560 (C) The epigenomic chromatin profile of DMR in miR-218 and its host gene SLIT2 561 promoter. Color-coded definitions of chromatin states are shown in Figure 2C. The position 562 of DMR is highlighted by the red box. The CpG islands overlapped with the DMR are shown 563 in the green box. (D) Scatterplots and box plots of methylation of DMR within the SLIT2 564 promoter and its expression in the RRBS cohort. (E) Scatterplots and box plots of 565 methylation of DMR within the promoter of SLIT2 and expression of miR-218 in the RRBS 566 cohort. (F) The computational prediction, biological process that miR-218 is involved in.

20

567

568 Additionally, based on the bioinformatics analysis of our corresponding miRNA-seq and 569 RNA-seq data, miR-218 was downregulated in tumors and predicted to be involved in many 570 important cancer-related biological processes, such as the cell cycle, DNA replication and 571 DNA repair, further supporting the tumor suppressive role of miR-218 in lung cancer. 572 However, how to regulate the expression of miR-218 remains elusive. Here, our findings 573 suggest that promoter methylation is a probably selective mechanism of regulation of miR- 574 218 and its host gene (Figure 7D-F). Epigenetic silencing of miR-490 and its host gene 575 CHRM2 has been observed in gastric and colorectal cancer [56, 57]. For the first time, we 576 reported the potential epigenetic regulation of miR-490-3p and its host gene CHRM2 in 577 NSCLC (Figure S11B-D).

578 CDO1-LNC, a lncRNA annotated in the GENCODE (v28), was also detected in our 579 lung cancer data. It covered a stretch of 557bp within CDO1 (Figure S12A). In an effort to 580 determine the potential functional role of CDO1-LNC, a Guilt-By-Association (GBA) 581 analysis was performed, an analysis that has been widely used for studying lncRNAs [58]. 582 This analysis yielded a total of 1,975 protein coding genes that were significantly co- 583 expressed with CDO1-LNC (FDR <0.01, Spearman’s rho < -0.5 or >0.5). These genes were 584 mainly enriched in the cell cycle, DNA replication and the development process, suggesting 585 that it may be involved in tumorigenesis in NSCLC. In addition, hypermethylation of the 586 same DMR is significantly correlated with expression of both CDO1-LNC and its host gene 587 CDO1 (Figure S12B-D), indicating they are probably co-regulated by promoter methylation.

588 Differential DNA methylation of transcription factor binding motifs in NSCLC

589 TF binding can be blocked by the methylation of TF binding sites in ESCs [59]. 590 Consequently, we sought to search for sets of TFs that modulate expression of our identified 591 methylation driver genes in lung cancer. To ensure the power of identifying TFs, we focused 592 on the primary gene list in the discovery set, which consisted of 190 methylation driver genes 593 (Table S8). Then, 190 pDMRs in these genes were intersected with binding motifs of 84 TF 594 groups obtained from the literature and de novo motif discovery in 427 human ChIP-seq 595 datasets [60]. Significant enrichment of motifs was observed in the hyperMe regions of lung 596 tumor (46 in hyperMe regions versus 10 hypoMe regions; Fisher’s test p value= 0.0002; 597 Table S10). Finally, TFs that correspond to these enriched motifs were identified; the 598 correlation between expression of the TFs and their targeted genes was calculated. As a 599 result, 27 significantly correlated motifs in hyper-methylated and down-regulated genes, and 600 eight in hypo-methylated and up-regulated genes were identified (FDR < 0.05; Table S11), 601 including known methylation sensitive TFs such as SP1, YY1 and CTCF [61]. For these TFs, 602 the DNA methylation in their binding motifs regulated cancer-specific gene expression by 603 affecting the binding ability of TFs in NSCLCs. The top ten highly significantly correlated 604 motifs in hyper-methylated regions and the top five in hypo-methylated regions were selected 605 for downstream analysis. At least one CpG site was observed in most of these selected motifs, 606 further supporting that they were indeed involved in transcriptional regulation through CpG 607 methylation states (Figure S13).

21

608

609 Figure 8. Expression of epigenetically dysregulated genes is affected by methylation 610 changes of transcription factor binding motifs in lung cancer. (A, B) Activity plot 611 integrating the correlation between methylation of transcription factor binding motifs and 612 expressional fold changes of targeted methylation-driven genes. The strength of inactivation 613 (red; A) and activation (blue; B) in tumors compared to normal samples is represented by 614 color and intensity. (C) Correlation between methylation level of two CpG sites within the 615 EGR1 binding motif of TCF21 and expression level of TCF21. Genomic coordinates of the 616 EGR1 binding motif relative to TCF21 are shown in the top panel. (D) Correlation between 617 EGR1 and TCF21 expression. 618 To determine which genes were strongly influenced by the binding vicinity of the 619 corresponding TFs in NSCLC, we calculated activity scores by integrating the correlation of 22

620 targeted gene expression with methylation level of pDMRs and expression level of TFs 621 (Figure 8A-B). The strongly inactivated genes in lung tumors compared to normal samples, 622 whose corresponding TF binding motifs were affected by DNA methylation, included well- 623 known epigenetically dysregulated genes such as TCF21, CYYR1 and GATA2, as well as 624 some novel epigenetically dysregulated genes such as HSPB6, ITGA5 and TBX5. The 625 strongly activated genes in lung tumors were also found, including PKP3, a member of the 626 armadillo protein family whose overexpression was a key feature in lung carcinogenesis [62]. 627 Of interest was the motif EGR1_known3 of EGR1, including two CpG sites (Figure S13). 628 The genome-wide methylation level of this motif was lower than its neighbor regions, 629 supporting that TF binding sites tend to lose methylation (Figure S14). The high variance of 630 methylation in the motif and neighbor regions in lung tumors indicated that some CpG sites 631 have been selectively methylated (Figure S14). EGR1, a zinc-finger tumor suppressor 632 transcription factor, has been shown to regulate multiple tumor suppressors including TGFβ1, 633 TP53 and PTEN [63]. Here, we found the strongest inactivation of EGR1 binding motif 634 within the TCF21, an epigenetically regulated tumor suppressor gene in lung cancer (Figure 635 8A). We found a significant association between the methylation level of the two CpG sites in 636 the EGR1 binding motif and TCF21 expression (Figure 8C). The EGR1 expression was also 637 highly correlated with expression level of TCF21 (Figure 8D and Figure S15). Moreover, it 638 has been shown that TCF21 can be activated by WT1, recognizing and binding to EGR1-like 639 motifs [64, 65]. Taken together, EGR1 is potentially involved in transcriptional regulation of 640 TCF21 and the methylation states of EGR1 binding motif may influence the ability of EGR1 641 or WT1 to bind to TCF21 in lung cancer.

642 Discussion

643 Previous studies have explored the impact of DNA methylation events on lung tumor 644 initiation and progression [19-22, 27-29, 66-68]. However, those studies mainly investigated 645 a single gene or a small number of genes and a NGS-based methylation investigation in lung 646 cancer has not yet been reported. Here, we analyzed the genome-wide methylation profile in 647 lung cancer using an enriched sequencing approach called RRBS. This NGS-based technique 648 is greatly superior to traditional microarray-based methods with regard to its high CpG 649 coverage, single-base resolution and low input requirement [30]. In the present study, on 650 average we could interrogate more than two million CpG sites in each sample for methylation 651 analysis. Another major advantage of this method is the ability to search for DMRs spanning 652 multiple consecutive CpG sites, which are robust and functionally relevant methylation 653 events [47]. Due to strong correlation among CpG sites within a genomic region of about 654 500bp, calling DMR could reduce the dimensionality of methylation data and thus increase 655 the power and robustness of identifying differential methylation events, especially for low- 656 coverage regions. From our RRBS data, we uncovered 4,410 hypermethylated and 4,824 657 hypomethylated DMRs using the circular binary segmentation algorithm, which could help us 658 better dissect the casual role of DNA methylation in lung cancer.

659 As expected, we observed widespread aberrations of DNA methylation in our lung 660 cancer data including global DNA hypomethylation and CGI-specific hypermethylation, 23

661 which is largely in line with previous reports in several malignancies [44, 69], suggesting 662 their important roles in tumorigenesis. Based on the top two principal components of the PCA 663 results, we found that methylation profiles in tumor specimens were largely different from 664 each other, which could partially explain intra-tumor heterogeneity, whereas their 665 corresponding normal counterparts presented similar methylation landscapes. Interestingly, 666 the methylation level of our identified hyperMe-DMRs remained significantly higher in 667 tumor tissues from 21 cancer types of TCGA and the 1,028 CCLE cell lines compared to that 668 in corresponding normal tissues and cells, indicating that hypermethylation of these regions is 669 cancer specific. Additionally, 46.1% of the identified hyperMe-DMRs in lung cancer were 670 overlapped with poised promoters in ESCs. This further confirmed that the frequent 671 methylation of polycomb targets is a hallmark of lung cancer and many other human cancer 672 types [48, 49]. However, our observations also suggest that methylation silencing of these 673 genes is a highly selective pressure during tumor formation and potentially affects tumor 674 pathogenesis, rather than a residual stem-cell memory. Strikingly, 80% of our identified 675 hypermethylation driver genes were pre-marked with a poised promoter in ESCs.

676 It is challenging to pinpoint cancer driver events from a large number of DNA 677 methylation changes in genome-wide methylation studies. In the present study, we integrated 678 DNA methylation with matched gene expression from our own lung cancer data and data 679 provided by TCGA cohorts. Using this strategy, we identified 20 hypermethylation driver 680 genes, of which 12 were recovered from previous studies and known epigenetic-regulated 681 genes (e.g., SLIT2, CDO1 and TCF21), suggesting that the integration of DNA methylation 682 with other types of omics data, such as RNA-seq, is an effective method to study methylation 683 regulation of genes. Moreover, all of the eight novel hypermethylation driver genes 684 (PCDH17, IRX1, ITGA5, HSPB6, TBX5, ADCY8, GALNT13 and TCTEX1D1) were 685 successfully validated in an independent cohort via a pyrosequencing approach, with an AUC 686 of 0.965 for prediction of lung cancer patients. This indicates that these novel drivers could 687 be used as promising diagnostic biomarkers in the clinic. Notably, they have not been 688 included in the 14 currently commercially used DNA methylation-based biomarkers [70].

689 Considering the recapitulation of previously published cancer driver genes, we 690 experimentally tested the tumor-suppressive role of four new methylation driver genes in 691 lung cancer cells, including PCDH17, IRX1, HSPB6 and TBX5, which are known driver 692 genes but not linked with lung tumorigenesis [71-78]. Expression of these genes was 693 upregulated upon treatment with 5-Aza-dC demethylation, suggesting that their expression is 694 regulated by DNA methylation. Notably, the putative methylation driver PCDH17 was 695 previously described as a tumor suppressor that induces tumor cell apoptosis and autophagy, 696 and was functionally hypermethylated in gastric, colorectal, urological and esophageal 697 cancers [76-78]. We also found that inhibition of PCDH17 promotes cell proliferation, 698 suggesting its tumor-suppressive role in lung cancer. In addition, we validated the tumor- 699 suppressive role of IRX1, TBX5 and HSPB6 in lung cancer. Of note, our identified 700 methylation drivers are not within the gene list of 299 cancer driver genes that were identified 701 by PanSoftware using somatic mutations and insertions/deletions in pan-cancer data and 702 represented the largest discovery of cancer genes thus far [79]. This suggests that analysis of 703 methylation events could help to identify additional drivers and provide important research

24

704 candidates.

705 Epigenetic regulation of ncRNAs in cancer have been recently characterized [80]. Using 706 the same data mining strategy described above, we also discovered 13 methylation driver 707 ncRNAs (i.e., eight lncRNAs and five miRNAs) in lung cancer. This suggests that epigenetic 708 alteration is a selective mechanism for regulation of ncRNA expression in cancer. Strikingly, 709 our observations showed that several ncRNAs and their host genes were co-regulated by the 710 same DMRs within the promoter, such as miR-218 and its host gene SLIT2, miR-490 and its 711 host gene CHRM2, and CDO1-LNC and its host gene CDO1. This offers the possibility of 712 developing more effective epigenetic therapeutic drugs to target the common DMRs that 713 impact ncRNAs and the host genes, both of which play a critical role in tumor initiation and 714 progression.

715 It has been demonstrated that the binding affinity of some TFs to DNA motifs is affected 716 by CpG methylation status in vivo [81]. A recent study suggested that loss of NRF1 binding 717 was caused by local hypermethylation [82]. Based on this regulatory paradigm, we correlated 718 DNA methylation and expression level of targets with expression level of enriched TFs to 719 identify cancer-specific TFs in lung cancer at a large scale. Using this method, we discovered 720 several TFs and their target genes. Notably, we uncovered a potential regulatory relationship 721 between EGR1 and TCF21. The strong correlation between methylation level of two CpG 722 sites within EGR1 binding motif and TCF21 expression suggests that EGR1 transcriptionally 723 regulates TCF21 expression through an epigenetic mechanism. Our observations also showed 724 that the underlying connection between TFs and DNA methylation may be complex. For 725 example, several DNA binding motifs of TFs did not span CpG sites, such as NR3C1_disc6, 726 NR3C1_disc2 and SMARC_disc1 (Figure S13). However, the methylation level of 727 neighborhood regions flanking these DNA motifs showed a strong correlation with 728 expression of these TFs as well as target genes. Similar findings were reported in previous 729 studies in which the binding ability of OCT4 was sharply reduced providing that DNA 730 methylation occurred within 100bp on each side of the targeted motif of OCT4, but no CpG 731 sites were located within its recognized motifs [83, 84]. Therefore, the neighborhood regions 732 of TF binding sites may also play a role in gene regulation and should not be overlooked in 733 future studies. Further characterization and experiment-based validation of the identified 734 regulators may lead to the discovery of novel epigenetically dysregulated pathways in lung 735 cancer and give new insights into the function of DNA methylation alterations in cancer.

736 Conclusions

737 Our findings demonstrated that RRBS is an effective and robust approach to explore the 738 cancer methylome. In our study, we discovered a large number of hypermethylation events 739 pre-marked by poised promoter in ESCs, being a hallmark of lung cancer. These 740 hypermethylation events showed a high conservation across cancer types and cancer cell 741 lines. Among them, we identified and experimentally confirmed a group of methylation 742 driver genes (e.g., PCDH17, IRX1, TBX5 and HSPB6) that play a critical role in neoplasia 743 initiation, promotion, and progression. We also revealed shared epigenetic control of ncRNAs

25

744 and their host genes that are controlled by the same promoter hypermethylation. Furthermore, 745 we detected sets of TF regulators driving the expression of epigenetically dysregulated genes, 746 guiding us to investigate the potential novel pathways and mechanisms by which DNA 747 methylation regulates gene expression in lung cancer. Our study also provides a valuable 748 resource for future efforts to identify DNA methylation-based diagnostic biomarkers, develop 749 cancer epigenetic therapy and study cancer pathogenesis. 750 751

26

752 Abbreviations

753 AUC: an area under the curve

754 CCLE: Cancer Cell Line Encyclopedia

755 CGI: CpG island

756 DMR: differentially methylated regions

757 ENCODE: Encyclopedia of DNA Elements

758 ESCs: embryonic stem cells

759 FDR: false discovery rate

760 GO: gene ontology

761 hyperMe-DMRs: hypermethylated DMRs

762 hypoMe-DMRs: hypomethylated DMRs

763 lncRNAs: long noncoding RNAs

764 LUAD: lung adenocarcinoma

765 LUSC: lung squamous cell carcinomas

766 ncRNAs: noncoding RNAs

767 NGS: next-generation sequencing

768 NSCLC: non-small cell lung cancer

769 PCA: Principal component analysis

770 pDMRs: promoter DMRs

771 ROC: receiver operating characteristic

772 RPKM: Reads per kilobase of exon per million fragments mapped

773 RPM: Reads Per Million

774 RRBS: reduced representation bisulfite sequencing

775 siRNA: small interfering RNA

776 TCGA: The Cancer Genome Atlas

777 TES: transcript end sites

778 TF: Transcription factor

779 TSS: transcript start site

27

780 Declarations

781 Ethics approval and consent to participate

782 The study was approved by the institutional review board of Sir Run Run Shaw Hospital 783 at Zhejiang University School of Medicine (Hangzhou, China). The study was conducted in 784 accordance with the International Ethical Guidelines for Biomedical Research Involving 785 Human Subjects.

786 Consent for publication

787 All authors have agreed to publish this manuscript.

788 Availability of data and materials

789 The raw sequence data of these samples were deposited in the NCBI short read archive 790 (SRA) with accession numbers SRP125064 and SRP227900. All other data that support the 791 findings of this study are available from the corresponding authors upon reasonable request.

792 Competing Interest

793 No potential conflicts of interest were disclosed.

794 Funding

795 This work has been supported in part by National Key R&D program of China 796 (2019YFC1315700 and 2016YFA0501800), Key Program of Zhejiang Provincial Natural 797 Science Foundation of China (LZ20H160001), Medical Health Science and Technology Key 798 Project of Zhejiang Provincial Health Commission (WKJ-ZJ-2007 and 2017211914), 799 National Natural Science Foundation of China (81871864, 81772766 and 82072857), and the 800 American Heart Association (15SFRN23910002).

801 Authors’ Contribution

802 Yan Lu, PL and EC considered and designed the study. XS developed the algorithm and 803 performed the data analysis. JY, JY and XQ conducted in vitro cell-based assays. Yan Lu 804 prepared for RRBS, RNA-Seq and small RNA sequencing libraries. JL, ZJ and EC collected 805 tumor tissues. LB performed pathological analysis of tumor samples. XP and Yi Liu 806 contributed RRBS sequencing data analysis. Yong Liu and ML contributed to RRBS library 807 construction and sequencing. XS, Yan Lu and PL wrote the manuscript. Yan Lu, PL, EC and 808 ML revised the manuscript. All of the authors discussed and commented the study.

809 Acknowledgments

810 We thank TCGA for providing access to RNA-seq and methylation datasets, and 811 ENCODE and CCLE for providing access to methylation datasets. We thank the Core 812 Facility at Zhejiang University School of Medicine for providing technical support and 813 anonymous reviewers for reading and commenting on the manuscript. 28

814 Supplementary Materials

815 Supplementary Methods 816 817 Supplementary Tables 818 819 Table S1. Characteristics of patient samples used in the study. 820 821 Table S2. Sequencing quality of RRBS (A), RNA-seq (B) and miRNA-seq (C). 822 823 Table S3. Summary of TCGA samples used in this study. 824 825 Table S4. A list of primers used for pyrosequencing validation of eight novel methylation 826 driver genes in an independent samples. 827 828 Table S5. siRNA and qPCR primers used in the vitro assays. 829 830 Table S6. Summary of 4,810 hyperMe-DMRs (A) and 4,824 hypoMe-DMRs (B). 831 832 Table S7. Over-represented categories of molecular signature. 833 834 Table S8. Methylation driver genes identified in the study. (A) Confirmed in both TCGA 835 LUAD and TCGA LUSC cohorts, (B) Confirmed only in TCGA LUAD cohort, (C) 836 Confirmed only in TCGA LUSC cohort, (D) Not confirmed in TCGA LUAD cohort or 837 TCGA LUSC cohort. 838 839 Table S9. Methylation driver ncRNAs identified in the study 840 841 Table S10. Enrichment of transcription factor binding motifs in hypermethylated and 842 down-regulated methylation-driven genes (A) and in hypomethylated and up-regulated 843 methylation-driven genes (B). 844 845 Table S11. Transcription factors significantly correlated with hypermethylated and 846 down-regulated methylation-driven genes (A) and hypomethylated and up-regulated 847 methylation-driven genes (B). 848 849 Supplementary Figures 850 851 Figure S1. Correlations between RRBS samples. Box plots of Pearson correlation 852 coefficients calculated among 18 normal samples, 18 tumor samples, and 18 normal/tumor 853 pairs, respectively. N vs. T indicates normal/tumor pairs. 854 29

855 Figure S2. Characteristics of the methylation level of CpG sites across lung tumors and 856 adjacent normal tissues. (A) Density plots of average methylation level of CpGs from the 857 background set (n = 2,166,853; solid lines) and DMRs (n = 162,603; dashed lines) across 858 normal or tumor samples, respectively. (B) Correlation between average methylation level of 859 tumor and normal samples in CpG islands (CGI; left) and outside CpG islands (non-CGI; 860 right). Orange indicates high density of CpG sites and purple indicates low density. UMR, 0– 861 10%; LMR, 10%–80%; FMR, 80%–100%. (C) Average methylation differences (average for 862 tumor samples vs. normal samples) across chromatin segments annotated in ESCs. 863 864 Figure S3. Length of DMRs and number of CpG sites in DMRs. (A) Distribution of the 865 length of hypermethylated and hypomethylated DMRs. (B) Distribution of the number of 866 CpG sites in hypermethylated and hypomethylated DMRs. HyperMe-DMRs, 867 hypermethylated DMRs; HypoMe-DMRs, hypomethylated DMRs. 868 869 Figure S4. The epigenomic chromatin profile of DMR in HOXA cluster genes. Color- 870 coded definitions of chromatin states are shown in Figure 2C. The position of the DMR is 871 highlighted by the red box. The CpG islands overlapped with the DMR are shown in the 872 green box. 873 874 Figure S5. Gene ontology analysis of genes with HyperMe-DMRs in their promoters. (A) 875 Enriched gene ontology (GO) term. (B) Statistical significance of corresponding GO term. 876 877 Figure S6. Methylation levels of hyperMe-DMRs pre-marked by poised promoter (A) 878 and active promoter (B) in ESCs between tumor and normal samples across 21 cancer 879 types. Data are shown as mean ± SD. The top digits signified number of tumor (red) and 880 normal (green) samples analyzed. SD, standard deviation. 881 882 Figure S7. Scatterplots and box plots expression of novel methylation driver genes and 883 methylation of DMR/CpGs within their promoters in RRBS NSCLC, TCGA LUAD and 884 TCGA LUSC cohorts. (A) ADCY8, (B) GALNT13, (C) HSPB6, (D) IRX1, (E) TBX5, (F) 885 TCTEX1D1, (G) ITGA5. 886 887 Figure S8. mRNA expression of PCDH17 across tissue samples from 95 human 888 individuals representing 27 different tissues. Data were downloaded from the Publication: 889 PMID 24309898. 890 891 Figure S9. Functional validation of IRX1, TBX5, and HSPB6. (A) Relative expression of 892 IRX1 after treatment with increasing concentration of 5-aza-dC in lung cancer cell lines 893 A549, Calu1, and H1299. (B) Relative expression of HSPB6 after treatment with increasing 894 concentration of 5-aza-dC in lung cancer cell lines A549, Calu1, H1299 and HOP62. (C-D) 895 Relative expression change of IRX1, TBX5 and HSPB6 in A549 (C) and H1299 (D) cells 896 transfected with empty or IRX1, TBX5 and HSPB6 overexpression vectors. (E-F) Growth 897 curves of A549 (E) and H1299 (F) cells transfected with empty or IRX1 overexpression 898 vectors. (G) Migration assay following overexpression of IRX1, TBX5, and HSPB6 in A549

30

899 and H1299 cells. The error bars indicate SD of three independent experiments. *P <0.05, **P 900 <0.01, ***P <0.001 using a two-sided Student’s t-test. 901 902 Figure S10. DNA methylation patterns across the gene body and ±5kb up- or down- 903 stream of gene body in lung tumors. (A) DNA methylation patterns of protein-coding genes 904 and ncRNAs (lncRNAs and miRNAs). (B) DNA methylation patterns of three different 905 subcategories of lncRNAs (intergenic, intragenic and antisense). DNA methylation level is 906 calculated according to 1kb windows with 100bp steps in lung tumor samples. 907 908 Figure S11. DNA methylation alterations control expression of miR-490-3p and its host 909 gene CHRM2. (A) The epigenomic chromatin profile of DMR in miR-490-3p and CHRM2 910 promoter. Color-coded definitions of chromatin states are shown in Figure 2C. The position 911 of DMR is highlighted by the red box. The CpG islands overlapped with the DMR are shown 912 in the green box. (B) Scatterplots and box plots of methylation of DMR within CHRM2 913 promoter and its expression in the NSCLC cohort. (C) Scatterplots and box plots of 914 methylation of DMR within the promoter of CHRM2 and expression of miR-490 in the 915 NSCLC cohort. (D) Computational prediction of the biological process in which miR-490-3p 916 is involved. 917 918 Figure S12. DNA methylation alterations control expression of CDO1-LNC and its host 919 genes CDO1. (A) The epigenomic chromatin profile of DMR in CDO1-LNC and CDO1 920 promoter. Color-coded definitions of chromatin states are shown in Figure 2C. The position 921 of DMR is highlighted by the red box. The CpG islands overlapped with the DMR are shown 922 in the green box. (B) Scatterplots and box plots of methylation of DMR within the CDO1 923 promoter and its expression in the NSCLC cohort. (C) Scatterplots and box plots of 924 methylation of DMR within the promoter of CDO1 and expression of CDO1-LNC in the 925 NSCLC cohort. (D) Computational prediction biological process in which CDO1-LNC is 926 involved. 927 928 Figure S13. Display of selective transcription factor binding motifs for methylation- 929 driven genes. (A) hypermethylated and down-regulated methylation-driven genes, (B) 930 hypomethylated and up-regulated methylation-driven genes. 931 932 Figure S14. Distribution of methylation level for transcription factor binding motif, 933 EGR1_known3, and ±2kb flanking the motif in tumors and adjacent normal tissues. 934 TFBS, transcription factor binding sites. 935 936 Figure S15. Correlations between expression levels of EGR1 and TCF21 in TCGA 937 LUAD and LUSC cohorts. 938

939

31

940 References

941 1. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2018. CA: A Cancer Journal for 942 Clinicians. 2018; 68: 7-30. 943 2. Gatta G, Mallone S, van der Zwan JM, Trama A, Siesling S, Capocaccia R, et al. Cancer 944 prevalence estimates in Europe at the beginning of 2000. Annals of Oncology. 2013; 24: 945 1660-6. 946 3. Ettinger DS, Wood DE, Aisner DL, Akerley W, Bauman J, Chirieac LR, et al. Non–Small 947 Cell Lung Cancer, Version 5.2017, NCCN Clinical Practice Guidelines in Oncology. Journal 948 of the National Comprehensive Cancer Network. 2017; 15: 504-35. 949 4. Duruisseaux M, Esteller M. Lung cancer epigenetics: From knowledge to applications. 950 Seminars in Cancer Biology. 2018; 51: 116-28. 951 5. Govindan R, Ding L, Griffith M, Subramanian J, Dees ND, Kanchi KL, et al. Genomic 952 landscape of non-small cell lung cancer in smokers and never-smokers. Cell. 2012; 150: 953 1121-34. 954 6. Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, et al. 955 Integrative analysis of 111 reference human epigenomes. Nature. 2015; 518: 317-30. 956 7. Minna JD, Roth JA, Gazdar AF. Focus on lung cancer. Cancer cell. 2002; 1: 49-52. 957 8. Sritharan N. Genomic landscape of non-small-cell lung cancer in smokers and never- 958 smokers. Thorax. 2014; 69: 54-. 959 9. Zheng Z, Chen T, Li X, Haura E, Sharma A, Bepler G. DNA synthesis and repair genes 960 RRM1 and ERCC1 in lung cancer. New England Journal of Medicine. 2007; 356: 800-8. 961 10. Scagliotti GV, Parikh P, Pawel Jv, Biesma B, Vansteenkiste J, Manegold C, et al. Phase 962 III Study Comparing Cisplatin Plus Gemcitabine With Cisplatin Plus Pemetrexed in 963 Chemotherapy-Naive Patients With Advanced-Stage Non–Small-Cell Lung Cancer. Journal 964 of Clinical Oncology. 2008; 26: 3543-51. 965 11. Sandler A, Gray R, Perry MC, Brahmer J, Schiller JH, Dowlati A, et al. Paclitaxel– 966 Carboplatin Alone or with Bevacizumab for Non–Small-Cell Lung Cancer. New England 967 Journal of Medicine. 2006; 355: 2542-50. 968 12. Maemondo M, Inoue A, Kobayashi K, Sugawara S, Oizumi S, Isobe H, et al. Gefitinib or 969 Chemotherapy for Non–Small-Cell Lung Cancer with Mutated EGFR. New England Journal 970 of Medicine. 2010; 362: 2380-8. 971 13. Shaw AT, Kim D-W, Nakagawa K, Seto T, Crinó L, Ahn M-J, et al. Crizotinib versus 972 Chemotherapy in Advanced ALK-Positive Lung Cancer. New England Journal of Medicine. 973 2013; 368: 2385-94. 974 14. Shaw AT, Ou S-HI, Bang Y-J, Camidge DR, Solomon BJ, Salgia R, et al. Crizotinib in 975 ROS1-Rearranged Non–Small-Cell Lung Cancer. New England Journal of Medicine. 2014; 976 371: 1963-71. 977 15. Brahmer J, Reckamp KL, Baas P, Crinò L, Eberhardt WEE, Poddubskaya E, et al. 978 Nivolumab versus Docetaxel in Advanced Squamous-Cell Non–Small-Cell Lung Cancer. 979 New England Journal of Medicine. 2015; 373: 123-35. 980 16. Sharma P, Hu-Lieskovan S, Wargo JA, Ribas A. Primary, Adaptive, and Acquired 981 Resistance to Cancer Immunotherapy. Cell. 2017; 168: 707-23. 982 17. Jones PA, Baylin SB. The fundamental role of epigenetic events in cancer. Nature 32

983 reviews genetics. 2002; 3: 415-28. 984 18. Rodríguez-Paredes M, Esteller M. Cancer epigenetics reaches mainstream oncology. 985 Nature medicine. 2011: 330-9. 986 19. Merlo A, Herman JG, Mao L, Lee DJ, Gabrielson E, Burger PC, et al. 5′ CpG island 987 methylation is associated with transcriptional silencing of the tumour suppressor 988 /CDKN2/MTS1 in human cancers. Nature Medicine. 1995; 1: 686. 989 20. Sozzi G, Veronese ML, Negrini M, Baffa R, Cotticelli MG, Inoue H, et al. The FHIT 990 Gene at 3p14.2 Is Abnormal in Lung Cancer. Cell. 1996; 85: 17-26. 991 21. Licchesi JDF, Westra WH, Hooker CM, Herman JG. Promoter Hypermethylation of 992 Hallmark Cancer Genes in Atypical Adenomatous Hyperplasia of the Lung. Clinical Cancer 993 Research. 2008; 14: 2570-8. 994 22. Yim HW, Slebos RJ, Randell SH, Umbach DM, Parsons AM, Rivera MP, et al. Smoking 995 is associated with increased telomerase activity in short-term cultures of human bronchial 996 epithelial cells. Cancer letters. 2007; 246: 24-33. 997 23. Alajez NM, Lenarduzzi M, Ito E, Hui AB, Shi W, Bruce J, et al. MiR-218 suppresses 998 nasopharyngeal cancer progression through downregulation of and the SLIT2- 999 ROBO1 pathway. Cancer research. 2011; 71: 2381-91. 1000 24. Le X, Mu J, Peng W, Tang J, Xiang Q, Tian S, et al. DNA methylation downregulated 1001 ZDHHC1 suppresses tumor growth by altering cellular metabolism and inducing 1002 oxidative/ER stress-mediated apoptosis and pyroptosis. Theranostics.2020; 10: 9495. 1003 25. Han B, Meng X, Wu P, Li Z, Li S, Zhang Y, et al. ATRX/EZH2 complex epigenetically 1004 regulates FADD/PARP1 axis, contributing to TMZ resistance in glioma. Theranostics.2020; 1005 10: 3351. 1006 26. Hlady RA, Zhao X, Pan X, Yang JD, Ahmed F, Antwi SO, et al. Genome-wide discovery 1007 and validation of diagnostic DNA methylation-based biomarkers for hepatocellular cancer 1008 detection in circulating cell free DNA. Theranostics.2019; 9: 7239. 1009 27. Carvalho RH, Haberle V, Hou J, van Gent T, Thongjuea S, van IJcken W, et al. Genome- 1010 wide DNA methylation profiling of non-small cell lung carcinomas. Epigenetics Chromatin. 1011 2012; 5: 9. 1012 28. Heller G, Babinsky VN, Ziegler B, Weinzierl M, Noll C, Altenberger C, et al. Genome- 1013 wide CpG island methylation analyses in non-small cell lung cancer patients. Carcinogenesis. 1014 2013; 34: 513-21. 1015 29. Selamat SA, Chung BS, Girard L, Zhang W, Zhang Y, Campan M, et al. Genome-scale 1016 analysis of DNA methylation in lung adenocarcinoma and integration with mRNA 1017 expression. Genome research. 2012; 22: 1197-211. 1018 30. Gu H, Bock C, Mikkelsen TS, Jäger N, Smith ZD, Tomazou E, et al. Genome-scale DNA 1019 methylation mapping of clinical samples at single-nucleotide resolution. Nature methods. 1020 2010; 7: 133-6. 1021 31. Liu Y, Liu P, Yang C, Cowley AW, Jr., Liang M. Base-resolution maps of 5- 1022 methylcytosine and 5-hydroxymethylcytosine in Dahl S rats: effect of salt and genomic 1023 sequence. Hypertension. 2014; 63: 827-38. 1024 32. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite- 1025 Seq applications. Bioinformatics. 2011; 27: 1571-2. 1026 33. Jühling F, Kretzmer H, Bernhart SH, Otto C, Stadler PF, Hoffmann S. metilene: Fast and

33

1027 sensitive calling of differentially methylated regions from bisulfite sequencing data. Genome 1028 research. 2015. 1029 34. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate 1030 alignment of transcriptomes in the presence of insertions, deletions and gene fusions. 1031 Genome Biol. 2013; 14: R36. 1032 35. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and 1033 transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature 1034 protocols. 2012; 7: 562-78. 1035 36. Wang F, Li L, Xu H, Liu Y, Yang C, Cowley Jr AW, et al. Characteristics of long non- 1036 coding RNAs in the Brown Norway rat and alterations in the Dahl salt-sensitive rat. 1037 Scientific reports. 2014; 4. 1038 37. Lin MF, Jungreis I, Kellis M. PhyloCSF: a comparative genomics method to distinguish 1039 protein coding and non-coding regions. Bioinformatics. 2011; 27: i275-82. 1040 38. Kriegel AJ, Liu Y, Liu P, Baker MA, Hodges MR, Hua X, et al. Characteristics of 1041 microRNAs enriched in specific cell types and primary tissue types in solid organs. 1042 Physiological genomics. 2013; 45: 1144-56. 1043 39. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. 1044 Bioinformatics. 2009; 25: 1754-60. 1045 40. Griffiths-Jones S, Grocock RJ, Van Dongen S, Bateman A, Enright AJ. miRBase: 1046 microRNA sequences, targets and gene nomenclature. Nucleic acids research. 2006; 34: 1047 D140-D4. 1048 41. Kretzmer H, Bernhart SH, Wang W, Haake A, Weniger MA, Bergmann AK, et al. DNA 1049 methylome analysis in Burkitt and follicular lymphomas identifies differentially methylated 1050 regions linked to somatic mutation and transcriptional control. Nature genetics. 2015; 47: 1051 1316-25. 1052 42. Mevik B-H, Wehrens R. The pls package: principal component and partial least squares 1053 regression in R. Journal of Statistical Software. 2007; 18: 1-24. 1054 43. Juhling F, Kretzmer H, Bernhart SH, Otto C, Stadler PF, Hoffmann S. metilene: fast and 1055 sensitive calling of differentially methylated regions from bisulfite sequencing data. Genome 1056 Res. 2016; 26: 256-62. 1057 44. Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, et al. The human 1058 colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue- 1059 specific CpG island shores. Nat Genet. 2009; 41: 178-86. 1060 45. Rauch T, Wang Z, Zhang X, Zhong X, Wu X, Lau SK, et al. Homeobox gene methylation 1061 in lung cancer studied by genome-wide analysis with a microarray-based methylated CpG 1062 island recovery assay. Proceedings of the National Academy of Sciences. 2007; 104: 5527- 1063 32. 1064 46. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The 1065 Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. 1066 Nature. 2012; 483: 603. 1067 47. Teschendorff AE, Relton CL. Statistical and integrative system-level analysis of DNA 1068 methylation data. Nature reviews Genetics. 2018; 19: 129-47. 1069 48. Ohm JE, McGarvey KM, Yu X, Cheng L, Schuebel KE, Cope L, et al. A stem cell-like 1070 chromatin pattern may predispose tumor suppressor genes to DNA hypermethylation and

34

1071 heritable silencing. Nat Genet. 2007; 39: 237-42. 1072 49. Schlesinger Y, Straussman R, Keshet I, Farkash S, Hecht M, Zimmerman J, et al. 1073 Polycomb-mediated methylation on Lys27 of histone H3 pre-marks genes for de novo 1074 methylation in cancer. Nat Genet. 2007; 39: 232-6. 1075 50. Hahn MA, Li AX, Wu X, Yang R, Drew DA, Rosenberg DW, et al. Loss of the polycomb 1076 mark from bivalent promoters leads to activation of cancer-promoting genes in colorectal 1077 tumors. Cancer research. 2014; 74: 3617-29. 1078 51. Easwaran H, Johnstone SE, Van Neste L, Ohm J, Mosbruger T, Wang Q, et al. A DNA 1079 hypermethylation module for the stem/progenitor cell signature of cancer. Genome Res. 1080 2012; 22: 837-49. 1081 52. Kalari S, Pfeifer GP. Identification of driver and passenger DNA methylation in cancer 1082 by epigenomic analysis. Advances in genetics. 2010; 70: 277-308. 1083 53. De Carvalho DD, Sharma S, You JS, Su SF, Taberlay PC, Kelly TK, et al. DNA 1084 methylation screening identifies driver epigenetic events of cancer cell survival. Cancer Cell. 1085 2012; 21: 655-67. 1086 54. Cheung H, Lee T, Davis A, Taft D, Rennert O, Chan W. Genome-wide DNA methylation 1087 profiling reveals novel epigenetically regulated genes and non-coding RNAs in human 1088 testicular cancer. British journal of cancer. 2010; 102: 419-27. 1089 55. Li Y, Zhang Y, Li S, Lu J, Chen J, Wang Y, et al. Genome-wide DNA methylome analysis 1090 reveals epigenetically dysregulated non-coding RNAs in human breast cancer. Scientific 1091 reports. 2015; 5. 1092 56. Shen J, Xiao Z, Wu WK, Wang MH, To KF, Chen Y, et al. Epigenetic Silencing of miR- 1093 490-3p Reactivates the Chromatin Remodeler SMARCD1 to Promote Helicobacter pylori– 1094 Induced Gastric Carcinogenesis. Cancer research. 2015; 75: 754-65. 1095 57. Zheng K, Zhou X, Yu J, Li Q, Wang H, Li M, et al. Epigenetic silencing of miR-490-3p 1096 promotes development of an aggressive colorectal cancer phenotype through activation of the 1097 Wnt/β-catenin signaling pathway. Cancer letters. 2016. 1098 58. Huarte M, Guttman M, Feldser D, Garber M, Koziol MJ, Kenzelmann-Broz D, et al. A 1099 large intergenic noncoding RNA induced by mediates global gene repression in the p53 1100 response. Cell. 2010; 142: 409-19. 1101 59. Chen P-Y, Feng S, Joo J, Jacobsen SE, Pellegrini M. A comparative analysis of DNA 1102 methylation across human embryonic stem cell lines. Genome Biol. 2011; 12: R62. 1103 60. Kheradpour P, Kellis M. Systematic discovery and characterization of regulatory motifs 1104 in ENCODE TF binding experiments. Nucleic acids research. 2014; 42: 2976-87. 1105 61. Blattler A, Farnham PJ. Cross-talk between site-specific transcription factors and DNA 1106 methylation states. Journal of Biological Chemistry. 2013; 288: 34287-94. 1107 62. Furukawa C, Daigo Y, Ishikawa N, Kato T, Ito T, Tsuchiya E, et al. Plakophilin 3 1108 oncogene as prognostic marker and therapeutic target for lung cancer. Cancer research. 2005; 1109 65: 7102-10. 1110 63. Baron V, Adamson ED, Calogero A, Ragona G, Mercola D. The transcription factor Egr1 1111 is a direct regulator of multiple tumor suppressors including TGFβ1, PTEN, p53, and 1112 . Cancer gene therapy. 2006; 13: 115-24. 1113 64. Hashimoto H, Olanrewaju YO, Zheng Y, Wilson GG, Zhang X, Cheng X. Wilms tumor 1114 protein recognizes 5-carboxylcytosine within a specific DNA sequence. Genes &

35

1115 development. 2014; 28: 2304-13. 1116 65. Dong L, Pietsch S, Englert C. Towards an understanding of kidney diseases associated 1117 with WT1 mutations. Kidney international. 2015. 1118 66. Kordowski F, Kolarova J, Schafmayer C, Buch S, Goldmann T, Marwitz S, et al. 1119 Aberrant DNA methylation of ADAMTS16 in colorectal and other epithelial cancers. BMC 1120 cancer. 2018; 18: 796. 1121 67. Grasse S, Lienhard M, Frese S, Kerick M, Steinbach A, Grimm C, et al. Epigenomic 1122 profiling of non-small cell lung cancer xenografts uncover LRP12 DNA methylation as 1123 predictive biomarker for carboplatin resistance. Genome medicine. 2018; 10: 55. 1124 68. Ooki A, Dinalankara W, Marchionni L, Tsay JJ, Goparaju C, Maleki Z, et al. 1125 Epigenetically regulated PAX6 drives cancer cells toward a stem-like state via GLI- 1126 signaling axis in lung adenocarcinoma. Oncogene. 2018. 1127 69. Vidal E, Sayols S, Moran S, Guillaumet-Adkins A, Schroeder MP, Royo R, et al. A DNA 1128 methylation map of human cancer at single base-pair resolution. Oncogene. 2017; 36: 5648- 1129 57. 1130 70. Koch A, Joosten SC, Feng Z, de Ruijter TC, Draht MX, Melotte V, et al. Analysis of 1131 DNA methylation in cancer: location revisited. Nature reviews Clinical oncology. 2018; 15: 1132 459-66. 1133 71. Yu J, Ma X, Cheung K, Li X, Tian L, Wang S, et al. Epigenetic inactivation of T-box 1134 transcription factor 5, a novel tumor suppressor gene, is associated with colon cancer. 1135 Oncogene. 2010; 29: 6464-74. 1136 72. Koga Y, Pelizzola M, Cheng E, Krauthammer M, Sznol M, Ariyan S, et al. Genome-wide 1137 screen of promoter methylation identifies novel markers in melanoma. Genome research. 1138 2009; 19: 1462-70. 1139 73. Fang Z, Yao W, Xiong Y, Zhang J, Liu L, Li J, et al. Functional elucidation and 1140 methylation‐mediated downregulation of ITGA5 gene in breast cancer cell line MDA‐MB‐ 1141 468. Journal of cellular biochemistry. 2010; 110: 1130-41. 1142 74. Guo X, Liu W, Pan Y, Ni P, Ji J, Guo L, et al. Homeobox gene IRX1 is a tumor 1143 suppressor gene in gastric carcinoma. Oncogene. 2010; 29: 3908-20. 1144 75. Bennett KL, Karpenko M, Lin M-t, Claus R, Arab K, Dyckhoff G, et al. Frequently 1145 methylated tumor suppressor genes in head and neck squamous cell carcinoma. Cancer 1146 research. 2008; 68: 4494-9. 1147 76. Hu X, Sui X, Li L, Huang X, Rong R, Su X, et al. Protocadherin 17 acts as a tumour 1148 suppressor inducing tumour cell apoptosis and autophagy, and is frequently methylated in 1149 gastric and colorectal cancers. The Journal of pathology. 2013; 229: 62-73. 1150 77. Costa VL, Henrique R, Danielsen SA, Eknaes M, Patrício P, Morais A, et al. TCF21 and 1151 PCDH17 methylation: An innovative panel of biomarkers for a simultaneous detection of 1152 urological cancers. Epigenetics. 2011; 6: 1120-30. 1153 78. Haruki S, Imoto I, Kozaki K-i, Matsui T, Kawachi H, Komatsu S, et al. Frequent 1154 silencing of protocadherin 17, a candidate tumour suppressor for esophageal squamous-cell 1155 carcinoma. Carcinogenesis. 2010: bgq053. 1156 79. Bailey MH, Tokheim C, Porta-Pardo E, Sengupta S, Bertrand D, Weerasinghe A, et al. 1157 Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell. 2018; 174: 1158 1034-5.

36

1159 80. Wang Z, Yang B, Zhang M, Guo W, Wu Z, Wang Y, et al. lncRNA Epigenetic Landscape 1160 Analysis Identifies EPIC1 as an Oncogenic lncRNA that Interacts with MYC and Promotes 1161 Cell-Cycle Progression in Cancer. Cancer Cell. 2018; 33: 706-20.e9. 1162 81. Hu S, Wan J, Su Y, Song Q, Zeng Y, Nguyen HN, et al. DNA methylation presents 1163 distinct binding sites for human transcription factors. Elife. 2013; 2: e00726. 1164 82. Domcke S, Bardet AF, Ginno PA, Hartl D, Burger L, Schübeler D. Competition between 1165 DNA methylation and transcription factors determines binding of NRF1. Nature. 2015. 1166 83. Jones PA. Functions of DNA methylation: islands, start sites, gene bodies and beyond. 1167 Nature Reviews Genetics. 2012; 13: 484-92. 1168 84. You JS, Kelly TK, De Carvalho DD, Taberlay PC, Liang G, Jones PA. OCT4 establishes 1169 and maintains nucleosome-depleted regions that provide additional layers of epigenetic 1170 regulation of its target genes. Proceedings of the National Academy of Sciences. 2011; 108: 1171 14497-502. 1172

37