Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

1 Title: 2 Genome-wide DNA methylation profiling of esophageal squamous cell carcinoma from 3 global high incidence regions identifies crucial and potential cancer markers 4 5 6 Author’s list: 7 Fazlur Rahman Talukdar1, Sheila C Soares Lima2, Rita Khoueiry1, Ruhina Shirin Laskar1, Cyrille 8 Cuenin1, Bruna Pereira Sorroche1,3, Anne-Claire Boisson1, Behnoush Abedi-Ardekani1, 9 Christine Carreira1, Diana Menya4, Charles P. Dzamalala5, Mathewos Assefa6, Abraham 10 Aseffa7, Vera Miranda-Gonçalves8, Carmen Jerónimo8, Rui M Henrique8, Ramin Shakeri9, 11 Reza Malekzadeh9, Nagla Gasmelseed10, Mona Ellaithi11, Nitin Gangane12, Daniel RS 12 Middleton1, Florence Le Calvez-Kelm1, Akram Ghantous1, Maria Leon Roux1, Joachim Schüz1, 13 Valerie McCormack1, M. Iqbal Parker13, Luis Felipe Ribeiro Pinto2, Zdenko Herceg1 14 15 16 Author’s affiliations: 17 1. International Agency for Research on Cancer, Lyon, France; 18 2. Department of Molecular Carcinogenesis, Brazilian National Cancer Institute, Rio de 19 Janeiro, Brazil; 20 3. Molecular Oncology Research Center, Barretos Cancer Hospital, Barretos, Brazil; 21 4. Moi University, Eldoret, Kenya; 22 5. University of Malawi, Blantyre, Malawi; 23 6. Addis Ababa University, Addis Ababa, Ethiopia; 24 7. Armauer Hansen Research Institute, Addis Ababa, Ethiopia; 25 8. Department of Pathology and Cancer Biology & Epigenetics Group, Portuguese Oncology 26 Institute of Porto & Biomedical Sciences Institute of University of Porto, Portugal; 27 9. Digestive Disease Research Institute, Tehran University of Medical Sciences, Tehran, Iran; 28 10. Department of Molecular Biology, National Cancer Institute, University of Gezira, Sudan; 29 11. Department of Histopathology and Cytology, Al-Neelain University, Khartoum, Sudan; 30 12. Mahatma Gandhi Institute of Medical Sciences, Sevagram, India; 31 13. Integrative Biomedical Sciences and IDM, University of Cape Town, Cape Town, South 32 Africa 33 34 35 Running title: 36 Tumor-specific DNA methylation events in ESCC 37 38 39 Keywords: DNA methylation, esophageal squamous cell carcinoma, epigenetics, high 40 incidence populations

1

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

41 42 43 Correspondence to: 44 Dr. Zdenko Herceg 45 International Agency for Research on Cancer (IARC), 46 150 Cours Albert-Thomas, 69008 Lyon Cedex 08, France, 47 E-mail: [email protected], 48 Tel. +33 4 72 73 83 98; 49 & 50 Dr. Fazlur Rahman Talukdar 51 International Agency for Research on Cancer (IARC), 52 150 Cours Albert-Thomas, 69008 Lyon Cedex 08, France, 53 E-mail: [email protected], 54 Mobile: +33-620008958 55

56 Conflict of interest statement: There are no conflicts of interest to declare.

57 Number of figures and tables: 6 figures and 1 table 58 Supplementary figures and tables: 7 figures and 16 tables

59

60

61 Disclaimer: 62 Where authors are identified as personnel of the International Agency for Research on 63 Cancer / World Health Organization, the authors alone are responsible for the views 64 expressed in this article and they do not necessarily represent the decisions, policy, or views 65 of the International Agency for Research on Cancer / World Health Organization.

66

67

68

69

70

71

72

73

2

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

74 ABSTRACT

75 Epigenetic mechanisms such as aberrant DNA methylation (DNAme) are known to drive 76 esophageal squamous cell carcinoma (ESCC), yet they remain poorly understood. Here we 77 studied tumor-specific DNAme in ESCC cases from nine high incidence countries of Africa, 78 Asia, and South America. Infinium MethylationEPIC array was performed on 108 tumors and 79 51 normal tissue adjacent to the tumors (NAT) in the discovery phase, and targeted 80 pyrosequencing was performed on 132 tumors and 36 NAT in the replication phase. Top 81 genes for replication were prioritized by weighting methylation results using RNA-seq data 82 from TCGA and GTEx and validated by qPCR. Methylome analysis comparing tumor and NAT 83 identified 6,796 differentially methylated positions (DMPs) and 866 differential methylated 84 regions (DMRs) with a 30% methylation (Δβ) difference. The majority of identified DMPs and 85 DMRs were hypermethylated in tumors, particularly in promoters and -body regions of 86 genes involved in transcription activation. The top three prioritized genes for replication, 87 PAX9, SIM2 and THSD4 had similar methylation differences in the discovery and replication 88 sets. These genes were exclusively expressed in normal esophageal tissues in GTEx and 89 downregulated in tumors. The specificity and sensitivity of these DNAme events in 90 discriminating tumors from NAT were assessed. Our study identified novel, robust, and 91 crucial tumor-specific DNAme events in ESCC tumors across several high incidence 92 populations of the world. Methylome changes identified in this study may serve as potential 93 targets for biomarker discovery and warrant further functional characterization.

94 95

96 SIGNIFICANCE

97 This largest genome-wide DNA methylation study on ESCC from high incidence populations 98 of the world identifies functionally relevant and robust DNAme events that could serve as 99 potential tumor-specific markers.

100

101

102

103

3

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

104 INTRODUCTION

105 Esophageal cancer is the seventh most common cancer worldwide, with an estimated 106 572,000 new cases and 509,000 deaths in 2018 (1). Esophageal squamous cell carcinoma 107 (ESCC - the major subtype of esophageal cancer) is one of the most aggressive and lethal 108 forms of cancer, which is often diagnosed at later stages, resulting in high mortality rates. 109 ESCC arises from squamous epithelial cells of the esophagus, whereas, esophageal 110 adenocarcinoma (EAC - the subtype commonly occurring in high-income countries of North 111 America and Europe) originates from glandular cells near the lower esophageal sphincter. 112 ESCC exhibits remarkable geographical differences in its incidence rate, the reasons for 113 which are poorly understood. The highest incidences of ESCC are observed in many of the 114 low- and middle-income countries (LMICs) of East Africa, Asia, and South America, and this 115 disparity in incidence rates may be attributed to differences in environmental and dietary 116 factors that are prevalent in the high-risk areas (1,2).

117 Unlike other gastrointestinal cancers, there is limited understanding of ESCC etiology 118 and molecular mechanisms of carcinogenesis especially from high-incidence countries (1,3). 119 In addition to the high incidence of ESCC in these LMICs, this cancer is difficult to diagnose 120 at early stages and treat successfully(3), emphasizing the necessity for a better 121 understanding of the underlying molecular mechanisms and the identification of molecular 122 markers for early detection and theranostics for better management of the condition(4).

123 Epigenetic deregulation and resulting changes are among the major 124 hallmarks of cancer(4-6). Although many studies have identified the presence of aberrant 125 DNA methylation (DNAme) in ESCC, these studies are often limited by genomic coverage, 126 absence of control tissues, sample size, high ethnic diversity, and lack of follow-up 127 functional characterization of epigenetic changes driving ESCC development (6), (7). The 128 largest study to date in terms of sample size was from the Cancer Genome Atlas (TCGA), 129 which included methylome data of 96 ESCC tumors and was aimed at comparing the 130 molecular features of ESCC and EAC(8). In addition, all of these studies focused on individual 131 CpGs and did not address differential methylation across possible functional genomic 132 regions involved in gene transcriptional regulation. The functional significance of the 133 identified target genes was not analyzed either. Moreover, the majority of the genome-wide 134 studies focused on genetic alterations predominantly from Chinese populations and the

4

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

135 representation of other high incidence regions like East Africa or South America is scant. 136 Therefore, methylome studies on paired tumor and adjacent non-tumor tissues from 137 neglected high incidence populations of ESCC with sufficient power and coverage are of 138 obvious importance for both identifying robust DNAme markers and studying epigenetic 139 mechanisms driving ESCC development.

140 To address these limitations, we conducted the first methylome-wide study, using 141 the latest HM850K methylation array with the highest methylome coverage, on a large set 142 of ESCC tumor and non-tumor adjacent tissues from high incidence understudied 143 populations of East Africa and South America. To account for the functional significance, we 144 selected candidate genes/regions enriched for differentially methylated CpGs, looked into 145 their differential expression in cancer tissues and expression patterns in normal mucosa 146 using in-house data as well as data mining approaches from various publicly available 147 databases. We validated the robustness of our top candidates on an additional set of ESCC 148 samples from other high incidence regions across the four continents.

149

150 MATERIALS AND METHODS

151 Study design, study population, and sample preparation 152 The overall study design is summarized in Figure 1A. Briefly, it comprised a standard 153 discovery-replication design and included 108 tumors and 51 normal tissue adjacent to the 154 tumor (NAT) in the discovery series and an independent set of 132 tumors and 36 NAT in 155 the replication set. Samples were collected from different collaborating centers from 9 156 countries and the associated details are summarised in Figures 1B and 1C. High incidence 157 and lack of molecular studies were the primary criteria for the inclusion of populations for 158 discovery and replication sets. Samples from Malawi, Brazil, South Africa, Ethiopia, Sudan, 159 and Kenya were included in the discovery phase and samples from Brazil, Iran, Ethiopia, 160 India, Kenya, and Portugal were included in the replication phase.

161 ESCC patients included in the present study were recruited in case-control and 162 complete case series studies(9-11) from various collaborating centers in association with 163 major public-sector hospitals in the respective locations. Details of the sample collection 164 sites and characteristics of the study participants are provided in Supplementary Table S1 165 and S2 respectively. Each patient provided written informed consent to the study, the study

5

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

166 was approved by the local Institutional Review Boards (IRB) of all the collaborating centers, 167 and by the International Agency for Research on Cancer (IARC) Ethical Committee (IEC; 168 approval number: 16-25). All the included cases were histologically confirmed ESCC.

169 Tumor tissues were collected from either endoscopic biopsy or were surgically 170 excised tissues processed into a Formalin-Fixed Paraffin-Embedded (FFPE) tissues and/or 171 freshly frozen tissues, whereas NATs were collected from pathologically confirmed negative 172 margins of the ESCC samples. One section of 5 microns from each of these tissues was 173 processed for pathological examination by haematoxylin and eosin (H&E) staining and the 174 pathology was confirmed by experienced pathologists before designating them as a tumor 175 or normal tissue for their inclusion in the current study, according to the Digestive System 176 Tumors (5th edition) of the World Health Organization (WHO) series on the classification of 177 human tumors. The inclusion criteria of the samples were as follows: the presence of more 178 than 50% neoplastic cells in a sample was included as tumor tissue and the presence of 100 179 % normal esophageal mucosa from negative margins of ESCC samples was considered as 180 NAT.

181 Sample processing for HM850K array 182 Genomic DNA from tumor and NAT was extracted using the AllPrep DNA/RNA Mini Kit 183 (Qiagen, Valencia, CA). The isolated DNA was then quantified with Qubit 3 Fluorometer 184 using Qubit™ dsDNA BR Assay Kit (Thermo Fisher Scientific) as per the manufacturer’s 185 protocol. Samples with more than 500 ng DNA were included. The DNA from FFPE samples 186 was subjected to Illumina FFPE QC Kit for quality control (QC) checks, as per the 187 manufacturer’s protocol. All DNA that passed the preliminary QC steps was then processed 188 for bisulfite conversion with 500 ng of the isolated DNA from each sample using the EZ DNA 189 Methylation Kit (Zymo Research, Irvine, CA) as described by the manufacturer. Only the 190 FFPE isolated DNA was further restored with Illumina Infinium HD FFPE Restoration kit as 191 per the protocol(12) before being processed for HM850K as per the manufacturer’s 192 instructions(13).

193 Genome-Wide DNAme analysis: 194 Infinium MethylationEPIC BeadChip microarray (HM850K) (Illumina, San Diego, CA)(13) was 195 used to determine the genome-wide DNAme profile at single-base resolution covering over 196 850,000 CpG sites. Raw data files were pre-processed and normalized using the “minfi”

6

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

197 Bioconductor package in R programming language with the R-Studio(14). The CpG sites 198 exhibiting low detection P-value (detection P >0.05) or missing data for >10% of samples 199 were excluded. Similarly, samples with missing data or overall low confidence for >10% of 200 CpG sites were also removed from the analysis. Probes with single nucleotide 201 polymorphisms (SNPs) (15), cross-reactive probes(16), and probes from X and Y 202 were also excluded. Finally, a total of 706,476 probes were retained for 203 downstream analysis. Each CpG site was assigned with a specific β value which is defined as 204 the ratio of signal intensities between methylated (M) and unmethylated (U) probe which is 205 β = M/ (M + U). The β value ranges from 0 and 1, with 0 being unmethylated, and 1 fully 206 methylated. In some analyses, β values were converted to M-values which are calculated as 207 log (M/U) for normalization(17,18). Batch correction and adjustment for potential 208 confounding variables (age, sample type, country, etc.) and other unknown sources of 209 variation were performed using the “SVA” package implemented in R(19). The additional 210 raw datasets generated in this study are available in the GEO repository, with the accession 211 number GSE164083.

212 Differential methylation analysis and statistical methods: 213 The analysis consists of two sequential elements starting with SVA followed by robust 214 regression analysis while controlling for multiple testing. The “SVA” package corrects known 215 and unknown surrogate variables including batch effects. For analysis 6 known (batch, 216 sentrix-position, country, sample type, tobacco, alcohol) and 17 unknown significant 217 variables were corrected by SVA. Robust linear regression was used to find differential 218 methylation between tumor and normal tissues after adjusting for surrogate variables using 219 the “Limma” R package (20). Additionally, variables such as batch, country, sample type, age 220 and gender were included in the regression model for further adjustment during the 221 regression analysis. We set a stringent filter with 30% Δβ (at least 30% difference in 222 methylation) to identify differentially methylated positions (DMPs) after correcting for 223 multiple testing with False Discovery Rates (FDR) <0.05(21). The Differential Methylated 224 Regions (DMRs) were obtained using the DMRcate package by assuming ≥ 2 CpGs within a 225 500bp window between tumor and normal(22). Further, to rank the DMRs, we assigned

226 each DMR a score (SDMR) as:

227 SDMR = mΔβ x –log10P x NCpG,

7

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

228 Where mΔβ is the absolute mean beta difference of CpGs in a DMR between tumor and

229 normal; -log10P is the log-transformed P-value of difference and NCpG is the number of CpGs 230 in each DMR.

231 232 Gene expression analysis from online databases: 233 We used normalized RNA sequencing expression data from 96 ESCC tumors and 509 normal 234 esophageal samples from the TCGA and the GTEx projects respectively, using a standard 235 processing pipeline utilized in RNAseqDB(23), that processes and unifies RNA-seq data from 236 both platforms after uniform realignment, gene expression quantification, and batch effect 237 removal. Briefly, the pipeline uses STAR to align sequencing reads, RSEM, and Feature Count 238 to quantify gene expression, mRIN to evaluate sample degradation, RSeQC to measure 239 sample staidness and quality, and SVAseq for batch correction. 240 For differential gene expression (DGE) between TCGA and GTEx, we limited our analysis to 241 include only the genes associated with significant DMRs in the methylation analysis. We 242 used linear models to quantify the difference in expression after correcting for multiple 243 testing.

244 We assigned each gene a differential expression score (SDGE) taking into account both the 245 magnitude and significance of the difference as:

246 SDGE = AbsFC x –log10P,

247 Where AbsFC is the absolute fold change between tumor and normal, and -log10P is the log- 248 transformed P-value of difference. We also downloaded RNAseq expression data across all 249 sites from GTEx and compared them with the esophageal RNAseq data from GTEx in order 250 to identify tissue-specific expression. For each of the selected genes, a tissue-specific

251 expression score (STSE) was calculated simply as

252 STSE = Transcript per million reads (TPM) esophagus/Average TPM all other sites

253 For weighting DMR statistics, we calculated an expression score (ES) by the linear 254 combination of differential expression and a tissue-specific expression score for each gene 255 as:

256 ES= SDGE + STSE 257 Prioritization of targets: criteria and selection of top targets:

8

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

258 In order to prioritize targets and genes with probable functional relevance for replication

259 and expression studies, we ranked the targets by SDMR scores weighted by their respective 260 expression scores as:

261 Combined Score (CS) = SDMR*ES

262 Targeted methylation by Pyrosequencing: 263 The selected top three genes were then subjected to targeted DNAme replication using 264 pyrosequencing on an additional 132 tumors and 36 NAT samples (Supplementary Table 265 S2). The 100 ng of DNA samples were treated with sodium bisulfite using the EZ-96 DNA 266 Methylation-Gold Kit (Zymo Research, Irvine, CA) as per the manufacture’s protocol. The 267 bisulfite-treated DNA was then processed as per the standard protocol described 268 previously(24). The primer details for the pyrosequencing are provided in Supplementary 269 Table S3.

270 TCGA methylation analysis: 271 Infinium Human Methylation 450K BeadChip (HM450K) IDAT files and clinical data files for 272 96 ESCC tumors and 13 NAT (3 paired with tumor, 10 unpaired) were downloaded from the 273 GDC data portal (National Cancer Institute-NIH) using the TCGAbiolinks R package(23). From 274 96 tumors, 5 samples were excluded due to missing clinical data. The IDAT files were 275 subjected to pre-processing, normalization, and QC steps similar to the discovery set 276 HM850K files. Methylation β values of tumor and normal were compared using an unpaired 277 t-test for each of the selected CpGs for replication. We performed two regression analyses 278 of TCGA HM450K data. The first analysis was conducted on the tumor (n=91) versus NAT 279 (n=13) to determine tumor-specific DNAme. The second analysis was performed between 280 early-stage ESCC cases consisting of stage I and II patients (n=61) comparing with NAT 281 (n=13).

282 Gene-specific qPCR for functional validation of the targets: 283 From the discovery set of samples, RNA was also isolated using the AllPrep DNA/RNA Mini 284 Kit (Qiagen, Valencia, CA), and 56 tumors and 29 NAT samples gave decent RNA yield for 285 further use. The inclusion criteria were the same for each set of samples and the isolated 286 RNA was used to validate the impact of DNAme on gene expression (Supplementary Table 287 S4). The RNA was assessed using a NanoDrop 8000 Spectrophotometer (Thermo Fisher 288 Scientific). 500 ng of RNA was then used to synthesize cDNA using an M-MLV Reverse

9

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

289 Transcriptase kit (Promega Corporation, Madison, USA) as per the mentioned protocol. The 290 cDNA was used to measure the gene expression of the top three genes (primer details in 291 Supplementary Table S5) relative to GAPDH (housekeeping gene) as per the standard 292 protocol described previously(25).

293 Marker assessment to discriminate ESCC from normal tissue 294 To assess the potential of the top three prioritized DMRs as markers for ESCC, Partial Least 295 Square-Discriminant Analysis (PLS-DA) was performed to predict tumor or NAT tissue based 296 on beta values of the 48 differentially methylated CpGs from these genes. PLS is a 297 supervised process as it uses the independent beta values from DNAme and the dependent 298 variables as the outcome (here tumor and NAT)(26). The discriminant analysis determines 299 the best combination of the CpGs, and the final values are chosen based on the least error 300 rate and lowest number of predictors (CpGs) possible. The accuracy of performance of the 301 identified predictor CpGs was estimated by plotting receiver operating characteristic curves 302 (ROC). Then the performance was validated with TCGA samples by plotting additional ROC 303 curves.

304

305 RESULTS

306 Patient characteristics 307 The study design overview and sample characteristics are shown in Figure 1. The mean age 308 of ESCC patients was 54.93 years (ranging from 25 to 90) in the discovery set and 60.64 309 years (ranging from 29 to 87) in the replication set for pyrosequencing. There were 47% 310 male and 53% female ESCC cases in the discovery set and 63% males and 37% females in the 311 replication set. 54% and 50% of the cases from the discovery set were tobacco and alcohol 312 consumers respectively. Moreover, 59% and 47% of ESCC cases in the replication set 313 consumed tobacco and alcohol respectively. The details of the patient characteristics are 314 shown in Supplementary Table S2. This study was approved by the Institutional Review 315 Boards of all local collaborating centers and the IARC Ethical Committee.

316 Identifying tumor-specific DMPs 317 A total of 6,796 tumor-specific DMPs spanning 2,737 unique genes were identified with a 318 30% Δβ DNAme difference (FDR <0.05) between tumors and NATs (Figure 2A). Principal

10

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

319 Component Analysis (PCA) was performed and plotted based on DNAme status of the 320 identified DMPs separated tumors from NATs, with tumors being more heterogeneous. No 321 country or population-specific clustering was seen in the PCA plot using the identified DMPs 322 (Figure 2B). There was an increased number of hypermethylated CpGs in 1500bp upstream 323 of the transcription start sites and gene body, whereas intergenic regions, particularly open- 324 sea regions, had more hypomethylated CpGs (Figure 2C and 2D). The distribution of DMPs 325 was relatively uniform across all autosomes with 5,618 hypermethylated CpGs (82.6%) and 326 1,178 (17.3%) hypomethylated CpGs in tumors compared to NATs (Figure 2E, 327 Supplementary Table S6). Unsupervised hierarchal clustering with the significant DMPs 328 separated tumors from NAT markedly well with little misclassification (Figure 2F). Among 329 the 2,737 differentially methylated genes, there were several known tumor suppressors or 330 oncogenes, including ADMTS9, ADMTS18, RNASET2, EPAS1, FHIT, RUNX3, RASSF1, ZNF382, 331 FGFR1, KDM2A, etc. (Supplementary Table S6). Top hypermethylated CpGs were 332 cg10752421 (SLC7A1, Δβ=54%, p=1.8x10-43), cg19126169 (NUMA1, Δβ=52%, p=4.8x10-48), 333 cg04415798 (PAX9, Δβ=54%, p=1.8x10-43), cg11634930 (MKNK2, Δβ=52%, p=5.7x10-58). 334 Enrichment analysis of the genes containing differentially methylated CpGs identified 335 several important functional pathways and ontologies. The hypermethylated CpG containing 336 genes were enriched for several cancer-related pathways in KEGG, such as pathways in 337 cancer, hippo signaling pathways, and WNT signaling pathways among others (Figure 2G, 338 Supplementary Table S7). for the molecular function was enriched 339 particularly for DNA binding and transcription activity and regulation functions which might 340 contribute to gene expression alterations. The hypomethylated CpG-containing genes were 341 enriched for the nervous system, neurotransmitter, and addiction associated pathways like 342 circadian entrainment, dopaminergic synapse, cholinergic synapse, morphine, and 343 amphetamine addiction pathways.

344 Identifying tumor-specific DMRs 345 Genome-wide differential methylation analysis in specific regions (DMRs) with at least 2 346 CpGs within the 500 bp window and 30% Δβ (FDR <0.05) between tumor and normal tissues 347 identified 866 DMRs in autosomes. Hypermethylated DMRs vastly outnumbered the 348 hypomethylated DMRs with 788 (91%) hypermethylated and 78 (9%) hypomethylated DMRs 349 in tumors compared to NATs (Figure 3A). A PCA was plotted using DNAme status of the

11

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

350 identified CpGs from the DMR list separated tumors from NATs, with tumors being more 351 heterogeneous. No country or population-specific clustering was seen in the PCA plot 352 (Supplementary Figure S1). A total of 45 (5%) DMRs consisted of 6 or more CpGs within the 353 region up to a maximum of 21 CpGs, 244 (28%) consisted of 3 to 5 CpGs, and the rest 354 contained at least 2 CpGs in the DMR regions (Supplementary Table S8). Among the 866 355 DMRs, 735 were associated with gene loci and involved 634 unique genes (564 were unique 356 gene-DMR pairs and 70 genes contained 171 DMRs) and 131 non-genic DMRs failed to map 357 to any gene (Supplementary Table S8). Since around 67% of the DMRs consisted of 2 CpGs 358 with 30% Δβ, we tested the possibilities of the presence of other differentially methylated 359 CpGs in these DMRs by reducing the methylation difference to 10% and increasing the 360 number of CpGs to ≥5 (Supplementary Table S9). We found 58% of the DMRs with 2 CpGs 361 and 30% Δβ overlap with the DMRs when the Δβ is reduced to 10% and the number of CpGs 362 was increased to ≥5 CpGs (Supplementary Figure S2). DNA binding and transcription 363 activator activity pathway was significantly enriched among the DMR genes (Figure 3B) and 364 approximately 50-60% of the hypo- and hypermethylated DMRs were in promoters and

365 enhancers (Figure 3C). The top 4 DMRs ranked by DMR score (SDMR) were PAX9 (mean 366 Δβ=41%; p value= 1.83 X-300) and SIM2 (mean Δβ=40%; p value= 1.83 X-300), having 21 CpGs 367 each, followed by MEIS1 (mean Δβ=44%; p value= 1.83 X-300) involving 18 CpGs, a microRNA 368 gene MIR23B (mean Δβ=42%; p value= 1.83 X-300) and stretch of 555 bp long non-genic 369 region on 6 with 12 CpGs (Supplementary Table S8).

370 Tumor-specific DMRs from TCGA HM450K data analysis 371 We next evaluated the overall overlap of the DMRs identified in our discovery set with the 372 tumor-specific DMRs identified from 96 ESCC and 13 NAT samples available in TCGA. 373 Although limited by much lesser genome coverage in HM450K array data with almost half 374 the number of CpGs from HM850K arrays and lesser NAT samples for comparison, we 375 identified 5,063 DMRs spanning 2,935 unique genes (Supplementary Table S10). 376 Comparison of these 2,935 genes with 634 DMR-associated genes generated from our 377 discovery set revealed a significant overlap (323/634, 51%) indicating that more than half of 378 these DMR associated genes are frequently aberrantly methylated across different 379 populations and largely consistent across HM450K and HM850K array datasets 380 (Supplementary Figure S3). Since the probes on the HM850K array cover > 90 % of the sites

12

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

381 on the HM450, it is less likely that the non-overlapping probes are due to the differences in 382 the CpG probes in both the arrays(27). 383 In order to predict whether our identified tumor-specific DNAme are early events 384 during the development of ESCC, we compared early-stage ESCC (stage I and II; n=61) versus 385 NAT (n=13) samples from TCGA. A total of 1,647 DMRs spanning 1,038 unique genes were 386 observed between early-stage tumors and NAT, of which 165 (~26%) were in common with 387 previously identified DMRs from our discovery set (866 DMRs spanning 634 genes) 388 (Supplementary Table S11). Therefore, these 165 of the tumor-specific DMRs obtained in 389 the discovery phase could be early events that might play a crucial role in ESCC 390 carcinogenesis (Supplementary Figure S4A and S4B). 391 These results provide convincing evidence of frequent and robust DNAme alterations 392 in the esophageal cancer tissues in comparison with the normal mucosa which is consistent 393 across tumors from various populations. Since almost one-fourth of these alterations were 394 found in early stages after comparing with TCGA early-stage ESCC DMRs, these DNAme 395 alterations seem to be crucial in the initiation of ESCC development.

396 Evaluation of the impact of differential methylation on associated gene expression 397 To understand the functional impact of the differentially methylated genes in ESCC, we next 398 studied the expression of these genes in ESCC and normal esophageal tissues from TCGA 399 and GTEx, respectively. The expression for 559/634 genes was available from TCGA and 400 GTEx and Figure 4A depicts average methylation of DMRs associated with expression of 401 these genes, differential and tissue-specific expressions, and the weighting scores for each 402 of them. A total of 430 genes were differentially expressed between tumor and normal 403 esophageal tissues (Figure 4A and Supplementary Table S12). A total of 334 top DMR genes 404 exhibited an inverse correlation with expression patterns, i.e. genes having 405 hypermethylation in tumors when compared to normal tissues were down-regulated in 406 tumor tissues and hypomethylated genes having up-regulation in tumors (Supplementary 407 Table S13 and Supplementary Figure S5). Further comparing tissue-specific expression of 408 genes between esophageal tissues and tissues from other body sites in GTEx, 56 genes had 409 at least a 2-fold higher expression in esophageal mucosa as compared to other sites. Among 410 the top genes with esophagus-specific expression were IL1RN, PAX9, KRT32, SIM2, and 411 TRIM29 (Supplementary Table S14). DNAme and altered RNA expression in the exclusively

13

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

412 or highly expressing esophageal genes suggest the crucial role of tumor-specific DNAme 413 patterns in ESCC carcinogenesis. Further, SLC9A3, HOXA13, EPN3, THSD4, and PHYHD1 were 414 among the top genes ranked by the expression score (ES), a measure for combining the 415 effects of differential expression and tissue-specific expression for each gene (Figure 4A and 416 Supplementary Table S15).

417 Prioritization and replication of candidate genes 418 We next prioritized candidate genes for replication as well as for understanding functional 419 significance in terms of expression changes associated with DNA methylation deregulation.

420 To this end, we combined gene-based methylation statistics (SDMR), information regarding 421 tissue-specific expression patterns, and differential expression between tumor and normal 422 tissues (ES) in a score (CS) (Figure 4A and Supplementary Table S15). Based on the score, 423 the top three genes were SIM2, PAX9, and THSD4, and the top 20 prioritized gene list is 424 shown in Table 1. From the list, we selected the top three target DMRs which were 425 hypermethylated in tumors and downregulated in TCGA ESCC tumors as compared to 426 normal esophageal tissues from GTEx. SIM2 had a mean methylation difference of 41% 427 (P<1.83x10-300) between tumor and NAT tissues, around 11-fold lower expression 428 (P=2.83x10-68) in the tumor as compared to normal tissues and 24.6-fold higher expression 429 in esophageal tissues as compared to other organs. Similarly, PAX9 and THSD4 had 430 respectively 40% (P<1.83x10-300) and 38% (P=8.35x10-246) higher methylation, 3.6-fold 431 (P=2.32x10-16) and 21.4-fold (P=1.33x10-112) lower expression in tumors with very high 432 esophagus-specific expression patterns (PAX9: 93.2-fold and THSD4: 6.1-fold higher 433 expression in esophageal tissues compared to other organs) (Figure 4B). 434 To validate the results obtained from the discovery phase analysis of the HM850K 435 array, we performed pyrosequencing analysis on the 3 prioritized genes (PAX9, SIM2, and 436 THSD4) and obtained DNAme status at the target region in an independent set of 168 437 samples (132 tumors and 36 NAT). All three genes showed significantly higher DNAme in 438 tumors compared to the NAT (Figure 5A), which further reinforced the robustness of the 439 results generated in the discovery phase. The results showed a 20 to 60% DNAme difference 440 across the CpGs within the pyrosequenced region of all three genes (Supplementary Figure 441 S6A, S6B and S6C). The two CpGs (cg00762160 from HM850K array) in the PAX9 region 442 showed at least 20% higher methylation in tumors compared to NAT (P=0.0008). Moreover,

14

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

443 the DNAme difference in the 5 CpGs (cg21697851 from HM850K) of SIM2 was around 30% 444 (P= 3.38x10-08) which is almost identical to the beta difference obtained from the discovery 445 phase analysis (Figure 5B). Results obtained from the pyrosequencing of the THSD4 region 446 with two CpGs (cg05337779 from HM850K) showed somewhat higher DNAme levels (53% in 447 replication vs 38% in the discovery phase) between tumor and NAT (P=2.49x10-26). 448 We further validated the gene expression results for these 3 selected genes by qPCR (Figure 449 5C). The results showed a significantly lower PAX9 (-2.7-fold; P= 0.0006) and SIM2 (-11.14; 450 P<0.0001) expression in ESCC tumor (n=56) than NAT (n=29) samples. The other gene THSD4 451 also followed the same trend of higher expression in NAT although the difference in 452 expression was marginally significant (-2.2-fold; P= 0.07).

453 Performance assessment of the prioritized DMRs as potential markers 454 To identify the best combinations of CpG markers from the three prioritized DMR genes 455 consisting of 48 CpGs, we explored all combinations of CpGs that could have discriminatory 456 power to distinguish between tumor and NAT samples using Partial Least Square- 457 Discriminant Analysis (PLS-DA). Seven CpGs showed the best predictive value in our models 458 (CpG IDs) using the discovery set samples with AUC value 0.98 (95% CI= 0.97 - 0.99; 459 P<0.00001) shown in Figure 6A and 6B. To test the effectiveness of these potential markers 460 in another independent set, we tested this in TCGA samples. There were 5 CpGs (CpG IDs) 461 common in the HM450K array (used in TCGA) that were used in the analysis 462 (Supplementary Table S16). Since stage information was available in TCGA, we performed 463 one analysis including stage I and II samples (n=62) versus normal (n=13) to assess the 464 effectiveness of the markers in early-stage tumors and the other analysis with all stage 465 tumors (n=91) versus NAT (n=13). The analysis with TCGA early-stage and TCGA-all stage 466 tumors showed AUC 0.90 (95% CI= 0.80 - 1.0; P<0.00001) and 0.89 (95% CI= 0.79 - 1.0; 467 P<0.00001) respectively (Figure 6C, 6D and 6E). There was no difference in DNAme of the 468 marker panel by age, gender, and country of origin in both the datasets (Supplementary 469 Figure S7A- S7E).

470

471

472

15

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

473 DISCUSSION

474 In the present study, we characterized the genome-wide methylation profiles of a large and 475 unique set of ESCC tumors from patients living in high incidence, underrepresented 476 populations. This was combined with gene expression data from ESCC tumor and normal 477 samples from TCGA and GTEx and targeted validation in independent series of samples, to 478 identify functionally relevant DNAme events and associated genes and pathways involved in 479 ESCC carcinogenesis. We found widespread DNA methylation changes and highlighted 480 frequent, early, and functionally relevant novel differentially methylated target genes 481 consistent across populations that might be potential drivers involved in the ESCC 482 development and progression.

483 We observed extensive hypermethylation across ESCC genomes, particularly at the 484 promoters and within the gene body, and little hypomethylation changes that were mainly 485 enriched in the intergenic regions and open sea regions of the genome. This differential CpG 486 methylation change according to the genomic location is consistent with prevalent literature 487 in several cancers(28,29). The genes mapped to the differentially methylated CpGs included 488 genes with known oncogenic or tumor suppressor functions. For example, promoter 489 methylation of RASSF1, FHIT, ADMTS1, ADMTS18, ZNF382, RUNX3 have been previously 490 implicated in ESCC (30). As such, several pathways important for cancer development like 491 WNT and hippo signaling pathways, inflammatory, and cell communication pathways were 492 enriched. These are in line with our previous study including ESCC tumor vs NAT pairs from 493 Brazil using one of the earliest high-throughput Illumina arrays with limited coverage of 494 around 1500 CpGs(6) as well as two other studies exploring ESCC tumor DNAme using 495 HM450K arrays with small sample sizes (n≤10) from specific Asian population that found 496 similar enrichment of WNT signaling pathway(7,31).

497 In contrast to most of the previous studies that identified methylation of candidate 498 genes associated with single CpGs in ESCC (32), we focussed on differentially methylated 499 regions associated genes and identified a number of DMR genes across the cancer genome. 500 By focusing on DMRs, we intended to identify methylation events that might have functional 501 relevance in gene expression regulation. In addition to genic DMRs in promoters and 502 enhancers, our study also found substantial methylation in introns and gene body 503 supporting the presence and regulatory functions of gene body DNAme in many

16

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

504 cancers(33). As such, several pathways involved in proximal promoter sequence-specific 505 DNA binding and transcription activator activity were enriched among the DMRs. The only 506 study reporting DMRs in ESCC found similar enrichments of gene body methylation and 507 transcription regulation, though using sequencing techniques in only 4 tumors and NAT 508 pairs(34). In addition, more than half of the tumor-specific DMRs from the discovery set 509 replicated in TCGA DMRs despite the different population and methylation array type 510 (HM450K array). Around 26% of the tumor-specific DMRs were found to be early events as 511 identified through comparison with DMRs identified from early-stage tumors of TCGA. 512 These findings are consistent with the DNAme events acting as early aberrations in ESCC 513 development, although further studies are needed to substantiate this notion. The 514 robustness and reproducibility of our identified tumor-specific DNAme events also 515 represent attractive avenues for the development of assays targeting these methylation- 516 based markers for improved early detection of ECSCC. 517 Since our study identified tumor-specific DNAme alterations consistently across all 518 tumors and populations, we reasoned that they may play a role in the ESCC development 519 process. To further explore this possibility, we investigated the expression patterns of the 520 DMR genes in silico from tumor and normal esophageal tissues. Among the DMR genes, 88% 521 were differentially expressed between normal and tumor tissues and 60% showed 522 directionally opposite expression patterns compared to methylation changes, supporting 523 the notion of an inverse correlation of DNAme and gene expression (35,36). These genes are 524 likely to be functionally important for esophageal cells and disruption of regulatory control 525 by aberrant DNA methylation might act as driver events in the malignant transformation of 526 these cells. Therefore, in order to identify functionally relevant methylation events, we 527 weighted the DMRs with their corresponding expression patterns and prioritized genes for 528 driver identification and future functional characterization. SIM2, PAX9, and THSD4 topped 529 the list of prioritized genes, and several novel genes that were not previously reported in 530 ESCC such as THSD4, PHYHD1, GPT, KCNJ15, and TP53AIP1 were among the top 10 531 prioritized DMR associated genes. To further support these findings, we performed qPCR on 532 candidate genes to check if the results are concordant with the in-silico analysis. Among 533 these DMR associated differentially expressed genes, we validated three genes PAX9, SIM2, 534 and THSD4, showing a consistent trend of RNA expression as generated from GTEx RNA-seq

17

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

535 data. Many of these genes also showed esophagus-specific expression, including some of 536 the top DMR genes such as SIM2 and PAX9.

537 DNAme patterns of the top three target genes were successfully replicated in an 538 independent set of ESCC samples from Brazil, Portugal, Iran, Ethiopia, Kenya, and India. Our 539 top DMR and one of the target genes is Pax9, the well known for its role 540 in tooth agenesis or hypodontia(37), and several studies suggest that tooth agenesis could 541 be linked to cancer susceptibility (38,39). Our findings of hypermethylation and 542 downregulation of PAX9 in ESCC as well as its exclusive expression in esophageal mucosa, 543 coupled with previous reports of downregulation of PAX9 in esophageal adenocarcinoma, 544 Barrett's esophagus (40), and head and neck cancers suggest an important role in regulating 545 squamous cell differentiation in the oro-esophageal epithelium. Therefore, epigenetic 546 silencing of this gene could be a frequent event in upper-aerodigestive tract cancers 547 contributing to oro-esophageal carcinogenesis(39,41,42). Similarly, SIM2 is also a 548 transcription factor from the family of the basic helix–loop–helix–PER–ARNT–SIM. 549 Supporting our findings, DNAme associated suppression of SIM2 expression in ESCC is 550 considered as a frequent event (43) and reported in a previous study in as many as 90% of 551 the ESCC cases (n=60). SIM2 downregulation is also involved in chemo-radiotherapy 552 sensitivity in ESCC, suggesting this gene as a possible therapeutic target (43,44). It is 553 noteworthy that one study identified four significantly differentially expressed genes in 554 ESCC compared to NAT, among which two of them were SIM2 and THSD4 underscoring their 555 roles in tumorigenesis(45). Although the functional role of THSD4 in ESCC is not known, it is 556 a member of the extracellular calcium-binding family involved in cell adhesion and 557 migration that was previously linked to tumorigenesis of hemangioblastoma (46) and 558 neoadjuvant chemoradiotherapy response in rectal cancer(47). Other top novel genes 559 PHYHD1 and GPT were not previously linked to carcinogenesis. However, loss of KCNJ15 560 expression in renal carcinoma promoted malignant phenotypes and associated with poorer 561 prognosis(48), and loss of TP53AIP1 expression was associated with the inhibition of cancer 562 cell apoptosis in lung adenocarcinoma(49). All these studies on SIM2, PAX9, THSD4, KCNJ15, 563 and TP53AIP1 suggested their alteration in cancer is a common event that might be 564 regulated by aberrant DNAme. Therefore, upstream and downstream signaling pathways of 565 these genes may provide potential therapeutic targets for ESCC and other related cancers.

18

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

566 One of the goals of this study was to discover non-random cancer-specific changes in 567 DNAme that could discriminate ESCC tissue from normal samples. We identified a 568 combination of 7 CpGs that could identify tumors with high sensitivity and specificity with 569 the potential to be used as ESCC markers. The marker performance on TCGA clinical samples 570 (AUC 0.89) is comparable with the performance on the discovery data (AUC 0.98), given that 571 both were independent sets of tumor and normal tissues, analyzed by different microarray 572 platforms (450K vs 850K) and the TCGA dataset having only 5 CpGs in the marker panel. 573 However, one limitation was the unavailability of additional independent cohorts of blood 574 samples or other minimally invasive cell types from ESCC patients and healthy controls for 575 marker validation. 576 In conclusion, our study is the first and largest applying a methylome profiling 577 approach to ESCC samples from unique high incidence populations of the world and further 578 replicated our results with cases from other populations using TCGA data. We identified 579 novel, robust, early, and functionally relevant tumor-specific DNAme events in ESCC across 580 tumors. Aberrant DNAme in tumors and subsequent gene expression alterations of these 581 genes (notably PAX9, SIM2, THSD4, and the additional genes) which are exclusively 582 expressed or highly expressed in normal esophageal mucosa might be crucial events in the 583 initiation of ESCC development. Our findings could serve as a reference for the functional 584 characterization of these genes to explore deeper into their role in ESCC initiation and 585 progression. Furthermore, the potential use of the identified CpGs as minimally-invasive 586 DNAme based ESCC biomarkers such as those employed on minimally-invasively collected 587 samples using Cytosponge(50) or circulating tumor DNA (ctDNA) from blood, should be 588 prioritized in future early detection efforts. Since esophageal sponge cytology sampling was 589 shown feasible in African settings (51), the identified DNAme markers could be even tested 590 and implemented for early detection in various high incidence populations of ESCC in low 591 resource settings.

592

593

594

595

19

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

596 ACKNOWLEDGMENTS

597 The work reported in this article was undertaken by FRT partly during the tenure of a 598 Postdoctoral Fellowship from the International Agency for Research on Cancer, partially 599 supported by the EC FP7 Marie Curie Actions – People– Co-funding of regional, national, and 600 international programs (COFUND). The work in the Epigenetics Group at IARC is supported 601 by grants from the Institut National du Cancer (INCa, France), the European Commission 602 (EC) Seventh Framework Programme (FP7) Translational Cancer Research (TRANSCAN) 603 Framework, the Foundation ARC pour la Recherche sur le Cancer (France), Plan Cancer-Eva- 604 Inserm research grant, and La direction 1énérale de l’offre de soins (DGOS), and INSERM 605 (SIRIC LYriCAN, INCa-DGOS-Inserm_12563) to Z.H. VM-G was supported by NORTE-01-0145- 606 740 FEDER-000027 (ESTIMA). M.I.P. was jointly supported by the SAMRC with funds 607 received from the National Department of Health and the MRC UK with funds from the UK 608 Government’s Newton Fund and GSK. Kenyan case-control study was supported by the IARC 609 and NIH/NCI (grant number R21CA191965). The Ethiopian case-control study was funded by 610 the American Cancer Society (ACS) and the IARC.

611 AUTHOR CONTRIBUTIONS

612 FRT and ZH conceptualized, planned, and designed the experiments. FRT, SCSL, RK, RSL, AG 613 contributed to the analysis plan. FRT and ZH coordinated sample collection and shipment 614 from all the collaborating centers. SCSL, DM, CD, MA, AA, VMG, CJ, RH, NG, ME, RS, RM, 615 NG, DM, MLR, JS, VM contributed to sample collection, processing, and shipment from 616 various collaborating centers to the study site. ChC preparation of tissue sections and 617 processing for histopathology analysis. BAA planning and performing pathological analysis. 618 FRT, SCSL, and VMG performed DNA isolation. FRT and ACB performed RNA isolation. FRT 619 and AG planned the Illumina array design. FRT and CC performed bisulfite conversions of 620 DNA and Illumina arrays. FLCK helped in performing Illumina arrays. FRT, RSL, and BS 621 performed Illumina array data analyses. RSL downloaded TCGA and GTEx data, performed 622 RNA-seq, and statistical analyses. FRT and RK performed qPCR. FRT and CC performed 623 pyrosequencing. JS, VM, MIP, LFRP, and ZH contributed resources and critical feedback on 624 the project. ZH supervised the experiments and work progress. FRT and ZH wrote the 625 original draft. All authors reviewed and edited the manuscript.

626

20

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

627 628 REFERENCES 629 1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: 630 GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA 631 Cancer J Clin 2018;68:394-424 632 2. Arnold M, Laversanne M, Brown LM, Devesa SS, Bray F. Predicting the Future Burden of 633 Esophageal Cancer by Histological Subtype: International Trends in Incidence up to 2030. Am J 634 Gastroenterol 2017;112:1247-55 635 3. McCormack VA, Menya D, Munishi MO, Dzamalala C, Gasmelseed N, Leon Roux M, et al. 636 Informing etiologic research priorities for squamous cell esophageal cancer in Africa: A review of 637 setting-specific exposures to known and putative risk factors. Int J Cancer 2017;140:259-71 638 4. Ma K, Cao B, Guo M. The detective, prognostic, and predictive value of DNA methylation in 639 human esophageal squamous cell carcinoma. Clin Epigenetics 2016;8:43 640 5. Kanwal R, Gupta S. Epigenetic modifications in cancer. Clin Genet 2012;81:303-11 641 6. Lima SC, Hernandez-Vargas H, Simao T, Durand G, Kruel CD, Le Calvez-Kelm F, et al. Identification 642 of a DNA methylome signature of esophageal squamous cell carcinoma and potential epigenetic 643 biomarkers. Epigenetics 2011;6:1217-27 644 7. Li X, Zhou F, Jiang C, Wang Y, Lu Y, Yang F, et al. Identification of a DNA methylome profile of 645 esophageal squamous cell carcinoma and potential plasma epigenetic biomarkers for early 646 diagnosis. PloS one 2014;9:e103162 647 8. Cancer Genome Atlas Research N, Analysis Working Group: Asan U, Agency BCC, Brigham, 648 Women's H, Broad I, et al. Integrated genomic characterization of oesophageal carcinoma. 649 Nature 2017;541:169-75 650 9. Menya D, Oduor M, Kigen N, Maina SK, Some F, Kibosia C, et al. Cancer epidemiology fieldwork 651 in a resource-limited setting: Experience from the western Kenya ESCCAPE esophageal cancer 652 case-control pilot study. Cancer Epidemiol 2018;57:45-52 653 10. Nicolau-Neto P, Da Costa NM, de Souza Santos PT, Gonzaga IM, Ferreira MA, Guaraldi S, et al. 654 Esophageal squamous cell carcinoma transcriptome reveals the effect of FOXM1 on patient 655 outcome through novel PIK3R3 mediated activation of PI3K signaling pathway. Oncotarget 656 2018;9:16634-47 657 11. Leon ME, Assefa M, Kassa E, Bane A, Gemechu T, Tilahun Y, et al. Qat use and esophageal cancer 658 in Ethiopia: A pilot case-control study. PLoS One 2017;12:e0178911 659 12. Serrano J, Snuderl M. Whole Genome DNA Methylation Analysis of Human Glioblastoma Using 660 Illumina BeadArrays. Methods Mol Biol 2018;1741:31-51 661 13. Kling T, Wenger A, Beck S, Caren H. Validation of the MethylationEPIC BeadChip for fresh-frozen 662 and formalin-fixed paraffin-embedded tumours. Clin Epigenetics 2017;9:33 663 14. Fortin JP, Triche TJ, Jr., Hansen KD. Preprocessing, normalization and integration of the Illumina 664 HumanMethylationEPIC array with minfi. Bioinformatics 2017;33:558-60 665 15. Zhou W, Laird PW, Shen H. Comprehensive characterization, annotation and innovative use of 666 Infinium DNA methylation BeadChip probes. Nucleic Acids Res 2017;45:e22 667 16. Chen YA, Lemire M, Choufani S, Butcher DT, Grafodatskaya D, Zanke BW, et al. Discovery of 668 cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 669 microarray. Epigenetics 2013;8:203-9

21

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

670 17. Merid SK, Novoloaca A, Sharp GC, Kupers LK, Kho AT, Roy R, et al. Epigenome-wide meta-analysis 671 of blood DNA methylation in newborns and children identifies numerous loci related to 672 gestational age. Genome Med 2020;12:25 673 18. Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, et al. Comparison of Beta-value and M-value 674 methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 675 2010;11:587 676 19. Zhang S, Wu Z, Xie J, Yang Y, Wang L, Qiu H. DNA methylation exploration for ARDS: a multi- 677 omics and multi-microarray interrelated analysis. J Transl Med 2019;17:345 678 20. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression 679 analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 2015;43:e47 680 21. Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the false discovery rate in behavior 681 genetics research. Behav Brain Res 2001;125:279-84 682 22. Peters TJ, Buckley MJ, Statham AL, Pidsley R, Samaras K, R VL, et al. De novo identification of 683 differentially methylated regions in the . Epigenetics Chromatin 2015;8:6 684 23. Mounir M, Lucchetta M, Silva TC, Olsen C, Bontempi G, Chen X, et al. New functionalities in the 685 TCGAbiolinks package for the study and integration of cancer data from GDC and GTEx. PLoS 686 Comput Biol 2019;15:e1006701 687 24. Busato F, Dejeux E, El Abdalaoui H, Gut IG, Tost J. Quantitative DNA Methylation Analysis at 688 Single-Nucleotide Resolution by Pyrosequencing(R). Methods Mol Biol 2018;1708:427-45 689 25. Li YY, Wang K, Chen LH, Zhu XX, Zhou J. Quantification of mRNA Levels Using Real-Time 690 Polymerase Chain Reaction (PCR). Breast Cancer: Methods and Protocols 2016;1406:73-9 691 26. Perez-Enciso M, Tenenhaus M. Prediction of clinical outcome with microarray data: a partial 692 least squares discriminant analysis (PLS-DA) approach. Hum Genet 2003;112:581-92 693 27. Pidsley R, Zotenko E, Peters TJ, Lawrence MG, Risbridger GP, Molloy P, et al. Critical evaluation 694 of the Illumina MethylationEPIC BeadChip microarray for whole-genome DNA methylation 695 profiling. Genome Biol 2016;17:208 696 28. Blattler A, Yao L, Witt H, Guo Y, Nicolet CM, Berman BP, et al. Global loss of DNA methylation 697 uncovers intronic enhancers in genes showing expression changes. Genome Biol 2014;15:469 698 29. Scala G, Federico A, Palumbo D, Cocozza S, Greco D. DNA sequence context as a marker of CpG 699 methylation instability in normal and cancer tissues. Sci Rep 2020;10:1721 700 30. Li JS, Ying JM, Wang XW, Wang ZH, Tao Q, Li LL. Promoter methylation of tumor suppressor 701 genes in esophageal squamous cell carcinoma. Chin J Cancer 2013;32:3-11 702 31. Kishino T, Niwa T, Yamashita S, Takahashi T, Nakazato H, Nakajima T, et al. Integrated analysis of 703 DNA methylation and mutations in esophageal squamous cell carcinoma. Mol Carcinog 704 2016;55:2077-88 705 32. Talukdar FR, Ghosh SK, Laskar RS, Mondal R. Epigenetic, genetic and environmental interactions 706 in esophageal squamous cell carcinoma from northeast India. PloS one 2013;8:e60996 707 33. Arechederra M, Daian F, Yim A, Bazai SK, Richelme S, Dono R, et al. Hypermethylation of gene 708 body CpG islands predicts high dosage of functional oncogenes in liver cancer. Nat Commun 709 2018;9:3164 710 34. Chen C, Peng H, Huang X, Zhao M, Li Z, Yin N, et al. Genome-wide profiling of DNA methylation 711 and gene expression in esophageal squamous cell carcinoma. Oncotarget 2016;7:4507-21 712 35. Belanger AS, Tojcic J, Harvey M, Guillemette C. Regulation of UGT1A1 and HNF1 transcription 713 factor gene expression by DNA methylation in colon cancer cells. Bmc Mol Biol 2010;11:9

22

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

714 36. Xu W, Xu M, Wang L, Zhou W, Xiang R, Shi Y, et al. Integrative analysis of DNA methylation and 715 gene expression identified cervical cancer-specific diagnostic biomarkers. Signal Transduct 716 Target Ther 2019;4:55 717 37. Bonczek O, Balcar VJ, Sery O. PAX9 gene mutations and tooth agenesis: A review. Clin Genet 718 2017;92:467-76 719 38. Paranjyothi MV, Kumaraswamy KL, Begum LF, Manjunath K, Basheer S. Tooth agenesis: A 720 susceptible indicator for colorectal cancer? J Cancer Res Ther 2018;14:527-31 721 39. Gawron-Jakubek W, Spaczynska J, Pitynski K, Loster BW. Coexistence of tooth agenesis and 722 ovarian cancer - a systematic literature review. Ginekol Pol 2019;90:707-10 723 40. Lv J, Guo L, Wang JH, Yan YZ, Zhang J, Wang YY, et al. Biomarker identification and trans- 724 regulatory network analyses in esophageal adenocarcinoma and Barrett's esophagus. World J 725 Gastroenterol 2019;25:233-44 726 41. Xiong Z, Ren S, Chen H, Liu Y, Huang C, Zhang YL, et al. PAX9 regulates squamous cell 727 differentiation and carcinogenesis in the oro-oesophageal epithelium. J Pathol 2018;244:164-75 728 42. Gerber JK, Richter T, Kremmer E, Adamski J, Hofler H, Balling R, et al. Progressive loss of PAX9 729 expression correlates with increasing malignancy of dysplastic and cancerous epithelium of the 730 human oesophagus. J Pathol 2002;197:293-7 731 43. Tamaoki M, Komatsuzaki R, Komatsu M, Minashi K, Aoyagi K, Nishimura T, et al. Multiple roles of 732 single-minded 2 in esophageal squamous cell carcinoma and its clinical implications. Cancer Sci 733 2018;109:1121-34 734 44. Takashima K, Fujii S, Komatsuzaki R, Komatsu M, Takahashi M, Kojima T, et al. CD24 and CK4 are 735 upregulated by SIM2, and are predictive biomarkers for chemoradiotherapy and surgery in 736 esophageal cancer. Int J Oncol 2020;56:835-47 737 45. Su P, Wen S, Zhang Y, Li Y, Xu Y, Zhu Y, et al. Identification of the Key Genes and Pathways in 738 Esophageal Carcinoma. Gastroenterol Res Pract 2016;2016:2968106 739 46. Ma D, Yang J, Wang Y, Huang X, Du G, Zhou L. Whole exome sequencing identified genetic 740 variations in Chinese hemangioblastoma patients. Am J Med Genet A 2017;173:2605-13 741 47. Frydrych LM, Ulintz P, Bankhead A, Sifuentes C, Greenson J, Maguire L, et al. Rectal cancer sub- 742 clones respond differentially to neoadjuvant therapy. Neoplasia 2019;21:1051-62 743 48. Liu Y, Wang H, Ni B, Zhang J, Li S, Huang Y, et al. Loss of KCNJ15 expression promotes malignant 744 phenotypes and correlates with poor prognosis in renal carcinoma. Cancer Manag Res 745 2019;11:1211-20 746 49. Fang H, Liu Y, He Y, Jiang Y, Wei Y, Liu H, et al. Extracellular vesicledelivered miR5055p, as a 747 diagnostic biomarker of early lung adenocarcinoma, inhibits cell apoptosis by targeting 748 TP53AIP1. Int J Oncol 2019;54:1821-32 749 50. Januszewicz W, Tan WK, Lehovsky K, Debiram-Beecham I, Nuckcheddy T, Moist S, et al. Safety 750 and Acceptability of Esophageal Cytosponge Cell Collection Device in a Pooled Analysis of Data 751 From Individual Patients. Clin Gastroenterol Hepatol 2019;17:647-56 e1 752 51. Middleton DRS, Mmbaga BT, O'Donovan M, Abedi-Ardekani B, Debiram-Beecham I, Nyakunga- 753 Maro G, et al. Minimally invasive esophageal sponge cytology sampling is feasible in a Tanzanian 754 community setting. Int J Cancer 2021;148:1208-18

755

756 757

23

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

758

Gene Combined score (CS) hg19 genomic coordinates width no. of probes Mean Δβ p value SIM2* 688878.52 chr21:38076709:38083099 6391 21 0.40 1.83 X-300 PAX9 329274.46 chr14:37125531:37130313 4783 21 0.41 1.83 X-300 THSD4 293401.57 chr15:71838235:71840098 1864 6 0.39 8.35 X-246 HOXA3* 172195.13 chr7:27160276:27162766 2491 12 0.37 1.83 X-300 PHYHD1 145078.68 chr9:131683857:131684923 1067 5 0.39 9.63 X-176 GPT 129918.28 chr8:145728138:145728543 406 6 0.41 1.09 X-217 HOXB3 129143.23 chr17:46627848:46629804 1957 7 0.41 1.01 X-273 KCNJ15 124430.48 chr21:39643353:39644262 910 5 0.35 4.71 X-188 TP53AIP1 121633.01 chr11:128812462:128813688 1227 9 0.40 1.83 X-300 NFIX* 117796.80 chr19:13135318:13135808 491 10 0.40 1.83 X-300 EN1 70812.55 chr2:119606302:119607885 1584 6 0.39 1.25 X-189 FAM83F 68656.14 chr22:40417285:40417869 585 5 0.40 2.78 X-204 SHOX2 62780.05 chr3:157821536:157821994 459 4 0.38 2.90 X-136 LAMB4 52733.81 chr7:107771104:107771214 111 3 0.38 8.75 X-112 ITPRIP 51282.25 chr10:106093778:106094976 1199 6 0.38 4.22 X-266 EXPH5 49297.09 chr11:108422663:108423292 630 5 0.37 3.12 X-172 ZNF471 45717.96 chr19:57019005:57019279 275 6 0.41 2.13 X-133 MKNK2 42887.85 chr19:2046085:2046350 266 4 0.52 7.67 X-190 SPARC 37744.97 chr5:151066460:151066486 27 3 0.39 3.71 X-103 NDST1 37215.63 chr5:149887039:149887716 678 5 0.35 2.36 X-189 759 760 Table 1: List of top 20 DMR associated genes based on combined score after the target prioritization 761 of DMRs. Δβ {greater than or equal to} 30% was the cut off for the DMR analysis (FDR {less than or 762 equal to} 0.05) using DMRcate R package. Mean Δβ for each DMR are the mean across all probes 763 within the DMR. Δβ values were calculated as the difference between the mean β for tumor and 764 NAT. 765 * Two DMRs of the same gene were present in the top 20 list of DMRs (Supplementary Table 14). 766 Only the first DMR based on the combined score is included in the table 767 768 769 770 771 772 773 774 775 776 777 778 779 780

24

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

781 782 783 FIGURE LEGENDS: 784 785 Figure 1: Overview of study design and sample characteristics. (A) Outline of the study 786 design; (B) Country-wise sample distribution in percentages; (C) Dots in the map showing 787 sample collection sites and their respective countries are colored

788 Figure 2: Tumor-specific DMPs. (A) Circos plot showing the hyper (purple) and 789 hypomethylated (golden) DMPs across the autosomes; (B) Principal Component Analysis 790 (PCA) performed using all DMPs. Samples from different countries are shown in various 791 colors (BR: Brazil; ET: Ethiopia; IR: Iran; KN: Kenya; ML: Malawi; SA: South Africa; SD: Sudan). 792 Tumors (T) and NAT (N) are shown as tringle and round-dot shape respectively (C) Genomic 793 distribution of DMPs; (D) CpG island distribution of DMPs; (E) Volcano plot representing 794 DMPs. The significantly hypermethylated probes are shown in purple while hypomethylated 795 probes are shown in golden; (F) Heatmap based on identified tumor-specific DMPs. The row 796 represents individual probes and the column represents individual samples. The red color 797 indicates tumor and the green color indicates NAT samples. The color in the heatmap 798 represents the methylation level of the genes. The purple color is for hypermethylated 799 probes while golden represents hypomethylated probes; (G) Pathway enrichment analysis 800 of the hyper and hypomethylated DMPs

801 Figure 3: Tumor-specific DMRs. (A) Circos plot of the tumor-specific DMR distribution; (B) 802 Pathway enrichment analysis using DMR associated genes; (C) Genomic distribution of the 803 tumor-specific DMRs

804 Figure 4: Comparison of DMR genes with RNA-seq data from TCGA-GTEx and target 805 prioritization. (A) Heatmap with DMR from discovery set analysis of HM850K data 806 correlated with gene expression from TCGA-GTEx derived RNA-seq data; (B) Mean DNAme 807 values from discovery set, mean RNA expression from TCGA-GTEx derived RNA-seq data and 808 esophagus-specific RNA expression of the three selected genes (PAX9, SIM2 and THSD4) for 809 validation. [DEG= Differentially Expresses Genes; SDMR (Score DMR) = mean Δβ x –log10P x 810 Number of CpGs; TSE= Tissue-specific expression, here esophagus; ES (Expression Score) =

811 SDGE + STSE; CS= Combined Score]

25

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

812 Figure5: Replication of the 3 target genes. (A) Mean beta value of the CpGs within the DMR 813 region of PAX9, SIM2 and THSD4; (B) Replication of the target CpG by pyrosequencing PAX9 814 (CpG ID: cg00762160), SIM2 (CpG ID: cg2169785), and THSD4 (CpG ID: cg05337779); (C) RNA 815 expression of PAX9, SIM2 and THSD4 normalized to GAPDH (house-keeping gene)

816 Figure 6: Specificity and sensitivity of CpG panel as markers for ESCC: (A) Mean 817 methylation of the proposed 7 CpG marker in tumor and NAT from discovery set (B) ROC 818 curve of the 7 CpG marker from discovery set: 108 tumors vs 51 NAT; (C) Mean methylation 819 of the proposed 5 CpG (common in HM450K) marker in TCGA-early stage and all stage 820 versus NAT (D) ROC curve of the 5 CpG marker from TCGA-early stage: 62 tumors versus 13 821 NAT; (E) ROC curve of the 5 CpG marker from TCGA-all stage: 91 tumors versus 13 NAT

26

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research. Author Manuscript Published OnlineFirst on March 19, 2021; DOI: 10.1158/0008-5472.CAN-20-3445 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Genome-wide DNA methylation profiling of esophageal squamous cell carcinoma from global high-incidence regions identifies crucial genes and potential cancer markers

Fazlur Rahman Talukdar, Sheila C Soares Lima, Rita Khoueiry, et al.

Cancer Res Published OnlineFirst March 19, 2021.

Updated version Access the most recent version of this article at: doi:10.1158/0008-5472.CAN-20-3445

Supplementary Access the most recent supplemental material at: Material http://cancerres.aacrjournals.org/content/suppl/2021/03/20/0008-5472.CAN-20-3445.DC1

Author Author manuscripts have been peer reviewed and accepted for publication but have not yet Manuscript been edited.

E-mail alerts Sign up to receive free email-alerts related to this article or journal.

Reprints and To order reprints of this article or to subscribe to the journal, contact the AACR Publications Subscriptions Department at [email protected].

Permissions To request permission to re-use all or part of this article, use this link http://cancerres.aacrjournals.org/content/early/2021/03/19/0008-5472.CAN-20-3445. Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC) Rightslink site.

Downloaded from cancerres.aacrjournals.org on October 3, 2021. © 2021 American Association for Cancer Research.