Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

A deep learning framework identifies pathogenic noncoding somatic mutations from personal prostate cancer genomes

Cheng Wang and Jingjing Li* the Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, the Parker Institute for Cancer Immunotherapy, the Bakar Computational Health Sciences Institute, the Department of Neurology, School of Medicine, University of California, San Francisco, 35 Medical Center Way, San Francisco, CA 94143

*correspondence should be addressed to: [email protected] Mailing address: 35 Medical Center Way, San Francisco, CA 94143 Tel: 415-502-2572

Declaration of Interests. C.W. and J.L. are listed as inventors on a pending patent application filed by UCSF related to part of the methods presented here for prostate cancer analysis. J.L. is a co-founder and a member of the scientific advisory board of SensOmics, Inc.

Running Title: deep learning for prostate cancer

Abstract Our understanding of noncoding mutations in cancer genomes has been derived primarily from mutational recurrence analysis by aggregating clinical samples on a large scale. These cohort- based approaches cannot directly identify individual pathogenic noncoding mutations from personal cancer genomes. Therefore, although most somatic mutations are localized in the noncoding cancer genome, their effects on driving tumorigenesis and progression have not been systematically explored and noncoding somatic alleles have not been leveraged in current clinical practice to guide personalized screening, diagnosis, and treatment. Here we present a deep learning framework to capture pathogenic noncoding mutations in personal cancer genomes, which perturb regulation by altering chromatin architecture. We deployed the system specifically for localized prostate cancer by integrating large-scale prostate cancer genomes and the prostate-specific epigenome. We exhaustively evaluated somatic mutations in each patient's genome and agnostically identified thousands of somatic alleles altering the prostate epigenome. Functional genomic analyses subsequently demonstrated that affected displayed differential expression in prostate tumor samples, were vulnerable to expression alterations, and were convergent onto androgen-receptor- mediated signaling pathways. Accumulation of pathogenic regulatory mutations in these affected genes was predictive of clinical observations, suggesting potential clinical utility of this approach. Overall, the deep learning framework has significantly expanded our view of somatic mutations in the vast noncoding genome, uncovered novel genes in localized prostate cancer, and will foster the development of personalized screening and therapeutic strategies for prostate cancer.

Statement of Significance. This study's characterization of the noncoding genome in prostate cancer reveals mutational signatures predictive of clinical observations, which may serve as a powerful prognostic tool in this disease.

1

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Introduction To date, more than 81 million simple somatic mutations (base substitutions and short indels) from cancer genomes have been catalogued by International Cancer Genome Consortium (ICGC)(1), among which >90% fall in non-coding regions. However, our current cancer genome analysis has been largely confined to somatic mutations in coding sequences, which only account for ~2% of the , leaving the vast majority of the human genome unexplored. Multiple lines of evidence have suggested a significant involvement of non-coding somatic mutations in cancer genomes: (i) it has been established that more than 90% of functional loci in complex diseases reside in the non-coding genome(2,3), and therefore significant contribution to cancer etiologies by perturbing non-coding regulatory elements would also be expected(4); (ii) specifically for cancer, individual cases studies have identified several non-coding mutations driving tumorigenesis and progression, including the well-known somatic mutations in the TERT promoter(5) as well as the clustered somatic mutations affecting the cis- regulatory elements of FOXA1, which promotes prostate cancer cell proliferation(6); (iii) the landmark cancer epigenome study has revealed distinct chromatin architecture in tumors relative to normal tissues(7), prompting further investigation on somatic mutations that alter tissue-specific chromatin structure leading to tumor formation and progression.

Despite these considerations, the recent pan-cancer genome analysis unexpectedly reported a paucity of somatic driver mutations in the non-coding genome by identifying only a handful of non-coding somatic mutations displaying significant recurrence across tumor samples(8). Mutational recurrence has been used as the primary approach to infer mutational pathogenicity, which indirectly infers mutational pathogenicity based on statistical enrichment, but cannot directly assess the molecular effects of individual non-coding somatic mutations. As such, mutational recurrence analysis effectively identifies mutations forming “hotspots” in a patient cohort, but cannot capture individual pathogenic mutations from personal genomes. Given the sporadic nature of somatic mutations, it is reasonable to expect that many pathogenic mutations do not form clusters, but individually exert their effects on tumor initiation, formation or progression. Identifying these individual mutations will not only significantly expand our view of the non-coding cancer genome, but will also foster the development of clinical tools to screen pathogenic regulatory mutations.

In addition to mutational recurrence analysis (or identifying mutation hotspots), several approaches have been proposed to individually annotate non-coding somatic mutations in

2

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

cancer genomes (as reviewed in these excellent articles(9,10)). Some of the studies examined mutational localization in known regulatory regions, followed by motif analysis. However, many transcription factors bind to degenerate DNA sequences(11), and a single base change is more likely to be neutral than consequential. Therefore, merely examining mutational localization in regulatory elements is not sufficient to determine specific allelic effects on perturbing gene regulation. Evolutionary conservation was also integrated in the analysis; however, even if one somatic mutation affects a conserved site, the observed conservation could merely result from background selection. Some studies identified somatic mutations displaying allele-specific expression (ASE, or somatic eQTLs) in tumor samples(12,13). However, this association framework to identify allelic imbalance at the RNA level cannot confirm causative roles of somatic alleles. Importantly, this practice also requires a large cohort of patient tumor tissues for genome and transcriptome sequencing, which is often impractical for regular clinical analyses. Taken together, like the mutational recurrence analysis, our current understanding of the non- coding genome in cancer has been derived largely from indirect inference from large-scale patient cohorts, which cannot directly determine the pathogenic effects of individual non-coding mutations from personal genomes. As such, non-coding mutations are often excluded from cancer genome analysis and have not been leveraged to guide personalized screening, diagnosis and treatment.

Because sequence composition of non-coding regulatory elements is not random, tissue- specific regulatory capacity of a given sequence could be predicted(14-16). Therefore, an allelic change that significantly alters regulatory capacity of a sequence element would be considered consequential leading to gene dysregulation. Like deleterious missense mutations altering structure, the deleteriousness of non-coding mutations is thus defined by their mutational impacts on altering epigenomic architecture in specific cell and tissue types. Given the sporadic nature of cancer somatic mutations that are sparsely localized across the genome, it is reasonable to assume that these mutations individually exert their effects; in this way, one can evaluate the effect of each somatic mutation one at a time across the genome and identify those disrupting epigenomic architecture in disease-related tissue and cell types. We herein leveraged a deep-learning based framework to achieve this goal, and applied this strategy to analyzing somatic mutations in localized prostate cancer genomes, which serve as an excellent model based on the following considerations: (1) compared with metastatic cancers, localized prostate cancers are more likely to be affected by simple somatic mutations (base substitutions and short insertion and deletions, indels) than somatic copy number aberrations(17). However,

3

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

many clinical cases cannot be explained by coding sequence mutations, and thus pathogenic non-coding somatic mutations are yet to be identified; (2) localized prostate cancers are slow- growing tumors, which allows us to use reference tissue-specific epigenomes from healthy individuals to study how somatic alleles could alter the reference epigenome to promote tumor formation and progression. However, metastatic prostate cancers tend to have differential epigenetics(17,18), which might complicate the analysis; (3) compared with many other cancer types, prostate cancer have a more established molecular etiology centralized on the androgen receptor (AR)-mediated signaling pathways(19). This knowledge will help validate the identified pathogenic non-coding variants for their roles in perturbing AR pathways; (4) because prostate cancer is the second most prevalent cancer in males, which could be cured at the localized stage and becomes lethal at the metastatic stage, expanding our analysis to the vast non- coding genome will help improve diagnostic yield in our clinical sequencing practice, fostering the development of personalized therapeutic strategies.

With the deep learning framework, we identified numerous novel pathogenic non-coding somatic mutations from prostate cancer genomes, which significantly expanded our knowledge of genes in prostate cancer. Importantly, our functional genomic analysis demonstrated that these newly identified novel somatic mutations preferentially affect genes responsive to 5α- dihydrotestosterone (DHT, the physiological androgen) stimulation and enzalutamide treatment (an androgen receptor inhibitor), not only confirming their implication in prostate cancer but also highlighting the therapeutic value of our approach. For individual patients, our analysis further demonstrated that the identified pathogenic non-coding somatic alleles are significant indicators of clinical characteristics of localized prostate cancer. Overall, compared with previous cohort- based analyses, our work is able to identify individual pathogenic non-coding somatic alleles from personal genomes, which can be deployed as a tool to uncover novel clinical mutations in non-coding cancer genomes.

Methods and Materials The genomic resources We downloaded genome data and somatic mutations from the ICGC (International Cancer Genome Consortium) data portal (https://icgc.org). We examined 707,610 simple somatic mutations in the localized prostate cancer cohort (the PRAD-CA) including 306 donors (Table S1). Gleason scores were available for 227 patients among these donors. Variant allele frequencies (VAF) were also provided for each of the somatic mutations and this cohort has

4

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

uniformly benchmarked clinical information. We repurposed the HOMER software (http://homer.ucsd.edu) to annotate the genomic localization of each mutation, and we identified 422,314 somatic mutations that could be associated with protein-coding genes. We then annotated each somatic mutation for their localization in intronic, promoter, 5’UTR, 3’UTR, exonic and intergenic regions (adjacent to protein-coding sequences). We retrieved the prostate ATAC-Seq data from the ENCODE data portal (ENCFF670GFY, https://www.encodeproject.org). Peak call was performed in the aligned BAM file using HMMRATAC(20). Peak annotation was performed by implementing HOMER. TCGA gene expression data (the primary prostate tumors) were queried from UALCAN(21); we also downloaded the TCGA transcriptome data for the primary prostate tumors (TCGA-PRAD) and normal prostate tissues. We included 495 patient samples in our analysis that did not receive pharmaceutical treatment. We implemented DESeq2(22) to compute the expression fold changes and differential expression. pLI scores were retrieved from an earlier paper(23). We retrieved RNA-Seq data in the LNCaP prostate cancer cell line after the DHT stimulation and enzalutamide treatment(24)(GEO accession: GSE110903). For gene symbols mapped onto multiple Ensembl identifiers, we averaged their expression data. We also averaged gene expression across multiple replicates, and only considered genes with moderate or significant abundance in our comparison (RPKM>1). The human proteome map data were retrieved from the original publication(25). We averaged protein abundance for gene symbols mapped onto different protein identifiers. CADD (Combined Annotation Dependent Depletion) predictions(26) and GERP++ scores(27) were retrieved from the original publications. The reference genome in this study was based on GRCh37 (hg19) unless otherwise mentioned.

The deep learning model We developed a deep convolutional neural network with multilayers, and the network consists of alternating convolution and maxpooling layers, taking input of every 2kb sequence to predict the associated chromatin openness in the prostate. The model was described in previous papers(15,28), and we further adapted the model for prostate cancer genome analysis (Fig.S1). Each of our convolutional layers contains 320, 480, and 640 hidden neurons, and the output of each convolutional layer is activated by the ReLU function before propagating to the next maxpooling or convolutional layers. We implemented a fully connected layer with ReLU activation on top of the three convolutional layers, which is further propagated to the output sigmoid layer to compute the probability of a given input sequence having an open chromatin state. We trained the deep learning model using the prostate ATAC-Seq peaks. Specifically, we

5

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

first split the entire genome into every 200bp bins, and we then consider a subset of the 200bp sequences as positive samples in our model training if 50% of the sequences were overlapped with the prostate ATAC-Seq peaks. All the 200bp sequences were padded 900bp sequence at both upstream and downstream regions as the context sequences, which were subsequently fed into the deep neural network for model training or testing. Similar with previous work(15,28), we adopted a holdout strategy to verify the model performance, where we randomly held out a , trained the deep learning model using all other , and then independently tested on all the sequences from this held-out chromosome. In our study, was randomly selected as a test set. The deep learning model was developed by PyTorch version 3.7.

In silico mutagenesis and the prediction of deleterious mutations For each mutation, we evaluated the altered chromatin openness from one reference allele to the somatic allele in our prostate-specific deep learning model. We used each 2kb sequence (200bp core sequence, plus 900bp flanking upstream and downstream sequences) to scan the 1kb flanking genomic regions of the mutation loci at a step size of 20bp, obtaining 101 such sequence windows. We then derived an integrated score for each somatic mutation by weighting the distance of each mutation site to each of the 101 sliding windows. The procedure was detailed in previous publications(15,28). We used kernel density estimation to approximate the underlying score distribution of all the simple somatic alleles associated with protein coding sequences in this study, and only considered those above the genome-wide threshold (score≥15.4321, the upper one percentile across the genome) as high-confidence deleterious mutations. In our analysis of clinical outcomes, as a control set, we also designated the lowest one percentile of all the genomic mutations as benign variants. We used the GTEx dataset to validate the scoring scheme, where high-confidence and low-confidence data have been annotated by CAVIAR(29). We additionally included 40,580 variants as background control, which had no association with expression of any genes in the prostate. All data were downloaded from GTEx data portal (https://www.gtexportal.org).

Data availability The code and the model for this study are available at: https://github.com/complexdisease/DEEP. The model output is provided in Supplementary Table S2.

6

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Results The overall study design is shown in Fig.1A. We leveraged the prostate ATAC-Seq data from ENCODE (Encyclopedia of DNA Elements) to construct a deep learning model, which was capable of predicting open chromatin status in the prostate for any given genomic sequences. We then used this model to screen every simple somatic mutation in prostate cancer genomes, and identified mutations whose alleles had a significant impact on altering local open chromatin status in the prostate. We identified a subset of somatic mutations showing extreme effects on perturbing gene regulation and integrated multi-omics data to demonstrate their associations with prostate cancer.

Overview of prostate cancer genomes and the prostate epigenome We retrieved localized prostate cancer genomes from the ICGC PRAD (Prostate Adenocarcinoma) – CA cohort, totaling 707,610 simple somatic mutations (indels and single- nucleotide variants) (Table S1) identified from 306 patient tumor samples(1,17,18). The somatic mutations were identified by comparing the index lesion and paired blood samples from patients receiving naïve treatment at the time of sampling. Characterization of sample information, genome sequencing platforms and technical details for variant call have been detailed in previous publications(17,18). Among 707,610 somatic mutations (0.33 % were short indels and 99.67 % were single-nucleotide variants), we analyzed 422,314 that could be associated with protein-coding genes, among which we considered 224503, 178130, 5165, 397, 3592, and 5703 somatic mutations in intergenic, intronic, promoter, 5’UTR, 3’UTR and exonic regions, respectively (Table S2).

We identified deleterious non-coding mutations by evaluating their allelic effects on altering the reference epigenome in the prostate (Fig.1A). The reference prostate epigenome was obtained from ENCODE, where ATAC-Seq was performed on the prostate gland from a 54-year-old healthy male, revealing the regulatory landscape in the prostate. We performed peak call on this ATAC-Seq data, and identified 24,108 high-confidence elements representing a comprehensive collection of regulatory elements across the prostate genome (Table S3). These elements had an average size of 411 base pairs, demonstrating a high resolution for fine-mapping non-coding variants. These elements were predominantly localized in intergenic, intronic and promoter regions (Fig.1B). While the localization in promoter regions is expected, the prostate genome

7

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

accessibility in intergenic and intronic regions suggests a widespread distribution of regulatory elements (e.g. enhancers) in these non-coding regions. We examined several known genes in prostate cancer, and the ATAC-Seq data clearly revealed their active regulatory elements in the prostate, including NKX3-1(30,31) and TP53(32)(Fig.1C-D). Given the vast majority of somatic mutations falling in intronic and intergenic regions, it is important to determine whether their somatic alleles could perturb the prostate epigenome contributing to tumorigenesis.

A deep learning model to predict the prostate epigenome It has been established that the chromatin status of a given genomic region could be predicted, and therefore a somatic allele is considered deleterious if the allelic change from a reference allele to a somatic allele will result in an alteration of the predicted chromatin status(14,33). Machine learning approaches have been proposed to capture such allelic changes (in silico mutagenesis)(14-16). These approaches utilized epigenome profiling data in heterogeneous tissue and cell types by aggregating all the ENCODE and Epigenome Roadmap resources(34,35); however, to derive mechanistic insights, predictive models have to be constructed based on specific tissue and cell types pertinent to particular diseases, and to date these models have not been developed for cancer genome analysis. We herein deploy this strategy to analyzing prostate cancer genomes by training a machine learning model specific to the prostate epigenome. We trained a deep convolution neural network (CNN) to (1) predict prostate-specific chromatin structure for any given genomic sequences, and (2) to identify impactful mutations whose somatic alleles alter the predicted prostate-specific chromatin structure from reference alleles. We developed the CNN model(15,28), and further configured the model to specifically predict chromatin openness only in the prostate gland for any given 200-bp sequences. For model verification, we adopted a holdout strategy to verify the model performance, where we randomly held out a chromosome, trained the deep learning model using all other chromosomes, and then independently tested on all the sequences from this hold-out chromosome. In our study, Chromosome 5 was randomly selected as a test set, establishing the predictability of the prostate epigenome at an AUROC (area under the receiving operator characteristics) 0.86 (Fig.2A). We also computed the area under precision recall curve (the AUPR score) on the test dataset, which reached AUPR=0.63.

With the established precision, we next mapped all the somatic mutations onto the reference genome, and asked how the somatic alleles could alter the predicted chromatin status from their respective reference alleles; identifying these epigenomically impactful alleles is specific to the

8

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

prostate. We modified the previously developed protocol(15) and further developed the method specific to prostate cancer genome analysis: we used sliding windows to scan through a given somatic mutation, and computed the alterations of the predicted prostate chromatin openness resulting from the allelic change for each sliding window. We averaged the predicted alterations over all the sliding windows as a composite deleteriousness score specific to the prostate gland (see Methods and Materials and also in Fig.S1). This score quantifies the overall allelic impact on altering chromatin architecture in the prostate. Notably, this framework does not require that a mutation should be localized in a regulatory element, and hence has the power to study mutations at the upstream or downstream of a given regulatory element, enabling us to model positional effects of somatic alleles. Moreover, compared with previous mutational recurrence analysis in large-scale patient cohorts, this framework can be directly deployed to scan personal cancer genomes. For convenience, we termed this framework DEEP (deep estimation from epigenome prediction) in our prostate cancer study.

We leveraged the prostate eQTL (expression quantitative loci) data in GTEx (the Genotype- Tissue Expression project) to determine whether the DEEP framework can effectively capture impactful mutations perturbing gene regulation in the prostate. GTEx has compiled eQTL fine- mapping results from CAVIAR(29), and we retrieved those identified in the prostate. Among 354,784 putative prostate eQTL sites in unlinked genomic regions initially identified by GTEx (across 221 prostate samples), the CAVIAR fine-mapping method identified 17,579 as high- confidence eQTLs (HC-eQTLs), and the remaining were considered low-confidence eQTLs (LC- eQTLs). For each of the GTEx variants, we implemented DEEP to score their allelic impacts on altering open chromatin structure in the prostate, and observed that these CAVIAR HC-eQTLs indeed received substantially increased DEEP scores relative to the LC-eQTLs (P=1.28e-56, Wilcoxon rank-sum test, Fig.2B) as well as to the genome background (genomic variants having no association with gene expression in the prostate, non-eQTLs, P=8.41e-73, Wilcoxon rank-sum test). This observation indicates that these HC-eQTLs likely dysregulate gene expression by altering their local chromatin structure in the prostate, thereby confirming an enrichment of causal variants among these CAVIAR high-confidence loci. Notably, the LC- eQTLs identified by GTEx but not by CAVIAR also displayed an increased in DEEP score relative to the genome background (P=2.48e-13, Fig.2B), suggesting a moderate enrichment of causal variants in this low-confidence set. Taken together, applying DEEP to analyzing GTEx eQTLs, we validated the effectiveness of this deep learning model on capturing functional non-

9

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

coding mutations, and confirmed that those mutations can indeed perturb gene expression by altering chromatin structure in the prostate.

Agnostically identify pathogenic non-coding somatic alleles in prostate cancer genomes Establishing the predictive power of the DEEP framework, we next used the system to scan each of the 422,314 somatic mutations from the ICGC PRAD-CA cohort (described above). We computed DEEP scores for the entire collection of somatic mutations in this study, including those in coding sequences (Table S2), and the score distribution is shown in Fig. 3A. The score distribution was peaked around 0, suggesting that the vast majority of the somatic alleles were effectively neutral in localized prostate cancer, having little effect on altering the prostate chromatin architecture. However, numerous outliers were also detected hallmarking their extreme effects on altering chromatin structure by their somatic alleles. Across all the 422,314 somatic mutations analyzed, the one receiving the highest score (DEEP score=166, Fig.3A) was a 3-bp indel localized in the first exon of MMGT1 (Membrane Magnesium Transporter 1), immediate 324-bp downstream of its transcription start site (TSS). Although the role of MMGT1 in prostate cancer has not been extensively characterized, we queried its expression in the TCGA prostate tumor and matched normal samples using UALCAN(21), and observed significant down-regulation of MMGT1 across most of the samples with varying Gleason scores (Fig.3B), suggesting a significant role of MMGT1 in prostate cancer (see Discussion). Another extreme outlier is a somatic allele (a one-base substitution) in the intronic region of ARHGEF16 (Rho Guanine Nucleotide Exchange Factor 16, Fig.3A), which was previously implicated in glioma etiology(36). Querying TCGA prostate tumor samples, this gene displayed significant upregulation in prostate tumor samples relative to the matched normal samples across different clinical grades (stratified by Gleason scores, Fig.3C). Another extreme somatic allele (a one- base substitution) was identified in the promoter region of IPO11 (the 11, Fig.3A), indicating its strong effect on perturbing IPO11 promoter activity in the prostate. Interestingly, this gene has recently been identified as a tumor suppressor which prevents PTEN from degradation, and has been suggested as an indicator for clinical outcome of prostate cancer patients receiving radical prostatectomy(37). Querying TCGA expression data, we consistently observed its down-regulation in prostate tumor samples across varying Gleason score groups (Fig.3D). Apparently, the detected somatic allele contributed to prostate cancer etiology by dysregulating IPO11 expression in this patient. Notably, all these extreme examples were specific to individual personal genomes, and did not form mutational hotspots in their neighboring regions across all samples. Therefore, these mutations would be missed in a typical

10

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

mutational recurrence analysis, but now can be captured by the DEEP framework targeting individual mutations from personal genomes.

In addition to these extreme outliers, we next set out to agnostically identify high-confidence pathogenic non-coding somatic alleles. Across all the 422,314 somatic mutations analyzed, we used the upper one percentile to threshold the DEEP score distribution (Fig.3A), and identified 2,037 and 2,050 significant somatic mutations in genic and intergenic regions, respectively, characterizing the altered epigenomes in localized prostate tumors (Table S4). Among the significant genic mutations, 48, 1914, 31, 3, and 41 were localized in promoter, intronic, exonic, 5’UTR and 3’UTR regions, respectively. We reasoned that if those mutations were indeed implicated in prostate cancer pathogenesis, their associated genes are expected to be intolerant to expression alterations. We associated these mutations with their neighboring genes using the closest physical proximity, and compared the recently developed pLI scores for each gene, which have been widely used as a proxy of gene dosage sensitivity or haploinsufficiency(23). As expected, genes harboring significant genic significant non-coding somatic mutations indeed displayed substantially increased pLI scores relative to the genome background (P=1.98e-14, Wilcoxon rank-sum test, Fig. 4A), and the trend was also significant for genes physically associated with the significant intergenic mutations (P= 4.77e-9, Wilcoxon rank-sum test, Fig. 4A). Taken together, we identified non-coding somatic mutations having extreme impacts on altering prostate epigenomic architecture, and their associated genes tended to be intolerant to expression alterations.

We followed previous practice and used pLI≥0.9 to define dosage-sensitive genes(23). We identified 463 and 317 dosage-sensitive genes affected by significant genic and intergenic somatic mutations, respectively (i.e. these mutations received extreme DEEP scores, DEEP- genic and DEEP-intergenic genes, Table S5). To confirm their implication in primary prostate cancers, we examined RNA-Seq data in the TCGA-PRAD (prostate adenocarcinoma) cohort and computed expression fold changes for each gene in 495 primary prostate tumor samples relative to 55 normal prostate tissues. Compared with genes harboring at least one somatic mutation, the DEEP-genic (P=4.78e-11, Wilcoxon rank-sum test, Fig. 4B) and DEEP-intergenic (P=4.45e-14, Wilcoxon rank-sum test, Fig. 4B) genes displayed substantial down-regulation. To exclude the possibility that the down-regulation was caused by their respective extreme pLI scores, we identified 3,230 genes with the same pLI threshold 0.9, and again, we observed that both DEEP-genic (P=2.48e-5, Wilcoxon rank-sum test) and DEEP-intergenic (P=3.44e-9,

11

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Wilcoxon rank-sum test) genes displayed significant downregulation relative to these dosage sensitive genes (Fig. 4B). Taken together, this deep learning model indeed captured genes implicated in prostate cancer. Among these identified genes, Table 1 displayed 13 genes whose promoters were affected by significant somatic mutations. Ranking them based on the DEEP scores, we immediately observed that the top four genes, ING3, IPO11 (described above), LARP4 and TSC22D1, were all tumor suppressors(37-42), highlighting the strong and novel candidacies of these gene in prostate cancer. We particularly note that the top hit ING3 is necessary for ATM signaling and DNA repair for double-strand breaks(38). Next to the top four genes, MCL1 was ranked at the fifth position, which has been proposed as a target for cancer therapy(43-45). This list also included several other known prostate cancer genes such as HNRNPM(46) and NFKBIA(47), as well as CDK8, an emerging target for immunotherapy(48). A few other genes had uncharacterized functions in prostate cancer, and we confirmed their candidacies in prostate cancer based on their differential expression in the TCGA prostate tumor samples by querying UALCAN(21), including ZNF711(P=8.66e-13), DIDO1 (P=5.60e-3), LHX2 (P=1.17e-6) and MLLT3 (P=0.01). In addition to these genes with affected promoters, we also identified the androgen receptor AR with a significant intronic mutation receiving an extreme DEEP score, suggesting an alteration of an intronic regulatory element in AR. Importantly, implementing ChIP Enrichment Analysis (ChEA/EnrichR)(49,50), we observed that these genes (DEEP-genic and DEEP-intergenic) were highly enriched for targets of multiple transcription factors with the highest enrichment for AR (Table S6). This enrichment suggested that the identified pathogenic non-coding mutations were effectively convergent onto the AR- mediated transcriptional regulatory network. Moreover, examining the Human Proteome Map(25), we observed that the DEEP-genic genes displayed elevated protein abundance in the prostate gland (P=2.95e-8, Wilcoxon rank-sum test), confirming their physiological functions. Taken together, these observations confirmed the implication of the identified non-coding somatic mutations in prostate cancer. Although these mutations were individually identified from personal genomes, they are in fact convergent onto AR-mediated pathways in the prostate.

The perturbed AR signaling pathway We next examined the identified genes for their roles in mediating AR signaling. We first considered DEEP-genic genes because these genes are less likely confounded by remote chromatin interactions. We first considered the LNCaP prostate cancer cell line that was stimulated by an AR ligand, dihydrotestosterone (DHT, the physiological androgen). We examined RNA-Seq data generated from a previous study(24), where transcriptome profiling

12

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

was performed on LNCaP cells in 6 and 24 hours after DHT stimulation. Referencing with protein-coding genes harboring at least one genic somatic mutation, we observed that the identified 463 DEEP-genic genes displayed a significant down-regulation upon DHT stimulation (P=3.12e-9, Wilcoxon rank-sum test, Fig.5) after 6 hours, whereas the expression alteration from 6 hours to 24 hours was insignificant (P=0.07, Wilcoxon rank-sum test). We further reasoned that if the downregulation suggested the implication of the identified genes in AR signaling pathways, we would also expect their up-regulation when treating the LNCaP cells using an AR antagonist. From the same study(24), we examined the RNA-Seq data in the LNCaP cells after the enzalutamide treatment after 48 hours, using the 48-hour treatment of dimethyl sulfoxide (DMSO) as a control. Enzalutamide is an approved and potent AR antagonist for treating castration-resistant prostate cancer, which achieves its antagonist function by switching DNA motifs recognized by AR(51). As expected, the identified genes indeed displayed significant up-regulation upon enzalutamide treatment (P=4.70e-8, Wilcoxon rank-sum test, Fig.5), consistent with their marked down-regulation upon DHT simulation. We note that the observed expression alterations were also significant when compared with the list of 3,230 genes with the same pLI threshold at 0.9 (P<5e-5, Wilcoxon rank-sum test), excluding the possibility that these expression changes were merely explained by their dosage sensitivity. We also performed the same analysis on the DEEP-intergenic genes, which displayed similar down- regulation and up-reregulation upon the DHT simulation (P=1.64e-4, Wilcoxon rank-sum test) and the enzalutamide treatment (P=5.02e-4, Wilcoxon rank-sum test), respectively. Taken together, expression data from the DHT stimulation and the enzalutamide treatment mutually validated each other, and collectively demonstrated that the DEEP framework identified pathogenic mutations perturbing AR signaling in prostate cancer.

Personal genome scan to predict clinical characteristics of prostate cancer Given the mutational convergence onto AR signaling, we further hypothesized that at the personal genome level, accumulating pathogenic mutations affecting AR signaling are likely predictive of clinical characteristics of prostate tumors. We analyzed the distribution of the identified pathogenic regulatory somatic mutations in each personal cancer genome in the ICGC PRAD-CA cohort, followed by a comparison with their respective clinical information documented in the ICGC data portal. Among the 463 DEEP-genic genes (deleterious somatic mutations falling in genic regions), we considered 257 that had at least one deleterious somatic mutation identified by DEEP in these patients and that were likely implicated in AR signaling given their up-regulation or down-regulation after DHT or enzalutamide treatment, respectively

13

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

(Fig.5). The mutation profile is shown in Fig. 6A, which immediately revealed one extreme patient: 51 among the 257 genes were perturbed by somatic regulatory mutations, suggesting a pervasive dysregulation of AR-mediated pathways in this patient. As expected, this extreme patient received a Gleason score of 9, suggesting a potential correlation between the number of affected genes and an adverse clinical manifestation. To generalize this observation, we extended the analysis to the entire patient cohort. Strikingly, we observed a significant positive correlation between Gleason scores and the number of the affected genes in each personal cancer genome (Pearson’s R=0.34, P=1.1e-4, Spearman’s rho=0.33, P=2.38e-4). As an independent control experiment, we considered the somatic mutations receiving the lowest one percentile of DEEP scores across the genome in the same cohort (i.e. high-confidence benign variants, as opposed to the pathogenic mutations in the upper 1% in this cohort), and repeated the same analysis. However, no significant correlation could be observed (Pearson’s R=0.06, P=0.4023, Spearman’s rho=0.006, P=0.9232), demonstrating the specificity of the DEEP scoring system. Stratifying patients based on their Gleason scores, individuals with milder tumors (Gleason score=6) on average had one affected gene per person, in contrast to ~2.5 among individuals with more aggressive tumors (Gleason score>6, P=4.22e-4, Wilcoxon rank- sum test, Fig.6B). In addition to those genic mutations (the DEEP-genic set), we also tested significant somatic regulatory mutations falling in intergenic regions (the DEEP-intergenic set), associating each mutation with their closest genes. We still observed similar positive correlations, albeit the statistical significance was attenuated (Pearson’s R=0.27, P=7.3e-3, Spearman’s rho=0.18, P=0.07). The weakened correlation likely resulted from potential remote chromatin interactions that could have complicated the assignment of intergenic mutations to their target genes. Taken together, leveraging the panel of 257 genes, this approach could be deployed to estimate the expected Gleason scores by scanning each personal prostate cancer genome.

Discussion Despite abundant somatic mutations in the non-coding genome, our current cancer genome analysis has been primarily focused on mutations in protein-coding sequences affecting ~2% of the human genome. Analyzing recurrent somatic mutations has provided us a glimpse into the mechanisms of tumorigenesis by perturbing gene regulation(8); however, compared with the vast non-coding genome, somatic mutations are sparsely localized, and therefore inferring pathogenicity merely from recurrent mutations could be less effective, leading to the observation of a paucity of non-coding driver mutations in cancer(8). However, in this study, the DEEP

14

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

framework has aimed to directly assess allelic effects on altering chromatin architecture for each somatic mutation, thereby enabling us to identify pathogenic regulatory somatic mutations from personal cancer genomes. We applied this strategy to analyzing localized prostate cancer, and identified numerous novel pathogenic somatic alleles as well as their affected genes as novel candidate loci in prostate cancer. Our functional genomic analyses confirmed their function in prostate cancer, revealed the mutational convergence on AR-mediated pathways, and demonstrated the utility of our approach for personal genome scans. With this new approach, we concluded that pathogenic regulatory somatic mutations are widely dispersed across the prostate cancer genome.

Somatic mutations emerge along the tumor development trajectory, and mutations arising at early stages tend to have increased variant allele frequencies (VAF)(52,53). Therefore, it is possible to leverage VAF to trace evolutionary origins of the predicted pathogenic non-coding mutations in this study. In fact, among all the somatic mutations, we identified high-confidence deleterious (top 1% of the genome-wide prediction scores) and benign (bottom 1%) mutations falling in gene promoter regions. Indeed, these deleterious mutations displayed a significant increase in VAF compared with the genome background (all somatic mutations in this study, P=3.4e-3, Wilcoxon rank-sum test, Fig.S2), whereas the benign mutations displayed a significant reduction in VAF compared with the genome background (P=6.1e-4, Wilcoxon rank- sum test, Fig.S2). We also extended the analysis from promoter mutations to all the predicted deleterious somatic mutations (intronic, intergenic, etc., top 1 percentile), which consistently displayed increased VAF relative to the genome background (P=8.82e-14, Wilcoxon rank-sum test). Therefore, these pathogenic non-coding mutations in localized prostate cancers likely arose early during tumor development.

Applying DEEP to primary prostate tumor genomes, we examined the complete set somatic mutations from 306 patient tumor samples, and identified thousands of significant mutations receiving extreme DEEP scores (above one percentile) indicating their alteration on the prostate epigenome. Interestingly, genes affected by these identified somatic alleles are responsive to enzalutamide treatment, an FDA approved medication for treating prostate cancer(54,55). Therefore, it is reasonable to speculate that these mutation carriers would likely have differential responses to enzalutamide treatment than other patient groups. Future studies are therefore needed to develop individualized therapeutic strategies utilizing personal somatic mutation profiles. In addition, we also showed that burden of pathogenic mutations predicted by DEEP

15

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

was indicative of patients’ Gleason scores (Fig.6). Therefore, future analyses are warranted to examine the clinical utility of this approach on a large scale, which could be leveraged as a molecular tool to support pathological analysis of prostate tumor samples.

In our analyses, we adopted a very conservative strategy to agnostically identify high- confidence somatic non-mutations in prostate cancer by only considering extreme DEEP and pLI scores. However, those not considered in this analysis could also contribute to prostate cancer etiology. For example, MMGT1 (membrane magnesium transporter 1) received the highest DEEP score in our analysis (Fig.3A), but its pLI score was 0.8 in the latest genomAD database(56). Although it was not included in our downstream functional genomic study adopting a stringent threshold pLI=0.9, this somatic mutation affecting MMGT1 is expected to contribute to prostate cancer, and in fact downregulation of MMGT1 was observed in the TGCA- PRAD cohort (Fig. 3B) across varying Gleason score groups, except for the most aggressive tumor group (Gleason score=10, small sample size). Intriguingly, previous work suggested that the competition between calcium and magnesium for membrane binding sites will result in an imbalanced cellular micronutrient in human diseases(57,58), and a higher calcium to magnesium dietary intake ratio is positively correlated with increased chance to develop aggressive prostate cancer(58,59). These observations are consistent with the perturbed gene regulation of MMGT1: by reducing the abundance of membrane magnesium transporter, the magnesium intake would be decreased accordingly, leading to an increased calcium to magnesium ratio and therefore an increased chance to develop aggressive prostate cancer. This finding also highlights the possibility of personalizing the dietary plan to manage prostate cancer based on individuals’ genomic profiles. Taken together, for variants receiving intermediate DEEP scores, they also likely contribute to prostate cancer, and their implication requires further investigation based on clinical judgement and our understanding of gene- environment interaction (e.g. patient lifestyles). In addition to these individual genes, as a general trend, we observed that genes affected regulatory somatic mutations in the ICGC cohort also displayed down-regulation in the TCGA cohort. These genes included many tumor suppressors. The observation was consistent with our further investigation of the enzalutamide treatment in LnCaP cells, which up-regulated those affected genes. These data at different levels collectively demonstrated the mutational convergence onto AR signaling pathways in prostate cancer.

16

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

We observed that many ATAC-Seq peaks in the prostate gland were localized in exonic regions, suggesting potential enhancer elements for nearby genes. We also identified several somatic mutations that likely alter chromatin structure in exonic regions (Table S7). Therefore, when analyzing coding sequencing mutations, extra caution has to be executed to determine the mutational effect on gene regulation. This notion was also suggested in previous study(60,61). In the similar vein, numerous intronic and intergenic regulatory elements were also observed in the prostate, and impactful mutations residing these elements were also identified using the DEEP framework. To gain further mechanistic insights, extensive epigenome profiling experiments are required to systematically characterize the epigenome landscape in the prostate, such as ChIP-Seq profiling for enhancer (H3K4me1 and H3K27ac) and promoter (H3K4me3) marks. It is also important to note that the DEEP framework was trained on open chromatin regions (ATAC-Seq), so it is expected that the system preferentially detects mutations disrupting regulatorily active elements. With the recent development of high- throughput platforms to identify silencer elements across the genome(62), the DEEP framework can be directly trained on silencer elements to identify mutations interfering with regulatory suppressors.

On the methodological side, the DEEP framework was purely driven by the allelic effects on altering chromatin structure without requiring an aggregation of large-scale clinical samples. We compared the DEEP scores with evolutionary conservation (GERP++ scores)(27), and observed that their correlation was close to 0, which is in line with a previous work suggesting that genetic loci implicated in prostate cancer (appearing in later stages of one’s life) are selectively neutral(63). The lack of substantial correlation was also observed when comparing the DEEP scores with CADD (Combined Annotation Dependent Depletion) scores(64). This is expected because much contribution to the CADD scoring scheme comes from evolutionary conservation. CADD also incorporates epigenome information, which however only examines the localization of mutations in regulatory elements without quantifying the allelic effects(64). More importantly, CADD aggregates all existing epigenome information without considering tissue specificity. In contrast, we specifically trained the DEEP model using the prostate epigenome, so the mutational pathogenicity is solely defined in a prostate-specific context. Given the lack of correlation between our algorithm with CADD and GERP++, apparently the information captured in our analysis would be missed by other existing approaches.

17

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Because of the flexible design (only requiring tissue-specific epigenomes), DEEP is readily extended to studying any other cancer types, and its capability to effectively capture pathogenic mutations in the non-coding genome is expected to improve the current clinical tumor sequencing practice, which now has been largely confined to coding sequences. Expanding the search for causal variants in non-coding regions would significantly increase the diagnostic yield. However, substantial efforts are required to further develop a guideline for variant interpretation in the non-coding genome, similar with the ACMG (American College of Medical Genetics and Genomics) consensus for classifying coding-sequence mutations(65). As presented in this work, mutation impact scores (the DEEP scores), gene intolerance to expression changes and gene functional relevance to particular cancer types should be simultaneously considered, and independent clinical studies are needed to evaluate the expected diagnostic yield by integrating coding and non-coding mutations for different cancer types. It is also important to note that DEEP identifies deleterious mutations by quantifying the likelihood of the mutation disrupting chromatin structure. This scoring scheme helps prioritize the most likely deleterious mutations in the non-coding genome, and substantial experimental work will be required to mechanistically characterize the identified mutations.

Acknowledgements We sincerely thank all the anonymous reviewers for insightful advice. The project is supported by the UCSF Prostate Cancer Program Pilot Research Funding Award and by the startup fund supported from the Parker Institute for Cancer Immunotherapy, the Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, and the Bakar Computational Health Sciences Institute at UCSF.

References 1. International Cancer Genome C, Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, et al. International network of cancer genome projects. Nature 2010;464:993-8 2. Corradin O, Scacheri PC. Enhancer variants: evaluating functions in common disease. Genome Med 2014;6:85 3. Schaub MA, Boyle AP, Kundaje A, Batzoglou S, Snyder M. Linking disease associations with regulatory information in the human genome. Genome Res 2012;22:1748-59 4. Khurana E, Fu Y, Chakravarty D, Demichelis F, Rubin MA, Gerstein M. Role of non- coding sequence variants in cancer. Nat Rev Genet 2016;17:93-108 5. Huang FW, Hodis E, Xu MJ, Kryukov GV, Chin L, Garraway LA. Highly recurrent TERT promoter mutations in human melanoma. Science 2013;339:957-9 6. Zhou S, Hawley JR, Soares F, Grillo G, Teng M, Madani Tonekaboni SA, et al. Noncoding mutations target cis-regulatory elements of the FOXA1 plexus in prostate cancer. Nat Commun 2020;11:441 7. Corces MR, Granja JM, Shams S, Louie BH, Seoane JA, Zhou W, et al. The chromatin accessibility landscape of primary human cancers. Science 2018;362

18

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

8. Rheinbay E, Nielsen MM, Abascal F, Wala JA, Shapira O, Tiao G, et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 2020;578:102-11 9. Gan KA, Carrasco Pro S, Sewell JA, Fuxman Bass JI. Identification of Single Nucleotide Non-coding Driver Mutations in Cancer. Front Genet 2018;9:16 10. Piraino SW, Furney SJ. Beyond the exome: the role of non-coding somatic mutations in cancer. Ann Oncol 2016;27:240-8 11. Zhang C, Xuan Z, Otto S, Hover JR, McCorkle SR, Mandel G, et al. A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acids Res 2006;34:2238-46 12. Cheng Z, Vermeulen M, Rollins-Green M, DeVeale B, Babak T. A catalog of cis- regulatory mutations in 12 major cancer types. bioRxiv 2019:710103 13. Zhang W, Bojorquez-Gomez A, Velez DO, Xu G, Sanchez KS, Shen JP, et al. A global transcriptional network connecting noncoding mutations to changes in tumor gene expression. Nat Genet 2018;50:613-20 14. Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, et al. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet 2015;47:955-61 15. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning- based sequence model. Nat Methods 2015;12:931-4 16. Zhou J, Park CY, Theesfeld CL, Wong AK, Yuan Y, Scheckel C, et al. Whole-genome deep-learning analysis identifies contribution of noncoding mutations to autism risk. Nat Genet 2019;51:973-80 17. Espiritu SMG, Liu LY, Rubanova Y, Bhandari V, Holgersen EM, Szyca LM, et al. The Evolutionary Landscape of Localized Prostate Cancers Drives Clinical Aggression. Cell 2018;173:1003-13 e15 18. Fraser M, Sabelnykova VY, Yamaguchi TN, Heisler LE, Livingstone J, Huang V, et al. Genomic hallmarks of localized, non-indolent prostate cancer. Nature 2017;541:359-64 19. Taplin ME. Drug insight: role of the androgen receptor in the development and progression of prostate cancer. Nat Clin Pract Oncol 2007;4:236-44 20. Tarbell ED, Liu T. HMMRATAC: a Hidden Markov ModeleR for ATAC-seq. Nucleic Acids Res 2019;47:e91 21. Chandrashekar DS, Bashel B, Balasubramanya SAH, Creighton CJ, Ponce-Rodriguez I, Chakravarthi B, et al. UALCAN: A Portal for Facilitating Tumor Subgroup Gene Expression and Survival Analyses. Neoplasia 2017;19:649-58 22. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550 23. Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 2016;536:285-91 24. Zhang Y, Pitchiaya S, Cieslik M, Niknafs YS, Tien JC, Hosono Y, et al. Analysis of the androgen receptor-regulated lncRNA landscape identifies a role for ARLNC1 in prostate cancer progression. Nat Genet 2018;50:814-24 25. Kim MS, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, et al. A draft map of the human proteome. Nature 2014;509:575-81 26. Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 2019;47:D886-D94 27. Davydov EV, Goode DL, Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput Biol 2010;6:e1001025 28. Zhou J, Theesfeld CL, Yao K, Chen KM, Wong AK, Troyanskaya OG. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet 2018;50:1171-9

19

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

29. Hormozdiari F, Kostem E, Kang EY, Pasaniuc B, Eskin E. Identifying causal variants at loci with multiple signals of association. Genetics 2014;198:497-508 30. Bhatia-Gaur R, Donjacour AA, Sciavolino PJ, Kim M, Desai N, Young P, et al. Roles for Nkx3.1 in prostate development and cancer. Genes Dev 1999;13:966-77 31. Bowen C, Bubendorf L, Voeller HJ, Slack R, Willi N, Sauter G, et al. Loss of NKX3.1 expression in human prostate cancers correlates with tumor progression. Cancer Res 2000;60:6111-5 32. Ecke TH, Schlechte HH, Schiemenz K, Sachs MD, Lenk SV, Rudolph BD, et al. TP53 gene mutations in prostate cancer progression. Anticancer Res 2010;30:1579-86 33. Shrikumar A, Prakash E, Kundaje A. GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs. Bioinformatics 2019;35:i173-i82 34. Roadmap Epigenomics C, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, et al. Integrative analysis of 111 reference human epigenomes. Nature 2015;518:317-30 35. Consortium EP. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57-74 36. Huang D, Wang Y, Xu L, Chen L, Cheng M, Shi W, et al. GLI2 promotes cell proliferation and migration through transcriptional activation of ARHGEF16 in human glioma cells. J Exp Clin Cancer Res 2018;37:247 37. Chen M, Nowak DG, Narula N, Robinson B, Watrud K, Ambrico A, et al. The nuclear transport receptor Importin-11 is a tumor suppressor that maintains PTEN protein. J Cell Biol 2017;216:641-56 38. Mouche A, Archambeau J, Ricordel C, Chaillot L, Bigot N, Guillaudeux T, et al. ING3 is required for ATM signaling and DNA repair in response to DNA double strand breaks. Cell Death Differ 2019;26:2344-57 39. Egiz M, Usui T, Ishibashi M, Zhang X, Shigeta S, Toyoshima M, et al. La-Related Protein 4 as a Suppressor for Motility of Ovarian Cancer Cells. Tohoku J Exp Med 2019;247:59- 67 40. Seetharaman S, Flemyng E, Shen J, Conte MR, Ridley AJ. The RNA-binding protein LARP4 regulates cancer cell migration and invasion. Cytoskeleton (Hoboken) 2016;73:680-90 41. Nakashiro K, Kawamata H, Hino S, Uchida D, Miwa Y, Hamano H, et al. Down- regulation of TSC-22 (transforming growth factor beta-stimulated clone 22) markedly enhances the growth of a human salivary gland cancer cell line in vitro and in vivo. Cancer Res 1998;58:549-55 42. Rentsch CA, Cecchini MG, Schwaninger R, Germann M, Markwalder R, Heller M, et al. Differential expression of TGFbeta-stimulated clone 22 in normal prostate and prostate cancer. Int J Cancer 2006;118:899-906 43. Arai S, Jonas O, Whitman MA, Corey E, Balk SP, Chen S. Tyrosine Kinase Inhibitors Increase MCL1 Degradation and in Combination with BCLXL/BCL2 Inhibitors Drive Prostate Cancer Apoptosis. Clin Cancer Res 2018;24:5458-70 44. Merino D, Kelly GL, Lessene G, Wei AH, Roberts AW, Strasser A. BH3-Mimetic Drugs: Blazing the Trail for New Cancer Medicines. Cancer Cell 2018;34:879-91 45. Senichkin VV, Streletskaia AY, Zhivotovsky B, Kopeina GS. Molecular Comprehension of Mcl-1: From Gene Structure to Cancer Therapy. Trends Cell Biol 2019;29:549-62 46. Yang T, An Z, Zhang C, Wang Z, Wang X, Liu Y, et al. hnRNPM, a potential mediator of YY1 in promoting the epithelial-mesenchymal transition of prostate cancer cells. Prostate 2019;79:1199-210 47. Carter SL, Centenera MM, Tilley WD, Selth LA, Butler LM. IkappaBalpha mediates prostate cancer cell death induced by combinatorial targeting of the androgen receptor. BMC Cancer 2016;16:141

20

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

48. Philip S, Kumarasiri M, Teo T, Yu M, Wang S. Cyclin-Dependent Kinase 8: A New Hope in Targeted Cancer Therapy? J Med Chem 2018;61:5073-92 49. Lachmann A, Xu H, Krishnan J, Berger SI, Mazloom AR, Ma'ayan A. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinformatics 2010;26:2438-44 50. Chen EY, Tan CM, Kou Y, Duan Q, Wang Z, Meirelles GV, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 2013;14:128 51. Chen Z, Lan X, Thomas-Ahner JM, Wu D, Liu X, Ye Z, et al. Agonist and antagonist switch DNA motifs recognized by human androgen receptor in prostate cancer. EMBO J 2015;34:502-16 52. Milanese J-S, Tibiche C, Zaman N, Zou J, Han P, Meng Z, et al. eTumorMetastasis, a network-based algorithm predicts clinical outcomes using whole-exome sequencing data of cancer patients. bioRxiv 2018:268680 53. Miller CA, White BS, Dees ND, Griffith M, Welch JS, Griffith OL, et al. SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Comput Biol 2014;10:e1003665 54. Hussain M, Fizazi K, Saad F, Rathenborg P, Shore N, Ferreira U, et al. Enzalutamide in Men with Nonmetastatic, Castration-Resistant Prostate Cancer. N Engl J Med 2018;378:2465-74 55. Beer TM, Armstrong AJ, Rathkopf D, Loriot Y, Sternberg CN, Higano CS, et al. Enzalutamide in Men with Chemotherapy-naive Metastatic Castration-resistant Prostate Cancer: Extended Analysis of the Phase 3 PREVAIL Study. Eur Urol 2017;71:151-4 56. Karczewski KJea. The mutational constraint spectrum quantified from variation in 141,456 humans. bioRxiv 2020 57. Rosanoff A, Dai Q, Shapses SA. Essential Nutrient Interactions: Does Low or Suboptimal Magnesium Status Interact with Vitamin D and/or Calcium Status? Adv Nutr 2016;7:25-43 58. Dai Q, Motley SS, Smith JA, Jr., Concepcion R, Barocas D, Byerly S, et al. Blood magnesium, and the interaction with calcium, on the risk of high-grade prostate cancer. PLoS One 2011;6:e18237 59. Steck SE, Omofuma OO, Su LJ, Maise AA, Woloszynska-Read A, Johnson CS, et al. Calcium, magnesium, and whole-milk intakes and high-aggressive prostate cancer in the North Carolina-Louisiana Prostate Cancer Project (PCaP). Am J Clin Nutr 2018;107:799- 807 60. Ahituv N. Exonic enhancers: proceed with caution in exome and genome sequencing studies. Genome Med 2016;8:14 61. Birnbaum RY, Clowney EJ, Agamy O, Kim MJ, Zhao J, Yamanaka T, et al. Coding exons function as tissue-specific enhancers of nearby genes. Genome Res 2012;22:1059-68 62. Pang B, Snyder MP. Systematic identification of silencers in human cells. Nat Genet 2020;52:254-63 63. Lachance J, Berens AJ, Hansen MEB, Teng AK, Tishkoff SA, Rebbeck TR. Genetic Hitchhiking and Population Bottlenecks Contribute to Prostate Cancer Disparities in Men of African Descent. Cancer Res 2018;78:2432-43 64. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 2014;46:310-5 65. Richards S, Aziz N, Bale S, Bick D, Das S, Gastier-Foster J, et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation

21

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med 2015;17:405-24

Table 1. The significant somatic mutations in gene promoters DEEP dist. to chr start end ref alt symbol pLI score TSS

chr7 120590816 120590816 C T 62.09 -15 ING3 0.99 chr5 61714006 61714006 C T 53.89 -705 IPO11 1.00 chr12 50795199 50795199 C G 33.42 9 LARP4 0.99 chr13 45151140 45151140 C G 33.20 -439 TSC22D1 0.94 chr1 150552163 150552163 A C 27.69 51 MCL1 0.95 chrX 119709898 119709898 C T 25.30 -214 CUL4B 1.00 chrX 84499744 84499744 G A 22.38 747 ZNF711 0.99 chr20 61569348 61569348 C A 22.30 -44 DIDO1 1.00 chr13 26827888 26827888 T A 21.80 -378 CDK8 0.95 chr19 8509668 8509668 G A 20.20 -191 HNRNPM 1.00 chr9 126773773 126773773 C A 17.86 -274 LHX2 0.95 chr14 35873926 35873926 G A 16.71 34 NFKBIA 0.98 chr9 20622011 20622011 G C 15.55 -74 MLLT3 1.00

Figures Legends

Fig. 1. The regulatory landscape in the human prostate. (A). The overall study design. ATAC-Seq open chromatin data are used to train a deep learning model, which is capable of predicting regulatory capacity in the prostate for any given genomic sequences. The model is subsequently used to scan ~400K simple somatic mutations from the ICGC prostate cancer genomes to identify somatic alleles that significantly alter prostate chromatin structure. The identified consequential mutations are subjected to validation and characterization using multi- omics data and functional genome analysis. (B) The genomic distribution of ATAC-Seq peaks in the prostate. (C-D) ATAC-Seq peaks in two genes associated with prostate cancer, NKX3-1 (C) and TP53 (D). ATAC-Seq peaks (blue) and their associated P-values (red) are shown. A strong peak is observed in the promoter of NKX3-1, and many ATAC-Seq peaks were also observed in

22

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

TP53 intronic regions. Only nominally significant P values were shown (P<0.01). Genomic coordinates were based on hg38.

Fig.2. Performance of the deep learning model. (A) The receiving operating characteristic curve from the deep-learning-based prediction. The area under the curve was estimated from a five-fold cross-validation. (B) Validation of the predicted deleterious regulatory mutations using GTEx eQTL data in the prostate. The high-confidence eQTLs (HC-eQTLs) received substantially increased prediction score by the deep learning model relative to the low- confidence putative eQTLs (LC-eQTLs) and the genomic variants having no association with gene expression in the prostate (non-eQTLs). P-values were derived from Wilcoxon rank-sum test. Error bar represent 95% confidence intervals of the mean estimated from bootstrapping analyses.

Fig. 3. Genome-wide identification of pathogenic non-coding somatic alleles in localized prostate cancer. (A) Distribution of the prediction scores for all the somatic alleles in this study. Three extreme outliers and their prediction scores are shown to represent significant regulatory mutations in promoter, intronic and exonic regions, respectively. Significant mutations were identified by adopting a threshold at the upper one percentile across all the mutations (the red bar). (B-D) Differential expression of MMGT1(B), ARHGEF16 (C) and IPO11 (D) in TCGA prostate tumor samples of varying clinical grades relative to the normal tissues. Gene expression data were queried from UALCAN(21). No statistical significance was observed from the Gleason group 10 due to small sample size (N=10).

Fig.4. Characterizing the deleterious non-coding somatic mutations. (A) For the identified deleterious somatic variants, they displayed a significant elevation in dosage sensitivity measured by pLI scores. The variants localized in genic and intergenic regions were considered separately (P=1.98e-14 for genic variants, and P=4.77e-9 for intergenic variants, Wilcoxon rank- sum test). (B) For genes affected by the identified deleterious regulatory mutations, their expression in primary prostate tumors displayed marked down-regulation relative to the genome background (P values in red) and to the set of dosage-sensitive genes (P values in blue).

Fig.5. The identified pathogenic somatic variants are convergent on androgen-receptor- mediated pathways. Genes affected by the identified deleterious somatic mutations displayed strong down-regulation and up-regulation after DHT simulation and Enzalutamide treatment, respectively.

Fig.6. Deleterious regulatory somatic mutations are predictive of adverse clinical outcomes. (A) Personal genome scan identified an extreme mutation carrier with excessive regulatory variants perturbing the 257 gene panel, corresponding to his extreme prostate cancer Gleason score (GS) of 9. (B) Across all the study subjects, individuals with mild tumors (GS=6) on average had one gene affected, compared with ~2.5 affected genes for individuals with aggressive tumors (Gleason score>6, P=4.22e-4, Wilcoxon rank-sum test).

23

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited.

A deep learning framework identifies pathogenic noncoding somatic mutations from personal prostate cancer genomes

Cheng Wang and Jingjing Li

Cancer Res Published OnlineFirst September 9, 2020.

Updated version Access the most recent version of this article at: doi:10.1158/0008-5472.CAN-20-1791

Supplementary Access the most recent supplemental material at: Material http://cancerres.aacrjournals.org/content/suppl/2020/09/09/0008-5472.CAN-20-1791.DC1

Author Author manuscripts have been peer reviewed and accepted for publication but have not yet been Manuscript edited.

E-mail alerts Sign up to receive free email-alerts related to this article or journal.

Reprints and To order reprints of this article or to subscribe to the journal, contact the AACR Publications Subscriptions Department at [email protected].

Permissions To request permission to re-use all or part of this article, use this link http://cancerres.aacrjournals.org/content/early/2020/09/09/0008-5472.CAN-20-1791. Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC) Rightslink site.

Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research.