A Deep Learning Framework Identifies Pathogenic Noncoding Somatic Mutations from Personal Prostate Cancer Genomes

Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited. A deep learning framework identifies pathogenic noncoding somatic mutations from personal prostate cancer genomes Cheng Wang and Jingjing Li* the Eli and Edythe Broad Center of Regeneration Medicine and Stem Cell Research, the Parker Institute for Cancer Immunotherapy, the Bakar Computational Health Sciences Institute, the Department of Neurology, School of Medicine, University of California, San Francisco, 35 Medical Center Way, San Francisco, CA 94143 *correspondence should be addressed to: [email protected] Mailing address: 35 Medical Center Way, San Francisco, CA 94143 Tel: 415-502-2572 Declaration of Interests. C.W. and J.L. are listed as inventors on a pending patent application filed by UCSF related to part of the methods presented here for prostate cancer analysis. J.L. is a co-founder and a member of the scientific advisory board of SensOmics, Inc. Running Title: deep learning for prostate cancer Abstract Our understanding of noncoding mutations in cancer genomes has been derived primarily from mutational recurrence analysis by aggregating clinical samples on a large scale. These cohort- based approaches cannot directly identify individual pathogenic noncoding mutations from personal cancer genomes. Therefore, although most somatic mutations are localized in the noncoding cancer genome, their effects on driving tumorigenesis and progression have not been systematically explored and noncoding somatic alleles have not been leveraged in current clinical practice to guide personalized screening, diagnosis, and treatment. Here we present a deep learning framework to capture pathogenic noncoding mutations in personal cancer genomes, which perturb gene regulation by altering chromatin architecture. We deployed the system specifically for localized prostate cancer by integrating large-scale prostate cancer genomes and the prostate-specific epigenome. We exhaustively evaluated somatic mutations in each patient's genome and agnostically identified thousands of somatic alleles altering the prostate epigenome. Functional genomic analyses subsequently demonstrated that affected genes displayed differential expression in prostate tumor samples, were vulnerable to expression alterations, and were convergent onto androgen-receptor- mediated signaling pathways. Accumulation of pathogenic regulatory mutations in these affected genes was predictive of clinical observations, suggesting potential clinical utility of this approach. Overall, the deep learning framework has significantly expanded our view of somatic mutations in the vast noncoding genome, uncovered novel genes in localized prostate cancer, and will foster the development of personalized screening and therapeutic strategies for prostate cancer. Statement of Significance. This study's characterization of the noncoding genome in prostate cancer reveals mutational signatures predictive of clinical observations, which may serve as a powerful prognostic tool in this disease. 1 Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited. Introduction To date, more than 81 million simple somatic mutations (base substitutions and short indels) from cancer genomes have been catalogued by International Cancer Genome Consortium (ICGC)(1), among which >90% fall in non-coding regions. However, our current cancer genome analysis has been largely confined to somatic mutations in coding sequences, which only account for ~2% of the human genome, leaving the vast majority of the human genome unexplored. Multiple lines of evidence have suggested a significant involvement of non-coding somatic mutations in cancer genomes: (i) it has been established that more than 90% of functional loci in complex diseases reside in the non-coding genome(2,3), and therefore significant contribution to cancer etiologies by perturbing non-coding regulatory elements would also be expected(4); (ii) specifically for cancer, individual cases studies have identified several non-coding mutations driving tumorigenesis and progression, including the well-known somatic mutations in the TERT promoter(5) as well as the clustered somatic mutations affecting the cis- regulatory elements of FOXA1, which promotes prostate cancer cell proliferation(6); (iii) the landmark cancer epigenome study has revealed distinct chromatin architecture in tumors relative to normal tissues(7), prompting further investigation on somatic mutations that alter tissue-specific chromatin structure leading to tumor formation and progression. Despite these considerations, the recent pan-cancer genome analysis unexpectedly reported a paucity of somatic driver mutations in the non-coding genome by identifying only a handful of non-coding somatic mutations displaying significant recurrence across tumor samples(8). Mutational recurrence has been used as the primary approach to infer mutational pathogenicity, which indirectly infers mutational pathogenicity based on statistical enrichment, but cannot directly assess the molecular effects of individual non-coding somatic mutations. As such, mutational recurrence analysis effectively identifies mutations forming “hotspots” in a patient cohort, but cannot capture individual pathogenic mutations from personal genomes. Given the sporadic nature of somatic mutations, it is reasonable to expect that many pathogenic mutations do not form clusters, but individually exert their effects on tumor initiation, formation or progression. Identifying these individual mutations will not only significantly expand our view of the non-coding cancer genome, but will also foster the development of clinical tools to screen pathogenic regulatory mutations. In addition to mutational recurrence analysis (or identifying mutation hotspots), several approaches have been proposed to individually annotate non-coding somatic mutations in 2 Downloaded from cancerres.aacrjournals.org on October 6, 2021. © 2020 American Association for Cancer Research. Author Manuscript Published OnlineFirst on September 9, 2020; DOI: 10.1158/0008-5472.CAN-20-1791 Author manuscripts have been peer reviewed and accepted for publication but have not yet been edited. cancer genomes (as reviewed in these excellent articles(9,10)). Some of the studies examined mutational localization in known regulatory regions, followed by motif analysis. However, many transcription factors bind to degenerate DNA sequences(11), and a single base change is more likely to be neutral than consequential. Therefore, merely examining mutational localization in regulatory elements is not sufficient to determine specific allelic effects on perturbing gene regulation. Evolutionary conservation was also integrated in the analysis; however, even if one somatic mutation affects a conserved site, the observed conservation could merely result from background selection. Some studies identified somatic mutations displaying allele-specific expression (ASE, or somatic eQTLs) in tumor samples(12,13). However, this association framework to identify allelic imbalance at the RNA level cannot confirm causative roles of somatic alleles. Importantly, this practice also requires a large cohort of patient tumor tissues for genome and transcriptome sequencing, which is often impractical for regular clinical analyses. Taken together, like the mutational recurrence analysis, our current understanding of the non- coding genome in cancer has been derived largely from indirect inference from large-scale patient cohorts, which cannot directly determine the pathogenic effects of individual non-coding mutations from personal genomes. As such, non-coding mutations are often excluded from cancer genome analysis and have not been leveraged to guide personalized screening, diagnosis and treatment. Because sequence composition of non-coding regulatory elements is not random, tissue- specific regulatory capacity of a given sequence could be predicted(14-16). Therefore, an allelic change that significantly alters regulatory capacity of a sequence element would be considered consequential leading to gene dysregulation. Like deleterious missense mutations altering protein structure, the deleteriousness of non-coding mutations is thus defined by their mutational impacts on altering epigenomic architecture in specific cell and tissue types. Given the sporadic nature of cancer somatic mutations that are sparsely localized across the genome, it is reasonable to assume that these mutations individually exert their effects; in this way, one can evaluate the effect of each somatic mutation one at a time across the genome and identify those disrupting epigenomic architecture in disease-related tissue and cell types. We herein leveraged a deep-learning based framework to achieve this goal, and applied this strategy to analyzing somatic mutations in localized prostate cancer genomes, which serve as an excellent model based on the following considerations: (1) compared with metastatic cancers, localized prostate cancers are more likely to be affected by simple somatic mutations (base substitutions and short insertion and deletions, indels) than somatic copy number aberrations(17). However,

A Deep Learning Framework Identifies Pathogenic Noncoding Somatic Mutations from Personal Prostate Cancer Genomes

Environmental Influences on Endothelial Gene Expression

1 Supporting Information for a Microrna Network Regulates

Supplemental Figure and Table Legends

Chapter 2 Gene Regulation and Speciation in House Mice

Gene Regulation and the Genomic Basis of Speciation and Adaptation in House Mice (Mus Musculus)

Variation in Protein Coding Genes Identifies Information Flow

Content Based Search in Gene Expression Databases and a Meta-Analysis of Host Responses to Infection

Proteomics of Nucleocytoplasmic Partitioning

IPO11 Mediates Βcatenin Nuclear Import in a Subset of Colorectal Cancers

Predict AID Targeting in Non-Ig Genes Multiple Transcription Factor

A New Genomewide Association Meta-Analysis of Alcohol Dependence

Significant Association Between Rare IPO11-HTR1A Variants and Attention Deficit Hyperactivity Disorder in Caucasians