Learning Mutational Signatures and Their Multidimensional Genomic
Total Page:16
File Type:pdf, Size:1020Kb
ARTICLE https://doi.org/10.1038/s41467-021-23551-9 OPEN Learning mutational signatures and their multidimensional genomic properties with TensorSignatures ✉ Harald Vöhringer1, Arne Van Hoeck 2, Edwin Cuppen2,3 & Moritz Gerstung 1,4 We present TensorSignatures, an algorithm to learn mutational signatures jointly across different variant categories and their genomic localisation and properties. The analysis of 1234567890():,; 2778 primary and 3824 metastatic cancer genomes of the PCAWG consortium and the HMF cohort shows that all signatures operate dynamically in response to genomic states. The analysis pins differential spectra of UV mutagenesis found in active and inactive chromatin to global genome nucleotide excision repair. TensorSignatures accurately characterises transcription-associated mutagenesis in 7 different cancer types. The algorithm also extracts distinct signatures of replication- and double strand break repair-driven mutagenesis by APOBEC3A and 3B with differential numbers and length of mutation clusters. Finally, Ten- sorSignatures reproduces a signature of somatic hypermutation generating highly clustered variants at transcription start sites of active genes in lymphoid leukaemia, distinct from a general and less clustered signature of Polη-driven translesion synthesis found in a broad range of cancer types. In summary, TensorSignatures elucidates complex mutational foot- prints by characterising their underlying processes with respect to a multitude of genomic variables. 1 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK. 2 Center for Molecular Medicine and Oncode Institute, University Medical Center Utrecht, Universiteitsweg 100, Utrecht, The Netherlands. 3 Hartwig Medical Foundation, Amsterdam, The Netherlands. 4 European ✉ Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany. email: [email protected] NATURE COMMUNICATIONS | (2021) 12:3628 | https://doi.org/10.1038/s41467-021-23551-9 | www.nature.com/naturecommunications 1 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-021-23551-9 ancer arises through the accumulation of mutations base substitutions by expressing the mutated base pair in terms of Ccaused by multiple processes that leave behind distinct its pyrimidine equivalent (C > A, C > G, C > T, T > A, T > C and patterns of mutations on the DNA. A number of studies T > G) plus the flanking 5′ and 3′ bases. We additionally cate- have analysed cancer genomes to extract such mutational sig- gorised other mutation types into 91 MNV classes, 62 indel natures using computational pattern recognition algorithms such classes, and used the classification of SVs provided by the as non-negative matrix factorisation (NMF) over catalogues of PCAWG Structural Variants Working Group17. In addition to single nucleotide variants (SNVs) and other mutation types1–8.So the immediate base context, a number of genomics features have far, mutational signature analysis has provided more than 50 been described to influence mutation rates. Here, we use 5 dif- different single base substitution patterns, indicative of a range of ferent genomic annotations—transcription and replication strand endogenous mutational processes, as well as genetically acquired orientation, nucleosomal occupancy, epigenetic states as well as hypermutation and exogenous mutagen exposures9. clustered hypermutation—and generate 96-dimensional base Mutational signature analysis via computational pattern substitution spectra for each possible combination of these recognition draws its strength from detecting recurrent patterns genomic states separately and for each sample. Partitioning var- of mutations across catalogues of cancer genomes. As many iants creates a seven-dimensional count tensor (a multi- mutational processes also generate characteristic multi nucleotide dimensional array), owing to the multitude of possible variants (MNVs)10,11, insertion and deletions (indels)12–14, and combinations of different genomic features (Fig. 1a). structural variants (SVs)6,15–17 it appears valuable to jointly Both transcription and replication introduce a strand specifi- deconvolve broader mutational catalogues to further understand city, which provides a distinction of pyrimidine and purine base the multifaceted nature of mutagenesis. context, which is considered to be indistinguishable in the Moreover, it has also been reported that mutagenesis depends on a range of additional genomic properties, such as the tran- scriptional orientation and the direction of replication18–20, and a Structure of the SNV count tensor 1 Mutation count Mutation count sometimes manifests as local hypermutation (kataegis) . Addi- matrix tensor tionally, epigenetic and local genomic properties can also influ- Split by Split by Split by clustering state (2), – transcription replication nucleosome position (44) and ence mutation rates and spectra21 23. In fact, these phenomena strand strand epigenetic environment (16) n samples 1TssA 2TssAFlnk 16 Baseline 2 clustered on may help to more precisely characterise the underlying muta- s Coding C pe CCC 1 unclustered y t strand mutati mutation L G U p tional processes, but the large number of possible combinations p 1minor Template T TTT out makes the resulting multidimensional data structure unamenable strand L G U 2minor to conventional matrix factorisation methods. Unknown U UUU in strand L G U We present TensorSignatures, a multidimensional tensor fac- ging G strand strand strand torisation framework incorporating the aforementioned features for eading nknown L La U a more comprehensive extraction of mutational signatures. We 4 Baseline apply TensorSignatures to 2778 whole genomes from the Pan Cancer Analysis of Whole Genomes (PCAWG) consortium24,and b TensorSignature Factorization fi Mutation counts Exposures Signatures validate our ndings in an additional 3824 metastatic cancer whole 1. Factorization of n samples s signatures 25 count matrix into MNVs, genomes from the Hartwig Medical Foundation (HMF) .The signatures Indels, q mutation q mutation Signature SVs types types matrix resulting tensor signatures add considerable detail to known Mutation count matrix n samples other mutation types mutational signatures in terms of their genomic determinants and s Signature tensor 2. Factorization of n samples s signatures broader mutational context. Strikingly, some signatures are being count tensor into p mutation p mutation types types further subdivided based on genomic properties, illustrating the tensor signatures Mutation count tensor shared between array of SNV count SNV and other differential manifestation of the same mutational process in dif- matrices for each mutation types and state combination same for alll states ferent parts of the genome. This includes UV-mutagenesis and and variant types tobacco associated mutations, manifesting at differential rates in active and quiescent chromatin, and enables the algorithm to detect c FFactorization of signature tensor 18,26 the prevalence of transcription-associated mutagenesis in more Signature tensor Signatures Activities cancers than currently appreciated. The incorporation of additional strand specific Activity in a given combination of state and mutational signatures signature-specific coefficients variant types enables TensorSignatures to delineate two APOBEC (same across all states) signatures manifesting as replication associated mutations, or highly C C C Spectrum Clustering Nucleosome Epigenetic Coding coefficients coefficients L G U strand coefficients clustered SV-associated base substitutions indicative of APOBEC3A s T T T Spectrum s – Template 1 1,18,19,27 30 fi L G U s ss and 3B .Finally,TensorSignaturesconrms localised strand 1 1 s 4 1 L G U 1 somatic hypermutation at transcription start sites in lymphoid 1 2 16 U U U 1 neoplasms7, with a distinct spectrum from a mostly unclustered, ging G strand strand eading genome-wide signature of translesion synthesis found in a range of L La Spectrum other cancer types28. Taken together, TensorSignatures sheds light Spectrum on the manifold influences that underlie mutagenesis and helps to Fig. 1 A multidimensional tensor factorisation framework to extract pinpoint mutagenic influences by jointly learning mutation patterns mutational signatures. a Splitting variants by transcriptional and and their genomic determinants. TensorSignatures is implemented replicational strand, and genomic states creates an array of count using the powerful TensorFlow31 backend. The software is available matrices, a multidimensional tensor, in which each matrix harbours as a python package on the PyPI repository and results presented in the mutation counts for each possible combination of genomic states. this study can also be explored on a webserver - see section Code b TensorSignatures factorises a mutation count tensor (SNVs) into an availability at the end of the article. exposure matrix and signature tensor. Simultaneously, other mutation types (MNVs, indels, SVs), represented as a conventional count matrix are factorised using the same exposure matrix. c The signature tensor has itself Results a lower dimensional structure, defined by the product of strand-specific Multidimensional genomic features of mutagenesis. It is com- signatures, and coefficients reflecting the activity of the mutational process mon practise in mutational signature analysis to classify single in a given genomic state combination. 2 NATURE COMMUNICATIONS | (2021) 12:3628 | https://doi.org/10.1038/s41467-021-23551-9