<<

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

The landscape of the human body

Pablo E. García-Nieto, Ashby J. Morrison, Hunter B. Fraser* Department of Biology, Stanford University, Stanford CA *Correspondence: [email protected]

ABSTRACT Somatic in healthy tissues contribute to aging, , and initiation, yet remain largely uncharacterized. To gain a better understanding of their distribution and functional impacts, we leveraged the genomic information contained in the transcriptome to uniformly call somatic mutations from over 7,500 tissue samples, representing 36 distinct tissues. This catalog, containing over 280,000 mutations, revealed a wide diversity of tissue- specific mutation profiles associated with gene expression levels and chromatin states. We found pervasive negative selection acting on missense and nonsense mutations, except for mutations previously observed in cancer samples, which were under positive selection and were highly enriched in many healthy tissues. These findings reveal fundamental patterns of tissue-specific somatic evolution and shed light on aging and the earliest stages of tumorigenesis.

INTRODUCTION In humans, somatic mutations play a key role in and tumorigenesis1. Pioneering work on somatic evolution in cancer has led to the characterization of cancer driver genes2 and mutation signatures3; the interplay between chromatin, nuclear architecture, and the mutational landscape4–7; the evolutionary forces acting on somatic mutations8–11; and clinical implications of somatic mutations12. Compared to cancer research, the study of somatic mutations in healthy tissues is limited. Early studies focused on blood13,14 as it is readily accessible and because of the known effects of immune-driven somatic mutation. Recently, somatic mutations have been characterized in tissues like skin15, brain16,17, esophagus18,19 and colon20. These studies confirmed that cells harboring certain mutations expand clonally, and the number of clonal populations—as well as the total number of somatic mutations—increases with age. Additionally, recurrent positively- selected mutations in specific genes (e.g. NOTCH1) were observed. However, a more comprehensive understanding of somatic mutations across the human body has been limited by the small number of tissues studied to date. Most studies on somatic evolution in healthy tissues have sequenced DNA from biopsies to high coverage. However, the transcriptome also carries all the genomic information of a cell’s transcribed genome, in addition to RNA-specific mutations or edits. RNA-seq has been used to identify DNA variants21, and recently single-cell (sc) RNA-seq was used to call DNA somatic mutations in the pancreas of several people22. To systematically identify somatic mutations in the human body and to investigate their distribution and functional impact, we developed a method that leverages the genomic information carried by RNA to identify DNA somatic mutations while avoiding most sources of false positives. We applied it to infer somatic mutations across 36 non-cancerous tissues, allowing us to explore the landscape of somatic mutations throughout the human body.

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

RESULTS

Somatic mutation calling across 36 non-disease tissues from 547 people

Our method considers genomic positions where we observed two alleles in the RNA-seq reads and assesses whether they are likely to be bona fide DNA somatic mutations (Fig 1a). We optimized the minimum levels of RNA-seq read depth, sequence quality, and number of reads supporting the variant allele to limit the impact of sequencing errors (Supp. Fig. 1a; see Methods). We then applied extensive filters to eliminate false-positives from biological and technical sources, including RNA editing, sequencing errors, and mapping errors (see Methods; Fig 1b-c, Supp. Fig. 2a; Supp. Table 1). To validate the method, we compared somatic mutation calls from 105 blood RNA-seq samples to exome DNA sequencing performed on the same samples23 (Fig. 1d). We observed a false-discovery rate (FDR) of 29% which represents the percentage of somatic mutations called from RNA-seq not having evidence in the corresponding DNA exome-seq sample (Fig. 1d, Supp. Fig. 1c; see methods). This is comparable to the 40% FDR in a previous study that inferred mutations from scRNA-seq in pancreas22. After applying the pipeline and filters to RNA-seq data from the GTEx project, we retained a total of 7,584 samples from 36 different tissues and 547 different individuals with no detectable cancer (Supp. Table 2). This resulted in a total of 280,843 unique mutations (Supp. Table 3), most of which were rare across the entire data set (median frequency = 0.026% of samples; Supp. Fig. 2b). We first investigated the factors influencing mutation counts per sample and tissue (see Methods). The main contributor was sequencing depth and to a lesser extent other biological and technical factors (Supp. Fig. 2c, Supp. Table 4). Tissues that have more mutations than expected from sequencing depth include those most often exposed to environmental or with a high cellular turnover like skin, lung, blood, esophagus mucosa, spleen, liver and small intestine (Fig. 2a). On the other end of the spectrum are those with low environmental exposure or low cellular turnover such as brain, adrenal gland, prostate, and several types of muscle — heart, esophagus muscularis, and skeletal muscle (Fig. 2a). We observed similar trends across all six possible point mutation types (complementary mutations – such as C>T and G>A – were collapsed because when one is detected the other is always present on the other DNA strand) (Supp. Fig. 3a). C>T and T>C mutations were the most abundant types followed by C>A (Supp. Fig. 3b; Supp. Table 5).

Mutation load is affected by age, sex, ethnicity, and natural selection

After controlling for all known technical factors (see Methods), we observed that age was positively correlated with mutation load across most tissues (Supp. Fig. 4a; 29/36 tissues with Spearman ⍴ > 0, binomial p = 1.6x10-4). Blood showed the most significant age association for all mutation types combined (Fig. 2b; 1st-4th-quartile Wilcoxon p = 0.013; Spearman ⍴ = 0.21, Benjamini-Hochberg [BH] FDR < 0.0001), whereas sun-exposed skin was the most significant for C>T mutations, the most common mutation associated with UV radiation (1st-4th-quartile Wilcoxon p = 0.00041; Spearman ⍴ = 0.17 BH-FDR = 0.02). We observed other strong age

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. a

Data Tissue-specific DNA soma�c muta�on calling Cohort Filter 547 Donors Map reads false-positives

RNA-seq Mismatch-aware mapping - Blacklisted regions - RNA edit sites 36 Tissues - Splicing-based 7584 samples Identify positions mismapping with 2 base calls - Sequencing error - Read position bias - Mapping quality bias Eliminate germline - Base sequencing bias variants - Read distance bias - Strand bias Sample extraction - VAF bias RNA-seq Library prep Call somatic - Tissue-specific and whole-body sytematic mutations artifacts Probabilistic method - Hypermutated samples based on local coverage

b Sources of varia�on c Whole-Blood removal of false posi�ves (%) Soma�c muta�ons Germline variants Total Read posi�on bias DNA 100 33.69 RNA edi�ng RNA pol errors Blacklisted region Mapping quality bias Transcrip�on 84.72 33.64 RNA Sequencing errors RNA edit Base quality bias Mapping and processing ar�facts Sequencing 74.24 30.4

Reads Splicing junc�on error Mapping quality vs strand bias 68.36 30.37 Sequencing error Variant distance bias Discard false-posi�ves 65.29 14.62 Soma�c muta�ons kept removed kept removed d Method valida�on Figure 1. A method to iden�fy DNA soma�c muta�ons from RNA-seq. a, A general overview of the method. Whole-Blood RNA-seq reads were downloaded from GTEx v7 (le�) and sample extraction processed to iden�fy posi�ons with two different base calls at a high confidence. Then, sources of biological and technical ar�facts were removed (right, see Methods). b, RNA-seq Exome-seq Schema�c illustra�ng poten�al sources of sequence varia�on. c, Average percentage of variants detected in 105 people blood RNA-seq that are retained a�er each step of filtering (see Methods). d, Valida�on of the method. For Mutation Mapping 105 individuals we compared variant calls from exome total Calling DNA-seq data with those from RNA-seq of the same T.G samples. Median FDR values per muta�on type are shown T.C and they represent the frac�on of muta�ons called in % mutations in RNA with alt allele RNA-seq for which there are no exome reads suppor�ng T.A not detected in exome the same variant (see Methods and Supplementary Figure

Mutation C.T 1c). Error bars represent the 95% confidence interval a�er bootstrapping 10,000 �mes. C.G

C.A

0 10 20 30 40 Median FDR (%) bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

associations in several brain regions, and in particular the basal ganglia (Supp. Fig. 4a), supporting the hypothesis that somatic mutation load could contribute to the increasing risk of neurodegenerative diseases with age24. In addition to age, other biological factors also contributed to somatic mutations. For example, women show a much greater mutation load in breast than men (Fig 2b; Wilcoxon p = 1.7x10-8). We also observed female-biased mutations (BH-FDR<0.1) in subcutaneous adipose, visceral adipose, liver, and the adrenal gland; in contrast, we found no significant male-biased mutations after multiple hypothesis correction (Supp. Fig. 4c; Supp. Table 6). Ethnicity can also affect mutation rates: we found a significant increase of C>T mutations in Caucasian sun-exposed skin compared to non-exposed skin, but no corresponding difference in African-Americans (Fig. 2c, Supp. Fig. 4b), likely due to protection against UV radiation provided by higher melanin content. Finally, we observed that the number of divisions in a tissue was weakly correlated with mutation load (see Supp. Note 1, Supp. Fig. 5). To investigate the role of natural selection on deleterious somatic mutations, we examined the variant allele frequency (VAF) within every sample where a given mutation was observed. If mutations are subject to negative selection, we would expect it be acting most weakly on synonymous mutations that do not change the amino acid sequence; more strongly on missense variants that change an amino acid; and most strongly on nonsense mutations that introduce a premature termination codon. Indeed, this was the pattern we observed, with synonymous variants having significantly higher VAF than both missense and nonsense variants (Fig. 2d). Since our mutations are called from the transcriptome, these results could be influenced by Nonsense Mediated Decay (NMD), which degrades transcripts containing premature termination codons. While we did observe evidence of NMD (Supp. Fig. 4d), we found that nonsense mutations in the last exon of genes – which are not subjected to NMD25 – still have significantly lower VAF compared to synonymous and missense mutations (Wilcoxon p = 1.9x10- 213 and p = 5.6x10-150, respectively; Supp. Fig. 4d), confirming the existence of negative selection against nonsense mutations. Therefore, in contrast to what has been observed in cancer26, we found evidence of negative selection acting against both missense and nonsense mutations.

Somatic mutation profiles are tissue-specific and associated with chromatin state

To visualize the tissue-specificity of somatic mutational patterns, we applied t-SNE27 to the full set of mutations called in each of the 7584 tissue samples (including the two bp flanking each mutation; see Methods; Fig 2e). We then used a silhouette score (SS) to quantify clustering of samples in this two-dimensional space (see Methods), where a value SS = 1 indicates samples that are maximally clustered within a group, and a value SS=0 means that samples are equidistant to samples within vs. outside a group. Grouping samples by their donor-of-origin results in a mean SS = 0, no different than expected by chance (Fig. 2f). However, grouping samples by tissue resulted in values 0.7 > SS > 0.1, suggesting tissue-specific mutation patterns. Spleen, blood, skin, liver, and esophagus mucosa exhibited the most coherent profiles (Fig. 2f, Supp. Fig. 6a,d); conversely artery, lung, stomach and thyroid had the least consistent profiles (Fig. 2f, Supp. Fig. 6e,f). Interestingly, some tissues cluster together as shown by SSs obtained by grouping samples from more than one tissue, suggesting shared mutagenic or repair processes (Fig 2f; Supp. Fig.

a b Skin c Whole Blood Blood Breast rho = 0.8 Sun exposed Skin sun-protected 07 p=0.0013 p = 2.2x10 p=0.00041 p=1.7x10-8 bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprintSkin sun-exposed (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. 50It is made available under aCC-BY-ND 4.0 International2 license. -6 p=2.5x10 1 1 High muta�on load Esophagus Mucosa 2e+05 Lung 1 Skin sun-exposed p=0.97

40 Thyroid Muscle 0 0 0 Skin sun-protected Tibial Artery Tibial Nerve 1 30 Mutations Colon Transverse 1e+05 Aorta Artery 1 C>T mutations (%) Liver Breast Adipose Subcutaneous Low muta�on load 1

Normalized mutations (all) Normalized mutations (all) Spleen Stomach Normalized mutations (C>T) 2 Small Intestine Adipose Visceral Prostate Esophagus Muscularis Pituitary Pancreas Heart Atrial Appendage Ovary 20 Coronary Artery Heart Left Ventricle Colon Sigmoid Brain Cortex Adrenal Gland Esophagus Gastroesophageal Junction 3 2 Brain Nucleus accumbens basal ganglia Brain Frontal Cortex Brain Hypothalamus Brain Caudate basal ganglia 2 Brain Hippocampus Brain Putamen basal ganglia 2044 6170 2146 6270 MF 4810 16 Caucasian Total sequencing depth (billions of reads) Age Age Sex African American d 0.08 e p < 1x10300 30 Adipose Subcutaneous Esophagus Muscularis 204 p = 1.3x10 Adipose Visceral Heart Atrial Appendage 300 p < 1x10 Adrenal Gland Heart Left Ventricle 0.06 20 Aorta Artery Liver

Coronary Artery Lung Tibial Artery Muscle 10 Brain Caudate basal ganglia Tibial Nerve 0.04 Brain Cortex Ovary Brain Frontal Cortex Pancreas Median VAF 0 Brain Hippocampus Pituitary Brain Hypothalamus Prostate tSNE dimension 2 Brain Nucleus accumbens basal ganglia Skin protected 0.02 10 Brain Putamen basal ganglia Skin Sun Exposed Breast Small Intestine Colon Sigmoid Spleen 20 Colon Transverse Stomach 0.00 Esophagus Gastroesophageal Junction Thyroid

20 10 0 10 20 Esophagus Mucosa Blood Missense tSNE dimension 1 Synonymous Nonsense

f 0.8 Figure 2. Cross-�ssue analysis of soma�c muta�ons. a, The total number of muta�ons 0.6 observed in a �ssue depends on the sequencing depth of that �ssue. Sequencing depth is defined as 0.4 the cumula�ve amount of uniquely mapped reads 0.2 across all samples of a �ssue. A linear regression Random expecta�on line is shown in blue; �ssues above it exhibit more 0.0 Avergae silhouette muta�ons than expected by sequencing depth,

Liver Lung while �ssues below it show fewer muta�ons than Spleen Muscle Ovary BreastThyroid Pancreas Pituitary Prostate Stomach Whole Blood Brain Cortex Tibial ArteryAorta Artery Tibial Nerve expected. rho is the Spearman coefficient. b, Small IntestineMuscle & Heart Adrenal Gland Colon Sigmoid Skin sun-exposedBrain (7 regions) AdiposeColon Visceral Transverse Coronary Artery SkinEsophagus sun-protected Mucosa Brain Frontal Cortex Brain HypothalamusBrain Hippocampus Heart Left Ventricle Grouped by people HeartAdipose Atrial Appendage SubcutaneousEsophagus Muscularis Examples of significant muta�on associa�ons with Brain Caudate basal ganglia Brain Putamen basal ganglia Colon Treansverse & Small Int. age and biological sex (see Supplementary Figure 4 Skin (sun-exposed and not exposed) Brain Nucleus accumbens basal ganglia Esophagus Gastroesophageal Junction and Supp. Table 6 for all �ssue data). Age ranges g 20 H3K9me3 represent the youngest and oldest quar�les for Posi�ve associa�on with muta�on load H3K27me3 each �ssue. To control for sequencing depth and other technical ar�facts, muta�on values were H3K36me3 10 obtained as the residuals from a linear regression value) n.s. Bonferroni (see Methods). P-values are from a two-sided 0 Mann-Whitney test. c, Caucasian sun-exposed skin shows a higher percentage of C>T muta�ons compared to sun-protected skin, while no such 10 difference was seen for African-American skin. P- values are from two-sided Mann-Whitney tests. d, Signed log10(p Nega�ve associa�on with muta�on load 20 Median variant allele frequency (VAF) for each muta�on type based on their impact to the amino Breast Ovary Spleen Lung Muscle Pancreas Stomach Aorta Artery acid sequence; error bars represent the 95% Brain SmallCortex IntestineColon Sigmoid Adrenal Gland Esophagus MucosaSkin sun-protected Heart Left VentricleBrain FrontalBrain CortexHippocampus confidence interval a�er bootstrapping 1000 �mes; Adipose Subcutaneous Brain Putamen basal ganglia p-values are from two-sided Mann-Whitney tests. e, tSNE plot constructed from a normalized penta-nucleo�de muta�on profile (the mutated base plus two nucleo�des in each direc�on; see Methods for normaliza�on details) and all samples in this study. f, Average silhoue�e scores represen�ng the coherence of selected groups of samples from the tSNE space in panel e; a score of 1 represents maximal clustering, whereas 0 represents no clustering (see Methods). Grouping was performed by �ssue-of-origin, or mul�ple �ssues combined (red labels). “Grouped by people” (green label) is an average silhoue�e score a�er grouping samples by their person-of-origin from 20 randomly selected people. The blue dashed line represents the average random score expecta�on a�er permu�ng �ssue labels (see Methods) and the blue stripes are +/- two standard devia�ons. Error bars in points represent the 95% confidence interval based on bootstrapping 10,000 �mes. g, Muta�on load is posi�vely associated with H3K9me3 and/or nega�vely associated with H3K36me3 across most �ssues analyzed. P-value of associa�on was obtained from a linear regression using all chroma�n marks as explanatory variables (see Methods). Gray range denotes non-significant p-values a�er Bonferroni correc�on. bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

6b,c; e.g. skeletal muscle and heart, SS = 0.38; seven brain regions, SS = 0.53; colon and small intestine, SS = 0.31). These results suggest that the somatic mutation landscape of the transcribed genome is largely defined by tissue-of-origin. To explore the source of this tissue-specificity, we hypothesized that chromatin may play an important role, as it does in cancer4,5. We assessed the association between tissue-specific mutation rates and five chromatin marks measured across many human tissues by the Roadmap Epigenomics Project28 (Supp. Table 7, see Methods). Across most tissues (except brain) there is a strong positive association between mutation rate and a marker for heterochromatin (H3K9me3); conversely, we found strong negative associations with actively transcribed chromatin (H3K36me3; Fig. 2g). These results suggest that chromatin associations with mutation rates in the human body are nearly ubiquitous and arise prior to cancer development.

Mutational strand asymmetries are widespread and vary across individuals

Cancer mutations often occur preferentially on one DNA strand, either in reference to transcription or to DNA replication (leading vs lagging strand)29. Since our mutations are derived from transcribed exons, we focused on transcriptional mutational strand asymmetries. We observed the strongest asymmetry for C>A mutations, which preferentially occur on the transcribed strand in most tissues except for brain (Fig. 3a,b; mean [transcribed/non- transcribed] ratio = 1.6 in non-brain, 0.98 in brain). C>A mutation load on the transcribed strand also showed the greatest variation between individuals (Fig. 3a,b). C>A asymmetries were often correlated between different tissues of the same person (Fig. 3c, Supp. Fig. 7a), suggesting a common factor can induce an over-accumulation of C>A mutations on the transcribed strand (or equivalently, G>T mutations on the non-transcribed strand) across many of an individual’s tissues. This factor is of unknown origin; it could be intrinsic (e.g. genetic), extrinsic (e.g. exposure to a ), or a combination of the two. Interestingly, this same C>A asymmetry has been observed in lung and ovarian cancer29 and in cell lines exposed to different environmental agents30; our results suggest it is far more widespread, occurring in most tissues except for brain. Strand asymmetries could potentially arise from RNA-specific edits or transcriptional errors. To test if these are bona fide DNA mutations, we examined exome data from matching blood samples, and observed general agreement between the level of asymmetry per gene as measured by DNA vs. RNA-seq (Fig. 3d). In addition, we found blood samples to have the highest levels for C>T and T>C asymmetries. However, the directionality of these asymmetries suggests that samples with high biases may be driven by the two major types of RNA editing: A>I (represented in our data by T>C on the transcribed strand), and C>U (represented in our data by C>T on the transcribed strand) (Fig. 3a, right panel). Interestingly, estimating cell type abundances in each blood sample (see Methods) revealed that the abundances of two cell types, resting NK cells and CD8+ T cells, showed the strongest associations with the extent of both of these asymmetries (Supp. Fig 7b,c). Consistent with this result, these same two cell types have been shown to have increased levels of both types of editing in stress conditions31. This suggests that some RNA editing sites may be present in our catalog of somatic mutations, though we did not observe an increased FDR for these two mutation types in blood (Fig. 1d), suggesting that RNA editing has not substantially inflated our FDRs.

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a Mutations Average aCC-BY-NDMutations 4.0 International S.D. licenseStrand. fold-change 2 10 1 2 3 1130 10 1 Tissue ZScore Tissue ZScore log2(mean(trans) / mean(non-trans)) C>A C>G C>T T>A T>C T>G C>A C>G C>T T>A T>C T>G

Heart Atrial Appendage Adipose Visceral Heart Left Ventricle Colon Sigmoid Adipose Subcutaneous Aorta Artery Adrenal Gland Muscle Esophagus Gastroesophageal Junction Esophagus Muscularis Tibial Nerve Tibial Artery Coronary Artery Prostate Spleen Skin sun-protected Stomach Ovary Breast Pancreas Lung Small Intestine Colon Transverse Thyroid Esophagus Mucosa Skin sun-exposed Whole Blood Liver Brain Cortex Brain Nucleus accumbens basal ganglia Pituitary Brain Hippocampus Brain Frontal Cortex Brain Hypothalamus Brain Caudate basal ganglia

Transcribed strand? Brain Putamen basal ganglia NY NY NY NY NYNY NYNYNY NYNY NY b Mutations Average Mutations S.D. Figure 3. Soma�c muta�onal strand 3 asymmetries. a, Muta�on average and S.D. 2 2 for each strand with respect to

1 transcrip�on (le� and middle panels) and 1

the ra�o of mean muta�ons on the

0 transcribed over the non-transcribed 0

strand (right panel). b, Distribu�on of z- Tissue Z Score 1 1 scores for muta�on averages and standard NYNYNYNYNYNY NYNYNYNYNYNY devia�ons on each strand (from panel a) c Transcribed strand? across all samples. c, Example of intra- individual correla�on of C>A strand asymmetry in two �ssues; each point C>A C>G C>T 6 rho =0.4 rho =0.17 rho =0.064 represents an individual for which we p = 1.2ex10-8 p = 0.013 2 p = 0.32 generated muta�on maps in the two 2 4 �ssues (see Supp. Fig. 7a for all �ssue 1 pairwise comparisons). d, Correla�on of 2 0 muta�onal strand biases between calls 0 0 from RNA-seq and matched DNA-seq; each point represents the muta�on strand bias 2 1 2 observed in one gene across all 105 blood 2 samples. Blue lines in all sca�er plots are 4 2.5 0.0 2.5 5.0 3 2 10 1 2 3 2 10 1 2 3 based on a linear regression; rho values are Tibial Nerve mutation strand bias

log2(transcribed/non transcribed) the Spearman correla�on coefficients. Tibial Artery mutation strand bias log2(transcribed/nontranscribed)

d C>A C>G C>T rho =0.45 rho =0.74 -11 4 p = 0.00018 4 p =1x10 4

0 0 0

4 4 4 rho =0.46 p = 6.3x1011 40 4 40 4 50 5 T>A T>C T>G rho =0.61 rho =0.38 p = 0.00013 p = 1.4x106 4 5 4

0 0 0 4 4 5 rho =0.68 p = 8.5x108 Blood mutations — DNA-based

log2(transcribed/non transcribed) 40 4 5.0 0.0 5.0 40 4 8 Blood mutations — RNA-based log2(transcribed/nontranscribed) bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Gene expression implicates pathways associated with somatic mutation load

To explore cellular factors accompanying an increase in somatic mutation load across non-disease tissues, we performed an unbiased tissue-level search for genes whose expression was associated—either positively or negatively—with exome-wide mutational load (see Methods). These enrichments may reflect a mixture of causal scenarios: gene expression impacting mutations or vice versa, or both driven by a third variable. Here we focused on the most-abundant C>T mutations; other mutation types are described in the supplement (Supp. Figs. 9,10). While most genes were tested in the majority of tissues (Supp. Fig. 8a), most significant associations were specific to only 1-3 tissues (Bonferroni-corrected p < 0.05; Fig. 4a, Supp. Fig. 8b). Among genes positively associated with mutation load in multiple tissues (see Methods), we observed enrichments for GO categories that included nucleotide excision repair, cellular transport, cell adhesion, and macroautophagy; and categories including immune response, keratinization, and cell polarity for negative expression-mutation associations (Fig. 4b, Supp. Tables 8,9). Consistent with these results, several of these same processes (e.g. cellular transport, autophagy and cell adhesion) are known to be associated with mutation load in cancer or normal cells32–34. To further explore the contribution of different DNA repair pathways to , we focused on the associations between the expression of repair pathway members with mutation load. Specifically, we analyzed genes involved in double-strand break (DSB) repair, mismatch repair (MMR), nucleotide-excision repair (NER), base excision repair (BER), the DNA deaminases APOBEC3A/B and the translesion polymerase POLH. We found 14 associations at a <20% FDR, 8 at <10% FDR, and 2 at <5% FDR (Fig. 4c blue asterisks; Supp. Fig. 8c-g) which are further described in Supplementary Note 2. We then tested genes in these pathways for consistent associations with mutation load across tissues (see Methods; Fig. 4c, left panel; Supp. Fig. 10a-d). We observed significant hits from NER (XPA, FDR=0.01; XPC, FDR=0.07; DDB1, FDR=0.07), MMR (MLH1, FDR=0.05; MSH6, FDR=0.16; PMS2 FDR=0.17), and BER (NEIL2, FDR=0.15). The associations were generally in the expected direction, based on what is known about each gene (see Supp. Note 3); for example, all associations of MLH1 were negative, indicating lower expression associated with higher mutation load (Fig. 4c, Supp. Fig. 8e). MLH1 silencing contributes to cancer development and mutagenesis35, and our results suggest that natural variation of MLH1 expression is associated with mutagenesis across many tissues in the same direction as in cancer. To further explore factors that may contribute to mutation load, we analyzed the expression of entire pathways or functionally related gene sets in each tissue (see Methods). MMR, BER, and NER are all strongly associated with mutation load in multiple tissues, but with very little overlap among them (Fig. 4d; Supp. Fig. 10e-i). As a result, these associations are primarily dominated by just one or two pathways per tissue. In summary, most transcriptional signatures associated with mutation load are tissue- specific, however, genes associated with mutation load in several tissues are enriched in a number of pathways including DNA repair. These results paint a complex landscape where mutational load across non-disease tissues is associated with a variety of cellular functions.

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. a b Nega�ve gene expression associa�ons Term ID Descrip�on FDR 45.7% (13170) GO:0051641 Cellular localizaton 0.0000153 GO:0046907 Intracellular transport 0.0000652 10000 GO:0007155 Cell adhesion 0.00317 GO:0016032 Viral process 0.00319 GO:0000715 Nucleo�de-excision repair, DNA damage recogniton 0.00985 64.35% (5373) GO:0016241 Regula�on of macroautophagy 0.0183 5000 77.79%91.22%

No. of genes (3874) (3869) 96.68% Posi�ve gene expression associa�ons (1572) 98.46% 99.07%99.41% 99.7% 99.89%99.98%99.99%99.99% 100% Term ID Descrip�on FDR (513) 0 (178) (97) (84) (55) (24) (3) (2) (2) GO:0002250 Adap�ve immune response 0.0000547 012345 6 7 8 9 10 11 12 14 GO:0031424 Kera�nizaton 0.000583 No. of tissues in which a gene is associated GO:0030155 Regula�on of cell adhesion 0.00119 with mutation load GO:1902531 Regula�on of intracellular signal transduc�on 0.00124 GO:0007163 Establishment or maintenance of cell polarity 0.00813 GO:0006810 Transport 0.0292

c FDR(%) -log10(signed pval)

10 5 0510 10-5 5-10.1-1 100-2020-1515-10 APOBEC3A C deamination APOBEC3B FDR(%) POLH Translesion pol BRCA2 DSB repair BRCA1 * POLD1 EXO1 d 10-5 5-10.1-1 100-2020-1515-10 MSH2 * MSH3 ** MMR C deamination MSH6 Translesion pol MLH1 ** * PMS2 DSB repair XPC MMR XPA ** NER DDB1 NER DDB2 BER NEIL1 ***** * Liver

NEIL2 Lung

*** Ovary Breast Spleen Muscle NTHL1 Thyroid Pituitary * Prostate Stomach OGG1 ** BER Pancreas Aorta Artery Tibial Nerve Tibial Artery Brain Cortex MUTYH * Whole Blood Adrenal Gland Small Intestine Colon Sigmoid

POLB Coronary Artery Adipose Visceral Colon Transverse Skin sun-exposed

POLL Skin sun-protected

** Heart Left Ventricle Brain Hippocampus Esophagus Mucosa Brain Frontal Cortex Brain Hypothalamus Esophagus Muscularis Adipose Subcutaneous Heart Atrial Appendage Liver Lung Ovary Breast Brain Caudate basal ganglia Spleen Muscle Brain Putamen basal ganglia Thyroid Pituitary Prostate Stomach Pancreas Aorta Artery Tibial Nerve Tibial Artery Brain Cortex Whole Blood Adrenal Gland Small Intestine Colon Sigmoid Coronary Artery Adipose Visceral Esophagus Gastroesophageal Junction Colon Transverse Brain Nucleus accumbens basal ganglia Skin sun-exposed Skin sun-protected Heart Left Ventricle Brain Hippocampus Esophagus Mucosa Brain Frontal Cortex Brain Hypothalamus Esophagus Muscularis Adipose Subcutaneous Heart Atrial Appendage Brain Caudate basal ganglia Brain Putamen basal ganglia Esophagus Gastroesophageal Junction Brain Nucleus accumbens basal ganglia

Figure 4. Muta�on load is associated with the expression of genes and pathways. a, Histogram of the number of �ssues for which each gene was significantly (Bonferroni corrected p < 0.05) associated with muta�on load. Associa�ons were es�mated for each �ssue using linear models controlling for popula�on structure and biological and technical cofactors (see Methods). b, Genes whose expression was nega�vely (top) or posi�vely (bo�om) associated with C>T muta�on load in mul�ple �ssues are enriched in these representa�ve GO categories (see Methods and Supp. Tables 8,9). c, Individual gene-�ssue associa�ons between C>T muta�ons and expression of genes involved in DNA repair or DNA mutagenesis (right panel). Blue asterisks denote significant associa�ons using a permuta�on-based FDR strategy (see Methods; * FDR < 0.2, ** FDR < 0.1, *** FDR < 0.05). Shown on the le� panel are genes whose expression was associated with muta�on load across all �ssues more than expected by chance at the indicated FDR (see Methods). d, Group-level gene expression associa�ons of the shown pathways and C>T muta�ons across �ssues (see Methods). Heatmap columns in c and d are ordered based on a hierarchical clustering. bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Cancer driver genes are enriched for mutations and under positive selection in non-disease tissues

To investigate the extent to which cancer-associated mutations exist in a pre-cancerous state, we calculated the enrichment of all COSMIC36 cancer point mutations in our mutation maps (see Methods). Many samples were highly enriched (Hypergeometric Bonferroni-corrected p<0.05), with some having more than 20% overlap with COSMIC mutations (Fig. 5a). In contrast, we observed almost no significant overlaps in a negative control (using a permuted set of mutations per-sample, that conserves mutation frequencies and genomic regions from the original set of mutations; see methods; Supp. Fig. 11a). Sun-exposed skin had the highest level of overlap (Hypergeometric BH-FDR<0.05), with 100% of samples having significant enrichments, followed by sun-protected skin and skeletal muscle (Fig. 5a). We observed the lowest overlaps across the seven brain regions, aorta, and spleen. Focusing our analysis on a panel of 53 known cancer driver genes37 (Supp. Table 10), we observed mutations in 31 of them (after excluding potential false-positive mutations; see Methods and Supp. Table 11). Some tissues showed significantly greater or lower mutation rates across these genes (Fig. 5b). Sun-exposed skin was once again the most significant tissue, with several others at FDR < 0.05, such as muscle and heart (Fig. 5b). Interestingly, MAP2K1 and RRAS2 are highly mutated in muscle, and IDH2 and PPP2R1A are highly mutated in both heart and skeletal muscle (Fig. 5c). Conversely, several tissues—including most brain regions—had significantly lower mutation rates than average in these cancer drivers (Fig. 5b). Observed mutations in cancer driver genes were then classified by their impact on protein sequences. We found that some genes were dominated by missense mutations, such as IDH2, while others showed an excess of synonymous mutations, such as MAP2K1 (Fig. 5d). To assess the selective pressures acting on these genes we used a dN/dS approach, which assesses the ratio of non-synonymous to synonymous mutations while accounting for sequence composition and variable mutation rates across the genome38 (see Methods). dN/dS values close to 1 represent little or no detectable selection, dN/dS>1 suggests positive selection, and dN/dS<1 suggests purifying selection. Our analysis showed that most of these cancer drivers are evolving under positive selection (Fig. 5e). Missense mutations had a mean dN/dS=1.06 whereas nonsense mutations had a mean dN/dS=9.14 (indicating over 9-fold enrichment for nonsense mutations compared to the expectation based on synonymous mutations in the same genes); the latter is significantly more than expected from genome-wide values (permutation-based p=0.1 and p=0.009, respectively; Fig. 5e). NOTCH1 has been recently observed to evolve under positive selection in non-cancerous skin15 and esophagus18,19. Consistent with this, we found the highest rates of NOTCH1 mutations in these same two tissues (Fig. 5c), with strong signals of positive selection (NOTCH1 missense dN/dS = 1.46, nonsense dN/dS = 14.04). We also observed NOTCH1 mutations at slightly lower rates in the small intestine and tibial artery (Fig. 5c). To further explore the effects of selection on cancer-associated mutations, we performed a VAF analysis (similar to Fig. 2d). Using the large catalog of COSMIC cancer point mutations (as in Fig. 5a), we created two subsets of our somatic mutation list: those that have been observed in cancer samples, and those at precisely the same location as a known cancer mutation, but with a different base change (which act as well-matched negative controls). We then separated each

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not a Significant samples certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is madeenriched available in cancer muta under�ons 25 28% 22% 32% 27% 57% 45% 35% 33% 64% 37% 44% 45% 48% 61% 72%a57%CC-BY-ND65% 56% 84% 4.078% International57% 61% 87% 69% 70%license95% 61%. 83% 75% 85% 78% 96% 81% 96% 99% 100% (Bonferroni p−value < 0.05)

20 log10(Bonferroni pvalue) >30 15 20

10 10

5 per sample(%) 0

0 Cancer COSMIC mutations

Spleen Breast Lung Ovary Liver Pituitary Thyroid Prostate Pancreas StomachMuscle Aorta Artery Tibial Artery Tibial Nerve Brain WholeCortex BloodColon Sigmoid Adrenal Gland Small Intestine Coronary Artery Adipose Visceral Colon Transverse Skin sun-exposed Brain FrontalBrain Cortex HippocampusBrain Hypothalamus Heart Left VentricleEsophagus MucosaSkin sun-protected Adipose Subcutaneous Esophagus MuscularisHeart Atrial Appendage Brain PutamenBrain basal Caudate ganglia basal ganglia Brain Nucleusb accumbens basal ganglia Esophagus Gastroesophageal Junction 4e04 *** *** 3e04 *** ***

2e04 * * n.s. n.s. n.s. n.s. Tissue-wide average n.s. n.s. n.s. * n.s. * ** * * 1e04 * ** ** ** *** *** ** *** *** *** *** ** *** *** *** *** 0e+00 Total mutationin tissue

Mutations in driver genes Liver Lung Muscle Ovary Breast Spleen Thyroid Prostate Stomach Pancreas Pituitary Tibial Nerve Aorta ArteryTibial Artery Brain Cortex Small Intestine Adrenal GlandColon Sigmoid Whole Blood Skin sun-exposed Coronary Artery Adipose Visceral Colon Transverse EsophagusHeart Mucosa Left Ventricle Skin sun-protected Brain Frontal Cortex Brain Hypothalamus Heart AtrialAdipose Appendage SubcutaneousEsophagus Muscularis Brain Putamen basal ganglia Brain Caudate basal ganglia Mutations in driver Log10 Esophagus Gastroesophageal() Junction Brain Nucleus accumbens basal ganglia c Total mutations in tissue d Missense Nonsense Stop loss Synonymous e 5.5 4 SF3B1 SF3B1 n = 182 p = 0.009 NFE2L2 NFE2L2 n = 96 CTNNB1 IDH2 IDH2 n = 65 6 ERBB2 ERBB2 n = 43 NFE2L2 FGFR3 FGFR3 n = 42 PPP2R1A PPP2R1A n = 38 CTNNB1 CTNNB1 n = 37 RHOA RHOA n = 30 NOTCH1 NOTCH1 n = 25 4 p = 0.1 ZFP36L2 ZFP36L2 n = 23 CTNNB1 NOTCH1 KEAP1 KEAP1 n = 22 RRAS2 RRAS2 n = 22 NFE2L2 RAC1 RAC1 n = 15 ZFP36L2 AKT1 AKT1 n = 14 ERBB2 MAP2K1 MAP2K1 n = 10 2 PPP2R1A SPOP SPOP n = 5 ZFP36L2 SF3B1 CDH1 CDH1 n = 4 IDH2 EP300 EP300 n = 4 KEAP1 ERBB2

TGFBR2 TGFBR2 n = 4 log2(dN/dS) NOTCH1 ERBB3 ERBB3 n = 3 0 PPP2R1A RAC1 IDH1 IDH1 n = 3 PIK3CA PIK3CA n = 3 FGFR3 DICER1 DICER1 n = 2 RHOA EGFR EGFR n = 2 FBXW7 FBXW7 n = 2 FGFR2 FGFR2 RRAS2 n = 2 2 GNA11 GNA11 n = 1 SF3B1 PIK3R1 PIK3R1 n = 1 RUNX1 RUNX1 n = 1 MAP2K1 SMAD2 SMAD2 n = 1 AKT1 TP53 TP53 n = 1

Lung Liver MuscleOvary Breast Spleen 0 25 50 75 100 Prostate StomachThyroid Pancreas Pituitary Missense Nonsense Tibial Nerve Tibial Artery Aorta Artery Brain Cortex Colon Sigmoid AdrenalSmall Gland Intestine Whole Blood Coronary Artery Adipose Visceral Colon Transverse Mutations (%) SkinEsophagus sun-exposedHeart Left Mucosa Ventricle Skin sun-protected Brain Frontal Cortex Brain Hypothalamus Adipose SubcutaneousEsophagus MuscularisHeart Atrial Appendage Brain Putamen basal ganglia Brain Caudate basal ganglia

Esophagus Gastroesophageal Junction Brain Nucleus accumbens basal ganglia f Mutations In cancer Not in cancer g 300 p < 1x10 5 Rac1 p = 2.3x10204 5 Notch1

54 0 p = 3.1x10 Mutations # Ras 0.075 0 0 100 192aa # Mutations #

0 400 800 1200 1600 2000 2555aa 6 Ppp2r1a

11 Rhoa 0.050 0 # Mutations # HEAT 2 HEAT 2HEAT 2 0 0 100 200 300 400 500 589aa # Mutations # Ras Median VAF 0 100 193aa 0.025 20 Idh2 Silent Missense Nonsense 0 Oncogenic* 3D clustered hotspot* Mutations # Iso dh 0.000 0 100 200 300 400 452aa

Missense Synonymous Nonsense bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. Figure 5. Cancer driver genes evolve under strong posi�ve selec�on, and cancer muta�ons are enriched in healthy �ssues. a, Percentage of COSMIC cancer muta�ons observed per sample and grouped by �ssue; p-values for enrichment were calculated using a hypergeometric test accoun�ng for sequencing coverage, total number of muta�ons per sample, and total number of COSMIC muta�ons, and the three possible alternate alleles that any given reference allele can have (see Methods). P-values are Bonferroni-corrected across all samples. b, Rela�ve muta�on rates of a selected group of 53 genes known to carry cancer driver muta�ons37 (only 31 of them had at least 1 muta�on in this study); the �ssue-wide average is indicated with the do�ed line. Significant devia�on from the �ssue-wide average was calculated using the binomial distribu�on and the �ssue-wide average muta�on rate. Benjamini-Hochberg FDR: *** (FDR < 0.001), ** (FDR < 0.01), * (FDR < 0.05). c, Individual muta�on rates for each cancer driver gene across all �ssues. d, Percentage of muta�ons for each cancer driver gene stra�fied by impact to amino acid sequence; n is the total number of muta�ons observed in a gene. e, dN/dS values for missense (blue) and nonsense muta�ons (orange) in cancer driver genes calculated using dndsloc9,38 (see Methods); averages per group are shown as rhomboids and their respec�ve genome-wide averages are shown as dashed lines along with their 95% confidence intervals a�er bootstrapping 10,000 �mes. P-values indicate the probability of observing a higher average dN/dS from 10,000 equally-sized randomly-sampled groups of genes (see Methods). f, Median variant allele frequency (VAF) for each muta�on type based on their impact to the amino acid sequence and colored by their cancer status. Muta�ons “in cancer” (purple bars) are those that overlap with the COSMIC database in both base change and posi�on, and muta�ons “not in cancer” (yellow bars) are those that overlap with COSMIC only in posi�on but not in base change. Error bars represent the 95% confidence interval a�er bootstrapping 1000 �mes; p- values are from two-sided Mann-Whitney tests. g, Muta�on maps of five cancer driver genes; oncogenic state was obtained from oncoKB39, and clustered muta�on annota�ons were obtained from the databases Cancer Hotspots42 and 3D Hotspots of muta�ons occurring in close proximity at the protein level43. bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

of these two subsets into three mutation types: synonymous, missense, and nonsense. In all three cases, the VAFs of the cancer-associated mutations were significantly higher than the matched non-cancer-associated ones (Fig. 5f). Using the synonymous non-cancer mutations as representative of neutrality (similar to dN/dS), we found that all three cancer-associated mutation types had higher VAF than these neutral proxies (Fig. 5f), suggesting the action of positive (and not just weaker negative) selection. Interestingly, the ratio of cancer/non-cancer mutation VAFs was lowest for synonymous (1.4-fold), intermediate for missense (2.0-fold), and highest for nonsense (5.2-fold), suggesting stronger positive selection for more extreme mutations. These results are consistent with the dN/dS analysis in Fig. 5e, which also showed the strongest positive selection on nonsense mutations in cancer driver genes, but the VAF analysis has substantially more power due to the greater number of mutations analyzed. We also found a number of specific mutations manually curated as oncogenic or likely oncogenic from OncoKB39 (Supp. Table 12). For instance, all nonsense mutations that we observed in NOTCH1 have been labelled as oncogenic (Fig. 5g). We found many other oncogenic mutations as well; for example, RHOA and RAC1 had oncogenic mutations in their Ras domains (Fig. 5g). Together these results show that cancer mutations are enriched in non-disease tissues and not only do they accumulate in driver genes, but the majority of these genes are evolving under positive selection, suggesting that these mutations increase cellular proliferation well before any cancer is observed.

DISCUSSION We developed a method to detect rare somatic mutations from RNA-seq data and applied it to over 7,500 tissue samples (Fig. 1). To our knowledge this is the largest map to date of somatic mutations in non-cancerous tissues. It has been proposed that somatic mutations contribute to aging and organ deterioration40; consistent with this, we observed a positive correlation between age and mutation burden in most tissues. Interestingly, several brain regions are among the tissues exhibiting stronger age correlation, and somatic mutations have been shown to have a role in neurodegeneration24. We observed largely tissue-specific behaviors and some pervasive observations shared across tissues. Mutation profiles are defined by their tissue-of-origin, mutation maps are delineated by tissue-specific chromatin organization, and transcriptional signatures associated with mutation load are highly tissue-specific. These results suggest that different cell types are subjected to different evolutionary paths that could be dependent on environmental or developmental differences. For example, while most samples exhibit tissue-specific mutation profiles, some others like transverse colon and the small intestine have similar profiles. Additionally, we observed that genes whose expression is associated with mutation load in several tissues are enriched in DNA repair, autophagy, immune response, cellular transport, cell adhesion and viral processes; and while these functions have been implicated in mutagenesis in cancer32–34, our results highlight how expression variation of these genes associates with mutational variation in healthy tissues. Cancer mutations are enriched across many organs. Muscle and heart tissue are particularly interesting because they have lower-than-expected mutation rates, but those

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

mutations are highly enriched for cancer mutations and had high mutation rates in cancer drivers. Sarcomas are tumors originated from soft tissues – including muscle – that are relatively uncommon and have a low density of point mutations but high copy number variation41. Our results show that low mutation rates are also observed in these soft tissues, but compared to tumors, cancer mutations are frequent. The functional implications of this observation will need to be explored in further detail. Positive selection of driver genes has been recently observed in healthy tissues15,18. Accordingly, we confirmed that NOTCH1 is under positive selection in skin and esophagus. We found other genes positively selected both broadly (e.g. IDH2, CTNNB1, NFE2L2) and with more tissue-specific patterns (e.g. KEAP1 in prostate, thyroid and muscle; RAC1 in skin, esophagus, and breast), which will be important subjects for future studies. In general, mutations previously observed in cancer studies were found in high abundance across many healthy tissues, and our VAF analysis showed signatures of positive selection acting on them. In contrast, when looking at all mutations and in particular those not previously seen in cancer, we found that missense and nonsense mutations are generally under negative selection. Our results reconcile recent studies reporting prevalent positive9 or negative11 selection in somatic evolution, as we find evidence of both co-existing in different sets of mutations. Our findings paint a complex landscape of somatic mutation across the human body, highlighting their tissue-specific distributions and functional associations. The prevalence of cancer mutations and positive selection of cancer driver genes in non-diseased tissues suggests the possibility of a poised pre-cancerous state, which could also contribute to aging. Finally, our method for inferring somatic mutations from RNA-seq data may help accelerate the study of somatic evolution and its role in aging and disease.

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

SUPPLEMENTARY TABLES

Supplementary Table 1. Average percentage elimination of putative mutation calls by per- sample false-positive filters across all tissues.

Supplementary Table 2. Total number of samples per tissue included for the final set of mutation calls.

Supplementary Table 3. List of all somatic mutations identified in this study.

Supplementary Table 4. P-values (-log10[p-value]) for the coefficients of each feature used in a linear regression on the total number of mutations per tissue.

Supplementary Table 5. Average percentage of each mutation type across samples of the given tissue.

Supplementary Table 6. Significant associations between biological sex and mutation load across tissues and mutation types.

Supplementary Table 7. Tissue correspondence between the GTEx23 and Roadmap Epigenomics projects28.

Supplementary Table 8. Significant GO enrichments for genes whose expression is significantly and negatively associated with C>T mutation load across several tissues.

Supplementary Table 9. Significant GO enrichments for genes whose expression is significantly and positively associated with C>T mutation load across several tissues.

Supplementary Table 10. List of cancer driver genes used in this study37.

Supplementary Table 11. List of mutations in cancer driver genes after further elimination of potential false positives.

Supplementary Table 12. List of mutations in cancer driver genes annotated in Oncokb39.

Supplementary Table 13. Correspondence between GTEx23 tissues and cancer types from Tomasetti and Vogelstein44.

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not SUPPLEMENTARYcertified by peer review) FIGURES is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. Blood Exome DNA-seq a Coverage = 40 b max(p) = 8.4e18 0.003 1e40 0.002 1e77 Phred 0.001 1e114 seq quality 30 d = 1

Mean mutation rate 0.000 1e151 31 32 0 25 50 75 100 33 Bps to closest exon boundary 10 20 30 40 34 Blood RNA-seq Coverage = 10k 35 max(p) = 0.053 36

37 1e11 0.002 38 39

p(sequencing error) 40 1e26 0.001 1e41

d = 5 1e56 Mean mutation rate 0.000 0 25 50 75 100 Bps to closest exon boundary 10 20 30 40 Skin sun-protected RNA-seq c Reads supporting ALT allele 0.0015 100

Original mutations 0.0010 75 0.0005

d = 8 50 Mean mutation rate 0.0000 0 25 50 75 100 Bps to closest exon boundary Percentage 25 e Residual <1500 >=1500 Truncated axis 0 3000 0 10 20 30 40 4000 Reads supporting ALT allele 2000 1e+08 100 Permuted mutations 2000 1000

5e+07 Residuals 75 0 0 Expected mutations Uniquely mapped reads 50 0e+00 0 2000 4000 6000 0 2000 4000 6000 0 2000 4000 6000

Percentage Observed mutations Observed mutations Observed mutations 25

Truncated axis Supplementary Figure 1. Sta�s�cs associated to a method for calling DNA muta�ons from RNA-seq data. a, 0 Probabili�es of observing a sequencing error (y axis) on a posi�on covered with the lowest coverage used in this 0 10 20 30 40 study (top panel, 40 reads), or a highly covered base (bo�om panel, 104 reads), with increasing number of reads Reads supporting ALT allele suppor�ng the alternate allele (x axis), at different phred sequencing scores (colored lines). Probability is d calculated using the right tail of a binomial distribu�on on the number of observed reads suppor�ng a muta�on [X ~ binom(n = coverage, p=10-Prehd/10x1/3); Phred-based sequencing error probability is mul�plied by 1/3 to DNA-seq coverage 20 DNA-seq coverage 40 account for three poten�al different bases to mutate]. b, Average muta�on rate is high close to splice junc�ons in RNA-seq (bo�om 2 rows) but not in DNA-seq data (top row). Muta�on rate was calculated by taking the 10.0 number of muta�ons observed at a given distance from an exon junc�on and dividing it by the number of reads 10.0 covering posi�ons at that distance. Ver�cal dashed lines represent the point of inflec�on at which muta�on rate stabilizes; we used a 1-bp sliding window to iden�fy the posi�on with maximum absolute difference between 1.0 muta�on rates of the four bps downstream and the four bps upstream of that posi�on. Error bars are the 95% 1.0 confidence intervals based on bootstrapping 1,000 �mes. c, DNA soma�c muta�ons called from RNA-seq were Coverage validated by assessing the percentage of muta�ons for which there was at least one read suppor�ng the RNAseq 0.1 alternate allele in matched exome DNA-seq data (see Methods). The top panel shows the histogram of number 0.1 40 100 of reads suppor�ng the alternate allele in DNA-seq for all muta�ons found in RNA-seq. The bo�om panel shows 200 10 20 30 40 10 20 30 40 the same data a�er randomly assigning an alternate allele to the muta�ons called in RNA-seq, effec�vely

r 400 DNA-seq coverage 80 DNA-seq coverage 160 crea�ng a distribu�on expected by chance (see Methods). X axes were truncated for visual purposes. d, For the 800 100.0 method valida�on, we compared RNA-seq-based muta�on calls to DNA-seq data. To address differences in 100 1600 2400 coverage between the two methods we only compared posi�ons with r >= 8 (dashed line; r effec�vely represents 4800 the number of expected reads that support the alternate allele in DNA-seq at a posi�on given the alternate allele 10.0 frequency observed in RNA-seq at that posi�on and the DNA-seq coverage; see Methods), which ensures 10 considering only posi�ons for which reads suppor�ng the alternate allele were expected to be found in DNA-seq given the coverage of that posi�on in both experiments. e, Hyper-mutated samples were iden�fied by applying a 1.0 linear regression between the number of muta�ons and sequencing depth (Uniquely mapped reads), age, BMI 1 and gender. The observed number of muta�ons was mostly explained by sequencing depth (le� panel) and the expected number of muta�ons agrees for most samples with the observed number muta�ons (middle panel). 0.1 Hyper-mutated samples were tagged as the ones with residuals from the linear regression of >=1,500, meaning 10 20 30 40 10 20 30 40 they had at least 1,500 muta�ons more than expected (right panel; see Methods for more details). Reads supporting ALT allele bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a Whole-Blood RNA-seq aCC-BY-ND 4.0 InternationalWhole-Blood license. exome DNA-seq removal of false posi�ves (%) removal of false posi�ves (%) total read posi�on bias total read posi�on bias 100 33.69 100 67.8 blacklisted region mapping quality bias blacklisted region mapping quality bias 84.72 33.64 92.48 67.71 rna edit base quality bias rna edit base quality bias 74.24 30.4 92.13 62.35 splicing junc�on error mapping quality vs strand bias splicing junc�on error mapping quality vs strand bias 68.36 30.37 88.93 62.28 sequencing error variant distance bias sequencing error variant distance bias 65.29 14.62 88.9 30.58

kept removed kept removed kept removed kept removed

b c Prostate 125000 average = 0.15 median = 0.026 1st quartile = 0.013 Uniquely mapped reads 3d quartile = 0.092 n = 280843 Transcriptome diversity 100000

Time in fixative

Selfdefined race 75000 RIN

Count Ischemic time 50000 feature PC3

Genotype PC2 25000 Genotype PC1

BMI 0 Age 01234 Mutation frequency across samples (%) 02468 log10(pvalue)

Supplementary Figure 2. Calling of soma�c DNA-muta�ons in the GTEx cohort. a, Per-sample filters applied to muta�on calls in blood RNA-seq (le�) and DNA-seq (right) (See Supp. Table 1 for all-�ssue data). b, Frequency histogram of unique muta�ons across samples. c, Significance of associa�ons between total number of muta�ons observed in prostate with biological and technical variables. P-values were obtained from a linear regression (see Methods; see Supp. Table 4 for all-�ssue data). bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. a

Whole Blood rho =0.88 rho =0.78 rho =0.76 Skin sun-exposed -09 07 Esophagus Mucosa 07 Esophagus Mucosa 20000 p < 1x10 p = 3.3x10 p = 6.2x10 20000 Skin sun-exposed Esophagus Mucosa 60000 Skin sun-exposed Muscle Lung Muscle Whole Blood Lung Muscle Thyroid Thyroid Whole Blood Lung Skin sun-protected Thyroid 15000 15000 Skin sun-protected Tibial Nerve Skin sun-protected Tibial Artery 40000 Adipose Subcutaneous Colon Transverse

Colon Transverse Tibial Nerve Adipose Visceral 10000 Tibial Nerve Esophagus Muscularis Spleen 10000 Breast Adipose Subcutaneous Tibial Artery Small Intestine Breast Esophagus Muscularis Colon Transverse Spleen Mutations C>T Mutations C>A

Mutations C>G Aorta Artery Adipose Subcutaneous Liver Breast Stomach Liver Colon Sigmoid Adipose Visceral Tibial Artery Stomach Adipose Visceral Heart Atrial Appendage Small Intestine Pituitary Heart Left Ventricle Small Intestine Pituitary Colon Sigmoid Aorta Artery Spleen Heart Left Ventricle 20000 Heart Left Ventricle Pancreas Aorta Artery Esophagus Muscularis Stomach Ovary Esophagus Gastroesophageal Junction Ovary Pituitary Liver Adrenal Gland Esophagus Gastroesophageal Junction 5000 Heart Atrial Appendage 5000 Adrenal Gland Adrenal Gland Pancreas Heart Atrial Appendage Ovary Colon Sigmoid Pancreas Brain Cortex Prostate Prostate Esophagus Gastroesophageal Junction Brain Frontal Cortex Brain Cortex Coronary Artery Brain Cortex Prostate Brain Nucleus accumbens basal ganglia Brain Nucleus accumbens basal ganglia Coronary Artery Brain Frontal Cortex Brain Nucleus accumbens basal ganglia Brain Frontal Cortex Coronary Artery Brain Caudate basal ganglia Brain Hypothalamus Brain Hypothalamus Brain Hypothalamus Brain Caudate basal ganglia Brain Caudate basal ganglia Brain Hippocampus Brain Putamen basal ganglia Brain Hippocampus Brain HippocampusBrain Putamen basal ganglia Brain Putamen basal ganglia 4.0e+09 8.0e+09 1.2e+10 1.6e+10 4.0e+09 8.0e+09 1.2e+10 1.6e+10 4.0e+09 8.0e+09 1.2e+10 1.6e+10 Sequencing depth Sequencing depth Sequencing depth

15000 Skin sun-exposed 150000 16000 rho =0.85 rho =0.83 08 rho =0.74 08 Esophagus Mucosa p = 3x10 p = 9.8x1007 Whole Blood p = 9x10 Esophagus Mucosa Skin sun-exposed Lung Thyroid Lung Muscle Whole Blood Thyroid 12000 Muscle Skin sun-protected Whole Blood 100000 10000

Tibial Nerve Tibial Nerve Skin sun-protected Lung Tibial Artery Colon Transverse Tibial Artery Thyroid 8000 Colon Transverse Adipose Subcutaneous Esophagus Mucosa Spleen Adipose Visceral Tibial Artery Breast Adipose Subcutaneous Esophagus Muscularis Mutations T>A Breast Skin sun-exposed Mutations T>G Spleen Aorta Artery Aorta Artery Mutations T>C Small Intestine Esophagus Muscularis Colon Sigmoid 50000 Muscle Aorta Artery Liver Stomach Liver Adrenal Gland Liver Adipose Visceral Pituitary Heart Left Ventricle Tibial Nerve Heart Left Ventricle 5000 Stomach Pituitary Colon Transverse Brain FrontalEsophagus SmallCortex Intestine Gastroesophageal Junction Skin sun-protected Pituitary Heart Atrial Appendage Small Intestine Breast Brain Frontal Cortex Colon Sigmoid Esophagus Gastroesophageal Junction Pancreas Pancreas Heart Atrial Appendage Prostate Ovary Coronary Artery Adipose Subcutaneous 4000 Brain Cortex Adrenal Gland Brain Cortex Spleen Adipose Visceral Coronary Artery Stomach Esophagus Muscularis ProstateBrain Nucleus accumbens basal ganglia Ovary Prostate Brain Nucleus accumbens basal ganglia Pancreas Adrenal GlandHeart Atrial Appendage Brain Hypothalamus Coronary Artery Brain Hypothalamus Brain Nucleus accumbens basal ganglia Ovary Brain Cortex Esophagus Gastroesophageal JunctionHeart Left Ventricle Brain Caudate basal ganglia Brain Caudate basal ganglia Brain HypothalamusBrain Frontal CortexColon Sigmoid Brain Putamen basal ganglia Brain Hippocampus Brain Putamen basal gangliaBrain Hippocampus Brain HippocampusBrain Putamen basalBrain ganglia Caudate basal ganglia 4.0e+09 8.0e+09 1.2e+10 1.6e+10 4.0e+09 8.0e+09 1.2e+10 1.6e+10 4.0e+09 8.0e+09 1.2e+10 1.6e+10 Sequencing depth Sequencing depth Sequencing depth

b Skin sun-protected Skin sun-exposed Blood 250

600 200 200

150 150 400

100 100

mean mutations mean mutations 200 mean mutations 50 50

0 0 0 C>A C>G C>T T>A T>C T>G C>A C>G C>T T>A T>C T>G C>A C>G C>T T>A T>C T>G

Supplementary Figure 3. Muta�on load across different muta�on types in non-disease human �ssues. a, Across all muta�on types, the total number of muta�ons observed in a �ssue is explained by the total sequencing depth of that �ssue. A linear regression line is shown in blue; �ssues above it exhibit more muta�ons than expected by sequencing depth and �ssues below it show fewer muta�ons than expected. Rho is the Spearman coefficient. b, Representa�ve examples for rela�ve contribu�ons of different muta�on types across different �ssues (see Supp. Table 5 for all-�ssue data). bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not a certified by peer review) is theC>A author/funder, who has granted bioRxiv a license toC>G display the preprint in perpetuity. It is made availableC>T under Brain Putamen basal ganglia * aCC-BY-NDBrain Putamen 4.0basal Internationalganglia license. ** Brain Hypothalamus * Brain Caudate basal ganglia * Aorta Artery ** Whole Blood *** Brain Hypothalamus Adrenal Gland * Skin sun-exposed *** Adrenal Gland * Small Intestine Ovary Small Intestine Whole Blood * Brain Putamen basal ganglia Spleen Tibial Artery * Small Intestine Pituitary Muscle * Adrenal Gland Whole Blood * Lung * Tibial Artery * Brain Nucleus accumbens basal ganglia Coronary Artery Brain Nucleus accumbens basal ganglia Esophagus Mucosa Spleen Lung * Tibial Artery Brain Nucleus accumbens basal ganglia Coronary Artery Stomach Pituitary Aorta Artery Skin sun-exposed Breast Esophagus Mucosa Lung Ovary Tibial Nerve Adipose Visceral Tibial Nerve Breast Brain Hippocampus Brain Cortex Pituitary Ovary Heart Atrial Appendage Brain Hippocampus Esophagus Muscularis Stomach Esophagus Muscularis Muscle Brain Hypothalamus Colon Sigmoid Brain Frontal Cortex Adipose Subcutaneous Stomach Esophagus Gastroesophageal Junction Esophagus Muscularis Brain Cortex Pancreas Adipose Visceral Thyroid Coronary Artery Brain Hippocampus Adipose Subcutaneous Heart Left Ventricle Esophagus Gastroesophageal Junction Adipose Visceral Tibial Nerve Thyroid Pancreas Skin sun-protected Heart Left Ventricle Brain Caudate basal ganglia Colon Sigmoid Pancreas Skin sun-protected Breast Skin sun-exposed Muscle Thyroid Esophagus Mucosa Heart Atrial Appendage Heart Atrial Appendage Brain Caudate basal ganglia Heart Left Ventricle Liver Brain Frontal Cortex Liver Adipose Subcutaneous Skin sun-protected Brain Frontal Cortex Colon Transverse Colon Transverse Esophagus Gastroesophageal Junction Prostate Prostate Spleen Aorta Artery Colon Sigmoid Colon Transverse Brain Cortex Liver Prostate 0.4 0.2 0.0 0.2 0.4 0.25 0.00 0.25 0.25 0.00 0.25 Spearman's rho Spearman's rho Spearman's rho T>A T>C T>G Brain Nucleus accumbens basal ganglia Brain Putamen basal ganglia * Brain Nucleus accumbens basal ganglia Brain Putamen basal ganglia Whole Blood **** Small Intestine Brain Hypothalamus Brain Nucleus accumbens basal ganglia * Pituitary Small Intestine Small Intestine * Brain Hippocampus Coronary Artery Coronary Artery * Ovary Tibial Artery Tibial Artery *** Esophagus Muscularis Ovary Adrenal Gland * Spleen Adrenal Gland Ovary * Adrenal Gland Lung Brain Hypothalamus Liver Whole Blood Pancreas * Adipose Subcutaneous Brain Caudate basal ganglia Skin sun-protected * Whole Blood Esophagus Gastroesophageal Junction Colon Sigmoid * Tibial Artery Muscle Lung * Brain Caudate basal ganglia Stomach Brain Hippocampus Breast Thyroid Muscle * Brain Putamen basal ganglia Skin sun-exposed Liver Brain Hypothalamus Liver Aorta Artery * Skin sun-exposed Esophagus Muscularis Pituitary Pancreas Tibial Nerve Tibial Nerve * Colon Sigmoid Heart Atrial Appendage Esophagus Muscularis Brain Frontal Cortex Breast Adipose Visceral Muscle Heart Left Ventricle Thyroid Esophagus Mucosa Esophagus Mucosa Breast Lung Brain Frontal Cortex Brain Caudate basal ganglia Heart Left Ventricle Aorta Artery Adipose Subcutaneous Thyroid Pancreas Skin sun-exposed Skin sun-protected Adipose Visceral Stomach Stomach Brain Cortex Heart Atrial Appendage Tibial Nerve Spleen Esophagus Mucosa Aorta Artery Pituitary Heart Left Ventricle Brain Cortex Prostate Esophagus Gastroesophageal Junction Coronary Artery Skin sun-protected Spleen Esophagus Gastroesophageal Junction Adipose Subcutaneous Brain Cortex Adipose Visceral Brain Hippocampus Colon Transverse Heart Atrial Appendage Colon Sigmoid Brain Frontal Cortex Colon Transverse Colon Transverse Prostate * Prostate 0.4 0.2 0.0 0.2 0.4 0.4 0.2 0.0 0.2 0.4 0.2 0.0 0.2 Spearman's rho Spearman's rho Spearman's rho All Skin Adipose subcutaneous Liver Brain Putamen basal ganglia * bc Whole Blood p=0.0037 **** p=0.00019 p=0.015 Brain Nucleus accumbens basal ganglia * 1.0 Brain Hypothalamus * Coronary Artery * Adrenal Gland * 1 Small Intestine * Ovary 1 Tibial Artery * Skin sun-exposed * Pancreas Pituitary Lung 0 Aorta Artery Muscle * 0.5 Brain Hippocampus Colon Sigmoid 0 Stomach Tibial Nerve Brain Caudate basal ganglia 1 Breast Esophagus Muscularis Skin sun-protected Esophagus Mucosa Liver Thyroid 1 Adipose Visceral 0.0 2 Brain Cortex Adipose Subcutaneous Normalized mutations (all) Spleen Normalized mutations (C>T) Esophagus Gastroesophageal Junction Brain Frontal Cortex C>T Fold Change [log2(Sun/Non)] Heart Left Ventricle Colon Transverse 3 Heart Atrial Appendage 2 Prostate * 0.25 0.00 0.25 d Spearman's rho MF MF Caucasian Sex Sex 0.100 p < 1e300 African American p < 1e300 Supplementary Figure 4. Phenotypic associa�ons and proper�es of muta�on load in the human body. a, Age p = 1.9e213 associa�ons (x axis) with muta�on load per �ssue (y axis) across muta�on types (panels). P-values were 0.075 p = 1.3e204 obtained by assessing the frac�on of permuta�on-based rho values greater than the original rho value for p = 5.6e150 posi�ve rho values, or the frac�on of permuta�on-based rho values smaller than the original rho value for nega�ve rho values. A total 10,000 permuta�ons were performed. FDRs were obtained for each panel using the 0.050 Benjamini-Hochberg method: **** (FDR < 0.01), *** (FDR < 0.05), ** (FDR < 0.1), * (FDR < 0.2). b, Fold-change p = 2e91 of C>T muta�ons between matched sun-exposed and sun-protected skin samples from the same individual and Median VAF stra�fied by self-reported race. P-value is based on a two-sided Mann-Whitney test. c, Two representa�ve 0.025 examples of associa�ons between muta�on load and biological sex. To control for sequencing depth and other technical ar�facts, muta�on values were obtained as the residuals from a linear regression using technical features as explanatory variables (see Methods, see Supp. Table 6 for all significant associa�ons with sex). d, 0.000 Related to Figure 2d, median variant allele frequency (VAF) across all muta�ons for each muta�on type based on their impact to the amino acid sequence; nonsense muta�ons were divided into two different groups based on whether they are located in the last exon; error bars represent the 95% confidence interval a�er bootstrapping Missense Synonymous 1000 �mes; p-values are from two-sided Mann-Whitney tests. Nonsense - last exon Nonsense - not last exon bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peeraa review) is the author/funder,C>A who has granted bioRxiv a license to display the preprint inT>A perpetuity. It is made available under rho =0.36 Small Intestinea TerminalCC-BY-ND Ileum 4.0 International licenserho =0.48. Small Intestine Terminal Ileum p = 0.28 p = 0.14 Skin Not Sun Exposed Suprapubic Whole Blood Whole Blood Liver 0.4 Skin Sun Exposed Lower leg Liver Colon Transverse

0.4 Skin Not Sun Exposed Suprapubic

Lung Skin Sun Exposed Lower leg 0.2 Lung

Colon Transverse Esophagus average Thyroid 0.0 Thyroid Esophagus average 0.0

0.4 0.2 Brain average

Mutations log2(observed / expected) Mutations log2(observed / expected) Brain average

Pancreas Pancreas 9 10 11 12 9 10 11 12 log10(Number of stem cell divisions) log10(Number of stem cell divisions) T>C rho = 0.52 C>G rho =0.018 1.0 p = 0.098 Whole Blood Small Intestine Terminal Ileum p = 0.96 1.5

Whole Blood Colon Transverse 1.0 Liver 0.5 Skin Not Sun Exposed Suprapubic Skin Sun Exposed Lower leg

Thyroid Liver 0.5 Lung Small Intestine Terminal Ileum Lung

Thyroid Colon Transverse 0.0 Esophagus average 0.0 Skin Not Sun Exposed Suprapubic Pancreas Skin Sun Exposed Lower leg Pancreas Esophagus average Mutations log2(observed / expected) Mutations log2(observed / expected) Brain average 0.5 Brain average

9 10 11 12 9 10 11 12 log10(Number of stem cell divisions) log10(Number of stem cell divisions) C>T T>G 1.0 rho =0.4 0.8 rho =0.15 Small Intestine Terminal Ileum p = 0.23 Whole Blood p = 0.67 Small Intestine Terminal Ileum

Liver Colon Transverse Whole Blood 0.4 Liver Colon Transverse 0.5 Skin Sun Exposed Lower leg Skin Sun Exposed Lower leg Thyroid Lung Skin Not Sun Exposed Suprapubic Skin Not Sun Exposed Suprapubic Lung Thyroid 0.0 Esophagus average

Esophagus average 0.0 Brain average

0.4

Brain average Mutations log2(observed / expected) 0.5 Mutations log2(observed / expected) Pancreas Pancreas 9 10 11 12 9 10 11 12 log10(Number of stem cell divisions) log10(Number of stem cell divisions)

rho =0.22 All p = 0.52 Whole Blood Supplementary Figure 5. Number of stem cell divisions correlates 1.0 weakly with muta�on load in human �ssues. Supplementary Figure 5. Number of stem cell divisions correlates weakly with Small Intestine Terminal Ileum muta�on load in human �ssues. Associa�ons between different Liver muta�on types and the number of stem cell divisions of the most 44 0.5 abundant cell type for each �ssue . Observed/expected muta�ons (y-axis) were calculated by dividing the number of muta�ons Colon Transverse Lung observed in a given �ssue by the predicted number of muta�ons Thyroid Skin Not Sun Exposed Suprapubic Skin Sun Exposed Lower leg based on the sequencing depth of the �ssue from the linear regression in Figure 2a. Associa�ons between different muta�on 0.0 Esophagus average types and the number of stem cell divisions of the most abundant cell type for each �ssue44. Observed/expected muta�ons (y-axis) were calculated by dividing the number of muta�ons observed in a

Mutations log2(observed / expected) Pancreas Brain average given �ssue by the predicted number of muta�ons based on the 0.5 sequencing depth of the �ssue from the linear regression in Figure 9 10 11 12 2a. log10(Number of stem cell divisions) bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. Muscle Skeletal

a Spleen b Heart Left Ventricle Heart Atrial Appendage

30 30

20 20

10 10

0 0 tSNE dimension 2 tSNE dimension 2

10 10

20 20

20 10 0 10 20 20 10 0 10 20 tSNE dimension 1 tSNE dimension 1

c Skin protected Skin Sun Exposed d Esophagus Mucosa 30 30

20 20

10 10

0 0 tSNE dimension 2 tSNE dimension 2

10 10

20 20

20 10 0 10 20 20 10 0 10 20 tSNE dimension 1 tSNE dimension 1

e Artery Coronary f Thyroid 30 30

20 20

10 10 0 0 tSNE dimension 2 tSNE dimension 2 10 10

20 20

20 10 0 10 20 20 10 0 10 20

tSNE dimension 1 tSNE dimension 1

Supplementary Figure 6. Muta�on profiles cluster by �ssue. a-f, tSNE plots as described in Figure 2e, highligh�ng individual (a,d) and grouped (b,c) �ssues exhibi�ng highly similar within-�ssue muta�on profiles, and two �ssues showing �ssue profiles with weak clustering (e,f). certified bypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmadeavailableunder bioRxiv preprint

C>T strand bias c a log2(transcribed/nontranscribed) Aorta Artery 4 3 2 1 Skin sun-exposed 0 1 Adipose Subcutaneous Tibial Artery

01010200 150 100 50 0 Tibial Nerve Muscle Skin sun-protected doi:

NK cells resting NK cells Heart Atrial Appendage Small Intestine

Heart Left Ventricle https://doi.org/10.1101/668624 Spleen

p =2.5x10 rho =0.65 Coronary Artery Pituitary Brain Cortex Cell type contribution tosample typecontribution Cell

Brain Frontal Cortex C>A Pancreas Brain Caudate basal ganglia Brain Nucleus accumbens basal ganglia Liver

Whole Blood Adipose Visceral 36 Colon Sigmoid Esophagus Mucosa Esophagus Gastroesophageal Junction Stomach 4 3 2 1 Breast 0 1 Colon Transverse Ovary 0 200 100 0 Prostate Esophagus Muscularis Thyroid Adrenal Gland

T cellsCD8+ Whole Blood Lung -log10 (signed pval) -log10 (signed Aorta Artery Aorta Skin sun-exposed Subcutaneous Adipose Artery Tibial Nerve Tibial Muscle Skin sun-protected AtrialAppendage Heart Intestine Small LeftVentricle Heart Spleen Artery Coronary Pituitary Cortex Brain Cortex Frontal Brain Pancreas ganglia basal Caudate Brain ganglia basal accumbens Nucleus Brain Liver Visceral Adipose Sigmoid Colon Mucosa Esophagus Junction Gastroesophageal Esophagus Stomach Breast Transverse Colon Ovary Prostate Muscularis Esophagus Thyroid Gland Adrenal Whole Blood Lung p =3.9x10 rho =0.61 a ; CC-BY-ND 4.0Internationallicense this versionpostedJune12,2019. 05 50 32 strand asymmetry typecontent. andcell type Methods). Cell nega (see and posi�ve CIBERSORT45 composi�on. using type obtained cell was composi�on blood and strand) non-transcribed clustering. associa correla correla posi�ve indicates Yellow of pairs in strand) non-transcribed the (ra asymmetry strand between correla asymmetry muta correla Inter- �ssue 7. Figure Supplementary b Whole Blood ons. Tissues in light gray did not have enough samples to assess to samples enough have not did gray light in Tissues �ons. n ad el ye associa type cell and �ons n. oun ad os ee ree uig hierarchical using ordered were rows and Columns �ons.

Associa b, C>A The copyrightholderforthispreprint(whichwasnot .

ve correla �ve C>G n; -aus r fo Sera correla Spearman from are p-values �ons; n bten > srn aymty (transcribed/ asymmetry strand C>T between �ons C>T

T>A ons in blood (from panel b) between C>T between b) panel (from blood in �ons o muta of �o

n, hra bu idcts nega indicates blue whereas �ons, T>C

n. a, �ons. T>G ssues from the same individual. individual. same the from �ssues n o te rncie over transcribed the on �ons Spearman's rho Spearman's T cells gamma delta T cellsgamma 05 0 0 5 Mast cells activated Mast cells activated NK cells B cellsnaive Eosinophils B cellsmemory resting cells Dendritic resting memory T cellsCD4 Tregs T cellsregulatory M0 Macrophages T cellsCD8 resting NK cells Monocytes activated memory T cellsCD4 activated cells Dendritic Neutrophils helper T cellsfollicular resting Mast cells cells Plasma M2 Macrophages naive T cellsCD4 nl tad asymmetry strand �onal se strand inter-�ssue C>A c, �ons Top ve �ve bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a aCC-BY-ND 4.0 International license. 100% 10000

7500 c C>T Ovary — OGG1 expression 5000 rho =0.38 p = 2x1004

2500 9% 1.0 No. of genes

24% 50% 0 1 2 3 4 5 6 7 8 9 1011 12 13 14 15 16 17 18 19 2021 22 23 24 25 26 27 28 29 3031 32 33 34 35 36 0.5 No. of tissues a gene was tested on b 40.44% 6000 n = 5953 0.0

4000 61.04% 0.5 n = 3033 78.99% n = 2643

92.71% Normalized mutation load 2000 n = 2020

No. of genes 1.0 97.44% n = 696 98.78% 99.38% 99.69% 99.82% 99.91% 99.93% 99.98% 99.99% 100% n = 198 0 n = 87 n = 46 n = 20 n = 13 n = 3 n = 7 n = 1 n = 2 20 30 40 0 2.82.8 5.65.6 8.38.3 11 11 14 14 17 17 19 19 22 22 25 25 28 28 31 31 33 33 36 36 39 TPM Percentage of tissues in which a gene is associated with mutation load d C>T Liver — NEIL1 expression e C>T Lung — MLH1 expression rho =0.55 rho =0.28 p = 3.1x1010 p = 2.2x1006 2 1

0 0

1

2 Normalized mutation load Normalized mutation load

2

10 20 20 30 40 TPM TPM

f C>T Adrenal gland — MSH3 expression g C>T Artery tibial— NEIL2 expression 2 rho =0.23 2 rho =0.28 p = 0.0078 p = 1.8x1006

1 1

0 0

1 1 Normalized mutation load Normalized mutation load

2 2

2.5 5.0 7.5 10 20 30 TPM TPM Supplementary Figure 8. Gene expression associa�ons with C>T muta�on load. a, Histogram of the number of �ssues in which a gene was tested for associa�on between its expression and C>T muta�on load. A gene was selected to be tested in a �ssue based on having detectable expression in that �ssue (see Methods). b, Histogram of the percentage of �ssues exhibi�ng significant associa�ons (p < 0.05 a�er Bonferroni correc�on) between expression of a gene and C>T muta�on load. c-g, Examples of individual significant associa�ons between C>T muta�on load and expression of DNA repair genes in different �ssues (0.001 < FDR < 0.2, see Fig. 4c). Muta�on load is normalized by controlling for biological and technical factors (see Methods). Rho is the Spearman correla�on coefficient. No. of genes d c a 2000 4000 No. of genes 10000 Posi�ve expression gene associa�ons Nega�ve gene expression associa �ons certified bypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmadeavailableunder 7500 5000 2500 0

0 2.8 bioRxiv preprint GO:0015031 protein transport protein GO:0015031 process viral GO:0016032 signaling hippo GO:0035329 catabolic process formaldehyde GO:0046294 repair nucleotide-excision GO:0006289 0.0128 organization organelle GO:0006996 process microtubule-based GO:0007017 deneddylation protein GO:0000338 molecules adhesion membrane via plasma adhesion cell homophilic GO:0007156 transport intracellular GO:0046907 localization cellular Description GO:0051641 Golgitransport vesicle GO:0048193 Term ID GO:0030004 cellular monovalent inorganic cation homeostasis inorganic monovalent cellular GO:0030004 of cytotoxicity regulation mediated leukocyte GO:0001910 localization GO:0051179 establishment of cell epithelial polarity GO:0090162 process ofmetabolic regulation lipid GO:0019216 acid process metabolic monocarboxylic GO:0032787 to response staurosporine cellular GO:0072734 cell activation GO:0001775 hexose process metabolic GO:0019318 secretion GO:0046903 system immune process GO:0002376 process metabolic carbohydrate GO:0005975 of cell establishment or polarity maintenance GO:0007163 process metabolic phosphorus 0.0167 GO:0006793 process metabolic steroid GO:0008202 of ofRNA regulation phosphorylation II polymerase domain C-terminal GO:1901407 autophosphorylation protein GO:0046777 keratinization GO:0031424 ofGTPase regulation transduction small signal mediated GO:0051056 process metabolic alcohol GO:0006066 ofactivity regulation hydrolase GO:0051336 catabolic molecule process small GO:0044282 response immune adaptive GO:0002250 ion transport GO:0006811 process metabolic molecule small GO:0044281 process metabolic compound hydroxy organic Description GO:1901615 process metabolic lipid GO:0006629 Term ID 0 n =5534 37.59% 9%

2.8 5.6 36 35 34 33 32 3031 29 28 27 26 25 24 23 22 2021 19 18 17 16 15 14 13 12 1011 9 8 7 6 5 4 3 2 1 n =3102 58.66%

Percentage of tissues in which ageneisassociated oftissuesinwhich Percentage 5.6 8.3 n =2862 78.1%

8.3 11 26% n =2190 92.98% doi:

11 14 wastestedon No. oftissuesagene https://doi.org/10.1101/668624 97.72% n =699

14 17 with mutation load with mutation 99.12% n =205

17 19 99.67% n =81

19 22 99.84% n =26

22 25 99.96% n =17

25 28 99.98% n =3

28 31 99.99% n =1 a 31 33 ; CC-BY-ND 4.0Internationallicense this versionpostedJune12,2019. 99.99% n =0 50%

33 36 99.99% n =0

36 39 100% n =2 100% of GO categories (see Methods). muta nega is expression whose muta with associated percentageor (b) number the of (c) associa muta for tested thus and expressed, of number the of Histogram associa expression Gene muta 9. Figure Supplementary No. of genes b 12000 8000 4000 on load in mul�ple in load �on muta all across load �on n load. �on FDR 0.00631 0.00567 0.00565 0.00449 0.00426 0.00366 0 FDR 0.0148 0.0928 0.0905 0.0829 0.0786 0.0736 0.0583 0.0463 0.0311 0.0273 0.0265 0.0242 0.0234 0.0234 0.0199 0.0155 0.0104 0.00777 0.067 0.016 0.014 n =12510 0.0807 0.0788 0.0762 0.0608 0.0541 0.0463 0.0186 0.0126 0.0125 43.41% 0.071 0123456789 The copyrightholderforthispreprint(whichwasnot . n =5504 62.51% iia t Fgr 4, bt o al muta all for but 4a,b Figure to Similar n =4050 No. of tissues in which a gene isassociated agene No. oftissuesinwhich 76.57% n od p 00, ofroi corrected). Bonferroni 0.05, < (p load �on vely (bo� posi�vely or (top) �vely n =4243 91.29% ssues are enriched in these representa these in enriched are �ssues on types (see Methods). (see types �on se i wih ah ee a detectably was gene each which in �ssues n =1681 97.13% with mutation load with mutation ssues in which a genewasa signiwhichficantly in �ssues 98.98% n =534 n ewe is xrsin and expression its between �on 99.62% n =185 99.84% n =62 99.93% n =26 om) associated with associated om) n wt overall with �ons 99.98% n =15 b-c, n types. �on Histogram of Histogram 99.99% n =3 01 14 11 10 Genes d, ve lists �ve 99.99% n =1 100% n =2 a, bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not FDR(%)certified by peer review)-log is (signedthe author/funder, pval) who has granted bioRxivFDR(%) a license to display the-log preprint(signed pval)in perpetuity. It is made available under a 10 aCC-BY-NDb 4.0 International license. 10

10 5 0510 10 5 0510 10-5 5-10.1-1 10-5 5-10.1-1 100-2020-1515-10 100-2020-1515-10 C>A C>G APOBEC3A ** C deamination APOBEC3A C deamination APOBEC3B APOBEC3B POLH Translesion pol POLH ** Translesion pol * BRCA2 BRCA2 DSB repair DSB repair BRCA1 * BRCA1 * POLD1 * POLD1 EXO1 EXO1 MSH2 * ** * * * MSH2 MSH3 * **** MMR MSH3 * *** MMR MSH6 MSH6 * MLH1 * *** MLH1 PMS2 * PMS2 XPC XPC XPA XPA ** NER *** * * NER DDB1 * DDB1 DDB2 DDB2 ** NEIL1 NEIL1 * NEIL2 * NEIL2 *** NTHL1 * NTHL1 * * OGG1 * BER OGG1 *** BER MUTYH MUTYH POLB POLB POLL ** POLL * Liver Liver Lung Lung Ovary Ovary Breast Breast Spleen Muscle Spleen Muscle Thyroid Thyroid Pituitary Pituitary Prostate Prostate Stomach Stomach Pancreas Pancreas Aorta Artery Tibial Nerve Tibial Artery Aorta Artery Tibial Nerve Tibial Artery Brain Cortex Brain Cortex Whole Blood Whole Blood Adrenal Gland Adrenal Gland Colon Sigmoid Colon Sigmoid Coronary Artery Coronary Artery Adipose Visceral Adipose Visceral Colon Transverse Colon Transverse Skin sun-exposed Skin sun-exposed Skin sun-protected Skin sun-protected Heart Left Ventricle Heart Left Ventricle Brain Hippocampus Esophagus Mucosa Brain Hippocampus Esophagus Mucosa Brain Frontal Cortex Brain Hypothalamus Brain Frontal Cortex Brain Hypothalamus Esophagus Muscularis Esophagus Muscularis Adipose Subcutaneous Adipose Subcutaneous Heart Atrial Appendage Heart Atrial Appendage Brain Caudate basal ganglia Brain Caudate basal ganglia Brain Putamen basal ganglia Brain Putamen basal ganglia Small Intestine Terminal Ileum Small Intestine Terminal Ileum Esophagus Gastroesophageal Junction Esophagus Gastroesophageal Junction Brain Nucleus accumbens basal ganglia c Brain Nucleus accumbens basal ganglia d T>A T>C APOBEC3A * C deamination APOBEC3A C deamination APOBEC3B APOBEC3B POLH ** Translesion pol POLH Translesion pol BRCA2 * DSB repair BRCA2 DSB repair BRCA1 *** BRCA1 POLD1 POLD1 EXO1 ** EXO1 MSH2 ** MSH2 MSH3 * * MMR MSH3 MMR MSH6 MSH6 MLH1 *** * MLH1 PMS2 * PMS2 XPC XPC XPA ** NER XPA DDB1 * DDB1 NER DDB2 DDB2 NEIL1 * NEIL1 * ** NEIL2 NEIL2 NTHL1 NTHL1 OGG1 ** BER OGG1 BER MUTYH **** MUTYH POLB POLB POLL * POLL Liver Lung Liver Lung Ovary Breast Ovary Spleen Muscle Breast Thyroid Spleen Muscle Thyroid Pituitary Prostate Stomach Pituitary Prostate Pancreas Stomach Pancreas Aorta Artery Tibial Nerve Tibial Artery Brain Cortex Whole Blood Aorta Artery Tibial Nerve Tibial Artery Brain Cortex Whole Blood Adrenal Gland Colon Sigmoid Adrenal Gland Colon Sigmoid Coronary Artery Adipose Visceral Coronary Artery Colon Transverse Adipose Visceral Skin sun-exposed Colon Transverse Skin sun-exposed Skin sun-protected Heart Left Ventricle Skin sun-protected Brain Hippocampus Esophagus Mucosa Heart Left Ventricle Brain Frontal Cortex Brain Hypothalamus Brain Hippocampus Esophagus Mucosa Brain Frontal Cortex Brain Hypothalamus Esophagus Muscularis Adipose Subcutaneous Esophagus Muscularis Heart Atrial Appendage Adipose Subcutaneous Heart Atrial Appendage Brain Caudate basal ganglia Brain Putamen basal ganglia Brain Caudate basal ganglia Brain Putamen basal ganglia Small Intestine Terminal Ileum Small Intestine Terminal Ileum Esophagus Gastroesophageal Junction Brain Nucleus accumbens basal ganglia Esophagus Gastroesophageal Junction f Brain Nucleus accumbens basal ganglia g C>A C>G e T>G C deamination Translesion pol APOBEC3A C deamination APOBEC3B DSB repair POLH Translesion pol MMR BRCA2 DSB repair NER BRCA1 POLD1 BER EXO1 MSH2 Liver Lung Liver Lung Ovary Ovary Breast Breast Spleen Muscle Spleen Thyroid

MSH3 Muscle

MMR Thyroid Pituitary Prostate ** Pituitary Prostate Stomach Stomach Pancreas MSH6 Pancreas Aorta Artery Tibial Nerve Tibial Artery Aorta Artery Tibial Nerve Tibial Artery Brain Cortex Whole Blood Brain Cortex MLH1 Whole Blood Adrenal Gland Adrenal Gland Colon Sigmoid Colon Sigmoid Coronary Artery

PMS2 Coronary Artery Adipose Visceral Adipose Visceral Colon Transverse Colon Transverse Skin sun-exposed Skin sun-exposed Skin sun-protected XPC Skin sun-protected Heart Left Ventricle Heart Left Ventricle Brain Hippocampus Brain Hippocampus Esophagus Mucosa Esophagus Mucosa Brain Frontal Cortex Brain Frontal Cortex * Brain Hypothalamus XPA Brain Hypothalamus Esophagus Muscularis Esophagus Muscularis Adipose Subcutaneous ** * Adipose Subcutaneous Heart Atrial Appendage NER Heart Atrial Appendage DDB1 *** * DDB2 Brain Caudate basal ganglia Brain Caudate basal ganglia Brain Putamen basal ganglia Brain Putamen basal ganglia Small Intestine Terminal Ileum NEIL1 Small Intestine Terminal Ileum NEIL2 ** NTHL1

OGG1 BER Esophagus Gastroesophageal Junction Esophagus Gastroesophageal Junction Brain Nucleus accumbens basal ganglia MUTYH Brain Nucleus accumbens basal ganglia POLB POLL ** h i T>A T>C

Liver C deamination Lung Ovary Breast Spleen Muscle

Thyroid Translesion pol Pituitary Prostate Stomach Pancreas DSB repair Aorta Artery Tibial Nerve Tibial Artery Brain Cortex Whole Blood MMR Adrenal Gland Colon Sigmoid

Coronary Artery NER Adipose Visceral Colon Transverse Skin sun-exposed Skin sun-protected

Heart Left Ventricle BER Brain Hippocampus Esophagus Mucosa Brain Frontal Cortex Brain Hypothalamus Esophagus Muscularis Adipose Subcutaneous Heart Atrial Appendage Liver Liver Lung Lung Ovary Ovary Breast Breast Spleen Spleen Muscle Muscle Thyroid Thyroid Pituitary Pituitary Brain Caudate basal ganglia Prostate Prostate Brain Putamen basal ganglia Stomach Stomach Pancreas Pancreas Small Intestine Terminal Ileum Aorta Artery Tibial Nerve Tibial Artery Aorta Artery Tibial Nerve Tibial Artery Brain Cortex Brain Cortex Whole Blood Whole Blood Adrenal Gland Adrenal Gland Colon Sigmoid Colon Sigmoid Coronary Artery Coronary Artery Adipose Visceral Adipose Visceral Colon Transverse Colon Transverse Skin sun-exposed Skin sun-exposed Skin sun-protected Skin sun-protected Heart Left Ventricle Heart Left Ventricle Brain Hippocampus Esophagus Mucosa Brain Hippocampus Esophagus Mucosa Brain Frontal Cortex Brain Hypothalamus Brain Frontal Cortex Brain Hypothalamus Esophagus Gastroesophageal Junction Brain Nucleus accumbens basal ganglia Esophagus Muscularis Esophagus Muscularis Adipose Subcutaneous Adipose Subcutaneous Heart Atrial Appendage Heart Atrial Appendage Brain Caudate basal ganglia Brain Caudate basal ganglia Brain Putamen basal ganglia Brain Putamen basal ganglia Small Intestine Terminal Ileum j Small Intestine Terminal Ileum Esophagus Gastroesophageal Junction Esophagus Gastroesophageal Junction Brain Nucleus accumbens basal ganglia

T>G Brain Nucleus accumbens basal ganglia C deamination Translesion pol DSB repair FDR(%) Supplementary Figure 10. Muta�on load associa�ons with expression of MMR NER genes involved in DNA repair or DNA mutagenesis. a-d, Individual gene-�ssue BER 10-5 5-10.1-1 associa�ons between different muta�on types and expression of genes 100-2020-1515-10 Liver Lung Ovary Breast

Spleen involved in DNA repair or DNA mutagenesis (right panels); blue stars denote Muscle Thyroid Pituitary Prostate Stomach Pancreas Aorta Artery Tibial Nerve Tibial Artery Brain Cortex Whole Blood significant associa�ons using a permuta�on-based FDR strategy (see Methods; Adrenal Gland Colon Sigmoid Coronary Artery Adipose Visceral Colon Transverse Skin sun-exposed Skin sun-protected

Heart Left Ventricle * FDR < 0.2, ** FDR < 0.1, *** FDR < 0.05). Shown in the le� panels are genes Brain Hippocampus Esophagus Mucosa Brain Frontal Cortex Brain Hypothalamus Esophagus Muscularis Adipose Subcutaneous Heart Atrial Appendage whose expression was associated across all �ssues more than expected by Brain Caudate basal ganglia Brain Putamen basal ganglia Small Intestine Terminal Ileum chance at the indicated FDR (see Methods). e-i, Group-level gene expression associa�ons of the shown pathways and different muta�on types across Esophagus Gastroesophageal Junction Brain Nucleus accumbens basal ganglia �ssues (see Methods). bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. a Significant samples 25 2% 5% 8% 5% 5% 5% 5% 4% 3% 13% 4% 5% 8% 7% 6% 6% 9% 8% 5% 7% 9% 7% 7% 10% 11% 18% 11% 12% 9% 8% 21% 5% 11% 13% 12% 12% (Bonferroni p−value < 0.05)

20 log10(Bonferroni pvalue) >30 15 20

10 10

5 0 Permuted mutations seen in COMSIC (%)

0

Liver Ovary Lung SpleenBreast Pituitary Pancreas Prostate Thyroid Stomach Muscle Tibial Artery Aorta Artery Tibial Nerve Brain Cortex Whole Blood Colon Sigmoid AdrenalSmall Gland Intestine Coronary Artery Adipose Visceral ColonSkin Transverse sun-exposed Brain Hippocampus Brain Frontal CortexBrain Hypothalamus Skin sun-protected Heart EsophagusLeft Ventricle Mucosa Heart AtrialEsophagus Appendage Muscularis Adipose Subcutaneous Brain PutamenBrain Caudate basal gangliabasal ganglia

Brain Nucleus accumbens basal ganglia Esophagus Gastroesophageal Junction

Tissue Inidivual muta�ons accoun�ng for most muta�ons in a gene b 300

200 count 100

0 0 20 40 60 Mutation count # Unique mutations in gene-of-origin

Supplementary Figure 11. Nega�ve controls and filters for cancer muta�on enrichment in non-disease human �ssues. a, Percentage of randomly permuted muta�ons (see Methods) in non-disease �ssues that overlap with cancer muta�on sites (COSMIC) but where the muta�on type disagrees (i.e C>A in one dataset and C>G in the other); p-values for enrichment were calculated using a hypergeometric test accoun�ng for sequencing coverage, total number of muta�ons per sample, and total number of COSMIC muta�ons, and the three possible alternate alleles that any given reference allele can have (see Methods). P-values are Bonferroni-corrected across all samples. FDR is based on the Benjamini-Hochberg method across all samples. b, Histogram of t values (x axis, described in methods) for each muta�on in a panel of 31 cancer driver genes. Muta�ons with high t values account for most of the unique mutated sites in their gene-of- origin, leading to a low diversity in muta�ons of a given gene. These muta�ons are likely systema�c ar�facts and can bias dN/dS ra�os9. bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

METHODS

Raw RNA-seq retrieval and processing We downloaded raw RNA sequencing reads from the dbGAP GTEx23 project version 7 (phs000424.v7.p2). We included 36 tissues with a total of 7,584 samples (Supp. Table 2) after applying all filters in the mutation calling pipeline (see below). We also processed DNA-exome sequencing reads from 105 whole-blood samples that had a matched RNA-seq sample. Reads were mapped to the human genome Hg19 (NCBI build GRCh37.p13) using STAR46 with the following parameters: requiring uniquely mapping reads (--outFilterMultimapNmax 1), clipping 6 bases in the 5’ end of reads (--clip5pNbases 6), keeping reads with 10 or fewer mismatches and less than 10% mismatches of the read length that effectively mapped to genome (--outFilterMismatchNmax 10, --outFilterMismatchNoverLmax 0.1). To avoid germline variant contamination during the somatic mutation calling phase, all SNPs from dbSNP47 v142 were masked to Ns and ignored in downstream processing (we used SNPs annotated in https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b142_GRCh37p13/BED). After mapping we removed duplicate reads to avoid biases arising from PCR duplicates during library preparation using a custom python script.

DNA somatic mutation calling from RNA-sequencing Identifying DNA variants from RNA-seq requires filtering out many sources of false positives. These include sequencing errors, RNA editing events, mapping errors around splice junctions, germline variants, and other sequencing/mapping biases. To address these issues, we developed a custom mutation calling pipeline that borrows ideas from classical DNA-based variant calling coupled with extra filters that to increase the fidelity of the mutation calls from RNA-seq. After mapping the raw RNA-seq reads, the pipeline consists of three main parts: 1) identifying positions with two base calls, 2) removing germline variants, and 3) filtering out other potentially spurious variants. Identifying positions with two base calls. After mapping, bam files were scanned to identify genomic positions that were covered by reads having exactly two different base calls. Given the intrinsic sequencing error rate, we only considered positions with high coverage and high sequencing quality for this step. Stringent cutoffs were set for coverage (c >= 40 reads) and sequencing quality (qs >=30 in Phred scale); this is in comparison twice what some DNA-based calling algorithms have used21. Finally, only positions in which the minor allele was supported by at least 6 reads were considered (n >= 6). In combination these parameters define a theoretical distribution of the probability of observing a mutation due to sequencing errors, which is small for a wide range of sequence coverage levels (Supp. Fig. 1a). In addition we included a filter to only consider variants with a probability of sequencing error p<0.0001 (see filters below). These strict cutoffs ensure a low probability of observing sequencing errors even in highly covered genes; nonetheless we still apply further filters to account for sequencing errors (see below). Identifying and eliminating germline variants. Common and rare germline variants have to be excluded for the proper identification of somatic mutations. To eliminate common variants a strict filter was applied by using a human genome masked with Ns for positions known to have common variants from dbSNP v142 (see above).

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

To eliminate all other germline variants, including rare ones, we utilized the low confidence germline variants called by GTEx. These calls were made by GTEx using GATK’s HaplotypeCaller v3.4 using whole-genome sequencing data at 30x coverage from whole blood samples. We specifically used the vcf file GTEx_Analysis_2016-01- 15_v7_WholeGenomeSeq_652Ind_GATK_HaplotypeCaller.vcf which contains all germline variants before filtering for MAF and low-quality variants. Our goal was to exclude as many germline variants as possible, so we reasoned that using all germline variants called by GTEx – including the low-quality ones – was the safest option to minimize contamination in our somatic mutation calls. While these variants were called in whole-blood samples, germline variants should be present in all tissues of an individual and as such these variants were excluded in a per-individual basis across all tissues of that individual. Only sites that had heterozygous or homozygous variants for the alternate allele were excluded from the mutation calls of the given individual. Filtering out artifacts. A total of 13 filters were applied to exclude a variety of artifacts: 1. Blacklisted regions. We excluded sex chromosomes, unfinished chromosomal scaffolds, the mitochondrial genome and the HLA locus in chromosome 6 which is known to contain a high density of germline variants48 making accurate mapping challenging, and is hence a potential source of false-positive calls. 2. RNA edits. The most prevalent type of RNA-editing in humans is Adenine-to-Inosine (A>I) which is observed as an A>G/T>C substitution in sequencing data. The Rigorously Annotated Database of A-to-I RNA editing (RADAR)49 has extensively curated RNA-edit events including calls identified using the GTEx data50. We also obtained RNA-edit information from the Database of RNA-editing (DARNED) that includes further edit types curated from published studies51. We excluded all positions described in RADAR v2 (http://lilab.stanford.edu/GokulR/database/Human_AG_all_hg19_v2.txt) and in DARNED (https://darned.ucc.ie/static/downloads/hg19.txt), and observed an average decrease of 10% mutations per sample in our RNA-sequencing calls but not in our DNA- sequencing calls from GTEx (Supp. Fig. 2a), indicating that we are indeed eliminating real RNA-edit events that would otherwise be false-positive mutation calls. 3. Splice junction artifacts. Splice junctions are difficult to resolve during mapping because a gap has to be introduced in reads spanning a splice junction to map it to the corresponding exons in the genome. We observed that the mutation rate was higher close to annotated exon ends and it stabilized at ~7 bp away from the exon end across all tissues (Genecode v19 genes v7 annotation file) (Supp. Fig. 1b). Most of these are likely mapping artifacts and we therefore excluded all mutations located less than 7 bp away from an annotated exon end. 4. Sequencing errors. While sequencing errors are unlikely to be found given the parameters established in the first part of the mutation calling pipeline (see above), we additionally filtered out mutations that had a probability of being sequencing errors greater than 0.01%. This probability was calculated using the upper tail integral of the binomial where the number of successes is the number of reads supporting the alternate allele, the number of events is the coverage in that position, and the probability of success is the conservative assumption of p=0.001 which equals our cutoff of phred score 30 during the first part of the pipeline (see above). This is extremely conservative

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

because it does not incorporate the probability of observing the exact same base call across all reads supporting the alternate allele. 5. Read position bias. To eliminate any systematic bias of a mutation being consistently called around the same position along reads supporting it versus the rest of the reads, we excluded mutations that had a p-value less than 0.05 when applying a Mann-Whitney U test of the positions in the read supporting the alternate allele vs the positions supporting the reference allele. For these tests we used BCFtools52 mpileup. 6. Mapping quality bias. We excluded mutations that had a p-value less than 0.05 when applying a Mann-Whitney U test comparing mapping quality scores of the base calls supporting the alternate allele vs the mapping quality scores of reads supporting the reference allele. For these tests we used BCFtools52 mpileup. 7. Sequence quality bias. We excluded mutations with a p-value less than 0.05 when applying a Mann-Whitney U test comparing sequencing quality scores of base calls supporting the alternate allele vs the scores of base calls supporting the reference allele. For these tests we used BCFtools52 mpileup. 8. Strand bias. We excluded mutations with a p-value less than 0.05 when applying a Mann- Whitney U test comparing strand bias of bases supporting the reference and alternate allele (i.e., cases where mutations were only observed on one strand were excluded; this is not related to the strand asymmetry we observed for some mutation types). For these tests we used the ‘Mann-Whitney U test of Mapping Quality vs Strand Bias’ of BCFtools52 mpileup. 9. Variant distance bias. We excluded variants that showed a high or low mean pairwise distance between the alternate allele positions in the reads supporting it. Similar to the read position bias filter, this ensures that we filter mutations that are consistently observed around the same region of all the reads that support it. We used a cut off of p<0.05 for a two-tail distribution of simulated mean pairwise distances from the BCFtools52 implementation. 10. RNA-specific allele frequency bias. We observed an enrichment of variants having a variant allele frequency (VAF) greater than 0.9 only in mutation calls from RNA- sequencing but not from the matched DNA-sequencing samples. This RNA-specific bias could be due to several factors, including allele specific expression leading to enrichment of the alternate allele, RNA editing, or systematic artifacts during RNA extraction, library construction, and sequencing. We took a conservative approach and excluded all mutations that had a VAF greater than 0.7. 11. Tissue-specific mutation effects. To eliminate false-positives arising by systematic artifacts of unknown origin, we first looked for recurrent mutations observed in many samples of a given tissue. While these mutations could be real and have biological impact, they may also reflect a shared systematic artifact that is producing the same exact mutation across several samples of the same tissue. We decided to take a conservative approach and eliminated all mutations that were called in at least 40% of the samples in one tissue. Even though we labelled these mutations on a per-tissue basis, once identified we removed them from any sample in any tissue that had them.

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

12. Overall systematic mutation bias. We further eliminated mutations that were present in at least 4% of all samples. Similarly, as in the previous step these mutations are more likely to have originated from a systematic artifact. 13. Hyper-mutated samples. We excluded samples that had an excess of mutations compared to what it was expected from sequencing depth and biological factors. To do so we looked at the residuals after applying a linear regression on mutation numbers using as features sequencing depth, age, sex, and BMI. We observed 48 samples that had residual values greater than 1500 (i.e. they had >1500 more mutations than expected by other factors) (Supp. Fig. 1e) and excluded them from further analysis, leaving 7,584 remaining. We did not observe any hypo-mutated samples having similar residual values in the opposite direction.

Method validation To validate the precision of our DNA somatic mutation calls from RNA-seq, we sought to compare those calls to DNA sequencing from the exome. The GTEx Consortium performed DNA exome capture followed by sequencing in blood samples with a median coverage of 80x. From the GTEx dbGaP repository (phs000424.v7.p2) we downloaded raw DNA exome sequencing runs from 105 randomly chosen donors. We mapped the reads and called mutations identically to our RNA-seq samples. Throughout the study we used the mutation calls from exome to validate certain results when appropriate. To validate the overall mutation calling pipeline we matched RNA-seq to exome DNA-seq samples of the same individuals. Then, we asked what percentage of the mutation calls from RNA-seq had reads supporting the alternate allele in the matched exome DNA-seq. To account for coverage differences between RNA-seq and DNA-seq, we focused this analysis only on mutation calls for which we had reasonable power of detection to validate mutations in DNA-seq data. This is especially relevant because in highly expressed genes, sequencing coverage can be >1000x in RNA-seq; therefore we could detect mutations with VAF as low as 0.006 for a 1000x- covered position (given our minimum alternate allele 6-read cutoff described above). In comparison, in exome DNA-seq given the 80x median coverage we expect to see only 0.48 reads supporting the alternate allele of a variant with VAF = 0.006. We therefore devised a way to account for this coverage difference between RNA-seq and DNA-seq, and only compared positions where we had power of detection in both experiments. We first made the following calculation: /#,0()*+,- "# = %#,'()*+,- . 1 %#,0()*+,-

Where Ci and Ai are total and alternate read counts for a mutation in position i; ri effectively represents the number of expected reads to support the alternate allele in DNA-seq at position i given the VAF observed in RNA-seq. As shown in Supplementary Figure 1d, we took a conservative approach and only compared genomic positions where r ≥ 8, to ensure sufficient power of validation. For all positions selected we then calculated the percentage of mutations observed in an RNA-seq sample that had at least one read supporting the variant allele in the exome DNA-seq, effectively resulting in a false-discovery rate. When assigning a random alternate allele (not

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

including the reference allele among the choices) to the mutated positions in RNA-seq we did not observe support for >98% of the position-permuted mutations, thus proving the validity of this FDR calculation (Supp. Fig. 1c).

Discovery of mutation associations with biological and non-biological factors We first sought to identify and subtract the effects of non-biological factors influencing the number of mutations observed per sample. To this end we grouped samples by tissue and performed a linear regression on the mutation numbers using non-biological factors as explanatory variables, which included sequencing depth, transcriptome Shannon diversity, time spent in the PAXgene fixative (SMTSPAX from GTEx sample attributes), total ischemic time (SMTSISCH from GTEx sample attributes), and RNA integrity number (SMRIN from GTEx sample attributes). We assessed the significance of association between mutation number and any factor by the p-value of each coefficient in a multiple regression (Supp. Table 4). To discover biological factors associated with mutation numbers, we took the residuals of the previous linear regression (constructed with non-biological explanatory variables) and performed another regression using biological factors as explanatory variables. Biological factors included: biological sex, age, BMI, and the first three genotype principal components constructed from the ~107 SNPs with MAF >= 0.1 available from GTEx. Significance was assessed by the p- value of each coefficient in a multiple regression (Supp. Table 4). Additionally, Spearman correlations were independently performed between each biological factor and the residuals from the linear regression between mutation number and non-biological factors.

tSNE tSNE was performed on a matrix of m rows and n columns, where m are samples and n are mutation type counts normalized by their corresponding sequence content in the sample transcriptome. We used a 1536-type mutation profile including two base-pairs upstream and downstream of the mutation site (pentanucleotide profile). The normalized mutation count (muti) by sequence content in the transcriptome was obtained as follows:

5# 234# = 8,:,+ ∑8 78,#98

Where ci is the count of the mutation type i, sg,i is the number of occurrences of the i sequence context in gene g, eg is the expression in TMP of gene g, and genes are all genes with TPM >= 1 in the given sample. This calculation was performed separately for every sample. We used the tSNE implementation in R (Rtsne) with parameters dims = 2, max_iter = 500, perplexity = 30, pca = TRUE, theta = 0.5. To quantify the clustering among different groupings in the tSNE two-dimensional space, we calculated a silhouette score (SS) for each group defined in Figure 2f by first obtaining a silhouette score (si) for each sample in each group:

;# − =# 7# = max (=#, ;# )

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

Where ai is the average distance of sample i to all other samples inside the group, and bi is the average distance of sample i to samples outside the group. We then calculated the average score s of all samples in a given group and the 95% confidence intervals based on bootstrapping 10,000 times. We performed the individual-based grouping by calculating silhouette scores for all tissues from 20 randomly selected individuals and then averaging them. The random expectation was calculated by permuting the tissue labels across samples 10 times, repeating the SS calculation across tissues and then averaging all SSs.

Cell type decomposition for blood and lung samples We used CIBERSORT45 to identify the cell type composition of each whole blood sample in the GTEx data. Briefly, CIBERSORT applies a support vector regression on a gene expression profile(s) using reference gene expression signatures from different cell types, and then retrieves the cell type composition from the signatures in the original expression profile(s). We used the online portal (https://cibersort.stanford.edu/) and the default LM22 expression signatures composed of the most prevalent immune cell types. CIBERSORT was run on the blood gene expression profiles with default parameters: 100 permutations and “absolute” mode.

Gene expression associations To avoid population-based effects for all expression associations analyses in this study we only used self-reported Caucasian people. For each tissue we used individual gene expression to model mutation counts in a linear regression as follows:

D# = EF + H E: 5IJ:,# + KL# :

Where x is the expression of a given gene in TPM, cov is a covariate variable and there are n covariates. To estimate the effects of a SNP on the phenotype, g is calculated for each gene in the genome that has a TPM > 1 in at least 20% of the samples. Significance is measured based on the p-value of a non-zero t-test performed on g. P-values are adjusted using Bonferroni correction. When pi is the vector of mutation loads, we defined it as the residuals after regressing non-biological factors as described previously (see above in “Discovery of mutation associations with biological and non-biological factors”). The covariates included in this model are the first 3 PCs of the whole-genome , age, sex, and BMI. In all cases both pi and g were normalized by converting the values into quantiles and mapping them to the corresponding values of the standard normal distribution quantiles, following standard GTEx practices23.

Gene Ontology analysis For gene expression associations with mutation load, we assessed enrichment in GO biological processes using GOrilla53. We used as input a ranked list of genes based on the number of tissues they were significant in (and only including genes that were tested in more than the

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

median number of tissues all genes were tested in, p<0.05 after Bonferroni correction, see above in “Gene expression associations”) and breaking ties by significance. We then used REVIGO54 to obtain non-redundant categories.

Expression associations with genomic instability genes We selected a panel of genes known to be involved in different pathways of DNA repair or translesion replication (Fig. 4) and performed association analyses between their expression and the mutational load for all different mutation types on a per-tissue basis. We followed the same linear regression strategy as described above in “Gene expression associations”. We devised customized strategies to calculate FDRs for individual gene-tissue associations, gene enrichment across all tissues, and pathway (gene group) associations in each tissue. For individual gene-tissue associations, we calculated an FDR based on the distribution of p-values of the linear regressions from the tests of all genes in the genome (see above for additional filters). Then, for each gene of interest we calculated the FDR as the percentage of genes from all tests that had an equal or lower p-value. To obtain an FDR of the enrichment for different pathways across tissues, for each pathway of interest we obtained the p-values from the individual linear regressions of each gene. We then created 10,000 groups with randomly selected genes with the same size as the pathway of interest and within the tissue of interest, and performed linear regressions with mutation load (as describe above in “Gene expression associations”). Finally, for each p-value in the original group we calculated the FDR by dividing the percentage of p-values of equal or lower value in all permuted regressions by the percentage of p-values of equal or lower value in the original regressions. For example, if 0.1% of permuted regressions reached p<0.001, compared to 1% of unpermuted regressions, the FDR at p<0.001 would be 0.1/1 = 10%. Lastly, to calculate FDR for the enrichment of individual genes across all tissues we used a similar strategy as the FDR calculation for gene pathways described in the previous paragraph. We first obtained the p-values of individual linear regression of a given gene across all tissues. We then created a permuted set of p-values from the regressions of one gene from each of the 10,000 permuted groups in all tissues. Finally, we calculated the FDR of the p-values in the original group (one across all tissues) by dividing the percentage of p-values of equal or lower p-value in all permuted regressions by the percentage of p-values of equal or lower value in the original regressions.

Chromatin analysis To address the influence of chromatin on somatic mutations, we first manually mapped tissues from GTEx to tissues of the Roadmap Epigenomics Project28. We were able to do such mapping for 18 tissues (Supp. Table 7). For each tissue, on a per-exon basis we calculated both mutation rates and signal for H3K36me3, H3K4me1, H3K4me3, H3K27me3, H3K9me3. To account for sequencing depth (expression) of an exon when calculating mutation rates, we assigned exons to 100 bins based on their sequencing depth, where each bin contains 1% of exons. We then calculated the median number of mutations observed in each bin, and finally the mutation rate per exon was obtained by subtracting expected number of mutations (the median for all exons in that bin) from the observed number for that exon. Chromatin signal for each exon was obtained from Roadmap ChIP-seq data. We calculated

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

the average base-pair ratios of IP/input obtained from the Roadmap bigwig files. Significance of the association between each chromatin mark and mutation rate was assessed by applying a linear regression on the mutation rate using all chromatin marks as features.

Mutation enrichment in COSMIC cancer mutations We downloaded the entire set of cancer mutations from COSMIC36 v86 and we further filtered them to only keep single nucleotide variants without indels. For each sample we calculated the percentage (overlap) of their mutations that are present in COSMIC mutations, and we calculated the significance of the overlap using the integral of the upper tail from a hypergeometric distribution, that is p(X > k), where X follows a hypergeometric distribution with parameters K (number of mutations in sample), n (number of COSMIC mutations whose positions were covered by >= 40 reads in the RNA-seq sample), N x 3 (the total number of base pairs covered by >= 40 reads in the RNA-seq sample multiplied by the three possible alternate alleles that any reference can have), and k (the overlap of mutations, K and n). As a control, for each sample this calculation was repeated using a permuted set of mutations. This set was constructed by randomly selecting genomic positions that were covered by >=40 reads in the sample and then mutations were simulated in those positions, importantly we conserved the number of reference alleles and their corresponding alternate alleles from the original mutations in this permuted set.repeated.

dN/dS analysis To calculate dN/dS ratios we applied a previously described method38 that uses a Poisson distribution to model the number of mutations with different impacts (i.e. synonymous vs non- synonymous). Briefly, the Poisson distribution is based on the relative content of a mutation type (e.g. C>T) across all types, the total content of that mutation type, and the density of mutations per site. For non-synonymous mutations an extra parameter represents the effect of selection (dN/dS), and maximum-likelihood estimates are calculated by Poisson regression for all parameters. This framework accounts for different substitution rates across different genes as well as sequence composition. We used dndsloc, which is the implementation of this method in R9 (https://github.com/im3sanger/dndscv).

Mutation analyses on cancer driver genes We downloaded the list of genes known to contain at least one cancer driver mutation, based on the latest TCGA publication on cancer driver genes37(Supp. Table 10). Since we did more in depth analysis in this set of genes, we further removed potential false-positive mutation calls that could bias gene-level analysis but are not necessarily an issue for genome-wide analysis. For some of these genes we observed a high number of total mutations but low number of unique mutations, in other words some mutations accounted for most of the unique mutated sites, which has been previously described as an artifact9. To flag these events, for each mutation we calculated a metric t as the ratio of the counts of that mutation divided by the unique number of mutations found in the gene-of-origin. Mutations with high t values account for most of the unique mutated sites in their gene-of-origin, leading to a low diversity in mutations of a given gene. These mutations may be artifacts and can bias dN/dS ratios9. The distribution of t values

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

for mutations in this set of genes was bimodal (Supp. Fig. 11b) and we therefore excluded mutations with r > 3.5. To assess the mutation load on these genes and their significance, for each tissue we calculated the mutation rate of these genes, and then calculated whether this mutation rate was higher or lower than expected from the overall mutation rate on these genes across all tissues (Fig. 5b). To do so we calculated the overall mutation rate in cancer driver genes across all tissues (k) and then for each tissue we calculated the probability of the number of observed mutations (n) in these genes using the binomial distribution [X ~ binom(n,k)]. For tissues showing more mutations than expected we used the integral of the right tail of the binomial distribution to calculate the probability of observing n mutations, and conversely for tissues showing fewer mutations we used the integral of the left tail of the distribution. FDR was calculated using Benjamini-Hochberg method on the p-values. We annotated the oncogenic status for the mutations in these driver genes using the oncokb39 tool MafAnnotator.py.

Acknowledgements

We would like to thank J. Pritchard, B. Li, and members of the Fraser Lab for helpful feedback. PEGN is supported by a Bio-X Bowes Graduate Student Fellowship. This work was supported by NIH grant 2R01GM097171-05A1.

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

REFERENCES

1. Kennedy, S. R., Loeb, L. A. & Herr, A. J. Somatic mutations in aging, cancer and neurodegeneration. Mech. Ageing Dev. 133, 118–126 (2012). 2. Bailey, M. H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 173, 371-385.e18 (2018). 3. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013). 4. Schuster-Bockler, B. & Lehner, B. Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature 488, 504–507 (2012). 5. Paz Polak, Rosa Karlic, Amnon Koren, Robert Thurman, Richard Sandstrom, Michael S. Lawrence, A. R. & Eric Rynes, Kristian Vlahovicˇek, J. A. S. & S. R. S. Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature 518, 360–364 (2015). 6. García-Nieto, P. E. et al. susceptibility is regulated by genome architecture and predicts cancer mutagenesis. EMBO J. 36, 2829–2843 (2017). 7. Smith, K. S., Liu, L. L., Ganesan, S., Michor, F. & De, S. Nuclear topology modulates the mutational landscapes of cancer genomes. Nat. Struct. Mol. Biol. 24, 1000–1006 (2017). 8. Yadav, V. K., DeGregori, J. & De, S. The landscape of somatic mutations in protein coding genes in apparently benign human tissues carries signatures of relaxed purifying selection. Nucleic Acids Res. 44, 2075–2084 (2016). 9. Martincorena, I. et al. Universal Patterns of Selection in Cancer and Somatic Tissues. Cell 171, 1029-1041.e21 (2017). 10. Temko, D., Tomlinson, I. P. M., Severini, S., Schuster-Böckler, B. & Graham, T. A. The effects of mutational processes and selection on driver mutations across cancer types. Nat. Commun. 9, 1857 (2018). 11. Zapata, L. et al. Negative selection in tumor genome evolution acts on essential cellular functions and the immunopeptidome. Genome Biol. 19, 67 (2018). 12. Samstein, R. M. et al. Tumor mutational load predicts survival after immunotherapy across multiple cancer types. Nat. Genet. (2019). doi:10.1038/s41588-018-0312-8 13. Bose, S., Deininger, M., Gora-Tybor, J., Goldman, J. M. & Melo, J. V. The presence of typical and atypical BCR-ABL fusion genes in leukocytes of normal individuals: biologic significance and implications for the assessment of minimal residual disease. Blood 92, 3362–7 (1998). 14. Xie, M. et al. Age-related mutations associated with clonal hematopoietic expansion and malignancies. Nat. Med. 20, 1472–1478 (2014). 15. Martincorena, I. et al. High burden and pervasive positive selection of somatic mutations in normal human skin. Science (80-. ). 348, 880–886 (2015). 16. Bae, T. et al. Different mutational rates and mechanisms in human cells at pregastrulation and . Science (80-. ). eaan8690 (2017). doi:10.1126/science.aan8690 17. Lodato, M. A. et al. Aging and neurodegeneration are associated with increased mutations in single human . Science (80-. ). eaao4426 (2017). doi:10.1126/science.aao4426

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

18. Martincorena, I. et al. Somatic mutant clones colonize the human esophagus with age. Science (80-. ). eaau3879 (2018). doi:10.1126/SCIENCE.AAU3879 19. Yokoyama, A. et al. Age-related remodelling of oesophageal epithelia by mutated cancer drivers. Nature 565, 1 (2019). 20. Lee-Six, H. et al. The landscape of somatic mutation in normal colorectal epithelial cells. bioRxiv 416800 (2018). doi:10.1101/416800 21. Xu, C. A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Comput. Struct. Biotechnol. J. 16, 15–24 (2018). 22. Enge, M. et al. Single-Cell Analysis of Human Pancreas Reveals Transcriptional Signatures of Aging and Somatic Mutation Patterns. Cell 171, 321–330 (2017). 23. Consortium, G. Genetic effects on gene expression across human tissues. (2017). doi:10.1038/nature24277 24. Leija-Salazar, M., Piette, C. & Proukakis, C. Review: Somatic mutations in neurodegeneration. Neuropathol. Appl. Neurobiol. 44, 267–285 (2018). 25. Lindeboom, R. G. H., Supek, F. & Lehner, B. The rules and impact of nonsense-mediated mRNA decay in human . Nat. Genet. 48, 1112–1118 (2016). 26. Igo Martincorena, I., Raine, K. M., Davies, H., Stratton, M. R. & Campbell, P. J. Universal Patterns of Selection in Cancer and Somatic Tissues. (2017). doi:10.1016/j.cell.2017.09.042 27. Van Der Maaten, L. & Hinton, G. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, (2008). 28. Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). 29. Haradhvala, N. J. J. et al. Mutational Strand Asymmetries in Cancer Genomes Reveal Mechanisms of DNA Damage and Repair. Cell 164, 538–549 (2016). 30. Kucab, J. E. et al. A Compendium of Mutational Signatures of Environmental Agents. Cell 0, (2019). 31. Sharma, S. et al. Mitochondrial hypoxic stress induces widespread RNA editing by APOBEC3G in natural killer cells. Genome Biol. 20, 1–17 (2019). 32. Kim, H. & Kim, Y.-M. Pan-cancer analysis of somatic mutations and transcriptomes reveals common functional gene clusters shared by multiple cancer types. Sci. Rep. 8, 6041 (2018). 33. Eliopoulos, A. G., Havaki, S. & Gorgoulis, V. G. DNA Damage Response and Autophagy: A Meaningful Partnership. Front. Genet. 7, 204 (2016). 34. Wang, Y. JimMY on the stage: Linking DNA damage with cell adhesion and motility. Cell Adh. Migr. 4, 166–8 35. Richman, S. Deficient mismatch repair: Read all about it (Review). Int. J. Oncol. 47, 1189– 202 (2015). 36. Tate, J. G. et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 47, 941–947 (2018). 37. Bailey, M. H. et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 173, 371-385.e18 (2018). 38. Wong, C. C. et al. Inactivating CUX1 mutations promote tumorigenesis. Nat. Genet. 46, 33–38 (2014).

bioRxiv preprint doi: https://doi.org/10.1101/668624; this version posted June 12, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license.

39. Chakravarty, D. et al. OncoKB: A Precision Oncology Knowledge Base. JCO Precis. Oncol. 1, 1–16 (2017). 40. Martincorena, I. & Campbell, P. J. Somatic mutation in cancer and normal cells. Science (80-. ). 349, 1483–1489 (2015). 41. Cancer Genome Atlas Research Network, T. et al. Comprehensive and Integrated Genomic Characterization of Adult Soft Tissue Sarcomas. (2017). doi:10.1016/j.cell.2017.10.014 42. Chang, M. T. et al. Accelerating Discovery of Functional Mutant Alleles in Cancer. Cancer Discov. 8, 174–183 (2018). 43. Gao, J. et al. 3D clusters of somatic mutations in cancer reveal numerous rare mutations as functional targets. Genome Med. 9, 4 (2017). 44. Tomasetti, C. & Vogelstein, B. Cancer etiology. Variation in cancer risk among tissues can be explained by the number of stem cell divisions. Science 347, 78–81 (2015). 45. Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015). 46. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). 47. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001). 48. Buhler, S. & Sanchez-Mazas, A. HLA DNA sequence variation among human populations: molecular signatures of demographic and selective events. PLoS One 6, e14643 (2011). 49. Ramaswami, G. & Li, J. B. RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res. 42, D109–D113 (2014). 50. Tan, M. H. et al. Dynamic landscape and regulation of RNA editing in . Nature 550, 249–254 (2017). 51. Kiran, A. M., O’Mahony, J. J., Sanjeev, K. & Baranov, P. V. Darned in 2013: inclusion of model organisms and linking with Wikipedia. Nucleic Acids Res. 41, D258–D261 (2012). 52. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078– 2079 (2009). 53. Eden, E., Navon, R., Steinfeld, I., Lipson, D. & Yakhini, Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 10, 48 (2009). 54. Supek, F., Bošnjak, M., Škunca, N. & Šmuc, T. REVIGO Summarizes and Visualizes Long Lists of Gene Ontology Terms. PLoS One 6, e21800 (2011).