Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 Interplay between coding and exonic splicing regulatory sequences
2 3 Nicolas Fontrodona1,$, Fabien Aubé1,$, Jean-Baptiste Claude1, Hélène Polvèche1,§ , Sébastien Lemaire1, 4 Léon-Charles Tranchevent2, Laurent Modolo3, Franck Mortreux1, Cyril F. Bourgeois1, Didier Auboeuf1,* 5 6 1. Univ Lyon, ENS de Lyon, Univ Claude Bernard, CNRS UMR 5239, INSERM U1210, Laboratory of 7 Biology and Modelling of the Cell, 46 Allée d'Italie Site Jacques Monod, F-69007, Lyon, France 8 2. Proteome and Genome Research Unit, Department of Oncology, Luxembourg Institute of Health 9 (LIH), Strassen, Luxembourg 10 3. LBMC biocomputing center, CNRS UMR 5239, INSERM U1210, 46 Allée d'Italie Site Jacques Monod, 11 F-69007, Lyon, France 12 §. Present address: INSERM UMR 861, I-STEM, 28, Rue Henri Desbruères, 91100 Corbeil-Essonnes, 13 France 14 $ Equal contribution 15 *Corresponding author: Didier Auboeuf, Laboratory of Biology and Modelling of the Cell, ENS de 16 Lyon, 69007 Lyon, France. [email protected] 17
1
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 Abstract 2 The inclusion of exons during the splicing process depends on the binding of splicing factors
3 to short low-complexity regulatory sequences. The relationship between exonic splicing
4 regulatory sequences and coding sequences is still poorly understood. We demonstrate that 5 exons that are coregulated by any given splicing factor share a similar nucleotide
6 composition bias and preferentially code for amino acids with similar physicochemical
7 properties because of the non-randomness properties of the genetic code. Indeed, amino 8 acids sharing similar physicochemical properties correspond to codons that have the same
9 nucleotide composition bias. In particular, we uncover that the TRA2A and TRA2B splicing
10 factors that bind to adenine-rich motifs promote the inclusion of adenine-rich exons coding 11 preferentially for hydrophilic amino acids that correspond to adenine-rich codons. SRSF2
12 that binds guanine/cytosine-rich motifs promotes the inclusion of GC-rich exons coding
13 preferentially for small amino acids, while SRSF3 that binds to cytosine-rich motifs promotes
14 the inclusion of exons coding preferentially for uncharged amino acids, like serine and
15 threonine that can be phosphorylated. Finally, coregulated-exons encoding amino acids with
16 similar physicochemical properties correspond to specific protein features. In conclusion, the 17 regulation of an exon by a splicing factor that relies on the affinity of this factor for specific
18 nucleotide(s) is tightly interconnected with the exonic encoded physicochemical properties.
19 We therefore uncover an unanticipated bidirectional interplay between the splicing 20 regulatory process and its biological functional outcome.
2
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 Introduction 2 Alternative splicing is a cellular process involved in the regulated-inclusion or -exclusion of 3 exons during the processing of mRNA precursors. Alternative splicing is the rule in human since 95% 4 of human genes produce several splicing variants (Pan et al. 2008; Wang et al. 2008). The exon 5 selection process relies on RNA binding proteins or splicing factors that enhance or repress exon 6 inclusion following two main principles. First, splicing factors bind to short intronic or exonic motifs 7 (or splicing regulatory sequences) that are often low-complexity sequences composed of the 8 repetition of the same nucleotide or dinucleotide (Fu and Ares 2014). In addition, the interaction of 9 splicing factors with their cognate binding motifs often depends on the sequence context and on the 10 presence of clusters of related-binding motifs (Zhang et al. 2013; Cereda et al. 2014; Fu and Ares 11 2014; Dominguez et al. 2018; Jobbins et al. 2018). The second principle states that the splicing 12 outcome (i.e., exon inclusion or skipping) depends on where splicing factors bind on pre-mRNAs with 13 respect to the regulated exons. For example, HNRNP-like splicing factors often repress the inclusion 14 of exons they bind to, but they enhance exon inclusion when they do not bind to exons but instead 15 bind to introns (Erkelenz et al. 2013; Fu and Ares 2014; Geuens et al. 2016). Meanwhile, the exonic 16 binding of SR-like splicing factors (SRSFs) usually enhances exon inclusion (Erkelenz et al. 2013; Fu 17 and Ares 2014). 18 Since some splicing regulatory sequences lie within protein-coding sequences, a major 19 challenge is to understand how coding sequences accommodate this overlapping layer of 20 information (Itzkovitz et al. 2010; Lin et al. 2011; Savisaar and Hurst 2017a; Savisaar and Hurst 21 2017b). To date, a general assumption is that protein coding sequences can accommodate 22 overlapping information or “codes” (including the “splicing code”) as a direct consequence of the 23 redundancy of the genetic code that allows the same amino acid to be encoded by several codons 24 differing only on their third "wobble" nucleotide (Goren et al. 2006; Itzkovitz and Alon 2007; Itzkovitz 25 et al. 2010; Lin et al. 2011; Shabalina et al. 2013; Savisaar and Hurst 2017a; Savisaar and Hurst 26 2017b). Therefore, coding and exonic splicing regulatory sequences could evolve independently 27 because of the variation of the third nucleotide of codons. In this setting, it has recently been shown 28 that splicing regulatory sequence-selection severely constrains coding sequence evolution (Savisaar 29 and Hurst 2018). In addition, it has been reported that some amino acids are preferentially encoded 30 near exon-intron junctions because of the presence of general splicing consensus sequences near 31 splicing sites (Parmley et al. 2007; Warnecke et al. 2008; Smithers et al. 2015). Finally, recent 32 evidence has suggested that exons that are coregulated in specific pathophysiological conditions may 33 code for protein regions engaged in similar cellular processes (Irimia et al. 2014; Tranchevent et al. 34 2017). These observations raised the possibility that exons regulated by the same splicing regulatory 35 process code for similar biological information. So far, the lack of large sets of coregulated exons has
3
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 limited the studies addressing the interplay between the splicing regulatory process and peptide 2 sequences encoded by splicing regulated exons. By focusing on exons coregulated by different 3 splicing factors, we uncover a bidirectional interplay between the physicochemical protein features 4 encoded by exons and their regulation by splicing factors. 5 6 7 Results 8 Nucleotide composition bias of coregulated exons 9 To investigate the relationship between exonic splicing regulatory sequences and coding 10 sequences, we analyzed publicly available RNA-seq datasets generated from different cell lines 11 transfected with siRNAs or shRNAs targeting specific splicing factors (e.g., SRSF1, SRSF3, TRA2A, 12 TRA2B) or transfected with an SRSF2-expression vector (Supplemental Table S1). Since TRA2A and 13 TRA2B are paralogous and have been co-depleted in the analyzed datasets, we will refer both of 14 these factors as TRA2. SRSF1, SRSF2, SRSF3, TRA2 belong to the family of Arg/Ser (RS) domain- 15 containing splicing factors (SRSFs). Each splicing factor regulated a common set of exons in several 16 cell lines but many exons were regulated on a cell line-specific mode (Supplemental Figure S1). 17 SRSF1, SRSF2, SRSF3, and TRA2 bind to GGA-rich motifs, SSNG motifs (where S=G or C), C-rich and G- 18 poor motifs, and AGAA-like motifs, respectively (Grellscheid et al. 2011; Tsuda et al. 2011; Anko et al. 19 2012; Pandit et al. 2013; Ray et al. 2013; Best et al. 2014; Fu and Ares 2014; Anczukow et al. 2015; 20 Hauer et al. 2015; Giudice et al. 2016; Luo et al. 2017). As expected, hexanucleotides enriched in 21 SRSF1-, SRSF2-, SRSF3-, or TRA2-activated exons across different cell lines were enriched in purine- 22 rich, S-rich, C-rich, or A-rich hexanucleotides, respectively, when compared to control exons and in 23 contrast to exons repressed by the same factor (position weight matrices (PWM) in Figure 1A, 24 Supplemental Figure S2A and Supplemental Table S2). 25 Each set of splicing factor-regulated exons had a specific nucleotide composition bias. 26 Indeed, SRSF1-activated exons (but not SRSF1-repressed exons) were enriched in G when compared 27 to control exons (Mann-Whitney U test p-value 0.029; Figure 1A, and see Supplemental Figure S2B 28 and Table S2). SRSF2-activated exons (but not SRSF2-repressed exons) were enriched in S (G or C; 29 Figure 1A, Randomization test FDR < 1x10-3). SRSF3-activated exons were enriched in C (Figure 1A, 30 Randomization test FDR < 1x10-4) and impoverished in G (Supplemental Figure S2C). Finally, TRA2- 31 activated exons were enriched in A (Randomization test FDR < 1x10-4), unlike to TRA2-repressed 32 exons (Randomization test FDR < 1x10-4), when compared to control exons (Figure 1A). Accordingly, a 33 high density of G, S, C, or A nucleotides was more frequent in SRSF1-, SRSF2-, SRSF3-, or TRA2- 34 activated exons, respectively, when compared to the corresponding repressed exons (Figure 1B, 35 Kolmogorov-Smirnov (K-S) test p-value < 1x10-14).
4
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 While the enriched nucleotides within splicing factor-regulated exons could be randomly 2 distributed across exons, we observed an increased frequency of specific dinucleotides and low- 3 complexity sequences. For example, GG, SS, CC, or AA dinucleotides were more frequent in SRSF1-, 4 SRSF2-, SRSF3-, or TRA2-activated exons, respectively, than in control exons or in the corresponding 5 repressed exons (Supplemental Figure S2D and Table S2). We next performed a logistic regression 6 analysis to test differences in low-complexity sequence content between activated and repressed 7 exons for a given splicing factor while accounting for cell-line variations. As shown in Figure 1C, a 8 larger proportion of SRSF1-, SRSF2-, SRSF3-, or TRA2-activated exons contained G-, S-, C-, or A-rich 9 low-complexity sequences, respectively, when compared to the corresponding repressed exons 10 (logistic regression analysis p-value < 3x10-7). 11 12 Amino acid composition bias of SRSF-coregulated exons 13 We next analyzed the codon content of exons regulated by SRSF-related splicing factors. In 14 agreement with the nucleotide composition bias described above, SRSF1-, SRSF2-, SRSF3-, or TRA2- 15 activated exons were enriched in G-, S-, C-, or A-rich codons, respectively, compared to both sets of 16 control exons or the corresponding repressed exons (Figure 2A, Randomization test FDR < 0.05; and 17 see Supplemental Figure S3 and Table S2). In this setting, it has been proposed that coding sequences 18 can accommodate exonic splicing regulatory sequences through variation of the third codon position 19 (see Introduction). Accordingly, the nucleotide composition bias observed at the whole exon level 20 (Figure 1) was also observed on the third codon position across some datasets and for a subset of 21 codons (Figure 2B, upper panels and Supplemental Table S2 and Fig. S4). Importantly however, the 22 identity of the nucleotide at the first or second codon positions was systematically biased (Figure 2B, 23 lower panels and Supplemental Table S2). This raises the possibility that different sets of SRSF- 24 regulated exons may preferentially code for different amino acids. 25 As shown in Figure 3A (upper panels), amino acids more frequently encoded by SRSF1-, 26 SRSF2-, SRSF3-, and TRA2-activated exons corresponded to G-, S-, C-, and A-rich codons, respectively 27 (see also Supplemental Figure S5 and Supplemental Table S2). This was in sharp contrast to the 28 corresponding repressed exons (Figure 3A, lower panels). For example, glycine (GGN codons) was 29 more frequently encoded by SRSF1-activated exons (Mann-Whitney U test p-value 0.029) than by 30 control exons and SRSF1-repressed exons (Figure 3B). A counting of glycines encoded within SRSF1- 31 activated versus SRSF1-repressed exons showed a mirrored distribution: a large proportion of 32 activated exons (~60%) coded for more than three glycines, whereas nearly 70% of repressed exons 33 coded for a maximum of one glycine (Figure 3C). Similarly, alanine (GCN codons), proline (CCN 34 codons), and lysine (AAR codons) were more frequently encoded by SRSF2-, SRSF3-, and TRA2- 35 activated exons, respectively (Figures 3B and 3C).
5
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 SRSF-coregulated exons code for amino acids with similar physicochemical properties 2 The above observations revealed a nucleotide composition bias of splicing factor-regulated 3 exons and a bias regarding the nature of the amino acids that are encoded by these exons. In this 4 setting, it is well established that amino acids sharing similar physicochemical properties (e.g., size, 5 hydropathy, charge) are encoded by similar codons (i.e., codons composed of the same nucleotides) 6 (Woese 1965; Wolfenden et al. 1979; Taylor and Coates 1989; Biro et al. 2003; Prilusky and Bibi 7 2009). For example, small amino acids (Ala, Asn, Asp, Cys, Gly, Pro, Ser, Thr), and in particular very 8 small amino acids (Ala, Gly, Ser, Cys) are encoded by S-rich codons while, large amino acids (Arg, Ile, 9 Leu, Lys, Met, Phe, Trp, Tyr) are encoded by S-poor codons (Figure 4A). SRSF2 binds to SSNG motifs 10 and SRSF2-activated exons are S-rich (see Figure 1). The two sets of analyzed SRSF2-activated exons 11 encoded more frequently for very small and small amino acids when compared to control exons, in 12 contrast to SRSF2-repressed exons (Figure 4B, Randomization test FDR < 0.0003 and Supplemental 13 Table S2). Conversely, large amino acids were less frequently encoded by SRSF2-activated exons 14 (Figure 4B, Randomization test FDR < 1x10-4). Accordingly, a high density of very small amino acids 15 was more frequent in SRSF2-activated when compared to SRSF2-repressed exons, while a high 16 density of large amino acids was more frequent in SRSF2-repressed when compared to SRSF2- 17 activated exons (Figure 4C, K-S test p-value < 5x10-6). 18 Amino acids can be classified in three families in regards to their hydropathy and each family 19 is encoded by codons having different features. Hydrophilic amino acid (Arg, Asn, Asp, Gln, Glu, Lys) 20 are encoded by A-rich codons, hydrophobic amino acids (Ala, Cys, Ile, Leu, Met, Phe, Val) are 21 encoded by U-rich and A-poor codons, and ambivalent or neutral amino acids (Gly, His, Pro, Ser, Thr, 22 Tyr) are encoded by C-rich codons (Figure 4D) (Kyte and Doolittle 1982; Engelman et al. 1986; 23 Chiusano et al. 2000; Biro et al. 2003; Pommie et al. 2004; Prilusky and Bibi 2009; Zhang and Yu 24 2011). TRA2 that binds to AGAA-like motifs activates the inclusion of A-rich exons (Figure 1). TRA2- 25 activated exons encoded hydrophilic amino acids more frequently (Randomization test FDR < 1x10-4) 26 and neutral or hydrophobic amino acids less frequently than control exons (Figure 4E, Randomization 27 test FDR < 1x10-4 and Supplemental Table S2). Similar results were obtained by using different 28 hydrophobicity propensity scales (Figure 4F). Accordingly, a high density of hydrophilic amino acids 29 was more frequently encoded by TRA2-activated when compared to TRA2-repressed exons (Figure 30 4G, K-S test p-value < 1x10-13). 31 Polar uncharged amino acids (Asn, Gln, Ser, Thr, Tyr), including hydroxyl-containing amino 32 acids (e.g., Ser and Thr) correspond to C-rich and G-poor codons, while polar charged amino acids 33 (Asp, Glu, Lys, Arg) correspond to G-rich and C-poor codons (Figure 4H) (Biro et al. 2003; Zhang and 34 Yu 2011). SRSF3 binds to C-rich motifs and activates C-rich and G-poor exons (Figure 1). Two sets of 35 SRSF3-activated exons encoded uncharged (including hydroxyl-containing) amino acids more
6
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 frequently than control exons (Randomization test FDR < 1x10-4) or SRSF3-repressed exons (Figure 4I 2 and Supplemental Table S2). They also encoded more frequently hydropathically neutral amino acids 3 (Figure 4I, Randomization test FDR < 1x10-4 for both cell lines) that correspond to C-rich codons 4 (Figure 4D). As shown in Figure 4J, a high relative density of uncharged vs. charged and hydroxyl vs. 5 negatively charged amino acids was more frequent in SRSF3-activated when compared to SRSF3- 6 repressed exons (K-S test p-value < 3x10-12). These observations suggest a link between splicing- 7 related nucleotide composition bias of splicing-regulated exons and physicochemical properties of 8 the exon-encoded amino acids. 9 10 HNRNP-corepressed exons code for amino acids with similar physicochemical properties 11 SRSF-like splicing factors activate exons they bind to, in contrast to HNRNP-like splicing 12 factors that repress exons they bind to. We analyzed publicly available RNA-seq datasets generated 13 from different cell lines transfected with siRNAs or shRNAs targeting HNRNPH1, HNRNPK, HNRNPL, 14 or PTBP1 (Supplemental Table S1 and Supplemental Figure S1). HNRNPH1, HNRNPK, HNRNPL, and 15 PTBP1 bind to G-rich, C-rich, CA-rich, and CU-rich motifs, respectively (Klimek-Tomczak et al. 2004; 16 Katz et al. 2010; Llorian et al. 2010; Ray et al. 2013; Rossbach et al. 2014; Hauer et al. 2015; Giudice 17 et al. 2016). As shown in Figure 5A, HNRNPH1-repressed exons were enriched in Gs (Randomization 18 test FDR 0.005), while HNRNPK-repressed exons were enriched in Cs when compared to control 19 exons (Randomization test FDR < 1x10-4; see also Supplemental Figure S2F). This nucleotide 20 composition bias was observed at the first and second codon positions (Figure 5B, Randomization 21 test FDR < 0.003; see also Supplemental Figure S2G). Accordingly, glycine (GGN codons) and proline 22 (CCN codons) were more frequently encoded by HNRNPH1-repressed and by HNRNPK-repressed 23 exons, respectively (Figure 5C, Randomization test FDR < 0.05; see also Supplemental Figure S2H). As 24 in the case of C-rich SRSF3-activated exons (Figure 4I), C-rich HNRNPK-repressed exons encoded 25 more frequently uncharged and hydropathically neutral amino acids than control exons (Figure 5D, 26 Randomization test FDR < 1x10-4; see also Supplemental Figure S2I). 27 As mentioned above, HNRNPL represses the inclusion of exons containing CA-rich motifs, 28 while PTBP1 represses the inclusion of exons containing CU-rich motifs. As shown in Figure 5E (left 29 panel), HNRNPL repressed the inclusion of exon enriched in CA and AC dinucleotide when compared 30 to PTBP1-repressed exon (Mann-Whitney U test p-value 0.029), while PTBP1 repressed exons 31 enriched in CU an UC dinucleotides unlike hnRNPL (Mann-Whitney U test p-value 0.029). Glutamine 32 (Gln, CAR codons) and threonine (Thr, ACN codons) were more frequently encoded by HNRNPL- 33 repressed exons than by PTBP1-repressed exons, while serine (Ser, UCN codons) was more 34 frequently encoded by PTBP1-repressed exons than by HNRNPL-repressed exons (Figure 5E, right 35 panel; Mann-Whitney U test p-value 0.03). Both HNRNPL- and PTBP1-repressed exons encoded
7
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 respectively more frequently hydroxyl-containing amino acids and less frequently negatively charged 2 amino acids, compared to control exons (Figure5F; Mann-Whitney U test p-value 0.03). This is 3 consistent with the fact that hydroxyl-containing and charged amino acids are C-rich and C-poor, 4 respectively (Figure 4H). 5 In conclusion, each tested set of SRSF- or HNRNP-coregulated exons has a specific nucleotide 6 composition bias and codes for amino acids with similar physicochemical properties. 7 8 Bidirectional interplay between the splicing regulatory process and its functional outcome 9 The physicochemical properties of amino acids are often more or less related. For example, 10 hydrophilic amino acids are often charged amino acids. In addition, we observed that exons 11 regulated by a given splicing factor often coded for amino acids that have different physicochemical 12 properties depending on whether the exons are activated or repressed by this factor (Figure 4). 13 Consequently, each splicing factor induced a shift toward different combinations of protein 14 physicochemical properties encoded by their regulated exons (Figures 6A and 6B; Mann-Whitney U 15 testFDR < 0.05). 16 Since protein features depend on amino acid physicochemical properties, we measured an 17 enrichment Z-score of annotated protein features to determine whether each factor-specific set of 18 exons encodes specific protein features using our recently developed Exon Ontology bioinformatics 19 suite (Tranchevent et al. 2017). As shown in Figure 6C, all sets of SRSF-activated exons preferentially 20 encoded intrinsically unstructured protein regions (IUPR; FDR < 0.05), in agreement with previous 21 reports indicating that alternatively spliced exons often code for intrinsically disordered regions 22 (Tranchevent et al. 2017). However, each set of SRSF-coregulated exons encoded specific sets of 23 annotated protein features. For example, C-rich SRSF3-activated exons encoded peptides that are 24 enriched for experimentally validated phospho-serine and -threonine (“PTM”, Figure 6C; FDR < 0.05 25 and Figure 6D; FDR < 0.05). These phosphorylation sites arise in serine- and proline-rich regions 26 (Figure 6E). This observation is consistent with the fact that hydroxyl-containing amino acids and 27 proline correspond to C-rich codons (Figure 4H). Serine- and proline-rich regions have been shown to 28 play a role in RNA-protein interactions that can be regulated by phosphorylation (Wang et al. 2006; 29 Thapar 2015). In this setting, SRSF3-activated exons encoded annotated “Nucleic Acid Binding” 30 activity (Figure6C; FDR < 0.05). 31 Along the same line, A-rich TRA2-activated exons often encoded for nuclear localization 32 signal (“NLS”, Figure 6F). This is consistent with the fact that they encode hydrophilic amino acids, in 33 particular lysine that correspond to A-rich codons (Figures 3B-3C and 4D-4G), a major amino acid of 34 classical NLS (Marfori et al. 2011). In contrast, A-poor TRA2-repressed exons code for intramembrane 35 protein parts (“Mne”, Figure 6F; FDR < 0.05) which are intrinsically rich in hydrophobic amino acids.
8
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 This is consistent with the fact that A-poor TRA2-repressed exons encode more frequently 2 hydrophobic amino acids corresponding to A-poor codons (Figure 4D-4G). Collectively, these 3 observations support a model where a splicing factor-related nucleotide composition bias of exons 4 (Figures 1-3) impacts on the physicochemical properties of their encoded amino acids because of the 5 non-randomness of the genetic code (Figure 4) with direct consequences on protein features 6 encoded by splicing factor-regulated exons (Figures 6A-6F). 7 On one hand, splicing factors bind to sequences that have a biased nucleotide composition 8 and on the other hand, amino acids with similar physicochemical properties are encoded by codons 9 having the same nucleotide composition bias. Therefore, we hypothesized that increasing the exonic 10 density of specific nucleotides as measured in splicing factor-regulated exons would increase the 11 density of encoded amino acids sharing the same physicochemical properties, as observed in those 12 exons. To challenge this possibility, we generated random exonic coding sequences enriched in 13 specific nucleotide(s) by following the human codon usage bias (labelled CUB sequences) or by 14 randomly mutating human coding exons (labelled MUT sequences). For example, we generated 100 15 sets of 300 coding exons containing either 53% or 47% of S nucleotides, as measured in SRSF2- 16 activated and SRSF2-repressed exons, respectively. Increasing by ~13% the density of S nucleotides in 17 coding exons (S-CUB or S-MUT) increased (by ~15%) the frequency of encoded very small amino 18 acids, while it decreased (by ~10%) the frequency of encoded large amino acids (Figure 6G, “S=53% 19 vs 47%”; t-test p-value < 1x10-14), as observed when comparing SRSF2-activated and SRSF2-repressed 20 exons (Figure 4B). Increasing the density of A nucleotides in coding exons (A-CUB and A-MUT) from 21 23% to 34%, as measured in TRA2-repressed and TRA2-activated exons, respectively, increased (by 22 ~40%) the frequency of encoded hydrophilic amino acids and it decreased (by ~20%) the frequency 23 of encoded hydrophobic amino acids (Figure 6G, “A= 34% vs 23%”; t-test p-value < 1x10-14), as 24 observed when comparing TRA2-activated and TRA2-repressed exons (Figure 4E). Finally, increasing 25 the density of C nucleotides in coding exons (C-CUB and C-MUT) from 21% to 29%, as measured in 26 SRSF3-repressed and SRSF3-activated exons, respectively, increased the frequency of encoded 27 uncharged amino acids and neutral amino acids while it decreased the frequency of encoded charged 28 amino acids (Figure 6G, “C= 29% vs 21%”; t-test p-value < 1x10-14), as observed when comparing 29 SRSF3-activated and SRSF3-repressed exons (Figure 4I). 30 We next generated exons coding for different proportions of amino acids sharing the same 31 physicochemical features by mutating randomly-selected human coding exons. For example, we 32 generated 100 sets of 300 mutated exons encoding different proportions of very small and large 33 amino acids, using their respective frequency measured in SRSF2-activated or SRSF2-repressed exons 34 (See Materials and Methods). Mutated exons coding more frequently very small rather than large 35 amino acids had a higher frequency of S nucleotides and GC dinucleotides and contained more
9
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 frequently S-rich low complexity sequences (Figure 6H, “SRSF2-like encoded properties”; t-test p- 2 value < 1x10-14), as observed when comparing SRSF2-activated to SRSF2-repressed exons (Figure 1). 3 Mutated exons encoding more frequently hydrophilic amino acids had a higher frequency of A 4 nucleotides and AA dinucleotides and contained more frequently A-rich low complexity sequences 5 (Figure 6H, ““TRA2-like encoded properties”; t-test p-value < 1x10-14), as observed when comparing 6 TRA2-activated to TRA2-repressed exons (Figure 1). Finally, mutated exons encoding more frequently 7 hydropathically neutral amino acids had a higher frequency of C nucleotides and CC dinucleotides 8 and contained more frequently C-rich low complexity sequences (Figure 6H, “SRSF3-like encoded 9 properties”; t-test p-value < 1x10-14), as observed when comparing SRSF3-activated to SRSF3- 10 repressed exons (Figure 1). 11 12
13 Discussion 14 In this work, we uncover a direct link between the splicing regulatory process and its 15 biological outcome relying on two straightforward principles: 1) splicing factors bind to exonic 16 sequences that have a nucleotide composition bias with consequences on the nucleotide 17 composition of codons from coregulated exons; and 2) codons having the same nucleotide 18 composition bias encode amino acids with similar physicochemical properties with consequences on 19 protein features encoded by the coregulated exons. 20 Exons coregulated by a given splicing factor are enriched for specific low-complexity 21 sequences (often composed of a repeated (di)nucleotide) that correspond to the RNA-binding sites of 22 the cognate factor (Figures 1A-1C, 5A, and 5E). We showed that each set of exons whose inclusion 23 (or exclusion) is enhanced by a given splicing factor is enriched for specific nucleotide(s) when 24 compared to control exons or to exons repressed (or activated, respectively) by the same factor 25 (Figures 1A-1C). To the best of our knowledge, this splicing-related exonic nucleotide composition 26 bias has not been reported yet. However, it is in agreement with recent observations indicating that 27 the interaction of a splicing factor with a binding motif depends on the sequence context and on the 28 presence of clusters of related binding motifs (Zhang et al. 2013; Cereda et al. 2014; Fu and Ares 29 2014; Dominguez et al. 2018; Jobbins et al. 2018). For example, increasing the exonic frequency of 30 GGA-like motifs increases the probability of an exon to be regulated by the SRSF1 splicing factor that 31 binds to GGAGGA-like motifs even though only one binding site is used (Jobbins et al. 2018). In this 32 setting, we observed similar nucleotide and amino acid composition biases when analyzing exons 33 regulated by a splicing factor or exons containing CLIP-related signals for the same splicing factor 34 (Supplemental Figure S6).
10
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 Coding sequences overlap several kinds of regulatory sequences, including exonic splicing 2 regulatory sequences. To date, it has been assumed that the redundancy of the genetic code permits 3 protein-coding regions to carry this extra information (Goren et al. 2006; Itzkovitz and Alon 2007; 4 Itzkovitz et al. 2010; Lin et al. 2011; Shabalina et al. 2013; Savisaar and Hurst 2017b; Savisaar and 5 Hurst 2017a). This means that the sequence constraints imposed by splicing factor binding motifs 6 would accommodate with coding sequences by impacting only the third codon position. In this 7 setting, we observed that nucleotide composition bias of splicing factor-regulated exons impacts not 8 only the third codon position but also the first and second positions (Figures 2B and 5B). Since amino 9 acids having the same physicochemical properties correspond to codons with similar nucleotide 10 composition bias, a direct consequence of the exonic nucleotide composition bias associated with 11 the splicing regulatory process is that each set of splicing factor-regulated exons preferentially 12 encodes amino acids having similar physicochemical properties (Figures 4 and 5). In addition, since 13 specific local protein features depend on amino acid physicochemical properties, splicing factor- 14 coregulated exons encode specific sets of protein features (Figures 6A-6F). 15 Therefore, we propose that the interplay between coding and exonic splicing regulatory 16 sequences that we report is based on straightforward principles related to both the non-randomness 17 of the genetic code and the preferential binding of splicing factors to low-complexity sequences. 18 Owing to these properties, the high exonic density of a specific nucleotide related to splicing factor- 19 binding features increases the probability that an exon encodes amino acids with similar 20 physicochemical properties (Figure 6G). Conversely, the high density of amino acids corresponding to 21 specific physicochemical properties increases the probability of generating exonic nucleotide 22 composition bias and nucleotide low complexity sequences (Figure 6H). 23 A deeper understanding of the interplay between splicing regulatory sequences and coding 24 information will require to improve: i) the characterization of splicing factor binding sites, ii) the 25 analysis of exon encoded-protein features, and iii) the identification of exons dependent on several 26 splicing factors. Indeed, splicing factor binding sites are unlikely to be defined by their sole nucleotide 27 composition bias. For example, it has been recently shown that RNA binding sites are within specific 28 context and that some RNA binding sites correspond to spaced "bipartite" short linear motifs (Zhang 29 et al. 2013; Cereda et al. 2014; Fu and Ares 2014; Dominguez et al. 2018; Jobbins et al. 2018). By 30 dissecting each RNA binding site, their flanking nucleotide preferences, their clustering, and the 31 space between them, it would be possible to better characterize the way they can affect codon and 32 amino acid usage. In this setting, while we focused our investigation on general properties of amino 33 acids, it will be interesting to look for complex pattern of protein-related features. For example, the 34 alternation of amino acids having specific features (e.g., alternation of hydrophilic and hydrophobic 35 amino acids) may allow to uncover specific protein-related properties (e.g., alpha-helix made of
11
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 periodic alternation of hydrophilic and hydrophobic amino acids). Finally, it will be important to 2 identify exons that are simultaneously regulated by two different splicing factors and to characterize 3 how the combination of different regulatory binding motifs impacts on coding sequences. An 4 interesting possibility is that the combinatory regulation of an exon by two splicing factors is 5 associated with specific exonic encoded protein features. 6 Another challenge will be to link our observations with the known tissue-specific regulation 7 of alternative splicing. Based on this work and previously published observations (Irimia et al. 2014; 8 Tranchevent et al. 2017), it can be anticipated that tissue-specific co-regulated exons encode similar 9 protein-related features. In this setting, the function of splicing factors would not only be to regulate 10 the production of individual specialized protein isoforms, but it would also be to more globally 11 control the intracellular content of specific protein regions having specific physicochemical 12 properties. Each splicing factor (or combination of factors) would control a specific combination of 13 exon-encoded protein physicochemical properties accordingly to its (or their) affinity for specific 14 nucleotides. Our work unravels how a complex phenomenon (e.g., splicing regulatory process and its 15 biological consequences) can rely on straightforward principles.
12
Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press
1 Materials and Methods 2 RNA-seq dataset analyses 3 Publicly available RNA-seq datasets generated from different human cell lines transfected with 4 siRNAs or shRNAs targeting specific splicing factors or transfected with splicing factor expression 5 vectors were recovered from GEO (Supplemental Table S1). These RNA-seq datasets were analyzed 6 using FARLINE, a computational program dedicated to analyze and quantify alternative splicing 7 variations, as previously reported (Benoit-Pilven et al. 2018). This pipeline is freely available 8 (http://kissplice.prabi.fr/pipeline_ks_farline). To determine a set of exons that is regulated by a given 9 splicing factor, we measured the Percent-Spliced-In (PSI) that corresponds to exon inclusion rate. 10 Each exon with a PSI variation (deltaPSI) greater than 10% or lower than -10% and a p-value below 11 0.05, when comparing each sample to it’s corresponding control, is considered to be regulated 12 (Benoit-Pilven et al. 2018). FARLINE analyses exons annotated from FASTERDB (http://fasterdb.ens- 13 lyon.fr/faster/home.pl) that is based on hg19 annotation. A lift over from hg19 to GRCh38 recovers 14 the same sequence for 99.94 % of the analyzed exons. 15 Frequency of hexanucleotides, dinucleotides, nucleotides, codons, amino acids, and amino acid 16 physicochemical features in exon sets.
17 The formula (1) was used to compute the frequencies of words (Dn) of size n within a set of exons SN =
18 {y1, …, yN} such that yi is an exon i having a number Li of nucleotides: