How Prevalent Is Functional Alternative Splicing in the Human Genome?Q
Total Page:16
File Type:pdf, Size:1020Kb
68 Update TRENDS in Genetics Vol.20 No.2 February 2004 How prevalent is functional alternative splicing in the human genome?q Rotem Sorek1,2, Ron Shamir3 and Gil Ast1 1Department of Human Genetics, Sackler Faculty of Medicine, Tel Aviv University, Ramat Aviv 69978, Israel 2Compugen, 72 Pinchas Rosen Street, Tel Aviv 69512, Israel 3School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel Comparative analyses of ESTs and cDNAs with genomic introns as normal splice sites, thus, inserting part of DNA predict a high frequency of alternative splicing in an intron into the mature mRNA. Obviously such human genes. However, there is an ongoing debate as mistakes would not represent functional, regulated alterna- to how many of these predicted splice variants are func- tive splicing. tional and how many are the result of aberrant splicing What portion of the observed splice variants represents (or ‘noise’). To address this question, we compared functional alternative splicing? We employed a compara- alternatively spliced cassette exons that are conserved tive genomics approach to address this question, by com- between human and mouse with EST-predicted cas- piling a dataset of exon-skipping events (cassette exons) sette exons that are not conserved in the mouse gen- that are conserved between human and mouse. The con- ome. Presumably, conserved exon-skipping events servation of such events in both human and mouse species, represent functional alternative splicing. We show that which diverged from their common ancestor 75–110 conserved (functional) cassette exons possess unique million years ago, suggests the functional importance of characteristics in size, repeat content and in their influ- these exons. ence on the protein. By contrast, most non-conserved cassette exons do not share these characteristics. We Detecting cassette exons conserved between human conclude that a significant portion of cassette exons and mouse evident in EST databases is not functional, and might We have recently collected a dataset of 980 EST-predicted result from aberrant rather than regulated splicing. human alternatively spliced cassette exons [7]. From these 980 exons, 243 (25%) were also found to be alternatively Numerous studies have shown that alternative splicing is spliced in mouse [‘conserved alternatively spliced exons’ prevalent in mammalian genomes. Using ESTs and (CAS exons)]. The remaining 737 (75%) are ‘non-conserved cDNAs aligned to the genomic sequence, these studies alternatively spliced exons’ (non-CAS exons) (Box 1). Low estimate that between 35% and 59% of all human genes levels of alternative splicing conservation between human undergo alternative splicing [1,2]. However, it is not clear and mouse were also observed in other studies [8,9]. The how many of the splice variants predicted from ESTs are method that was used to locate cassette exons and functional and how many represent aberrant splicing human–mouse conservation is described in detail in (‘noise’) or EST artefacts (such as genomic contamination) Ref. [7]. Calculated features of the 980 exons used for [3–5]. An mRNA variant can be defined as being ‘func- tional’ if it is required during the life-cycle of the organism this study appear as supplementary material online. and activated in a regulated manner. Aberrant alternative splicing Somatic mutations within splice sites or introns could result in aberrant splicing, leading to non-functional Box 1. Finding exon-skipping events that are conserved mRNAs; ESTs derived from these mRNAs would be between humans and mouse indistinguishable from normal splice variants. Because The initial set of exons comprised 980 apparently alternatively somatic mutations are prevalent in cancer related tissues, spliced human exons, for which a mouse EST spanning the intron and .50% of the ESTs in dbEST come from cancer, cell- that represents the exon-skipping variant was found. Two strategies lines or tumor tissues [6], such spurious variants can be were used to identify an exon as being conserved in mouse: (i) identification of mouse ESTs that contain the exon and the two ubiquitous in dbEST. flanking exons; and (ii) if the exon was not represented in mouse Splicosomal mistakes have also been proposed as a ESTs, the sequence of the human exon was searched against the mechanism that can result in non-functional transcripts intron spanned by the skipping mouse EST on the mouse genome. If [3]. In common with any complex biological machine, the a significant conservation (.80%) was found, the alignment spanned the full length of the human exon, and the exon was flanked by the splicosome can ‘slip’ and identify cryptic splice sites in canonical AG acceptor and GT donor sites in the mouse genome, then the exon was declared as conserved. For 243 exons (25% of 980), q Supplementary data associated with this article can be found at doi: conserved alternative splicing was detected in mouse. Detailed 10.1016/j.tig.2003.12.004 description of the methods can be found in Ref. [7]. Corresponding author: Gil Ast ([email protected]). www.sciencedirect.com Update TRENDS in Genetics Vol.20 No.2 February 2004 69 Comparing conserved with non-conserved cassette influence of the non-CAS exons (Figure 1). In 147 of 191 exons (77%) CAS exons that were located within the protein- Presumably, orthologous exons that are alternatively coding region, the insertion of the alternatively spliced spliced both in human and mouse have functional exon did not alter the reading frame and only eight of importance. We can therefore regard the group of CAS these 147 exons (5%) contained an in-frame stop codon. exons as a representative group of functional, alterna- This means that 73% of the conserved alternative exons tively spliced exons. are ‘peptide cassettes’ (i.e. they insert a short amino-acid Non-CAS exons, on the other hand, can also be func- sequence into the translated protein without changing the tional, representing exons created after the divergence of coding frame) (Figure 1). These results indicate a strong the human and the mouse lineages. However, if these tendency of functional cassette exons to add or remove exons, as a group, were indeed functional, we might expect amino acids within the protein sequence, rather than them to have the same general characteristics as CAS inflict a dramatic change on it. exons. Therefore, we compared several features between By comparison, only 206 of the 510 (40%) non-CAS the two groups of exons predicted by ESTs. exons preserved the reading frame; when an alternatively We first sought to understand the influence of the spliced exon was inserted in frame, almost half (47%) of alternatively spliced exons on the proteins in which they these contained a stop codon. Thus, only 109 of the are inserted. From the 243 CAS exons identified, 195 510 (21%) non-CAS exons were indeed ‘peptide cassette’ had an unambiguous coding-region annotation in the exons (Figure 1). GenBank cDNAs. Of these, 191 (98%) were located within We examined in detail the cases in which the the protein-coding region and four were located within the alternatively spliced exons caused a frame shift in the untranslated region (UTR). This finding is not surprising CDS. Of the 44 CAS exons that caused a frame shift, 27 because UTRs are mostly found in the terminal exons [10], (61%) actually made the protein longer by suppressing a whereas the exons in our study were internal exons. A nearby stop codon and only four exon insertions (9%) similar percentage of the non-CAS exons were located in resulted in a protein shorter than 100 amino acids. This the coding sequence and UTR. indicates that, in a substantial fraction of the conserved The influence of the CAS exons on the protein coding alternatively spliced exons that cause a frame shift, the sequence (CDS) was significantly different from the exon insertion changes only the C-terminus of the protein. Human alternatively spliced exons 980 Conserved Non-conserved 243 (25%) 737 (75%) 195Unambiguous CDS annotation in Genbank 554 191 4 510 44 CDS UTR CDS UTR (98%) (2%) (92%) (8%) 44 147 304 206 FS No frame shift FS No frame shift (23%) (77%) (60%) (40%) 8 139 97 109 Stop in exon No stop Stop in exon No stop (5%) (95%) (47%) (53%) TRENDS in Genetics Figure 1. The influence of alternatively spliced exons on the protein-coding sequence. Our dataset of human alternatively spliced exons contained 980 exons. These were divided into two subsets: (i) exons whose existence was also confirmed by mouse ESTs or mouse genomic sequence (243 exons, termed ‘conserved alternatively spliced exons’); and (ii) exons for which no such evidence was detected (737 exons, termed ‘non-conserved alternatively spliced exons’). The influence of these exons on the pro- tein-coding sequence is indicated by the tree-like diagram. Those exons with ‘unambiguous CDS annotation’ [i.e. all annotated GenBank cDNAs in which the exon appeared had the same annotation (either UTR or CDS)] were analyzed further. Peptide cassettes were found in 73% (139 of 191) and 21% (109 of 510) of the conserved alternatively spliced and non-conserved alternatively spliced exons, respectively (shown in red). Definitions: frame shift, the length of the exon is not a multiple of three; stop in exon, the exon contains an in-frame stop codon; peptide cassette, contains exons that neither cause a frame shift nor contain a stop codon. Abbreviations: CDS, coding sequence; FS, frame shift; UTR, untranslated region. www.sciencedirect.com 70 Update TRENDS in Genetics Vol.20 No.2 February 2004 By contrast, of the 304 non-CAS exons that caused a The difference between these two distributions is statisti- frame shift, only 25 (8%) resulted in a longer protein, cally significant ðP , 1026Þ: (For comparison, the average whereas 91 (30%) resulted in a protein shorter than 100 length of constitutively spliced exons is 129 bases [16].) amino acids.