Comparative Analysis of Sequence Features Involved in the Recognition
Total Page:16
File Type:pdf, Size:1020Kb
BMC Genomics BioMed Central Research article Open Access Comparative analysis of sequence features involved in the recognition of tandem splice sites Ralf Bortfeldt*1, Stefanie Schindler2, Karol Szafranski2, Stefan Schuster1 and Dirk Holste*3,4 Address: 1Department of Bioinformatics, Friedrich-Schiller University, Ernst-Abbe-Platz 2, D-07743 Jena, Germany, 2Fritz-Lipmann Institute for Aging Research, Beutenbergstraße 11, D-07745 Jena, Germany, 3Research Institute of Molecular Pathology, Dr. Bohr-Gasse 7, A-1030, Vienna, Austria and 4Institute of Molecular Biotechnology of the Austrian Academy of Sciences, Dr. Bohr-Gasse 3-5, A-1030, Vienna, Austria Email: Ralf Bortfeldt* - [email protected]; Stefanie Schindler - [email protected]; Karol Szafranski - [email protected]; Stefan Schuster - [email protected]; Dirk Holste* - [email protected] * Corresponding authors Published: 30 April 2008 Received: 16 January 2008 Accepted: 30 April 2008 BMC Genomics 2008, 9:202 doi:10.1186/1471-2164-9-202 This article is available from: http://www.biomedcentral.com/1471-2164/9/202 © 2008 Bortfeldt et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: The splicing of pre-mRNAs is conspicuously often variable and produces multiple alternatively spliced (AS) isoforms that encode different messages from one gene locus. Computational studies uncovered a class of highly similar isoforms, which were related to tandem 5'-splice sites (5'ss) and 3'-splice sites (3'ss), yet with very sparse anecdotal evidence in experimental studies. To compare the types and levels of alternative tandem splice site exons occurring in different human organ systems and cell types, and to study known sequence features involved in the recognition and distinction of neighboring splice sites, we performed large-scale, stringent alignments of cDNA sequences and ESTs to the human and mouse genomes, followed by experimental validation. Results: We analyzed alternative 5'ss exons (A5Es) and alternative 3'ss exons (A3Es), derived from transcript sequences that were aligned to assembled genome sequences to infer patterns of AS occurring in several thousands of genes. Comparing the levels of overlapping (tandem) and non-overlapping (competitive) A5Es and A3Es, a clear preference of isoforms was seen for tandem acceptors and donors, with four nucleotides and three to six nucleotides long exon extensions, respectively. A subset of inferred A5E tandem exons was selected and experimentally validated. With the focus on A5Es, we investigated their transcript coverage, sequence conservation and base-paring to U1 snRNA, proximal and distal splice site classification, candidate motifs for cis-regulatory activity, and compared A5Es with A3Es, constitutive and pseudo-exons, in H. sapiens and M. musculus. The results reveal a small but authentic enriched set of tandem splice site preference, with specific distances between proximal and distal 5'ss (3'ss), which showed a marked dichotomy between the levels of in- and out-of-frame splicing for A5Es and A3Es, respectively, identified a number of candidate NMD targets, and allowed a rough estimation of a number of undetected tandem donors based on splice site information. Conclusion: This comparative study distinguishes tandem 5'ss and 3'ss, with three to six nucleotides long extensions, as having unusually high proportions of AS, experimentally validates tandem donors in a panel of different human tissues, highlights the dichotomy in the types of AS occurring at tandem splice sites, and elucidates that human alternative exons spliced at overlapping 5'ss posses features of typical splice variants that could well be beneficial for the cell. Page 1 of 25 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&*HQRPLFV 2008, :202 http://www.biomedcentral.com/1471-2164/9/202 Background (PTCs), and, if detected, can selectively degrade defective As the central intermediate between transcription and mRNAs [5]. translation of eukaryotic genes, the splicing of precursors to messenger RNAs (pre-mRNAs) in the nucleus is fre- Fostered by the abundant accumulation of complemen- quently variable and produces multiple alternatively tary DNA (cDNA) sequences and expressed sequence tags spliced (AS) mRNA isoforms. The recognition of authen- (ESTs), genome-wide computational studies of AS have tic pre-mRNA splice sites out of many possible pseudo- investigated its scope in metazoans and estimated that a sites, the precise excision of introns, and the ligation of fraction of up to two-thirds of human genes are predicted exons to produce a correct message are catalyzed by a large to encode or regulate protein synthesis via such pathways ribonucleoprotein (RNP) complex known as the spliceo- [6-9]. The outcome of these approaches have shown SEs some, which is composed of several small RNPs and per- as the most frequent AS event in mRNA isoforms in haps over two-hundred proteins [1]. Splice sites mark the human and other mammalian organ systems and cell boundaries between exon and intron: a 5'-splice site (5'ss types, followed by A3Es and A5Es, in turn followed by RIs or donor) at the terminus of the exon/beginning of the [10]. Interestingly, the sequence information of SEs and intron and a 3'ss (acceptor) at the terminus of the intron/ their flanking regions, and the phylogenetic conservation beginning of the exon. In addition, introns contain a of such information, is sufficient to discriminate constitu- branch point signal, typically 15 to 45 nucleotides tive exons from SEs and can be used in computational upstream of the 3'ss. During later stages of spliceosome models to start predicting AS events that have not yet been assembly, there are mediated interactions between the 5'ss uncovered by cDNA and EST analyses [11,12]. and 3'ss, as well as splicing factors that recognize them, and a basic distinction is made between the pairing of Compared with the skipping of about one hundred exon splice sites across the exon ('exon-definition') or the nucleotides or the retention of several hundred intron intron ('intron-definition') [2]. In humans, with compact nucleotides, A3Es and A5Es are thought to create more exons (average length of about 120 nucleotides) and com- subtle changes, by affecting the choice of the 3'ss or 5'ss, paratively much larger introns, exon-definition is thought respectively. Here, splice site usage gives rise to two types to be the prevalent mode of RNA splicing. When a pair of of exon segments – the 'core' common to both splice closely spaced 3'ss-5'ss signals is recognized, the exon is forms and the 'extension' that is present in only the longer roughly defined by interactions between U2 snRNP:3'ss, isoform. Both types of AS events have been shown to play U1 snRNP:5'ss as well as additional splicing factors, decisive roles during development (e.g., sex determina- including U2AF65:branch site and U2AF35:poly-(Y) site tion and differentiation in Drosophila melanogaster [13] or interactions. developmental stage-related changes in the human CFTR gene [14]), but also in human disease (e.g. 5'ss mutations AS events are categorized according to their splice site in the tau gene [15]). A3Es and A5Es are thought to be reg- choice and one can distinguish four canonical types: ulated by splicing-regulatory elements in exons and exon-skipping (SE), in which mRNA isoforms differ by nearby exon-flanking regions, as well as trans-acting the inclusion/exclusion of an exon; alternative 5'ss exon antagonistic splicing factors, which bind them and affect (A5E) or alternative 3'ss exon (A3E), in which isoforms the choice of splice sites in a concentration dependent differ in the usage of a 5'ss or 3'ss, respectively; and reten- manner [16,17]. Interestingly, computational studies tion-type intron (RI), in which isoforms differ by the pres- showed that for both A3Es and A5Es the distribution of ence/absence of an unspliced intron [3]. These types are extensions, f(E), is markedly skewed toward short-range not necessarily mutually exclusive and more complex splice forms [18]. In particular, alternative splice sites that types of AS events can be constructed from such canonical are separated by the three-nucleotide long motif NAG/ types. Alternative splicing produces similar, yet different NAG/(where '/' marks an inferred splice site) make up a messages from one gene locus, thus enabling the diversi- predominant proportion of A3E events in a mammals, fication of protein sequences and function [4]. In addi- extending to invertebrates and plants [19,20]. Yet addi- tion, AS holds the possibility to control gene expression at tional support from experimental studies is still very the post-transcriptional level via the non-sense mediated sparse, and the similarities and dissimilarities of overlap- mRNA decay (NMD) pathway. To prevent aberrantly or ping against non-overlapping ("competitive") as well as deliberately incorrectly spliced transcripts that prema- constitutive splice sites remain to be delineated. turely terminate translation, NMD ensures that only cor- rectly spliced mRNAs that contain the full (or nearly so) Here, we describe an effort to compare and contrast A5E, message are subsequently utilized for protein synthesis. A3E, and constitutive splice sites of human exons derived Therefore, NMD scans newly synthesized mRNA for the from transcript sequences, of different human organ sys- presence of one or more premature-termination codons tems and cell types, which were aligned to the assembled human genome sequence. To study known sequence fea- Page 2 of 25 SDJHQXPEHUQRWIRUFLWDWLRQSXUSRVHV %0&*HQRPLFV 2008, :202 http://www.biomedcentral.com/1471-2164/9/202 tures involved in the recognition and distinction of splice rithms were required to agree in three out of all three inde- sites, we performed large-scale but stringent alignments of pendent alignments (see below, as well as Methods).