The Implications of Alternative Splicing in the ENCODE Protein Complement
Total Page:16
File Type:pdf, Size:1020Kb
The implications of alternative splicing in the ENCODE protein complement Michael L. Tressa,b, Pier Luigi Martellic, Adam Frankishd, Gabrielle A. Reevese, Jan Jaap Wesselinka, Corin Yeatsf, Pa´ ll Iso´ ´ lfur Olason´ g, Mario Albrechth, Hedi Hegyii, Alejandro Giorgettij, Domenico Raimondoj, Julien Lagardek, Roman A. Laskowskie, Gonzalo Lo´peza, Michael I. Sadowskil, James D. Watsone, Piero Farisellic, Ivan Rossic, Alinda Nagyi, Wang Kaig, Zenia Størlingg, Massimiliano Orsinim, Yassen Assenovh, Hagen Blankenburgh, Carola Huthmacherh, Fidel Ramírezh, Andreas Schlickerh, France Denoeudk, Phil Jonese, Samuel Kerriene, Sandra Orcharde, Stylianos E. Antonarakisn, Alexandre Reymondo, Ewan Birneye, Søren Brunakg, Rita Casadioc, Roderic Guigok,p, Jennifer Harrowd, Henning Hermjakobe, David T. Jonesl, Thomas Lengauerh, Christine A. Orengof, La´ szlo´ Patthyi, Janet M. Thorntone,q, Anna Tramontanor, and Alfonso Valenciaa,r aStructural Computational Biology Programme, Spanish National Cancer Research Centre, E-28029 Madrid, Spain; cDepartment of Biology, University of Bologna, 33-40126 Bologna, Italy; dHAVANA Group, The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom; eEuropean Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, United Kingdom; fDepartment of Biochemistry and Molecular Biology and lBioinformatics Unit, University College London, London WC1E 6BT, United Kingdom; gCenter for Biological Sequence Analysis, BioCentrum-DTU, DK-2800 Lyngby, Denmark; hMax Planck Institute for Informatics, 66123 Saarbru¨cken, Germany; iBiological Research Center, Hungarian Academy of Sciences, 1113 Budapest, Hungary; jDepartment of Biochemical Sciences, University of Rome ‘‘La Sapienza,’’ 2-00185 Rome, Italy; kResearch Unit on Biomedical Informatics, Institut Municipal d’Investigacio´Me` dica, E-8003 Barcelona, Spain; mCenter for Advanced Studies, Research and Development in Sardinia (CRS4), 09010 Pula, Italy; nDepartment of Genetic Medicine and Development, University of Geneva Medical School, 1211 Geneva, Switzerland; oCenter for Integrative Genomics, Genopode building, University of Lausanne, 1015 Lausanne, Switzerland; and pCentre de Regulacio´Geno`mica, Universitat Pompeu Fabra, E-08003 Barcelona, Spain Contributed by Janet M. Thornton, January 31, 2007 (sent for review October 20, 2006) Alternative premessenger RNA splicing enables genes to generate has provided us with the material to make an in-depth assess- more than one gene product. Splicing events that occur within ment of a systematically collected reference set of splice variants. protein coding regions have the potential to alter the biological function of the expressed protein and even to create new protein Results functions. Alternative splicing has been suggested as one expla- Alternative Splicing Frequency. The GENCODE set is made up of nation for the discrepancy between the number of human genes 2,608 annotated transcripts for 487 distinct loci. A total of 1,097 and functional complexity. Here, we carry out a detailed study of transcripts from 434 loci are predicted to be protein coding. the alternatively spliced gene products annotated in the ENCODE There are on average 2.53 protein coding variants per locus; 182 pilot project. We find that alternative splicing in human genes is loci have only one variant, whereas one locus, RP1–309K20.2 more frequent than has commonly been suggested, and we dem- (CPNE1) has 17 coding variants. onstrate that many of the potential alternative gene products will A total of 57.8% of the loci are annotated with alternatively have markedly different structure and function from their consti- spliced transcripts, although there are differences between target tutively spliced counterparts. For the vast majority of these alter- regions chosen manually and those chosen according to the native isoforms, little evidence exists to suggest they have a role stratified random-sampling strategy (11). The differences stem as functional proteins, and it seems unlikely that the spectrum of from gene clusters in the manually selected regions, such as the conventional enzymatic or structural functions can be substantially cluster of 31 loci that code for olfactory receptors in manual pick extended through alternative splicing. 9 from chromosome 11 (13). These olfactory receptors are recent in evolutionary origin, have a single large coding exon, and code function ͉ human ͉ isoforms ͉ splice ͉ structure for a single isoform. This means that although the 0.5% of the human genome that was selected for biological interest has 276 loci, just 52.1% of the loci have multiple variants. In contrast, the lternative mRNA splicing, the generation of a diverse range of regions that were selected in the stratified random-sampling mature RNAs, has considerable potential to expand the A process have fewer loci (158), but 68.7% of the loci have multiple cellular protein repertoire (1–3), and recent studies have estimated variants (see Fig. 1a). This number is toward the higher end of that 40–80% of multiexon human genes can produce differently previous estimates but in line with the most recent reports (14). spliced mRNAs (4, 5). The importance of alternative splicing in Analysis of the data suggests that the GENCODE-validated EVOLUTION processes such as development (6) has long been recognized, and transcripts are an underestimate of the real numbers of alter- proteins coded by alternatively spliced transcripts have been impli- cated in a number of cellular pathways (7–9). The extent of alternative splicing in eukaryotic genomes has lead to suggestions Author contributions: M.L.T., P.L.M., G.A.R., C.Y., M.A., E.B., S.B., R.C., R.G., J.H., H. Herm- that alternative splicing is key to understanding how human com- jakob, D.T.J., T.L., C.A.O., L.P., J.M.T., A.T., and A.V. designed research; M.L.T., P.L.M., A.F., plexity can be encoded by so few genes (10). G.A.R., J.J.W., C.Y., P.I´.O´ ., H. Hegyi, A.G., D.R., J.L., R.L., G.L., M.I.S., J.D.W., P.F., I.R., A.N., W.K., Z.S., M.O., Y.A., H.B., C.H., F.R., A.S., F.D., P.J., S.K., and S.O. performed research; The pilot project of the Encyclopedia of DNA Elements M.L.T., P.L.M., A.F., G.R., J.J.W., C.Y., P.I´.O´ ., H. Hegyi, M.A., A.G., D.R., J.L., R.L., M.I.S., S.E.A., (ENCODE) (11), which aims to identify all the functional J.D.W., and A.R. analyzed data; and M.L.T., P.L.M., C.Y., P.I´.O´ ., and A.V. wrote the paper. elements in the human genome, has undertaken a comprehen- The authors declare no conflict of interest. sive analysis of 44 selected regions that make up 1% of the Abbreviation: TMH, transmembrane helices. human genome. One valuable element of the project has been bTo whom correspondence may be addressed at: Spanish National Cancer Research Centres, the detailing of a reference set of manually annotated splice c. Melchor Ferna´ndez Almagro, 3, E-28029 Madrid, Spain. E-mail: [email protected]. variants by the GENCODE consortium (12). The annotation by qTo whom correspondence may be addressed. E-mail: [email protected]. the GENCODE consortium is an extension of the manually rTo whom correspondence may be addressed. E-mail: [email protected]. curated annotation by the Havana team at The Sanger Institute. This article contains supporting information online at www.pnas.org/cgi/content/full/ Although a full understanding of the functional implications 0700800104/DC1. of alternative splicing is still a long way off, the GENCODE set © 2007 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0700800104 PNAS ͉ March 27, 2007 ͉ vol. 104 ͉ no. 13 ͉ 5495–5500 Downloaded by guest on September 30, 2021 sequence sharing 3Ј exons in different reading frames. In this set, 23 separate loci code for variants with overlapping reading frames, suggesting that the frequency of variants with overlap- ping reading frames might be somewhat higher than previously thought (20). Few of the variants coded from overlapping reading frames appeared to be functional. We were able to compare nine transcripts where the human and mouse homologue had con- served exonic structure. If the two alternative reading frames evolve under functional constraints, the mutation rate for all three codon positions should be the same, and both frames should have an identical nonsynonymous substitution rate (Ka, 21). Only one of the nine pairs of transcripts had identical Ka values and equal rates of mutation for each coding position. Signal Peptides and Transmembrane Helices. Of the 1,097 tran- scripts, 219 were predicted to have signal peptides, accounting for 107 of the 434 loci. Unequivocal loss or gain of signal peptides can be seen in 12 loci, and, in these cases, localization will not be conserved between isoforms. Signal peptide loss/gain is caused by substitution of exons at the N terminus in eight of the loci. This is coherent with earlier findings (18) that showed that most signal peptide gain/loss in alternative splicing comes about Fig. 1. Isoform distribution. (a) The number of isoforms per locus in the through N-terminal exon substitution. One example is the al- manually selected regions compared with the number of isoforms per locus in ternative isoform in locus RP1-248E1.1 (MOXD1), which loses the regions selected by random stratified procedure. (b) The effect of splicing 86 N-terminal residues, including the signal peptide, with respect events on the protein sequence of alternative splice isoforms. to the principal isoform. Signal peptide loss in isoform 003 of locus AC010518.2 native splicing at the mRNA level. Although the set does include (LILRA3) is triggered by a 17 residue N-terminal insertion ahead many known isoforms and uncovers numerous previously un- of the signal peptide predicted for the principal sequence. This recognized variants, several known variants were not annotated results in the apparent internalization of the signal peptide and in the initial release.