
bioRxiv preprint doi: https://doi.org/10.1101/017772; this version posted April 9, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Comprehensive identification and 2 characterization of conserved small 3 ORFs in animals 4 5 6 Sebastian D. Mackowiak1, Henrik Zauber1, Chris Bielow2, Denise Thiel1, Kamila Kutz1, 7 Lorenzo Calviello1, Guido Mastrobuoni1, Nikolaus RaJewsky1, Stefan Kempa1, Matthias 8 Selbach1, Benedikt Obermayer1* 9 10 1Max-Delbrück-Center for Molecular Medicine, Berlin Institute for Medical Systems 11 Biology, 13125 Berlin, Germany 12 2Berlin Institute of Health, Kapelle-Ufer 2, 10117 Berlin, Germany 13 14 *[email protected] 1 bioRxiv preprint doi: https://doi.org/10.1101/017772; this version posted April 9, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Abstract 2 There is increasing evidence that non-annotated short open reading frames (sORFs) can 3 encode functional micropeptides, but computational identification remains challenging. 4 We expand our published method and predict conserved sORFs in human, mouse, 5 zebrafish, fruit fly and the nematode C. elegans. Isolating specific conservation 6 signatures indicative of purifying selection on encoded amino acid sequence, we identify 7 about 2000 novel sORFs in the untranslated regions of canonical mRNAs or on 8 transcripts annotated as non-coding. Predicted sORFs show stronger conservation 9 signatures than those identified in previous studies and are sometimes conserved over 10 large evolutionary distances. Encoded peptides have little homology to known proteins 11 and are enriched in disordered regions and short interaction motifs. Published ribosome 12 profiling data indicate translation for more than 100 of novel sORFs, and mass 13 spectrometry data gives peptidomic evidence for more than 70 novel candidates. We 14 thus provide a catalog of conserved micropeptides for functional validation in vivo. 2 bioRxiv preprint doi: https://doi.org/10.1101/017772; this version posted April 9, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Introduction 2 Ongoing efforts to comprehensively annotate the genomes of humans and other species 3 revealed that a much larger fraction of the genome is transcribed than initially 4 appreciated1. Pervasive transcription produces a number of novel classes of non-coding 5 RNAs, in particular long intergenic non-coding RNAs (lincRNAs)2. The defining feature of 6 lincRNAs is the lack of canonical open reading frames (ORFs), classified mainly by 7 length, nucleotide sequence statistics, conservation signatures and similarity to known 8 protein domains2. Although coding-independent RNA-level functions have been 9 established for a growing number of lincRNAs3,4, there is little consensus about their 10 general roles5. Moreover, the distinction between lincRNAs and mRNAs is not always 11 clear-cut6, since many lincRNAs have short ORFs, which easily occur by chance in any 12 stretch of nucleotide sequence. However, recent observations suggest that lincRNAs and 13 other non-coding regions are often associated with ribosomes and sometimes in fact 14 translated7-16. Indeed, some of the encoded peptides have been detected via mass 15 spectrometry10,17-23. Small peptides have been marked as essential cellular components 16 in bacteria24 and yeast25. More detailed functional studies have identified the well- 17 known tarsal-less peptides in insects26-29, characterized a short secreted peptide as an 18 important developmental signal in vertebrates30, and established a fundamental link 19 between different animal micropeptides and cellular calcium uptake31,32. 20 Importantly, some ambiguity between coding and non-coding regions has been 21 observed even on canonical mRNAs15: upstream ORFs (uORFs) in 5' untranslated 22 regions (5'UTRs) are frequent, well-known and mostly linked to the translational 23 regulation of the main CDS33,34. To a lesser extent, mRNA 3'UTRs have also been found 24 associated to ribosomes, which has been attributed to stop-codon read-through35, in 25 other cases to delayed drop-off, translational regulation or ribosome recycling36, and 26 even to the translation of 3'UTR ORFs (dORFs)10. Translational regulation could be the 27 main role of these ORFs, and regulatory effects of translation (e.g., mRNA decay) could 28 be a major function of lincRNA translation12. Alternatively, they could be ORFs in their 29 own right, considering well-known examples of polycistronic transcripts in animals such 30 as the tarsal-less mRNA26-28. Indeed, many non-annotated ORFs have been found to 31 produce detectable peptides10,17, and might therefore encode functional 32 micropeptides37. 33 Typically, lincRNAs are poorly conserved on the nucleotide level, and it is hard to 34 computationally detect functional conservation despite sequence divergence even when 35 it is suggested by synteny2,38. In contrast, many of the sORFs known to produce 3 bioRxiv preprint doi: https://doi.org/10.1101/017772; this version posted April 9, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 functional micropeptides display striking sequence conservation, highlighted by a 2 characteristic depletion of nonsynonymous compared to synonymous mutations. This 3 suggests purifying selection on the level of encoded peptide (rather than DNA or RNA) 4 sequence. Also, the sequence conservation rarely extends far beyond the ORF itself, and 5 an absence of insertions or deletions implies conservation of the reading frame. These 6 features are well-known characteristics of canonical protein-coding genes and have in 7 fact been used for many years in comparative genomics39,40. While many powerful 8 computational methods to identify protein-coding regions are based on sequence 9 statistics and suffer high false-positive rates for very short ORFs41,42, comparative 10 genomics methods have gained statistical power over the last years given the vastly 11 increased number of sequenced animal genomes. 12 Here, we present results of an integrated computational pipeline to identify conserved 13 sORFs using comparative genomics. We greatly extended our previously published 14 approach10 and applied it to the entire transcriptome of five animal species: human (H. 15 sapiens), mouse (M. musculus), zebrafish (D. rerio), fruit fly (D. melanogaster), and the 16 nematode C. elegans. Applying rigorous filtering criteria, we find a total of about 2000 17 novel conserved sORFs in lincRNAs as well as other regions of the transcriptome 18 annotated as non-coding. By means of comparative and population genomics, we detect 19 purifying selection on the encoded peptide sequence, suggesting that the detected 20 sORFs, of which some are conserved over wide evolutionary distances, give rise to 21 functional micropeptides. We compare our results to published catalogs of peptides 22 from non-annotated regions, to sets of sORFs found to be translated using ribosome 23 profiling, and to a number of computational sORF predictions. While there is often little 24 overlap, we find in all cases consistently stronger conservation for our candidates, 25 confirming the high stringency of our approach. Overall, predicted peptides have little 26 homology to known proteins and are rich in disordered regions and peptide binding 27 motifs which could mediate protein-protein interactions. Finally, we use published high- 28 throughput datasets to analyze expression of their host transcripts, confirm translation 29 of more than 100 novel sORFs using published ribosome profiling data, and mine in- 30 house and published mass spectrometry datasets to support protein expression from 31 more than 70 novel sORFs. Altogether, we provide a comprehensive catalog of 32 conserved sORFs in animals to aid functional studies. 4 bioRxiv preprint doi: https://doi.org/10.1101/017772; this version posted April 9, 2015. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Results 2 Identification of conserved coding sORFs from multiple species alignments. 3 Our approach, which is summarized in Fig. 1A, is a significant extension of our 4 previously published method10. Like most other computational studies, we take an 5 annotated transcriptome together with published lincRNA catalogs as a starting point. 6 We chose the Ensembl annotation (v74), which is currently one of the most 7 comprehensive ones, especially for the species considered here. In contrast to de novo 8 genome-wide predictions17,43, we rely on annotated transcript structures including 9 splice sites. We then identified canonical ORFs for each transcript, using the most 10 upstream AUG for each stop codon; although use of non-canonical start codons has been 11 frequently described15-17,44,45, there is currently no clear consensus how alternative 12 translation start sites are selected. Next, ORFs were classified according to their location 13 on lincRNAs or on transcripts from protein-coding loci: annotated ORFs serving as 14 positive control; ORFs in 3'UTRs, 5'UTRs or overlapping
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages42 Page
-
File Size-