Proto-Genes and De Novo Gene Birth
Total Page:16
File Type:pdf, Size:1020Kb
LETTER doi:10.1038/nature11184 Proto-genes and de novo gene birth Anne-Ruxandra Carvunis1,2,3, Thomas Rolland1,2, Ilan Wapinski4, Michael A. Calderwood1,2, Muhammed A. Yildirim5, Nicolas Simonis1,2{, Benoit Charloteaux1,2,6,Ce´sar A. Hidalgo7, Justin Barbette1,2, Balaji Santhanam1,2, Gloria A. Brar8, Jonathan S. Weissman8, Aviv Regev9,10, Nicolas Thierry-Mieg3, Michael E. Cusick1,2 & Marc Vidal1,2 Novel protein-coding genes can arise either through re-organization genes from non-genic ORFs occurring by chance in non-genic of pre-existing genes or de novo1,2. Processes involving re-organization sequences15. The resulting gene catalogue has undergone numerous of pre-existing genes, notably after gene duplication, have been exten- adjustments16, with currently ,6,000 ORFs annotated as genes and sively described1,2.Incontrast,de novo gene birth remains poorly ,261,000 unannotated ORFs containing at least three codons understood, mainly because translation of sequences devoid of genes, considered to be non-genic ORFs (Supplementary Fig. 1). Non-genic or ‘non-genic’ sequences, is expected to produce insignificant poly- sequences are broadly transcribed in S. cerevisiae17, their overexpres- peptides rather than proteins with specific biological functions1,3–6. sion is mostly non-toxic18, and the corresponding transcripts can Here we formalize an evolutionary model according to which associate with ribosomes, often at AUG start codons6,19. We reasoned functional genes evolve de novo through transitory proto-genes4 that translation of non-genic ORFs could be more common than generated by widespread translational activity in non-genic expected. Such translation events would not systematically lead to de sequences. Testing this model at the genome scale in Saccharomyces novo gene birth, as the corresponding polypeptides would not cerevisiae, we detect translation of hundreds of short species-specific necessarily have specific biological functions. Instead, upon trans- open reading frames (ORFs) located in non-genic sequences. These lation, non-genic ORFs would become proto-genes (Fig. 1b). Proto- translation events seem to provide adaptive potential7, as suggested genes would provide adaptive potential6 by exposing genetic variations by their differential regulation upon stress and by signatures of that are usually hidden in non-genic sequences. A subset of proto- retention by natural selection. In line with our model, we establish genes could occasionally be retained over evolutionary time, for that S. cerevisiae ORFs can be placed within an evolutionary instance if providing an advantage to the organism under specific continuum ranging from non-genic sequences to genes. We identify environmental conditions. Retained proto-genes could gradually 1,900 candidate proto-genes among S. cerevisiae ORFs and find evolve the characteristics of genes, whereas other proto-genes might that de novo gene birth from such a reservoir may be more prevalent lose the ability to be translated. Such a reservoir of proto-genes would than sporadic gene duplication. Our work illustrates that evolution allow evolutionary innovations to be attempted without affecting exist- exploits seemingly dispensable sequences to generate adaptive ing genes. functional innovation. This evolutionary model leads to the following predictions: (1) the Both genome-wide surveys and analyses of individual cases have structural and functional characteristics of S. cerevisiae ORFs (for shown that de novo gene birth has occurred throughout the evolution example, length, expression level or sequence composition) should of many lineages, potentially affecting species-specific adaptations reflect an evolutionary continuum ranging from non-genic ORFs to and evolutionary radiations1,2,5,6,8,9. Genes are thought to emerge de genes; (2) many non-genic ORFs should be translated; and (3) ORFs novo when non-genic sequences become transcribed, acquire ORFs that emerged recently should occasionally have adaptive functions and the corresponding non-genic transcripts access the translation retained by natural selection. machinery1,2,4,5,8. However, it is hard to reconcile this proposed mech- To examine these predictions, we estimated the order of emergence anism with expectations that non-genic sequences should lack trans- of S. cerevisiae ORFs (Fig. 1c). Annotated ORFs were classified into ten lational activity and, even if translated, should encode insignificant groups based on their conservation throughout the Ascomycota phylo- polypeptides1,3,4,6. Evidence of associations between non-genic tran- geny (Supplementary Fig. 2). Of ,6,000 annotated ORFs, ,2% are 10 scripts and ribosomes has suggested that non-genic sequences may found only in S. cerevisiae (ORFs1) (Supplementary Fig. 2) and ,12% occasionally be translated, which could provide raw material for are found only in the four closely related Saccharomyces sensu stricto 6 natural selection . It has also been speculated that genes that originate species (ORFs1–4). The ,88% of annotated ORFs found outside of this de novo could initially be simple and gradually become more complex group (ORFs5–10) are well characterized and can confidently be con- 4 over evolutionary time . These ideas are consistent with reports show- sidered genes. ORFs1–4 are poorly characterized and their annotation ing that genes that emerged recently are shorter, less expressed and as genes is debatable (Supplementary Fig. 2)16,20. The weak conser- 1,10–13 more rapidly diverging than other genes . We developed an integ- vation of ORFs1–4 suggests that they emerged recently, which we rative evolutionary model whereby de novo gene birth proceeds corroborated using gene duplication events to control for relative time through intermediate and reversible proto-gene stages, mirroring the of emergence (Supplementary Fig. 3). We estimate that over 97% of 14 well-described pseudo-gene stages of gene death (Fig. 1a) . ORFs1–4 originated de novo rather than by cross-species transfer, We investigated this model at genome scale in the context of de novo which could also explain their weak conservation (Supplementary 8,10 gene birth in S. cerevisiae .InS. cerevisiae, a minimal length threshold Information). ORFs1–4 often partially overlap ORFs5–10, which seems of 300 nucleotides was originally used to delineate ORFs likely to be incompatible with cross-species transfer, or preferentially lie within 1Center for Cancer Systems Biology (CCSB) and Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA. 2Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA. 3UJF-Grenoble 1/CNRS/TIMC-IMAG UMR 5525, Computational and Mathematical Biology Group, Grenoble F-38041, France. 4Department of Systems Biology, Harvard Medical School, Boston, Massachusetts 02115, USA. 5Center for International Development and Harvard University, Cambridge, Massachusetts 02138, USA. 6Unit of Animal Genomics, GIGA-R and Faculty of Veterinary Medicine, University of Liege, 4000 Liege, Wallonia-Brussels Federation, Belgium. 7The MIT Media Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts 02142, USA. 8Howard Hughes Medical Institute, Department of Cellular and Molecular Pharmacology, University of California, San Francisco, and California Institute for Quantitative Biosciences, San Francisco, California 94158, USA. 9Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA. 10Howard Hughes Medical Institute, Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. {Present address: Laboratoire de Bioinformatique des Ge´nomes et des Re´seaux (BiGRe), Campus Plaine, Free University of Brussels, 1050 Brussels, Wallonia- Brussels Federation, Belgium. 00 MONTH 2012 | VOL 000 | NATURE | 1 ©2012 Macmillan Publishers Limited. All rights reserved RESEARCH LETTER subtelomeric regions whose instability may facilitate de novo emergence a (Supplementary Fig. 4). In addition to classifying ORFs1–10,weassigned a conservation level of 0 to ,108,000 unannotated ORFs longer than 30 Non-genic Non-genic sequence Proto-gene Gene Pseudo-gene sequence nucleotides and free from overlap with annotated features on the same strand (ORFs0) (Supplementary Information). ORFs0 and ORFs1–4 Mechanisms developed here Well-described mechanisms b constituted our initial list of candidate proto-genes. To test the evolutionary continuum prediction, we first verified that Hidden genetic variations Exposed genetic variations Translation ORF conservation level correlates positively with length and expres- event sion level (Fig. 2a and Supplementary Fig. 5)1,10–12. These correlations suggest that genes evolve from non-genic ORFs that lengthen and Non-genic Non-genic ORFs Genes increase in expression level over evolutionary time. A negative correla- sequences Proto-genes tion between ORF length and expression level21 was observed among Non-genic sequence Conservation level Outcome of natural selection: a b Candidate ORF Length Loss of translation Candidate Genes proto-genes Genes Expression level Evolution into gene proto-genes τ τ Sequence composition 1,600 = 0.31 0.16 = 0.12 c P < 2.2 × 10–16 P < 2.2 × 10–16 ~ 267,000 ORFs in S. cerevisiae Candidate 1,200 0.12 proto-genes Genes ~ 6,000 ~ 261,000 Median codon 5 adaptation index annotated ORFs unannotated ORFs 10 800 0.08 4 012345678910 10 Conservation level 103 (in nucleotides) 400 c Average ORF length Length > 30 nucleotides, 2 ORFs /ORFs Phylostratigraphic 10 ORFs1–4/ORFs0