The Classification of Orphans Is Improved by Combining Searches in Both Proteomes and Genomes
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/185983; this version posted May 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. The classification of orphans is improved by combining searches in both proteomes and genomes. Walter Basile1,2, Marco Salvatore1,2, Arne Elofsson1,2,3,*, 1 Science for Life Laboratory, Stockholm University SE-171 21 Solna, Sweden 2 Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden 3 Swedish e-Science Research Center (SeRC) * Corresponding author: [email protected] Abstract The identification of de novo created genes is important as it provides a glimpse on the evolutionary processes of gene creation. Potential de novo created genes are identified by selecting genes that have no homologs outside a particular species, but for an accurate detection this identification needs to be correct. Genes without any homologs are often referred to as orphans; in addition to de novo created ones, fast evolving genes or genes lost in all related genomes might also be classified as orphans. The identification of orphans is dependent on: (i) a method to detect homologs and (ii) a database including genes from related genomes. Here, we set out to investigate how the detection of orphans is influenced by these two factors. Using Saccharomyces cerevisiae we identify that best strategy is to use a combination of searching annotated proteins and a six-frame translation of all ORFs from closely related genomes. Using this strategy we obtain a set of 54 orphans in Drosophila melanogaster and 38 in Drosophila pseudoobscura, significantly less than what is reported in some earlier studies. Introduction 1 Even after twenty years of genome sequencing some protein-coding genes appear to have 2 no homologs. These genes are often referred to as orphans, a definition that dates back 3 to the sequencing of Saccharomyces cerevisiae [1{3]. It was believed that most of them 4 would disappear when more genomes would be sequenced. However that is not the case, 5 even after thousands of genomes have been sequenced [4{7]. After the initial studies in 6 yeast, orphans have also been identified in Drosophila [8{10], mammals [11], 7 primates [12{16], plants [17,18], Bacteria [19{21] and viruses [22,23]. 8 The term \orphan" might refer to genes without any detectable homologs or it might 9 refer to only de novo created genes. The first definition would include genes that have 10 homologs that all are too distant to be detected, possibly due to extensive gene loss in 11 closely related species, lack of closely related species in the database searched, or 12 because of fast evolution of the gene. Given better methods to detect homology and the 13 inclusion of more closely related species in the database, the number of non-de novo 14 created orphans should decrease. 15 If a gene truly does not have any homologs it must have been created by some 16 mutations turning a non-coding DNA region into a protein coding gene. It is likely that 17 PLOS 1/21 bioRxiv preprint doi: https://doi.org/10.1101/185983; this version posted May 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. this is a widespread phenomenon, at least in yeast [24]. However, in evolutionary terms 18 only a few of these genes will get fixed in the population. It is plausible that every 19 protein superfamily was created by such mechanisms. However, it is also possible that 20 at least some superfamilies are created by rearranging shorter stretches of coding 21 regions [25]. A correct identification of orphans is therefore important for understanding 22 the selective pressures affecting novel gene creation throughout evolutionary history. In 23 particular, an imprecise detection, that includes fast evolving genes in the orphan set, 24 might cause the incorrect picture that de novo created genes have the same properties 25 as fast evolving genes. 26 In some studies, different levels of orphanicity have been used [4,26{28]. Here, a gene 27 is assigned to have a certain age - given that homologs are only found within genomes 28 belonging to a certain clade. For instance, a gene can be specific to primates, to 29 mammals or eukaryotes. The correct classification of the age of a gene is difficult, as it 30 would require accurate detection of very distant homologs. If this is not done correctly, 31 fast-evolving genes will appear to be younger than they are. Taken these difficulties into 32 account we only study the most recent orphans and have not used gene age in this study. 33 In short, the identification of orphans starts by searching for homologous genes in a 34 set of related genomes and then labeling as orphans the genes lacking hits. The results 35 will be dependent on the database used and the homology detection method. In general, 36 better homology detection [29{31] would potentially reduce the number of genes 37 assigned as orphans [4]. 38 Another complication when it comes to identification of de novo created genes is 39 that it is almost impossible to know if an ORF is a functional gene from its genomic 40 sequence alone. For genes with homologs conservation across species is the best method 41 to identify if a gene is functionally important or not, but by definition this is not 42 possible for orphans. An ORF might look like a gene but is never expressed under any 43 circumstances, i.e. not being functional. Alternatively, an ORF can be missed as it 44 appears not to be coding. This difficulty applies both to the ORFs in the query species 45 and to the genomes searched. Therefore, just having a hit to a genomic location does 46 not guarantee that the corresponding region is an actual functional gene [5,28, 32]. 47 However, the reverse is also possible, i.e. that a hit to a non-functional ORF represents 48 homology with a common functional ancestral gene, i.e. the function was lost in the 49 other species but present in the ancestral one. Other factors, such as the removal of 50 CDS overlapping annotated genes, the study of the syntenic regions, or experimental 51 studies, improve the efficacy of detection of de novo created genes. 52 Several different estimates of the orphans of S. cerevisiae have been 53 proposed [4, 5, 27, 28,33]. Surprisingly, the obtained set of orphans is quite different 54 between these studies. One of the reasons of these differences is the use of different 55 datasets [28]. For example all S. cerevisiae ORFs are considered in Carvunis et al. [5], 56 but only the annotated ones are present in Ekman et al. [4]. 57 The goal of this study was to develop a robust strategy for orphan detection in 58 sequenced genomes. We used a set of homology search methods (blastn, blastp and 59 tblastn ) and compared the classified orphans with earlier sets. To avoid some of the 60 problems described above we decided to: (i) use only annotated genes, (ii) assume that 61 every significant hit in a related genome shows that a gene is not de novo created, 62 although we know that this would not identify very recently de novo created genes, (iii) 63 use a similar set of genomes as in earlier studies. We then explore how the inclusion of 64 different genomes and the use of different homology detection tools influence the 65 detection of orphans first in S. cerevisiae and thereafter in D. melanogaster and D. 66 pseudoobscura. At the end we propose a strategy that combines the use of tblastn and 67 blastp. Using this strategy a conservative estimate of the number of in Drosophila 68 melanogaster and Drosophila pseudoobscura is obtained. 69 PLOS 2/21 bioRxiv preprint doi: https://doi.org/10.1101/185983; this version posted May 19, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Materials and Methods 70 S. cerevisiae data 71 5,917 Saccharomyces cerevisiae protein-coding genes (reference strain s288c, NCBI 72 taxon id 559292) were downloaded from the Saccharomyces Genome Database 73 (SGD) [34]. Only coding and annotated ORFs were considered, and we also removed 74 genes labeled as \dubious", mitochondrial or 2-micron plasmid encoded genes, resulting 75 in a set of 5,894 genes. 76 For homology detection we used fourteen genomes from the Saccharomycetales order, 77 see Table 1. Five of these genomes belong to the Saccharomyces genus. All genomes are 78 fully sequenced with an assembly status equal to \contigs", \scaffold", \chromosomes" 79 or \complete genome". Full genome sequences were obtained from NCBI Genome 80 Project, while proteomes were downloaded from Uniprot [35]. It should be noted that 81 many better annotated fungal genomes are available today. However, we choose to use 82 this set, as it is representative of what was used in earlier studies. 83 Estimation of evolutionary distances between genomes 84 To estimate the phylogenetic distance between the genomes we used the amino acid 85 sequence of the well-conserved gene mre11 (nuclease subunit of the MRX complex, 86 involved in DNA repair) as a proxy for evolutionary distance.