Identification and Analysis of Unitary Pseudogenes: Historic and Contemporary Gene Losses in Humans and Other Primates
Total Page:16
File Type:pdf, Size:1020Kb
Zhang et al. Genome Biology 2010, 11:R26 http://genomebiology.com/2010/11/3/R26 RESEARCH Open Access Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates Zhengdong D Zhang1, Adam Frankish2, Toby Hunt2, Jennifer Harrow2, Mark Gerstein1,3,4* Abstract Background: Unitary pseudogenes are a class of unprocessed pseudogenes without functioning counterparts in the genome. They constitute only a small fraction of annotated pseudogenes in the human genome. However, as they represent distinct functional losses over time, they shed light on the unique features of humans in primate evolution. Results: We have developed a pipeline to detect human unitary pseudogenes through analyzing the global inventory of orthologs between the human genome and its mammalian relatives. We focus on gene losses along the human lineage after the divergence from rodents about 75 million years ago. In total, we identify 76 unitary pseudogenes, including previously annotated ones, and many novel ones. By comparing each of these to its functioning ortholog in other mammals, we can approximately date the creation of each unitary pseudogene (that is, the gene ‘death date’) and show that for our group of 76, the functional genes appear to be disabled at a fairly uniform rate throughout primate evolution - not all at once, correlated, for instance, with the ‘Alu burst’. Furthermore, we identify 11 unitary pseudogenes that are polymorphic - that is, they have both nonfunctional and functional alleles currently segregating in the human population. Comparing them with their orthologs in other primates, we find that two of them are in fact pseudogenes in non-human primates, suggesting that they represent cases of a gene being resurrected in the human lineage. Conclusions: This analysis of unitary pseudogenes provides insights into the evolutionary constraints faced by different organisms and the timescales of functional gene loss in humans. Background complement of the human genome has been investi- Pseudogenes (ψ) are nongenic DNA segments that exhi- gated both in gene family-specific studies [1-4] and in bit a high degree of sequence similarity to functional comprehensive surveys [5-7]. Of the approximately genes but contain disruptive defects. The initial pseudo- 20,000 pseudogenes identified in early studies, most, if genization of a functional gene is most likely a single not all, do not represent the extinction of a function as mutagenic event that results in premature stop codons, their ‘parent’ genes are intact and functional. abolished splice junctions, shifts to the coding frame, or A third group of pseudogenes particularly relevant to impaired transcriptional regulatory sequences. Most functional analyses are unitary pseudogenes, which are pseudogenes are disabled copies of a functional ‘parent’ unprocessed pseudogenes with no functional counter- gene and can be classified as either processed or dupli- parts. They are generated by disruptive mutations occur- cated pseudogenes depending on whether they are gen- ring in functional genes and prevent them from being erated by the retro-transposition of processed mRNA successfully transcribed or translated. They differ from transcripts or the duplication of gene-containing DNA duplicated pseudogenes in that the disabled gene had an segments in the genome. Recently, the pseudogene established function rather than being a more recent copy of a functional gene. The initial analysis of the * Correspondence: [email protected] euchromaticsequenceofthehumangenomeidentified 1 Department of Molecular Biophysics and Biochemistry, Yale University, New 37 unitary pseudogene candidates [8]. In addition to Haven, CT 06520, USA © 2010 Zhang et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Zhang et al. Genome Biology 2010, 11:R26 Page 2 of 17 http://genomebiology.com/2010/11/3/R26 unitary pseudogenes with fixed disruptive nucleotide To take advantage of rich genomic annotation avail- substitutions, human genes with polymorphic disruptive able for mouse, our study uses the mouse gene set as sites that are currently segregating in the human popu- the reference to identify genes that have been lost in the lation have also been indentified [8-10], and many of human lineage since the divergence of these two species. them provide the genetic bases of certain inheritable Using the InParanoid [16] human-mouse orthologous diseases [11]. Such gene deactivation, which happens in gene set, we find 6,236 mouse proteins without discern- situ giving rise to a unitary pseudogene, results in a loss ible human orthologs. The presence of these unique to the functional part of the genetic repertoire of the mouse proteins indicates, most likely, both gene gains in organism. Polymorphic pseudogenes are unlikely to the mouse lineage and gene losses in the human one. become fixed in a population if the gene loss is deleter- There are 2,005 unique mouse proteins that cannot be ious. However, various evolutionary processes, such as aligned to the human genome and thus are likely to be genetic drift, migration (population bottleneck), and in gene gains in the mouse. For the remaining unique some cases, natural selection, can lead to fixation. A mouse proteins that can be aligned, we found disrup- number of genes are known to have been lost in the tions to the putative human coding sequences in 974 human lineage in comparison with other mammals sequence alignments. Subsequent removal of redun- [4,12-15]. dancy reveals 612 potentially pseudogenic loci; 187 loci In this study, we develop a novel comparative geno- are removed from the list because they are identified mic approach to identify genes disabled in situ without based on predicted or modeled mouse genes, whose afunctionalcopy(unitarypseudogenes)usingthe validity cannot be easily verified; 94 loci are also absence of human proteins orthologous to their mouse removed without further consideration as their identifi- counterparts as the signals of losses of well-established cations are based on unspliced mouse transcribed genes. Our method is able to systematically detect the sequences labeled as ‘expressed’ or ‘RIKEN cDNA’ sequence signature left by such genic losses, distin- sequences. The filtering steps leave 258 loci based on guishing true loss from mere loss of redundant genes annotated mouse genes and 73 of these are based on following duplication or retrotransposition. We identify spliced mouse ‘expressed’ or ‘RIKEN cDNA’ sequences. historic and contemporary losses of protein-coding Manual inspection of each of the remaining 331 pseudo- genes in the human lineage since the last common genic loci removes 113 falsepositives(suchasones ancestor of euarchontoglires (primates and rodents). In found in short, low-quality sequence alignments) and addition to pseudogenes in tandem gene families, we confirms the presence of 228 disabled human genes, identify 76 losses of well-established genes in the which include 122 pseudogenes in large gene families, human lineage since the common ancestor with 81 possible fixed human unitary pseudogenes, and mouse. Moreover, we also find 11 genes with poly- 15 likely segregating human pseudogenes. After remov- morphic disruptive sites. This latter set represents ing five human fixed pseudogenes that are not in gene losses on a very different timescale: the genic and regions syntenic to those of their mouse orthologs and pseudogenic alleles are segregating in the current four segregating pseudogenes whose identifications are human population and are subject to various evolu- attributed to the sequence errors in the human refer- tionary forces. ence genome, we identify 87 unitary pseudogenes, of which 76 are fixed and 11 still segregating in the human Results population (Figure 1b). Gene loss is indicated by the absence of orthologs After a speciation event, the increasing divergence Many genes were lost in the human lineage since the between two resultant species reflects the diminution in human-mouse divergence their genic orthology as gains and losses of genes gradu- Using the human-mouse genic orthology, we identify ally accumulate in each of them. Thus, the presence of 228 pseudogenic loci - about 1% of the human gene cat- genes unique to one species relative to another indicates alog - in the human genome, which include 98 olfactory either gene gains in one or gene losses in the other. In receptors, 23 vomeronasal receptors, and 1 zinc finger common with many other genomic features, genes in all protein. The large number of olfactory receptors and species are in a state of flux during evolution. However, vomeronasal receptors found in our study is consistent since all species are related to one another through spe- with previous observations [17,18]. These gene families ciation, gains and losses of genes in one species can be form tandem gene clusters and have experienced copy identified only relative to another. Based on this obser- number changes and complex local rearrangements. vation, we developed a pipeline that uses the ortholo- Because the dynamics of gene clusters make it difficult gous relationship between genes from a pair of species to unambiguously discern ortholog/paralog relationships to detect gene losses in one of