The GENCODE Pseudogene Resource

Pei et al. Genome Biology 2012, 13:R51 http://genomebiology.com/2012/13/9/R51 RESEARCH Open Access The GENCODE pseudogene resource Baikang Pei1†, Cristina Sisu1,2†, Adam Frankish3, Cédric Howald4, Lukas Habegger1, Xinmeng Jasmine Mu1, Rachel Harte5, Suganthi Balasubramanian1,2, Andrea Tanzer6, Mark Diekhans5, Alexandre Reymond4, Tim J Hubbard3, Jennifer Harrow3 and Mark B Gerstein1,2,7* Abstract Background: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. Results: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large- scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. Conclusions: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes. Background pseudogenes, which arise through in situ mutations in Pseudogenes are defined as defunct genomic loci with previously functional protein-coding genes [1,4–6]. sequence similarity to functional genes but lacking cod- Different types of pseudogenes exhibit different geno- ing potential due to the presence of disruptive muta- mic features. Duplicated pseudogenes have intron-exon- tions such as frame shifts and premature stop codons like genomic structures and may still maintain the [1–4]. The functional paralogs of pseudogenes are often upstream regulatory sequences of their parents. In con- referred to as parent genes. Based on the mechanism of trast, processed pseudogenes, having lost their introns, their creation, pseudogenes can be categorized into contain only exonic sequence and do not retain the three large groups: (1) processed pseudogenes, created upstream regulatory regions. Processed pseudogenes by retrotransposition of mRNA from functional protein- maypreserveevidenceoftheirinsertionintheformof coding loci back into the genome; (2) duplicated (also polyadenine features at their 3’ end. These features of referred to as unprocessed) pseudogenes, derived from processed pseudogenes are shared with other genomic duplication of functional genes; and (3) unitary elements commonly known as retrogenes [7]. However, retrogenes differ from pseudogenes in that they have intact coding frames and encode functional proteins [8]. The composition of different types of pseudogenes var- * Correspondence: [email protected] † Contributed equally ies among organisms [9]. In the human genome, pro- 1Program in Computational Biology and Bioinformatics, Yale University, Bass cessed pseudogenes are the most abundant type due to 432, 266 Whitney Avenue, New Haven, CT 06520, USA Full list of author information is available at the end of the article © 2012 Pei et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Pei et al. Genome Biology 2012, 13:R51 Page 2 of 26 http://genomebiology.com/2012/13/9/R51 a burst of retrotranspositional activity [10] in the ances- human genome. We combined automated pipelines and tral primates 40 million years ago [11–13]. manual curation into a production annotation workflow. Pseudogenes have long been considered as nonfunc- This allowed us to precisely annotate pseudogene loci tional genomic sequences. However, evidence of tran- and create a consensus set of pseudogenes. scription and conservation of some pseudogenes led to We identified potential transcribed pseudogenes from the speculation that they might be functional [14,15], locus-specific transcription evidence (that is, EST and and several estimates of the number of transcribed pseu- mRNA data) and high throughput sequencing data (for dogenes have been published in recent years [14,16,17]. example, RNA-Seq) [30]. Candidate transcribed pseudo- More recently, studies have shown that, in some cases, genes were assessed by large-scale RT-PCR-Seq. The expressed pseudogenes can perform crucial regulatory experimental results can serve as a benchmark for com- roles through their RNA products [18–21]. putational models of pseudogene transcription. Finally, Pseudogenes have been suggested to exhibit different for each tissue tested, a list of transcribed pseudogenes types of activity. Firstly, they can regulate the expression was obtained. The results indicate that pseudogene tran- of their parent gene by decreasing the mRNA stability scription is predominantly tissue-specific. Using the of the functional gene through their over-expression. A functional genomics data from the ENCODE consor- good example is the MYLKP1 pseudogene, which is up- tium together with the pseudogene annotation, we regulated in cancer cells [22]. The transcription of found that the transcribed pseudogenes tend to associate MYLKP1 creates a non-coding RNA (ncRNA) that inhi- with a more active chromatin state and maintain more bits the mRNA expression of its functional parent, active promoter regions, compared to their non-tran- MYLK. Moreover, studies in Drosophila and mouse have scribed counterparts. Both the transcription and regula- shown that small interfering RNA (siRNA) derived from tion of pseudogenes exhibit tissue specificity. processed pseudogenes can regulate gene expression by Alongside ‘fully active’ pseudogenes, we also found means of the RNA-interference pathway [19,20,23–25], evidence for pseudogenes showing partial activity pat- thus acting as endogenous siRNAs. In addition, it has terns. One hypothesis is that these pseudogenes are the also been hypothesized that pseudogenes with high result of genomic elements in the process of either los- sequence homology to their parent genes can regulate ing or gaining function. Thus, we consider pseudogenes their expression through the generation of anti-sense showing partial activity as products of ‘dying’ genes or transcripts. A recent study by Hawkins and Morris [26] undergoing a ‘resurrection’ process. Two well-known has shown that knocking down a ncRNA antisense to examples of ‘dying’ and ‘resurrected’ pseudogenes are an Oct4 pseudogene increases the expression of both ACYL3 [31] and XIST [32], respectively. Partially active Oct4 and its pseudogene. Finally, pseudogenes can com- pseudogenes form an interesting group of case studies pete with their parent genes for microRNA (miRNA) for the evolution and dynamics of function develop- binding, thereby modulating the repression of the func- ment. There can be different patterns of pseudogene tional gene by its cognate miRNA. For example, the partial activity. For example, duplicated pseudogenes pseudogene of PTEN, a crucial tumor suppressor, regu- that arise from ‘dying’ genes may lack transcriptional lates the expression of its parent gene following this evidence, but retain some of the upstream control ele- mechanism [19]. The 3’ UTR of the transcript originat- ments from their parents - for example, active transcrip- ing from the pseudogene, PTENP1,actsasadecoyfor tion factor binding sites (TFBSs) and various levels of the miRNA that represses the parent gene. It has been chromatin activity. However, these genomic elements suggested that this could be a general mechanism of may no longer be evolutionarily constrained. Similarly, regulation in cancer [27]. we can envision a scenario where processed pseudo- While the above examples clearly illustrate that some genes that do not have their parental upstream regula- pseudogenes indeed have a functional role, the extent of tory sequences might gain functionality when they are this phenomenon is not clear. The large corpus of func- inserted into a region of the genome favorable for tran- tional data from the ENCODE consortium provides us scription. Such pseudogenes may gain upstream regula- with an opportunity to study pseudogene transcription tory sequences and hence transcriptional potential and activity in a systematic and comprehensive manner. resulting in novel ncRNAs. The resurrection motif was It is of interest to study whether these examples are just previously used by Vinckenbosch et al. [7] and Kaess- sporadic exceptions,

The GENCODE Pseudogene Resource

Expert Curation of the Human and Mouse Olfactory Receptor Gene Repertoires Identifies Conserved Coding Regions Split Across Two Exons

Repetitive Elements in Humans

GENCODE: the Reference Human Genome Annotation for the ENCODE Project

GENCODE Reference Annotation for the Human and Mouse Genomes

Downloaded from the Tranche Distributed File System (Tranche.Proteomecommons.Org) and Ftp://Ftp.Thegpm.Org/Data/Msms

The GENCODE Consortium the GENCODE Update Trackhub

High-Throughput Annotation of Full-Length Long Noncoding

3 Characterization of Intergenic Regions and Gene Definition

Universal Alternative Splicing of Noncoding Exons

Nearly All New Protein-Coding Predictions in the CHESS Database Are Not Protein-Coding Irwin Jungreis*,✝,1,2, Michael L

Accurate Mutation Annotation and Functional Prediction Enhance the Applicability of -Omics Data in Precision Medicine

RNA-Seq Analysis Reveals Localization-Associated Alternative Splicing Across 13 Cell Lines