Evaluating CCDS and Pseudogenes Rachel Harte and Mark Diekhans Gencode Meeting, June 18, 2008 Challenges for CCDS
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating CCDS and Pseudogenes Rachel Harte and Mark Diekhans Gencode meeting, June 18, 2008 Challenges For CCDS • Where is the translation initiation start site? • NMD candidates - how to track these? • Loci lacking transcript evidence • Missing loci CCDS - Missing Loci CCDS Statistics Fraction Fraction Source Loci Transcripts Loci Transcripts Ensembl only 3329 4075 0.128 0.043 RefSeq only 3241 3322 0.125 0.035 RefSeq AND Ensembl 2595 7383 0.100 0.077 CCDS 16788 80852 0.646 0.845 Total 25953 95632 1.000 1.000 hg18 Loci Without CCDS 1000 900 800 700 600 500 Number of Loci 400 300 200 100 0 100% 95% 90% 75% 50% 25% >0% RefSeq/ENSEMBL Percent CDS Similarity Total: 2595 Loci Pseudogenes Pseudogenes • UCSC Retrotransposed (processed pseudogenes) • Genome-wide search using retroFinder • Re-run about every 6 months • Compare to Havana and Yale pseudogenes • Consolidate UCSC processed pseudogenes with Yale • Future: re-run pipeline for UCSC duplicated pseudogenes track (Yontao Lu’s predictions) Comparison of Havana/UCSC/Yale/ Processed Pseudogenes Fraction Fraction Source Loci Transcripts Loci Transcripts Havana (3013) 249 251 0.011 0.008 UCSC (16318) 8756 8762 0.387 0.273 Yale (12799) 5967 6010 0.264 0.187 Havana AND UCSC 872 1762 0.039 0.055 Havana AND Yale 119 241 0.005 0.008 UCSC and Yale 4899 9825 0.217 0.306 UCSC, Havana and Yale 1754 5279 0.078 0.164 Pseudogene - UCSC Only Window Position Human Mar. 2006 chr1:42,317-43,222 (906 bp) chr1: 42400 42450 42500 42550 42600 42650 42700 42750 42800 42850 42900 42950 43000 43050 43100 43150 43200 Yale ---> Yale Pseudogenes Based on Ensembl Build 49, No Exon Structure ENSP00000314947.frag.268392 Havana (Vega) Annotated Pseudogenes and Immunoglobulin Segments, May 2008 UCSC Retroposed Genes, Including Pseudogenes retro-OR4F15 Havana Processed Pseudogenes (May 2008) Only Yale Processed Pseudogenes (Build 49) Only UCSC UCSC Retroposed (Processed) Pseudogenes (Jan 2008) Only NM_001001674.1-1 Havana (May 2008) and Yale (Build 49) Processed Pseudogenes Only Yale (Build 49) and UCSC (Jan 2008) Processed Pseudogenes Only Havana (May 2008) and UCSC (Jan 2008) Processed Pseudogenes Only Havana (May 2008), Yale (Build 49) and UCSC (Jan 2008) Processed Pseudogenes Your Sequence from Blat Search Mammalian Gene Collection Full ORF mRNAs MGC Genes Human mRNAs from GenBank Human mRNAs Human ESTs That Have Been Spliced Spliced ESTs Vertebrate Multiz Alignment & PhastCons Conservation (28 Species) Mammal Cons Chimp Rhesus Mouse Rat Dog Cow Simple Nucleotide Polymorphisms (dbSNP build 128) SNPs (128) Repeating Elements by RepeatMasker RepeatMasker • Yale predicts a pseudogene fragment • Classed: ambiguous • Parent gene: OR4CD UCSC predicts longer UTR CHCHD2 Pseudogene Window Position Human Mar. 2006 chr1:15,803,567-15,804,450 (884 bp) chr1: 15803700 15803800 15803900 15804000 15804100 15804200 15804300 15804400 ---> Yale Pseudogenes Based on Ensembl Build 49, No Exon Structure ENSP00000352987.proc.268469 Yale Havana (Vega) Annotated Pseudogenes and Immunoglobulin Segments, May 2008 OTTHUMT00000006774 UCSC Retroposed Genes, Including Pseudogenes retro-CHCHD2 Havana Processed Pseudogenes (May 2008) Only UCSC Yale Processed Pseudogenes (Build 49) Only UCSC Retroposed (Processed) Pseudogenes (Jan 2008) Only Havana (May 2008) and Yale (Build 49) Processed Pseudogenes Only Yale (Build 49) and UCSC (Jan 2008) Processed Pseudogenes Only Havana (May 2008) and UCSC (Jan 2008) Processed Pseudogenes Only Havana (May 2008), Yale (Build 49) and UCSC (Jan 2008) Processed Pseudogenes BC066331.1-1 ENSP00000352987.proc.268469 OTTHUMT00000006774 Your Sequence from Blat Search Parent CHCHD2_parent Vega Protein Coding Annotations Vega Annotated Pseudogenes and Immunoglobulin Segments OTTHUMT00000006774 Human mRNAs from GenBank Human mRNAs protein Human ESTs That Have Been Spliced Spliced ESTs Vertebrate Multiz Alignment & PhastCons Conservation (28 Species) Mammal Cons Chimp Rhesus Mouse Vega Rat Dog Cow Simple Nucleotide Polymorphisms (dbSNP build 128) SNPs (128) Repeating Elements by RepeatMasker RepeatMasker CHCHD2 Parent Gene Window Position Human Mar. 2006 chr7:56,138,052-56,141,600 (3,549 bp) chr7: 56138500 56139000 56139500 56140000 56140500 56141000 56141500 Your Sequence from Blat Search CHCHD2_parent Mammalian Gene Collection Full ORF mRNAs MGC Genes Vega Protein Coding Annotations OTTHUMT00000251589 Vega Annotated Pseudogenes and Immunoglobulin Segments Vega Human mRNAs from GenBank Human mRNAs Human ESTs That Have Been Spliced Spliced ESTs Vertebrate Multiz Alignment & PhastCons Conservation (28 Species) Mammal Cons Chimp Rhesus Mouse Rat Dog Cow Simple Nucleotide Polymorphisms (dbSNP build 128) SNPs (128) Repeating Elements by RepeatMasker RepeatMasker.