Evaluating CCDS and Rachel Harte and Mark Diekhans Gencode meeting, June 18, 2008 Challenges For CCDS

• Where is the translation initiation start site? • NMD candidates - how to track these? • Loci lacking transcript evidence • Missing loci CCDS - Missing Loci CCDS Statistics

Fraction Fraction Source Loci Transcripts Loci Transcripts Ensembl only 3329 4075 0.128 0.043 RefSeq only 3241 3322 0.125 0.035 RefSeq AND Ensembl 2595 7383 0.100 0.077 CCDS 16788 80852 0.646 0.845 Total 25953 95632 1.000 1.000 hg18 Loci Without CCDS

1000

900

800

700

600

500

Number of Loci 400

300

200

100

0 100% 95% 90% 75% 50% 25% >0% RefSeq/ENSEMBL Percent CDS Similarity

Total: 2595 Loci Pseudogenes Pseudogenes

• UCSC Retrotransposed (processed pseudogenes) • Genome-wide search using retroFinder • Re-run about every 6 months • Compare to Havana and Yale pseudogenes • Consolidate UCSC processed pseudogenes with Yale • Future: re-run pipeline for UCSC duplicated pseudogenes track (Yontao Lu’s predictions) Comparison of Havana/UCSC/Yale/ Processed Pseudogenes Fraction Fraction Source Loci Transcripts Loci Transcripts Havana (3013) 249 251 0.011 0.008 UCSC (16318) 8756 8762 0.387 0.273 Yale (12799) 5967 6010 0.264 0.187 Havana AND UCSC 872 1762 0.039 0.055 Havana AND Yale 119 241 0.005 0.008 UCSC and Yale 4899 9825 0.217 0.306 UCSC, Havana and Yale 1754 5279 0.078 0.164 - UCSC Only

Window Position Human Mar. 2006 chr1:42,317-43,222 (906 bp) chr1: 42400 42450 42500 42550 42600 42650 42700 42750 42800 42850 42900 42950 43000 43050 43100 43150 43200 Yale ---> Yale Pseudogenes Based on Ensembl Build 49, No Exon Structure ENSP00000314947.frag.268392 Havana (Vega) Annotated Pseudogenes and Immunoglobulin Segments, May 2008 UCSC Retroposed , Including Pseudogenes retro-OR4F15 Havana Processed Pseudogenes (May 2008) Only Yale Processed Pseudogenes (Build 49) Only UCSC UCSC Retroposed (Processed) Pseudogenes (Jan 2008) Only NM_001001674.1-1 Havana (May 2008) and Yale (Build 49) Processed Pseudogenes Only Yale (Build 49) and UCSC (Jan 2008) Processed Pseudogenes Only Havana (May 2008) and UCSC (Jan 2008) Processed Pseudogenes Only Havana (May 2008), Yale (Build 49) and UCSC (Jan 2008) Processed Pseudogenes Your Sequence from Blat Search Mammalian Collection Full ORF mRNAs MGC Genes Human mRNAs from GenBank Human mRNAs Human ESTs That Have Been Spliced Spliced ESTs Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)

Mammal Cons

Chimp Rhesus Mouse Rat Dog Cow Simple Polymorphisms (dbSNP build 128) SNPs (128) Repeating Elements by RepeatMasker RepeatMasker

• Yale predicts a pseudogene fragment • Classed: ambiguous • Parent gene: OR4CD UCSC predicts longer UTR CHCHD2 Pseudogene

Window Position Human Mar. 2006 chr1:15,803,567-15,804,450 (884 bp) chr1: 15803700 15803800 15803900 15804000 15804100 15804200 15804300 15804400 --->

Yale Pseudogenes Based on Ensembl Build 49, No Exon Structure ENSP00000352987.proc.268469 Yale Havana (Vega) Annotated Pseudogenes and Immunoglobulin Segments, May 2008 OTTHUMT00000006774 UCSC Retroposed Genes, Including Pseudogenes retro-CHCHD2 Havana Processed Pseudogenes (May 2008) Only UCSC Yale Processed Pseudogenes (Build 49) Only UCSC Retroposed (Processed) Pseudogenes (Jan 2008) Only Havana (May 2008) and Yale (Build 49) Processed Pseudogenes Only Yale (Build 49) and UCSC (Jan 2008) Processed Pseudogenes Only Havana (May 2008) and UCSC (Jan 2008) Processed Pseudogenes Only Havana (May 2008), Yale (Build 49) and UCSC (Jan 2008) Processed Pseudogenes BC066331.1-1 ENSP00000352987.proc.268469 OTTHUMT00000006774 Your Sequence from Blat Search Parent CHCHD2_parent Vega Coding Annotations Vega Annotated Pseudogenes and Immunoglobulin Segments OTTHUMT00000006774 Human mRNAs from GenBank Human mRNAs protein Human ESTs That Have Been Spliced Spliced ESTs Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)

Mammal Cons

Chimp Rhesus Mouse Vega Rat Dog Cow Simple Nucleotide Polymorphisms (dbSNP build 128) SNPs (128) Repeating Elements by RepeatMasker RepeatMasker CHCHD2 Parent Gene

Window Position Human Mar. 2006 chr7:56,138,052-56,141,600 (3,549 bp) chr7: 56138500 56139000 56139500 56140000 56140500 56141000 56141500 Your Sequence from Blat Search CHCHD2_parent Mammalian Gene Collection Full ORF mRNAs MGC Genes Vega Protein Coding Annotations OTTHUMT00000251589 Vega Annotated Pseudogenes and Immunoglobulin Segments Vega Human mRNAs from GenBank Human mRNAs Human ESTs That Have Been Spliced Spliced ESTs Vertebrate Multiz Alignment & PhastCons Conservation (28 Species)

Mammal Cons

Chimp Rhesus Mouse Rat Dog Cow Simple Nucleotide Polymorphisms (dbSNP build 128) SNPs (128) Repeating Elements by RepeatMasker RepeatMasker