The GENCODE Consortium the GENCODE Update Trackhub

1/25/2019 The status of GENCODE gene annotation GENCODE Manual Genome Annotation in Reference genebuild ‘First pass’ systematic chr annotation Ensembl • Analysis of cDNA, ESTs, build genes Jane Loveland PhD Mouse complete Annotation Project Leader Ensembl-HAVANA Maturing genebuild Targeted improvement of models • Identification of additional gene, transcripts, exons • ‘Completion’ of models th PAG XXVII, 13 January 2019 • Correct functional annotation Major focus of human work TAGENE MANE project The HAVANA team Manual Annotation: Biotypes Annotation: Biotypes based on transcriptional evidence Whole Genome Targeted regions GENCODE Community projects Protein Coding or chromosome or genes Known_CDS Novel_CDS Putative_CDS Nonsense_mediated_decay Sequences from Transcript retained intron databases putative Non-coding lincRNA Antisense Sense_intronic Sense_overlapping 3’_overlapping_ncRNA Pseudogene Processed Unprocessed Transcribed Translated Unitary Polymorphic Immunoglobulin IG_pseudogene IG_Gene Structural and functional TR_Gene The GENCODE consortium The GENCODE update trackhub HAVANA Genebuild Manual annotation Computational annotation GENCODE gene set 1 1/25/2019 Ensembl gene view: ENO1 updated annotation (more transcripts) Walking across the mouse genome Updated annotation Walking across the mouse genome Walking across the mouse genome Walking across the mouse genome Walking across the mouse genome 2 1/25/2019 Walking across the mouse genome The GENCODE consortium current gene counts Human Mouse Total No of Transcripts 206694 141283 Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Pseudogenes 14729 13376 IG/TR gene segments - protein coding segments 408 494 - pseudogenes 237 203 The GENCODE consortium The GENCODE consortium current gene counts current gene counts Human Mouse Human Mouse Total No of Transcripts 206694 141283 Total No of Transcripts 206694 141283 Total No of Genes 58721 55636 Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Small non-coding RNA genes 7577 6108 Pseudogenes 14729 13376 Pseudogenes 14729 13376 IG/TR gene segments IG/TR gene segments - protein coding segments 408 494 - protein coding segments 408 494 - pseudogenes 237 203 - pseudogenes 237 203 The GENCODE consortium The GENCODE consortium current gene counts current gene counts Human Mouse Human Mouse Total No of Transcripts 206694 141283 Total No of Transcripts 206694 141283 Total No of Genes 58721 55636 Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Small non-coding RNA genes 7577 6108 Pseudogenes 14729 13376 Pseudogenes 14729 13376 IG/TR gene segments IG/TR gene segments - protein coding segments 408 494 - protein coding segments 408 494 - pseudogenes 237 203 - pseudogenes 237 203 3 1/25/2019 The GENCODE consortium current gene counts GRCm38 Genome issues resolved post- Updates GRCm38 Human Mouse Total No of Transcripts 206694 141283 Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Updates as of Pseudogenes 14729 13376 GRCm38.p6 • 65 FIX patches IG/TR gene segments • 9 NOVEL patches - protein coding segments 408 494 - pseudogenes 237 203 GRCm39 due summer 2019 The GENCODE consortium This is A LOT of new transcript data current gene counts Within protein-coding genes… Human Mouse SCN2A Nanopore (cerebellum) Total No of Transcripts 206694 141283 Currently 17 SCN2A transcript models Total No of Genes 58721 55636 … how many more could we annotate? Protein-coding genes 19940 22407 … should we annotate? Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Pseudogenes 14729 13376 IG/TR gene segments RNAseq introns - protein coding segments 408 494 - pseudogenes 237 203 The GENCODE consortium This is A LOT of new transcript data current gene counts Human Mouse … and outside of protein-coding genes Total No of Transcripts 206694 141283 PacBio Capture-seq (non-redundant) Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Pseudogenes 14729 13376 Existing GENCODE annotation IG/TR gene segments ENSG00000261738 - protein coding segments 408 494 ENSG00000264449 - pseudogenes 237 203 4 1/25/2019 Long-read data: Better for discovering novel alternatively spliced transcripts and full-length transcripts ‘TAGENE’ workflow to aid manual annotation PacBio-CaptureSeq (human: brain, testis, heart, liver, HeLa, K562) (mouse: brain, testis, heart, liver, E7, E15) SLR-RNAseq (human/mouse brain) Long reads and manual annotation Manual analysis of TAGENE TAGENE created: • 259,964 models in coding genes • 44,959 models in lncRNAs • 17,025 models in integenic space (11,506 novel genes) 1984 TAGENE models manually examined so far: ~80% ‘completely’ acceptable novel alt spliced transcripts * No models have made it into GENCODE without manual inspection * Future plans for 2nd round: • More accurate splice site classification (trust, check, reject) • Develop CDS prediction utility • Scale up: manual annotators focus on function Long reads and manual annotation Human geneset refinement: Two comprehensive independent human reference transcript sets: ~34,000 unique CCDS for 95% human protein coding genes Why is this a problem? • Resources use either RefSeq or Ensembl/GENCODE • Differences in annotation make it hard to for researchers to exchange data or translate co-ordinates (e.g. HGVS variants) What’s the solution? • Identify a representative transcript that captures the most information about each protein-coding gene (not just the longest/first one) • Revise annotation in RefSeq and GENCODE sets to match overall splicing structure, CDS and precise 5’ and 3’ boundaries • Create a common geneset for all applications Annotators review and edit models 5 1/25/2019 MANE project Step 2: Selecting UTRs, 3’ end: Matched Annotation from NCBI and EMBL-EBI • A transcript set with the following attributes: NCBI’s Genome Data Viewer • Match to GRCh38 REM2 • One MANE Select transcript per locus • 100% identical between the RefSeq and corresponding Ensembl transcript for 5’UTR, CDS, and 3’UTR RefSeq • Tiers: MANE Select – one per gene, representative of biology at each locus Ensembl Well-supported, expressed, conserved cDNA and ESTs MANE Plus – alternate transcripts to capture key aspects of gene structure MANE Extended – additional transcripts that match RNAseq • Fairly stable, but will allow updates when necessary Longest PolyA counts Longest Strong All the transcripts we annotate should always be considered and we are certainly NOT saying that biology can be simplified to a single transcript at each genomic locus PolyA seq: This is data from the 3’ end. It is the sequence from the polyadenlyated region of mRNA, defining the end of a transcript. Step 1: Selecting transcripts Current status of MANE project • Compare all transcripts annotated independently by RefSeq and Ensembl Goals: Phase 1: End 2018 >50% Phase 2: End 2019 >90% Bin1: Identical 15% Independent pipelines Bin 2: Same CDS, • RefSeq Select Pipeline • Ensembl Select Pipeline ! but different UTR Work in 53% • Expression • Length progress or • Conservation • Expression length or splicing • Representation in UniProt and • Conservation pattern 85% Ensembl • Representation in UniProt and • Length RefSeq Bin 3: Different • Prior manual curation (LRG) • Coverage of pathogenic variants CDS, with or without different UTR length or splicing pattern Identical splicing and CDS Step 2: Selecting UTRs, 5’ end: Getting from 53% to 90% KNG1 NCBI’s Genome Data Viewer Pipelines selecting same transcript for ~75% genes Ensembl Bin 2: Same RefSeq CDS, but different or RNAseq UTR length or CAGE splicing pattern counts Longest Longest Strongest strong CAGE = Cap Analysis of Gene Expression, developed by RIKEN Predominantly alternative splicing in 5’ UTR This is a way of getting the full 5’ end of messenger RNA. The outputs of CAGE is tags, and these give a quantification of the RNA abundance. Missing data: no CAGE ~17% genes no polyA ~28% genes 6 1/25/2019 Getting from 53% to 90% Summary and future plans: GENCODE geneset for human and mouse: Complete QC for mouse genome Lessons learned from human first pass Protein coding genes • Bin 3 = Pipelines picked different CDS Pseudogenes and retrogenes Plan for GRCm39 MANE project Clinical data and refinement for human Phase 1 (release 0.5) Spring 2019 • Manual review of several genes to understand discrepancies Further integration into Ensembl • Improve pipelines, based on review Streamline merge process • This is the hardest bin! TAGENE extension and refinement • In some cases, only manual review will be able to decipher the Computational analyses with manual guidance correct answer. Mouse Update cycle • In other cases, there is no right answer. Either one could be selected. This is biology! [email protected] GENCODE Acknowledgements P3 fibroblast MANE Plus DST: Ensembl-HAVANA: TGMI: GENCODE Consortium Zmap/Otter Joannella Morales Roderic Guigo, CRG P2 brain Adam Frankish Ruth Bennett Julien Legarde P4 myoblast If Barnes Claire Davidson Barbara Uszczynski Capturing a larger set of Andrew Berry Mike

The GENCODE Consortium the GENCODE Update Trackhub

Expert Curation of the Human and Mouse Olfactory Receptor Gene Repertoires Identifies Conserved Coding Regions Split Across Two Exons

Repetitive Elements in Humans

GENCODE: the Reference Human Genome Annotation for the ENCODE Project

GENCODE Reference Annotation for the Human and Mouse Genomes

Downloaded from the Tranche Distributed File System (Tranche.Proteomecommons.Org) and Ftp://Ftp.Thegpm.Org/Data/Msms

High-Throughput Annotation of Full-Length Long Noncoding

3 Characterization of Intergenic Regions and Gene Definition

Universal Alternative Splicing of Noncoding Exons

Nearly All New Protein-Coding Predictions in the CHESS Database Are Not Protein-Coding Irwin Jungreis*,✝,1,2, Michael L

Accurate Mutation Annotation and Functional Prediction Enhance the Applicability of -Omics Data in Precision Medicine

RNA-Seq Analysis Reveals Localization-Associated Alternative Splicing Across 13 Cell Lines

Human Genome Far More Active Than Thought 6 September 2012