1/25/2019

The status of GENCODE annotation GENCODE Manual Annotation in Reference genebuild ‘First pass’ systematic chr annotation Ensembl • Analysis of cDNA, ESTs, build

Jane Loveland PhD Mouse complete Annotation Project Leader Ensembl-HAVANA Maturing genebuild Targeted improvement of models • Identification of additional gene, transcripts, exons • ‘Completion’ of models th PAG XXVII, 13 January 2019 • Correct functional annotation Major focus of human work TAGENE

MANE project

The HAVANA team Manual Annotation: Biotypes Annotation: Biotypes based on transcriptional evidence Whole Genome Targeted regions GENCODE Community projects Protein Coding or chromosome or genes Known_CDS Novel_CDS Putative_CDS Nonsense_mediated_decay Sequences from Transcript retained intron databases putative

Non-coding lincRNA Antisense Sense_intronic Sense_overlapping 3’_overlapping_ncRNA

Pseudogene Processed Unprocessed Transcribed Translated Unitary Polymorphic

Immunoglobulin IG_pseudogene IG_Gene Structural and functional TR_Gene

The GENCODE consortium The GENCODE update trackhub

HAVANA Genebuild Manual annotation Computational annotation

GENCODE gene set

1 1/25/2019

Ensembl gene view: ENO1 updated annotation (more transcripts) Walking across the mouse genome

Updated annotation

Walking across the mouse genome Walking across the mouse genome

Walking across the mouse genome Walking across the mouse genome

2 1/25/2019

Walking across the mouse genome The GENCODE consortium current gene counts

Human Mouse

Total No of Transcripts 206694 141283

Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 14729 13376

IG/TR gene segments - protein coding segments 408 494 - pseudogenes 237 203

The GENCODE consortium The GENCODE consortium current gene counts current gene counts

Human Mouse Human Mouse

Total No of Transcripts 206694 141283 Total No of Transcripts 206694 141283

Total No of Genes 58721 55636 Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Small non-coding RNA genes 7577 6108 Pseudogenes 14729 13376 Pseudogenes 14729 13376

IG/TR gene segments IG/TR gene segments - protein coding segments 408 494 - protein coding segments 408 494 - pseudogenes 237 203 - pseudogenes 237 203

The GENCODE consortium The GENCODE consortium current gene counts current gene counts

Human Mouse Human Mouse

Total No of Transcripts 206694 141283 Total No of Transcripts 206694 141283

Total No of Genes 58721 55636 Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Small non-coding RNA genes 7577 6108 Pseudogenes 14729 13376 Pseudogenes 14729 13376

IG/TR gene segments IG/TR gene segments - protein coding segments 408 494 - protein coding segments 408 494 - pseudogenes 237 203 - pseudogenes 237 203

3 1/25/2019

The GENCODE consortium current gene counts GRCm38 Genome issues resolved post- Updates GRCm38 Human Mouse

Total No of Transcripts 206694 141283

Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Updates as of Pseudogenes 14729 13376 GRCm38.p6 • 65 FIX patches IG/TR gene segments • 9 NOVEL patches - protein coding segments 408 494 - pseudogenes 237 203 GRCm39 due summer 2019

The GENCODE consortium This is A LOT of new transcript data current gene counts

Within protein-coding genes… Human Mouse SCN2A Nanopore (cerebellum) Total No of Transcripts 206694 141283 Currently 17 SCN2A transcript models Total No of Genes 58721 55636 … how many more could we annotate? Protein-coding genes 19940 22407 … should we annotate? Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Pseudogenes 14729 13376

IG/TR gene segments RNAseq introns - protein coding segments 408 494 - pseudogenes 237 203

The GENCODE consortium This is A LOT of new transcript data current gene counts

Human Mouse … and outside of protein-coding genes

Total No of Transcripts 206694 141283 PacBio Capture-seq (non-redundant)

Total No of Genes 58721 55636 Protein-coding genes 19940 22407 Long non-coding RNA genes 16066 13250 Small non-coding RNA genes 7577 6108 Pseudogenes 14729 13376

Existing GENCODE annotation IG/TR gene segments ENSG00000261738

- protein coding segments 408 494 ENSG00000264449 - pseudogenes 237 203

4 1/25/2019

Long-read data: Better for discovering novel alternatively spliced transcripts and full-length transcripts ‘TAGENE’ workflow to aid manual annotation

PacBio-CaptureSeq (human: brain, testis, heart, liver, HeLa, K562) (mouse: brain, testis, heart, liver, E7, E15)

SLR-RNAseq (human/mouse brain)

Long reads and manual annotation Manual analysis of TAGENE

TAGENE created: • 259,964 models in coding genes • 44,959 models in lncRNAs • 17,025 models in integenic space (11,506 novel genes)

1984 TAGENE models manually examined so far: ~80% ‘completely’ acceptable novel alt spliced transcripts

* No models have made it into GENCODE without manual inspection *

Future plans for 2nd round: • More accurate splice site classification (trust, check, reject) • Develop CDS prediction utility • Scale up: manual annotators focus on function

Long reads and manual annotation Human geneset refinement:

Two comprehensive independent human reference transcript sets:

~34,000 unique CCDS for 95% human protein coding genes

Why is this a problem? • Resources use either RefSeq or Ensembl/GENCODE • Differences in annotation make it hard to for researchers to exchange data or translate co-ordinates (e.g. HGVS variants)

What’s the solution?

• Identify a representative transcript that captures the most information about each protein-coding gene (not just the longest/first one) • Revise annotation in RefSeq and GENCODE sets to match overall splicing structure, CDS and precise 5’ and 3’ boundaries • Create a common geneset for all applications Annotators review and edit models

5 1/25/2019

MANE project Step 2: Selecting UTRs, 3’ end: Matched Annotation from NCBI and EMBL-EBI

• A transcript set with the following attributes: NCBI’s Genome Data Viewer • Match to GRCh38 REM2 • One MANE Select transcript per locus • 100% identical between the RefSeq and corresponding Ensembl transcript for 5’UTR, CDS, and 3’UTR RefSeq • Tiers: MANE Select – one per gene, representative of biology at each locus Ensembl Well-supported, expressed, conserved cDNA and ESTs MANE Plus – alternate transcripts to capture key aspects of gene structure MANE Extended – additional transcripts that match RNAseq • Fairly stable, but will allow updates when necessary Longest PolyA counts Longest Strong All the transcripts we annotate should always be considered and we are certainly NOT saying that biology can be simplified to a single transcript at each genomic locus PolyA seq: This is data from the 3’ end. It is the sequence from the polyadenlyated region of mRNA, defining the end of a transcript.

Step 1: Selecting transcripts Current status of MANE project • Compare all transcripts annotated independently by RefSeq and Ensembl Goals: Phase 1: End 2018 >50% Phase 2: End 2019 >90%

Bin1: Identical 15%

Independent pipelines Bin 2: Same CDS, • RefSeq Select Pipeline • Ensembl Select Pipeline ! but different UTR Work in 53% • Expression • Length progress or • Conservation • Expression length or splicing • Representation in UniProt and • Conservation pattern 85% Ensembl • Representation in UniProt and • Length RefSeq Bin 3: Different • Prior manual curation (LRG) • Coverage of pathogenic variants CDS, with or without different UTR length or splicing pattern

Identical splicing and CDS

Step 2: Selecting UTRs, 5’ end: Getting from 53% to 90%

KNG1 NCBI’s Genome Data Viewer Pipelines selecting same transcript for ~75% genes

Ensembl Bin 2: Same RefSeq CDS, but different or RNAseq UTR length or CAGE splicing pattern counts Longest Longest Strongest strong CAGE = Cap Analysis of Gene Expression, developed by RIKEN Predominantly in 5’ UTR This is a way of getting the full 5’ end of messenger RNA. The outputs of CAGE is tags, and these give a quantification of the RNA abundance. Missing data: no CAGE ~17% genes no polyA ~28% genes

6 1/25/2019

Getting from 53% to 90% Summary and future plans: GENCODE geneset for human and mouse: Complete QC for mouse genome Lessons learned from human first pass Protein coding genes • Bin 3 = Pipelines picked different CDS Pseudogenes and retrogenes Plan for GRCm39 MANE project Clinical data and refinement for human Phase 1 (release 0.5) Spring 2019

• Manual review of several genes to understand discrepancies Further integration into Ensembl • Improve pipelines, based on review Streamline merge process

• This is the hardest bin! TAGENE extension and refinement • In some cases, only manual review will be able to decipher the Computational analyses with manual guidance correct answer. Mouse Update cycle • In other cases, there is no right answer. Either one could be selected. This is biology! [email protected]

GENCODE Acknowledgements P3 fibroblast MANE Plus DST: Ensembl-HAVANA: TGMI: GENCODE Consortium Zmap/Otter Joannella Morales Roderic Guigo, CRG P2 brain Adam Frankish Ruth Bennett Julien Legarde P4 myoblast If Barnes Claire Davidson Barbara Uszczynski Capturing a larger set of Andrew Berry Mike Kay Rory Johnson Alex Bignell Manolis Kellis, MIT Sarah Donaldson functionally important transcripts Annotrack/GENCODE: Irwin Jungreis Matt Hardy Jose Manuel Gonzalez Michael Tress, CNIO Toby Hunt P5 brain Alex Reymond, UNIL Jane Loveland Ensembl: Anne-Maude Ferreira P1 epithelial Jonathan Mudge Paul Flicek Mark Gerstein, Yale Marie-Marthe Suner Fiona Cunningham Cristina Sisu Mark Thomas Carlos García Girón Fabio Navara Fergal Martin Benedict Paten, UCSC Mark Diekhans Tim Hubbard, KCL

ftp://ngs.sanger.ac.uk/production/gencode/update_trackhub/hub.txt

Clinical Ensembl Acknowledgements Data The Entire Ensembl Team Fiona Cunningham1, Premanand Achuthan1, Wasiu Akanni1, James Allen1, M. Ridwan Human KCNMA1: Amode1, Irina Armean1, Ruth Bennett1, Jyothish Bhai1, Konstantinos Billis1, Sanjay 1 1 1 1 1 22 original transcripts Boddu , Carla Cummins , Claire Davidson , Kamalkumar Jayantilal Dodiya , Astrid Gall , Now 92 transcripts Carlos García Girón1, Laurent Gil1, Tiago Grego1, Leanne Haggerty1, Erin Haskell1, Thibaut 25 from SLR-Seq Hourlier1, Osagie G. Izuogu1, Sophie H. Janacek1, Thomas Juettemann1, Mike Kay1, SLR-Seq also extended Matthew R. Laird1, Ilias Lavidas1, Zhicheng Liu1, Jane E. Loveland1, José C. Marugán1, other transcripts Thomas Maurel1, Aoife C. McMahon1, Benjamin Moore1, Joannella Morales1, Jonathan M. Mudge1, Michael Nuhn1, Denye Ogeh1, Anne Parker1, Andrew Parton1, Mateus Patricio1, Ahamed Imran Abdul Salam1, Bianca M. Schmitt1, Helen Schuilenburg1, Dan Sheppard1, Helen Sparrow1, Eloise Stapleton1, Marek Szuba1, Kieron Taylor1, Glen Threadgold, Anja Thormann1, Alessandro Vullo1, Brandon Walts1, Andrea Winterbottom1, Amonida Zadissa1, Marc Chakiachvili1, Adam Frankish1, Sarah E. Hunt1, Myrto Kostadima1, Nick Langridge1, Fergal J. Martin1, Matthieu Muffato1, Emily Perry1, Magali Ruffier1, Daniel M. Staines1, Stephen J. Trevanion1, Bronwen L. Aken1, Andrew D. Yates1, Daniel R Zerbino1, Paul Flicek1,*

1 European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom

Co-funded by the European Union

EBI is an Outstation of the European Molecular Biology Laboratory.

7