Proteogenomics for accurate assembly and annotation of genomes

Harsha Gowda, Ph.D. QIMR Berghofer Medical Research Institute The University of Queensland

1 Exponential growth of genome sequences in NCBI

2 https://www.ncbi.nlm.nih.gov/refseq/statistics/ Human timeline

3 https://www.genome.gov Draft map of the human genome

Science. (2001) 291(5507):1304-51 Nature. (2001) 409(6822):860-921 4 Estimation of number of human protein coding over the years

Lewin B., Genes IV (Oxford Univ. Press, Oxford, 1990) - 100,000

Antequera et al. (1993) PNAS - 80,000

Liang et al. (2000) Nature Genetics - 120,000 genes

Human Genome Sequencing Consortium (2001) Nature - 30,000 – 35,000

Venter et al. (2001) Science - ~26,500 genes

Mouse genome sequencing consortium (2002) Nature - ~30,000 genes

Clamp et al. (2007) PNAS - ~20,500

5 None of the estimates were based on an approach that directly monitors proteins !!!

6 Basic approaches to genome annotation

7 Nat Rev Genet. (2012) 13(5):329-42. prediction and gene annotation

8 Nat Rev Genet. (2012) 13(5):329-42. A snapshot of a genome browser

9 https://cgl.genomics.ucsc.edu/ Gene size as a function of genome size

10 Nat Rev Genet. (2012) 13(5):329-42. Human genome characteristics

• Human genome – 3 billion bps haploid genome, 6 billion bps diploid genome

• ~ 20,000 – 21,000 protein coding genes

• The transcripts by these genes offer further complexity due to alternative splicing and RNA editing events

level complexity is even more due to various post-translational modifications

• Several ncRNAs including miRNAs that have distinct functional roles in a cell 11 One gene, many proteins

Gene mRNA Protein

Co/post-translational processing

Alternative Protein1 splicing Protein2 mRNA1 Protein3 mRNA2 Protein4 Gene mRNA3 Protein5 mRNA4 Protein6 mRNA5 ....

Proteini 12 Genome sequencing and annotation has significant impact on discovery of genes associated with diseases

13 Genes linked to isolated ID and ID-associated

14 Nat Rev Genet. 2016 Jan;17(1):9-18 Diagnostic yield for ID over time

15 Nat Rev Genet. 2016 Jan;17(1):9-18 Proteogenomics approach to annotate human genome

• What are our options if we want to directly monitor proteins?

• How to go about using cutting edge technologies available today to do best possible protein measurement from human samples ?

• What should be the study design?

• What are the challenges?

16 Most commonly used methods for monitoring proteins

17 • can be used to sequence peptides/proteins in a completely unbiased and untargeted manner much like RNA-Seq for RNA

• This essentially means that during sequencing, the method does not distinguish between a known and an unknown protein

18 LC elution profile of a complex mixture of peptides

100

50 Relative intensity Relative

0 60 120 19 Time (min) A mass spectrum of eluted peptides at 61.2 min

100

50 1 second Relative intensity Relative

0 300 500 700 900 1100 1300 1500 1650 20 m/z Determining challenges associated with proteogenomics

• Kelkar et al., Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry (2011) Molecular and Cellular .

• Chaerkady et al., A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry (2011) Genome Research.

• Prasad et al. Proteogenomic analysis of Candida glabrata using high resolution mass Spectrometry (2012) Journal of Proteome Research.

• Pawar et al., A proteogenomic approach to map the proteome of an unsequenced pathogen - Leishmania donovani (2012) Proteomics.

• Selvan et al., Proteogenomic analysis of pathogenic yeast Cryptococcus neoformans using high resolution mass spectrometry (2014) Clinical Proteomics.

Near complete proteome coverage in single cell organisms and several novel protein coding regions in all the organisms we studied 21 Novel gene identification in simpler organisms

22 Novel protein in C. glabrata

23 J Proteome Res. (2012) 11(1):247-60 Initial Goals of Human Proteome Map Project

• To obtain experimental evidence for most of the annotated human proteins in public databases

• To establish a baseline of normal human proteome using histologically normal tissues and primary cells

• To generate a high quality mass spectrometry dataset usable by the community

• Validate human proteome annotation in public databases and also contribute towards further enriching it with experimentally derived data

• Develop novel methods to look for novel proteins that have evaded detection

24 Human tissues used in the study

25 Nature (2014) 509, 575–581 Mass spectrometry derived data

Conventional Proteogenomics workflow workflow

Evidence for proteins Novel protein coding annotated regions in public databases

26 Nature (2014) 509, 575–581 Sample processing and mass spectrometry

27 Nature (2014) 509, 575–581 Mass spectrometry data

• >2,000 LC-MS/MS runs

• >25 million tandem mass spectra acquired

• ~293,000 non-redundant peptides were identified at <1% FDR

• Median mass measurement error ~260 ppb

• Median number of peptides identified per gene/protein – 10

• Median number of tandem mass spectra acquired per gene/protein - 37

28 Human Proteome coverage

• Identified protein coding evidence for 17,294 genes annotated in the human genome

• ~ 84% of annotated human protein-coding genes

• Protein coding evidence for 2,535 of the total 3,844 genes designated “missing proteins” by neXtProt database

29 Comparison with two largest human peptide-based resources in the public domain

30 Nature (2014) 509, 575–581 Proteins encoded by “housekeeping genes” are an abundant fraction of all proteomics studies

31 Nature (2014) 509, 575–581 Relative abundance of proteins in the human proteome

32 Nature (2014) 509, 575–581 Tissue specific signatures revealed by large-scale proteomics

33 Nature (2014) 509, 575–581 Isoform specific expression pattern

Illumina’s Body Map shows similar pattern for Fyn based on RNA-Seq data 34 A web based portal to explore human proteome dataset

http://www.humanproteomemap.org/

35 Nature (2014) 509, 575–581 Dependence on predefined protein sequence databases for identifying peptides/proteins – a fundamental limitation of current proteomics workflow

36 Mass spectrometry guided exploration of protein coding regions in the genome

37 Proteogenomic analysis to identify novel protein features not represented in protein databases

Unmatched Matched spectra spectra

38 Nature (2014) 509, 575–581 Proteogenomic analysis

39 Nature (2014) 509, 575–581 Representative example of a novel ORF

Expressed in 25 of the 30 tissues analyzed in this study 40 Nature (2014) 509, 575–581 Novel ORFs in alternative reading frames of existing transcripts

41 Nature (2014) 509, 575–581 A representative example demonstrating protein coding evidence for a ncRNA

NR_027693 - PPARGC1 and ESRR induced regulator, muscle 1 (PERM1), long non-coding RNA

42 Nature (2014) 509, 575–581 43 A peptide in the intergenic region between NDUFv3 and PKNOX1 genes

44 45 Protein coding evidence for an annotated “

46 Protein coding evidence for pseudogene MYO15B

MYO15B Identified peptides pseudogene

RNASeq evidence

conservation in mammals

47 Domain prediction for translated MYO15B pseudogene and sequence alignment with MYO15A

SMART domains predicted for MyTH4 SH3 MyTH4 translated MYO15B

Identified peptides

MYO15B pseudogene 24 PLAKPLTQLDGDNPQRALDINKVMLRLLGDGSLESWQRQIMGAYLVRQGQCRPGLRNELF 83 PL PLTQL ++ A+ I K++LR +GD L + I G Y+V++G P LR+E+ MYO15A 2069 PLRTPLTQLPAEHHAEAVSIFKLILRFMGDPHLHGARENIFGNYIVQKGLAVPELRDEIL 2128//

MYO15B pseudogene 144 LGALEQS-QLASGATRAHPPTQLEWLAGWRRGRMALDVFTFSEECYSAEVESWTTGEQLA 202 + A+ ++ Q SGA R PPTQLEW A + + MALDV F+ + +S V SW+TGE++A MYO15A 2189 MQAMGRAQQQGSGAARTLPPTQLEWTATYEKASMALDVGCFNGDQFSCPVHSWSTGEEVA 2248//

MYO15B pseudogene 960 DLIKLLP------VATLEP------GWQFGSAGGRSGLFPADIVQPA 994 D+I L P V LE GW+FG+ GR G FP+++VQPA MYO15A 2892 DIIHLQPLEPPRVGYSAGCVVRRKVVYLEELRRRGPDFGWRFGTIHGRVGRFPSELVQPA 2951

MYO15B pseudogene 995 AAPDFSFSKEQRSGWHKGQLSNGEPGLARWDRASERPAHPWSQAHSDDSEATSLSSVAYA 1054 AAPDF + ++ A R P +A S D +L+ MYO15A 2952 AAPDFLQLPTEPGRGRAAAVAAAVASAAAAQEVGRRREGPPVRARSADHGEDALA----- 3011

48 Widespread protein expression pattern of some of the annotated “” in the human genome

49 Expressed Pseudogenes in the Transcriptional Landscape of Human Cancers

50 Cell 149, 1622–1634 (2012) Summary of human proteome analysis

51 Nature (2014) 509, 575–581

52 53 Cell. 2015 Feb 12;160(4):595-606. 54 Discovery of novel protein coding regions in the human genome

55 Nat Commun. (2018) 9(1):903 A typical proteogenomics workflow to discover and annotate protein coding regions and associated Variants in the genome

56 Nat Commun. (2018) 9(1):903 Discovery of novel protein coding regions in the human genome

57 Nat Commun. (2018) 9(1):903 Unannotated protein coding loci found in A431 cells

58 Nat Commun. (2018) 9(1):903 Examples of databases that are complementing and extending the annotation of human genome

59 Gene expression

60 http://www.gtexportal.org/ Catalogue of somatic in cancer

61 http://cancer.sanger.ac.uk/cosmic 62 http://www.proteinatlas.org/ Thank you

63