Proteogenomics for accurate assembly and annotation of genomes
Harsha Gowda, Ph.D. QIMR Berghofer Medical Research Institute The University of Queensland
1 Exponential growth of genome sequences in NCBI
2 https://www.ncbi.nlm.nih.gov/refseq/statistics/ Human genome project timeline
3 https://www.genome.gov Draft map of the human genome
Science. (2001) 291(5507):1304-51 Nature. (2001) 409(6822):860-921 4 Estimation of number of human protein coding genes over the years
Lewin B., Genes IV (Oxford Univ. Press, Oxford, 1990) - 100,000
Antequera et al. (1993) PNAS - 80,000
Liang et al. (2000) Nature Genetics - 120,000 genes
Human Genome Sequencing Consortium (2001) Nature - 30,000 – 35,000
Venter et al. (2001) Science - ~26,500 genes
Mouse genome sequencing consortium (2002) Nature - ~30,000 genes
Clamp et al. (2007) PNAS - ~20,500
5 None of the estimates were based on an approach that directly monitors proteins !!!
6 Basic approaches to genome annotation
7 Nat Rev Genet. (2012) 13(5):329-42. Gene prediction and gene annotation
8 Nat Rev Genet. (2012) 13(5):329-42. A snapshot of a genome browser
9 https://cgl.genomics.ucsc.edu/ Gene size as a function of genome size
10 Nat Rev Genet. (2012) 13(5):329-42. Human genome characteristics
• Human genome – 3 billion bps haploid genome, 6 billion bps diploid genome
• ~ 20,000 – 21,000 protein coding genes
• The transcripts by these genes offer further complexity due to alternative splicing and RNA editing events
• Proteome level complexity is even more due to various post-translational modifications
• Several ncRNAs including miRNAs that have distinct functional roles in a cell 11 One gene, many proteins
Gene mRNA Protein
Co/post-translational processing
Alternative Protein1 splicing Protein2 mRNA1 Protein3 mRNA2 Protein4 Gene mRNA3 Protein5 mRNA4 Protein6 mRNA5 ....
Proteini 12 Genome sequencing and annotation has significant impact on discovery of genes associated with diseases
13 Genes linked to isolated ID and ID-associated
14 Nat Rev Genet. 2016 Jan;17(1):9-18 Diagnostic yield for ID over time
15 Nat Rev Genet. 2016 Jan;17(1):9-18 Proteogenomics approach to annotate human genome
• What are our options if we want to directly monitor proteins?
• How to go about using cutting edge technologies available today to do best possible protein measurement from human samples ?
• What should be the study design?
• What are the challenges?
16 Most commonly used methods for monitoring proteins
17 • Mass spectrometry can be used to sequence peptides/proteins in a completely unbiased and untargeted manner much like RNA-Seq for RNA
• This essentially means that during sequencing, the method does not distinguish between a known and an unknown protein
18 LC elution profile of a complex mixture of peptides
100
50 Relative intensity Relative
0 60 120 19 Time (min) A mass spectrum of eluted peptides at 61.2 min
100
50 1 second Relative intensity Relative
0 300 500 700 900 1100 1300 1500 1650 20 m/z Determining challenges associated with proteogenomics
• Kelkar et al., Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry (2011) Molecular and Cellular Proteomics.
• Chaerkady et al., A proteogenomic analysis of Anopheles gambiae using high-resolution Fourier transform mass spectrometry (2011) Genome Research.
• Prasad et al. Proteogenomic analysis of Candida glabrata using high resolution mass Spectrometry (2012) Journal of Proteome Research.
• Pawar et al., A proteogenomic approach to map the proteome of an unsequenced pathogen - Leishmania donovani (2012) Proteomics.
• Selvan et al., Proteogenomic analysis of pathogenic yeast Cryptococcus neoformans using high resolution mass spectrometry (2014) Clinical Proteomics.
Near complete proteome coverage in single cell organisms and several novel protein coding regions in all the organisms we studied 21 Novel gene identification in simpler organisms
22 Novel protein coding region in C. glabrata
23 J Proteome Res. (2012) 11(1):247-60 Initial Goals of Human Proteome Map Project
• To obtain experimental evidence for most of the annotated human proteins in public databases
• To establish a baseline of normal human proteome using histologically normal tissues and primary cells
• To generate a high quality mass spectrometry dataset usable by the community
• Validate human proteome annotation in public databases and also contribute towards further enriching it with experimentally derived data
• Develop novel methods to look for novel proteins that have evaded detection
24 Human tissues used in the study
25 Nature (2014) 509, 575–581 Mass spectrometry derived data
Conventional Proteogenomics workflow workflow
Evidence for proteins Novel protein coding annotated regions in public databases
26 Nature (2014) 509, 575–581 Sample processing and mass spectrometry
27 Nature (2014) 509, 575–581 Mass spectrometry data
• >2,000 LC-MS/MS runs
• >25 million tandem mass spectra acquired
• ~293,000 non-redundant peptides were identified at <1% FDR
• Median mass measurement error ~260 ppb
• Median number of peptides identified per gene/protein – 10
• Median number of tandem mass spectra acquired per gene/protein - 37
28 Human Proteome coverage
• Identified protein coding evidence for 17,294 genes annotated in the human genome
• ~ 84% of annotated human protein-coding genes
• Protein coding evidence for 2,535 of the total 3,844 genes designated “missing proteins” by neXtProt database
29 Comparison with two largest human peptide-based resources in the public domain
30 Nature (2014) 509, 575–581 Proteins encoded by “housekeeping genes” are an abundant fraction of all proteomics studies
31 Nature (2014) 509, 575–581 Relative abundance of proteins in the human proteome
32 Nature (2014) 509, 575–581 Tissue specific signatures revealed by large-scale proteomics
33 Nature (2014) 509, 575–581 Isoform specific expression pattern
Illumina’s Body Map shows similar pattern for Fyn based on RNA-Seq data 34 A web based portal to explore human proteome dataset
http://www.humanproteomemap.org/
35 Nature (2014) 509, 575–581 Dependence on predefined protein sequence databases for identifying peptides/proteins – a fundamental limitation of current proteomics workflow
36 Mass spectrometry guided exploration of protein coding regions in the genome
37 Proteogenomic analysis to identify novel protein features not represented in protein databases
Unmatched Matched spectra spectra
38 Nature (2014) 509, 575–581 Proteogenomic analysis
39 Nature (2014) 509, 575–581 Representative example of a novel ORF
Expressed in 25 of the 30 tissues analyzed in this study 40 Nature (2014) 509, 575–581 Novel ORFs in alternative reading frames of existing transcripts
41 Nature (2014) 509, 575–581 A representative example demonstrating protein coding evidence for a ncRNA
NR_027693 - PPARGC1 and ESRR induced regulator, muscle 1 (PERM1), long non-coding RNA
42 Nature (2014) 509, 575–581 43 A peptide in the intergenic region between NDUFv3 and PKNOX1 genes
44 45 Protein coding evidence for an annotated “pseudogene”
46 Protein coding evidence for pseudogene MYO15B
MYO15B Identified peptides pseudogene
RNASeq evidence
conservation in mammals
47 Domain prediction for translated MYO15B pseudogene and sequence alignment with MYO15A
SMART domains predicted for MyTH4 SH3 MyTH4 translated MYO15B
Identified peptides
MYO15B pseudogene 24 PLAKPLTQLDGDNPQRALDINKVMLRLLGDGSLESWQRQIMGAYLVRQGQCRPGLRNELF 83 PL PLTQL ++ A+ I K++LR +GD L + I G Y+V++G P LR+E+ MYO15A 2069 PLRTPLTQLPAEHHAEAVSIFKLILRFMGDPHLHGARENIFGNYIVQKGLAVPELRDEIL 2128//
MYO15B pseudogene 144 LGALEQS-QLASGATRAHPPTQLEWLAGWRRGRMALDVFTFSEECYSAEVESWTTGEQLA 202 + A+ ++ Q SGA R PPTQLEW A + + MALDV F+ + +S V SW+TGE++A MYO15A 2189 MQAMGRAQQQGSGAARTLPPTQLEWTATYEKASMALDVGCFNGDQFSCPVHSWSTGEEVA 2248//
MYO15B pseudogene 960 DLIKLLP------VATLEP------GWQFGSAGGRSGLFPADIVQPA 994 D+I L P V LE GW+FG+ GR G FP+++VQPA MYO15A 2892 DIIHLQPLEPPRVGYSAGCVVRRKVVYLEELRRRGPDFGWRFGTIHGRVGRFPSELVQPA 2951
MYO15B pseudogene 995 AAPDFSFSKEQRSGWHKGQLSNGEPGLARWDRASERPAHPWSQAHSDDSEATSLSSVAYA 1054 AAPDF + ++ A R P +A S D +L+ MYO15A 2952 AAPDFLQLPTEPGRGRAAAVAAAVASAAAAQEVGRRREGPPVRARSADHGEDALA----- 3011
48 Widespread protein expression pattern of some of the annotated “pseudogenes” in the human genome
49 Expressed Pseudogenes in the Transcriptional Landscape of Human Cancers
50 Cell 149, 1622–1634 (2012) Summary of human proteome analysis
51 Nature (2014) 509, 575–581
52 53 Cell. 2015 Feb 12;160(4):595-606. 54 Discovery of novel protein coding regions in the human genome
55 Nat Commun. (2018) 9(1):903 A typical proteogenomics workflow to discover and annotate protein coding regions and associated Variants in the genome
56 Nat Commun. (2018) 9(1):903 Discovery of novel protein coding regions in the human genome
57 Nat Commun. (2018) 9(1):903 Unannotated protein coding loci found in A431 cells
58 Nat Commun. (2018) 9(1):903 Examples of databases that are complementing and extending the annotation of human genome
59 Gene expression
60 http://www.gtexportal.org/ Catalogue of somatic mutations in cancer
61 http://cancer.sanger.ac.uk/cosmic 62 http://www.proteinatlas.org/ Thank you
63