Proteogenomics from Next-Generation Sequencing (NGS
Total Page:16
File Type:pdf, Size:1020Kb
Clinica Chimica Acta 498 (2019) 38–46 Contents lists available at ScienceDirect Clinica Chimica Acta journal homepage: www.elsevier.com/locate/cca Review Proteogenomics: From next-generation sequencing (NGS) and mass spectrometry-based proteomics to precision medicine T ⁎ Mia Yang Anga,1,2, Teck Yew Lowa, ,2, Pey Yee Leea, Wan Fahmi Wan Mohamad Nazariea, Victor Guryevb, Rahman Jamala a UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, 56000 Kuala Lumpur, Malaysia b European Research Institute for the Biology of Ageing, University of Groningen, University Medical Center Groningen, Groningen 9713AD, The Netherlands ARTICLE INFO ABSTRACT Keywords: One of the best-established area within multi-omics is proteogenomics, whereby the underpinning technologies Proteogenomics are next-generation sequencing (NGS) and mass spectrometry (MS). Proteogenomics has contributed sig- Genomics nificantly to genome (re)-annotation, whereby novel coding sequences (CDS) are identified and confirmed. By Proteomics incorporating in-silico translated genome variants in protein database, single amino acid variants (SAAV) and Next-generation sequencing (NGS) splice proteoforms can be identified and quantified at peptide level. The application of proteogenomics in cancer Mass spectrometry (MS) research potentially enables the identification of patient-specific proteoforms, as well as the association of the Genomic variant efficacy or resistance of cancer therapy to different mutations. Here, we discuss how NGS/TGS data are analyzed and incorporated into the proteogenomic framework. These sequence data mainly originate from whole genome sequencing (WGS), whole exome sequencing (WES) and RNA-Seq. We explain two major strategies for sequence analysis i.e., de novo assembly and reads mapping, followed by construction of customized protein databases using such data. Besides, we also elaborate on the procedures of spectrum to peptide sequence matching in proteogenomics, and the relationship between database size on the false discovery rate (FDR). Finally, we discuss the latest development in proteogenomics-assisted precision oncology and also challenges and oppor- tunities in proteogenomics research. 1. Introduction from that, it has assisted in unraveling the undiscovered protein counterparts of annotated genes, the so-named “missing proteins” [7]. Although mass spectrometry (MS)-based proteomics is closely Importantly, by incorporating in-silico translated coding variants in linked to genomics, these two disciplines have remained in relative protein FASTAs, single amino acid variants (SAAV), post-transcriptional isolation. It is not until recently that the breaking down of the silos took modifications and splice proteoforms [8–10]; in addition to products of place, and the fusion of proteomics with genomics contributed to an alternative frames of translation initiation and termination can be emerging field called “proteogenomics” [1,2]. With proteogenomics, identified and quantified at peptide level [11,12]. biologists have produced insightful research that can neither be Precision medicine is frequently associated with genomic ap- achieved by genomics or proteomics alone. Among these, the best- proaches. Notably, in precision oncology, clinical sequencing has led to documented contribution of proteogenomics lies in the annotation of cancer genomics whereby genome sequence data are not only used for newly sequenced non-model organisms [3]. Besides, proteogenomics is diagnosis, prognosis, management and stratification of patients based applied in genome re-annotation whereby it has helped to correct mis- on cancer sub-types; but also for guiding the selection of appropriate annotated genes [4], and to detect novel coding sequences (CDS) pre- therapeutic regimes, as well as the development of new combinatorial viously thought to be non-coding; such as pseudogenes, short open and molecular targeted therapies [13]. However, it was documented reading frames (sORFs) and other non-coding RNA genes [5,6]. Apart that not all cancer patients responded to targeted therapies that are ⁎ Corresponding author at: UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia (UKM), UKM Medical Centre, Jalan Yaacob Latiff, Bandar Tun Razak, 56000 Kuala Lumpur, Malaysia. E-mail address: [email protected] (T.Y. Low). 1 Current addresses: a) Department of Gene Diagnostics and Therapeutics, National Center for Global Health and Medicine, Tokyo 162-8655, Japan b) Department of Clinical Genome Informatics, Graduate School of Medicine, The University of Tokyo, Tokyo 113-0033, Japan 2 These authors contributed equally. https://doi.org/10.1016/j.cca.2019.08.010 Received 22 May 2019; Received in revised form 13 August 2019; Accepted 13 August 2019 Available online 14 August 2019 0009-8981/ © 2019 Elsevier B.V. All rights reserved. M.Y. Ang, et al. Clinica Chimica Acta 498 (2019) 38–46 tailored based on individual tumor genome profile [14]. Furthermore, the fact that highly abundant transcripts harbor many discrepant bases, genomics alone does not fully connect genotypes to disease phenotypes, resulting in problems when ddetermining sequencing errors. Ad- as it cannot relate to protein modification, signaling pathways and the ditionally, RNAseq can be strand-specific, besides sharing of exons from microbiome [15]. Thus mutational profiles of tumors cannnot fully the same gene by spliced variants can be difficult to resolve. To counter explain or predict patient outcomes. It is precisely at this juncture that challenges in de novo transcriptome assembly, a number of de novo proteogenomics can play a role in precision medicine and clinical di- transcriptome assemblers have been made available, such as Velvet agnostics. By combining proteomics with genomics, the following issues [21], Mira [29] Trans-AbySS [30], Trinity [31], Oases [32] and can potentially be addressed: (i) the validation of whether novel, SOAPdenovo-Trans [33]. Generally, these “assemblers” are based on cancer-specific mutations are translated into proteins, (ii) the poor either the “overlap graph” algorithm or the “De-Bruijin graph” algo- correlation of mRNA abundance to protein abundance, and (iii) the rithm [34]. Reference transcriptomes that are constructed can then be identification and quantification of post-translational modifications used translated in silico into protein FASTA database for querying MS- that are known to involve in cancer development. based proteomics data. This strategy is referred to as “proteomics in- formed by transcriptomics” (PIT) [35]. 2. Next-generation sequencing (NGS) 3.2. Gene prediction Lying at the upstream of proteogenomics are next-generation se- quencing (NGS) technologies. NGS is performed to decipher the gen- For a newly-sequenced and assembled genome, gene prediction omes and the transcriptomes, two major data modalities in proteoge- (gene finding) is performed to identify regions that encode protein- nomics; but has since been expanded to the epigenomes, transcription coding genes and other functional elements within an assembled factor-bound or ribosome-bound transcripts and 3D chromosomal genome. Identification of coding sequences is an essential step for the landscape. To decipher the genomes, whole genome sequencing (WGS) construction of a protein FASTA for proteogenomics study. is used to sequence the total DNA content; so that the context and Methodology-wise, algorithms for gene predictions perform gene dis- complexity of genomic alterations including point mutations, in- covery based mainly on (i) sequence similarity search and (ii) ab initio sertion–deletion mutations (INDELs) and copy number variations gene prediction. (CNVs) can be identified [16]. Whereas, in whole exome sequencing In the first method, similarity in gene sequences is discovered by (WES), only protein-coding exonic sequences are sequenced, as only the comparing a query sequence against expressed sequence tags (ESTs), exonic fragments are enriched before sequencing [17]. As for the gene sequences, proteins or other genomes. Sequence similarities be- transcriptomes, RNA-Seq sequences allows the detection of co/post- tween certain regions are used to interpret gene structure or the func- transcriptional modifications, such as alternative spliced variants tions of the region. The software used in this approach are the BLAST (ASVs), RNA editing and fusion transcripts; as well as the quantification family of programs including DIAMOND [36] and BLAT [37]. While of gene expression levels [18]. gene prediction for prokaryotes is less complicated due to the higher gene similarity and the absence of introns, gene discovery in eukaryotes 3. Analysis of NGS data is complicated by introns that interrupt open reading frames (ORFs). On the other hand, ab initio methods rely on signal and content Post sequencing and data quality control, two major strategies are sensors [38]. Signal sensors refer to short sequence motifs, such as applied for analyzing genome data: (i) de novo assembly or (ii) align- splice sites, branch points, poly-pyrimidine tracts, start and stop co- ment (reads mapping). In strategy (i), NGS reads are assembled into dons. Exon detection must rely on the content sensors, which refer to longer sequences without the aid of a pre-existing reference [19]. Once the patterns of codon usage that are unique to a species and allow fully assembled genomes become available, a reference genome can be coding sequences to be distinguished from the surrounding non-coding built