Terminomics Methodologies and the Completeness of Reductive Dimethylation: a Meta-Analysis of Publicly Available Datasets
Total Page:16
File Type:pdf, Size:1020Kb
proteomes Technical Note Terminomics Methodologies and the Completeness of Reductive Dimethylation: A Meta-Analysis of Publicly Available Datasets Mariella Hurtado Silva 1 , Iain J. Berry 1,2, Natalie Strange 1 , Steven P. Djordjevic 2 and Matthew P. Padula 1,* 1 Proteomics Core Facility and School of Life Sciences, Faculty of Science, University of Technology Sydney, Broadway NSW 2007, Australia; [email protected] (M.H.S.); [email protected] (I.J.B.); [email protected] (N.S.) 2 The ithree Institute, Faculty of Science, University of Technology Sydney, Broadway NSW 2007, Australia; [email protected] * Correspondence: [email protected]; Tel.: +61-403-838-981 Received: 16 January 2019; Accepted: 25 March 2019; Published: 29 March 2019 Abstract: Methods for analyzing the terminal sequences of proteins have been refined over the previous decade; however, few studies have evaluated the quality of the data that have been produced from those methodologies. While performing global N-terminal labelling on bacteria, we observed that the labelling was not complete and investigated whether this was a common occurrence. We assessed the completeness of labelling in a selection of existing, publicly available N-terminomics datasets and empirically determined that amine-based labelling chemistry does not achieve complete labelling and potentially has issues with labelling amine groups at sequence-specific residues. This finding led us to conduct a thorough review of the historical literature that showed that this is not an unexpected finding, with numerous publications reporting incomplete labelling. These findings have implications for the quantitation of N-terminal peptides and the biological interpretations of these data. Keywords: terminomics; mass spectrometry; amine labelling 1. Introduction Quantitative proteomics has become a widely used tool in the biological sciences, inferring biological significance from the changes in protein abundance between varying treatments or conditions. Protein quantification studies normally use an organism’s reference genome to identify and quantify products of an organism’s open reading frames (ORFs). This approach, while useful for quantitation of predicted protein products, does not take into consideration the various post-translational modifications (PTMs) that may occur on a protein, or the presence of biologically distinct proteoforms arising from a single ORF. Identifying specific proteoforms and quantifying their abundance levels is necessary for a complete assessment of proteomic variation and its biological significance. The production of mature proteoforms is an intricate process, as outlined in Figure1[ 1–3]. ORFs by definition are inferred by bioinformatic prediction from genome sequencing projects, often without direct proteomic evidence to confirm their accuracy [4–6]. Variations resulting from chemical modifications to the nascent protein (PTMs), such as acetylation and phosphorylation, as well as primary structural modifications by proteolytic cleavage, introduces a level of proteome complexity that is often overlooked when using reference genomes to detect the presence of a particular ORF using mass spectrometry [7–9]. Considering proteolytic cleavage in particular, the most prevalent Proteomes 2019, 7, 11; doi:10.3390/proteomes7020011 www.mdpi.com/journal/proteomes Proteomes 2019, 7, 11 2 of 13 Proteomes 2018, 6, x 2 of 14 PTM in biological systems, mature proteoforms that are products of proteolysis are often incorrectly representedrepresented inin proteinprotein databasesdatabases asas aa non-mature,non-mature, directdirect translationtranslation ofof thethe ORF.ORF. OurOur extensiveextensive workwork examiningexamining MycoplasmaMycoplasma sppspp.,., consideredconsidered thethe ‘simplest’‘simplest’ self-replicatingself-replicating organismorganism yetyet discovered,discovered, hashas shownshown thatthat proteolyticproteolytic cleavagecleavage isis aa criticalcritical processprocess inin thethe generationgeneration ofof maturemature proteoformsproteoforms fromfrom largelarge ORFs,ORFs, producingproducing aa largerlarger proteomeproteome thanthan bioinformaticallybioinformatically predictedpredicted forfor thesethese genome-reducedgenome-reduced bacteriabacteria [10[10].]. In In addition, addition, proteolysis proteolysis creates creates proteoforms proteoforms that havethat have different different functionality functionality to the parentto the proteoform,parent proteoform, further extendingfurther extending proteome proteome complexity complexity [10–22]. This[10–22]. increase This inincrease proteome in proteome diversity throughdiversity proteolysis through proteolysis also occurs also in eukaryotes, occurs in eukaryotes, and this presents and this a need presents to identify a need and to characterizeidentify and proteoformscharacterize proteoforms produced through produced proteolysis through proteolysis in order to in understand order to understand proteoform proteoform diversity diversity and its effectand its on effect biological on biological systems, systems, rather than rather quantifying than quantifying the abundance the abundance of aggregated of aggregated ORF products. ORF However,products. despiteHowever, a rangedespite of a methodologies range of methodologies being available being toavailable achieve to this, achieve the methods this, the aremethods often notare ableoften to not definitively able to definitively identify theidentify diversity the diversity of proteoforms of proteoforms on a proteome-wide on a proteome-wide scale or scale in a highor in throughputa high throughput manner. manner. Figure 1. Schematic for the various methods for the production of proteoforms. Final protein Figure 1. Schematic for the various methods for the production of proteoforms. Final protein products products may be modified during transcription or during translation at the ribosome. Alternatively, may be modified during transcription or during translation at the ribosome. Alternatively, nascent nascent polypeptide chains may be modified after translation with a myriad of post-translational polypeptide chains may be modified after translation with a myriad of post-translational modifications. Once translated, the protein sequence can often require further modifications to perform modifications. Once translated, the protein sequence can often require further modifications to a specialized function, generating proteoforms that vary from the original expressed protein [2,3,23]. perform a specialized function, generating proteoforms that vary from the original expressed protein Post-translational modifications such as phosphorylation, a reversible chemical addition to the protein, [2,3,23]. Post-translational modifications such as phosphorylation, a reversible chemical addition to or proteolysis, a permanent hydrolysis event removing amino acid(s) from the polypeptide chain, cause the protein, or proteolysis, a permanent hydrolysis event removing amino acid(s) from the important functional changes in proteins [24,25]. polypeptide chain, cause important functional changes in proteins [24,25]. The original method for analyzing intact proteoforms and protein sequences was Edman degradation,The original which method allows directfor analyzing and unambiguous intact proteoforms sequencing and of protein the N-terminal sequences amino was acidsEdman of intact,degradation, purified which proteins allows [3,26 direct–28]. and Nevertheless, unambiguous Edman sequencing degradation of the is aN-terminal time-consuming amino process,acids of sequencingintact, purified one aminoproteins acid [3,26–28]. residue per Nevertheless, hour, and is Edman limited bydegradation the efficiency is a of time-consuming the chemical reagents process, to hydrolyzesequencing the one peptide amino bonds acid residue of each per subsequent hour, and residue is limited [26– by29 ].the The efficiency ideal solution of the forchemical characterizing reagents to hydrolyze the peptide bonds of each subsequent residue [26–29]. The ideal solution for characterizing intact proteoforms is direct sequencing in a complex proteome sample using intact mass spectrometry (MS), with high-resolution accurate mass measurements and MS/MS Proteomes 2019, 7, 11 3 of 13 intact proteoforms is direct sequencing in a complex proteome sample using intact mass spectrometry (MS), with high-resolution accurate mass measurements and MS/MS fragmentation. Intact protein MS/MS (or Top Down MS) analysis can determine the N-and C-terminal sequences of a proteoform by matching of spectral data to the predicted ORF, as well as detecting the presence and location of PTMs, where the difference in proteoform mass corresponds exactly to the PTM mass [30–32]. However, higher throughput and increased proteome depth (at the expense of proteoform identification) can be achieved by analyzing peptides from enzymatically digested proteins using traditional shotgun Liquid Chromatography-Tandem Mass Spectrometry (LC/MS/MS), with routine identification of ~11,000 ORF products [33–35] compared to <1500 proteoforms identified by Top Down MS [36]. In order to characterize the N- or C-terminus of the mature proteoforms, the terminal peptide must be unambiguously identified to avoid conjecture, which can be challenging if the true terminal peptide is not ‘MS friendly’ and thus undetectable due to poor ionization potential or unsuitable length after digestion. A solution for these issues which has been implemented by several