Evaluating Peptide Mass Fingerprinting-Based Protein Identification Introduction

Perspective Evaluating Peptide Mass Fingerprinting-based Protein Identification Senthilkumar Damodaran1*, Troy D. Wood2, Priyadharsini Nagarajan3, and Richard A. Rabin1 1 Department of Pharmacology and Toxicology, School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY 14214, USA; 2 Department of Chemistry, University at Buffalo, Buffalo, NY 14260, USA; 3 Department of Biochemistry, School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY 14214, USA. Identification of proteins by mass spectrometry (MS) is an essential step in proteomic studies and is typically accomplished by either peptide mass fingerprinting (PMF) or amino acid sequencing of the peptide. Although sequence information from MS/MS analysis can be used to validate PMF-based protein identification, it may not be practical when analyzing a large number of proteins and when high- throughput MS/MS instrumentation is not readily available. At present, a vast majority of proteomic studies employ PMF. However, there are huge disparities in criteria used to identify proteins using PMF. Therefore, to reduce incorrect protein identification using PMF, and also to increase confidence in PMF-based protein identification without accompanying MS/MS analysis, definitive guiding principles are essential. To this end, we propose a value-based scoring system that provides guidance on evaluating when PMF-based protein identification can be deemed sufficient without accompanying amino acid sequence data from MS/MS analysis. Key words: peptide mass fingerprinting, Mowse, Mascot, ProFound, proteomics Introduction Protein identification using mass spectrometry (MS) ratios for a series of daughter ions. Subsequently, is an essential step in studies that employ proteomic the sequence ions and the intact peptide masses are methods such as two-dimensional electrophoresis (2- matched against protein databases to identify the un- DE), and is typically accomplished by either peptide known protein. mass fingerprinting (PMF) or amino acid sequenc- To determine the prevalence of PMF analysis as ing of the peptide using tandem mass spectrometry well as the type of search algorithms employed in pro- (MS/MS). In PMF analysis, proteolytic cleavage us- tein identification, we surveyed articles published in ing an enzyme such as trypsin results in a collection Proteomics (August 2005 to July 2007) and Proteome of peptides, which serves as a unique identifier or Science (January 2003 to September 2007). Out of fingerprint of the protein. The accurate determina- the 581 articles surveyed, approximately 35% of the m/z tion of peptide mass-to-charge ( ) ratio, typically studies only used PMF-based protein identification, using a matrix-assisted laser desorption/ionization 32% of the studies only utilized MS/MS, and 33% of time-of-flight (MALDI-TOF) mass spectrometer, al- the studies used both PMF and MS/MS for protein lows identification of the unknown protein by match- identification. Thus, 68% of the studies utilized PMF ing the resulting peptide masses with the theoreti- for protein identification. cal peptide masses of proteins in a database (such as Although sequence information from MS/MS NCBI and Swiss-Prot). For peptide sequence anal- analysis can be used to validate PMF-based pro- ysis, peptides obtained by proteolytic cleavage are tein identification, it may not be practical when an- subjected to MS/MS to fragment the peptide along alyzing a large number of proteins and when high- the amide backbone. The amino acid sequence of the throughput MS/MS instrumentation is not readily peptide is then obtained from the differences in m/z available. Besides, in comparison to MS/MS analysis on a single peptide, identification of proteins *Corresponding author. using PMF has two significant advantages that are E-mail: [email protected] often ignored. Firstly, in PMF, unsuspected post- 152 Geno. Prot. Bioinfo. Vol. 5 No. 3–4 2007 Damodaran et al. translational modifications lead to only a marginal rized a match as successful if a protein had a score loss in the quality of data and do not affect the greater than the significant threshold of the Mascot outcome. In contrast, in MS/MS, unspecified post- algorithm (5 , 6 ; www.matrixscience.com) and was the translational modifications can adversely affect the top match in both Mascot and ProFound (7 , 8 ) algo- matching and scoring process, thus precluding un- rithms. Whereas proteins with a Mascot score of at ambiguous protein identification. Secondly, although least 95 or a ProFound score of 2.2 have been con- MS/MS analysis can be used to determine the amino sidered significant by others (9 ). Also, Guipaud et acid sequence of a peptide, that peptide may be al (10 ) and Sinclair et al (11 ), in addition to the unique to a particular protein or common to a number significant scores of protein algorithms, incorporated of different proteins (for example, enzymes from the sequence coverage, number of peptides matched, and same family). Conversely, the use of multiple pep- the congruence of isoelectric point (pI) and molecu- tides for protein identification in PMF allows more lar mass of the protein with the sequence database in extensive coverage of the protein, thereby increasing the evaluation of PMF data. Consequently, definitive the confidence in a positive identification. Thus, in guidelines for what constitutes a positive match are some cases the MS/MS analysis on a single peptide lacking. may actually be less specific than PMF. To this end, we propose a value-based scoring sys- A pertinent question when using PMF for protein tem that provides guidance on evaluating when PMF- identification is that: at what point does one use based protein identification can be deemed sufficient MS/MS analysis for protein identification? A com- without accompanying amino acid sequence data monly used approach is to use PMF as an initial step, from MS/MS analysis. To construct this value-based and then use MS/MS to corroborate proteins that are scoring system, we have used parameters that are considered ambiguous. However, this brings up an considered important in substantiating protein iden- interesting question: when can one conclude protein tification, such as congruence of the observed pI and identification using PMF to be “unambiguous”? The molecular mass with the protein sequence database, traditional definition of “unambiguous” (having or percent sequence coverage (percentage of the theoret- exhibiting no ambiguity or uncertainty) can hardly be ical protein that is covered by the experimental pep- applied to PMF-based protein identification, which is tide masses), number of peptides matched, and the probability based and there always remains a chance matching scores from two different protein search en- of obtaining a false positive protein match. So when gines. used in the context of protein identification, a prag- matic definition of “unambiguous” would be “having or exhibiting little ambiguity or uncertainty”. With- Protein search algorithms out accompanying amino acid sequence data, can one increase the confidence in protein identification using Since PMF-based protein identification is probability- PMF data? Western blotting can be used to confirm based, the choice of protein search algorithms is of protein identification; however, this technique is not paramount importance. To date, thirteen different high-throughput, limited by the availability of anti- PMF search algorithms have been reported (12 ). bodies, and impractical for analyzing large number of There are several ways that these protein search en- samples. gines differ from one another, but the fundamental Presently, there are huge disparities in criteria difference is in the scoring algorithm that is employed. used to identify proteins using PMF. Plomion et al Some of the protein search engines calculate the prob- (1 ) selected a molecular weight search (Mowse) score ability of obtaining an incorrect match and report sig- of 71 or more as significant for a protein match, nificance threshold scores to increase the confidence while Hoffrogge et al (2 ) used a Mowse score of in a protein match, whereas others just rank possible greater than 69 as significant. On the other hand, protein matches without reporting a significance level. it has been suggested that the Mowse score should In our sampling of the literature, Mascot was the most be 50 more than the significant threshold level for commonly used protein search engine (67%), followed protein identification (3 ). Because the Mowse score by MS-Fit (13 ) (14%) and ProFound (12%). Mascot depends on the number of sequences in a database, and MS-Fit implement a probability-based scoring ap- for species with small database sizes this approach proach using Mowse; however, while Mascot reports would not be appropriate. Naranjo et al (4 ) catego- scores that correspond to a 5% significance level, MS- Geno. Prot. Bioinfo. Vol. 5 No. 3–4 2007 153 Scoring System to Evaluate PMF Fit does not provide a significance level for its pro- increase. This issue is of greater relevance when it tein matches. ProFound, on the other hand, uses comes to the identification of proteins from organ- the Bayesian probability and reports a significance isms whose sequences have not been completely deter- threshold score (Z score). Incidentally, most studies mined. Also, post-translational processing of proteins that employed Ms-Fit or ProFound also used Mascot would increase the number of protein sequences in the for database searching. databases, even for completely sequenced genomes. The threshold Mowse score in Mascot is reported For instance, if the searched database contains 20,000 as −10 lg(P ), where P (α value/number of database sequences, a protein match would be considered pos- sequences) is the probability that the observed match itive at the default α value of 0.05 when the Mowse is a random event (5 , 6 ). Accordingly, the threshold score exceeds the threshold of 56. However, this pro- Mowse score depends on two parameters: the num- tein match would no longer be positive if the number ber of sequences in a database and the α value used of sequences in the database were to increase.

Load more