Large-Scale Database Searching Using Tandem Mass Spectra: Looking up the Answer in the Back of the Book
Total Page:16
File Type:pdf, Size:1020Kb
REVIEW Large-scale database searching using tandem mass spectra: Looking up the answer in the methods back of the book Rovshan G Sadygov, Daniel Cociorva & John R Yates III .com/nature e Database searching is an essential element of large-scale proteomics. Because these .natur w methods are widely used, it is important to understand the rationale of the algorithms. Most algorithms are based on concepts first developed in SEQUEST and PeptideSearch. Four basic approaches are used to determine a match between a spectrum and http://ww sequence: descriptive, interpretative, stochastic and probability–based matching. We oup r review the basic concepts used by most search algorithms, the computational modeling G of peptide identification and current challenges and limitations of this approach for protein identification. lishing b Pu An unintended consequence of whole-genome sequenc- to matching tandem mass spectra of peptides whose ing has been the birth of large-scale proteomics. What exact sequences may not be present in the database. Nature drives proteomics is the ability to use mass spectrome- Space limitations restrict a detailed description of all try data of peptides as an ‘address’ or ‘zip code’ to locate algorithms in this rapidly expanding field. Also, some 2004 proteins in sequence databases. Two mass spectrom- algorithms are proprietary, and thus, details on how © etry methods are used to identify proteins by database they work are unknown. This review should supple- search methods. The first method uses a molecular ment and update earlier reviews on database search weight fingerprint measured from a protein digested algorithms19–24. with a site-specific protease1–5. A second method uses tandem mass spectra derived from individual peptides Peptide fragmentation and data preprocessing of a digested protein6,7 (Fig. 1). Because each tandem In tandem mass spectrometry (MS/MS), gas phase mass spectrum represents an independent and verifi- peptide ions undergo collision-induced dissociation able piece of data, this approach to database searching (CID) with molecules of an inert gas such as helium has the ability to identify proteins in mixtures, enabling or argon25. Other methods of dissociation have been a rapid and comprehensive approach for the analysis of developed, such as electron capture dissociation protein complexes and other complicated mixtures of (ECD), surface induced dissociation (SID) and elec- proteins6,8–12. New biology has been discovered based tron transfer dissociation (ETD), but gas-phase CID on fast and accurate protein identification13–18. As tan- is the most widely used in commercial tandem mass dem mass spectral protein identification has proliferat- spectrometers. The dissociation pathways are strongly ed, it has become increasingly important to understand dependent on the collision energy, but the vast major- the rationale of individual database search algorithms, ity of instruments use low-energy CID (<100 eV)26. their relative strengths and weaknesses, and the math- At low collision energies fragmentation mainly occurs ematics used to match sequence to spectrum. along the peptide backbone bonds, whereas at higher In this review we discuss the prevailing fragmentation energies fragments generated by amino acid side- models, spectral preprocessing, methods to match tan- chains are observed25,27. At low-energy CID, condi- dem mass spectra to sequences and several approaches tions normally used in triple quadrupole, quadrupole Department of Cell Biology, The Scripps Research Institute, La Jolla, California 92037, USA. Correspondence should be addressed to J.R.Y. ([email protected]). PUBLISHED ONLINE 18 NOVEMBER 2004; DOI:10.1038/NMETH725 NATURE METHODS | VOL.1 NO.3 | DECEMBER 2004 | 195 REVIEW Figure 1 | Overview of the protein identification process. A protein mixture is digested, and the resulting peptides are analyzed by MS/MS to obtain experimental spectra. Search programs find database candidate sequences whose theoretical spectra are compared to the experimental spectrum. The best match (highest-scoring candidate sequence) defines the identified database peptide and the corresponding database protein. Validation software then determines whether the peptide and protein identifications are true or false. time of flight and ion trap (both linear and Paul) mass spectrom- eters, b-ions, y-ions and neutral losses of water and ammonia dominate the mass spectrum (Box 1). Fragmentation patterns are methods also strongly dependent on the chemical and physical properties of the amino acids and sequences of the peptide28,29. Most algo- rithms assume that peptides preferentially fragment into b- and y-ions. The distribution of intensities between b- and y-ions is .com/nature e the subject of intensive studies, and this distribution can vary by type of instrument (for example, ion trap as compared to Q-TOF), .natur w but this information is not yet fully exploited in most matching as a result of changes in the concentrations of peptides entering models. A mobile proton model30 has been proposed to explain the ion source. Given a precursor m/z value and a list of fragment intensity patterns observed in MS/MS, and Zhang has developed ions, the goal is to match these to an amino acid sequence within 31 http://ww a theoretical model that predicts fragment ion intensities well . the measurement and fragmentation uncertainty of the mass spec- As with any measurement process, tandem mass spectra may have trometer (Fig. 1). oup some level of uncertainty. The accuracy of the mass-to-charge Some of the methods described below are used in our labora- r G ratio (m/z) and the mass resolving power are limited, electronic tory and illustrate a general approach for the analysis of large data and chemical noise may be present and ion signals may fluctuate sets. One source of uncertainty in analyzing tandem mass spectra of multiply charged ions is determination of the charge state of lishing the precursor peptide, which is critical to accurately calculating a b peptide’s molecular weight. A highly accurate molecular weight Pu BOX 1 PEPTIDE FRAGMENTATION measurement or calculation can be very effective in restrict- ing search results. Methods to determine the charge state of ions Low-energy CID dissociates the amide bond along the Nature involve deconvolution or determination of the m/z spread of iso- peptide backbone. As a result, two fragments are produced, tope peaks32. On low-resolution instruments, where these methods one containing the N terminus and the other containing 2004 are ineffective, MS/MS of peptides with charge states higher than © the C terminus. Nomenclature denotes the N-terminal 1 are typically searched twice, once calculating a molecular weight fragments with letters a, b and c and the C-terminal assuming that the charge state is +2 and the second time assuming fragments with letters x, y and z59. Internal ion fragments the charge state is +3. Based on observations by Dancik and col- are formed by simultaneous cleavage of N and C termini. leagues33 that complementary fragment ions (N-terminal and C- Immonium ions are of the structure HN = CH–R25. A terminal fragment ions) can be used to improve molecular weight numerical subscript for each fragment ion indicates the calculations, our group and others used a variation of this approach position of the amino acid at which the bond cleavage to determine peptide ion charge state34,35. In good-quality tandem occurs. For N-terminal fragments, the numbering starts from the N terminus, and for C-terminal fragments, the mass spectra there are numerous complimentary ions; thus, if the numbering starts from the C terminus. precursor ion is assumed to be doubly charged, the complementary ions present in the spectrum should sum to this molecular weight. x3 If the ion is triply charged, then the complementary fragment ions y will sum to this molecular weight. Both situations are tested, and 3 the molecular weight calculation that accounts for the most com- z3 R1 R2 plementary ions is assumed to be the correct charge state. In addi- tion to the above method, ion traps can also be used to perform a narrow mass range scan at resolutions sufficient to determine the H N-CH-CO-NH-CH-CO-NH-CH-CO-NH-CH-CO-NH-CH-COOH 2 charge state, but this necessitates an additional scan and reason- ably abundant signal36. The newer linear ion traps with higher scan a2 R R R 3 4 5 speeds can accommodate the high-resolution scan without decreas- b 2 ing the efficiency of data acquisition. c2 Most large-scale tandem mass spectrometry data is acquired using automated methods such as data-dependent data acquisi- Fragment ion nomenclature. Schematic diagram of N-terminal tion, which triggers MS/MS based on ion abundance. When the a2, b2 and c2 ions and C-terminal x3, y3 and z3 ions for a five- amino-acid peptide. ion abundance level to trigger MS/MS is set just above the back- ground noise level, MS/MS data is almost continuously acquired. 196 | VOL.1 NO.3 | DECEMBER 2004 | NATURE METHODS REVIEW BOX 2 DATABASE SEARCHING ALGORITHMS Descriptive models Stochastic models Descriptive algorithms are based on a mechanistic prediction of Stochastic models are based on probability models for the how peptides fragment in a tandem mass spectrometer, which is generation of tandem mass spectra and the fragmentation then quantified to determine the quality of the match between of peptides. Basic probabilities of fragment ion matches are the prediction and the experimental spectrum. Mathematical obtained from training sets of spectra of known sequence methods such as correlation analysis have been used to assess identity. Stochastic models use statistical limits on the match quality. measurement and fragmentation process to create a likelihood that the match is correct. Interpretative models methods Interpretative approaches are based on manual or automated Statistical and probability models interpretation of a partial sequence from a tandem mass Statistical and probability models determine the relationship spectrum and incorporation of that sequence into a database between the tandem mass spectrum and sequences.