Ptmap—A Sequence Alignment Software for Unrestricted, Accurate, and Full-Spectrum Identification of Post-Translational Modification Sites
Total Page:16
File Type:pdf, Size:1020Kb
PTMap—A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites Yue Chena,b,1, Wei Chenc, Melanie H. Cobbc, and Yingming Zhaoa,2 Departments of aBiochemistry and cPharmacology, University of Texas Southwestern Medical Center, Dallas, TX 75390; and bDepartment of Chemistry and Biochemistry, University of Texas, Arlington, TX 76019 Contributed by Melanie H. Cobb, November 20, 2008 (sent for review June 12, 2008) We present sequence alignment software, called PTMap, for the over the integers between Ϫ100 and ϩ 300, a list of 4,800 possible accurate identification of full-spectrum protein post-translational modified peptides is generated from the single peptide sequence. modifications (PTMs) and polymorphisms. The software incorpo- The exponentially increased size of the peptide pool and the high rates several features to improve searching speed and accuracy, similarity among peptide sequences derived from the full spectrum including peak selection, adjustment of inaccurate mass shifts, and of possible PTMs make it difficult to remove most of the false precise localization of PTM sites. PTMap also automates rules, positives, let alone all of them. Second, a PTM could happen at based mainly on unmatched peaks, for manual verification of several residue [e.g., protein methylation (18)], modified peptides identified peptides. To evaluate the quality of sequence alignment, with adjacent PTM sites are likely to have similar theoretical we developed a scoring system that takes into account both fragmentation patterns, and hence similar statistical scores. Third, matched and unmatched peaks in the mass spectrum. Incorpora- in silico-generated peptide sequence pools used for sequence tion of these features dramatically increased both accuracy and alignment are unlikely to include all of the peptide sequences sensitivity of the peptide- and PTM-identifications. To our knowl- generated by proteolytic digestion. Take, for example, trypsin, a edge, PTMap is the first algorithm that emphasizes unmatched protease often used for shotgun proteomics, generate not only peaks to eliminate false positives. The superior performance and tryptic peptides, but also semitryptic peptides, in which one termi- reliability of PTMap were demonstrated by confident identification nus is generated by chymotryptic activity within the enzyme prep- of PTMs on 156 peptides from four proteins and validated by aration (19). MS/MS of the synthetic peptides. Our results demonstrate that The false positives were caused during protein database search, PTMap is a powerful algorithm capable of identification of all because of misinterpretation of charge states, abnormal enzymatic possible protein PTMs with high confidence. digestion sites, misinterpretation of protein modifications, wrong assignment of modification sites and modification types, and incor- acetylation ͉ dehydration ͉ methylation ͉ phosphorylation rect use of isotopic peaks (19). Unfortunately, the decoy protein sequence database cannot accurately predicate the false discovery ass spectrometry is the method of choice for mapping sites of rates as many false positives are sequence-linked to the true hits, Mprotein post-translational modifications (PTMs) that are such as those cases caused by abnormal enzymatic digestion and known to have more than 300 types (1, 2). Efficient sequence wrong assignment of modification sites and types (19). alignment algorithms are essential for mapping PTM sites and We have observed that the MS/MS spectra of false positive peptide identification from mass spectrometric data (3). Widely identifications commonly contain unmatched peaks with significant used programs, such as Sequest (4, 5), Mascot (6), and X!Tandem intensities. Accordingly, we argue that, while identification of (7), identify PTM sites based on a restricted database search in candidate peptides should focus on matched peaks, to comprehen- which tandem mass spectra are aligned with protein sequences sively identify false positives, emphasis should be placed on un- bearing one or several specified PTMs at specified amino acid matched peaks. We realized that ‘‘unexplained peaks’’ have been residues. This restricted database search strategy developed in early used as a penalty parameter in de novo peptide sequencing (20). days has been very useful for identifying peptides bearing a limited number of specified PTMs, but lacks the flexibility to identify However, it has never been used as a major filter to remove false unexpected PTMs. positive peptide and PTM identifications during database search. Recently, several algorithms have been developed to extend the Our emphasis on unmatched peaks is based on the rationale that a capability of shotgun proteomics to allow identification of all true peptide identification should explain all major peaks in the possible PTMs and sequence polymorphisms (8–14). These algo- spectrum, especially for those MS/MS data that are generated in ion rithms can carry out unrestricted database searches, identifying any trap types of mass spectrometers that have relatively simple frag- PTM, whether previously known or unknown. To improve the mentation patterns (21). We further argue that the modification site accuracy of peptide identification, different statistical strategies should be unambiguously located to confidently identify peptides have been proposed in attempts to reduce the number of false containing PTMs, and to eliminate false-positive PTM assignments. positives (10–12, 15–17). These strategies typically involve applying a statistical significance test to score the confidence level of each Author contributions: Y.C., M.H.C., and Y.Z. designed research; Y.C. and W.C. performed identification. While useful, the reliability of these strategies has not research; Y.C. and Y.Z. analyzed data; and Y.C., M.H.C., and Y.Z. wrote the paper. been critically evaluated, for example, by testing them with manual The authors declare no conflict of interest. verification with high stringency, or with MS/MS of synthetic 1Present address: Ben May Department for Cancer Research, University of Chicago, Chi- peptides, the gold standard for confirming peptide identification. cago, IL 60637. Identification of false positives using statistical methods is daunt- 2To whom correspondence should be sent at the present address: Ben May Department for ing in unrestricted sequence alignments for several reasons. First, Cancer Research, University of Chicago, Chicago, IL 60637. E-mail: yingming.zhao@ the size of the protein sequence database is exponentially increased. uchicago.edu. For example, consider a peptide of 12 amino acid residues. If it is This article contains supporting information online at www.pnas.org/cgi/content/full/ assumed that this peptide contains a single PTM at an unknown 0811739106/DCSupplemental. BIOCHEMISTRY position, and that this PTM causes a mass shift that could range © 2009 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0811739106 PNAS ͉ January 20, 2009 ͉ vol. 106 ͉ no. 3 ͉ 761–766 Downloaded by guest on September 30, 2021 Here, we describe PTMap, a sequence alignment algorithm for based on matched peaks and then removes false positives from the reliable, full-spectrum identification of mass shifts associated with candidate list based on unmatched peaks. This filtering process is unspecified PTMs or polymorphisms. To reduce the risk of obtain- superior to statistical methods, which will typically sacrifice sensi- ing false positives when identifying peptides that bear an unknown tivity of peptide identification for identification accuracy. Third, an mass shift at an unknown amino acid residue, the algorithm adopts identification is only considered true if the PTM site is unambig- several strategies: selection of sequence-rich MS peaks, automation uously determined. We reason that the precise location of a PTM of mass-shift adjustment, precise localization of PTM sites, imple- site is critical in unrestricted sequence alignment because of the mentation of rules for manual verification, and development of a large number of possible modified peptides with the same sequence. scoring system based on spectrum quality to remove false-positive To further improve the performance of PTMap, we incorporated sequence alignments. PTMap also reduces searching time by align- three additional strategies (Fig. 1): (i) PTMap selects only those ing the sequences of only those proteins that are of interest, and by peaks with signal intensities that are significant relative to the local removing isotopic peaks and noise peaks. A unique feature of noise; (ii) PTMap uses only monoisotopic peaks for sequence PTMap is identification of false positives by emphasizing un- alignment; and (iii) PTMap automates the adjustment of mass matched peaks instead of matched peaks with statistically signifi- shifts. The last feature allows PTMap to analyze data from low- cant scores. The software is able to use MS/MS data from low- resolution mass spectrometers in addition to the data from high- resolution tandem mass spectrometers by implementing automatic resolution mass spectrometer. mass-shift adjustment. To demonstrate its usefulness, we used We use two parameters to evaluate a peptide identification made PTMap to identify PTM sites in human histone H4, HMG2, mouse by PTMap: unmatched peak score (SUnmatched) and PTMap score. SGK1, and BSA. Accuracy of peptide identification by PTMap was SUnmatched is determined by