A Novel Approach for Untargeted Post-Translational Modification Identification Using Integer Linear Optimization and Tandem Mass Spectrometry*□S

Research A Novel Approach for Untargeted Post-translational Modification Identification Using Integer Linear Optimization and Tandem Mass Spectrometry*□S Richard C. Baliban‡, Peter A. DiMaggio‡, Mariana D. Plazas-Mayorca§, Nicolas L. Young¶, Benjamin A. Garcia§¶ʈ, and Christodoulos A. Floudas‡** A novel algorithm, PILOT_PTM, has been developed for Identification of the types of post-translational modifica- the untargeted identification of post-translational mod- tions (PTMs)1 of various organisms is currently a major chal- ifications (PTMs) on a template sequence. The algorithm lenge in the field of proteomics. MS/MS has shown to be an consists of an analysis of an MS/MS spectrum via an excellent tool for de novo peptide sequence prediction and integer linear optimization model to output a rank-or- database peptide identification and is indispensable in deter- dered list of PTMs that best match the experimental mining PTMs (1, 2). Many research groups (3–28) have incor- data. Each MS/MS spectrum is analyzed by a prepro- porated modification discovery into their respective identifi- cessing algorithm to reduce spectral noise and label cation algorithms and utilize multiple databases, including potential complimentary, offset, isotope, and multiply UniMod (29), RESID (30), and Delta Mass,2 to build a list of charged peaks. Postprocessing of the rank-ordered list from the integer linear optimization model will resolve variable modifications that can exist on a candidate peptide. fragment mass errors and will reorder the list of PTMs To date, there exist two types of algorithms for identification based on the cross-correlation between the experimen- of PTMs: (a) hybrid sequence tag/database approaches (3–9, tal and theoretical MS/MS spectrum. PILOT_PTM is in- 28), which develop a sequence tag and subsequently com- strument-independent, capable of handling multiple pare this tag with a database to extract a candidate peptide fragmentation technologies, and can address the uni- sequence and determine the set of PTMs that best explain the verse of PTMs for every amino acid on the template MS/MS spectrum and (b) pure database-based approaches sequence. The various features of PILOT_PTM are pre- (10–15, 23–27), which directly compare the experimental sented, and it is tested on several modified and unmod- peak list with a theoretical peak list derived from candidate ified data sets including chemically synthesized phos- peptides in a database. Both approaches have had success phopeptides, histone H3-(1–50) polypeptides, histone both in validation of known modifications and discovery of H3-(1–50) tryptic fragments, and peptides generated novel ones. To our knowledge, there is no de novo approach from proteins extracted from chromatin-enriched frac- for identification of PTMs using a comprehensive variable tions. The data sets consist of spectra derived from fragmentation via collision-induced dissociation, elec- modification list. tron transfer dissociation, and electron capture dissoci- The hybrid methods, denoted as (a), are beneficial because ation. The capability of PILOT_PTM is then bench- the derivation of the sequence tag may limit the size of the marked using five state-of-the-art methods, InsPecT, database to proteins that contain that sequence tag. This Virtual Expert Mass Spectrometrist (VEMS), Modi, Mas- approach (3) can allow for a richer set of variable modifica- cot, and X!Tandem. PILOT_PTM demonstrates superior tions to be considered on candidate peptide sequences due accuracy on both the small and large scale proteome to the database size reduction. For example, the InsPecT experiments. A protocol is finally developed for the anal- algorithm (4) will generate de novo sequence tags of a fixed ysis of a complete LC-MS/MS scan using template se- length and scan a trie-based database for all instances of the quences generated from SEQUEST and is demonstrated tag. Each distinct variable modification combination, or dec- on over 270,000 MS/MS spectra collected from a total oration, is entered in a mass-ordered list prior to database chromatin digest. Molecular & Cellular Proteomics 9: 764–779, 2010. 1 The abbreviations used are: PTM, post-translational modification; VEMS, Virtual Expert Mass Spectrometrist; ETD, electron transfer From the Departments of ‡Chemical Engineering, §Chemistry, and dissociation; ECD, electron collision dissociation; ILP, integer linear ¶Molecular Biology, Princeton University, Princeton, New Jersey 08544 optimization; LP, linear programming; PL, protein list; TS, template Received, October 16, 2009, and in revised form, January 22, 2010 sequences; CPU, central processing unit. Published, MCP Papers in Press, January 26, 2010, DOI 10.1074/ 2 K. Mitchelhill, Delta Mass: a Database of Protein Post Transla- mcp.M900487-MCP200 tional Modifications. 764 Molecular & Cellular Proteomics 9.5 © 2010 by The American Society for Biochemistry and Molecular Biology, Inc. This paper is available on line at http://www.mcponline.org PILOT_PTM for Untargeted PTM Identification searching. When a peptide matching the tag is found, the c-ions and z⅐-ions (32, 35) that are analogous to the b-ions algorithm will attempt to increase the length of the tag with an and y-ions produced from CID. Although the c-ions and z⅐- amino acid sequence if the mass of the sequence plus the ions are often the most abundant ions present, both ETD and mass of one decoration is equal to the predetermined mass ECD spectra have been known to show b-ions, y-ions, and gap. The Virtual Expert Mass Spectrometrist (VEMS) (6) uses their neutral losses as well (37). ECD and ETD enhance the both a database-independent search for generation of se- diversity of peptides that can be fragmented because they quence tags and a database-dependent search to determine can analyze bigger peptides with higher charge state. In fact, possible peptides. Any sequence tag that is not validated by a recent decision tree model (38) was developed to differen- a peptide found in the database-dependent search is com- tiate which parent mass and charge states are most appro- pared with the list of proteins containing peptides found in the priate for CID and ETD. Generally, CID will provide the most database-dependent search to generate candidate amino fragmentation for peptides of charge 2 or high mass peptides acid sequences. All combinations of variable modifications of charge 3. Low mass peptides of charge 3 and all peptides that equal the difference between the parent mass and the of charge greater than 3 may have better fragmentation using candidate sequence mass are tested to derive the best pos- ETD or ECD (38). Unlike CID, ECD and ETD cleavage is very sible modification (6). The Modi algorithm (7) assumes that the weakly affected by the amino acid sequence and generally database has been reduced a priori to a candidate subset of provides more complete coverage than CID alone when used 20 proteins. After filtering the MS/MS spectrum, the algorithm on peptides with higher charge density. Depending on the generates a list of sequence tags derived from the spectrum precursor charge and basic residue location, one can expect and attempts to explain the mass gaps using any of the a large fraction of complementary c-ions and z⅐-ions to be modifications from the UniMod database. present in the spectral data. Additionally, both ECD and ETD Although sequence tags have proven to be very capable in also prevent cleavage of labile modifications (33, 36). Al- determining the candidate peptide sequence, the success of though the mechanism of cleavage during ECD and ETD is these methods relies greatly on the accurate prediction of the still debated, PTMs are fully present on the c-ions and z⅐-ions sequence tag. Pure database methods, denoted earlier as (b), produced during cleavage. As ETD/ECD fragmentation tech- remove this need by directly obtaining the peptide sequence niques become more readily available, they will serve as a (with or without modifications) from a database. These ap- complement for CID technology, and hence it is desirable that proaches are also beneficial because they use all of the computational algorithms be able to handle inputs from all MS/MS spectrum peak information at once when analyzing a three techniques. candidate peptide. That is, for each candidate sequence in A further limitation of most of the preceding algorithms is the database, a full set of theoretical ion fragments may be the inability to search for a large amount of variable modifi- compared with the experimental MS/MS spectrum peaks to cations. Enumerating all combinations of the variable modifi- derive a score for the candidate sequence. The potential cations will lead to an exponential increase in the search time drawback of the pure database algorithms is the limitation of and can pose a significant problem when the database size is variable modifications that can be analyzed. Each variable large. This may be reduced by implementing a two-pass modification will create an additional copy of the amino acid approach (39–41) where the database is initially scanned that must be analyzed when developing a theoretical candi- either with no modifications or a small subset of variable date peptide from the database. modifications to eliminate proteins that did not score above a When analyzing the entire database with a small modifica- given threshold (based on the peptide hits). Mascot (40), tion set, these algorithms have been very effective in identi- X!Tandem (39), and InsPecT (41) will run a first pass search fying modified spectra. The SEQUEST algorithm (10) uses a with a small set of variable modifications to analyze spectra technique known as cross-correlation to mathematically com- that are either unmodified or contain the queried modifica- pare the overlap between the theoretical spectrum from a tions. Because of the reduced database size, additional vari- candidate database peptide and the experimental spectrum.

A Novel Approach for Untargeted Post-Translational Modification Identification Using Integer Linear Optimization and Tandem Mass Spectrometry*□S

Table S1. List of Proteins in the BAHD1 Interactome

Inactive USP14 and Inactive UCHL5 Cause Accumulation of Distinct Ubiquitinated Proteins In

MALDI-TOF/TOF MS Protocols Objectives

Optimized Search Strategy to Maximize PTM Characterization and Protein Coverage in Proteome Discoverer Software

Proteomic and Bioinformatic Pipeline to Screen the Ligands of S

Identification and the Significance of Selective Proteins in Bile And

Supplementary Table 1. the List of Proteins with at Least 2 Unique

Protein Identification by Mass Spectrometry

Large-Scale Database Searching Using Tandem Mass Spectra: Looking up the Answer in the Back of the Book

Ptmap—A Sequence Alignment Software for Unrestricted, Accurate, and Full-Spectrum Identification of Post-Translational Modification Sites

Cross-Talk Between Overlap Interactions in Biomolecules: a Case Study of the Β-Turn Motif

Biological Recognition of Graphene Nanoflakes