Dtaselect and Contrast: Tools for Assembling and Comparing Protein Identifications from Shotgun Proteomics

DTASelect and Contrast: Tools for Assembling and Comparing Protein Identifications from Shotgun Proteomics David L. Tabb,*,² W. Hayes McDonald,³ and John R. Yates, III³ Department of Molecular Biotechnology, University of Washington, Seattle, Washington 98195, and Scripps Research Institute, 10550 North Torrey Pines Road, SR11 Department of Cell Biology, La Jolla, California 92037 Received September 24, 2001 The components of complex peptide mixtures can be separated by liquid chromatography, fragmented by tandem mass spectrometry, and identified by the SEQUEST algorithm. Inferring a mixture's source proteins requires that the identified peptides be reassociated. This process becomes more challenging as the number of peptides increases. DTASelect, a new software package, assembles SEQUEST identifications and highlights the most significant matches. The accompanying Contrast tool compares DTASelect results from multiple experiments. The two programs improve the speed and precision of proteomic data analysis. Keywords: proteomics • protein identification • SEQUEST • data analysis • software • differential analysis • subtractive analysis • MudPIT • assembly 1. Introduction database peptides that match the observed peptide's mass. The sequences are quickly checked against the spectrum by a Protein identification through mass spectrometry has be- preliminary scoring algorithm, and obvious nonmatches are come a major application of genetic sequence information. Two removed. A more extensive scoring algorithm cross-correlates forms of protein identification have emerged. ªMass mappingº sequence-derived theoretical spectra against the observed enzymatically digests an unknown protein and then identifies spectrum, and sequences are ranked on the basis of these it by the masses of the peptides produced. Tandem mass scores. SEQUEST's automation of this process has helped the spectrometry of peptides, on the other hand, fragments pep- use of tandem mass spectrometry to expand from analytical tides to reveal their sequences.1,2 Both methods are effective chemistry to biology. for the identification of proteins isolated by polyacrylamide gels.3,4 Tandem mass spectrometry, however, is applicable to Because SEQUEST functions at the level of individual protein mixtures before isolation as well. ªShotgun proteomicsº spectra, additional software is necessary to assemble informa- 15 describes the process of digesting and separating a complex tion at the protein level. SEQUEST SUMMARY is broadly used, protein mixture, acquiring spectra from its peptides, and but it can only coordinate the results from a single dimension 15 identifying the proteins originally found in the sample.5-9 of chromatography. AUTOQUEST, a more extensive algo- Liquid chromatography is commonly used to separate rithm, can handle multidimensional results, but it has several mixtures of peptides for shotgun proteomics. Single-dimension limitations. As the number of peptides increases, the output liquid chromatography coupled to tandem mass spectrometry grows in size very rapidly. It categorizes identifications by score, (LC/MS/MS) has been used to identify the components of but the range of scores in each category is too broad for protein complexes and subcellular compartments.5-12 This adequate discrimination. The program cannot organize results technique is capable of identifying hundreds of peptides in an from DNA database searches. In addition, AUTOQUEST will hour.13 Two dimensions of liquid chromatography (LC/LC) not install alongside the commercial version of SEQUEST. enable many thousands of peptides to be identified through Neither program allows much latitude for users to set standards improved resolution. Link et al. used this approach, called on which identifications should be reported and which should multidimensional protein identification technology or ªMud- not. PITº, to identify the components of large protein complexes The DTASelect package was developed to assemble and such as yeast and human ribosomes, and Washburn et al. evaluate shotgun proteomics data. The program applies mul- performed an analysis of whole yeast cells.8,9 tiple layers of filtering to SEQUEST search results and supplies Amino acid sequences of peptides can be inferred from several analytical reports in text and HTML files. The Contrast tandem mass spectra.14 The SEQUEST algorithm automates this program is an accompanying tool for comparing multiple process.7 The software begins by enumerating candidate experiments and/or criteria sets. It is particularly useful for comparing results from different experimental conditions. This * To whom correspondence should be addressed. Tel: 858 784-8876. paper describes these two programs as demonstrated by an Fax: 858 784-8883. E-mail: [email protected]. ² University of Washington. analysis of the purified yeast 26S proteosome using LC/MS/ ³ The Scripps Research Institute. MS12 and MudPIT.16 10.1021/pr015504q CCC: $22.00 2002 American Chemical Society Journal of Proteome Research 2002, 1, 21-26 21 Published on Web 01/07/2002 research articles Tabb et al. Table 1. DTASelect Spectrum Filtersa Table 2. DTASelect Locus Filters and Utilitiesa option default individual spectrum filters option default locus filters -1 # 1.8 set lowest +1 XCorr -t 0 do not purge duplicate spectra for each sequence -2 # 2.5 set lowest +2 XCorr -t 1 purge duplicate spectra on basis of total intensity -3 # 3.5 set lowest +3 XCorr -t 2 X purge duplicate spectra on basis of XCorr -d # 0.08 set lowest DeltCN -u false include only loci with uniquely matching peptides -c # 1 set lowest charge state -o false remove proteins that are subsets of others -C # 3 set highest charge state -e $ remove proteins with names matching this string -i # 0.0 set lowest proportion of fragment ions observed -E $ include only proteins with names matching this string -s # 1000 set maximum Sp ranking -l $ remove proteins with descriptions including -m 0 require peptides to be modified this word -m 1 X include peptides regardless of modification -L $ include only proteins with descriptions including -m 2 exclude modified peptides this word -y 0 X include peptides regardless of tryptic status -M # 0 set minimum modified peptides per locus -y 1 include only half- or fully tryptic peptides criterion -y 2 include only fully tryptic peptides -r # 10 show all loci with peptides that appear this many times option extended filters (off by default) -p # 2 set minimum peptides per locus criterion -Sic $ sequences must contain all of these characters option utilities (excludes C terminal residue) -Sip $ sequences must contain this pattern --nofilter, -n do not apply any criteria -Sec $ sequences must not contain any of these characters (excludes C terminal residue) --GUI report through GUI instead of output files -Stn $ preceding residue must be one of these --compress create .IDX and .SPM files from spectra -Stc $ C terminal residue must be one of these --copy print commands to copy selected IDs -X1 # set highest +1 XCorr --XML save XML report of filtered results -X2 # set highest +2 XCorr --DB save in format for database import -X3 # set highest +3 XCorr --chroma save chromatography report --similar save protein similarity table a # symbols indicate that a numerical parameter follows while $ symbols --mods save modification report indicate a character or string parameter follows. These filters are employed --align save sequence alignment report in the first pass of the filtering step. Each spectrum must pass all the standard filters to be included in the DTASelect report. Changes can be made at the --help, -h print this list of options command line or in a DTASelect configuration file. --here, -. include only IDs in current directory a # symbols indicate that a numerical parameter follows, while $ symbols 2. Experimental Section indicate a character or string parameter follows. Filtering is conducted at the locus level after it has been performed for individual spectra. A protein 2.1. DTASelect Algorithm. DTASelect functions in three must pass either the -p or the -r filter to be retained in the DTASelect reports. phases: summarization, evaluation, and reporting. The summarization phase reads individual spectrum identifications of different peptides or that have at least one peptide showing from the SEQUEST output (.out) files and assembles informa- up several times (see Table 2). Two filters reduce the redun- tion for each locus (the database identity of the protein or gene from which the peptide derives). The evaluation phase applies dancy of results. DTASelect can choose the best example of the criteria indicated by the user for spectra and proteins, multiple spectra matching the same sequence by score or by removing those that do not qualify. The reporting phase creates signal intensity. In addition, many proteins may be indicated files to be read in a web browser or spreadsheet. At each step, by exactly the same sets of peptides; DTASelect groups together the user is notified of DTASelect's progress. proteins that are identical in sequence coverage. These selec- Summarization collects the most important information tions highlight peptides and proteins that represent the entire from SEQUEST identification. Each identification's output file sample. is read to acquire the cross-correlation score (XCorr), normal- The reporting phase generates several different files. The ized difference between first and second match scores (DeltCN), primary report is the DTASelect.html file, which enumerates rank

Dtaselect and Contrast: Tools for Assembling and Comparing Protein Identifications from Shotgun Proteomics

Standard Flow Multiplexed Proteomics (Sflompro) – an Accessible and Cost-Effective Alternative to Nanolc Workflows

W Yang Et Al. Proteomic Approaches to the Analysis of Multiprotein

Proteomics in Forensic Analysis: Applications for Human Samples

Challenges in Clinical Metaproteomics Highlighted by the Analysis of Acute Leukemia Patients with Gut Colonization by Multidrug-Resistant Enterobacteriaceae

Improving Data-Independent Acquisition Proteomics 12 July 2021

The Power of Three in Cannabis Shotgun Proteomics: Proteases, Databases and Search Engines

Phosphoproteomics for the Masses

Leveraging Shotgun Proteomics for Optimised Interpretation of Data- Independent Acquisition Data: Identification of Diagnostic Biomarkers for Paediatric Tuberculosis

2. Introduction to Shot Gun Proteomics

Proteomics in Non-Model Organisms: a New Analytical Frontier

Challenges and Perspectives of Metaproteomic Data Analysis

CAMPI): a Multi-Lab Comparison of Established Workflows