DTASelect and Contrast: Tools for Assembling and Comparing Identifications from Shotgun

David L. Tabb,*,† W. Hayes McDonald,‡ and John R. Yates, III‡

Department of Molecular Biotechnology, University of Washington, Seattle, Washington 98195, and Scripps Research Institute, 10550 North Torrey Pines Road, SR11 Department of Cell Biology, La Jolla, California 92037

Received September 24, 2001

The components of complex peptide mixtures can be separated by liquid chromatography, fragmented by tandem , and identified by the SEQUEST algorithm. Inferring a mixture’s source requires that the identified peptides be reassociated. This process becomes more challenging as the number of peptides increases. DTASelect, a new software package, assembles SEQUEST identifications and highlights the most significant matches. The accompanying Contrast tool compares DTASelect results from multiple experiments. The two programs improve the speed and precision of proteomic data analysis.

Keywords: proteomics • protein identification • SEQUEST • data analysis • software • differential analysis • subtractive analysis • MudPIT • assembly

1. Introduction database peptides that match the observed peptide’s mass. The sequences are quickly checked against the spectrum by a Protein identification through mass spectrometry has be- preliminary scoring algorithm, and obvious nonmatches are come a major application of genetic sequence information. Two removed. A more extensive scoring algorithm cross-correlates forms of protein identification have emerged. “Mass mapping” sequence-derived theoretical spectra against the observed enzymatically digests an unknown protein and then identifies spectrum, and sequences are ranked on the basis of these it by the masses of the peptides produced. Tandem mass scores. SEQUEST’s automation of this process has helped the spectrometry of peptides, on the other hand, fragments pep- use of to expand from analytical tides to reveal their sequences.1,2 Both methods are effective chemistry to biology. for the identification of proteins isolated by polyacrylamide gels.3,4 Tandem mass spectrometry, however, is applicable to Because SEQUEST functions at the level of individual protein mixtures before isolation as well. “Shotgun proteomics” spectra, additional software is necessary to assemble informa- 15 describes the process of digesting and separating a complex tion at the protein level. SEQUEST SUMMARY is broadly used, protein mixture, acquiring spectra from its peptides, and but it can only coordinate the results from a single dimension 15 identifying the proteins originally found in the sample.5-9 of chromatography. AUTOQUEST, a more extensive algo- Liquid chromatography is commonly used to separate rithm, can handle multidimensional results, but it has several mixtures of peptides for shotgun proteomics. Single-dimension limitations. As the number of peptides increases, the output liquid chromatography coupled to tandem mass spectrometry grows in size very rapidly. It categorizes identifications by score, (LC/MS/MS) has been used to identify the components of but the range of scores in each category is too broad for protein complexes and subcellular compartments.5-12 This adequate discrimination. The program cannot organize results technique is capable of identifying hundreds of peptides in an from DNA database searches. In addition, AUTOQUEST will hour.13 Two dimensions of liquid chromatography (LC/LC) not install alongside the commercial version of SEQUEST. enable many thousands of peptides to be identified through Neither program allows much latitude for users to set standards improved resolution. Link et al. used this approach, called on which identifications should be reported and which should multidimensional protein identification technology or “Mud- not. PIT”, to identify the components of large protein complexes The DTASelect package was developed to assemble and such as yeast and human ribosomes, and Washburn et al. evaluate shotgun proteomics data. The program applies mul- performed an analysis of whole yeast cells.8,9 tiple layers of filtering to SEQUEST search results and supplies Amino acid sequences of peptides can be inferred from several analytical reports in text and HTML files. The Contrast tandem mass spectra.14 The SEQUEST algorithm automates this program is an accompanying tool for comparing multiple process.7 The software begins by enumerating candidate experiments and/or criteria sets. It is particularly useful for comparing results from different experimental conditions. This * To whom correspondence should be addressed. Tel: 858 784-8876. paper describes these two programs as demonstrated by an Fax: 858 784-8883. E-mail: [email protected]. † University of Washington. analysis of the purified yeast 26S proteosome using LC/MS/ ‡ The Scripps Research Institute. MS12 and MudPIT.16

10.1021/pr015504q CCC: $22.00  2002 American Chemical Society Journal of Proteome Research 2002, 1, 21-26 21 Published on Web 01/07/2002 research articles Tabb et al.

Table 1. DTASelect Spectrum Filtersa Table 2. DTASelect Locus Filters and Utilitiesa

option default individual spectrum filters option default locus filters -1 # 1.8 set lowest +1 XCorr -t 0 do not purge duplicate spectra for each sequence -2 # 2.5 set lowest +2 XCorr -t 1 purge duplicate spectra on basis of total intensity -3 # 3.5 set lowest +3 XCorr -t 2 X purge duplicate spectra on basis of XCorr -d # 0.08 set lowest DeltCN -u false include only loci with uniquely matching peptides -c # 1 set lowest charge state -o false remove proteins that are subsets of others -C # 3 set highest charge state -e $ remove proteins with names matching this string -i # 0.0 set lowest proportion of fragment ions observed -E $ include only proteins with names matching this string -s # 1000 set maximum Sp ranking -l $ remove proteins with descriptions including -m 0 require peptides to be modified this word -m 1 X include peptides regardless of modification -L $ include only proteins with descriptions including -m 2 exclude modified peptides this word -y 0 X include peptides regardless of tryptic status -M # 0 set minimum modified peptides per locus -y 1 include only half- or fully tryptic peptides criterion -y 2 include only fully tryptic peptides -r # 10 show all loci with peptides that appear this many times option extended filters (off by default) -p # 2 set minimum peptides per locus criterion -Sic $ sequences must contain all of these characters option utilities (excludes C terminal residue) -Sip $ sequences must contain this pattern --nofilter, -n do not apply any criteria -Sec $ sequences must not contain any of these characters (excludes C terminal residue) --GUI report through GUI instead of output files -Stn $ preceding residue must be one of these --compress create .IDX and .SPM files from spectra -Stc $ C terminal residue must be one of these --copy print commands to copy selected IDs -X1 # set highest +1 XCorr --XML save XML report of filtered results -X2 # set highest +2 XCorr --DB save in format for database import -X3 # set highest +3 XCorr --chroma save chromatography report --similar save protein similarity table a # symbols indicate that a numerical parameter follows while $ symbols --mods save modification report indicate a character or string parameter follows. These filters are employed --align save sequence alignment report in the first pass of the filtering step. Each spectrum must pass all the standard filters to be included in the DTASelect report. Changes can be made at the --help, -h print this list of options command line or in a DTASelect configuration file. --here, -. include only IDs in current directory

a # symbols indicate that a numerical parameter follows, while $ symbols 2. Experimental Section indicate a character or string parameter follows. Filtering is conducted at the locus level after it has been performed for individual spectra. A protein 2.1. DTASelect Algorithm. DTASelect functions in three must pass either the -p or the -r filter to be retained in the DTASelect reports. phases: summarization, evaluation, and reporting. The sum- marization phase reads individual spectrum identifications of different peptides or that have at least one peptide showing from the SEQUEST output (.out) files and assembles informa- up several times (see Table 2). Two filters reduce the redun- tion for each locus (the database identity of the protein or gene from which the peptide derives). The evaluation phase applies dancy of results. DTASelect can choose the best example of the criteria indicated by the user for spectra and proteins, multiple spectra matching the same sequence by score or by removing those that do not qualify. The reporting phase creates signal intensity. In addition, many proteins may be indicated files to be read in a web browser or spreadsheet. At each step, by exactly the same sets of peptides; DTASelect groups together the user is notified of DTASelect’s progress. proteins that are identical in sequence coverage. These selec- Summarization collects the most important information tions highlight peptides and proteins that represent the entire from SEQUEST identification. Each identification’s output file sample. is read to acquire the cross-correlation score (XCorr), normal- The reporting phase generates several different files. The ized difference between first and second match scores (DeltCN), primary report is the DTASelect.html file, which enumerates rank by preliminary score (Sp rank), identified sequence, the proteins and the spectra in evidence for each (see Figures precursor m/z, protein or genetic locus from which this peptide 1 and 2). Hyperlinks from this report lead to sequence coverage derives, intensity, and percentage of fragment ions matched for individual proteins, a view of each spectrum, and the output between predicited and observed spectra. The peptides are file for each spectrum. This information is also recorded in sorted by locus, and peptides for each locus are sorted by DTASelect-filter.txt, a tab-delimited text file suitable for review sequence. Next, the same database used by SEQUEST is consulted for the full protein sequence, descriptive name, and in a spreadsheet. For better data portability, DTASelect includes peptide positions. The information from the output files and a graphical user interface, which can be used independently database lookup is stored to the DTASelect.txt file so that this of SEQUEST’s support programs (see Figure 3). This interface step can by bypassed in subsequent analyses of this sample. provides superior support for identifying sequence ions from Evaluation applies user-selected criteria to the matches. An spectra, especially those that come from triply charged precur- identification is included only when it passes all specified sors. In another mode, it displays the selected protein’s spectrum criteria (see Table 1). Selection is next conducted at sequence coverage. These files and the graphical user interface a higher level, retaining proteins that have a sufficient number present the primary information produced by DTASelect.

22 Journal of Proteome Research • Vol. 1, No. 1, 2002 DTASelect and Contrast research articles

Figure 1. Sample DTASelect.html fragment. Each protein identity is printed beside the count of peptide sequences associated with it. The number of spectra representing those sequences is also shown, along with the protein’s sequence coverage, length in residues, molecular weight, calculated pI, and description from the specified database. If multiple proteins in the database correspond to the same set of peptide sequences, the proteins are grouped together. The peptides found for each collection of loci are listed beneath it. Spectra matching the same sequences but possessing different charge states (discernible by the “.2” vs “.3” suffixes on filenames) are not considered duplicates. Peptides that are uniquely found at a particular locus are indicated with asterisks. The fields enumerated for each peptide include file name, XCorr, DeltCN, precursor ion mass, Sp rank, percentage of fragment ions found, copy count, and sequence. Addition symbols (as seen with w26S.0501.0501.1) link to other proteins in the report that also contain the indicated peptide. The similarity for protein YOL055C to YPL258C is reported, showing that one peptide present for YOL055C matches to the other protein and one peptide does not.

are reading, comparison, and reporting. The program is configured by changes in the Contrast.params file, and the specified DTASelect results are read from the disk. The presence of each protein is compared between the described sets, and the similarities and differences are reported in HTML and text formats. The reading phase begins with the Contrast.params file, Figure 2. Summary tables from DTASelect output for LC/MS/ which allows the user to specify the DTASelect results of MS and MudPIT analysis of purified 26S protesomes: (A) DTASelect summary output for LC/MS/MS analysis on 4 µgof interest and to enumerate criteria sets to be applied to the purified 26S proteosome. Shown are total counts for proteins, results. Contrast requires at least one sample and set of criteria peptides, and spectra. The difference between the nonredundant to be specified, but more can be included as long as the and redundant protein counts reflects that some proteins have number of samples multiplied by the number of criteria sets been grouped together because of identical sequence coverage. does not exceed 63. Each DTASelect result is processed under When used with databases that contain a large number of related each set of criteria, so four samples and two criteria sets will proteins (such as the human database), DTASelect’s grouping result in eight “data sets,” corresponding to eight columns in functionality is a timesaver. (B) As in (A) except that results are the output. Once the Contrast.params file sets up the run, for a MudPIT analysis of 40 g of purified 26S proteosome. µ Contrast will read the DTASelect.txt file for each sample and Supplementary information is presented in two auxiliary apply the specified filters as in DTASelect. files. The first contains information about protein similarity. To compare the data sets, Contrast develops a master list of Multiple proteins may feature a particular peptide, but this is proteins including each locus appearing in any of the data sets. not always an indicator of substantial similarity among the Each data set is assigned a power of two. Each locus’ pattern proteins. Similarity is calculated to be the number of peptides of presence or absence is stored by summing the powers of shared between two proteins minus the number of nonmatch- two for the data sets in which the protein is found. Finally, the ing peptides; a score of zero indicates that the number of master list of loci is sorted by these power sums and then by peptides that match equals the number that do not match. protein name. The effect is that loci are grouped by patterns DTASelect includes all similarity scores greater than or equal of appearance across data sets. to zero in its primary report and in a separate similarity table. Contrast produces extensive output. The major report is Another DTASelect report evaluates chromatography. After Contrast.html, which lists the proteins in each group and filters have been applied, the number of singly, doubly, and includes links to the corresponding DTASelect.html files for triply charged peptides remaining from each round of chro- each data set (see Figures 4 and 5). To permit Contrast results matography is given, along with a profile of where in each cycle to be analyzed by spreadsheet, they are exported in a tab- the peptides eluted. This information can help users optimize delimited text file. For users who want to see how individual their separations for both completeness and efficiency. peptides for each protein differ among data sets, Contrast offers 2.2. Contrast Algorithm. Contrast extends DTASelect across a “verbose” mode, which provides this level of detail (see Figure multiple samples or multiple sets of criteria. Its three phases 6). Although a complex comparison can yield a lengthy report,

Journal of Proteome Research • Vol. 1, No. 1, 2002 23 research articles Tabb et al.

Figure 3. DTASelect graphical user interface. Identified peaks are color-coded blue for y ions or red b ions. The letters along the top of the window show the correspondence between fragment ions and sequence. Clicking on a peptide will cause its spectrum to be shown. Selecting a protein will show sequence coverage.

Contrast’s output organization ensures that details can be (see Figure 1). The DTASelect output compiled the important found in predictable locations. metrics for each spectrum into a single file, enabling further study through hypertext links. 3. Results and Discussion 3.2. DTASelect Application: MudPIT. Digested peptides from 36 µg of the same preparation of 26S proteosome as for Analyses of purified yeast 26S proteosomes by both single the single-dimension experiment were analyzed by MudPIT9 dimension LC/MS/MS12 and by MudPIT16 have been reported and searched against the database. After filtering and sum- previously. All experiments were performed on a Thermo marization, the 33 841 output files were pared down to 4157 Finnigan LCQ or LCQ Deca using standard procedures.4,8,9 The spectra representing 968 distinct peptides and 77 proteins (see resulting peptide identifications were processed by DTASelect Figure 2B). Upon manual evaluation of the results, only four and Contrast. of these 77 proteins (5.2%) passed through the filters as false 3.1. DTASelect Application: Single-Dimension LC/MS/MS. positives. Such percentages are typical of the results for these Digested peptides from 4 µg of purified 26S proteosome were experiments. The percentage of false postives, however, is a resolved and analyzed duringa2hLC/MS/MS experiment.12 function of the filters employed by the user; DTASelect allows After a SEQUEST search against a database of predicted yeast considerable latitude in the stringency applied to the analysis. open reading frames and common contaminants, the resulting 3.3. Contrast Application: MudPIT versus MudPIT. The 6587 output files were filtered and summarized using the most common application for Contrast is for the comparison default criteria of DTASelect (see Tables 1 and 2). The filtering of multiple LC/MS/MS or MudPIT experiments. The recent yielded 411 spectra corresponding to 342 peptides and 37 MudPIT analysis was compared to a previous MudPIT analysis distinct proteins (see Figure 2A). of 60 µg of purified proteosome.16 The summary table generated All but two of these 37 proteins had been observed before: by Contrast allows the researcher to evaluate how many 32 had been identified as part of the 26S proteosome, one was proteins were found in both samples (see Figure 5). Manual known to associate with the complex,12 and two were exog- evaluation confirmed that none of the 60 shared proteins were enous contaminants.8 DTASelect prepared its report in minutes passed through the filters as false positives. For this group of

24 Journal of Proteome Research • Vol. 1, No. 1, 2002 DTASelect and Contrast research articles

Figure 6. Sample Verbose Contrast.html fragment. Proteins YDR471W and YHR010W were found in both samples under this criteria set, though with different sequence coverages (17.6% and 21.3%, respectively). One peptide was found in both samples, but the other peptides were found in only one. The highest XCorr for each peptide in each sample is shown beside its sequence. Cumulatively, these peptides add up to 30.9% sequence cover- age. The sequence coverage percentages for each sample lead to the relevant sections in the respective DTASelect output files. The cumulative sequence coverage links to a view of the protein’s sequence overlaid with the peptide sequences.

3.4. DTASelect and Contrast Applications: Miscellaneous. Since a large variety of customizable filters are available in DTASelect and thus Contrast, individual researchers can adapt the programs to their own needs. Contrast can be used for a Figure 4. Sample Contrast.html fragment. This represents a “step analysis” in which progressively more stringent criteria group of proteins that appear in the new MudPIT sample but not are applied to data from a single experiment. Those proteins the previous experiment when the same criteria are used against that meet the highest stringency requirements are sorted to each. Each row in the table represents one protein, and the the top of the output file, and those passing only the lowest numbers in the columns are the sequence coverage percentages found in each data set (or, in the Total column, the cumulative criteria are sorted to the end. This approach is well suited for sequence coverage across multiple columns). The percentages stratifying the protein identifications in a complex sample. For link to each protein’s location in a corresponding DTASelect.html projects focusing on a particular protein, the verbose mode of file. If multiple proteins have identical sequence coverage, they Contrast can show how individual peptides vary among are grouped together (for example, NRL_1IKFH and NRL_1INDH). samples. If modifications like phosphorylation or methylation Several such sections appear in each Contrast output file, one are included in SEQUEST search parameters, DTASelect can for each combination of presence and absence. be used to list only those peptides found to be modified, saving time for researchers interested in this subset of the data. It can also be used for chemical labeling tags such as the ICAT reagent.17 The open-ended design of the programs accom- modates many purposes.

4. Conclusions

DTASelect assembles protein-level information from peptide Figure 5. Sample Contrast.html summary. Each row in this table data more quickly, flexibly, and uniformly than existing tools. represents a particular combination of presence and absence in each of the data sets, with the “X” marks indicating this pattern. DTASelect focuses attention on peptides of interest by sweeping Each row’s count links back to the appearance of the group above away less likely identifications. By streamlining data analysis it in the Contrast.html file. Of the 118 proteins appearing, 60 were for proteomics, DTASelect makes more complex experiments present in both samples, 18 were present only in the “new” feasible. Contrast provides extensive sample comparison ca- analysis, and 40 were found only in the “prev” experiment. pabilities. Whether stepping through multiple levels of criteria to triage the proteins identified from a single sample or proteins, all were either components of the proteosome,32 differentiating the protein content of multiple, complex samples, associated with the proteosome,8 highly abundant yeast pro- Contrast is a potent tool for the molecular biologist. Most users teins that are found in numerous purifications,18 or exogenous of SEQUEST can benefit from DTASelect and Contrast. Just as contaminants such as immunoglobulin and trypsin.2 SEQUEST automated the time-intensive process of spectrum identification, DTASelect and Contrast automate the process Contrast produces an analysis within a few minutes that of SEQUEST result interpretation. would take a few hours to produce manually. As in DTASelect output, important information is directly available, including Acknowledgment. We thank Rati Verma and Raymond locus name, percent sequence coverage, and locus description/ Deshaies of the Division of Biology and Howard Hughes protein name (see Figure 4). The proteins are sorted by the Medical Institute at Caltech for graciously providing proteo- experiments in which they are found, and hypertext links some samples used to demonstrate the software. D.L.T. was provide access to more detailed information for evaluative supported by a National Science Foundation Graduate Re- purposes (see section 2.2). When multiple positive and negative search Fellowship. W.H.M. was supported by the Merck Ge- controls are part of the experiment, Contrast’s automation nome Research Institute. Partial support was derived from NIH becomes even more valuable. Grant Nos. RR11823 and R33CA81665. Suggestions from the

Journal of Proteome Research • Vol. 1, No. 1, 2002 25 research articles Tabb et al. members of the Yates Lab contributed many features to the (9) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Nat. Biotechnol. - development of these programs. 2001, 19, 242 247. (10) Meeusen, S.; Tieu, Q.; Wong, E.; Weiss, E.; Schieltz, D. M.; Yates, J. R., III; Nunnari, J. J. Cell Biol. 1999, 145, 291-304. References (11) Goode, B. L.; Wong, J. J.; Butty, A. C.; Peter, M.; McCormack, A. L.; Yates, J. R., III; Drubin, D. G.; Barnes, G. J. Cell Biol. 1999, (1) Yates, J. R., III; McCormack, A. L.; Eng, J. K. Anal. Chem. 1996, 144,83-98. 68, 534A-540A. (12) Verma, R.; Chen, S.; Feldman, R.; Schieltz, D.; Yates, J. R., III; (2) Yates, J. R., III. Electrophoresis 1998, 19, 893-900. Dohmen, J.; Deshaies, R. J. Mol. Biol. Cell 2000, 11, 3425-3439. (3) Henzel, W. J.; Billeci, T. M.; Stults, J. T.; Wong, S. C.; Grimley, C.; (13) Yates, J. R., III. J. Mass Spectrom. 1998, 33,1-19. Watanabe, C. Proc. Natl. Acad. Sci. U.S.A. 1993, 90, 5011-5015. (14) Hunt, D. F.; Yates, J. R., III; Shabanowitz, J.; Winston, S.; Hauer, (4) Gatlin, C. L.; Kleeman, G. R.; Hays, L. G.; Link, A. J.; Yates, J. R., C. R. Proc. Natl. Acad. Sci. U.S.A. 1986, 83, 6233-6237. III. Anal. Biochem. 1998, 263,93-101. (15) Tabb, D. L.; Eng, J. K.; Yates, J. R., III. Protein Identification by (5) McCormack, A. L.; Schieltz, D. M.; Goode, B.; Yang, S.; Barnes, SEQUEST. In Proteome Research: Mass Spectrometry; James, P., G.; Drubin, D.; Yates, J. R., III. Anal. Chem. 1997, 69, 767-776. Ed.; Springer: New York, 2001; Vol. 1, pp 125-142. (6) Link, A. J.; Carmack, E.; Yates, J. R., III. Intl. J. Mass Spectrom. (16) Verma, R.; McDonald, W. H.; Yates, J. R., III; Deshaies, R. J. Mol. 1997, 160, 303-316. Cell 2001, 8, 439-48. (7) Eng, J. K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass (17) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H. Nat. Spectrom. 1994, 5, 976-989. Biotechnol. 2000, 17, 994-999. (8) Link, A. J.; Eng, J. K.; Schieltz D. M.; Carmack, E.; Mize G. J.; Morris D. R.; Garvik, B. M.; Yates, J. R., III. Nat. Biotechnol. 1999, 17, 676-682. PR015504Q

26 Journal of Proteome Research • Vol. 1, No. 1, 2002