<<

Proc. Natl. Acad. Sci. USA Vol. 90, pp. 5138-5142, June 1993 Biochemistry Rapid identification of ( composition/molecular weight/isoelectric point/ analysis) GERRY SHAW University of Florida College of Medicine, Department of Neuroscience, JHMHC Box-J100244, Gainesville, FL 32610 Communicated by George K. Davis, February 9, 1993

ABSTRACT The amino acid composition, molecular introduced since protein blotting is normally performed on weight, and isoelectric point of a protein can all be easily and proteins larger than this. MAKE-LIB simultaneously writes the economically determined by current electrophoretic tech- accession numbers and identifiers of each selected protein niques. A method which uses such easily obtained data to sequentially to the NAMES file. identify proteins is described. A computer program first cor- Amino acid composition determinations are of differing rects for systematic errors in amino acid quantitation and then accuracy and reproducibility for different amino acids (3-5). searches the current sequence database for proteins with amino We therefore made multiple composition determinations acid compositions similar to the corrected values, taking into from blots ofabout 5 ug ofeach offour different pure proteins account the reliability ofdetermination ofeach amino acid. The of known primary sequence and found that each amino acid program also provides the calculated molecular weight, iso- was quantified with a characteristic and quite repeatable electric point, and name of each candidate, providing three error (Table 1). The variance and standard deviation (SD) of further independent criteria for protein identification. The the quantities obtained for each amino acid were then cal- program is surprisingly sensitive, and the composition data culated as described in the legend to Table 1. The values alone, if of good quality, usually suggest the correct protein as obtained are in line with those reported by other workers (5) a strong candidate if it or a close homologue is present in the but show that different facilities have rather different accu- database. Further studies show that proteins in the current racies and reproducibilities for each amino acid. To make database have amino acid compositions distinct enough to aflow optimum use of this method, users should probably deter- this method to be generally applicable. The method is a quick mine their own values for Table 1. and cost-effective first step in protein characterization and The amino acid composition data obtained on an unknown should become increasingly useful as the number of fully protein are entered into the FINDER program, each quantity sequenced proteins continues to rise. is corrected by the appropriate error factor shown in Table 1, the mole percentage is calculated, and the corrected values Current PAGE and blotting methods allow the convenient are displayed on the monitor. Table 1 shows that Gly, Met, and inexpensive determination of amino acid composition of and Pro are determined particularly unreliably, as shown by a protein, as well as molecular weight estimation and iso- the high variance and SDs of scores. Val might sometimes be electric point measurement (1). These parameters are much less reliably determined than suggested by Table 1 since it is easier and cheaper to determine than primary amino acid poorly recovered from acid hydrolysates of proteins rich in sequence, the usual method of characterizing unidentified hydrophobic sequences (6). Both MAKE-LIB and FINDER proteins. is a fairly complex multistep therefore determine mole percentage values initially exclud- procedure requiring a considerable amount of time and ing these amino acids. The mole percentage values of these resources, and technical problems frequently occur (1, 2). 4 amino acids are then calculated relative to the total of 100% Here is described an efficient method for protein identifica- for the other 12. Gross errors in any or all of these 4 amino tion based on the more easily determined protein character- acids cannot therefore affect the remaining composition data. istics. FINDER compares the 16 calculated values with the corre- sponding values in each array in the SCORES file. The program gives a score of 2 if the experimental value for a particular MATERIALS AND METHODS amino acid and the corresponding database value are within Programs. The protein sequence database was the Protein 1.5 SDs ofeach other, 1 ifthey are within 3 SDs, and 0 ifthey Identification Resource (PIR) Release 32, containing 40,287 are further than 3 SDs apart (Table 1). Since amino acids entries. Fig. 1 shows in diagrammatic form how the programs present in very small percentage amounts are determined less work. MAKE-LIB counts the number of Asx, Thr, Ser, Glx, accurately, experimentally determined scores below 3 mol % Pro, Gly, Ala, Val, Met, Ile, Leu, Tyr, Phe, His, Lys, and Arg are treated, for the purposes of calculating the range of residues in each protein. These are the amino acids routinely variability tolerated, as if they were 3 mol %. This scoring quantified, counting Asp and Asn together (Asx) and Glu and method is one of more than 50 tested and is a good compro- Gln together (Glx), and not counting Trp and Cys. The mise between speed and sensitivity. percentages of these amino acids (calculated as described Entries in the scoREs file which have a score higher than below) are put into the first 16 elements of an 18-element a preset value (usually 20 points for a preliminary run) are array, the last two elements being loaded with the calculated displayed on the monitor screen along with the calculated molecular mass and calculated isoelectric point of the pro- mean score, SD of scores (see Table 2) and a histogram of tein. If the calculated molecular mass is >5 kDa, the entire score distribution. The programs and data files run on IBM 18-element array is saved on disc as one entry of a file called PC-compatible computers and occupy a total of less than 3.4 SCORES, the process being reiterated for all proteins in the megabytes of disk space. Programs were written in the C database. The 5-kDa cutoff removes 4786 proteins and was language and run in between 1.5 min and 10 sec, depending on the type of computer used. The publication costs ofthis article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" Abbreviations: PVDF, poly(vinylidene difluoride); PIR, Protein in accordance with 18 U.S.C. §1734 solely to indicate this fact. Identification Resource; GFAP, glial fibrillary acidic protein. 5138 Downloaded by guest on September 25, 2021 Biochemistry: Shaw Proc. Natl. Acad. Sci. USA 90 (1993) 5139

Protein Sequence Database (PVDF) membranes were obtained from Bio-Rad or Milli- pore. Amino Acid Analysis. Coomassie brilliant blue-stained pro- MAKE-LIB Program tein bands were excised from PVDF membranes and sealed in vacuo in glass vials containing 6 M HCI (constant boiling; Amino Acid Accession Numbers Pierce) plus 0.1% phenol. Acid hydrolysis was performed at Composition and Identifiers 120°C for 20 hr. Hydrolysates were applied to a Beckman LUbrary-SCORES Ubrary-NAMES system 6300 amino acid analyzer, which separates amino acids on a polystyrenesulfonic acid ion-exchange resin in I I sodium-containing buffers and derivatizes the amino acids Amino add FINDER1 with ninhydrin as they are eluted. Ninhydrin levels are then Composition determined spectrophotometrically. Data were interpreted Data with the Nelson Analytical 2600 chromatograph software package. Candidate Proteins RESULTS FIG. 1. Flow diagram of the operation of programs described The degree of over- or underestimation ofquantitation and the here. reproducibility of determination of each amino acid under the experimental conditions used were determined as described in Proteins. Proteins were obtained from Sigma (bovine ubiq- Table 1 and in Materials and Methods. The determination of uitin, bovine serum , rabbit muscle , pig muscle these two factors allows the systematic correction of amino glyceraldehyde-3-phosphate dehydrogenase, chicken ovalbu- acid composition data and the construction of a scoring min, chicken lysozyme, human carbonic anhydrase, bovine method which expects the best fit for amino acids which we carbonic anhydrase, rabbit phosphorylase b, Escherichia coli can determine very accurately but tolerates a loose fit with ,-galactosidase) or produced locally [porcine neurofilament those which are less accurately determined. To test FINDER, subunits NF-L and NF-M, porcine glial fibrillary acidic pro- several pure proteins were blotted onto PVDF membranes, tein (GFAP), and various unidentified proteins]. The cytoskel- their amino acid compositions were determined, and the data etal pig spinal cord cytoskeletal preparation was produced as were entered into the FINDER program. The program was described (7). ABRF-92 is a protein sample from the Associ- surprisingly sensitive, almost always producing the best match ation of Biomedical Research Facilities, the identity of which with the expected protein (Table 2). FINDER effectively points is kept secret, used to test how accurately amino acid com- to a single candidate based on the amino acid composition positions are being determined nationwide (5). alone, although the SDS/PAGE molecular sizes are also in Electroblotting. Electroblotting was performed at a con- agreement with the calculated molecular sizes of the candi- stant voltage of90 V for 30-120 min from Laemmli gels in the dates. Many ofthe closely matching sequences were obviously presence of 10 mM Mes at pH 6.0, either made 20% methanol related to the target . For instance, the second-best although rather distant match to E. coli 8-galactosidase was for low molecular weight proteins or between 0.1% and ,B-galactosidase from Klebsiella pneumoniae. Similarly, bo- 0.01% SDS for larger proteins. Poly(vinylidene difluoride vine NF-L finds as best matches all four NF-L sequences in the PIR 32 database. Note that the absence ofthe bovine NF-L Table 1. Accuracy and reproducibility of amino acid sequence from the database would clearly not have prevented composition data protein identification. Often members of the same protein Amino superfamilies as the source protein were among those better acid Factor Variance SD matching. For example, data from blots of GFAP produced vimentin, desmin, lamins, and other intermediate-filament Asx 1.01 0.0032 0.0563 subunits as poorer matches than GFAP itself, but nonetheless Thr 1.01 0.0007 0.0273 with scores in the 16-20 range. The relatively low best score Ser 0.94 0.0058 0.0759 obtained by certain ofthese bovine Glx 1.04 0.0028 0.0528 proteins-e.g., in Table 2-indicates that some of the data cannot have been of Pro 1.15 0.0929 0.3048 the highest but that the method was still sensitive Gly 1.10 0.0096 0.0978 quality, enough to the in In the case of Ala 1.04 0.0003 0.0169 identify protein question. ubiquitin, note that no than a the Val 0.87 0.0039 0.0627 protein other member of scored more than so that even the Met 0.68 0.0296 0.1722 ubiquitin family 20, rela- tively low score obtained still suggests as a Ile 0.91 0.0022 0.0464 ubiquitin strong candidate. FINDER also the as Leu 1.04 0.0013 0.0359 suggested expected proteins best candidates in the case Tyr 1.10 0.0014 0.0377 of bovine neurofilament subunit NF-M, bovine GFAP, bovine , chicken oval- Phe 1.00 0.0024 0.0488 bumin, chicken lysozyme, pig glyceraldehyde-3-phosphate His 0.80 0.0061 0.0778 and Lys 1.21 0.0044 0.0662 dehydrogenase, rabbit muscle myosin. In general, lower scores with the were Arg 0.97 0.0061 0.0784 expected candidate associated with lower amounts ofamino acid analyzed; >10 nmol oftotal amino acid Samples of bovine serum albumin, chicken ovalbumin, chicken hydrolyzed usually gave a workable result, but the best data lysozyme, and human carbonic anhydrase were blotted onto PVDF were obtained with >25 nmol, corresponding to about 2.5 pg, three separate times, and the amino acid compositions were deter- or 50 of a mined. The results obtained for each amino acid were compared with pmol, 50-kDa protein. the theoretically expected values, and the average degree ofover- or These studies suggest that the proteins selected have amino under quantitation has been placed in the "Factor" column. The acid compositions unique enough for the program to select values obtained for each amino acid from a particular protein were them from the database. To see whether this is generally true then standardized to give a mean of 1.00. The values for the same of known proteins, a program (called SEEKER) was written amino acid from each protein were then combined and the variance which randomly selects a protein composition from the SCORES and standard deviation (SD) were determined. file and looks for other proteins with similar compositions, Downloaded by guest on September 25, 2021 5140 Biochemistry: Shaw Proc. Natl. Acad. Sci. USA 90 (1993) Table 2. Results obtained from FINDER Molecular Protein analyzed and SDS/PAGE mass, Amount, molecular mass Best candidates Score kDa IEP nmol E. coli f-galactosidase, 115 kDa GBEC; t3-galactosidase, E. coli LacZ 30 116.3 5.1 43.45 A24925; /3galactosidase, Klebsiella pneumoniae 24 117.5 5.5 A39967; Inter-a-trypsin inhibitor, human (fragment) 22 36.4 5.1 Bovine NF-L, 68 kDa QFPGL; neurofilament triplet L protein, pig 26 61.8 4.3 25.93 A25227; neurofilament triplet L protein, mouse 25 61.8 4.3 S07144; neurofilament triplet L protein, human 24 61.7 4.4 QFMSL; neurofilament L protein, mouse (fragment) 20 32.4 4.1 Human carbonic anhydrase, 29 kDa CRHU2; carbonate dehydratase (EC 4.2.1.1) II, human 27 29.1 7.0 55.91 A27175; carbonate dehydratase (EC 4.2.1.1) II, human 27 29.2 7.0 A26386; acidic fibroblast , human (fragments) 23 18.4 7.1 A33879; cytosol aminopeptidase, Saccharomyces cerevisiae 22 57.0 5.5 Ubiquitin, 8 kDa A34080; ubiquitin 14, Dictyostelium discoideum 24 59.6 7.1 22.41 D34080; ubiquitin 1, D. discoideum 24 25.5 7.1 C34080; ubiquitin 2, D. discoideum 24 42.6 7.1 B27806; ubiquitin, D. discoideum 23 25.6 7.1 B34080; ubiquitin 19, D. discoideum 23 42.6 7.1 A31560; polyubiquitin, Drosophila melanogaster 22 25.9 7.1 A26087; ubiquitin, Drosophila melanogaster 22 8.5 7.1 A26437; ubiquitin, human 22 25.7 7.1 A22005; ubiquitin precursor, human 22 77.0 7.1 A27806; ubiquitin (clone p229), D. discoideum 22 42.8 7.1 UQHU; ubiquitin, human 21 8.4 7.1 UQBO; ubiquitin, bovine 21 8.4 7.1 UQFFM; ubiquitin, Mediterranean fruit fly 21 8.4 7.1 UQBY; ubiquitin, S. cerevisiae 21 8.5 7.1 UQNC; ubiquitin precursor, Neurospora crassa 21 34.4 7.1 UQUYSF; ubiquitin, fall armyworm (fragment) 21 8.4 8.1 S04863; ubiquitin precursor, maize (fragment) 21 30.5 7.8 D29456; ubiquitin fusion protein, S. cerevisiae 21 42.8 7.1 A30126; ubiquitin precursor, Caenorhabditis elegans 21 93.9 7.1 S17740; ubiquitin precursor, Phytophthora infestans 21 25.8 7.1 S12577; polyubiquitin, Tetrahymena pyriformis (SGC5) 21 25.5 8.1 S13928; polyubiquitin, chicken 21 25.7 7.1 S12583; polyubiquitin, mouse 21 34.0 7.1 Unidentified spinal cord S04695; (3 chain, Volvox carteri f. 27 49.6 4.6 67.54 cytoskeletal protein, 50 kDa UBKM; tubulin (3 chain; Chlamydomonas reinhardtii 25 49.6 4.5 JQ0177; tubulin X3 chain, Polytomella agilis 24 49.5 4.5 MZ0005; tubulin -2 chain, Polytomella agilis 24 49.5 4.5 S05496; tubulin (3 chain, Euglena gracilis 23 49.4 4.5 B30309; tubulin (8 chain, Euplotes crassus 23 49.9 4.5 S00683; tubulin (3 chain, Stylonychia lemnae 23 49.4 4.6 A37851; phosphoenolpyruvate carboxylase, E. coli 23 51.2 7.6 C36346; fibulin C, human 23 74.4 4.9 A23679; furin precursor, mouse 23 86.7 5.9 Unidentified protein sample KYBOA; chymotrypsin A precursor, bovine 30 25.6 8.0 71.76 ABRF-92, 25 kDa S09959; Ig heavy chain V-D-J region, mouse 20 13.4 7.1 DEZMG3; glyceraldehyde-3-phosphate dehydrogenase A precursor, maize 19 42.8 7.0 KYBOB; chymotrypsin B precursor, bovine 19 25.7 4.8 PS0140; aspergillopepsin A precursor; Aspergillus awamori (fragment) 19 38.9 4.3 F29380; Ig heavy chain precursor V region (fragment) 19 15.0 7.7 JU0340; aspergillopepsin A precursor; Aspergillus awamori 19 41.1 4.3 A21195; chymotrypsinogen 2 precursor, dog 19 27.7 7.0 Only the best scores are shown, and all proteins with the lowest score listed are included. IEP, calculated isoelectric point. Amount, total quantity (nmol) of relevant amino acids analyzed. The variable ubiquitin molecular masses arise from polyubiquitin cDNAs and ubiquitin precursors, which contain multiple 8.4-kDa ubiquitin monomers.

using a variety of scoring methods. Unrelated proteins of proteins were clearly related to the target protein-e.g., the similar amino acid composition are extremely difficult to find, same sequence under a different accession number; the same even with scoring methods looser than that used by FINDER, gene product from a different tissue, strain, or species; or suggesting a surprising degree of uniqueness to amino acid readily comprehensible situations such as a protein and a large composition data. SEEKER performed 100 random searches fragment ofthe same protein. Fifty-four ofthe searches found using the FINDER scoring method, and the resulting 1528 no unrelated protein scoring >20 points. In 28 of those cases proteins scoring .20 were examined in detail. Of these, 651 no other protein, whether related or not, scored >20, so that Downloaded by guest on September 25, 2021 Biochemistry: Shaw Proc. Natl. Acad. Sci. USA 90 (1993) 5141 within the limits defined by the scoring method, the amino acid composition ofthese proteins was unique. In the remaining 46 NF-H ** ?Po *." -* cases the unrelated proteins invariably obtained low scores, as NF-M wwwo shown in Fig. 2. No protein unrelated to the target protein scored >25, and only 2 unrelated proteins scored 25. In contrast, 55 clearly related proteins scored 25, and 201 related proteins scored >25. Of the proteins with low scores, 13.5% - _:y _ 9._ q of the proteins scoring 20 were related to the target, as were 50 kDa g6 31% ofthose scoring 21, 47% ofthose scoring 22, 80%o ofthose scoring 23, 86% ofthose scoring 24, and 96.5% ofthose scoring 25. FINDER scores of 20 and above therefore have significant A B C probability of indicating a genuine match, the likelihood in- 61 62636465666768697071 69 68 creasing dramatically with increasing score. a FIG. 3. (A) SDS/PAGE of solubilized salt- and detergent- Proteins selected by SEEKER that produced significant extracted pig spinal cord cytoskeletal material resolved on DEAE- number of scores with clearly unrelated proteins had amino cellulose; fraction numbers are at the bottom of each lane. An acid compositions close to the average composition of the unidentified protein with an SDS/PAGE molecular size of 50 kDa database (for the PIR 32 database, in mole percentages: Asx, was eluted between neurofflament subunits NF-H and NF-M, sug- 9.84; Thr, 6.05; Ser, 7.43; Glx, 10.77; Pro, 5.39; Gly, 7.41; Ala, gesting an isoelectric point of about 5.0. (B and C) All lanes 7.81; Val, 6.69; Met, 2.37; Ile, 5.58; Leu, 9.45; Tyr, 3.34; Phe, containing the 50-kDa protein are strongly labeled with both mono- 4.11; His, 2.35; Lys, 5.99; Arg, 5.39). The FINDER scoring clonal (B) and polyclonal (C) to 1-tubulin, in line with the algorithm was therefore used to search for proteins matching results from FINDER. this average. Even in this worst-case situation the highest ofthe 58 proteins scoring 20 or more, 29 were f-, and score was 25, obtained by a single protein. Sixteen proteins the 4 best-scoring proteins were all 3-tubulins, scoring in the scored 24, and 32 proteins scored 23. These data suggest that, range 24-27. The SDS/PAGE molecular mass and the iso- given good-quality input, even a protein with an amino acid electric point estimated from ion-exchange chromatography composition very close to the database average would be are also consistent with the 50-kDa protein being a ,B-tubulin, selected on the basis of amino acid composition alone. A although none of the close non-3-tubulin candidates meet corollary of this is that even quite low-quality data would still both these criteria. Finally, (8-tubulin, while not expected, is allow the identification ofthe -50% ofproteins in the current also not an implausible component of the preparation. Later database which are not close in amino acid composition to any experiments showed that antibodies to 3-tubulin gave clear unrelated protein. and strong signals on the 50-kDa band, leaving little doubt The program has been used to aid in the identification of a that this protein is a member of the 3-tubulin family (Fig. 3 growing list of proteins found in a variety of different situa- B and C). A further example is provided by ABRF-92, a tions. The program frequently suggests a strong and plausible protein sample given out by the Association of Biomedical candidate with reasonably matching molecular mass and Research Facilities, the identity of which is kept secret. The isoelectric point, which can then be tested for identity to the amino acid composition determined from ABRF-92 was fed target protein. Here I provide two examples. Salt- and deter- into FINDER, which pointed to bovine chymotrypsinogen B gent-extracted pig spinal cord cytoskeletal preparations con- precursor as clearly and unambiguously the best match tain intermediate-filament subunits as major components, but (Table 2). The Protein Core Facility at the University of microtubule- and microfilament-associated proteins would Florida obtained two sequences corresponding ex- be expected to be extracted (Fig. 3A), so that a protein of actly to the bovine chymotrypsinogen B precursor sequence, 50-kDa apparent molecular mass might be a novel interme- and a telephone call to the Association of Biomedical Re- diate filament-associated protein (7). The amino acid com- search Facilities further confirmed this identification. position from a single PVDF blot was fed into FINDER, which In the case of several other proteins, FINDER found no revealed that the 50-kDa protein had a composition very close matches. Later, partial peptide sequences from two of close to that of the 3-tubulin multigene family (Table 2). Out these showed no similarity to any known protein, suggesting Number of sequences that these proteins are novel, and also indicating that the program tends not to generate false positives. In one case 650- FINDER failed to identify a protein which was later charac- 600- terized by peptide sequencing. In this case, examination of 550 - the original composition data revealed that it matched the *Unrelated expected values very poorly, probably due to contamination or other technical problems. Finally, several published amino 460-450- I Related acid profiles were examined with FINDER. Even though the 400- error and reproducibility factors shown in Table 1 are ex- to be somewhat different for other laboratories, the 350 - pected program usually worked efficiently. For example, the amino 300- acid composition of the 66-kDa protein described by Chiu et 250- al. (8) was entered into FINDER, which found as the single best 200- match a-intemexin, (score of 25 points; next best score, 21 points), in line with the generally held belief that these two 150- proteins are identical (9). 100- 501I DISCUSSION 20 21 22 23 24 25 26 27 28 29 30 31 32 The program described should be a useful addition to the Score current array of scientific software. It is far easier, cheaper, and quicker to obtain amino acid composition data than FIG. 2. Results obtained from 100 runs of the SEEKER program. peptide sequence. The blockage ofthe N-terminal amino acid Downloaded by guest on September 25, 2021 5142 Biochemistry: Shaw Proc. Natl. Acad. Sci. USA 90 (1993) is not a problem for composition analysis, the amount of particular facility. Future improvements in protein hydroly- protein required is quite low, and the recovery ofamino acids sis techniques and amino acid determination will therefore from PVDF membranes should be excellent. Perhaps the naturally increase the resolution of the program. FINDER can most surprising finding reported here is the highly specific be easily modified and made even more sensitive if Cys and nature of the composition data; unrelated proteins have Trp, or separate Glu/Gln and Asp/Asn determinations, be- distinct amino acid compositions, and proteins with the same come more widely performed. It is becoming ever more likely or very similar profiles are invariably closely related. The that the sequence of an unknown protein in a particular composition profile, as routinely determined, is an array of 16 experimental situation has already been determined, so that non-integer numbers whose range in known proteins is quite the protein could be identified by programs like the one surprising. Even if proteins of >20 kDa are selected, the described here. Failure to find a good match with FINDER, ranges in PIR 32 are as follows: Asx, 0-46.08%; Thr, given good-quality amino acid composition data, is also 0-45.3%; Ser, 0-38.69%; Glx, 0-45.78%; Pro, 0-47.5%; Gly, useful information, since it suggests that a protein is novel 0-66.32%; Ala, 0-4.65%; Val, 0-19.18%; Met, 0-15.19%; and worthy of detailed characterization. After the work Ile, 0-23.35%; Leu, 0-29.18%; Tyr, 0-16.76%; Phe, described here was complete, I became aware that previous 0-27.71%; His, 0-64.94%; Lys, 0-32.23%; Arg, 0-32.14%. authors have used amino acid composition data to identify For comparison, note that only 10 integer numbers (i.e., each proteins, although it is fair to state that this approach is not number has only 10 possible values) are sufficient to uniquely well known or widely used and that the method described identify most of tens of millions of telephones in the United here is significantly more refined in several respects (11, 12). States. Three further criteria beyond the composition can Executable versions of FINDER and the NAMES and SCORES also be used in the identification process: the name, calcu- files are available from the author. FINDER has been copy- lated molecular mass, and isoelectric point of each candidate righted. protein. The molecular mass determined from SDS/PAGE is reasonably accurate for most proteins (10), but in some I thank Benne Parten at the Protein Core Facility ofthe University extreme cases may differ from the real molecular mass by as of Florida for performing the hydrolyses and amino acid analyses. much as a factor of 2. Similarly, the isoelectric-point calcu- Julio Hawkins, Laura Errante, Ben Dunn, Nancy Denslow, Jeff lation does not take into account interactions between Harris, and Paul Hargrave provided helpful criticism and other input. charged groups within the molecule or the effects of post- This work was supported by National Institutes of Health Grant translational modification and so is only accurate to within NS22695. about 1 pH unit. However both numbers, used judiciously, 1. Matsudaira, P., ed. (1989) A Practical Guide to Protein and can rule out a large number of proteins from any list of Peptide Purification for Microsequencing (Academic, San Di- candidates. Finally, the candidate should be a protein which ego). could plausibly be found in the experimental situation under 2. LeGendre, N. (1990) BioTechniques 9, 788-805. examination. The composition, molecular weight, isoelectric 3. Ozols, J. (1990) Methods Enzymol. 182, 587-601. point, and context therefore provide four independent crite- 4. Gharahdaghi, F., Atherton, D., DeMott, M. & Mische, S. M. ria which can be used in the identification process. Experi- (1992) Techniques in Protein Chemistry III, ed. Angeileti, R. H. ence to date suggests that one can feel very confident about (Academic, San Diego), pp. 249-260. 5. Strydom, D. J., Tarr, G. E., Pan, Y.-C. E. & Paxton, R. J. the identity ofa protein which matches a particular candidate (1992) Techniques in Protein Chemistry III, ed. Angelleti, R. H. by all four criteria. (Academic, San Diego), pp. 261-274. The finding that a l3-tubulin isotype is a component of the 6. Tsugita, A., Uchida, T., Mewes, H. W. & Ataka, T. (1987) J. salt-extracted spinal cord cytoskeleton is interesting and may Biochem. 102, 1593-1597. point to the presence of an unusually stable form of micro- 7. Shaw, G. & Hou, Z.-H. (1990) J. Neurosci. Res. 25, 561-568. tubules in these preparations. Alternatively, the ,-tubulin 8. Chiu, F. C., Barnes, E. A., Das, K., Haley, J., Socolow, P., may be a contaminant and of no real significance. In either Macaluso, F. P. & Fant, J. (1989) Neuron 2, 1435-1445. case we are now at a stage that would normally have required 9. Fliegner, K. H., Ching, G. Y. & Liem, R. K. H. (1990) EMBO considerable work and expense. J. 9, 749-755. 10. Weber, K. & Osborn, M. 0. (1969) J. Biol. Chem. 244, 4406- The surprising variability of amino compositions ofknown 4412. proteins, the reason that FINDER works, has been conclu- 11. Eckerskorn, C., Jungblut, P., Mewes, W., Klose, J. & Lott- sively demonstrated here. The resolution of FINDER is de- speich, F. (1988) Electrophoresis 9, 830-838. pendent on the empirically determined accuracy and repro- 12. Sibbald, P. R., Sommerfeldt, H. & Argos, P. (1991) Anal. ducibility of amino acid composition data produced by a Biochem. 198, 330-333. Downloaded by guest on September 25, 2021