Linking Yeast Genetics to Mammalian Genomes

Proc. Natl. Acad. Sci. USA Vol. 90, pp. 10031-10035, November 1993 Genetics Linking yeast genetics to mammalian genomes: Identification and mapping of the human homolog of CDC27 via the expressed sequence tag (EST) data base STUART TUGENDREICH*, MARK S. BOGUSKIt, MICHAEL S. SELDINt, AND PHILIP HIETER*§ *Department of Molecular Biology and Genetics, The Johns Hopkins University School of Medicine, Baltimore, MD 21205; tNational Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894; and *Departments of Medicine and Microbiology, Duke University Medical Center, Durham, NC 27710 Communicated by Victor A. McKusick, June 30, 1993

ABSTRACT We describe a strategy for quickly identifying cloned, sequenced, and placed in the public sequence data and positionally mapping human homologs of yeast genes to bases. cross-reference the biological and genetic information known The utility of systematically cloning human homologs of about yeast genes to mammalian chromosomal maps. Opti- yeast genes and mapping them to mammalian genomes is mized computer search methods have been developed to scan clear, but the problem is how best to accomplish the cross- the rapidly expanding expressed sequence tag (EST) data base phylum cloning step. Standard methods include cross- to find human open reading frames related to yeast protein hybridization of DNA probes or cross-reacting antibodies, sequence queries. These methods take advantage of the newly complementation of yeast mutations by expression of heter- developed BLOSUM scoring matrices and the query masking ologous cDNAs, and PCR with degenerate oligonucleotides. function SEG. The corresponding human cDNA is then used to These methods are laborious, have a limited success rate, and obtain a high-resolution map position on human and mouse are, therefore, not suitable for the identification of human chromosomes, providing the links between yeast genetic anal- homologs of yeast proteins on a comprehensive scale. ysis and mapped mammalian loci. By using these methods, a The most expedient and sensitive means of identifying human homolog of Saccharomyces cerevisiae CDC27 has been homology between two proteins is by computer. Search identified and mapped to human chromosome 17 and mouse algorithms such as BLAST (5) are capable of detecting much chromosome 11 between the Pkca and Erbb-2 genes. Human weaker homologies between two genes than any biochemical CDC27 encodes an 823-aa protein with global similarity to its method and are much faster. The problem with this approach fungal homologs CDC27, nuc2+, and BimA. Comprehensive is that it is limited by the availability ofhuman sequences with cross-referencing of genes and mutant phenotypes described in which to compare yeast proteins. Now, however, advances humans, mice, and yeast should accelerate the study ofnormal in automated sequencing technology have led to a rapidly eukaryotic biology and human disease states. expanding source of data that can be used for this purpose. These data come in the form of expressed sequence tags The construction of high-resolution genetic and physical (ESTs), which are partial cDNA sequences derived from maps of the chromosomes of humans and several model clones that are randomly selected from various cDNA librar- organisms will provide an enhanced ability to identify, ana- ies. These data have been organized in a special data base of lyze, and manipulate genes involved in normal human biology ESTs (dbEST) (6). Similarities to yeast proteins can be and human disease. These maps will ultimately lead to the detected using the TBLASTN algorithm, which conceptually correlation of genes with phenotypes, including human dis- translates dbEST sequences in all six reading frames and then ease states. Two model organisms, the budding yeast and the seeks statistically significant similarities between these open mouse, offer great potential in accelerating the study ofgene reading frames (ORFs) and the yeast query sequences. Using function. Comprehensive cross-referencing of genes and this approach, we have identified a human homolog of the S. mutant phenotypes among these three organisms should have cerevisiae CDC27 gene in dbEST, confirmed its identity by an immediate impact on the analysis of eukaryotic biology. full-length sequencing,¶ and positioned this gene on the One of the most important consequences of the construc- mouse and human genetic maps. To generalize our method- tion of high-resolution genetic maps is that genes responsible ology for identifying the human counterparts of S. cerevisiae for human disease can be identified and cloned (ref. 1 and gene products, we optimized search parameters on 11 known references therein). Once cloned, a human disease gene can yeast/human homolog pairs using simulated ESTs derived be used to identify cognate genes in genetically more trac- from the human sequences. In this report, we describe a table model organisms such as yeast. In two recent examples, preliminary strategy for quickly identifying and mapping the human CFTR gene (cystic fibrosis) was found to be human homologs of yeast genes, which should facilitate the homologous to STE6 ofSaccharomyces cerevisiae (2) and the cross-referencing of yeast molecular genetics with mamma- human NFI gene (neurofibromatosis) was found to be ho- lian maps on a genome scale. mologous to yeast IRA] and IRA2 (3, 4). Discovery of structural similarity to yeast proteins provides an immediate MATERIALS AND METHODS link to yeast genetics, offering the possibility of relating protein sequence to function and providing an experimental Bioinformatics. The TBLASTN program takes a protein paradigm for further analysis. Unfortunately, the connection sequence query and compares it against six-frame conceptual between yeast and human proteins came very late in the examples cited above-i.e., after both genes had been Abbreviations: EST, expressed sequence tag; ORF, open reading frame; dbEST, special data base of EST; TPR, tetratrico peptide repeat. The publication costs of this article were defrayed in part by page charge §To whom reprint requests should be addressed. payment. This article must therefore be hereby marked "advertisement" IThe sequence reported in this paper has been deposited in the in accordance with 18 U.S.C. §1734 solely to indicate this fact. GenBank data base (accession no. U00001). 10031 Downloaded by guest on September 27, 2021 10032 Genetics: Tugendreich et al. Proc. Natl. Acad. Sci. USA 90 (1993)

83 nuc2+ MTDRLKCLIWYCIDNQNYDNSIFYSERLHA------IED-SNESLYLL YSHFLNLD iw +++ Y +++F +ERL A S+E+L+LLA EST556 MTVLQEPVQAA IWQALNHYARDAVFLAERLY------EVHSEEALFLL TCYYRSqK IR+A+FLA RL A E ++ EA +LLA CY ++ BimA MTPSTSHISSQLRQLIYYHLDNNLh -LF-LAGRLHA_----- EPRTSIASYLLALCYLQN Q CDC27 MAVNPELAPFTLSRGIPSFDDQALSTIIQLQDCIQQAIQQLNYSTAEFLAELLYAECSILDKSSVYWSDAVYLYALSLFLNKS

8 4 159 nuc2+ YNIVYDL---LDRVISHVPCTYLFARTSLILGR-YKQGISAVEACRSNWRSIQPNINDSISSRG---H-PDASCML EST556 AYKAYRLLK--GHSCTTP CKYLLAKCCVDLSR-LANGEQI SGG-----VFNKQKSHDDIVTE---FGDSACFTL C Y+ A C+DL+K ++G I BimA VKAAWETSKHFGSRGAHL CSYVYAQACLDLGK-YTDGINAI ERSKGQWTSRNHWNKHSETRRQ---HLPDAAAVL CDC27 YHTAFQISKEFKE--YHLGIAYIFGRCALQLSQGVNEAILTLLSI INVF SSNSSNTRINMVLNSNLVHIPDLATLN * * * *D * * FIG. 1. Four-way alignment of the three fungal homologs to the human EST. Residues in boldface type are those that are conserved across the four proteins. The dot below the alignment corresponds to positions where the human protein is identical to at least two of the three fungal residues at that position. The TBLASTN alignments (maximal segment pairs, ref. 5) between nuc2+/EST556 and BimA/EST556 are shown in boxes. The numbers above the alignment are approximate sequence positions (since gaps are included) generated by the PILEUP program. translations of all sequences in a nucleotide sequence data determined by analyzing all haplotypes and minimizing cross- base. TBLASTN iS similar to the BLASTX program (7) and both over frequency between all genes that were determined to be are extensions of the original BLAST algorithm (5). TBLASTN within a linkage group (16). alignments were scored with any of six amino acid substitu- tion matrices including PAM4o, PAM12o, and PAM250 (8) and RESULTS BLOSUM80, BLOSUM62, and BLOSUM45 (9). PAM matrices are based upon a model of evolutionary change in proteins (10); Searching dbEST for Homologs of CDC27/nuc2+/BimA. BLOSUM matrices have recently been developed based upon The proteins encoded by the CDC27, nuc2+, and BimA genes conserved blocks from distantly related proteins (9). The are thought to perform the same function(s) in S. cerevisiae, statistical significance (P value) of a sequence match is Schizosaccharomyces pombe, and Aspergillus nidulans, re- expressed as the probability that such a match could have spectively, due to their amino acid sequence conservation occurred by chance according to Karlin and Altschul (11) and and the similar mitotic arrest phenotype of cells bearing is calculated by TBLASTN. Many proteins possess "low- mutations in these genes (17, 18). The three proteins are complexity" subsequences-i.e., regions oflocally "biased" members of the TPR gene family whose gene products amino acid composition [such as poly(Asn) or poly(Gln) or contain multiple tandem repeats ofa 34-aa motifcalled a TPR acidic or basic regions]-which complicate the interpretation (tetratrico peptide repeat) unit (19, 20). of data base search results. We used a recently developed We searched a human-only subset of dbEST with the method, SEG (12), to identify and mask these distracting amino acid sequences encoded by CDC27, nuc2+, and BimA subsequences as described in the text. TBLASTN was used to by using the default parameters and scoring matrix (PAM12O) search a human-only subset ofdbEST (Prerelease 4, Novem- used by the TBLASTN data base searching algorithm (version: ber 1992) containing 9581 human EST sequences (2.9 million TBLASTN 1.3.OMP; November 22, 1992) (5). Any low- nucleotides). 11 The PILEUP and GAP (Needleman-Wunsch complexity subsequences in the three queries were masked algorithm) programs found in the Genetics Computer Group with SEG prior to searching. An EST (no. 556) derived from software package were used to compute multiple and pair- a human fetal brain library was shown to be related to nuc2+ wise alignments of proteins. The plot of four-way homology with a chance probability (P value) of0.0025 (raw score = 71) in Fig. 2 was generated by the PLOTSIMILARITY program in and to BimA with a combined P value of 0.0056 (raw scores Genetics Computer Group and then modified in a graphics = 49, 45, and 40; n = 3, for three separate regions of editor. similarity). In contrast, no EST matched significantly to S. Human Chromosome Assignment. A human-rodent so- cerevisiae CDC27 using the default parameters of TBLASTN matic cell hybrid panel (NIGMS 2, a gift from A. Scott, Johns A N-terminal Medical School) was to PCR using a A 1.5 , r'u~~nique" C- TPR Block Hopkins subjected T~~~~Pdomain TPR rriiinal primer corresponding to the last few codons of the 0 CDC27Hs** ORF (5'-ATGACACACAACTTCAT-3', bases 0 U)1.0 2490-2506) and a primer hybridizing to the 3' untranslated A~~~~~~~~~~~~~~ region (5'-CACGTCAGCACTAGTCA-3', the reverse com- 1, plement of bases 2564-2580) that amplify a 90-bp product > |n n 4 v!~~1) f V from the CDC27Hs cDNA or human genomic DNA. ~ Genetic Mapping in the Mouse. C3H/HeJ-gld and Mus spretus (Spain) mice and [(C3H/HeJ-gld x Mus x 0 200 400 600 800 spretus)F1 Approximate sequence position C3H/HeJ-gld] interspecific backcross mice were bred and B maintained as described (13), and DNA was analyzed as CDC27Hs _.. 1. 1. f 1:: i,-"I .i described (14). The CDC27Hs probe was a 1.1-kb PCR derived from the cDNA by using BimAr-1- product CDC27Hs template ---- - l l l r- i-[m oligonucleotides that correspond to bases 1428-1444 and the CDC27 _OM 1--ILl ___ .-NNN r1VL1 1 V T reverse complement of bases 2492-2509. Gene linkage was determined by segregation analysis (15). Gene order was nlj c2 + . .. 1 . - I I l -l-----T----rl~~~~~~~~~~~~~~~~~~~~~~~~~~~L...... L.. FIG. 2. (A) Plot of similarity along a four-way alignment of "For more information about dbEST, see ref. 6 or send E-mail to full-length CDC27/CDC27Hs/nuc2+/BimA. A sliding window of 17 [email protected] (type 'help' in the message body and residues was used and the homology was scored from 0 (unrelated) leave subject line blank). to 1.5 (identical). (B) Schematic representation of the primary **The correct nomenclature for this human gene is CDC27 but it will structure of the four homologs CDC27Hs/BimA/CDC27/nuc2+. be called CDC27Hs hereafter to distinguish it from the S. cerevi- The figure is to scale and the shading is the same used in A. NNN siae gene CDC27. represents 24 consecutive Asn residues. Downloaded by guest on September 27, 2021 Genetics: Tugendreich et al. Proc. Natl. Acad. Sci. USA 90 (1993) 10033

PAM120 PAM250 PAM40 BLOSUM45 BLOSUM62 BLOSUM80 ESr p-value EST p-value ESr p-value EST p-value ESr p-value EST p_value 3889 0.21 13195 0.31 1432 0.1 3889 0.59 17567 0.084 15561 0-04 l 556 0.21 556 0.45 12627 0.19 12594 0.6 556 0.46 12594 0.077 17678 0.32 1629 0.67 13318 0.21 19812 0.64 3889 0.55 6797 0.47 2633 0.37 1526 0.88 4068 0.22 5561 0.73 2633 0.56 3889 0.48 2526 0.72 12793 0.93 19957 0.43 13195 0.77 12594 0.79 17567 0.58

FIG. 3. Results of TBLASTN searches of a human-only subset of dbEST with CDC27 masked with SEG set to a window size of 15 residues. The paired columns are the EST identification number and its corresponding P value with the scoring matrix used in each search shown above the top five ESTs returned in each search (as ranked by P value). EST556 (which corresponds to CDC27Hs) is boxed if present. Note that, in this retrospective search, EST12594 and EST17567 gave P values <0.1 by using BLOSUM matrices and are candidates for sequence extension and/or further analysis.

(November 1992). A four-way amino acid alignment was be important in detecting significant similarity to the human made between the three fungal sequences and translated EST. Therefore, we experimented with different parameter human EST556 and is shown in Fig. 1. The alignment settings and found that increasing the "window size" from 12 revealed that certain amino acids conserved among EST556, to 15 prevented SEG from masking this region of CDC27. The nuc2+, and BimA are also maintained in the CDC27 se- searches ofthe human ESTs with the less stringently masked quence, indicating that EST556 might correspond to a cog- CDC27 query were repeated and six scoring matrices were nate of CDC27/nuc2+/BimA. Importantly, the region of tested. EST556 ranked first with a low P value (P = 0.04) four-way homology is at the N terminus of each protein, when CDC27 was scored against dbEST using BLOSUM80 upstream of any TPR units, which suggested that this EST did (Fig. 3). EST556 also ranked in the top five using four of the not simply encode a more distantly related TPR family five other scoring matrices, but the P values in these cases member. were not significant (P = 0.2-0.73). The scoring matrix and Sequence of CDC27Hs. The EST556 cDNA clone was the degree of masking both seemed to play important roles in obtained from the American Type Culture Collection (ATCC finding the human EST with the S. cerevisiae query. 7763) and completely sequenced. The DNA sequence is 2592 Optimition ofEST Data Base Searching Using a "Training bp long and contains a 2472-bp coding region flanked by 52 Set" of Known Yeast/Human Homologs. The results of the and 56 bases of5' and 3' untranslated sequence, respectively. retrospective search of dbEST with CDC27 suggested that it The cDNA sequence predicts an 823-aa ORF with global might be possible to increase the general rate of success of (end-to-end) similarity to nuc2+/BimA/CDC27. Fig. 2A detecting human homologs of S. cerevisiae genes in dbEST. shows a four-way amino acid similarity plot between the For this purpose, we developed a training set of 11 known S. human and fungal homologs of CDC27. The plot shows that cerevisiae/human cognate pairs whose sequences are mod- there are essentially three conserved domains within the erately to weakly conserved (Table 1). A library of simulated homologs: an N-terminal "unique" domain, an N-terminal human ESTs was created by dividing each of the 11 human TPR (with conservation extending to both sides of the TPR cDNAs into 300-bp segments with 30 bp overlaps. These motif), and a C-terminal block of nine tandemly repeated pseudo-ESTs were combined with actual EST dataconsisting TPRs. The human protein is _45% identical to any of the of all human entries in dbEST (Pre-release 4) and searched three fungal cognates in the =350-aa C-terminal TPR block with TBLASTN using each of the S. cerevisiae protein se- and roughly 30% identical in the N-terminal -250 aa (P value quences masked with SEG set to a window size of 15. As = 1.4 x 10-66 for full-length CDC27 vs. CDC27Hs; see Table above, six scoring matrices were tested. The results are 1). All four proteins contain a central region of 100-250 aa graphically represented by plotting the P value obtained for that varies greatly in sequence and length (see Fig. 2B). the homology score as a function of the location of each Retrospective Analysis of the Search of dbEST with CDC27. pseudo-EST segment along the human cDNA (Fig. 4). We investigated why CDC27 failed to identify EST556 in the Ranking of a pseudo-EST that was marginally conserved to initial search. Using its default parameter settings, SEG had a region of a yeast query was highly dependent on the scoring appropriately masked the central poly(Asn) tract in CDC27 matrix used. Fig. 5 shows rank order for 3 of the 14 marginal but also masked a portion of CDC27's N terminus that might cases analyzed. For each yeast query, the rank order and P Table 1. List of the human/yeast homolog pairs used to test search parameters S. cerevisiae Human Gene Accession Accession Needleman-Wunsch TBLASTN symbol no. Gene no. % identity (gaps) Score P value n STE6 M26376 MDR3 M23234 26 (18) 211 9.1 E-41 2 STE6 M26376 CFTR M28668 23 (48) 126 3.4 E-14 3 SRMI* P21827 RCCI X06130 32 (16) 93 5.3 E-20 3 IRA2* P19158 NFI M89914 27 (15) 130 2.2 E-27 4 GPHI X04604 Liver phosphorylase M14636 51 (13) 386 3.5 E-117 3 TOP2* M13814 Topoisomerase II J04088 45 (25) 727 3.5 E-144 10 TOPJ* K03077 Topoisomerase I J03250 49 (16) 587 7.3 E-82 3 CDC17 P13382 DNA polymerase a X60745 35 (43) 499 7.9 E-66 1 POL30 P15873 PCNA J04718 36 (1) 468 1.6 E-61 1 CDC9 P04819 DNA ligase M36067 42 (9) 585 3.1 E-79 1 MCM3* X53540 P1.h protein X62153 37 (10) 141 3.4 E-13 1 CDC27* CDC27Hs U00001 31 (16) 494 1.4 E-66 1 P value expressed as 9.1 E-41 is 9.1 x 10-41, for example. The values indicated were computed by TBLASTN usingPAM12O when the full-length human cDNAs were placed in the human-only dbEST data base and searched with the yeast protein queries masked with SEG at a window size of 15. Yeast query sequences that have an asterisk are those that contained low-complexity subsequences that SEG masked. n is the number of maximal segment pairs between the query and subject sequences. CDC27 vs. CDC27Hs is shown for comparison. Downloaded by guest on September 27, 2021 10034 Genetics: Tugendreich et al. Proc. Natl. Acad. Sci. USA 90 (1993)

A Erbb-2 * O * O * O Cdc-27 0 O3 O3 0 0 O Pkca * OQQO * O #of mice 56 51 2 2 3 0 FIG. 6. (A) Segregation of B Mouse 11 Hluman synteny Cdc-27 among mouse chromo- a) some 11 loci in [(C3H/HeJ-gld x - Lit ] 22q Mus spretus)FI x C3H/HeJ-gld] - Erbb 7P interspecific backcross mice. - Rel 2p 0) Solid boxes, homozygous C3H - Hba 16p pattern; open boxes, F1 pattern. 0 -Adra-1 5q (B) Map ofmouse chromosome 11 -1-3 and corresponding human syn- -Grur-I tenic regions. The mouse loci in 17p boldface type are those that have - Trp53 been typed on the interspecific Hff backcross panel used in this 8 12 cM Top-2 17q study. The inferred position of pseudo-EST number Erbb-2- - Erbb-2 CDC27Hs on human chromo- Cdc-27- CDC27Hs some 17q21-24 is shown at the 26Pkca Pkca FIG. 4. Variation of P value vs. pseudo-EST number for an TImp-2 right of the human map. example from the training set. The P values of alignments of translated segments ofa human cDNA to its related yeast protein are nation of the haplotype distribution of the Cdc-27 locus plotted on the y axis on a logarithmic scale. Each pseudo-EST (300 indicated that in 111 of the 114 meiotic events examined, the bp, each with 30-bp overlaps) along the length of the human cDNA Cdc-27 locus cosegregated with the protein kinase Ca locus is given a consecutive number starting from 1. For example, pseudo- on mouse chromosome 11 as depicted in Fig. 6A. EST 2 would begin at base 271 of the cDNA sequence. The plot is The best order that is of human DNA polymerase a pseudo-ESTs searched with the gene indicated Cdc-27 located 3.5 ± CDC17 query. 1.7 centimorgans distal to Erba/Erbb-2 and 2.6 ± 1.5 centimorgans proximal to Pkca. A large block of mouse chromo- value of the corresponding pseudo-EST are shown. In gen- some 11 encompassing this region has been conserved on eral, BLOSUM62 and BLOSUM8O revealed more significant human chromosome 17 (Fig. 6B). We infer that CDC27Hs matches to the test ESTs than the other matrices. Of 14 maps between the ERBB2 and PRKCA genes on human instances of marginal homologies between the pseudo-EST chromosome 17q21-24. and the yeast query (P > 0.01), BLOSUM62 and BLOSUM80 performed better in 10 instances, in 3 instances the PAM and DISCUSSION BLOSUM matrices worked equally well, whereas a PAM matrix (PAM25o) worked best only once. Also, ranking the align- We have formulated a strategy for quickly identifying and ments by their statistical significance (P value) rather than positionally mapping human homologs of yeast genes to their raw score aided in the interpretation of search results cross-reference yeast genetic analysis and mammalian chro- (data not shown). mosomal maps. The rapidly expanding EST data base serves Mapping CDC27 in Mice and Humans. The CDC27Hs gene as the source of candidate human genes that can be scanned was mapped by two criteria: (i) chromosome assignment in using optimized computer search methods to find ORFs humans (21) and (ii) genetic map position in the mouse. related to yeast protein sequences. To assign CDC27Hs to a human chromosome, oligonucle- Many groups are randomly sequencing cDNAs from var- otide primers were used that yielded a 90-bp CDC27Hs- ious human cDNA libraries to clone genes with no a priori specific PCR product using human, but not rodent, genomic information about their function or genomic locations (22, DNA as template DNA. Genomic DNA from a panel of 23). Up to 70% of the sequences that have been obtained to somatic cell hybrds, each containing a single human chro- date have shown no significant homology to any sequences in mosome in a rodent background, was assayed by PCR. Only the public data bases using the default parameters ofthe most DNA from the cell line containing human chromosome 17 widely used search program. Indeed, the homology between yielded a PCR product of the expected size. We concluded EST556 and CDC27/nuc2+/BimA was not previously -de- that CDC27Hs maps to human chromosome 17. tected. We found that a combination of the matrix used to Using a mouse "interspecific" backcross panel developed score alignments and the degree to which a query sequence by one of us (M.F.S.), the CDC27Hs gene was mapped is masked significantly affect whether a homologous EST is between the Erbb-2 and Pkca genes. Informative restriction found for that sequence. Searching a concentrated source of fragment length variants were detected in Southern blots with potential coding sequences (dbEST) with each yeast protein Pvu 11-digested parental DNAs (C3H-HeJ-gld = 24.0kb; Mus query has proven to be more successful than searching a large spretus = 18.0 kb and 5.5 kb). Each of 114 offspring DNAs complex data base such as GenBank. Another reason to analyzed displayed either the homozygous or heterozygous search in this way is that half of the =2000 known yeast pattern when hybridized with the CDC27Hs probe. Exami- protein sequences are not yet publicly available and would be

# of real ESTa p.value yeast human that ranked of Scoring ID number of ESTs that ranked In top four query pseudo-EST higher pseudo-EST Matrix 1 2 3 4

COC17 Pol Alpha #16 12 0.99 BLOSUM45 4347 12770 12951 3752 CDC17 Pol Alpha #16 1 0.23 BLOSUM62 4347 Pol Alpha #161 12951 3752 CDC17 Pol Alpha #16 0 0.06 BLOSUM80 |POl Alpha #16 4347 19184 12951 CDC17 Pol Alpha #16 3 0.89 PAM120 13397 18693 19037 Pol Alpha #16 CDC17 Pol Alpha #16 29 1 PAM250 18693 4749 4479 18903 FIG. 5. Two examples of the effect of scor- CDC17 Pol Alpha #16 9 0.91 PAM40 4207 19012 3715 1353 ing matrix on the rank order and P value of weakly conserved pseudo-ESTs. The divergent CDC9 Ligase #3 BLOSUM45 75 4508 661 9 19377 CDC9 Ligase #3 7 0.92 BLOSUM62 75 19377 6971 20277 pseudo-ESTs (boxed) were followed in CDC9 Ligase #3 0 0.067 BLOSUM80 Ligase#3 19377 20277 6619 TBLASTN searches of human-only dbEST with CDC9 #3 PAM1 20 4660 6619 13274 1927 Ligase 1 as CDC9 #3 PAM250 4508 75 18089 1798 the yeast homolog query and using each ofthe Ligase 1 cDC9 Ligase #3 PAM40 4660 13274 1187 1 9377 six scoring matrices. Downloaded by guest on September 27, 2021 Genetics: Tugendreich et al. Proc. Natl. Acad. Sci. USA 90 (1993) 10035 missed by those who search the public data bases with six mediate functional information about corresponding gene frame translations of each EST. products. Positional mapping of human cDNAs related to We created a "training set" of known human homologs of well-characterized yeast genes should provide a resource of yeast genes to have positive controls in tests of search "candidate genes" potentially responsible for phenotypes strategies. By using a data base of human ESTs "doped" being mapped throughout the human and mouse genomes. with "pseudo-ESTs" derived from the human homologs, we Knowledge of phenotypes in yeast may often suggest prior- found that the recently derived BLOSUM matrices generally ities by which corresponding candidate gene probes could be work better than the PAM matrices used almost universally directly tested for cosegregation with the mammalian mutant for data base searching, especially in cases where the ho- phenotype. Furthermore, identification of mammalian homology between the yeast and human proteins is weak. mologs of yeast genes permits study of the evolutionary Indeed, using the optimized search parameters, S. cerevisiae conservation of genes and the use of available experimental CDC27 was able to significantly match the cognate EST advantages of mammalian cells (e.g., superior cytology) for whereas the November 1992 default parameters were unable analysis of yeast proteins. The development and implemen- to do this. (Note that the BLASTN default scoring matrix has tation of methods for identifying (and subsequently mapping) subsequently been changed to BLOSUM62.) We also found that human homologs of yeast genes via the EST data base should the degree of significance can vary greatly from one EST to facilitate cross-referencing human, mouse, and yeast genes the next, with certain regions being extremely well conserved on a large scale. (P 10-50) and other regions being so divergent that insig- nificant P values were obtained. This variability in conser- We gratefully acknowledge F. Spencer and R. Reeves for insights vation implies that if one identifies a human EST that is and suggestions during the development of the yeast/mammalian weakly related to a yeast protein query, sequencing a few cross-referencing scheme and for critical reading of the manuscript. S.T. is supported by National Institutes of Health Training Grant hundred more bases of the clone should be sufficient to 5T32CA09139 and P.H. is supported by an American Cancer Society establish whether the similarity is spurious or indicative of Faculty Research Award. This, work was supported by National true homology. Institutes of Health Grants CA16519 (P.H.) and HG-00101 (M.F.S.). Having found a human homolog of a yeast gene, the next step in cross-referencing to mapped mammalian phenotypes 1. Ballabio, A. (1993) Nature Genet. 3, 277-279. is to determine the genomic location ofthe human cDNA. We 2. Hyde, S. C., Emsley, P., Hartshorn, M. J., Mimmack, M. M., mapped CDC27Hs by using a combination of single human Gileadi, U., Pearce, S. R., Gallagher, M. P., Gill, D. R., Hubbard, assignment by PCR (21) of a somatic cell hybrid R. E. & Higgins, C. F. (1990) Nature (London) 346, 362-365. chromosome 3. Ballester, R., Marchuk, D., Boguski, M., Saulino, A., Letcher, R., panel and subchromosomal assignment in mouse using an Wigler, M. & Coflins, F. (1990) Cell 63, 851-859. interspecific backcross panel of DNAs (24, 25). These two 4. Xu, G. F., Lin, B., Tanaka, K., Dunn, D., Wood, D., Gesteland, R., methods are simpler than other available mapping methods, White, R., Weiss, R. & Tamanoi, F. (1990) Cell 63, 835-841. provide a high-resolution position, and can be used to quickly 5. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, map large numbers of human genes. CDC27Hs mapped to D. J. (1990) J. Mol. Biol. 215, 403-410. which made it a candidate 6. Boguski, M. S., Lowe, T. M. J. & Tolstoshev, C. M. (1993) Nature human chromosome 17q21-24, Genet. 4, 332-333. gene for early-onset breast cancer. The CDC27Hs locus has 7. Gish, W. & States, D. J. (1993) Nature Genet. 3, 266-272. been shown independently to be distal to the BRCAI locus 8. Altschul, S. F. (1991) J. Mol. Biol. 219, 555-565. (L. Brody, P. Ho, K. Abel, B. Weber, and F. Collins; L. 9. Henikoff, S. & Henikoff, J. (1992) Proc. Natl. Acad. Sci. USA 89, Friedman, E. Lynch, and M. C. King; Y. Miki, J. Swensen, 10915-10919. and M. Skolnick; personal communications). 10. Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978) in.Atlas of Protein Sequence and Structure, ed. Dayhoff, M. 0. (Natl. Biomed. A potential concern in the context of an EST search Res. Found., Washington, DC), pp. 345-358. protocol is the extent to which "significant" matches identify 11. Karlin, S. & Altschul, S. F. (1990) Proc. Natl. Acad. Sci. USA 87, functional homologs (orthologous proteins) vs. family mem- 2264-2268. bers (proteins that perform related but distinct functions). In 12. Wootton, J. C. & Federhen, S. (1993) Comput. Chem. 17, 149-163. our view, this does not pose a problem. It seems likely that 13. Seldin, M. F., Morse, H. C., Reeves, J. P., Scribner, J. P., LeB- while some of the estimated 50,000-100,000 genes in mam- oeuf, R. C. & Steinberg, A. D. (1988) J. Exp. Med. 167, 688-693. 14. Maniatis, T., Fritsch, E. F. & Sambrook, J. (1982) Molecular mals will have counterparts with identical function in yeast, Cloning: A Laboratory Manual (Cold Spring Harbor Lab. Press, the vast majority will be related to yeast proteins as members Plainview, NY). of a gene family that have analogous but not identical 15. Green, E. L. (1981) in Genetics and Probability in Animal Breeding functions. In these cases, the function of the yeast protein Experiments, ed. Green, E. L. (Macmillan, New York), pp. 77-113. might suggest a function for the human family member. The 16. Bishop, D. T. (1985) Genet. Epidemiol. 2, 349-361. STE6 relationship provides an excel- 17. O'Donnell, K., Osmani, A., Osmani, S. & Morris, N. (1991) J. Cell (yeast)/CFTR (human) Sci. 99, 711-719. lent example. The predicted amino acid sequence of CFIR, 18. Hirano, T., Hiraoka, Y. & Yanagida, M. (1988) J. Cell Biol. 106, which is an ion channel, is strikingly similar to the yeast STE6 1171-1183. protein (ref. 2 and also Table 1), which is a transmembrane 19. Sikorski, R. S., Boguski, M. S., Goebl, M. & Hieter, P. (1990) Cell protein involved in pumping mating pheromone out of the 60, 307-317. yeast cell (26). The two proteins are not identical but are 20. Goebl, M. & Yanagida, M. (1991) Trends Biochem. Sci. 16, 173-177. the 21. Wilcox, A. S., Khan, A. S., Hopkins, J. A. & Sikela, J. M. (1991) clearly similar in function, and this link has promoted Nucleic Acids Res. 19, 1837-1843. application of yeast genetic studies to the analysis of CFTR 22. Sikela, J. M. & Auffray, C. (1993) Nature Genet. 3, 189-191. structure and function (see ref. 27). Furthermore, had an EST 23. Davies, K. (1993) Nature (London) 364, 554. corresponding to CFTR been identified using STE6 as query 24. Saunders, A. M. & Seldin, M. F. (1990) Genomics 8, 524-535. and mapped to human and mouse chromosomes, the initial 25. Watson, M. L., D'Eustachio, P., Mock, B. A., Steinberg, A. D., identification of the CFTR gene would have occurred at a Morse, H. C., Oakey, R. J., Howard, T. A., Rochelle, J. M. & Seldin, M. F. (1992) Mammal. Genome 2, 158-171. great savings of time and resources. 26. Kuchler, K., Sterne, R. E. & Thorner, J. (1989) EMBO J. 8, Cross-referencing of yeast genetic loci to the human and 3973-3984. mouse genetic maps not only should facilitate the initial 27. Teem, J. L., Berger, H. A., Ostedgaard, L. S., Rich, D. P., Tsui, identification of disease genes but also should provide im- L.-C. & Welsh, M. J. (1993) Cell 73, 335-346. Downloaded by guest on September 27, 2021