1544 Current Biology 1996, Vol 6 No 12 sequence tags as described here Fibronectin type III 35–125 of YEF3_YEAST, matched relies on all proteins from an the HMM with scores of 39.5 and organism being in sequence domains in yeast 21.5 bits, respectively. We found a databases. In this manner, if only detected by a hidden homologue of L8543.18 in cosmid one protein within a given pI and c6G9 of the genomic data for S. mass range is found with a certain Markov model pombe, using the program tblastn [8] amino- or carboxy-terminal Alex Bateman and (see Fig. 1a). Residues 77–167 of sequence tag, one can be confident this sequence match the FnIII that there is no other, as yet HMM with a score of 32.4 bits. The undescribed, protein that could HMM score is the logarithm to base otherwise match the tag. In fully Proteins containing fibronectin type 2 of the probability of the sequence sequenced organisms, the procedure III (FnIII) domains play a central matching the HMM, divided by the is thus self-checking. The role in many intercellular processes: probability of a randomly generated specificity of sequence tags may be they are part of many cell-surface sequence matching the HMM. The an issue in larger organisms: whereas receptors, adhesive matrix proteins next highest match, SVS1, scored there are (for example) 3 200 000 and cell adhesion molecules. FnIII 9.4 bits. We would expect a score of combinations of five amino-acid domains are also found in the giant 12 bits to be significant against a tags, protein amino termini have muscle proteins, titin and twitchin. database of this size. HMM scores biased sequences and many amino The occurrence of FnIII domains in of 39 and 21 bits are highly reliable termini are shared. However, prokaryotes led to speculation that indicators of sequence homology, in protein carboxyl termini have almost these domains existed in the last our experience. We therefore expect random sequences (data not shown) common ancestor of prokaryotes and that the yeast domains have an so their sequence tags should be animals [1]. However, it has been FnIII-like fold. However, we did try more specific. Other factors to argued that the currently known other methods to verify our results: consider will be the accuracy of prokaryotic examples were obtained database searches with BLASTP sequence data that can be obtained from a horizontal transfer of a single [8], key residues analysis [9], and from proteins purified from two- domain from animals — that this PHD [10] secondary structure dimensional gels, and the accuracy protein arose late in evolution, and prediction. of prediction of protein open is unlikely to occur in plants or BLASTP [8] found matches reading frames in genome/proteome fungi [2,3]. between the ‘FnIII’ sections of the databases. Large-scale protein Here, we report evidence that yeast proteins and known animal characterization projects will define three fungal proteins, L8543.18 and FnIII domains in the SWISS-PROT the effect of these factors and thus YEF3_YEAST of Saccharomyces database [11]. The top match against the utility of sequence tags for cerevisiae and the L8543.18 the L8543.18 protein was the FnIII- protein identification. homologue from Schizosaccharomyces containing receptor tyrosine kinase pombe, contain FnIII domains. The KEK4_CHICK, with a p value of evidence for this comes from a 7.5 ϫ 10–4. The best match against References 1. Wilkins MR, Ou K, Appel RD, Sanchez J-C, hidden Markov model (HMM) [4,5] YEF3_YEAST was FINC_CHICK, Yan JX, Golaz O, et al.: Rapid protein of the amino-acid residues that the fibronectin protein in chicken, identification using N-terminal “sequence determine the FnIII protein fold, with a p value of 2.0 ϫ 10–3. Note that tag” and amino acid analysis. Biochem Biophys Res Commun 1996, and is supported by other these matches were found using only 221:609–613. calculations. From alignments of the the ‘FnIII’ portion of the two S. sequences of a protein family, an cerevisiae proteins. If the whole Address: Central Clinical Chemistry HMM can be built to encode the sequence of L8543.18 is used, the Laboratory, Geneva University Hospital, 24 probabilities of different residues first protein with an FnIII domain to Rue Micheli-du-Crest, 1211-Geneve 14. occurring at particular sites. The be matched is NCA1_MOUSE at E-mail: [email protected] model can then be used to detect rank position 284, with a p value of other sequences that are likely to be 0.56; for the whole sequence of The editors of Current Biology welcome very distant members of the protein- YEF3_YEAST, the first such match correspondence in response to any article fold family [6]. We built an HMM is NCA2_XENLA at rank position in the journal, but reserve the right to reduce the length of any letter to be from a multiple alignment of 434 380 and with a p value of 0.96. published. Items for publication should FnIII domains, and used it to search Routine BLASTP analysis would either be submitted typed, double-spaced, for FnIII domains in the yeast not, therefore, find the yeast FnIII- or sent by electronic mail. They should protein database release 4.1 [7]. like sequences. include a full contact address, with phone and fax numbers. Residues 76–166 of the Key residues are those that, sequence L8543.18, and residues through their packing, hydrogen Magazine 1545

Figure 1

(a) The sequence of the putative S. pombe homologue of S. cerevisiae L8543.18, (a) 1 MDDTNQFMVS VAKIDAGMAI LLTPSFHIIE FPSVLLPNDA TAGSIIDISV HHNKEEEIAR 60 translated from genomic cosmid c6G9. (b) 61 ETAFDDVQKE IFETYGQKLP SPPVLKLKNA TQTSIVLEWD PLQLSTARLK SLCLYRNNVR 120 121 VLNISNPMTT HNAKLSGLSL DTEYDFSLVL DTTAGTFPSK HITIKTLRMI DLTGIQVCVG 180 Alignment of yeast FnIII domains to the FnIII 181 NMVPNEMEAL QKCIERIHAR PIQTSVRIDT THFICSSTGG PEYEKAKAAN IPILGLDYLL 240 domains of neuroglian (NGd1 and NGd2) 241 KCESEGRLVN VSGFYIENRA SYNANASINS VEAAQNAAPN LNATTEQPKN TAEVAQGAAS 300 and tenascin (Ten) [12,13]. The S. pombe 301 AKAPQQTTQQ GTQNSANAEP SSSASVPAEA PETEAEQSID VSSDIGLRSD SSKPNEAPTS 360 homologue of L8543.18 is denoted Pombe. 361 SENIKADQPE NSTKQENPEE DMQIKDAEEH SNLESTPAAQ QTSEVEANNH QEKPSSLPAV 420 421 EQINVNEENN TPETEGLEDE KEENNTAAES LINQEETTSG EAVTKSTVES SANEEEAEPN 480 Regions in the yeast sequences that are 481 EIIEENAVKS LLNQEGPATN EEVEKNNANS ENANGLTDEK IIEAPLDTKE NSDDDKPSPA 540 expected to have the same structure as one 541 AAEDIGTNGA IEEIPQVSEV LEPEKAHTTN LQLNALDKEE DLNITTVKQS SEPTADDNLI 600 of the known structures are shown in upper 601 PNKEAEIIQS SDEFESVNID case; regions expected to differ in conformation are shown in lower case. The secondary structure (marked ‘Strand’) of the (b) Strand A------A B-----B C------C known FnIII domain structures is shown, as is NGd1 DVP.NAP..KLTGITC.QA.DKAEIHWE...QQGDNRSPI.....LHYTIQFNTS. the PHD secondary structure prediction NGd2 .PDVPFKNP.DNVVG.QGTQP.NNLVISWT..PMPEIEHNAPN....FHYYVSWKRD. Ten .RL.DAP.SQIEV.KDVTD.TTALITWF.....KPLAEI...... DGIELTYGIK. (marked ‘PHD’) [10] of the single yeast L8543.18 THKP.ESP..VLKI.VNVTQ.TSCVLAWD..plkl...gsak....LKSLILYRKG. domains: residues predicted to be in an PHD ------ extended conformation are marked by — and Pombe QKLP.SPP..VLKL.KNATQ.TSIVLEWD..plql...star....LKSLCLYRNN. those predicted to be in a helical PHD ------ YEF3 KIKTP.PAT..KVSI.DKIAT.DSVTIHWEnepvkaedngsadrnfiSHYLLYLNNT. conformation by +. Key residues important PHD ------++++++++------ for the structure of the FnIII fold are shown in red [9]. The chemical nature of these ----C' E-----E F------F G------G Strand sidechains is largely conserved, implying FTPASWDAAYEK..V..PNTDSSFVVQ..MSPW.ANYTFRVIAFNK..IGASPPSASSD.SCTTQ NGd1 structural similarity of the yeast domains to .IPAAAWENNN...I.FDWRQNNIVIA.DQPTF.VKYLIKVVAIND..RGESN.VAAEEVVGYSG NGd2 .KVPGDRTTID...L..TEDENQYSIG.NLKPD.TEYEVSLISR....RGDMS.SNPAKETFTT. Ten the known structures. (c) The location of FnIII ....irsmvipnp.F....KVTTTKIS.GLSVD.TPYEFQLKLITT..SGTLW.SEKV..ILRTH L8543.18 domains within the S. cerevisiae proteins. ------PHD ....vrvlnisnp.M....TTHNAKLS.GLSLD.TEYDFSLVLDTT..AGTFP.SKHI..TIKTL Pombe ------PHD ..qlaifpnnpns.L.....YTCCSIT.GLEAE.TQYQLDFITINN..KGFIN.KPSI..YCPTK YEF3 ------++ +++ ------PHD

(c) L8543.18 FnIII 76 166 277 671

YEF3_YEAST FnIII 35136 625 694 956

Low complexity region High complexity region FnIII domain 100 amino acids

bonds or unusual torsion angles, play domains of known structures homologue of L8543.18 are the the major role in determining the [12,13]. The yeast sequences same as, or very similar to, those three-dimensional structure of a aligned to the known structures found in the S. cerevisiae form. All protein. These residues tend to be share identical residues at only 8–16 the residues found at core sites of strongly conserved, in type if not in sites. Inspection of the alignment, the yeast domains can be found in identity, over long evolutionary however, shows that, to a very large the sequences of animal FnIII periods, and they can be used to extent, the key residues in the core domains. detect distant evolutionary of the known structures are the The known FnIII structures relationships. We have defined the same or conservatively substituted contain two ␤ sheets, one with three key residues of several FnIII in the two new sequences. The key strands (ABE) and the other with domains of known structure [9]. residues in the core of each of the four (GFCC′). The sequences of the Figure 1b shows a comparison of the predicted S. cerevisiae FnIII domains yeast domains were submitted to residues at key sites in the yeast are shown schematically in Figure 2. the PHD secondary structure FnIII domains with those in three The key residues of the S. pombe prediction server [10]. For each of 1546 Current Biology 1996, Vol 6 No 12

Figure 2

The conservation of core residues between neuroglian domain 2 and YEF3_YEAST and Neuroglian domain 2 L8543.18 YEF3-YEAST S. cerevisiae L8543.18. Strands A,B and E are shown in pink. Strands G,F C and C′ are 713 79 39 shown in black. The buried core residues are shown in boxes, using the single-letter amino- P P P acid code. Hydrogen bonds are denoted by thin horizontal lines. Other residues are represented by ovals. The C′ strand is found GNF GTL GNS to assume varied conformations in the known P P T structures, so the yeast sequences are not shown. SAYI LLSF ITYL V W L W V W 126 93

AVV SL I KFL G I I I LT I IC

G VII W N VFFY K ILF L LI C I V I

GY LY Y 118 83

SV TT TT

810 165 135 G A F B C E C' G A F B C E G A F B C E

the yeast proteins, PHD predicts YEF3_YEAST is at present 2. Bork P, Doolittle RF: Proposed acquisition strand conformations in regions unknown. of an animal protein domain by bacteria. Proc Natl Acad Sci USA 1992, equivalent to the A, B, C, E, F and The evidence for the proposal of 89:8990–8994. G strand regions in the known horizontal transmission of 3. Little E, Bork P, Doolittle RF: Tracing the structures (Fig. 1b). PHD has an eukaryotic FnIII to bacteria is based spread of fibronectin type III domains in bacterial glycohydrolases. J Mol Evol expected accuracy of above 72%, on two arguments: there are 1994, 39:631–643. based on predictions of known examples of related bacterial 4. Krogh A, Brown M, Mian IS, Sjolander K, structures [10], and the predictions proteins which lack the FnIII Haussler D: Hidden Markov models in . J Mol Biol 1994, for the yeast sequences agree very domains, and phylogenetic analysis 235:1501–1531. well with those for proteins of suggests that the FnIII domains 5. Eddy SR, Mitchison G, Durbin R: Maximum known structure. were acquired recently [2,3]. In the discrimination hidden Markov models of sequence consensus. J Comput Biol Using the seg program [14], case of yeast FnIII domains, no 1987, 2:9–23. much of the yeast proteins’ homologues lacking FnIII domains 6. Bateman A, Eddy SR, Chothia C: sequence is found to be of low have been found. Also, at least in Immunoglobulin superfamily domains in bacteria. Protein Sci 1996, 5:1939—1941. complexity (see Fig. 1c). The yeast the case of L8543.18, it is found in 7. Garrels JI: YPD — a database for the protein database notes that both S. pombe and S. cerevisiae, which proteins of Saccharomyces cerevisiae. L8543.18 protein is similar in diverged some 1 000 million years Nucleic Acids Res 1995, 24:46–49. 8. Altschul SF, Gish W, Miller W, Myers EW, sequence to neurofilament triplet H ago [17]. So, at present, it is Lipman DJ: Basic local alignment search protein, whereas Genequiz [15] reasonable to assume that the FnIII tool. J Mol Biol 1990, 215:403–410. notes that it is most similar to a domain was present in the common 9. Bateman A, Jouet M, MacFarlane J, Du J-S, Kenwrick S, Chothia C: Outline structure circumsporozoite protein from ancestor of yeast and animals. of the human L1 cell adhesion molecule Plasmodium yoelii. However, these and the sites where mutations cause similarities are in low complexity neurological disorders. EMBO J 1996, regions, so the sequence similarity is References 15:6048–6057. 1. Watanabe T, Suzuki K, Oyanagi W, Ohnishi 10.Rost B, Sander C: Combining evolutionary unlikely to be of biological K, Tanaka H: Gene cloning of chitinase A1 information and neural networks to relevance. The function of L8543.18 from Bacillus circulans WL-12 revealed its predict protein secondary structure. is uncertain, but it is known to be evolutionary relationship to Serratia Proteins 1994, 19:55–72. chitinase and to the type III homology 11.Bairoch A, Boeckmann B: The SWISS- required for chitin synthase III units of fibronectin. J Biol Chem 1990, PROT protein sequence data bank. activity [16]. The function of 265:15659—15665. Nucleic Acids Res 1991, 19:2247–2249. Magazine 1547

12.Leahy DJ, Hendrickson WA, Aukhil I, university (formerly Tokyo energy physics laboratory, housing Erickson HP: Structure of a fibronectin type III domain from tenascin phased by University of Education) to the new the photon factory which has become MAD analysis of the selenomethionyl city, the government provided a internationally known as a source for protein. Science 1992, 258:987–991. catalyst for subsequent development. X-ray crystallographers. The 13.Huber AH, Wang YE, Bieber AJ, Bjorkman PJ: Crystal structure of tandem type III Now there are also several private favourable terms given to foreign fibronectin domains from Drosophila research institutes and over 100 research groups often make it more neuroglian at 2.0Å. Neuron 1994, companies with research facilities in economical (and even sometimes 12:717–731. 14.Wootton JC: Sequences with ‘unusual’ the area, including many foreign more convenient) to hop on a plane amino acid compositions. Curr Opin Struc companies, such as Texas to Japan than to use a local Biol 1994, 4:413–421. Instruments and Glaxo–Wellcome. synchrotron source. 15.Casari G, Andrade M, Bork P, Boyle J, Daruvar A, Ouzounis C et al.: Challenging time for . Nature 1995, How has the University fared? Now Is it a popular place to work? 376:647–648. called Tsukuba University, it is Tsukuba has been successful in many 16.Bulawa CE: Genetics and molecular biology of chitin synthesis in fungi. Annu located in landscaped surrounds on a ways, and for the young Japanese Rev Microbiol 1993, 47:505–534 large campus to the north of the city. family its merits are clear. Nice new 17. Sipiczki M, in Nasim A ,Young P, Johnson, In its short history, Tsukuba homes are available at affordable BF (eds): Molecular Biology of Fission Yeast. New York: Academic Press; University has hit the headlines prices and there are many parks and 1989:431–52. surprisingly frequently. One reason is amenities to make life comfortable. the reforms — considered radical in The plans afoot to cut the ‘Tsukuba Japan — such as encouraging allowance’, which was originally Address: MRC Laboratory of Molecular collaborations with private made to compensate scientists for the Biology, Hills Road, Cambridge CB2 2QH, UK. companies, that have been instigated move away from Tokyo, from the by the present head, Professor Leo salaries of some 10 000 employees of Esaki, one of Japan’s few Nobel national labs is bound to reduce laureates (he won the Physics prize satisfaction levels, however. To the in 1973 for the discovery of non-Japanese eye, 1960s design Gazetteer tunnelling in semiconductors). features have left their mark, in Nearly everyone has an opinion of straight wide roads more reminiscent Esaki, and not all are complimentary. of the USA than Japan and the lack of Tsukuba Science City Tensions were readily apparent this a traditional city centre (as in Britain’s year when, despite his pre-eminence ‘new towns’, such as Milton Keynes). Where is it? Sixty kilometres north- in Japanese society, Esaki only If foreign scientists come to Japan east of Tokyo and forty from the narrowly won re-election for a second seeking that quintessential Japanese airport — just far enough to be term of office. experience, Tsukuba Science City inconvenient for both. The city has may disappoint. an estimated population of 180 000 Is the press coverage always positive? and covers an area of about 27 km2. Not always. One of the University’s alumni, a chemist called Masami How old is it? Although it’s now 33 Tsuchiya, joined the Aum Shinrikyo years since the Japanese government (Aum Supreme Truth Cult) instead Intron first decided to establish the new city, of taking the usual career path into a most of the buildings are much more lifetime job with a large company. recent. It is only 24 years since the There, he applied his chemical Tasty genetic bloomer first research institute opened expertise to the production of sarin (Research in Inorganic Materials), and nerve gas. On 20 March 1995 his “. . . The predicted gene product a mere 11 years since hamburger chain work was put to devastating use contains seven helicase motifs McDonalds made its cultural debut. when Aum cult members released that are present in the three the gas on the Tokyo subway, killing members of the RecQ subfamily Why was it built? For two reasons, ten people and injuring thousands. of DExH (D for asparagus, E for really. Tokyo was rapidly becoming glutamic acid, H for histidine, x over-crowded, and the government Are there any famous research for any amino acid) box- wanted to create a focal point for institutes? The Tsukuba Life containing DNA helicases . . .” scientific research. By relocating 45 Science Center (RIKEN) has an public institutes, covering the international reputation in basic From the article by E. Passarge: A physical, life and environmental biology, but perhaps one of the most DNA helicase in full Bloom, Nature sciences, as well as an entire successful institutes is the high Genetics 1995, 11:356–358.