Fibronectin Type III Domains in Yeast Detected by a Hidden Markov Model

1544 Current Biology 1996, Vol 6 No 12 sequence tags as described here Fibronectin type III 35–125 of YEF3_YEAST, matched relies on all proteins from an the HMM with scores of 39.5 and organism being in sequence domains in yeast 21.5 bits, respectively. We found a databases. In this manner, if only detected by a hidden homologue of L8543.18 in cosmid one protein within a given pI and c6G9 of the genomic data for S. mass range is found with a certain Markov model pombe, using the program tblastn [8] amino- or carboxy-terminal Alex Bateman and (see Fig. 1a). Residues 77–167 of sequence tag, one can be confident this sequence match the FnIII that there is no other, as yet Cyrus Chothia HMM with a score of 32.4 bits. The undescribed, protein that could HMM score is the logarithm to base otherwise match the tag. In fully Proteins containing fibronectin type 2 of the probability of the sequence sequenced organisms, the procedure III (FnIII) domains play a central matching the HMM, divided by the is thus self-checking. The role in many intercellular processes: probability of a randomly generated specificity of sequence tags may be they are part of many cell-surface sequence matching the HMM. The an issue in larger organisms: whereas receptors, adhesive matrix proteins next highest match, SVS1, scored there are (for example) 3 200 000 and cell adhesion molecules. FnIII 9.4 bits. We would expect a score of combinations of five amino-acid domains are also found in the giant 12 bits to be significant against a tags, protein amino termini have muscle proteins, titin and twitchin. database of this size. HMM scores biased sequences and many amino The occurrence of FnIII domains in of 39 and 21 bits are highly reliable termini are shared. However, prokaryotes led to speculation that indicators of sequence homology, in protein carboxyl termini have almost these domains existed in the last our experience. We therefore expect random sequences (data not shown) common ancestor of prokaryotes and that the yeast domains have an so their sequence tags should be animals [1]. However, it has been FnIII-like fold. However, we did try more specific. Other factors to argued that the currently known other methods to verify our results: consider will be the accuracy of prokaryotic examples were obtained database searches with BLASTP sequence data that can be obtained from a horizontal transfer of a single [8], key residues analysis [9], and from proteins purified from two- domain from animals — that this PHD [10] secondary structure dimensional gels, and the accuracy protein arose late in evolution, and prediction. of prediction of protein open is unlikely to occur in plants or BLASTP [8] found matches reading frames in genome/proteome fungi [2,3]. between the ‘FnIII’ sections of the databases. Large-scale protein Here, we report evidence that yeast proteins and known animal characterization projects will define three fungal proteins, L8543.18 and FnIII domains in the SWISS-PROT the effect of these factors and thus YEF3_YEAST of Saccharomyces database [11]. The top match against the utility of sequence tags for cerevisiae and the L8543.18 the L8543.18 protein was the FnIII- protein identification. homologue from Schizosaccharomyces containing receptor tyrosine kinase pombe, contain FnIII domains. The KEK4_CHICK, with a p value of evidence for this comes from a 7.5 3 10–4. The best match against References 1. Wilkins MR, Ou K, Appel RD, Sanchez J-C, hidden Markov model (HMM) [4,5] YEF3_YEAST was FINC_CHICK, Yan JX, Golaz O, et al.: Rapid protein of the amino-acid residues that the fibronectin protein in chicken, identification using N-terminal “sequence determine the FnIII protein fold, with a p value of 2.0 3 10–3. Note that tag” and amino acid analysis. Biochem Biophys Res Commun 1996, and is supported by other these matches were found using only 221:609–613. calculations. From alignments of the the ‘FnIII’ portion of the two S. sequences of a protein family, an cerevisiae proteins. If the whole Address: Central Clinical Chemistry HMM can be built to encode the sequence of L8543.18 is used, the Laboratory, Geneva University Hospital, 24 probabilities of different residues first protein with an FnIII domain to Rue Micheli-du-Crest, 1211-Geneve 14. occurring at particular sites. The be matched is NCA1_MOUSE at E-mail: [email protected] model can then be used to detect rank position 284, with a p value of other sequences that are likely to be 0.56; for the whole sequence of The editors of Current Biology welcome very distant members of the protein- YEF3_YEAST, the first such match correspondence in response to any article fold family [6]. We built an HMM is NCA2_XENLA at rank position in the journal, but reserve the right to reduce the length of any letter to be from a multiple alignment of 434 380 and with a p value of 0.96. published. Items for publication should FnIII domains, and used it to search Routine BLASTP analysis would either be submitted typed, double-spaced, for FnIII domains in the yeast not, therefore, find the yeast FnIII- or sent by electronic mail. They should protein database release 4.1 [7]. like sequences. include a full contact address, with phone and fax numbers. Residues 76–166 of the Key residues are those that, sequence L8543.18, and residues through their packing, hydrogen Magazine 1545 Figure 1 (a) The sequence of the putative S. pombe homologue of S. cerevisiae L8543.18, (a) 1 MDDTNQFMVS VAKIDAGMAI LLTPSFHIIE FPSVLLPNDA TAGSIIDISV HHNKEEEIAR 60 translated from genomic cosmid c6G9. (b) 61 ETAFDDVQKE IFETYGQKLP SPPVLKLKNA TQTSIVLEWD PLQLSTARLK SLCLYRNNVR 120 121 VLNISNPMTT HNAKLSGLSL DTEYDFSLVL DTTAGTFPSK HITIKTLRMI DLTGIQVCVG 180 Alignment of yeast FnIII domains to the FnIII 181 NMVPNEMEAL QKCIERIHAR PIQTSVRIDT THFICSSTGG PEYEKAKAAN IPILGLDYLL 240 domains of neuroglian (NGd1 and NGd2) 241 KCESEGRLVN VSGFYIENRA SYNANASINS VEAAQNAAPN LNATTEQPKN TAEVAQGAAS 300 and tenascin (Ten) [12,13]. The S. pombe 301 AKAPQQTTQQ GTQNSANAEP SSSASVPAEA PETEAEQSID VSSDIGLRSD SSKPNEAPTS 360 homologue of L8543.18 is denoted Pombe. 361 SENIKADQPE NSTKQENPEE DMQIKDAEEH SNLESTPAAQ QTSEVEANNH QEKPSSLPAV 420 421 EQINVNEENN TPETEGLEDE KEENNTAAES LINQEETTSG EAVTKSTVES SANEEEAEPN 480 Regions in the yeast sequences that are 481 EIIEENAVKS LLNQEGPATN EEVEKNNANS ENANGLTDEK IIEAPLDTKE NSDDDKPSPA 540 expected to have the same structure as one 541 AAEDIGTNGA IEEIPQVSEV LEPEKAHTTN LQLNALDKEE DLNITTVKQS SEPTADDNLI 600 of the known structures are shown in upper 601 PNKEAEIIQS SDEFESVNID case; regions expected to differ in conformation are shown in lower case. The secondary structure (marked ‘Strand’) of the (b) Strand A-------------A B-----B C--------C known FnIII domain structures is shown, as is NGd1 DVP.NAP..KLTGITC.QA.DKAEIHWE...QQGDNRSPI.....LHYTIQFNTS. the PHD secondary structure prediction NGd2 .PDVPFKNP.DNVVG.QGTQP.NNLVISWT..PMPEIEHNAPN....FHYYVSWKRD. Ten .RL.DAP.SQIEV.KDVTD.TTALITWF.....KPLAEI......DGIELTYGIK. (marked ‘PHD’) [10] of the single yeast L8543.18 THKP.ESP..VLKI.VNVTQ.TSCVLAWD..plkl...gsak....LKSLILYRKG. domains: residues predicted to be in an PHD ---- ---- ------ - -------- extended conformation are marked by — and Pombe QKLP.SPP..VLKL.KNATQ.TSIVLEWD..plql...star....LKSLCLYRNN. those predicted to be in a helical PHD -- ------- ----- YEF3 KIKTP.PAT..KVSI.DKIAT.DSVTIHWEnepvkaedngsadrnfiSHYLLYLNNT. conformation by +. Key residues important PHD - -- ------ -- ++++++++------- for the structure of the FnIII fold are shown in red [9]. The chemical nature of these ----C' E-----E F---------F G-------------G Strand sidechains is largely conserved, implying FTPASWDAAYEK..V..PNTDSSFVVQ..MSPW.ANYTFRVIAFNK..IGASPPSASSD.SCTTQ NGd1 structural similarity of the yeast domains to .IPAAAWENNN...I.FDWRQNNIVIA.DQPTF.VKYLIKVVAIND..RGESN.VAAEEVVGYSG NGd2 .KVPGDRTTID...L..TEDENQYSIG.NLKPD.TEYEVSLISR....RGDMS.SNPAKETFTT. Ten the known structures. (c) The location of FnIII ....irsmvipnp.F....KVTTTKIS.GLSVD.TPYEFQLKLITT..SGTLW.SEKV..ILRTH L8543.18 domains within the S. cerevisiae proteins. ----- -------- ---- --------- -- ---- --- PHD ....vrvlnisnp.M....TTHNAKLS.GLSLD.TEYDFSLVLDTT..AGTFP.SKHI..TIKTL Pombe ------- --- --------- - ---- PHD ..qlaifpnnpns.L.....YTCCSIT.GLEAE.TQYQLDFITINN..KGFIN.KPSI..YCPTK YEF3 -- - ------ ++ +++ ----- - --- PHD (c) L8543.18 FnIII 76 166 277 671 YEF3_YEAST FnIII 35136 625 694 956 Low complexity region High complexity region FnIII domain 100 amino acids bonds or unusual torsion angles, play domains of known structures homologue of L8543.18 are the the major role in determining the [12,13]. The yeast sequences same as, or very similar to, those three-dimensional structure of a aligned to the known structures found in the S. cerevisiae form. All protein. These residues tend to be share identical residues at only 8–16 the residues found at core sites of strongly conserved, in type if not in sites. Inspection of the alignment, the yeast domains can be found in identity, over long evolutionary however, shows that, to a very large the sequences of animal FnIII periods, and they can be used to extent, the key residues in the core domains. detect distant evolutionary of the known structures are the The known FnIII structures relationships. We have defined the same or conservatively substituted contain two b sheets, one with three key residues of several FnIII in the two new sequences. The key strands (ABE) and the other with domains of known structure [9]. residues in the core of each of the four (GFCC′). The sequences of the Figure 1b shows a comparison of the predicted S. cerevisiae FnIII domains yeast domains were submitted to residues at key sites in the yeast are shown schematically in Figure 2. the PHD secondary structure FnIII domains with those in three The key residues of the S. pombe prediction server [10]. For each of 1546 Current Biology 1996, Vol 6 No 12 Figure 2 The conservation of core residues between neuroglian domain 2 and YEF3_YEAST and Neuroglian domain 2 L8543.18 YEF3-YEAST S. cerevisiae L8543.18. Strands A,B and E are shown in pink.

Fibronectin Type III Domains in Yeast Detected by a Hidden Markov Model

Enhanced Representation of Natural Product Metabolism in Uniprotkb

The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice

Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences

Molecular Genetics & Genomics

Pfam: the Protein Families Database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A

Rapid Identification of Novel Protein Families Using Similarity Searches [Version 1; Peer Review: 2 Approved]

Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments

Generating Functional Protein Variants with Variational Autoencoders

Genome Informatics

Enhanced Protein Domain Discovery by Using Language Modeling Techniques from Speech Recognition

Introduction