* Contact emails Jonathan Ient

cerevisiae, secondary structure predicti structure secondary cerevisiae, KEYWORDS: ¶ 1 David R.Fidler Running title: HHsearch finds PH new domains [email protected] Timothy P.Levine: Ient:[email protected] Jonathan [email protected] ElTohamy: Rana pante Antonoudiou: Pantelis kather Courtis: Katherine [email protected] Sarah Murphy: david.r.fidler@googlema David Fidler: Department of Cell Biology, UCL Institute of Ophthal UCL Instituteof ofCellBiology, Department

to whom correspondence should be addressed beaddressed should whomto correspondence these authors contributed equally to this work tothis equally contributed these authors This article isprotected bycopyright. Allrights reserved. Accepted Article Using HHsearchtotackle proteinsofunk

article as doi: 10.1 differencesbetween version lead to this may been through the copyediting, typesetting, pagination andproofreading process, which This article hasbeen accepted for publication andundergone full peer review but has not

GRAM domains, Gyp7p, pleckstrin homology (PH) homology Gyp7p, pleckstrin GRAM domains,

1 andTimothy P.Levine 1 * , SarahE. Murphy [email protected] [email protected] [email protected] [email protected] 111/tra.12432

il.com, or [email protected] [email protected] il.com, 1 on, structural , TBC1D on, structuralbioinformatics, *

, Katherine Courtis 1¶

mology, 11-43 Bath Street, London EC1V 9EL, UK EC1V9EL,UK Street, London Bath 11-43 mology, domains

1 , PantelisAntonoudiou and the Versionof Record. Please cite this nown function–ap nown domains, profile-profile search, Saccharomyces search,Saccharomyces profile-profile domains, 15, Vps13p, YJL181C, YJR030C, YJL016W. YJR030C, YJL181C, 15, Vps13p, 1 , RanaEl-Tohamy ilot study with PH 1 ,

INTRODUCTION organism ofmodel entireproteomes across particularly widely, more applied be firststeps. should HHsearch useful they are workbe corroborated, to considerable require of importance we thefunctional where confirmed Vps13p, RESULTS andDISCUSION signific be could without domaininformation proteins st This Vps13p. of fulllength ofthefunction for oneaspect lipids inositide wide information for both query and target and forbothquery wide information from theinitial (MSA) created with a by searching sensitivity increase prokaryotic origin prokaryotic Inadditio protein. ancestral fromasingle to havediverged ( alignment makes asignificant with PHdomain searchaclassical seeded PSI-BLAST 1-3),butno Movies 1and (Figure backbones alpha-carbon Forexample, alignment. by linked sequence are not families predi structuresor onsolved relying structure ofsecondary aspects encode with supplemented havebeen searches sequence-sequence minority asignificant function. However, pr way ofanunknown thefunction An important todetermine PH-like domains are defined byacombinat aredefined PH-like domains newly domains, PH ofclassical discovery Since theinitial domains ofnew PH-like prediction Hence, discovery of discovery completely new folds targets, mostoften of restricted toasmallrange computational considerable which requires libraries, sequence whic manyof yeast proteins 13 in domains new 16 predicted of systematicapplication yeast, in fold domain homology structural foldinasinglemodelorganism databas toolsusedtoannotate aremissedby that homologies HHsearch toolssuchas profile-profile have developed structural of devoid orentirely are largely Such proteins by rela are hampered the biology cell inmembrane Advances ABSTRACT yeast proteins there is no functionally functionally thereisno proteins yeast modelorganism The divergence. protein toolstoall the mostsensitive Itmay possible be domains. of known structural homologues standard definition of the PH-like clan we used PSI-BLAST weto PSI-BLAST clan used ofthePH-like definition standard with theliterature, starts mining domains PH-like Defining discove alagbetween thatthereis 2), butgiven (Box knew already we – what PH-likedomains Curating thatit both was bydemonstrating verified domain this onePH-like implicated siteprotein contact membrane aconserved Vps13p, ye 16new wediscovered Overall, searches. offline automated by se greatersensitivity foundeven We toann was used thantools moreaccurate HHsearch yeast. with domains PH-like against HHsearch After bench-marking we tostudy the used HHsearch indepth, worked through clan PH-like growing in eukaryotes structuralho their and domains (PH) Pleckstrin homology widelyor systematically. applied that hasnotbeen 30% below drop domain within asingle identities sequence when perform poorly structure (sequence alone) structure (sequence early on in finding functions for yeast proteins with yeast proteins for functions in finding

This article isprotected bycopyright. Allrights reserved. Accepted Article (17) , and at least ten times since atleasttentimes , and

(14) (15) . Although PH-like domains have no unified function, they function, nounified have domains . AlthoughPH-like . They are often involved in targeting peripheral membrane proteins involved in traffic and signaling intrafficand proteinsinvolved membrane peripheral intargeting involved . are often They (26)

. PH-like families (Supplemental Table 1)havebeen Table families(Supplemental . PH-like (28) S. cerevisiae . PH-like domains are therefore a good exampl a good therefore are . PH-likedomains to any otherprotein toany

query. Even more sensitive tools carry out profile-profile searches, which extract family- searches, out profile-profile carry tools sensitive Evenmore query. i.e. (6) there is no hit) with GRAM domains and with and hit) domains thereisno GRAM is one of the most highly studied, especia studied, of themosthighly one is , which suggests that many regions of unknown function will to bedistant turnout of unknown function which thatmany regions suggests , cting secondary structural elements structural secondary cting regions of unknown function. HHsearch is HHsearch function. ofunknown regions or structurally relevant domain informat domain relevant or structurally (7) (18-27) eding searches with yeast protein sequences, although this required high volume volume high thisrequired although sequences, with protein yeast searches eding profile thatcontains profile (~20%) of protein sequences shows no obv shows no sequences ofprotein (~20%) , but some tools go further by explicitly including secondary structure secondary including further byexplicitly sometoolsgo , but as proof of principle. Intheentirecl proof ofprinciple. as (4-6) no recognizable domain is slower thanexpected isslower domain recognizable no may identify residues thatarefunctional residues identify may ion of structure and ofstructureandsequence ion . Each of these discoveries added a new family of PH-like domains toa domains family a new ofPH-like added discoveries . Eachofthese . Such tools involve the construction of theconstruction . Suchtoolsinvolve (1)

the P . Tools such as Basic Local Alignment Search Tool (BLAST) beginto (BLAST) SearchTool Alignment suchasBasicLocal . Tools antly reduced by systematic application of HHsearch. ofHHsearch. application by systematic antly reduced ries and upgrading databases, we thesedefinitions. repeated databases, upgrading ries and

rotein D (online version called HHpred), which can detect remote which detect can HHpred), called version (online s, to significantly improve database annotations. annotations. database improve s, tosignificantly ly significant domain annotations. Structural bioinformaticians Structuralbioinformaticians annotations. domain ly significant HHsearch accurately identified known PH-like domains. Italso domains. known PH-like identified accurately HHsearch mologues, PH-like domains, are one of the most common folds commonfolds areoneofthemost domains, PH-like mologues, the PDB for solved structures, and sequence databases. For a For databases. structures,andsequence PDBforsolved the PH-like domain clan (sometimes called superfamily) in yeast. yeast. in superfamily) called clan(sometimes PH-like domain solved PH-like structureshav PH-like solved n there is some evidence that that evidence thereissome n the predicted PH-like domain. Even though such predictions predictions such Eventhough PH-likedomain. the predicted udy of PH-like domains in yeast shows how the problem of problem shows how the yeast udy in of PH-likedomains statistical information from statisticalinformation otate databases, and it detected some new domains easily. domains somenew anditdetected otate databases, h are implicated inintracellula h areimplicated classical PH domains and GRAM domains have very similar similar have very domains andGRAM PHdomains classical asPSI-BLAST such searches profile-sequence otein is to identify homology to a protein with known to aprotein homology istoidentify otein to detect these distant structural homologies by applying byapplying homologies distantstructural to detectthese solved structures, we searched for PH-like domains in domains we forPH-like searched solved structures, tively high proportion of proteins with no known function. with function. noknown ofproteins proportion high tively atab study PDB, which contains > studywhich contains PDB, es. Here we have applied HHsearch to study a single tostudyasingle we HHsearch es. Hereapplied have ast PH-like domains. Oneoft domains. ast PH-like effort. Therefore, profile-profile searches tend to be tobe tend searches profile-profile effort.Therefore, in sorting along theendocytic along in sorting has a specific targeting acti hasaspecifictargeting ank of solved structures (PDB). Profiles can implicitly (PDB).Profilescanimplicitly structures ofsolved ank e of a structural fold with sequence astructuralfold high e of (11,12) defined by a variety of profile-sequence tools ofprofile-sequence byavariety defined have typical docking sites for other proteins or proteins sitesforother docking typical have vice versa . There has been a steep decline in the in decline asteep hasbeen . There lly in systems biology. Yet for 10-13% of 10-13% Yetfor biology. insystems lly an ofproteindomains ion (TL, unpublished observation). Progress Progress observation). unpublished ion (TL, a powerful profile-profile alignment tool alignment profile-profile a powerful (2) . To overcome this problem, overcomethisproblem, . To ly importantforintracellulartraffic. (Box 1). Different PH-like domain DifferentPH-likedomain (Box 1). profiles for all members oftarget forallmembers profiles ious homology at the level of primary ofprimary level atthe homology ious . Each PH-like family is presumed is presumed family PH-like . Each a multiple sequence alignment alignment sequence multiple a e unexpectedly been found, both both found, been unexpectedly e the whole clan diverged from a from diverged whole clan the r traffic. One of these wasr traffic.One of these (1) 200 different PH-like domains. PH-like domains. 200 different hese is at the C-terminus of of hese isattheC-terminus . As an example that can be thatcan . Asanexample pathway. The presence of presence The pathway. vity and that itisrequired that vity and sharing the (3) (8-10) , which pleckstrin , either , either (16) . (13)

was seeded with sequences from our minimal set of 39 PH-like families. families. setof39PH-like ourminimal from with sequences was seeded HHsearch - more sensitive than PSI-BLAST than sensitive -more HHsearch by PSI-BLAST. they existed, undetectable domains easilybe might 2A). T others(Figure fromall isolated (Figure 2A). However, themainfeatur (Figure 2A).However, 4A). (Figure domains are PH-like confirmedthat This Methods). (see bydomains HHalign with PH-like relevant the alignments out pairwise carried (3-5 aa),and with flanks minimal regions identified newly by PDB specificity and sensitivit and specificity wewhere ifhitsare PDB, arecertain We forPH-likedomains. on PSI-BLAST improve can HHsearch (Box asHHsearch toolssuch Profile-profile MSA, even though they are divergent from Age1p (Supplement fromAge1p are divergent they MSA, eventhough with featuresoftwo parts of sequence oftheirGEFdoma upstream domain PH MSA haveaclassical itc possibly unique; isessentially Age1p by PH-lik dominated oftheMSAisspuriously N-terminus The was fu was MSA madefor created. The MSAforAge1p way the suchastr hit,. Wenext investigatedhow hi other strong oranywas nomatchtoPHdomains tools there false positives): Specificity (avoiding andRanB each:FERM-C,PTB of thedomains (~ PHdomains theclassical of largestfamily consists The ismorecomplete classification 1).This Table and Supplemental searches.This strong hitsinPSI-BLAST thati sequences of 39PH-like set theminimal described We aligns with classical PH domains. This kind of falseposit kindof with PHdomains. This classical aligns above. Among hitswith prob[SS] containing a targetHMMlibrary tosearch usedHHsearch We 2A). Table (Supplemental these domains annotation notedthatcurrentdatabase We yeast. in domains proteins.T of5,900 for10-13% no structuralinformation PDB, fromsearching turned next We s PDB-to-yeast: easy detection of some new domains new some of easy detection PDB-to-yeast: prob[SS] hitsscoring non-PH-like all Among domains. detect many truepositives The 5 The s ourPDB-to-yeast 2B). Therefore, Table andothertool were that hits PH-likedomains, top produced withregions. five all PSI-BLAST seeded We profiles. may moreinformation 4). (Box Thus, theprofile biases which tools. byprofile-sequence detected been mightappear This homology. weturned tosequence predictions were with fivehits prob[SS] there addition, indicates that it mightalsoto thatit indicates detection 2B). The (Figure PHdomains toclassical domains GRAM evenlinked with HHsearch prob[SS]=96%. domains toclassicalPH ahit (4r8g_A)produced domain 1c TH1 (F families just4unlinked HHsearchleft by PSI-BLAST, unlinked with prob[SS] inPDB toPH-likedomains hits 1200 with produced HHsearch 39searches The together. PH-likeclan the tojoin ability prob[SS] struct ofthePH-like portions to longer alignm These (datanotshown). alignment structural secondary shorte by being differedfromtruepositives PH-like domains that indicates 3A). This prob[SS] (Figure Sensitivity (avoiding false negatives): tructure (prob[SS], Box 3). Previous workwith Box3).Previous tructure (prob[SS],

This article isprotected bycopyright. Allrights reserved. Accepted Article th region identified was the N-termi at regionidentified ⇔ ≥ whole protein with ( domain whole theidentified just protein 85% will ensure close to 100% specificity. to100%specificity. will close ensure 85% ≥ 85%, which is many more positive hits than PSI hitsthan whichmany positive ismore 85%, y. Importantly, for thisexercise (29-39) detect new PH-like domains in domains new detect PH-like . In our 39 searches,1300 hits scored prob[SS] scoredprob[SS] hits . Inour39searches,1300 ≥ 85%, there were 71 of the 73 known yeast PH-like domains (97%, Figure 2C). In 2C). Figure (97%, PH-likedomains yeast were ofthe73known 71 85%, there true or false. This allowed us to determine key parameters relating toboth relating parameters key ustodetermine allowed false. This true or he lack of cross-detection between mostfa between he lackofcross-detection where structure has been solved, tosear solved, wherestructure hasbeen e was the presence of 20 small families ( of 20smallfamilies was thepresence e ure (Supplemental Figure 1). We conclude from this benchmarking that a threshold that athreshold fromthisbenchmarking conclude 1).We Figure ure (Supplemental these PH domains. This led totheinclus led This PHdomains. these HHsearch provides a list of alignments with decreasing prob with decreasing alistofalignments provides HHsearch the prob[SS] metric is a conservative estimate of accuracy. Alignments tonon- Alignments accuracy. estimateof isaconservative theprob[SS]metric ontains an defunct relic of a PH domain. relicofaPHdomain. defunct an ontains ong false positive arose(asteriskinFigur false positive ong To compare the sensitivities ofHHs thesensitivities compare To minimal set indicates the presence of 39 families in the PH-like clan (Figure 2A 2A clan(Figure inthePH-like of39families thepresence minimal setindicates earches identified four new PH-like domains. domains. four new PH-like identified earches However, these profiles have typically been seeded on mammalian sequences, sequences, mammalian on seeded been typically have theseprofiles However, ≥ nus of the ARF-GEF Age1p. When this was seeded into HHsearch or other was orother intoHHsearch this seeded Age1p.When oftheARF-GEF nus 3) many detected have new domains previously 85% to regions not previously identifi notpreviously 85% toregions

D. Non-significant hits indicated that hitsindicated D. Non-significant HHsearch has foundthata has HHsearch Four sequences from Caf120p, Lam1p, Sip3p and Vid27p Vid27p and Sip3p Lam1p, fromCaf120p, sequences Four HHsearch was set to be unaware of solved structure. HHsearch HHsearch of solvedstructure. was settobeunaware HHsearch ≥

5%, the observed error rate was by lessthanindicated error rate far theobserved 5%, horough curation of current literature identified 73 PH-like 73PH-like identified of currentliterature curation horough ive can be avoided by repeating by repeating canbeavoided ive i.e. only the four new regions in Caf120p, Lam1p, Sip3p and Vid27p Vid27p and Sip3p Lam1p, inCaf120p, regions fournew the only 50% of all sequences). Three fa Three ofallsequences). 50% s also identified these regions as PH-like (Supplemental asPH-like(Supplemental theseregions s alsoidentified r (Figure 3B). In addition, non-PH-like domains had weaker had domains non-PH-like 3B).Inaddition, (Figure r PDB of homologies by HHsearch that PSI-BLAST could not find couldnotfind thatPSI-BLAST by HHsearch ofhomologies proteins of unknown structure. structure. ofunknown proteins dentified the complete set of PH-like domains inPDBas setofPH-likedomains thecomplete dentified s lag behind this, containing from 48 to 62 (66-85%) of from 48to62(66-85%) containing this, s lagbehind e domains because of three factors: (i) the N-terminus of of factors: (i)theN-terminus ofthree because e domains be obtained about yeast proteins if they are used to seed usedtoseed iftheyyeast are proteins about obtained be -BLAST produced (n=350). While 20 families were 20families While (n=350). produced -BLAST all yeast proteins with the 39 PH-like sequences defined defined with PH-likesequences the39 proteins yeast all in. (iii) the Age1 N-terminal region shares non-significant non-significant shares region Age1 N-terminal (iii)the in. t (Supplemental Table 2B).Theref Table t (Supplemental al Figure 2). The N-terminu The 2). al Figure ents were either to a single PH-like structural element, or element, PH-likestructural were toasingle either ents first studied how HHsearch detects PH-like domains in in domains detectsPH-like HHsearch firststudiedhow than definitions of the PH-like clan in databases (Box 2). (Box databases in ofthePH-likeclan definitions than ll length Age1, which aligns with >100 other ARF-GEFs. ARF-GEFs. with other >100 which aligns Age1, ll length paradoxical, since by definition these regions have not have regions these sincebydefinition paradoxical, igure 2B). For example, the previously isolated myosin- isolated thepreviously example, For igure 2B). ⇔ domain). dothis,weTo madeMSAsforallfive

threshold of prob[SS] ofprob[SS] threshold earch and PSI-BLAST, we compared their their we compared andPSI-BLAST, earch ching the yeast proteome where we have where yeast proteome the ching ≥ ≤ ion of two genuine PH sequences in the in PHsequences genuine oftwo ion 80%, of which 1298 were PH-like 1298 80%, ofwhich 3 members) that werecompletely3 members)that milies indicates that if other PH-like thatifotherPH-like milies indicates ed as PH-like. To investigate these these investigate as PH-like.To ed (ii) most of the other ARF-GEFs inthe(ii) mostoftheotherARF-GEFs e 3B). We foundanexplanatione 3B).inthe some families share weak homology weak homology familiesshare some all positive alignments obtained obtained alignments positive all milies have approximately 10% 10% approximately have milies s of the Age1p MSA therefore s oftheAge1p (29-39) , so we, so askedif ore, it is a false positive ore, itis a false positive ≥ 80% is sufficient to 80% is ability ofs ability hared hared previous structural information: Yjl016wp, is a t Yjl016wp, is structuralinformation: previous PH- areall 11matches predictthattheremaining We shown). cellular localisation (data not shown). We nextconstructed di We shown). (datanot localisation cellular fa the expressed also 4A).We (Figure Vps13p and Tph3p-2, in eight domains implicated in membrane traffic/functi membrane in implicated domains in eight Support for thepredictedPH-likedomaininVps13p 4B). (Figure domain PH-like new sothisisapossible like domain, yeast-to In threshold. positive was false below the it although notshown).it yeast (data However, in PH-like domains no produced sswto30% 3). Increasing (Box of11% default weight structure we thesecondary raised sequence, diverged with highly folds identify conserved To within HHsearch. waswe totestdifferentsettings explored finaloption The yeas in predictions Thus, eukaryotes. plants; and infungi degradation import and hum 5 including eukaryotes RabGAPinall late endosomal widely distributed are of thenew families searches.T indirect required sensitivity ofst oracombination use ofHHsearch, segments( with longer wereobtained only domains matches toPH-like Oneother oftheregions. for11 domains outpairwis bywere truepositives carrying tested ifthese domains PH-like 12furthercandidate and above found regions butatprob[SS] inPDB, domains matchedPH-like also domain Use of HHsearch increased the number of PH-like domains in yeast by >20% (73 yeast by in>20% ofPH-likedomains number the increased Use ofHHsearch not shown). (data searches yeast-to-PDB Importantly, t 4A). Figure (Vid27p-1, domain within HHsearch. areavailable species ofthese libraries other fungi: intwo yeastdomains PH-like forthe88 PH-like homologues yeast,we carri budding in domains PH-like For computation. we named it Tph3p, for twin PH domain-3. Yjl181c PH domain-3. fortwin itTph3p, wenamed and (2) identifi This library. to searchthePDB (s 8iterations using MSA library yeast entire remadethe We sequences. yeast with seeded frombeing would alsobenefit HHsearch benefitssomuch SincePSI-BLAST genomes. divergent sequenced andfewer whenwere sequences fewer there solved years ago, wereseveral made libraries organism and themodel 3).By seeBox compar toeight, (up iterations usemore queries standardus 4). However, (Box symmetrical are potentially HHsear Since sequences. non-yeast on seeded missed byprofiles that indicates tool.This sequence profile were hits not same thoughthe even domains, PH-like defined work such this A keyatool in initiating finding isthat astepchangeinsensitivity Yeast-to-PDB: (43) function this had nine only PHdomains, classical yeast surveyof 33 isorg domains One ofthemosttypical ofPH-like functions searching transitive called is indirectly database standard struct hits tobothaPDB that arestrong ofHHs thesensitivity way toenhance One Indirect use of HHsearch for maximum sensitivity sensitivity maximum for ofHHsearch use Indirect we havedone. as libraries, indicatesthat list. This werenot intheenriched that they forunknown-to-known library aproteome-wide constructing ledtoaclear withyeast proteins HHsearch Overall, seeding h small number of false positives. Howe of falsepositives. small number li would require searches.These PDB-to-yeast achieved easily searches yeast-to-PDB weby reverse found the 11domains withpr thehits byusing fortruepositives enriched unknowns omologs, so we named them Rbh1/2p. Rbh1/2p. them wenamed so omologs, . To determine if the newly described PH-like domains target domains PH-like described determineifthenewly . To

This article isprotected bycopyright. Allrights reserved. Accepted Article hadaprob[SS] ≥ 85%. This group contained 72 ofthe73kn 72 groupcontained This 85%. t can have broader significance. significance. broader have t can ed 88 yeast regions where the top hit washit thetop where regions yeast ed 88 ver, four of the 11 domains had prob[SS] had fourofthe11domains ver, he newly described sequences add fivefurt add sequences described he newly ure and to a yeast region of unknown function. Linking an unknown to thegold- to anunknown Linking function. ofunknown yeast region toa ure and rategies for the online server HHpred (Workflows described in Box 5), ultimate Box in described (Workflows serverHHpred rategies fortheonline profiles centred on yeast sequences ma sequences yeast centredon profiles throughout evolution, and all of andall evolution, throughout fo was afalsepositive inPib2p, region, earch without having to construct a library is to define intermediate sequences sequences intermediate istodefine alibrary toconstruct without having earch Vps13, a membrane contact site prot contact Vps13,amembrane hese indirect searches easily identify all 11 of the domains that we by found thedomainsthat of 11 all identify easily searches hese indirect

win PH win p and Yjr030cp are paralogs that are distant R thatare paralogs are Yjr030cp p and as PSI-BLAST with as yeast sequenc PSI-BLAST on: Bud2p (both domains), Gyp7p, domains), (both on: Bud2p

protein, the3 protein,

led to one extra region in Rec114p being identified weakly, identified being Rec114p in led tooneextraregion maximum sensitivity mightrequi sensitivity maximum Using these fungal domains, we identified one extra PH-like extraPH-like we one identified domains, thesefungal Using e alignments. These showed strong hits to known PH-like hitstoknown strong showed These e alignments. (40,41) age of HHsearch is not entirely symmetrical. Profilesfor isnotentirely ofHHsearch age an homologs: TBC1D15/16/17 RUTBC1/2; Vid27, vacuolar Vid27,vacuolar RUTBC1/2; TBC1D15/16/17 homologs: an anelle targeting, binding proteins or lipids. In a previous Inaprevious orlipids. proteins binding targeting, anelle searches, as we did here,us wedid searches, as major improvement in overall accuracy in identifying new new identifying in accuracy inoverall major improvement ed out a simplified form of indirect searching by finding all all by finding searching formofindirect ed outasimplified ee Methods). We then used each of the yeast MSAsinturn yeast each ofthe thenused We ee Methods). , five had moderately high prob[SS] (>60%) inthemore prob[SS] (>60%) high , fivehadmoderately lse positive in Age1p. None of None Age1p. in lse positive increase in sensitivity for HHsearch. In practice, ratherthan Inpractice, forHHsearch. insensitivity increase mers (in the form GFP-PH-PH) for six of these PH-like forsixofthese GFP-PH-PH) mers (intheform -PDB searches, the top hit for this region in PDB wasPDB aPH- in thetophitforthisregion -PDB searches, ob[SH]<85% in PDB-to-yeast searches (Figure 3A). Among 3A).Among searches(Figure inPDB-to-yeast ob[SH]<85% ing (ssw) within each alignment. In HHsearch, this has a thishasa InHHsearch, within alignment. each (ssw) ing like domains (Figure 4A). Three of the proteins have no no oftheproteinshave 4A).Three (Figure like domains , and theoretically it involves a massive increase in in a massiveincrease itinvolves , andtheoretically seen in the other direction with PSI-BLAST orany other with PSI-BLAST otherdirection seen inthe =79%. The 16 other regions included all four of the new fourofthenew all included 16otherregions =79%. The ttle work to find by yeast-to-PDB searches among a very amongavery work searches yeast-to-PDB ttle tofindby , with matches in the range of prob[SS] 88–100%. We We with ofprob[SS]88–100%. intherange , matches membranes, we fusionsfordomains GFP-PH membranes, expressed ison, target libraries were withiterations, two constructed libraries target ison, ch makes profiles of both query and target, its results itsresults target, and ofbothquery profiles makes ch own yeast PH-like domains (99%). The 73 The (99%). domains yeast PH-like own rd from being initiated on yeast sequences, we if asked yeast sequences, on initiated frombeing (42) in a family also containing Caf120p and Skg3p, so Skg3p, and Caf120p family containing a also in , but it is also applicable to other PH-like families families tootherPH-like , butitisalsoapplicable e.g. these are involved in intracellular traffic: Gyp7, inintracellular theseareinvolved U. maydis U. the whole protein) not the domain (data not (data notthedomain wholeprotein) the <5% in PDB-to-yeast searches, meaning meaning searches, PDB-to-yeast in <5% both (1) r the same reasons as Age1p, inthat asAge1p, reasons r thesame ein affecting lipid metabolism in all inall metabolism affectinglipid ein y contain critical information thatis criticalinformation y contain Æ her families to the PH-like clan. Three Three clan. her familiestothePH-like and and 89). Whether this involved offline offline thisinvolved Whether 89). es strongly matched themto matched strongly es S. pombe Lam1p, Pkh2p, Tph3p-1, Tph3p-1, Pkh2p, Lam1p, a solved >200 PH-like domain domain PH-like asolved>200 ers could produce alistof produce ers could re the construction ofentire theconstruction re these showed any specific specific any theseshowed an b . Proteome-wide . Proteome-wide inding domain domain inding rd known unfold in residue) hydrophobic motif(Ø= WxxxØ similar tothe Figure5B).We I3129A, L3125A (“LIAA” = mutation ofcons point double (b)a of3029-3144); (deletion tagged at the C-terminus with fully EGFP attheC-terminus tagged we 5B).Therefore, (Figure these loops (NVJ) junctions vacuole nucleus contactsand endosome-mitochondrial (vCLAMP), patches mitochondrial to localisations has intracellular aa) (3144 Vps13p Full-length may important. functionally be we iden thattheregion suggests 5A).This (Figure some cells fortheVps13 except cytosolic diffusely constructs remained Meanwhile HHpred is an extremely powerful online resource for discovering remote homologies. remotehomologies. fordiscovering resource online powerful extremely isan HHpred Meanwhile a systematically, yeast proteome the toanalyze weplan f about information toprovide distances at differentevolutionary in 6A);by comparison (Figure tothevacuole were close diffuse, was it largely phase inlog distribution: intracellular mutations on Vps13 theeffectoftwo studied next We unfolding. by partial caused stability protein with reduced consistent subject toastrongv subject andVps13 Vps13LIAA-EGFP EGFP, PH oftheproposed role forafunctional looked next We phase. toplay arole ofVps13pappears C-terminus the extreme Thus, Beyond PHdomains tothe whole proteome sitedat loops variable sites inthepredicted it.A toinactivate predicted amutantversion constructing tested.We been have domains targeting specific yet no Age1,si and Tph3p-2) and Tph3-1 except (all domains because many organisms exist that diverged that diverged exist organisms manybecause functi ofunknown forproteins good beparticularly may HHsearch alsoextended and domains theknown identified HHsearch onprot tofocussearches the literature, besi can annotations domain Current protein (prob[SS] sequence would discoverab also be ofthesenew domains many domains, of (12%) in16 were predicted 21new domains were examined. were usedtosearch MSAsforallORFs yeast proteome. the framesfromont reading open contiguous we 132 surveyed forPH-likedomains seen discovery in domain Is theincrease (Figure 6H) (Figure foranadja mightbeoverlooked thePH-likedomain HHsearch, weaker th were significantly both although than Vps13LIAA, Vps13LIAA in NVJ targeting, which was partial for Vps13LIAA-EGFP (Figure 6D) and complete forVps13 (Figure complete 6D)and was forVps13LIAA-EGFP whichpartial NVJ targeting, in reduction amarginal showed mutants two6B). Vps13p The the ARM repeat in Rrp1p), other predicti inRrp1p),other the ARMrepeat homologs (data not shown), but this not the butthisnotthe notshown), (data homologs VPS13,for of C-terminus attheextreme widelypredicted function this and important, functionally part was elicited mutated domain PH-like predicted which the CONCLUSION information. no structural/functional currently have domai new ~900 identify could proteomeHHsearch yeast entire with prior no proteins the19 found in were domains predicted Sevennewly Ydr124wp). in domain aDNA-binding Ydr089wp, in polymerase polyphosphate

This article isprotected bycopyright. Allrights reserved. Accepted Article (49) . Cells expressed the PH(LIAA)dimerconstructatalo the expressed . Cells (51) . ∆

vps13 ≥ acuolar p acuolar 99% n=9). Although some of the newly discover of the newly some Although 99% n=9). produced significantly stronger signals than strongersignals significantly produced rotein s orting ( orting ∆ structural/functional information, a discovery rate of 37%. Rolled out acrossthe out of37%.Rolled rate adiscovery information, structural/functional PH-EGFP rescuesortingofcarboxypeptidas PH-EGFP eins of unknown function and to maximize sensitivity. For the PH-like clan, thePH-likeclan, For tomaximizesensitivity. and function of unknown eins constructed two other mutants:(a)Vps13 other two constructed requires the predicted PH-like domain to be intact. Related PH-like domains are are domains RelatedPH-like tobeintact. domain PH-like thepredicted requires ons are more functionally significant ( significant are morefunctionally ons corrected the defect as shown previously asshown defect the corrected predicted that the hydrophobic resid thatthehydrophobic predicted β vps most highly conserved part of theC-terminus partof conserved most highly 1- gnificantly improved with profile-profile tools, to catch up with in advances tocatchup with tools, profile-profile improved gnificantly from itatdifferenttimesinthelast10

β ) defect (and hence is secreted) in issecreted) hence ) defect(and 2, β 3- β nce dimers can reveal weak targeting membrane reveal dimers can nce 4 and 4 and -like domain in Vps13p. We tested if plasmid-borne Vps13- ifplasmid-borne tested We inVps13p. domain -like early stationary phase Vps13- phase stationary early nnotating all the knowns and discovering many unknowns. unknowns. many discovering and theknowns all nnotating classical PH domains,sot classical PH determined the role of the predicted PH-like domain by domain PH-like ofthepredicted role the determined intracellular distribution. Vps13-EGFP showed a complex a complex showed Vps13-EGFP distribution. intracellular desirable strategy would be to mutate putative ligand binding binding ligand would tomutateputative be strategy desirable example in all four human homologs and in some plant plant insome and humanhomologs four inall example with the majority of cells containing puncta, some of which of some puncta, of cellscontaining with themajority erved hydrophobic residues in the predicted alpha helix helix alpha inthepredicted residues hydrophobic erved their number by 20%. Elsewhere in yeast, application of application yeast, in Elsewhere by20%. their number of >2% IV,representing chromosome armofyeast he right an empty plasmid (Figure 6G). (Figure plasmid an empty PDB, and regions not previously known to contain domains containdomains to known notpreviously PDB,andregions PH-dimer, which faintly localised to rings at the bud neck of neckof toringsatthebud localised which faintly PH-dimer, multiple membrane contact sites, including vacuole and and vacuole sites, including contact membrane multiple ial rescue. This shows that the C-terminus of Vps13p is Vps13p showsC-terminusof thatthe This ial rescue. tified in Vps13p is involved in intracellular targeting, and and targeting, intracellular in isinvolved Vps13p tified in β punctate targeting (Figure 6C/E), but considerable loss of loss 6C/E), butconsiderable (Figure targeting punctate likely to apply to other domains? To begin to examine this examine beginto To tootherdomains? toapply likely cent glycine-rich domain that is far more conserved isfarmoreconserved that domain glycine-rich cent 6- wer level than wild-type dimer (Figure 5C). This is This 5C). (Figure wild-type dimer wer than level in targeting, particularly to the NVJ in early stationary stationary the NVJinearly to particularly targeting, in the proteins (Supplemental Figure 3). As with PH-like 3).As Figure (Supplemental theproteins unctionally critical residues in critical residues unctionally ns, and 250-300 of these woul ofthese 250-300 ns, and le by systematically initiating PSI-BLAST yeast with PSI-BLAST the initiating le by systematically β on. Yeast proteins are very good subjects forthistool very subjects good proteinsare on. Yeast 7 wild-type Vps13-EGFP, strongerforVps13 Vps13-EGFP, wild-type (42) ed domains do not strongly determine function ( function determine strongly donot ed domains . However, Vps13p has no conserved residues in in residues hasnoconserved . However, Vps13p e.g. ues would stabilise the PH domain core, core, thePHdomain stabilise would ues ∆ intasome in Fob1p, vacuolar vacuolar inFob1p, intasome PH, lacking the entire PH domain PH domain theentire PH, lacking ∆ 9 he LIAA mutant may partially mutantmaypartially he LIAA (48) e-Y (CPY), a vacuolar enzyme that is that enzyme e-Y (CPY),avacuolar years, so homologues can be found canbefound years, sohomologues vps13 . Incontrast,Vps13 EGFP targeted the NVJ (Figure targetedtheNVJ(Figure EGFP (48) cells Thus, bothconstructsin Thus, ∆ . Without the prediction by the. Without prediction PH-EGFP (Figure 6F). 6F). (Figure PH-EGFP yeast proteins. Inthefutur yeast proteins. d be in the proteinsthat the d bein (50) . Wild-type Vps13 Vps13 . Wild-type (44,45) ∆ (46-48) PH-EGFP and PH-EGFP . All ∆ PH-EGFP , butas e.g. e

Vacuolar Protein Sorting of CPY: constructs. different forimagescomparing settings identical AOBS onaLeica andvisualized and coverslip, slide between Microscopy: repair non-gap and products I3129A). AllPCR Vps13 aa), (1-3144 Vps13-EGFP either clone tr and length Full and Ysp1p-2=568-717. Vps13p=3028-3144, Plasmids: were aa 116 thefinal attheextreme C-terminus, alignment (n ofx±9 average arolling bywere taking smoothed Values To allow secondary structure to be weighted higher than than higher weighted tobe allow structure To secondary limitonprob[SS]=0%. lower returned, “maximumaccu ofHHblits, 8iterations except: HHblits: round were toasingle thensubmitted sequences was mo 10untilno initiated, of further round fromNCBI(nr70,p sequences Tool were atTuebingen intwo carriedout stages Searches PSI-BLAST 70%identity than sharesmore nosequence sothat number PDB(“nr70”, (May templateHMMs 2008), yeast including June2012) Toolkit, (Tuebingen 2.0.15 HHsuite InterProScan generated): date (source, downloaded Files files Database todetectCPY then processed with anitrocellulo overlaid being before hours grown for24 was to: termin reduced underlined The aa ofVps13p Vps13p: of C-terminus the across conservation Residue PH-likedomain. the includes were were curated alignments ifthese found and domains, containPH-lik hitstoPDBentriesthat Top astarget. 2013) Yeast-to-PDB searches: request. tocreatea“Yc ssw=11%) chunks(8iterations, created forall (overlappi 200residues of intoregions proteins dividing sequence/S288C_refe MATERIALS andMETHODS Yeast HMMlibrary (Ychunk200): settings. with standard run inHHalign were compared entries, confidence result the and submitted toHHblits, HHalign: thec on acap changeremoved This was included as positive control. Ten-fold dilutions Ten-fold positivecontrol. as was included GFP- expressing well asaplasmid as plasmids, Vps13-EGFP Gyp Bud2p-2=168-319, 1=1-163, were separat dimers —SR*,and sequence LGSAPVMAS —cloned PHO5 HHsearch: 0.001–0.01. MSA = hit.score_aass=(hit.logPval<-10.0? hit.logPval: l hit.logPval: hit.score_aass=(hit.logPval<-10.0? fmax(0.0,0.2*(hit.score-8.0)))/0.45 l hit.logPval: hit.score_aass=(hit.logPval<-10.0?

This article isprotected bycopyright. Allrights reserved. Accepted Article formoderately stro This was run on its own or as part of HHsearch with upto was itsown runon ofHHsearch oraspart This Domain sequences from both PDB and yeast proteins yeast proteins and from bothPDB Domainsequences All plasmids were based on pRS416 (CEN, (CEN, were onpRS416 based plasmids All Use wasonline (“HHpred”) Use both Cells grown at 30°C to mid log phase, or 16 hours ther or16 phase, at 30°Ctomidlog Cellsgrown (53)

. Results were normalized sothatthe50 were normalized . Results

rence/orf_protein/ All HMMs in Ychunk200 were submitted as queries for HHsearch using the PDB library (June (June library thePDB using forHHsearch were asqueries submitted AllHMMsinYchunk200 ng, constitutiveexpression.

(50) ≤ 0.005, iterations = 20).Ifnew = were 0.005,iterations sequences , using monoclonal anti-CPY (Invitrogen). (Invitrogen). anti-CPY , usingmonoclonal 7p=1-182, Pkh1p=562-674, Pkh2p=833- Pkh1p=562-674, 7p=1-182, Total yeast protein sequence was yeast obta proteinsequence Total ing “representative alignments”, deleting deleting alignments”, “representative ing –3.0 Haploid BY4741 yeast lacking yeast lacking BY4741 Haploid ontribution of secondary structure. structure. ofsecondary ontribution . The problem highlighted by Age1p (Supplemental Figure 2) was reduced by was by reduced 2) Figure (Supplemental Age1p by highlighted problem . The re sequences were added (max 40 iterations, 40 iterations, (max were added re sequences at http://toolkit.t ∆ ed segments were checked by sequencing. bywere sequencing. segments checked ed racy” alignment was turned on, e-value = 0.01, = wason, e-value turned alignment racy” PH-EGFP (1-3028 aa), or Vps13LIAA-EGFP (1-3144, L3125A and and L3125A (1-3144, aa),orVps13LIAA-EGFP (1-3028 PH-EGFP og(-log(1-hit.Pval)))/0.45–lamda*hit.score_ss/0.45 og(-log(1-hit.Pval)))/0.45–fmin(lamda*hit.score_ss, of BLAST using PDB as the target database. the targetdatabase. usingPDBas of BLAST of cells were spotted on selective medium (10 medium selective wereon ofcells spotted PH domains were cloned as monomers in the form GFP — GFP intheform were asmonomers cloned PH domains URA3 ng by 100), producing ~26,000 “chunks”. MSAs and HMMs were MSAsandHMMs “chunks”. ~26,000 producing by 100), ng th h tnad1% eeie ie83o hhtitC: 853 of“hhhitlist.C”: we line edited standard11%, the (Saccharomyces Genome Database, December 2015), 2015), December GenomeDatabase, (Saccharomyces centile is zero and the interquartile distance (75 distance and theinterquartile iszero centile se membrane for 16 hours. This was washed extensively, and and washedextensively, was This for16hours. se membrane e domains were fo thenfiltered e domains kit. Firstly the PDB sequence was seeded in searchesof was in seeded kit. FirstlyPDB sequence the by hand to include only those hits where the alignment where thealignment hits those only toinclude byhand submitted separately with no significant difference. difference. withsignificant no submitted separately uebingen.mpg.de with any other, Tuebingen Toolkit), and Interpro 34.0. 34.0. andInterpro Toolkit), Tuebingen other, with any =19). To check that an internal domain did notforcemis- did domain checkthataninternal To =19). ConSurf was used with standard settings for the final 995 for thefinal995 with settings was used ConSurfstandard hunk200” library that is available as an FTP download on on download asanFTP isavailable that library hunk200” ), and used the 168 bpfragment the168 ), andused uncated Vps13p constructs were cloned by gap repair to gaprepair by were cloned constructs uncated Vps13p GFP. The parent wild-type strain, also carrying GFP-GFP, GFP-GFP, alsocarrying strain, wild-type parent GFP. The SP2 confocal microscope at microscope SP2 confocal with 3-5 flanking residues where available were where available residues with 3-5 flanking VPS13 8 iterations and an e-value threshold for inclusion in in forinclusion threshold e-value an and 8iterations ed by either SR or SS. Residues used were Bud2p- used Residues eitherSRorSS. ed by eafter for early stationary phase, were immobilized were immobilized phase, forearlyeafter stationary

966, Tph3p-1=1- 177, Tph3p-2=180-561, 177,Tph3p-2=180-561, 966, Tph3p-1=1- were transformed with andmutated length were full transformed ined from downloads.yeast-genome.org/ fromdownloads.yeast-genome.org/ ined the secondary structure prediction and and structureprediction the secondary (52) still added in the final of 10 iterations, a a of10iterations, inthefinal added still and locally in R. Settings were default inR.Settings locally and r the presence of multiple ofmultiple r thepresence ≤ room temperature, using using room temperature, 4000 sequences). Aligned Aligned 4000 sequences). fromthepromoterof ≥ 1000 matches were matches 1000 4 , 10 –3.0 3 , 10 i.e. th 2 per spot) and perspot)and –25 reduced in reduced in th ) =1. associated arginine methyltransferase 1 domains. EMBO J 2007;26:4391-4401. EMBOJ2007;26:4391-4401. 1domains. methyltransferase arginine associated 24. MorasD,Cava P, N,CuraV,Hassenboehler Troffer-Charlier 374. with theDNA replicatio its interaction 23. CP E,HerouxA,Hill Ferris BlanksmaM, AP, VanDemark for GLUEinlin role reveal structures domain interest. of noconflicts thattheyhave authors declare inthe norole had bodies funding The respectively. Society were by summerstudentsh PAandRE-T funded (BB/L016028). Matt Hayes forcritical anti-CPY,and providing kindly Acknowledgments manuscript. 22. 22. H,O,VeprintsevDB,Vallis Gill DJ,SunJ,Perisic Teo EMBO type 1protein. 21. ScheffzekK.Anovelbipar F, S,Bonneau I,Welti D'Angelo totheRalA Sec6/8 effectors 20. ErvinKE,SchellerRH JR,MaternHT, Jin R,Junutula nucleotide inDNA involved containsaPHdomain TFIIH 19. F, Wasielewski V,JawhariA,Frindel Gervais V,Lamour 1402. myopathy intomyotubular insights MTMR2: phosphatase, 18. JE GS,KimSA,VeineDM,Dixon Begley MJ,Taylor thephosphot of recognition Structure SW. Fesik andligand 17. PetrosAM, EF, Olejniczak KS, MM,Ravichandran Zhou 2007;10.1042/BSS0740081:81-93. 16. Lemmon MA.Pleckstrinhomology doma (PH) oft :apublication Protein science thepleckstrinhomol between 15. MJ,Driscoll Zvelebil MB,MichieAD, Orengo CA,Swindells 2004;340:991-1004. Mol Biol 14. Sternber A,BaldingDJ, KP,Muller LR,Fleming Mayor CASP9. Proteins2011; 13. A HaasJ,Schwede T. SchmidtT, V,KieferF, Mariani 1999;292:195-202. 12. bas structure prediction Proteinsecondary Jones DT. 1987;326:347-352. Nature molecules. ofnovel design 11. JM.Know MJ, Thornton Sibanda BL,Sternberg TL, Blundell 2011;12:275. Bioinformatics 10. ofH alignment AlignHUSH: N. O,Srinivasan Krishnadev by HM detection J.Proteinhomology Soding 9. hybr using new methods alignment: sequence and detection KohIY,Posy B.On S, AlexovE,Honig L, CL,Xie Tang 8. structure-dependent and tables substitution REFERENCES Authors’ contributions and created display items. All authors items.All display and created andJI items.PA,RE-T created display outcloni carried sorting,and CPY assayed SEM HHsearches. 7. Shi J, Blundell TL, Mizuguchi K. FUGUE: sequence-st K.FUGUE: Mizuguchi TL, ShiJ,Blundell 7. universe. protein ofthe regions C, Wooley J, KrishnaSS,Bakolitsa LiZ, L, Jaroszewski 6. 2002;315:1257-1275. Mol Biol pr asensitive thetwilightzone: YonaG,LevittM.Within 5. 2003;31:683-689. prot between weaksimilarities Finding AR. Panchenko 4. pr search database ofprotein new generation Schaffer TL, AltschulSF,Madden 3. Proc relationships. distant evolutionary co Assessingsequence TJ. C,Hubbard SE,Chothia Brenner 2. still over Why Hughes TR. arethere L, Pena-Castillo 1.

This article isprotected bycopyright. Allrights reserved. Accepted Article

We would like to thank Johannes Söding for assistance in using HHsearch, Kate Bowers forvery KateBowers HHsearch, inusing forassistance Söding liketothankJohannes would We Rep 2006;7:174-179. Rep2006;7:174-179. 79 Suppl 10:37-58. 79 Suppl10:37-58. DRF adapted the program hhhitlist.C, construc hhhitlist.C, theprogram adapted DRF GTPase. EMBO J 2005;24:2064-2074. EMBO J 2005;24:2064-2074. GTPase. ogy domain and verotoxin: ogy domain PLoS Biol 2009;7:e1000205. 2009;7:e1000205. PLoSBiol he Protein Society 1995;4:1977-1983. 1995;4:1977-1983. Society he Protein n factor RPA, and a potential role in rolein RPA,andapotential n factor AA, ZhangJ,Z,Lipm Miller W, contributed to drafting ofthemanus todrafting contributed Natl Acad Sci U S A 1998;95:6073-6078. Natl AcadSciUSA1998;95:6073-6078. created and analyzed match lists. TPL c matchlists.TPL analyzed created and gap penalties. J Mol Biol 2001;310:243-257. 2001;310:243-257. JMolBiol gappenalties. king to ESCRT-I and membranes. Cell 2006;125:99-111. 2006;125:99-111. Cell andmembranes. king toESCRT-I ograms. Nucleic Acids Res 1997;25:3389-3402. 1997;25:3389-3402. AcidsRes Nucleic ograms. M-HMM comparison. Bioinformatics 2005;21:951-960. 2005;21:951-960. Bioinformatics M-HMM comparison. ins and phosphoinositides. Biochem Soc Symp BiochemSoc ins andphosphoinositides. reading of the manuscript. SEM has a BBSRC iCASE studentship studentship SEMhasaBBSRCiCASE ofthemanuscript. reading the problem of measuring andev theproblemofmeasuring , Stuckey JA. Crystal structure of a phosphoinositide structureofaphosphoinositide JA.Crystal , Stuckey 1000 uncharacterized yeast ge uncharacterized 1000 excision repair. Nat Struct Mol Biol 2004;11:616-622. NatStructMolBiol2004;11:616-622. repair. excision ed on position-specific scoring matrices. JMolBiol scoringmatrices. position-specific ed on , Brunger AT. Exo84 and Sec5 are competitive regulatory regulatory arecompetitive andSec5 Exo84 AT. , Brunger ssessment of template based prot based ssessment oftemplate Y, Emr SD, Williams RL.ES Y,EmrSD,Williams ructure homology recognition recognition homology ructure and Charcot-Marie-Tooth syndrome. Mol Cell 2003;12:1391- MolCell syndrome. Charcot-Marie-Tooth and g MJ. Clustering of protein ofprotein Clustering g MJ. eins by sequence profile comparison. Nucleic Acids Res Acids Nucleic comparison. profile by sequence eins the role of structural information in remote homology remotehomology in information theroleofstructural yrosine binding domain of Shc. Nature 1995;378:584-592. Nature1995;378:584-592. ofShc. domain binding yrosine design, execution, analysis or writing up of this work. up ofthis The orwriting analysis execution, design, Deacon AM, Wilson IA, Godzik A. Exploration of uncharted ofuncharted IA,GodzikA.Exploration AM,Wilson Deacon ofile-profile comparison tool comparison ofile-profile MMs using structure and hydrophobicity information. BMC information. hydrophobicity and MMs usingstructure id sequence profiles. J Mol Biol 2003;334:1043-1062. 2003;334:1043-1062. JMolBiol profiles. id sequence E, Dubaele S, Egly JM, Thierry JC, Kieffer B,PoterszmanA. S,Egly JC, Dubaele JM,Thierry E, Meadows RP, Sattler M, Harlan JE, Wade WS, WS, Wade Burakoff SJ, RP, SattlerM,HarlanJE, Meadows , Formosa T. The structure of the yFACT Pob3-M domain, Pob3-Mdomain, yFACT structure ofthe The T. , Formosa ng and imaging. KC created and analyzed matchlists,and analyzed KCcreatedand imaging. ng and tite phospholipid-binding module in the neurofibromatosis intheneurofibromatosis module phospholipid-binding tite PC, Waterfield MD, Thornton JM. Structural similarity JM. Structuralsimilarity MD, Thornton PC,Waterfield ledge-based prediction of prediction ledge-based ips from the Biochemical Society and the Mycological Mycological and the Society fromtheBiochemical ips relli J. Functional insights fromstructuresofcoactivator- insights J.Functional relli mparison methods with reliable structurally identified identified structurally with reliable methods mparison ted Ychunk200, and carried out all offline outall carried and Ychunk200, ted nucleosome deposition. Mol Cell 2006;22:363- MolCell deposition. nucleosome cript, and read and approved the final final the and approved cript, andread an DJ. Gapped BLAST and PSI-BLAST: a andPSI-BLAST: BLAST an DJ.Gapped onceived the study, created matchlists, created thestudy, onceived domains in the human genome. J genome. inthehuman domains CRT-I core and ESCRT-II GLUE coreandESCRT-II CRT-I based on information theory. J theory. oninformation based using environment-specific environment-specific using nes? Genetics 2007;176:7-14. Genetics2007;176:7-14. nes? aluating structural similarity. similarity. aluating structural protein structures and the structures protein ein structure predictions in ein structurepredictions 48. 48. R,McGoldrick LL, MK,Policastro Park JS, Thorsness Vps protein in theendosomal 47. PeterAT, B. ER-mi P,Kornmann AB,John Walter Lang J CellSci2012;125:3004-3011. 46. morphoge membrane regulates Park JS,NeimanAM.VPS13 component -independent and dependent 45. 45. Munro S. Levine TP, Tar yeast and in 3-phosphate phosphatidylinositol 41. phi- P5proteinfrombacteriophage NV. Pei J,GrishinThe 1998;14:707-714. Bioinformatics sequence. 40. the e of Gerstein M.Measurement me widespread an and ancient of TSET, Characterization 39. Hirst J,SchlachtA,NorcottJP, D, Traynor in domains thatformplatformsforsmallGTPases 38. Roadblock and of new Longin A, BarrFA. Discovery Gerondopoulos LH,GattaAT, Daniels Wong RD, TP, Levine to isstructurallyrelated neurodegeneration, 37. MJ. Hayes LH, DanielsWong GattaAT, RD, TP, Levine 2011;410:18-26. APH-1.JMolBiol secretase subunit 36. of type DA, DixonJE,GrishinNV.Expansion Pei J,Mitchell 2011;333:1637-1640. factors. Science 35. ar TAF1B human and S.YeastRrn7 Knutson BA,Hahn interactors asconserved proteins 34. Yeas TP. J,Levine K,Satkurunathan MJ,Bryon Hayes 44. 44. Bryant M,GouldR, DJ, MorrowGillooly IC,Lindsay ER-vacuol and ER-mitochondria of component 43. J, PrinzWA, A,Yamada RD,Toulmay Murley A,Sarsam analysis ofmembranetargetingby 42. A,SinghS,KeletiD, MendrolaJM,Audhya Yu JW, Pr ofthe :apublication science 33. architectu repeat and domain intothe Knutson BA.Insights 2010;26:1927-1931. Bioinformatics ERandmitochondria. between exchange forlipid a structuralbasis 32. to ofSMPdomains Homology Kopec KO,AlvaV,LupasAN. tr tethered of membrane superfamily toanewbelong 31. A,S T, Wiedmer Biegert RD,SimsPJ, Bateman A,Finn 2009;8:1453-1455. 30. fold ofacaspase-like NV. Prediction Pei J,Grishin subunits ofRNApolymeraseII. 29. SS.Expansion Knutson BA,Broyles Sci1999;24:441-445. Biochem 28. PHs M,SarasteM. The E,Nilges N,Baraldi Blomberg of features functional reveals 27. LL,Fo H,McCullough DJ,Whitby FG,Robinson Kemble JMolBiol2010;396:31-46. domain. L DasD,DellerMC,Duan T, Clayton 26. Axelr P,AstakhovaT, RD,Abdubek A,Finn Xu Q,Bateman domain studyof theN-terminal 25. 25. K,Zhang X, Orlando A,LeeSH,Zhang Baek K,Knodler putative membrane-associated proteins putative membrane-associated 55. M,BorkP.GRAM, StraussM,Brendel anovel Doerks T, 54. 1993;363:309-310. Nature homology. domain BA.Pleckstrin HB,Hemmings Haslam RJ,Koide nucleic and structure ofproteins and sequence 53. N. H, ErezE,Martz Ben-Tal Ashkenazy E,PupkoT, 2005;33:W244-248. Res Acids Nucleic prediction. 52. interactiv HHpred A,LupasAN. J,Biegert The Soding a fungi,bacteria inplants, advances Recent 51. S, PlatJ, WarneckeD. A, Grille S,Zaslawski Thiele ofitsstructuralgene. isolation allows 50. Overprodu LA,Stevens TH. JH,HunterCP,Valls Rothman 49. (PH)domains homology BA.Pleckstrin Ingley E,Hemmings atme islocalized and function promotes mitochondrial

This article isprotected bycopyright. Allrights reserved. Accepted Article geting ofGolgi-specif the histone chaperone FACT. J FACT. chaperone thehistone 13. J Cell Biol 2015;210:883-890. 2015;210:883-890. 13. JCellBiol of exocyst subunit Sec3. J Biol Chem 2010;285:10424-10433. 2010;285:10424-10433. Sec3.JBiolChem subunit of exocyst otein Society 2005;14:1370-1374. 2005;14:1370-1374. otein Society Virus Genes200 of BLOC-1. Traffic 2011;12:260-268. 2011;12:260-268. of BLOC-1. Traffic S. cerevisiae pleckstrin homology S.cerevisiaepleckstrin , etal. Proc Natl Acad Sci U S A 1986;83:3248-3252. AcadSciUSA1986;83:3248-3252. Proc Natl ffectiveness of transitive sequence compar oftransitivesequence ffectiveness . Trends Biochem Sci 2000;25:483-485. 2000;25:483-485. Sci Biochem . Trends of poxvirus RNA polymerase subunits sharing homology with corresponding with corresponding homology sharing subunits RNApolymerase of poxvirus s. Curr Biol 2002;12:695-704. Biol2002;12:695-704. s. Curr Bacterial pleckstrin homology domains homology pleckstrin Bacterial DENN Rab-GEFs. Bioinformatics 2013;29:499-503. 2013;29:499-503. Bioinformatics DENN Rab-GEFs. nd animals. Prog Lipid Res 2010;49:262-288. 2010;49:262-288. Res Lipid Prog nd animals. mammalian cells. Embo J 2000;19:4577-4588. EmboJ2000;19:4577-4588. cells. mammalian Bloomfield G, Antrobus R, Kay RR, Dacks JB, Robinson MS. MS. Robinson G,AntrobusR,Kay RR,DacksJB, Bloomfield e contacts. J Cell Biol 2015;209:539-548. 2015;209:539-548. e contacts.JCellBiol acids. Nucleic Acids Res 2010;38:W529-533. AcidsRes2010;38:W529-533. acids.Nucleic 8;36:307-311. 8;36:307-311. ic pleckstrinhomology domains Ragulator and TRAPP-II. Sma RagulatorandTRAPP-II. anscription factors. Bioinformatics 2009;25:159-162. 2009;25:159-162. factors.Bioinformatics anscription mbrane contact sites. Mol Biol Cell 2016;27:2435-2449. 2016;27:2435-2449. MolBiolCell contactsites. mbrane DeWald DB, Murray D, Emr SD, Lemmon MA. Genome-wide DB,MurrayD, EmrSD,LemmonMA.Genome-wide DeWald in Tannerella forsythia virulence factor PrtH. Cell Cycle PrtH.CellCycle factor virulence forsythia in Tannerella NJ, Gaullier JM,PartonRG, NJ, Gaullier The functions of steryl glycosides come to those whowait: tothose come functions glycosides ofsteryl The ConSurf 2010: calculating evolutionary conservation in conservation evolutionary ConSurf2010:calculating e server for protein homolog forprotein e server mbrane trafficking complex. eLife 2014;3:e02866. 2014;3:e02866. eLife complex. trafficking mbrane Hollingsworth NM, Thorsness PE, Neiman AM.YeastVps13 PE,Neiman NM,Thorsness Hollingsworth uperfold: a structural scaffold for multiple functions. Trends Trends functions. scaffoldformultiple uperfold: astructural t homologues of three BLOC-1 subunits highlight KxDL highlight BLOC-1subunits ofthree t homologues e TFIIB-related RNA polymerase I general transcription transcription Igeneral polymerase RNA e TFIIB-related tochondrial junctions can be bypassed by dominant mutations bydominant canbebypassed junctions tochondrial oding J. Phospholipid scramblases and Tubby-like proteins proteins andTubby-like scramblases J.Phospholipid oding rmosa T, Hill CP. Structure of the Spt16 middle domain domain HillCP.StructureoftheSpt16middle rmosa T, The product ofC9orf72,a product The domain in glucosyltransferases, myotubularins and other andother myotubularins glucosyltransferases, in domain J, Foskett TJ, Guo W, Domingu GuoW, J, Foskett TJ, Nunnari J. Ltc1 is an ER-localized sterol transporter and a and sterol transporter Nunnari J.Ltc1isanER-localized 6 is a distant homolog ofly 6 isadistanthomolog Biol Chem 2013;288:10188-10194. 2013;288:10188-10194. Biol Chem ction-induced mislocalization of mislocalization ction-induced re of target of rapamycin. re oftargetrapamycin. II CAAX proteases reveals evolutionary origin ofgamma- origin evolutionary reveals IICAAXproteases od HL, Bakolitsa C, Carlton D, Chen C, Chiu HJ,ChiuM, D, ChenC,Chiu C,Carlton Bakolitsa od HL, in signal transduction. J Cell Biochem 1994;56:436-443. Biochem1994;56:436-443. JCell transduction. insignal the TULIP superfamily of lipid-binding proteins provides provides proteins superfamilyoflipid-binding theTULIP nesis during sporulation in Saccharomyces cerevisiae. cerevisiae. Saccharomyces in sporulation during nesis domains. MolCell domains. ll GTPases 2013;4:62-69. 2013;4:62-69. ll GTPases ison, through a third 'intermediate' athird'intermediate' through ison, involves both PtdIns4-kinase- involves both : a prokaryotic origin for thePH for origin : aprokaryotic Stenmark H. Localization of Stenmark H.Localization y detection and structure and y detection gene strongly implicated in implicated strongly gene 2004;13:6 tic transglycosylases. Protein transglycosylases. tic J Struct Biol 2010;170:354-363. J StructBiol2010;170:354-363. ez R.Structure-function a yeast vacuolar protein yeast protein avacuolar 77-688. 77-688. 69. 69. ple The PA. JM,SztulE,Randazzo X, Gruschus Jian Sci2010;123:3478-3489. Imh1p.JCell Golgitorecruit Arl1p atthelate 68. ChangLC, Hsu PC,HsuJW, CY, HC,Fang KY, Chen Tsai 2010;49:9353-9360. 67. binds M,OstapEM.Myo1e EA,IgnacioCM,Krendel Feeser 2006;17:4856-4865. Cell Mol Biol domain. homology 66. SeptD, Ostap EM.My JM,Lin T, Laakso DE, Hokanson futureof SY.Bigdata:The O,Rhee White 65. Hannick Howe M,Fey L,Hide D,Costanzo T, P,Gojobori 2011;39:W29-37. 64. web J,Eddy server:intera HMMER RD,Clements Finn SR. Res 2010;38:D296-300. Acids Nucleic 63. Lees J,YeatsC,RedfernO,Clegg 62. 1998;14:755-763. Bioinformatics Markovmodels. hidden Eddy SR.Profile 2006;34:D247-251. EL,BatemanA.Pfam SR,Sonnhammer R,Eddy Durbin 61. Finn RD,MistryJ,Schuster-Bockler Acids Res2016;44:D279-285. Pfam J,BatemanA, SalazarGA, proteinfami A.The Tate 60. RY,EddyMistry J, SR, P,Eberhardt Finn RD,Coggill assignments. genome and alignments 59. HM C.SUPERFAMILY: Gough J,Chothia Nature2005;437:154-158. proteins. related 58. WA,HurleyIm YJ,Raychaudhuri S,Prinz Struct JH. St Nat coupling. conformational 57. Lu Q,LiJ,Yetail F, ZhangM.Structureofmyosin-1c andELMO1forRacactivationin DOCK2 formation between Koshiba S, Kigawa T, Terada T, Shirouzu M, Kohda D, Nishikimi A, Fukui Y A,Fukui D,Nishikimi ShirouzuM,Kohda T, Terada T, S,Kigawa Koshiba 56. C, M,Mishima-Tsumagari K,Kukimoto-Niino Hanawa-Suetsugu

RD, Gough J, Haft D, Hulo N, KahnD RD, GoughJ,HaftD,HuloN, 70. BairochA,Bateman R,AttwoodTK, Hunter S,Apweiler site.J binding is anallosteric 75. 75. MJ.Proteinstru LA, Sternberg Kelley Acids Res2005;33:W284-288. 74. LiW, Godzik A.FF Rychlewski L,LiZ, L, Jaroszewski Res 2006;34:W335-339. Acids Nucleic 73. C,RemmertM,SodingJ,LupasAN.The A,Mayer Biegert NatMethods2012;9:173-175. HMM alignment. 72. li A,SodingJ.HHblits: Remmert M,BiegertA,Hauser AcidsRes2012;40:D700-705. yeast. Nucleic ofbudding resource genomics SR,FiskDG,HirschmanJE,HitzBC, Engel CJ Karra K,Krieger 71. R,Bi C,Balakrishnan EL,Amundsen JM,Hong Cherry 2009;37:D211-215. protocols 2009;4:363-371. protocols

This article isprotected bycopyright. Allrights reserved. Accepted Article

Biol Chem 2012;287:24273-24283. 2012;287:24273-24283. Biol Chem ruct Mol Biol 2015;22:81-88. 2015;22:81-88. ruct MolBiol , etal. Nucleic Acids Res 2002;30:268-272. AcidsRes2002;30:268-272. Nucleic A, Orengo C. Gene3D: merging structur merging A, OrengoC.Gene3D: B, Griffiths-Jones S, Hollich V,Lassm S,Hollich B, Griffiths-Jones cture prediction on the Web: a case theWeb: on cture prediction biocuration. Nature 2008;455:47-50. 2008;455:47-50. Nature biocuration. InterPro: the integrative protein si protein InterPro:theintegrative Ms representing all proteins ofknow all proteins Ms representing ural mechanismforsterol : clans, web tools and services. Nucleic Acids Res Res NucleicAcids services. weband : clans, tools ckstrin homology (PH) domain of domain (PH) homology ckstrin bound to calmodulin provides insights into calcium-mediated intocalcium-mediated insights provides tocalmodulin bound ghtning-fast iterative protei iterative ghtning-fast AS03: a server for profile--profile sequence alignments. Nucleic Nucleic alignments. sequence AS03: aserverforprofile--profile Mitchell AL, Potter SC, Punta M, Qureshi M, Sangrador-Vegas M,Sangrador-Vegas Mitchell AL,PotterSC,PuntaM,Qureshi lies database: towards a more sustainable future. Nucleic future.Nucleic sustainable towardsamore database: lies nkley G, Chan ET, SS, MC,Dwight Christie KR,Costanzo G,ChanET, nkley o1c binds phosphoinositides through a putative pleckstrin pleckstrin aputative through phosphoinositides binds o1c lymphocyte chemotaxis. D Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn L,Finn L,Duquenne P, DasU,Daugherty BinnsD,Bork W, Hill DP, Kania R, Schaeffer M,StPierreS, W, HillDP,KaniaR,Schaeffer S, Twigger MPI Bioinformatics Toolkit for protein sequence analysis. analysis. forproteinsequence Toolkit MPI Bioinformatics ctive sequence similarity searching. NucleicAcidsRes searching. similarity ctive sequence Tsai YT, Yu CJ, Lee FJ. Syt1p promotes activation of Syt1p promotesactivation YuCJ,LeeFJ. YT, Tsai anionic phospholipids with high affinity. Biochemistry with affinity. Biochemistry high phospholipids anionic , etal. Akasaka R, Ohsawa S,ItoN, AkasakaN,Sekine T, Tochio Saccharomyces Genome Database: the GenomeDatabase: Saccharomyces , etal. study using the Phyre server. Nature server. Phyre the using study ann T, Moxon S, Marshall M, Khanna A, S, MarshallM,Khanna Moxon ann T, gnature database. Nucleic AcidsRes Nucleic database. gnature e and function for a Thousand genomes. genomes. foraThousand function e and n structure. SCOP sequence searches, searches, SCOPsequence n structure. Structural basis of functional complex complex offunctional Structuralbasis sensing andtransportbyOSBP- n sequence searching by HMM- searching n sequence OI:102210/pdb3a the Arf exchange factorBrag2 the Arfexchange 98/pdb 2010. 98/pdb 2010. Domains in pleckstrin were initially identif were initially inpleckstrin Domains weak classicalPH, = helix; strongcolors = blue thebetasand with ofbothsides views across 86residues) 83-183); residues (1lw3_A, ofMTMR2 domain structure identified as beta sheets 1–5 of a complete PH-like domain PH-like of acomplete 1–5 sheets asbeta structure identified solved soonafter LEGENDS

Ribbon diagrams oftheC diagrams Ribbon signi no share that 1:PH-likedomains Figure

This article isprotected bycopyright. Allrights reserved. Accepted Article Fidler C A lsi HGRAM Classic PH

(15) superimposed et al. . GRAM domains were identified as a family of sequences ~70 aalong ~70 of sequences were asafamily identified . GRAMdomains α backbones of: backbones Figure1. 180° B ied as a family of homologues ~100-120 aa ~100-120 as afamilyofhomologues ied A. classical PH domain in PEPP1 (1upq_A, residues 54-152); 54-152); residues (1upq_A, inPEPP1 domain classicalPH C. colors = GRAM. For more GRAM.For inA-C,see colors= detail ficant sequence have highly similar folds. similarfolds. have highly sequence ficant the two structures superimposed superimposed thetwo structures wich. Models colored by secondary structure: red = sheet, red= structure: by secondary colored wich. Models (18)

. (root mean square distance = 1.9 = distance mean square (root (54) , and several structures were structures , andseveral (55) Movies 1–3 Movies which firstsolved the respectively. B. GRAM

Å seen between two PTB groups and in two Ran-binding domain (RanBD) groups (EVH1, WH1). 20 independent families families 20 independent WH1). (EVH1, groups (RanBD) domain Ran-binding andintwo groups two PTB seen between with phosphotyros some overlap have some andthey groups, PHdomains Classical outlines. bydashed indicated grouping with false posit no occurred that (p>0.005) hits significant shown homologies, Weak 1. Table Supplemental listed in with Fami ofdomains. numbers correlates ofoverlap extent A. Figure 2: Family grouping of PHdomains in PDBand yeast C B A Spt16M1

~240 PH-like domains in PDB were divided into 39 families by PSI-BLAST. The size of the coloured shapes and and shapes sizeofthecoloured byThe PSI-BLAST. into39families were divided inPDB domains PH-like ~240 23 49 Fidler 1 1 This article isprotected bycopyright. Allrights reserved.

Accepted Article3 PDB domains–PSI-BLAST Yeast domains–HHsearch PDB domains–HHsearch PTB PTB PTB 20 4r8g 126classic et al. 4gx4gxb4 gxb xb 45classic / b 4 FERM 2 4gxb FERM /

b FERM 115 classic Figure2. C

7 2rrf AVO1 5

3 1 AVO1

7 4k17

DOK-PTB 44k17

2 9 7 NF1 2 NF1 1 12 5 26 split split 3 40 33 SEC3 SEC3 14 1 myosin1c-TH1 4r8g = 4r8g 4xuu 4xuu 2 1 2rrf 2rr u f 2 2 ive hits above them in hit lists. They lead to weak inhitlists. family to They lead hitsabovethem ive by dashed lines, were deduced from the presence ofnon- from the presence were deduced lines, dashed by ly names are standard abbreviations, and PDB identifiers are are identifiers PDB and abbreviations, arestandard names ly DCP1 VPS36 1 GRAM 1 ine binding (PTB) domains. A high degree of overlap is ofoverlap Ahighdegree domains. (PTB) binding ine ISP3 (~50% of all domains) are made up to three overlapping tothreeoverlapping made up are domains) ofall (~50% Rtt106C 2 TFIIH ISP VPS36 P3 3 TFIIH VPS36 3 GRAM 1 Rtt106C 20 RAM BEACH bact. 1 bactbact. 3 7 pre- POB3N 1 4 Rtt106C RanBD BEACH t. DCP1 DCP1 POB3N pre-

ICln

26 1 / 44em4emo 4emo 1 3 1 emo SPT16M1 GRAM GRAM 2 GRAM o 3 ICln recog SS- 1 2 necapn necap pre-BEACH 1 recog SPT16M1 5 CARM SS- 19 27 1 p

2 RanBDR RanBDR

+ 29

CCARMCA CARM

+ + Rpn13 6 Rpn13 Rpn13 M ICln split 3 3 3 1

+ + OC OCRL1/Inpp5b recog 4khb_A = 4gxb 4gou SS- TFIIH ISP3 SPT16D SEC3 RL1/In OCRL1/Inpp5b 2 2 2kie 2 2kie 2kie 4khbA 3 4gou 3u12 4gou 1 u1 4khbA 4emo AVO1 e e 4xuu 2 2kig2 2kig 2 POB3N pp NF1 bact. 2kig 2 2 2 necap 4k17 3u12 5 b

between prob[SS] ( prob[SS] between was less than predicted fromtheprob[SS]metr was predicted lessthan domains. ofnon-PH-like fortheoccurrence scanned and merged indicates the position of Age1p, astrong fa ofAge1p, theposition indicates squares)ar (black positives strongfalse Three not shown). ten similarly hits Non-PH-like were exceptions. some there wi increased COLs groups Inboth occurrences). individual (tot centile prob[SS] means forevery A. Figure 3. Properties ofPH-like and non-PH-like hits faint outline. (details literature inthe described PH-likedomains yeast 73 with grey toblack). colour 80% varying ofthesethreegroups fusion topartial leading GRAM domains, hitstobot produce domains families.Some other with nine fused are alsoGRAMdomains with domains; FERM-C arefused PTB groups. inlarger included are now small families with other toeach prob[SS] hits thatproduce Domains HHsearch. tolargerfamilies. with nolinks orthreedomains two one, contain B A Fidler 100 100 Specificity of HHsearch at different le HHsearchat of Specificity # aligned residues false positives (%) 20 40 60 80 This article isprotected bycopyright. Allrights reserved. Accepted20 40 60 80 Article 0 50 60708090 100 10 2030405060708090 100 et al. ≥ 50%) and the number of aligned residues (COL residues aligned of thenumber 50%)and Figure3. prob[SS] (%) prob[SS] (%) false +ve true +ve C. al n=2200, median 22 hits per centile) and for non-PH-like hits (showing 90 (showing hits andfornon-PH-like hitspercentile) 22 median al n=2200, The families arranged according to the HHsearch analysis from B were populated were fromB populated analysis the HHsearch to according familiesarranged The vels of prob[SS]. Hit lists withof prob[SS].Hitlists prob[SS] vels expected observed lse positive in yeast (Supplemental Figure 1). Figure (Supplemental yeast in positive lse ic. No non-PH-like domain scored prob[SS]>85%. scored prob[SS]>85%. domain ic. Nonon-PH-like *

th prob[SS]. Although COLs was positives, lower COLs prob[SS].Although forfalse th ded to have lower secondary structural similarity scores (data (data scores structural similarity secondary lower ded tohave h classical PH domains and either PTB/FERM-C domains or domains PTB/FERM-C either and domains classicalPH h e described in detail in Supplemental Figure 1.Asterisk Figure inSupplemental detail in e described . Dashed lines indicate incomplete overlap (prob[SH] = 50- (prob[SH]= overlap incomplete indicate lines . Dashed in Supplemental Table 2A). Empty families are shown in areshown Empty 2A). families Table inSupplemental At each level of prob[SS], the rate of non-PH-like hits hits therateof non-PH-like levelofprob[SS], At each ≥ B. 85% are grouped together. 16 of the 20 unlinked 20unlinked 16ofthe together. grouped are 85% The same domains and families were analyzed by were analyzed and families same domains The s, see Box 3), both for true positives (showing (showing truepositives 3), bothfor s, seeBox ≥ 5% from 39 PDB-to-PDB searches were searches PDB-to-PDB 5% from39 B. The relationship relationship The 2B. Table inSupplemental are like domains henceth weighting, structural secondary with prob[SH]. low This identified tentatively like domain domai ahelical contains Rbh2p) (and terminus ofRbh1p kinase=PKinase, =TMD, asfollo BAR,StART, andothers C2, RasGAP,DH,TBC, domains PH-like other,known include domains Accompanying so position, atthesame with PH-likedomains homologues Pkh1/ Bud2p-2, evolution. widelyin eukaryotic present Lightgrey box forbothsearches. aregiven prob[SS] values predi theonedomain For with prob[SS]. higher tothehit belong bothpr line, inthesame arereported paralogs hits. Where sectionshow scale).Main graded strongest hits(blue–red Ne omitted. been have patterns domain similar share highly A. yeast in identified 4:New PH-likedomains Figure B A Tph3p-1 &-2 (Caf120p-2) Figure 4 16 strongly predicted new PH-like domains in yeast, shown in PH-like domains new predicted strongly 16 Spo71p-1 (Sip3p-2) Bud2p x2 Lam1p-2 (Pkh2p) (Rbh2p) Vps13p Vid27p-1 This article isprotected bycopyright. Allrights reserved. Gyp7p Pkh1p Rbh1p

AcceptedRec114p Article & -2

52 % chorN PH 57% PH 8 100% 98% Fidler etal. 96% PH

PH helical > BAR PH 7 88% 97% 100%

PKinase PH e s-oPBPDB-to-yeast Yeast-to-PDB 2001 100% PH PH 5 C2 PH chorein-N domain in VPS13=chorN, WD40=WD, glycine-rich=GLY. The N- The glycine-rich=GLY. WD40=WD, VPS13=chorN, in domain chorein-N

e usual prob[SS] value scale does notapply. does scale value prob[SS] e usual DUF1162 428 WD RasGAP 92-96% TBC PH prob[SH] forprediction(%) 25 PH 95% 98 % 561 PH WD 612 PH 97% 2p, Tph3p and Caf120p (domains with green surrounds) have have with surrounds) green (domains andCaf120p Tph3p 2p, was first found with PDB-to-yeast searches using increased increased searchesusing waswith firstfound PDB-to-yeast n of unknown function with toRhoGEFs. homology unknown function of n ws: domain of unknown function=DUF; transmembrane transmembrane function=DUF; ws: domainofunknown s yeast-to-PDB hits; right-hand section shows PDB-to-yeast PDB-to-yeast sectionshows right-hand hits; yeast-to-PDB s 0100 50 746 ob[SS] values are reported, but the diagram and shading shading and butthediagram reported, valuesare ob[SS] these new discoveries are to some extent expected. aretosomeextent expected. thesenewdiscoveries 766 es (Gyp7p, Vid27p Vps13p) indicate new PH-like families families PH-like new indicate Vps13p) Vid27p es (Gyp7p, w domains (black outline) are are outline) w (black domains 782 (in Lam1p/Sip3p and Spo71p - shaded black) as wellas black) as -shaded Spo71p and Lam1p/Sip3p (in in the context of9full-lengt in thecontext StART cted by indirect alignment (yellow outline, Vid27p-1), Vid27p-1), outline, (yellow alignment byindirect cted PH

75

TMD 200 aa 1104 GLY TMD PH PH 94% Details of all newly predicted PH- predicted ofallnewly Details 1228 1246 3144 shaded according prob[SS] of prob[SS] according shaded that 4paralogs h proteins. weighting =30% secondary structural < 98 % -7 74% 5-97% PH PH 1 63% 61% PH > < < 77% B. 5-23% 5-33% < PH PH PH 66% PH PH 75% PH 97% 5% PH One PH- PH PH PH 96%

as query (Q, top 3 lines) aligned with a target (T) hit with (T) atarget aligned (Q,top3lines) as query with ot all seen being isnonspecific, enrichment anddotseithersideofoccasion buds (filledarrowheads), A. of Vps13p by PH-likedomain the targeting 5:Intracellular Figure amuch cellsto in accumulates I3129A) and (L3125A madeby HHalign. Alignment with and“|” (“+” respectively). 3hsa_A that align (highli I3129 and COLs.L3125 prob[SS] and “ss_dssp”sh targetalsohas target. The andCforuns Hforhelix (red) three states,Eforsheet(blue), ForbothQ “-“isbad,and“=”aclash. & “.”isneutral, good, amazonensis Q consensus confidence prediction T ss T ss 3hsa_A 31 T consensus Vps13p 3050 prediction Q ss B AC Fidler

GFP-Vps13-PH-PH (dimer) weakly seen thebudneck, targets (dimer) GFP-Vps13-PH-PH

dssp chorN GFP-Vps13PH (x2)

This article isprotected bycopyright. Allrights reserved. Accepted Article

gE~i~~ayk~i--rD~~~fTnkRlI~vD~qg~tGkK~~~~sipY~~I~~~siEtAG~~D~D~Elki~i~g~~~~i~~ef~k~~di~~i~~~La~~v

~d~Y~~H~~l~~~~~vlliT~~RIi~~~~~----~~~~~W~~~~~di~~i~~~~~G------i~i~lk~~~~~-fi~i~d~~~r~~ly~~i~~av

4444444444468899999999999954244445799999999988644477777765554233444444555777776554 TCCEEEEEECS--SEEEEEESSEEEEEEEESTTSCEEEEEEEEGGGEEEEEEEEESTTSCEEEEEEEETTCSSCEEEEECTTSCHHHHHHHHHHHH CeEEEcCEEEceEEc------EEEEEecCcCC-EEEcCCHHHHHHHHHHHHHHH CceEEEEEEecCCcEEEEEEcceEEEEEcC---- CC NEELQLAYKMV—RDLFVFTSKRLILIDKQGVTGKKVSYHSIPYKAIVHFQVETAGTFDMDAELKLWISGQHEPLVKELKRGTDVVGIQKTIARYAL ++.=.+-+-+. ...+++||+|+|.++..+.+-.-++||..|..+++|..||.|.+.++..+-+-.+....=..+++-|+..| NDEYLSHVILPGKELAVIVSMQHIAEVQMA----TQELMWSTGYPSIQGITLERSG------LQIKLKSQSEY-FIPISDPEERRS EEEEEE eee--cc , bottom 5 lines). The middle line indicates which residues align, where: “|” is a very good alignment, “+” is “+” alignment, where: avery “|” good is align, which residues indicates middleline The , bottom5lines).

EEEE et al., ecCe EEEEE cCcCccccee Ht ProbE-value Hit haAP-iedmi 95.60.39823050-313331-124 3hsa_A PH-likedomain Vps13p PH (3hsa_A) CcEEEEEeeHHHccceEEeCCe Vps13-C EEE Figure5 ecchhe GFP-Vps13PH EEEEEE

eCccccCccc DU

EEEEE F1162 owing its solved structure. The boxa structure.The itssolved owing cCCce

COLs QueryHMM EEEEE ghted in yellow) are partially conserved residues (lower case in consensus) inconsensus) case (lower residues conserved partially are yellow) in ghted ecCCCC LIAA her PH monomers and dimers (data not shown). (datanotshown). dimers and her PHmonomers HHHHHHHHHHH 3048—3144

Template HMM L YRN (x2) I from the solved structure 3hsa_A, a bacterial protein ( protein bacterial a from the structure3hsa_A, solved AIAV lesser extent than wild-type. Scale bars 5 µm. bars5µm. Scale wild-type. than extent lesser h 124 3133

al larger buds (hollow a al largerbuds(hollow

tructured loop (black), with prediction confidence shown for for shown confidence with (black),prediction tructured loop T, the secondary structure prediction (ss prediction) isin (ssprediction) prediction structure the secondary T, as linear targeting across the neck of small-to-medium theneckofsmall-to-medium across targeting as linear bove shows statistics on the hit, including statisticsonthehit,including bove shows C . GFP-tagged dimeric Vps13-PH(LIAA) Vps13-PH(LIAA) dimeric . GFP-tagged rrowheads). The minor nuclear minor rrowheads). The B. Vps13p 3028-3144 Vps13p3028-3144 Shewanella

µm. openarrowhe (NVJ, junction vacuole thenucleus and log phase, Sitesof (B/D/F). phase (A/C/E)orearly stationary log phase sequence. ofthePDB portion domain with thePH-like shares homology maydiffe multiple 2. However, PDBentries contain definition to homology has sequence that a Finding 2. definition a in Boi1p PHdomain yeast theclassical while in 1-3), PH domain Boththeclassical sequence. indire may (1).This thatfitsdefinition require domain thecharacteristic follows peptidebackbone The PH-like domains are the fourth most arethefourth PH-like domains Box 1: Defining aPH-likedomain 1:Defining Box BOXES (and inblue) Glossary –terms function no known of part mostconserved inMethods. The asdescribed scaled 6 similarexperiments. secretion respectively.of Rescue A-F. Vps13p of domain by PH-like targeting 6:Intracellular Figure The definition of a PH-like domain we use is ofaPH-likedomain definition The and 2). Excluding redundant sequences, >200 PH-like domain st domain PH-like >200 sequences, redundant and 2).Excluding beta slightlysplayed orthogonal, an aa, making span 100–120 earlier identifications made solely on the on solely made identifications earlier Fidler G

∆vps13∆ stationary log BF AE wt 10 This article isprotected bycopyright. Allrights reserved. controls plasmid) G. (empty Acceptedv Article Vps13-EGFP constructs, A/B: wildtype constructs,A/B: (WT) Vps13-EGFP p 4

blotstodetectsecretedC Overlay s 10 1 cells plated 3 3 WT

10 2

10 ∆vps13 PLASMID et al., 2

10 (51) 3

, not the predicted PH domain (green). (green). PHdomain predicted , notthe 10 C D + H. 4 ∆PH LIAA WT Residue conservation in the C-terminal 1095 residues of Vps13p, calculated by ConSurf and and by ConSurf calculated ofVps13p, residues 1095 intheC-terminal conservation Residue LIAA Figure6 H

2050 conservation (AU) -1 0 1 2 ∆ vps13

common fold in the human proteome foldinthehuman common basis of remote sequence homology inmost homology ofremotesequence basis

in PEPP1 and the GRAM domain in MTMR2 PEPP1 andtheGRAMdomaininMTMR2 in was as plasmids testedforVps13-EGFP PY. Controls: wild-type wild-type cells and PY. Controls: a PDB entry documented as having a PH-like fold goes some waymeet some to goes fold aPH-like ashaving documented PDB entry a either: (1) ∆PH βββββββα , C/D: L3125A I3129A (LIAA) and E/F: 1-3028( (LIAA)andE/F: I3129A , C/D:L3125A gly 3144 ct alignment (profile-sequence) rather thanjustsequence- rather (profile-sequence) ct alignment PH nd the GRAM domain in Ymr1p (orthologous to MTMR2) meet toMTMR2) (orthologous inYmr1p GRAMdomain nd the bystructural similarity tothere

pattern closely; localization include intracellular puncta (filled arrowheads) in arrowheads) puncta(filled intracellular include localization rent domains, soitisimporta rent domains, the Vps13 C-terminus is a glycine rich domain (orange) of (orange) richdomain glycine isa the Vps13C-terminus sandwich (4+3) capped by the helix (Figure 1,Movies1 (Figure bythehelix capped (4+3) sandwich ructures have been solved. solved. have been ructures ads) in stationary phase. Scale bar (shown in A only) 5 5 in Aonly) (shown Scalebar phase. ads) instationary (14) ∆ or (2) vps13 . Their 8 structural elements ( 8structural. elements Their by significant sequence homology to a to homology sequence by significant cells show no secretion and maximal and maximal no secretion show cells in A-F. Results are representative of of Resultsarerepresentative in A-F. (56,57) meet definition1(F gion occurring twice inpleckstrin. twice gion occurring nt to check that a sequence nt tocheckthatasequence butnotall Structures have confirmed Structures haveconfirmed ∆ PH) in PH) (58) cases. ∆ igure 1,Movies βββββββα vps13 cellsin ) entries. A third library constructed in-house was usedfor in-house constructed library entries. Athird at70%se (non-redundant entries ~35,000 which contains cerevisiae columns, and we found that shorter hits were oftentoasingle thatshorterhits we and found columns, prob intermsofdescending hitsordered upto~20,000 returns HHsearch within ala orthologues only toidentify specifically designed num times(maximum with Iterating more 2 iterations. built been and eukaryotes (x9): human, mouse, fruit fly, nematode, mala nematode, mouse,fruitfly, (x9):human, and eukaryotes curation of the profiles obtained, with obtained, oftheprofiles curation with or one seeded are tools All profile-sequence can be as few as 6. We applied a length threshold t threshold alength applied as 6.We can beasfew to3st states(reduced structure secondary structure. solved (C) using dominates. alignment alignm oftheoverall theproportion tovary possible (B) scoring secondarystructure. sensitivity. reducing deliberately (A) number ofiterations. setby be thatcan Important variables t wh ofHHblits, iterations multiple three stages has HHsearch powerful programis The t searchsoftware profile-profile accessible isfreely HHsearch withwhil structure, solved seeds uses only inparticula ofseeds, choice toolsdifferin Profile-sequence multiple iteratesthrough process whole information moresequence itcontains because and searching, hits (MSA) madefromseedplus alignment sequence multiple isast profile The threshold. theE-value than more closely Pfam, SuperFamily, GENE3D and many other profile-sequence tools profile-sequence other andmany Pfam, SuperFamily,GENE3D Bioinfo fromtheEuropean a resource which is InterProScan, a of themosthighly Yeast isone butlacksany linktothePH-likefold “Myosin_TH1”, perfect.For isnot example, curation and the literature but new discoveries pfam.xfam.org/clan/CL0266), ( regularly areupdated definitions Domain Box 2: Profile-sequence tools currently used todefine domains states (helix = H, sheet =E, loop H,sheet=E, states (helix = independent of sequence homology; and (iii) the number of col thenumberof (iii) and homology; sequence of independent equiva sizeofthelibrary, the given alone statisticalexpe values:(i)the threeimportant by summarized eachhit,apartfromtheprob[SS]value, For sequence. query andother Pfam, SuperFamily, Gene3D(CATH) available libraries(athttp://toolkit.tuebi isusedto structure secondary predicted (HHpred). HHsearch searches with 3:Profile-profile Box specific,asitin 2).Also,itisnotentirely Table data,butits useful provides thelikeliho enhances strongly hasbeen =0.05) SuperFamily found(>10), 0.002, Pfamnot PHdoma aclassical have inotherspecies GEFs and related has Syt1p example For homologues. arefull-length proteins to homologous ishighly sequence the protein accept weaknevertheless too are is homology because definition specific way which (HMM), codes Markovmodel intoahidden convert theMSA Amo penalties. opening/extension gap fixed using alignments theMSA. calculate way profiles inthe alsovary Tools

This article isprotected bycopyright. Allrights reserved. Accepted Article ), fission yeast ( yeast ), fission (9,62) . HMM profile-sequence searches searches . HMMprofile-sequence (13) S. pombe The defaults number ofiterati number defaults The , fast and flexible, with many settings that can be varied by users. by varied users. with thatcan be many settings , fastandflexible, (9) output is not close to 100% sensitive, only to100%sensitive, output isnotclose If the PDB is being used as either query or target, the user can choose to use solved solved touse or target,theusercanchoose query aseither IfthePDBisbeingused : od of a PH-like domain existing domain od ofaPH-like (1) ich is more sensitive than PSI-BLAST than sensitive ich ismore nnotated organisms. The Saccharomyces Saccharomyces The organisms. nnotated = C) for each position in theMSA in C)foreachposition = starting with a single seed or aligned multiple seeds (“query”), MSAsaremadeby (“query”), seeds multiple or aligned with seed asingle starting There is an option to not score secondary stru scoresecondary tonot isanoption There ) and maize ( maizeblast ) and Pfam using human intervention to group togroup intervention Pfamusinghuman ngen.mpg.de/) include include ngen.mpg.de/) he user for each of the three stages are: foreachofthethreestages user he rounds until no more sequences are added. added. are sequences more untilno rounds search through pre-made libraries of libraries pre-made through search lent totheHHblitsE-value;(ii) the2 e.g. e Pfam starts with any recognizable domain withPfam starts recognizable e any ates) instead ofpred ates) instead a protein that does contain a PH-like a doescontain that aprotein Pfam now version 30, where the PH domain clan is CL0266, see clanisCL0266, where thePHdomain Pfamnow30, version cludes three false positives (Supplemental Table 2). Table (Supplemental falsepositives three cludes profile tools, and also “unknowns”, “unknowns”, andalso tools, profile multiple sequences to find new hits, these being new hits,these tofind sequences multiple are not disseminated to databases instantly todatabases notdisseminated are are used inPfamandGene3D are used (57,66,67) ent weighted to secondary structure. Default is 11%, is11%, Default structure. weighted tosecondary ent hat the alignment should should alignment hat the The simple way, which minimizes computation, scores all scoresall which computation, minimizes way, simple The U. maydis Pfam describes a domain in unconventional myosins as myosins unconventional in adomain Pfam describes ons for building the query MSA is 3. Libraries oftargetshave 3.Libraries MSAis query the forbuilding ons off-line searches (see Yeast-to-PDB section, and Methods section,and (seeYeast-to-PDB off-line searches quence identity) updated weekly, and yeast, which has 5870 which has5870 yeast, and weekly, updated identity) quence . In addition, some domains somedomains . Inaddition, atistical representation ofthe atistical representation r how broad a net to cast. For example, the tool SuperFamily tocast.For howanet thetoolSuperFamily broad example, r rge family of homologues may ben may ofhomologues family rge accepted as PH-like,becaus accepted domain, (GEF) factor exchange guanine aSec7-homology

rmatics Institute (Hinxton, UK). (Hinxton, rmatics Institute (68,69) ctation (E-value) ofachievin ctation (E-value) SMART= Syt1p (E-values: hitin non-significant in. The hat is run either offline, or on a webserver called HHpred. HHpred. called webserver a oron offline, hat isruneither there is a precise alignment there isaprecise both “knowns” suchasPDB,andcurateddomainsfrom structural element (Supplement element structural rial parasite, thale cressan thale parasite, rial . The profile is then used to initiate another round of round toinitiateanother thenused profileis . The re sensitive, butcomputati re sensitive, umns icted structure. icted structure. than the previous round it may detect more hits. The itmay hits. round more detect The theprevious than ber = 8) enhances sensitivity, but takes longer. Searches Searches buttakeslonger. sensitivity, 8)enhances = ber ). The two libraries used mostly for this study are PDB, arePDB, mostlyforthisstudy used twolibraries ). The ed. The basis for this is typically that the remainder of theremainder that basisforthisistypically ed. The . (12) insertions or deletions at each position in aprofile- eachposition at ordeletions insertions

aligned between query and target (COLs), which (COLs), target and query between aligned . (70,71) (72) (3) including 61 out of 73 domains (Supplemental (Supplemental 73domains outof 61 including . the query HMM containing both sequence and and sequence both containing HMM thequery ability of ability s . yeast,For PH-likedomainsinInterPro (2) Genome Database (SGD) is annotated by by (SGD)isannotated Genome Database PSIPRED predicts one of three structural threestructural oneof PSIPREDpredicts cover 20% of the domain, of thedomain, 20% cover ° s target HMMs built before-hand. Freely Freely target HMMsbuiltbefore-hand. structurally related families intoclans relatedfamilies structurally tructural sim (63,64) cture at all. Run off-line, itisalso atall.Runoff-line, cture i.e. domain, and it is thought that the two thatthetwo anditisthought domain, hared s hared proteomes, both prokaryotes (>30) (>30) prokaryotes proteomes,both . that do not meet the PH-like donotmeetthe that (59,60) different residues acrossthe residues different onally harder, technique isto technique harder, onally e the GEF domain homology domainhomology e theGEF d three fungi: budding yeast ( yeast budding d threefungi: g the hit in terms of sequence ofsequence g thehitinterms (see Figure 5B) thatcanbe 5B) (seeFigure tructure (prob[SS]) withtructure (prob[SS]) the . Tools alsodifferinthe . Tools ilarity (2Ssim), which is (2Ssim), ilarity InterPro incorporates all of all InterPro incorporates efit from using 1 iteration, efit fromusing1iteration, al Figure 1A). Hits thatcan al Figure (65) . Upgrades lag behind behind lag . Upgrades

sequences that align thatalign sequences i.e. i.e. 20 aligned 20aligned sequence sequence (61) ) . . S. with 8 iterations, and align them pairwise with HHalign pairwise them andalign 8iterations, with (A2a steps fortheconfirmatory alternative A morecomplex A. To find a fold/function for an unknown protein: unknown findafold/functionforan A. To one obtained that would be expected to occur randomly given given randomly would tooccur expected be that one obtained target isassigned and query between alignment each E-value: ( samefold thatsharethe families ofdomain Clan: agroup Glossary here) domains inHHpred Box 5:Workflows aprob[SS] have typically PSI-BLAST by be detected profiles apply inflexible, standard rules, standard inflexible, apply profiles oft representation Profile: statistical ( sequences related ofhighly numbers bylarge MSAscanbedominated gaps. byinserting region with multiple sequences alignment: sequence MSA: multiple way;computatio more profile-specific way a MSAs,coding torepresent Markov Model: HMM: Hidden st more with an e-value aligns that Hit: targetindatabase by structure. MSAscanbebuilt toolthatexpl profile-profile HHsearch: serverforHHsearch. online HHpred: sensitivity. and speed that increase round multiple via query fromasingle profiles Builds HHblits: of HHsearch application pairwise HHalign: morestatis alignments 0.001) sothat (here chosen, B. To maximally extend the structural homologues foraknow homologues thestructural extend maximally B. To way (query either round them comparing and targets, way besymmetric.So atrivial toolsin might Profile-profile A:“I here”) startfrom to I Iget go?” (Q:“How do where want Asymmetrywouldn’t searches 4: in Box PSI-BLAST. of HHblitsthan basis of structural homology confirmed byHHsearch confirmed homology basis ofstructural when structurally and rela HMMs. curated,reliable, They are without kn even broadly, domains Pfam: atoolthatidentifies by filt beaddressed can problem This orthologues). group (~100) that are chosen to maximize diversity tomaximize thatarechosen (~100) group thisby does HHblits traceofitsorigin. any to eradicate tool hitsasthesame make thesame Thus, query. the centred on on toolsaredependent profile-sequence One way to overcome the problem ofprofil way theproblem toovercome One e.g. 2. Set a threshold forprob[SS]thatav Setathreshold 2. todefinefamiliesthat UsePSI-BLAST 1. of confirmthesignificance region, every For interesting 3. toexcl searches out domain-to-PDB Carry 2. afa Allhitsabove searches. out protein-to-PDB Carry 1. 6. Confirm all new matches with domain only searches. searches. with only domain newmatches Confirmall 6. searches by toPDBentries Linkfurtherhits indirect 5. with re those especially hits, non-significant Investigate 4. with thesese searches HHsearch SeedPDB-to-unknown 3. This article isprotected bycopyright. Allrights reserved.

Accepted superfamilies. called alone.Sometimes PSI-BLAST) Article strongest false positives in PDB-to-PDB searches inPDB-to-PDB positives strongest false further. considering related proteome-to-PDB searches. searches. proteome-to-PDB related homologues thathavenoovert toidentifyhits searches searches. above. obtained withhits threshold the positive family), identifying structure). setto option HHsearch from eachfamily(with one sequence

a profile-sequence tool searching in one direction ( direction inone toolsearching a profile-sequence

either PSI-BLAST or HHblits. orHHblits. PSI-BLAST either he residues across an MSA, sometimes represented as a protein “logo”. Simple Simple asaprotein“logo”. MSA,sometimesrepresented across an he residues searching in the other direction ( direction other inthe searching nally complex than a simple profile profile asimple than complex nally icitly weights a proportion ofanalignm weights aproportion icitly for example for insertions/deletions. forinsertions/deletions. forexample oids false positives, using HHsearch to carry out PDB-to-PDB searches with searches tocarry PDB-to-PDB HHsearch out using positives, oids false where they start because the profile where theprofile they startbecause e bias in favor of where itstartedis favorofwhere bias in e for direct comparison of two MSAs. MSAs. oftwo for directcomparison ensure representation of all families wi ofallfamilies representation ensure ude false positives (Supplemental Figure 2). Figure (Supplemental falsepositives ude

(72) ering MSAs for redundancy, reducing non-diversity. non-diversity. reducing MSAsforredundancy, ering tically unlikely than this are considered significant. significant. areconsidered this than unlikely tically (60) ≥ Æ . 99% and an E-value of10 E-value and an 99% (73) (set to detect predicted structure only). only). predictedstructure (settodetect . filtering an MSA (thousands of sequences) to only pickasmall toonly ofsequences) anMSA(thousands filtering target ortarget target . in related organisms. First, use unknown-to-related proteome proteome useunknown-to-related First, organisms. related in atistically significant (i.e. lower) than the chosen threshold. threshold. (i.e.lower) thechosen than significant atistically but cannot be linked by standard sequence alignment tools alignment sequence by standard linked cannotbe but long as profiles are made the same way for use as queries way asqueries foruse thesame aremade asprofiles long irly generous threshold (prob[SS]>50%, COLs>20) are worth are COLs>20) (prob[SS]>50%, threshold generous irly nd B6) is to create HMMs for the query and target regions targetregions and HMMsforthequery nd B6)istocreate own structure/function. Pfam own structure/function. the hit by comparing itsst by thehit comparing latively high COLs and 2Ssim (Box 3), with 3), unknown-to-PDB 2Ssim(Box COLsand latively high ted families are found, they are grouped into clans on the clansonthe into grouped found,they are are ted families small regions of local homology are aligned across a large acrossalarge arealigned oflocalhomology small regions s of searching, differing from PSI-BLAST in several ways in several differingfromPSI-BLAST s ofsearching, n fold in a proteomes with an HMM library (as forPH-like with HMMlibrary an inaproteomes n fold in the proteome of choice. Use these hits to carry out out hitstocarry choice.Usethese of proteome in the the size of the database bein of thedatabase the size quences, looking in the proteome of choice (one per ofchoice(one intheproteome looking quences, an e-value, which is the number of hits as good as the goodasthe ofhitsas number which isthe an e-value, features including penalizing insertions and deletions in a in anddeletions insertions penalizing features including ignore known structure, and use predicted secondary secondary use predicted structure, and known ignore e.g. Æ query) makes no difference. On the other hand On theotherhand nodifference. makes query) unknown-to-known). unknown-to-known). ent to aligned (predicted) secondary secondary (predicted) ent toaligned represents a large group of sequences sequences of group a large represents to manipulate thegr to manipulate e.g. -10 , which reflects the greater sensitivity which reflectsthegreatersensitivity , th the fold within the overall clan. clan. within theoverall fold th the known-to-unknown) will probably not not will probably known-to-unknown) rength to that obtained by the rength tothatobtained domains are constructed as areconstructed domains g searched. Athresholdis g searched. e.g. owing MSAtoattempt mammalian

sequence identity (20-90%). (20-90%). identity sequence se redundant byexcluding canbereduced Size genomes). protei structures,predicted solved small (all from relatively among adatabase into curated groupofsequences Target: P PSI-BLAST: SuperFamily: a narrow specificity tool for identifying do foridentifying tool narrow specificity a SuperFamily: areinitiated which searches :sequencefrom Query (orseed) ofBLAST. rounds

This article isprotected bycopyright. Allrights reserved. Accepted Article osition-s pecific i pecific terated b asic l asic ocal alignment s ocal alignment mains similartoknown stru mains ns in individual genomes) to very large (all proteins in all in (allproteins tovery large genomes) inindividual ns quences sharing more than a pre-determined levelof a pre-determined morethan sharing quences which homologs are being sought. Database size varies sizevaries Database sought. arebeing which homologs earch t ool. Builds profiles from a single query via multiple viamultiple query asingle from profiles Builds ool. ctures, similar to SCOP. similartoSCOP. ctures,