Protonet: Recent Developments

Total Page:16

File Type:pdf, Size:1020Kb

Protonet: Recent Developments

Nature Biotechnology

(Ref: NBT-L28388A (acceptance letter – 19 Jan 2013

ProtoNet: Charting the expanding universe of

protein sequences

Nadav Rappoport1, Nathan Linial1 and Michal Linial2

1School of Computer Science and Engineering, 2Deptartment of Biological Chemistry, Institute of Life Sciences, The Sudarsky Center for Computational Biology, The Hebrew University of Jerusalem, Israel

Corresponding author*

:Corresponding author details Michal Linial E-Mail: [email protected]

Department of Biological Chemistry, Institute of Life Sciences The Hebrew University Givat Ram Campus Jerusalem, 91904 Israel

972-2-6585425 Phone: 972-2-6586448 FAX:

1 Appendix

:Supplementary text

Methodology outline and quality assessment of ProtoNet

Text Sections 1-9 

Tables S1-S8

Table S1: Statistics of proteins classifications and data growth 

Table S2: Resources for annotation 

Table S3: Number of annotations for the mouse proteome 

Table S4: Annotation types for all proteins 

Table S5: Best-clusters for annotation inference by source 

Table S6: List of 2069 annotations and clusters’ statistics 

Table S7: Protein accessions for cluster A4768953 

Table S8: Expansion of cluster A4768953 by UniProtKB 2012 

.Table S9: Root cluster A4741039, DUF148 

Figures S1-S3

Fig. S1: The number of stable clusters along the ProtoNet depth 

Fig. S2: Quality assessment of ProtoNet clusters by InterPro keywords 

Fig. S3: General statistical information for clusters A4654740 A4768953 

Methodology – outline

2 Databases and sources .1 The tools and methods described in this paper were applied to the ProtoNet protein classification system. ProtoNet provides an agglomerative hierarchical clustering using several merging strategies. For the sake of simplicity, we discuss only one of the merging strategies offered by .ProtoNet, called the ‘Arithmetic’ merging strategy 1 The data source for ProtoNet is the UniProtKB 2. ProtoNet 6.0 is based on release 15.4 of UniProtKB. The initial version of ProtoNet, for example version 2.1 provided a classification hierarchy that covered all 114,033 proteins in the SwissProt database (SWP, release 40.28). In the resulting tree there were 227,436 clusters and a total of 630 roots (mostly singletons). The growth of ProtoNet is shown in Table S1. It reflects a dramatic growth over the years with over 9 million .proteins included in ProtoNet 6.0 3

.Database update - ProtoNet 6.1 . 2 The growth in genomics science led to a fast increase in the protein sequence space. Coping with the most recent version of UniProKB is extremely challenging. To meet this task, ProtoNet 6.1 was updated with 2,478,328 representatives. These UniRef50 representatives cover 18,887,498 proteins .(that are presented in current version of UniProtKB (24th Sep. 2012 Due to a dynamic nature of the UniProtKB resource, 241,795 ProtoNet leafs are not any more supported by UniProtKB (marked ‘obsolete’). There are 2,778,067 UniRef50 representatives that are not represented in ProtoNet 6.1 and 1,972,512 of them are of size 1. All together, the expanded .number of proteins in UniRef50 clusters that are missing from ProtoNet6.1 reaches 6,707,256

Table S1. Statistics of proteins classifications and data growth

Total Total Total Total Total Total Type of UniProt root root :clusters proteins: clusters: proteins classification version :clusters :clusters .w cond Expandedb w/o .w cond w/o .cond versiona .cond 132951 27103 138385 188874 492955 223653 A-UP 2012 UniProt6.1 7 98 3 3 132951 27103 138385 906475 492955 247832 A-UP 15.4 UniProt6.0 7 1 3 8 52735 180 635175 318883 372915 186466 A-UP 8.1 UniProt5.0 5 4 7 11716 8947 50768 295615 107291 A-SW41.21 UniProt4.0 1 3153 630 37391 227436 114033 A-SW40.28 UniProt2.0 2217 630 30065 227436 114033 G- SW40.28 UniProt2.0 5878 1509 32975 186795 94152 A- SW39.15 UniProt1.0 3 5111 1509 26312 186795 94152 G- SW39.15 UniProt1.0 aClassification types are A, Arithmetic; G-Geometric; SW, SwissProt/UniProtKB; UP, UniProtKB. bTotal number of proteins following expansion of the seed proteins (UniRef50 for UP15.4 and .(UniProt90 for UP8.1). Clusters are counted following filtration for minimal Life Time (LT≥1.0

Fig. S1. The number of stable clusters along the ProtoNet depth (ProtoLevel). UP 15.4 (red); UP 8.1 (green); SW41.21 (blue). The X-axis ranges from 0 (proteins as singletons) to 100 (the root of .the ProtoNet tree). The graph focuses on values PL 40-100

ProtoNet clustering measurements .3

The agglomerative hierarchical clustering scheme defines a set of terms that are intrinsically associated with the process. In such a scheme, each cluster is created from smaller clusters, which .are captured as its descendants in the clustering tree ProtoNet is a model free (i.e., no HMMs or alignment based PSSMs are considered). Hence, only intrinsic properties of the data and the merging process determine the features of the final ProtoNet .Tree ProtoLevel (PL) ranges from 0-100 and is used as a standard quantitative measure of the relative height of a cluster in the merging tree. Indirectly, the PL of a cluster reflects the global average of the sequence similarity BLAST E-score between proteins in the cluster. Specifically, the pre- calculation of all-against-all BLAST search BLAST E-values are used for clustering. The similarity .values are collected at a very relaxed value of BLAST E-value 100 The PL of the leaves of the tree is defined as 0, whereas the PL of a root equals 100. The larger the PL, the ‘later’ the merging that created the cluster took place. Therefore, the PL scale is considered .(as an internal timer during the clustering process (Fig. S1 L ifetime (LT) of a cluster is the difference between PL at its creation (i.e. the time when two clusters were merged to form the present cluster) and its termination (i.e. the time where the cluster was merged with another cluster). The LT of a cluster reflects its remoteness from the clusters in its “vicinity”. Explanations for additional terms that describe the clustering process such as depth, .connectivity and compactness are available at the ProtoNet Web site

Amount of Coverage Coverage Type Major annotations ProtoDB (%) (%) SWP annotation sources 4 5,190 11 44 Func (ENZYME (6/09 720 14 19 DOM+Fam (SMART (6.0 1,024 27 37 DOM+Fam (GENE3D (6.1.0 3,338 27 37 Str (CATH (3.2.0 7,821 35 47 Str (SCOP (1.75 27,050 52 93 Func (GO (1.7 949 63 99 Func (UniProtKB (15.4 10,640 73 92 DOM+Fam (PFAM (23.0 18,638 77 95 DOM+Fam (InterPro (21.0 442,867 100 100 Tax NCBI Taxonomy ((6/09

Table S2: The annotation resources in ProtoDB cover functional annotations (Func), domains .(Dom) and families (Fam), structure (Str), taxonomy (Tax) and more

Integration of annotations .4 The UniProt Keywords are a list of general functional terms. These keywords are based on information contained in SwissProt, TrEMBL, and PIR. InterPro 4 is a meta-annotation resource that combines 15 of the most widely used domain and families databases. We kept the original collection of the databases that compile the InterPro. The major resources that are combined in the InterPro annotation scheme are: PROSITE, a database of protein families and domains. It is manually curated and used as a benchmark for false positive and false negative assignments. Pfam 5 is a large collection of multiple sequence alignments and hidden Markov models covering most protein domains and families. Pfam was used as a high quality hidden Markov models (HMMs) for repeats, domains and families. PRINTS database 6 is a collection of protein fingerprints that characterize a protein family. The ProDom protein domain database 7 consists of an automatic compilation of homologous domains from a recursive PSI-BLAST search protocol. SMART 8 provides annotation for domains and their architectures. TIGRFAM 9 is a collection of protein families based on curated multiple sequence alignments and HMMs. SUPERFAMILY 10 is a library of profile HMMs that represent all proteins of known structure. The library is based on the SCOP classification 11. The Gene3D database 12 describes protein families and domain architectures in complete genomes. Gene3D unifies the HMM libraries of CATH 13 and Pfam domains. PANTHER 14 is a large collection of protein families that have been manually divided into subfamilies. HMMs .are built for each family and subfamily

of mouse proteins # of annotated proteins # Domain based annotations 689 199,830 PRODOM domain 3,098 1,806,403 TIGRFAMs domain 5 3,483 494,833 PIRSF domain 15,198 1,516,755 PRINTS domain 22,605 3,804,487 SSF domain 24,656 1,665,306 SMART domain 33,444 4,836,740 PROSITE domain 46,467 8,945,725 PFAM domain 60,605 9,063,896 Taxonomy species 602,836 194,312,350 Total w/o Taxonomy .Table S3. Summary of major domain based annotations associated with the mouse proteome Keyword correspondence scores .5

In order to measure the correspondence between a given cluster and a specific annotation, we define C K the notion of a correspondence score (CS). The CS for a certain cluster and a given keyword measures the correlation between the cluster and the keyword, using the well-known intersect-union .ratio P TK| | C∩ K( C= S , C = ) N FP FP| TK | CU + + where: c is the set of annotated proteins in cluster C, k is the set of proteins annotated with K, TP, FP, FN stand for true positives, false positives, and false negatives, respectively. TP = the number of proteins in cluster C that have keyword annotation K, FP = the number of annotated proteins in cluster C that do not have keyword annotation K, FN = the number of proteins not in cluster C that have keyword annotation K. The cluster receiving the maximal score for keyword K is considered the cluster that best represents K within the ProtoNet tree. The score for a given cluster on keyword K ranges from 0 (no correspondence) to 1 (the cluster contains exactly all of the proteins with .keyword K, i.e. maximally corresponds to the keyword). For a formal definition see 15

For annotation keywords from several external sources, we define the cluster with the best CS for each keyword as the best cluster for this keyword. The sources used for defining the best clusters as well as their CS are: InterPro (families, domains and others), Pfam, SCOP (fold, superfamily, family and domain levels), GO (in 3 categories- Molecular function, Cellular process and Cellular .(localization) and ENZYME (4 levels of EC hierarchy

Clusters stability, pruning protocol and expanded clusters .6

The term assigned to our measure of inherent stability of a cluster is the Life Time (LT) of a cluster. The LT is the difference between the time (i.e. merging steps) that a cluster was created and the time it is merged to a larger cluster. This value is a reflection to the relative height of a cluster in the merging tree (Figure 1B). The level of the tree (ProtoLevel, PL) is an additional internal monotonic timer for merging along the clustering process. The LT and PL are combined for the purpose of tree

6 pruning. Pruning ProtoNet 6.0 tree at LT and PL thresholds of 1.0 and 90 respectively, resulted in 162,088 high quality stable clusters. The pruning protocol yielded a 30-fold compression from the .original 5 million clusters (including leaves) generated prior to pruning

The ProtoNet tree is constructed using the representative proteins from UniRef50. UniRef50 comprises about 2.5 million representative proteins with the property that every protein has a >50% overlap with at least one representative protein. For each cluster the expanded list of proteins of the complete UniProtKB is provided. On average, there is a 4.5 fold expansion from UniRef50 to the .UniProtKB full list. Thus the 10^7 proteins represent the expanded view of the ProtoNet6.0 3

Performance by external experts .7

ProtoNet 6.0 provides a nested tree with about 150,000 stable clusters. As an unsupervised platform, we assess the quality of the clustering process. At a ProtoLevel of >40, a drop in the number of clusters (containing at least two proteins) along the progression of ProtoNet tree reflects the merges of pre-existing clusters and the establishing of larger clusters. Testing the quality of the .(mergers throughout the clustering protocol is based on the keyword Corresponding Score (CS The quality of the stable clusters (at a PL>1.0) is illustrated by testing the CS for all the families and domains from Pfam (12,000) and InterPro (18,000). The integrated version of InterPro covers about 80% of the proteins. About 2/3 of these keywords are included in the analysis using a

.minimal cutoff of ≥20 proteins in each cluster Fig. 2S shows that the quality as measured by the CS is stable throughout the entire range of the ProtoLevel / Birth time of the ProtoNet tree. This trend is valid for Pfam and InterPro. The average quality of Pfam clusters as measured by CS is 0.91 (Fig. 1D) and the CS values for InterPro are an .average 0.8

Fig. S2. Quality of ProtoNet clusters (≥20 proteins) according to InterPro families and domains. The dashed white line show the CS=0.5. Note that throughout the Birth time range (0-1.0), the CS remains at 0.8. Hence, the high quality of the ProtoNet tree in view of InterPro is valid for all the .levels of the tree

Genomic perspective on the protein clusters .8 The number of organisms covered by UniProtKB is huge (Table S2). Nevertheless, a third of the protein sequences originate from a relatively small number of organisms that were completely

7 sequenced (annotated as a complete proteome). Multi-cellular organisms that serve as genetic model organisms are included in the genome view. ProtoNet 6.0 supports over 30 organisms from all superkingdoms. Specifically, D. melanegaster, C. elegans, human, honeybee, mouse and more are organized in a taxonomy tree. Selecting any node from the organisms’ tree returns the clusters .that include proteins from the selected taxonomical level

of # unique # Annotation type proteins in proteins in DB mouse 4,170 1,025,565 EC - x 4,082 1,016,337 EC - x.x 3,985 1,001,790 EC - x.x.x 3,025 893,537 EC - x.x.x.x 60,605 9,063,869 Tax species 60,605 8,799,481 Tax genus 60,605 8,408,876 Tax family 60,605 8,047,843 Tax order 60,605 7,414,051 Tax class 60,605 7,981,148 Tax phylum 60,605 2,380,779 Tax kingdom 60,605 8,272,660 Tax superkingdom 24,656 1,306,672 SMART domain 15,198 1,321,388 PRINTS domain 46,467 6,582,986 PFAM domain 33,444 3,165,669 PROSITE domain 689 190,503 PRODOM domain 3,098 1,657,561 TIGRFAMs domain 3,483 494,833 PIRSF domain 22,605 3,137,662 SSF domain 16,112 1,445,866 PANTHER domain 19,293 2,434,880 GENE3D domain 27,588 3,438,771 PFAM CLANS domain 35,843 4,886,673 InterPro Domain 23,414 3,905,328 InterPro Family 9,314 828,822 InterPro Conserved_site 2,496 257,406 InterPro Active_site 4,452 220,350 InterPro Repeat 4,993 204,357 InterPro Region

8 2,480 170,864 InterPro Binding_site 623 48,201 InterPro PTM 33,365 5,667,975 UniProt keyword 40,941 4,135,974 GO molecular function 35,457 3,516,315 GO biological process 36,693 2,333,260 GO cellular component 19,280 2,432,181 CATH class 19,280 2,432,181 CATH architecture 19,278 2,430,414 CATH topology 19,238 2,424,415 CATH homologous superfamily 22,598 3,135,943 SCOP class 22,598 3,135,943 SCOP fold 22,598 3,135,943 SCOP superfamily 1,148,281 143,849,828 Total 602,836 74,416,565 Total w/o Taxonomy 9.2 82 Avg. / protein

.(Table S4. Annotation types associated with all proteins (an expanded set, >9 million proteins

Navigation in the ProtoNet family tree .9 The overall summary of the ProtoNet clusters is beyond the scope of this correspondence letter. A .list of the proteins that are analyzed in the examples clusters is shown

Table S5. ProtoNet clusters with maximal Correspondence Score (CS) values for thousands of annotations are coined ‘Best clusters’. Several filtrations were applied to present clusters with high confidence for annotation inference. The filters used include (i) average length ≤300; (ii) cluster size ≥ 10; (iii) the number of proteins that are subjected for inference > 0; (iv) the fraction of annotation inferred proteins ≤50%. There are 2,069 annotations that match the combination of the selected filters. These annotations are from a verity of sources that are supported by ProtoNet (Fig. .(2A

Number of Number of Known Keyword Type predicted proteins proteins 717 4,818 CATH homologous superfamily 453 3,155 CATH topology 713 2,224 EC - all levels 2,000 6,375 GO biological process 9 504 1,708 GO cellular component 1,815 5,639 GO molecular function 2,827 16,682 InterPro Domain 4,947 36,287 InterPro Family 6,941 58,479 PFAM domain 395 5,647 SCOP superfamily 1,266 3,586 UniProt keyword 22,578 144,600 Total 2,069 :Total Number of clusters

Table S6. The 2,069 annotations of the ‘Best clusters’ are associated with 1,082 unique clusters. A detailed list of the clusters and related attributes of the clusters (number of proteins, false positives, false negatives, CS values, fraction of inference, ProtoLevel and more) is shown. The numbers of proteins that are listed in the table are according to UniRef50 representatives. On average, the number of the expanded proteins list is 7.8 folds larger. The average CS for all the Best clusters is .(very high (0.89

Table S7. UniProtKB accession numbers of proteins from cluster A4768953 (98.5% from bacteria). The cluster is created at PL 96.9 (and LT≥ 1.0). There are 121 UniRef50 representatives that .accounts for 371 sequences for ProtoNet 6.0 A0AY61_BURCH, A0AYP1_BURCH, A0B523_BURCH, A0KDK2_BURCH, A0YUS0_9CYAN, A0YXM6_9CYAN, A0ZIK3_NODSP, A1AQR7_PELPD, A1ASA5_PELPD, A1ATP6_PELPD, A1IQ44_NEIMA, A1KS43_NEIMF, A1U426_MARAV, A1URJ7_BARBK, A1VDE5_DESVV, A1VDJ0_DESVV, A1VID2_POLNA, A1VUI2_POLNA, A1VUU8_POLNA, A1VX41_POLNA, A1WA18_ACISJ, A1WDJ6_ACISJ, A1WN69_VEREI, A1WN70_VEREI, A2FQI0_TRIVA, A2S771_BURM9, A2W067_9BURK, A2W0N7_9BURK, A2WHL8_9BURK, A3ETU0_9BACT, A3EUI8_9BACT, A3EUT8_9BACT, A3EUW0_9BACT, A3IMU5_9CHRO, A3ITS1_9CHRO, A3IVA6_9CHRO, A3JL43_9ALTE, A3MPS7_BURM7, A3N343_ACTP2, A3NQ10_BURP0, A3RSU1_RALSO, A3RVF0_RALSO, A3T2F0_9RHOB, A3T2P0_9RHOB, A3U3I6_9RHOB, A3VJ15_9RHOB, A3XE30_9RHOB, A3YBX4_9GAMM, A3YXB9_9SYNE, A3YYN4_9SYNE, A4BMI2_9GAMM, A4BMU0_9GAMM, A4G757_HERAR, A4JLJ4_BURVG, A4JWV1_9CAUD, A4MHG7_BURPS, A4NIH0_HAEIN, A4P1H1_HAEIN, A4SDW1_PROVI, A4SDW9_PROVI, A4SMF9_AERS4, A4TNW4_YERPP, A4W5D4_ENT38, A4YYF9_BRASO, A5EB55_BRASB, A5EWA6_DICNV, A5G8E7_GEOUR, A5THN4_BURMA, A5VYQ8_PSEP1, A5VZX2_PSEP1, A5ZQ34_9FIRM, A6BTF5_YERPE, A6GBU6_9DELT, A6GLV3_9BURK, A6GSB6_9BURK, A6VNM7_ACTSZ, A6VRK4_MARMS, A6VWS7_MARMS, A6W1R6_MARMS, A7BLG0_9GAMM, A7BNL4_9GAMM, A7C8Q0_BURPI, A7DVA4_VIBVU, A7HV43_PARL1, A7J7Y1_PBCVF, A7JT83_PASHA, A7JWP4_PASHA, A7JWT6_PASHA, A7K8S7_9PHYC, A7MAH2_PSEAE, A7ZDD6_CAMC1, A8F2Y5_RICM5, A8GU01_RICRS, A8GV95_RICB8, A8YEU1_MICAE, A8YFI8_MICAE, A8ZL86_ACAM1, A8ZNL1_ACAM1, A8ZRU4_DESOH, A8ZUV6_DESOH, A8ZZ23_DESOH, A9AP44_BURM1, A9BR82_DELAS, A9BSW9_DELAS, A9BUI5_DELAS, A9BZS8_DELAS, A9EAX3_9RHOB, A9INC9_BART1, A9IPK2_BART1, A9ITS5_BART1, A9IY58_BART1, A9K0S7_BURMA, A9KH51_COXBN, A9L145_SHEB9, A9M1H7_NEIM0, A9MEC2_BRUC2, A9MYW2_SALPB, A9QS52_PSESX, A9R2C7_YERPG, A9VY58_METEP, A9WY66_BRUSI, A9ZA21_YERPE, A9ZV04_YERPE, B0BH72_9BACT, B0BSH4_ACTPJ, B0BVJ3_RICRO, B0C1R5_ACAM1, B0C9N6_ACAM1, B0CDJ2_ACAM1, B0CF50_ACAM1, B0GF09_YERPE, B0GPY3_YERPE, B0H3V1_YERPE, B0HEH9_YERPE, B0HZ49_YERPE, B0J3J2_RHILT, B0JHD9_MICAN, B0JHE9_MICAN, B0JNS5_MICAN, B0JUA6_MICAN, B0JUV0_MICAN, B0KS04_PSEPG, B0QVD2_HAEPR, B0QX65_HAEPR, B0SXW2_CAUSK, B0USW2_HAES2, B0UW55_HAES2, B1FAH8_9BURK, B1FFX5_9BURK, B1GB48_9BURK, B1K4X1_BURCC, B1K610_BURCC, B1MA45_METRJ, B1SYC7_9BURK, B1T955_9BURK, B1VJ81_PROMH, B1YZV0_BURA4, B1Z0V9_BURA4, B1ZAK5_METPB, B1ZHK8_METPB, B1ZJU5_METPB, B1ZLX6_METPB, B2C6T4_ACIBA, B2FDC5_RHIME, B2FJ08_STRMK, B2GZV1_BURPS, B2I5N4_XYLF2, B2IAQ2_XYLF2, B2TE06_BURPP, B2TSF9_SHIB3, B2U639_ECOLX, B2UHE0_RALPJ, B2VAY8_ERWT9, B3DQN4_BIFLD, B3E764_GEOLS, B3ECU4_CHLL2, B3EMQ3_CHLPB, B3EMQ6_CHLPB, B3GYY1_ACTP7, B3HFC0_ECOLX, B3ISX8_ECOLX, B3JYH8_9DELT, B3PFX8_CELJU, B3PFX9_CELJU, B3R4X4_CUPTR, B3R659_CUPTR, B3X774_SHIDY, B4BWK0_9CHRO, B4EL47_BURCJ, B4EML7_BURCJ, B4F1A0_PROMH, B4RNY5_NEIG2, B4SE49_PELPB, B4SF38_PELPB, B4ST99_STRM5, B4VPB9_9CYAN, B4WDE7_9CAUL, B5C0W3_SALET, B5EJ13_GEOBB, B5ERP5_ACIF5, B5FD56_VIBFM, B5FHK8_SALDC, B5N208_SALET, B5PE14_SALET, B5PPY4_SALHA, B5RZ11_RALSO, B5SLX0_RALSO, B5WP46_9BURK, B5XNN8_KLEP3, B6AKX5_9BACT, 10 B6ANJ3_9BACT, B6C3Q3_9GAMM, B6C438_9GAMM, B6IYD2_RHOCS, B6X043_9ENTR, B7GQ85_BIFLI, B7JAX0_ACIF2, B7KPZ4_METC4, B7KSA8_METC4, B7LJE8_ECOLU, B7YG92_VARPD, B7YTG6_VARPD, B8F4N1_HAEPS, B8F7F6_HAEPS, B8GSA3_THISH, B8GTV7_THISH, B8GUV3_THISH, B8H360_CAUCN, B8ITM9_METNO, B8KUY3_9GAMM, B8LA05_9GAMM, B8LA19_9GAMM, B9B269_9BURK, B9BKG6_9BURK, B9C4H2_9BURK, B9DAL1_9GAMM, B9JFK9_AGRRK, B9JPN3_AGRRK, B9K3H8_AGRVS, B9NWT0_9RHOB, C0BSI9_9BIFI, C0G9V0_9RHIZ, C0GTU2_9DELT, C0GUR9_9DELT, C0Q4S6_SALPC, C0QI01_DESAH, C0VCW4_9MICO, C0YUD1_9FLAO, C1ADF9_9BACT, C1DE86_AZOVD, C1DR99_AZOVD, C1HVF5_NEIGO, C1SPK8_9BACT, C1T3N5_DESBA, C1XQ07_9DEIN, C2A215_SULDE, C2BV60_9ACTO, C2CUD4_GARVA, C2CUJ3_GARVA, C2KQK6_9ACTO, C2KR20_9ACTO, C2LIN4_PROMI, C3K003_PSEFL, C3KFQ9_PSEFL, C3WDL4_FUSMR, Q01YS7_SOLUE, Q02RQ2_PSEAB, Q05HP1_XANOR, Q07IW4_RHOP5, Q0AF37_NITEC, Q0B469_BURCM, Q0B4P3_BURCM, Q0CZA0_ASPTN, Q0KBP5_RALEH, Q11N66_MESSB, Q131N1_RHOPS, Q138H7_RHOPS, Q13FU3_BURXL, Q13LE6_BURXL, Q1BLP1_BURCA, Q1BW04_BURCA, Q1CAI7_YERPA, Q1CFJ4_YERPN, Q1I6T7_PSEE4, Q1IB13_PSEE4, Q1IU34_ACIBL, Q1NJK5_9DELT, Q1NP18_9DELT, Q1NR36_9DELT, Q1QF67_NITHX, Q1QFT4_NITHX, Q1RHK4_RICBR, Q1XGP3_PSEPU, Q214U9_RHOPB, Q21ZJ2_RHOFD, Q2ISG9_RHOP2, Q2IT41_RHOP2, Q2IUV2_RHOP2, Q2RPR8_RHORT, Q2RUT9_RHORT, Q2RX84_RHORT, Q2SIZ8_HAHCH, Q2TR62_ACIBA, Q2W3Q7_MAGMM, Q315U5_DESDG, Q392I8_BURS3, Q393E1_BURS3, Q39SL9_GEOMG, Q3AR07_CHLCH, Q3B1U1_PELLD, Q3BTL9_XANC5, Q3JDA3_NITOC, Q3JDK7_NITOC, Q3JDX3_NITOC, Q3KEQ6_PSEPF, Q3KIR2_PSEPF, Q3KJ92_PSEPF, Q3QZ61_XYLFA, Q3R7W4_XYLFA, Q3RFI8_XYLFA, Q48B36_PSE14, Q4BUP6_CROWT, Q4BUQ5_CROWT, Q4BVR2_CROWT, Q4KIY9_PSEF5, Q4UJX2_RICFE, Q5NZP9_AZOSE, Q5QWX9_IDILO, Q5X039_LEGPL, Q63LE9_BURPS, Q63YL3_BURPS, Q6ALJ0_DESPS, Q6G4P2_BARHE, Q6J5I5_HAEIN, Q6MCI0_PARUW, Q6N367_RHOPA, Q6W4R6_VIBAN, Q6ZEG0_SYNY3, Q70W78_YEREN, Q74CX9_GEOSL, Q7CH33_YERPE, Q7MBL1_VIBVY, Q7N237_PHOLL, Q7N3K9_PHOLL, Q7N4L9_PHOLL, Q7N6M0_PHOLL, Q7N7P7_PHOLL, Q7N7R2_PHOLL, Q7N8S2_PHOLL, Q7X1K1_9BACT, Q820L9_NITEU, Q840E6_9GAMM, Q847G4_PSEPU, Q879U0_XYLFT, Q87CA7_XYLFT, Q87UD0_PSESM, Q88NG8_PSEPK, Q88PM1_PSEPK, Q89KB2_BRAJA, Q8FLW1_COREF, Q8G4Q3_BIFLO, Q8VMN8_PSEPU, Q8XTN0_RALSO, Q8YUE9_ANASP, Q926N9_LISIN, Q9A407_CAUCR, Q9P9V8_XYLFA, Q9PCS6_XYLFA, Q9PHG2_XYLFA, Q9XAX6_PSEAC, Y1420_HAEIN

Fig. S3. A summary page for ProtoNet clusters A4654740 (98 proteins) and A4768953 (112 proteins, Fig. 2B). Statistical details for the cluster are shown in a table format. PL=100 reflects a root cluster. Cluster A4654740 was assigned as ‘Best cluster’ for InterPro keyword ‘Addiction module antidote protein, HI1420’ (CS=1.0) and thus it is assigned as the ProtoName. In ProtoNet database, there are 53 UniProt50 representative proteins that are annotated by this keyword, all are .included in these cluster

Table S8. ProtoNet 6.1 view of expanded cluster A4768953. The 121 representative UniRef50 .sequences are expanded to 1022 sequences according to ProtoNet 6.1

Siz Length e Cluster name Cluster ID 150 1 Putative uncharacterized protein UniRef50_A0YUS0 104 2 Putative uncharacterized protein UniRef50_A0YXM6 100 1 Putative transcriptional regulator UniRef50_A1AQR7 68 1 Putative transcriptional regulator UniRef50_A1VX41 67 1 Putative uncharacterized protein UniRef50_A2FQI0 106 1 Putative uncharacterized protein UniRef50_A3IMU5 98 7 Putative uncharacterized protein UniRef50_A3T2P0 111 7 Putative uncharacterized protein UniRef50_A4BMI2 69 2 Putative uncharacterized protein UniRef50_A4G757 148 8 Fimbrial protein pilin UniRef50_A4YAZ9 133 2 Putative uncharacterized protein UniRef50_A5EB55 43 1 Putative uncharacterized protein UniRef50_A5EWA6 121 1 Uncharacterized protein UniRef50_A5ZQ34 125 2 Putative uncharacterized protein UniRef50_A6GBU6 11 94 1 Putative transcriptional regulator UniRef50_A6VWS7 94 4 Putative transcriptional regulator UniRef50_A6W1R6 92 1 Transcriptional regulator UniRef50_A7BLG0 44 1 Putative uncharacterized protein UniRef50_A7BNL4 98 4 Transcriptional regulator, Cro/CI family UniRef50_A7DVA4 107 4 Putative transcriptional regulator UniRef50_A7HV43 70 1 Putative uncharacterized protein n627R UniRef50_A7J7Y1 107 14 Possible transcriptional regulator UniRef50_A7JT83 90 1 Putative uncharacterized protein z317L UniRef50_A7K8S7 76 2 Transcriptional regulator UniRef50_A8F2Y5 106 3 Putative uncharacterized protein UniRef50_A8ZNL1 92 1 Putative transcriptional regulator UniRef50_A8ZRU4 99 1 Putative transcriptional regulator UniRef50_A8ZUV6 63 1 Putative uncharacterized protein UniRef50_A9BR82 79 1 Putative uncharacterized protein UniRef50_A9KH51 252 1 Putative uncharacterized protein UniRef50_A9L145 97 5 Addiction module antitoxin, putative UniRef50_B0C1R5 74 1 Putative uncharacterized protein UniRef50_B0CDJ2 101 1 Putative uncharacterized protein UniRef50_B0JHD9 275 1 Putative transcriptional regulator UniRef50_B0SXW2 89 1 Putative transcriptional regulator UniRef50_B1ZAK5 81 1 Putative partial transcriptional regulator protein UniRef50_B2FDC5 118 25 DNA-binding prophage protein UniRef50_B3X774 106 1 Putative uncharacterized protein UniRef50_B4VPB9 102 1 Probable addiction module antidote protein UniRef50_B4WDE7 96 5 Putative uncharacterized protein UniRef50_B5PPY4 106 1 Putative uncharacterized protein UniRef50_B6IYD2 108 3 Putative transcriptional regulator, antitoxin protein higA UniRef50_B8H360 62 1 Putative transcriptional regulator UniRef50_B8ITM9 88 5 Transcriptional regulator UniRef50_B9JFK9 40 1 Putative uncharacterized protein UniRef50_B9JPN3 78 1 Putative uncharacterized protein UniRef50_C0BSI9 97 7 Uncharacterized protein UniRef50_C1ADF9 97 5 Possible transcriptional regulator UniRef50_C2KR20 70 1 Putative uncharacterized protein UniRef50_C3WDL4 97 5 Addiction module antidote protein UniRef50_C5D1V6 107 14 Addiction module antidote protein UniRef50_C6AZF6 100 3 Addiction module antidote protein UniRef50_C6DYU8 86 19 Predicted protein UniRef50_C9T0K8 67 2 Putative uncharacterized protein UniRef50_D0SF18 99 6 Addiction module antidote protein UniRef50_D1BRK1 52 2 Putative uncharacterized protein UniRef50_D2UFF5 116 48 Addiction module antidote protein UniRef50_D6SJR1 Toxin-antitoxin system, antitoxin component, ribbon-helix- 105 8 helix domain protein UniRef50_D8F2I0 182 30 XRE family transcriptional regulator UniRef50_E0NB55 105 5 Predicted transcriptional regulator UniRef50_E1VPC2 131 3 Toxin-antitoxin system, antitoxin component, Xre family UniRef50_E2STQ7 129 2 Transcriptional regulator UniRef50_E2XVJ2 106 2 Transcriptional regulator 3 UniRef50_E3HHX7 102 7 Uncharacterized protein UniRef50_E6PN15 101 27 Uncharacterized protein UniRef50_E6QCN5 105 13 Uncharacterized protein UniRef50_E6QCQ8 94 3 Addiction module antidote protein UniRef50_E6X9K7 106 36 Addiction module antidote protein UniRef50_E8X7N4 12 121 11 Uncharacterized protein UniRef50_F3INW7 100 4 Addiction module antidote protein UniRef50_F3L0G0 84 2 Putative uncharacterized protein UniRef50_F6AX15 196 35 Addiction module antidote protein UniRef50_F6BME8 108 32 Transcriptional regulator UniRef50_F7ND47 100 4 Addiction module antidote protein UniRef50_F8GGJ2 100 4 Uncharacterized protein HI_1420 UniRef50_F8LFL8 108 11 Addiction module antidote protein UniRef50_G4E1Q1 101 13 Addiction module antidote protein UniRef50_G6XQZ4 125 2 Putative uncharacterized protein UniRef50_G7CR59 109 13 Addiction module antidote protein UniRef50_G7HEF7 164 12 Transciptional regulator UniRef50_G8Q9N4 99 5 Addiction module antidote protein UniRef50_H0HXW3 97 11 Addiction module antidote protein UniRef50_H1KHC6 98 16 Addiction module antidote protein UniRef50_H1LAV3 390 5 DNA replication and repair protein RecF UniRef50_H1NMJ8 174 116 Thiol-disulfide isomerase-like thioredoxin UniRef50_H5WLK2 103 34 Predicted transcriptional regulator UniRef50_H6SIS2 106 6 Uncharacterized protein UniRef50_I0INU3 118 20 Addiction module antidote protein UniRef50_I1WF00 106 13 Toxin-antitoxin system, antitoxin component, Xre family UniRef50_I2Y9I5 108 18 Uncharacterized protein UniRef50_I4FCC2 120 9 Uncharacterized protein UniRef50_I4FRG9 107 10 Uncharacterized protein UniRef50_I4XN50 104 84 Uncharacterized protein UniRef50_I6HTI1 106 16 Uncharacterized protein UniRef50_J0QIC2 94 10 Uncharacterized protein UniRef50_J1A1Y8 126 37 Uncharacterized protein UniRef50_J1P6H9 103 4 Uncharacterized protein UniRef50_J2P562 97 13 Uncharacterized protein HI_1420 UniRef50_P44191 104 1 Putative uncharacterized protein UniRef50_Q01YS7 95 1 Putative transcriptional regulator UniRef50_Q0AF37 187 1 Predicted protein UniRef50_Q0CZA0 177 2 Putative uncharacterized protein UniRef50_Q11N66 305 2 Transcriptional regulator-like UniRef50_Q138H7 221 1 Putative uncharacterized protein UniRef50_Q1IB13 107 2 Putative uncharacterized protein UniRef50_Q1IU34 168 1 Putative uncharacterized protein UniRef50_Q1QFT4 183 1 Putative uncharacterized protein UniRef50_Q21ZJ2 81 1 Putative transcriptional regulator UniRef50_Q2ISG9 185 1 Putative uncharacterized protein UniRef50_Q2IUV2 78 1 Putative uncharacterized protein UniRef50_Q2W3Q7 100 5 Addiction module antidote protein UniRef50_Q315U5 112 2 Putative uncharacterized protein UniRef50_Q3BTL9 117 18 Transcriptional regulator, Cro/CI family UniRef50_Q3JDK7 99 3 Similar to transcriptional regulator UniRef50_Q4BUP6 105 18 Predicted transcriptional regulator UniRef50_Q4UJX2 86 1 Putative uncharacterized protein UniRef50_Q5X039 91 1 Putative uncharacterized protein UniRef50_Q6ALJ0 188 1 Putative uncharacterized protein ORF7 UniRef50_Q70W78 69 1 Similar to hypothetical protein of Haemophilus influenzae UniRef50_Q7N237 Possible phoP; Response regulators consisting of a CheY- 229 3 like receiver domain and a HTH DNA-binding domain UniRef50_Q820L7 100 2 Blr4995 protein UniRef50_Q89KB2 13 98 1 Asl2401 protein UniRef50_Q8YUE9 173 1 Pli0013 protein UniRef50_Q926N9

Table S9. UniProtKB accession numbers of proteins from a root cluster A4741039 (65 proteins, .Fig. 2D). Cluster Name: Protein of unknown function DUF148 The cluster includes all 47 proteins of DUF148 (PF02520, for UniRef50) and thus associates with CS=1.0. Additional 18 proteins in this cluster lack any InterPro / Pfam annotations. Recall that the DUF148 family has no known function nor do any of the proteins that possess it. Still all the proteins in the cluster belong to a diverse collection of nematodes (some cause severe human diseases). Inspection of this cluster suggests that these proteins may act as surface antigen and may .activate the host immune response. The expended cluster (UniRef100) comprises of 128 sequences

A0DWN3_PARTE, A1IKL2_ANISI, A1Z6D0_CAEEL, A7M6T4_ANISI, A8NVF8_BRUMA, A8Q0T0_BRUMA, A8Q3S4_BRUMA, A8Q4D7_BRUMA, A8R0N5_BRUPA, A8WY92_CAEBR, A8WYA2_CAEBR, A8X0P8_CAEBR, A8X193_CAEBR, A8X194_CAEBR, A8X6I6_CAEBR, A8XIZ6_CAEBR, A8XLL8_CAEBR, A8XTF6_CAEBR, A8XZS0_CAEBR, A8Y4D4_CAEBR, B0ZBB0_9PELO, B2XCP1_ANISI, B6IH84_CAEBR, B6VBW3_9PELO, O01674_ACAVI, O17021_CAEEL, O17974_CAEEL, O45098_CAEEL, O45347_CAEEL, O76573_CAEEL, OV17_ONCVO, OV39_ONCVO, P91548_CAEEL, Q18280_CAEEL, Q18577_CAEEL, Q19406_CAEEL, Q19414_CAEEL, Q1W204_ANCCA, Q20202_CAEEL, Q20998_CAEEL, Q21588_CAEEL, Q22237_CAEEL, Q22537_CAEEL, Q23545_CAEEL, Q23614_CAEEL, Q23615_CAEEL, Q23683_CAEEL, Q4W5P1_CAEEL, Q5FC55_CAEEL, Q6LA91_CAEEL, Q6RUQ0_MELIC, Q6UY31_HUMAN, Q7YTI8_CAEEL, Q7YTP4_CAEEL, Q7YWN3_CAEEL, Q8MXD0_9BILA, Q8WQ00_OSTOS, Q93122_ASCSU, Q95YM3_NIPBR, Q9GU97_LOALO, Q9NFS0_GLORO, Q9NFU9_GLORO, Q9TZL7_WUCBA, Q9XTN7_CAEEL, YY23_CAEEL

References

.(O. Sasson, A. Vaaknin, H. Fleischer et al., Nucleic Acids Res 31 (1), 348 (2003 .1 C. H. Wu, R. Apweiler, A. Bairoch et al., Nucleic Acids Res 34 (Database issue), D187 .2 .((2006 N. Rappoport, S. Karsenty, A. Stern et al., Nucleic Acids Res 40 (Database issue), D313 .3 .((2012 S. Hunter, R. Apweiler, T. K. Attwood et al., Nucleic Acids Res 37 (Database issue), D211 .4 .((2009 .(R. D. Finn, J. Tate, J. Mistry et al., Nucleic Acids Res 36 (Database issue), D281 (2008 .5 .(T. K. Attwood, P. Bradley, D. R. Flower et al., Nucleic Acids Res 31 (1), 400 (2003 .6 .(F. Corpet, F. Servant, J. Gouzy et al., Nucleic Acids Res 28 (1), 267 (2000 .7 I. Letunic, R. R. Copley, S. Schmidt et al., Nucleic Acids Res 32 (Database issue), D142 .8 .((2004 .(D. H. Haft, J. D. Selengut, and O. White, Nucleic Acids Res 31 (1), 371 (2003 .9 M. Madera, C. Vogel, S. K. Kummerfeld et al., Nucleic Acids Res 32 (Database issue), D235 .10 .((2004 .(A. G. Murzin, S. E. Brenner, T. Hubbard et al., J Mol Biol 247 (4), 536 (1995 .11 14 .(F. Pearl, A. Todd, I. Sillitoe et al., Nucleic Acids Res 33 (Database issue), D247 (2005 .12 .(C. A. Orengo, A. D. Michie, S. Jones et al., Structure 5 (8), 1093 (1997 .13 H. Mi, B. Lazareva-Ulitsky, R. Loo et al., Nucleic Acids Res 33 (Database issue), D284 .14 .((2005 .(N. Kaplan, O. Sasson, U. Inbar et al., Nucleic Acids Res 33 (Database issue), D216 (2005 .15

15

Recommended publications