Supporting Information for Mackenzie et al., “Tertiary Alphabet for the Observable Structural Universe”

SI Methods strongly correlate, resulting in shallow angles and large positive average cosines. Fig. S1 shows this general behavior, except that it also exhibits some oscillations RMSD Cutoff Derivation before it converges to zero. These oscillations have to do with a characteristic length of secondary structural elements (SSEs) separated by turns and loops, such Fundamentally, optimal rigid-body superposition is an instance of linear that within 15-30 residues a chain is expected to dramatically , often in regression, whose goal is to minimize the sum squared deviation (SSD) between almost the opposite direction. In fact, the overall fits very closely to a corresponding coordinates of two sets of atoms. Regression quality is typically product of exponential decay (due to the chain “forgetting” its direction) and a assessed using the mean squared error (MSE), an estimate of the error variance, sinusoid (from SSE-related oscillations), as shown in Fig. S1. Specifically, the computed as SSD �, where � is the number of degrees of freedom. The RMSD equation we used was exp − � � ∙ cos � ∙ 2� � , where � is the separation (in metric used in structural superposition is similar to the square root of MSE, but residues), � is the characteristic persistence length, and � is the period of uses the number of atoms in place of �. This is statistically justified only when oscillations. Optimally fitting parameters were � ≈ 33 and � ≈ 10 residues. atoms are entirely independent of each other, which is clearly not the case in Based on this, we chose to use a correlation length parameter for our RMSD general. For example, consider comparing two contiguous segments of structure. cutoff function in the range between 15 to 25 residues (corresponding to a drop- Residues consecutive in sequence are joined by peptide bonds of relatively fixed off in the exponential component to ~0.2 and ~0.1, respectively). Specifically, we geometry, so that the position of each next residue in a chain is highly tested all combinations of correlation lengths 15, 20, and 25 with resolution constrained by the previous residues (e.g., Cα-to-Cα distances between adjacent parameters of 0.9 Å, 1.0 Å, and 1.1 Å. From our initial sequence recovery results, residues are generally ~3.8 Å). On the other hand, two residues that are either it appeared that a correlation length of 20 with a resolution of 1.0 Å produced distant in sequence or not part of the same chain more closely resemble best results, although results from all combinations were similar. We thus chose independent data points. We thus reasoned that to formulate a systematic RMSD the RMSD cutoff function based on � ≈ 10 and resolution of 1.0 Å for our final cutoff function, we must consider how the effective number of degrees of set cover and most other experiments in this study. freedom changes with structure size and topology. Using the PDB, we found that a typical protein chain “forgets” its direction only after ~20 residues (see section Effective Number of Degrees of Freedom “Persistence Length” below and Fig. S1), suggesting that positions of residues separated by fewer amino acids would be expected to be significantly inter- Let us consider a structural query composed of � disjoint segments, with the �-th constrained. Considering residues not required to be part of the same chain as segment of length �!. Our goal is to estimate the number of degrees of freedom completely independent of each other, we then estimated the total effective in such a structure, �. Based on the analysis in the section above, we shall number of degrees of freedom in a structural motif as: assume that two residues in the same chain possess some amount of positional !!!! !! “correlation” that decays exponentially with sequence separation, while residue 2 � = � 1 − � !!! ! (S1) pairs in different chains are completely independent. The average degree of � � − 1 ! !!! !!!!! correlation across all residue pairs in the structure is then: ! !!!! !! !!! ! where the first sum extends over disjoint segments of the motif, � is the length !!! !!! !!!!! � ! � = of the �-th segment, � is the total length of the motif (i.e., � = ! �!), and � is a � � − 1 2 correlation length parameter emergent from the above observation of positional where � !!! ! is the correlation between positions � and � of the same segment, correlation in protein chains (see section “Effective Number of Degrees of with � being a correlation length parameter, whereas a correlation value of zero Freedom” below for a derivation). Notice that � ranges between zero and �. Zero is assumed for residue pairs across different chains (and these are thus skipped in is reached when the correlation length � is very large (infinite), meaning that the summation). An aggregate parameter that describes the overall amount of positions of all residues are highly inter-correlated and no degrees of freedom inter-dependence, � varies from 0 to 1 corresponding to all residues in the remain. On the other hand, the value of � is approached either for very short structure being either fully independent (i.e., � = �) or fully dependent (i.e., correlation lengths or very long segment lengths. In these cases, each residue is � = 0) of each other, respectively. We thus express � as � 1 − � or: effectively independent of the bulk of other residues, such that (nearly) all !!!! !! 2 degrees of freedom are preserved. Given Eq. S1, we can now establish a � = � 1 − � !!! ! � � − 1 ! “universal” RMSD metric ��� �, in place of the traditional RMSD, ��� �. !!! !!!!! Then, if we choose a specific cutoff in terms of our universal metric, �!"#, the When applying this to TERMs, we treated disjoint segments as separate chains. corresponding traditional RMSD cutoff for a given TERM � becomes: Given this formulation for the number of degrees of freedom, the modified

� � = �!"# � � � � (S2) universal RMSD metric follows readily as detailed above. We note that this formulation is entirely empirical. However, we have found the resulting This value is no larger than �!"# itself, being much smaller than �!"# for functional form for an RMSD cutoff to work very well in practice, returning small/simple TERMs (i.e., when � ≪ � ), and approaching �!"# for match ensembles that generally agree well with our intuition on structural large/complex TERMs (e.g., those with many segments) as � approaches �. similarity across a large range of motif sizes and complexities (Fig. S2). Thus, �!"# can be thought of as a resolution parameter (i.e., the RMSD cutoff imposed in the limit of large structures), with Eq. S2 computing the RMSD TERM Secondary Structure Classification cutoff for a structure of any finite size. Secondary-structural (SS) assignments were used to analyze TERM diversity and Persistence Length to generate representative sets of TERMs across different SSE classes. First, STRIDE was used to determine the SS class of each residue (1): (including We sought to determine how quickly the protein chain in a typical folded STRIDE’s α-helix, 3-10 helix, and pi-helix classes; H), strand (including structure “forgets” its direction. Towards this end, we defined a chain tangent STRIDE’s extended and isolated β-bridge classes; E), turn (T), or coil (C). Next, vector at each residue � to be along the largest principal axis of the local each residue of a TERM segment was assigned an SS class based on the majority backbone window (backbone atoms from residues � through � + 5; sign of vector of residue assignments in the surrounding three-residue window. If there was no chosen to align with the N-to-C direction of the window—i.e., last minus first majority, the residue maintained its original class. The SS class of an entire backbone atom). Then, for all pairs of residues in a protein, the cosine of the segment was then assigned as a string of concatenated residue SS classes, with angle between their respective chain tangent vectors, ��� � , was computed. consecutive repeats removed SS. For example, a six residue segment with two Having done this for all chains in DB80, we plotted the average ��� � as a turn residues followed by four helix residues, TTHHHH, would be represented as function of inter-residue separation, shown in Fig. S1. At large separations there the string TH. Finally, a TERM’s SS class was represented with the set of its should be no relationship between chain tangent vectors (i.e., chain direction is segment classifications. For example a two-segment TERM would be fully randomized after a certain number of amino acids), such that the average represented as {E,E}, a two-segment TERM with turn-helix and helix segments cosine should be zero. Whereas at short separations, chain directions should as {TH, H} and a three-segment TERM with a beta sheet and a helix as {E,E,H}. Two TERMs were considered to share a secondary-structure class if they were average. In particular, if �!" is the set of all TERMs that cover ��(�, �), with the assigned identical sets. positions of TERM � ∈ �!" aligned with � and � being �! and �!, respectively, then the pair enrichment was calculated as: � (� , � )� �, �, � , � Metal and Water Binding Motifs !∈!!" ! ! ! ! ! ! � �, �, �, � = (S10) � (� , � ) TERMs for which at least � % of matches interacted with a metal of interest (i.e., !∈!!" ! ! ! where � (� , � ) is the weight assigned to TERM �, in conjunction with had a non-hydrogen atom within 3 Å of the metal atom) and which had at least ! ! ! position pair � , � , and � �, �, � , � is the enrichment of amino-acid pair ten matches were selected as enriched for metal binding. � was 60% for Ca, Cu, ! ! ! ! ! �, � at positions � , � of TERM �, respectively. The weights here are and Na and 50% for Mg and Zn. For each metal, we further ranked TERMs by ! ! analogous to those used with self energies (Eq. S5), defined as: number of non-redundant matches. � (� , � ) = max � � , � , 0.5 � � (S11) Since water is very common on the surface of in PDB structures, to ! ! ! ! ! ! find TERMs truly enriched for water binding we required that in at least 90% of where �! �! , �! is the joint information content for the position pair �! , �! matches the same pair of positions exhibited close contacts with water (i.e., non- within instances of � (2). The enrichment �! �, �, �! , �! was meant to hydrogen atom within 3 Å of the water oxygen), in addition to requiring at least capture the ratio between the number of observed amino-acid pairs �, � and 10 unique matches. These were then ranked by the number of contacts they positions �! , �! of � and the number of such pairs expected based on self covered. statistics at each position. Specifically, this was calculated as: �! �, �, �! , �! + �! �! �, �, �! , �! = S12 Pseudo-energy Calculation for Design �! �, �! �! �, �! � + �! where � is the number of instances of TERM �, �! �, �, �! , �! is the weighted The self pseudo-energy for � at position � was calculated as: count of amino-acid pairs �, � at �! , �! within instances of �, and �! is a �! �, � = −ln � �, � (S3) pseudocount-like parameter set to 2.0 in our study to dampen noisy pair statistics. where � �, � is the propensity of � at the corresponding positions of TERM When calculating �! �, �, �! , �! redundancy between TERM matches was matches covering �. Specifically, if �! is the set of all TERMs covering position removed as discussed for �! �, �! above, with remaining TERM instances � in the target and the position of TERM � ∈ �! aligned with � is �!, then the weighed based on environmental similarity to the target environments around � propensity was: and �: ! ! ! !∈!! �!(�!)�! �, �! !!! �! � ∙ �! � � � �! , � � � �! , � � �, � = (S4) �! �, �, �! , �! = (S13) !∈!! �!(�!, �) � where � is a normalization constant that ensures !" !" � �, �, � , � = �. where �! �, �! is the relative frequency of amino acid � at position �! in � and !!! !!! ! ! ! With the above definitions, the total energy for a proposed sequence of � �! is a weight given to � in conjunction with position �! . We gave higher weights to more complex motifs, as these were expected to describe more of the residues was then the sum of self and pair energies given by: ! structural environment around �, and motifs with stronger sequence signals. �!"! = �!(�!, �) + �! �!, �!, �, � (S14) Specifically, the weighting function was defined as: !!! !,!∈!,!!! �! �! = max �! �! , 0.5 � � (S5) where �! is the amino acid at position � of the sequence and Π is the set of all PCs where �! �! is the information content at position �! across all matches of � in the target. We used integer linear programming (ILP) to find the sequence that (computed with a small-sample correction (2)) and � � is the effective number minimizes �!"! for a given target backbone (3). If the ILP did not converge of degrees of freedom in � (expression under the square root in Eq. S2) sufficiently quickly (9 out of 92 cases), we used a self-consistent mean field quantifying motif complexity. The max function ensures that all motifs get some algorithm instead. say in determining propensities, even those with weak sequence signals. The Because this procedure was applied to redesign native proteins, we were relative frequency of amino acid � at position �! of TERM � was defined as: careful to remove any TERM instances that may be evolutionarily related to the �! �, �! + � target protein when generating pseudo-energies. First, before starting the �! �, � = !" (S6) procedure we used BLAST (4) to find PDB chains having more than 30% !!! �! �, � + 20� where � �, � is the weighted count of � at position � in TERM �, and � is a sequence identity over at least 70% of the target and removed TERM instances ! ! ! associated with these chains from the entire analysis. To further remove any pseudocount-like parameter (set to 0.1 here). Before computing �! �, �! , we removed redundancy from TERM instances in the same manner as described in residual local homology, we also ignored TERM instances originating from the “Non-redundant Instances” section above, but using more stringent regions with local similarity to the corresponding region in the target during the parameters for cluster generation with BLASTclust (i.e., 45% sequence identity procedure. To this end, for each covering TERM we excised a 31-residue and 90% coverage) so as to preserve more of the data. Further, we chose to window around its central residue from the originating protein chain (15 residues weigh TERM instances based on the similarity between their surrounding on either side; if fewer residues were available on one side, the other was structural environments (in the context of their originating protein) and the lengthened to compensate). We also excised the corresponding regions from the corresponding environment in the target. To this end, we defined an environment target protein’s native sequence. TERM instances with windows having 40% or metric, � � , which quantified the structural context around a given residue �. higher sequence identity to the corresponding window from the target were This metric varied between 0 and roughly 5, with high values corresponding to discarded. positions likely to be buried/crowded and low values typically indicating exposure and lack of significant interference from nearby side-chains (see Environment Metric “Environment Metric” section below). With this, the weight of the �-th instance The environment score of a residue was defined by combining its contact of TERM �, with respect to position � was defined as: ! degrees with surrounding residues and its crowdedness—a measure of the ! 1 �! � = (S7) residue’s side-chain freedom lost due to surrounding backbone atoms. � � − �(�!) + � Specifically, crowdedness of residue � was: where � , set to 0.1 in our study, dampened the weight of matches with !" environments highly similar to the target one. The weighted count �! �, �! was � � = ∆ �! �� � �(�!) (S15) then defined as: !!! !!∈!! ! ! ! where �! � is the set of all library rotamers of amino acid � at position � (5), !!! �! � � � �! , � �! �, �! = (S8) �� � and �(�!) are the same as those used in Eq. 2 in main text, and ∆ �! is � unity if rotamer � clashes with the backbone and zero if it does not. where � is the number of instances of TERM �, � � is the native amino acid ! ! Crowdedness varied between 0 (for positions experiencing little backbone at position � of �, � � � , � is unity when this amino acid is � and zero ! ! interference) and 1 (for positions whose amino-acid identity was highly restricted otherwise, and � is a normalization factor that ensures !" � �, � = �. !!! ! by the backbone). The environment score for residue � was then defined as: By analogy to self energies, the interaction pseudo-energy between amino ! acids � and � at positions � and � in the target, respectively, was: � � = 4 � � + 1 − � � � �, � (S16) � �, �, �, � = −ln � �, �, �, � (S9) !!! ! where � is the length of the structure. Although the summation extends over all where �(�, �, �, �) is the enrichment of amino acids � and � in TERM positions residues �, only those with non-zero contact degrees with � contribute. The that simultaneously cover � and � , respectively. As with self energies, we 1 − � � in the second term diminishes the role of the contact degree when combined data from different TERMs covering a given PC using a weighted crowdedness is high and the position is thus incapable of accommodating many contacts. We found the combined metric in Eq. S16, with its empirically chosen SI Results coefficient of 4 in the first term, to strike a good balance between accounting for effects of nearby backbone and side-chains. Specialization of TERMs

Structure Prediction Modularity in Metal Coordination Most metal-linked TERMs do not include all residues in the binding site. Instead, multiple TERMs surround the metal, The direct coupling analysis (DCA) framework by Morcos et al. (6) was indicating that the structure of these binding sites is modular. This modularity used to produce a statistical sequence model for each TERM from the MSA of its allows TERMs to be reused across proteins, with variable overall binding-site matching instances. The end goal of the DCA method, to find contacting residue geometries or even outside of binding sites. A particularly illustrative example of pairs based on direct co-variation in an MSA, is not directly related to our goals such modularity is shown by the TERM in Fig S10A. This motif associates with here. However, as part of identifying such directly coupled positions, the method estimates a two-body statistical model for the observed MSA by an elegant a variety of different metals, including Na, Co, Ca, Mg, Mn, and Zn, with each of application of the weak-coupling approximation. This was the capability of the instances in Fig. S10B arising from proteins that fall into divergent sequence interest to us, because applying it to the MSA of a given TERM’s instances clusters (see Materials and Methods in main text). Indeed, matches to this TERM would express its sequence preferences as a rigorous sequence model. Matlab exhibit highly varying inter-segment sequence lengths, often differing by more code for the DCA method was obtained from Martin Weigt (Université Pierre et than a hundred residues (Fig. S10C). On the other hand, these proteins exhibit a Marie Curie, Paris, France) and modified to output the statistical potential clear functional bias, with phosphatase, sulfatase, phosphotransferase, inferred from the input MSA—i.e., the self energies for each amino acid at each phosphonate ester hydrolase, and phosphoethanolamine transferase activities MSA position, and pair energies for all amino-acid pairs at pairs of positions prevailing (see Table S4). Analysis of corresponding mechanistic studies reveals (local biases and pair couplings, respectively, in the nomenclature used by Weigt that the metals associated with the TERM are in the active site and directly and co-workers (6)). Such a model was generated for each TERM �, with the involved in catalysis, with the precise active-site geometry, including the number resulting potential used to score each of its possible alignments onto the sequence and type of bound metals, dictating the specific reaction and substrate (9-12). of a given benchmark protein (note, TERM instances arising from sequences TERM side-chains involved in metal coordination vary somewhat depending on with homology to the benchmark protein were removed from the MSA prior to the bound metal, but not enough to fully explain specificity or the diversity of generating the model; homology was defined as having at least 30% sequence identity over at least 70% of the benchmark protein). The final score for each catalyzed reactions (Fig. S10C). In fact, this TERM captures only some of the alignment � of TERM � was then calculated as: residues important for coordination and function, while others are donated by the surrounding structural environment, which exhibits great variability despite the �!! ! !! ! functional bias in these proteins (see Fig. S11). On the other hand, this variability ln ∙ � � = �! � − ln � + ln � � (S17) �!! ! ! ! itself appears to be modular. For example, as illustrated in Fig. S12, two different surrounding universal TERMs can combine with the main TERM in Fig. S10A to where �! � is statistical energy associated with the �-th alignment of TERM �, 2+ 2+ calculated by summing the appropriate self and pair energy components (more either coordinate a second Ca /Mg -binding site or to eliminate such a site and positive energies are more favorable by convention in Morcos et al. (6)), the sum instead extend additional functional residues into the active site. Interestingly, in the denominator on the left extends over all possible alignments of � in the these two different approaches appear to be employed by alkaline phosphatases corresponding benchmark protein, and � � represents the prior probability of and sulfatases, respectively, though other proteins in each functional class observing the TERM and was taken simply as the frequency of � in the set-cover employ yet other surrounding TERMs and active-site geometries. database. The top 20 best-scoring alignments were found and recorded for each TERM, with the best alignments from the combined list corresponding to Water Binding We also wondered whether any TERMs would specialize in predicted TERM alignments. Alignments were considered structurally correct if binding water. Specifically coordinated water molecules can play key functional the corresponding RMSD was below 1.0 Å for single-segment motifs, below 1.5 roles, so knowledge of water binding motifs could facilitate the design of Å for two-segment motifs, and below 2.0 Å for three-segment motifs. So as not complex functions. Compared to metal-associated TERMs, true water-binding to complicate interpretation of test results, proteins with significant internal sites were more difficult to delineate, due to the large number of ordered waters symmetries (those with repeated CATH (7) domains or repeat proteins) were present in X-ray structures. Still, by filtering more stringently (see Materials and excluded from the Xray-1 set for this analysis (i.e., 1BFG, 1BD8, 1AMF, 1ATG, Methods in main text), enriched motifs could be identified, with two 1BFG, 1BIO, 1LST, 1NKR, 2SGA). representatives shown in Fig. S13. The first of these coordinates a water To compute the background rate of correct alignments for multi-segment molecule between three strands, using both main-chain and side-chain atoms TERMs given perfect knowledge of the local backbone geometry, we performed (Fig. S13A). This TERM, with 92 non-redundant matches in our database, occurs the following procedure for each multi-segment TERM � and each benchmark within a β-trefoil motif shared by a highly diverse set of proteins (e.g., protease protein �. First, we used MASTER to find all structural matches in � for all inhibitors, neurotoxins, growth factors, DNA binding proteins) (13, 14). While individual segments in �, defined as having a backbone RMSD below 1.0 Å. certain TERM positions in the water-binding site are conserved, the overall Then, the total number of combinations of these alignments was calculated, �!. Finally, MASTER was used to find all matches in � to the entire TERM �, using conservation of surrounding regions is low and the average pairwise sequence the same RMSD cutoff as stated above (i.e., 1.5 and 2.0 Å for two- and three- identity between TERM instances is only 20% (matching region only; identity is lower over full chains). The TERM in Fig. S13B wedges a water molecule segment motifs, respectively), with the number of these matches recorded as �!. The rate of random guessing based on knowledge of local backbone between six β-strands, with four appearing to directly participate in coordination. conformation was then estimated as �! �! , with the average rate over all Similarly to the previous motif, this TERM occurs in a diverse set of proteins benchmark proteins presented in the Results section (main text). (e.g., nucleases, phosphatases, and toxins) that share considerable similarity in To test the effect of biasing TERM alignments based on predicted the vicinity of the binding site and may exhibit related functional mechanisms secondary structure information, we employed PSIPRED (8) to compute from (15), but are quite different overall (average pairwise sequence identity of 20%, sequence the probability of helix, strand, and coil states for each residue in each with the most distant pair at 14%). Although the bound water molecule extends a benchmark protein. In addition, STRIDE (1) was used to classify each residue of network of polar interactions with active-site residues of respective proteins (e.g., each TERM into the same three classes based on backbone conformation. With in beta-toxin from Staphylococcus aureus, PDB code 3I48; mouse tyrosyl-DNA- this information, for any alignment of a given TERM, we computed a secondary- phosphodiesterase 2, PDB code 4PUQ; human CNOT6L nuclease, PDB code structure match score by summing the logarithms of PSIPRED probabilities of the TERM’s residue classes at corresponding alignment positions. This quantity 3NGO; or human inositol polyphosphate 5-phophatase INPP5B, DPB code was then added to the value in Eq. S17 to arrive at the final score for the 3MTC), it has not yet been ascribed a specific functional role, to our knowledge alignment, and the remaining procedure repeated as above. (16-18). Its structural conservation within a functionally biased class of proteins suggests that it may have such a role.

SI Figures

Figure S1 The mean cosine of the angle between chain tangent vectors at two positions � residues apart. Means were calculated over all within-chain position pairs in DB80.

Figure S2 The RMSD cutoff for a motif is given by its segment lengths. A) Examples of three RMSD cutoff (�!"# = 1.0 Å, � = 20). The solid line represents RMSD cutoffs for a single-segment motif of � residues. The dashed line represents the RMSD cutoffs for a motif with one segment of 5 residues and another of � residues. The dashed-dotted line represents the RMSD cutoffs for a TERM with two 5-residue segments and one �-residue segment. B) Examples of ensembles created using the same RMSD cutoff scheme. Each ensemble contains 10 matches randomly selected from all those within the corresponding RMSD cutoff of each motif. In addition, each ensemble also includes the query (for which the ribbon representation is shown) and the match with the highest acceptable RMSD. Segment lengths and corresponding RMSD cutoffs are shown below each ensemble on the first and second lines, respectively.

Figure S3 Statistics of universal TERMs. A) Secondary structure composition as a function of the number of segments. The Y-axis represents the percent of centroid residues, among TERMs with the given number of segments, that have a specific secondary structure classification (see SI Methods). B) The distribution of TERM sizes (i.e., the number of residues). C) The distribution of the number of TERM segment.

Figure S4 Anti-parallel β-sheets have two unique types of lateral contacts, while parallel ones only one. A) and B) both show anti-parallel two-stranded sheets corresponding to TERMs 2 and 5 from the main text, while the motif in C) corresponds to TERM 3 from the main text. We define the chirality class of each strand residue as either right or left by considering its backbone amide nitrogen and carbonyl oxygen atoms along with the Cα atom of the opposite residue on the opposing strand, as shown in D). Cβ atoms of all residues in A), B), and C) are colored based on chirality class: orange for right and gray for left. Clearly, anti-parallel sheets have two topologically different sides: one with all right residues and another with all left residues, whereas the two sides of parallel sheets have the same topology. Consequently, two types of lateral contacts exist in anti-parallel sheets, right-right and left-left, as shown in A) and B), respectively, but there is only one type of contact in parallel sheets as in C) (note that left-right and right-left pairs in parallel sheets are superimposable upon a rotation).

Figure S5 Structure quality parameters as a function of TERM priority. For each TERM in this study, we computed the mean backbone B-factor and occupancy (excluding hydrogens) over all instances of the TERM. A) and B) plot the resulting B-factor and occupancy values against TERM rank in the set cover procedure, respectively (order was randomized for TERMs that cover exactly the same number of PCs). Points are colored by density in B), and the red line in A) designates the 1st percentile across a moving window of 1,000 TERMs (i.e., 99% of TERMs are above the red line). Clearly, instances with unusually high B-factors and low occupancies occur only towards the tail end of the set cover (i.e., among the lowest-priority TERMs), although these instances are still rare with the bulk of the data showing high-quality parameters.

Figure S6 The structural novelty contributed by TERMs yet to be included in the set cover, compared to those already discovered. All TERMs are classified as either high- or low-priority based on whether they are discovered before or after the priority rank indicated on the X-axis, respectively. For each such classification, novelty is then �\� computed as , where � and � are the sets of universal elements jointly covered by the � low- and high-priority TERMs, respectively, and plotted on the Y-axis. Inset shows the same plot with a logarithmic scale along the X-axis.

Figure S7 Distribution of TERM size (top) and the number of segments per TERM (bottom) as a function of priority. Each vertical bar represents a group of 10,000 TERMs, considered in the order of priority, with pseudo-color used to represent the histogram of the relevant metric within each group (color bars on the right indicate frequencies).

Figure S8 Comparison of coverage of the original dataset and 1,095 novel proteins. Data are represented exactly as in Fig. 3 from the main text, except the RMSD cutoff function with parameters � = �� and ���� = 1.0 Šwas used for both generating universal TERMs and assessing coverage in novel proteins.

Figure S9 Top metal binding motifs for five different metals. TERMs were first filtered by the fraction of matches involved in metal binding. For calcium, copper, and sodium this was set at 60%. For zinc and magnesium this was set at 50%. Enriched TERMs for each metal where then ranked by number of matches and the top five for each metal are shown in descending order from left to right in each row. Shown under each TERM is rank in the set cover (before the semicolon) and the number of unique matches (after the semicolon).

Figure S10 An example multi-metal binding TERM. Out of the 21 non-redundant matches for this TERM, shown superimposed in A), there are nine different binding modes for five metals types, shown in B). Side chains with close metal contacts are shown with sticks. The segment numbers, ordered by their appearance in the protein, are placed next to the segments in A). C) The distribution of sequence distances between segments in this four-segment motif, across all matches.

Figure S11 Variability in the structural environment around the metal- linked TERM in Fig. S10A. The TERM is shown in green and all surrounding structure in gray. Various associated metals are shown as spheres.

Figure S12 The metal-associated TERM in Fig. S10 (center/green) can combine with other universal TERMs in its surrounding (e.g., left/blue or right/cyan) to complete different active site geometries (e.g., extreme left or right, respectively). TERMs are shown as a superpositions of several instances, with residues in the two combined functional sites colored to indicate which TERM they belong to. Residues that overlap between surrounding TERM 2 and the central TERM are colored yellow on the extreme left.

Figure S13 Representative water binding TERMs. A water molecule is bound between three and six β-strands in the TERM shown in A) and B), respectively. TERM instances are shown superimposed with gray cartoon, while key residues participating in the coordination are rendered as green sticks.

Figure S14 Most frequent interfacial TERMs. A) Top 25 TERMs by the number of interfacial PCs covered. The number before the semicolon is the rank of the TERM in the set cover and the number after is the number of unique matches. Inter-helical geometric parameters for the top five parallel and anti-parallel helix-helix motifs are shown in B) and C), respectively, with the corresponding motifs labeled with single letters in A). The X- and Y-axes show the superhelical radius and the pitch angle, respectively, for each TERM; parameters were extracted by globally fitting to the ensemble of TERM matches via CCCP (19).

Figure S15 Coverage of the universe, as a function of number of TERMs, split by element type. Data are presented as in the main-text Fig. 1B, except that total coverage is broken into coverage of residues (blue), tertiary PCs (green), and quaternary PCs (red).

Figure S17 Enrichment ratios were calculated for each position of each of the 91 proteins mentioned in the main text, using either TERM- predicted or evolutionary MSA’s. Shown here is a histogram of per- position correlation coefficients between the two enrichment ratios. 30% of positions exhibit correlations above R = 0.8, with the mean Figure S16 Correlation of enrichment ratios predicted from TERM-based and correlation being R = 0.51. evolutionary alignments. Each point represents the enrichment ratio of a single amino acid at a specific protein position. Points are colored by local density. Overall correlation coefficient is R = 0.51.

Figure S18 Predicted and observed evolutionary sequence variation. A)–C) show to the best, median, and worst performing proteins, respectively (PDB ID codes are shown underlined). In each panel, the top sequence logo represents the inferred evolutionary profile (Materials and Methods, main text), whereas the bottom logo arises from MC sampling using TERM-based pseudoenergies. Between the two logos, vertical bars indicate positions where the top amino acid is the same in both cases, and two dots designate positions where the top choices are within the same physicochemical group. All logos were generated with the standalone version of Seq2Logo (20). D)–F) show structures corresponding to B)–C), respectively. Cartoon color represents the correlation coefficient between TERM-predicted and observed enrichment ratios at each position, with the pseudocolor scale shown on the bottom right (several residues that were not covered by TERMs are shown in dark gray).

Figure S19 A representative example of a protein from the benchmark set (PDB code 2KL8) with structurally correct TERM alignments from all those that scored in the top 1st percentile in prediction. The protein is shown in cyan ribbon in A) and B). Magenta and green tubes, in B) and C) represent superimposed single- and multi-segment TERMs, respectively. It is apparent from C) that correct predictions for both types of motifs originate from nearly all regions of the protein and give an approximate outline of the overall structure.

Figure S20 Prediction of TERMs from sequence alone improves uniformly when secondary structure class predictions are used to bias TERM alignments (see main- text Materials and Methods). The figure is formatted identically to Fig. 6 from the main text. SI Tables Table S1 Proteins containing EF-hand instances of the calcium-enriched TERM discussed in the main text.

PDB ID Title of PDB entry PDB Classification 3Q5I Crystal structure of pbanka_031420 Transferase 3PM8 Cad domain of pff0520w Calcium dependent protein 1K9U Crystal structure of the calcium-binding pollen allergen phl p 7 (polcalcin) at 1.75 angstroem Allergen Crystal structure of calcium-saturated (3ca2+) cardiac complexed with the calcium sensitizer 1DTL Structural protein bepridil at 2.15 a resolution 1RWY Crystal structure of rat alpha- at 1.05 resolution Calcium-binding protein Crystal structure of human peptidyl-prolyl cis-trans isomerase fkbp22 (aka fkbp14) containing two ef-hand 4MSP Isomerase motifs 2R2I Myristoylated guanylate cyclase activating protein-1 with calcium bound Lyase activator 4ETO Structure of s100a4 in complex with non-muscle myosin-iia peptide Metal binding protein 1B47 Structure of the n-terminal domain of cbl in complex with its binding site in zap-70 3SG7 Crystal structure of gcamp3-kf(linker 1) Fluorescent protein Crystal structure of n-terminal -like calcium sensor of human mitochondrial atp-mg/pi carrier 4N5X Calcium-binding protein scamc1 1Y1X Structural analysis of a homolog of programmed cell death 6 protein from leishmania major friedlin Structural genomics Spatial structure of the novel light-sensitive photoprotein berovin from the ctenophore beroe abyssicola in 4MN0 Luminescent protein the ca2+-loaded apoprotein conformation state 1G33 Crystal structure of rat parvalbumin without the n-terminal domain Metal binding protein 2ZN9 Crystal structure of ca2+-bound form of des3-20alg-2 Apoptosis 2CT9 The crystal structure of calcineurin b homologous proein 1 (chp1) Metal binding protein 1SRA Structure of a novel extracellular ca2+-binding module in bm-40(slash)sparc(slash) Calcium-binding protein 2CCM X-ray structure of calexcitin from loligo pealeii at 1.8a Signaling protein 3A8R The structure of the n-terminal regulatory domain of a plant nadph oxidase Calcium binding protein 3U0K Crystal structure of the genetically encoded calcium indicator rcamp Fluorescent protein 1PSR Human psoriasin (s100a7) Ef-hand protein 3LCP Crystal structure of the carbohydrate recognition domain of lman1 in complex with mcfd2 Protein binding 3H4S Structure of the complex of a mitotic kinesin with its calcium binding regulator Motor protein/calcium binding protein 1UHN The crystal structure of the calcium binding protein atcbl2 from arabidopsis thaliana Metal binding protein 3MSE Crystal structure of c-terminal domain of pf110239. Transferase 2EGD Crystal structure of human s100a13 in the ca2+-bound state Metal binding protein 4MEW Structure of the core fragment of human pr70 Hydrolase 1Y1A Crystal structure of calcium and integrin binding protein Metal binding protein Structural basis for prokaryotic calcium-mediated regulation by a streptomyces coelicolor calcium-binding 3AKB Metal binding protein protein Crystal structure of pyrophosphate-dependent phosphofructokinase from nitrosospira multiformis. Northeast 3HNO Transferase structural genomics consortium target id nmr42 3CS1 Flagellar calcium-binding protein (fcabp) from t. Cruzi Metal binding protein 1CB7 Glutamate mutase from clostridium cochlearium reconstituted with methyl-cobalamin Isomerase

Table S2 Elements of the structural universe (residues and PCs)

Non-redundant (106) Total (106) Unique/Total (%)

All residues & PCs 12.8 38.9 32.9%

Residues 3.5 14.1 24.8%

Tertiary PCs 7.5 22.9 32.7%

Quaternary PCs 1.8 2.0 90.0%

Table S3 Structure sets used for the sequence recovery and evolutionary profile prediction studies.

Xray-1 Xray-2 NMR-1 NMR-2 1A6M 1BJ7 1LBU 1VSR 2O0Q 2LV8 2JQN 1A7S 1BK9 1LST 2A0B 3IBW 2LN3 2KO1 1A8O 1BM8 1MAI 2ACY 3FIF 2LVB 2JN0 1ABA 1BQK 1MH1 2ENG 1TVG 2LTA 1XPW 1AKR 1BXV 1NKR 2IGD 3E0E 2KL8 2K5V 1AMF 1BY2 1OPC 2MCM 3IDU 2KL6 1AMX 1C52 1OPD 2MHR 3H9X 2KFP 1ATG 1CEX 1ORC 2PTH 3C4S 2JZ2 1AZO 1CJW 1PNE 2SGA 2Q00 2JPU 1B2V 1CO6 1PPN 2SNS 3K63 2KRT 1B9O 1G3P 1QF9 3EUG 1TTZ 1XPV 1BD8 1GBS 1QST 3PYP 1BDO 1HOE 1RA9 3SEB 1BEA 1HYP 1RGP 3TSS 1BFG 1IFC 1TMY 1BGC 1KAO 1UBQ 1BIO 1KOE 1UCH

Table S4 Proteins containing non-redundant matches to the metal-associated TERM in Fig. S10A.

PDB ID PDB description Function 3NVL Crystal Structure of Phosphoglycerate Mutase from Trypanosoma brucei Phosphoglycerate mutase 3M8W Phosphopentomutase from Bacillus cereus Phosphopentomutase 3E2D The 1.4 A crystal structure of the large and cold-active Vibrio sp. alkaline phosphatase Alkaline phosphatase 3A52 Crystal structure of cold-active alkailne phosphatase from psychrophile Shewanella sp. Alkaline phosphatase 3NKM Crystal structure of mouse autotaxin Autotaxin hydrolase 2W5V Structure of TAB5 alkaline phosphatase mutant HIS135ASP with Mg bound in the M3 site Alkaline phosphatase 1SHN Crystal structure of shrimp alkaline phosphatase with phosphate bound Alkaline phosphatase 1ED8 Structure of E. coli alkaline phosphatase inhibited by the inorganic phosphate at 1.75a resolution Alkaline phosphatase 4MIV Crystal Structure of Sulfamidase, Crystal Form L Sulfamidase 2GSN Structure of Xac Nucleotide Pyrophosphatase/Phosphodiesterase Nucleotide phosphatase Crystal structure of a catalytically promiscuous phosphonate monoester hydrolase from Burkholderia 2W8S Phosphonate monoester hydrolase caryophylli 3ED4 Crystal structure of putative arylsulfatase from escherichia coli Arylsulfatase 4LQY Crystal Structure of Human ENPP4 with AMP AMP hydrolase 1E1Z Crystal structure of an arylsulfatase a mutant C69S Arylsulfatase 2ZKT Structure of PH0037 protein from Pyrococcus horikoshii Isomerase Crystal structure of the putative sulfatase yidJ from Bacteroides fragilis. Northeast Structural Genomics 2QZU Sulfatase Consortium target BfR123 Crystal structure of a putative sulfatase (NP_810509.1) from Bacteroides thetaiotaomicron VPI-5482 at 3B5Q Sulfatase 2.40 A resolution Crystal Structure of the soluble domain of Lipooligosaccharide phosphoethanolamine transferase A from 4KAV Phosphoethanolamine transferase Neisseria meningitidis 1FSU Crystal Structure of 4-Sulfatase (human) Sulfatase 1HDH Arylsulfatase from Pseudomonas aeruginosa Arylsulfatase 3Q3Q Crystal Structure of SPAP: an novel alkaline phosphatase from bacterium Sphingomonas sp. strain BSAR-1 Alkaline phosphatase

SI References 1. Frishman D & Argos P (1995) Knowledge-based protein secondary structure assignment. Proteins 23(4):566-579. 2. Basharin GP (1959) On a statistical estimate for the entropy of a sequence of independent random variables. Theory of Probability & Its Applications 4(3):333-336. 3. Kingsford CL, Chazelle B, & Singh M (2005) Solving and analyzing side-chain positioning problems using linear and integer programming. Bioinformatics 21(7):1028-1036. 4. Altschul SF, Gish W, Miller W, Myers EW, & Lipman DJ (1990) Basic local alignment search tool. Journal of molecular biology 215(3):403-410. 5. Lovell SC, Word JM, Richardson JS, & Richardson DC (2000) The penultimate rotamer library. Proteins 40(3):389-408. 6. Morcos F, et al. (2011) Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A 108(49):E1293-1301. 7. Sillitoe I, et al. (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43(Database issue):D376-381. 8. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292(2):195- 202. 9. Stec B, Holtz KM, & Kantrowitz ER (2000) A revised mechanism for the alkaline phosphatase reaction involving three metal ions. J Mol Biol 299(5):1303-1311. 10. von Bulow R, Schmidt B, Dierks T, von Figura K, & Uson I (2001) Crystal structure of an enzyme-substrate complex provides insight into the interaction between human arylsulfatase A and its substrates during catalysis. J Mol Biol 305(2):269-277. 11. Boltes I, et al. (2001) 1.3 A structure of arylsulfatase from Pseudomonas aeruginosa establishes the catalytic mechanism of sulfate ester cleavage in the sulfatase family. Structure 9(6):483-491. 12. Mercaldi GF, Pereira HM, Cordeiro AT, Michels PA, & Thiemann OH (2012) Structural role of the active-site metal in the conformation of Trypanosoma brucei phosphoglycerate mutase. FEBS J 279(11):2012-2021. 13. Renko M, Sabotic J, & Turk D (2012) beta-trefoil inhibitors--from the work of Kunitz onward. Biol Chem 393(10):1043- 1054. 14. Fujimoto Z (2013) Structure and function of carbohydrate-binding module families 13 and 42 of glycoside hydrolases, comprising a beta-trefoil fold. Biosci Biotechnol Biochem 77(7):1363-1371. 15. Whisstock JC, et al. (2000) The inositol polyphosphate 5-phosphatases and the apurinic/apyrimidinic base excision repair endonucleases share a common mechanism for catalysis. J Biol Chem 275(47):37055-37061. 16. Gao R, et al. (2014) Proteolytic degradation of topoisomerase II (Top2) enables the processing of Top2.DNA and Top2.RNA covalent complexes by tyrosyl-DNA-phosphodiesterase 2 (TDP2). J Biol Chem 289(26):17960-17969. 17. Wang H, et al. (2010) Crystal structure of the human CNOT6L nuclease domain reveals strict poly(A) substrate specificity. EMBO J 29(15):2566-2576. 18. Tresaugues L, et al. (2014) Structural basis for phosphoinositide substrate recognition, catalysis, and membrane interactions in human inositol polyphosphate 5-phosphatases. Structure 22(5):744-755. 19. Grigoryan G & Degrado WF (2011) Probing designability via a generalized model of helical bundle geometry. J Mol Biol 405(4):1079-1100. 20. Thomsen MC & Nielsen M (2012) Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion. Nucleic Acids Res 40(Web Server issue):W281-287.