<<

Articles https://doi.org/10.1038/s41594-018-0141-6

De novo design of a non-local β-sheet with high stability and accuracy

Enrique Marcos 1,2,3,10*, Tamuka M. Chidyausiku1,2,10, Andrew C. McShan4, Thomas Evangelidis5, Santrupti Nerli4,6, Lauren Carter1,2, Lucas G. Nivón1,2,7, Audrey Davis1,2,8, Gustav Oberdorfer 1,2,9, Konstantinos Tripsianes 5, Nikolaos G. Sgourakis4 and David Baker 1,2*

β​-sheet carry out critical functions in biology, and hence are attractive scaffolds for computational protein design. Despite this potential, de novo design of all-β-sheet​ proteins from first principles lags far behind the design of all-α ​or mixed-α​β ​ domains owing to their non-local nature and the tendency of exposed β​-strand edges to aggregate. Through study of loops con- necting unpaired β-strands​ (β​-arches), we have identified a series of structural relationships between loop geometry, side chain directionality and β​-strand length that arise from hydrogen bonding and packing constraints on regular β​-sheet . We use these rules to de novo design jellyroll structures with double-stranded β​-helices formed by eight antiparallel β-strands.​ The nuclear magnetic resonance of a hyperthermostable design closely matched the computational model, demonstrat- ing accurate control over the β-sheet​ structure and loop geometry. Our results open the door to the design of a broad range of non-local β​-sheet protein structures.

​-sheet protein domains are ubiquitous in nature, carrying out a each other, also known as β​-arches15. These loops connect distinct wide range of functions: transporting hydrophobic molecules, β​-sheets and pair β​-strands with larger sequence separation, and recognition and enzymatic processing of carbohydrates, and they are essential for enabling the protein fold complexity observed scaffoldingβ of virus capsids and , among others. Although in antibodies, ​-solenoids, jellyrolls and Greek key-containing β β​-sheet protein scaffolds are well suited for incorporating new func- structures generally. Here we set out to identify the general prin- tions, their design from first principles remains an outstanding ciples for designing non-local β​-sheet structures. challenge. Recent progress in de novo protein design has enabled the accurate design of many hyperstable and structurally diverse Results proteins, but so far, other than short β​-sheet peptides1–3, all exhibit Constraints on β-arch geometry. We undertook investigation of either all-α​ or mixed-αβ​ ​ folds4. The design of these last has been the constraints on the backbone geometry of β​-strands and con- considerably facilitated by the derivation of a set of rules describ- necting loops that arise from hydrogen bonding and the require- ing constraints on the backbone geometry of the loops connecting ment for a compact hydrophobic core. We studied side chain secondary structure elements5, but all-β​ proteins contain additional directionality patterns of the two β​-strand residues adjacent to features that are less well understood. All-β​-sheet structures are β​-arch loops (Fig. 1a, left) in naturally occurring protein structures, particularly challenging to design from scratch6 because a larger defining the side chain orientation of the β​-strand residue preceding fraction of the interactions are non-local (that is, between residues the loop as ‘concave’ (represented by ↓​) if its Cα​Cβ​ vector is parallel distant along the linear sequence), leading to slower folding rates7, to the vector d from the first to the second β​-strand, and ‘convex’ and because β​-strands, particularly at the edges of β​-sheets, can (represented by ↑​) if the Cα​Cβ​ vector is antiparallel to d. For the aggregate into -like structures. Hence, few β​-sheet protein residue following the loop, the side chain pattern is described in the design studies have sought to generate new backbone structures8,9 same way, but instead using the vector from the second to the first and, except for a recent β​-barrel structure with primarily local strand β-strand​ (–d) as a reference (Fig. 1a). This results in four possible pairings10, those designs confirmed by high-resolution structure β​-arch loop side chain orientation patterns: ↑↑​ ,​ ↑↓​ ,​ ↓↑​ ​ and ↓↓​ ​. We determination have relied heavily on sequence information11,12 and analyzed the side chain patterns and the local backbone geometry backbone structures13,14 from naturally occurring β​-sheet proteins. (as described with ABEGO torsion bins16) of 5,061 β-arch​ loops So far, the de novo design of β-sheet​ loop connections has been from a non-redundant database of natural protein structures (tor- limited to β-hairpins​ (two antiparallel β-strands​ interacting via sion bins A and B are the α​-helix and extended regions; G and E backbone hydrogen bonding and connected through a loop), which regions are the positive φ​ angle equivalents of A and B; and O is the is the most local strand pairing possible and, in principle, the fastest cis- bond conformation; Supplementary Fig. 1). We found to fold. However, these structures lack a critical feature of non-local that all four side chain orientation patterns frequently occur, and, globular all-β​ structures: loops connecting β​-strands not paired to in contrast to other types of loop connections (that is, αβ​ ,​ βα​ ​ and

1Department of , University of Washington, Seattle, WA, USA. 2Institute for Protein Design, University of Washington, Seattle, WA, USA. 3Institute for Research in Biomedicine (IRB Barcelona), Barcelona Institute of Science and Technology, Barcelona, Spain. 4Department of Chemistry and Biochemistry, University of California, Santa Cruz, Santa Cruz, CA, USA. 5CEITEC—Central European Institute of Technology, Masaryk University, Brno, Czech Republic. 6Department of Computer Science, University of California, Santa Cruz, Santa Cruz, CA, USA. 7Present address: Cyrus Biotechnology, Seattle, WA, USA. 8Present address: Amazon, Seattle, WA, USA. 9Present address: Institute of Biochemistry, Graz University of Technology, Graz, Austria. 10These authors contributed equally: Enrique Marcos, Tamuka M. Chidyausiku. *e-mail: [email protected]; [email protected]

Nature Structural & Molecular Biology | www.nature.com/nsmb Articles NaTure STrucTural & MOlecular BiOlOgy

a (N) (C) C C (C) (C) α d α β-arch pattern CαCβ ∙ –d > 0

(N) CαCβ ∙ d > 0 C (N) C (C) β β (N) CαCβ ∙ d < 0 β-hairpin β-arch

bc400 1 ABB ABABB 350 BBG AB 300 BBGBB AB BB 250 BAAB BG BBGBB BBABB 200 A BBGB ABABB A Frequency 150 G BAAB BBG EB BABB 100 AAB Loop types AAA BBBGB AAA BBGB GB G 50 EB BABB ABB BGB BGBB BGB 0 GB BBBGB AAB 0 B G A B B B B

β-arch patterns

d Antiparallel β-arcade 1 ABABB β-arch (2) BBG β -arch (2) patterns 0

-arch (1) β β-arch (1) patterns

Fig. 1 | Constraints on β-arch geometry. a, Side chain directionality in the β​-arch. Comparison between β​-hairpin and β​-arch (left); the Cα​Cβ​ and d vectors used to define the orientation of the two adjacent side chains are indicated. The four possible side chain directionality patterns are on the right. b, type dependence of β​-arch side chain patterns. Loops on the y axis are described by their ABEGO torsion bins (Supplementary Fig. 1). Most of the loops adopt only one of the four possible side chain patterns. c, Frequency of the most common loops for each of the four β-arch​ side chain patterns. There are strong preferences: for example, BBGB is strongly associated with the ↓​↑ ​pattern, whereas ABB is strongly associated with the ↓​↓ ​pattern (bottom). Only loops with bending <120°​ (see Methods) and containing between 1 and 5 amino acids were considered in this analysis. d, β-arcades​ consist of two stacked β​-arches with in-register strand pairing (left). Because strand pairs of the β​-arcade are in register, the side chains adjacent to one β​-arch loop must have the same orientation as the paired side chains that are adjacent to the second β​-arch loop, and therefore not all loop pairs are allowed (middle). Example of a β-arcade​ formed by two common β​-arches with compatible side chain patterns (right).

β-hairpins)​ 5, there was no correlation between β​-arch loop length Jellyroll design principles. The double-stranded β​-helix can be and side chain pattern. Instead, each loop ABEGO type, because of regarded as a long β​-hairpin wrapped around an axis perpendicu- the way in which it twists and bends the polypeptide chain16, is asso- lar to the direction of β​-strands, with β​-helical turns formed by the ciated with a specific flanking residue side chain pattern (Fig. 1b). pairing between β​-arcades (Fig. 2a). In the compact folded struc- The most frequently observed turn types (between 1 and 5 amino ture, two antiparallel β-sheets​ pack against each other in a sand- acids) for each side chain pattern are listed in Fig. 1c; for example, wich-like arrangement, with the first strand paired to the last, and ABB, BBGB, BABB and BGB are the most frequent loop types for all β​-strands are connected through β-arch​ loops except for the cen- the patterns ↓↓​ ,​ ↓↑​ ,​ ↑↓​ ​ and ↑↑​ ,​ respectively. tral β​-hairpin. We aimed to design β-helices​ with three β​-arcades The next level of non-local interaction complexity in all-β​ folds forming two antiparallel four-stranded β​-sheets, with the eight involves strand pairing (parallel or antiparallel) between two β​-arches β​-strands connected through six β-arches​ and one β​-hairpin. The forming a β​-arcade (Fig. 1d), a common structural motif in natu- non-local character of the structure grows from the first β​-arcade, rally occurring β​-solenoids15,17. Because the β​-arch loops are stacked which starts from the central β-hairpin,​ to the last one, where the in register, the side chains adjacent to one β​-arch loop are likely to N and C termini are paired. have the same orientation as the side chains adjacent to the second The analysis from Fig. 1 leads to strong constraints on the con- β​-arch loop; analysis of naturally occurring β​-arcades confirms that struction of β​-sheet backbone structures, as the side chain direction- the side chain patterns of the two β-arch​ loops are indeed correlated ality patterns of the β​-strands and loops are coupled in several ways. (Fig. 1d, middle). First, the directionality patterns of the loops preceding and following

Nature Structural & Molecular Biology | www.nature.com/nsmb NaTure STrucTural & MOlecular BiOlOgy Articles

a Double-stranded β-helix β-arcade 1 β-arcade 2 β-arcade 3

β-sheet 1

90° 90° β-sheet 2

b β-sheets in register c β-sheets out of register

L2 S S S 2 1 2 S1

L1

S S S S7 8 7 8 β-arcade shift Shift Shift L7 S3 – S8 S4 – S7

S3 S4 S S4 L 3 6 -hairpin L3 β L4 S6 S5 S6 S5 L5

de S4 – S7

S3 – S8

β-arcade shift

Fig. 2 | Double-stranded β-helix topology specification. a, The double-stranded β​-helix fold consists of two four-stranded antiparallel β​-sheets (in blue and green) with six β-arch​ and one β-hairpin​ connection. Pairs of β​-arches forming the three β​-arcades are highlighted (right). β​-arch loops belonging to the same β-arcade​ are displayed with the same color throughout the figure (β​-arcades 1, 2 and 3 in red, orange and magenta, respectively). b, Topology diagram of a designed double-stranded β​-helix with all β​-strand pairs in register. The Cα​ traces of the first and second β​-sheets are colored in blue and green, respectively. Side chain Cβ​ positions oriented toward the inner and outer faces of the β​-helix are represented with up and down black arrows with rounded tips, respectively. β​-arch loops are colored as in a. c, Definition of β​-arcade register shift varied during conformational sampling. The β​-arcade register shift (between β​-arcades 1 and 3) is determined by the register of β​-strand pairs S3/S8 and S4/S7, and the lengths of β​-strands S3, S4, S8 and S7 (see Methods). In this example, β​-strand pairs S3/S8 and S4/S7 each have a two-residue register shift, resulting in an overall β​-arcade register shift of four residues. Loops are omitted to facilitate visualization. d, Example of a design model with all β-strand​ pairs in register forming a sandwich-like structure. e, Example of a design model with register shifts between β​-arcades 1 and 3 (magenta and red) forming a barrel-like structure. each β​-strand are coupled to the length of the strand (Fig. 2b): assembly using blueprints (representations of the target protein for example, a β​-strand with an even number of residues that is topologies specifying the ordering, lengths and backbone torsion preceded by a ↑↑​ ​ loop must be followed by a ↓↑​ ​ or a ↓↓​ ​ loop, but bins of secondary structure elements and loop connections5) in not a ↑↑​ ​ or ↑↓​ ​ loop, owing to the alternating pleating of β​-strands. conjunction with backbone hydrogen-bonding constraints speci- Second, because the β​-arcades of the β​-helix have paired β​-strands fying all pairings between β-strands.​ We explored strand lengths and β​-arch loops, the side chains adjacent to one β​-arch loop must between 5 and 7 residues and the most commonly observed β​-arch have the same orientation as the paired side chains adjacent to the loops between 3 and 5 residues (Fig. 1c). The central β​-hairpin was second β​-arch loop (Fig. 1d). Owing to the antiparallel orientation designed with two-residue loops following the β​β​ rule5. The register of the β​-arcades, ↓↓​ ​ and ↑↑​ ​ loops are compatible with loops of the shifts between pairs of β-strands​ from different β-arcades​ (1 and 3) same type, but ↑↓​ ​ loops are only compatible with ↓↑​ ​ loops (Fig. 1d). were allowed to range from 0 to 2 and the β-arcade​ register shifts Third, the twist and curvature of the two β​-sheets of the β​-helix are between 0 and 4; strand pairs within the same β​-arcade were kept in constrained by the hydrogen-bonding register between β​-arcades register. A total of 3,673 combinations were enumerated, of which 1 and 3 (herein called β​-arcade register), and within β​-strand pairs 1,853 had mutually compatible strand lengths and loop types con- S3/S8 and S4/S7, as shown in Fig. 2c. sistent with the constraints summarized in the previous paragraph. For each of these internally consistent blueprints, we used Rosetta De novo design of protein structures. We constructed double- to build thousands of protein backbones. The resulting ensemble stranded β​-helix protein backbones by Monte Carlo fragment of backbone structures has considerable structural diversity; those

Nature Structural & Molecular Biology | www.nature.com/nsmb Articles NaTure STrucTural & MOlecular BiOlOgy with all strands in register had narrow, sandwich-like structures hydrogen bonding likely favor folding to the correct topology and (Fig. 2d), while those with large register shifts had wider, barrel-like contribute to stability by compensating for the loss of interactions structures (Fig. 2e). with water of polar groups in the side chains and backbone. These For each generated backbone, we carried out flexible-sequence interactions likely disfavor the competing local strand-pairing design calculations18,19 to identify low-energy identi- arrangement in which the two strands form a β​-hairpin; this is a ties and side chain conformations providing close complementary very common pathology in ab initio structure prediction25. For packing, side chain–backbone hydrogen bonding in β​-arch loops the most stable dimeric design (BH_6), we introduced disul- (to pre-organize their conformation and facilitate folding), and high fide bonds to stabilize protein regions having contacts with large sequence–structure compatibility. We favored inward-pointing sequence separation (for example, between the N- and C-terminal charged or polar amino acids at the four edge strands to minimize strands), but this did not succeed in yielding stable monomers. aggregation propensity20. Loop sequences were designed with con- Addition of an α-helix​ to the C terminus (one of the two extremes sensus profiles obtained from fragments with the same backbone of the β-helix),​ as a capping domain protecting the strand edges ABEGO torsion bins21. Because the very large size of the space from intermolecular pairing, also failed to yield stable monomers, sampled by our design procedure limits convergence on optimal even in combination with disulfide bonds. This suggests that the sequence–structure pairs, we carried out a second round of calcula- sequence of the core β​-sheet must strongly encode its structure tions starting from the blueprints yielding the lowest-energy designs, independently of disulfide bonds or protecting domains aimed at intensifying sampling at both the backbone and sequence level. For increasing stability. a subset of designs, we introduced disulfide bonds between paired β​-strand positions with high sequence separation (for example, NMR structure of a de novo–designed β-helix. We succeeded between the first and last β​-strands) and optimal orientation (see in solving the structure of BH_10 by 4D NMR spectroscopy Methods): disulfide bonds distant in primary sequence decrease (Fig. 3d, Table 1 and Supplementary Fig. 4), using the 4D-CHAINS/ the entropy of the unfolded state and therefore enhance the ther- AutoNOE-Rosetta automated pipeline for resonance assignments modynamic stability of the native state. To assess compatibility of and structure calculation28, and found it to be in very close agree- the top-ranked designed sequences with their structures, we char- ment with the computational model (Cα ​r.m.s. deviation (r.m.s.d.) acterized their folding energy landscape with biased forward fold- of 0.84 Å, averaged over 10 NMR models). The overall topology ing simulations21, and those with substantial near-native sampling is accurately recapitulated, including all strand pairings, regis- were subsequently assessed by Rosetta ab initio structure predic- ter shifts and loop connections, as supported by 132 long-range tion calculations22,23. Designs with funnel-shaped energy landscapes nuclear Overhauser effects (NOEs) between backbone amide and (where the designed structure is at the global energy minimum and side chain protons (Supplementary Fig. 5). The designed aliphatic has a substantial energy gap with respect to alternative conforma- and aromatic side chain packing in the protein core, as well as tions) were selected for experimental characterization. Ab initio salt bridge interactions across the two β-sheet​ surfaces, was also structure prediction of natural β-sheet​ proteins tends to oversample accurately reproduced; three salt bridges between the two paired local contacts24,25 (that is, favors β​-hairpins over β​-arches), but we β​-arcades and one within the third β​-arcade are well supported succeeded in designing sequences with the β​-arches sufficiently by the observed NOEs (Supplementary Fig. 6). The agreement in strongly encoded that they folded in silico to near the designed tar- both the backbone conformation and hydrogen-bonding interac- get structure. tions of the loops forming the three β​-arcades is remarkable given that these elements are the most flexible parts of the structure and Experimental characterization. For experimental characteriza- therefore difficult to design owing to sampling bottlenecks. The tion, we chose 19 designs with funnel-shaped energy landscapes β​-arcades were designed with pairs of β​-arch loops that interact via ranging between 70 and 94 amino acids (Supplementary Table 1). backbone-backbone hydrogen bonds (owing to the complementar- BLAST searches26,27 indicated that the designed sequences had ity between their backbone conformations) stabilizing loop pairing little or no similarity with native proteins (lowest Expect (E) val- and avoiding burial of polar backbone atoms (see Supplementary ues ranging from 0.003 to >​10; Supplementary Table 2). Synthetic Fig. 7 for the BH_10 loop sequences and side chain patterns). For genes encoding the designs (design names are BH_n, where BH example, β​-arcade 1 is formed by BBG and ABB loops, and the bur- stands for β-helix,​ n stands for the design number and the _ss suffix ied backbone NH group of the G position in the former makes a is used if disulfide bonds are present) were obtained; the proteins with the buried backbone C=​O of the neighbor- were expressed in Escherichia coli and purified by affinity chroma- ing loop (Fig. 3e). The other two β​-arcades were designed with one tography. Sixteen of the designs expressed well and were soluble, β​-arch loop containing buried and fully hydrogen-bonded aspara- and two (BH_10 and BH_11) were monomeric (Supplementary gines (four hydrogen bonds in total) that stabilize both loop pairing Fig. 2) by size-exclusion chromatography coupled with multi- and the local β-arch​ conformation (of ABABB loops). By design, the angle light scattering (SEC-MALS) (most of the non-monomeric asparagine side chain geometry was further stabilized with hydro- designs were either dimers or soluble aggregates). Both mono- phobic stacking interactions from the two β​-arch loops of the same meric designs had far-ultraviolet spectra (CD) at arcade. The high degree of convergence of the designed rotamer in 25 °C characteristic of β ​proteins, a melting temperature (Tm) above the NMR ensemble illustrates the high structural preorganization of 95 °C, and well-ordered structures according to 2D 1H-15N het- this particular motif (Fig. 3f). eronuclear single-quantum coherence (HSQC) spectra (Fig. 3a–c The amino acid sequence of BH_10 is unrelated to any sequence and Supplementary Fig. 3). For both designs, the number of NMR in the NCBI nr database (BLAST found one hit with insignificant peaks matched the number of expected amide resonances based on sequence similarity; E value 6.3). We searched the Protein Data the protein sequence, but the higher stability of BH_10 under the Bank (PDB) for similarities in structure (using the Dali server29 conditions of the NMR experiments made it a better candidate for with the lowest-energy NMR model as the query structure) or NMR structure determination. sequence (with HHpred30 for sensitive profile-based sequence The two monomeric designs with well-ordered structures were search), and identified matches similar in fold but containing addi- among those with better-packed cores and a larger proportion of tional and irregular secondary structures and longer loops. These β​-arch loops containing and with backbone polar atoms matches are all homodimers with sheet-to-sheet interface packing making hydrogen bonds (Supplementary Table 3). β​-arch loops that (Supplementary Fig. 8) or domains integrated in larger structures, are structurally preorganized with the polar groups making internal in sharp contrast to the BH_10 monomer.

Nature Structural & Molecular Biology | www.nature.com/nsmb NaTure STrucTural & MOlecular BiOlOgy Articles

abc –140 ) 20 110 –1 –160 10 –180 115 –200 0 –220 120 N(p.p.m.)

–240 –10 15 125

–260 Ellipticity (mdeg)

Energy (kcal mol –280 –20 130 0246810 12 200 210 220 230 240 250 260 10 987 r.m.s.d. (Å) Wavelength (nm) 1H (p.p.m.)

d Design model NMR ensemble BBG

ABABB

BBG

ABABB GG

ABB

BBG r.m.s.d. 0.8 Å e f

L1

N65

L3

L L5 7 β-arcade 1 β-arcade 3

gh40 BH_10 35 70 β-hairpin 30 60 25 50 20 40 Between β-arcades 15 30 Contact order

10 Residue number 20 Within β-arcades 5 10 0 0 30 90 150 210 270 330 390 010203040506070 Protein size Residue number

Fig. 3 | The NMR structure of BH_10 is nearly identical to the design model. a, Calculated BH_10 folding energy landscape. Each dot represents the lowest-energy structure obtained from ab initio folding trajectories starting from an extended chain (red dots), biased forward folding trajectories (blue dots) or local relaxation of the designed structure (green dots); the x axis is the Cα​ r.m.s.d. from the designed model and the y axis is the Rosetta all-atom energy. b, Far-ultraviolet CD spectra (blue, 25 °C; red, 95 °C; green, 25 °C after cooling). c, 1H-15N HSQC spectra obtained at 37 °C at a 1H field of 800 MHz. d, NMR structure in comparison with the design model. Comparison of core side chain rotamers (NMR structure in gray and design in rainbow) (inset). Topology diagram showing the ABEGO torsion bins of all loop connections (left). Atomic coordinates for the design model are in Supplementary Dataset 1. e, Backbone hydrogen bonding of β​-arcade 1 is well preserved across the NMR ensemble. f, Side chain interactions of N65 with backbone and side chains form a hydrogen-bonded network in β​-arcade 3 that is well recapitulated in the NMR ensemble. g, Contact order of de novo–designed protein domains confirmed by high-resolution structure determination; all-α​ (blue), αβ​ ​ (green) and all-β ​(red). The BH_10 design stands out with a contact order of 35.8 for a chain length of 78 residues. The domains are listed in Supplementary Tables 4 and 5. h, Contact map illustrating the large sequence separation of the contacts present in the BH_10 topology.

Contact order and sequence determinants of the BH_10 fold. The is higher than that for any previous single-domain protein designed non-local character of BH_10 is of particular note: a large fraction de novo (Fig. 3g,h). Proteins with high contact order fold more of the contacting residues are distant along the linear sequence, with slowly than those with low contact order as there is a greater loss extensive strand pairing between the N- and C-terminal β​-strands. in chain entropy for forming the first native interactions, and they The contact order of the structure (that is, the average separation tend to form long-lived non-native structures that can oligomerize along the linear sequence of residues in contact in the 3D structure) or aggregate31. We have overcome the challenges in designing

Nature Structural & Molecular Biology | www.nature.com/nsmb Articles NaTure STrucTural & MOlecular BiOlOgy

We have successfully designed a double-stranded ​-helix de novo, Table 1 | NMR and refinement statistics for BH_10 β as confirmed by the NMR structure of the design BH_10, based on BH_10 (PDB 6E5C) a series of rules describing the geometry of β-arch​ loops and their NMR distance and dihedral constraints interactions in more complex β-arcades.​ Our work also achieves two related milestones: the first accurate design of an all-β ​ globu- Distance constraints lar protein with exposed β​-sheet edges, and the most non-local Total NOE 659 structure yet designed from scratch. Comparison of successful and Intraresidue 272 failed designs suggests that folding and stabilization of the mono- Inter-residue 387 meric structure (and implicitly disfavoring competing topologies with more local strand pairings) is bolstered by loops containing Sequential (|i – j| ​ 1) 222 = side chain–backbone and backbone-backbone hydrogen bonds Medium range (2 ≤​ |i – j| ≤​ 4) 33 together with well-packed mixed aliphatic/aromatic side chains in Long range (|i – j| ≥​ 5) 132 the protein core, inward-pointing polar amino acids at strand edges Intermolecular 0 and salt bridges between paired strands. Previous design studies 11 12 Hydrogen bonds 0 on β​-propellers or parallel β​-helices have used naturally occur- ring backbone structures and consensus sequence information on Total restraints 156 the target fold families; this approach, while powerful, sheds less φ 78 light on the key principles underlying β​-sheet structure construc- ψ 78 tion and does not allow the programming of new backbone geom- Structure statistics etries. The β-helix​ fold designed here is well suited for incorporating metal, ligand-binding and active sites, as illustrated by the broad Violations (mean ​ s.d.) ± functional diversity of cupin protein domains, which are the closest Distance constraints (Å)a 0.30 ±​ 0.46 naturally occurring structural analogs. With the basic design prin- Dihedral angle constraints (°)b 9.30 ±​ 2.49 ciples now understood, our de novo design strategy should enable Max. dihedral angle violation (°)b 47.59 the construction of a wide range of β​-helix structures tailored to a Max. distance constraint violation (Å)a 1.32 broad range of target ligands. Initial advances in protein design were from algorithms that Deviations from idealized geometry allowed rapid identification of a very-low-energy sequence for Bond lengths (Å) 0.00 ± 0.00​ a given backbone structure. In recent years, progress has come Bond angles (°) 0.00 ± 0.00​ from the realization that the requirements of burying hydropho- bic residues in a core away from solvent while avoiding the burial Impropers (°) 0.00 ± 0.00​ c of backbone polar groups without compensating hydrogen bonds, Average pairwise r.m.s.d. (Å) together with torsional restrictions on the peptide backbone, con- Heavy 0.61 ±​ 0.13 siderably constrain overall globular protein backbone geometry, Backbone 0.51 ±​ 0.11 particularly for β​-sheet-containing proteins: it is much harder than originally expected to construct new backbones that have aDistance constraint violations in the structural ensemble were calculated using a 7-Å universal upper distance bound for the NOE restraints assigned by AutoNOE-Rosetta. bDihedral angle these properties. The de novo design of β​-sheet-containing pro- restraints were derived from TALOS-N. The violations were calculated for the core secondary teins advanced considerably following the elucidation of β-sheet​ structural regions of the ten lowest-energy models using a 15° cutoff beyond TALOS-N-predicted design principles for construction of backbones meeting the above dihedral angles. cPairwise r.m.s.d. was calculated among ten refined models for a core secondary structural region defined by residues 2–8, 11–18, 21–28, 32–36, 39–43, 46–53, 59–65 and 71–75. constraints while having desired geometries: for example, princi- ples for controlling the chirality of β​-hairpins5, reducing strain in β​-strands with kinks10, and combining β​-bulges and register shifts to curve β​-sheets21. The design rules described here are a con- non-local structures by focusing on backbones lacking internal siderable further advance as they provide control over β​-arch con- strain and having maximal internal coherence, and programming β​ nections between distinct β​-sheets, and should enable the design -strand orientation with highly structured loops. of a broad range of β​-protein families beyond the β-barrel​ and One of the challenges in achieving high contact order through β​-helix, with considerable medical and biotechnological potential; β​-arches is to disfavor formation of more sequence-local β​-hairpins. for example, the immunoglobulin fold widely utilized for binding To evaluate in silico how each of our design features contributes and loop scaffolding in nature is topologically very similar to the to favoring β​-arches over β​-hairpins, we generated folding energy double-stranded β​-helices designed here, with a larger proportion landscapes for a series of mutants of BH_10 in which loop hydrogen of β​-hairpins over β​-arches. bonding, side chain packing of loop neighbors and loop local geom- etry were disrupted one at a time. For all conformations generated, Online content we classified all the β-strand​ connections as β​-arch or β​-hairpin Any methods, additional references, Nature Research reporting depending on strand-pairing formation, and calculated the over- summaries, source data, statements of data availability and asso- all frequency of β-hairpin​ formation for each pair of consecutive ciated accession codes are available at https://doi.org/10.1038/ β​-strands. As shown in Supplementary Fig. 9, disruption of packing s41594-018-0141-6. within or between β​-arch loops, removal of side chain–backbone hydrogen-bonding interactions and reducing loop geometry encod- Received: 16 July 2018; Accepted: 11 September 2018; ing by eliminating prolines all increase sampling of competing Published: xx xx xxxx β​-hairpin conformations, and thus substantially decrease sampling of β-arches​ and the target designed structure. References 1. Kortemme, T., Ramírez-Alvarado, M. & Serrano, L. Design of a 20-amino Discussion acid, three-stranded β​-sheet protein. Science 281, 253–256 (1998). 2. Searle, M. S. & Ciani, B. Design of β​-sheet systems for understanding the The design of all-β​ globular proteins from first principles has thermodynamics and kinetics of . Curr. Opin. Struct. Biol. 14, remained elusive for two decades of protein design research. 458–464 (2004).

Nature Structural & Molecular Biology | www.nature.com/nsmb NaTure STrucTural & MOlecular BiOlOgy Articles

3. Hughes, R. M. & Waters, M. L. Model systems for β​-hairpins and β​-sheets. 27. Camacho, C. et al. BLAST: architecture and applications. BMC Bioinformatics Curr. Opin. Struct. Biol. 16, 514–524 (2006). 10, 421 (2009). 4. Marcos, E. & Adriano-Silva, D. Essentials of de novo protein design: methods 28. Evangelidis, T. et al. Automated NMR resonance assignments and and applications. WIREs Comput. Mol. Sci. 8, e1374 (2018). structure determination using a minimal set of 4D spectra. Nat. Commun. 9, 5. Koga, N. et al. Principles for designing ideal protein structures. Nature 491, 384 (2018). 222–227 (2012). 29. Holm, L. & Laakso, L. M. Dali server update. Nucleic Acids Res. 44, 6. Hecht, M. H. De novo design of β​-sheet proteins. Proc. Natl Acad. Sci. USA W351–W355 (2016). 91, 8729–8730 (1994). 30. Zimmermann, L. et al. A completely reimplemented MPI bioinformatics 7. Plaxco, K. W., Simons, K. T. & Baker, D. Contact order, transition state toolkit with a new HHpred server at its core. J. Mol. Biol. 430, placement and the refolding rates of single domain proteins. J. Mol. Biol. 277, 2237–2243 (2018). 985–994 (1998). 31. Clark, P. Protein folding in the cell: reshaping the . 8. Quinn, T. P., Tweedy, N. B., Williams, R. W., Richardson, J. S. & Trends Biochem. Sci. 29, 527–534 (2004). Richardson, D. C. Betadoublet: de novo design, synthesis, and characterization of a β-sandwich​ protein. Proc. Natl Acad. Sci. USA 91, 8747–8751 (1994). Acknowledgements 9. Nanda, V. et al. De novo design of a redox-active minimal rubredoxin mimic. We thank S. Rettie for mass spectrometry assistance and the rest of the Baker laboratory J. Am. Chem. Soc. 127, 5804–5805 (2005). members for discussion. We acknowledge computing resources provided by the Hyak 10. Dou, J. et al. De novo design of a fuorescence-activating β​-barrel. Nature supercomputer system funded by the STF at the University of Washington, and Rosetta@ 561, 485–491 (2018). Home volunteers in ab initio structure prediction calculations. Work carried out at 11. Voet, A. R. D. et al. Computational design of a self-assembling symmetrical the Baker laboratory was supported by the Howard Hughes Medical Institute, Open β​-propeller protein. Proc. Natl Acad. Sci. USA 111, 15102–15107 (2014). Philanthropy, and the Defense Threat Reduction Agency. E.M. was supported by a Marie 12. MacDonald, J. T. Synthetic β​-solenoid proteins with the fragment-free Curie International Outgoing Fellowship (FP7-PEOPLE-2011-IOF 298976). G.O. was computational design of a β​-hairpin extension. Proc. Natl Acad. Sci. USA 113, supported by a Marie Curie International Outgoing Fellowship (FP7-PEOPLE-2012- 10346–10351 (2016). IOF 332094). This research was financially supported by the Ministry of Education, 13. Ottesen, J. J. & Imperiali, B. Design of a discretely folded mini-protein Youths and Sports of the Czech Republic within the CEITEC 2020 (LQ1601) project, the motif with predominantly β​-structure. Nat. Struct. Biol. 8, 535–539 (2001). Grant Agency of Masaryk University (to K.T.), an R35 Outstanding Investigator Award 14. Hu, X., Wang, H., Ke, H. & Kuhlman, B. Computer-based redesign of to N.G.S. through NIGMS (1R35GM125034-01), and the Office of the Director, NIH, β​ sandwich protein suggests that extensive negative design is not required for under High End Instrumentation Grant S10OD018455, which funded the 800-MHz de novo β ​sheet design. Structure 16, 1799–1805 (2008). NMR spectrometer at UCSC. IRB Barcelona is the recipient of a Severo Ochoa 15. Hennetin, J., Jullian, B., Steven, A. C. & Kajava, A. V. Standard Award of Excellence from the Ministry of Economy, Industry and Competitiveness conformations of beta-arches in β​-solenoid proteins. J. Mol. Biol. 358, (government of Spain). 1094–1105 (2006). 16. Lin, Y.-R. et al. Control over overall shape and size in de novo designed proteins. Proc. Natl Acad. Sci. USA 112, E5478–E5485 (2015). Author contributions 17. Kajava, A. V., Baxa, U. & Steven, A. C. β​ arcades: recurring motifs in E.M. designed the research, carried out the loop structural analysis, set up the design naturally occurring and disease-related amyloid fbrils. FASEB J. 24, method and performed design calculations. T.M.C. carried out design calculations, 1311–1319 (2010). protein expression, purification and CD experiments. A.C.M. collected 4D NMR data. 18. Kuhlman, B. & Baker, D. Native protein sequences are close to optimal for T.E. performed 4D-CHAINS analysis. S.N. carried out AutoNOE-Rosetta calculations. their structures. Proc. Natl Acad. Sci. USA 97, 10383–10388 (2000). L.C. expressed isotopically labeled proteins and performed SEC-MALS analysis. L.G.N. 19. Kuhlman, B. et al. Design of a novel globular protein fold with atomic-level designed the research and carried out design calculations. A.D. and G.O. helped in accuracy. Science 302, 1364–1368 (2003). protein expression and characterization. K.T. and N.G.S. supervised NMR structure 20. Richardson, J. S. & Richardson, D. C. Natural beta-sheet proteins use negative determination. D.B. designed and supervised the research. E.M., and D.B. prepared the design to avoid edge-to-edge aggregation. Proc. Natl Acad. Sci. USA 99, manuscript with input from all authors. 2754–2759 (2002). 21. Marcos, E. et al. Principles for designing proteins with cavities formed by Competing interests curved β​ sheets. Science 355, 201–206 (2017). The authors declare no competing interests. 22. Rohl, C. A., Strauss, C. E. M., Misura, K. M. S. & Baker, D. prediction using Rosetta. Methods Enzymol. 383, 66–93 (2004). 23. Bradley, P. Toward high-resolution de novo structure prediction for small Additional information proteins. Science 309, 1868–1871 (2005). Supplementary information is available for this paper at https://doi.org/10.1038/ 24. Kuhn, M., Meiler, J. & Baker, D. Strand-loop-strand motifs: prediction of s41594-018-0141-6. hairpins and diverging turns in proteins. Proteins 54, 282–288 (2004). Reprints and permissions information is available at www.nature.com/reprints. 25. Bradley, P. & Baker, D. Improved β​-protein structure prediction by multilevel Correspondence and requests for materials should be addressed to E.M. or D.B. optimization of nonlocal strand pairings and local backbone conformation. Proteins 65, 922–929 (2006). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in 26. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of published maps and institutional affiliations. protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). © The Author(s), under exclusive licence to Springer Nature America, Inc. 2018

Nature Structural & Molecular Biology | www.nature.com/nsmb Articles NaTure STrucTural & MOlecular BiOlOgy

Methods To evaluate the amount of β​-hairpin sampling in each loop connection during Loop analysis. Loop connections between β​-strands were collected from a non- ab initio structure prediction, we first detected all strand pairings formed in each redundant database of PDB structures obtained from the PISCES server32 with generated decoy and then mapped the residues involved in those strand pairings sequence identity <​30% and resolution ≤​2 Å. We discarded those loops connecting to the secondary structure elements of the designed structure. After secondary β​-strands with hydrogen-bonded pairing (β​-hairpins), and the remaining 5,061 structure mapping, pairings between strands consecutive in the sequence were β​-arch loops were subsequently analyzed. Te ABEGO torsion bins of each counted as β​-hairpins. The total count of β-hairpins​ sampled in each loop over the residue position were assigned based on the defnition shown in Supplementary total number of generated decoys is a relative quantity of hairpin sampling allowed Fig. 1, and the side chain directionality pattern of neighboring residues was us to compare the β​-hairpin propensity of different loops and mutants defned according to Fig. 1a. Te secondary structure of all residue positions was (see Supplementary Fig. 9). assigned with DSSP33, and the last β​-strand residue preceding and the frst β​-strand residue following the β​-arch loop were chosen as the critical neighboring residues Contact order. To evaluate the non-local character of protein structures we determining the side chain pattern of the loop. Te loop bending was defned as computed ‘contact order’ as the average sequence separation between pairs the angle between the loop center of mass and the two strand positions adjacent of Cα​ atoms within a distance of 8 Å and with a sequence separation of three to the loop. Tose loops with bending angles larger than 120º were discarded residues at least. from the analysis to correctly identify those loops producing a substantial change in the direction of the two connected β​-strands. Te loop dataset is available in Protein expression and purification. Genes encoding the designed sequences Supplementary Dataset 2. were obtained from Genscript and cloned into the pET-28b+​ (with N-terminal 6×​ His tag and a thrombin cleavage site) expression vectors. Plasmids were Backbone generation. We used the Blueprint Builder mover5 of RosettaScripts34 transformed into Escherichia coli BL21 Star (DE3) competent cells, and starter to build protein backbones by Monte Carlo fragment assembly using nine- and cultures were grown at 37 °C in LB medium overnight with kanamycin. Overnight three-residue fragments compatible with the target secondary structure and torsion cultures were used to inoculate 500 ml LB medium supplemented with antibiotic, bins (ABEGO), as specified in the blueprints of every target topology. We used a and cells were grown at 37 °C and 225 r.p.m. until an optical density (OD600) of polyvaline centroid representation of the protein and a scoring function accounting 0.5–0.7 was reached. Protein expression was induced with 1 mM IPTG at 18 °C for backbone hydrogen bonding, van der Waals interactions (namely to avoid steric and, after overnight expression, cells were collected by centrifugation (at 4 °C and clashes), planarity of the (omega score term), and compactness of 4,400 r.p.m. for 10 min) and resuspended in 25 ml of lysis buffer (20 mM imidazole structures (radius of gyration). Thousands of independent folding trajectories were and PBS). Resuspended cells were lysed in the presence of lysozyme, DNase performed and subsequently filtered. Owing to the non-local character of β​-sheet and protease inhibitors. Lysates were centrifuged at 4 °C and 18,000 r.p.m. for contacts, we used distance and angle constraints to favor the correct hydrogen- 30 min, and the supernatant was loaded onto a nickel affinity gravity column pre- bonded pairing between β​-strand main chain atoms. For every target topology, we equilibrated in lysis buffer. The column was washed with three column volumes automatically set all pairs of residues involved in β​-strand pairing to generate all of PBS +​ 30 mM imidazole and the purified protein was eluted with three column constraints for backbone building. Protein backbones were filtered based on their volumes of PBS +​ 250 mM imidazole. The eluted protein solution was dialyzed match with the blueprint specifications (secondary structure, torsion bins and against PBS buffer overnight. The expression of purified proteins was assessed by strand pairing), and subsequently ranked based on backbone hydrogen-bonding SDS–PAGE and mass spectrometry, and protein concentrations were determined energy (lr_hb score term) and the total energy obtained from one round of all- from the absorbance at 280 nm measured on a NanoDrop spectrophotometer atom flexible-sequence design (see next section). (Thermo Scientific) with extinction coefficients predicted from the amino acid sequences using the ProtParam tool (https://web.expasy.org/protparam/). Proteins Flexible-sequence design. Generated protein backbones were subjected to were further purified by fast protein liquid chromatography size-exclusion flexible-sequence design calculations with RosettaDesign18,19 using the Rosetta chromatography using a Superdex 75 10/300 GL (GE Healthcare) column. all-atom energy function Talaris201435 to favor amino acid identities and side- chain conformations with low energy and tight packing. We performed cycles Circular dichroism. Far-ultraviolet CD measurements were carried out with the of fixed backbone design followed by backbone relaxation using the FastDesign AVIV 420 spectrometer. Wavelength scans were measured from 260 to 195 nm at mover36 of RosettaScripts34. Designed sequences were filtered based on total energy, temperatures between 25 and 95 °C, using a 1-mm path-length cuvette. Protein side-chain packing (measured with RosettaHoles37, packstat and core side-chain samples were prepared in PBS buffer (pH 7.4) at a concentration of 0.2–0.4 mg ml–1. average degree21), side chain–backbone hydrogen bond energy, and secondary structure prediction (match between the designed secondary structure and that Size-exclusion chromatography combined with multiple-angle light scattering. predicted by PSPIRED38 based on the designed sequence). Amino acid identities SEC-MALS experiments were performed using a Superdex 75 10/300 GL (GE were restricted based on the solvent accessibility of protein positions, ensuring that Healthcare) column combined with a miniDAWN TREOS multi-angle static light hydrophobic amino acids were located in the core and polar ones in the surface. scattering detector and an Optilab T-rEX refractometer (Wyatt Technology). One- Further restrictions were imposed to improve sequence-structure compatibility in hundred-microliter protein samples of 1–3 mg ml–1 were injected onto the column loop regions. Sequence profiles were obtained for naturally occurring loops with equilibrated with PBS (pH 7.4) or TBS (pH 8.0) buffer at a flow rate of 0.5 ml min–1. the same ABEGO string sequence, as previously21. The collected data were analyzed with ASTRA software (Wyatt Technology) to For those blueprints that yielded the lowest energy designs we performed a estimate the molecular weight of the eluted species. second round with ten times more backbone samples. Backbones generated in this second round were subjected to more exhaustive sequence design by running Protein expression of isotopically labeled proteins for NMR structure multiple Generic Monte Carlo trajectories optimizing total energy and side chain determination. Plasmids were transformed using standard heat-shock average degree simultaneously, and then applied all filters described above. transformation into the Lemo21 expression strain of E. coli (NEB) and plated onto a minimal M9 medium containing glucose and kanamycin to maintain tight Design of disulfide bonds and helix capping domain. We used the Disulfidize control over expression. A single colony was selected, inoculated into 50 ml of LB mover of RosettaScripts34 to identify pairs of residue positions able to form containing 50 µ​g ml–1 of kanamycin and grown at 37 °C with shaking overnight. disulfide bonds with a good scoring geometry. We searched for disulfide bonds After approximately 18 h, the 50-ml starter culture was removed and 25 ml was between residues distant in primary sequence and with a disulfide score <−​ ​1.0. We used to inoculate 500 ml of Terrific Broth (TB) containing 50 µ​g ml–1 kanamycin designed a C-terminal helix capping domain (followed with a β-strand​ pairing with and mixed mineral salts40. The TB culture was grown at 37 °C with shaking at the first β​-strand) using the backbone-generation protocol described above but 250 r.p.m. until the OD600 reached a value of 1.0. Then, the culture was removed starting from design BH_6. The structure of BH_6 was kept fixed during fragment and the cells were pelleted by centrifugation at 4,000 r.p.m. for 15 min. The TB assembly and the C-terminal domain was generated. Then, sequence design was broth was removed and the pelleted cells were resuspended gently with 50 ml of performed for the C-terminal domain and those neighboring residues within 10 Å. 20 mM NaPO4, 150 mM NaCl, pH 7.5. The resuspended cells were transferred into minimal labeling media, containing 15N-labeled ammonium chloride at 50 mM 13 Sequence–structure compatibility. For assessing the local compatibility between and C-labeled glucose to 0.25% (w/v), as well as trace metals, 25 mM Na2HPO4, designed sequences and structures we picked 200 naturally occurring fragments 25 mM KH2PO4 and 5 mM Na2SO4. The culture was returned to 37 °C at 250 r.p.m. (9- and 3-mers) with sequences similar to the design, and evaluated the structural for 1 h in order to replace unlabeled and carbon with labeled nitrogen similarity (by r.m.s.d.) between the ensemble of picked fragments and the local and carbon. After 1 h, IPTG was added to 1 mM, the temperature was reduced to designed structure. Those with overall low r.m.s.d. fragments, and therefore with 25 °C and the culture was allowed to express overnight. The following morning, the high fragment quality, were subsequently assessed by Rosetta folding simulations culture was removed and the cells were pelleted by centrifugation at 4,000 r.p.m. for using the Rosetta energy function ‘ref2015’39. First, biased forward folding 15 min. The cells were resuspended with 40 ml of lysis buffer (20 mM Tris, 250 mM simulations21 (using the three lowest r.m.s.d. fragments and 40 folding trajectories) NaCl, 0.25% CHAPS, pH 8) and lysed with a Microfluidics M110P Microfluidizer were used to quickly identify those designs more likely to have funnel-shaped at 18,000 p.s.i. The lysed cells were clarified using centrifugation at 24,000g for energy landscapes. Those designs achieving near-native sampling (r.m.s.d. to 30 min. The labeled protein in the soluble fraction was purified using immobilized target structure below 1.5 Å) were then assessed by standard Rosetta ab initio metal affinity chromatography using standard methods (QIagen Ni-NTA resin). structure prediction22,23. The purified protein was then concentrated to 2 ml and purified by FPLC

Nature Structural & Molecular Biology | www.nature.com/nsmb NaTure STrucTural & MOlecular BiOlOgy Articles size-exclusion chromatography using a Superdex 75 10/300 GL (GE Healthcare) weights of 5, 10, 25 and 50 were used). All the calculation runs were evaluated 39 column into 20 mM NaPO4, 150 mM NaCl, pH 7.5. The efficiency of labeling was using a function that assesses the all-atom energies of the structural models and confirmed using mass spectrometry. their convergence. After selecting the best-scoring restraint weight run, the ten models that exhibited the lowest energy within this run were selected. Commands Heteronuclear single-quantum coherence spectra. The designed BH_10 to set up the calculations were used exactly as provided in the Supplementary (0.81 mM) and BH_11 (0.64 mM) were exhaustively buffer exchanged into NMR Methods of a previous work28. Molprobity52 was used to compute Ramachandran buffer (50 mM NaCl, 20 mM sodium phosphate, pH 6.5, 0.01% (v/v) NaN3, 4 mM statistics for the ten lowest-energy structural models (100% of residues in favored 1 15 EDTA and 1 U Roche protease inhibitor cocktail) in 95% H2O/5% D2O. 2D H- N regions of Ramachandran space, and 99% in favored regions) and deviations from HSQC experiments were acquired at 37 °C with four scans, acquisition times of the ideal geometry (Table 1). 72 ms (15N) in the indirect dimension and recycle delay of 2 s. Salt bridges. We used ESBRI53 to predict salt bridges in the ten lowest-energy Chemical shift assignment. For chemical shift assignment of BH_10, a set of two structural models. Out of 19 salt bridges predicted using ESBRI, AutoNOE-Rosetta non-uniformly sampled (NUS) 4D NMR experiments, a 4D HC(CC-TOCSY(CO)) recovers four salt bridges in the form of NOE contacts on the surface of the BH_10 NH and 4D 13C-15N edited HMQC-NOESY-HSQC, were acquired at 800 MHz at protein. To identify salt bridges, we examined the NOE restraints assigned by 37 °C as previously described28. For the 4D HC(CC-TOCSY(CO))NH experiment, AutoNOE-Rosetta between negatively charged glutamic acid or aspartic acid and spectrum widths were set to 12,500 (acquisition dimension) ×​ 2,100 (15N) ×​ 8,000 positively charged arginine or lysine. We further filtered the restraints based on (13C) ×​ 7,300 (1H) Hz and acquisition times in the indirect dimensions of 60 ms the upper distance bound of 4 Å in the ten lowest-energy structures. From these (15N), 8 ms (13C) and 8 ms (1H) using 16 scans and a recycle delay of 1 s. Spectra filters, we found that the salt bridges recapitulated by the NOE assignment module were acquired with 2,000 hypercomplex NUS points distributed over the indirectly between the residue pairs are (15,62), (23,78), (33,64) and (35,62). detected dimensions (0.38% sparsity). For the 4D 13C-15N edited HMQC-NOESY- HSQC, spectrum widths were set to 12,500 (acquisition dimension) ×​ 1,000 Hydrophobic core of BH_10. Buried residues were selected from the ten lowest- (15N) ×​ 8,000 (13C) ×​ 10,000 (1H) Hz, respectively, and acquisition times in the energy structural models using a 10 Å2 solvent-accessible surface area threshold. indirect dimensions of 38 ms (15N), 10 ms (13C) and 20 ms (1H) using 8 scans, a There are 18 buried residues that contribute up to 70% of the total NOEs assigned recycle delay of 1 s and a NOESY mixing time of 120 ms. Spectra were acquired by AutoNOE-Rosetta, and two of them are aromatic residues (F34 and F73). with 4,000 hypercomplex NUS points distributed over the indirectly detected While 4D-CHAINS does not assign chemical shifts of side chain groups of dimensions (0.32% sparsity). 4D NUS spectra were processed in NMRPipe41 using aromatic residues (specifically aromatic rings), it provides respective chemical shift SMILE reconstruction42 and analyzed using NMRFAM-SPARKY43. For every 1 assignments of backbone atoms (Cα​, Hα​, N, H) and the β​-carbon and H-15N HSQC peak, the corresponding planes in 4D-HCNH TOCSY and β​-proton (Cβ​, Hβ)​ atoms. AutoNOE-Rosetta assigned a total of nine NOEs for 4D-HCNH NOESY spectra were inspected and peaks were picked manually. the aromatic residues in the hydrophobic core, and the placement of aromatic side The 4D peak lists were used as input for the 4D-CHAINS algorithm28 to obtain chains was optimized by the Rosetta packer algorithm. On close examination of the sequence-specific resonance assignments of backbone and side chain atoms BH_10 structures, we found that the geometry of the two aromatic side chains was automatically. The overall assignment completeness reached 92%. No aromatic constrained by neighboring residues with methyl groups placed based on NOEs, resonances were assigned. 4D-CHAINS assignments together with the 4D-HCNH supporting the accuracy of the aromatic side chain placement. NOESY peak list were used in AutoNOE-Rosetta for structure determination. Visualization of protein structures and image rendering. Images of protein 54 NOE assignment and structure determination using AutoNOE-Rosetta. To structures were created with PyMOL . determine the structural models of the BH_10 target protein, we used CS-Rosetta44 that provides AutoNOE-Rosetta45 and RASREC-Rosetta46 protocols. AutoNOE- Reporting Summary. Further information on research design is available in the Rosetta is an iterative NOE assignment method that utilizes RASREC-Rosetta Nature Research Reporting Summary linked to this article. to model protein structures de novo. These methods make use of valuable information contained within NMR chemical shifts about secondary and tertiary Code availability. The Rosetta macromolecular modeling suite (http://www. structures, and dynamics of proteins, to model targets of interest accurately44,47. rosettacommons.org/) is freely available to academic and non-commercial The primary aim of AutoNOE-Rosetta is to label proton atoms to the unassigned users. Scripts and protocols used in this article for generating protein backbone NOESY cross-peaks by mapping their resonance frequencies to the assigned blueprints, performing Rosetta design calculations and analyzing protein structures chemical shift frequencies. The resulting assignments can be utilized to create are all available on GitHub (https://github.com/emarcos/beta_sheet). NOE-based distance restraints that aid the structure calculation process. The method begins by creating an initial mapping between the assigned chemical shift Data availability list and the unassigned NOESY cross-peak list. This mapping produces ambiguous NMR chemical shifts and NOESY cross-peak lists used to determine structures of assignments due to possible noise in the NOESY spectra48,49. These assignments BH_10 have been deposited in the BMRB with accession code 30495. Coordinates undergo evaluation and filtering. The evaluation criteria rely on the symmetry of the ten lowest-energy structures and the restraint lists have been deposited in the of cross-peaks, chemical shift compatibility, intermediate structural model wwPDB as PDB 6E5C. The design model of BH_10 is available as Supplementary compatibility (in the subsequent stages of the protocol) and the participation of any Dataset 1, and the loop dataset used to analyze the side chain patterns of naturally NOE in a network of NOEs (network anchoring)50. The cross-peaks are eliminated occurring β​-arches is available in Supplementary Dataset 2. Other data are available if they lie along the diagonal in the NOESY spectra or their contribution to from the corresponding authors upon reasonable request. some of the evaluation criteria (such as network anchor score) is insignificant. The intensities of high-scoring NOE peaks are calibrated to produce distance restraints. These restraints are used to calculate structures within the highly References parallel RASREC-Rosetta, which performs fragment assembly44 using Monte Carlo 32. Wang, G. & Dunbrack, R. L. Jr. PISCES: a protein sequence culling server. methods and additional optimized algorithms46. This process of assigning NOEs Bioinformatics 19, 1589–1591 (2003). and calculating structures is carried out iteratively across eight distinct stages. The 33. Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern final stage retains well-converged structural models alongside generated NOE recognition of hydrogen-bonded and geometrical features. Biopolymers 22, restraints used for their calculation. 2577–2637 (1983). The process of setting up AutoNOE-Rosetta calculations is highly automated 34. Fleishman, S. J. et al. RosettaScripts: a scripting language interface to the and accessible via a Python interface within the toolbox available at the CS-Rosetta Rosetta macromolecular modeling suite. PLoS ONE 6, e20161 (2011). website (https://csrosetta.chemistry.ucsc.edu). Before setting up NOE assignment 35. O’Meara, M. J. et al. Combined covalent-electrostatic model of hydrogen and structure calculation runs, (1) NMR chemical shifts and target sequences are bonding improves structure prediction with Rosetta. J. Chem. Teory Comput. used to predict secondary structure (rigid regions) and flexible end regions from 11, 609–622 (2015). TALOS-N51, (2) the flexible end regions are trimmed from sequence and chemical 36. Bhardwaj, G. et al. Accurate de novo design of hyperstable constrained shift files because they cause deterioration of the performance of structure . Nature 538, 329–335 (2016). determination methods by inducing large numbers of degrees of freedom, and 37. Shefer, W. & Baker, D. RosettaHoles2: a volumetric packing measure for (3) the predicted secondary structure, together with trimmed chemical shift protein structure refnement and validation. Protein Sci. 19, 1991–1995 (2010). files, is used to select 200 structural fragments of amino acid lengths of three and 38. Jones, D. T. Protein secondary structure prediction based on position-specifc nine, for each position in the target sequence. On completion of the previous scoring matrices. J. Mol. Biol. 292, 195–202 (1999). steps, AutoNOE-Rosetta calculations are set up with target sequence, structural 39. Alford, R. F. et al. Te Rosetta all-atom energy function for macromolecular fragments, chemical shifts and unassigned NOESY cross-peak lists. For the modeling and design. J. Chem. Teory Comput. 13, 3031–3048 (2017). BH_10 target protein, we obtained NMR chemical shifts from 4D-CHAINS28 and 40. Studier, F. W. Protein production by auto-induction in high density shaking additionally utilized amide to aliphatic (HCNH) unassigned NOESY cross-peak cultures. Protein Expres. Purif. 41, 207–234 (2005). lists. Thereafter, we performed four rounds of AutoNOE-Rosetta calculations, 41. Delaglio, F. et al. NMRPipe: a –spectral processing system based on UNIX where each round was supplied with a different restraint weight (standard restraint pipes. J. Biomol. NMR 6, 277–293 (1995).

Nature Structural & Molecular Biology | www.nature.com/nsmb Articles NaTure STrucTural & MOlecular BiOlOgy

42. Ying, J., Delaglio, F., Torchia, D. A. & Bax, A. Sparse multidimensional 49. Nilges, M. Ambiguous distance data in the calculation of NMR structures. iterative lineshape-enhanced (SMILE) reconstruction of both non-uniformly Fold Des. 2, S53–S57 (1997). sampled and conventional NMR data. J. Biomol. NMR 68, 101–118 (2017). 50. Herrmann, T., Güntert, P. & Wüthrich, K. Protein NMR structure 43. Lee, W., Tonelli, M. & Markley, J. L. Nmrfam-Sparky: enhanced sofware for determination with automated NOE assignment using the new sofware biomolecular NMR spectroscopy. Bioinformatics 31, 1325–1327 (2015). CANDID and the torsion angle dynamics algorithm DYANA. J. Mol. Biol. 44. Nerli, S., McShan, A. C. & Sgourakis, N. G. Chemical shif-based methods in 319, 209–227 (2002). NMR structure determination. Prog. Nucl. Mag. Res. Sp 106-107, 1–25 (2018). 51. Shen, Y. & Bax, A. Protein backbone and sidechain torsion angles predicted 45. Lange, O. F. Automatic NOESY assignment in CS-RASREC-Rosetta. from NMR chemical shifs using artifcial neural networks. J. Biomol. NMR J. Biomol. NMR 59, 147–159 (2014). 56, 227–241 (2013). 46. Lange, O. F. & Baker, D. Resolution-adapted recombination of structural 52. Chen, V. B. et al. MolProbity: all-atom structure validation for features signifcantly improves sampling in restraint-guided structure macromolecular crystallography. Acta Crystallogr. D Biol. Crystallogr. 66, calculation. Proteins 80, 884–895 (2012). 12–21 (2010). 47. Berjanskii, M. V. & Wishart, D. S. Unraveling the meaning of chemical shifs 53. Costantini, S., Colonna, G. & Facchiano, A. M. ESBRI: a web server for in protein NMR. Biochim. Biophys. Acta 1865, 1564–1576 (2017). evaluating salt bridges in proteins. Bioinformation 3, 137–138 (2008). 48. Nilges, M. A calculation strategy for the structure determination of 54. Te PyMOL Molecular Graphics System, Version 1.7.2 symmetric dimers by 1H NMR. Proteins 17, 297–309 (1993). (Schrödinger, LLC, 2016).

Nature Structural & Molecular Biology | www.nature.com/nsmb nature research | reporting summary

Corresponding author(s): Enrique Marcos, David Baker

Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistical parameters When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main text, or Methods section). n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one- or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section. A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated

Clearly defined error bars State explicitly what error bars represent (e.g. SD, SE, CI)

Our web collection on statistics for biologists may be useful.

Software and code Policy information about availability of computer code Data collection Protein structures used for loop geometric analysis were obtained from a non-redundant database of natural proteins from the Protein Data Bank.

Data analysis Analyses on protein structures were carried out with PyRosetta and all design and folding calculations were done with Rosetta. All NMR analyses were done as described in the methods section with the following programs: NMRPipe, NMRFAM-SPARKY, 4D-CHAINS, CS-Rosetta and TALOS-N. Graphical data representations were done with python and matplotlib. Visualization and image rendering of protein structures was done with Pymol

For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information. April 2018

1 Data nature research | reporting summary Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: - Accession codes, unique identifiers, or web links for publicly available datasets - A list of figures that have associated raw data - A description of any restrictions on data availability

NMR chemical shifts and NOESY cross-peak lists used to determine structures of BH_10 have been deposited under BMRB ID 30495. Coordinates of ten lowest- energy structures and the restraint lists have been deposited under the PDB ID 6E5C.

Field-specific reporting Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection. Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences For a reference copy of the document with all sections, see nature.com/authors/policies/ReportingSummary-flat.pdf

Life sciences study design All studies must disclose on these points even when the disclosure is negative. Sample size We selected 19 computationally designed sequences for experimental characterization

Data exclusions No data was excluded

Replication Replication was not revelant to our study

Randomization Randomization was not relevant to our study

Blinding Blinding was not relevant to our study

Reporting for specific materials, systems and methods

Materials & experimental systems Methods n/a Involved in the study n/a Involved in the study Unique biological materials ChIP-seq Antibodies Flow cytometry Eukaryotic cell lines MRI-based neuroimaging Palaeontology Animals and other organisms Human research participants April 2018

2