Quick viewing(Text Mode)

Genetic Code Evolution Investigated Through the Synthesis and Characterisation of Proteins from Reduced- Alphabet Libraries Matilda S

Genetic Code Evolution Investigated Through the Synthesis and Characterisation of Proteins from Reduced- Alphabet Libraries Matilda S

DOI:10.1002/cbic.201800668 Full Papers

Genetic Evolution Investigated through the Synthesis and Characterisation of from Reduced- Alphabet Libraries Matilda S. Newton+,[a, b] DanaJ.Morrone+,[a, b] Kun-Hwa Lee,[a, b] and Burckhard Seelig*[a, b]

The universal of 20 amino acids is the product of bet. Four libraries werecomprised of the five, nine, and 16 evolution. It is believed that earlier versions of the code had most ancient amino acids, and all 20 extant residues for a fewer residues.Many theories for the order in which amino direct side-by-side comparison. Wecharacterisednumerous acids were integrated into the code have been proposed,con- variantsfrom each library for their solubility and propensity to sidering factorsranging from prebioticchemistry to codon form secondary,tertiary or quaternary structures.Proteins from capture. Several meta-analyses combined these theories to the two most ancient libraries were more likely to be soluble yield afeasible consensus chronology of the genetic code’s than those from the extant library. Severalindividual evolution, but there is adearth of experimental data to test variantsexhibited inducible protein folding and othertraits the hypothesisedorder. We used combinatorialchemistryto typical of intrinsically disordered proteins. From these libraries, synthesise libraries of random polypeptides that were based we can infer how primordial and function on different subsets of the 20 standard amino acids, thus rep- might have evolvedwith the genetic code. resenting different stages of aplausible history of the alpha-

Introduction

The geneticcode of 20 amino acids we know today is not famousMiller–Urey spark experiment;the hypothesis on the static. Althoughsupposedly “universal”,[1] the code has been importance of complementarity; the thermostability of the found to contain a21st and a22nd ( triplet code;and the codon-capture theory.[9–12] Other analyses and ) in many extant .[2,3] The addition of considerthe physicochemical properties of amino acids like these residues indicates that the code is not “frozen”,[4] but can size, charge and hydrophobicity;the code’sco-evolution; bio- expand.The 64 codons could, in principle, evolve to encode as synthetic theories;orthe detection of amino acids on meteor- many as 63 chemically distinct amino acids. As the genetic ites, to name afew.[13–16] Acommon trend among the hypothe- code shows signs of expansion, it is reasonable to assume ses is that the earliestamino acids were comparably small;this that, at earlier points in time, it coded for fewer amino acids. is supported by the observation that prebiotic synthesis seems Likewise, it is inconceivable that the code would have arisen to favour less complex amino acids.[17,18] It is widely agreed fully formed. Many lines of evidenceand reasoning indicate that the most recent additions to the geneticcode include the that earliest —before the last universal commonancestor aromatic amino acids , and trypto- (LUCA)—used agenetic code consisting of fewer amino phan.[17, 18] Unbiased, comprehensive meta-analyses of the acids.[5–8] Further amino acids were gradually added to result in proffered hypotheses andexperimental evidence suggest a the chemical complexity we find in the code today. consensus chronologicalorder for the incorporation of amino Numerous hypotheses have been proposed that attemptto acids into the genetic code[19,20] (Figure1A). define aplausible chronologicalorder for the of The evolution of the genetic code resulted in ashifting pro- amino acids in the geneticcode. Arguments include the teomecomposition over time, thereby alteringthe chemical properties of primordial proteins.Asaresult,these proteins [a] Dr.M.S.Newton,+ Prof. D. J. Morrone,+ K.-H. Lee, Prof. B. Seelig probablyalso changed in their biophysical and functional Department of Biochemistry,Molecular and Biophysics properties, which is the particularfocus of this work. Although University of Minnesota Minneapolis, MN, 55455 (USA) numeroustheories have been proposed describing the [b] Dr.M.S.Newton,+ Prof. D. J. Morrone,+ K.-H. Lee, Prof. B. Seelig of these early amino acid alphabets and the gradual addition BioTechnologyInstitute, University of Minnesota of “later” amino acids to give today’s genetic code, there are 1479 Gortner Avenue, 140 Gortner Laboratory still only afew experimental data to link these hypotheses to St. Paul, MN, 55108-6106 (USA) the biochemical characteristics of the corresponding proteins. E-mail:[email protected] There is aprecedent forstudying proteins with areduced [+] These authors contributed equally to this work. amino acid composition. Severalgroups have presented exam- Supporting information and the ORCID identification numbers for the authors of this article can be found under https://doi.org/10.1002/ ples of extant proteins that still fold andfunction correctly cbic.201800668. when their amino acid complexity is artificially reduced;[2–23]

ChemBioChem 2019, 20,846 –856 846  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers

Figure 1. Plausible history of amino acid additions to the genetic code and reduced amino acid alphabets from different stages of code evolution. A) The con- sensus chronologicalorder of incorporation of amino acidsinto the genetic code over time.[19] Amino acidsofapproximately the sameproposed age are de- pictedinparentheses. B) The four different amino acid alphabets that were used to synthesise the four polypeptide libraries. still other projects used rational design techniques to generate the desired amino acids in the precise ratios we desired, some- de novo, folded proteins with as few as ten different amino thing that usually cannotbeachievedwith degenerate codon acids.[24] Althoughthis work confirmed that alimited contin- synthesis. gent of amino acids can support functional proteins,the This studyaimed to characterise the impact of amino acid amino acid alphabets used in those example studies[21,24–28] composition on the biophysical properties of the reduced- were chosen to reduce the chemical redundancy of the extant alphabet protein variants as afirst step to investigating the code for engineering purposes, rather than to represent a effect that an evolving, increasingly complex genetic code likely early geneticcode. would have upon protein structure and function. Previous We experimentally probedthe properties of reduced-alpha- studies have attempted to estimate the functionofancient bet proteins that represent snapshots along the most likely hypothetical genetic ,[36–40] but amajor strength of our chronology of the genetic code’s evolution. Using the consen- approachisinthe quality of our libraries. Specifically,wede- sus temporal order of amino acid additions to the genetic signedthe libraries to strictly adheretothe reducedalphabets code,[19] we defined three primordial protein alphabetsthat within the randomised regions,with no compromise for nonli- comprise the likely earliestfive, nine and 16 amino acids (Fig- brary amino acids. Furthermore, the fully random natureofthe ure 1B). From each alphabet,wesynthesised protein libraries sequences—rather than construction from apre-existing pro- of random sequence—each 83 amino acids in length. Proteins tein scaffold—allowed anunbiased exploration of the folding of comparable size have previously been foundtohave biolog- behaviours of the alphabets.Finally,the approachofgenerat- ical activity[29,30] and it is reasonabletopresumethat apre- ing multiple libraries for different points in the code’s evolu- LUCA organismwould have only been able to support alimit- tion allowed nuanced comparison and insightinto the chang- ed , thus favouring short .[31,32] For comparison, ing chemistries of primordial proteins. To our knowledge,this we also synthesised afourth library comprised of the canonical work therefore contains the most accurate assembly of re- 20 amino acids. duced alphabets to date and importantly links experimental The four libraries of five, nine, 16 and 20 were generated data to hypothesesonearly alphabets. through trimer chemical synthesis. Although it has been stan- To determine the propensity of the four alphabets to yield dard practice to generate such protein libraries through the soluble, stable structures, we expressedabouttwo dozen pro- use of degenerate codons,[33,34] no degenerate codons exist tein variants from each library in and analysed that would code for the reduced alphabets described. There- their secondary-structure content, hydrophobic packing and fore, we synthesised our libraries through the sequential cou- oligomerisation state. The reduced-library proteins demonstrat- pling of mixtures of trinucleotide phosphoramidites,[35] with ed asurprising propensity for solubility when expressed—with each trimer representing adesired codon. This method the five- and nine-amino-acid alphabetsshowing the highest uniquely allowed us to generate libraries comprised of only proportion of soluble protein,probablydue in part to their

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 847  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers overabundance of negatively charged residues.These analyses (20AA) composed of the 20 canonical extant aminoacids for also showedthat the proteins lack significant secondary struc- comparison. ture and could be described as intrinsically disorderedpro- Proteins from each library had arandomised region of 80 teins. amino acids that conformed to its given alphabet and was constructed from chemically synthesised 20-codon cas- settes that were subsequently ligated.[47] Alength of 20 trinu- cleotides (60 nt) was chosen due to the limitations of accurate Results chemicalsynthesis and had the benefitofshuffling fragments to increasesequence diversity further.Trimer phosphorami- Libraryconstruction and compositions dites were used in chemical synthesis—as opposed to classic We synthesised DNA libraries to encode random polypeptide monomer-based sequential synthesis—which allowed precise libraries of different amino acid alphabets to represent“snap- control over codon usage.The relative amino acidabundance shots” in the likely evolutionary development of the genetic in each library was based on residue usage in today’s pro- code. Accordingly,each library was constructed to encode an teome(as defined by the ExPASy server,[48] see Table S1 in the increasingly complex set of amino acids (Figures 1B and 2), Supporting Information). This amino acid usage was confirmed conforming to the consensus order reported by Trifonov.[19] We by Sanger (Tables S2 and S3, Figure S1). acknowledge that somefiner details of this consensus order The stretch of 20 trinucleotides was flankedatthe 5’ and 3’ might still be amatter of debate, but the general order of very termini by conserved sequences of 20 nt to be used as primer early amino acids and late amino acid additions to the genetic binding sites andtointroduce restriction endonucleaserecog- code is widely accepted. Our most reduced library was set to nition sites. Type IIS restrictionsites were used, as these gener- have just the five most ancient amino acids (5AA:Gly,Ala, ate cohesive ends outside their recognition sequences and Asp, Val, Pro), as aprevious study showed that asmall b-sheet thus avoid undesired cloning artefacts that would code for protein could be engineered to contain only five different amino acids not included in somealphabets. Following PCR amino acids butstill fold properly.[46] For our next point in the amplification and digestion, cassettes were then evolution of the genetic code, we chose alibrary of nine ligatedtogether to build up an open of 40 amino acids, whichare all believed to easily form prebiotical- codons and then, following asecond cycle of digestion and ly[20] (9AA:Gly,Ala, Asp, Val, Pro, Ser,Glu, Leu, Thr). These nine ligation, 80. Thiscloning strategy enabled us to have cloning contained the five amino acids from the first library and also scars of only asingle at each ligation junction. Conven- added residues that significantly increasethe polarity and iently,glycine is part of all four libraries. charge of the proteins they encode. Next, we constructed a We performed the ligationreactions for the libraries at a library from 16 amino acids (16 AA:Gly,Ala, Asp, Val, Pro, Ser, scale that enabled the net production of more than 1012 Glu, Leu, Thr,Arg, Ile, Gln, Asn,His, Lys, Cys) that included all uniquesequence variants in each of the four libraries. This vast extant residues except the four amino acids that are generally diversity was desired to takefull advantage of the high- thought be the most recent additions to the geneticcode. We throughput capacityofinvitro selection methods, such as the chose this particular compositionbecause this library lacks mRNA display technology,whichweplan to use in subsequent only the aromatic amino acids and , the most projects.Following the construction of the libraries, variants commonstart codon (although we were constrained to main- were cloned into expression plasmids and submitted for se- tain asingle ATGstartcodon). Finally,wemade alibrary quenceverification. Sanger sequencing of 206 library variants

Figure 2. Construction of the DNA libraryencoding the protein library.Achemicallysynthesised cassette of 20 randomnucleotide trimers was concatemerised to yieldaDNA libraryencoding 80 random amino acid positions.The starting cassette was flankedbyconstant termini. Twoaliquots of the library were sepa- rately digestedwith type IIS restriction BsrDI andBsmI. These digestedfragments were ligated at the resulting complementary overhangs to pro- duce aGGG scar (codon for Gly) and double the length of the randomtrimer region.Digestion and ligation was repeated to provide afinal libraryof80 randomcodons with three glycine scarsatdefinedpositionsand flanked by the constant termini. Each library was constructed separately with fourindividual cassettes, each encoding the appropriate alphabet.

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 848  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers revealed that 60%ofthe sequences exactly conformed to 16AA:8out of 23 tested;20AA: 8out of 24 tested;FigureS2). their design and alphabet. The other 40 %comprised sequen- Of the expressible proteins,variants from the 20AA library ces with , truncations or missense substitutions showedthe lowest solubilitywith11of16variants having had (Table S4). 20 %soluble protein. All three reduced-alphabetlibraries  showedhigher solubility than the library from the extant al- phabet, from which only one variant (20-5) showedhigh ( Characterisationofexpression and solubility of protein  80%) solubility.For each of the other three libraries, we found library variants at least five variants with asolubility as high as that of variant Sequence-confirmed variants( 20) from each library were 20-5. For several insoluble variants from different libraries,re-  expressed as C-terminallyHis6-tagged proteins for purification folding procedures by denaturing/renaturing were attempted and detection purposes.Initial test expressions determined to obtain soluble protein, but these methodswere not suc- that induction with1mm isopropyl-b-d-thiogalactopyranoside cessful. Select proteins were expressed at alarger 1Lscale and

(IPTG) yieldedthe most soluble protein.Following the induc- purifiedbyHis6 affinity chromatography for subsequent bio- tion of variants,the soluble and insoluble (solubilised with 8m physicalcharacterisation. urea) fractions were analysed by SDS-PAGE.Although SDS- PAGE and Coomassie stainingwere sufficient to identify most Biophysical characterisation of soluble protein variants over-expressed variantsinthe 16AA and 20AA libraries, the 5AAand 9AAlibraries showed lower-abundance proteins that Determining secondary structurebycircular dichroism: TheCD were easier to detect by the more sensitivewestern blot analy- spectra of select soluble variantsfrom each library indicated sis with aHis6-tag-binding antibody. that the proteins existed primarily in extended random-coil We categorised protein variants as “no expression” (no pro- form, as demonstrated by the absorbance minimabetween tein detectable in the soluble or insoluble fraction), “highly 198–202nm(Figures 4and S3). However,several variants dis- soluble” (when 80 %ofthe total detected protein was in the playedinducible increases in a-helicity.None of the variants,  soluble lane), “partially soluble” (21–79 %inthe soluble lane), when analysed under buffer-only conditions, showed signifi- or “no/lowsoluble” ( 20 %inthe soluble lane;Table 1and cant CD spectralcharacteristics of a-helical or b-strand-contain-  ing regions. The addition of 2,2,2-trifluoroethanol (TFE) to a protein solution induces or stabilises secondarystructure, par- Table 1. Distribution of protein solubility for all variants tested from the ticularly a-helicalregions.[49,50] TFE’seffects on protein structure four libraries,asdetermined by SDS-PAGEand western blot analysis. might be explained either by its promoting intramolecular Alphabet hydrogen bonding, stabilising hydrophobic regions, or by 5AA9AA 16AA 20AA disrupting protein–solvent interactions.[50,51] Twelve out of 13 expressed in E. coli 7111516library variants tested(5-11, 5-18, 5-20, 9-2, 9-4, 9-6, 9-21, 16-1, high solubility ( 80%) 6810 1  16-13, 16-18, 16-23, 20-5) showed alteredspectra in the pres- partial solubility (21–79%) 0334ence of 25 or 50%TFE. The single minimum characteristic of no/low solubility ( 20%) 10211  no expressioninE. coli 13 13 88random coilshiftedright to 202–206 nm, and asecond nega- total variants analysed 20 24 23 24 tive peak at 220 nm was formed, which together indicated  increased a-helical character (an all a-helical protein typically shows CD spectrum minima at 205 and 222 nm). The H values, Figures 3and S2). Despite using sensitivewestern blot analysis, ameasureofhelix content,[45] also increased, thusreflecting which allowed the detection of low-abundance proteins with- this spectral shift (Table S5). Variants16-18, 9-6, 20-5, 5-11and out obfuscation by endogenousprotein, no overexpression 16-23 showedthe greatesthelical content (> 60%) in the pres- was detected for 33–65% of protein variants from the different ence of 25 or 50%TFE, thus indicating that the effectisnot libraries (5AA:13out of 20 tested;9AA:13out of 24 tested; alphabet-specific.

Figure 3. Solubility analysis of libraryvariants by SDS-PAGE. A) Library variants were expressed in E. coli,and the soluble and insoluble fractions were separat- ed by SDS-PAGE. The upper panel shows asmall representative subset of variants tested on agel stained with Coomassie Brilliant Blue. The lower panel shows the samegel area but with the library variants visualised by anti-His6 westernblot. The library variant namesare indicatedatthe top;pET28A refers to anegative control of cells carrying an empty expressionvector underidenticalexpression conditions.B)Summaryofthesolubilityanalysis of all 91 protein variants tested.

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 849  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers

Figure 5. ANS fluorescence emission spectrafor library variants. Proteins (10 mm)weremixed with 500 mm ANS and excited at 385 nm. The fluores- cenceemission spectrafor BSA (a), ribonuclease A(RNase; c), the RNA ligase 10C (····), variant 16-1 (c)are shown; all remaining tested soluble variants (c)gave similar spectra.

had substantially more ANS fluorescencethan all other variants tested,the spectra for which were all similar to that of the atypical ligase 10C protein.

Characterising the oligomeric state by size-exclusion chromatog- raphy: Size-exclusion chromatography was used to determine the oligomeric state of protein variant 20-1, which was the single highlysoluble protein from the 20 AA library. Soluble variants from the three reduced-alphabet libraries were not amenabletothis method because detection by UV absorption requires the presence of aromatic residues,which were only present in 20AA library variants.Using aSuperdex 75 column, which can separate proteins ranging in MW from 3000 to Figure 4. Secondary-structure content of soluble variants from the fourlibra- 70000 Da, we found that the solution of the 20-1 variant (pre- ries measured by CD spectroscopy.A)Reference CD spectra are shownfor dicted mass of 10 kDa) yielded multiple peaks with retention all random-coil (c), a-helix (c)and b-strand (a)proteins. B) One typi- cal Ni-NTA-purifiedvariant protein from each library. The spectraare present- times between tR =10.4 min and tR =16.9 min, thus indicating 3 2 1 ed as molarellipticities (q”10 degcm dmolÀ )for protein c:inbuffer, that the protein exists in multiple quaternary states (Figure 6). a:inthe presence of 25%trifluoroethanol (TFE) and ····:in50% TFE. Ad- For comparison, we also analysed two standards of albumin ditionalCDspectra are showninFigureS3. and ribonuclease Aof65and 13 kDa, respectively.

Assessing protein folding by ANS fluorescence: The tertiary struc- ture content of proteins can be assessed with the help of the amphiphilic dye 1-anilinonaphthalene-8-sulfonate (ANS), which increases its fluorescencewhen shielded from water,such as when it binds hydrophobic areas of proteins. Well-folded pro- teins with highly packed hydrophobic cores display ANS fluo- rescence.Molten globules—partiallyfolded proteins with dis- tinct secondary but little or no tertiarystructure—exhibit even greater ANS fluorescence, as hydrophobic regions that would be buried in well-folded protein are more easily accessible to the dye. Library variants were subjected to ANS binding, fol- Figure 6. Assessment of the oligomeric state of the purified soluble variant 20-5 (c)bysize-exclusion chromatography.Monomeric standards of a lowed by excitation and fluorescencedetection(Figure 5). For 13 kDa ribonuclease (a)and a65kDa albumin (c)are shownfor com- comparison, we included two well-folded proteins:ribonu- parison. The calculated molecularweight of variant 20-5 is 10 kDa. clease Aand BSA, as well as our artificial RNA ligase,[29,52] which is asmall 87-residue protein that hasanunusualthree-dimen- [53] sional structure with high conformational dynamics. Of the Discussion 13 libraryvariants examined, only variant 16-1 exhibited fluo- rescencesimilartothat of the well-folded controlproteins To experimentally test ahypothesised succession of amino (Figure 5). This variant also exhibited the highest inducible heli- acid inclusion during the evolution of the genetic code, we cal structure detectable by CD. Although the variant displayed generated four randomised protein libraries that represent dif- severalfold less fluorescence than ribonuclease Aand BSA, it ferent stages in the code’s history.The libraries of 5AA, 9AA,

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 850  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers

16AA, and the extant 20 AA contained 80-residue randomised Sangersequencing demonstrated that the amino acid distribu- regions that conformed strictly to their defined amino acid al- tion among the correctsequences in each library showed 70– phabet. Randomly chosen variants from each library werechar- 80%agreement with the intended distribution (Figure S1 and acterised for solubility by recombinant expression and further Table S3). Finally,the number of our libraries, whichrepresent analysed for their propensity for secondary,tertiaryand quater- four different time points in the code’s history,allows valuable nary structures by CD spectroscopy,ANS fluorescenceand comparison of biophysical characteristics and potential for cel- size-exclusion chromatography,respectively. lular function. Although other groups have probedthe func- The libraries we constructed represent avigorousexperi- tion of ancient alphabets—in addition to earlier examples, co- mental exploration of the evolution of the geneticcode, there- factor bindingwas examined for VNM codon libraries encoding by complementing and advancingprior studies. Although the oldest 15 residues[36,37]—none, to our knowledge,has com- other groups have generatedreduced amino acid libraries pared multiple libraries strictly conforming to aconsensus before,our study has two major strengths overpreviouswork: order[19] for the geneticcode’s evolution. firstly,our library design conforms strictly to the defined alpha- All four libraries described in our study were biophysically bets and secondly, the construction of four libraries increases characterised independently under identicalconditions to the powerofvaluable comparison. Trinucleotide synthesis, allow comparison of the different alphabets.Heterologous ex- used in the current work, allows precise control of the amino pression of individual library variants in E. coli BL21(DE3) cells acid alphabet and relative abundance. We further ensured that indicated that the extant (20 AA) and most recent (16 AA) al- the cloning protocolused (Figure 2) did not introduce non- phabets are best-suitedtoexpression (independent of solubili- library residues at the cloningscars from cassette ligations. ty) under this model system.These libraries had the highest Sangersequencing of 206 individual variantsshowed that 60% fraction of expressed proteins detectable by western blot to of our libraryvariants conformed exactly to their intended al- total variants tested (16AA:15/23;20AA: 16/24;Table 1and phabet, without truncations, frameshifts or non-alphabet mis- Figures 3B and S2), followed by the 9AAlibrary (11/24). The sense (Table S4). For comparison, Kumachi et al. se- 5AAlibrary wasleast amenable to heterologous expression (7/ lected a30-residue tRNA-binding peptidefrom afour-amino- 20). The proteins detected for the 9AAand 5AAlibraries were acid code library (using the degenerate codon GNC:Gly,Ala, also, on average, expressed at lower abundance than proteins Asp, Val) butfound that only 13%ofall variants(6/46)con- from the more modernalphabets.Proteins from these libraries formed to their alphabetasdetermined by sequencing after appeared as fainter bands in the western blots and required a the selection.[40] Some degree of that leads to unde- longerdetection time (Figure S2). These two observations sirable non-alphabet codons is inevitable during library synthe- probablyindicate lower protein abundance in the , but de- sis and propagation. However, techniques such as controlling creased affinity of proteins from different alphabets to the pol- conditions through cell-free expression systems[54,55] yvinylidenedifluoride (PVDF) membrane used in western blots could further improvethe adherrance of alibrary to its alpha- could have some effect too.Furthermore, our detection meth- bet in future applications. ods did not distinguish between low protein abundance due Library design is just as important for quality libraries as ac- to low expression (for which the limitedavailability of charged curate synthesis. Notable previous work by Yanagawa and tRNA for our reduced alphabets could be acontributing factor) colleagues similarly generated multiple libraries (using codons or high degradation rates. Tanaka et al. reported an opposite GNN, RNN, and NNN forthe early five, 12, and all 20 amino expression trend:whereas 60%ofthe proteins from their acids, respectively),[39] but used alibrary construction strategy most diverse NNN librarycould be expressed (comparedto that resulted in regularly interspersed non-alphabet amino 77%for our 20AA library), as high as 87 and 86 %ofproteins acids:cloning scars introduced nine additional resi- from their reduced-alphabet RNN and GNN libraries could be dues within every library variant, corresponding to every 16th expressed, respectively.[39] This drastically different trend might residue. Unlike the invariant glycine cloning scar in our libra- be due in part to the high abundance of the modern amino ries, tryptophan is considered to have been the last addition acid tryptophan in all of their libraries. In anotherstudy of to the genetic code,[19] and ahigh tryptophancontentwould random sequence proteins of asimilarlength to those de- be likely to have agreat influence upon the biophysical prop- scribed here, Tretyachenko et al. reported that 53%ofrandom erties of proteins, especially in the case of the libraries from 20AA sequences with similaramino acid usage to modern pro- the reduced alphabets of early amino acids. Furthermore, al- teins could be expressed in E. coli (8/15).[56] Finally,a20AA thoughdegenerate codons allow easy,low-cost library genera- random library of proteins 95 amino acids in length createdby tion, they do not always allow the researcher to sample the Urabe and colleagues contained only 20 %solubly expressed exact cohort of desired amino acids or the ratios in which they variants (as detectedbywestern blot).[57] Theobservation that appear.For example, GNN and RNN libraries do not strictly our libraries becomeprogressively more amenable to heterolo- follow the consensus chronology[19] explored here, duetothe gous expression in E. coli as they becomemore similartothe arrangement of the code. In contrast, the use of trinucleotide modern proteome couldindicate that the modern expression phosphoramidites allows one to precisely manipulate both the system hasbeen optimised for modernproteins.Weacknowl- choice of amino acids and the relative abundanceofeach. The edge the limitations of using an extant with its trans- amino acid distributionsofour four libraries were designed to lation,chaperone and degradation systems to express “primor- mimic the extant proteome (determined through ExPASy[48]). dial” proteins, butthere is currently no satisfactory alternative.

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 851  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers

The E. coli expression system is simply atechnically feasible The 16 AA library was also the only alphabet to yield avar- tool for protein synthesis that allows our different libraries to iant that exhibited potentialtertiary structure—hydrophobic be compared with one another, and with previous similarre- packing—asdetected by ANS fluorescence(variant 16-1; duced-alphabetstudies. Figure 5). The lower propensity for secondary structureinthe The two most ancient alphabets in our study displayed a 5AAlibrary can be explained by the abundance of the rigid greater propensity for solubility among their expressed var- and flexible glycine residues,which are known to disfa- iants than the more modern libraries. In spite of low expres- vour secondary structure.[59] These two amino acids are about sion levels in E. coli,the majority of variants that didexpress four to five times more abundant in the 5AAlibrary than in from the 5AAand 9AAlibraries showed high solubility (5AA: the extant proteome (Figure S1). b-Branched amino acids like 6/7 expressed were 80%soluble;8/11for 9AA; Table 1, Fig- (present in 5AAlibrary) might destabilise a-helices.[60]  ure 3B). In contrast, the 20AA library was dominated by pro- Therefore, it appearsthat the composition of the 5AAlibrary is teins with lowornosolubility.Eleven out of 16 expressible var- less likely to promote canonical secondary structure than are iants had 20%soluble protein (Table 1, Figure3B). The 16AA the later alphabets. The libraries representing intermediate  library appeared to be the “sweet-spot”for expression and sol- codes (9AA and 16AA) have agreater diversity of residues ubility:lackingthe bulky,aromatic residues,tryptophan, tyro- than the 5AAlibrary;this has the dual effect of reducing the sine and phenylalanine,itdisplayed comparable expression abundance of secondary-structure-disrupting side chains and levels to the 20 AA librarybut ahigher propensity for soluble also introducing amino acids that can have stabilising effects. protein. Thirteen of the 15 expressible variants had 21–100% It has been demonstrated that the dipole moment of a-helices soluble protein, corresponding to 56%ofall proteins tested can be stabilised by basic residues (absentfrom 5AAlibrary) at from that library.Ananalysisofthe amino acid distribution of their Ctermini and acidic residues at the Ntermini.[61] The differentsubsets of variants from each library (i.e.,expressed greater diversity of chemistries introduced in the 9AA(Glu vs. no expression;high/partialsolubility vs. low/nosolubility) added) and 16AA (Arg, His, Lysadded) libraries could provide uncovered severalcorrelations between the biophysical prop- such an effect. The further increase in chemical complexityin erties of the proteins and their aminoaciddistribution, thereby the 16 AA library could explain why 16-1 wasthe only variant indicating that selection is occurring for solubly expressing to display hydrophobic packing. Although it is difficult to proteins (Table S6). For instance, 20 AA variants with low or no make inferences about tertiary structure from the primary se- solubility contained higher proportions of bulky,aromatic tyro- quence, the increaseinpolar residues in the 16 AA library (Arg, sine and tryptophan residues, which are likely to lead to aggre- Gln, Asn, His, Cys are added to the 9AAalphabet)likelyena- gation.These initial solubility assay results—notably the trend bles stronger interactions at the protein/solvent interface. This of early alphabets displaying high solubility when expression effect, in turn, could facilitate the formation of ahydrophobic was observed—support the idea that our four libraries are core, even without the bulky aromatic residues tyrosine, phe- amenable to furtherexperimentation, such as future functional nylalanine or tryptophan. studies. The biophysical properties of randomly chosen variants from Based on the variants tested here, the 16 AA library yielded the four libraries showedatrend that followed the increasing the proteins with the highest secondary and tertiary structural chemicaldiversity of side chains. Variantsfrom the 5AAlibrary content.Weperformed CD analysisonfour solubly expressing exhibited the lowest propensity for expression in E. coli,and variants from each of the reduced libraries, and the single yet 85 %ofexpressed variants werehighly soluble. The 5AA highly expressed 20-5 variant from the extantlibrary (Figures 4 variants hadthe least TFE-induced increase in secondary struc- and S3).Under buffer-only conditions, all the spectra had simi- ture. Variants from the 9AAlibrary were more likely to be het- lar profiles,with aminimum at 198 nm, which is typical of a erologously expressed (46 %detected by western blot) and dis-  random coil. This indicates that the variants do not contain playedsimilarsolubility,with agreater average response to substantial a-helical or b-strand secondary structural compo- TFE in CD analyses.The 16AA variantscontinuedthis trend of nents. Aquantitative measurement of helicity (a functionof an increased propensity for expression and inducible secon- absorbance at 222 nm), however,gave helicityvalues ranging dary structure, including the variants with the greatest helical from 12–27 %(Table S4), indicating aslight a-helical character. content (16-23), and the single variant with demonstrable hy- The addition of organic solvent TFE increased helicity in the drophobic packing (16-1).Severalother research groups have majority of variants,with variant 16-23 exhibiting 99%helicity taken diverseapproaches to investigating intermediate alpha- in the presence of 25%TFE (Table S5). The CD spectrum of 16- bets such as our 9AAand 16AA:severalcomputational studies 23 still differs from the reference graph for an all-a-helicalpro- have suggested that about ten differentamino acids would be tein in Figure 4B,inthat the magnitude of the minimum at requiredfor an alphabet to ensure afoldable proteins.[62,63] In 208 nm suggestsadditional random-coil content.[58] This var- an experimental approach, a108-residue folded four-helix iant displayed the greatest TFE-induced shift in helicityfrom bundle protein made from just seven different aminoacids 27 to 99%(72%change). Although variants from all libraries was designed and structurally solved.[64] In afurtherremarkable demonstrated TFE-inducible increases in helicity, the average example, achorismate mutase was successively modified to shift for each library (5AA:20%;9AA:33%;16AA: 42%; 20-5 contain only nine different amino acids (chemically redundant, variant:45%)followed atrend of agreater shift being correlat- not primordial) and was still found to function inside a ed with greater alphabet complexity. cell.[21,25,65] In addition to showing the highest propensity for

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 852  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers soluble heterologous expression, inducible secondary structure ed an artificial RNA ligase [29,52,71] that had aCDspec- and tertiarypacking, the 16 AA library also contains many side trum similar to all but one variant reported here and, yet, this chains important for catalysis[66] that are lacking from the 5AA protein had astably folded three-dimensional structure[53] and and 9AAlibraries. These results suggestthat an alphabet of ameltingtemperature of 728C. Importantly,the example of the 16 earliest amino acids could probably support many this unnatural enzyme highlights that, in contrasttocommon extant protein functions. The incorporation of the four likely belief, the absence of canonical secondary structure elements, latest amino acids (Met, Tyr, Phe and Trp)[19] into the genetic namely a-helices or b-strands, or the lack of alargecompact code appears to represent aselectivetrade-off. Although the hydrophobic core does not precludeastable three-dimension- fraction of expressed variants from the 20AA library was com- al structure and enzymatic activity.Indeed there are also exam- parable to that from the 16AA library,the majority of these ples of that are not globular,yet possessactivity.[72] displayed no or low solubility.This poor solubility was likely to IDPs have been suggested to evolve more rapidly than glob- be due to the 20AA library’s inclusion of the three aromatic ular proteins due to the lack of strict structural constraints.[69] residues (Tyr,Phe, Trp). Indeed, tyrosine and tryptophan were Structuralflexibility predisposes IDPs to multifunctionality;[69] a enriched in the variants with low to no solubility,compared to trait that both increases the functional adaptability of acell,[73] other 20 AA library sequences. It is interesting to note that, in and is thought to be essential for the evolution of catalytic the soluble 20 AA variants, tyrosine was depletedwhereas phe- function.[32] Furthermore,the earliest cells would be unlikely to nylalanine, which differs by just the hydroxy group, was not. In be able to maintain alarge genomedue to the poorer fidelity contrast, the different constitutionalisomers and and processivity of early polymerases;therefore, it would have showedcomparable abundances.These resultsempha- been more economical to express proteins able to perform sise the sensitivity of protein properties to the chemistries of multiple cellular tasks.[31,32,73,74] Although this still needs to be amino acids and that, despite apparent similarities, each amino investigated in future studies, the proteins described in this acid added to the code has been subjecttoselective pres- work might possess some of these characteristics, as demon- sure.[13] Aromatic residues, which were only part of our 20 AA li- strated by their biophysical properties. This characteristic of brary,expanded the chemical toolbox of the code to increase IDPs can serve as areminder that the earliest proteins might functionality,such as coordinating aromatic ligands or partici- have had biological activity without adhering to the still- pating in catalysis,[66] and also enhance protein folding and sta- common notion that proteins mustbethe globular, well- bility by their potential to increase core hydrophobicity.For folded macromolecules. the genetic code to have evolved to the current 20-amino-acid alphabet,[67] these benefits must have outweighed the accom- panying decrease in solubility for partially folded or unfolded Conclusions regions.However,itisimportant to appreciate the evolution of the geneticcode as agradualprocess, during which newly en- This work describes the analysisofthe biophysical properties coded amino acids became integrated into pre-existing func- of severaldozen protein variants from four different amino tional proteins under selective pressure. This process is highly acid alphabet libraries and represents afoundational studyfor unlikely to continue to yield new functional proteins spontane- furtherexperimental exploration of the genetic code’s history. ously from naı¨ve random sequences, like in the earliest days of We have shown that reduced-alphabet proteins are surprising- life. This caveat does not preclude, however,the artificial selec- ly soluble compared to random extant-alphabet proteins, and tion of functional proteins from fully randomised polypeptide that they exhibit secondary structure under certain conditions. libraries,such as adenovo ATP-binding protein.[30,68] These variants represent only atiny fraction of the >1012 var- Although only afew of our protein variants exhibited clearly iants from the libraries synthesised;future work will probe the defined secondary structure when analysed by CD, it is reason- functional capacity of reduced-alphabet proteins, employing in able to postulate that theseproteins could behavelike intrinsi- vitro selection techniques such as mRNA display.[75,76] mRNA cally disordered proteins (IDPs).IDPs typically do not have a display can not only probe the libraries’diversities fully,but defined tertiary structure and regularly contain mostly random will also allow us to manipulate the selection conditions with coil. Members of this diverse family of proteins display dynam- great precision to reflect predicted early life. It will be valuable ic conformationsand can form stable structuresupon a to comparethe characteristics of functionally selected library change in chemical or physical environment,[69] such as an variants to the variants from unselected libraries described interaction with aligand or asolventlike TFE. IDPsexhibit here. The biophysical properties found in this study by testing strong CD minimaat 200 nm[70] and do not exhibit signifi- asmall number of random variantsfrom the reduced amino  cant ANS signal,[69] as was observed for most of our library var- acid libraries demonstrated that these primordial alphabets iants. Buildingonthe TFE-inducible a-helicalcharacter we ob- could have played akey role in early life, whether as autono- served for 12 of the 13 tested variants, Lu et al. used CD to mous proteins or as accessory proteinsinanRNA-dominated detect apopulation-levelchange in the secondarystructure of world.[77] We believe that our biophysical characterisationof their 15-amino-acid alphabet library in the presence of ATP proteins representing early pointsinthe genetic code’s evolu- and NADP.[37] These data support the idea that aprotein does tion lends support to the validity of hypothesised reduced not have to display discrete secondaryorcompact tertiary alphabets,and is an important first step in investigating the structuretofunction. Furthermore, in earlier work, we generat- early evolution of today’s protein-dominated world.

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 853  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers

Experimental Section unique sequences, as determined by spectrophotometric quantifi- cation at 260 nm. To confirm the sequences, the final DNA libraries were digested with BsrDI and BsmI and ligated into apER13 plas- Library construction, sequencing and cloning: The libraries were mid that had been modified to introduce these restriction sites constructed from cassettes of 20 randomised codons (XXX) that into the multiple cloning site.[41,42] Where necessary,further se- represented aspecific mixture of codons according to each of the quencing was performed on variants cloned into pET28a by using four alphabets (Figure 2and Table S1). The cassettes were ligated the NcoI and BamHI restriction sites. Individual colonies were together to form the final libraries with arandomised region of 83 picked, and plasmids were subsequently isolated and sequenced amino acids flanked by the following constant regions:5’-TTACA (Beckman Coulter , Danvers, MA). ATTACAGCAA TGGGG [(XXX)20 GGG]4 ATTCC TCACC ATCAC CA-3’. The 5’ region contained atobacco mosaic translational Assaying solubility by E. coli expression and western blot: Se- enhancer;the 3’ region contained aHis6 purification tag. The rand- quence-confirmed mutants were expressed from pET28a that con- omised region contained three invariant glycine residues (GGG tained aT7promotor under control of the lac . Following codon) as aresult of cloning scars between cassettes, but glycine transformation into E. coli BL21(DE3) cells and plating onto lysoge- was part of all alphabets. ny broth agar with kanamycin, single colonies of each sequenced variant were selected and grown overnight in LB containing kana- The trimer phosphoramidites used for cassette chemical synthesis 1 were aproduct of Glen Research (VA), and 60-mer oligonucleotides mycin (36 mgmLÀ ), and then used to inoculate LB-kanamycin cul- (corresponding to 60 from 20 codons) were assembled tures (10 mL) and grown at 378C. Cultures were induced with iso- by the Keck Biotechnology Resources Center (Yale University,New propyl-b-d-thiogalactopyranoside (IPTG;1mm)atanOD600 of 0.6. Expression continued for 3hat 37 C. Cells were harvested by cen- Haven, CT) flanked by constant regions 5’-TTACA ATTACAGCAA 8 trifugation at 3000g and resuspended in BugBuster HT (Novagen) TGGGG-3’ and 5’-GGCATTCCTC ACCATCACCA-3’ at the 5’-and 3’- 1 ends, respectively.These constant regions contained the recogni- at 5mLgÀ wet cell pellet with SIGMAFAST protease inhibitor tion sites for the type IIS restriction enzymes BsrDI and BsmI. Ac- (Sigma–Aldrich, 1tablet per 500 mL). Cell cultures were lysed for cordingly,each variant’s contained aconstant N-ter- 30 min at room temperature with shaking, then the soluble frac- minal Met-Gly sequence, and aGly-Ile-Pro sequence C-terminal to tion was separated by centrifugation (21000g). Equal volumes of the randomised region of the library.The trimer codons used in soluble fraction and solubilised pellet (solubilised to original the synthesis and the desired ratios of residues in each library’s volume in 1” PBS, 250 mm NaCl, 8m urea:representing the insolu- random region are listed in Table S1. All primers and library cas- ble proteins) were analysed by SDS-PAGE (4–12%Bis·Tris, NuPage sette oligonucleotides were purified by urea-PAGE gel electropho- gels, Life Technologies) in MES buffer (Life Technologies), thereby resis. The gel bands were excised and extracted overnight into TE enabling the direct comparison of protein amounts for both solu- buffer (10 mm Tris, 1mm EDTA, pH 7.4) by using the crush-and- ble fraction and pellet on the electrophoresis gel. soak method. From these purified single-stranded DNA cassettes, For increased detection sensitivity,all electrophoresis gels were de- which encoded 20 random codon positions, an aliquot correspond- tected by western blotting with an antibody specific to the C-ter- ing to at least 1012 unique sequences was amplified by PCR. For- minal His6 tag present in all proteins. Following SDS-PAGE, the sep- ward primer 5’-GGACA ATTACTATTT AAATT ACAGC AATGG G-3’,re- arated proteins were transferred onto PVDF membranes (Roche) in verse primer 5’-TGGTG ATGGT GATGG TGAGG AATGC C-3’ and Phu- transfer buffer (25 mm Tris, 192 mm glycine, 20% v/v methanol, sion polymerase (New England Biolabs) were used for seven PCR 0.005% w/v SDS) with aPierce Power System, according to the cycles of 95 8Cfor 30 s, 60 8Cfor 20 sand 72 8Cfor 15 s. Primer con- manufacturer’s instructions. The membranes were washed in blot- centrations were 0.5 mm,dNTPs were 200 mm,and template con- ting buffer (25 mm Tris, pH 7.4, 150 mm NaCl, 0.1% v/v Tween-20) centration was 5nm.PCR products were purified by phenol/chloro- and blocked in blotting buffer with 5% w/v milk powder.Proteins form extraction, concentrated with butanol, and precipitated with were detected by using ahorseradish peroxidase (HRP)-conjugated ethanol. PCR cleanup kits (Qiagen) were used to de-salt samples mouse monoclonal anti-His6 tag antibody (Thermoscientific;1:500 prior to restriction digest. Samples were quantified by determining dilution in blotting buffer with 2% w/v milk powder) and the de- the absorbance at 260 nm, and their molecular weight was verified tection reagents from the Pierce fast western blot kit, according to by agarose gel electrophoresis. The purified PCR product was then the manufacturer’s instructions. Membranes were analysed with a divided into two equal pools, and each pool was digested with LICOR Odyssey Fc imaging system by using exposure times of 2or either BsrDI or BsmI for 4hat 658C. Following digestion and etha- 10 min. ImageJ software[43] was used to quantify band intensities in nol precipitation, products were separated on a3%agarose gel cases where variants showed partial solubility. (120 V, 60 min) to separate the desired digested library fragments from the terminal constant primer-binding region. The appropriate Large-scale expression and purification of selected soluble var- band was excised, purified (Qiagen gel extraction kit) and subject- iants: Protein variants that showed soluble over-expression in the ed to ligation with T4 DNA ligase in the commercial reaction buffer small-scale solubility assays (above) were also expressed in 1Lcul- (New England Biolabs). The ligation was carried out for 30 min at tures to obtain sufficient amounts of protein for biophysical char- room temperature. Ligation concentrations were 0.2 mm DNA, acterisation. The conditions for expression, harvest and lysis were 1 1mm ATPand 20 UmLÀ T4 DNA ligase. Following ligation, the otherwise identical to those for the small-scale screen. The soluble products were concentrated in avacuum concentrator at the fractions were diluted 1:1into purification buffer (10 mm sodium medium setting and subsequently resolved on a3%agarose gel, phosphate, 150 mm NaCl, 5mm imidazole, 0.5 mm b-mercapto- excised and purified (Qiagen gel extraction kit). The ligated prod- ethanol, pH 7.4) and mixed with pre-washed and equilibrated Ni- uct, consisting of two joined cassettes of 20 codons each, was NTAresin (300 mL, Life Technologies). Following 30 min of rotated then subjected to asecond round of PCR amplification, digestion incubation with the resin at 48C, the supernatant and resin were and ligation to build up afinal library containing 80 trinucleotides transferred to a10mLchromatography column (Bio-Rad). Column (four cassettes of 20 codons each). Identical PCR, purification, washes were performed at 48Cwith 2” 3column volumes of pu- digestion and ligation protocols were followed for this second rification buffer.Elution was conducted in astep-wise fashion with round. Each completed library represented greater than 2”1012 CD buffer (0.5 mL, 10 mm sodium phosphate, 150 mm NaF,pH7.4)

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 854  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers increasing from 20 to 50, 100, 150 and 250 mm imidazole. Frac- grants from the US National Aeronautics and Space Administra- tions were assessed for the presence of protein by SDS-PAGE. Frac- tion (NASA) Agreement (NNX14AK29G), the Simons Foundation tions containing protein were pooled and dialysed twice for 12 h (340762), the Minnesota MedicalFoundation (4036–9663-10), the against CD buffer (2 L) to remove imidazole. Following dialysis, University of Minnesota Biocatalysis Initiative, and the Office of protein samples were spin concentrated (Amicon Ultra, 3000 MW), the VP of Researchatthe University of Minnesota (Grant-in-Aid). and the concentration was determined by bicinchoninic acid (BCA) assay.[44] Protein samples were stored in CD buffer at 48C. Characterisation of the biophysical properties of protein Conflict of Interest variants The authors declare no conflict of interest. Determination of secondary structure content by circular dichroism: CD spectra were taken on aJASCO J-185 CD spectropolarimeter. 1 Proteins were scanned from 260 to 190 nm at arate of 50 nmsÀ Keywords: genetic code · origin of proteins · primordial for ten accumulations and four scans. All protein samples (10 mm) peptides · protein libraries were read with a1mm path-length cuvette at 238C. All spectra were baseline corrected against the CD spectrum of the buffer [1] D. Voet, J. G. Voet, Biochemistry,3rd ed. 2004,Wiley,Hoboken, pp. 80– only.TFE (Sigma–Aldrich) was added to samples to final concentra- 126. tions of 25 and 50%. In the presence of TFE, samples were allowed [2] J. E. Cone, R. M. Del Rio, J. O. E. N. Davis, T. C. Stadtman, Proc. Natl. Acad. to equilibrate at 238Cfor 15 min. Sci. USA 1976, 73,2659–2663. [3] M. A. Gaston, L. Zhang,K.B.Green-Church, J. A. Krzycki, Nature 2011, The measured ellipticities ([q]obs [mdeg]) were converted to mean 471,647–650. 2 1 residue ellipticities ([q][degcm dmolÀ ]) according to Equation (1): [4] F. H. C. Crick, J. Mol. Biol. 1968, 38,367–379. [5] E. V. Koonin, A. S. Novozhilov, IUBMB Life 2009, 61,99–111. MRW [6] J. Wong, S.-K. Ng, W.-K. Mat, T. Hu, H. Xue, Life 2016, 6,12. q q 1 ½ Š ¼½ Šobs  10 lc ð Þ [7] M. Di Giulio, J. Mol. Evol. 2016, 83,93–96.   [8] M. Eigen, P. Schuster, Naturwissenschaften 1978, 65,341–369. [9] S. L. Miller, Science 1953, 117,528 –529. MRWisthe mean residue molecular weight of the (calcu- [10] A. P. Johnson, H. J. Cleaves, J. P. Dworkin, D. P. Glavin, A. Lazcano, J. L. lated by dividing the molecular weight of the peptide by the Bada, Science 2008, 322,404 –404. number of residues), l is the optical path length of the cell [cm], [11] M. Eigen, P. Schuster, Naturwissenschaften 1977, 64,541 –565. 1 [12] S. Osawa, T. H. Jukes, K. Watanabe,A.Muto, Microbiol. Rev. 1992, 56, and c is the concentration of the peptide [mgmLÀ ]. The degree of helical structure (% a-helix) was calculated according to Equa- 229–264. [13] G. K. Philip, S. J. Freeland, Astrobiology 2011, 11,235–240. tion (2): [14] T. A. Ronneberg, L. F. Landweber,S.J.Freeland, Proc. Natl. Acad.Sci. USA 2000, 97,13690 –13695. q 100 % a-helix ½ Š222  2 [15] S. A. Sandford, J. AlØon, C. M. O’D. Alexander,T.Araki, S. Bajt, G. A. Barat- ¼ 40 000 1 2:5=n ð Þ ta, J. Borg, J. P. Bradley, D. E. Brownlee, J. R. Brucato, et al., Science 2006, À ð À Þ 314,1720–1724.

Here [q]222 is the mean residue ellipticity at 222 nm, and n the [16] C. D. K. Herd,A.Blinova, D. N. Simkus, Y. Huang, R. Tarozo, C. M. O’D. number of the residues in the peptide.[45] Alexander,F.Gyngard, L. R. Nittler,G.D.Cody,M.L.Fogel, et al., Science 2011, 332,1304–1307. Assessment of protein folding by ANS fluorescence: ANS fluores- [17] H. J. Cleaves II, J. Theor.Biol. 2010, 263,490 –498. cence was measured in aMolecular Devices SpectraMax M5 spec- [18] H. J. Cleaves,S.L.Miller, Proc. Natl. Acad. Sci. USA 1998, 95,7260 –7263. trofluorometer.With the exception of the size standard bovine [19] E. N. Trifonov, J. Biomol.Struct. Dyn. 2004, 22,1–11. serum albumin (BSA), samples were mixed in 200 mLvolumes con- [20] P. G. Higgs, R. E. Pudritz, Astrobiology 2009, 9,483–490. taining 10 mm protein and 500 mm ANS dissolved in CD buffer.Due [21] S. V. Taylor,K.U.Walter,P.Kast, D. Hilvert, Proc. Natl. Acad. Sci. USA to its greater mass and presumed greater surface area, BSA 1.5 mm, 2001, 98,10596 –10601. [22] S. Akanuma, T. Kigawa, S. Yokoyama, Proc. Natl. Acad. Sci. USA 2002, 99, was mixed in 200 mLwith 500 mm ANS. Baseline spectra were re- 13549 –13553. corded with CD buffer only.After incubation at room temperature [23] R. Shibue, T. Sasamoto, M. Shimada, B. Zhang, A. Yamagishi, S. Akanu- for 30 min, samples were excited at 385 nm, and fluorescence was ma, Sci. Rep. 2018, 8,1227. recorded from 400 to 600 nm. Six readings were taken per sample [24] S. Kamtekar,J.M.Schiffer,H.Xiong, J. M. Babik, M. H. Hecht, Science with the scan speed set to normal. 1993, 262,1680 –1685. [25] K. U. Walter,K.Vamvaca,D.Hilvert, J. Biol. Chem. 2005, 280,37742 – Determining oligomeric state by size-exclusion chromatogra- 37746. phy: Gel filtration chromatography was performed by using aSu- [26] M. T. Reetz, S. Wu, Chem. Commun. 2008,5499 –5501. perdex 75 resin (GE Healthcare) in a10mm”300 mm column (Tri- [27] Z. Sun, R. Lonsdale,X.D.Kong, J. H. Xu, J. Zhou,M.T.Reetz, Angew. corn) on the ¾KTAFPLC system (GE Healthcare). Protein samples in Chem. Int. Ed. 2015, 54,12410 –12415; Angew.Chem. 2015, 127,12587 – CD buffer were injected by using a10mLloop with an isocratic 12592. 1 elution at aflow rate of 1mLminÀ . [28] E. L. Peterson, J. Kondev,J.A.Theriot, R. Phillips, Bioinformatics 2009, 25,1356 –1362. [29] B. Seelig, J. W. Szostak, Nature 2007, 448,828–831. [30] A. D. Keefe, J. W. Szostak, Nature 2001, 410,715–718. Acknowledgements [31] A. Szilµgyi, .Kun, E. Szathmµry, Biol. Direct 2012, 7,38. [32] R. Jensen, Annu. Rev.Microbiol. 1976, 30,409 –425. We thank Dr.Maureen Quin for assistance with the western [33] M. V. Golynskiy,B.Seelig, Trends Biotechnol. 2010, 28,340–345. [34] C. Neylon, NucleicAcids Res. 2004, 32,1448 –1459. blots, and Prof. Romas Kazlauskas and Fredarla Miller for their [35] A. L. Kayushin,M.D.Korosteleva, A. I. Miroshnikov,W.Kosch, D. Zubov, comments on the manuscript. This work was fundedinpart by N. Piel, Nucleic Acids Res. 1996, 24,3748 –3755.

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 855  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim Full Papers

[36] S. Kang, B. Chen,T.Tian,X.Jia, X. Chu, R. Liu, P. Dong,Q.Yang, H. [56] V. Tretyachenko, J. Vymeˇtal, L. Bednµrovµ,V.Kopecky´,Jr.,K.Hofbauer- Zhang, Biochem. Biophys. Res. Commun. 2015, 466,400–405. ovµ,H.Jindrovµ,M.Hubµlek, R. Soucˇek, J. Konvalinka, J. Vondrµsˇek, K. [37] M.-F.Lu, Y. Xie, Y.-J. Zhang, X.-Y.Xing, ProteinPept. Lett. 2015, 22,579 – Hlouchovµ, Sci. Rep. 2017, 7,15449. 585. [57] I. D. Prijambada,T.Yomo, F. Tanaka,T.Kawama, K. Yamamoto, A. Hase- [38] N. Doi, K. Kakukawa, Y. Oishi,H.Yanagawa, Protein Eng. Des. Sel. 2005, gawa, Y. Shima, S. Negoro, I. Urabe, FEBSLett. 1996, 382,21–25. 18,279–284. [58] N. J. Greenfield, Nat. Protoc. 2006, 1,2876 –2890. [39] J. Tanaka, N. Doi, H. Takashima, H. Yanagawa, Protein Sci. 2010, 19,786 – [59] C. N. Pace, J. M. Scholtz, Biophys. J. 1998, 75,422–427. 795. [60] A. M. Facchiano, G. Colonna, R. Ragone, ProteinEng. 1998, 11,753 –760. [40] S. Kumachi, Y. Husimi, N. Nemoto, ACS Omega 2016, 1,52–57. [61] J. Richardson, D. Richardson, Science 1988, 240,1648 –1652. [41] T. Seitz, R. Thoma, G. A. Schoch, M. Stihle, J. Benz, B. D’Arcy,A.Wiget,A. [62] K. Fan, W. Wang, J. Mol. Biol. 2003, 328,921 –926. Ruf, M. Hennig, R. Sterner, J. Mol. Biol. 2010, 403,562–577. [63] L. R. Murphy,A.Wallqvist, R. M. Levy, Protein Eng. 2000, 13,149 –152. [42] M. V. Golynskiy,J.C.Haugner,B.Seelig, ChemBioChem 2013, 14,1553 – [64] C. E. Schafmeister,S.L.LaPorte, L. J. W. Miercke, R. M. Stroud, Nat. Struct. 1563. Biol. 1997, 4,1039 –1046. [43] C. A. Schneider,W.S.Rasband, K. W. Eliceiri, Nat. Methods 2012, 9,671 – [65] M. M. Müller,J.R.Allison,N.Hongdilokkul,L.Gaillon, P. Kast,W.F.van Gunsteren, P. Marli›re,D.Hilvert, PLoS Genet. 2013, 9,e1003187. 675. [66] G. L. Holliday, J. B. O. Mitchell,J.M.Thornton, J. Mol. Biol. 2009, 390, [44] P. K. Smith,R.I.Khrohn, G. T. Hermanson, A. K. Mallia, F. H. Gatner,M.D. 560–577. Provenzano, E. K. Fujimoto,N.M.Goeke, B. J. Olson, D. C. Klenk, Anal. [67] M. Ilardo, M. Meringer,S.Freeland, B. Rasulev,H.J.Cleaves, H. J. Cleaves Biochem. 1985, 150,76–85. II, Sci. Rep. 2015, 5,9414. [45] Y. H. Chen, J. T. Yang,K.H.Chau, Biochemistry 1974, 13,3350 –3359. [68] P. Lo Surdo, M. A. Walsh, M. Sollazzo, Nat. Struct. Mol. Biol. 2004, 11, [46] D. S. Riddle, J. V. Santiago, S. T. Bray-Hall, N. Doshi, V. P. Grantcharova, Q. 382–383. Yi,D.Baker, Nat. Struct. Biol. 1997, 4,805 –809. [69] J. D. Forman-Kay,T.Mittag, Structure 2013, 21,1492 –1499. [47] G. Cho, A. D. Keefe, R. Liu, D. S. Wilson,J.W.Szostak, J. Mol. Biol. 2000, [70] L. B. Chemes, L. G. Alonso,M.G.Noval, G. de Prat-Gay, Methods Mol. 297,309–319. Biol. 2012, 895,387 –404. [48] E. Gasteiger,C.Hoogland, A. Gattiker,S.Duvaud, M. R. Wilkins, R. D. [71] B. Seelig, Nat. Protoc. 2011, 6,540–552. Appel, A. Bairoch in Proteomics Protocols Handbook (Ed.:J.M.Walker), [72] C. M. Rufo, Y. S. Moroz, O. V. Moroz,J.Stçhr,T.A.Smith,X.Hu, W. F. De- Humana, Totowa, 2005,pp. 571 –607. Grado, I. V. Korendovych, Nat. Chem. 2014, 6,303 –309. [49] F. D. Sçnnichsen, J. E. VanEyk,R.S.Hodges, B. D. Sykes, Biochemistry [73] W. M. Patrick, E. M. Quandt, D. B. Swartzlander,I.Matsumura, Mol. Biol. 1992, 31,8790 –8798. Evol. 2007, 24,2716 –2722. [50] K. Gast, D. Zirwer,M.Müller-Frohne, G. Damaschun, Protein Sci. 1999, 8, [74] M. P. Ferla, J. L. Brewster, K. R. Hall, G. B. Evans,W.M.Patrick, Mol. Micro- 625–634. biol. 2017, 105,508 –524. [51] H. Reiersen, A. R. Rees, Protein Eng. 2000, 13,739–743. [75] R. W. Roberts, J. W. Szostak, Proc. Natl. Acad. Sci. USA 1997, 94,12297 – [52] A. Morelli, J. Haugner,B.Seelig, PLoS One 2014, 9,e112028. 12302. [53] F.-A. Chao, A. Morelli, J. C. Haugner, L. Churchfield, L. N. Hagmann, L. [76] N. Nemoto, E. Miyamoto-Sato, Y. Husimi,H.Yanagawa, FEBS Lett. 1997, Shi, L. R. Masterson, R. Sarangi, G. Veglia, B. Seelig, Nat. Chem. Biol. 414,405–408. 2013, 9,81–83. [77] H. S. Bernhardt, Biol. Direct 2012, 7,23. [54] Y. Shimizu, A. Inoue, Y. Tomari, T. Suzuki, T. Yokogawa, K. Nishikawa, T. Ueda, Nat. Biotechnol. 2001, 19,751–755. Manuscript received:November 1, 2018 [55] K. Amikura,Y.Sakai, S. Asami, D. Kiga, ACS Synth. Biol. 2014, 3,140 – Accepted manuscript online:December3,2018 144. Version of record online:February 15, 2019

ChemBioChem 2019, 20,846 –856 www.chembiochem.org 856  2019 Wiley-VCH Verlag GmbH &Co. KGaA, Weinheim