http://www.paper.edu.cn

PROTEINS:Structure,Function,andGenetics44:119–122(2001)

SilkFibroin:StructuralImplicationsofaRemarkable AminoAcidSequence

Cong-ZhaoZhou,1,2,3 FabriceConfalonieri,1 MichelJacquet,1 RolandPerasso,2 Zhen-GangLi,3 andJoelJanin4* 1InstitutdeGe´ne´tiqueetMicrobiologie,Universite´Paris-SudetCNRS,OrsayCedex,France 2LaboratoiredeBiologieCellulaire4,Universite´Paris-SudetCNRS,OrsayCedex,France 3DepartmentofBiology,UniversityofScienceandTechnologyofChina,Hefei,Anhui,People’sRepublicofChina 4Laboratoired’EnzymologieetBiochimieStructurales,CNRS,Gif-sur-Yvette,France

ABSTRACT Theaminoacidsequenceofthe minedproteinsequenceofthefibroinheavychaindeduced heavychainofBombyxmorisilkfibroinwasderived fromthatofthegenomicDNA.4 fromthegenesequence.The5,263-residue(391-kDa) polypeptidechaincomprises12low-complexity SEQUENCEANDCOMPOSITIONOFTHE “crystalline”domainsmadeupofGly–Xrepeatsand FIBROINHEAVYCHAIN covering94%ofthesequence;XisAlain65%,Serin Aftertheearlystudiesmentionedaboveandreviewedby 23%,andTyrin9%oftherepeats.Theremainder Lucasetal.,1 Bombyxsilkfibroinreceivedlittleattention includesanonrepetitive151-residueheaderse- frombiochemists.Thesequenceofthe26-kDalightchain, quence,11nearlyidenticalcopiesofa43-residue ofashortinternalfragmentandofa85-residueC-terminal spacersequence,anda58-residueC-terminalse- sequenceoftheheavychainweredetermined.5–7 Thelight quence.Theheadersequenceishomologoustothe chain,whichislinkedtotheheavychainbyasingle N-terminalsequenceofotherfibroinswithacom- disulfidebridge,hasastandardaminoacidcomposition pletelydifferentcrystallineregion.InBombyxmori, andanonrepetitivesequence.Itplaysonlyamarginalrole eachcrystallinedomainismadeupofsubdomains inthefiber. ofϳ70residues,whichinmostcasesbeginwith Thecompleteaminoacidsequenceoftheheavychain repeatsoftheGAGAGShexapeptideandterminate hasnowbeendeducedfromthatofa80-kbpfragmentof withtheGAAStetrapeptide.Withinthesubdo- BombyxmorigenomicDNAcomprisingthe17-kpbfibroin mains,theGly–Xalternanceisstrict,whichstrongly gene(accesscodeGenBankAF226688–EMBLP05790).4 supportstheclassicPauling–Coreymodel,inwhich Thegenecodesforapolypeptidechainof5,263residues ␤-sheetspackoneachotherinalternatinglayersof withamolecularweightof391kDa,composedof45.9% Gly/GlyandX/Xcontacts.Whenfittingtheactual glycine,30.3%,12.1%,5.3%tyrosine,1.8% sequencetothatmodel,weproposethateachsubdo- valine,andonly4.7%oftheother15aminoacidtypes. mainformsa␤-strandandeachcrystallinedomain Mostofthesequenceislow-complexityandforms2,377 atwo-layered␤-sandwich,andwesuggestthatthe repeatsofaGly–X(GX)dipeptidemotif.TheGXrepeatis ␤-sheetsmaybeparallel,ratherthanantiparallel,as thebuildingblockofthe␤-sheetsinthePauling–Corey hasbeenassumeduptonow.Proteins2001;44:119–122. modelandofthewholefiber.Thesequencecontainsvery ©2001Wiley-Liss,Inc. longstretchesofGXrepeatsthatmustconstitutethex-ray Keywords:Bombyxmori;low-complexitysequences; diffractingstructure,oftencalledthe“crystalline”compo- ␤-sheetpacking;fiberstructure nentofsilkfibroin.ResidueXisAlain64%oftherepeats, Serin22%,Tyrin10%,Valin3%,andThrin1.3%ofthe INTRODUCTION repeats.Essentiallynoneoftheother14aminoacidtypes Thelongandrobustproteinfiberthatmakesupthe ispresentintherepeats.In2%ofthedipeptides,thefirst cocoonoftheBombyxmoricaterpillarhasbeenusedby positionisanalanineinsteadofglycine.Thisoccurs manformorethan3,000yearstoproducehigh-quality almostexclusivelyintheGAAStetrapeptide,whichis ϭ threadandcloth.Theirremarkablepropertiesoriginatein repeated41times.Moreover,twohexapeptidess ϭ thephysicalchemistryofsilkfibroin,theproteinthat GAGAGS,in433copies,andy GAGAGY,in120copies, constitutesalmostallthefiber.Fibroinissynthesizedin accountfor70%ofthelow-complexityregion.Theirabun- largequantitybythesilkglandofthecaterpillarastwo dancewasknownfrompartialsequencesanditwas polypeptidechainslinkedbyadisulfidebridge.Thelarger suggestedthatfibroinmightbemadeupentirelyofthese heavychainisglycinerich,andmostofitssequenceisa repeatofGly–Ala/Serdipeptides.Thesilkfiberhasbeen 1 Grantsponsor:ProgrammedesRecherchesAvance´esFranco- knownasearlyas1913todiffractx-rays. Itsdiffraction Chinois;Grantnumber:BT97-05. patternischaracteristicofapleated␤-sheet,andithelped *Correspondenceto:Joe`lJanin,LEBS-CNRS,91198Gif-sur-Yvette, PaulingandCoreyindefiningthistypeofsecondary [email protected] structure.2 ThePauling–Coreymodelofsilkfibroin3 is Received3November2000;Accepted5March2001 revisitedherewithinthecontextoftherecentlydeter- Publishedonline00Month2001

©2001WILEY-LISS,INC. 中国科技论文在线 http://www.paper.edu.cn

120 C.-Z. ZHOU ET AL.

Fig. 1. The fibroin heavy chain. The 5,263-residue polypeptide chain (EMBL P05790) is broken in domains and subdomains. The sequence number of the first residue number and the length l of each domain are given in parentheses. The 25-residue motif in boldface characters is repeated between the header and the linkers. A one-residue (or three-residue) insertion in subdomain GX9.4 is also in boldface characters. Lowercase letters s, y, a, and u represent frequently observed hexapeptides. Hexapeptide code and number of copies:

s GAGAGS 433 y GAGAGY 120 a GAGAGA 27 ␮ GAGYGA 39

hexapeptides.6,8 Figure 1 shows that this is incorrect in fibroin is otherwise completely different from Bombyx.It spite of the hexapeptide abundance, and that fibroin has a yields a different type of fiber, and the bulk of its 2,639- more elaborate primary structure as discussed below. residue sequence (EMBL O76786), although low-complex- ity, is more alanine than glycine-rich.9 A BLAST search THE NONREPETITIVE OR “AMORPHOUS” against the Drosophila genome also suggests the presence SEGMENTS of a segment related to these N-terminal sequences in the The GX repeats that form the bulk of the polypeptide CG18026 gene product, but the score is much lower (27% chain are distributed among 12 domains separated by identity over 136 residues), and the function is short linkers (Fig. 1). In contrast to the domains, the unknown. N-terminal 151 residues, C-terminal 50 residues, and the The 58-residue C-terminal segment of the Bombyx heavy 42–43 residue linkers between domains are nonrepetitive chain is arginine/lysine-rich and depleted in hydrophobic and “amorphous,” as opposed to the “crystalline” domains. residues, which does not suggest a globular fold. It con- The N-terminal segment or header has a standard amino tains three cysteine residues involved in two disulfide acid composition and may constitute an independent globu- bonds,7 one with the light chain, the other internal. The 11 lar unit, possibly with some ␣-helix as well as a ␤-sheet. linker segments connecting the crystalline domains have Related sequences (33–38% identity) are found at the nearly identical sequences, including a 25-residue nonre- N-terminus of the fibroins of the moths Galleria mal- petitive peptide (boldface characters in Fig. 1), also present lonella and Antheraea pernyi (Fig. 2). The cocoon of the in the header sequence in a truncated version. The peptide latter is used in Asia to produce tussah silk. Antheraea breaks the GX alternance and terminates the crystalline 中国科技论文在线 http://www.paper.edu.cn

SEQUENCE AND STRUCTURE OF SILK FIBROIN 121

collaborators proposed that the fiber is made of antiparal- lel ␤-sheets packing on top of each other.3 The ␤-strands extend along the fiber axis, yielding a 7.0-Å axial repeat containing two peptide units. There is also a Ϸ9.5-Å repeat across the fiber in the diffraction pattern. It is interpreted as representing in one direction, twice the spacing between ␤-strands in an antiparallel ␤-sheet, and in the orthogonal direction, twice the spacing between packed ␤-sheets. The Pauling–Corey model, elaborated upon by Crick and Ken- drew,10 takes note of the fact that, in a ␤-strand made of Gly–Ala/Ser repeats, all the nonglycine residues have their side-chains on the same face of the ␤-sheet. The ␤-sheets may then pack on each other alternatively by their glycine face and by their Ala/Ser face. The first Fig. 2. The header sequence. Secondary structure prediction by PHD and an alignment of the N-terminal sequences of silk fibroins from packing yields a short 3.5–3.9 Å spacing, the second longer Bombyx mori (EMBL P05790), Galleria mellonella (EMBL AF095239), one of 5.3–5.7 Å. This type of packing was later observed in and Antherea pernyi (EMBL O76786). E, extended; H, helical. crystalline poly-(Ala–Gly).11 The Pauling–Corey model was challenged in a recent study of the Bombyx silk fiber pattern. In an attempt to domains. It has a proline, charged residues and the only interpret the diffracted intensities quantitatively, Taka- tryptophan residue found in fibroin. Charged residues are hashi et al.12 test four ways of assembling ␤-strands and entirely absent from the crystalline domains. packing ␤-sheets. Adjacent strands within a ␤-sheet may either be parallel or antiparallel, and their side-chains PUNCTUATIONS IN THE CRYSTALLINE REGION may either all point in the same direction (polar mode) or The 12 crystalline domains are labeled GX1–GX12 in alternatively up and down (antipolar mode). With this Figure 1. Their average length is 413 residues, omitting convention, the Pauling–Corey model is polar–antiparal- GX12, which is much shorter (37 residues). Within a lel. It yields the best fit to a set of 26 diffracted intensities domain, the GX alternance is perturbed only by the assuming the fibroin sequence to be just poly-(Ala–Gly). occasional presence of a GAAS tetrapeptide, and, in do- Nevertheless, the authors go on to consider effects of main GX9, by a single residue insertion at position 4108. disorder and of the presence of serine residues on the The insertion changes the phase of the alternance and calculated diffraction pattern of other types of assemblies. must perturb the ␤-sheet packing. In contrast, the GAAS Their conclusions favor an antipolar–antiparallel model tetrapeptide maintains the phase while introducing a with Ala/Ser side-chains pointing alternatively up and punctuation in the crystalline domains. In the gene, all down across a ␤-sheet. As antipolar ␤-sheets do not have a GAAS tetrapeptides use the same codons forming the glycine-only face, their packing is less regular than in the 4 same 12-base pair (bp) element. Taking GX2 as an Pauling–Corey model. The fit to the diffraction data de- example, we find that the 511 residues of this domain are pends on a number of assumptions, on the distributed into six subdomains beginning with a stretch sequence for instance, and none of the alternative models ϭ of s GAGAGS hexapeptides and ending in GAAS. The can be entirely excluded. six subdomains are obviously related to each other, and it is most likely that the whole domain derives from succes- FOLDING THE HEAVY CHAIN sive duplications. The same pattern of subdomains and duplications is Folding a 5,263-residue polypeptide chain must take seen in domains GX2 to GX11, except that GAGS or GTGS place in several steps, whether conceptually or in reality. sometimes replace GAAS as a punctuation (Fig. 1). In the The actual sequence forbids fitting the packed ␤-sheet gene, these tetrapeptides all have a distinctive codon model to the whole fibroin molecule, but the model is usage.4 Domain GX1 can also be viewed as being made of plausible for the crystalline domains, which are large units subdomains ending in GAAS, but their sequence is diver- able to form several ␤-sheets each. We suggest that each gent and four of them lack GAGAGS repeats. In total, the subdomain forms a ␤-strand, ϳ66 residues or 200 Å long, heavy chain comprises some 64 subdomains, each ϳ70 connected to the next ␤-strand by a four-residue ␤-turn at residues in length. Remarkably, the 70-residue segmenta- the GAAS boundary tetrapeptide. These ␤-strands are tion of the protein sequence also shows up at the DNA much longer than in globular , where they rarely level, where a repeating sequence unit of ϳ208 bp is exceed 15 residues, but the strong diffraction pattern of observed, possibly the size of the DNA in a nucleosome.4 silk suggests long ␤-strands. A crystalline domain will then comprise either one ␤-sheet or two ␤-sheets packing A MODEL OF THE SILK STRUCTURE: PAULING on top of each other. The latter is more likely in GX1, GX2, AND COREY REVISITED GX5, GX7, GX8, and GX11, which contain six or more Based on the diffraction pattern and the peculiar amino subdomains. Each one of these domains could constitute a acid composition of silk, Pauling and Corey and their structural unit made of two layers of three-stranded 中国科技论文在线 http://www.paper.edu.cn

122 C.-Z. ZHOU ET AL.

␤-sheets, with approximate dimensions of 10 ϫ 15 ϫ 200 CONCLUSIONS Å. A thicker and broader structure can then be created by The Gly–X alternance is a remarkable feature of the packing domains side by side and layer by layer both sequence of the Bombyx mori silk fibroin heavy chain, within and between fibroin chains. where it is maintained over the long stretches that consti- The next question is the polarity of the ␤-strands and tute the crystalline domains. Whereas this alternance is ␤-sheets. The strict GX alternance observed within subdo- the basis of the Pauling–Corey model, the model cannot mains implies an extreme selective pressure against sub- easily be fitted as the actual sequence shows an additional stituting glycines. This is more easily understood in the level of repetition above the Gly–X unit, the subdomain, polar mode, where the ␤-sheets have one face devoid of and a marked N-to-C directionality. Building a realistic side-chains, than in a nonpolar mode, where there are model of the crystalline domains and of their association side-chains on both faces. On the other hand, the sequence into a fiber, will require far more experimental informa- has an obvious N-to-C directionality, with GAGAGS re- tion than can be derived from the fiber diffraction pattern peats at the N-terminus of the subdomains, and the rarer alone. tyrosine, valine and threonine residues near the C- REFERENCES ␤ terminus. In an antipolar mode of -sheet assembly, no 1. Lucas F, Shaw JBT, Smith SG. The silk fibroins. Adv Protein regular packing of the large side-chains of these residues Chem 1958;13:108–244. can be envisaged. In an antiparallel ␤-sheet, the large 2. Pauling L, Corey RB. Proc R Soc Lond 1953;B141:21 ␤ ␤ 3. Marsh RE, Corey RB, Pauling L. Biochim Biophys Acta 1955;16: side-chains of one -strand are at one end of the -sheet, 1–34 those of adjacent ␤-strands, at the opposite end. On the 4. Zhou CZ, Confalonieri F, Medina N, Zivanovic Y, Esnault C, Yang other hand, a polar–parallel assembly has all the GAGAGS T, Jacquet M, Janin J, Perasso R, Li ZG. Fine organization of ␤ Bombyx mori fibroin heavy chain. Nucleic Acids Res 2000;28:2413– hexapeptides at one end of the -sheet, all the large 2419. side-chains clustering together at the other end, and a 5. Yamaguchi K, Kikuchi Y, Tagaki T, Kikuchi A, Oyama F, Shimura regular packing can be expected. S, Mizuno S. Primary structure of the silk fibroin light chain ␤ determined by cDNA sequencing and peptide analysis. J Mol Biol A natural way to fold a protein into parallel -sheets is 1989;210:127–139. to build a two-layered solenoid structure, with successive 6. Gage LP, Manning RF. Internal structure of the silk fibroin gene ␤-strands alternating between the top and bottom layer. of Bombyx mori. I The fibroin gene consists of a homogeneous alternating array of repetitious crystalline and amorphous se- This type of assembly, which minimizes the conformation quences. J Biol Chem 1980;255:9444–9450. search made by the folding polypeptide chain, is common 7. Tanaka K, Kajiyama N, Ishikura K, Waga S, Kikuchi A, Ohtomo in globular all-␤ proteins. The transverse repeat of two K, Takagi T, Mizuno S. Determination of the site of the disulfide linkage between heavy and light chains of silk fibroin produced by ␤-strands observed in silk fibroin diffraction argues against Bombyx mori. Biochim Biophys Acta 1999;1432:92–103. parallel ␤-sheets, because their repeating unit is a single 8. Mita K, Ichimura S, Zama M, James TC. Specific codon usage ␤ pattern and its implication on the secondary structure of silk -strand. However, only poly-(Ala–Gly) has a one-strand fibroin mRNA. Mol Biol 1988;203:917–925. repeat, strictly speaking. In the actual sequence of the 9. Sezutsu H, Tamura T, Yukuhiro K. Characterisation of the full protein, the spatial distribution of the larger side-chains of length fibroin gene of a wild silkworm Antheraea pernyi (submit- ted). serine and tyrosine may explain the larger unit cell. The 10. Crick FHC, Kendrew J. Adv Protein Chem 1957;12:160. fibroin diffraction pattern has weak reflections at low 11. Lotz B, Brack A, Spack G. ␤-Structure of periodic copolypeptides angles that cannot be indexed on the poly-(Ala–Gly) unit of L-alanine and glycine. J Mol Biol 1974;87:193–203. 12. Takahashi Y, Gehoh M, Yuzuhira K. Structure refinement and cell and some are attributed to the presence of serine in diffuse streak scattering of silk (Bombyx mori). Int J Biol Macro- GAGAGS hexapeptides.12 mol 1999;24:127–138.