RNA folding and Theory

Henri Orland (IPhT,CEA, Saclay) Collaboration with

• A. Zee (KITP, UCSB) and • G. Vernizzi (SPhT, Saclay) • M. Bon (Saclay) Outline

• Review of basic properties of RNA • Secondary structures • Matrix field theory for RNA • Topological classification of RNA • Exact enumeration of RNA structures • Monte Carlo approach Review of basic properties of RNA

• RNA is a biopolymer – RNA (length ~ 70–3000): single stranded – DNA (length ~ 106–109): double stranded – Proteins (length ~ 102) – Polysaccharides (length ~ 103) Several forms of RNA

• Messenger : mRNA (L ~ 1000) (only 5% of RNA) • Transfer: tRNA (L ~ 70) • Ribosomal: rRNA (L ~ 3000) • Micro: µRNA (L ~ 25) • Huge amounts of non-coding RNA in “junk” DNA: up to 80% Why does the 3d structure of RNA matter? Important discovery in the 80s: RNA can have enzymatic activity Important discovery since 2000: some RNAs play crucial role in cell regulation

Function strongly related to shape

Must know 3d structure of RNA Chemistry of RNA

• RNA is a single- stranded heteropolymer • Four bases: – Adenine (A) – Guanine (G) – Cytosine (C) – Uracil (U) The sugar phosphate backbone polymerizes into a single stranded charged (-) polymer Energy scales

• Crick-Watson: conjugate pairs C – G A – U Pairing due to Hydrogen bonds between bases RNA folding Stacking of aromatic groups Electrostatics (Mg++ ions) controls 3d structure Energy scales

C — G : 3kCal/mole = 5 kT A — U : 2kCal/mole = 3.3 kT G — U : 1kCal/mole =1.6 kT

300 K = 0.6 kCal/mole = 1/40 eV Base pairing

• Induces helical strands (like in DNA) • Induces secondary structure of RNA

RNA folding problem: determine which bases are paired 11 Pictures of RNA

Transfer RNA Hammerhead Ribosomal RNA Ribozyme

13 Secondary structures elements

• In RNA, there are helical stems with loops • and bulges

14 Pseudo-knots in RNA • In addition to planar structure, • there are “pseudo-knots” which constrain the 3d structure

the Kissing Hairpin Loop-bulge

The H-hairpin

• 3d folding controlled by concentration of Mg ions. In fact base pairing is not good enough: need also stacking energies.

However: • saturation of Crick-Watson pairing • experimental observation: number of bases in pseudo-knot << number of bases in planar secondary structure (less than 10%) RNA folding much easier than protein folding Further simplifications: – Saturation of interactions – Watson-Crick pairing

Define

Base pair energy

• Approximation Chain rigidity

sterically allowed configurations where

• sum is mainly combinatorial • any index appears once and only once (saturation) • In using this partition function, we have not taken into account the entropy of loops. • For a loop of size l, the entropy is S = l log µ c log l − • In fact the log µ goes into the free energies of pairing, so that S = c log l • with c = 3 / 2 (Gaussian− chain) • c = 1 . 75 (Self Avoiding Walk)

19 Z = 1 + V + V V + + V V + (2) ij ij kl · · · ik jl · · · ! where < ij > denotes all pairs with j > i, < ijkl > all quadruplets with l > k > j > i, and so on. Then the partition function is given by

L L 1 − = d3!r f( !r !r ) Z (3) Z k | i+1 − i| i=1 " k#=1 # The function f(r) can be taken to be, for example, δ(r l) for a model in which the nucleotides are connected along (r−l)2/6σ2 the RNA heteropolymer by rigid rods of length l, or e− − for a model with elastic rods. Note that the saturation of the hydrogen bond has been incorporated by the requirement l > k > j > i, and so on. Once the nucleotide at i has interacted with the nucleotide at j it cannot interact with the nucleotide at k . Note that in (2), only the enthalpy and combinatorics of pairings are included. The integration over the atomic coordinates in (3) accounts for the actual topological feasibility of a given pairing and also for the entropic factor associated with loop formation. Biologists are interested in the folded configuration essentially at room temperature. Since room temperature is substantially less than the melting temperature (of order 800C, in other words, the energy scale of the problem), we want to determine the ground state configuration of the RNA heteropolymer. In other words, once we have obtained Z we would like to extract the term in Z that dominates as βε tends to infinity in (1). PlanarWe ha veSecondarygiven a simplified quantitative framework for the RNA foldin g pstructuresroblem. From a chemical point of view, it would be appropriate to include also the stacking energies of couples of complementary base pairs, instead of energies of single pairs of bases. However, in the following, we will stick with the latter. We will also concentrate on the evaluation of the “pairing” partition function (2). We expect that the various effects we have ignored, such as stacking , etc..., can be added later as “bells and whistles” to our approach. The stacking energies for instance can be taken into account by utilizing a 16 16 interaction matrix between pairs of bases instead of the 4 4 matrix × × • We workε (soni, sj) we u se hereQ. 0 III. MATRIX THEORY

What is the connection with matrix theory? • SecondaryConsid erstructurespulling apart the folded RNA structure giv=en in figArches. 2a.

Fig. 2b: Representation of the same RNA stretFcigh. 2ead: R.epresentation of the secondary structure of an RNA.

We obtain the structure of fig. 2b which to physicists are reminiscent of Feynman diagrams in a variety of subjects: matrix theory, quantum chromodynamics, and so on. which gluon lines cross are of higher or•der. DefineWe merely hav e Zt o g (Foo rit hoe ,sa tkehj oef de)l fian irtegasneess, Nlet us ebxtheorprowatnhestieormnin,oloagynodf qutanhtuem cdhrioamgodryanammicss. Tahreedotted lines are known 2 as gluon propagators, and the solid line as a quark propagator. The secondary structure corresponds to diagrams in classified by powers of 1/N . Note that a somewhat similar formwuhilchathieoglnuoninlinestedornmot scroossfovmer eaacthroithxer,twhhieleothreyterhtiaarysstbruceteurne courressepodndsftoordiagrams in which the the meander problem [12]. gluon lines do cross. The crucial observation, originally made by ’t Hooft [11], is that there is a systematic relation between the topology Consider the quantity • partition functionof a graph and its corresponding pofower of 1/segmentN 2. For instance, planar diagrams are(of oi,rder 1/Nj0, a)nd diagrams in

L L 3 N 1 1 (V − )ij tr(ϕiϕj ) 1 Z(1, L) = dϕ e− 2 ij tr (1 + ϕ ) (4) A(L) k N l ! k"=1 # "l=1 Here ϕ (i = 1, , L) denote L independent N by N Hermitian matrices and Π (1 + ϕ ) represents the ordered 20 i · · · l l matrix product (1 + ϕ1)(1 + ϕ2) (1 + ϕL). All matrix products will be understood as ordered in this paper. The normalization factor A(L) is defined· · · by

L N 1 2 (V − )ij tr(ϕiϕj ) A(L) = dϕke− ij (5) ! k"=1 # b Let us refer to the row and column indices a and b of the matrices (ϕi)a as color indices, with a, b = 1, 2, , N. The matrix integral (4) defines a matrix theory with L matrices. We can either think of it as a Gaussian theory·w· ·ith a 1 1 1 log[ N trΠl(1+ϕl)] complicated observable N trΠl(1 + ϕl), or alternatively, by raising N trΠl(1 + ϕl) = e into the exponent, N 1 1 as a complicated matrix theory with the action ( (V − ) tr(ϕ ϕ ) log[ trΠ (1 + ϕ )]). Another trivial remark 2 ij ij i j − N l l is that we can effectively remove 1 tr from (4). N # The important remark is that the matrix theory [13] defined by (4) has the same topological structure as ’t Hooft’s 1 large N quantum chromodynamics. There are L types of gluons, and the gluon propagators are given by N Vij . As 1 in large N quantum chromodynamics, each gluon propagator is associated with a factor of N and each color loop is associated with a factor of N. The reader familiar with matrix theory or large N quantum chromodynamics can see immediately that the Gaussian matrix integral (4) evaluates precisely to the infinite series 1 Z(1, L) = 1 + V + V V + + V V + (6) ij ij kl · · · N 2 ik jl · · · $ <$ijkl> <$ijkl> Some “typical” terms in this series correspond to the diagrams in fig. 3.

+ + + ... + + ... Fig. 3: Graphical representation of a few terms of the partition function.

This differs from (2) only in that the terms with different topological character are now classified by inverse powers 1 of N 2 . Thus, the use of the large N expansion allows us to separate out the tertiary structure, represented in (6) for 1 example by the term N 2 VikVjl, from the secondary structure. Note that the ordered product Π (1 + ϕ ) ensures that the diagonal elements V of the matrix V do not appear in # l l ii Z(1, L). We have nevertheless already Vii to 0. The program proposed in this paper is thus to evaluate Z(1, L) with V an arbitrary matrix. Once Z(1, L) is known 1 we can then insert it into (3) to evaluate . The parameter N serves as a convenient marker to distinguish the tertiary structure from the secondary structZure. What we offer here is a systematic way of generating refinements to the calculation of Z, and hence the free energy F, to any desired accuracy in a well controlled approximation. Since in Z(1, L) the quantities 1 and L represent arbitrary labels we can just as well define

L n N n 1 1 Σ (V − ) tr(ϕ ϕ ) 1 Z(m, n) = dϕ e− 2 i,j=m ij i j tr (1 + ϕ ) (7) A(m, n) k N l ! k"=1 l"=m

4 Recursion relation • Graphically, when one adds a base

k Z(i, k + 1) = Z(i, k) + V Z(i, j 1)Z(j + 1, k) j,k+1 − j=i • with !

βε(i,j) V (i, j) = e− θ( i j 4) | − | −

21 • by iterating this recursion, one can generate all possible secondary structures, with correct Boltzmann weights. • This is the best tool for predicting secondary structures in RNA : about 75% of base pairings correctly predicted • Algorithm scales as N 3 • One can include Entropies and Stacking Energies • MFOLD • Vienna Package

22 23 • Recursion equation looks like Hartree equations (tree diagrams) • No Pseudo-Knots • Is it possible to find a field theory such that secondary structures are the Hartree graphs? • Then, Pseudo-Knots would appear as the corrections to Hartree approximation.

24 Matrix Field Theory

25 Wick Theorem

• Simple representation: consider an RNA sequence of length L

L L 1 1 1 φiV − φj Q0 = dφie− 2 i,j ij (1 + φi) i=1 P i=1 N ! " " • due to Wick theorem L 1 1 1 φiV − φj Vij = dφie− 2 i,j ij φiφj i=1 P N ! " 26 Wick Theorem

L 1 1 1 φiV − φj VijVkl + VikVjl + VilVjk = dφie− 2 i,j ij φiφjφkφl i=1 P N ! "

• However, this form gives same weight to all pairings. No penalty for Pseudo-Knots. • Experimentally, few pseudo-knots.

27 • We look for a parameter N such that N + Secondary structures → ∞ ≡ 1 • Corrections in Pseudo-Knots N ≡ • Pseudo-knots are tunable by [Mg ++ ] concentration 1 ++ plays the role of [Mg ] N • TOPOLOGY=MATRIX FIELD THEORY

28 Matrix Field Theory: a Short Tutorial • Vector field theories: O ( n ) models count number of connected component of a graph. n is the fugacity of a loop. • Matrix field theories: “count” topology. • Consider the generalization of the scalar φ 4 field theory (t’Hooft, 1973) • Consider the fields φ ab ( x ) : a N N matrix at each point x in space. ×

29 Matrix Field Theory • A matrix φ 4 field theory is defined by

N dxT rφ(x)( 2+m2)φ(x) gN dxT rφ4(x) Z = φ (x)e− 2 −∇ − 4! D ab ! R R • represent φ ab ( x ) by a double line 4 • Vertex: N T r φ ab ( x ) factor N 1 1 • Propagator: G(x y) N − N 30 Fig. 2b: Representation of the same RNA stretched. Fig. 2b: Representation of the same RNA stretched. Fig. 2b: Representation of the same RNA stretched. which gluon lines cross are of higher order. We merely have to go to the large N expansion, and the diagrams are clawsshiificehdgbluyopnowlineres ocfro1s/sNa2r.e Nofotheigthheart oardsoemr.ewWheamt seirmeliylarhafovremtuolagtoiotno itnhetelramrgseoNf meaxtpriaxnstihoeno,ryanhdasthbeeedniaugsreadmfsorare classified bywphoiwchergsluofon1/lNin2e.s Ncrootses tahraetoaf shoimgheewrhoartdesirm. iWlare fmoremreullyathioavneintotegromtsootfhme alatrrigxetNheoerxypahnasiobne,enanudsetdhefodriagrams are the meander problem [12]. 2 Cothensidermeandertheclpqarsuaosiblemfinteitdyb[1y2p].owers of 1/N . Note that a somewhat similar formulation in terms of matrix theory has been used for Consider the qmuaeanndertity problem [12]. L L Consider the quantity N 1 1 (V − )ij tr(ϕiϕj ) 1 L 2 ij L Z(1, L) = dϕke− N 1 tr (1 + ϕl) (4) 1 (V − )ij tr(ϕiϕj ) 1 A(L) 2 L ij N L Z(1, L) = dϕ e− N 1 tr (1 + ϕ ) (4) ! k=1 1k (V − )li=j 1tr(ϕiϕj ) 1 l AZ(L(1), L") = # dϕ e− 2 ij N " tr (1 + ϕ ) (4) ! k=1A(L) # k l=1 N l Here ϕ (i = 1, , L) denote L independen"t N by N Herk=1mitian matrices and"Π (1 + ϕ )l=r1epresents the ordered i · · · ! " # l l " maHtreirxe pϕriod(iuc=t (11, + ϕ, )L(1) +deϕnot)e L(1in+deϕpen).deAnltl mNatbryixNprHerodumctitias wnillmbaeturincesderasntododΠla(s1 o+rdϕelr)edreprin etsenhistpsatpheer.orTderheed Here ·ϕ· · 1(i = 1, 2 ·,· ·L) denoLte L independent N by N Hermitian matrices and Π (1 + ϕ ) represents the ordered nomrmaatrlizxaptiroondufacctto(1r +Ai (ϕL1))(is1 d+efined·ϕ· ·2) by(1 + ϕL). All matrix products will be understood as orderedl in thils paper. The normalizatiomnaftarcixtopr rAod(Lu)ctis(1defined+ ϕ· ·1·)(b1y+ ϕ2) (1 + ϕL). All matrix products will be understood as ordered in this paper. The · ·L· normalization factor A(L) is defined by N 1 (V − )ij tr(ϕiϕj ) L 2 ij A(L) = dϕke− N 1 (5) 2 L (V − )ij tr(ϕiϕj ) A(L) = k=1 dϕke− ij N 1 (5) ! # 2 (V − )ij tr(ϕiϕj ) "A(L) = dϕ e− ij (5) ! k=1 k " # b Let us refer to the row and column indices a and b of th!e mk=a1trices (ϕi#)a as color indices, with a, b = 1, 2, , N. " b · · · Let us refer to the row and column indices a and b of the matrices (ϕi) as color indices, with a, b = 1, 2, , N. The matrix integral (4) defines a matrix theory with L matrices. We can eithera think of it bas a Gaussian theory with a Let us refer to the row and column indices a and b of the matrices (ϕ1 ) as color indices, with·a· ·, b = 1, 2, , N. The matrix integral (41) defines a matrix theory with L matrices. 1We can either thilnogk[ oiftraiΠtla(1+s aϕlG)]aussian theory with a complicated observable trΠl(1 + ϕl), or alternatively, by raising trΠl(1 + ϕl) = e N into the exponent, · · · The matriNx i1ntegral (4) defines a matrix theory withNL 1matrices. We canloegit[ h1 etrΠth(1+inkϕ o)]f it as a Gaussian theory with a complicated observable trΠl(1 + ϕl), or alternNatively, by1 raising trΠl(1 +1 ϕl) = e N l l into the exponent, as a complicated matrix thNeory with th1e action ( (V − )ijtr(ϕiϕNj ) log[ tr1Πl(1 + ϕl)]). Anoltohg[e1r ttrrΠiv(1+ialϕre)]mark complicated observable trΠ (1 + ϕ2 )N, oijr alternatively, by raisinNg 1 trΠ (1 + ϕ ) = e N l l into the exponent, 1 N l l 1 − N l l is tahsaat cwoemcpalnicaetffeedctmivaetlryixretmheoovrey wtitrhfrtohme a(c4t)i.on ( 2 ij(V − )ijtr(ϕiϕj ) log[ N trΠl(1 + ϕl)]). Another trivial remark as a complicated maNtrix theory with the action ( N (V 1−) tr(ϕ ϕ ) log[ 1 trΠ (1 + ϕ )]). Another trivial remark is that we can effectively remove 1 tr from (4). # 2 ij − ij i j N l l The important remark is that theNmatrix theo1ry [13#] defined by (4) has the same to−pological structure as ’t Hooft’s The impoirstatnhtatrewmeacraknisetffheactitvheelymreamtroixvethNeotryfr[o1m3] d(4e)fi.ned by (4) has the same topological structure as1 ’t Hooft’s large N quantum chromodynamics. There are L types of gluons#, and the gluon propagators are given by N Vij . As large N quantTumhecihmrpoomrotadnytnraemmicasr.k TishtehraetatrheeLmtaytpriexs tohfegolruyo[n1s3,] adnedfintehde bgylu(o4n) hparosptahgeast1aomrseatroepgoilvoegnicbalyst1ruVctu. rAesas ’t Hooft’s in large N quantum chromodynamics, each gluon propagator is associated with a factor of and each color Nloopij is 1 large N quantum chromodynamics. There are L types of gluons, and the gluNon1propagators are given by V . As assinocliartgeed NwitqhuanftaucmtorchorfoNm.odTheynarmeaicders, eafacmh iliagluronwithpropmaagtartioxrtiheos asrsyoociraltaerdgewNithqauafancttuomr ocfhromaondyenaacmhicoslcoarnlosoepe is N ij in large N quantum chromodynamics, each gluon propagator is associated with Na factor of 1 and each color loop is imamsseodciaiatteelyd twhiatth tahefaGctaoursosifaNn .mTheatrixreaindertegrfaalm(4ilia) ervawlithuatemsaptreixcisteheoly rtyo tohrelainrgfieniNte qsuerainestum chromodynaNmics can see immediatelyastshoactiattheedGwaiuthssiaanfamctaotrroixf Nint.eThegral r(ea4)derevaflaumatiliaes prrwecithiselmy attoritxhetheoinfirnyitoerslearrigese N quantum chromodynamics can see immediately that the Gaussian matrix integral (4) evalu1ates precisely to the infinite series Z(1, L) = 1 + V + V V + + V V + (6) ij ij kl · · · N 2 1 ik jl · · · Z(1, L) = 1 + Vij <+ijkl> Vij Vkl + + VikV1jl + (6) Z$(1, L) = 1$+ V + · · · V NV 2$+ + · · · V V + (6) ij ij kl · N 2 ik jl · · · Some “typical” terms in this series corres$pond to th$eagramsg. 3. $ Some “typical” terms in this series correspond to t$he diagram$s in fig. 3. $ Some “typical” terms in this series correspond to the diagrams in fig. 3.

+ + + ... + + ... Fig. 3: Graphical r+epresentat+ion of a few ter+m ...s +of the partition fu+ n ...ction. Fig. 3: Graphical representa+tion of a fe+w terms of the+ p ...ar+tition function. + ... This differs from (2) only in that tFheigt.e3r:mGs rwaipthidcaiffl erreepnrtesteonptoaltoigoincaolfcahaferwactteerrmarseonfotwheclpaasrstifiiteidonbyfuinncvteiorsne. powers 1This differs from (2) only in thFeynmannat the terms with different topological character arGraphse now classified by inverse powers of 2 . Thus, the use of the large N expansion allows us to separate out the tertiary structure, represented in (6) for N 1 This differs from (2) only in that the terms with different topological character are now classified by inverse powers of 2 . Thus, theFuigse.12obf :thReelparregseenNtaetixopnanofsitohneaslalomwesRuNs tAo ssterpeatcrahteed.out the tertiary structure, represented in (6) for examNple by the ter1m 2 VikVjl, from the secondary structure. of . TNhu1s,use of the large N expansion allows us to separate out the tertiary structure, represented in (6) for example by theNter2 m V V , from the secondary structure. Note that the orderedNp2rodl(i1k+jlϕl) ensures that the diagonal elements Vii of the matrix V do not appear in example by#the term N 2 VikVjl, from the secondary structure. which gluon lZin(e1s,NLcro).otsesWtaheraehatotvfheheneviogrhdeerrteohelessdrdpero. daWulrcetamdΠyel(rs1eetl+yVϕhiail)vteont0so.urgeos ttohatthethleardgieagNoneaxlpealenmsieonnt,saVnidi otfhethdeiamgaratrmixs Varedo not appear in Note that #the ordered product Πl(1 + ϕl) ensures that the diagonal elements Vii of the matrix V do not appear in classified by powTZeh(r1es,opLfr).o1g/WrNaem2.hapNrvoetpenevostehdeartitnhelessatshoims peawalrpheeartd#iysimtshetiulasVritiofoteromva0ul.ulaattieonZi(n1,tLer)mwsitohf Vmaatnriaxrbthiteroarryy hmaastbriexe.nOunsceed Zfo(r1, L) is known we cTanhethpernogirnZasm(e1r,tpLrio).tpioWnsteodha(i3nv)etthnevoisepveaarptluhelessearties tha.ulrseTtaohdeyevpsaetaluraaVmtieieZteo(r10,.1L)swerivtehsVasana acrobnivtreanriyenmt amtrairxk.eOr ntocedZis(t1in, Lgu)iisshktnhoewn the meander p•roblemV:[12]. vertices N 1 terwtiearcyanstrtuhcetnurienTshfererotmpritotghinreatsmoec(po3rn)odptaorsyevdsatirlnuactthZeuisrep..aWpTehrhaeits pwthaeuraosmffteoerteehvrearlueaistearZvse(ys1s,taLesm)aawticitcohnwvVaeynainoefnatgrebmnitearrraakrteyirnmgtoartedrfiiixnst.eimnOgenuncitesshZtot(h1e, L) is known Consider the quantity we can then insert it into (3) tZo evaluate . The Nparameter 1 serves as a convenient marker to distinguish the thetecratilacurylasttioruncotuf rZe,faronmd hthenecseectohnedfarerye esntreurcgtyurFe,. tWo ahnayt wdZeesiorffeedrahcecrueraiscya isnysatNewmeallticconwtaryololefdgaenpeprraotxiinmgarteifion.ements to the calculatitoenrtoiafrZy,starnudcthuerneLcferotmhetfhreesecnoenrgdyarFy,sttoruacntyurdee.sWirehLdaatcwcueroaffceyrinhearewiesllacsoynsttreomllaetdicapwparyoxoifmgaetnieorna.ting refinements to Since in Z(1, L) the quantities 1 and LN represen1 t arbitrary labels we can just as well define 1 (V − )ij tr(ϕiϕj ) 1 the calculation of Z, and h2enceij the free energy F, to any desired accuracy in a well controlled approximation. V I+L Since iZn (Z1,(1L,)L=) the quantitiedsϕ1kea−nd L represent arbitrarytrlabe(l1s +weϕcl)an just as well define (4) • I: internalSinceAin(LZ)(1, L) the q uapropagatorsntitieLs 1 and L reprNesent arbitrary labelsnwe can just as well define N n 1 ! k=1 1 Σ (V −l=)1 tr(ϕ ϕ ) 1 N # ij i j − Z(m, n) "= Ldϕ e− 2 i,j=m " tr n(1 + ϕ ) (7) 1 k N n 1 1 l A(m, n) 2 ΣLi,j=m (V − )ij tr(ϕiϕNj ) n Z(m, n) = k=1 dϕke− N n 1lt=rm (1 + ϕl) (7) Here ϕi (i = 1, , L) denote L independent N by N! Hermitia1 n matrices and ΠΣl(1 + (ϕVl−) )rieprj tr(eϕseniϕjt)s1the ordered · · · AZ((mm,,nn)) "= dϕke− 2 i,j=m N " tr (1 + ϕl) (7) matrix product (1 + ϕ1)(1 + ϕ2) (1 + ϕL). All matrix !prokAd=u(1mct,snw)ill be understood as orderedl=imn thiNs paper. The • L: loops· · · " ! k=1 " l=m normalization factor A(L) is defined by 4" " 4 L N 1 4 2 (V − )ij tr(ϕiϕj ) A(L) = dϕke− ij (5) ! k"=1 # b Let us refer to the row and column indices a and b of the matrices (ϕi)a as color indices, with a, b = 1, 2, , N. The matrix integral (4) defines a matrix theory with L matrices. We can either think of it as a Gaussian theory·w· ·ith a 1 • V=21 1 log[ trΠl(1+ϕl)] N complicated observable N trΠl(1 + ϕl), or alternatively, by raising N trΠl(1 + ϕl) = e into the exponent, N 1 1 as a complicated matrix theory with the action ( (V − ) tr(ϕ ϕ ) log[ trΠ (1 + ϕ )]). Another trivial remark 2 ij ij i j − N l l is that we can effectively remove 1 tr from (4). N # The import•ant remI=4ark is that the matrix theory [13] defined by (4) has the same topological structure as ’t Hooft’s 1 large N quantum chromodynamics. There are L types of gluons, and the gluon propagators are given by N Vij . As 1 in large N quantum chromodynamics, each gluon propagator is associated with a factor of N and each color loop is associated with a factor of N. The reader familiar with matrix theory or large N quantum chromodynamics can see immediately t•hat theL=4Gaussian matrix integral (4) evaluates precisely to the infinite series 1 Z(1, L) = 1 + V + V V + + V V + (6) ij ij kl · · · N 2 ik jl · · · • Euler $characteristic:<$ijkl> <$ijkl> χ = V I + L Some “typical” terms in this series correspond to the diagrams in fig. 3. − + + + ... + + ... 31 Fig. 3: Graphical representation of a few terms of the partition function.

This differs from (2) only in that the terms with different topological character are now classified by inverse powers 1 of N 2 . Thus, the use of the large N expansion allows us to separate out the tertiary structure, represented in (6) for 1 example by the term N 2 VikVjl, from the secondary structure. Note that the ordered product Π (1 + ϕ ) ensures that the diagonal elements V of the matrix V do not appear in # l l ii Z(1, L). We have nevertheless already set Vii to 0. The program proposed in this paper is thus to evaluate Z(1, L) with V an arbitrary matrix. Once Z(1, L) is known 1 we can then insert it into (3) to evaluate . The parameter N serves as a convenient marker to distinguish the tertiary structure from the secondary structZure. What we offer here is a systematic way of generating refinements to the calculation of Z, and hence the free energy F, to any desired accuracy in a well controlled approximation. Since in Z(1, L) the quantities 1 and L represent arbitrary labels we can just as well define

L n N n 1 1 Σ (V − ) tr(ϕ ϕ ) 1 Z(m, n) = dϕ e− 2 i,j=m ij i j tr (1 + ϕ ) (7) A(m, n) k N l ! k"=1 l"=m

4 Euler characteristic and the Genus • Consider a graph with Euler characteristics χ • Theorem: this graph can be drawn without crossings on a surface of genus 2 χ c given by g = − − where c is the 2 • number of boundaries of the graph • The genus g is the number of handles of the embedding surface

32 Double line graphs • In our problem, if we use matrix fields

φab(x): NxN matrix • If we use same rule: Propagator: 1/N Loop: N 1 • Above graph: N = 1 × N

33 • Other graph

2 internal lines: 1/N 2 2 Loops: N 2 Order 1

• Arches are of order 1

2 internal lines: 1/N 2 0 Loops: 1 • Pseudo-knots are of higher order in 1/N

34 • Matrix field theory seems to do what we want. • Not really surprising:

35 Hollywood already knew: the MATRIX rules

36 Fig. 2b: Representation of the same RNA stretched. which gluon lines cross areMatrixof higher o fieldrder. W erepresentationmerely have to go to th eofla rgRNAe N exp ansion, and the diagrams are classified by powers of 1/N 2. Note that a somewhafoldingt similar formulation in terms of matrix theory has been used for the meander problem [12]. Consider the quantit•y We thus generalize the Wick theorem

L L N 1 1 (V − )ij tr(ϕiϕj ) 1 Z(1, L) = dϕ e− 2 ij tr (1 + ϕ ) (4) A(L) k N l ! k"=1 # "l=1 Here ϕ (i = 1, , L) denote L independent N by N Hermitian matrices and Π (1 + ϕ ) represents the ordered i · · · l l matrix product (1 + ϕ1)(1 + ϕ2) (1 + ϕL). All matrix products will be understood as ordered in this paper. The normalization factor A(L) is defined· · · by • ϕ ab ( i ) is a real symmetric (or hermitian) L N 1 (V − )ij tr(ϕiϕj ) matrix 2 ij A(L) = dϕke− (5) • By looking at !a k"few=1 diagrams:# Hartree b Let us refer to the row diagramsand column i ndicesare ofa aordernd b of t1,he pseudo-knotsmatrices (ϕi)a as c oarelor in ofdic es, with a, b = 1, 2, , N. The matrix integral (4) defines a matrix theory with L matrices. We can either think of it as a Gaussian theory·w· ·ith a higher order. 37 1 1 1 log[ N trΠl(1+ϕl)] complicated observable N trΠl(1 + ϕl), or alternatively, by raising N trΠl(1 + ϕl) = e into the exponent, N 1 1 as a complicated matrix theory with the action ( (V − ) tr(ϕ ϕ ) log[ trΠ (1 + ϕ )]). Another trivial remark 2 ij ij i j − N l l is that we can effectively remove 1 tr from (4). N # The important remark is that the matrix theory [13] defined by (4) has the same topological structure as ’t Hooft’s 1 large N quantum chromodynamics. There are L types of gluons, and the gluon propagators are given by N Vij . As 1 in large N quantum chromodynamics, each gluon propagator is associated with a factor of N and each color loop is associated with a factor of N. The reader familiar with matrix theory or large N quantum chromodynamics can see immediately that the Gaussian matrix integral (4) evaluates precisely to the infinite series 1 Z(1, L) = 1 + V + V V + + V V + (6) ij ij kl · · · N 2 ik jl · · · $ <$ijkl> <$ijkl> Some “typical” terms in this series correspond to the diagrams in fig. 3.

+ + + ... + + ... Fig. 3: Graphical representation of a few terms of the partition function.

This differs from (2) only in that the terms with different topological character are now classified by inverse powers 1 of N 2 . Thus, the use of the large N expansion allows us to separate out the tertiary structure, represented in (6) for 1 example by the term N 2 VikVjl, from the secondary structure. Note that the ordered product Π (1 + ϕ ) ensures that the diagonal elements V of the matrix V do not appear in # l l ii Z(1, L). We have nevertheless already set Vii to 0. The program proposed in this paper is thus to evaluate Z(1, L) with V an arbitrary matrix. Once Z(1, L) is known 1 we can then insert it into (3) to evaluate . The parameter N serves as a convenient marker to distinguish the tertiary structure from the secondary structZure. What we offer here is a systematic way of generating refinements to the calculation of Z, and hence the free energy F, to any desired accuracy in a well controlled approximation. Since in Z(1, L) the quantities 1 and L represent arbitrary labels we can just as well define

L n N n 1 1 Σ (V − ) tr(ϕ ϕ ) 1 Z(m, n) = dϕ e− 2 i,j=m ij i j tr (1 + ϕ ) (7) A(m, n) k N l ! k"=1 l"=m

4 Topological classification of RNA folds • An RNA fold can be characterized by its topology:

• Number of handles of embedding surface P L g = − 2 38 • In fact, one can prove that the matrix field partition function is equal to

1 βE(pairing) Z = e− N 2g(pairing) all pa!irings • where g(pairing) is the genus of the pairing graph

39 g=0

g=1

g=2

g=3

Figure 3: First few terms of the topological expansion of closed oriented surfaces: the term g = 0 is a sphere, g = 1 is a torus, g = 2 is a double torus and so forth.

g=0 g=1 g=2 g=3

,,,

Figure 4: Any RNA circular diagram can be drawn on a closed surface with a suitable number of “handles” (the genus). For the sake of simplicity, in this figure all helices and set of pairings on the surfaces are schematically identified only by their color. Note that the circle of the RNA-backbone (in green) topologically corresponds to a hole (or puncture) on the surface. 40

6 Genus 0: the Sphere

a) b)

41

d) c) Genus 1: the Torus

• Genus 1: the Torus

H. Orland, SPhT, Saclay RNA folding, Japan 2006 43

42 Genus 2: the Bi-torus

43 Genus 3

44 NNoowwuusseetthheeGGaauussssiaiannrreepprreesseennttaattioionn

11 22 1 NN 22 2NtrtKrK 1 2ttrrAA++ititrrAAKK ee−−2N == ddAAee−−2 ((1177)) CC !! NN 22 2trtArA wwitithhtthheennoorrmmaalilzizaattioionnfafaccttoorrCC== ddAAee−−2 ..NNootteetthhaatteevveenntthhoouugghhKKisisccoommpplelexxwweeccaannttaakkeeAAttoobbeehheerrmmititeeaann.. ((EEqquuivivaalelennttlyly,,tthheeaanntti-i-hheerrmmititeeaannppaarrttooffAAddrrooppssoouutt.).) PPuuttttininggititttooggeetthheerrwweeoobbttaainin "" a 1 ∂ 1 NN 22 ψ∗ Mij ψa 1 ∂ 1 2ttrArA ij aψa∗a,i,iMij ψjj ZZ((11,,LL))== ddAAee−−2 ddψψ∗∗ddψψee−− ij a ((1188)) NN∂∂hhCC !! !! ## ## wherwheree

11 2 MMijij==δδijij δδi,ij,j++11++hhδδi,i1,1δδjj,L,L++11++ii((VVii 11,j,j))2 AAii 11,j,j ((1199)) −− −− −− oorrininmmaattrrixixfoforrmm 11 00 00 00 hh ·· ·· 11 11++aa1122 aa1133 aa11LL 00 −− ·· ··   aa∗1∗2 11 11++aa23 aa2L 00 12 −− 23 ·· ·· 2L MMLL==   ((2200)) ·· ·· ·· ·· ·· ·· ··    ·· ·· ·· ·· ·· ·· ··   11 11++aaLL 11LL 00 ·· ·· ·· ·· −− −−  aa∗1∗L aa∗2∗L aa∗L∗ 2L aa∗L∗ 1L 11 11  1L 2L · L −2L L −1L −   · − − −  wwhheerreewweehhaavveeuusseeddtthheeccoonnvveennieiennttnnoottaattioionn

ii VVijij AAijij==aaijij foforr ii<

1 ∂ 1 NN 22 1 ∂ 1 NN 22 1 ∂ 1 2trtArA NN 1 ∂ 1 2ttrrAA++NNttrrlologgMM((AA)) ZZ((11,,LL))==• After someddAAee−−2 algebraic((ddeettMM((AA) ))manipulations,) == ddAAee− −one2 ((2222)) NN∂∂hhCC NN∂∂hhCC has the!! exact expression: !! AAttthisthisppooinint,t,wweecancanddiffifferenerentiatetiatewwithithrrespespectecttotohhaannddsseetthhttoo00,,oobbttaaininininggtthheeaaltlteerrnnaattiviveeffoorrmm

1 NN 22 1 2ttrrAA++NNttrrlologgMM((AA)) 11 ZZ((11,,LL))== ddAAee−−2 MM−− ((AA))L+1,1 ((2233)) CC L+1,1 !! IInntthhisiseexxpprreessssioionn,, • where A l l is a L L matrix and ! × 11 MMij==δδij δδi,j+1++ii((VVi 1,j))22AAi 1,j ((2244)) ij ij − i,j+1 i −1,j i −1,j • The dependence− is explicit− − LLeettuussininttrroodduucceetthheeaaccttioionn N one can perform a loop expansion 11 2 SS((AA))== ttrrAA2 ttrrlologgMM((AA)) ((2255)) (saddle-point)22 −− 45 aannddddeefifinneetthheeaavveerraaggeeooffaann““oobbsseerrvvaabblele””OObbyy 1 1 NNSS((AA)) <>== ddAAee−− OO ((2266)) CC !! (No(Notteetthehenonon-n-stastandandarrddnonorrmmaalizalizattioionnuusedsedhherere.)e.)TThen,hen,oouurrrresultesultccaannbbeesummasummarrizedizedelegelegaanntlytlyaass

11 ZZ((11,,LL))==<> ((2277))

AAtttthhisisppooinintt,,aassrreemmaarrkkeeddeeaarrlileierr,,wweennootteetthhaatttthheeqquuaannttitityyZZ((11,,LL))ccaannoobbvvioiouusslylybbeeggeenneerraalilzizeeddttooZZ((ii,,jj)):: aafftteerr aalll,l,tthheessititeelalabbeelsls11aannddLLaarreeaarrbbititrraarryy..TThheennwweehhaavveetthheeaappppeeaalilninggrreessuultlttthhaatt

11 ZZ((ii,,jj))==<> foforrjj>>ii ((2288))

66 The loop expansion

• Saddle-point equation ∂S = 0 Hartree recursion equations ∂All! • Expansion in 1/N

(0) xll! All = A + ! ll! √N

• Propagators of x l l ! satisfy a Bethe- Salpeter equation

46 Bethe-Salpeter equation

• No order 1/N correction

k m k m k i m

= + j l n l n l n

k m k m k m k m = + + + + ...

l n l n l n l n Fig. 5: Graphical representation of the Bethe-Salpeter recursion relation. The dotted lines represent factors Vij while the dashed lines represent factors Vij . The solid thick lines represent Hartree propagators G. The Hartree propagators being directed, the arrows denote the direction of increasing spatial index. !

1 Dmn = M − Vm 1,n xm 1,n mm! !− !− m! "p ! (Bp)kl = (D )kl 47 Tp = TrBp (39)

In (38), the bracket means that the Wick theorem should be applied to contract the fields xll! which appear in this expression, their contraction being given by the kernel ∆. The calculation of the correction to the free energy is possible numerically for not too long RNA sequences. Work in this direction is in progress. Because of the complexity of the (exact) order 1/N 2 obtained in this approach, we found it simpler to generalize the Hartree recursion equation to incorporate some residual interactions between the loops and bulges.

VI. RECURSION APPROACH

Two approaches can be used to derive recursion relations for the partition functions. One is detailed in the following, whereas the other one is described in appendix B. A possible approach is to take the expression in (31) 1 ∂ Z(1, L) = < (det M(A))N > (40) N ∂h 0 N and try to relate Z(1, L + 1) to Z(1, L). In other words, we would like to relate < (det ML+1(A)) > to N < (det ML(A)) > where the subscript on M keeps track of the different matrices in the discussion. Note that ML is an L + 1 by L + 1 matrix. Explicitly, as noted before, the L + 2 by L + 2 matrix ML+1 has the form 1 0 0 0 h 1 1 + a a · · b 0 − 12 23 · · 1  a∗ 1 b 0  12 − · · · 2 ML+1 =   (41)  · · · · · · ·     · · · · · · ·   1 1 + bL 0   b· b· · b· b− 1 1   1∗ 2∗ L∗ 1 L∗   · − −  where for convenience we have denoted i V A = a for i < j L ij ij ij ≤ i V A = b for i L i,L+!1 i,L+1 i ≤ i V A = a∗ for j < i L ! ij ij ji ≤ i VL+1!,j AL+1,j = b∗ for j L (42) j ≤ ! 9 Recursion relations • It is possible to obtain exact recursion relations for genus 1 • There is an exact relation L k 1 L 1 − 1 Z(1, L + 1) = Z(1, L) + V < T r (1 + φ ) T r (1 + φ ) > L+1,k N i × N j i=1 k!=1 " j="k+1 • which can be expanded in powers of 1 N

• Algorithm scales as L 6 too long!

48 H. Orland, SPhT, Saclay Graphology Parallel pairings don’t change the genus

49 © @R

Irreducibility and Nesting

g

© @R Irreducible PK

Genus is additive

g @R Non nested PK

50

@R Only 4 primitive PK of genus 1 Primitive=Irreducible and non-nested

H PK (pseudo-nœud H)

KHP

51 H. Orland, SPhT, Saclay B

B

B B A ABCABC Text C

A C

52 H. Orland, SPhT, Saclay 53 27 potential ABCABC H. Orland, SPhT, Saclay Statistical study

• Look in database and calculate genii of pseudo-knots • PseudoBase: around 245 primitive pseudo-knots found experimentally • 237 H PK of the type ABAB • 6 KHP of the type ABACBC • 1 PK of the type ABCABC • 1 PK of type ABCDCADB with genus 2

54 • Protein Data Bank (PDB): 850 RNA Structures • Number of bases ranges from 22 ( H PK with genus 1) to 2999 (with genus 15) • Maximum total genus is 18. Maximum genus of primitive PK is 8. • Transfer RNA (L=78) are KHP of genus 1

55 Number of RNA as a function of genus

f

g

56 H. Orland, SPhT, Saclay 57 H. Orland, SPhT, Saclay L L/4

2000 0.14 280 × "

58

• •

• This PK of genus 7 is made of 3 HP, 3 KHP nested in a large KHP

59 Exemple de 1vou (sous-unité 50s d’un ribosome 70s d’ E.Coli)

2850 bases

pseudo-nœud H kissing-hairpin

• This PK of genus 7 is made of 3 HP, 3 KHP nested in a large KHP Are these genii big?

60 H. Orland, SPhT, Saclay Exact enumeration of RNA structures. • Model: RNA in which any base can pair with any other base. All pairing energies are identical

Vij = v • Partition function of the model can be written as

1 N Trφ2 1 L Z (L) = dφ e− 2v Tr (1 + φ) N A N ! • with only one NxN matrix φ 61 • This integral can be calculated exactly using random matrix theory (orthogonal

polynomials). number of graphs of length L of genus g ∞ a (g) Z (L) = L N N 2g g=0 • and the asymptotic behaviors! are given by

L 3g 3/2 aL(g) L Kg(1 + 2v) L − ≈ →∞ 1 Kg = 4g 3/2 2g+1 3 − 2 g!√π

62 H. Orland, SPhT, Saclay • The total number of diagrams with any genus is given by L/2+√L 1/4 L/2 e− − L L N ≈ →∞ √2 • the average genus is given by < g > 0.25L L≈ • for real RNA, the largest genus we found is 18 for ribosomes (size around 3000 bp). The genus should be around 750. • What about Steric Constraints?

63 Enumeration of self-avoiding RNA structures. • Self-avoiding polymer on a cubic lattice • Saturating attraction between nearest- neighbor monomers. • Monte Carlo growth method allows to calculate accurately free energies. • Length of chains up to 1200 < g > 0.13L • Still much bigger than≈ for real RNA: 390 for RNA of length 3000 instead of 18. 64 MonteMonte Carlo CarloCarlo methodmethod method • Idea: forget matrix fields, keep genus • •Idea:Idea: forget forget matrix matrix fields, fields, keep keep genus genus • Work in pairing space (contact map) • •WorkWork in inpairing pairing space space (contact (contact map) map) βE(pairing) 2g(pairing) Z = eβ−E(pairing) /2Ng(pairing) Z = e− /N possible pairings possible pa!irings • Introduce !a chemical potential for the • •IntroduceIntroduce a chemicala chemical potential potential for for the the topology: µ µ 1 1 topology: e−eµ− ==1 topology: e− = 2 2 NN2N βE(pairing) µg(pairing) Z = eβ−E(pairing) µg−(pairing) Z = e− − possible pairings possible pa!irings ! 65 42 42 H. Orland, SPhT, Saclay H. Orland, SPhT, Saclay Possible moves

C C’

1)

2)

3) = i

= j = others 4)

5)

6) When a pair is added or removed, the energy is changed and the genus of the graph may have changed 66 H. Orland, SPhT,Saclay • Accept or reject move with probability β∆E µ∆g p = e− −

• It is possible to – take into account the entropy – make it very fast • Current Turner energies are not fitted to calculate energies of pseudoknots • transfer RNAs (g=1) • Hepatitis delta virus ribozyme (g=2)

67 H. Orland, SPhT, Saclay Structure of a Hepatitis Delta Virus Ribozyme 543

crystallize, can be engineered to crystallize readily by introducing a crystallization module into its P4 stem. In our previous work, we employed a GAAA tetraloop and its 11 nt receptor sequence placed so as to stack coaxially and be available for intermolecular but not intramolecular association. While the engineered RNAs produced large crys- The structure of the HDVtals, theseribozymedid not diffract X-rays beyond 7 AÊ resol- ution (FerreÂ-D'Amare et al., 1998b). Because our approach was partially successful, we decided to introduce the cognate site for U1A into the same stem-loop. In order to minimize any distortion of the structure of the HDV active site by introduc- tion of the crystallization module, we placed the U1A binding site as a loop capping the distal end of P4, as far away as possible from the core of the ribozyme. A series of RNAs were prepared in which the HDV ribozyme and helix-terminal U1A-binding moieties were separated by a variable number of spacer base-pairs (Figure 1(b)). These paired nucleotides were expected to form segments of A-form double helix that would coaxially stack and rigidly connect the two RNA moieties. The various constructs differed in the separation and helical twist between the two domains. This is ana- logous to the length-variation strategy routinely employed for DNA-protein co-crystallization (Aggarwal, 1990; Schultz et al., 1990). We employed self-cleaved ribozymes for crystal- lization, because HDV ribozymes are known to be prone to misfold (Been & Wickham, 1997; Perrotta & Been, 1998). Use of self-cleavage products ensured that only molecules that had successfully traversed the transition state, and thus folded cor- rectly, were employed for crystallization (see Methods). Previous biochemical work suggested that the precursor and the product have very simi- lar structures (Been & Wickham, 1997; Duhamel et al., 1996). Nucleotides 50 of the cleavage site are unlikely to participate in forming the architecture of the active site, since the HDV ribozyme has no sequence requirements in the 50 leader sequence, will cleave ef®ciently after a single nucleotide, and the identity of this 1 position nucleotide has only modest effects on cleavÀ age rates (Been & Wickham, 1997). Binding of U1A-RBD to the constructs did not signi®cantly affect their self-cleavage activity

Figure 1. Engineering the U1A crystallization module into the genomic HDV ribozyme. (a) Hypothetical chemical mechanism of the reaction catalyzed by the HDV ribozyme. (b) Schematic of the length variation binding site moieties) whose crystal structure was deter- strategy. The secondary structure of the genomic HDV mined. Nucleotides corresponding to the genomic HDV ribozyme (top left) and the U1A binding site from stem- ribozyme are numbered with Arabic numerals starting loop II of U1 small nuclear RNA (bottom left) are with the guanine bearing the leaving group of the self- shown in simpli®ed form, separated by a horizontal scission reaction; nucleotides of the U1A binding site are gap. This gap was bridged with 0 to 3 spacer base-pairs, in lower-case Roman numerals. RNA segments68 are in order to generate multiple crystallization candidates. named P for base-paired region, J for joining region, Constructs were transcribed in vitro, preceded by a 41 nt and L for loop, numbered from the 50 end of the mol- leader sequence, from which the ribozymes self-cleaved, ecule, as by FerreÂ-D'Amare et al. (1998a); the same color to generate the mature RNAs used for crystallization. (c) scheme is employed in all subsequent Figures, except Sequence and secondary structure of the construct (no for Figure 6. The riboses of residues 22, 24, 27, 41, 74-76, spacer base-pairs between the ribozyme and U1A- 77, v, vii and x have C20-endo puckers. • We find pseudoknots with free-energies lower than the native ones! • We have worked out a new parametrization of pairing free energies in RNA. – we predict 99% of all tRNA – predict very well RNA smaller than 200.

69 Conclusion

• Matrix field theory introduces a natural classification of RNA folds according to their topological genus. • One can write exact recursion equations for genus 0, 1, ... • The Genus of real RNA is small • Most promising is the Monte Carlo calculation with chemical potential for the genus (in fact enumerations of structures)

70 71