RNA Secondary Structure Prediction

1 RNA structure prediction methods

◼ Base-Pair Maximization

◼ Context-Free Grammar .

◼ Free Energy Methods

◼ Covariance Models

2 The Nussinov-Jacobson

1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3  2 C 0 0 0 1 1 1 2 2 3 q = 9 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 A C A G U U G C A 6 U 0 0 0 1 2 7 G 0 0 1 1 1 2 3 4 5 6 7 8 9 8 C 0 0 0 9 A 0 0 SCFG Version

• Nussinov algorithm can be converted to a stochastic context-free grammar:

• S → W • W → aW | cW | gW | uW • W → Wa | Wc | Wg | Wu • W → aWu | cWg | uWa | gWc

• W → WW 4 SCFGs

• Stochastic Context Free Grammars (SCFGs) have also been used to model RNA secondary structure

• Examples – tRNAScan-SE – program created to find snoRNAs

• Grammars are created by using a training set of data, and then the grammars are applied to potential sequences to see if they fit into the language

5 SCFGs

• SCFGs allow the detection of sequences belonging to a family – tRNAs – group I introns – snoRNAs – snRNAs

6 SCFGs

• Any RNA structure can be reduced to (SCFG (see Durbin, et al., p 278-279 מa

7 Transformational Grammars

• First described by linguist Noam Chomsky in the 1950’s.

– (Yes, the same Noam Chomsky who has expressed various dissident political views throughout the years!)

8 13 June 2006 9 13 June 2006 10 Transformational Grammars

• Very important in computer science, most notably in compiler design

• Covered in detail in compiler and automaton classes

11 Transformational Grammars

• Idea: take a set of outputs (sentence, RNA structure) and determine if it can be produced using a set of rules

• Consist of a set of symbols and production rules

• The symbols can be terminal (emitting) symbols or non-terminal symbols

12 13 June 2006 13 13 June 2006 14 13 June 2006 15 13 June 2006 16 Grammar for Palindromes

• Consider palindromic DNA sequences

• Five possible terminal symbols: {a, c, g, t, ) ( represents the blank terminal symbol)

17 Grammar for Palindromes

• Production Rules, where S and W are non-terminal symbols:

• S→W • W→ aWa | cWc | gWg | tWt • W→ a | c| g | t | 

18 Derivation of Sequences

• Using these production rules, a derivation of the palindromic sequence acttgttca follows:

• S  W  aWa  acWcaactWtca  acttWttca  acttgttca

19 Example

13 June 2006 20 Example

13 June 2006 21 SCFGs for RNA

• base-paired columns modeled by pairwise emitting non terminals – aWu; uWa; gWc; cWg; ...

• single-stranded columns modeled by leftwise emitting nonterminals (when possible) – aW; cW; gW; uW; ..., when possible

22 Parse Trees

• A context-free grammar can be aligned to a sequence using a • Root of the tree is the non-terminal start symbol, S • Leaves are terminal symbols • Internal nodes are the nonterminals • Leaves can be parsed from left to right to view the results of production

24 Parse Tree

S

W

W

W

W

W

a c t t g t t c a

25 13 June 2006 27 13 June 2006 28 13 June 2006 29 13 June 2006 30 13 June 2006 31 13 June 2006 32 دانشگاه صنعتی امیر کبیر دانشکده مهندسی کامپیوتر CYK )Cocke-Younger-Kasami) Parsing Algorithm

سید محمد حسین معطر پردازش زبان طبیعی Parsing

• CFGs are basis for describing (syntactic) structure of NL sentences • Thus - Parsing Algorithms are core of NL analysis systems • Recognition vs. Parsing: – Recognition - deciding the membership in the language: – Parsing – Recognition +producing a parse tree for it • Parsing is more “difficult” than recognition? (time complexity) • Ambiguity - an input may have exponentially many parses CYK )Cocke-Younger-Kasami)

• One of the earliest recognition and parsing algorithms • The standard version of CYK can only recognize languages defined by context-free grammars in (CNF). • It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF – Harder to understand • Based on a “” approach: – Build solutions compositionally from sub-solutions – Store sub-solutions and re-use them whenever necessary

• Recognition version: decide whether S == > w ? CYK Algorithm

• The CYK algorithm for the membership problem is as follows: – Let the input string be a sequence of n letters a1 ... an. – Let the grammar contain r terminal and nonterminal symbols R1 ... Rr, and let R1 be the start symbol. – Let P[n,n,r] be an array of booleans. Initialize all elements of P to false. – For each i = 1 to n • For each unit production Rj -> ai, set P[i,1,j] = true. – For each i = 2 to n -- Length of span • For each j = 1 to n-i+1 -- Start of span – For each k = 1 to i-1 -- Partition of span » For each production RA -> RB RC » If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true – If P[1,n,1] is true • Then string is member of language • Else string is not member of language CYK Pseudocode

On input x = x1x2 … xn : for (i = 1 to n) //create middle diagonal for (each var. A)

if(A→xi) add A to table[i-1][i] for (d = 2 to n) // d’th diagonal for (i = 0 to n-d) for (k = i+1 to i+d-1) for (each var. A) for(each var. B in table[i][k]) for(each var. C in table[k][k+d]) if(A→BC) add A to table[i][k+d] return Stable[0][n] ? ACCEPT : REJECT CYK Algorithm

• this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk. • Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on. • For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence. • Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol CYK Algorithm for Deciding Context Free Languages

Q: Consider the grammar G given by S →  | AB | XB T → AB | XB X → AT A → a B → b

1. Is x = aaabbb in L(G ) ? CYK Algorithm for Deciding Context Free Languages Now look at aaabbb :

S →  | AB | XB a a a b b b T → AB | XB X → AT A → a B → b CYK Algorithm for Deciding Context Free Languages 1) Write variables for all length 1 substrings.

S →  | AB | XB a a a b b b T → AB | XB A A A B B B X → AT A → a B → b CYK Algorithm for Deciding Context Free Languages 2) Write variables for all length 2 substrings.

S →  | AB | XB a a a b b b T → AB | XB A A A B B B X → AT S,T A → a B → b CYK Algorithm for Deciding Context Free Languages 3) Write variables for all length 3 substrings.

S →  | AB | XB a a a b b b T → AB | XB A A A B B B X → AT A → a S,TT B → b X CYK Algorithm for Deciding Context Free Languages 4) Write variables for all length 4 substrings.

S →  | AB | XB a a a b b b T → AB | XB A A A B B B X → AT S,TT A → a X B → b S,T CYK Algorithm for Deciding Context Free Languages 5) Write variables for all length 5 substrings.

S →  | AB | XB a a a b b b T → AB | XB A A A B B B X → AT A → a S,TT B → b X S,T X CYK Algorithm for Deciding Context Free Languages 6) Write variables for all length 6 substrings.

S →  | AB | XB a a a b b b T → AB | XB A A A B B B X → AT A → a S,TT B → b X S,T

S is included so X aaabbb accepted! S,T CYK Algorithm for Deciding Context Free Languages Can also use a table for same purpose. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb 1:aaabbb 2:aaabbb 3:aaabbb 4:aaabbb 5:aaabbb CYK Algorithm for Deciding Context Free Languages 1. Variables for length 1 substrings. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb A 1:aaabbb A 2:aaabbb A 3:aaabbb B 4:aaabbb B 5:aaabbb B CYK Algorithm for Deciding Context Free Languages 2. Variables for length 2 substrings. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb A - 1:aaabbb A - 2:aaabbb A S,T 3:aaabbb B - 4:aaabbb B - 5:aaabbb B CYK Algorithm for Deciding Context Free Languages 3. Variables for length 3 substrings. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb A - - 1:aaabbb A - X 2:aaabbb A S,T - 3:aaabbb B - - 4:aaabbb B - 5:aaabbb B CYK Algorithm for Deciding Context Free Languages 4. Variables for length 4 substrings. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb A - - - 1:aaabbb A - X S,T 2:aaabbb A S,T - - 3:aaabbb B - - 4:aaabbb B - 5:aaabbb B CYK Algorithm for Deciding Context Free Languages 5. Variables for length 5 substrings. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb A - - - X 1:aaabbb A - X S,T - 2:aaabbb A S,T - - 3:aaabbb B - - 4:aaabbb B - 5:aaabbb B CYK Algorithm for Deciding Context Free Languages 6. Variables for aaabbb. ACCEPTED! end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb A - - - X S,T 1:aaabbb A - X S,T - 2:aaabbb A S,T - - 3:aaabbb B - - 4:aaabbb B - 5:aaabbb B Parsing results

• We keep the results for every wij in a table. • Note that we only need to fill in entries up to the diagonal – the longest substring starting at i is of length n-i+1 Constructing parse tree

• we need to construct parse trees for string w: • Idea: – Keep back-pointers to the table entries that we combine – At the end - reconstruct a parse from the back-pointers • This allows us to find all parse trees References

• Hopcroft and Ullman,“Intro. to Automata Theory, Lang. and Comp.”Section 6.3, pp. 139-141 • “CYK algorithm ” , Wikipedia, the free encyclopedia • A representation by Zeph Grunschlag The Nussinov-Jacobson Algorithm

1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3  2 C 0 0 0 1 1 1 2 2 3 q = 9 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 A C A G U U G C A 6 U 0 0 0 1 2 7 G 0 0 1 1 1 2 3 4 5 6 7 8 9 8 C 0 0 0 9 A 0 0 The Nussinov-Jacobson Algorithm

- 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3  2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 A C A G U U G C A 6 U 0 0 0 1 2 7 G 0 0 1 1 1 2 3 4 5 6 7 8 9 8 C 0 0 0 9 A 0 0

66 The Nussinov-Jacobson Algorithm

i < q ≤ j q-1 q

1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3  2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 A C A G U U G C A 6 U 0 0 0 1 2 7 G 0 0 1 1 1 2 3 4 5 6 7 8 9 8 C 0 0 0 9 A 0 0

67 • Co-terminus foldings:

A U C A U G G C A U

• Partitionable foldings:

A C A G U U G C A 1 2 3 4 5 6 7 8 9

68 Another way to write the Nussinov-Jacobson recursion

• Initialization:

 (i,i −1) = 0 for i = 2 to L  (i,i) = 0

  (i +1, j); Two special cases of • Recursion:  Partitionable Folding   (i, j −1);  (i, j) = max  (i +1, j −1) + BasePairScore(i, j); Co-Terminus  Folding   max ik j  (i,k) +  (k +1, j). Partitionable Folding

69 SCFG version of the Nussinov-Jacobson algorithm • Stochastic Context-Free Grammars • Makes use of production rules: – W → aW | cW | gW | uW (i unpaired) • Every production rule has a associated probability parameter. • The maximum probability parse is equivalent to the maximum probability secondary structure.

70 SCFG Version of Nussinov- Jacobson Algorithm • The algorithm can be converted to a stochastic context-free grammar: • S → W • W → aW | cW | gW | uW • W → Wa | Wc | Wg | Wu • W → aWu | cWg | uWa | gWc • W → WW

71 Needed terminology • The inside-outside (recursive dynamic programming) algorithm for SCFGs in Chomsky normal form is the natural counterpart of the forward-backward algorithm for HMM. • Best path variant of the inside-outside algorithm is the Cocke-Younger-Kasami (CYK) algorithm. It finds the maximum probabilistic alignment of the SCFG to the sequence. 72 CYK for Nussinov-style RNA SCFG

Addition to the fill stage • Initialization: of the Nussinov algorithm.  (i,i −1) = − for i = 2 to L The principal difference

log p(xi S) is that the SCFG  (i,i) = max for i =1 to L log p(Sx ) description is a  i probabilistic model.

  (i +1, j) + log p(xW ); • Recursion: i Two special cases of  Partitionable Folding   (i, j −1) + log p(Wx j );  (i, j) = max Co-Terminus  (i +1, j −1) + log p(xWx );  i j Folding  max ik j  (i,k) +  (k +1, j) + log p(WW ). Partitionable Folding

73 CYK for Nussinov-style RNA SCFG (2) • The log P(x,ˆ | ) is the log likelihood of the optimal structure  ˆ given the SCFG model  • The traceback to find the secondary structure corresponding to the best score is performed analogously to the traceback in the Nussinov algorithm

74 Example of RNA Structure SCFG • RNA structure for the sequence produced by MFOLD, can be constructed (5’ to 3’):

• GCUUACGACCAUAUCACGUUGAAUGCAC GCCAUCCCGUCCGAUCUGGCAAGUUAAG CAACGUUGAGUCCAGUUAGUACUUGGAU CGGAGACGGCCUGGGAAUCCUGGAUGU UGUAAGCU

75 Example Construction

• S • W • Wu • gWcu • gcWgcu • gcuWagcu • gcuuWaagcu • gcuuaWuaagcu • gcuuacWguaagcu • gcuuacgWuguaagcu • gcuuacgaWuuguaagcu • gcuuacgacWguuguaagcu • gcuuacgaccWguuguaagcu • gcuuacgaccaWguuguaagcu.... 76 CYK for Nussinov-style RNA SCFG • Good starting example, but it is too simple to be an accurate RNA folder • The algorithm does not consider important structural features like preferences for certain: – Loop lengths – Nearest neighbours in the structure caused by stacking interactions between neighbouring base pairs in a stem.

77