CYK Algorithm

RNA Secondary Structure Prediction 1 RNA structure prediction methods ◼ Base-Pair Maximization ◼ Context-Free Grammar Parsing. ◼ Free Energy Methods ◼ Covariance Models 2 The Nussinov-Jacobson Algorithm 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 q = 9 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 A C A G U U G C A 6 U 0 0 0 1 2 7 G 0 0 1 1 1 2 3 4 5 6 7 8 9 8 C 0 0 0 9 A 0 0 SCFG Version • Nussinov algorithm can be converted to a stochastic context-free grammar: • S → W • W → aW | cW | gW | uW • W → Wa | Wc | Wg | Wu • W → aWu | cWg | uWa | gWc • W → WW 4 SCFGs • Stochastic Context Free Grammars (SCFGs) have also been used to model RNA secondary structure • Examples – tRNAScan-SE – program created to find snoRNAs • Grammars are created by using a training set of data, and then the grammars are applied to potential sequences to see if they fit into the language 5 SCFGs • SCFGs allow the detection of sequences belonging to a family – tRNAs – group I introns – snoRNAs – snRNAs 6 SCFGs • Any RNA structure can be reduced to (SCFG (see Durbin, et al., p 278-279 מa 7 Transformational Grammars • First described by linguist Noam Chomsky in the 1950’s. – (Yes, the same Noam Chomsky who has expressed various dissident political views throughout the years!) 8 13 June 2006 9 13 June 2006 10 Transformational Grammars • Very important in computer science, most notably in compiler design • Covered in detail in compiler and automaton classes 11 Transformational Grammars • Idea: take a set of outputs (sentence, RNA structure) and determine if it can be produced using a set of rules • Consist of a set of symbols and production rules • The symbols can be terminal (emitting) symbols or non-terminal symbols 12 13 June 2006 13 13 June 2006 14 13 June 2006 15 13 June 2006 16 Grammar for Palindromes • Consider palindromic DNA sequences • Five possible terminal symbols: {a, c, g, t, ) ( represents the blank terminal symbol) 17 Grammar for Palindromes • Production Rules, where S and W are non-terminal symbols: • S→W • W→ aWa | cWc | gWg | tWt • W→ a | c| g | t | 18 Derivation of Sequences • Using these production rules, a derivation of the palindromic sequence acttgttca follows: • S W aWa acWcaactWtca acttWttca acttgttca 19 Example 13 June 2006 20 Example 13 June 2006 21 SCFGs for RNA • base-paired columns modeled by pairwise emitting non terminals – aWu; uWa; gWc; cWg; ... • single-stranded columns modeled by leftwise emitting nonterminals (when possible) – aW; cW; gW; uW; ..., when possible 22 Parse Trees • A context-free grammar can be aligned to a sequence using a parse tree • Root of the tree is the non-terminal start symbol, S • Leaves are terminal symbols • Internal nodes are the nonterminals • Leaves can be parsed from left to right to view the results of production 24 Parse Tree S W W W W W a c t t g t t c a 25 13 June 2006 27 13 June 2006 28 13 June 2006 29 13 June 2006 30 13 June 2006 31 13 June 2006 32 دانشگاه صنعتی امیر کبیر دانشکده مهندسی کامپیوتر CYK )Cocke-Younger-Kasami) Parsing Algorithm سید محمد حسین معطر پردازش زبان طبیعی Parsing Algorithms • CFGs are basis for describing (syntactic) structure of NL sentences • Thus - Parsing Algorithms are core of NL analysis systems • Recognition vs. Parsing: – Recognition - deciding the membership in the language: – Parsing – Recognition +producing a parse tree for it • Parsing is more “difficult” than recognition? (time complexity) • Ambiguity - an input may have exponentially many parses CYK )Cocke-Younger-Kasami) • One of the earliest recognition and parsing algorithms • The standard version of CYK can only recognize languages defined by context-free grammars in Chomsky Normal Form (CNF). • It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF – Harder to understand • Based on a “dynamic programming” approach: – Build solutions compositionally from sub-solutions – Store sub-solutions and re-use them whenever necessary • Recognition version: decide whether S == > w ? CYK Algorithm • The CYK algorithm for the membership problem is as follows: – Let the input string be a sequence of n letters a1 ... an. – Let the grammar contain r terminal and nonterminal symbols R1 ... Rr, and let R1 be the start symbol. – Let P[n,n,r] be an array of booleans. Initialize all elements of P to false. – For each i = 1 to n • For each unit production Rj -> ai, set P[i,1,j] = true. – For each i = 2 to n -- Length of span • For each j = 1 to n-i+1 -- Start of span – For each k = 1 to i-1 -- Partition of span » For each production RA -> RB RC » If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true – If P[1,n,1] is true • Then string is member of language • Else string is not member of language CYK Pseudocode On input x = x1x2 … xn : for (i = 1 to n) //create middle diagonal for (each var. A) if(A→xi) add A to table[i-1][i] for (d = 2 to n) // d’th diagonal for (i = 0 to n-d) for (k = i+1 to i+d-1) for (each var. A) for(each var. B in table[i][k]) for(each var. C in table[k][k+d]) if(A→BC) add A to table[i][k+d] return Stable[0][n] ? ACCEPT : REJECT CYK Algorithm • this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk. • Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on. • For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence. • Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol CYK Algorithm for Deciding Context Free Languages Q: Consider the grammar G given by S → | AB | XB T → AB | XB X → AT A → a B → b 1. Is x = aaabbb in L(G ) ? CYK Algorithm for Deciding Context Free Languages Now look at aaabbb : S → | AB | XB a a a b b b T → AB | XB X → AT A → a B → b CYK Algorithm for Deciding Context Free Languages 1) Write variables for all length 1 substrings. S → | AB | XB a a a b b b T → AB | XB A A A B B B X → AT A → a B → b CYK Algorithm for Deciding Context Free Languages 2) Write variables for all length 2 substrings. S → | AB | XB a a a b b b T → AB | XB A A A B B B X → AT S,T A → a B → b CYK Algorithm for Deciding Context Free Languages 3) Write variables for all length 3 substrings. S → | AB | XB a a a b b b T → AB | XB A A A B B B X → AT A → a S,TT B → b X CYK Algorithm for Deciding Context Free Languages 4) Write variables for all length 4 substrings. S → | AB | XB a a a b b b T → AB | XB A A A B B B X → AT S,TT A → a X B → b S,T CYK Algorithm for Deciding Context Free Languages 5) Write variables for all length 5 substrings. S → | AB | XB a a a b b b T → AB | XB A A A B B B X → AT A → a S,TT B → b X S,T X CYK Algorithm for Deciding Context Free Languages 6) Write variables for all length 6 substrings. S → | AB | XB a a a b b b T → AB | XB A A A B B B X → AT A → a S,TT B → b X S,T S is included so X aaabbb accepted! S,T CYK Algorithm for Deciding Context Free Languages Can also use a table for same purpose. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb 1:aaabbb 2:aaabbb 3:aaabbb 4:aaabbb 5:aaabbb CYK Algorithm for Deciding Context Free Languages 1. Variables for length 1 substrings. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb A 1:aaabbb A 2:aaabbb A 3:aaabbb B 4:aaabbb B 5:aaabbb B CYK Algorithm for Deciding Context Free Languages 2. Variables for length 2 substrings. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb A - 1:aaabbb A - 2:aaabbb A S,T 3:aaabbb B - 4:aaabbb B - 5:aaabbb B CYK Algorithm for Deciding Context Free Languages 3. Variables for length 3 substrings. end at 1: 2: 3: 4: 5: 6: start at aaabbb aaabbb aaabbb aaabbb aaabbb aaabbb 0:aaabbb A - - 1:aaabbb A - X 2:aaabbb A S,T - 3:aaabbb B - - 4:aaabbb B - 5:aaabbb B CYK Algorithm for Deciding Context Free Languages 4.

Load more