The CYK

자연어처리 The CYK Algorithm • The membership problem: – Problem: • Given a context-free grammar G and a string w – G = (V, ∑ ,P , S) where » V finite set of variables » ∑ (the alphabet) finite set of terminal symbols » P finite set of rules » S start symbol (distinguished element of V) » V and ∑ are assumed to be disjoint – G is used to generate the string of a language – Question: • Is w in L(G)? The CYK Algorithm

• J. Cocke • D. Younger, • T. Kasami

– Independently developed an algorithm to answer this question. The CYK Algorithm Basics

– The Structure of the rules in a grammar

– Uses a “” or “table-filling algorithm” Chomsky Normal Form

• Normal Form is described by a set of conditions that each rule in the grammar must satisfy • Context-free grammar is in CNF if each rule has one of the following forms: – A  BC at most 2 symbols on right side – A  a, or terminal symbol – S  λ null string where B, C Є V – {S} Construct a Triangular Table

• Each row corresponds to one length of substrings –Bottom Row – Strings of length 1 –Second from Bottom Row – Strings of length 2 . . –Top Row – string ‘w’ Construct a Triangular Table

•Xi, i is the set of variables A such that

A  wi is a production of G

•Compare at most n pairs of previously computed sets:

(Xi, i , Xi+1, j ), (Xi, i+1 , Xi+2, j ) … (Xi, j-1 , Xj, j ) e.g. i=1, j=5

(X1,1 , X2,5 ), (X1,2 , X3,5 ) … (X1,4 , X5,5 )

Construct a Triangular Table

X1, 5

X1, 4 X2, 5

X1, 3 X2, 4 X3, 5

X1, 2 X2, 3 X3, 4 X4, 5

X1, 1 X2, 2 X3, 3 X4, 4 X5, 5

w1 w2 w3 w4 w5 Table for string ‘w’ that has length 5 Construct a Triangular Table

X1, 5

X1, 4 X2, 5

X1, 3 X2, 4 X3, 5

X1, 2 X2, 3 X3, 4 X4, 5

X1, 1 X2, 2 X3, 3 X4, 4 X5, 5

w1 w2 w3 w4 w5

Looking for pairs to compare Example CYK Algorithm

• Show the CYK Algorithm with the following example: – CNF grammar G • S  NP VP • NP  DET NP | NP NP | time | flies | arrow • VP  VP NP | VP PP | flies | like • PP  P NP • DET  an • P  like – w is "time flies like an arrow" – Question Is "time flies like an arrow" in L(G)? Constructing The Triangular Table

{NP} (X1,1) {NP,VP} (X2,2) {VP,P} (X3,3) {DET} (X4,4) {NP} (X5,5)

time (1) flies (2) like (3) an (4) arrow (5)

Calculating the Bottom ROW: Xi, i Constructing The Triangular Table

{NP,S} (X1,2) {S} (X2,3) {} (X3,4) {NP} (X4,5) {NP} {NP,VP} {VP,P} {DET} {NP}

time (1) flies (2) like (3) an (4) arrow (5)

X1,2: (X1,1 , X2,2 ) X3,4: (X3,3 , X4,4 )

X2,3: (X2,2 , X3,3 ) X4,5: (X4,4 , X5,5 ) Constructing The Triangular Table

{S} (X1,3) {} (X2,4) {VP,PP} (X3,5) {NP,S} {S} {} {NP} {NP} {NP,VP} {VP,P} {DET} {NP}

time (1) flies (2) like (3) an (4) arrow (5)

X1,3: (X1,1 , X2,3 ), (X1,2 , X3,3 )

X3,5: (X3,3 , X4,5 ), (X3,4 , X5,5 ) Constructing The Triangular Table

{} (X1,4) {S,VP} (X2,5) {S} {} {VP,PP} {NP,S} {S} {} {NP} {NP} {NP,VP} {VP,P} {DET} {NP} time (1) flies (2) like (3) an (4) arrow (5)

X1,4: (X1,1 , X2,4 ), (X1,2 , X3,4 ) , (X1,3 , X4,4 )

X2,5: (X2,2 , X3,5 ), (X2,3 , X4,5 ) , (X2,4 , X5,5 ) Constructing The Triangular Table

{S} (X1,5) {} {S,VP} {S} {} {VP,PP} {NP,S} {S} {} {NP} {NP} {NP,VP} {VP,P} {DET} {NP}

time (1) flies (2) like (3) an (4) arrow (5)

X1,5: (X1,1 , X2,5 ), (X1,2 , X3,5 ) , (X1,3 , X4,5 ) , (X1,4 , X5,5 )

CYK algorithm: Pseudocode let the input be a string S consisting of n characters: a1 ... an. let the grammar contain r nonterminal symbols R1 ... Rr. This grammar contains the subset Rs which is the set of start symbols. let P[n,n,r] be an array of booleans. Initialize all elements of P to false. for each i = 1 to n

for each unit production Rj  ai set P[i,1,j] = true for each i = 2 to n -- Length of span for each j = 1 to n-i+1 -- Start of span for each k = 1 to i-1 -- Partition of span

for each production RA  RB RC if P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true if any of P[1,n,x] is true (x is iterated over the set s, where s are all the indices for Rs) then S is member of language else S is not member of language