03 Pumping Lemma and Chomsky Hierarchy

03 Pumping Lemma and Chomsky Hierarchy

Introduction to Computational Linguistics Summer Term 2008 FSA ≡ regular expressions Intro CL | Session 3: ! FSA & regular expressions are equivalent Pumping lemma & • regular expression ! FSA (has been shown above) the Chomsky hierarchy • FSA ! regular expression (not so obious) Stefan Evert & Peter Bosch regular language A Institute of Cognitive Science University of Osnabrück ≡ A described by regular expression ≡ A recognised by FSA 2 Proving the limits of regexp FSA and long strings b 1 6 a b ! Show that no FSA recognises A = {an bn} exactly a b regular languages have a relatively simple structure 0 b 7 • a 2 5 b a b basic idea: if FSA recognises all strings an bn, then it • b will also accept strings that do not belong to A 3 4 • our argument will depend on very long strings, since every finite language is regular (extensional def.) a a a b b b 3 4 FSA and long strings Cycles cycle ! Can we be sure that FSA has to traverse cycle? b 1 6 given a finite-state automaton with K states a b • a b • each transition “consumes” one symbol of input 0 b 7 a 2 5 b a b • after K symbols, FSA must have visited some state twice (because it has passed through K + 1 states) b 3 4 cycle • this state marks a cycle in the path ! For every accepted input string |w| ! K, the FSA must traverse a cycle a a a a a b b b b b • corresponding substring can be pumped indefinitely 5 6 Application of The pumping lemma the pumping lemma ! Given an arbitrary regular language A ! Proof by contradiction ! there is some constant K • assume that A = {an bn} is regular ! recognised by • viz. the number of states of a FSA that accepts A some FSA with K states ! such that ∀z ∈ A with |z| ! K • every string |an bn| ! K can be “pumped” ! we can write z = uvw with v " " and |uv| # K • select a suitable string, e.g. z = aK bK • “v is nonempty substring among first K symbols” • segmentation: z = uvw with v " " and |uv| # K • v corresponds to first cycle of path through FSA - we don't know the precise substring v that can be pumped! • but v has to lie completely within aK ! v = am ! Then: uvnw ∈ A for n = 0, 1, 2, … • pumping lemma: uv2w = aK+m bK ∈ A • i.e. the cycle can be repeated any number of times ! contradiction 7 8 What about natural language? Context-free grammars (CFG) • the cheese that the mouse stole was expensive ! What is an appropriate grammar formalism for • the cheese that the mouse that the cat caught stole bracketing structures? was expensive ! Context-free grammar G = (#, V, S, P) • the cheese that the mouse that the cat that the dog chased caught stole was expensive • # = alphabet (terminal symbols) • the cheese that the mouse that the cat that the dog • V = variables (nonterminals), S ∈ V = start symbol that the girl saw chased caught stole was expensive • P = set of productions A $ % ∈ P (“rewrite rules”) * the cheese that stole that caught that chased the • more about CFG in Session 5 mouse the cat the dog was expensive • n n • * the cheese that the mouse caught stole was ! A context-free grammar for A = {a b } expensive • V = {S}, P = {S $ a S b | a b} n n ☞ NL contains bracketing of type a b • derivation S ⇒ a S b ⇒ a a S b b ⇒ a a a b b b = a3 b3 9 10 The pumping lemma Can CFG do everything? for context-free languages ! Are CFG the only formalism we need? ! Given an arbitrary context-free language A • except when regular expressions are already sufficient ! there is some constant K ! Are there languages that cannot be described? ! such that ∀z ∈ A with |z| ! K • it's much harder to find such examples for CFG ! we can write z = uvwxy with v " " ! x " " and |vwx| # K ! Three simple structures that are not CF two substrings within a region of at most K symbols • multiple repetition: an bn cn • corresponds to recursion cycle in (binary) parse tree • crossing brackets: an bk an bk • n n • COPY language: {ww | w ∈ #*} ! Then: uv wx y ∈ A for n = 0, 1, 2, … • but an bk ak bn and {wwR | w ∈ #*} are context-free • both substrings are pumped “in parallel” 11 12 Is NL context-free? Bambara ! Long-standing debate in linguistics ! Bambara has structures of the form ww ! see “Footlose and context-free”, Chapter 16 of Pullum, Geoffrey K. (1991). The Great Eskimo Vocabulary Hoax. The University of Chicago Press. • wulu ‘dog’ ! Evidence found for a number of languages: • wulu-filela ‘dog watcher’ • Bambara (reduplication $ COPY language) • wulu-filela-nyinila ‘dog watcher hunter’ • Swiss German dialects (cross-serial dependencies) • Dutch (multiple repetition) • wulu-o-wulu ‘whatever dog’ ! No mathematical proof for English or German • wulu-filela-o-wulu-filela ‘whatever dog watcher’ • wulu-filela-nyinila-o-wulu-filela-nyinila • but that does not mean that CFG are a linguistically ‘whatever dog watcher hunter’ appropriate formalism for natural language! 13 14 Swiss German Dutch ! Standard German: nested an bk ak bn ! Multiple repetition an bn cn in Dutch • Jan sagt, dass wir die Kinder1 dem Mann2 das Haus3 • Of Jan Piet hoorde en zag? anstreichen3 helfen2 lassen1 wollen haben. ‘Did Jan hear Piet and see [him]?’ ‘Jan says that we wanted to let the children help Hans (the man) paint the house.’ • Of Jan Piet Marie hoorde ontmoeten en zag omhelzen? ! Swiss German has cross-serial an bk an bk ‘Did J. hear P. meet M. and see [him] embrace [her]?’ • Of Jan Piet Wim Marie hoorde helpen ontmoeten en • Jan säit, das mer d'Chind1 em Hans2 es Huus3 händ wele laa hälfe aastriiche . zag horen omhelzen? 1 2 3 ‘Did Jan hear Piet help Wim meet Marie and see [P.] • different verbs select for dative/accusative objects hear [W.] embrace [M.]?’ from: Shieber, Stuart M. (1985). Evidence against the context-freeness from: Manaster-Ramer, Alexis (1987). Dutch as a formal language. of natural language. Linguistics and Philosophy, 8, 333–343. Linguistics and Philosophy, 10(2), 221–246. 15 16 What else is there? What else is there? Type 0: algorithmic languages Type 0: algorithmic languages (Turing machines, rewrite grammars) (Turing machines, rewrite grammars) Type 1: context-sensitive languages Type 1: context-sensitive languages (linear bounded automata, context-sensitive grammars) (linear bounded automata, context-sensitive grammars) Type 2: context-free languages Type 2: context-free languages (context-free grammars, pushdown automata) (context-free grammars, pushdown automata) Type 3: regular languages Type 3: regular languages (regular expressions, finite-state automata) (regular expressions, finite-state automata) Perl-compatible regular expressions (PCRE) 17 18 Grammars and recognisers Issues in formal language theory in the Chomsky hierarchy type 0: algorithmic languages ! Establish hierarchies of language classes rewrite grammars: % $ & Turing machines (TM) • each class corresponds to a grammar formalism type 1: context-sensitive languages context-sensitive grammars: linear bounded TM ! Algorithmic properties &A' $ &%' or % $ & with |%| " |&| grammar (description) vs. recogniser (algorithm) • type 2: context-free languages • decide whether string belongs to language context-free grammars: push-down automata • are two grammars equivalent? etc. A $ % with % ∈ (V ∪ #)* ! Closure properties (of language classes) type 3: regular languages regular expressions finite-state automata • closed under union, intersection, Kleene star, … ? linear grammars: A $ aB 19 20 Algorithmic & closure properties Algorithmic / closure properties Intersection of FSA Intro to CL Complexity S. Evert (aba|ba)*b a(bab|ab)* Why class w A? A !? A Σ∗? A B? A B? complexity? ∈ = = ⊆ = algorithmic — — — — — b 3 a 2 Chomsky CS √ — — — — hierarchy a b CF √ √ — — — a The pumping 0 a 0 1 b lemma regular √ √ √ √ √ 1 3 NL is not b a reguler! a 2 b 4 Non-CF languages class A B A B A∗ A B A R A f (A) ∪ ◦ ∩ ∩ C Is NL algorithmic √ √ √ √ √ — √ deterministic FSA context-free? CS √ √ √ √ √ √ √ Summary CF √ √ √ — √ — √ regular √ √ √ √ √ √ √ a b a b a 0/0 1/1 2/3 0/4 3/1 0/2 b 21 22 Intersection of FSA Complement of FSA ! Intersection algorithm for FSA ! regular a(a|bc)* over alphabet #={a,b,c} languages are closed under intersection a ! There is no “intersection” operator for regexp! a b 0 1 2 deterministic FSA c • but we now know it could be added easily ! Application: fast matching of word lists a b c 2 • match regular expression against large word list “complete” FSA 1 a,b a,b,c word list encoded as minimised FSA with error state E a c • b,c E • regular expression translated to deterministic FSA 0 • compute intersection of word list and regexp FSAs 23 24 Complement of FSA Homework a(a|bc)* over alphabet #={a,b,c} ! Read full handout! a ! Reading assignment a b 0 1 2 deterministic FSA c Jurafsky & Martin, Chapters 2 & 15 (Chapters 2 & 13 in first edition) a b 2 complement FSA c a,b,c ! Homework assignment 3 1 a,b (swaps final and a c • various applications of regular expressions non-final states) b,c E 0 (tokenisation, corpus search) • a little larger because you've got two weeks to do it 25 26.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    7 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us