Introduction to Computational Linguistics Summer Term 2008 FSA ≡ regular expressions

Intro CL | Session 3: ! FSA & regular expressions are equivalent Pumping lemma & • ! FSA (has been shown above) the Chomsky hierarchy • FSA ! regular expression (not so obious)

Stefan Evert & Peter Bosch A Institute of Cognitive Science University of Osnabrück ≡ A described by regular expression ≡ A recognised by FSA

2

Proving the limits of regexp FSA and long strings

b 1 6 a b ! Show that no FSA recognises A = {an bn} exactly a b regular languages have a relatively simple structure 0 b 7 • a 2 5 b a b basic idea: if FSA recognises all strings an bn, then it • b will also accept strings that do not belong to A 3 4 • our argument will depend on very long strings, since every finite language is regular (extensional def.) a a a b b b

3 4 FSA and long strings Cycles cycle ! Can we be sure that FSA has to traverse cycle? b 1 6 given a finite-state automaton with K states a b • a b • each transition “consumes” one of input 0 b 7 a 2 5 b a b • after K symbols, FSA must have visited some state twice (because it has passed through K + 1 states) b 3 4 cycle • this state marks a cycle in the path ! For every accepted input string |w| ! K, the FSA must traverse a cycle a a a a a b b b b b • corresponding substring can be pumped indefinitely

5 6

Application of The pumping lemma the pumping lemma ! Given an arbitrary regular language A ! Proof by contradiction ! there is some constant K • assume that A = {an bn} is regular ! recognised by • viz. the number of states of a FSA that accepts A some FSA with K states ! such that ∀z ∈ A with |z| ! K • every string |an bn| ! K can be “pumped” ! we can write z = uvw with v " " and |uv| # K • select a suitable string, e.g. z = aK bK • “v is nonempty substring among first K symbols” • segmentation: z = uvw with v " " and |uv| # K • v corresponds to first cycle of path through FSA - we don't know the precise substring v that can be pumped! • but v has to lie completely within aK ! v = am ! Then: uvnw ∈ A for n = 0, 1, 2, … • pumping lemma: uv2w = aK+m bK ∈ A • i.e. the cycle can be repeated any number of times ! contradiction 7 8 What about natural language? Context-free grammars (CFG) • the cheese that the mouse stole was expensive ! What is an appropriate grammar formalism for • the cheese that the mouse that the cat caught stole bracketing structures? was expensive ! Context-free grammar G = (#, V, S, P) • the cheese that the mouse that the cat that the dog chased caught stole was expensive • # = alphabet (terminal symbols) • the cheese that the mouse that the cat that the dog • V = variables (nonterminals), S ∈ V = start symbol that the girl saw chased caught stole was expensive • P = set of productions A $ % ∈ P (“rewrite rules”) * the cheese that stole that caught that chased the • more about CFG in Session 5 mouse the cat the dog was expensive • n n • * the cheese that the mouse caught stole was ! A context-free grammar for A = {a b } expensive • V = {S}, P = {S $ a S b | a b} n n ☞ NL contains bracketing of type a b • derivation S ⇒ a S b ⇒ a a S b b ⇒ a a a b b b = a3 b3 9 10

The pumping lemma Can CFG do everything? for context-free languages

! Are CFG the only formalism we need? ! Given an arbitrary context-free language A • except when regular expressions are already sufficient ! there is some constant K ! Are there languages that cannot be described? ! such that ∀z ∈ A with |z| ! K • it's much harder to find such examples for CFG ! we can write z = uvwxy with v " " ! x " " and |vwx| # K ! Three simple structures that are not CF two substrings within a region of at most K symbols • multiple repetition: an bn cn • corresponds to recursion cycle in (binary) parse tree • crossing brackets: an bk an bk • n n • COPY language: {ww | w ∈ #*} ! Then: uv wx y ∈ A for n = 0, 1, 2, … • but an bk ak bn and {wwR | w ∈ #*} are context-free • both substrings are pumped “in parallel” 11 12 Is NL context-free? Bambara

! Long-standing debate in linguistics ! Bambara has structures of the form ww ! see “Footlose and context-free”, Chapter 16 of Pullum, Geoffrey K. (1991). The Great Eskimo Vocabulary Hoax. The University of Chicago Press. • wulu ‘dog’ ! Evidence found for a number of languages: • wulu-filela ‘dog watcher’ • Bambara (reduplication $ COPY language) • wulu-filela-nyinila ‘dog watcher hunter’ • Swiss German dialects (cross-serial dependencies) • Dutch (multiple repetition) • wulu-o-wulu ‘whatever dog’ ! No mathematical proof for English or German • wulu-filela-o-wulu-filela ‘whatever dog watcher’ • wulu-filela-nyinila-o-wulu-filela-nyinila • but that does not mean that CFG are a linguistically ‘whatever dog watcher hunter’ appropriate formalism for natural language!

13 14

Swiss German Dutch

! Standard German: nested an bk ak bn ! Multiple repetition an bn cn in Dutch

• Jan sagt, dass wir die Kinder1 dem Mann2 das Haus3 • Of Jan Piet hoorde en zag? anstreichen3 helfen2 lassen1 wollen haben. ‘Did Jan hear Piet and see [him]?’ ‘Jan says that we wanted to let the children help Hans (the man) paint the house.’ • Of Jan Piet Marie hoorde ontmoeten en zag omhelzen? ! Swiss German has cross-serial an bk an bk ‘Did J. hear P. meet M. and see [him] embrace [her]?’ • Of Jan Piet Wim Marie hoorde helpen ontmoeten en • Jan säit, das mer d'Chind1 em Hans2 es Huus3 händ wele laa hälfe aastriiche . zag horen omhelzen? 1 2 3 ‘Did Jan hear Piet help Wim meet Marie and see [P.] • different verbs select for dative/accusative objects hear [W.] embrace [M.]?’ from: Shieber, Stuart M. (1985). Evidence against the context-freeness from: Manaster-Ramer, Alexis (1987). Dutch as a . of natural language. Linguistics and Philosophy, 8, 333–343. Linguistics and Philosophy, 10(2), 221–246.

15 16 What else is there? What else is there?

Type 0: algorithmic languages Type 0: algorithmic languages (Turing machines, rewrite grammars) (Turing machines, rewrite grammars)

Type 1: context-sensitive languages Type 1: context-sensitive languages (linear bounded automata, context-sensitive grammars) (linear bounded automata, context-sensitive grammars)

Type 2: context-free languages Type 2: context-free languages (context-free grammars, pushdown automata) (context-free grammars, pushdown automata)

Type 3: regular languages Type 3: regular languages (regular expressions, finite-state automata) (regular expressions, finite-state automata)

Perl-compatible regular expressions (PCRE)

17 18

Grammars and recognisers Issues in formal language theory in the Chomsky hierarchy

type 0: algorithmic languages ! Establish hierarchies of language classes rewrite grammars: % $ & Turing machines (TM) • each class corresponds to a grammar formalism type 1: context-sensitive languages context-sensitive grammars: linear bounded TM ! Algorithmic properties &A' $ &%' or % $ & with |%| " |&| grammar (description) vs. recogniser (algorithm) • type 2: context-free languages • decide whether string belongs to language context-free grammars: push-down automata • are two grammars equivalent? etc. A $ % with % ∈ (V ∪ #)* ! Closure properties (of language classes) type 3: regular languages regular expressions finite-state automata • closed under union, intersection, Kleene star, … ? linear grammars: A $ aB

19 20 Algorithmic & closure properties Algorithmic / closure properties Intersection of FSA Intro to CL Complexity

S. Evert (aba|ba)*b a(bab|ab)* Why class w A? A !? A Σ∗? A B? A B? complexity? ∈ = = ⊆ = algorithmic — — — — — b 3 a 2 Chomsky CS √ — — — — hierarchy a b CF √ √ — — — a The pumping 0 a 0 1 b lemma regular √ √ √ √ √ 1 3 NL is not b a reguler! a 2 b 4 Non-CF languages class A B A B A∗ A B A R A f (A) ∪ ◦ ∩ ∩ C Is NL algorithmic √ √ √ √ √ — √ deterministic FSA context-free? CS √ √ √ √ √ √ √ Summary CF √ √ √ — √ — √ regular √ √ √ √ √ √ √ a b a b a 0/0 1/1 2/3 0/4 3/1 0/2 b

21 22

Intersection of FSA Complement of FSA

! Intersection algorithm for FSA ! regular a(a|bc)* over alphabet #={a,b,c}

languages are closed under intersection a

! There is no “intersection” operator for regexp! a b 0 1 2 deterministic FSA c • but we now know it could be added easily

! Application: fast matching of word lists a b c 2 • match regular expression against large word list “complete” FSA 1 a,b a,b,c word list encoded as minimised FSA with error state E a c • b,c E • regular expression translated to deterministic FSA 0 • compute intersection of word list and regexp FSAs

23 24 Complement of FSA Homework

a(a|bc)* over alphabet #={a,b,c} ! Read full handout! a ! Reading assignment a b 0 1 2 deterministic FSA c Jurafsky & Martin, Chapters 2 & 15 (Chapters 2 & 13 in first edition) a b 2 complement FSA c a,b,c ! Homework assignment 3 1 a,b (swaps final and a c • various applications of regular expressions non-final states) b,c E 0 (tokenisation, corpus search) • a little larger because you've got two weeks to do it

25 26