Syntactic Structure

Dr. Mark Lee [email protected]

Colourless green ideas sleep furiously

Introduction The previous lecture gave an overview of language. You will recall that natural languages all share certain aspects, the most important being that they are driven by intentional behaviour, rely on convention and are used to categorise the world. As the last lecture showed, language understand- ing is a difficult task which we take for granted because we are naturally language users.

A first step in language understanding is structural (or syntactic) analysis. This is the topic for this week.

A simple programming problem Suppose we have a large text file which contains payroll information. For example here’s an extract:

Page 26 Dr. Mark Lee Age 29 0121 414 4765 Room 110 Mr. Mike Leigh Age 43 0121 428 2394 Ms. Sue Smith Age 21 .... And we’d like to extract the names, ages, phone numbers & locations and put them into a data- base. How could we do this?

Regular expressions Regular expressions are a feature of many programming languages such as Java, Perl & are part of most operating systems (most notably Unix/Linux) and are used to match text strings. The follow- ing is based on Perl syntax. \w Any alphabetic character \d Any since numerical character \b One blank space + one or more of the proceeding . * zero or more of the proceeding symbol | “or” (i.e. \w|\d means either a letter or a digit) n.b. You should know what regular expressions are but you are not expected to learn regular expressions for this module. However you will encounter regular expressions throughout your module (and especially in the web technology module next term). Using regular expressions we could match the name, age etc. and easily add this information to our database. There are two concerns. Under generation: Does the expression miss any valid example. Over generation: Does the expression include any non-valid example.

More structured language All languages have a syntax. That is there are general rules governing how words can be fit together to form valid sentences. A language’s syntax can be described using a grammar.

We can formally describe a grammar as follows.

G=(Vt,Vn,P,S) where Vt is a finite set of terminals Vn is a finite set of non-terminals P is a set of production rules S is an element of Vn which is distinguished as the starting non-terminal. Elements of P are all of the form V -> (Vt U Vn)*

We can add various constraints to the production rules in P (of which more later). Below is a sim- ple grammar of English:

S -> NP VP Det->[the] NP -> Det N N->[telescope] NP -> NP PP N->[boy] VP -> V NP N->[girl] VP -> V NP PP V->[saw] PP -> P NP V->[gave] V-> [played] P->[with] N->[dress] Adj->[yellow]

Questions Using this grammar can you prove that “The boy saw the girl with the telescope” is a valid sen- tence? Using a grammar to analyze a sentence is called parsing. What other sentences are allowed? Are there any non-grammatical sentences allowed in this grammar?

Notice that the grammar has potentially the same issues of over-generation and under-generation as the regular expressions we designed earlier. Expressive power The grammar above is technically a context free grammar. This is because according to the grammar, regardless of context it is always possible to replace the left hand side with the right hand side. Other grammars are possible if we are more or less strict about the restrictions of the production rules. For example. suppose we insist that the only legal production rules are of the following form:

P -> NT TT or P-> TT

That is, every production rule must have one terminal symbol and one other non-terminal symbol. This type of grammar is a . In fact a regular grammar has exactly the same expressive power as a (i.e. if you can construct a regular expression to capture the structure of a given sentence then you can also construct a regular grammar to do the same.

Levels of grammar In addition to context free and regular grammars there are far more types of grammar. The most famous form the Chomsky hierarchy.

Table 1: Chomsky Hierarchy

Type Grammar Restriction on rules Type 0 Unrestricted unrestricted Type 1 Context Sensitive aXb -> aYb where X & Y are sets of NT & TT Type 2 Context Free NT -> NT NT Type 3 Regular NT -> NT TT or NT -> TT

The hierarchy was originally designed to describe natural language grammars but is now used far more in theoretical computer science and in particular principles of programming languages and compiler design.

It’s an interesting question how powerful (i.e. expressive) a grammar we need for natural lan- guage. It’s easy to show that a regular grammar is not enough but there is some controversy above how much natural language can be captured using a context free grammar. It has been shown that (at least) some aspects of verb noun agreement in Swiss German require a more powerful gram- mar. As is the artificial language anbncn (but can be dealt with a cfg and a stack i.e. an ). Parsing Parsing is the process of analysing a sentence using a grammar. We’ve already done parsing infor- mally but more formally there are two main methods “top down” and “bottom up”. Top-down Parsing Start with S symbol Consume constituent of input if matches current symbol. If not, rewrite current symbol using grammar rule Repeat 1.

Pros: Won’t look for things where they can’t happen. Cons: Can build (and rebuild) lots of useless structure Falls down on looping left-recursive rules (X->Xy) Predictive ability (“garden path sentences”)

Bottom-up Parsing Start with Input Sentence Consume constituent(s) of input which match rhs of rule If completes rule then repeat with new constituent Otherwise match next constituent and return to 2 Otherwise invoke new rule with next constituent then try 3

Pros: Avoids futile structure building but must look at all categories for every word Cons:Can wander off endlessly given empty productions (X-[]) Those of you doing AI will notice that parsing is essentially a form of “search” that is the gram- mar formalizes a “search space” of possible sentences and the parsing algorithm searches the space for the particular sentence.

What makes a good grammar a good grammar? Any finite set of sentences can be described by an infinite number of grammars. Therefore there has to be some criteria for choosing a good grammar. • Empirical Criteria Does it under-generate? Does it over-generate? Does it assign appropriate structures to the strings it generates?

• Heuristics Simplicity (number of rules/number of non-terminals) Generality (does an English grammar look a bit like a French/Swedish one?)

• Mathematical criteria Expressive ability to capture all legal sentences Tractability of parsing (that is could a computer programme assign structure in reasonable time) It’s worth commenting on how ambiguous natural language is. The verbmobil corpus (which is a collection of English, German and Japanese text) reveals that the average sentence length is 14 words and has 3245 possible parses using a decent grammar. This is why context free grammars are commonly used: they are computationally tractable (in fact they are of polynomial complexity but that is a advanced topic). Grammars of course can be used to model non-natural languages such as programming languages. Most programming languages are based on context free gram- mars.

Summary Today we’ve looked at syntax and how grammars can be used to capture syntax. We’ve looked at different types of grammar but most notably context free grammars and discussed various algo- rithms for analysing sentences using grammars. This process is called parsing.

A puzzle to ponder for Thursday There is a village in Germany where there is a male barber who shaves all and only those men who don’t shave themselves.

Does he shave himself?