Indexed Languages and Unication



Grammars

y

Tore Burheim

Abstract

Indexed languages are interesting in computational linguistics b ecause they

are the least class of languages in the that has not b een

shown not to b e adequate to describ e the string set of natural language sen

tences We here dene a class of unication grammars that exactly describ e

the class of indexed languages

Introduction

The o ccurrence of purely syntactical crossserial dep endencies in SwissGerman

shows that contextfree grammars can not describ e the string sets of natural lan

guage Shi The least class in the Chomsky hierarchy that can describ e unlimited

crossserial dep endencies is indexed grammars Aho Gazdar discuss in Gaz

the applicability of indexed grammars to natural languages and show how they can

b e used to describ e dierent syntactic structures We are here going to study how

we can describ e the class of indexed languages with a unication grammar formal

ism After dening indexed grammars and a simple unication grammar framework

we show how we can dene an equivalent unication grammar for any given indexed

grammar Two grammars are equivalent if they generate the same language With

this background we dene a class of unication grammars and show that this class

describ es the class of indexed languages

Indexed grammars

Indexed grammars is a grammar formalism with generative capacity b etween con

textfree grammars and contextsensitive grammars Contextfree grammars can

not describ e crossserial dep endencies due to the pumping lemma while indexed

grammars can However the class of languages generated by indexed grammars

the indexed languages is a prop er subset of contextsensitive languages Aho

Indexed grammars can b e seen as a contextfree grammar where we add a string

or stack of indices to the nonterminal no des in the phrase structure trees or

derivation trees as we will call them Some pro duction rules add an index to the

b eginning of the string while the use of other pro duction rules is dep endent on

the rst index in the string When such a pro duction rule is applied the index

of which it is dep endent is removed and the rest of the indexstring is kept by

the daughters In this way we may distribute information from one part of the

derivation tree to another The original denition of indexed grammars was given

This work has b een supp orted by grant from Norwegian Research Council

y

University of Bergen Department of Informatics N Bergen Norway and University of

the Saarland Computational Linguistic Postfach D Saarbr ucken Germany Email

ToreBurheimiiuibno

by Aho Aho We are here using the denition used by Hop croft and Ullman

HU with some minor notational variations

Denition An G is a tuple G hN T I P S i where

N is a nite set of symbols called nonterminals

T is a nite set of symbols called terminals

I is a nite set of symbols called indices

P is a nite set of ordered pairs each on one of the forms hA B f i hAf i or

hA i where A and B are nonterminal symbols in N is a nite string in

N T and f is an index in I An element in P is called a pro duction rule

and is written A B f Af or A

S is a symbol in N and is called the start symbol

and such that N T and I are pairwise disjoint

An indexed grammar G hN T I P S i is on reduced form if each production

in P is on one of the forms

a A B f

b Af B

c A BC

d A t

where A B C are in N f is in I and t is in T fg

Aho showed in his original pap er Aho that for every indexed grammar there

exists an indexed grammar on reduced form which generates the same language

To dene constituent structures and derivation trees we are going to use tree

domains Let N b e the set of all integers greater than zero A tree domain D is

+

a set D N of number strings so that if x D then all prexes of x are also in

+

D and for all i N and x N if xi D then xj D for all j j i

+

+

The out degree dx of an element x in a tree domain D is the cardinality of the set

fi j xi D i N g The set of terminals of D is ter mD fx j x D dx g

+

The elements of a tree domain are totally ordered lexicographically as follows x y

if x is a prex of y or there exist strings z z z N and i j N with i j

+

+

1

such that x z iz and y z j z We also dene that x y if x y and x y

A tree domain D can b e viewed as a tree graph in the following way The

elements of D are the no des in the tree is the ro ot and for every x D the

element xi D is xs child number i A tree domain may b e innite but we shall

restrict attention to nite tree domains A nite tree domain can also describ e the

top ology of a derivation tree This representation provides a name for every no de

in the derivation tree directly from the denition of a tree domain Our denition

of derivation trees for indexed grammars with the use of tree domains is based on

Hayashi Hay

Denition A derivation tree based on an indexed grammar G hN T I P S i

is a pair hD C i of a nite tree domain D and a function C D NI T fg

I I

where

i C S

I

1

See Gallier Gal for more ab out tree domains

ii C x NI for every node x in D with dx Moreover if C x A

I I

for A N and I and C xi B with B N T fg and I

I i i i i

for every i i dx then either

a A B f is a production rule in P such that dx f I and

1

f or

1

b Af B B is a production rule in P such that f I where

1

d(x)

f and if B N and if B T fg or

i i i i

c A B B is a production rule in P such that if B N

1 i i

d(x)

and if B T fg

i i

iii C x T fg for every node in D with dx

I

sy m

The symbol function C D N T and the index string func

I

idx

tion C D I are total functions on D such that if C x A where

I

I

sy m

idx

A N T fg and I then C x for al l x D x A and C

I

I

The terminal string of a derivation tree hD C i is the string C x C x

I I 1 I n

where fx x g ter mD and x x for al l i i n

1 n i i+1

We also dene the license function license D ter mD P such that

if A is a production rule according to a b or c in ii for a node x in D

then licensex A

Informally this is a traditional phrase structure tree If we have a no de with

lab el A where A is a nonterminal symbol and is a string of indices and we use

a pro duction rule A B f then the no des only child gets the lab el B f If we

instead use a pro duction rule A BC on the same no de it gets two children lab eled

B and C resp ectively or if we use a pro duction rule A t where t is a terminal

symbol then we remove all the indices and the no des only child gets the lab el t

If we have a no de lab eled with Af where f is a index and we use a pro duction

rule Af B then the no des only child gets the lab el B We also see that the

terminal string is a string in T since C x T fg for all x ter mD

U

Denition A string w is grammatical with respect to an indexed grammar G

if and only if there exists a derivation tree based on G with w as the terminal string

The language generated by G LG is the set of al l grammatical strings with respect

to G

Example Let G hN T I P S i b e an indexed grammar where T fa b cg is

the set of terminal symbols N fS S A B C g is the set of nonterminal symbols

I ff g g is the set of indices and P is the least set containing the following

pro duction rules

S S f Ag aA Af a

S S g B g bB B f b

S AB C C g cC C f c

Figure shows the derivation tree for the string aabbcc based on this grammar

n n n

The language LG generated by this grammar is fa b c j n g

We close this presentation of indexed grammars by showing a simple technical

observation that we will use in later pro ofs

Denition An indexed grammar G hN T I P S i has a marked indexend

if and only if it has one and only one production rule where the start symbol occurs

and this rule is on the form S A where A N and the index does not occur

in any other production rule S

S'f

S'gf

Agf Bgf Cgf

a Afb Bf c Cf

a b c

Figure Derivation tree for the string aabbcc based on the grammar in Example

If an indexed grammar has a marked indexend then in any derivation tree every

nonterminal no de except the ro ot gets a at the end of the index list Since no

rule requires that there is an empty index list and neither nor the start symbol

o ccurs in any other pro duction rule it is straight forward to construct an equivalent

grammar with a marked indexend for any indexed grammar

Lemma For every indexed grammar G there exists an indexed grammar with a

marked indexend G such that LG LG

$ $

Pro of Let G hN T I P S i b e an indexed grammar and assume that S and

0

do not o ccur in G G is dened from G by adding the pro duction rule S S

0

$

such that S b ecomes the new start symbol and is added to the set of nonterminal

0

symbols and is added to the set of indices Formally if G hN T I P S i and

S N T I then G hN fS g T I fg P fhS S ig S i Then G

0 0 0 0

$ $

has a marked indexend and we have to show that for any string w w LG if

and only if w LG

$

Let hD C i b e any derivation tree based on G and assume that w is its

I

terminal string From this we construct a derivation tree hD C i based on G

$

I

as follows First let D fx j x D g fg Then let C S and let

0

I

x C x for all x C x for all x D ter mD Let also C C

I I

I I

i has then the same terminal string as x ter mD The derivation tree hD C

I

hD C i Since no rule requires that there is an empty index list and do es not

I

o ccur in any pro duction rule in G a pro duction rule that is licensing a no de x in

hD C i will license the no de x in hD C i The rule S S licenses the ro ot

I 0

I

Then hD C i is a valid derivation tree according to Denition

I

Let hD C i b e any derivation tree based on G and assume that w is its

$

I

terminal string Since S S must license the ro ot and do es not o ccur in any

0

other pro duction rule the index symbol o ccurs at the end of the index list at every

nonterminal no de except the ro ot in hD C i From this derivation tree we construct

I

a derivation tree hD C i based on G as follows First let D fx j x D g

I

Then for all x D ter mD let C x where C x Let also

I

I

C x C x for all x ter mD The derivation tree hD C i has then the

I I

I

same terminal string as hD C i Since every pro duction rule in G except S S

0

$

I

also is a pro duction rule in G the rule S S only can license the ro ot and

0

do es not o ccur in any other pro duction rule a pro duction rule that licenses a no de

x in hD C i will license the no de x in hD C i Then hD C i is a valid derivation

I I

I

tree according to Denition 2

Notice in the pro of that if G is on reduced form then G is also on reduced

$

form Then for any indexed grammar on reduced form there also exists an indexed

grammar on reduced form with a marked indexend

Unication grammars

We are here going to give a description of a very simple unication grammar for

malism The formalism itself is not particularly interesting and it is only meant as

a framework for the rest of this pap er The formalism is just a notational variant of

the basic formalism used by Colban in his work on restrictions on unication gram

mars Col It should b e easy to reformulate this in most of the known formalisms

available We give an informal description of feature structures in the way they are

used here b efore we dene the grammar formalism

A feature structure over a set of attribute symbols A and value symbols V is

a fourtuple hQ m i where Q is a nite set of no des Q A Q is a

D

partial function called the transition function Q V is a partial function

called the atomic value function and m D Q is a function called the name

D

mapping We will mostly omit the namedomain from the notation so m will alone

denote the name mapping We extend the transition function to b e a function from

pairs of no des and strings of attribute symbols For every q Q let q q

If q q and q a q then let q a q for every q q q Q

1 2 2 3 1 3 1 2 3

A and a A

A feature structure is describable if there for every no de is a path from a named

no de to the no de This means that for every q Q there is an x D and a A

such that mx q A feature structure is atomic if every no de with an atomic

value has no outedges This means that for every no de q Q q a is not dened

for any a A if q is dened A feature structure is acyclic if it do es not contain

attribute cycles This means that for every no de q Q q q if and only if

A feature structure is wel l dened if it is describable atomic and acyclic

When nothing else is said we require that feature structures are well dened in the

rest of this pap er

We are going to use equations to describ e feature structures in a way where

feature structure satises equations A feature structure satises the equation

x x

2 2 1 1

if and only if mx mx and the equation

1 1 2 2

v x

1 1

if and only if mx v where x x D A and v V

1 1 1 2 1 2

We only allow equations on those two forms This means that there is no typing

quantication implication negation or explicit disjunction as we may nd in other

unication grammars and feature logics

If E is a set of equations of the ab ove form and M is a well dened feature

structure such that M satises every equation in E then we say that M satises E

and we write

M j E

A set of equations E is consistent if there exists a well dened feature structure

that satises E

The notation of the grammar formalism is b orrowed from Lexical Functional

Grammar KB

Denition A simple unification grammar G over a set of attribute symbols

A and value symbols V is a tuple hN T P L S i where

N is a nite set of symbols called nonterminals

T is a nite set of symbols called terminals

P is a nite set of production rules

A A A

0 1 n

E E

1 n

where n A A N and for al l i i n E is a nite set with

0 n i

equations on the forms

j j

j v

+ 2

where A A and v V

L is a nite set of lexicon rules

A t

E

where A N t T fg and E is a nite set of equations on the form

j v

+

where A and v V

S is a symbol in N called start symbol

As an example is a pro duction rule

A B C C

a a a

1 3 4

v a a c v a a

2 1 3 1 2 3

Denition A constituent structure cstructure based on a simple uni

cation grammar G hN T P L S i is a triple hD C E i where

U U

D is a nite tree domain

C D N T fg is a function

U

E D fg is a function where is the set of al l equation sets in P

U

and L

such that C x T fg for al l x ter mD C S and for al l x

U U

D ter mD if dx n then

C x C x C xn

U U U

E x E xn

U U

is a production or lexicon rule in G

The terminal string of a constituent structure is the string C x C x

U 1 U n

where fx x g ter mD and x x for al l i i n

1 n i i+1

To get equations that can b e satised by a feature structure we must instantiate

the up and down arrows in the equations from the rule set We substitute them

with no des from the cstructure such that the no des b ecome the domain of the name

mapping For this purp ose we dene the function such that E xi E xix

U

U

xi We see that the value of the function E is a set of equations that feature

U

structures may satisfy

2

j denotes here a or a

Denition The cstructure hD K E i generates the feature structure M if and

only if

[

E x M j

U

xD

A cstructure may generate dierent feature structures The tree domain will

form a name set for feature structures that this union generates A string is gram

matical if this union is consistent

Denition A string w is grammatical with respect to a simple unication

grammar G if and only if there exists a cstructure based on G with w as the terminal

string and the cstructure generates a wel l dened feature structure The language

generated by G LG is the set of al l grammatical strings with respect to G

From Indexed Grammars to Unication Gram

mars

We are here going to dene a simple unication grammar that is equivalent to

a given indexed grammar The main idea is that we use feature structures to

represent the index string more or less like a nested stack The use of feature

structures to represent stacks for indexed grammars is also used by Gazdar and

Mellish GM although they do not go into much details Here we dene a function

that transforms any indexed grammar on reduced form with a marked indexend

to a simple unication grammar such that the new grammar generates the same

language

Denition Let G hN T I P S i be an indexed grammar on reduced form

$

with a marked indexend We then dene the simple unication grammar U G as

$

hN T P L S i where P and L are the least sets where

a For each rule on the form A B f in P P has a production rule on the

form

A B

next

idx f

b For each rule on the form Af B in P P has a production rule on the

form

A B

next

idx f

c For each rule on the form A BC in P P has a production rule on the

form

A B C

d For each rule on the form A a in P L has a lexicon rule on the form

A a

If p is a production rule in G then U p is the production or lexicon rule in

$

U G dened by a b c or d

$

Notice that there is a onetoone relation b etween the pro duction rules in G

$

and pro ductionlexiconrules in U G We will later dene a class of unication

$

grammars which can b e dened by pro duction and lexicon rules on the forms used

here But rst we will show that G and U G are equivalent

$ $

Lemma For every indexed grammar G on reduced form with a marked index

$

end LG LU G

$ $

Pro of We have to show that for any string w w LG if and only if w

$

LU G

$

For every w LG there exists a derivation tree hD C i for w based

I

$

on G We have to show that based on U G there exist cstructure with w as

$ $

the terminal string which generates a well dened feature structure We dene the

cstructure hD C E i on the same tree domain D

U U

For every nonterminal no de x in D we have a unique pro duction rule licensex

in the indexed grammar and for each pro duction rule in the indexed grammar

we have a unique corresp onding pro duction or lexicon rule U licensex in U G

$

according to Denition If

U licensex A A A

0 1 n

E E

1 n

then let C xi A and E xi E for all i n and let C x A

U i U i U 0

sy m

x for all x D it also Then we have a valid cstructure and since C x C

U

I

has w as terminal string Now we only have to show that all the equations in the

cstructure are satised by a well dened feature structure

For any nite string over an alphab et I we may dene a feature structure where

the no de set is the union of all suxes of and all symbols o ccurring in Here we

make a distinction b etween the singleton string of a symbol and the symbol itself

such that they are regarded as two distinct no des For all nonempty string no des

let the idx attribute p oint to the rst symbol of the string and let the next attribute

p oint to the rest of the string when we remove the rst symbol ie f idx f

and f next for every nonempty sux f of where f I Let also the

atomic value of each symbolno de b e the symbol itself ie f f Else let no

more attributes or atomic values b e dened and in particular let next idx

and b e undened We extend the denition directly to any nite set of strings

over an alphab et With any namemapping to the string no des dened from this

nite set this is a well dened feature structure since each nonempty string has a

unique rst symbol and a unique sux with length one less than the string itself

Let M b e the feature structure dened as describ ed on the set of all index strings

that o ccur in the derivation tree hD C i with the mapping of each nonterminal

I

idx

no de in the tree domain to the indexstring of that no de mx C x This is

I

a well dened feature structure We now have to show that all the equations in the

cstructure are satised by the feature structure M We have three dierent cases

to consider

Assume for a no de x that C x A where is an indexstring and that

I

licensex A B f Then C x B f mx and mx f From

I

x fx next x x idx f g which is satised U licensex we have that E

U

by the feature structure M since f next and f idx f

Assume for a no de x that C x Af where f is an nonempty index

I

string and that licensex Af B Then C x B mx f and

I

x fx next x x idx f g mx From U licensex we have that E

U

which is satised by the indexstring feature structure M since f next and

f idx f

Assume for a no de x that C x A where is an indexstring and that

I

licensex A BC Then C x B C x C and mx mx

I I

x x fx xg and E mx From U licensex we have that E

U U

fx xg which is satised by the indexstring feature structure M

We do not have to consider the no des which license pro duction rules with ter

minal symbols since all the terminal no des have empty equation sets Then all

the equations in the cstructure are satised by the feature structure M and then

w LU G

$

We will here use the function idxlst Q V dened on any well

dened acyclic feature structure as follows idxlstq q if q is dened

If q idx and q next are b oth dened then idxlstq is the concatenation of

idxlst q idx followed by idxlst q next Else idxlstq We restrict our

attention to its prex with as last symbol Let idxlst Q V b e the function

$

such that idxlst q is the smallest prex of idxlstq with as the last symbol

$

If idxlstq do es not contain any then idxlst q

$

For every w LU G there exists a cstructure hD C E i for w based on

U U

$

U G which generates a well dened feature structure We dene the derivation

$

sy m

x C x tree hD C i for w based on G on the same tree domain D Let C

U I

$

I

idx

for all no des in D and C x idxlst mx for all nonterminal no des in D

$

I

idx

except for the ro ot for which we dene C to b e the empty string This

I

derivation tree has w as terminal string and we just have to show that this is a

valid derivation tree according to Denition

Since G has a marked indexend the only pro duction rule where the start

$

symbol o ccurs is S A for an A N This gives the following corresp onding

pro duction rule in U G

$

S A

next

idx

which is the only pro duction rule in U G where the start symbol o ccurs Then

$

C S which is the start symbol of G Here we also have that idxlst m

I

$ $

and C A so that C A and S A licenses the ro ot no de For all

U I

the other nonterminal no des in the tree domain we have four cases to consider

Assume for a nonterminal no de x except for the ro ot no de that C x A and

U

idxlst mx Then C x A Assume also that there exists a pro duction

I

$

rule in U G from Denition a such that C x B E x fx next

U

$

U

x x idx f g and x has no sister no des Since only o ccurs in the one pro duction

rule with the start symbol f Then idxlst mx f and C x B f

I

$

From the reverse of Denition a there exists a pro duction rule A B f in G

$

which licenses x

Assume for a nonterminal no de x except for the ro ot no de that C x A

U

and idxlst mx f Then C x Af Assume also that there exists a

I

$

pro duction rule in U G from Denition b such that C x B E x

U

$

U

fx next x x idx f g and x has no sister no des Since only o ccur in the

one pro duction rule with the start symbol f Then idxlst mx and

$

C x B By the reverse of Denition b there exist a pro duction rule

I

Af B in G which licenses x

$

Assume for a nonterminal no de x except for the ro ot no de that C x A

U

and idxlst mx Then C x A Assume also that there exist a pro

I

$

duction rule in U G from Denition c such that dx C x B

U

$

x fx xg Then idxlst mx x fx xg and E C x C E

U

$

U U

idxlst mx C x B and C x C By the reverse of Denition

I I

$

c there exist a pro duction rule A BC in G which licenses x

$

Assume for a nonterminal no de x except for the ro ot no de that C x A and

U

idxlst mx Then C x A Assume also that there exists a lexicon rule

I

$

x in U G from Denition d such that dx C x t and E

U

$

U

Then C x t By the reverse of Denition d there exist a pro duction rule

I

A t in G which licenses x

$

We then have a valid derivation tree with the same terminal string as the c

structure and then w LG 2

$

Example Let G hN T I P S i b e an indexed grammar where T fdg is the

set of terminal symbols N fS A B C C D g is the set of nonterminal symbols

I f f g g is the set of indices and P is the least set containing the following

pro duction rules

S A B CC

A B f C g C C CC

B B g C f D D d

This grammar is on reduced form with a marked indexend The simple unication

grammar U G as given in Denition is then the tuple hN T P L S i where

P is the least set containing the following pro duction rules

S A B C C

next

idx

A B C C C C C

next next

idx f idx g

B B C D

next next

idx g idx f

and L contains one single lexicon rule

D d

Figure shows the derivation tree for the string dddd based on the indexed

grammar G together with the cstructure and the feature structure for the same

string string based on the simple unication grammar U G This shows that the

string dddd is b oth in LG and in LU G The language generated by G and

n

2

U G is fd j n g S S Α A$ ↓ next =˙ ↑ ↓ idx =˙ $ Bf$

Bgf$ B ↓ next =˙ ↑ ↓ idx =˙ f Cgf$ Cgf$ B C'f$ C'f$ ↓ next =˙ ↑ ↓ idx =˙ g Cf$ Cf$ Cf$ Cf$ C C ↑=↓ ↑=↓ D$ D$ D$ D$ ˙ ˙

d d d d C' C' ↑ next =˙ ↓ ↑ next =˙ ↓ ↑ idx =˙ g ↑ idx =˙ g a) Derivation tree C C C C ↑=˙↓ ↑=˙↓ ↑=˙↓ ↑=˙↓

D D D D next next next ↑ next =˙ ↓ ↑ next =˙ ↓ ↑ next =˙ ↓ ↑ next =˙ ↓ ↑ idx =˙ f ↑ idx =˙ f ↑ idx =˙ f ↑ idx =˙ f idx idx idx d d d d g f $ ∅ ∅ ∅ ∅

c) Feature structure b) C-structure

Figure Derivation tree a for the string dddd based on the grammar G in

Example together with the cstructure b and feature structure c for the same

string based on the grammar U G

A Unication Grammar Formalism for Indexed

Languages

We are here going to dene a version of the simple unication grammar that de

scrib es the class of indexed languages Just to b e precise a class of languages C

over a countable set of symbols is a set of languages such that each language

L C is a subset of where is a nite subset of The class C GF of

languages that a grammar formalism GF describ es is the set of all languages L

over such that there exists a grammar G in GF where LG L The class of

indexed languages is then the set of languages such that there for each language

exist a indexed grammar that generates the language We assume that is the set

of all terminal symbols that we use and drop as subscript

Denition A Unification grammar for Indexed languages UGI is a

simple unication grammar where

a each equation set in the production rules is on one of the three forms

E fg

E f next idx f g

E f next idx f g

where f is any value symbol and next and idx are the same two attribute

symbols for al l equations in al l production rules in UGI

b each lexicon rule has en empty equation set

Lemma The class of languages C UGI contains the class of indexed languages

Pro of Aho Aho showed that for every there exists an indexed

grammar on reduced form which generates the language From Lemma and its

pro of we have that for every indexed grammar G on reduced form there exists an

indexed grammar on reduced form with a marked indexend G such that LG

$

LG The simple unication grammar U G dened from the indexed grammar

$ $

on reduced form with a marked indexend in Denition is an UGI grammar From

Lemma we have that LG LU G Then every indexed language can b e

$ $

generated by an UGI grammar 2

We shall now show that every UGI grammar generates an indexed language

but to do this we need some technical results First it is easy to see that every UGI

grammar can b e formulated with rules only on the forms used in Denition ad

We dene the reduced form for this

Denition A UGI grammar is on reduced form if and only if every produc

tion rule is on one of the three fol lowing forms

A B A B A B C

next next

idx f idx f

Lemma For every UGI grammar there is an equivalent grammar on reduced

form

Pro of Using the techniques from the standard pro of for normal form for context

free grammars it is straight forward to replace each pro duction rule in the original

grammar not on reduced form with a set of new lexicon rules and pro duction rules

on reduced form This can b e done such that one instance of an original rule

corresp onds to the net eect of combining one ore more of the new rules This is

p ossible since we allow the empty string in lexicon rules 2

To make this formalism more directly comparable to indexed grammars with a

marked indexend we use what we will call a sinkmapped root

Denition A UGI grammar hN T P L S i has a sinkmapped root if and

only if it has one and only one production rule where the start symbol occurs and

this rule is on the form

S A

next

idx

where A N and the value symbol does not occur in any other production rule

The value symbol will form some kind of a blo ckade in the feature structure

since it do es not o ccur in any other pro duction rule hence no other no de in the

cstructure will b e mapp ed to the same no de in the feature structure as the ro ot of

the cstructure

What we are doing here is to put a mark at the b ottom of the stack of indices

in the way the nested stack is represented as a feature structure We also want

to map the ro ot of the cstructure to the sink of the feature structure when we

follow the next attribute

Lemma For every UGI grammar G there exists a UGI grammar with a sink

mapped root G such that LG LG

Pro of First we show how we from any UGI grammar G may dene a UGI grammar

with a sinkmapp ed ro ot G After this we show that for any string w w LG if

and only if w LG

Let any UGI grammar G hN T P L S i b e given and assume that S S and

0

S are neither terminal nor nonterminal symbols in G and that is a value symbol

not used in G The grammar G is dened by adding the following pro duction and

lexicon rules to the rules we have in G

i Let the following b e two pro duction rules

S S

0

next

idx

S S S

ii For each f V used in any pro duction rule in G let the following b e a

pro duction rule

S S

next

idx f

iii Let the following b e a lexicon rule

S

Complete G by adding S S and S to the nonterminal symbols and let S b e

0 0

the start symbol of G We see that G is a UGI grammar with a sinkmapped

3

root Notice also that if G is on reduced form so is the new grammar

Now we have to show that for any string w w LG if and only if w LG

We show this direction in two steps First we dene something that we

call a canonical feature structure for cstructures based on UGI grammars This is

done such that if the cstructure generates a well dened feature structure at all

then it is also generating the canonical feature structure After this denition we

show how we from a cstructure based on G together with its canonical feature

structure can construct a cstructure together with a feature structure based on the

grammar G This is done such that the two cstructures have the same terminal

string and if the terminal string is in LG so is it in LG also

Let hD K E i b e any cstructure based on a UGI grammar G such that it gen

erates a feature structure The canonical feature structure hQ mi for the c

structure is dened as follows Let rst Q b e the set of all sequences of no des

+

from the cstructure with at most n no des in each sequence where n is the

height of the cstructure Then let the name mapping function m b e dened on

Q by topdown induction on the no des in the cstructure First let the mapping

+

of the ro ot no de m b e the sequence of n s where again n is

3

The use of S in rule together with rule where it will lab el the mother of a no de with

"

the empty string is only done b ecause we want to stay in the domain of grammars on reduced

form when G is on reduced form This denition could b e simplied if we did not want this

the height of the cstructure Now assume that mx is dened for a no de x in the

cstructure Then for each daughter xi of x let

mxi mx if E xi

mxi popmx if next E xi

mxi addxi mx if next E xi

where pop of any nonempty sequence is the sequence we get by removing the

rst element popx x x x x and add of a single element and a

1 2 k 2 k

sequence is the sequence we get by adding the single element to the b eginning of the

sequence addx x x x x x Since the ro ot no de is mapp ed to

1 k 1 k

the sequence of n s pop and add may not go out of their domain and therefore

is m well dened

Extend now the set Q such that all the value symbol used in the cstructure also

+

are elements in Q Then let the partial function Q fnext idxg Q b e

+ + + +

dened such that q next popq for all nonempty sequences q Q and let

+ +

q next b e undened when q is the empty sequence Moreover let q idx f

+ +

for the value symbol f if and only if there exists a no de x in the cstructure such

that either idx f E x or idx f E xi for a daughter xi of x This is the

only place where inconsistency may o ccur and we will later see that it will not o ccur

if the cstructure generates any feature structure at all We extend the denition

of the to pairs of no des and strings of the attribute symbols as describ ed in the

+

denition of feature structures in the b eginning of section

Now let us shrink the denitions of Q and such that we get a well dened

+ +

feature structure First let Q Q b e the set of all no des that is reachable from

+

a named no de formally Q fq j x D fnext idxg mx q g

+

Then we restrict to the new domain Q fnext idxg Q Finally

+

let f f for all value symbol used in the cstructure We now have a feature

structure and it is describable and acyclic directly from the denition of Q and

It is also atomic since is not dened on any feature symbol no de and is only

dened on feature symbol no des Moreover it satises all the equations from the

cstructure after we have instantiated the up and down arrows We will now show

that if the cstructure generates any well dened feature structure so will it generate

the well dened canonical one also

Let M hQ m i b e any well dened feature structure which the c

structure generates and assume that we have the canonical feature structure as

describ ed From the fact that the cstructure generates a feature structure and from

the denition of the canonical feature structure we have that if mx my for

any two no des x and y in the cstructure then m x m y Now we may dene a

function h Q Q from the no des in the canonical feature structure to the no des in

M such that m x hmx for all no des x in the cstructure Assume then that

we dont have a well dened canonical feature structure b ecause of inconsistency

in it denition This means that there exist two instantiated equations x idx f

and y idx f from the cstructure where mx my but f f However

then m x m y and inconsistency must also o ccur with resp ect to M and the

cstructure can not generate any well dened feature structure Then the canonical

feature structure must b e consistent dened and since it is also describable acyclic

and atomic it is well dened Since it also satises all the equations in the cstructure

it is generated by the cstructure

Now we have a well dened canonical feature structure for each cstructure based

on any UGI grammar if the cstructure generates a feature structure Notice that

idx is not dened for the canonical feature structure This due to the

mapping of the ro ot in the cstructure to the sequence of n s where n is the

height of the cstructure With this height it is only p ossible to p op of n s

according to denition of the name mapping and since q idx is only dened

for q if there exist a no de x such that mx q idx can not b e dened

Assume now that w LG for a grammar G Then we have a cstructure

for w based on G which generates a well dened feature structure Then it is

also generating a canonical feature structure M hQ mi as describ ed ab ove

For this feature structure we extend the denition of and as follows First let

idx and let For all sequences q of s such that q idx

is not dened let q idx f for any value symbol f which o ccurs in the c

structure When we construct the new cstructure based on G the old no des keep

their mapping values

We construct a new cstructure for w based on G by the following steps First

add a new no de on the top of the cstructure by applying the pro duction rule

This give us also a new sister no de for the old ro ot no de Map the two new

no des to the same no de in the extended canonical feature structure as the old ro ot

no de This secures that the equations in the pro duction rule is satised by

the extended feature structure The new sister no de lab eled with S may only b e

a mother of a terminal no de lab eled with the empty string such that the terminal

string is still w Now add n no des ab ove the present ro ot no de by applying the

generic pro duction rule n times and pro duction rule on the topmost

no de This top no de will b e the ro ot no de in the new cstructure and it is now

lab eled with the start symbol in G When applying the generic pro duction rule

let f mx idx for each new no de x where it is applied The new

no des are each mapp ed to the sequence of k s where k is the no des distance from

the new ro ot no de In this way the new ro ot no de is mapp ed to the empty sequence

the daughter of the ro ot no de is mapp ed to and so on Since idx

the equations in pro duction rule is satised by the feature structure Moreover

since f mx idx for each no de x where the pro duction rule is applied

and q next popq all the equations is satised by the feature structure We

then have a cstructure based on G with w as terminal string and this cstructure

generates a well dened feature structure Then w LG

Assume that w LG for a grammar G Then there is a cstructure with

category S in the ro ot and a sequence of derivations down to a no de with category

0

S where each intermediate no de has category S This has b een constructed by

rst using pro duction rule and then a sequence of zero or more applications

of pro duction rule b efore pro duction rule gives the no de with category S

Every no de ab ove the rst no de with category S has only one child except the rst

which has an additional daughter lab eled with S This daughter is the mother

of a single terminal no de lab eled with the empty string Then we can remove

all no des ab ove the no de lab eled S and still have the same terminal string w in

the cstructure The new cstructure will have a ro otno de with category S and

only pro duction rules from the grammar G are used Since the original cstructure

generates a feature structure so do es the new one Then w LG 2

Now we have the necessary technical results to show that every language in

C UGI is an indexed language We do this in two steps

Lemma For any UGI grammar G on reduced form with a sinkmapped root

there exists an indexed grammar G such that U G G

I I

Pro of Assume that G hN T P L S i is a UGI grammar on reduced form with

a sinkmapp ed ro ot Then let G hN T I P S i b e an indexed grammar where

I

I is all the value symbols o ccurring in G and P is constructed from P and L by

reversing Denition ad This can b ee done since G is on reduced form and there

exist a one to one relation b etween the pro duction rules in the indexed grammar and

the pro duction and lexicon rules in the unication grammar dened there Since G

has a sinkmapp ed ro ot the start symbol will o ccur in one and only one pro duction

rule together with a unique value symbol Then G has a marked indexend and

I

U G G 2

I

Lemma Every language in C UGI is an indexed language

Pro of From Lemma and Lemma we have for any language in C UGI that there

exist a UGI grammar G on reduced form with a sinkmapp ed ro ot that generates the

language From Lemma we have an indexed grammar G such that U G G

I I

By Lemma we have that LG LG Then we have an indexed grammar for

I

all languages in C UGI 2

From Lemma and Lemma we then have the following result

Theorem The class C UGI is the class of indexed languages

Acknowledgments

I would like to thank Tore Langholm for his advice during the work that this pap er

is based on and for extended comments on earlier versions of this pap er

References

Aho Alfred V Aho Indexed grammars an extension of contextfree gram

mars Journal of the Association of Computing Machinery

Octob er

Col Erik A Colban Three Studies in Computational Semantics Drscient

thesis University of Oslo

Gal Jean H Gallier Logic for Computer Science Harp er Row Publishers

New York

Gaz Gerald Gazdar Applicability of indexed grammars to natural languages In

Uwe Reye and Christian Rohrer editors Natural Language Parsing and

Linguistic Theories pages D Reidel Publishing Company Dor

drecht Holland

GM Gerald Gazdar and Chris Mellish Natural Language Processing in LISP

AddisonWesley Publishing Company Also in Prolog version

Hay Takeshi Hayashi On derivation trees of indexed grammars an extension

of the uvwxyteorem Publications of the Research Institute for Mathe

matical Sciences Kyoto University

HU John E Hop croft and Jerey D Ullman Introduction to

Languages and Computation AddisonWesley

KB Ronald M Kaplan and Joan Bresnan Lexical functional grammar A

formal system of grammatical representation In Joan Bresnan editor

The Mental Representation of Gramatical Relations MITPress

Shi Stuart M Shieb er Evidence against the contextfreeness of natural lan

guage Linguistics and Philosophy