<<

Inside-Out

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Goodman, Joshua. 1998. Parsing Inside-Out. Harvard Computer Science Group Technical Report TR-07-98.

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:24829603

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA

Parsing InsideOut

Joshua Go o dman

TR

June

Computer Science Group

Harvard University

Cambridge Massachusetts

Parsing InsideOut

A thesis presented

by

Joshua T Go o dman

to

The Division of Engineering and Applied Sciences

in partial fulllment of the requirements

for the degree of

Do ctor of Philosophy

in the sub ject of

Computer Science

Harvard University

Cambridge Massachusetts

May

by Joshua T Go o dman

All rights reserved iii

Abstract

Probabilistic ContextFree Grammars PCFGs and variations on them have recently b e

come some of the most common formalisms for parsing It is common with PCFGs to

compute the inside and outside probabilities When these probabilities are multiplied to

gether and normalized they pro duce the probability that any given nonterminal covers

any piece of the input sentence The traditional use of these probabilities is to improve

the probabilities of grammar rules In this thesis we show that these values are useful for

solving many other problems in Statistical Natural Language Pro cessing

We give a framework for describing parsers The framework generalizes the inside and

outside values to semirings It makes it easy to describ e parsers that compute a wide variety

of interesting quantities including the inside and outside probabilities as well as related

quantities such as Viterbi probabilities and nb est lists We also present three novel uses for

the inside and outside probabilities The rst novel use is an algorithm that gets improved

p erformance by optimizing metrics other than the exact match rate The next novel use

is a similar algorithm that in combination with other techniques sp eeds DataOriented

Parsing by a factor of The third use is to sp eed parsing for PCFGs using thresholding

techniques that approximate the insideoutside pro duct the thresholding techniques lead

to a times sp eedup at the same accuracy level as conventional metho ds At the time this

research was done no state of the art grammar formalism could b e used to compute inside

and outside probabilities We present the Probabilistic Feature Grammar formalism which

achieves state of the art accuracy and can compute these probabilities iv

Acknowledgements

To Erica

my love my help

There are to o many p eople for me to thank and for to o many things Most of all this

thesis is for Erica who I met a month after I started graduate school and who I will marry

a month after I nish

Next I want to thank my committee Stuart Shieb er is an amazing advisor who gave

me wonderful supp ort and just the right amount of guidance Barbara Grosz has b een

a great acting advisor in my last year putting her fo ot down on imp ortant things but

allowing me to o ccassionally split innitives when it did not matter Fernando Pereira has

given me very go o d advice and is a terric source of knowledge Thanks also to Leslie

Valiant for the almost thankless task of serving on my committee

For my rst year or two I shared an oce with Stan Chen and Andy Kehler Id rather

share a small oce with them than have a large oce to myself they taught me as much

as anyone here and made grad school fun They and the other members of the AIResearch

group read endless drafts of my pap ers attended no end of nearlyidentical practice talks

and listened to hundreds of bad ideas Wheeler Ruml in particular has done countless small

favors Lillian Lee has b een a source of guidance Im happy to call Reb ecca Hwa my friend

Others Luke Hunsb erger Ellie Baker Kathy Ryall Jon Christenson Christine Nakatani

Nadia Shalaby Greg Galp erin and David Magerman for p oker and rejecting my ACL

pap er have all help ed me through and made my time here b etter

A year after getting here I developed tendonitis in my hands I needed more help than

usual to graduate and I got it b oth from the Division of Engineering and Applied Sciences

and from many of the p eople I have named so far

A PhD is built on a deep foundation My parents and family have always pushed me

and supp orted me and continue to do so My father encouraged me from a ridiculously

early age what kind of p erson gives a year old Scientic American while my mother

through what she called gentle teasing gave me her own sp ecial kind of encouragement I

want to thank Edward Siegfried my mentor for nine years who taught me so much ab out

computers As an undergraduate I had many helpful professors among whom Harry Lewis

stands out After college I worked at Dragon Systems where Dean Sturtevant Larry

In the AI Research Group acknowledgments are traditionally funny This encourages p eople to read the acknowledgements and

skip the thesis I have therefore written rather dry acknowledgments and put one joke in this thesis to encourage the reading of this

work in its entirety The joke is not very funny so you will have to read the whole thing to b e sure you have found it I will send

to the rst p erson each year to nd the joke without help Some restrictions apply v

Gillick and Bob Roth initiated me into the black art of sp eech recognition and taught me

how to b e a go o d

Id also like to thank the National Science Foundation for providing most of my funding

with Grant IRI Grant IRI and an NSF Graduate Student Fellowship vi

Contents

Introduction

Background

Introduction to Statistical NLP

ContextFree Grammars

Probabilistic ContextFree Grammars

Overview

Semiring Parsing

Introduction

Earley Parsing

Overview

Semiring Parsing

Semiring

ItemBased Description

The Grammar

Conditions for Correct Pro cessing

The derivation semirings

Ecient Computation of Item Values

Item Value Formula

Solving the Innite Summation

Reverse Values

Reverse Values in Noncommutative Semirings

Semiring Parser Execution

Bucketing

Interpreter

Grammar Transformations

Examples

Finite State Automata and Hidden Markov Mo dels

Prex Values

Beyond ContextFree vii

Tomita Parsing

Graham Harrison Ruzzo Parsing

Previous Work

Recent similar work

Conclusion

A Additional Pro ofs

A Viterbinb est is a semiring

B Additional Examples

B Graham Harrison and Ruzzo GHR Parsing

B Beyond ContextFree

B Greibach Normal Form

C Reverse Value of Noncommutative Semirings

C Pair Semirings

C Sp ecic Pair Semirings

C Derivation of NonCommutative Reverse Value Formulas

Maximizing Metrics

Introduction

Evaluation Metrics

Basic Denitions

Evaluation Metrics

Maximizing Metrics

Which Metrics to Use

Lab elled Recall Parsing

Formulas

Pseudo co de Algorithm

ItemBased Description

Bracketed Recall Parsing

Exp erimental Results

Grammar Induced by Pereira and Schabes metho d

Grammar Induced by Counting

NPCompleteness of Bracketed Tree Maximization

NPCompleteness of HMM Most Likely String

Bracketed Tree Maximization is NPComplete

General Recall Algorithm

NAry Branching Parse Trees

NAry Branching Evaluation Metrics

Combined Rate Maximization

NAry Branching Exp eriments viii

Conclusions

A Pro of of Crossing Brackets Theorem

B Glossary

DataOriented Parsing

Introduction

Previous Research

Reduction of DOP to PCFG

Parsing Algorithms

Sampling Algorithms

Exp erimental Results and Discussion

Timing Analysis

Analysis of Bo ds Data

Conclusion

Thresholding

Introduction

Beam Thresholding

Global Thresholding

Global Thresholding Algorithm

MultiplePass Parsing

MultiplePass Sp eech Recognition

MultiplePass Parsing

Multiple Parameter Optimization

Comparison to Previous Work

Exp eriments

Data

The Grammar

What we measured

Exp eriments in Beam Thresholding

Exp eriments in Global Thresholding

Exp eriments combining Global Thresholding and Beam Thresholding

Exp eriments in MultiplePass Parsing

Future Work and Conclusion

Future Work

Conclusions

Probabilistic Feature Grammars

Introduction

Motivation ix

Formalism

Events and EventProbs

Terminal Function Binary PFG Alternating PFG

Comparison to Previous Work

Parsing

Pruning

Exp erimental Results

Features

Exp erimental Details

Results

Contribution of Individual Features

Conclusions and Future Work

A Backo Tables

Conclusion

Summary x

List of Figures

Inside probability example

Outside probability example

InsideOutside probability example

CKY algorithm

Inside probabilities

Inside algorithm

Outside probabilities

Outside algorithm

Insideoutside probabilities

Insideoutside algorithm

Dep endencies in the thesis

CKY Recognition Algorithm

CKY Inside Algorithm

Itembased description of a CKY parser

Earley Parsing

Semirings Used hA i

Grammar derivation tree item derivation tree value

Derivation Forest Implementation

Outside algorithm

Goal tree outer tree

Combining an outer tree with inner trees to form an outer tree

Forward Semiring Parser Interpreter

Reverse Semiring Parser Interpreter

Removal of Unary Pro ductions

Removal of Epsilon Pro ductions

Renormalization Parsing

Removal of nary Pro ductions

NFAHMM parser

Prex Derivation Rules

Prederivation Illustration xi

Fast Prex Derivation Rules

Fast Prederivation Illustration

Graham Harrison Ruzzo

TAG parser itembased description

Greibach Normal Form Transformation

Noncrossing and crossing constituents

Four trees of sample grammar and Lab elled Recall tree

Lab elled Recall Algorithm

Lab elled Recall Description

Bracketed Recall Description

Lab elled Tree versus Bracketed Recall in Pereira and Schabes Grammar

Conversion of Pro ductions to Binary Branching

Portion of HMM corresp onding to a Sat formula

Tree corresp onding to an output of symbol from symbol alphab et

Tree corresp onding to state sequence S AB C

Example Sto chastic Tree Substitution Grammar and two parses

STSG to PCFG conversion example

General Recall Description

Nary Lab elled Recall Description

Lab elled Combined Algorithm vs Lab elled Tree Algorithm

Crossing trees

Training corpus tree for DOP example

Sample STSG Pro duced from DOP Mo del

Example tree with addresses

STSG elementary tree isomorphic to a PCFG sub derivation

Example of Isomorphic Derivation

Monte Carlo parsing algorithm

Bo ds O Gn sampling algorithm

Faster O Gn sampling algorithm

Precision and Recall versus Time in Beam Thresholding

Inside Parser with Beam Thresholding

Beam thresholding itembased description

Example Hidden Markov Mo del

Global Thresholding Motivation

Global Thresholding Algorithm

Global thresholding itembased description

Second Pass Parsing Algorithm xii

MultiplePass Parsing Description

Gradient Descent Multiple Threshold Search

Optimizing for Lower Entropy versus Optimizing for Faster Sp eed

Converting to Binary Branching

Converting to Terminal and TerminalPrime Grammars

Smo othness for Precision and Recall versus Total Inside for Dierent Test

Data Sizes

Pro ductions versus Time

Beam Thresholding with and without the Prior Probability Two Dierent

Scales

Combining Beam and Global Search

Multiple Pass Parsing vs Beam and Global vs Beam

Example tree with features

Pro ducing the man one feature at a time

PFG Inside Algorithm

Example tree with features The normal man dies xiii

List of Tables

Lab elled Tree Viterbi versus Bracketed Recall for PS

Grammar Induced by Counting Three Algorithms Evaluated on Five Criteria

Algorithm Timings in Seconds

Metrics and corresp onding algorithms

DOP Lab elled Recall versus Pereira and Schabes on Minimally Edited ATIS

DOP Lab elled Recall versus Pereira and Schabes on Bo ds Data

Three way comparison on minimally edited ATIS data

Three way comparison on ATIS data edited by Bo d

Transformations from N ary to Binary Branching Structures

Probabilities of test data with ungeneratable sentences

Monotonicity of various metrics

Features Used in Exp eriments

PFG exp erimental results

Contribution of individual features

Binary Event Backo

Unary Event Backo

StartPrior Event Backo xiv

Chapter

Introduction

Background

Consider the sentence Stuart loves his thesis and Barbara do es to o If a human b eing

or a computer attempts to understand this sentence what will the steps b e One might

imagine that the rst step would b e a gross analysis of the sentence a syntactic parsing to

determine its structure breaking the sentence into a conjunction and two sentential clauses

breaking each sentential clause into a noun phrase NP and a verb phrase VP and so

on recursively to the words

S

H

H

H

H

H

H

H

H

H

H

H

S S conj

H H

H

H

H

H

H

and

NP VP

NP VP

H

H

H

H

H

adv verb PN

PN

NP verb

H

H

Barbara does too

Stuart

N pos loves

his thesis

After this rst syntactic step a semantic step would follow The sentence would b e con

verted into a logical representation ruling out incorrect interpretations such as Stuart

loves someones thesis and Barbara loves someone elses thesis and allowing only inter

pretations such as Stuart loves his own thesis and Barbara loves Stuarts thesis to o and

Stuart loves someones thesis and Barbara loves that thesis to o After this semantic step

a pragmatic step would b e necessary to disambiguate b etween these interpretations Given

that Stuart is notoriously hard to please we would probably decide that it is Stuarts own

thesis which is the intended antecedent rather than some more recent one .1 .01 verb noun

loves loves

Figure Inside probability example

The eventual goal of Natural Language Pro cessing NLP research is to understand

the meaning of sentences of human language This is a dicult goal and one we are

still a long way from achieving Rather than attacking the full problem all at once it is

prudent to attack subproblems separately such as semantics or pragmatics We

will b e solely concerned with the rst problem syntax There are two broad approaches

to syntax the rulebased and the statistical The statistical approach has b ecome p ossible

only recently with the advent of b o dies of text corp ora which have had their structures

hand annotated called tree banks Each sentence in the corpus has b een assigned a parse

tree a representation of its structure by a human b eing The ma jority of these sentences

can b e used to train a statistical mo del and another p ortion can b e used to test the accuracy

of the mo del

In this thesis we will b e particularly concerned with two sp ecial quantities in statistical

NLP the inside and outside probabilities For each span of terminal symbols words

and for each nonterminal symbol ie a phrase type such as a noun phrase the inside

probability is the probability that that nonterminal would consist of exactly those terminals

For instance if one out of every ten thousand noun phrases is his thesis then

inside his thesis NP

While a phrase like his thesis can b e interpreted only as a noun phrase natural language

is replete with ambiguities For instance the word loves can b e either a verb or a noun

Let us assume that there is a one in ten chance that a given verb will b e loves as seems true

in the literature of computational linguistics examples and a one in one hundred chance

that a given noun will b e loves Then

inside loves verb

inside loves noun

We might denote this graphically as in Figure

The outside probabilities can b e thought of as the probability of everything surrounding

a phrase For instance if the probability of a sentence of the form Stuart verb his thesis

is say one in one thousand and one of the form Stuart noun his thesis is one in ten .001 .0001 S S

verb noun

Stuart his thesis Stuart his thesis

Figure Outside probability example

.1 x .001 = .0001 .01 x .0001 = .000001 S S

verb noun

Stuart loves his thesis Stuartloves his thesis

Figure InsideOutside probability example

thousand then

his thesis verb outside Stuart

his thesis noun outside Stuart

This is illustrated in Figure

Now we can multiply the inside probabilities by the outside probabilities as illustrated

in Figure The pro duct of the inside and outside probabilities gives us the overall

probability of the sentence having the indicated structure For instance there is a

chance of the sentence Stuart loves his thesis with loves b eing a verb and a

chance of the sentence with loves b eing a noun Since these are

the only two p ossible parts of sp eech for loves the overall probability of the sentence is

We can divide by this overall probability to get the conditional

is the conditional probability that loves is a verb probabilities That is

given the sentence as a whole while is the conditional probability that loves

is a noun

Traditionally the insideoutside probabilities have b een used as a to ol to learn the

parameters of a statistical mo del In particular the insideoutside probabilities of a training

set using one mo del can b e used to nd the parameters of another mo del with typically

improved p erformance This pro cedure can b e iterated and is guaranteed to converge to a

lo cal optimum

While learning the probabilities of a statistical grammar is the traditional use for the

inside and outside probabilities the goal of this thesis is dierent Our goal is to show that

the insideoutside values are useful for solving many other problems in statistical parsing

and to provide useful to ols for nding these values

Introduction to Statistical NLP

In this section we will give a very brief introduction to statistical NLP describing context

free grammars CFGs probabilistic contextfree grammars PCFGs and algorithms for

parsing CFGs and PCFGs

ContextFree Grammars

We b egin by quickly reviewing ContextFree Grammars CFGs less with the intention of

aiding the novice reader than of clarifying the relationship to PCFGs in the next subsec

tion

A CFG G is a tuple hN R S i where N is the set of nonterminals including the

start symbol S is the set of terminal symbols and R is the set of rules we use R

rather than the more conventional P to avoid confusion with probabilities which will b e

introduced later Let V the vocabulary of the grammar b e the set N We will use

lowercase letters a b c to represent terminal symbols upp ercase letters A B C for

nonterminals and Greek symbols to represent a string of zero or more terminals

and nonterminals We will use the sp ecial symbol to represent a string of zero symbols

Rules in the grammar are all of the form A

Given a string A and a grammar rule A R we write

A

to indicate that the rst string pro duces the second string by substituting for A A

sequence of zero or more such substitutions called a derivation is indicated by For

instance if we have an input sentence w w and a sequence of substitutions starting

n

with the start symbol S derives the sentence then we write S w w

n

There are several well known algorithms for determining whether for a given input

sentence w w such a sequence of substitutions exists The two b est known algorithms

n

are the CKY algorithm Kasami Younger and Earleys algorithm Earley

The CKY algorithm makes the simplifying assumption that the grammar is in

a sp ecial form Chomsky Normal Form CNF in which all pro ductions are of the form

A BC or A a The CKY algorithm is given in Figure

boolean chart n jN j n FALSE

for each start s

for each rule A w

s

chart s A s TRUE

for each length l shortest to longest

for each start s

for each split length t

for each rule A BC R

extra TRUE for expository purposes

chart s A s l chart s A s l

chart s B s t chart s t C s l TRUE

return chart S n

Figure CKY algorithm

Most of the parsing algorithms we use in this thesis will strongly resemble the CKY al

gorithm so it is imp ortant that the reader understand this algorithm Briey the algorithm

can b e describ ed as follows The central data structure in the CKY algorithm is a b o olean

three dimensional array the chart An entry chart i A j contains TRUE if A w w

i j

FALSE otherwise The key line in the algorithm

extra TRUE for expository purposes

chart s A s l chart s A s l

chart s B s t chart s t C s l TRUE

says that if A BC and B w w and C w w then A w w

s st st s

sl sl

Notice that once all spans of length one have b een examined which o ccurs in the rst

double set of lo ops we can then pro ceed to examine all spans of length two and from there

all spans of length three which must b e formed only from shorter spans and so on Thus

the outermost lo op of the main set of lo ops is a lo op over lengths from shortest to longest

The next three lo ops examine all p ossible combinations of start p ositions split lengths and

rules so that the main inner statement examines all p ossibilities Because array elements

covering shorter spans are lled in rst this style of parser is also called a b ottomup chart

parser

Probabilistic ContextFree Grammars

In this thesis we will primarily b e concerned with a variation on contextfree grammars

probabilistic contextfree grammars PCFGs A PCFG is simply a CFG augmented with

probabilities We denote the probability of a rule A by P A We can also discuss

the probability of a particular derivation or of all p ossible derivations of one string from

another In order to make sure that equivalent derivations are not counted twice we need

the concept of a leftmost derivation A leftmost derivation is one in which the leftmost X

w w

i j-1

Figure Inside probabilities

A

nonterminal symbol is the one that is substituted for We will write to indicate a

leftmost derivation using a substitution of for A Now we dene the probability of a one

step derivation to b e

A

P P A

We can dene the probability of a string of substitutions

k

Y

A A A A

k k

P P A

i i

k

i

Finally we can dene the probability of all of the leftmost derivations of some string from

some initial string In particular we dene

X

A A A

k k

P P

k

A !

A !

k k

k A A st

k k k k k

Several probabilities of this form are of sp ecial interest In particular we say that the

probability of a sentence w w is P S w w the sum of the probabilities of all

n n

leftmost derivations of the sentence In general we will b e interested in probabilities

of nonterminals deriving sections of the sentence P A w w We will call this

i j

probability the inside probability of A over the span i to j and will write it as insidei A j

Note that in general a derivation of the form A can b e drawn as a in

which the internal branches of the parse tree represent the nonterminals where substitutions

o ccured for each substitution A B B there will b e a no de with parent A and children

k

B B The leaves of the tree when concatenated form the string There is a onetoone

k

corresp ondence b etween leftmost derivations and parse trees so we will use the concepts

interchangeably

When we consider inside probabilities we are summing over the probabilities of all

p ossible parse trees with a given ro ot no de covering a given span Since the internal structure

is summed over for an inside probability we graphically represent the inside probability of

a nonterminal X covering words w w as shown in Figure The inside algorithm

i j

shown in Figure computes these probabilities Notice that the inside algorithm is

oat chart n jN j n

for each start s

for each rule A w

s

insides A s P A w

s

for each length l shortest to longest

for each start s

for each split length t

for each rule A BC R

insides A s l insides A s l

inside s B s t insides t C s l P A BC

return inside S n

Figure Inside algorithm

extremely similar to the CKY algorithm of Figure The inside algorithm was created

by Baker and Lari and Young have written a go o d tutorial explaining it

In many practical applications we are not interested only in the sum of probabilities

of all derivations but also in the most probable derivation The inside algorithm of Figure

can b e easily mo died to return the probability of the most probable derivation which

corresp onds uniquely to a most probable parse tree instead of the sum of the probabilities

of all derivations We simply change the inner lo op of the inside algorithm to read

Viterbis A sl maxViterbis A sl Viterbis B stViterbist C sl P A BC

These probabilities are known as the Viterbi probabilities by analogy to the Viterbi prob

abilities for Hidden Markov Mo dels HMMs Rabiner Viterbi With slightly

larger changes the algorithm can record the actual pro ductions used to create this most

probable parse tree

Another useful probability is the outside probability The outside probability of a non

terminal X covering w to w is

i j

P S w w X w w

i j n

This probability is illustrated in Figure The outside probability can b e computed using

the outside algorithm as given in Figure Baker Lari and Young

Notice that if we multiply an inside probability by the corresp onding outside probability

we get

insidei X j outside i X j P X w w P S w w X w w

i j i j n

P S w w X w w w w

i j n n

which is the probability that a derivation of the whole sentence uses a constituent X covering

w to w This combination is illustrated in Figure Now the probability of the

i j S

X

ww w w

1 i-1 j n

Figure Outside probabilities

for each length l longest downto shortest

for each start s

for each split length t

for each rule A BC R

outsides B s t outsides B s t

outside s A s l insides t C s l P A BC

outsides t C s l outside s t C s l

outside s A s l insides B s t P A BC

Figure Outside algorithm S

X

X

ww w w 1 i-1 j n w w

i j-1

Figure Insideoutside probabilities

sentence as a whole is just

inside S n P S w w

n

If we normalize by dividing by the probability of the sentence we get the conditional

probability that X covers w w given the sentence

i j

insidei X j outside i X j P X w w P S w w X w w

i j i j n

inside S n

P S w w

n

P S w w X w w w w jS w w

i j n n n

We might also want the probability that a particular rule was used to cover a particular

span given the sentence

P

inside i B k inside k C j outside i A j P A

k

inside S n

A

P S w w Aw w w w w w w w jS w w

i j n i j n n n

This conditional probability is extremely useful Traditionally it has b een used for

estimating the probabilities of a PCFG from training sentences using the insideoutside

algorithm shown in Figure In this algorithm we start with some initial probability

for each iteration i until the probabilities have converged

for each rule A

C A

for j to number of sentences

compute inside probabilities of sentence j using P

i

compute outside probabilities of sentence j using P

i

for each length l shortest to longest

for each start s

for each rule A

C A C A

A

P S w w Aw w w w w w w w jS w w

i i j n i j n n n

for each rule A

C A

P

P A

i

C A

Figure Insideoutside algorithm

estimate P A Then for each sentence of training data we determine the inside and

outside probabilities to compute for each pro duction how likely it is that that pro duction

was used as part of the derivation of that sentence This gives us a number of counts

for each pro duction for each sentence Summing these counts across sentences gives us an

estimate of the total number of times each pro duction was used to pro duce the sentences in

the training corpus Dividing by the total counts of pro ductions used for each nonterminal

A gives us a new estimate of the probability of the pro duction It is an imp ortant theorem

that this new estimate will assign a higher probability to the training data than the old

estimate and that when this algorithm is run rep eatedly the rule probabilities converge

towards lo cally optimum values Baker in terms of maximizing the probability of

the training data

Overview

While the traditional use for the insideoutside probabilities is to estimate the parameters of

a PCFG our goal is dierent our goal is to demonstrate that the insideoutside probabilities

are useful for solving many other problems in statistical parsing and to provide useful to ols

for nding these values In the remaining chapters of this thesis we will rst provide a

general framework for sp ecifying parsers that makes it easy to compute inside and outside

probabilities Next we will show three novel uses improved parser p erformance on sp ecic

criteria faster parsing for the DataOriented Parsing DOP mo del and faster parsing more

generally by using thresholding Finally we describ e a stateoftheart parsing formalism

that can compute inside and outside probabilities

In Chapter we develop a novel framework for sp ecifying parsers that makes it easy

to compute the inside and outside probabilities and others The CKY algorithm of Figure

is very similar to the inside algorithm of Figure and the inside algorithm in some

ways resembles the outside algorithm of Figure For a simple parsing technique such

as CKY parsing it is not to o much work to derive each of these algorithms separately

ignoring their commonalities but for more complicated algorithms and formalisms the

duplicated eort is signicant In particular for sophisticated formalisms or techniques

the outside formula can b e esp ecially complicated to derive Also for parsing algorithms

that can handle lo ops like those that result from a grammar rule like A A the inside and

outside algorithms may b ecome yet more complicated b ecause of the innite summations

that result We develop a framework that allows a single parsing description to b e used

to compute recognition inside outside and Viterbi values among others The framework

uses a description language that is indep endent of the values b eing derived and thus allows

the complicated manipulations required to handle innite summations to b e separated out

from the construction of the parsing algorithms Using this framework we show how to

easily compute many interesting values including the set of all parses of a grammar the

top n parses of a sentence the most probable completion of a sentence and many others

With this format it will b e simple to sp ecify the thresholding and parsing algorithms of the

following chapters although we will also use traditional pseudo co de as well in an eort to

keep the chapters selfcontained

In Chapter we present our rst novel use for the insideoutside probabilities tailoring

parsing algorithms to various metrics Most probabilistic parsing algorithms are similar

to the They attempt to maximize a single metric the probability that

the guessed parse tree is exactly correct If the score that the parser receives is given by

the number of exactly correct guessed trees then this approach is correct However in

practice many other metrics are typically used such as precision and recall or crossing

brackets These metrics measure in one way or another how many pieces of the sentence

are correct rather than whether the whole sentence is exactly correct Because the inside

outside pro duct is prop ortional to the probability that a given constituent is correct we can

use it maximize correct pieces rather than the whole We give various algorithms using the

insideoutside probabilities for maximizing p erformance on these piecewise metrics We also

show that surprisingly a similar problem maximizing p erformance on the well known zero

crossing brackets rate is NPComplete Finally we give an algorithm that using the inside

and outside probabilities allows the tradeo of precision versus recall These algorithms

can b e easily sp ecied in the format of Chapter

Next we show another use for the insideoutside probabilities DataOriented Parsing

When we b egan this research DOP was one of the most promising techniques available

with rep orted results an order of magnitude b etter than other parsers However the only

algorithms available for parsing using the DOP mo del required exp onentially large gram

mars Furthermore these algorithms were randomized algorithms with some chance of

failure We show how to use the insideoutside techniques of Chapter to parse the DOP

mo del in O n time deterministically without sampling at all We also show a grammar

construction technique that is linear in sentence length rather than exp onential Using

these techniques we are able to parse times faster than previous algorithms Our re

sults are not as go o d as the previously published results and we give an analysis of the

data that shows that these previous results are probably due to a fortuitous split of the

data into test and training sections or to easy data

The third novel use we give for the insideoutside probabilities is to sp eed parsing

which we discuss in Chapter Parsers can b e sp ed up using thresholding a technique

in which some low probability hypotheses are discarded sp eeding later parsing Since the

normalized insideoutside probability gives the probability that any constituent is correct

it is the mathematically ideal probability to use to determine which hypotheses to discard

However since the outside probability cannot b e determined until after parsing is complete

we instead use three approximations to the insideoutside probabilities These include

a variation on b eam search in which rather than thresholding based only on the inside

probability we also include a very simple approximation to the outside probability another

thresholding technique that uses a more complicated approximation to the insideoutside

probability which takes into account the whole sentence and a multiple pass technique that

uses the insideoutside probability from one pass to threshold later passes All three of these

algorithms lead to signicantly improved sp eed In order to maximize p erformance using all

of these algorithms at once it is necessary to maximize many parameters simultaneously We

give a novel algorithm for maximizing the parameters of multiple thresholding algorithms

Rather than directly maximizing accuracy this algorithm maximizes the inside probability

which turns out to b e much more ecient Combining all of the thresholding algorithms

together leads to ab out a factor of sp eedup over traditional thresholding algorithms at

the same error rates All of the thresholding algorithms can b e succinctly describ ed using

the itembased descriptions of Chapter

Despite the usefulness of the inside and outside probabilities as shown in the previous

chapters most state of the art parsing formalisms cannot b e used to compute either the

inside or the outside scores In Chapter we introduce a grammar formalism Probabilistic

Feature Grammar PFG that combines the b est prop erties of most of the previous existing

formalisms but more elegantly PFGs can b e used to compute b oth inside and outside

probabilities meaning that they can b e used with the previously introduced algorithms

PFGs achieve state of the art p erformance on parsing tasks

This thesis shows that using the insideoutside probabilities is a p owerful general tech

nique We have simplied the pro cess of computing inside and outside probabilities for

new parsing algorithms We have shown how to use these probabilities to improve p er

formance by matching parsing algorithms to metrics to quickly parse DOP grammars

and to quickly parse Probabilistic ContextFree Grammars PCFGs and PFGs with novel

thresholding techniques Finally we have introduced a novel formalism PFG for which

the inside and outside probabilities can b e easily computed and that achieves state of the

art p erformance

In writing this thesis we have taken into account that it will b e a rare p erson who

wishes to read the thesis in its entirety Thus whenever it is reasonable to make a chapter

selfcontained or to isolate interdependencies we have sought to do so

Figure shows the organization and dep endencies of the content chapters of the the

sis The chapters on maximizing metrics thresholding and probabilistic feature grammars

all dep end somewhat on the semiring parsing chapter in that algorithms in these chapters 2: Semiring Parsing

3: Maximizing 5: Thresholding 6: Probabilistic Metrics Feature Grammars

4: Data-Oriented

Parsing

Figure Dep endencies in the thesis

are given using the format and theory of semiring parsing However to keep the chapters

selfcontained all algorithms in these chapters are also presented in traditional pseudo co de

The most imp ortant dep endency is that of the DataOriented Parsing chapter which uses

algorithms and ideas from the chapter on maximizing metrics and should probably not b e

read on its own The thresholding and probabilistic feature grammar chapters each dep end

somewhat on the other and can b est b e appreciated as a pair although each can b e read

separately

Chapter

Semiring Parsing

In this chapter we present a system for describing parsers that allows a single simple rep

resentation to b e used for describing parsers that compute inside and outside probabilities

as well as many other values including recognition derivation forests and Viterbi values

This representation will b e used throughout the thesis to describ e the parsing algorithms

we develop

Introduction

For a given grammar and string there are many interesting quantities we can compute

We can determine whether the string is generated by the grammar we can enumerate all

of the derivations of the string if the grammar is probabilistic we can compute the inside

and outside probabilities of comp onents of the string Traditionally a dierent parser

description has b een needed to compute each of these values For some parsers such as

CKY parsers all of these algorithms except for the outside parser strongly resemble each

other For other parsers such as Earley parsers the algorithms for computing each value

are somewhat dierent and a fair amount of work can b e required to construct each one

We present a formalism for describing parsers such that a single simple description can

b e used to generate parsers that compute all of these quantities and others This will

b e esp ecially useful for nding parsers for outside values and for parsers that can handle

general grammars like Earleystyle parsers

We will compare the CKY algorithm Kasami Younger to the inside al

gorithm Baker Lari and Young to illustrate the similarity of parsers for

computing dierent values Both of these parsers were describ ed in in Section we

rep eat the co de here in Figures and Notice how similar the inside algorithm is to

the recognition algorithm essentially all that has b een done is to substitute for

for and P A w and P A BC for TRUE For many parsing algorithms this

s

or a similarly simple mo dication is all that is needed to create a probabilistic version of

the algorithm On the other hand a simple substitution is not always sucient To give a

boolean chart n jN j n FALSE

for each start s

for each rule A w

s

chart s A s TRUE

for each length l shortest to longest

for each start s

for each split length t

for each rule A BC R

extra TRUE for expository purposes

chart s A s l chart s A s l

chart s B s t chart s t C s l TRUE

return chart S n

Figure CKY Recognition Algorithm

oat chart n jN j n

for each start s

for each rule A w

s

chart s A s P A w

s

for each length l shortest to longest

for each start s

for each split length t

for each rule A BC R

chart s A s l chart s A s l

chart s B s t chart s t C s l P A BC

return chart S n

Figure CKY Inside Algorithm

trivial example if in the CKY recognition algorithm we had written

chart s A s l chart s A s l chart s B s t chart s t C s l

instead of the less natural

chart s A s l chart s A s l chart s B s t chart s t C s l TRUE

larger changes would b e necessary to create the inside algorithm

Besides recognition there are four other quantities that are commonly computed by

parsing algorithms derivation forests Viterbi scores number of parses and outside prob

abilities The rst quantity a derivation forest is a data structure that allows one to

eciently compute the set of legal derivations of the input string The derivation forest is

typically found by mo difying the recognition algorithm to keep track of back p ointers for

each cell of how it was pro duced The second quantity often computed is the Viterbi score

the probability of the most probable derivation of the sentence This can typically b e com

puted by substituting for and max for Less commonly computed is the total number

of parses of the sentence which like the inside values can b e computed using multiplication

and addition unlike for the inside values the probabilities of the rules are not multiplied

into the scores One last commonly computed quantity the outside probability cannot b e

found with mo dications as simple as the others We will discuss how to compute outside

quantities later in Section

One of the key ideas of this chapter is that all ve of these commonly computed quantities

can b e describ ed as elements of complete semirings Kuich A complete semiring is

a set of values over which a multiplicative op erator and a commutative additive op erator

have b een dened and for which innite summations are dened For parsing algorithms

satisfying certain conditions the multiplicative and additive op erations of any complete

semiring can b e used in place of and and correct values will b e returned We will give

a simple normal form for describing parsers then precisely dene complete semirings the

conditions for correctness and a simple normal form for describing parsers

We now describ e our normal form for parsers which is very similar to that used by

Shieb er et al and by Sikkel In most parsers there is at least one chart

of some form In our normal form we will use a corresp onding concept items Rather

than for instance a chart element chart i A j we will use an item i A j Conceptually

chart elements and items are equivalent Furthermore rather than use explicit pro cedural

descriptions such as

chart s A s l chart s A s l chart s B s t chart s t C s l TRUE

we will use inference rules such as

R A BC i B k k C j

i A j

The meaning of an inference rule is that if the top line is all true then we can conclude

Item form

i A j

Goal

S n

Rules

R A w

i

Unary

i A i

R A BC i B k k C j

Binary

i A j

Figure Itembased description of a CKY parser

the b ottom line For instance this example inference rule can b e read as saying that if

A BC and B w w and C w w then A w w

i j j

k k

The general form for an inference rule will b e

A A

k

B

where if the conditions A A are all true then we infer that B is also true The A can b e

i

k

either items or in an extension to the usual convention for inference rules can b e rules

such as R A BC We write R A BC rather than A BC to indicate that we could

b e interested in a value asso ciated with the rule such as the probability of the rule if we

were computing inside probabilities If an A is in the form R we call it a rule All of

i

the A must b e rules or items when we wish to refer to b oth rules and items we use the

i

word terms

We now give an example of an itembased description and its semantics Figure

gives a description of a CKY style parser For this example we will use the inside semiring

whose additive op erator is addition and whose multiplicative op erator is multiplication We

use the input string xxx to the following grammar

S XX

X XX

X x

Our rst step is to use the unary rule

R A w

i

i A i

The eect of the unary rule will exactly parallel the rst set of lo ops in the CKY inside

algorithm We will instantiate the free variables of the unary rule in every p ossible way

For instance we instantiate the free variable i with the value and the free variable A

with the nonterminal X Since w x the instantiated rule is then

R X x

X

Because the value of the top line of the instantiated unary rule R X x has value

we deduce that the b ottom line X has value We instantiate the rule in two other

ways and compute the following chart values

X

X

X

Next we will use the binary rule

R A BC i B k k C j

i A j

The eect of the binary rule will parallel the second set of lo ops for the CKY inside algo

rithm Consider the instantiation i k j A X B X C X

R X XX X X

X

We use the multiplicative op erator of the semiring of interest to multiply together the values

of the top line In the inside semiring the multiplicative op erator is just multiplication so

we get and deduce that X has value We can do the

same thing for the instantiation i k j A X B X C X getting the

following item values

X

X

We can also deduce that

S

S

There are two more ways to instantiate the conditions of the binary rule

R S XX X X

S

R S XX X X

S

The rst has the value and the second also has the value

When there is more than one way to derive a value for an item we use the additive op erator

of the semiring to sum them up In the inside semiring the additive op erator is just addition

so the value of this item is S Notice that the goal item for the CKY parser

is S Thus we now know that the inside value for xxx is The goal item exactly

parallels the return statement of the CKY inside algorithm

Besides the fact that this itembased description is simpler than the explicit lo oping

description there will b e other reasons we wish to use it Unlike the other quantities we

wish to compute the outside probabilities cannot b e computed by simply substituting a

dierent semiring into either an iterative or itembased description Instead we will show

how to compute the outside probabilities using a mo died interpreter of the same itembased

description used for computing the inside probabilities

Earley Parsing

Many parsers are much more complicated than the CKY parser This will make our de

scription of semiring parsing a bit longer but will also explain why our format is so useful

these complexities o ccur in many dierent parsers and the ability of semiring parsing to

handle them automatically will prove to b e its main attraction

Most of the interesting complexities we wish to discuss are exhibited by Earley pars

ing Earley Earleys parser is often describ ed as a b ottomup parser with top

down ltering In a probabilistic framework the b ottomup and topdown asp ects are very

dierent the b ottomup sections compute probabilities while the topdown ltering non

probabilistically removes items that cannot b e derived In order to capture these dierences

we expand our notation for deduction rules to the following form

A A

k

C C

l

B

C C are side conditions interpreted nonprobabilistically while A A are main

l k

conditions with values in whichever semiring we are using While the values of all main

conditions are multiplied together to yield the value for the item under the line the side

conditions are interpreted in a b o olean manner either they all have nonzero value or not

The rule can only b e used if all of the side conditions have nonzero value but other than

that their values are ignored

Figure gives an itembased description of Earleys parser We assume the addition of

a distinguished nonterminal S with a single rule S S An item of the form i A j

asserts that A w w

i j

The prediction rule includes a side condition making it a go o d example The rule is

R B

i A B j

j B j

Through the prediction rule Earleys algorithm guarantees that an item of the form j B

j can only b e pro duced if S w w B for some this top down ltering leads

j

to signicantly more ecient parsing for some grammars than the CKY algorithm The

Item form

i A j

Goal

S S n

Rules

Initialization

S S

i A w j

j

Scanning

i A w j

j

R B

i A B j Prediction

j B j

i A B k k B j

Completion

i A B j

Figure Earley Parsing

prediction rule combines side and main conditions The side condition

i A B j

provides the topdown ltering ensuring that only items that might b e used later by the

completion rule can b e predicted while the main condition

R B

provides the probability of the relevant rule The side condition is interpreted in a b o olean

fashion while the main conditions actual probability is used

Unlike the CKY algorithm Earleys algorithm can handle grammars with epsilon

unary and nary branching rules In some cases this can signicantly complicate parsing

For instance given unary rules A B and B A a cycle exists This kind of cycle

may allow an innite number of dierent derivations requiring an innite summation to

compute the inside probabilities The ability of itembased parsers to handle these innite

lo ops with ease is a ma jor attraction

Overview

This chapter will simplify the development of new parsers in several ways First it will

simplify sp ecication of parsers the itembased description is simpler than a pro cedural

description Second it will make it easier to generalize parsers to other tasks a single item

based description can b e used to compute values in a variety of semirings and outside values

as well This will b e esp ecially advantageous for parsers that can handle lo ops resulting from

rules like A A and computations resulting from pro ductions b oth of which typically

lead to innite sums In these cases the pro cedure for computing an innite sum diers

from semiring to semiring and the fact that we can sp ecify that a parser computes an

innite sum separately from its metho d of computing that sum will b e very helpful

In the next section we describ e the basics of semiring parsing In the following sections

we derive formulae for computing the values of items in semiring parsers and then describ e

an algorithm for interpreting an itembased description Next we discuss using this same

formalism for p erforming grammar transformations At the end of the chapter we give

examples of using semiring parsers to solve a variety of problems

Semiring Parsing

In this section we rst describ e the inputs to a semiring parser a semiring an itembased

description and a grammar Next we give the conditions under which a semiring parser

gives correct results At the end of this section we discuss three esp ecially complicated and

interesting semirings

Semiring

In this subsection we dene and discuss semirings The b est introduction to semirings that

we know of and the one we follow here is that of Kuich who also gives more formal

denitions than those given in this chapter

A semiring has two op erations and that intuitively have most but not necessarily

all of the prop erties of the conventional and op erations on the p ositive integers In

particular we require the following prop erties is asso ciative and commutative is

asso ciative and distributes over If is commutative we will say that the semiring

is commutative We assume an additive identity element which we write as and a

multiplicative identity element which we write as Both addition and multiplication can

b e dened over nite sets of elements if the set is empty then the value is the resp ective

identity element or We also assume that x x for all x In other words a

semiring is just like a ring except that the additive op erator need not have an inverse We

will write

hA i

to indicate a semiring over the set A with additive op erator multiplicative op erator

additive identity and multiplicative identity

For parsers with lo ops ie those in which an item can b e used to derive itself we will

also require that sums of an innite number of elements b e well dened In particular we

will require that the semirings b e complete Kuich p This means that sums of

an innite number of elements should b e asso ciative commutative and distributive just like

nite sums All of the semirings we will deal with in this chapter are complete Completeness

is a somewhat stronger condition than we really need we could instead require that limits

b o olean hfTRUE FALSE g FALSE TRUEi

i inside hR

max i Viterbi hR

i counting hN

min i tropical semiring hR

arctic semiring hR fg max i

E

derivation forest h fhigi

E

max h i h fhigii Viterbiderivation hR

Vit

Vit

E R

g max fh fhigigi Viterbinb est hftopn X jX

Vitn

Vitn

Figure Semirings Used hA i

b e appropriately dened for those innite sums that o ccur while parsing but this weaker

condition is more complicated to describ e precisely

Certain semirings are naturally ordered meaning that we can dene a partial ordering

v such that x v y if and only if there exists z such that x z y We will call a naturally

ordered complete semiring continuous Kuich p if for any sequence x x

L L

x v y That is if every partial x v y then and for any constant y if for all n

i i

i in

sum is less than or equal to y then the innite sum is also less than or equal to y This

imp ortant prop erty makes it easy to compute or at least approximate innite sums All

of the semirings we discuss here will b e b oth naturally ordered and continuous

There will b e several esp ecially useful semirings in this chapter which are dened in

b

to indicate the set of real numbers from a to b inclusive Figure We will write R

a

with similar notation for the natural numbers N We will write E to indicate the set of

E

all derivations where a derivation is an ordered list of grammar rules We will write

to indicate the set of all sets of derivations There are three derivation semirings the

derivation forest semiring the Viterbiderivation semiring and the Viterbinb est semiring

The op erators used in the derivation semirings max max and will b e describ ed

Vit Vitn

Vit Vitn

later in Section

The inside semiring includes all nonnegative real numbers to b e closed under addition

and includes innity to b e closed under innite sums while the Viterbi semiring contains

only numbers up to since that is all that is required to b e closed under max

There are two additional semirings the tropical semiring which usually is restricted

to natural numbers but is extended to real numbers here and another semiring which

we have named the arctic semiring since it is the opp osite of the tropical semiring taking

maxima rather than minima

The three derivation forest semirings can b e used to nd esp ecially imp ortant values

the derivation forest semiring computes all derivations of a sentence the Viterbiderivation

semiring computes the most probable derivation and the Viterbinb est semiring computes

the n b est derivations A derivation is simply a list of rules from the grammar From

a derivation a parse tree can b e derived so the derivation forest semiring is analogous

to conventional parse forests Unlike the other semirings all three of these semirings are

noncommutative The additive op eration of these semirings is essentially union or maxi

mum while the multiplicative op eration is essentially concatenation These semirings are

relatively complicated and are describ ed in more detail in Section

ItemBased Description

A semiring parser requires an itembased description D of the parsing algorithm in the

form given earlier So far we have skipp ed one imp ortant detail of semiring parsing In

a simple recognition system as used in deduction systems all that matters is whether an

item can b e deduced or not Thus in these simple systems the order of pro cessing items

is relatively unimp ortant as long as some simple constraints are met On the other hand

for a semiring such as the inside semiring there are imp ortant ordering constraints for

instance we cannot compute the inside value of a CKYstyle chart element until the inside

values of all of its children have b een computed

Thus we need to imp ose an ordering on the items in such a way that no item precedes

any item on which it dep ends We will asso ciate with each item x a bucket B and write

bucketx B We order the buckets in such a way that if item y dep ends on item x

then bucketx buckety We will write rst last nextB and previous B for the rst

last next and previous buckets resp ectively For some pairs of items it may b e that b oth

dep end directly or indirectly on each other we asso ciate these item with sp ecial lo oping

buckets whose values may require innite sums to compute We will also call a bucket

lo oping if an item in it dep ends on itself The predicate loop B will b e true for lo oping

buckets B

One way to achieve a bucketing with the required ordering constraints is to create a graph

of the dep endencies with a no de for each item and an edge from each item x to each item

b that dep ends on it We then separate the graph into its strongly connected comp onents

and p erform a top ological sort Items forming singleton strongly connected comp onents

are in their own buckets items forming nonsingleton strongly connected comp onents are

together in lo oping buckets

An actual example may help here Consider an example grammar such as

S C AC

C c

B A

A B

A a

and an input sentence cac parsed with the Earley parser of Figure It will b e p ossible to

derive items such as C c through prediction C c through scanning and

S C AC through completion Each of these items would form a singleton strongly

connected comp onent and could b e put into its own bucket Now using completion we

see that

B A A a

B A

Next comes the lo oping part Notice that

A B B A

A B

and

B A A B

B A

Thus the items B A and A B can each b e derived from the other

and since they dep end on each other will b e placed together in a lo oping bucket

A top ological sort is not the only way to bucket the items In particular for items such

that neither dep ends on the other it is p ossible to place them into a bucket together with

no loss of eciency For some descriptions this could b e used to pro duce faster simpler

parsing algorithms For instance in a CKY style parser we could simply place all items of

the same length in the same bucket ordering buckets from shortest to longest avoiding the

need to p erform a top ological sort

Later when we discuss algorithms for interpreting an itembased description we will

need another concept Of all the items asso ciated with a bucket B we will b e able to nd

derivations for only a subset If we can derive an item x asso ciated with bucket B we write

x B and say that item x is in bucket B For example the goal item of a parser will

almost always b e associated with the last bucket if the sentence is grammatical the goal

item will b e in the last bucket and if it is not grammatical it wont b e

It will b e useful to assume that there is a single variable free goal item and that this

goal item do es not o ccur as a condition for any rules We can always add a new goal item

oldgoal

goal and a rule where oldgoal is the goal in the original description We will

goal

assume in general that this transformation has b een made or is not necessary

The Grammar

A semiring parser also requires a grammar as input We will need a list of rules in the

grammar and a function that gives the value for each rule in the grammar This latter

function will b e semiring sp ecic For instance for computing the inside and Viterbi prob

abilities the value of a grammar rule is just the conditional probability of that rule or

if it is not in the grammar For the b o olean semiring the value is TRUE if the rule is in

the grammar FALSE otherwise For the counting semiring the value is if the rule is in

the grammar otherwise We call this function R rule This function replaces the set of

rules R of a conventional grammar description a rule is in the grammar if R rule is not

the zero element of the semiring

Conditions for Correct Pro cessing

We will say that a semiring parser works correctly if for any grammar input and semiring

the value of the input according to the grammar equals the value of the input using the

parser In this subsection we will dene the value of an input according to the grammar

the value of an input using the parser and give a sucient condition for a semiring parser

to work correctly

From this p oint onwards unless we sp ecically mention otherwise we will assume that

some xed semiring itembased description and grammar have b een given without sp ecif

ically mentioning which ones

Value according to grammar

Under certain conditions a semiring parser will work correctly for any grammar input and

semiring First we must dene what we mean by working correctly Essentially we mean

that the value of a sentence according to the grammar equals the value of the sentence using

the parser

Consider a derivation E consisting of grammar rules e e e We dene the value

l

of the derivation to b e simply the pro duct in the semiring of the values of the rules used

in E

l

O

V E R e

G i

i

Then we can dene the value of a sentence that can b e derived using grammar derivations

k

E E E to b e

k

M

j

V V E

G G

j

where k is p otentially innite In other words the value of the sentence according to the

grammar is the sum of the values of all derivations We will assume that in each gram

mar formalism there is some way to dene derivations uniquely for instance in CFGs one

way would b e using leftmost derivations For simplicity we will simply refer to deriva

tions rather than eg leftmost derivations since we are never interested in nonunique

derivations

Example of value according to grammar

A short example will help clarify We consider the following grammar

S AA R S AA

A AA R A AA

A a R A a

and the input string aaa There are two grammar derivations the rst of which is

S AA AAA Aa Aa Aa

S AA AAA aAA aaA aaa

which has value

R S AA R A AA R A a R A a R A a

Notice that the rules in the value are the same rules in the same order as in the derivation

The other grammar derivation is

S AA Aa AAA Aa Aa

S AA aA aAA aaA aaa

which has value

R S AA R A a R A AA R A a R A a

We note that for commutative semirings the value of the two grammar derivations are

equal but for noncommutative semirings they dier

The value of the sentence is the sum of the values of the two derivations

R S AA R A AA R A a R A a R A a

R S AA R A a R A AA R A a R A a

Item derivations

Next we must dene item derivations ie derivations using the itembased description of

the parser We will dene item derivation in such a way that for a correct parser description

there will b e exactly one item derivation for each grammar derivation The value of a

sentence using the parser is the sum of the value of all item derivations of the goal item

A A a a

k k

c c is an instantiation of deduction rule C C when We say that

l l

b B

ever the rst expression is a variablefree instance of the second that is the rst expression

is the result of consistently substituting constant terms for each variable in the second

Now we can dene an item derivation tree Intuitively an item derivation tree for x just

gives a way of deducing x from ground items items that dont dep end on other items ie

items that can b e deduced using rules that have no items in the A We dene an item

i

derivation tree recursively The base case is rules of the grammar hr i is an item derivation

tree where r is a rule of the grammar Also if D D D D are derivation trees

a a c c

k l

a a

k

headed by a a c c resp ectively and if c c is the instantiation of a deduc

k l l

b

tion rule then hb D D i is also a derivation tree Notice that the D D do not

a a c c

k l

o ccur in this tree they are side conditions and although their existence is required to prove

that c c could b e derived they do not contribute to the value of the tree We will write

l

a a

k

to indicate that there is an item derivation tree of the form hb D D i

a a

k

b

As mentioned in Section we will write x B if bucketx B and there is an item

derivation tree for x

Example of item derivation

We can continue the example of parsing aaa now using the item based CKY parser of

Figure There are two item derivation trees for the goal item we give the rst as an

S AA AAA Aa Aa Aa

S AA AAA aAA aaA aaa

Grammar Derivation

R S AA

H

H

H

H

H

R A AA R A a

H

H

H

a

R A a R A a

a a

Grammar Derivation Tree

S

H

H

H

H

H

H

H

H

H

H

H

H

H

H

A R S AA A

H

H

H

H

R A a

H

H

H

R A AA A A

R A a R A a

Item Derivation Tree

R S AA R A AA R A a R A a R A a

Derivation Value

Figure Grammar derivation tree item derivation tree value

example displaying it as a tree rather than with angle bracket notation for simplicity

Figure shows this tree and the corresp onding grammar derivation

Notice that an item derivation is a tree not a directed graph Thus an item sub

derivation could o ccur multiple times in a given item derivation This means that we can

have a onetoone corresp ondence b etween item derivations and grammar derivations lo ops

in the grammar lead to an innite number of grammar derivations and an innite number

of corresp onding item derivations

A grammar including rules such as

S AAA

A B

A a

B A

B

would allow derivations such as S AAA B AA AA BA A B

Dep ending on the parser we might include the exact same item derivation showing A

B three times Similarly for a derivation such as A B A B A a we

would have a corresp onding item derivation tree that included multiple uses of the A B

and B A rules

Value of item derivation

The value of an item derivation D V D is the pro duct of the value of its rules R r in

the same order that they app ear in the item derivation tree Since rules o ccur only in the

leaves of item derivation trees there is no ambiguity as to order in this denition For an

item derivation tree D with rule values d d d as its leaves

d

d

O

V D R d

i

i

Alternatively we can write this equation recursively as

R D if D is a rule

N

V D

k

V D if D hb D D i

i

k

i

Continuing our example the value of the item derivation tree of Figure is

R S AA R A a R A AA R A a R A a

the same as the value of the rst grammar derivation

Notice that Equation is just a recursive expression for the pro duct of the rule values

app earing in the leaves of the tree Thus

Let innerx represent the set of all item derivation trees headed by an item x Then

the value of x is the sum of all the values of all item derivation trees headed by x Formally

M

V x V D

D innerx

The value of a sentence is just the value of the goal item V goal

Isovalued derivations

In certain cases a particular grammar derivation and a particular itemderivation will

have the same value for any semiring and any rule value function R In particular if

the same rules o ccur in the same order in b oth the grammar derivation and the item

derivation then their values will b e the same no matter what If a grammar derivation

and an item derivation meet this condition then we dene a new term to describ e them

isovalued In Figure the grammar derivation and item derivation b oth have the rules

R S AA R A AA R A a R A a R A a and so they are isovalued

In some cases a grammar derivation and an itemderivation will have the same value

for any commutative semiring and any rule value function If the same rules o ccur the same

number of times in b oth the grammar derivation and the item derivation then they will

have the same value in any commutative semiring We say that a grammar derivation and

an item derivation meeting this condition are commutatively isovalued

Isovalued and commutatively isovalued derivations will b e imp ortant when we discuss

conditions for correctness

Example value of item derivation

Finishing our example the value of the goal item given our example sentence is just the

sum of the values of the two itembased derivations

R S AA R A AA R A a R A a R A a

R S AA R A a R A AA R A a R A a

This value is the same as the value of the sentence according to the grammar

Conditions for correctness

We can now sp ecify the conditions for an itembased description to b e correct

Theorem

Given an itembased description D if for every grammar G there exists a onetoone

corresp ondence b etween the item derivations using D and the grammar derivations and

the corresp onding derivations are isovalued then for every complete semiring the value

of a given input w w is the same according to the grammar as the value of the goal

n

item If the semiring is commutative then the corresp onding derivations need only b e

commutatively isovalued

Proof The pro of is very simple essentially each term in each sum o ccurs in the

other We separate the pro of into two cases First is the noncommutative case In this case

by hypothesis for a given input there are grammar derivations E E for k and

k

corresp onding isovalued item derivation trees D D of the goal item Since corresp onding

k

items are isovalued for all i V E V D Now since the value of the string according

i i

P P P

V D V D and the value of the goal item is V E to the grammar is just

i i i

i i i

the value of the string according to the grammar equals the value of the goal item

The second case the commutative case follows analogously

There is one additional condition for an itembased description to b e usable in practice

which is that there b e only a nite number of derivable items for a given input sentence

there may however b e an innite number of derivations of any item

The derivation semirings

All of the semirings we use should b e familiar except for the derivation semirings which

we now describ e These semirings unlike the other semirings describ ed in Figure are

not commutative under their multiplicative op erator concatenation

In many parsers it is conventional to compute parse forests compact representations of

the set of trees consistent with the input We will use a related concept derivation forests a

compact representation of the set of derivations consistent with the input which corresp onds

to the parse forest for CNF grammars but is easily extended to other formalisms Although

the terminology we use is dierent the representation of derivation forests is similar to that

used by Billot and Lang

Often we will not b e interested in the set of all derivations but only in the most

probable derivation The Viterbiderivation semiring computes this value Alternatively

we might want the n b est derivations which would b e useful if the output of the parser

were passed to another stage such as semantic disambiguation this value is computed by

the Viterbinb est derivation semiring

Notice that each of the derivation semirings can also b e used to create transducers That

is we simply asso ciate strings rather than grammar rules with each rule value Instead of

grammar rule concatenation we p erform string concatenation The derivation semiring

then corresp onds to nondeterministic transductions the Viterbi semiring corresp onds to a

weighted or probabilistic transducer and the inside semiring could b e used to for instance

p erform reestimation of probabilistic transducers

Derivation Forest

The derivation forest semiring consists of sets of derivations where a derivation is a list of

rules of the grammar In the CFG case these rules would form for instance a leftmost

derivation The additive op erator pro duces a union of derivations and the multiplicative

op erator pro duces the concatenation one derivation concatenated with the next The

concatenation op eration is dened on b oth derivations and sets of derivations when

applied to a set of derivations it pro duces the set of pairwise concatenations The simplest

derivations are simply rules of the grammar such as X Y Z for a CFG Sets containing

one rule such as fhX Y Z ig constitute the primitive elements of the semiring

A few examples may help The additive identity the zero element is simply the empty

set union with the empty set is an identity op eration The multiplicative identity is

The derivation forest semiring is equivalent to a semiring well known to mathematicians the p olynomials

over noncommuting variables Such a p olynomial is a sum of terms each of which is an ordered pro duct

of variables If these variables corresp ond to the basic elements of this semiring then each term in the

p olynomial corresp onds to a derivation

Function Concatenate f g

return hConcatenate f g i

Function Unionf g

return hUnion f g i

Function Create rule

return hCreate rulei

Function Extract f

switch f

Concatenate

return Concat Extract f Extract f

Union

return choose Extract f Extract f

Create

return hf i

Figure Derivation Forest Implementation

the set containing the empty derivation fhig concatenation with the empty derivation is

an identity op eration Derivations need not b e complete For instance assuming we are

using leftmost derivations with CFGs fhX Y Z Y y ig is a valid element as is

fhY y X xig In fact fhX A B big is a valid element of the semiring even

though it could not o ccur in a valid grammar derivation this value should never o ccur in a

correctly functioning parser

The obvious implementation of derivation forests as actual sets of derivations would b e

extremely inecient In the worst case when we allow innite unions a case we will wish

to consider the obvious implementation do es not work at all However in Section we

will show how to use p ointers to eciently implement innite unions of derivation forests

in a manner analogous to the traditional implementation of parse forests

We can now describ e a simple ecient implementation of the derivation forest semiring

We will assume that four op erations are desired concatenation union primitive creation

and extraction Primitive creation is used to create the basic elements of the semiring sets

containing a single rule Extraction nondeterministically extracts a single derivation from

the derivation forest Co de for these four functions is given in Figure All of the co de

is straightforward We assume a function choose needed to handle unions in the extract

function that nondeterministically chooses b etween its inputs

A straightforward implementation of this algorithm would work ne but slight varia

tions are required for go o d eciency The problem comes from the concatenation op eration

Typically in LISPlike languages concatenation is implemented as a copy op eration If we

were to build up a derivation of length n one rule at a time then the run time would b e

O n since we would rst copy one element then two then three etc resulting in O n

rules b eing copied To get go o d eciency we need to implement concatenation destruc

tively we assume that lists indicated with angle brackets are implemented with linked

lists with a p ointer to the last element For those skilled in Prolog this implementation of

linked lists is essentially equivalent to dierence lists List creation can b e p erformed e

ciently using nondestructive op erations Destructive concatenation can op erate in constant

time Given this destructive op eration any interpreter of this nondeterministic algorithm

would need to keep a list of destructive changes to undo during backtracking as is done in

many implementations of unication grammars or of unication in Prolog

This mo died implementation is ecient in the following senses First concatenation

and union are constant time op erations Second if we were to use Extract with a non

deterministic interpreter to generate all derivations in the derivation forest the time used

would b e at worst prop ortional to the total size of all trees generated

Billot and Lang show how to create grammars to represent parse forests Readers

familiar with their work will recognize the similarity b etween their representation and ours

Where we write

e Concatenate f g

Billot et al write

e f g

Where we write

e Unionf g

they write

e f

e g

Viterbiderivation Semiring

The Viterbiderivation semiring computes the most probable derivation of the sentence

given a probabilistic grammar Elements of this semiring are a pair of a real number v and

a derivation forest E ie the set of derivations with score v We dene max the additive

Vit

op erator as

hv E i if v w

max hv E i hw D i hw D i if v w

Vit

hv E D i if v w

In typical practical Viterbi parsers when two derivations have the same value one of the

derivations is arbitrarily chosen In practice this is usually a ne solution and one that

could b e used in a realworld implementation of the ideas in this chapter but from a

theoretical viewp oint the arbitrary choice destroys the asso ciative prop erty of the additive

op erator max To preserve asso ciativity we keep derivation forests of all elements that tie

Vit

for b est An alternate technique for preserving asso ciativity would b e to choose b etween

derivations using some ordering but the derivation forest solution simplies the discussion

in Section C

The denition for max is only dened for two elements Since the op erator is asso ciative

Vitn

it is clear how to dene max for any nite number of elements but we also need innite

Vitn

summations to b e dened We require an op erator sup the supremum for this denition

The supremum of a set is the smallest value at least as large as all elements of the set that

is it is a maximum that is dened in the innite case

We can now dene max for the case of innite sums First let

Vitn

w sup v

hv E iX

Then let

D fE jhw E i X g

Then max X hw D i In the nite case this is equivalent to our original denition In the

Vit

innite case D is p otentially empty but this causes us no problems in theory and innite

sums with an empty D will not app ear in practice

We dene the multiplicative op erator as

Vit

hv E i hw D i hv w E D i

where E D represents the concatenation of the two derivation forests

Viterbinb est semiring

The last kind of derivation semiring is the Viterbinb est semiring which is used for con

structing nb est lists Intuitively the value of a string using this semiring will b e the n most

likely derivations of that string unless there are fewer than n total derivations Further

more in a practical implementation this is actually how a Viterbinb est semiring would

typically b e implemented From a theoretical viewp oint however this implementation is

inadequate since we must also dene innite sums and b e sure that the distributive prop

erty holds Thus we introduce two complications First when not only are there more

than n total derivations but there is a tie for the nth most likely there will b e more than

n entries which we can represent eciently with a derivation forest Second in order to

make innite sums well dened it will b e useful to have an additional value counted

as a legal derivation The value will arise due to innite sums of elements approaching

a supremum Thus we will want to consider to represent an innite number of values

approaching the derivation value

The b est way to dene the Viterbinb est semiring is as a homomorphism from a simpler

semiring the Viterbiall semiring The Viterbiall semiring keeps all derivations and their

values The additive op erator is set union and the multiplicative op erator is dened as

X Y fhv w d eijhv di X hw ei Y g

Then the Viterbiall semiring is

E R

fh hiigi h

Now we can dene a help er function we will need for the homomorphism to the Viterbi

nb est semiring We dene simpletopn which returns the n highest valued elements Ties

for last are kept this prop erty will make the additive op erator commutative and asso ciative

simpletopn X fhv di X jthere are at most n items hw ei X st w v g

We can now dene a function topn which will provide the homomorphism Like

simpletopn topn returns the n highest valued elements keeping ties In addition if there

is an innite number of elements approaching a supremum topn returns a sp ecial element

whose value is the supremum and whose derivation is the symbol

if jsimpletopn X j n

if X simpletopn X

topn X simpletopn X

fhsup v ig otherwise

v jhv diX simpletopn X

We can now dene the Viterbinb est semiring as a homomorphism from the Viterbiall

R E

semiring In particular we dene the elements of the semiring to b e ftopn X jX g

Because of this denition for every A B in the Viterbinb est semiring there is some X Y

such that topn X A and topn Y B We can then dene

max A B C

Vitn

where C topn X Y for some X Y such that topn X A and topn Y B In

app endix A we prove that C is uniquely dened by this relationship Similarly we

dene the multiplicative op erator to b e

Vitn

A B C

Vitn

where C topn X Y for some X Y such that topn X A and topn Y B and

again prove the uniqueness of the relationship in the app endix Also in the app endix we

prove that these op erators do indeed form an continuous semiring

Ecient Computation of Item Values

L

V D This denition may Recall that the value of an item x is just V x

D innerx

require summing over exp onentially many or even innitely many terms In this section we

give relatively ecient formulas for computing the values of items There are three cases

that must b e handled First is the base case if x is a rule then

M M

V x V D V D V hxi R x

D innerx D fhxig

The second and third cases o ccur when x is an item Recall that each item is asso ciated

with a bucket and that the buckets are ordered Each item x is either asso ciated with a

nonlo oping bucket in which case its value dep ends only on the values of items in earlier

buckets or with a lo oping bucket in which case its value dep ends p otentially on the values

of other items in the same bucket In the second case when the item is asso ciated with

a nonlo oping bucket and if we compute items in the same order as their buckets we can

assume that the values of items a a contributing to the value of item b are known We

k

give a formula for computing the value of item b that dep ends only on the values of items

in earlier buckets

For the third case in which x is asso ciated with a lo oping bucket innite lo ops may

o ccur when the value of two items in the same bucket are mutually dep endent or an item

dep ends on its own value These innite lo ops may require computation of innite sums

Still we can express these innite sums in a relatively simple form allowing them to b e

eciently computed or approximated

Item Value Formula

Theorem

If an item x is not in a lo oping bucket then

k

O M

V a V x

i

i

a a

k

a a st

k

x

a a

k

Proof Let us expand our notion of inner to include deduction rules inner

b

is the set of all derivation trees of the form hb ha iha iha ii For any item deriva

k

a a

k

tion tree that is not a simple rule there is some a a b such that D inner

k

b

Thus for any item x

M

V x V D

D innerx

M M

V D

a a

k

a a

k

D inner

x

a a

k

Consider item derivation trees D D headed by items a a such that Recall

a a

k

k

x

that hx D D i is the item derivation tree formed by combining each of these trees

a a

k

into a full tree and notice that

a a

k

D i inner hx D

a a

k

x

innera D

a

D innera

a

k

k

Therefore

M M

V D V hx D D i

a a

k

a a innera D

a

k

D inner

D innera

x

a

k

k

N

k

V D Thus Also notice that V hx D D i

a a a

i

i

k

k

M M O

V hx D D i V D

a a a

i

k

i

innera innera D D

a a

D D innera innera

a a

k k

k k

Since for all semirings b oth op erations are asso ciative and multiplication distributes over

addition we can rearrange summations and pro ducts

k k

M O O M

V D V D

a a

i i

i i

innera D

a

i

i

innera D

a

innera D

a

k

k

k

O

V a

i

i

Substituting this back into Equation we get

k

O M

V a V x

i

i

a a

k

a a st

k

x

completing the pro of

Now we address the case in which x is an item in a lo oping bucket This case requires

computation of an innite sum We will write out this innite sum and discuss how to

compute it exactly in all cases except for one where we approximate it

Consider the derivable items x x in some lo oping bucket B If we build up derivation

m

trees incrementally when we b egin pro cessing bucket B only those trees with no items from

bucket B will b e available what we will call th generation derivation trees We can put

these th generation trees together to form rst generation trees headed by elements in

B We can combine these rst generation trees with each other and with th generation

trees to form second generation trees and so on Formally we dene the generation of a

derivation tree headed by x in bucket B to b e the largest number of items in B we can

encounter on a path from the ro ot to a leaf It will b e convenient to dene the generation

of the derivation of any item x in a bucket preceding B to b e generation generation

will not contain derivations of any items in bucket B

Consider the set of all trees of generation at most g headed by x Call this set

inner x B We can dene the g generation value of an item x in bucket B V x B

g g

M

V x B V x

g

D inner xB

g

Intuitively as g increases for x B inner x B b ecomes closer and closer to innerx

g

Thus intuitively the nite sum of values in the latter approaches the innite sum of values

in the former For continuous semirings which includes all of the semirings considered

in this chapter an innite sum is equal to the supremum of the partial sums Kuich

p Thus

M

V x V x sup V x B

g

g

D innerxB

It will b e easier to compute the supremum if we nd a simple formula for V x B

g

Notice that for items x in buckets preceding B since these items are all included in

generation

V x B V x

g

Also notice that for items x B there will b e no generation derivations so

V x B

Thus generation makes a trivial base for a recursive formula Now we can consider the

general case

Theorem

For x an item in a lo oping bucket B and for g

O M

V a if a B

i i

V x B

g

V a B if a B

g i i

ik

a a

k

a a st

k

x

The pro of parallels that of Theorem and is given in App endix A

Solving the Innite Summation

A formula for V x B is useful but what we really need is sp ecic techniques for com

g

puting the supremum V x sup V x B

g

g

For all continuous semirings the supremum of iteratively approximating the value

of a set of p olynomial equations as we are essentially doing in Equation is equal to

the smallest solution to the equations Kuich p In particular consider the

equations

O M

V a if a B

i i

V x B

V a B if a B

i i

ik

a a

k

a a st

k

x

where V x B can b e thought of as indicating jB j dierent variables one for each item

x in the lo oping bucket B Equation represents the iterative approximation of this

equation and therefore the smallest solution to this equation represents the supremum of

that one

One fact will b e useful for several semirings whenever the values of all items x B at

generation g are the same as the values of all items in the preceding generation g they

will b e the same at all succeeding generations as well Thus the value at generation g will

b e the value of the supremum The pro of of this fact is trivial we substitute the value of

V x B for V x B to show that V x B V x B

g g g g

M O

V a if a B

i i

V x B

g

V a B if a B

g i i

ik

a a

k

a a st

k

x

M O

V a if a B

i i

V a B if a B

g i i

ik

a a

k

a a st

k

x

V x B

g

Now we can consider various semiring sp ecic algorithms for computing the supremum

We rst examine the simplest case the b o olean semiring b o oleans under and Notice

that whenever a particular item has value TRUE at generation g it must also have value

TRUE at generation g since if the item can b e derived in at most g generations then

it can certainly b e derived in at most g generations Thus since the number of TRUE

valued items is nondecreasing and is at most jB j eventually the values of all items must

not change from one generation to the next Therefore for the b o olean semiring a simple

algorithm suces keep computing successive generations until no change is detected in

some generation the result is the supremum We can p erform this computation eciently

if we keep track of items that change value in generation g and only examine items that

dep end on them in generation g This algorithm is then similar to the algorithm of

Shieb er et al

For the next three semirings the counting semiring the Viterbi semiring and the

derivation forest semiring we need the concept of a derivation subgraph In Section

we considered the strongly connected comp onents of the dep endency graph consisting of

items that for some sentence could p ossibly dep end on each other and we put these p ossibly

interdependent items together in lo oping buckets For a given sentence and grammar not

all items will have derivations We will nd the subgraph of the dep endency graph of items

with derivations and compute the strongly connected comp onents of this subgraph The

strongly connected comp onents of this subgraph corresp ond to lo ops that actually o ccur

given the sentence and the grammar as opp osed to lo ops that might o ccur for some sentence

and grammar given the parser alone We call this subgraph the derivation subgraph and

we will say that items in a strongly connected comp onent of the derivation subgraph are

part of a lo op

Now we can discuss the counting semiring integers under and In the counting

semiring for each item there are three cases the item can b e in a lo op the item can dep end

directly or indirectly on an item in a lo op or the item do es not dep end on lo ops If the

item is in a lo op or dep ends on a lo op its value is innite If the item do es not dep end on

a lo op in the current bucket then its value b ecomes xed after some generation We can

now give the algorithm rst compute successive generations until the set of items in B

do es not change from one generation to the next Next compute the derivation subgraph

and its strongly connected comp onents Items in a strongly connected comp onent a lo op

have an innite number of derivations and thus an innite value Compute items that

dep end directly or indirectly on items in lo ops these items also have innite value Any

other items can only b e derived in nitely many ways using items in the current bucket so

compute successive generations until the values of these items do not change

The arctic semiringhR fg max i is analogous to the counting semiring

lo ops lead to innite values Thus we can use the same algorithm

The metho d for solving the innite summation for the derivation forest semiring de

p ends on the implementation of derivation forests Essentially that representation will use

p ointers to eciently represent derivation forests Pointers in various forms allow one

to eciently represent innite circular references either directly Go o dman a or

indirectly Go o dman b

The algorithm we use is to compute the derivation subgraph and then create p ointers

analogous to the directed edges in the derivation subgraph including p ointers in lo ops

whenever there is a lo op in the derivation subgraph corresp onding to an innite number of

derivations For the representation describ ed in Section we p erform two steps First

for each item x B we will set

V x hUnion x hUnion x hUnion x h iiii

a a

k

Next for each derivation of where there is one x for each instantiation of a rule

i

x

a a

k

x we will create a value

x

hConcatenate V a hConcatenate V a h hConcatenate V a V a i iii

k k

and destructively change an appropriate x to this value In a LISPlike language this

i

will create the appropriate circular p ointers As in the nite case this representation is

equivalent to that of Billot and Lang

For the Viterbi semiring the algorithm is analogous to the b o olean case Derivations

using lo ops in these semirings will always have values lower than derivations not using lo ops

since the value with the lo op will b e the same as some value without the lo op multiplied

by some set of rule probabilities that are at most Thus lo ops do not change values

Therefore we can simply compute successive generations until values fail to change from

one iteration to the next The tropical semiring is analogous again lo ops do not change

values so we can compute successive generations until values dont change

Now consider implementations of the Viterbiderivation semiring in practice in which

we keep only a representative derivation rather than the whole derivation forest In this

case lo ops do not change values and we use the same algorithm as for the Viterbi semiring

On the other hand for a theoretically correct implementation lo ops with value can lead

to an innite number of derivations in the same way they did for the derivation forest

semiring Thus for a theoretically correct implementation we must use the same techniques

we used for the derivation semiring In an implementation of the Viterbinb est semiring

in practice lo ops can change values but at most n times so the same algorithm used for

the Viterbi semiring still works Again in a theoretical implementation we need to use the

same mechanism as in the derivation forest semiring

The last semiring we consider is the inside semiring This semiring is the most dicult

There are two cases of interest one of which we can solve exactly and the other of which

requires approximations In many cases involving lo oping buckets all deduction rules will

a x

where a and b are items in the lo oping bucket and x is either a rule b e of the form

b

or an item in a previously computed bucket This case corresp onds to the items used for

deducing singleton pro ductions such as Earleys algorithm uses for rules such as A B

and B A In this case Equation forms a set of linear equations that can b e solved

by matrix inversion In the more general case as is likely to happ en with epsilon rules we

get a set of nonlinear equations and must solve them by approximation techniques such as

simply computing successive generations for many iterations Stolcke provides an

excellent discussion of these cases including a discussion of sparse matrix inversion useful

for sp eeding up some computations

Reverse Values

The previous section showed how to compute several of the most commonly used values for

parsers including b o olean inside Viterbi counting and derivation forest values among

others Noticeably absent from the list are the outside probabilities In general computing

outside probabilities is signicantly more complicated than computing inside probabilities

In this section we show how to compute outside probabilities from the same item

based descriptions used for computing inside values Outside probabilities have many uses

Note that even in the case where we can only use approximation techniques this algorithm is relatively

ecient By assumption in this case there is at least one deduction rule with two items in the current

generation thus the number of deduction trees over which we are summing grows exp onentially with the

number of generations a linear amount of computation yields the sum of the values of exp onentially many

trees

for each length l longest downto shortest

for each start s

for each split length t

for each rule A BC R

outsides B s t outsides B s t

outside s A s l insides t C s l P A BC

outsides t C s l outside s t C s l

outside s A s l insides B s t P A BC

Figure Outside algorithm

including for reestimating grammar probabilities Baker for improving parser p er

formance on some criteria Chapter for sp eeding parsing in some formalisms such as

DataOriented Parsing Chapter and for go o d thresholding algorithms Chapter

We will show that by substituting other semirings we can get values analogous to the

outside probabilities for any commutative semiring and in Sections and C that we

can get similar values for many noncommutative semirings as well We will refer to these

analogous quantities as reverse values For instance the quantity analogous to the outside

value for the Viterbi semiring will b e called the reverse Viterbi value Notice that the inside

semiring values of a Hidden Markov Mo del HMM corresp ond to the forward values of

HMMs and the reverse inside values of an HMM corresp ond to the backwards values

Compare the outside algorithmBaker Lari and Young Lari and Young

given in Figure to the inside algorithm of Figure Notice that while the inside

and recognition algorithms were very similar the outside algorithm is quite a bit dierent

In particular while the inside and recognition algorithms lo op ed over items from shortest

to longest the outside algorithm lo ops over items in the reverse order from longest to

shortest Also compare the inside algorithms main lo op formula to the outside algorithms

main lo op formula While there is clearly a relationship b etween the two equations the exact

pattern of the relationship is not obvious Notice that the outside formula is ab out twice as

complicated as the inside formula This doubled complexity is typical of outside formulas

and partially explains why the itembased description format is so useful descriptions

for the simpler inside values can b e developed with relative ease and then automatically

transformed used to compute the twice as complicated outside values

For a contextfree grammar using the CKY parser of Figure recall that the inside

probability for an item i A j is P A w w The outside probability for the same

i j

item is P S w w Aw w Thus the outside probability has the prop erty that

i j n

when multiplied by the inside probability it gives the probability that the start symbol

generates the sentence using the given item P S w w Aw w w w This

i j n n

probability equals the sum of the probabilities of all derivations using the given item For

mally letting P D represent the probability of a particular derivation and C D i X j

represent the number of o ccurrences of item i X j in derivation D which for some parsers goal goal

b . . .

a 1a 2 a k-1a k (b)

Derivation of goal Outer tree of b

Figure Goal tree outer tree

could b e more than one if X were part of a lo op

X

P D C D i X j inside i X j outsidei X j

D a derivation

The reverse values in general have an analogous meaning Let C D x represent the

number of o ccurrences the count of item x in item derivation tree D Then for an item

x the reverse value Z x should have the prop erty

M

V D C D x V x Z x

D a derivation

Notice that we have multiplied an element of the semiring V D by an integer C D x

This multiplication is meant to indicate rep eated addition using the additive op erator of

the semiring Thus for instance in the Viterbi semiring multiplying by a count other than

has no eect since x x maxx x x while in the inside semiring it corresp onds

to actual multiplication This value represents the sum of the values of all derivation trees

that the item x o ccurs in if an item x o ccurs more than once in a derivation tree D then

the value of D is counted more than once

To formally dene the reverse value of an item x we must rst dene the outer trees

outer x Consider an item derivation tree of the goal item containing one or more instances

of item x Remove one of these instances of x and its children to o leaving a gap in its

place This tree is an outer tree of x Figure shows an item derivation tree of the goal

item including a sub derivation of an item b derived from terms a a It also shows an

k

outer tree of b with b and its children removed the sp ot b was removed from is lab elled by

b

For an outer tree D outer x we dene its value Z D to b e the pro duct of the value

N

R r Then the reverse value of an item can b e formally dened as of all rules in D

r D

M

Z x Z D

D outerx

That is the reverse value of x is the sum of the values of each outer tree of x

Now we show that this denition of reverse values has the prop erty describ ed by Equa

tion

Theorem

M

V D C D x V x Z x

D a derivation

Proof

M

V x Z x V x Z O

O outerx

M M

A

V I Z O

I innerx O outerx

M M

V I Z O

I innerx O outerx

Next we argue that this last expression equals the expression on the right hand side of

L

V D C D x For an item x any outer part of an item derivation tree Equation

D

for x can b e combined with any inner part to form a complete item derivation tree That

is any O outerx and any I innerx can b e combined to form an item derivation

tree D containing x and any item derivation tree D containing x can b e decomp osed into

such outer and inner trees Thus the list of all combinations of outer and inner trees

corresp onds exactly to the list of all item derivation trees containing x In fact for an item

derivation tree D containing C D x instances of x there are C D x ways to form D from

combinations of outer and inner trees Also notice that for D combined from O and I

O O O

R r V D R r R r V I Z O

r D r O r I

Thus

M M M

V D C D x V I Z O

D

I innerx O outerx

Combining Equation with Equation we see that

M

V D C D x V x Z x

D a derivation

completing the pro of

There is a simple recursive formula for eciently computing reverse values Recall that

We note that satisfying Equation is a useful but not sucient condition for using reverse inside

values for grammar reestimation While this denition will typically provide the necessary values for the E

step of an EM algorithm additional work will typically b e required to prove this fact Equation should

b e useful in such a pro of

the basic equation for computing forward values not involved in lo ops was

k

M O

V x V a

i

i

a a

k

a a st

k

x

At this p oint for conciseness we introduce a nonstandard notation We will so on b e

using many sequences of the form j j j j k k We indicate such

j

sequences by k signicantly simplifying some expressions By extension we will also

j

write f f k to indicate a sequence of the form f f f j f j f j

f j f k f k

Now we can give a simple formula for computing reverse values Z x not involved in

lo ops

Theorem

For items x B where B is nonlo oping

O M

V a Z b Z x

i

j

i k

a a

k

xa ja a b st

j

k

b

unless x is the goal item in which case Z x the multiplicative identity of the semiring

Proof The simple case is when x is the goal item Since an outer tree of the goal

item is a derivation of the goal item with the goal item and its children removed and since

we assumed in Section that the goal item can only app ear in the ro ot of a derivation

tree the outer trees of the goal item are all empty Thus

M O

Z goal Z D Z fhig R r

D outer goal r fhig

As mentioned in Section the pro duct of zero elements is the multiplicative identity

Now we consider the general case We need to expand our concept of outer to include

a a

k

is an item derivation tree of the goal item with deduction rules where outer j

b

one subtree removed a subtree headed by a whose parent is b and whose siblings are

j

j

headed by a a Notice that for every outer tree D outer x there is exactly one

k

a a

k

this corresp onds to the j a a and b such that x a and D outer j

j

k

b

deduction rule used at the sp ot in the tree where the subtree headed by x was deleted

Figure illustrates the idea of putting together an outer tree of b with inner trees for

j

a a to form an outer tree of x a Using this observation

j

k

M

Z x Z D

D outerx goal

a a a a (b) 1 j-1 j+1 k

Outer Tree of b Inner Trees

goal

b a a 1 a j-1 j+1 a k (a j )

Outer Tree of a j

Figure Combining an outer tree with inner trees to form an outer tree

M M

Z D

a a

a a

k

k

xa ja a b st

D outer j

j

k

b

b

a a

k

Now consider all of the outer trees outer j For each item derivation tree

b

j

D innera D inner a and for each outer tree D outer b there will b e

a a

k b

k

a a a a

k k

one outer tree in the set outer j Similarly each tree in outer j

b b

j

can b e decomp osed into an outer tree in outer b and derivation trees for a a Then

k

M

Z D

a a

k

D outer j

b

M

j

Z D V D V D

a a

b

k

D outerb

b

j

innera D

a

D innera

a

k

k

M M M

j

A A A

V D V D Z D

a a

b

k

innera D innera D D outerb

a a

k b

k

j

Z b V a V a

k

O

V a Z b

i

j

i k

Substituting equation into equation we conclude that

O M

V a Z b Z x

i

j

i k

a a

k

xa ja a b st

j

k

b

completing the general case

Computing the reverse values for lo ops is somewhat more complicated As in the forward

case it requires an innite sum Computation of this innite sum will b e semiring sp ecic

Also as in the forward case we use the concept of generation Let us dene the generation g

of an outer tree D of item x in bucket B to b e the number of items in bucket B on the path

b etween the ro ot and the removal p oint inclusive Thus outer trees of items in buckets

following B will b e in generation Let outer x B represent the set of outer trees of

g

x with generation at most g It should b e clear that as g approaches outer x B

g

approaches outerx Now we can dene the g generation reverse value of an item x in

bucket B Z x B

g

M

Z x B Z D

g

D outer xB

g

For continuous semirings an innite sum is equal to the supremum of the partial

sums

M

Z D Z x B sup Z x B

g

g

D outerxB

Thus we wish to nd a simple formula for Z x B

g

Notice that for x C where C is a bucket following B by the inclusion of derivations

of these items in generation

Z x B Z x

g

Also notice that for g and x B

M M

Z x B Z D Z D

D outer xB D

g

Thus generation makes a simple base for a recursive formula for outer values Now we

can consider the general case for g

Theorem

For items x B and g

O M

Z b B if b B

C B

g

V a Z x B

A

i g

Z b if b B

j

i k

a a

k

xa ja a b st

j

k

b

The pro of parallels that of Theorem and is given in App endix A

Reverse Values in Noncommutative Semirings

Equations and apply only to commutative semirings since their derivation makes

use of the commutativity of the multiplicative op erator We might wish to compute re

verse values in noncommutative semirings as well For instance the reverse values in the

derivation semiring could b e used to compute the set of all parses that include a particular

constituent It turns out that there is no equation in the noncommutative semirings cor

resp onding directly to Equations and in fact in general there are no values in

noncommutative semirings corresp onding directly to the reverse values

When we compute reverse values we need them to b e such that the pro duct of the

forward and the reverse values gives the sum of the values over all derivation trees using the

item Consider a case in which the forward value of the item is b and there are derivation

trees with the values abc and dbe The pro duct of the forward and reverse values should thus

b e abc dbe but in a noncommutative semiring there will not b e any reverse value x such

that xb abc dbe Instead one must create what we call Pair semirings corresp onding to

multisets of pairs of values in the base semiring In this example the reverse value would

b e the multiset fha ci hd eig and a sp ecial op erator would combine b with this set to yield

abc dbe Using Pair semirings one can nd equations directly analogous to Equations

and A complete explication of reverse values for noncommutative semirings and

the derivation of the equations are given in App endix C

Semiring Parser Execution

Bucketing

Executing a semiring parser is fairly simple There is however one issue that must b e dealt

with b efore we can actually b egin parsing A semiring parser computes the values of items

in the order of the buckets they fall into Thus b efore we can b egin parsing we need to

know which items fall into which buckets and the ordering of those buckets There are three

approaches to determining the buckets and ordering that we will discuss in this section The

rst approach is a simple brute force enumeration of all items derivable or not followed by

a top ological sort This approach will have sub optimal time and space complexity for some

itembased descriptions The second approach is to use an agenda parser in the b o olean

semiring to determine the derivable items and their dep endencies and to then p erform

a top ological sort This approach has optimal time complexity but typically sub optimal

space complexity The nal approach is to use bucketing co de sp ecic to the itembased

interpreter This achieves optimal p erformance for additional programming eort

The simplest way to determine the bucketing is to simply enumerate all p ossible items

for the given itembased description grammar and input sentence Then we compute the

strongly connected comp onents and a partial ordering b oth steps can b e done in time

prop ortional to the number of items plus the number of dep endencies Cormen et al

ch For some parsers this technique has optimal time complexity although p o or space

complexity In particular for the CKY algorithm the time complexity is optimal but since

it requires computing and storing all p ossible O n dep endencies b etween the items it

takes signicantly more space than the O n space required in the b est implementation

In general the brute force technique raises the space complexity to b e the same as the time

complexity Furthermore for some algorithms such as Earleys algorithm there could b e a

signicant time complexity added as well In particular Earleys algorithm may not need

to examine all p ossible items For certain grammars Earleys algorithm examines only a

linear number of items and a linear number of dep endencies even though there are O n

p ossible items and O n p ossible dep endencies Thus the brute force approach would

require O n time and space instead of O n time and space

The next approach to nding the bucketing solves the time complexity problem In

this approach we rst parse in the b o olean semiring using the agenda parser describ ed

by Shieb er et al and then we p erform a top ological sort Shieb er et al use an

interpreter which after an item is derived determines all items that could trigger o of

that item For instance in an Earleystyle parser such as that of Figure if an item

i A B j is pro cessed and there is a rule B then by the prediction rule

j B j can b e derived Thus immediately after i A B j is pro cessed

j B j is added to an agenda new items to b e pro cessed are taken o of the

agenda This approach works ne for the b o olean semiring where items only have value

TRUE or FALSE but cannot b e used directly for other semirings For other semirings we

need to make sure that the values of items are not computed until after the values of all

items they dep end on are computed However we can use the algorithm of Shieb er et al to

compute all of the items that are derivable and to store all of the dep endencies b etween the

items Then we p erform a top ological sort on the items The time complexity of b oth the

agenda parser and the top ological sort will b e prop ortional to the number of dep endencies

which will b e prop ortional to the optimal time complexity Unfortunately we still have the

space complexity problem since again the space used will b e prop ortional to the number

of dep endencies rather than to the number of items

The third approach to bucketing is to create algorithmsp ecic bucketing co de this

results in parsers with b oth optimal time and optimal space complexity For instance in

a CKY style parser we can simply create one bucket for each length and place each item

into the bucket for its length For some algorithms such as Earleys algorithm sp ecial

purp ose co de for bucketing might have to b e combined with co de for triggering just as in

the algorithm of Shieb er et al in order to achieve optimal p erformance

current rst

do

if loop current

replace with semiring specic code

for x current

V x

for g to

a a

k

for each x current a a st

k

x

N

V a a current

i i

k

V x g V x g

i

V a g a current

i i

for each x current

V x V x

else

a a

k

for each x current a a st

k

x

N

k

V x V x V a

i

i

oldCurrrent current

current nextcurrent

while oldCurrent last

return V goal

Figure Forward Semiring Parser Interpreter

Interpreter

Once we have the bucketing the parsing step is fairly simple The basic algorithm

app ears in Figure We simply lo op over each item in each bucket There are two types

of buckets lo oping buckets and nonlo oping buckets If the current bucket is a lo oping

bucket we compute the innite sum needed to determine the buckets values in a working

system we substitute semiring sp ecic co de for this section as describ ed in Section

If the bucket is not a lo oping bucket we simply compute all of the p ossible instantiations

that could contribute to the values of items in that bucket Finally we return the value of

the goal item

The reverse semiring parser interpreter is very similar to the forward semiring parser

interpreter The dierences are that in the reverse semiring parser interpreter we traverse

the buckets in reverse order and we use the formula for the reverse values rather than the

forward values

Both interpreters are closely based on formulas derived earlier The forward semiring

parser interpreter uses the co de

a a

k

for each x current a a st

k

x

N

V a a current

i i

k

V x g V x g

i

V a g a current

i i

current last

do

if loop current

replace with semiring specic code

for x current

Z x

for g to

a a

k

for each j a a x st and a current

j

k

x

N

Z x x current

Z a g Z a g V a

j

j j i

i k

Z x g x current

for each x current

Z x Z x

else

a a

k

and a current for each j a a x st

j

k

x

N

V a Z a Z a Z x

j

j j j

j k

oldCurrrent current

current previous current

while oldCurrent rst

Figure Reverse Semiring Parser Interpreter

to implement Equation for computing the values of lo oping buckets

O M

V a if a B

i i

V x B

g

V a B if a B

g i i

ik

a a

k

a a st

k

x

For computing the values of nonlo oping buckets the interpreter uses the co de

a a

k

for each x current a a st

k

x

N

k

V x V x V a

i

i

which is simply an implementation of Equation

k

O M

V a V x

i

i

a a

k

a a st

k

x

The corresp onding lines in the reverse interpreter corresp ond to Equations and

resp ectively

Using these equations a simple inductive pro of shows that the semiring parser inter

preter is correct and an analogous theorem holds for the reverse semiring parser interpreter

Theorem

The forward semiring parser interpreter correctly computes the value of all items

A sketch of the pro of is given in App endix A

There are two other implementation issues First for some parsers it will b e p ossible

to discard some items That is some items serve the role of temp orary variables and can

b e discarded after they are no longer needed esp ecially if only the forward values are going

to b e computed Also some items do not dep end on the input string but only on the rule

value function of the grammar The values of these items can b e precomputed using the

forward semiring parser interpreter

Grammar Transformations

We can apply the same techniques to grammar transformations that we have so far applied

to parsing Consider a grammar transformation such as the Chomsky Normal Form CNF

grammar transformation which takes a grammar with epsilon unary and nary branching

pro ductions and converts it into one in which all pro ductions are of the form A BC

or A a For any sentence w w its value under the original grammar in the b o olean

n

semiring TRUE if the sentence can b e generated by the grammar FALSE otherwise is

the same as its value under a transformed grammar Therefore we say that this grammar

transformation is value preserving under the b o olean semiring We can generalize this

concept of value preserving to other semirings For instance if prop erly sp ecied the CNF

transformation also preserves value under any complete commutative semiring so that the

value of any sentence in the transformed grammar is the same as the value of the sentence

in the original grammar Thus for instance we could start with a grammar with rule

probabilities transform it using the CNF transformation and nd Viterbi values using a

CKY parser and the transformed grammar the Viterbi values for any sentence would b e

the values using the original grammar

We now show how to sp ecify grammar transformations using almost the same itembased

descriptions we used for parsing We give a value preserving transformation to CNF in this

section and in App endix B we give a value preserving transformation to Greibach

Normal Form GNF While itembased descriptions have b een used to sp ecify parsers by

Shieb er et al and Sikkel we do not know of previous uses for sp ecifying

grammar transformations

The concept of value preserving grammar transformation is known in the intersection

of formal language theory and algebra Kuich shows how to p erform trans

formations to b oth CNF and GNF with a valuepreserving formula Teitelbaum

shows how to convert to CNF for a sub class of continuous semirings The contribution

then of this section is to show that these value preserving transformations can b e fairly

simply given as itembased descriptions allowing the same computational machinery to b e

used for grammar transformations as is used for parsing and to some extent showing the

relationship b etween certain grammar transformations and certain parsers such as that of

Graham et al discussed in Section and App endix B While the relation

ship b etween grammar transformations and parsers is already known in the literature on

covering grammars Nijholt Leermakers our treatment is clearer b ecause we

use the same machinery for sp ecifying b oth the transformations and the parsers allowing

commonalities to b e expressed in the same way in b oth cases

There are three steps to the CNF transformation removal of epsilon pro ductions re

moval of unary pro ductions and nally splitting of nary pro ductions Of these three the

b est one for exp ository purp oses is the removal of unary pro ductions shown in Figure

so we will explicate this transformation rst even though logically it b elongs second This

transformation assumes that the grammar contains no epsilon rules It will b e convenient

to allow variables A B C to represent either nonterminals or terminals throughout this

section

To distinguish rules in the old grammar from rules in the new grammar we have num

b ered the rule functions R in this transformation for the original grammar and R for

the new grammar The only dierence b etween an itembased parser description and an

itembased grammar transformation description is the goals section instead of a single goal

item there is a rule goal with variables such as R A giving the value of items in

the new grammar

An item of the form A B C can derived if and only there is a derivation of the form

A D E F B C

An item in this form is simply derived using the extension rule several times combined with

Item form

A

Rule Goal

R A

Rules

R A B C

Nary

A B C

R A a

Unary

A a

R A B B

Extension

A

A

Output

R A

Figure Removal of Unary Pro ductions

the Nary rule Similarly an item of the form A a can b e derived if and only if there is

a derivation of the form

A D E F a

Items of the form A B where B is a nonterminal cannot b e derived

A short example will help illustrate how this grammar transformation works Consider

the following grammar with values in the inside semiring

S Aa

A a

A A

There are an innite number of derivations of the item A a It can b e derived using

just the unary rule with value it can b e derived using the unary rule and the extension

rule with value it can b e derived using the unary rule and using the extension rule

twice with value and so on The total of all derivations is There is just one

derivation for the item S Aa which has value also Thus the resulting grammar is

S Aa

A a

which has no nonterminal unary rules

Now with an example nished we can discuss conditions for correctness for a grammar

transformation

Consider a grammar derivation D as always leftmost in the original grammar and

a grammar derivation E in the new grammar using item derivations I to derive the rules

used in E Roughly if there is a onetoone pairing b etween oldgrammar derivations D

and pairs E I then the transformation is value preserving Formally

Theorem

Consider a derivation D in the original grammar using rules D D Consider also a

d

derivation in the new grammar E using rules E E For each new grammar rule E there

e i

j

i

I We can consider sequences of is some set of itembased derivations of that rule I

i

i

k

k k

k

i

e

selecting one rule derivation I I I such rule derivations I for each new grammar

e

i

rule E A grammar transformation will b e value preserving for a commutative semiring if

i

k

k

e

and there is a onetoone pairing b etween derivations D D and pairs E E I I

e

d

e

k

k

e

if all rule values from the original grammar o ccur the same number of times in I I

e

as they do in D D

d

Proof The pro of is obvious since each term in the sum over derivations in the

original grammar has a term in the sum over derivations in the new grammar The details

essentially follow Theorem

There are two imp ortant caveats to note ab out this form of grammar transformation

The rst is that the grammar transformation is semiring sp ecic Consider the probabilistic

grammar example transformed with the unary pro ductions removal transformation

If we transform it using the inside semiring we get Grammar On the other hand if

we transform the grammar using the Viterbi semiring we get

S Aa

A a

which is also correct the Viterbi semiring value of the string aa in the original grammar is

just as it is in the transformed grammar Notice that the values dier it is imp ortant to

remember that grammar transformations are semiring sp ecic Furthermore notice that the

probabilities do not sum to in the transformed grammar in the Viterbi semiring While for

this example and the unary pro ductions removal transformation in general transformations

using the inside ring do preserve summation to this is not always true as we will show

during our discussion of the epsilon removal transformation

Next we consider the epsilon removal transformation This transformation is fairly

simple it is derived from Earleys algorithm This transformation assumes S do es not

o ccur on the right hand side of any pro ductions There is one item form A that

can b e derived if and only if there is a derivation of the form A using only

substitutions of the form B and where each symbol C in can derive some string of

terminals

There are six rules The rst prediction simply makes sure there is one initial item

A for each rule of the form A The next rule epsilon completion allows deletion

of symbols B that derive epsilon from rules of the form A B while the following rule

nonepsilon completion simply moves the dot over symbols B that can derive terminal

strings The fourth rule scanning moves the dot over terminals The last two rules nary

Item form

A

Rule Goal

R A

Rules

R A

Prediction

A

A B B

Epsilon Completion

A

A B

B C Nonepsilon Completion

A B

A a

Scanning

A a

A B

Nary Output

R A B

S

Epsilon Output

R S

Figure Removal of Epsilon Pro ductions

output and epsilon output derive the output values

For simplicity the removal of unary pro ductions transformation and the removal of n

ary pro ductions transformations do not handle the rule S pro duced by the epsilon

output rule but could easily b e mo died to do so

While the epsilon removal transformation preserves the value of any string using the

inside semiring it do es not in general pro duce a grammar with probabilities that sum to

Consider the probabilistic grammar

S aB

B

B b

The grammar with epsilons removed using the inside semiring will b e

S aB

S a

B b

which generates the same strings with the same probabilities value is preserved notice

however that probabilities of individual nonterminals sum to b oth more and less than one

and that simply normalizing probabilities by dividing through by the total for each left

hand side leads to a grammar with dierent string probabilities The lack of summation to

one could p otentially make this grammar less useful in a system that needed for instance

intermediate probabilities for thresholding

We can correctly renormalize using a mo died itembased description of Earleys algo

rithm in Figure This parser is just like Earleys parser except that the indices have

b een removed and the scanning rule do es not make reference to the words of the sentence

There is one valid derivation in this parser for each derivation in the grammar Computing

V A Z A

P

V A Z A

gives the normalized probability P A The intuition b ehind this formula is simply

that it is the usual formula for insideoutside reestimation and insideoutside reestimation

when in a lo cal minimum stays in the same place Since a grammar is a lo cal minimum of

itself we should get essentially the same grammar but with normalized rule probabilities

A stronger argument is given in App endix A Theorem

The renormalization parser has some other useful prop erties For instance in the count

ing semiring V S S gives the total number of parses in the language in the Viterbi

nb est semiring it gives the probabilities and parses of the n most probable parses in the

language and in the parseforest semiring it gives a derivation forest for the entire language

In the inside semiring it should always b e assuming a prop er grammar as input

One last interesting prop erty of the renormalization parser is that it can b e used to

remove useless rules The forward b o olean value times the reverse b o olean value of an item

A will b e TRUE if and only if the rule A is useful that is if it can app ear in

Item form

A

Goal

S S

Rules

Initialization

S S

A a

Scanning

A a

R B

A B Prediction

B

A B B

Completion

A B

Figure Renormalization Parsing

the derivation of some string Useless rules can b e eliminated

For completeness there are two more steps in the CNF transformation conversion

from nary branching rules n to binary branching rules and conversion from binary

branching rules with terminal symbols to those with nonterminals We show how to convert

from nary to binary branching rules in Figure This transformation assumes that

all unary and rules have b een removed We construct many new nonterminals in this

transformation each of the form hC D i This step is fairly simple so we wont explicate

it The remaining step removal of terminal symbols from binary branching rules is trivial

and we do not present it

We should note that not all transformations have a valuepreserving version For in

stance in the general case the transformation of a nondeterministic nite state automaton

NFA into a deterministic nite state automaton DFA cannot b e made value preserving

the problem is that in the DFA there is exactly one derivation for a given input string

so it is not p ossible to get a onetoone corresp ondence b etween derivations in the NFA

and the DFA meaning that these techniques cannot b e used Mohri discusses the

conditions under which some NFAs in certain semirings can b e made deterministic

Examples

In this section we give examples of several parsing algorithms expressed as itembased

descriptions

Item form

Rule Goal

R

Rules

R A B C

R A B C

R A B C D

R A B hC D i

R A B C D

R hCD i C D

R A B C D E

R hC D E i C hD E i

Figure Removal of nary Pro ductions

Finite State Automata and Hidden Markov Mo dels

NFAs and HMMs can b oth b e expressed using a single itembased description shown in

Figure We will express transitions from state A to state B emitting symbol a as

A a B As usual we will let R A a B have dierent values dep ending on the

semiring used

Bo olean TRUE if there is a transition from A to B emitting a FALSE otherwise

Counting if there is a transition from A to B emitting a otherwise

Derivation fhA a B ig if there is a transition from A to B emitting a otherwise

Viterbi Probability of a transition from A to B emitting a

Inside Probability of a transition from A to B emitting a

We assume there is a single start state S and a single nal state F and we allow

transitions

For HMMs notice that the forward algorithm is obtained simply by using the inside

semiring the backwards algorithm is obtained using the reverse values of the inside semiring

and the Viterbi algorithm is obtained using the Viterbi semiring For NFAs we can use the

b o olean semiring to determine whether a string is in the language of an NFA we can use

the counting semiring to determine how many state sequences there are in the NFA for a

given string and we can use the derivation forest semiring to get a compact representation

of all state sequences in an NFA for an input string

Item form

A i

Goal

F n

Rules

Start Axiom

S

A i R A w B

i

Scanning

B i

A i R A B

Epsilon Scanning

B i

Figure NFAHMM parser

Prex Values

For language mo deling it may b e useful to compute the prex probability of a string

That is given a string w w we may wish to know the total probability of all sentences

n

b eginning with that string

X

P S w w x x

n

k

k x x

k

Jelinek and Laerty and Stolcke b oth give algorithms for computing the pre

x probabilities However the derivations are somewhat complex requiring ve pages of

Jelinek and Laertys nine page pap er

In contrast we give a fairly simple itembased description in Figure which we will

pre

explicate in detail b elow We will call a prex derivation X w w a derivation in which

i j

X w w x x x

i j

k

The prex value of a string is the sum of the pro ducts of the values used in the prex

derivation A brief example will help consider the prex a and the grammar

S sA

A a

A b

There are two prex derivations of s of the form S sA sa and S sA sb

The inside score is the sum while the Viterbi score is the max

Figure gives an itembased description for nding prex values for CNF grammars

it would b e straightforward to mo dify this algorithm for an Earleystyle parser allowing it

Item form

i A j

i A

A

Goal

S

Rules

R A w

i

Unary In

i A i

R A BC i B k k C j

In In

i A j

R A a

Unary Out

A

R A BC B C

Out Out

A

R A w

n

Unary Between

n A

R A BC i B j j C

In Between

i A

R A BC i B C

Between Out

i A

Figure Prex Derivation Rules In Between

Between Out

In In Out Out

Unary Unary Unary

In Between Out

Figure Prederivation Illustration

to handle any CFG as shown by Stolcke

There are three item types in this description The rst item type i A j called In

can b e derived only if A w w The second type i A called Between can b e

i j

derived only if A w w x x The nal type A called Out can b e derived only

i n

k

if A x x The Unary In and In In rules corresp ond to the usual unary and binary

k

rules The Unary Out and Out Out rules corresp ond to the usual unary and binary rules

but for symbols after the prex For instance the Out Out rule says that if A BC and

B x x and C y y then A x x y y The last three rules deal with Between

k l k l

items Unary Between says that if A w then A w In Between says that if A BC

n n

and B w w and C w w x x then A w w x x Finally Between Out

i j j n i n

k k

says that if A BC and B w w x x and C y y then A w w x y

i n i n

k l l

Figure illustrates the seven dierent rules Words in the prex are indicated by solid

triangles and words that could follow the prex are indicated by dashed ones

There is one problem with the itembased description of Figure Both the Out and

the Between items are all asso ciated with lo oping buckets Since the Between items can only

b e computed on line ie only once we know the input sentence this means that we must

p erform time consuming innite sums on line It turns out that with a few mo dications

the values of all items asso ciated with lo oping buckets can b e computed oline once p er

grammar signicantly sp eeding up the online part of the computation

Figure gives a faster itembased description The fast version contains a new item

pre

type A B that can b e derived only if there is a derivation of the form A B x x

k

It turns out that for the fast description we need to compute rightmost derivations rather

than leftmost derivations There are eight deduction rules the rst four of which are the

same as b efore mo died for rightmost derivations The next two rules the Prederivation

pre

Axiom and Prederivation Completion compute all items of the form A B The last

rule Between Continuation is the most complicated The idea b ehind this rule is the

Item form

i A j

i A

A

pre

A B

Goal

S

Rules

R A w

i

Unary In

i A i

R A BC k C j i B k

In In

i A j

R A a

Unary Out

A

R A BC C B

Out Out

A

Prederivation Axiom

pre

A A

pre

R A BC C B D

Prederivation Completion

pre

A D

pre

A B R B w

n

Between Initialization

n A

pre

A B R B CD j D i C j

Between Continuation

i A

Figure Fast Prex Derivation Rules Between Continuation

Between Initialization

In In Out Out

Unary Unary

In Out

Figure Fast Prederivation Illustration

following In our previous implementation there would b e an In Between rule followed by

a string of Between Out rules It was b ecause of the Between Out rules that the Between

items were asso ciated with lo oping buckets In the fast version we collapse the In Between

followed by a string of many Between Outs into a single Between Continuation rule Between

pre pre

Continuation is the same as In Between except with A B prep ended the A B is

essentially equivalent to a string of zero or more Between Out rules It is prep ended rather

than app ended b ecause we are now nding rightmost derivations We use a similar trick for

Unary Between in the original description there could b e a Unary Between rule followed

by a string of Between Out rules Again we collapse the string of Between Out rules using

pre

a single item of the form A B Figure shows a schematic tree with examples of

the rules

We can give a more formal justication for each of the last four rules The Prederivation

Axiom says that for all A A A The Prederivation Completion rule says that if A BC

and B D x x and C y y then A D x x y y The Between Initialization

k l k l

B x x and B w then A w x x Finally Between rule says that if A

n n

k k

Continuation says that if A B x x and B CD and and C w w and D

i j

k

w w y y then A w w y y x x

j n i n

l l k

The careful reader can verify that there is a onetoone corresp ondence b etween item

derivations in this system and prex derivations of the input string Notice that the only

pre

items asso ciated with a lo oping bucket are of the form A B and A these can all b e

precomputed indep endent of the input string This algorithm is essentially the same as

that of Jelinek and Laerty

As usual there are advantages to the item based description b esides its simplicity For

instance we can use the same itembased description with the Viterbi semiring to nd the

Viterbi prex probabilities the probability of the most likely prex derivation with the

pre

b o olean semiring to compute the valid prex prop erty whether S w w etc

n

Prex derivation values are p otentially useful for starting interpretation of sentences

b efore they are completed For instance a travel agent program on hearing Show me

ights on April so I can could compute the Viterbiderivation prex parse so that it

could b egin pro cessing the transaction b efore the user was nished sp eaking Similar values

such as the prex derivationforest value would yield the set of all p ossible derivations that

could complete the sentence

We note that in the inside semiring

V i A j Z i A j

V S

X

P S w w Aw w x x w w x x jw w

i j n n n

k k

k x x

k

which is the probability that symbol A covers terminals i to j given all input symbols

seen so far From an information content p oint of view this formula is the optimal one to

use for thresholding although the resulting algorithm would b e O n since the outside

values which require time O n to compute would need to b e recomputed after each new

input symbol Thus although thresholding with this formula probably would not provide

a sp eedup it could b e useful for incremental interpretation or for analyzing garden path

sentences

Beyond ContextFree

There has b een quite a bit of previous work on the intersection of formal language theory

and algebra as describ ed by Kuich among others This previous work has made

heavy use of the fact that there is a strong corresp ondence b etween algebraic equations in

certain noncommutative semirings and CFGs This corresp ondence has made it p ossible to

manipulate algebraic systems rather than grammar systems simplifying many op erations

On the other hand there is an inherent limit to such an approach namely a limit to

contextfree systems It is then p erhaps slightly surprising that we can avoid these limita

tions and create itembased descriptions of parsers for weakly contextsensitive grammars

such as Tree Adjoining Grammars TAGs We avoid the limitations of previous approaches

using two techniques One technique is rather than computing parse trees for TAGs we

compute derivation trees Computing derivation trees for TAGs is signicantly easier than

computing parse trees since the derivation trees are contextfree The other trick we use is

that while earlier formulations created one set of equations for each grammar our parsing

approach can b e thought of as creating a set of equations for each grammar and string

length Because the number of equations grows with the string length we can recognize

strings in weakly contextsensitive languages

A further explication of this sub ject including an itembased description for a simple

TAG parser is given in the app endix in Section B

Tomita Parsing

Our goal in this section has b een to show that itembased descriptions can b e used to simply

describ e almost all parsers of interest One parsing algorithm that would seem particularly

dicult to describ e is Tomitas graphstructuredstack LR parsing algorithm This algo

rithm at rst glance b ears little resemblance to other parsing algorithms and worse uses

p ointers extensively Since there is no obvious way to emulate p ointers in an itembased

description it would app ear that this parser has no simple itembased description How

ever Sikkel gives an itembased description for a Tomitastyle parser for the b o olean

semiring which is also more ecient than Tomitas algorithm Sikkels format is similar

enough to ours that his description can b e easily converted to our format where it can b e

used for continuous semirings in general

Graham Harrison Ruzzo Parsing

Graham et al describ e a parser similar to Earleys but with several sp eedups that

lead to signicant improvements Essentially there are three improvements in the GHR

parser First epsilon pro ductions are precomputed Second unary pro ductions are pre

computed and nally completion is separated into two steps allowing b etter dynamic

programming

In App endix B we give a full itembased description of a GHR parser The forward

values of many of the items in our parser related to unary and epsilon pro ductions can

b e computed oline once p er grammar This idea of precomputing values oline in a

probabilistic GHRstyle parser is due to Stolcke Since reverse values require entire

strings the reverse values of these items cannot b e computed until the input string is

known Because we use a single itembased description for precomputed items and non

precomputed items and for forward and reverse values this combination of oline and

online computation is easily and compactly sp ecied

Previous Work

The previous work in this area is extensive including work in deductive parsing work in

statistical parsing and work in the combination of formal language theory and algebra This

chapter can b e thought of as synthetic combining the work in all three areas although in

the course of synthesis several general formulas have b een found most notably the general

formula for reverse values A comprehensive examination of all three areas is b eyond the

scop e of this chapter but we can touch on a few signicant areas of each

First there is the work in deductive parsing This work in some sense dates back to

Earley in which the use of items in parsers is introduced More recent work Pereira

and Warren Pereira and Shieb er demonstrates how to use deduction engines

for parsing Finally b oth Shieb er et al and Sikkel have shown how to sp ecify

parsers in a simple interpretable itembased format This format is roughly the format we

have used here although there are dierences due to the fact that their work was strictly

in the b o olean semiring

Work in statistical parsing has also greatly inuenced this work We can trace this work

back to research in HMMs by Baum and his colleagues Baum and Eagon Baum

A go o d introduction to this work and its practical application to sp eech recog

nition was written by Rabiner In particular the work of Baum developed the

concept of backward probabilities in the inside semiring as well as many of the tech

niques for computing in the inside semiring Viterbi developed corresp onding algo

rithms for computing in the Viterbi semiring Baker extended the work of Baum

et al to PCFGs including to computation of the outside values or reverse inside values

in our terminology Bakers work is describ ed by Lari and Young Bakers

work was only for PCFGs in Chomsky Normal Form avoiding the need to compute innite

summations Jelinek and Laerty showed how to compute some of the innite sum

mations in the inside semiring those needed to compute the prex probabilities of PCFGs

in CNF Stolcke showed how to use the same techniques to compute inside proba

bilities for Earley parsing dealing with the dicult problems of unary transitions and the

more dicult problems of epsilon transitions He thus solved all of the imp ortant prob

lems encountered in using an itembased parser to compute the inside and outside values

forward and reverse inside values he also showed how to compute the forward Viterbi

values

The nal area of work is in formal language theory and algebra Although it is not

widely known there has b een quite a bit of work showing how to use formal p ower series to

elegantly derive results in formal language theory In particular the ma jor classic results

can b e derived in this framework but with the added b enet that they typically apply to all

continuous semirings or at least to all commutative continuous semirings The most

accessible introduction to this literature we have found is by Kuich There are also

b o oks by Salomaa and Soittola and Kuich and Salomaa as well as a b o ok

concentrating primarily on the nite state case by Berstel and Reutenauer This

work dates back to to Chomsky and Schutzenberger Kuich gives a much

more complete bibliography than we give here

One piece of work deserves sp ecial mention Teitelbaum showed that any semiring

could b e used in the CKY algorithm He further showed that a subset of complete semirings

could b e used in valuepreserving transformations to CNF Thus he laid the foundation for

much of the work that followed

In summary this chapter synthesizes work from several dierent related elds including

deductive parsing statistical parsing and formal language theory we emulate and expand

on the earlier synthesis of Teitelbaum The synthesis here is p owerful by generalizing

and integrating many results we make the computation of a much wider variety of values

p ossible

Recent similar work

There has also b een recent similar work by Tendeau b a Tendeau b gives

an Earleylike algorithm that can b e adapted to work with complete semirings satisfying

certain conditions Unlike our version of Earleys algorithm Tendeaus version requires

L

time O n where L is the length of the longest right hand side as opp osed to O n for

the classic version and for our description There is also not much detail ab out how the

algorithm handles lo oping pro ductions

Tendeau a also shows how to compute values in an abstract semiring with a CKY

style algorithm in a manner similar to Teitelbaum More imp ortantly he gives a

generic description for dynamic programming algorithms His description is very similar to

our itembased descriptions with two caveats First he includes a single rule term on the

left rather than a set of rule values intermixed with item values For commutative semirings

this limitation is ne but for noncommutative semirings it may lead to inelegancies

Second he do es not have an equivalent to our side conditions This means that algorithms

such as Earleys algorithm which rely on side conditions for eciency cannot b e describ ed

in this formalism in a way that captures those eciency considerations

Tendeau b a introduces a parse forest semiring similar to our derivation

forest semiring in that it enco des a parse forest succinctly Tendeaus parse forest has an

advantage over ours in that it is commutative However it has two disadvantages First it

is partially b ecause of the enco ding of this parse forest that Tendeaus version of Earleys

algorithm has its p o or time complexity Second to implement this semiring Tendeaus

version of rule value functions take as their input not only a nonterminal but also the span

that it covers this is somewhat less elegant than our version Tendeau b also shows

that Denite Clause Grammars DCGs can b e describ ed as semirings with some caveats

including the same problems as the parse forest semiring

Conclusion

In this chapter we have shown that a simple itembased description format can b e used to

describ e a very wide variety of parsers These parsers include the CKY algorithm Earleys

algorithm prex probability computation a TAG parsing algorithm Graham Harrison

Ruzzo parsing and HMM computations We have shown that this description format

makes it easy to nd parsers that compute values in any continuous semiring The same

description can b e used to nd reverse values in commutative continuous semirings and

in many noncommutative ones as well We have also shown that this description format

can b e used to describ e grammar transformations including transformations to CNF and

GNF which preserve values in any commutative continuous semiring

While theoretical in nature this chapter is of some practical value There are three

reasons the results of this chapter would b e used in practice rst these techniques make

computation of the outside values simple and mechanical second these techniques make it

easy to show that a parser will work in any continuous semiring and third these tech

niques isolate computation of innite sums in a given semiring from the parser sp ecication

pro cess

Probably the way in which these results will b e used most is to nd formulas for out

side values For parsers such as CKY parsers nding outside formulas is not particularly

burdensome but for complicated parsers such as TAG parsers Graham Harrison Ruzzo

parsers and others it can require a fair amount of thought to nd these equations through

conventional reasoning With these techniques the formulas can b e found in a simple

mechanical way

The second advantage comes from clarifying the conditions under which a parser can

b e converted from computing values in the b o olean semiring a recognizer to computing

values in any continuous semiring We should note that b ecause in the b o olean semiring

innite summations can b e computed trivially TRUE if any element is TRUE and b ecause

rep eatedly adding a term do es not change results it is not uncommon for parsers that work

in the b o olean semiring to require signicant mo dication for other semirings For parsers

like CKY parsers verifying that the parser will work in any semiring is trivial but for other

parsers the conditions are more complex With the techniques in this chapter all that is

necessary is to show that there is a onetoone corresp ondence b etween item derivations and

grammar derivations Once that corresp ondence has b een shown Theorem states that

any continuous semiring can b e used

The third use of this chapter is to separate the computation of innite sums from the

main parsing pro cess Innite sums can come from several dierent phenomena such as

lo ops from pro ductions like A A pro ductions involving and sometimes left recursion

In traditional pro cedural sp ecications the solution to these dicult problems is intermixed

with the parser sp ecication and makes the parser sp ecic to semirings using the same

techniques for solving the summations

It is imp ortant to notice that the algorithms for solving these innite summations vary

fairly widely dep ending on the semiring On the one hand b o olean innite summations are

nearly trivial to compute For other semirings such as the counting semiring or derivation

forest semiring more complicated computations are required including the detection of

lo ops Finally for the inside semiring in most cases only approximate techniques can

b e used although in some cases matrix inversion can b e used Thus the actual parsing

algorithm if sp ecied pro cedurally can vary quite a bit dep ending on the semiring

On the other hand using our techniques makes innite sums easier to deal with in two

ways First these dicult problems are separated out relegated conceptually to the parser

interpreter where they can b e ignored by the constructor of parsing algorithms Second

b ecause they are separated out they can b e solved once rather than again and again Both

of these advantages make it signicantly easier to construct parsers

In summary the techniques of this chapter will make it easier to compute outside values

easier to construct parsers that work for any continuous semiring and easier to compute

innite sums in those semirings In Teitelbaum wrote

We have p ointed out the relevance of the theory of algebraic p ower series in

noncommuting variables in order to minimize further piecemeal rediscovery

Many of the techniques needed to parse in sp ecic semirings continue to b e rediscovered

and outside formulas are derived without observation of the basic formula given here We

hop e this chapter will bring ab out Teitelbaums wish

App endix

A Additional Pro ofs

In this app endix we prove theorems given earlier It will b e helpful to refer back to the

original statements of the theorems for context

Theorem

For x an item in a lo oping bucket B and for g

O M

V a if a B

i i

V x B

g

V a B if a B

g i i

ik

a a

k

a a st

k

x

Proof Observe that for any item x in a bucket preceding B inner x B

g

innerx Then

M

V x B V x

g

D inner xB

g

M M

V hx D D i

a a

k

a a inner a B D

a

k g

a a st

k

inner a B D

x

a

g k

k

M O M

V D

a

i

ik

inner a B D a a

a

g k

a a st

k

D inner a B

x

a

g k

k

O M M

V D

a

i

ik

inner a B D

a

i

g

i

a a

k

a a st

k

x

M O

V a B

g i

ik

a a

k

a a st

k

x

Now for elements x B we can substitute Equation into Equation to yield

M O

V a if a B

i i

V x B

g

V a B if a B

g i i

ik

a a

k

a a st

k

x

completing the pro of

Theorem

For items x B and g

O M

Z b B if b B

C B

g

V a Z x B

A

i g

Z b if b B

j

i k

a a

k

xa ja a b st

j

k

b

j

Proof Dene MakeOuterj D D D to b e a function that puts together

a a

b

k

the sp ecied trees to form an outer tree for a Then

j

Z x B

g

M

Z D

D outer xB

g

M M

j

Z MakeOuterj D D D

a a

b

k

j

a a innera D

a

k

ja a b st xa

j

k

innera D

b

a

k

k

D outer bB

b g

M M O

V D Z D

a

b

i

j

j

i k

a a innera D

a

k

ja a b st xa

j

k

D innera

b

a

k

k

D outer bB

b g

M M O M

Z D V D

a

b

i

j

D outer bB

j

b g

i k

innera D a a

a

k

xa ja a b st

j

k

D innera

b

a

k

k

B C

B C

M M O M

B C

B C

V D Z D

a

b

i

B C

B C

j

D outer bB

j

b g

i k

A

a a innera D

a

k

ja a b st xa

j

k

D innera

b

a

k

k

O M M

B C

Z b B V D

A

g a

i

j

innera B D

a

i

i k

i

a a

k

xa ja a b st

j

k

b

O M

C B

Z b B V a

A

g i

j

i k

a a

k

xa ja a b st

j

k

b

Now substituting Equation into Equation we get

O M

Z b B if b B

C B

g

V a Z x B

A

i g

Z b if b B

j

i k

a a

k

xa ja a b st

j

k

b

Theorem

The forward semiring parser interpreter correctly computes the value of all items

Sketch of proof We show that every item in each bucket has its value correctly

computed The pro of is by induction on the buckets The base case is the rst bucket Since

items in each bucket dep end only on the values of items in preceding buckets items in the

rst bucket dep end on no previous buckets Now if the rst bucket is a lo oping bucket the

interpreter uses an implementation of Equation previously shown correct in Theorem

If the rst bucket is a nonlo oping bucket the interpreter uses an implementation of

Equation previously shown correct in Theorem Since these equations refer to items

in the rst bucket and such items do not dep end on values outside the rst bucket other

than rule values the rst bucket is correctly computed

For the inductive step consider the current bucket of the lo op Assume that the values of

all items in all previous buckets have b een correctly computed Then dep ending on whether

the current bucket is a lo oping bucket or not either an implementation of Equation or

is used These equations only dep end on other values in the current bucket and on

values in previously correctly computed buckets so the values in the current bucket are

correctly computed

Thus by induction the values of all items in all buckets are correctly computed

We now discuss the renormalization parser of Figure We claim that using the

equation

V A Z A

P

P A

new

V A Z A

yields new probabilities that pro duce the same trees with the same probabilities as the

original grammar and that clearly these probabilities sum to for each nonterminal which

they may not have done in the original grammar The intuition b ehind this statement is

that Equation is simply the insideoutside reestimation formula We have applied it here

to the probability distribution of all trees pro duced by the original grammar but this should

b e very similar to applying the insideoutside formula to a very large approaching innite

sample of trees from that distribution Since the insideoutside probabilities approach a

lo cal optimum and since a grammar pro ducing the same trees with the same probabilities

will b e optimal we exp ect to achieve that grammar

We will now give a formal pro of that the resulting trees have the same probabilities as

the original trees sub ject to one caveat In particular we do not know how to show that the

resulting probability distribution is tight that is that the sum of the probabilities of all of

the trees equals Not all grammars are tight For instance S SS S a is not

tight Chi and Geman showed that CNF grammars pro duced by the insideoutside

reestimation formula are tight but it is not clear whether is has ever b een shown for the

case of grammars with unary pro ductions

Theorem

Assuming that the insideoutside reestimation formula pro duces tight grammars and that

the original grammar is tight the grammar pro duced using the renormalization parser of

Figure and Equation pro duces a grammar that pro duces an identical probability

distribution over trees as the original grammar but with each rule having a probability

b etween and

Proof We will show by induction that for each leftmost derivation D of length

at most k the value of all derivations starting with D is the same in the new grammar as

the old

The base case follows from our assumption that b oth the new grammar and the old

grammar are tight since the sum of all derivations starting with the empty derivation

k is the sum of the probability of all trees or by assumption in b oth cases

Next for the inductive step assume the theorem for k Consider a leftmost deriva

tion D of length k We can split D into a leftmost derivation E of length k followed by

E

the single step A S w w A w w We show that the sum of the values

j j

of all derivations that b egin this way is the same in b oth old and new grammars Since

the new grammar is by assumption tight and has rule probabilities that sum to one the

probability of all derivations starting with D is just P D

D E

P S w w P S w w A w w

new j new j j

E

P S w w A P A

new j new

V A

E

P

P S w w A

new j

V A

Now we consider values in the original grammar Let the value of A V A b e

P

V A Let the nonterminal symbols in b e denoted by and let

j j

Q

D

j j

V V Notice that given a derivation of the form S w w the sum

i j

i

of the values of all derivations starting with D is the value of D times the pro duct of the

value of all of the nonterminals in

D

V S w w V V

j

E

V S w w A w w V V

j j

E

V S w w A P A V V

j

orig

E

V S w w A V A V

j

Next by the inductive assumption

E E

V S w w A V A V P S w w A

j new j

Dividing b oth sides by V A

E

P S w w A

E

new j

V S w w A V

j

V A

Substituting into Equation we get

V A

D E

V S w w V V P S w w A

j new j

V A

V A

E

P

P S w w A

new j

V A

which equals Expression completing the inductive step

A Viterbinb est is a semiring

In this section we show that the Viterbinb est semiring as describ ed in Section has

all the required prop erties of a semiring and is continuous Recall that the Viterbin

b est semiring is a homomorphism from the Viterbiall semiring We rst show that the

Viterbiall semiring is continuous

We b egin by arguing that the Viterbiall semiring has all of the prop erties of an

continuous semiring saving the most complicated prop erty the distributive prop erty for

last It should b e clear that and are asso ciative To show that the semiring is

continuous we rst note that the prop er convergence of innite sums follows directly from

the prop erties of union as do es the asso ciativity of innite sums unions The only com

plicated prop erty is that distributes over even in the innite case We sketch this pro of

quickly We need to show that for any set of Y and any I

i

X Y Y X

i i

iI iI

Recall the denition of

X Y fhv w d eijhv di X hw ei Y g

S

Y It must b e the case that v X and First consider an element hv w d ei X

i

iI

S

w Y for some v w i So then hv w d ei X Y A reverse argument also holds

i i

iI

proving equality Technically we also need to show distributivity in the opp osite direction

S S

Y X but this follows from an exactly analogous argument Y X that

i i

iI iI

Now we will show that our denitions of op erations in the Viterbinb est semiring are

well dened and then show that topn really do es dene a homomorphism Recall our

denitions max A B C if and only if there is some X Y in the Viterbiall semiring

Vitn

such that such that topn X A topn Y B and topn X Y C We need to show

that this denes a one to one relationship that there is exactly one C which satises

this relationship for each A and B The existence of at least one such C follows from

the denition of the Viterbinb est semiring There must b e at least one X Y such that

topn X A and topn Y B and thus at least one C topn X Y

Now consider any X X such that topn X topn X and Y Y such that topn Y

topn Y To show that C is unique we need to show that topn X Y topn X Y

We will say that an element of a set Z is simple if there are at most a nite number of

larger elements in Z and complex otherwise We rst show equality of the simple elements

in the two sets Consider a simple element z in topn X Y There are at most n

larger elements in X Y z must have b een an element of either X or Y or b oth Assume

without loss of generality that it was in X Then since there are at most n larger

elements in X Y there are at most n larger elements in X and thus z topn X

Since topn X topn X z topn X and therefore z X and z X Y By

analogous reasoning for any simple element z topn X Y z X Y Now it must

b e the case that z topn X Y since each element of topn X Y is in X Y and

if there were n elements larger than z in X Y then each of these elements would b e in

topn X Y and thus in X Y which would prevent z from b eing in topn X Y By

analogous reasoning every simple element z topn X Y is also in topn X Y Thus

the simple elements of the two sets are equal

Now let us consider the innite elements hv i of topn X Y There are several

cases to consider involving the p ossibilities where zero one or b oth of X Y have an innite

element We only consider the most complicated case in which b oth do Notice that if

there are n or more simple elements in X and Y then topn X Y will not have an innite

element so we need only consider the case where there are at most n simple elements

in X Y Then the innite element of topn X Y will just b e the supremum of all of the

complex elements of X Y which will equal the supremum of the complex elements of X

union the complex elements of Y which will equal the maximum of the supremums of the

complex elements of X and the complex elements of Y which will equal the maximum of

the innite elements of topn X and topn Y The same reasoning applies to any X Y

and since if topn X topn X and topn Y topn Y their innite elements must b e

the same the maximum of their innite elements must b e the same and thus the innite

elements of topn X Y and topn X Y must b e the same Since we have already shown

equality of the nite elements this shows equality overall and thus shows the uniqueness

of C in our denition of additive op eration max Therefore max is well dened

Vitn Vitn

Next we must go through very similar reasoning for showing that the multiplicative

op erator is well dened Recall that we dened A B C if and only if there

Vitn Vitn

is some X Y in the Viterbiall semiring such that such that topn X A topn Y B

and topn X Y C There is at least one such C equal to topn X Y and now we

need to prove its uniqueness If x hv di and y hw ei then we will write xy to indicate

the pro duct hv w dei We examine rst the simple elements of X and Y Consider a simple

element z topn X Y z xy for some x X y Y Now x topn X otherwise

the n larger elements of X multiplied by y would result in n larger elements in X Y

meaning that z topn X Y Similarly y topn Y Thus z X Y Similarly each

z topn X Y is in X Y Following the same reasoning as b efore the simple elements

of the two sets are equal

Now consider the innite elements Again we will consider only the case where b oth

topn X and topn Y have an innite element If there is an innite element in topn X Y

it must b e formed in one of three ways by taking the top element of x and multiplying it

by the elements approaching a supremum in Y by taking the top element of y multiplied

by the elements approaching a supremum in X or if there is no top element in X or Y by

taking the elements approaching the supremum in X and multiplying them by the elements

approaching the supremum in Y Thus the innite element of topn X Y is the maximum

of these three quantities which will b e same for any X such that topn X topn X and

Y such that topn Y topn Y

Now that we have shown that b oth max and are welldened the asso ciative and

Vitn

Vitn

distributive prop erties in the Viterbinb est semiring follow automatically We simply map

the expression on the left side backwards from the Viterbinb est semiring into expressions in

the Viterbiall semiring map the expression on the right side into the Viterbiall semiring

and then use the appropriate prop erty in the Viterbiall semiring to show equality We

show how to use this technique for the asso ciativity of max the other prop erties follow

Vitn

analogously We simply note that since our homomorphism is onto for any A B C in the

Viterbinb est semiring there must b e some X Y Z in the Viterbiall semiring such that

A topn X B topn Y and C topn Z Consider

max max A B C

Vitn Vitn

There must b e some D max A B topn X Y Then max max A B C max

Vitn Vitn Vitn Vitn

D C topn X Y Z By analogous reasoning max A max B C topn X Y Z

Vitn Vitn

and thus by the asso ciativity of the asso ciativity of max is proved Using the same

Vitn

reasoning we can show asso ciativity of and distributivity

Vitn

Now we must show that the Viterbinb est semiring is complete has the asso ciative and

distributive prop erty for innite sums and is continuous To do this we need to show

that our homomorphism works even for innite sums Then by the same mapping argument

we just made we can show asso ciativity distributivity and continuity for innite sums

as well

such that topn X topn X We need to show that Consider X X and X X

i

i

S S S S

topn X topn Let X X X and let X Consider a simple element X

i i

i i i i

i i

z in topn X Our reasoning will b e nearly identical to the nite case and is included here

for completeness There are at most n larger elements in X For some i z must have

b een an element of X Then since there are at most n larger elements in X there are

i

at most n larger elements in topn X and thus z topn X Thus z topn X and

i i

i

and z X By analogous reasoning for any simple element z topn X therefore z X

i

z X Now it must b e the case that z topn X since each element of topn X is in

X and if there were n elements larger than z in X then each of these elements would b e

in topn X and thus in X which would prevent z from b eing in topn X By analogous

reasoning every simple element z topn X is also in topn X Thus the simple elements

of the two sets are equal

Now consider the case where there is an innite element in topn X this case is some

what more complex Let

w sup

v jhv diX simpletopn X

and

w sup

0 0

v jhv diX simpletopn X

Take an innite sequence of elements in X simpletopn X approaching w Consider an

element x in this sequence x must have come from some X simpletopn X We will

i

show that there is an element of X simpletopn X which is at least as close to w as x is

i

and thus that w is at least as large as w

Here are the cases to consider

a X has at most n elements

i

b X has at most n simple elements and an innite number of complex elements

i

c X has at least n simple elements

i

In case b and thus that x X we deduce X X In case a since topn X topn X

i i

i i i

either x is a simple element in which case this reduces to case a or x is a complex element

the supremum of the complex elements is the same in Now since topn X topn X

i

i

In case c we notice b oth cases and thus there is an element at least as large as x in X

i

that not all n simple elements can b e in simpletopn X since otherwise there would b e n

simple elements in X and we would not b e concerned with the supremum of the complex

elements In particular at least the nth largest element cannot b e in simpletopn X Now

And if since topn X topn X if x is one of the n largest elements of X then x X

i i

i i

is an element x is not as large as the nth largest element then the nth largest element of X

i

of X which is larger than x and in X simpletopn X

i i

This shows that even in the case of innite sums the homomorphism works Following

the same reasoning as b efore it should b e clear how to prove asso ciativity and distributivity

for innite sums and continuity

App endix

B Additional Examples

B Graham Harrison and Ruzzo GHR Parsing

In this section we give a detailed description of a parser similar to that of Graham et al

as introduced in Section The GHR parser is in many ways similar to Earleys parser

but with several improvements including that epsilon and unary chains are b oth precom

puted and that completion is separated into two steps allowing b etter dynamic program

ming

Sikkel p gives a deduction system for GHR parsing Note however that

Sikkels parser app ears to have more than one derivation for a given parse in particular

chains of unary pro ductions can b e broken down in several ways In a b o olean parser such

as Sikkels this rep etition leads to no problems but in a general semiring parser this is not

acceptable Another issue is derivation rules using precomputed chains such as a rule using

a condition like A In a b o olean parser we can simply have a side condition A

However in a semiring parser we will need to multiply in the value of the derivation as

well Thus we will need to explicitly compute items such as A recording the value

of the derivations We will need similar items for chain rules

Figure gives an itembased GHR parser description We assume that the start

symbol S do es not o ccur on the right hand side of any rule We note that this parser do es

not work for noncommutative semirings However since there will still b e a onetoone cor

resp ondence b etween item derivations and grammar derivations with simple mo dications

it would b e p ossible to map from the item values derived here for the noncommutative

derivation semirings to the grammar values as a p ostpro cessing step

There are ve dierent item types for this parser The rst item type A is

used for two purp oses determining which elements have derivations of the form A and

for determining which items have derivations of the form A B such that and

these are the rules which form a single step in a chain of unary pro ductions We

can actually only derive two subtypes of this item A which can b e derived only

if there is a derivation of the form A and A A which can b e derived

only if there is a derivation of the form A A A Items of the type A

are derived using three dierent rules the initial axiom initial epsilon scanning and initial

unary scanning The rst rule the initial axiom simply says that if A then A

The next rule initial epsilon scanning says that if A B and B then A

And nally initial unary scanning is a technical rule which simply advances the circle

over a single nonterminal It is b ecause of this rule that we cannot use noncommutative

semirings Consider a grammar

A EBE R A EBE

E R E

Then the value of A B will b e R A EBE R E R E For a commu

tative semiring this grammar will work ne but for a noncommutative semiring the value

Item form

A

A B

i A j u

i nished A j

i extendedFinished A j

Goal

extendedFinished S n

Rules

R A

Initial Axiom

A

A B B

Initial Epsilon Scanning

A

A B

Initial Unary Scanning

A B

Unary Axiom

A A

A B B C

Unary Completion

A C

R S

Initialization

S

i A B j u B

Epsilon Scanning

i A j u

R B

i A B j u Prediction

j B j

i A B k u k extendedFinished B j

i k Completion

i A B j minu

w a Terminal Finishing

i

i nished a i

i A j u

u A S u Finishing

i nished A j

i nished A j B A

Extended Finishing

i extendedFinished B j

Figure Graham Harrison Ruzzo

is incorrect since we would need to put the yet to b e determined value of B s derivation

b etween the two R E s Essentially this is the same reason that our grammar trans

formations only work for commutative semirings In some sense then the GHR parser

p erforms a grammar transformation to CNF and then parses the transformed grammar

Of course for many noncommutative semirings such as the derivation semirings there will

still b e a onetoone corresp ondence b etween item derivations and grammar derivations

with slight mo dications we could easily map from the transp osed item derivation to the

corresp onding correct grammar derivation

The next item type A B can b e derived only if there is a derivation of the form

A B This item is derived with two rules The unary axiom simply states that A A

Unary completion states that if A B and B C then A C It is imp ortant that

unary completion uses items of the form B C rather than items B C since

this makes sure that there is a onetoone corresp ondence b etween item derivations and

grammar derivations

The next item type is i A j u This is almost exactly the usual Earley style

item it can b e derived only if S w w A and A w w There is one

i i j

additional condition however We wish to precompute and unary derivations Therefore

in order to avoid duplicate derivations we need to make sure that we do not recompute or

unary derivations during the b o dy of the computation Thus we keep track of how many

non derivations have b een used to compute i A j u and enco de this in u If u

is zero only rules have b een used if one then one non rule has b een used and if two

then at least two non rules have b een used u can only take the values or

We wish to collapse unary derivations We can do this using two more item types

Roughly we can deduce i nished A j if there is an item of the form i A j

ie when there is a derivation of the form A w w using two nonepsilon rules We

i j

require the last element to b e to avoid recomputing unary chains or epsilon chains We

will also derive such an item if A S and the last element is this allows us to recognize

sentences of the form S The second item type i extendedFinished A j can b e

derived only if there is a derivation of the form A w w The dierence b etween this

i j

item type and the previous one is that there are now no restrictions on unary branches and

nonepsilon rules this item type captures unary branching extensions

These items are derived using many rules The initialization rule is just the initialization

rule of Earley parsing the epsilon scanning rule is new it allows us to skip over nonterminals

A such that A The prediction rule is the same as the Earley prediction rule We thus

implement prediction incrementally without collapsing unary chains Graham et al

p suggest this as one p ossible implementation of prediction With a few more deduc

tion rules we could implement prediction more eciently in a manner analogous to the

nishing rules

Completion is the same as Earley completion with a few caveats First we use the

extended nishing items rather than the items used in Earleys algorithm This automati

cally takes into account unary chains Also we increment the nonepsilon count not letting

it go ab ove two

We use the terminal nishing rule to handle terminal symbols rather than the scanning

rule of Earleys algorithm This lets us take into account unary chains using the terminal

symbols The nishing rule simply computes items of the form i nished A as previously

describ ed and the extended nishing rule takes the nished rules and extends them with

precomputed unary chains

Notice that the value of items of the form A and A B which are the

only items in lo oping buckets can b e computed without the input sentence thus the value

of these items can b e computed oline from the grammar alone Notice also that the

reverse values of these items require the input sentence The itembased description format

makes it easy to sp ecify parsers that use oline computation for some values and online

computation for other similar values

B Beyond ContextFree

In this section we consider the problem of formalisms that are more p owerful than CFGs

We will show that these formalisms p ose a slight problem for the conventional algebraic

treatment of formal language theory but that we can solve this problem without much

trouble In particular in the conventional view of formal language theory combined with

algebra a language is describ ed as a formal p ower series The terms of the formal p ower

series are strings of the language the co ecients give the values of the strings These

formal p ower series are most readily describ ed as sets of algebraic equations the formal

p ower series represents a solution to the equations Sets of algebraic equations used in this

way can only describ e contextfree languages On the other hand it is straightforward to

describ e a Tree Adjoining Grammar TAG parser which also can b e transformed into a set

of algebraic equations Since Tree Adjoining Grammars are more p owerful than contextfree

languages this is a useful result

Formal Language Theory Background

At this p oint we need to discuss some results from algebraformal language theory One of

the primary results is that there is a onetoone corresp ondence b etween CFGs and certain

sets of algebraic equations

We must b egin by dening a formal power series A formal p ower series A hh ii is a

semiring using elements from a semiring A and an alphab et Elements in A hh ii map

from strings in to elements of the semiring A If s is an element of A hh ii we

will write s to indicate the value s maps to Essentially s can b e thought of as a

language A is the semiring used to assign values to strings of the language Let B represent

the b o oleans then formal p ower series s B hh ii represent formal languages strings

such that s TRUE corresp ond to the elements in the language if A were the inside

semiring then s would equal P the probability of the string

We can dene sums s t in A h h ii as

u s t

We can denes pro ducts s t in A hh ii as

M

u s t

j

The pro duct and sum denitions are analogous to the way pro ducts are dened for p oly

nomials in variables that do not commute which is why A h h ii is called a formal p ower

series this is however probably not the b est way to think of A h h ii We can dene

multiplication by a constant t A t s as

u t s

Now a formal p ower series is called algebraic if it is the solution to a set of algebraic

equations Consider the formal p ower series

z xz y xxz y y xxxz y y y xxxxz y y y y

in B h h ii This formal p ower series is a solution to the following algebraic equations in B

S xS y Z

Z z

which is very similar to the CFG

S xS y

S Z

Z z

The preceding example is in the b o olean semiring but could b e extended to any

continuous semiring by adding constants to the equations In general given a CFG and

a rule value function R we can form a set of algebraic equations For instance for a

nonterminal A with rules such as

A R A

A R A

A R A

there is a corresp onding algebraic equation

A R A R A R A

with analogous equations for the other nonterminals in the grammar Each nonterminal

symbol represents a variable each terminal symbol is a member of the alphab et

It is an imp ortant theorem from the intersection of formal language theory and algebra

that for each CFG there is a set of algebraic equations in B h h ii the solutions to which

represent the strings of the language for each algebraic equation in B h h ii there is a CFG

whose language corresp onds to the solutions of the equations

Tree Adjoining Grammars

In this section we address TAGs TAGs p ose an interesting challenge for semiring parsers

since the tree adjoining languages are weakly context sensitive We develop an itembased

description for a TAG parser essentially using the description of Shieb er et al

mo died for our format and including a few extra rule values to ensure that there is

a onetoone corresp ondence b etween item and grammar derivations which can easily b e

recovered The reader is strongly encouraged to refer to that work for background on TAGs

and explication of the parser

In the TAG formalism there are two trees that corresp ond to a parse of a sentence One

tree is the parse tree which is a conventional parse of the sentence The other tree is the

derivation tree a traversal of the derivation tree gives the rules that would actually b e used

in a derivation to pro duce the parse tree in the order they would b e used While the parse

trees of TAGs cannot b e pro duced by a CFG the derivation trees can It is imp ortant to

note that the derivations pro duced by the parser of Figure corresp ond to derivation

trees not parse trees This is in part what allows us to parse with a formalism more

p owerful than CFGs

The other reason that we are able to parse with a more p owerful formalism is that while

the algebraic formulation of formal language theory sp ecies a language as a single set of

algebraic equations we sp ecify parsers as an inputstring sp ecic set of algebraic equations

Recall that the way we nd the values of items in a lo oping bucket is to create a set of

algebraic equations for the bucket Recall also that we can always place all items into a

single large lo oping bucket Thus we can solve the parsing problem for any sp ecic input

string by solving a set of algebraic equations However the algebraic equations we solve are

sp ecic to the input string and the number of equations will almost always grow with the

length of the input string This variability of the equations is the other factor that allows

us to parse formalisms more p owerful than CFGs

B Greibach Normal Form

As we discussed in Section itembased descriptions can b e used to sp ecify grammar

transformations In Section we showed how to use itembased descriptions to convert

to Chomsky Normal Form In this section we show how to use these descriptions to convert

to Greibach Normal Form GNF In GNF every rule is of the form A a We give here

a valuepreserving transformation to GNF following Hop croft and Ullman While

value preserving GNF transformations have b een given b efore Kuich and Salomaa

this is the rst itembased description of such a transformation

Figure gives an itembased description for a valuepreserving GNF transformation

There is a sequence of steps in GNF transformation so we will use items of the form

A where j indicates the step number The rst step in a GNF transformation is to

j

put the grammar in Chomsky Normal Form which we have previously shown how to do

in Section We will assume for this subsection that our nonterminal symbols are A to

Item form

i j k l

i j k l

goal

Goal

goal

Rules

n R start

Find Start

goal

label w Terminal Axiom

i

i i

label Empty String Axiom

i i

A Foot Axiom

foot p p q q

p i j k l

p undened Complete Unary

p i j k l

p i j k l p l j k m

Complete Binary

p i j j k k m

R noadjoin i j k l

No Adjoin

i j k l

R adjoin i p q l p j k q

Adjoin

i j k l

Figure TAG parser itembased description

A The next step is conversion to ascending form

m

A A j i

i j

A a

i

This will b e accomplished by rst putting all rules with A into this form then A

etc To transform rules into this form we substitute the right hand sides of A for A in

j j

any rule of the form A A j i a pro cess that must terminate after at most i

i j

steps This step is done using three rules the Rule Axiom which creates one item of the

form A for every rule of the form A ie for every rule in the grammar We

i i

use two dierent item forms A A j i for rules not yet in ascending form and

i j

A A j i for rules in ascending form The Ascent Substitution rule substitutes

i j

the right hand side of a rule in ascending form into the rst nonterminal of a rule not in

ascending form The Ascent Completion rule detects that an item is in ascending form and

promotes it creating an item with subscript

The next step is removal of all left branching rules of the form A A by conversion

i i

to the form

A A j i

i j

A a

i

B

i

There are four rules that p erform this op eration right b eginning right single step right

termination and right continuation using items in four dierent forms The rst item form

is A A which is essentially the input to this step The other three item forms are

i j

A A j i A a and B which are essentially the outputs of this

i j i i

step Consider a leftbranching derivation of the form

A A

A

A i j

i

i

A A A A

i i i j

The right branching equivalent will b e

A A B

B B

i j i B

i i

i

A A B A B A

i j i j i j

The item generating the rule used in the rst substitution in the right branching form will

b e derived using Right Beginning the item for the second substitution will b e derived using

Right Continuation and the item for the nal substitution will b e derived using Right

Termination The rule A A is valid in b oth forms Right Single Step derives the

i j

needed item

Now we are ready for the next and p enultimate step Notice that since all items of the

form A A have j i the last nonterminal A must have rules only of the form

i j m

A a the target form We can substitute A s pro ductions into any pro duction of

m m

the form A A putting all A pro ductions into the target form This recursive

m m m

substitution can b e rep eated all the way down to terminal A Formally we substitute A

j

Item form

A

i j

B

i j

Rule Goal

R X

Rules

R A A

i

Rule Axiom

A

i

A A A

i j j

j i Ascent Substitution

A

i

A X

i

X a X A j i Ascent Completion

j

A X

i

A X

i

X a X A j i Right Beginning

j

A X B

i i

A X

i

X a X A j i Right Single Step

j

A X

i

A A

i i

j i Right Termination

B

i

A A

i i

Right Continuation

B B

i i

A A A a

i j j

j i Descent Substitution

A a

i

A a

i

Descent Nonsubstitution

A a

i

B A A a

i j j

Auxiliary Substitution

B a

i

B a

i

Auxiliary Nonsubstitution

B a

i

X

Output

R X

Figure Greibach Normal Form Transformation

into any rule of the form A A j i yielding rules of the form

i j

A a

i

We call this p enultimate step Descent Substitution Descent Substitution is done using

two rules Descent Substitution and Descent Nonsubstitution The inputs to this step

are items in the form A A and the outputs are in the form A a Descent

i j i

Nonsubstitution simply detects when an item already has a terminal as its rst symbol

and promotes it to the output form

Finally we substitute the A into the B to yield rules of the form

i j

A a

i

B a

i

This step is accomplished with the Auxiliary Substitution and Auxiliary Nonsubstitution

rules Hop croft and Ullman shows that all items B are actually in the

i

form B A or B a The Auxiliary Substitution rule p erforms substitution

i j i

into items of the rst form and the Auxiliary Nonsubstitution rule promotes items of the

second form

The output rule trivially states that items of the form X corresp ond to the rules

of the transformed grammar

Proving that this transformation is valuepreserving is somewhat tedious and essentially

follows the pro of that grammars can b e converted to GNF given by Hop croft and Ullman

App endix

C Reverse Value of Noncommutative Semirings

In this app endix we consider the problem of nding reverse values in noncommutative

semirings the situation is somewhat more complex than in the commutative case The

problem is for noncommutative semirings there is no obvious equivalent to the reverse

values Consider the following grammar

S SA

S B

A

B b

Now consider the derivation forest semiring Without making reference to a sp ecic parser

let us call the item deriving the terminal symbol B We will denote derivations using just

the nonterminals on the left hand side for conciseness So for instance a derivation

S SA S SA S B B b A A

S SA S AA B AA bAA bA b

will b e written as simply S S S B AA The inside value of B will just b e

V B fB g

The value of the sentence is the union of all derivations namely

fS B S S B A S S S B AA S S S S B AAA S S S S S B AAAA g

Now since all derivations use the item B it should b e the case that the forward value

times the reverse value of B should just b e the value of the sentence

V B Z B fS B S S B A S S S B AA S S S S B AAA S S S S S B AAAA g

Since V B fB g we get

fB g Z B fS B S S B A S S S B AA S S S S B AAA S S S S S B AAAA g

for some Z B But it should b e clear that there is no such value for Z B in a non

commutative semiring the problem intuitively is that we need to get V B into the

center of the pro duct and there is no way to do that One might think that what we need

is two reverse values a left outside value Z B and a right outside value Z B so

L R

that we can nd values such that

Z B V B Z B fS B S S B A S S S B AA S S S S B AAA S S S S S B AAAA g

L R

but some thought will show that even this is not p ossible

The solution then is to keep track of pairs of values That is we let Z B b e a set of

pairs of values such as

Z B fhfS g fgi hfSS g fAgi hf SSS g fAAg i hfSSSS g f AAAg i g

and then dene an appropriate way to combine these pairs of values with inside values This

lets us dene reverse values in the noncommutative case We can use this same technique

of using pairs of values to nd an analog to the reverse values for most noncommutative

semirings

C Pair Semirings

There are many technical details In particular we will want to compute innite sums

using these pairs of values in ways similar to what we have already done Thus it will b e

convenient to dene things so that we can use the mathematical machinery of semirings

Therefore given a semiring A we will dene a new semiring P A Later we will describ e

a prop erty preserving pair order and show that if A is continuous and preserves pair

order then P A is continuous allowing us to compute innite sums just as b efore We

will call P A the pair semiring of A

Intuitively we simply want pairs of values but in practice it will b e more convenient to

dene addition if we allow pairs to o ccur multiple times Therefore we will dene elements

of P A as a mapping function from pairs a a to N Thus an element of the semiring

will b e r A A N

Continuing our example if r Z B then

k k

if E fS g E fA g

r E E

otherwise

Next we need to dene combinations of an element r P A with an element a A

which we will write as ra to evoke function application We simply multiply a b etween

each of the pairs Formally

M

r b b bab ra

0

bb A

where the premultiplication by the integral value r b b indicates rep eated addition in A

Furthermore we dene two elements r and s to b e equivalent if whenever we combine

them with any element of A they yield the same value Formally

r s i a ra sa

It will b e convenient to denote certain single pairs of elements of A concisely Let ba a c

In retrosp ect it might have b een simpler to consider the semiring of functions from A to A where the

multiplicative op erator is comp osition On the other hand the Pair semiring will simplify the discussion in

Section C where the fact that all elements of the semiring have the form of multisets of pairs will b e

useful

indicate

if a b a b

ba a cb b

otherwise

Intuitively this value is the set containing the single pair a a

a We will need to make frequent reference to pairs of values We will therefore let

indicate the pair a a and interchange these notations freely

Addition can b e simply dened pairwise if t r s then

a r a sa t

Commutativity of addition follows trivially We denote the zero element by b c notice

that for all a b ca

We now show that application distributes over addition

X

r sa r sb b bab

0

bb

X

r b b sb b bab

0

bb

X

r b b bab sb b bab

0

bb

X X

r b b bab sb b bab

0 0

bb bb

ra sa

A similar argument shows that application also distributes over innite sums

Next we show that for any elements r s t our denition of equality is consistent with

our denition of addition ie if r s then r t s t We simply notice that for all

a A r ta ra ta sa ta s ta

Now when we dene multiplication we want the prop erty that

ba a c bb b c bab b a c

multiplying the rst element on the right and the second element on the left

More generally we dene multiplication as for t r s

X X

tc c sb b r a a

0 0 0 0 0

aa bb st abcb a c

a b to indicate This denition of multiplication has the desired prop erty We will also write

the pair bab b a c this allows us to write the multiplicative formula as

X X

t c r asb

a

b st abc

It should b e clear that b c is a multiplicative identity

We now show that for any elements r s t our denition of equality is consistent with

our denition of multiplication ie if r s then r t s t and t r t s We rst

show that for all r s r sa rsa

X

r sa r sb b bab

0

bb

X X X

r c c sd d bab

0 0 0 0 0 0

bb cc dd st cdbd c b

X X

r c c sd d cdad c

0 0

cc dd

X X

A

r c c c sd d dad c

0 0

cc dd

X

r c c csac

0

cc

rsa

Now if r s then for all a

r ta rta sta s ta

Also if r s then

t r a tra tsa t sa

This shows that our denition of equality is consistent with our denition of multiplication

Multiplication is not commutative but is asso ciative meaning that r s t r s t

We show this now rst computing a simple form for r s t

X X

e c td r s r s t

c

djcde

X X X X

r asbtd

c a

djcde bjabc

X X X X

r asbtd

c a

bjabc d jcde

X X X X

r asbtd

a

b cjabc d jcde

X X X

r asb td

a

b djabde

A similar rearrangement yields the same formula for r s t showing asso ciativity

Now we need to show that addition distributes over multiplication b oth left and right

We do only the right case ie r s t r s r t The left case is symmetric

X X

r s ta bs tc r

b c st bca

X X

bsc tc r

b c st bca

X X X X

r bsc r btc

b c st bca b c st b ca

r s r ta

We have now shown all of the nontrivial prop erties of a semiring and a few of the

trivial ones as well

There is one more prop erty we must show that P A is continuous assuming that A

is continuous and assuming an additional quality preserving pair order which we dene

b elow This involves several steps The rst is to show that P A is naturally ordered

meaning that we can dene an ordering v such that r v s if and only if there exists t

such that r t s To show that this ordering denes a true partial ordering we must

show that if r v s and s v r then r s The other prop erties of a partial ordering

reexivity and transitivity follow immediately from the prop erties of addition From the

fact that r v s we know that for some t r t s Therefore we know that for all

a A r ta sa Thus ra ta sa Now since A is also continuous it is also

naturally ordered implying that ra v sa Similarly from the fact that s v r we conclude

that sa v ra Thus ra sa which shows that r s

The next step in showing continuity is to show that P A is complete meaning that we

must show that innite sums are commutative and satisfy the distributive law Asso ciativity

of innite sums means that given an index set J and disjoint index sets I for j J

j

M M M

r r

i i

S

j J iI

j

I i

j

j

This prop erty follows directly from the fact that N is complete We also must show that

multiplication distributes b oth left and right over innite sums We show one case the

other is symmetric We use a simple rearrangement of sums which again relies on the fact

that N is complete

X X X M

r s c r as b

i i

a iI iI

bjabc

X X X

r a s b

i

a iI

bjabc

M

c s r

i

iI

The nal step in showing continuity is to show that for all s for all sequences r

i

L L

r v s We do not know how to show this for r v s for all n N then if

i i

iN in

continuous semirings A in general although continuity of A is certainly a requirement

We therefore make an additional assumption which is true of all semirings discussed in this

chapter we assume that if for all a A ra v sa then r v s If this assumption is true

for a semiring A we shall say that A preserves pair order Now assuming A preserves pair

order it is straightforward to show that P A is continuous Given a sequence r let

i

L L L

r r r Now for all n r v r since r r and let r r

i i n n i n

in in i

Notice that for any t t v s implies that for all a ta v sa Therefore since for all n and

all a r v s we conclude that

n

r a v sa

n

Now it is a prop erty of continuous semirings that Kuich p

M M

a a sup

i i

n

in i

Notice that

M M

A

r a a r r a

n i i

in in

and

M M

A

ra a r r a

i i

i i

Thus

sup r a ra

n

n

From Equation and Equation we conclude that for all a ra v sa and from this

and our assumption that A preserves pair order we conclude that r v s which shows that

P A is an continuous semiring

C Sp ecic Pair Semirings

Now we can discuss sp ecic semirings showing that they preserve pair order and showing

how to eciently implement P A We rst note that for any commutative semiring A

P A is isomorphic to A the equivalence classes of P A are in direct corresp ondence with

the elements of A application ra is equivalent to multiplication It is thus straightforward

to show that A preserves pair order Thus the formulae we will give in the sequel for paired

semirings hold equally well for all commutative semirings

Next we notice that for all of the noncommutative semirings discussed in this chapter

the three derivation semirings if a v b then a b b Now for such a semiring if for all a

ra v sa then for all a ra sa sa and thus r s s which implies that r v s Thus

all of the noncommutative semirings discussed in this chapter preserve pair order

Implementations of the paired versions of the three derivation semirings are straightfor

ward To implement the Pairderivation semiring we simply keep sets of pairs of derivation

forests Recall that for the Viterbiderivation semiring in theory we keep a derivation forest

of all top scoring values but in practical implementations typically keep just an arbitrary el

ement of this forest In a theoretically correct implementation of the PairViterbiderivation

semiring we simply maintain a top scoring element v and the set of pairs of derivations

with value v using the same implementation as for the Pairderivation semiring If in prac

tice we are only interested in a representative top scoring derivation rather than the set of

all top scoring derivations then we can simply maintain a pair d e of a left and a right

derivation along with the value v Similar considerations are true for b oth theoretical and

practical implementations of the PairViterbinbest semiring

C Derivation of NonCommutative Reverse Value Formulas

Now we can rederive the reverse formulas but using noncommutative semirings and the

corresp onding pair semirings These derivations almost exactly follow the corresp onding

derivations in commutative semirings

For noncommutative semirings A Z x will represent a value in P A We will construct

reverse values in such a way that

M

V D C D x Z xV x

D a derivation

which is directly analogous to Equation

Recall that an outer tree O has a hole in it from where some inner tree headed by x

was deleted We now dene the left and right reverse values of an outer tree O Z O and

L

Z O to b e the pro duct of the values of the rules to the left and to the right of the hole

R

resp ectively

O

R r Z O

L

r D to left

O

R r Z O

R

r D to right

Notice that these are values in A Now we will dene the value of an outer tree We will

use the same notation Z O as we used for noncommutative semirings for consistency

which we hop e will not lead to confusion If these formulas were applied to commutative

semirings the values would b e the same as with the original formulas so this redenition

should not b e problematic

We dene the value of Z O to b e the pair of values of the rules to the left and right of

the hole

Z O bZ O Z O c

L R

which is a value in P A Then the reverse value of an item can b e dened just as b efore

in Equation

M

Z x Z D

D outerx

That is the reverse value of x is the sum of the values of each outer tree of x this time in

the pair semiring

Next we show that this new denition of reverse values has the prop erty describ ed by

Equation following almost exactly the pro of of Theorem

Theorem

M

V D C D x Z xV x

D a derivation

Proof

M

A

Z xV x V x Z O

O outerx

M M

A A

Z O V I

O outerx I innerx

M M

A A

V I bZ O Z O c

L R

I innerx O outerx

M M

A

bZ O Z O c V I

L R

O outerx I innerx

M M

A

Z O Z O V I

L R

O outerx I innerx

M M

Z O V I Z O

L R

I innerx O outerx

Now using the same reasoning as in Theorem it should b e clear that this last expression

L

equals the expression on the right hand side of Equation V D C D x completing

D

the pro of

There is a simple recursive formula for eciently computing reverse values analogous

to that shown in Theorem and using an analogous pro of Recall that the basic equation

for computing forward values not involved in lo ops was

k

O M

V a V x

i

i

a a

k

a a st

k

x

j

Earlier we introduced the notation k to indicate the sequence j j k

Now we introduce additional notation for constructing elements of the pair semiring Let

j

k

O O O

a b a a c

i i i

j

i ij

i k

We need to show

Lemma

L N

distributes over

j

i k

j

Proof Let H H b e index sets such that for h H x is a value in A

i i

k h

i

Then

O M O M O M

x x x

h h h

i i i

j

ij

ijk h H h H h H

i i i i i i

i k

O M O M

x x

h h

i i

ij

h H ijk h H

i i i i

M O O M

x x

h h

i i

ij

h H ik j h H

i i i i

O M O M

b x c bx c

h h

i i

ij

ik j h H h H

i i i i

M M O O

bx c b x c

h h

i i

ij

h H h H h H h H ik j

j j j j

k k

M O O

A

b x c bx c

h h

i i

j

ij

ik j

h H h H

k k

O O M

x x

h h

i i

j

ij

ijk

h H h H

k k

M O

x

h

i

j j

h H h H i k

k k

Now we can give a simple formula for computing reverse values Z x not involved in

lo ops

Theorem

For items x B where B is nonlo oping

M O

Z b Z x V a

i

j

i k

a a

k

xa ja a b st

j

k

b

unless x is the goal item in which case Z x b c the multiplicative identity of the pair

semiring

Proof We b egin with the goal item As b efore the outer trees of the goal item

are all empty Thus

M

Z goal bZ D Z D c

L R

D outergoal

bZ fhig Z fhigc

L R

O O

R r c R r b

r fhig r fhig

b c

Using the same observation as in Theorem that every outer tree of a D can

j a

j

b e describ ed as a combination of the surrounding outer tree of b D and inner trees of

b

a a

k

j

and observing that the value of such an outer tree is Z D a a where

b k

b

N

V D

j

a

i

i k

M

Z x Z D

D outerx

M M

Z D

a a

a a

k

k

ja a b st xa

D outer j

j

k

b

b

M M O

V D Z D

a

b

i

j

j

i k

a a innera D

a

k

xa ja a b st

j

k

innera D

b

a

k

k

D outer b

b

M O M M

Z D V D

a

b

i

j

D outerb

j

b

i k

innera D a a

a

k

xa ja a b st

j

k

D innera

b

a

k

k

O M M

Z b V D

a

i

j

innera D

a

i

i k

i

a a

k

xa ja a b st

j

k

b

O M

Z b V a

i

j

i k

a a

k

xa ja a b st

j

k

b

completing the general case

The case for lo oping buckets is as usual somewhat more complicated but follows

Theorem very closely As in the commutative case it requires an innite sum This

innite sum is why we so carefully made sure that the pair semiring is continuous we

wanted to make sure that we could easily handle the innite case by using the prop erties

of continuous semirings

Recall that outer x B represents the set of outer trees of x with generation at most

g

g Now we can dene the left and right g generation reverse value of an item x in bucket

B

M

Z x B bZ D Z D c

g L R

D outer xB

g

Since P A is an continuous semiring an innite sum is equal to the supremum of the

partial sums

M

Z D Z x B sup Z x B

g

g

D outerxB

Thus as b efore we wish to nd a simple formula for Z x B

g

Recall that for x in a bucket following B Z x B Z x and that for x B

g

Z x B providing a base case We can then address the general case for g

Theorem

For x B and g

O M

Z b B if b B

g

V a Z x B

i g

Z b if b B

j

i k

a a

k

xa ja a b st

j

k

b

j

Proof Recall that MakeOuterj D D D is a function that puts together

a a

b

k

the sp ecied trees to form an outer tree for a Then

j

Z x B

g

M

Z D

D outer xB

g

M M

j

Z MakeOuterj D D D

a a

b

k

D outer bB a a

b g k

xa ja a b st

j

k

j

innera D

b

a

innera D

a

k

k

O M M

V D Z D

a

b

i

j

i k

a a D outer bB

k b g

ja a b st xa

j

k

j

innera D

b

a

innera B D

a

k

k

M M O M

V D Z D

a

b

i

j

D outer bB

j

b g

i k

a a innera D

a

k

ja a b st xa

j

k

D innera

b

a

k

k

O M M

Z b B V D

g a

i

j

innera B D

a

i

i k

i

a a

k

xa ja a b st

j

k

b

O M

Z b B V a

g i

j

i k

a a

k

xa ja a b st

j

k

b

Substituting Equation into Equation we get

O M

Z b B if b B

g

V a Z x B

i g

Z b if b B

j

i k

a a

k

xa ja a b st

j

k

b

Chapter

Maximizing Metrics

It is well known how to nd the parse with the highest probability of b eing exactly right

In this chapter we show how to use the insideoutside probabilities to nd parses that are

the b est in other senses such as maximizing the exp ected number of correct constituents

Go o dman b We will use these algorithms in the next chapter to parse the DOP

mo del eciently

Introduction

In corpusbased approaches to parsing researchers are given a a collection of text

annotated with the correct parse tree The researchers then attempt to nd algorithms

that given unlab elled text from the treebank pro duce as similar a parse as p ossible to the

one in the treebank

Various metho ds can b e used for nding these parses Some of the most common involve

inducing Probabilistic ContextFree Grammars PCFGs and then using an algorithm such

as the Lab elled Tree Viterbi algorithm which maximizes the probability that the output

of the parser the guessed tree is the one that the PCFG pro duced

There are many dierent ways to evaluate the output parses In Section we will

dene them using for consistency our own terminology We will often include the conven

tional term in parentheses for those measures with standard names The most common

evaluation metrics include the Lab elled Tree rate also called the Viterbi Criterion or Exact

Match rate Consistent Brackets Recall rate also called the Crossing Brackets rate Con

sistent Brackets Tree rate also called the Zero Crossing Brackets rate and Precision and

Recall Despite the variety of evaluation metrics nearly all researchers use algorithms that

maximize p erformance on the Lab elled Tree rate even when they evaluate p erformance

using other criteria

We prop ose that by creating algorithms that optimize the evaluation criterion rather

than some general criterion improved p erformance can b e achieved

There are two commonly used kinds of parse trees In one kind binary branching parse

trees the trees are constrained to have exactly two branches for each nonterminal no de In

the other kind nary branching parse trees each nonterminal no de may have any number

of branches however in this chapter we will only b e considering nary trees with at least

two branches

In the rst part of this chapter we discuss the binary branching case In Section we

dene most of the evaluation metrics used in this chapter and discuss previous approaches

Then in Section we discuss the Lab elled Recall algorithm a new algorithm that max

imizes p erformance on the Lab elled Recall rate In Section we discuss another new

algorithm the Bracketed Recall algorithm that maximizes p erformance on the Bracketed

Recall rate closely related to the Consistent Brackets Recall Crossing Brackets rate In

Section we present exp erimental results using these two algorithms on appropriate tasks

and compare them to the Viterbi algorithm Next in Section we show that optimizing

a similar criterion the Bracketed Tree rate which resembles the Consistent Brackets Tree

rate is NPComplete

In the second part of the chapter we extend these results in two ways First in Section

we show that the algorithms of the rst part are b oth sp ecial cases of a more general

algorithm and give some p otential applications for this more general algorithm Second in

Section we generalize the algorithm further so that it can handle nary branching parses

In particular we give denitions for Recall Precision and a new related measure Mistakes

We then show how to maximize the weighted dierence of Recall minus Mistakes which we

call the Combined rate Finally we give exp erimental results for the nary branching case

Evaluation Metrics

In this section we rst dene some of the basic terms and symbols including parse tree

and then sp ecify the dierent kinds of errors parsing algorithms can make Next we dene

the dierent metrics used in evaluation Finally we discuss the relationship of these metrics

to parsing algorithms

In this chapter we sp end quite a bit of time on the topic of parsing metrics and the

use of a consistent naming convention will simplify the discussion However this consistent

naming convention is not standard and so will only b e used in this chapter A glossary of

these terms is provided in App endix B

Basic Denitions

In this chapter we are primarily concerned with scoring parse trees and nding parse

trees that optimize certain scores Parse trees are typically scored by the number of their

constituents that are correct according to some measure so it will b e convenient in this

chapter to dene parse trees as sets of constituents As we have done throughout this thesis

we will let w w denote the sequence of terminals words in the sentence under discussion

n

Then we dene a parse tree T as a set of constituents hi X j i A triple hi X j i indicates

that w w w can b e parsed as a terminal or nonterminal X The triples must meet

i i j

the following requirements which enforce binary branching constraints and consistency

The sentence was generated by the start symbol S Formally h S n i T and

for all X S h X n i T

i jk l i kl j i kj l

Do esnt cross Do esnt cross Crosses

Figure Noncrossing and crossing constituents

The tree is binary branching and consistent Formally for every hi X j i in T i j

there is exactly one k Y and Z such that i k j and hi Y k i T and hk Z j i T

Let T denote the correct parse the one in the tree bank and let T denote the

C G

guessed parse the one output by the parsing algorithm Let N denote jT j the number

C C

of vocabulary symbols in the correct parse tree and let N denote jT j the number of

G G

vocabulary symbols in the guessed parse tree

Evaluation Metrics

There are various levels of strictness for determining whether a constituent element of T

G

is correct The strictest of these is Labelled Match A constituent hi X j i T is correct

G

according to Lab elled Match if and only if hi X j i T In other words a constituent in

C

the guessed parse tree is correct if and only if it o ccurs in the correct parse tree

The next level of strictness is Bracketed Match Bracketed match is like Lab elled Match

except that the nonterminal lab el is ignored Formally a constituent hi X j i T is correct

G

according to Bracketed Match if and only if there exists a Y such that hi Y j i T

C

The least strict level is Consistent Brackets traditionally misnamed Crossing Brackets

Consistent Brackets is like Bracketed Match in that the lab el is ignored It is even less strict

in that the observed hi X j i need not b e in T it must simply not b e ruled out by any

C

hk Y l i T A particular triple hk Y l i rules out hi X j i if there is no way that hi X j i

C

and hk Y l i could b oth b e in the same parse tree Figure shows examples of noncrossing

and crossing constituents In particular if the interval hi j i crosses the interval hk l i then

hi X j i is ruled out and counted as an error Formally we say that hi j i crosses hk l i if

and only if i k j l or j i k l

If T is binary branching then Consistent Brackets and Bracketed Match are identical

C

a pro of of this fact is given in App endix A immediately following this chapter The

following symbols denote the number of constituents that match according to each of these

criteria

L jfT T gj the number of constituents in T that are correct according to

C G G

Lab elled Match

This follows Pereira and Schabes except that in our notation spans include the rst element but

not the last

B jfhi X j i hi X j i T and for some Y hi Y j i T gj the number of con

G C

stituents in T that are correct according to Bracketed Match

G

C jfhi X j i hi X j i T and there is no hk Y l i T crossing hi X j igj the

G C

number of constituents in T correct according to Consistent Brackets

G

Following are the denitions of the six metrics used in this chapter for evaluating binary

branching trees

Labelled Recall Rate LN

C

if L N

C

Labelled Tree Rate

otherwise

This metric is also called the Viterbi criterion or the Exact Match rate

Bracketed Recall Rate B N

C

if B N

C

Bracketed Tree Rate

otherwise

Consistent Brackets Recall Rate C N

G

This metric is often called the Crossing Brackets rate In the case where the parses

are binary branching this criterion is the same as the Bracketed Recall rate

if C N

G

Consistent Brackets Tree rate

otherwise

This metric is closely related to the Bracketed Tree rate In the case where the parses

are binary branching the two metrics are the same This criterion is also called the

Zero Crossing Brackets rate

The preceding six metrics each corresp ond to cells in the following table

Recall Tree

if C N

G

Consistent Brackets C N

G

otherwise

if B N

C

Bracketed B N

C

otherwise

if L N

C

Lab elled LN

C

otherwise

We will dene metrics for nary branching trees including Precision and Recall in

Section

Maximizing Metrics

Although several metrics are available for evaluating parsers there is only one metric most

parsing algorithms attempt to maximize namely the Lab elled Tree rate That is most

parsing algorithms attempt to solve the following problem

if L N

C

M arg max E

otherwise

T

G

Here E is the exp ected value op erator and M is the mo del under consideration typically

a probabilistic grammar In words this maximization nds that tree T which maximizes

G

the exp ected Lab elled Tree score assuming that the sentence was generated by the mo del

Equation is equivalent to maximizing the probability that T is exactly right The

G

assumption that the sentence was generated by the mo del or at least that the mo del

is a go o d approximation is key to any parsing algorithm The Viterbi algorithm the

prex probability estimate of Jelinek and Laerty language mo deling using PCFGs

and nb est parsing algorithms all implicitly make this approximation without it very

little can b e determined Whether or not the approximation is a go o d one is an empirical

question Later we will show exp eriments using grammars obtained in two dierent ways

that demonstrate that the algorithms we derive using this approximation do indeed work

well Since essentially every equation in this chapter requires this approximation we will

implicitly assume the conditioning on M from now on

The maximization of Equation is used by most parsing algorithms including the

Lab elled Tree Viterbi algorithm and sto chastic versions of Earleys algorithm Stolcke

and variations such as those used in Picky parsing Magerman and Weir and in

current stateoftheart systems such as those of Charniak and Collins Even

in probabilistic mo dels not closely related to PCFGs such as Spatter parsing Magerman

Expression is still computed One notable exception is Brills Transformation

Based Error Driven system Brill which induces a set of transformations designed

to maximize the Consistent Brackets Recall Crossing Brackets rate However Brills

system do es not induce a PCFG and the techniques Brill introduced are not as p owerful

as mo dern probabilistic techniques We will show that by matching the parsing algorithm

to the evaluation criteria b etter p erformance can b e achieved for the more commonly used

PCFG formalism and variations

Ideally one might try to directly maximize the most commonly used evaluation criteria

such as the Consistent Brackets Recall Crossing Brackets rate or Consistent Brackets Tree

Zero Crossing Brackets rate However these criteria are relatively dicult to maximize

since it is timeconsuming to compute the probability that a particular constituent crosses

some constituent in the correct parse On the other hand the Bracketed Recall and Brack

eted Tree rates are easier to handle since computing the probability that a bracket matches

one in the correct parse is not to o dicult It is plausible that algorithms which optimize

these closely related criteria will do well on the analogous Consistent Brackets criteria

Which Metrics to Use

When building a system one should use the metric most appropriate for the target problem

For instance if one were creating a database query system such as an automated travel

agent then the Lab elled Tree Viterbi metric would b e most appropriate This is b ecause

a single error in the syntactic representation of a query will likely result in an error in the

semantic representation and therefore in an incorrect database query leading to an incorrect

result For instance if the user request Find me all ights on Tuesday is misparsed with

the prep ositional phrase attached to the verb then the system might wait until Tuesday

b efore resp onding a single error leads to completely incorrect b ehavior

On the other hand for a machineassisted translation system in which the system

provides translations and then a human uent in the target language manually edits them

Lab elled Recall is more appropriate If the system is given the foreign language equivalent

of His credentials are nothing which should b e laughed at and makes the single mistake

of attaching the relative clause at the sentential level it might translate the sentence as

His credentials are nothing which should make you laugh With this translation the

human translator must make some changes but certainly needs to do less editing than if

the sentence were completely misparsed The more errors there are the more editing the

human translator needs to do Thus a criterion such as Lab elled Recall is appropriate for

this task where the number of incorrect constituents correlates to application p erformance

Lab elled Recall Parsing

The Lab elled Recall parsing algorithm nds that tree T that has the highest exp ected value

G

for the Lab elled Recall rate LN where L is the number of correctly lab elled constituents

C

and N is the number of no des in the correct parse Formally the algorithm nds

C

T arg max E LN

G C

T

The dierence b etween the Lab elled Recall maximization of Expression and the

Lab elled Tree maximization of Expression may b e seen from the following example

Figure gives an example grammar that generates four trees with equal probability For

the top left tree in Figure the probabilities of b eing correct are S A and

C Similar counting holds for the other three Thus the exp ected value of L for any

of these trees is

In contrast the optimal Lab elled Recall parse is shown in the b ottom of Figure

This tree has probability according to the grammar and thus is nonoptimal according to

the Lab elled Tree rate criterion However for this tree the probabilities of each no de b eing

correct are S A and B The exp ected value of L is the highest of

any tree This tree therefore optimizes the Lab elled Recall rate

Formulas

We now derive an algorithm for nding the parse that maximizes the exp ected Lab elled

Recall rate We do this by expanding Expression out into a probabilistic form converting

this into a recursive equation and nally creating an equivalent dynamic programming

algorithm

We b egin by rewriting Expression expanding out the exp ected value op erator and

S A C

S A D

S E B

S F B

A B C D E F xx

Sample grammar illustrating Lab elled Recall

S S

H H

H H

A A C D

H H H H

H H H H

x x x x x x x x

S S

H H

H H

E B F B

H H H H

H H H H

x x x x x x x x

Four trees in the sample grammar

S

H

H

A B

H H

H H

x x x x

Lab elled Recall tree

Figure Four trees of sample grammar and Lab elled Recall tree

which is the same for all T and so plays no role in the maximization removing the

G

N

C

X

arg max P T j w w jT T j

C n G C

T

G

T

C

This can b e further expanded to

X X

if hi X j i T

C

arg max P T j w w

C n

otherwise

T

G

T

hiX j iT

C

G

Now given a Probabilistic ContextFree Grammar G with start symbol S the following

equality holds

X

if hi X j i T

C

P S w w X w w jw w P T jw w

i j n n C n

otherwise

T

C

By rearranging the summation in Expression and then substituting this equality we

get

X

arg max P S w w X w w jw w

i j n n

T

G

hiX j iT

G

At this p oint it is useful to recall the inside and outside probabilities introduced in Sec

tion Recall that the inside probability is dened as inside i X j P X w w

i j

and the outside probability is outside i X j P S w w X w w

i j n

Let us dene a new symbol g i X j

g i X j P S w w X w w jw w

i j n n

P S w w X w w P X w w

i j n i j

P S w w

n

outsidei X j inside i X j

inside S n

Now the denition of a Lab elled Recall Parse can b e rewritten as

X

arg max g i X j

T

G

hiX j iT

G

Pseudo co de Algorithm

Given the values of g i X j it is a simple matter of dynamic programming to determine

the parse that maximizes the Lab elled Recall rate Dene

max MAXCi k MAXCk j if i j

k st ik j

MAXCi j max g i X j

if i j

X

We now show that this recursive equation is correct with a simple pro of We will write

T to represent a subtree covering w w and we will write LT to represent a function

ij i j ij

giving the exp ected number of correctly lab elled constituents in T

ij

Theorem

MAXCi j max LT

ij

T

ij

Proof The pro of is by induction over the size of the parse tree j i The base case

is parse trees of size when i j In this case there is a single constituent in the

p ossible parse trees Thus

max LT max Lfhi X j ig

ij

T

fhiX j ig

ij

max g i X j

fhiX j ig

max g i X j

X

MAXCi j

completing the base case

In the inductive step assume the theorem for lengths less than j i Then we notice

that for trees of size greater than one we can always break down a maximal subtree

T arg max LT

ij

ij

T

ij

into a ro ot no de hi X j i and two smaller maximal child trees

T arg max LT

ik

ik

T

ik

T arg max LT

k j

k j

T

k j

Notice that if we were trying to maximize the Lab elled Tree rate the most probable child

trees would dep end on the parent nonterminal However when we maximize the Lab elled

Recall rate the child trees do not dep end on the parent as was illustrated in Figure

where we showed that in fact the joint probability of the parent nonterminal and child trees

could even b e zero Thus

max LT max g i X j max max LT max LT

ij

ik k j

T X T T

k

ij

ik k j

max g i X j max MAXCi k MAXCk j

X

k

MAXCi j if i j

which completes the inductive step

Notice that MAXC n contains the score of the b est parse according to the Lab elled

oat maxC n n

for length to n

for i to n length

j i length

maxG max g i X j

X

if length

bestSplit max maxC i k maxC k j

k jik j

else

bestSplit

maxC i j maxG bestSplit

Figure Lab elled Recall Algorithm

Recall rate

Equation can b e converted into a dynamic programming algorithm as shown in

Figure For a grammar with r rules and k nonterminals the run time of this algorithm

is O n k n since there are two layers of outer lo ops each with run time at most n and

an inner lo op over nonterminals and n However the overall runtime is dominated by the

computation of the inside and outside probabilities which takes time O r n

By mo difying the algorithm slightly to record the actual split used at each no de we can

recover the b est parse The entry maxC n contains the exp ected number of correct

constituents given the mo del

ItemBased Description

For those not familiar with the notation of Chapter this section and succeeding references

to itembased descriptions may be skipped without loss of continuity

It is also p ossible to sp ecify the Lab elled Recall algorithm using an itembased descrip

tion as in Chapter although we will need to slightly extend the notation of itembased

descriptions to do so These same extensions will b e necessary in Chapter when we de

scrib e global thresholding and multiplepass parsing There are two extensions that will

b e required First we may have more than one goal item leading to each item having a

separate outside value for each goal item Second we will need to b e able to make reference

to the insideoutside value of an item We will denote the forwards value of an item x in

the inside semiring by V x and its reverse value with goal item goal by Z x goal

in in

V xZ xgoal

in in

VZ

x goal by We will abbreviate the quantity

V

in

V goal

in

Figure gives the Lab elled Recall itembased description The description is similar

to the pro cedural version The rst step is to compute the insideoutside values We thus

have the usual items i A j in the inside semiring and the usual unary and binary rules

VZ

for CKY parsing The insideoutside values will simply b e i A j S n using

V

in

the notation we just dened Next we need to nd for each i A j the insideoutside value

and the b est sum of splits for the span These will b e stored in items of the form i A j

Item form

i A j inside semiring

i A j arctic semiring or similar

Primary Goal

S n

Secondary Goal

S n

Rules

R A w

i

Unary

i A i

R A BC i B k k C j

Binary

i A j

VZ

R A w i A i S n

i

V

in

Unary Lab els

i A i

VZ

i A j S n i B k k C j R A B C

V

in

xo Binary Lab els

i A j

Figure Lab elled Recall Description

To get the insideoutside value for hi A j i we use a sp ecial rule value function

VZ

R A B C i A j S n

V

in

The value of this function is just the insideoutside value of the constituent hi A j i which

is passed into the function as the last argument We also need to nd the b est sum of splits

for the span ie a maximum over a sum For this we use a sp ecial semiring the arctic

semiring whose op erations are max as dened in Section Notice that our two

dierent item types use two dierent semirings the inside semiring for i A j and the

arctic semiring for i A j Using the arctic semiring and our sp ecial rule value function

the unary and binary lab els rules compute the b est nonterminal lab el and b est sum of splits

for each span These rules are identical to the usual unary and binary rules except that

they use our sp ecial rule value function that gives the insideoutside value of the relevant

constituent

We must of course nd the inside and outside values for the i A j b efore we can

compute the i A j values this means that the order of interpretation is changed as well

Normally the order of interpretation is simply all of the forward values in order followed

by all of the reverse values in the reverse order In this new version we rst have all of the

forward values of items i A j and then in reverse order all of the reverse values of these

items next we have all of the forward values of i A j and nally optionally in reverse

order all of the reverse values of i A j

If we use the arctic semiring we get only the maximum exp ected number of correctly

lab elled constituents but not the tree which gives this number To get this tree we would

use the arcticderivation semiring As usual there are advantages to describing the algo

rithm with itembased descriptions For instance we can easily compute nb est derivations

just by using the arctictopn semiring

A short example may help clarify the algorithm Consider the example grammar of

Figure For this grammar we have that the insideoutside values of the i X j which

VZ

i X j S n are as follows equal

V

in

S

S

H

H

A

H

H

E

A B

F

E C

B

F D

C

H H

H H

x x x x

D

We have that the forward arctic values of the i X j are

S

S

H

H

A

H

H

E

A B

F

E C

B

F D

C

H H

H H

x x x x

D

The key element is S which is derived using the binary lab els rule instantiated with

R S B C B C

S

which computed in the arctic semiring where the multiplicative op erator is addition has

the value

Bracketed Recall Parsing

The Lab elled Recall algorithm maximizes the exp ected number of correct lab elled con

stituents However many commonly used evaluation metrics such as the Consistent Brack

ets Recall Crossing Brackets rate ignore lab els Similarly some grammar induction al

gorithms such as those used by Pereira and Schabes do not pro duce meaningful

lab els In particular the Pereira and Schabes metho d induces a grammar from the brack

ets in the treebank ignoring the lab els in the treebank While the grammar they induce

has lab els these lab els are not related to those in the treebank Thus while the Lab elled

Recall algorithm could b e used with these grammars p erhaps maximizing a criterion that

is more closely tied to the task will pro duce b etter results Ideally we would maximize

the Consistent Brackets Recall rate directly However since it is timeconsuming to deal

with Consistent Brackets as describ ed in Section we instead use the closely related

Bracketed Recall rate

For the Bracketed Recall algorithm we nd the parse that maximizes the exp ected

Bracketed Recall rate B N Remember that B is the number of brackets that are

C

correct and N is the number of constituents in the correct parse

C

T arg max E B N

G C

T

Following a derivation similar to that used for the Lab elled Recall algorithm we can rewrite

Equation as

X X

P S w w X w w jw w T arg max

i j n n G

T

X

hij iT

The algorithm for Bracketed Recall parsing is almost identical to that for Lab elled Recall

parsing The only required change is to sum rather than maximize over the symbols X to

Item form

i A j inside semiring

y

i j inside semiring

i j arctic semiring or similar

Primary Goal

S n

Secondary Goal

S n

Rules

R A w

i

Unary

i A i

R A BC i B k k C j

Binary

i A j

VZ

i A j S n

V

in

Summation

y

i j

y

R w i i

i

Unary Brackets

i i

y

R B C i j i k k j

Binary Brackets

i j

Figure Bracketed Recall Description

calculate maxG substituting in the following line

X

maxG g i X j

X

The Lab elled Recall itembased description can b e easily converted to a Bracketed Recall

itembased description with the addition of one new item type and one new rule Figure

gives an itembased description for the Bracketed Recall algorithm There are now item

P

y

g i A j types i A j which has the usual meaning i j which has the value of

A

and i j which has the value of MAXCi j The summation rule sums over items of type

y

i A j to pro duce items of the type i j

Exp erimental Results

We describ e two exp eriments that tested these algorithms The rst uses a grammar with

out meaningful nonterminal symbols and compares the Bracketed Recall algorithm to the

traditional Lab elled Tree Viterbi algorithm The second uses a grammar with meaningful

nonterminal symbols and p erforms a threeway comparison b etween the Lab elled Recall

Bracketed Recall and Lab elled Tree algorithms These exp eriments show that use of an

algorithm matched appropriately to the evaluation criterion can lead to as much as a

reduction in error rate

Grammar Induced by Pereira and Schabes metho d

We duplicated the exp eriment of Pereira and Schabes Pereira and Schabes trained

a grammar from a bracketed form of the TI section of the ATIS corpus using a mo died

form of the insideoutside algorithm They then used the Lab elled Tree Viterbi algorithm

to select the b est parse for sentences in held out test data We rep eated the exp eriment

inducing the grammar the same way However during the testing phase we ran b oth the

Lab elled Tree and Lab elled Recall algorithm for each sentence In contrast to previous

research we rep eated the exp eriment ten times with dierent random splits of the data

into training set and test set and dierent random initial conditions each time Note that

in one detail our scoring for exp eriments diers from the theoretical discussion in the pap er

so that we can more closely follow convention In particular constituents of length one are

not counted in recall measures since these are trivially correct for most criteria

In three test sets there were sentences with terminals not present in the matching training

set The four sentences out of containing these terminals could not b e parsed and in

the following analysis were assigned right branching p erio d high structure This of course

aects the Lab elled Recall and Lab elled Tree algorithms equally

Table shows the results of running this exp eriment giving the minimum maximum

mean range and standard deviation for three criteria Consistent Brackets Recall Consis

tent Brackets Tree and Bracketed Recall We also computed for each split of the data the

dierence b etween the Bracketed Recall algorithm and the Lab elled Tree algorithm on each

criterion Notice that on each criterion the minimum of the dierences is negative meaning

that on some data set the Lab elled Tree algorithm worked b etter and the maximum of

the dierences is p ositive meaning that on some data set the Bracketed Recall algorithm

worked b etter The only criterion for which there was a statistically signicant dierence

b etween the means of the two algorithms is the Consistent Brackets Recall rate which was

signicant to the signicance level paired ttest Thus use of the Bracketed Recall

Most researchers throw out a few sentences b ecause of problems aligning the part of sp eech and parse

les or b ecause of lab ellings of discontinuous constituents which are not usable with the Crossing Brackets

rate The dicult data was cleaned up and used in these exp eriments rather than thrown out A di

le b etween the original ATIS data and the cleaned up version in a form usable by the ed program

is available by anonymous FTP from ftpftpdeasharvardedup ubg oodm anat ise d ti tbpared

tbposed The number of changes made was small the di les sum to bytes versus and ti

bytes for the original les or less than

Criteria Min Max Range Mean StdDev

Lab elled Tree Algorithm

Cons Brack Rec

Cons Brack Tree

Brack Rec

Bracketed Recall Algorithm

Cons Brack Rec

Cons Brack Tree

Brack Rec

Bracketed Recall Lab elled Tree

Cons Brack Rec

Cons Brack Tree

Brack Rec

Table Lab elled Tree Viterbi versus Bracketed Recall for PS

algorithm leads to a reduction in error rate

In addition the p erformance of the Bracketed Recall algorithm was also qualitatively

more app ealing Figure shows typical results Notice that the Bracketed Recall algo

rithms Consistent Brackets rate versus iteration is smo other and more nearly monotonic

than the Lab elled Tree algorithms The Bracketed Recall algorithm also gets o to a much

faster start and is generally although not always ab ove the Lab elled Tree level For the

Lab elled Tree rate the two are usually very comparable

Grammar Induced by Counting

The replication of the Pereira and Schabes exp eriment was useful for testing the Bracketed

Recall algorithm However since that exp eriment induces a grammar with nonterminals

not comparable to those in the training a dierent exp eriment is needed to evaluate the

Lab elled Recall algorithm one in which the nonterminals in the induced grammar are the

same as the nonterminals in the test set

Grammar Induction by Counting

For this exp eriment a very simple grammar was induced by counting using the Penn Tree

Bank version In particular the trees were rst made binary branching by removing

epsilon pro ductions collapsing singleton pro ductions and by converting nary pro ductions

n as in Figure The resulting trees were treated as the Correct trees in the

evaluation

A grammar was then induced in a straightforward way from these trees simply by

giving one count for each observed pro duction No smo othing was done There were

sentences and nonterminals in the test data The resulting grammar undergenerated

somewhat b eing unable to parse approximately of the test data The unparsable data 100

90

80

70

60

50

40 Percentage Correct

30 Labelled Tree Algorithm: Consistent Brackets Recall Bracketed Recall Algorithm: Consistent Brackets Recall 20 Labelled Tree Algorithm: Labelled Tree Bracketed Recall Algorithm: Labelled Tree 10

0 0 10 20 30 40 50 60 70

Iteration Number

Figure Lab elled Tree versus Bracketed Recall in Pereira and Schabes Grammar

X

H

H

X

A X

Cont

P

P

H

b ecomes

P

H

P

B

X

A B C D Cont

H

H

C D

Figure Conversion of Pro ductions to Binary Branching

Criterion

Lab el Lab el Brack Cons Brack Cons Brack

Algorithm Tree Recall Recall Recall Tree

Lab el Tree

Lab el Recall

Bracket Recall

Table Grammar Induced by Counting Three Algorithms Evaluated on Five Criteria

were assigned a right branching structure with their rightmost element attached high

Notice that the Lab elled Tree Viterbi Lab elled Recall and Bracketed Recall algorithms

all fail on exactly the same sentences since the insideoutside calculation fails exactly when

the Lab elled Tree calculation fails Thus this default b ehavior aects all sentences equally

Results

Table shows the results of running all three algorithms evaluating against ve cri

teria Notice that for each algorithm for the criterion that it optimizes it is the b est

algorithm That is the Lab elled Tree algorithm is the b est for the Lab elled Tree rate the

Lab elled Recall algorithm is the b est for the Lab elled Recall rate and the Bracketed Recall

algorithm is the b est for the Bracketed Recall rate

NPCompleteness of Bracketed Tree Maximization

In this section we show the NP completeness of the Bracketed Tree Maximization problem

Bracketed Tree Maximization is the problem of nding that bracketing which has the highest

exp ected score on the Bracketed Tree criterion In other words it is that tree which has the

highest probability of b eing exactly correct according to the mo del ignoring nonterminal

lab els Formally the Bracketed Tree Maximization problem is to compute

if B N

C

arg max E

otherwise

T

G

The Bracketed Tree criterion can b e distinguished from the Lab elled Tree Viterbi criterion

in the following example

S L x

S M x

S x R

L M R x x

Here given an input of xxx the Lab elled Tree parse is

S

H

H

x R

H

H

x x

and thus the Lab elled Tree bracketing is xxx with a chance of b eing correct as

suming that the string was pro duced by the mo del On the other hand the bracketing

xxx has a chance of b eing correct If our goal is to maximize the probability that

the bracketing is exactly correct then we would like an algorithm that returns the second

bracketing rather than the rst Our previous algorithm the Bracketed Recall algorithm

may return bracketings with probability it do es not solve this problem We will show

that maximizing Bracketed Recall is NPComplete It will b e easier to prove this if we rst

prove a related theorem ab out Hidden Markov Mo dels HMMs The pro of we give here is

similar to one of Simaan a that maximizing the Lab elled Tree rate for a Sto chastic

Tree Substitution Grammar is NPComplete

We note that for binary branching trees there is a crossing bracket in any tree that

do es not exactly match Thus for binary branching trees Bracketed Tree Maximization is

equivalent to Zero Crossing Brackets Rate Maximization Thus we will implicitly also b e

showing that Zero Crossing Brackets Rate Maximization is NPComplete

NPCompleteness of HMM Most Likely String

HMM Most Likely String HMMMLS

x

a length Instance An HMM sp ecication with probabilities sp ecied as fractions

y

n in unary and a probability p

Question Is there some string of length n such that the sp ecied HMM pro duces the

string with probability at least p

The pro of that Bracketed Tree Maximization is NPComplete is easier to understand if

we rst show the NPCompleteness of a simpler problem namely nding whether the most

likely string of a given length n output by a Hidden Markov Mo del has probability at least

p A dierent problem that of nding the most likely state sequence of an HMM and the

corresp onding most likely output can of course b e solved easily However since the same

string can b e output by several state sequences the most likely string problem is much

harder

We note that a related problem Most Likely String Without Length in which the

length of the string is not presp ecied can b e shown NPhard by a very similar pro of

However it cannot b e shown NPComplete by this pro of since the most likely string could

b e exp onentially long and thus the generate and test metho d cannot b e used to show that

the problem is in NP

We show the NPCompleteness of HMM Most Likely String by a reduction of the NP

Complete problem Sat The problem Sat is whether or not there is a way to satisfy a

formula of the form F F F where F is of the form Here the

n i i i i

represent literals either x or x An example of Sat is the question of whether the

ij

k k

following formula is satisable

x x x x x x x x x

We will show how to construct for any Sat formula an HMM that has a high proba

bility output if and only if the formula is satisable

The most obvious way to pursue this reduction is to create a network whose output

corresp onds to satisfying assignments of the clauses eg a string of T s and F s one letter

for each variable denoting whether that variable is true or false There will b e one section

of the network for each clause and the probabilities will b e assigned in such a way that for

the output probability to sum suciently high for every clause the maximal string must

have a path through the corresp onding subnetwork

The problem with this technique is that such a string may b e output by multiple paths

through the subnetworks corresp onding to some of the clauses and zero paths through the

parts of the network leading to other clauses What we must do is to ensure that there

is exactly one path through each clauses subnetwork In order to do this we augment

the output with an initial sequence of the digits and Each digit sp ecies which

variable the rst second or third of each clause satised that clause This strategy allows

construction of subnetworks for each clause with at most one path Then if there is a

string whose probability indicates that it has n paths through the HMM we know that the

string satises every clause and therefore satises the formula Any string that do es not

corresp ond to a satisfying assignment will have fewer than n paths and will not have a

suciently high probability

Theorem

The HMM Most Likely String problem is NPComplete

Proof To show that HMM Most Likely String is in NP is trivial guess a most

likely string of the given length n then compute its probability by well known p olynomial

time dynamic programming techniques the forward algorithm then verify that the prob

ability is at least p

To show NPCompleteness we show how to reduce a Sat formula to an HMM Most

Likely String problem Let f represent the number of disjunctions and v represent the

number of variables in the formula to satisfy Create an HMM with f f v states as

follows The states will have the following names and h ranges from

END S T ART

hij

to f v i ranges from to f and j ranges from to Two of the states are a start and

an end state with an equiprobable epsilon transition from the start state to the f states

of the form

ij

The basic form of the network is

xij

T F T F j

ij iij

END f k ij f v ij S T ART f ij f ij

where xi j is T if is of the form x and F if it is of the form x Figure shows

ij

k k

a section of the HMM giving the parts that corresp ond to the rst and last clause The

notation indicates a state transition with equal probabilities of emitting the 1 1,2,3 1,2,3 T,F x(1,1) ......

1,1,1 2,1,1 f,1,1 f+1,1,1 f+k,1,1 f+v,1,1 2 1,2,3 1,2,3 T,F x(1,2) ...... T,F

ε 1,1,2 2,1,2 f,1,2 f+1,1,2 f+k,1,2 f+v,1,2 3 1,2,3 1,2,3 T,F x(1,3) ε ...... T,F

ε 1,1,3 2,1,3 f,1,3 f+1,1,3 f+k,1,3 f+v,1,3

. T,F . . START ε T,F END 1,2,3 1,2,3 1 T,F x(f,1) ...... ε T,F 1,f,1 2,f,1 f,f,1 f+1,f,1 f+k,f,1 f+v,f,1

ε 1,2,3 1,2,3 2 T,F x(f,2) ...... T,F 1,f,2 2,f,2 f,f,2 f+1,f,2 f+k,f,2 f+v,f,2 1,2,3 1,2,3 3 T,F x(f,3) ......

1,f,3 2,f,3 f,f,3 f+1,f,3 f+k,f,3 f+v,f,3

Figure Portion of HMM corresp onding to a Sat formula

T F

symbols or Similarly indicates a state transition with equal probabilities

of emitting T or F

We now show that if the given Sat formula is satisable then the constructed HMM

has a string with probability whereas if it is not satisable the most likely string

v

f

will have lower probability The pro of is as follows First assume that there is some way

to satisfy the Sat formula Let X denote T if the variable x takes on the value true and

i i

let it denote F otherwise under some assignment of values to variables that satises the

formula Also let J denote the number or resp ectively dep ending on whether for

i

formula i or resp ectively satises formula i If J could take on more than one

i i i i

value by this denition that is if F is satised in more than one way then arbitrarily

i

we assign it to the lowest of these three Since we are assuming here that the formula is

satised under the assignment of values to variables at least one of or must

i i i

satisfy each clause

Then the string

J J J X X X

v

f

will b e a most probable string There may b e other most probable strings but all will have

the same probability and satisfy the formula Example the string FTFT would b e the

most likely string output by an HMM corresp onding to the sample formula

b ecause there will b e routes emitting that This string will have probability

v

f

string through exactly f of the f subnetworks and each route will have probability

f

v

f

On the other hand assume that the formula is not satisable Now the most likely

string will have a lower probability Obviously the string could not have higher probability

and it is not p ossible since the b est path through any subnetwork has probability

v

f

to have routes through more than f of the f subnetworks Therefore the probability can

b e at most However it cannot b e this high since if it were then the string would

v

f

have a path through f of the f networks and would therefore corresp ond to a satisfaction

of the formula Thus its probability must b e lower

Bracketed Tree Maximization is NPComplete

Having shown the NPCompleteness of the HMM Most Likely String problem we can

show the NPCompleteness of Bracketed Tree Maximization To phrase this as a language

problem we rephrase it as the question of whether or not the b est parse has an exp ected

score of at least p according to the Bracketed Tree rate criterion

Bracketed Tree Maximization BTM

Instance A Probabilistic ContextFree Grammar G a string w w and a probability

n

p

if B N

C

Question Is there a T such that E p

G

otherwise

Theorem

The Bracketed Tree Maximization problem is NPComplete

A

H

H

H

A B

H

H

H

x

A

H

H

H

A x

H

H

H

H

A x

H

H

H

H

H

A x

H

H

H

x A

H

H

H

x A

H

H

x A

H

H

x x

Figure Tree corresp onding to an output of symbol from symbol alphab et

Proof Using the generate and test metho d it is clear that this problem is in NP

To show completeness we show that for every HMM there is an equivalent PCFG that

pro duces bracketings with the same probability that the HMM pro duces symbols We do

this by mapping states of the HMM to nonterminals of the grammar we map each output

symbol of the HMM to a small subtree with a unique bracket sequence

The pro of is as follows Number the output symbols of the HMM to k Assign the

start state of the HMM to the start symbol of the PCFG For each state A with a transition

emitting symbol i to state B with probability p include rules in the grammar that rst

have k i symbols on right branching no des followed by i symbols on left branching

no des For instance for i and k the tree would lo ok like the tree in Figure

Formally we can write this as

k i k i i i

z z z z

x xxx x xxx B p A

i

Also for each nal state A include a rule of the form

A

If an HMM had a state sequence S AB C then the corresp onding parse tree would lo ok

like Figure where represents a subtree corresp onding to the HMMs output

S

H

H

H

A

H

H

H

B

H

H

C

H

H

Figure Tree corresp onding to state sequence S AB C

There will b e a onetoone corresp ondence b etween parse trees in this grammar and

statesequenceoutputs pairs in the HMM If we throw away nonterminal information leav

ing only brackets there will b e a onetoone corresp ondence b etween bracketed parse trees

and HMM outputs The Bracketed Tree Maximization of a string in this grammar will cor

resp ond to the most likely output of the corresp onding HMM Thus if we could solve the

Bracketed Tree Maximization problem we could also solve the HMM Most Likely String of

length n problem Therefore Bracketed Tree Maximization is NPComplete

Corollary

Consistent Brackets Tree Zero Crossing Brackets Maximization for Binary Branching

Trees is NPComplete

Proof Bracketed Tree is equivalent to Consistent Brackets Tree if b oth T and

C

T are binary branching

G

We note that for nary branching trees discussed in Section Consistent Brackets

Tree maximization is typically achieved by simply placing parentheses at the topmost level

and nowhere else Thus it is the binary branching case that is of interest

General Recall Algorithm

The Lab elled Recall algorithm and Bracketed Recall algorithm are b oth sp ecial cases of a

more general algorithm called the General Recall algorithm In fact the General Recall

algorithm was developed rst for parsing Sto chastic Tree Substitution Grammars STSGs

In this section we rst dene STSGs and then use them to motivate the General Recall

algorithm Next we show how the other two algorithms reduce to the general algorithm

and nally give some examples of other cases in which one might wish to use the General

Recall algorithm

There are two ways to dene a STSG either as a Sto chastic Tree Adjoining Grammar

Schabes restricted to substitution op erations or as an extended PCFG in which

entire trees may o ccur on the right hand side instead of just strings of terminals and non

terminals A PCFG then is a sp ecial case of an STSG in which the trees are limited to depth

In the STSG formalism for a given parse tree there may b e many p ossible derivations

A A

H H

B C A

H H

H H H

B C B C

H H H

H H

x x x x C B

H H

x x x x

A A

H H

H H

H H

Parse Tree

B C C B

H H H H

H H H H

x x x x x x x x

Derivation

Probability

Parse

Probability

Figure Example Sto chastic Tree Substitution Grammar and two parses

A

H

A B C

H

H

B xx

B C

H H

C D e

H H

x x D e

Example STSG Corresp onding PCFG

pro duction pro ductions

Figure STSG to PCFG conversion example

Thus variations on the Viterbi algorithm which nd the b est derivation may not b e the

most appropriate What is really desired is to nd the b est parse But b est according to

what criterion If the criterion used is the Lab elled Tree rate then the problem is NP

Complete Simaan a and the only known approximation algorithm is a Monte Carlo

algorithm due to Bo d c On the other hand if the criterion used is the Lab elled

Recall rate then there is an algorithm the General Recall algorithm

In Figure we give a sample STSG grammar and two dierent parses of the string

xxxx demonstrating that the most likely derivation diers from the most likely parse

To use the General Recall algorithm for parsing STSGs we must rst convert the STSG

to a PCFG This conversion is done in the usual way That is assign to every internal no de

of every subtree of the STSG a unique lab el which we will indicate with an sign Leave

the ro ot and leaf no des unlab elled or equivalently lab elled with a null lab el Now for

We will show in Section that this algorithm has a serious problem either its accuracy decreases

exp onentially with sentence length or its runtime increases exp onentially

each ro ot A of a subtree with probability p and children B k C l create a PCFG rule

A B C p

k l

For each internal no de Aj of a subtree with children B k C l create a PCFG rule

A B C

j

k l

Here the notation X denotes a new nonterminal created for this no de k or l may b e null if

j

the corresp onding no de is a leaf no de The number in parentheses indicates the probability

asso ciated with a given pro duction Figure illustrates this conversion

Now if we were to apply the Viterbi algorithm directly to this PCFG rather than

returning the most likely parse tree it would return a parse tree corresp onding to the most

likely derivation Not only that but the lab els in this parse tree would not even b e the

same as the lab els in the original grammar Of course we could solve this latter problem

STSG that would map from symbols in the PCFG to symbols by creating a function MAP

in the STSG

MAP STSGX X

STSGX X MAP

j

In other words MAP STSG removes subscripts

One way to use this function would b e to use the Viterbi algorithm or Lab elled Recall

algorithms with the PCFG to nd the b est tree and then use the function to map each

no des lab el to the original name But we would presumably get b etter results following a

more principled approach such as trying to solve the following optimization problem

X

STSGX k i T if hj MAP

G

A

arg max E

otherwise

T

G

hjX k iT

C

Here T is a parse tree with symbols in the PCFG while T is a parse tree with symbols

C G

in the original STSG grammar This maximization is exactly parallel to the maximization

p erformed by the Lab elled Recall algorithm except that the symbol names are transformed

STSG b efore the maximization is p erformed In fact if we dene two other with MAP

mapping functions MAP IDENT the identity mapping function and MAP BRACK the

function that maps everything to the same symbol then the similarity b etween the three

maximizations b ecomes more clear

IDENT function The MAP

MAP IDENTX X

can b e used to dene the Lab elled Recall maximization

X

IDENTX k i T if hj MAP

G

A

max E

otherwise

T

G

hjX k iT

C

while the MAP BRACK function

MAP BRACKX S

can b e used to dene the Bracketed Recall maximization

X

if hj MAP BRACKX k i T

G

A

max E

otherwise

T

G

hjX k iT

C

If we substitute

X

maxG max g i Y j

X

Y jX map Y

into the Lab elled Recall algorithm of Figure we get the General Recall algorithm map

is a function that maps from nonterminals in the relevant PCFG to nonterminals relevant

STSG MAP IDENT or MAP BRACK among for the output It could b e any of MAP

others

Given this general algorithm other uses for this technique b ecome apparent For in

stance in the latest version of the Penn Treebank nonterminals annotated with grammatical

function are included That is rather than using simple nonterminals such as NP more

complex nonterminals such as NPSBJ and NPPRD are used to indicate sub ject and

predicate Dep ending on the application one might choose to use the annotated nontermi

nals in a PCFG but to return only the simpler unannotated nonterminals Collins

uses a version of the grammatically annotated nonterminals internally parsing with the

Lab elled Tree Viterbi algorithm and then stripping the annotations So we could dene

SEM a mapping function MAP

MAP SEMX X

SEMX SY X MAP

We could then use the General Recall algorithm with this function If we used a simpler

scheme such as parsing using the Viterbi algorithm and then mapping the resulting parse

problems might o ccur For instance since noun phrases are divided into classes while verb

phrases are not an inadvertent bias against noun phrases will have b een added into the

system Using the General Recall algorithm removes such biases since all the dierent

kinds of nounphrases are summed over and is theoretically b etter motivated

An itembased description of the General Recall algorithm is given in Figure The

General Recall description is the same as the Bracketed Recall description except that we

now add nonterminals to the items and use the function map In the next chapter we will

show how to use the General Recall algorithm to greatly sp eed up DataOriented Parsing

Item form

i A j inside semiring

y

i A j inside semiring

i A j arctic semiring or similar

Primary Goal

S n

Secondary Goal

S n

Rules

R A w

i

Unary

i A i

R A BC i B k k C j

Binary

i A j

VZ

i A j S n

V

in

map A B Summation

y

i B j

y

R A w i i S n

i

Unary Lab els

i i

y

R A B C i j i k k j

Binary Lab els

i j

Figure General Recall Description

Bo d

NAry Branching Parse Trees

Previously we have b een discussing binary branching parse trees However in practice it

is often useful to have trees with more than two branches In this section we discuss nary

n

branching parse trees denoted by the symbol In general nary branching parse trees

T

can have any number of branches including zero or one but the discussion in this chapter

will b e limited to those trees where every no de has at least two branches

We b egin by discussing why with nary branching parse trees dierent evaluation met

rics should b e used from those for binary branching ones In particular we discuss Con

sistent Brackets Recall Crossing Brackets and Consistent Brackets Tree Zero Crossing

Brackets versus Bracketed or Lab elled Precision and Recall We then go on to discuss

approximations for Bracketed and Lab elled Precision and Recall the Bracketed Combined

rate and Lab elled Combined rate Next we show an algorithm for maximizing these two

rates Finally we give results using this new algorithm

NAry Branching Evaluation Metrics

If we wish to mo del nary branching trees rather than just binary branching ones we could

continue to use the same metrics as b efore For instance we could use the Consistent

Brackets Recall rate or Consistent Brackets Tree rate However b oth of these have a

problem that simply by returning trees like

S

X

X

H

X H

X

X H

X

X H

w w w w

n n

ie trees that have only one no de and all terminals as children of that no de a

score can b e achieved The problem here is that there is no reward for returning correct

constituents only a p enalty for returning incorrect ones The Lab elled Recall rate has the

opp osite problem there is no p enalty for including to o many no des eg returning a binary

branching tree

What is needed is some criterion that combines b oth the p enalty and the reward The

traditional solution has b een to use a pair of criteria the Bracketed Recall rate and the

Bracketed Precision rate or more common recently the Lab elled Recall rate and the La

b elled Precision rate The Bracketed Recall rate gives the reward yielding higher scores

for answers with more correct constituents while the Bracketed Precision rate gives the

p enalty giving lower scores when there are more incorrect constituents These Precision

Rates are dened as follows and the Recall rates are included for comparison

Labelled Precision Rate LN

G

Bracketed Precision Rate B N

G

Labelled Recall Rate LN

C

Bracketed Recall Rate B N

C

Ideally we would now develop an algorithm that maximizes some weighted sum of a

Recall rate and a Precision rate However it is relatively timeconsuming to maximize

Precision b ecause the denominator is N the number of no des in the guessed tree Thus

G

the relative p enalty for making a mistake diers dep ending on how many constituents total

are returned slowing dynamic programming algorithms On the other hand we can create

metrics closely related to the Precision metrics which we will call Mistakes and dene as

follows

Labelled Mistakes Rate N LN

G C

Bracketed Mistakes Rate N B N

G C

N is the number of guessed constituents and L is the number of correctly lab elled con

G

stituents so N L is the number of constituents that are not correct ie mistakes

G

We normalize by dividing through by N the number of constituents in the guessed parse

C

While this factor is less intuitive than N it will remain constant across parse trees making

G

the cost of each mistake indep endent of the size of the guessed tree

Now we can dene combinations of Mistakes and Recall using a weighting factor

Labelled Combined Rate LN N LN

C G C

Bracketed Combined Rate B N N B N

C G C

The p ositive term LN provides a reward for correct constituents and the negative term

C

N LN provides a p enalty for incorrect constituents

G C

Combined Rate Maximization

We can now state the Lab elled Combined Rate Maximization problem as

arg max E LN N LN

C G C

n

T

G

Using manipulations similar to those used previously we can rewrite this as

X

arg max g i X j g i X j

n

n

T

G

hiX j i

T

G

The preceding expression says that we want to nd that parse tree which maximizes

our score where our score is one p oint for each correct constituent and p oints for each

incorrect one

The preceding expression do es not address an interesting question In general if we are

inducing nary branching trees it will b e b ecause we are interested in grammars with other

than two expressions on the right hand sides of their pro ductions While grammars with

zero or one elements on their right hand side are interesting they are in general signicantly

harder to parse so we will only b e concerned in this chapter with grammars with at least

two nonterminals on the right side of pro ductions

Grammars with at least two elements on the right can easily b e converted to grammars

with exactly two elements on the right using a trick b est known from Earleys algorithm

in which new nonterminals are created by dotting A rule such as

A B C D E p

is converted into three rules of the form

A B B CDE p

B CDE C BC DE

BC DE D E

This works ne in that it pro duces a grammar which pro duces the same strings with the

same probabilities and which is amenable to simple chart parsing However the parse trees

pro duced are binary branching

Let us dene an op erator nodot X that is true if and only if the lab el do es not have a

dot in it and then mo dify Expression to b e

X

g i X j g i X j arg max

n

n

T

G

hiX j i jnodot X

T

G

The mo died expression do es not count the value of any dotted no des There is one more

complication for any given constituent even those without a dot it may contribute a

negative score even though its children contribute p ositive scores We thus want to b e able

to omit any constituent which contributes negatively while keeping the children The nal

expression we maximize which returns a binary branching tree is

X

arg max max g i X j g i X j

T

G

hiX j iT jnodot X

G

We take the binary branching tree which maximizes this expression and remove all dotted

Furthermore even though using the techniques of Chapter we can parse grammars that have lo ops

from unary or epsilon pro ductions it b ecomes more complicated to dene precision and recall appropriately

for these grammars For instance given our denitions recall scores ab ove are p ossible by returning

trees of the form

S

S

S

S

b ecause this could lead to L or B exceeding N since the rep eated S constituents were all counted as

C

correct In fact when we examined scoring co de used by Collins we found this problem while

Collins p ersonal communication says that his parser could not return trees of this form it illustrates the

problems that b egin to surface in scoring trees with unary branches

no des and all no des which contribute a negative score while leaving their children in the

tree The resulting tree is the optimal nary branching tree Because expression returns

a binary branching tree we can use our previous CKYstyle parsing algorithms to eciently

p erform the maximization and remove the dummy and negative no des as a p ostpro cessing

step

A similar maximization can b e used for the bracketed case We rst dene g i j as

X

g i j g i X j

X jnodot X

Now we can give the expression for Bracketed Combined maximization

X

arg max max g i j g i j

T

G

hiX j iT

G

Using these expressions we could mo dify the Lab elled Recall algorithm to maximize

either the Lab elled Combined rate or the Bracketed Combined rate or more generally we

could mo dify the General Recall algorithm If we substitute

X

bestG max g hi Y j i

X

Y jX map Y

maxG max bestG bestG

into the Lab elled Recall algorithm of Figure we get the general algorithm for nary

branching parse trees which we call the General Combined algorithm

In Figure we give an itembased description for the Lab elled Combined algorithm

very similar to the Lab elled Recall description This algorithm uses the function combined

to give the score of constituents

VZ

i A j S n

V

in

if nodot X max

VZ

i A j S n combined i A j

V

in

otherwise

Rather than just using the unary lab els rules the description also contains unary omit and

binary omit rules which are triggered for items whose exp ected contribution is These

items then use rules with a on the left hand side in the derivation which can b e used to

reconstruct an appropriate nary branching tree

NAry Branching Exp eriments

We p erformed a simple exp eriment using the Penn Treebank version I I sections for

training section for test extracting a grammar in the same way as in Section We

varied the precisionrecall tradeo over a range of values

Figure gives the results As can b e seen the Lab elled Combined algorithm not

only pro duced a smo oth tradeo b etween Lab elled Precision and Lab elled Recall it also

worked b etter than the Lab elled Tree Viterbi algorithm on b oth measures simultaneously

Item form

i A j inside semiring

i A j arctic semiring or similar

Primary Goal

S n

Secondary Goal

S n

Rules

R A w

i

Unary

i A i

R A BC i B k k C j

Binary

i A j

R A w combined i A i

i

combined i A i Unary Lab els

i A i

R A B C combined i A j i B k k C j

combined i A j Binary Lab els

i A j

R w combined i A i

i

combined i A i Unary Omit

i A i

R B C combined i A j i B k k C j

combined i A j Binary Omit

i A j

Figure Nary Lab elled Recall Description 100

90

Precision/Recall 80

70 Precision

Viterbi 60

50

0 10 20 30 40 50 60 70 80

Recall

Figure Lab elled Combined Algorithm vs Lab elled Tree Algorithm

Algorithm Time

Viterbi algorithm

Inside algorithm

Outside algorithm

Find Combined tree

Find Combined trees

Table Algorithm Timings in Seconds

The algorithm asymptotes vertically at ab out Lab elled Recall This asymptote corre

sp onds to such a strong weight on recall that the resulting tree is binary branching The

horizontal asymptote is at Lab elled Precision corresp onding to a single constituent

containing the entire sentence The reason the asymptote do es not approach is that

we measured Lab elled Precision b ecause of our scoring mechanism not all top no des in

the Penn Treebank were lab elled the same leading to o ccasional mistakes in the top no de

lab el

Table gives the timings of the various algorithms in seconds The Viterbi algorithm

and inside algorithm take approximately the same amount of time while the outside algo

rithm takes ab out twice as long Once the inside and outside probabilities are computed

the Lab elled Combined trees can b e computed very quickly A p ortion of the work in com

puting combined trees computing the g values needs to b e done only once p er sentence

which is why the time to nd trees is not times the time of nding tree It is very

signicant that almost all of the work to compute a combined tree needs only to b e done

once this means that if we are going to compute one tree we might as well compute the

entire ROC precisionrecall curve

The fact that we can compute a precisionrecall curve makes it much easier to compare

parsing algorithms If there are two parsing algorithms and one gets a b etter score on

Lab elled Precision and the other gets a b etter score on Lab elled Recall we cannot determine

which is b etter However if for one or b oth algorithms an ROC curve exists then we can

determine which is the b etter algorithm unless of course the curves cross But even in the

crossing case we have learned the useful fact that one algorithm is b etter for some things

and the other algorithm is b etter for other things

Conclusions

Matching parsing algorithms to evaluation criteria is a p owerful technique that can b e

used to achieve b etter p erformance than standard algorithms In particular the Lab elled

Recall algorithm has b etter p erformance than the Lab elled Tree algorithm on the Consistent

In the Penn Treebank version I I the top bracket is always unlab elled and unary branching We collapsed

unary branches b efore scoring since this algorithm could not pro duce unary branches This lead to a top

no de that was usually an S but could b e something else such as SINV for questions

Bracketed Lab elled

Tree NPComplete Lab elled Tree

Recall Bracketed Recall Lab elled Recall

Table Metrics and corresp onding algorithms

Brackets Recall Lab elled Recall and Bracketed Recall rates Similarly the Bracketed

Recall algorithm has b etter p erformance than the Lab elled Tree algorithm on Consistent

Brackets and Bracketed Recall rates Thus these algorithms improve p erformance not only

on the measures that they were designed for but also on related criteria For nary branching

trees we have shown that the Lab elled Combined algorithm can lead to improvements on

b oth precision and recall and can allow us to trade precision and recall o against each

other or even to quickly pro duce the curve showing this tradeo

Of course there are limitations to the approach For instance the problem of maximiz

ing the Bracketed Tree rate equivalent to Zero Crossing Brackets rate in the case of binary

branching data is NPComplete not all criteria can b e optimized directly

In the next chapter we will see that these techniques can also in some cases sp eed

parsing In particular we will introduce DataOriented Parsing Bo d and show

that while we cannot eciently maximize the Lab elled Tree rate we can use the General

Recall algorithm to maximize the Lab elled Recall rate in time O n leading to signicant

sp eedups

App endix

A Pro of of Crossing Brackets Theorem

k im j l

Figure Crossing trees

In this app endix we prove that if the correct parse tree is binary branching then

Consistent Brackets and Bracketed Match are identical Pereira and Schabes make

reference to an equivalent observation but without pro of The pro of is broken into two

parts In the rst part we show that any match do es not cross In the second part we

show that any constituent hi j i that do es not have a Bracketed Match do es cross

To show that any element which do es match do es not cross we note that by a simple

induction the left descendants of any constituent can never cross the right descendants of

that constituent Now assume that an element which did match also crossed Find the

lowest common ancestor of the match and the crossing element One of them must b e a

descendant of the left child and the other must b e a descendant of the right child But

then by the lemma they would not cross and we have a contradiction so our assumption

must b e wrong

To show that any element hi j i that do es not match must have a crossing element we

nd the smallest element hk l i containing hi j i that is k i and l j Now since hi j i

do esnt match hk l i one of these inequalities must b e strict that is k i or l j Assume

without loss of generality that l j Since we assumed that T is binary branching hk l i

C

has two children Let hm l i represent the right child This conguration is illustrated in

Figure We know that m i since otherwise hm l i would b e a constituent smaller

than hk l i containing hi j i Similarly we know that m j since otherwise hk mi would

b e a constituent smaller than hk l i Thus we have i m j l meeting the denition

for a cross

App endix

B Glossary

B Number of correctly bracketed constituents see Section

Bracketed Mistakes rate N B N Approximation to unlab elled precision See

G C

Section

Bracketed Combined rate B N N B N The weighted dierence of Brack

C G C

eted Recall and Bracketed Mistakes See Section

Bracketed Precision rate B N A score which p enalizes incorrect guesses See Section

G

Bracketed Recall algorithm Algorithm maximizing Bracketed Recall rate See Section

Bracketed Recall rate B N Closely related to Consistent Brackets rate See Section

C

Bracketed match hi X j i bracketed matches hi Y j i See Section

if B N

C

See Section Bracketed Tree rate

otherwise

C Number of constituents that do not cross a correct constituent see Section

Consistent Brackets Recall rate C N Often called Crossing Brackets rate When

G

the parses are binary branching the same as the Bracketed Recall rate

Consistent Brackets match hi X j i matches if there is no hk Y l i crossing it

if C N

G

Closely related to Bracketed Tree rate Consistent Brackets Tree rate

otherwise

When the parses are binary branching the two metrics are the same Also called the

Zero Crossing Brackets rate See Section

Crossing Brackets rate Conventional name for Consistent Brackets rate

E Exp ected value function

insidei X j Inside value

Exact Match rate Conventional name for Lab elled Tree rate

outside i X j Outside value

g i X j Normalized insideoutside value

L Number of correct constituents see Section

Lab elled Mistakes rate N LN An approximation to Lab elled Precision See

G C

Section

Lab elled Combined rate LN N LN The weighted dierence of Lab elled

C G C

Recall and Lab elled Mistakes See Section

Lab elled Precision rate LN A score which p enalizes incorrect guesses See Section

G

Lab elled Recall algorithm Algorithm for maximizing Lab elled Recall rate See Section

Lab elled Recall rate LN See Section

C

Lab elled match hi X j i o ccurs in b oth correct and guessed parse trees

Lab elled Tree algorithm Algorithm for maximizing Lab elled Tree rate Also called

Viterbi algorithm

if L N

C

This metric is also called the Viterbi criterion or Lab elled Tree rate

otherwise

the Exact Match rate

N Number of constituents in tree in treebank

C

N Number of constituents in tree output by parser

G

T Correct parse tree tree in treebank

C

T Guessed parse tree tree output by parser

G

Viterbi algorithm Wellknown CKY style algorithm for maximizing Lab elled Tree rate

w Word i of input sentence

i

Zero Crossing Brackets rate Conventional name for Consistent Brackets Tree rate

Chapter

DataOriented Parsing

In this chapter we describ e techniques for parsing the DataOriented Parsing DOP mo del

times faster than with previous parsers This work Go o dman a represents the

rst replication of the DOP mo del and calls into question the source of the previously

rep orted extraordinary p erformance levels The results in this chapter rely primarily on

two techniques an ecient conversion of the DOP mo del to a PCFG and the General

Recall algorithm of the preceding chapter

Introduction

The DataOriented Parsing DOP mo del has an interesting and controversial history It

was introduced by Remko Scha and was then studied by Rens Bo d Bo d c

was not able to nd an ecient exact algorithm for parsing using the mo del however he

did discover and implement Monte Carlo approximations He tested these algorithms on a

cleaned up version of the ATIS corpus Hemphill et al in which inconsistencies in the

data had b een removed by hand Bo d achieved some very exciting results rep ortedly getting

of his test set exactly correct a huge improvement over previous results For instance

Bo d b compares these results to Schabes in which for short sentences of

the sentences have no crossing brackets a much easier measure than exact match Thus

Bo d achieves an extraordinary fold error rate reduction

Other researchers attempted to duplicate these results but b ecause of a lack of details

of the parsing algorithm in his publications were unable to conrm the results Magerman

Laerty p ersonal communication Even Bo ds thesis Bo d b do es not contain enough

information to replicate his results

Parsing using the DOP mo del is esp ecially dicult The mo del can b e summarized

as a sp ecial kind of Sto chastic Tree Substitution Grammar STSG given a bracketed

lab elled training corpus let every subtree of that corpus b e an elementary tree with a

probability prop ortional to the number of o ccurrences of that subtree in the training corpus

Unfortunately the number of trees is in general exp onential in the size of the training corpus

trees pro ducing an unwieldy grammar

In this chapter we introduce a reduction of the DOP mo del to an exactly equivalent

S

H

H

H

NP VP

H

H

H H

pn pn V NP

H

H

det n

Figure Training corpus tree for DOP example

Probabilistic ContextFree Grammar PCFG that is linear in the number of no des in the

training data Next we show that the General Recall algorithm introduced in Section

which uses the insideoutside probabilities can b e used to eciently parse the DOP

mo del in time O T n where T is training data size This p olynomial run time is esp ecially

signicant given that Simaan a showed that computing the most probable parse of

an STSG is NPComplete We also give a random sampling algorithm equivalent to Bo ds

that runs in time O Gn rather than Bo ds O Gn p er sample where G is the grammar

size We use the reduction and the two parsing algorithms to parse held out test data

comparing these results to a replication of Pereira and Schabes on the same data

These results are disapp ointing b oth the Monte Carlo parser and the General Recall parser

applied to the DOP mo del p erform ab out the same as the Pereira and Schabes metho d

We present an analysis of the runtime of our algorithm and Bo ds Finally we analyze

Bo ds data showing that some of the dierence b etween our p erformance and his is due to

a fortuitous choice of test data

This work was the rst published replication of the full DOP mo del ie using a parser

that sums over derivations It also contains algorithms implementing the mo del with sig

nicantly fewer resources than previously needed Furthermore for the rst time the DOP

mo del is compared to a comp eting mo del on the same data

Previous Research

The DOP mo del itself is extremely simple and can b e describ ed as follows for every sentence

in a parsed training corpus extract every subtree In general the number of subtrees will

b e very large typically exp onential in sentence length Now use these trees to form a

Sto chastic Tree Substitution Grammar STSG Each tree is assigned a number of counts

one count for each time it o ccurred as a subtree of a tree in the training corpus Each tree is

then assigned a probability by dividing its number of counts by the total number of counts

of trees with the same ro ot nonterminal

Given the tree of Figure we can use the DOP mo del to convert it into the STSG of

Figure The numbers in parentheses represent the probabilities To give one example

STSGs were describ ed in Section

S VP

H

H

NP NP S VP

H H

V NP NP VP H

H

H

H

H H H H

pn pn NP VP det n V NP

H

H

H H

det n V NP

S

S

H

H

H

S

H

S

H

H

H

H

H

NP VP

H

NP VP

H

H

NP VP

H

NP VP

H

H

H H

V NP

H

H

pn pn V NP

H

H

H H

pn pn

pn pn V NP

H

H

H

H

det n

det n

Figure Sample STSG Pro duced from DOP Mo del

the tree

NP

H

H

det n

is assigned probability b ecause the tree o ccurs once in the training corpus and there

The resulting are two subtrees in the training corpus that are ro oted in NP so

STSG can b e used for parsing

In theory the DOP mo del has several advantages over other mo dels Unlike a PCFG the

use of trees allows capturing large contexts making the mo del more sensitive Since every

subtree is included even trivial ones corresp onding to rules in a PCFG novel sentences

with unseen contexts may still b e parsed

Because every subtree is included the number of subtrees is huge therefore Bo d ran

domly samples of the subtrees throwing away the rest This reduction in grammar

size signicantly sp eeds up parsing

As we discussed in Section there are three ways to parse a STSG and thus to

parse DOP The two ways Bo d considered were the most probable derivation and the

most probable parse b est according to the Lab elled Tree criterion The most probable

derivation and the most probable parse may dier when there are several derivations of a

given parse as previously illustrated in Figure The third way to parse a STSG is to

use the General Recall algorithm to nd the b est Lab elled Recall parse

Bo d c shows how to approximate the most probable parse using a Monte Carlo

algorithm The algorithm randomly samples p ossible derivations then nds the tree with

the most sampled derivations Bo d shows that the most probable parse yields b etter p er

formance than the most probable derivation on the exact match criterion

Simaan b implemented a version of the DOP mo del which parses eciently by

limiting the number of trees used and by using an ecient most probable derivation mo del

His exp eriments diered from ours and Bo ds in many ways including his use of a dierent

version of the ATIS corpus the use of word strings rather than part of sp eech strings and

the fact that he did not parse sentences containing unknown words eectively throwing

out the most dicult sentences Furthermore Simaan limited the number of substitution

S

H

H

H

NP VP

H

H

H H

pn pn v NP

H

H

det n

Figure Example tree with addresses

sites for his trees eectively using a subset of the DOP mo del Simaan a shows that

computing the most probable parse of a STSG is NPComplete

Reduction of DOP to PCFG

Bo ds reduction to a STSG is extremely exp ensive even when throwing away of the

grammar However it is p ossible to nd an equivalent PCFG that contains at most eight

PCFG rules for each no de in the training data thus it is O n Because this reduction

is so much smaller we do not discard any of the grammar when using it The PCFG is

equivalent in two senses rst it generates the same strings with the same probabilities

second using an isomorphism dened b elow it generates the same trees with the same

probabilities although one must sum over several PCFG trees for each STSG tree

To show this reduction and equivalence we must rst dene some terminology We

assign every no de in every tree a unique number which we will call its address Let Ak

denote the no de at address k where A is the nonterminal lab eling that no de Figure

shows the example tree augmented with addresses We will need to create one new non

terminal for each no de in the training data We will call this nonterminal A We will call

k

nonterminals of this form interior nonterminals and the original nonterminals in the

parse trees exterior

Let a represent the number of nontrivial subtrees headed by the no de Aj Let a

j

represent the number of nontrivial subtrees headed by no des with nonterminal A that is

P

a a

j

j

Consider a no de Aj of the form

Aj

H

H

B k C l

How many nontrivial subtrees do es it have Consider rst the p ossibilities on the left

branch There are b nontrivial subtrees headed by B k and there is also the trivial

k

case where the left no de is simply B Thus there are b dierent p ossibilities on the

k

left branch Similarly for the right branch there are c p ossibilities We can create a

l

subtree by choosing any p ossible left subtree and any p ossible right subtree Thus there

are a b c p ossible subtrees headed by Aj In our example tree of Figure

j

k l

S external

S

H

H

H

H

H H

NP V P

NP V P internal

H

H

H

H

H H

H H

PN PN V NP

PN PN V NP external

STSG elementary isomorphic PCFG

tree sub derivation

Figure STSG elementary tree isomorphic to a PCFG sub derivation

b oth noun phrases have exactly one subtree np np the verb phrase has

subtrees v p and the sentence has s These numbers corresp ond to the

number of subtrees in Figure

We will call a PCFG sub derivation isomorphic to a STSG elementary tree if the sub

derivation b egins with an external nonterminal uses internal nonterminals for intermedi

ate steps and ends with external nonterminals Figure gives an example of an STSG

elementary tree taken from Figure and an isomorphic PCFG sub derivation

We will give a simple small PCFG with the following surprising prop erty for every

subtree in the training corpus headed by A the grammar will generate an isomorphic

sub derivation with probability a In other words rather than using the large explicit

STSG we can use this small PCFG that generates isomorphic derivations with identical

probabilities

The construction is as follows For a no de such as

Aj

H

H

B k C l

we will generate the following eight PCFG rules where the number in parentheses following

a rule is its probability

A BC a A BC a

j j

A B C b a A B C b a

j j

k k k k

A BC c a A BC c a

j j

l l l l

A B C b c a A B C b c a

j j

k l k l k l k l

Theorem

Sub derivations headed by A with external nonterminals at the ro ots and leaves and internal

nonterminals elsewhere have probability a Sub derivations headed by A with external

j

nonterminals only at the leaves and internal nonterminals elsewhere have probability

a

j

Proof The pro of is by induction on the depth of the trees For trees of depth there

are two cases

Aj A

H H

H H

B C B C

Trivially these trees have the required probabilities

Now assume that the theorem is true for trees of depth n or less We show that it holds

for trees of depth n There are eight cases one for each of the eight rules We show

B k

two of them Let represent a tree of at most depth n with external leaves headed by

B k and with internal intermediate nonterminals Then for trees such as

Aj

H

H

B k C l

b c

k l

the probability of the tree is Similarly for another case trees headed by

b c a a

j j

k l

A

H

H

B k C

b

k

The other six cases follow trivially with similar the probability of the tree is

b a a

k

reasoning

We call a PCFG derivation isomorphic to a STSG derivation if for every substitution

in the STSG there is a corresp onding sub derivation in the PCFG Figure contains an

example of isomorphic derivations using two subtrees in the STSG and four pro ductions in

the PCFG

We call a PCFG tree isomorphic to a STSG tree if they are identical when internal

nonterminals are changed to external nonterminals

Theorem

This construction pro duces PCFG trees isomorphic to the STSG trees with equal probabil

ity

Proof If every subtree in the training corpus o ccurred exactly once the pro of would b e

trivial For every STSG sub derivation there would b e an isomorphic PCFG sub derivation

with equal probability Thus for every STSG derivation there would b e an isomorphic

PCFG derivation with equal probability Thus every STSG tree would b e pro duced by the

PCFG with equal probability

However it is extremely likely that some subtrees esp ecially trivial ones like

S

H

H

NP V P

will o ccur rep eatedly

If the STSG formalism were mo died slightly so that trees could o ccur multiple times

then our relationship could b e made onetoone Consider a mo died form of the DOP

PCFG derivation

pro ductions

S

H

H

H

H

NP V P

H

H

H

H

H

PN PN

V NP

H

H

DET N

STSG derivation

subtrees

S

H

H

H

H

NP V P

H

H

H

H

H

PN PN

V NP

NP

H

H

DET N

Figure Example of Isomorphic Derivation

mo del in which the counts of subtrees which o ccurred multiple times in the training corpus

were not merged b oth identical trees would b e added to the grammar Each of these trees

will have a lower probability than if their counts were merged This would change the

probabilities of the derivations however the probabilities of parse trees would not change

since there would b e corresp ondingly more derivations for each tree Now the desired

onetoone relationship holds for every derivation in the new STSG there is an isomorphic

derivation in the PCFG with equal probability Thus summing over all derivations of a tree

in the STSG yields the same probability as summing over all the isomorphic derivations in

the PCFG Thus every STSG tree would b e pro duced by the PCFG with equal probability

It follows trivially from this that no extra trees are pro duced by the PCFG Since the

total probability of the trees pro duced by the STSG is and the PCFG pro duces these

trees with the same probability no probability is left over for any other trees

Parsing Algorithms

As we discussed in Chapter there are several dierent evaluation metrics one could use

for nding the b est parse The three most interesting are the most probable derivation

which can b e found using the Viterbi algorithm the most probable parse which can b e

found by random sampling and the Lab elled Recall parse which can b e found using the

General Recall algorithm of Section

Bo d a b shows that the most probable derivation do es not p erform as well as

the most probable parse for the DOP mo del getting exact match for the most probable

derivation versus exact match for the most probable parse This p erformance dierence

is not surprising since each parse tree can b e derived by many dierent derivations the most

probable parse criterion takes all p ossible derivations into account Similarly the Lab elled

Recall parse is also derived from the sum of many dierent derivations Furthermore

although the Lab elled Recall parse should not do as well on the exact match criterion it

should p erform even b etter on the Lab elled Recall rate and related criteria such as the

Crossing Brackets rate In the preceding chapter we p erformed a detailed comparison

b etween the most likely parse the Lab elled Tree parse and the Lab elled Recall parse for

PCFGs we showed that the two have very similar p erformance on a broad range of measures

with at most a dierence in error rate ie a change from error rate to error

rate We therefore think that it is reasonable to use the General Recall algorithm which

for STSGs can compute the Lab elled Recall parse to parse the DOP mo del esp ecially

since our comparisons will b e on the Crossing Brackets rate where we exp ect from b oth

theoretical and empirical considerations that an algorithm maximizing the Lab elled Recall

rate will outp erform one maximizing the exact match Lab elled Tree rate Bo d a

complains that if we use the General Recall algorithm we are not really parsing the DOP

mo del since our parser do es not return a most probable parse As we will show in the

results section our parser p erforms at least as well as a parser that do es return the most

probable parse so this ob jection is immaterial

Although the General Recall algorithm was describ ed in detail in the previous chapter

we review it here First for each p otential constituent where a constituent is a nonterminal

rep eat until the standard error and the mpp error are smaller than a threshold

sample a random derivation from the derivation forest

store the parse generated by the sampled derivation

mpp parse with maximal frequency

calculate the standard error of the mpp and the mpp error

Figure Monte Carlo parsing algorithm

for length to n

for start to n length

for each no de X chart start start length

select at random a sub derivation of X

eliminate the other sub derivations

Figure Bo ds O Gn sampling algorithm

a start p osition and an end p osition the algorithm uses the insideoutside values to nd

the probability that that constituent is in the correct parse After that the algorithm uses

dynamic programming to put the most likely constituents together to form an output parse

tree the output parse tree maximizes the exp ected Lab elled Recall rate

Recall from Section that the run time of the General Recall algorithm is dominated

by the time to compute the inside and outside probabilities For a grammar with r rules

this is O r n Now since there are at most eight rules for each no de in the training data

if the training data is of size T the overall run time is O T n

Sampling Algorithms

Bo d b p gives the simple algorithm of Figure for nding the most probable

parse mpp ie the parse that maximizes the exp ected Exact Match rate Essentially the

algorithm is to randomly sample parses from the derivation forest and pick the maximal

frequency parse In practice rather than compute standard errors Bo d simply ran the

outer lo op times The algorithm Bo d used for sampling a random derivation is given

in Figure Bo d analyzes the run time of the algorithm for computing one sample as

Function fastsample i X j

if i j

return leaf X

else

select at random a sub derivation of X i Y k and k Z j

return tree X fastsample i Y k fastsample k Z j

Figure Faster O Gn sampling algorithm

Criteria Min Max Range Mean StdDev

Cross Brack DOP

Cross Brack PS

Cross Brack DOPPS

Zero Cross Brack DOP

Zero Cross Brack PS

Zero Cross Brack DOPPS

Table DOP Lab elled Recall versus Pereira and Schabes on Minimally Edited ATIS

Criteria Min Max Range Mean StdDev

Cross Brack DOP

Cross Brack PS

Cross Brack DOPPS

Zero Cross Brack DOP

Zero Cross Brack PS

Zero Cross Brack DOPPS

Exact Match DOP

Table DOP Lab elled Recall versus Pereira and Schabes on Bo ds Data

O Gn although by using tables it can b e eciently approximated in time O Gn Bo d

p ersonal communication

We used a dierent but mathematically equivalent sampling algorithm Rather than

a b ottomup algorithm we used a topdown algorithm as shown in Figure The run

time for our sampling algorithm if naively implemented as we did is at worst O Gn

although using the same trick that Bo d used it can b e implemented in time O n

Exp erimental Results and Discussion

We are grateful to Bo d for supplying us with data edited for his exp eriments Bo d

c Bo d b Bo d c although it app ears not to have b een exactly the data he

used We have b een unable to obtain the exact same data he used and since we cannot get

Lab elled Recall Most Probable Pereira and Signicant

Parse Parse Schabes

Cross Brack

Cross Brack

Exact Match

Table Three way comparison on minimally edited ATIS data

Lab elled Recall Most Probable Pereira and Signicant

Parse Parse Schabes

p

Cross Brack

p

Cross Brack

Exact Match

Table Three way comparison on ATIS data edited by Bo d

it we use the data Bo d gave us

The original ATIS data from the Penn Treebank version is very noisy it is dicult

to even automatically read this data due to inconsistencies b etween les Researchers are

thus left with the dicult decision as to how to clean up the data For this chapter we

conducted two sets of exp eriments one using a minimally cleaned up set of data the same

as describ ed in Section making our results comparable to previous results the other

using the ATIS data prepared by Bo d which contained much more signicant revisions

Ten data sets were constructed by randomly splitting minimally edited ATIS sentences

into a sentence training set and an sentence test set then discarding sentences of

length For each of the ten sets the Lab elled Recall parse the sampling algorithm

given in Figure equivalent to but faster than Bo ds and the grammar induction ex

p eriment of Pereira and Schabes were run All sentences output by the parser were

made binary branching using the Continued transformation as describ ed in Section

since otherwise the crossing brackets measures are meaningless Magerman A few

sentences were not parsable these were assigned right branching p erio d high structure

a go o d heuristic Brill Crossing brackets zero crossing brackets and the paired

dierences b etween Lab elled Recall and Pereira and Schabes are presented in Table

The results are disapp ointing In absolute value the results are signicantly b elow the

exact match rep orted by Bo d In relative value they are also disapp ointing while the DOP

results are slightly higher on average than the Pereira and Schabes results the dierences

are small

We also ran exp eriments using Bo ds data sentence test sets and no limit on sentence

length However while Bo d provided us with data he did not provide us with a split into

test and training data as b efore we used ten random splits The DOP results while b etter

are still disapp ointing as shown in Table They continue to b e noticeably worse than

those rep orted by Bo d and again comparable to the Pereira and Schabes algorithm Even

The data Bo d gave us contained no epsilon pro ductions traces while Bo ds a data apparently

did contain epsilons as explicit partofsp eech tags in the data We note that this is an unconventional

way to handle epsilon pro ductions since real data would typically not contain traces annotated in this way

although it might b e p ossible to train a tagger to pro duce them We have asked Bo d for the correct data

but we have never received it We note that so on after p ointing out to us that the data he had given us

was incorrect since it did not contain epsilons Bo d mailed another researcher John Maxwell data without

epsilons Strangely the data Maxwell received is dierent from the data we received In particular the

data Maxwell received is a subset of the data we received with some lines rep eated to reach the same total

number of lines

on Bo ds data the DOP achieves on the zero crossing brackets criterion is not close

to the Bo d rep orted on the much harder exact match criterion It is not clear what

exactly accounts for these dierences It is also noteworthy that the results are much b etter

on Bo ds data than on the minimally edited data crossing brackets rates of and

on Bo ds data versus on minimally edited data Thus it app ears that part of Bo ds

extraordinary p erformance can b e explained by the fact that his data is much cleaner than

the data used by other researchers

DOP do es do slightly b etter on most measures We p erformed a statistical analysis using

a ttest on the paired dierences b etween DOP and Pereira and Schabes p erformance on

each run On the minimally edited ATIS data the dierences were statistically insignicant

while on Bo ds data the dierences were statistically signicant b eyond the th p ercentile

Our technique for nding statistical signicance is more strenuous than most we assume

that since all test sentences were parsed with the same training data all results of a single

run are correlated Thus we compare paired dierences of entire runs rather than of

sentences or constituents This makes it harder to achieve statistical signicance

Notice also the minimum and maximum columns of the DOPPS lines constructed

by nding for each of the paired runs the dierence b etween the DOP and the Pereira and

Schabes algorithms Notice that the minimum is usually negative and the maximum is

usually p ositive meaning that on some tests DOP did worse than Pereira and Schabes and

on some it did b etter It is imp ortant to run multiple tests esp ecially with small test sets

like these in order to avoid misleading results

Tables and show a threeway comparison b etween all the algorithms the sampling

algorithm and the Lab elled Recall algorithm p erform almost identically In the next section

we will show that the sampling algorithms p erformance probably do es not scale well to

longer sentences

Timing Analysis

In this section we examine the empirical runtime of the General Recall algorithm and

analyze the runtime of Bo ds Monte Carlo algorithm We also note that Bo ds algorithm

will probably b e particularly inecient on longer sentences

It takes ab out seconds p er sentence to run our algorithm on an HP versus

hours to run Bo ds algorithm on a Sparc Bo d c Factoring in that the HP is

roughly four times faster than the Sparc the new algorithm is ab out times faster Of

course some of this dierence may b e due to dierences in implementation so this estimate

is approximate

Furthermore we b elieve Bo ds analysis of his parsing algorithm is awed Letting G

represent grammar size and represent maximum estimation error Bo d correctly analyzes

his runtime as O Gn However Bo d then neglects analysis of this term assuming

that it is constant Thus he concludes that his algorithm runs in p olynomial time However

for his algorithm to have some reasonable chance of nding the most probable parse the

number of times he must sample his data is as a conservative estimate inversely prop or

tional to the conditional probability of that parse For instance if the maximum probability

parse had probability then he would need to sample at least times to b e reasonably

sure of nding that parse

Now we note that the conditional probability of the most probable parse tree will

in general decline exp onentially with sentence length We assume that the number of

ambiguities in a sentence will increase linearly with sentence length if a ve word sentence

has on average one ambiguity then a ten word sentence will have two etc A linear increase

in ambiguity will lead to an exp onential decrease in probability of the most probable parse

Since the probability of the most probable parse decreases exp onentially in sentence

length the number of random samples needed to nd this most probable parse increases

exp onentially in sentence length Thus when using the Monte Carlo algorithm one is left

with the uncomfortable choice of exp onentially decreasing the probability of nding the

most probable parse or exp onentially increasing the runtime

We admit that this argument is somewhat informal Still the Monte Carlo algorithm

has never b een tested on sentences longer than those in the ATIS corpus there is go o d

reason to b elieve the algorithm will not work as well on longer sentences We note that our

algorithm has true runtime O T n as shown previously

Analysis of Bo ds Data

In the DOP mo del a sentence cannot b e given an exactly correct parse unless all pro duc

tions in the correct parse o ccur in the training set Thus we can get an upp er b ound on

p erformance by examining the test corpus and nding which parse trees could not b e gen

erated using only pro ductions in the training corpus As mentioned in Section the data

Bo d provided us with may not have b een the data he used for his exp eriments furthermore

the data was not divided into test and training Nevertheless we analyze this data to nd

an upp er b ound on average case p erformance

In our pap er on DOP Go o dman a we p erformed an analysis of Bo ds data based

on the following lines from his thesis Bo d c p

It may b e relevant to mention that the parse coverage was This means that

for of the test strings the p erceived test corpus parse was in the derivation

forest generated by the system

Using this gure we were able to achieve a strong b ound on the likelihoo d of achieving

Bo ds results However Bo d later informed us p ersonal communication that

The coverage refers to the p ercentage of sentences for which a parse was

found I did not check whether the appriopriate sic parse was among the

found parses I just assumed that this would have b een the case but probably

it wasnt

We can still use the fact that Bo d got a exact match rate to aid us although this leads

to a weaker upp er b ound than that in the original pap er

Bo d randomly split his corpus into test and training From the exact match rate of

his parser we conclude that only three of his test sentences had a correct parse that could

Original Correct Continued Simple

A A A

H H H

H H H

A

B B A B A CDE

P

P

H H H

P

H H H

P

C C C

DE A A

B C D E

H H H

H H H

D E D E D E

Table Transformations from N ary to Binary Branching Structures

Correct Continued Simple

no unary

unary

Table Probabilities of test data with ungeneratable sentences

not b e generated from the training data This small number turns out to b e very surprising

An analysis of Bo ds data shows that at least some of the dierence in p erformance b etween

his results and ours must b e due to a fortuitous choice of test data or to the data he used

b eing even easier than the data he sent us which was signicantly easier than the original

ATIS data Bo d did examine versions of DOP that smo othed allowing pro ductions that

did not o ccur in the training set however his exact match rate and his reference to coverage

are b oth with resp ect to a version that do es no smo othing

In order to p erform our analysis we must determine certain details of Bo ds parser

that aect the probability of having most sentences correctly parsable When using a

chart parser as Bo d did three problematic cases must b e handled pro ductions unary

pro ductions and nary n pro ductions The rst two kinds of pro ductions can b e

handled with a probabilistic chart parser but large and dicult matrix manipulations are

required Stolcke these manipulations would b e esp ecially dicult given the size of

Bo ds grammar In the data Bo d gave us there were no epsilon pro ductions in other data

Bo d p ersonal communication treated epsilons the same as other part of sp eech tags a

strange strategy We also assume that Bo d made the same choice we did and eliminated

unary pro ductions given the diculty of correctly parsing them Bo d himself do es not

know which technique he used for nary pro ductions since the chart parser he used was

written by a third party Bo d p ersonal communication

The nary pro ductions can b e parsed in a straightforward manner by converting them

to binary branching form however there are at least three dierent ways to convert them

as illustrated in Table In metho d Correct the nary branching pro ductions are

converted in such a way that no overgeneration is introduced A set of sp ecial nonterminals

is added one for each partial right hand side In metho d Continued a single new non

terminal is introduced for each original nonterminal Because these nonterminals o ccur

in multiple contexts some overgeneration is introduced However this overgeneration is

constrained so that elements that tend to o ccur only at the b eginning middle or end of the

right hand side of a pro duction cannot o ccur somewhere else If the Simple metho d is used

then no new nonterminals are introduced using this metho d it is not p ossible to recover

the nary branching structure from the resulting parse tree and signicant overgeneration

o ccurs

Table shows the undergeneration probabilities for each of these p ossible techniques

for handling unary pro ductions and nary pro ductions The rst column gives the prob

ability that a sentence contains a pro duction found only in that sentence and the second

column contains the probability that a random set of test sentences would contain at

most three such sentences

p p p p p p p p

The table is arranged from least generous to most generous in the upp er left hand

corner is a technique Bo d might reasonably have used in that case the probability of

getting the test set he describ ed is less than one in In the lower right corner we

give Bo d the absolute maximum b enet of the doubt we assume he used a parser capable

of parsing unary branching pro ductions that he used a very overgenerating grammar and

that he used a lo ose denition of Exact Match Even in this case there is only ab out a

chance of getting the test set Bo d describ es

Conclusion

We have given ecient techniques for parsing the DOP mo del These results are signicant

since the DOP mo del has p erhaps the b est rep orted parsing accuracy previously the full

DOP mo del had not b een replicated due to the diculty and computational complexity

of the existing algorithms We have also shown that previous results were partially due to

heavy cleaning of the data which reduced the diculty of the task and partially due to an

unlikely choice of test data or to data even easier than that which Bo d gave us

Of course this research raises as many questions as it answers Were previous results due

only to the choice of test data or are dierences in implementation partly resp onsible In

that case there is signicant future work required to understand which dierences account

for Bo ds exceptional p erformance This will b e complicated by the fact that sucient

details of Bo ds implementation are not available However based on the fact that further

extraordinary DOP results have not b een rep orted since the conference version of this work

was published it app ears that problems in our implementation are not the source of the

A p erl script for analyzing Bo ds data is available by anonymous FTP from

ftpftpdasharvardedupubgoo dmananalyzep erl

Actually this is a slight overestimate for a few reasons including the fact that the sentences are drawn

without replacement Also consider a sentence with a pro duction that o ccurs only in one other sentence in

the corpus there is some probability that b oth sentences will end up in the test data causing b oth to b e

ungeneratable

discrepancy

This research also shows the imp ortance of testing on more than one small test set as

well as the imp ortance of not making crosscorpus comparisons if a new corpus is required

then previous algorithms should b e duplicated for comparison

The sp eedups we achieved were critical to the success of this chapter Running even one

exp eriment without the times sp eedup we achieved would have b een dicult never mind

running the ten exp eriments that allowed us to compute accurate average p erformance and

to compute statistical signicance These sp eedups were made p ossible by the combination

of our ecient equivalent grammar and our use of the General Recall algorithm which

dep ends on the insideoutside probabilities

Chapter

Thresholding

In this chapter we show how to eciently threshold Probabilistic ContextFree Gram

mars and Probabilistic Feature Grammars using three new thresholding algorithms b eam

thresholding with the prior global thresholding and multiple pass thresholding Go o dman

Each of these algorithms approximates the insideoutside probabilities We also give

an algorithm that uses the inside probabilities to eciently optimize the settings of all of

the parameters simultaneously

Introduction

In this chapter we examine thresholding techniques for statistical parsers While there ex

ist theoretically ecient O n G algorithms for parsing Probabilistic ContextFree Gram

mars PCFGs n G can still b e fairly large in practice Sentence lengths of words

n are common and large grammars are increasingly frequent For instance

Charniak used a rule grammar built by simply reading rules o a tree bank

The grammar size of more recent lexicalized grammars such as Probabilistic Feature Gram

mars PFGs describ ed in the next chapter is eectively much larger The pro duct of n

and G can quickly b ecome very large thus practical parsing algorithms usually make use

of pruning techniques such as b eam thresholding for increased sp eed

We introduce two novel thresholding techniques global thresholding and multiplepass

parsing and one signicant variation on traditional b eam thresholding each of these three

techniques uses a dierent approximation to the insideoutside probabilities to improve

thresholding We examine the value of these techniques when used separately and when

combined In order to examine the combined techniques we also introduce an algorithm

for optimizing the settings of multiple thresholds using the inside probabilities When all

three thresholding metho ds are used together they yield very signicant sp eedups over

traditional b eam thresholding alone while achieving the same level of p erformance

We apply our techniques to CKY chart parsing one of the most commonly used parsing

metho ds in natural language pro cessing as describ ed in Section Recall that in a CKY

chart parser a twodimensional matrix of cells the chart is lled in Each cell in the chart

corresp onds to a span of the sentence and each cell of the chart contains the nonterminals 80

78 recall precision 76

74

72 % Correct

70

68

66

10000

Time

Figure Precision and Recall versus Time in Beam Thresholding

that could generate that span The parser lls in a cell in the chart by examining the

nonterminals in lower shorter cells and combining these nonterminals according to the

rules of the grammar The more nonterminals there are in the shorter cells the more

combinations of nonterminals the parser must consider

In some grammars such as PCFGs and PFGs probabilities are asso ciated with the

grammar rules This introduces problems since in many grammars almost any combina

tion of nonterminals is p ossible p erhaps with some low probability The large number of

p ossibilities can greatly slow parsing On the other hand the probabilities also introduce

new opp ortunities For instance if in a particular cell in the chart there is some nonterminal

that generates the span with high probability and another that generates that span with

low probability then we can remove the less likely nonterminal from the cell The less likely

nonterminal will probably not b e part of either the correct parse or the tree returned by

the parser so removing it will do little harm This technique is called beam thresholding

If we use a lo ose b eam threshold removing only those nonterminals that are much less

probable than the b est nonterminal in a cell our parser will run only slightly faster than

with no thresholding while p erformance measures such as precision and recall will remain

virtually unchanged On the other hand if we use a tight threshold removing nonterminals

that are almost as probable as the b est nonterminal in a cell then we can get a considerable

sp eedup but at a considerable cost Figure shows the tradeo b etween accuracy and

time using a b eam threshold that ranged from to

When we b eam threshold we remove less likely nonterminals from the chart There are

many ways to measure the likelihoo d of a nonterminal The ideal measure would b e the

normalized insideoutside probability which would give the probability that the nonterminal

was correct given the whole sentence However we cannot compute the outside probability

of a nonterminal until we are nished computing all of the inside probabilities so this

technique cannot b e used in practice

We can though approximate the insideoutside probability in several dierent ways

In this chapter we will consider three dierent kinds of thresholding using three dierent

approximations to the insideoutside probability In traditional b eam search only the inside

probability is used the probability of the nonterminal generating the terminals of the cells

span We have found that a minor variation introduced in Section in which we also

consider the average outside probability of the nonterminal which is prop ortional to its

prior probability of b eing part of the correct parse can lead to nearly an order of magnitude

improvement

The problem with b eam search is that it only compares nonterminals to other nonter

minals in the same cell Consider the case in which a particular cell contains only bad

nonterminals all of roughly equal probability We cannot threshold out these no des b e

cause even though they are all bad none is much worse than the b est Thus what we

want is a thresholding technique that uses some global information for thresholding rather

than just using information in a single cell The second kind of thresholding we consider

is a novel technique global thresholding describ ed in Section Global thresholding uses

an approximation to the outside probability that uses all nonterminals not covered by the

constituent allowing insideoutside probabilities of nonterminals covering dierent spans to

b e compared

The last technique we consider multiplepass parsing is introduced in Section The

basic idea is that we can use insideoutside probabilities from parsing with one grammar

as approximations to insideoutside probabilities in another We run two passes with two

dierent grammars The rst grammar is fast and simple We compute the insideoutside

probabilities using this rst pass grammar and use these probabilities to avoid considering

unlikely constituents in the second pass grammar The second pass is more complicated and

slower but also more accurate Because we have already eliminated many low probability

no des using the insideoutside probabilities from the rst pass the second pass can run

much faster and despite the fact that we have to run two passes the added savings in the

second pass can easily outweigh the cost of the rst one

Exp erimental comparisons of these techniques show that they lead to considerable

sp eedups over traditional thresholding when used separately We also wished to com

bine the thresholding techniques this is relatively dicult since searching for the optimal

thresholding parameters in a multidimensional space is p otentially very time consuming

Attempting to optimize p erformance measures such as precision and recall using gradient

descent is not feasible b ecause these measures are to o noisy However we found that the

inside probability was monotonic enough to optimize and designed a variant on a gradient

descent search algorithm to nd the optimal parameters Using all three thresholding meth

o ds together and the parameter search algorithm we achieved our b est results running an

for each start s

for each rule A w

s

chart s A s P A w

s

for each length l shortest to longest

for each start s

for each split length t

for each B st chart s B s t

for each C st chart s t C s l

for each rule A BC R

chart s A s l chart s A s l

P A BC chart s B s t chart s t C s l

best max chart s A s l

A

for each A

if chart s A s l T best

B

chart s A s l

return chart S n

Figure Inside Parser with Beam Thresholding

estimated times faster than traditional b eam search at the same p erformance level

Beam Thresholding

The rst and simplest technique we will examine is b eam thresholding While this tech

nique is used as part of many search algorithms b eam thresholding with PCFGs is most

similar to b eam thresholding as used in sp eech recognition Beam thresholding is often used

in statistical parsers such as that of Collins

Consider a nonterminal X in a cell covering the span of terminals w w We will

j

k

refer to this as node hj X k i since it corresp onds to a p otential no de in the nal parse tree

In b eam thresholding we compare no des hj X k i and hj Y k i covering the same span If

one no de is much more likely than the other then it is unlikely that the less probable no de

will b e part of the correct parse and we can remove it from the chart saving time later

There is some ambiguity ab out what it means for a no de hj X k i to b e more likely than

some other no de According to folk wisdom the b est way to measure the likelihoo d of a

no de hj X k i is to use the inside probability inside j X k P X w w Figure

j

k

shows a PCFG parser for computing inside probabilities that uses this traditional b eam

thresholding This parser is just a conventional PCFG parser with two changes The most

imp ortant change is that after parsing a given span we nd the most probable nonterminal

in that span Then given a thresholding factor T we nd all nonterminals less

B

probable than the b est by a factor of T and set their probabilities to The other

B

change from the way we have sp ecied PCFG parsers elsewhere is that we lo op over child

symbols B C with nonzero probabilities and then nd rules consistent with these children

Elsewhere we lo op ed over all rules but doing that here would mean that b eam thresholding

would not lead to any reduction in the number of rules examined

Recall that the outside probability of a no de hj X k i is the probability of that no de given

the surrounding terminals of the sentence ie outside j X k P S w w X w w

j n

k

Ideally we would multiply the inside probability by the outside probability and normalize

computing

insidej X k outside j X k

P S w w X w w w w jS w w

j n n n

k

inside S n

This expression would give us the overall probability that the no de is part of the correct

parse which would b e ideal for thresholding However there is no go o d way to quickly

compute the outside probability of a no de during b ottomup chart parsing although it can

b e eciently computed afterwards One simple approximation to the outside probability

of a no de hj X k i is just the average outside probability of the nonterminal X across the

language

X

outside i X j P S w w

n

jk jnk w w

n

X

P S w w X w w jS w w P S w w

j n n n

k

jk jnk w w

n

We will show that the average outside probability of a nonterminal X is prop ortional to the

prior probability of X where by prior probability we mean the probability that a random

nonterminal of a random parse tree will b e X Formally letting C D X denote the number

of o ccurrences of nonterminal X in a derivation D we can write the prior probability as

P

P D C D X

D a derivation

P P

P X

P D C D Y

Y

D a derivation

Now

X

P S w w X w w jS w w P S w w

j n n n

k

jk jnk w w

n

X

P S w w X w w P X w w

j n j j

k

jk jnk w w

n

X

P D C D X

D a derivation

Thus the average outside probability of a nonterminal X is the same as the prior probability

of X except for a factor equal to the normalization term of Expression

P P

P D C D Y

Y

D a derivation

Since it is easier to compute the prior probability than the average outside probability and

since all of our values will have the same normalization factor we use the prior probability

rather than the average outside probability

Item form

i A j inside or Viterbi

i j inside or Viterbi

Goal

S n

Rules

R A w

i

Unary

i A i

R A BC i B k k C j

R B i B k T i k

B

Binary

R C k C j T k j

i A j

B

R A i A j

Thresholding

i j

Figure Beam thresholding itembased description

Our nal thresholding measure then is P X insidej X k the algorithm of Figure

is mo died to read

best max chart s A s l P A

A

for each A

if chart s A s l P A T best

B

chart s A s l

We can also give an itembased description for the CKY algorithm with b eam thresh

olding with the prior as shown in Figure The itembased descriptions in this chapter

can b e skipp ed for those not familiar with semiring parsing as describ ed in Chapter the

descriptions in this chapter use minor extensions to Chapter describ ed in Section

The thresholding algorithm is the same as the usual CKY algorithm except that there is an

added item form i j containing the probability of the most probable nonterminal in the

span and an added side condition on the binary rule which ensures that b oth children are

suciently probable For the itembased description where we indicate rule values with the

function R we have used the notation R X to indicate the prior probability of nonterminal

X

In Section we will show exp eriments comparing insideprobability b eam thresh

olding to b eam thresholding using the inside probability times the prior Using the prior

can lead to a sp eedup of up to a factor of at the same p erformance level

To the b est of our knowledge using the prior probability in b eam thresholding is new

although not particularly insightful on our part Collins p ersonal communication inde .0001 B 1.0 A D

.9999 C 1.0

Figure Example Hidden Markov Mo del

p endently observed the usefulness of this mo dication and Caraballo and Charniak

used a related technique in a b estrst parser We think that the main reason this technique

was not used so oner is that b eam thresholding for PCFGs is derived from b eam threshold

ing in sp eech recognition using Hidden Markov Mo dels HMMs and in HMMs this extra

factor is almost never needed

Consider the simple HMM of Figure We have not annotated any output symbols

all states output the same symbol After a single input symbol we are in state B with

probability and in state C with probability If we are using thresholding we

should threshold out state B After one more symbol we arrive in the nal state D and

we are done Now consider the same pro cess backwards HMMs can b e run backwards

starting from the nal state and the last time and moving towards the start state and

b eginning time The total probability of any string will b e the same computed backwards

as it was forwards However notice what happ ens to the thresholding state B and state

C are b oth equally likely with value when we move backwards and no thresholding

o ccurs When moving forwards as long as every state in an HMM has some path to a nal

state and in essentially all sp eech recognition applications this is the case every state

in an HMM has probability of eventually p erhaps after transitioning through many other

states reaching the nal state When pro cessed backwards the probability of reaching the

start state can b e much less than one and in order to p erform optimal thresholding it is

necessary to factor in the prior probability of reaching each state from the start

In sp eech recognition where b eam thresholding was developed pro cessing is usually

done forwards and this extra factor is not needed In contrast in parsing the pro cessing

is usually b ottom up corresp onding to a backwards pro cessing from end states terminals

to the start state the start symbol S It is b ecause of this b ottomup backwards pro cessing

that we need the extra factor that indicates the probability of getting from the start symbol

to the nonterminal in question which is prop ortional to the prior probability As we noted

this can b e very dierent for dierent nonterminals

In the cases where HMM pro cessing is done backwards typically the forward probabilities are available

and techniques more sophisticated than b eam thresholding can b e used A BC

XY Z

Figure Global Thresholding Motivation

Global Thresholding

As mentioned earlier the problem with b eam thresholding is that it can only threshold

out the worst no des of a cell It cannot threshold out an entire cell even if there are no

go o d no des in it To remedy this problem we introduce a novel thresholding technique

global thresholding that uses an approximation to the outside probability which takes into

account all terminals not covered by the span under consideration This allows nonterminals

in dierent cells to b e compared to each other

The key insight of global thresholding is due to Rayner and Carter Rayner and

Carter noticed that a particular no de cannot b e part of the correct parse if there are no

no des in adjacent cells In fact it must b e part of a sequence of no des stretching from the

start of the string to the end In a probabilistic framework where almost every no de will

have some p ossibly very small probability we can rephrase this requirement as b eing that

the no de must b e part of a reasonably probable sequence

Figure shows an example of this insight No des A B and C will not b e thresholded

out b ecause each is part of a sequence from the b eginning to the end of the chart On

the other hand no des X Y and Z will b e thresholded out b ecause none is part of such a

sequence

Rayner and Carter used this insight for a hierarchical nonrecursive grammar and only

used their technique to prune after the rst level of the grammar They computed a score

for each sequence as the minimum of the scores of each no de in the sequence and computed

a score for each no de in the sequence as the minimum of three scores one based on statistics

ab out no des to the left one based on no des to the right and one based on unigram statistics

We wanted to extend the work of Rayner and Carter to general PCFGs including those

that were recursive Our approach therefore diers from theirs in many ways Rayner

and Carter ignore the inside probabilities of no des While this approach may work after

pro cessing only the rst level of a grammar when the inside probabilities will b e relatively

homogeneous it could cause problems after other levels when the inside probability of a

no de will give imp ortant information ab out its usefulness On the other hand b ecause long

no des will tend to have low inside probabilities taking the minimum of all scores strongly

favors sequences of short no des Furthermore their algorithm requires time O n to run

just once This runtime is acceptable if the algorithm is run only after the rst level but

running it more often would lead to an overall run time of O n Finally we hop ed to nd

an algorithm that was somewhat less heuristic in nature

Our global thresholding technique thresholds out no de hj X k i if the ratio b etween the

most probable sequence of no des including no de hj X k i and the overall most probable

sequence of no des is less than some threshold T Formally denoting sequences of no des

G

by L we threshold no de hj X k i if

T max P L max P L

G

L

LjhjX k iL

Now the hard part is determining P L the probability of a no de sequence There is

no way to do this eciently as part of the intermediate computation of a b ottomup chart

parser Thus we will approximate P L as follows

Y Y

P L P L jL L P L

i i i

i i

That is we assume indep endence b etween the elements of a sequence The probability of

no de L hj X k i is just its prior probability times its inside probability as b efore

i

Another way to lo ok at global thresholding is as an approximation to the unnormalized

insideoutside probability In particular

P S w w X w w w w

j n n

k

X

P S A A XB B w w X w w w w

m j n n

l k

l mA A B B

m

l

X

P L

LjhjX k iL

Unlike Expression Equation uses a summation rather than a maximum in

practice we havent found a p erformance dierence using either form This approximation

is not a very go o d one since it will sum most derivations rep eatedly For instance if we

have

S AX A A X w w

n

then we will sum b oth

P S AX w w X w w

j n

and

P S A A X w w X w w

j n

However we sp eculate that each no de hj X k i is aected more or less equally by this ap

proximation and the eects cancel out Next we make the same indep endence assumption

as b efore that the probability of a sequence of nonterminals is equal to the pro duct of the

oat f n f g

for end to n

f end max f start inside start X end P X

startendX

oat bn f g

for start n downto

bstart max insidestart X end P X bend

end startX

bestProb f n

for each no de hstart X end i

total f start inside start X end P X bend

TRUE if total bestProb T

G

activestart X end

FALSE otherwise

Figure Global Thresholding Algorithm

prior probabilities of the nonterminals

P S A A XB B P A P A P X P B P B

m m

l l

Using this expression we can approximate the insideoutside probabilities of any no de

hj X k i and compare it to the b est insideoutside probability of the sentence

The most imp ortant dierence b etween global thresholding and b eam thresholding is

that global thresholding is global any no de in the chart can help prune out any other

no de In stark contrast b eam thresholding only compares no des to other no des covering

the same span Beam thresholding typically allows tighter thresholds since there are fewer

approximations but do es not b enet from global information

Global Thresholding Algorithm

Global thresholding is p erformed in a b ottomup chart parser immediately after each

length is completed It thus runs n times during the course of parsing a sentence of length

n

We use the simple dynamic programming algorithm in Figure There are O n

no des in the chart and each no de is examined exactly three times so the run time of this

algorithm is O n The rst section of the algorithm works forwards computing for each

i f i which contains the score of the b est sequence covering terminals w w Thus

i

f n contains the score of the b est sequence covering the whole sentence max P L

L

The algorithm works analogously to the Viterbi algorithm for HMMs The second section

is analogous but works backwards computing bi which contains the score of the b est

sequence covering terminals w w

i n

Once we have computed the preceding arrays computing max P L is straight

LjhjX k iL

forward We simply want the score of the b est sequence covering the no des to the left of j

f j times the score of the no de itself times the score of the b est sequence of no des from

Item form

i A j inside or Viterbi

i inside or Viterbi

j

Primary Goal

S n

Secondary Goals

n

j

Rules

R A w

i

Unary

i A i

VZ

R A BC i B k k C j

i B k n T

j i G

V

in

Binary

VZ

k C j n T

i A j

j i G

V

in

Initialization

k

i i A j

k

k j i Extension

j

k

Figure Global thresholding itembased description

k to the end which is just bk Using this expression we can threshold each no de quickly

Since this algorithm is run n times during the course of parsing and requires time O n

each time it runs the algorithm requires time O n overall Exp eriments will show that

the time it saves easily outweighs the time it uses

In Figure we give an itembased description for a global thresholding parser The

algorithm is again very similar to the CKY algorithm We have unary and binary rules

as usual However there is now a side condition on the binary rules b oth the left and the

right child must b e part of a reasonably likely sequence

There are two item types The rst i A j has the usual meaning The second item

type i can b e deduced if there is a sequence of items A m m B n o C i

j

where each item has length at most j where the length of an item i A k is k i the

number of words it covers

The pseudo co de of Figure contains co de only for thresholding which would b e run

after each length was pro cessed in the main parser The itembased description of Figure

gives a description for the complete parser In the pro cedural version the value of f i

when it is computed after length j would equal the forward value of i in the itembased

j

description and the value of bi would equal the reverse value of i

j

There are two rules for deducing items of type i initialization and extension Ini

j

tialization simply states that there is a zero length sequence starting at word Extension

states that if there is a sequence of items of length at most k covering words through i

and there is an item covering words i through j of length at most k then there is a

sequence of items of length at most k covering words through j The trickiest rule is

the binary rule which has a fairly complicated side condition It checks for b oth i B k

and k C j that there is a reasonably likely sequence covering the sentence using items of

length at most j i and including the item

We should note that an earlier version of global thresholding using a standard pseudo

co de sp ecication contained a subtle bug the reverse values for items i were incorrectly

j

computed The itembased description of Figure makes it clearer what is going on than

the pro cedural denition of Figure can and makes a bug of this form much less likely

MultiplePass Parsing

In this section we discuss a novel thresholding technique multiplepass parsing We show

that multiplepass parsing techniques can yield large sp eedups Multiplepass parsing is

a variation on a new technique in sp eech recognition multiplepass sp eech recognition

Zavaliagkos et al which we introduce rst

MultiplePass Sp eech Recognition

The basic idea b ehind multiplepass sp eech recognition is that we can use the normalized

forwardbackward probabilities of one HMM as approximations to the normalized forward

backward probabilities of another HMM In an idealized multiplepass sp eech recognizer we

rst run a simple fast rst pass HMM computing the forward and backward probabilities

We then use these probabilities as approximations to the probabilities in the corresp onding

states of a slower more accurate second pass HMM We dont need to examine states in

this second pass HMM that corresp ond to low insideoutside probability states in the rst

pass The extra time of running two passes is more than made up for by the time saved in

the second pass

The mathematics of multiplepass recognition is fairly simple In the rst simple pass

we record the forward probabilities forward t i and backward probabilities backward t i

forward tibackward ti

gives the overall probability of of each state i at each time t Now

forward T nal

b eing in state i at time t given the acoustics Our second pass will use an HMM whose

states are analogous to the rst pass HMMs states If a rst pass state at some time is

unlikely then the analogous second pass state is probably also unlikely so we can threshold

it out

There are a few complications to multiplepass recognition First storing all the forward

and backward probabilities can b e exp ensive Second the second pass is more complicated

than the rst typically meaning that it has more states So the mapping b etween states in

Thanks to Michael Collins for catching it

for length to n

for start to n length

for leftLength to length

LeftPrev PrevChart leftLength start

for each LeftNodePrev LeftPrev

for each nonthresholded pro duction instance Prod from

LeftNodePrev of size length

for each elab oration L of Prod

Left

for each elab oration R of Prod

Right

for each elab oration P of Prod

Parent

such that P L R

add P to Chart length start

Figure Second Pass Parsing Algorithm

the rst pass and states in the second pass may b e nontrivial To solve b oth these problems

only states at word transitions are saved That is from pass to pass only information ab out

where words are likely to start and end is used for thresholding

MultiplePass Parsing

We can use an analogous algorithm for multiplepass parsing In this case we will use the

normalized insideoutside probabilities of one grammar as approximations to the normalized

insideoutside probabilities of another grammar We rst compute the normalized inside

outside probabilities of a simple fast rst pass grammar Next we run a slower more

accurate second pass grammar ignoring constituents whose corresp onding rst pass inside

outside probabilities are to o low

Of course for our second pass to b e more accurate it will probably b e more complicated

typically containing an increased number of nonterminals and pro ductions Thus we create

a mapping function from each rst pass nonterminal to a set of second pass nonterminals

and threshold out those second pass nonterminals that map from lowscoring rst pass

nonterminals We call this mapping function the elaborations function

There are many p ossible examples of rst and second pass combinations For instance

the rst pass could use regular nonterminals such as NP and VP and the second pass could

use nonterminals augmented with headword information The elab orations function then

app ends the p ossible head words to the rst pass nonterminals to get the second pass ones

Even though the corresp ondence b etween forwardbackward and insideoutside prob

abilities is very close there are imp ortant dierences b etween sp eechrecognition HMMs

and naturallanguage pro cessing PCFGs In particular we have found that in some cases

In this chapter we will assume that each second pass nonterminal is an elab oration of at most one rst

pass nonterminal in each cell The grammars used here have this prop erty If this assumption is violated

multiplepass parsing is still p ossible but some of the algorithms need to b e changed

it is more imp ortant to threshold pro ductions than nonterminals That is rather than just

noticing that a particular nonterminal VP spanning the words killed the rabbit is very

likely we also note that the pro duction VP V NP and the relevant spans is likely

Both the rst and second pass parsing algorithms are simple variations on CKY pars

ing In the rst pass we now keep track of each production instance asso ciated with a no de

ie hi X j i hi Y k i hk Z j i computing the inside and outside probabilities of each

We remove all constituents and pro duction instances from the rst pass whose normalized

insideoutside probability is to o small The second pass requires more changes Let us de

note the elab orations of nonterminal X by X X In the second pass for each pro duction

x

of the form hi X j i hi Y k i hk Z j i in the rst pass that was not thresholded out by

multipass b eam or global thresholding we consider every elab orated pro duction instance

that is all those of the form hi X j i hi Y k i hk Z j i for appropriate values of p q r

p q r

This algorithm is given in Figure which uses a current pass matrix Chart to keep track

of nonterminals in the current pass and a previous pass matrix PrevChart to keep track

of nonterminals in the previous pass We use one additional optimization keeping track of

the elab orations of each nonterminal in each cell in PrevChart that are in the corresp onding

cell of Chart

We tried multiplepass thresholding in two dierent ways In the rst technique we

tried pro ductioninstance thresholding we remove from consideration in the second pass

the elab orations of all pro duction instances whose combined insideoutside probability falls

b elow a threshold In the second technique no de thresholding we remove from considera

tion the elab orations of all no des whose insideoutside probability falls b elow a threshold

In our pilot exp eriments we found that in some cases one technique works slightly b etter

and in some cases the other do es We therefore ran our exp eriments using b oth thresholds

together

The itembased description format of Chapter allows us to describ e multiple pass

parsing very succinctly Figure is such a description This description only do es no de

thresholding pro duction thresholding could b e implemented in a similar way The descrip

tion is identical to the CKY itembased description with the following changes First every

item is annotated with a subscript indicating the pass of that item We have also annotated

the rule value function with a subscript indicating the pass for that rule Finally each

deduction rule contains an additional side condition of the form

VZ

i map A j S n T x

x x x

x

V

in

indicating that the deduction rule should trigger only if we are either on the rst pass or

the equivalent item from the previous pass derived using the map function was within

x

the threshold T Notice that the rst p passes use the inside semiring but that the

x

nal pass can use any semiring

One nice feature of multiplepass parsing is that under sp ecial circumstances it is an

admissible search technique meaning that we are guaranteed to nd the b est solution with

it In particular if we parse using no thresholding and our grammars have the prop erty

that for every nonzero probability parse in the second pass there is an analogous non

Item form

i A j

x

Primary Goal

S n

p

Secondary Goals

S n x p

x

Rules

R A w

x

x i

Unary

VZ

i map A i S n T

i A i

x x x

x

x

V

in

R A BC i B k k C j

x

x x x

Binary

VZ

i map A j S n T

i A j

x x x

x

x

V

in

Figure MultiplePass Parsing Description

zero probability parse in the rst pass then multiplepass search is admissible Under

these circumstances no nonzero probability parse will b e thresholded out but many zero

probability parses may b e removed from consideration While we will almost always wish

to parse using thresholds it is nice to know that multiplepass parsing can b e seen as an

approximation to an admissible technique where the degree of approximation is controlled

by the thresholding parameter

Multiple Parameter Optimization

The use of any one of these techniques do es not exclude the use of the others There is

no reason that we cannot use b eam thresholding global thresholding and multiplepass

parsing all at the same time In general it would not make sense to use a technique such

as multiplepass parsing without other thresholding techniques our rst pass would b e

overwhelmingly slow without some sort of thresholding

There are however some practical considerations To optimize a single threshold we

could simply sweep our parameters over a one dimensional range and pick the b est sp eed

versus p erformance tradeo In combining multiple techniques we need to nd optimal

combinations of thresholding parameters Rather than having to examine ten values in a

single dimensional space we might have to examine one hundred combinations in a two

dimensional space Later we show exp eriments with up to six thresholds Since we do

not have time to parse with one million parameter combinations we need a b etter search

algorithm

Ideally we would simply run some form of gradient descent algorithm optimizing a

Metric decrease same increase

Inside

Viterbi

Cross Bracket

Zero Cross Bracket

Precision

Recall

Table Monotonicity of various metrics

weighted sum of p erformance and time There are two problems with this approach First

is that most measures of p erformance are to o noisy It is imp ortant when doing gradient

descent that the p erformance measure b e smo oth and monotonic enough that the numerical

derivative can b e accurately measured If the p erformance measure is noisy then we must

run a large number of sentences in order to accurately measure the derivative However if

the number of sentences is to o large then the algorithm will b e to o slow Thus it is im

p ortant to have a smo oth p erformance measure We will show that the inside probability is

much smo other and more monotonic than conventional p erformance measures Because it is

so smo oth we can use a relatively small number of sentences when determining derivatives

allowing the optimization algorithm to run in a reasonable amount of time

The second problem with using a simple gradient descent algorithm is that it do es not

give us much control over the solution we arrive at Ideally we would like to b e able to pick

a p erformance level in terms of either entropy or precision and recall and nd the b est

set of thresholds for achieving that p erformance level as quickly as p ossible If an absolute

p erformance level is our goal then a normal gradient descent technique will not work since

we cannot use such a technique to optimize one function of a set of variables time as a

function of thresholds while holding another one constant p erformance We could use

gradient descent to minimize a weighted sum of time and p erformance but we would not

know at the b eginning what p erformance level we would have at the end If our goal is

to have the b est p erformance we can while running in real time or to achieve a minimum

acceptable p erformance level with as little time as necessary then a simple gradient descent

function would not work as well as the algorithm we will give

We will show that the inside probability is a go o d p erformance measure to optimize

We need a metric of p erformance that will b e sensitive to changes in threshold values In

particular our ideal metric would b e strictly increasing as our thresholds lo osened so that

every lo osening of threshold values would pro duce a measurable increase in p erformance

The closer we get to this ideal the fewer sentences we need to test during parameter

optimization

We tried an exp eriment in which we ran b eam thresholding with a tight threshold and

then a lo ose threshold on all sentences of section of length at most For this exp eriment

only we discarded those sentences that could not b e parsed with the sp ecied setting of the

threshold rather than retrying with lo oser thresholds For any given sentence we would

while not Thresholds ThresholdsSet

add Thresholds to ThresholdsSet

BaseE BaseTime ParseAllThresholds

T

for each Threshold Thresholds

if BaseE TargetE

T

T

tighten Threshold

NewE NewTime ParseAllThresholds

T

Ratio BaseTime NewTime

BaseE NewE

T T

else

lo osen Threshold

NewE NewTime ParseAllThresholds

T

Ratio BaseE NewE

T T

BaseTime NewTime

change Threshold with b est Ratio

Figure Gradient Descent Multiple Threshold Search

generally exp ect that it would do b etter on most measures with a lo ose threshold than with

a tight one but of course this will not always b e the case For instance there is a very go o d

chance that the number of crossing brackets or the precision and recall for any particular

sentence will not change at all when we move from a tight to a lo ose threshold There is

even some chance for any particular sentence that the number of crossing brackets or the

precision or the recall will even get worse We calculated for each measure how many

sentences fell into each of the three categories increased score same score or decreased

score Table gives the results As can b e seen the inside score was by far the most

nearly strictly increasing metric Furthermore as we will show in Figure this metric

also correlates well with precision and recall although with less noise Therefore we should

use the inside probability as our metric of p erformance however inside probabilities can

b ecome very close to zero so instead we measure entropy the negative logarithm of the

inside probability

We implemented a variation on a steep est descent search technique We denote the

entropy of the sentence after thresholding by E Our search engine is given a target

T

p erformance level E to search for and then tries to nd the b est combination of parameters

T

that works at approximately this level of p erformance At each p oint it nds the threshold

to change that gives the most bang for the buck It then changes this parameter in the

correct direction to move towards E and p ossibly overshoot it A simplied version of

T

the algorithm is given in Figure

Figure shows graphically how the algorithm works There are two cases In the

rst case if we are currently ab ove the goal entropy then we lo osen our thresholds leading Thresh 1 Thresh 2

Goal Goal

Entropy Entropy Thresh 1 Thresh 2

Time Time

Optimizing for Optimizing for

Lower Entropy Faster Sp eed

Steep er is Better Flatter is Better

Figure Optimizing for Lower Entropy versus Optimizing for Faster Sp eed

to slower sp eed and lower entropy We then wish to get as much entropy reduction as

p ossible p er time increase that is we want the steep est slop e p ossible On the other hand

if we are trying to increase our entropy we want as much time decrease as p ossible p er

entropy increase that is we want the attest slop e p ossible Because of this dierence we

need to compute dierent ratios dep ending on which side of the goal we are on

There are several subtleties when thresholds are set very tightly When we fail to parse

a sentence b ecause the thresholds are to o tight we retry the parse with lower thresholds

This can lead to conditions that are the opp osite of what we exp ect for instance lo osening

thresholds may lead to faster parsing b ecause we dont need to parse the sentence fail and

then retry with lo oser thresholds The full algorithm contains additional checks that our

thresholding change had the eect we exp ected either increased time for decreased entropy

or vice versa If we get either a change in the wrong direction or a change that makes

everything worse then we retry with the reverse change hoping that that will have the

intended eect If we get a change that makes b oth time and entropy b etter then we make

that change regardless of the ratio

Also we need to do checks that the denominator when computing R atio is not to o small

If it is very small then our estimate may b e unreliable and we do not consider changing

this parameter Finally the actual algorithm we used also contained a simple annealing

schedule in which we slowly decreased the factor by which we changed thresholds That

is we actually run the algorithm multiple times to termination rst changing thresholds

For this algorithm although not for most exp eriments our measurement of time was the total number

of pro ductions searched rather than cpu time we wanted the greater accuracy of measuring pro ductions

by a factor of After a lo op is reached at this factor we lower the factor to then

then then

We note that this algorithm is fairly task indep endent It can b e used for almost any

statistical parsing formalism that uses thresholds or even for sp eech recognition

Comparison to Previous Work

Beam thresholding is a common approach While we do not know of other systems that have

used exactly our techniques our techniques are certainly similar to those of others For in

stance Collins uses a form of b eam thresholding that diers from ours only in that it

do es not use the prior probability of nonterminals as a factor and Caraballo and Charniak

use a version with the prior but with other factors as well

Much of the previous related work on thresholding is in the similar area of priority

functions for agendabased parsers These parsers try to do b est rst parsing with some

function akin to a thresholding function determining what is b est The b est comparison of

these functions is due to Caraballo and Charniak who tried various prioritiza

tion metho ds Several of their techniques are similar to our b eam thresholding technique

and one of their techniques not yet published Caraballo and Charniak would prob

ably work b etter

The only technique that Caraballo and Charniak give that to ok into account the

scores of other no des in the priority function the prex mo del required O n time to

compute compared to our O n system On the other hand all no des in the agenda parser

were compared to all other no des so in some sense all the priority functions were global

We note that agendabased PCFG parsers in general require more than O n run

time b ecause when b etter derivations are discovered they may b e forced to propagate

improvements to pro ductions that they have previously considered For instance if an

agendabased system rst computes the probability for a pro duction S NP VP and

then later computes some b etter probability for the NP it must up date the probability for

the S as well This could propagate through much of the chart To remedy this Caraballo

et al only propagated probabilities that caused a large enough change Caraballo and

Charniak Also the question of when an agendabased system should stop is a little

discussed issue and dicult since there is no obvious stopping criterion Because of these

issues we chose not to implement an agendabased system for comparison

As mentioned earlier Rayner and Carter describ e a system that is the inspiration

for global thresholding Because of the limitation of their system to nonrecursive grammars

and the other dierences discussed in Section global thresholding represents a signicant

improvement

Collins uses two thresholding techniques The rst of these is essentially b eam

thresholding without a prior In the second technique there is a constant probability

threshold Any no des with a probability b elow this threshold are pruned If the parse fails

parsing is restarted with the constant lowered We attempted to duplicate this technique

but achieved only negligible p erformance improvements Collins p ersonal communication

rep orts a sp eedup when this technique is combined with lo ose b eam thresholding

compared to lo ose b eam thresholding alone Perhaps our lack of success is due to dierences

b etween our grammars which are fairly dierent formalisms When Collins b egan using a

formalism somewhat closer to ours he needed to change his b eam thresholding to take into

account the prior so this hypothesis is not unlikely Hwa p ersonal communication using

a mo del similar to PCFGs Sto chastic Lexicalized Tree Insertion Grammars also was not

able to obtain a sp eedup using this technique

There is previous work in the sp eech recognition community on automatically optimizing

some parameters Schwartz et al However this previous work diered signicantly

from ours b oth in the techniques used and in the parameters optimized In particular

previous work fo cused on optimizing weights for various comp onents such as the language

mo del comp onent In contrast we optimize thresholding parameters Previous techniques

could not b e used for or easily adapted to thresholding parameters

Exp eriments

Data

All exp eriments were trained on sections of the Penn Treebank version I I A few

were tested where noted on the rst sentences of section of length at most

words In one exp eriment we used the rst of length at most and in the remainder

of our exp eriments we used those sentences in the rst of length at most Our

parameter optimization algorithm always used the rst sentences of length at most

words from section We ran some exp eriments on more sentences but there were three

sentences in this larger test set that could not b e parsed with b eam thresholding even with

lo ose settings of the threshold we therefore chose to rep ort the smaller test set since it is

dicult to compare techniques that did not parse exactly the same sentences

The Grammar

We needed several grammars for our exp eriments so that we could test the multiplepass

parsing algorithm The grammar rules and their asso ciated probabilities were determined

by reading them o of the training section of the treebank in a manner very similar to

that used by Charniak The main grammar we chose was essentially of the following

form

X A X

B C D E F

X A X

AB C D E B C D E F

X A

X A B

In Chapter we describ e Probabilistic Feature Grammars This grammar can b e more simply describ ed

as a PFG with six features the continuation feature child child child No smo othing was done and

some dep endencies that could have b een captured such as the dep endence b etween the left childs child

feature and the parents child feature were ignored in order to capture only the dep endencies describ ed

here Unary branches were handled as describ ed in Section

Original

S

H

H

H

H

Y X

X

X

H

H X

X

X H

Z

A B G H

H

H

A B

Binary Branching

S

H

H

H

H

H

H

H

H

H

X Y

H

H

H

Z

A X

B C D E F

H

H

H

H

H

A B

B X

C D E F G

H

H

H

X

C

D E F GH

H

H

D X

E F GH

H

H

E X

F GH

H

H

X

F

GH

H

H

G H

Figure Converting to Binary Branching

Original

S

H

H

H

H

VP NP

H

H

H

H

verb

det adj noun

Terminal

DET

H

H

H

H

DET VERB

H

H

H

verb

det ADJ

H

H

adj noun

TerminalPrime

DET

H

H

H

H

DET VERB

H

H

H

verb

det ADJ

H

H

adj noun

Figure Converting to Terminal and TerminalPrime Grammars

That is pro ductions were all unary or binary branching There were never more than

ve subscripted symbols for any nonterminal although there could b e fewer than ve if

there were fewer than ve symbols remaining on the right hand side Thus our grammar

was a kind of gram mo del on symbols in the grammar

Figure shows an example of how we converted trees to the form of our grammar

We refer to this grammar as the gram grammar The terminals of the grammar were

the partofsp eech symbols in the treebank Any exp eriments that do not mention which

grammar we used were run with the gram grammar

For a simple grammar we wanted something that would b e very fast The fastest

grammar we can think of we call the terminal grammar b ecause it has one nonterminal

for each terminal symbol in the alphab et The nonterminal symbol indicates the rst

terminal in its span The parses are unary and binary branching in the same way that

the gram grammar parses are Figure shows how to convert a parse tree to the

terminal grammar Since there is only one nonterminal p ossible for each cell of the chart

parsing is quick for this grammar For technical and practical reasons we actually wanted

a marginally more complicated grammar which included the prime symbol of the gram

grammar indicating that a cell is part of the same constituent as its parent Therefore

we doubled the size of the grammar so that there would b e b oth primed and nonprimed

versions of each terminal we call this the terminalprime grammar and also show how to

convert to it in Figure This grammar is the one we actually used as the rst pass in

our multiplepass parsing exp eriments

What we measured

The goal of a go o d thresholding algorithm is to trade o correctness for increased sp eed We

must thus measure b oth correctness and sp eed and there are some subtleties to measuring

each

The traditional way of measuring correctness is with metrics such as precision and

recall which were describ ed in Chapter There are two problems with these measures

First they are two numbers neither useful without the other Second they are sub ject

to considerable noise In pilot exp eriments we found that as we changed our thresholding

values monotonically precision and recall changed nonmonotonically see Figure We

attribute this to the fact that we must choose a single parse from our parse forest and

as we tighten a thresholding parameter we may threshold out either go o d or bad parses

Furthermore rather than just changing precision or recall by a small amount a single

thresholded item may completely change the shap e of the resulting tree Thus precision

and recall are only smo oth with very large sets of test data However b ecause of the large

number of exp eriments we wished to run using a large set of test data was not feasible

Thus we lo oked for a surrogate measure and decided to use the total inside probability

of all parses which with no thresholding is just the probability of the sentence given

the mo del If we denote the total inside probability with no thresholding by I and the

I

T

is the probability that we did not total inside probability with thresholding by I then

T

I

threshold out the correct parse given the mo del Thus maximizing I should maximize

T

correctness Since probabilities can b ecome very small we instead minimize entropies the

negative logarithm of the probabilities Figure shows that with a large data set entropy

correlates well with precision and recall and that with smaller sets it is much smo other

Entropy is smo other b ecause it is a function of many more variables in one exp eriment

there were ab out constituents that contributed to precision and recall measurements

versus million pro ductions p otentially contributing to entropy Thus we choose entropy

as our measure of correctness for most exp eriments When we did measure precision and

recall we used the metric as dened by Collins

The fact that entropy changes smo othly and monotonically is critical for the p erformance

Our parser is the PFG parser of Chapter Many of the decisions that parser makes dep end on the

value of the continuation feature and so the PFG parser cannot run without that feature Furthermore

keeping information for each constituent ab out whether it was an internal primed feature or not seemed

like it would provide useful information to the rst pass

This grammar can b e more simply describ ed as a PFG with two features the continuation feature

which corresp onds to the prime and a feature for the rst terminal No smo othing was done

PrecisionRecall Total Entropy

Sentences Sentences 1750

1700 70 1650 recall precision 1600 65 1550

1500 60 % Correct 1450 Total Entropy

1400

55 1350

1300

50 1250 100 1000 100 1000

Time Time

PrecisionRecall Total Entropy

Sentences Sentences 80 17500

78 recall 17000 precision 76

74 16500

72 16000 70 % Correct Total Entropy 68 15500

66 15000 64

62 14500 1000 10000 1000 10000

Time Time

Figure Smo othness for Precision and Recall versus Total Inside for Dierent Test Data

Sizes 6e+07

5e+07

4e+07

3e+07 Productions 2e+07

1e+07

0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Time

Figure Pro ductions versus Time

of the multiple parameter optimization algorithm Furthermore we may have to run quite

a few iterations of that algorithm to get convergence so the fact that entropy is smo oth for

relatively small numbers of sentences is a large help Thus the discovery that entropy or

equivalently the log of the inside probability is a go o d surrogate for precision and recall

is nontrivial The same kinds of observations could b e extended to sp eech recognition

to optimize multiple thresholds there the typical mo dern sp eech system has quite a few

thresholds a topic for future research

For some sentences with to o tight thresholding the parser will fail to nd any parse

at all We dealt with these cases by restarting the parser with all thresholds lowered by a

factor of iterating this lo osening until a parse could b e found This restarting is why for

some tight thresholds the parser may b e slower than with lo oser thresholds the sentence

has to b e parsed twice once with tight thresholds and once with lo ose ones

Next we needed to choose a measure of time There are two obvious measures amount

of work done by the parser and elapsed time If we measure amount of work done by

the parser in terms of the number of pro ductions with nonzero probability examined by

the parser we have a fairly implementationindependent machineindependent measure of

sp eed On the other hand b ecause we used many dierent thresholding algorithms some

with a fair amount of overhead this measure seems inappropriate Multiplepass parsing 17500 16786 Prior Prior 17000 No Prior 15786 No Prior

16500 15286

16000 14986

15500 14886 Total Entropy Total Entropy 14836 15000

14500 14806 1000 10000 1000 10000

Time Time

X axis log time X axis log time

Y axis entropy Y axis log entropy asymptote

Figure Beam Thresholding with and without the Prior Probability Two Dierent

Scales

requires use of the outside algorithm global thresholding uses its own dynamic programming

algorithm and even b eam thresholding has some p erno de overhead Thus we will give

most measurements in terms of elapsed time not including loading the grammar and other

O overhead We did want to verify that elapsed time was a reasonable measure so

we did a b eam thresholding exp eriment to make sure that elapsed time and number of

pro ductions examined were well correlated using sentences and an exp onential sweep

of the thresholding parameter The results shown in Figure clearly indicate that time

is a go o d proxy for pro ductions examined

Exp eriments in Beam Thresholding

Our rst goal was to show at least informally that entropy is a go o d surrogate for precision

and recall We thus tried two exp eriments one with a relatively large test set of

sentences and one with a relatively small test set of sentences Presumably the

sentence test set should b e much less noisy and fairly indicative of p erformance We

graphed b oth precision and recall and entropy versus time as we swept the thresholding

parameter over a sequence of values The results are in Figure As can b e seen entropy

is signicantly smo other than precision and recall for b oth size test corp ora In Section

we gave a more rigorous discussion of the monotonicity of entropy versus precision and

recall with the same conclusion

Our second goal was to check that the prior probability is indeed helpful We ran two

exp eriments one with the prior and one without The results shown in Figure indicate

that the prior is a critical comp onent This exp eriment was run on sentences of test

data

Notice that as the time increases the data tends to approach an asymptote as shown

in the left hand graph of Figure In order to make these small asymptotic changes

more clear we wished to expand the scale towards the asymptote The right hand graph

was plotted with this expanded scale based on log entropy asymptote a slight variation

on a normal log scale We use this scale in all the remaining entropy graphs A normal

logarithmic scale is used for the time axis The fact that the time axis is logarithmic is

esp ecially useful for determining how much more ecient one algorithm is than another at

a given p erformance level If one picks a p erformance level on the vertical axis then the

distance b etween the two curves at that level represents the ratio b etween their sp eeds

There is roughly a factor of to dierence b etween using the prior and not using it at all

graphed p erformance levels with a slow trend towards smaller dierences as the thresholds

are lo osened Because of the large dierence b etween using the prior and not using it all

other b eam thresholding exp eriments included the prior

Exp eriments in Global Thresholding

We tried an exp eriment comparing global thresholding to b eam thresholding Figure

shows the results of this exp eriment and later exp eriments In the b est case global thresh

olding works twice as well as b eam thresholding in the sense that to achieve the same level

of p erformance requires only half as much time although smaller improvements were more

typical

We have found that in general global thresholding works b etter on simpler grammars

In the complicated grammars of Chapter there were systematic strong correlations b e

tween no des which violated the indep endence approximation used in global thresholding

This prevented us from using global thresholding with these grammars In the future we

may mo dify global thresholding to mo del some of these correlations

Exp eriments combining Global Thresholding and Beam Threshold

ing

While global thresholding works b etter than b eam thresholding in general each has its own

strengths Global thresholding can threshold across cells but b ecause of the approximations

used the thresholds must generally b e lo oser Beam thresholding can only threshold within

a cell but can do so fairly tightly Combining the two oers the p otential to get the

advantages of b oth We ran a series of exp eriments using the thresholding optimization

algorithm of Section Figure gives the results The combination of b eam and

global thresholding together is clearly b etter than either alone in some cases running

faster than global thresholding alone while achieving the same p erformance level The

combination generally runs twice as fast as b eam thresholding alone although up to a

factor of three

Exp eriments in MultiplePass Parsing

Multiplepass parsing improves even further on our exp eriments combining b eam and global

thresholding In addition to multiplepass parsing we used b oth b eam and global thresh

olding for b oth the rst and second pass in these exp eriments The rst pass grammar Beam 74420 Global Beam and Global 71420

70420

69920 Total Entropy 69620

69520

1000 10000

Time

Figure Combining Beam and Global Search 80

79

78 Beam Recall Global Recall Beam and Global Recall 77 Multi-Pass Recall Beam Precision Global Precision 76 Beam and Global Precision Multi-Pass Precision % Correct 75

74

73

72 10000

Time

Figure Multiple Pass Parsing vs Beam and Global vs Beam

was the very simple terminalprime grammar and the second pass grammar was the usual

gram grammar

Our goal throughout this chapter has b een to maximize precision and recall as quickly

as p ossible In general however we have measured entropy rather than precision and recall

b ecause it correlates well with those measures but is much smo other While this correlation

has held for all of the previous thresholding algorithms we have tried it turns out not to

hold in some cases for multiplepass parsing In particular in the exp eriments conducted

here our rst and second pass grammars were very dierent from each other For a given

parse to b e returned it must b e in the intersection of b oth grammars and reasonably likely

according to b oth Since the rst and second pass grammars capture dierent information

parses that are likely according to b oth are esp ecially go o d The entropy of a sentence

measures its likelihoo d according to the second pass but ignores the fact that the returned

parse must also b e likely according to the rst pass Thus in these exp eriments entropy

do es not correlate nearly as well with precision and recall We therefore give precision

and recall results in this section We still optimized our thresholding parameters using the

same sentence held out corpus and minimizing entropy versus number of pro ductions

as b efore

We should note that when we used a rst pass grammar that captured a strict subset

of the information in the second pass grammar we have found that entropy is a very go o d

measure of p erformance As in our earlier exp eriments it tends to b e well correlated with

precision and recall but less sub ject to noise It is only b ecause of the grammar mismatch

that we have changed the evaluation

Figure shows precision and recall curves for single pass versus multiple pass ex

p eriments As in the entropy curves we can determine the p erformance ratio by lo oking

across horizontally For instance the multipass recognizer achieves a recall level using

seconds while the b est single pass algorithm requires ab out seconds to reach

that level Due to the noise resulting from precision and recall measurements it is hard to

exactly quantify the advantage from multiple pass parsing but it is generally ab out

Future Work and Conclusion

Future Work

In this chapter we only considered applying multiplepass and global thresholding tech

niques to parsing probabilistic contextfree grammars However just ab out any probabilistic

grammar formalism for which inside and outside probabilities can b e computed can b enet

from these techniques For instance Probabilistic Link Grammars Laerty et al

could b enet from our algorithms We have however had trouble using global thresholding

with grammars that strongly violated the indep endence assumptions of global thresholding

One esp ecially interesting p ossibility is to apply multiplepass techniques to formalisms

that require greater than O n parsing time such as Sto chastic Bracketing Transduction

Grammar SBTG Wu and Sto chastic Tree Adjoining Grammars STAG Resnik

Schabes SBTG is a contextfreelike formalism designed for translation from

one language to another it uses a four dimensional chart to index spans in b oth the source

and target language simultaneously It would b e interesting to try sp eeding up an SBTG

parser by running an O n rst pass on the source language alone and using this to prune

parsing of the full SBTG

The STAG formalism is a mildly contextsensitive formalism requiring O n time to

parse Most STAG pro ductions in practical grammars are actually contextfree The tradi

tional way to sp eed up STAG parsing is to use the contextfree subset of an STAG to form

a Sto chastic Tree Insertion Grammar STIG Schabes and Waters an O n for

malism but this metho d has problems b ecause the STIG undergenerates since it is missing

some elementary trees A dierent approach would b e to use multiplepass parsing We

could rst nd a contextfree covering grammar for the STAG and use this as a rst pass

and then use the full STAG for the second pass

There is also future work that could b e done on further improvements to thresholding al

gorithms One p otential improvement that should b e tried is mo difying b eam thresholding

with the prior or global thresholding so that each compares only nonterminals which are

similar in some way Both of these algorithms make certain approximations The more sim

ilar two nonterminals are the smaller the relative error from these approximations will b e

Thus comparing only similar nonterminals will allow tighter thresholding Obviously these

mo died algorithms should b e combined with the original algorithms using the multiple

parameter search technique

Conclusions

The grammars describ ed here are fairly simple presented for purp oses of explication and

to keep our exp eriments simple enough to replicate easily In Chapter we use signicantly

more complicated grammars Probabilistic Feature Grammars PFGs For some PFGs the

improvements from multiplepass parsing are even more dramatic single pass exp eriments

are simply to o slow to run at all

We have also found the automatic thresholding parameter optimization algorithm to

b e very useful Before writing the parameter optimization algorithm we had developed a

complicated PFG grammar and the multiplepass parsing technique and ran a series of ex

p eriments using hand optimized parameters We thereafter ran the optimization algorithm

and reran the exp eriments achieving a factor of two sp eedup with no p erformance loss

While we had not sp ent a great deal of time hand optimizing these parameters we are very

encouraged by the optimization algorithms practical utility

This chapter introduces four new techniques b eam thresholding with priors global

thresholding multiplepass parsing and automatic search for the parameters of combined

algorithms Beam thresholding with priors can lead to almost an order of magnitude im

provement over b eam thresholding without priors Global thresholding can b e up to two

times as ecient as the new b eam thresholding technique although the typical improve

ment is closer to When global thresholding and b eam thresholding are combined they

are usually two to three times as fast as b eam thresholding alone Multiplepass parsing

can lead to up to an additional improvement with the grammars in this chapter We

exp ect the parameter optimization algorithm to b e broadly useful

Chapter

Probabilistic Feature Grammars

This chapter introduces Probabilistic Feature Grammars PFGs a relatively simple and

elegant formalism that is the rst stateoftheart formalism for which the inside and outside

probabilities can b e computed Go o dman Because we can compute the inside and

outside probabilities we can use the ecient thresholding algorithms of Chapter when

parsing PFGs

Introduction

Recently many researchers have worked on statistical parsing techniques which try to cap

ture additional context b eyond that of simple probabilistic contextfree grammars PCFGs

including work by Magerman Charniak Collins Black et

al b Eisele and Brew Each researcher has tried to capture the hi

erarchical nature of language as typied by contextfree grammars and to then augment

this with additional context sensitivity based on various features of the input However

none of these works combines the most imp ortant b enets of all the others and most lack

a certain elegance We have therefore tried to synthesize these works into a new formal

ism probabilistic feature grammar PFG PFGs have several imp ortant prop erties First

PFGs can condition on features b eyond the nonterminal of each no de including features

such as the head word or grammatical number of a constituent Also PFGs can b e parsed

using ecient p olynomialtime dynamic programming algorithms and learned quickly from

a treebank Finally unlike most other formalisms PFGs are p otentially useful for language

mo deling or as one part of an integrated statistical system eg Miller et al or for

use with algorithms requiring outside probabilities Empirical results are encouraging our

b est parser is comparable to those of Magerman and Collins when run on

the same data When we run using partofsp eech POS tags alone as input we p erform

signicantly b etter than comparable parsers

S

singular

dies

H

H

H

H

H

H

NP VP

singular singular

man

dies

H

H

H

H

V

DET N

singular

singular singular

dies

man

the

Figure Example tree with features

Motivation

PFG can b e regarded in several dierent ways as a way to make historybased grammars

Magerman more contextfree and thus amenable to dynamic programming as a way

to generalize the work of Black et al a as a way to turn Collins parser Collins

into a generative probabilistic language mo del or as an extension of languagemo deling

techniques to sto chastic grammars The resulting formalism is relatively simple and elegant

In Section we will compare PFGs to each of the systems from which it derives and

show how it integrates their b est prop erties

Consider the following simple parse tree for the sentence The man dies

S

H

H

H

NP VP

H

H

dies the man

While this tree captures the simple fact that sentences are comp osed of noun phrases and

verb phrases it fails to capture other imp ortant restrictions For instance the NP and VP

must have the same number b oth singular or b oth plural Also a man is far more likely

to die than spaghetti and this constrains the head words of the corresp onding phrases

This additional information can b e captured in a parse tree that has b een augmented with

features such as the category number and head word of each constituent as is traditionally

done in many featurebased formalisms such as HPSG LFG and others Figure shows

a parse tree that has b een augmented with these features

While a normal PCFG has pro ductions such as

S NP VP

we will write these augmented pro ductions as for instance

S singular dies NP singular man VP singular dies

In a traditional probabilistic contextfree grammar we could augment the rst tree with

probabilities in a simple fashion We estimate the probability of S NP VP using a tree

C S NP VP

the number of o ccurrences of S NP VP divided by bank to determine

C S

the number of o ccurrences of S For a reasonably large treebank probabilities estimated in

this way would b e reliable enough to b e useful Charniak On the other hand it is

not unlikely that we would never have seen any counts at all of

C S singular dies NP singular man VP singular dies

C S singular dies

which is the estimated probability of the corresp onding pro duction in our grammar aug

mented with features

The introduction of features for number and head word has created a data sparsity

problem Fortunately the datasparsity problem is well known in the languagemo deling

community and we can use their techniques ngram mo dels and smo othing to help us

Consider the probability of a ve word sentence w w

P w w w w w P w P w jw P w jw w P w jw w w P w jw w w w

While exactly computing P w jw w w w is dicult a go o d approximation is P w jw w w w

P w jw w

Let C w w w represent the number of o ccurrences of the sequence w w w in a cor

C w w w

However this pus We can then empirically approximate P w jw w with

C w w

approximation alone is not enough there may still b e many three word combinations that

do not o ccur in the corpus but that should not b e assigned zero probabilities So we smooth

this approximation for instance by using

C w w w C w w C w

P

P w jw w

C w C w w C w

w

where are numbers b etween and that determine the amount of smo othing

Now we can use these same approximations in PFGs Let us assume that our PFG is

binary branching and has g features numbered g we will call the parent features a the

i

left child features b and the right child features c In our earlier example a represented

i i

the parent nonterminal category a represented the parent number singular or plural a

represented the parent head word b represented the left child category etc We can write

a PFG pro duction as a a a b b b c c c If we think of the set of

g g g

features for a constituent A as b eing the random variables A A then the probability

g

of a pro duction is the conditional probability

P B b B b C c C c jA a A a

g g g g g g

k

to represent A a A a We can We write a as shorthand for A a and a

i i i

k k

then write this conditional probability as

g g g

P b c ja

This joint probability can b e factored as the pro duct of a set of conditional probabilities in

many ways One simple way is to arbitrarily order the features as b b c c We

g g

then condition each feature on the parent features and all features earlier in the sequence

g g g g g g g g g

P b c ja P b ja P b ja b P b ja b P c ja b c

g

We can now approximate the various terms in the factorization by making indep endence

assumptions For instance returning to the concrete example ab ove consider feature c

the right child nonterminal or terminal category The following approximation should work

fairly well in practice

g g

b P c ja b P c ja

That is the category of the right child is well determined by the category of the parent

and the category of the left child Just as ngram mo dels approximate conditional lexical

probabilities by assuming indep endence of words that are suciently distant here we ap

proximate conditional feature probabilities by assuming indep endence of features that are

suciently unrelated Furthermore we can use the same kinds of backingo techniques

that are used in smo othing traditional language mo dels to allow us to condition on rela

tively large contexts In practice a grammarian determines the order of the features in the

factorization and then for each feature which features it dep ends on and the optimal order

of backo p ossibly using exp eriments on development test data for feedback It might b e

p ossible to determine the factorization order the b est indep endence assumptions and the

optimal order of backo automatically a sub ject of future research

Intuitively in a PFG features are pro duced one at a time This order corresp onds to

the order of the factorization The probability of a feature b eing pro duced dep ends on

a subset of the features in a lo cal context of that feature Figure shows an example

of this featureatatime generation for the noun phrase the man In this example the

grammarian picked the simple feature ordering b b b c c c To the right of the gure

the indep endence assumptions made by the grammarian are shown

Formalism

In a PCFG the imp ortant concepts are the terminals and nonterminals the pro ductions

involving these and the corresp onding probabilities In a PFG a vector of features corre

sp onds to the terminals and nonterminals PCFG pro ductions corresp ond to PFG events

of the form a a b b c c and our PFG rule probabilities corresp ond

g g g

to pro ducts of conditional probabilities one for each feature that needs to b e generated

NP

singular

man

H

H

H

H

NP

singular

man

P B DETjA NP A singular A man

H

H

H

H

P B DETjA NP A singular

DET

NP

singular

man

P B singularjA NP A singular A man B DET

H

H

H

H

P B singularjA singular B DET

DET

singular

NP

singular

man

P B the jA NP A singular B singular

H

H

H

H

P B thejB DET A man

DET

singular

the

NP

singular

man

P C manjA NP A singular C singular

H

H

H

H

P C man jA man A NP

DET N

singular singular

man

the

Figure Pro ducing the man one feature at a time

Events and EventProbs

There are two kinds of PFG events of immediate interest The rst is a binary event in

g g g

which a feature set a the parent features generates features b c the child features

Figure is an example of a single such event Binary events generate the whole tree

except the start no de which is generated by a start event

The probability of an event is given by an EventProb which generates each new feature

in turn assigning it a conditional probability given all the known features For instance in

a binary event the EventProb assigns probabilities to each of the child features given the

parent features and any child features that have already b een generated

Formally an EventProb E is a tuple hK N F i where K is the set of conditioning

N N N N is an ordered list of conditioned features features the Known features

n

F f f f is a parallel list of functions Each function the New features and

n

f n k k n n n returns P N n jK k K k N n N

i i i i i

k k k

n N n the probability that feature N n given all the known features and

i i i i

all the lower indexed new features

For a binary event we may have E hfa a a g hb b c c i F i that is

B g g g B

the child features are conditioned on the parent features and earlier child features For a

F i ie the parent features are conditioned start event we have E hfg ha a a i

S S g

only on each other in sequence

Terminal Function Binary PFG Alternating PFG

We need one last element a function T from a set of g features to hT N i which tells us

whether a part of an event is terminal or nonterminal the terminal function A Binary

PFG is then a quadruple hg E E T i a number of features a binary EventProb a start

B S

EventProb and a terminal function

Of course using binary events allows us to mo del nary branching grammars for any

xed n we simply add additional features for terminals to b e generated in the future

as well as a feature for whether or not this intermediate no de is a dummy no de the

continuation feature We demonstrate how to do this in detail in Section

On the other hand it do es not allow us to handle unary branching pro ductions In

general probabilistic grammars that allow an unbounded number of unary branches are

very dicult to deal with Stolcke There are a number of ways we could have

handled unary branches The one we chose was to enforce an alternation b etween unary

and binary branches marking most unary branches as dummies with the continuation

feature and removing them b efore printing the output of the parser

To handle these unary branches we add one more EventProb E Thus an Alternating

U

PFG is a quintuple of fg E E E T g

B S U

It is imp ortant that we allow only a limited number of unary branches As we discussed

in Chapter unlimited unary branches lead to innite sums For conventional PCFGstyle

grammar formalisms these innite sums can b e computed using matrix inversion which is

still fairly timeconsuming For a formalism such as ours or a similar formalism the eective

number of nonterminals needed in the matrix inversion is p otentially huge making such

computations impractical Thus instead we simply limit the number of unary branches

meaning that the sum is nite for computing b oth the inside and the outside values Two

comp eting formalisms those of Collins and Charniak allow unlimited unary

branches but b ecause of this can only compute Viterbi probabilities not inside and outside

probabilities

Comparison to Previous Work

PFG b ears much in common with previous work but in each case has at least some advan

tages over previous formalisms

Some other mo dels Charniak Brew Collins Black et al b use

probability approximations that do not sum to meaning that they should not b e used

either for language mo deling eg in a sp eech recognition system or as part of an integrated

mo del such as that of Miller et al Some mo dels Magerman Collins

assign probabilities to parse trees conditioned on the strings so that an unlikely sentence

with a single parse might get probability making these systems unusable for language

mo deling PFGs use joint probabilities so can b e used b oth for language mo deling and as

part of an integrated mo del

Furthermore unlike all but one of the comparable systems Black et al a PFGs

can compute outside probabilities which are useful for grammar induction as well as for

the parsing algorithms of Chapter and the thresholding algorithms of Chapter

Bigram Lexical Dependency Parsing Collins introduced a parser with extremely

go o d p erformance From this parser we take many of the particular conditioning features

that we will use in PFGs As noted this mo del cannot b e used for language mo deling

There are also some inelegancies in the need for a separate mo del for BaseNPs and the

treatment of punctuation as inherently dierent from words The mo del also contains a

nonstatistical rule ab out the placement of commas Finally Collins mo del uses memory

prop ortional to the sum of the squares of each training sentences length PFGs in general

use memory that is only linear

Generative Lexicalized Parsing Collins worked indep endently from us to con

struct a mo del that is very similar to ours In particular Collins wished to adapt his

previous parser Collins to a generative mo del In this he succeeded However while

we present a fairly simple and elegant formalism which captures all information as features

Collins uses a variety of dierent techniques First he uses variables which are analogous

to our features Next b oth our mo dels need a way to determine when to stop generating

child no des like everything else we enco de this in a feature but Collins creates a sp ecial

STOP nonterminal For some information Collins mo dies the names of nonterminals

rather than enco ding the information as additional features Finally all information in our

mo del is generated topdown In Collins mo del most information is generated topdown

but distance information is propagated b ottomup Thus while PFGs enco de all informa

tion as topdown features Collins mo del uses several dierent techniques This lack of

homogeneity fails to show the underlying structure of the mo del and the ways it could b e

expanded While Collins mo del could not b e enco ded exactly as a PFG a PFG that was

extremely similar could b e created

Furthermore our mo del of generation is very general While our implementation cap

tures head words through the particular choice of features Collins mo del explicitly gener

ates rst the head phrase then the right children and nally the left children Thus our

mo del can b e used to capture a wider variety of grammatical theories simply by changing

the choice of features

Simple PCFGs Charniak showed that a simple PCFG formalism in which the

rules are simply read o of a treebank can p erform very comp etitively Furthermore he

showed that a simple mo dication in which pro ductions at the right side of the sentence

have their probability b o osted to encourage right branching structures can improve p er

formance even further PFGs are a sup erset of PCFGs so we can easily mo del the basic

PCFG grammar used by Charniak although the b o osting cannot b e exactly duplicated

However we can use more principled techniques such as a feature that captures whether

a particular constituent is at the end of the sentence and a feature for the length of the

constituent Charniaks b o osting strategy means that the scores of constituents are no

longer probabilities meaning that they cannot b e used with the insideoutside algorithm

Furthermore the PFG featurebased technique is not extragrammatical meaning that no

additional machinery needs to b e added for parsing or grammar induction

PCFG with Word Statistics Charniak uses a grammar formalism which is in

many ways similar to the PFG mo del with several minor dierences and one imp ortant

one The main dierence is that while we binarize trees and enco de rules as features ab out

which nonterminal should b e generated next Charniak explicitly uses rules in the style of

traditional PCFG parsing in combination with other features This dierence is discussed

in more detail in Section

Stochastic HPSG Brew introduced a sto chastic version of HPSG In his for

malism in some cases even if two features have b een constrained to the same value by

unication the probabilities of their pro ductions are assumed indep endent The resulting

probability distribution is then normalized so that probabilities sum to one This leads to

problems with grammar induction p ointed out by Abney Our formalism in con

trast explicitly mo dels dep endencies to the extent p ossible given data sparsity constraints

IBM Language Modeling Group Researchers in the IBM Language Mo deling Group

developed a series of successively more complicated mo dels to integrate statistics with fea

tures

The rst mo del Black et al Black et al b essentially tries to convert a

unication grammar to a PCFG by instantiating the values of the features Because of

data sparsity however not all features can b e instantiated Instead they create a grammar

where many features have b een instantiated and many have not they call these partially

instantiated features sets mnemonics They then create a PCFG using the mnemonics as

terminals and nonterminals Features instantiated in a particular mnemonic are generated

probabilistically while the rest are generated through unication Because no smo othing

is done and b ecause features are group ed data sparsity limits the number of features that

can b e generated probabilistically whereas b ecause we generate features one at a time and

smo oth we are far less limited in the number of features we can use Their technique of

generating some features probabilistically and the rest by unication is somewhat inelegant

also for the probabilities to sum to one it requires an additional step of normalization

which they app ear not to have implemented

In their next mo del Black et al a which strongly inuenced our mo del ve

attributes are asso ciated with each nonterminal a syntactic category a semantic category

a rule and two lexical heads The rules in this grammar are the same as the mnemonic rules

used in the previous work developed by a grammarian These ve attributes are generated

one at a time with backo smo othing conditioned on the parent attributes and earlier

attributes Our generation mo del is essentially the same as this Notice that in this mo del

unlike ours there are two kinds of features those features captured in the mnemonics

and the ve categories the categories and mnemonic features are mo deled very dierently

Also notice that a great deal of work is required by a grammarian to develop the rules and

mnemonics

The third mo del Magerman extends the second mo del to capture more de

p endencies and to remove the use of a grammarian Each decision in this mo del can in

principal dep end on any previous decision and on any word in the sentence Because of

these p otentially unbounded dep endencies there is no dynamic programming algorithm

without pruning the time complexity of the mo del is exp onential One motivation for PFG

was to capture similar information to this third mo del while allowing dynamic program

ming This third mo del uses a more complicated probability mo del all probabilities are

determined using decision trees it is an area for future research to determine whether we

can improve our p erformance by using decision trees

Probabilistic LR Parsing with Unication Grammars Brisco e and Carroll describ e a

formalism Brisco e and Carroll Carroll and Brisco e similar in many ways

to the rst IBM mo del In particular a contextfree covering grammar of a unication

grammar is constructed Some features are captured by the covering grammar while oth

ers are mo deled only through unications Only simple plusonestyle smo othing is done

so data sparsity is still signicant The most imp ortant dierence b etween the work of

Brisco e and Carroll and that of Black et al is that Brisco e et al asso ciate

probabilities with the augmented transition matrix of an LR Parse table this gives them

more context sensitivity than Black et al However the basic problems of the two approaches

are the same data sparsity diculty normalizing probabilities and lack of elegance due

to the union of two very dierent approaches

Parsing

The parsing algorithm we use is a simple variation on probabilistic versions of the CKY

algorithm for PCFGs using feature vectors instead of nonterminals Baker Lari and

Young The parser computes inside probabilities the sum of probabilities of all

parses ie the probability of the sentence and Viterbi probabilities the probability of

the b est parse and optionally outside probabilities In Figure we give the inside

algorithm for PFGs Notice that the algorithm requires time O n in sentence length but

is p otentially exp onential in the number of features since there is one lo op for each parent

for each length l shortest to longest

for each start s

for each split length t

g g

for each b st chart s s t b

g g

for each c st chart s t s l c

g g

for each a consistent with b c

g g g

for each a consistent with b c a

g

g g g g

chart s s l a E a b c

B

X

g g

return E a chart n a

S

g

a

Figure PFG Inside Algorithm

feature a through a

g

When parsing a PCFG it is a simple matter to nd for every right and left child what

the p ossible parents are On the other hand for a PFG there are some subtleties We must

lo op over every p ossible value for each feature At rst this sounds overwhelming since it

requires guessing a huge number of feature sets leading to a run time exp onential in the

number of features In practice most values of most features will have zero probabilities

and we can avoid considering these only a small number of values will b e consistent with

previously determined features For instance features such as the length of a constituent

take a single value p er cell Many other features take on very few values given the children

For example we arrange our parse trees so that the head word of each constituent is

dominated by one of its two children This means that we need consider only two values for

this feature for each pair of children The single most time consuming feature is the Name

feature which corresp onds to the terminals and nonterminals of a PCFG For eciency we

keep a list of the parentleftchildrightchild name triples that have nonzero probabilities

allowing us to hypothesize only the p ossible values for this feature given the children Careful

choice of features helps keep parse times reasonable

Pruning

We use two pruning metho ds to sp eed parsing The rst is b eam thresholding with the

prior as describ ed in Section Within each cell in the parse chart we multiply each

entrys inside probability by the prior probability of the parent features of that entry using

a sp ecial EventProb E We then remove those entries whose combined probability is to o

P

much lower than the b est entry of the cell

The other technique we use is multiplepass parsing describ ed in Section Recall

that in multiplepass parsing we use a simple fast grammar for the rst pass which approx

imates the later pass We then remove any events whose combined insideoutside pro duct

is to o low essentially those events that are unlikely given the complete sentence The tech

nique is particularly natural for PFGs since for the rst pass we can simply use a grammar

with a sup erset of the features from the previous pass The features we used in our rst pass

were Name Continuation and two new features esp ecially suitable for a fast rst pass the

length of the constituent and the terminal symbol following the constituent Since these two

features are uniquely determined by the chart cell of the constituent they work particularly

well in a rst pass since they provide useful information without increasing the number of

elements in the chart However when used in our second pass these features did not help

p erformance presumably b ecause they captured information similar to that captured by

other features Multiplepass techniques have dramatically sp ed up PFG parsing

Exp erimental Results

The PFG formalism is an extremely general one that has the capability to mo del a wide

variety of phenomena and there are very many p ossible sets of features that could b e used

in a given implementation We will on an example set of features show that the formalism

can b e used to achieve a high level of accuracy

Features

In this section we will describ e the actual features used by our parser The most interesting

and most complicated features are those used to enco de the rules of the grammar the child

features We will rst show how to enco de a PCFG as a PFG The PCFG has some

maximum length right hand side say ve symbols We would then create a PFG with six

features The rst feature would b e N the nonterminal symbol Since PFGs are binarized

while PCFGs are nary branching we will need a feature that allows us to recover the

nary branching structure from a binary branching tree This feature which we call the

continuation feature C will b e for constituents that are the top of a rule and for

constituents that are used internally The next features describ e the future children of

the nonterminal We will use the symbol to denote an empty child To enco de a PCFG

rule such as A BCDEF we would assign the following probabilities to the features sets

E A B C D E F B X X A C D E F P B X X

B

E A C D E F C X X A D E F P C X X

B

E A D E F D X X A E F P D X X

B

E A E F E X X F Y Y P E X X

B

P F Y Y

It should b e clear that we can dene probabilities of individual features in such as way as

to get the desired probabilities for the events We also need to dene a distribution over

the start symbols

E S A B C D E A X X S B C D E P S AB C D E

S

P A X X

A quick insp ection will show that this assignment of probabilities to events leads to the

same probabilities b eing assigned to analogous trees in the PFG as in the original PCFG

N Name Corresp onds to the terminals and nonterminals of a PCFG

C Continuation Tells whether we are generating mo diers to the right or the left

and whether it is time to generate the head no de

Child Name of rst child to b e generated

Child Name of second child to b e generated In combination with Child this

allows us to simulate a second order Markov pro cess on nonterminal sequences

H Head name Name of the head category

P Head p os Part of sp eech of head word

W Head word Actual head word Not used in POS only mo del

L

D left Count of punctuation verbs words to left of head

R

D right Counts to right of head

B

D b etween Counts b etween parents and childs heads

Table Features Used in Exp eriments

Now imagine if rather than using the raw probability estimates from the PCFG we

were to smo oth the probabilities in the same way we smo oth all of our feature probabilities

This will lead to a kind of smo othed gram mo del on right hand sides Presumably there

is some assignment of smo othing parameters that will lead to p erformance which is at least

as go o d if not b etter than unsmo othed probabilities Furthermore what if we had many

other features in our PFG such as head words and distance features With more features

we might have to o much data sparsity to make it really worthwhile to keep all ve of the

children as features Instead we could keep say the rst two children as features After

each child is generated as a left nonterminal we could generate the next child the third

fourth fth etc as the Child feature of the right child Our new grammar probabilities

might lo ok like

E A B C B X X A C D P B X X

B

E A C D C X X A D E P C X X

B

E A D E D X X A E F P D X X

B

E A E F E X X F Y Y P E X X P F Y Y

B

P

P B X X This assignment will pro duce a go o d where P B X X

approximation to the original PCFG using a kind of trigram mo del on right hand sides

with fewer parameters and fewer features than the exact PFG Fewer child features will b e

very helpful when we add other features to the mo del We will show in Section that

with the set of features we use children is optimal

We can now describ e the features actually used in our exp eriments We used two PFGs

one that used the head word feature and one otherwise identical grammar with no word

based features only POS tags The grammars had the features shown in Table A

sample parse tree with these features is given in Figure

Recall that in Section we mentioned that in order to get go o d p erformance we made

our parse trees binary branching in such a way that every constituent dominates its head

word To achieve this rather than generating children strictly left to right we rst generate

all children to the left of the head word left to right then we generate all children to the

right of the head word right to left and nally we generate the child containing the head

word It is b ecause of this that the ro ot no de of the example tree in Figure has NP as

its rst child and VP as its second rather than the other way around

L

The feature D is a tuple indicating the number of punctuation characters verbs

and words to the left of a constituents head word To avoid data sparsity we do not count

L

higher than punctuation characters verb or words So a value of D indicates

that a constituent has no punctuation at least one verb and or more words to the left

R B

of its head word D is a similar feature for the right side Finally D gives the numbers

b etween the constituents head word and the head word of its other child Words verbs

and punctuation are counted as b eing to the left of themselves that is a terminal verb has

one verb and one word on its left

Notice that the continuation of the lower NP in Figure is R indicating that it

inherits its child from the right with the indicating that it is a dummy no de to b e left

out of the nal tree

Exp erimental Details

The probability functions we used were similar to those of Section but with three

imp ortant dierences The rst is that in some cases we can compute a probability exactly

For instance if we know that the head word of a parent is man and that the parent got

its head word from its right child then we know that with probability the head word

of the right child is man In cases where we can compute a probability exactly we do

so Second we smo othed slightly dierently In particular when smo othing a probability

estimate of the form

C abc

pajb pajbc

C bc

C bc

we set using a separate k for each probability distribution Finally we did

k C bc

additional smo othing for words adding counts for the unknown word

The actual tables that show for each feature the order of backo for that feature are

given in App endix A In this section we simply discuss a single example the order of

backo for the feature the category of the second child of the right feature set The

R

most relevant features are rst N C H N C N C P W We back o from

R R R R P P L L R R

the head word feature rst b ecause although relevant this feature creates signicant data

sparsity Notice that in general features from the same feature set in this case the right

features are kept longest parent features are next most relevant and sibling features are

considered least relevant We leave out entirely features that are unlikely to b e relevant and

N S

C R

NP

VP

H VP

P V

W dies

L

D

R

D

B

D

H

H

H

H

H

H

H

H

H

H

H

N NP N VP

C R C

DET V

ADJ

H N H V

P N P V

W man

W dies

L

L

D

D

R

R

D

D

B

B

D

D

H

H

H

H

N V

H

H

C

H

H

N DET N NP

C C R

H V

ADJ

P V

N

W dies

H DET H N

L

D

P DET P N

R

D

W man

W the

B

L

L

D

D

D

R

R

D

D

B

B

D

D

H

H

H

H

H

N ADJ N N

C C

H ADJ H N

P ADJ P N

W man

W normal

L

L

D

D

R

R

D

D

B

B

D

D

Figure Example tree with features The normal man dies

Mo del Lab eled Lab eled Crossing Crossing Crossing

Recall Precision Brackets Brackets Brackets

PFG Words

PFG POS only

Collins b est

Collins b est

Collins POS only

Magerman

Table PFG exp erimental results

that would cause data sparsity such as W the head word of the left sibling

L

Results

We used the same machinelabeled data as Collins TreeBank II sections

for training section for test section for development using all sentences of

words or less We also used the same scoring metho d replicating even a minor bug for the

sake of comparison see the fo otnote on page

Table gives the results Our results are the b est we know of from POS tags alone

and with the head word feature fall b etween the results of Collins and Magerman as given

by Collins To take one measure as an example we got crossing brackets p er

sentence with POS tags alone versus Collins results of in similar conditions With

the head word feature we got crossing brackets which is b etween Collins and

Magermans The results are similar for other measures of p erformance

Contribution of Individual Features

In this subsection we analyze the contribution of features by running exp eriments using

the full set of features minus some individual feature The dierence in p erformance b etween

the full set and the full set minus some individual feature gives us an estimate of the features

contribution Note that the contribution of a feature is relative to the other features in the

full set For instance if we had features for b oth head word and a morphologically stemmed

head word their individual contributions as measured in this manner would b e nearly

negligible b ecause b oth are so highly correlated The same eect will hold to lesser degrees

for other features dep ending on how correlated they are Some features are not meaningful

without other features For instance the child feature is not particularly meaningful

without the child feature Thus we cannot simply remove just child to compute its

contribution instead we rst remove child compute its contribution and then remove

child as well

We are grateful to Michael Collins and Adwait Ratnaparkhi for supplying us with the partofsp eech

tags

Name Features Lab el Lab el Cross Cross Cross Time

Recall Prec Brack Brack Brack

B L R

Base HN CWPD D D

B L R

NoW HN C PD D D

B L R

NoP HN CW D D D

B L R

NoD HN CWP D D

NoD HN CWP

B L R

No HN CWPD D D

B L R

No HN CWPD D D

B L R

BaseH N CWPD D D

B L R

NoWH NC PD D D

B L R

NoPH N CW D D D

B L R

NoD H N CWP D D

NoDH N CWP

B L R

NoH N CWPD D D

B L R

NoH N CWPD D D

B L R

NoNames CWPD D D

B L R

NoHPlus N CWPD D D

Table Contribution of individual features

When p erforming these exp eriments we used the same dep endencies and same order

of backo for all exp eriments This probably causes us to overestimate the value of some

features since when we delete a feature we could add additional dep endencies to other

features that dep ended on it We kept the thresholding parameters constant as optimized

by the parameter search algorithm of Chapter but adjusted the parameters of the backo

algorithm

In the original work on PFGs we did not p erform this feature contribution exp eriment

and thus included the head name H feature in all of our original exp eriments When

we p erformed these exp eriments we discovered that the head name feature had a negative

contribution We thus removed the head name feature and rep eated the feature contri

bution exp eriment without this feature Both sets of results are given here In order to

minimize the number of runs on the nal test data all of these exp eriments were run on

the development test section section of the data

Table shows the results of exp eriments removing various features Our rst test

was the baseline We then removed the head word feature leading to a fairly signicant

but not overwhelming drop in p erformance combined precision and recall Our re

sults without the head word are p erhaps the b est rep orted The other features head

p os distance b etween all distance features and second child all lead to mo dest drops

in p erformance resp ectively on combined precision and recall

On the other hand when we removed b oth child and child we got a drop When

we removed the head name feature we achieved a improvement We rep eated these

exp eriments without the head name feature Without the head name feature the contribu

tions of the other features are very similar the head word feature contributed Other

features head p os distance b etween all distance features and second child contributed

Without child and child p erformance dropp ed This is

presumably b ecause there were now almost no name features remaining except for name

itself When the name feature was also removed p erformance dropp ed even further to

ab out crossing brackets p er sentence versus for the no head name baseline

there are no lab els here so we cannot compute lab elled precision or lab elled recall for this

case Finally we tried adding in the child feature to see if we had examined enough child

features Performance dropp ed negligibly

These results are signicant regarding child and child While the other features we

used are very common the child and child features are much less used Both Magerman

and Ratnaparkhi make use of essentially these features but in a historybased

formalism One generative formalism which comes close to using these features is that of

Charniak which uses a feature for the complete rule This captures the child child

and other child features However it do es so in a crude way with two drawbacks First

it do es not allow any smo othing and second it probably captures to o many child features

As we have shown the contribution of child features plateaus at ab out two children with

three children there is a slight negative impact and more children probably lead to larger

negative contributions On the other hand Collins do es not use these features

at all although he do es use a feature for the head name Comparing No to NoH

in Table we see that without Child and Child the head name makes a signicant

p ositive contribution Extrap olating from these results integrating the Child and Child

features and p erhaps removing the head name feature in a stateoftheart mo del such as

that of Collins would probably lead to improvements

The last column of Table the Time column is esp ecially interesting Notice that the

runtimes are fairly constrained ranging from a low of ab out seconds to a high of ab out

seconds Furthermore the longest runtime comes with the worst p erformance and

the fewest features Better mo dels often allow b etter thresholding and faster p erformance

Features such as child and child that we might exp ect to signicantly slow down parsing

b ecause of the large number of features sets they allow in some cases Base to No actually

sp eed p erformance Of course to fully substantiate these claims would require a more

detailed exploration into the tradeo b etween sp eed and p erformance for each set of features

which would b e b eyond the scop e of this chapter

We reran our exp eriment on section the nal test data this time without the head

name feature The results are given here with the original results with the head name

rep eated for comparison The results are disapp ointing leading to only a slight improve

ment The smaller than exp ected improvement might b e attributable to random variation

either in the development test data or in the nal test data or to some systematic dier

ence b etween the two sets Analyzing the two sets directly would invalidate any further

exp eriments on the nal test data so we cannot determine which of these is the case

Mo del Lab eled Lab eled Crossing Crossing Crossing

Recall Precision Brackets Brackets Brackets

Words head name

Words no head name

Conclusions and Future Work

While the empirical p erformance of Probabilistic Feature Grammars is very encouraging

we think there is far more p otential First for grammarians wishing to integrate statistics

into more conventional mo dels the features of PFG are a very useful to ol corresp onding

to the features of DCG LFG HPSG and similar formalisms TreeBank II is annotated

with many semantic features currently unused in all but the simplest way by all systems

it should b e easy to integrate these features into a PFG

PFG has other b enets that we would like to explore including the p ossibility of its use

as a language mo del for applications such as sp eech recognition Furthermore the dynamic

programming used in the mo del is amenable to ecient rescoring of lattices output by sp eech

recognizers

Another b enet of PFG is that b oth inside and outside probabilities can b e computed

making it p ossible to reestimate PFG parameters We would like to try exp eriments using

PFGs and the insideoutside algorithm to estimate parameters from unannotated text

Because the PFG mo del is so general it is amenable to many further improvements we

would like to try a wide variety of them We would like to try more sophisticated smo othing

techniques as well as more sophisticated probability mo dels such as decision trees We

would like to try a variety of new features including classes of words and nonterminals and

morphologically stemmed words and integrating the most useful features from recent work

such as that of Collins We would also like to p erform much more detailed research

into the optimal order for backo than the few pilot exp eriments used here

While the generality and elegance of the PFG mo del make these and many other exp er

iments p ossible we are also encouraged by the very go o d exp erimental results Wordless

mo del p erformance is excellent and the more recent mo dels with words are comparable to

the state of the art

App endix

A Backo Tables

Tables and give the order of backo used in the exp eriments For the

wordless exp eriments the head word feature W was simply deleted The order of

backo is provided here for those who would like to duplicate these results However this

table was pro duced mostly from intuition and with the help of only a few pilot exp eriments

there is no reason to think that this is a particularly go o d set of tables

The tables require a bit of explanation A basic table entry was already describ ed in

Section Briey a table entry gives the order of backo of features with the most

relevant features rst In some cases we wished to stop backo b eyond a certain p oint

This is indicated by a vertical bar j This feature was esp ecially useful in constructing the

gram grammar of Chapter since it allowed smo othing to b e easily turned o making

those exp eriments more easily replicated In some cases it was p ossible to determine the

value of a feature from the value of already known features For instance from the parent

continuation feature C the parent rst child feature and the parent name feature

P P

N we can determine the left child name feature N Thus no backo is necessary this

P L

is indicated by a single vertical bar in the entry for N

L

B L R

C C N D D D

L P P P P

P P P

B L R

C C C N D D D

R P L P P P

P P P

N j

R

N j

L

B

P C N N D P W

L P L P P P

P

B

P C N N D P W

R P R P P P

P

B

P W W P C N N D

P P L L P L P

P

B

W P C N N D P W

R R P R P P P

P

H C C jN P W N N

L L P L L L P R

H C C jN P W N N

R R P R R R P L

N C H N C N C P W

L L L L P P R R L L

N C H N C N C P W

L L L L L P P R R L L

N C H N C N C P W

R R R R P P L L R R

N C H N C N C P W

R R R R R P P L L R R

L L B

D C C D D N N P

P L L P L

L P P

R R B

D C C D D N N P

P R R P R

R P P

R B

D C C C D N N P P H H

P L R L R L R L R

L P

L B R

D C C D D N N P

P L R P R

R P L

B

D j

L

B

D j

R

Table Binary Event Backo

L L

D D

L P

R R

D D

L P

L R

C C N D D

L P P P P

L L

N j

L

W j

L

P j

L

L R

H N C C N D D

L L L P P

L L

L R

N C N C P W D D

L L L P P L L

L L

L R

N C N C P W D D

L L L L P P L L

L L

B L R

D C D D N P H W

L L L L L L L

L L L

Table Unary Event Backo

StartPrior Event Backo

N j

P

C N

P P

H N C j

P P P

N C H

P P P P

N C H

P P P P P

P N C H

P P P P P P

W P N C H

P P P P P P P

B

D P C N H W

P P P P P P P

P

L B

D P D C N H W

P P P P P P P

P P

R B L

D P D D C N H W

P P P P P P P

P P P

Table StartPrior Event Backo

Chapter

Conclusion

All this means of course is that I didnt solve the natural language parsing

problem No big surprise there although the naive graduate student in me is a

little disapp ointed David Magerman

Summary

The goal of this thesis has b een to show new uses for the insideoutside probabilities and to

provide useful to ols for nding these probabilities In Chapter we gave to ols for nding

insideoutside probabilities in various formalisms We then showed in Chapters and

that these probabilities have several uses b eyond grammar induction including more

accurate parsing by matching parsing algorithms to metrics times faster parsing of the

DOP mo del and times faster parsing of general PCFGs through thresholding Finally in

Chap er we gave a stateoftheart parsing formalism that can compute inside and outside

probabilities and that makes a go o d framework for future research

We b egan by presenting in Chapter a general framework for parsing algorithms semir

ing parsing Using the itembased descriptions of semiring parsing it is easy to derive for

mulas for a wide variety of parsers and for a wide variety of values including the Viterbi

inside and outside probabilities As we lo ok towards the future of parsing we see a move

ment towards new formalisms with an increasingly lexicalized emphasis and a movement

towards parsing algorithms that can take advantage of the eciencies of lexicalization Laf

ferty et al Whether or not this o ccurs as long as parsing technology do es not

remain stagnant the theory of semiring parsing will simplify the development of whatever

new parsers are needed Furthermore given the exp ense of developing learning

algorithms like the insideoutside reestimation algorithm will probably play a key role in

the development of practical systems Thus the ease with which semiring parsing allows

the inside and outside formulas to b e derived will b e imp ortant

In Chapter we showed that the inside and outside probabilities could b e used to

improve p erformance by matching parsing algorithms to metrics This basic idea is p ow

erful While many researchers have worked on dierent grammar induction techniques and

dierent grammar formalisms the only commonly used parsing algorithm was the Viterbi

algorithm Thus many researchers can use the algorithms of this chapter or appropriate

variations to get improvements

In Chapter we sp ed up DataOriented Parsing by times and replicated the algo

rithms for the rst time using the insideoutside probabilities While DOP had incredible

rep orted results we showed that these previous results were probably attributable to chance

or to a particularly easy corpus Negative results are always disapp ointing but we hop e to

have at least help ed steer the eld in a more fruitful direction

The normalized insideoutside probabilities are ideal for thresholding giving the proba

bility that any given constituent is correct The only problem with using the insideoutside

probabilities for thresholding is that the outside probability is unknown until long after it

is needed In Chapter we showed that approximations to the insideoutside probabilities

can b e used to signicantly sp eed up parsing In particular we sp ed up parsing by a factor

of ab out at the same accuracy level as traditional thresholding algorithms We exp ect

that these techniques will b e broadly useful to the statistical parsing community

Finally in Chapter we gave a stateoftheart parsing formalism that could b e used

to compute inside and outside probabilities While others Charniak Collins

indep endently worked in a very similar direction we provided several unique contributions

including an elegant theoretical framework a useful way of breaking rules into pieces an

analysis of the value of each feature and excellent wordless results We think the theoretical

structure of PFGs makes a go o d platform for future parsing research allowing a variety of

researchers to phrase their mo dels in a unied framework

At the end of a thesis like this the question of the future direction of statistical parsing

naturally arises Given that statistical parsing has b orrowed so much from sp eech recogni

tion it is only natural to compare the progress of statistical parsing systems to the progress

of sp eech recognition systems In particular over the last ten or fteen years the progress of

the sp eech recognition community has b een stunning Statistical parsing has made progress

at a similar pace but for a much shorter p erio d of time Given that a stateoftheart sp eech

system requires p erhaps p erson years of work to develop while a stateoftheart parsing

system requires only one or two quite a bit of progress remains to b e made if only by

integrating everything we know how to do into a single system If the most useful lessons of

this thesis were combined with the most useful lessons of others Magerman Collins

Ratnaparkhi Charniak inter alia we assume a much b etter parser would

result All of the techniques used in this thesis would t well in such a system

Recently there has b een work towards using statistical parsingstyle metho ds as the

framework for relatively complete understanding systems Miller et al Epstein et al

As this higher level work progresses it seems likely that the techniques developed

in this thesis will b e b e applicable Our work on maximizing the right criteria and on

thresholding algorithms for search could b oth probably b e applied in higher level domains

Our work on PFGs could even serve as the framework for such new approaches with features

used to store or enco de the semantic representation

The lessons learned in this thesis are clearly applicable to future work in statistical NLP

but they are also useful more broadly within computer science The general lessons include

Maximize the true criterion or as close as you can get

Maximize pieces correct using the overall probability of each piece this can b e easier

and sometimes more eective than maximizing the whole

Guide search using the b est approximations to future and past you can nd including

lo cal global and multiplepass techniques It may b e p ossible to get b etter results

by combining all of these techniques together

Our work in Chapter presented a general framework which we used for statistical parsing

As p ointed out by Tendeau a this kind of framework can b e used to enco de almost

any dynamic programming algorithm so the techniques of semiring parsing can actually b e

applied widely

In summary we have shown how to use the insideoutside probabilities and approxima

tions to them to improve accuracy and sp eed parsing and we have provided the to ols to

nd these probabilities These techniques will b e useful b oth in future work on statistical

Natural Language Pro cessing and more broadly

References

Steve Abney Sto chastic attributevalue grammars Available as cmplg

Octob er

JK Baker Trainable grammars for sp eech recognition In Proceedings of the Spring

Conference of the Acoustical Society of America pages Boston MA June

LE Baum and JA Eagon An inequality with application to statistical estimation

for probabilistic functions of Markov pro cesses and to a mo del for ecology Bul letin of

the American Mathematicians Society

LE Baum An inequality and asso ciated maximization technique in statistical

estimation of probabilistic functions of a Markov pro cess Inequalities

Jean Berstel and Christophe Reutenauer Rational Series and Their Languages

Number in EATCS Monographs on Theoretical Computer Science SpringerVerlag

Berlin Germany

Sylvie Billot and Bernard Lang The structure of shared forests in ambiguous parsing

In Proceedings of the th Annual Meeting of the ACL pages Vancouver

Ezra Black Frederick Jelinek John Laerty David M Magerman Rob ert Mercer and

Salim Roukos a Towards historybased grammars Using richer mo dels for prob

abilistic parsing In Proceedings of the February DARPA Speech and Natural Lan

guage Workshop

Ezra Black John Laerty and Salim Roukos b Development and evaluation of a

broadcoverage probabilistic grammar of Englishlanguage computer manuals In Pro

ceedings of the th Annual Meeting of the ACL pages

Ezra Black George Garside and Georey Leech StatisticallyDriven Computer

Grammars of English the IBMLancaster Approach volume of Language and Com

puters Studies in Practical Linguistics Ro dopi Amsterdam

Rens Bo d Mathematical prop erties of the dataoriented parsing mo del Paper

presented at the Third Meeting on Mathematics of Language MOL Austin Texas

Rens Bo d a Dataoriented parsing as a general framework for sto chastic language

pro cessing In K Sikkel and A Nijholt editors Parsing Natural Language Twente

The Netherlands

Rens Bo d b Monte Carlo parsing In Proceedings Third International Workshop on

Parsing Technologies TilburgDurbury

Rens Bo d c Using an annotated corpus as a sto chastic grammar In Proceedings of

the Sixth Conference of the European Chapter of the ACL pages

Rens Bo d a Ecient algorithms for parsing the DOP mo del A reply to Joshua

Go o dman Available from httpxxxlanlgovabscmplg May

Rens Bo d b Enriching Linguistics with Statistics Performance Models of Natural

Language University of Amsterdam ILLC Dissertation Series Academische

Pers Amsterdam

Rens Bo d c The problem of computing the most probable tree in dataoriented

parsing and sto chastic tree grammars In Proceedings of the Seventh Conference of the

European Chapter of the ACL Dublin Ireland March

Chris Brew Sto chastic HPSG In Proceedings of the Seventh Conference of the

European Chapter of the ACL pages Dublin Ireland March

Eric Brill A CorpusBased Approach to Language Learning PhD thesis University

of Pennsylvania

Ted Brisco e and John Carroll Generalized probabilistic LR parsing of natural lan

guage corp ora with unicationbased grammars Computational Linguistics

Sharon Caraballo and Eugene Charniak Figures of merit for b estrst probabilis

tic chart parsing In Proceedings of the Conference on Empirical Methods in Natural

Language Processing pages Philadelphia May

Sharon Caraballo and Eugene Charniak New gures of merit for

b est rst probabilistic chart parsing In submission Available from

httpwwwcsbrownedupeoplescNewFiguresofMeritpsZ

John Carroll and Ted Brisco e Probabilistic normalisation and unpacking of packed

parse forests for unicationbased grammars In Proceedings of the AAAI Fall Sympo

sium on Probabilistic Approaches to Natural Language pages Cambridge MA

Eugene Charniak Treebank grammars Technical Rep ort CS

Department of Computer Science Brown University Available from

ftpftpcsbrownedupubtechreportscspsZ

Eugene Charniak Statistical parsing with a contextfree grammar and word statistics

In Proceedings of the AAAI pages Providence RI AAAI PressMIT Press

Zhiyi Chi and Stuart Geman Estimation of probabilistic contextfree grammars

Computational Linguistics To app ear

N Chomsky and M P Schutzenberger The algebraic theory of contextfree lan

guages In P Braort and D Hirschberg editors Computer Programming and Formal

Systems NorthHolland

Michael Collins A new statistical parser based on bigram lexical dep endencies In

Proceedings of the th Annual Meeting of the ACL pages Santa Cruz CA

June

Michael Collins Three generative lexicalised mo dels for statistical parsing In

Proceedings of the th Annual Meeting of the ACL pages Madrid Spain

Thomas H Cormen Charles E Leiserson and Ronald L Rivest Introduction to

Algorithms MIT Press Cambridge MA

Jay Earley An ecient contextfree parsing algorithm Communications of the ACM

Andreas Eisele Towards probabilistic extensions of constraintbased

grammars DYANA Deliverable RB September Available from

ftpftpimsunistuttgartdepubpapersDYANARB

M Epstein K Papineni S Roukos T Ward and S Della Pietra Statistical natural

language understanding using hidden clumpings In ICASSP volume pages

New York IEEE

Joshua Go o dman a Ecient algorithms for parsing the DOP mo del In Proceedings

of the Conference on Empirical Methods in Natural Language Processing pages

May

Joshua Go o dman b Parsing algorithms and metrics In Proceedings of the th

Annual Meeting of the ACL pages Santa Cruz CA June

Joshua Go o dman Global thresholding and multiplepass parsing In Proceedings

of the Second Conference on Empirical Methods in Natural Language Processing pages

Joshua Go o dman a Parsing InsideOut PhD thesis Harvard University Available

as cmplg and from httpwwweecsharvardedugoodmanthesisps

Joshua Go o dman b Semiring parsing In preparation

Susan L Graham Michael A Harrison and Walter L Ruzzo An improved context

free recognizer ACM Transactions on Programming Languages and Systems

July

Charles T Hemphill John J Go dfrey and George R Do ddington The ATIS sp oken

language systems pilot corpus In DARPA Speech and Natural Language Workshop

Hidden Valley Pennsylvania June Morgan Kaufmann

John E Hop croft and Jerey D Ullman Introduction to Automata Theory Lan

guages and Computation AddisonWesley Reading Massachusetts

F Jelinek and J D Laerty Computation of the probability of initial substring

generation by sto chastic contextfree grammars Computational Linguistics pages

J Kasami An ecient recognition and syntax analysis algorithm for contextfree

languages Technical rep ort University of Hawaii

Werner Kuich and Arto Salomaa Semirings Automata Languages Number in

EATCS Monographs on Theoretical Computer Science SpringerVerlag Berlin Ger

many

Werner Kuich Semirings and formal p ower series Their relevance to formal lan

guages and automata In Grzegorz Rozenberg and Arto Salomaa editors Handbook of

Formal Languages pages SpringerVerlag Berlin

John Laerty Daniel Sleator and Davy Temperley Grammatical trigrams A prob

abilistic mo del of link grammar In Proceedings of the AAAI Fall Symposium on

Probabilistic Approaches to Natural Language Octob er

K Lari and SJ Young The estimation of sto chastic contextfree grammars using

the insideoutside algorithm Computer Speech and Language

K Lari and SJ Young Applications of sto chastic contextfree grammars using the

insideoutside algorithm Computer Speech and Language

R Leermakers How to cover a grammar In Proceedings of the th Annual Meeting

of the ACL pages Vancouver

DM Magerman and C Weir Eciency robustness and accuracy in picky chart

parsing In Proceedings of the Association for Computational Linguistics

David Magerman Natural Language Parsing as Statistical Pattern Recognition

PhD thesis Stanford University University February

David Magerman Statistical decisionmo dels for parsing In Proceedings of the rd

Annual Meeting of the ACL pages Cambridge MA

Scott Miller David Stallard Rob ert Bobrow and Richard Schwartz A fully statisti

cal approach to natural language interfaces In Proceedings of the th Annual Meeting

of the ACL pages Santa Cruz CA June

Mehryar Mohri Finitestate transducers in language and sp eech pro cessing Com

putational Linguistics

Anton Nijholt ContexFree Grammars Covers Normal Forms and Parsing Num

b er in Lecture Notes in Computer Science Springer Verlag Berlin Germany

Fernando Pereira and Yves Schabes InsideOutside reestimation from partially

bracketed corp ora In Proceedings of the th Annual Meeting of the ACL pages

Newark Delaware

Fernando Pereira and Stuart Shieb er Prolog and Natural Language Analysis Num

b er in Lecture Notes Center for the Study of Language and Information Stanford

California

Fernando Pereira and David Warren Parsing as deduction In Proceedings of the

st Annual Meeting of the ACL pages Cambridge Massachusetts

LR Rabiner A tutorial on hidden Markov mo dels and selected applications in

sp eech recognition Proceedings of the IEEE February

Adwait Ratnaparkhi A linear observed time statistical parser based on maximum

entropy mo dels In Proceedings of the Second Conference on Empirical Methods in

Natural Language Processing pages

Manny Rayner and David Carter Fast parsing using pruning and grammar sp ecial

ization In Proceedings of the th Annual Meeting of the ACL pages Santa

Cruz CA June

P Resnik Probabilistic treeadjoining grammar as a framework for statistical natural

language pro cessing In Proceedings of the th International Conference on Computa

tional Linguistics Nantes France August

Arto Salomaa and Matti Soittola AutomataTheoretic Aspects of Formal Power

Series SpringerVerlag Berlin Germany

R Scha Language theory and language technology comp etence and p erformance

In QAM de Kort and GLJ Leerdam editors Computertoepassingen in de Neerlan

distiek Landelijke Vereniging van Neerlandici LVVNjaarbo ek Almere In Dutch

Y Schabes and R Waters Tree insertion grammar A cubictime parsable formalism

that lexicalizes contextfree grammar without changing the tree pro duced Technical

Rep ort TR Mitsubishi Electric Research Lab oratories

Yves Schabes Michal Roth and Randy Osb orne Parsing the Wall Street Jour

nal with the InsideOutside algorithm In Proceedings of the Sixth Conference of the

European Chapter of the ACL pages

Yves Schabes Sto chastic lexicalized treeadjoining grammars In Proceedings of the

th International Conference on Computational Linguistics pages Nantes

France August

Richard Schwartz Steve Austin Francis Kubala John Makhoul Long Nguyen Paul Place

way and George Zavaliagkos New uses for the nb est sentence hypothesis within

the Byblos sp eech recognition system In Proceedings of the IEEE International Con

ference on Acoustics Speech and Signal Processing volume I pages San Francisco

California

Stuart Shieb er Yves Schabes and Fernando Pereira Principles and implementation

of deductive parsing Journal of Logic Programming

Klaas Sikkel Parsing Schemata

Khalil Simaan a Computational complexity of probabilistic disambiguation by means

of tree grammars In Proceedings Coling

Khalil Simaan b Ecient disambiguation by means of sto chastic tree substitution

grammars In R Mitkov and N Nicolov editors Recent Advances in NLP volume

of Current Issues in Linguistic Theory John Benjamins Amsterdam

Andreas Stolcke An ecient probabilistic contextfree parsing algorithm that com

putes prex probabilities Technical Rep ort TR International Computer Science

Institute Berkeley CA

Ray Teitelbaum Contextfree error analysis by evaluation of algebraic p ower series

In Proc Fifth Annual ACM Symposium on Theory of Computing pages Austin

Texas

Frederic Tendeau a Computing abstract decorations of parse forests using dynamic

programming and algebraic p ower series Theoretical Computer Science To app ear

Frederic Tendeau b An Earley algorithm for generic attribute augmented grammars

and applications In Proceedings of the International Workshop on Parsing Technologies

pages

Andrew J Viterbi Error b ounds for convolutional co des and an asymptotically

optimum deco ding algorithm IEEE Transactions on Information Theory IT

Dekai Wu A p olynomialtime algorithm for statistical In

Proceedings of the th Annual Meeting of the ACL pages Santa Cruz CA

June

D H Younger Recognition and parsing of contextfree languages in time n Infor

mation and Control

G Zavaliagkos T Anastasakos G Chou C Lapre F Kubala J Makhoul L Nguyen

R Schwartz and Y Zhao Improved search acoustic and language mo deling in

the BBN Byblos large vocabulary CSR system In Proceedings of the ARPA Workshop

on Spoken Language Technology pages Plainsb oro New Jersey