Complexity of Lexical Descriptions and Its Relevance to Partial Parsing

University of Pennsylvania ScholarlyCommons

IRCS Technical Reports Series Institute for Research in Cognitive Science

June 1997

Complexity of Lexical Descriptions and its Relevance to Partial Parsing

Srinivas Bangalore University of Pennsylvania

Follow this and additional works at: https://repository.upenn.edu/ircs_reports

Bangalore, Srinivas, "Complexity of Lexical Descriptions and its Relevance to Partial Parsing" (1997). IRCS Technical Reports Series. 82. https://repository.upenn.edu/ircs_reports/82

University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-97-10.

This paper is posted at ScholarlyCommons. https://repository.upenn.edu/ircs_reports/82 For more information, please contact [email protected]. Complexity of Lexical Descriptions and its Relevance to Partial Parsing

Abstract In this dissertation, we have proposed novel methods for robust parsing that integrate the flexibility of linguistically motivated lexical descriptions with the robustness of statistical techniques. Our thesis is that the computation of linguistic structure can be localized if lexical items are associated with rich descriptions (supertags) that impose complex constraints in a local context. However, increasing the complexity of descriptions makes the number of different descriptions for each lexical item much larger and hence increases the local ambiguity for a parser. This local ambiguity can be resolved by using supertag co-occurrence statistics collected from parsed corpora. We have explored these ideas in the context of Lexicalized Tree-Adjoining Grammar (LTAG) framework wherein supertag disambiguation provides a representation that is an almost parse. We have used the disambiguated supertag sequence in conjunction with a lightweight dependency analyzer to compute noun groups, verb groups, dependency linkages and even partial parses. We have shown that a trigram-based supertagger achieves an accuracy of 92.1‰ on Wall Street Journal (WSJ) texts. Furthermore, we have shown that the lightweight dependency analysis on the output of the supertagger identifies 83‰ of the dependency links accurately. We have exploited the representation of supertags with Explanation-Based Learning to improve parsing effciency. In this approach, parsing in limited domains can be modeled as a Finite-State Transduction. We have implemented such a system for the ATIS domain which improves parsing eciency by a factor of 15. We have used the supertagger in a variety of applications to provide lexical descriptions at an appropriate granularity. In an information retrieval application, we show that the supertag based system performs at higher levels of precision compared to a system based on part-of-speech tags. In an information extraction task, supertags are used in specifying extraction patterns. For language modeling applications, we view supertags as syntactically motivated class labels in a class-based language model. The distinction between recursive and non-recursive supertags is exploited in a sentence simplification application.

Comments University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-97-10.

This thesis or dissertation is available at ScholarlyCommons: https://repository.upenn.edu/ircs_reports/82 , Institute for Research in Cognitive Science

Complexity of Lexical Descriptions and Its Relevance to Partial Parsing

Srinivas Bangalore

University of Pennsylvania 3401 Walnut Street, Suite 400A Philadelphia, PA 19104-6228

June 1997

Site of the NSF Science and Technology Center for Research in Cognitive Science

IRCS Report 97--10

COMPLEXITY OF LEXICAL DESCRIPTIONS AND ITS

RELEVANCE TO PARTIAL PARSING

SRINIVAS BANGALORE

A DISSERTATION

COMPUTER AND INFORMATION SCIENCE

Presented to the Faculties of the UniversityofPennsylvania in Partial Fulllment of the

Requirements for the Degree of Do ctor of Philosophy

Aravind K Joshi

Sup ervisor of Dissertation

Mark Steedman Graduate Group Chairp erson

Srinivas Bangalore

Acknowledgments

I owe my thanks to a number of p eople each of whom contributed in their own way

towards this research and in the preparation of this do cument First of all I thank Prof

Aravind Joshi for his continued supp ort during the p erio d of this research Ihave b eneted

signicantly from his deep insights and his passion for subtle details which have made

a signicant impact on this research I thank Prof Mitch Marcus with whom I have

had numerous corridor conversations that hav e help ed shap e this research I deeply

appreciate his supp ort in the absence of Prof Joshi during Fall My thanks are also

duetoProf Mark Steedman whose seminar course help ed me to extend the applicability

of this research I thank SteveAbney Mark Lib erman and John Trueswell who as part of

my committee provided insightful comments and suggestions that has improved the quality

of this research

Without Yves Schab es and his dissertation this researchwould not have b een p ossible

at all I have b eneted a great deal from my discussions with him on many o ccasions I

have also used his co de develop ed for XTAG extensively in this research

There would be no greater injustice if I do not acknowledge the inuence the XTAG

at all times Beth Ann group has had on this research b esides keeping me on my to es

Ho ckey and Christine Doran deserve my sp ecial thanks for putting up with my linguistic

ignorance and cheering me along with my often silly ideas Without their critical remarks

this research would have taken much longer than it has I thank Dania and Martin for

providing the many to ols used in this research I also would like to thank the XTAG elves

Heather Laura Susan and young Tim who cleaned up the XTAG corpus over the summer

of and

My sp ecial thanks to R Chandrasekar Mickey for several invigorating discussions iii

ab out my work in particular and the eld of computer science in general I also thank

him for giving me a chance to apply sup ertagging to his work and for his warm fraternal

aection that saw me through the nal stages of this work Signicant improvement in

sup ertagging p erformance came ab out during the Summer of thanks to the summer

camp team Breck Baldwin Christine Doran Michael Niv and Je Reynar I am deeply

indebted to them for providing constructive criticism and setting up high standards of

p erformance

My thanks to Martha Palmer Tilman Becker James Rogers for providing constructive

comments during practice talks and in studentlounge discussions I would also like to

acknowledge Ra jesh Bhatt Eric Brill Ted Brisco e Mike Collins Jason Eisner Seth Kulick

Dan Melamed Adwait Ratnaparkhi Ano op Sarkar Da vid Yarowsky and the innumerable

IRCS visitors for useful discussions

I would like to thank the eorts of the administrative sta of CIS Department and

IRCS including Mike Felker for dealing with my course issues Gail Shannon for many

money matters Trisha Yannuzzi and Chris Sandy for accomo dating my wierd requests for

various kinds of letters and certications

Finally I owe everything I am to my parents In particular I thank them for their

unagging encouragement to pursue my research career and for their condence in me to

succeed My sp ecial thanks to Purnima for standing by me at all times and giving me

a reason to smile during the frustrating times of this research I also acknowledge the

supp ort from my friends Bharathi Esther Raghava Tikkana and Vijay during the early

stages of this research iv

Abstract

Complexity of Lexical Descriptions and its Relevance to Partial Parsing

Srinivas Bangalore

Sup ervisor Aravind K Joshi

In this dissertation wehave prop osed novel metho ds for robust parsing that integrate the

exibility of linguistically motivated lexical descriptions with the robustness of statistical

techniques Our thesis is that the computation of linguistic structure can be lo calized if

lexical items are asso ciated with rich descriptions supertags that imp ose complex con

straints in a lo cal context However increasing the complexity of descriptions makes the

number of dierent descriptions for each lexical item much larger and hence increases

the lo cal ambiguity for a parser This lo cal ambiguity can be resolved by using sup ertag

coo ccurrence statistics collected from parsed corp ora We have explored these ideas in

the context of Lexicalized TreeAdjoining Grammar LTAG framework wherein sup ertag

disambiguation provides a representation that is an almost parse We have used the dis

ambiguated sup ertag sequence in conjunction with a lightweight dep endency analyzer to

compute noun groups verb groups dep endency linkages and even partial parses Wehave

shown that a trigrambased sup ertagger achieves an accuracy of on Wall Street

Journal WSJ texts Furthermore wehave shown that the lightweigh t dep endency anal

ysis on the output of the sup ertagger identies of the dep endency links accurately

We have exploited the representation of sup ertags with ExplanationBased Learning to

improve parsing eciency In this approach parsing in limited domains can be mo deled

as a FiniteState Transduction Wehave implemented such a system for the ATIS domain

which improves parsing eciency by a factor of We have used the sup ertagger in a v

variety of applications to provide lexical descriptions at an appropriate granularity In

an information retrieval application we show that the sup ertag based system p erforms at

higher levels of precision compared to a system based on partofsp eech tags In an infor

mation extraction task sup ertags are used in sp ecifying extraction patterns For language

modeling applications we view sup ertags as syntactically motivated class lab els in a class

based language mo del The distinction between recursive and nonrecursive sup ertags is

exploited in a sentence simplication application vi

Contents

Acknowledgments iii

Intro duction

Issues in Natural Language Parsing

Computational Issues

Linguistic Issues

Psycholinguistic Issues

Metho ds for Robust Parsing

Finite State Grammarbased Approaches

Statistical Parsers

Our Approach

Lo calizing Ambiguity

Using Explanationbased Learning Technique

Chapter Summaries

Literature Survey

Fidditch

CASS

UNIVAC Parser now called Uniparse

FASTUS

ENGCG Parser

Decision Tree Parsers

Bilder vii

Data Oriented Parser

De Marckens Parser

Sene s Extended Chart Parser

The PLNLP approach

TACITUS

Sparser

IBMs PCFG and HBG Mo del

Summary

Merits of LTAG for Partial Parsing

F eaturebased Lexicalized TreeAdjoining Grammar

Key prop erties of LTAGs

Derivation Structure

Categorial Grammars

Chunking and LTAGs

Language Mo deling and LTAG

XTAG

Summary

Sup ertags

Example of Sup ertagging

Reducing sup ertag ambiguityusing

structural information

Mo dels Data Exp eriments and Results

Early Work

Dep endency mo del

RecentWork

Unigram mo del

ngram mo del

Errordriven Transformationbased Tagger

Head Trigram Mo del viii

Head Trigram Mo del with Sup ertags as Feature Vectors

Sup ertagging b efore Parsing

Lightweight Dep endency Analyzer

Discussion

Applicability to other Lexicalized Grammars

Summary

Exploiting LTAG representation for Explanationbased Learning

Explanationbased Learning

Overview of our approachtousingEBL

Featuregeneralization

Storing and retrieving featuregeneralized parses

Recursivegeneralization

Storing and retrieving recursivegeneralized parses

FiniteState Transducer Representation

Extending the Enco ding for Clausal complements

Typ es of auxiliary trees

Exp eriments and Results

Phrasal EBL and Weighted FST

Discussion

Summary

Stapler

Input to the Stapler Almost Parse

Staplers Tasks

Identify the nature of op eration

Mo dier Attachment

Assigning the addresses to the links

Feature Structures and Unication

Data Structure

Exp erimental Results ix

Summary

Parser Evaluation Metrics

Metho ds for Evaluating a Parsing System

Test suitebased Evaluation

aluation Unannotated Corpusbased Ev

Annotated Corpusbased Evaluation

Limitations of Parseval

Our Prop osal

Application Indep endentEvaluation

Application Dep endentEvaluation

A General Framework for Parser Evaluation

Evaluation of Sup ertag and LDA system

Performance of Sup ertagging for Text Chunking

Performance of Sup ertag and LDAsystem

Summary

Applications of Sup ertagging

Information Retrieval

Metho dology

The Exp eriment POS Tagging vs Sup ertagging

Gleaning Information from the Web

Information Extraction

Language Mo deling

Performance Evaluation

Simplication

Simplication with Dep endency links

Evaluation and Discussion

Exploiting Do cumentlevel Constraints for

Sup ertagging

Summary x

Conclusions

Contributions

Future Work

A List of Sup ertags xi

List of Tables

XTAG System Summary

Examples of syntactic environments

Sup ertag ambiguity with and without the use of structural constraints

The eect of lters on sup ertag ambiguity tabulated against partofsp eech

Dep endency Data

Results of Dep endency mo del

Results from the Unigram Sup ertag Mo del

Performance of the sup ertagger on the WSJ corpus

Performance of the sup ertagger on the IBM Manual corpus and ATIS corpus

The list of features used to enco de sup ertags

Performance improvement of b est sup ertagger over the b est sup ertagger

on the WSJ corpus

An example sentence with the sup ertags assigned to eachword and dep en

dency links among words

An example illustrating the working of LDAonacenteremb edded construc

tion

An example illustrating the working of LDAonasentence with long distance

extraction

The explanationbased learning problem

Corresp ondence b etween EBL and parsing terminology

Description of the comp onents in the tuple representation asso ciated with

eachword xii

Description of the comp onents in the tuple representation asso ciated with

eachword

Recall percentage Average number of parses Resp onse times and Size of

the FST for various corp ora

Performance comparison of XTAG with and without EBL comp onent

Performance comparison of the transformation based noun chunker and the

sup ertag based noun chunker

Performance comparison of the transformation based verb chunker and the

sup ertag based verb chunker

Performance comparison of the sup ertag based prep osition phrase attach

ment against other approaches

Performance of the trigram sup ertagger on App ositive Parentheticals Co

ordination and Relative Clause constructions

ComparativeEvaluation of LDAonWall Street Journal and Brown Corpus

Performance of LDA system compared against the XTAG derivation structures

The p ercentage of sentences with zero one two and three dep endency link

errors

Classication of app oint sentences

Precision and Recall of dierent lters for relevant sentences

Precision and Recall of dierent lters for irrelevant sentences

Classication of the do cuments retrieved for the search query

Precision and Recall of Glean for retrieving relevant do cuments

Precision and Recall of Glean for ltering out irrelevantdocuments

Word p erplexities for the Wall Street Journal Corpus using the partof

sp eech and sup ertag based langauge mo dels

Class p erplexities for the Wall Street Journal Corpus

Performance results on extracting Noun Phrases with and without

presup ertagging xiii

List of Figures

Substitution and Adjunction in LTAG

Elementary Trees for the sentence The company is being acquired

a Derived Tree b Derivation Tree and c Dep endency tree for the

sentence The company is being acquired

a Tree for Raising analysis anchored by seems b Transitive tree c

Ob ject extraction tree for the verb hit

Chunks for The professor from Milwaukee was reading about a biography of

Marcel Proust

Chunks in LTAG for The professor from Milwaukee was reading about a

biography of Marcel Proust

Flowchart of the XTAG system

asso ciated with each word of the sentence the A selection of the sup ertags

purchase price includes two ancil lary companies

Sup ertag disambiguation for the sentence the purchase price includes two

ancil lary companies

Comparison of numb er of sup ertags with and without ltering for sentences

of length to words

Percentage drop in the number of sup ertags with and without ltering for

sentences of length to words

Sup ertag hierarchy

Flowchart of the XTAG system with the EBL comp onent xiv

a Derivation structure b Feature Generalized Derivation structure c

Index for the generalized parse for the sentence show me the ights from

Boston to Philadelphia

a Featuregeneralized derivation tree b Recursivegeneralized derivation

tree c Recursivegeneralized derivation tree with the two Kleenestars col

lapsed into one d Index for the generalized parse for the sentence show me

the ights from Boston to Philadelphia

Generalized derivation tree for the sentence show me the ights from Boston

to Philadelphia

Finite State Transducer Representation for the sentences show me the

ights from Boston to Philadelphia show me the ights from Boston to

Philadelphia on Monday

The elementary trees for the sentence who did you say ies from Boston to

Washington

a Derivation structure b Feature Generalized Derivation structure and

c Recursive Generalized Derivation structure for the sentence who did you

say ies from Boston to Washington

FST for the sentence who did you say ies from Boston to Washington

a Mo dier Auxiliary Tree b Predicative Auxiliary Tree

Finite State Transducer Representation for the sentences who did you say

ies from Boston to Washington who did you say ies from Boston to

Washington on Monday who did you think I said ies from Boston to Wash

ington

a Parse tree and b Treechunks for the sentence show me the ights from

Boston to Philadelphia

a Output of Sup ertagger for the sentence The purchase price includes two

ancil lary companies b Output of EBLlo okup for the sentence Show me

the ights from Boston to Philadelphia

a Output of EBLlo okup b Instantiated derivation tree c Intended

derivation tree for the sentence Give me the cost of tickets on Delta

Data Structure for an LTAG elementary tree xv

System setup used for Exp eriment b

System setup used for Exp erimentc

Summary of the Relation Mo del of Parser Evaluation

The Phrase Structure tree and the dep endency linkage obtained from the

phrase structure tree for the WSJ sentence Pierre Vinken years old wil l

join the board as a nonexecutive director Nov

Overview of the Information Filtering scheme

Sample patterns involving POS tags and Sup ertags

A sample MOP pattern using Sup ertags and Maximal Noun Phrases

Rule for extracting relative clauses xvi

Chapter

Intro duction

Although parsers have proved uncontroversially useful in the domain of pro cessing Pro

gramming Languages the issue of parsing in the domain of Natural Languages NL

pro cessing has b een a cause for tension b etween the computational and linguistic p ersp ec

tives for a long time In the past the controversy ab out parsing was due to the divergence

of ob jectives between natural language application develop ers who were oriented to de

veloping practical parsers and psycholinguists who were concerned with the psychological

pro cess of language comprehension In recent times however develop ers of natural lan

guage applications have questioned the usefulness for parsing in practical NLP systems

This is primarily b ecause there are no grammars that have complete coverage of freely

o ccurring natural language texts and there are no parsers that are robust enough to deal

with that inadequacy This limitation is further comp ounded by the fact that the inherent

ambiguity of Natural Languages forces parsers to op erate at sp eeds far from realtime re

quirements In this dissertation we address the issue of robust parsing by exploring various

novel metho ds of using linguistically motivated rich and complex lexical descriptions for

partial parsing

We review some of the issues faced by NL parsers in Section and summarize two

approaches to robust parsing in Section In Section we motivate our approach of

using rich and complex lexical descriptions for partial parsing and intro duce two metho ds

pursued in this dissertation

parser is dened as a device that assigns a structural description to Traditionally a

an input string given a set of elementary structures rules trees or graphs dened in a

grammar and the op erations for combining those structures The output of a parser is a

parse that serves as a pro of of the wellformedness of the input string given the elementary

structures of the grammar A parse is the result of searching selecting and combining

appropriate structures from the set of elementary structures dened in the grammar so as

to span the input string The pro cess of combining elementary structures is represented by

the derivation structure for the input string In the past few years however statistically

induced grammarless parsing metho dology Jelinek et al Magerman Collins

has redened the parsing enterprise Although the task of parsing is still to assign a

structural description to an input it is no longer the case that the description b e arrived

bycombining elementary structures of a grammar In fact there is no notion of an explicit

grammar in this paradigm of parsing A structural description to a string is assigned based

on the likeliho o d of that structure in a context similar to that in the string The contexts

and likeliho o ds are collected from a corpus of sentences that are annotated with the desired

structural descriptions

Issues in Natural Language Parsing

In order to situate our work we briey review the issues involved in Natural Language NL

parsing Since parsing is the computation of assigning an interpretation to a sentence via

a grammar NL parsing enterprise has been inuenced by computational linguistic and

psycholinguistic considerations Robinson In this section we review the issues in

NL parsing from computational psycholinguistic and linguistic p ersp ectives in the context

of a purely grammarbased parsing approach In the next few sections we discuss the

emerging approaches to NL parsing that address these issues to some extent

Computational Issues

Natural Languages are b oth lexically and structurally ambiguous This results in both

lo cal and global ambiguity for a natural language parser This fact is further comp ounded

Parsers for contextfree grammars can b e viewed this way with rules as one level trees that are combined

using substitution op eration

by the high p erformance requirements on NL parsers in terms of sp eed accuracy and

robustness which makes NLs hard to parse

Robustness The syntactic structure of natural language is very complex and no gram

mar has b een written that has complete coverage of any natural language This is further

comp ounded by the fact that unrestricted texts contain ungrammatical and fragmented

sentences Natural language parsers are exp ected to be robust in the face of incomplete

lexicon and grammar coverage and even ungrammatical input They are exp ected to

rapidly parse simple sentences and yet degrade gracefully in the face of extragrammatical

and ungrammatical input

Sp eed NL parsers that attempt to parse sentences with to words are often

faced with a combinatorially intractable task due to ambiguity inherent in natural lan

guage descriptions Parsers arrive at an analysis by systematically searching selecting and

combining appropriate elementary structures from the set of p ossible structures dened in

the grammar so as to span the input string This asp ect of searching through the space of

elementary structures of the grammar to determine the appropriate structures to combine

is the single most timeexp ensive comp onent of the parsing pro cess The space of ele

mentary structures to b e searched and combined grows rapidly as the degree of ambiguity

of words in the input string increases This fact is further comp ounded in widecoverage

grammars where a variety of elementary structures are included in the grammar to deal

with a range of linguistic constructions Due to the enormous size of the search space

parsers for widecoverage grammars are excruciatingly slow and often p erform at sp eeds

that make them impractical to use

Selecting Analysis Assuming that an NL parser is fast enough to nd a set of

analyses in a reasonable amount of time it is faced with the problem of selecting a correct

analysis As a result of lexical and syntactic ambiguity in NLs parsers invariably pro duce

multiple parses for a given input However since the grammar itself do es not directly

enco de any preference metric with each analysis parsers for NLs in general do not havea

metho d of cho osing the correct analysis among comp eting analyses

We consider the sp eed of the parser by itself and not when the parser is emb edded in an application and

can have access to other sources of information suchassemantic pragmatic world knowledge or preference

factors

Global Structure The ob jective of a parser is to pro duce a structure that spans

the entire input string However due to lack of preference metrics combined with the

incompleteness of the lexicon and grammar and p ossibly ungrammatical input parsers in

their attempt to assign a global structure to the complete input often pro duce disprefered

lo cal structures

Linguistic Issues

It is quite evident that there is a close relationship b etween a parser and the representation

the parser manipulates However in recent times there has b een increasing debate on

such issues as what should the representation be how linguistically detailed should the

representation b e and how do es one go ab out constructing such a representation

An active line of research has b een to develop widecoverage grammars in wellstudied

mathematically rigorous grammar frameworks that are linguistically adequate A parser

based on a linguistically motivated widecoverage grammar has many advantages A wide

coverage grammar that is domain indep endent enco des the invariances of the language

in general and is not sp ecialized to any particular domain As a result such a gram

mar is more p ortable across domains than a grammar develop ed for a particular domain

However for a parser to b etter exploit the idiosyncrasies of a domain a widecoverage

grammar can be sp ecialized to that domain thus making parsing ecient in limited do

mains Also dep ending on how directly a grammar framework enco des linguistic facts a

linguistically motivated grammar develop ed in that framework could pro duce output that

is quite detailed and directly amenable to further pro cessing F urthermore if a grammar

for one language is created in detail and the structures of the grammar are organized

systematically then it is conceivable that grammars for closely related languages could b e

automatically generated by abstracting away from language sp ecic features

Psycholinguistic Issues

Humans parse natural language expressions rapidly robustly eciently and eectively

even in noisy environments with degraded inputs This suggests that parsing mechanisms

We are hesitant to limit the notion of representation to the traditional sense of grammar since these

issues are applicable to statistical grammarless parsing paradigms as well

that build on what we know of human sentence pro cessing might do b etter in terms of

sp eed and robustness

Lexicalist Approach Humans seem to bring in vast amounts of lexical information

and rapidly commit to a single structure on the basis of probabilistic and lo cal contextual

information Trueswell and Tanenhaus MacDonald et al In view of this

recent theories of sentence pro cessing in psycholinguistics have emphasized the role of

lexical mechanisms a fact indep endently motivated in computational formalisms as well

Lo cal Constaints An imp ortantaspectofhuman sentence pro cessing is use of lo cal

constraints and rapid commitment to a single structure even at the cost of pro cessing

errors Not many parsers in the past have fo cussed on this issue and instead attempted at

a globally consisten t parse while entertaining many p ossible structure in parallel instead

of risking a total failure to parse A primary reason for this distinction is that the cost

of pro cessing errors in humans is relatively low since they are endowed with quick error

detection and error recovery mechanism that works in conjunction with the online incre

mental pro cessing mechanism Recent trends in parsing can b e viewed as addressing some

of these issues

Metho ds for Robust Parsing

In this section we discuss some of the robust parsing metho ds that have b een adopted to

parse NLs that either address or circumvent the issues that were discussed in the previous

section The main emphasis of these techniques has b een on sp eed and coverage of parsing

while trading the depth of analysis to robustness There are two main approaches for

dealing with robust parsing of natural languages We summarize the similarities and

dierences b etween these approaches

Finite State Grammarbased Approaches

Finite State Grammarbased approaches to parsing is characterized by for example Joshi

Joshi and Hop ely Abney a App elt et al Ro che Grishman

parsing systems These systems use grammars that are represented as cascaded nite

state regular expression recognizers The regular expressions are usually handcrafted

Each recognizer in the cascade provides a lo cally optimal output The output of these

systems is mostly in the form of noun groups and verb groups rather than constituent

structure often called as a shal low parse There are no clauselevel attachments or mo d

ier attachments in the shallow parse These parsers always pro duce one output since

they use the longest match heuristic to resolve cases of ambiguity when more than one

regular expression matches the input string at a given p osition At present none of these

any statistical information to resolve ambiguity The grammar itself can be systems use

partitioned into domain indep endent and domain sp ecic regular expressions which implies

that p orting to a new domain would involve rewriting the domain dep endent expressions

This approachhasproved to b e quite successful as a prepro cessor in information extraction

systems Hobbs et al Grishman

Statistical Parsers

Pioneered by the IBM natural language group Fujisaki et al and later pursued by

for example Schab es et al Jelinek et al Magerman Collins

this approach decouples the issue of wellformedness of an input string from the problem

of assigning a structure to it These systems attempt to assign some structure to every

input string The rules to assign a structure to an input is extracted automatically from

handannotated parses of large corp ora which are then sub jected to smo othing to obtain

reasonable coverage of the language The resultant set of rules are not linguistically trans

parent and are not easily mo diable Lexical and structural ambiguity is resolved using

probability information that is enco ded in the rules This allows the system to assign the

mostlikely structure to each input The output of these systems consists of constituent

analysis the degree of detail of which is dep endent on the detail of annotation present

in the treebank that is used to train the system However since the system attempts a

globally optimal parse it may pro duce lo cally disprefered analysis Also since there is no

distinction b etween the domain dep endent and domain indep endent rule sets the system

has to b e retrained for each new domain which in turn requires an annotated corpus for

that domain

also parsers that use probabilistic weighting information in conjunction There are

with handcrafted grammars for example Black et al b Nagao Alshawi

and Carter Srinivas et al In these cases the probabilistic information is

primarily used to rank the parses pro duced by the parser and not so much for the purp ose

of robustness of the system

Our Approach

In this dissertation wehave prop osed novel metho ds for robust parsing that integrate the

exibility of linguistically motivated lexical descriptions with the robustness of statistical

techniques We pursue the idea that the computation of linguistic structure can b e lo cal

ized if lexical items are asso ciated with rich descriptions Supertags that imp ose complex

constraints in a lo cal context This makes the number of dierent descriptions for each

lexical item much larger than when the descriptions are less complex thus increasing the

lo cal ambiguity for a parser However this lo cal ambiguity can b e resolved by using statis

tical distributions of sup ertag coo ccurrences collected from a corpus of parses Sup ertag

disambiguation results in a representation that is eectively a parse almost parse

The idea of using complex descriptions for primitives to capture constraints lo cally has

some precursors in AI For example the Waltz algorithm Waltz for lab eling vertices

of p olygonal solid ob jects can b e thought of in these terms The idea of the algorithm is to

include the various p ossibilities of lab eling vertices as primitives thus lo calizing ambiguity

The lab eling constraints asso ciated with the primitives op erate lo cally and are used to

lter out mutually incompatible primitives thus leaving only combinations of compatible

ones as solutions However as far as weknow there is no rendering of Waltzs algorithm

that exploits the lo cality of constraints using statistical techniques

Lo calizing Ambiguity

In the linguistic context there can b e manyways of increasing the complexity of descrip

tions of lexical items The idea is to asso ciate lexical items with descriptions that allow for

lo calization of all and only those elements on which the lexical item imp oses constraints to

b e with in the same description Further it is required to asso ciate each lexical item with

as many descriptions as the numb er of dierentsyntactic contexts in which the lexical item

can app ear This of course increases the lo cal ambiguity for the parser The parser has to

decide which complex description out of the set of descriptions asso ciated with each lexical

item is to b e used for a given reading of a sentence even b efore combining the descriptions

together The obvious solution is to put the burden of this job entirely on the parser The

parser will eventually disambiguate all the descriptions and pick one per lexical item for

a given reading of the sentence However there is an alternate metho d of parsing which

reduces the amount of disambiguation done by the parser Just as in Waltzs algorithm

the idea is to lo cally check the constraints that are asso ciated with the descriptions of lex

ical items to lter out incompatible descriptions During this disambiguation the system

can also exploit statistical information that can b e asso ciated with the descriptions based

on their distribution in a corpus of parses

We employ these ideas in the context of LTAG where each lexical item is asso ciated

with one sup ertag for eachsyntactic context that the lexical item app ears in The number

of sup ertags asso ciated with each lexical item is much larger than the number of stan

dard partsofsp eech POS asso ciated with that item Even when the POS ambiguity is

removed the number of sup ertags asso ciated with each item can be large We show that

disambiguating sup ertags prior to parsing sp eedsup a parser by a factor of More in

teresting is the fact that since sup ertags combine b oth phrase structure information and

dep endency information in a single representation a disambiguated sup ertag sequence in

conjunction with a lightweight dep endency analyzer can b e used to compute noun groups

verb groups dep endency linkages and even partial parses We have shown that training

on one million sup ertag annotated words from Wall Street Journal corpus and testing on

words of heldout data a trigrambased sup ertagger assigns of the words

the same sup ertag as they would be assigned in the correct parse of the sentences Fur

thermore we have shown that the lightweight dep endency analysis on the output of the

sup ertagger for the words identies of the dep endency links accurately

Using Explanationbased Learning Technique

Although handcrafted widecoverage grammars are p ortable they can be made more

ecient if it is recognized that language in limited domains is usually well constrained and

certain linguistic constructions are more frequentthanothers Hence not all the structures

of the domainindep endent grammar nor all the combinations of those structures would

be used in limited domains In this pap er we view a domainindep endent grammar as a

rep ository of p ortable grammatical structures whose combinations are to b e sp ecialized for

a given domain We use Explanationbased Learning mechanism to identify the relevant

subset of a handcrafted general purp ose grammar needed to parse in a given domain

The use of EBL for natural language parsing involves two phases training phase and

application phase In the training phase generalized parses of sentences are stored under a

suitable index that is computed from the training sentences In the application phase an

index for the test sentence is used to retrieve a generalized parse which is then instantiated

to that sentence This approach improves the eciency of a parsing system by exploiting

the domain sp ecic sentence structures The success of this approach dep ends heavily on

ecacy to generalize parses Rayner Samuelsson and Neumann have

used the EBL metho dology to sp ecialize CFGbased grammars for improving the eciency

of their systems

Weshow that the idea of increased complexity of lexical descriptions as instantiated in

LTAGs allows us to represent certain generalizations that are not p ossible in CFGbased

approaches Our metho d exploits some of the key asp ects of LTAGs to a achieveanim

mediate generalization of parses for the training set of sentences b achieve generalization

over recursive substructures of the parses in the training set and c allow for a nite state

transducer FST representation of the set of generalized parses

Chapter Summaries

The outline of the dissertation is as follows

Chapter In this chapter we review the previous work in robust natural language

parsing The chapter contains reviews of three class of parsing metho dologies that in

corp orate robustness First the nitestate based partial parsing systems employ simple

mechanisms to group sequences of words and indicate the grammatical relations among

word groups Second statistical parsing systems use annotated treebanks to induce prob

ability distributions p ertaining to parsing decisions and employ smo othing techniques to

achieve robustness Third parsing systems that extend conventional parsing techniques to

achieve robustness

Chapter In this chapter we intro duce the LTAG formalism and review the three

central prop erties of LTAGs Lexicalization Extended Domain of Lo cality and Factoring

of Recursion from the domain of dep endencies We discuss the relevance of these prop erties

for partial parsing and presentanoverview of a grammar developmentenvironment and a

widecoverage grammar for English that is b eing develop ed in LTAG

Chapter In this chapter we present the idea of sup ertag disambiguation and discuss

its relationship to partial parsing We presentseveral mo dels for sup ertag disambiguation

and their p erformance evaluation results on various corp ora We also present a novel

lightweight dep endency analyzer that exploits the information enco ded in the sup ertags to

compute a dep endency analysis of a sentence

Chapter Although handcrafted wide coverage grammars are p ortable they can b e

made more ecient when applied to limited domains if it is recognized that language in

limited domains is usually well constrained and certain linguistic constructions are more

frequent than others In this chapter we present the Explanationbased Learning technique

of sp ecializing a widecoverage LTAG grammar to a limited domain We show that by

exploiting the central prop erties of LTAGs parsing in limited domains can be seen as

Finite State Transduction from strings to their dep endency structures

Chapter In this chapter we present an imp overished parser called Stapler that

takes as input the almost parse structure resulting from an FST induced by the EBL

mechanism and computes all p ossible parses and instantiates it to the particular sentence

Using this technique the sp ecialized grammar obtains a sp eed up of a factor of over

the unsp ecialized grammar on the Air Travel Information Service ATIS domain

Chapter This chapter addresses the issue of evaluating partial parsing systems It

consists of two parts In the rst part we review the currently prevalentevaluation metrics

for parsers and indicate their limitations for comparing parsers that pro duce dierent

representations We prop ose an alternate evaluation metric based on dep endency relations

that overcomes the limitations of the existing evaluation metrics In the second part of

this chapter weprovide results from extensiveevaluation of the sup ertagger and the LDA

system on a range of tasks and compare its p erformance against the p erformance of other

systems on those tasks

Chapter In this chapter we discuss several applications of the sup ertagger and the

LDA system Sup ertagging has b een used in a variety of applications including information

retrieval and information extraction text simplication and language mo deling Our goal

in using a sup ertagger is to exploit the strengths of sup ertags as the appropriate level of

lexical description needed for most applications In an information retrieval application we

compare the retrieval eciency of a system based on partofsp eech tags against a system

based on sup ertags and show that the sup ertagbased system p erforms at higher levels of

precision In an information extraction task sup ertags are used in sp ecifying extraction

patterns For language modeling applications we view sup ertags as syntactically motivated

class lab els in a classbased language mo del The distinction b etween recursive and non

recursive sup ertags is exploited in a sentence simplication application The ability to

bring to bear do cumentlevel and domainsp ecic constraints during sup ertagging of a

sentence is explored in another application

Chapter

Literature Survey

In recent times robust Natural LanguageNL parsing has received renewed attention This

interest in robust parsing has been driven by the intuition that exploiting sophisticated

linguistic information present in NL texts should improve the p erformance of NLbased

applications However although there has been signicant progress in both grammar

based and sto chastic robust parsing general purp ose parsers are far from b eing readily

deployable in practical applications

Robustness in parsing refers to that prop erty of the parser that irresp ective of the

grammaticality of the input string allows the parser to pro duce a parse that mayormaynot

hniques span the entire input string The stumbling blo cks for grammarbased parsing tec

from achieving robustness are incomplete lexicons inadequate grammatical coverage of

systems extremely long sentences that take unreasonable amount of time to complete

a parse and ungrammatical texts The sto chastic parsers although more robust than

purely grammarbased approaches have their share of limitations Being heavily data

driven sto chastic approaches need vast amounts of data which need to be annotated a

fairly exp ensive prop osition Further the richness of the output pro duced by these parsers

is limited by the detail of the annotation

In this chapter we will review some of the previous approaches that have resorted to

partial parsing in an attempt to achieve robustness in parsing By partial parsing we

mean that these systems either always or on failure to provide a single tree spanning the

input pro duce a forest of trees that span most of the input The reviews in this chapter

discuss the metho dologies used by the various partial parsers without providing much

description of their p erformance However cross comparison of the p erformance of these

parsers which would be invaluable is extremely dicult to achieve for obvious reasons

We also review a few datadriven sto chasticbased parsers that use smo othing techniques

to achieve robustness

The particular parsers b eing reviewed and the main reasons for including them in this

particular set are listed b elow

Typ e I Extended Finite State Transducerbased Partial Parsers

This approach to parsing use grammars that are represented as cascaded nitestate

regular expression recognizers The regular expressions are usually handcrafted

Each recognizer in the cascade provides a lo cally optimal output The output of

these systems is mostly in the form of noun groups and verb groups rather than

constituent structure often called as a shal low parse or a text chunk There are no

clauselevel attachments or mo dier attachments in the shallow parse These parsers

always pro duce one output since they use the longest match heuristic to resolve cases

of ambiguity when more than one regular expression matches the input string at a

given p osition At present none of these systems use any statistical information to

resolve ambiguity The grammar itself can be partitioned into domain indep endent

and domain sp ecic regular expressions which implies that p orting to a new domain

would involve rewriting the domain dep endent expressions This approach has proved

to b e quite successful as a prepro cessor in information extraction systems

Fidditch Hindle Industrial strength version of Marcus parser Mar

cus that is used extensively in the Treebank pro ject at the Universityof

Pennsylvania

CASS Abney b Amultistage parser that employs simple and computa

tionally inexp ensive set of to ols at each stage to arrive at a parse

UNIVAC Parser now called Uniparse Joshi Joshi and Hop ely

This pap er describ es the parser designed and implemented on UNIVAC during the p erio d

This pro ject was directed byZellig Harris The memb ers of the pro ject were Carol Chomsky Lila

Gleitman Aravind Joshi Bruria Kauman and Naomi Sager See also Harris This parser has b een

A parser of historical signicance that employed manyof the socalled current

stateoftheart techniques in parsing

Helsinki Parser Karlsson et al A parser more of a tagger that uses

nitestate expressible constraints to reduce the ambiguityof morphosyntactic