Formal and Computational Aspects of Natural Language Syntax Owen Rambow University of Pennsylvania

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by ScholarlyCommons@Penn University of Pennsylvania ScholarlyCommons IRCS Technical Reports Series Institute for Research in Cognitive Science May 1994 Formal and Computational Aspects of Natural Language Syntax Owen Rambow University of Pennsylvania Follow this and additional works at: http://repository.upenn.edu/ircs_reports Rambow, Owen, "Formal and Computational Aspects of Natural Language Syntax" (1994). IRCS Technical Reports Series. 154. http://repository.upenn.edu/ircs_reports/154 University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-94-08. This paper is posted at ScholarlyCommons. http://repository.upenn.edu/ircs_reports/154 For more information, please contact [email protected]. Formal and Computational Aspects of Natural Language Syntax Abstract This thesis explores issues related to using a restricted mathematical formalism as the formal basis for the representation of syntactic competence and the modeling of performance. The specific onc tribution of this thesis is to examine a language with considerably freer word-order than English, namely German, and to investigate the formal requirements that this syntactic freedom imposes. Free word order (or free constituent order) languages can be seen as a test case for linguistic theories, since presumably the stricter word order can be subsumed by an apparatus that accounts for freer word order. The formal systems investigated in this thesis are based on the tree adjoining grammar (TAG) formalism of Joshi et al. (1975). TAG is an appealing formalism for the representation of natural language syntax because its elementary structures are phrase structure trees, which allows the linguist to localize linguistic dependencies such as agreement, subcategorization, and filler-gap relations, and to develop a theory of grammar based on the lexicon. The ainm results of the thesis are an argument that simple TAGs are formally inadequate, and the definition of an extension to TAG that is. Every aspect of the definition of this extension to TAG, called V-TAG, is specifically motivated by linguistic facts, not by formal considerations. A formal investigation of V-TAG reveals that (when lexicalized) it has restricted generative capacity, that it is polynomial parsable, and that it forms an abstract family of languages. This means that it has desirable formal properties for representing natural language syntax. Both a formal automaton and a parser for V-TAG are presented. As a consequence of the new system, a reformulation of the linguistic theory that has been proposed for TAG suggests itself. Instead of including a transformational step in the theory of grammar, all derivations are performed within mathematically defined formalisms, thus limiting the degrees of freedom in the linguistic theory, and making the theory more appealing from a computational point of view. This has several interesting linguistic consequences; for instance, functional categories are expressed by feature content (not node labels), and head movement is replaced by the adjunction of heads. The thesis sketches a fragment of a grammar of German, which covers phenomena such as scrambling, extraposition, topicalization, and the V2 effect. Finally, the formal automaton for V-TAG is used as a model of human syntactic processing. It is shown that this model makes several interesting predictions related to free word order in German. Comments University of Pennsylvania Institute for Research in Cognitive Science Technical Report No. IRCS-94-08. This thesis or dissertation is available at ScholarlyCommons: http://repository.upenn.edu/ircs_reports/154 The Institute For Research In Cognitive Science Formal and Computational Aspects of Natural Language Syntax (Ph.D. Dissertation) by P Owen Rambow E University of Pennsylvania 3401 Walnut Street, Suite 400C Philadelphia, PA 19104-6228 N May 1994 Site of the NSF Science and Technology Center for Research in Cognitive Science N University of Pennsylvania IRCS Report 94-08 Founded by Benjamin Franklin in 1740 Formal and Computational Aspects of Natural Language Syntax Owen Rambow A Dissertation in Computer and Information Science Presented to the Faculties of the UniversityofPennsylvania in Partial Fulllment of the Requirements for the Degree of Do ctor of Philosophy Aravind K Joshi and AnthonyKroch Sup ervisors of Dissertation c Copyright by Owen Rambow This researchwas partially funded byARO grants DAAGK and DAALCPRI DARPAgrants NK and NJ NSF grants MCS and IRI and Ben Franklin Grant SC at the UniversityofPennsylvania iii Acknowledgements Ihave b een lucky to havetwo extremely supp ortive advisors Aravind Joshis deep insightinto and interest in the eld haveprovided the intellectual framework in which this thesis work is inscrib ed TonyKrochs undogmatic metho dological rigor has formed my understanding of lin guistic research This thesis is profoundly marked by their scholarship and advice and would not have b een p ossible without it Iwould like to thank the memb ers of my thesis committee for their insightful comments questions and criticisms Jean Gallier Mitch Marcus and Mark Steedman I would esp ecially like to thank the two external memb ers of my committee Beatrice Santorini Northwestern University and Hans Uszkoreit Universitat des Saarlandes whose detailed comments and questions havegone far b eyond the call of duty and whose feedback has b een instrumental in shaping various parts of this thesis It has b een a pleasant surprise to nd that academic research is not a solitary enterprise My interest in computational linguistics arose from an early joint pro ject with Tanya Korelsky and Richard Kittredge I would like to thank them for exp osing me to linguistic theory and to com putational linguistics and for sending me o to Penn for graduate scho ol This thesis originated with a joint pro ject with Tilman Becker and Aravind Joshi to examine the syntax of German in hael Niv led to a tree adjoining grammar framework Collab oration with Tilman Becker and Mic a b etter understanding of the formal prop erties of word order variation Together with Young Suk Lee I have explored the linguistic asp ects of using tree adjoining grammar for free word order languages A joint pro ject with Bob Frank and YoungSuk Lee help ed me gain a b etter understanding of the syntactic prop erties of scrambling Jointwork with Giorgio Satta has dealt with some mathematical prop erties of formalisms used for linguistic purp oses FinallyIhave worked with Tilman Becker on a parsing algorithm for one of the formalisms discussed in this thesis Throughout the thesis work that is directly based on jointwork is clearly marked as such however the entire thesis would have b een imp ossible without the stimulating eect of intellec tual interaction I am extremely grateful to the researchers with whom I have had the fortune of collab orating These collab orations have b een made p ossible to a large extentby the p ositive environment for interdiscipli nary research at the Institute for Research in Cognitive Science at the UniversityofPennsylvania This thesis work has b een greatly help ed by discussions with Anne Ab eille Bob Frank Beryl Homan Sabine Iatridou Bernhard Rohrbacher Yves Schab es K VijayShanker and David Weir Michael Hegarty and Caroline Heyco ckhave b een extremely generous in sp ending time explaining various asp ects of linguistic theory to me I would also like to thank Lauri Karttunen iv Bob Kasp er and Mike Reap e for helpful discussions and crucial comments FinallyIhave proted from the courses taughtby Ellen Prince and Bonnie Webb er even though the lessons learned ab out the imp ortance of linguistic and extralinguistic contextgobeyond the scop e of this thesis It is in order to acknowledge the multiplicity of authors ideas and inuences that have shap ed this thesis that I will use the rst p erson plural pronoun in the thesis Needless to say I assume full resp onsibility for all errors and infelicities Iwould like to thank John Bacon Tilman Becker Michael Hegarty Beryl Homan Philip Resnik Giorgio Satta and K VijayShanker for reading various drafts of various parts of this thesis and providing crucial early feedback I am grateful to Tilman Becker Mischa CzarnyTanya Korelsky YoungSuk Lee Aileen Rambow Bernhard Rohrbacher and Beatrice Santorini for native sp eaker judgments John Bacon has b een patient and supp ortive and has provided p ersp ective without him graduate scho ol would have b een less fun FinallyIwould like to thank my parents Gerhard and Susan Rambow and my sister Aileen for their interest in mywork and their encouragement and supp ort v Abstract Formal and Computational Asp ects of Natural Language Syntax Author Owen Rambow Sup ervisors Aravind K Joshi and Anthony Kro ch This thesis explores issues related to using a restricted mathematical formalism as the formal basis for the representation of syntactic comp etence and the mo deling of p erformance The sp ecic contribution of this thesis is to examine a language with considerably freer wordorder than English namely German and to investigate the formal requirements that this syntactic freedom imp oses Free word order or free constituent order languages can b e seen as a test case for linguistic theories since presumably the stricter word order can b e subsumed by an apparatus that accounts for freer word order The formal systems investigated in this thesis are based on the

Formal and Computational Aspects of Natural Language Syntax Owen Rambow University of Pennsylvania

Fundamental Methodological Issues of Syntactic Pattern Recognition

Lecture 5 Mildly Context-Sensitive Languages

The Linguistic Relevance of Tree Adjoining Grammar

Algorithm for Analysis and Translation of Sentence Phrases

Surface Without Structure Word Order and Tractability Issues in Natural Language Analysis

Parsing Discontinuous Structures

Fundamental Study

Technical Report No. 2006-514 All Or Nothing? Finding Grammars That

On the Degree of Nondeterminism of Tree Adjoining Languages and Head Grammar Languages Suna Bensch, Maia Hoeberechts

Discontinuous Data-Oriented Parsing Through Mild Context-Sensitivity

Parsing Linear Context-Free Rewriting Systems with Fast Matrix Multiplication

Expressivity and Complexity of the Grammatical Framework