Towards a Principled Computational System of Syntactic Ambiguity Detection and Representation

Towards a Principled Computational System of Syntactic Ambiguity Detection and Representation Hilton Alers-Valentín1, Carlos G. Rivera-Velázquez2, J. Fernando Vega-Riveros2 and Nayda G. Santiago2 1Linguistics and Cognitive Science Program, Department of Hispanic Studies, University of Puerto Rico-Mayagüez, P.O. Box 6000, Mayagüez, 00681-6000, Puerto Rico 2Department of Electrical and Computer Engineering, University of Puerto Rico-Mayagüez, P.O. Box 6000, Mayagüez, 00681-6000, Puerto Rico Keywords: Syntax, Parser, Lexicon, Structural Ambiguity, Computational Linguistics, Natural Language Processing. Abstract: This paper presents the current status of a research project in computational linguistics/natural language processing whose main objective is to develop a symbolic, principle-based, bottom-up system in order to process and parse sequences of lexical items as declarative sentences in English. For each input sequence, the parser should produce (maximally) binary trees as generated by the Merge operation on lexical items. Due to parametric variations in the algorithm, the parser should be able to output (up to four) grammatically feasible structural representations accounted by alternative constituent analyses because of structural ambiguities in the parsing of the input string. Finally, the system should be able to state whether a particular string of lexical items is a possible sentence in account of its parsability. The system has a scalable software framework that may be suitable for the analysis of typologically-diverse natural languages. 1 INTRODUCTION Deterministic parsers, on the other hand, use a system of syntactic rules to produce a structural Natural language parsing is a computational process representation. Deterministic parsers take input that takes as input a sequence of words and yields a strings of natural languages and analyse them using syntactic structure for said sequence according to production rules of a context free grammar. If, for a some sort of procedure. The production of a syntactic given sequence of lexical items, the rules of a structure from a sequence determines whether it language grammar cannot produce a structural legally belongs to a language. There are two main representation, the sequence is considered types of parsers used to analyse word sequences. On ungrammatical for that language. one hand, there are probabilistic parsers which, given A single grammatical sequence, however, may a statistical model of the syntactic structure of a have multiple representations if it is syntactically language, will produce the most likely parse of a ambiguous. The (generally assumed) Principle of sentence, even if the word sequence is actually judged Compositionality states that the meaning of an as ungrammatical by native speakers. Probabilistic expressions is a function of the meaning of its parts parsers are widely used in natural language and of the way they are syntactically combined processing applications. However, they require a (Partee, 2004). As a consequence, syntactically manually annotated corpus, a statistical learning ambiguous sequences may also be semantically algorithm, as well as training. Although these parsers ambiguous. There are two main causes of syntactic are particularly good in identifying syntactic ambiguity: referential and structural. Referential categories or parts of speech and have a desirable ambiguity is due to possible valuations and cost-benefit relation between accuracy and speed, interpretations of noun phrases and pronouns, as it they have been found rather ineffective in the happens with the three possible assignments of the representation of sentences containing relative possessive pronoun in the woman said that she kicked clauses or long distance relations among constituents. her lover. Structural ambiguity occurs when there exist multiple structural relations between the lexical 980 Alers-Valentín, H., Rivera-Velázquez, C., Vega-Riveros, J. and Santiago, N. Towards a Principled Computational System of Syntactic Ambiguity Detection and Representation. DOI: 10.5220/0007698709800987 In Proceedings of the 11th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2019), pages 980-987 ISBN: 978-989-758-350-6 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Towards a Principled Computational System of Syntactic Ambiguity Detection and Representation items and constituents of a sentence, as in the in the development of the Faculty of Language, are classical example the boy saw a man with a telescope. computational cognitive systems consisting of a Sometimes, structural ambiguity is caused by lexicon, that contains representations of all primitives multiple syntactic category assignment to a lexical of linguistic computation (along with their features), item, as can be seen in visiting relatives can be a and a grammar, a combinatorial system of operations nuisance, in which the first word can be tagged either on these representations. Sequences that satisfy all as a transitive verb or as an adjective. In this paper, a grammatical constraints of a language are mapped to computational system is described that detects (at least) one tuple of syntactic levels of structural syntactic ambiguity in a string and yields the representations: the theory-internal levels of what has correspondent structural representations. been known as Deep and Surface Structure (DS, SS), For most probabilistic parsers, syntactic and the interface levels of Logical Form (LF) and ambiguity, even ungrammaticality, remains Phonetic Form (PF). In this theory, constraints are undetected. To deal with structural ambiguity, we highly modularized and apply to syntactic structures propose a deterministic (symbolic) parser that from a certain level of representation onwards. produces X-bar structural representations based on X-bar is a powerful and compact module of Principle-and-Parameters Theory modules to Principles and Parameters Theory (Adger, 2003; generate multiple syntactic parses for syntactically Carnie, 2013; Sportiche, Koopman and Stabler, 2014) ambiguous sentences. Deterministic parsers in the for the representation of syntactic category formation form of minimalist grammars have been already in natural language, as it yields hierarchical structures formalized (Stabler, 1997, 2011; Collins and Stabler, in binary trees that encode the constituency of a 2016). Other symbolic parsers have been developed sentence. The syntactic category or part-of-speech of as computational models of syntactic competence a lexical item in a sentence is determined according (Berwick, 1985; Fong, 1991, 2005; Chesi, 2004, to the item’s morphology, grammatical features and 2012); however, the parser we propose implements syntactic distribution. Syntactic categories with variation parameters that may account for structural referential meaning or content are classified as lexical ambiguity. (nouns, verbs, adjective, adverbs, prepositions), while those that strictly serve grammatical purposes and are required for well-formedness are called functional 2 THEORETICAL (determiners, complementisers, coordinators, tense, auxiliaries, negation). Heads are lexical items from BACKGROUND which full phrases are formed and they project themselves into different levels. X-bar theory (where Principles and Parameters Theory (Chomsky, 1981, the variable X stands for a syntactic category) 1995) is a generative-derivational theory of the assumes three syntactic projection levels: minimal, human Faculty of Language. According to this intermediate and maximal. In the X-bar binary tree theory, a natural (I-)language is an internal, structure, minimal projections or heads (denoted individual, intensional cognitive state of the human sometimes as X°) are nuclear categories and do not mind (hence a mental organ) whose initial state, dominate any other category, in other words, the known as Universal Grammar (UG), contains a set of terminal nodes of a syntactic tree. Intermediate invariable principles and constraints that apply to all projections (denoted as X' and read as “X-bar”) are languages, as well as a set of variable parameters typically generated from the merge of a minimal (possibly binary-valued) that children set during projection and a subcategorized complement. language acquisition from the primary linguistic data Maximal projections or phrases (denoted as XP) are to which they are exposed. Among the fundamental the highest level of a nuclear category which has principles of UG are the Structural Dependency satisfied all its subcategorization requirements and Principle (syntactic structures show hierarchical may dominate another phrase-level constituent (a structure and non-linear dependencies) and the specifier) merged with the intermediate projection. Projection Principle (every minimal category projects The X-bar module has only three general rules that its features to a maximal or phrase-level projection). apply to all lexical categories, i.e. the specifier rule, Some of the best studied syntactic parameters are the the complement rule, and the adjunct rule. The Null Subject Parameter (languages may allow or context-free X-bar rules may be combined in any disallow null subjects in finite clauses) and the Head order so it allows the production of different Parameter (syntactic heads can be linearized before or structures from the same array of words or lexical after their complements). Languages, as steady states items. As a recursive rule, Adjunction is

Load more