A Computational Grammar for Portuguese
Total Page:16
File Type:pdf, Size:1020Kb
Bruno Cuconato Claro A computational grammar for Portuguese Rio de Janeiro 2019 Bruno Cuconato Claro A computational grammar for Portuguese Dissertação submetida à Escola de Matemática Aplicada como requisito parcial para a obtenção do grau de Mestre em Modelagem Matemática Fundação Getulio Vargas Escola de Matemática Aplicada Mestrado em Modelagem Matemática Ênfase em Modelagem e Análise da Informação Supervisor: Alexandre Rademaker Rio de Janeiro 2019 Dados Internacionais de Catalogação na Publicação (CIP) Ficha catalográfica elaborada pelo Sistema de Bibliotecas/FGV Claro, Bruno Cuconato A computational grammar for Portuguese / Bruno Cuconato Claro. – 2019. 112 f. Dissertação (mestrado) -Fundação Getulio Vargas, Escola de Matemática Aplicada. Orientador: Alexandre Rademaker. Inclui bibliografia. 1. Linguística - Processamento de dados. 2. Teoria dos tipos. 3. Processamento da linguagem natural (Computação). I. Rademaker, Alexandre. II. Fundação Getulio Vargas. Escola de Matemática Aplicada. IV. Título. CDD – 006.35 Elaborada por Maria do Socorro Almeida – CRB-7/4254 Acknowledgements I thank my significant other for the love, patience, and help. You know well how much you helped me through this. I thank my family for the love and support. The choices you made for me in the past allowed me to choose this path now. I thank my advisor Alexandre Rademaker for the many ideas, discussions, and support. I have learned a multitude of things under your guidance, only some of which appear here. I thank professor Flávio Coelho introducing me to the Unix tradition. It hasn’t been a day where I don’t use what I learned with/through the book you lent me. I thank professor Paulo Carvalho for the help, advice, and teachings – your intro- duction to mathematics still echoes in everything I’ve done since then. I thank Cirlei de Oliveira, Elisângela Santana, Conceição Lima, Cristiane Guimarães, and Mônica Souza for the help, support, and conversations. I thank my friends and colleagues for the companionship and the conversations (which more than once led to interesting ideas, even if we were just talking nonsense). Specially I’d like to mention Laura Sant’Anna, Kátia Nishiyama, Henrique Muniz, Guil- herme Passos, Pedro Delfino, Harllos Arthur, Alessandra Cid, Alexandre Tessarollo, João Carabetta, Fernanda Scovino. Finally, I’d like to say I’d have given up on this dissertation’s idea if not for the extreme patience and kindness of a former stranger which is now a dear friend – thank you, Inari Listenmaa. “Look at the raw PGF” Abstract In this work we present a freely-available type-theoretical computational grammar for Portuguese, implemented in the Grammatical Framework (GF) multilingual formalism. Such a grammar can be used for both syntactic parsing and natural language generation. We first describe the formalism itself, discussing its logico-mathematical foundations; we then present our grammar. We evaluate our grammar’s productions with respect to syntactical correctness, show possible applications, and discuss future work. Keywords: Type theory. Natural Language Processing. Computational linguistics. Natural Language Generation. Grammar engineering. Resumo Este trabalho descreve a criação de uma gramática computacional do Português imple- mentada no formalismo Grammatical Framework. Nele apresentamos o formalismo e a nossa gramática. Avaliamos nossa gramática com respeito à corretude sintática de suas produções, demonstramos possíveis aplicações, e discutimos trabalhos futuros. Palavras-chave: Teoria de tipos. Processamento de Linguagem Natural. Linguística computacional. Geração de Linguagem Natural. Engenharia de gramáticas. List of Figures Figure 1 – Constituent analysis of a sentence..................... 24 Figure 2 – A functor implementation of Haskell and Lisp list syntaxes. 39 Figure 3 – The PGF grammar for Haskell lists..................... 46 Figure 4 – A context-free aproximation for Haskell lists without an intermediary category for empty lists .......................... 46 Figure 5 – Parse for the string [x , x] using CFG approximation with on-the-fly specialization................................. 47 Figure 6 – Parsing deduction rules........................... 50 Figure 7 – Parse deduction for the string [x , x] ................... 52 Figure 8 – Failed parse deduction for the string [x , x]: unexpected token. 53 Figure 9 – Failed parse deduction for the string [x , x]: unexpected token. 53 Figure 10 – German predicate Prime .......................... 56 Figure 11 – RGL module structure (condensed).................... 59 Figure 12 – GF RGL category system ......................... 60 Figure 13 – Module structure of the Portuguese mini resource grammar . 63 Figure 14 – Abstract syntax trees for John is from the city; the hole in the diamond- shaped node can be filled by both UseComp and UseComp_estar. 79 Figure 15 – Abstract syntax tree for what she did is important. ........... 80 Figure 16 – Two analyses for the sentence John is not a doctor . 80 Figure 17 – Abstract syntax tree for John saw no animals............... 83 Figure 18 – Screenshot of the DG demo app...................... 92 Figure 19 – Different kinds of trees........................... 97 Figure 20 – Converting decorated parse tree to a dependency tree.......... 99 Figure 21 – Dependency trees converted from GF trees . 100 List of Tables Table 1 – Inflection table for the word gramática ................... 69 Table 2 – RGL and Romance tenses, and how they inflect verbs in a Portuguese sentence ................................... 85 Listings Listing 1.1 – Example GF code listing.......................... 27 Listing 2.1 – Abstract grammar Foods.......................... 30 Listing 2.2 – Concrete English grammar for Foods. .................. 30 Listing 2.3 – Abstract syntax for linked lists in GF................... 32 Listing 2.4 – GF shell session with abstract module only................ 33 Listing 2.5 – Lisp list linearization types......................... 33 Listing 2.6 – Lisp list linearization rules. ........................ 34 Listing 2.7 – GF shell session with single concrete module. .............. 34 Listing 2.8 – Haskell List linearization types....................... 35 Listing 2.9 – Haskell List linearization rules. ...................... 35 Listing 2.10–Cons linearization rule using GF tables. ................. 36 Listing 2.11–Translation between List concrete syntaxes................ 36 Listing 2.12–Cons using a consWith oper ........................ 37 Listing 2.13–List syntax interface module. ....................... 38 Listing 2.15–Haskell list syntax instantiation. ..................... 39 Listing 2.16–Haskell list functor instantiation...................... 39 Listing 2.17–Lisp list syntax instantiation........................ 39 Listing 2.18–Lisp list functor instantiation........................ 39 Listing 2.14–List functor module............................. 40 Listing 3.1 – A hand-written rule............................. 56 Listing 3.2 – Using the RGL constructors directly.................... 56 Listing 3.3 – Using the RGL API............................. 56 Listing 3.4 – mkCl overloaded function in the RGL API................. 57 Listing 3.5 – Prime predicate in Portuguese and English................ 57 Listing 3.6 – prime_A in German, Portuguese, and English............... 57 Listing 3.7 – Portuguese noun constructors. ...................... 58 Listing 3.8 – Concrete representation of pronouns and how they are built in the Portuguese mini-resource. ............................. 64 Listing 3.9 – Portuguese parameter definitions in the Portuguese mini-resource. 64 Listing 3.10–Concrete representation of nouns and noun phrases in the Portuguese mini-resource..................................... 65 Listing 3.11–Linearization of the UsePron constructor in the Portuguese mini-resource. 65 Listing 3.12–GF parameter representing verb forms in the Portuguese mini-resource. 66 Listing 3.13–GF verbal concrete representations in the Portuguese mini-resource. 66 Listing 3.14–GF verbal complementation in the Portuguese mini-resource. 67 Listing 3.15–GF clause building in the Portuguese mini-resource. .......... 68 Listing 3.16–The Portuguese lexical constructor for noun gramática . 69 Listing 3.17–Naive smart paradigm for verbs...................... 70 Listing 3.18–Portuguese verb smart paradigm ..................... 71 Listing 3.19–The GenRP constructor from Extend ................... 73 Listing 3.20–The AdjAsCN constructor from Extend . 74 Listing 3.21–The AdvImp constructor from ParseExtend . 75 Listing 3.22–A function mapping temporal order and anteriority to verb forms . 86 Listing 4.1 – Dependency configurations for a few RGL functions........... 96 Listing 4.2 – Raw output of the GF to UD conversion for the sentence “há uma vaca na floresta” .....................................101 Listing 4.3 – Partially corrected output of the GF to UD conversion for the sentence “há uma vaca na floresta” .............................101 Contents 1 INTRODUCTION............................ 23 1.1 Motivation .................................. 23 1.2 Scope and Contributions ......................... 25 1.3 Structure .................................. 26 1.4 Typesetting Conventions ......................... 27 2 GRAMMATICAL FRAMEWORK ................... 29 2.1 A GF tutorial ................................ 30 2.1.1 A simple GF grammar ............................ 31 2.1.2 Refactoring .................................. 36 2.2 GF, mathematically ...........................